[virtio-dev] RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues

2018-12-30 Thread Wang, Wei W
On Sunday, December 30, 2018 2:06 PM, Halil Pasic wrote:
> 
> I guess you are the first one trying to read virtio config from within 
> interrupt
> context. AFAICT this never worked.

I'm not sure about "never worked". It seems to work well with virtio-pci.
But looking forward to hearing a solid reason why reading config inside
the handler is forbidden (if that's true).

> About what happens. The apidoc of ccw_device_start() says it needs to be
> called with the ccw device lock held, so ccw_io_helper() tries to take it 
> (since
> forever I guess). OTOH do_cio_interrupt() takes the subchannel lock and
> io_subchannel_initialize_dev() makes the ccw device lock be the subchannel
> lock. That means when one tries to get virtio config form within a cio
> interrupt context we deadlock, because we try to take a lock we already have.
> 
> That said, I don't think this limitation is by design (i.e. intended).
> Maybe Connie can help us with that question. AFAIK we have nothing
> documented regarding this (neither that can nor can't).
> 
> Obviously, there are multiple ways around this problem, and at the moment
> I can't tell which would be my preferred one.

Yes, it's also not difficult to tweak the virtio-balloon code to avoid that 
issue.
But if that's just an issue with ccw itself, I think it's better to tweak ccw 
and
remain virtio-balloon unchanged.

Best,
Wei


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues

2018-12-28 Thread Wang, Wei W
On Friday, December 28, 2018 3:57 PM, Christian Borntraeger wrote:
> On 28.12.2018 03:26, Wei Wang wrote:
> > Some vqs don't need to be allocated when the related feature bits are
> > disabled. Callers notice the vq allocation layer by setting the
> > related names[i] to be NULL.
> >
> > This patch series fixes the find_vqs implementations to handle this case.
> 
> So the random crashes during boot are gone.
> What still does not work is actually using the balloon.
> 
> So in the qemu monitor using lets say "balloon 1000"  will hang the guest.
> Seems to be a deadlock in the virtio-ccw code.  We seem to call the config
> code in the interrupt handler.

Yes. It reads a config register from the interrupt handler. Do you know why ccw 
doesn't support it and has some internal lock that caused the deadlock issue?
 
Best,
Wei


[virtio-dev] RE: [PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-09-07 Thread Wang, Wei W
On Friday, September 7, 2018 8:30 PM, Dr. David Alan Gilbert wrote:
> OK, that's much better.
> The ~50% reducton with a 8G VM and a real workload is great, and it does
> what you expect when you put a lot more RAM in and see the 84% reduction
> on a guest with 128G RAM - 54s vs ~9s is a big win!
> 
> (The migrate_set_speed is a bit high, since that's in bytes/s - but it's not
> important).
> 
> That looks good,
> 

Thanks Dave for the feedback.
Hope you can join our discussion and review the v37 patches as well.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Wang, Wei W
On Monday, August 6, 2018 9:29 PM, Tetsuo Handa wrote:
> On 2018/08/06 21:44, Wang, Wei W wrote:
> > On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
> >> On 2018/08/06 18:56, Wei Wang wrote:
> >>> On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
> >>>> On 2018/08/03 17:32, Wei Wang wrote:
> >>>>> +static int virtio_balloon_register_shrinker(struct virtio_balloon
> >>>>> +*vb) {
> >>>>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> >>>>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> >>>>> +    vb->shrinker.batch = 0;
> >>>>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
> >>>> Why flags field is not set? If vb is allocated by
> >>>> kmalloc(GFP_KERNEL) and is nowhere zero-cleared, KASAN would
> complain it.
> >>>
> >>> Could you point where in the code that would complain it?
> >>> I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
> >> they seem not related to that.
> >>
> >> Where is vb->shrinker.flags initialized?
> >
> > Is that mandatory to be initialized?
> 
> Of course. ;-)
> 
> > I find it's not initialized in most shrinkers (e.g. zs_register_shrinker,
> huge_zero_page_shrinker).
> 
> Because most shrinkers are "statically initialized (which means that
> unspecified fields are implicitly zero-cleared)" or "dynamically allocated 
> with
> __GFP_ZERO or zero-cleared using
> memset() (which means that all fields are once zero-cleared)".
> 
> And if you once zero-clear vb at allocation time, you will get a bonus that
> calling unregister_shrinker() without corresponding register_shrinker() is 
> safe
> (which will simplify initialization failure path).

Oh, I see, thanks. So it sounds better to directly kzalloc vb.

Best,
Wei


[virtio-dev] RE: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Wang, Wei W
On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
> On 2018/08/06 18:56, Wei Wang wrote:
> > On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
> >> On 2018/08/03 17:32, Wei Wang wrote:
> >>> +static int virtio_balloon_register_shrinker(struct virtio_balloon
> >>> +*vb) {
> >>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> >>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> >>> +    vb->shrinker.batch = 0;
> >>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
> >> Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
> >> and is nowhere zero-cleared, KASAN would complain it.
> >
> > Could you point where in the code that would complain it?
> > I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
> they seem not related to that.
> 
> Where is vb->shrinker.flags initialized?

Is that mandatory to be initialized? I find it's not initialized in most 
shrinkers (e.g. zs_register_shrinker, huge_zero_page_shrinker).

Best,
Wei


[virtio-dev] RE: [PATCH v2 0/2] virtio-balloon: some improvements

2018-07-27 Thread Wang, Wei W
On Friday, July 27, 2018 10:06 PM, Michael S. Tsirkin wrote:
> On Fri, Jul 27, 2018 at 05:24:53PM +0800, Wei Wang wrote:
> > This series is split from the "Virtio-balloon: support free page
> > reporting" series to make some improvements.
> >
> > v1->v2 ChangeLog:
> > - register the shrinker when VIRTIO_BALLOON_F_DEFLATE_ON_OOM is
> negotiated.
> >
> > Wei Wang (2):
> >   virtio-balloon: remove BUG() in init_vqs
> >   virtio_balloon: replace oom notifier with shrinker
> 
> Thanks!
> Given it's very late in the release cycle, I'll merge this for the next Linux
> release.

No problem. Thanks!

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-07-22 Thread Wang, Wei W
On Friday, July 20, 2018 8:52 PM, Michael S. Tsirkin wrote:
> On Fri, Jul 20, 2018 at 04:33:00PM +0800, Wei Wang wrote:
> > This patch series is separated from the previous "Virtio-balloon
> > Enhancement" series. The new feature,
> VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > implemented by this series enables the virtio-balloon driver to report
> > hints of guest free pages to the host. It can be used to accelerate
> > live migration of VMs. Here is an introduction of this usage:
> >
> > Live migration needs to transfer the VM's memory from the source
> > machine to the destination round by round. For the 1st round, all the
> > VM's memory is transferred. From the 2nd round, only the pieces of
> > memory that were written by the guest (after the 1st round) are
> > transferred. One method that is popularly used by the hypervisor to
> > track which part of memory is written is to write-protect all the guest
> memory.
> >
> > This feature enables the optimization by skipping the transfer of
> > guest free pages during VM live migration. It is not concerned that
> > the memory pages are used after they are given to the hypervisor as a
> > hint of the free pages, because they will be tracked by the hypervisor
> > and transferred in the subsequent round if they are used and written.
> >
> > * Tests
> > - Test Environment
> > Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> > Guest: 8G RAM, 4 vCPU
> > Migration setup: migrate_set_speed 100G, migrate_set_downtime 2
> > second
> 
> Can we split out patches 1 and 2? They seem appropriate for this release ...

Sounds good to me. I'm not sure if there would be comments on the first 2 
patches. If no, can you just take them here? Or you need me to repost them 
separately?

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v35 1/5] mm: support to get hints of free page blocks

2018-07-11 Thread Wang, Wei W
On Wednesday, July 11, 2018 7:10 PM, Michal Hocko wrote:
> On Wed 11-07-18 18:52:45, Wei Wang wrote:
> > On 07/11/2018 05:21 PM, Michal Hocko wrote:
> > > On Tue 10-07-18 18:44:34, Linus Torvalds wrote:
> > > [...]
> > > > That was what I tried to encourage with actually removing the
> > > > pages form the page list. That would be an _incremental_
> > > > interface. You can remove MAX_ORDER-1 pages one by one (or a
> > > > hundred at a time), and mark them free for ballooning that way.
> > > > And if you still feel you have tons of free memory, just continue
> removing more pages from the free list.
> > > We already have an interface for that. alloc_pages(GFP_NOWAIT,
> MAX_ORDER -1).
> > > So why do we need any array based interface?
> >
> > Yes, I'm trying to get free pages directly via alloc_pages, so there
> > will be no new mm APIs.
> 
> OK. The above was just a rough example. In fact you would need a more
> complex gfp mask. I assume you only want to balloon only memory directly
> usable by the kernel so it will be
>   (GFP_KERNEL | __GFP_NOWARN) & ~__GFP_RECLAIM

Sounds good to me, thanks.

> 
> > I plan to let free page allocation stop when the remaining system free
> > memory becomes close to min_free_kbytes (prevent swapping).
> 
> ~__GFP_RECLAIM will make sure you are allocate as long as there is any
> memory without reclaim. It will not even poke the kswapd to do the
> background work. So I do not think you would need much more than that.

"close to min_free_kbytes" - I meant when doing the allocations, we 
intentionally reserve some small amount of memory, e.g. 2 free page blocks of 
"MAX_ORDER - 1". So when other applications happen to do some allocation, they 
may easily get some from the reserved memory left on the free list. Without 
that reserved memory, other allocation may cause the system free memory below 
the WMARK[MIN], and kswapd would start to do swapping. This is actually just a 
small optimization to reduce the probability of causing swapping (nice to have, 
but not mandatary because we will allocate free page blocks one by one).

 > But let me note that I am not really convinced how this (or previous)
> approach will really work in most workloads. We tend to cache heavily so
> there is rarely any memory free.

With less free memory, the improvement becomes less, but should be nicer than 
no optimization. For example, the Linux build workload would cause 4~5 GB (out 
of 8GB) memory to be used as page cache at the final stage, there is still ~44% 
live migration time reduction.

Since we have many cloud customers interested in this feature, I think we can 
let them test the usefulness.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v35 1/5] mm: support to get hints of free page blocks

2018-07-10 Thread Wang, Wei W
On Tuesday, July 10, 2018 5:31 PM, Wang, Wei W wrote:
> Subject: [PATCH v35 1/5] mm: support to get hints of free page blocks
> 
> This patch adds support to get free page blocks from a free page list.
> The physical addresses of the blocks are stored to a list of buffers passed
> from the caller. The obtained free page blocks are hints about free pages,
> because there is no guarantee that they are still on the free page list after 
> the
> function returns.
> 
> One use example of this patch is to accelerate live migration by skipping the
> transfer of free pages reported from the guest. A popular method used by
> the hypervisor to track which part of memory is written during live migration
> is to write-protect all the guest memory. So, those pages that are hinted as
> free pages but are written after this function returns will be captured by the
> hypervisor, and they will be added to the next round of memory transfer.
> 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michael S. Tsirkin 
> Cc: Linus Torvalds 
> ---
>  include/linux/mm.h |  3 ++
>  mm/page_alloc.c| 98
> ++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h index a0fbb9f..5ce654f
> 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2007,6 +2007,9 @@ extern void free_area_init(unsigned long *
> zones_size);  extern void free_area_init_node(int nid, unsigned long *
> zones_size,
>   unsigned long zone_start_pfn, unsigned long *zholes_size);
> extern void free_initmem(void);
> +unsigned long max_free_page_blocks(int order); int
> +get_from_free_page_list(int order, struct list_head *pages,
> + unsigned int size, unsigned long *loaded_num);
> 
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100..b67839b
> 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5043,6 +5043,104 @@ void show_free_areas(unsigned int filter,
> nodemask_t *nodemask)
>   show_swap_cache_info();
>  }
> 
> +/**
> + * max_free_page_blocks - estimate the max number of free page blocks
> + * @order: the order of the free page blocks to estimate
> + *
> + * This function gives a rough estimation of the possible maximum
> +number of
> + * free page blocks a free list may have. The estimation works on an
> +assumption
> + * that all the system pages are on that list.
> + *
> + * Context: Any context.
> + *
> + * Return: The largest number of free page blocks that the free list can 
> have.
> + */
> +unsigned long max_free_page_blocks(int order) {
> + return totalram_pages / (1 << order);
> +}
> +EXPORT_SYMBOL_GPL(max_free_page_blocks);
> +
> +/**
> + * get_from_free_page_list - get hints of free pages from a free page
> +list
> + * @order: the order of the free page list to check
> + * @pages: the list of page blocks used as buffers to load the
> +addresses
> + * @size: the size of each buffer in bytes
> + * @loaded_num: the number of addresses loaded to the buffers
> + *
> + * This function offers hints about free pages. The addresses of free
> +page
> + * blocks are stored to the list of buffers passed from the caller.
> +There is
> + * no guarantee that the obtained free pages are still on the free page
> +list
> + * after the function returns. pfn_to_page on the obtained free pages
> +is
> + * strongly discouraged and if there is an absolute need for that, make
> +sure
> + * to contact MM people to discuss potential problems.
> + *
> + * The addresses are currently stored to a buffer in little endian.
> +This
> + * avoids the overhead of converting endianness by the caller who needs
> +data
> + * in the little endian format. Big endian support can be added on
> +demand in
> + * the future.
> + *
> + * Context: Process context.
> + *
> + * Return: 0 if all the free page block addresses are stored to the buffers;
> + * -ENOSPC if the buffers are not sufficient to store all the
> + * addresses; or -EINVAL if an unexpected argument is received (e.g.
> + * incorrect @order, empty buffer list).
> + */
> +int get_from_free_page_list(int order, struct list_head *pages,
> + unsigned int size, unsigned long *loaded_num) {


Hi Linus,

We  took your original suggestion - pass in pre-allocated buffers to load the 
addresses (now we use a list of pre-allocated page blocks as buffers). Hope 
that suggestion is still acceptable (the advantage of this method was explained 
here: https://lkml.org/lkml/2018/6/28/184).
Look forward to getting your feedback. Thanks.

Best,
Wei 

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v34 0/4] Virtio-balloon: support free page reporting

2018-06-29 Thread Wang, Wei W
On Friday, June 29, 2018 7:54 PM, David Hildenbrand wrote:
> On 29.06.2018 13:31, Wei Wang wrote:
> > On 06/29/2018 03:46 PM, David Hildenbrand wrote:
> >>>
> >>> I'm afraid it can't. For example, when we have a guest booted,
> >>> without too many memory activities. Assume the guest has 8GB free
> >>> memory. The arch_free_page there won't be able to capture the 8GB
> >>> free pages since there is no free() called. This results in no free pages
> reported to host.
> >>
> >> So, it takes some time from when the guest boots up until the balloon
> >> device was initialized and therefore page hinting can start. For that
> >> period, you won't get any arch_free_page()/page hinting callbacks, correct.
> >>
> >> However in the hypervisor, you can theoretically track which pages
> >> the guest actually touched ("dirty"), so you already know "which
> >> pages were never touched while booting up until virtio-balloon was
> >> brought to life". These, you can directly exclude from migration. No
> >> interface required.
> >>
> >> The remaining problem is pages that were touched ("allocated") by the
> >> guest during bootup but freed again, before virtio-balloon came up.
> >> One would have to measure how many pages these usually are, I would
> >> say it would not be that many (because recently freed pages are
> >> likely to be used again next for allocation). However, there are some
> >> pages not being reported.
> >>
> >> During the lifetime of the guest, this should not be a problem,
> >> eventually one of these pages would get allocated/freed again, so the
> >> problem "solves itself over time". You are looking into the special
> >> case of migrating the VM just after it has been started. But we have
> >> the exact same problem also for ordinary free page hinting, so we
> >> should rather solve that problem. It is not migration specific.
> >>
> >> If we are looking for an alternative to "problem solves itself",
> >> something like "if virtio-balloon comes up, it will report all free
> >> pages step by step using free page hinting, just like we would have
> >> from "arch_free_pages()"". This would be the same interface we are
> >> using for free page hinting - and it could even be made configurable in the
> guest.
> >>
> >> The current approach we are discussing internally for details about
> >> Nitesh's work ("how the magic inside arch_fee_pages() will work
> >> efficiently) would allow this as far as I can see just fine.
> >>
> >> There would be a tiny little window between virtio-balloon comes up
> >> and it has reported all free pages step by step, but that can be
> >> considered a very special corner case that I would argue is not worth
> >> it to be optimized.
> >>
> >> If I am missing something important here, sorry in advance :)
> >>
> >
> > Probably I didn't explain that well. Please see my re-try:
> >
> > That work is to monitor page allocation and free activities via
> > arch_alloc_pages and arch_free_pages. It has per-CPU lists to record
> > the pages that are freed to the mm free list, and the per-CPU lists
> > dump the recorded pages to a global list when any of them is full.
> > So its own per-CPU list will only be able to get free pages when there
> > is an mm free() function gets called. If we have 8GB free memory on
> > the mm free list, but no application uses them and thus no mm free()
> > calls are made. In that case, the arch_free_pages isn't called, and no
> > free pages added to the per-CPU list, but we have 8G free memory right
> > on the mm free list.
> > How would you guarantee the per-CPU lists have got all the free pages
> > that the mm free lists have?
> 
> As I said, if we have some mechanism that will scan the free pages (not
> arch_free_page() once and report hints using the same mechanism step by
> step (not your bulk interface)), this problem is solved. And as I said, this 
> is
> not a migration specific problem, we have the same problem in the current
> page hinting RFC. These pages have to be reported.
> 
> >
> > - I'm also worried about the overhead of maintaining so many per-CPU
> > lists and the global list. For example, if we have applications
> > frequently allocate and free 4KB pages, and each per-CPU list needs to
> > implement the buddy algorithm to sort and merge neighbor pages. Today
> > a server can have more than 100 CPUs, then there will be more than 100
> > per-CPU lists which need to sync to a global list under a lock, I'm
> > not sure if this would scale well.
> 
> The overhead in the current RFC is definitely too high. But I consider this a
> problem to be solved before page hinting would go upstream. And we are
> discussing right now "if we have a reasonable page hinting implementation,
> why would we need your interface in addition".
> 
> >
> > - This seems to be a burden imposed on the core mm memory
> > allocation/free path. The whole overhead needs to be carried during
> > the whole system life cycle. What we actually expected is to just make
> > one call 

[virtio-dev] RE: [PATCH v34 0/4] Virtio-balloon: support free page reporting

2018-06-29 Thread Wang, Wei W
On Friday, June 29, 2018 10:46 PM, Michael S. Tsirkin wrote:
> To: David Hildenbrand 
> Cc: Wang, Wei W ; virtio-dev@lists.oasis-open.org;
> linux-ker...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> k...@vger.kernel.org; linux...@kvack.org; mho...@kernel.org;
> a...@linux-foundation.org; torva...@linux-foundation.org;
> pbonz...@redhat.com; liliang.opensou...@gmail.com;
> yang.zhang...@gmail.com; quan@gmail.com; ni...@redhat.com;
> r...@redhat.com; pet...@redhat.com
> Subject: Re: [PATCH v34 0/4] Virtio-balloon: support free page reporting
> 
> On Wed, Jun 27, 2018 at 01:06:32PM +0200, David Hildenbrand wrote:
> > On 25.06.2018 14:05, Wei Wang wrote:
> > > This patch series is separated from the previous "Virtio-balloon
> > > Enhancement" series. The new feature,
> > > VIRTIO_BALLOON_F_FREE_PAGE_HINT, implemented by this series
> enables
> > > the virtio-balloon driver to report hints of guest free pages to the
> > > host. It can be used to accelerate live migration of VMs. Here is an
> introduction of this usage:
> > >
> > > Live migration needs to transfer the VM's memory from the source
> > > machine to the destination round by round. For the 1st round, all
> > > the VM's memory is transferred. From the 2nd round, only the pieces
> > > of memory that were written by the guest (after the 1st round) are
> > > transferred. One method that is popularly used by the hypervisor to
> > > track which part of memory is written is to write-protect all the guest
> memory.
> > >
> > > This feature enables the optimization by skipping the transfer of
> > > guest free pages during VM live migration. It is not concerned that
> > > the memory pages are used after they are given to the hypervisor as
> > > a hint of the free pages, because they will be tracked by the
> > > hypervisor and transferred in the subsequent round if they are used and
> written.
> > >
> > > * Tests
> > > - Test Environment
> > > Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> > > Guest: 8G RAM, 4 vCPU
> > > Migration setup: migrate_set_speed 100G, migrate_set_downtime 2
> > > second
> > >
> > > - Test Results
> > > - Idle Guest Live Migration Time (results are averaged over 10 runs):
> > > - Optimization v.s. Legacy = 284ms vs 1757ms --> ~84% reduction
> > > - Guest with Linux Compilation Workload (make bzImage -j4):
> > > - Live Migration Time (average)
> > >   Optimization v.s. Legacy = 1402ms v.s. 2528ms --> ~44% reduction
> > > - Linux Compilation Time
> > >   Optimization v.s. Legacy = 5min6s v.s. 5min12s
> > >   --> no obvious difference
> > >
> >
> > Being in version 34 already, this whole thing still looks and feels
> > like a big hack to me. It might just be me, but especially if I read
> > about assumptions like "QEMU will not hotplug memory during
> > migration". This does not feel like a clean solution.
> >
> > I am still not sure if we really need this interface, especially as
> > real free page hinting might be on its way.
> >
> > a) we perform free page hinting by setting all free pages
> > (arch_free_page()) to zero. Migration will detect zero pages and
> > minimize #pages to migrate. I don't think this is a good idea but
> > Michel suggested to do a performance evaluation and Nitesh is looking
> > into that right now.
> 
> Yes this test is needed I think. If we can get most of the benefit without PV
> interfaces, that's nice.
> 
> Wei, I think you need this as part of your performance comparison
> too: set page poisoning value to 0 and enable KSM, compare with your
> patches.

Do you mean live migration with zero pages?
I can first share the amount of memory transferred during live migration I saw,
Legacy is around 380MB,
Optimization is around 340MB.
This proves that most pages have already been 0 and skipped during the legacy 
live migration. But the legacy time is still much larger because zero page 
checking is costly. 
(It's late night here, I can get you that with my server probably tomorrow)

Best,
Wei






-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-20 Thread Wang, Wei W
On Tuesday, June 19, 2018 10:43 PM, Michael S. Tsirk wrote:
> On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > > On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:
> > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > > On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512,
> > > > > > 1024) above,
> > > > > so the maximum memory that can be reported is 2TB. For larger
> guests, e.g.
> > > > > 4TB, the optimization can still offer 2TB free memory (better
> > > > > than no optimization).
> > > > >
> > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters 
> > > > > even
> more.
> > > > > I'd rather we had a better plan. From that POV I like what
> > > > > Matthew Wilcox suggested for this which is to steal the necessary # of
> entries off the list.
> > > > Actually what Matthew suggested doesn't make a difference here.
> > > > That method always steal the first free page blocks, and sure can
> > > > be changed to take more. But all these can be achieved via kmalloc
> > > I'd do get_user_pages really. You don't want pages split, etc.
> 
> Oops sorry. I meant get_free_pages .

Yes, we can use __get_free_pages, and the max allocation is MAX_ORDER - 1, 
which can report up to 2TB free memory. 

"getting two pages isn't harder", do you mean passing two arrays (two 
allocations by get_free_pages(,MAX_ORDER -1)) to the mm API?

Please see if the following logic aligns to what you think:

uint32_t i, max_hints, hints_per_page, hints_per_array, total_arrays;
unsigned long *arrays;
 
 /*
 * Each array size is MAX_ORDER_NR_PAGES. If one array is not enough to
 * store all the hints, we need to allocate multiple arrays.
 * max_hints: the max number of 4MB free page blocks
 * hints_per_page: the number of hints each page can store
 * hints_per_array: the number of hints an array can store
 * total_arrays: the number of arrays we need
 */
max_hints = totalram_pages / MAX_ORDER_NR_PAGES;
hints_per_page = PAGE_SIZE / sizeof(__le64);
hints_per_array = hints_per_page * MAX_ORDER_NR_PAGES;
total_arrays = max_hints /  hints_per_array +
   !!(max_hints % hints_per_array);
arrays = kmalloc(total_arrays * sizeof(unsigned long), GFP_KERNEL);
for (i = 0; i < total_arrays; i++) {
arrays[i] = __get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC, 
MAX_ORDER - 1);

  if (!arrays[i])
  goto out;
}


- the mm API needs to be changed to support storing hints to multiple separated 
arrays offered by the caller.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-18 Thread Wang, Wei W
On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> 4TB, the optimization can still offer 2TB free memory (better than no
> optimization).
> 
> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method 
always steal the first free page blocks, and sure can be changed to take more. 
But all these can be achieved via kmalloc by the caller which is more prudent 
and makes the code more straightforward. I think we don't need to take that 
risk unless the MM folks strongly endorse that approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation 
that the max free memory to report is 2TB. Back to the motivation of this work, 
the cloud guys want to use this optimization to accelerate their guest live 
migration. 2TB guests are not common in today's clouds. When huge guests become 
common in the future, we can easily tweak this API to fill hints into scattered 
buffer (e.g. several 4MB arrays passed to this API) instead of one as in this 
version.

This limitation doesn't cause any issue from functionality perspective. For the 
extreme case like a 100TB guest live migration which is theoretically possible 
today, this optimization helps skip 2TB of its free memory. This result is that 
it may reduce only 2% live migration time, but still better than not skipping 
the 2TB (if not using the feature).

So, for the first release of this feature, I think it is better to have the 
simpler and more straightforward solution as we have now, and clearly document 
why it can report up to 2TB free memory.


 
> If that doesn't fly, we can allocate out of the loop and just retry with more
> pages.
> 
> > On the other hand, large guests being large mostly because the guests need
> to use large memory. In that case, they usually won't have that much free
> memory to report.
> 
> And following this logic small guests don't have a lot of memory to report at
> all.
> Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use 
e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free 
memory to report, but reporting 0.5TB with this optimization is better than no 
optimization. (and the current 2TB limitation isn't a limitation for the 3TB 
guest in this case)

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v33 1/4] mm: add a function to get free page blocks

2018-06-16 Thread Wang, Wei W
On Saturday, June 16, 2018 12:50 PM, Matthew Wilcox wrote:
> On Fri, Jun 15, 2018 at 12:43:10PM +0800, Wei Wang wrote:
> > +/**
> > + * get_from_free_page_list - get free page blocks from a free page
> > +list
> > + * @order: the order of the free page list to check
> > + * @buf: the array to store the physical addresses of the free page
> > +blocks
> > + * @size: the array size
> > + *
> > + * This function offers hints about free pages. There is no guarantee
> > +that
> > + * the obtained free pages are still on the free page list after the
> > +function
> > + * returns. pfn_to_page on the obtained free pages is strongly
> > +discouraged
> > + * and if there is an absolute need for that, make sure to contact MM
> > +people
> > + * to discuss potential problems.
> > + *
> > + * The addresses are currently stored to the array in little endian.
> > +This
> > + * avoids the overhead of converting endianness by the caller who
> > +needs data
> > + * in the little endian format. Big endian support can be added on
> > +demand in
> > + * the future.
> > + *
> > + * Return the number of free page blocks obtained from the free page list.
> > + * The maximum number of free page blocks that can be obtained is
> > +limited to
> > + * the caller's array size.
> > + */
> 
> Please use:
> 
>  * Return: The number of free page blocks obtained from the free page list.
> 
> Also, please include a
> 
>  * Context: Any context.
> 
> or
> 
>  * Context: Process context.
> 
> or whatever other conetext this function can be called from.  Since you're
> taking the lock irqsafe, I assume this can be called from any context, but I
> wonder if it makes sense to have this function callable from interrupt 
> context.
> Maybe this should be callable from process context only.

Thanks, sounds better to make it process context only.

> 
> > +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t
> > +size) {
> > +   struct zone *zone;
> > +   enum migratetype mt;
> > +   struct page *page;
> > +   struct list_head *list;
> > +   unsigned long addr, flags;
> > +   uint32_t index = 0;
> > +
> > +   for_each_populated_zone(zone) {
> > +   spin_lock_irqsave(>lock, flags);
> > +   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +   list = >free_area[order].free_list[mt];
> > +   list_for_each_entry(page, list, lru) {
> > +   addr = page_to_pfn(page) << PAGE_SHIFT;
> > +   if (likely(index < size)) {
> > +   buf[index++] = cpu_to_le64(addr);
> > +   } else {
> > +   spin_unlock_irqrestore(>lock,
> > +  flags);
> > +   return index;
> > +   }
> > +   }
> > +   }
> > +   spin_unlock_irqrestore(>lock, flags);
> > +   }
> > +
> > +   return index;
> > +}
> 
> I wonder if (to address Michael's concern), you shouldn't instead use the 
> first
> free chunk of pages to return the addresses of all the pages.
> ie something like this:
> 
>   __le64 *ret = NULL;
>   unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);
> 
>   for_each_populated_zone(zone) {
>   spin_lock_irq(>lock);
>   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
>   list = >free_area[order].free_list[mt];
>   list_for_each_entry_safe(page, list, lru, ...) {
>   if (index == size)
>   break;
>   addr = page_to_pfn(page) << PAGE_SHIFT;
>   if (!ret) {
>   list_del(...);

Thanks for sharing. But probably we would not take this approach for the 
reasons below:

1) I'm not sure if getting a block of free pages to use could be that simple 
(just pluck it from the list as above). I think it is more prudent to let the 
callers allocate the array via the regular allocation functions. 

2) Callers may need to use this with their own defined protocols, and they want 
the header and payload (i.e. the obtained hints) to locate in physically 
continuous memory (there are tricks they can use to make it work with 
non-physically continuous memory, but that would just complicate all the 
things) . In this case, it is better to have callers allocate the memory on 
their own, and pass the payload part memory to this API to get the payload 
filled.

Best,
Wei


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v33 1/4] mm: add a function to get free page blocks

2018-06-15 Thread Wang, Wei W
On Saturday, June 16, 2018 7:09 AM, Linus Torvalds wrote:
> On Fri, Jun 15, 2018 at 2:08 PM Wei Wang  wrote:
> >
> > This patch adds a function to get free pages blocks from a free page
> > list. The obtained free page blocks are hints about free pages,
> > because there is no guarantee that they are still on the free page
> > list after the function returns.
> 
> Ack. This is the kind of simple interface where I don't need to worry about
> the MM code calling out to random drivers or subsystems.
> 
> I think that "order" should be checked for validity, but from a MM standpoint
> I think this is fine.
> 

Thanks, will add a validity check for "order".

Best,
Wei


[virtio-dev] RE: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-15 Thread Wang, Wei W
On Friday, June 15, 2018 10:29 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 02:11:23PM +0000, Wang, Wei W wrote:
> > On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> > > On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature
> > > > indicates the support of reporting hints of guest free pages to host via
> virtio-balloon.
> > > >
> > > > Host requests the guest to report free page hints by sending a
> > > > command to the guest via setting the
> > > VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > > > bit of the host_cmd config register.
> > > >
> > > > As the first step here, virtio-balloon only reports free page
> > > > hints from the max order (10) free page list to host. This has
> > > > generated similar good results as reporting all free page hints during
> our tests.
> > > >
> > > > TODO:
> > > > - support reporting free page hints from smaller order free page lists
> > > >   when there is a need/request from users.
> > > >
> > > > Signed-off-by: Wei Wang 
> > > > Signed-off-by: Liang Li 
> > > > Cc: Michael S. Tsirkin 
> > > > Cc: Michal Hocko 
> > > > Cc: Andrew Morton 
> > > > ---
> > > >  drivers/virtio/virtio_balloon.c | 187
> +--
> > > -
> > > >  include/uapi/linux/virtio_balloon.h |  13 +++
> > > >  2 files changed, 163 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > > > --- a/drivers/virtio/virtio_balloon.c
> > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > @@ -43,6 +43,9 @@
> > > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > >
> > > > +/* The size of memory in bytes allocated for reporting free page
> > > > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > > > +
> > > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >
> > > Doesn't this limit memory size of the guest we can report?
> > > Apparently to several gigabytes ...
> > > OTOH huge guests with lots of free memory is exactly where we would
> > > gain the most ...
> >
> > Yes, the 16-page array can report up to 32GB (each page can hold 512
> addresses of 4MB free page blocks, i.e. 2GB free memory per page) free
> memory to host. It is not flexible.
> >
> > How about allocating the buffer according to the guest memory size
> > (proportional)? That is,
> >
> > /* Calculates the maximum number of 4MB (equals to 1024 pages) free
> > pages blocks that the system can have */ 4m_page_blocks =
> > totalram_pages / 1024;
> >
> > /* Allocating one page can hold 512 free page blocks, so calculates
> > the number of pages that can hold those 4MB blocks. And this
> > allocation should not exceed 1024 pages */ pages_to_allocate =
> > min(4m_page_blocks / 512, 1024);
> >
> > For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate
> 1024 pages as the buffer.
> >
> > When the guest has large memory, it should be easier to succeed in
> allocation of large buffer. If that allocation fails, that implies that 
> nothing
> would be got from the 4MB free page list.
> >
> > I think the proportional allocation is simpler compared to other
> > approaches like
> > - scattered buffer, which will complicate the get_from_free_page_list
> > implementation;
> > - one buffer to call get_from_free_page_list multiple times, which needs
> get_from_free_page_list to maintain states.. also too complicated.
> >
> > Best,
> > Wei
> >
> 
> That's more reasonable, but question remains what to do if that value
> exceeds MAX_ORDER. I'd say maybe tell host we can't report it.

Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above, so the 
maximum memory that can be reported is 2TB. For larger guests, e.g. 4TB, the 
optimization can still offer 2TB free memory (better than no optimization).

On the other hand, large guests being large mostly because the guests need to 
use large memory. In that case, they usually won't have that much free memory 
to report.

> 
> Also allocating it with GFP_KERNEL is out. You only want to take it off the 
> free
> list. So I guess __GFP_NOMEMALLOC and __GFP_ATOMIC.

Sounds good, thanks.

> Also you can't allocate this on device start. First totalram_pages can change.
> Second that's too much memory to tie up forever.

Yes, makes sense.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v33 0/4] Virtio-balloon: support free page reporting

2018-06-15 Thread Wang, Wei W
On Friday, June 15, 2018 7:30 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 12:43:09PM +0800, Wei Wang wrote:
> >   - remove the cmd id related interface. Now host can just send a free
> > page hint command to the guest (via the host_cmd config register)
> > to start the reporting.
> 
> Here we go again. And what if reporting was already started previously?
> I don't think it's a good idea to tweak the host/guest interface yet again.

This interface is much simpler, and I'm not sure if that would be an issue here 
now, because
now the guest delivers the whole buffer of hints to host once, instead of hint 
by hint as before. And the guest notifies host after the buffer is delivered. 
In any case, the host doorbell handler will be invoked, if host doesn't need 
the hints at that time, it will just give back the buffer. There will be no 
stale hints remained in the ring now.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-15 Thread Wang, Wei W
On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates
> > the support of reporting hints of guest free pages to host via 
> > virtio-balloon.
> >
> > Host requests the guest to report free page hints by sending a command
> > to the guest via setting the
> VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > bit of the host_cmd config register.
> >
> > As the first step here, virtio-balloon only reports free page hints
> > from the max order (10) free page list to host. This has generated
> > similar good results as reporting all free page hints during our tests.
> >
> > TODO:
> > - support reporting free page hints from smaller order free page lists
> >   when there is a need/request from users.
> >
> > Signed-off-by: Wei Wang 
> > Signed-off-by: Liang Li 
> > Cc: Michael S. Tsirkin 
> > Cc: Michal Hocko 
> > Cc: Andrew Morton 
> > ---
> >  drivers/virtio/virtio_balloon.c | 187 +--
> -
> >  include/uapi/linux/virtio_balloon.h |  13 +++
> >  2 files changed, 163 insertions(+), 37 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c
> > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -43,6 +43,9 @@
> >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> >
> > +/* The size of memory in bytes allocated for reporting free page
> > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> 
> Doesn't this limit memory size of the guest we can report?
> Apparently to several gigabytes ...
> OTOH huge guests with lots of free memory is exactly where we would gain
> the most ...

Yes, the 16-page array can report up to 32GB (each page can hold 512 addresses 
of 4MB free page blocks, i.e. 2GB free memory per page) free memory to host. It 
is not flexible.

How about allocating the buffer according to the guest memory size 
(proportional)? That is,

/* Calculates the maximum number of 4MB (equals to 1024 pages) free pages 
blocks that the system can have */
4m_page_blocks = totalram_pages / 1024;

/* Allocating one page can hold 512 free page blocks, so calculates the number 
of pages that can hold those 4MB blocks. And this allocation should not exceed 
1024 pages */
pages_to_allocate = min(4m_page_blocks / 512, 1024);

For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate 1024 
pages as the buffer.

When the guest has large memory, it should be easier to succeed in allocation 
of large buffer. If that allocation fails, that implies that nothing would be 
got from the 4MB free page list.

I think the proportional allocation is simpler compared to other approaches 
like 
- scattered buffer, which will complicate the get_from_free_page_list 
implementation;
- one buffer to call get_from_free_page_list multiple times, which needs 
get_from_free_page_list to maintain states.. also too complicated.

Best,
Wei




-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-04-05 Thread Wang, Wei W
On Friday, April 6, 2018 2:30 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 03:47:28PM +0000, Wang, Wei W wrote:
> > On Thursday, April 5, 2018 10:04 PM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 05, 2018 at 02:05:03AM +, Wang, Wei W wrote:
> > > > On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> > > > > On Thu, Apr 05, 2018 at 12:30:27AM +, Wang, Wei W wrote:
> > > > > > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > > > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > > > I'm afraid the driver couldn't be aware if the added hints are
> > > > > > stale or not,
> > > > >
> > > > >
> > > > > No - I mean that driver has code that compares two values and
> > > > > stops reporting. Can one of the values be stale?
> > > >
> > > > The driver compares "vb->cmd_id_use != vb->cmd_id_received" to
> > > > decide if it needs to stop reporting hints, and cmd_id_received is
> > > > what the driver reads from host (host notifies the driver to read
> > > > for the latest value). If host sends a new cmd id, it will notify
> > > > the guest to read again. I'm not sure how that could be a stale
> > > > cmd id (or maybe I misunderstood your point here?)
> > > >
> > > > Best,
> > > > Wei
> > >
> > > The comparison is done in one thread, the update in another one.
> >
> > I think this isn't something that could be solved by adding a lock,
> > unless host waits for the driver's ACK about finishing the update
> > (this is not agreed in the QEMU part discussion).
> >
> > Actually virtio_balloon has F_IOMMU_PLATFORM disabled, maybe we
> don't
> > need to worry about that using DMA api case (we only have gpa added to
> > the vq, and having some entries stay in the vq seems fine). For this
> > feature, I think it would not work with F_IOMMU enabled either.
> 
> Adding a code comment explaining all this might be a good idea.

Thanks, will do.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-04-05 Thread Wang, Wei W
On Thursday, April 5, 2018 10:04 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 02:05:03AM +0000, Wang, Wei W wrote:
> > On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 05, 2018 at 12:30:27AM +, Wang, Wei W wrote:
> > > > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > I'm afraid the driver couldn't be aware if the added hints are
> > > > stale or not,
> > >
> > >
> > > No - I mean that driver has code that compares two values and stops
> > > reporting. Can one of the values be stale?
> >
> > The driver compares "vb->cmd_id_use != vb->cmd_id_received" to decide
> > if it needs to stop reporting hints, and cmd_id_received is what the
> > driver reads from host (host notifies the driver to read for the
> > latest value). If host sends a new cmd id, it will notify the guest to
> > read again. I'm not sure how that could be a stale cmd id (or maybe I
> > misunderstood your point here?)
> >
> > Best,
> > Wei
> 
> The comparison is done in one thread, the update in another one.

I think this isn't something that could be solved by adding a lock, unless host 
waits for the driver's ACK about finishing the update (this is not agreed in 
the QEMU part discussion).

Actually virtio_balloon has F_IOMMU_PLATFORM disabled, maybe we don't need to 
worry about that using DMA api case (we only have gpa added to the vq, and 
having some entries stay in the vq seems fine). For this feature, I think it 
would not work with F_IOMMU enabled either.

If there is any further need (I couldn't think of a need so far), I think we 
could consider to let host inject a vq interrupt at some point, and then the 
driver handler can do the virtqueue_get_buf work.

Best,
Wei


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-04-04 Thread Wang, Wei W
On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 12:30:27AM +0000, Wang, Wei W wrote:
> > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > I'm afraid the driver couldn't be aware if the added hints are stale
> > or not,
> 
> 
> No - I mean that driver has code that compares two values and stops
> reporting. Can one of the values be stale?

The driver compares "vb->cmd_id_use != vb->cmd_id_received" to decide if it 
needs to stop reporting hints, and cmd_id_received is what the driver reads 
from host (host notifies the driver to read for the latest value). If host 
sends a new cmd id, it will notify the guest to read again. I'm not sure how 
that could be a stale cmd id (or maybe I misunderstood your point here?)

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-04-04 Thread Wang, Wei W
On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > +static int add_one_sg(struct virtqueue *vq, unsigned long pfn,
> > > > +uint32_t len) {
> > > > +   struct scatterlist sg;
> > > > +   unsigned int unused;
> > > > +
> > > > +   sg_init_table(, 1);
> > > > +   sg_set_page(, pfn_to_page(pfn), len, 0);
> > > > +
> > > > +   /* Detach all the used buffers from the vq */
> > > > +   while (virtqueue_get_buf(vq, ))
> > > > +   ;
> > > > +
> > > > +   /*
> > > > +* Since this is an optimization feature, losing a couple of 
> > > > free
> > > > +* pages to report isn't important. We simply return without 
> > > > adding
> > > > +* the page hint if the vq is full.
> > > why not stop scanning of following pages though?
> >
> > Because continuing to send hints is a way to deliver the maximum
> > possible hints to host. For example, host may have a delay in taking
> > hints at some point, and then it resumes to take hints soon. If the
> > driver does not stop when the vq is full, it will be able to put more
> > hints to the vq once the vq has available entries to add.
> 
> What this appears to be is just lack of coordination between host and guest.
> 
> But meanwhile you are spending cycles walking the list uselessly.
> Instead of trying nilly-willy, the standard thing to do is to wait for host to
> consume an entry and proceed.
> 
> Coding it up might be tricky, so it's probably acceptable as is for now, but
> please replace the justification about with a TODO entry that we should
> synchronize with the host.

Thanks. I plan to add

TODO: The current implementation could be further improved by stopping the 
reporting when the vq is full and continuing the reporting when host notifies 
that there are available entries for the driver to add.


> 
> 
> >
> > >
> > > > +* We are adding one entry each time, which essentially results 
> > > > in no
> > > > +* memory allocation, so the GFP_KERNEL flag below can be 
> > > > ignored.
> > > > +* Host works by polling the free page vq for hints after 
> > > > sending the
> > > > +* starting cmd id, so the driver doesn't need to kick after 
> > > > filling
> > > > +* the vq.
> > > > +* Lastly, there is always one entry reserved for the cmd id to 
> > > > use.
> > > > +*/
> > > > +   if (vq->num_free > 1)
> > > > +   return virtqueue_add_inbuf(vq, , 1, vq, GFP_KERNEL);
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +static int virtio_balloon_send_free_pages(void *opaque, unsigned long
> pfn,
> > > > +  unsigned long nr_pages)
> > > > +{
> > > > +   struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > > +   uint32_t len = nr_pages << PAGE_SHIFT;
> > > > +
> > > > +   /*
> > > > +* If a stop id or a new cmd id was just received from host, 
> > > > stop
> > > > +* the reporting, and return 1 to indicate an active stop.
> > > > +*/
> > > > +   if (virtio32_to_cpu(vb->vdev, vb->cmd_id_use) != vb-
> >cmd_id_received)
> > > > +   return 1;
> 
> functions returning int should return 0 or -errno on failure, positive return
> code should indicate progress.
> 
> If you want a boolean, use bool pls.

OK. I plan to change 1  to -EBUSY to indicate the case that host actively asks 
the driver to stop reporting (This makes the callback return value type 
consistent with walk_free_mem_block). 



> 
> 
> > > > +
> > > this access to cmd_id_use and cmd_id_received without locks bothers
> > > me. Pls document why it's safe.
> >
> > OK. Probably we could add below to the above comments:
> >
> > cmd_id_use and cmd_id_received don't need to be accessed under locks
> > because the reporting does not have to stop immediately before
> > cmd_id_received is changed (i.e. when host requests to stop). That is,
> > reporting more hints after host requests to stop isn't an issue for
> > this optimization feature, because host will simply drop the stale
> > hints next time when it needs a new reporting.
> 
> What about the other direction? Can this observe a stale value and exit
> erroneously?

I'm afraid the driver couldn't be aware if the added hints are stale or not, 
because host and guest actions happen asynchronously. That is, host side 
iothread stops taking hints as soon as the migration thread asks to stop, it 
doesn't wait for any ACK from the driver to stop (as we discussed before, host 
couldn't always assume that the driver is in a responsive state).

Btw, we also don't need to worry about any memory left in the vq, since only 
addresses are added to the vq, there is no real memory allocations.

Best,
Wei


[virtio-dev] RE: [Qemu-devel] [virtio-dev] Re: [PATCH v5 4/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-03-26 Thread Wang, Wei W
On Monday, March 26, 2018 11:04 PM, Daniel P. Berrangé wrote:
> On Mon, Mar 26, 2018 at 02:54:45PM +0000, Wang, Wei W wrote:
> > On Monday, March 26, 2018 7:09 PM, Daniel P. Berrangé wrote:
> > >
> > > As far as libvirt is concerned there are three sets of threads it
> > > provides control over
> > >
> > >  - vCPUs - each VCPU in KVM has a thread. Libvirt provides per-thread
> > >tunable control
> > >
> > >  - IOThreads - each named I/O thread can be associated with one or more
> > >devices. Libvirt provides per-thread tunable control.
> > >
> > >  - Emulator - any other QEMU thread which isn't an vCPU thread or IO
> thread
> > >gets called an emulator thread by libvirt. There is no-per thread
> > >tunable control - we can set tunables for entire set of emulator 
> > > threads
> > >at once.
> > >
> >
> >
> > Hi Daniel,
> > Thanks for sharing the details, they are very helpful. I still have a 
> > question:
> >
> > There is no fundamental difference between iothread and our
> > optimization thread (it is similar to the migration thread, which is
> > created when migration begins and terminated when migration is done) -
> > both of them are pthreads and each has a name. Could we also add the
> > similar per-thread tunable control in libvirt for such threads?
> >
> > For example, in QEMU we can add a new migration qmp command,
> > migrate_enable_free_page_optimization (just like other commands
> > migrate_set_speed 10G), this command will create the optimization thread.
> > In this way, creation of the thread is in libvirt's control, and
> > libvirt can then support tuning the thread (e.g. pin it to any pCPU), right?
> 
> Anything is possible from a technical level, but from a design POV we would
> rather not start exposing tunables for arbitrary device type specific threads
> as they are inherantly non-portable and may not even exist in QEMU long
> term as it is just an artifact of the current QEMU implementation.
> 

OK. My actual concern with iothread is that it exists during the whole QEMU 
lifecycle. Block device using it is fine since block device access happens 
frequently during the QEMU lifecycle. Usages like live migration, if they 
happen once per day, running this iothread would be a waste of CPU cycles.

Best,
Wei
 


[virtio-dev] RE: [Qemu-devel] [virtio-dev] Re: [PATCH v5 4/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-03-26 Thread Wang, Wei W
On Monday, March 26, 2018 7:09 PM, Daniel P. Berrangé wrote:
> 
> As far as libvirt is concerned there are three sets of threads it provides
> control over
> 
>  - vCPUs - each VCPU in KVM has a thread. Libvirt provides per-thread
>tunable control
> 
>  - IOThreads - each named I/O thread can be associated with one or more
>devices. Libvirt provides per-thread tunable control.
> 
>  - Emulator - any other QEMU thread which isn't an vCPU thread or IO thread
>gets called an emulator thread by libvirt. There is no-per thread
>tunable control - we can set tunables for entire set of emulator threads
>at once.
> 


Hi Daniel,
Thanks for sharing the details, they are very helpful. I still have a question:

There is no fundamental difference between iothread and our optimization thread 
(it is similar to the migration thread, which is created when migration begins 
and terminated when migration is done) - both of them are pthreads and each has 
a name. Could we also add the similar per-thread tunable control in libvirt for 
such threads?

For example, in QEMU we can add a new migration qmp command, 
migrate_enable_free_page_optimization (just like other commands 
migrate_set_speed 10G), this command will create the optimization thread. In 
this way, creation of the thread is in libvirt's control, and libvirt can then 
support tuning the thread (e.g. pin it to any pCPU), right?


> So, if this balloon driver thread needs to support tuning controls separately
> from other general purpose QEMU threads, then it would ideally use
> iothread infrastructure.
> 
> I don't particularly understand what this code is doing, but please consider
> whether NUMA has any impact on the work done in this thread. Specifically
> when the guest has multiple virtual NUMA nodes, each associated with a
> specific host NUMA node. If there is any memory intensive work being done
> here, then it might need to be executed on the correct host NUMA node
> according to the memory region being touched.
> 
 
I think it would not be significantly impacted by NUMA, because this 
optimization thread doesn’t access to the guest memory a lot except the 
virtqueue (even with iothread, we may still not know which pCPU to pin to match 
virtqueue in the vNUMA case). Essentially, it gets the free page address and 
length, then clears bits from the migration dirty bitmap, which is allocated by 
QEMU itself.
So, I think adding the tunable support is nicer, but I'm not sure if that would 
be required.

Best,
Wei


[virtio-dev] RE: [PATCH v29 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules

2018-03-25 Thread Wang, Wei W
On Monday, March 26, 2018 10:40 AM, Wang, Wei W wrote:
> Subject: [PATCH v29 3/4] mm/page_poison: expose page_poisoning_enabled
> to kernel modules
> 
> In some usages, e.g. virtio-balloon, a kernel module needs to know if page
> poisoning is in use. This patch exposes the page_poisoning_enabled function
> to kernel modules.
> 
> Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Michael S. Tsirkin <m...@redhat.com>
> ---
>  mm/page_poison.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/page_poison.c b/mm/page_poison.c index e83fd44..762b472
> 100644
> --- a/mm/page_poison.c
> +++ b/mm/page_poison.c
> @@ -17,6 +17,11 @@ static int early_page_poison_param(char *buf)  }
> early_param("page_poison", early_page_poison_param);
> 
> +/**
> + * page_poisoning_enabled - check if page poisoning is enabled
> + *
> + * Return true if page poisoning is enabled, or false if not.
> + */
>  bool page_poisoning_enabled(void)
>  {
>   /*
> @@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
> 
>   (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
>   debug_pagealloc_enabled()));
>  }
> +EXPORT_SYMBOL_GPL(page_poisoning_enabled);
> 
>  static void poison_page(struct page *page)  {
> --
> 2.7.4


Could we get a review of this patch? We've reviewed other parts, and this one 
seems to be the last part of this feature. Thanks.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v24 1/2] mm: support reporting free page blocks

2018-01-27 Thread Wang, Wei W
On Friday, January 26, 2018 11:00 PM, Michael S. Tsirkin wrote:
> On Fri, Jan 26, 2018 at 11:29:15AM +0800, Wei Wang wrote:
> > On 01/25/2018 09:41 PM, Michael S. Tsirkin wrote:
> > > On Wed, Jan 24, 2018 at 06:42:41PM +0800, Wei Wang wrote:
> > > > This patch adds support to walk through the free page blocks in
> > > > the system and report them via a callback function. Some page
> > > > blocks may leave the free list after zone->lock is released, so it
> > > > is the caller's responsibility to either detect or prevent the use of 
> > > > such
> pages.
> > > >
> > > > One use example of this patch is to accelerate live migration by
> > > > skipping the transfer of free pages reported from the guest. A
> > > > popular method used by the hypervisor to track which part of
> > > > memory is written during live migration is to write-protect all
> > > > the guest memory. So, those pages that are reported as free pages
> > > > but are written after the report function returns will be captured
> > > > by the hypervisor, and they will be added to the next round of memory
> transfer.
> > > >
> > > > Signed-off-by: Wei Wang 
> > > > Signed-off-by: Liang Li 
> > > > Cc: Michal Hocko 
> > > > Cc: Michael S. Tsirkin 
> > > > Acked-by: Michal Hocko 
> > > > ---
> > > >   include/linux/mm.h |  6 
> > > >   mm/page_alloc.c| 91
> ++
> > > >   2 files changed, 97 insertions(+)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h index
> > > > ea818ff..b3077dd 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -1938,6 +1938,12 @@ extern void free_area_init_node(int nid,
> unsigned long * zones_size,
> > > > unsigned long zone_start_pfn, unsigned long 
> > > > *zholes_size);
> > > >   extern void free_initmem(void);
> > > > +extern void walk_free_mem_block(void *opaque,
> > > > +   int min_order,
> > > > +   bool (*report_pfn_range)(void *opaque,
> > > > +unsigned long 
> > > > pfn,
> > > > +unsigned long 
> > > > num));
> > > > +
> > > >   /*
> > > >* Free reserved pages within range [PAGE_ALIGN(start), end &
> PAGE_MASK)
> > > >* into the buddy system. The freed pages will be poisoned with
> > > > pattern diff --git a/mm/page_alloc.c b/mm/page_alloc.c index
> > > > 76c9688..705de22 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -4899,6 +4899,97 @@ void show_free_areas(unsigned int filter,
> nodemask_t *nodemask)
> > > > show_swap_cache_info();
> > > >   }
> > > > +/*
> > > > + * Walk through a free page list and report the found pfn range
> > > > +via the
> > > > + * callback.
> > > > + *
> > > > + * Return false if the callback requests to stop reporting.
> > > > +Otherwise,
> > > > + * return true.
> > > > + */
> > > > +static bool walk_free_page_list(void *opaque,
> > > > +   struct zone *zone,
> > > > +   int order,
> > > > +   enum migratetype mt,
> > > > +   bool (*report_pfn_range)(void *,
> > > > +unsigned long,
> > > > +unsigned long))
> > > > +{
> > > > +   struct page *page;
> > > > +   struct list_head *list;
> > > > +   unsigned long pfn, flags;
> > > > +   bool ret;
> > > > +
> > > > +   spin_lock_irqsave(>lock, flags);
> > > > +   list = >free_area[order].free_list[mt];
> > > > +   list_for_each_entry(page, list, lru) {
> > > > +   pfn = page_to_pfn(page);
> > > > +   ret = report_pfn_range(opaque, pfn, 1 << order);
> > > > +   if (!ret)
> > > > +   break;
> > > > +   }
> > > > +   spin_unlock_irqrestore(>lock, flags);
> > > > +
> > > > +   return ret;
> > > > +}
> > > There are two issues with this API. One is that it is not
> > > restarteable: if you return false, you start from the beginning. So
> > > no way to drop lock, do something slow and then proceed.
> > >
> > > Another is that you are using it to report free page hints.
> > > Presumably the point is to drop these pages - keeping them near head
> > > of the list and reusing the reported ones will just make everything
> > > slower invalidating the hint.
> > >
> > > How about rotating these pages towards the end of the list?
> > > Probably not on each call, callect reported pages and then move them
> > > to tail when we exit.
> >
> >
> > I'm not sure how this would help. For example, we have a list of 2M
> > free page blocks:
> > A-->B-->C-->D-->E-->F-->G--H
> >
> > After reporting A and B, and put them to 

[virtio-dev] RE: [PATCH v24 1/2] mm: support reporting free page blocks

2018-01-27 Thread Wang, Wei W
On Saturday, January 27, 2018 5:44 AM, Michael S. Tsirkin wrote:
> On Fri, Jan 26, 2018 at 05:00:09PM +0200, Michael S. Tsirkin wrote:
> > On Fri, Jan 26, 2018 at 11:29:15AM +0800, Wei Wang wrote:
> > > On 01/25/2018 09:41 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Jan 24, 2018 at 06:42:41PM +0800, Wei Wang wrote:
> > > 2) If worse, all the blocks have been split into smaller blocks and
> > > used after the caller comes back.
> > >
> > > where could we continue?
> >
> > I'm not sure. But an alternative appears to be to hold a lock and just
> > block whoever wanted to use any pages.  Yes we are sending hints
> > faster but apparently something wanted these pages, and holding the
> > lock is interfering with this something.
> 
> I've been thinking about it. How about the following scheme:
> 1. register balloon to get a (new) callback when free list runs empty 2. take
> pages off the free list, add them to the balloon specific list 3. report to 
> host 4.
> readd to free list at tail 5. if callback triggers, interrupt balloon 
> reporting to
> host,
>and readd to free list at tail

So in step 2, when we walk through the free page list, take each block, and add 
them to the balloon specific list, is this performed under the mm lock? If it 
is still under the lock, then what would be the difference compared to walking 
through the free list, and add each block to virtqueue? 

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [Qemu-devel] [PATCH v1 1/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

2018-01-17 Thread Wang, Wei W
On Wednesday, January 17, 2018 7:41 PM, Juan Quintela wrote:
> Wei Wang  wrote:
> > +void skip_free_pages_from_dirty_bitmap(RAMBlock *block, ram_addr_t
> offset,
> > +   size_t len) {
> > +long start = offset >> TARGET_PAGE_BITS,
> > + nr = len >> TARGET_PAGE_BITS;
> > +
> > +bitmap_clear(block->bmap, start, nr);
> 
> But what assures us that all the nr pages are dirty?

Actually the free page optimization is used for the bulk stage, where all the 
pages are treated as dirty. In patch 2, we have:

+if (rs->free_page_support) {
+bitmap_set(block->bmap, 1, pages);
+} else {
+bitmap_set(block->bmap, 0, pages);
+}


> 
> > +ram_state->migration_dirty_pages -= nr;
> 
> This should be
> ram_state->migration_dirty_pages -=
>  count_ones(block->bmap, start, nr);
> 
> For a count_ones function, no?

Not really. "nr" is the number of bits to clear from the bitmap, so 
"migration_dirty_pages -= nr" shows how many dirty bits left in the bitmap.

One cornercase I'm thinking about is when we do 
bitmap_clear(block->bmap, start, nr);
would it be possible that "start + nr" exceeds the bitmap size? that is, would 
it be possible that the free page block (e.g. 2MB) from the guest crosses the 
QEMU ram block boundary?


> Furthermore, we have one "optimization" and this only works for the second
> stages onward:
> 
> static inline
> unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
>   unsigned long start) {
> unsigned long size = rb->used_length >> TARGET_PAGE_BITS;
> unsigned long *bitmap = rb->bmap;
> unsigned long next;
> 
> if (rs->ram_bulk_stage && start > 0) {
> next = start + 1;
> } else {
> next = find_next_bit(bitmap, size, start);
> }
> 
> return next;
> }
> 
> So, for making this really work, we have more work to do.
> 
> Actually, what I think we should do was to _ask_ the guest which pages are
> used at the beggining, instead of just setting all pages as dirty, but that
> requires kernel changes and lot of search of corner cases.
> 

This series optimizes the bulk stage memory transfer. Do you mean you have 
another idea to optimize the second round and onward? How would you let the 
guest track pages that are written?

Best,
Wei



-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v20 0/7] Virtio-balloon Enhancement

2017-12-20 Thread Wang, Wei W
On Wednesday, December 20, 2017 8:26 PM, Matthew Wilcox wrote:
> On Wed, Dec 20, 2017 at 06:34:36PM +0800, Wei Wang wrote:
> > On 12/19/2017 10:05 PM, Tetsuo Handa wrote:
> > > I think xb_find_set() has a bug in !node path.
> >
> > I think we can probably remove the "!node" path for now. It would be
> > good to get the fundamental part in first, and leave optimization to
> > come as separate patches with corresponding test cases in the future.
> 
> You can't remove the !node path.  You'll see !node when the highest set bit
> is less than 1024.  So do something like this:
> 
>   unsigned long bit;
>   xb_preload(GFP_KERNEL);
>   xb_set_bit(xb, 700);
>   xb_preload_end();
>   bit = xb_find_set(xb, ULONG_MAX, 0);
>   assert(bit == 700);

This above test will result in "!node with bitmap !=NULL", and it goes to the 
regular "if (bitmap)" path, which finds 700.

A better test would be
...
xb_set_bit(xb, 700);
assert(xb_find_set(xb, ULONG_MAX, 800) == ULONG_MAX);
...

The first try with the "if (bitmap)" path doesn't find a set bit, and the 
remaining tries will always result in "!node && !bitmap", which implies no set 
bit anymore and no need to try in this case.

So, I think we can change it to

If (!node && !bitmap)
return size;


Best,
Wei



-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v19 3/7] xbitmap: add more operations

2017-12-17 Thread Wang, Wei W


> -Original Message-
> From: Tetsuo Handa [mailto:penguin-ker...@i-love.sakura.ne.jp]
> Sent: Sunday, December 17, 2017 6:22 PM
> To: Wang, Wei W <wei.w.w...@intel.com>; wi...@infradead.org
> Cc: virtio-dev@lists.oasis-open.org; linux-ker...@vger.kernel.org; qemu-
> de...@nongnu.org; virtualizat...@lists.linux-foundation.org;
> k...@vger.kernel.org; linux...@kvack.org; m...@redhat.com;
> mho...@kernel.org; a...@linux-foundation.org; mawil...@microsoft.com;
> da...@redhat.com; cornelia.h...@de.ibm.com;
> mgor...@techsingularity.net; aarca...@redhat.com;
> amit.s...@redhat.com; pbonz...@redhat.com;
> liliang.opensou...@gmail.com; yang.zhang...@gmail.com;
> quan...@aliyun.com; ni...@redhat.com; r...@redhat.com
> Subject: Re: [PATCH v19 3/7] xbitmap: add more operations
> 
> Wei Wang wrote:
> > > But passing GFP_NOWAIT means that we can handle allocation failure.
> > > There is no need to use preload approach when we can handle allocation
> failure.
> >
> > I think the reason we need xb_preload is because radix tree insertion
> > needs the memory being preallocated already (it couldn't suffer from
> > memory failure during the process of inserting, probably because
> > handling the failure there isn't easy, Matthew may know the backstory
> > of
> > this)
> 
> According to https://lwn.net/Articles/175432/ , I think that preloading is
> needed only when failure to insert an item into a radix tree is a significant
> problem.
> That is, when failure to insert an item into a radix tree is not a problem, I
> think that we don't need to use preloading.

It also mentions that the preload attempts to allocate sufficient memory to 
*guarantee* that the next radix tree insertion cannot fail.

If we check radix_tree_node_alloc(), the comments there says "this assumes that 
the caller has performed appropriate preallocation".

So, I think we would get a risk of triggering some issue without preload().

> >
> > So, I think we can handle the memory failure with xb_preload, which
> > stops going into the radix tree APIs, but shouldn't call radix tree
> > APIs without the related memory preallocated.
> 
> It seems to me that virtio-ballon case has no problem without using
> preloading.

Why is that?

Best,
Wei


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [Qemu-devel] [virtio-dev] [PATCH v3 0/7] Vhost-pci for inter-VM communication

2017-12-11 Thread Wang, Wei W
On Monday, December 11, 2017 7:12 PM, Stefan Hajnoczi wrote:
> On Sat, Dec 09, 2017 at 04:23:17PM +0000, Wang, Wei W wrote:
> > On Friday, December 8, 2017 4:34 PM, Stefan Hajnoczi wrote:
> > > On Fri, Dec 8, 2017 at 6:43 AM, Wei Wang <wei.w.w...@intel.com>
> wrote:
> > > > On 12/08/2017 07:54 AM, Michael S. Tsirkin wrote:
> > > >>
> > > >> On Thu, Dec 07, 2017 at 06:28:19PM +, Stefan Hajnoczi wrote:
> > > >>>
> > > >>> On Thu, Dec 7, 2017 at 5:38 PM, Michael S. Tsirkin 
> > > >>> <m...@redhat.com>
> > > > Thanks Stefan and Michael for the sharing and discussion. I 
> > > > think above 3 and 4 are debatable (e.g. whether it is simpler 
> > > > really depends). 1 and 2 are implementations, I think both 
> > > > approaches could implement the device that way. We originally 
> > > > thought about one device and driver to support all types (called 
> > > > it transformer sometimes :-) ), that would look interesting from 
> > > > research point of view, but from real usage point of view, I 
> > > > think it would be better to have them separated,
> > > because:
> > > > - different device types have different driver logic, mixing 
> > > > them together would cause the driver to look messy. Imagine that 
> > > > a networking driver developer has to go over the block related 
> > > > code to debug, that also increases the difficulty.
> > >
> > > I'm not sure I understand where things get messy because:
> > > 1. The vhost-pci device implementation in QEMU relays messages but 
> > > has no device logic, so device-specific messages like 
> > > VHOST_USER_NET_SET_MTU are trivial at this layer.
> > > 2. vhost-user slaves only handle certain vhost-user protocol messages.
> > > They handle device-specific messages for their device type only.
> > > This is like vhost drivers today where the ioctl() function 
> > > returns an error if the ioctl is not supported by the device.  It's not 
> > > messy.
> > >
> > > Where are you worried about messy driver logic?
> >
> > Probably I didn’t explain well, please let me summarize my thought a 
> > little
> bit, from the perspective of the control path and data path.
> >
> > Control path: the vhost-user messages - I would prefer just have the 
> > interaction between QEMUs, instead of relaying to the GuestSlave, 
> > because
> > 1) I think the claimed advantage (easier to debug and develop) 
> > doesn’t seem very convincing
> 
> You are defining a mapping from the vhost-user protocol to a custom 
> virtio device interface.  Every time the vhost-user protocol (feature 
> bits, messages,
> etc) is extended it will be necessary to map this new extension to the 
> virtio device interface.
> 
> That's non-trivial.  Mistakes are possible when designing the mapping.
> Using the vhost-user protocol as the device interface minimizes the 
> effort and risk of mistakes because most messages are relayed 1:1.
> 
> > 2) some messages can be directly answered by QemuSlave , and some
> messages are not useful to give to the GuestSlave (inside the VM), 
> e.g. fds, VhostUserMemoryRegion from SET_MEM_TABLE msg (the device 
> first maps the master memory and gives the offset (in terms of the 
> bar, i.e., where does it sit in the bar) of the mapped gpa to the 
> guest. if we give the raw VhostUserMemoryRegion to the guest, that wouldn’t 
> be usable).
> 
> I agree that QEMU has to handle some of messages, but it should still 
> relay all (possibly modified) messages to the guest.
> 
> The point of using the vhost-user protocol is not just to use a 
> familiar binary encoding, it's to match the semantics of vhost-user 
> 100%.  That way the vhost-user software stack can work either in host 
> userspace or with vhost-pci without significant changes.
> 
> Using the vhost-user protocol as the device interface doesn't seem any 
> harder than defining a completely new virtio device interface.  It has 
> the advantages that I've pointed out:
> 
> 1. Simple 1:1 mapping for most that is easy to maintain as the
>vhost-user protocol grows.
> 
> 2. Compatible with vhost-user so slaves can run in host userspace
>or the guest.
> 
> I don't see why it makes sense to define new device interfaces for 
> each device type and create a software stack that is incompatible with 
> vhost-user.


I think this 1:1 mapping wouldn't be easy:

1) We will have 2 Qemu side slaves to achieve this bidirectional relaying, that 
is, the working model will b

[virtio-dev] RE: [Qemu-devel] [virtio-dev] [PATCH v3 0/7] Vhost-pci for inter-VM communication

2017-12-09 Thread Wang, Wei W
On Friday, December 8, 2017 4:34 PM, Stefan Hajnoczi wrote:
> On Fri, Dec 8, 2017 at 6:43 AM, Wei Wang  wrote:
> > On 12/08/2017 07:54 AM, Michael S. Tsirkin wrote:
> >>
> >> On Thu, Dec 07, 2017 at 06:28:19PM +, Stefan Hajnoczi wrote:
> >>>
> >>> On Thu, Dec 7, 2017 at 5:38 PM, Michael S. Tsirkin 
> > Thanks Stefan and Michael for the sharing and discussion. I think
> > above 3 and 4 are debatable (e.g. whether it is simpler really
> > depends). 1 and 2 are implementations, I think both approaches could
> > implement the device that way. We originally thought about one device
> > and driver to support all types (called it transformer sometimes :-)
> > ), that would look interesting from research point of view, but from
> > real usage point of view, I think it would be better to have them separated,
> because:
> > - different device types have different driver logic, mixing them
> > together would cause the driver to look messy. Imagine that a
> > networking driver developer has to go over the block related code to
> > debug, that also increases the difficulty.
> 
> I'm not sure I understand where things get messy because:
> 1. The vhost-pci device implementation in QEMU relays messages but has no
> device logic, so device-specific messages like VHOST_USER_NET_SET_MTU are
> trivial at this layer.
> 2. vhost-user slaves only handle certain vhost-user protocol messages.
> They handle device-specific messages for their device type only.  This is like
> vhost drivers today where the ioctl() function returns an error if the ioctl 
> is
> not supported by the device.  It's not messy.
> 
> Where are you worried about messy driver logic?

Probably I didn’t explain well, please let me summarize my thought a little 
bit, from the perspective of the control path and data path.

Control path: the vhost-user messages - I would prefer just have the 
interaction between QEMUs, instead of relaying to the GuestSlave, because
1) I think the claimed advantage (easier to debug and develop) doesn’t seem 
very convincing
2) some messages can be directly answered by QemuSlave , and some messages are 
not useful to give to the GuestSlave (inside the VM), e.g. fds, 
VhostUserMemoryRegion from SET_MEM_TABLE msg (the device first maps the master 
memory and gives the offset (in terms of the bar, i.e., where does it sit in 
the bar) of the mapped gpa to the guest. if we give the raw 
VhostUserMemoryRegion to the guest, that wouldn’t be usable).


Data path: that's the discussion we had about one driver or separate driver for 
different device types, and this is not related to the control path.
I meant if we have one driver for all the types, that driver would look messy, 
because each type has its own data sending/receiving logic. For example, net 
type deals with a pair of tx and rx, and transmission is skb based (e.g. 
xmit_skb), while block type deals with a request queue. If we have one driver, 
then the driver will include all the things together.


The last part is whether to make it a virtio device or a regular pci device
I don’t have a strong preference. I think virtio device works fine (e.g. use 
some bar area to create ioevenfds to solve the "no virtqueue no fds" issue if 
you and Michael think that's acceptable),  and we can reuse some other things 
like feature negotiation from virtio. But if Michael and you have a decision to 
make it a regular PCI device, I think that would also work though.

Best,
Wei


[virtio-dev] RE: [Qemu-devel] [virtio-dev] [PATCH v3 0/7] Vhost-pci for inter-VM communication

2017-12-09 Thread Wang, Wei W
On Friday, December 8, 2017 10:28 PM, Michael S. Tsirkin wrote:
> On Fri, Dec 08, 2017 at 06:08:05AM +, Stefan Hajnoczi wrote:
> > On Thu, Dec 7, 2017 at 11:54 PM, Michael S. Tsirkin 
> wrote:
> > > On Thu, Dec 07, 2017 at 06:28:19PM +, Stefan Hajnoczi wrote:
> > >> On Thu, Dec 7, 2017 at 5:38 PM, Michael S. Tsirkin 
> wrote:
> > >> > On Thu, Dec 07, 2017 at 05:29:14PM +, Stefan Hajnoczi wrote:
> > >> >> On Thu, Dec 7, 2017 at 4:47 PM, Michael S. Tsirkin 
> wrote:
> > >> >> > On Thu, Dec 07, 2017 at 04:29:45PM +, Stefan Hajnoczi wrote:
> > >> >> >> On Thu, Dec 7, 2017 at 2:02 PM, Michael S. Tsirkin
>  wrote:
> > >> >> >> > On Thu, Dec 07, 2017 at 01:08:04PM +, Stefan Hajnoczi
> wrote:
> > >> >> >> >> Instead of responding individually to these points, I hope
> > >> >> >> >> this will explain my perspective.  Let me know if you do
> > >> >> >> >> want individual responses, I'm happy to talk more about
> > >> >> >> >> the points above but I think the biggest difference is our
> perspective on this:
> > >> >> >> >>
> > >> >> >> >> Existing vhost-user slave code should be able to run on
> > >> >> >> >> top of vhost-pci.  For example, QEMU's
> > >> >> >> >> contrib/vhost-user-scsi/vhost-user-scsi.c should work
> > >> >> >> >> inside the guest with only minimal changes to the source
> > >> >> >> >> file (i.e. today it explicitly opens a UNIX domain socket
> > >> >> >> >> and that should be done by libvhost-user instead).  It
> > >> >> >> >> shouldn't be hard to add vhost-pci vfio support to
> contrib/libvhost-user/ alongside the existing UNIX domain socket code.
> > >> >> >> >>
> > >> >> >> >> This seems pretty easy to achieve with the vhost-pci PCI
> > >> >> >> >> adapter that I've described but I'm not sure how to
> > >> >> >> >> implement libvhost-user on top of vhost-pci vfio if the
> > >> >> >> >> device doesn't expose the vhost-user protocol.
> > >> >> >> >>
> > >> >> >> >> I think this is a really important goal.  Let's use a
> > >> >> >> >> single vhost-user software stack instead of creating a
> > >> >> >> >> separate one for guest code only.
> > >> >> >> >>
> > >> >> >> >> Do you agree that the vhost-user software stack should be
> > >> >> >> >> shared between host userspace and guest code as much as
> possible?
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >
> > >> >> >> > The sharing you propose is not necessarily practical
> > >> >> >> > because the security goals of the two are different.
> > >> >> >> >
> > >> >> >> > It seems that the best motivation presentation is still the
> > >> >> >> > original rfc
> > >> >> >> >
> > >> >> >> > http://virtualization.linux-foundation.narkive.com/A7FkzAgp
> > >> >> >> > /rfc-vhost-user-enhancements-for-vm2vm-communication
> > >> >> >> >
> > >> >> >> > So comparing with vhost-user iotlb handling is different:
> > >> >> >> >
> > >> >> >> > With vhost-user guest trusts the vhost-user backend on the host.
> > >> >> >> >
> > >> >> >> > With vhost-pci we can strive to limit the trust to qemu only.
> > >> >> >> > The switch running within a VM does not have to be trusted.
> > >> >> >>
> > >> >> >> Can you give a concrete example?
> > >> >> >>
> > >> >> >> I have an idea about what you're saying but it may be wrong:
> > >> >> >>
> > >> >> >> Today the iotlb mechanism in vhost-user does not actually
> > >> >> >> enforce memory permissions.  The vhost-user slave has full
> > >> >> >> access to mmapped memory regions even when iotlb is enabled.
> > >> >> >> Currently the iotlb just adds an indirection layer but no
> > >> >> >> real security.  (Is this correct?)
> > >> >> >
> > >> >> > Not exactly. iotlb protects against malicious drivers within guest.
> > >> >> > But yes, not against a vhost-user driver on the host.
> > >> >> >
> > >> >> >> Are you saying the vhost-pci device code in QEMU should
> > >> >> >> enforce iotlb permissions so the vhost-user slave guest only
> > >> >> >> has access to memory regions that are allowed by the iotlb?
> > >> >> >
> > >> >> > Yes.
> > >> >>
> > >> >> Okay, thanks for confirming.
> > >> >>
> > >> >> This can be supported by the approach I've described.  The
> > >> >> vhost-pci QEMU code has control over the BAR memory so it can
> > >> >> prevent the guest from accessing regions that are not allowed by the
> iotlb.
> > >> >>
> > >> >> Inside the guest the vhost-user slave still has the memory
> > >> >> region descriptions and sends iotlb messages.  This is
> > >> >> completely compatible with the libvirt-user APIs and existing
> > >> >> vhost-user slave code can run fine.  The only unique thing is
> > >> >> that guest accesses to memory regions not allowed by the iotlb do
> not work because QEMU has prevented it.
> > >> >
> > >> > I don't think this can work since suddenly you need to map full
> > >> > IOMMU address space into BAR.
> > >>
> > >> The BAR covers all guest RAM
> > >> but QEMU can set up MemoryRegions that hide parts from the guest
> > >> 

RE: [virtio-dev] [PATCH v3 0/7] Vhost-pci for inter-VM communication

2017-12-06 Thread Wang, Wei W
On Wednesday, December 6, 2017 9:50 PM, Stefan Hajnoczi wrote:
> On Tue, Dec 05, 2017 at 11:33:09AM +0800, Wei Wang wrote:
> > Vhost-pci is a point-to-point based inter-VM communication solution.
> > This patch series implements the vhost-pci-net device setup and
> > emulation. The device is implemented as a virtio device, and it is set
> > up via the vhost-user protocol to get the neessary info (e.g the
> > memory info of the remote VM, vring info).
> >
> > Currently, only the fundamental functions are implemented. More
> > features, such as MQ and live migration, will be updated in the future.
> >
> > The DPDK PMD of vhost-pci has been posted to the dpdk mailinglist here:
> > http://dpdk.org/ml/archives/dev/2017-November/082615.html
> 
> I have asked questions about the scope of this feature.  In particular, I 
> think
> it's best to support all device types rather than just virtio-net.  Here is a
> design document that shows how this can be achieved.
> 
> What I'm proposing is different from the current approach:
> 1. It's a PCI adapter (see below for justification) 2. The vhost-user 
> protocol is
> exposed by the device (not handled 100% in
>QEMU).  Ultimately I think your approach would also need to do this.
> 
> I'm not implementing this and not asking you to implement it.  Let's just use
> this for discussion so we can figure out what the final vhost-pci will look 
> like.
> 
> Please let me know what you think, Wei, Michael, and others.
> 

Thanks for sharing the thoughts. If I understand it correctly, the key 
difference is that this approach tries to relay every vhost-user msg to the 
guest. I'm not sure about the benefits of doing this. 
To make data plane (i.e. driver to send/receive packets) work, I think, mostly, 
the memory info and vring info are enough. Other things like callfd, kickfd 
don't need to be sent to the guest, they are needed by QEMU only for the 
eventfd and irqfd setup.



> ---
> vhost-pci device specification
> ---
> The vhost-pci device allows guests to act as vhost-user slaves.  This enables
> appliance VMs like network switches or storage targets to back devices in
> other VMs.  VM-to-VM communication is possible without vmexits using
> polling mode drivers.
> 
> The vhost-user protocol has been used to implement virtio devices in
> userspace processes on the host.  vhost-pci maps the vhost-user protocol to
> a PCI adapter so guest software can perform virtio device emulation.
> This is useful in environments where high-performance VM-to-VM
> communication is necessary or where it is preferrable to deploy device
> emulation as VMs instead of host userspace processes.
> 
> The vhost-user protocol involves file descriptor passing and shared memory.
> This precludes vhost-user slave implementations over virtio-vsock, virtio-
> serial, or TCP/IP.  Therefore a new device type is needed to expose the
> vhost-user protocol to guests.
> 
> The vhost-pci PCI adapter has the following resources:
> 
> Queues (used for vhost-user protocol communication):
> 1. Master-to-slave messages
> 2. Slave-to-master messages
> 
> Doorbells (used for slave->guest/master events):
> 1. Vring call (one doorbell per virtqueue) 2. Vring err (one doorbell per
> virtqueue) 3. Log changed
> 
> Interrupts (used for guest->slave events):
> 1. Vring kick (one MSI per virtqueue)
> 
> Shared Memory BARs:
> 1. Guest memory
> 2. Log
> 
> Master-to-slave queue:
> The following vhost-user protocol messages are relayed from the vhost-user
> master.  Each message follows the vhost-user protocol VhostUserMsg layout.
> 
> Messages that include file descriptor passing are relayed but do not carry 
> file
> descriptors.  The relevant resources (doorbells, interrupts, or shared memory
> BARs) are initialized from the file descriptors prior to the message becoming
> available on the Master-to-Slave queue.
> 
> Resources must only be used after the corresponding vhost-user message
> has been received.  For example, the Vring call doorbell can only be used
> after VHOST_USER_SET_VRING_CALL becomes available on the Master-to-
> Slave queue.
> 
> Messages must be processed in order.
> 
> The following vhost-user protocol messages are relayed:
>  * VHOST_USER_GET_FEATURES
>  * VHOST_USER_SET_FEATURES
>  * VHOST_USER_GET_PROTOCOL_FEATURES
>  * VHOST_USER_SET_PROTOCOL_FEATURES
>  * VHOST_USER_SET_OWNER
>  * VHOST_USER_SET_MEM_TABLE
>The shared memory is available in the corresponding BAR.
>  * VHOST_USER_SET_LOG_BASE
>The shared memory is available in the corresponding BAR.
>  * VHOST_USER_SET_LOG_FD
>The logging file descriptor can be signalled through the logging
>virtqueue.
>  * VHOST_USER_SET_VRING_NUM
>  * VHOST_USER_SET_VRING_ADDR
>  * VHOST_USER_SET_VRING_BASE
>  * VHOST_USER_GET_VRING_BASE
>  * VHOST_USER_SET_VRING_KICK
>This message is still needed because it may indicate only polling
>mode is supported.
>  * VHOST_USER_SET_VRING_CALL
>This message is still 

[virtio-dev] RE: [PATCH v18 05/10] xbitmap: add more operations

2017-12-01 Thread Wang, Wei W
On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
> > On 11/30/2017 06:34 PM, Tetsuo Handa wrote:
> > > Wei Wang wrote:
> > >> + * @start: the start of the bit range, inclusive
> > >> + * @end: the end of the bit range, inclusive
> > >> + *
> > >> + * This function is used to clear a bit in the xbitmap. If all the
> > >> +bits of the
> > >> + * bitmap are 0, the bitmap will be freed.
> > >> + */
> > >> +void xb_clear_bit_range(struct xb *xb, unsigned long start,
> > >> +unsigned long end) {
> > >> +struct radix_tree_root *root = >xbrt;
> > >> +struct radix_tree_node *node;
> > >> +void **slot;
> > >> +struct ida_bitmap *bitmap;
> > >> +unsigned int nbits;
> > >> +
> > >> +for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 
> > >> 1) {
> > >> +unsigned long index = start / IDA_BITMAP_BITS;
> > >> +unsigned long bit = start % IDA_BITMAP_BITS;
> > >> +
> > >> +bitmap = __radix_tree_lookup(root, index, , );
> > >> +if (radix_tree_exception(bitmap)) {
> > >> +unsigned long ebit = bit + 2;
> > >> +unsigned long tmp = (unsigned long)bitmap;
> > >> +
> > >> +nbits = min(end - start + 1, BITS_PER_LONG - 
> > >> ebit);
> > > "nbits = min(end - start + 1," seems to expect that start == end is
> > > legal for clearing only 1 bit. But this function is no-op if start == end.
> > > Please clarify what "inclusive" intended.
> >
> > If xb_clear_bit_range(xb,10,10), then it is effectively the same as
> > xb_clear_bit(10). Why would it be illegal?
> >
> > "@start inclusive" means that the @start will also be included to be
> > cleared.
> 
> If start == end is legal,
> 
>for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> 
> makes this loop do nothing because 10 < 10 is false.


How about "start <= end "?

Best,
Wei




-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

2017-11-17 Thread Wang, Wei W
On Friday, November 17, 2017 8:45 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 17, 2017 at 07:35:03PM +0800, Wei Wang wrote:
> > On 11/16/2017 09:27 PM, Wei Wang wrote:
> > > On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
> > > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature
> > > > > indicates the support of reporting hints of guest free pages to
> > > > > the host via virtio-balloon. The host requests the guest to
> > > > > report the free pages by sending commands via the virtio-balloon
> configuration registers.
> > > > >
> > > > > When the guest starts to report, the first element added to the
> > > > > free page vq is a sequence id of the start reporting command.
> > > > > The id is given by the host, and it indicates whether the
> > > > > following free pages correspond to the command. For example, the
> > > > > host may stop the report and start again with a new command id.
> > > > > The obsolete pages for the previous start command can be
> > > > > detected by the id dismatching on the host. The id is added to
> > > > > the vq using an output buffer, and the free pages are added to
> > > > > the vq using input buffer.
> > > > >
> > > > > Here are some explainations about the added configuration registers:
> > > > > - host2guest_cmd: a register used by the host to send commands
> > > > > to the guest.
> > > > > - guest2host_cmd: written by the guest to ACK to the host about
> > > > > the commands that have been received. The host will clear the
> > > > > corresponding bits on the host2guest_cmd register. The guest
> > > > > also uses this register to send commands to the host (e.g. when finish
> free page reporting).
> > > > > - free_page_cmd_id: the sequence id of the free page report
> > > > > command given by the host.
> > > > >
> > > > > Signed-off-by: Wei Wang 
> > > > > Signed-off-by: Liang Li 
> > > > > Cc: Michael S. Tsirkin 
> > > > > Cc: Michal Hocko 
> > > > > ---
> > > > >
> > > > > +
> > > > > +static void report_free_page(struct work_struct *work) {
> > > > > +struct virtio_balloon *vb;
> > > > > +
> > > > > +vb = container_of(work, struct virtio_balloon,
> > > > > report_free_page_work);
> > > > > +report_free_page_cmd_id(vb);
> > > > > +walk_free_mem_block(vb, 0, _balloon_send_free_pages);
> > > > > +/*
> > > > > + * The last few free page blocks that were added may not reach 
> > > > > the
> > > > > + * batch size, but need a kick to notify the device to
> > > > > handle them.
> > > > > + */
> > > > > +virtqueue_kick(vb->free_page_vq);
> > > > > +report_free_page_end(vb);
> > > > > +}
> > > > > +
> > > > I think there's an issue here: if pages are poisoned and
> > > > hypervisor subsequently drops them, testing them after allocation
> > > > will trigger a false positive.
> > > >
> > > > The specific configuration:
> > > >
> > > > PAGE_POISONING on
> > > > PAGE_POISONING_NO_SANITY off
> > > > PAGE_POISONING_ZERO off
> > > >
> > > >
> > > > Solutions:
> > > > 1. disable the feature in that configuration
> > > > suggested as an initial step
> > >
> > > Thanks for the finding.
> > > Similar to this option: I'm thinking could we make
> > > walk_free_mem_block() simply return if that option is on?
> > > That is, at the beginning of the function:
> > > if (!page_poisoning_enabled())
> > > return;
> > >
> >
> >
> > Thought about it more, I think it would be better to put this logic to
> > virtio_balloon:
> >
> > send_free_page_cmd_id(vb, >start_cmd_id);
> > if (page_poisoning_enabled() &&
> > !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))
> > walk_free_mem_block(vb, 0, _balloon_send_free_pages);
> > send_free_page_cmd_id(vb, >stop_cmd_id);
> >
> >
> > walk_free_mem_block() should be a more generic API, and this potential
> > page poisoning issue is specific to live migration which is only one
> > use case of this function, so I think it is better to handle it in the
> > special use case itself.
> >
> > Best,
> > Wei
> >
> 
> It's a quick work-around but it doesn't make me very happy.
> 
> AFAIK e.g. RHEL has a debug kernel with poisoning enabled.
> If this never uses free page hinting at all, it will be much less useful for
> debugging guests.
> 

I understand your concern. I think people who use debugging guests don't regard 
performance as the first priority, and most vendors usually wouldn't use 
debugging guests for their products.

How about taking it as the initial solution? We can exploit more solutions 
after this series is done.

Best,
Wei



-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

2017-10-02 Thread Wang, Wei W
On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > + struct virtio_balloon_ctrlq_cmd *cmd,
> > + bool inbuf)
> > +{
> > +   struct virtqueue *vq = vb->ctrl_vq;
> > +
> > +   ctrlq_add_cmd(vq, cmd, inbuf);
> > +   if (!inbuf) {
> > +   /*
> > +* All the input cmd buffers are replenished here.
> > +* This is necessary because the input cmd buffers are lost
> > +* after live migration. The device needs to rewind all of
> > +* them from the ctrl_vq.
> 
> Confused. Live migration somehow loses state? Why is that and why is it a good
> idea? And how do you know this is migration even?
> Looks like all you know is you got free page end. Could be any reason for 
> this.


I think this would be something that the current live migration lacks - what the
device read from the vq is not transferred during live migration, an example is 
the 
stat_vq_elem: 
Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

For all the things that are added to the vq and need to be held by the device
to use later need to consider the situation that live migration might happen at 
any
time and they need to be re-taken from the vq by the device on the destination
machine.

So, even without this live migration optimization feature, I think all the 
things that are 
added to the vq for the device to hold, need a way for the device to rewind 
back from
the vq - re-adding all the elements to the vq is a trick to keep a record of 
all of them
on the vq so that the device side rewinding can work. 

Please let me know if anything is missed or if you have other suggestions.


> > +static void ctrlq_handle(struct virtqueue *vq) {
> > +   struct virtio_balloon *vb = vq->vdev->priv;
> > +   struct virtio_balloon_ctrlq_cmd *msg;
> > +   unsigned int class, cmd, len;
> > +
> > +   msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, );
> > +   if (unlikely(!msg))
> > +   return;
> > +
> > +   /* The outbuf is sent by the host for recycling, so just return. */
> > +   if (msg == >free_page_cmd_out)
> > +   return;
> > +
> > +   class = virtio32_to_cpu(vb->vdev, msg->class);
> > +   cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > +
> > +   switch (class) {
> > +   case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > +   if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > +   vb->report_free_page_stop = true;
> > +   } else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > +   vb->report_free_page_stop = false;
> > +   queue_work(vb->balloon_wq, 
> >report_free_page_work);
> > +   }
> > +   vb->free_page_cmd_in.class =
> > +
>   VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > +   ctrlq_send_cmd(vb, >free_page_cmd_in, true);
> > +   break;
> > +   default:
> > +   dev_warn(>vdev->dev, "%s: cmd class not supported\n",
> > +__func__);
> > +   }
> 
> Manipulating report_free_page_stop without any locks looks very suspicious.

> Also, what if we get two start commands? we should restart from beginning,
> should we not?
> 


Yes, it will start to report free pages from the beginning.
walk_free_mem_block() doesn't maintain any internal status, so the invoking of
it will always start from the beginning.


> > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> */
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP0
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_START   1
> > +
> >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> 
> The stop command does not appear to be thought through.
> 
> Let's assume e.g. you started migration. You ask guest for free pages.
> Then you cancel it.  There are a bunch of pages in free vq and you are getting
> more.  You now want to start migration again. What to do?
> 
> A bunch of vq flushing and waiting will maybe do the trick, but waiting on 
> guest
> is never a great idea.
> 


I think the device can flush (pop out what's left in the vq and push them back) 
the
vq right after the Stop command is sent to the guest, rather than doing the 
flush
when the 2nd initiation of live migration begins. The entries pushed back to 
the vq
will be in the used ring, what would the device need to wait for?


> I previously suggested pushing the stop/start commands from guest to host on
> the free page vq, and including an ID in host to guest and guest to host
> commands. This way ctrl vq is just for host to guest commands, and host
> matches commands and knows which command is a free page in response to.
> 
> I still think it's a good idea but go ahead and propose something else that 
> works.
> 

Thanks for the suggestion. Probably I haven't fully understood it. Please see 
the example
below:

1) host-to-guest ctrl_vq:
StartCMD, ID=1

2) 

[virtio-dev] RE: [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()

2017-09-29 Thread Wang, Wei W
On Monday, September 11, 2017 9:27 PM, Matthew Wilcox wrote
> On Mon, Aug 28, 2017 at 06:08:30PM +0800, Wei Wang wrote:
> > +/**
> > + *  xb_zero - zero a range of bits in the xbitmap
> > + *  @xb: the xbitmap that the bits reside in
> > + *  @start: the start of the range, inclusive
> > + *  @end: the end of the range, inclusive  */ void xb_zero(struct xb
> > +*xb, unsigned long start, unsigned long end) {
> > +   unsigned long i;
> > +
> > +   for (i = start; i <= end; i++)
> > +   xb_clear_bit(xb, i);
> > +}
> > +EXPORT_SYMBOL(xb_zero);
> 
> Um.  This is not exactly going to be quick if you're clearing a range of bits.
> I think it needs to be more along the lines of this:
> 
> void xb_clear(struct xb *xb, unsigned long start, unsigned long end) {
> struct radix_tree_root *root = >xbrt;
> struct radix_tree_node *node;
> void **slot;
> struct ida_bitmap *bitmap;
> 
> for (; end < start; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> unsigned long index = start / IDA_BITMAP_BITS;
> unsigned long bit = start % IDA_BITMAP_BITS;
> 
> bitmap = __radix_tree_lookup(root, index, , );
> if (radix_tree_exception(bitmap)) {
> unsigned long ebit = bit + 2;
> unsigned long tmp = (unsigned long)bitmap;
> if (ebit >= BITS_PER_LONG)
> continue;
> tmp &= ... something ...;
> if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
> __radix_tree_delete(root, node, slot);
> else
> rcu_assign_pointer(*slot, (void *)tmp);
> } else if (bitmap) {
> unsigned int nbits = end - start + 1;
> if (nbits + bit > IDA_BITMAP_BITS)
> nbits = IDA_BITMAP_BITS - bit;
> bitmap_clear(bitmap->bitmap, bit, nbits);
> if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
> kfree(bitmap);
> __radix_tree_delete(root, node, slot);
> }
> }
> }
> }
> 
> Also note that this should be called xb_clear(), not xb_zero() to fit in with
> bitmap_clear().  And this needs a thorough test suite testing all values for 
> 'start'
> and 'end' between 0 and at least 1024; probably much higher.  And a variable
> number of bits need to be set before calling
> xb_clear() in the test suite.
> 
> Also, this implementation above is missing a few tricks.  For example, if 
> 'bit' is 0
> and 'nbits' == IDA_BITMAP_BITS, we can simply call kfree without first zeroing
> out the bits and then checking if the whole thing is zero.

Thanks for the optimization suggestions. We've seen significant improvement of
the ballooning time. Some other optimizations (stated in the changelog) haven't
been included in the new version. If possible, we can leave that to a second 
step
optimization outside this patch series.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

2017-09-05 Thread Wang, Wei W
Ping for comments if possible. Thanks.

On Monday, August 28, 2017 6:09 PM, Wang, Wei W wrote:
> [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
> Add a new vq, ctrl_vq, to handle commands between the host and guest.
> With this feature, we will be able to have the control plane and data plane
> separated. In other words, the control related data of each feature will be 
> sent
> via the ctrl_vq cmds, meanwhile each feature may have its own data plane vq.
> 
> Free page report is the the first new feature controlled via ctrl_vq, and a 
> new
> cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
> Currently, this feature has two cmds:
> VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest to
> start the free page reporting work.
> VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is used bidirectionally. The
> guest would send the cmd to the host to indicate the reporting work is done.
> The host would send the cmd to the guest to actively request the stop of the
> reporting work.
> 
> The free_page_vq is used to transmit the guest free page blocks to the host.
> 
> Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> Signed-off-by: Liang Li <liang.z...@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c | 247 +-
> --
>  include/uapi/linux/virtio_balloon.h |  15 +++
>  2 files changed, 242 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c 
> b/drivers/virtio/virtio_balloon.c index
> 8ecc1d4..1d384a4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
> 
>  struct virtio_balloon {
>   struct virtio_device *vdev;
> - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
> +  *free_page_vq;
> +
> + /* Balloon's own wq for cpu-intensive work items */
> + struct workqueue_struct *balloon_wq;
> + /* The work items submitted to the balloon wq are listed here */
> + struct work_struct report_free_page_work;
> 
>   /* The balloon servicing is delegated to a freezable workqueue. */
>   struct work_struct update_balloon_stats_work; @@ -65,6 +71,9 @@
> struct virtio_balloon {
>   spinlock_t stop_update_lock;
>   bool stop_update;
> 
> + /* Stop reporting free pages */
> + bool report_free_page_stop;
> +
>   /* Waiting for host to ack the pages we released. */
>   wait_queue_head_t acked;
> 
> @@ -93,6 +102,11 @@ struct virtio_balloon {
> 
>   /* To register callback in oom notifier call chain */
>   struct notifier_block nb;
> +
> + /* Host to guest ctrlq cmd buf for free page report */
> + struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
> + /* Guest to Host ctrlq cmd buf for free page report */
> + struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
>  };
> 
>  static struct virtio_device_id id_table[] = { @@ -177,6 +191,26 @@ static 
> void
> send_balloon_page_sg(struct virtio_balloon *vb,
>   }
>  }
> 
> +static void send_free_page_sg(struct virtqueue *vq, void *addr,
> +uint32_t size) {
> + unsigned int len;
> + int err = -ENOSPC;
> +
> + do {
> + if (vq->num_free) {
> + err = add_one_sg(vq, addr, size);
> + /* Sanity check: this can't really happen */
> + WARN_ON(err);
> + if (!err)
> + virtqueue_kick(vq);
> + }
> +
> + /* Release entries if there are */
> + while (virtqueue_get_buf(vq, ))
> + ;
> + } while (err == -ENOSPC && vq->num_free); }
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -525,42 +559,206 @@ static void update_balloon_size_func(struct
> work_struct *work)
>   queue_work(system_freezable_wq, work);  }
> 
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +unsigned long nr_pages)
> +{
> + struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> + void *addr = (void *)pfn_to_kaddr(pfn);
> + uint32_t len = nr_pages << PAGE_SHIFT;
> +
> + if (vb->report_free_page_stop)
> + return 1;
> +
> + send_free_page_sg(vb->free_page_vq, addr, len);
> +
> + return 0;
> +}
> +
> +static voi

[virtio-dev] RE: [PATCH v13 4/5] mm: support reporting free page blocks

2017-08-03 Thread Wang, Wei W
On Thursday, August 3, 2017 9:51 PM, Michal Hocko: 
> As I've said earlier. Start simple optimize incrementally with some numbers to
> justify a more subtle code.
> --

OK. Let's start with the simple implementation as you suggested.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v13 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-08-03 Thread Wang, Wei W
On Thursday, August 3, 2017 10:23 PM, Michael S. Tsirkin wrote:
> On Thu, Aug 03, 2017 at 02:38:17PM +0800, Wei Wang wrote:
> > +static void send_one_sg(struct virtio_balloon *vb, struct virtqueue *vq,
> > +   void *addr, uint32_t size)
> > +{
> > +   struct scatterlist sg;
> > +   unsigned int len;
> > +
> > +   sg_init_one(, addr, size);
> > +   while (unlikely(virtqueue_add_inbuf(vq, , 1, vb, GFP_KERNEL)
> > +   == -ENOSPC)) {
> > +   /*
> > +* It is uncommon to see the vq is full, because the sg is sent
> > +* one by one and the device is able to handle it in time. But
> > +* if that happens, we kick and wait for an entry is released.
> 
> is released -> to get used.
> 
> > +*/
> > +   virtqueue_kick(vq);
> > +   while (!virtqueue_get_buf(vq, ) &&
> > +  !virtqueue_is_broken(vq))
> > +   cpu_relax();
> 
> Please rework to use wait_event in that case too.

For the balloon page case here, it is fine to use wait_event. But for the free 
page
case, I think it might not be suitable because the mm lock is being held.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v4] virtio-net: enable configurable tx queue size

2017-07-06 Thread Wang, Wei W
On Thursday, July 6, 2017 9:49 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 28, 2017 at 10:37:59AM +0800, Wei Wang wrote:
> > This patch enables the virtio-net tx queue size to be configurable
> > between 256 (the default queue size) and 1024 by the user when the
> > vhost-user backend is used.
> >
> > Currently, the maximum tx queue size for other backends is 512 due to
> > the following limitations:
> > - QEMU backend: the QEMU backend implementation in some cases may send
> > 1024+1 iovs to writev.
> > - Vhost_net backend: there are possibilities that the guest sends a
> > vring_desc of memory which crosses a MemoryRegion thereby generating
> > more than 1024 iovs after translation from guest-physical address in
> > the backend.
> >
> > Signed-off-by: Wei Wang 
> 
> Could you pls add a bit info about how this was tested?
> Was any special setup for dpdk necessary?

Yes, I used the vhost-user implementation in DPDK. So, on the host, I have 
the legacy ovs-dpdk setup ready (I'm using dpdk-stable-16.11.1 and 
openvswitch-2.6.1,
the setup steps can be found inside the source code directory).

When booting the guest, I have the following QEMU commands:
-chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1
-netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
-device virtio-net-pci,mac=52:54:00:00:00:01,netdev=mynet1,tx_queue_size=1024

To check the guest tx queue size, I simply added a printk() at the end of 
virtnet_probe()
to print out vi->sq->vq->num_free, which initially equals to the queue size.

Then, I did Ping and netperf tests to transmit packets between VMs, which 
worked fine.

If the related configuration support to Libvirt is ready, I think we can get 
the customer
to try in their test environment, too.

Best,
Wei






-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH v3 1/2] virtio-net: enable configurable tx queue size

2017-06-26 Thread Wang, Wei W
On Tuesday, June 27, 2017 9:51 AM, Eric Blake wrote:
> On 06/22/2017 09:32 PM, Wei Wang wrote:
> > This patch enables the virtio-net tx queue size to be configurable 
> > between 256 (the default queue size) and 1024 by the user when the 
> > vhost-user backend is used.
> 
> When sending a multi-patch series, don't forget the 0/2 cover letter.
> 

Thanks for the reminder, and I'll do it in the next version.

Best,
Wei 


RE: [virtio-dev] Re: [Qemu-devel] [PATCH v3 1/2] virtio-net: enable configurable tx queue size

2017-06-26 Thread Wang, Wei W
On Tuesday, June 27, 2017 10:20 AM, Jason Wang wrote:
> On 2017年06月27日 05:21, Michael S. Tsirkin wrote:
> > On Mon, Jun 26, 2017 at 06:34:25PM +0800, Wei Wang wrote:
> >> On 06/26/2017 04:05 PM, Jason Wang wrote:
> >>>
> But this patch in fact allows 1024 to be used even for vhost-kernel after
> migration from vhost-user?

As discussed before, we only consider migration with the same configuration at 
this stage. 

Best,
Wei


[virtio-dev] RE: [PATCH v11 0/6] Virtio-balloon Enhancement

2017-06-09 Thread Wang, Wei W
On Friday, June 9, 2017 6:42 PM, Wang, Wei W wrote:
> To: virtio-dev@lists.oasis-open.org; linux-ker...@vger.kernel.org; qemu-
> de...@nongnu.org; virtualizat...@lists.linux-foundation.org;
> k...@vger.kernel.org; linux...@kvack.org; m...@redhat.com;
> da...@redhat.com; Hansen, Dave <dave.han...@intel.com>;
> cornelia.h...@de.ibm.com; a...@linux-foundation.org;
> mgor...@techsingularity.net; aarca...@redhat.com; amit.s...@redhat.com;
> pbonz...@redhat.com; Wang, Wei W <wei.w.w...@intel.com>;
> liliang.opensou...@gmail.com
> Subject: [PATCH v11 0/6] Virtio-balloon Enhancement
> 
> This patch series enhances the existing virtio-balloon with the following new
> features:
> 1) fast ballooning: transfer ballooned pages between the guest and host in
> chunks, instead of one by one; and
> 2) cmdq: a new virtqueue to send commands between the device and driver.
> Currently, it supports commands to report memory stats (replace the old statq
> mechanism) and report guest unused pages.

v10->v11 changes:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

> 
> Liang Li (1):
>   virtio-balloon: deflate via a page list
> 
> Wei Wang (5):
>   virtio-balloon: coding format cleanup
>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
>   mm: function to offer a page block on the free list
>   mm: export symbol of next_zone and first_online_pgdat
>   virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
> 
>  drivers/virtio/virtio_balloon.c | 781 --
> --
>  drivers/virtio/virtio_ring.c| 120 +-
>  include/linux/mm.h  |   5 +
>  include/linux/virtio.h  |   7 +
>  include/uapi/linux/virtio_balloon.h |  14 +
>  include/uapi/linux/virtio_ring.h|   3 +
>  mm/mmzone.c |   2 +
>  mm/page_alloc.c |  91 +
>  8 files changed, 950 insertions(+), 73 deletions(-)
> 
> --
> 2.7.4


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [Qemu-devel] [virtio-dev] Re: [virtio-dev] Re: [PATCH v2 00/16] Vhost-pci for inter-VM communication

2017-05-22 Thread Wang, Wei W
On Monday, May 22, 2017 10:28 AM, Jason Wang wrote:
> On 2017年05月19日 23:33, Stefan Hajnoczi wrote:
> > On Fri, May 19, 2017 at 11:10:33AM +0800, Jason Wang wrote:
> >> On 2017年05月18日 11:03, Wei Wang wrote:
> >>> On 05/17/2017 02:22 PM, Jason Wang wrote:
>  On 2017年05月17日 14:16, Jason Wang wrote:
> > On 2017年05月16日 15:12, Wei Wang wrote:
> >>> Hi:
> >>>
> >>> Care to post the driver codes too?
> >>>
> >> OK. It may take some time to clean up the driver code before post
> >> it out. You can first have a check of the draft at the repo here:
> >> https://github.com/wei-w-wang/vhost-pci-driver
> >>
> >> Best,
> >> Wei
> > Interesting, looks like there's one copy on tx side. We used to
> > have zerocopy support for tun for VM2VM traffic. Could you please
> > try to compare it with your vhost-pci-net by:
> >
> >>> We can analyze from the whole data path - from VM1's network stack
> >>> to send packets -> VM2's network stack to receive packets. The
> >>> number of copies are actually the same for both.
> >> That's why I'm asking you to compare the performance. The only reason
> >> for vhost-pci is performance. You should prove it.
> > There is another reason for vhost-pci besides maximum performance:
> >
> > vhost-pci makes it possible for end-users to run networking or storage
> > appliances in compute clouds.  Cloud providers do not allow end-users
> > to run custom vhost-user processes on the host so you need vhost-pci.
> >
> > Stefan
> 
> Then it has non NFV use cases and the question goes back to the performance
> comparing between vhost-pci and zerocopy vhost_net. If it does not perform
> better, it was less interesting at least in this case.
> 

Probably I can share what we got about vhost-pci and vhost-user:
https://github.com/wei-w-wang/vhost-pci-discussion/blob/master/vhost_pci_vs_vhost_user.pdf
Right now, I don’t have the environment to add the vhost_net test.

Btw, do you have data about vhost_net v.s. vhost_user?

Best,
Wei



[virtio-dev] RE: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ

2017-05-06 Thread Wang, Wei W
On 05/06/2017 06:21 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> > On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > > Add a new vq, miscq, to handle miscellaneous requests between the
> > > > device and the driver.
> > > >
> > > > This patch implemnts the
> VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > > implements
> > >
> > > > request sent from the device.
> > > Commands are sent from host and handled on guest.
> > > In fact how is this so different from stats?
> > > How about reusing the stats vq then? You can use one buffer for
> > > stats and one buffer for commands.
> > >
> >
> > The meaning of the two vqs is a little different. statq is used for
> > reporting statistics, while miscq is intended to be used to handle
> > miscellaneous requests from the guest or host
> 
> misc just means "anything goes". If you want it to mean "commands" name it so.

Ok, will change it.

> > (I think it can
> > also be used the other way around in the future when other new
> > features are added which need the guest to send requests and the host
> > to provide responses).
> >
> > I would prefer to have them separate, because:
> > If we plan to combine them, we need to put the previous statq related
> > implementation under miscq with a new command (I think we can't
> > combine them without using commands to distinguish the two features).
> 
> Right.

> > In this way, an old driver won't work with a new QEMU or a new driver
> > won't work with an old QEMU. Would this be considered as an issue
> > here?
> 
> Compatibility is and should always be handled using feature flags.  There's a
> feature flag for this, isn't it?

The negotiation of the existing feature flag, VIRTIO_BALLOON_F_STATS_VQ
only indicates the support of the old statq implementation. To move the statq
implementation under cmdq, I think we would need a new feature flag for the
new statq implementation:
#define VIRTIO_BALLOON_F_CMDQ_STATS  5

What do you think?

Best,
Wei





-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS

2017-05-06 Thread Wang, Wei W
On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 26, 2017 at 11:03:34AM +, Wang, Wei W wrote:
> > > > Hi Michael, could you please give some feedback?
> > > I'm sorry, I'm not sure feedback on what you are requesting.
> > Oh, just some trivial things (e.g. use a field in the header,
> > hdr->chunks to indicate the number of chunks in the payload) that
> > wasn't confirmed.
> >
> > I will prepare the new version with fixing the agreed issues, and we
> > can continue to discuss those parts if you still find them improper.
> >
> >
> > >
> > > The interface looks reasonable now, even though there's a way to
> > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > 1). Do you think we can live with this limitation?
> > Yes, I think we can. So, is it good to change to use the previous
> > 64-bit chunk format (52-bit base + 12-bit size)?
> 
> This isn't what I meant. virtio ring has descriptors with a 64 bit address 
> and 32 bit
> size.
> 
> If size < 4g is not a significant limitation, why not just use that to pass
> address/size in a standard s/g list, possibly using INDIRECT?

OK, I see your point, thanks. Post the two options here for an analysis:
Option1 (what we have now):
struct virtio_balloon_page_chunk {
__le64 chunk_num;
struct virtio_balloon_page_chunk_entry entry[];
};
Option2:
struct virtio_balloon_page_chunk {
__le64 chunk_num;
struct scatterlist entry[];
};

I don't have an issue to change it to Option2, but I would prefer Option1,
because I think there is no be obvious difference between the two options,
while Option1 appears to have little advantages here:
1) "struct virtio_balloon_page_chunk_entry" has smaller size than
"struct scatterlist", so the same size of allocated page chunk buffer
can hold more entry[] using Option1;
2) INDIRECT needs on demand kmalloc();
3) no 4G size limit;

What do you think?

Best,
Wei




-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] virtio-net: configurable TX queue size

2017-05-04 Thread Wang, Wei W
Hi,

I want to re-open the discussion left long time ago:
https://lists.gnu.org/archive/html/qemu-devel/2015-11/msg06194.html
, and discuss the possibility of changing the hardcoded (256) TX  queue
size to be configurable between 256 and 1024.

The reason to propose this request is that a severe issue of packet drops in
TX direction was observed with the existing hardcoded 256 queue size,
which causes performance issues for packet drop sensitive guest
applications that cannot use indirect descriptor tables. The issue goes away
with 1K queue size.

The concern mentioned in the previous discussion (please check the link
above) is that the number of chained descriptors would exceed
UIO_MAXIOV (1024) supported by the Linux.

>From the code,  I think the number of the chained descriptors is limited to
MAX_SKB_FRAGS + 2 (~18), which is much less than UIO_MAXIOV.
Please point out if I missed anything. Thanks.

Best,
Wei




RE: [virtio-dev] Re: Vhost-pci RFC2.0

2017-04-19 Thread Wang, Wei W
On 04/19/2017 07:21 PM, Jan Kiszka wrote:
> On 2017-04-19 13:11, Wei Wang wrote:
> > On 04/19/2017 06:36 PM, Jan Kiszka wrote:
> >> On 2017-04-19 12:02, Wei Wang wrote:
> >>> The design presented here intends to use only one BAR to expose
> >>> both TX and RX. The two VMs share an intermediate memory here,
> >>> why couldn't we give the same permission to TX and RX?
> >>>
> >> For security and/or safety reasons: the TX side can then safely
> >> prepare and sign a message in-place because the RX side cannot
> >> mess around with it while not yet being signed (or check-summed).
> >> Saves one copy from a secure place into the shared memory.
> > If we allow guest1 to write to RX, what safety issue would it
> > cause to guest2?
>  This way, guest1 could trick guest2, in a race condition, to sign a
>  modified message instead of the original one.
> 
> >>> Just align the context that we are talking about: RX is the
> >>> intermediate shared ring that guest1 uses to receive packets and
> >>> guest2 uses to send packet.
> >>>
> >>> Seems the issue is that guest1 will receive a hacked message from RX
> >>> (modified by itself). How would it affect guest2?
> >> Retry: guest2 wants to send a signed/hashed message to guest1. For
> >> that purpose, it starts to build that message inside the shared
> >> memory that
> >> guest1 can at least read, then guest2 signs that message, also in-place.
> >> If guest1 can modify the message inside the ring while guest2 has not
> >> yet signed it, the result is invalid.
> >>
> >> Now, if guest2 is the final receiver of the message, nothing is lost,
> >> guest2 just shot itself into the foot. However, if guest2 is just a
> >> router (gray channel) and the message travels further, guest2 now has
> >> corrupted that channel without allowing the final receive to detect
> >> that. That's the scenario.
> >
> > If guest2 has been a malicious guest, I think it wouldn't make a
> > difference whether we protect the shared RX or not. As a router,
> > guest2 can play tricks on the messages after read it and then send the
> > modified message to a third man, right?
> 
> It can swallow it, "steal" it (redirect), but it can't manipulate the signed 
> content
> without being caught, that's the idea. It's particularly relevant for 
> safety-critical
> traffic from one safe application to another over unreliable channels, but it 
> may
> also be relevant for the integrity of messages in a secure setup.
> 

OK, I see most of your story, thanks. To get to the bottom of it, is it 
possible to
Sign the packet before put it onto the unreliable channel (e.g. the shared RX),
Instead of signing in-place? If that's doable, we can have a simpler shared 
channel.  


Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Vhost-pci RFC2.0

2017-04-19 Thread Wang, Wei W
Hi,

We made some design changes to the original vhost-pci design, and want to open
a discussion about the latest design (labelled 2.0) and its extension (2.1).
2.0 design: One VM shares the entire memory of another VM
2.1 design: One VM uses an intermediate memory shared with another VM for
 packet transmission.

For the convenience of discussion, I have some pictures presented at this link:
https://github.com/wei-w-wang/vhost-pci-discussion/blob/master/vhost-pci-rfc2.0.pdf

Fig. 1 shows the common driver frame that we want use to build the 2.0 and 2.1
design. A TX/RX engine consists of a local ring and an exotic ring.
Local ring:
1) allocated by the driver itself;
2) registered with the device (i.e. virtio_add_queue())
Exotic ring:
1) ring memory comes from the outside (of the driver), and exposed to the driver
 via a BAR MMIO;
2) does not have a registration in the device, so no ioeventfd/irqfd, 
configuration
registers allocated in the device

Fig. 2 shows how the driver frame is used to build the 2.0 design.
1) Asymmetric: vhost-pci-net <-> virtio-net
2) VM1 shares the entire memory of VM2, and the exotic rings are the rings
from VM2.
3) Performance (in terms of copies between VMs):
TX: 0-copy (packets are put to VM2's RX ring directly)
RX: 1-copy (the green arrow line in the VM1's RX engine)

Fig. 3 shows how the driver frame is used to build the 2.1 design.
1) Symmetric: vhost-pci-net <-> vhost-pci-net
2) Share an intermediate memory, allocated by VM1's vhost-pci device,
for data exchange, and the exotic rings are built on the shared memory
3) Performance:
TX: 1-copy
RX: 1-copy

Fig. 4 shows the inter-VM notification path for 2.0 (2.1 is similar).
The four eventfds are allocated by virtio-net, and shared with vhost-pci-net:
Uses virtio-net's TX/RX kickfd as the vhost-pci-net's RX/TX callfd
Uses virtio-net's TX/RX callfd as the vhost-pci-net's RX/TX kickfd
Example of how it works:
After packets are put into vhost-pci-net's TX, the driver kicks TX, which
causes the an interrupt associated with fd3 to be injected to virtio-net

The draft code of the 2.0 design is ready, and can be found here:
Qemu: https://github.com/wei-w-wang/vhost-pci-device
Guest driver: https://github.com/wei-w-wang/vhost-pci-driver

We tested the 2.0 implementation using the Spirent packet
generator to transmit 64B packets, the results show that the
throughput of vhost-pci reaches around 1.8Mpps, which is around
two times larger than the legacy OVS+DPDK. Also, vhost-pci shows
better scalability than OVS+DPDK.


Best,
Wei























[virtio-dev] RE: [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-04-05 Thread Wang, Wei W
On Wednesday, April 5, 2017 12:31 PM, Wei Wang wrote:
> On Wednesday, April 5, 2017 11:54 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 03:31:36AM +0000, Wang, Wei W wrote:
> > > On Thursday, March 16, 2017 3:09 PM Wei Wang wrote:
> > > > The implementation of the current virtio-balloon is not very
> > > > efficient, because the ballooned pages are transferred to the host
> > > > one by one. Here is the breakdown of the time in percentage spent
> > > > on each step of the balloon inflating process (inflating 7GB of an 8GB 
> > > > idle
> guest).
> > > >
> > > > 1) allocating pages (6.5%)
> > > > 2) sending PFNs to host (68.3%)
> > > > 3) address translation (6.1%)
> > > > 4) madvise (19%)
> > > >
> > > > It takes about 4126ms for the inflating process to complete.
> > > > The above profiling shows that the bottlenecks are stage 2) and stage 
> > > > 4).
> > > >
> > > > This patch optimizes step 2) by transferring pages to the host in
> > > > chunks. A chunk consists of guest physically continuous pages, and
> > > > it is offered to the host via a base PFN (i.e. the start PFN of
> > > > those physically continuous pages) and the size (i.e. the total
> > > > number of the
> > pages). A chunk is formated as below:
> > > >
> > > > 
> > > > | Base (52 bit)| Rsvd (12 bit) |
> > > > 
> > > > 
> > > > | Size (52 bit)| Rsvd (12 bit) |
> > > > 
> > > >
> > > > By doing so, step 4) can also be optimized by doing address
> > > > translation and
> > > > madvise() in chunks rather than page by page.
> > > >
> > > > This optimization requires the negotiation of a new feature bit,
> > > > VIRTIO_BALLOON_F_CHUNK_TRANSFER.
> > > >
> > > > With this new feature, the above ballooning process takes ~590ms
> > > > resulting in an improvement of ~85%.
> > > >
> > > > TODO: optimize stage 1) by allocating/freeing a chunk of pages
> > > > instead of a single page each time.
> > > >
> > > > Signed-off-by: Liang Li <liang.z...@intel.com>
> > > > Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> > > > Suggested-by: Michael S. Tsirkin <m...@redhat.com>
> > > > ---
> > > >  drivers/virtio/virtio_balloon.c | 371
> > +-
> > > > --
> > > >  include/uapi/linux/virtio_balloon.h |   9 +
> > > >  2 files changed, 353 insertions(+), 27 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > b/drivers/virtio/virtio_balloon.c index
> > > > f59cb4f..3f4a161 100644
> > > > --- a/drivers/virtio/virtio_balloon.c
> > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > @@ -42,6 +42,10 @@
> > > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > >
> > > > +#define PAGE_BMAP_SIZE (8 * PAGE_SIZE)
> > > > +#define PFNS_PER_PAGE_BMAP (PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > > +#define PAGE_BMAP_COUNT_MAX32
> > > > +
> > > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> > +54,14
> > > > @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");  static
> > > > struct vfsmount *balloon_mnt;  #endif
> > > >
> > > > +#define BALLOON_CHUNK_BASE_SHIFT 12 #define
> > > > +BALLOON_CHUNK_SIZE_SHIFT 12 struct balloon_page_chunk {
> > > > +   __le64 base;
> > > > +   __le64 size;
> > > > +};
> > > > +
> > > > +typedef __le64 resp_data_t;
> > > >  struct virtio_balloon {
> > > > struct virtio_device *vdev;
> > > > struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -67,6
> > > > +79,31 @@ struct virtio_balloon {
> > > >
> > > > /* Number of balloon pages we've told the 

[virtio-dev] RE: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages

2017-03-22 Thread Wang, Wei W
Hi Andrew, 

Do you have any comments on my thoughts? Thanks.

> On 03/17/2017 05:28 AM, Andrew Morton wrote:
> > On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang 
> wrote:
> >
> >> From: Liang Li 
> >>
> >> This patch adds a function to provides a snapshot of the present
> >> system unused pages. An important usage of this function is to
> >> provide the unsused pages to the Live migration thread, which skips
> >> the transfer of thoses unused pages. Newly used pages can be
> >> re-tracked by the dirty page logging mechanisms.
> > I don't think this will be useful for anything other than
> > virtio-balloon.  I guess it would be better to keep this code in the
> > virtio-balloon driver if possible, even though that's rather a
> > layering violation :( What would have to be done to make that
> > possible?  Perhaps we can put some *small* helpers into page_alloc.c
> > to prevent things from becoming too ugly.
> 
> The patch description was too narrowed and may have caused some confusion,
> sorry about that. This function is aimed to be generic. I agree with the
> description suggested by Michael.
> 
> Since the main body of the function is related to operating on the free_list. 
> I
> think it is better to have them located here.
> Small helpers may be less efficient and thereby causing some performance loss
> as well.
> I think one improvement we can make is to remove the "chunk format"
> related things from this function. The function can generally offer the base 
> pfn
> to the caller's recording buffer. Then it will be the caller's responsibility 
> to
> format the pfn if they need.
> 
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
> >>show_swap_cache_info();
> >>   }
> >>
> >> +static int __record_unused_pages(struct zone *zone, int order,
> >> +   __le64 *buf, unsigned int size,
> >> +   unsigned int *offset, bool part_fill) {
> >> +  unsigned long pfn, flags;
> >> +  int t, ret = 0;
> >> +  struct list_head *curr;
> >> +  __le64 *chunk;
> >> +
> >> +  if (zone_is_empty(zone))
> >> +  return 0;
> >> +
> >> +  spin_lock_irqsave(>lock, flags);
> >> +
> >> +  if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
> >> +  ret = -ENOSPC;
> >> +  goto out;
> >> +  }
> >> +  for (t = 0; t < MIGRATE_TYPES; t++) {
> >> +  list_for_each(curr, >free_area[order].free_list[t]) {
> >> +  pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >> +  chunk = buf + *offset;
> >> +  if (*offset + 2 > size) {
> >> +  ret = -ENOSPC;
> >> +  goto out;
> >> +  }
> >> +  /* Align to the chunk format used in virtio-balloon */
> >> +  *chunk = cpu_to_le64(pfn << 12);
> >> +  *(chunk + 1) = cpu_to_le64((1 << order) << 12);
> >> +  *offset += 2;
> >> +  }
> >> +  }
> >> +
> >> +out:
> >> +  spin_unlock_irqrestore(>lock, flags);
> >> +
> >> +  return ret;
> >> +}
> > This looks like it could disable interrupts for a long time.  Too long?
> 
> What do you think if we give "budgets" to the above function?
> For example, budget=1000, and there are 2000 nodes on the list.
> record() returns with "incomplete" status in the first round, along with the 
> status
> info, "*continue_node".
> 
> *continue_node: pointer to the starting node of the leftover. If 
> *continue_node
> has been used at the time of the second call (i.e. continue_node->next == 
> NULL),
> which implies that the previous 1000 nodes have been used, then the record()
> function can simply start from the head of the list.
> 
> It is up to the caller whether it needs to continue the second round when 
> getting
> "incomplete".
> 
> >
> >> +/*
> >> + * The record_unused_pages() function is used to record the system
> >> +unused
> >> + * pages. The unused pages can be skipped to transfer during live 
> >> migration.
> >> + * Though the unused pages are dynamically changing, dirty page
> >> +logging
> >> + * mechanisms are able to capture the newly used pages though they
> >> +were
> >> + * recorded as unused pages via this function.
> >> + *
> >> + * This function scans the free page list of the specified order to
> >> +record
> >> + * the unused pages, and chunks those continuous pages following the
> >> +chunk
> >> + * format below:
> >> + * --
> >> + * |  Base (52-bit)   | Rsvd (12-bit) |
> >> + * --
> >> + * --
> >> + * |  Size (52-bit)   | Rsvd (12-bit) |
> >> + * --
> >> + *
> >> + * @start_zone: zone to start the record operation.
> >> + * @order: order of the free page list to record.
> >> + * @buf: buffer to record the 

[virtio-dev] RE: [PATCH v7 kernel 3/5] virtio-balloon: implementation of VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-03-13 Thread Wang, Wei W
On Sunday, March 12, 2017 12:04 PM, Michael S. Tsirkin wrote:
> On Sun, Mar 12, 2017 at 01:59:54AM +0000, Wang, Wei W wrote:
> > On 03/11/2017 10:10 PM, Matthew Wilcox wrote:
> > > On Sat, Mar 11, 2017 at 07:59:31PM +0800, Wei Wang wrote:
> > > > I'm thinking what if the guest needs to transfer these much
> > > > physically continuous memory to host: 1GB+2MB+64KB+32KB+16KB+4KB.
> > > > Is it going to use Six 64-bit chunks? Would it be simpler if we
> > > > just use the 128-bit chunk format (we can drop the previous normal
> > > > 64-bit format)?
> > >
> > > Is that a likely thing for the guest to need to do though?  Freeing
> > > a 1GB page is much more liikely, IMO.
> >
> > Yes, I think it's very possible. The host can ask for any number of pages 
> > (e.g.
> 1.5GB) that the guest can afford.  Also, the ballooned 1.5G memory is not
> guaranteed to be continuous in any pattern like 1GB+512MB. That's why we
> need to use a bitmap to draw the whole picture first, and then seek for
> continuous bits to chunk.
> >
> > Best,
> > Wei
> 
> While I like the clever format that Matthew came up with, I'm also inclined to
> say let's keep things simple.
> the simplest thing seems to be to use the ext format all the time.
> Except let's reserve the low 12 bits in both address and size, since they are
> already 0, we might be able to use them for flags down the road.

Thanks for reminding us about the hugepage story. I'll use the ext format in 
the implementation if no further objections from others.

Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org