RE: [PATCH v2] virtio/vsock: fix the transport to work with VMADDR_CID_ANY

2021-11-29 Thread Wang, Wei W
Hi Michael,

Do you plan to merge this patch through your tree?
If not, I'll resend to have it applied to the net tree.

Thanks,
Wei

On Friday, November 26, 2021 4:54 PM, Stefano Garzarella wrote:
> On Thu, Nov 25, 2021 at 08:18:23PM -0500, Wei Wang wrote:
> >The VMADDR_CID_ANY flag used by a socket means that the socket isn't
> >bound to any specific CID. For example, a host vsock server may want to
> >be bound with VMADDR_CID_ANY, so that a guest vsock client can connect
> >to the host server with CID=VMADDR_CID_HOST (i.e. 2), and meanwhile, a
> >host vsock client can connect to the same local server with
> >CID=VMADDR_CID_LOCAL (i.e. 1).
> >
> >The current implementation sets the destination socket's svm_cid to a
> >fixed CID value after the first client's connection, which isn't an
> >expected operation. For example, if the guest client first connects to
> >the host server, the server's svm_cid gets set to VMADDR_CID_HOST, then
> >other host clients won't be able to connect to the server anymore.
> >
> >Reproduce steps:
> >1. Run the host server:
> >   socat VSOCK-LISTEN:1234,fork -
> >2. Run a guest client to connect to the host server:
> >   socat - VSOCK-CONNECT:2:1234
> >3. Run a host client to connect to the host server:
> >   socat - VSOCK-CONNECT:1:1234
> >
> >Without this patch, step 3. above fails to connect, and socat complains
> >"socat[1720] E connect(5, AF=40 cid:1 port:1234, 16): Connection reset
> >by peer".
> >With this patch, the above works well.
> >
> >Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
> >Signed-off-by: Wei Wang 
> >---
> > net/vmw_vsock/virtio_transport_common.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Usually fixes for net/vmw_vsock/* are applied through the net tree
> (net...@vger.kernel.org) that seems not CCed. Please
> use ./scripts/get_maintainer.pl next time.
> 
> Maybe this one can be queued by Michael, let's wait a bit, otherwise please
> resend CCing netdev and using "net" tag.
> 
> Anyway the patch LGTM:
> 
> Reviewed-by: Stefano Garzarella 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH] virtio/vsock: fix the transport to work with VMADDR_CID_ANY

2021-11-25 Thread Wang, Wei W
On Thursday, November 25, 2021 6:41 PM, Stefano Garzarella wrote:
> On Thu, Nov 25, 2021 at 09:27:40AM +0000, Wang, Wei W wrote:
> >On Thursday, November 25, 2021 3:16 PM, Wang, Wei W wrote:
> >> -  /* Update CID in case it has changed after a transport reset event */
> >> -  vsk->local_addr.svm_cid = dst.svm_cid;
> >> -
> >>if (space_available)
> >>sk->sk_write_space(sk);
> >>
> >
> >Not sure if anybody knows how this affects the transport reset.
> 
> I believe the primary use case is when a guest is migrated.
> 
> After the migration, the transport gets a reset event from the hypervisor and
> all connected sockets are closed. The ones in listen remain open though.
> 
> Also the guest's CID may have changed after migration. So if an application 
> has
> open listening sockets, bound to the old CID, this should ensure that the 
> socket
> continues to be usable.

OK, thanks for sharing the background.

> 
> The patch would then change this behavior.
> 
> So maybe to avoid problems, we could update the CID only if it is different
> from VMADDR_CID_ANY:
> 
>   if (vsk->local_addr.svm_cid != VMADDR_CID_ANY)
>   vsk->local_addr.svm_cid = dst.svm_cid;
> 
> 
> When this code was written, a guest only supported a single transport, so it
> could only have one CID assigned, so that wasn't a problem.
> For that reason I'll add this Fixes tag:
> Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")

Sounds good to me.

Thanks,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH] virtio/vsock: fix the transport to work with VMADDR_CID_ANY

2021-11-25 Thread Wang, Wei W
On Thursday, November 25, 2021 3:16 PM, Wang, Wei W wrote:
> - /* Update CID in case it has changed after a transport reset event */
> - vsk->local_addr.svm_cid = dst.svm_cid;
> -
>   if (space_available)
>   sk->sk_write_space(sk);
> 

Not sure if anybody knows how this affects the transport reset.

Thanks,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC] hypercall-vsock: add a new vsock transport

2021-11-25 Thread Wang, Wei W
On Thursday, November 25, 2021 2:38 PM, Jason Wang wrote:
> > We thought about virtio-mmio. There are some barriers:
> > 1) It wasn't originally intended for x86 machines. The only machine
> > type in QEMU that supports it (to run on x86) is microvm. But
> > "microvm" doesn’t support TDX currently, and adding this support might
> need larger effort.
> 
> Can you explain why microvm needs larger effort? It looks to me it fits for 
> TDX
> perfectly since it has less attack surface.

The main thing is TDVF doesn’t support microvm so far (the based OVMF
support for microvm is still under their community discussion).

Do you guys think it is possible to add virtio-mmio support for q35?
(e.g. create a special platform bus in some fashion for memory mapped devices)
Not sure if the effort would be larger.

Thanks,
Wei



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [RFC] hypercall-vsock: add a new vsock transport

2021-11-11 Thread Wang, Wei W
On Wednesday, November 10, 2021 7:17 PM, Stefano Garzarella wrote:


> Adding Andra and Sergio, because IIRC Firecracker and libkrun emulates
> virtio-vsock with virtio-mmio so the implementation should be simple and also
> not directly tied to a specific VMM.
> 

OK. This would be OK for KVM based guests.
For Hyperv and VMWare based guests, they don't have virtio-mmio support.
If the MigTD (a special guest) we provide is based on virtio-mmio, it would not 
be usable to them.

Thanks,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC] hypercall-vsock: add a new vsock transport

2021-11-11 Thread Wang, Wei W
> From: Stefan Hajnoczi 
On Wednesday, November 10, 2021 5:35 PM, Stefan Hajnoczi wrote:
> AF_VSOCK is designed to allow multiple transports, so why not. There is a cost
> to developing and maintaining a vsock transport though.

Yes. The effort could be reduced via simplifying the design as much as possible:
e.g. no ring operations - guest just sends a packet each time for the host to 
read.
(this transport isn't targeting for high performance)

> 
> I think Amazon Nitro enclaves use virtio-vsock and I've CCed Andra in case she
> has thoughts on the pros/cons and how to minimize the trusted computing
> base.

Thanks for adding more related person to the discussion loop.

> 
> If simplicity is the top priority then VIRTIO's MMIO transport without 
> indirect
> descriptors and using the packed virtqueue layout reduces the size of the
> implementation:
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-1
> 440002

I listed some considerations for virtio-mmio in the response to Michael.
Please have a check if any different thoughts.

Thanks,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC] hypercall-vsock: add a new vsock transport

2021-11-10 Thread Wang, Wei W
On Wednesday, November 10, 2021 6:50 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 10, 2021 at 07:12:36AM +0000, Wang, Wei W wrote:
>
> hypercalls are fundamentally hypervisor dependent though.

Yes, each hypervisor needs to support it.
We could simplify the design and implementation to the minimal, so that each 
hypervisor can easily support it.
Once every hypervisor has the support, the guest (MigTD) could be a unified 
version.
(e.g. no need for each hypervisor user to develop their own MigTD using their 
own vsock transport)

> Assuming you can carve up a hypervisor independent hypercall, using it for
> something as mundane and specific as vsock for TDX seems like a huge overkill.
> For example, virtio could benefit from faster vmexits that hypercalls give you
> for signalling.
> How about a combination of virtio-mmio and hypercalls for fast-path signalling
> then?

We thought about virtio-mmio. There are some barriers:
1) It wasn't originally intended for x86 machines. The only machine type in QEMU
that supports it (to run on x86) is microvm. But "microvm" doesn’t support TDX 
currently,
and adding this support might need larger effort.
2) It's simpler than virtio-pci, but still more complex than hypercall.
3) Some CSPs don't have virtio support in their software, so this might add too 
much development effort for them.

This usage doesn’t need high performance, so faster hypercall for signalling 
isn't required, I think.
(but if hypercall has been verified to be much faster than the current EPT 
misconfig based notification,
it could be added for the general virtio usages)

> 
> > 2)   It is simpler. It doesn’t rely on any complex bus enumeration
> >
> > (e.g. virtio-pci based vsock device may need the whole implementation of
> PCI).
> >
> 
> Next thing people will try to do is implement a bunch of other device on top 
> of
> it.  virtio used pci simply because everyone implements pci.  And the reason
> for *that* is because implementing a basic pci bus is dead simple, whole of
> pci.c in qemu is <3000 LOC.

This doesn’t include the PCI enumeration in seaBIOS and the PCI driver in the 
guest though.

Virtio has high performance, I think that's an important reason that more 
devices are continually added.
For this transport, I couldn’t envision that a bunch of devices would be added. 
It's a simple PV method.


> 
> >
> > An example usage is the communication between MigTD and host (Page 8
> > at
> >
> > https://static.sched.com/hosted_files/kvmforum2021/ef/
> > TDX%20Live%20Migration_Wei%20Wang.pdf).
> >
> > MigTD communicates to host to assist the migration of the target (user) TD.
> >
> > MigTD is part of the TCB, so its implementation is expected to be as
> > simple as possible
> >
> > (e.g. bare mental implementation without OS, no PCI driver support).
> >
> >
> 
> Try to list drawbacks? For example, passthrough for nested virt isn't possible
> unlike pci, neither are hardware implementations.
> 

Why hypercall wouldn't be possible for nested virt?
L2 hypercall goes to L0 directly and L0 can decide whether to forward the call 
the L1 (in our case, I think no need as the packet will go out), right?

Its drawbacks are obvious (e.g. low performance). 
In general, I think it could be considered as a complement to virtio.
I think most usages would choose virtio as they don’t worry about the 
complexity and they purse high performance.
For some special usages that think virtio is too complex to suffice and they 
want something simpler, they would consider to use this transport。

Thanks,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC] hypercall-vsock: add a new vsock transport

2021-11-09 Thread Wang, Wei W
Hi,

We plan to add a new vsock transport based on hypercall (e.g. vmcall on Intel 
CPUs).
It transports AF_VSOCK packets between the guest and host, which is similar to
virtio-vsock, vmci-vsock and hyperv-vsock.

Compared to the above listed vsock transports which are designed for high 
performance,
the main advantages of hypercall-vsock are:

1)   It is VMM agnostic. For example, one guest working on hypercall-vsock 
can run on

either KVM, Hyperv, or VMware.

2)   It is simpler. It doesn't rely on any complex bus enumeration

(e.g. virtio-pci based vsock device may need the whole implementation of PCI).

An example usage is the communication between MigTD and host (Page 8 at
https://static.sched.com/hosted_files/kvmforum2021/ef/TDX%20Live%20Migration_Wei%20Wang.pdf).
MigTD communicates to host to assist the migration of the target (user) TD.
MigTD is part of the TCB, so its implementation is expected to be as simple as 
possible
(e.g. bare mental implementation without OS, no PCI driver support).

Looking forward to your feedbacks.

Thanks,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH v1 3/3] virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM

2020-02-14 Thread Wang, Wei W
On Friday, February 14, 2020 5:52 PM, David Hildenbrand wrote:
> > Commit 71994620bb25 ("virtio_balloon: replace oom notifier with
> > shrinker") changed the behavior when deflation happens automatically.
> > Instead of deflating when called by the OOM handler, the shrinker is used.
> >
> > However, the balloon is not simply some slab cache that should be
> > shrunk when under memory pressure. The shrinker does not have a
> > concept of priorities, so this behavior cannot be configured.
> >
> > There was a report that this results in undesired side effects when
> > inflating the balloon to shrink the page cache. [1]
> > "When inflating the balloon against page cache (i.e. no free memory
> >  remains) vmscan.c will both shrink page cache, but also invoke the
> >  shrinkers -- including the balloon's shrinker. So the balloon
> >  driver allocates memory which requires reclaim, vmscan gets this
> >  memory by shrinking the balloon, and then the driver adds the
> >  memory back to the balloon. Basically a busy no-op."
> >
> > The name "deflate on OOM" makes it pretty clear when deflation should
> > happen - after other approaches to reclaim memory failed, not while
> > reclaiming. This allows to minimize the footprint of a guest - memory
> > will only be taken out of the balloon when really needed.
> >
> > Especially, a drop_slab() will result in the whole balloon getting
> > deflated - undesired. While handling it via the OOM handler might not
> > be perfect, it keeps existing behavior. If we want a different
> > behavior, then we need a new feature bit and document it properly
> > (although, there should be a clear use case and the intended effects should
> be well described).
> >
> > Keep using the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> because
> > this has no such side effects. Always register the shrinker with
> > VIRTIO_BALLOON_F_FREE_PAGE_HINT now. We are always allowed to
> reuse
> > free pages that are still to be processed by the guest. The hypervisor
> > takes care of identifying and resolving possible races between
> > processing a hinting request and the guest reusing a page.
> >
> > In contrast to pre commit 71994620bb25 ("virtio_balloon: replace oom
> > notifier with shrinker"), don't add a moodule parameter to configure
> > the number of pages to deflate on OOM. Can be re-added if really needed.
> > Also, pay attention that leak_balloon() returns the number of 4k pages
> > - convert it properly in virtio_balloon_oom_notify().
> >
> > Note1: using the OOM handler is frowned upon, but it really is what we
> >need for this feature.
> >
> > Note2: without VIRTIO_BALLOON_F_MUST_TELL_HOST (iow, always with
> QEMU) we
> >could actually skip sending deflation requests to our hypervisor,
> >making the OOM path *very* simple. Besically freeing pages and
> >updating the balloon. If the communication with the host ever
> >becomes a problem on this call path.
> >
> 
> @Michael, how to proceed with this?
> 

I vote for not going back. When there are solid request and strong reasons in 
the future, we could reopen this discussion.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC] virtio_balloon: conservative balloon page shrinking

2020-02-09 Thread Wang, Wei W
On Monday, February 10, 2020 11:57 AM, Tetsuo Handa wrote:
> Then, "node-A's NR_FILE_PAGES is already 0 and node-B's NR_FILE_PAGES is
> not 0, but allocation request which triggered this shrinker wants to allocate
> from only node-A"
> would be confused by this change, for the pagecache pages for allocating
> thread's interested node are already depleted but the balloon cannot shrink
> when it should because the pagecache pages for allocating thread's
> uninterested nodes are not yet depleted.

The existing balloon isn't numa aware. "but the balloon cannot shrink " - even 
we
let balloon to shrink, it could shrink pages from the uninterested node.

When we have a numa aware balloon, we could further update the shrinker
to check with the per node counter , node_page_state(NR_FILE_PAGES).

> 
> >
> Well, my comment is rather: "Do not try to reserve guest's memory. In other
> words, do not try to maintain balloons on the guest side. Since host would
> be able to cache file data on the host's cache, guests would be able to
> quickly fetch file data from host's cache via normal I/O requests." ;-)

Didn't this one. The discussion was about guest pagecache pages v.s. guest 
balloon pages.
Why is host's pagecache here?

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC] virtio_balloon: conservative balloon page shrinking

2020-02-09 Thread Wang, Wei W
On Saturday, February 8, 2020 8:33 PM, Tetsuo Handa wrote:
> 
> Is this NUMA aware? Can "node-A's NR_FILE_PAGES is already 0 and
> node-B's NR_FILE_PAGES is not 0, but allocation request which triggered this
> shrinker wants to allocate from only node-B" happen? 

No, it's a global counter.

>Can some thread keep
> this shrinker defunctional by keep increasing NR_FILE_PAGES?

Yes. Actually it's our intention - as long as there are pagecache pages,
balloon pages are avoided to be reclaimed.


> 
> Is this patch from "Re: Balloon pressuring page cache" thread? I hope that
> the guest could start reclaiming memory based on host's request (like OOM
> notifier chain) which is issued when host thinks that host is getting close to
> OOM and thus guests should start returning their unused memory to host.
> Maybe "periodically (e.g. 5 minutes)" in addition to "upon close to OOM
> condition" is also possible.

That's about the host usages. The host side management software decides when to 
issue a request to balloon (either periodically or event driven), I think there 
isn't anything we need to do in the balloon driver here.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC] virtio_balloon: conservative balloon page shrinking

2020-02-06 Thread Wang, Wei W
On Thursday, February 6, 2020 5:32 PM, David Hildenbrand wrote:
> 
> If the page cache is empty, a drop_slab() will deflate the whole balloon if I
> am not wrong.
> 
> Especially, a echo 3 > /proc/sys/vm/drop_caches
> 
> will first drop the page cache and then drop_slab()

Then that's the problem of "echo 3 > /proc/sys/vm/drop_cache" itself. It 
invokes other shrinkers as well (if considered an issue), need to be tweaked in 
the mm.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC] virtio_balloon: conservative balloon page shrinking

2020-02-06 Thread Wang, Wei W
On Thursday, February 6, 2020 5:31 PM, Michael S. Tsirkin wrote:
> 
> How about just making this a last resort thing to be compatible with existing
> hypervisors? if someone wants to change behaviour that really should use a
> feature bit ...

Yeah, sounds good to me to control via feature bits.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-02-06 Thread Wang, Wei W
Isn't the hint only useful during the first round?
After the first round if a page becomes free then we need to update the copy at 
the migration destination, so freeing a page that previously had contents 
should mark it dirty.


Nope. I think as long as it is a free page (no matter 1st or 2nd round), we 
don’t need to send it.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH RFC] virtio_balloon: conservative balloon page shrinking

2020-02-06 Thread Wang, Wei W
On Thursday, February 6, 2020 5:10 PM, David Hildenbrand wrote:
> so dropping caches (echo 3 > /proc/sys/vm/drop_caches) will no longer
> deflate the balloon when conservative_shrinker=true?
> 

Should be. Need Tyler's help to test it.

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC] virtio_balloon: conservative balloon page shrinking

2020-02-06 Thread Wang, Wei W
On Thursday, February 6, 2020 5:04 PM, Michael S. Tsirkin wrote:
> virtio_balloon_shrinker_count(struct shrinker *shrinker,
> > struct virtio_balloon, shrinker);
> > unsigned long count;
> >
> > -   count = vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
> > +   if (conservative_shrinker && global_node_page_state(NR_FILE_PAGES))
> 
> I'd rather have an API for that in mm/. In particular, do we want other
> shrinkers to run, not just pagecache? To pick an example I'm familiar
> with, kvm mmu cache for nested virt?

We could make it extendable:

#define BALLOON_SHRINKER_AFTER_PAGE_CACHE   (1 << 0)
#define BALLOON_SHRINKER_AFTER_KVM_MMU_CACHE(1 << 1)
...

uint64_t conservative_shrinker;
if ((conservative_shrinker | BALLOON_SHRINKER_AFTER_PAGE_CACHE) && 
global_node_page_state(NR_FILE_PAGES))
return 0;

For now, we probably only need BALLOON_SHRINKER_AFTER_PAGE_CACHE.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v1 3/3] virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM

2020-02-06 Thread Wang, Wei W
On Thursday, February 6, 2020 12:34 AM, David Hildenbrand wrote:
> Commit 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
> changed the behavior when deflation happens automatically. Instead of
> deflating when called by the OOM handler, the shrinker is used.
> 
> However, the balloon is not simply some slab cache that should be shrunk
> when under memory pressure. The shrinker does not have a concept of
> priorities, so this behavior cannot be configured.
> 
> There was a report that this results in undesired side effects when inflating
> the balloon to shrink the page cache. [1]
>   "When inflating the balloon against page cache (i.e. no free memory
>remains) vmscan.c will both shrink page cache, but also invoke the
>shrinkers -- including the balloon's shrinker. So the balloon
>driver allocates memory which requires reclaim, vmscan gets this
>memory by shrinking the balloon, and then the driver adds the
>memory back to the balloon. Basically a busy no-op."

Not sure if we need to go back to OOM, which has many drawbacks as we discussed.
Just posted out another approach, which is simple.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-02-05 Thread Wang, Wei W
On Wednesday, February 5, 2020 5:37 PM, David Hildenbrand wrote:
> >
> > Not sure how TCG tracks the dirty bits. But In whatever
> > implementation, the hypervisor should have
> 
> There is only a single bitmap for that purpose. (well, the one where KVM
> syncs to)
> 
> > already dealt with the race between he current round and the previous
> round dirty recording.
> > (the race isn't brought by this feature essentially)
> 
> It is guaranteed to work reliably without this feature as you only clear what
> *has been migrated*, 

Not "clear what has been migrated" (that skips nothing..)
Anyway, it's a hint used for optimization.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-02-05 Thread Wang, Wei W
On Wednesday, February 5, 2020 5:23 PM, David Hildenbrand wrote:
> So, if you run a TCG guest and use it with free page reporting, the race is
> possible? So the correctness depends on two dirty bitmaps in the hypervisor
> and how they interact. wow this is fragile.
>

Not sure how TCG tracks the dirty bits. But In whatever implementation, the 
hypervisor should have
already dealt with the race between he current round and the previous round 
dirty recording.
(the race isn't brought by this feature essentially)

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-02-05 Thread Wang, Wei W
On Wednesday, February 5, 2020 5:06 PM, David Hildenbrand wrote:

> 
> No, that does not answer my question. Because then, obviously the guest
> can't do any hinting in the last round. I think I am missing something
> important :)

No problem, probably need more details here:

QEMU has a dirty bitmap which indicates all the dirty pages from the previous 
round.
KVM has a dirty bitmap which records what pages are modified in this round.
When a new round starts, QEMU syncs the bitmap from KVM (this round always
sends the pages dirtied from the previous round).

> 1. Guest allocates a page and sends it to the host.
> 2. Shrinker gets active and releases that page again.
> 3. Some user in the guest allocates and modifies that page. The dirty bit is
> set in the hypervisor.

The bit will be set in KVM's bitmap, and will be synced to QEMU's bitmap when 
the next round starts.

> 4. The host processes the request and clears the bit in the dirty bitmap.

This clears the bit from the QEMU bitmap, and this page will not be sent in 
this round.

> 5. The guest is stopped and the last set of dirty pages is migrated. The
> modified page is not being migrated (because not marked dirty).

When QEMU start the last round, it first syncs the bitmap from KVM, which 
includes the one set in step 3.
Then the modified page gets sent.

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-02-05 Thread Wang, Wei W
On Wednesday, February 5, 2020 4:57 PM, David Hildenbrand wrote:
> >> Yes, I agree with you. Yet, I am thinking about one
> >> (unlikely?impossible?) scenario. Can you refresh my brain why that
> >> cannot happen (IOW, why we don't have to wait for the host to process
> >> the request)?
> >>
> >> 1. Guest allocates a page and sends it to the host.
> >> 2. Shrinker gets active and releases that page again.
> >> 3. Some user in the guest allocates and modifies that page. After
> >> that, it is done using that page for the next hour.
> >> 4. The host processes the request and clears the bit in the dirty bitmap.
> >> 5. The guest is being migrated by the host. The modified page is not
> >> being migrated.
> >
> > Whenever the guest modifies a page during migration, it will be
> > captured by the dirty logging and the hypervisor will send the dirtied the
> page in the following round.
> 
> Please explain why the steps I outlined don't apply esp. in the last round.
> Your general statement does not explain why this race can't happen.
> 

The guest is stopped in the last round, thus no page will be modified at that 
time.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-02-05 Thread Wang, Wei W
On Wednesday, February 5, 2020 4:19 PM, David Hildenbrand wrote:
> Yes, I agree with you. Yet, I am thinking about one
> (unlikely?impossible?) scenario. Can you refresh my brain why that cannot
> happen (IOW, why we don't have to wait for the host to process the
> request)?
> 
> 1. Guest allocates a page and sends it to the host.
> 2. Shrinker gets active and releases that page again.
> 3. Some user in the guest allocates and modifies that page. After that, it is
> done using that page for the next hour.
> 4. The host processes the request and clears the bit in the dirty bitmap.
> 5. The guest is being migrated by the host. The modified page is not being
> migrated.

Whenever the guest modifies a page during migration, it will be captured by the
dirty logging and the hypervisor will send the dirtied the page in the 
following round.

Just more thoughts to clarify the difference. I think it's all about the page 
ownership.
For VIRTIO_BALLOON_F_FREE_PAGE_HINT, the guest always owns the page,
so host should not use or unmap the page.
For VIRTIO_BALLOON_F_REPORTING or the legacy balloon inflation,
guest intends to transfer the ownership of the underlying physical page to the 
host,
that's why host and guest needs a sync about - if the "ownership" transfer 
completes or not.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-02-05 Thread Wang, Wei W
On Wednesday, February 5, 2020 3:05 PM, Michael S. Tsirkin wrote:
> 
> Well if we did the hint would be reliable, allowing host to immediately drop
> any pages it gets in the hint. 

"drop", you mean host to unmap the page from guest? I think that's not allowed 
for hints.

> Originally I wanted to speed up hinting by never
> waiting for host, but that does not seem to be what was
> implemented: the only place we don't wait is the shrinker

Didn't get this one. For FREE_PAGE_HINT, the hints are always sent to host 
without
an ack from host about whether it has read the hint or not. (please see 
get_free_page_and_send)

> and it seems a
> waste that we introduced complexity to host without getting any real
> benefit out of it.
> 
> VIRTIO_BALLOON_F_MUST_TELL_HOST doesn't really apply to hinting right
> now, 

There is no need I think, as host isn't allowed to use or unmap the hint page.

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-02-04 Thread Wang, Wei W
On Wednesday, February 5, 2020 12:50 AM, Michael S. Tsirkin wrote:
> > Michael, any clue on which event we have to wait with
> > VIRTIO_BALLOON_F_MUST_TELL_HOST? IMHO, I don't think
> > VIRTIO_BALLOON_F_MUST_TELL_HOST applies to
> > VIRTIO_BALLOON_F_FREE_PAGE_HINT and we'd better document that. It
> introduces complexity with no clear benefit.
> 
> I meant that we must wait for host to see the hint.

Why?

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-02-04 Thread Wang, Wei W
On Tuesday, February 4, 2020 10:30 PM, David Hildenbrand wrote:
> So, I am trying to understand how the code is intended to work, but I am
> afraid I am missing something (or to rephrase: I think I found a BUG :) and
> there is lack of proper documentation about this feature).
> 
> a) We allocate pages and add them to the list as long as we are told to do
> so.
>We send these pages to the host one by one.
> b) We free all pages once we get a STOP signal. Until then, we keep pages
> allocated.

Yes. Either host sends to the guest a STOP cmd or when the guest fails to 
allocate a page (meaning that all the possible free pages are taken already),
the reporting ends.

> c) When called via the shrinker, we want to free pages from the list, even
> though the hypervisor did not notify us to do so.
> 
> 
> Issue 1: When we unload the balloon driver in the guest in an unlucky event,
> we won't free the pages. We are missing something like (if I am not wrong):
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b1d2068fa2bd..e2b0925e1e83 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -929,6 +929,10 @@ static void remove_common(struct virtio_balloon
> *vb)
> leak_balloon(vb, vb->num_pages);
> update_balloon_size(vb);
> 
> +   /* There might be free pages that are being reported: release them.
> */
> +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> +   return_free_pages_to_mm(vb, ULONG_MAX);
> +
> /* Now we reset the device so we can clean up the queues. */
> vb->vdev->config->reset(vb->vdev);


Right, thanks!

> 
> 
> Issue 2: When called via the shrinker, (but also to fix Issue 1), it could be 
> that
> we do have VIRTIO_BALLOON_F_MUST_TELL_HOST. 

I don't think it is an issue here.
MUST_TELL_HOST is for the ballooning pages, where pages are offered to host to 
_USE_.
For free page hint, as the name already suggests, it's just a _HINT_ , so in 
whatever use case,
the host should not take the page to use. So the guest doesn't need to tell 
host and wait.

Back to the implementation of virtio_balloon_shrinker_scan, which I don't see 
an issue so far:
shrink_free_pages just return pages to mm without waiting for the ack from host
shrink_balloon_pages goes through leak_balloon which tell_host before release 
the balloon pages.

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: Balloon pressuring page cache

2020-01-30 Thread Wang, Wei W
On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
> On 29.01.20 20:11, Tyler Sanderson wrote:
> >
> >
> > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand  > > wrote:
> >
> > On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
> > > A primary advantage of virtio balloon over other memory reclaim
> > > mechanisms is that it can pressure the guest's page cache into
> > shrinking.
> > >
> > > However, since the balloon driver changed to using the shrinker API
> > >
> >
>  e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
> > > use case has become a bit more tricky. I'm wondering what the
> intended
> > > device implementation is.
> > >
> > > When inflating the balloon against page cache (i.e. no free memory
> > > remains) vmscan.c will both shrink page cache, but also invoke the
> > > shrinkers -- including the balloon's shrinker. So the balloon driver
> > > allocates memory which requires reclaim, vmscan gets this memory
> by
> > > shrinking the balloon, and then the driver adds the memory back to
> the
> > > balloon. Basically a busy no-op.

Per my understanding, the balloon allocation won’t invoke shrinker as 
__GFP_DIRECT_RECLAIM isn't set, no?


> > >
> > > If file IO is ongoing during this balloon inflation then the page
> > cache
> > > could be growing which further puts "back pressure" on the balloon
> > > trying to inflate. In testing I've seen periods of > 45 seconds where
> > > balloon inflation makes no net forward progress.

I think this is intentional (but could be improved). As inflation does not stop 
when the allocation fails (it simply sleeps for a while and resumes.. repeat 
till there are memory to inflate)
That's why you see no inflation progress for long time under memory pressure.


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH v2 2/2] virtio-pci: check name when counting MSI-X vectors

2020-01-05 Thread Wang, Wei W
On Saturday, January 4, 2020 2:41 AM, Daniel Verkamp wrote:
> Subject: [PATCH v2 2/2] virtio-pci: check name when counting MSI-X vectors
> 
> VQs without a name specified are not valid; they are skipped in the later loop
> that assigns MSI-X vectors to queues, but the per_vq_vectors loop above that
> counts the required number of vectors previously still counted any queue with 
> a
> non-NULL callback as needing a vector.
> 
> Add a check to the per_vq_vectors loop so that vectors with no name are not
> counted to make the two loops consistent.  This prevents over-counting
> unnecessary vectors (e.g. for features which were not negotiated with the
> device).
> 
> Cc: sta...@vger.kernel.org
> Fixes: 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
> Reviewed-by: Cornelia Huck 
> Signed-off-by: Daniel Verkamp 
> ---
> 
> v1:
> https://lists.linuxfoundation.org/pipermail/virtualization/2019-December/0448
> 28.html
> 
>  drivers/virtio/virtio_pci_common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_pci_common.c
> b/drivers/virtio/virtio_pci_common.c
> index f2862f66c2ac..222d630c41fc 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -294,7 +294,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev,
> unsigned nvqs,
>   /* Best option: one for change interrupt, one per vq. */
>   nvectors = 1;
>   for (i = 0; i < nvqs; ++i)
> - if (callbacks[i])
> + if (names[i] && callbacks[i])
>   ++nvectors;
>       } else {
>   /* Second best: one for change, shared for all vqs. */
> --
> 2.24.1.735.g03f4e72817-goog

Reviewed-by: Wang, Wei W 

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 1/2] virtio-balloon: initialize all vq callbacks

2020-01-05 Thread Wang, Wei W
On Saturday, January 4, 2020 2:41 AM, Daniel Verkamp wrote:
> Subject: [PATCH v2 1/2] virtio-balloon: initialize all vq callbacks
> 
> Ensure that elements of the callbacks array that correspond to unavailable
> features are set to NULL; previously, they would be left uninitialized.
> 
> Since the corresponding names array elements were explicitly set to NULL,
> the uninitialized callback pointers would not actually be dereferenced;
> however, the uninitialized callbacks elements would still be read in 
> vp_find_vqs_msix() and used to calculate the number of MSI-X vectors
> required.

The above description doesn't seem true after your second patch gets applied.
 
> Cc: sta...@vger.kernel.org
> Fixes: 86a559787e6f ("virtio-balloon:
> VIRTIO_BALLOON_F_FREE_PAGE_HINT")
> Reviewed-by: Cornelia Huck 
> Signed-off-by: Daniel Verkamp 
> ---
> 
> v1:
> https://lists.linuxfoundation.org/pipermail/virtualization/2019-December/04
> 4829.html
> 
> Changes from v1:
> - Clarified "array" in commit message to "callbacks array"
> 
>  drivers/virtio/virtio_balloon.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 93f995f6cf36..8e400ece9273 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -475,7 +475,9 @@ static int init_vqs(struct virtio_balloon *vb)
>   names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
>   callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
>   names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> + callbacks[VIRTIO_BALLOON_VQ_STATS] = NULL;
>   names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> + callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;

Could you remove other redundant NULL initialization well?
https://lists.linuxfoundation.org/pipermail/virtualization/2019-December/044837.html

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 2/2] balloon: fix up comments

2019-07-18 Thread Wang, Wei W
On Thursday, July 18, 2019 8:24 PM, Michael S. Tsirkin wrote:
>  /*
>   * balloon_page_alloc - allocates a new page for insertion into the balloon
> - * page list.
> + *   page list.
>   *
> - * Driver must call it to properly allocate a new enlisted balloon page.
> - * Driver must call balloon_page_enqueue before definitively removing it
> from
> - * the guest system.  This function returns the page address for the recently
> - * allocated page or NULL in the case we fail to allocate a new page this 
> turn.
> + * Driver must call this function to properly allocate a new enlisted balloon
> page.

Probably better to say "allocate a new balloon page to enlist" ?
"enlisted page" implies that the allocated page has been added to the list, 
which might
be misleading.


> + * Driver must call balloon_page_enqueue before definitively removing
> + the page
> + * from the guest system.
> + *
> + * Returns: struct page address for the allocated page or NULL in case it 
> fails
> + *   to allocate a new page.
>   */

Returns: pointer to the page struct of the allocated page, or NULL if 
allocation fails.



>  struct page *balloon_page_alloc(void)
>  {
> @@ -130,19 +133,15 @@ EXPORT_SYMBOL_GPL(balloon_page_alloc);
>  /*
>   * balloon_page_enqueue - inserts a new page into the balloon page list.
>   *
> - * @b_dev_info: balloon device descriptor where we will insert a new page
> to
> + * @b_dev_info: balloon device descriptor where we will insert a new
> + page
>   * @page: new page to enqueue - allocated using balloon_page_alloc.
>   *
> - * Driver must call it to properly enqueue a new allocated balloon page
> - * before definitively removing it from the guest system.
> + * Drivers must call this function to properly enqueue a new allocated
> + balloon
> + * page before definitively removing the page from the guest system.
>   *
> - * Drivers must not call balloon_page_enqueue on pages that have been
> - * pushed to a list with balloon_page_push before removing them with
> - * balloon_page_pop. To all pages on a list, use balloon_page_list_enqueue
> - * instead.
> - *
> - * This function returns the page address for the recently enqueued page or
> - * NULL in the case we fail to allocate a new page this turn.
> + * Drivers must not call balloon_page_enqueue on pages that have been
> + pushed to
> + * a list with balloon_page_push before removing them with
> + balloon_page_pop. To
> + * enqueue all pages on a list, use balloon_page_list_enqueue instead.

"To enqueue a list of pages" ?


>   */
>  void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
> struct page *page)
> @@ -157,14 +156,24 @@ EXPORT_SYMBOL_GPL(balloon_page_enqueue);
> 
>  /*
>   * balloon_page_dequeue - removes a page from balloon's page list and
> returns
> - * the its address to allow the driver release the page.
> + * its address to allow the driver to release the page.
>   * @b_dev_info: balloon device decriptor where we will grab a page from.
>   *
> - * Driver must call it to properly de-allocate a previous enlisted balloon
> page
> - * before definetively releasing it back to the guest system.
> - * This function returns the page address for the recently dequeued page or
> - * NULL in the case we find balloon's page list temporarily empty due to
> - * compaction isolated pages.
> + * Driver must call this to properly dequeue a previously enqueued page
 
"call this function"?
 

> + * before definitively releasing it back to the guest system.
> + *
> + * Caller must perform its own accounting to ensure that this
> + * function is called only if some pages are actually enqueued.


"only when" ?

> + *
> + * Note that this function may fail to dequeue some pages even if there

"even when" ?

> + are
> + * some enqueued pages - since the page list can be temporarily empty
> + due to
> + * the compaction of isolated pages.
> + *
> + * TODO: remove the caller accounting requirements, and allow caller to
> + wait
> + * until all pages can be dequeued.
> + *
> + * Returns: struct page address for the dequeued page, or NULL if it fails to
> + *   dequeue any pages.

Returns: pointer to the page struct of the dequeued page, or NULL if no page 
gets dequeued.


>   */
>  struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
> { @@ -177,9 +186,9 @@ struct page *balloon_page_dequeue(struct
> balloon_dev_info *b_dev_info)
>   if (n_pages != 1) {
>   /*
>* If we are unable to dequeue a balloon page because the
> page
> -  * list is empty and there is no isolated pages, then
> something
> +  * list is empty and there are no isolated pages, then
> something
>* went out of track and some balloon pages are lost.
> -  * BUG() here, otherwise the balloon driver may get stuck
> into
> +  * BUG() here, otherwise 

RE: [PATCH v2 1/2] virtio-balloon: tweak config_changed implementation

2019-01-04 Thread Wang, Wei W
On Friday, January 4, 2019 11:45 PM, Michael S. Tsirkin wrote:
> >  struct virtio_balloon {
> > struct virtio_device *vdev;
> > struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > @@ -77,6 +81,8 @@ struct virtio_balloon {
> > /* Prevent updating balloon when it is being canceled. */
> > spinlock_t stop_update_lock;
> > bool stop_update;
> > +   /* Bitmap to indicate if reading the related config fields are needed
> */
> > +   unsigned long config_read_bitmap;
> >
> > /* The list of allocated free pages, waiting to be given back to mm */
> > struct list_head free_page_list;
> 
> It seems that you never initialize this bitmap. Probably harmless here but
> generally using uninitialized memory isn't good.

We've used kzalloc to allocate the vb struct, so it's already 0-initialized :)

> > +
> >  static int send_free_pages(struct virtio_balloon *vb)  {
> > int err;
> > @@ -620,6 +630,7 @@ static int send_free_pages(struct virtio_balloon *vb)
> >  * stop the reporting.
> >  */
> > cmd_id_active = virtio32_to_cpu(vb->vdev, vb-
> >cmd_id_active);
> > +   virtio_balloon_read_cmd_id_received(vb);
 

[1]


> > if (cmd_id_active != vb->cmd_id_received)
> > break;
> >
> > @@ -637,11 +648,9 @@ static int send_free_pages(struct virtio_balloon
> *vb)
> > return 0;
> >  }
> >
> > -static void report_free_page_func(struct work_struct *work)
> > +static void virtio_balloon_report_free_page(struct virtio_balloon
> > +*vb)
> >  {
> > int err;
> > -   struct virtio_balloon *vb = container_of(work, struct virtio_balloon,
> > -report_free_page_work);
> > struct device *dev = >vdev->dev;
> >
> > /* Start by sending the received cmd id to host with an outbuf. */
> > @@ -659,6 +668,22 @@ static void report_free_page_func(struct
> work_struct *work)
> > dev_err(dev, "Failed to send a stop id, err = %d\n", err);  }
> >
> > +static void report_free_page_func(struct work_struct *work) {
> > +   struct virtio_balloon *vb = container_of(work, struct virtio_balloon,
> > +report_free_page_work);
> > +
> > +   virtio_balloon_read_cmd_id_received(vb);
> 
> This will not achieve what you are trying to do, which is cancel reporting if 
> it's
> in progress.
> 
> You need to re-read each time you compare to cmd_id_active.

Yes, already did, please see [1] above


> An API similar to
>   u32 virtio_balloon_cmd_id_received(vb)
> seems to be called for, and I would rename cmd_id_received to
> cmd_id_received_cache to make sure we caught all users.
> 

I'm not sure about adding "cache" here, cmd_id_received refers to the cmd id
that the driver just received from the device. There is one called 
"cmd_id_active" which
is the one that the driver is actively using. 
So cmd_id_received is already a "cache" to " cmd_id_active " in some sense.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues

2018-12-30 Thread Wang, Wei W
On Sunday, December 30, 2018 2:06 PM, Halil Pasic wrote:
> 
> I guess you are the first one trying to read virtio config from within 
> interrupt
> context. AFAICT this never worked.

I'm not sure about "never worked". It seems to work well with virtio-pci.
But looking forward to hearing a solid reason why reading config inside
the handler is forbidden (if that's true).

> About what happens. The apidoc of ccw_device_start() says it needs to be
> called with the ccw device lock held, so ccw_io_helper() tries to take it 
> (since
> forever I guess). OTOH do_cio_interrupt() takes the subchannel lock and
> io_subchannel_initialize_dev() makes the ccw device lock be the subchannel
> lock. That means when one tries to get virtio config form within a cio
> interrupt context we deadlock, because we try to take a lock we already have.
> 
> That said, I don't think this limitation is by design (i.e. intended).
> Maybe Connie can help us with that question. AFAIK we have nothing
> documented regarding this (neither that can nor can't).
> 
> Obviously, there are multiple ways around this problem, and at the moment
> I can't tell which would be my preferred one.

Yes, it's also not difficult to tweak the virtio-balloon code to avoid that 
issue.
But if that's just an issue with ccw itself, I think it's better to tweak ccw 
and
remain virtio-balloon unchanged.

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues

2018-12-28 Thread Wang, Wei W
On Friday, December 28, 2018 3:57 PM, Christian Borntraeger wrote:
> On 28.12.2018 03:26, Wei Wang wrote:
> > Some vqs don't need to be allocated when the related feature bits are
> > disabled. Callers notice the vq allocation layer by setting the
> > related names[i] to be NULL.
> >
> > This patch series fixes the find_vqs implementations to handle this case.
> 
> So the random crashes during boot are gone.
> What still does not work is actually using the balloon.
> 
> So in the qemu monitor using lets say "balloon 1000"  will hang the guest.
> Seems to be a deadlock in the virtio-ccw code.  We seem to call the config
> code in the interrupt handler.

Yes. It reads a config register from the interrupt handler. Do you know why ccw 
doesn't support it and has some internal lock that caused the deadlock issue?
 
Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-09-07 Thread Wang, Wei W
On Friday, September 7, 2018 8:30 PM, Dr. David Alan Gilbert wrote:
> OK, that's much better.
> The ~50% reducton with a 8G VM and a real workload is great, and it does
> what you expect when you put a lot more RAM in and see the 84% reduction
> on a guest with 128G RAM - 54s vs ~9s is a big win!
> 
> (The migrate_set_speed is a bit high, since that's in bytes/s - but it's not
> important).
> 
> That looks good,
> 

Thanks Dave for the feedback.
Hope you can join our discussion and review the v37 patches as well.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Wang, Wei W
On Monday, August 6, 2018 9:29 PM, Tetsuo Handa wrote:
> On 2018/08/06 21:44, Wang, Wei W wrote:
> > On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
> >> On 2018/08/06 18:56, Wei Wang wrote:
> >>> On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
> >>>> On 2018/08/03 17:32, Wei Wang wrote:
> >>>>> +static int virtio_balloon_register_shrinker(struct virtio_balloon
> >>>>> +*vb) {
> >>>>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> >>>>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> >>>>> +    vb->shrinker.batch = 0;
> >>>>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
> >>>> Why flags field is not set? If vb is allocated by
> >>>> kmalloc(GFP_KERNEL) and is nowhere zero-cleared, KASAN would
> complain it.
> >>>
> >>> Could you point where in the code that would complain it?
> >>> I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
> >> they seem not related to that.
> >>
> >> Where is vb->shrinker.flags initialized?
> >
> > Is that mandatory to be initialized?
> 
> Of course. ;-)
> 
> > I find it's not initialized in most shrinkers (e.g. zs_register_shrinker,
> huge_zero_page_shrinker).
> 
> Because most shrinkers are "statically initialized (which means that
> unspecified fields are implicitly zero-cleared)" or "dynamically allocated 
> with
> __GFP_ZERO or zero-cleared using
> memset() (which means that all fields are once zero-cleared)".
> 
> And if you once zero-clear vb at allocation time, you will get a bonus that
> calling unregister_shrinker() without corresponding register_shrinker() is 
> safe
> (which will simplify initialization failure path).

Oh, I see, thanks. So it sounds better to directly kzalloc vb.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Wang, Wei W
On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
> On 2018/08/06 18:56, Wei Wang wrote:
> > On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
> >> On 2018/08/03 17:32, Wei Wang wrote:
> >>> +static int virtio_balloon_register_shrinker(struct virtio_balloon
> >>> +*vb) {
> >>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> >>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> >>> +    vb->shrinker.batch = 0;
> >>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
> >> Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
> >> and is nowhere zero-cleared, KASAN would complain it.
> >
> > Could you point where in the code that would complain it?
> > I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
> they seem not related to that.
> 
> Where is vb->shrinker.flags initialized?

Is that mandatory to be initialized? I find it's not initialized in most 
shrinkers (e.g. zs_register_shrinker, huge_zero_page_shrinker).

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH v2 0/2] virtio-balloon: some improvements

2018-07-27 Thread Wang, Wei W
On Friday, July 27, 2018 10:06 PM, Michael S. Tsirkin wrote:
> On Fri, Jul 27, 2018 at 05:24:53PM +0800, Wei Wang wrote:
> > This series is split from the "Virtio-balloon: support free page
> > reporting" series to make some improvements.
> >
> > v1->v2 ChangeLog:
> > - register the shrinker when VIRTIO_BALLOON_F_DEFLATE_ON_OOM is
> negotiated.
> >
> > Wei Wang (2):
> >   virtio-balloon: remove BUG() in init_vqs
> >   virtio_balloon: replace oom notifier with shrinker
> 
> Thanks!
> Given it's very late in the release cycle, I'll merge this for the next Linux
> release.

No problem. Thanks!

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-07-22 Thread Wang, Wei W
On Friday, July 20, 2018 8:52 PM, Michael S. Tsirkin wrote:
> On Fri, Jul 20, 2018 at 04:33:00PM +0800, Wei Wang wrote:
> > This patch series is separated from the previous "Virtio-balloon
> > Enhancement" series. The new feature,
> VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > implemented by this series enables the virtio-balloon driver to report
> > hints of guest free pages to the host. It can be used to accelerate
> > live migration of VMs. Here is an introduction of this usage:
> >
> > Live migration needs to transfer the VM's memory from the source
> > machine to the destination round by round. For the 1st round, all the
> > VM's memory is transferred. From the 2nd round, only the pieces of
> > memory that were written by the guest (after the 1st round) are
> > transferred. One method that is popularly used by the hypervisor to
> > track which part of memory is written is to write-protect all the guest
> memory.
> >
> > This feature enables the optimization by skipping the transfer of
> > guest free pages during VM live migration. It is not concerned that
> > the memory pages are used after they are given to the hypervisor as a
> > hint of the free pages, because they will be tracked by the hypervisor
> > and transferred in the subsequent round if they are used and written.
> >
> > * Tests
> > - Test Environment
> > Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> > Guest: 8G RAM, 4 vCPU
> > Migration setup: migrate_set_speed 100G, migrate_set_downtime 2
> > second
> 
> Can we split out patches 1 and 2? They seem appropriate for this release ...

Sounds good to me. I'm not sure if there would be comments on the first 2 
patches. If no, can you just take them here? Or you need me to repost them 
separately?

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v35 1/5] mm: support to get hints of free page blocks

2018-07-11 Thread Wang, Wei W
On Wednesday, July 11, 2018 7:10 PM, Michal Hocko wrote:
> On Wed 11-07-18 18:52:45, Wei Wang wrote:
> > On 07/11/2018 05:21 PM, Michal Hocko wrote:
> > > On Tue 10-07-18 18:44:34, Linus Torvalds wrote:
> > > [...]
> > > > That was what I tried to encourage with actually removing the
> > > > pages form the page list. That would be an _incremental_
> > > > interface. You can remove MAX_ORDER-1 pages one by one (or a
> > > > hundred at a time), and mark them free for ballooning that way.
> > > > And if you still feel you have tons of free memory, just continue
> removing more pages from the free list.
> > > We already have an interface for that. alloc_pages(GFP_NOWAIT,
> MAX_ORDER -1).
> > > So why do we need any array based interface?
> >
> > Yes, I'm trying to get free pages directly via alloc_pages, so there
> > will be no new mm APIs.
> 
> OK. The above was just a rough example. In fact you would need a more
> complex gfp mask. I assume you only want to balloon only memory directly
> usable by the kernel so it will be
>   (GFP_KERNEL | __GFP_NOWARN) & ~__GFP_RECLAIM

Sounds good to me, thanks.

> 
> > I plan to let free page allocation stop when the remaining system free
> > memory becomes close to min_free_kbytes (prevent swapping).
> 
> ~__GFP_RECLAIM will make sure you are allocate as long as there is any
> memory without reclaim. It will not even poke the kswapd to do the
> background work. So I do not think you would need much more than that.

"close to min_free_kbytes" - I meant when doing the allocations, we 
intentionally reserve some small amount of memory, e.g. 2 free page blocks of 
"MAX_ORDER - 1". So when other applications happen to do some allocation, they 
may easily get some from the reserved memory left on the free list. Without 
that reserved memory, other allocation may cause the system free memory below 
the WMARK[MIN], and kswapd would start to do swapping. This is actually just a 
small optimization to reduce the probability of causing swapping (nice to have, 
but not mandatary because we will allocate free page blocks one by one).

 > But let me note that I am not really convinced how this (or previous)
> approach will really work in most workloads. We tend to cache heavily so
> there is rarely any memory free.

With less free memory, the improvement becomes less, but should be nicer than 
no optimization. For example, the Linux build workload would cause 4~5 GB (out 
of 8GB) memory to be used as page cache at the final stage, there is still ~44% 
live migration time reduction.

Since we have many cloud customers interested in this feature, I think we can 
let them test the usefulness.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v35 1/5] mm: support to get hints of free page blocks

2018-07-10 Thread Wang, Wei W
On Tuesday, July 10, 2018 5:31 PM, Wang, Wei W wrote:
> Subject: [PATCH v35 1/5] mm: support to get hints of free page blocks
> 
> This patch adds support to get free page blocks from a free page list.
> The physical addresses of the blocks are stored to a list of buffers passed
> from the caller. The obtained free page blocks are hints about free pages,
> because there is no guarantee that they are still on the free page list after 
> the
> function returns.
> 
> One use example of this patch is to accelerate live migration by skipping the
> transfer of free pages reported from the guest. A popular method used by
> the hypervisor to track which part of memory is written during live migration
> is to write-protect all the guest memory. So, those pages that are hinted as
> free pages but are written after this function returns will be captured by the
> hypervisor, and they will be added to the next round of memory transfer.
> 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michael S. Tsirkin 
> Cc: Linus Torvalds 
> ---
>  include/linux/mm.h |  3 ++
>  mm/page_alloc.c| 98
> ++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h index a0fbb9f..5ce654f
> 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2007,6 +2007,9 @@ extern void free_area_init(unsigned long *
> zones_size);  extern void free_area_init_node(int nid, unsigned long *
> zones_size,
>   unsigned long zone_start_pfn, unsigned long *zholes_size);
> extern void free_initmem(void);
> +unsigned long max_free_page_blocks(int order); int
> +get_from_free_page_list(int order, struct list_head *pages,
> + unsigned int size, unsigned long *loaded_num);
> 
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100..b67839b
> 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5043,6 +5043,104 @@ void show_free_areas(unsigned int filter,
> nodemask_t *nodemask)
>   show_swap_cache_info();
>  }
> 
> +/**
> + * max_free_page_blocks - estimate the max number of free page blocks
> + * @order: the order of the free page blocks to estimate
> + *
> + * This function gives a rough estimation of the possible maximum
> +number of
> + * free page blocks a free list may have. The estimation works on an
> +assumption
> + * that all the system pages are on that list.
> + *
> + * Context: Any context.
> + *
> + * Return: The largest number of free page blocks that the free list can 
> have.
> + */
> +unsigned long max_free_page_blocks(int order) {
> + return totalram_pages / (1 << order);
> +}
> +EXPORT_SYMBOL_GPL(max_free_page_blocks);
> +
> +/**
> + * get_from_free_page_list - get hints of free pages from a free page
> +list
> + * @order: the order of the free page list to check
> + * @pages: the list of page blocks used as buffers to load the
> +addresses
> + * @size: the size of each buffer in bytes
> + * @loaded_num: the number of addresses loaded to the buffers
> + *
> + * This function offers hints about free pages. The addresses of free
> +page
> + * blocks are stored to the list of buffers passed from the caller.
> +There is
> + * no guarantee that the obtained free pages are still on the free page
> +list
> + * after the function returns. pfn_to_page on the obtained free pages
> +is
> + * strongly discouraged and if there is an absolute need for that, make
> +sure
> + * to contact MM people to discuss potential problems.
> + *
> + * The addresses are currently stored to a buffer in little endian.
> +This
> + * avoids the overhead of converting endianness by the caller who needs
> +data
> + * in the little endian format. Big endian support can be added on
> +demand in
> + * the future.
> + *
> + * Context: Process context.
> + *
> + * Return: 0 if all the free page block addresses are stored to the buffers;
> + * -ENOSPC if the buffers are not sufficient to store all the
> + * addresses; or -EINVAL if an unexpected argument is received (e.g.
> + * incorrect @order, empty buffer list).
> + */
> +int get_from_free_page_list(int order, struct list_head *pages,
> + unsigned int size, unsigned long *loaded_num) {


Hi Linus,

We  took your original suggestion - pass in pre-allocated buffers to load the 
addresses (now we use a list of pre-allocated page blocks as buffers). Hope 
that suggestion is still acceptable (the advantage of this method was explained 
here: https://lkml.org/lkml/2018/6/28/184).
Look forward to getting your feedback. Thanks.

Best,
Wei 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v34 0/4] Virtio-balloon: support free page reporting

2018-06-29 Thread Wang, Wei W
On Friday, June 29, 2018 7:54 PM, David Hildenbrand wrote:
> On 29.06.2018 13:31, Wei Wang wrote:
> > On 06/29/2018 03:46 PM, David Hildenbrand wrote:
> >>>
> >>> I'm afraid it can't. For example, when we have a guest booted,
> >>> without too many memory activities. Assume the guest has 8GB free
> >>> memory. The arch_free_page there won't be able to capture the 8GB
> >>> free pages since there is no free() called. This results in no free pages
> reported to host.
> >>
> >> So, it takes some time from when the guest boots up until the balloon
> >> device was initialized and therefore page hinting can start. For that
> >> period, you won't get any arch_free_page()/page hinting callbacks, correct.
> >>
> >> However in the hypervisor, you can theoretically track which pages
> >> the guest actually touched ("dirty"), so you already know "which
> >> pages were never touched while booting up until virtio-balloon was
> >> brought to life". These, you can directly exclude from migration. No
> >> interface required.
> >>
> >> The remaining problem is pages that were touched ("allocated") by the
> >> guest during bootup but freed again, before virtio-balloon came up.
> >> One would have to measure how many pages these usually are, I would
> >> say it would not be that many (because recently freed pages are
> >> likely to be used again next for allocation). However, there are some
> >> pages not being reported.
> >>
> >> During the lifetime of the guest, this should not be a problem,
> >> eventually one of these pages would get allocated/freed again, so the
> >> problem "solves itself over time". You are looking into the special
> >> case of migrating the VM just after it has been started. But we have
> >> the exact same problem also for ordinary free page hinting, so we
> >> should rather solve that problem. It is not migration specific.
> >>
> >> If we are looking for an alternative to "problem solves itself",
> >> something like "if virtio-balloon comes up, it will report all free
> >> pages step by step using free page hinting, just like we would have
> >> from "arch_free_pages()"". This would be the same interface we are
> >> using for free page hinting - and it could even be made configurable in the
> guest.
> >>
> >> The current approach we are discussing internally for details about
> >> Nitesh's work ("how the magic inside arch_fee_pages() will work
> >> efficiently) would allow this as far as I can see just fine.
> >>
> >> There would be a tiny little window between virtio-balloon comes up
> >> and it has reported all free pages step by step, but that can be
> >> considered a very special corner case that I would argue is not worth
> >> it to be optimized.
> >>
> >> If I am missing something important here, sorry in advance :)
> >>
> >
> > Probably I didn't explain that well. Please see my re-try:
> >
> > That work is to monitor page allocation and free activities via
> > arch_alloc_pages and arch_free_pages. It has per-CPU lists to record
> > the pages that are freed to the mm free list, and the per-CPU lists
> > dump the recorded pages to a global list when any of them is full.
> > So its own per-CPU list will only be able to get free pages when there
> > is an mm free() function gets called. If we have 8GB free memory on
> > the mm free list, but no application uses them and thus no mm free()
> > calls are made. In that case, the arch_free_pages isn't called, and no
> > free pages added to the per-CPU list, but we have 8G free memory right
> > on the mm free list.
> > How would you guarantee the per-CPU lists have got all the free pages
> > that the mm free lists have?
> 
> As I said, if we have some mechanism that will scan the free pages (not
> arch_free_page() once and report hints using the same mechanism step by
> step (not your bulk interface)), this problem is solved. And as I said, this 
> is
> not a migration specific problem, we have the same problem in the current
> page hinting RFC. These pages have to be reported.
> 
> >
> > - I'm also worried about the overhead of maintaining so many per-CPU
> > lists and the global list. For example, if we have applications
> > frequently allocate and free 4KB pages, and each per-CPU list needs to
> > implement the buddy algorithm to sort and merge neighbor pages. Today
> > a server can have more than 100 CPUs, then there will be more than 100
> > per-CPU lists which need to sync to a global list under a lock, I'm
> > not sure if this would scale well.
> 
> The overhead in the current RFC is definitely too high. But I consider this a
> problem to be solved before page hinting would go upstream. And we are
> discussing right now "if we have a reasonable page hinting implementation,
> why would we need your interface in addition".
> 
> >
> > - This seems to be a burden imposed on the core mm memory
> > allocation/free path. The whole overhead needs to be carried during
> > the whole system life cycle. What we actually expected is to just make
> > one call 

RE: [PATCH v34 0/4] Virtio-balloon: support free page reporting

2018-06-29 Thread Wang, Wei W
On Friday, June 29, 2018 10:46 PM, Michael S. Tsirkin wrote:
> To: David Hildenbrand 
> Cc: Wang, Wei W ; virtio-...@lists.oasis-open.org;
> linux-ker...@vger.kernel.org; virtualization@lists.linux-foundation.org;
> k...@vger.kernel.org; linux...@kvack.org; mho...@kernel.org;
> a...@linux-foundation.org; torva...@linux-foundation.org;
> pbonz...@redhat.com; liliang.opensou...@gmail.com;
> yang.zhang...@gmail.com; quan@gmail.com; ni...@redhat.com;
> r...@redhat.com; pet...@redhat.com
> Subject: Re: [PATCH v34 0/4] Virtio-balloon: support free page reporting
> 
> On Wed, Jun 27, 2018 at 01:06:32PM +0200, David Hildenbrand wrote:
> > On 25.06.2018 14:05, Wei Wang wrote:
> > > This patch series is separated from the previous "Virtio-balloon
> > > Enhancement" series. The new feature,
> > > VIRTIO_BALLOON_F_FREE_PAGE_HINT, implemented by this series
> enables
> > > the virtio-balloon driver to report hints of guest free pages to the
> > > host. It can be used to accelerate live migration of VMs. Here is an
> introduction of this usage:
> > >
> > > Live migration needs to transfer the VM's memory from the source
> > > machine to the destination round by round. For the 1st round, all
> > > the VM's memory is transferred. From the 2nd round, only the pieces
> > > of memory that were written by the guest (after the 1st round) are
> > > transferred. One method that is popularly used by the hypervisor to
> > > track which part of memory is written is to write-protect all the guest
> memory.
> > >
> > > This feature enables the optimization by skipping the transfer of
> > > guest free pages during VM live migration. It is not concerned that
> > > the memory pages are used after they are given to the hypervisor as
> > > a hint of the free pages, because they will be tracked by the
> > > hypervisor and transferred in the subsequent round if they are used and
> written.
> > >
> > > * Tests
> > > - Test Environment
> > > Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> > > Guest: 8G RAM, 4 vCPU
> > > Migration setup: migrate_set_speed 100G, migrate_set_downtime 2
> > > second
> > >
> > > - Test Results
> > > - Idle Guest Live Migration Time (results are averaged over 10 runs):
> > > - Optimization v.s. Legacy = 284ms vs 1757ms --> ~84% reduction
> > > - Guest with Linux Compilation Workload (make bzImage -j4):
> > > - Live Migration Time (average)
> > >   Optimization v.s. Legacy = 1402ms v.s. 2528ms --> ~44% reduction
> > > - Linux Compilation Time
> > >   Optimization v.s. Legacy = 5min6s v.s. 5min12s
> > >   --> no obvious difference
> > >
> >
> > Being in version 34 already, this whole thing still looks and feels
> > like a big hack to me. It might just be me, but especially if I read
> > about assumptions like "QEMU will not hotplug memory during
> > migration". This does not feel like a clean solution.
> >
> > I am still not sure if we really need this interface, especially as
> > real free page hinting might be on its way.
> >
> > a) we perform free page hinting by setting all free pages
> > (arch_free_page()) to zero. Migration will detect zero pages and
> > minimize #pages to migrate. I don't think this is a good idea but
> > Michel suggested to do a performance evaluation and Nitesh is looking
> > into that right now.
> 
> Yes this test is needed I think. If we can get most of the benefit without PV
> interfaces, that's nice.
> 
> Wei, I think you need this as part of your performance comparison
> too: set page poisoning value to 0 and enable KSM, compare with your
> patches.

Do you mean live migration with zero pages?
I can first share the amount of memory transferred during live migration I saw,
Legacy is around 380MB,
Optimization is around 340MB.
This proves that most pages have already been 0 and skipped during the legacy 
live migration. But the legacy time is still much larger because zero page 
checking is costly. 
(It's late night here, I can get you that with my server probably tomorrow)

Best,
Wei





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-20 Thread Wang, Wei W
On Tuesday, June 19, 2018 10:43 PM, Michael S. Tsirk wrote:
> On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > > On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:
> > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > > On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512,
> > > > > > 1024) above,
> > > > > so the maximum memory that can be reported is 2TB. For larger
> guests, e.g.
> > > > > 4TB, the optimization can still offer 2TB free memory (better
> > > > > than no optimization).
> > > > >
> > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters 
> > > > > even
> more.
> > > > > I'd rather we had a better plan. From that POV I like what
> > > > > Matthew Wilcox suggested for this which is to steal the necessary # of
> entries off the list.
> > > > Actually what Matthew suggested doesn't make a difference here.
> > > > That method always steal the first free page blocks, and sure can
> > > > be changed to take more. But all these can be achieved via kmalloc
> > > I'd do get_user_pages really. You don't want pages split, etc.
> 
> Oops sorry. I meant get_free_pages .

Yes, we can use __get_free_pages, and the max allocation is MAX_ORDER - 1, 
which can report up to 2TB free memory. 

"getting two pages isn't harder", do you mean passing two arrays (two 
allocations by get_free_pages(,MAX_ORDER -1)) to the mm API?

Please see if the following logic aligns to what you think:

uint32_t i, max_hints, hints_per_page, hints_per_array, total_arrays;
unsigned long *arrays;
 
 /*
 * Each array size is MAX_ORDER_NR_PAGES. If one array is not enough to
 * store all the hints, we need to allocate multiple arrays.
 * max_hints: the max number of 4MB free page blocks
 * hints_per_page: the number of hints each page can store
 * hints_per_array: the number of hints an array can store
 * total_arrays: the number of arrays we need
 */
max_hints = totalram_pages / MAX_ORDER_NR_PAGES;
hints_per_page = PAGE_SIZE / sizeof(__le64);
hints_per_array = hints_per_page * MAX_ORDER_NR_PAGES;
total_arrays = max_hints /  hints_per_array +
   !!(max_hints % hints_per_array);
arrays = kmalloc(total_arrays * sizeof(unsigned long), GFP_KERNEL);
for (i = 0; i < total_arrays; i++) {
arrays[i] = __get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC, 
MAX_ORDER - 1);

  if (!arrays[i])
  goto out;
}


- the mm API needs to be changed to support storing hints to multiple separated 
arrays offered by the caller.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-18 Thread Wang, Wei W
On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> 4TB, the optimization can still offer 2TB free memory (better than no
> optimization).
> 
> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method 
always steal the first free page blocks, and sure can be changed to take more. 
But all these can be achieved via kmalloc by the caller which is more prudent 
and makes the code more straightforward. I think we don't need to take that 
risk unless the MM folks strongly endorse that approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation 
that the max free memory to report is 2TB. Back to the motivation of this work, 
the cloud guys want to use this optimization to accelerate their guest live 
migration. 2TB guests are not common in today's clouds. When huge guests become 
common in the future, we can easily tweak this API to fill hints into scattered 
buffer (e.g. several 4MB arrays passed to this API) instead of one as in this 
version.

This limitation doesn't cause any issue from functionality perspective. For the 
extreme case like a 100TB guest live migration which is theoretically possible 
today, this optimization helps skip 2TB of its free memory. This result is that 
it may reduce only 2% live migration time, but still better than not skipping 
the 2TB (if not using the feature).

So, for the first release of this feature, I think it is better to have the 
simpler and more straightforward solution as we have now, and clearly document 
why it can report up to 2TB free memory.


 
> If that doesn't fly, we can allocate out of the loop and just retry with more
> pages.
> 
> > On the other hand, large guests being large mostly because the guests need
> to use large memory. In that case, they usually won't have that much free
> memory to report.
> 
> And following this logic small guests don't have a lot of memory to report at
> all.
> Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use 
e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free 
memory to report, but reporting 0.5TB with this optimization is better than no 
optimization. (and the current 2TB limitation isn't a limitation for the 3TB 
guest in this case)

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v33 1/4] mm: add a function to get free page blocks

2018-06-16 Thread Wang, Wei W
On Saturday, June 16, 2018 12:50 PM, Matthew Wilcox wrote:
> On Fri, Jun 15, 2018 at 12:43:10PM +0800, Wei Wang wrote:
> > +/**
> > + * get_from_free_page_list - get free page blocks from a free page
> > +list
> > + * @order: the order of the free page list to check
> > + * @buf: the array to store the physical addresses of the free page
> > +blocks
> > + * @size: the array size
> > + *
> > + * This function offers hints about free pages. There is no guarantee
> > +that
> > + * the obtained free pages are still on the free page list after the
> > +function
> > + * returns. pfn_to_page on the obtained free pages is strongly
> > +discouraged
> > + * and if there is an absolute need for that, make sure to contact MM
> > +people
> > + * to discuss potential problems.
> > + *
> > + * The addresses are currently stored to the array in little endian.
> > +This
> > + * avoids the overhead of converting endianness by the caller who
> > +needs data
> > + * in the little endian format. Big endian support can be added on
> > +demand in
> > + * the future.
> > + *
> > + * Return the number of free page blocks obtained from the free page list.
> > + * The maximum number of free page blocks that can be obtained is
> > +limited to
> > + * the caller's array size.
> > + */
> 
> Please use:
> 
>  * Return: The number of free page blocks obtained from the free page list.
> 
> Also, please include a
> 
>  * Context: Any context.
> 
> or
> 
>  * Context: Process context.
> 
> or whatever other conetext this function can be called from.  Since you're
> taking the lock irqsafe, I assume this can be called from any context, but I
> wonder if it makes sense to have this function callable from interrupt 
> context.
> Maybe this should be callable from process context only.

Thanks, sounds better to make it process context only.

> 
> > +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t
> > +size) {
> > +   struct zone *zone;
> > +   enum migratetype mt;
> > +   struct page *page;
> > +   struct list_head *list;
> > +   unsigned long addr, flags;
> > +   uint32_t index = 0;
> > +
> > +   for_each_populated_zone(zone) {
> > +   spin_lock_irqsave(>lock, flags);
> > +   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +   list = >free_area[order].free_list[mt];
> > +   list_for_each_entry(page, list, lru) {
> > +   addr = page_to_pfn(page) << PAGE_SHIFT;
> > +   if (likely(index < size)) {
> > +   buf[index++] = cpu_to_le64(addr);
> > +   } else {
> > +   spin_unlock_irqrestore(>lock,
> > +  flags);
> > +   return index;
> > +   }
> > +   }
> > +   }
> > +   spin_unlock_irqrestore(>lock, flags);
> > +   }
> > +
> > +   return index;
> > +}
> 
> I wonder if (to address Michael's concern), you shouldn't instead use the 
> first
> free chunk of pages to return the addresses of all the pages.
> ie something like this:
> 
>   __le64 *ret = NULL;
>   unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);
> 
>   for_each_populated_zone(zone) {
>   spin_lock_irq(>lock);
>   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
>   list = >free_area[order].free_list[mt];
>   list_for_each_entry_safe(page, list, lru, ...) {
>   if (index == size)
>   break;
>   addr = page_to_pfn(page) << PAGE_SHIFT;
>   if (!ret) {
>   list_del(...);

Thanks for sharing. But probably we would not take this approach for the 
reasons below:

1) I'm not sure if getting a block of free pages to use could be that simple 
(just pluck it from the list as above). I think it is more prudent to let the 
callers allocate the array via the regular allocation functions. 

2) Callers may need to use this with their own defined protocols, and they want 
the header and payload (i.e. the obtained hints) to locate in physically 
continuous memory (there are tricks they can use to make it work with 
non-physically continuous memory, but that would just complicate all the 
things) . In this case, it is better to have callers allocate the memory on 
their own, and pass the payload part memory to this API to get the payload 
filled.

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v33 1/4] mm: add a function to get free page blocks

2018-06-15 Thread Wang, Wei W
On Saturday, June 16, 2018 7:09 AM, Linus Torvalds wrote:
> On Fri, Jun 15, 2018 at 2:08 PM Wei Wang  wrote:
> >
> > This patch adds a function to get free pages blocks from a free page
> > list. The obtained free page blocks are hints about free pages,
> > because there is no guarantee that they are still on the free page
> > list after the function returns.
> 
> Ack. This is the kind of simple interface where I don't need to worry about
> the MM code calling out to random drivers or subsystems.
> 
> I think that "order" should be checked for validity, but from a MM standpoint
> I think this is fine.
> 

Thanks, will add a validity check for "order".

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-15 Thread Wang, Wei W
On Friday, June 15, 2018 10:29 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 02:11:23PM +0000, Wang, Wei W wrote:
> > On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> > > On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature
> > > > indicates the support of reporting hints of guest free pages to host via
> virtio-balloon.
> > > >
> > > > Host requests the guest to report free page hints by sending a
> > > > command to the guest via setting the
> > > VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > > > bit of the host_cmd config register.
> > > >
> > > > As the first step here, virtio-balloon only reports free page
> > > > hints from the max order (10) free page list to host. This has
> > > > generated similar good results as reporting all free page hints during
> our tests.
> > > >
> > > > TODO:
> > > > - support reporting free page hints from smaller order free page lists
> > > >   when there is a need/request from users.
> > > >
> > > > Signed-off-by: Wei Wang 
> > > > Signed-off-by: Liang Li 
> > > > Cc: Michael S. Tsirkin 
> > > > Cc: Michal Hocko 
> > > > Cc: Andrew Morton 
> > > > ---
> > > >  drivers/virtio/virtio_balloon.c | 187
> +--
> > > -
> > > >  include/uapi/linux/virtio_balloon.h |  13 +++
> > > >  2 files changed, 163 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > > > --- a/drivers/virtio/virtio_balloon.c
> > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > @@ -43,6 +43,9 @@
> > > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > >
> > > > +/* The size of memory in bytes allocated for reporting free page
> > > > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > > > +
> > > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >
> > > Doesn't this limit memory size of the guest we can report?
> > > Apparently to several gigabytes ...
> > > OTOH huge guests with lots of free memory is exactly where we would
> > > gain the most ...
> >
> > Yes, the 16-page array can report up to 32GB (each page can hold 512
> addresses of 4MB free page blocks, i.e. 2GB free memory per page) free
> memory to host. It is not flexible.
> >
> > How about allocating the buffer according to the guest memory size
> > (proportional)? That is,
> >
> > /* Calculates the maximum number of 4MB (equals to 1024 pages) free
> > pages blocks that the system can have */ 4m_page_blocks =
> > totalram_pages / 1024;
> >
> > /* Allocating one page can hold 512 free page blocks, so calculates
> > the number of pages that can hold those 4MB blocks. And this
> > allocation should not exceed 1024 pages */ pages_to_allocate =
> > min(4m_page_blocks / 512, 1024);
> >
> > For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate
> 1024 pages as the buffer.
> >
> > When the guest has large memory, it should be easier to succeed in
> allocation of large buffer. If that allocation fails, that implies that 
> nothing
> would be got from the 4MB free page list.
> >
> > I think the proportional allocation is simpler compared to other
> > approaches like
> > - scattered buffer, which will complicate the get_from_free_page_list
> > implementation;
> > - one buffer to call get_from_free_page_list multiple times, which needs
> get_from_free_page_list to maintain states.. also too complicated.
> >
> > Best,
> > Wei
> >
> 
> That's more reasonable, but question remains what to do if that value
> exceeds MAX_ORDER. I'd say maybe tell host we can't report it.

Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above, so the 
maximum memory that can be reported is 2TB. For larger guests, e.g. 4TB, the 
optimization can still offer 2TB free memory (better than no optimization).

On the other hand, large guests being large mostly because the guests need to 
use large memory. In that case, they usually won't have that much free memory 
to report.

> 
> Also allocating it with GFP_KERNEL is out. You only want to take it off the 
> free
> list. So I guess __GFP_NOMEMALLOC and __GFP_ATOMIC.

Sounds good, thanks.

> Also you can't allocate this on device start. First totalram_pages can change.
> Second that's too much memory to tie up forever.

Yes, makes sense.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-15 Thread Wang, Wei W
On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates
> > the support of reporting hints of guest free pages to host via 
> > virtio-balloon.
> >
> > Host requests the guest to report free page hints by sending a command
> > to the guest via setting the
> VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > bit of the host_cmd config register.
> >
> > As the first step here, virtio-balloon only reports free page hints
> > from the max order (10) free page list to host. This has generated
> > similar good results as reporting all free page hints during our tests.
> >
> > TODO:
> > - support reporting free page hints from smaller order free page lists
> >   when there is a need/request from users.
> >
> > Signed-off-by: Wei Wang 
> > Signed-off-by: Liang Li 
> > Cc: Michael S. Tsirkin 
> > Cc: Michal Hocko 
> > Cc: Andrew Morton 
> > ---
> >  drivers/virtio/virtio_balloon.c | 187 +--
> -
> >  include/uapi/linux/virtio_balloon.h |  13 +++
> >  2 files changed, 163 insertions(+), 37 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c
> > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -43,6 +43,9 @@
> >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> >
> > +/* The size of memory in bytes allocated for reporting free page
> > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> 
> Doesn't this limit memory size of the guest we can report?
> Apparently to several gigabytes ...
> OTOH huge guests with lots of free memory is exactly where we would gain
> the most ...

Yes, the 16-page array can report up to 32GB (each page can hold 512 addresses 
of 4MB free page blocks, i.e. 2GB free memory per page) free memory to host. It 
is not flexible.

How about allocating the buffer according to the guest memory size 
(proportional)? That is,

/* Calculates the maximum number of 4MB (equals to 1024 pages) free pages 
blocks that the system can have */
4m_page_blocks = totalram_pages / 1024;

/* Allocating one page can hold 512 free page blocks, so calculates the number 
of pages that can hold those 4MB blocks. And this allocation should not exceed 
1024 pages */
pages_to_allocate = min(4m_page_blocks / 512, 1024);

For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate 1024 
pages as the buffer.

When the guest has large memory, it should be easier to succeed in allocation 
of large buffer. If that allocation fails, that implies that nothing would be 
got from the 4MB free page list.

I think the proportional allocation is simpler compared to other approaches 
like 
- scattered buffer, which will complicate the get_from_free_page_list 
implementation;
- one buffer to call get_from_free_page_list multiple times, which needs 
get_from_free_page_list to maintain states.. also too complicated.

Best,
Wei



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-04-05 Thread Wang, Wei W
On Thursday, April 5, 2018 10:04 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 02:05:03AM +0000, Wang, Wei W wrote:
> > On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 05, 2018 at 12:30:27AM +, Wang, Wei W wrote:
> > > > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > I'm afraid the driver couldn't be aware if the added hints are
> > > > stale or not,
> > >
> > >
> > > No - I mean that driver has code that compares two values and stops
> > > reporting. Can one of the values be stale?
> >
> > The driver compares "vb->cmd_id_use != vb->cmd_id_received" to decide
> > if it needs to stop reporting hints, and cmd_id_received is what the
> > driver reads from host (host notifies the driver to read for the
> > latest value). If host sends a new cmd id, it will notify the guest to
> > read again. I'm not sure how that could be a stale cmd id (or maybe I
> > misunderstood your point here?)
> >
> > Best,
> > Wei
> 
> The comparison is done in one thread, the update in another one.

I think this isn't something that could be solved by adding a lock, unless host 
waits for the driver's ACK about finishing the update (this is not agreed in 
the QEMU part discussion).

Actually virtio_balloon has F_IOMMU_PLATFORM disabled, maybe we don't need to 
worry about that using DMA api case (we only have gpa added to the vq, and 
having some entries stay in the vq seems fine). For this feature, I think it 
would not work with F_IOMMU enabled either.

If there is any further need (I couldn't think of a need so far), I think we 
could consider to let host inject a vq interrupt at some point, and then the 
driver handler can do the virtqueue_get_buf work.

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-04-04 Thread Wang, Wei W
On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 12:30:27AM +0000, Wang, Wei W wrote:
> > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > I'm afraid the driver couldn't be aware if the added hints are stale
> > or not,
> 
> 
> No - I mean that driver has code that compares two values and stops
> reporting. Can one of the values be stale?

The driver compares "vb->cmd_id_use != vb->cmd_id_received" to decide if it 
needs to stop reporting hints, and cmd_id_received is what the driver reads 
from host (host notifies the driver to read for the latest value). If host 
sends a new cmd id, it will notify the guest to read again. I'm not sure how 
that could be a stale cmd id (or maybe I misunderstood your point here?)

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-04-04 Thread Wang, Wei W
On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > +static int add_one_sg(struct virtqueue *vq, unsigned long pfn,
> > > > +uint32_t len) {
> > > > +   struct scatterlist sg;
> > > > +   unsigned int unused;
> > > > +
> > > > +   sg_init_table(, 1);
> > > > +   sg_set_page(, pfn_to_page(pfn), len, 0);
> > > > +
> > > > +   /* Detach all the used buffers from the vq */
> > > > +   while (virtqueue_get_buf(vq, ))
> > > > +   ;
> > > > +
> > > > +   /*
> > > > +* Since this is an optimization feature, losing a couple of 
> > > > free
> > > > +* pages to report isn't important. We simply return without 
> > > > adding
> > > > +* the page hint if the vq is full.
> > > why not stop scanning of following pages though?
> >
> > Because continuing to send hints is a way to deliver the maximum
> > possible hints to host. For example, host may have a delay in taking
> > hints at some point, and then it resumes to take hints soon. If the
> > driver does not stop when the vq is full, it will be able to put more
> > hints to the vq once the vq has available entries to add.
> 
> What this appears to be is just lack of coordination between host and guest.
> 
> But meanwhile you are spending cycles walking the list uselessly.
> Instead of trying nilly-willy, the standard thing to do is to wait for host to
> consume an entry and proceed.
> 
> Coding it up might be tricky, so it's probably acceptable as is for now, but
> please replace the justification about with a TODO entry that we should
> synchronize with the host.

Thanks. I plan to add

TODO: The current implementation could be further improved by stopping the 
reporting when the vq is full and continuing the reporting when host notifies 
that there are available entries for the driver to add.


> 
> 
> >
> > >
> > > > +* We are adding one entry each time, which essentially results 
> > > > in no
> > > > +* memory allocation, so the GFP_KERNEL flag below can be 
> > > > ignored.
> > > > +* Host works by polling the free page vq for hints after 
> > > > sending the
> > > > +* starting cmd id, so the driver doesn't need to kick after 
> > > > filling
> > > > +* the vq.
> > > > +* Lastly, there is always one entry reserved for the cmd id to 
> > > > use.
> > > > +*/
> > > > +   if (vq->num_free > 1)
> > > > +   return virtqueue_add_inbuf(vq, , 1, vq, GFP_KERNEL);
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +static int virtio_balloon_send_free_pages(void *opaque, unsigned long
> pfn,
> > > > +  unsigned long nr_pages)
> > > > +{
> > > > +   struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > > +   uint32_t len = nr_pages << PAGE_SHIFT;
> > > > +
> > > > +   /*
> > > > +* If a stop id or a new cmd id was just received from host, 
> > > > stop
> > > > +* the reporting, and return 1 to indicate an active stop.
> > > > +*/
> > > > +   if (virtio32_to_cpu(vb->vdev, vb->cmd_id_use) != vb-
> >cmd_id_received)
> > > > +   return 1;
> 
> functions returning int should return 0 or -errno on failure, positive return
> code should indicate progress.
> 
> If you want a boolean, use bool pls.

OK. I plan to change 1  to -EBUSY to indicate the case that host actively asks 
the driver to stop reporting (This makes the callback return value type 
consistent with walk_free_mem_block). 



> 
> 
> > > > +
> > > this access to cmd_id_use and cmd_id_received without locks bothers
> > > me. Pls document why it's safe.
> >
> > OK. Probably we could add below to the above comments:
> >
> > cmd_id_use and cmd_id_received don't need to be accessed under locks
> > because the reporting does not have to stop immediately before
> > cmd_id_received is changed (i.e. when host requests to stop). That is,
> > reporting more hints after host requests to stop isn't an issue for
> > this optimization feature, because host will simply drop the stale
> > hints next time when it needs a new reporting.
> 
> What about the other direction? Can this observe a stale value and exit
> erroneously?

I'm afraid the driver couldn't be aware if the added hints are stale or not, 
because host and guest actions happen asynchronously. That is, host side 
iothread stops taking hints as soon as the migration thread asks to stop, it 
doesn't wait for any ACK from the driver to stop (as we discussed before, host 
couldn't always assume that the driver is in a responsive state).

Btw, we also don't need to worry about any memory left in the vq, since only 
addresses are added to the vq, there is no real memory allocations.

Best,
Wei

RE: [PATCH v29 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules

2018-03-25 Thread Wang, Wei W
On Monday, March 26, 2018 10:40 AM, Wang, Wei W wrote:
> Subject: [PATCH v29 3/4] mm/page_poison: expose page_poisoning_enabled
> to kernel modules
> 
> In some usages, e.g. virtio-balloon, a kernel module needs to know if page
> poisoning is in use. This patch exposes the page_poisoning_enabled function
> to kernel modules.
> 
> Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Michael S. Tsirkin <m...@redhat.com>
> ---
>  mm/page_poison.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/page_poison.c b/mm/page_poison.c index e83fd44..762b472
> 100644
> --- a/mm/page_poison.c
> +++ b/mm/page_poison.c
> @@ -17,6 +17,11 @@ static int early_page_poison_param(char *buf)  }
> early_param("page_poison", early_page_poison_param);
> 
> +/**
> + * page_poisoning_enabled - check if page poisoning is enabled
> + *
> + * Return true if page poisoning is enabled, or false if not.
> + */
>  bool page_poisoning_enabled(void)
>  {
>   /*
> @@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
> 
>   (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
>   debug_pagealloc_enabled()));
>  }
> +EXPORT_SYMBOL_GPL(page_poisoning_enabled);
> 
>  static void poison_page(struct page *page)  {
> --
> 2.7.4


Could we get a review of this patch? We've reviewed other parts, and this one 
seems to be the last part of this feature. Thanks.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [vhost:vhost 20/20] ERROR: "page_poisoning_enabled" [drivers/virtio/virtio_balloon.ko] undefined!

2018-02-06 Thread Wang, Wei W
On Wednesday, February 7, 2018 10:52 AM, Michael S. Tsirkin wrote:
> On Wed, Feb 07, 2018 at 10:25:35AM +0800, Wei Wang wrote:
> > On 02/07/2018 09:26 AM, kbuild test robot wrote:
> > > tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
> vhost
> > > head:   96bcd04462b99e2c80e09f6537770a0ca6b288d0
> > > commit: 96bcd04462b99e2c80e09f6537770a0ca6b288d0 [20/20]
> > > virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
> > > config: ia64-allmodconfig (attached as .config)
> > > compiler: ia64-linux-gcc (GCC) 7.2.0
> > > reproduce:
> > >  wget https://raw.githubusercontent.com/intel/lkp-
> tests/master/sbin/make.cross -O ~/bin/make.cross
> > >  chmod +x ~/bin/make.cross
> > >  git checkout 96bcd04462b99e2c80e09f6537770a0ca6b288d0
> > >  # save the attached .config to linux build tree
> > >  make.cross ARCH=ia64
> > >
> > > All errors (new ones prefixed by >>):
> > >
> > > WARNING: modpost: missing MODULE_LICENSE() in
> drivers/auxdisplay/img-ascii-lcd.o
> > > see include/linux/module.h for more information
> > > WARNING: modpost: missing MODULE_LICENSE() in drivers/gpio/gpio-
> ath79.o
> > > see include/linux/module.h for more information
> > > WARNING: modpost: missing MODULE_LICENSE() in drivers/gpio/gpio-
> iop.o
> > > see include/linux/module.h for more information
> > > WARNING: modpost: missing MODULE_LICENSE() in
> drivers/iio/accel/kxsd9-i2c.o
> > > see include/linux/module.h for more information
> > > WARNING: modpost: missing MODULE_LICENSE() in
> drivers/iio/adc/qcom-vadc-common.o
> > > see include/linux/module.h for more information
> > > WARNING: modpost: missing MODULE_LICENSE() in
> drivers/media/platform/mtk-vcodec/mtk-vcodec-common.o
> > > see include/linux/module.h for more information
> > > WARNING: modpost: missing MODULE_LICENSE() in
> drivers/media/platform/tegra-cec/tegra_cec.o
> > > see include/linux/module.h for more information
> > > WARNING: modpost: missing MODULE_LICENSE() in
> drivers/mtd/nand/denali_pci.o
> > > see include/linux/module.h for more information
> > > WARNING: modpost: missing MODULE_LICENSE() in
> drivers/pinctrl/pxa/pinctrl-pxa2xx.o
> > > see include/linux/module.h for more information
> > > WARNING: modpost: missing MODULE_LICENSE() in
> drivers/power/reset/zx-reboot.o
> > > see include/linux/module.h for more information
> > > > > ERROR: "page_poisoning_enabled" [drivers/virtio/virtio_balloon.ko]
> undefined!
> >
> > page_poisoning_enabled needs to be exposed. I'll send a small patch to
> > add EXPORT_SYMBOL_GPL(page_poisoning_enabled).
> >
> >
> > Best,
> > Wei
> 
> This will probably miss this release cycle.

OK if it's too difficult.  My bad, didn't capture that, too sad :(

I just resent that patch with the fix.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v24 1/2] mm: support reporting free page blocks

2018-01-27 Thread Wang, Wei W
On Friday, January 26, 2018 11:00 PM, Michael S. Tsirkin wrote:
> On Fri, Jan 26, 2018 at 11:29:15AM +0800, Wei Wang wrote:
> > On 01/25/2018 09:41 PM, Michael S. Tsirkin wrote:
> > > On Wed, Jan 24, 2018 at 06:42:41PM +0800, Wei Wang wrote:
> > > > This patch adds support to walk through the free page blocks in
> > > > the system and report them via a callback function. Some page
> > > > blocks may leave the free list after zone->lock is released, so it
> > > > is the caller's responsibility to either detect or prevent the use of 
> > > > such
> pages.
> > > >
> > > > One use example of this patch is to accelerate live migration by
> > > > skipping the transfer of free pages reported from the guest. A
> > > > popular method used by the hypervisor to track which part of
> > > > memory is written during live migration is to write-protect all
> > > > the guest memory. So, those pages that are reported as free pages
> > > > but are written after the report function returns will be captured
> > > > by the hypervisor, and they will be added to the next round of memory
> transfer.
> > > >
> > > > Signed-off-by: Wei Wang 
> > > > Signed-off-by: Liang Li 
> > > > Cc: Michal Hocko 
> > > > Cc: Michael S. Tsirkin 
> > > > Acked-by: Michal Hocko 
> > > > ---
> > > >   include/linux/mm.h |  6 
> > > >   mm/page_alloc.c| 91
> ++
> > > >   2 files changed, 97 insertions(+)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h index
> > > > ea818ff..b3077dd 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -1938,6 +1938,12 @@ extern void free_area_init_node(int nid,
> unsigned long * zones_size,
> > > > unsigned long zone_start_pfn, unsigned long 
> > > > *zholes_size);
> > > >   extern void free_initmem(void);
> > > > +extern void walk_free_mem_block(void *opaque,
> > > > +   int min_order,
> > > > +   bool (*report_pfn_range)(void *opaque,
> > > > +unsigned long 
> > > > pfn,
> > > > +unsigned long 
> > > > num));
> > > > +
> > > >   /*
> > > >* Free reserved pages within range [PAGE_ALIGN(start), end &
> PAGE_MASK)
> > > >* into the buddy system. The freed pages will be poisoned with
> > > > pattern diff --git a/mm/page_alloc.c b/mm/page_alloc.c index
> > > > 76c9688..705de22 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -4899,6 +4899,97 @@ void show_free_areas(unsigned int filter,
> nodemask_t *nodemask)
> > > > show_swap_cache_info();
> > > >   }
> > > > +/*
> > > > + * Walk through a free page list and report the found pfn range
> > > > +via the
> > > > + * callback.
> > > > + *
> > > > + * Return false if the callback requests to stop reporting.
> > > > +Otherwise,
> > > > + * return true.
> > > > + */
> > > > +static bool walk_free_page_list(void *opaque,
> > > > +   struct zone *zone,
> > > > +   int order,
> > > > +   enum migratetype mt,
> > > > +   bool (*report_pfn_range)(void *,
> > > > +unsigned long,
> > > > +unsigned long))
> > > > +{
> > > > +   struct page *page;
> > > > +   struct list_head *list;
> > > > +   unsigned long pfn, flags;
> > > > +   bool ret;
> > > > +
> > > > +   spin_lock_irqsave(>lock, flags);
> > > > +   list = >free_area[order].free_list[mt];
> > > > +   list_for_each_entry(page, list, lru) {
> > > > +   pfn = page_to_pfn(page);
> > > > +   ret = report_pfn_range(opaque, pfn, 1 << order);
> > > > +   if (!ret)
> > > > +   break;
> > > > +   }
> > > > +   spin_unlock_irqrestore(>lock, flags);
> > > > +
> > > > +   return ret;
> > > > +}
> > > There are two issues with this API. One is that it is not
> > > restarteable: if you return false, you start from the beginning. So
> > > no way to drop lock, do something slow and then proceed.
> > >
> > > Another is that you are using it to report free page hints.
> > > Presumably the point is to drop these pages - keeping them near head
> > > of the list and reusing the reported ones will just make everything
> > > slower invalidating the hint.
> > >
> > > How about rotating these pages towards the end of the list?
> > > Probably not on each call, callect reported pages and then move them
> > > to tail when we exit.
> >
> >
> > I'm not sure how this would help. For example, we have a list of 2M
> > free page blocks:
> > A-->B-->C-->D-->E-->F-->G--H
> >
> > After reporting A and B, and put them to 

RE: [PATCH v24 1/2] mm: support reporting free page blocks

2018-01-27 Thread Wang, Wei W
On Saturday, January 27, 2018 5:44 AM, Michael S. Tsirkin wrote:
> On Fri, Jan 26, 2018 at 05:00:09PM +0200, Michael S. Tsirkin wrote:
> > On Fri, Jan 26, 2018 at 11:29:15AM +0800, Wei Wang wrote:
> > > On 01/25/2018 09:41 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Jan 24, 2018 at 06:42:41PM +0800, Wei Wang wrote:
> > > 2) If worse, all the blocks have been split into smaller blocks and
> > > used after the caller comes back.
> > >
> > > where could we continue?
> >
> > I'm not sure. But an alternative appears to be to hold a lock and just
> > block whoever wanted to use any pages.  Yes we are sending hints
> > faster but apparently something wanted these pages, and holding the
> > lock is interfering with this something.
> 
> I've been thinking about it. How about the following scheme:
> 1. register balloon to get a (new) callback when free list runs empty 2. take
> pages off the free list, add them to the balloon specific list 3. report to 
> host 4.
> readd to free list at tail 5. if callback triggers, interrupt balloon 
> reporting to
> host,
>and readd to free list at tail

So in step 2, when we walk through the free page list, take each block, and add 
them to the balloon specific list, is this performed under the mm lock? If it 
is still under the lock, then what would be the difference compared to walking 
through the free list, and add each block to virtqueue? 

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v20 0/7] Virtio-balloon Enhancement

2017-12-20 Thread Wang, Wei W
On Wednesday, December 20, 2017 8:26 PM, Matthew Wilcox wrote:
> On Wed, Dec 20, 2017 at 06:34:36PM +0800, Wei Wang wrote:
> > On 12/19/2017 10:05 PM, Tetsuo Handa wrote:
> > > I think xb_find_set() has a bug in !node path.
> >
> > I think we can probably remove the "!node" path for now. It would be
> > good to get the fundamental part in first, and leave optimization to
> > come as separate patches with corresponding test cases in the future.
> 
> You can't remove the !node path.  You'll see !node when the highest set bit
> is less than 1024.  So do something like this:
> 
>   unsigned long bit;
>   xb_preload(GFP_KERNEL);
>   xb_set_bit(xb, 700);
>   xb_preload_end();
>   bit = xb_find_set(xb, ULONG_MAX, 0);
>   assert(bit == 700);

This above test will result in "!node with bitmap !=NULL", and it goes to the 
regular "if (bitmap)" path, which finds 700.

A better test would be
...
xb_set_bit(xb, 700);
assert(xb_find_set(xb, ULONG_MAX, 800) == ULONG_MAX);
...

The first try with the "if (bitmap)" path doesn't find a set bit, and the 
remaining tries will always result in "!node && !bitmap", which implies no set 
bit anymore and no need to try in this case.

So, I think we can change it to

If (!node && !bitmap)
return size;


Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v19 3/7] xbitmap: add more operations

2017-12-17 Thread Wang, Wei W
On Saturday, December 16, 2017 3:22 AM, Matthew Wilcox wrote:
> On Fri, Dec 15, 2017 at 10:49:15AM -0800, Matthew Wilcox wrote:
> > Here's the API I'm looking at right now.  The user need take no lock;
> > the locking (spinlock) is handled internally to the implementation.

Another place I saw your comment " The xb_ API requires you to handle your own 
locking" which seems conflict with the above "the user need take no lock".
Doesn't the caller need a lock to avoid concurrent accesses to the ida bitmap?


> I looked at the API some more and found some flaws:
>  - how does xbit_alloc communicate back which bit it allocated?
>  - What if xbit_find_set() is called on a completely empty array with
>a range of 0, ULONG_MAX -- there's no invalid number to return.

We'll change it to "bool xb_find_set(.., unsigned long *result)", returning 
false indicates no "1" bit is found.


>  - xbit_clear() can't return an error.  Neither can xbit_zero().

I found the current xbit_clear implementation only returns 0, and there isn't 
an error to be returned from this function. In this case, is it better to make 
the function "void"?


>  - Need to add __must_check to various return values to discourage sloppy
>programming
> 
> So I modify the proposed API we compete with thusly:
> 
> bool xbit_test(struct xbitmap *, unsigned long bit); int __must_check
> xbit_set(struct xbitmap *, unsigned long bit, gfp_t); void xbit_clear(struct
> xbitmap *, unsigned long bit); int __must_check xbit_alloc(struct xbitmap *,
> unsigned long *bit, gfp_t);
> 
> int __must_check xbit_fill(struct xbitmap *, unsigned long start,
> unsigned long nbits, gfp_t); void xbit_zero(struct 
> xbitmap *,
> unsigned long start, unsigned long nbits); int __must_check
> xbit_alloc_range(struct xbitmap *, unsigned long *bit,
> unsigned long nbits, gfp_t);
> 
> bool xbit_find_clear(struct xbitmap *, unsigned long *start, unsigned long
> max); bool xbit_find_set(struct xbitmap *, unsigned long *start, unsigned
> long max);
> 
> (I'm a little sceptical about the API accepting 'max' for the find functions 
> and
> 'nbits' in the fill/zero/alloc_range functions, but I think that matches how
> people want to use it, and it matches how bitmap.h works)

Are you suggesting to rename the current xb_ APIs to the above xbit_ names 
(with parameter changes)? 

Why would we need xbit_alloc, which looks like ida_get_new, I think set/clear 
should be adequate to the current usages.

Best,
Wei



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v19 3/7] xbitmap: add more operations

2017-12-17 Thread Wang, Wei W


> -Original Message-
> From: Tetsuo Handa [mailto:penguin-ker...@i-love.sakura.ne.jp]
> Sent: Sunday, December 17, 2017 6:22 PM
> To: Wang, Wei W <wei.w.w...@intel.com>; wi...@infradead.org
> Cc: virtio-...@lists.oasis-open.org; linux-ker...@vger.kernel.org; qemu-
> de...@nongnu.org; virtualization@lists.linux-foundation.org;
> k...@vger.kernel.org; linux...@kvack.org; m...@redhat.com;
> mho...@kernel.org; a...@linux-foundation.org; mawil...@microsoft.com;
> da...@redhat.com; cornelia.h...@de.ibm.com;
> mgor...@techsingularity.net; aarca...@redhat.com;
> amit.s...@redhat.com; pbonz...@redhat.com;
> liliang.opensou...@gmail.com; yang.zhang...@gmail.com;
> quan...@aliyun.com; ni...@redhat.com; r...@redhat.com
> Subject: Re: [PATCH v19 3/7] xbitmap: add more operations
> 
> Wei Wang wrote:
> > > But passing GFP_NOWAIT means that we can handle allocation failure.
> > > There is no need to use preload approach when we can handle allocation
> failure.
> >
> > I think the reason we need xb_preload is because radix tree insertion
> > needs the memory being preallocated already (it couldn't suffer from
> > memory failure during the process of inserting, probably because
> > handling the failure there isn't easy, Matthew may know the backstory
> > of
> > this)
> 
> According to https://lwn.net/Articles/175432/ , I think that preloading is
> needed only when failure to insert an item into a radix tree is a significant
> problem.
> That is, when failure to insert an item into a radix tree is not a problem, I
> think that we don't need to use preloading.

It also mentions that the preload attempts to allocate sufficient memory to 
*guarantee* that the next radix tree insertion cannot fail.

If we check radix_tree_node_alloc(), the comments there says "this assumes that 
the caller has performed appropriate preallocation".

So, I think we would get a risk of triggering some issue without preload().

> >
> > So, I think we can handle the memory failure with xb_preload, which
> > stops going into the radix tree APIs, but shouldn't call radix tree
> > APIs without the related memory preallocated.
> 
> It seems to me that virtio-ballon case has no problem without using
> preloading.

Why is that?

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v18 05/10] xbitmap: add more operations

2017-12-01 Thread Wang, Wei W
On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
> > On 11/30/2017 06:34 PM, Tetsuo Handa wrote:
> > > Wei Wang wrote:
> > >> + * @start: the start of the bit range, inclusive
> > >> + * @end: the end of the bit range, inclusive
> > >> + *
> > >> + * This function is used to clear a bit in the xbitmap. If all the
> > >> +bits of the
> > >> + * bitmap are 0, the bitmap will be freed.
> > >> + */
> > >> +void xb_clear_bit_range(struct xb *xb, unsigned long start,
> > >> +unsigned long end) {
> > >> +struct radix_tree_root *root = >xbrt;
> > >> +struct radix_tree_node *node;
> > >> +void **slot;
> > >> +struct ida_bitmap *bitmap;
> > >> +unsigned int nbits;
> > >> +
> > >> +for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 
> > >> 1) {
> > >> +unsigned long index = start / IDA_BITMAP_BITS;
> > >> +unsigned long bit = start % IDA_BITMAP_BITS;
> > >> +
> > >> +bitmap = __radix_tree_lookup(root, index, , );
> > >> +if (radix_tree_exception(bitmap)) {
> > >> +unsigned long ebit = bit + 2;
> > >> +unsigned long tmp = (unsigned long)bitmap;
> > >> +
> > >> +nbits = min(end - start + 1, BITS_PER_LONG - 
> > >> ebit);
> > > "nbits = min(end - start + 1," seems to expect that start == end is
> > > legal for clearing only 1 bit. But this function is no-op if start == end.
> > > Please clarify what "inclusive" intended.
> >
> > If xb_clear_bit_range(xb,10,10), then it is effectively the same as
> > xb_clear_bit(10). Why would it be illegal?
> >
> > "@start inclusive" means that the @start will also be included to be
> > cleared.
> 
> If start == end is legal,
> 
>for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> 
> makes this loop do nothing because 10 < 10 is false.


How about "start <= end "?

Best,
Wei



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-11-30 Thread Wang, Wei W
On Thursday, November 30, 2017 6:36 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
> > +static inline int xb_set_page(struct virtio_balloon *vb,
> > +  struct page *page,
> > +  unsigned long *pfn_min,
> > +  unsigned long *pfn_max)
> > +{
> > +   unsigned long pfn = page_to_pfn(page);
> > +   int ret;
> > +
> > +   *pfn_min = min(pfn, *pfn_min);
> > +   *pfn_max = max(pfn, *pfn_max);
> > +
> > +   do {
> > +   ret = xb_preload_and_set_bit(>page_xb, pfn,
> > +GFP_NOWAIT | __GFP_NOWARN);
> 
> It is a bit of pity that __GFP_NOWARN here is applied to only xb_preload().
> Memory allocation by xb_set_bit() will after all emit warnings. Maybe
> 
>   xb_init(>page_xb);
>   vb->page_xb.gfp_mask |= __GFP_NOWARN;
> 
> is tolerable? Or, unconditionally apply __GFP_NOWARN at xb_init()?
> 



Please have a check this one: radix_tree_node_alloc()

In our case, I think the code path goes to 

if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
...
ret = kmem_cache_alloc(radix_tree_node_cachep,
   gfp_mask | __GFP_NOWARN);
...
goto out;
}

So I think the __GFP_NOWARN is already there.



>   static inline void xb_init(struct xb *xb)
>   {
>   INIT_RADIX_TREE(>xbrt, IDR_RT_MARKER | GFP_NOWAIT);
>   }
> 
> > +   } while (unlikely(ret == -EAGAIN));
> > +
> > +   return ret;
> > +}
> > +
> 
> 
> 
> > @@ -172,11 +283,18 @@ static unsigned fill_balloon(struct virtio_balloon
> *vb, size_t num)
> > vb->num_pfns = 0;
> >
> > while ((page = balloon_page_pop())) {
> > +   if (use_sg) {
> > +   if (xb_set_page(vb, page, _min, _max) < 0) {
> > +   __free_page(page);
> > +   break;
> 
> You cannot "break;" without consuming all pages in "pages".


Right, I think it should be "continue" here. Thanks. 

> 
> > +   }
> > +   } else {
> > +   set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> > +   }
> > +
> > balloon_page_enqueue(>vb_dev_info, page);
> >
> > vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > -
> > -   set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> > vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > if (!virtio_has_feature(vb->vdev,
> >
>   VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> 
> 
> 
> > @@ -212,9 +334,12 @@ static unsigned leak_balloon(struct virtio_balloon
> *vb, size_t num)
> > struct page *page;
> > struct balloon_dev_info *vb_dev_info = >vb_dev_info;
> > LIST_HEAD(pages);
> > +   bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> 
> You can pass use_sg as an argument to leak_balloon(). Then, you won't need
> to define leak_balloon_sg_oom(). Since xbitmap allocation does not use
> __GFP_DIRECT_RECLAIM, it is safe to reuse leak_balloon() for OOM path.
> Just be sure to pass use_sg == false because memory allocation for use_sg ==
> true likely fails when called from OOM path. (But trying use_sg == true for
> OOM path and then fallback to use_sg == false is not bad?)
> 


But once the SG mechanism is in use, we cannot use the non-SG mechanism, 
because the interface between the guest and host is not the same for SG and 
non-SG. Methods to make the host support both mechanisms at the same time would 
complicate the interface and implementation. 

I also think the old non-SG mechanism should be deprecated to use since its 
implementation isn't perfect in some sense, e.g. it balloons pages using outbuf 
which implies that no changes are allowed to the balloon pages and this isn't 
true for ballooning. The new mechanism (SG) has changed it to use inbuf.

So I think using leak_balloon_sg_oom() would be better. Is there any reason 
that we couldn't use it?

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

2017-11-17 Thread Wang, Wei W
On Friday, November 17, 2017 8:45 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 17, 2017 at 07:35:03PM +0800, Wei Wang wrote:
> > On 11/16/2017 09:27 PM, Wei Wang wrote:
> > > On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
> > > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature
> > > > > indicates the support of reporting hints of guest free pages to
> > > > > the host via virtio-balloon. The host requests the guest to
> > > > > report the free pages by sending commands via the virtio-balloon
> configuration registers.
> > > > >
> > > > > When the guest starts to report, the first element added to the
> > > > > free page vq is a sequence id of the start reporting command.
> > > > > The id is given by the host, and it indicates whether the
> > > > > following free pages correspond to the command. For example, the
> > > > > host may stop the report and start again with a new command id.
> > > > > The obsolete pages for the previous start command can be
> > > > > detected by the id dismatching on the host. The id is added to
> > > > > the vq using an output buffer, and the free pages are added to
> > > > > the vq using input buffer.
> > > > >
> > > > > Here are some explainations about the added configuration registers:
> > > > > - host2guest_cmd: a register used by the host to send commands
> > > > > to the guest.
> > > > > - guest2host_cmd: written by the guest to ACK to the host about
> > > > > the commands that have been received. The host will clear the
> > > > > corresponding bits on the host2guest_cmd register. The guest
> > > > > also uses this register to send commands to the host (e.g. when finish
> free page reporting).
> > > > > - free_page_cmd_id: the sequence id of the free page report
> > > > > command given by the host.
> > > > >
> > > > > Signed-off-by: Wei Wang 
> > > > > Signed-off-by: Liang Li 
> > > > > Cc: Michael S. Tsirkin 
> > > > > Cc: Michal Hocko 
> > > > > ---
> > > > >
> > > > > +
> > > > > +static void report_free_page(struct work_struct *work) {
> > > > > +struct virtio_balloon *vb;
> > > > > +
> > > > > +vb = container_of(work, struct virtio_balloon,
> > > > > report_free_page_work);
> > > > > +report_free_page_cmd_id(vb);
> > > > > +walk_free_mem_block(vb, 0, _balloon_send_free_pages);
> > > > > +/*
> > > > > + * The last few free page blocks that were added may not reach 
> > > > > the
> > > > > + * batch size, but need a kick to notify the device to
> > > > > handle them.
> > > > > + */
> > > > > +virtqueue_kick(vb->free_page_vq);
> > > > > +report_free_page_end(vb);
> > > > > +}
> > > > > +
> > > > I think there's an issue here: if pages are poisoned and
> > > > hypervisor subsequently drops them, testing them after allocation
> > > > will trigger a false positive.
> > > >
> > > > The specific configuration:
> > > >
> > > > PAGE_POISONING on
> > > > PAGE_POISONING_NO_SANITY off
> > > > PAGE_POISONING_ZERO off
> > > >
> > > >
> > > > Solutions:
> > > > 1. disable the feature in that configuration
> > > > suggested as an initial step
> > >
> > > Thanks for the finding.
> > > Similar to this option: I'm thinking could we make
> > > walk_free_mem_block() simply return if that option is on?
> > > That is, at the beginning of the function:
> > > if (!page_poisoning_enabled())
> > > return;
> > >
> >
> >
> > Thought about it more, I think it would be better to put this logic to
> > virtio_balloon:
> >
> > send_free_page_cmd_id(vb, >start_cmd_id);
> > if (page_poisoning_enabled() &&
> > !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))
> > walk_free_mem_block(vb, 0, _balloon_send_free_pages);
> > send_free_page_cmd_id(vb, >stop_cmd_id);
> >
> >
> > walk_free_mem_block() should be a more generic API, and this potential
> > page poisoning issue is specific to live migration which is only one
> > use case of this function, so I think it is better to handle it in the
> > special use case itself.
> >
> > Best,
> > Wei
> >
> 
> It's a quick work-around but it doesn't make me very happy.
> 
> AFAIK e.g. RHEL has a debug kernel with poisoning enabled.
> If this never uses free page hinting at all, it will be much less useful for
> debugging guests.
> 

I understand your concern. I think people who use debugging guests don't regard 
performance as the first priority, and most vendors usually wouldn't use 
debugging guests for their products.

How about taking it as the initial solution? We can exploit more solutions 
after this series is done.

Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

2017-10-02 Thread Wang, Wei W
On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > + struct virtio_balloon_ctrlq_cmd *cmd,
> > + bool inbuf)
> > +{
> > +   struct virtqueue *vq = vb->ctrl_vq;
> > +
> > +   ctrlq_add_cmd(vq, cmd, inbuf);
> > +   if (!inbuf) {
> > +   /*
> > +* All the input cmd buffers are replenished here.
> > +* This is necessary because the input cmd buffers are lost
> > +* after live migration. The device needs to rewind all of
> > +* them from the ctrl_vq.
> 
> Confused. Live migration somehow loses state? Why is that and why is it a good
> idea? And how do you know this is migration even?
> Looks like all you know is you got free page end. Could be any reason for 
> this.


I think this would be something that the current live migration lacks - what the
device read from the vq is not transferred during live migration, an example is 
the 
stat_vq_elem: 
Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

For all the things that are added to the vq and need to be held by the device
to use later need to consider the situation that live migration might happen at 
any
time and they need to be re-taken from the vq by the device on the destination
machine.

So, even without this live migration optimization feature, I think all the 
things that are 
added to the vq for the device to hold, need a way for the device to rewind 
back from
the vq - re-adding all the elements to the vq is a trick to keep a record of 
all of them
on the vq so that the device side rewinding can work. 

Please let me know if anything is missed or if you have other suggestions.


> > +static void ctrlq_handle(struct virtqueue *vq) {
> > +   struct virtio_balloon *vb = vq->vdev->priv;
> > +   struct virtio_balloon_ctrlq_cmd *msg;
> > +   unsigned int class, cmd, len;
> > +
> > +   msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, );
> > +   if (unlikely(!msg))
> > +   return;
> > +
> > +   /* The outbuf is sent by the host for recycling, so just return. */
> > +   if (msg == >free_page_cmd_out)
> > +   return;
> > +
> > +   class = virtio32_to_cpu(vb->vdev, msg->class);
> > +   cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > +
> > +   switch (class) {
> > +   case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > +   if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > +   vb->report_free_page_stop = true;
> > +   } else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > +   vb->report_free_page_stop = false;
> > +   queue_work(vb->balloon_wq, 
> >report_free_page_work);
> > +   }
> > +   vb->free_page_cmd_in.class =
> > +
>   VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > +   ctrlq_send_cmd(vb, >free_page_cmd_in, true);
> > +   break;
> > +   default:
> > +   dev_warn(>vdev->dev, "%s: cmd class not supported\n",
> > +__func__);
> > +   }
> 
> Manipulating report_free_page_stop without any locks looks very suspicious.

> Also, what if we get two start commands? we should restart from beginning,
> should we not?
> 


Yes, it will start to report free pages from the beginning.
walk_free_mem_block() doesn't maintain any internal status, so the invoking of
it will always start from the beginning.


> > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> */
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP0
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_START   1
> > +
> >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> 
> The stop command does not appear to be thought through.
> 
> Let's assume e.g. you started migration. You ask guest for free pages.
> Then you cancel it.  There are a bunch of pages in free vq and you are getting
> more.  You now want to start migration again. What to do?
> 
> A bunch of vq flushing and waiting will maybe do the trick, but waiting on 
> guest
> is never a great idea.
> 


I think the device can flush (pop out what's left in the vq and push them back) 
the
vq right after the Stop command is sent to the guest, rather than doing the 
flush
when the 2nd initiation of live migration begins. The entries pushed back to 
the vq
will be in the used ring, what would the device need to wait for?


> I previously suggested pushing the stop/start commands from guest to host on
> the free page vq, and including an ID in host to guest and guest to host
> commands. This way ctrl vq is just for host to guest commands, and host
> matches commands and knows which command is a free page in response to.
> 
> I still think it's a good idea but go ahead and propose something else that 
> works.
> 

Thanks for the suggestion. Probably I haven't fully understood it. Please see 
the example
below:

1) host-to-guest ctrl_vq:
StartCMD, ID=1

2) 

RE: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-10-02 Thread Wang, Wei W
On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > +struct virtqueue *vq,
> > +void *addr,
> > +uint32_t size,
> > +bool batch)
> > +{
> > +   int err;
> > +
> > +   err = add_one_sg(vq, addr, size);
> > +
> > +   /* If batchng is requested, we batch till the vq is full */
> 
> typo
> 
> > +   if (!batch || !vq->num_free)
> > +   kick_and_wait(vq, vb->acked);
> > +
> > +   return err;
> > +}
> 
> If add_one_sg fails, kick_and_wait will hang forever.
> 
> The reason this might work in because
> 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, 
> then
> something
>is in queue and will wake up kick_and_wait.
> 
> So in short this is expected to never fail.
> How about a BUG_ON here then?
> And make it void, and add a comment with above explanation.
> 


Yes, agree that this wouldn't fail - the worker thread performing the 
ballooning operations has been put into sleep when the vq is full, so I think 
there shouldn't be anyone else to put more sgs onto the vq then.
Btw, not sure if we need to mention memory allocation in the comment, I found 
virtqueue_add() doesn't return any error when allocation (for indirect desc-s) 
fails - it simply avoids the use of indirect desc.

What do you think of the following? 

err = add_one_sg(vq, addr, size);
/* 
  * This is expected to never fail: there is always at least 1 entry available 
on the vq,
  * because when the vq is full the worker thread that adds the sg will be put 
into
  * sleep until at least 1 entry is available to use.
  */
BUG_ON(err);

Best,
Wei



 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()

2017-09-29 Thread Wang, Wei W
On Monday, September 11, 2017 9:27 PM, Matthew Wilcox wrote
> On Mon, Aug 28, 2017 at 06:08:30PM +0800, Wei Wang wrote:
> > +/**
> > + *  xb_zero - zero a range of bits in the xbitmap
> > + *  @xb: the xbitmap that the bits reside in
> > + *  @start: the start of the range, inclusive
> > + *  @end: the end of the range, inclusive  */ void xb_zero(struct xb
> > +*xb, unsigned long start, unsigned long end) {
> > +   unsigned long i;
> > +
> > +   for (i = start; i <= end; i++)
> > +   xb_clear_bit(xb, i);
> > +}
> > +EXPORT_SYMBOL(xb_zero);
> 
> Um.  This is not exactly going to be quick if you're clearing a range of bits.
> I think it needs to be more along the lines of this:
> 
> void xb_clear(struct xb *xb, unsigned long start, unsigned long end) {
> struct radix_tree_root *root = >xbrt;
> struct radix_tree_node *node;
> void **slot;
> struct ida_bitmap *bitmap;
> 
> for (; end < start; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> unsigned long index = start / IDA_BITMAP_BITS;
> unsigned long bit = start % IDA_BITMAP_BITS;
> 
> bitmap = __radix_tree_lookup(root, index, , );
> if (radix_tree_exception(bitmap)) {
> unsigned long ebit = bit + 2;
> unsigned long tmp = (unsigned long)bitmap;
> if (ebit >= BITS_PER_LONG)
> continue;
> tmp &= ... something ...;
> if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
> __radix_tree_delete(root, node, slot);
> else
> rcu_assign_pointer(*slot, (void *)tmp);
> } else if (bitmap) {
> unsigned int nbits = end - start + 1;
> if (nbits + bit > IDA_BITMAP_BITS)
> nbits = IDA_BITMAP_BITS - bit;
> bitmap_clear(bitmap->bitmap, bit, nbits);
> if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
> kfree(bitmap);
> __radix_tree_delete(root, node, slot);
> }
> }
> }
> }
> 
> Also note that this should be called xb_clear(), not xb_zero() to fit in with
> bitmap_clear().  And this needs a thorough test suite testing all values for 
> 'start'
> and 'end' between 0 and at least 1024; probably much higher.  And a variable
> number of bits need to be set before calling
> xb_clear() in the test suite.
> 
> Also, this implementation above is missing a few tricks.  For example, if 
> 'bit' is 0
> and 'nbits' == IDA_BITMAP_BITS, we can simply call kfree without first zeroing
> out the bits and then checking if the whole thing is zero.

Thanks for the optimization suggestions. We've seen significant improvement of
the ballooning time. Some other optimizations (stated in the changelog) haven't
been included in the new version. If possible, we can leave that to a second 
step
optimization outside this patch series.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

2017-09-05 Thread Wang, Wei W
Ping for comments if possible. Thanks.

On Monday, August 28, 2017 6:09 PM, Wang, Wei W wrote:
> [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
> Add a new vq, ctrl_vq, to handle commands between the host and guest.
> With this feature, we will be able to have the control plane and data plane
> separated. In other words, the control related data of each feature will be 
> sent
> via the ctrl_vq cmds, meanwhile each feature may have its own data plane vq.
> 
> Free page report is the the first new feature controlled via ctrl_vq, and a 
> new
> cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
> Currently, this feature has two cmds:
> VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest to
> start the free page reporting work.
> VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is used bidirectionally. The
> guest would send the cmd to the host to indicate the reporting work is done.
> The host would send the cmd to the guest to actively request the stop of the
> reporting work.
> 
> The free_page_vq is used to transmit the guest free page blocks to the host.
> 
> Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> Signed-off-by: Liang Li <liang.z...@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c | 247 +-
> --
>  include/uapi/linux/virtio_balloon.h |  15 +++
>  2 files changed, 242 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c 
> b/drivers/virtio/virtio_balloon.c index
> 8ecc1d4..1d384a4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
> 
>  struct virtio_balloon {
>   struct virtio_device *vdev;
> - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
> +  *free_page_vq;
> +
> + /* Balloon's own wq for cpu-intensive work items */
> + struct workqueue_struct *balloon_wq;
> + /* The work items submitted to the balloon wq are listed here */
> + struct work_struct report_free_page_work;
> 
>   /* The balloon servicing is delegated to a freezable workqueue. */
>   struct work_struct update_balloon_stats_work; @@ -65,6 +71,9 @@
> struct virtio_balloon {
>   spinlock_t stop_update_lock;
>   bool stop_update;
> 
> + /* Stop reporting free pages */
> + bool report_free_page_stop;
> +
>   /* Waiting for host to ack the pages we released. */
>   wait_queue_head_t acked;
> 
> @@ -93,6 +102,11 @@ struct virtio_balloon {
> 
>   /* To register callback in oom notifier call chain */
>   struct notifier_block nb;
> +
> + /* Host to guest ctrlq cmd buf for free page report */
> + struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
> + /* Guest to Host ctrlq cmd buf for free page report */
> + struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
>  };
> 
>  static struct virtio_device_id id_table[] = { @@ -177,6 +191,26 @@ static 
> void
> send_balloon_page_sg(struct virtio_balloon *vb,
>   }
>  }
> 
> +static void send_free_page_sg(struct virtqueue *vq, void *addr,
> +uint32_t size) {
> + unsigned int len;
> + int err = -ENOSPC;
> +
> + do {
> + if (vq->num_free) {
> + err = add_one_sg(vq, addr, size);
> + /* Sanity check: this can't really happen */
> + WARN_ON(err);
> + if (!err)
> + virtqueue_kick(vq);
> + }
> +
> + /* Release entries if there are */
> + while (virtqueue_get_buf(vq, ))
> + ;
> + } while (err == -ENOSPC && vq->num_free); }
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -525,42 +559,206 @@ static void update_balloon_size_func(struct
> work_struct *work)
>   queue_work(system_freezable_wq, work);  }
> 
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +unsigned long nr_pages)
> +{
> + struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> + void *addr = (void *)pfn_to_kaddr(pfn);
> + uint32_t len = nr_pages << PAGE_SHIFT;
> +
> + if (vb->report_free_page_stop)
> + return 1;
> +
> + send_free_page_sg(vb->free_page_vq, addr, len);
> +
> + return 0;
> +}
> +
> +static voi

RE: [PATCH] MAINTAINERS: copy virtio on balloon_compaction.c

2017-08-03 Thread Wang, Wei W
On Thursday, August 3, 2017 9:36 PM, Rafael Aquini wrote:
> On Thu, Aug 03, 2017 at 03:42:52PM +0300, Michael S. Tsirkin wrote:
> > Changes to mm/balloon_compaction.c can easily break virtio, and virtio
> > is the only user of that interface.  Add a line to MAINTAINERS so
> > whoever changes that file remembers to copy us.
> >
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >  MAINTAINERS | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS index f66488d..6b1d60e 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -13996,6 +13996,7 @@ F:  drivers/block/virtio_blk.c
> >  F: include/linux/virtio*.h
> >  F: include/uapi/linux/virtio_*.h
> >  F: drivers/crypto/virtio/
> > +F: mm/balloon_compaction.c
> >
> >  VIRTIO CRYPTO DRIVER
> >  M: Gonglei 
> > --
> > MST
> 
> Acked-by: Rafael Aquini 

Acked-by: Wei Wang 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v13 4/5] mm: support reporting free page blocks

2017-08-03 Thread Wang, Wei W
On Thursday, August 3, 2017 9:51 PM, Michal Hocko: 
> As I've said earlier. Start simple optimize incrementally with some numbers to
> justify a more subtle code.
> --

OK. Let's start with the simple implementation as you suggested.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v13 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-08-03 Thread Wang, Wei W
On Thursday, August 3, 2017 10:23 PM, Michael S. Tsirkin wrote:
> On Thu, Aug 03, 2017 at 02:38:17PM +0800, Wei Wang wrote:
> > +static void send_one_sg(struct virtio_balloon *vb, struct virtqueue *vq,
> > +   void *addr, uint32_t size)
> > +{
> > +   struct scatterlist sg;
> > +   unsigned int len;
> > +
> > +   sg_init_one(, addr, size);
> > +   while (unlikely(virtqueue_add_inbuf(vq, , 1, vb, GFP_KERNEL)
> > +   == -ENOSPC)) {
> > +   /*
> > +* It is uncommon to see the vq is full, because the sg is sent
> > +* one by one and the device is able to handle it in time. But
> > +* if that happens, we kick and wait for an entry is released.
> 
> is released -> to get used.
> 
> > +*/
> > +   virtqueue_kick(vq);
> > +   while (!virtqueue_get_buf(vq, ) &&
> > +  !virtqueue_is_broken(vq))
> > +   cpu_relax();
> 
> Please rework to use wait_event in that case too.

For the balloon page case here, it is fine to use wait_event. But for the free 
page
case, I think it might not be suitable because the mm lock is being held.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-29 Thread Wang, Wei W
On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > add each s/g as a separate input buffer.
> > > > > > Should it be an output buffer?
> > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > writeable by device: DMA_FROM_DEVICE.
> > > > Why would the hypervisor need to zero the buffer?
> > > The page is supplied to hypervisor and can lose the value that is
> > > there.  That is the definition of writeable by device.
> >
> > I think for the free pages, it should be clear that they will be added
> > as output buffer to the device, because (as we discussed) they are
> > just hints, and some of them may be used by the guest after the report_ API 
> > is
> invoked.
> > The device/hypervisor should not use or discard them.
> 
> Discarding contents is exactly what you propose doing if migration is going 
> on,
> isn't it?

That's actually a different concept. Please let me explain it with this example:

The hypervisor receives the hint saying the guest PageX is a free page, but as 
we know, 
after that report_ API exits, the guest kernel may take PageX to use, so PageX 
is not free
page any more. At this time, if the hypervisor writes to the page, that would 
crash the guest.
So, I think the cornerstone of this work is that the hypervisor should not 
touch the
reported pages.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v12 6/8] mm: support reporting free page blocks

2017-07-26 Thread Wang, Wei W
On Wednesday, July 26, 2017 7:55 PM, Michal Hocko wrote:
> On Wed 26-07-17 19:44:23, Wei Wang wrote:
> [...]
> > I thought about it more. Probably we can use the callback function
> > with a little change like this:
> >
> > void walk_free_mem(void *opaque1, void (*visit)(void *opaque2,
> > unsigned long pfn,
> >unsigned long nr_pages))
> > {
> > ...
> > for_each_populated_zone(zone) {
> >for_each_migratetype_order(order, type) {
> > report_unused_page_block(zone, order, type,
> > ); // from patch 6
> > pfn = page_to_pfn(page);
> > visit(opaque1, pfn, 1 << order);
> > }
> > }
> > }
> >
> > The above function scans all the free list and directly sends each
> > free page block to the hypervisor via the virtio_balloon callback
> > below. No need to implement a bitmap.
> >
> > In virtio-balloon, we have the callback:
> > void *virtio_balloon_report_unused_pages(void *opaque,  unsigned long
> > pfn, unsigned long nr_pages) {
> > struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > ...put the free page block to the the ring of vb; }
> >
> >
> > What do you think?
> 
> I do not mind conveying a context to the callback. I would still prefer
> to keep the original min_order to check semantic though. Why? Well,
> it doesn't make much sense to scan low order free blocks all the time
> because they are simply too volatile. Larger blocks tend to surivive for
> longer. So I assume you would only care about larger free blocks. This
> will also make the call cheaper.
> --

OK, I will keep min order there in the next version.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v12 6/8] mm: support reporting free page blocks

2017-07-25 Thread Wang, Wei W
On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
> On Tue 25-07-17 19:56:24, Wei Wang wrote:
> > On 07/25/2017 07:25 PM, Michal Hocko wrote:
> > >On Tue 25-07-17 17:32:00, Wei Wang wrote:
> > >>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> > On 07/19/2017 04:13 PM, Michal Hocko wrote:
> > >>>[...
> > We don't need to do the pfn walk in the guest kernel. When the API
> > reports, for example, a 2MB free page block, the API caller offers to
> > the hypervisor the base address of the page block, and size=2MB, to
> > the hypervisor.
> 
> So you want to skip pfn walks by regularly calling into the page allocator to
> update your bitmap. If that is the case then would an API that would allow you
> to update your bitmap via a callback be s sufficient? Something like
>   void walk_free_mem(int node, int min_order,
>   void (*visit)(unsigned long pfn, unsigned long 
> nr_pages))
> 
> The function will call the given callback for each free memory block on the 
> given
> node starting from the given min_order. The callback will be strictly an 
> atomic
> and very light context. You can update your bitmap from there.

I would need to introduce more about the background here:
The hypervisor and the guest live in their own address space. The hypervisor's 
bitmap
isn't seen by the guest. I think we also wouldn't be able to give a callback 
function 
from the hypervisor to the guest in this case.

> 
> This would address my main concern that the allocator internals would get
> outside of the allocator proper. 

What issue would it have to expose the internal, for_each_zone()?
I think new code which would call it will also be strictly checked when they
are pushed to upstream.

Best,
Wei



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v11 0/6] Virtio-balloon Enhancement

2017-06-09 Thread Wang, Wei W
On Friday, June 9, 2017 6:42 PM, Wang, Wei W wrote:
> To: virtio-...@lists.oasis-open.org; linux-ker...@vger.kernel.org; qemu-
> de...@nongnu.org; virtualization@lists.linux-foundation.org;
> k...@vger.kernel.org; linux...@kvack.org; m...@redhat.com;
> da...@redhat.com; Hansen, Dave <dave.han...@intel.com>;
> cornelia.h...@de.ibm.com; a...@linux-foundation.org;
> mgor...@techsingularity.net; aarca...@redhat.com; amit.s...@redhat.com;
> pbonz...@redhat.com; Wang, Wei W <wei.w.w...@intel.com>;
> liliang.opensou...@gmail.com
> Subject: [PATCH v11 0/6] Virtio-balloon Enhancement
> 
> This patch series enhances the existing virtio-balloon with the following new
> features:
> 1) fast ballooning: transfer ballooned pages between the guest and host in
> chunks, instead of one by one; and
> 2) cmdq: a new virtqueue to send commands between the device and driver.
> Currently, it supports commands to report memory stats (replace the old statq
> mechanism) and report guest unused pages.

v10->v11 changes:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

> 
> Liang Li (1):
>   virtio-balloon: deflate via a page list
> 
> Wei Wang (5):
>   virtio-balloon: coding format cleanup
>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
>   mm: function to offer a page block on the free list
>   mm: export symbol of next_zone and first_online_pgdat
>   virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
> 
>  drivers/virtio/virtio_balloon.c | 781 --
> --
>  drivers/virtio/virtio_ring.c| 120 +-
>  include/linux/mm.h  |   5 +
>  include/linux/virtio.h  |   7 +
>  include/uapi/linux/virtio_balloon.h |  14 +
>  include/uapi/linux/virtio_ring.h|   3 +
>  mm/mmzone.c |   2 +
>  mm/page_alloc.c |  91 +
>  8 files changed, 950 insertions(+), 73 deletions(-)
> 
> --
> 2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v10 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS

2017-05-06 Thread Wang, Wei W
On 05/06/2017 06:29 AM, Michael S. Tsirkin wrote:
> On Thu, May 04, 2017 at 04:50:12PM +0800, Wei Wang wrote:
> > Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables the
> > transfer of the ballooned (i.e. inflated/deflated) pages in chunks to
> > the host.
> >
> > The implementation of the previous virtio-balloon is not very
> > efficient, because the ballooned pages are transferred to the host one
> > by one. Here is the breakdown of the time in percentage spent on each
> > step of the balloon inflating process (inflating 7GB of an 8GB idle
> > guest).
> >
> > 1) allocating pages (6.5%)
> > 2) sending PFNs to host (68.3%)
> > 3) address translation (6.1%)
> > 4) madvise (19%)
> >
> > It takes about 4126ms for the inflating process to complete.
> > The above profiling shows that the bottlenecks are stage 2) and stage
> > 4).
> >
> > This patch optimizes step 2) by transferring pages to the host in
> > chunks. A chunk consists of guest physically continuous pages.
> > When the pages are packed into a chunk, they are converted into
> > balloon page size (4KB) pages. A chunk is offered to the host via a
> > base PFN (i.e. the start PFN of those physically continuous
> > pages) and the size (i.e. the total number of the 4KB balloon size
> > pages). A chunk is formatted as below:
> > 
> > | Base (52 bit)| Rsvd (12 bit) |
> > 
> > 
> > | Size (52 bit)| Rsvd (12 bit) |
> > 
> >
> > By doing so, step 4) can also be optimized by doing address
> > translation and madvise() in chunks rather than page by page.
> >
> > With this new feature, the above ballooning process takes ~590ms
> > resulting in an improvement of ~85%.
> >
> > TODO: optimize stage 1) by allocating/freeing a chunk of pages instead
> > of a single page each time.
> >
> > Signed-off-by: Wei Wang 
> > Signed-off-by: Liang Li 
> > Suggested-by: Michael S. Tsirkin 
> 
> 
> This is much cleaner, thanks. It might be even better to have wrappers that 
> put
> array and its size in a struct and manage that struct, but I won't require 
> this for
> submission.

OK, thanks. Would this be your suggestion:

struct virtio_balloon_page_struct { 
unsigned int page_bmap_num;
unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
}

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ

2017-05-06 Thread Wang, Wei W
On 05/06/2017 06:21 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> > On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > > Add a new vq, miscq, to handle miscellaneous requests between the
> > > > device and the driver.
> > > >
> > > > This patch implemnts the
> VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > > implements
> > >
> > > > request sent from the device.
> > > Commands are sent from host and handled on guest.
> > > In fact how is this so different from stats?
> > > How about reusing the stats vq then? You can use one buffer for
> > > stats and one buffer for commands.
> > >
> >
> > The meaning of the two vqs is a little different. statq is used for
> > reporting statistics, while miscq is intended to be used to handle
> > miscellaneous requests from the guest or host
> 
> misc just means "anything goes". If you want it to mean "commands" name it so.

Ok, will change it.

> > (I think it can
> > also be used the other way around in the future when other new
> > features are added which need the guest to send requests and the host
> > to provide responses).
> >
> > I would prefer to have them separate, because:
> > If we plan to combine them, we need to put the previous statq related
> > implementation under miscq with a new command (I think we can't
> > combine them without using commands to distinguish the two features).
> 
> Right.

> > In this way, an old driver won't work with a new QEMU or a new driver
> > won't work with an old QEMU. Would this be considered as an issue
> > here?
> 
> Compatibility is and should always be handled using feature flags.  There's a
> feature flag for this, isn't it?

The negotiation of the existing feature flag, VIRTIO_BALLOON_F_STATS_VQ
only indicates the support of the old statq implementation. To move the statq
implementation under cmdq, I think we would need a new feature flag for the
new statq implementation:
#define VIRTIO_BALLOON_F_CMDQ_STATS  5

What do you think?

Best,
Wei




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS

2017-05-06 Thread Wang, Wei W
On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 26, 2017 at 11:03:34AM +, Wang, Wei W wrote:
> > > > Hi Michael, could you please give some feedback?
> > > I'm sorry, I'm not sure feedback on what you are requesting.
> > Oh, just some trivial things (e.g. use a field in the header,
> > hdr->chunks to indicate the number of chunks in the payload) that
> > wasn't confirmed.
> >
> > I will prepare the new version with fixing the agreed issues, and we
> > can continue to discuss those parts if you still find them improper.
> >
> >
> > >
> > > The interface looks reasonable now, even though there's a way to
> > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > 1). Do you think we can live with this limitation?
> > Yes, I think we can. So, is it good to change to use the previous
> > 64-bit chunk format (52-bit base + 12-bit size)?
> 
> This isn't what I meant. virtio ring has descriptors with a 64 bit address 
> and 32 bit
> size.
> 
> If size < 4g is not a significant limitation, why not just use that to pass
> address/size in a standard s/g list, possibly using INDIRECT?

OK, I see your point, thanks. Post the two options here for an analysis:
Option1 (what we have now):
struct virtio_balloon_page_chunk {
__le64 chunk_num;
struct virtio_balloon_page_chunk_entry entry[];
};
Option2:
struct virtio_balloon_page_chunk {
__le64 chunk_num;
struct scatterlist entry[];
};

I don't have an issue to change it to Option2, but I would prefer Option1,
because I think there is no be obvious difference between the two options,
while Option1 appears to have little advantages here:
1) "struct virtio_balloon_page_chunk_entry" has smaller size than
"struct scatterlist", so the same size of allocated page chunk buffer
can hold more entry[] using Option1;
2) INDIRECT needs on demand kmalloc();
3) no 4G size limit;

What do you think?

Best,
Wei



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS

2017-04-26 Thread Wang, Wei W
Hi Michael, could you please give some feedback?

On Monday, April 17, 2017 11:35 AM, Wei Wang wrote:
> On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> > On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> >> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> >>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> >>>
> >>> So we don't need the bitmap to talk to host, it is just a data
> >>> structure we chose to maintain lists of pages, right?
> >> Right. bitmap is the way to gather pages to chunk.
> >> It's only needed in the balloon page case.
> >> For the unused page case, we don't need it, since the free page
> >> blocks are already chunks.
> >>
> >>> OK as far as it goes but you need much better isolation for it.
> >>> Build a data structure with APIs such as _init, _cleanup, _add,
> >>> _clear, _find_first, _find_next.
> >>> Completely unrelated to pages, it just maintains bits.
> >>> Then use it here.
> >>>
> >>>
> static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
>  +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> static struct vfsmount *balloon_mnt;
> #endif
>  +/* Types of pages to chunk */
>  +#define PAGE_CHUNK_TYPE_BALLOON 0
>  +
> >>> Doesn't look like you are ever adding more types in this patchset.
> >>> Pls keep code simple, generalize it later.
> >>>
> >> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> > I would say add the extra code there too. Or maybe we can avoid adding
> > it altogether.
> 
> I'm trying to have the two features( i.e. "balloon pages" and "unused pages")
> decoupled while trying to use common functions to deal with the commonalities.
> That's the reason to define the above macro.
> Without the macro, we will need to have separate functions, for example,
> instead of one "add_one_chunk()", we need to have
> add_one_balloon_page_chunk() and add_one_unused_page_chunk(), and some
> of the implementations will be kind of duplicate in the two functions.
> Probably we can add it when the second feature comes to the code.
> 
> >
> >> Types of page to chunk are treated differently. Different types of
> >> page chunks are sent to the host via different protocols.
> >>
> >> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> >> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> >>
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq
> >> msg
> >> format:
> >> miscq_hdr +
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> The chunk msg is actually the payload of the miscq msg.
> >>
> >>
> > So just combine the two message formats and then it'll all be easier?
> >
> 
> Yes, it'll be simple with only one msg format. But the problem I see here is 
> that
> miscq hdr is something necessary for the "unused page"
> usage, but not needed by the "balloon page" usage. To be more precise, struct
> virtio_balloon_miscq_hdr {
>   __le16 cmd;
>   __le16 flags;
> };
> 'cmd' specifies  the command from the miscq (I envision that miscq will be
> further used to handle other possible miscellaneous requests either from the
> host or to the host), so 'cmd' is necessary for the miscq. But the inflateq is
> exclusively used for inflating pages, so adding a command to it would be
> redundant and look a little bewildered there.
> 'flags': We currently use bit 0 of flags to indicate the completion ofa 
> command,
> this is also useful in the "unused page" usage, and not needed by the "balloon
> page" usage.
>  +#define MAX_PAGE_CHUNKS 4096
> >>> This is an order-4 allocation. I'd make it 4095 and then it's an
> >>> order-3 one.
> >> Sounds good, thanks.
> >> I think it would be better to make it 4090. Leave some space for the
> >> hdr as well.
> > And miscq hdr. In fact just let compiler do the math - something like:
> > (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
> Agree, thanks.
> 
> >
> > I skimmed explanation of algorithms below but please make sure code
> > speaks for itself and add comments inline to document it.
> > Whenever you answered me inline this is where you want to try to make
> > code clearer and add comments.
> >
> > Also, pls find ways to abstract the data structure so we don't need to
> > deal with its internals all over the code.
> >
> >
> > 
> >
> {
>   struct scatterlist sg;
>  +struct virtio_balloon_page_chunk_hdr *hdr;
>  +void *buf;
>   unsigned int len;
>  -sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>  +switch (type) {
>  +case PAGE_CHUNK_TYPE_BALLOON:
>  +hdr = vb->balloon_page_chunk_hdr;
>  +len = 0;
> 

RE: [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-04-05 Thread Wang, Wei W
On Wednesday, April 5, 2017 12:31 PM, Wei Wang wrote:
> On Wednesday, April 5, 2017 11:54 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 03:31:36AM +0000, Wang, Wei W wrote:
> > > On Thursday, March 16, 2017 3:09 PM Wei Wang wrote:
> > > > The implementation of the current virtio-balloon is not very
> > > > efficient, because the ballooned pages are transferred to the host
> > > > one by one. Here is the breakdown of the time in percentage spent
> > > > on each step of the balloon inflating process (inflating 7GB of an 8GB 
> > > > idle
> guest).
> > > >
> > > > 1) allocating pages (6.5%)
> > > > 2) sending PFNs to host (68.3%)
> > > > 3) address translation (6.1%)
> > > > 4) madvise (19%)
> > > >
> > > > It takes about 4126ms for the inflating process to complete.
> > > > The above profiling shows that the bottlenecks are stage 2) and stage 
> > > > 4).
> > > >
> > > > This patch optimizes step 2) by transferring pages to the host in
> > > > chunks. A chunk consists of guest physically continuous pages, and
> > > > it is offered to the host via a base PFN (i.e. the start PFN of
> > > > those physically continuous pages) and the size (i.e. the total
> > > > number of the
> > pages). A chunk is formated as below:
> > > >
> > > > 
> > > > | Base (52 bit)| Rsvd (12 bit) |
> > > > 
> > > > 
> > > > | Size (52 bit)| Rsvd (12 bit) |
> > > > 
> > > >
> > > > By doing so, step 4) can also be optimized by doing address
> > > > translation and
> > > > madvise() in chunks rather than page by page.
> > > >
> > > > This optimization requires the negotiation of a new feature bit,
> > > > VIRTIO_BALLOON_F_CHUNK_TRANSFER.
> > > >
> > > > With this new feature, the above ballooning process takes ~590ms
> > > > resulting in an improvement of ~85%.
> > > >
> > > > TODO: optimize stage 1) by allocating/freeing a chunk of pages
> > > > instead of a single page each time.
> > > >
> > > > Signed-off-by: Liang Li <liang.z...@intel.com>
> > > > Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> > > > Suggested-by: Michael S. Tsirkin <m...@redhat.com>
> > > > ---
> > > >  drivers/virtio/virtio_balloon.c | 371
> > +-
> > > > --
> > > >  include/uapi/linux/virtio_balloon.h |   9 +
> > > >  2 files changed, 353 insertions(+), 27 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > b/drivers/virtio/virtio_balloon.c index
> > > > f59cb4f..3f4a161 100644
> > > > --- a/drivers/virtio/virtio_balloon.c
> > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > @@ -42,6 +42,10 @@
> > > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > >
> > > > +#define PAGE_BMAP_SIZE (8 * PAGE_SIZE)
> > > > +#define PFNS_PER_PAGE_BMAP (PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > > +#define PAGE_BMAP_COUNT_MAX32
> > > > +
> > > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> > +54,14
> > > > @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");  static
> > > > struct vfsmount *balloon_mnt;  #endif
> > > >
> > > > +#define BALLOON_CHUNK_BASE_SHIFT 12 #define
> > > > +BALLOON_CHUNK_SIZE_SHIFT 12 struct balloon_page_chunk {
> > > > +   __le64 base;
> > > > +   __le64 size;
> > > > +};
> > > > +
> > > > +typedef __le64 resp_data_t;
> > > >  struct virtio_balloon {
> > > > struct virtio_device *vdev;
> > > > struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -67,6
> > > > +79,31 @@ struct virtio_balloon {
> > > >
> > > > /* Number of balloon pages we've told the 

RE: [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-04-04 Thread Wang, Wei W
On Wednesday, April 5, 2017 11:54 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 05, 2017 at 03:31:36AM +0000, Wang, Wei W wrote:
> > On Thursday, March 16, 2017 3:09 PM Wei Wang wrote:
> > > The implementation of the current virtio-balloon is not very
> > > efficient, because the ballooned pages are transferred to the host
> > > one by one. Here is the breakdown of the time in percentage spent on
> > > each step of the balloon inflating process (inflating 7GB of an 8GB idle 
> > > guest).
> > >
> > > 1) allocating pages (6.5%)
> > > 2) sending PFNs to host (68.3%)
> > > 3) address translation (6.1%)
> > > 4) madvise (19%)
> > >
> > > It takes about 4126ms for the inflating process to complete.
> > > The above profiling shows that the bottlenecks are stage 2) and stage 4).
> > >
> > > This patch optimizes step 2) by transferring pages to the host in
> > > chunks. A chunk consists of guest physically continuous pages, and
> > > it is offered to the host via a base PFN (i.e. the start PFN of
> > > those physically continuous pages) and the size (i.e. the total number of 
> > > the
> pages). A chunk is formated as below:
> > >
> > > 
> > > | Base (52 bit)| Rsvd (12 bit) |
> > > 
> > > 
> > > | Size (52 bit)| Rsvd (12 bit) |
> > > 
> > >
> > > By doing so, step 4) can also be optimized by doing address
> > > translation and
> > > madvise() in chunks rather than page by page.
> > >
> > > This optimization requires the negotiation of a new feature bit,
> > > VIRTIO_BALLOON_F_CHUNK_TRANSFER.
> > >
> > > With this new feature, the above ballooning process takes ~590ms
> > > resulting in an improvement of ~85%.
> > >
> > > TODO: optimize stage 1) by allocating/freeing a chunk of pages
> > > instead of a single page each time.
> > >
> > > Signed-off-by: Liang Li <liang.z...@intel.com>
> > > Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> > > Suggested-by: Michael S. Tsirkin <m...@redhat.com>
> > > ---
> > >  drivers/virtio/virtio_balloon.c | 371
> +-
> > > --
> > >  include/uapi/linux/virtio_balloon.h |   9 +
> > >  2 files changed, 353 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_balloon.c
> > > b/drivers/virtio/virtio_balloon.c index
> > > f59cb4f..3f4a161 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -42,6 +42,10 @@
> > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > >
> > > +#define PAGE_BMAP_SIZE   (8 * PAGE_SIZE)
> > > +#define PFNS_PER_PAGE_BMAP   (PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > +#define PAGE_BMAP_COUNT_MAX  32
> > > +
> > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> +54,14
> > > @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");  static
> > > struct vfsmount *balloon_mnt;  #endif
> > >
> > > +#define BALLOON_CHUNK_BASE_SHIFT 12 #define
> > > +BALLOON_CHUNK_SIZE_SHIFT 12 struct balloon_page_chunk {
> > > + __le64 base;
> > > + __le64 size;
> > > +};
> > > +
> > > +typedef __le64 resp_data_t;
> > >  struct virtio_balloon {
> > >   struct virtio_device *vdev;
> > >   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -67,6
> > > +79,31 @@ struct virtio_balloon {
> > >
> > >   /* Number of balloon pages we've told the Host we're not using. */
> > >   unsigned int num_pages;
> > > + /* Pointer to the response header. */
> > > + struct virtio_balloon_resp_hdr *resp_hdr;
> > > + /* Pointer to the start address of response data. */
> > > + resp_data_t *resp_data;
> >
> > I think the implementation has an issue here - both the balloon pages and 
> > the
> unused pages use the same buffer ("resp_data" above) to store chunks. It would
> cause a race in this case: live migration starts while ballooning is also in 
> progress.
> I plan to use separate buffers for CHUNKS_OF_BALLOON_PAGES and
> CHUNKS_OF_UNUSED_PAGES. Please let me know if you have a different
> suggestion. Thanks.
> >
> > Best,
> > Wei
> 
> Is only one resp data ever in flight for each kind?
> If not you want as many buffers as vq allows.
> 

No, all the kinds were using only one resp_data. I will make it one resp_data 
for each kind.

Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-04-04 Thread Wang, Wei W
On Thursday, March 16, 2017 3:09 PM Wei Wang wrote:
> The implementation of the current virtio-balloon is not very efficient, 
> because
> the ballooned pages are transferred to the host one by one. Here is the
> breakdown of the time in percentage spent on each step of the balloon 
> inflating
> process (inflating 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2) and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in chunks. A 
> chunk
> consists of guest physically continuous pages, and it is offered to the host 
> via a
> base PFN (i.e. the start PFN of those physically continuous pages) and the 
> size
> (i.e. the total number of the pages). A chunk is formated as below:
> 
> 
> | Base (52 bit)| Rsvd (12 bit) |
> 
> 
> | Size (52 bit)| Rsvd (12 bit) |
> 
> 
> By doing so, step 4) can also be optimized by doing address translation and
> madvise() in chunks rather than page by page.
> 
> This optimization requires the negotiation of a new feature bit,
> VIRTIO_BALLOON_F_CHUNK_TRANSFER.
> 
> With this new feature, the above ballooning process takes ~590ms resulting in
> an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of a
> single page each time.
> 
> Signed-off-by: Liang Li 
> Signed-off-by: Wei Wang 
> Suggested-by: Michael S. Tsirkin 
> ---
>  drivers/virtio/virtio_balloon.c | 371 +-
> --
>  include/uapi/linux/virtio_balloon.h |   9 +
>  2 files changed, 353 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c 
> b/drivers/virtio/virtio_balloon.c index
> f59cb4f..3f4a161 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -42,6 +42,10 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> 
> +#define PAGE_BMAP_SIZE   (8 * PAGE_SIZE)
> +#define PFNS_PER_PAGE_BMAP   (PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +#define PAGE_BMAP_COUNT_MAX  32
> +
>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6 +54,14
> @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");  static struct
> vfsmount *balloon_mnt;  #endif
> 
> +#define BALLOON_CHUNK_BASE_SHIFT 12
> +#define BALLOON_CHUNK_SIZE_SHIFT 12
> +struct balloon_page_chunk {
> + __le64 base;
> + __le64 size;
> +};
> +
> +typedef __le64 resp_data_t;
>  struct virtio_balloon {
>   struct virtio_device *vdev;
>   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -67,6 +79,31
> @@ struct virtio_balloon {
> 
>   /* Number of balloon pages we've told the Host we're not using. */
>   unsigned int num_pages;
> + /* Pointer to the response header. */
> + struct virtio_balloon_resp_hdr *resp_hdr;
> + /* Pointer to the start address of response data. */
> + resp_data_t *resp_data;

I think the implementation has an issue here - both the balloon pages and the 
unused pages use the same buffer ("resp_data" above) to store chunks. It would 
cause a race in this case: live migration starts while ballooning is also in 
progress. I plan to use separate buffers for CHUNKS_OF_BALLOON_PAGES and 
CHUNKS_OF_UNUSED_PAGES. Please let me know if you have a different suggestion. 
Thanks.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages

2017-03-22 Thread Wang, Wei W
Hi Andrew, 

Do you have any comments on my thoughts? Thanks.

> On 03/17/2017 05:28 AM, Andrew Morton wrote:
> > On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang 
> wrote:
> >
> >> From: Liang Li 
> >>
> >> This patch adds a function to provides a snapshot of the present
> >> system unused pages. An important usage of this function is to
> >> provide the unsused pages to the Live migration thread, which skips
> >> the transfer of thoses unused pages. Newly used pages can be
> >> re-tracked by the dirty page logging mechanisms.
> > I don't think this will be useful for anything other than
> > virtio-balloon.  I guess it would be better to keep this code in the
> > virtio-balloon driver if possible, even though that's rather a
> > layering violation :( What would have to be done to make that
> > possible?  Perhaps we can put some *small* helpers into page_alloc.c
> > to prevent things from becoming too ugly.
> 
> The patch description was too narrowed and may have caused some confusion,
> sorry about that. This function is aimed to be generic. I agree with the
> description suggested by Michael.
> 
> Since the main body of the function is related to operating on the free_list. 
> I
> think it is better to have them located here.
> Small helpers may be less efficient and thereby causing some performance loss
> as well.
> I think one improvement we can make is to remove the "chunk format"
> related things from this function. The function can generally offer the base 
> pfn
> to the caller's recording buffer. Then it will be the caller's responsibility 
> to
> format the pfn if they need.
> 
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
> >>show_swap_cache_info();
> >>   }
> >>
> >> +static int __record_unused_pages(struct zone *zone, int order,
> >> +   __le64 *buf, unsigned int size,
> >> +   unsigned int *offset, bool part_fill) {
> >> +  unsigned long pfn, flags;
> >> +  int t, ret = 0;
> >> +  struct list_head *curr;
> >> +  __le64 *chunk;
> >> +
> >> +  if (zone_is_empty(zone))
> >> +  return 0;
> >> +
> >> +  spin_lock_irqsave(>lock, flags);
> >> +
> >> +  if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
> >> +  ret = -ENOSPC;
> >> +  goto out;
> >> +  }
> >> +  for (t = 0; t < MIGRATE_TYPES; t++) {
> >> +  list_for_each(curr, >free_area[order].free_list[t]) {
> >> +  pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >> +  chunk = buf + *offset;
> >> +  if (*offset + 2 > size) {
> >> +  ret = -ENOSPC;
> >> +  goto out;
> >> +  }
> >> +  /* Align to the chunk format used in virtio-balloon */
> >> +  *chunk = cpu_to_le64(pfn << 12);
> >> +  *(chunk + 1) = cpu_to_le64((1 << order) << 12);
> >> +  *offset += 2;
> >> +  }
> >> +  }
> >> +
> >> +out:
> >> +  spin_unlock_irqrestore(>lock, flags);
> >> +
> >> +  return ret;
> >> +}
> > This looks like it could disable interrupts for a long time.  Too long?
> 
> What do you think if we give "budgets" to the above function?
> For example, budget=1000, and there are 2000 nodes on the list.
> record() returns with "incomplete" status in the first round, along with the 
> status
> info, "*continue_node".
> 
> *continue_node: pointer to the starting node of the leftover. If 
> *continue_node
> has been used at the time of the second call (i.e. continue_node->next == 
> NULL),
> which implies that the previous 1000 nodes have been used, then the record()
> function can simply start from the head of the list.
> 
> It is up to the caller whether it needs to continue the second round when 
> getting
> "incomplete".
> 
> >
> >> +/*
> >> + * The record_unused_pages() function is used to record the system
> >> +unused
> >> + * pages. The unused pages can be skipped to transfer during live 
> >> migration.
> >> + * Though the unused pages are dynamically changing, dirty page
> >> +logging
> >> + * mechanisms are able to capture the newly used pages though they
> >> +were
> >> + * recorded as unused pages via this function.
> >> + *
> >> + * This function scans the free page list of the specified order to
> >> +record
> >> + * the unused pages, and chunks those continuous pages following the
> >> +chunk
> >> + * format below:
> >> + * --
> >> + * |  Base (52-bit)   | Rsvd (12-bit) |
> >> + * --
> >> + * --
> >> + * |  Size (52-bit)   | Rsvd (12-bit) |
> >> + * --
> >> + *
> >> + * @start_zone: zone to start the record operation.
> >> + * @order: order of the free page list to record.
> >> + * @buf: buffer to record the 

RE: [PATCH v7 kernel 3/5] virtio-balloon: implementation of VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-03-13 Thread Wang, Wei W
On Sunday, March 12, 2017 12:04 PM, Michael S. Tsirkin wrote:
> On Sun, Mar 12, 2017 at 01:59:54AM +0000, Wang, Wei W wrote:
> > On 03/11/2017 10:10 PM, Matthew Wilcox wrote:
> > > On Sat, Mar 11, 2017 at 07:59:31PM +0800, Wei Wang wrote:
> > > > I'm thinking what if the guest needs to transfer these much
> > > > physically continuous memory to host: 1GB+2MB+64KB+32KB+16KB+4KB.
> > > > Is it going to use Six 64-bit chunks? Would it be simpler if we
> > > > just use the 128-bit chunk format (we can drop the previous normal
> > > > 64-bit format)?
> > >
> > > Is that a likely thing for the guest to need to do though?  Freeing
> > > a 1GB page is much more liikely, IMO.
> >
> > Yes, I think it's very possible. The host can ask for any number of pages 
> > (e.g.
> 1.5GB) that the guest can afford.  Also, the ballooned 1.5G memory is not
> guaranteed to be continuous in any pattern like 1GB+512MB. That's why we
> need to use a bitmap to draw the whole picture first, and then seek for
> continuous bits to chunk.
> >
> > Best,
> > Wei
> 
> While I like the clever format that Matthew came up with, I'm also inclined to
> say let's keep things simple.
> the simplest thing seems to be to use the ext format all the time.
> Except let's reserve the low 12 bits in both address and size, since they are
> already 0, we might be able to use them for flags down the road.

Thanks for reminding us about the hugepage story. I'll use the ext format in 
the implementation if no further objections from others.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v7 kernel 3/5] virtio-balloon: implementation of VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-03-11 Thread Wang, Wei W
On 03/11/2017 10:10 PM, Matthew Wilcox wrote:
> On Sat, Mar 11, 2017 at 07:59:31PM +0800, Wei Wang wrote:
> > I'm thinking what if the guest needs to transfer these much physically
> > continuous memory to host: 1GB+2MB+64KB+32KB+16KB+4KB.
> > Is it going to use Six 64-bit chunks? Would it be simpler if we just
> > use the 128-bit chunk format (we can drop the previous normal 64-bit
> > format)?
> 
> Is that a likely thing for the guest to need to do though?  Freeing a 1GB 
> page is
> much more liikely, IMO.

Yes, I think it's very possible. The host can ask for any number of pages (e.g. 
1.5GB) that the guest can afford.  Also, the ballooned 1.5G memory is not 
guaranteed to be continuous in any pattern like 1GB+512MB. That's why we need 
to use a bitmap to draw the whole picture first, and then seek for continuous 
bits to chunk.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


A question about vring operation

2015-08-27 Thread Wang, Wei W
Hi all,

I have a question about the vring_avail:
It only includes an idx (equivalent to the ring tail), which is used by the 
frontend (virtio_net) to fill bufs. The backend (e.g. vhost_net) maintains the 
ring head (last_avail_idx) by itself. The frontend checks if the ring is full 
or empty via a counter (vq-num_free).
My question is why can't we include the ring head in the  vring_avail struct, 
so that the vq-num_free is not needed, and the backend can directly use it 
without maintaining its own copy?

Thanks,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization