Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-31 Thread Wei Wang

On 07/31/2017 12:20 AM, Michael S. Tsirkin wrote:

On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote:

On Sun, Jul 30, 2017 at 05:59:17AM +, Wang, Wei W wrote:
That's a hypervisor implementation detail. From guest point of view,
discarding contents can not be distinguished from writing old contents.


Besides, ignoring the free page tricks, consider regular ballooning.
We map page with DONTNEED then back with WILLNEED. Result is
getting a zero page. So at least one of deflate/inflate should be input.
I'd say both for symmetry.



OK, I see the point. Thanks.

Best,
Wei



Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-31 Thread Wei Wang

On 07/31/2017 12:20 AM, Michael S. Tsirkin wrote:

On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote:

On Sun, Jul 30, 2017 at 05:59:17AM +, Wang, Wei W wrote:
That's a hypervisor implementation detail. From guest point of view,
discarding contents can not be distinguished from writing old contents.


Besides, ignoring the free page tricks, consider regular ballooning.
We map page with DONTNEED then back with WILLNEED. Result is
getting a zero page. So at least one of deflate/inflate should be input.
I'd say both for symmetry.



OK, I see the point. Thanks.

Best,
Wei



Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-30 Thread Michael S. Tsirkin
On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote:
> On Sun, Jul 30, 2017 at 05:59:17AM +, Wang, Wei W wrote:
> > On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> > > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > > > add each s/g as a separate input buffer.
> > > > > > > > Should it be an output buffer?
> > > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > > > writeable by device: DMA_FROM_DEVICE.
> > > > > > Why would the hypervisor need to zero the buffer?
> > > > > The page is supplied to hypervisor and can lose the value that is
> > > > > there.  That is the definition of writeable by device.
> > > >
> > > > I think for the free pages, it should be clear that they will be added
> > > > as output buffer to the device, because (as we discussed) they are
> > > > just hints, and some of them may be used by the guest after the report_ 
> > > > API is
> > > invoked.
> > > > The device/hypervisor should not use or discard them.
> > > 
> > > Discarding contents is exactly what you propose doing if migration is 
> > > going on,
> > > isn't it?
> > 
> > That's actually a different concept. Please let me explain it with this 
> > example:
> > 
> > The hypervisor receives the hint saying the guest PageX is a free page, but 
> > as we know, 
> > after that report_ API exits, the guest kernel may take PageX to use, so 
> > PageX is not free
> > page any more. At this time, if the hypervisor writes to the page, that 
> > would crash the guest.
> > So, I think the cornerstone of this work is that the hypervisor should not 
> > touch the
> > reported pages.
> > 
> > Best,
> > Wei
> 
> That's a hypervisor implementation detail. From guest point of view,
> discarding contents can not be distinguished from writing old contents.
> 

Besides, ignoring the free page tricks, consider regular ballooning.
We map page with DONTNEED then back with WILLNEED. Result is
getting a zero page. So at least one of deflate/inflate should be input.
I'd say both for symmetry.

-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-30 Thread Michael S. Tsirkin
On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote:
> On Sun, Jul 30, 2017 at 05:59:17AM +, Wang, Wei W wrote:
> > On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> > > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > > > add each s/g as a separate input buffer.
> > > > > > > > Should it be an output buffer?
> > > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > > > writeable by device: DMA_FROM_DEVICE.
> > > > > > Why would the hypervisor need to zero the buffer?
> > > > > The page is supplied to hypervisor and can lose the value that is
> > > > > there.  That is the definition of writeable by device.
> > > >
> > > > I think for the free pages, it should be clear that they will be added
> > > > as output buffer to the device, because (as we discussed) they are
> > > > just hints, and some of them may be used by the guest after the report_ 
> > > > API is
> > > invoked.
> > > > The device/hypervisor should not use or discard them.
> > > 
> > > Discarding contents is exactly what you propose doing if migration is 
> > > going on,
> > > isn't it?
> > 
> > That's actually a different concept. Please let me explain it with this 
> > example:
> > 
> > The hypervisor receives the hint saying the guest PageX is a free page, but 
> > as we know, 
> > after that report_ API exits, the guest kernel may take PageX to use, so 
> > PageX is not free
> > page any more. At this time, if the hypervisor writes to the page, that 
> > would crash the guest.
> > So, I think the cornerstone of this work is that the hypervisor should not 
> > touch the
> > reported pages.
> > 
> > Best,
> > Wei
> 
> That's a hypervisor implementation detail. From guest point of view,
> discarding contents can not be distinguished from writing old contents.
> 

Besides, ignoring the free page tricks, consider regular ballooning.
We map page with DONTNEED then back with WILLNEED. Result is
getting a zero page. So at least one of deflate/inflate should be input.
I'd say both for symmetry.

-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-30 Thread Michael S. Tsirkin
On Sun, Jul 30, 2017 at 05:59:17AM +, Wang, Wei W wrote:
> On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > > add each s/g as a separate input buffer.
> > > > > > > Should it be an output buffer?
> > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > > writeable by device: DMA_FROM_DEVICE.
> > > > > Why would the hypervisor need to zero the buffer?
> > > > The page is supplied to hypervisor and can lose the value that is
> > > > there.  That is the definition of writeable by device.
> > >
> > > I think for the free pages, it should be clear that they will be added
> > > as output buffer to the device, because (as we discussed) they are
> > > just hints, and some of them may be used by the guest after the report_ 
> > > API is
> > invoked.
> > > The device/hypervisor should not use or discard them.
> > 
> > Discarding contents is exactly what you propose doing if migration is going 
> > on,
> > isn't it?
> 
> That's actually a different concept. Please let me explain it with this 
> example:
> 
> The hypervisor receives the hint saying the guest PageX is a free page, but 
> as we know, 
> after that report_ API exits, the guest kernel may take PageX to use, so 
> PageX is not free
> page any more. At this time, if the hypervisor writes to the page, that would 
> crash the guest.
> So, I think the cornerstone of this work is that the hypervisor should not 
> touch the
> reported pages.
> 
> Best,
> Wei

That's a hypervisor implementation detail. From guest point of view,
discarding contents can not be distinguished from writing old contents.



Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-30 Thread Michael S. Tsirkin
On Sun, Jul 30, 2017 at 05:59:17AM +, Wang, Wei W wrote:
> On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > > add each s/g as a separate input buffer.
> > > > > > > Should it be an output buffer?
> > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > > writeable by device: DMA_FROM_DEVICE.
> > > > > Why would the hypervisor need to zero the buffer?
> > > > The page is supplied to hypervisor and can lose the value that is
> > > > there.  That is the definition of writeable by device.
> > >
> > > I think for the free pages, it should be clear that they will be added
> > > as output buffer to the device, because (as we discussed) they are
> > > just hints, and some of them may be used by the guest after the report_ 
> > > API is
> > invoked.
> > > The device/hypervisor should not use or discard them.
> > 
> > Discarding contents is exactly what you propose doing if migration is going 
> > on,
> > isn't it?
> 
> That's actually a different concept. Please let me explain it with this 
> example:
> 
> The hypervisor receives the hint saying the guest PageX is a free page, but 
> as we know, 
> after that report_ API exits, the guest kernel may take PageX to use, so 
> PageX is not free
> page any more. At this time, if the hypervisor writes to the page, that would 
> crash the guest.
> So, I think the cornerstone of this work is that the hypervisor should not 
> touch the
> reported pages.
> 
> Best,
> Wei

That's a hypervisor implementation detail. From guest point of view,
discarding contents can not be distinguished from writing old contents.



RE: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-30 Thread Wang, Wei W
On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > add each s/g as a separate input buffer.
> > > > > > Should it be an output buffer?
> > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > writeable by device: DMA_FROM_DEVICE.
> > > > Why would the hypervisor need to zero the buffer?
> > > The page is supplied to hypervisor and can lose the value that is
> > > there.  That is the definition of writeable by device.
> >
> > I think for the free pages, it should be clear that they will be added
> > as output buffer to the device, because (as we discussed) they are
> > just hints, and some of them may be used by the guest after the report_ API 
> > is
> invoked.
> > The device/hypervisor should not use or discard them.
> 
> Discarding contents is exactly what you propose doing if migration is going 
> on,
> isn't it?

That's actually a different concept. Please let me explain it with this example:

The hypervisor receives the hint saying the guest PageX is a free page, but as 
we know, 
after that report_ API exits, the guest kernel may take PageX to use, so PageX 
is not free
page any more. At this time, if the hypervisor writes to the page, that would 
crash the guest.
So, I think the cornerstone of this work is that the hypervisor should not 
touch the
reported pages.

Best,
Wei


RE: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-30 Thread Wang, Wei W
On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > add each s/g as a separate input buffer.
> > > > > > Should it be an output buffer?
> > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > writeable by device: DMA_FROM_DEVICE.
> > > > Why would the hypervisor need to zero the buffer?
> > > The page is supplied to hypervisor and can lose the value that is
> > > there.  That is the definition of writeable by device.
> >
> > I think for the free pages, it should be clear that they will be added
> > as output buffer to the device, because (as we discussed) they are
> > just hints, and some of them may be used by the guest after the report_ API 
> > is
> invoked.
> > The device/hypervisor should not use or discard them.
> 
> Discarding contents is exactly what you propose doing if migration is going 
> on,
> isn't it?

That's actually a different concept. Please let me explain it with this example:

The hypervisor receives the hint saying the guest PageX is a free page, but as 
we know, 
after that report_ API exits, the guest kernel may take PageX to use, so PageX 
is not free
page any more. At this time, if the hypervisor writes to the page, that would 
crash the guest.
So, I think the cornerstone of this work is that the hypervisor should not 
touch the
reported pages.

Best,
Wei


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-29 Thread Michael S. Tsirkin
On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > OK I thought this over. While we might need these new APIs in
> > > > > > the future, I think that at the moment, there's a way to implement
> > > > > > this feature that is significantly simpler. Just add each s/g
> > > > > > as a separate input buffer.
> > > > > Should it be an output buffer?
> > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > writeable by device: DMA_FROM_DEVICE.
> > > Why would the hypervisor need to zero the buffer?
> > The page is supplied to hypervisor and can lose the value that
> > is there.  That is the definition of writeable by device.
> 
> I think for the free pages, it should be clear that they will be added as
> output buffer to the device, because (as we discussed) they are just hints,
> and some of them may be used by the guest after the report_ API is invoked.
> The device/hypervisor should not use or discard them.

Discarding contents is exactly what you propose doing if
migration is going on, isn't it?

> For the balloon pages, I kind of agree with the existing implementation
> (e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of
> _add_inbuf()).


This is because it does not pass SGs, it passes weirdly
formatted PA within the buffer.

> I think inbuf should be a buffer which will be written by the device and
> read by the
> driver.

If hypervisor can change it, it's an inbuf. Should not matter
whether driver reads it.

> The cmd buffer put on the vq for the device to send commands can be
> an
> inbuf, I think.
> 
> 
> > 
> > > I think it may only
> > > need to read out the info(base,size).
> > And then do what?
> 
> 
> For the free pages, the info will be used to clear the corresponding "1" in
> the dirty bitmap.
> For balloon pages, they will be made DONTNEED and given to other host
> processes to
> use (the device won't write them, so no need to set "write" when
> virtqueue_map_desc() in
> the device).
> 
> 
> > 
> > > I think it should be like this:
> > > the cmd hdr buffer: input, because the hypervisor would write it to
> > > send a cmd to the guest
> > > the payload buffer: output, for the hypervisor to read the info
> > These should be split.
> > 
> > We have:
> > 
> > 1. request that hypervisor sends to guest, includes ID: input
> > 2. synchronisation header with ID sent by guest: output
> > 3. list of pages: input
> > 
> > 2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
> > might make sense.
> 
> I would prefer to make the above item 1 come on the the free page vq,
> because the existing stat_vq doesn't support the cmd hdr.
> Now, I think it is also not necessary to move the existing stat_vq
> implementation to
> a new implementation under the cmd hdr, because
> that new support doesn't make a difference for stats, it will still use its
> stat_vq (the free
> page vq will be used for reporting free pages only)
> 
> What do you think?
> 
> 
> Best,
> Wei


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-29 Thread Michael S. Tsirkin
On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > OK I thought this over. While we might need these new APIs in
> > > > > > the future, I think that at the moment, there's a way to implement
> > > > > > this feature that is significantly simpler. Just add each s/g
> > > > > > as a separate input buffer.
> > > > > Should it be an output buffer?
> > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > writeable by device: DMA_FROM_DEVICE.
> > > Why would the hypervisor need to zero the buffer?
> > The page is supplied to hypervisor and can lose the value that
> > is there.  That is the definition of writeable by device.
> 
> I think for the free pages, it should be clear that they will be added as
> output buffer to the device, because (as we discussed) they are just hints,
> and some of them may be used by the guest after the report_ API is invoked.
> The device/hypervisor should not use or discard them.

Discarding contents is exactly what you propose doing if
migration is going on, isn't it?

> For the balloon pages, I kind of agree with the existing implementation
> (e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of
> _add_inbuf()).


This is because it does not pass SGs, it passes weirdly
formatted PA within the buffer.

> I think inbuf should be a buffer which will be written by the device and
> read by the
> driver.

If hypervisor can change it, it's an inbuf. Should not matter
whether driver reads it.

> The cmd buffer put on the vq for the device to send commands can be
> an
> inbuf, I think.
> 
> 
> > 
> > > I think it may only
> > > need to read out the info(base,size).
> > And then do what?
> 
> 
> For the free pages, the info will be used to clear the corresponding "1" in
> the dirty bitmap.
> For balloon pages, they will be made DONTNEED and given to other host
> processes to
> use (the device won't write them, so no need to set "write" when
> virtqueue_map_desc() in
> the device).
> 
> 
> > 
> > > I think it should be like this:
> > > the cmd hdr buffer: input, because the hypervisor would write it to
> > > send a cmd to the guest
> > > the payload buffer: output, for the hypervisor to read the info
> > These should be split.
> > 
> > We have:
> > 
> > 1. request that hypervisor sends to guest, includes ID: input
> > 2. synchronisation header with ID sent by guest: output
> > 3. list of pages: input
> > 
> > 2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
> > might make sense.
> 
> I would prefer to make the above item 1 come on the the free page vq,
> because the existing stat_vq doesn't support the cmd hdr.
> Now, I think it is also not necessary to move the existing stat_vq
> implementation to
> a new implementation under the cmd hdr, because
> that new support doesn't make a difference for stats, it will still use its
> stat_vq (the free
> page vq will be used for reporting free pages only)
> 
> What do you think?
> 
> 
> Best,
> Wei


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-29 Thread Wei Wang

On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:

On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.

Should it be an output buffer?

Hypervisor overwrites these pages with zeroes. Therefore it is
writeable by device: DMA_FROM_DEVICE.

Why would the hypervisor need to zero the buffer?

The page is supplied to hypervisor and can lose the value that
is there.  That is the definition of writeable by device.


I think for the free pages, it should be clear that they will be added as
output buffer to the device, because (as we discussed) they are just hints,
and some of them may be used by the guest after the report_ API is invoked.
The device/hypervisor should not use or discard them.

For the balloon pages, I kind of agree with the existing implementation
(e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of 
_add_inbuf()).
I think inbuf should be a buffer which will be written by the device and 
read by the
driver. The cmd buffer put on the vq for the device to send commands can 
be an

inbuf, I think.





I think it may only
need to read out the info(base,size).

And then do what?



For the free pages, the info will be used to clear the corresponding "1" 
in the dirty bitmap.
For balloon pages, they will be made DONTNEED and given to other host 
processes to
use (the device won't write them, so no need to set "write" when 
virtqueue_map_desc() in

the device).





I think it should be like this:
the cmd hdr buffer: input, because the hypervisor would write it to
send a cmd to the guest
the payload buffer: output, for the hypervisor to read the info

These should be split.

We have:

1. request that hypervisor sends to guest, includes ID: input
2. synchronisation header with ID sent by guest: output
3. list of pages: input

2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
might make sense.


I would prefer to make the above item 1 come on the the free page vq,
because the existing stat_vq doesn't support the cmd hdr.
Now, I think it is also not necessary to move the existing stat_vq 
implementation to

a new implementation under the cmd hdr, because
that new support doesn't make a difference for stats, it will still use 
its stat_vq (the free

page vq will be used for reporting free pages only)

What do you think?


Best,
Wei


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-29 Thread Wei Wang

On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:

On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.

Should it be an output buffer?

Hypervisor overwrites these pages with zeroes. Therefore it is
writeable by device: DMA_FROM_DEVICE.

Why would the hypervisor need to zero the buffer?

The page is supplied to hypervisor and can lose the value that
is there.  That is the definition of writeable by device.


I think for the free pages, it should be clear that they will be added as
output buffer to the device, because (as we discussed) they are just hints,
and some of them may be used by the guest after the report_ API is invoked.
The device/hypervisor should not use or discard them.

For the balloon pages, I kind of agree with the existing implementation
(e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of 
_add_inbuf()).
I think inbuf should be a buffer which will be written by the device and 
read by the
driver. The cmd buffer put on the vq for the device to send commands can 
be an

inbuf, I think.





I think it may only
need to read out the info(base,size).

And then do what?



For the free pages, the info will be used to clear the corresponding "1" 
in the dirty bitmap.
For balloon pages, they will be made DONTNEED and given to other host 
processes to
use (the device won't write them, so no need to set "write" when 
virtqueue_map_desc() in

the device).





I think it should be like this:
the cmd hdr buffer: input, because the hypervisor would write it to
send a cmd to the guest
the payload buffer: output, for the hypervisor to read the info

These should be split.

We have:

1. request that hypervisor sends to guest, includes ID: input
2. synchronisation header with ID sent by guest: output
3. list of pages: input

2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
might make sense.


I would prefer to make the above item 1 come on the the free page vq,
because the existing stat_vq doesn't support the cmd hdr.
Now, I think it is also not necessary to move the existing stat_vq 
implementation to

a new implementation under the cmd hdr, because
that new support doesn't make a difference for stats, it will still use 
its stat_vq (the free

page vq will be used for reporting free pages only)

What do you think?


Best,
Wei


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-28 Thread Michael S. Tsirkin
On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > OK I thought this over. While we might need these new APIs in
> > > > the future, I think that at the moment, there's a way to implement
> > > > this feature that is significantly simpler. Just add each s/g
> > > > as a separate input buffer.
> > > 
> > > Should it be an output buffer?
> > Hypervisor overwrites these pages with zeroes. Therefore it is
> > writeable by device: DMA_FROM_DEVICE.
> 
> Why would the hypervisor need to zero the buffer?

The page is supplied to hypervisor and can lose the value that
is there.  That is the definition of writeable by device.

> I think it may only
> need to read out the info(base,size).

And then do what?

> I think it should be like this:
> the cmd hdr buffer: input, because the hypervisor would write it to
> send a cmd to the guest
> the payload buffer: output, for the hypervisor to read the info

These should be split.

We have:

1. request that hypervisor sends to guest, includes ID: input
2. synchronisation header with ID sent by guest: output
3. list of pages: input

2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
might make sense.


> > > I think output means from the
> > > driver to device (i.e. DMA_TO_DEVICE).
> > This part is correct I believe.
> > 
> > > > This needs zero new APIs.
> > > > 
> > > > I know that follow-up patches need to add a header in front
> > > > so you might be thinking: how am I going to add this
> > > > header? The answer is quite simple - add it as a separate
> > > > out header.
> > > > 
> > > > Host will be able to distinguish between header and pages
> > > > by looking at the direction, and - should we want to add
> > > > IN data to header - additionally size (<4K => header).
> > > 
> > > I think this works fine when the cmdq is only used for
> > > reporting the unused pages.
> > > It would be an issue
> > > if there are other usages (e.g. report memory statistics)
> > > interleaving. I think one solution would be to lock the cmdq until
> > > a cmd usage is done ((e.g. all the unused pages are reported) ) -
> > > in this case, the periodically updated guest memory statistics
> > > may be delayed for a while occasionally when live migration starts.
> > > Would this be acceptable? If not, probably we can have the cmdq
> > > for one usage only.
> > > 
> > > 
> > > Best,
> > > Wei
> > OK I see, I think the issue is that reporting free pages
> > was structured like stats. Let's split it -
> > send pages on e.g. free_vq, get commands on vq shared with
> > stats.
> > 
> 
> Would it be better to have the "report free page" command to be sent
> through the free_vq? In this case,we will have
> stats_vq: for the stats usage, which is already there
> free_vq: for reporting free pages.
> 
> Best,
> Wei

See above. I would get requests on stats vq but report
free pages separately on free vq.

> 
> 
> 
> 


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-28 Thread Michael S. Tsirkin
On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > OK I thought this over. While we might need these new APIs in
> > > > the future, I think that at the moment, there's a way to implement
> > > > this feature that is significantly simpler. Just add each s/g
> > > > as a separate input buffer.
> > > 
> > > Should it be an output buffer?
> > Hypervisor overwrites these pages with zeroes. Therefore it is
> > writeable by device: DMA_FROM_DEVICE.
> 
> Why would the hypervisor need to zero the buffer?

The page is supplied to hypervisor and can lose the value that
is there.  That is the definition of writeable by device.

> I think it may only
> need to read out the info(base,size).

And then do what?

> I think it should be like this:
> the cmd hdr buffer: input, because the hypervisor would write it to
> send a cmd to the guest
> the payload buffer: output, for the hypervisor to read the info

These should be split.

We have:

1. request that hypervisor sends to guest, includes ID: input
2. synchronisation header with ID sent by guest: output
3. list of pages: input

2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
might make sense.


> > > I think output means from the
> > > driver to device (i.e. DMA_TO_DEVICE).
> > This part is correct I believe.
> > 
> > > > This needs zero new APIs.
> > > > 
> > > > I know that follow-up patches need to add a header in front
> > > > so you might be thinking: how am I going to add this
> > > > header? The answer is quite simple - add it as a separate
> > > > out header.
> > > > 
> > > > Host will be able to distinguish between header and pages
> > > > by looking at the direction, and - should we want to add
> > > > IN data to header - additionally size (<4K => header).
> > > 
> > > I think this works fine when the cmdq is only used for
> > > reporting the unused pages.
> > > It would be an issue
> > > if there are other usages (e.g. report memory statistics)
> > > interleaving. I think one solution would be to lock the cmdq until
> > > a cmd usage is done ((e.g. all the unused pages are reported) ) -
> > > in this case, the periodically updated guest memory statistics
> > > may be delayed for a while occasionally when live migration starts.
> > > Would this be acceptable? If not, probably we can have the cmdq
> > > for one usage only.
> > > 
> > > 
> > > Best,
> > > Wei
> > OK I see, I think the issue is that reporting free pages
> > was structured like stats. Let's split it -
> > send pages on e.g. free_vq, get commands on vq shared with
> > stats.
> > 
> 
> Would it be better to have the "report free page" command to be sent
> through the free_vq? In this case,we will have
> stats_vq: for the stats usage, which is already there
> free_vq: for reporting free pages.
> 
> Best,
> Wei

See above. I would get requests on stats vq but report
free pages separately on free vq.

> 
> 
> 
> 


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-28 Thread Michael S. Tsirkin
On Fri, Jul 28, 2017 at 04:25:19PM +0800, Wei Wang wrote:
> On 07/12/2017 08:40 PM, Wei Wang wrote:
> > Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> > transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> > scatter-gather lists to the host.
> > 
> > The implementation of the previous virtio-balloon is not very
> > efficient, because the balloon pages are transferred to the
> > host one by one. Here is the breakdown of the time in percentage
> > spent on each step of the balloon inflating process (inflating
> > 7GB of an 8GB idle guest).
> > 
> > 1) allocating pages (6.5%)
> > 2) sending PFNs to host (68.3%)
> > 3) address translation (6.1%)
> > 4) madvise (19%)
> > 
> > It takes about 4126ms for the inflating process to complete.
> > The above profiling shows that the bottlenecks are stage 2)
> > and stage 4).
> > 
> > This patch optimizes step 2) by transferring pages to the host in
> > sgs. An sg describes a chunk of guest physically continuous pages.
> > With this mechanism, step 4) can also be optimized by doing address
> > translation and madvise() in chunks rather than page by page.
> > 
> > With this new feature, the above ballooning process takes ~491ms
> > resulting in an improvement of ~88%.
> > 
> 
> 
> I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499
> , zeros all the ballooned pages, which is very time consuming.
> 
> Tests show that the time to balloon 7G pages is increased from ~491 ms to
> 2.8 seconds with the above patch.

Sounds like it should be reverted. Post a revert pls and
we'll discuss.

> How about moving the zero operation to the hypervisor? In this way, we
> will have a much faster balloon process.
> 
> 
> Best,
> Wei

Or in other words hypervisors should not be stupid and
should not try to run ksm on DONTNEED pages.

-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-28 Thread Michael S. Tsirkin
On Fri, Jul 28, 2017 at 04:25:19PM +0800, Wei Wang wrote:
> On 07/12/2017 08:40 PM, Wei Wang wrote:
> > Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> > transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> > scatter-gather lists to the host.
> > 
> > The implementation of the previous virtio-balloon is not very
> > efficient, because the balloon pages are transferred to the
> > host one by one. Here is the breakdown of the time in percentage
> > spent on each step of the balloon inflating process (inflating
> > 7GB of an 8GB idle guest).
> > 
> > 1) allocating pages (6.5%)
> > 2) sending PFNs to host (68.3%)
> > 3) address translation (6.1%)
> > 4) madvise (19%)
> > 
> > It takes about 4126ms for the inflating process to complete.
> > The above profiling shows that the bottlenecks are stage 2)
> > and stage 4).
> > 
> > This patch optimizes step 2) by transferring pages to the host in
> > sgs. An sg describes a chunk of guest physically continuous pages.
> > With this mechanism, step 4) can also be optimized by doing address
> > translation and madvise() in chunks rather than page by page.
> > 
> > With this new feature, the above ballooning process takes ~491ms
> > resulting in an improvement of ~88%.
> > 
> 
> 
> I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499
> , zeros all the ballooned pages, which is very time consuming.
> 
> Tests show that the time to balloon 7G pages is increased from ~491 ms to
> 2.8 seconds with the above patch.

Sounds like it should be reverted. Post a revert pls and
we'll discuss.

> How about moving the zero operation to the hypervisor? In this way, we
> will have a much faster balloon process.
> 
> 
> Best,
> Wei

Or in other words hypervisors should not be stupid and
should not try to run ksm on DONTNEED pages.

-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-28 Thread Wei Wang

On 07/12/2017 08:40 PM, Wei Wang wrote:

Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
transfer a chunk of ballooned (i.e. inflated/deflated) pages using
scatter-gather lists to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~491ms
resulting in an improvement of ~88%.




I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499
, zeros all the ballooned pages, which is very time consuming.

Tests show that the time to balloon 7G pages is increased from ~491 ms to
2.8 seconds with the above patch.

How about moving the zero operation to the hypervisor? In this way, we
will have a much faster balloon process.


Best,
Wei




Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-28 Thread Wei Wang

On 07/12/2017 08:40 PM, Wei Wang wrote:

Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
transfer a chunk of ballooned (i.e. inflated/deflated) pages using
scatter-gather lists to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~491ms
resulting in an improvement of ~88%.




I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499
, zeros all the ballooned pages, which is very time consuming.

Tests show that the time to balloon 7G pages is increased from ~491 ms to
2.8 seconds with the above patch.

How about moving the zero operation to the hypervisor? In this way, we
will have a much faster balloon process.


Best,
Wei




Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-26 Thread Wei Wang

On 07/27/2017 01:02 AM, Michael S. Tsirkin wrote:

On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote:

On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:

On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:

On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:

On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
  note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
  for alternative layouts which always expose descriptors
  immediately

Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is indexed
by
prev_id, probably better to call it tail_id). Then, the new added desc
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
when it's
added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.

I think it can support the 1.1 proposal, too. But before getting
into that, I think we first need to deep dive into the implementation
and usage of _first/next/last. The usage would need to lock the vq
from the first to the end (otherwise, the returned info about the number
of available desc in the vq, i.e. num_free, would be invalid):

lock(vq);
add_first();
add_next();
add_last();
unlock(vq);

However, I think the case isn't this simple, since we need to check more
things
after each add_xx() step. For example, if only one entry is available at the
time
we start to use the vq, that is, num_free is 0 after add_first(), we
wouldn't be
able to add_next and add_last. So, it would work like this:

start:
  ...get free page block..
  lock(vq)
retry:
  ret = add_first(..,_free,);
  if(ret == -ENOSPC) {
  goto retry;
  } else if (!num_free) {
  add_chain_head();
  unlock(vq);
  kick & wait;
  goto start;
  }
next_one:
  ...get free page block..
  add_next(..,_free,);
  if (!num_free) {
  add_chain_head();
  unlock(vq);
  kick & wait;
  goto start;
  } if (num_free == 1) {
  ...get free page block..
  add_last(..);
  unlock(vq);
  kick & wait;
  goto start;
  } else {
  goto next_one;
  }

The above seems unnecessary to me to have three different APIs.
That's the reason to combine them into one virtqueue_add_chain_desc().

-- or, do you have a different thought about using the three APIs?


Implementation Reference:

struct desc_iterator {
  unsigned int head;
  unsigned int tail;
};

add_first(*vq, *desc_iterator, *num_free, ..)
{
  if (vq->vq.num_free < 1)
  return -ENOSPC;
  get_desc(_id);
  desc[desc_id].flag &= ~_F_NEXT;
  desc_iterator->head = desc_id
  desc_iterator->tail = desc_iterator->head;
  *num_free = vq->vq.num_free;
}

add_next(vq, desc_iterator, *num_free,..)
{
  get_desc(_id);
  desc[desc_id].flag &= ~_F_NEXT;
  desc[desc_iterator.tail].next = desc_id;
  desc[desc_iterator->tail].flag |= _F_NEXT;
  desc_iterator->tail = desc_id;
  *num_free = vq->vq.num_free;
}

add_last(vq, desc_iterator,..)
{
  get_desc(_id);
  desc[desc_id].flag &= ~_F_NEXT;
  desc[desc_iterator.tail].next = desc_id;
  desc_iterator->tail = desc_id;

  add_chain_head(); // put the desc_iterator.head to the ring
}


Best,
Wei

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.


Should it be an output buffer?

Hypervisor overwrites these pages with zeroes. Therefore it is
writeable by device: DMA_FROM_DEVICE.


Why would the hypervisor need to zero the buffer? I think it may only
need to read out the info(base,size).

I think it should be like this:
the cmd hdr buffer: input, because the hypervisor would write it to
send a cmd to the guest
the payload buffer: output, for the hypervisor to read the info


I think output means from the
driver to device (i.e. DMA_TO_DEVICE).

This part is correct I believe.


This needs zero new APIs.

I know that follow-up patches need to add a header in front
so you might be thinking: how am I going to add this
header? The answer is quite simple - add it as a separate
out header.

Host will be able to distinguish between header and pages
by looking at the direction, and - should we want to add
IN data to header - additionally size (<4K => header).


I think this works fine when the cmdq is 

Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-26 Thread Wei Wang

On 07/27/2017 01:02 AM, Michael S. Tsirkin wrote:

On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote:

On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:

On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:

On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:

On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
  note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
  for alternative layouts which always expose descriptors
  immediately

Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is indexed
by
prev_id, probably better to call it tail_id). Then, the new added desc
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
when it's
added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.

I think it can support the 1.1 proposal, too. But before getting
into that, I think we first need to deep dive into the implementation
and usage of _first/next/last. The usage would need to lock the vq
from the first to the end (otherwise, the returned info about the number
of available desc in the vq, i.e. num_free, would be invalid):

lock(vq);
add_first();
add_next();
add_last();
unlock(vq);

However, I think the case isn't this simple, since we need to check more
things
after each add_xx() step. For example, if only one entry is available at the
time
we start to use the vq, that is, num_free is 0 after add_first(), we
wouldn't be
able to add_next and add_last. So, it would work like this:

start:
  ...get free page block..
  lock(vq)
retry:
  ret = add_first(..,_free,);
  if(ret == -ENOSPC) {
  goto retry;
  } else if (!num_free) {
  add_chain_head();
  unlock(vq);
  kick & wait;
  goto start;
  }
next_one:
  ...get free page block..
  add_next(..,_free,);
  if (!num_free) {
  add_chain_head();
  unlock(vq);
  kick & wait;
  goto start;
  } if (num_free == 1) {
  ...get free page block..
  add_last(..);
  unlock(vq);
  kick & wait;
  goto start;
  } else {
  goto next_one;
  }

The above seems unnecessary to me to have three different APIs.
That's the reason to combine them into one virtqueue_add_chain_desc().

-- or, do you have a different thought about using the three APIs?


Implementation Reference:

struct desc_iterator {
  unsigned int head;
  unsigned int tail;
};

add_first(*vq, *desc_iterator, *num_free, ..)
{
  if (vq->vq.num_free < 1)
  return -ENOSPC;
  get_desc(_id);
  desc[desc_id].flag &= ~_F_NEXT;
  desc_iterator->head = desc_id
  desc_iterator->tail = desc_iterator->head;
  *num_free = vq->vq.num_free;
}

add_next(vq, desc_iterator, *num_free,..)
{
  get_desc(_id);
  desc[desc_id].flag &= ~_F_NEXT;
  desc[desc_iterator.tail].next = desc_id;
  desc[desc_iterator->tail].flag |= _F_NEXT;
  desc_iterator->tail = desc_id;
  *num_free = vq->vq.num_free;
}

add_last(vq, desc_iterator,..)
{
  get_desc(_id);
  desc[desc_id].flag &= ~_F_NEXT;
  desc[desc_iterator.tail].next = desc_id;
  desc_iterator->tail = desc_id;

  add_chain_head(); // put the desc_iterator.head to the ring
}


Best,
Wei

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.


Should it be an output buffer?

Hypervisor overwrites these pages with zeroes. Therefore it is
writeable by device: DMA_FROM_DEVICE.


Why would the hypervisor need to zero the buffer? I think it may only
need to read out the info(base,size).

I think it should be like this:
the cmd hdr buffer: input, because the hypervisor would write it to
send a cmd to the guest
the payload buffer: output, for the hypervisor to read the info


I think output means from the
driver to device (i.e. DMA_TO_DEVICE).

This part is correct I believe.


This needs zero new APIs.

I know that follow-up patches need to add a header in front
so you might be thinking: how am I going to add this
header? The answer is quite simple - add it as a separate
out header.

Host will be able to distinguish between header and pages
by looking at the direction, and - should we want to add
IN data to header - additionally size (<4K => header).


I think this works fine when the cmdq is 

Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-26 Thread Michael S. Tsirkin
On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote:
> On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:
> > On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
> > > On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> > > > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > > > > > So the way I see it, there are several issues:
> > > > > > 
> > > > > > - internal wait - forces multiple APIs like kick/kick_sync
> > > > > >  note how kick_sync can fail but your code never checks return 
> > > > > > code
> > > > > > - need to re-write the last descriptor - might not work
> > > > > >  for alternative layouts which always expose descriptors
> > > > > >  immediately
> > > > > Probably it wasn't clear. Please let me explain the two functions 
> > > > > here:
> > > > > 
> > > > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> > > > > grabs a desc from the vq and inserts it to the chain tail (which is 
> > > > > indexed
> > > > > by
> > > > > prev_id, probably better to call it tail_id). Then, the new added desc
> > > > > becomes
> > > > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each 
> > > > > desc
> > > > > when it's
> > > > > added to the chain, and set when another desc comes to follow later.
> > > > And this only works if there are multiple rings like
> > > > avail + descriptor ring.
> > > > It won't work e.g. with the proposed new layout where
> > > > writing out a descriptor exposes it immediately.
> > > I think it can support the 1.1 proposal, too. But before getting
> > > into that, I think we first need to deep dive into the implementation
> > > and usage of _first/next/last. The usage would need to lock the vq
> > > from the first to the end (otherwise, the returned info about the number
> > > of available desc in the vq, i.e. num_free, would be invalid):
> > > 
> > > lock(vq);
> > > add_first();
> > > add_next();
> > > add_last();
> > > unlock(vq);
> > > 
> > > However, I think the case isn't this simple, since we need to check more
> > > things
> > > after each add_xx() step. For example, if only one entry is available at 
> > > the
> > > time
> > > we start to use the vq, that is, num_free is 0 after add_first(), we
> > > wouldn't be
> > > able to add_next and add_last. So, it would work like this:
> > > 
> > > start:
> > >  ...get free page block..
> > >  lock(vq)
> > > retry:
> > >  ret = add_first(..,_free,);
> > >  if(ret == -ENOSPC) {
> > >  goto retry;
> > >  } else if (!num_free) {
> > >  add_chain_head();
> > >  unlock(vq);
> > >  kick & wait;
> > >  goto start;
> > >  }
> > > next_one:
> > >  ...get free page block..
> > >  add_next(..,_free,);
> > >  if (!num_free) {
> > >  add_chain_head();
> > >  unlock(vq);
> > >  kick & wait;
> > >  goto start;
> > >  } if (num_free == 1) {
> > >  ...get free page block..
> > >  add_last(..);
> > >  unlock(vq);
> > >  kick & wait;
> > >  goto start;
> > >  } else {
> > >  goto next_one;
> > >  }
> > > 
> > > The above seems unnecessary to me to have three different APIs.
> > > That's the reason to combine them into one virtqueue_add_chain_desc().
> > > 
> > > -- or, do you have a different thought about using the three APIs?
> > > 
> > > 
> > > Implementation Reference:
> > > 
> > > struct desc_iterator {
> > >  unsigned int head;
> > >  unsigned int tail;
> > > };
> > > 
> > > add_first(*vq, *desc_iterator, *num_free, ..)
> > > {
> > >  if (vq->vq.num_free < 1)
> > >  return -ENOSPC;
> > >  get_desc(_id);
> > >  desc[desc_id].flag &= ~_F_NEXT;
> > >  desc_iterator->head = desc_id
> > >  desc_iterator->tail = desc_iterator->head;
> > >  *num_free = vq->vq.num_free;
> > > }
> > > 
> > > add_next(vq, desc_iterator, *num_free,..)
> > > {
> > >  get_desc(_id);
> > >  desc[desc_id].flag &= ~_F_NEXT;
> > >  desc[desc_iterator.tail].next = desc_id;
> > >  desc[desc_iterator->tail].flag |= _F_NEXT;
> > >  desc_iterator->tail = desc_id;
> > >  *num_free = vq->vq.num_free;
> > > }
> > > 
> > > add_last(vq, desc_iterator,..)
> > > {
> > >  get_desc(_id);
> > >  desc[desc_id].flag &= ~_F_NEXT;
> > >  desc[desc_iterator.tail].next = desc_id;
> > >  desc_iterator->tail = desc_id;
> > > 
> > >  add_chain_head(); // put the desc_iterator.head to the ring
> > > }
> > > 
> > > 
> > > Best,
> > > Wei
> > OK I thought this over. While we might need these new APIs in
> > the future, I think that at the moment, there's a way to implement
> > this feature that is significantly simpler. Just add each s/g
> > as a separate input buffer.
> 
> 
> Should it be an output buffer?

Hypervisor overwrites these pages with zeroes. Therefore it is
writeable by device: DMA_FROM_DEVICE.

> I think output means from 

Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-26 Thread Michael S. Tsirkin
On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote:
> On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:
> > On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
> > > On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> > > > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > > > > > So the way I see it, there are several issues:
> > > > > > 
> > > > > > - internal wait - forces multiple APIs like kick/kick_sync
> > > > > >  note how kick_sync can fail but your code never checks return 
> > > > > > code
> > > > > > - need to re-write the last descriptor - might not work
> > > > > >  for alternative layouts which always expose descriptors
> > > > > >  immediately
> > > > > Probably it wasn't clear. Please let me explain the two functions 
> > > > > here:
> > > > > 
> > > > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> > > > > grabs a desc from the vq and inserts it to the chain tail (which is 
> > > > > indexed
> > > > > by
> > > > > prev_id, probably better to call it tail_id). Then, the new added desc
> > > > > becomes
> > > > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each 
> > > > > desc
> > > > > when it's
> > > > > added to the chain, and set when another desc comes to follow later.
> > > > And this only works if there are multiple rings like
> > > > avail + descriptor ring.
> > > > It won't work e.g. with the proposed new layout where
> > > > writing out a descriptor exposes it immediately.
> > > I think it can support the 1.1 proposal, too. But before getting
> > > into that, I think we first need to deep dive into the implementation
> > > and usage of _first/next/last. The usage would need to lock the vq
> > > from the first to the end (otherwise, the returned info about the number
> > > of available desc in the vq, i.e. num_free, would be invalid):
> > > 
> > > lock(vq);
> > > add_first();
> > > add_next();
> > > add_last();
> > > unlock(vq);
> > > 
> > > However, I think the case isn't this simple, since we need to check more
> > > things
> > > after each add_xx() step. For example, if only one entry is available at 
> > > the
> > > time
> > > we start to use the vq, that is, num_free is 0 after add_first(), we
> > > wouldn't be
> > > able to add_next and add_last. So, it would work like this:
> > > 
> > > start:
> > >  ...get free page block..
> > >  lock(vq)
> > > retry:
> > >  ret = add_first(..,_free,);
> > >  if(ret == -ENOSPC) {
> > >  goto retry;
> > >  } else if (!num_free) {
> > >  add_chain_head();
> > >  unlock(vq);
> > >  kick & wait;
> > >  goto start;
> > >  }
> > > next_one:
> > >  ...get free page block..
> > >  add_next(..,_free,);
> > >  if (!num_free) {
> > >  add_chain_head();
> > >  unlock(vq);
> > >  kick & wait;
> > >  goto start;
> > >  } if (num_free == 1) {
> > >  ...get free page block..
> > >  add_last(..);
> > >  unlock(vq);
> > >  kick & wait;
> > >  goto start;
> > >  } else {
> > >  goto next_one;
> > >  }
> > > 
> > > The above seems unnecessary to me to have three different APIs.
> > > That's the reason to combine them into one virtqueue_add_chain_desc().
> > > 
> > > -- or, do you have a different thought about using the three APIs?
> > > 
> > > 
> > > Implementation Reference:
> > > 
> > > struct desc_iterator {
> > >  unsigned int head;
> > >  unsigned int tail;
> > > };
> > > 
> > > add_first(*vq, *desc_iterator, *num_free, ..)
> > > {
> > >  if (vq->vq.num_free < 1)
> > >  return -ENOSPC;
> > >  get_desc(_id);
> > >  desc[desc_id].flag &= ~_F_NEXT;
> > >  desc_iterator->head = desc_id
> > >  desc_iterator->tail = desc_iterator->head;
> > >  *num_free = vq->vq.num_free;
> > > }
> > > 
> > > add_next(vq, desc_iterator, *num_free,..)
> > > {
> > >  get_desc(_id);
> > >  desc[desc_id].flag &= ~_F_NEXT;
> > >  desc[desc_iterator.tail].next = desc_id;
> > >  desc[desc_iterator->tail].flag |= _F_NEXT;
> > >  desc_iterator->tail = desc_id;
> > >  *num_free = vq->vq.num_free;
> > > }
> > > 
> > > add_last(vq, desc_iterator,..)
> > > {
> > >  get_desc(_id);
> > >  desc[desc_id].flag &= ~_F_NEXT;
> > >  desc[desc_iterator.tail].next = desc_id;
> > >  desc_iterator->tail = desc_id;
> > > 
> > >  add_chain_head(); // put the desc_iterator.head to the ring
> > > }
> > > 
> > > 
> > > Best,
> > > Wei
> > OK I thought this over. While we might need these new APIs in
> > the future, I think that at the moment, there's a way to implement
> > this feature that is significantly simpler. Just add each s/g
> > as a separate input buffer.
> 
> 
> Should it be an output buffer?

Hypervisor overwrites these pages with zeroes. Therefore it is
writeable by device: DMA_FROM_DEVICE.

> I think output means from 

Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-25 Thread Wei Wang

On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:

On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:

On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:

On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
 note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
 for alternative layouts which always expose descriptors
 immediately

Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is indexed
by
prev_id, probably better to call it tail_id). Then, the new added desc
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
when it's
added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.

I think it can support the 1.1 proposal, too. But before getting
into that, I think we first need to deep dive into the implementation
and usage of _first/next/last. The usage would need to lock the vq
from the first to the end (otherwise, the returned info about the number
of available desc in the vq, i.e. num_free, would be invalid):

lock(vq);
add_first();
add_next();
add_last();
unlock(vq);

However, I think the case isn't this simple, since we need to check more
things
after each add_xx() step. For example, if only one entry is available at the
time
we start to use the vq, that is, num_free is 0 after add_first(), we
wouldn't be
able to add_next and add_last. So, it would work like this:

start:
 ...get free page block..
 lock(vq)
retry:
 ret = add_first(..,_free,);
 if(ret == -ENOSPC) {
 goto retry;
 } else if (!num_free) {
 add_chain_head();
 unlock(vq);
 kick & wait;
 goto start;
 }
next_one:
 ...get free page block..
 add_next(..,_free,);
 if (!num_free) {
 add_chain_head();
 unlock(vq);
 kick & wait;
 goto start;
 } if (num_free == 1) {
 ...get free page block..
 add_last(..);
 unlock(vq);
 kick & wait;
 goto start;
 } else {
 goto next_one;
 }

The above seems unnecessary to me to have three different APIs.
That's the reason to combine them into one virtqueue_add_chain_desc().

-- or, do you have a different thought about using the three APIs?


Implementation Reference:

struct desc_iterator {
 unsigned int head;
 unsigned int tail;
};

add_first(*vq, *desc_iterator, *num_free, ..)
{
 if (vq->vq.num_free < 1)
 return -ENOSPC;
 get_desc(_id);
 desc[desc_id].flag &= ~_F_NEXT;
 desc_iterator->head = desc_id
 desc_iterator->tail = desc_iterator->head;
 *num_free = vq->vq.num_free;
}

add_next(vq, desc_iterator, *num_free,..)
{
 get_desc(_id);
 desc[desc_id].flag &= ~_F_NEXT;
 desc[desc_iterator.tail].next = desc_id;
 desc[desc_iterator->tail].flag |= _F_NEXT;
 desc_iterator->tail = desc_id;
 *num_free = vq->vq.num_free;
}

add_last(vq, desc_iterator,..)
{
 get_desc(_id);
 desc[desc_id].flag &= ~_F_NEXT;
 desc[desc_iterator.tail].next = desc_id;
 desc_iterator->tail = desc_id;

 add_chain_head(); // put the desc_iterator.head to the ring
}


Best,
Wei

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.



Should it be an output buffer? I think output means from the
driver to device (i.e. DMA_TO_DEVICE).



This needs zero new APIs.

I know that follow-up patches need to add a header in front
so you might be thinking: how am I going to add this
header? The answer is quite simple - add it as a separate
out header.

Host will be able to distinguish between header and pages
by looking at the direction, and - should we want to add
IN data to header - additionally size (<4K => header).



I think this works fine when the cmdq is only used for
reporting the unused pages. It would be an issue
if there are other usages (e.g. report memory statistics)
interleaving. I think one solution would be to lock the cmdq until
a cmd usage is done ((e.g. all the unused pages are reported) ) -
in this case, the periodically updated guest memory statistics
may be delayed for a while occasionally when live migration starts.
Would this be acceptable? If not, probably we can have the cmdq
for one usage only.


Best,
Wei


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-25 Thread Wei Wang

On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:

On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:

On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:

On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
 note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
 for alternative layouts which always expose descriptors
 immediately

Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is indexed
by
prev_id, probably better to call it tail_id). Then, the new added desc
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
when it's
added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.

I think it can support the 1.1 proposal, too. But before getting
into that, I think we first need to deep dive into the implementation
and usage of _first/next/last. The usage would need to lock the vq
from the first to the end (otherwise, the returned info about the number
of available desc in the vq, i.e. num_free, would be invalid):

lock(vq);
add_first();
add_next();
add_last();
unlock(vq);

However, I think the case isn't this simple, since we need to check more
things
after each add_xx() step. For example, if only one entry is available at the
time
we start to use the vq, that is, num_free is 0 after add_first(), we
wouldn't be
able to add_next and add_last. So, it would work like this:

start:
 ...get free page block..
 lock(vq)
retry:
 ret = add_first(..,_free,);
 if(ret == -ENOSPC) {
 goto retry;
 } else if (!num_free) {
 add_chain_head();
 unlock(vq);
 kick & wait;
 goto start;
 }
next_one:
 ...get free page block..
 add_next(..,_free,);
 if (!num_free) {
 add_chain_head();
 unlock(vq);
 kick & wait;
 goto start;
 } if (num_free == 1) {
 ...get free page block..
 add_last(..);
 unlock(vq);
 kick & wait;
 goto start;
 } else {
 goto next_one;
 }

The above seems unnecessary to me to have three different APIs.
That's the reason to combine them into one virtqueue_add_chain_desc().

-- or, do you have a different thought about using the three APIs?


Implementation Reference:

struct desc_iterator {
 unsigned int head;
 unsigned int tail;
};

add_first(*vq, *desc_iterator, *num_free, ..)
{
 if (vq->vq.num_free < 1)
 return -ENOSPC;
 get_desc(_id);
 desc[desc_id].flag &= ~_F_NEXT;
 desc_iterator->head = desc_id
 desc_iterator->tail = desc_iterator->head;
 *num_free = vq->vq.num_free;
}

add_next(vq, desc_iterator, *num_free,..)
{
 get_desc(_id);
 desc[desc_id].flag &= ~_F_NEXT;
 desc[desc_iterator.tail].next = desc_id;
 desc[desc_iterator->tail].flag |= _F_NEXT;
 desc_iterator->tail = desc_id;
 *num_free = vq->vq.num_free;
}

add_last(vq, desc_iterator,..)
{
 get_desc(_id);
 desc[desc_id].flag &= ~_F_NEXT;
 desc[desc_iterator.tail].next = desc_id;
 desc_iterator->tail = desc_id;

 add_chain_head(); // put the desc_iterator.head to the ring
}


Best,
Wei

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.



Should it be an output buffer? I think output means from the
driver to device (i.e. DMA_TO_DEVICE).



This needs zero new APIs.

I know that follow-up patches need to add a header in front
so you might be thinking: how am I going to add this
header? The answer is quite simple - add it as a separate
out header.

Host will be able to distinguish between header and pages
by looking at the direction, and - should we want to add
IN data to header - additionally size (<4K => header).



I think this works fine when the cmdq is only used for
reporting the unused pages. It would be an issue
if there are other usages (e.g. report memory statistics)
interleaving. I think one solution would be to lock the cmdq until
a cmd usage is done ((e.g. all the unused pages are reported) ) -
in this case, the periodically updated guest memory statistics
may be delayed for a while occasionally when live migration starts.
Would this be acceptable? If not, probably we can have the cmdq
for one usage only.


Best,
Wei


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-22 Thread Michael S. Tsirkin
On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > > > So the way I see it, there are several issues:
> > > > 
> > > > - internal wait - forces multiple APIs like kick/kick_sync
> > > > note how kick_sync can fail but your code never checks return code
> > > > - need to re-write the last descriptor - might not work
> > > > for alternative layouts which always expose descriptors
> > > > immediately
> > > Probably it wasn't clear. Please let me explain the two functions here:
> > > 
> > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> > > grabs a desc from the vq and inserts it to the chain tail (which is 
> > > indexed
> > > by
> > > prev_id, probably better to call it tail_id). Then, the new added desc
> > > becomes
> > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> > > when it's
> > > added to the chain, and set when another desc comes to follow later.
> > And this only works if there are multiple rings like
> > avail + descriptor ring.
> > It won't work e.g. with the proposed new layout where
> > writing out a descriptor exposes it immediately.
> 
> I think it can support the 1.1 proposal, too. But before getting
> into that, I think we first need to deep dive into the implementation
> and usage of _first/next/last. The usage would need to lock the vq
> from the first to the end (otherwise, the returned info about the number
> of available desc in the vq, i.e. num_free, would be invalid):
> 
> lock(vq);
> add_first();
> add_next();
> add_last();
> unlock(vq);
> 
> However, I think the case isn't this simple, since we need to check more
> things
> after each add_xx() step. For example, if only one entry is available at the
> time
> we start to use the vq, that is, num_free is 0 after add_first(), we
> wouldn't be
> able to add_next and add_last. So, it would work like this:
> 
> start:
> ...get free page block..
> lock(vq)
> retry:
> ret = add_first(..,_free,);
> if(ret == -ENOSPC) {
> goto retry;
> } else if (!num_free) {
> add_chain_head();
> unlock(vq);
> kick & wait;
> goto start;
> }
> next_one:
> ...get free page block..
> add_next(..,_free,);
> if (!num_free) {
> add_chain_head();
> unlock(vq);
> kick & wait;
> goto start;
> } if (num_free == 1) {
> ...get free page block..
> add_last(..);
> unlock(vq);
> kick & wait;
> goto start;
> } else {
> goto next_one;
> }
> 
> The above seems unnecessary to me to have three different APIs.
> That's the reason to combine them into one virtqueue_add_chain_desc().
> 
> -- or, do you have a different thought about using the three APIs?
> 
> 
> Implementation Reference:
> 
> struct desc_iterator {
> unsigned int head;
> unsigned int tail;
> };
> 
> add_first(*vq, *desc_iterator, *num_free, ..)
> {
> if (vq->vq.num_free < 1)
> return -ENOSPC;
> get_desc(_id);
> desc[desc_id].flag &= ~_F_NEXT;
> desc_iterator->head = desc_id
> desc_iterator->tail = desc_iterator->head;
> *num_free = vq->vq.num_free;
> }
> 
> add_next(vq, desc_iterator, *num_free,..)
> {
> get_desc(_id);
> desc[desc_id].flag &= ~_F_NEXT;
> desc[desc_iterator.tail].next = desc_id;
> desc[desc_iterator->tail].flag |= _F_NEXT;
> desc_iterator->tail = desc_id;
> *num_free = vq->vq.num_free;
> }
> 
> add_last(vq, desc_iterator,..)
> {
> get_desc(_id);
> desc[desc_id].flag &= ~_F_NEXT;
> desc[desc_iterator.tail].next = desc_id;
> desc_iterator->tail = desc_id;
> 
> add_chain_head(); // put the desc_iterator.head to the ring
> }
> 
> 
> Best,
> Wei

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.

This needs zero new APIs.

I know that follow-up patches need to add a header in front
so you might be thinking: how am I going to add this
header? The answer is quite simple - add it as a separate
out header.

Host will be able to distinguish between header and pages
by looking at the direction, and - should we want to add
IN data to header - additionally size (<4K => header).

We will be able to look at extended APIs separately down
the road.

-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-22 Thread Michael S. Tsirkin
On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > > > So the way I see it, there are several issues:
> > > > 
> > > > - internal wait - forces multiple APIs like kick/kick_sync
> > > > note how kick_sync can fail but your code never checks return code
> > > > - need to re-write the last descriptor - might not work
> > > > for alternative layouts which always expose descriptors
> > > > immediately
> > > Probably it wasn't clear. Please let me explain the two functions here:
> > > 
> > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> > > grabs a desc from the vq and inserts it to the chain tail (which is 
> > > indexed
> > > by
> > > prev_id, probably better to call it tail_id). Then, the new added desc
> > > becomes
> > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> > > when it's
> > > added to the chain, and set when another desc comes to follow later.
> > And this only works if there are multiple rings like
> > avail + descriptor ring.
> > It won't work e.g. with the proposed new layout where
> > writing out a descriptor exposes it immediately.
> 
> I think it can support the 1.1 proposal, too. But before getting
> into that, I think we first need to deep dive into the implementation
> and usage of _first/next/last. The usage would need to lock the vq
> from the first to the end (otherwise, the returned info about the number
> of available desc in the vq, i.e. num_free, would be invalid):
> 
> lock(vq);
> add_first();
> add_next();
> add_last();
> unlock(vq);
> 
> However, I think the case isn't this simple, since we need to check more
> things
> after each add_xx() step. For example, if only one entry is available at the
> time
> we start to use the vq, that is, num_free is 0 after add_first(), we
> wouldn't be
> able to add_next and add_last. So, it would work like this:
> 
> start:
> ...get free page block..
> lock(vq)
> retry:
> ret = add_first(..,_free,);
> if(ret == -ENOSPC) {
> goto retry;
> } else if (!num_free) {
> add_chain_head();
> unlock(vq);
> kick & wait;
> goto start;
> }
> next_one:
> ...get free page block..
> add_next(..,_free,);
> if (!num_free) {
> add_chain_head();
> unlock(vq);
> kick & wait;
> goto start;
> } if (num_free == 1) {
> ...get free page block..
> add_last(..);
> unlock(vq);
> kick & wait;
> goto start;
> } else {
> goto next_one;
> }
> 
> The above seems unnecessary to me to have three different APIs.
> That's the reason to combine them into one virtqueue_add_chain_desc().
> 
> -- or, do you have a different thought about using the three APIs?
> 
> 
> Implementation Reference:
> 
> struct desc_iterator {
> unsigned int head;
> unsigned int tail;
> };
> 
> add_first(*vq, *desc_iterator, *num_free, ..)
> {
> if (vq->vq.num_free < 1)
> return -ENOSPC;
> get_desc(_id);
> desc[desc_id].flag &= ~_F_NEXT;
> desc_iterator->head = desc_id
> desc_iterator->tail = desc_iterator->head;
> *num_free = vq->vq.num_free;
> }
> 
> add_next(vq, desc_iterator, *num_free,..)
> {
> get_desc(_id);
> desc[desc_id].flag &= ~_F_NEXT;
> desc[desc_iterator.tail].next = desc_id;
> desc[desc_iterator->tail].flag |= _F_NEXT;
> desc_iterator->tail = desc_id;
> *num_free = vq->vq.num_free;
> }
> 
> add_last(vq, desc_iterator,..)
> {
> get_desc(_id);
> desc[desc_id].flag &= ~_F_NEXT;
> desc[desc_iterator.tail].next = desc_id;
> desc_iterator->tail = desc_id;
> 
> add_chain_head(); // put the desc_iterator.head to the ring
> }
> 
> 
> Best,
> Wei

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.

This needs zero new APIs.

I know that follow-up patches need to add a header in front
so you might be thinking: how am I going to add this
header? The answer is quite simple - add it as a separate
out header.

Host will be able to distinguish between header and pages
by looking at the direction, and - should we want to add
IN data to header - additionally size (<4K => header).

We will be able to look at extended APIs separately down
the road.

-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-14 Thread Wei Wang

On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:

On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
for alternative layouts which always expose descriptors
immediately

Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is indexed
by
prev_id, probably better to call it tail_id). Then, the new added desc
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
when it's
added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.


I think it can support the 1.1 proposal, too. But before getting
into that, I think we first need to deep dive into the implementation
and usage of _first/next/last. The usage would need to lock the vq
from the first to the end (otherwise, the returned info about the number
of available desc in the vq, i.e. num_free, would be invalid):

lock(vq);
add_first();
add_next();
add_last();
unlock(vq);

However, I think the case isn't this simple, since we need to check more 
things
after each add_xx() step. For example, if only one entry is available at 
the time
we start to use the vq, that is, num_free is 0 after add_first(), we 
wouldn't be

able to add_next and add_last. So, it would work like this:

start:
...get free page block..
lock(vq)
retry:
ret = add_first(..,_free,);
if(ret == -ENOSPC) {
goto retry;
} else if (!num_free) {
add_chain_head();
unlock(vq);
kick & wait;
goto start;
}
next_one:
...get free page block..
add_next(..,_free,);
if (!num_free) {
add_chain_head();
unlock(vq);
kick & wait;
goto start;
} if (num_free == 1) {
...get free page block..
add_last(..);
unlock(vq);
kick & wait;
goto start;
} else {
goto next_one;
}

The above seems unnecessary to me to have three different APIs.
That's the reason to combine them into one virtqueue_add_chain_desc().

-- or, do you have a different thought about using the three APIs?


Implementation Reference:

struct desc_iterator {
unsigned int head;
unsigned int tail;
};

add_first(*vq, *desc_iterator, *num_free, ..)
{
if (vq->vq.num_free < 1)
return -ENOSPC;
get_desc(_id);
desc[desc_id].flag &= ~_F_NEXT;
desc_iterator->head = desc_id
desc_iterator->tail = desc_iterator->head;
*num_free = vq->vq.num_free;
}

add_next(vq, desc_iterator, *num_free,..)
{
get_desc(_id);
desc[desc_id].flag &= ~_F_NEXT;
desc[desc_iterator.tail].next = desc_id;
desc[desc_iterator->tail].flag |= _F_NEXT;
desc_iterator->tail = desc_id;
*num_free = vq->vq.num_free;
}

add_last(vq, desc_iterator,..)
{
get_desc(_id);
desc[desc_id].flag &= ~_F_NEXT;
desc[desc_iterator.tail].next = desc_id;
desc_iterator->tail = desc_id;

add_chain_head(); // put the desc_iterator.head to the ring
}


Best,
Wei




Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-14 Thread Wei Wang

On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:

On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
for alternative layouts which always expose descriptors
immediately

Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is indexed
by
prev_id, probably better to call it tail_id). Then, the new added desc
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
when it's
added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.


I think it can support the 1.1 proposal, too. But before getting
into that, I think we first need to deep dive into the implementation
and usage of _first/next/last. The usage would need to lock the vq
from the first to the end (otherwise, the returned info about the number
of available desc in the vq, i.e. num_free, would be invalid):

lock(vq);
add_first();
add_next();
add_last();
unlock(vq);

However, I think the case isn't this simple, since we need to check more 
things
after each add_xx() step. For example, if only one entry is available at 
the time
we start to use the vq, that is, num_free is 0 after add_first(), we 
wouldn't be

able to add_next and add_last. So, it would work like this:

start:
...get free page block..
lock(vq)
retry:
ret = add_first(..,_free,);
if(ret == -ENOSPC) {
goto retry;
} else if (!num_free) {
add_chain_head();
unlock(vq);
kick & wait;
goto start;
}
next_one:
...get free page block..
add_next(..,_free,);
if (!num_free) {
add_chain_head();
unlock(vq);
kick & wait;
goto start;
} if (num_free == 1) {
...get free page block..
add_last(..);
unlock(vq);
kick & wait;
goto start;
} else {
goto next_one;
}

The above seems unnecessary to me to have three different APIs.
That's the reason to combine them into one virtqueue_add_chain_desc().

-- or, do you have a different thought about using the three APIs?


Implementation Reference:

struct desc_iterator {
unsigned int head;
unsigned int tail;
};

add_first(*vq, *desc_iterator, *num_free, ..)
{
if (vq->vq.num_free < 1)
return -ENOSPC;
get_desc(_id);
desc[desc_id].flag &= ~_F_NEXT;
desc_iterator->head = desc_id
desc_iterator->tail = desc_iterator->head;
*num_free = vq->vq.num_free;
}

add_next(vq, desc_iterator, *num_free,..)
{
get_desc(_id);
desc[desc_id].flag &= ~_F_NEXT;
desc[desc_iterator.tail].next = desc_id;
desc[desc_iterator->tail].flag |= _F_NEXT;
desc_iterator->tail = desc_id;
*num_free = vq->vq.num_free;
}

add_last(vq, desc_iterator,..)
{
get_desc(_id);
desc[desc_id].flag &= ~_F_NEXT;
desc[desc_iterator.tail].next = desc_id;
desc_iterator->tail = desc_id;

add_chain_head(); // put the desc_iterator.head to the ring
}


Best,
Wei




Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-13 Thread Michael S. Tsirkin
On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > 
> > So the way I see it, there are several issues:
> > 
> > - internal wait - forces multiple APIs like kick/kick_sync
> >note how kick_sync can fail but your code never checks return code
> > - need to re-write the last descriptor - might not work
> >for alternative layouts which always expose descriptors
> >immediately
> 
> Probably it wasn't clear. Please let me explain the two functions here:
> 
> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> grabs a desc from the vq and inserts it to the chain tail (which is indexed
> by
> prev_id, probably better to call it tail_id). Then, the new added desc
> becomes
> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> when it's
> added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.

> 2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end.
> 
> So, if people want to add a desc and immediately expose it to the other end,
> i.e. build a single desc chain, they can just add and expose:
> 
> virtqueue_add_chain_desc(..);
> virtqueue_add_chain(..,head_id);
> 
> Would you see any issues here?

The way the new APIs poll used ring internally.

> 
> > - some kind of iterator type would be nicer instead of
> >maintaining head/prev explicitly
> 
> Why would we need to iterate the chain?

In your patches prev/tail are iterators - they keep track of
where you are in the chain.

> I think it would be simpler to use
> a wrapper struct:
> 
> struct virtqueue_desc_chain {
> unsigned int head;  // head desc id of the chain
> unsigned int tail; // tail desc id of the chain
> }
> 
> The new desc will be put to desc[tail].next, and we don't need to walk
> from the head desc[head].next when inserting a new desc to the chain, right?
> 
> 
> > 
> > As for the use, it would be better to do
> > 
> > if (!add_next(vq, ...)) {
> > add_last(vq, ...)
> > kick
> > wait
> > }
> 
> "!add_next(vq, ...)" means that the vq is full?


No - it means there's only 1 entry left for the last descriptor.


> If so, what would add_last()
> do then?
> 
> > Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
> > in fact it merely puts the branch in the virtio code.
> > 
> 
> Actually it wasn't intended to improve performance. It is used to indicate
> the "init" state
> of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head
> id=INIT, it will
> assign the grabbed desc id to _id. In some sense, it is equivalent to
> add_first().
> 
> Do you have a different opinion here?
> 
> Best,
> Wei
> 

It is but let's make it explicit here - an API function is better
than a special value.

-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-13 Thread Michael S. Tsirkin
On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > 
> > So the way I see it, there are several issues:
> > 
> > - internal wait - forces multiple APIs like kick/kick_sync
> >note how kick_sync can fail but your code never checks return code
> > - need to re-write the last descriptor - might not work
> >for alternative layouts which always expose descriptors
> >immediately
> 
> Probably it wasn't clear. Please let me explain the two functions here:
> 
> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> grabs a desc from the vq and inserts it to the chain tail (which is indexed
> by
> prev_id, probably better to call it tail_id). Then, the new added desc
> becomes
> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> when it's
> added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.

> 2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end.
> 
> So, if people want to add a desc and immediately expose it to the other end,
> i.e. build a single desc chain, they can just add and expose:
> 
> virtqueue_add_chain_desc(..);
> virtqueue_add_chain(..,head_id);
> 
> Would you see any issues here?

The way the new APIs poll used ring internally.

> 
> > - some kind of iterator type would be nicer instead of
> >maintaining head/prev explicitly
> 
> Why would we need to iterate the chain?

In your patches prev/tail are iterators - they keep track of
where you are in the chain.

> I think it would be simpler to use
> a wrapper struct:
> 
> struct virtqueue_desc_chain {
> unsigned int head;  // head desc id of the chain
> unsigned int tail; // tail desc id of the chain
> }
> 
> The new desc will be put to desc[tail].next, and we don't need to walk
> from the head desc[head].next when inserting a new desc to the chain, right?
> 
> 
> > 
> > As for the use, it would be better to do
> > 
> > if (!add_next(vq, ...)) {
> > add_last(vq, ...)
> > kick
> > wait
> > }
> 
> "!add_next(vq, ...)" means that the vq is full?


No - it means there's only 1 entry left for the last descriptor.


> If so, what would add_last()
> do then?
> 
> > Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
> > in fact it merely puts the branch in the virtio code.
> > 
> 
> Actually it wasn't intended to improve performance. It is used to indicate
> the "init" state
> of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head
> id=INIT, it will
> assign the grabbed desc id to _id. In some sense, it is equivalent to
> add_first().
> 
> Do you have a different opinion here?
> 
> Best,
> Wei
> 

It is but let's make it explicit here - an API function is better
than a special value.

-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-13 Thread Wei Wang

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:


So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
   note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
   for alternative layouts which always expose descriptors
   immediately


Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is 
indexed by
prev_id, probably better to call it tail_id). Then, the new added desc 
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc 
when it's

added to the chain, and set when another desc comes to follow later.

2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end.

So, if people want to add a desc and immediately expose it to the other end,
i.e. build a single desc chain, they can just add and expose:

virtqueue_add_chain_desc(..);
virtqueue_add_chain(..,head_id);

Would you see any issues here?



- some kind of iterator type would be nicer instead of
   maintaining head/prev explicitly


Why would we need to iterate the chain? I think it would be simpler to use
a wrapper struct:

struct virtqueue_desc_chain {
unsigned int head;  // head desc id of the chain
unsigned int tail; // tail desc id of the chain
}

The new desc will be put to desc[tail].next, and we don't need to walk
from the head desc[head].next when inserting a new desc to the chain, right?




As for the use, it would be better to do

if (!add_next(vq, ...)) {
add_last(vq, ...)
kick
wait
}


"!add_next(vq, ...)" means that the vq is full? If so, what would 
add_last() do then?




Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
in fact it merely puts the branch in the virtio code.



Actually it wasn't intended to improve performance. It is used to 
indicate the "init" state
of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head 
id=INIT, it will
assign the grabbed desc id to _id. In some sense, it is equivalent 
to add_first().


Do you have a different opinion here?

Best,
Wei





Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-13 Thread Wei Wang

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:


So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
   note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
   for alternative layouts which always expose descriptors
   immediately


Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is 
indexed by
prev_id, probably better to call it tail_id). Then, the new added desc 
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc 
when it's

added to the chain, and set when another desc comes to follow later.

2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end.

So, if people want to add a desc and immediately expose it to the other end,
i.e. build a single desc chain, they can just add and expose:

virtqueue_add_chain_desc(..);
virtqueue_add_chain(..,head_id);

Would you see any issues here?



- some kind of iterator type would be nicer instead of
   maintaining head/prev explicitly


Why would we need to iterate the chain? I think it would be simpler to use
a wrapper struct:

struct virtqueue_desc_chain {
unsigned int head;  // head desc id of the chain
unsigned int tail; // tail desc id of the chain
}

The new desc will be put to desc[tail].next, and we don't need to walk
from the head desc[head].next when inserting a new desc to the chain, right?




As for the use, it would be better to do

if (!add_next(vq, ...)) {
add_last(vq, ...)
kick
wait
}


"!add_next(vq, ...)" means that the vq is full? If so, what would 
add_last() do then?




Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
in fact it merely puts the branch in the virtio code.



Actually it wasn't intended to improve performance. It is used to 
indicate the "init" state
of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head 
id=INIT, it will
assign the grabbed desc id to _id. In some sense, it is equivalent 
to add_first().


Do you have a different opinion here?

Best,
Wei





Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread kbuild test robot
Hi Wei,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.12 next-20170712]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

>> ERROR: ".xb_set_bit" [drivers/virtio/virtio_balloon.ko] undefined!
>> ERROR: ".xb_zero" [drivers/virtio/virtio_balloon.ko] undefined!
>> ERROR: ".xb_find_next_bit" [drivers/virtio/virtio_balloon.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread kbuild test robot
Hi Wei,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.12 next-20170712]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

>> ERROR: ".xb_set_bit" [drivers/virtio/virtio_balloon.ko] undefined!
>> ERROR: ".xb_zero" [drivers/virtio/virtio_balloon.ko] undefined!
>> ERROR: ".xb_find_next_bit" [drivers/virtio/virtio_balloon.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread kbuild test robot
Hi Wei,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.12 next-20170712]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956
config: i386-randconfig-x071-07121639 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers//virtio/virtio_balloon.c: In function 'tell_host_one_page':
>> drivers//virtio/virtio_balloon.c:535:39: warning: cast to pointer from 
>> integer of different size [-Wint-to-pointer-cast]
 virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
  ^

vim +535 drivers//virtio/virtio_balloon.c

   527  
   528  static void tell_host_one_page(struct virtio_balloon *vb, struct 
virtqueue *vq,
   529 struct page *page)
   530  {
   531  unsigned int id = VIRTQUEUE_DESC_ID_INIT;
   532  u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT;
   533  
   534  virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, , , 0);
 > 535  virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
   536  virtqueue_kick_async(vq, vb->acked);
   537  }
   538  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread kbuild test robot
Hi Wei,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.12 next-20170712]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956
config: i386-randconfig-x071-07121639 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers//virtio/virtio_balloon.c: In function 'tell_host_one_page':
>> drivers//virtio/virtio_balloon.c:535:39: warning: cast to pointer from 
>> integer of different size [-Wint-to-pointer-cast]
 virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
  ^

vim +535 drivers//virtio/virtio_balloon.c

   527  
   528  static void tell_host_one_page(struct virtio_balloon *vb, struct 
virtqueue *vq,
   529 struct page *page)
   530  {
   531  unsigned int id = VIRTQUEUE_DESC_ID_INIT;
   532  u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT;
   533  
   534  virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, , , 0);
 > 535  virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
   536  virtqueue_kick_async(vq, vb->acked);
   537  }
   538  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Michael S. Tsirkin
On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> scatter-gather lists to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~491ms
> resulting in an improvement of ~88%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Suggested-by: Michael S. Tsirkin 
> ---
>  drivers/virtio/virtio_balloon.c | 141 ++---
>  drivers/virtio/virtio_ring.c| 199 
> +---
>  include/linux/virtio.h  |  20 
>  include/uapi/linux/virtio_balloon.h |   1 +
>  4 files changed, 329 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..aa4e7ec 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +80,9 @@ struct virtio_balloon {
>   /* Synchronize access/update to this struct virtio_balloon elements */
>   struct mutex balloon_lock;
>  
> + /* The xbitmap used to record ballooned pages */
> + struct xb page_xb;
> +
>   /* The array of pfns we tell the Host about. */
>   unsigned int num_pfns;
>   __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb,
> page_to_balloon_pfn(page) + i);
>  }
>  
> +/*
> + * Send balloon pages in sgs to host.
> + * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap
> + * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for
> + * continuous "1" bits, which correspond to continuous pages, to chunk into
> + * sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be serached.

searched

> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   unsigned long page_xb_start,
> +   unsigned long page_xb_end)
> +{
> + unsigned int head_id = VIRTQUEUE_DESC_ID_INIT,
> +  prev_id = VIRTQUEUE_DESC_ID_INIT;
> + unsigned long sg_pfn_start, sg_pfn_end;
> + uint64_t sg_addr;
> + uint32_t sg_size;
> +
> + sg_pfn_start = page_xb_start;
> + while (sg_pfn_start < page_xb_end) {
> + sg_pfn_start = xb_find_next_bit(>page_xb, sg_pfn_start,
> + page_xb_end, 1);
> + if (sg_pfn_start == page_xb_end + 1)
> + break;
> + sg_pfn_end = xb_find_next_bit(>page_xb, sg_pfn_start + 1,
> +   page_xb_end, 0);
> + sg_addr = sg_pfn_start << PAGE_SHIFT;
> + sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE;

There's an issue here - this might not fit in uint32_t.
You need to limit sg_pfn_end - something like:

/* make sure sg_size below fits in a 32 bit integer */
sg_pfn_end = min(sg_pfn_end, sg_pfn_start + UINT_MAX >> PAGE_SIZE);

> + virtqueue_add_chain_desc(vq, sg_addr, sg_size, _id,
> +  _id, 0);
> + xb_zero(>page_xb, sg_pfn_start, sg_pfn_end);
> + sg_pfn_start = sg_pfn_end + 1;
> + }
> +
> + if (head_id != VIRTQUEUE_DESC_ID_INIT) {
> + virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL);
> + virtqueue_kick_async(vq, vb->acked);
> + }
> +}
> +
> +/* Update pfn_max and pfn_min according to the pfn of @page */
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> + struct page *page,
> + unsigned long *pfn_min,
> +   

Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Michael S. Tsirkin
On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> scatter-gather lists to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~491ms
> resulting in an improvement of ~88%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Suggested-by: Michael S. Tsirkin 
> ---
>  drivers/virtio/virtio_balloon.c | 141 ++---
>  drivers/virtio/virtio_ring.c| 199 
> +---
>  include/linux/virtio.h  |  20 
>  include/uapi/linux/virtio_balloon.h |   1 +
>  4 files changed, 329 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..aa4e7ec 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +80,9 @@ struct virtio_balloon {
>   /* Synchronize access/update to this struct virtio_balloon elements */
>   struct mutex balloon_lock;
>  
> + /* The xbitmap used to record ballooned pages */
> + struct xb page_xb;
> +
>   /* The array of pfns we tell the Host about. */
>   unsigned int num_pfns;
>   __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb,
> page_to_balloon_pfn(page) + i);
>  }
>  
> +/*
> + * Send balloon pages in sgs to host.
> + * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap
> + * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for
> + * continuous "1" bits, which correspond to continuous pages, to chunk into
> + * sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be serached.

searched

> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   unsigned long page_xb_start,
> +   unsigned long page_xb_end)
> +{
> + unsigned int head_id = VIRTQUEUE_DESC_ID_INIT,
> +  prev_id = VIRTQUEUE_DESC_ID_INIT;
> + unsigned long sg_pfn_start, sg_pfn_end;
> + uint64_t sg_addr;
> + uint32_t sg_size;
> +
> + sg_pfn_start = page_xb_start;
> + while (sg_pfn_start < page_xb_end) {
> + sg_pfn_start = xb_find_next_bit(>page_xb, sg_pfn_start,
> + page_xb_end, 1);
> + if (sg_pfn_start == page_xb_end + 1)
> + break;
> + sg_pfn_end = xb_find_next_bit(>page_xb, sg_pfn_start + 1,
> +   page_xb_end, 0);
> + sg_addr = sg_pfn_start << PAGE_SHIFT;
> + sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE;

There's an issue here - this might not fit in uint32_t.
You need to limit sg_pfn_end - something like:

/* make sure sg_size below fits in a 32 bit integer */
sg_pfn_end = min(sg_pfn_end, sg_pfn_start + UINT_MAX >> PAGE_SIZE);

> + virtqueue_add_chain_desc(vq, sg_addr, sg_size, _id,
> +  _id, 0);
> + xb_zero(>page_xb, sg_pfn_start, sg_pfn_end);
> + sg_pfn_start = sg_pfn_end + 1;
> + }
> +
> + if (head_id != VIRTQUEUE_DESC_ID_INIT) {
> + virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL);
> + virtqueue_kick_async(vq, vb->acked);
> + }
> +}
> +
> +/* Update pfn_max and pfn_min according to the pfn of @page */
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> + struct page *page,
> + unsigned long *pfn_min,
> + unsigned long *pfn_max)
> +{
> + unsigned 

Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Michael S. Tsirkin
On Wed, Jul 12, 2017 at 09:29:00PM +0800, Wei Wang wrote:
> On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> > > diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> > > index 28b0e96..9f27101 100644
> > > --- a/include/linux/virtio.h
> > > +++ b/include/linux/virtio.h
> > > @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
> > > void *data,
> > > gfp_t gfp);
> > > +/* A desc with this init id is treated as an invalid desc */
> > > +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> > > +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> > > +  uint64_t addr,
> > > +  uint32_t len,
> > > +  unsigned int *head_id,
> > > +  unsigned int *prev_id,
> > > +  bool in);
> > > +
> > > +int virtqueue_add_chain(struct virtqueue *_vq,
> > > + unsigned int head,
> > > + bool indirect,
> > > + struct vring_desc *indirect_desc,
> > > + void *data,
> > > + void *ctx);
> > > +
> > >   bool virtqueue_kick(struct virtqueue *vq);
> > > +bool virtqueue_kick_sync(struct virtqueue *vq);
> > > +
> > > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> > > +
> > >   bool virtqueue_kick_prepare(struct virtqueue *vq);
> > >   bool virtqueue_notify(struct virtqueue *vq);
> > I don't much care for this API. It does exactly what balloon needs,
> > but at cost of e.g. transparently busy-waiting. Unlikely to be
> > a good fit for anything else.
> 
> If you were referring to this API - virtqueue_add_chain_desc():
> 
> Busy waiting only happens when the vq is full (i.e. no desc left). If
> necessary, I think we can add an input parameter like
> "bool busywaiting", then the caller can decide to simply get a -ENOSPC
> or busy wait to add when no desc is available.

I think this just shows this API is too high level.
This policy should live in drivers.

> > 
> > If you don't like my original _first/_next/_last, you will
> > need to come up with something else.
> 
> I thought the above virtqueue_add_chain_des() performs the same
> functionality as _first/next/last, which are used to grab descs from the
> vq and chain them together. If not, could you please elaborate the
> usage of the original proposal?
> 
> Best,
> Wei
> 

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
  note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
  for alternative layouts which always expose descriptors
  immediately
- some kind of iterator type would be nicer instead of
  maintaining head/prev explicitly


As for the use, it would be better to do

if (!add_next(vq, ...)) {
add_last(vq, ...)
kick
wait
}

Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
in fact it merely puts the branch in the virtio code.



-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Michael S. Tsirkin
On Wed, Jul 12, 2017 at 09:29:00PM +0800, Wei Wang wrote:
> On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> > > diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> > > index 28b0e96..9f27101 100644
> > > --- a/include/linux/virtio.h
> > > +++ b/include/linux/virtio.h
> > > @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
> > > void *data,
> > > gfp_t gfp);
> > > +/* A desc with this init id is treated as an invalid desc */
> > > +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> > > +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> > > +  uint64_t addr,
> > > +  uint32_t len,
> > > +  unsigned int *head_id,
> > > +  unsigned int *prev_id,
> > > +  bool in);
> > > +
> > > +int virtqueue_add_chain(struct virtqueue *_vq,
> > > + unsigned int head,
> > > + bool indirect,
> > > + struct vring_desc *indirect_desc,
> > > + void *data,
> > > + void *ctx);
> > > +
> > >   bool virtqueue_kick(struct virtqueue *vq);
> > > +bool virtqueue_kick_sync(struct virtqueue *vq);
> > > +
> > > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> > > +
> > >   bool virtqueue_kick_prepare(struct virtqueue *vq);
> > >   bool virtqueue_notify(struct virtqueue *vq);
> > I don't much care for this API. It does exactly what balloon needs,
> > but at cost of e.g. transparently busy-waiting. Unlikely to be
> > a good fit for anything else.
> 
> If you were referring to this API - virtqueue_add_chain_desc():
> 
> Busy waiting only happens when the vq is full (i.e. no desc left). If
> necessary, I think we can add an input parameter like
> "bool busywaiting", then the caller can decide to simply get a -ENOSPC
> or busy wait to add when no desc is available.

I think this just shows this API is too high level.
This policy should live in drivers.

> > 
> > If you don't like my original _first/_next/_last, you will
> > need to come up with something else.
> 
> I thought the above virtqueue_add_chain_des() performs the same
> functionality as _first/next/last, which are used to grab descs from the
> vq and chain them together. If not, could you please elaborate the
> usage of the original proposal?
> 
> Best,
> Wei
> 

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
  note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
  for alternative layouts which always expose descriptors
  immediately
- some kind of iterator type would be nicer instead of
  maintaining head/prev explicitly


As for the use, it would be better to do

if (!add_next(vq, ...)) {
add_last(vq, ...)
kick
wait
}

Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
in fact it merely puts the branch in the virtio code.



-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Wei Wang

On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote:

On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:

diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 28b0e96..9f27101 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
  void *data,
  gfp_t gfp);
  
+/* A desc with this init id is treated as an invalid desc */

+#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
+int virtqueue_add_chain_desc(struct virtqueue *_vq,
+uint64_t addr,
+uint32_t len,
+unsigned int *head_id,
+unsigned int *prev_id,
+bool in);
+
+int virtqueue_add_chain(struct virtqueue *_vq,
+   unsigned int head,
+   bool indirect,
+   struct vring_desc *indirect_desc,
+   void *data,
+   void *ctx);
+
  bool virtqueue_kick(struct virtqueue *vq);
  
+bool virtqueue_kick_sync(struct virtqueue *vq);

+
+bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
+
  bool virtqueue_kick_prepare(struct virtqueue *vq);
  
  bool virtqueue_notify(struct virtqueue *vq);

I don't much care for this API. It does exactly what balloon needs,
but at cost of e.g. transparently busy-waiting. Unlikely to be
a good fit for anything else.


If you were referring to this API - virtqueue_add_chain_desc():

Busy waiting only happens when the vq is full (i.e. no desc left). If
necessary, I think we can add an input parameter like
"bool busywaiting", then the caller can decide to simply get a -ENOSPC
or busy wait to add when no desc is available.



If you don't like my original _first/_next/_last, you will
need to come up with something else.


I thought the above virtqueue_add_chain_des() performs the same
functionality as _first/next/last, which are used to grab descs from the
vq and chain them together. If not, could you please elaborate the
usage of the original proposal?

Best,
Wei





Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Wei Wang

On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote:

On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:

diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 28b0e96..9f27101 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
  void *data,
  gfp_t gfp);
  
+/* A desc with this init id is treated as an invalid desc */

+#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
+int virtqueue_add_chain_desc(struct virtqueue *_vq,
+uint64_t addr,
+uint32_t len,
+unsigned int *head_id,
+unsigned int *prev_id,
+bool in);
+
+int virtqueue_add_chain(struct virtqueue *_vq,
+   unsigned int head,
+   bool indirect,
+   struct vring_desc *indirect_desc,
+   void *data,
+   void *ctx);
+
  bool virtqueue_kick(struct virtqueue *vq);
  
+bool virtqueue_kick_sync(struct virtqueue *vq);

+
+bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
+
  bool virtqueue_kick_prepare(struct virtqueue *vq);
  
  bool virtqueue_notify(struct virtqueue *vq);

I don't much care for this API. It does exactly what balloon needs,
but at cost of e.g. transparently busy-waiting. Unlikely to be
a good fit for anything else.


If you were referring to this API - virtqueue_add_chain_desc():

Busy waiting only happens when the vq is full (i.e. no desc left). If
necessary, I think we can add an input parameter like
"bool busywaiting", then the caller can decide to simply get a -ENOSPC
or busy wait to add when no desc is available.



If you don't like my original _first/_next/_last, you will
need to come up with something else.


I thought the above virtqueue_add_chain_des() performs the same
functionality as _first/next/last, which are used to grab descs from the
vq and chain them together. If not, could you please elaborate the
usage of the original proposal?

Best,
Wei





Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Michael S. Tsirkin
On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 28b0e96..9f27101 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
> void *data,
> gfp_t gfp);
>  
> +/* A desc with this init id is treated as an invalid desc */
> +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> +  uint64_t addr,
> +  uint32_t len,
> +  unsigned int *head_id,
> +  unsigned int *prev_id,
> +  bool in);
> +
> +int virtqueue_add_chain(struct virtqueue *_vq,
> + unsigned int head,
> + bool indirect,
> + struct vring_desc *indirect_desc,
> + void *data,
> + void *ctx);
> +
>  bool virtqueue_kick(struct virtqueue *vq);
>  
> +bool virtqueue_kick_sync(struct virtqueue *vq);
> +
> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> +
>  bool virtqueue_kick_prepare(struct virtqueue *vq);
>  
>  bool virtqueue_notify(struct virtqueue *vq);

I don't much care for this API. It does exactly what balloon needs,
but at cost of e.g. transparently busy-waiting. Unlikely to be
a good fit for anything else.

If you don't like my original _first/_next/_last, you will
need to come up with something else.

-- 
MST


Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Michael S. Tsirkin
On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 28b0e96..9f27101 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
> void *data,
> gfp_t gfp);
>  
> +/* A desc with this init id is treated as an invalid desc */
> +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> +  uint64_t addr,
> +  uint32_t len,
> +  unsigned int *head_id,
> +  unsigned int *prev_id,
> +  bool in);
> +
> +int virtqueue_add_chain(struct virtqueue *_vq,
> + unsigned int head,
> + bool indirect,
> + struct vring_desc *indirect_desc,
> + void *data,
> + void *ctx);
> +
>  bool virtqueue_kick(struct virtqueue *vq);
>  
> +bool virtqueue_kick_sync(struct virtqueue *vq);
> +
> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> +
>  bool virtqueue_kick_prepare(struct virtqueue *vq);
>  
>  bool virtqueue_notify(struct virtqueue *vq);

I don't much care for this API. It does exactly what balloon needs,
but at cost of e.g. transparently busy-waiting. Unlikely to be
a good fit for anything else.

If you don't like my original _first/_next/_last, you will
need to come up with something else.

-- 
MST


[PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Wei Wang
Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
transfer a chunk of ballooned (i.e. inflated/deflated) pages using
scatter-gather lists to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~491ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Suggested-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 141 ++---
 drivers/virtio/virtio_ring.c| 199 +---
 include/linux/virtio.h  |  20 
 include/uapi/linux/virtio_balloon.h |   1 +
 4 files changed, 329 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..aa4e7ec 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +80,9 @@ struct virtio_balloon {
/* Synchronize access/update to this struct virtio_balloon elements */
struct mutex balloon_lock;
 
+   /* The xbitmap used to record ballooned pages */
+   struct xb page_xb;
+
/* The array of pfns we tell the Host about. */
unsigned int num_pfns;
__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb,
  page_to_balloon_pfn(page) + i);
 }
 
+/*
+ * Send balloon pages in sgs to host.
+ * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap
+ * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for
+ * continuous "1" bits, which correspond to continuous pages, to chunk into
+ * sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be serached.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+ struct virtqueue *vq,
+ unsigned long page_xb_start,
+ unsigned long page_xb_end)
+{
+   unsigned int head_id = VIRTQUEUE_DESC_ID_INIT,
+prev_id = VIRTQUEUE_DESC_ID_INIT;
+   unsigned long sg_pfn_start, sg_pfn_end;
+   uint64_t sg_addr;
+   uint32_t sg_size;
+
+   sg_pfn_start = page_xb_start;
+   while (sg_pfn_start < page_xb_end) {
+   sg_pfn_start = xb_find_next_bit(>page_xb, sg_pfn_start,
+   page_xb_end, 1);
+   if (sg_pfn_start == page_xb_end + 1)
+   break;
+   sg_pfn_end = xb_find_next_bit(>page_xb, sg_pfn_start + 1,
+ page_xb_end, 0);
+   sg_addr = sg_pfn_start << PAGE_SHIFT;
+   sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE;
+   virtqueue_add_chain_desc(vq, sg_addr, sg_size, _id,
+_id, 0);
+   xb_zero(>page_xb, sg_pfn_start, sg_pfn_end);
+   sg_pfn_start = sg_pfn_end + 1;
+   }
+
+   if (head_id != VIRTQUEUE_DESC_ID_INIT) {
+   virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL);
+   virtqueue_kick_async(vq, vb->acked);
+   }
+}
+
+/* Update pfn_max and pfn_min according to the pfn of @page */
+static inline void update_pfn_range(struct virtio_balloon *vb,
+   struct page *page,
+   unsigned long *pfn_min,
+   unsigned long *pfn_max)
+{
+   unsigned long pfn = page_to_pfn(page);
+
+   *pfn_min = min(pfn, *pfn_min);
+   *pfn_max = max(pfn, *pfn_max);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
unsigned num_allocated_pages;
+   bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+   unsigned long pfn_max = 0, pfn_min = 

[PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-12 Thread Wei Wang
Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
transfer a chunk of ballooned (i.e. inflated/deflated) pages using
scatter-gather lists to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~491ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Suggested-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 141 ++---
 drivers/virtio/virtio_ring.c| 199 +---
 include/linux/virtio.h  |  20 
 include/uapi/linux/virtio_balloon.h |   1 +
 4 files changed, 329 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..aa4e7ec 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +80,9 @@ struct virtio_balloon {
/* Synchronize access/update to this struct virtio_balloon elements */
struct mutex balloon_lock;
 
+   /* The xbitmap used to record ballooned pages */
+   struct xb page_xb;
+
/* The array of pfns we tell the Host about. */
unsigned int num_pfns;
__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb,
  page_to_balloon_pfn(page) + i);
 }
 
+/*
+ * Send balloon pages in sgs to host.
+ * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap
+ * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for
+ * continuous "1" bits, which correspond to continuous pages, to chunk into
+ * sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be serached.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+ struct virtqueue *vq,
+ unsigned long page_xb_start,
+ unsigned long page_xb_end)
+{
+   unsigned int head_id = VIRTQUEUE_DESC_ID_INIT,
+prev_id = VIRTQUEUE_DESC_ID_INIT;
+   unsigned long sg_pfn_start, sg_pfn_end;
+   uint64_t sg_addr;
+   uint32_t sg_size;
+
+   sg_pfn_start = page_xb_start;
+   while (sg_pfn_start < page_xb_end) {
+   sg_pfn_start = xb_find_next_bit(>page_xb, sg_pfn_start,
+   page_xb_end, 1);
+   if (sg_pfn_start == page_xb_end + 1)
+   break;
+   sg_pfn_end = xb_find_next_bit(>page_xb, sg_pfn_start + 1,
+ page_xb_end, 0);
+   sg_addr = sg_pfn_start << PAGE_SHIFT;
+   sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE;
+   virtqueue_add_chain_desc(vq, sg_addr, sg_size, _id,
+_id, 0);
+   xb_zero(>page_xb, sg_pfn_start, sg_pfn_end);
+   sg_pfn_start = sg_pfn_end + 1;
+   }
+
+   if (head_id != VIRTQUEUE_DESC_ID_INIT) {
+   virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL);
+   virtqueue_kick_async(vq, vb->acked);
+   }
+}
+
+/* Update pfn_max and pfn_min according to the pfn of @page */
+static inline void update_pfn_range(struct virtio_balloon *vb,
+   struct page *page,
+   unsigned long *pfn_min,
+   unsigned long *pfn_max)
+{
+   unsigned long pfn = page_to_pfn(page);
+
+   *pfn_min = min(pfn, *pfn_min);
+   *pfn_max = max(pfn, *pfn_max);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
unsigned num_allocated_pages;
+   bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+   unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
/* We can only do one array worth at a time.