Re: [virtio-dev] Re: [PATCH v5 4/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-03-21 Thread Wei Wang

On 03/20/2018 11:24 AM, Michael S. Tsirkin wrote:

On Tue, Mar 20, 2018 at 11:18:23AM +0800, Wei Wang wrote:

On 03/20/2018 10:59 AM, Michael S. Tsirkin wrote:

On Tue, Mar 20, 2018 at 10:16:09AM +0800, Wei Wang wrote:

On 03/20/2018 06:55 AM, Michael S. Tsirkin wrote:

On Mon, Mar 19, 2018 at 05:01:38PM +0800, Wei Wang wrote:

On 03/19/2018 12:24 PM, Michael S. Tsirkin wrote:

On Sun, Mar 18, 2018 at 06:36:20PM +0800, Wei Wang wrote:

On 03/16/2018 11:16 PM, Michael S. Tsirkin wrote:

On Fri, Mar 16, 2018 at 06:48:28PM +0800, Wei Wang wrote:

OTOH it seems that if thread stops nothing will wake it up
whem vm is restarted. Such bahaviour change across vmstop/vmstart
is unexpected.
I do not understand why we want to increment the counter
on vm stop though. It does make sense to stop the thread
but why not resume where we left off when vm is resumed?


I'm not sure which counter we incremented. But it would be clear if we have
a high level view of how it works (it is symmetric actually). Basically, we
start the optimization when each round starts and stop it at the end of each
round (i.e. before we do the bitmap sync), as shown below:

1) 1st Round starts --> free_page_start
2) 1st Round in progress..
3) 1st Round ends  --> free_page_stop
4) 2nd Round starts --> free_page_start
5) 2nd Round in progress..
6) 2nd Round ends --> free_page_stop
..

For example, in 2), the VM is stopped. virtio_balloon_poll_free_page_hints
finds the vq is empty (i.e. elem == NULL) and the runstate is stopped, the
optimization thread exits immediately. That is, this optimization thread is
gone forever (the optimization we can do for this round is done). We won't
know when would the VM be woken up:
A) If the VM is woken up very soon when the migration thread is still in
progress of 2), then in 4) a new optimization thread (not the same one for
the first round) will be created and start the optimization for the 2nd
round as usual (If you have questions about 3) in this case, that
free_page_stop will do nothing than just return, since the optimization
thread has exited) ;
B) If the VM is woken up after the whole migration has ended, there is still
no point in resuming the optimization.

I think this would be the simple design for the first release of this
optimization. There are possibilities to improve case A) above by continuing
optimization for the 1st Round as it is still in progress, but I think
adding that complexity for this rare case wouldn't be worthwhile (at least
for now). What would you think?


Best,
Wei

In my opinion this just makes the patch very messy.

E.g. attempts to attach a debugger to the guest will call vmstop and
then behaviour changes. This is a receipe for heisenbugs which are then
extremely painful to debug.

It is not really hard to make things symmetrical:
e.g. if you stop on vmstop then you should start on vmstart, etc.
And stopping thread should not involve a bunch of state
changes, just stop it and that's it.


"stop it" - do you mean to
1) make the thread exit (i.e.make virtio_balloon_poll_free_page_hints exit
the while loop and return NULL); or
2) keep the thread staying in the while loop but yield running (e.g.
sleep(1) or block on a mutex)? (or please let me know if you suggested a
different implementation about stopping the thread)

I would say it makes more sense to make it block on something.

BTW I still think you are engaging in premature optimization here.
What you are doing here is a "data plane for balloon".
I would make the feature work first by processing this in a BH.
Creating threads immediately opens up questions of isolation,
cgroups etc.


Could you please share more about how creating threads affects isolation and
cgroup?

When threads are coming and going, they are hard to control.

Consider the rich API libvirt exposes for controlling the io threads:

https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation



and how does BH solve it? Thanks.
Best,
Wei

It's late at night so I don't remember whether it's the emulator
or the io thread that runs the BH, but the point is it's
already controlled by libvirt.



OK. I've tried to implement it this way: create an iothread via the qemu 
cmdline option: --device 
virtio-balloon,free-page-hint=true,iothread=iothread10, and schedule a 
BH to run in the iothread context when free_page_start() is called.


I think the drawback of this method is that the iothread exists during 
the whole QEMU lifetime, which wastes CPU cycles (e.g. when live 
migration isn't in progress). The method in v5 is like migration thread, 
which comes only when the migration request is received and goes when 
migration is done. Any thought?


Best,
Wei









-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v2] virtio_balloon: export hugetlb page allocation counts

2018-03-21 Thread Michael S. Tsirkin
On Thu, Mar 22, 2018 at 09:52:18AM +0800, Jason Wang wrote:
> 
> 
> On 2018年03月20日 12:26, Jonathan Helman wrote:
> > > On Mar 19, 2018, at 7:31 PM, Jason Wang  wrote:
> > > 
> > > 
> > > 
> > > On 2018年03月20日 06:14, Jonathan Helman wrote:
> > > > Export the number of successful and failed hugetlb page
> > > > allocations via the virtio balloon driver. These 2 counts
> > > > come directly from the vm_events HTLB_BUDDY_PGALLOC and
> > > > HTLB_BUDDY_PGALLOC_FAIL.
> > > > 
> > > > Signed-off-by: Jonathan Helman 
> > > Reviewed-by: Jason Wang 
> > Thanks.
> > 
> > > > ---
> > > >   drivers/virtio/virtio_balloon.c | 6 ++
> > > >   include/uapi/linux/virtio_balloon.h | 4 +++-
> > > >   2 files changed, 9 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_balloon.c 
> > > > b/drivers/virtio/virtio_balloon.c
> > > > index dfe5684..6b237e3 100644
> > > > --- a/drivers/virtio/virtio_balloon.c
> > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > @@ -272,6 +272,12 @@ static unsigned int update_balloon_stats(struct 
> > > > virtio_balloon *vb)
> > > > pages_to_bytes(events[PSWPOUT]));
> > > > update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, 
> > > > events[PGMAJFAULT]);
> > > > update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, 
> > > > events[PGFAULT]);
> > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > +   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
> > > > +   events[HTLB_BUDDY_PGALLOC]);
> > > > +   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
> > > > +   events[HTLB_BUDDY_PGALLOC_FAIL]);
> > > > +#endif
> > > >   #endif
> > > > update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
> > > > pages_to_bytes(i.freeram));
> > > > diff --git a/include/uapi/linux/virtio_balloon.h 
> > > > b/include/uapi/linux/virtio_balloon.h
> > > > index 4e8b830..40297a3 100644
> > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > @@ -53,7 +53,9 @@ struct virtio_balloon_config {
> > > >   #define VIRTIO_BALLOON_S_MEMTOT   5   /* Total amount of memory */
> > > >   #define VIRTIO_BALLOON_S_AVAIL6   /* Available memory as in /proc 
> > > > */
> > > >   #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
> > > > -#define VIRTIO_BALLOON_S_NR   8
> > > > +#define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations 
> > > > */
> > > > +#define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation 
> > > > failures */
> > > > +#define VIRTIO_BALLOON_S_NR   10
> > > > /*
> > > >* Memory statistics structure.
> > > Not for this patch, but it looks to me that exporting such nr through 
> > > uapi is fragile.
> > Sorry, can you explain what you mean here?
> > 
> > Jon
> 
> Spec said "Within an output buffer submitted to the statsq, the device MUST
> ignore entries with tag values that it does not recognize". So exporting
> VIRTIO_BALLOON_S_NR seems useless and device implementation can not depend
> on such number in uapi.
> 
> Thanks

Suggestions? I don't like to break build for people ...

> > 
> > > Thanks
> > 
> > -
> > To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
> > 

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v2] virtio_balloon: export hugetlb page allocation counts

2018-03-21 Thread Jason Wang



On 2018年03月20日 12:26, Jonathan Helman wrote:

On Mar 19, 2018, at 7:31 PM, Jason Wang  wrote:



On 2018年03月20日 06:14, Jonathan Helman wrote:

Export the number of successful and failed hugetlb page
allocations via the virtio balloon driver. These 2 counts
come directly from the vm_events HTLB_BUDDY_PGALLOC and
HTLB_BUDDY_PGALLOC_FAIL.

Signed-off-by: Jonathan Helman 

Reviewed-by: Jason Wang 

Thanks.


---
  drivers/virtio/virtio_balloon.c | 6 ++
  include/uapi/linux/virtio_balloon.h | 4 +++-
  2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index dfe5684..6b237e3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -272,6 +272,12 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
pages_to_bytes(events[PSWPOUT]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
+#ifdef CONFIG_HUGETLB_PAGE
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
+   events[HTLB_BUDDY_PGALLOC]);
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
+   events[HTLB_BUDDY_PGALLOC_FAIL]);
+#endif
  #endif
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
pages_to_bytes(i.freeram));
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 4e8b830..40297a3 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -53,7 +53,9 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_MEMTOT   5   /* Total amount of memory */
  #define VIRTIO_BALLOON_S_AVAIL6   /* Available memory as in /proc */
  #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
-#define VIRTIO_BALLOON_S_NR   8
+#define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
+#define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
+#define VIRTIO_BALLOON_S_NR   10
/*
   * Memory statistics structure.

Not for this patch, but it looks to me that exporting such nr through uapi is 
fragile.

Sorry, can you explain what you mean here?

Jon


Spec said "Within an output buffer submitted to the statsq, the device 
MUST ignore entries with tag values that it does not recognize". So 
exporting VIRTIO_BALLOON_S_NR seems useless and device implementation 
can not depend on such number in uapi.


Thanks




Thanks


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org




-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-21 Thread Jean-Philippe Brucker
On 21/03/18 06:43, Tian, Kevin wrote:
[...]
>> +
>> +#include 
>> +
>> +#define MSI_IOVA_BASE   0x800
>> +#define MSI_IOVA_LENGTH 0x10
> 
> this is ARM specific, and according to virtio-iommu spec isn't it
> better probed on the endpoint instead of hard-coding here?

These values are arbitrary, not really ARM-specific even if ARM is the
only user yet: we're just reserving a random IOVA region for mapping MSIs.
It is hard-coded because of the way iommu-dma.c works, but I don't quite
remember why that allocation isn't dynamic.

As said on the v0.6 spec thread, I'm not sure allocating the IOVA range in
the host is preferable. With nested translation the guest has to map it
anyway, and I believe dealing with IOVA allocation should be left to the
guest when possible.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [RFC] virtio-iommu version 0.6

2018-03-21 Thread Jean-Philippe Brucker
Hi Kevin,

Thanks for the comments

On 19/03/18 10:03, Tian, Kevin wrote:
> BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it 
> intended?

In my opinion BYPASS is a bit different from the other features: while the
others are needed for correctness, this one is optional and even if the
guest supports BYPASS, it should be allowed not to accept it. For security
reasons it may not want to let endpoints access the whole guest-physical
address space.

> --2.6.2 Device Requirements: Device operations--
> 
> "If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, 
> the device MUST truncate the range described by virt_start 
> and virt_end in requests to ft in the range described by 
> input_range."
> 
> "If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device 
> MUST ignore bits above domain_bits in field domain of requests."
> 
> shouldn't device return error upon above situation instead
> of continuing operation with modified value?

The intent was to allow device and driver to pass metadata in the top bits
of pointers and domain IDs, perhaps for a future extension or for
debugging. I'm not convinced it's useful anymore (and do wonder what my
initial reasoning was...) Such extension would be determined by a new
feature bit and the device would know that the fields contain additional
info, if the driver accepted that feature.

> --2.6.4 DETACH request--
> 
> " Detach an endpoint from its domain. When this request 
> completes, the endpoint cannot access any mapping from that 
> domain anymore "
> 
> Does it make sense to mention BYPASS situation upon this request?

Sure, I'll add something to clarify that situation

> --2.6.8.2 Property RESV_MEM --
> 
> "VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0) Accesses to 
> virtual addresses in this region are not translated by the device. 
> They may either be aborted by the device (or the underlying 
> IOMMU), bypass it, or never even reach it"
> 
> in 3.3 hardware device assignment you described an example 
> that a reserved range is stolen by host to map the MSI 
> doorbell... such map behavior is not described above.

Right, we can say that accesses to the region may be used for host translation

> Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI.
> I know there were quite some discussion around this flag before,
> but my mental picture now still is a bit difficult to understand its
> usage based on examples in implementation notes:
> 
>   - for x86, you describe it as indicating MSI bypass for
> oxfeex. However guest doesn't need to know this fact. Only
> requirement is to treat it as reserved range (as on bare metal)
> then T_RESERVED is sufficient for this purpose>
>   - for ARM, either let guest or let host to choose a virtual
> address for mapping to MSI doorbell. the former doesn't require
> a reserved range. for the latter also T_RESERVED is enough as
> the example in hardware device assignment section.

It might be nicer to have the host decide it, but when the physical IOMMU
is ARM SMMU, nesting translation complicates things, because the guest
*has* to create a mapping:

* The guest is in charge of stage-1. It creates IOVA->GPA mapping for the
  MSI doorbell. The GPA is mapped to the physical MSI doorbell at
  stage-2 by the host.

* when it writes that IOVA in the virtual MSI-X tables, the host updates
  the physical MSI-X tables using an ioctl (it can't map the physical
  tables into the guest, because MSI-X vector data also contain physical
  device info that shouldn't be known by the guest.)

So we need support for software-mapped MSIs in the guest anyway. In Linux
the virtio-iommu driver can use existing plumbing for that, no change
needed in the guest.

> Then what's the real point of differentiating MSI bypass from
> normal reserved range? Is there other example where guest
> may use such info to do special thing?

The presence of an MSI bypass regions allows the driver and the DMA layer
(in Linux iommu_dma_map_msi_msg) to know whether it needs to create a
mapping or if it's bypassed. When writing the entry into the MSI-X table,
if the address in the MSI vector already corresponds to an MSI bypass
region, then there is no need to create a mapping. Making the bypass MSI
region a normal RESV range hides that information.

Another way of choosing would be with #ifdef CONFIG_ARM64, CONFIG_X86 etc,
but I find it nasty, and I personally prefer using MSI bypass for ARM when
possible.

> -- 3.2 Message Signaled Interrupts --
> 
> " Section 3.2.2 describes the added complexity (from the host 
> point of view) of translating the IRQ chip doorbell "
> 
> however later 3.2.2 only says one sentence about host
> complexity - " However, this mode of operations may add 
> significant complexity in the host implementation". If you plan
> to elaborate that part in the future, please add a note. :-)

Oh right, I don't really want to elaborate, I'll try to reword 3.2. The
kvmtool patches have support for both MSI bypass and translation
(selectable