date:20191204

Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.

2019-12-04 Thread Alex Williamson

On Thu, 5 Dec 2019 11:49:00 +0530
Kirti Wankhede  wrote:

> On 12/5/2019 11:26 AM, Alex Williamson wrote:
> > On Thu, 5 Dec 2019 11:12:23 +0530
> > Kirti Wankhede  wrote:
> >   
> >> On 12/5/2019 6:58 AM, Yan Zhao wrote:  
> >>> On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:  
>  On Wed, 4 Dec 2019 23:40:25 +0530
>  Kirti Wankhede  wrote:
>  
> > On 12/3/2019 11:34 PM, Alex Williamson wrote:  
> >> On Mon, 25 Nov 2019 19:57:39 -0500
> >> Yan Zhao  wrote:
> >> 
> >>> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:  
>  On Fri, 15 Nov 2019 00:26:07 +0530
>  Kirti Wankhede  wrote:
> 
> > On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> >> On Thu, 14 Nov 2019 01:07:21 +0530
> >> Kirti Wankhede  wrote:
> >>  
> >>> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
>  On Tue, 12 Nov 2019 22:33:37 +0530
>  Kirti Wankhede  wrote:
>  
> > All pages pinned by vendor driver through vfio_pin_pages API 
> > should be
> > considered as dirty during migration. IOMMU container maintains 
> > a list of
> > all such pinned pages. Added an ioctl defination to get bitmap 
> > of such  
> 
>  definition
>  
> > pinned pages for requested IO virtual address range.  
> 
>  Additionally, all mapped pages are considered dirty when 
>  physically
>  mapped through to an IOMMU, modulo we discussed devices opting 
>  in to
>  per page pinning to indicate finer granularity with a TBD 
>  mechanism to
>  figure out if any non-opt-in devices remain.
>  
> >>>
> >>> You mean, in case of device direct assignment (device pass 
> >>> through)?  
> >>
> >> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are 
> >> fully
> >> pinned and mapped, then the correct dirty page set is all mapped 
> >> pages.
> >> We discussed using the vpfn list as a mechanism for vendor drivers 
> >> to
> >> reduce their migration footprint, but we also discussed that we 
> >> would
> >> need a way to determine that all participants in the container have
> >> explicitly pinned their working pages or else we must consider the
> >> entire potential working set as dirty.
> >>  
> >
> > How can vendor driver tell this capability to iommu module? Any 
> > suggestions?  
> 
>  I think it does so by pinning pages.  Is it acceptable that if the
>  vendor driver pins any pages, then from that point forward we 
>  consider
>  the IOMMU group dirty page scope to be limited to pinned pages?  
>  There  
> >>> we should also be aware of that dirty page scope is pinned pages + 
> >>> unpinned pages,
> >>> which means ever since a page is pinned, it should be regarded as 
> >>> dirty
> >>> no matter whether it's unpinned later. only after log_sync is called 
> >>> and
> >>> dirty info retrieved, its dirty state should be cleared.  
> >>
> >> Yes, good point.  We can't just remove a vpfn when a page is unpinned
> >> or else we'd lose information that the page potentially had been
> >> dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> >> list and both the currently pinned vpfns and the dirty vpfns are walked
> >> on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> >> The container would need to know that dirty tracking is enabled and
> >> only manage the dirty vpfns list when necessary.  Thanks,
> >> 
> >
> > If page is unpinned, then that page is available in free page pool for
> > others to use, then how can we say that unpinned page has valid data?
> >
> > If suppose, one driver A unpins a page and when driver B of some other
> > device gets that page and he pins it, uses it, and then unpins it, then
> > how can we say that page has valid data for driver A?
> >
> > Can you give one example where unpinned page data is considered reliable
> > and valid?  
> 
>  We can only pin pages that the user has already allocated* and mapped
>  through the vfio DMA API.  The pinning of the page simply locks the
>  page for the vendor driver to access it and unpinning that page only
>  indicates that access is complete.  Pages are not freed when a vendor
>  driver unpins them, they still exist and at this point we're now
>  assuming the device dirtied the page while it was pinned.  Thanks,
> 
>  Alex
> 
>  *

Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-04 Thread Jason Wang


Hi:

On 2019/12/5 上午11:24, Yan Zhao wrote:

For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to  (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and



A silly question, what's the reason for doing this, is this a must for 
dirty page tracking?




  (3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
  __   register mediate ops|  ___ ___|

|  |<---| VF|   |   |
| vfio-pci |  | |  mediate  |   | PF driver |   |
|__|--->|   driver  |   |___|
  |open(pdev)  |  ---  | |
  ||
  ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
 \|/  \|/
--- 
|VF   | |PF|
--- 


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind the opening device with this
mediate ops using the returned mediate handle.

Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
VF will be intercepted into VF mediate driver as
vfio_pci_mediate_ops->get_region_info(),
vfio_pci_mediate_ops->rw,
vfio_pci_mediate_ops->mmap, and get customized.
For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
further return 'pt' to indicate whether vfio-pci should further
passthrough data to hw.

when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
with a mediate handle as parameter.

The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
mediate driver be able to differentiate two opening VFs of the same device
id and vendor id.

When VF mediate driver exits, it unregisters its mediate ops from
vfio-pci.


In this patchset, we enable vfio-pci to provide 3 things:
(1) calling mediate ops to allow vendor driver customizing default
region info/rw/mmap of a region.
(2) provide a migration region to support migration



What's the benefit of introducing a region? It looks to me we don't 
expect the region to be accessed directly from guest. Could we simply 
extend device fd ioctl for doing such things?




(3) provide a dynamic trap bar info region to allow vendor driver
control trap/untrap of device pci bars

This vfio-pci + mediate ops way differs from mdev way in that
(1) medv way needs to create a 1:1 mdev device on top of one VF, device
specific mdev parent driver is bound to VF directly.
(2) vfio-pci + mediate ops way does not create mdev devices and VF
mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.

The reason why we don't choose the way of writing mdev parent driver is
that
(1) VFs are almost all the time directly passthroughed. Directly binding
to vfio-pci can make most of the code shared/reused.



Can we split out the common parts from vfio-pci?



  If we write a
vendor specific mdev parent driver, most of the code (like passthrough
style of rw/mmap) still needs to be copied from vfio-pci driver, which is
actually a duplicated and tedious work.



The mediate ops looks quite similar to what vfio-mdev did. And it looks 
to me we need to consider live migration for mdev as well. In that case, 
do we still expect

Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.

2019-12-04 Thread Kirti Wankhede





On 12/5/2019 11:26 AM, Alex Williamson wrote:

On Thu, 5 Dec 2019 11:12:23 +0530
Kirti Wankhede  wrote:


On 12/5/2019 6:58 AM, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:

On Wed, 4 Dec 2019 23:40:25 +0530
Kirti Wankhede  wrote:
  

On 12/3/2019 11:34 PM, Alex Williamson wrote:

On Mon, 25 Nov 2019 19:57:39 -0500
Yan Zhao  wrote:
  

On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:

On Fri, 15 Nov 2019 00:26:07 +0530
Kirti Wankhede  wrote:
 

On 11/14/2019 1:37 AM, Alex Williamson wrote:

On Thu, 14 Nov 2019 01:07:21 +0530
Kirti Wankhede  wrote:
   

On 11/13/2019 4:00 AM, Alex Williamson wrote:

On Tue, 12 Nov 2019 22:33:37 +0530
Kirti Wankhede  wrote:
  

All pages pinned by vendor driver through vfio_pin_pages API should be
considered as dirty during migration. IOMMU container maintains a list of
all such pinned pages. Added an ioctl defination to get bitmap of such


definition
  

pinned pages for requested IO virtual address range.


Additionally, all mapped pages are considered dirty when physically
mapped through to an IOMMU, modulo we discussed devices opting in to
per page pinning to indicate finer granularity with a TBD mechanism to
figure out if any non-opt-in devices remain.
  


You mean, in case of device direct assignment (device pass through)?


Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
pinned and mapped, then the correct dirty page set is all mapped pages.
We discussed using the vpfn list as a mechanism for vendor drivers to
reduce their migration footprint, but we also discussed that we would
need a way to determine that all participants in the container have
explicitly pinned their working pages or else we must consider the
entire potential working set as dirty.
   


How can vendor driver tell this capability to iommu module? Any suggestions?


I think it does so by pinning pages.  Is it acceptable that if the
vendor driver pins any pages, then from that point forward we consider
the IOMMU group dirty page scope to be limited to pinned pages?  There

we should also be aware of that dirty page scope is pinned pages + unpinned 
pages,
which means ever since a page is pinned, it should be regarded as dirty
no matter whether it's unpinned later. only after log_sync is called and
dirty info retrieved, its dirty state should be cleared.


Yes, good point.  We can't just remove a vpfn when a page is unpinned
or else we'd lose information that the page potentially had been
dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
list and both the currently pinned vpfns and the dirty vpfns are walked
on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
The container would need to know that dirty tracking is enabled and
only manage the dirty vpfns list when necessary.  Thanks,
  


If page is unpinned, then that page is available in free page pool for
others to use, then how can we say that unpinned page has valid data?

If suppose, one driver A unpins a page and when driver B of some other
device gets that page and he pins it, uses it, and then unpins it, then
how can we say that page has valid data for driver A?

Can you give one example where unpinned page data is considered reliable
and valid?


We can only pin pages that the user has already allocated* and mapped
through the vfio DMA API.  The pinning of the page simply locks the
page for the vendor driver to access it and unpinning that page only
indicates that access is complete.  Pages are not freed when a vendor
driver unpins them, they still exist and at this point we're now
assuming the device dirtied the page while it was pinned.  Thanks,

Alex

* An exception here is that the page might be demand allocated and the
act of pinning the page could actually allocate the backing page for
the user if they have not faulted the page to trigger that allocation
previously.  That page remains mapped for the user's virtual address
space even after the unpinning though.
  


Yes, I can give an example in GVT.
when a gem_object is allocated in guest, before submitting it to guest
vGPU, gfx cmds in its ring buffer need to be pinned into GGTT to get a
global graphics address for hardware access. At that time, we shadow
those cmds and pin pages through vfio pin_pages(), and submit the shadow
gem_object to physial hardware.
After guest driver thinks the submitted gem_object has completed hardware
DMA, it unnpinnd those pinned GGTT graphics memory addresses. Then in
host, we unpin the shadow pages through vfio unpin_pages.
But, at this point, guest driver is still free to access the gem_object
through vCPUs, and guest user space is probably still mapping an object
into the gem_object in guest driver.
So, missing the dirty page tracking for unpinned pages would cause
data inconsitency.
   


If pages are accessed by guest through vCPUs, then RAM module

Re: [PATCH v2] virtio-pci: disable vring processing when bus-mastering is disabled

2019-12-04 Thread Alexey Kardashevskiy

Hi,

I was wondering if this is going anywhere or if SLOF is still expected
to get fixed and if it is SLOF, then what exactly in SLOF's behaviour is
incorrect and requires fixing? I am a bit lost here. Thanks,




On 20/11/2019 11:50, Michael Roth wrote:
> Currently the SLOF firmware for pseries guests will disable/re-enable
> a PCI device multiple times via IO/MEM/MASTER bits of PCI_COMMAND
> register after the initial probe/feature negotiation, as it tends to
> work with a single device at a time at various stages like probing
> and running block/network bootloaders without doing a full reset
> in-between.
> 
> In QEMU, when PCI_COMMAND_MASTER is disabled we disable the
> corresponding IOMMU memory region, so DMA accesses (including to vring
> fields like idx/flags) will no longer undergo the necessary
> translation. Normally we wouldn't expect this to happen since it would
> be misbehavior on the driver side to continue driving DMA requests.
> 
> However, in the case of pseries, with iommu_platform=on, we trigger the
> following sequence when tearing down the virtio-blk dataplane ioeventfd
> in response to the guest unsetting PCI_COMMAND_MASTER:
> 
>   #2  0x55922651 in virtqueue_map_desc 
> (vdev=vdev@entry=0x56dbcfb0, p_num_sg=p_num_sg@entry=0x7fffe657e1a8, 
> addr=addr@entry=0x7fffe657e240, iov=iov@entry=0x7fffe6580240, 
> max_num_sg=max_num_sg@entry=1024, is_write=is_write@entry=false, pa=0, sz=0)
>   at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:757
>   #3  0x55922a89 in virtqueue_pop (vq=vq@entry=0x56dc8660, 
> sz=sz@entry=184)
>   at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:950
>   #4  0x558d3eca in virtio_blk_get_request (vq=0x56dc8660, 
> s=0x56dbcfb0)
>   at /home/mdroth/w/qemu.git/hw/block/virtio-blk.c:255
>   #5  0x558d3eca in virtio_blk_handle_vq (s=0x56dbcfb0, 
> vq=0x56dc8660)
>   at /home/mdroth/w/qemu.git/hw/block/virtio-blk.c:776
>   #6  0x5591dd66 in virtio_queue_notify_aio_vq 
> (vq=vq@entry=0x56dc8660)
>   at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:1550
>   #7  0x5591ecef in virtio_queue_notify_aio_vq (vq=0x56dc8660)
>   at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:1546
>   #8  0x5591ecef in virtio_queue_host_notifier_aio_poll 
> (opaque=0x56dc86c8)
>   at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:2527
>   #9  0x55d02164 in run_poll_handlers_once 
> (ctx=ctx@entry=0x5688bfc0, timeout=timeout@entry=0x7fffe65844a8)
>   at /home/mdroth/w/qemu.git/util/aio-posix.c:520
>   #10 0x55d02d1b in try_poll_mode (timeout=0x7fffe65844a8, 
> ctx=0x5688bfc0)
>   at /home/mdroth/w/qemu.git/util/aio-posix.c:607
>   #11 0x55d02d1b in aio_poll (ctx=ctx@entry=0x5688bfc0, 
> blocking=blocking@entry=true)
>   at /home/mdroth/w/qemu.git/util/aio-posix.c:639
>   #12 0x55d0004d in aio_wait_bh_oneshot (ctx=0x5688bfc0, 
> cb=cb@entry=0x558d5130 , 
> opaque=opaque@entry=0x56de86f0)
>   at /home/mdroth/w/qemu.git/util/aio-wait.c:71
>   #13 0x558d59bf in virtio_blk_data_plane_stop (vdev=)
>   at /home/mdroth/w/qemu.git/hw/block/dataplane/virtio-blk.c:288
>   #14 0x55b906a1 in virtio_bus_stop_ioeventfd 
> (bus=bus@entry=0x56dbcf38)
>   at /home/mdroth/w/qemu.git/hw/virtio/virtio-bus.c:245
>   #15 0x55b90dbb in virtio_bus_stop_ioeventfd 
> (bus=bus@entry=0x56dbcf38)
>   at /home/mdroth/w/qemu.git/hw/virtio/virtio-bus.c:237
>   #16 0x55b92a8e in virtio_pci_stop_ioeventfd (proxy=0x56db4e40)
>   at /home/mdroth/w/qemu.git/hw/virtio/virtio-pci.c:292
>   #17 0x55b92a8e in virtio_write_config (pci_dev=0x56db4e40, 
> address=, val=1048832, len=)
>   at /home/mdroth/w/qemu.git/hw/virtio/virtio-pci.c:613
> 
> I.e. the calling code is only scheduling a one-shot BH for
> virtio_blk_data_plane_stop_bh, but somehow we end up trying to process
> an additional virtqueue entry before we get there. This is likely due
> to the following check in virtio_queue_host_notifier_aio_poll:
> 
>   static bool virtio_queue_host_notifier_aio_poll(void *opaque)
>   {
>   EventNotifier *n = opaque;
>   VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
>   bool progress;
> 
>   if (!vq->vring.desc || virtio_queue_empty(vq)) {
>   return false;
>   }
> 
>   progress = virtio_queue_notify_aio_vq(vq);
> 
> namely the call to virtio_queue_empty(). In this case, since no new
> requests have actually been issued, shadow_avail_idx == last_avail_idx,
> so we actually try to access the vring via vring_avail_idx() to get
> the latest non-shadowed idx:
> 
>   int virtio_queue_empty(VirtQueue *vq)
>   {
>   bool empty;
>   ...
> 
>   if (vq->shadow_avail_idx != vq->last_avail_idx) {
>   return 0;
>   }
> 
>   rcu_read_lock();
>   empty = vring_avail_idx(vq) == vq->last_avail_idx;
>   rcu_read_unlock();
>

Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.

2019-12-04 Thread Alex Williamson

On Thu, 5 Dec 2019 11:12:23 +0530
Kirti Wankhede  wrote:

> On 12/5/2019 6:58 AM, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:  
> >> On Wed, 4 Dec 2019 23:40:25 +0530
> >> Kirti Wankhede  wrote:
> >>  
> >>> On 12/3/2019 11:34 PM, Alex Williamson wrote:  
>  On Mon, 25 Nov 2019 19:57:39 -0500
>  Yan Zhao  wrote:
>   
> > On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:  
> >> On Fri, 15 Nov 2019 00:26:07 +0530
> >> Kirti Wankhede  wrote:
> >> 
> >>> On 11/14/2019 1:37 AM, Alex Williamson wrote:  
>  On Thu, 14 Nov 2019 01:07:21 +0530
>  Kirti Wankhede  wrote:
>    
> > On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> >> On Tue, 12 Nov 2019 22:33:37 +0530
> >> Kirti Wankhede  wrote:
> >>  
> >>> All pages pinned by vendor driver through vfio_pin_pages API 
> >>> should be
> >>> considered as dirty during migration. IOMMU container maintains a 
> >>> list of
> >>> all such pinned pages. Added an ioctl defination to get bitmap of 
> >>> such  
> >>
> >> definition
> >>  
> >>> pinned pages for requested IO virtual address range.  
> >>
> >> Additionally, all mapped pages are considered dirty when physically
> >> mapped through to an IOMMU, modulo we discussed devices opting in 
> >> to
> >> per page pinning to indicate finer granularity with a TBD 
> >> mechanism to
> >> figure out if any non-opt-in devices remain.
> >>  
> >
> > You mean, in case of device direct assignment (device pass 
> > through)?  
> 
>  Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
>  pinned and mapped, then the correct dirty page set is all mapped 
>  pages.
>  We discussed using the vpfn list as a mechanism for vendor drivers to
>  reduce their migration footprint, but we also discussed that we would
>  need a way to determine that all participants in the container have
>  explicitly pinned their working pages or else we must consider the
>  entire potential working set as dirty.
>    
> >>>
> >>> How can vendor driver tell this capability to iommu module? Any 
> >>> suggestions?  
> >>
> >> I think it does so by pinning pages.  Is it acceptable that if the
> >> vendor driver pins any pages, then from that point forward we consider
> >> the IOMMU group dirty page scope to be limited to pinned pages?  There 
> >>  
> > we should also be aware of that dirty page scope is pinned pages + 
> > unpinned pages,
> > which means ever since a page is pinned, it should be regarded as dirty
> > no matter whether it's unpinned later. only after log_sync is called and
> > dirty info retrieved, its dirty state should be cleared.  
> 
>  Yes, good point.  We can't just remove a vpfn when a page is unpinned
>  or else we'd lose information that the page potentially had been
>  dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
>  list and both the currently pinned vpfns and the dirty vpfns are walked
>  on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
>  The container would need to know that dirty tracking is enabled and
>  only manage the dirty vpfns list when necessary.  Thanks,
>   
> >>>
> >>> If page is unpinned, then that page is available in free page pool for
> >>> others to use, then how can we say that unpinned page has valid data?
> >>>
> >>> If suppose, one driver A unpins a page and when driver B of some other
> >>> device gets that page and he pins it, uses it, and then unpins it, then
> >>> how can we say that page has valid data for driver A?
> >>>
> >>> Can you give one example where unpinned page data is considered reliable
> >>> and valid?  
> >>
> >> We can only pin pages that the user has already allocated* and mapped
> >> through the vfio DMA API.  The pinning of the page simply locks the
> >> page for the vendor driver to access it and unpinning that page only
> >> indicates that access is complete.  Pages are not freed when a vendor
> >> driver unpins them, they still exist and at this point we're now
> >> assuming the device dirtied the page while it was pinned.  Thanks,
> >>
> >> Alex
> >>
> >> * An exception here is that the page might be demand allocated and the
> >>act of pinning the page could actually allocate the backing page for
> >>the user if they have not faulted the page to trigger that allocation
> >>previously.  That page remains mapped for the user's virtual address
> >>space even after the unpinning though.
> >>  
> > 
> > Yes, I can give an example in GVT.
> > when a gem_object is

Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.

2019-12-04 Thread Yan Zhao

On Thu, Dec 05, 2019 at 01:42:23PM +0800, Kirti Wankhede wrote:
> 
> 
> On 12/5/2019 6:58 AM, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:
> >> On Wed, 4 Dec 2019 23:40:25 +0530
> >> Kirti Wankhede  wrote:
> >>
> >>> On 12/3/2019 11:34 PM, Alex Williamson wrote:
>  On Mon, 25 Nov 2019 19:57:39 -0500
>  Yan Zhao  wrote:
> 
> > On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
> >> On Fri, 15 Nov 2019 00:26:07 +0530
> >> Kirti Wankhede  wrote:
> >>   
> >>> On 11/14/2019 1:37 AM, Alex Williamson wrote:
>  On Thu, 14 Nov 2019 01:07:21 +0530
>  Kirti Wankhede  wrote:
>  
> > On 11/13/2019 4:00 AM, Alex Williamson wrote:
> >> On Tue, 12 Nov 2019 22:33:37 +0530
> >> Kirti Wankhede  wrote:
> >>
> >>> All pages pinned by vendor driver through vfio_pin_pages API 
> >>> should be
> >>> considered as dirty during migration. IOMMU container maintains a 
> >>> list of
> >>> all such pinned pages. Added an ioctl defination to get bitmap of 
> >>> such
> >>
> >> definition
> >>
> >>> pinned pages for requested IO virtual address range.
> >>
> >> Additionally, all mapped pages are considered dirty when physically
> >> mapped through to an IOMMU, modulo we discussed devices opting in 
> >> to
> >> per page pinning to indicate finer granularity with a TBD 
> >> mechanism to
> >> figure out if any non-opt-in devices remain.
> >>
> >
> > You mean, in case of device direct assignment (device pass through)?
> 
>  Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
>  pinned and mapped, then the correct dirty page set is all mapped 
>  pages.
>  We discussed using the vpfn list as a mechanism for vendor drivers to
>  reduce their migration footprint, but we also discussed that we would
>  need a way to determine that all participants in the container have
>  explicitly pinned their working pages or else we must consider the
>  entire potential working set as dirty.
>  
> >>>
> >>> How can vendor driver tell this capability to iommu module? Any 
> >>> suggestions?
> >>
> >> I think it does so by pinning pages.  Is it acceptable that if the
> >> vendor driver pins any pages, then from that point forward we consider
> >> the IOMMU group dirty page scope to be limited to pinned pages?  There
> > we should also be aware of that dirty page scope is pinned pages + 
> > unpinned pages,
> > which means ever since a page is pinned, it should be regarded as dirty
> > no matter whether it's unpinned later. only after log_sync is called and
> > dirty info retrieved, its dirty state should be cleared.
> 
>  Yes, good point.  We can't just remove a vpfn when a page is unpinned
>  or else we'd lose information that the page potentially had been
>  dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
>  list and both the currently pinned vpfns and the dirty vpfns are walked
>  on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
>  The container would need to know that dirty tracking is enabled and
>  only manage the dirty vpfns list when necessary.  Thanks,
> 
> >>>
> >>> If page is unpinned, then that page is available in free page pool for
> >>> others to use, then how can we say that unpinned page has valid data?
> >>>
> >>> If suppose, one driver A unpins a page and when driver B of some other
> >>> device gets that page and he pins it, uses it, and then unpins it, then
> >>> how can we say that page has valid data for driver A?
> >>>
> >>> Can you give one example where unpinned page data is considered reliable
> >>> and valid?
> >>
> >> We can only pin pages that the user has already allocated* and mapped
> >> through the vfio DMA API.  The pinning of the page simply locks the
> >> page for the vendor driver to access it and unpinning that page only
> >> indicates that access is complete.  Pages are not freed when a vendor
> >> driver unpins them, they still exist and at this point we're now
> >> assuming the device dirtied the page while it was pinned.  Thanks,
> >>
> >> Alex
> >>
> >> * An exception here is that the page might be demand allocated and the
> >>act of pinning the page could actually allocate the backing page for
> >>the user if they have not faulted the page to trigger that allocation
> >>previously.  That page remains mapped for the user's virtual address
> >>space even after the unpinning though.
> >>
> > 
> > Yes, I can give an example in GVT.
> > when a gem_object is allocated in guest, before submitting it to guest
> > vGPU,

Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.

2019-12-04 Thread Kirti Wankhede





On 12/5/2019 6:58 AM, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:

On Wed, 4 Dec 2019 23:40:25 +0530
Kirti Wankhede  wrote:


On 12/3/2019 11:34 PM, Alex Williamson wrote:

On Mon, 25 Nov 2019 19:57:39 -0500
Yan Zhao  wrote:
   

On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:

On Fri, 15 Nov 2019 00:26:07 +0530
Kirti Wankhede  wrote:
  

On 11/14/2019 1:37 AM, Alex Williamson wrote:

On Thu, 14 Nov 2019 01:07:21 +0530
Kirti Wankhede  wrote:


On 11/13/2019 4:00 AM, Alex Williamson wrote:

On Tue, 12 Nov 2019 22:33:37 +0530
Kirti Wankhede  wrote:
   

All pages pinned by vendor driver through vfio_pin_pages API should be
considered as dirty during migration. IOMMU container maintains a list of
all such pinned pages. Added an ioctl defination to get bitmap of such


definition
   

pinned pages for requested IO virtual address range.


Additionally, all mapped pages are considered dirty when physically
mapped through to an IOMMU, modulo we discussed devices opting in to
per page pinning to indicate finer granularity with a TBD mechanism to
figure out if any non-opt-in devices remain.
   


You mean, in case of device direct assignment (device pass through)?


Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
pinned and mapped, then the correct dirty page set is all mapped pages.
We discussed using the vpfn list as a mechanism for vendor drivers to
reduce their migration footprint, but we also discussed that we would
need a way to determine that all participants in the container have
explicitly pinned their working pages or else we must consider the
entire potential working set as dirty.



How can vendor driver tell this capability to iommu module? Any suggestions?


I think it does so by pinning pages.  Is it acceptable that if the
vendor driver pins any pages, then from that point forward we consider
the IOMMU group dirty page scope to be limited to pinned pages?  There

we should also be aware of that dirty page scope is pinned pages + unpinned 
pages,
which means ever since a page is pinned, it should be regarded as dirty
no matter whether it's unpinned later. only after log_sync is called and
dirty info retrieved, its dirty state should be cleared.


Yes, good point.  We can't just remove a vpfn when a page is unpinned
or else we'd lose information that the page potentially had been
dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
list and both the currently pinned vpfns and the dirty vpfns are walked
on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
The container would need to know that dirty tracking is enabled and
only manage the dirty vpfns list when necessary.  Thanks,
   


If page is unpinned, then that page is available in free page pool for
others to use, then how can we say that unpinned page has valid data?

If suppose, one driver A unpins a page and when driver B of some other
device gets that page and he pins it, uses it, and then unpins it, then
how can we say that page has valid data for driver A?

Can you give one example where unpinned page data is considered reliable
and valid?


We can only pin pages that the user has already allocated* and mapped
through the vfio DMA API.  The pinning of the page simply locks the
page for the vendor driver to access it and unpinning that page only
indicates that access is complete.  Pages are not freed when a vendor
driver unpins them, they still exist and at this point we're now
assuming the device dirtied the page while it was pinned.  Thanks,

Alex

* An exception here is that the page might be demand allocated and the
   act of pinning the page could actually allocate the backing page for
   the user if they have not faulted the page to trigger that allocation
   previously.  That page remains mapped for the user's virtual address
   space even after the unpinning though.



Yes, I can give an example in GVT.
when a gem_object is allocated in guest, before submitting it to guest
vGPU, gfx cmds in its ring buffer need to be pinned into GGTT to get a
global graphics address for hardware access. At that time, we shadow
those cmds and pin pages through vfio pin_pages(), and submit the shadow
gem_object to physial hardware.
After guest driver thinks the submitted gem_object has completed hardware
DMA, it unnpinnd those pinned GGTT graphics memory addresses. Then in
host, we unpin the shadow pages through vfio unpin_pages.
But, at this point, guest driver is still free to access the gem_object
through vCPUs, and guest user space is probably still mapping an object
into the gem_object in guest driver.
So, missing the dirty page tracking for unpinned pages would cause
data inconsitency.



If pages are accessed by guest through vCPUs, then RAM module in QEMU 
will take care of tracking those pages as dirty.


All unpinned pages might not be used, so tracking all unpinned pages 
during VM

Re: [PATCH v17 6/7] migration: Include migration support for machine check handling

2019-12-04 Thread Ganesh




On 11/19/19 8:15 AM, David Gibson wrote:

On Thu, Oct 24, 2019 at 01:13:06PM +0530, Ganesh Goudar wrote:

From: Aravinda Prasad 

This patch includes migration support for machine check
handling. Especially this patch blocks VM migration
requests until the machine check error handling is
complete as these errors are specific to the source
hardware and is irrelevant on the target hardware.

[Do not set FWNMI cap in post_load, now its done in .apply hook]
Signed-off-by: Ganesh Goudar 
Signed-off-by: Aravinda Prasad 
---
  hw/ppc/spapr.c | 41 +
  hw/ppc/spapr_events.c  | 16 +++-
  hw/ppc/spapr_rtas.c|  2 ++
  include/hw/ppc/spapr.h |  2 ++
  4 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 346ec5ba6c..e0d0f95ec0 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -46,6 +46,7 @@
  #include "migration/qemu-file-types.h"
  #include "migration/global_state.h"
  #include "migration/register.h"
+#include "migration/blocker.h"
  #include "mmu-hash64.h"
  #include "mmu-book3s-v3.h"
  #include "cpu-models.h"
@@ -1751,6 +1752,8 @@ static void spapr_machine_reset(MachineState *machine)
  
  /* Signal all vCPUs waiting on this condition */

  qemu_cond_broadcast(>mc_delivery_cond);
+
+migrate_del_blocker(spapr->fwnmi_migration_blocker);
  }
  
  static void spapr_create_nvram(SpaprMachineState *spapr)

@@ -2041,6 +2044,43 @@ static const VMStateDescription vmstate_spapr_dtb = {
  },
  };
  
+static bool spapr_fwnmi_needed(void *opaque)

+{
+SpaprMachineState *spapr = (SpaprMachineState *)opaque;
+
+return spapr->guest_machine_check_addr != -1;
+}
+
+static int spapr_fwnmi_pre_save(void *opaque)
+{
+SpaprMachineState *spapr = (SpaprMachineState *)opaque;
+
+/*
+ * With -only-migratable QEMU option, we cannot block migration.
+ * Hence check if machine check handling is in progress and print
+ * a warning message.
+ */

IIUC the logic below this could also occur in the case where the fwnmi
event occurs after a migration has started, but before it completes,
not just with -only-migratable.  Is that correct?

Yes



+if (spapr->mc_status != -1) {
+warn_report("A machine check is being handled during migration. The"
+"handler may run and log hardware error on the destination");
+}
+
+return 0;
+}
+
+static const VMStateDescription vmstate_spapr_machine_check = {
+.name = "spapr_machine_check",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = spapr_fwnmi_needed,
+.pre_save = spapr_fwnmi_pre_save,
+.fields = (VMStateField[]) {
+VMSTATE_UINT64(guest_machine_check_addr, SpaprMachineState),
+VMSTATE_INT32(mc_status, SpaprMachineState),
+VMSTATE_END_OF_LIST()
+},
+};
+
  static const VMStateDescription vmstate_spapr = {
  .name = "spapr",
  .version_id = 3,
@@ -2075,6 +2115,7 @@ static const VMStateDescription vmstate_spapr = {
  _spapr_cap_large_decr,
  _spapr_cap_ccf_assist,
  _spapr_cap_fwnmi,
+_spapr_machine_check,
  NULL
  }
  };
diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
index db44e09154..30d9371c88 100644
--- a/hw/ppc/spapr_events.c
+++ b/hw/ppc/spapr_events.c
@@ -43,6 +43,7 @@
  #include "qemu/main-loop.h"
  #include "hw/ppc/spapr_ovec.h"
  #include 
+#include "migration/blocker.h"
  
  #define RTAS_LOG_VERSION_MASK   0xff00

  #define   RTAS_LOG_VERSION_60x0600
@@ -842,6 +843,8 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered)
  {
  SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
  CPUState *cs = CPU(cpu);
+int ret;
+Error *local_err = NULL;
  
  if (spapr->guest_machine_check_addr == -1) {

  /*
@@ -871,8 +874,19 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered)
  return;
  }
  }
-spapr->mc_status = cpu->vcpu_id;
  
+ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, _err);

+if (ret == -EBUSY) {
+/*
+ * We don't want to abort so we let the migration to continue.
+ * In a rare case, the machine check handler will run on the target.
+ * Though this is not preferable, it is better than aborting
+ * the migration or killing the VM.
+ */
+warn_report_err(local_err);

I suspect the error message in local_err won't be particularly
meaningful on its own.  Perhaps you need to add a prefix to clarify
that the problem is you've received a fwnmi after migration has
commenced?

ok



+}
+
+spapr->mc_status = cpu->vcpu_id;
  spapr_mce_dispatch_elog(cpu, recovered);
  }
  
diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c

index 0328b1f341..c78d96ee7e 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -50,6 +50,7 @@
  #include "hw/ppc/fdt.h"
  #include "target/ppc/mmu-hash64.h"

Re: [RFC] QEMU Gating CI

2019-12-04 Thread Cleber Rosa

On Tue, Dec 03, 2019 at 05:54:38PM +, Peter Maydell wrote:
> On Mon, 2 Dec 2019 at 14:06, Cleber Rosa  wrote:
> >
> > RFC: QEMU Gating CI
> > ===
> >
> > This RFC attempts to address most of the issues described in
> > "Requirements/GatinCI"[1].  An also relevant write up is the "State of
> > QEMU CI as we enter 4.0"[2].
> >
> > The general approach is one to minimize the infrastructure maintenance
> > and development burden, leveraging as much as possible "other people's"
> > infrastructure and code.  GitLab's CI/CD platform is the most relevant
> > component dealt with here.
> 
> Thanks for writing up this RFC.
> 
> My overall view is that there's some interesting stuff in
> here and definitely some things we'll want to cover at some
> point, but there's also a fair amount that is veering away
> from solving the immediate problem we want to solve, and
> which we should thus postpone for later (beyond making some
> reasonable efforts not to design something which paints us
> into a corner so it's annoyingly hard to improve later).
>

Right.  I think this is a valid perspective to consider as we define
the order and scope of thanks.  I'll follow up with a more
straightforward suggestion with the bare minimum actions for a first
round.

> > To exemplify my point, if one specific test run as part of "check-tcg"
> > is found to be faulty on a specific job (say on a specific OS), the
> > entire "check-tcg" test set may be disabled as a CI-level maintenance
> > action.  Of course a follow up action to deal with the specific test
> > is required, probably in the form of a Launchpad bug and patches
> > dealing with the issue, but without necessarily a CI related angle to
> > it.
> >
> > If/when test result presentation and control mechanism evolve, we may
> > feel confident and go into finer grained granularity.  For instance, a
> > mechanism for disabling nothing but "tests/migration-test" on a given
> > environment would be possible and desirable from a CI management level.
> 
> For instance, we don't have anything today for granularity of
> definition of what tests we run where or where we disable them.
> So we don't need it in order to move away from the scripting
> approach I have at the moment. We can just say "the CI system
> will run make and make check (and maybe in some hosts some
> additional test-running commands) on these hosts" and hardcode
> that into whatever yaml file the CI system's configured in.
>

I absolutely agree.  That's why I even considered *if* this will done,
and not only *when*.  Because I happen to be biased from working on a
test runner/framework, this is something that I had to at least talk
about, so that it can be evaluated and maybe turned into a goal.

> > Pre-merge
> > ~
> >
> > The natural way to have pre-merge CI jobs in GitLab is to send "Merge
> > Requests"[3] (abbreviated as "MR" from now on).  In most projects, a
> > MR comes from individual contributors, usually the authors of the
> > changes themselves.  It's my understanding that the current maintainer
> > model employed in QEMU will *not* change at this time, meaning that
> > code contributions and reviews will continue to happen on the mailing
> > list.  A maintainer then, having collected a number of patches, would
> > submit a MR either in addition or in substitution to the Pull Requests
> > sent to the mailing list.
> 
> Eventually it would be nice to allow any submaintainer
> to send a merge request to the CI system (though you would
> want it to have a "but don't apply until somebody else approves it"
> gate as well as the automated testing part). But right now all
> we need is for the one person managing merges and releases
> to be able to say "here's the branch where I merged this
> pullrequest, please test it". At any rate, supporting multiple
> submaintainers all talking to the CI independently should be
> out of scope for now.
>

OK, noted.

> > Multi-maintainer model
> > ~~
> >
> > The previous section already introduced some of the proposed workflow
> > that can enable such a multi-maintainer model.  With a Gating CI
> > system, though, it will be natural to have a smaller "Mean time
> > between (CI) failures", simply because of the expected increased
> > number of systems and checks.  A lot of countermeasures have to be
> > employed to keep that MTBF in check.
> >
> > For once, it's imperative that the maintainers for such systems and
> > jobs are clearly defined and readily accessible.  Either the same
> > MAINTAINERS file or a more suitable variation of such data should be
> > defined before activating the *gating* rules.  This would allow a
> > routing to request the attention of the maintainer responsible.
> >
> > In case of unresposive maintainers, or any other condition that
> > renders and keeps one or more CI jobs failing for a given previously
> > established amount of time, the job can be demoted with an
> > "allow_failure" configuration[7].

Re: [PATCH v2 2/4] target/arm: Abstract the generic timer frequency

2019-12-04 Thread Andrew Jeffery




On Wed, 4 Dec 2019, at 03:57, Philippe Mathieu-Daudé wrote:
> On 12/3/19 1:48 PM, Andrew Jeffery wrote:
> > On Tue, 3 Dec 2019, at 16:39, Philippe Mathieu-Daudé wrote:
> >> On 12/3/19 5:14 AM, Andrew Jeffery wrote:
> >>> Prepare for SoCs such as the ASPEED AST2600 whose firmware configures
> >>> CNTFRQ to values significantly larger than the static 62.5MHz value
> >>> currently derived from GTIMER_SCALE. As the OS potentially derives its
> >>> timer periods from the CNTFRQ value the lack of support for running
> >>> QEMUTimers at the appropriate rate leads to sticky behaviour in the
> >>> guest.
> >>>
> >>> Substitute the GTIMER_SCALE constant with use of a helper to derive the
> >>> period from gt_cntfrq stored in struct ARMCPU. Initially set gt_cntfrq
> >>> to the frequency associated with GTIMER_SCALE so current behaviour is
> >>> maintained.
> >>>
> >>> Signed-off-by: Andrew Jeffery 
> >>> Reviewed-by: Richard Henderson 
> >>> ---
> >>>target/arm/cpu.c|  2 ++
> >>>target/arm/cpu.h| 10 ++
> >>>target/arm/helper.c | 10 +++---
> >>>3 files changed, 19 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> >>> index 7a4ac9339bf9..5698a74061bb 100644
> >>> --- a/target/arm/cpu.c
> >>> +++ b/target/arm/cpu.c
> >>> @@ -974,6 +974,8 @@ static void arm_cpu_initfn(Object *obj)
> >>>if (tcg_enabled()) {
> >>>cpu->psci_version = 2; /* TCG implements PSCI 0.2 */
> >>>}
> >>> +
> >>> +cpu->gt_cntfrq = NANOSECONDS_PER_SECOND / GTIMER_SCALE;
> >>>}
> >>>
> >>>static Property arm_cpu_reset_cbar_property =
> >>> diff --git a/target/arm/cpu.h b/target/arm/cpu.h
> >>> index 83a809d4bac4..666c03871fdf 100644
> >>> --- a/target/arm/cpu.h
> >>> +++ b/target/arm/cpu.h
> >>> @@ -932,8 +932,18 @@ struct ARMCPU {
> >>> */
> >>>DECLARE_BITMAP(sve_vq_map, ARM_MAX_VQ);
> >>>DECLARE_BITMAP(sve_vq_init, ARM_MAX_VQ);
> >>> +
> >>> +/* Generic timer counter frequency, in Hz */
> >>> +uint64_t gt_cntfrq;
> >>
> >> You can also explicit the unit by calling it 'gt_cntfrq_hz'.
> > 
> > Fair call, I'll fix that.
> > 
> >>
> >>>};
> >>>
> >>> +static inline unsigned int gt_cntfrq_period_ns(ARMCPU *cpu)
> >>> +{
> >>> +/* XXX: Could include qemu/timer.h to get NANOSECONDS_PER_SECOND? */
> >>
> >> Why inline this call? I doubt there is a significant performance gain.
> > 
> > It wasn't so much performance. It started out as a macro for a simple 
> > calculation
> > because I didn't want to duplicate it across a number of places, then I 
> > wanted type
> > safety for the pointer so  I switched the macro in the header to an inline 
> > function. So
> > it is an evolution of the patch rather than something that came from an 
> > explicit goal
> > of e.g. performance.
> 
> OK. Eventually NANOSECONDS_PER_SECOND will move to "qemu/units.h".
> 
> Should the XXX comment stay? I'm not sure, it is confusing.

I'll remove that. 

> 
> Reviewed-by: Philippe Mathieu-Daudé 

Thanks. However, did you still want your comment on 4/4 addressed (move
the comment to this patch)?

Andrew

Re: [PATCH v17 5/7] ppc: spapr: Handle "ibm,nmi-register" and "ibm,nmi-interlock" RTAS calls

2019-12-04 Thread Ganesh




On 11/19/19 8:09 AM, David Gibson wrote:

On Thu, Oct 24, 2019 at 01:13:05PM +0530, Ganesh Goudar wrote:

From: Aravinda Prasad 

This patch adds support in QEMU to handle "ibm,nmi-register"
and "ibm,nmi-interlock" RTAS calls.

The machine check notification address is saved when the
OS issues "ibm,nmi-register" RTAS call.

This patch also handles the case when multiple processors
experience machine check at or about the same time by
handling "ibm,nmi-interlock" call. In such cases, as per
PAPR, subsequent processors serialize waiting for the first
processor to issue the "ibm,nmi-interlock" call. The second
processor that also received a machine check error waits
till the first processor is done reading the error log.
The first processor issues "ibm,nmi-interlock" call
when the error log is consumed.

[Move fwnmi registeration to .apply hook]

s/registeration/registration/

Thanks



Signed-off-by: Ganesh Goudar 
Signed-off-by: Aravinda Prasad 
---
  hw/ppc/spapr_caps.c|  9 +--
  hw/ppc/spapr_rtas.c| 57 ++
  include/hw/ppc/spapr.h |  5 +++-
  3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr_caps.c b/hw/ppc/spapr_caps.c
index 976d709210..1675ebd45e 100644
--- a/hw/ppc/spapr_caps.c
+++ b/hw/ppc/spapr_caps.c
@@ -509,9 +509,14 @@ static void cap_fwnmi_mce_apply(SpaprMachineState *spapr, 
uint8_t val,
   * of software injected faults like duplicate SLBs).
   */
  warn_report("Firmware Assisted Non-Maskable Interrupts not supported in 
TCG");

This logic still isn't quite right.  To start with the warn_report()
above possible wants to be more weakly worded.  With TCG, FWNMI won't
generally *do* anything, and there are some edge cases where the
behaviour is arguably incorrect.  However there's no reason we can't
make the RTAS calls work basically as expected and in almost all cases
things will behave correctly - at least according to the case where no
fwnmi events are delivered...

ok



-} else if (kvm_enabled() && (kvmppc_set_fwnmi() != 0)) {
-error_setg(errp,
+} else if (kvm_enabled()) {
+if (!kvmppc_set_fwnmi()) {
+/* Register ibm,nmi-register and ibm,nmi-interlock RTAS calls */
+spapr_fwnmi_register();

..but here you only register the RTAS calls in the KVM case, which
breaks that.  If there really is a strong reason to do this, then the
warn_report() above should be error_setg() and fail the apply.


+} else {
+error_setg(errp,
  "Firmware Assisted Non-Maskable Interrupts not supported by KVM, try 
cap-fwnmi-mce=off");
+}
  }
  }
  
diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c

index 2c066a372d..0328b1f341 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -400,6 +400,55 @@ static void rtas_get_power_level(PowerPCCPU *cpu, 
SpaprMachineState *spapr,
  rtas_st(rets, 1, 100);
  }
  
+static void rtas_ibm_nmi_register(PowerPCCPU *cpu,

+  SpaprMachineState *spapr,
+  uint32_t token, uint32_t nargs,
+  target_ulong args,
+  uint32_t nret, target_ulong rets)
+{
+hwaddr rtas_addr = spapr_get_rtas_addr();
+
+if (!rtas_addr) {
+rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);
+return;
+}
+
+if (spapr_get_cap(spapr, SPAPR_CAP_FWNMI_MCE) == SPAPR_CAP_OFF) {
+rtas_st(rets, 0, RTAS_OUT_NOT_SUPPORTED);

Actually, since you explicitly test for the cap being enabled here,
there's no reason not to *always* register this RTAS call.  Also this
test for the feature flag should go first, before delving into the
device tree for the RTAS address.

Sure, will do



+return;
+}
+
+spapr->guest_machine_check_addr = rtas_ld(args, 1);
+rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+}
+
+static void rtas_ibm_nmi_interlock(PowerPCCPU *cpu,
+   SpaprMachineState *spapr,
+   uint32_t token, uint32_t nargs,
+   target_ulong args,
+   uint32_t nret, target_ulong rets)
+{
+if (spapr->guest_machine_check_addr == -1) {
+/* NMI register not called */
+rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+return;
+}
+
+if (spapr->mc_status != cpu->vcpu_id) {
+/* The vCPU that hit the NMI should invoke "ibm,nmi-interlock" */
+rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+return;
+}
+
+/*
+ * vCPU issuing "ibm,nmi-interlock" is done with NMI handling,
+ * hence unset mc_status.
+ */
+spapr->mc_status = -1;
+qemu_cond_signal(>mc_delivery_cond);
+rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+}
+
  static struct rtas_call {
  const char *name;
  spapr_rtas_fn fn;
@@ -503,6 +552,14 @@ hwaddr spapr_get_rtas_addr(void)
  return (hwaddr)fdt32_to_cpu(*rtas_data);
  }
  
+void

Re: [PATCH v2 1/3] virtio: add ability to delete vq through a pointer

2019-12-04 Thread Pankaj Gupta



> 
> On 2019/12/4 16:33, Pankaj Gupta wrote:
> > 
> >> From: Pan Nengyuan 
> >>
> >> Devices tend to maintain vq pointers, allow deleting them trough a vq
> >> pointer.
> >>
> >> Signed-off-by: Michael S. Tsirkin 
> >> Signed-off-by: Pan Nengyuan 
> >> ---
> >> Changes v2 to v1:
> >> - add a new function virtio_delete_queue to cleanup vq through a vq
> >> pointer
> >> ---
> >>  hw/virtio/virtio.c | 16 +++-
> >>  include/hw/virtio/virtio.h |  2 ++
> >>  2 files changed, 13 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >> index 04716b5..6de3cfd 100644
> >> --- a/hw/virtio/virtio.c
> >> +++ b/hw/virtio/virtio.c
> >> @@ -2330,17 +2330,23 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev,
> >> int
> >> queue_size,
> >>  return >vq[i];
> >>  }
> >>  
> >> +void virtio_delete_queue(VirtQueue *vq)
> >> +{
> >> +vq->vring.num = 0;
> >> +vq->vring.num_default = 0;
> >> +vq->handle_output = NULL;
> >> +vq->handle_aio_output = NULL;
> >> +g_free(vq->used_elems);
> >> +vq->used_elems = NULL;
> >> +}
> >> +
> >>  void virtio_del_queue(VirtIODevice *vdev, int n)
> >>  {
> >>  if (n < 0 || n >= VIRTIO_QUEUE_MAX) {
> >>  abort();
> >>  }
> >>  
> >> -vdev->vq[n].vring.num = 0;
> >> -vdev->vq[n].vring.num_default = 0;
> >> -vdev->vq[n].handle_output = NULL;
> >> -vdev->vq[n].handle_aio_output = NULL;
> >> -g_free(vdev->vq[n].used_elems);
> >> +virtio_delete_queue(>vq[n]);
> >>  }
> >>  
> >>  static void virtio_set_isr(VirtIODevice *vdev, int value)
> >> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> >> index c32a815..e18756d 100644
> >> --- a/include/hw/virtio/virtio.h
> >> +++ b/include/hw/virtio/virtio.h
> >> @@ -183,6 +183,8 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int
> >> queue_size,
> >>  
> >>  void virtio_del_queue(VirtIODevice *vdev, int n);
> >>  
> >> +void virtio_delete_queue(VirtQueue *vq);
> >> +
> >>  void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
> >>  unsigned int len);
> >>  void virtqueue_flush(VirtQueue *vq, unsigned int count);
> >> --
> >> 2.7.2.windows.1
> >>
> >>
> > Overall it ooks good to me.
> > 
> > Just one point: e.g in virtio_rng: "virtio_rng_device_unrealize" function
> > We are doing : virtio_del_queue(vdev, 0);
> > 
> > One can directly call "virtio_delete_queue". It can become confusing
> > to call multiple functions for same purpose. Instead, Can we make
> > "virtio_delete_queue" static inline?
> > 
> yes, It will be a little confused, but I think it will have the same
> problem if we make "virtio_delete_queue" static inline. We can directly
> call it aslo. （e.g virtio-serial-bus.c virtio-balloon.c).
> 
> How about replacing the function name to make it more clear (e.g
> virtio_delete_queue -> virtio_queue_cleanup) ? It's too similar between
> "virtio_del_queue" and "virtio_delete_queue".

I am just thinking if we need these two separate functions.

Yes, changing name of virtio_delete_queue -> virtio_queue_cleanup
should be good enough.

Thanks,
Pankaj

> 
> > Other than that:
> > Reviewed-by: Pankaj Gupta 
> > 
> >>
> >>
> > 
> > 
> > .
> > 
> 
> 
>

Re: [PULL v2 4/6] spapr: Add /chosen to FDT only at reset time to preserve kernel and initramdisk

2019-12-04 Thread Alexey Kardashevskiy




On 04/12/2019 21:32, Laurent Vivier wrote:
> On 04/12/2019 05:40, Alexey Kardashevskiy wrote:
>>
>>
>> On 04/12/2019 15:23, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 04/12/2019 03:09, Laurent Vivier wrote:

 Bad reply, the problem is with

 "spapr: Render full FDT on ibm,client-architecture-support"
>>>
>>>
>>> https://git.qemu.org/?p=SLOF.git;a=blob;f=board-qemu/slof/fdt.fs;h=3e4c1b34b8af2dcebde57e548c94417e5e20e1cc;hb=HEAD#l265
>>>
>>> A "bit ugly" became really ugly as before we were only patching
>>> interrupt-map for PHB (7 cells per line) only but now we have to patch
>>> (or, rather, skip) the PCI bridge interrupt-map (9 cells per line).
>>>
>>> Fixing now...
>>
>>
>> Basically, this:
>>
>>
>> diff --git a/board-qemu/slof/fdt.fs b/board-qemu/slof/fdt.fs
>> index 3e4c1b34b8af..463a2a8c0c2d 100644
>> --- a/board-qemu/slof/fdt.fs
>> +++ b/board-qemu/slof/fdt.fs
>> @@ -300,8 +300,13 @@ fdt-claim-reserve
>> \ ." Replacing in " dup node>path type cr
>> >r
>> s" interrupt-map" r@ get-property 0= IF
>> -  ( old new prop-addr prop-len  R: node )
>> -  fdt-replace-interrupt-map
>> +  dup e00 = IF
>> +  ( old new prop-addr prop-len  R: node )
>> +  fdt-replace-interrupt-map
>> +  ELSE
>> + 2drop
>> +  ."  no idea what this is" cr
>> +  THEN
>> THEN
> 
> This does not fix the problem for me.

That's strange, does it crash the same way?

Anyway I made 2 patches:
https://patchwork.ozlabs.org/patch/1204467/
https://patchwork.ozlabs.org/patch/1204468/

Please give them a try. Thanks,


-- 
Alexey

[RFC PATCH 1/2] hw/vfio: add a 'disablable' flag to sparse mmaped region

2019-12-04 Thread Yan Zhao

add a 'disablable' flag to each each sparse mmaped region and this flag is by
default off.

vfio_region_disablable_mmaps_set_enabled() will enable/disable mmapped
subregions if its 'disablable' flag is on.

Cc: Kevin Tian 
Signed-off-by: Yan Zhao 
---
 hw/vfio/common.c  | 28 +++-
 hw/vfio/trace-events  |  3 ++-
 include/hw/vfio/vfio-common.h |  2 ++
 linux-headers/linux/vfio.h|  2 ++
 4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6f36b02e3e..79f694dd19 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -883,11 +883,13 @@ static int vfio_setup_region_sparse_mmaps(VFIORegion 
*region,
 for (i = 0, j = 0; i < sparse->nr_areas; i++) {
 trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
 sparse->areas[i].offset +
-sparse->areas[i].size);
+sparse->areas[i].size,
+sparse->areas[i].disablable);
 
 if (sparse->areas[i].size) {
 region->mmaps[j].offset = sparse->areas[i].offset;
 region->mmaps[j].size = sparse->areas[i].size;
+region->mmaps[j].disablable = sparse->areas[i].disablable;
 j++;
 }
 }
@@ -1084,6 +1086,30 @@ void vfio_region_mmaps_set_enabled(VFIORegion *region, 
bool enabled)
 enabled);
 }
 
+/**
+ * enable/disable vfio regions with mmaped subregions
+ * It only disable mmapped subregions with disablable flag on
+ */
+void vfio_region_disablable_mmaps_set_enabled(VFIORegion *region, bool enabled)
+{
+int i;
+
+if (!region->mem) {
+return;
+}
+
+for (i = 0; i < region->nr_mmaps; i++) {
+if (region->mmaps[i].mmap && region->mmaps[i].disablable) {
+memory_region_set_enabled(>mmaps[i].mem, enabled);
+trace_vfio_region_disablable_mmaps_set_enabled(
+memory_region_name(region->mem),
+region->mmaps[i].offset,
+region->mmaps[i].offset + region->mmaps[i].size,
+enabled);
+}
+}
+}
+
 void vfio_reset_handler(void *opaque)
 {
 VFIOGroup *group;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 414a5e69ec..7b2d07529e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -113,7 +113,7 @@ vfio_region_finalize(const char *name, int index) "Device 
%s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps 
enabled: %d"
 vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) 
"Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) 
"Device %s region %d: %d sparse mmap entries"
-vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) 
"sparse entry %d [0x%lx - 0x%lx]"
+vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end, 
bool disablable) "sparse entry %d [0x%lx - 0x%lx] disablable %d"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t 
subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
 
@@ -161,3 +161,4 @@ vfio_load_device_config_state(char *name) " (%s)"
 vfio_load_state(char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(char *name, uint64_t data_offset, uint64_t 
data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
 vfio_get_dirty_page_list(char *name, uint64_t start, uint64_t pfn_count, 
uint64_t page_size) " (%s) start 0x%"PRIx64" pfn_count 0x%"PRIx64 " page size 
0x%"PRIx64
+vfio_region_disablable_mmaps_set_enabled(const char *name, unsigned long 
offset, unsigned long end, bool enabled) "Region %s mmaps [0x%lx - 0x%lx] set 
to %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 41ff5ebba2..8cfe46c681 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -45,6 +45,7 @@ typedef struct VFIOMmap {
 void *mmap;
 off_t offset;
 size_t size;
+bool disablable; /* whether this region is able to get diabled */
 } VFIOMmap;
 
 typedef struct VFIORegion {
@@ -187,6 +188,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, 
VFIORegion *region,
   int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_disablable_mmaps_set_enabled(VFIORegion *region, bool 
enabled);
 void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 4bc0236b08..f9f0ea8eda 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -258,6 +258,8 @@ struct

[RFC PATCH 0/2] QEMU: Dynamic trap/untrap of VFIO PCI BARs

2019-12-04 Thread Yan Zhao

This patchset enables PCI BARs to be dynamically trapped/passthroughed
in response to vendor driver's needs.

To dynamic trap/untrap PCI BARs, 3 info required:
(1) which part of PCI BARs are to be trapped/passthroughed
(2) when to do the trap/passthrough transition
(3) to trap or to passthrough

Patch 1 let vendor driver specify which sparse mmaped subregions are
disablable. Therefore providing the first info.

Patch 2 probes and creates dynamic trap bar info region, whose
"dt_fd" field provides the second info, and
"trap" field provide the third info.

The corresponding kernel implementation is at
https://www.spinics.net/lists/kernel/msg3337337.html.


Yan Zhao (2):
  hw/vfio: add a 'disablable' flag to sparse mmaped region
  hw/vfio/pci: init dynamic-trap-bar-info region

 hw/vfio/common.c  |  28 +++-
 hw/vfio/pci.c | 117 ++
 hw/vfio/pci.h |   5 ++
 hw/vfio/trace-events  |   4 +-
 include/hw/vfio/vfio-common.h |   2 +
 linux-headers/linux/vfio.h|  13 
 6 files changed, 167 insertions(+), 2 deletions(-)

-- 
2.17.1

[RFC PATCH 2/2] hw/vfio/pci: init dynamic-trap-bar-info region

2019-12-04 Thread Yan Zhao

for devices that support device region of type
VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO, probe and init a
dynamic-trap-bar-info region which holds info of
(1) fd of eventfd,
(2) to trap/untrap of sparse mmaped pci bars.

Vendor driver first should specify device pci bars to be sparse mmapped,
which means those bars are sparsely passthroughed.
And if it wants certain sub-regions to be dynamically trapped, it should
also set 'disablable' flag for those sub-regions.

When vendor driver signals the eventfd, QEMU reads 'trap' field of this
dynamic trap bar info region, then disable/enable disablable subregions
of pci bar regions.

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 hw/vfio/pci.c  | 117 +
 hw/vfio/pci.h  |   5 ++
 hw/vfio/trace-events   |   1 +
 linux-headers/linux/vfio.h |  11 
 4 files changed, 134 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c04f4bcfb8..3837f77185 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2638,6 +2638,120 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
 return 0;
 }
 
+static void vfio_dt_notifier_handler(void *opaque)
+{
+VFIOPCIDevice *vdev = opaque;
+int i;
+__u32 dt_state;
+
+if (!event_notifier_test_and_clear(>dt_notifier)) {
+return;
+}
+
+if (vdev->dt_offset < 0) {
+return;
+}
+
+if (pread(vdev->vbasedev.fd, _state,
+sizeof(dt_state),
+vdev->dt_offset +
+offsetof(struct vfio_device_dt_bar_info_region, trap))
+!= sizeof(dt_state)) {
+error_report("vfio failed to read from dt region");
+return;
+}
+
+if (dt_state == vdev->dt_state) {
+return;
+}
+
+for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
+vfio_region_disablable_mmaps_set_enabled(>bars[i].region,
+ !dt_state);
+}
+
+vdev->dt_state = dt_state;
+
+}
+
+static void vfio_register_dt_notifier(VFIOPCIDevice *vdev)
+{
+if (vdev->enable_dt) {
+return;
+}
+
+if (event_notifier_init(>dt_notifier, 0)) {
+error_report("vfio: unable to init event notifier for dynamic trap");
+return;
+}
+
+qemu_set_fd_handler(event_notifier_get_fd(>dt_notifier),
+vfio_dt_notifier_handler, NULL, vdev);
+}
+
+static void vfio_unregister_dt_notifier(VFIOPCIDevice *vdev)
+{
+if (!vdev->enable_dt) {
+return;
+}
+
+qemu_set_fd_handler(event_notifier_get_fd(>dt_notifier),
+NULL, NULL, vdev);
+event_notifier_cleanup(>dt_notifier);
+vdev->enable_dt = false;
+vdev->dt_offset = -1;
+vdev->dt_state = false;
+}
+
+/**
+ * init a dynamic trap bar info region
+ * this region is used for qemu to communicate to vendor driver
+ *
+ * if this device region is queried from vendor driver, qemu will
+ * create an eventfd and write fd of this eventfd to dt_fd field of
+ * this region.
+ *
+ * when vendor driver notifys this dt_fd, qemu first read trap field
+ * of this region to get trap/untrap info. Then qemu will disable/enable
+ * mmaped subregions of pci bar regions according to this info.
+ *
+ */
+static void init_dt_region(VFIOPCIDevice *vdev)
+{
+struct vfio_region_info *reg_info;
+int ret;
+uint32_t dt_fd;
+vdev->dt_state = false;
+
+ret = vfio_get_dev_region_info(>vbasedev,
+VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
+VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
+_info);
+if (ret || reg_info->size < sizeof(dt_fd)) {
+goto out;
+}
+
+vdev->dt_offset = reg_info->offset;
+
+vfio_register_dt_notifier(vdev);
+dt_fd = event_notifier_get_fd(>dt_notifier);
+
+trace_vfio_init_dt_region(vdev->vbasedev.name, vdev->vendor_id,
+  vdev->device_id, reg_info->offset,
+  reg_info->offset + reg_info->size - 1, dt_fd);
+
+if (pwrite(vdev->vbasedev.fd, _fd,
+sizeof(dt_fd),
+vdev->dt_offset) != sizeof(dt_fd)) {
+error_report("vfio failed to write to dt region");
+vfio_unregister_dt_notifier(vdev);
+}
+vdev->enable_dt = true;
+out:
+g_free(reg_info);
+}
+
+
 static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 {
 VFIODevice *vbasedev = >vbasedev;
@@ -3173,6 +3287,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
  vdev->vbasedev.name);
 }
 
+init_dt_region(vdev);
+
 vfio_register_err_notifier(vdev);
 vfio_register_req_notifier(vdev);
 vfio_setup_resetfn_quirk(vdev);
@@ -3214,6 +3330,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 
 vfio_unregister_req_notifier(vdev);
 vfio_unregister_err_notifier(vdev);
+vfio_unregister_dt_notifier(vdev);
 pci_device_set_intx_routing_notifier(>pdev, NULL);

[PATCH v5 2/2] block/nbd: fix memory leak in nbd_open()

2019-12-04 Thread pannengyuan

From: Pan Nengyuan 

In currently implementation there will be a memory leak when
nbd_client_connect() returns error status. Here is an easy way to
reproduce:

1. run qemu-iotests as follow and check the result with asan:
./check -raw 143

Following is the asan output backtrack:
Direct leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7f629688a560 in calloc (/usr/lib64/libasan.so.3+0xc7560)
#1 0x7f6295e7e015 in g_malloc0  (/usr/lib64/libglib-2.0.so.0+0x50015)
#2 0x56281dab4642 in qobject_input_start_struct  
/mnt/sdb/qemu-4.2.0-rc0/qapi/qobject-input-visitor.c:295
#3 0x56281dab1a04 in visit_start_struct  
/mnt/sdb/qemu-4.2.0-rc0/qapi/qapi-visit-core.c:49
#4 0x56281dad1827 in visit_type_SocketAddress  qapi/qapi-visit-sockets.c:386
#5 0x56281da8062f in nbd_config   /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1716
#6 0x56281da8062f in nbd_process_options 
/mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1829
#7 0x56281da8062f in nbd_open /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1873

Direct leak of 15 byte(s) in 1 object(s) allocated from:
#0 0x7f629688a3a0 in malloc (/usr/lib64/libasan.so.3+0xc73a0)
#1 0x7f6295e7dfbd in g_malloc (/usr/lib64/libglib-2.0.so.0+0x4ffbd)
#2 0x7f6295e96ace in g_strdup (/usr/lib64/libglib-2.0.so.0+0x68ace)
#3 0x56281da804ac in nbd_process_options 
/mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1834
#4 0x56281da804ac in nbd_open /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1873

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x7f629688a3a0 in malloc (/usr/lib64/libasan.so.3+0xc73a0)
#1 0x7f6295e7dfbd in g_malloc (/usr/lib64/libglib-2.0.so.0+0x4ffbd)
#2 0x7f6295e96ace in g_strdup (/usr/lib64/libglib-2.0.so.0+0x68ace)
#3 0x56281dab41a3 in qobject_input_type_str_keyval 
/mnt/sdb/qemu-4.2.0-rc0/qapi/qobject-input-visitor.c:536
#4 0x56281dab2ee9 in visit_type_str 
/mnt/sdb/qemu-4.2.0-rc0/qapi/qapi-visit-core.c:297
#5 0x56281dad0fa1 in visit_type_UnixSocketAddress_members 
qapi/qapi-visit-sockets.c:141
#6 0x56281dad17b6 in visit_type_SocketAddress_members 
qapi/qapi-visit-sockets.c:366
#7 0x56281dad186a in visit_type_SocketAddress qapi/qapi-visit-sockets.c:393
#8 0x56281da8062f in nbd_config /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1716
#9 0x56281da8062f in nbd_process_options 
/mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1829
#10 0x56281da8062f in nbd_open /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1873

Fixes: 8f071c9db506e03ab
Reported-by: Euler Robot 
Signed-off-by: Pan Nengyuan 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Cc: qemu-stable 
Cc: Vladimir Sementsov-Ogievskiy 
---
Changes v2 to v1:
- add a new function to do the common cleanups (suggested by Stefano
  Garzarella).
---
Changes v3 to v2:
- split in two patches(suggested by Stefano Garzarella)
---
Changes v4 to v3:
- replace function name from nbd_free_bdrvstate_prop to
  nbd_clear_bdrvstate and add Fixes tag.
---
Changes v5 to v4:
- correct the wrong email address
---
 block/nbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/nbd.c b/block/nbd.c
index 8b4a65a..9062409 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1891,6 +1891,7 @@ static int nbd_open(BlockDriverState *bs, QDict *options, 
int flags,
 
 ret = nbd_client_connect(bs, errp);
 if (ret < 0) {
+nbd_clear_bdrvstate(s);
 return ret;
 }
 /* successfully connected */
-- 
2.7.2.windows.1

[PATCH v5 1/2] block/nbd: extract the common cleanup code

2019-12-04 Thread pannengyuan

From: Pan Nengyuan 

The BDRVNBDState cleanup code is common in two places, add
nbd_clear_bdrvstate() function to do these cleanups.

Signed-off-by: Stefano Garzarella 
Signed-off-by: Pan Nengyuan 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
v3:
- new patch, split form 2/2 patch (suggested by Stefano Garzarella)
Changes v4 to v3:
- replace function name from nbd_free_bdrvstate_prop to
  nbd_clear_bdrvstate and set cleared fields to NULL (suggested by Eric
  Blake)
- remove static function prototype. (suggested by Eric Blake)
---
Changes v5 to v4:
- correct the wrong email address
---
 block/nbd.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 1239761..8b4a65a 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -94,6 +94,19 @@ typedef struct BDRVNBDState {
 
 static int nbd_client_connect(BlockDriverState *bs, Error **errp);
 
+void nbd_clear_bdrvstate(BDRVNBDState *s)
+{
+object_unref(OBJECT(s->tlscreds));
+qapi_free_SocketAddress(s->saddr);
+s->saddr = NULL;
+g_free(s->export);
+s->export = NULL;
+g_free(s->tlscredsid);
+s->tlscredsid = NULL;
+g_free(s->x_dirty_bitmap);
+s->x_dirty_bitmap = NULL;
+}
+
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (ret == -EIO) {
@@ -1855,10 +1868,7 @@ static int nbd_process_options(BlockDriverState *bs, 
QDict *options,
 
  error:
 if (ret < 0) {
-object_unref(OBJECT(s->tlscreds));
-qapi_free_SocketAddress(s->saddr);
-g_free(s->export);
-g_free(s->tlscredsid);
+nbd_clear_bdrvstate(s);
 }
 qemu_opts_del(opts);
 return ret;
@@ -1937,12 +1947,7 @@ static void nbd_close(BlockDriverState *bs)
 BDRVNBDState *s = bs->opaque;
 
 nbd_client_close(bs);
-
-object_unref(OBJECT(s->tlscreds));
-qapi_free_SocketAddress(s->saddr);
-g_free(s->export);
-g_free(s->tlscredsid);
-g_free(s->x_dirty_bitmap);
+nbd_clear_bdrvstate(s);
 }
 
 static int64_t nbd_getlength(BlockDriverState *bs)
-- 
2.7.2.windows.1

[PATCH v5 0/2] block/nbd: fix memory leak in nbd_open

2019-12-04 Thread pannengyuan

From: Pan Nengyuan 

This series add a new function to do the common cleanups, and fix a memory
leak in nbd_open when nbd_client_connect returns error status.

---
Changes v2 to v1:
- add a new function to do the common cleanups (suggested by Stefano 
Garzarella).
---
Changes v3 to v2:
- split in two patches(suggested by Stefano Garzarella)
---
Changes v4 to v3:
- replace function name from nbd_free_bdrvstate_prop to nbd_clear_bdrvstate and 
add Fixes tag(suggested by Eric Blake).
- remove static function prototype. (suggested by Eric Blake)
---
Changes v5 to v4:
- correct the wrong email address

Pan Nengyuan (2):
  block/nbd: extract the common cleanup code
  block/nbd: fix memory leak in nbd_open()

 block/nbd.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

-- 
2.7.2.windows.1

[RFC PATCH 6/9] sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0

2019-12-04 Thread Yan Zhao

This sample code first returns device
cap |= VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR, so that vfio-pci driver
would create for it a dynamic-trap-bar-info region
(of type VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and
subtype VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO)

Then in igd_dt_get_region_info(), this sample driver will customize the
size of dynamic-trap-bar-info region.
Also, this sample driver customizes BAR 0 region to be sparse mmaped
(only passthrough subregion from BAR0_DYNAMIC_TRAP_OFFSET of size
BAR0_DYNAMIC_TRAP_SIZE) and set this sparse mmaped subregion as disablable.

Then when QEMU detects the dynamic trap bar info region, it will create
an eventfd and write its fd into 'dt_fd' field of this region.

When BAR0's registers below BAR0_DYNAMIC_TRAP_OFFSET is trapped, it will
signal the eventfd to notify QEMU to read 'trap' field of dynamic trap bar
info region  and put previously passthroughed subregion to be trapped.
After registers within BAR0_DYNAMIC_TRAP_OFFSET and
BAR0_DYNAMIC_TRAP_SIZE are trapped, this sample driver notifies QEMU via
eventfd to passthrough this subregion again.

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 samples/vfio-pci/igd_dt.c | 176 ++
 1 file changed, 176 insertions(+)

diff --git a/samples/vfio-pci/igd_dt.c b/samples/vfio-pci/igd_dt.c
index 857e8d01b0d1..58ef110917f1 100644
--- a/samples/vfio-pci/igd_dt.c
+++ b/samples/vfio-pci/igd_dt.c
@@ -29,6 +29,9 @@
 /* This driver supports to open max 256 device devices */
 #define MAX_OPEN_DEVICE 256
 
+#define BAR0_DYNAMIC_TRAP_OFFSET (32*1024)
+#define BAR0_DYNAMIC_TRAP_SIZE (32*1024)
+
 /*
  * below are pciids of two IGD devices supported in this driver
  * It is only for demo purpose.
@@ -47,10 +50,30 @@ struct igd_dt_device {
__u32 vendor;
__u32 device;
__u32 handle;
+
+   __u64 dt_region_index;
+   struct eventfd_ctx *dt_trigger;
+   bool is_highend_trapped;
+   bool is_trap_triggered;
 };
 
 static struct igd_dt_device *igd_device_array[MAX_OPEN_DEVICE];
 
+static bool is_handle_valid(int handle)
+{
+   mutex_lock(_bit_lock);
+
+   if (handle >= MAX_OPEN_DEVICE || !igd_device_array[handle] ||
+   !test_bit(handle, igd_device_bits)) {
+   pr_err("%s: handle mismatch, please check interaction with 
vfio-pci module\n",
+   __func__);
+   mutex_unlock(_bit_lock);
+   return false;
+   }
+   mutex_unlock(_bit_lock);
+   return true;
+}
+
 int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 *mediate_handle)
 {
int supported_dev_cnt = sizeof(pciidlist)/sizeof(struct pci_device_id);
@@ -88,6 +111,7 @@ int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 
*mediate_handle)
igd_device->vendor = pdev->vendor;
igd_device->device = pdev->device;
igd_device->handle = handle;
+   igd_device->dt_region_index = -1;
igd_device_array[handle] = igd_device;
set_bit(handle, igd_device_bits);
 
@@ -95,6 +119,7 @@ int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 
*mediate_handle)
pdev->vendor, pdev->device, handle);
 
*mediate_handle = handle;
+   *caps |= VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR;
 
 error:
mutex_unlock(_bit_lock);
@@ -135,14 +160,165 @@ static void igd_dt_get_region_info(int handle,
struct vfio_info_cap *caps,
struct vfio_region_info_cap_type *cap_type)
 {
+   struct vfio_region_info_cap_sparse_mmap *sparse;
+   size_t size;
+   int nr_areas, ret;
+
+   if (!is_handle_valid(handle))
+   return;
+
+   switch (info->index) {
+   case VFIO_PCI_BAR0_REGION_INDEX:
+   info->flags |= VFIO_REGION_INFO_FLAG_MMAP;
+   nr_areas = 1;
+
+   size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+   sparse = kzalloc(size, GFP_KERNEL);
+   if (!sparse)
+   return;
+
+   sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+   sparse->header.version = 1;
+   sparse->nr_areas = nr_areas;
+
+   sparse->areas[0].offset = BAR0_DYNAMIC_TRAP_OFFSET;
+   sparse->areas[0].size = BAR0_DYNAMIC_TRAP_SIZE;
+   sparse->areas[0].disablable = 1;//able to get disabled
+
+   ret = vfio_info_add_capability(caps, >header,
+   size);
+   kfree(sparse);
+   break;
+   case VFIO_PCI_BAR1_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+   case VFIO_PCI_CONFIG_REGION_INDEX:
+   case VFIO_PCI_ROM_REGION_INDEX:
+   case VFIO_PCI_VGA_REGION_INDEX:
+   break;
+   default:
+   if ((cap_type->type ==
+   VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO) &&
+   (cap_type->subtype ==
+

[RFC PATCH 8/9] i40e/vf_migration: mediate migration region

2019-12-04 Thread Yan Zhao

in vfio_pci_mediate_ops->get_region_info(), migration region's len and
flags are overridden and its region index is saved.

vfio_pci_mediate_ops->rw() and vfio_pci_mediate_ops->mmap() overrides
default rw/mmap for migration region.

This is only a sample implementation in i440 vf migration to demonstrate
how vf migration code will look like. The actual dirty page tracking and
device state retrieving code would be sent in future. Currently only
comments are used as placeholders.

It's based on QEMU vfio migration code v8:
(https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html).

Cc: Shaopeng He 

Signed-off-by: Yan Zhao 
---
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 335 +-
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  14 +
 2 files changed, 345 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c 
b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
index b2d913459600..5bb509fed66e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -14,6 +14,55 @@ static long open_device_bits[MAX_OPEN_DEVICE / BITS_PER_LONG 
+ 1];
 static DEFINE_MUTEX(device_bit_lock);
 static struct i40e_vf_migration *i40e_vf_dev_array[MAX_OPEN_DEVICE];
 
+static bool is_handle_valid(int handle)
+{
+   mutex_lock(_bit_lock);
+
+   if (handle >= MAX_OPEN_DEVICE || !i40e_vf_dev_array[handle] ||
+   !test_bit(handle, open_device_bits)) {
+   pr_err("%s: handle mismatch, please check interaction with 
vfio-pci module\n",
+  __func__);
+   mutex_unlock(_bit_lock);
+   return false;
+   }
+   mutex_unlock(_bit_lock);
+   return true;
+}
+
+static size_t set_device_state(struct i40e_vf_migration *i40e_vf_dev, u32 
state)
+{
+   int ret = 0;
+   struct vfio_device_migration_info *mig_ctl = i40e_vf_dev->mig_ctl;
+
+   if (state == mig_ctl->device_state)
+   return ret;
+
+   switch (state) {
+   case VFIO_DEVICE_STATE_RUNNING:
+   break;
+   case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+   // alloc dirty page tracking resources and
+   // do the first round dirty page scanning
+   break;
+   case VFIO_DEVICE_STATE_SAVING:
+   // do the last round of dirty page scanning
+   break;
+   case ~VFIO_DEVICE_STATE_MASK & VFIO_DEVICE_STATE_MASK:
+   // release dirty page tracking resources
+   //if (mig_ctl->device_state == VFIO_DEVICE_STATE_SAVING)
+   //  i40e_release_scan_resources(i40e_vf_dev);
+   break;
+   case VFIO_DEVICE_STATE_RESUMING:
+   break;
+   default:
+   ret = -EFAULT;
+   }
+
+   mig_ctl->device_state = state;
+
+   return ret;
+}
+
 int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, u32 *dm_handle)
 {
int i, ret = 0;
@@ -24,6 +73,8 @@ int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, 
u32 *dm_handle)
struct i40e_vf *vf;
unsigned int vf_devfn, devfn;
int vf_id = -1;
+   struct vfio_device_migration_info *mig_ctl = NULL;
+   void *dirty_bitmap_base = NULL;
 
if (!try_module_get(THIS_MODULE))
return -ENODEV;
@@ -68,18 +119,41 @@ int i40e_vf_migration_open(struct pci_dev *pdev, u64 
*caps, u32 *dm_handle)
i40e_vf_dev->vf_dev = vf_dev;
i40e_vf_dev->handle = handle;
 
-   pr_info("%s: device %x %x, vf id %d, handle=%x\n",
-   __func__, pdev->vendor, pdev->device, vf_id, handle);
+   mig_ctl = kzalloc(sizeof(*mig_ctl), GFP_KERNEL);
+   if (!mig_ctl) {
+   ret = -ENOMEM;
+   goto error;
+   }
+
+   dirty_bitmap_base = vmalloc_user(MIGRATION_DIRTY_BITMAP_SIZE);
+   if (!dirty_bitmap_base) {
+   ret = -ENOMEM;
+   goto error;
+   }
+
+   i40e_vf_dev->dirty_bitmap = dirty_bitmap_base;
+   i40e_vf_dev->mig_ctl = mig_ctl;
+   i40e_vf_dev->migration_region_size = DIRTY_BITMAP_OFFSET +
+   MIGRATION_DIRTY_BITMAP_SIZE;
+   i40e_vf_dev->migration_region_index = -1;
+
+   vf = >vf[vf_id];
 
i40e_vf_dev_array[handle] = i40e_vf_dev;
set_bit(handle, open_device_bits);
-   vf = >vf[vf_id];
*dm_handle = handle;
+
+   *caps |= VFIO_PCI_DEVICE_CAP_MIGRATION;
+
+   pr_info("%s: device %x %x, vf id %d, handle=%x\n",
+   __func__, pdev->vendor, pdev->device, vf_id, handle);
 error:
mutex_unlock(_bit_lock);
 
if (ret < 0) {
module_put(THIS_MODULE);
+   kfree(mig_ctl);
+   vfree(dirty_bitmap_base);
kfree(i40e_vf_dev);
}
 
@@ -112,32 +186,285 @@ void i40e_vf_migration_release(int handle)
i40e_vf_dev->vf_vendor, i40e_vf_dev->vf_device,

[RFC PATCH 3/9] vfio/pci: register a default migration region

2019-12-04 Thread Yan Zhao

Vendor driver specifies when to support a migration region through cap
VFIO_PCI_DEVICE_CAP_MIGRATION in vfio_pci_mediate_ops->open().

If vfio-pci detects this cap, it creates a default migration region on
behalf of vendor driver with region len=0 and region->ops=null.
Vendor driver should override this region's len, flags, rw, mmap in
its vfio_pci_mediate_ops.

This migration region definition is aligned to QEMU vfio migration code v8:
(https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html)

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 drivers/vfio/pci/vfio_pci.c |  15 
 include/linux/vfio.h|   1 +
 include/uapi/linux/vfio.h   | 149 
 3 files changed, 165 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f3730252ee82..059660328be2 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -115,6 +115,18 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+/**
+ * init a region to hold migration ctl & data
+ */
+void init_migration_region(struct vfio_pci_device *vdev)
+{
+   vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
+   VFIO_REGION_SUBTYPE_MIGRATION,
+   NULL, 0,
+   VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+   NULL);
+}
+
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
struct resource *res;
@@ -523,6 +535,9 @@ static int vfio_pci_open(void *device_data)
vdev->mediate_ops = mentry->ops;
vdev->mediate_handle = handle;
 
+   if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
+   init_migration_region(vdev);
+
pr_info("vfio pci found mediate_ops %s, 
caps=%llx, handle=%x for %x:%x\n",
vdev->mediate_ops->name, caps,
handle, vdev->pdev->vendor,
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0265e779acd1..cddea8e9dcb2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -197,6 +197,7 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
 struct vfio_pci_mediate_ops {
char*name;
+#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
void(*release)(int handle);
void(*get_region_info)(int handle,
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..caf8845a67a6 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -306,6 +306,155 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_GFX(1)
 #define VFIO_REGION_TYPE_CCW   (2)
 
+/* Migration region type and sub-type */
+#define VFIO_REGION_TYPE_MIGRATION  (3)
+#define VFIO_REGION_SUBTYPE_MIGRATION   (1)
+
+/**
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related 
migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *  To indicate vendor driver the state VFIO device should be transitioned
+ *  to. If device state transition fails, write on this field return error.
+ *  It consists of 3 bits:
+ *  - If bit 0 set, indicates _RUNNING state. When its reset, that 
indicates
+ *_STOPPED state. When device is changed to _STOPPED, driver should 
stop
+ *device before write() returns.
+ *  - If bit 1 set, indicates _SAVING state.
+ *  - If bit 2 set, indicates _RESUMING state.
+ *  Bits 3 - 31 are reserved for future use. User should perform
+ *  read-modify-write operation on this field.
+ *  _SAVING and _RESUMING bits set at the same time is invalid state.
+ *
+ * pending bytes: (read only)
+ *  Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *  User application should read data_offset in migration region from where
+ *  user application should read device data during _SAVING state or write
+ *  device data during _RESUMING state or read dirty pages bitmap. See 
below
+ *  for detail of sequence to be followed.
+ *
+ * data_size: (read/write)
+ *  User application should read data_size to get size of data copied in
+ *  migration region during _SAVING state and write size of data copied in
+ *  migration region during _RESUMING state.
+ *
+ * start_pfn: (write only)
+ *  Start address pfn to get bitmap of dirty pages from vendor driver duing
+ *  _SAVING state.
+ *
+ *

[RFC PATCH 5/9] samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD

2019-12-04 Thread Yan Zhao

This is a sample driver to use mediate ops for passthrough IGDs.

This sample driver does not directly bind to IGD device but defines what
IGD devices to support via a pciidlist.

It registers its vfio_pci_mediate_ops to vfio-pci on driver loading.

when vfio_pci->open() calls vfio_pci_mediate_ops->open(), it will check
the vendor id and device id of the pdev passed in. If they match in
pciidlist, success is returned; otherwise, failure is return.

After a success vfio_pci_mediate_ops->open(), vfio-pci will further call
.get_region_info/.rw/.mmap interface with a mediate handle for each region
and therefore the regions access get mediated/customized.

when vfio-pci->release() is called on the IGD, it first calls
vfio_pci_mediate_ops->release() with a mediate_handle to close the
opened IGD device instance in this sample driver.

This sample driver unregister its vfio_pci_mediate_ops on driver exiting.

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 samples/Kconfig   |   6 ++
 samples/Makefile  |   1 +
 samples/vfio-pci/Makefile |   2 +
 samples/vfio-pci/igd_dt.c | 191 ++
 4 files changed, 200 insertions(+)
 create mode 100644 samples/vfio-pci/Makefile
 create mode 100644 samples/vfio-pci/igd_dt.c

diff --git a/samples/Kconfig b/samples/Kconfig
index c8dacb4dda80..2da42a725c03 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -169,4 +169,10 @@ config SAMPLE_VFS
  as mount API and statx().  Note that this is restricted to the x86
  arch whilst it accesses system calls that aren't yet in all arches.
 
+config SAMPLE_VFIO_PCI_IGD_DT
+   tristate "Build example driver to dynamicaly trap a passthroughed 
device bound to VFIO-PCI -- loadable modules only"
+   depends on VFIO_PCI && m
+   help
+ Build a sample driver to show how to dynamically trap a passthroughed 
device that bound to VFIO-PCI
+
 endif # SAMPLES
diff --git a/samples/Makefile b/samples/Makefile
index 7d6e4ca28d69..f0f422e7dd11 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -18,5 +18,6 @@ subdir-$(CONFIG_SAMPLE_SECCOMP)   += seccomp
 obj-$(CONFIG_SAMPLE_TRACE_EVENTS)  += trace_events/
 obj-$(CONFIG_SAMPLE_TRACE_PRINTK)  += trace_printk/
 obj-$(CONFIG_VIDEO_PCI_SKELETON)   += v4l/
+obj-$(CONFIG_SAMPLE_VFIO_PCI_IGD_DT)   += vfio-pci/
 obj-y  += vfio-mdev/
 subdir-$(CONFIG_SAMPLE_VFS)+= vfs
diff --git a/samples/vfio-pci/Makefile b/samples/vfio-pci/Makefile
new file mode 100644
index ..4b8acc145d65
--- /dev/null
+++ b/samples/vfio-pci/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_SAMPLE_VFIO_PCI_IGD_DT) += igd_dt.o
diff --git a/samples/vfio-pci/igd_dt.c b/samples/vfio-pci/igd_dt.c
new file mode 100644
index ..857e8d01b0d1
--- /dev/null
+++ b/samples/vfio-pci/igd_dt.c
@@ -0,0 +1,191 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Dynamic trap IGD device that bound to vfio-pci device driver
+ * Copyright(c) 2019 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define VERSION_STRING  "0.1"
+#define DRIVER_AUTHOR   "Intel Corporation"
+
+/* helper macros copied from vfio-pci */
+#define VFIO_PCI_OFFSET_SHIFT   40
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   ((off) >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+/* This driver supports to open max 256 device devices */
+#define MAX_OPEN_DEVICE 256
+
+/*
+ * below are pciids of two IGD devices supported in this driver
+ * It is only for demo purpose.
+ * You can add more device ids in this list to support any pci devices
+ * that you want to dynamically trap its pci bars
+ */
+static const struct pci_device_id pciidlist[] = {
+   {0x8086, 0x5927, ~0, ~0, 0x3, 0xff, 0},
+   {0x8086, 0x193b, ~0, ~0, 0x3, 0xff, 0},
+};
+
+static long igd_device_bits[MAX_OPEN_DEVICE/BITS_PER_LONG + 1];
+static DEFINE_MUTEX(device_bit_lock);
+
+struct igd_dt_device {
+   __u32 vendor;
+   __u32 device;
+   __u32 handle;
+};
+
+static struct igd_dt_device *igd_device_array[MAX_OPEN_DEVICE];
+
+int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 *mediate_handle)
+{
+   int supported_dev_cnt = sizeof(pciidlist)/sizeof(struct pci_device_id);
+   int i, ret = 0;
+   struct igd_dt_device *igd_device;
+   int handle;
+
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   for (i = 0; i < supported_dev_cnt; i++) {
+   if (pciidlist[i].vendor == pdev->vendor &&
+   pciidlist[i].device == pdev->device)
+   goto support;
+   }
+
+

[RFC PATCH 7/9] i40e/vf_migration: register mediate_ops to vfio-pci

2019-12-04 Thread Yan Zhao

register to vfio-pci vfio_pci_mediate_ops when i40e binds to PF to
support mediating of VF's vfio-pci ops.
unregister vfio_pci_mediate_ops when i40e unbinds from PF.

vfio_pci_mediate_ops->open will return success if the device passed in
equals to devfn of its VFs

Cc: Shaopeng He 

Signed-off-by: Yan Zhao 
---
 drivers/net/ethernet/intel/Kconfig|   2 +-
 drivers/net/ethernet/intel/i40e/Makefile  |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h|   2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 169 ++
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  52 ++
 6 files changed, 229 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h

diff --git a/drivers/net/ethernet/intel/Kconfig 
b/drivers/net/ethernet/intel/Kconfig
index 154e2e818ec6..b5c7fdf55380 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -240,7 +240,7 @@ config IXGBEVF_IPSEC
 config I40E
tristate "Intel(R) Ethernet Controller XL710 Family support"
imply PTP_1588_CLOCK
-   depends on PCI
+   depends on PCI && VFIO_PCI
---help---
  This driver supports Intel(R) Ethernet Controller XL710 Family of
  devices.  For more information on how to identify your adapter, go
diff --git a/drivers/net/ethernet/intel/i40e/Makefile 
b/drivers/net/ethernet/intel/i40e/Makefile
index 2f21b3e89fd0..ae7a6a23dba9 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -24,6 +24,7 @@ i40e-objs := i40e_main.o \
i40e_ddp.o \
i40e_client.o   \
i40e_virtchnl_pf.o \
-   i40e_xsk.o
+   i40e_xsk.o  \
+   i40e_vf_migration.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 2af9f6308f84..0141c94b835f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -1162,4 +1162,6 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
 int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
  struct i40e_cloud_filter *filter,
  bool add);
+int i40e_vf_migration_register(void);
+void i40e_vf_migration_unregister(void);
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6031223eafab..92d1c3fdc808 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -15274,6 +15274,7 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
/* print a string summarizing features */
i40e_print_features(pf);
 
+   i40e_vf_migration_register();
return 0;
 
/* Unwind what we've done if something failed in the setup */
@@ -15320,6 +15321,8 @@ static void i40e_remove(struct pci_dev *pdev)
i40e_status ret_code;
int i;
 
+   i40e_vf_migration_unregister();
+
i40e_dbg_pf_exit(pf);
 
i40e_ptp_stop(pf);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c 
b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
new file mode 100644
index ..b2d913459600
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2013 - 2019 Intel Corporation. */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "i40e.h"
+#include "i40e_vf_migration.h"
+
+static long open_device_bits[MAX_OPEN_DEVICE / BITS_PER_LONG + 1];
+static DEFINE_MUTEX(device_bit_lock);
+static struct i40e_vf_migration *i40e_vf_dev_array[MAX_OPEN_DEVICE];
+
+int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, u32 *dm_handle)
+{
+   int i, ret = 0;
+   struct i40e_vf_migration *i40e_vf_dev = NULL;
+   int handle;
+   struct pci_dev *pf_dev, *vf_dev;
+   struct i40e_pf *pf;
+   struct i40e_vf *vf;
+   unsigned int vf_devfn, devfn;
+   int vf_id = -1;
+
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   pf_dev = pdev->physfn;
+   pf = pci_get_drvdata(pf_dev);
+   vf_dev = pdev;
+   vf_devfn = vf_dev->devfn;
+
+   for (i = 0; i < pci_num_vf(pf_dev); i++) {
+   devfn = (pf_dev->devfn + pf_dev->sriov->offset +
+pf_dev->sriov->stride * i) & 0xff;
+   if (devfn == vf_devfn) {
+   vf_id = i;
+   break;
+   }
+   }
+
+   if (vf_id == -1) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   mutex_lock(_bit_lock);
+   handle = find_next_zero_bit(open_device_bits, MAX_OPEN_DEVICE, 0);
+

[RFC PATCH 2/9] vfio/pci: test existence before calling region->ops

2019-12-04 Thread Yan Zhao

For regions registered through vfio_pci_register_dev_region(),
before calling region->ops, first check whether region->ops is not null.

As in the next two patches, dev regions of null region->ops are to be
registered by default on behalf of vendor driver, we need to check here
to prevent null pointer access if vendor driver forgets to handle those
dev regions

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 drivers/vfio/pci/vfio_pci.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 55080ff29495..f3730252ee82 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -398,8 +398,12 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 
vdev->virq_disabled = false;
 
-   for (i = 0; i < vdev->num_regions; i++)
+   for (i = 0; i < vdev->num_regions; i++) {
+   if (!vdev->region[i].ops || vdev->region[i].ops->release)
+   continue;
+
vdev->region[i].ops->release(vdev, >region[i]);
+   }
 
vdev->num_regions = 0;
kfree(vdev->region);
@@ -900,7 +904,8 @@ static long vfio_pci_ioctl(void *device_data,
if (ret)
return ret;
 
-   if (vdev->region[i].ops->add_capability) {
+   if (vdev->region[i].ops &&
+   vdev->region[i].ops->add_capability) {
ret = vdev->region[i].ops->add_capability(vdev,
>region[i], );
if (ret)
@@ -1251,6 +1256,9 @@ static ssize_t vfio_pci_rw(void *device_data, char __user 
*buf,
return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
default:
index -= VFIO_PCI_NUM_REGIONS;
+   if (!vdev->region[index].ops || !vdev->region[index].ops->rw)
+   return -EINVAL;
+
return vdev->region[index].ops->rw(vdev, buf,
   count, ppos, iswrite);
}
-- 
2.17.1

[RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-04 Thread Yan Zhao

Dynamic trap bar info region is a channel for QEMU and vendor driver to
communicate dynamic trap info. It is of type
VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.

This region has two fields: dt_fd and trap.
When QEMU detects a device regions of this type, it will create an
eventfd and write its eventfd id to dt_fd field.
When vendor drivre signals this eventfd, QEMU reads trap field of this
info region.
- If trap is true, QEMU would search the device's PCI BAR
regions and disable all the sparse mmaped subregions (if the sparse
mmaped subregion is disablable).
- If trap is false, QEMU would re-enable those subregions.

A typical usage is
1. vendor driver first cuts its bar 0 into several sections, all in a
sparse mmap array. So initally, all its bar 0 are passthroughed.
2. vendor driver specifys part of bar 0 sections to be disablable.
3. on migration starts, vendor driver signals dt_fd and set trap to true
to notify QEMU disabling the bar 0 sections of disablable flags on.
4. QEMU disables those bar 0 section and hence let vendor driver be able
to trap access of bar 0 registers and make dirty page tracking possible.
5. on migration failure, vendor driver signals dt_fd to QEMU again.
QEMU reads trap field of this info region which is false and QEMU
re-passthrough the whole bar 0 region.

Vendor driver specifies whether it supports dynamic-trap-bar-info region
through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
vfio_pci_mediate_ops->open().

If vfio-pci detects this cap, it will create a default
dynamic_trap_bar_info region on behalf of vendor driver with region len=0
and region->ops=null.
Vvendor driver should override this region's len, flags, rw, mmap in its
vfio_pci_mediate_ops.

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 drivers/vfio/pci/vfio_pci.c | 16 
 include/linux/vfio.h|  3 ++-
 include/uapi/linux/vfio.h   | 11 +++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 059660328be2..62b811ca43e4 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -127,6 +127,19 @@ void init_migration_region(struct vfio_pci_device *vdev)
NULL);
 }
 
+/**
+ * register a region to hold info for dynamically trap bar regions
+ */
+void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
+{
+   vfio_pci_register_dev_region(vdev,
+   VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
+   VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
+   NULL, 0,
+   VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+   NULL);
+}
+
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
struct resource *res;
@@ -538,6 +551,9 @@ static int vfio_pci_open(void *device_data)
if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
init_migration_region(vdev);
 
+   if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
+   init_dynamic_trap_bar_info_region(vdev);
+
pr_info("vfio pci found mediate_ops %s, 
caps=%llx, handle=%x for %x:%x\n",
vdev->mediate_ops->name, caps,
handle, vdev->pdev->vendor,
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index cddea8e9dcb2..cf8ecf687bee 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -197,7 +197,8 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
 struct vfio_pci_mediate_ops {
char*name;
-#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
+#define VFIO_PCI_DEVICE_CAP_MIGRATION  (0x01)
+#define VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR   (0x02)
int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
void(*release)(int handle);
void(*get_region_info)(int handle,
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index caf8845a67a6..74a2d0b57741 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -258,6 +258,9 @@ struct vfio_region_info {
 struct vfio_region_sparse_mmap_area {
__u64   offset; /* Offset of mmap'able area within region */
__u64   size;   /* Size of mmap'able area */
+   __u32   disablable; /* whether this mmap'able are able to
+*  be dynamically disabled
+*/
 };
 
 struct vfio_region_info_cap_sparse_mmap {
@@ -454,6 +457,14 @@ struct vfio_device_migration_info {
 #define VFIO_DEVICE_DIRTY_PFNS_ALL (~0ULL)
 } __attribute__((packed));
 
+/* Region type and sub-type to hold info to dynamically trap bars */
+#define VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO (4)
+#define VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO  (1)
+
+struct

[RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-04 Thread Yan Zhao

when vfio-pci is bound to a physical device, almost all the hardware
resources are passthroughed.
Sometimes, vendor driver of this physcial device may want to mediate some
hardware resource access for a short period of time, e.g. dirty page
tracking during live migration.

Here we introduce mediate ops in vfio-pci for this purpose.

Vendor driver can register a mediate ops to vfio-pci.
But rather than directly bind to the passthroughed device, the
vendor driver is now either a module that does not bind to any device or
a module binds to other device.
E.g. when passing through a VF device that is bound to vfio-pci modules,
PF driver that binds to PF device can register to vfio-pci to mediate
VF's regions, hence supporting VF live migration.

The sequence goes like this:
1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver

2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops

3. Whenever vfio-pci opens a device, it searches the list and call
vfio_pci_mediate_ops->open() to check whether a vendor driver supports
mediating this device.
Upon a success return value of from vfio_pci_mediate_ops->open(),
vfio-pci will stop list searching and store a mediate handle to
represent this open into vendor driver.
(so if multiple vendor drivers support mediating a device through
vfio_pci_mediate_ops, only one will win, depending on their registering
sequence)

4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
vendor driver is able to override a region's default flags and caps,
e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
region.

5. vfio_pci_rw()/vfio_pci_mmap() first calls into
vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
passthrough this read/write/mmap to physical device, otherwise it just
returns without touch physical device.

6. When vfio-pci closes a device, vfio_pci_release() chains into
vfio_pci_mediate_ops->release() to close the reference in vendor driver.

7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 drivers/vfio/pci/vfio_pci.c | 146 
 drivers/vfio/pci/vfio_pci_private.h |   2 +
 include/linux/vfio.h|  16 +++
 3 files changed, 164 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 02206162eaa9..55080ff29495 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 "Disable using the PCI D3 low power state for idle, unused 
devices");
 
+static LIST_HEAD(mediate_ops_list);
+static DEFINE_MUTEX(mediate_ops_list_lock);
+struct vfio_pci_mediate_ops_list_entry {
+   struct vfio_pci_mediate_ops *ops;
+   int refcnt;
+   struct list_headnext;
+};
+
 static inline bool vfio_vga_disabled(void)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
@@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
if (!(--vdev->refcnt)) {
vfio_spapr_pci_eeh_release(vdev->pdev);
vfio_pci_disable(vdev);
+   if (vdev->mediate_ops && vdev->mediate_ops->release) {
+   vdev->mediate_ops->release(vdev->mediate_handle);
+   vdev->mediate_ops = NULL;
+   }
}
 
mutex_unlock(>reflck->lock);
@@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
 {
struct vfio_pci_device *vdev = device_data;
int ret = 0;
+   struct vfio_pci_mediate_ops_list_entry *mentry;
 
if (!try_module_get(THIS_MODULE))
return -ENODEV;
@@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
goto error;
 
vfio_spapr_pci_eeh_open(vdev->pdev);
+   mutex_lock(_ops_list_lock);
+   list_for_each_entry(mentry, _ops_list, next) {
+   u64 caps;
+   u32 handle;
+
+   memset(, 0, sizeof(caps));
+   ret = mentry->ops->open(vdev->pdev, , );
+   if (!ret)  {
+   vdev->mediate_ops = mentry->ops;
+   vdev->mediate_handle = handle;
+
+   pr_info("vfio pci found mediate_ops %s, 
caps=%llx, handle=%x for %x:%x\n",
+   vdev->mediate_ops->name, caps,
+   handle, vdev->pdev->vendor,
+   vdev->pdev->device);
+   /*
+* only find the first matching mediate_ops,
+*

[PATCH v4 1/2] block/nbd: extract the common cleanup code

2019-12-04 Thread pannengyuan

From: Pan Nengyuan 

The BDRVNBDState cleanup code is common in two places, add
nbd_clear_bdrvstate() function to do these cleanups.

Signed-off-by: Stefano Garzarella 
Signed-off-by: Pan Nengyuan 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
v3:
- new patch, split form 2/2 patch (suggested by Stefano Garzarella)
Changes v4 to v3:
- replace function name from nbd_free_bdrvstate_prop to
  nbd_clear_bdrvstate and set cleared fields to NULL (suggested by Eric
  Blake)
- remove static function prototype. (suggested by Eric Blake)
---
 block/nbd.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 1239761..8b4a65a 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -94,6 +94,19 @@ typedef struct BDRVNBDState {
 
 static int nbd_client_connect(BlockDriverState *bs, Error **errp);
 
+void nbd_clear_bdrvstate(BDRVNBDState *s)
+{
+object_unref(OBJECT(s->tlscreds));
+qapi_free_SocketAddress(s->saddr);
+s->saddr = NULL;
+g_free(s->export);
+s->export = NULL;
+g_free(s->tlscredsid);
+s->tlscredsid = NULL;
+g_free(s->x_dirty_bitmap);
+s->x_dirty_bitmap = NULL;
+}
+
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (ret == -EIO) {
@@ -1855,10 +1868,7 @@ static int nbd_process_options(BlockDriverState *bs, 
QDict *options,
 
  error:
 if (ret < 0) {
-object_unref(OBJECT(s->tlscreds));
-qapi_free_SocketAddress(s->saddr);
-g_free(s->export);
-g_free(s->tlscredsid);
+nbd_clear_bdrvstate(s);
 }
 qemu_opts_del(opts);
 return ret;
@@ -1937,12 +1947,7 @@ static void nbd_close(BlockDriverState *bs)
 BDRVNBDState *s = bs->opaque;
 
 nbd_client_close(bs);
-
-object_unref(OBJECT(s->tlscreds));
-qapi_free_SocketAddress(s->saddr);
-g_free(s->export);
-g_free(s->tlscredsid);
-g_free(s->x_dirty_bitmap);
+nbd_clear_bdrvstate(s);
 }
 
 static int64_t nbd_getlength(BlockDriverState *bs)
-- 
2.7.2.windows.1

[RFC PATCH 9/9] i40e/vf_migration: support dynamic trap of bar0

2019-12-04 Thread Yan Zhao

mediate dynamic_trap_info region to dynamically trap bar0.

bar0 is sparsely mmaped into 5 sub-regions, of which only two need to be
dynamically trapped.
By mediating dynamic_trap_info region and telling QEMU this information,
the two sub-regions of bar0 can be trapped when migration starts and put
to passthrough again when migration fails

Cc: Shaopeng He 

Signed-off-by: Yan Zhao 
---
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 140 +-
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  12 ++
 2 files changed, 147 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c 
b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
index 5bb509fed66e..0b9d5be85049 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -29,6 +29,21 @@ static bool is_handle_valid(int handle)
return true;
 }
 
+static
+void i40e_vf_migration_dynamic_trap_bar(struct i40e_vf_migration *i40e_vf_dev)
+{
+   if (i40e_vf_dev->dt_trigger)
+   eventfd_signal(i40e_vf_dev->dt_trigger, 1);
+}
+
+static void i40e_vf_trap_bar0(struct i40e_vf_migration *i40e_vf_dev, bool trap)
+{
+   if (i40e_vf_dev->trap_bar0 != trap) {
+   i40e_vf_dev->trap_bar0 = trap;
+   i40e_vf_migration_dynamic_trap_bar(i40e_vf_dev);
+   }
+}
+
 static size_t set_device_state(struct i40e_vf_migration *i40e_vf_dev, u32 
state)
 {
int ret = 0;
@@ -39,8 +54,10 @@ static size_t set_device_state(struct i40e_vf_migration 
*i40e_vf_dev, u32 state)
 
switch (state) {
case VFIO_DEVICE_STATE_RUNNING:
+   i40e_vf_trap_bar0(i40e_vf_dev, false);
break;
case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+   i40e_vf_trap_bar0(i40e_vf_dev, true);
// alloc dirty page tracking resources and
// do the first round dirty page scanning
break;
@@ -137,16 +154,22 @@ int i40e_vf_migration_open(struct pci_dev *pdev, u64 
*caps, u32 *dm_handle)
MIGRATION_DIRTY_BITMAP_SIZE;
i40e_vf_dev->migration_region_index = -1;
 
+   i40e_vf_dev->dt_region_index = -1;
+   i40e_vf_dev->trap_bar0 = false;
+
vf = >vf[vf_id];
 
i40e_vf_dev_array[handle] = i40e_vf_dev;
set_bit(handle, open_device_bits);
+
*dm_handle = handle;
 
*caps |= VFIO_PCI_DEVICE_CAP_MIGRATION;
+   *caps |= VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR;
 
pr_info("%s: device %x %x, vf id %d, handle=%x\n",
__func__, pdev->vendor, pdev->device, vf_id, handle);
+
 error:
mutex_unlock(_bit_lock);
 
@@ -188,6 +211,10 @@ void i40e_vf_migration_release(int handle)
 
kfree(i40e_vf_dev->mig_ctl);
vfree(i40e_vf_dev->dirty_bitmap);
+
+   if (i40e_vf_dev->dt_trigger)
+   eventfd_ctx_put(i40e_vf_dev->dt_trigger);
+
kfree(i40e_vf_dev);
 
module_put(THIS_MODULE);
@@ -216,6 +243,47 @@ static void migration_region_sparse_mmap_cap(struct 
vfio_info_cap *caps)
kfree(sparse);
 }
 
+static void bar0_sparse_mmap_cap(struct vfio_region_info *info,
+struct vfio_info_cap *caps)
+{
+   struct vfio_region_info_cap_sparse_mmap *sparse;
+   size_t size;
+   int nr_areas = 5;
+
+   size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+   sparse = kzalloc(size, GFP_KERNEL);
+   if (!sparse)
+   return;
+
+   sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+   sparse->header.version = 1;
+   sparse->nr_areas = nr_areas;
+
+   sparse->areas[0].offset = 0;
+   sparse->areas[0].size = IAVF_VF_TAIL_START;
+   sparse->areas[0].disablable = 0;//able to get toggled
+
+   sparse->areas[1].offset = IAVF_VF_TAIL_START;
+   sparse->areas[1].size = PAGE_SIZE;
+   sparse->areas[1].disablable = 1;//able to get toggled
+
+   sparse->areas[2].offset = IAVF_VF_TAIL_START + PAGE_SIZE;
+   sparse->areas[2].size = IAVF_VF_ARQH1 - sparse->areas[2].offset;
+   sparse->areas[2].disablable = 0;//able to get toggled
+
+   sparse->areas[3].offset = IAVF_VF_ARQT1;
+   sparse->areas[3].size = PAGE_SIZE;
+   sparse->areas[3].disablable = 1;//able to get toggled
+
+   sparse->areas[4].offset = IAVF_VF_ARQT1 + PAGE_SIZE;
+   sparse->areas[4].size = info->size - sparse->areas[4].offset;
+   sparse->areas[4].disablable = 0;//able to get toggled
+
+   vfio_info_add_capability(caps, >header, size);
+   kfree(sparse);
+}
+
 static void
 i40e_vf_migration_get_region_info(int handle,
  struct vfio_region_info *info,
@@ -227,9 +295,8 @@ i40e_vf_migration_get_region_info(int handle,
 
switch (info->index) {
case VFIO_PCI_BAR0_REGION_INDEX:
-   info->flags = VFIO_REGION_INFO_FLAG_READ |
-

[RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-04 Thread Yan Zhao

For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to  (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and (3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
 __   register mediate ops|  ___ ___|
|  |<---| VF|   |   |   
| vfio-pci |  | |  mediate  |   | PF driver |   |
|__|--->|   driver  |   |___|   
 |open(pdev)  |  ---  | |
 || 
 ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
\|/  \|/
--- 
|VF   | |PF|
--- 


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind the opening device with this
mediate ops using the returned mediate handle.

Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
VF will be intercepted into VF mediate driver as
vfio_pci_mediate_ops->get_region_info(),
vfio_pci_mediate_ops->rw,
vfio_pci_mediate_ops->mmap, and get customized.
For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
further return 'pt' to indicate whether vfio-pci should further
passthrough data to hw.

when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
with a mediate handle as parameter.

The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
mediate driver be able to differentiate two opening VFs of the same device
id and vendor id.

When VF mediate driver exits, it unregisters its mediate ops from
vfio-pci.


In this patchset, we enable vfio-pci to provide 3 things:
(1) calling mediate ops to allow vendor driver customizing default
region info/rw/mmap of a region.
(2) provide a migration region to support migration
(3) provide a dynamic trap bar info region to allow vendor driver
control trap/untrap of device pci bars

This vfio-pci + mediate ops way differs from mdev way in that
(1) medv way needs to create a 1:1 mdev device on top of one VF, device
specific mdev parent driver is bound to VF directly.
(2) vfio-pci + mediate ops way does not create mdev devices and VF
mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.

The reason why we don't choose the way of writing mdev parent driver is
that
(1) VFs are almost all the time directly passthroughed. Directly binding
to vfio-pci can make most of the code shared/reused. If we write a
vendor specific mdev parent driver, most of the code (like passthrough
style of rw/mmap) still needs to be copied from vfio-pci driver, which is
actually a duplicated and tedious work.
(2) For features like dynamically trap/untrap pci bars, if they are in
vfio-pci, they can be available to most people without repeated code
copying and re-testing.
(3) with a 1:1 mdev driver which passthrough VFs most of the time, people
have to decide whether to bind VFs to vfio-pci or mdev parent driver before
it runs into a real migration need. However, if vfio-pci is bound
initially, they have no chance to do live migration when there's a need
later. 

In this patchset,
- patches 1-4 enable vfio-pci to call mediate ops registered by vendor
  driver

[PATCH v4 2/2] block/nbd: fix memory leak in nbd_open()

2019-12-04 Thread pannengyuan

From: Pan Nengyuan 

In currently implementation there will be a memory leak when
nbd_client_connect() returns error status. Here is an easy way to
reproduce:

1. run qemu-iotests as follow and check the result with asan:
./check -raw 143

Following is the asan output backtrack:
Direct leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7f629688a560 in calloc (/usr/lib64/libasan.so.3+0xc7560)
#1 0x7f6295e7e015 in g_malloc0  (/usr/lib64/libglib-2.0.so.0+0x50015)
#2 0x56281dab4642 in qobject_input_start_struct  
/mnt/sdb/qemu-4.2.0-rc0/qapi/qobject-input-visitor.c:295
#3 0x56281dab1a04 in visit_start_struct  
/mnt/sdb/qemu-4.2.0-rc0/qapi/qapi-visit-core.c:49
#4 0x56281dad1827 in visit_type_SocketAddress  qapi/qapi-visit-sockets.c:386
#5 0x56281da8062f in nbd_config   /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1716
#6 0x56281da8062f in nbd_process_options 
/mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1829
#7 0x56281da8062f in nbd_open /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1873

Direct leak of 15 byte(s) in 1 object(s) allocated from:
#0 0x7f629688a3a0 in malloc (/usr/lib64/libasan.so.3+0xc73a0)
#1 0x7f6295e7dfbd in g_malloc (/usr/lib64/libglib-2.0.so.0+0x4ffbd)
#2 0x7f6295e96ace in g_strdup (/usr/lib64/libglib-2.0.so.0+0x68ace)
#3 0x56281da804ac in nbd_process_options 
/mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1834
#4 0x56281da804ac in nbd_open /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1873

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x7f629688a3a0 in malloc (/usr/lib64/libasan.so.3+0xc73a0)
#1 0x7f6295e7dfbd in g_malloc (/usr/lib64/libglib-2.0.so.0+0x4ffbd)
#2 0x7f6295e96ace in g_strdup (/usr/lib64/libglib-2.0.so.0+0x68ace)
#3 0x56281dab41a3 in qobject_input_type_str_keyval 
/mnt/sdb/qemu-4.2.0-rc0/qapi/qobject-input-visitor.c:536
#4 0x56281dab2ee9 in visit_type_str 
/mnt/sdb/qemu-4.2.0-rc0/qapi/qapi-visit-core.c:297
#5 0x56281dad0fa1 in visit_type_UnixSocketAddress_members 
qapi/qapi-visit-sockets.c:141
#6 0x56281dad17b6 in visit_type_SocketAddress_members 
qapi/qapi-visit-sockets.c:366
#7 0x56281dad186a in visit_type_SocketAddress qapi/qapi-visit-sockets.c:393
#8 0x56281da8062f in nbd_config /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1716
#9 0x56281da8062f in nbd_process_options 
/mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1829
#10 0x56281da8062f in nbd_open /mnt/sdb/qemu-4.2.0-rc0/block/nbd.c:1873

Fixes: 8f071c9db506e03ab
Reported-by: Euler Robot 
Signed-off-by: Pan Nengyuan 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Cc: qemu-stable 
Cc: Vladimir Sementsov-Ogievskiy 
---
Changes v2 to v1:
- add a new function to do the common cleanups (suggested by Stefano
  Garzarella).
---
Changes v3 to v2:
- split in two patches(suggested by Stefano Garzarella)
---
Changes v4 to v3:
- replace function name from nbd_free_bdrvstate_prop to
  nbd_clear_bdrvstate and add Fixes tag.
---
 block/nbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/nbd.c b/block/nbd.c
index 8b4a65a..9062409 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1891,6 +1891,7 @@ static int nbd_open(BlockDriverState *bs, QDict *options, 
int flags,
 
 ret = nbd_client_connect(bs, errp);
 if (ret < 0) {
+nbd_clear_bdrvstate(s);
 return ret;
 }
 /* successfully connected */
-- 
2.7.2.windows.1

[PATCH v4 0/2] block/nbd: fix memory leak in nbd_open

2019-12-04 Thread pannengyuan

From: Pan Nengyuan 

This series add a new function to do the common cleanups, and fix a memory
leak in nbd_open when nbd_client_connect returns error status.

---
Changes v2 to v1:
- add a new function to do the common cleanups (suggested by Stefano 
Garzarella).
---
Changes v3 to v2:
- split in two patches(suggested by Stefano Garzarella)
---
Changes v4 to v3:
- replace function name from nbd_free_bdrvstate_prop to nbd_clear_bdrvstate and 
add Fixes tag(suggested by Eric Blake).
- remove static function prototype. (suggested by Eric Blake)

Pan Nengyuan (2):
  block/nbd: extract the common cleanup code
  block/nbd: fix memory leak in nbd_open()

 block/nbd.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

-- 
2.7.2.windows.1

Re: [PATCH] Revert "qemu-options.hx: Update for reboot-timeout parameter"

2019-12-04 Thread Han Han

OK. Updated in version 2.

On Wed, Dec 4, 2019 at 8:21 PM Dr. David Alan Gilbert 
wrote:

> * Han Han (h...@redhat.com) wrote:
> > This reverts commit bbd9e6985ff342cbe15b9cb7eb30e842796fbbe8.
>
> Patchew spotted you're missing the signed-off-by; please send one.
>
> Dave
>
> > In 20a1922032 we allowed reboot-timeout=-1 again, so update the doc
> > accordingly.
> > ---
> >  qemu-options.hx | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 65c9473b73..e14d88e9b2 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -327,8 +327,8 @@ format(true color). The resolution should be
> supported by the SVGA mode, so
> >  the recommended is 320x240, 640x480, 800x640.
> >
> >  A timeout could be passed to bios, guest will pause for
> @var{rb_timeout} ms
> > -when boot failed, then reboot. If @option{reboot-timeout} is not set,
> > -guest will not reboot by default. Currently Seabios for X86
> > +when boot failed, then reboot. If @var{rb_timeout} is '-1', guest will
> not
> > +reboot, qemu passes '-1' to bios by default. Currently Seabios for X86
> >  system support it.
> >
> >  Do strict boot via @option{strict=on} as far as firmware/BIOS
> > --
> > 2.24.0.rc1
> >
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>
>

-- 
Best regards,
---
Han Han
Quality Engineer
Redhat.

Email: h...@redhat.com
Phone: +861065339333

[PATCH v2] Revert "qemu-options.hx: Update for reboot-timeout parameter"

2019-12-04 Thread Han Han

This reverts commit bbd9e6985ff342cbe15b9cb7eb30e842796fbbe8.

In 20a1922032 we allowed reboot-timeout=-1 again, so update the doc
accordingly.

Signed-off-by: Han Han 
---
 qemu-options.hx | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index 65c9473b..e14d88e9 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -327,8 +327,8 @@ format(true color). The resolution should be supported by 
the SVGA mode, so
 the recommended is 320x240, 640x480, 800x640.
 
 A timeout could be passed to bios, guest will pause for @var{rb_timeout} ms
-when boot failed, then reboot. If @option{reboot-timeout} is not set,
-guest will not reboot by default. Currently Seabios for X86
+when boot failed, then reboot. If @var{rb_timeout} is '-1', guest will not
+reboot, qemu passes '-1' to bios by default. Currently Seabios for X86
 system support it.
 
 Do strict boot via @option{strict=on} as far as firmware/BIOS
-- 
2.23.0

Re: [PATCH v2 1/3] virtio: add ability to delete vq through a pointer

2019-12-04 Thread Pan Nengyuan




On 2019/12/4 22:40, Eric Blake wrote:
> On 12/4/19 1:31 AM, pannengy...@huawei.com wrote:
>> From: Pan Nengyuan 
>>
>> Devices tend to maintain vq pointers, allow deleting them trough a vq
>> pointer.
> 
> through

Thanks. I'm sorry for my carelessness.

> 
>>
>> Signed-off-by: Michael S. Tsirkin 
>> Signed-off-by: Pan Nengyuan 
>> ---
> 
> Also, don't forget to send a 0/3 cover letter (any series longer than
> one patch should have a cover letter; it is possible to configure git to
> do this automatically: https://wiki.qemu.org/Contribute/SubmitAPatch has
> this tip and others)

ok, thanks.

>

Re: [PATCH v2 1/3] virtio: add ability to delete vq through a pointer

2019-12-04 Thread Pan Nengyuan




On 2019/12/4 16:33, Pankaj Gupta wrote:
> 
>> From: Pan Nengyuan 
>>
>> Devices tend to maintain vq pointers, allow deleting them trough a vq
>> pointer.
>>
>> Signed-off-by: Michael S. Tsirkin 
>> Signed-off-by: Pan Nengyuan 
>> ---
>> Changes v2 to v1:
>> - add a new function virtio_delete_queue to cleanup vq through a vq pointer
>> ---
>>  hw/virtio/virtio.c | 16 +++-
>>  include/hw/virtio/virtio.h |  2 ++
>>  2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>> index 04716b5..6de3cfd 100644
>> --- a/hw/virtio/virtio.c
>> +++ b/hw/virtio/virtio.c
>> @@ -2330,17 +2330,23 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int
>> queue_size,
>>  return >vq[i];
>>  }
>>  
>> +void virtio_delete_queue(VirtQueue *vq)
>> +{
>> +vq->vring.num = 0;
>> +vq->vring.num_default = 0;
>> +vq->handle_output = NULL;
>> +vq->handle_aio_output = NULL;
>> +g_free(vq->used_elems);
>> +vq->used_elems = NULL;
>> +}
>> +
>>  void virtio_del_queue(VirtIODevice *vdev, int n)
>>  {
>>  if (n < 0 || n >= VIRTIO_QUEUE_MAX) {
>>  abort();
>>  }
>>  
>> -vdev->vq[n].vring.num = 0;
>> -vdev->vq[n].vring.num_default = 0;
>> -vdev->vq[n].handle_output = NULL;
>> -vdev->vq[n].handle_aio_output = NULL;
>> -g_free(vdev->vq[n].used_elems);
>> +virtio_delete_queue(>vq[n]);
>>  }
>>  
>>  static void virtio_set_isr(VirtIODevice *vdev, int value)
>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> index c32a815..e18756d 100644
>> --- a/include/hw/virtio/virtio.h
>> +++ b/include/hw/virtio/virtio.h
>> @@ -183,6 +183,8 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int
>> queue_size,
>>  
>>  void virtio_del_queue(VirtIODevice *vdev, int n);
>>  
>> +void virtio_delete_queue(VirtQueue *vq);
>> +
>>  void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
>>  unsigned int len);
>>  void virtqueue_flush(VirtQueue *vq, unsigned int count);
>> --
>> 2.7.2.windows.1
>>
>>
> Overall it ooks good to me.
> 
> Just one point: e.g in virtio_rng: "virtio_rng_device_unrealize" function
> We are doing : virtio_del_queue(vdev, 0);
> 
> One can directly call "virtio_delete_queue". It can become confusing
> to call multiple functions for same purpose. Instead, Can we make 
> "virtio_delete_queue" static inline?
> 
yes, It will be a little confused, but I think it will have the same
problem if we make "virtio_delete_queue" static inline. We can directly
call it aslo. （e.g virtio-serial-bus.c virtio-balloon.c).

How about replacing the function name to make it more clear (e.g
virtio_delete_queue -> virtio_queue_cleanup) ? It's too similar between
"virtio_del_queue" and "virtio_delete_queue".

> Other than that:
> Reviewed-by: Pankaj Gupta 
> 
>>
>>
> 
> 
> .
>

[PATCH] util/cutils: Expand do_strtosz parsing precision to 64 bits

2019-12-04 Thread Tao Xu

Parse input string both as a double and as a uint64_t, then use the
method which consumes more characters. Update the related test cases.

Signed-off-by: Tao Xu 
---
 tests/test-cutils.c| 37 -
 tests/test-keyval.c| 47 ---
 tests/test-qemu-opts.c | 39 --
 util/cutils.c  | 74 ++
 4 files changed, 73 insertions(+), 124 deletions(-)

diff --git a/tests/test-cutils.c b/tests/test-cutils.c
index 1aa8351520..4a7030c611 100644
--- a/tests/test-cutils.c
+++ b/tests/test-cutils.c
@@ -1970,40 +1970,25 @@ static void test_qemu_strtosz_simple(void)
 g_assert_cmpint(err, ==, 0);
 g_assert_cmpint(res, ==, 12345);
 
-/* Note: precision is 53 bits since we're parsing with strtod() */
-
-str = "9007199254740991"; /* 2^53-1 */
-err = qemu_strtosz(str, , );
-g_assert_cmpint(err, ==, 0);
-g_assert_cmpint(res, ==, 0x1f);
-g_assert(endptr == str + 16);
-
-str = "9007199254740992"; /* 2^53 */
-err = qemu_strtosz(str, , );
-g_assert_cmpint(err, ==, 0);
-g_assert_cmpint(res, ==, 0x20);
-g_assert(endptr == str + 16);
+/* Note: precision is 64 bits (UINT64_MAX) */
 
 str = "9007199254740993"; /* 2^53+1 */
 err = qemu_strtosz(str, , );
 g_assert_cmpint(err, ==, 0);
-g_assert_cmpint(res, ==, 0x20); /* rounded to 53 bits */
+g_assert_cmpint(res, ==, 0x21);
 g_assert(endptr == str + 16);
 
-str = "18446744073709549568"; /* 0xf800 (53 msbs set) */
+str = "18446744073709550591"; /* 0xfbff */
 err = qemu_strtosz(str, , );
 g_assert_cmpint(err, ==, 0);
-g_assert_cmpint(res, ==, 0xf800);
+g_assert_cmpint(res, ==, 0xfbff);
 g_assert(endptr == str + 20);
 
-str = "18446744073709550591"; /* 0xfbff */
+str = "18446744073709551615"; /* 2^64-1 (UINT64_MAX) */
 err = qemu_strtosz(str, , );
 g_assert_cmpint(err, ==, 0);
-g_assert_cmpint(res, ==, 0xf800); /* rounded to 53 bits */
+g_assert_cmpint(res, ==, 0x);
 g_assert(endptr == str + 20);
-
-/* 0x7e00..0x7fff get rounded to
- * 0x8000, thus -ERANGE; see test_qemu_strtosz_erange() */
 }
 
 static void test_qemu_strtosz_units(void)
@@ -2145,16 +2130,6 @@ static void test_qemu_strtosz_erange(void)
 g_assert_cmpint(err, ==, -ERANGE);
 g_assert(endptr == str + 2);
 
-str = "18446744073709550592"; /* 0xfc00 */
-err = qemu_strtosz(str, , );
-g_assert_cmpint(err, ==, -ERANGE);
-g_assert(endptr == str + 20);
-
-str = "18446744073709551615"; /* 2^64-1 */
-err = qemu_strtosz(str, , );
-g_assert_cmpint(err, ==, -ERANGE);
-g_assert(endptr == str + 20);
-
 str = "18446744073709551616"; /* 2^64 */
 err = qemu_strtosz(str, , );
 g_assert_cmpint(err, ==, -ERANGE);
diff --git a/tests/test-keyval.c b/tests/test-keyval.c
index 09b0ae3c68..fad941fcb8 100644
--- a/tests/test-keyval.c
+++ b/tests/test-keyval.c
@@ -383,59 +383,26 @@ static void test_keyval_visit_size(void)
 visit_end_struct(v, NULL);
 visit_free(v);
 
-/* Note: precision is 53 bits since we're parsing with strtod() */
+/* Note: precision is 64 bits (UINT64_MAX) */
 
-/* Around limit of precision: 2^53-1, 2^53, 2^53+1 */
-qdict = keyval_parse("sz1=9007199254740991,"
- "sz2=9007199254740992,"
- "sz3=9007199254740993",
+/* Around limit of precision: UINT64_MAX - 1, UINT64_MAX */
+qdict = keyval_parse("sz1=18446744073709551614,"
+ "sz2=18446744073709551615",
  NULL, _abort);
 v = qobject_input_visitor_new_keyval(QOBJECT(qdict));
 qobject_unref(qdict);
 visit_start_struct(v, NULL, NULL, 0, _abort);
 visit_type_size(v, "sz1", , _abort);
-g_assert_cmphex(sz, ==, 0x1f);
+g_assert_cmphex(sz, ==, 0xfffe);
 visit_type_size(v, "sz2", , _abort);
-g_assert_cmphex(sz, ==, 0x20);
-visit_type_size(v, "sz3", , _abort);
-g_assert_cmphex(sz, ==, 0x20);
-visit_check_struct(v, _abort);
-visit_end_struct(v, NULL);
-visit_free(v);
-
-/* Close to signed upper limit 0x7c00 (53 msbs set) */
-qdict = keyval_parse("sz1=9223372036854774784," /* 7c00 */
- "sz2=9223372036854775295", /* 7dff */
- NULL, _abort);
-v = qobject_input_visitor_new_keyval(QOBJECT(qdict));
-qobject_unref(qdict);
-visit_start_struct(v, NULL, NULL, 0, _abort);
-visit_type_size(v, "sz1", , _abort);
-g_assert_cmphex(sz, ==, 0x7c00);
-visit_type_size(v, "sz2", , _abort);
-g_assert_cmphex(sz, ==, 0x7c00);
-visit_check_struct(v, _abort);
-

Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.

2019-12-04 Thread Yan Zhao

On Thu, Dec 05, 2019 at 02:34:57AM +0800, Alex Williamson wrote:
> On Wed, 4 Dec 2019 23:40:25 +0530
> Kirti Wankhede  wrote:
> 
> > On 12/3/2019 11:34 PM, Alex Williamson wrote:
> > > On Mon, 25 Nov 2019 19:57:39 -0500
> > > Yan Zhao  wrote:
> > >   
> > >> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:  
> > >>> On Fri, 15 Nov 2019 00:26:07 +0530
> > >>> Kirti Wankhede  wrote:
> > >>>  
> >  On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> > > On Thu, 14 Nov 2019 01:07:21 +0530
> > > Kirti Wankhede  wrote:
> > >
> > >> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> > >>> On Tue, 12 Nov 2019 22:33:37 +0530
> > >>> Kirti Wankhede  wrote:
> > >>>   
> >  All pages pinned by vendor driver through vfio_pin_pages API 
> >  should be
> >  considered as dirty during migration. IOMMU container maintains a 
> >  list of
> >  all such pinned pages. Added an ioctl defination to get bitmap of 
> >  such  
> > >>>
> > >>> definition
> > >>>   
> >  pinned pages for requested IO virtual address range.  
> > >>>
> > >>> Additionally, all mapped pages are considered dirty when physically
> > >>> mapped through to an IOMMU, modulo we discussed devices opting in to
> > >>> per page pinning to indicate finer granularity with a TBD mechanism 
> > >>> to
> > >>> figure out if any non-opt-in devices remain.
> > >>>   
> > >>
> > >> You mean, in case of device direct assignment (device pass through)? 
> > >>  
> > >
> > > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> > > pinned and mapped, then the correct dirty page set is all mapped 
> > > pages.
> > > We discussed using the vpfn list as a mechanism for vendor drivers to
> > > reduce their migration footprint, but we also discussed that we would
> > > need a way to determine that all participants in the container have
> > > explicitly pinned their working pages or else we must consider the
> > > entire potential working set as dirty.
> > >
> > 
> >  How can vendor driver tell this capability to iommu module? Any 
> >  suggestions?  
> > >>>
> > >>> I think it does so by pinning pages.  Is it acceptable that if the
> > >>> vendor driver pins any pages, then from that point forward we consider
> > >>> the IOMMU group dirty page scope to be limited to pinned pages?  There  
> > >> we should also be aware of that dirty page scope is pinned pages + 
> > >> unpinned pages,
> > >> which means ever since a page is pinned, it should be regarded as dirty
> > >> no matter whether it's unpinned later. only after log_sync is called and
> > >> dirty info retrieved, its dirty state should be cleared.  
> > > 
> > > Yes, good point.  We can't just remove a vpfn when a page is unpinned
> > > or else we'd lose information that the page potentially had been
> > > dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> > > list and both the currently pinned vpfns and the dirty vpfns are walked
> > > on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> > > The container would need to know that dirty tracking is enabled and
> > > only manage the dirty vpfns list when necessary.  Thanks,
> > >   
> > 
> > If page is unpinned, then that page is available in free page pool for 
> > others to use, then how can we say that unpinned page has valid data?
> > 
> > If suppose, one driver A unpins a page and when driver B of some other 
> > device gets that page and he pins it, uses it, and then unpins it, then 
> > how can we say that page has valid data for driver A?
> > 
> > Can you give one example where unpinned page data is considered reliable 
> > and valid?
> 
> We can only pin pages that the user has already allocated* and mapped
> through the vfio DMA API.  The pinning of the page simply locks the
> page for the vendor driver to access it and unpinning that page only
> indicates that access is complete.  Pages are not freed when a vendor
> driver unpins them, they still exist and at this point we're now
> assuming the device dirtied the page while it was pinned.  Thanks,
> 
> Alex
> 
> * An exception here is that the page might be demand allocated and the
>   act of pinning the page could actually allocate the backing page for
>   the user if they have not faulted the page to trigger that allocation
>   previously.  That page remains mapped for the user's virtual address
>   space even after the unpinning though.
>

Yes, I can give an example in GVT.
when a gem_object is allocated in guest, before submitting it to guest
vGPU, gfx cmds in its ring buffer need to be pinned into GGTT to get a
global graphics address for hardware access. At that time, we shadow
those cmds and pin pages through vfio pin_pages(), and submit the shadow
gem_object to physial hardware.
After guest driver

RE: issue about virtio-blk queue size

2019-12-04 Thread Wangyong

>
> On Thu, Nov 28, 2019 at 08:44:43AM +, Wangyong wrote:
> > Hi all,
>
> This looks interesting, please continue this discussion on the QEMU mailing 
> list
>  so that others can participate.
>
> >
> > This patch makes virtio_blk queue size configurable
> >
> > commit 6040aedddb5f474a9c2304b6a432a652d82b3d3c
> > Author: Mark Kanda 
> > Date:   Mon Dec 11 09:16:24 2017 -0600
> >
> > virtio-blk: make queue size configurable
> >
> > But when we set the queue size to more than 128, it will not take effect.
> >
> > That's because linux aio's maximum outstanding requests at a time is
> > always less than or equal to 128
> >
> > The following code limits the outstanding requests at a time:
> >
> > #define MAX_EVENTS 128
> >
> > laio_do_submit()
> > {
> >
> > if (!s->io_q.blocked &&
> > (!s->io_q.plugged ||
> >  s->io_q.in_flight + s->io_q.in_queue >= MAX_EVENTS)) {
> > ioq_submit(s);
> > }
> > }
> >
> > Should we make the value of MAX_EVENTS configurable ?
>
> Increasing MAX_EVENTS to a larger hardcoded value seems reasonable as a
> shortterm fix.  Please first check how /proc/sys/fs/aio-max-nr and
> io_setup(2) handle this resource limit.  The patch must not break existing
> systems where 128 works today.
[root@node2 ~]# cat /etc/centos-release
CentOS Linux release 7.5.1804 (Core)

[root@node2 ~]# cat /proc/sys/fs/aio-max-nr
4294967296

> > MAX_EVENTS should have the same value as queue size ?
>
> Multiple virtio-blk devices can share a single AioContext,
Is multiple virtio-blk configured with one IOThread?
Multiple virtio-blk performance will be worse.

>so setting it to the
> queue size may not be enough.  That's why I suggest increasing the
> hardcoded limit for now unless someone things up a way to size MAX_EVENTS
> correctly.
>
> > I set the virtio blk queue size to 1024, then tested the results as
> > follows
> >
> > fio --filename=/dev/vda -direct=1 -iodepth=1024 -thread -rw=randread
> > -ioengine=libaio -bs=8k -size=50G -numjobs=1 -runtime=600
> > -group_reporting -name=test
> > guest:
> >
> > [root@localhost ~]# cat /sys/module/virtio_blk/parameters/queue_depth
> > 1024
> >
> > [root@localhost ~]# cat /sys/block/vda/queue/nr_requests
> > 1024
> >
> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > vda   0.00 0.000.00 1432.00 0.00 11456.00
> 16.00  1024.91  720.820.00  720.82   0.70 100.10
>
> This iostat output doesn't correspond to the fio -rw=randread command-line
> you posted because it shows writes instead of reads ;).  I assume nothing else
> was changed in the fio command-line.
fio --filename=/dev/vda -direct=1 -iodepth=1024 -thread -rw=randread 
-ioengine=libaio -bs=8k -size=50G -numjobs=1 -runtime=600 -group_reporting 
-name=test

MAX_EVENTS = 128

guest:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.00 1324.000.00 10592.00 0.0016.00  
1023.90  769.05  769.050.00   0.76 100.00

host:

root@cvk~/build# cat /sys/block/sda/queue/nr_requests
1024

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.00 1359.000.00 10872.00 0.0016.00   
127.91   93.93   93.930.00   0.74 100.00


I redefined this macro(MAX_EVENTS = 1024)
#define MAX_EVENTS 1024
Then retested, the results are as follows: （IO performance will be greatly 
improved）

guest:

[root@localhost ~]# cat /sys/module/virtio_blk/parameters/queue_depth
1024

[root@localhost ~]# cat /sys/block/vda/queue/nr_requests
1024

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.00 1743.000.00 13944.00 0.0016.00  
1024.50  584.94  584.940.00   0.57 100.10


host:

root@cvk~/build# cat /sys/block/sda/queue/nr_requests
1024


Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.00 1414.001.00 11312.00 1.0015.99  
1023.37  726.36  726.86   24.00   0.71 100.00
>
> >
> > host:
> >
> > root@cvk~/build# cat /sys/block/sda/queue/nr_requests
> > 1024
> >
> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda   0.0011.000.00 1402.00 0.00 11244.00
> 16.04   128.00   88.300.00   88.30   0.71 100.00
> >
> >
> >
> > I redefined this macro(MAX_EVENTS = 1024) #define MAX_EVENTS 1024
> >
> > Then retested, the results are as follows: （IO performance will be
> > greatly improved）
> >
> > fio --filename=/dev/vda -direct=1 -iodepth=1024 -thread -rw=randread
> > -ioengine=libaio -bs=8k -size=50G -numjobs=1 -runtime=600
> > -group_reporting -name=test
> >
> > guest:

Re: guest / host buffer sharing ...

2019-12-04 Thread Dylan Reid

On Thu, Nov 21, 2019 at 4:59 PM Tomasz Figa  wrote:
>
> On Thu, Nov 21, 2019 at 6:41 AM Geoffrey McRae  wrote:
> >
> >
> >
> > On 2019-11-20 23:13, Tomasz Figa wrote:
> > > Hi Geoffrey,
> > >
> > > On Thu, Nov 7, 2019 at 7:28 AM Geoffrey McRae 
> > > wrote:
> > >>
> > >>
> > >>
> > >> On 2019-11-06 23:41, Gerd Hoffmann wrote:
> > >> > On Wed, Nov 06, 2019 at 05:36:22PM +0900, David Stevens wrote:
> > >> >> > (1) The virtio device
> > >> >> > =
> > >> >> >
> > >> >> > Has a single virtio queue, so the guest can send commands to 
> > >> >> > register
> > >> >> > and unregister buffers.  Buffers are allocated in guest ram.  Each 
> > >> >> > buffer
> > >> >> > has a list of memory ranges for the data. Each buffer also has some
> > >> >>
> > >> >> Allocating from guest ram would work most of the time, but I think
> > >> >> it's insufficient for many use cases. It doesn't really support things
> > >> >> such as contiguous allocations, allocations from carveouts or <4GB,
> > >> >> protected buffers, etc.
> > >> >
> > >> > If there are additional constrains (due to gpu hardware I guess)
> > >> > I think it is better to leave the buffer allocation to virtio-gpu.
> > >>
> > >> The entire point of this for our purposes is due to the fact that we
> > >> can
> > >> not allocate the buffer, it's either provided by the GPU driver or
> > >> DirectX. If virtio-gpu were to allocate the buffer we might as well
> > >> forget
> > >> all this and continue using the ivshmem device.
> > >
> > > I don't understand why virtio-gpu couldn't allocate those buffers.
> > > Allocation doesn't necessarily mean creating new memory. Since the
> > > virtio-gpu device on the host talks to the GPU driver (or DirectX?),
> > > why couldn't it return one of the buffers provided by those if
> > > BIND_SCANOUT is requested?
> > >
> >
> > Because in our application we are a user-mode application in windows
> > that is provided with buffers that were allocated by the video stack in
> > windows. We are not using a virtual GPU but a physical GPU via vfio
> > passthrough and as such we are limited in what we can do. Unless I have
> > completely missed what virtio-gpu does, from what I understand it's
> > attempting to be a virtual GPU in its own right, which is not at all
> > suitable for our requirements.
>
> Not necessarily. virtio-gpu in its basic shape is an interface for
> allocating frame buffers and sending them to the host to display.
>
> It sounds to me like a PRIME-based setup similar to how integrated +
> discrete GPUs are handled on regular systems could work for you. The
> virtio-gpu device would be used like the integrated GPU that basically
> just drives the virtual screen. The guest component that controls the
> display of the guest (typically some sort of a compositor) would
> allocate the frame buffers using virtio-gpu and then import those to
> the vfio GPU when using it for compositing the parts of the screen.
> The parts of the screen themselves would be rendered beforehand by
> applications into local buffers managed fully by the vfio GPU, so
> there wouldn't be any need to involve virtio-gpu there. Only the
> compositor would have to be aware of it.
>
> Of course if your guest is not Linux, I have no idea if that can be
> handled in any reasonable way. I know those integrated + discrete GPU
> setups do work on Windows, but things are obviously 100% proprietary,
> so I don't know if one could make them work with virtio-gpu as the
> integrated GPU.
>
> >
> > This discussion seems to have moved away completely from the original
> > simple feature we need, which is to share a random block of guest
> > allocated ram with the host. While it would be nice if it's contiguous
> > ram, it's not an issue if it's not, and with udmabuf (now I understand
> > it) it can be made to appear contigous if it is so desired anyway.
> >
> > vhost-user could be used for this if it is fixed to allow dynamic
> > remapping, all the other bells and whistles that are virtio-gpu are
> > useless to us.
> >
>
> As far as I followed the thread, my impression is that we don't want
> to have an ad-hoc interface just for sending memory to the host. The
> thread was started to look for a way to create identifiers for guest
> memory, which proper virtio devices could use to refer to the memory
> within requests sent to the host.
>
> That said, I'm not really sure if there is any benefit of making it
> anything other than just the specific virtio protocol accepting
> scatterlist of guest pages directly.
>
> Putting the ability to obtain the shared memory itself, how do you
> trigger a copy from the guest frame buffer to the shared memory?

Adding Zach for more background on virtio-wl particular use cases.

Re: [PATCH v2 1/7] iotests: Provide a function for checking the creation of huge files

2019-12-04 Thread Cleber Rosa

On Wed, Dec 04, 2019 at 04:46:12PM +0100, Thomas Huth wrote:
> Some tests create huge (but sparse) files, and to be able to run those
> tests in certain limited environments (like CI containers), we have to
> check for the possibility to create such files first. Thus let's introduce
> a common function to check for large files, and replace the already
> existing checks in the iotests 005 and 220 with this function.
> 
> Reviewed-by: Alex Bennée 
> Signed-off-by: Thomas Huth 
> ---
>  tests/qemu-iotests/005   |  5 +
>  tests/qemu-iotests/220   |  6 ++
>  tests/qemu-iotests/common.rc | 10 ++
>  3 files changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/tests/qemu-iotests/005 b/tests/qemu-iotests/005
> index 58442762fe..b6d03ac37d 100755
> --- a/tests/qemu-iotests/005
> +++ b/tests/qemu-iotests/005
> @@ -59,10 +59,7 @@ fi
>  # Sanity check: For raw, we require a file system that permits the creation
>  # of a HUGE (but very sparse) file. Check we can create it before continuing.
>  if [ "$IMGFMT" = "raw" ]; then
> -if ! truncate --size=5T "$TEST_IMG"; then
> -_notrun "file system on $TEST_DIR does not support large enough 
> files"
> -fi
> -rm "$TEST_IMG"
> +_require_large_file 5T
>  fi
>  
>  echo
> diff --git a/tests/qemu-iotests/220 b/tests/qemu-iotests/220
> index 2d62c5dcac..15159270d3 100755
> --- a/tests/qemu-iotests/220
> +++ b/tests/qemu-iotests/220
> @@ -42,10 +42,8 @@ echo "== Creating huge file =="
>  
>  # Sanity check: We require a file system that permits the creation
>  # of a HUGE (but very sparse) file.  tmpfs works, ext4 does not.
> -if ! truncate --size=513T "$TEST_IMG"; then
> -_notrun "file system on $TEST_DIR does not support large enough files"
> -fi
> -rm "$TEST_IMG"
> +_require_large_file 513T
> +
>  IMGOPTS='cluster_size=2M,refcount_bits=1' _make_test_img 513T
>  
>  echo "== Populating refcounts =="
> diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
> index 0cc8acc9ed..6f0582c79a 100644
> --- a/tests/qemu-iotests/common.rc
> +++ b/tests/qemu-iotests/common.rc
> @@ -643,5 +643,15 @@ _require_drivers()
>  done
>  }
>  
> +# Check that we have a file system that allows huge (but very sparse) files
> +#
> +_require_large_file()
> +{
> +if ! truncate --size="$1" "$TEST_IMG"; then
> +_notrun "file system on $TEST_DIR does not support large enough 
> files"
> +fi
> +rm "$TEST_IMG"
> +}
> +
>  # make sure this script returns success
>  true
> -- 
> 2.18.1
> 

This is a good refactor even without considering the CI environment
issues it will help to address.

Reviewed-by: Cleber Rosa 
Tested-by: Cleber Rosa

[PATCH-for-5.0] hw/alpha/dp264: Use the DECchip Tulip network interface

2019-12-04 Thread Philippe Mathieu-Daudé

Commit 34ea023d4b9 introduced the Tulip PCI NIC.
Since this better models the DP264 hardware, use it.

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/alpha/dp264.c | 4 ++--
 hw/alpha/Kconfig | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/alpha/dp264.c b/hw/alpha/dp264.c
index 51b3cf7a61..4424551ba1 100644
--- a/hw/alpha/dp264.c
+++ b/hw/alpha/dp264.c
@@ -85,9 +85,9 @@ static void clipper_init(MachineState *machine)
 /* VGA setup.  Don't bother loading the bios.  */
 pci_vga_init(pci_bus);
 
-/* Network setup.  e1000 is good enough, failing Tulip support.  */
+/* Network setup */
 for (i = 0; i < nb_nics; i++) {
-pci_nic_init_nofail(_table[i], pci_bus, "e1000", NULL);
+pci_nic_init_nofail(_table[i], pci_bus, "tulip", NULL);
 }
 
 /* 2 82C37 (dma) */
diff --git a/hw/alpha/Kconfig b/hw/alpha/Kconfig
index 15c59ff264..552e6a4c23 100644
--- a/hw/alpha/Kconfig
+++ b/hw/alpha/Kconfig
@@ -2,7 +2,7 @@ config DP264
 bool
 imply PCI_DEVICES
 imply TEST_DEVICES
-imply E1000_PCI
+imply TULIP
 select I82374
 select I8254
 select I8259
-- 
2.21.0

Re: [PATCH v4 26/40] target/arm: Update define_one_arm_cp_reg_with_opaque for VHE

2019-12-04 Thread Alex Bennée



Richard Henderson  writes:

> On 12/4/19 10:58 AM, Alex Bennée wrote:
>>> @@ -7437,13 +7437,10 @@ void define_one_arm_cp_reg_with_opaque(ARMCPU *cpu,
>>>  mask = PL0_RW;
>>>  break;
>>>  case 4:
>>> +case 5:
>>>  /* min_EL EL2 */
>>>  mask = PL2_RW;
>>>  break;
>>> -case 5:
>>> -/* unallocated encoding, so not possible */
>>> -assert(false);
>>> -break;
>> 
>> This change is fine - I don't think we should have asserted here anyway.
>> But don't we generate an unallocated exception if the CPU is v8.0?
>
> This change is only for validation of the system registers themselves.  It has
> nothing to do with the usage of system registers from the actual guest.

So what is the mechanism that feeds back to the translator?
access_check_cp_reg only seems to care about XSCALE. I guess
cp_access_ok would trip if you weren't at EL2 but what if you are a v8.0
at EL2?

-- 
Alex Bennée

Re: [PATCH v2 2/2] migration: savevm_state_handler_insert: constant-time element insertion

2019-12-04 Thread David Gibson

On Wed, Dec 04, 2019 at 04:49:15PM +, Dr. David Alan Gilbert wrote:
> * Scott Cheloha (chel...@linux.vnet.ibm.com) wrote:
> > On Mon, Oct 21, 2019 at 09:14:44AM +0100, Dr. David Alan Gilbert wrote:
> > > * David Gibson (da...@gibson.dropbear.id.au) wrote:
> > > > On Fri, Oct 18, 2019 at 10:43:52AM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Laurent Vivier (lviv...@redhat.com) wrote:
> > > > > > On 18/10/2019 10:16, Dr. David Alan Gilbert wrote:
> > > > > > > * Scott Cheloha (chel...@linux.vnet.ibm.com) wrote:
> > > > > > >> savevm_state's SaveStateEntry TAILQ is a priority queue.  
> > > > > > >> Priority
> > > > > > >> sorting is maintained by searching from head to tail for a 
> > > > > > >> suitable
> > > > > > >> insertion spot.  Insertion is thus an O(n) operation.
> > > > > > >>
> > > > > > >> If we instead keep track of the head of each priority's subqueue
> > > > > > >> within that larger queue we can reduce this operation to O(1) 
> > > > > > >> time.
> > > > > > >>
> > > > > > >> savevm_state_handler_remove() becomes slightly more complex to
> > > > > > >> accomodate these gains: we need to replace the head of a 
> > > > > > >> priority's
> > > > > > >> subqueue when removing it.
> > > > > > >>
> > > > > > >> With O(1) insertion, booting VMs with many SaveStateEntry 
> > > > > > >> objects is
> > > > > > >> more plausible.  For example, a ppc64 VM with maxmem=8T has 
> > > > > > >> 4 such
> > > > > > >> objects to insert.
> > > > > > > 
> > > > > > > Separate from reviewing this patch, I'd like to understand why 
> > > > > > > you've
> > > > > > > got 4 objects.  This feels very very wrong and is likely to 
> > > > > > > cause
> > > > > > > problems to random other bits of qemu as well.
> > > > > > 
> > > > > > I think the 4 objects are the "dr-connectors" that are used to 
> > > > > > plug
> > > > > > peripherals (memory, pci card, cpus, ...).
> > > > > 
> > > > > Yes, Scott confirmed that in the reply to the previous version.
> > > > > IMHO nothing in qemu is designed to deal with that many 
> > > > > devices/objects
> > > > > - I'm sure that something other than the migration code is going to
> > > > > get upset.
> > > > 
> > > > It kind of did.  Particularly when there was n^2 and n^3 cubed
> > > > behaviour in the property stuff we had some ludicrously long startup
> > > > times (hours) with large maxmem values.
> > > > 
> > > > Fwiw, the DRCs for PCI slots, DRCs and PHBs aren't really a problem.
> > > > The problem is the memory DRCs, there's one for each LMB - each 256MiB
> > > > chunk of memory (or possible memory).
> > > > 
> > > > > Is perhaps the structure wrong somewhere - should there be a single 
> > > > > DRC
> > > > > device that knows about all DRCs?
> > > > 
> > > > Maybe.  The tricky bit is how to get there from here without breaking
> > > > migration or something else along the way.
> > > 
> > > Switch on the next machine type version - it doesn't matter if migration
> > > is incompatible then.
> > 
> > 1mo bump.
> > 
> > Is there anything I need to do with this patch in particular to make it 
> > suitable
> > for merging?
> 
> Apologies for the delay;  hopefully this will go in one of the pulls
> just after the tree opens again.
> 
> Please please try and work on reducing the number of objects somehow -
> while this migration fix is a useful short term fix, and not too
> invasive; having that many objects around qemu is a really really bad
> idea so needs fixing properly.

I'm hoping to have a crack at this tomorrow.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [PATCH v2 7/7] travis.yml: Enable builds on arm64, ppc64le and s390x

2019-12-04 Thread David Gibson

On Wed, Dec 04, 2019 at 04:46:18PM +0100, Thomas Huth wrote:
> Travis recently added the possibility to test on these architectures,
> too, so let's enable them in our travis.yml file to extend our test
> coverage.
> 
> Unfortunately, the libssh in this Ubuntu version (bionic) is in a pretty
> unusable Frankenstein state and libspice-server-dev is not available here,
> so we can not use the global list of packages to install, but have to
> provide individual package lists instead.
> 
> Also, some of the iotests crash when using "dist: bionic" on arm64
> and ppc64le, thus these two builders have to use "dist: xenial" until
> the problem is understood / fixed.
> 
> Signed-off-by: Thomas Huth 

Acked-by: David Gibson 

> ---
>  .travis.yml | 86 +
>  1 file changed, 86 insertions(+)
> 
> diff --git a/.travis.yml b/.travis.yml
> index 445b0646c1..0e6458b0af 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -354,6 +354,92 @@ matrix:
>  - TEST_CMD="make -j3 check-tcg V=1"
>  - CACHE_NAME="${TRAVIS_BRANCH}-linux-gcc-debug-tcg"
>  
> +- arch: arm64
> +  dist: xenial
> +  addons:
> +apt_packages:
> +  - libaio-dev
> +  - libattr1-dev
> +  - libbrlapi-dev
> +  - libcap-ng-dev
> +  - libgcrypt20-dev
> +  - libgnutls28-dev
> +  - libgtk-3-dev
> +  - libiscsi-dev
> +  - liblttng-ust-dev
> +  - libncurses5-dev
> +  - libnfs-dev
> +  - libnss3-dev
> +  - libpixman-1-dev
> +  - libpng-dev
> +  - librados-dev
> +  - libsdl2-dev
> +  - libseccomp-dev
> +  - liburcu-dev
> +  - libusb-1.0-0-dev
> +  - libvdeplug-dev
> +  - libvte-2.91-dev
> +  env:
> +- TEST_CMD="make check check-tcg V=1"
> +- CONFIG="--disable-containers --target-list=${MAIN_SOFTMMU_TARGETS}"
> +
> +- arch: ppc64le
> +  dist: xenial
> +  addons:
> +apt_packages:
> +  - libaio-dev
> +  - libattr1-dev
> +  - libbrlapi-dev
> +  - libcap-ng-dev
> +  - libgcrypt20-dev
> +  - libgnutls28-dev
> +  - libgtk-3-dev
> +  - libiscsi-dev
> +  - liblttng-ust-dev
> +  - libncurses5-dev
> +  - libnfs-dev
> +  - libnss3-dev
> +  - libpixman-1-dev
> +  - libpng-dev
> +  - librados-dev
> +  - libsdl2-dev
> +  - libseccomp-dev
> +  - liburcu-dev
> +  - libusb-1.0-0-dev
> +  - libvdeplug-dev
> +  - libvte-2.91-dev
> +  env:
> +- TEST_CMD="make check check-tcg V=1"
> +- CONFIG="--disable-containers 
> --target-list=${MAIN_SOFTMMU_TARGETS},ppc64le-linux-user"
> +
> +- arch: s390x
> +  dist: bionic
> +  addons:
> +apt_packages:
> +  - libaio-dev
> +  - libattr1-dev
> +  - libbrlapi-dev
> +  - libcap-ng-dev
> +  - libgcrypt20-dev
> +  - libgnutls28-dev
> +  - libgtk-3-dev
> +  - libiscsi-dev
> +  - liblttng-ust-dev
> +  - libncurses5-dev
> +  - libnfs-dev
> +  - libnss3-dev
> +  - libpixman-1-dev
> +  - libpng-dev
> +  - librados-dev
> +  - libsdl2-dev
> +  - libseccomp-dev
> +  - liburcu-dev
> +  - libusb-1.0-0-dev
> +  - libvdeplug-dev
> +  - libvte-2.91-dev
> +  env:
> +- TEST_CMD="make check check-tcg V=1"
> +- CONFIG="--disable-containers 
> --target-list=${MAIN_SOFTMMU_TARGETS},s390x-linux-user"
>  
>  # Release builds
>  # The make-release script expect a QEMU version, so our tag must start 
> with a 'v'.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

[PATCH-for-5.0] roms/edk2-funcs.sh: Use available GCC for ARM/Aarch64 targets

2019-12-04 Thread Philippe Mathieu-Daudé

Centos 7.7 only provides cross GCC 4.8.5, but the script forces
us to use GCC5. Since the same machinery is valid to check the
GCC version, remove the $emulation_target check.

  $ cat /etc/redhat-release
  CentOS Linux release 7.7.1908 (Core)

  $ aarch64-linux-gnu-gcc -v 2>&1 | tail -1
  gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)

Signed-off-by: Philippe Mathieu-Daudé 
---
Patch to review with --ignore-all-space
---
 roms/edk2-funcs.sh | 48 +++---
 1 file changed, 20 insertions(+), 28 deletions(-)

diff --git a/roms/edk2-funcs.sh b/roms/edk2-funcs.sh
index 3f4485b201..a455611c0d 100644
--- a/roms/edk2-funcs.sh
+++ b/roms/edk2-funcs.sh
@@ -135,35 +135,27 @@ qemu_edk2_get_toolchain()
 return 1
   fi
 
-  case "$emulation_target" in
-(arm|aarch64)
-  printf 'GCC5\n'
+  if ! cross_prefix=$(qemu_edk2_get_cross_prefix "$emulation_target"); then
+return 1
+  fi
+
+  gcc_version=$("${cross_prefix}gcc" -v 2>&1 | tail -1 | awk '{print $3}')
+  # Run "git-blame" on "OvmfPkg/build.sh" in edk2 for more information on
+  # the mapping below.
+  case "$gcc_version" in
+([1-3].*|4.[0-7].*)
+  printf '%s: unsupported gcc version "%s"\n' \
+"$program_name" "$gcc_version" >&2
+  return 1
   ;;
-
-(i386|x86_64)
-  if ! cross_prefix=$(qemu_edk2_get_cross_prefix "$emulation_target"); then
-return 1
-  fi
-
-  gcc_version=$("${cross_prefix}gcc" -v 2>&1 | tail -1 | awk '{print $3}')
-  # Run "git-blame" on "OvmfPkg/build.sh" in edk2 for more information on
-  # the mapping below.
-  case "$gcc_version" in
-([1-3].*|4.[0-7].*)
-  printf '%s: unsupported gcc version "%s"\n' \
-"$program_name" "$gcc_version" >&2
-  return 1
-  ;;
-(4.8.*)
-  printf 'GCC48\n'
-  ;;
-(4.9.*|6.[0-2].*)
-  printf 'GCC49\n'
-  ;;
-(*)
-  printf 'GCC5\n'
-  ;;
-  esac
+(4.8.*)
+  printf 'GCC48\n'
+  ;;
+(4.9.*|6.[0-2].*)
+  printf 'GCC49\n'
+  ;;
+(*)
+  printf 'GCC5\n'
   ;;
   esac
 }
-- 
2.21.0

Re: [PATCH 01/10] hw: arm: add Allwinner H3 System-on-Chip

2019-12-04 Thread Niek Linnenbank

Hello Philippe,

On Wed, Dec 4, 2019 at 5:53 PM Philippe Mathieu-Daudé 
wrote:

> Hi Niek,
>
> On 12/2/19 10:09 PM, Niek Linnenbank wrote:
> > The Allwinner H3 is a System on Chip containing four ARM Cortex A7
> > processor cores. Features and specifications include DDR2/DDR3 memory,
> > SD/MMC storage cards, 10/100/1000Mbit ethernet, USB 2.0, HDMI and
> > various I/O modules. This commit adds support for the Allwinner H3
> > System on Chip.
> >
> > Signed-off-by: Niek Linnenbank 
> > ---
> >   MAINTAINERS |   7 ++
> >   default-configs/arm-softmmu.mak |   1 +
> >   hw/arm/Kconfig  |   8 ++
> >   hw/arm/Makefile.objs|   1 +
> >   hw/arm/allwinner-h3.c   | 215 
> >   include/hw/arm/allwinner-h3.h   | 118 ++
> >   6 files changed, 350 insertions(+)
> >   create mode 100644 hw/arm/allwinner-h3.c
> >   create mode 100644 include/hw/arm/allwinner-h3.h
>
> Since your series changes various files, can you have a look at the
> scripts/git.orderfile file and setup it for your QEMU contributions?
>

OK, done! I didn't know such a script existed, thanks.
I ran this command in my local repository:
 $ git config diff.orderFile scripts/git.orderfile
It seems to work, when I re-generate the patches, the order of the diff is
different.



> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 5e5e3e52d6..29c9936037 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -479,6 +479,13 @@ F: hw/*/allwinner*
> >   F: include/hw/*/allwinner*
> >   F: hw/arm/cubieboard.c
> >
> > +Allwinner-h3
> > +M: Niek Linnenbank 
> > +L: qemu-...@nongnu.org
> > +S: Maintained
> > +F: hw/*/allwinner-h3*
> > +F: include/hw/*/allwinner-h3*
> > +
> >   ARM PrimeCell and CMSDK devices
> >   M: Peter Maydell 
> >   L: qemu-...@nongnu.org
> > diff --git a/default-configs/arm-softmmu.mak
> b/default-configs/arm-softmmu.mak
> > index 1f2e0e7fde..d75a239c2c 100644
> > --- a/default-configs/arm-softmmu.mak
> > +++ b/default-configs/arm-softmmu.mak
> > @@ -40,3 +40,4 @@ CONFIG_FSL_IMX25=y
> >   CONFIG_FSL_IMX7=y
> >   CONFIG_FSL_IMX6UL=y
> >   CONFIG_SEMIHOSTING=y
> > +CONFIG_ALLWINNER_H3=y
> > diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
> > index c6e7782580..ebf8d2325f 100644
> > --- a/hw/arm/Kconfig
> > +++ b/hw/arm/Kconfig
> > @@ -291,6 +291,14 @@ config ALLWINNER_A10
> >   select SERIAL
> >   select UNIMP
> >
> > +config ALLWINNER_H3
> > +bool
> > +select ALLWINNER_A10_PIT
> > +select SERIAL
> > +select ARM_TIMER
> > +select ARM_GIC
> > +select UNIMP
> > +
> >   config RASPI
> >   bool
> >   select FRAMEBUFFER
> > diff --git a/hw/arm/Makefile.objs b/hw/arm/Makefile.objs
> > index fe749f65fd..956e496052 100644
> > --- a/hw/arm/Makefile.objs
> > +++ b/hw/arm/Makefile.objs
> > @@ -34,6 +34,7 @@ obj-$(CONFIG_DIGIC) += digic.o
> >   obj-$(CONFIG_OMAP) += omap1.o omap2.o
> >   obj-$(CONFIG_STRONGARM) += strongarm.o
> >   obj-$(CONFIG_ALLWINNER_A10) += allwinner-a10.o cubieboard.o
> > +obj-$(CONFIG_ALLWINNER_H3) += allwinner-h3.o
> >   obj-$(CONFIG_RASPI) += bcm2835_peripherals.o bcm2836.o raspi.o
> >   obj-$(CONFIG_STM32F205_SOC) += stm32f205_soc.o
> >   obj-$(CONFIG_XLNX_ZYNQMP_ARM) += xlnx-zynqmp.o xlnx-zcu102.o
> > diff --git a/hw/arm/allwinner-h3.c b/hw/arm/allwinner-h3.c
> > new file mode 100644
> > index 00..470fdfebef
> > --- /dev/null
> > +++ b/hw/arm/allwinner-h3.c
> > @@ -0,0 +1,215 @@
> > +/*
> > + * Allwinner H3 System on Chip emulation
> > + *
> > + * Copyright (C) 2019 Niek Linnenbank 
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program.  If not, see  >.
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "exec/address-spaces.h"
> > +#include "qapi/error.h"
> > +#include "qemu/module.h"
> > +#include "qemu/units.h"
> > +#include "cpu.h"
> > +#include "hw/sysbus.h"
> > +#include "hw/arm/allwinner-h3.h"
> > +#include "hw/misc/unimp.h"
> > +#include "sysemu/sysemu.h"
> > +
> > +static void aw_h3_init(Object *obj)
> > +{
> > +AwH3State *s = AW_H3(obj);
> > +
> > +sysbus_init_child_obj(obj, "gic", >gic, sizeof(s->gic),
> > +  TYPE_ARM_GIC);
> > +
> > +sysbus_init_child_obj(obj, "timer", >timer, sizeof(s->timer),
> > +  TYPE_AW_A10_PIT);
> > +}
> > +
> > +static void

Re: [PATCH v6 0/9] Clock framework API

2019-12-04 Thread Philippe Mathieu-Daudé


On 12/4/19 5:40 PM, Damien Hedde wrote:

On 12/2/19 5:15 PM, Peter Maydell wrote:


The one topic I think we could do with discussing is whether
a simple uint64_t giving the frequency of the clock in Hz is
the right representation. In particular in your patch 9 the
board has a clock frequency that's not a nice integer number
of Hz. I think Philippe also mentioned on irc some board where
the UART clock ends up at a weird frequency. Since the
representation of the frequency is baked into the migration
format it's going to be easier to get it right first rather
than trying to change it later.


Important precision for Damien, IIUC we can not migrate float/double types.


So what should the representation be? Some random thoughts:

1) ptimer internally uses a 'period plus fraction' representation:
  int64_t period is the integer part of the period in nanoseconds,
  uint32_t period_frac is the fractional part of the period
(if you like you can think of this as "96-bit integer
period measured in units of one-2^32nd of a nanosecond").
However its only public interfaces for setting the frequency
are (a) set the frequency in Hz (uint32_t) or (b) set
the period in nanoseconds (int64_t); the period_frac part
is used to handle frequencies which don't work out to
a nice whole number of nanoseconds per cycle.


This is very clear, thanks Peter!

The period+period_frac split allow us to migrate the 96 bits:

VMSTATE_UINT32(period_frac, ptimer_state),
VMSTATE_INT64(period, ptimer_state),


2) I hear that SystemC uses "value plus a time unit", with
the smallest unit being a picosecond. (I think SystemC
also lets you specify the duty cycle, but we definitely
don't want to get into that!)


The "value" is internally stored in a 64bits unsigned integer.



3) QEMUTimers are basically just nanosecond timers


Similarly to SystemC, the QEMUTimers macro use a 'scale' unit, of:

#define SCALE_MS 100
#define SCALE_US 1000
#define SCALE_NS 1



4) The MAME emulator seems to work with periods of
96-bit attoseconds (represented internally by a
32-bit count of seconds plus a 64-bit count of
attoseconds). One attosecond is 1e-18 seconds.

Does anybody else have experience with other modelling
or emulator technology and how it represents clocks ?


5) In linux, a clock rate is an "unsigned long" representing Hz.



I feel we should at least be able to represent clocks
with the same accuracy that ptimer has.


Then is a maybe a good idea to store the period and not the frequency in
clocks so that we don't loose anything when we switch from a clock to a
ptimer ?


I think storing the period as an integer type is a good idea.

However if we store the period in nanoseconds, we get at most 1GHz 
frequency.


The attosecond granularity feels overkill.

If we use a 96-bit integer to store picoseconds and use similar SCALE 
macros we get to 1THz.


Regardless the unit chosen, as long it is integer, we can migrate it.
If can migrate the period, we don't need to migrate the frequency.
We can then use the float type in with the timer API to pass frequencies 
(which in the modeled hardware are ratios, likely not integers).


So we could use set_freq(100e6 / 3), set_freq(40e6 / 5.5) directly.


Regarding the clock, I don't see any strong obstacle to switch
internally to a period based value.
The only things we have to choose is how to represent a disabled clock.
Since putting a "0" period to a ptimer will disable the timer in
ptimer_reload(). We can choose that (and it's a good value because we
can multiply or divide it, it stays the same).

We could use the same representation as a ptimer. But if we don't keep a
C number representation, then computation of frequencies/periods will be
complicated at best and error prone.

 From that point of view, if we could stick to a 64bits integer (or
floating point number) it would be great. Can we use a sub nanosecond
unit that fit our needs ?

I did some test with a unit of 2^-32 of nanoseconds on 64bits (is that
the unit of the ptimer fractional part ?) and if I'm not mistaken
+ we have a frequency range from ~0.2Hz up to 10^18Hz
+ the resolution is decreasing with the frequency (but at 100Mhz we have
a ~2.3mHz resolution, at 1GHz it's ~0.23Hz and at 10GHz ~23Hz
resolution). We hit 1Hz resolution around 2GHz.

So it sounds to me we have largely enough resolution to model clocks in
the range of frequencies we will have to handle. What do you think ?


Back to your series, I wonder why you want to store the frequency in 
ClockIn. ClockIn shouldn't be aware at what frequency it is clocked. 
What matters is ClockOut, and each device exposing ClockOuts has a 
(migrated) state of the output frequencies (rather in fields, or encoded 
in registers). Once migrated, after the state is loaded back into the 
device, we call post_load(). Isn't it a good place to call 
clock_set_frequency(ClockOut[]) which will correctly set each ClockIn 
frequency.


IOW I don't think ClockIn/ClockOut require to

Re: [PATCH 04/10] arm: allwinner-h3: add USB host controller

2019-12-04 Thread Niek Linnenbank

On Wed, Dec 4, 2019 at 5:11 PM Aleksandar Markovic <
aleksandar.m.m...@gmail.com> wrote:

>
>
> On Monday, December 2, 2019, Niek Linnenbank 
> wrote:
>
>> The Allwinner H3 System on Chip contains multiple USB 2.0 bus
>> connections which provide software access using the Enhanced
>> Host Controller Interface (EHCI) and Open Host Controller
>> Interface (OHCI) interfaces. This commit adds support for
>> both interfaces in the Allwinner H3 System on Chip.
>>
>> Signed-off-by: Niek Linnenbank 
>> ---
>
>
> Niek, hi!
>
> I would like to clarify a detail here:
>
> The spec of the SoC enumerates (in 8.5.2.4. USB Host Register List) a
> number of registers for reading various USB-related states, but also for
> setting some of USB features.
>
> Does this series cover these registers, and interaction with them? If yes,
> how and where? If not, do you think it is not necessary at all? Or perhaps
> that it is a non-crucial limitation of this series?
>

Hello Aleksandar!

Very good question, I will try to explain what I did to support USB for the
Allwinner H3 emulation.
EHCI and OHCI are both standardized interfaces to the USB bus and both
provide their own standardized software interface.
Because they are standards, operatings system drivers can implement a
generic driver which uses the defined interface and
re-use it in multiple boards/platforms. Things that can be different
between boards are, for example the base address in
memory where the registers are provided.

In QEMU I found that both the OHCI and EHCI host controllers are already
emulated and used by other boards as well. For example,
you can find the OHCI registers from 8.5.2.4 implemented in the file
hw/usb/hcd-ohci.c:1515 in ohci_mem_read(). So for the Allwinner
H3 I simply had to define the base address for both controllers and create
the objects. At that point, the Linux kernel can access
the USB bus with the generic EHCI/OHCI platform drivers. In the Linux code,
you can see in the file ./arch/arm/boot/dts/sunxi-h3-h5.dtsi:281
the definitions named ehci0-ehci3 and ohci0-ohci3 where it specifies in the
device tree configuration to load the generic drivers.


>
> Thanks in advance, and congrats for your, it seems, first submission!
>
>
Thank you Aleksandar! Indeed, it is my first submission. I will do my best
to
update the patches to comply with the QEMU coding style and best practises.

Regards,
Niek


> Aleksandar
>
>
>  hw/arm/allwinner-h3.c| 20 
>>  hw/usb/hcd-ehci-sysbus.c | 17 +
>>  hw/usb/hcd-ehci.h|  1 +
>>  3 files changed, 38 insertions(+)
>>
>> diff --git a/hw/arm/allwinner-h3.c b/hw/arm/allwinner-h3.c
>> index 5566e979ec..afeb49c0ac 100644
>> --- a/hw/arm/allwinner-h3.c
>> +++ b/hw/arm/allwinner-h3.c
>> @@ -26,6 +26,7 @@
>>  #include "hw/sysbus.h"
>>  #include "hw/arm/allwinner-h3.h"
>>  #include "hw/misc/unimp.h"
>> +#include "hw/usb/hcd-ehci.h"
>>  #include "sysemu/sysemu.h"
>>
>>  static void aw_h3_init(Object *obj)
>> @@ -183,6 +184,25 @@ static void aw_h3_realize(DeviceState *dev, Error
>> **errp)
>>  }
>>  sysbus_mmio_map(SYS_BUS_DEVICE(>ccu), 0, AW_H3_CCU_BASE);
>>
>> +/* Universal Serial Bus */
>> +sysbus_create_simple(TYPE_AW_H3_EHCI, AW_H3_EHCI0_BASE,
>> + s->irq[AW_H3_GIC_SPI_EHCI0]);
>> +sysbus_create_simple(TYPE_AW_H3_EHCI, AW_H3_EHCI1_BASE,
>> + s->irq[AW_H3_GIC_SPI_EHCI1]);
>> +sysbus_create_simple(TYPE_AW_H3_EHCI, AW_H3_EHCI2_BASE,
>> + s->irq[AW_H3_GIC_SPI_EHCI2]);
>> +sysbus_create_simple(TYPE_AW_H3_EHCI, AW_H3_EHCI3_BASE,
>> + s->irq[AW_H3_GIC_SPI_EHCI3]);
>> +
>> +sysbus_create_simple("sysbus-ohci", AW_H3_OHCI0_BASE,
>> + s->irq[AW_H3_GIC_SPI_OHCI0]);
>> +sysbus_create_simple("sysbus-ohci", AW_H3_OHCI1_BASE,
>> + s->irq[AW_H3_GIC_SPI_OHCI1]);
>> +sysbus_create_simple("sysbus-ohci", AW_H3_OHCI2_BASE,
>> + s->irq[AW_H3_GIC_SPI_OHCI2]);
>> +sysbus_create_simple("sysbus-ohci", AW_H3_OHCI3_BASE,
>> + s->irq[AW_H3_GIC_SPI_OHCI3]);
>> +
>>  /* UART */
>>  if (serial_hd(0)) {
>>  serial_mm_init(get_system_memory(), AW_H3_UART0_REG_BASE, 2,
>> diff --git a/hw/usb/hcd-ehci-sysbus.c b/hw/usb/hcd-ehci-sysbus.c
>> index 020211fd10..174c3446ef 100644
>> --- a/hw/usb/hcd-ehci-sysbus.c
>> +++ b/hw/usb/hcd-ehci-sysbus.c
>> @@ -145,6 +145,22 @@ static const TypeInfo ehci_exynos4210_type_info = {
>>  .class_init= ehci_exynos4210_class_init,
>>  };
>>
>> +static void ehci_aw_h3_class_init(ObjectClass *oc, void *data)
>> +{
>> +SysBusEHCIClass *sec = SYS_BUS_EHCI_CLASS(oc);
>> +DeviceClass *dc = DEVICE_CLASS(oc);
>> +
>> +sec->capsbase = 0x0;
>> +sec->opregbase = 0x10;
>> +set_bit(DEVICE_CATEGORY_USB, dc->categories);
>> +}
>> +
>> +static const TypeInfo ehci_aw_h3_type_info = {
>> +.name  = TYPE_AW_H3_EHCI,

Re: [PATCH v2 0/7] Enable Travis builds on arm64, ppc64le and s390x

2019-12-04 Thread Cleber Rosa

On Wed, Dec 04, 2019 at 04:46:11PM +0100, Thomas Huth wrote:
> Travis recently added build hosts for arm64, ppc64le and s390x, so
> this is a welcome addition to our Travis testing matrix.
> 
> Unfortunately, the builds are running in quite restricted LXD containers
> there, for example it is not possible to create huge files there (even
> if they are just sparse), and certain system calls are blocked. So we
> have to change some tests first to stop them failing in such environments.
>

Hi Thomas,

FIY, Avocado[1] has been running checks on those arches for a little
over two weeks and in my experience, there are still some reliability
issues (besides the other limitations you're already aware).

During the last week I've stopped seeing "machines" that wouldn't boot,
or severe networking limitations, but things are still not as smooth
as I'd like.

Anyway, I think we should insist on it, and give it a bit more time,
so I definitely agree with and appreciate this work.

[1] https://travis-ci.org/avocado-framework/avocado/builds

- Cleber.

> v2:
>  - Added "make check-tcg" and Alex' patch to disable cross-containers
>  - Explicitely set "dist: xenial" for arm64 and ppc64le since some
>iotests are crashing on bionic on these hosts.
>  - Dropped "libcap-dev" from the package list since it will be replaced
>by libcapng-dev soon.
> 
> Alex Bennée (1):
>   configure: allow disable of cross compilation containers
> 
> Thomas Huth (6):
>   iotests: Provide a function for checking the creation of huge files
>   iotests: Skip test 060 if it is not possible to create large files
>   iotests: Skip test 079 if it is not possible to create large files
>   tests/hd-geo-test: Skip test when images can not be created
>   tests/test-util-filemonitor: Skip test on non-x86 Travis containers
>   travis.yml: Enable builds on arm64, ppc64le and s390x
> 
>  .travis.yml   | 86 +++
>  configure |  8 +++-
>  tests/hd-geo-test.c   | 12 -
>  tests/qemu-iotests/005|  5 +-
>  tests/qemu-iotests/060|  3 ++
>  tests/qemu-iotests/079|  3 ++
>  tests/qemu-iotests/220|  6 +--
>  tests/qemu-iotests/common.rc  | 10 
>  tests/tcg/configure.sh|  6 ++-
>  tests/test-util-filemonitor.c | 11 +
>  10 files changed, 138 insertions(+), 12 deletions(-)
> 
> -- 
> 2.18.1
>

Re: [PATCH 02/10] hw: arm: add Xunlong Orange Pi PC machine

2019-12-04 Thread Niek Linnenbank

On Wed, Dec 4, 2019 at 10:03 AM Philippe Mathieu-Daudé 
wrote:

> On 12/3/19 8:33 PM, Niek Linnenbank wrote:
> > Hello Philippe,
> >
> > Thanks for your quick review comments!
> > I'll start working on a v2 of the patches and include the changes you
> > suggested.
>
> Thanks, but I'd suggest to wait few more days to give time to others
> reviewers. Else having multiple versions of a big series reviewed at the
> same time is very confusing.
> I have other minor comments on others patches, but need to find the time
> to continue reviewing.
>
>
OK Philippe, I will follow your advise and wait a few more days before
submitting a new version.
I'll wait at least until you had a chance to review all the patches. I'm
new to the QEMU
community, so I will need to learn the process along the way.

Regards,
Niek





-- 
Niek Linnenbank

Re: [PATCH v4 26/40] target/arm: Update define_one_arm_cp_reg_with_opaque for VHE

2019-12-04 Thread Richard Henderson

On 12/4/19 10:58 AM, Alex Bennée wrote:
>> @@ -7437,13 +7437,10 @@ void define_one_arm_cp_reg_with_opaque(ARMCPU *cpu,
>>  mask = PL0_RW;
>>  break;
>>  case 4:
>> +case 5:
>>  /* min_EL EL2 */
>>  mask = PL2_RW;
>>  break;
>> -case 5:
>> -/* unallocated encoding, so not possible */
>> -assert(false);
>> -break;
> 
> This change is fine - I don't think we should have asserted here anyway.
> But don't we generate an unallocated exception if the CPU is v8.0?

This change is only for validation of the system registers themselves.  It has
nothing to do with the usage of system registers from the actual guest.


r~

[for-5.0 PATCH 3/4] ppc: Don't use CPUPPCState::irq_input_state with modern Book3s CPU models

2019-12-04 Thread Greg Kurz

The power7_set_irq() and power9_set_irq() functions set this but it is
never used actually. Modern Book3s compatible CPUs are only supported
by the pnv and spapr machines. They have an interrupt controller, XICS
for POWER7/8 and XIVE for POWER9, whose models don't require to track
IRQ input states at the CPU level.

Drop these lines to avoid confusion.

Signed-off-by: Greg Kurz 
---
 hw/ppc/ppc.c |   16 ++--
 target/ppc/cpu.h |4 +++-
 2 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
index fab73f1b1fc9..45834f98d176 100644
--- a/hw/ppc/ppc.c
+++ b/hw/ppc/ppc.c
@@ -275,10 +275,9 @@ void ppc970_irq_init(PowerPCCPU *cpu)
 static void power7_set_irq(void *opaque, int pin, int level)
 {
 PowerPCCPU *cpu = opaque;
-CPUPPCState *env = >env;
 
 LOG_IRQ("%s: env %p pin %d level %d\n", __func__,
-env, pin, level);
+>env, pin, level);
 
 switch (pin) {
 case POWER7_INPUT_INT:
@@ -292,11 +291,6 @@ static void power7_set_irq(void *opaque, int pin, int 
level)
 LOG_IRQ("%s: unknown IRQ pin %d\n", __func__, pin);
 return;
 }
-if (level) {
-env->irq_input_state |= 1 << pin;
-} else {
-env->irq_input_state &= ~(1 << pin);
-}
 }
 
 void ppcPOWER7_irq_init(PowerPCCPU *cpu)
@@ -311,10 +305,9 @@ void ppcPOWER7_irq_init(PowerPCCPU *cpu)
 static void power9_set_irq(void *opaque, int pin, int level)
 {
 PowerPCCPU *cpu = opaque;
-CPUPPCState *env = >env;
 
 LOG_IRQ("%s: env %p pin %d level %d\n", __func__,
-env, pin, level);
+>env, pin, level);
 
 switch (pin) {
 case POWER9_INPUT_INT:
@@ -334,11 +327,6 @@ static void power9_set_irq(void *opaque, int pin, int 
level)
 LOG_IRQ("%s: unknown IRQ pin %d\n", __func__, pin);
 return;
 }
-if (level) {
-env->irq_input_state |= 1 << pin;
-} else {
-env->irq_input_state &= ~(1 << pin);
-}
 }
 
 void ppcPOWER9_irq_init(PowerPCCPU *cpu)
diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index e3e82327b723..f9528fc29d98 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -1090,7 +1090,9 @@ struct CPUPPCState {
 #if !defined(CONFIG_USER_ONLY)
 /*
  * This is the IRQ controller, which is implementation dependent
- * and only relevant when emulating a complete machine.
+ * and only relevant when emulating a complete machine. Note that
+ * this isn't used by recent Book3s compatible CPUs (POWER7 and
+ * newer).
  */
 uint32_t irq_input_state;
 void **irq_inputs;

[for-5.0 PATCH 4/4] ppc: Ignore the CPU_INTERRUPT_EXITTB interrupt with KVM

2019-12-04 Thread Greg Kurz

This only makes sense with an emulated CPU. Don't set the bit in
CPUState::interrupt_request when using KVM to avoid confusions.

Signed-off-by: Greg Kurz 
---
 target/ppc/helper_regs.h |5 +
 1 file changed, 5 insertions(+)

diff --git a/target/ppc/helper_regs.h b/target/ppc/helper_regs.h
index 85dfe7687fbb..d78c2af63eac 100644
--- a/target/ppc/helper_regs.h
+++ b/target/ppc/helper_regs.h
@@ -22,6 +22,7 @@
 
 #include "qemu/main-loop.h"
 #include "exec/exec-all.h"
+#include "sysemu/kvm.h"
 
 /* Swap temporary saved registers with GPRs */
 static inline void hreg_swap_gpr_tgpr(CPUPPCState *env)
@@ -102,6 +103,10 @@ static inline void hreg_compute_hflags(CPUPPCState *env)
 
 static inline void cpu_interrupt_exittb(CPUState *cs)
 {
+if (!kvm_enabled()) {
+return;
+}
+
 if (!qemu_mutex_iothread_locked()) {
 qemu_mutex_lock_iothread();
 cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);

[for-5.0 PATCH 2/4] xics: Don't deassert outputs

2019-12-04 Thread Greg Kurz

The correct way to do this is to deassert the input pins on the CPU side.
This is the case since a previous change.

Signed-off-by: Greg Kurz 
---
 hw/intc/xics.c |3 ---
 1 file changed, 3 deletions(-)

diff --git a/hw/intc/xics.c b/hw/intc/xics.c
index 0b259a09c545..1952009e6d22 100644
--- a/hw/intc/xics.c
+++ b/hw/intc/xics.c
@@ -289,9 +289,6 @@ void icp_reset(ICPState *icp)
 icp->pending_priority = 0xff;
 icp->mfrr = 0xff;
 
-/* Make all outputs are deasserted */
-qemu_set_irq(icp->output, 0);
-
 if (kvm_irqchip_in_kernel()) {
 Error *local_err = NULL;

[for-5.0 PATCH 1/4] ppc: Deassert the external interrupt pin in KVM on reset

2019-12-04 Thread Greg Kurz

When a CPU is reset, QEMU makes sure no interrupt is pending by clearing
CPUPPCstate::pending_interrupts in ppc_cpu_reset(). In the case of a
complete machine emulation, eg. a sPAPR machine, an external interrupt
request could still be pending in KVM though, eg. an IPI. It will be
eventually presented to the guest, which is supposed to acknowledge it at
the interrupt controller. If the interrupt controller is emulated in QEMU,
either XICS or XIVE, ppc_set_irq() won't deassert the external interrupt
pin in KVM since it isn't pending anymore for QEMU. When the vCPU re-enters
the guest, the interrupt request is still pending and the vCPU will try
again to acknowledge it. This causes an infinite loop and eventually hangs
the guest.

The code has been broken since the beginning. The issue wasn't hit before
because accel=kvm,kernel-irqchip=off is an awkward setup that never got
used until recently with the LC92x IBM systems (aka, Boston).

Add a ppc_irq_reset() function to do the necessary cleanup, ie. deassert
the IRQ pins of the CPU in QEMU and most importantly the external interrupt
pin for this vCPU in KVM.

Reported-by: Satheesh Rajendran 
Signed-off-by: Greg Kurz 
---
 hw/ppc/ppc.c|8 
 include/hw/ppc/ppc.h|2 ++
 target/ppc/translate_init.inc.c |1 +
 3 files changed, 11 insertions(+)

diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
index 8dd982fc1e40..fab73f1b1fc9 100644
--- a/hw/ppc/ppc.c
+++ b/hw/ppc/ppc.c
@@ -1515,3 +1515,11 @@ PowerPCCPU *ppc_get_vcpu_by_pir(int pir)
 
 return NULL;
 }
+
+void ppc_irq_reset(PowerPCCPU *cpu)
+{
+CPUPPCState *env = >env;
+
+env->irq_input_state = 0;
+kvmppc_set_interrupt(cpu, PPC_INTERRUPT_EXT, 0);
+}
diff --git a/include/hw/ppc/ppc.h b/include/hw/ppc/ppc.h
index 585be6ab98c5..89e1dd065af7 100644
--- a/include/hw/ppc/ppc.h
+++ b/include/hw/ppc/ppc.h
@@ -77,6 +77,7 @@ static inline void ppc970_irq_init(PowerPCCPU *cpu) {}
 static inline void ppcPOWER7_irq_init(PowerPCCPU *cpu) {}
 static inline void ppcPOWER9_irq_init(PowerPCCPU *cpu) {}
 static inline void ppce500_irq_init(PowerPCCPU *cpu) {}
+static inline void ppc_irq_reset(PowerPCCPU *cpu) {}
 #else
 void ppc40x_irq_init(PowerPCCPU *cpu);
 void ppce500_irq_init(PowerPCCPU *cpu);
@@ -84,6 +85,7 @@ void ppc6xx_irq_init(PowerPCCPU *cpu);
 void ppc970_irq_init(PowerPCCPU *cpu);
 void ppcPOWER7_irq_init(PowerPCCPU *cpu);
 void ppcPOWER9_irq_init(PowerPCCPU *cpu);
+void ppc_irq_reset(PowerPCCPU *cpu);
 #endif
 
 /* PPC machines for OpenBIOS */
diff --git a/target/ppc/translate_init.inc.c b/target/ppc/translate_init.inc.c
index ba726dec4d00..64a838095c7a 100644
--- a/target/ppc/translate_init.inc.c
+++ b/target/ppc/translate_init.inc.c
@@ -10461,6 +10461,7 @@ static void ppc_cpu_reset(CPUState *s)
 env->pending_interrupts = 0;
 s->exception_index = POWERPC_EXCP_NONE;
 env->error_code = 0;
+ppc_irq_reset(cpu);
 
 /* tininess for underflow is detected before rounding */
 set_float_detect_tininess(float_tininess_before_rounding,

[for-5.0 PATCH 0/4] ppc: Fix interrupt controller emulation

2019-12-04 Thread Greg Kurz

Guest hangs have been observed recently on POWER9 hosts, specifically LC92x
"Boston" systems, when the guests are being rebooted multiple times. The
issue isn't POWER9 specific though. It is caused by a very long standing bug
when using the uncommon accel=kvm,kernel-irqchip=off machine configuration
which happens to be enforced on LC92x because of a host FW limitation. This
affects both the XICS and XIVE emulated interrupt controllers.

The actual fix is in patch 1. Patch 2 is a followup cleanup. The other
patches are unrelated cleanups I came up with while investigating.

Since this bug always existed and we're already in rc4, I think it is better
to fix it in 5.0 and possibly backport it to stable and downstream if needed.

--
Greg

---

Greg Kurz (4):
  ppc: Deassert the external interrupt pin in KVM on reset
  xics: Don't deassert outputs
  ppc: Don't use CPUPPCState::irq_input_state with modern Book3s CPU models
  ppc: Ignore the CPU_INTERRUPT_EXITTB interrupt with KVM


 hw/intc/xics.c  |3 ---
 hw/ppc/ppc.c|   24 ++--
 include/hw/ppc/ppc.h|2 ++
 target/ppc/cpu.h|4 +++-
 target/ppc/helper_regs.h|5 +
 target/ppc/translate_init.inc.c |1 +
 6 files changed, 21 insertions(+), 18 deletions(-)

Re: [PATCH v2 1/5] virtiofsd: Get rid of unused fields in fv_QueueInfo

2019-12-04 Thread Dr. David Alan Gilbert

* Vivek Goyal (vgo...@redhat.com) wrote:
> There are some unused fields in "struct fv_QueueInfo". Get rid of these 
> fields.
> 
> Signed-off-by: Vivek Goyal 
> ---
>  contrib/virtiofsd/fuse_virtio.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
> index 31c8542b6c..2a9cd60a01 100644
> --- a/contrib/virtiofsd/fuse_virtio.c
> +++ b/contrib/virtiofsd/fuse_virtio.c
> @@ -50,12 +50,6 @@ struct fv_QueueInfo {
>  int qidx;
>  int kick_fd;
>  int kill_fd; /* For killing the thread */
> -
> -/* The element for the command currently being processed */
> -VuVirtqElement *qe;
> -/* If any of the qe vec elements (towards vmm) are unmappable */
> -unsigned int elem_bad_in;
> -bool reply_sent;

Yep, those last two got moved into FVRequest as part of the thread pool
stuff.


Reviewed-by: Dr. David Alan Gilbert 

>  };
>  
>  /* A FUSE request */
> -- 
> 2.20.1
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

[PATCH v2 5/5] virtiofsd: Implement blocking posix locks

2019-12-04 Thread Vivek Goyal

As of now we don't support fcntl(F_SETLKW) and if we see one, we return
-EOPNOTSUPP.

Change that by accepting these requests and returning a reply immediately
asking caller to wait. Once lock is available, send a notification to
the waiter indicating lock is available.

Signed-off-by: Vivek Goyal 
---
 contrib/virtiofsd/fuse_kernel.h|  7 +++
 contrib/virtiofsd/fuse_lowlevel.c  | 23 ++-
 contrib/virtiofsd/fuse_lowlevel.h  | 25 
 contrib/virtiofsd/fuse_virtio.c| 97 --
 contrib/virtiofsd/passthrough_ll.c | 49 ---
 5 files changed, 185 insertions(+), 16 deletions(-)

diff --git a/contrib/virtiofsd/fuse_kernel.h b/contrib/virtiofsd/fuse_kernel.h
index 2bdc8b1c88..432eb14d14 100644
--- a/contrib/virtiofsd/fuse_kernel.h
+++ b/contrib/virtiofsd/fuse_kernel.h
@@ -444,6 +444,7 @@ enum fuse_notify_code {
FUSE_NOTIFY_STORE = 4,
FUSE_NOTIFY_RETRIEVE = 5,
FUSE_NOTIFY_DELETE = 6,
+   FUSE_NOTIFY_LOCK = 7,
FUSE_NOTIFY_CODE_MAX,
 };
 
@@ -836,6 +837,12 @@ struct fuse_notify_retrieve_in {
uint64_tdummy4;
 };
 
+struct fuse_notify_lock_out {
+   uint64_tunique;
+   int32_t error;
+   int32_t padding;
+};
+
 /* Device ioctls: */
 #define FUSE_DEV_IOC_CLONE _IOR(229, 0, uint32_t)
 
diff --git a/contrib/virtiofsd/fuse_lowlevel.c 
b/contrib/virtiofsd/fuse_lowlevel.c
index d4a42d9804..3d9c289510 100644
--- a/contrib/virtiofsd/fuse_lowlevel.c
+++ b/contrib/virtiofsd/fuse_lowlevel.c
@@ -183,7 +183,8 @@ int fuse_send_reply_iov_nofree(fuse_req_t req, int error, 
struct iovec *iov,
 {
struct fuse_out_header out;
 
-   if (error <= -1000 || error > 0) {
+   /* error = 1 has been used to signal client to wait for notificaiton */
+   if (error <= -1000 || error > 1) {
fuse_log(FUSE_LOG_ERR, "fuse: bad error value: %i\n",   error);
error = -ERANGE;
}
@@ -291,6 +292,12 @@ int fuse_reply_err(fuse_req_t req, int err)
return send_reply(req, -err, NULL, 0);
 }
 
+int fuse_reply_wait(fuse_req_t req)
+{
+   /* TODO: This is a hack. Fix it */
+   return send_reply(req, 1, NULL, 0);
+}
+
 void fuse_reply_none(fuse_req_t req)
 {
fuse_free_req(req);
@@ -2207,6 +2214,20 @@ static int send_notify_iov(struct fuse_session *se, int 
notify_code,
return fuse_send_msg(se, NULL, iov, count);
 }
 
+int fuse_lowlevel_notify_lock(struct fuse_session *se, uint64_t unique,
+ int32_t error)
+{
+   struct fuse_notify_lock_out outarg = {0};
+   struct iovec iov[2];
+
+   outarg.unique = unique;
+   outarg.error = -error;
+
+   iov[1].iov_base = 
+   iov[1].iov_len = sizeof(outarg);
+   return send_notify_iov(se, FUSE_NOTIFY_LOCK, iov, 2);
+}
+
 int fuse_lowlevel_notify_poll(struct fuse_pollhandle *ph)
 {
if (ph != NULL) {
diff --git a/contrib/virtiofsd/fuse_lowlevel.h 
b/contrib/virtiofsd/fuse_lowlevel.h
index e664d2d12d..4126b4f967 100644
--- a/contrib/virtiofsd/fuse_lowlevel.h
+++ b/contrib/virtiofsd/fuse_lowlevel.h
@@ -1251,6 +1251,22 @@ struct fuse_lowlevel_ops {
  */
 int fuse_reply_err(fuse_req_t req, int err);
 
+/**
+ * Ask caller to wait for lock.
+ *
+ * Possible requests:
+ *   setlkw
+ *
+ * If caller sends a blocking lock request (setlkw), then reply to caller
+ * that wait for lock to be available. Once lock is available caller will
+ * receive a notification with request's unique id. Notification will
+ * carry info whether lock was successfully obtained or not.
+ *
+ * @param req request handle
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_wait(fuse_req_t req);
+
 /**
  * Don't send reply
  *
@@ -1704,6 +1720,15 @@ int fuse_lowlevel_notify_delete(struct fuse_session *se,
 int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino,
   off_t offset, struct fuse_bufvec *bufv,
   enum fuse_buf_copy_flags flags);
+/**
+ * Notify event related to previous lock request
+ *
+ * @param se the session object
+ * @param unique the unique id of the request which requested setlkw
+ * @param error zero for success, -errno for the failure
+ */
+int fuse_lowlevel_notify_lock(struct fuse_session *se, uint64_t unique,
+ int32_t error);
 
 /* --- *
  * Utility functions  *
diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
index 94cf9b3791..129dd329f6 100644
--- a/contrib/virtiofsd/fuse_virtio.c
+++ b/contrib/virtiofsd/fuse_virtio.c
@@ -208,6 +208,83 @@ static void copy_iov(struct iovec *src_iov, int src_count,
 }
 }
 
+static int virtio_send_notify_msg(struct fuse_session *se, struct iovec *iov,
+ int count)
+{
+struct fv_QueueInfo *qi;
+VuDev *dev =

[PATCH v2 3/5] virtiofd: Create a notification queue

2019-12-04 Thread Vivek Goyal

Add a notification queue which will be used to send async notifications
for file lock availability.

Signed-off-by: Vivek Goyal 
---
 contrib/virtiofsd/fuse_i.h |  1 +
 contrib/virtiofsd/fuse_virtio.c| 74 +++---
 hw/virtio/vhost-user-fs-pci.c  |  2 +-
 hw/virtio/vhost-user-fs.c  | 37 +--
 include/hw/virtio/vhost-user-fs.h  |  1 +
 include/standard-headers/linux/virtio_fs.h |  3 +
 6 files changed, 87 insertions(+), 31 deletions(-)

diff --git a/contrib/virtiofsd/fuse_i.h b/contrib/virtiofsd/fuse_i.h
index 966b1a3baa..4eeae0bfeb 100644
--- a/contrib/virtiofsd/fuse_i.h
+++ b/contrib/virtiofsd/fuse_i.h
@@ -74,6 +74,7 @@ struct fuse_session {
char *vu_socket_lock;
struct fv_VuDev *virtio_dev;
int thread_pool_size;
+   bool notify_enabled;
 };
 
 struct fuse_chan {
diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
index 2a9cd60a01..b1eebcf054 100644
--- a/contrib/virtiofsd/fuse_virtio.c
+++ b/contrib/virtiofsd/fuse_virtio.c
@@ -14,6 +14,7 @@
 #include "qemu/osdep.h"
 #include "qemu/iov.h"
 #include "qapi/error.h"
+#include "standard-headers/linux/virtio_fs.h"
 #include "fuse_i.h"
 #include "fuse_kernel.h"
 #include "fuse_misc.h"
@@ -92,23 +93,31 @@ struct fv_VuDev {
  */
 size_t nqueues;
 struct fv_QueueInfo **qi;
-};
-
-/* From spec */
-struct virtio_fs_config {
-char tag[36];
-uint32_t num_queues;
+/* True if notification queue is being used */
+bool notify_enabled;
 };
 
 /* Callback from libvhost-user */
 static uint64_t fv_get_features(VuDev *dev)
 {
-return 1ULL << VIRTIO_F_VERSION_1;
+uint64_t features;
+
+features = 1ull << VIRTIO_F_VERSION_1 |
+   1ull << VIRTIO_FS_F_NOTIFICATION;
+
+return features;
 }
 
 /* Callback from libvhost-user */
 static void fv_set_features(VuDev *dev, uint64_t features)
 {
+struct fv_VuDev *vud = container_of(dev, struct fv_VuDev, dev);
+struct fuse_session *se = vud->se;
+
+if ((1ull << VIRTIO_FS_F_NOTIFICATION) & features) {
+vud->notify_enabled = true;
+se->notify_enabled = true;
+}
 }
 
 /*
@@ -765,6 +774,9 @@ static void fv_queue_set_started(VuDev *dev, int qidx, bool 
started)
 {
 struct fv_VuDev *vud = container_of(dev, struct fv_VuDev, dev);
 struct fv_QueueInfo *ourqi;
+void * (*thread_func) (void *) = fv_queue_thread;
+int valid_queues = 2; /* One hiprio queue and one request queue */
+bool notification_q = false;
 
 fuse_log(FUSE_LOG_INFO, "%s: qidx=%d started=%d\n", __func__, qidx,
  started);
@@ -776,10 +788,18 @@ static void fv_queue_set_started(VuDev *dev, int qidx, 
bool started)
  * well-behaved client in mind and may not protect against all types of
  * races yet.
  */
-if (qidx > 1) {
-fuse_log(FUSE_LOG_ERR,
- "%s: multiple request queues not yet implemented, please only 
"
- "configure 1 request queue\n",
+if (vud->notify_enabled) {
+valid_queues++;
+/*
+ * If notification queue is enabled, then qidx 1 is notificaiton queue.
+ */
+if (qidx == 1)
+notification_q = true;
+}
+
+if (qidx >= valid_queues) {
+fuse_log(FUSE_LOG_ERR, "%s: multiple request queues not yet"
+ "implemented, please only configure 1 request queue\n",
  __func__);
 exit(EXIT_FAILURE);
 }
@@ -803,13 +823,19 @@ static void fv_queue_set_started(VuDev *dev, int qidx, 
bool started)
 assert(vud->qi[qidx]->kick_fd == -1);
 }
 ourqi = vud->qi[qidx];
+pthread_mutex_init(>vq_lock, NULL);
+/*
+ * For notification queue, we don't have to start a thread yet.
+ */
+if (notification_q)
+return;
+
 ourqi->kick_fd = dev->vq[qidx].kick_fd;
 
 ourqi->kill_fd = eventfd(0, EFD_CLOEXEC | EFD_SEMAPHORE);
 assert(ourqi->kill_fd != -1);
-pthread_mutex_init(>vq_lock, NULL);
 
-if (pthread_create(>thread, NULL, fv_queue_thread, ourqi)) {
+if (pthread_create(>thread, NULL, thread_func, ourqi)) {
 fuse_log(FUSE_LOG_ERR, "%s: Failed to create thread for queue 
%d\n",
  __func__, qidx);
 assert(0);
@@ -819,17 +845,19 @@ static void fv_queue_set_started(VuDev *dev, int qidx, 
bool started)
 assert(qidx < vud->nqueues);
 ourqi = vud->qi[qidx];
 
-/* Kill the thread */
-if (eventfd_write(ourqi->kill_fd, 1)) {
-fuse_log(FUSE_LOG_ERR, "Eventfd_read for queue: %m\n");
-}
-ret = pthread_join(ourqi->thread, NULL);
-if (ret) {
-fuse_log(FUSE_LOG_ERR, "%s: Failed to join thread idx %d err %d\n",
- __func__, qidx, ret);
+if (!notification_q) {
+/* Kill the thread */
+if

[PATCH v2 1/5] virtiofsd: Get rid of unused fields in fv_QueueInfo

2019-12-04 Thread Vivek Goyal

There are some unused fields in "struct fv_QueueInfo". Get rid of these fields.

Signed-off-by: Vivek Goyal 
---
 contrib/virtiofsd/fuse_virtio.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
index 31c8542b6c..2a9cd60a01 100644
--- a/contrib/virtiofsd/fuse_virtio.c
+++ b/contrib/virtiofsd/fuse_virtio.c
@@ -50,12 +50,6 @@ struct fv_QueueInfo {
 int qidx;
 int kick_fd;
 int kill_fd; /* For killing the thread */
-
-/* The element for the command currently being processed */
-VuVirtqElement *qe;
-/* If any of the qe vec elements (towards vmm) are unmappable */
-unsigned int elem_bad_in;
-bool reply_sent;
 };
 
 /* A FUSE request */
-- 
2.20.1

[PATCH v2 0/5] [RFC] virtiofsd, vhost-user-fs: Add support for notification queue

2019-12-04 Thread Vivek Goyal

Hi,

Here is V2 of RFC patches for adding a notification queue to
vhost-user-fs device to send notifications from host to guest.
It also has patches to support remote posix locks which make use of this
newly introduced notification queue.

I have taken care of most of the comments from last iteration. Still one
major TODO item is to be able to interrupt/stop blocked thrads for locks
when guest reboots. 

Patches are also available here.

https://github.com/rhvgoyal/qemu/commits/blocking-locks-v2

Associated kernel changes are available here.

https://github.com/rhvgoyal/linux/commits/blocking-locks-v2

Thanks
Vivek

Vivek Goyal (5):
  virtiofsd: Get rid of unused fields in fv_QueueInfo
  virtiofsd: Release file locks using F_UNLCK
  virtiofd: Create a notification queue
  virtiofsd: Specify size of notification buffer using config space
  virtiofsd: Implement blocking posix locks

 contrib/virtiofsd/fuse_i.h |   1 +
 contrib/virtiofsd/fuse_kernel.h|   7 +
 contrib/virtiofsd/fuse_lowlevel.c  |  23 ++-
 contrib/virtiofsd/fuse_lowlevel.h  |  25 +++
 contrib/virtiofsd/fuse_virtio.c| 208 +
 contrib/virtiofsd/passthrough_ll.c |  80 ++--
 hw/virtio/vhost-user-fs-pci.c  |   2 +-
 hw/virtio/vhost-user-fs.c  |  63 ++-
 include/hw/virtio/vhost-user-fs.h  |   3 +
 include/standard-headers/linux/virtio_fs.h |   5 +
 10 files changed, 354 insertions(+), 63 deletions(-)

-- 
2.20.1

[PATCH v2 4/5] virtiofsd: Specify size of notification buffer using config space

2019-12-04 Thread Vivek Goyal

Daemon specifies size of notification buffer needed and that should be done
using config space.

Only ->notify_buf_size value of config space comes from daemon. Rest of
it is filled by qemu device emulation code.

Signed-off-by: Vivek Goyal 
---
 contrib/virtiofsd/fuse_virtio.c| 31 ++
 hw/virtio/vhost-user-fs.c  | 26 ++
 include/hw/virtio/vhost-user-fs.h  |  2 ++
 include/standard-headers/linux/virtio_fs.h |  2 ++
 4 files changed, 61 insertions(+)

diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
index b1eebcf054..94cf9b3791 100644
--- a/contrib/virtiofsd/fuse_virtio.c
+++ b/contrib/virtiofsd/fuse_virtio.c
@@ -869,6 +869,35 @@ static bool fv_queue_order(VuDev *dev, int qidx)
 return false;
 }
 
+static uint64_t fv_get_protocol_features(VuDev *dev)
+{
+   return 1ull << VHOST_USER_PROTOCOL_F_CONFIG;
+}
+
+static int fv_get_config(VuDev *dev, uint8_t *config, uint32_t len)
+{
+   struct virtio_fs_config fscfg = {};
+   unsigned notify_size, roundto = 64;
+   union fuse_notify_union {
+   struct fuse_notify_poll_wakeup_out  wakeup_out;
+   struct fuse_notify_inval_inode_out  inode_out;
+   struct fuse_notify_inval_entry_out  entry_out;
+   struct fuse_notify_delete_out   delete_out;
+   struct fuse_notify_store_outstore_out;
+   struct fuse_notify_retrieve_out retrieve_out;
+   };
+
+   notify_size = sizeof(struct fuse_out_header) +
+ sizeof(union fuse_notify_union);
+   notify_size = ((notify_size + roundto)/roundto) * roundto;
+
+   fscfg.notify_buf_size = notify_size;
+   memcpy(config, , len);
+   fuse_log(FUSE_LOG_DEBUG, "%s:Setting notify_buf_size=%d\n", __func__,
+ fscfg.notify_buf_size);
+   return 0;
+}
+
 static const VuDevIface fv_iface = {
 .get_features = fv_get_features,
 .set_features = fv_set_features,
@@ -877,6 +906,8 @@ static const VuDevIface fv_iface = {
 .queue_set_started = fv_queue_set_started,
 
 .queue_is_processed_in_order = fv_queue_order,
+.get_protocol_features = fv_get_protocol_features,
+.get_config = fv_get_config,
 };
 
 /*
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index fe9dbe..5a6d244b98 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -277,16 +277,40 @@ uint64_t vhost_user_fs_slave_io(struct vhost_dev *dev, 
VhostUserFSSlaveMsg *sm,
 return (uint64_t)done;
 }
 
+static int vhost_user_fs_handle_config_change(struct vhost_dev *dev)
+{
+return 0;
+}
+
+const VhostDevConfigOps fs_ops = {
+.vhost_dev_config_notifier = vhost_user_fs_handle_config_change,
+};
 
 static void vuf_get_config(VirtIODevice *vdev, uint8_t *config)
 {
 VHostUserFS *fs = VHOST_USER_FS(vdev);
 struct virtio_fs_config fscfg = {};
+int ret;
+
+/*
+ * As of now we only get notification buffer size from device. And that's
+ * needed only if notification queue is enabled.
+ */
+if (fs->notify_enabled) {
+ret = vhost_dev_get_config(>vhost_dev, (uint8_t *)>fscfg,
+   sizeof(struct virtio_fs_config));
+if (ret < 0) {
+error_report("vhost-user-fs: get device config space failed."
+ " ret=%d\n", ret);
+return;
+}
+}
 
 memcpy((char *)fscfg.tag, fs->conf.tag,
MIN(strlen(fs->conf.tag) + 1, sizeof(fscfg.tag)));
 
 virtio_stl_p(vdev, _request_queues, fs->conf.num_request_queues);
+virtio_stl_p(vdev, _buf_size, fs->fscfg.notify_buf_size);
 
 memcpy(config, , sizeof(fscfg));
 }
@@ -545,6 +569,8 @@ static void vuf_device_realize(DeviceState *dev, Error 
**errp)
 fs->vhost_dev.nvqs = 2 + fs->conf.num_request_queues;
 
 fs->vhost_dev.vqs = g_new0(struct vhost_virtqueue, fs->vhost_dev.nvqs);
+
+vhost_dev_set_config_notifier(>vhost_dev, _ops);
 ret = vhost_dev_init(>vhost_dev, >vhost_user,
  VHOST_BACKEND_TYPE_USER, 0);
 if (ret < 0) {
diff --git a/include/hw/virtio/vhost-user-fs.h 
b/include/hw/virtio/vhost-user-fs.h
index bd47e0da98..f667cc4b5a 100644
--- a/include/hw/virtio/vhost-user-fs.h
+++ b/include/hw/virtio/vhost-user-fs.h
@@ -14,6 +14,7 @@
 #ifndef _QEMU_VHOST_USER_FS_H
 #define _QEMU_VHOST_USER_FS_H
 
+#include "standard-headers/linux/virtio_fs.h"
 #include "hw/virtio/virtio.h"
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/vhost-user.h"
@@ -58,6 +59,7 @@ typedef struct {
 struct vhost_virtqueue *vhost_vqs;
 struct vhost_dev vhost_dev;
 VhostUserState vhost_user;
+struct virtio_fs_config fscfg;
 
 /*< public >*/
 MemoryRegion cache;
diff --git a/include/standard-headers/linux/virtio_fs.h 
b/include/standard-headers/linux/virtio_fs.h
index 9ee95f584f..719216a262 100644
---

[PATCH v2 2/5] virtiofsd: Release file locks using F_UNLCK

2019-12-04 Thread Vivek Goyal

We are emulating posix locks for guest using open file description locks
in virtiofsd. When any of the fd is closed in guest, we find associated
OFD lock fd (if there is one) and close it to release all the locks.

Assumption here is that there is no other thread using lo_inode_plock
structure or plock->fd, hence it is safe to do so.

But now we are about to introduce blocking variant of locks (SETLKW),
and that means we might be waiting to a lock to be available and
using plock->fd. And that means there are still users of plock structure.

So release locks using fcntl(SETLK, F_UNLCK) instead and plock will
be freed later.

Signed-off-by: Vivek Goyal 
---
 contrib/virtiofsd/passthrough_ll.c | 31 --
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/contrib/virtiofsd/passthrough_ll.c 
b/contrib/virtiofsd/passthrough_ll.c
index bc214df0c7..6aa56882e8 100644
--- a/contrib/virtiofsd/passthrough_ll.c
+++ b/contrib/virtiofsd/passthrough_ll.c
@@ -936,6 +936,14 @@ static void put_shared(struct lo_data *lo, struct lo_inode 
*inode)
}
 }
 
+static void posix_locks_value_destroy(gpointer data)
+{
+   struct lo_inode_plock *plock = data;
+
+   close(plock->fd);
+   free(plock);
+}
+
 /* Increments nlookup and caller must release refcount using
  * lo_inode_put().
  */
@@ -994,7 +1002,9 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, 
const char *name,
inode->key.ino = e->attr.st_ino;
inode->key.dev = e->attr.st_dev;
pthread_mutex_init(>plock_mutex, NULL);
-   inode->posix_locks = g_hash_table_new(g_direct_hash, 
g_direct_equal);
+   inode->posix_locks = g_hash_table_new_full(g_direct_hash,
+   g_direct_equal, NULL,
+   posix_locks_value_destroy);
 
get_shared(lo, inode);
 
@@ -1436,9 +1446,6 @@ static void unref_inode(struct lo_data *lo, struct 
lo_inode *inode, uint64_t n)
if (!inode->nlookup) {
lo_map_remove(>ino_map, inode->fuse_ino);
 g_hash_table_remove(lo->inodes, >key);
-   if (g_hash_table_size(inode->posix_locks)) {
-   fuse_log(FUSE_LOG_WARNING, "Hash table is not empty\n");
-   }
g_hash_table_destroy(inode->posix_locks);
pthread_mutex_destroy(>plock_mutex);
 
@@ -1868,6 +1875,7 @@ static struct lo_inode_plock 
*lookup_create_plock_ctx(struct lo_data *lo,
plock->fd = fd;
g_hash_table_insert(inode->posix_locks,
GUINT_TO_POINTER(plock->lock_owner), plock);
+   fuse_log(FUSE_LOG_DEBUG, "lookup_create_plock_ctx(): Inserted element 
in posix_locks hash table with value pointer %p\n", plock);
return plock;
 }
 
@@ -2046,6 +2054,7 @@ static void lo_flush(fuse_req_t req, fuse_ino_t ino, 
struct fuse_file_info *fi)
(void) ino;
struct lo_inode *inode;
struct lo_inode_plock *plock;
+   struct flock flock;
 
inode = lo_inode(req, ino);
if (!inode) {
@@ -2058,14 +2067,16 @@ static void lo_flush(fuse_req_t req, fuse_ino_t ino, 
struct fuse_file_info *fi)
plock = g_hash_table_lookup(inode->posix_locks,
GUINT_TO_POINTER(fi->lock_owner));
if (plock) {
-   g_hash_table_remove(inode->posix_locks,
-   GUINT_TO_POINTER(fi->lock_owner));
/*
-* We had used open() for locks and had only one fd. So
-* closing this fd should release all OFD locks.
+* An fd is being closed. For posix locks, this means
+* drop all the associated locks.
 */
-   close(plock->fd);
-   free(plock);
+   memset(, 0, sizeof(struct flock));
+   flock.l_type = F_UNLCK;
+   flock.l_whence = SEEK_SET;
+   /* Unlock whole file */
+   flock.l_start = flock.l_len = 0;
+   fcntl(plock->fd, F_OFD_SETLK, );
}
pthread_mutex_unlock(>plock_mutex);
 
-- 
2.20.1

Re: [PATCH v4 26/40] target/arm: Update define_one_arm_cp_reg_with_opaque for VHE

2019-12-04 Thread Alex Bennée



Richard Henderson  writes:

> For ARMv8.1, op1 == 5 is reserved for EL2 aliases of
> EL1 and EL0 registers.
>
> Signed-off-by: Richard Henderson 
> ---
>  target/arm/helper.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/target/arm/helper.c b/target/arm/helper.c
> index 023b8963cf..1812588fa1 100644
> --- a/target/arm/helper.c
> +++ b/target/arm/helper.c
> @@ -7437,13 +7437,10 @@ void define_one_arm_cp_reg_with_opaque(ARMCPU *cpu,
>  mask = PL0_RW;
>  break;
>  case 4:
> +case 5:
>  /* min_EL EL2 */
>  mask = PL2_RW;
>  break;
> -case 5:
> -/* unallocated encoding, so not possible */
> -assert(false);
> -break;

This change is fine - I don't think we should have asserted here anyway.
But don't we generate an unallocated exception if the CPU is v8.0?


>  case 6:
>  /* min_EL EL3 */
>  mask = PL3_RW;


-- 
Alex Bennée

Re: qom device lifecycle interaction with hotplug/hotunplug ?

2019-12-04 Thread Eduardo Habkost

On Wed, Dec 04, 2019 at 05:21:25PM +0100, Jens Freimann wrote:
> On Wed, Dec 04, 2019 at 11:35:37AM -0300, Eduardo Habkost wrote:
> > On Wed, Dec 04, 2019 at 10:18:24AM +0100, Jens Freimann wrote:
> > > On Tue, Dec 03, 2019 at 06:40:04PM -0300, Eduardo Habkost wrote:
> > > > +jfreimann, +mst
> > > >
> > > > On Sat, Nov 30, 2019 at 11:10:19AM +, Peter Maydell wrote:
> > > > > On Fri, 29 Nov 2019 at 20:05, Eduardo Habkost  
> > > > > wrote:
> > > > > > So, to summarize the current issues:
> > > > > >
> > > > > > 1) realize triggers a plug operation implicitly.
> > > > > > 2) unplug triggers unrealize implicitly.
> > > > > >
> > > > > > Do you expect to see use cases that will require us to implement
> > > > > > realize-without-plug?
> > > > >
> > > > > I don't think so, but only because of the oddity that
> > > > > we put lots of devices on the 'sysbus' and claim that
> > > > > that's plugging them into the bus. The common case of
> > > > > 'realize' is where one device (say an SoC) has a bunch of child
> > > > > devices (like UARTs); the SoC's realize method realizes its child
> > > > > devices. Those devices all end up plugged into the 'sysbus'
> > > > > but there's no actual bus there, it's fictional and about
> > > > > the only thing it matters for is reset propagation (which
> > > > > we don't model right either). A few devices don't live on
> > > > > buses at all.
> > > >
> > > > That's my impression as well.
> > > >
> > > > >
> > > > > > Similarly, do you expect use cases that will require us to
> > > > > > implement unplug-without-unrealize?
> > > > >
> > > > > I don't know enough about hotplug to answer this one:
> > > > > it's essentially what I'm hoping you'd be able to answer.
> > > > > I vaguely had in mind that eg the user might be able to
> > > > > create a 'disk' object, plug it into a SCSI bus, then
> > > > > unplug it from the bus without the disk and all its data
> > > > > evaporating, and maybe plug it back into the SCSI
> > > > > bus (or some other SCSI bus) later ? But I don't know
> > > > > anything about how we expose that kind of thing to the
> > > > > user via QMP/HMP.
> > > >
> > > > This ability isn't exposed to the user at all.  Our existing
> > > > interfaces are -device, device_add and device_del.
> > > >
> > > > We do have something new that sounds suspiciously similar to
> > > > "unplugged but not unrealized", though: the new hidden device
> > > > API, added by commit f3a850565693 ("qdev/qbus: add hidden device
> > > > support").
> > > >
> > > > Jens, Michael, what exactly is the difference between a "hidden"
> > > > device and a "unplugged" device?
> > > 
> > > "hidden" the way we use it for virtio-net failover is actually unplugged. 
> > > But it
> > > doesn't have to be that way. You can register a function that decides
> > > if the device should be hidden, i.e. plugged now, or do something else
> > > with it (in the virtio-net failover case we just save everything we
> > > need to plug the device later).
> > > 
> > > We did introduce a "unplugged but not unrealized" function too as part
> > > of the failover feature. See "a99c4da9fc pci: mark devices partially
> > > unplugged"
> > > 
> > > This was needed so we would be able to re-plug the device in case a
> > > migration failed and we need to hotplug the primary device back to the
> > > guest. To avoid the risk of not getting the resources the device needs
> > > we don't unrealize but just trigger the unplug from the guest OS.
> > 
> > Thanks for the explanation.  Let me confirm if I understand the
> > purpose of the new mechanisms: should_be_hidden is a mechanism
> > for implementing realize-without-plug.  partially_hotplugged is a
> > mechanism for implementing unplug-without-unrealize.  Is that
> > correct?
> 
> should_be_hidden is a mechanism for implementing
> realize-without-plug: kind of. It's a mechanism that ensures
> qdev_device_add() returns early as long as the condition to hide the
> device is true. You could to the realize-without-plug in the handler
> function that decides if the device should be "hidden".

Oh, right.  I thought "qdev_device_add() returns early" meant
"return after realize, before plug".  Now I see it returns before
object_new().  This means we have another user-visible device
state: "defined (in QemuOpts), but not created".

> 
> partially_hotplugged is a mechanism for implementing
> unplug-without-unrealize: yes.

Thanks!

-- 
Eduardo

Re: [PATCH v4 25/40] target/arm: Update timer access for VHE

2019-12-04 Thread Alex Bennée



Richard Henderson  writes:

> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

> ---
>  target/arm/helper.c | 102 +++-
>  1 file changed, 81 insertions(+), 21 deletions(-)
>
> diff --git a/target/arm/helper.c b/target/arm/helper.c
> index a4a7f82661..023b8963cf 100644
> --- a/target/arm/helper.c
> +++ b/target/arm/helper.c
> @@ -2287,10 +2287,18 @@ static CPAccessResult gt_cntfrq_access(CPUARMState 
> *env, const ARMCPRegInfo *ri,
>   * Writable only at the highest implemented exception level.
>   */
>  int el = arm_current_el(env);
> +uint64_t hcr;
> +uint32_t cntkctl;
>  
>  switch (el) {
>  case 0:
> -if (!extract32(env->cp15.c14_cntkctl, 0, 2)) {
> +hcr = arm_hcr_el2_eff(env);
> +if ((hcr & (HCR_E2H | HCR_TGE)) == (HCR_E2H | HCR_TGE)) {
> +cntkctl = env->cp15.cnthctl_el2;
> +} else {
> +cntkctl = env->cp15.c14_cntkctl;
> +}
> +if (!extract32(cntkctl, 0, 2)) {
>  return CP_ACCESS_TRAP;
>  }
>  break;
> @@ -2318,17 +2326,47 @@ static CPAccessResult gt_counter_access(CPUARMState 
> *env, int timeridx,
>  {
>  unsigned int cur_el = arm_current_el(env);
>  bool secure = arm_is_secure(env);
> +uint64_t hcr = arm_hcr_el2_eff(env);
>  
> -/* CNT[PV]CT: not visible from PL0 if ELO[PV]CTEN is zero */
> -if (cur_el == 0 &&
> -!extract32(env->cp15.c14_cntkctl, timeridx, 1)) {
> -return CP_ACCESS_TRAP;
> -}
> +switch (cur_el) {
> +case 0:
> +/* If HCR_EL2. == '11': check CNTHCTL_EL2.EL0[PV]CTEN. */
> +if ((hcr & (HCR_E2H | HCR_TGE)) == (HCR_E2H | HCR_TGE)) {
> +return (extract32(env->cp15.cnthctl_el2, timeridx, 1)
> +? CP_ACCESS_OK : CP_ACCESS_TRAP_EL2);
> +}
>  
> -if (arm_feature(env, ARM_FEATURE_EL2) &&
> -timeridx == GTIMER_PHYS && !secure && cur_el < 2 &&
> -!extract32(env->cp15.cnthctl_el2, 0, 1)) {
> -return CP_ACCESS_TRAP_EL2;
> +/* CNT[PV]CT: not visible from PL0 if EL0[PV]CTEN is zero */
> +if (!extract32(env->cp15.c14_cntkctl, timeridx, 1)) {
> +return CP_ACCESS_TRAP;
> +}
> +
> +/* If HCR_EL2. == '10': check CNTHCTL_EL2.EL1PCTEN. */
> +if (hcr & HCR_E2H) {
> +if (timeridx == GTIMER_PHYS &&
> +!extract32(env->cp15.cnthctl_el2, 10, 1)) {
> +return CP_ACCESS_TRAP_EL2;
> +}
> +} else {
> +/* If HCR_EL2. == 0: check CNTHCTL_EL2.EL1PCEN. */
> +if (arm_feature(env, ARM_FEATURE_EL2) &&
> +timeridx == GTIMER_PHYS && !secure &&
> +!extract32(env->cp15.cnthctl_el2, 1, 1)) {
> +return CP_ACCESS_TRAP_EL2;
> +}
> +}
> +break;
> +
> +case 1:
> +/* Check CNTHCTL_EL2.EL1PCTEN, which changes location based on E2H. 
> */
> +if (arm_feature(env, ARM_FEATURE_EL2) &&
> +timeridx == GTIMER_PHYS && !secure &&
> +(hcr & HCR_E2H
> + ? !extract32(env->cp15.cnthctl_el2, 10, 1)
> + : !extract32(env->cp15.cnthctl_el2, 0, 1))) {
> +return CP_ACCESS_TRAP_EL2;
> +}
> +break;
>  }
>  return CP_ACCESS_OK;
>  }
> @@ -2338,19 +2376,41 @@ static CPAccessResult gt_timer_access(CPUARMState 
> *env, int timeridx,
>  {
>  unsigned int cur_el = arm_current_el(env);
>  bool secure = arm_is_secure(env);
> +uint64_t hcr = arm_hcr_el2_eff(env);
>  
> -/* CNT[PV]_CVAL, CNT[PV]_CTL, CNT[PV]_TVAL: not visible from PL0 if
> - * EL0[PV]TEN is zero.
> - */
> -if (cur_el == 0 &&
> -!extract32(env->cp15.c14_cntkctl, 9 - timeridx, 1)) {
> -return CP_ACCESS_TRAP;
> -}
> +switch (cur_el) {
> +case 0:
> +if ((hcr & (HCR_E2H | HCR_TGE)) == (HCR_E2H | HCR_TGE)) {
> +/* If HCR_EL2. == '11': check CNTHCTL_EL2.EL0[PV]TEN. */
> +return (extract32(env->cp15.cnthctl_el2, 9 - timeridx, 1)
> +? CP_ACCESS_OK : CP_ACCESS_TRAP_EL2);
> +}
>  
> -if (arm_feature(env, ARM_FEATURE_EL2) &&
> -timeridx == GTIMER_PHYS && !secure && cur_el < 2 &&
> -!extract32(env->cp15.cnthctl_el2, 1, 1)) {
> -return CP_ACCESS_TRAP_EL2;
> +/*
> + * CNT[PV]_CVAL, CNT[PV]_CTL, CNT[PV]_TVAL: not visible from
> + * EL0 if EL0[PV]TEN is zero.
> + */
> +if (!extract32(env->cp15.c14_cntkctl, 9 - timeridx, 1)) {
> +return CP_ACCESS_TRAP;
> +}
> +/* fall through */
> +
> +case 1:
> +if (arm_feature(env, ARM_FEATURE_EL2) &&
> +timeridx == GTIMER_PHYS && !secure) {
> +if (hcr & HCR_E2H) {
> +/* If HCR_EL2. == '10': check CNTHCTL_EL2.EL1PTEN. 
> */
> +if (!extract32(env->cp15.cnthctl_el2, 11, 1)) {
> +

Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.

2019-12-04 Thread Alex Williamson

On Wed, 4 Dec 2019 23:40:25 +0530
Kirti Wankhede  wrote:

> On 12/3/2019 11:34 PM, Alex Williamson wrote:
> > On Mon, 25 Nov 2019 19:57:39 -0500
> > Yan Zhao  wrote:
> >   
> >> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:  
> >>> On Fri, 15 Nov 2019 00:26:07 +0530
> >>> Kirti Wankhede  wrote:
> >>>  
>  On 11/14/2019 1:37 AM, Alex Williamson wrote:  
> > On Thu, 14 Nov 2019 01:07:21 +0530
> > Kirti Wankhede  wrote:
> >
> >> On 11/13/2019 4:00 AM, Alex Williamson wrote:  
> >>> On Tue, 12 Nov 2019 22:33:37 +0530
> >>> Kirti Wankhede  wrote:
> >>>   
>  All pages pinned by vendor driver through vfio_pin_pages API should 
>  be
>  considered as dirty during migration. IOMMU container maintains a 
>  list of
>  all such pinned pages. Added an ioctl defination to get bitmap of 
>  such  
> >>>
> >>> definition
> >>>   
>  pinned pages for requested IO virtual address range.  
> >>>
> >>> Additionally, all mapped pages are considered dirty when physically
> >>> mapped through to an IOMMU, modulo we discussed devices opting in to
> >>> per page pinning to indicate finer granularity with a TBD mechanism to
> >>> figure out if any non-opt-in devices remain.
> >>>   
> >>
> >> You mean, in case of device direct assignment (device pass through)?  
> >
> > Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
> > pinned and mapped, then the correct dirty page set is all mapped pages.
> > We discussed using the vpfn list as a mechanism for vendor drivers to
> > reduce their migration footprint, but we also discussed that we would
> > need a way to determine that all participants in the container have
> > explicitly pinned their working pages or else we must consider the
> > entire potential working set as dirty.
> >
> 
>  How can vendor driver tell this capability to iommu module? Any 
>  suggestions?  
> >>>
> >>> I think it does so by pinning pages.  Is it acceptable that if the
> >>> vendor driver pins any pages, then from that point forward we consider
> >>> the IOMMU group dirty page scope to be limited to pinned pages?  There  
> >> we should also be aware of that dirty page scope is pinned pages + 
> >> unpinned pages,
> >> which means ever since a page is pinned, it should be regarded as dirty
> >> no matter whether it's unpinned later. only after log_sync is called and
> >> dirty info retrieved, its dirty state should be cleared.  
> > 
> > Yes, good point.  We can't just remove a vpfn when a page is unpinned
> > or else we'd lose information that the page potentially had been
> > dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> > list and both the currently pinned vpfns and the dirty vpfns are walked
> > on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> > The container would need to know that dirty tracking is enabled and
> > only manage the dirty vpfns list when necessary.  Thanks,
> >   
> 
> If page is unpinned, then that page is available in free page pool for 
> others to use, then how can we say that unpinned page has valid data?
> 
> If suppose, one driver A unpins a page and when driver B of some other 
> device gets that page and he pins it, uses it, and then unpins it, then 
> how can we say that page has valid data for driver A?
> 
> Can you give one example where unpinned page data is considered reliable 
> and valid?

We can only pin pages that the user has already allocated* and mapped
through the vfio DMA API.  The pinning of the page simply locks the
page for the vendor driver to access it and unpinning that page only
indicates that access is complete.  Pages are not freed when a vendor
driver unpins them, they still exist and at this point we're now
assuming the device dirtied the page while it was pinned.  Thanks,

Alex

* An exception here is that the page might be demand allocated and the
  act of pinning the page could actually allocate the backing page for
  the user if they have not faulted the page to trigger that allocation
  previously.  That page remains mapped for the user's virtual address
  space even after the unpinning though.

Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.

2019-12-04 Thread Kirti Wankhede





On 12/3/2019 11:34 PM, Alex Williamson wrote:

On Mon, 25 Nov 2019 19:57:39 -0500
Yan Zhao  wrote:


On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:

On Fri, 15 Nov 2019 00:26:07 +0530
Kirti Wankhede  wrote:
   

On 11/14/2019 1:37 AM, Alex Williamson wrote:

On Thu, 14 Nov 2019 01:07:21 +0530
Kirti Wankhede  wrote:
 

On 11/13/2019 4:00 AM, Alex Williamson wrote:

On Tue, 12 Nov 2019 22:33:37 +0530
Kirti Wankhede  wrote:


All pages pinned by vendor driver through vfio_pin_pages API should be
considered as dirty during migration. IOMMU container maintains a list of
all such pinned pages. Added an ioctl defination to get bitmap of such


definition


pinned pages for requested IO virtual address range.


Additionally, all mapped pages are considered dirty when physically
mapped through to an IOMMU, modulo we discussed devices opting in to
per page pinning to indicate finer granularity with a TBD mechanism to
figure out if any non-opt-in devices remain.



You mean, in case of device direct assignment (device pass through)?


Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
pinned and mapped, then the correct dirty page set is all mapped pages.
We discussed using the vpfn list as a mechanism for vendor drivers to
reduce their migration footprint, but we also discussed that we would
need a way to determine that all participants in the container have
explicitly pinned their working pages or else we must consider the
entire potential working set as dirty.
 


How can vendor driver tell this capability to iommu module? Any suggestions?


I think it does so by pinning pages.  Is it acceptable that if the
vendor driver pins any pages, then from that point forward we consider
the IOMMU group dirty page scope to be limited to pinned pages?  There

we should also be aware of that dirty page scope is pinned pages + unpinned 
pages,
which means ever since a page is pinned, it should be regarded as dirty
no matter whether it's unpinned later. only after log_sync is called and
dirty info retrieved, its dirty state should be cleared.


Yes, good point.  We can't just remove a vpfn when a page is unpinned
or else we'd lose information that the page potentially had been
dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
list and both the currently pinned vpfns and the dirty vpfns are walked
on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
The container would need to know that dirty tracking is enabled and
only manage the dirty vpfns list when necessary.  Thanks,



If page is unpinned, then that page is available in free page pool for 
others to use, then how can we say that unpinned page has valid data?


If suppose, one driver A unpins a page and when driver B of some other 
device gets that page and he pins it, uses it, and then unpins it, then 
how can we say that page has valid data for driver A?


Can you give one example where unpinned page data is considered reliable 
and valid?


Thanks,
Kirti


Alex
  

are complications around non-singleton IOMMU groups, but I think we're
already leaning towards that being a non-worthwhile problem to solve.
So if we require that only singleton IOMMU groups can pin pages and we
pass the IOMMU group as a parameter to
vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
flag on its local vfio_group struct to indicate dirty page scope is
limited to pinned pages.  We might want to keep a flag on the
vfio_iommu struct to indicate if all of the vfio_groups for each
vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
pinned pages as an optimization to avoid walking lists too often.  Then
we could test if vfio_iommu.domain_list is not empty and this new flag
does not limit the dirty page scope, then everything within each
vfio_dma is considered dirty.


Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
include/uapi/linux/vfio.h | 23 +++
1 file changed, 23 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 35b09427ad9f..6fd3822aa610 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
#define VFIO_IOMMU_ENABLE   _IO(VFIO_TYPE, VFIO_BASE + 15)
#define VFIO_IOMMU_DISABLE  _IO(VFIO_TYPE, VFIO_BASE + 16)

+/**

+ * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
+ * struct vfio_iommu_type1_dirty_bitmap)
+ *
+ * IOCTL to get dirty pages bitmap for IOMMU container during migration.
+ * Get dirty pages bitmap of given IO virtual addresses range using
+ * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
+ * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
+ * bitmap and should set size of allocated memory in bitmap_size field.
+ * One bit is used to represent per page consecutively starting

[PATCH for-5.0 6/8] acpi: cpuhp: spec: add typical usecases

2019-12-04 Thread Igor Mammedov

Document work-flows for
  * finding a CPU with pending 'insert/remove' event
  * enumerating present and possible CPUs

Signed-off-by: Igor Mammedov 
---
 docs/specs/acpi_cpu_hotplug.txt | 29 +
 1 file changed, 29 insertions(+)

diff --git a/docs/specs/acpi_cpu_hotplug.txt b/docs/specs/acpi_cpu_hotplug.txt
index f3c552d..58c16c6 100644
--- a/docs/specs/acpi_cpu_hotplug.txt
+++ b/docs/specs/acpi_cpu_hotplug.txt
@@ -64,6 +64,7 @@ write access:
 [0x0-0x3] CPU selector: (DWORD access)
   selects active CPU device. All following accesses to other
   registers will read/store data from/to selected CPU.
+  Valid values: [0 .. max_cpus)
 [0x4] CPU device control fields: (1 byte access)
 bits:
 0: reserved, OSPM must clear it before writing to register.
@@ -96,3 +97,31 @@ write access:
  ACPI_DEVICE_OST QMP event from QEMU to external applications
  with current values of OST event and status registers.
 other values: reserved
+
+Typical usecases:
+- Get a cpu with pending event
+  1. Store 0x0 to the 'CPU selector' register.
+  2. Store 0x0 to the 'Command field' register.
+  3. Read the 'CPU device status fields' register.
+  4. If both bit#1 and bit#2 are clear in the value read, there is no 
CPU
+ with a pending event and selected CPU remains unchanged.
+  5. Otherwise, read the 'Command data' register. The value read is the
+ selector of the CPU with the pending event (which is already
+ selected).
+
+- Enumerate CPUs present/non present CPUs
+  01. Set the present CPU count to 0.
+  02. Set the iterator to 0.
+  03. Store 0x0 to the 'CPU selector' register, to ensure that it's in
+  a valid state and that access to other registers won't be 
ignored.
+  04. Store 0x0 to the 'Command field' register to make 'Command data'
+  register return 'CPU selector' value of selected CPU
+  05. Read the 'CPU device status fields' register.
+  06. If bit#0 is set, increment the present CPU count.
+  07. Increment the iterator.
+  08. Store the iterator to the 'CPU selector' register.
+  09. Read the 'Command data' register.
+  10. If the value read is not zero, goto 05.
+  11. Otherwise store 0x0 to the 'CPU selector' register, to put it
+  into a valid state and exit.
+  The iterator at this point equals "max_cpus".
-- 
2.7.4

[PATCH for-5.0 3/8] acpi: cpuhp: spec: clarify 'CPU selector' register usage and endianness

2019-12-04 Thread Igor Mammedov

* Move reserved registers to the top of the section, so reader would be
  aware of effects when reading registers description.
* State registers endianness explicitly at the beginning of the section
* Describe registers behavior in case of 'CPU selector' register contains
  value that doesn't point to a possible CPU.

Signed-off-by: Igor Mammedov 
---
 docs/specs/acpi_cpu_hotplug.txt | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/docs/specs/acpi_cpu_hotplug.txt b/docs/specs/acpi_cpu_hotplug.txt
index ee219c8..4e65286 100644
--- a/docs/specs/acpi_cpu_hotplug.txt
+++ b/docs/specs/acpi_cpu_hotplug.txt
@@ -30,6 +30,18 @@ Register block base address:
 Register block size:
 ACPI_CPU_HOTPLUG_REG_LEN = 12
 
+All accesses to registers described below, imply little-endian byte order.
+
+Reserved resisters behavior:
+   - write accesses are ignored
+   - read accesses return all bits set to 0.
+
+The last stored value in 'CPU selector' must refer to a possible CPU, otherwise
+  - reads from any register return 0
+  - writes to any other register are ignored until valid value is stored into 
it
+On QEMU start, 'CPU selector' is initialized to a valid value, on reset it
+keeps the current value.
+
 read access:
 offset:
 [0x0-0x3] reserved
@@ -86,9 +98,3 @@ write access:
  ACPI_DEVICE_OST QMP event from QEMU to external applications
  with current values of OST event and status registers.
 other values: reserved
-
-Selecting CPU device beyond possible range has no effect on platform:
-   - write accesses to CPU hot-plug registers not documented above are
- ignored
-   - read accesses to CPU hot-plug registers not documented above return
- all bits set to 0.
-- 
2.7.4

[PATCH for-5.0 5/8] acpi: cpuhp: spec: clarify store into 'Command data' when 'Command field' == 0

2019-12-04 Thread Igor Mammedov

Write section of 'Command data' register should describe what happens
when it's written into. Correct description in case the last stored
'Command field' value equals to 0, to reflect that currently it's not
supported.

Signed-off-by: Igor Mammedov 
---
 docs/specs/acpi_cpu_hotplug.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/docs/specs/acpi_cpu_hotplug.txt b/docs/specs/acpi_cpu_hotplug.txt
index 19c508f..f3c552d 100644
--- a/docs/specs/acpi_cpu_hotplug.txt
+++ b/docs/specs/acpi_cpu_hotplug.txt
@@ -90,8 +90,7 @@ write access:
 other values: reserved
 [0x6-0x7] reserved
 [0x8] Command data: (DWORD access)
-  current 'Command field' value:
-  0: OSPM reads value of CPU selector
+  if last stored 'Command field' value:
   1: stores value into OST event register
   2: stores value into OST status register, triggers
  ACPI_DEVICE_OST QMP event from QEMU to external applications
-- 
2.7.4

[PATCH for-5.0 4/8] acpi: cpuhp: spec: fix 'Command data' description

2019-12-04 Thread Igor Mammedov

Correct returned value description in case 'Command field' == 0x0,
it's in not PXM but CPU selector value with pending event

In addition describe 0 blanket value in case of not supported
'Command field' value.

Signed-off-by: Igor Mammedov 
---
 docs/specs/acpi_cpu_hotplug.txt | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/docs/specs/acpi_cpu_hotplug.txt b/docs/specs/acpi_cpu_hotplug.txt
index 4e65286..19c508f 100644
--- a/docs/specs/acpi_cpu_hotplug.txt
+++ b/docs/specs/acpi_cpu_hotplug.txt
@@ -56,9 +56,8 @@ read access:
3-7: reserved and should be ignored by OSPM
 [0x5-0x7] reserved
 [0x8] Command data: (DWORD access)
-  in case of error or unsupported command reads is 0x
-  current 'Command field' value:
-  0: returns PXM value corresponding to device
+  contains 0 unless last stored in 'Command field' value is one of:
+  0: contains 'CPU selector' value of a CPU with pending event[s]
 
 write access:
 offset:
@@ -81,9 +80,9 @@ write access:
   value:
 0: selects a CPU device with inserting/removing events and
following reads from 'Command data' register return
-   selected CPU (CPU selector value). If no CPU with events
-   found, the current CPU selector doesn't change and
-   corresponding insert/remove event flags are not set.
+   selected CPU ('CPU selector' value).
+   If no CPU with events found, the current 'CPU selector' doesn't
+   change and corresponding insert/remove event flags are not set.
 1: following writes to 'Command data' register set OST event
register in QEMU
 2: following writes to 'Command data' register set OST status
-- 
2.7.4

[PATCH for-5.0 2/8] tests: q35: MCH: add default SMBASE SMRAM lock test

2019-12-04 Thread Igor Mammedov

test lockable SMRAM at default SMBASE feature, introduced by
patch "q35: implement 128K SMRAM at default SMBASE address"

Signed-off-by: Igor Mammedov 
---
 tests/q35-test.c | 105 +++
 1 file changed, 105 insertions(+)

diff --git a/tests/q35-test.c b/tests/q35-test.c
index a68183d..dd02660 100644
--- a/tests/q35-test.c
+++ b/tests/q35-test.c
@@ -186,6 +186,109 @@ static void test_tseg_size(const void *data)
 qtest_quit(qts);
 }
 
+#define SMBASE 0x3
+#define SMRAM_TEST_PATTERN 0x32
+#define SMRAM_TEST_RESET_PATTERN 0x23
+
+static void test_smram_smbase_lock(void)
+{
+QPCIBus *pcibus;
+QPCIDevice *pcidev;
+QDict *response;
+QTestState *qts;
+int i;
+
+qts = qtest_init("-M q35");
+
+pcibus = qpci_new_pc(qts, NULL);
+g_assert(pcibus != NULL);
+
+pcidev = qpci_device_find(pcibus, 0);
+g_assert(pcidev != NULL);
+
+/* check that SMRAM is not enabled by default */
+g_assert(qpci_config_readb(pcidev, MCH_HOST_BRIDGE_F_SMBASE) == 0);
+qtest_writeb(qts, SMBASE, SMRAM_TEST_PATTERN);
+g_assert_cmpint(qtest_readb(qts, SMBASE), ==, SMRAM_TEST_PATTERN);
+
+/* check that writinng junk to 0x9c before before negotiating is ignorred 
*/
+for (i = 0; i < 0xff; i++) {
+qpci_config_writeb(pcidev, MCH_HOST_BRIDGE_F_SMBASE, i);
+g_assert(qpci_config_readb(pcidev, MCH_HOST_BRIDGE_F_SMBASE) == 0);
+}
+
+/* enable SMRAM at SMBASE */
+qpci_config_writeb(pcidev, MCH_HOST_BRIDGE_F_SMBASE, 0xff);
+g_assert(qpci_config_readb(pcidev, MCH_HOST_BRIDGE_F_SMBASE) == 0x01);
+/* lock SMRAM at SMBASE */
+qpci_config_writeb(pcidev, MCH_HOST_BRIDGE_F_SMBASE, 0x02);
+g_assert(qpci_config_readb(pcidev, MCH_HOST_BRIDGE_F_SMBASE) == 0x02);
+
+/* check that SMRAM at SMBASE is locked and can't be unlocked */
+g_assert_cmpint(qtest_readb(qts, SMBASE), ==, 0xff);
+for (i = 0; i <= 0xff; i++) {
+/* make sure register is immutable */
+qpci_config_writeb(pcidev, MCH_HOST_BRIDGE_F_SMBASE, i);
+g_assert(qpci_config_readb(pcidev, MCH_HOST_BRIDGE_F_SMBASE) == 0x02);
+
+/* RAM access should go inot black hole */
+qtest_writeb(qts, SMBASE, SMRAM_TEST_PATTERN);
+g_assert_cmpint(qtest_readb(qts, SMBASE), ==, 0xff);
+}
+
+/* reset */
+response = qtest_qmp(qts, "{'execute': 'system_reset', 'arguments': {} }");
+g_assert(response);
+g_assert(!qdict_haskey(response, "error"));
+qobject_unref(response);
+
+/* check RAM at SMBASE is available after reset */
+g_assert_cmpint(qtest_readb(qts, SMBASE), ==, SMRAM_TEST_PATTERN);
+g_assert(qpci_config_readb(pcidev, MCH_HOST_BRIDGE_F_SMBASE) == 0);
+qtest_writeb(qts, SMBASE, SMRAM_TEST_RESET_PATTERN);
+g_assert_cmpint(qtest_readb(qts, SMBASE), ==, SMRAM_TEST_RESET_PATTERN);
+
+g_free(pcidev);
+qpci_free_pc(pcibus);
+
+qtest_quit(qts);
+}
+
+static void test_without_smram_base(void)
+{
+QPCIBus *pcibus;
+QPCIDevice *pcidev;
+QTestState *qts;
+int i;
+
+qts = qtest_init("-M pc-q35-4.1");
+
+pcibus = qpci_new_pc(qts, NULL);
+g_assert(pcibus != NULL);
+
+pcidev = qpci_device_find(pcibus, 0);
+g_assert(pcidev != NULL);
+
+/* check that RAM accessible */
+qtest_writeb(qts, SMBASE, SMRAM_TEST_PATTERN);
+g_assert_cmpint(qtest_readb(qts, SMBASE), ==, SMRAM_TEST_PATTERN);
+
+/* check that writing to 0x9c succeeds */
+for (i = 0; i <= 0xff; i++) {
+qpci_config_writeb(pcidev, MCH_HOST_BRIDGE_F_SMBASE, i);
+g_assert(qpci_config_readb(pcidev, MCH_HOST_BRIDGE_F_SMBASE) == i);
+}
+
+/* check that RAM is still accessible */
+qtest_writeb(qts, SMBASE, SMRAM_TEST_PATTERN + 1);
+g_assert_cmpint(qtest_readb(qts, SMBASE), ==, (SMRAM_TEST_PATTERN + 1));
+
+g_free(pcidev);
+qpci_free_pc(pcibus);
+
+qtest_quit(qts);
+}
+
 int main(int argc, char **argv)
 {
 g_test_init(, , NULL);
@@ -197,5 +300,7 @@ int main(int argc, char **argv)
 qtest_add_data_func("/q35/tseg-size/8mb", _8mb, test_tseg_size);
 qtest_add_data_func("/q35/tseg-size/ext/16mb", _ext_16mb,
 test_tseg_size);
+qtest_add_func("/q35/smram/smbase_lock", test_smram_smbase_lock);
+qtest_add_func("/q35/smram/legacy_smbase", test_without_smram_base);
 return g_test_run();
 }
-- 
2.7.4

[PATCH for-5.0 1/8] q35: implement 128K SMRAM at default SMBASE address

2019-12-04 Thread Igor Mammedov

Use commit (2f295167e0 q35/mch: implement extended TSEG sizes) for
inspiration and (ab)use reserved register in config space at 0x9c
offset [*] to extend q35 pci-host with ability to use 128K at
0x3 as SMRAM and hide it (like TSEG) from non-SMM context.

Usage:
  1: write 0xff in the register
  2: if the feature is supported, follow up read from the register
 should return 0x01. At this point RAM at 0x3 is still
 available for SMI handler configuration from non-SMM context
  3: writing 0x02 in the register, locks SMBASE area, making its contents
 available only from SMM context. In non-SMM context, reads return
 0xff and writes are ignored. Further writes into the register are
 ignored until the system reset.

*) https://www.mail-archive.com/qemu-devel@nongnu.org/msg455991.html

Signed-off-by: Igor Mammedov 
---
 include/hw/pci-host/q35.h | 10 ++
 hw/i386/pc.c  |  4 ++-
 hw/pci-host/q35.c | 80 ++-
 3 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
index b3bcf2e..976fbae 100644
--- a/include/hw/pci-host/q35.h
+++ b/include/hw/pci-host/q35.h
@@ -32,6 +32,7 @@
 #include "hw/acpi/ich9.h"
 #include "hw/pci-host/pam.h"
 #include "hw/i386/intel_iommu.h"
+#include "qemu/units.h"
 
 #define TYPE_Q35_HOST_DEVICE "q35-pcihost"
 #define Q35_HOST_DEVICE(obj) \
@@ -54,6 +55,8 @@ typedef struct MCHPCIState {
 MemoryRegion smram_region, open_high_smram;
 MemoryRegion smram, low_smram, high_smram;
 MemoryRegion tseg_blackhole, tseg_window;
+MemoryRegion smbase_blackhole, smbase_window;
+bool has_smram_at_smbase;
 Range pci_hole;
 uint64_t below_4g_mem_size;
 uint64_t above_4g_mem_size;
@@ -97,6 +100,13 @@ typedef struct Q35PCIHost {
 #define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY  0x
 #define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_MAX0xfff
 
+#define MCH_HOST_BRIDGE_SMBASE_SIZE(128 * KiB)
+#define MCH_HOST_BRIDGE_SMBASE_ADDR0x3
+#define MCH_HOST_BRIDGE_F_SMBASE   0x9c
+#define MCH_HOST_BRIDGE_F_SMBASE_QUERY 0xff
+#define MCH_HOST_BRIDGE_F_SMBASE_IN_RAM0x01
+#define MCH_HOST_BRIDGE_F_SMBASE_LCK   0x02
+
 #define MCH_HOST_BRIDGE_PCIEXBAR   0x60/* 64bit register */
 #define MCH_HOST_BRIDGE_PCIEXBAR_SIZE  8   /* 64bit register */
 #define MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT   0xb000
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index ac08e63..9c4b4ac 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -103,7 +103,9 @@
 
 struct hpet_fw_config hpet_cfg = {.count = UINT8_MAX};
 
-GlobalProperty pc_compat_4_1[] = {};
+GlobalProperty pc_compat_4_1[] = {
+{ "mch", "smbase-smram", "off" },
+};
 const size_t pc_compat_4_1_len = G_N_ELEMENTS(pc_compat_4_1);
 
 GlobalProperty pc_compat_4_0[] = {};
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 158d270..c1bd9f7 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -275,20 +275,20 @@ static const TypeInfo q35_host_info = {
  * MCH D0:F0
  */
 
-static uint64_t tseg_blackhole_read(void *ptr, hwaddr reg, unsigned size)
+static uint64_t blackhole_read(void *ptr, hwaddr reg, unsigned size)
 {
 return 0x;
 }
 
-static void tseg_blackhole_write(void *opaque, hwaddr addr, uint64_t val,
- unsigned width)
+static void blackhole_write(void *opaque, hwaddr addr, uint64_t val,
+unsigned width)
 {
 /* nothing */
 }
 
-static const MemoryRegionOps tseg_blackhole_ops = {
-.read = tseg_blackhole_read,
-.write = tseg_blackhole_write,
+static const MemoryRegionOps blackhole_ops = {
+.read = blackhole_read,
+.write = blackhole_write,
 .endianness = DEVICE_NATIVE_ENDIAN,
 .valid.min_access_size = 1,
 .valid.max_access_size = 4,
@@ -430,6 +430,46 @@ static void mch_update_ext_tseg_mbytes(MCHPCIState *mch)
 }
 }
 
+static void mch_update_smbase_smram(MCHPCIState *mch)
+{
+PCIDevice *pd = PCI_DEVICE(mch);
+uint8_t *reg = pd->config + MCH_HOST_BRIDGE_F_SMBASE;
+bool lck;
+
+if (!mch->has_smram_at_smbase) {
+return;
+}
+
+if (*reg == MCH_HOST_BRIDGE_F_SMBASE_QUERY) {
+pd->wmask[MCH_HOST_BRIDGE_F_SMBASE] =
+MCH_HOST_BRIDGE_F_SMBASE_LCK;
+*reg = MCH_HOST_BRIDGE_F_SMBASE_IN_RAM;
+return;
+}
+
+/*
+ * default/reset state, discard written value
+ * which will disable SMRAM balackhole at SMBASE
+ */
+if (pd->wmask[MCH_HOST_BRIDGE_F_SMBASE] == 0xff) {
+*reg = 0x00;
+}
+
+memory_region_transaction_begin();
+if (*reg & MCH_HOST_BRIDGE_F_SMBASE_LCK) {
+/* disable all writes */
+pd->wmask[MCH_HOST_BRIDGE_F_SMBASE] &=
+~MCH_HOST_BRIDGE_F_SMBASE_LCK;
+*reg = MCH_HOST_BRIDGE_F_SMBASE_LCK;
+lck = true;
+} else {
+lck = false;
+}
+

[PATCH for-5.0 8/8] acpi: cpuhp: spec: document procedure for enabling modern CPU hotplug

2019-12-04 Thread Igor Mammedov

Describe how to enable and detect modern CPU hotplug interface.
Detection part is based on new CPHP_GET_CPU_ID_CMD command,
introduced by "acpi: cpuhp: add CPHP_GET_CPU_ID_CMD command" patch.

Signed-off-by: Igor Mammedov 
---
 docs/specs/acpi_cpu_hotplug.txt | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/docs/specs/acpi_cpu_hotplug.txt b/docs/specs/acpi_cpu_hotplug.txt
index bb33144..667b264 100644
--- a/docs/specs/acpi_cpu_hotplug.txt
+++ b/docs/specs/acpi_cpu_hotplug.txt
@@ -15,14 +15,14 @@ CPU present bitmap for:
   PIIX-PM  (IO port 0xaf00-0xaf1f, 1-byte access)
   One bit per CPU. Bit position reflects corresponding CPU APIC ID. Read-only.
   The first DWORD in bitmap is used in write mode to switch from legacy
-  to new CPU hotplug interface, write 0 into it to do switch.
+  to modern CPU hotplug interface, write 0 into it to do switch.
 ---
 QEMU sets corresponding CPU bit on hot-add event and issues SCI
 with GPE.2 event set. CPU present map is read by ACPI BIOS GPE.2 handler
 to notify OS about CPU hot-add events. CPU hot-remove isn't supported.
 
 =
-ACPI CPU hotplug interface registers:
+Modern ACPI CPU hotplug interface registers:
 -
 Register block base address:
 ICH9-LPC IO port 0x0cd8
@@ -105,6 +105,24 @@ write access:
   other values: reserved
 
 Typical usecases:
+- (x86) Detecting and enabling modern CPU hotplug interface.
+  QEMU starts with legacy CPU hotplug interface enabled. Detecting and
+  switching to modern interface is based on the 2 legacy CPU hotplug 
features:
+1. Writes into CPU bitmap are ignored.
+2. CPU bitmap always has bit#0 set, corresponding to boot CPU.
+
+  Use following steps to detect and enable modern CPU hotplug 
interface:
+1. Store 0x0 to the 'CPU selector' register,
+   attempting to switch to modern mode
+2. Store 0x0 to the 'CPU selector' register,
+   to ensure valid selector value
+3. Store 0x3 to the 'Command field' register,
+   sets the 'Command data 2' register into architecture specific
+   CPU identifier mode
+4. Read the 'Command data 2' register.
+   If read value is 0x0, the modern interface is enabled.
+   Otherwise legacy or no CPU hotplug interface available
+
 - Get a cpu with pending event
   1. Store 0x0 to the 'CPU selector' register.
   2. Store 0x0 to the 'Command field' register.
-- 
2.7.4

[PATCH for-5.0 0/8] q35: CPU hotplug with secure boot, part 1+2

2019-12-04 Thread Igor Mammedov

Series consists of 2 parts: 1st is lockable SMRAM at SMBASE
and the 2nd adds means to enumerate APIC IDs for possible CPUs.

1st part [1-2/8]:
 In order to support CPU hotplug in secure boot mode,
 UEFI firmware needs to relocate SMI handler of hotplugged CPU,
 in a way that won't allow ring 0 user to break in priveleged
 SMM mode that firmware maintains during runtime.
 Used approach allows to hide RAM at default SMBASE to make it
 accessible only to SMM mode, which lets us to make sure that
 SMI handler installed by firmware can not be hijacked by
 unpriveleged user (similar to TSEG behavior). 

2nd part:
 mostly fixes and extra documentation on how to detect and use
 modern CPU hotplug interface (MMIO block).
 So firmware could reuse it for enumerating possible CPUs and
 detecting hotplugged CPU(s). It also adds support for
 CPHP_GET_CPU_ID_CMD command [7/8], which should allow firmware
 to fetch APIC IDs for possible CPUs which is necessary for
 initializing internal structures for possible CPUs on boot.
 

CC: m...@redhat.com
CC: pbonz...@redhat.com
CC: ler...@redhat.com
CC: phi...@redhat.com

Igor Mammedov (8):
  q35: implement 128K SMRAM at default SMBASE address
  tests: q35: MCH: add default SMBASE SMRAM lock test
  acpi: cpuhp: spec: clarify 'CPU selector' register usage and
endianness
  acpi: cpuhp: spec: fix 'Command data' description
  acpi: cpuhp: spec: clarify store into 'Command data' when 'Command
field' == 0
  acpi: cpuhp: spec: add typical usecases
  acpi: cpuhp: add CPHP_GET_CPU_ID_CMD command
  acpi: cpuhp: spec: document procedure for enabling modern CPU hotplug

 include/hw/pci-host/q35.h   |  10 
 docs/specs/acpi_cpu_hotplug.txt |  91 +++---
 hw/acpi/cpu.c   |  15 ++
 hw/acpi/trace-events|   1 +
 hw/i386/pc.c|   4 +-
 hw/pci-host/q35.c   |  80 +++---
 tests/q35-test.c| 105 
 7 files changed, 281 insertions(+), 25 deletions(-)

-- 
2.7.4

[PATCH for-5.0 7/8] acpi: cpuhp: add CPHP_GET_CPU_ID_CMD command

2019-12-04 Thread Igor Mammedov

Extend CPU hotplug interface to return architecture specific
identifier for current CPU in 2 registers:
 - lower 32 bits existing ACPI_CPU_CMD_DATA_OFFSET_RW
 - upper 32 bits in new ACPI_CPU_CMD_DATA2_OFFSET_R at
   offset 0.

Target user is UEFI firmware, which needs a way to enumerate
all CPUs (including possible CPUs) to allocate and initialize
CPU structures on boot.
(for x86: it needs APIC ID and later command will be used to
retrieve ARM's MPIDR which serves the similar to APIC ID purpose)

The new ACPI_CPU_CMD_DATA2_OFFSET_R register will also be used
to detect presence of modern CPU hotplug, which will be described
in follow up patch.

Signed-off-by: Igor Mammedov 
---
v1:
 - s/ACPI_CPU_CMD_DATA2_OFFSET_RW/ACPI_CPU_CMD_DATA2_OFFSET_R/.
---
 docs/specs/acpi_cpu_hotplug.txt | 10 --
 hw/acpi/cpu.c   | 15 +++
 hw/acpi/trace-events|  1 +
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/docs/specs/acpi_cpu_hotplug.txt b/docs/specs/acpi_cpu_hotplug.txt
index 58c16c6..bb33144 100644
--- a/docs/specs/acpi_cpu_hotplug.txt
+++ b/docs/specs/acpi_cpu_hotplug.txt
@@ -44,7 +44,11 @@ keeps the current value.
 
 read access:
 offset:
-[0x0-0x3] reserved
+[0x0-0x3] Command data 2: (DWORD access)
+  if last stored 'Command field' value:
+  3: upper 32 bits of architecture specific identifying CPU 
value
+ (n x86 case: 0x0)
+  other values: reserved
 [0x4] CPU device status fields: (1 byte access)
 bits:
0: Device is enabled and may be used by guest
@@ -96,7 +100,9 @@ write access:
   2: stores value into OST status register, triggers
  ACPI_DEVICE_OST QMP event from QEMU to external applications
  with current values of OST event and status registers.
-other values: reserved
+  3: lower 32 bit of architecture specific identifier
+ (in x86 case: APIC ID)
+  other values: reserved
 
 Typical usecases:
 - Get a cpu with pending event
diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
index 87f30a3..87813ce 100644
--- a/hw/acpi/cpu.c
+++ b/hw/acpi/cpu.c
@@ -12,11 +12,13 @@
 #define ACPI_CPU_FLAGS_OFFSET_RW 4
 #define ACPI_CPU_CMD_OFFSET_WR 5
 #define ACPI_CPU_CMD_DATA_OFFSET_RW 8
+#define ACPI_CPU_CMD_DATA2_OFFSET_R 0
 
 enum {
 CPHP_GET_NEXT_CPU_WITH_EVENT_CMD = 0,
 CPHP_OST_EVENT_CMD = 1,
 CPHP_OST_STATUS_CMD = 2,
+CPHP_GET_CPU_ID_CMD = 3,
 CPHP_CMD_MAX
 };
 
@@ -74,11 +76,24 @@ static uint64_t cpu_hotplug_rd(void *opaque, hwaddr addr, 
unsigned size)
 case CPHP_GET_NEXT_CPU_WITH_EVENT_CMD:
val = cpu_st->selector;
break;
+case CPHP_GET_CPU_ID_CMD:
+   val = cdev->arch_id & 0x;
+   break;
 default:
break;
 }
 trace_cpuhp_acpi_read_cmd_data(cpu_st->selector, val);
 break;
+case ACPI_CPU_CMD_DATA2_OFFSET_R:
+switch (cpu_st->command) {
+case CPHP_GET_CPU_ID_CMD:
+   val = cdev->arch_id >> 32;
+   break;
+default:
+   break;
+}
+trace_cpuhp_acpi_read_cmd_data2(cpu_st->selector, val);
+break;
 default:
 break;
 }
diff --git a/hw/acpi/trace-events b/hw/acpi/trace-events
index 96b8273..afbc77d 100644
--- a/hw/acpi/trace-events
+++ b/hw/acpi/trace-events
@@ -23,6 +23,7 @@ cpuhp_acpi_read_flags(uint32_t idx, uint8_t flags) 
"idx[0x%"PRIx32"] flags: 0x%"
 cpuhp_acpi_write_idx(uint32_t idx) "set active cpu idx: 0x%"PRIx32
 cpuhp_acpi_write_cmd(uint32_t idx, uint8_t cmd) "idx[0x%"PRIx32"] cmd: 
0x%"PRIx8
 cpuhp_acpi_read_cmd_data(uint32_t idx, uint32_t data) "idx[0x%"PRIx32"] data: 
0x%"PRIx32
+cpuhp_acpi_read_cmd_data2(uint32_t idx, uint32_t data) "idx[0x%"PRIx32"] data: 
0x%"PRIx32
 cpuhp_acpi_cpu_has_events(uint32_t idx, bool ins, bool rm) "idx[0x%"PRIx32"] 
inserting: %d, removing: %d"
 cpuhp_acpi_clear_inserting_evt(uint32_t idx) "idx[0x%"PRIx32"]"
 cpuhp_acpi_clear_remove_evt(uint32_t idx) "idx[0x%"PRIx32"]"
-- 
2.7.4

Re: [PATCH 01/10] hw: arm: add Allwinner H3 System-on-Chip

2019-12-04 Thread Philippe Mathieu-Daudé


Hi Niek,

On 12/2/19 10:09 PM, Niek Linnenbank wrote:

The Allwinner H3 is a System on Chip containing four ARM Cortex A7
processor cores. Features and specifications include DDR2/DDR3 memory,
SD/MMC storage cards, 10/100/1000Mbit ethernet, USB 2.0, HDMI and
various I/O modules. This commit adds support for the Allwinner H3
System on Chip.

Signed-off-by: Niek Linnenbank 
---
  MAINTAINERS |   7 ++
  default-configs/arm-softmmu.mak |   1 +
  hw/arm/Kconfig  |   8 ++
  hw/arm/Makefile.objs|   1 +
  hw/arm/allwinner-h3.c   | 215 
  include/hw/arm/allwinner-h3.h   | 118 ++
  6 files changed, 350 insertions(+)
  create mode 100644 hw/arm/allwinner-h3.c
  create mode 100644 include/hw/arm/allwinner-h3.h


Since your series changes various files, can you have a look at the 
scripts/git.orderfile file and setup it for your QEMU contributions?




diff --git a/MAINTAINERS b/MAINTAINERS
index 5e5e3e52d6..29c9936037 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -479,6 +479,13 @@ F: hw/*/allwinner*
  F: include/hw/*/allwinner*
  F: hw/arm/cubieboard.c
  
+Allwinner-h3

+M: Niek Linnenbank 
+L: qemu-...@nongnu.org
+S: Maintained
+F: hw/*/allwinner-h3*
+F: include/hw/*/allwinner-h3*
+
  ARM PrimeCell and CMSDK devices
  M: Peter Maydell 
  L: qemu-...@nongnu.org
diff --git a/default-configs/arm-softmmu.mak b/default-configs/arm-softmmu.mak
index 1f2e0e7fde..d75a239c2c 100644
--- a/default-configs/arm-softmmu.mak
+++ b/default-configs/arm-softmmu.mak
@@ -40,3 +40,4 @@ CONFIG_FSL_IMX25=y
  CONFIG_FSL_IMX7=y
  CONFIG_FSL_IMX6UL=y
  CONFIG_SEMIHOSTING=y
+CONFIG_ALLWINNER_H3=y
diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index c6e7782580..ebf8d2325f 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -291,6 +291,14 @@ config ALLWINNER_A10
  select SERIAL
  select UNIMP
  
+config ALLWINNER_H3

+bool
+select ALLWINNER_A10_PIT
+select SERIAL
+select ARM_TIMER
+select ARM_GIC
+select UNIMP
+
  config RASPI
  bool
  select FRAMEBUFFER
diff --git a/hw/arm/Makefile.objs b/hw/arm/Makefile.objs
index fe749f65fd..956e496052 100644
--- a/hw/arm/Makefile.objs
+++ b/hw/arm/Makefile.objs
@@ -34,6 +34,7 @@ obj-$(CONFIG_DIGIC) += digic.o
  obj-$(CONFIG_OMAP) += omap1.o omap2.o
  obj-$(CONFIG_STRONGARM) += strongarm.o
  obj-$(CONFIG_ALLWINNER_A10) += allwinner-a10.o cubieboard.o
+obj-$(CONFIG_ALLWINNER_H3) += allwinner-h3.o
  obj-$(CONFIG_RASPI) += bcm2835_peripherals.o bcm2836.o raspi.o
  obj-$(CONFIG_STM32F205_SOC) += stm32f205_soc.o
  obj-$(CONFIG_XLNX_ZYNQMP_ARM) += xlnx-zynqmp.o xlnx-zcu102.o
diff --git a/hw/arm/allwinner-h3.c b/hw/arm/allwinner-h3.c
new file mode 100644
index 00..470fdfebef
--- /dev/null
+++ b/hw/arm/allwinner-h3.c
@@ -0,0 +1,215 @@
+/*
+ * Allwinner H3 System on Chip emulation
+ *
+ * Copyright (C) 2019 Niek Linnenbank 
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include "qemu/osdep.h"
+#include "exec/address-spaces.h"
+#include "qapi/error.h"
+#include "qemu/module.h"
+#include "qemu/units.h"
+#include "cpu.h"
+#include "hw/sysbus.h"
+#include "hw/arm/allwinner-h3.h"
+#include "hw/misc/unimp.h"
+#include "sysemu/sysemu.h"
+
+static void aw_h3_init(Object *obj)
+{
+AwH3State *s = AW_H3(obj);
+
+sysbus_init_child_obj(obj, "gic", >gic, sizeof(s->gic),
+  TYPE_ARM_GIC);
+
+sysbus_init_child_obj(obj, "timer", >timer, sizeof(s->timer),
+  TYPE_AW_A10_PIT);
+}
+
+static void aw_h3_realize(DeviceState *dev, Error **errp)
+{
+AwH3State *s = AW_H3(dev);
+SysBusDevice *sysbusdev = NULL;
+Error *err = NULL;
+unsigned i = 0;
+
+/* CPUs */
+for (i = 0; i < AW_H3_NUM_CPUS; i++) {


In https://www.mail-archive.com/qemu-devel@nongnu.org/msg662942.html
Markus noted some incorrect pattern, and apparently you inherited it.
You should initialize 'err' in the loop.


+Object *cpuobj = object_new(ARM_CPU_TYPE_NAME("cortex-a7"));
+CPUState *cpustate = CPU(cpuobj);


We loose access to the CPUs. Can you use an array of AW_H3_NUM_CPUS cpus 
in AwH3State?



+
+/* Set the proper CPU index */
+cpustate->cpu_index = i;
+
+/* Provide Power State Coordination Interface */
+object_property_set_int(cpuobj, QEMU_PSCI_CONDUIT_HVC,
+

Re: [PATCH v2 2/2] migration: savevm_state_handler_insert: constant-time element insertion

2019-12-04 Thread Dr. David Alan Gilbert

* Scott Cheloha (chel...@linux.vnet.ibm.com) wrote:
> On Mon, Oct 21, 2019 at 09:14:44AM +0100, Dr. David Alan Gilbert wrote:
> > * David Gibson (da...@gibson.dropbear.id.au) wrote:
> > > On Fri, Oct 18, 2019 at 10:43:52AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Laurent Vivier (lviv...@redhat.com) wrote:
> > > > > On 18/10/2019 10:16, Dr. David Alan Gilbert wrote:
> > > > > > * Scott Cheloha (chel...@linux.vnet.ibm.com) wrote:
> > > > > >> savevm_state's SaveStateEntry TAILQ is a priority queue.  Priority
> > > > > >> sorting is maintained by searching from head to tail for a suitable
> > > > > >> insertion spot.  Insertion is thus an O(n) operation.
> > > > > >>
> > > > > >> If we instead keep track of the head of each priority's subqueue
> > > > > >> within that larger queue we can reduce this operation to O(1) time.
> > > > > >>
> > > > > >> savevm_state_handler_remove() becomes slightly more complex to
> > > > > >> accomodate these gains: we need to replace the head of a priority's
> > > > > >> subqueue when removing it.
> > > > > >>
> > > > > >> With O(1) insertion, booting VMs with many SaveStateEntry objects 
> > > > > >> is
> > > > > >> more plausible.  For example, a ppc64 VM with maxmem=8T has 4 
> > > > > >> such
> > > > > >> objects to insert.
> > > > > > 
> > > > > > Separate from reviewing this patch, I'd like to understand why 
> > > > > > you've
> > > > > > got 4 objects.  This feels very very wrong and is likely to 
> > > > > > cause
> > > > > > problems to random other bits of qemu as well.
> > > > > 
> > > > > I think the 4 objects are the "dr-connectors" that are used to 
> > > > > plug
> > > > > peripherals (memory, pci card, cpus, ...).
> > > > 
> > > > Yes, Scott confirmed that in the reply to the previous version.
> > > > IMHO nothing in qemu is designed to deal with that many devices/objects
> > > > - I'm sure that something other than the migration code is going to
> > > > get upset.
> > > 
> > > It kind of did.  Particularly when there was n^2 and n^3 cubed
> > > behaviour in the property stuff we had some ludicrously long startup
> > > times (hours) with large maxmem values.
> > > 
> > > Fwiw, the DRCs for PCI slots, DRCs and PHBs aren't really a problem.
> > > The problem is the memory DRCs, there's one for each LMB - each 256MiB
> > > chunk of memory (or possible memory).
> > > 
> > > > Is perhaps the structure wrong somewhere - should there be a single DRC
> > > > device that knows about all DRCs?
> > > 
> > > Maybe.  The tricky bit is how to get there from here without breaking
> > > migration or something else along the way.
> > 
> > Switch on the next machine type version - it doesn't matter if migration
> > is incompatible then.
> 
> 1mo bump.
> 
> Is there anything I need to do with this patch in particular to make it 
> suitable
> for merging?

Apologies for the delay;  hopefully this will go in one of the pulls
just after the tree opens again.

Please please try and work on reducing the number of objects somehow -
while this migration fix is a useful short term fix, and not too
invasive; having that many objects around qemu is a really really bad
idea so needs fixing properly.

Dave

> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v6 0/9] Clock framework API

2019-12-04 Thread Damien Hedde




On 12/2/19 5:15 PM, Peter Maydell wrote:
> 
> The one topic I think we could do with discussing is whether
> a simple uint64_t giving the frequency of the clock in Hz is
> the right representation. In particular in your patch 9 the
> board has a clock frequency that's not a nice integer number
> of Hz. I think Philippe also mentioned on irc some board where
> the UART clock ends up at a weird frequency. Since the
> representation of the frequency is baked into the migration
> format it's going to be easier to get it right first rather
> than trying to change it later.
> 
> So what should the representation be? Some random thoughts:
> 
> 1) ptimer internally uses a 'period plus fraction' representation:
>  int64_t period is the integer part of the period in nanoseconds,
>  uint32_t period_frac is the fractional part of the period
> (if you like you can think of this as "96-bit integer
> period measured in units of one-2^32nd of a nanosecond").
> However its only public interfaces for setting the frequency
> are (a) set the frequency in Hz (uint32_t) or (b) set
> the period in nanoseconds (int64_t); the period_frac part
> is used to handle frequencies which don't work out to
> a nice whole number of nanoseconds per cycle.
> 
> 2) I hear that SystemC uses "value plus a time unit", with
> the smallest unit being a picosecond. (I think SystemC
> also lets you specify the duty cycle, but we definitely
> don't want to get into that!)

The "value" is internally stored in a 64bits unsigned integer.

> 
> 3) QEMUTimers are basically just nanosecond timers
> 
> 4) The MAME emulator seems to work with periods of
> 96-bit attoseconds (represented internally by a
> 32-bit count of seconds plus a 64-bit count of
> attoseconds). One attosecond is 1e-18 seconds.
> 
> Does anybody else have experience with other modelling
> or emulator technology and how it represents clocks ?

5) In linux, a clock rate is an "unsigned long" representing Hz.

> 
> I feel we should at least be able to represent clocks
> with the same accuracy that ptimer has.

Then is a maybe a good idea to store the period and not the frequency in
clocks so that we don't loose anything when we switch from a clock to a
ptimer ?

Regarding the clock, I don't see any strong obstacle to switch
internally to a period based value.
The only things we have to choose is how to represent a disabled clock.
Since putting a "0" period to a ptimer will disable the timer in
ptimer_reload(). We can choose that (and it's a good value because we
can multiply or divide it, it stays the same).

We could use the same representation as a ptimer. But if we don't keep a
C number representation, then computation of frequencies/periods will be
complicated at best and error prone.

>From that point of view, if we could stick to a 64bits integer (or
floating point number) it would be great. Can we use a sub nanosecond
unit that fit our needs ?

I did some test with a unit of 2^-32 of nanoseconds on 64bits (is that
the unit of the ptimer fractional part ?) and if I'm not mistaken
+ we have a frequency range from ~0.2Hz up to 10^18Hz
+ the resolution is decreasing with the frequency (but at 100Mhz we have
a ~2.3mHz resolution, at 1GHz it's ~0.23Hz and at 10GHz ~23Hz
resolution). We hit 1Hz resolution around 2GHz.

So it sounds to me we have largely enough resolution to model clocks in
the range of frequencies we will have to handle. What do you think ?

--
Damien

Re: [PATCH v2 1/2] migration: add savevm_state_handler_remove()

2019-12-04 Thread Dr. David Alan Gilbert

* Scott Cheloha (chel...@linux.vnet.ibm.com) wrote:
> Create a function to abstract common logic needed when removing a
> SaveStateEntry element from the savevm_state.handlers queue.
> 
> For now we just remove the element.  Soon it will involve additional
> cleanup.
> 
> Signed-off-by: Scott Cheloha 

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/savevm.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 8d95e261f6..b2e3b7222a 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -725,6 +725,11 @@ static void savevm_state_handler_insert(SaveStateEntry 
> *nse)
>  }
>  }
>  
> +static void savevm_state_handler_remove(SaveStateEntry *se)
> +{
> +QTAILQ_REMOVE(_state.handlers, se, entry);
> +}
> +
>  /* TODO: Individual devices generally have very little idea about the rest
> of the system, so instance_id should be removed/replaced.
> Meanwhile pass -1 as instance_id if you do not already have a clearly
> @@ -777,7 +782,7 @@ void unregister_savevm(DeviceState *dev, const char 
> *idstr, void *opaque)
>  
>  QTAILQ_FOREACH_SAFE(se, _state.handlers, entry, new_se) {
>  if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
> -QTAILQ_REMOVE(_state.handlers, se, entry);
> +savevm_state_handler_remove(se);
>  g_free(se->compat);
>  g_free(se);
>  }
> @@ -841,7 +846,7 @@ void vmstate_unregister(DeviceState *dev, const 
> VMStateDescription *vmsd,
>  
>  QTAILQ_FOREACH_SAFE(se, _state.handlers, entry, new_se) {
>  if (se->vmsd == vmsd && se->opaque == opaque) {
> -QTAILQ_REMOVE(_state.handlers, se, entry);
> +savevm_state_handler_remove(se);
>  g_free(se->compat);
>  g_free(se);
>  }
> -- 
> 2.23.0
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 2/2] migration: savevm_state_handler_insert: constant-time element insertion

2019-12-04 Thread Dr. David Alan Gilbert

* Scott Cheloha (chel...@linux.vnet.ibm.com) wrote:
> savevm_state's SaveStateEntry TAILQ is a priority queue.  Priority
> sorting is maintained by searching from head to tail for a suitable
> insertion spot.  Insertion is thus an O(n) operation.
> 
> If we instead keep track of the head of each priority's subqueue
> within that larger queue we can reduce this operation to O(1) time.
> 
> savevm_state_handler_remove() becomes slightly more complex to
> accomodate these gains: we need to replace the head of a priority's
> subqueue when removing it.
> 
> With O(1) insertion, booting VMs with many SaveStateEntry objects is
> more plausible.  For example, a ppc64 VM with maxmem=8T has 4 such
> objects to insert.
> 
> Signed-off-by: Scott Cheloha 

OK, it took me a while to figure out why you didn't just
turn handlers into handlers[MIG_PRI_MAX]; but I guess the problem is
you would have to change all the foreach's scattered around that walk
the list.  So


Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/savevm.c | 26 +++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index b2e3b7222a..f7a2d36bba 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -250,6 +250,7 @@ typedef struct SaveStateEntry {
>  
>  typedef struct SaveState {
>  QTAILQ_HEAD(, SaveStateEntry) handlers;
> +SaveStateEntry *handler_pri_head[MIG_PRI_MAX + 1];
>  int global_section_id;
>  uint32_t len;
>  const char *name;
> @@ -261,6 +262,7 @@ typedef struct SaveState {
>  
>  static SaveState savevm_state = {
>  .handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
> +.handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
>  .global_section_id = 0,
>  };
>  
> @@ -709,24 +711,42 @@ static void savevm_state_handler_insert(SaveStateEntry 
> *nse)
>  {
>  MigrationPriority priority = save_state_priority(nse);
>  SaveStateEntry *se;
> +int i;
>  
>  assert(priority <= MIG_PRI_MAX);
>  
> -QTAILQ_FOREACH(se, _state.handlers, entry) {
> -if (save_state_priority(se) < priority) {
> +for (i = priority - 1; i >= 0; i--) {
> +se = savevm_state.handler_pri_head[i];
> +if (se != NULL) {
> +assert(save_state_priority(se) < priority);
>  break;
>  }
>  }
>  
> -if (se) {
> +if (i >= 0) {
>  QTAILQ_INSERT_BEFORE(se, nse, entry);
>  } else {
>  QTAILQ_INSERT_TAIL(_state.handlers, nse, entry);
>  }
> +
> +if (savevm_state.handler_pri_head[priority] == NULL) {
> +savevm_state.handler_pri_head[priority] = nse;
> +}
>  }
>  
>  static void savevm_state_handler_remove(SaveStateEntry *se)
>  {
> +SaveStateEntry *next;
> +MigrationPriority priority = save_state_priority(se);
> +
> +if (se == savevm_state.handler_pri_head[priority]) {
> +next = QTAILQ_NEXT(se, entry);
> +if (next != NULL && save_state_priority(next) == priority) {
> +savevm_state.handler_pri_head[priority] = next;
> +} else {
> +savevm_state.handler_pri_head[priority] = NULL;
> +}
> +}
>  QTAILQ_REMOVE(_state.handlers, se, entry);
>  }
>  
> -- 
> 2.23.0
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: virtiofsd: Where should it live?

2019-12-04 Thread Dr. David Alan Gilbert

We seem to be settling out to either fsdev/virtiofsd or tools/virtiofsd
with tools picking up some speed as people seem to want to put a bunch
of other stuff in there.

Unless anyone shouts really loud, I'll work on making it
tools/virtiofsd.

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH] target/sparc: Remove old TODO file

2019-12-04 Thread Artyom Tarasenko

On Wed, Dec 4, 2019 at 5:27 PM Thomas Huth  wrote:
>
> On 30/09/2019 19.10, Thomas Huth wrote:
> > This file hasn't seen a real (non-trivial) update since 2008 anymore,
> > so we can assume that it is pretty much out of date and nobody cares
> > for it anymore. Let's simply remove it.
> >
> > Signed-off-by: Thomas Huth 
> > ---
> >  target/sparc/TODO | 88 ---
> >  1 file changed, 88 deletions(-)
> >  delete mode 100644 target/sparc/TODO
> >
> > diff --git a/target/sparc/TODO b/target/sparc/TODO
> > deleted file mode 100644
> > index b8c727e858..00
> > --- a/target/sparc/TODO
> > +++ /dev/null
> > @@ -1,88 +0,0 @@
> > -TODO-list:
> > -
> > -CPU common:
> > -- Unimplemented features/bugs:
> > - - Delay slot handling may fail sometimes (branch end of page, delay
> > - slot next page)
> > - - Atomical instructions
> > - - CPU features should match real CPUs (also ASI selection)
> > -- Optimizations/improvements:
> > - - Condition code/branch handling like x86, also for FPU?
> > - - Remove remaining explicit alignment checks
> > - - Global register for regwptr, so that windowed registers can be
> > - accessed directly
> > - - Improve Sparc32plus addressing
> > - - NPC/PC static optimisations (use JUMP_TB when possible)? (Is this
> > - obsolete?)
> > - - Synthetic instructions
> > - - MMU model dependent on CPU model
> > - - Select ASI helper at translation time (on V9 only if known)
> > - - KQemu/KVM support for VM only
> > - - Hardware breakpoint/watchpoint support
> > - - Cache emulation mode
> > - - Reverse-endian pages
> > - - Faster FPU emulation
> > - - Busy loop detection
> > -
> > -Sparc32 CPUs:
> > -- Unimplemented features/bugs:
> > - - Sun4/Sun4c MMUs
> > - - Some V8 ASIs
> > -
> > -Sparc64 CPUs:
> > -- Unimplemented features/bugs:
> > - - Interrupt handling
> > - - Secondary address space, other MMU functions
> > - - Many V9/UA2005/UA2007 ASIs
> > - - Rest of V9 instructions, missing VIS instructions
> > - - IG/MG/AG vs. UA2007 globals
> > - - Full hypervisor support
> > - - SMP/CMT
> > - - Sun4v CPUs
> > -
> > -Sun4:
> > -- To be added
> > -
> > -Sun4c:
> > -- A lot of unimplemented features
> > -- Maybe split from Sun4m
> > -
> > -Sun4m:
> > -- Unimplemented features/bugs:
> > - - Hardware devices do not match real boards
> > - - Floppy does not work
> > - - CS4231: merge with cs4231a, add DMA
> > - - Add cg6, bwtwo
> > - - Arbitrary resolution support
> > - - PCI for MicroSparc-IIe
> > - - JavaStation machines
> > - - SBus slot probing, FCode ROM support
> > - - SMP probing support
> > - - Interrupt routing does not match real HW
> > - - SuSE 7.3 keyboard sometimes unresponsive
> > - - Gentoo 2004.1 SMP does not work
> > - - SS600MP ledma -> lebuffer
> > - - Type 5 keyboard
> > - - Less fixed hardware choices
> > - - DBRI audio (Am7930)
> > - - BPP parallel
> > - - Diagnostic switch
> > - - ESP PIO mode
> > -
> > -Sun4d:
> > -- A lot of unimplemented features:
> > - - SBI
> > - - IO-unit
> > -- Maybe split from Sun4m
> > -
> > -Sun4u:
> > -- Unimplemented features/bugs:
> > - - Interrupt controller
> > - - PCI/IOMMU support (Simba, JIO, Tomatillo, Psycho, Schizo, Safari...)
> > - - SMP
> > - - Happy Meal Ethernet, flash, I2C, GPIO
> > - - A lot of real machine types
> > -
> > -Sun4v:
> > -- A lot of unimplemented features
> > - - A lot of real machine types
> >
>
> Ping?

Sorry for the delay, you are right the file doesn't reflect the
current state, so

Reviewed-by: Artyom Tarasenko 


-- 
Regards,
Artyom Tarasenko

SPARC and PPC PReP under qemu blog: http://tyom.blogspot.com/search/label/qemu

Re: qom device lifecycle interaction with hotplug/hotunplug ?

2019-12-04 Thread Jens Freimann

On Wed, Dec 04, 2019 at 11:35:37AM -0300, Eduardo Habkost wrote:

On Wed, Dec 04, 2019 at 10:18:24AM +0100, Jens Freimann wrote:

On Tue, Dec 03, 2019 at 06:40:04PM -0300, Eduardo Habkost wrote:
> +jfreimann, +mst
>
> On Sat, Nov 30, 2019 at 11:10:19AM +, Peter Maydell wrote:
> > On Fri, 29 Nov 2019 at 20:05, Eduardo Habkost  wrote:
> > > So, to summarize the current issues:
> > >
> > > 1) realize triggers a plug operation implicitly.
> > > 2) unplug triggers unrealize implicitly.
> > >
> > > Do you expect to see use cases that will require us to implement
> > > realize-without-plug?
> >
> > I don't think so, but only because of the oddity that
> > we put lots of devices on the 'sysbus' and claim that
> > that's plugging them into the bus. The common case of
> > 'realize' is where one device (say an SoC) has a bunch of child
> > devices (like UARTs); the SoC's realize method realizes its child
> > devices. Those devices all end up plugged into the 'sysbus'
> > but there's no actual bus there, it's fictional and about
> > the only thing it matters for is reset propagation (which
> > we don't model right either). A few devices don't live on
> > buses at all.
>
> That's my impression as well.
>
> >
> > > Similarly, do you expect use cases that will require us to
> > > implement unplug-without-unrealize?
> >
> > I don't know enough about hotplug to answer this one:
> > it's essentially what I'm hoping you'd be able to answer.
> > I vaguely had in mind that eg the user might be able to
> > create a 'disk' object, plug it into a SCSI bus, then
> > unplug it from the bus without the disk and all its data
> > evaporating, and maybe plug it back into the SCSI
> > bus (or some other SCSI bus) later ? But I don't know
> > anything about how we expose that kind of thing to the
> > user via QMP/HMP.
>
> This ability isn't exposed to the user at all.  Our existing
> interfaces are -device, device_add and device_del.
>
> We do have something new that sounds suspiciously similar to
> "unplugged but not unrealized", though: the new hidden device
> API, added by commit f3a850565693 ("qdev/qbus: add hidden device
> support").
>
> Jens, Michael, what exactly is the difference between a "hidden"
> device and a "unplugged" device?

"hidden" the way we use it for virtio-net failover is actually unplugged. But it
doesn't have to be that way. You can register a function that decides
if the device should be hidden, i.e. plugged now, or do something else
with it (in the virtio-net failover case we just save everything we
need to plug the device later).

We did introduce a "unplugged but not unrealized" function too as part
of the failover feature. See "a99c4da9fc pci: mark devices partially
unplugged"

This was needed so we would be able to re-plug the device in case a
migration failed and we need to hotplug the primary device back to the
guest. To avoid the risk of not getting the resources the device needs
we don't unrealize but just trigger the unplug from the guest OS.

Thanks for the explanation.  Let me confirm if I understand the
purpose of the new mechanisms: should_be_hidden is a mechanism
for implementing realize-without-plug.  partially_hotplugged is a
mechanism for implementing unplug-without-unrealize.  Is that
correct?

should_be_hidden is a mechanism for implementing
realize-without-plug: kind of. It's a mechanism that ensures
qdev_device_add() returns early as long as the condition to hide the
device is true. You could to the realize-without-plug in the handler
function that decides if the device should be "hidden".  

partially_hotplugged is a mechanism for implementing
unplug-without-unrealize: yes. 

regards
Jens

Re: [PATCH] target/sparc: Remove old TODO file

2019-12-04 Thread Thomas Huth

On 30/09/2019 19.10, Thomas Huth wrote:
> This file hasn't seen a real (non-trivial) update since 2008 anymore,
> so we can assume that it is pretty much out of date and nobody cares
> for it anymore. Let's simply remove it.
> 
> Signed-off-by: Thomas Huth 
> ---
>  target/sparc/TODO | 88 ---
>  1 file changed, 88 deletions(-)
>  delete mode 100644 target/sparc/TODO
> 
> diff --git a/target/sparc/TODO b/target/sparc/TODO
> deleted file mode 100644
> index b8c727e858..00
> --- a/target/sparc/TODO
> +++ /dev/null
> @@ -1,88 +0,0 @@
> -TODO-list:
> -
> -CPU common:
> -- Unimplemented features/bugs:
> - - Delay slot handling may fail sometimes (branch end of page, delay
> - slot next page)
> - - Atomical instructions
> - - CPU features should match real CPUs (also ASI selection)
> -- Optimizations/improvements:
> - - Condition code/branch handling like x86, also for FPU?
> - - Remove remaining explicit alignment checks
> - - Global register for regwptr, so that windowed registers can be
> - accessed directly
> - - Improve Sparc32plus addressing
> - - NPC/PC static optimisations (use JUMP_TB when possible)? (Is this
> - obsolete?)
> - - Synthetic instructions
> - - MMU model dependent on CPU model
> - - Select ASI helper at translation time (on V9 only if known)
> - - KQemu/KVM support for VM only
> - - Hardware breakpoint/watchpoint support
> - - Cache emulation mode
> - - Reverse-endian pages
> - - Faster FPU emulation
> - - Busy loop detection
> -
> -Sparc32 CPUs:
> -- Unimplemented features/bugs:
> - - Sun4/Sun4c MMUs
> - - Some V8 ASIs
> -
> -Sparc64 CPUs:
> -- Unimplemented features/bugs:
> - - Interrupt handling
> - - Secondary address space, other MMU functions
> - - Many V9/UA2005/UA2007 ASIs
> - - Rest of V9 instructions, missing VIS instructions
> - - IG/MG/AG vs. UA2007 globals
> - - Full hypervisor support
> - - SMP/CMT
> - - Sun4v CPUs
> -
> -Sun4:
> -- To be added
> -
> -Sun4c:
> -- A lot of unimplemented features
> -- Maybe split from Sun4m
> -
> -Sun4m:
> -- Unimplemented features/bugs:
> - - Hardware devices do not match real boards
> - - Floppy does not work
> - - CS4231: merge with cs4231a, add DMA
> - - Add cg6, bwtwo
> - - Arbitrary resolution support
> - - PCI for MicroSparc-IIe
> - - JavaStation machines
> - - SBus slot probing, FCode ROM support
> - - SMP probing support
> - - Interrupt routing does not match real HW
> - - SuSE 7.3 keyboard sometimes unresponsive
> - - Gentoo 2004.1 SMP does not work
> - - SS600MP ledma -> lebuffer
> - - Type 5 keyboard
> - - Less fixed hardware choices
> - - DBRI audio (Am7930)
> - - BPP parallel
> - - Diagnostic switch
> - - ESP PIO mode
> -
> -Sun4d:
> -- A lot of unimplemented features:
> - - SBI
> - - IO-unit
> -- Maybe split from Sun4m
> -
> -Sun4u:
> -- Unimplemented features/bugs:
> - - Interrupt controller
> - - PCI/IOMMU support (Simba, JIO, Tomatillo, Psycho, Schizo, Safari...)
> - - SMP
> - - Happy Meal Ethernet, flash, I2C, GPIO
> - - A lot of real machine types
> -
> -Sun4v:
> -- A lot of unimplemented features
> - - A lot of real machine types
> 

Ping?

 Thomas

Re: [PATCH v4 14/40] target/arm: Recover 4 bits from TBFLAGs

2019-12-04 Thread Richard Henderson

On 12/4/19 7:53 AM, Alex Bennée wrote:
> 
> Richard Henderson  writes:
> 
>> On 12/4/19 3:43 AM, Alex Bennée wrote:
> 
  void gen_intermediate_code(CPUState *cpu, TranslationBlock *tb, int 
 max_insns)
  {
 -DisasContext dc;
 +DisasContext dc = { };
>>>
>>> We seemed to have dropped an initialise here which seems unrelated.
>>
>> Added, not dropped.
> 
> But is it related to this patch or fixing another bug?

It is related to the patch.

We used to initialize all of the a32 and m32 fields in DisasContext by
assignment.  Now we only initialize either the a32 or m32 by assignment,
because the bits overlap in tbflags.  So zero out the other bits here.

I'll add this to the commit message.

r~

Re: [PATCH v4 23/40] target/arm: Update ctr_el0_access for EL2

2019-12-04 Thread Alex Bennée



Richard Henderson  writes:

> Update to include checks against HCR_EL2.TID2.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

> ---
>  target/arm/helper.c | 26 +-
>  1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/target/arm/helper.c b/target/arm/helper.c
> index ffa82b5509..9ad5015d5c 100644
> --- a/target/arm/helper.c
> +++ b/target/arm/helper.c
> @@ -5212,11 +5212,27 @@ static const ARMCPRegInfo el3_cp_reginfo[] = {
>  static CPAccessResult ctr_el0_access(CPUARMState *env, const ARMCPRegInfo 
> *ri,
>   bool isread)
>  {
> -/* Only accessible in EL0 if SCTLR.UCT is set (and only in AArch64,
> - * but the AArch32 CTR has its own reginfo struct)
> - */
> -if (arm_current_el(env) == 0 && !(env->cp15.sctlr_el[1] & SCTLR_UCT)) {
> -return CP_ACCESS_TRAP;
> +int cur_el = arm_current_el(env);
> +
> +if (cur_el < 2) {
> +uint64_t hcr = arm_hcr_el2_eff(env);
> +
> +if (cur_el == 0) {
> +if ((hcr & (HCR_E2H | HCR_TGE)) == (HCR_E2H | HCR_TGE)) {
> +if (!(env->cp15.sctlr_el[2] & SCTLR_UCT)) {
> +return CP_ACCESS_TRAP_EL2;
> +}
> +} else {
> +if (!(env->cp15.sctlr_el[1] & SCTLR_UCT)) {
> +return CP_ACCESS_TRAP;
> +}
> +if (hcr & HCR_TID2) {
> +return CP_ACCESS_TRAP_EL2;
> +}
> +}
> +} else if (hcr & HCR_TID2) {
> +return CP_ACCESS_TRAP_EL2;
> +}
>  }
>  return CP_ACCESS_OK;
>  }


-- 
Alex Bennée

Re: [PATCH] target/i386: relax assert when old host kernels don't include msrs

2019-12-04 Thread Paolo Bonzini

On 04/12/19 16:47, Eduardo Habkost wrote:
> On Wed, Dec 04, 2019 at 04:34:45PM +0100, Paolo Bonzini wrote:
>> On 04/12/19 16:07, Catherine Ho wrote:
 Ok, so the problem is that some MSR didn't exist in that version.  Which
>>> I thought in my platform, the only MSR didn't exist is MSR_IA32_VMX_BASIC
>>> (0x480). If I remove this kvm_msr_entry_add(), everything is ok, the guest 
>>> can
>>> be boot up successfully.
>>>
>>
>> MSR_IA32_VMX_BASIC was added in kvm-4.10.  Maybe the issue is the
>> _value_ that is being written to the VM is not valid?  Can you check
>> what's happening in vmx_restore_vmx_basic?
> 
> I believe env->features[FEAT_VMX_BASIC] will be initialized to 0
> if the host kernel doesn't have KVM_CAP_GET_MSR_FEATURES.

But the host must have MSR features if the MSRs are added:

if (kvm_feature_msrs && cpu_has_vmx(env)) {
kvm_msr_entry_add_vmx(cpu, env->features);
}

Looks like feature MSRs were backported to 4.14, but
1389309c811b0c954bf3b591b761d79b1700283d and the previous commit weren't.

Paolo

Re: [PATCH v4 16/40] target/arm: Rearrange ARMMMUIdxBit

2019-12-04 Thread Philippe Mathieu-Daudé


On 12/3/19 3:29 AM, Richard Henderson wrote:

Define via macro expansion, so that renumbering of the base ARMMMUIdx
symbols is automatically reflexed in the bit definitions.

Signed-off-by: Richard Henderson 
---
  target/arm/cpu.h | 39 +++
  1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 5f295c7e60..6ba5126852 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -2886,27 +2886,34 @@ typedef enum ARMMMUIdx {
  ARMMMUIdx_Stage1_E1 = 1 | ARM_MMU_IDX_NOTLB,
  } ARMMMUIdx;
  
-/* Bit macros for the core-mmu-index values for each index,

+/*
+ * Bit macros for the core-mmu-index values for each index,
   * for use when calling tlb_flush_by_mmuidx() and friends.
   */
+#define TO_CORE_BIT(NAME) \
+ARMMMUIdxBit_##NAME = 1 << (ARMMMUIdx_##NAME & ARM_MMU_IDX_COREIDX_MASK)
+
  typedef enum ARMMMUIdxBit {
-ARMMMUIdxBit_EL10_0 = 1 << 0,
-ARMMMUIdxBit_EL10_1 = 1 << 1,
-ARMMMUIdxBit_E2 = 1 << 2,
-ARMMMUIdxBit_SE3 = 1 << 3,
-ARMMMUIdxBit_SE0 = 1 << 4,
-ARMMMUIdxBit_SE1 = 1 << 5,
-ARMMMUIdxBit_Stage2 = 1 << 6,
-ARMMMUIdxBit_MUser = 1 << 0,
-ARMMMUIdxBit_MPriv = 1 << 1,
-ARMMMUIdxBit_MUserNegPri = 1 << 2,
-ARMMMUIdxBit_MPrivNegPri = 1 << 3,
-ARMMMUIdxBit_MSUser = 1 << 4,
-ARMMMUIdxBit_MSPriv = 1 << 5,
-ARMMMUIdxBit_MSUserNegPri = 1 << 6,
-ARMMMUIdxBit_MSPrivNegPri = 1 << 7,
+TO_CORE_BIT(EL10_0),
+TO_CORE_BIT(EL10_1),
+TO_CORE_BIT(E2),
+TO_CORE_BIT(SE0),
+TO_CORE_BIT(SE1),
+TO_CORE_BIT(SE3),
+TO_CORE_BIT(Stage2),
+
+TO_CORE_BIT(MUser),
+TO_CORE_BIT(MPriv),
+TO_CORE_BIT(MUserNegPri),
+TO_CORE_BIT(MPrivNegPri),
+TO_CORE_BIT(MSUser),
+TO_CORE_BIT(MSPriv),
+TO_CORE_BIT(MSUserNegPri),
+TO_CORE_BIT(MSPrivNegPri),
  } ARMMMUIdxBit;
  
+#undef TO_CORE_BIT

+
  #define MMU_USER_IDX 0
  
  static inline int arm_to_core_mmu_idx(ARMMMUIdx mmu_idx)




Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH 04/10] arm: allwinner-h3: add USB host controller

2019-12-04 Thread Aleksandar Markovic

On Monday, December 2, 2019, Niek Linnenbank 
wrote:

> The Allwinner H3 System on Chip contains multiple USB 2.0 bus
> connections which provide software access using the Enhanced
> Host Controller Interface (EHCI) and Open Host Controller
> Interface (OHCI) interfaces. This commit adds support for
> both interfaces in the Allwinner H3 System on Chip.
>
> Signed-off-by: Niek Linnenbank 
> ---


Niek, hi!

I would like to clarify a detail here:

The spec of the SoC enumerates (in 8.5.2.4. USB Host Register List) a
number of registers for reading various USB-related states, but also for
setting some of USB features.

Does this series cover these registers, and interaction with them? If yes,
how and where? If not, do you think it is not necessary at all? Or perhaps
that it is a non-crucial limitation of this series?

Thanks in advance, and congrats for your, it seems, first submission!

Aleksandar


 hw/arm/allwinner-h3.c| 20 
>  hw/usb/hcd-ehci-sysbus.c | 17 +
>  hw/usb/hcd-ehci.h|  1 +
>  3 files changed, 38 insertions(+)
>
> diff --git a/hw/arm/allwinner-h3.c b/hw/arm/allwinner-h3.c
> index 5566e979ec..afeb49c0ac 100644
> --- a/hw/arm/allwinner-h3.c
> +++ b/hw/arm/allwinner-h3.c
> @@ -26,6 +26,7 @@
>  #include "hw/sysbus.h"
>  #include "hw/arm/allwinner-h3.h"
>  #include "hw/misc/unimp.h"
> +#include "hw/usb/hcd-ehci.h"
>  #include "sysemu/sysemu.h"
>
>  static void aw_h3_init(Object *obj)
> @@ -183,6 +184,25 @@ static void aw_h3_realize(DeviceState *dev, Error
> **errp)
>  }
>  sysbus_mmio_map(SYS_BUS_DEVICE(>ccu), 0, AW_H3_CCU_BASE);
>
> +/* Universal Serial Bus */
> +sysbus_create_simple(TYPE_AW_H3_EHCI, AW_H3_EHCI0_BASE,
> + s->irq[AW_H3_GIC_SPI_EHCI0]);
> +sysbus_create_simple(TYPE_AW_H3_EHCI, AW_H3_EHCI1_BASE,
> + s->irq[AW_H3_GIC_SPI_EHCI1]);
> +sysbus_create_simple(TYPE_AW_H3_EHCI, AW_H3_EHCI2_BASE,
> + s->irq[AW_H3_GIC_SPI_EHCI2]);
> +sysbus_create_simple(TYPE_AW_H3_EHCI, AW_H3_EHCI3_BASE,
> + s->irq[AW_H3_GIC_SPI_EHCI3]);
> +
> +sysbus_create_simple("sysbus-ohci", AW_H3_OHCI0_BASE,
> + s->irq[AW_H3_GIC_SPI_OHCI0]);
> +sysbus_create_simple("sysbus-ohci", AW_H3_OHCI1_BASE,
> + s->irq[AW_H3_GIC_SPI_OHCI1]);
> +sysbus_create_simple("sysbus-ohci", AW_H3_OHCI2_BASE,
> + s->irq[AW_H3_GIC_SPI_OHCI2]);
> +sysbus_create_simple("sysbus-ohci", AW_H3_OHCI3_BASE,
> + s->irq[AW_H3_GIC_SPI_OHCI3]);
> +
>  /* UART */
>  if (serial_hd(0)) {
>  serial_mm_init(get_system_memory(), AW_H3_UART0_REG_BASE, 2,
> diff --git a/hw/usb/hcd-ehci-sysbus.c b/hw/usb/hcd-ehci-sysbus.c
> index 020211fd10..174c3446ef 100644
> --- a/hw/usb/hcd-ehci-sysbus.c
> +++ b/hw/usb/hcd-ehci-sysbus.c
> @@ -145,6 +145,22 @@ static const TypeInfo ehci_exynos4210_type_info = {
>  .class_init= ehci_exynos4210_class_init,
>  };
>
> +static void ehci_aw_h3_class_init(ObjectClass *oc, void *data)
> +{
> +SysBusEHCIClass *sec = SYS_BUS_EHCI_CLASS(oc);
> +DeviceClass *dc = DEVICE_CLASS(oc);
> +
> +sec->capsbase = 0x0;
> +sec->opregbase = 0x10;
> +set_bit(DEVICE_CATEGORY_USB, dc->categories);
> +}
> +
> +static const TypeInfo ehci_aw_h3_type_info = {
> +.name  = TYPE_AW_H3_EHCI,
> +.parent= TYPE_SYS_BUS_EHCI,
> +.class_init= ehci_aw_h3_class_init,
> +};
> +
>  static void ehci_tegra2_class_init(ObjectClass *oc, void *data)
>  {
>  SysBusEHCIClass *sec = SYS_BUS_EHCI_CLASS(oc);
> @@ -267,6 +283,7 @@ static void ehci_sysbus_register_types(void)
>  type_register_static(_platform_type_info);
>  type_register_static(_xlnx_type_info);
>  type_register_static(_exynos4210_type_info);
> +type_register_static(_aw_h3_type_info);
>  type_register_static(_tegra2_type_info);
>  type_register_static(_ppc4xx_type_info);
>  type_register_static(_fusbh200_type_info);
> diff --git a/hw/usb/hcd-ehci.h b/hw/usb/hcd-ehci.h
> index 0298238f0b..edb59311c4 100644
> --- a/hw/usb/hcd-ehci.h
> +++ b/hw/usb/hcd-ehci.h
> @@ -342,6 +342,7 @@ typedef struct EHCIPCIState {
>  #define TYPE_SYS_BUS_EHCI "sysbus-ehci-usb"
>  #define TYPE_PLATFORM_EHCI "platform-ehci-usb"
>  #define TYPE_EXYNOS4210_EHCI "exynos4210-ehci-usb"
> +#define TYPE_AW_H3_EHCI "aw-h3-ehci-usb"
>  #define TYPE_TEGRA2_EHCI "tegra2-ehci-usb"
>  #define TYPE_PPC4xx_EHCI "ppc4xx-ehci-usb"
>  #define TYPE_FUSBH200_EHCI "fusbh200-ehci-usb"
> --
> 2.17.1
>
>
>

Re: [PATCH v2 3/7] iotests: Skip test 079 if it is not possible to create large files

2019-12-04 Thread Philippe Mathieu-Daudé


On 12/4/19 4:46 PM, Thomas Huth wrote:

Test 079 fails in the arm64, s390x and ppc64le LXD containers on Travis
(which we will hopefully enable in our CI soon). These containers
apparently do not allow large files to be created. Test 079 tries to
create a 4G sparse file, which is apparently already too big for these
containers, so check first whether we can really create such files before
executing the test.

Signed-off-by: Thomas Huth 
---
  tests/qemu-iotests/079 | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/079 b/tests/qemu-iotests/079
index 81f0c21f53..78536d3bbf 100755
--- a/tests/qemu-iotests/079
+++ b/tests/qemu-iotests/079
@@ -39,6 +39,9 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
  _supported_fmt qcow2
  _supported_proto file nfs
  
+# Some containers (e.g. non-x86 on Travis) do not allow large files

+_require_large_file 4G
+
  echo "=== Check option preallocation and cluster_size ==="
  echo
  cluster_sizes="16384 32768 65536 131072 262144 524288 1048576 2097152 4194304"



Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v4 14/40] target/arm: Recover 4 bits from TBFLAGs

2019-12-04 Thread Alex Bennée



Richard Henderson  writes:

> On 12/4/19 3:43 AM, Alex Bennée wrote:

>>>  void gen_intermediate_code(CPUState *cpu, TranslationBlock *tb, int 
>>> max_insns)
>>>  {
>>> -DisasContext dc;
>>> +DisasContext dc = { };
>> 
>> We seemed to have dropped an initialise here which seems unrelated.
>
> Added, not dropped.

But is it related to this patch or fixing another bug?


-- 
Alex Bennée

Re: [PATCH] target/i386: relax assert when old host kernels don't include msrs

2019-12-04 Thread Eduardo Habkost

On Wed, Dec 04, 2019 at 04:34:45PM +0100, Paolo Bonzini wrote:
> On 04/12/19 16:07, Catherine Ho wrote:
> >> Ok, so the problem is that some MSR didn't exist in that version.  Which
> > I thought in my platform, the only MSR didn't exist is MSR_IA32_VMX_BASIC
> > (0x480). If I remove this kvm_msr_entry_add(), everything is ok, the guest 
> > can
> > be boot up successfully.
> > 
> 
> MSR_IA32_VMX_BASIC was added in kvm-4.10.  Maybe the issue is the
> _value_ that is being written to the VM is not valid?  Can you check
> what's happening in vmx_restore_vmx_basic?

I believe env->features[FEAT_VMX_BASIC] will be initialized to 0
if the host kernel doesn't have KVM_CAP_GET_MSR_FEATURES.

-- 
Eduardo

Re: [PATCH v2 2/7] iotests: Skip test 060 if it is not possible to create large files

2019-12-04 Thread Philippe Mathieu-Daudé


On 12/4/19 4:46 PM, Thomas Huth wrote:

Test 060 fails in the arm64, s390x and ppc64le LXD containers on Travis
(which we will hopefully enable in our CI soon). These containers
apparently do not allow large files to be created. The repair process
in test 060 creates a file of 64 GiB, so test first whether such large
files are possible and skip the test if that's not the case.

Signed-off-by: Thomas Huth 
---
  tests/qemu-iotests/060 | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index b91d8321bb..d96f17a484 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -49,6 +49,9 @@ _supported_fmt qcow2
  _supported_proto file
  _supported_os Linux
  
+# The repair process will create a large file - so check for availability first

+_require_large_file 64G
+
  rt_offset=65536  # 0x1 (XXX: just an assumption)
  rb_offset=131072 # 0x2 (XXX: just an assumption)
  l1_offset=196608 # 0x3 (XXX: just an assumption)



Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v2 1/7] iotests: Provide a function for checking the creation of huge files

2019-12-04 Thread Philippe Mathieu-Daudé


On 12/4/19 4:46 PM, Thomas Huth wrote:

Some tests create huge (but sparse) files, and to be able to run those
tests in certain limited environments (like CI containers), we have to
check for the possibility to create such files first. Thus let's introduce
a common function to check for large files, and replace the already
existing checks in the iotests 005 and 220 with this function.

Reviewed-by: Alex Bennée 
Signed-off-by: Thomas Huth 
---
  tests/qemu-iotests/005   |  5 +
  tests/qemu-iotests/220   |  6 ++
  tests/qemu-iotests/common.rc | 10 ++
  3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/tests/qemu-iotests/005 b/tests/qemu-iotests/005
index 58442762fe..b6d03ac37d 100755
--- a/tests/qemu-iotests/005
+++ b/tests/qemu-iotests/005
@@ -59,10 +59,7 @@ fi
  # Sanity check: For raw, we require a file system that permits the creation
  # of a HUGE (but very sparse) file. Check we can create it before continuing.
  if [ "$IMGFMT" = "raw" ]; then
-if ! truncate --size=5T "$TEST_IMG"; then
-_notrun "file system on $TEST_DIR does not support large enough files"
-fi
-rm "$TEST_IMG"
+_require_large_file 5T
  fi
  
  echo

diff --git a/tests/qemu-iotests/220 b/tests/qemu-iotests/220
index 2d62c5dcac..15159270d3 100755
--- a/tests/qemu-iotests/220
+++ b/tests/qemu-iotests/220
@@ -42,10 +42,8 @@ echo "== Creating huge file =="
  
  # Sanity check: We require a file system that permits the creation

  # of a HUGE (but very sparse) file.  tmpfs works, ext4 does not.
-if ! truncate --size=513T "$TEST_IMG"; then
-_notrun "file system on $TEST_DIR does not support large enough files"
-fi
-rm "$TEST_IMG"
+_require_large_file 513T
+
  IMGOPTS='cluster_size=2M,refcount_bits=1' _make_test_img 513T
  
  echo "== Populating refcounts =="

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index 0cc8acc9ed..6f0582c79a 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -643,5 +643,15 @@ _require_drivers()
  done
  }
  
+# Check that we have a file system that allows huge (but very sparse) files

+#
+_require_large_file()
+{
+if ! truncate --size="$1" "$TEST_IMG"; then
+_notrun "file system on $TEST_DIR does not support large enough files"
+fi
+rm "$TEST_IMG"
+}


:)

Reviewed-by: Philippe Mathieu-Daudé 


+
  # make sure this script returns success
  true

[PATCH v2 6/7] configure: allow disable of cross compilation containers

2019-12-04 Thread Thomas Huth

From: Alex Bennée 

Our docker infrastructure isn't quite as multiarch as we would wish so
let's allow the user to disable it if they want. This will allow us to
use still run check-tcg on non-x86 CI setups.

Signed-off-by: Alex Bennée 
Reviewed-by: Stefan Weil 
Signed-off-by: Thomas Huth 
---
 configure  | 8 +++-
 tests/tcg/configure.sh | 6 --
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index 6099be1d84..fe6d0971f1 100755
--- a/configure
+++ b/configure
@@ -302,6 +302,7 @@ audio_win_int=""
 libs_qga=""
 debug_info="yes"
 stack_protector=""
+use_containers="yes"
 
 if test -e "$source_path/.git"
 then
@@ -1539,6 +1540,10 @@ for opt do
   ;;
   --disable-plugins) plugins="no"
   ;;
+  --enable-containers) use_containers="yes"
+  ;;
+  --disable-containers) use_containers="no"
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1722,6 +1727,7 @@ Advanced options (experts only):
track the maximum stack usage of stacks created by 
qemu_alloc_stack
   --enable-plugins
enable plugins via shared library loading
+  --disable-containers don't use containers for cross-building
 
 Optional features, enabled with --enable-FEATURE and
 disabled with --disable-FEATURE, default is enabled if available:
@@ -8039,7 +8045,7 @@ done
 (for i in $cross_cc_vars; do
   export $i
 done
-export target_list source_path
+export target_list source_path use_containers
 $source_path/tests/tcg/configure.sh)
 
 # temporary config to build submodules
diff --git a/tests/tcg/configure.sh b/tests/tcg/configure.sh
index 6c4a471aea..210e68396f 100755
--- a/tests/tcg/configure.sh
+++ b/tests/tcg/configure.sh
@@ -36,8 +36,10 @@ TMPC="${TMPDIR1}/qemu-conf.c"
 TMPE="${TMPDIR1}/qemu-conf.exe"
 
 container="no"
-if has "docker" || has "podman"; then
-  container=$($python $source_path/tests/docker/docker.py probe)
+if test $use_containers = "yes"; then
+if has "docker" || has "podman"; then
+container=$($python $source_path/tests/docker/docker.py probe)
+fi
 fi
 
 # cross compilers defaults, can be overridden with --cross-cc-ARCH
-- 
2.18.1

[PATCH v2 5/7] tests/test-util-filemonitor: Skip test on non-x86 Travis containers

2019-12-04 Thread Thomas Huth

test-util-filemonitor fails in restricted non-x86 Travis containers
since they apparently blacklisted some required system calls there.
Let's simply skip the test if we detect such an environment.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Alex Bennée 
Signed-off-by: Thomas Huth 
---
 tests/test-util-filemonitor.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tests/test-util-filemonitor.c b/tests/test-util-filemonitor.c
index 301cd2db61..45009c69f4 100644
--- a/tests/test-util-filemonitor.c
+++ b/tests/test-util-filemonitor.c
@@ -406,10 +406,21 @@ test_file_monitor_events(void)
 char *pathdst = NULL;
 QFileMonitorTestData data;
 GHashTable *ids = g_hash_table_new(g_int64_hash, g_int64_equal);
+char *travis_arch;
 
 qemu_mutex_init();
 data.records = NULL;
 
+/*
+ * This test does not work on Travis LXD containers since some
+ * syscalls are blocked in that environment.
+ */
+travis_arch = getenv("TRAVIS_ARCH");
+if (travis_arch && !g_str_equal(travis_arch, "x86_64")) {
+g_test_skip("Test does not work on non-x86 Travis containers.");
+return;
+}
+
 /*
  * The file monitor needs the main loop running in
  * order to receive events from inotify. We must
-- 
2.18.1

[PATCH v2 2/7] iotests: Skip test 060 if it is not possible to create large files

2019-12-04 Thread Thomas Huth

Test 060 fails in the arm64, s390x and ppc64le LXD containers on Travis
(which we will hopefully enable in our CI soon). These containers
apparently do not allow large files to be created. The repair process
in test 060 creates a file of 64 GiB, so test first whether such large
files are possible and skip the test if that's not the case.

Signed-off-by: Thomas Huth 
---
 tests/qemu-iotests/060 | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index b91d8321bb..d96f17a484 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -49,6 +49,9 @@ _supported_fmt qcow2
 _supported_proto file
 _supported_os Linux
 
+# The repair process will create a large file - so check for availability first
+_require_large_file 64G
+
 rt_offset=65536  # 0x1 (XXX: just an assumption)
 rb_offset=131072 # 0x2 (XXX: just an assumption)
 l1_offset=196608 # 0x3 (XXX: just an assumption)
-- 
2.18.1

1 2 3 >

1 - 100 of 242 matches

Mail list logo