Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-29 Thread Yongji Xie

On 2016/6/30 10:53, Alex Williamson wrote:


On Thu, 30 Jun 2016 10:40:23 +0800
Yongji Xie  wrote:


Hi Alex,

On 2016/6/30 4:03, Alex Williamson wrote:


On Tue, 28 Jun 2016 13:47:23 -0600
Alex Williamson  wrote:
  

On Tue, 28 Jun 2016 18:09:46 +0800
Yongji Xie  wrote:
  

Hi, Alex

On 2016/6/25 0:43, Alex Williamson wrote:
  

On Fri, 24 Jun 2016 23:37:02 +0800
Yongji Xie  wrote:
   

Hi, Alex

On 2016/6/24 11:37, Alex Williamson wrote:
   

On Fri, 24 Jun 2016 10:52:58 +0800
Yongji Xie  wrote:

On 2016/6/24 0:12, Alex Williamson wrote:

On Mon, 30 May 2016 21:06:37 +0800
Yongji Xie  wrote:

+static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
+{
+   struct resource *res;
+   int bar;
+   struct vfio_pci_dummy_resource *dummy_res;
+
+   INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
+   for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+   res = vdev->pdev->resource + bar;
+
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+   goto no_mmap;
+
+   if (!(res->flags & IORESOURCE_MEM))
+   goto no_mmap;
+
+   /*
+* The PCI core shouldn't set up a resource with a
+* type but zero size. But there may be bugs that
+* cause us to do that.
+*/
+   if (!resource_size(res))
+   goto no_mmap;
+
+   if (resource_size(res) >= PAGE_SIZE) {
+   vdev->bar_mmap_supported[bar] = true;
+   continue;
+   }
+
+   if (!(res->start & ~PAGE_MASK)) {
+   /*
+* Add a dummy resource to reserve the remainder
+* of the exclusive page in case that hot-add
+* device's bar is assigned into it.
+*/
+   dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+   if (dummy_res == NULL)
+   goto no_mmap;
+
+   dummy_res->resource.start = res->end + 1;
+   dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+   dummy_res->resource.flags = res->flags;
+   if (request_resource(res->parent,
+   &dummy_res->resource)) {
+   kfree(dummy_res);
+   goto no_mmap;
+   }

Isn't it true that request_resource() only tells us that at a given
point in time, no other drivers have reserved that resource?  It seems
like it does not guarantee that the resource isn't routed to another
device or that another driver won't at some point attempt to request
that same resource.  So for example if a user constructs their initrd
to bind vfio-pci to devices before other modules load, this
request_resource() may succeed, at the expense of drivers loaded later
now failing.  The behavior will depend on driver load order and we're
not actually insuring that the overflow resource is unused, just that
we got it first.  Can we do better?  Am I missing something that
prevents this?  Thanks,

Alex

Couldn't PCI resources allocator prevent this, which will find a
empty slot in the resource tree firstly, then try to request that
resource in allocate_resource() when a PCI device is probed.
And I'd like to know why a PCI device driver would attempt to
call request_resource()? Should this be done in PCI enumeration?

Hi Yongji,

Looks like most pci drivers call pci_request_regions().  From there the
call path is:

pci_request_selected_regions
  __pci_request_selected_regions
__pci_request_region
  __request_mem_region
__request_region
  __request_resource

We see this driver ordering issue sometimes with users attempting to
blacklist native pci drivers, trying to leave a device free for use by
vfio-pci.  If the device is a graphics card, the generic vesa or uefi
driver can request device resources causing a failure when vfio-pci
tries to request those same resources.  I expect that unless it's a
boot device, like vga in my example, the resources are not enabled
until the driver opens the device, therefore the request_resource() call
doesn't occur until that point.

For another trivial example, look at /proc/iomem as you load and unload
a driver, on my laptop with e1000e unloaded I see:

  e120-e121 : :00:19.0
  e123e000-e123efff : :00:19.0

When e1000e is loaded, each of these becomes claimed by the e1000e
driver:

  e120-e121 : :00:19.0
e120-e121 : e1000e
  e123e000-e123efff : :00:19.0
e123e000-e123efff : e1000e

Clearly pci core knows the resource is associated with the device, but
I don't think we're tapping into that with request_resource(), we're
just potentially stealing resou

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-29 Thread Alex Williamson
On Thu, 30 Jun 2016 10:40:23 +0800
Yongji Xie  wrote:

> Hi Alex,
> 
> On 2016/6/30 4:03, Alex Williamson wrote:
> 
> > On Tue, 28 Jun 2016 13:47:23 -0600
> > Alex Williamson  wrote:
> >  
> >> On Tue, 28 Jun 2016 18:09:46 +0800
> >> Yongji Xie  wrote:
> >>  
> >>> Hi, Alex
> >>>
> >>> On 2016/6/25 0:43, Alex Williamson wrote:
> >>>  
>  On Fri, 24 Jun 2016 23:37:02 +0800
>  Yongji Xie  wrote:
>    
> > Hi, Alex
> >
> > On 2016/6/24 11:37, Alex Williamson wrote:
> >   
> >> On Fri, 24 Jun 2016 10:52:58 +0800
> >> Yongji Xie  wrote:  
> >>> On 2016/6/24 0:12, Alex Williamson wrote:  
>  On Mon, 30 May 2016 21:06:37 +0800
>  Yongji Xie  wrote:  
> > +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> > +{
> > +   struct resource *res;
> > +   int bar;
> > +   struct vfio_pci_dummy_resource *dummy_res;
> > +
> > +   INIT_LIST_HEAD(&vdev->dummy_resources_list);
> > +
> > +   for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; 
> > bar++) {
> > +   res = vdev->pdev->resource + bar;
> > +
> > +   if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> > +   goto no_mmap;
> > +
> > +   if (!(res->flags & IORESOURCE_MEM))
> > +   goto no_mmap;
> > +
> > +   /*
> > +* The PCI core shouldn't set up a resource with a
> > +* type but zero size. But there may be bugs that
> > +* cause us to do that.
> > +*/
> > +   if (!resource_size(res))
> > +   goto no_mmap;
> > +
> > +   if (resource_size(res) >= PAGE_SIZE) {
> > +   vdev->bar_mmap_supported[bar] = true;
> > +   continue;
> > +   }
> > +
> > +   if (!(res->start & ~PAGE_MASK)) {
> > +   /*
> > +* Add a dummy resource to reserve the remainder
> > +* of the exclusive page in case that hot-add
> > +* device's bar is assigned into it.
> > +*/
> > +   dummy_res = kzalloc(sizeof(*dummy_res), 
> > GFP_KERNEL);
> > +   if (dummy_res == NULL)
> > +   goto no_mmap;
> > +
> > +   dummy_res->resource.start = res->end + 1;
> > +   dummy_res->resource.end = res->start + 
> > PAGE_SIZE - 1;
> > +   dummy_res->resource.flags = res->flags;
> > +   if (request_resource(res->parent,
> > +   &dummy_res->resource)) {
> > +   kfree(dummy_res);
> > +   goto no_mmap;
> > +   }  
>  Isn't it true that request_resource() only tells us that at a given
>  point in time, no other drivers have reserved that resource?  It 
>  seems
>  like it does not guarantee that the resource isn't routed to another
>  device or that another driver won't at some point attempt to request
>  that same resource.  So for example if a user constructs their initrd
>  to bind vfio-pci to devices before other modules load, this
>  request_resource() may succeed, at the expense of drivers loaded 
>  later
>  now failing.  The behavior will depend on driver load order and we're
>  not actually insuring that the overflow resource is unused, just that
>  we got it first.  Can we do better?  Am I missing something that
>  prevents this?  Thanks,
> 
>  Alex  
> >>> Couldn't PCI resources allocator prevent this, which will find a
> >>> empty slot in the resource tree firstly, then try to request that
> >>> resource in allocate_resource() when a PCI device is probed.
> >>> And I'd like to know why a PCI device driver would attempt to
> >>> call request_resource()? Should this be done in PCI enumeration?  
> >> Hi Yongji,
> >>
> >> Looks like most pci drivers call pci_request_regions().  From there the
> >> call path is:
> >>
> >> pci_request_selected_regions
> >>  __pci_request_selected_regions
> >>__pci_request_region
> >>  __request_mem_region
> >>__request_region
> >>  __request_resource
> >>
> >> We see this driver ordering issue sometimes with users attempting to
> >> blacklist native pci drivers, trying to leave a device free for use by
> >> vfio-pci.  If the device is a graphics card, the generic vesa or

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-29 Thread Yongji Xie

Hi Alex,

On 2016/6/30 4:03, Alex Williamson wrote:


On Tue, 28 Jun 2016 13:47:23 -0600
Alex Williamson  wrote:


On Tue, 28 Jun 2016 18:09:46 +0800
Yongji Xie  wrote:


Hi, Alex

On 2016/6/25 0:43, Alex Williamson wrote:
   

On Fri, 24 Jun 2016 23:37:02 +0800
Yongji Xie  wrote:


Hi, Alex

On 2016/6/24 11:37, Alex Williamson wrote:


On Fri, 24 Jun 2016 10:52:58 +0800
Yongji Xie  wrote:

On 2016/6/24 0:12, Alex Williamson wrote:

On Mon, 30 May 2016 21:06:37 +0800
Yongji Xie  wrote:

+static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
+{
+   struct resource *res;
+   int bar;
+   struct vfio_pci_dummy_resource *dummy_res;
+
+   INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
+   for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+   res = vdev->pdev->resource + bar;
+
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+   goto no_mmap;
+
+   if (!(res->flags & IORESOURCE_MEM))
+   goto no_mmap;
+
+   /*
+* The PCI core shouldn't set up a resource with a
+* type but zero size. But there may be bugs that
+* cause us to do that.
+*/
+   if (!resource_size(res))
+   goto no_mmap;
+
+   if (resource_size(res) >= PAGE_SIZE) {
+   vdev->bar_mmap_supported[bar] = true;
+   continue;
+   }
+
+   if (!(res->start & ~PAGE_MASK)) {
+   /*
+* Add a dummy resource to reserve the remainder
+* of the exclusive page in case that hot-add
+* device's bar is assigned into it.
+*/
+   dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+   if (dummy_res == NULL)
+   goto no_mmap;
+
+   dummy_res->resource.start = res->end + 1;
+   dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+   dummy_res->resource.flags = res->flags;
+   if (request_resource(res->parent,
+   &dummy_res->resource)) {
+   kfree(dummy_res);
+   goto no_mmap;
+   }

Isn't it true that request_resource() only tells us that at a given
point in time, no other drivers have reserved that resource?  It seems
like it does not guarantee that the resource isn't routed to another
device or that another driver won't at some point attempt to request
that same resource.  So for example if a user constructs their initrd
to bind vfio-pci to devices before other modules load, this
request_resource() may succeed, at the expense of drivers loaded later
now failing.  The behavior will depend on driver load order and we're
not actually insuring that the overflow resource is unused, just that
we got it first.  Can we do better?  Am I missing something that
prevents this?  Thanks,

Alex

Couldn't PCI resources allocator prevent this, which will find a
empty slot in the resource tree firstly, then try to request that
resource in allocate_resource() when a PCI device is probed.
And I'd like to know why a PCI device driver would attempt to
call request_resource()? Should this be done in PCI enumeration?

Hi Yongji,

Looks like most pci drivers call pci_request_regions().  From there the
call path is:

pci_request_selected_regions
 __pci_request_selected_regions
   __pci_request_region
 __request_mem_region
   __request_region
 __request_resource

We see this driver ordering issue sometimes with users attempting to
blacklist native pci drivers, trying to leave a device free for use by
vfio-pci.  If the device is a graphics card, the generic vesa or uefi
driver can request device resources causing a failure when vfio-pci
tries to request those same resources.  I expect that unless it's a
boot device, like vga in my example, the resources are not enabled
until the driver opens the device, therefore the request_resource() call
doesn't occur until that point.

For another trivial example, look at /proc/iomem as you load and unload
a driver, on my laptop with e1000e unloaded I see:

 e120-e121 : :00:19.0
 e123e000-e123efff : :00:19.0

When e1000e is loaded, each of these becomes claimed by the e1000e
driver:

 e120-e121 : :00:19.0
   e120-e121 : e1000e
 e123e000-e123efff : :00:19.0
   e123e000-e123efff : e1000e

Clearly pci core knows the resource is associated with the device, but
I don't think we're tapping into that with request_resource(), we're
just potentially stealing resources that another driver might have
claimed otherwise as I described above.  That's my suspicion at
least, feel free to show 

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-29 Thread Alex Williamson
On Tue, 28 Jun 2016 13:47:23 -0600
Alex Williamson  wrote:

> On Tue, 28 Jun 2016 18:09:46 +0800
> Yongji Xie  wrote:
> 
> > Hi, Alex
> > 
> > On 2016/6/25 0:43, Alex Williamson wrote:
> >   
> > > On Fri, 24 Jun 2016 23:37:02 +0800
> > > Yongji Xie  wrote:
> > >
> > >> Hi, Alex
> > >>
> > >> On 2016/6/24 11:37, Alex Williamson wrote:
> > >>
> > >>> On Fri, 24 Jun 2016 10:52:58 +0800
> > >>> Yongji Xie  wrote:
> >  On 2016/6/24 0:12, Alex Williamson wrote:
> > > On Mon, 30 May 2016 21:06:37 +0800
> > > Yongji Xie  wrote:
> > >> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> > >> +{
> > >> +struct resource *res;
> > >> +int bar;
> > >> +struct vfio_pci_dummy_resource *dummy_res;
> > >> +
> > >> +INIT_LIST_HEAD(&vdev->dummy_resources_list);
> > >> +
> > >> +for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; 
> > >> bar++) {
> > >> +res = vdev->pdev->resource + bar;
> > >> +
> > >> +if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> > >> +goto no_mmap;
> > >> +
> > >> +if (!(res->flags & IORESOURCE_MEM))
> > >> +goto no_mmap;
> > >> +
> > >> +/*
> > >> + * The PCI core shouldn't set up a resource with a
> > >> + * type but zero size. But there may be bugs that
> > >> + * cause us to do that.
> > >> + */
> > >> +if (!resource_size(res))
> > >> +goto no_mmap;
> > >> +
> > >> +if (resource_size(res) >= PAGE_SIZE) {
> > >> +vdev->bar_mmap_supported[bar] = true;
> > >> +continue;
> > >> +}
> > >> +
> > >> +if (!(res->start & ~PAGE_MASK)) {
> > >> +/*
> > >> + * Add a dummy resource to reserve the remainder
> > >> + * of the exclusive page in case that hot-add
> > >> + * device's bar is assigned into it.
> > >> + */
> > >> +dummy_res = kzalloc(sizeof(*dummy_res), 
> > >> GFP_KERNEL);
> > >> +if (dummy_res == NULL)
> > >> +goto no_mmap;
> > >> +
> > >> +dummy_res->resource.start = res->end + 1;
> > >> +dummy_res->resource.end = res->start + 
> > >> PAGE_SIZE - 1;
> > >> +dummy_res->resource.flags = res->flags;
> > >> +if (request_resource(res->parent,
> > >> +&dummy_res->resource)) {
> > >> +kfree(dummy_res);
> > >> +goto no_mmap;
> > >> +}
> > > Isn't it true that request_resource() only tells us that at a given
> > > point in time, no other drivers have reserved that resource?  It seems
> > > like it does not guarantee that the resource isn't routed to another
> > > device or that another driver won't at some point attempt to request
> > > that same resource.  So for example if a user constructs their initrd
> > > to bind vfio-pci to devices before other modules load, this
> > > request_resource() may succeed, at the expense of drivers loaded later
> > > now failing.  The behavior will depend on driver load order and we're
> > > not actually insuring that the overflow resource is unused, just that
> > > we got it first.  Can we do better?  Am I missing something that
> > > prevents this?  Thanks,
> > >
> > > Alex
> >  Couldn't PCI resources allocator prevent this, which will find a
> >  empty slot in the resource tree firstly, then try to request that
> >  resource in allocate_resource() when a PCI device is probed.
> >  And I'd like to know why a PCI device driver would attempt to
> >  call request_resource()? Should this be done in PCI enumeration?
> > >>> Hi Yongji,
> > >>>
> > >>> Looks like most pci drivers call pci_request_regions().  From there the
> > >>> call path is:
> > >>>
> > >>> pci_request_selected_regions
> > >>> __pci_request_selected_regions
> > >>>   __pci_request_region
> > >>> __request_mem_region
> > >>>   __request_region
> > >>> __request_resource
> > >>>
> > >>> We see this driver ordering issue sometimes with users attempting to
> > >>> blacklist native pci drivers, trying to leave a device free for use by
> > >>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
> > >>> driver can request device resources causing a failure when vfio-pci
> > >>> tries to request those same resources.  I expect that unless it's a
> > >>> boot device, like vga in my example, the resources are not enabl

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-28 Thread Alex Williamson
On Tue, 28 Jun 2016 18:09:46 +0800
Yongji Xie  wrote:

> Hi, Alex
> 
> On 2016/6/25 0:43, Alex Williamson wrote:
> 
> > On Fri, 24 Jun 2016 23:37:02 +0800
> > Yongji Xie  wrote:
> >  
> >> Hi, Alex
> >>
> >> On 2016/6/24 11:37, Alex Williamson wrote:
> >>  
> >>> On Fri, 24 Jun 2016 10:52:58 +0800
> >>> Yongji Xie  wrote:  
>  On 2016/6/24 0:12, Alex Williamson wrote:  
> > On Mon, 30 May 2016 21:06:37 +0800
> > Yongji Xie  wrote:  
> >> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >> +{
> >> +  struct resource *res;
> >> +  int bar;
> >> +  struct vfio_pci_dummy_resource *dummy_res;
> >> +
> >> +  INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >> +
> >> +  for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; 
> >> bar++) {
> >> +  res = vdev->pdev->resource + bar;
> >> +
> >> +  if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> >> +  goto no_mmap;
> >> +
> >> +  if (!(res->flags & IORESOURCE_MEM))
> >> +  goto no_mmap;
> >> +
> >> +  /*
> >> +   * The PCI core shouldn't set up a resource with a
> >> +   * type but zero size. But there may be bugs that
> >> +   * cause us to do that.
> >> +   */
> >> +  if (!resource_size(res))
> >> +  goto no_mmap;
> >> +
> >> +  if (resource_size(res) >= PAGE_SIZE) {
> >> +  vdev->bar_mmap_supported[bar] = true;
> >> +  continue;
> >> +  }
> >> +
> >> +  if (!(res->start & ~PAGE_MASK)) {
> >> +  /*
> >> +   * Add a dummy resource to reserve the remainder
> >> +   * of the exclusive page in case that hot-add
> >> +   * device's bar is assigned into it.
> >> +   */
> >> +  dummy_res = kzalloc(sizeof(*dummy_res), 
> >> GFP_KERNEL);
> >> +  if (dummy_res == NULL)
> >> +  goto no_mmap;
> >> +
> >> +  dummy_res->resource.start = res->end + 1;
> >> +  dummy_res->resource.end = res->start + 
> >> PAGE_SIZE - 1;
> >> +  dummy_res->resource.flags = res->flags;
> >> +  if (request_resource(res->parent,
> >> +  &dummy_res->resource)) {
> >> +  kfree(dummy_res);
> >> +  goto no_mmap;
> >> +  }  
> > Isn't it true that request_resource() only tells us that at a given
> > point in time, no other drivers have reserved that resource?  It seems
> > like it does not guarantee that the resource isn't routed to another
> > device or that another driver won't at some point attempt to request
> > that same resource.  So for example if a user constructs their initrd
> > to bind vfio-pci to devices before other modules load, this
> > request_resource() may succeed, at the expense of drivers loaded later
> > now failing.  The behavior will depend on driver load order and we're
> > not actually insuring that the overflow resource is unused, just that
> > we got it first.  Can we do better?  Am I missing something that
> > prevents this?  Thanks,
> >
> > Alex  
>  Couldn't PCI resources allocator prevent this, which will find a
>  empty slot in the resource tree firstly, then try to request that
>  resource in allocate_resource() when a PCI device is probed.
>  And I'd like to know why a PCI device driver would attempt to
>  call request_resource()? Should this be done in PCI enumeration?  
> >>> Hi Yongji,
> >>>
> >>> Looks like most pci drivers call pci_request_regions().  From there the
> >>> call path is:
> >>>
> >>> pci_request_selected_regions
> >>> __pci_request_selected_regions
> >>>   __pci_request_region
> >>> __request_mem_region
> >>>   __request_region
> >>> __request_resource
> >>>
> >>> We see this driver ordering issue sometimes with users attempting to
> >>> blacklist native pci drivers, trying to leave a device free for use by
> >>> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
> >>> driver can request device resources causing a failure when vfio-pci
> >>> tries to request those same resources.  I expect that unless it's a
> >>> boot device, like vga in my example, the resources are not enabled
> >>> until the driver opens the device, therefore the request_resource() call
> >>> doesn't occur until that point.
> >>>
> >>> For another trivial example, look at /proc/iomem as you load and unload
> >>> a dr

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-28 Thread Yongji Xie

Hi, Alex

On 2016/6/25 0:43, Alex Williamson wrote:


On Fri, 24 Jun 2016 23:37:02 +0800
Yongji Xie  wrote:


Hi, Alex

On 2016/6/24 11:37, Alex Williamson wrote:


On Fri, 24 Jun 2016 10:52:58 +0800
Yongji Xie  wrote:

On 2016/6/24 0:12, Alex Williamson wrote:

On Mon, 30 May 2016 21:06:37 +0800
Yongji Xie  wrote:

+static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
+{
+   struct resource *res;
+   int bar;
+   struct vfio_pci_dummy_resource *dummy_res;
+
+   INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
+   for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+   res = vdev->pdev->resource + bar;
+
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+   goto no_mmap;
+
+   if (!(res->flags & IORESOURCE_MEM))
+   goto no_mmap;
+
+   /*
+* The PCI core shouldn't set up a resource with a
+* type but zero size. But there may be bugs that
+* cause us to do that.
+*/
+   if (!resource_size(res))
+   goto no_mmap;
+
+   if (resource_size(res) >= PAGE_SIZE) {
+   vdev->bar_mmap_supported[bar] = true;
+   continue;
+   }
+
+   if (!(res->start & ~PAGE_MASK)) {
+   /*
+* Add a dummy resource to reserve the remainder
+* of the exclusive page in case that hot-add
+* device's bar is assigned into it.
+*/
+   dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+   if (dummy_res == NULL)
+   goto no_mmap;
+
+   dummy_res->resource.start = res->end + 1;
+   dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+   dummy_res->resource.flags = res->flags;
+   if (request_resource(res->parent,
+   &dummy_res->resource)) {
+   kfree(dummy_res);
+   goto no_mmap;
+   }

Isn't it true that request_resource() only tells us that at a given
point in time, no other drivers have reserved that resource?  It seems
like it does not guarantee that the resource isn't routed to another
device or that another driver won't at some point attempt to request
that same resource.  So for example if a user constructs their initrd
to bind vfio-pci to devices before other modules load, this
request_resource() may succeed, at the expense of drivers loaded later
now failing.  The behavior will depend on driver load order and we're
not actually insuring that the overflow resource is unused, just that
we got it first.  Can we do better?  Am I missing something that
prevents this?  Thanks,

Alex

Couldn't PCI resources allocator prevent this, which will find a
empty slot in the resource tree firstly, then try to request that
resource in allocate_resource() when a PCI device is probed.
And I'd like to know why a PCI device driver would attempt to
call request_resource()? Should this be done in PCI enumeration?

Hi Yongji,

Looks like most pci drivers call pci_request_regions().  From there the
call path is:

pci_request_selected_regions
__pci_request_selected_regions
  __pci_request_region
__request_mem_region
  __request_region
__request_resource

We see this driver ordering issue sometimes with users attempting to
blacklist native pci drivers, trying to leave a device free for use by
vfio-pci.  If the device is a graphics card, the generic vesa or uefi
driver can request device resources causing a failure when vfio-pci
tries to request those same resources.  I expect that unless it's a
boot device, like vga in my example, the resources are not enabled
until the driver opens the device, therefore the request_resource() call
doesn't occur until that point.

For another trivial example, look at /proc/iomem as you load and unload
a driver, on my laptop with e1000e unloaded I see:

e120-e121 : :00:19.0
e123e000-e123efff : :00:19.0

When e1000e is loaded, each of these becomes claimed by the e1000e
driver:

e120-e121 : :00:19.0
  e120-e121 : e1000e
e123e000-e123efff : :00:19.0
  e123e000-e123efff : e1000e

Clearly pci core knows the resource is associated with the device, but
I don't think we're tapping into that with request_resource(), we're
just potentially stealing resources that another driver might have
claimed otherwise as I described above.  That's my suspicion at
least, feel free to show otherwise if it's incorrect.  Thanks,

Alex
  

Thanks for your explanation. But I still have one question.
Shouldn't PCI core have claimed all PCI device's resources
after probing those device

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-24 Thread Alex Williamson
On Fri, 24 Jun 2016 23:37:02 +0800
Yongji Xie  wrote:

> Hi, Alex
> 
> On 2016/6/24 11:37, Alex Williamson wrote:
> 
> > On Fri, 24 Jun 2016 10:52:58 +0800
> > Yongji Xie  wrote:  
> >> On 2016/6/24 0:12, Alex Williamson wrote:  
> >>> On Mon, 30 May 2016 21:06:37 +0800
> >>> Yongji Xie  wrote:  
>  +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>  +{
>  +struct resource *res;
>  +int bar;
>  +struct vfio_pci_dummy_resource *dummy_res;
>  +
>  +INIT_LIST_HEAD(&vdev->dummy_resources_list);
>  +
>  +for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; 
>  bar++) {
>  +res = vdev->pdev->resource + bar;
>  +
>  +if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
>  +goto no_mmap;
>  +
>  +if (!(res->flags & IORESOURCE_MEM))
>  +goto no_mmap;
>  +
>  +/*
>  + * The PCI core shouldn't set up a resource with a
>  + * type but zero size. But there may be bugs that
>  + * cause us to do that.
>  + */
>  +if (!resource_size(res))
>  +goto no_mmap;
>  +
>  +if (resource_size(res) >= PAGE_SIZE) {
>  +vdev->bar_mmap_supported[bar] = true;
>  +continue;
>  +}
>  +
>  +if (!(res->start & ~PAGE_MASK)) {
>  +/*
>  + * Add a dummy resource to reserve the remainder
>  + * of the exclusive page in case that hot-add
>  + * device's bar is assigned into it.
>  + */
>  +dummy_res = kzalloc(sizeof(*dummy_res), 
>  GFP_KERNEL);
>  +if (dummy_res == NULL)
>  +goto no_mmap;
>  +
>  +dummy_res->resource.start = res->end + 1;
>  +dummy_res->resource.end = res->start + 
>  PAGE_SIZE - 1;
>  +dummy_res->resource.flags = res->flags;
>  +if (request_resource(res->parent,
>  +&dummy_res->resource)) {
>  +kfree(dummy_res);
>  +goto no_mmap;
>  +}  
> >>> Isn't it true that request_resource() only tells us that at a given
> >>> point in time, no other drivers have reserved that resource?  It seems
> >>> like it does not guarantee that the resource isn't routed to another
> >>> device or that another driver won't at some point attempt to request
> >>> that same resource.  So for example if a user constructs their initrd
> >>> to bind vfio-pci to devices before other modules load, this
> >>> request_resource() may succeed, at the expense of drivers loaded later
> >>> now failing.  The behavior will depend on driver load order and we're
> >>> not actually insuring that the overflow resource is unused, just that
> >>> we got it first.  Can we do better?  Am I missing something that
> >>> prevents this?  Thanks,
> >>>
> >>> Alex  
> >> Couldn't PCI resources allocator prevent this, which will find a
> >> empty slot in the resource tree firstly, then try to request that
> >> resource in allocate_resource() when a PCI device is probed.
> >> And I'd like to know why a PCI device driver would attempt to
> >> call request_resource()? Should this be done in PCI enumeration?  
> > Hi Yongji,
> >
> > Looks like most pci drivers call pci_request_regions().  From there the
> > call path is:
> >
> > pci_request_selected_regions
> >__pci_request_selected_regions
> >  __pci_request_region
> >__request_mem_region
> >  __request_region
> >__request_resource
> >
> > We see this driver ordering issue sometimes with users attempting to
> > blacklist native pci drivers, trying to leave a device free for use by
> > vfio-pci.  If the device is a graphics card, the generic vesa or uefi
> > driver can request device resources causing a failure when vfio-pci
> > tries to request those same resources.  I expect that unless it's a
> > boot device, like vga in my example, the resources are not enabled
> > until the driver opens the device, therefore the request_resource() call
> > doesn't occur until that point.
> >
> > For another trivial example, look at /proc/iomem as you load and unload
> > a driver, on my laptop with e1000e unloaded I see:
> >
> >e120-e121 : :00:19.0
> >e123e000-e123efff : :00:19.0
> >
> > When e1000e is loaded, each of these becomes claimed by the e1000e
> > driver:
> >
> >e120-e121 : :00:19.0
> 

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-24 Thread Yongji Xie

Hi, Alex

On 2016/6/24 11:37, Alex Williamson wrote:


On Fri, 24 Jun 2016 10:52:58 +0800
Yongji Xie  wrote:

On 2016/6/24 0:12, Alex Williamson wrote:

On Mon, 30 May 2016 21:06:37 +0800
Yongji Xie  wrote:

+static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
+{
+   struct resource *res;
+   int bar;
+   struct vfio_pci_dummy_resource *dummy_res;
+
+   INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
+   for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+   res = vdev->pdev->resource + bar;
+
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+   goto no_mmap;
+
+   if (!(res->flags & IORESOURCE_MEM))
+   goto no_mmap;
+
+   /*
+* The PCI core shouldn't set up a resource with a
+* type but zero size. But there may be bugs that
+* cause us to do that.
+*/
+   if (!resource_size(res))
+   goto no_mmap;
+
+   if (resource_size(res) >= PAGE_SIZE) {
+   vdev->bar_mmap_supported[bar] = true;
+   continue;
+   }
+
+   if (!(res->start & ~PAGE_MASK)) {
+   /*
+* Add a dummy resource to reserve the remainder
+* of the exclusive page in case that hot-add
+* device's bar is assigned into it.
+*/
+   dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+   if (dummy_res == NULL)
+   goto no_mmap;
+
+   dummy_res->resource.start = res->end + 1;
+   dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+   dummy_res->resource.flags = res->flags;
+   if (request_resource(res->parent,
+   &dummy_res->resource)) {
+   kfree(dummy_res);
+   goto no_mmap;
+   }

Isn't it true that request_resource() only tells us that at a given
point in time, no other drivers have reserved that resource?  It seems
like it does not guarantee that the resource isn't routed to another
device or that another driver won't at some point attempt to request
that same resource.  So for example if a user constructs their initrd
to bind vfio-pci to devices before other modules load, this
request_resource() may succeed, at the expense of drivers loaded later
now failing.  The behavior will depend on driver load order and we're
not actually insuring that the overflow resource is unused, just that
we got it first.  Can we do better?  Am I missing something that
prevents this?  Thanks,

Alex

Couldn't PCI resources allocator prevent this, which will find a
empty slot in the resource tree firstly, then try to request that
resource in allocate_resource() when a PCI device is probed.
And I'd like to know why a PCI device driver would attempt to
call request_resource()? Should this be done in PCI enumeration?

Hi Yongji,

Looks like most pci drivers call pci_request_regions().  From there the
call path is:

pci_request_selected_regions
   __pci_request_selected_regions
 __pci_request_region
   __request_mem_region
 __request_region
   __request_resource

We see this driver ordering issue sometimes with users attempting to
blacklist native pci drivers, trying to leave a device free for use by
vfio-pci.  If the device is a graphics card, the generic vesa or uefi
driver can request device resources causing a failure when vfio-pci
tries to request those same resources.  I expect that unless it's a
boot device, like vga in my example, the resources are not enabled
until the driver opens the device, therefore the request_resource() call
doesn't occur until that point.

For another trivial example, look at /proc/iomem as you load and unload
a driver, on my laptop with e1000e unloaded I see:

   e120-e121 : :00:19.0
   e123e000-e123efff : :00:19.0

When e1000e is loaded, each of these becomes claimed by the e1000e
driver:

   e120-e121 : :00:19.0
 e120-e121 : e1000e
   e123e000-e123efff : :00:19.0
 e123e000-e123efff : e1000e

Clearly pci core knows the resource is associated with the device, but
I don't think we're tapping into that with request_resource(), we're
just potentially stealing resources that another driver might have
claimed otherwise as I described above.  That's my suspicion at
least, feel free to show otherwise if it's incorrect.  Thanks,

Alex



Thanks for your explanation. But I still have one question.
Shouldn't PCI core have claimed all PCI device's resources
after probing those devices. If so, request_resource() will fail
when vfio-pci try to steal resources that another driver might
request later. Anyth

RE: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-23 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Friday, June 24, 2016 11:37 AM
> 
> On Fri, 24 Jun 2016 10:52:58 +0800
> Yongji Xie  wrote:
> > On 2016/6/24 0:12, Alex Williamson wrote:
> > > On Mon, 30 May 2016 21:06:37 +0800
> > > Yongji Xie  wrote:
> > >> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> > >> +{
> > >> +struct resource *res;
> > >> +int bar;
> > >> +struct vfio_pci_dummy_resource *dummy_res;
> > >> +
> > >> +INIT_LIST_HEAD(&vdev->dummy_resources_list);
> > >> +
> > >> +for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; 
> > >> bar++) {
> > >> +res = vdev->pdev->resource + bar;
> > >> +
> > >> +if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> > >> +goto no_mmap;
> > >> +
> > >> +if (!(res->flags & IORESOURCE_MEM))
> > >> +goto no_mmap;
> > >> +
> > >> +/*
> > >> + * The PCI core shouldn't set up a resource with a
> > >> + * type but zero size. But there may be bugs that
> > >> + * cause us to do that.
> > >> + */
> > >> +if (!resource_size(res))
> > >> +goto no_mmap;
> > >> +
> > >> +if (resource_size(res) >= PAGE_SIZE) {
> > >> +vdev->bar_mmap_supported[bar] = true;
> > >> +continue;
> > >> +}
> > >> +
> > >> +if (!(res->start & ~PAGE_MASK)) {
> > >> +/*
> > >> + * Add a dummy resource to reserve the remainder
> > >> + * of the exclusive page in case that hot-add
> > >> + * device's bar is assigned into it.
> > >> + */
> > >> +dummy_res = kzalloc(sizeof(*dummy_res), 
> > >> GFP_KERNEL);
> > >> +if (dummy_res == NULL)
> > >> +goto no_mmap;
> > >> +
> > >> +dummy_res->resource.start = res->end + 1;
> > >> +dummy_res->resource.end = res->start + 
> > >> PAGE_SIZE - 1;
> > >> +dummy_res->resource.flags = res->flags;
> > >> +if (request_resource(res->parent,
> > >> +&dummy_res->resource)) {
> > >> +kfree(dummy_res);
> > >> +goto no_mmap;
> > >> +}
> > > Isn't it true that request_resource() only tells us that at a given
> > > point in time, no other drivers have reserved that resource?  It seems
> > > like it does not guarantee that the resource isn't routed to another
> > > device or that another driver won't at some point attempt to request
> > > that same resource.  So for example if a user constructs their initrd
> > > to bind vfio-pci to devices before other modules load, this
> > > request_resource() may succeed, at the expense of drivers loaded later
> > > now failing.  The behavior will depend on driver load order and we're
> > > not actually insuring that the overflow resource is unused, just that
> > > we got it first.  Can we do better?  Am I missing something that
> > > prevents this?  Thanks,
> > >
> > > Alex
> >
> > Couldn't PCI resources allocator prevent this, which will find a
> > empty slot in the resource tree firstly, then try to request that
> > resource in allocate_resource() when a PCI device is probed.
> > And I'd like to know why a PCI device driver would attempt to
> > call request_resource()? Should this be done in PCI enumeration?
> 
> Hi Yongji,
> 
> Looks like most pci drivers call pci_request_regions().  From there the
> call path is:
> 
> pci_request_selected_regions
>   __pci_request_selected_regions
> __pci_request_region
>   __request_mem_region
> __request_region
>   __request_resource
> 
> We see this driver ordering issue sometimes with users attempting to
> blacklist native pci drivers, trying to leave a device free for use by
> vfio-pci.  If the device is a graphics card, the generic vesa or uefi
> driver can request device resources causing a failure when vfio-pci
> tries to request those same resources.  I expect that unless it's a
> boot device, like vga in my example, the resources are not enabled
> until the driver opens the device, therefore the request_resource() call
> doesn't occur until that point.
> 
> For another trivial example, look at /proc/iomem as you load and unload
> a driver, on my laptop with e1000e unloaded I see:
> 
>   e120-e121 : :00:19.0
>   e123e000-e123efff : :00:19.0
> 
> When e1000e is loaded, each of these becomes claimed by the e1000e
> driver:
> 
>   e120-e121 : :00:19.0
> e120-e121 : e1000e
>   e123e000-e123efff : :00:19.0
> e123e000-e123efff :

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-23 Thread Alex Williamson
On Fri, 24 Jun 2016 10:52:58 +0800
Yongji Xie  wrote:
> On 2016/6/24 0:12, Alex Williamson wrote:
> > On Mon, 30 May 2016 21:06:37 +0800
> > Yongji Xie  wrote:
> >> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >> +{
> >> +  struct resource *res;
> >> +  int bar;
> >> +  struct vfio_pci_dummy_resource *dummy_res;
> >> +
> >> +  INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >> +
> >> +  for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> >> +  res = vdev->pdev->resource + bar;
> >> +
> >> +  if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> >> +  goto no_mmap;
> >> +
> >> +  if (!(res->flags & IORESOURCE_MEM))
> >> +  goto no_mmap;
> >> +
> >> +  /*
> >> +   * The PCI core shouldn't set up a resource with a
> >> +   * type but zero size. But there may be bugs that
> >> +   * cause us to do that.
> >> +   */
> >> +  if (!resource_size(res))
> >> +  goto no_mmap;
> >> +
> >> +  if (resource_size(res) >= PAGE_SIZE) {
> >> +  vdev->bar_mmap_supported[bar] = true;
> >> +  continue;
> >> +  }
> >> +
> >> +  if (!(res->start & ~PAGE_MASK)) {
> >> +  /*
> >> +   * Add a dummy resource to reserve the remainder
> >> +   * of the exclusive page in case that hot-add
> >> +   * device's bar is assigned into it.
> >> +   */
> >> +  dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> >> +  if (dummy_res == NULL)
> >> +  goto no_mmap;
> >> +
> >> +  dummy_res->resource.start = res->end + 1;
> >> +  dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> >> +  dummy_res->resource.flags = res->flags;
> >> +  if (request_resource(res->parent,
> >> +  &dummy_res->resource)) {
> >> +  kfree(dummy_res);
> >> +  goto no_mmap;
> >> +  }  
> > Isn't it true that request_resource() only tells us that at a given
> > point in time, no other drivers have reserved that resource?  It seems
> > like it does not guarantee that the resource isn't routed to another
> > device or that another driver won't at some point attempt to request
> > that same resource.  So for example if a user constructs their initrd
> > to bind vfio-pci to devices before other modules load, this
> > request_resource() may succeed, at the expense of drivers loaded later
> > now failing.  The behavior will depend on driver load order and we're
> > not actually insuring that the overflow resource is unused, just that
> > we got it first.  Can we do better?  Am I missing something that
> > prevents this?  Thanks,
> >
> > Alex  
> 
> Couldn't PCI resources allocator prevent this, which will find a
> empty slot in the resource tree firstly, then try to request that
> resource in allocate_resource() when a PCI device is probed.
> And I'd like to know why a PCI device driver would attempt to
> call request_resource()? Should this be done in PCI enumeration?

Hi Yongji,

Looks like most pci drivers call pci_request_regions().  From there the
call path is:

pci_request_selected_regions
  __pci_request_selected_regions
__pci_request_region
  __request_mem_region
__request_region
  __request_resource

We see this driver ordering issue sometimes with users attempting to
blacklist native pci drivers, trying to leave a device free for use by
vfio-pci.  If the device is a graphics card, the generic vesa or uefi
driver can request device resources causing a failure when vfio-pci
tries to request those same resources.  I expect that unless it's a
boot device, like vga in my example, the resources are not enabled
until the driver opens the device, therefore the request_resource() call
doesn't occur until that point.

For another trivial example, look at /proc/iomem as you load and unload
a driver, on my laptop with e1000e unloaded I see:

  e120-e121 : :00:19.0
  e123e000-e123efff : :00:19.0

When e1000e is loaded, each of these becomes claimed by the e1000e
driver:

  e120-e121 : :00:19.0
e120-e121 : e1000e
  e123e000-e123efff : :00:19.0
e123e000-e123efff : e1000e

Clearly pci core knows the resource is associated with the device, but
I don't think we're tapping into that with request_resource(), we're
just potentially stealing resources that another driver might have
claimed otherwise as I described above.  That's my suspicion at
least, feel free to show otherwise if it's incorrect.  Thanks,

Alex


Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-23 Thread Yongji Xie

Hi, Alex

On 2016/6/24 0:12, Alex Williamson wrote:


On Mon, 30 May 2016 21:06:37 +0800
Yongji Xie  wrote:


Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
page may be shared with other BARs. This will cause some
performance issues when we passthrough a PCI device with
this kind of BARs. Guest will be not able to handle the mmio
accesses to the BARs which leads to mmio emulations in host.

However, not all sub-page BARs will share page with other BARs.
We should allow to mmap the sub-page MMIO BARs which we can
make sure will not share page with other BARs.

This patch adds support for this case. And we try to add a
dummy resource to reserve the remainder of the page which
hot-add device's BAR might be assigned into. But it's not
necessary to handle the case when the BAR is not page aligned.
Because we can't expect the BAR will be assigned into the same
location in a page in guest when we passthrough the BAR. And
it's hard to access this BAR in userspace because we have
no way to get the BAR's location in a page.

Signed-off-by: Yongji Xie 
---
  drivers/vfio/pci/vfio_pci.c |   87 ---
  drivers/vfio/pci/vfio_pci_private.h |8 
  2 files changed, 89 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 188b1ff..3cca2a7 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -110,6 +110,73 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
  }
  
+static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)

+{
+   struct resource *res;
+   int bar;
+   struct vfio_pci_dummy_resource *dummy_res;
+
+   INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
+   for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+   res = vdev->pdev->resource + bar;
+
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+   goto no_mmap;
+
+   if (!(res->flags & IORESOURCE_MEM))
+   goto no_mmap;
+
+   /*
+* The PCI core shouldn't set up a resource with a
+* type but zero size. But there may be bugs that
+* cause us to do that.
+*/
+   if (!resource_size(res))
+   goto no_mmap;
+
+   if (resource_size(res) >= PAGE_SIZE) {
+   vdev->bar_mmap_supported[bar] = true;
+   continue;
+   }
+
+   if (!(res->start & ~PAGE_MASK)) {
+   /*
+* Add a dummy resource to reserve the remainder
+* of the exclusive page in case that hot-add
+* device's bar is assigned into it.
+*/
+   dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+   if (dummy_res == NULL)
+   goto no_mmap;
+
+   dummy_res->resource.start = res->end + 1;
+   dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+   dummy_res->resource.flags = res->flags;
+   if (request_resource(res->parent,
+   &dummy_res->resource)) {
+   kfree(dummy_res);
+   goto no_mmap;
+   }

Isn't it true that request_resource() only tells us that at a given
point in time, no other drivers have reserved that resource?  It seems
like it does not guarantee that the resource isn't routed to another
device or that another driver won't at some point attempt to request
that same resource.  So for example if a user constructs their initrd
to bind vfio-pci to devices before other modules load, this
request_resource() may succeed, at the expense of drivers loaded later
now failing.  The behavior will depend on driver load order and we're
not actually insuring that the overflow resource is unused, just that
we got it first.  Can we do better?  Am I missing something that
prevents this?  Thanks,

Alex


Couldn't PCI resources allocator prevent this, which will find a
empty slot in the resource tree firstly, then try to request that
resource in allocate_resource() when a PCI device is probed.
And I'd like to know why a PCI device driver would attempt to
call request_resource()? Should this be done in PCI enumeration?

Thanks,
Yongji


+   dummy_res->index = bar;
+   list_add(&dummy_res->res_next,
+   &vdev->dummy_resources_list);
+   vdev->bar_mmap_supported[bar] = true;
+   continue;
+   }
+   /*
+* Here we don't handle the case when the BAR is not p

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-23 Thread Alex Williamson
On Mon, 30 May 2016 21:06:37 +0800
Yongji Xie  wrote:

> Current vfio-pci implementation disallows to mmap
> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
> page may be shared with other BARs. This will cause some
> performance issues when we passthrough a PCI device with
> this kind of BARs. Guest will be not able to handle the mmio
> accesses to the BARs which leads to mmio emulations in host.
> 
> However, not all sub-page BARs will share page with other BARs.
> We should allow to mmap the sub-page MMIO BARs which we can
> make sure will not share page with other BARs.
> 
> This patch adds support for this case. And we try to add a
> dummy resource to reserve the remainder of the page which
> hot-add device's BAR might be assigned into. But it's not
> necessary to handle the case when the BAR is not page aligned.
> Because we can't expect the BAR will be assigned into the same
> location in a page in guest when we passthrough the BAR. And
> it's hard to access this BAR in userspace because we have
> no way to get the BAR's location in a page.
> 
> Signed-off-by: Yongji Xie 
> ---
>  drivers/vfio/pci/vfio_pci.c |   87 
> ---
>  drivers/vfio/pci/vfio_pci_private.h |8 
>  2 files changed, 89 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 188b1ff..3cca2a7 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -110,6 +110,73 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>   return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>  }
>  
> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> +{
> + struct resource *res;
> + int bar;
> + struct vfio_pci_dummy_resource *dummy_res;
> +
> + INIT_LIST_HEAD(&vdev->dummy_resources_list);
> +
> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> + res = vdev->pdev->resource + bar;
> +
> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> + goto no_mmap;
> +
> + if (!(res->flags & IORESOURCE_MEM))
> + goto no_mmap;
> +
> + /*
> +  * The PCI core shouldn't set up a resource with a
> +  * type but zero size. But there may be bugs that
> +  * cause us to do that.
> +  */
> + if (!resource_size(res))
> + goto no_mmap;
> +
> + if (resource_size(res) >= PAGE_SIZE) {
> + vdev->bar_mmap_supported[bar] = true;
> + continue;
> + }
> +
> + if (!(res->start & ~PAGE_MASK)) {
> + /*
> +  * Add a dummy resource to reserve the remainder
> +  * of the exclusive page in case that hot-add
> +  * device's bar is assigned into it.
> +  */
> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> + if (dummy_res == NULL)
> + goto no_mmap;
> +
> + dummy_res->resource.start = res->end + 1;
> + dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> + dummy_res->resource.flags = res->flags;
> + if (request_resource(res->parent,
> + &dummy_res->resource)) {
> + kfree(dummy_res);
> + goto no_mmap;
> + }

Isn't it true that request_resource() only tells us that at a given
point in time, no other drivers have reserved that resource?  It seems
like it does not guarantee that the resource isn't routed to another
device or that another driver won't at some point attempt to request
that same resource.  So for example if a user constructs their initrd
to bind vfio-pci to devices before other modules load, this
request_resource() may succeed, at the expense of drivers loaded later
now failing.  The behavior will depend on driver load order and we're
not actually insuring that the overflow resource is unused, just that
we got it first.  Can we do better?  Am I missing something that
prevents this?  Thanks,

Alex

> + dummy_res->index = bar;
> + list_add(&dummy_res->res_next,
> + &vdev->dummy_resources_list);
> + vdev->bar_mmap_supported[bar] = true;
> + continue;
> + }
> + /*
> +  * Here we don't handle the case when the BAR is not page
> +  * aligned because we can't expect the BAR will be
> +  * assigned into the same location in a page in guest
> +  * when we passthrough the BAR. And it's hard to access
> +  * this BAR in userspace because we have no way to get
> +  * the BAR'

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-22 Thread Alex Williamson
On Thu, 23 Jun 2016 10:39:30 +0800
Yongji Xie  wrote:

> Hi, Alex
> 
> On 2016/6/23 6:04, Alex Williamson wrote:
> 
> > On Mon, 30 May 2016 21:06:37 +0800
> > Yongji Xie  wrote:
> >  
> >> Current vfio-pci implementation disallows to mmap
> >> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
> >> page may be shared with other BARs. This will cause some
> >> performance issues when we passthrough a PCI device with
> >> this kind of BARs. Guest will be not able to handle the mmio
> >> accesses to the BARs which leads to mmio emulations in host.
> >>
> >> However, not all sub-page BARs will share page with other BARs.
> >> We should allow to mmap the sub-page MMIO BARs which we can
> >> make sure will not share page with other BARs.
> >>
> >> This patch adds support for this case. And we try to add a
> >> dummy resource to reserve the remainder of the page which
> >> hot-add device's BAR might be assigned into. But it's not
> >> necessary to handle the case when the BAR is not page aligned.
> >> Because we can't expect the BAR will be assigned into the same
> >> location in a page in guest when we passthrough the BAR. And
> >> it's hard to access this BAR in userspace because we have
> >> no way to get the BAR's location in a page.
> >>
> >> Signed-off-by: Yongji Xie 
> >> ---  
> > Hi Yongji,
> >
> > On 5/22, message-id
> > <201605230345.u4n3djip043...@mx0a-001b2d01.pphosted.com> you indicated
> > you'd post the QEMU code which is enabled by this patch "soon".  Have I
> > missed that?  I'm still waiting to see it.  Thanks,
> >
> > Alex  
> 
> I posted it on May 24th [1]. Do I need to resend it?
> 
> [1] http://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg04125.html

I found it.  Thanks,

Alex

> >>   drivers/vfio/pci/vfio_pci.c |   87 
> >> ---
> >>   drivers/vfio/pci/vfio_pci_private.h |8 
> >>   2 files changed, 89 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 188b1ff..3cca2a7 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -110,6 +110,73 @@ static inline bool vfio_pci_is_vga(struct pci_dev 
> >> *pdev)
> >>return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> >>   }
> >>   
> >> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >> +{
> >> +  struct resource *res;
> >> +  int bar;
> >> +  struct vfio_pci_dummy_resource *dummy_res;
> >> +
> >> +  INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >> +
> >> +  for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> >> +  res = vdev->pdev->resource + bar;
> >> +
> >> +  if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> >> +  goto no_mmap;
> >> +
> >> +  if (!(res->flags & IORESOURCE_MEM))
> >> +  goto no_mmap;
> >> +
> >> +  /*
> >> +   * The PCI core shouldn't set up a resource with a
> >> +   * type but zero size. But there may be bugs that
> >> +   * cause us to do that.
> >> +   */
> >> +  if (!resource_size(res))
> >> +  goto no_mmap;
> >> +
> >> +  if (resource_size(res) >= PAGE_SIZE) {
> >> +  vdev->bar_mmap_supported[bar] = true;
> >> +  continue;
> >> +  }
> >> +
> >> +  if (!(res->start & ~PAGE_MASK)) {
> >> +  /*
> >> +   * Add a dummy resource to reserve the remainder
> >> +   * of the exclusive page in case that hot-add
> >> +   * device's bar is assigned into it.
> >> +   */
> >> +  dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> >> +  if (dummy_res == NULL)
> >> +  goto no_mmap;
> >> +
> >> +  dummy_res->resource.start = res->end + 1;
> >> +  dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> >> +  dummy_res->resource.flags = res->flags;
> >> +  if (request_resource(res->parent,
> >> +  &dummy_res->resource)) {
> >> +  kfree(dummy_res);
> >> +  goto no_mmap;
> >> +  }
> >> +  dummy_res->index = bar;
> >> +  list_add(&dummy_res->res_next,
> >> +  &vdev->dummy_resources_list);
> >> +  vdev->bar_mmap_supported[bar] = true;
> >> +  continue;
> >> +  }
> >> +  /*
> >> +   * Here we don't handle the case when the BAR is not page
> >> +   * aligned because we can't expect the BAR will be
> >> +   * assigned into the same location in a page in guest
> >> +   * when we passthrough the BAR. And it's hard to access
> >> +   * this BAR in userspace because we have no way to get
> >> +   * the BAR's location in 

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-22 Thread Yongji Xie

Hi, Alex

On 2016/6/23 6:04, Alex Williamson wrote:


On Mon, 30 May 2016 21:06:37 +0800
Yongji Xie  wrote:


Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
page may be shared with other BARs. This will cause some
performance issues when we passthrough a PCI device with
this kind of BARs. Guest will be not able to handle the mmio
accesses to the BARs which leads to mmio emulations in host.

However, not all sub-page BARs will share page with other BARs.
We should allow to mmap the sub-page MMIO BARs which we can
make sure will not share page with other BARs.

This patch adds support for this case. And we try to add a
dummy resource to reserve the remainder of the page which
hot-add device's BAR might be assigned into. But it's not
necessary to handle the case when the BAR is not page aligned.
Because we can't expect the BAR will be assigned into the same
location in a page in guest when we passthrough the BAR. And
it's hard to access this BAR in userspace because we have
no way to get the BAR's location in a page.

Signed-off-by: Yongji Xie 
---

Hi Yongji,

On 5/22, message-id
<201605230345.u4n3djip043...@mx0a-001b2d01.pphosted.com> you indicated
you'd post the QEMU code which is enabled by this patch "soon".  Have I
missed that?  I'm still waiting to see it.  Thanks,

Alex


I posted it on May 24th [1]. Do I need to resend it?

[1] http://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg04125.html

Thanks,
Yongji


  drivers/vfio/pci/vfio_pci.c |   87 ---
  drivers/vfio/pci/vfio_pci_private.h |8 
  2 files changed, 89 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 188b1ff..3cca2a7 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -110,6 +110,73 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
  }
  
+static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)

+{
+   struct resource *res;
+   int bar;
+   struct vfio_pci_dummy_resource *dummy_res;
+
+   INIT_LIST_HEAD(&vdev->dummy_resources_list);
+
+   for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+   res = vdev->pdev->resource + bar;
+
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
+   goto no_mmap;
+
+   if (!(res->flags & IORESOURCE_MEM))
+   goto no_mmap;
+
+   /*
+* The PCI core shouldn't set up a resource with a
+* type but zero size. But there may be bugs that
+* cause us to do that.
+*/
+   if (!resource_size(res))
+   goto no_mmap;
+
+   if (resource_size(res) >= PAGE_SIZE) {
+   vdev->bar_mmap_supported[bar] = true;
+   continue;
+   }
+
+   if (!(res->start & ~PAGE_MASK)) {
+   /*
+* Add a dummy resource to reserve the remainder
+* of the exclusive page in case that hot-add
+* device's bar is assigned into it.
+*/
+   dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
+   if (dummy_res == NULL)
+   goto no_mmap;
+
+   dummy_res->resource.start = res->end + 1;
+   dummy_res->resource.end = res->start + PAGE_SIZE - 1;
+   dummy_res->resource.flags = res->flags;
+   if (request_resource(res->parent,
+   &dummy_res->resource)) {
+   kfree(dummy_res);
+   goto no_mmap;
+   }
+   dummy_res->index = bar;
+   list_add(&dummy_res->res_next,
+   &vdev->dummy_resources_list);
+   vdev->bar_mmap_supported[bar] = true;
+   continue;
+   }
+   /*
+* Here we don't handle the case when the BAR is not page
+* aligned because we can't expect the BAR will be
+* assigned into the same location in a page in guest
+* when we passthrough the BAR. And it's hard to access
+* this BAR in userspace because we have no way to get
+* the BAR's location in a page.
+*/
+no_mmap:
+   vdev->bar_mmap_supported[bar] = false;
+   }
+}
+
  static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
  static void vfio_pci_disable(struct vfio_pci_device *vdev);
  
@@ -218,12 +285,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)

}
   

Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

2016-06-22 Thread Alex Williamson
On Mon, 30 May 2016 21:06:37 +0800
Yongji Xie  wrote:

> Current vfio-pci implementation disallows to mmap
> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
> page may be shared with other BARs. This will cause some
> performance issues when we passthrough a PCI device with
> this kind of BARs. Guest will be not able to handle the mmio
> accesses to the BARs which leads to mmio emulations in host.
> 
> However, not all sub-page BARs will share page with other BARs.
> We should allow to mmap the sub-page MMIO BARs which we can
> make sure will not share page with other BARs.
> 
> This patch adds support for this case. And we try to add a
> dummy resource to reserve the remainder of the page which
> hot-add device's BAR might be assigned into. But it's not
> necessary to handle the case when the BAR is not page aligned.
> Because we can't expect the BAR will be assigned into the same
> location in a page in guest when we passthrough the BAR. And
> it's hard to access this BAR in userspace because we have
> no way to get the BAR's location in a page.
> 
> Signed-off-by: Yongji Xie 
> ---

Hi Yongji,

On 5/22, message-id
<201605230345.u4n3djip043...@mx0a-001b2d01.pphosted.com> you indicated
you'd post the QEMU code which is enabled by this patch "soon".  Have I
missed that?  I'm still waiting to see it.  Thanks,

Alex

>  drivers/vfio/pci/vfio_pci.c |   87 
> ---
>  drivers/vfio/pci/vfio_pci_private.h |8 
>  2 files changed, 89 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 188b1ff..3cca2a7 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -110,6 +110,73 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>   return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>  }
>  
> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> +{
> + struct resource *res;
> + int bar;
> + struct vfio_pci_dummy_resource *dummy_res;
> +
> + INIT_LIST_HEAD(&vdev->dummy_resources_list);
> +
> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> + res = vdev->pdev->resource + bar;
> +
> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> + goto no_mmap;
> +
> + if (!(res->flags & IORESOURCE_MEM))
> + goto no_mmap;
> +
> + /*
> +  * The PCI core shouldn't set up a resource with a
> +  * type but zero size. But there may be bugs that
> +  * cause us to do that.
> +  */
> + if (!resource_size(res))
> + goto no_mmap;
> +
> + if (resource_size(res) >= PAGE_SIZE) {
> + vdev->bar_mmap_supported[bar] = true;
> + continue;
> + }
> +
> + if (!(res->start & ~PAGE_MASK)) {
> + /*
> +  * Add a dummy resource to reserve the remainder
> +  * of the exclusive page in case that hot-add
> +  * device's bar is assigned into it.
> +  */
> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> + if (dummy_res == NULL)
> + goto no_mmap;
> +
> + dummy_res->resource.start = res->end + 1;
> + dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> + dummy_res->resource.flags = res->flags;
> + if (request_resource(res->parent,
> + &dummy_res->resource)) {
> + kfree(dummy_res);
> + goto no_mmap;
> + }
> + dummy_res->index = bar;
> + list_add(&dummy_res->res_next,
> + &vdev->dummy_resources_list);
> + vdev->bar_mmap_supported[bar] = true;
> + continue;
> + }
> + /*
> +  * Here we don't handle the case when the BAR is not page
> +  * aligned because we can't expect the BAR will be
> +  * assigned into the same location in a page in guest
> +  * when we passthrough the BAR. And it's hard to access
> +  * this BAR in userspace because we have no way to get
> +  * the BAR's location in a page.
> +  */
> +no_mmap:
> + vdev->bar_mmap_supported[bar] = false;
> + }
> +}
> +
>  static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
>  static void vfio_pci_disable(struct vfio_pci_device *vdev);
>  
> @@ -218,12 +285,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   }
>   }
>  
> + vfio_pci_probe_mmaps(vdev);
> +
>   return 0;
>  }
>  
>  static void vfio_pci_disable(stru