Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-05-09 Thread Dong Jia
On Thu, 5 May 2016 13:23:11 -0700
Neo Jia  wrote:

> > > I also noticed in another thread:
> > > -
> > > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with 
> > > iommu and without iommu
> > > 
> > > Kirti did:
> > > 1. don't pin the pages in the map ioctl for the vGPU case.
> > > 2. export vfio_pin_pages and vfio_unpin_pages.
> > > 
> > > Although their patches didn't show how these interfaces were used, I
> > > guess them can either use these interfaces to pin/unpin all of the
> > > guest memory, or pin/unpin memory on demand. So can I reuse their work
> > > to finish my #1? If the answer is yes, then I could change my plan and  
> > 
> > Yes, we would absolutely only want one vfio iommu backend doing this,
> > there's nothing device specific about it.  We're looking at supporting
> > both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
> > wants the on-demand approach while Intel vGPU wants to pin the entire
> > guest, at least for an initial solution.  This iommu backend would need
> > to support both as determined by the mediated device backend.  
> 
> Right, we will add a new callback to mediated device backend interface for 
> this
> purpose in v4 version patch.
Dear Neo:
Thanks for this information.

What I interest most is the new vfio iommu backend. Looking forward to
your new patches. :>

> 
> Thanks,
> Neo
> 
> >   
> > > do:
> > > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > > When starting the guest, form the  database.
> > > 
> > > #2. In the driver of the ccw devices, when an I/O instruction was
> > > intercepted, call vfio_pin_pages (Kirti's version) to get the host
> > > physical address, then translate the ccw program for I/O operation.
> > > 
> > > So which one is the right way to go?  
> > 
> > As above, I think we have a need to support both approaches in this new
> > iommu backend, it will be up to you to determine which is appropriate
> > for your devices and guest drivers.  A fully pinned guest has a latency
> > advantage, but obviously there are numerous disadvantages for the
> > pinning itself.  Pinning on-demand has overhead to setup each DMA
> > operations by the device but has a much smaller pinning footprint.



Dong Jia




Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-05-09 Thread Dong Jia
On Thu, 5 May 2016 13:19:45 -0600
Alex Williamson  wrote:

> [cc +Intel,NVIDIA]
> 
> On Thu, 5 May 2016 18:29:08 +0800
> Dong Jia  wrote:
> 
> > On Wed, 4 May 2016 13:26:53 -0600
> > Alex Williamson  wrote:
> > 
> > > On Wed, 4 May 2016 17:26:29 +0800
> > > Dong Jia  wrote:
> > >   
> > > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > > Alex Williamson  wrote:
> > > > 
> > > > Dear Alex:
> > > > 
> > > > Thanks for the comments.
> > > > 
> > > > [...]
> > > >   
> > > > > > 
> > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is 
> > > > > > definitely a
> > > > > > good example to get understand how these patches work. Here is a 
> > > > > > little
> > > > > > bit more detail how an I/O request triggered by the Qemu guest will 
> > > > > > be
> > > > > > handled (without error handling).
> > > > > > 
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > > 
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > > (u_ccwchain).
> > > > > 
> > > > > Is this replacing guest physical address in the program with QEMU
> > > > > virtual addresses?
> > > > Yes.
> > > >   
> > > > > 
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > > K2. Translate the user space ccw program to a kernel space ccw
> > > > > > program, which becomes runnable for a real device.
> > > > > 
> > > > > And here we translate and likely pin QEMU virtual address to physical
> > > > > addresses to further modify the program sent into the channel?
> > > > Yes. Exactly.
> > > >   
> > > > > 
> > > > > > K3. With the necessary information contained in the orb passed 
> > > > > > in
> > > > > > by Qemu, issue the k_ccwchain to the device, and wait event 
> > > > > > q
> > > > > > for the I/O result.
> > > > > > K4. Interrupt handler gets the I/O result, and wakes up the 
> > > > > > wait q.
> > > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result 
> > > > > > to
> > > > > > update the user space irb.
> > > > > > K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.
> > > > > 
> > > > > If the answers to my questions above are both yes,
> > > > Yes, they are.
> > > >   
> > > > > then this is really a mediated interface, not a direct assignment.
> > > > Right. This is true.
> > > >   
> > > > > We don't need an iommu
> > > > > because we're policing and translating the program for the device
> > > > > before it gets sent to hardware.  I think there are better ways than
> > > > > noiommu to handle such devices perhaps even with better performance
> > > > > than this two-stage translation.  In fact, I think the solution we 
> > > > > plan
> > > > > to implement for vGPU support would work here.
> > > > > 
> > > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > > translation or isolation since a vGPU is largely a software construct,
> > > > > but we do have software policing and translating how the GPU is
> > > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > > programming them into an IOMMU for a device, it simply stores the
> > > > > translations for use by later requests.  This means that a device
> > > > > programmed in a VM with guest physical addresses can have the
> > > > > vfio kernel convert that address to process virtual address, pin the
> > > > > page and program the hardware with the host physical address in one
> > > > > step.
> > > > I've read through the mail threads those discuss how to add vGPU
> > > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > > to this case, especially if we want to make the vfio api completely
> > > > compatible with the existing usage.
> > > > 
> > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > > fixed range of address in the memory space for DMA operations. Any
> > > > address inside this range will not be used for other purpose. Thus we
> > > > can add memory listener on this range, and pin the pages for further
> > > > use (DMA operation). And we can keep the pages pinned during the life
> > > > cycle of the VM (not quite accurate, or I should say 'the target
> > > > device').  
> > > 
> > > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > > assignment pins all of guest memory, not just a dedicated, exclusive
> > > range of it, in order to map it through the hardware IOMMU.  That gives
> > > the guest the ability to transparently perform DMA with the 

Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-05-05 Thread Neo Jia
On Thu, May 05, 2016 at 01:19:45PM -0600, Alex Williamson wrote:
> [cc +Intel,NVIDIA]
> 
> On Thu, 5 May 2016 18:29:08 +0800
> Dong Jia  wrote:
> 
> > On Wed, 4 May 2016 13:26:53 -0600
> > Alex Williamson  wrote:
> > 
> > > On Wed, 4 May 2016 17:26:29 +0800
> > > Dong Jia  wrote:
> > >   
> > > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > > Alex Williamson  wrote:
> > > > 
> > > > Dear Alex:
> > > > 
> > > > Thanks for the comments.
> > > > 
> > > > [...]
> > > >   
> > > > > > 
> > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is 
> > > > > > definitely a
> > > > > > good example to get understand how these patches work. Here is a 
> > > > > > little
> > > > > > bit more detail how an I/O request triggered by the Qemu guest will 
> > > > > > be
> > > > > > handled (without error handling).
> > > > > > 
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > > 
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > > (u_ccwchain).
> > > > > 
> > > > > Is this replacing guest physical address in the program with QEMU
> > > > > virtual addresses?
> > > > Yes.
> > > >   
> > > > > 
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > > K2. Translate the user space ccw program to a kernel space ccw
> > > > > > program, which becomes runnable for a real device.
> > > > > 
> > > > > And here we translate and likely pin QEMU virtual address to physical
> > > > > addresses to further modify the program sent into the channel?
> > > > Yes. Exactly.
> > > >   
> > > > > 
> > > > > > K3. With the necessary information contained in the orb passed 
> > > > > > in
> > > > > > by Qemu, issue the k_ccwchain to the device, and wait event 
> > > > > > q
> > > > > > for the I/O result.
> > > > > > K4. Interrupt handler gets the I/O result, and wakes up the 
> > > > > > wait q.
> > > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result 
> > > > > > to
> > > > > > update the user space irb.
> > > > > > K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.
> > > > > 
> > > > > If the answers to my questions above are both yes,
> > > > Yes, they are.
> > > >   
> > > > > then this is really a mediated interface, not a direct assignment.
> > > > Right. This is true.
> > > >   
> > > > > We don't need an iommu
> > > > > because we're policing and translating the program for the device
> > > > > before it gets sent to hardware.  I think there are better ways than
> > > > > noiommu to handle such devices perhaps even with better performance
> > > > > than this two-stage translation.  In fact, I think the solution we 
> > > > > plan
> > > > > to implement for vGPU support would work here.
> > > > > 
> > > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > > translation or isolation since a vGPU is largely a software construct,
> > > > > but we do have software policing and translating how the GPU is
> > > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > > programming them into an IOMMU for a device, it simply stores the
> > > > > translations for use by later requests.  This means that a device
> > > > > programmed in a VM with guest physical addresses can have the
> > > > > vfio kernel convert that address to process virtual address, pin the
> > > > > page and program the hardware with the host physical address in one
> > > > > step.
> > > > I've read through the mail threads those discuss how to add vGPU
> > > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > > to this case, especially if we want to make the vfio api completely
> > > > compatible with the existing usage.
> > > > 
> > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > > fixed range of address in the memory space for DMA operations. Any
> > > > address inside this range will not be used for other purpose. Thus we
> > > > can add memory listener on this range, and pin the pages for further
> > > > use (DMA operation). And we can keep the pages pinned during the life
> > > > cycle of the VM (not quite accurate, or I should say 'the target
> > > > device').  
> > > 
> > > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > > assignment pins all of guest memory, not just a dedicated, exclusive
> > > range of it, in order to map it through the hardware IOMMU.  That gives
> > > the guest the ability to transparently perform DMA with the device
> > > since the 

Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-05-05 Thread Alex Williamson
[cc +Intel,NVIDIA]

On Thu, 5 May 2016 18:29:08 +0800
Dong Jia  wrote:

> On Wed, 4 May 2016 13:26:53 -0600
> Alex Williamson  wrote:
> 
> > On Wed, 4 May 2016 17:26:29 +0800
> > Dong Jia  wrote:
> >   
> > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > Alex Williamson  wrote:
> > > 
> > > Dear Alex:
> > > 
> > > Thanks for the comments.
> > > 
> > > [...]
> > >   
> > > > > 
> > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely 
> > > > > a
> > > > > good example to get understand how these patches work. Here is a 
> > > > > little
> > > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > > handled (without error handling).
> > > > > 
> > > > > Explanation:
> > > > > Q1-Q4: Qemu side process.
> > > > > K1-K6: Kernel side process.
> > > > > 
> > > > > Q1. Intercept a ssch instruction.
> > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > (u_ccwchain).
> > > > 
> > > > Is this replacing guest physical address in the program with QEMU
> > > > virtual addresses?
> > > Yes.
> > >   
> > > > 
> > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > K2. Translate the user space ccw program to a kernel space ccw
> > > > > program, which becomes runnable for a real device.
> > > > 
> > > > And here we translate and likely pin QEMU virtual address to physical
> > > > addresses to further modify the program sent into the channel?
> > > Yes. Exactly.
> > >   
> > > > 
> > > > > K3. With the necessary information contained in the orb passed in
> > > > > by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > for the I/O result.
> > > > > K4. Interrupt handler gets the I/O result, and wakes up the wait 
> > > > > q.
> > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > update the user space irb.
> > > > > K6. Copy irb and scsw back to user space.
> > > > > Q4. Update the irb for the guest.
> > > > 
> > > > If the answers to my questions above are both yes,
> > > Yes, they are.
> > >   
> > > > then this is really a mediated interface, not a direct assignment.
> > > Right. This is true.
> > >   
> > > > We don't need an iommu
> > > > because we're policing and translating the program for the device
> > > > before it gets sent to hardware.  I think there are better ways than
> > > > noiommu to handle such devices perhaps even with better performance
> > > > than this two-stage translation.  In fact, I think the solution we plan
> > > > to implement for vGPU support would work here.
> > > > 
> > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > translation or isolation since a vGPU is largely a software construct,
> > > > but we do have software policing and translating how the GPU is
> > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > programming them into an IOMMU for a device, it simply stores the
> > > > translations for use by later requests.  This means that a device
> > > > programmed in a VM with guest physical addresses can have the
> > > > vfio kernel convert that address to process virtual address, pin the
> > > > page and program the hardware with the host physical address in one
> > > > step.
> > > I've read through the mail threads those discuss how to add vGPU
> > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > to this case, especially if we want to make the vfio api completely
> > > compatible with the existing usage.
> > > 
> > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > fixed range of address in the memory space for DMA operations. Any
> > > address inside this range will not be used for other purpose. Thus we
> > > can add memory listener on this range, and pin the pages for further
> > > use (DMA operation). And we can keep the pages pinned during the life
> > > cycle of the VM (not quite accurate, or I should say 'the target
> > > device').  
> > 
> > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > assignment pins all of guest memory, not just a dedicated, exclusive
> > range of it, in order to map it through the hardware IOMMU.  That gives
> > the guest the ability to transparently perform DMA with the device
> > since the IOMMU maps the guest physical to host physical translations.  
> Thanks for this explanation.
> 
> I noticed in the Qemu part, when we tried to introduce vfio-pci to the
> s390 architecture, we set the IOMMU width by calling
> memory_region_add_subregion before initializing the address_space of
> the PCI device, which will be registered 

Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-05-05 Thread Dong Jia
On Wed, 4 May 2016 13:26:53 -0600
Alex Williamson  wrote:

> On Wed, 4 May 2016 17:26:29 +0800
> Dong Jia  wrote:
> 
> > On Fri, 29 Apr 2016 11:17:35 -0600
> > Alex Williamson  wrote:
> > 
> > Dear Alex:
> > 
> > Thanks for the comments.
> > 
> > [...]
> > 
> > > > 
> > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > good example to get understand how these patches work. Here is a little
> > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > handled (without error handling).
> > > > 
> > > > Explanation:
> > > > Q1-Q4: Qemu side process.
> > > > K1-K6: Kernel side process.
> > > > 
> > > > Q1. Intercept a ssch instruction.
> > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > (u_ccwchain).  
> > > 
> > > Is this replacing guest physical address in the program with QEMU
> > > virtual addresses?  
> > Yes.
> > 
> > >   
> > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > K2. Translate the user space ccw program to a kernel space ccw
> > > > program, which becomes runnable for a real device.  
> > > 
> > > And here we translate and likely pin QEMU virtual address to physical
> > > addresses to further modify the program sent into the channel?  
> > Yes. Exactly.
> > 
> > >   
> > > > K3. With the necessary information contained in the orb passed in
> > > > by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > for the I/O result.
> > > > K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > update the user space irb.
> > > > K6. Copy irb and scsw back to user space.
> > > > Q4. Update the irb for the guest.  
> > > 
> > > If the answers to my questions above are both yes,  
> > Yes, they are.
> > 
> > > then this is really a mediated interface, not a direct assignment.  
> > Right. This is true.
> > 
> > > We don't need an iommu
> > > because we're policing and translating the program for the device
> > > before it gets sent to hardware.  I think there are better ways than
> > > noiommu to handle such devices perhaps even with better performance
> > > than this two-stage translation.  In fact, I think the solution we plan
> > > to implement for vGPU support would work here.
> > > 
> > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > translation or isolation since a vGPU is largely a software construct,
> > > but we do have software policing and translating how the GPU is
> > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > backend that uses the existing map and unmap ioctls, but rather than
> > > programming them into an IOMMU for a device, it simply stores the
> > > translations for use by later requests.  This means that a device
> > > programmed in a VM with guest physical addresses can have the
> > > vfio kernel convert that address to process virtual address, pin the
> > > page and program the hardware with the host physical address in one
> > > step.  
> > I've read through the mail threads those discuss how to add vGPU
> > support in VFIO. I'm afraid that proposal could not be simply addressed
> > to this case, especially if we want to make the vfio api completely
> > compatible with the existing usage.
> > 
> > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > fixed range of address in the memory space for DMA operations. Any
> > address inside this range will not be used for other purpose. Thus we
> > can add memory listener on this range, and pin the pages for further
> > use (DMA operation). And we can keep the pages pinned during the life
> > cycle of the VM (not quite accurate, or I should say 'the target
> > device').
> 
> That's not entirely accurate.  Ignoring a guest IOMMU, current device
> assignment pins all of guest memory, not just a dedicated, exclusive
> range of it, in order to map it through the hardware IOMMU.  That gives
> the guest the ability to transparently perform DMA with the device
> since the IOMMU maps the guest physical to host physical translations.
Thanks for this explanation.

I noticed in the Qemu part, when we tried to introduce vfio-pci to the
s390 architecture, we set the IOMMU width by calling
memory_region_add_subregion before initializing the address_space of
the PCI device, which will be registered with the vfio_memory_listener
later. The 'width' of the subregion is what I called the 'range' in the
former reply.

The first reason we did that is, we know exactly the dma memory
range, and we got the width by 'dma_addr_end - dma_addr_start'. The
second reason we have to do that is, using the following statement will
cause the initialization of the guest tremendously 

Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-05-04 Thread Alex Williamson
On Wed, 4 May 2016 17:26:29 +0800
Dong Jia  wrote:

> On Fri, 29 Apr 2016 11:17:35 -0600
> Alex Williamson  wrote:
> 
> Dear Alex:
> 
> Thanks for the comments.
> 
> [...]
> 
> > > 
> > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > good example to get understand how these patches work. Here is a little
> > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > handled (without error handling).
> > > 
> > > Explanation:
> > > Q1-Q4: Qemu side process.
> > > K1-K6: Kernel side process.
> > > 
> > > Q1. Intercept a ssch instruction.
> > > Q2. Translate the guest ccw program to a user space ccw program
> > > (u_ccwchain).  
> > 
> > Is this replacing guest physical address in the program with QEMU
> > virtual addresses?  
> Yes.
> 
> >   
> > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > K2. Translate the user space ccw program to a kernel space ccw
> > > program, which becomes runnable for a real device.  
> > 
> > And here we translate and likely pin QEMU virtual address to physical
> > addresses to further modify the program sent into the channel?  
> Yes. Exactly.
> 
> >   
> > > K3. With the necessary information contained in the orb passed in
> > > by Qemu, issue the k_ccwchain to the device, and wait event q
> > > for the I/O result.
> > > K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > update the user space irb.
> > > K6. Copy irb and scsw back to user space.
> > > Q4. Update the irb for the guest.  
> > 
> > If the answers to my questions above are both yes,  
> Yes, they are.
> 
> > then this is really a mediated interface, not a direct assignment.  
> Right. This is true.
> 
> > We don't need an iommu
> > because we're policing and translating the program for the device
> > before it gets sent to hardware.  I think there are better ways than
> > noiommu to handle such devices perhaps even with better performance
> > than this two-stage translation.  In fact, I think the solution we plan
> > to implement for vGPU support would work here.
> > 
> > Like your device, a vGPU is mediated, we don't have IOMMU level
> > translation or isolation since a vGPU is largely a software construct,
> > but we do have software policing and translating how the GPU is
> > programmed.  To do this we're creating a type1 compatible vfio iommu
> > backend that uses the existing map and unmap ioctls, but rather than
> > programming them into an IOMMU for a device, it simply stores the
> > translations for use by later requests.  This means that a device
> > programmed in a VM with guest physical addresses can have the
> > vfio kernel convert that address to process virtual address, pin the
> > page and program the hardware with the host physical address in one
> > step.  
> I've read through the mail threads those discuss how to add vGPU
> support in VFIO. I'm afraid that proposal could not be simply addressed
> to this case, especially if we want to make the vfio api completely
> compatible with the existing usage.
> 
> AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> fixed range of address in the memory space for DMA operations. Any
> address inside this range will not be used for other purpose. Thus we
> can add memory listener on this range, and pin the pages for further
> use (DMA operation). And we can keep the pages pinned during the life
> cycle of the VM (not quite accurate, or I should say 'the target
> device').

That's not entirely accurate.  Ignoring a guest IOMMU, current device
assignment pins all of guest memory, not just a dedicated, exclusive
range of it, in order to map it through the hardware IOMMU.  That gives
the guest the ability to transparently perform DMA with the device
since the IOMMU maps the guest physical to host physical translations.

That's not what vGPU is about.  In the case of vGPU the proposal is to
use the same QEMU vfio MemoryListener API, but only for the purpose of
having an accurate database of guest physical to process virtual
translations for the VM.  In your above example, this means step Q2 is
eliminated because step K2 has the information to perform both a guest
physical to process virtual translation and to pin the page to get a
host physical address.  So you'd only need to modify the program once.

> Well, a Subchannel Device does not have such a range of address. The
> device driver simply calls kalloc() to get a piece of memory, and
> assembles a ccw program with it, before issuing the ccw program to
> perform an I/O operation. So the Qemu memory listener can't tell if an
> address is for an I/O operation, or for whatever else. And this makes
> the memory listener unnecessary for our case.

It's only 

Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-05-04 Thread Dong Jia
On Fri, 29 Apr 2016 11:17:35 -0600
Alex Williamson  wrote:

Dear Alex:

Thanks for the comments.

[...]

> > 
> > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > good example to get understand how these patches work. Here is a little
> > bit more detail how an I/O request triggered by the Qemu guest will be
> > handled (without error handling).
> > 
> > Explanation:
> > Q1-Q4: Qemu side process.
> > K1-K6: Kernel side process.
> > 
> > Q1. Intercept a ssch instruction.
> > Q2. Translate the guest ccw program to a user space ccw program
> > (u_ccwchain).
> 
> Is this replacing guest physical address in the program with QEMU
> virtual addresses?
Yes.

> 
> > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > K2. Translate the user space ccw program to a kernel space ccw
> > program, which becomes runnable for a real device.
> 
> And here we translate and likely pin QEMU virtual address to physical
> addresses to further modify the program sent into the channel?
Yes. Exactly.

> 
> > K3. With the necessary information contained in the orb passed in
> > by Qemu, issue the k_ccwchain to the device, and wait event q
> > for the I/O result.
> > K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > update the user space irb.
> > K6. Copy irb and scsw back to user space.
> > Q4. Update the irb for the guest.
> 
> If the answers to my questions above are both yes,
Yes, they are.

> then this is really a mediated interface, not a direct assignment.
Right. This is true.

> We don't need an iommu
> because we're policing and translating the program for the device
> before it gets sent to hardware.  I think there are better ways than
> noiommu to handle such devices perhaps even with better performance
> than this two-stage translation.  In fact, I think the solution we plan
> to implement for vGPU support would work here.
> 
> Like your device, a vGPU is mediated, we don't have IOMMU level
> translation or isolation since a vGPU is largely a software construct,
> but we do have software policing and translating how the GPU is
> programmed.  To do this we're creating a type1 compatible vfio iommu
> backend that uses the existing map and unmap ioctls, but rather than
> programming them into an IOMMU for a device, it simply stores the
> translations for use by later requests.  This means that a device
> programmed in a VM with guest physical addresses can have the
> vfio kernel convert that address to process virtual address, pin the
> page and program the hardware with the host physical address in one
> step.
I've read through the mail threads those discuss how to add vGPU
support in VFIO. I'm afraid that proposal could not be simply addressed
to this case, especially if we want to make the vfio api completely
compatible with the existing usage.

AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
fixed range of address in the memory space for DMA operations. Any
address inside this range will not be used for other purpose. Thus we
can add memory listener on this range, and pin the pages for further
use (DMA operation). And we can keep the pages pinned during the life
cycle of the VM (not quite accurate, or I should say 'the target
device').

Well, a Subchannel Device does not have such a range of address. The
device driver simply calls kalloc() to get a piece of memory, and
assembles a ccw program with it, before issuing the ccw program to
perform an I/O operation. So the Qemu memory listener can't tell if an
address is for an I/O operation, or for whatever else. And this makes
the memory listener unnecessary for our case.

The only time point that we know we should pin pages for I/O, is the
time that an I/O instruction (e.g. ssch) was intercepted. At this
point, we know the address contented in the parameter of the ssch
instruction points to a piece of memory that contents a ccw program.
Then we do: pin the pages --> convert the ccw program --> perform the
I/O --> return the I/O result --> and unpin the pages.

> 
> This architecture also makes the vfio api completely compatible with
> existing usage without tainting QEMU with support for noiommu devices.
> I would strongly suggest following a similar approach and dropping the
> noiommu interface.  We really do not need to confuse users with noiommu
> devices that are safe and assignable and devices where noiommu should
> warn them to stay away.  Thanks,
Understand. But like explained above, even if we introduce a new vfio
iommu backend, what it does would probably look quite like what the
no-iommu backend does. Any idea about this?

> 
> Alex
> 


Dong Jia




Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-04-29 Thread Alex Williamson
On Fri, 29 Apr 2016 14:11:47 +0200
Dong Jia Shi  wrote:

> vfio: ccw: basic vfio-ccw infrastructure
> 
> 
> Introduction
> 
> 
> Here we describe the vfio support for Channel I/O devices (aka. CCW
> devices) for Linux/s390. Motivation for vfio-ccw is to passthrough CCW
> devices to a virtual machine, while vfio is the means.
> 
> Different than other hardware architectures, s390 has defined a unified
> I/O access method, which is so called Channel I/O. It has its own
> access patterns:
> - Channel programs run asynchronously on a separate (co)processor.
> - The channel subsystem will access any memory designated by the caller
>   in the channel program directly, i.e. there is no iommu involved.
> Thus when we introduce vfio support for these devices, we realize it
> with a no-iommu vfio implementation.
> 
> This document does not intend to explain the s390 hardware architecture
> in every detail. More information/reference could be found here:
> - A good start to know Channel I/O in general:
>   https://en.wikipedia.org/wiki/Channel_I/O
> - s390 architecture:
>   s390 Principles of Operation manual (IBM Form. No. SA22-7832)
> - The existing Qemu code which implements a simple emulated channel
>   subsystem could also be a good reference. It makes it easier to
>   follow the flow.
>   qemu/hw/s390x/css.c
> 
> Motivation of vfio-ccw
> --
> 
> Currently, a guest virtualized via qemu/kvm on s390 only sees
> paravirtualized virtio devices via the "Virtio Over Channel I/O
> (virtio-ccw)" transport. This makes virtio devices discoverable via
> standard operating system algorithms for handling channel devices.
> 
> However this is not enough. On s390 for the majority of devices, which
> use the standard Channel I/O based mechanism, we also need to provide
> the functionality of passing through them to a Qemu virtual machine.
> This includes devices that don't have a virtio counterpart (e.g. tape
> drives) or that have specific characteristics which guests want to
> exploit.
> 
> For passing a device to a guest, we want to use the same interface as
> everybody else, namely vfio. Thus, we would like to introduce vfio
> support for channel devices. And we would like to name this new vfio
> device "vfio-ccw".
> 
> Access patterns of CCW devices
> --
> 
> s390 architecture has implemented a so called channel subsystem, that
> provides a unified view of the devices physically attached to the
> systems. Though the s390 hardware platform knows about a huge variety of
> different peripheral attachments like disk devices (aka. DASDs), tapes,
> communication controllers, etc. They can all be accessed by a well
> defined access method and they are presenting I/O completion a unified
> way: I/O interruptions.
> 
> All I/O requires the use of channel command words (CCWs). A CCW is an
> instruction to a specialized I/O channel processor. A channel program
> is a sequence of CCWs which are executed by the I/O channel subsystem.
> To issue a CCW program to the channel subsystem, it is required to
> build an operation request block (ORB), which can be used to point out
> the format of the CCW and other control information to the system. The
> operating system signals the I/O channel subsystem to begin executing
> the channel program with a SSCH (start sub-channel) instruction. The
> central processor is then free to proceed with non-I/O instructions
> until interrupted. The I/O completion result is received by the
> interrupt handler in the form of interrupt response block (IRB).
> 
> Back to vfio-ccw, in short:
> - ORBs and CCW programs are built in user space (with virtual
>   addresses).
> - ORBs and CCW programs are passed to the kernel.
> - kernel translates virtual addresses to real addresses and starts the
>   IO with issuing a privileged Channel I/O instruction (e.g SSCH).
> - CCW programs run asynchronously on a separate processor.
> - I/O completion will be signaled to the host with I/O interruptions.
>   And it will be copied as IRB to user space.
> 
> 
> vfio-ccw patches overview
> -
> 
> It follows that we need vfio-ccw with a vfio no-iommu mode. For now,
> our patches are based on the current no-iommu implementation. It's a
> good start to launch the code review for vfio-ccw. Note that the
> implementation is far from complete yet; but we'd like to get feedback
> for the general architecture.
> 
> The current no-iommu implementation would consider vfio-ccw as
> unsupported and will taint the kernel. This should be not true for
> vfio-ccw. But whether the end result will be using the existing
> no-iommu code or a new module would be an implementation detail.
> 
> * CCW translation APIs
> - Description:
>   These introduce a group of APIs (start with 'ccwchain_') to do CCW
>   translation. The CCWs passed in by a user space program are organized
>   in a buffer, with 

[Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-04-29 Thread Dong Jia Shi
vfio: ccw: basic vfio-ccw infrastructure


Introduction


Here we describe the vfio support for Channel I/O devices (aka. CCW
devices) for Linux/s390. Motivation for vfio-ccw is to passthrough CCW
devices to a virtual machine, while vfio is the means.

Different than other hardware architectures, s390 has defined a unified
I/O access method, which is so called Channel I/O. It has its own
access patterns:
- Channel programs run asynchronously on a separate (co)processor.
- The channel subsystem will access any memory designated by the caller
  in the channel program directly, i.e. there is no iommu involved.
Thus when we introduce vfio support for these devices, we realize it
with a no-iommu vfio implementation.

This document does not intend to explain the s390 hardware architecture
in every detail. More information/reference could be found here:
- A good start to know Channel I/O in general:
  https://en.wikipedia.org/wiki/Channel_I/O
- s390 architecture:
  s390 Principles of Operation manual (IBM Form. No. SA22-7832)
- The existing Qemu code which implements a simple emulated channel
  subsystem could also be a good reference. It makes it easier to
  follow the flow.
  qemu/hw/s390x/css.c

Motivation of vfio-ccw
--

Currently, a guest virtualized via qemu/kvm on s390 only sees
paravirtualized virtio devices via the "Virtio Over Channel I/O
(virtio-ccw)" transport. This makes virtio devices discoverable via
standard operating system algorithms for handling channel devices.

However this is not enough. On s390 for the majority of devices, which
use the standard Channel I/O based mechanism, we also need to provide
the functionality of passing through them to a Qemu virtual machine.
This includes devices that don't have a virtio counterpart (e.g. tape
drives) or that have specific characteristics which guests want to
exploit.

For passing a device to a guest, we want to use the same interface as
everybody else, namely vfio. Thus, we would like to introduce vfio
support for channel devices. And we would like to name this new vfio
device "vfio-ccw".

Access patterns of CCW devices
--

s390 architecture has implemented a so called channel subsystem, that
provides a unified view of the devices physically attached to the
systems. Though the s390 hardware platform knows about a huge variety of
different peripheral attachments like disk devices (aka. DASDs), tapes,
communication controllers, etc. They can all be accessed by a well
defined access method and they are presenting I/O completion a unified
way: I/O interruptions.

All I/O requires the use of channel command words (CCWs). A CCW is an
instruction to a specialized I/O channel processor. A channel program
is a sequence of CCWs which are executed by the I/O channel subsystem.
To issue a CCW program to the channel subsystem, it is required to
build an operation request block (ORB), which can be used to point out
the format of the CCW and other control information to the system. The
operating system signals the I/O channel subsystem to begin executing
the channel program with a SSCH (start sub-channel) instruction. The
central processor is then free to proceed with non-I/O instructions
until interrupted. The I/O completion result is received by the
interrupt handler in the form of interrupt response block (IRB).

Back to vfio-ccw, in short:
- ORBs and CCW programs are built in user space (with virtual
  addresses).
- ORBs and CCW programs are passed to the kernel.
- kernel translates virtual addresses to real addresses and starts the
  IO with issuing a privileged Channel I/O instruction (e.g SSCH).
- CCW programs run asynchronously on a separate processor.
- I/O completion will be signaled to the host with I/O interruptions.
  And it will be copied as IRB to user space.


vfio-ccw patches overview
-

It follows that we need vfio-ccw with a vfio no-iommu mode. For now,
our patches are based on the current no-iommu implementation. It's a
good start to launch the code review for vfio-ccw. Note that the
implementation is far from complete yet; but we'd like to get feedback
for the general architecture.

The current no-iommu implementation would consider vfio-ccw as
unsupported and will taint the kernel. This should be not true for
vfio-ccw. But whether the end result will be using the existing
no-iommu code or a new module would be an implementation detail.

* CCW translation APIs
- Description:
  These introduce a group of APIs (start with 'ccwchain_') to do CCW
  translation. The CCWs passed in by a user space program are organized
  in a buffer, with their user virtual memory addresses. These APIs will
  copy the CCWs into the kernel space, and assemble a runnable kernel
  CCW program by updating the user virtual addresses with their
  corresponding physical addresses.
- Patches:
  vfio: ccw: introduce page array interfaces
  vfio: ccw: