Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

2020-11-23 Thread Neo Jia
On Mon, Nov 23, 2020 at 11:14:38AM +0800, Shenming Lu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2020/11/21 6:01, Alex Williamson wrote:
> > On Fri, 20 Nov 2020 22:05:49 +0800
> > Shenming Lu  wrote:
> >
> >> On 2020/11/20 1:41, Alex Williamson wrote:
> >>> On Thu, 19 Nov 2020 14:13:24 +0530
> >>> Kirti Wankhede  wrote:
> >>>
>  On 11/14/2020 2:47 PM, Shenming Lu wrote:
> > When running VFIO migration, I found that the restoring of VFIO PCI 
> > device’s
> > config space is before VGIC on ARM64 target. But generally, interrupt 
> > controllers
> > need to be restored before PCI devices.
> 
>  Is there any other way by which VGIC can be restored before PCI device?
> >>
> >> As far as I know, it seems to have to depend on priorities in the 
> >> non-iterable process.
> >>
> 
> > Besides, if a VFIO PCI device is
> > configured to have directly-injected MSIs (VLPIs), the restoring of its 
> > config
> > space will trigger the configuring of these VLPIs (in kernel), where it 
> > would
> > return an error as I saw due to the dependency on kvm’s vgic.
> >
> 
>  Can this be fixed in kernel to re-initialize the kernel state?
> >>
> >> Did you mean to reconfigure these VLPIs when restoring kvm's vgic?
> >> But the fact is that this error is not caused by kernel, it is due to the 
> >> incorrect
> >> calling order of qemu...
> >>
> 
> > To avoid this, we can move the saving of the config space from the 
> > iterable
> > process to the non-iterable process, so that it will be called after 
> > VGIC
> > according to their priorities.
> >
> 
>  With this change, at resume side, pre-copy phase data would reach
>  destination without restored config space. VFIO device on destination
>  might need it's config space setup and validated before it can accept
>  further VFIO device specific migration state.
> 
>  This also changes bit-stream, so it would break migration with original
>  migration patch-set.
> >>>
> >>> Config space can continue to change while in pre-copy, if we're only
> >>> sending config space at the initiation of pre-copy, how are any changes
> >>> that might occur before the VM is stopped conveyed to the target?  For
> >>> example the guest might reboot and a device returned to INTx mode from
> >>> MSI during pre-copy.  Thanks,
> >>
> >> What I see is that the config space is only saved once in 
> >> save_live_complete_precopy
> >> currently...
> >> As you said, a VFIO device might need it's config space setup first, and
> >> the config space can continue to change while in pre-copy, Did you mean we
> >> have to migrate the config space in save_live_iterate?
> >> However, I still have a little doubt about the restoring dependence between
> >> the qemu emulated config space and the device data...
> >>
> >> Besides, if we surely can't move the saving of the config space back, can 
> >> we
> >> just move some actions which are triggered by the restoring of the config 
> >> space
> >> back (such as vfio_msix_enable())?
> >
> > It seems that the significant benefit to enabling interrupts during
> > pre-copy would be to reduce the latency and failure potential during
> > the final phase of migration.  Do we have any data for how much it adds
> > to the device contributed downtime to configure interrupts only at the
> > final stage?  My guess is that it's a measurable delay on its own.  At
> > the same time, we can't ignore the differences in machine specific
> > dependencies and if we don't even sync the config space once the VM is
> > stopped... this all seems not ready to call supported, especially if we
> > have concerns already about migration bit-stream compatibility.
> >
> 
> I have another question for this, if we restore the config space while in 
> pre-copy
> (include enabling interrupts), does it affect the _RESUMING state (paused) of 
> the
> device on the dst host (cause it to send interrupts? which should not be 
> allowed
> in this stage). Does the restore sequence need to be further discussed and 
> reach
> a consensus(spec) (taking into account other devices and the corresponding 
> actions
> of the vendor driver)?
> 
> > Given our timing relative to QEMU 5.2, the only path I feel comfortable
> > with is to move forward with downgrading vfio migration support to be
> > enabled via an experimental option.  Objections?  Thanks,
> 
> Alright, but this issue is related to our ARM GICv4.1 migration scheme, could 
> you
> give a rough idea about this (where to enable interrupts, we hope it to be 
> after
> the restoring of VGIC)?

I disagree. If this is only specific to Huawei ARM GIC implementation, why do 
we want to
make the entire VFIO based migration an experimental feature?

Thanks,
Neo

> 
> Thanks,
> Shenming



Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

2020-11-10 Thread Neo Jia
On Tue, Nov 10, 2020 at 08:20:50AM -0700, Alex Williamson wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Tue, 10 Nov 2020 19:46:20 +0530
> Kirti Wankhede  wrote:
> 
> > On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote:
> > > * Alex Williamson (alex.william...@redhat.com) wrote:
> > >> On Mon, 9 Nov 2020 19:44:17 +
> > >> "Dr. David Alan Gilbert"  wrote:
> > >>
> > >>> * Alex Williamson (alex.william...@redhat.com) wrote:
> >  Per the proposed documentation for vfio device migration:
> > 
> > Dirty pages are tracked when device is in stop-and-copy phase
> > because if pages are marked dirty during pre-copy phase and
> > content is transfered from source to destination, there is no
> > way to know newly dirtied pages from the point they were copied
> > earlier until device stops. To avoid repeated copy of same
> > content, pinned pages are marked dirty only during
> > stop-and-copy phase.
> > 
> >  Essentially, since we don't have hardware dirty page tracking for
> >  assigned devices at this point, we consider any page that is pinned
> >  by an mdev vendor driver or pinned and mapped through the IOMMU to
> >  be perpetually dirty.  In the worst case, this may result in all of
> >  guest memory being considered dirty during every iteration of live
> >  migration.  The current vfio implementation of migration has chosen
> >  to mask device dirtied pages until the final stages of migration in
> >  order to avoid this worst case scenario.
> > 
> >  Allowing the device to implement a policy decision to prioritize
> >  reduced migration data like this jeopardizes QEMU's overall ability
> >  to implement any degree of service level guarantees during migration.
> >  For example, any estimates towards achieving acceptable downtime
> >  margins cannot be trusted when such a device is present.  The vfio
> >  device should participate in dirty page tracking to the best of its
> >  ability throughout migration, even if that means the dirty footprint
> >  of the device impedes migration progress, allowing both QEMU and
> >  higher level management tools to decide whether to continue the
> >  migration or abort due to failure to achieve the desired behavior.
> > >>>
> > >>> I don't feel particularly badly about the decision to squash it in
> > >>> during the stop-and-copy phase; for devices where the pinned memory
> > >>> is large, I don't think doing it during the main phase makes much sense;
> > >>> especially if you then have to deal with tracking changes in pinning.
> > >>
> > >>
> > >> AFAIK the kernel support for tracking changes in page pinning already
> > >> exists, this is largely the vfio device in QEMU that decides when to
> > >> start exposing the device dirty footprint to QEMU.  I'm a bit surprised
> > >> by this answer though, we don't really know what the device memory
> > >> footprint is.  It might be large, it might be nothing, but by not
> > >> participating in dirty page tracking until the VM is stopped, we can't
> > >> know what the footprint is and how it will affect downtime.  Is it
> > >> really the place of a QEMU device driver to impose this sort of policy?
> > >
> > > If it could actually track changes then I'd agree we shouldn't impose
> > > any policy; but if it's just marking the whole area as dirty we're going
> > > to need a bodge somewhere; this bodge doesn't look any worse than the
> > > others to me.
> > >
> > >>
> > >>> Having said that, I agree with marking it as experimental, because
> > >>> I'm dubious how useful it will be for the same reason, I worry
> > >>> about whether the downtime will be so large to make it pointless.
> > >>
> >
> > Not all device state is large, for example NIC might only report
> > currently mapped RX buffers which usually not more than a 1GB and could
> > be as low as 10's of MB. GPU might or might not have large data, that
> > depends on its use cases.
> 
> 
> Right, it's only if we have a vendor driver that doesn't pin any memory
> when dirty tracking is enabled and we're running without a viommu that
> we would expect all of guest memory to be continuously dirty.
> 
> 
> > >> TBH I think that's the wrong reason to mark it experimental.  There's
> > >> clearly demand for vfio device migration and even if the practical use
> > >> cases are initially small, they will expand over time and hardware will
> > >> get better.  My objection is that the current behavior masks the
> > >> hardware and device limitations, leading to unrealistic expectations.
> > >> If the user expects minimal downtime, configures convergence to account
> > >> for that, QEMU thinks it can achieve it, and then the device marks
> > >> everything dirty, that's not supportable.
> > >
> > > Yes, agreed.
> >
> > Yes, there is demand for vfio device migration and many devices owners
> > started scoping and 

Re: [Qemu-devel] [PATCH 1/2] vfio/mdev: add version field as mandatory attribute for mdev device

2019-04-23 Thread Neo Jia
On Tue, Apr 23, 2019 at 11:39:39AM +0100, Daniel P. Berrangé wrote:
> On Fri, Apr 19, 2019 at 04:35:04AM -0400, Yan Zhao wrote:
> > device version attribute in mdev sysfs is used by user space software
> > (e.g. libvirt) to query device compatibility for live migration of VFIO
> > mdev devices. This attribute is mandatory if a mdev device supports live
> > migration.
> > 
> > It consists of two parts: common part and vendor proprietary part.
> > common part: 32 bit. lower 16 bits is vendor id and higher 16 bits
> >  identifies device type. e.g., for pci device, it is
> >  "pci vendor id" | (VFIO_DEVICE_FLAGS_PCI << 16).
> > vendor proprietary part: this part is varied in length. vendor driver can
> >  specify any string to identify a device.
> > 
> > When reading this attribute, it should show device version string of the
> > device of type . If a device does not support live migration, it
> > should return errno.
> > When writing a string to this attribute, it returns errno for
> > incompatibility or returns written string length in compatibility case.
> > If a device does not support live migration, it always returns errno.
> > 
> > For user space software to use:
> > 1.
> > Before starting live migration, user space software first reads source side
> > mdev device's version. e.g.
> > "#cat \
> > /sys/bus/pci/devices/\:00\:02.0/5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd/mdev_type/version"
> > 00028086-193b-i915-GVTg_V5_4
> > 
> > 2.
> > Then, user space software writes the source side returned version string
> > to device version attribute in target side, and checks the return value.
> > If a negative errno is returned in the target side, then mdev devices in
> > source and target sides are not compatible;
> > If a positive number is returned and it equals to the length of written
> > string, then the two mdev devices in source and target side are compatible.
> > e.g.
> > (a) compatibility case
> > "# echo 00028086-193b-i915-GVTg_V5_4 >
> > /sys/bus/pci/devices/\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/mdev_type/version"
> > 
> > (b) incompatibility case
> > "#echo 00028086-193b-i915-GVTg_V5_1 >
> > /sys/bus/pci/devices/\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/mdev_type/version"
> > -bash: echo: write error: Invalid argument
> 
> What you have written here seems to imply that each mdev type is able to
> support many different versions at the same time. Writing a version into
> this sysfs file then chooses which of the many versions to actually use.
> 
> This is good as it allows for live migration across driver software upgrades.
> 
> A mgmt application may well want to know what versions are supported for an
> mdev type *before* starting a migration. A mgmt app can query all the 100's
> of hosts it knows and thus figure out which are valid to use as the target
> of a migration.
> 
> IOW, we want to avoid the ever hitting the incompatibility case in the
> first place, by only choosing to migrate to a host that we know is going
> to be compatible.
> 
> This would need some kind of way to report the full list of supported
> versions against the mdev supported types on the host.

What would be the typical scenario / use case for mgmt layer to query the 
version
information? Do they expect this get done completely offline as long as the
vendor driver installed on each host?

Thanks,
Neo

> 
> 



Re: [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device

2019-02-20 Thread Neo Jia
On Thu, Feb 21, 2019 at 05:52:53AM +, Tian, Kevin wrote:
> > From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> > Sent: Thursday, February 21, 2019 1:25 PM
> > 
> > On 2/20/2019 3:52 PM, Dr. David Alan Gilbert wrote:
> > > * Kirti Wankhede (kwankh...@nvidia.com) wrote:
> > >> Add migration support for VFIO device
> > >
> > > Hi Kirti,
> > >   Can you explain how this compares and works with Yan Zhao's
> > > set?
> > 
> > This patch set is incremental version of my previous patch set:
> > https://patchwork.ozlabs.org/cover/1000719/
> > This takes care of the feedbacks received on previous version.
> > 
> > This patch set is different than Yan Zhao's set.
> > 
> 
> I can help give some background about Yan's work:
> 
> There was a big gap between Kirti's last version and the overall review
> comments, especially this one:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg576652.html

Hi Kevin,

> 
> Then there was no reply from Kirti whether she agreed with the comments
> and was working on a new version.

Sorry, we should ack on those comments when we have received them last time.

> 
> Then we think we should jump in to keep the ball moving, based on
> a fresh implementation according to recommended direction, i.e. focusing
> on device state management instead of sticking to migration flow in kernel
> API design.
> 
> and also more importantly we provided kernel side implementation based
> on Intel GVT-g to give the whole picture of both user/kernel side changes.
> That should give people a better understanding of how those new APIs
> are expected to be used by Qemu, and to be implemented by vendor driver.
> 
> That is why Yan just shared her work.

Really glad to see the v2 version works for you guys, appreciate for the driver
side changes.

> 
> Now it's great to see that Kirti is still actively working on this effort and 
> is
> also moving toward the right direction. Let's have a close look at two
> implementations and then choose a cleaner one as base for future
> enhancements. :-)

Yes, the v3 has addressed all the comments / concerns raised in the v2, I think
we should take a look and keep moving.

Just a quick thought - would be possible / better to have Kirti focus on the 
QEMU 
patches and Yan take care GVT-g kernel driver side changes? This will give us
the best testing coverage. Hope I don't step on anybody's toes here. ;-)

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [PATCH v14 00/22] Add Mediated device support

2016-11-17 Thread Neo Jia
On Thu, Nov 17, 2016 at 02:25:15PM -0700, Alex Williamson wrote:
> On Thu, 17 Nov 2016 02:16:12 +0530
> Kirti Wankhede  wrote:
> > 
> >  Documentation/ABI/testing/sysfs-bus-vfio-mdev |  111 ++
> >  Documentation/vfio-mediated-device.txt|  399 +++
> >  MAINTAINERS   |9 +
> >  drivers/vfio/Kconfig  |1 +
> >  drivers/vfio/Makefile |1 +
> >  drivers/vfio/mdev/Kconfig |   17 +
> >  drivers/vfio/mdev/Makefile|5 +
> >  drivers/vfio/mdev/mdev_core.c |  385 +++
> >  drivers/vfio/mdev/mdev_driver.c   |  119 ++
> >  drivers/vfio/mdev/mdev_private.h  |   41 +
> >  drivers/vfio/mdev/mdev_sysfs.c|  286 +
> >  drivers/vfio/mdev/vfio_mdev.c |  180 +++
> >  drivers/vfio/pci/vfio_pci.c   |   83 +-
> >  drivers/vfio/platform/vfio_platform_common.c  |   31 +-
> >  drivers/vfio/vfio.c   |  340 +-
> >  drivers/vfio/vfio_iommu_type1.c   |  872 +++---
> >  include/linux/mdev.h  |  177 +++
> >  include/linux/vfio.h  |   32 +-
> >  include/uapi/linux/vfio.h |   10 +
> >  samples/vfio-mdev/Makefile|   13 +
> >  samples/vfio-mdev/mtty.c  | 1503 
> > +
> >  21 files changed, 4358 insertions(+), 257 deletions(-)
> >  create mode 100644 Documentation/ABI/testing/sysfs-bus-vfio-mdev
> >  create mode 100644 Documentation/vfio-mediated-device.txt
> >  create mode 100644 drivers/vfio/mdev/Kconfig
> >  create mode 100644 drivers/vfio/mdev/Makefile
> >  create mode 100644 drivers/vfio/mdev/mdev_core.c
> >  create mode 100644 drivers/vfio/mdev/mdev_driver.c
> >  create mode 100644 drivers/vfio/mdev/mdev_private.h
> >  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
> >  create mode 100644 drivers/vfio/mdev/vfio_mdev.c
> >  create mode 100644 include/linux/mdev.h
> >  create mode 100644 samples/vfio-mdev/Makefile
> >  create mode 100644 samples/vfio-mdev/mtty.c
> 
> As discussed, I dropped patch 12, updated the documentation, and added
> 'retries' initialization.  This is now applied to my next branch for
> v4.10.  Thanks to the reviewers and Kirti and Neo for your hard work!

Really appreciate your help and reviews to allow us reach here, and thanks to
various reviewers for their comments and suggestions!

Thanks,
Neo


> Thanks,
> 
> Alex



Re: [Qemu-devel] [PATCH 1/2] KVM: page track: add a new notifier type: track_flush_slot

2016-10-14 Thread Neo Jia
On Fri, Oct 14, 2016 at 10:51:24AM -0600, Alex Williamson wrote:
> On Fri, 14 Oct 2016 09:35:45 -0700
> Neo Jia <c...@nvidia.com> wrote:
> 
> > On Fri, Oct 14, 2016 at 08:46:01AM -0600, Alex Williamson wrote:
> > > On Fri, 14 Oct 2016 08:41:58 -0600
> > > Alex Williamson <alex.william...@redhat.com> wrote:
> > >   
> > > > On Fri, 14 Oct 2016 18:37:45 +0800
> > > > Jike Song <jike.s...@intel.com> wrote:
> > > >   
> > > > > On 10/11/2016 05:47 PM, Paolo Bonzini wrote:
> > > > > > 
> > > > > > 
> > > > > > On 11/10/2016 11:21, Xiao Guangrong wrote:  
> > > > > >>
> > > > > >>
> > > > > >> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:  
> > > > > >>>
> > > > > >>>
> > > > > >>> On 11/10/2016 04:39, Xiao Guangrong wrote:  
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:  
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On 10/10/2016 20:01, Neo Jia wrote:  
> > > > > >>>>>>> Hi Neo,
> > > > > >>>>>>>
> > > > > >>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the 
> > > > > >>>>>>> PPGTT,
> > > > > >>>>>>> while nVidia does.  
> > > > > >>>>>>
> > > > > >>>>>> Hi Paolo and Xiaoguang,
> > > > > >>>>>>
> > > > > >>>>>> I am just wondering how device driver can register a notifier 
> > > > > >>>>>> so he
> > > > > >>>>>> can be
> > > > > >>>>>> notified for write-protected pages when writes are happening.  
> > > > > >>>>>> 
> > > > > >>>>>
> > > > > >>>>> It can't yet, but the API is ready for that.  
> > > > > >>>>> kvm_vfio_set_group is
> > > > > >>>>> currently where a struct kvm_device* and struct vfio_group* 
> > > > > >>>>> touch.
> > > > > >>>>> Given
> > > > > >>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be 
> > > > > >>>>> passed to
> > > > > >>>>> kvm_page_track_register_notifier.  So I guess you could add a 
> > > > > >>>>> callback
> > > > > >>>>> that passes the struct kvm_device* to the mdev device.
> > > > > >>>>>
> > > > > >>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it 
> > > > > >>>>> briefly
> > > > > >>>>> at KVM Forum but I don't remember the details.  
> > > > > >>>>
> > > > > >>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that 
> > > > > >>>> we can
> > > > > >>>> figure out the kvm instance based on the fd.
> > > > > >>>>
> > > > > >>>> We got a new idea, how about search the kvm instance by 
> > > > > >>>> mm_struct, it
> > > > > >>>> can work as KVMGT is running in the vcpu context and it is much 
> > > > > >>>> more
> > > > > >>>> straightforward.  
> > > > > >>>
> > > > > >>> Perhaps I didn't understand your suggestion, but the same 
> > > > > >>> mm_struct can
> > > > > >>> have more than 1 struct kvm so I'm not sure that it can work. 
> > > > > >>>  
> > > > > >>
> > > > > >> vcpu->pid is valid during vcpu running so that it can be used to 
> > > > > >> figure
> > > > > >> out which kvm instance owns the vcpu whose pid is the one as 
> > > > > >> current
> > > > > >> thread, i think it can work. :)  
> > > > > > 
> > > > > > No

Re: [Qemu-devel] [PATCH 1/2] KVM: page track: add a new notifier type: track_flush_slot

2016-10-14 Thread Neo Jia
On Fri, Oct 14, 2016 at 08:46:01AM -0600, Alex Williamson wrote:
> On Fri, 14 Oct 2016 08:41:58 -0600
> Alex Williamson <alex.william...@redhat.com> wrote:
> 
> > On Fri, 14 Oct 2016 18:37:45 +0800
> > Jike Song <jike.s...@intel.com> wrote:
> > 
> > > On 10/11/2016 05:47 PM, Paolo Bonzini wrote:  
> > > > 
> > > > 
> > > > On 11/10/2016 11:21, Xiao Guangrong wrote:
> > > >>
> > > >>
> > > >> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:
> > > >>>
> > > >>>
> > > >>> On 11/10/2016 04:39, Xiao Guangrong wrote:
> > > >>>>
> > > >>>>
> > > >>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:
> > > >>>>>
> > > >>>>>
> > > >>>>> On 10/10/2016 20:01, Neo Jia wrote:
> > > >>>>>>> Hi Neo,
> > > >>>>>>>
> > > >>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the 
> > > >>>>>>> PPGTT,
> > > >>>>>>> while nVidia does.
> > > >>>>>>
> > > >>>>>> Hi Paolo and Xiaoguang,
> > > >>>>>>
> > > >>>>>> I am just wondering how device driver can register a notifier so he
> > > >>>>>> can be
> > > >>>>>> notified for write-protected pages when writes are happening.
> > > >>>>>
> > > >>>>> It can't yet, but the API is ready for that.  kvm_vfio_set_group is
> > > >>>>> currently where a struct kvm_device* and struct vfio_group* touch.
> > > >>>>> Given
> > > >>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be passed 
> > > >>>>> to
> > > >>>>> kvm_page_track_register_notifier.  So I guess you could add a 
> > > >>>>> callback
> > > >>>>> that passes the struct kvm_device* to the mdev device.
> > > >>>>>
> > > >>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it 
> > > >>>>> briefly
> > > >>>>> at KVM Forum but I don't remember the details.
> > > >>>>
> > > >>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that we 
> > > >>>> can
> > > >>>> figure out the kvm instance based on the fd.
> > > >>>>
> > > >>>> We got a new idea, how about search the kvm instance by mm_struct, it
> > > >>>> can work as KVMGT is running in the vcpu context and it is much more
> > > >>>> straightforward.
> > > >>>
> > > >>> Perhaps I didn't understand your suggestion, but the same mm_struct 
> > > >>> can
> > > >>> have more than 1 struct kvm so I'm not sure that it can work.
> > > >>
> > > >> vcpu->pid is valid during vcpu running so that it can be used to figure
> > > >> out which kvm instance owns the vcpu whose pid is the one as current
> > > >> thread, i think it can work. :)
> > > > 
> > > > No, don't do that.  There's no reason for a thread to run a single VCPU,
> > > > and if you can have multiple VCPUs you can also have multiple VCPUs from
> > > > multiple VMs.
> > > > 
> > > > Passing file descriptors around are the right way to connect 
> > > > subsystems.
> > > 
> > > [CC Alex, Kevin and Qemu-devel]
> > > 
> > > Hi Paolo & Alex,
> > > 
> > > IIUC, passing file descriptors means touching QEMU and the UAPI between
> > > QEMU and VFIO. Would you guys have a look at below draft patch? If it's
> > > on the correct direction, I'll send the split ones. Thanks!
> > > 
> > > --
> > > Thanks,
> > > Jike
> > > 
> > > 
> > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > index bec694c..f715d37 100644
> > > --- a/hw/vfio/pci-quirks.c
> > > +++ b/hw/vfio/pci-quirks.c
> > > @@ -10,12 +10,14 @@
> > >   * the COPYING file in the top-level directory.
> > >   */
> > >  
> > > +#include 
> > >  #include "qemu/osde

Re: [Qemu-devel] summary of current vfio mdev upstreaming status

2016-09-29 Thread Neo Jia
On Thu, Sep 29, 2016 at 05:05:47PM +0800, Xiao Guangrong wrote:
> 
> 
> On 09/29/2016 04:55 PM, Jike Song wrote:
> > Hi all,
> > 
> > In order to have a clear understanding about the VFIO mdev upstreaming
> > status, I'd like to summarize it. Please share your opinions on this,
> > and correct my misunderstandings.
> > 
> > The whole vfio mdev series can be logically divided into several parts,
> > they work together to provide the mdev support.
> 
> I think what Jike want to suggest is how about partially push/develop the
> mdev. As jike listed, there are some parts can be independent and they have
> mostly been agreed.
> 
> Such development plan can make the discussion be much efficient in the
> community. Also it make the possibility that Intel, Nvdia, IBM can focus
> on different parts and co-develop it.

Hi Guangrong,

JFYI. we are preparing v8 patches to accommodate most comments we have discussed
so far and we will also include several things that we have decided on sysfs.

I definitely would like to see more interactive discussions especially on the
sysfs class front from intel folks.

Regarding the patch development and given the current status, especially where
we are and what we have been through, I am very confident that we should be able
to fully handle this ourselves, but thanks for offering help anyway!

We should be able to react as fast as possible based on the public mailing list
discussions, so again I don't think that part is an issue.

Thanks,
Neo

> 
> The maintainer can hold these development patches in local branch before
> pushing the full-functionality version to upstream.
> 
> Thanks!
> 
> 



Re: [Qemu-devel] summary of current vfio mdev upstreaming status

2016-09-29 Thread Neo Jia
On Thu, Sep 29, 2016 at 04:55:39PM +0800, Jike Song wrote:
> Hi all,
> 
> In order to have a clear understanding about the VFIO mdev upstreaming
> status, I'd like to summarize it. Please share your opinions on this,
> and correct my misunderstandings.
> 
> The whole vfio mdev series can be logically divided into several parts,
> they work together to provide the mdev support.

Hi Jike,

Thanks for summarizing this, but I will defer to Kirti to comment on the actual
upstream status of her patches, couples things to note though:

1) iommu type1 patches have been extensively reviewed by Alex already and we
have one action item left to implement which is already queued up in v8 
patchset.

2) regarding the sysfs interface and libvirt discussion, I would like to hear
what kind of attributes Intel folks are having so far as Daniel is
asking about adding a class "gpu" which will pull several attributes as 
mandatory.

Thanks,
Neo

> 
> 
> 
> PART 1: mdev core driver
> 
>   [task]
>   -   the mdev bus/device support
>   -   the utilities of mdev lifecycle management
>   -   the physical device register/unregister interfaces
> 
>   [status]
>   -   basically agreed by community
> 
> 
> PART 2: vfio bus driver for mdev
> 
>   [task]
>   -   interfaces with vendor drivers
>   -   the vfio bus implementation
> 
>   [status]
> 
>   -   basically agreed by community
> 
> 
> PART 3: iommu support for mdev
> 
>   [task]
>   -   iommu support for mdev
> 
>   [status]
>   -   Kirti's v7 implementation, not yet fully reviewed
> 
> 
> PART 4: sysfs interfaces for mdev
> 
>   [task]
>   -   define the hierarchy of minimal sysfs directories/files
>   -   check the validity from vendor drivers, init/de-init 
> them
>   [status]
>   -   interfaces are in discussion
> 
> 
> PART 6: Documentation
> 
>   [task]
>   -   clearly document the architecture and interfaces
>   -   coding example for vendor drivers
> 
>   [status]
>   -   N/A
> 
> 
> What I'm curious here is 'PART 4', which is needed by other parts to
> perform further steps, is it possible to accelerate the process somehow? :-)
> 
> 
> --
> Thanks,
> Jike
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [Qemu-devel] [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-29 Thread Neo Jia
On Thu, Sep 29, 2016 at 09:03:40AM +0100, Daniel P. Berrange wrote:
> On Wed, Sep 28, 2016 at 12:22:35PM -0700, Neo Jia wrote:
> > On Thu, Sep 22, 2016 at 03:26:38PM +0100, Daniel P. Berrange wrote:
> > > On Thu, Sep 22, 2016 at 08:19:21AM -0600, Alex Williamson wrote:
> > > > On Thu, 22 Sep 2016 09:41:20 +0530
> > > > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > > 
> > > > > >>>>> My concern is that a type id seems arbitrary but we're 
> > > > > >>>>> specifying that
> > > > > >>>>> it be unique.  We already have something unique, the name.  So 
> > > > > >>>>> why try
> > > > > >>>>> to make the type id unique as well?  A vendor can accidentally 
> > > > > >>>>> create
> > > > > >>>>> their vendor driver so that a given name means something very
> > > > > >>>>> specific.  On the other hand they need to be extremely 
> > > > > >>>>> deliberate to
> > > > > >>>>> coordinate that a type id means a unique thing across all their 
> > > > > >>>>> product
> > > > > >>>>> lines.
> > > > > >>>>>   
> > > > > >>>>
> > > > > >>>> Let me clarify, type id should be unique in the list of
> > > > > >>>> mdev_supported_types. You can't have 2 directories in with same 
> > > > > >>>> name.
> > > > > >>>
> > > > > >>> Of course, but does that mean it's only unique to the machine I'm
> > > > > >>> currently running on?  Let's say I have a Tesla P100 on my system 
> > > > > >>> and
> > > > > >>> type-id 11 is named "GRID-M60-0B".  At some point in the future I
> > > > > >>> replace the Tesla P100 with a Q1000 (made up).  Is type-id 11 on 
> > > > > >>> that
> > > > > >>> new card still going to be a "GRID-M60-0B"?  If not then we've 
> > > > > >>> based
> > > > > >>> our XML on the wrong attribute.  If the new device does not 
> > > > > >>> support
> > > > > >>> "GRID-M60-0B" then we should generate an error, not simply 
> > > > > >>> initialize
> > > > > >>> whatever type-id 11 happens to be on this new card.
> > > > > >>> 
> > > > > >>
> > > > > >> If there are 2 M60 in the system then you would find '11' type 
> > > > > >> directory
> > > > > >> in mdev_supported_types of both M60. If you have P100, '11' type 
> > > > > >> would
> > > > > >> not be there in its mdev_supported_types, it will have different 
> > > > > >> types.
> > > > > >>
> > > > > >> For example, if you replace M60 with P100, but XML is not updated. 
> > > > > >> XML
> > > > > >> have type '11'. When libvirt would try to create mdev device, 
> > > > > >> libvirt
> > > > > >> would have to find 'create' file in sysfs in following directory 
> > > > > >> format:
> > > > > >>
> > > > > >>  --- mdev_supported_types
> > > > > >>  |-- 11
> > > > > >>  |   |-- create
> > > > > >>
> > > > > >> but now for P100, '11' directory is not there, so libvirt should 
> > > > > >> throw
> > > > > >> error on not able to find '11' directory.  
> > > > > > 
> > > > > > This really seems like an accident waiting to happen.  What happens
> > > > > > when the user replaces their M60 with an Intel XYZ device that 
> > > > > > happens
> > > > > > to expose a type 11 mdev class gpu device?  How is libvirt supposed 
> > > > > > to
> > > > > > know that the XML used to refer to a GRID-M60-0B and now it's an
> > > > > > INTEL-IGD-XYZ?  Doesn't basing the XML entry on the name and 
> > > > > > removing
> > > > > > yet another arbitrary requirement that we have some sort of globally
> > > > > > unique type-id database make a lot of sense?  The same issue applies
> > > > > > for simple debug-ability, if I'm reviewing the XML for a domain and 
> > > > > > the
> > > > > > name is the primary index for the mdev device, I know what it is.
> > > > > > Seeing type-id='11' is meaningless.
> > > > > >  
> > > > > 
> > > > > Let me clarify again, type '11' is a string that vendor driver would
> > > > > define (see my previous reply below) it could be "11" or 
> > > > > "GRID-M60-0B".
> > > > > If 2 vendors used same string we can't control that. right?
> > > > > 
> > > > > 
> > > > > >>>> Lets remove 'id' from type id in XML if that is the concern. 
> > > > > >>>> Supported
> > > > > >>>> types is going to be defined by vendor driver, so let vendor 
> > > > > >>>> driver
> > > > > >>>> decide what to use for directory name and same should be used in 
> > > > > >>>> device
> > > > > >>>> xml file, it could be '11' or "GRID M60-0B":
> > > > > >>>>
> > > > > >>>> 
> > > > > >>>>   my-vgpu
> > > > > >>>>   pci__86_00_0
> > > > > >>>>   
> > > > > >>>> 

Re: [Qemu-devel] [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-28 Thread Neo Jia
On Wed, Sep 28, 2016 at 04:31:25PM -0400, Laine Stump wrote:
> On 09/28/2016 03:59 PM, Neo Jia wrote:
> > On Wed, Sep 28, 2016 at 07:45:38PM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Thursday, September 29, 2016 3:23 AM
> > > > 
> > > > On Thu, Sep 22, 2016 at 03:26:38PM +0100, Daniel P. Berrange wrote:
> > > > > On Thu, Sep 22, 2016 at 08:19:21AM -0600, Alex Williamson wrote:
> > > > > > On Thu, 22 Sep 2016 09:41:20 +0530
> > > > > > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > > > > 
> > > > > > > > > > > > My concern is that a type id seems arbitrary but we're 
> > > > > > > > > > > > specifying that
> > > > > > > > > > > > it be unique.  We already have something unique, the 
> > > > > > > > > > > > name.  So why try
> > > > > > > > > > > > to make the type id unique as well?  A vendor can 
> > > > > > > > > > > > accidentally create
> > > > > > > > > > > > their vendor driver so that a given name means 
> > > > > > > > > > > > something very
> > > > > > > > > > > > specific.  On the other hand they need to be extremely 
> > > > > > > > > > > > deliberate to
> > > > > > > > > > > > coordinate that a type id means a unique thing across 
> > > > > > > > > > > > all their product
> > > > > > > > > > > > lines.
> > > > > > > > > > > > 
> > > > > > > > > > > Let me clarify, type id should be unique in the list of
> > > > > > > > > > > mdev_supported_types. You can't have 2 directories in 
> > > > > > > > > > > with same name.
> > > > > > > > > > Of course, but does that mean it's only unique to the 
> > > > > > > > > > machine I'm
> > > > > > > > > > currently running on?  Let's say I have a Tesla P100 on my 
> > > > > > > > > > system and
> > > > > > > > > > type-id 11 is named "GRID-M60-0B".  At some point in the 
> > > > > > > > > > future I
> > > > > > > > > > replace the Tesla P100 with a Q1000 (made up).  Is type-id 
> > > > > > > > > > 11 on that
> > > > > > > > > > new card still going to be a "GRID-M60-0B"?  If not then 
> > > > > > > > > > we've based
> > > > > > > > > > our XML on the wrong attribute.  If the new device does not 
> > > > > > > > > > support
> > > > > > > > > > "GRID-M60-0B" then we should generate an error, not simply 
> > > > > > > > > > initialize
> > > > > > > > > > whatever type-id 11 happens to be on this new card.
> > > > > > > > > > 
> > > > > > > > > If there are 2 M60 in the system then you would find '11' 
> > > > > > > > > type directory
> > > > > > > > > in mdev_supported_types of both M60. If you have P100, '11' 
> > > > > > > > > type would
> > > > > > > > > not be there in its mdev_supported_types, it will have 
> > > > > > > > > different types.
> > > > > > > > > 
> > > > > > > > > For example, if you replace M60 with P100, but XML is not 
> > > > > > > > > updated. XML
> > > > > > > > > have type '11'. When libvirt would try to create mdev device, 
> > > > > > > > > libvirt
> > > > > > > > > would have to find 'create' file in sysfs in following 
> > > > > > > > > directory format:
> > > > > > > > > 
> > > > > > > > >   --- mdev_supported_types
> > > > > > > > >   |-- 11
> > > > > > > > >   |   |-- create
> > > > > > > > > 
> > > > > > > > > but now for P100, '11' directory is 

Re: [Qemu-devel] [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-28 Thread Neo Jia
On Wed, Sep 28, 2016 at 01:55:47PM -0600, Alex Williamson wrote:
> On Wed, 28 Sep 2016 12:22:35 -0700
> Neo Jia <c...@nvidia.com> wrote:
> 
> > On Thu, Sep 22, 2016 at 03:26:38PM +0100, Daniel P. Berrange wrote:
> > > On Thu, Sep 22, 2016 at 08:19:21AM -0600, Alex Williamson wrote:  
> > > > On Thu, 22 Sep 2016 09:41:20 +0530
> > > > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > >   
> > > > > >>>>> My concern is that a type id seems arbitrary but we're 
> > > > > >>>>> specifying that
> > > > > >>>>> it be unique.  We already have something unique, the name.  So 
> > > > > >>>>> why try
> > > > > >>>>> to make the type id unique as well?  A vendor can accidentally 
> > > > > >>>>> create
> > > > > >>>>> their vendor driver so that a given name means something very
> > > > > >>>>> specific.  On the other hand they need to be extremely 
> > > > > >>>>> deliberate to
> > > > > >>>>> coordinate that a type id means a unique thing across all their 
> > > > > >>>>> product
> > > > > >>>>> lines.
> > > > > >>>>> 
> > > > > >>>>
> > > > > >>>> Let me clarify, type id should be unique in the list of
> > > > > >>>> mdev_supported_types. You can't have 2 directories in with same 
> > > > > >>>> name.  
> > > > > >>>
> > > > > >>> Of course, but does that mean it's only unique to the machine I'm
> > > > > >>> currently running on?  Let's say I have a Tesla P100 on my system 
> > > > > >>> and
> > > > > >>> type-id 11 is named "GRID-M60-0B".  At some point in the future I
> > > > > >>> replace the Tesla P100 with a Q1000 (made up).  Is type-id 11 on 
> > > > > >>> that
> > > > > >>> new card still going to be a "GRID-M60-0B"?  If not then we've 
> > > > > >>> based
> > > > > >>> our XML on the wrong attribute.  If the new device does not 
> > > > > >>> support
> > > > > >>> "GRID-M60-0B" then we should generate an error, not simply 
> > > > > >>> initialize
> > > > > >>> whatever type-id 11 happens to be on this new card.
> > > > > >>>   
> > > > > >>
> > > > > >> If there are 2 M60 in the system then you would find '11' type 
> > > > > >> directory
> > > > > >> in mdev_supported_types of both M60. If you have P100, '11' type 
> > > > > >> would
> > > > > >> not be there in its mdev_supported_types, it will have different 
> > > > > >> types.
> > > > > >>
> > > > > >> For example, if you replace M60 with P100, but XML is not updated. 
> > > > > >> XML
> > > > > >> have type '11'. When libvirt would try to create mdev device, 
> > > > > >> libvirt
> > > > > >> would have to find 'create' file in sysfs in following directory 
> > > > > >> format:
> > > > > >>
> > > > > >>  --- mdev_supported_types
> > > > > >>  |-- 11
> > > > > >>  |   |-- create
> > > > > >>
> > > > > >> but now for P100, '11' directory is not there, so libvirt should 
> > > > > >> throw
> > > > > >> error on not able to find '11' directory.
> > > > > > 
> > > > > > This really seems like an accident waiting to happen.  What happens
> > > > > > when the user replaces their M60 with an Intel XYZ device that 
> > > > > > happens
> > > > > > to expose a type 11 mdev class gpu device?  How is libvirt supposed 
> > > > > > to
> > > > > > know that the XML used to refer to a GRID-M60-0B and now it's an
> > > > > > INTEL-IGD-XYZ?  Doesn't basing the XML entry on the name and 
> > > > > > removing
> > > > > > yet another arbitrary requirement that we have some sort of globally
> > > > > > unique type-id database make a lot of sense?  The same issue applies
> > > > > > for simple debug-ability, if I'm reviewing the XML for a domain and 
> > > > > > the
> > > > > > name is the primary index for the mdev device, I know what it is.
> > > > > > Seeing type-id='11' is meaningless.
> > > > > >
> > > > > 
> > > > > Let me clarify again, type '11' is a string that vendor driver would
> > > > > define (see my previous reply below) it could be "11" or 
> > > > > "GRID-M60-0B".
> > > > > If 2 vendors used same string we can't control that. right?
> > > > > 
> > > > >   
> > > > > >>>> Lets remove 'id' from type id in XML if that is the concern. 
> > > > > >>>> Supported
> > > > > >>>> types is going to be defined by vendor driver, so let vendor 
> > > > > >>>> driver
> > > > > >>>> decide what to use for directory name and same should be used in 
> > > > > >>>> device
> > > > > >>>> xml file, it could be '11' or "GRID M60-0B":
> > > > > >>>>
> > > > > >>>> 
> > > > > >>>>   my-vgpu
> > > > > >>>>   pci__86_00_0
> > > > > >>>>   
> > > > > >>>> 

Re: [Qemu-devel] [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-28 Thread Neo Jia
On Wed, Sep 28, 2016 at 07:45:38PM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Thursday, September 29, 2016 3:23 AM
> > 
> > On Thu, Sep 22, 2016 at 03:26:38PM +0100, Daniel P. Berrange wrote:
> > > On Thu, Sep 22, 2016 at 08:19:21AM -0600, Alex Williamson wrote:
> > > > On Thu, 22 Sep 2016 09:41:20 +0530
> > > > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > >
> > > > > >>>>> My concern is that a type id seems arbitrary but we're 
> > > > > >>>>> specifying that
> > > > > >>>>> it be unique.  We already have something unique, the name.  So 
> > > > > >>>>> why try
> > > > > >>>>> to make the type id unique as well?  A vendor can accidentally 
> > > > > >>>>> create
> > > > > >>>>> their vendor driver so that a given name means something very
> > > > > >>>>> specific.  On the other hand they need to be extremely 
> > > > > >>>>> deliberate to
> > > > > >>>>> coordinate that a type id means a unique thing across all their 
> > > > > >>>>> product
> > > > > >>>>> lines.
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>> Let me clarify, type id should be unique in the list of
> > > > > >>>> mdev_supported_types. You can't have 2 directories in with same 
> > > > > >>>> name.
> > > > > >>>
> > > > > >>> Of course, but does that mean it's only unique to the machine I'm
> > > > > >>> currently running on?  Let's say I have a Tesla P100 on my system 
> > > > > >>> and
> > > > > >>> type-id 11 is named "GRID-M60-0B".  At some point in the future I
> > > > > >>> replace the Tesla P100 with a Q1000 (made up).  Is type-id 11 on 
> > > > > >>> that
> > > > > >>> new card still going to be a "GRID-M60-0B"?  If not then we've 
> > > > > >>> based
> > > > > >>> our XML on the wrong attribute.  If the new device does not 
> > > > > >>> support
> > > > > >>> "GRID-M60-0B" then we should generate an error, not simply 
> > > > > >>> initialize
> > > > > >>> whatever type-id 11 happens to be on this new card.
> > > > > >>>
> > > > > >>
> > > > > >> If there are 2 M60 in the system then you would find '11' type 
> > > > > >> directory
> > > > > >> in mdev_supported_types of both M60. If you have P100, '11' type 
> > > > > >> would
> > > > > >> not be there in its mdev_supported_types, it will have different 
> > > > > >> types.
> > > > > >>
> > > > > >> For example, if you replace M60 with P100, but XML is not updated. 
> > > > > >> XML
> > > > > >> have type '11'. When libvirt would try to create mdev device, 
> > > > > >> libvirt
> > > > > >> would have to find 'create' file in sysfs in following directory 
> > > > > >> format:
> > > > > >>
> > > > > >>  --- mdev_supported_types
> > > > > >>  |-- 11
> > > > > >>  |   |-- create
> > > > > >>
> > > > > >> but now for P100, '11' directory is not there, so libvirt should 
> > > > > >> throw
> > > > > >> error on not able to find '11' directory.
> > > > > >
> > > > > > This really seems like an accident waiting to happen.  What happens
> > > > > > when the user replaces their M60 with an Intel XYZ device that 
> > > > > > happens
> > > > > > to expose a type 11 mdev class gpu device?  How is libvirt supposed 
> > > > > > to
> > > > > > know that the XML used to refer to a GRID-M60-0B and now it's an
> > > > > > INTEL-IGD-XYZ?  Doesn't basing the XML entry on the name and 
> > > > > > removing
> > > > > > yet another arbitrary requirement that we have some sort of globally
> > > > > > unique type-id database make a lot of sense?  The same issue applies
> > > > > > for simple debug-ability, if I'm reviewing the XML for a domain and 
> > > > > > the
> > > > > > name is the primary index for the mdev device, I know what it is.
> > > > > > Seeing type-id='11' is meaningless.
> > > > > >
> > > > >
> > > > > Let me clarify again, type '11' is a string that vendor driver would
> > > > > define (see my previous reply below) it could be "11" or 
> > > > > "GRID-M60-0B".
> > > > > If 2 vendors used same string we can't control that. right?
> > > > >
> > > > >
> > > > > >>>> Lets remove 'id' from type id in XML if that is the concern. 
> > > > > >>>> Supported
> > > > > >>>> types is going to be defined by vendor driver, so let vendor 
> > > > > >>>> driver
> > > > > >>>> decide what to use for directory name and same should be used in 
> > > > > >>>> device
> > > > > >>>> xml file, it could be '11' or "GRID M60-0B":
> > > > > >>>>
> > > > > >>>> 
> > > > > >>>>   my-vgpu
> > > > > >>>>   pci__86_00_0
> > > > > >>>>   
> > > > > >>>> 

Re: [Qemu-devel] [RFC v2] libvirt vGPU QEMU integration

2016-09-28 Thread Neo Jia
On Tue, Sep 20, 2016 at 10:47:53AM +0100, Daniel P. Berrange wrote:
> On Tue, Sep 20, 2016 at 02:05:52AM +0530, Kirti Wankhede wrote:
> > 
> > Hi libvirt experts,
> > 
> > Thanks for valuable input on v1 version of RFC.
> > 
> > Quick brief, VFIO based mediated device framework provides a way to
> > virtualize their devices without SR-IOV, like NVIDIA vGPU, Intel KVMGT
> > and IBM's channel IO. This framework reuses VFIO APIs for all the
> > functionalities for mediated devices which are currently being used for
> > pass through devices. This framework introduces a set of new sysfs files
> > for device creation and its life cycle management.
> > 
> > Here is the summary of discussion on v1:
> > 1. Discover mediated device:
> > As part of physical device initialization process, vendor driver will
> > register their physical devices, which will be used to create virtual
> > device (mediated device, aka mdev) to the mediated framework.
> > 
> > Vendor driver should specify mdev_supported_types in directory format.
> > This format is class based, for example, display class directory format
> > should be as below. We need to define such set for each class of devices
> > which would be supported by mediated device framework.
> > 
> >  --- mdev_destroy
> >  --- mdev_supported_types
> >  |-- 11
> >  |   |-- create
> >  |   |-- name
> >  |   |-- fb_length
> >  |   |-- resolution
> >  |   |-- heads
> >  |   |-- max_instances
> >  |   |-- params
> >  |   |-- requires_group
> >  |-- 12
> >  |   |-- create
> >  |   |-- name
> >  |   |-- fb_length
> >  |   |-- resolution
> >  |   |-- heads
> >  |   |-- max_instances
> >  |   |-- params
> >  |   |-- requires_group
> >  |-- 13
> >  |-- create
> >  |-- name
> >  |-- fb_length
> >  |-- resolution
> >  |-- heads
> >  |-- max_instances
> >  |-- params
> >  |-- requires_group
> > 
> > 
> > In the above example directory '11' represents a type id of mdev device.
> > 'name', 'fb_length', 'resolution', 'heads', 'max_instance' and
> > 'requires_group' would be Read-Only files that vendor would provide to
> > describe about that type.
> > 
> > 'create':
> > Write-only file. Mandatory.
> > Accepts string to create mediated device.
> > 
> > 'name':
> > Read-Only file. Mandatory.
> > Returns string, the name of that type id.
> 
> Presumably this is a human-targetted title/description of
> the device.
> 
> > 
> > 'fb_length':
> > Read-only file. Mandatory.
> > Returns {K,M,G}, size of framebuffer.
> > 
> > 'resolution':
> > Read-Only file. Mandatory.
> > Returns 'hres x vres' format. Maximum supported resolution.
> > 
> > 'heads':
> > Read-Only file. Mandatory.
> > Returns integer. Number of maximum heads supported.
> 
> None of these should be mandatory as that makes the mdev
> useless for non-GPU devices.
> 
> I'd expect to see a 'class' or 'type' attribute in the
> directory whcih tells you what kind of mdev it is. A
> valid 'class' value would be 'gpu'. The fb_length,
> resolution, and heads parameters would only be mandatory
> when class==gpu.
> 

Hi Daniel,

Here you are proposing to add a class named "gpu", which will make all those gpu
related attributes mandatory, which libvirt can allow user to better
parse/present a particular mdev configuration?

I am just wondering if there is another option that we just make all those
attributes that a mdev device can have as optional but still meaningful to
libvirt, so libvirt can still parse / recognize them as an class "mdev".

In general, I am just trying to understand the requirement from libvirt and see
how we can fit in this requirement for both Intel and NVIDIA since Intel is also
moving to the type-based interface although they don't have "class" concept yet.

Thanks,
Neo

> > 'max_instance':
> > Read-Only file. Mandatory.
> > Returns integer.  Returns maximum mdev device could be created
> > at the moment when this file is read. This count would be updated by
> > vendor driver. Before creating mdev device of this type, check if
> > max_instance is > 0.
> > 
> > 'params'
> > Write-Only file. Optional.
> > String input. Libvirt would pass the string given in XML file to
> > this file and then create mdev device. Set empty string to clear params.
> > For example, set parameter 'frame_rate_limiter=0' to disable frame rate
> > limiter for performance benchmarking, then create device of type 11. The
> > device created would have that parameter set by vendor driver.
> 
> Nope, libvirt will explicitly *NEVER* allow arbitrary opaque
> passthrough of vendor specific data in this way.
> 
> > The parent device would look like:
> > 
> >
> >  pci__86_00_0
> >  
> >0
> >134
> >0
> >0
> >
> >  
> >  
> >
> >GRID M60-0B
> >512M
> >

Re: [Qemu-devel] [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-28 Thread Neo Jia
On Thu, Sep 22, 2016 at 03:26:38PM +0100, Daniel P. Berrange wrote:
> On Thu, Sep 22, 2016 at 08:19:21AM -0600, Alex Williamson wrote:
> > On Thu, 22 Sep 2016 09:41:20 +0530
> > Kirti Wankhede  wrote:
> > 
> > > > My concern is that a type id seems arbitrary but we're specifying 
> > > > that
> > > > it be unique.  We already have something unique, the name.  So why 
> > > > try
> > > > to make the type id unique as well?  A vendor can accidentally 
> > > > create
> > > > their vendor driver so that a given name means something very
> > > > specific.  On the other hand they need to be extremely deliberate to
> > > > coordinate that a type id means a unique thing across all their 
> > > > product
> > > > lines.
> > > >   
> > > 
> > >  Let me clarify, type id should be unique in the list of
> > >  mdev_supported_types. You can't have 2 directories in with same 
> > >  name.
> > > >>>
> > > >>> Of course, but does that mean it's only unique to the machine I'm
> > > >>> currently running on?  Let's say I have a Tesla P100 on my system and
> > > >>> type-id 11 is named "GRID-M60-0B".  At some point in the future I
> > > >>> replace the Tesla P100 with a Q1000 (made up).  Is type-id 11 on that
> > > >>> new card still going to be a "GRID-M60-0B"?  If not then we've based
> > > >>> our XML on the wrong attribute.  If the new device does not support
> > > >>> "GRID-M60-0B" then we should generate an error, not simply initialize
> > > >>> whatever type-id 11 happens to be on this new card.
> > > >>> 
> > > >>
> > > >> If there are 2 M60 in the system then you would find '11' type 
> > > >> directory
> > > >> in mdev_supported_types of both M60. If you have P100, '11' type would
> > > >> not be there in its mdev_supported_types, it will have different types.
> > > >>
> > > >> For example, if you replace M60 with P100, but XML is not updated. XML
> > > >> have type '11'. When libvirt would try to create mdev device, libvirt
> > > >> would have to find 'create' file in sysfs in following directory 
> > > >> format:
> > > >>
> > > >>  --- mdev_supported_types
> > > >>  |-- 11
> > > >>  |   |-- create
> > > >>
> > > >> but now for P100, '11' directory is not there, so libvirt should throw
> > > >> error on not able to find '11' directory.  
> > > > 
> > > > This really seems like an accident waiting to happen.  What happens
> > > > when the user replaces their M60 with an Intel XYZ device that happens
> > > > to expose a type 11 mdev class gpu device?  How is libvirt supposed to
> > > > know that the XML used to refer to a GRID-M60-0B and now it's an
> > > > INTEL-IGD-XYZ?  Doesn't basing the XML entry on the name and removing
> > > > yet another arbitrary requirement that we have some sort of globally
> > > > unique type-id database make a lot of sense?  The same issue applies
> > > > for simple debug-ability, if I'm reviewing the XML for a domain and the
> > > > name is the primary index for the mdev device, I know what it is.
> > > > Seeing type-id='11' is meaningless.
> > > >  
> > > 
> > > Let me clarify again, type '11' is a string that vendor driver would
> > > define (see my previous reply below) it could be "11" or "GRID-M60-0B".
> > > If 2 vendors used same string we can't control that. right?
> > > 
> > > 
> > >  Lets remove 'id' from type id in XML if that is the concern. 
> > >  Supported
> > >  types is going to be defined by vendor driver, so let vendor driver
> > >  decide what to use for directory name and same should be used in 
> > >  device
> > >  xml file, it could be '11' or "GRID M60-0B":
> > > 
> > >  
> > >    my-vgpu
> > >    pci__86_00_0
> > >    
> > >  

Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver

2016-09-08 Thread Neo Jia
On Thu, Sep 08, 2016 at 04:09:39PM +0800, Jike Song wrote:
> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
> > +
> > +/**
> > + * struct parent_ops - Structure to be registered for each parent device to
> > + * register the device to mdev module.
> > + *
> > + * @owner: The module owner.
> > + * @dev_attr_groups:   Default attributes of the parent device.
> > + * @mdev_attr_groups:  Default attributes of the mediated device.
> > + * @supported_config:  Called to get information about supported types.
> > + * @dev : device structure of parent device.
> > + * @config: should return string listing supported config
> > + * Returns integer: success (0) or error (< 0)
> > + * @create:Called to allocate basic resources in parent 
> > device's
> > + * driver for a particular mediated device. It is
> > + * mandatory to provide create ops.
> > + * @mdev: mdev_device structure on of mediated device
> > + *   that is being created
> > + * @mdev_params: extra parameters required by parent
> > + * device's driver.
> > + * Returns integer: success (0) or error (< 0)
> > + * @destroy:   Called to free resources in parent device's 
> > driver for a
> > + * a mediated device. It is mandatory to provide destroy
> > + * ops.
> > + * @mdev: mdev_device device structure which is being
> > + *destroyed
> > + * Returns integer: success (0) or error (< 0)
> > + * If VMM is running and destroy() is called that means the
> > + * mdev is being hotunpluged. Return error if VMM is
> > + * running and driver doesn't support mediated device
> > + * hotplug.
> > + * @reset: Called to reset mediated device.
> > + * @mdev: mdev_device device structure.
> > + * Returns integer: success (0) or error (< 0)
> > + * @set_online_status: Called to change to status of mediated device.
> > + * @mdev: mediated device.
> > + * @online: set true or false to make mdev device online or
> > + * offline.
> > + * Returns integer: success (0) or error (< 0)
> > + * @get_online_status: Called to get online/offline status of  
> > mediated device
> > + * @mdev: mediated device.
> > + * @online: Returns status of mediated device.
> > + * Returns integer: success (0) or error (< 0)
> > + * @read:  Read emulation callback
> > + * @mdev: mediated device structure
> > + * @buf: read buffer
> > + * @count: number of bytes to read
> > + * @pos: address.
> > + * Retuns number on bytes read on success or error.
> > + * @write: Write emulation callback
> > + * @mdev: mediated device structure
> > + * @buf: write buffer
> > + * @count: number of bytes to be written
> > + * @pos: address.
> > + * Retuns number on bytes written on success or error.
> > + * @get_irq_info:  Called to retrieve information about mediated device IRQ
> > + * @mdev: mediated device structure
> > + * @irq_info: VFIO IRQ flags and count.
> > + * Returns integer: success (0) or error (< 0)
> > + * @set_irqs:  Called to send about interrupts configuration
> > + * information that VMM sets.
> > + * @mdev: mediated device structure
> > + * @flags, index, start, count and *data : same as that of
> > + * struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> > + * @get_device_info:   Called to get VFIO device information for a 
> > mediated
> > + * device.
> > + * @vfio_device_info: VFIO device info.
> > + * Returns integer: success (0) or error (< 0)
> > + * @get_region_info:   Called to get VFIO region size and flags of 
> > mediated
> > + * device.
> > + * @mdev: mediated device structure
> > + * @region_info: output, returns size and flags of
> > + *   requested region.
> > + * @cap_type_id: returns id of capability.
> > + * @cap_type: returns pointer to capability structure
> > + * corresponding to capability id.
> > + * Returns integer: success (0) or error (< 0)
> > + *
> > + * Parent device that support mediated device should be registered with 
> > mdev
> > + * module with parent_ops structure.
> > + */
> > +
> > +struct parent_ops {
> > +   struct module   *owner;
> > +   const struct 

Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-07 Thread Neo Jia
On Wed, Sep 07, 2016 at 07:27:19PM +0100, Daniel P. Berrange wrote:
> On Wed, Sep 07, 2016 at 11:17:39AM -0700, Neo Jia wrote:
> > On Wed, Sep 07, 2016 at 10:44:56AM -0600, Alex Williamson wrote:
> > > On Wed, 7 Sep 2016 21:45:31 +0530
> > > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > 
> > > > To hot-plug mdev device to a domain in which there is already a mdev
> > > > device assigned, mdev device should be created with same group number as
> > > > the existing devices are and then hot-plug it. If there is no mdev
> > > > device in that domain, then group number should be a unique number.
> > > > 
> > > > This simplifies the mdev grouping and also provide flexibility for
> > > > vendor driver implementation.
> > > 
> > > The 'start' operation for NVIDIA mdev devices allocate peer-to-peer
> > > resources between mdev devices.  Does this not represent some degree of
> > > an isolation hole between those devices?  Will peer-to-peer DMA between
> > > devices honor the guest IOVA when mdev devices are placed into separate
> > > address spaces, such as possible with vIOMMU?
> > 
> > Hi Alex,
> > 
> > In reality, the p2p operation will only work under same translation domain.
> > 
> > As we are discussing the multiple mdev per VM use cases, I think we probably
> > should not just limit it for p2p operation.
> > 
> > So, in general, the NVIDIA vGPU device model's requirement is to 
> > know/register 
> > all mdevs per VM before opening any those mdev devices.
> 
> It concerns me that if we bake this rule into the sysfs interface,
> then it feels like we're making life very hard for future support
> for hotplug / unplug of mdevs to running VMs.

Hi Daniel,

I don't think the grouping will stop anybody from supporting hotplug / unplug at
least from syntax point of view.

> 
> Conversely, if we can solve the hotplug/unplug problem, then we
> potentially would not need this grouping concept.

I think Kirti has also mentioned about hotplug support in her proposal, do you
mind to comment on that thread so I can think if I have missed anything?

Thanks,
Neo

> 
> I'd hate us to do all this complex work to group multiple mdevs per
> VM only to throw it away later when we hotplug support is made to
> work.
> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org  -o- http://virt-manager.org :|
> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|



Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-07 Thread Neo Jia
On Wed, Sep 07, 2016 at 10:44:56AM -0600, Alex Williamson wrote:
> On Wed, 7 Sep 2016 21:45:31 +0530
> Kirti Wankhede  wrote:
> 
> > To hot-plug mdev device to a domain in which there is already a mdev
> > device assigned, mdev device should be created with same group number as
> > the existing devices are and then hot-plug it. If there is no mdev
> > device in that domain, then group number should be a unique number.
> > 
> > This simplifies the mdev grouping and also provide flexibility for
> > vendor driver implementation.
> 
> The 'start' operation for NVIDIA mdev devices allocate peer-to-peer
> resources between mdev devices.  Does this not represent some degree of
> an isolation hole between those devices?  Will peer-to-peer DMA between
> devices honor the guest IOVA when mdev devices are placed into separate
> address spaces, such as possible with vIOMMU?

Hi Alex,

In reality, the p2p operation will only work under same translation domain.

As we are discussing the multiple mdev per VM use cases, I think we probably
should not just limit it for p2p operation. 

So, in general, the NVIDIA vGPU device model's requirement is to know/register 
all mdevs per VM before opening any those mdev devices.

> 
> I don't particularly like the iommu group solution either, which is why
> in my latest proposal I've given the vendor driver a way to indicate
> this grouping is required so more flexible mdev devices aren't
> restricted by this.  But the limited knowledge I have of the hardware
> configuration which imposes this restriction on NVIDIA devices seems to
> suggest that iommu grouping of these sets is appropriate.  The vfio-core
> infrastructure is almost entirely built for managing vfio group, which
> are just a direct mapping of iommu groups.  So the complexity of iommu
> groups is already handled.  Adding a new layer of grouping into mdev
> seems like it's increasing the complexity further, not decreasing it.

I really appreciate your thoughts on this issue, and consideration of how NVIDIA
vGPU device model works, but so far I still feel we are borrowing a very
meaningful concept "iommu group" to solve an device model issues, which I 
actually 
hope can be workarounded by a more independent piece of logic, and that is why 
Kirti is
proposing the "mdev group".

Let's see if we can address your concerns / questions in Kirti's reply.

Thanks,
Neo

> Thanks,
> 
> Alex



Re: [Qemu-devel] [RFC v2 0/4] adding mdev bus and vfio support

2016-09-06 Thread Neo Jia
On Wed, Sep 07, 2016 at 10:22:26AM +0800, Jike Song wrote:
> On 09/02/2016 11:03 PM, Alex Williamson wrote:
> > On Fri,  2 Sep 2016 16:16:08 +0800
> > Jike Song  wrote:
> > 
> >> This patchset is based on NVidia's "Add Mediated device support" series, 
> >> version 6:
> >>
> >>http://www.spinics.net/lists/kvm/msg136472.html
> > 
> > 
> > Hi Jike,
> > 
> > I'm thrilled by your active participation here, but I'm confused which
> > versions I should be reviewing and where the primary development is
> > going.  Kirti sent v7 a week ago, so I would have expected a revision
> > based on that rather than a re-write based on v6 plus incorporation of a
> > few of Kirti's patches directly.
> 
> Hi Alex,
> 
> [Sorry! replied this on Monday but it was silently dropped by the our 
> firewall]
> 
> 
> 
> The v1 of this patchset was send as incremental ones, basing on Nvidia's v6, 
> to
> demonstrate how is it possible and beneficial to:
> 
>   1, Introduce an independent device between physical and mdev;
>   2, Simplify vfio-mdev and make it the most flexible for vendor drivers;
> 
> Unfortunately neither was understood or adopted in v7:
> 
>   http://www.spinics.net/lists/kvm/msg137081.html
> 
> So here came the v2, as a standalone series, to give a whole and straight
> demonstration. The reason of still basing on v6:
> 
>   - Addressed all v6 comments (except the iommu part);
>   - There is no comments yet for v7 (except the sysfs ones);
> 
> 
> 
> > I liked the last version of these
> > changes a lot, but we need to figure out how to combine development
> > because we do not have infinite cycles for review available :-\  Thanks!
> 
> Fully understand.
> 
> Here is the dilemma: v6 is an obsolete version to work upon, v7 is still not
> at the direction we prefer. 

Hi Jike,

I wish I could meet you in person in KVM forum couple weeks ago so we can have a
better discussion.

We are trying our best to accommodate almost all requirements / comments from 
use cases and code reviews while keeping little (or none) architectural changes
between revisions.

> We would be highly glad and thankful if Neo/Kirti
> would adopt the code in their next version, which will certainly form a
> more simple and consolidated base for future co-development; otherwise
> and we could at least discuss the concerns, in case of any.
> 

As I have said in my previous response to you, if you have any questions about
adopting the framework that we have developed, you are very welcome to
comment/speak out on the code review thread like others. And if it is reasonable
request and won't break other vendors' use case, we will adopt it (one example
is the online file and removing the mdev pci dependency).

Just some update for you regarding the v7 patches, currently we are very
actively trying to lock down the sysfs and management interfaces discussion.

So, if you would like to make the upstream happen sooner, please join us in the
v7 and following patch discussion instead of rewriting them.

Thanks,
Neo

> 
> --
> Thanks,
> Jike
> 
> >>
> >>
> >> Key Changes from Nvidia v6:
> >>
> >>- Introduced an independent struct device to host device, thereby
> >>  formed a physical-host-mdev hierarchy, and highly reused Linux
> >>  driver core support;
> >>
> >>- Added online/offline to mdev_bus_type, leveraging the 'online'
> >>  attr support from Linux driver core;
> >>
> >>- Removed mdev_class and other unecessary stuff;
> >>
> >>/*
> >> * Given the changes above, the code volume of mdev core driver
> >> * dramatically reduced by ~50%.
> >> */
> >>
> >>
> >>- Interfaces between vfio_mdev and vendor driver are high-level,
> >>  e.g. ioctl instead of get_irq_info/set_irq_info and reset,
> >>  start/stop became mdev oriented, etc.;
> >>
> >>/*
> >> * Given the changes above, the code volume of mdev core driver
> >> * dramatically reduced by ~64%.
> >> */
> >>
> >>
> >> Test
> >>
> >>- Tested with KVMGT
> >>
> >> TODO
> >>
> >>- Re-implement the attribute group of host device as long as the
> >>  sysfs hierarchy in discussion gets finalized;
> >>
> >>- Move common routines from current vfio-pci into a higher location,
> >>  export them for various VFIO bus drivers and/or mdev vendor drivers;
> >>
> >>- Add implementation examples for vendor drivers to Documentation;
> >>
> >>- Refine IOMMU changes
> >>
> >>
> >>
> >> Jike Song (2):
> >>   Mediated device Core driver
> >>   vfio: VFIO bus driver for MDEV devices
> >>
> >> Kirti Wankhede (2):
> >>   vfio iommu: Add support for mediated devices
> >>   docs: Add Documentation for Mediated devices
> >>
> >>  Documentation/vfio-mediated-device.txt | 203 ++
> >>  drivers/vfio/Kconfig   |   1 +
> >>  drivers/vfio/Makefile  |   1 +
> >>  drivers/vfio/mdev/Kconfig  |  18 ++
> >>  drivers/vfio/mdev/Makefile |   5 +
> 

Re: [Qemu-devel] [RFC v2 4/4] docs: Add Documentation for Mediated devices

2016-09-02 Thread Neo Jia
On Fri, Sep 02, 2016 at 05:09:46PM -0500, Eric Blake wrote:
> * PGP Signed by an unknown key
> 
> On 09/02/2016 03:16 AM, Jike Song wrote:
> > From: Kirti Wankhede <kwankh...@nvidia.com>
> > 
> > Add file Documentation/vfio-mediated-device.txt that include details of
> > mediated device framework.
> > 
> > Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
> > Signed-off-by: Neo Jia <c...@nvidia.com>
> > Signed-off-by: Jike Song <jike.s...@intel.com>
> > ---
> >  Documentation/vfio-mediated-device.txt | 203 
> > +
> >  1 file changed, 203 insertions(+)
> >  create mode 100644 Documentation/vfio-mediated-device.txt
> > 
> > diff --git a/Documentation/vfio-mediated-device.txt 
> > b/Documentation/vfio-mediated-device.txt
> > new file mode 100644
> > index 000..39bdcd9
> > --- /dev/null
> > +++ b/Documentation/vfio-mediated-device.txt
> > @@ -0,0 +1,203 @@
> > +VFIO Mediated devices [1]
> > +---
> 
> Many files under Documentation trim the  decorator to the length of
> the line above.
> 
> Also, since you have no explicit copyright/license notice, your
> documentation is under GPLv2+ per the top level.  Other files do this,
> and if you are okay with it, I won't complain; but if you intended
> something else, or even if you wanted to make it explicit rather than
> implicit, then you may want to copy the example of files that call out a
> quick blurb on copyright and licensing.
> 

Hi Eric,

Thanks for the review and really sorry about the extra email threads of this 
review, 
we already have one actively going on for a while starting from RFC to 
currently v7.

http://www.spinics.net/lists/kvm/msg137208.html

And the related latest v7 document is at:

http://www.spinics.net/lists/kvm/msg137210.html

We will address all your review comments there.

Thanks,
Neo


> > +
> > +There are more and more use cases/demands to virtualize the DMA devices 
> > which
> > +doesn't have SR_IOV capability built-in. To do this, drivers of different
> 
> s/doesn't/don't/
> 
> > +devices had to develop their own management interface and set of APIs and 
> > then
> > +integrate it to user space software. We've identified common requirements 
> > and
> > +unified management interface for such devices to make user space software
> > +integration easier.
> > +
> > +The VFIO driver framework provides unified APIs for direct device access. 
> > It is
> > +an IOMMU/device agnostic framework for exposing direct device access to
> > +user space, in a secure, IOMMU protected environment. This framework is
> > +used for multiple devices like GPUs, network adapters and compute 
> > accelerators.
> > +With direct device access, virtual machines or user space applications have
> > +direct access of physical device. This framework is reused for mediated 
> > devices.
> > +
> > +Mediated core driver provides a common interface for mediated device 
> > management
> > +that can be used by drivers of different devices. This module provides a 
> > generic
> > +interface to create/destroy mediated device, add/remove it to mediated bus
> 
> s/mediated/a mediated/ twice
> 
> > +driver, add/remove device to IOMMU group. It also provides an interface to
> 
> s/add/and add/
> s/device to/a device to an/
> 
> > +register different types of bus drivers, for example, Mediated VFIO PCI 
> > driver
> > +is designed for mediated PCI devices and supports VFIO APIs. Similarly, 
> > driver
> 
> s/driver/the driver/
> 
> > +can be designed to support any type of mediated device and added to this
> > +framework. Mediated bus driver add/delete mediated device to VFIO Group.
> 
> Missing a verb and several articles, but I'm not sure what you meant.
> Maybe:
> 
> A mediated bus driver can add/delete mediated devices to a VFIO Group.
> 
> > +
> > +Below is the high level block diagram, with NVIDIA, Intel and IBM devices
> > +as examples, since these are the devices which are going to actively use
> > +this module as of now.
> > +
> 
> > +
> > +
> > +Registration Interfaces
> > +---
> > +
> 
> Again, rather long separator,
> 
> > +Mediated core driver provides two types of registration interfaces:
> > +
> > +1. Registration interface for mediated bus driver:
> > +-
> 
> while this 

Re: [Qemu-devel] [RFC v2 0/4] adding mdev bus and vfio support

2016-09-02 Thread Neo Jia
On Fri, Sep 02, 2016 at 09:03:52AM -0600, Alex Williamson wrote:
> On Fri,  2 Sep 2016 16:16:08 +0800
> Jike Song  wrote:
> 
> > This patchset is based on NVidia's "Add Mediated device support" series, 
> > version 6:
> > 
> > http://www.spinics.net/lists/kvm/msg136472.html
> 
> 
> Hi Jike,
> 
> I'm thrilled by your active participation here, but I'm confused which
> versions I should be reviewing and where the primary development is
> going.  Kirti sent v7 a week ago, so I would have expected a revision
> based on that rather than a re-write based on v6 plus incorporation of a
> few of Kirti's patches directly.  I liked the last version of these
> changes a lot, but we need to figure out how to combine development
> because we do not have infinite cycles for review available :-\  Thanks!

Agree with Alex, and the primary development is on Kirti's v7 patches thread.

Jike, could you please join us in the existing code review thread?

I know you are already there with the sysfs discussion recently, but I would
like to see your comments on the rest stuff so we can know how to best
accommodate your requirements and needs in the future revisions.

I believe that would be the best and fastest way to collaborate and that is the 
main purpose of having code review cycles.

Thanks,
Neo

> 
> Alex
> 
> > 
> > 
> > Key Changes from Nvidia v6:
> > 
> > - Introduced an independent struct device to host device, thereby
> >   formed a physical-host-mdev hierarchy, and highly reused Linux
> >   driver core support;
> > 
> > - Added online/offline to mdev_bus_type, leveraging the 'online'
> >   attr support from Linux driver core;
> > 
> > - Removed mdev_class and other unecessary stuff;
> > 
> > /*
> >  * Given the changes above, the code volume of mdev core driver
> >  * dramatically reduced by ~50%.
> >  */
> > 
> > 
> > - Interfaces between vfio_mdev and vendor driver are high-level,
> >   e.g. ioctl instead of get_irq_info/set_irq_info and reset,
> >   start/stop became mdev oriented, etc.;
> > 
> > /*
> >  * Given the changes above, the code volume of mdev core driver
> >  * dramatically reduced by ~64%.
> >  */
> > 
> > 
> > Test
> > 
> > - Tested with KVMGT
> > 
> > TODO
> > 
> > - Re-implement the attribute group of host device as long as the
> >   sysfs hierarchy in discussion gets finalized;
> > 
> > - Move common routines from current vfio-pci into a higher location,
> >   export them for various VFIO bus drivers and/or mdev vendor drivers;
> > 
> > - Add implementation examples for vendor drivers to Documentation;
> > 
> > - Refine IOMMU changes
> > 
> > 
> > 
> > Jike Song (2):
> >   Mediated device Core driver
> >   vfio: VFIO bus driver for MDEV devices
> > 
> > Kirti Wankhede (2):
> >   vfio iommu: Add support for mediated devices
> >   docs: Add Documentation for Mediated devices
> > 
> >  Documentation/vfio-mediated-device.txt | 203 ++
> >  drivers/vfio/Kconfig   |   1 +
> >  drivers/vfio/Makefile  |   1 +
> >  drivers/vfio/mdev/Kconfig  |  18 ++
> >  drivers/vfio/mdev/Makefile |   5 +
> >  drivers/vfio/mdev/mdev_core.c  | 250 +
> >  drivers/vfio/mdev/mdev_driver.c| 155 ++
> >  drivers/vfio/mdev/mdev_private.h   |  29 ++
> >  drivers/vfio/mdev/mdev_sysfs.c | 155 ++
> >  drivers/vfio/mdev/vfio_mdev.c  | 187 
> >  drivers/vfio/vfio.c|  82 ++
> >  drivers/vfio/vfio_iommu_type1.c| 499 
> > +
> >  include/linux/mdev.h   | 159 +++
> >  include/linux/vfio.h   |  13 +-
> >  14 files changed, 1709 insertions(+), 48 deletions(-)
> >  create mode 100644 Documentation/vfio-mediated-device.txt
> >  create mode 100644 drivers/vfio/mdev/Kconfig
> >  create mode 100644 drivers/vfio/mdev/Makefile
> >  create mode 100644 drivers/vfio/mdev/mdev_core.c
> >  create mode 100644 drivers/vfio/mdev/mdev_driver.c
> >  create mode 100644 drivers/vfio/mdev/mdev_private.h
> >  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
> >  create mode 100644 drivers/vfio/mdev/vfio_mdev.c
> >  create mode 100644 include/linux/mdev.h
> > 
> 



Re: [Qemu-devel] [libvirt] [RFC] libvirt vGPU QEMU integration

2016-08-21 Thread Neo Jia
On Fri, Aug 19, 2016 at 03:22:48PM -0400, Laine Stump wrote:
> On 08/18/2016 12:41 PM, Neo Jia wrote:
> > Hi libvirt experts,
> > 
> > I am starting this email thread to discuss the potential solution / 
> > proposal of
> > integrating vGPU support into libvirt for QEMU.
> 
> Thanks for the detailed description. This is very helpful.
> 
> 
> > 
> > Some quick background, NVIDIA is implementing a VFIO based mediated device
> > framework to allow people to virtualize their devices without SR-IOV, for
> > example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing 
> > the
> > VFIO API to process the memory / interrupt as what QEMU does today with 
> > passthru
> > device.
> > 
> > The difference here is that we are introducing a set of new sysfs file for
> > virtual device discovery and life cycle management due to its virtual 
> > nature.
> > 
> > Here is the summary of the sysfs, when they will be created and how they 
> > should
> > be used:
> > 
> > 1. Discover mediated device
> > 
> > As part of physical device initialization process, vendor driver will 
> > register
> > their physical devices, which will be used to create virtual device 
> > (mediated
> > device, aka mdev) to the mediated framework.
> 
> 
> We've discussed this question offline, but I just want to make sure I
> understood correctly - all initialization of the physical device on the host
> is already handled "elsewhere", so libvirt doesn't need to be concerned with
> any physical device lifecycle or configuration (setting up the number or
> types of vGPUs), correct? 

Hi Laine,

Yes, that is right, at least for NVIDIA vGPU.

> Do you think this would also be the case for other
> vendors using the same APIs? I guess this all comes down to whether or not
> the setup of the physical device is defined within the bounds of the common
> infrastructure/API, or if it's something that's assumed to have just
> magically happened somewhere else.

I would assume that is the case for other vendors as well, although this common
infrastructure doesn't put any restrictions about the physical device setup or
initialization, so actually vendor can have options to defer some of them till
the point when virtual device gets created. 

But if we just look at from the API level which gets exposed to libvirt, it is
the vendor driver's responsibility to ensure that the virtual device will be
available in a reasonable amount of time after the "online" sysfs file is set to
1. But where to hide the HW setup is not enforced in this common API.

In NVIDIA case, once our kernel driver registers the physical devices that he
owns to the "common infrastructure", all the physical devices are already fully
initialized and ready for virtual device creation.

> 
> 
> > 
> > Then, the sysfs file "mdev_supported_types" will be available under the 
> > physical
> > device sysfs, and it will indicate the supported mdev and configuration for 
> > this
> > particular physical device, and the content may change dynamically based on 
> > the
> > system's current configurations, so libvirt needs to query this file every 
> > time
> > before create a mdev.
> 
> I had originally thought that libvirt would be setting up and managing a
> pool of virtual devices, similar to what we currently do with SRIOV VFs. But
> from this it sounds like the management of this pool is completely handled
> by your drivers (especially since the contents of the pool can apparently
> completely change at any instant). In one way that makes life easier for
> libvirt, because it doesn't need to manage anything.

The pool (vgpu type availabilities) will only subject to change when virtual
devices get created or destroyed, as for now we don't support heterogeneous vGPU
type on the same physical GPU. Even in the future we have added such support,
the point of change is still the same.

> 
> On the other hand, it makes thing less predictable. For example, when
> libvirt defines a domain, it queries the host system to see what types of
> devices are legal in guests on this host, and expects those devices to be
> available at a later time. As I understand it (and I may be completely
> wrong), when no vGPUs are running on the hardware, there is a choice of
> several different models of vGPU (like the example you give below), but when
> the first vGPU is started up, that triggers the host driver to restrict the
> available models. If that's the case, then a particular vGPU could be
> "available" when a domain is defined, but not an option by the time the
> domain is started. That's not a show stopper, but

Re: [Qemu-devel] [libvirt] [RFC] libvirt vGPU QEMU integration

2016-08-21 Thread Neo Jia
On Fri, Aug 19, 2016 at 02:42:27PM +0200, Michal Privoznik wrote:
> On 18.08.2016 18:41, Neo Jia wrote:
> > Hi libvirt experts,
> 
> Hi, welcome to the list.
> 
> > 
> > I am starting this email thread to discuss the potential solution / 
> > proposal of
> > integrating vGPU support into libvirt for QEMU.
> > 
> > Some quick background, NVIDIA is implementing a VFIO based mediated device
> > framework to allow people to virtualize their devices without SR-IOV, for
> > example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing 
> > the
> > VFIO API to process the memory / interrupt as what QEMU does today with 
> > passthru
> > device.
> 
> So as far as I understand, this is solely NVIDIA's API and other vendors
> (e.g. Intel) will use their own or is this a standard that others will
> comply to?

Hi Michal,

Based on the initial vGPU VFIO design discussion thread on QEMU mailing, I
believe this is what both NVIDIA, Intel and even other companies will comply to.

(People from related parties are CC'ed in this email, such as Intel and IBM.)

As you know, I can't speak for Intel, so I would like to defer this question to
them, but above is my understanding based on the QEMU/KVM community discussions.

> 
> > 
> > The difference here is that we are introducing a set of new sysfs file for
> > virtual device discovery and life cycle management due to its virtual 
> > nature.
> > 
> > Here is the summary of the sysfs, when they will be created and how they 
> > should
> > be used:
> > 
> > 1. Discover mediated device
> > 
> > As part of physical device initialization process, vendor driver will 
> > register
> > their physical devices, which will be used to create virtual device 
> > (mediated
> > device, aka mdev) to the mediated framework.
> > 
> > Then, the sysfs file "mdev_supported_types" will be available under the 
> > physical
> > device sysfs, and it will indicate the supported mdev and configuration for 
> > this 
> > particular physical device, and the content may change dynamically based on 
> > the
> > system's current configurations, so libvirt needs to query this file every 
> > time
> > before create a mdev.
> 
> Ah, that was gonna be my question. Because in the example below, you
> used "echo '...vgpu_type_id=20...' > /sys/bus/.../mdev_create". And I
> was wondering where does the number 20 come from. Now what I am
> wondering about is how libvirt should expose these to users. Moreover,
> how it should let users to chose.
> We have a node device driver where I guess we could expose possible
> options and then require some explicit value in the domain XML (but what
> value would that be? I don't think taking vgpu_type_id-s as they are
> would be a great idea).

Right, the vgpu_type_id is just a handle for a given type of vGPU device for
NVIDIA case.  How about expose the "vgpu_type" which is a meaningful name
for the vGPU end users?

Also, when you are saying "let users to chose", does this mean to expose some
virsh command to allow user to dump their potential virtual devices and pick
one?

> 
> > 
> > Note: different vendors might have their own specific configuration sysfs as
> > well, if they don't have pre-defined types.
> > 
> > For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and 
> > here is
> > NVIDIA specific configuration on an idle system.
> > 
> > For example, to query the "mdev_supported_types" on this Tesla M60:
> > 
> > cat /sys/bus/pci/devices/:86:00.0/mdev_supported_types
> > # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> > max_resolution
> > 11  ,"GRID M60-0B",  16,   2,  45, 512M,2560x1600
> > 12  ,"GRID M60-0Q",  16,   2,  60, 512M,2560x1600
> > 13  ,"GRID M60-1B",   8,   2,  45,1024M,2560x1600
> > 14  ,"GRID M60-1Q",   8,   2,  60,1024M,2560x1600
> > 15  ,"GRID M60-2B",   4,   2,  45,2048M,2560x1600
> > 16  ,"GRID M60-2Q",   4,   4,  60,2048M,2560x1600
> > 17  ,"GRID M60-4Q",   2,   4,  60,4096M,3840x2160
> > 18  ,"GRID M60-8Q",   1,   4,  60,8192M,3840x2160
> > 
> > 2. Create/destroy mediated device
> > 
> > Two sysfs files are available under the physical device sysfs path : 
> > mdev_create
> > and mdev_destroy
> > 
> > Th

[Qemu-devel] [RFC] libvirt vGPU QEMU integration

2016-08-18 Thread Neo Jia
Hi libvirt experts,

I am starting this email thread to discuss the potential solution / proposal of
integrating vGPU support into libvirt for QEMU.

Some quick background, NVIDIA is implementing a VFIO based mediated device
framework to allow people to virtualize their devices without SR-IOV, for
example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
VFIO API to process the memory / interrupt as what QEMU does today with passthru
device.

The difference here is that we are introducing a set of new sysfs file for
virtual device discovery and life cycle management due to its virtual nature.

Here is the summary of the sysfs, when they will be created and how they should
be used:

1. Discover mediated device

As part of physical device initialization process, vendor driver will register
their physical devices, which will be used to create virtual device (mediated
device, aka mdev) to the mediated framework.

Then, the sysfs file "mdev_supported_types" will be available under the physical
device sysfs, and it will indicate the supported mdev and configuration for 
this 
particular physical device, and the content may change dynamically based on the
system's current configurations, so libvirt needs to query this file every time
before create a mdev.

Note: different vendors might have their own specific configuration sysfs as
well, if they don't have pre-defined types.

For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
NVIDIA specific configuration on an idle system.

For example, to query the "mdev_supported_types" on this Tesla M60:

cat /sys/bus/pci/devices/:86:00.0/mdev_supported_types
# vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
max_resolution
11  ,"GRID M60-0B",  16,   2,  45, 512M,2560x1600
12  ,"GRID M60-0Q",  16,   2,  60, 512M,2560x1600
13  ,"GRID M60-1B",   8,   2,  45,1024M,2560x1600
14  ,"GRID M60-1Q",   8,   2,  60,1024M,2560x1600
15  ,"GRID M60-2B",   4,   2,  45,2048M,2560x1600
16  ,"GRID M60-2Q",   4,   4,  60,2048M,2560x1600
17  ,"GRID M60-4Q",   2,   4,  60,4096M,3840x2160
18  ,"GRID M60-8Q",   1,   4,  60,8192M,3840x2160

2. Create/destroy mediated device

Two sysfs files are available under the physical device sysfs path : mdev_create
and mdev_destroy

The syntax of creating a mdev is:

echo "$mdev_UUID:vendor_specific_argument_list" >
/sys/bus/pci/devices/.../mdev_create

The syntax of destroying a mdev is:

echo "$mdev_UUID:vendor_specific_argument_list" >
/sys/bus/pci/devices/.../mdev_destroy

The $mdev_UUID is a unique identifier for this mdev device to be created, and it
is unique per system.

For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
above Tesla M60 output), and a VM UUID to be passed as
"vendor_specific_argument_list".

If there is no vendor specific arguments required, either "$mdev_UUID" or
"$mdev_UUID:" will be acceptable as input syntax for the above two commands.

To create a M60-4Q device, libvirt needs to do:

echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
/sys/bus/pci/devices/\:86\:00.0/mdev_create

Then, you will see a virtual device shows up at:

/sys/bus/mdev/devices/$mdev_UUID/

For NVIDIA, to create multiple virtual devices per VM, it has to be created
upfront before bringing any of them online.

Regarding error reporting and detection, on failure, write() to sysfs using fd
returns error code, and write to sysfs file through command prompt shows the
string corresponding to error code.

3. Start/stop mediated device

Under the virtual device sysfs, you will see a new "online" sysfs file.

you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
of this virtual device (0 or 1), and to start a virtual device or stop a 
virtual 
device you can do:

echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online

libvirt needs to query the current state before changing state.

Note: if you have multiple devices, you need to write to the "online" file
individually.

For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
them "online" before starting QEMU.

4. Launch QEMU/VM

Pass the mdev sysfs path to QEMU as vfio-pci device:

-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0

5. Shutdown sequence 

libvirt needs to shutdown the qemu, bring the virtual device offline, then 
destroy the
virtual device

6. VM Reset

No change or requirement for libvirt as this will be handled via VFIO reset API
and QEMU process will keep running as before.

7. Hot-plug

It optional for vendors to support hot-plug.

And it is same syntax to create a virtual device for hot-plug. 

For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs
to write to "destroy" sysfs to 

Re: [Qemu-devel] [RFC v6-based v1 0/5] refine mdev framework

2016-08-17 Thread Neo Jia
On Wed, Aug 17, 2016 at 04:58:14PM +0800, Dong Jia wrote:
> On Tue, 16 Aug 2016 16:14:12 +0800
> Jike Song  wrote:
> 
> > 
> > This patchset is based on NVidia's "Add Mediated device support" series, 
> > version 6:
> > 
> > http://www.spinics.net/lists/kvm/msg136472.html
> > 
> > 
> > Background:
> > 
> > The patchset from NVidia introduced the Mediated Device support to
> > Linux/VFIO. With that series, one can create virtual devices (supporting
> > by underlying physical device and vendor driver), and assign them to
> > userspace like QEMU/KVM, in the same way as device assignment via VFIO.
> > 
> > Based on that, NVidia and Intel implemented their vGPU solutions, IBM
> > implemented its CCW pass-through.  However, there are limitations
> > imposed by current (v6 in particular) mdev framework: the mdev must be
> > represented as a PCI device, several vfio capabilities such as
> > sparse mmap are not possible, and so forth.
> > 
> > This series aims to address above limitations and simplify the 
> > implementation.
> > 
> > 
> > Key Changes:
> > 
> > - An independent "struct device" was introduced to parent_device, thus
> >   a hierarchy in driver core is formed with physical device, parent 
> > device
> >   and mdev device;
> > 
> > - Leveraging the mechanism and APIs provided by Linux driver core, it
> >   is now safe to remove all refcnts and locks;
> > 
> > - vfio_mpci (later renamed to vfio_mdev) was made BUS-agnostic: all
> >   PCI-specific logic was removed, accesses from userspace are now
> >   passed to vendor driver directly, thus guaranteed that full VFIO
> >   capabilities provided: e.g. dynamic regions, sparse mmap, etc.;
> > 
> >   With vfio_mdev being BUS-agnostic, it is enough to have only one
> >   driver for all mdev devices;
> 
> Hi Jike:
> 
> I don't know what happened, but finding out which direction this will
> likely go seems my first priority now...

Hi Dong,

Just want to let you know that we are preparing the v7 patches to incorporate
the latest review comments from Intel folks and Alex, for some changes in this
patch set also mentioned in the recent review are already queued up in the new
version.

> 
> I'd say, either with only the original mdev v6, or patched this series,
> vfio-ccw could live. But this series saves my work of mimicing the
> vfio-mpci code in my vfio-mccw driver. I like this incremental patches.

Thanks for sharing your progress and good to know our current v6 solution works 
for you. We are still evaluating the vfio_mdev changes here as I still prefer to
share general VFIO pci handling inside a common VFIO PCI driver, and the
modularization will reduce the impact of future changes and potential 
regressions
cross architectures - between PCI and CCW.

Thanks,
Neo

> 
> I'm wondering if Alex and the Nvidia folks have some comments for this
> from their point of views. And I'm really looking forward on their
> feedback.
> 
> Thanks!
> 
> > 
> > - UUID was removed from the interface between mdev and vendor driver;
> > 
> > 
> > TODO
> > 
> > remove mdev stuff from vfio.h
> > update doc
> > 
> 
> 
> 
> Dong Jia
> 



Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-16 Thread Neo Jia
On Tue, Aug 16, 2016 at 02:51:03PM -0600, Alex Williamson wrote:
> On Tue, 16 Aug 2016 13:30:06 -0700
> Neo Jia <c...@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 04:47:41PM -0600, Alex Williamson wrote:
> > > On Mon, 15 Aug 2016 12:59:08 -0700
> > > Neo Jia <c...@nvidia.com> wrote:
> > >   
> > > > > > >
> > > > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > > > probably want to start/stop them individually.  Actually, why is 
> > > > > > > it
> > > > > > > that we can't use the mediated device being opened and released to
> > > > > > > automatically signal to the backend vendor driver to commit and 
> > > > > > > release
> > > > > > > resources? I don't fully understand why userspace needs this 
> > > > > > > interface.
> > > > >   
> > > 
> > > That doesn't give an individual user the ability to stop and start
> > > their devices though, because in order for a user to have write
> > > permissions there, they get permission to DoS other users by pumping
> > > arbitrary UUIDs into those files.  By placing start/stop per mdev, we
> > > have mdev level granularity of granting start/stop privileges.  Really
> > > though, do we want QEMU fumbling around through sysfs or do we want an
> > > interface through the vfio API to perform start/stop?  Thanks,  
> > 
> > Hi Alex,
> > 
> > I think those two suggests make sense, so we will move the "start/stop"
> > under mdev sysfs. 
> > 
> > This will be incorporated in our next v7 patch and by doing that, it will 
> > make
> > the locking scheme easier.
> 
> Thanks Neo.  Also note that the semantics change when we move to per
> device control.  It would be redundant to 'echo $UUID' into a start
> file which only controls a single device.  So that means we probably
> just want an 'echo 1'.  But if we can 'echo 1' then we can also 'echo
> 0', so we can reduce this to a single sysfs file.  Sysfs already has a
> common interface for starting and stopping devices, the "online" file.
> So I think we should probably move in that direction.  Additionally, an
> "online" file should support a _show() function, so if we have an Intel
> vGPU that perhaps does not need start/stop support, online could report
> "1" after create to show that it's already online, possibly even
> generate an error trying to change the online state.  Thanks,

Agree. We will adopt the similar syntax and support _show() function.

Thanks,
Neo

> 
> Alex



Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-16 Thread Neo Jia
On Mon, Aug 15, 2016 at 04:47:41PM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 12:59:08 -0700
> Neo Jia <c...@nvidia.com> wrote:
> 
> > > > >
> > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > that we can't use the mediated device being opened and released to
> > > > > automatically signal to the backend vendor driver to commit and 
> > > > > release
> > > > > resources? I don't fully understand why userspace needs this 
> > > > > interface.  
> > > 
> 
> That doesn't give an individual user the ability to stop and start
> their devices though, because in order for a user to have write
> permissions there, they get permission to DoS other users by pumping
> arbitrary UUIDs into those files.  By placing start/stop per mdev, we
> have mdev level granularity of granting start/stop privileges.  Really
> though, do we want QEMU fumbling around through sysfs or do we want an
> interface through the vfio API to perform start/stop?  Thanks,

Hi Alex,

I think those two suggests make sense, so we will move the "start/stop"
under mdev sysfs. 

This will be incorporated in our next v7 patch and by doing that, it will make
the locking scheme easier.

Thanks,
Neo

> 
> Alex



Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-16 Thread Neo Jia
On Tue, Aug 16, 2016 at 05:58:54AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, August 16, 2016 1:44 PM
> > 
> > On Tue, Aug 16, 2016 at 04:52:30AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 12:17 PM
> > > >
> > > > On Tue, Aug 16, 2016 at 03:50:44AM +, Tian, Kevin wrote:
> > > > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > > > >
> > > > > > On Tue, Aug 16, 2016 at 12:30:25AM +, Tian, Kevin wrote:
> > > > > > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > > > > >
> > > > > > > > > >
> > > > > > > > > > For NVIDIA vGPU solution we need to know all devices 
> > > > > > > > > > assigned to a VM
> > in
> > > > > > > > > > one shot to commit resources of all vGPUs assigned to a VM 
> > > > > > > > > > along with
> > > > > > > > > > some common resources.
> > > > > > > > >
> > > > > > > > > Kirti, can you elaborate the background about above one-shot 
> > > > > > > > > commit
> > > > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > > > >
> > > > > > > > > As I relied in another mail, I really hope start/stop become 
> > > > > > > > > a per-mdev
> > > > > > > > > attribute instead of global one, e.g.:
> > > > > > > > >
> > > > > > > > > echo "0/1" >
> > > > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > > > > >
> > > > > > > > > In many scenario the user space client may only want to talk 
> > > > > > > > > to mdev
> > > > > > > > > instance directly, w/o need to contact its parent device. 
> > > > > > > > > Still take
> > > > > > > > > live migration for example, I don't think Qemu wants to know 
> > > > > > > > > parent
> > > > > > > > > device of assigned mdev instances.
> > > > > > > >
> > > > > > > > Hi Kevin,
> > > > > > > >
> > > > > > > > Having a global /sys/class/mdev/mdev_start doesn't require 
> > > > > > > > anybody to
> > know
> > > > > > > > parent device. you can just do
> > > > > > > >
> > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > > > >
> > > > > > > > or
> > > > > > > >
> > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > > > >
> > > > > > > > without knowing the parent device.
> > > > > > > >
> > > > > > >
> > > > > > > You can look at some existing sysfs example, e.g.:
> > > > > > >
> > > > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > > > >
> > > > > > > You may also argue why not using a global style:
> > > > > > >
> > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > > > >
> > > > > > > There are many similar examples...
> > > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > My response above is to your question about using the global sysfs 
> > > > > > entry as you
> > > > > > don't want to have the global path because
> > > > > >
> > > > > > "I don't think Qemu wants to know parent device of assigned mdev 
> > > > > > instances.".
> > > > > >
> > > > > > So I just want to confirm with you that (in case you miss):
> > > &

Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-15 Thread Neo Jia
On Tue, Aug 16, 2016 at 04:52:30AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, August 16, 2016 12:17 PM
> > 
> > On Tue, Aug 16, 2016 at 03:50:44AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > >
> > > > On Tue, Aug 16, 2016 at 12:30:25AM +, Tian, Kevin wrote:
> > > > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > > >
> > > > > > > >
> > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned 
> > > > > > > > to a VM in
> > > > > > > > one shot to commit resources of all vGPUs assigned to a VM 
> > > > > > > > along with
> > > > > > > > some common resources.
> > > > > > >
> > > > > > > Kirti, can you elaborate the background about above one-shot 
> > > > > > > commit
> > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > >
> > > > > > > As I relied in another mail, I really hope start/stop become a 
> > > > > > > per-mdev
> > > > > > > attribute instead of global one, e.g.:
> > > > > > >
> > > > > > > echo "0/1" >
> > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > > >
> > > > > > > In many scenario the user space client may only want to talk to 
> > > > > > > mdev
> > > > > > > instance directly, w/o need to contact its parent device. Still 
> > > > > > > take
> > > > > > > live migration for example, I don't think Qemu wants to know 
> > > > > > > parent
> > > > > > > device of assigned mdev instances.
> > > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody 
> > > > > > to know
> > > > > > parent device. you can just do
> > > > > >
> > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > >
> > > > > > or
> > > > > >
> > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > >
> > > > > > without knowing the parent device.
> > > > > >
> > > > >
> > > > > You can look at some existing sysfs example, e.g.:
> > > > >
> > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > >
> > > > > You may also argue why not using a global style:
> > > > >
> > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > >
> > > > > There are many similar examples...
> > > >
> > > > Hi Kevin,
> > > >
> > > > My response above is to your question about using the global sysfs 
> > > > entry as you
> > > > don't want to have the global path because
> > > >
> > > > "I don't think Qemu wants to know parent device of assigned mdev 
> > > > instances.".
> > > >
> > > > So I just want to confirm with you that (in case you miss):
> > > >
> > > > /sys/class/mdev/mdev_start | mdev_stop
> > > >
> > > > doesn't require the knowledge of parent device.
> > > >
> > >
> > > Qemu is just one example, where your explanation of parent device
> > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > instance, so any mdev attributes touched by Qemu should be put under
> > > that node (e.g. start/stop for live migration usage as I explained 
> > > earlier).
> > 
> > Exactly, qemu is passed with the actual sysfs path.
> > 
> > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at 
> > all.
> > 
> > QEMU will take the sysfs path as input:
> > 
> >  -device
> > vfio-pci,sys

Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-15 Thread Neo Jia
On Tue, Aug 16, 2016 at 03:50:44AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, August 16, 2016 11:46 AM
> > 
> > On Tue, Aug 16, 2016 at 12:30:25AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > 
> > > > > >
> > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a 
> > > > > > VM in
> > > > > > one shot to commit resources of all vGPUs assigned to a VM along 
> > > > > > with
> > > > > > some common resources.
> > > > >
> > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > requirement? It's hard to understand such a requirement.
> > > > >
> > > > > As I relied in another mail, I really hope start/stop become a 
> > > > > per-mdev
> > > > > attribute instead of global one, e.g.:
> > > > >
> > > > > echo "0/1" > 
> > > > > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > >
> > > > > In many scenario the user space client may only want to talk to mdev
> > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > device of assigned mdev instances.
> > > >
> > > > Hi Kevin,
> > > >
> > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to 
> > > > know
> > > > parent device. you can just do
> > > >
> > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > >
> > > > or
> > > >
> > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > >
> > > > without knowing the parent device.
> > > >
> > >
> > > You can look at some existing sysfs example, e.g.:
> > >
> > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > >
> > > You may also argue why not using a global style:
> > >
> > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > >
> > > There are many similar examples...
> > 
> > Hi Kevin,
> > 
> > My response above is to your question about using the global sysfs entry as 
> > you
> > don't want to have the global path because
> > 
> > "I don't think Qemu wants to know parent device of assigned mdev 
> > instances.".
> > 
> > So I just want to confirm with you that (in case you miss):
> > 
> > /sys/class/mdev/mdev_start | mdev_stop
> > 
> > doesn't require the knowledge of parent device.
> > 
> 
> Qemu is just one example, where your explanation of parent device
> makes sense but still it's not good for Qemu to populate /sys/class/mdev
> directly. Qemu is passed with the actual sysfs path of assigned mdev
> instance, so any mdev attributes touched by Qemu should be put under 
> that node (e.g. start/stop for live migration usage as I explained earlier).

Exactly, qemu is passed with the actual sysfs path.

So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.

QEMU will take the sysfs path as input:

 -device 
vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,id=vgpu0

As you are saying in live migration, QEMU needs to access "start" and "stop".  
Could you 
please share more details, such as how QEMU access the "start" and "stop" sysfs,
when and what triggers that?

Thanks,
Neo

> 

> Thanks
> Kevin



Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-15 Thread Neo Jia
On Tue, Aug 16, 2016 at 12:30:25AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, August 16, 2016 3:59 AM

> > > >
> > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > some common resources.
> > >
> > > Kirti, can you elaborate the background about above one-shot commit
> > > requirement? It's hard to understand such a requirement.
> > >
> > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > attribute instead of global one, e.g.:
> > >
> > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > >
> > > In many scenario the user space client may only want to talk to mdev
> > > instance directly, w/o need to contact its parent device. Still take
> > > live migration for example, I don't think Qemu wants to know parent
> > > device of assigned mdev instances.
> > 
> > Hi Kevin,
> > 
> > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > parent device. you can just do
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > 
> > or
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > 
> > without knowing the parent device.
> > 
> 
> You can look at some existing sysfs example, e.g.:
> 
> echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> 
> You may also argue why not using a global style:
> 
> echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> 
> There are many similar examples...

Hi Kevin,

My response above is to your question about using the global sysfs entry as you
don't want to have the global path because

"I don't think Qemu wants to know parent device of assigned mdev instances.".

So I just want to confirm with you that (in case you miss):

/sys/class/mdev/mdev_start | mdev_stop 

doesn't require the knowledge of parent device.

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-15 Thread Neo Jia
On Mon, Aug 15, 2016 at 04:47:41PM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 12:59:08 -0700
> Neo Jia <c...@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 09:38:52AM +, Tian, Kevin wrote:
> > > > From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > > 
> > > > 
> > > > 
> > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:  
> > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > > >  
> > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > >>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > > >>>
> > > > >>> This is used later by mdev_device_start() and mdev_device_stop() to 
> > > > >>> get
> > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > >>> respectively.  That seems to imply that all of instances for a given
> > > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > > >>> still having a hard time buying into the uuid+instance plan when it
> > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > >>> Userspace tools can figure out which uuids to start for a given 
> > > > >>> user, I
> > > > >>> don't see much value in collecting them to instances within a uuid.
> > > > >>>  
> > > > >>
> > > > >> Initially we started discussion with VM_UUID+instance suggestion, 
> > > > >> where
> > > > >> instance was introduced to support multiple devices in a VM.  
> > > > >
> > > > > The instance number was never required in order to support multiple
> > > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > devices with that same UUID and therefore associate udev events to a
> > > > > given VM.  Only then does an instance number become necessary since 
> > > > > the
> > > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > > like a very dodgy solution when we should probably just be querying
> > > > > libvirt to give us a device to VM association.  
> > > 
> > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > for mdev in the basic design. It's bound to NVIDIA management stack too 
> > > tightly.
> > > 
> > > I'm OK to give enough flexibility for various upper level management 
> > > stacks,
> > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > option where either UUID or STRING could be optional? Upper management 
> > > stack can choose its own policy to identify a mdev:
> > > 
> > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > b) STRING only, which could be an index (0, 1, 2, ...), or any 
> > > combination 
> > > (vgpu0, vgpu1, etc.)
> > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > a numeric index
> > >   
> > > > >  
> > > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources 
> > > > >> of
> > > > >> all instances of similar devices assigned to VM.
> > > > >>
> > > > >> For example, to create 2 devices:
> > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > > >>
> > > > >> Commit resources for above devices with single 'mdev_start':
> > > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > > >> 'instance', so 'mdev_create' would look like:
> > > > >>
> > > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > &

Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-15 Thread Neo Jia
On Mon, Aug 15, 2016 at 04:52:39PM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 15:09:30 -0700
> Neo Jia <c...@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 09:59:26AM -0600, Alex Williamson wrote:
> > > On Mon, 15 Aug 2016 09:38:52 +
> > > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > >   
> > > > > From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> > > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > > > 
> > > > > 
> > > > > 
> > > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > > > >
> > > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > > >>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > > > >>>
> > > > > >>> This is used later by mdev_device_start() and mdev_device_stop() 
> > > > > >>> to get
> > > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > > >>> respectively.  That seems to imply that all of instances for a 
> > > > > >>> given
> > > > > >>> uuid come from the same parent_device.  Where is that enforced?  
> > > > > >>> I'm
> > > > > >>> still having a hard time buying into the uuid+instance plan when 
> > > > > >>> it
> > > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > > >>> Userspace tools can figure out which uuids to start for a given 
> > > > > >>> user, I
> > > > > >>> don't see much value in collecting them to instances within a 
> > > > > >>> uuid.
> > > > > >>>
> > > > > >>
> > > > > >> Initially we started discussion with VM_UUID+instance suggestion, 
> > > > > >> where
> > > > > >> instance was introduced to support multiple devices in a VM.
> > > > > >
> > > > > > The instance number was never required in order to support multiple
> > > > > > devices in a VM, IIRC this UUID+instance scheme was to appease 
> > > > > > NVIDIA
> > > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > > devices with that same UUID and therefore associate udev events to a
> > > > > > given VM.  Only then does an instance number become necessary since 
> > > > > > the
> > > > > > UUID needs to be static for a vGPUs within a VM.  This has always 
> > > > > > felt
> > > > > > like a very dodgy solution when we should probably just be querying
> > > > > > libvirt to give us a device to VM association.
> > > > 
> > > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > > for mdev in the basic design. It's bound to NVIDIA management stack too 
> > > > tightly.
> > > > 
> > > > I'm OK to give enough flexibility for various upper level management 
> > > > stacks,
> > > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > > option where either UUID or STRING could be optional? Upper management 
> > > > stack can choose its own policy to identify a mdev:
> > > > 
> > > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > > b) STRING only, which could be an index (0, 1, 2, ...), or any 
> > > > combination 
> > > > (vgpu0, vgpu1, etc.)
> > > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > > a numeric index
> > > >   
> > > > > >
> > > > > >> 'mdev_create' creates device and 'mdev_start' is to commit 
> > > > > >> resources of
> > > > > >> all instances of similar devices assigned to VM.
> > > > > >>
> > > > > >> For example, to create 2 devices:
> > > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > > >> # echo "$UUID:1:param

Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-15 Thread Neo Jia
On Mon, Aug 15, 2016 at 09:59:26AM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 09:38:52 +
> "Tian, Kevin"  wrote:
> 
> > > From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> > > Sent: Saturday, August 13, 2016 8:37 AM
> > > 
> > > 
> > > 
> > > On 8/13/2016 2:46 AM, Alex Williamson wrote:  
> > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > Kirti Wankhede  wrote:
> > > >  
> > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > >>> Kirti Wankhede  wrote:
> > > >>>
> > > >>> This is used later by mdev_device_start() and mdev_device_stop() to 
> > > >>> get
> > > >>> the parent_device so it can call the start and stop ops callbacks
> > > >>> respectively.  That seems to imply that all of instances for a given
> > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > >>> still having a hard time buying into the uuid+instance plan when it
> > > >>> seems like each mdev_device should have an actual unique uuid.
> > > >>> Userspace tools can figure out which uuids to start for a given user, 
> > > >>> I
> > > >>> don't see much value in collecting them to instances within a uuid.
> > > >>>  
> > > >>
> > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > >> instance was introduced to support multiple devices in a VM.  
> > > >
> > > > The instance number was never required in order to support multiple
> > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > devices with that same UUID and therefore associate udev events to a
> > > > given VM.  Only then does an instance number become necessary since the
> > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > like a very dodgy solution when we should probably just be querying
> > > > libvirt to give us a device to VM association.  
> > 
> > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > for mdev in the basic design. It's bound to NVIDIA management stack too 
> > tightly.
> > 
> > I'm OK to give enough flexibility for various upper level management stacks,
> > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > option where either UUID or STRING could be optional? Upper management 
> > stack can choose its own policy to identify a mdev:
> > 
> > a) $UUID only, so each mdev is allocated with a unique UUID
> > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > (vgpu0, vgpu1, etc.)
> > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > a numeric index
> > 
> > > >  
> > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > >> all instances of similar devices assigned to VM.
> > > >>
> > > >> For example, to create 2 devices:
> > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > >>
> > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > >>
> > > >> Commit resources for above devices with single 'mdev_start':
> > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > >> 'instance', so 'mdev_create' would look like:
> > > >>
> > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > >>
> > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > >> would be vendor specific parameters.
> > > >>
> > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > >>
> > > >> Then 'mdev_start' would be:
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > >>
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop  
> > > >
> > > > I'm not sure a comma separated list makes sense here, for both
> > > > simplicity in the kernel and more fine grained error reporting, we
> > > > probably want to start/stop them individually.  Actually, why is it
> > > > that we can't use the mediated device being opened and released to
> > > > automatically signal to the backend vendor driver to commit and release
> > > > resources? I don't fully understand why userspace needs this interface. 
> > > >  
> > 
> > There is a meaningful use of start/stop interface, as required in live
> > migration support. Such interface allows vendor driver to quiescent 
> > mdev activity on source device before mdev hardware state is snapshot,
> > and then resume mdev activity on dest device after its state is recovered.
> > Intel has implemented experimental live migration support in KVMGT (soon
> > to release), based on above two interfaces (plus another two to get/set
> > mdev state).
> 
> Ok, that's actually an 

Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-15 Thread Neo Jia
On Mon, Aug 15, 2016 at 09:38:52AM +, Tian, Kevin wrote:
> > From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> > Sent: Saturday, August 13, 2016 8:37 AM
> > 
> > 
> > 
> > On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > Kirti Wankhede  wrote:
> > >
> > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > >>> Kirti Wankhede  wrote:
> > >>>
> > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > >>> the parent_device so it can call the start and stop ops callbacks
> > >>> respectively.  That seems to imply that all of instances for a given
> > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > >>> still having a hard time buying into the uuid+instance plan when it
> > >>> seems like each mdev_device should have an actual unique uuid.
> > >>> Userspace tools can figure out which uuids to start for a given user, I
> > >>> don't see much value in collecting them to instances within a uuid.
> > >>>
> > >>
> > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > >> instance was introduced to support multiple devices in a VM.
> > >
> > > The instance number was never required in order to support multiple
> > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > devices with that same UUID and therefore associate udev events to a
> > > given VM.  Only then does an instance number become necessary since the
> > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > like a very dodgy solution when we should probably just be querying
> > > libvirt to give us a device to VM association.
> 
> Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> for mdev in the basic design. It's bound to NVIDIA management stack too 
> tightly.
> 
> I'm OK to give enough flexibility for various upper level management stacks,
> e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> option where either UUID or STRING could be optional? Upper management 
> stack can choose its own policy to identify a mdev:
> 
> a) $UUID only, so each mdev is allocated with a unique UUID
> b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> (vgpu0, vgpu1, etc.)
> c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> a numeric index
> 
> > >
> > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > >> all instances of similar devices assigned to VM.
> > >>
> > >> For example, to create 2 devices:
> > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > >>
> > >> "$UUID-0" and "$UUID-1" devices are created.
> > >>
> > >> Commit resources for above devices with single 'mdev_start':
> > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > >>
> > >> Considering $UUID to be a unique UUID of a device, we don't need
> > >> 'instance', so 'mdev_create' would look like:
> > >>
> > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > >>
> > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > >> would be vendor specific parameters.
> > >>
> > >> Device nodes would be created as "$UUID1" and "$UUID"
> > >>
> > >> Then 'mdev_start' would be:
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > >>
> > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > >>
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> > >
> > > I'm not sure a comma separated list makes sense here, for both
> > > simplicity in the kernel and more fine grained error reporting, we
> > > probably want to start/stop them individually.  Actually, why is it
> > > that we can't use the mediated device being opened and released to
> > > automatically signal to the backend vendor driver to commit and release
> > > resources? I don't fully understand why userspace needs this interface.
> 
> There is a meaningful use of start/stop interface, as required in live
> migration support. Such interface allows vendor driver to quiescent 
> mdev activity on source device before mdev hardware state is snapshot,
> and then resume mdev activity on dest device after its state is recovered.
> Intel has implemented experimental live migration support in KVMGT (soon
> to release), based on above two interfaces (plus another two to get/set
> mdev state).
> 
> > >
> > 
> > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > one shot to commit resources of all vGPUs assigned to a VM along with
> > some common resources.
> 
> Kirti, can you elaborate the background about above one-shot commit
> requirement? It's hard to understand such a requirement. 
> 
> As I relied in another 

Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver

2016-06-08 Thread Neo Jia
On Wed, Jun 08, 2016 at 02:13:49PM +0800, Dong Jia wrote:
> On Tue, 7 Jun 2016 20:48:42 -0700
> Neo Jia <c...@nvidia.com> wrote:
> 
> > On Wed, Jun 08, 2016 at 11:18:42AM +0800, Dong Jia wrote:
> > > On Tue, 7 Jun 2016 19:39:21 -0600
> > > Alex Williamson <alex.william...@redhat.com> wrote:
> > > 
> > > > On Wed, 8 Jun 2016 01:18:42 +
> > > > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > > > 
> > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > > > 
> > > > > > On Tue, 7 Jun 2016 03:03:32 +
> > > > > > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > > > > >   
> > > > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > > > >
> > > > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > > > Neo Jia <c...@nvidia.com> wrote:
> > > > > > > >  
> > > > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > > > Neo Jia <c...@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > > > This intends to handle an intercepted channel I/O 
> > > > > > > > > > instruction. It
> > > > > > > > > > basically need to do the following thing:  
> > > > > > > > >
> > > > > > > > > May I ask how and when QEMU knows that he needs to issue such 
> > > > > > > > > VFIO ioctl at
> > > > > > > > > first place?  
> > > > > > > >
> > > > > > > > Yep, this is my question as well.  It sounds a bit like there's 
> > > > > > > > an
> > > > > > > > emulated device in QEMU that's trying to tell the mediated 
> > > > > > > > device when
> > > > > > > > to start an operation when we probably should be passing through
> > > > > > > > whatever i/o operations indicate that status directly to the 
> > > > > > > > mediated
> > > > > > > > device. Thanks,
> > > > > > > >
> > > > > > > > Alex  
> > > > > > >
> > > > > > > Below is copied from Dong's earlier post which said clear that
> > > > > > > a guest cmd submission will trigger the whole flow:
> > > > > > >
> > > > > > > 
> > > > > > > Explanation:
> > > > > > > Q1-Q4: Qemu side process.
> > > > > > > K1-K6: Kernel side process.
> > > > > > >
> > > > > > > Q1. Intercept a ssch instruction.
> > > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > > > (u_ccwchain).
> > > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > > > K2. Translate the user space ccw program to a kernel space ccw
> > > > > > > program, which becomes runnable for a real device.
> > > > > > > K3. With the necessary information contained in the orb 
> > > > > > > passed in
> > > > > > > by Qemu, issue the k_ccwchain to the device, and wait 
> > > > > > > event q
> > > > > > > for the I/O result.
> > > > > > > K4. Interrupt handler gets the I/O result, and wakes up the 
> > > > > > > wait q.
> > > > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the 
> > > > > > > result to
> > > > > > > update the user space irb.
> > > > > > > K6. Copy irb and scsw back to user space.
> > > > > > > Q4. Update the irb for the guest.
> > > > > > > -

Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver

2016-06-07 Thread Neo Jia
On Wed, Jun 08, 2016 at 11:18:42AM +0800, Dong Jia wrote:
> On Tue, 7 Jun 2016 19:39:21 -0600
> Alex Williamson <alex.william...@redhat.com> wrote:
> 
> > On Wed, 8 Jun 2016 01:18:42 +
> > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > 
> > > > On Tue, 7 Jun 2016 03:03:32 +
> > > > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > > >   
> > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > >
> > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > Neo Jia <c...@nvidia.com> wrote:
> > > > > >  
> > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > Neo Jia <c...@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > This intends to handle an intercepted channel I/O instruction. 
> > > > > > > > It
> > > > > > > > basically need to do the following thing:  
> > > > > > >
> > > > > > > May I ask how and when QEMU knows that he needs to issue such 
> > > > > > > VFIO ioctl at
> > > > > > > first place?  
> > > > > >
> > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > emulated device in QEMU that's trying to tell the mediated device 
> > > > > > when
> > > > > > to start an operation when we probably should be passing through
> > > > > > whatever i/o operations indicate that status directly to the 
> > > > > > mediated
> > > > > > device. Thanks,
> > > > > >
> > > > > > Alex  
> > > > >
> > > > > Below is copied from Dong's earlier post which said clear that
> > > > > a guest cmd submission will trigger the whole flow:
> > > > >
> > > > > 
> > > > > Explanation:
> > > > > Q1-Q4: Qemu side process.
> > > > > K1-K6: Kernel side process.
> > > > >
> > > > > Q1. Intercept a ssch instruction.
> > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > (u_ccwchain).
> > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > K2. Translate the user space ccw program to a kernel space ccw
> > > > > program, which becomes runnable for a real device.
> > > > > K3. With the necessary information contained in the orb passed in
> > > > > by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > for the I/O result.
> > > > > K4. Interrupt handler gets the I/O result, and wakes up the wait 
> > > > > q.
> > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > update the user space irb.
> > > > > K6. Copy irb and scsw back to user space.
> > > > > Q4. Update the irb for the guest.
> > > > >   
> > > > 
> > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > QEMU if those are operations that are not visible to the mediated
> > > > device; which they very well might be, since it's described as an
> > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > the case, vfio-pci has its own ioctl for doing a hot reset.  
> Dear Alex, Kevin and Neo,
> 
> 'ssch' is a privileged I/O instruction, which should be finally issued
> to the dedicated subchannel of the physical device.
> 
> BTW, I did remove step Q2 with all of the user-space translation code,
> according to your comments in another thread.
> 
> > > 
> > > 
> > > >   
> > > > > My understanding is that such thing belongs to ho

Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver

2016-06-06 Thread Neo Jia
On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:
> On Sun, 5 Jun 2016 23:27:42 -0700
> Neo Jia <c...@nvidia.com> wrote:
> 
> 2. VFIO_DEVICE_CCW_CMD_REQUEST
> This intends to handle an intercepted channel I/O instruction. It
> basically need to do the following thing:

May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
first place?

Thanks,
Neo

>   a. Copy the raw data of the CCW program (a group of chained CCWs) from
>  user into kernel space buffers.
>   b. Do CCW program translation based on the raw data to get a
>  real-device runnable CCW program. We'd pin pages for those CCWs
>  which have memory space pointers for their offload, and update the
>  CCW program with the pinned results (phys).
>   c. Issue the translated CCW program to a real-device to perform the
>  I/O operation, and wait for the I/O result interrupt.
>   d. Once we got the I/O result, copy the result back to user, and
>  unpin the pages.
> 
> Step c could only be done by the physical device driver, since it's it
> that the int_handler belongs to.
> Step b and d should be done by the physical device driver. Or we'd
> pin/unpin pages in the mediated device driver?
> 
> That's why I asked for the new callback.
> 



Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver

2016-06-06 Thread Neo Jia
On Mon, Jun 06, 2016 at 02:01:48PM +0800, Dong Jia wrote:
> On Mon, 6 Jun 2016 10:57:49 +0530
> Kirti Wankhede  wrote:
> 
> > 
> > 
> > On 6/3/2016 2:27 PM, Dong Jia wrote:
> > > On Wed, 25 May 2016 01:28:15 +0530
> > > Kirti Wankhede  wrote:
> > > 
> > > 
> > > ...snip...
> > > 
> > >> +struct phy_device_ops {
> > >> +struct module   *owner;
> > >> +const struct attribute_group **dev_attr_groups;
> > >> +const struct attribute_group **mdev_attr_groups;
> > >> +
> > >> +int (*supported_config)(struct device *dev, char *config);
> > >> +int (*create)(struct device *dev, uuid_le uuid,
> > >> +  uint32_t instance, char *mdev_params);
> > >> +int (*destroy)(struct device *dev, uuid_le uuid,
> > >> +   uint32_t instance);
> > >> +int (*start)(uuid_le uuid);
> > >> +int (*shutdown)(uuid_le uuid);
> > >> +ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t 
> > >> count,
> > >> +enum mdev_emul_space address_space, loff_t pos);
> > >> +ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t 
> > >> count,
> > >> + enum mdev_emul_space address_space, loff_t 
> > >> pos);
> > >> +int (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> > >> +unsigned int index, unsigned int start,
> > >> +unsigned int count, void *data);
> > >> +int (*get_region_info)(struct mdev_device *vdev, int 
> > >> region_index,
> > >> + struct pci_region_info *region_info);
> > >> +int (*validate_map_request)(struct mdev_device *vdev,
> > >> +unsigned long virtaddr,
> > >> +unsigned long *pfn, unsigned 
> > >> long *size,
> > >> +pgprot_t *prot);
> > >> +};
> > > 
> > > Dear Kirti:
> > > 
> > > When I rebased my vfio-ccw patches on this series, I found I need an
> > > extra 'ioctl' callback in phy_device_ops.
> > > 
> > 
> > Thanks for taking closer look. As per my knowledge ccw is not PCI
> > device, right? Correct me if I'm wrong.
> Dear Kirti:
> 
> You are right. CCW is different to PCI. The official term is 'Channel
> I/O device'. They use 'Channels' (co-processors) and CCWs (channel
> command words) to handle I/O operations.
> 
> > I'm curious to know. Are you planning to write a driver (vfio-mccw) for
> > mediated ccw device?
> I wrote two drivers:
> 1. A vfio-pccw driver for the physical ccw device, which will reigister
> the device and callbacks to mdev framework. With this, I could create
> a mediated ccw device for the physical one then.
> 2. A vfio-mccw driver for the mediated ccw device, which will add
> itself to a vfio_group, mimiced what vfio-mpci did.
> 
> The problem is, vfio-mccw need to implement new ioctls besides the
> existing ones (VFIO_DEVICE_GET_INFO, etc). And these ioctls really need
> the physical device help to handle.

Hi Dong,

Could you please help us understand a bit more about the new VFIO ioctl? Since 
it is
a new ioctl it is send down by QEMU in this case right? More details?

Thanks,
Neo

> 
> > 
> > Thanks,
> > Kirti
> > 
> > > The ccw physical device only supports one ccw mediated device. And I
> > > have two new ioctl commands for the ccw mediated device. One is 
> > > to hot-reset the resource in the physical device that allocated for
> > > the mediated device, the other is to do an I/O instruction translation
> > > and perform an I/O operation on the physical device. I found the
> > > existing callbacks could not meet my requirements.
> > > 
> > > Something like the following would be fine for my case:
> > >   int (*ioctl)(struct mdev_device *vdev,
> > >unsigned int cmd,
> > >unsigned long arg);
> > > 
> > > What do you think about this?
> > > 
> > > 
> > > Dong Jia
> > > 
> > 
> 
> 
> Dong Jia
> 



Re: [Qemu-devel] [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices

2016-06-02 Thread Neo Jia
On Wed, Jun 01, 2016 at 04:40:19PM +0800, Dong Jia wrote:
> On Wed, 25 May 2016 01:28:17 +0530
> Kirti Wankhede  wrote:
> 
> > +
> > +/*
> > + * Pin a set of guest PFNs and return their associated host PFNs for API
> > + * supported domain only.
> > + * @vaddr [in]: array of guest PFNs
> > + * @npage [in]: count of array elements
> > + * @prot [in] : protection flags
> > + * @pfn_base[out] : array of host PFNs
> > + */
> > +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> > +  int prot, dma_addr_t *pfn_base)
> > +{
> > +   struct vfio_iommu *iommu = iommu_data;
> > +   struct vfio_domain *domain = NULL;
> > +   int i = 0, ret = 0;
> > +   long retpage;
> > +   unsigned long remote_vaddr = 0;
> > +   dma_addr_t *pfn = pfn_base;
> > +   struct vfio_dma *dma;
> > +
> > +   if (!iommu || !vaddr || !pfn_base)
> > +   return -EINVAL;
> > +
> > +   mutex_lock(>lock);
> > +
> > +   if (!iommu->mediated_domain) {
> > +   ret = -EINVAL;
> > +   goto pin_done;
> > +   }
> > +
> > +   domain = iommu->mediated_domain;
> > +
> > +   for (i = 0; i < npage; i++) {
> > +   struct vfio_pfn *p, *lpfn;
> > +   unsigned long tpfn;
> > +   dma_addr_t iova;
> > +   long pg_cnt = 1;
> > +
> > +   iova = vaddr[i] << PAGE_SHIFT;
> Dear Kirti:
> 
> Got one question for the vaddr-iova conversion here.
> Is this a common rule that can be applied to all architectures?
> AFAIK, this is wrong for the s390 case. Or I must miss something...

I need more details about the "wrong" part. 
IIUC, you are thinking about the guest iommu case?

Thanks,
Neo

> 
> If the answer to the above question is 'no', should we introduce a new
> argument to pass in the iovas? Say 'dma_addr_t *iova'.
> 
> > +
> > +   dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> > +   if (!dma) {
> > +   ret = -EINVAL;
> > +   goto pin_done;
> > +   }
> > +
> > +   remote_vaddr = dma->vaddr + iova - dma->iova;
> > +
> > +   retpage = vfio_pin_pages_internal(domain, remote_vaddr,
> > + pg_cnt, prot, );
> > +   if (retpage <= 0) {
> > +   WARN_ON(!retpage);
> > +   ret = (int)retpage;
> > +   goto pin_done;
> > +   }
> > +
> > +   pfn[i] = tpfn;
> > +
> > +   /* search if pfn exist */
> > +   p = vfio_find_pfn(domain, tpfn);
> > +   if (p) {
> > +   atomic_inc(>ref_count);
> > +   continue;
> > +   }
> > +
> > +   /* add to pfn_list */
> > +   lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> > +   if (!lpfn) {
> > +   ret = -ENOMEM;
> > +   goto pin_done;
> > +   }
> > +   lpfn->vaddr = remote_vaddr;
> > +   lpfn->iova = iova;
> > +   lpfn->pfn = pfn[i];
> > +   lpfn->npage = 1;
> > +   lpfn->prot = prot;
> > +   atomic_inc(>ref_count);
> > +   vfio_link_pfn(domain, lpfn);
> > +   }
> > +
> > +   ret = i;
> > +
> > +pin_done:
> > +   mutex_unlock(>lock);
> > +   return ret;
> > +}
> > +EXPORT_SYMBOL(vfio_pin_pages);
> 
> 
> 
> Dong Jia
> 



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 05:23:44PM +0800, Jike Song wrote:
> On 05/13/2016 04:31 PM, Neo Jia wrote:
> > On Fri, May 13, 2016 at 07:45:14AM +, Tian, Kevin wrote:
> >>
> >> We use page tracking framework, which is newly added to KVM recently,
> >> to mark RAM pages as read-only so write accesses are intercepted to 
> >> device model.
> > 
> > Yes, I am aware of that patchset from Guangrong. So far the interface are 
> > all
> > requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
> > 
> > - kvm_page_track_add_page(): add the page to the tracking pool after
> >   that later specified access on that page will be tracked
> > 
> > - kvm_page_track_remove_page(): remove the page from the tracking pool,
> >   the specified access on the page is not tracked after the last user is
> >   gone
> > 
> > void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
> > enum kvm_page_track_mode mode);
> > void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
> >enum kvm_page_track_mode mode);
> > 
> > Really curious how you are going to have access to the struct kvm *kvm, or 
> > you
> > are relying on the userfaultfd to track the write faults only as part of the
> > QEMU userfault thread?
> >
> 
> Hi Neo,
> 
> For the vGPU used as a device for KVM guest, there will be interfaces
> wrapped or implemented in KVM layer, as a rival thing diverted from
> the interfaces for Xen. That is where the KVM related code supposed to be.

Hi Jike,

Is this discussed anywhere on the mailing list already? Sorry if I have missed
such conversation.

Thanks,
Neo

> 
> --
> Thanks,
> Jike



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 05:46:17PM +0800, Jike Song wrote:
> On 05/13/2016 04:12 AM, Neo Jia wrote:
> > On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
> >>
> >> If you're trying to equate the scale of what we need to track vs what
> >> type1 currently tracks, they're significantly different.  Possible
> >> things we need to track include the pfn, the iova, and possibly a
> >> reference count or some sort of pinned page map.  In the pin-all model
> >> we can assume that every page is pinned on map and unpinned on unmap,
> >> so a reference count or map is unnecessary.  We can also assume that we
> >> can always regenerate the pfn with get_user_pages() from the vaddr, so
> >> we don't need to track that.  
> > 
> > Hi Alex,
> > 
> > Thanks for pointing this out, we will not track those in our next rev and
> > get_user_pages will be used from the vaddr as you suggested to handle the
> > single VM with both passthru + mediated device case.
> >
> 
> Just a gut feeling:
> 
> Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
> every time for a particular process. If the VM has dozens of VCPU, which
> is not rare, the semaphore is likely to be the bottleneck.

Hi Jike,

We do need to hold the lock of mm->mmap_sem for the VMM/QEMU process, but I
don't quite follow the reasoning with "dozens of vcpus", one situation that I
can think of is that we have other thread competing with the mmap_sem for the
VMM/QEMU process within KVM kernel such as hva_to_pfn, after a quick search it
seems only mostly gets used by iotcl "KVM_ASSIGN_PCI_DEVICE".

We will definitely conduct performance analysis with large configuration on
servers with E5-2697 v4. :-)

Thanks,
Neo

> 
> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 04:39:37PM +0800, Dong Jia wrote:
> On Fri, 13 May 2016 00:24:34 -0700
> Neo Jia <c...@nvidia.com> wrote:
> 
> > On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> > > On Thu, 12 May 2016 13:05:52 -0600
> > > Alex Williamson <alex.william...@redhat.com> wrote:
> > > 
> > > > On Thu, 12 May 2016 08:00:36 +
> > > > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > > > 
> > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > > > 
> > > > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > > > Jike Song <jike.s...@intel.com> wrote:
> > > > > >   
> > > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > > > >>>> From: Song, Jike
> > > > > > > >>>>
> > > > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* 
> > > > > > > >>>> underlying IOMMU
> > > > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > > > >>>> programming them into an IOMMU for a device, it simply 
> > > > > > > >>>> stores the
> > > > > > > >>>> translations for use by later requests".
> > > > > > > >>>>
> > > > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must 
> > > > > > > >>>> be disabled.
> > > > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually 
> > > > > > > >>>> programs
> > > > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or 
> > > > > > > >>>> dma_map_page;
> > > > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> 
> > > > > > > >>>> HPA
> > > > > > > >>>> translations without any knowledge about hardware IOMMU, how 
> > > > > > > >>>> is the
> > > > > > > >>>> device model supposed to do to get an IOVA for a given GPA 
> > > > > > > >>>> (thereby HPA
> > > > > > > >>>> by the IOMMU backend here)?
> > > > > > > >>>>
> > > > > > > >>>> If things go as guessed above, as vfio_pin_pages() 
> > > > > > > >>>> indicates, it
> > > > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult 
> > > > > > > >>>> for the
> > > > > > > >>>> device model to figure out:
> > > > > > > >>>>
> > > > > > > >>>>  1, for a given GPA, how to avoid calling dma_map_page 
> > > > > > > >>>> multiple times?
> > > > > > > >>>>  2, for which page to call dma_unmap_page?
> > > > > > > >>>>
> > > > > > > >>>> --  
> > > > > > > >>>
> > > > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > > > >>> Then in this file we only need to cache GPA to whatever 
> > > > > > > >>> dmadr_t
> > > > > > > >>> returned by dma_map_page.
> > > > > > > >>>  
> > > > > > > >>
> > > > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility 
> > > > > > > >> here?  
> > > > > > > >
> > > > > > > > Hi Jike,
> > &

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 08:02:41AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Friday, May 13, 2016 3:38 PM
> > 
> > On Fri, May 13, 2016 at 07:13:44AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Friday, May 13, 2016 2:42 PM
> > > >
> > > >
> > > > >
> > > > > We possibly have the same requirement from the mediate driver backend:
> > > > >
> > > > >   a) get a GFN, when guest try to tell hardware;
> > > > >   b) consult the vfio iommu with that GFN[1]: will you find me a 
> > > > > proper dma_addr?
> > > >
> > > > We will provide you the pfn via vfio_pin_pages, so you can map it for 
> > > > dma
> > > > purpose in your i915 driver, which is what we are doing today.
> > > >
> > >
> > > Can such 'map' operation be consolidated in vGPU core driver? I don't 
> > > think
> > > Intel vGPU driver has any feature proactively relying on iommu. The reason
> > > why we keep talking iommu is just because the kernel may enable iommu
> > > for physical GPU so we need make sure our device model can work in such
> > > configuration. And this requirement should apply to all vendors, not Intel
> > > specific (like you said you are doing it already today).
> > 
> > Hi Kevin,
> > 
> > Actually, such requirement is already satisfied today as all vendor drivers
> > should transparently work with and without system iommu on bare-metal, 
> > right?
> > 
> > So I don't see any new requirement here, also such consolidation doesn't 
> > help
> > any but adding complexity to the system as vendor driver will not remove
> > their own dma_map_xxx functions as they are still required to support
> > non-mediated cases.
> > 
> 
> Thanks for your information, which makes it clearer where the difference is. 
> :-)
> 
> Based on your description, looks you treat guest pages same as normal process
> pages, which all share the same code path when mapping as DMA target, so it
> is pointless to separate guest page map out to vGPU core driver. Is this
> understanding correct?

Yes.

It is Linux's responsibility to allocate the physical pages for the QEMU process
which will happen to be the guest physical memory that we might use as DMA
target. From the device point of view, it is just some physical location he
needs to hit.

> 
> In our side, so far guest pages are treated differently from normal process
> pages, which is the main reason why I asked whether we can consolidate that
> part. Looks now it's not necessary since it's already not a common 
> requirement.

> 
> One additional question though. Jike already mentioned the need to shadow
> GPU MMU (called GTT table in Intel side) in our device model. 'shadow' here
> basically means we need translate from 'gfn' in guest pte to 'dmadr_t'
> as returned by dma_map_xxx. Based on gfn->pfn translation provided by
> VFIO (in your v3 driver), gfn->dmadr_t mapping can be constructed accordingly
> in the vendor driver. So do you have similar requirement like this? If yes, do
> you think any value to unify that translation structure or prefer to maintain
> it by vendor driver?

Yes, I think it would make sense to do this in the vendor driver as it keeps the
iommu type1 clean - it will only track the gfn to pfn translation/pinning (on
CPU). Then, you can reuse your existing driver code to map the pfn as DMA
target.

Also you can do some kind of optimization such as keeping a small cache within
your device driver, if the gfn is already translated, no need to query again.

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 07:45:14AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Friday, May 13, 2016 3:42 PM
> > 
> > On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> > > On 05/13/2016 02:43 PM, Neo Jia wrote:
> > > > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> > > >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> > > >>>> From: Neo Jia [mailto:c...@nvidia.com] Sent: Friday, May 13,
> > > >>>> 2016 3:49 AM
> > > >>>>
> > > >>>>>
> > > >>>>>> Perhaps one possibility would be to allow the vgpu driver
> > > >>>>>> to register map and unmap callbacks.  The unmap callback
> > > >>>>>> might provide the invalidation interface that we're so far
> > > >>>>>> missing.  The combination of map and unmap callbacks might
> > > >>>>>> simplify the Intel approach of pinning the entire VM memory
> > > >>>>>> space, ie. for each map callback do a translation (pin) and
> > > >>>>>> dma_map_page, for each unmap do a dma_unmap_page and
> > > >>>>>> release the translation.
> > > >>>>>
> > > >>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
> > > >>>>> refering to gpu_device_ops as implemented in Kirti's patch)
> > > >>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely
> > > >>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
> > > >>>>> IOMMU compatibility.
> > > >>>>>
> > > >>>>> PS, this has very little to do with pinning wholly or
> > > >>>>> partially. Intel KVMGT has once been had the whole guest
> > > >>>>> memory pinned, only because we used a spinlock, which can't
> > > >>>>> sleep at runtime.  We have removed that spinlock in our
> > > >>>>> another upstreaming effort, not here but for i915 driver, so
> > > >>>>> probably no biggie.
> > > >>>>>
> > > >>>>
> > > >>>> OK, then you guys don't need to pin everything. The next
> > > >>>> question will be if you can send the pinning request from your
> > > >>>> mediated driver backend to request memory pinning like we have
> > > >>>> demonstrated in the v3 patch, function vfio_pin_pages and
> > > >>>> vfio_unpin_pages?
> > > >>>>
> > > >>>
> > > >>> Jike can you confirm this statement? My feeling is that we don't
> > > >>> have such logic in our device model to figure out which pages
> > > >>> need to be pinned on demand. So currently pin-everything is same
> > > >>> requirement in both KVM and Xen side...
> > > >>
> > > >> [Correct me in case of any neglect:)]
> > > >>
> > > >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> > > >> from a GPU is certainly a DMA operation. The DMA facility of most
> > > >> platforms, IGD and NVIDIA GPU included, is not capable of
> > > >> faulting-handling-retrying.
> > > >>
> > > >> As for vGPU solutions like Nvidia and Intel provide, the memory
> > > >> address region used by Guest for GPU access, whenever Guest sets
> > > >> the mappings, it is intercepted by Host, so it's safe to only pin
> > > >> the page before it get used by Guest. This probably doesn't need
> > > >> device model to change :)
> > > >
> > > > Hi Jike
> > > >
> > > > Just out of curiosity, how does the host intercept this before it
> > > > goes on the bus?
> > > >
> > >
> > > Hi Neo,
> > >
> > > [prologize if I mis-expressed myself, bad English ..]
> > >
> > > I was talking about intercepting the setting-up of GPU page tables,
> > > not the DMA itself.  For currently Intel GPU, the page tables are
> > > MMIO registers or simply RAM pages, called GTT (Graphics Translation
> > > Table), the writing event to an GTT entry from Guest, is always
> > > intercepted by Host.
> > 
> > Hi Jike,
> > 
> > Thanks for the details, one more question if the page tables are guest RAM, 
> > how do you
> > intercept it from host? I can see it get intercepted when it is in MMIO 
> > range.
> > 
> 
> We use page tracking framework, which is newly added to KVM recently,
> to mark RAM pages as read-only so write accesses are intercepted to 
> device model.

Yes, I am aware of that patchset from Guangrong. So far the interface are all
requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644

- kvm_page_track_add_page(): add the page to the tracking pool after
  that later specified access on that page will be tracked

- kvm_page_track_remove_page(): remove the page from the tracking pool,
  the specified access on the page is not tracked after the last user is
  gone

void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
   enum kvm_page_track_mode mode);

Really curious how you are going to have access to the struct kvm *kvm, or you
are relying on the userfaultfd to track the write faults only as part of the
QEMU userfault thread?

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> On 05/13/2016 02:43 PM, Neo Jia wrote:
> > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> >>>> From: Neo Jia [mailto:c...@nvidia.com] Sent: Friday, May 13,
> >>>> 2016 3:49 AM
> >>>> 
> >>>>> 
> >>>>>> Perhaps one possibility would be to allow the vgpu driver
> >>>>>> to register map and unmap callbacks.  The unmap callback
> >>>>>> might provide the invalidation interface that we're so far
> >>>>>> missing.  The combination of map and unmap callbacks might
> >>>>>> simplify the Intel approach of pinning the entire VM memory
> >>>>>> space, ie. for each map callback do a translation (pin) and
> >>>>>> dma_map_page, for each unmap do a dma_unmap_page and
> >>>>>> release the translation.
> >>>>> 
> >>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
> >>>>> refering to gpu_device_ops as implemented in Kirti's patch)
> >>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely 
> >>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
> >>>>> IOMMU compatibility.
> >>>>> 
> >>>>> PS, this has very little to do with pinning wholly or
> >>>>> partially. Intel KVMGT has once been had the whole guest
> >>>>> memory pinned, only because we used a spinlock, which can't
> >>>>> sleep at runtime.  We have removed that spinlock in our
> >>>>> another upstreaming effort, not here but for i915 driver, so
> >>>>> probably no biggie.
> >>>>> 
> >>>> 
> >>>> OK, then you guys don't need to pin everything. The next
> >>>> question will be if you can send the pinning request from your
> >>>> mediated driver backend to request memory pinning like we have
> >>>> demonstrated in the v3 patch, function vfio_pin_pages and 
> >>>> vfio_unpin_pages?
> >>>> 
> >>> 
> >>> Jike can you confirm this statement? My feeling is that we don't
> >>> have such logic in our device model to figure out which pages
> >>> need to be pinned on demand. So currently pin-everything is same
> >>> requirement in both KVM and Xen side...
> >> 
> >> [Correct me in case of any neglect:)]
> >> 
> >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> >> from a GPU is certainly a DMA operation. The DMA facility of most
> >> platforms, IGD and NVIDIA GPU included, is not capable of
> >> faulting-handling-retrying.
> >> 
> >> As for vGPU solutions like Nvidia and Intel provide, the memory
> >> address region used by Guest for GPU access, whenever Guest sets
> >> the mappings, it is intercepted by Host, so it's safe to only pin
> >> the page before it get used by Guest. This probably doesn't need
> >> device model to change :)
> > 
> > Hi Jike
> > 
> > Just out of curiosity, how does the host intercept this before it
> > goes on the bus?
> > 
> 
> Hi Neo,
> 
> [prologize if I mis-expressed myself, bad English ..] 
> 
> I was talking about intercepting the setting-up of GPU page tables,
> not the DMA itself.  For currently Intel GPU, the page tables are
> MMIO registers or simply RAM pages, called GTT (Graphics Translation
> Table), the writing event to an GTT entry from Guest, is always
> intercepted by Host.

Hi Jike,

Thanks for the details, one more question if the page tables are guest RAM, how 
do you
intercept it from host? I can see it get intercepted when it is in MMIO range.

Thanks,
Neo

> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 07:13:44AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Friday, May 13, 2016 2:42 PM
> > 
> > 
> > >
> > > We possibly have the same requirement from the mediate driver backend:
> > >
> > >   a) get a GFN, when guest try to tell hardware;
> > >   b) consult the vfio iommu with that GFN[1]: will you find me a proper 
> > > dma_addr?
> > 
> > We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> > purpose in your i915 driver, which is what we are doing today.
> > 
> 
> Can such 'map' operation be consolidated in vGPU core driver? I don't think 
> Intel vGPU driver has any feature proactively relying on iommu. The reason 
> why we keep talking iommu is just because the kernel may enable iommu 
> for physical GPU so we need make sure our device model can work in such
> configuration. And this requirement should apply to all vendors, not Intel
> specific (like you said you are doing it already today).

Hi Kevin,

Actually, such requirement is already satisfied today as all vendor drivers
should transparently work with and without system iommu on bare-metal, right?

So I don't see any new requirement here, also such consolidation doesn't help
any but adding complexity to the system as vendor driver will not remove
their own dma_map_xxx functions as they are still required to support
non-mediated cases. 

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> On Thu, 12 May 2016 13:05:52 -0600
> Alex Williamson <alex.william...@redhat.com> wrote:
> 
> > On Thu, 12 May 2016 08:00:36 +
> > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > 
> > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > Jike Song <jike.s...@intel.com> wrote:
> > > >   
> > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > >>>> From: Song, Jike
> > > > > >>>>
> > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying 
> > > > > >>>> IOMMU
> > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > > >>>> translations for use by later requests".
> > > > > >>>>
> > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be 
> > > > > >>>> disabled.
> > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually 
> > > > > >>>> programs
> > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or 
> > > > > >>>> dma_map_page;
> > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > > >>>> translations without any knowledge about hardware IOMMU, how is 
> > > > > >>>> the
> > > > > >>>> device model supposed to do to get an IOVA for a given GPA 
> > > > > >>>> (thereby HPA
> > > > > >>>> by the IOMMU backend here)?
> > > > > >>>>
> > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for 
> > > > > >>>> the
> > > > > >>>> device model to figure out:
> > > > > >>>>
> > > > > >>>>  1, for a given GPA, how to avoid calling dma_map_page multiple 
> > > > > >>>> times?
> > > > > >>>>  2, for which page to call dma_unmap_page?
> > > > > >>>>
> > > > > >>>> --  
> > > > > >>>
> > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > >>> returned by dma_map_page.
> > > > > >>>  
> > > > > >>
> > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility 
> > > > > >> here?  
> > > > > >
> > > > > > Hi Jike,
> > > > > >
> > > > > > With mediated passthru, you still can use hardware iommu, but more 
> > > > > > important
> > > > > > that part is actually orthogonal to what we are discussing here as 
> > > > > > we will only
> > > > > > cache the mapping between  > > > > > va>, once we
> > > > > > have pinned pages later with the help of above info, you can map it 
> > > > > > into the
> > > > > > proper iommu domain if the system has configured so.
> > > > > >  
> > > > >
> > > > > Hi Neo,
> > > > >
> > > > > Technically yes you can map a pfn into the proper IOMMU domain 
> > > > > elsewhere,
> > > > > but to find out whether a pfn was previously mapped or not, you have 
> > &

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> >> From: Neo Jia [mailto:c...@nvidia.com]
> >> Sent: Friday, May 13, 2016 3:49 AM
> >>
> >>>
> >>>> Perhaps one possibility would be to allow the vgpu driver to register
> >>>> map and unmap callbacks.  The unmap callback might provide the
> >>>> invalidation interface that we're so far missing.  The combination of
> >>>> map and unmap callbacks might simplify the Intel approach of pinning the
> >>>> entire VM memory space, ie. for each map callback do a translation
> >>>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> >>>> the translation.
> >>>
> >>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> >>> gpu_device_ops as
> >>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> >>> keeping vGPU purely
> >>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> >>> compatibility.
> >>>
> >>> PS, this has very little to do with pinning wholly or partially. Intel 
> >>> KVMGT has
> >>> once been had the whole guest memory pinned, only because we used a 
> >>> spinlock,
> >>> which can't sleep at runtime.  We have removed that spinlock in our 
> >>> another
> >>> upstreaming effort, not here but for i915 driver, so probably no biggie.
> >>>
> >>
> >> OK, then you guys don't need to pin everything. The next question will be 
> >> if you
> >> can send the pinning request from your mediated driver backend to request 
> >> memory
> >> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages 
> >> and
> >> vfio_unpin_pages?
> >>
> > 
> > Jike can you confirm this statement? My feeling is that we don't have such 
> > logic
> > in our device model to figure out which pages need to be pinned on demand. 
> > So
> > currently pin-everything is same requirement in both KVM and Xen side...
> 
> [Correct me in case of any neglect:)]
> 
> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM from a GPU is
> certainly a DMA operation. The DMA facility of most platforms, IGD and NVIDIA
> GPU included, is not capable of faulting-handling-retrying.
> 
> As for vGPU solutions like Nvidia and Intel provide, the memory address region
> used by Guest for GPU access, whenever Guest sets the mappings, it is
> intercepted by Host, so it's safe to only pin the page before it get used by
> Guest. This probably doesn't need device model to change :)

Hi Jike

Just out of curiosity, how does the host intercept this before it goes on the
bus?

Thanks,
Neo

> 
> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 02:08:36PM +0800, Jike Song wrote:
> On 05/13/2016 03:49 AM, Neo Jia wrote:
> > On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
> >> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
> >> <alex.william...@redhat.com> wrote:
> >>> On Wed, 11 May 2016 17:15:15 +0800
> >>> Jike Song <jike.s...@intel.com> wrote:
> >>>
> >>>> On 05/11/2016 12:02 AM, Neo Jia wrote:
> >>>>> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> >>>>>> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >>>>>>>> From: Song, Jike
> >>>>>>>>
> >>>>>>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >>>>>>>> hardware. It just, as you said in another mail, "rather than
> >>>>>>>> programming them into an IOMMU for a device, it simply stores the
> >>>>>>>> translations for use by later requests".
> >>>>>>>>
> >>>>>>>> That imposes a constraint on gfx driver: hardware IOMMU must be 
> >>>>>>>> disabled.
> >>>>>>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >>>>>>>> the hardware IOMMU with IOVA returned by pci_map_page or 
> >>>>>>>> dma_map_page;
> >>>>>>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >>>>>>>> translations without any knowledge about hardware IOMMU, how is the
> >>>>>>>> device model supposed to do to get an IOVA for a given GPA (thereby 
> >>>>>>>> HPA
> >>>>>>>> by the IOMMU backend here)?
> >>>>>>>>
> >>>>>>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> >>>>>>>> pin & translate vaddr to PFN, then it will be very difficult for the
> >>>>>>>> device model to figure out:
> >>>>>>>>
> >>>>>>>>  1, for a given GPA, how to avoid calling dma_map_page multiple 
> >>>>>>>> times?
> >>>>>>>>  2, for which page to call dma_unmap_page?
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>
> >>>>>>> We have to support both w/ iommu and w/o iommu case, since
> >>>>>>> that fact is out of GPU driver control. A simple way is to use
> >>>>>>> dma_map_page which internally will cope with w/ and w/o iommu
> >>>>>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >>>>>>> Then in this file we only need to cache GPA to whatever dmadr_t
> >>>>>>> returned by dma_map_page.
> >>>>>>>
> >>>>>>
> >>>>>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> >>>>>
> >>>>> Hi Jike,
> >>>>>
> >>>>> With mediated passthru, you still can use hardware iommu, but more 
> >>>>> important
> >>>>> that part is actually orthogonal to what we are discussing here as we 
> >>>>> will only
> >>>>> cache the mapping between , 
> >>>>> once we
> >>>>> have pinned pages later with the help of above info, you can map it 
> >>>>> into the
> >>>>> proper iommu domain if the system has configured so.
> >>>>>
> >>>>
> >>>> Hi Neo,
> >>>>
> >>>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> >>>> but to find out whether a pfn was previously mapped or not, you have to
> >>>> track it with another rbtree-alike data structure (the IOMMU driver 
> >>>> simply
> >>>> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> >>>> IOMMU backend we are discussing here.
> >>>>
> >>>> And it is also semantically correct for an IOMMU backend to handle both 
> >>>> w/
> >>>> and w/o an IOMMU hardware? :)
> >>>
> >>> A problem with the iommu doing the dma_map_page() though is for what
> >>> device does it do this?  In the mediated case

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-12 Thread Neo Jia
On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
> On Thu, 12 May 2016 08:00:36 +
> "Tian, Kevin" <kevin.t...@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Thursday, May 12, 2016 6:06 AM
> > > 
> > > On Wed, 11 May 2016 17:15:15 +0800
> > > Jike Song <jike.s...@intel.com> wrote:
> > >   
> > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > >>>> From: Song, Jike
> > > > >>>>
> > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying 
> > > > >>>> IOMMU
> > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > >>>> translations for use by later requests".
> > > > >>>>
> > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be 
> > > > >>>> disabled.
> > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or 
> > > > >>>> dma_map_page;
> > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > >>>> device model supposed to do to get an IOVA for a given GPA 
> > > > >>>> (thereby HPA
> > > > >>>> by the IOMMU backend here)?
> > > > >>>>
> > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for 
> > > > >>>> the
> > > > >>>> device model to figure out:
> > > > >>>>
> > > > >>>>1, for a given GPA, how to avoid calling dma_map_page multiple 
> > > > >>>> times?
> > > > >>>>2, for which page to call dma_unmap_page?
> > > > >>>>
> > > > >>>> --  
> > > > >>>
> > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > >>> returned by dma_map_page.
> > > > >>>  
> > > > >>
> > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here? 
> > > > >>  
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > With mediated passthru, you still can use hardware iommu, but more 
> > > > > important
> > > > > that part is actually orthogonal to what we are discussing here as we 
> > > > > will only
> > > > > cache the mapping between , 
> > > > > once we
> > > > > have pinned pages later with the help of above info, you can map it 
> > > > > into the
> > > > > proper iommu domain if the system has configured so.
> > > > >  
> > > >
> > > > Hi Neo,
> > > >
> > > > Technically yes you can map a pfn into the proper IOMMU domain 
> > > > elsewhere,
> > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > track it with another rbtree-alike data structure (the IOMMU driver 
> > > > simply
> > > > doesn't bother with tracking), that seems somehow duplicate with the 
> > > > vGPU
> > > > IOMMU backend we are discussing here.
> > > >
> > > > And it is also semantically correct for an IOMMU backend to handle both 
> > > > w/
> > > > and w/o an IOMMU hardware? :)  
> > > 
> > > A problem with the iommu doing the d

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-12 Thread Neo Jia
On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
> <alex.william...@redhat.com> wrote:
> > On Wed, 11 May 2016 17:15:15 +0800
> > Jike Song <jike.s...@intel.com> wrote:
> >
> >> On 05/11/2016 12:02 AM, Neo Jia wrote:
> >> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> >> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >> >>>> From: Song, Jike
> >> >>>>
> >> >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >> >>>> hardware. It just, as you said in another mail, "rather than
> >> >>>> programming them into an IOMMU for a device, it simply stores the
> >> >>>> translations for use by later requests".
> >> >>>>
> >> >>>> That imposes a constraint on gfx driver: hardware IOMMU must be 
> >> >>>> disabled.
> >> >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >> >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >> >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >> >>>> translations without any knowledge about hardware IOMMU, how is the
> >> >>>> device model supposed to do to get an IOVA for a given GPA (thereby 
> >> >>>> HPA
> >> >>>> by the IOMMU backend here)?
> >> >>>>
> >> >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> >> >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> >> >>>> device model to figure out:
> >> >>>>
> >> >>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >> >>>>  2, for which page to call dma_unmap_page?
> >> >>>>
> >> >>>> --
> >> >>>
> >> >>> We have to support both w/ iommu and w/o iommu case, since
> >> >>> that fact is out of GPU driver control. A simple way is to use
> >> >>> dma_map_page which internally will cope with w/ and w/o iommu
> >> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >> >>> Then in this file we only need to cache GPA to whatever dmadr_t
> >> >>> returned by dma_map_page.
> >> >>>
> >> >>
> >> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> >> >
> >> > Hi Jike,
> >> >
> >> > With mediated passthru, you still can use hardware iommu, but more 
> >> > important
> >> > that part is actually orthogonal to what we are discussing here as we 
> >> > will only
> >> > cache the mapping between , 
> >> > once we
> >> > have pinned pages later with the help of above info, you can map it into 
> >> > the
> >> > proper iommu domain if the system has configured so.
> >> >
> >>
> >> Hi Neo,
> >>
> >> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> >> but to find out whether a pfn was previously mapped or not, you have to
> >> track it with another rbtree-alike data structure (the IOMMU driver simply
> >> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> >> IOMMU backend we are discussing here.
> >>
> >> And it is also semantically correct for an IOMMU backend to handle both w/
> >> and w/o an IOMMU hardware? :)
> >
> > A problem with the iommu doing the dma_map_page() though is for what
> > device does it do this?  In the mediated case the vfio infrastructure
> > is dealing with a software representation of a device.  For all we
> > know that software model could transparently migrate from one physical
> > GPU to another.  There may not even be a physical device backing
> > the mediated device.  Those are details left to the vgpu driver itself.
> >
> 
> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
> a particular
> pdev in an vGPU IOMMU BE.
> 
> > Perhaps one possibility would be to allow the vgpu driver to register
> > map and unmap callbacks.  The unmap callback might provide the
> > invalidation interface that we're so far missing.  The combination of
> > map and unmap callbacks

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-10 Thread Neo Jia
On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >> From: Song, Jike
> >>
> >> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >> hardware. It just, as you said in another mail, "rather than
> >> programming them into an IOMMU for a device, it simply stores the
> >> translations for use by later requests".
> >>
> >> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >> translations without any knowledge about hardware IOMMU, how is the
> >> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >> by the IOMMU backend here)?
> >>
> >> If things go as guessed above, as vfio_pin_pages() indicates, it
> >> pin & translate vaddr to PFN, then it will be very difficult for the
> >> device model to figure out:
> >>
> >>1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >>2, for which page to call dma_unmap_page?
> >>
> >> --
> > 
> > We have to support both w/ iommu and w/o iommu case, since
> > that fact is out of GPU driver control. A simple way is to use
> > dma_map_page which internally will cope with w/ and w/o iommu
> > case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > Then in this file we only need to cache GPA to whatever dmadr_t
> > returned by dma_map_page.
> > 
> 
> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?

Hi Jike,

With mediated passthru, you still can use hardware iommu, but more important
that part is actually orthogonal to what we are discussing here as we will only
cache the mapping between , once we 
have pinned pages later with the help of above info, you can map it into the
proper iommu domain if the system has configured so.

Thanks,
Neo

> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure

2016-05-05 Thread Neo Jia
On Thu, May 05, 2016 at 01:19:45PM -0600, Alex Williamson wrote:
> [cc +Intel,NVIDIA]
> 
> On Thu, 5 May 2016 18:29:08 +0800
> Dong Jia  wrote:
> 
> > On Wed, 4 May 2016 13:26:53 -0600
> > Alex Williamson  wrote:
> > 
> > > On Wed, 4 May 2016 17:26:29 +0800
> > > Dong Jia  wrote:
> > >   
> > > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > > Alex Williamson  wrote:
> > > > 
> > > > Dear Alex:
> > > > 
> > > > Thanks for the comments.
> > > > 
> > > > [...]
> > > >   
> > > > > > 
> > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is 
> > > > > > definitely a
> > > > > > good example to get understand how these patches work. Here is a 
> > > > > > little
> > > > > > bit more detail how an I/O request triggered by the Qemu guest will 
> > > > > > be
> > > > > > handled (without error handling).
> > > > > > 
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > > 
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > > (u_ccwchain).
> > > > > 
> > > > > Is this replacing guest physical address in the program with QEMU
> > > > > virtual addresses?
> > > > Yes.
> > > >   
> > > > > 
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > > K2. Translate the user space ccw program to a kernel space ccw
> > > > > > program, which becomes runnable for a real device.
> > > > > 
> > > > > And here we translate and likely pin QEMU virtual address to physical
> > > > > addresses to further modify the program sent into the channel?
> > > > Yes. Exactly.
> > > >   
> > > > > 
> > > > > > K3. With the necessary information contained in the orb passed 
> > > > > > in
> > > > > > by Qemu, issue the k_ccwchain to the device, and wait event 
> > > > > > q
> > > > > > for the I/O result.
> > > > > > K4. Interrupt handler gets the I/O result, and wakes up the 
> > > > > > wait q.
> > > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result 
> > > > > > to
> > > > > > update the user space irb.
> > > > > > K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.
> > > > > 
> > > > > If the answers to my questions above are both yes,
> > > > Yes, they are.
> > > >   
> > > > > then this is really a mediated interface, not a direct assignment.
> > > > Right. This is true.
> > > >   
> > > > > We don't need an iommu
> > > > > because we're policing and translating the program for the device
> > > > > before it gets sent to hardware.  I think there are better ways than
> > > > > noiommu to handle such devices perhaps even with better performance
> > > > > than this two-stage translation.  In fact, I think the solution we 
> > > > > plan
> > > > > to implement for vGPU support would work here.
> > > > > 
> > > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > > translation or isolation since a vGPU is largely a software construct,
> > > > > but we do have software policing and translating how the GPU is
> > > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > > programming them into an IOMMU for a device, it simply stores the
> > > > > translations for use by later requests.  This means that a device
> > > > > programmed in a VM with guest physical addresses can have the
> > > > > vfio kernel convert that address to process virtual address, pin the
> > > > > page and program the hardware with the host physical address in one
> > > > > step.
> > > > I've read through the mail threads those discuss how to add vGPU
> > > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > > to this case, especially if we want to make the vfio api completely
> > > > compatible with the existing usage.
> > > > 
> > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > > fixed range of address in the memory space for DMA operations. Any
> > > > address inside this range will not be used for other purpose. Thus we
> > > > can add memory listener on this range, and pin the pages for further
> > > > use (DMA operation). And we can keep the pages pinned during the life
> > > > cycle of the VM (not quite accurate, or I should say 'the target
> > > > device').  
> > > 
> > > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > > assignment pins all of guest memory, not just a dedicated, exclusive
> > > range of it, in order to map it through the hardware IOMMU.  That gives
> > > the guest the ability to transparently perform DMA with the device
> > > since the 

Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device

2016-05-05 Thread Neo Jia
On Thu, May 05, 2016 at 09:24:26AM +, Tian, Kevin wrote:
> > From: Alex Williamson
> > Sent: Thursday, May 05, 2016 1:06 AM
> > > > > +
> > > > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct 
> > > > > vm_fault
> > *vmf)
> > > > > +{
> > > > > + int ret = 0;
> > > > > + struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > > > + struct vgpu_device *vgpu_dev;
> > > > > + struct gpu_device *gpu_dev;
> > > > > + u64 virtaddr = (u64)vmf->virtual_address;
> > > > > + u64 offset, phyaddr;
> > > > > + unsigned long req_size, pgoff;
> > > > > + pgprot_t pg_prot;
> > > > > +
> > > > > + if (!vdev && !vdev->vgpu_dev)
> > > > > + return -EINVAL;
> > > > > +
> > > > > + vgpu_dev = vdev->vgpu_dev;
> > > > > + gpu_dev  = vgpu_dev->gpu_dev;
> > > > > +
> > > > > + offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > > > + phyaddr  = virtaddr - vma->vm_start + offset;
> > > > > + pgoff= phyaddr >> PAGE_SHIFT;
> > > > > + req_size = vma->vm_end - virtaddr;
> > > > > + pg_prot  = vma->vm_page_prot;
> > > > > +
> > > > > + if (gpu_dev->ops->validate_map_request) {
> > > > > + ret = gpu_dev->ops->validate_map_request(vgpu_dev, 
> > > > > virtaddr,
> > ,
> > > > > +  _size, 
> > > > > _prot);
> > > > > + if (ret)
> > > > > + return ret;
> > > > > +
> > > > > + if (!req_size)
> > > > > + return -EINVAL;
> > > > > + }
> > > > > +
> > > > > + ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> > > >
> > > > So not supporting validate_map_request() means that the user can
> > > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > > scenario or should this callback be required?  It's not clear to me how
> > > > the vendor driver determines what this maps to, do they compare it to
> > > > the physical device's own BAR addresses?
> > >
> > > I didn't quite understand too. Based on earlier discussion, do we need
> > > something like this, or could achieve the purpose just by leveraging
> > > recent sparse mmap support?
> > 
> > The reason for faulting in the mmio space, if I recall correctly, is to
> > enable an ordering where the user driver (QEMU) can mmap regions of the
> > device prior to resources being allocated on the host GPU to handle
> > them.  Sparse mmap only partially handles that, it's not dynamic.  With
> > this faulting mechanism, the host GPU doesn't need to commit resources
> > until the mmap is actually accessed.  Thanks,
> > 
> > Alex
> 
> Neo/Kirti, any specific example how above exactly works? I can see
> difference from sparse mmap based on Alex's explanation, but still
> cannot map the 1st sentence to a real scenario clearly. Now our side
> doesn't use such faulting-based method. So I'd like to understand it
> clearly and then see any value to do same thing for Intel GPU.

Hi Kevin,

The short answer is CPU access to GPU resources via MMIO region.

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device

2016-05-04 Thread Neo Jia
On Wed, May 04, 2016 at 11:06:19AM -0600, Alex Williamson wrote:
> On Wed, 4 May 2016 03:23:13 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Wednesday, May 04, 2016 6:43 AM  
> > > > +
> > > > +   if (gpu_dev->ops->write) {
> > > > +   ret = gpu_dev->ops->write(vgpu_dev,
> > > > + user_data,
> > > > + count,
> > > > + 
> > > > vgpu_emul_space_config,
> > > > + pos);
> > > > +   }
> > > > +
> > > > +   memcpy((void *)(vdev->vconfig + pos), (void 
> > > > *)user_data, count);  
> > > 
> > > So write is expected to user_data to allow only the writable bits to be
> > > changed?  What's really being saved in the vconfig here vs the vendor
> > > vgpu driver?  It seems like we're only using it to cache the BAR
> > > values, but we're not providing the BAR emulation here, which seems
> > > like one of the few things we could provide so it's not duplicated in
> > > every vendor driver.  But then we only need a few u32s to do that, not
> > > all of config space.  
> > 
> > We can borrow same vconfig emulation from existing vfio-pci driver.
> > But doing so doesn't mean that vendor vgpu driver cannot have its
> > own vconfig emulation further. vGPU is not like a real device, since
> > there may be no physical config space implemented for each vGPU.
> > So anyway vendor vGPU driver needs to create/emulate the virtualized 
> > config space while the way how is created might be vendor specific. 
> > So better to keep the interface to access raw vconfig space from
> > vendor vGPU driver.
> 
> I'm hoping config space will be very simple for a vgpu, so I don't know
> that it makes sense to add that complexity early on.  Neo/Kirti, what
> capabilities do you expect to provide?  Who provides the MSI
> capability?  Is a PCIe capability provided?  Others?

Currently only standard PCI caps.

MSI cap is emulated by the vendor drivers via the above interface.

No PCIe caps so far.

>  
> > > > +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
> > > > +   size_t count, loff_t *ppos, bool iswrite)
> > > > +{
> > > > +   unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> > > > +   struct vfio_vgpu_device *vdev = device_data;
> > > > +
> > > > +   if (index >= VFIO_PCI_NUM_REGIONS)
> > > > +   return -EINVAL;
> > > > +
> > > > +   switch (index) {
> > > > +   case VFIO_PCI_CONFIG_REGION_INDEX:
> > > > +   return vgpu_dev_config_rw(vdev, buf, count, ppos, 
> > > > iswrite);
> > > > +
> > > > +   case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > > > +   return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
> > > > +
> > > > +   case VFIO_PCI_ROM_REGION_INDEX:
> > > > +   case VFIO_PCI_VGA_REGION_INDEX:  
> > > 
> > > Wait a sec, who's doing the VGA emulation?  We can't be claiming to
> > > support a VGA region and then fail to provide read/write access to it
> > > like we said it has.  
> > 
> > For Intel side we plan to not support VGA region when upstreaming our
> > KVMGT work, which means Intel vGPU will be exposed only as a 
> > secondary graphics card then so legacy VGA is not required. Also no
> > VBIOS/ROM requirement. Guess we can remove above two regions.
> 
> So this needs to be optional based on what the mediation driver
> provides.  It seems like we're just making passthroughs for the vendor
> mediation driver to speak vfio.
> 
> > > > +
> > > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct 
> > > > vm_fault *vmf)
> > > > +{
> > > > +   int ret = 0;
> > > > +   struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > > +   struct vgpu_device *vgpu_dev;
> > > > +   struct gpu_device *gpu_dev;
> > > > +   u64 virtaddr = (u64)vmf->virtual_address;
> > > > +   u64 offset, phyaddr;
> > > > +   unsigned long req_size, pgoff;
> > > > +   pgprot_t pg_prot;
> > > > +
> > > > +   if (!vdev && !vdev->vgpu_dev)
> > > > +   return -EINVAL;
> > > > +
> > > > +   vgpu_dev = vdev->vgpu_dev;
> > > > +   gpu_dev  = vgpu_dev->gpu_dev;
> > > > +
> > > > +   offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > > +   phyaddr  = virtaddr - vma->vm_start + offset;
> > > > +   pgoff= phyaddr >> PAGE_SHIFT;
> > > > +   req_size = vma->vm_end - virtaddr;
> > > > +   pg_prot  = vma->vm_page_prot;
> > > > +
> > > > +   if (gpu_dev->ops->validate_map_request) {
> > > > +   ret = gpu_dev->ops->validate_map_request(vgpu_dev, 
> > > > virtaddr, ,
> > > > +_size, 
> > > > _prot);
> > > > +   if (ret)
> > > > +   

Re: [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support

2016-05-04 Thread Neo Jia
On Wed, May 04, 2016 at 01:05:36AM +, Tian, Kevin wrote:
> > From: Kirti Wankhede
> > Sent: Tuesday, May 03, 2016 2:41 AM
> > 
> > This series adds vGPU support to v4.6 Linux host kernel. Purpose of this 
> > series
> > is to provide a common interface for vGPU management that can be used
> > by different GPU drivers. This series introduces vGPU core module that 
> > create
> > and manage vGPU devices, VFIO based driver for vGPU devices that are 
> > created by
> > vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.
> > 
> > What's new in v3?
> > VFIO type1 IOMMU module supports devices which are IOMMU capable. This 
> > version
> > of patched adds support for vGPU devices, which are not IOMMU capable, to 
> > use
> > existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs 
> > for
> > guest page translation.
> > 
> > What's left to do?
> > VFIO driver for vGPU device doesn't support devices with MSI-X enabled.
> > 
> > Please review.
> > 
> 
> Thanks Kirti/Neo for your nice work! We are integrating this common
> framework with KVMGT. Once ready it'll be released as an experimental
> feature in our next community release.
> 
> One curious question. There are some additional changes in our side.
> What is the best way to collaborate our effort before this series is
> accepted in upstream kernel? Do you prefer to receiving patches from
> us directly, or having it hosted some place so both sides can contribute?

Yes, sending it directly to Kirti and myself will work the best, we can sort
out this process offline.

Thanks,
Neo

> 
> Of course we'll conduct high-level discussions of our changes and reach
> agreement first before merging with your code.
> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU

2016-03-11 Thread Neo Jia
On Fri, Mar 11, 2016 at 10:56:24AM -0700, Alex Williamson wrote:
> On Fri, 11 Mar 2016 08:55:44 -0800
> Neo Jia <c...@nvidia.com> wrote:
> 
> > > > Alex, what's your opinion on this?  
> > > 
> > > The sticky point is how vfio, which is only handling the vGPU, has a
> > > reference to the physical GPU on which to call DMA API operations.  If
> > > that reference is provided by the vendor vGPU driver, for example
> > > vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
> > > be opposed to such an API.  I would not condone vfio deriving or owning
> > > a reference to the physical device on its own though, that's in the
> > > realm of the vendor vGPU driver.  It does seem a bit cleaner and should
> > > reduce duplicate code if the vfio vGPU iommu interface could handle the
> > > iommu mapping for the vendor vgpu driver when necessary.  Thanks,  
> > 
> > Hi Alex,
> > 
> > Since we don't want to allow vfio iommu to derive or own a reference to the
> > physical device, I think it is still better not providing such pci_dev to 
> > the 
> > vfio iommu type1 driver.
> > 
> > Also, I need to point out that if the vfio iommu is going to set up iommu 
> > page
> > table for the real underlying physical device, giving the fact of single 
> > RID we
> > are all having here, the iommu mapping code has to return the new "IOVA" 
> > that is
> > mapped to the HPA, which the GPU vendro driver will have to put on its DMA
> > engine. This is very different than the current VFIO IOMMU mapping logic.
> > 
> > And we still have to provide another interface to translate the GPA to
> > HPA for CPU mapping.
> > 
> > In the current RFC, we only need to have a single interface to provide the 
> > most
> > basic information to the GPU vendor driver and without taking the risk of
> > leaking a ref to VFIO IOMMU.
> 
> I don't see this as some fundamental difference of opinion, it's really
> just whether vfio provides a "pin this GFN and return the HPA" function
> or whether that function could be extended to include "... and also map
> it through the DMA API for the provided device and return the host
> IOVA".  It might even still be a single function to vfio for CPU vs
> device mapping where the device and IOVA return pointer are NULL when
> only pinning is required for CPU access (though maybe there are better
> ways to provide CPU access than pinning).  A wrapper could even give the
> appearance that those are two separate functions.
> 
> So long as vfio isn't owning or deriving the device for the DMA API
> calls and we don't introduce some complication in page accounting, this
> really just seems like a question of whether moving the DMA API
> handling into vfio is common between the vendor vGPU drivers and are we
> reducing the overall amount and complexity of code by giving the vendor
> drivers the opportunity to do both operations with one interface.

Hi Alex,

OK, I will look into of adding such facilitation and probably include it in a
bit later rev of VGPU IOMMU if we don't run any surprise or the issues you
mentioned above.

Thanks,
Neo

> If as Kevin suggest it also provides some additional abstractions
> for Xen vs KVM, even better.  Thanks,
> 
> Alex



Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU

2016-03-11 Thread Neo Jia
On Fri, Mar 11, 2016 at 09:13:15AM -0700, Alex Williamson wrote:
> On Fri, 11 Mar 2016 04:46:23 +
> "Tian, Kevin" <kevin.t...@intel.com> wrote:
> 
> > > From: Neo Jia [mailto:c...@nvidia.com]
> > > Sent: Friday, March 11, 2016 12:20 PM
> > > 
> > > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:  
> > > >  
> > > > >> Is it supposed to be the caller who should set
> > > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > > >> vgpu_dma_do_translate()?
> > > > >>  
> > > > >
> > > > > Don't think you need to call dma_map_page here. Once you have the pfn 
> > > > > available
> > > > > to your GPU kernel driver, you can just go ahead to setup the mapping 
> > > > > as you
> > > > > normally do such as calling pci_map_sg and its friends.
> > > > >  
> > > >
> > > > Technically it's definitely OK to call DMA API from the caller rather 
> > > > than here,
> > > > however personally I think it is a bit counter-intuitive: IOMMU page 
> > > > tables
> > > > should be constructed within the VFIO IOMMU driver.
> > > >  
> > > 
> > > Hi Jike,
> > > 
> > > For vGPU, what we have is just a virtual device and a fake IOMMU group, 
> > > therefore
> > > the actual interaction with the real GPU should be managed by the GPU 
> > > vendor driver.
> > >   
> > 
> > Hi, Neo,
> > 
> > Seems we have a different thought on this. Regardless of whether it's a 
> > virtual/physical 
> > device, imo, VFIO should manage IOMMU configuration. The only difference is:
> > 
> > - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry 
> > (GPA->HPA);
> > - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to 
> > IOMMU entry 
> > set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);
> > 
> > This would provide an unified way to manage the translation in VFIO, and 
> > then vendor
> > specific driver only needs to query and use returned IOVA corresponding to 
> > a GPA. 
> > 
> > Doing so has another benefit, to make underlying vGPU driver VMM agnostic. 
> > For KVM,
> > yes we can use pci_map_sg. However for Xen it's different (today Dom0 
> > doesn't see
> > IOMMU. In the future there'll be a PVIOMMU implementation) so different 
> > code path is 
> > required. It's better to abstract such specific knowledge out of vGPU 
> > driver, which just
> > uses whatever dma_addr returned by other agent (VFIO here, or another Xen 
> > specific
> > agent) in a centralized way.
> > 
> > Alex, what's your opinion on this?
> 
> The sticky point is how vfio, which is only handling the vGPU, has a
> reference to the physical GPU on which to call DMA API operations.  If
> that reference is provided by the vendor vGPU driver, for example
> vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
> be opposed to such an API.  I would not condone vfio deriving or owning
> a reference to the physical device on its own though, that's in the
> realm of the vendor vGPU driver.  It does seem a bit cleaner and should
> reduce duplicate code if the vfio vGPU iommu interface could handle the
> iommu mapping for the vendor vgpu driver when necessary.  Thanks,

Hi Alex,

Since we don't want to allow vfio iommu to derive or own a reference to the
physical device, I think it is still better not providing such pci_dev to the 
vfio iommu type1 driver.

Also, I need to point out that if the vfio iommu is going to set up iommu page
table for the real underlying physical device, giving the fact of single RID we
are all having here, the iommu mapping code has to return the new "IOVA" that is
mapped to the HPA, which the GPU vendro driver will have to put on its DMA
engine. This is very different than the current VFIO IOMMU mapping logic.

And we still have to provide another interface to translate the GPA to
HPA for CPU mapping.

In the current RFC, we only need to have a single interface to provide the most
basic information to the GPU vendor driver and without taking the risk of
leaking a ref to VFIO IOMMU.

Thanks,
Neo

> 
> Alex



Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU

2016-03-10 Thread Neo Jia
On Fri, Mar 11, 2016 at 04:46:23AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Friday, March 11, 2016 12:20 PM
> > 
> > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:
> > >
> > > >> Is it supposed to be the caller who should set
> > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > >> vgpu_dma_do_translate()?
> > > >>
> > > >
> > > > Don't think you need to call dma_map_page here. Once you have the pfn 
> > > > available
> > > > to your GPU kernel driver, you can just go ahead to setup the mapping 
> > > > as you
> > > > normally do such as calling pci_map_sg and its friends.
> > > >
> > >
> > > Technically it's definitely OK to call DMA API from the caller rather 
> > > than here,
> > > however personally I think it is a bit counter-intuitive: IOMMU page 
> > > tables
> > > should be constructed within the VFIO IOMMU driver.
> > >
> > 
> > Hi Jike,
> > 
> > For vGPU, what we have is just a virtual device and a fake IOMMU group, 
> > therefore
> > the actual interaction with the real GPU should be managed by the GPU 
> > vendor driver.
> > 
> 
> Hi, Neo,
> 
> Seems we have a different thought on this. Regardless of whether it's a 
> virtual/physical 
> device, imo, VFIO should manage IOMMU configuration. The only difference is:
> 
> - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry 
> (GPA->HPA);
> - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to 
> IOMMU entry 
> set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);

How does it make any sense for us to do a dma_map_page for a physical device 
that we don't 
have any direct interaction with?

> 
> This would provide an unified way to manage the translation in VFIO, and then 
> vendor
> specific driver only needs to query and use returned IOVA corresponding to a 
> GPA. 
> 
> Doing so has another benefit, to make underlying vGPU driver VMM agnostic. 
> For KVM,
> yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't 
> see
> IOMMU. In the future there'll be a PVIOMMU implementation) so different code 
> path is 
> required. It's better to abstract such specific knowledge out of vGPU driver, 
> which just
> uses whatever dma_addr returned by other agent (VFIO here, or another Xen 
> specific
> agent) in a centralized way.
> 
> Alex, what's your opinion on this?
> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU

2016-03-10 Thread Neo Jia
On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:
> 
> >> Is it supposed to be the caller who should set
> >> up IOMMU by DMA api such as dma_map_page(), after calling
> >> vgpu_dma_do_translate()?
> >>
> > 
> > Don't think you need to call dma_map_page here. Once you have the pfn 
> > available
> > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > normally do such as calling pci_map_sg and its friends.
> > 
> 
> Technically it's definitely OK to call DMA API from the caller rather than 
> here,
> however personally I think it is a bit counter-intuitive: IOMMU page tables
> should be constructed within the VFIO IOMMU driver.
> 

Hi Jike,

For vGPU, what we have is just a virtual device and a fake IOMMU group, 
therefore 
the actual interaction with the real GPU should be managed by the GPU vendor 
driver.

With the default TYPE1 IOMMU, it works with the vfio-pci as it owns the device.

Thanks,
Neo

> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU

2016-03-07 Thread Neo Jia
On Mon, Mar 07, 2016 at 02:07:15PM +0800, Jike Song wrote:
> Hi Neo,
> 
> On Fri, Mar 4, 2016 at 3:00 PM, Neo Jia <c...@nvidia.com> wrote:
> > On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
> >> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
> >> > +   vgpu_dma->size = map->size;
> >> > +
> >> > +   vgpu_link_dma(vgpu_iommu, vgpu_dma);
> >>
> >> Hi Kirti & Neo,
> >>
> >> seems that no one actually setup mappings for IOMMU here?
> >>
> >
> > Hi Jike,
> >
> > Yes.
> >
> > The actual mapping should be done by the host kernel driver after calling 
> > the
> > translation/pinning API vgpu_dma_do_translate.
> 
> Thanks for the reply. I mis-deleted the mail in my intel account, so
> reply with private mail account, sorry for that.
> 
> 
> In vgpu_dma_do_translate():
> 
> for (i = 0; i < count; i++) {
>{snip}
>dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
>vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
> 
> remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
> if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
> pfn = page_to_pfn(page[0]);
> }
> gfn_buffer[i] = pfn;
> }
> 
> If I understand correctly, the purpose of above code, is given an
> array of gfns, try to pin & return associated pfns. There is still no
> IOMMU mappings here.  

Yes.

> Is it supposed to be the caller who should set
> up IOMMU by DMA api such as dma_map_page(), after calling
> vgpu_dma_do_translate()?
> 

Don't think you need to call dma_map_page here. Once you have the pfn available
to your GPU kernel driver, you can just go ahead to setup the mapping as you
normally do such as calling pci_map_sg and its friends.

Thanks,
Neo

> 
> -- 
> Thanks,
> Jike



Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU

2016-03-03 Thread Neo Jia
On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
> > +   vgpu_dma->size = map->size;
> > +
> > +   vgpu_link_dma(vgpu_iommu, vgpu_dma);
> 
> Hi Kirti & Neo,
> 
> seems that no one actually setup mappings for IOMMU here?
> 

Hi Jike,

Yes.

The actual mapping should be done by the host kernel driver after calling the
translation/pinning API vgpu_dma_do_translate.

Thanks,
Neo

> > 
> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v2 1/3] vGPU Core driver

2016-02-29 Thread Neo Jia
On Mon, Feb 29, 2016 at 05:39:02AM +, Tian, Kevin wrote:
> > From: Kirti Wankhede
> > Sent: Wednesday, February 24, 2016 12:24 AM
> > 
> > Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
> > Signed-off-by: Neo Jia <c...@nvidia.com>
> 
> Hi, Kirti/Neo,
> 
> Thanks a lot for you updated version. Having not looked into detail
> code, first come with some high level comments.
> 
> First, in a glimpse the majority of the code (possibly >95%) is device
> agnostic, though we call it vgpu today. Just thinking about the
> extensibility and usability of this framework, would it be better to 
> name it in a way that any other type of I/O device can be fit into 
> this framework? I don't have a good idea of the name now, but 
> a simple idea is to replace vgpu with vdev (vdev-core, vfio-vdev,
> vfio-iommu-type1-vdev, etc.), and then underlying GPU drivers are
> just one category of users of this general vdev framework. In the
> future it's easily extended to support other I/O virtualization based 
> on similar vgpu concept;
> 
> Second, are these 3 patches already working with nvidia device,
> or are they just conceptual implementation w/o completing actual
> test yet? We'll start moving our implementation toward this direction
> too, so would be good to know the current status and how we can
> further cooperate to move forward. Based on that we can start 
> giving more comments on next level detail.
> 

Hi Kevin,

Yes, we do have an engineering prototype up and running with this set of kernel
patches we have posted.

Please let us know if you have any questions while integrating your vgpu 
solution
within this framework.

Thanks,
Neo

> Thanks
> Kevin



[Qemu-devel] [PATCH v2] replace fixed str limit by g_strdup_printf

2016-02-24 Thread Neo Jia
A trivial change to remove string limit by using g_strdup_printf

Tested-by: Neo Jia <c...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
---
 hw/vfio/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 30eb945a4fc1..d091d8cf0e6e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -919,7 +919,7 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 uint32_t orig, size = cpu_to_le32((uint32_t)PCI_ROM_ADDRESS_MASK);
 off_t offset = vdev->config_offset + PCI_ROM_ADDRESS;
 DeviceState *dev = DEVICE(vdev);
-char name[32];
+char *name;
 int fd = vdev->vbasedev.fd;
 
 if (vdev->pdev.romfile || !vdev->pdev.rom_bar) {
@@ -962,10 +962,11 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 
 trace_vfio_pci_size_rom(vdev->vbasedev.name, size);
 
-snprintf(name, sizeof(name), "vfio[%s].rom", vdev->vbasedev.name);
+name = g_strdup_printf("vfio[%s].rom", vdev->vbasedev.name);
 
 memory_region_init_io(>pdev.rom, OBJECT(vdev),
   _rom_ops, vdev, name, size);
+g_free(name);
 
 pci_register_bar(>pdev, PCI_ROM_SLOT,
  PCI_BASE_ADDRESS_SPACE_MEMORY, >pdev.rom);
-- 
1.8.3.1




[Qemu-devel] [PATCH] replace fixed str limit by g_strdup_printf

2016-02-24 Thread Neo Jia
A trivial change to remove string limit by using g_strdup_printf
and g_strconcat

Tested-by: Neo Jia <c...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
---
 hw/vfio/pci.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bfe421528618..795e035a6a9b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -874,7 +874,7 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 uint32_t orig, size = cpu_to_le32((uint32_t)PCI_ROM_ADDRESS_MASK);
 off_t offset = vdev->config_offset + PCI_ROM_ADDRESS;
 DeviceState *dev = DEVICE(vdev);
-char name[32];
+char *name;
 int fd = vdev->vbasedev.fd;
 
 if (vdev->pdev.romfile || !vdev->pdev.rom_bar) {
@@ -917,16 +917,18 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 
 trace_vfio_pci_size_rom(vdev->vbasedev.name, size);
 
-snprintf(name, sizeof(name), "vfio[%s].rom", vdev->vbasedev.name);
+name = g_strdup_printf("vfio[%s].rom", vdev->vbasedev.name);
 
 memory_region_init_io(>pdev.rom, OBJECT(vdev),
   _rom_ops, vdev, name, size);
+g_free(name);
 
 pci_register_bar(>pdev, PCI_ROM_SLOT,
  PCI_BASE_ADDRESS_SPACE_MEMORY, >pdev.rom);
 
 vdev->pdev.has_rom = true;
 vdev->rom_read_failed = false;
+
 }
 
 void vfio_vga_write(void *opaque, hwaddr addr,
@@ -1318,7 +1320,7 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
 {
 VFIOBAR *bar = >bars[nr];
 uint64_t size = bar->region.size;
-char name[64];
+char *name;
 uint32_t pci_bar;
 uint8_t type;
 int ret;
@@ -1328,8 +1330,6 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
 return;
 }
 
-snprintf(name, sizeof(name), "VFIO %s BAR %d", vdev->vbasedev.name, nr);
-
 /* Determine what type of BAR this is for registration */
 ret = pread(vdev->vbasedev.fd, _bar, sizeof(pci_bar),
 vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
@@ -1338,6 +1338,8 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
 return;
 }
 
+name = g_strdup_printf("VFIO %s BAR %d", vdev->vbasedev.name, nr);
+
 pci_bar = le32_to_cpu(pci_bar);
 bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO);
 bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64);
@@ -1357,7 +1359,8 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
 size = vdev->msix->table_offset & qemu_real_host_page_mask;
 }
 
-strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+name = g_strconcat(name, " mmap", NULL);
+
 if (vfio_mmap_region(OBJECT(vdev), >region, >region.mem,
   >region.mmap_mem, >region.mmap,
   size, 0, name)) {
@@ -1372,7 +1375,8 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
   PCI_MSIX_ENTRY_SIZE));
 
 size = start < bar->region.size ? bar->region.size - start : 0;
-strncat(name, " msix-hi", sizeof(name) - strlen(name) - 1);
+name = g_strconcat(name, " msix-hi", NULL);
+
 /* VFIOMSIXInfo contains another MemoryRegion for this mapping */
 if (vfio_mmap_region(OBJECT(vdev), >region, >region.mem,
   >msix->mmap_mem,
@@ -1382,6 +1386,7 @@ static void vfio_map_bar(VFIOPCIDevice *vdev, int nr)
 }
 
 vfio_bar_quirk_setup(vdev, nr);
+g_free(name);
 }
 
 static void vfio_map_bars(VFIOPCIDevice *vdev)
-- 
1.8.3.1




Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-17 Thread Neo Jia
On Wed, Feb 17, 2016 at 02:08:18PM +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> > For example, how to locate the path of a given VM?
> 
> You go ask libvirt, the domain xml will have the info.
> 
> > Whoever is going to configure
> > the qemu has to walk through *all* the current vgpu path to locate the UUID 
> > to
> > match the QEMU's VM UUID.
> 
> No.  qemu simply uses the path it get passed from libvirt.  libvirt
> simply uses whatever is stored in the domain xml.
> 
> i.e. you'll have a config like this:
> 
>  
>   rhel7-vfio
>   0990b05d-4fbd-49bf-88e4-e87974c64fba
>   [ ... ]
>   
> [ ... ]
> 
>   
> 
>   
>   bus='0x00' slot='0x04' function='0x0'/>
> 
> 
> > I think I has answered this, UUID is not a user space or kernel space
> > concept, it is just a generic way to represent object,
> 
> Yes.  But the above sounds like you want to use UUIDs to *link* two
> objects, by assigning the same uuid to both vm and vgpu.  This is *not*
> how uuids should be used.  Each object should have its own uuid.
> 
> You can use uuids to name the vgpus if you want of course.  But the vgpu
> uuid will will have no relationship whatsoever to the vm uuid then.
> 

Agree. I should made it clear that it should be a separate object.

Thanks,
Neo

> cheers,
>   Gerd
> 



Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-17 Thread Neo Jia
On Wed, Feb 17, 2016 at 09:52:04AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Wednesday, February 17, 2016 5:35 PM
> > 
> > On Wed, Feb 17, 2016 at 08:57:08AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Wednesday, February 17, 2016 3:55 PM
> > >
> > If we can make it free, why not?
> 
> I can buy-in this argument.

Great!

> 
> Qemu is not a kernel component. And UUID is OPTIONAL for Qemu.
> 
> KVM is the kernel component. It doesn't use UUID at all. the relation between
> UUID and VM is fully maintained in user space.

Hold on ... we are talking about the vgpu.ko not KVM right?

UUID is just a generic way to represent an object, here we use a uuid to
represent a virtual gpu device.

> 
> > 
> > Please also note that using UUID to represent a virtual gpu device directory
> > doesn't mean UUID is part of a GPU resource.
> 
> but it adds a hard dependency on another resource - UUID. 
> 
> > 
> > >
> > > So let's keep UUID as an optional parameter. When UUID is provided, it
> > > will be included in the vGPU name then your requirement can be met.
> > >
> > 
> > Like I have said before, we are seeking a generic interface to allow upper 
> > layer
> > software stack to manage vgpu device for different vendors, so we should 
> > not really
> > consider "an optional path for vgpu device discovery" at all.
> > 
> > This is why I think we should use this UUID as a generic management 
> > interface,
> > and we shouldn't have anything optional.
> > 
> 
> I don't buy-in this argument. I always think kernel design should provide 
> enough flexibility, instead of assuming user space behavior.
> 

I think you are using the wrong terms here. Flexibility doesn't apply here. What
we are trying to achieve here is to have a generic interface for upper layer
software to manage vgpu device. 

> Let me also add some Citrix friends. See how they feel about the necessity of
> having UUID in vgpu name.

Sorry?

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-17 Thread Neo Jia
On Wed, Feb 17, 2016 at 08:57:08AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Wednesday, February 17, 2016 3:55 PM
> 
> 'whoever' is too strict here. I don't think UUID is required in all scenarios.
> 
> In your scenario:
> 
> - You will pass VM UUID when creating a vgpu.
> - Consequently a /sys/device/virtual/vgpu/$UUID-$vgpu-id is created
> - Then you can identify $UUID-$vgpu-id is right for the very VM, by matching 
> all available vgpu nodes with VM UUID;
> 
> When it is a bit convenient, I don't see it significant. Looping directory is
> not unusual for file/directory operations and it happens infrequently only
> for vgpu life-cycle mgmt..

Hi Kevin,

The search is expensive, when you have 8 physical gpus and each can support up
to 32 vgpus. vgpu life-cycle management happens a lot as well in real production
scenario.

If we can make it free, why not?

> 
> Please think about my original proposal carefully. I'm not opposing encoding
> UUID in vgpu name. What I'm opposing is not to make it mandatory, i.e. when
> UUID is not provided, we should still allow vgpu creation using some default
> descriptive string.

Probably you are not quite get the generic design that we are proposing here.

The goal here is to have a unified interface for all gpu vendor, and expose that
to the upper layer software stack, so I don't think we should have an optional
vgpu device discovery path at all. 

If we have an optional case, does that mean libvirt will have a different
implementation and qemu will have a different implementation? I don't think that
is acceptable.

Since you have admitted this design is convenient and performance better, I 
think we 
should stay with it. 

> 
> 'user space dependency' means you need additional user-space operations
> (say uuidgen here) before you can utilize GPU virtualization feature, which
> is not necessary. In reality, UUID is not a GPU resource. It's not what GPU 
> virtualization intrinsically needs to handle. Let's keep vGPU-core sub-system
> modulo for its real functionalities.

Don't you need to create UUID to make qemu happy? I don't get this argument.

Please also note that using UUID to represent a virtual gpu device directory
doesn't mean UUID is part of a GPU resource.

> 
> So let's keep UUID as an optional parameter. When UUID is provided, it
> will be included in the vGPU name then your requirement can be met.
> 

Like I have said before, we are seeking a generic interface to allow upper layer
software stack to manage vgpu device for different vendors, so we should not 
really
consider "an optional path for vgpu device discovery" at all.

This is why I think we should use this UUID as a generic management interface,
and we shouldn't have anything optional.

Thanks,
Neo

> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-17 Thread Neo Jia
On Wed, Feb 17, 2016 at 07:51:12AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Wednesday, February 17, 2016 3:32 PM
> > 
> > On Wed, Feb 17, 2016 at 07:52:53AM +0100, Gerd Hoffmann wrote:
> > >   Hi,
> > >
> > > > The answer is simple, having a UUID as part of the device name will 
> > > > give you a
> > > > unique sysfs path that will be opened by QEMU.
> > >
> > > A descriptive name will work too, and I think it'll be easier to make
> > > those names persistent because you don't have to store the uuids
> > > somewhere to re-create the same setup afer reboot.
> > 
> > Hi Gerd,
> > 
> > Right, UUID will be persistent cross reboot. The qemu vgpu path for a given 
> > VM will
> > not get changed when it gets reboots and multiple other devices have been
> > created in the middle.
> 
> Curious why persistence matters here. It's completely OK to assign 
> another vgpu to this VM across reboot, as long as the new vgpu
> provides same capability to previous one.

Hi Kevin,

Those virtual devices might be destroyed and re-created as part of the reboot or
shutdown. The user doesn't want to change his configuration, if the
path is not associated with UUID, the user might get a different vgpu than his
previous configuration, unpredictable.

We can't change the vgpu configuration behind uses back, especially we have a
lot of physical devices and tons virtual gpus can be assigned.

Also, if user wants to move the vm configuration to a different host, it is more
nature to decouple the vgpu configuration from VM itself. So the VM will only
open up the device addressed by the corresponding UUID, what you need is just to
describe his vgpu, and it doesn't matter how you describe and doesn't matter
how many virtual devices have been created already on that physical device, your
QEMU path is always persistent.

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-16 Thread Neo Jia
On Wed, Feb 17, 2016 at 07:46:15AM +, Tian, Kevin wrote:
> > From: Neo Jia
> > Sent: Wednesday, February 17, 2016 3:26 PM
> > 
> > 
> 
> > 
> > If your most concern is having this kind of path doesn't provide enough
> > information of the virtual device, we can add more sysfs attributes within 
> > the
> > directory of /sys/devices/virtual/vgpu/$UUID-$vgpu_idx/ to reflect the
> > information you want.
> 
> Like Gerd said, you can have something like this:
> 
> -device vfio-pci,sysfsdev=/sys/devices/virtual/vgpu/vgpu_idx/UUID

Hi Kevin,

The vgpu_idx is not unique number at all.

For example, how to locate the path of a given VM? Whoever is going to configure
the qemu has to walk through *all* the current vgpu path to locate the UUID to
match the QEMU's VM UUID. This is not required if you have UUID as part of the
device path.

> 
> > 
> > Even with UUID, you don't need libvirt at all. you can get uuid by running
> > uuidgen command, I don't need libvirt to code up and test the RFC that I 
> > have
> > sent out early. :-)
> 
> although simple, it still creates unnecessary user space dependency for
> kernel resource management...

I think I has answered this, UUID is not a user space or kernel space
concept, it is just a generic way to represent object, it just make sure that
virtual gpu device directory can be uniquely addressed.

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-16 Thread Neo Jia
On Wed, Feb 17, 2016 at 07:52:53AM +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> > The answer is simple, having a UUID as part of the device name will give 
> > you a
> > unique sysfs path that will be opened by QEMU.
> 
> A descriptive name will work too, and I think it'll be easier to make
> those names persistent because you don't have to store the uuids
> somewhere to re-create the same setup afer reboot.

Hi Gerd,

Right, UUID will be persistent cross reboot. The qemu vgpu path for a given VM 
will 
not get changed when it gets reboots and multiple other devices have been
created in the middle.

> 
> > If you are worried about losing meaningful name here, we can create a sysfs 
> > file
> > to capture the vendor device description if you like.
> 
> You can also store the uuid in a sysfs file ...

Another benefit is that having the UUID as part of the virtual vgpu device path 
will
allow whoever is going to config the QEMU to automatically discover the virtual
device sysfs for free.

Thanks,
Neo

> 
> cheers,
>   Gerd
> 



Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-16 Thread Neo Jia
On Wed, Feb 17, 2016 at 06:02:36AM +, Tian, Kevin wrote:
> > From: Neo Jia
> > Sent: Wednesday, February 17, 2016 1:38 PM
> > > > >
> > > > >
> > > >
> > > > Hi Kevin,
> > > >
> > > > The answer is simple, having a UUID as part of the device name will 
> > > > give you a
> > > > unique sysfs path that will be opened by QEMU.
> > > >
> > > > vgpu-vendor-0 and vgpu-vendor-1 will not be unique as we can have 
> > > > multiple
> > > > virtual gpu devices per VM coming from same or different physical 
> > > > devices.
> > >
> > > That is not a problem. We can add physical device info too like 
> > > vgpu-vendor-0-0,
> > > vgpu-vendor-1-0, ...
> > >
> > > Please note Qemu doesn't care about the actual name. It just accepts a 
> > > sysfs path
> > > to open.
> > 
> > Hi Kevin,
> > 
> > No, I think you are making things even more complicated than it is required,
> > also it is not generic anymore as you are requiring the QEMU to know more 
> > than
> > he needs to.
> > 
> > The way you name those devices will require QEMU to know the relation
> > between virtual devices and physical devices. I don't think that is good.
> 
> I don't think you get my point. Look at how device is assigned in Qemu today:
> 
> -device vfio-pci,host=02:00.0
> 
> Then in a recent patch from Alex, Qemu will accept sysfsdev as well:
> 
> -device vfio-pci,sysfsdev=/sys/devices/pci:00/:00:1c.0/:02:00.0
> 
> Then with vgu (one example from Alex's original post):
> 
> -device vfio-pci,sysfsdev=/sys/devices/virtual/intel-vgpu/vgpu0@:00:02.0

Hi Kevin,

I am fully aware of Alex's patch, that is just an example, but he doesn't
exclude the cases of using UUID as the device path as he mentioned in the
beginning of this long email thread, actually he has already agreed with this
UUID-$vgpu_idx path. :-)

Also, you should note that with the proposal we have, it doesn't require to have
anything like either intel-vgpu or nvidia-vgpu as the path, having a non-vendor
specific information within the vgpu device path is one of the requirement IIRC.

-device vfio-pci,sysfsdev=/sys/devices/virtual/intel-vgpu/vgpu0@:00:02.0

> 
> Qemu doesn't need to know the relation between virtual/physical devices at
> all. It's just a path regardless of how vgpu name is created (either with your
> UUID proposal or my descriptive string proposal)

No, with path like above, QEMU needs to know the virtual device is created from
that physical device :00:02.0 right? (you have mentioned this yourself
actually below.) If QEMU doesn't want to know that, then he will transfer the
burden to the upper layer stack such as libvirt, who has to figure out the right
path of this new VM, as the vgpu<$id>, the <%id> will become another generic
number generated by the vgpu core driver. 

So why not just use UUID here?

> 
> > 
> > My flow is like this:
> > 
> > libvirt creats a VM object, it will have a UUID. then it will use the UUID 
> > to
> > create virtual gpu devices, then it will pass the UUID to the QEMU (actually
> > QEMU already has the VM UUID), then it will just open up the unique path.
> 
> If you look at above example, it's not UUID itself being passed to Qemu. It's
> the sysfsdev path.

UUID is always sent to QEMU, if you look at your QEMU command line, please.

Yes, it is a sysfs path, even with UUID, it is a sysfs path, the only difference
is that we have a uuid embedded within the device name.

> 
> > 
> > Also, you need to consider those 0-0 numbers are not generic as the UUID.
> 
> that encoding could be flexible to include any meaningful string. libvirt can
> itself manages how UUID is mapped to an actual vgpu name.
> 
> > >
> > > >
> > > > If you are worried about losing meaningful name here, we can create a 
> > > > sysfs file
> > > > to capture the vendor device description if you like.
> > > >
> > >
> > > Having the vgpu name descriptive is more informative imo. User can simply 
> > > check
> > > sysfs names to know raw information w/o relying on 3rd party agent to 
> > > query
> > > information around an opaque UUID.
> > >
> > 
> > You are actually arguing against your own design here, unfortunately. If you
> > look at your design carefully, it is your design actually require to have a 
> > 3rd
> > party code to figure out the VM and virtual gpu device relation as it is
> > never documented in the sysfs.
> 
> No. It's not about

Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-16 Thread Neo Jia
On Tue, Feb 16, 2016 at 10:09:43PM -0700, Eric Blake wrote:
> * PGP Signed by an unknown key
> 
> On 02/16/2016 10:04 PM, Tian, Kevin wrote:
> 
> 
> ...rather than making readers scroll through 16k bytes of repetitions of
> the same things they saw earlier in the thread, but getting worse with
> each iteration due to excessive quoting.
> 

Hi Eric,

Sorry about that, I will pay attention to this.

Thanks,
Neo

> -- 
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 
> 
> * Unknown Key
> * 0x2527436A



Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-16 Thread Neo Jia
On Wed, Feb 17, 2016 at 05:04:31AM +, Tian, Kevin wrote:
> > From: Neo Jia
> > Sent: Wednesday, February 17, 2016 12:18 PM
> > 
> > On Wed, Feb 17, 2016 at 03:31:24AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Tuesday, February 16, 2016 4:49 PM
> > > >
> > > > On Tue, Feb 16, 2016 at 08:10:42AM +, Tian, Kevin wrote:
> > > > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > > > Sent: Tuesday, February 16, 2016 3:53 PM
> > > > > >
> > > > > > On Tue, Feb 16, 2016 at 07:40:47AM +, Tian, Kevin wrote:
> > > > > > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > > > > > Sent: Tuesday, February 16, 2016 3:37 PM
> > > > > > > >
> > > > > > > > On Tue, Feb 16, 2016 at 07:27:09AM +, Tian, Kevin wrote:
> > > > > > > > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > > > > > > > Sent: Tuesday, February 16, 2016 3:13 PM
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 16, 2016 at 06:49:30AM +, Tian, Kevin wrote:
> > > > > > > > > > > > From: Alex Williamson 
> > > > > > > > > > > > [mailto:alex.william...@redhat.com]
> > > > > > > > > > > > Sent: Thursday, February 04, 2016 3:33 AM
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2016-02-03 at 09:28 +0100, Gerd Hoffmann wrote:
> > > > > > > > > > > > >   Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Actually I have a long puzzle in this area. 
> > > > > > > > > > > > > > Definitely libvirt will
> > use
> > > > UUID
> > > > > > to
> > > > > > > > > > > > > > mark a VM. And obviously UUID is not recorded 
> > > > > > > > > > > > > > within KVM.
> > Then
> > > > how
> > > > > > does
> > > > > > > > > > > > > > libvirt talk to KVM based on UUID? It could be a 
> > > > > > > > > > > > > > good reference
> > to
> > > > this
> > > > > > design.
> > > > > > > > > > > > >
> > > > > > > > > > > > > libvirt keeps track which qemu instance belongs to 
> > > > > > > > > > > > > which vm.
> > > > > > > > > > > > > qemu also gets started with "-uuid ...", so one can 
> > > > > > > > > > > > > query qemu
> > via
> > > > > > > > > > > > > monitor ("info uuid") to figure what the uuid is.  It 
> > > > > > > > > > > > > is also in the
> > > > > > > > > > > > > smbios tables so the guest can see it in the system 
> > > > > > > > > > > > > information
> > table.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The uuid is not visible to the kernel though, the kvm 
> > > > > > > > > > > > > kernel driver
> > > > > > > > > > > > > doesn't know what the uuid is (and neither does 
> > > > > > > > > > > > > vfio).  qemu uses
> > > > file
> > > > > > > > > > > > > handles to talk to both kvm and vfio.  qemu notifies 
> > > > > > > > > > > > > both kvm
> > and
> > > > vfio
> > > > > > > > > > > > > about anything relevant events (guest address space 
> > > > > > > > > > > > > changes
> > etc)
> > > > and
> > > > > > > > > > > > > connects file descriptors (eventfd -> irqfd).
> > > > > > > > > > > >
> > > > > > > > > > > > I think the original link to using a VM UUID for the 
> > > > > > > > > > > > vGPU comes from
> > > 

Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-16 Thread Neo Jia
On Wed, Feb 17, 2016 at 03:31:24AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, February 16, 2016 4:49 PM
> > 
> > On Tue, Feb 16, 2016 at 08:10:42AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Tuesday, February 16, 2016 3:53 PM
> > > >
> > > > On Tue, Feb 16, 2016 at 07:40:47AM +, Tian, Kevin wrote:
> > > > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > > > Sent: Tuesday, February 16, 2016 3:37 PM
> > > > > >
> > > > > > On Tue, Feb 16, 2016 at 07:27:09AM +, Tian, Kevin wrote:
> > > > > > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > > > > > Sent: Tuesday, February 16, 2016 3:13 PM
> > > > > > > >
> > > > > > > > On Tue, Feb 16, 2016 at 06:49:30AM +, Tian, Kevin wrote:
> > > > > > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > > > > > Sent: Thursday, February 04, 2016 3:33 AM
> > > > > > > > > >
> > > > > > > > > > On Wed, 2016-02-03 at 09:28 +0100, Gerd Hoffmann wrote:
> > > > > > > > > > >   Hi,
> > > > > > > > > > >
> > > > > > > > > > > > Actually I have a long puzzle in this area. Definitely 
> > > > > > > > > > > > libvirt will use
> > UUID
> > > > to
> > > > > > > > > > > > mark a VM. And obviously UUID is not recorded within 
> > > > > > > > > > > > KVM. Then
> > how
> > > > does
> > > > > > > > > > > > libvirt talk to KVM based on UUID? It could be a good 
> > > > > > > > > > > > reference to
> > this
> > > > design.
> > > > > > > > > > >
> > > > > > > > > > > libvirt keeps track which qemu instance belongs to which 
> > > > > > > > > > > vm.
> > > > > > > > > > > qemu also gets started with "-uuid ...", so one can query 
> > > > > > > > > > > qemu via
> > > > > > > > > > > monitor ("info uuid") to figure what the uuid is.  It is 
> > > > > > > > > > > also in the
> > > > > > > > > > > smbios tables so the guest can see it in the system 
> > > > > > > > > > > information table.
> > > > > > > > > > >
> > > > > > > > > > > The uuid is not visible to the kernel though, the kvm 
> > > > > > > > > > > kernel driver
> > > > > > > > > > > doesn't know what the uuid is (and neither does vfio).  
> > > > > > > > > > > qemu uses
> > file
> > > > > > > > > > > handles to talk to both kvm and vfio.  qemu notifies both 
> > > > > > > > > > > kvm and
> > vfio
> > > > > > > > > > > about anything relevant events (guest address space 
> > > > > > > > > > > changes etc)
> > and
> > > > > > > > > > > connects file descriptors (eventfd -> irqfd).
> > > > > > > > > >
> > > > > > > > > > I think the original link to using a VM UUID for the vGPU 
> > > > > > > > > > comes from
> > > > > > > > > > NVIDIA having a userspace component which might get 
> > > > > > > > > > launched from
> > a udev
> > > > > > > > > > event as the vGPU is created or the set of vGPUs within 
> > > > > > > > > > that UUID is
> > > > > > > > > > started.  Using the VM UUID then gives them a way to 
> > > > > > > > > > associate that
> > > > > > > > > > userspace process with a VM instance.  Maybe it could 
> > > > > > > > > > register with
> > > > > > > > > > libvirt for some sort of service provided for the VM, I 
> > > > > > > > > > don't know.
> > > > > > >

Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-16 Thread Neo Jia
On Tue, Feb 16, 2016 at 08:10:42AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, February 16, 2016 3:53 PM
> > 
> > On Tue, Feb 16, 2016 at 07:40:47AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Tuesday, February 16, 2016 3:37 PM
> > > >
> > > > On Tue, Feb 16, 2016 at 07:27:09AM +, Tian, Kevin wrote:
> > > > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > > > Sent: Tuesday, February 16, 2016 3:13 PM
> > > > > >
> > > > > > On Tue, Feb 16, 2016 at 06:49:30AM +, Tian, Kevin wrote:
> > > > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > > > Sent: Thursday, February 04, 2016 3:33 AM
> > > > > > > >
> > > > > > > > On Wed, 2016-02-03 at 09:28 +0100, Gerd Hoffmann wrote:
> > > > > > > > >   Hi,
> > > > > > > > >
> > > > > > > > > > Actually I have a long puzzle in this area. Definitely 
> > > > > > > > > > libvirt will use UUID
> > to
> > > > > > > > > > mark a VM. And obviously UUID is not recorded within KVM. 
> > > > > > > > > > Then how
> > does
> > > > > > > > > > libvirt talk to KVM based on UUID? It could be a good 
> > > > > > > > > > reference to this
> > design.
> > > > > > > > >
> > > > > > > > > libvirt keeps track which qemu instance belongs to which vm.
> > > > > > > > > qemu also gets started with "-uuid ...", so one can query 
> > > > > > > > > qemu via
> > > > > > > > > monitor ("info uuid") to figure what the uuid is.  It is also 
> > > > > > > > > in the
> > > > > > > > > smbios tables so the guest can see it in the system 
> > > > > > > > > information table.
> > > > > > > > >
> > > > > > > > > The uuid is not visible to the kernel though, the kvm kernel 
> > > > > > > > > driver
> > > > > > > > > doesn't know what the uuid is (and neither does vfio).  qemu 
> > > > > > > > > uses file
> > > > > > > > > handles to talk to both kvm and vfio.  qemu notifies both kvm 
> > > > > > > > > and vfio
> > > > > > > > > about anything relevant events (guest address space changes 
> > > > > > > > > etc) and
> > > > > > > > > connects file descriptors (eventfd -> irqfd).
> > > > > > > >
> > > > > > > > I think the original link to using a VM UUID for the vGPU comes 
> > > > > > > > from
> > > > > > > > NVIDIA having a userspace component which might get launched 
> > > > > > > > from a udev
> > > > > > > > event as the vGPU is created or the set of vGPUs within that 
> > > > > > > > UUID is
> > > > > > > > started.  Using the VM UUID then gives them a way to associate 
> > > > > > > > that
> > > > > > > > userspace process with a VM instance.  Maybe it could register 
> > > > > > > > with
> > > > > > > > libvirt for some sort of service provided for the VM, I don't 
> > > > > > > > know.
> > > > > > >
> > > > > > > Intel doesn't have this requirement. It should be enough as long 
> > > > > > > as
> > > > > > > libvirt maintains which sysfs vgpu node is associated to a VM 
> > > > > > > UUID.
> > > > > > >
> > > > > > > >
> > > > > > > > > qemu needs a sysfs node as handle to the vfio device, 
> > > > > > > > > something
> > > > > > > > > like /sys/devices/virtual/vgpu/.   can be a uuid 
> > > > > > > > > if you
> > want
> > > > > > > > > have it that way, but it could be pretty much anything.  The 
> > > > > > > > > sysfs node
> > > > > > > > > will

Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-15 Thread Neo Jia
On Tue, Feb 16, 2016 at 07:40:47AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, February 16, 2016 3:37 PM
> > 
> > On Tue, Feb 16, 2016 at 07:27:09AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Tuesday, February 16, 2016 3:13 PM
> > > >
> > > > On Tue, Feb 16, 2016 at 06:49:30AM +, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > Sent: Thursday, February 04, 2016 3:33 AM
> > > > > >
> > > > > > On Wed, 2016-02-03 at 09:28 +0100, Gerd Hoffmann wrote:
> > > > > > >   Hi,
> > > > > > >
> > > > > > > > Actually I have a long puzzle in this area. Definitely libvirt 
> > > > > > > > will use UUID to
> > > > > > > > mark a VM. And obviously UUID is not recorded within KVM. Then 
> > > > > > > > how does
> > > > > > > > libvirt talk to KVM based on UUID? It could be a good reference 
> > > > > > > > to this design.
> > > > > > >
> > > > > > > libvirt keeps track which qemu instance belongs to which vm.
> > > > > > > qemu also gets started with "-uuid ...", so one can query qemu via
> > > > > > > monitor ("info uuid") to figure what the uuid is.  It is also in 
> > > > > > > the
> > > > > > > smbios tables so the guest can see it in the system information 
> > > > > > > table.
> > > > > > >
> > > > > > > The uuid is not visible to the kernel though, the kvm kernel 
> > > > > > > driver
> > > > > > > doesn't know what the uuid is (and neither does vfio).  qemu uses 
> > > > > > > file
> > > > > > > handles to talk to both kvm and vfio.  qemu notifies both kvm and 
> > > > > > > vfio
> > > > > > > about anything relevant events (guest address space changes etc) 
> > > > > > > and
> > > > > > > connects file descriptors (eventfd -> irqfd).
> > > > > >
> > > > > > I think the original link to using a VM UUID for the vGPU comes from
> > > > > > NVIDIA having a userspace component which might get launched from a 
> > > > > > udev
> > > > > > event as the vGPU is created or the set of vGPUs within that UUID is
> > > > > > started.  Using the VM UUID then gives them a way to associate that
> > > > > > userspace process with a VM instance.  Maybe it could register with
> > > > > > libvirt for some sort of service provided for the VM, I don't know.
> > > > >
> > > > > Intel doesn't have this requirement. It should be enough as long as
> > > > > libvirt maintains which sysfs vgpu node is associated to a VM UUID.
> > > > >
> > > > > >
> > > > > > > qemu needs a sysfs node as handle to the vfio device, something
> > > > > > > like /sys/devices/virtual/vgpu/.   can be a uuid if 
> > > > > > > you want
> > > > > > > have it that way, but it could be pretty much anything.  The 
> > > > > > > sysfs node
> > > > > > > will probably show up as-is in the libvirt xml when assign a vgpu 
> > > > > > > to a
> > > > > > > vm.  So the name should be something stable (i.e. when using a 
> > > > > > > uuid as
> > > > > > > name you should better not generate a new one on each boot).
> > > > > >
> > > > > > Actually I don't think there's really a persistent naming issue, 
> > > > > > that's
> > > > > > probably where we diverge from the SR-IOV model.  SR-IOV cannot
> > > > > > dynamically add a new VF, it needs to reset the number of VFs to 
> > > > > > zero,
> > > > > > then re-allocate all of them up to the new desired count.  That has 
> > > > > > some
> > > > > > obvious implications.  I think with both vendors here, we can
> > > > > > dynamically allocate new vGPUs, so I would expect that libvirt would
> > > > > > create each vGPU instance as it's needed.  None would be created by
&

Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-15 Thread Neo Jia
On Tue, Feb 16, 2016 at 07:27:09AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, February 16, 2016 3:13 PM
> > 
> > On Tue, Feb 16, 2016 at 06:49:30AM +, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Thursday, February 04, 2016 3:33 AM
> > > >
> > > > On Wed, 2016-02-03 at 09:28 +0100, Gerd Hoffmann wrote:
> > > > >   Hi,
> > > > >
> > > > > > Actually I have a long puzzle in this area. Definitely libvirt will 
> > > > > > use UUID to
> > > > > > mark a VM. And obviously UUID is not recorded within KVM. Then how 
> > > > > > does
> > > > > > libvirt talk to KVM based on UUID? It could be a good reference to 
> > > > > > this design.
> > > > >
> > > > > libvirt keeps track which qemu instance belongs to which vm.
> > > > > qemu also gets started with "-uuid ...", so one can query qemu via
> > > > > monitor ("info uuid") to figure what the uuid is.  It is also in the
> > > > > smbios tables so the guest can see it in the system information table.
> > > > >
> > > > > The uuid is not visible to the kernel though, the kvm kernel driver
> > > > > doesn't know what the uuid is (and neither does vfio).  qemu uses file
> > > > > handles to talk to both kvm and vfio.  qemu notifies both kvm and vfio
> > > > > about anything relevant events (guest address space changes etc) and
> > > > > connects file descriptors (eventfd -> irqfd).
> > > >
> > > > I think the original link to using a VM UUID for the vGPU comes from
> > > > NVIDIA having a userspace component which might get launched from a udev
> > > > event as the vGPU is created or the set of vGPUs within that UUID is
> > > > started.  Using the VM UUID then gives them a way to associate that
> > > > userspace process with a VM instance.  Maybe it could register with
> > > > libvirt for some sort of service provided for the VM, I don't know.
> > >
> > > Intel doesn't have this requirement. It should be enough as long as
> > > libvirt maintains which sysfs vgpu node is associated to a VM UUID.
> > >
> > > >
> > > > > qemu needs a sysfs node as handle to the vfio device, something
> > > > > like /sys/devices/virtual/vgpu/.   can be a uuid if you 
> > > > > want
> > > > > have it that way, but it could be pretty much anything.  The sysfs 
> > > > > node
> > > > > will probably show up as-is in the libvirt xml when assign a vgpu to a
> > > > > vm.  So the name should be something stable (i.e. when using a uuid as
> > > > > name you should better not generate a new one on each boot).
> > > >
> > > > Actually I don't think there's really a persistent naming issue, that's
> > > > probably where we diverge from the SR-IOV model.  SR-IOV cannot
> > > > dynamically add a new VF, it needs to reset the number of VFs to zero,
> > > > then re-allocate all of them up to the new desired count.  That has some
> > > > obvious implications.  I think with both vendors here, we can
> > > > dynamically allocate new vGPUs, so I would expect that libvirt would
> > > > create each vGPU instance as it's needed.  None would be created by
> > > > default without user interaction.
> > > >
> > > > Personally I think using a UUID makes sense, but it needs to be
> > > > userspace policy whether that UUID has any implicit meaning like
> > > > matching the VM UUID.  Having an index within a UUID bothers me a bit,
> > > > but it doesn't seem like too much of a concession to enable the use case
> > > > that NVIDIA is trying to achieve.  Thanks,
> > > >
> > >
> > > I would prefer to making UUID an optional parameter, while not tieing
> > > sysfs vgpu naming to UUID. This would be more flexible to different
> > > scenarios where UUID might not be required.
> > 
> > Hi Kevin,
> > 
> > Happy Chinese New Year!
> > 
> > I think having UUID as the vgpu device name will allow us to have an gpu 
> > vendor
> > agnostic solution for the upper layer software stack such as QEMU, who is
> > supposed to open the device.
> > 
> 
> Qemu can use whatever sysfs path provided to open the device, regardless
> of whether there is an UUID within the path...
> 

Hi Kevin,

Then it will provide even more benefit of using UUID as libvirt can be
implemented as gpu vendor agnostic, right? :-)

The UUID can be VM UUID or vGPU group object UUID which really depends on the
high level software stack, again the benefit is gpu vendor agnostic.

Thanks,
Neo

> Thanks
> Kevin











Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-15 Thread Neo Jia
On Tue, Feb 16, 2016 at 06:49:30AM +, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Thursday, February 04, 2016 3:33 AM
> > 
> > On Wed, 2016-02-03 at 09:28 +0100, Gerd Hoffmann wrote:
> > >   Hi,
> > >
> > > > Actually I have a long puzzle in this area. Definitely libvirt will use 
> > > > UUID to
> > > > mark a VM. And obviously UUID is not recorded within KVM. Then how does
> > > > libvirt talk to KVM based on UUID? It could be a good reference to this 
> > > > design.
> > >
> > > libvirt keeps track which qemu instance belongs to which vm.
> > > qemu also gets started with "-uuid ...", so one can query qemu via
> > > monitor ("info uuid") to figure what the uuid is.  It is also in the
> > > smbios tables so the guest can see it in the system information table.
> > >
> > > The uuid is not visible to the kernel though, the kvm kernel driver
> > > doesn't know what the uuid is (and neither does vfio).  qemu uses file
> > > handles to talk to both kvm and vfio.  qemu notifies both kvm and vfio
> > > about anything relevant events (guest address space changes etc) and
> > > connects file descriptors (eventfd -> irqfd).
> > 
> > I think the original link to using a VM UUID for the vGPU comes from
> > NVIDIA having a userspace component which might get launched from a udev
> > event as the vGPU is created or the set of vGPUs within that UUID is
> > started.  Using the VM UUID then gives them a way to associate that
> > userspace process with a VM instance.  Maybe it could register with
> > libvirt for some sort of service provided for the VM, I don't know.
> 
> Intel doesn't have this requirement. It should be enough as long as
> libvirt maintains which sysfs vgpu node is associated to a VM UUID.
> 
> > 
> > > qemu needs a sysfs node as handle to the vfio device, something
> > > like /sys/devices/virtual/vgpu/.   can be a uuid if you want
> > > have it that way, but it could be pretty much anything.  The sysfs node
> > > will probably show up as-is in the libvirt xml when assign a vgpu to a
> > > vm.  So the name should be something stable (i.e. when using a uuid as
> > > name you should better not generate a new one on each boot).
> > 
> > Actually I don't think there's really a persistent naming issue, that's
> > probably where we diverge from the SR-IOV model.  SR-IOV cannot
> > dynamically add a new VF, it needs to reset the number of VFs to zero,
> > then re-allocate all of them up to the new desired count.  That has some
> > obvious implications.  I think with both vendors here, we can
> > dynamically allocate new vGPUs, so I would expect that libvirt would
> > create each vGPU instance as it's needed.  None would be created by
> > default without user interaction.
> > 
> > Personally I think using a UUID makes sense, but it needs to be
> > userspace policy whether that UUID has any implicit meaning like
> > matching the VM UUID.  Having an index within a UUID bothers me a bit,
> > but it doesn't seem like too much of a concession to enable the use case
> > that NVIDIA is trying to achieve.  Thanks,
> > 
> 
> I would prefer to making UUID an optional parameter, while not tieing
> sysfs vgpu naming to UUID. This would be more flexible to different
> scenarios where UUID might not be required.

Hi Kevin,

Happy Chinese New Year!

I think having UUID as the vgpu device name will allow us to have an gpu vendor
agnostic solution for the upper layer software stack such as QEMU, who is
supposed to open the device.

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] RE: [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-02-03 Thread Neo Jia
On Wed, Feb 03, 2016 at 08:04:16AM +, Tian, Kevin wrote:
> > From: Zhiyuan Lv
> > Sent: Tuesday, February 02, 2016 3:35 PM
> > 
> > Hi Gerd/Alex,
> > 
> > On Mon, Feb 01, 2016 at 02:44:55PM -0700, Alex Williamson wrote:
> > > On Mon, 2016-02-01 at 14:10 +0100, Gerd Hoffmann wrote:
> > > >   Hi,
> > > >
> > > > > > Unfortunately it's not the only one. Another example is, 
> > > > > > device-model
> > > > > > may want to write-protect a gfn (RAM). In case that this request 
> > > > > > goes
> > > > > > to VFIO .. how it is supposed to reach KVM MMU?
> > > > >
> > > > > Well, let's work through the problem.  How is the GFN related to the
> > > > > device?  Is this some sort of page table for device mappings with a 
> > > > > base
> > > > > register in the vgpu hardware?
> > > >
> > > > IIRC this is needed to make sure the guest can't bypass execbuffer
> > > > verification and works like this:
> > > >
> > > >   (1) guest submits execbuffer.
> > > >   (2) host makes execbuffer readonly for the guest
> > > >   (3) verify the buffer (make sure it only accesses resources owned by
> > > >   the vm).
> > > >   (4) pass on execbuffer to the hardware.
> > > >   (5) when the gpu is done with it make the execbuffer writable again.
> > >
> > > Ok, so are there opportunities to do those page protections outside of
> > > KVM?  We should be able to get the vma for the buffer, can we do
> > > something with that to make it read-only.  Alternatively can the vgpu
> > > driver copy it to a private buffer and hardware can execute from that?
> > > I'm not a virtual memory expert, but it doesn't seem like an
> > > insurmountable problem.  Thanks,
> > 
> > Originally iGVT-g used write-protection for privilege execbuffers, as Gerd
> > described. Now the latest implementation has removed wp to do buffer copy
> > instead, since the privilege command buffers are usually small. So that part
> > is fine.
> > 
> > But we need write-protection for graphics page table shadowing as well. Once
> > guest driver modifies gpu page table, we need to know that and manipulate
> > shadow page table accordingly. buffer copy cannot help here. Thanks!
> > 
> 
> 
> 4) Map/unmap guest memory
> --
> It's there for KVM.
> 
> 5) Pin/unpin guest memory
> --
> IGD or any PCI passthru should have same requirement. So we should be
> able to leverage existing code in VFIO. The only tricky thing (Jike may
> elaborate after he is back), is that KVMGT requires to pin EPT entry too,
> which requires some further change in KVM side. But I'm not sure whether
> it still holds true after some design changes made in this thread. So I'll
> leave to Jike to further comment.
> 

Hi Kevin,

I think you should be able to map and pin guest memory via the IOMMU API, not
KVM.

> Well, then I realize pretty much opens have been covered with a solution
> when ending this write-up. Then we should move forward to come up a
> prototype upon which we can then identify anything missing or overlooked
> (definitely there would be), and also discuss several remaining opens atop
>  (such as exit-less emulation, pin/unpin, etc.). Another thing we need
> to think is whether this new design is still compatible to Xen side.
> 
> Thanks a lot all for the great discussion (especially Alex with many good
> inputs)! I believe it becomes much clearer now than 2 weeks ago, about 
> how to integrate KVMGT with VFIO. :-)
> 

It is great to see you guys are onboard with VFIO solution! As Kirti has
mentioned in other threads, let's review the current registration APIs and
figure out what we need to add for both solutions.

Thanks,
Neo

> Thanks
> Kevin
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [Qemu-devel] [iGVT-g] RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-02-03 Thread Neo Jia
On Thu, Feb 04, 2016 at 03:01:36AM +, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Thursday, February 04, 2016 4:45 AM
> > >
> > > First, Jike told me before his vacation, that we cannot do any change to
> > > KVM module according to community comments. Now I think it's not true.
> > > We can do necessary changes, as long as it is done in a structural/layered
> > > approach, w/o hard assumption on KVMGT as the only user. That's the
> > > guideline we need to obey. :-)
> > 
> > We certainly need to separate the functionality that you're trying to
> > enable from the more pure concept of vfio.  vfio is a userspace driver
> > interfaces, not a userspace driver interface for KVM-based virtual
> > machines.  Maybe it's more of a gimmick that we can assign PCI devices
> > to QEMU tcg VMs, but that's really just the proof of concept for more
> > useful capabilities, like supporting DPDK applications.  So, I
> > begrudgingly agree that structured/layered interactions are acceptable,
> > but consider what use cases may be excluded by doing so.
> 
> Understand. We shouldn't assume VFIO always connected to KVM. For 
> example, once we have vfio-vgpu ready, it can be used to drive container
> usage too, not exactly always connecting with KVM/Qemu. Actually thinking
> more from this angle there is a new open which I'll describe in the end...
> 
> > >
> > > 4) Map/unmap guest memory
> > > --
> > > It's there for KVM.
> > 
> > Map and unmap for who?  For the vGPU or for the VM?  It seems like we
> > know how to map guest memory for the vGPU without KVM, but that's
> > covered in 7), so I'm not entirely sure what this is specifying.
> 
> Map guest memory for emulation purpose in vGPU device model, e.g. to r/w
> guest GPU page table, command buffer, etc. It's the basic requirement as
> we see in any device model.
> 
> 7) provides the database (both GPA->IOVA and GPA->HPA), where GPA->HPA
> can be used to implement this interface for KVM. However for Xen it's
> different, as special foreign domain mapping hypercall is involved which is
> Xen specific so not appropriate to be in VFIO. 
> 
> That's why we list this interface separately as a key requirement (though
> it's obvious for KVM)

Hi Kevin,

It seems you are trying to map the guest physical memory into your kernel driver
on the host, right? 

If yes, I think we have already have the required information to achieve that.

The type1 IOMMU VGPU interface has provided , which is
enough for us to do any lookup.

> 
> > 
> > > 5) Pin/unpin guest memory
> > > --
> > > IGD or any PCI passthru should have same requirement. So we should be
> > > able to leverage existing code in VFIO. The only tricky thing (Jike may
> > > elaborate after he is back), is that KVMGT requires to pin EPT entry too,
> > > which requires some further change in KVM side. But I'm not sure whether
> > > it still holds true after some design changes made in this thread. So I'll
> > > leave to Jike to further comment.
> > 
> > PCI assignment requires pinning all of guest memory, I would think that
> > IGD would only need to pin selective memory, so is this simply stating
> > that both have the need to pin memory, not that they'll do it to the
> > same extent?
> 
> For simplicity let's first pin all memory, while taking selective pinning as a
> future enhancement.
> 
> The tricky thing is that existing 'pin' action in VFIO doesn't actually pin
> EPT entry too (only pin host page tables for Qemu process). There are 
> various places where EPT entries might be invalidated when guest is 
> running, while KVMGT requires EPT entries to be pinned too. Let's wait 
> for Jike to elaborate whether this part is still required today.

Sorry, don't quite follow the logic here. The current VFIO TYPE1 IOMMU 
(including API
and underlying IOMMU implementation) will pin the guest physical memory and
install those pages to the proper device domain. Yes, it is only for the QEMU
process as that is what the VM is running at. 

Do I miss something here?

> 
> > 
> > > 6) Write-protect a guest memory page
> > > --
> > > The primary purpose is for GPU page table shadowing. We need to track
> > > modifications on guest GPU page table, so shadow part can be synchronized
> > > accordingly. Just think about CPU page table shadowing. And old example
> > > as Zhiyuan pointed out, is to write-protect guest cmd buffer. But it 
> > > becomes
> > > not necessary now.
> > >
> > > So we need KVM to provide an interface so some agents can request such
> > > write-protection action (not just for KVMGT. could be for other tracking
> > > usages). Guangrong has been working on a general page tracking mechanism,
> > > upon which write-protection can be easily built on. The review is still in
> > > progress.
> > 
> > I have a hard time believing we don't have the mechanics to do this
> > outside of KVM.  We should be able to write protect user pages from the
> > kernel, 

Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-02 Thread Neo Jia
On Tue, Feb 02, 2016 at 09:00:43AM +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> > And for UUID, I remember Alex had a concern on using it in kernel. 
> > Honestly speaking I don't have a good idea here. In Xen side there is a VM 
> > ID
> > which can be easily used as the index. But for KVM, what would be the best
> > identifier to associate with a VM?
> 
> The vgpu code doesn't need to associate the vgpu device with a vm in the
> first place.  You get all guest address space information from qemu, via
> vfio iommu interface.
> 
> When qemu does't use kvm (tcg mode), things should still work fine.
> Using vfio-based vgpu devices with non-qemu apps (some kind of test
> suite for example) should work fine too.

Hi Gerd and Kevin,

I thought Alex had agreed with the UUID as long as it is not tied with VM,
probably it is just his comment gets lost in our previous long email thread.

Thanks,
Neo

> 
> cheers,
>   Gerd
> 



Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-02 Thread Neo Jia
On Tue, Feb 02, 2016 at 08:18:44AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, February 02, 2016 4:13 PM
> > 
> > On Tue, Feb 02, 2016 at 09:00:43AM +0100, Gerd Hoffmann wrote:
> > >   Hi,
> > >
> > > > And for UUID, I remember Alex had a concern on using it in kernel.
> > > > Honestly speaking I don't have a good idea here. In Xen side there is a 
> > > > VM ID
> > > > which can be easily used as the index. But for KVM, what would be the 
> > > > best
> > > > identifier to associate with a VM?
> > >
> > > The vgpu code doesn't need to associate the vgpu device with a vm in the
> > > first place.  You get all guest address space information from qemu, via
> > > vfio iommu interface.
> > >
> > > When qemu does't use kvm (tcg mode), things should still work fine.
> > > Using vfio-based vgpu devices with non-qemu apps (some kind of test
> > > suite for example) should work fine too.
> > 
> > Hi Gerd and Kevin,
> > 
> > I thought Alex had agreed with the UUID as long as it is not tied with VM,
> > probably it is just his comment gets lost in our previous long email thread.
> > 
> 
> I think the point is... what is the value to introduce a UUID here? If
> what Gerd describes is enough, we can simply invent a vgpu ID which
> is returned at vgpu_create, and then used as the index for other
> interfaces.
> 

Hi Kevin,

It can just be a plain UUID, and the meaning of the UUID is up to upper layer 
SW, for
example with libvirt, you can create a new "vgpu group" object representing a
list of vgpu device. so the UUID will be the input on vgpu_create instead of
return value.

For the TCG mode, this should just work as long as libvirt can create the proper
internal objects there plus other vfio iommu interface Gerd has called out,
although the vector->kvm_interrupt part might need some tweaks.

Thanks,
Neo

> But I still need to think about whether there's value to have a VM
> identifier within vgpu core driver, especially regarding to how this
> vgpu core driver connects to KVM hypervisor or other hypervisor.
> I saw another long thread about that part. Jike has started his 
> vacation now. I'll follow up with it tomorrow.
> 
> Thanks
> Kevin



Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-27 Thread Neo Jia
On Wed, Jan 27, 2016 at 09:10:16AM -0700, Alex Williamson wrote:
> On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > > > 1.1 Under per-physical device sysfs:
> > > > > > --
> > > > > >  
> > > > > > vgpu_supported_types - RO, list the current supported virtual GPU 
> > > > > > types and its
> > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > > > "vgpu_supported_types".
> > > > > > 
> > > > > > vgpu_create - WO, input syntax , create a 
> > > > > > virtual
> > > > > > gpu device on a target physical GPU. idx: virtual device index 
> > > > > > inside a VM
> > > > > >  
> > > > > > vgpu_destroy - WO, input syntax , destroy a virtual 
> > > > > > gpu device on a
> > > > > > target physical GPU
> > > > >  
> > > > >  
> > > > > I've noted in previous discussions that we need to separate user 
> > > > > policy
> > > > > from kernel policy here, the kernel policy should not require a "VM
> > > > > UUID".  A UUID simply represents a set of one or more devices and an
> > > > > index picks the device within the set.  Whether that UUID matches a VM
> > > > > or is independently used is up to the user policy when creating the
> > > > > device.
> > > > >  
> > > > > Personally I'd also prefer to get rid of the concept of indexes 
> > > > > within a
> > > > > UUID set of devices and instead have each device be independent.  This
> > > > > seems to be an imposition on the nvidia implementation into the kernel
> > > > > interface design.
> > > > >  
> > > >  
> > > > Hi Alex,
> > > >  
> > > > I agree with you that we should not put UUID concept into a kernel API. 
> > > > At
> > > > this point (without any prototyping), I am thinking of using a list of 
> > > > virtual
> > > > devices instead of UUID.
> > > 
> > > Hi Neo,
> > > 
> > > A UUID is a perfectly fine name, so long as we let it be just a UUID and
> > > not the UUID matching some specific use case.
> > > 
> > > > > >  
> > > > > > int vgpu_map_virtual_bar
> > > > > > (
> > > > > > uint64_t virt_bar_addr,
> > > > > > uint64_t phys_bar_addr,
> > > > > > uint32_t len,
> > > > > > uint32_t flags
> > > > > > )
> > > > > >  
> > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > > >  
> > > > >  
> > > > > Per the implementation provided, this needs to be implemented in the
> > > > > vfio device driver, not in the iommu interface.  Finding the DMA 
> > > > > mapping
> > > > > of the device and replacing it is wrong.  It should be remapped at the
> > > > > vfio device file interface using vm_ops.
> > > > >  
> > > >  
> > > > So you are basically suggesting that we are going to take a mmap fault 
> > > > and
> > > > within that fault handler, we will go into vendor driver to look up the
> > > > "pre-registered" mapping and remap there.
> > > >  
> > > > Is my understanding correct?
> > > 
> > > Essentially, hopefully the vendor driver will have already registered
> > > the backing for the mmap prior to the fault, but either way could work.
> > > I think the key though is that you want to remap it onto the vma
> > > accessing the vfio device file, not scanning it out of an IOVA mapping
> > > that might be dynamic and doing a vma lookup based on the point in time
> > > mapping of the BAR.  The latter doesn't give me much confidence that
> > > mappings couldn't change while the former should be a one time fault.
> > 
> > Hi Alex,
> > 
> > The fact is that the vendor driver can only prevent such mmap fault by 
> >

Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-27 Thread Neo Jia
On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > 1.1 Under per-physical device sysfs:
> > > > --
> > > >  
> > > > vgpu_supported_types - RO, list the current supported virtual GPU types 
> > > > and its
> > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > "vgpu_supported_types".
> > > > 
> > > > vgpu_create - WO, input syntax , create a virtual
> > > > gpu device on a target physical GPU. idx: virtual device index inside a 
> > > > VM
> > > >  
> > > > vgpu_destroy - WO, input syntax , destroy a virtual gpu 
> > > > device on a
> > > > target physical GPU
> > > 
> > > 
> > > I've noted in previous discussions that we need to separate user policy
> > > from kernel policy here, the kernel policy should not require a "VM
> > > UUID".  A UUID simply represents a set of one or more devices and an
> > > index picks the device within the set.  Whether that UUID matches a VM
> > > or is independently used is up to the user policy when creating the
> > > device.
> > > 
> > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > UUID set of devices and instead have each device be independent.  This
> > > seems to be an imposition on the nvidia implementation into the kernel
> > > interface design.
> > > 
> > 
> > Hi Alex,
> > 
> > I agree with you that we should not put UUID concept into a kernel API. At
> > this point (without any prototyping), I am thinking of using a list of 
> > virtual
> > devices instead of UUID.
> 
> Hi Neo,
> 
> A UUID is a perfectly fine name, so long as we let it be just a UUID and
> not the UUID matching some specific use case.
> 
> > > >  
> > > > int vgpu_map_virtual_bar
> > > > (
> > > > uint64_t virt_bar_addr,
> > > > uint64_t phys_bar_addr,
> > > > uint32_t len,
> > > > uint32_t flags
> > > > )
> > > >  
> > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > 
> > > 
> > > Per the implementation provided, this needs to be implemented in the
> > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > of the device and replacing it is wrong.  It should be remapped at the
> > > vfio device file interface using vm_ops.
> > > 
> > 
> > So you are basically suggesting that we are going to take a mmap fault and
> > within that fault handler, we will go into vendor driver to look up the
> > "pre-registered" mapping and remap there.
> > 
> > Is my understanding correct?
> 
> Essentially, hopefully the vendor driver will have already registered
> the backing for the mmap prior to the fault, but either way could work.
> I think the key though is that you want to remap it onto the vma
> accessing the vfio device file, not scanning it out of an IOVA mapping
> that might be dynamic and doing a vma lookup based on the point in time
> mapping of the BAR.  The latter doesn't give me much confidence that
> mappings couldn't change while the former should be a one time fault.

Hi Alex,

The fact is that the vendor driver can only prevent such mmap fault by looking
up the <iova, hva> mapping table that we have saved from IOMMU memory listerner
when the guest region gets programmed. Also, like you have mentioned below, such
mapping between iova and hva shouldn't be changed as long as the SBIOS and
guest OS are done with their job. 

Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 

Probably we should just limit this interface to guest MMIO region and we can 
have
some crosscheck between the VFIO driver who has monitored the config spcae
access to make sure nothing getting moved around?

> 
> In case it's not clear to folks at Intel, the purpose of this is that a
> vGPU may directly map a segment of the physical GPU MMIO space, but we
> may not know what segment that is at setup time, when QEMU does an mmap
> of the vfio device file descriptor.  The thought is that we can create
> an invalid mapping when QEMU calls mmap(), knowing that it won't be
> accessed until later, then we can fault in the real mmap on demand.  Do
> you need anything similar?
> 
>

Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-26 Thread Neo Jia
e two symbols to outside for MMIO mapping and page
translation and pinning. 

Also, with a mmap MMIO interface between virtual and physical, this allows
para-virtualized guest driver can access his virtual MMIO without taking a MMAP
fault hit, also we can support different MMIO size between virtual and physical
device.

int vgpu_map_virtual_bar
(
uint64_t virt_bar_addr,
uint64_t phys_bar_addr,
uint32_t len,
uint32_t flags
)

EXPORT_SYMBOL(vgpu_map_virtual_bar);

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

EXPORT_SYMBOL(vgpu_dma_do_translate);

Still a lot to be added and modified, such as supporting multiple VMs and 
multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
kernel driver, error handling, roll-back and locked memory size per user, etc. 

4. Modules
==

Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
   TYPE1 v1 and v2 interface. 

vgpu.ko  - provide registration interface and virtual device
   VFIO access.

5. QEMU note
==

To allow us focus on the VGPU kernel driver prototyping, we have introduced a 
new VFIO 
class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c 
file and 
use it as a reference for our implementation. It is basically just a quick c & p
from vfio/pci.c to quickly meet our needs.

Once this proposal is finalized, we will move to vfio/pci.c instead of a new
class, and probably the only thing required is to have a new way to discover the
device.

6. Examples
==

On this server, we have two NVIDIA M60 GPUs.

[root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

After nvidia.ko gets initialized, we can query the supported vGPU type by
accessing the "vgpu_supported_types" like following:

[root@cjia-vgx-kvm ~]# cat 
/sys/bus/pci/devices/\:86\:00.0/vgpu_supported_types 
11:GRID M60-0B
12:GRID M60-0Q
13:GRID M60-1B
14:GRID M60-1Q
15:GRID M60-2B
16:GRID M60-2Q
17:GRID M60-4Q
18:GRID M60-8Q

For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
like to create "GRID M60-4Q" VM on it.

echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > 
/sys/bus/pci/devices/\:86\:00.0/vgpu_create

Note: the number 0 here is for vGPU device index. So far the change is not 
tested
for multiple vgpu devices yet, but we will support it.

At this moment, if you query the "vgpu_supported_types" it will still show all
supported virtual GPU types as no virtual GPU resource is committed yet.

Starting VM:

echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

then, the supported vGPU type query will return:

[root@cjia-vgx-kvm /home/cjia]$
> cat /sys/bus/pci/devices/\:86\:00.0/vgpu_supported_types
17:GRID M60-4Q

So vgpu_supported_config needs to be called whenever a new virtual device gets
created as the underlying HW might limit the supported types if there are
any existing VM runnings.

Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
GPU driver vendor to clean up resource.

Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
device sysfs.

7. What is not covered:
==

7.1 QEMU console VNC

QEMU console VNC is not covered in this RFC as it is a pretty isolated module
and not impacting the basic vGPU functionality, also we already have a good
discussion about the new VFIO interface that Alex is going to introduce to 
allow us 
describe a region for VM surface.

8 Patches
==

0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0

0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch  - against 
4.4.0-rc5

Thanks,
Kirti and Neo


> 
> Jike will provide next level API definitions based on KVMGT requirement. 
> We can further refine it to match requirements of multi-vendors.
> 
> Thanks
> Kevin
>From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001
From: Neo Jia <c...@nvidia.com>
Date: Tue, 26 Jan 2016 01:21:11 -0800
Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support

This is just a quick POV implementation to allow GPU driver vendor to plugin
into VFIO framework to provide their virtual GPU support. This kernel is
providing a registration interface for GPU vendor and generic DMA tracking 

Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-26 Thread Neo Jia
PU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
IOMMU code) and expose two symbols to outside for MMIO mapping and page
translation and pinning. 

Also, with a mmap MMIO interface between virtual and physical, this allows
para-virtualized guest driver can access his virtual MMIO without taking a MMAP
fault hit, also we can support different MMIO size between virtual and physical
device.

int vgpu_map_virtual_bar
(
uint64_t virt_bar_addr,
uint64_t phys_bar_addr,
uint32_t len,
uint32_t flags
)

EXPORT_SYMBOL(vgpu_map_virtual_bar);

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

EXPORT_SYMBOL(vgpu_dma_do_translate);

Still a lot to be added and modified, such as supporting multiple VMs and 
multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
kernel driver, error handling, roll-back and locked memory size per user, etc. 

4. Modules
==

Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
   TYPE1 v1 and v2 interface. 

vgpu.ko  - provide registration interface and virtual device
   VFIO access.

5. QEMU note
==

To allow us focus on the VGPU kernel driver prototyping, we have introduced a 
new VFIO 
class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c 
file and 
use it as a reference for our implementation. It is basically just a quick c & p
from vfio/pci.c to quickly meet our needs.

Once this proposal is finalized, we will move to vfio/pci.c instead of a new
class, and probably the only thing required is to have a new way to discover the
device.

6. Examples
==

On this server, we have two NVIDIA M60 GPUs.

[root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

After nvidia.ko gets initialized, we can query the supported vGPU type by
accessing the "vgpu_supported_types" like following:

[root@cjia-vgx-kvm ~]# cat 
/sys/bus/pci/devices/\:86\:00.0/vgpu_supported_types 
11:GRID M60-0B
12:GRID M60-0Q
13:GRID M60-1B
14:GRID M60-1Q
15:GRID M60-2B
16:GRID M60-2Q
17:GRID M60-4Q
18:GRID M60-8Q

For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
like to create "GRID M60-4Q" VM on it.

echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > 
/sys/bus/pci/devices/\:86\:00.0/vgpu_create

Note: the number 0 here is for vGPU device index. So far the change is not 
tested
for multiple vgpu devices yet, but we will support it.

At this moment, if you query the "vgpu_supported_types" it will still show all
supported virtual GPU types as no virtual GPU resource is committed yet.

Starting VM:

echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

then, the supported vGPU type query will return:

[root@cjia-vgx-kvm /home/cjia]$
> cat /sys/bus/pci/devices/\:86\:00.0/vgpu_supported_types
17:GRID M60-4Q

So vgpu_supported_config needs to be called whenever a new virtual device gets
created as the underlying HW might limit the supported types if there are
any existing VM runnings.

Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
GPU driver vendor to clean up resource.

Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
device sysfs.

7. What is not covered:
==

7.1 QEMU console VNC

QEMU console VNC is not covered in this RFC as it is a pretty isolated module
and not impacting the basic vGPU functionality, also we already have a good
discussion about the new VFIO interface that Alex is going to introduce to 
allow us 
describe a region for VM surface.

8 Patches
==

0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0

0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch  - against 
4.4.0-rc5

Thanks,
Kirti and Neo

>From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001
From: Neo Jia <c...@nvidia.com>
Date: Tue, 26 Jan 2016 01:21:11 -0800
Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support

This is just a quick POV implementation to allow GPU driver vendor to plugin
into VFIO framework to provide their virtual GPU support. This kernel is
providing a registration interface for GPU vendor and generic DMA tracking APIs.

extern int vgpu_register_device(struct pci_dev *dev,
cons

Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-26 Thread Neo Jia
On Tue, Jan 26, 2016 at 07:24:52PM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Tuesday, January 26, 2016 6:21 PM
> > 
> > 0. High level overview
> > =
> > =
> > 
> > 
> >   user space:
> > +---+  VFIO IOMMU IOCTLs
> >   +-| QEMU VFIO |-+
> > VFIO IOCTLs   | +---+ |
> >   |   |
> >  
> > -|---|-
> >   |   |
> >   kernel space:   |  +--->--->---+  (callback)V
> >   |  |   v 
> > +--V-+
> >   +--+   +V--^--+  +--+--+-+   | VGPU   
> > |
> >   |  |   |  | +| nvidia.ko +->-> TYPE1 
> > IOMMU|
> >   | VFIO Bus <===| VGPU.ko  |<|+---+ | 
> > +---++---+
> >   |  |   |  | | (register)   ^ ||
> >   +--+   +---+--+ |+---+ | ||
> >  V+| i915.ko   +-+ 
> > +---VV---+
> >  | +-^-+   | TYPE1  
> > |
> >  |  (callback)   | | IOMMU  
> > |
> >  +-->>---+ 
> > ++
> >  access flow:
> > 
> >   Guest MMIO / PCI config access
> >   |
> >   -
> >   |
> >   +-> KVM VM_EXITs  (kernel)
> >   |
> >   -
> >   |
> >   +-> QEMU VFIO driver (user)
> >   |
> >   -
> >   |
> >   +>  VGPU kernel driver (kernel)
> >   |
> >   |
> >   +> vendor driver callback
> > 
> > 
> 
> There is one difference between nvidia and intel implementations. We have
> vgpu device model in kernel, as part of i915.ko. So I/O emulation requests
> are forwarded directly in kernel side. 

Hi Kevin,

With the vendor driver callback, it will always forward to the kernel driver. If
you are talking about the QEMU VFIO driver (user) part I put on the above
diagram, that is how QEMU VFIO handles MMIO or pci config access today, which we
don't change anything here in this design.

Thanks,
Neo


> 
> Thanks
> Kevin



Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-26 Thread Neo Jia
On Tue, Jan 26, 2016 at 09:21:42PM +, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Wednesday, January 27, 2016 12:37 AM
> > 
> > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > On 2016/1/26 15:41, Jike Song wrote:
> > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > [cc +Neo @Nvidia]
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > some common VFIO framework changes that he can help :-)
> > > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > would you please have a look?
> > > > > >
> > > > > > Bus Driver
> > > > > >
> > > > > > { in i915/vgt/xxx.c }
> > > > > >
> > > > > > - define a subset of vfio_pci interfaces
> > > > > > - selective pass-through (say aperture)
> > > > > > - trap MMIO: interface w/ QEMU
> > > > >
> > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > don't apply, but you'll need to support the full device interface,
> > > > > right?  That includes the region info ioctl and access through the 
> > > > > vfio
> > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > >
> > > >
> > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > descriptor we'll definitely keep it.]
> > > >
> > > > The list of ioctl commands provided by vfio_pci:
> > > >
> > > > - VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > - VFIO_DEVICE_PCI_HOT_RESET
> > > >
> > > > As you said, above 2 don't apply. But for this:
> > > >
> > > > - VFIO_DEVICE_RESET
> > > >
> > > > In my opinion it should be kept, no matter what will be provided in
> > > > the bus driver.
> > > >
> > > > - VFIO_PCI_ROM_REGION_INDEX
> > > > - VFIO_PCI_VGA_REGION_INDEX
> > > >
> > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > ROM BAR or VGA region.
> > > >
> > > > - VFIO_DEVICE_GET_INFO
> > > > - VFIO_DEVICE_GET_REGION_INFO
> > > > - VFIO_DEVICE_GET_IRQ_INFO
> > > > - VFIO_DEVICE_SET_IRQS
> > > >
> > > > Above 4 are needed of course.
> > > >
> > > > We will need to extend:
> > > >
> > > > - VFIO_DEVICE_GET_REGION_INFO
> > > >
> > > >
> > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > should be trapped instead of being mmap-ed.
> > >
> > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > vfio driver? Since there are no real MMIO maps into the region and i
> > > suppose the access to the region should be handled by vgpu in i915
> > > driver, but currently most of the mmio accesses are handled by Qemu.
> > 
> > VFIO supports the following region attributes:
> > 
> > #define VFIO_REGION_INFO_FLAG_READ  (1 << 0) /* Region supports read */
> > #define VFIO_REGION_INFO_FLAG_WRITE (1 << 1) /* Region supports write */
> > #define VFIO_REGION_INFO_FLAG_MMAP  (1 << 2) /* Region supports mmap */
> > 
> > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > the specified offsets of the device file descriptor, depending on what
> > accesses are supported.  This is all reported through the REGION_INFO
> > ioctl for a given index.  If mmap is supported, the VM will have direct
> > access to the area, without faulting to KVM other than to populate the
> > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > returns out to QEMU to service the request, which then finds the
> > MemoryRegion serviced through vfio, which will then perform a
> > pread/pwrite through to the kernel vfio bus driver to handle the
> > access.  Thanks,
> > 
> 
> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to 
> KVM, so VM MMIO access will be forwarded to KVMGT directly for 
> emulation in kernel. If we reuse above R/W flags, the whole emulation 
> path would be unnecessarily long with obvious performance impact. We
> either need a new flag here to indicate in-kernel emulation (bias from
> passthrough support), or just hide the region alternatively (let KVMGT
> to handle I/O emulation itself like today).
> 

Hi Kevin,

Maybe there is some confusion about the VFIO interface that we are going to use
here. I thought we were going to adopt VFIO so nobody would need to directly
plug into kvm module.

Thanks,
Neo


> Thanks
> Kevin



Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-26 Thread Neo Jia
On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > On Mon, Jan 25, 2016 at 09:45:14PM +, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > 
> > Hi Alex, Kevin and Jike,
> > 
> > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > inline at the end)
> > 
> > Thanks for adding me to this technical discussion, a great opportunity
> > for us to design together which can bring both Intel and NVIDIA vGPU 
> > solution to
> > KVM platform.
> > 
> > Instead of directly jumping to the proposal that we have been working on
> > recently for NVIDIA vGPU on KVM, I think it is better for me to put out 
> > couple
> > quick comments / thoughts regarding the existing discussions on this thread 
> > as
> > fundamentally I think we are solving the same problem, DMA, interrupt and 
> > MMIO.
> > 
> > Then we can look at what we have, hopefully we can reach some consensus 
> > soon.
> > 
> > > Yes, and since you're creating and destroying the vgpu here, this is
> > > where I'd expect a struct device to be created and added to an IOMMU
> > > group.  The lifecycle management should really include links between
> > > the vGPU and physical GPU, which would be much, much easier to do with
> > > struct devices create here rather than at the point where we start
> > > doing vfio "stuff".
> > 
> > Infact to keep vfio-vgpu to be more generic, vgpu device creation and 
> > management
> > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > group and VFIO group.
> 
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
> 
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.
> 
> > Graphics driver can register with vfio-vgpu to get management and emulation 
> > call
> > backs to graphics driver.   
> > 
> > We already have struct vgpu_device in our proposal that keeps pointer to
> > physical device.  
> > 
> > > - vfio_pci will inject an IRQ to guest only when physical IRQ
> > > generated; whereas vfio_vgpu may inject an IRQ for emulation
> > > purpose. Anyway they can share the same injection interface;
> > 
> > eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
> > available to graphics driver so that graphics driver can inject interrupts
> > directly when physical device triggers interrupt. 
> > 
> > Here is the proposal we have, please review.
> > 
> > Please note the patches we have put out here is mainly for POC purpose to
> > verify our understanding also can serve the purpose to reduce confusions 
> > and speed up 
> > our design, although we are very happy to refine that to something 
> > eventually
> > can be used for both parties and upstreamed.
> > 
> > Linux vGPU kernel design
> > ==
> > 
> > Here we are proposing a generic Linux kernel module based on VFIO framework
> > which allows different GPU vendors to plugin and provide their GPU 
> > virtualization
> > solution on KVM, the benefits of having such generic kernel module are:
> > 
> > 1) Reuse QEMU VFIO driver, supporting VFIO UAPI
> > 
> > 2) GPU HW agnostic management API for upper layer software such as libvirt
> > 
> > 3) No duplicated VFIO kernel logic reimplemented by different GPU driver 
> > vendor
> > 
> > 0. High level overview
> > ==
> > 
> >  
> >   user space:
> > +---+  VFIO IOMMU IOCTLs
> >   +-| QEMU VFIO |-+
> > VFIO IOCTLs   | +---+ |
> > 

[Qemu-devel] Windows guest debugging on KVM/Qemu

2010-05-24 Thread Neo Jia
hi,

I am using KVM/Qemu to debug my Windows guest according to KVM wiki
page (http://www.linux-kvm.org/page/WindowsGuestDrivers/GuestDebugging).
It works for me and also I can only use one Windows guest and bind its
serial port to a TCP port and run Virtual Serial Ports Emulator on
my Windows dev machine.

The problem is that these kind of connection is really slow. Is there
any known issue with KVM serial port driver? There is a good
discussion about the same issue one year ago. Not sure if there is any
improvement or not after that.
(http://www.mail-archive.com/k...@vger.kernel.org/msg21145.html).

Thanks,
Neo
-- 
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!



[Qemu-devel] Windows guest debugging on KVM/Qemu

2010-05-24 Thread Neo Jia
hi,

I am using KVM/Qemu to debug my Windows guest according to KVM wiki
page (http://www.linux-kvm.org/page/WindowsGuestDrivers/GuestDebugging).
It works for me and also I can only use one Windows guest and bind its
serial port to a TCP port and run Virtual Serial Ports Emulator on
my Windows dev machine.

The problem is that these kind of connection is really slow. Is there
any known issue with KVM serial port driver? There is a good
discussion about the same issue one year ago. Not sure if there is any
improvement or not after that.
(http://www.mail-archive.com/k...@vger.kernel.org/msg21145.html).

Thanks,
Neo

-- 
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!



[Qemu-devel] Re: guest kernel debugging through serial port

2010-03-17 Thread Neo Jia
Here is what I have asked before. The problem that I want to assign a
real serial port to the guest is that the debugging through network
becomes really slow.

Thanks,
Neo

On Thu, Mar 11, 2010 at 2:44 AM, Neo Jia neo...@gmail.com wrote:
 hi,

 I have followed the windows guest debugging procedure from
 http://www.linux-kvm.org/page/WindowsGuestDrivers/GuestDebugging. And
 it works when I start two guests and bind tcp port to guest serial
 port, but it is really slow.

 And if I use -serial /dev/ttyS1 for the guest debugging target, I
 can't talk to it from my dev machine that has connected to ttyS1 with
 target machine (host).

 Is this a known problem?

 Thanks,
 Neo

 --
 I would remember that if researchers were not ambitious
 probably today we haven't the technology we are using!




-- 
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!




[Qemu-devel] guest kernel debugging through serial port

2010-03-11 Thread Neo Jia
hi,

I have followed the windows guest debugging procedure from
http://www.linux-kvm.org/page/WindowsGuestDrivers/GuestDebugging. And
it works when I start two guests and bind tcp port to guest serial
port, but it is really slow.

And if I use -serial /dev/ttyS1 for the guest debugging target, I
can't talk to it from my dev machine that has connected to ttyS1 with
target machine (host).

Is this a known problem?

Thanks,
Neo

-- 
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!




[Qemu-devel] Could not initialize SDL (kqemu)

2007-04-27 Thread Neo Jia

hi,

When I am trying to using kqemu on my IA32 linux, it throws out Could
not initialize SDL -- exiting.

Could you help me to figure it out?

Thanks,
Neo
--
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!




[Qemu-devel] How can qemu to generate a signal 0 on i386 target (Linux) and i386 host?

2007-04-26 Thread Neo Jia

hi,

I am using kgdb to debug Linux kernel. Both the target and host are
IA32 platform. But I got the following from my gdb console:

Program terminated with signal 0, Signal 0.
The program no longer exists.

In fact, this signal is not defined on my gdb.


From the post http://sourceware.org/ml/gdb/2004-03/msg1.html, it

seems that this signal is generated from the qemu instead of sent by
the bottom hardware.

So, I am wondering if there is anybody can point me to the code of
qemu, which will take care those signals.

Thanks,
Neo

--
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!




[Qemu-devel] Re: How to debug Linux kernel on qemu with kgdb?

2007-04-25 Thread Neo Jia

On 4/25/07, Jan Kiszka [EMAIL PROTECTED] wrote:

Neo Jia wrote:
 hi,

 I am trying to use debug kgdb patched linux kernel on my qemu. Both
 the native and target platform are IA32. I am wondering if there is
 anyone can show me the procedure?

Yep, see https://mail.gna.org/public/xenomai-core/2006-09/msg00202.html

(BTW, I think that kgdb bug is still unfixed - I never got a feedback.)


I can connect gdb through /dev/pts/XX. My qemu is lanuched by

qemu -nographic -hda linux.img -kernel
./2.6.15.5-kgdb/vmlinuz-2.6.15.5-kgdb -serial pty -append kgdbwait
console=ttyS0 root=/dev/hda sb=0x220,5,1,5 ide2=noprobe ide3=noprobe
ide4=noprobe ide5=noprobe

Do you know where can I get the console output?

I would like to work out kgdb + qemu to debug linux kernel.

Thanks,
Neo



Jan






--
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!




[Qemu-devel] Re: How to debug Linux kernel on qemu with kgdb?

2007-04-25 Thread Neo Jia

On 4/25/07, Neo Jia [EMAIL PROTECTED] wrote:

On 4/25/07, Jan Kiszka [EMAIL PROTECTED] wrote:
 Neo Jia wrote:
  hi,
 
  I am trying to use debug kgdb patched linux kernel on my qemu. Both
  the native and target platform are IA32. I am wondering if there is
  anyone can show me the procedure?

 Yep, see https://mail.gna.org/public/xenomai-core/2006-09/msg00202.html

 (BTW, I think that kgdb bug is still unfixed - I never got a feedback.)

I can connect gdb through /dev/pts/XX. My qemu is lanuched by

qemu -nographic -hda linux.img -kernel
./2.6.15.5-kgdb/vmlinuz-2.6.15.5-kgdb -serial pty -append kgdbwait
console=ttyS0 root=/dev/hda sb=0x220,5,1,5 ide2=noprobe ide3=noprobe
ide4=noprobe ide5=noprobe

Do you know where can I get the console output?

I would like to work out kgdb + qemu to debug linux kernel.

Thanks,
Neo



BTW, the error message I got is Program terminated with signal 0, Signal 0.

Thanks,
Neo



 Jan





--
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!




--
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!




[Qemu-devel] Re: How to debug Linux kernel on qemu with kgdb?

2007-04-25 Thread Neo Jia

On 4/25/07, Jan Kiszka [EMAIL PROTECTED] wrote:

Neo Jia wrote:
 On 4/25/07, Jan Kiszka [EMAIL PROTECTED] wrote:
 Neo Jia wrote:
  hi,
 
  I am trying to use debug kgdb patched linux kernel on my qemu. Both
  the native and target platform are IA32. I am wondering if there is
  anyone can show me the procedure?

 Yep, see https://mail.gna.org/public/xenomai-core/2006-09/msg00202.html

 (BTW, I think that kgdb bug is still unfixed - I never got a feedback.)

 I can connect gdb through /dev/pts/XX. My qemu is lanuched by

 qemu -nographic -hda linux.img -kernel
 ./2.6.15.5-kgdb/vmlinuz-2.6.15.5-kgdb -serial pty -append kgdbwait
 console=ttyS0 root=/dev/hda sb=0x220,5,1,5 ide2=noprobe ide3=noprobe
 ide4=noprobe ide5=noprobe

 Do you know where can I get the console output?


Use ... -serial stdio -serial pty ... and attached kgdb to the second
serial port (I think to recall that is default anyway). The first one is
then used for the kernel console.

 I would like to work out kgdb + qemu to debug linux kernel.

??? So you really want to debug the kernel when kgdb is applied, ie.
actually debug kgdb? If you only intend to debug the kernel itself, qemu
-s + gdb is enough.


Jan,

I just would like to debug kernel itself. I have tried qemu -s + gdb
but it keeps
getting apic_timer_interrupt when I am using n command.

The following is the output:


gdb vmlinux

GNU gdb 6.5.50.20060621-cvs
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as i686-pc-linux-gnu...Using host
libthread_db library /lib/tls/libthread_db.so.1.

(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0xfff0 in ?? ()
warning: shared library handler failed to enable breakpoint
(gdb) c
Continuing.

Program received signal SIGINT, Interrupt.
default_idle () at include/asm/bitops.h:252
252 return ((1UL  (nr  31))  (addr[nr  5])) != 0;
(gdb) b sys_ex
sys_execve  sys_exitsys_exit_group
(gdb) b sys_execve
Breakpoint 1 at 0xc0101ac1: file arch/i386/kernel/process.c, line 791.
(gdb) c
Continuing.

Breakpoint 1, sys_execve (regs=
 {ebx = 135197704, ecx = 135197864, edx = 135244936, esi =
135197704, edi = 135197704, ebp = -1079176984, eax = 11, xds = 123,
xes = 123, orig_eax = 11, eip = -1208835017, xcs = 115, eflags = 582,
esp = -1079177012, xss = 123})
   at arch/i386/kernel/process.c:791
791 filename = getname((char __user *) regs.ebx);
(gdb) n
0xc0103666 in apic_timer_interrupt () at include/asm/current.h:9
9   {
(gdb) quit

Thanks,
Neo



Jan






--
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!




  1   2   >