Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Mon, Mar 28, 2022 at 8:23 PM Jason Gunthorpe wrote: > > On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote: > > To me, it looks more easier to not answer this question by letting > > userspace know about the change, > > That is not backwards compatbile, so I don't think it helps unless we > say if you open /dev/vfio/vfio you get old behavior and if you open > /dev/iommu you get new... Actually, this is one way to go. Trying to behave exactly like typ1 might be not easy. > > Nor does it answer if I can fix RDMA or not :\ > vDPA has a backend feature negotiation, then actually, userspace can tell vDPA to go with the new accounting approach. Not sure RDMA can do the same. Thanks > So we really do need to know what exactly is the situation here. > > Jason > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe wrote: > > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote: > > > It's simply because we don't want to break existing userspace. [1] > > I'm still waiting to hear what exactly breaks in real systems. > > As I explained this is not a significant change, but it could break > something in a few special scenarios. > > Also the one place we do have ABI breaks is security, and ulimit is a > security mechanism that isn't working right. So we do clearly need to > understand *exactly* what real thing breaks - if anything. > > Jason > To tell the truth, I don't know. I remember that Openstack may do some accounting so adding Sean for more comments. But we really can't image openstack is the only userspace that may use this. To me, it looks more easier to not answer this question by letting userspace know about the change, Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 11:15 AM Tian, Kevin wrote: > > > From: Jason Wang > > Sent: Thursday, March 24, 2022 10:57 AM > > > > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin wrote: > > > > > > > From: Jason Wang > > > > Sent: Thursday, March 24, 2022 10:28 AM > > > > > > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin > > wrote: > > > > > > > > > > > From: Jason Gunthorpe > > > > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > > > > > > > I'm still picking my way through the series, but the later compat > > > > > > > interface doesn't mention this difference as an outstanding issue. > > > > > > > Doesn't this difference need to be accounted in how libvirt > > > > > > > manages > > VM > > > > > > > resource limits? > > > > > > > > > > > > AFACIT, no, but it should be checked. > > > > > > > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > > > > memory limits. > > > > > > > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > > > > { > > > > > > rlim = tsk->signal->rlim + resource; > > > > > > [..] > > > > > > if (new_rlim) > > > > > > *rlim = *new_rlim; > > > > > > > > > > > > Which vfio reads back here: > > > > > > > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > > > > > And iommufd does the same read back: > > > > > > > > > > > > lock_limit = > > > > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > > > > PAGE_SHIFT; > > > > > > npages = pages->npinned - pages->last_npinned; > > > > > > do { > > > > > > cur_pages = atomic_long_read(&pages->source_user- > > > > > > >locked_vm); > > > > > > new_pages = cur_pages + npages; > > > > > > if (new_pages > lock_limit) > > > > > > return -ENOMEM; > > > > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > > > > cur_pages, > > > > > >new_pages) != cur_pages); > > > > > > > > > > > > So it does work essentially the same. > > > > > > > > > > > > The difference is more subtle, iouring/etc puts the charge in the > > > > > > user > > > > > > so it is additive with things like iouring and additively spans all > > > > > > the users processes. > > > > > > > > > > > > However vfio is accounting only per-process and only for itself - no > > > > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > > > > > > > The user visible difference will be that a limit X that worked with > > > > > > VFIO may start to fail after a kernel upgrade as the charge > > > > > > accounting > > > > > > is now cross user and additive with things like iommufd. > > > > > > > > > > > > This whole area is a bit peculiar (eg mlock itself works > > > > > > differently), > > > > > > IMHO, but with most of the places doing pins voting to use > > > > > > user->locked_vm as the charge it seems the right path in today's > > > > > > kernel. > > > > > > > > > > > > Ceratinly havi
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin wrote: > > > From: Jason Wang > > Sent: Thursday, March 24, 2022 10:28 AM > > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin wrote: > > > > > > > From: Jason Gunthorpe > > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > > > I'm still picking my way through the series, but the later compat > > > > > interface doesn't mention this difference as an outstanding issue. > > > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > > > resource limits? > > > > > > > > AFACIT, no, but it should be checked. > > > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > > memory limits. > > > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > > { > > > > rlim = tsk->signal->rlim + resource; > > > > [..] > > > > if (new_rlim) > > > > *rlim = *new_rlim; > > > > > > > > Which vfio reads back here: > > > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > And iommufd does the same read back: > > > > > > > > lock_limit = > > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > > PAGE_SHIFT; > > > > npages = pages->npinned - pages->last_npinned; > > > > do { > > > > cur_pages = atomic_long_read(&pages->source_user- > > > > >locked_vm); > > > > new_pages = cur_pages + npages; > > > > if (new_pages > lock_limit) > > > > return -ENOMEM; > > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > > cur_pages, > > > >new_pages) != cur_pages); > > > > > > > > So it does work essentially the same. > > > > > > > > The difference is more subtle, iouring/etc puts the charge in the user > > > > so it is additive with things like iouring and additively spans all > > > > the users processes. > > > > > > > > However vfio is accounting only per-process and only for itself - no > > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > > > The user visible difference will be that a limit X that worked with > > > > VFIO may start to fail after a kernel upgrade as the charge accounting > > > > is now cross user and additive with things like iommufd. > > > > > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > > > IMHO, but with most of the places doing pins voting to use > > > > user->locked_vm as the charge it seems the right path in today's > > > > kernel. > > > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > > > I plan to fix RDMA like this as well so at least we can have > > > > consistency within qemu. > > > > > > > > > > I have an impression that iommufd and vfio type1 must use > > > the same accounting scheme given the management stack > > > has no insight into qemu on which one is actually used thus > > > cannot adapt to the subtle difference in between. in this > > > regard either we start fixing vfio type1 to use user->locked_vm > > > now or have iommufd follow vfio type1 for upward compatibility > > > and then change them together at a later point. > > > > > > I prefer to the former as IMHO I don't know when will be a later > > > point w/o
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin wrote: > > > From: Jason Gunthorpe > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > I'm still picking my way through the series, but the later compat > > > interface doesn't mention this difference as an outstanding issue. > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > resource limits? > > > > AFACIT, no, but it should be checked. > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > memory limits. > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > { > > rlim = tsk->signal->rlim + resource; > > [..] > > if (new_rlim) > > *rlim = *new_rlim; > > > > Which vfio reads back here: > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > And iommufd does the same read back: > > > > lock_limit = > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > PAGE_SHIFT; > > npages = pages->npinned - pages->last_npinned; > > do { > > cur_pages = atomic_long_read(&pages->source_user- > > >locked_vm); > > new_pages = cur_pages + npages; > > if (new_pages > lock_limit) > > return -ENOMEM; > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > cur_pages, > >new_pages) != cur_pages); > > > > So it does work essentially the same. > > > > The difference is more subtle, iouring/etc puts the charge in the user > > so it is additive with things like iouring and additively spans all > > the users processes. > > > > However vfio is accounting only per-process and only for itself - no > > other subsystem uses locked as the charge variable for DMA pins. > > > > The user visible difference will be that a limit X that worked with > > VFIO may start to fail after a kernel upgrade as the charge accounting > > is now cross user and additive with things like iommufd. > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > IMHO, but with most of the places doing pins voting to use > > user->locked_vm as the charge it seems the right path in today's > > kernel. > > > > Ceratinly having qemu concurrently using three different subsystems > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > I plan to fix RDMA like this as well so at least we can have > > consistency within qemu. > > > > I have an impression that iommufd and vfio type1 must use > the same accounting scheme given the management stack > has no insight into qemu on which one is actually used thus > cannot adapt to the subtle difference in between. in this > regard either we start fixing vfio type1 to use user->locked_vm > now or have iommufd follow vfio type1 for upward compatibility > and then change them together at a later point. > > I prefer to the former as IMHO I don't know when will be a later > point w/o certain kernel changes to actually break the userspace > policy built on a wrong accounting scheme... I wonder if the kernel is the right place to do this. We have new uAPI so management layer can know the difference of the accounting in advance by -device vfio-pci,iommufd=on ? > > Thanks > Kevin > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On Thu, Dec 9, 2021 at 1:41 PM Tian, Kevin wrote: > > > From: Jason Gunthorpe > > Sent: Thursday, December 2, 2021 9:55 PM > > > > Further, there is no reason why IMS should be reserved exclusively for > > VFIO! > > This is correct. Just as what you agreed with Thomas, the only difference > between IMS and MSI is on where the messages are stored. Physically > it is unreasonable for the HW to check whether an interrupt is used for > a specific software usage (e.g. virtualization) since it doesn't have such > knowledge. At most an entry is associated to PASID, but again the PASID > can be used anywhere. Note that vDPA had the plan to use IMS for the parent that can have a huge amount of instances. Thanks > > The wording in current IDXD spec is not clear on this part. We'll talk to > the spec owner to have it fixed. > > Thanks > Kevin > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/5] iommu/virtio: Add identity domains
On Mon, Oct 18, 2021 at 11:35 PM Michael S. Tsirkin wrote: > > On Mon, Oct 18, 2021 at 04:23:41PM +0100, Jean-Philippe Brucker wrote: > > On Thu, Oct 14, 2021 at 03:00:38AM +, Tian, Kevin wrote: > > > > From: Jean-Philippe Brucker > > > > Sent: Wednesday, October 13, 2021 8:11 PM > > > > > > > > Support identity domains, allowing to only enable IOMMU protection for a > > > > subset of endpoints (those assigned to userspace, for example). Users > > > > may enable identity domains at compile time > > > > (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time > > > > (iommu.passthrough=1) or > > > > runtime (/sys/kernel/iommu_groups/*/type = identity). > > > > > > Do we want to use consistent terms between spec (bypass domain) > > > and code (identity domain)? > > > > I don't think we have to. Linux uses "identity" domains and "passthrough" > > IOMMU. The old virtio-iommu feature was "bypass" so we should keep that > > for the new one, to be consistent. And then I've used "bypass" for domains > > as well, in the spec, because it would look strange to use a different > > term for the same concept. I find that it sort of falls into place: Linux' > > identity domains can be implemented either with bypass or identity-mapped > > virtio-iommu domains. > > > > > > > > > > Patches 1-2 support identity domains using the optional > > > > VIRTIO_IOMMU_F_BYPASS_CONFIG feature. The feature bit is not yet in the > > > > spec, see [1] for the latest proposal. > > > > > > > > Patches 3-5 add a fallback to identity mappings, when the feature is not > > > > supported. > > > > > > > > Note that this series doesn't touch the global bypass bit added by > > > > VIRTIO_IOMMU_F_BYPASS_CONFIG. All endpoints managed by the IOMMU > > > > should > > > > be attached to a domain, so global bypass isn't in use after endpoints > > > > > > I saw a concept of deferred attach in iommu core. See iommu_is_ > > > attach_deferred(). Currently this is vendor specific and I haven't > > > looked into the exact reason why some vendor sets it now. Just > > > be curious whether the same reason might be applied to virtio-iommu. > > > > > > > are probed. Before that, the global bypass policy is decided by the > > > > hypervisor and firmware. So I don't think Linux needs to touch the > > > > > > This reminds me one thing. The spec says that the global bypass > > > bit is sticky and not affected by reset. > > > > The spec talks about *device* reset, triggered by software writing 0 to > > the status register, but it doesn't mention system reset. Would be good to > > clarify that. Something like: > > > > If the device offers the VIRTIO_IOMMU_F_BYPASS_CONFIG feature, it MAY > > initialize the \field{bypass} field to 1. Field \field{bypass} SHOULD > > NOT change on device reset, but SHOULD be restored to its initial > > value on system reset. > > > > > This implies that in the case > > > of rebooting the VM into a different OS, the previous OS actually > > > has the right to override this setting for the next OS. Is it a right > > > design? Even the firmware itself is unable to identify the original > > > setting enforced by the hypervisor after reboot. I feel the hypervisor > > > setting should be recovered after reset since it reflects the > > > security measure enforced by the virtual platform? > > > > So I think clarifying system reset should address your questions. > > I believe we should leave bypass sticky across device reset, so a FW->OS > > transition, where the OS resets the device, does not open a vulnerability > > (if bypass was enabled at boot and then left disabled by FW.) > > > > It's still a good idea for the OS to restore on shutdown the bypass value > > it was booted with. So it can kexec into an OS that doesn't support > > virtio-iommu, for example. > > > > Thanks, > > Jean > > Is this stickiness really important? Can't this be addressed just by > hypervisor disabling bypass at boot? And I'm not sure if sticky can survive transport reset. Thanks > > -- > MST > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/5] iova: Move fast alloc size roundup into alloc_iova_fast()
On Fri, Sep 24, 2021 at 6:07 PM John Garry wrote: > > It really is a property of the IOVA rcache code that we need to alloc a > power-of-2 size, so relocate the functionality to resize into > alloc_iova_fast(), rather than the callsites. > > Signed-off-by: John Garry Acked-by: Jason Wang > --- > drivers/iommu/dma-iommu.c| 8 > drivers/iommu/iova.c | 9 + > drivers/vdpa/vdpa_user/iova_domain.c | 8 > 3 files changed, 9 insertions(+), 16 deletions(-) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 896bea04c347..a99b3445fef8 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -444,14 +444,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct > iommu_domain *domain, > > shift = iova_shift(iovad); > iova_len = size >> shift; > - /* > -* Freeing non-power-of-two-sized allocations back into the IOVA > caches > -* will come back to bite us badly, so we have to waste a bit of space > -* rounding up anything cacheable to make sure that can't happen. The > -* order of the unadjusted size will still match upon freeing. > -*/ > - if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) > - iova_len = roundup_pow_of_two(iova_len); > > dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit); > > diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c > index 9e8bc802ac05..ff567cbc42f7 100644 > --- a/drivers/iommu/iova.c > +++ b/drivers/iommu/iova.c > @@ -497,6 +497,15 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long > size, > unsigned long iova_pfn; > struct iova *new_iova; > > + /* > +* Freeing non-power-of-two-sized allocations back into the IOVA > caches > +* will come back to bite us badly, so we have to waste a bit of space > +* rounding up anything cacheable to make sure that can't happen. The > +* order of the unadjusted size will still match upon freeing. > +*/ > + if (size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) > + size = roundup_pow_of_two(size); > + > iova_pfn = iova_rcache_get(iovad, size, limit_pfn + 1); > if (iova_pfn) > return iova_pfn; > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c > b/drivers/vdpa/vdpa_user/iova_domain.c > index 1daae2608860..2b1143f11d8f 100644 > --- a/drivers/vdpa/vdpa_user/iova_domain.c > +++ b/drivers/vdpa/vdpa_user/iova_domain.c > @@ -292,14 +292,6 @@ vduse_domain_alloc_iova(struct iova_domain *iovad, > unsigned long iova_len = iova_align(iovad, size) >> shift; > unsigned long iova_pfn; > > - /* > -* Freeing non-power-of-two-sized allocations back into the IOVA > caches > -* will come back to bite us badly, so we have to waste a bit of space > -* rounding up anything cacheable to make sure that can't happen. The > -* order of the unadjusted size will still match upon freeing. > -*/ > - if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) > - iova_len = roundup_pow_of_two(iova_len); > iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true); > > return iova_pfn << shift; > -- > 2.26.2 > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v13 02/13] eventfd: Export eventfd_wake_count to modules
在 2021/8/31 下午6:36, Xie Yongji 写道: Export eventfd_wake_count so that some modules can use the eventfd_signal_count() to check whether the eventfd_signal() call should be deferred to a safe context. Signed-off-by: Xie Yongji And this matches the comment inside eventfd_signal(): /* * Deadlock or stack overflow issues can happen if we recurse here * through waitqueue wakeup handlers. If the caller users potentially * nested waitqueues with custom wakeup handlers, then it should * check eventfd_signal_count() before calling this function. If * it returns true, the eventfd_signal() call should be deferred to a * safe context. */ So: Acked-by: Jason Wang --- fs/eventfd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/eventfd.c b/fs/eventfd.c index e265b6dd4f34..1b3130b8d6c1 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -26,6 +26,7 @@ #include DEFINE_PER_CPU(int, eventfd_wake_count); +EXPORT_PER_CPU_SYMBOL_GPL(eventfd_wake_count); static DEFINE_IDA(eventfd_ida); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 12/12] Documentation: Add documentation for VDUSE
在 2021/8/18 下午8:06, Xie Yongji 写道: VDUSE (vDPA Device in Userspace) is a framework to support implementing software-emulated vDPA devices in userspace. This document is intended to clarify the VDUSE design and usage. Signed-off-by: Xie Yongji Acked-by: Jason Wang --- Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/vduse.rst | 233 ++ 2 files changed, 234 insertions(+) create mode 100644 Documentation/userspace-api/vduse.rst diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 0b5eefed027e..c432be070f67 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -27,6 +27,7 @@ place where this information is gathered. iommu media/index sysfs-platform_profile + vduse .. only:: subproject and html diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst new file mode 100644 index ..42ef59ea5314 --- /dev/null +++ b/Documentation/userspace-api/vduse.rst @@ -0,0 +1,233 @@ +== +VDUSE - "vDPA Device in Userspace" +== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. And +to make the device emulation more secure, the emulated vDPA device's +control path is handled in the kernel and only the data path is +implemented in the userspace. + +Note that only virtio block device is supported by VDUSE framework now, +which can reduce security risks when the userspace process that implements +the data path is run by an unprivileged user. The support for other device +types can be added after the security issue of corresponding device driver +is clarified or fixed in the future. + +Create/Destroy VDUSE devices + + +VDUSE devices are created as follows: + +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on + /dev/vduse/control. + +2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME. + +3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first + messages will arrive while attaching the VDUSE instance to vDPA bus. + +4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE + instance to vDPA bus. + +VDUSE devices are destroyed as follows: + +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE + instance from vDPA bus. + +2. Close the file descriptor referring to /dev/vduse/$NAME. + +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on + /dev/vduse/control. + +The netlink messages can be sent via vdpa tool in iproute2 or use the +below sample codes: + +.. code-block:: c + + static int netlink_add_vduse(const char *name, enum vdpa_command cmd) + { + struct nl_sock *nlsock; + struct nl_msg *msg; + int famid; + + nlsock = nl_socket_alloc(); + if (!nlsock) + return -ENOMEM; + + if (genl_connect(nlsock)) + goto free_sock; + + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME); + if (famid < 0) + goto close_sock; + + msg = nlmsg_alloc(); + if (!msg) + goto close_sock; + + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0)) + goto nla_put_failure; + + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name); + if (cmd == VDPA_CMD_DEV_NEW) + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse"); + + if (nl_send_sync(nlsock, msg)) + goto close_sock; + + nl_close(nlsock); + nl_socket_free(nlsock); + + return 0; + nla_put_failure: + nlmsg_free(msg); + close_sock: + nl_close(nlsock); + free_sock: + nl_socket_free(nlsock); + return -1; + } + +How VDUSE works +--- + +As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on +/dev/vduse/control. With this ioctl, userspace can specify some basic configuration +such as device name (uniquely identify a VDUSE device), virtio features, virtio +configuration space, the number of virtqueues and so on for this emulated device. +Then a char device interface (/dev/vduse/$NAME) is exported to userspace for device +emulation. Userspace can use the VDUSE_VQ_SETUP ioctl on /dev/vduse/$NAME to +add per-virtqueue configuration such as the max size of virtqueue
Re: [PATCH v11 12/12] Documentation: Add documentation for VDUSE
在 2021/8/18 下午8:06, Xie Yongji 写道: VDUSE (vDPA Device in Userspace) is a framework to support implementing software-emulated vDPA devices in userspace. This document is intended to clarify the VDUSE design and usage. Signed-off-by: Xie Yongji Acked-by: Jason Wang --- Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/vduse.rst | 233 ++ 2 files changed, 234 insertions(+) create mode 100644 Documentation/userspace-api/vduse.rst diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 0b5eefed027e..c432be070f67 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -27,6 +27,7 @@ place where this information is gathered. iommu media/index sysfs-platform_profile + vduse .. only:: subproject and html diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst new file mode 100644 index ..42ef59ea5314 --- /dev/null +++ b/Documentation/userspace-api/vduse.rst @@ -0,0 +1,233 @@ +== +VDUSE - "vDPA Device in Userspace" +== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. And +to make the device emulation more secure, the emulated vDPA device's +control path is handled in the kernel and only the data path is +implemented in the userspace. + +Note that only virtio block device is supported by VDUSE framework now, +which can reduce security risks when the userspace process that implements +the data path is run by an unprivileged user. The support for other device +types can be added after the security issue of corresponding device driver +is clarified or fixed in the future. + +Create/Destroy VDUSE devices + + +VDUSE devices are created as follows: + +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on + /dev/vduse/control. + +2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME. + +3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first + messages will arrive while attaching the VDUSE instance to vDPA bus. + +4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE + instance to vDPA bus. + +VDUSE devices are destroyed as follows: + +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE + instance from vDPA bus. + +2. Close the file descriptor referring to /dev/vduse/$NAME. + +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on + /dev/vduse/control. + +The netlink messages can be sent via vdpa tool in iproute2 or use the +below sample codes: + +.. code-block:: c + + static int netlink_add_vduse(const char *name, enum vdpa_command cmd) + { + struct nl_sock *nlsock; + struct nl_msg *msg; + int famid; + + nlsock = nl_socket_alloc(); + if (!nlsock) + return -ENOMEM; + + if (genl_connect(nlsock)) + goto free_sock; + + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME); + if (famid < 0) + goto close_sock; + + msg = nlmsg_alloc(); + if (!msg) + goto close_sock; + + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0)) + goto nla_put_failure; + + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name); + if (cmd == VDPA_CMD_DEV_NEW) + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse"); + + if (nl_send_sync(nlsock, msg)) + goto close_sock; + + nl_close(nlsock); + nl_socket_free(nlsock); + + return 0; + nla_put_failure: + nlmsg_free(msg); + close_sock: + nl_close(nlsock); + free_sock: + nl_socket_free(nlsock); + return -1; + } + +How VDUSE works +--- + +As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on +/dev/vduse/control. With this ioctl, userspace can specify some basic configuration +such as device name (uniquely identify a VDUSE device), virtio features, virtio +configuration space, the number of virtqueues and so on for this emulated device. +Then a char device interface (/dev/vduse/$NAME) is exported to userspace for device +emulation. Userspace can use the VDUSE_VQ_SETUP ioctl on /dev/vduse/$NAME to +add per-virtqueue configuration such as the max size of virtqueue
Re: [PATCH v11 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/8/18 下午8:06, Xie Yongji 写道: This VDUSE driver enables implementing software-emulated vDPA devices in userspace. The vDPA device is created by ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device interface (/dev/vduse/$NAME) is exported to userspace for device emulation. In order to make the device emulation more secure, the device's control path is handled in kernel. A message mechnism is introduced to forward some dataplane related control messages to userspace. And in the data path, the DMA buffer will be mapped into userspace address space through different ways depending on the vDPA bus to which the vDPA device is attached. In virtio-vdpa case, the MMU-based software IOTLB is used to achieve that. And in vhost-vdpa case, the DMA buffer is reside in a userspace memory region which can be shared to the VDUSE userspace processs via transferring the shmfd. For more details on VDUSE design and usage, please see the follow-on Documentation commit. Signed-off-by: Xie Yongji Acked-by: Jason Wang ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 05/12] vhost-vdpa: Handle the failure of vdpa_reset()
在 2021/8/18 下午8:06, Xie Yongji 写道: The vdpa_reset() may fail now. This adds check to its return value and fail the vhost_vdpa_open(). Signed-off-by: Xie Yongji Acked-by: Jason Wang --- drivers/vhost/vdpa.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b1c91b4db0ba..d99d75ad30cc 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -116,12 +116,13 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid) irq_bypass_unregister_producer(&vq->call_ctx.producer); } -static void vhost_vdpa_reset(struct vhost_vdpa *v) +static int vhost_vdpa_reset(struct vhost_vdpa *v) { struct vdpa_device *vdpa = v->vdpa; - vdpa_reset(vdpa); v->in_batch = 0; + + return vdpa_reset(vdpa); } static long vhost_vdpa_get_device_id(struct vhost_vdpa *v, u8 __user *argp) @@ -868,7 +869,9 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep) return -EBUSY; nvqs = v->nvqs; - vhost_vdpa_reset(v); + r = vhost_vdpa_reset(v); + if (r) + goto err; vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL); if (!vqs) { ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 04/12] vdpa: Add reset callback in vdpa_config_ops
在 2021/8/18 下午8:06, Xie Yongji 写道: This adds a new callback to support device specific reset behavior. The vdpa bus driver will call the reset function instead of setting status to zero during resetting if device driver supports the new callback. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 9 +++-- include/linux/vdpa.h | 11 ++- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b07aa161f7ad..b1c91b4db0ba 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -157,7 +157,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; u8 status, status_old; - int nvqs = v->nvqs; + int ret, nvqs = v->nvqs; u16 i; if (copy_from_user(&status, statusp, sizeof(status))) @@ -172,7 +172,12 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) if (status != 0 && (ops->get_status(vdpa) & ~status) != 0) return -EINVAL; - ops->set_status(vdpa, status); + if (status == 0 && ops->reset) { + ret = ops->reset(vdpa); + if (ret) + return ret; + } else + ops->set_status(vdpa, status); if ((status & VIRTIO_CONFIG_S_DRIVER_OK) && !(status_old & VIRTIO_CONFIG_S_DRIVER_OK)) for (i = 0; i < nvqs; i++) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 8a645f8f4476..af7ea5ad795f 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -196,6 +196,9 @@ struct vdpa_iova_range { *@vdev: vdpa device *Returns the iova range supported by *the device. + * @reset: Reset device (optional) + * @vdev: vdpa device + * Returns integer: success (0) or error (< 0) It looks to me we'd better make this mandatory. This help to reduce the confusion for the parent driver. Thanks * @set_map: Set device memory mapping (optional) *Needed for device that using device *specific DMA translation (on-chip IOMMU) @@ -263,6 +266,7 @@ struct vdpa_config_ops { const void *buf, unsigned int len); u32 (*get_generation)(struct vdpa_device *vdev); struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev); + int (*reset)(struct vdpa_device *vdev); /* DMA ops */ int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb); @@ -351,12 +355,17 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) return vdev->dma_dev; } -static inline void vdpa_reset(struct vdpa_device *vdev) +static inline int vdpa_reset(struct vdpa_device *vdev) { const struct vdpa_config_ops *ops = vdev->config; vdev->features_valid = false; + if (ops->reset) + return ops->reset(vdev); + ops->set_status(vdev, 0); + + return 0; } static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 03/12] vdpa: Fix some coding style issues
在 2021/8/18 下午8:06, Xie Yongji 写道: Fix some code indent issues and following checkpatch warning: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' 371: FILE: include/linux/vdpa.h:371: +static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset, Signed-off-by: Xie Yongji Acked-by: Jason Wang --- include/linux/vdpa.h | 34 +- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 954b340f6c2f..8a645f8f4476 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -43,17 +43,17 @@ struct vdpa_vq_state_split { * @last_used_idx: used index */ struct vdpa_vq_state_packed { -u16last_avail_counter:1; -u16last_avail_idx:15; -u16last_used_counter:1; -u16last_used_idx:15; + u16 last_avail_counter:1; + u16 last_avail_idx:15; + u16 last_used_counter:1; + u16 last_used_idx:15; }; struct vdpa_vq_state { - union { - struct vdpa_vq_state_split split; - struct vdpa_vq_state_packed packed; - }; + union { + struct vdpa_vq_state_split split; + struct vdpa_vq_state_packed packed; + }; }; struct vdpa_mgmt_dev; @@ -131,7 +131,7 @@ struct vdpa_iova_range { *@vdev: vdpa device *@idx: virtqueue index *@state: pointer to returned state (last_avail_idx) - * @get_vq_notification: Get the notification area for a virtqueue + * @get_vq_notification: Get the notification area for a virtqueue *@vdev: vdpa device *@idx: virtqueue index *Returns the notifcation area @@ -353,25 +353,25 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) static inline void vdpa_reset(struct vdpa_device *vdev) { -const struct vdpa_config_ops *ops = vdev->config; + const struct vdpa_config_ops *ops = vdev->config; vdev->features_valid = false; -ops->set_status(vdev, 0); + ops->set_status(vdev, 0); } static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features) { -const struct vdpa_config_ops *ops = vdev->config; + const struct vdpa_config_ops *ops = vdev->config; vdev->features_valid = true; -return ops->set_features(vdev, features); + return ops->set_features(vdev, features); } - -static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset, - void *buf, unsigned int len) +static inline void vdpa_get_config(struct vdpa_device *vdev, + unsigned int offset, void *buf, + unsigned int len) { -const struct vdpa_config_ops *ops = vdev->config; + const struct vdpa_config_ops *ops = vdev->config; /* * Config accesses aren't supposed to trigger before features are set. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 01/12] iova: Export alloc_iova_fast() and free_iova_fast()
在 2021/8/18 下午8:06, Xie Yongji 写道: Export alloc_iova_fast() and free_iova_fast() so that some modules can make use of the per-CPU cache to get rid of rbtree spinlock in alloc_iova() and free_iova() during IOVA allocation. Signed-off-by: Xie Yongji Acked-by: Jason Wang (If we need respin, I'd suggest to put the numbers you measured here) Thanks --- drivers/iommu/iova.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index b6cf5f16123b..3941ed6bb99b 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, return new_iova->pfn_lo; } +EXPORT_SYMBOL_GPL(alloc_iova_fast); /** * free_iova_fast - free iova pfn range into rcache @@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size) free_iova(iovad, pfn); } +EXPORT_SYMBOL_GPL(free_iova_fast); #define fq_ring_for_each(i, fq) \ for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()
在 2021/8/9 下午1:56, Yongji Xie 写道: On Thu, Aug 5, 2021 at 9:31 PM Jason Wang wrote: 在 2021/8/5 下午8:34, Yongji Xie 写道: My main point, though, is that if you've already got something else keeping track of the actual addresses, then the way you're using an iova_domain appears to be something you could do with a trivial bitmap allocator. That's why I don't buy the efficiency argument. The main design points of the IOVA allocator are to manage large address spaces while trying to maximise spatial locality to minimise the underlying pagetable usage, and allocating with a flexible limit to support multiple devices with different addressing capabilities in the same address space. If none of those aspects are relevant to the use-case - which AFAICS appears to be true here - then as a general-purpose resource allocator it's rubbish and has an unreasonably massive memory overhead and there are many, many better choices. OK, I get your point. Actually we used the genpool allocator in the early version. Maybe we can fall back to using it. I think maybe you can share some perf numbers to see how much alloc_iova_fast() can help. I did some fio tests[1] with a ram-backend vduse block device[2]. Following are some performance data: numjobs=1 numjobs=2numjobs=4 numjobs=8 iova_alloc_fast145k iops 265k iops 514k iops 758k iops iova_alloc137k iops 170k iops 128k iops 113k iops gen_pool_alloc 143k iops 270k iops 458k iops 521k iops The iova_alloc_fast() has the best performance since we always hit the per-cpu cache. Regardless of the per-cpu cache, the genpool allocator should be better than the iova allocator. I think we see convincing numbers for using iova_alloc_fast() than the gen_poll_alloc() (45% improvement on job=8). Thanks [1] fio jobfile: [global] rw=randread direct=1 ioengine=libaio iodepth=16 time_based=1 runtime=60s group_reporting bs=4k filename=/dev/vda [job] numjobs=.. [2] $ qemu-storage-daemon \ --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \ --monitor chardev=charmonitor \ --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \ --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128 The qemu-storage-daemon can be builded based on the repo: https://github.com/bytedance/qemu/tree/vduse-test. Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()
在 2021/8/5 下午8:34, Yongji Xie 写道: My main point, though, is that if you've already got something else keeping track of the actual addresses, then the way you're using an iova_domain appears to be something you could do with a trivial bitmap allocator. That's why I don't buy the efficiency argument. The main design points of the IOVA allocator are to manage large address spaces while trying to maximise spatial locality to minimise the underlying pagetable usage, and allocating with a flexible limit to support multiple devices with different addressing capabilities in the same address space. If none of those aspects are relevant to the use-case - which AFAICS appears to be true here - then as a general-purpose resource allocator it's rubbish and has an unreasonably massive memory overhead and there are many, many better choices. OK, I get your point. Actually we used the genpool allocator in the early version. Maybe we can fall back to using it. I think maybe you can share some perf numbers to see how much alloc_iova_fast() can help. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()
在 2021/8/4 下午5:07, Yongji Xie 写道: On Wed, Aug 4, 2021 at 4:54 PM Jason Wang wrote: 在 2021/8/4 下午4:50, Yongji Xie 写道: On Wed, Aug 4, 2021 at 4:32 PM Jason Wang wrote: 在 2021/8/3 下午5:38, Yongji Xie 写道: On Tue, Aug 3, 2021 at 4:09 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: The device reset may fail in virtio-vdpa case now, so add checks to its return value and fail the register_virtio_device(). So the reset() would be called by the driver during remove as well, or is it sufficient to deal only with the reset during probe? Actually there is no way to handle failure during removal. And it should be safe with the protection of software IOTLB even if the reset() fails. Thanks, Yongji If this is true, does it mean we don't even need to care about reset failure? But we need to handle the failure in the vhost-vdpa case, isn't it? Yes, but: - This patch is for virtio not for vhost, if we don't care virtio, we can avoid the changes - For vhost, there could be two ways probably: 1) let the set_status to report error 2) require userspace to re-read for status It looks to me you want to go with 1) and I'm not sure whether or not it's too late to go with 2). Looks like 2) can't work if reset failure happens in vhost_vdpa_release() and vhost_vdpa_open(). Yes, you're right. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()
在 2021/8/4 下午4:50, Yongji Xie 写道: On Wed, Aug 4, 2021 at 4:32 PM Jason Wang wrote: 在 2021/8/3 下午5:38, Yongji Xie 写道: On Tue, Aug 3, 2021 at 4:09 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: The device reset may fail in virtio-vdpa case now, so add checks to its return value and fail the register_virtio_device(). So the reset() would be called by the driver during remove as well, or is it sufficient to deal only with the reset during probe? Actually there is no way to handle failure during removal. And it should be safe with the protection of software IOTLB even if the reset() fails. Thanks, Yongji If this is true, does it mean we don't even need to care about reset failure? But we need to handle the failure in the vhost-vdpa case, isn't it? Yes, but: - This patch is for virtio not for vhost, if we don't care virtio, we can avoid the changes - For vhost, there could be two ways probably: 1) let the set_status to report error 2) require userspace to re-read for status It looks to me you want to go with 1) and I'm not sure whether or not it's too late to go with 2). Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 05/17] vhost-vdpa: Fail the vhost_vdpa_set_status() on reset failure
在 2021/8/3 下午5:50, Yongji Xie 写道: On Tue, Aug 3, 2021 at 4:10 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: Re-read the device status to ensure it's set to zero during resetting. Otherwise, fail the vhost_vdpa_set_status() after timeout. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b07aa161f7ad..dd05c1e1133c 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -157,7 +157,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; u8 status, status_old; - int nvqs = v->nvqs; + int timeout = 0, nvqs = v->nvqs; u16 i; if (copy_from_user(&status, statusp, sizeof(status))) @@ -173,6 +173,15 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) return -EINVAL; ops->set_status(vdpa, status); + if (status == 0) { + while (ops->get_status(vdpa)) { + timeout += 20; + if (timeout > VDPA_RESET_TIMEOUT_MS) + return -EIO; + + msleep(20); + } Spec has introduced the reset a one of the basic facility. And consider we differ reset here. This makes me think if it's better to introduce a dedicated vdpa ops for reset? Do you mean replace the ops.set_status(vdev, 0) with the ops.reset()? Then we can remove the timeout processing which is device specific stuff. Exactly. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()
在 2021/8/3 下午5:38, Yongji Xie 写道: On Tue, Aug 3, 2021 at 4:09 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: The device reset may fail in virtio-vdpa case now, so add checks to its return value and fail the register_virtio_device(). So the reset() would be called by the driver during remove as well, or is it sufficient to deal only with the reset during probe? Actually there is no way to handle failure during removal. And it should be safe with the protection of software IOTLB even if the reset() fails. Thanks, Yongji If this is true, does it mean we don't even need to care about reset failure? Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 04/17] vdpa: Fail the vdpa_reset() if fail to set device status to zero
在 2021/8/3 下午5:31, Yongji Xie 写道: On Tue, Aug 3, 2021 at 3:58 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: Re-read the device status to ensure it's set to zero during resetting. Otherwise, fail the vdpa_reset() after timeout. Signed-off-by: Xie Yongji --- include/linux/vdpa.h | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 406d53a606ac..d1a80ef05089 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -6,6 +6,7 @@ #include #include #include +#include /** * struct vdpa_calllback - vDPA callback definition. @@ -340,12 +341,24 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) return vdev->dma_dev; } -static inline void vdpa_reset(struct vdpa_device *vdev) +#define VDPA_RESET_TIMEOUT_MS 1000 + +static inline int vdpa_reset(struct vdpa_device *vdev) { const struct vdpa_config_ops *ops = vdev->config; + int timeout = 0; vdev->features_valid = false; ops->set_status(vdev, 0); + while (ops->get_status(vdev)) { + timeout += 20; + if (timeout > VDPA_RESET_TIMEOUT_MS) + return -EIO; + + msleep(20); + } I wonder if it's better to do this in the vDPA parent? Thanks Sorry, I didn't get you here. Do you mean vDPA parent driver (e.g. VDUSE)? Yes, since the how it's expected to behave depends on the specific hardware. Even for the spec, the behavior is transport specific: PCI: requires reread until 0 MMIO: doesn't require but it might not work for the hardware so we decide to change CCW: the succeed of the ccw command means the success of the reset Thanks Actually I didn't find any other place where I can do set_status() and get_status(). Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 02/17] file: Export receive_fd() to modules
在 2021/8/3 下午5:01, Yongji Xie 写道: On Tue, Aug 3, 2021 at 3:46 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: Export receive_fd() so that some modules can use it to pass file descriptor between processes without missing any security stuffs. Signed-off-by: Xie Yongji --- fs/file.c| 6 ++ include/linux/file.h | 7 +++ 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/fs/file.c b/fs/file.c index 86dc9956af32..210e540672aa 100644 --- a/fs/file.c +++ b/fs/file.c @@ -1134,6 +1134,12 @@ int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags) return new_fd; } +int receive_fd(struct file *file, unsigned int o_flags) +{ + return __receive_fd(file, NULL, o_flags); Any reason that receive_fd_user() can live in the file.h? Since no modules use it. Thanks, Yongji Ok. Acked-by: Jason Wang ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 05/17] vhost-vdpa: Fail the vhost_vdpa_set_status() on reset failure
在 2021/7/29 下午3:34, Xie Yongji 写道: Re-read the device status to ensure it's set to zero during resetting. Otherwise, fail the vhost_vdpa_set_status() after timeout. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b07aa161f7ad..dd05c1e1133c 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -157,7 +157,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) struct vdpa_device *vdpa = v->vdpa; const struct vdpa_config_ops *ops = vdpa->config; u8 status, status_old; - int nvqs = v->nvqs; + int timeout = 0, nvqs = v->nvqs; u16 i; if (copy_from_user(&status, statusp, sizeof(status))) @@ -173,6 +173,15 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) return -EINVAL; ops->set_status(vdpa, status); + if (status == 0) { + while (ops->get_status(vdpa)) { + timeout += 20; + if (timeout > VDPA_RESET_TIMEOUT_MS) + return -EIO; + + msleep(20); + } Spec has introduced the reset a one of the basic facility. And consider we differ reset here. This makes me think if it's better to introduce a dedicated vdpa ops for reset? Thanks + } if ((status & VIRTIO_CONFIG_S_DRIVER_OK) && !(status_old & VIRTIO_CONFIG_S_DRIVER_OK)) for (i = 0; i < nvqs; i++) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()
在 2021/7/29 下午3:34, Xie Yongji 写道: The device reset may fail in virtio-vdpa case now, so add checks to its return value and fail the register_virtio_device(). So the reset() would be called by the driver during remove as well, or is it sufficient to deal only with the reset during probe? Thanks Signed-off-by: Xie Yongji --- drivers/virtio/virtio.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index a15beb6b593b..8df75425fb43 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -349,7 +349,9 @@ int register_virtio_device(struct virtio_device *dev) /* We always start by resetting the device, in case a previous * driver messed it up. This also tests that code path a little. */ - dev->config->reset(dev); + err = dev->config->reset(dev); + if (err) + goto err_reset; /* Acknowledge that we've seen the device. */ virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE); @@ -362,10 +364,13 @@ int register_virtio_device(struct virtio_device *dev) */ err = device_add(&dev->dev); if (err) - ida_simple_remove(&virtio_index_ida, dev->index); -out: - if (err) - virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED); + goto err_add; + + return 0; +err_add: + virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED); +err_reset: + ida_simple_remove(&virtio_index_ida, dev->index); return err; } EXPORT_SYMBOL_GPL(register_virtio_device); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 07/17] virtio: Don't set FAILED status bit on device index allocation failure
在 2021/7/29 下午3:34, Xie Yongji 写道: We don't need to set FAILED status bit on device index allocation failure since the device initialization hasn't been started yet. This doesn't affect runtime, found in code review. Signed-off-by: Xie Yongji Does it really harm? Thanks --- drivers/virtio/virtio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 4b15c00c0a0a..a15beb6b593b 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -338,7 +338,7 @@ int register_virtio_device(struct virtio_device *dev) /* Assign a unique device index and hence name. */ err = ida_simple_get(&virtio_index_ida, 0, 0, GFP_KERNEL); if (err < 0) - goto out; + return err; dev->index = err; dev_set_name(&dev->dev, "virtio%u", dev->index); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 04/17] vdpa: Fail the vdpa_reset() if fail to set device status to zero
在 2021/7/29 下午3:34, Xie Yongji 写道: Re-read the device status to ensure it's set to zero during resetting. Otherwise, fail the vdpa_reset() after timeout. Signed-off-by: Xie Yongji --- include/linux/vdpa.h | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 406d53a606ac..d1a80ef05089 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -6,6 +6,7 @@ #include #include #include +#include /** * struct vdpa_calllback - vDPA callback definition. @@ -340,12 +341,24 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) return vdev->dma_dev; } -static inline void vdpa_reset(struct vdpa_device *vdev) +#define VDPA_RESET_TIMEOUT_MS 1000 + +static inline int vdpa_reset(struct vdpa_device *vdev) { const struct vdpa_config_ops *ops = vdev->config; + int timeout = 0; vdev->features_valid = false; ops->set_status(vdev, 0); + while (ops->get_status(vdev)) { + timeout += 20; + if (timeout > VDPA_RESET_TIMEOUT_MS) + return -EIO; + + msleep(20); + } I wonder if it's better to do this in the vDPA parent? Thanks + + return 0; } static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 03/17] vdpa: Fix code indentation
在 2021/7/29 下午3:34, Xie Yongji 写道: Use tabs to indent the code instead of spaces. Signed-off-by: Xie Yongji --- include/linux/vdpa.h | 29 ++--- 1 file changed, 14 insertions(+), 15 deletions(-) It looks to me not all the warnings are addressed. Or did you silent checkpatch.pl -f? Thanks diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 7c49bc5a2b71..406d53a606ac 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -43,17 +43,17 @@ struct vdpa_vq_state_split { * @last_used_idx: used index */ struct vdpa_vq_state_packed { -u16last_avail_counter:1; -u16last_avail_idx:15; -u16last_used_counter:1; -u16last_used_idx:15; + u16 last_avail_counter:1; + u16 last_avail_idx:15; + u16 last_used_counter:1; + u16 last_used_idx:15; }; struct vdpa_vq_state { - union { - struct vdpa_vq_state_split split; - struct vdpa_vq_state_packed packed; - }; + union { + struct vdpa_vq_state_split split; + struct vdpa_vq_state_packed packed; + }; }; struct vdpa_mgmt_dev; @@ -131,7 +131,7 @@ struct vdpa_iova_range { *@vdev: vdpa device *@idx: virtqueue index *@state: pointer to returned state (last_avail_idx) - * @get_vq_notification: Get the notification area for a virtqueue + * @get_vq_notification: Get the notification area for a virtqueue *@vdev: vdpa device *@idx: virtqueue index *Returns the notifcation area @@ -342,25 +342,24 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) static inline void vdpa_reset(struct vdpa_device *vdev) { -const struct vdpa_config_ops *ops = vdev->config; + const struct vdpa_config_ops *ops = vdev->config; vdev->features_valid = false; -ops->set_status(vdev, 0); + ops->set_status(vdev, 0); } static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features) { -const struct vdpa_config_ops *ops = vdev->config; + const struct vdpa_config_ops *ops = vdev->config; vdev->features_valid = true; -return ops->set_features(vdev, features); + return ops->set_features(vdev, features); } - static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset, void *buf, unsigned int len) { -const struct vdpa_config_ops *ops = vdev->config; + const struct vdpa_config_ops *ops = vdev->config; /* * Config accesses aren't supposed to trigger before features are set. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 02/17] file: Export receive_fd() to modules
在 2021/7/29 下午3:34, Xie Yongji 写道: Export receive_fd() so that some modules can use it to pass file descriptor between processes without missing any security stuffs. Signed-off-by: Xie Yongji --- fs/file.c| 6 ++ include/linux/file.h | 7 +++ 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/fs/file.c b/fs/file.c index 86dc9956af32..210e540672aa 100644 --- a/fs/file.c +++ b/fs/file.c @@ -1134,6 +1134,12 @@ int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags) return new_fd; } +int receive_fd(struct file *file, unsigned int o_flags) +{ + return __receive_fd(file, NULL, o_flags); Any reason that receive_fd_user() can live in the file.h? Thanks +} +EXPORT_SYMBOL_GPL(receive_fd); + static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags) { int err = -EBADF; diff --git a/include/linux/file.h b/include/linux/file.h index 2de2e4613d7b..51e830b4fe3a 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -94,6 +94,9 @@ extern void fd_install(unsigned int fd, struct file *file); extern int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags); + +extern int receive_fd(struct file *file, unsigned int o_flags); + static inline int receive_fd_user(struct file *file, int __user *ufd, unsigned int o_flags) { @@ -101,10 +104,6 @@ static inline int receive_fd_user(struct file *file, int __user *ufd, return -EFAULT; return __receive_fd(file, ufd, o_flags); } -static inline int receive_fd(struct file *file, unsigned int o_flags) -{ - return __receive_fd(file, NULL, o_flags); -} int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags); extern void flush_delayed_fput(void); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()
在 2021/7/29 下午3:34, Xie Yongji 写道: Export alloc_iova_fast() and free_iova_fast() so that some modules can use it to improve iova allocation efficiency. It's better to explain why alloc_iova() is not sufficient here. Thanks Signed-off-by: Xie Yongji --- drivers/iommu/iova.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index b6cf5f16123b..3941ed6bb99b 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, return new_iova->pfn_lo; } +EXPORT_SYMBOL_GPL(alloc_iova_fast); /** * free_iova_fast - free iova pfn range into rcache @@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size) free_iova(iovad, pfn); } +EXPORT_SYMBOL_GPL(free_iova_fast); #define fq_ring_for_each(i, fq) \ for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()
在 2021/7/29 下午3:34, Xie Yongji 写道: Export alloc_iova_fast() and free_iova_fast() so that some modules can use it to improve iova allocation efficiency. It's better to explain which alloc_iova() is not sufficient here. Thanks Signed-off-by: Xie Yongji --- drivers/iommu/iova.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index b6cf5f16123b..3941ed6bb99b 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, return new_iova->pfn_lo; } +EXPORT_SYMBOL_GPL(alloc_iova_fast); /** * free_iova_fast - free iova pfn range into rcache @@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size) free_iova(iovad, pfn); } +EXPORT_SYMBOL_GPL(free_iova_fast); #define fq_ring_for_each(i, fq) \ for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 17/17] Documentation: Add documentation for VDUSE
在 2021/7/29 下午3:35, Xie Yongji 写道: VDUSE (vDPA Device in Userspace) is a framework to support implementing software-emulated vDPA devices in userspace. This document is intended to clarify the VDUSE design and usage. Signed-off-by: Xie Yongji --- Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/vduse.rst | 232 ++ 2 files changed, 233 insertions(+) create mode 100644 Documentation/userspace-api/vduse.rst diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 0b5eefed027e..c432be070f67 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -27,6 +27,7 @@ place where this information is gathered. iommu media/index sysfs-platform_profile + vduse .. only:: subproject and html diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst new file mode 100644 index ..30c9d1482126 --- /dev/null +++ b/Documentation/userspace-api/vduse.rst @@ -0,0 +1,232 @@ +== +VDUSE - "vDPA Device in Userspace" +== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. And +to make the device emulation more secure, the emulated vDPA device's +control path is handled in the kernel and only the data path is +implemented in the userspace. + +Note that only virtio block device is supported by VDUSE framework now, +which can reduce security risks when the userspace process that implements +the data path is run by an unprivileged user. The support for other device +types can be added after the security issue of corresponding device driver +is clarified or fixed in the future. + +Create/Destroy VDUSE devices + + +VDUSE devices are created as follows: + +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on + /dev/vduse/control. + +2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME. + +3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first + messages will arrive while attaching the VDUSE instance to vDPA bus. + +4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE + instance to vDPA bus. + +VDUSE devices are destroyed as follows: + +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE + instance from vDPA bus. + +2. Close the file descriptor referring to /dev/vduse/$NAME. + +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on + /dev/vduse/control. + +The netlink messages can be sent via vdpa tool in iproute2 or use the +below sample codes: + +.. code-block:: c + + static int netlink_add_vduse(const char *name, enum vdpa_command cmd) + { + struct nl_sock *nlsock; + struct nl_msg *msg; + int famid; + + nlsock = nl_socket_alloc(); + if (!nlsock) + return -ENOMEM; + + if (genl_connect(nlsock)) + goto free_sock; + + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME); + if (famid < 0) + goto close_sock; + + msg = nlmsg_alloc(); + if (!msg) + goto close_sock; + + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0)) + goto nla_put_failure; + + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name); + if (cmd == VDPA_CMD_DEV_NEW) + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse"); + + if (nl_send_sync(nlsock, msg)) + goto close_sock; + + nl_close(nlsock); + nl_socket_free(nlsock); + + return 0; + nla_put_failure: + nlmsg_free(msg); + close_sock: + nl_close(nlsock); + free_sock: + nl_socket_free(nlsock); + return -1; + } + +How VDUSE works +--- + +As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on +/dev/vduse/control. With this ioctl, userspace can specify some basic configuration +such as device name (uniquely identify a VDUSE device), virtio features, virtio +configuration space, the number of virtqueues and so on for this emulated device. +Then a char device interface (/dev/vduse/$NAME) is exported to userspace for device +emulation. Userspace can use the VDUSE_VQ_SETUP ioctl on /dev/vduse/$NAME to +add per-virtqueue configuration such as the max size of virtqueue to the device. + +After the initialization, the VDUSE device can b
Re: [PATCH v10 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/7/29 下午3:35, Xie Yongji 写道: This VDUSE driver enables implementing software-emulated vDPA devices in userspace. The vDPA device is created by ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device interface (/dev/vduse/$NAME) is exported to userspace for device emulation. In order to make the device emulation more secure, the device's control path is handled in kernel. A message mechnism is introduced to forward some dataplane related control messages to userspace. And in the data path, the DMA buffer will be mapped into userspace address space through different ways depending on the vDPA bus to which the vDPA device is attached. In virtio-vdpa case, the MMU-based software IOTLB is used to achieve that. And in vhost-vdpa case, the DMA buffer is reside in a userspace memory region which can be shared to the VDUSE userspace processs via transferring the shmfd. For more details on VDUSE design and usage, please see the follow-on Documentation commit. Signed-off-by: Xie Yongji --- Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig | 10 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/vduse_dev.c | 1541 include/uapi/linux/vduse.h | 220 +++ 6 files changed, 1778 insertions(+) create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 1409e40e6345..293ca3aef358 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -300,6 +300,7 @@ Code Seq#Include File Comments 'z' 10-4F drivers/s390/crypto/zcrypt_api.hconflict! '|' 00-7F linux/media.h 0x80 00-1F linux/fb.h +0x81 00-1F linux/vduse.h 0x89 00-06 arch/x86/include/asm/sockios.h 0x89 0B-DF linux/sockios.h 0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig index a503c1b2bfd9..6e23bce6433a 100644 --- a/drivers/vdpa/Kconfig +++ b/drivers/vdpa/Kconfig @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK vDPA block device simulator which terminates IO request in a memory buffer. +config VDPA_USER + tristate "VDUSE (vDPA Device in Userspace) support" + depends on EVENTFD && MMU && HAS_DMA + select DMA_OPS + select VHOST_IOTLB + select IOMMU_IOVA + help + With VDUSE it is possible to emulate a vDPA Device + in a userspace program. + config IFCVF tristate "Intel IFC VF vDPA driver" depends on PCI_MSI diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile index 67fe7f3d6943..f02ebed33f19 100644 --- a/drivers/vdpa/Makefile +++ b/drivers/vdpa/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_VDPA) += vdpa.o obj-$(CONFIG_VDPA_SIM) += vdpa_sim/ +obj-$(CONFIG_VDPA_USER) += vdpa_user/ obj-$(CONFIG_IFCVF)+= ifcvf/ obj-$(CONFIG_MLX5_VDPA) += mlx5/ obj-$(CONFIG_VP_VDPA)+= virtio_pci/ diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile new file mode 100644 index ..260e0b26af99 --- /dev/null +++ b/drivers/vdpa/vdpa_user/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + +vduse-y := vduse_dev.o iova_domain.o + +obj-$(CONFIG_VDPA_USER) += vduse.o diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c new file mode 100644 index ..6addc62e7de6 --- /dev/null +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -0,0 +1,1541 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * VDUSE: vDPA Device in Userspace + * + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved. + * + * Author: Xie Yongji + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iova_domain.h" + +#define DRV_AUTHOR "Yongji Xie " +#define DRV_DESC "vDPA Device in Userspace" +#define DRV_LICENSE "GPL v2" + +#define VDUSE_DEV_MAX (1U << MINORBITS) +#define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024) +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024) +#define VDUSE_REQUEST_TIMEOUT 30 I think we need make this as a module parameter. 0 probably means we need to wait for ever. This can help in the case when the userspace is attached by GDB. If Michael is still not happy, we can find other solution (e.g only offload the datapath). Other looks good. Thanks ___
Re: [PATCH v9 17/17] Documentation: Add documentation for VDUSE
在 2021/7/13 下午4:46, Xie Yongji 写道: VDUSE (vDPA Device in Userspace) is a framework to support implementing software-emulated vDPA devices in userspace. This document is intended to clarify the VDUSE design and usage. Signed-off-by: Xie Yongji --- Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/vduse.rst | 248 ++ 2 files changed, 249 insertions(+) create mode 100644 Documentation/userspace-api/vduse.rst diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 0b5eefed027e..c432be070f67 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -27,6 +27,7 @@ place where this information is gathered. iommu media/index sysfs-platform_profile + vduse .. only:: subproject and html diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst new file mode 100644 index ..2c0d56d4b2da --- /dev/null +++ b/Documentation/userspace-api/vduse.rst @@ -0,0 +1,248 @@ +== +VDUSE - "vDPA Device in Userspace" +== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. And +to make the device emulation more secure, the emulated vDPA device's +control path is handled in the kernel and only the data path is +implemented in the userspace. + +Note that only virtio block device is supported by VDUSE framework now, +which can reduce security risks when the userspace process that implements +the data path is run by an unprivileged user. The support for other device +types can be added after the security issue of corresponding device driver +is clarified or fixed in the future. + +Start/Stop VDUSE devices + + +VDUSE devices are started as follows: Not native speaker but "created" is probably better. + +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on + /dev/vduse/control. + +2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME. + +3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first + messages will arrive while attaching the VDUSE instance to vDPA bus. + +4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE + instance to vDPA bus. I think 4 should be done before 3? + +VDUSE devices are stopped as follows: "removed" or "destroyed" is better than "stopped" here. + +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE + instance from vDPA bus. + +2. Close the file descriptor referring to /dev/vduse/$NAME. + +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on + /dev/vduse/control. + +The netlink messages can be sent via vdpa tool in iproute2 or use the +below sample codes: + +.. code-block:: c + + static int netlink_add_vduse(const char *name, enum vdpa_command cmd) + { + struct nl_sock *nlsock; + struct nl_msg *msg; + int famid; + + nlsock = nl_socket_alloc(); + if (!nlsock) + return -ENOMEM; + + if (genl_connect(nlsock)) + goto free_sock; + + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME); + if (famid < 0) + goto close_sock; + + msg = nlmsg_alloc(); + if (!msg) + goto close_sock; + + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0)) + goto nla_put_failure; + + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name); + if (cmd == VDPA_CMD_DEV_NEW) + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse"); + + if (nl_send_sync(nlsock, msg)) + goto close_sock; + + nl_close(nlsock); + nl_socket_free(nlsock); + + return 0; + nla_put_failure: + nlmsg_free(msg); + close_sock: + nl_close(nlsock); + free_sock: + nl_socket_free(nlsock); + return -1; + } + +How VDUSE works +--- + +As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on +/dev/vduse/control. With this ioctl, userspace can specify some basic configuration +such as device name (uniquely identify a VDUSE device), virtio features, virtio +configuration space, bounce buffer size This bounce buffer size looks questionable. We'd better not expose any implementation details to userspace. I think we can simply start with a module parameter for VDUSE? and
Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/7/15 下午12:03, Yongji Xie 写道: Which ioctl can be used for this? I mean we can introduce a new ioctl for that in the future. Ok, I see. I wonder if it's better to do something similar to ccw: 1) requires the userspace to update the status bit in the response 2) update the dev->status to the status in the response if no timeout Then userspace could then set NEEDS_RESET if necessary. But NEEDS_RESET does not only happen in this case. Yes, so you need an ioctl for userspace to update the device status (NEEDS_RESET) probably. Yes, but I'd like to do that after the initial version is merged since NEEDS_RESET is not supported now. Right. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 13/17] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
在 2021/7/14 下午5:57, Dan Carpenter 写道: On Wed, Jul 14, 2021 at 05:41:54PM +0800, Jason Wang wrote: 在 2021/7/14 下午4:05, Dan Carpenter 写道: On Wed, Jul 14, 2021 at 10:14:32AM +0800, Jason Wang wrote: 在 2021/7/13 下午7:31, Dan Carpenter 写道: On Tue, Jul 13, 2021 at 04:46:52PM +0800, Xie Yongji wrote: @@ -613,37 +618,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size) } } -static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, - struct vhost_iotlb_msg *msg) +static int vhost_vdpa_pa_map(struct vhost_vdpa *v, +u64 iova, u64 size, u64 uaddr, u32 perm) { struct vhost_dev *dev = &v->vdev; - struct vhost_iotlb *iotlb = dev->iotlb; struct page **page_list; unsigned long list_size = PAGE_SIZE / sizeof(struct page *); unsigned int gup_flags = FOLL_LONGTERM; unsigned long npages, cur_base, map_pfn, last_pfn = 0; unsigned long lock_limit, sz2pin, nchunks, i; - u64 iova = msg->iova; + u64 start = iova; long pinned; int ret = 0; - if (msg->iova < v->range.first || - msg->iova + msg->size - 1 > v->range.last) - return -EINVAL; This is not related to your patch, but can the "msg->iova + msg->size" addition can have an integer overflow. From looking at the callers it seems like it can. msg comes from: vhost_chr_write_iter() --> dev->msg_handler(dev, &msg); --> vhost_vdpa_process_iotlb_msg() --> vhost_vdpa_process_iotlb_update() Yes. If I'm thinking of the right thing then these are allowed to overflow to 0 because of the " - 1" but not further than that. I believe the check needs to be something like: if (msg->iova < v->range.first || msg->iova - 1 > U64_MAX - msg->size || I guess we don't need - 1 here? The - 1 is important. The highest address is 0x. So it goes start + size = 0 and then start + size - 1 == 0x. Right, so actually msg->iova = 0xfffe, msg->size=2 is valid. I believe so, yes. It's inclusive of 0xfffe and 0x. (Not an expert). I think so, and we probably need to fix vhost_overflow() as well which did: static bool vhost_overflow(u64 uaddr, u64 size) { /* Make sure 64 bit math will not overflow. */ return uaddr > ULONG_MAX || size > ULONG_MAX || uaddr > ULONG_MAX - size; } Thanks regards, dan carpenter ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 13/17] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
在 2021/7/14 下午4:05, Dan Carpenter 写道: On Wed, Jul 14, 2021 at 10:14:32AM +0800, Jason Wang wrote: 在 2021/7/13 下午7:31, Dan Carpenter 写道: On Tue, Jul 13, 2021 at 04:46:52PM +0800, Xie Yongji wrote: @@ -613,37 +618,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size) } } -static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, - struct vhost_iotlb_msg *msg) +static int vhost_vdpa_pa_map(struct vhost_vdpa *v, +u64 iova, u64 size, u64 uaddr, u32 perm) { struct vhost_dev *dev = &v->vdev; - struct vhost_iotlb *iotlb = dev->iotlb; struct page **page_list; unsigned long list_size = PAGE_SIZE / sizeof(struct page *); unsigned int gup_flags = FOLL_LONGTERM; unsigned long npages, cur_base, map_pfn, last_pfn = 0; unsigned long lock_limit, sz2pin, nchunks, i; - u64 iova = msg->iova; + u64 start = iova; long pinned; int ret = 0; - if (msg->iova < v->range.first || - msg->iova + msg->size - 1 > v->range.last) - return -EINVAL; This is not related to your patch, but can the "msg->iova + msg->size" addition can have an integer overflow. From looking at the callers it seems like it can. msg comes from: vhost_chr_write_iter() --> dev->msg_handler(dev, &msg); --> vhost_vdpa_process_iotlb_msg() --> vhost_vdpa_process_iotlb_update() Yes. If I'm thinking of the right thing then these are allowed to overflow to 0 because of the " - 1" but not further than that. I believe the check needs to be something like: if (msg->iova < v->range.first || msg->iova - 1 > U64_MAX - msg->size || I guess we don't need - 1 here? The - 1 is important. The highest address is 0x. So it goes start + size = 0 and then start + size - 1 == 0x. Right, so actually msg->iova = 0xfffe, msg->size=2 is valid. Thanks I guess we could move the - 1 to the other side? msg->iova > U64_MAX - msg->size + 1 || regards, dan carpenter ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/7/14 下午2:49, Yongji Xie 写道: On Wed, Jul 14, 2021 at 1:45 PM Jason Wang wrote: 在 2021/7/13 下午4:46, Xie Yongji 写道: This VDUSE driver enables implementing software-emulated vDPA devices in userspace. The vDPA device is created by ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device interface (/dev/vduse/$NAME) is exported to userspace for device emulation. In order to make the device emulation more secure, the device's control path is handled in kernel. A message mechnism is introduced to forward some dataplane related control messages to userspace. And in the data path, the DMA buffer will be mapped into userspace address space through different ways depending on the vDPA bus to which the vDPA device is attached. In virtio-vdpa case, the MMU-based IOMMU driver is used to achieve that. And in vhost-vdpa case, the DMA buffer is reside in a userspace memory region which can be shared to the VDUSE userspace processs via transferring the shmfd. For more details on VDUSE design and usage, please see the follow-on Documentation commit. Signed-off-by: Xie Yongji --- Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig | 10 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/vduse_dev.c | 1502 include/uapi/linux/vduse.h | 221 +++ 6 files changed, 1740 insertions(+) create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 1409e40e6345..293ca3aef358 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -300,6 +300,7 @@ Code Seq#Include File Comments 'z' 10-4F drivers/s390/crypto/zcrypt_api.h conflict! '|' 00-7F linux/media.h 0x80 00-1F linux/fb.h +0x81 00-1F linux/vduse.h 0x89 00-06 arch/x86/include/asm/sockios.h 0x89 0B-DF linux/sockios.h 0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig index a503c1b2bfd9..6e23bce6433a 100644 --- a/drivers/vdpa/Kconfig +++ b/drivers/vdpa/Kconfig @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK vDPA block device simulator which terminates IO request in a memory buffer. +config VDPA_USER + tristate "VDUSE (vDPA Device in Userspace) support" + depends on EVENTFD && MMU && HAS_DMA + select DMA_OPS + select VHOST_IOTLB + select IOMMU_IOVA + help + With VDUSE it is possible to emulate a vDPA Device + in a userspace program. + config IFCVF tristate "Intel IFC VF vDPA driver" depends on PCI_MSI diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile index 67fe7f3d6943..f02ebed33f19 100644 --- a/drivers/vdpa/Makefile +++ b/drivers/vdpa/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_VDPA) += vdpa.o obj-$(CONFIG_VDPA_SIM) += vdpa_sim/ +obj-$(CONFIG_VDPA_USER) += vdpa_user/ obj-$(CONFIG_IFCVF)+= ifcvf/ obj-$(CONFIG_MLX5_VDPA) += mlx5/ obj-$(CONFIG_VP_VDPA)+= virtio_pci/ diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile new file mode 100644 index ..260e0b26af99 --- /dev/null +++ b/drivers/vdpa/vdpa_user/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + +vduse-y := vduse_dev.o iova_domain.o + +obj-$(CONFIG_VDPA_USER) += vduse.o diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c new file mode 100644 index ..c994a4a4660c --- /dev/null +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -0,0 +1,1502 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * VDUSE: vDPA Device in Userspace + * + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved. + * + * Author: Xie Yongji + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iova_domain.h" + +#define DRV_AUTHOR "Yongji Xie " +#define DRV_DESC "vDPA Device in Userspace" +#define DRV_LICENSE "GPL v2" + +#define VDUSE_DEV_MAX (1U << MINORBITS) +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024) +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024) +#define VDUSE_REQUEST_TIMEOUT 30 + +struct vduse_virtqueue { + u16 index; + u16 num_max; + u32 num; + u64 desc_addr; +
Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/7/14 下午2:47, Greg KH 写道: On Wed, Jul 14, 2021 at 02:02:50PM +0800, Jason Wang wrote: 在 2021/7/14 下午1:54, Michael S. Tsirkin 写道: On Wed, Jul 14, 2021 at 01:45:39PM +0800, Jason Wang wrote: +static int vduse_dev_msg_sync(struct vduse_dev *dev, + struct vduse_dev_msg *msg) +{ + int ret; + + init_waitqueue_head(&msg->waitq); + spin_lock(&dev->msg_lock); + msg->req.request_id = dev->msg_unique++; + vduse_enqueue_msg(&dev->send_list, msg); + wake_up(&dev->waitq); + spin_unlock(&dev->msg_lock); + + wait_event_killable_timeout(msg->waitq, msg->completed, + VDUSE_REQUEST_TIMEOUT * HZ); + spin_lock(&dev->msg_lock); + if (!msg->completed) { + list_del(&msg->list); + msg->resp.result = VDUSE_REQ_RESULT_FAILED; + } + ret = (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO; I think we should mark the device as malfunction when there is a timeout and forbid any userspace operations except for the destroy aftwards for safety. This looks like if one tried to run gdb on the program the behaviour will change completely because kernel wants it to respond within specific time. Looks like a receipe for heisenbugs. Let's not build interfaces with arbitrary timeouts like that. Interruptible wait exists for this very reason. The problem is. Do we want userspace program like modprobe to be stuck for indefinite time and expect the administrator to kill that? Why would modprobe be stuck for forever? Is this on the module probe path? Yes, it is called in the device probing path where the kernel forwards the device configuration request to userspace and wait for its response. If it turns out to be tricky, we can implement the whole device inside the kernel and leave only the datapath in the userspace (as what TUN did). Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/7/14 下午1:54, Michael S. Tsirkin 写道: On Wed, Jul 14, 2021 at 01:45:39PM +0800, Jason Wang wrote: +static int vduse_dev_msg_sync(struct vduse_dev *dev, + struct vduse_dev_msg *msg) +{ + int ret; + + init_waitqueue_head(&msg->waitq); + spin_lock(&dev->msg_lock); + msg->req.request_id = dev->msg_unique++; + vduse_enqueue_msg(&dev->send_list, msg); + wake_up(&dev->waitq); + spin_unlock(&dev->msg_lock); + + wait_event_killable_timeout(msg->waitq, msg->completed, + VDUSE_REQUEST_TIMEOUT * HZ); + spin_lock(&dev->msg_lock); + if (!msg->completed) { + list_del(&msg->list); + msg->resp.result = VDUSE_REQ_RESULT_FAILED; + } + ret = (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO; I think we should mark the device as malfunction when there is a timeout and forbid any userspace operations except for the destroy aftwards for safety. This looks like if one tried to run gdb on the program the behaviour will change completely because kernel wants it to respond within specific time. Looks like a receipe for heisenbugs. Let's not build interfaces with arbitrary timeouts like that. Interruptible wait exists for this very reason. The problem is. Do we want userspace program like modprobe to be stuck for indefinite time and expect the administrator to kill that? Thanks Let userspace have its own code to set and use timers. This does mean that userspace will likely have to change a bit to support this driver, such is life. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/7/13 下午4:46, Xie Yongji 写道: This VDUSE driver enables implementing software-emulated vDPA devices in userspace. The vDPA device is created by ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device interface (/dev/vduse/$NAME) is exported to userspace for device emulation. In order to make the device emulation more secure, the device's control path is handled in kernel. A message mechnism is introduced to forward some dataplane related control messages to userspace. And in the data path, the DMA buffer will be mapped into userspace address space through different ways depending on the vDPA bus to which the vDPA device is attached. In virtio-vdpa case, the MMU-based IOMMU driver is used to achieve that. And in vhost-vdpa case, the DMA buffer is reside in a userspace memory region which can be shared to the VDUSE userspace processs via transferring the shmfd. For more details on VDUSE design and usage, please see the follow-on Documentation commit. Signed-off-by: Xie Yongji --- Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig | 10 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/vduse_dev.c | 1502 include/uapi/linux/vduse.h | 221 +++ 6 files changed, 1740 insertions(+) create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 1409e40e6345..293ca3aef358 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -300,6 +300,7 @@ Code Seq#Include File Comments 'z' 10-4F drivers/s390/crypto/zcrypt_api.hconflict! '|' 00-7F linux/media.h 0x80 00-1F linux/fb.h +0x81 00-1F linux/vduse.h 0x89 00-06 arch/x86/include/asm/sockios.h 0x89 0B-DF linux/sockios.h 0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig index a503c1b2bfd9..6e23bce6433a 100644 --- a/drivers/vdpa/Kconfig +++ b/drivers/vdpa/Kconfig @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK vDPA block device simulator which terminates IO request in a memory buffer. +config VDPA_USER + tristate "VDUSE (vDPA Device in Userspace) support" + depends on EVENTFD && MMU && HAS_DMA + select DMA_OPS + select VHOST_IOTLB + select IOMMU_IOVA + help + With VDUSE it is possible to emulate a vDPA Device + in a userspace program. + config IFCVF tristate "Intel IFC VF vDPA driver" depends on PCI_MSI diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile index 67fe7f3d6943..f02ebed33f19 100644 --- a/drivers/vdpa/Makefile +++ b/drivers/vdpa/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_VDPA) += vdpa.o obj-$(CONFIG_VDPA_SIM) += vdpa_sim/ +obj-$(CONFIG_VDPA_USER) += vdpa_user/ obj-$(CONFIG_IFCVF)+= ifcvf/ obj-$(CONFIG_MLX5_VDPA) += mlx5/ obj-$(CONFIG_VP_VDPA)+= virtio_pci/ diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile new file mode 100644 index ..260e0b26af99 --- /dev/null +++ b/drivers/vdpa/vdpa_user/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + +vduse-y := vduse_dev.o iova_domain.o + +obj-$(CONFIG_VDPA_USER) += vduse.o diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c new file mode 100644 index ..c994a4a4660c --- /dev/null +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -0,0 +1,1502 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * VDUSE: vDPA Device in Userspace + * + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved. + * + * Author: Xie Yongji + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iova_domain.h" + +#define DRV_AUTHOR "Yongji Xie " +#define DRV_DESC "vDPA Device in Userspace" +#define DRV_LICENSE "GPL v2" + +#define VDUSE_DEV_MAX (1U << MINORBITS) +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024) +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024) +#define VDUSE_REQUEST_TIMEOUT 30 + +struct vduse_virtqueue { + u16 index; + u16 num_max; + u32 num; + u64 desc_addr; + u64 driver_addr; + u64 device_addr; + struct vdpa_vq_state state; + bool ready; + bool kicked; + spinlock_t kick_lock; + spinlock_t irq_lock; + s
Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/7/13 下午9:27, Dan Carpenter 写道: On Tue, Jul 13, 2021 at 04:46:55PM +0800, Xie Yongji wrote: +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name) +{ + struct vduse_vdpa *vdev; + int ret; + + if (dev->vdev) + return -EEXIST; + + vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev, +&vduse_vdpa_config_ops, name, true); + if (!vdev) + return -ENOMEM; This should be an IS_ERR() check instead of a NULL check. Yes. The vdpa_alloc_device() macro is doing something very complicated but I'm not sure what. It calls container_of() and that looks buggy until you spot the BUILD_BUG_ON_ZERO() compile time assert which ensures that the container_of() is a no-op. Only one of the callers checks for error pointers correctly so maybe it's too complicated or maybe there should be better documentation. We need better documentation for this macro and fix all the buggy callers. Yong Ji, want to do that? Thanks regards, dan carpenter ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 13/17] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
在 2021/7/13 下午7:31, Dan Carpenter 写道: On Tue, Jul 13, 2021 at 04:46:52PM +0800, Xie Yongji wrote: @@ -613,37 +618,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 iova, u64 size) } } -static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, - struct vhost_iotlb_msg *msg) +static int vhost_vdpa_pa_map(struct vhost_vdpa *v, +u64 iova, u64 size, u64 uaddr, u32 perm) { struct vhost_dev *dev = &v->vdev; - struct vhost_iotlb *iotlb = dev->iotlb; struct page **page_list; unsigned long list_size = PAGE_SIZE / sizeof(struct page *); unsigned int gup_flags = FOLL_LONGTERM; unsigned long npages, cur_base, map_pfn, last_pfn = 0; unsigned long lock_limit, sz2pin, nchunks, i; - u64 iova = msg->iova; + u64 start = iova; long pinned; int ret = 0; - if (msg->iova < v->range.first || - msg->iova + msg->size - 1 > v->range.last) - return -EINVAL; This is not related to your patch, but can the "msg->iova + msg->size" addition can have an integer overflow. From looking at the callers it seems like it can. msg comes from: vhost_chr_write_iter() --> dev->msg_handler(dev, &msg); --> vhost_vdpa_process_iotlb_msg() --> vhost_vdpa_process_iotlb_update() Yes. If I'm thinking of the right thing then these are allowed to overflow to 0 because of the " - 1" but not further than that. I believe the check needs to be something like: if (msg->iova < v->range.first || msg->iova - 1 > U64_MAX - msg->size || I guess we don't need - 1 here? Thanks msg->iova + msg->size - 1 > v->range.last) But writing integer overflow check correctly is notoriously difficult. Do you think you could send a fix for that which is separate from the patcheset? We'd want to backport it to stable. regards, dan carpenter ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
在 2021/7/7 下午11:54, Stefan Hajnoczi 写道: On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote: 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道: On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote: 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道: On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote: On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi wrote: On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote: 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: 在 2021/7/4 下午5:49, Yongji Xie 写道: OK, I get you now. Since the VIRTIO specification says "Device configuration space is generally used for rarely-changing or initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG ioctl should not be called frequently. The spec uses MUST and other terms to define the precise requirements. Here the language (especially the word "generally") is weaker and means there may be exceptions. Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG approach is reads that have side-effects. For example, imagine a field containing an error code if the device encounters a problem unrelated to a specific virtqueue request. Reading from this field resets the error code to 0, saving the driver an extra configuration space write access and possibly race conditions. It isn't possible to implement those semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it makes me think that the interface does not allow full VIRTIO semantics. Note that though you're correct, my understanding is that config space is not suitable for this kind of error propagating. And it would be very hard to implement such kind of semantic in some transports. Virtqueue should be much better. As Yong Ji quoted, the config space is used for "rarely-changing or intialization-time parameters". Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to handle the message failure, I'm going to add a return value to virtio_config_ops.get() and virtio_cread_* API so that the error can be propagated to the virtio device driver. Then the virtio-blk device driver can be modified to handle that. Jason and Stefan, what do you think of this way? Why does VDUSE_DEV_GET_CONFIG need to support an error return value? The VIRTIO spec provides no way for the device to report errors from config space accesses. The QEMU virtio-pci implementation returns -1 from invalid virtio_config_read*() and silently discards virtio_config_write*() accesses. VDUSE can take the same approach with VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. I'd like to stick to the current assumption thich get_config won't fail. That is to say, 1) maintain a config in the kernel, make sure the config space read can always succeed 2) introduce an ioctl for the vduse usersapce to update the config space. 3) we can synchronize with the vduse userspace during set_config Does this work? I noticed that caching is also allowed by the vhost-user protocol messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't know whether or not caching is in effect. The interface you outlined above requires caching. Is there a reason why the host kernel vDPA code needs to cache the configuration space? Because: 1) Kernel can not wait forever in get_config(), this is the major difference with vhost-user. virtio_cread() can sleep: #define virtio_cread(vdev, structname, member, ptr) \ do {\ typeof(((structname*)0)->member) virtio_cread_v;\ \ might_sleep(); \ ^^ Which code path cannot sleep? Well, it can sleep but it can't sleep forever. For VDUSE, a buggy/malicious userspace may refuse to respond to the get_config. It looks to me the ideal case, with the current virtio spec, for VDUSE is to 1) maintain the device and its state in the kernel, userspace may sync with the kernel device via ioctls 2) offload the datapath (virtqueue) to the userspace This seems more robust and safe than simply relaying everything to userspace and waiting for its response. And we know for sure this model can work, an example is TUN/TAP: netdevice is abstracted in the kernel and datapath is done via sendmsg()/recvmsg(). Maintaining the config in the kernel follows this model and it can simplify the device generation implementation. For config space write, it requires more thought but fortunately it's not commonly used. So VDUSE can choose to filter out the device/features that depends on the config write. This is the problem. There are other messages like SET_FEATURES where I guess we'll face the same challenge. Probably not
Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
在 2021/7/7 下午4:55, Stefan Hajnoczi 写道: On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote: 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道: On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote: On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi wrote: On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote: 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: 在 2021/7/4 下午5:49, Yongji Xie 写道: OK, I get you now. Since the VIRTIO specification says "Device configuration space is generally used for rarely-changing or initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG ioctl should not be called frequently. The spec uses MUST and other terms to define the precise requirements. Here the language (especially the word "generally") is weaker and means there may be exceptions. Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG approach is reads that have side-effects. For example, imagine a field containing an error code if the device encounters a problem unrelated to a specific virtqueue request. Reading from this field resets the error code to 0, saving the driver an extra configuration space write access and possibly race conditions. It isn't possible to implement those semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it makes me think that the interface does not allow full VIRTIO semantics. Note that though you're correct, my understanding is that config space is not suitable for this kind of error propagating. And it would be very hard to implement such kind of semantic in some transports. Virtqueue should be much better. As Yong Ji quoted, the config space is used for "rarely-changing or intialization-time parameters". Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to handle the message failure, I'm going to add a return value to virtio_config_ops.get() and virtio_cread_* API so that the error can be propagated to the virtio device driver. Then the virtio-blk device driver can be modified to handle that. Jason and Stefan, what do you think of this way? Why does VDUSE_DEV_GET_CONFIG need to support an error return value? The VIRTIO spec provides no way for the device to report errors from config space accesses. The QEMU virtio-pci implementation returns -1 from invalid virtio_config_read*() and silently discards virtio_config_write*() accesses. VDUSE can take the same approach with VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. I'd like to stick to the current assumption thich get_config won't fail. That is to say, 1) maintain a config in the kernel, make sure the config space read can always succeed 2) introduce an ioctl for the vduse usersapce to update the config space. 3) we can synchronize with the vduse userspace during set_config Does this work? I noticed that caching is also allowed by the vhost-user protocol messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't know whether or not caching is in effect. The interface you outlined above requires caching. Is there a reason why the host kernel vDPA code needs to cache the configuration space? Because: 1) Kernel can not wait forever in get_config(), this is the major difference with vhost-user. virtio_cread() can sleep: #define virtio_cread(vdev, structname, member, ptr) \ do {\ typeof(((structname*)0)->member) virtio_cread_v;\ \ might_sleep(); \ ^^ Which code path cannot sleep? Well, it can sleep but it can't sleep forever. For VDUSE, a buggy/malicious userspace may refuse to respond to the get_config. It looks to me the ideal case, with the current virtio spec, for VDUSE is to 1) maintain the device and its state in the kernel, userspace may sync with the kernel device via ioctls 2) offload the datapath (virtqueue) to the userspace This seems more robust and safe than simply relaying everything to userspace and waiting for its response. And we know for sure this model can work, an example is TUN/TAP: netdevice is abstracted in the kernel and datapath is done via sendmsg()/recvmsg(). Maintaining the config in the kernel follows this model and it can simplify the device generation implementation. For config space write, it requires more thought but fortunately it's not commonly used. So VDUSE can choose to filter out the device/features that depends on the config write. This is the problem. There are other messages like SET_FEATURES where I guess we'll face the same challenge. Probably not, userspace device can tell the kernel about the device_features and mandated_features during creation, and the fe
Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: 在 2021/7/4 下午5:49, Yongji Xie 写道: OK, I get you now. Since the VIRTIO specification says "Device configuration space is generally used for rarely-changing or initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG ioctl should not be called frequently. The spec uses MUST and other terms to define the precise requirements. Here the language (especially the word "generally") is weaker and means there may be exceptions. Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG approach is reads that have side-effects. For example, imagine a field containing an error code if the device encounters a problem unrelated to a specific virtqueue request. Reading from this field resets the error code to 0, saving the driver an extra configuration space write access and possibly race conditions. It isn't possible to implement those semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it makes me think that the interface does not allow full VIRTIO semantics. Note that though you're correct, my understanding is that config space is not suitable for this kind of error propagating. And it would be very hard to implement such kind of semantic in some transports. Virtqueue should be much better. As Yong Ji quoted, the config space is used for "rarely-changing or intialization-time parameters". Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to handle the message failure, I'm going to add a return value to virtio_config_ops.get() and virtio_cread_* API so that the error can be propagated to the virtio device driver. Then the virtio-blk device driver can be modified to handle that. Jason and Stefan, what do you think of this way? Why does VDUSE_DEV_GET_CONFIG need to support an error return value? The VIRTIO spec provides no way for the device to report errors from config space accesses. The QEMU virtio-pci implementation returns -1 from invalid virtio_config_read*() and silently discards virtio_config_write*() accesses. VDUSE can take the same approach with VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. I'd like to stick to the current assumption thich get_config won't fail. That is to say, 1) maintain a config in the kernel, make sure the config space read can always succeed 2) introduce an ioctl for the vduse usersapce to update the config space. 3) we can synchronize with the vduse userspace during set_config Does this work? I noticed that caching is also allowed by the vhost-user protocol messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't know whether or not caching is in effect. The interface you outlined above requires caching. Is there a reason why the host kernel vDPA code needs to cache the configuration space? Because: 1) Kernel can not wait forever in get_config(), this is the major difference with vhost-user. 2) Stick to the current assumption that virtio_cread() should always succeed. Thanks Here are the vhost-user protocol messages: Virtio device config space ^^ ++--+---+-+ | offset | size | flags | payload | ++--+---+-+ :offset: a 32-bit offset of virtio device's configuration space :size: a 32-bit configuration space access size in bytes :flags: a 32-bit value: - 0: Vhost master messages used for writeable fields - 1: Vhost master messages used for live migration :payload: Size bytes array holding the contents of the virtio device's configuration space ... ``VHOST_USER_GET_CONFIG`` :id: 24 :equivalent ioctl: N/A :master payload: virtio device config space :slave payload: virtio device config space When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is submitted by the vhost-user master to fetch the contents of the virtio device configuration space, vhost-user slave's payload size MUST match master's request, vhost-user slave uses zero length of payload to indicate an error to vhost-user master. The vhost-user master may cache the contents to avoid repeated ``VHOST_USER_GET_CONFIG`` calls. ``VHOST_USER_SET_CONFIG`` :id: 25 :equivalent ioctl: N/A :master payload: virtio device config space :slave payload: N/A When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is submitted by the vhost-user master when the Guest changes the virtio device configuration space and also can be used for live migration on the destination host. The vhost-user slave must check the flags field, and slaves MUST NOT accept SET_CONFIG for read-only configuration space fields unless the live migration bit is set. Stefan __
Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
在 2021/7/4 下午5:49, Yongji Xie 写道: OK, I get you now. Since the VIRTIO specification says "Device configuration space is generally used for rarely-changing or initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG ioctl should not be called frequently. The spec uses MUST and other terms to define the precise requirements. Here the language (especially the word "generally") is weaker and means there may be exceptions. Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG approach is reads that have side-effects. For example, imagine a field containing an error code if the device encounters a problem unrelated to a specific virtqueue request. Reading from this field resets the error code to 0, saving the driver an extra configuration space write access and possibly race conditions. It isn't possible to implement those semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it makes me think that the interface does not allow full VIRTIO semantics. Note that though you're correct, my understanding is that config space is not suitable for this kind of error propagating. And it would be very hard to implement such kind of semantic in some transports. Virtqueue should be much better. As Yong Ji quoted, the config space is used for "rarely-changing or intialization-time parameters". Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to handle the message failure, I'm going to add a return value to virtio_config_ops.get() and virtio_cread_* API so that the error can be propagated to the virtio device driver. Then the virtio-blk device driver can be modified to handle that. Jason and Stefan, what do you think of this way? I'd like to stick to the current assumption thich get_config won't fail. That is to say, 1) maintain a config in the kernel, make sure the config space read can always succeed 2) introduce an ioctl for the vduse usersapce to update the config space. 3) we can synchronize with the vduse userspace during set_config Does this work? Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/7/1 下午6:26, Yongji Xie 写道: On Thu, Jul 1, 2021 at 3:55 PM Jason Wang wrote: 在 2021/7/1 下午2:50, Yongji Xie 写道: On Wed, Jun 30, 2021 at 5:51 PM Stefan Hajnoczi wrote: On Tue, Jun 29, 2021 at 10:59:51AM +0800, Yongji Xie wrote: On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi wrote: On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote: +/* ioctls */ + +struct vduse_dev_config { + char name[VDUSE_NAME_MAX]; /* vduse device name */ + __u32 vendor_id; /* virtio vendor id */ + __u32 device_id; /* virtio device id */ + __u64 features; /* device features */ + __u64 bounce_size; /* bounce buffer size for iommu */ + __u16 vq_size_max; /* the max size of virtqueue */ The VIRTIO specification allows per-virtqueue sizes. A device can have two virtqueues, where the first one allows up to 1024 descriptors and the second one allows only 128 descriptors, for example. Good point! But it looks like virtio-vdpa/virtio-pci doesn't support that now. All virtqueues have the same maximum size. I see struct vpda_config_ops only supports a per-device max vq size: u16 (*get_vq_num_max)(struct vdpa_device *vdev); virtio-pci supports per-virtqueue sizes because the struct virtio_pci_common_cfg->queue_size register is per-queue (controlled by queue_select). Oh, yes. I miss queue_select. I guess this is a question for Jason: will vdpa will keep this limitation? If yes, then VDUSE can stick to it too without running into problems in the future. I think it's better to extend the get_vq_num_max() per virtqueue. Currently, vDPA assumes the parent to have a global max size. This seems to work on most of the parents but not vp-vDPA (which could be backed by QEMU, in that case cvq's size is smaller). Fortunately, we haven't enabled had cvq support in the userspace now. I can post the fixes. OK. If so, it looks like we need to support the per-vq configuration. I wonder if it's better to use something like: VDUSE_CREATE_DEVICE -> VDUSE_SETUP_VQ -> VDUSE_SETUP_VQ -> ... -> VDUSE_ENABLE_DEVICE to do initialization rather than only use VDUSE_CREATE_DEVICE. This should be fine. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/7/1 下午2:50, Yongji Xie 写道: On Wed, Jun 30, 2021 at 5:51 PM Stefan Hajnoczi wrote: On Tue, Jun 29, 2021 at 10:59:51AM +0800, Yongji Xie wrote: On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi wrote: On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote: +/* ioctls */ + +struct vduse_dev_config { + char name[VDUSE_NAME_MAX]; /* vduse device name */ + __u32 vendor_id; /* virtio vendor id */ + __u32 device_id; /* virtio device id */ + __u64 features; /* device features */ + __u64 bounce_size; /* bounce buffer size for iommu */ + __u16 vq_size_max; /* the max size of virtqueue */ The VIRTIO specification allows per-virtqueue sizes. A device can have two virtqueues, where the first one allows up to 1024 descriptors and the second one allows only 128 descriptors, for example. Good point! But it looks like virtio-vdpa/virtio-pci doesn't support that now. All virtqueues have the same maximum size. I see struct vpda_config_ops only supports a per-device max vq size: u16 (*get_vq_num_max)(struct vdpa_device *vdev); virtio-pci supports per-virtqueue sizes because the struct virtio_pci_common_cfg->queue_size register is per-queue (controlled by queue_select). Oh, yes. I miss queue_select. I guess this is a question for Jason: will vdpa will keep this limitation? If yes, then VDUSE can stick to it too without running into problems in the future. I think it's better to extend the get_vq_num_max() per virtqueue. Currently, vDPA assumes the parent to have a global max size. This seems to work on most of the parents but not vp-vDPA (which could be backed by QEMU, in that case cvq's size is smaller). Fortunately, we haven't enabled had cvq support in the userspace now. I can post the fixes. + __u16 padding; /* padding */ + __u32 vq_num; /* the number of virtqueues */ + __u32 vq_align; /* the allocation alignment of virtqueue's metadata */ I'm not sure what this is? This will be used by vring_create_virtqueue() too. If there is no official definition for the meaning of this value then "/* same as vring_create_virtqueue()'s vring_align parameter */" would be clearer. That way the reader knows what to research in order to understand how this field works. OK. I don't remember but maybe it was used to support vrings when the host/guest have non-4KB page sizes. I wonder if anyone has an official definition for this value? Not sure. Maybe we might need some alignment which is less than PAGE_SIZE sometimes. So I see CCW always use 4096, but I'm not sure whether or not it's smaller than PAGE_SIZE. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
在 2021/6/29 下午2:40, Yongji Xie 写道: On Tue, Jun 29, 2021 at 12:13 PM Jason Wang wrote: 在 2021/6/28 下午6:32, Yongji Xie 写道: The large barrier is bounce-buffer mapping: SPDK requires hugepages for NVMe over PCIe and RDMA, so take some preallcoated hugepages to map as bounce buffer is necessary. Or it's hard to avoid an extra memcpy from bounce-buffer to hugepage. If you can add an option to map hugepages as bounce-buffer, then SPDK could also be a potential user of vduse. I think we can support registering user space memory for bounce-buffer use like XDP does. But this needs to pin the pages, so I didn't consider it in this initial version. Note that userspace should be unaware of the existence of the bounce buffer. If so, it might be hard to use umem. Because we can't use umem for coherent mapping which needs physical address contiguous space. Thanks, Yongji We probably can use umem for memory other than the virtqueue (still via mmap()). Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
在 2021/6/28 下午6:32, Yongji Xie 写道: The large barrier is bounce-buffer mapping: SPDK requires hugepages for NVMe over PCIe and RDMA, so take some preallcoated hugepages to map as bounce buffer is necessary. Or it's hard to avoid an extra memcpy from bounce-buffer to hugepage. If you can add an option to map hugepages as bounce-buffer, then SPDK could also be a potential user of vduse. I think we can support registering user space memory for bounce-buffer use like XDP does. But this needs to pin the pages, so I didn't consider it in this initial version. Note that userspace should be unaware of the existence of the bounce buffer. So we need to think carefully of mmap() vs umem registering. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
在 2021/6/28 下午1:54, Liu, Xiaodong 写道: Several issues: - VDUSE needs to limit the total size of the bounce buffers (64M if I was not wrong). Does it work for SPDK? Yes, Jason. It is enough and works for SPDK. Since it's a kind of bounce buffer mainly for in-flight IO, so limited size like 64MB is enough. Ok. - VDUSE can use hugepages but I'm not sure we can mandate hugepages (or we need introduce new flags for supporting this) Same with your worry, I'm afraid too that it is a hard for a kernel module to directly preallocate hugepage internal. What I tried is that: 1. A simple agent daemon (represents for one device) `preallocates` and maps dozens of 2MB hugepages (like 64MB) for one device. 2. The daemon passes its mapping addr&len and hugepage fd to kernel module through created IOCTL. 3. Kernel module remaps the hugepages inside kernel. Such model should work, but the main "issue" is that it introduce overheads in the case of vhost-vDPA. Note that in the case of vhost-vDPA, we don't use bounce buffer, the userspace pages were shared directly. And since DMA is not done per page, it prevents us from using tricks like vm_insert_page() in those cases. 4. Vhost user target gets and maps hugepage fd from kernel module in vhost-user msg through Unix Domain Socket cmsg. Then kernel module and target map on the same hugepage based bounce buffer for in-flight IO. If there is one option in VDUSE to map userspace preallocated memory, then VDUSE should be able to mandate it even it is hugepage based. As above, this requires some kind of re-design since VDUSE depends on the model of mmap(MAP_SHARED) instead of umem registering. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/29 上午11:56, Yongji Xie 写道: On Tue, Jun 29, 2021 at 11:29 AM Jason Wang wrote: 在 2021/6/29 上午10:26, Yongji Xie 写道: On Mon, Jun 28, 2021 at 12:40 PM Jason Wang wrote: 在 2021/6/25 下午12:19, Yongji Xie 写道: 2b) for set_status(): simply relay the message to userspace, reply is no needed. Userspace will use a command to update the status when the datapath is stop. The the status could be fetched via get_stats(). 2b looks more spec complaint. Looks good to me. And I think we can use the reply of the message to update the status instead of introducing a new command. Just notice this part in virtio_finalize_features(): virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); status = dev->config->get_status(dev); if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { So we no reply doesn't work for FEATURES_OK. So my understanding is: 1) We must not use noreply for set_status() 2) We can use noreply for get_status(), but it requires a new ioctl to update the status. So it looks to me we need synchronize for both get_status() and set_status(). We should not send messages to userspace in the FEATURES_OK case. So the synchronization is not necessary. As discussed previously, there could be a device that mandates some features (VIRTIO_F_RING_PACKED). So it can choose to not accept FEATURES_OK is packed virtqueue is not negotiated. In this case we need to relay the message to userspace. OK, I see. If so, I prefer to only use noreply for set_status(). We do not set the status bit if the message is failed. In this way, we don't need to change lots of virtio core codes to handle the failure of set_status()/get_status(). It should work. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/29 上午10:26, Yongji Xie 写道: On Mon, Jun 28, 2021 at 12:40 PM Jason Wang wrote: 在 2021/6/25 下午12:19, Yongji Xie 写道: 2b) for set_status(): simply relay the message to userspace, reply is no needed. Userspace will use a command to update the status when the datapath is stop. The the status could be fetched via get_stats(). 2b looks more spec complaint. Looks good to me. And I think we can use the reply of the message to update the status instead of introducing a new command. Just notice this part in virtio_finalize_features(): virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); status = dev->config->get_status(dev); if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { So we no reply doesn't work for FEATURES_OK. So my understanding is: 1) We must not use noreply for set_status() 2) We can use noreply for get_status(), but it requires a new ioctl to update the status. So it looks to me we need synchronize for both get_status() and set_status(). We should not send messages to userspace in the FEATURES_OK case. So the synchronization is not necessary. As discussed previously, there could be a device that mandates some features (VIRTIO_F_RING_PACKED). So it can choose to not accept FEATURES_OK is packed virtqueue is not negotiated. In this case we need to relay the message to userspace. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/25 下午12:19, Yongji Xie 写道: 2b) for set_status(): simply relay the message to userspace, reply is no needed. Userspace will use a command to update the status when the datapath is stop. The the status could be fetched via get_stats(). 2b looks more spec complaint. Looks good to me. And I think we can use the reply of the message to update the status instead of introducing a new command. Just notice this part in virtio_finalize_features(): virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); status = dev->config->get_status(dev); if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { So we no reply doesn't work for FEATURES_OK. So my understanding is: 1) We must not use noreply for set_status() 2) We can use noreply for get_status(), but it requires a new ioctl to update the status. So it looks to me we need synchronize for both get_status() and set_status(). Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
在 2021/6/28 下午6:33, Liu Xiaodong 写道: On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote: This series introduces a framework that makes it possible to implement software-emulated vDPA devices in userspace. And to make it simple, the emulated vDPA device's control path is handled in the kernel and only the data path is implemented in the userspace. Since the emuldated vDPA device's control path is handled in the kernel, a message mechnism is introduced to make userspace be aware of the data path related changes. Userspace can use read()/write() to receive/reply the control messages. In the data path, the core is mapping dma buffer into VDUSE daemon's address space, which can be implemented in different ways depending on the vdpa bus to which the vDPA device is attached. In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma buffer is reside in a userspace memory region which can be shared to the VDUSE userspace processs via transferring the shmfd. The details and our user case is shown below: - -- |Container || QEMU(VM) | | VDUSE daemon | | - || --- | | - | | |dev/vdx| || |/dev/vhost-vdpa-x| | | | vDPA device emulation | | block driver | | +--- ---+ -+--+- | || | | || | +---++--+- || block device | | vhost device || vduse driver | | TCP/IP || |---+ +---+ -+| | | | | || | --+-- --+--- ---+--- || | | virtio-blk driver | | vhost-vdpa driver | | vdpa device | || | --+-- --+--- ---+--- || | | virtio bus | | || | ++--- | | || || | | || | --+--| | || | | virtio-blk device || | || | --+--| | || || | | || | ---+--- | | || | | virtio-vdpa driver | | | || | ---+--- | | || || | |vdpa bus || | ---+--+---+ || | ---+--- | -| NIC |-- ---+--- | -+- | Remote Storages | --- We make use of it to implement a block device connecting to our distributed storage, which can be used both in containers and VMs. Thus, we can have an unified technology stack in this two cases. To test it with null-blk: $ qemu-storage-daemon \ --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \ --monitor chardev=charmonitor \ --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/24 下午5:16, Yongji Xie 写道: On Thu, Jun 24, 2021 at 4:14 PM Jason Wang wrote: 在 2021/6/24 下午12:46, Yongji Xie 写道: So we need to deal with both FEATURES_OK and reset, but probably not DRIVER_OK. OK, I see. Thanks for the explanation. One more question is how about clearing the corresponding status bit in get_status() rather than making set_status() fail. Since the spec recommends this way for validation which is done in virtio_dev_remove() and virtio_finalize_features(). Thanks, Yongji I think you can. Or it would be even better that we just don't set the bit during set_status(). Yes, that's what I mean. I just realize that in vdpa_reset() we had: static inline void vdpa_reset(struct vdpa_device *vdev) { const struct vdpa_config_ops *ops = vdev->config; vdev->features_valid = false; ops->set_status(vdev, 0); } We probably need to add the synchronization here. E.g re-read with a timeout. Looks like the timeout is already in set_status(). Do you mean the VDUSE's implementation? Do we really need a duplicated one here? 1) this is the timeout at the vDPA layer instead of the VDUSE layer. 2) it really depends on what's the meaning of the timeout for set_status of VDUSE. Do we want: 2a) for set_status(): relay the message to userspace and wait for the userspace to quiescence the datapath or 2b) for set_status(): simply relay the message to userspace, reply is no needed. Userspace will use a command to update the status when the datapath is stop. The the status could be fetched via get_stats(). 2b looks more spec complaint. And how to handle failure? Adding a return value to virtio_config_ops->reset() and passing the error to the upper layer? Something like this. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/24 下午12:46, Yongji Xie 写道: So we need to deal with both FEATURES_OK and reset, but probably not DRIVER_OK. OK, I see. Thanks for the explanation. One more question is how about clearing the corresponding status bit in get_status() rather than making set_status() fail. Since the spec recommends this way for validation which is done in virtio_dev_remove() and virtio_finalize_features(). Thanks, Yongji I think you can. Or it would be even better that we just don't set the bit during set_status(). I just realize that in vdpa_reset() we had: static inline void vdpa_reset(struct vdpa_device *vdev) { const struct vdpa_config_ops *ops = vdev->config; vdev->features_valid = false; ops->set_status(vdev, 0); } We probably need to add the synchronization here. E.g re-read with a timeout. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/23 下午1:50, Yongji Xie 写道: On Wed, Jun 23, 2021 at 11:31 AM Jason Wang wrote: 在 2021/6/22 下午4:14, Yongji Xie 写道: On Tue, Jun 22, 2021 at 3:50 PM Jason Wang wrote: 在 2021/6/22 下午3:22, Yongji Xie 写道: We need fix a way to propagate the error to the userspace. E.g if we want to stop the deivce, we will delay the status reset until we get respose from the userspace? I didn't get how to delay the status reset. And should it be a DoS that we want to fix if the userspace doesn't give a response forever? You're right. So let's make set_status() can fail first, then propagate its failure via VHOST_VDPA_SET_STATUS. OK. So we only need to propagate the failure in the vhost-vdpa case, right? I think not, we need to deal with the reset for virtio as well: E.g in register_virtio_devices(), we have: /* We always start by resetting the device, in case a previous * driver messed it up. This also tests that code path a little. */ dev->config->reset(dev); We probably need to make reset can fail and then fail the register_virtio_device() as well. OK, looks like virtio_add_status() and virtio_device_ready()[1] should be also modified if we need to propagate the failure in the virtio-vdpa case. Or do we only need to care about the reset case? [1] https://lore.kernel.org/lkml/20210517093428.670-1-xieyon...@bytedance.com/ My understanding is DRIVER_OK is not something that needs to be validated: " DRIVER_OK (4) Indicates that the driver is set up and ready to drive the device. " Since the spec doesn't require to re-read the and check if DRIVER_OK is set in 3.1.1 Driver Requirements: Device Initialization. It's more about "telling the device that driver is ready." But we don have some status bit that requires the synchronization with the device. 1) FEATURES_OK, spec requires to re-read the status bit to check whether or it it was set by the device: " Re-read device status to ensure the FEATURES_OK bit is still set: otherwise, the device does not support our subset of features and the device is unusable. " This is useful for some device which can only support a subset of the features. E.g a device that can only work for packed virtqueue. This means the current design of set_features won't work, we need either: 1a) relay the set_features request to userspace or 1b) introduce a mandated_device_features during device creation and validate the driver features during the set_features(), and don't set FEATURES_OK if they don't match. 2) Some transports (PCI) requires to re-read the status to ensure the synchronization. " After writing 0 to device_status, the driver MUST wait for a read of device_status to return 0 before reinitializing the device. " So we need to deal with both FEATURES_OK and reset, but probably not DRIVER_OK. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/22 下午4:14, Yongji Xie 写道: On Tue, Jun 22, 2021 at 3:50 PM Jason Wang wrote: 在 2021/6/22 下午3:22, Yongji Xie 写道: We need fix a way to propagate the error to the userspace. E.g if we want to stop the deivce, we will delay the status reset until we get respose from the userspace? I didn't get how to delay the status reset. And should it be a DoS that we want to fix if the userspace doesn't give a response forever? You're right. So let's make set_status() can fail first, then propagate its failure via VHOST_VDPA_SET_STATUS. OK. So we only need to propagate the failure in the vhost-vdpa case, right? I think not, we need to deal with the reset for virtio as well: E.g in register_virtio_devices(), we have: /* We always start by resetting the device, in case a previous * driver messed it up. This also tests that code path a little. */ dev->config->reset(dev); We probably need to make reset can fail and then fail the register_virtio_device() as well. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/22 下午3:22, Yongji Xie 写道: We need fix a way to propagate the error to the userspace. E.g if we want to stop the deivce, we will delay the status reset until we get respose from the userspace? I didn't get how to delay the status reset. And should it be a DoS that we want to fix if the userspace doesn't give a response forever? You're right. So let's make set_status() can fail first, then propagate its failure via VHOST_VDPA_SET_STATUS. + } +} + +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa) +{ + struct vduse_dev *dev = vdpa_to_vduse(vdpa); + + return dev->config_size; +} + +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset, + void *buf, unsigned int len) +{ + struct vduse_dev *dev = vdpa_to_vduse(vdpa); + + memcpy(buf, dev->config + offset, len); +} + +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset, + const void *buf, unsigned int len) +{ + /* Now we only support read-only configuration space */ +} + +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa) +{ + struct vduse_dev *dev = vdpa_to_vduse(vdpa); + + return dev->generation; +} + +static int vduse_vdpa_set_map(struct vdpa_device *vdpa, + struct vhost_iotlb *iotlb) +{ + struct vduse_dev *dev = vdpa_to_vduse(vdpa); + int ret; + + ret = vduse_domain_set_map(dev->domain, iotlb); + if (ret) + return ret; + + ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX); + if (ret) { + vduse_domain_clear_map(dev->domain, iotlb); + return ret; + } + + return 0; +} + +static void vduse_vdpa_free(struct vdpa_device *vdpa) +{ + struct vduse_dev *dev = vdpa_to_vduse(vdpa); + + dev->vdev = NULL; +} + +static const struct vdpa_config_ops vduse_vdpa_config_ops = { + .set_vq_address = vduse_vdpa_set_vq_address, + .kick_vq= vduse_vdpa_kick_vq, + .set_vq_cb = vduse_vdpa_set_vq_cb, + .set_vq_num = vduse_vdpa_set_vq_num, + .set_vq_ready = vduse_vdpa_set_vq_ready, + .get_vq_ready = vduse_vdpa_get_vq_ready, + .set_vq_state = vduse_vdpa_set_vq_state, + .get_vq_state = vduse_vdpa_get_vq_state, + .get_vq_align = vduse_vdpa_get_vq_align, + .get_features = vduse_vdpa_get_features, + .set_features = vduse_vdpa_set_features, + .set_config_cb = vduse_vdpa_set_config_cb, + .get_vq_num_max = vduse_vdpa_get_vq_num_max, + .get_device_id = vduse_vdpa_get_device_id, + .get_vendor_id = vduse_vdpa_get_vendor_id, + .get_status = vduse_vdpa_get_status, + .set_status = vduse_vdpa_set_status, + .get_config_size= vduse_vdpa_get_config_size, + .get_config = vduse_vdpa_get_config, + .set_config = vduse_vdpa_set_config, + .get_generation = vduse_vdpa_get_generation, + .set_map= vduse_vdpa_set_map, + .free = vduse_vdpa_free, +}; + +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + unsigned long attrs) +{ + struct vduse_dev *vdev = dev_to_vduse(dev); + struct vduse_iova_domain *domain = vdev->domain; + + return vduse_domain_map_page(domain, page, offset, size, dir, attrs); +} + +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) +{ + struct vduse_dev *vdev = dev_to_vduse(dev); + struct vduse_iova_domain *domain = vdev->domain; + + return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs); +} + +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_addr, gfp_t flag, + unsigned long attrs) +{ + struct vduse_dev *vdev = dev_to_vduse(dev); + struct vduse_iova_domain *domain = vdev->domain; + unsigned long iova; + void *addr; + + *dma_addr = DMA_MAPPING_ERROR; + addr = vduse_domain_alloc_coherent(domain, size, + (dma_addr_t *)&iova, flag, attrs); + if (!addr) + return NULL; + + *dma_addr = (dma_addr_t)iova; + + return addr; +} + +static void vduse_dev_free_coherent(struct device *dev, size_t size, + void *vaddr, dma_addr_t dma_addr, + unsigned long attrs) +{ + struct vduse_dev *vdev = dev_to_vduse(dev); + struct vduse_iova_domain *domain = vdev->dom
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/21 下午6:41, Yongji Xie 写道: On Mon, Jun 21, 2021 at 5:14 PM Jason Wang wrote: 在 2021/6/15 下午10:13, Xie Yongji 写道: This VDUSE driver enables implementing vDPA devices in userspace. The vDPA device's control path is handled in kernel and the data path is handled in userspace. A message mechnism is used by VDUSE driver to forward some control messages such as starting/stopping datapath to userspace. Userspace can use read()/write() to receive/reply those control messages. And some ioctls are introduced to help userspace to implement the data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file descriptors referring to vDPA device's iova regions. Then userspace can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD ioctls can be used to inject interrupt and setup the kickfd for virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the configuration space and inject a config interrupt. Signed-off-by: Xie Yongji --- Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig | 10 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/vduse_dev.c | 1453 include/uapi/linux/vduse.h | 143 ++ 6 files changed, 1613 insertions(+) create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 9bfc2b510c64..acd95e9dcfe7 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -300,6 +300,7 @@ Code Seq#Include File Comments 'z' 10-4F drivers/s390/crypto/zcrypt_api.h conflict! '|' 00-7F linux/media.h 0x80 00-1F linux/fb.h +0x81 00-1F linux/vduse.h 0x89 00-06 arch/x86/include/asm/sockios.h 0x89 0B-DF linux/sockios.h 0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig index a503c1b2bfd9..6e23bce6433a 100644 --- a/drivers/vdpa/Kconfig +++ b/drivers/vdpa/Kconfig @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK vDPA block device simulator which terminates IO request in a memory buffer. +config VDPA_USER + tristate "VDUSE (vDPA Device in Userspace) support" + depends on EVENTFD && MMU && HAS_DMA + select DMA_OPS + select VHOST_IOTLB + select IOMMU_IOVA + help + With VDUSE it is possible to emulate a vDPA Device + in a userspace program. + config IFCVF tristate "Intel IFC VF vDPA driver" depends on PCI_MSI diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile index 67fe7f3d6943..f02ebed33f19 100644 --- a/drivers/vdpa/Makefile +++ b/drivers/vdpa/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_VDPA) += vdpa.o obj-$(CONFIG_VDPA_SIM) += vdpa_sim/ +obj-$(CONFIG_VDPA_USER) += vdpa_user/ obj-$(CONFIG_IFCVF)+= ifcvf/ obj-$(CONFIG_MLX5_VDPA) += mlx5/ obj-$(CONFIG_VP_VDPA)+= virtio_pci/ diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile new file mode 100644 index ..260e0b26af99 --- /dev/null +++ b/drivers/vdpa/vdpa_user/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + +vduse-y := vduse_dev.o iova_domain.o + +obj-$(CONFIG_VDPA_USER) += vduse.o diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c new file mode 100644 index ..5271cbd15e28 --- /dev/null +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -0,0 +1,1453 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * VDUSE: vDPA Device in Userspace + * + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved. + * + * Author: Xie Yongji + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iova_domain.h" + +#define DRV_AUTHOR "Yongji Xie " +#define DRV_DESC "vDPA Device in Userspace" +#define DRV_LICENSE "GPL v2" + +#define VDUSE_DEV_MAX (1U << MINORBITS) +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024) +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024) +#define VDUSE_REQUEST_TIMEOUT 30 + +struct vduse_virtqueue { + u16 index; + u32 num; + u32 avail_idx; + u6
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/15 下午10:13, Xie Yongji 写道: This VDUSE driver enables implementing vDPA devices in userspace. The vDPA device's control path is handled in kernel and the data path is handled in userspace. A message mechnism is used by VDUSE driver to forward some control messages such as starting/stopping datapath to userspace. Userspace can use read()/write() to receive/reply those control messages. And some ioctls are introduced to help userspace to implement the data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file descriptors referring to vDPA device's iova regions. Then userspace can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD ioctls can be used to inject interrupt and setup the kickfd for virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the configuration space and inject a config interrupt. Signed-off-by: Xie Yongji --- Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig | 10 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/vduse_dev.c | 1453 include/uapi/linux/vduse.h | 143 ++ 6 files changed, 1613 insertions(+) create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 9bfc2b510c64..acd95e9dcfe7 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -300,6 +300,7 @@ Code Seq#Include File Comments 'z' 10-4F drivers/s390/crypto/zcrypt_api.hconflict! '|' 00-7F linux/media.h 0x80 00-1F linux/fb.h +0x81 00-1F linux/vduse.h 0x89 00-06 arch/x86/include/asm/sockios.h 0x89 0B-DF linux/sockios.h 0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig index a503c1b2bfd9..6e23bce6433a 100644 --- a/drivers/vdpa/Kconfig +++ b/drivers/vdpa/Kconfig @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK vDPA block device simulator which terminates IO request in a memory buffer. +config VDPA_USER + tristate "VDUSE (vDPA Device in Userspace) support" + depends on EVENTFD && MMU && HAS_DMA + select DMA_OPS + select VHOST_IOTLB + select IOMMU_IOVA + help + With VDUSE it is possible to emulate a vDPA Device + in a userspace program. + config IFCVF tristate "Intel IFC VF vDPA driver" depends on PCI_MSI diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile index 67fe7f3d6943..f02ebed33f19 100644 --- a/drivers/vdpa/Makefile +++ b/drivers/vdpa/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_VDPA) += vdpa.o obj-$(CONFIG_VDPA_SIM) += vdpa_sim/ +obj-$(CONFIG_VDPA_USER) += vdpa_user/ obj-$(CONFIG_IFCVF)+= ifcvf/ obj-$(CONFIG_MLX5_VDPA) += mlx5/ obj-$(CONFIG_VP_VDPA)+= virtio_pci/ diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile new file mode 100644 index ..260e0b26af99 --- /dev/null +++ b/drivers/vdpa/vdpa_user/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + +vduse-y := vduse_dev.o iova_domain.o + +obj-$(CONFIG_VDPA_USER) += vduse.o diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c new file mode 100644 index ..5271cbd15e28 --- /dev/null +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -0,0 +1,1453 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * VDUSE: vDPA Device in Userspace + * + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved. + * + * Author: Xie Yongji + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iova_domain.h" + +#define DRV_AUTHOR "Yongji Xie " +#define DRV_DESC "vDPA Device in Userspace" +#define DRV_LICENSE "GPL v2" + +#define VDUSE_DEV_MAX (1U << MINORBITS) +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024) +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024) +#define VDUSE_REQUEST_TIMEOUT 30 + +struct vduse_virtqueue { + u16 index; + u32 num; + u32 avail_idx; + u64 desc_addr; + u64 driver_addr; + u64 device_addr; + bool ready; + bool kicked; + spinlock_t kick_lock; + spinlock_t irq_lock; + struct eventfd_ctx *kickfd; +
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/10 下午7:47, Jason Gunthorpe 写道: On Thu, Jun 10, 2021 at 10:00:01AM +0800, Jason Wang wrote: 在 2021/6/8 下午9:20, Jason Gunthorpe 写道: On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote: Well, this sounds like a re-invention of io_uring which has already worked for multifds. How so? io_uring is about sending work to the kernel, not getting structued events back? Actually it can. Userspace can poll multiple fds via preparing multiple sqes with IORING_OP_ADD flag. Poll is only a part of what is needed here, the main issue is transfering the PRI events to userspace quickly. Do we really care e.g at most one more syscall in this case? I think the time spent on demand paging is much more than transferring #PF to userspace. What's more, a well designed vIOMMU capable IOMMU hardware should have the ability to inject such event directly to guest if #PF happens on L1. This means another ring and we need introduce ioctl() to add or remove ioasids from the poll. And it still need a kind of fallback like a list if the ring is full. The max size of the ring should be determinable based on the PRI concurrance of each device and the number of devices sharing the ring This has at least one assumption, #PF event is the only event for the ring, I'm not sure this is the case. Thanks In any event, I'm not entirely convinced eliding the PRI user/kernel copy is the main issue here.. If we want this to be low latency I think it ends up with some kernel driver component assisting the vIOMMU emulation and avoiding the round trip to userspace Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/10 上午10:00, Jason Wang 写道: 在 2021/6/8 下午9:20, Jason Gunthorpe 写道: On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote: Well, this sounds like a re-invention of io_uring which has already worked for multifds. How so? io_uring is about sending work to the kernel, not getting structued events back? Actually it can. Userspace can poll multiple fds via preparing multiple sqes with IORING_OP_ADD flag. IORING_OP_POLL_ADD actually. Thanks It is more like one of the perf rings This means another ring and we need introduce ioctl() to add or remove ioasids from the poll. And it still need a kind of fallback like a list if the ring is full. Thanks Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/8 下午6:45, Enrico Weigelt, metux IT consult 写道: On 07.06.21 20:01, Jason Gunthorpe wrote: it is what it is, select has a fixed size bitmap of FD #s and a hard upper bound on that size as part of the glibc ABI - can't be fixed. in glibc ABI ? Uuuuh! Note that dealing with select() or try to overcome the limitation via epoll() directly via the application is not a good practice (or it's not portable). It's suggested to use building blocks provided by glib, e.g the main event loop[1]. That is how Qemu solve the issues of dealing with a lot of file descriptors. Thanks [1] https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html --mtx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/8 下午9:20, Jason Gunthorpe 写道: On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote: Well, this sounds like a re-invention of io_uring which has already worked for multifds. How so? io_uring is about sending work to the kernel, not getting structued events back? Actually it can. Userspace can poll multiple fds via preparing multiple sqes with IORING_OP_ADD flag. It is more like one of the perf rings This means another ring and we need introduce ioctl() to add or remove ioasids from the poll. And it still need a kind of fallback like a list if the ring is full. Thanks Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/8 上午3:41, Alex Williamson 写道: On Mon, 7 Jun 2021 16:08:02 -0300 Jason Gunthorpe wrote: On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote: It is up to qemu if it wants to proceed or not. There is no issue with allowing the use of no-snoop and blocking wbinvd, other than some drivers may malfunction. If the user is certain they don't have malfunctioning drivers then no issue to go ahead. A driver that knows how to use the device in a coherent way can certainly proceed, but I suspect that's not something we can ask of QEMU. QEMU has no visibility to the in-use driver and sketchy ability to virtualize the no-snoop enable bit to prevent non-coherent DMA from the device. There might be an experimental ("x-" prefixed) QEMU device option to allow user override, but QEMU should disallow the possibility of malfunctioning drivers by default. If we have devices that probe as supporting no-snoop, but actually can't generate such traffic, we might need a quirk list somewhere. Compatibility is important, but when I look in the kernel code I see very few places that call wbinvd(). Basically all DRM for something relavent to qemu. That tells me that the vast majority of PCI devices do not generate no-snoop traffic. Unfortunately, even just looking at devices across a couple laptops most devices do support and have NoSnoop+ set by default. I don't notice anything in the kernel that actually tries to set this enable (a handful that actively disable), so I assume it's done by the firmware. I wonder whether or not it was done via ACPI: " 6.2.17 _CCA (Cache Coherency Attribute) The _CCA object returns whether or not a bus-master device supports hardware managed cache coherency. Expected values are 0 to indicate it is not supported, and 1 to indicate that it is supported. All other values are reserved. ... On Intel platforms, if the _CCA object is not supplied, the OSPM will assume the devices are hardware cache coherent. " Thanks It's not safe for QEMU to make an assumption that only GPUs will actually make use of it. I think it makes the software design much simpler if the security check is very simple. Possessing a suitable device in an ioasid fd container is enough to flip on the feature and we don't need to track changes from that point on. We don't need to revoke wbinvd if the ioasid fd changes, for instance. Better to keep the kernel very simple in this regard. You're suggesting that a user isn't forced to give up wbinvd emulation if they lose access to their device? Sure, why do we need to be stricter? It is the same logic I gave earlier, once an attacker process has access to wbinvd an attacker can just keep its access indefinitely. The main use case for revokation assumes that qemu would be compromised after a device is hot-unplugged and you want to block off wbinvd. But I have a hard time seeing that as useful enough to justify all the complicated code to do it... It's currently just a matter of the kvm-vfio device holding a reference to the group so that it cannot be used elsewhere so long as it's being used to elevate privileges on a given KVM instance. If we conclude that access to a device with the right capability is required to gain a privilege, I don't really see how we can wave aside that the privilege isn't lost with the device. For KVM qemu can turn on/off on hot plug events as it requires to give VM security. It doesn't need to rely on the kernel to control this. Yes, QEMU can reject a hot-unplug event, but then QEMU retains the privilege that the device grants it. Releasing the device and retaining the privileged gained by it seems wrong. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/3 上午1:21, Jason Gunthorpe 写道: On Wed, Jun 02, 2021 at 04:54:26PM +0800, Jason Wang wrote: 在 2021/6/2 上午1:31, Jason Gunthorpe 写道: On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote: We can open up to ~0U file descriptors, I don't see why we need to restrict it in uAPI. There are significant problems with such large file descriptor tables. High FD numbers man things like select don't work at all anymore and IIRC there are more complications. I don't see how much difference for IOASID and other type of fds. People can choose to use poll or epoll. Not really, once one thing in an applicate uses a large number FDs the entire application is effected. If any open() can return 'very big number' then nothing in the process is allowed to ever use select. It is not a trivial thing to ask for And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I wonder how select can work for the specific ioasid. pagefault events are one thing that comes to mind. Bundling them all together into a single ring buffer is going to be necessary. Multifds just complicate this too Jason Well, this sounds like a re-invention of io_uring which has already worked for multifds. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/7 下午10:14, Jason Gunthorpe 写道: On Mon, Jun 07, 2021 at 11:18:33AM +0800, Jason Wang wrote: Note that no drivers call these things doesn't meant it was not supported by the spec. Of course it does. If the spec doesn't define exactly when the driver should call the cache flushes for no-snoop transactions then the protocol doesn't support no-soop. Just to make sure we are in the same page. What I meant is, if the DMA behavior like (no-snoop) is device specific. There's no need to mandate a virtio general attributes. We can describe it per device. The devices implemented in the current spec does not use non-coherent DMA doesn't mean any future devices won't do that. The driver could choose to use transport (e.g PCI), platform (ACPI) or device specific (general virtio command) way to detect and flush cache when necessary. no-snoop is only used in very specific sequences of operations, like certain GPU usages, because regaining coherence on x86 is incredibly expensive. ie I wouldn't ever expect a NIC to use no-snoop because NIC's expect packets to be processed by the CPU. For NIC yes. But virtio is more that just NIC. We've already supported GPU and crypto devices. In this case, no-snoop will be useful since the data is not necessarily expected to be processed by CPU. And a lot of other type of devices are being proposed. Thanks "non-coherent DMA" is some general euphemism that evokes images of embedded platforms that don't have coherent DMA at all and have low cost ways to regain coherence. This is not at all what we are talking about here at all. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/4 下午7:58, Jason Gunthorpe 写道: On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote: nor do any virtio drivers implement the required platform specific cache flushing to make no-snoop TLPs work. I don't get why virtio drivers needs to do that. I think DMA API should hide those arch/platform specific stuffs from us. It is not arch/platform stuff. If the device uses no-snoop then a very platform specific recovery is required in the device driver. It is not part of the normal DMA API, it is side APIs like flush_agp_cache() or wbinvd() that are used by GPU drivers only. Yes and virtio doesn't support AGP. If drivers/virtio doesn't explicitly call these things it doesn't support no-snoop - hence no VDPA device can ever use no-snoop. Note that no drivers call these things doesn't meant it was not supported by the spec. Actually, spec doesn't forbid the non coherent DMA, anyway we can raise a new thread in the virtio mailing list to discuss about that. But consider virtio has already supported GPU, crypto and sound device, and the devices like codec and video are being proposed. It doesn't help if we mandate coherent DMA now. Thanks Since VIRTIO_F_ACCESS_PLATFORM doesn't trigger wbinvd on x86 it has nothing to do with no-snoop. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/3 下午9:55, Andi Kleen 写道: Ok, but what I meant is this, if we don't read from the descriptor ring, and validate all the other metadata supplied by the device (used id and len). Then there should be no way for the device to suppress the dma flags to write to the indirect descriptor table. Or do you have an example how it can do that? I don't. If you can validate everything it's probably ok The only drawback is even more code to audit and test. -Andi Ok, then I'm going to post a formal series, please have a look and we can start from there. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/4 上午1:33, Andy Lutomirski 写道: On 6/2/21 5:41 PM, Andi Kleen wrote: Only allow split mode when in a protected guest. Followon patches harden the split mode code paths, and we don't want an malicious host to force anything else. Also disallow indirect mode for similar reasons. I read this as "the virtio driver is buggy. Let's disable most of the buggy code in one special case in which we need a driver without bugs. In all the other cases (e.g. hardware virtio device connected over USB-C), driver bugs are still allowed." Can we just fix the driver without special cases? I think we can, this is what this series tries to do: https://www.spinics.net/lists/kvm/msg241825.html It tries to fix without a special caring for any specific features. Thanks --Andy ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/4 上午2:19, Jacob Pan 写道: Hi Shenming, On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu wrote: On 2021/6/2 1:33, Jason Gunthorpe wrote: On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote: The drivers register per page table fault handlers to /dev/ioasid which will then register itself to iommu core to listen and route the per- device I/O page faults. I'm still confused why drivers need fault handlers at all? Essentially it is the userspace that needs the fault handlers, one case is to deliver the faults to the vIOMMU, and another case is to enable IOPF on the GPA address space for on-demand paging, it seems that both could be specified in/through the IOASID_ALLOC ioctl? I would think IOASID_BIND_PGTABLE is where fault handler should be registered. There wouldn't be any IO page fault without the binding anyway. I also don't understand why device drivers should register the fault handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So I think it should be the IOASID itself register the handler. As discussed in another thread. I think the reason is that ATS doesn't forbid the #PF to be reported via a device specific way. Thanks Thanks, Shenming Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/4 上午2:00, Andi Kleen 写道: On 6/3/2021 10:33 AM, Andy Lutomirski wrote: On 6/2/21 5:41 PM, Andi Kleen wrote: Only allow split mode when in a protected guest. Followon patches harden the split mode code paths, and we don't want an malicious host to force anything else. Also disallow indirect mode for similar reasons. I read this as "the virtio driver is buggy. Let's disable most of the buggy code in one special case in which we need a driver without bugs. In all the other cases (e.g. hardware virtio device connected over USB-C), driver bugs are still allowed." My understanding is most of the other modes (except for split with separate descriptors) are obsolete and just there for compatibility. As long as they're deprecated they won't harm anyone. -Andi For "mode" do you packed vs split? If yes, it's not just for compatibility. Though packed virtqueue is designed to be more hardware friendly, most hardware vendors choose to start from split. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/4 上午3:31, Andy Lutomirski 写道: On Thu, Jun 3, 2021, at 11:00 AM, Andi Kleen wrote: On 6/3/2021 10:33 AM, Andy Lutomirski wrote: On 6/2/21 5:41 PM, Andi Kleen wrote: Only allow split mode when in a protected guest. Followon patches harden the split mode code paths, and we don't want an malicious host to force anything else. Also disallow indirect mode for similar reasons. I read this as "the virtio driver is buggy. Let's disable most of the buggy code in one special case in which we need a driver without bugs. In all the other cases (e.g. hardware virtio device connected over USB-C), driver bugs are still allowed." My understanding is most of the other modes (except for split with separate descriptors) are obsolete and just there for compatibility. As long as they're deprecated they won't harm anyone. Tell that to every crypto downgrade attack ever. I see two credible solutions: 1. Actually harden the virtio driver. 2. Have a new virtio-modern driver and use it for modern use cases. Maybe rename the old driver virtio-legacy or virtio-insecure. They can share code. Note that we had already split legacy driver out which can be turned off via Kconfig. Another snag you may hit: virtio’s heuristic for whether to use proper DMA ops or to bypass them is a giant kludge. I’m very slightly optimistic that getting the heuristic wrong will make the driver fail to operate but won’t allow the host to take over the guest, but I’m not really convinced. And I wrote that code! A virtio-modern mode probably should not have a heuristic, and the various iommu-bypassing modes should be fixed to work at the bus level, not the device level. I remember there's a very long discussion about this and probably without any conclusion. Fortunately, the management layer has been taught to enforce VIRTIO_F_ACCESS_PLATFORM for encrypted guests. A possible way to fix this is without any conflicts is to mandate the VIRTIO_F_ACCESS_PLATFORM in version 1.2. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/3 下午9:09, Jason Gunthorpe 写道: On Thu, Jun 03, 2021 at 10:52:51AM +0800, Jason Wang wrote: Basically, we don't want to bother with pseudo KVM device like what VFIO did. So for simplicity, we rules out the IOMMU that can't enforce coherency in vhost-vDPA if the parent purely depends on the platform IOMMU: VDPA HW cannot issue no-snoop TLPs in the first place. Note that virtio/vDPA is not necessarily a PCI device. virtio does not define a protocol to discover such a functionality, Actually we had: VIRTIO_F_ACCESS_PLATFORM(33) This feature indicates that the device can be used on a platform where device access to data in memory is limited and/or translated. E.g. this is the case if the device can be located behind an IOMMU that translates bus addresses from the device into physical addresses in memory, if the device can be limited to only access certain memory addresses or if special commands such as a cache flush can be needed to synchronise data in memory with the device. nor do any virtio drivers implement the required platform specific cache flushing to make no-snoop TLPs work. I don't get why virtio drivers needs to do that. I think DMA API should hide those arch/platform specific stuffs from us. It is fundamentally part of the virtio HW PCI API that a device vendor cannot alter. The spec doesn't forbid this, and it just leave the detection and action to the driver in a platform specific way. Thanks Basically since we already know that the virtio kernel drivers do not call the cache flush instruction we don't need the weird KVM logic to turn it on at all. Enforcing no-snoop at the IOMMU here is redundant/confusing. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/3 上午10:56, Andi Kleen 写道: I agree, but I want to know why indirect descriptor needs to be disabled. The table can't be wrote by the device since it's not coherent swiotlb mapping. I had all kinds of problems with uninitialized entries in the indirect table. So I gave up on it and concluded it would be too difficult to secure. -Andi Ok, but what I meant is this, if we don't read from the descriptor ring, and validate all the other metadata supplied by the device (used id and len). Then there should be no way for the device to suppress the dma flags to write to the indirect descriptor table. Or do you have an example how it can do that? Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/3 上午4:37, Alex Williamson 写道: On Wed, 2 Jun 2021 16:54:04 -0300 Jason Gunthorpe wrote: On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote: Right, the device can generate the no-snoop transactions, but it's the IOMMU that essentially determines whether those transactions are actually still cache coherent, AIUI. Wow, this is really confusing stuff in the code. At the PCI level there is a TLP bit called no-snoop that is platform specific. The general intention is to allow devices to selectively bypass the CPU caching for DMAs. GPUs like to use this feature for performance. Yes I assume there is some exciting security issues here. Looks like allowing cache bypass does something bad inside VMs? Looks like allowing the VM to use the cache clear instruction that is mandatory with cache bypass DMA causes some QOS issues? OK. IIRC, largely a DoS issue if userspace gets to choose when to emulate wbinvd rather than it being demanded for correct operation. So how does it work? What I see in the intel/iommu.c is that some domains support "snoop control" or not, based on some HW flag. This indicates if the DMA_PTE_SNP bit is supported on a page by page basis or not. Since x86 always leans toward "DMA cache coherent" I'm reading some tea leaves here: IOMMU_CAP_CACHE_COHERENCY, /* IOMMU can enforce cache coherent DMA transactions */ And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set? That's my understanding as well. Further, I guess IOMMUs that don't support PTE_SNP, or have DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat and all) Yes. So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA incoherent with the CPU caches, it just means that that snoop bit in the TLP cannot be enforced. ie the device *could* do no-shoop DMA if it wants. Devices that never do no-snoop remain DMA coherent on x86, as they always have been. Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device DMA to be coherent via the IOMMU. IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means the PCI device is blocked from using no-snoop in its TLPs. I wonder if ARM implemented this consistently? I see VDPA is confused.. Basically, we don't want to bother with pseudo KVM device like what VFIO did. So for simplicity, we rules out the IOMMU that can't enforce coherency in vhost-vDPA if the parent purely depends on the platform IOMMU: if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) return -ENOTSUPP; For the parents that use its own translations logic, an implicit assumption is that the hardware will always perform cache coherent DMA. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring
在 2021/6/3 上午10:18, Andi Kleen 写道: It looks to me all the evils came from the fact that we depends on the descriptor ring. So the checks in this patch could is unnecessary if we don't even read from the descriptor ring which could be manipulated by the device. This is what my series tries to achieve: https://www.spinics.net/lists/kvm/msg241825.html I would argue that you should boundary check in any case. It was always a bug to not have boundary checks in such a data structure with multiple users, trust or not. But yes your patch series is interesting and definitely makes sense for TDX too. Best would be to have both I guess, and always check the boundaries everywhere. I agree but some of the checks are unnecessary in we do this series on top of my series. So what's the merge status of your series? If I understand correctly from Michael, I will send a formal series and he will try to merge it for the 5.14. Thanks -Andi ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/3 上午9:48, Andi Kleen 写道: So we will see huge performance regression without indirect descriptor. We need to consider to address this. A regression would be when some existing case would be slower. That's not the case because the behavior for the existing cases does not change. Anyways when there are performance problems they can be addressed, but first is to make it secure. I agree, but I want to know why indirect descriptor needs to be disabled. The table can't be wrote by the device since it's not coherent swiotlb mapping. Thanks -Andi Thanks break; case VIRTIO_RING_F_EVENT_IDX: break; @@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev) case VIRTIO_F_ACCESS_PLATFORM: break; case VIRTIO_F_RING_PACKED: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_F_ORDER_PLATFORM: break; + clear: default: /* We don't understand this bit. */ __virtio_clear_bit(vdev, i); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 3/8] virtio: Harden split buffer detachment
在 2021/6/3 上午8:41, Andi Kleen 写道: Harden the split buffer detachment path by adding boundary checking. Note that when this fails we may fail to unmap some swiotlb mapping, which could result in a leak and a DOS. But that's acceptable because an malicious host can DOS us anyways. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index d37ff5a0ff58..1e9aa1e95e1b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -651,12 +651,19 @@ static bool virtqueue_kick_prepare_split(struct virtqueue *_vq) return needs_kick; } -static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, -void **ctx) +static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, + void **ctx) { unsigned int i, j; __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); + /* We'll leak DMA mappings when this happens, but nothing +* can be done about that. In the worst case the host +* could DOS us, but it can of course do that anyways. +*/ + if (!inside_split_ring(vq, head)) + return -EIO; I think the caller have already did this for us with even more check on the token (virtqueue_get_buf_ctx_split()): if (unlikely(i >= vq->split.vring.num)) { BAD_RING(vq, "id %u out of range\n", i); return NULL; } if (unlikely(!vq->split.desc_state[i].data)) { BAD_RING(vq, "id %u is not a head!\n", i); return NULL; } + /* Clear data ptr. */ vq->split.desc_state[head].data = NULL; @@ -666,6 +673,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, while (vq->split.vring.desc[i].flags & nextflag) { vring_unmap_one_split(vq, &vq->split.vring.desc[i]); i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next); + if (!inside_split_ring(vq, i)) + return -EIO; Similarly, if we don't depend on the metadata stored in the descriptor, we don't need this check. vq->vq.num_free++; } @@ -684,7 +693,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, /* Free the indirect table, if any, now that it's unmapped. */ if (!indir_desc) - return; + return 0; len = virtio32_to_cpu(vq->vq.vdev, vq->split.vring.desc[head].len); @@ -701,6 +710,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, } else if (ctx) { *ctx = vq->split.desc_state[head].indir_desc; } + return 0; } static inline bool more_used_split(const struct vring_virtqueue *vq) @@ -717,6 +727,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, void *ret; unsigned int i; u16 last_used; + int err; START_USE(vq); @@ -751,7 +762,12 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, /* detach_buf_split clears data, so grab it now. */ ret = vq->split.desc_state[i].data; - detach_buf_split(vq, i, ctx); + err = detach_buf_split(vq, i, ctx); + if (err) { + END_USE(vq); This reminds me that we don't use END_USE() after BAD_RING() which should be fixed. Thanks + return NULL; + } + vq->last_used_idx++; /* If we expect an interrupt for the next entry, tell host * by writing event index and flush out the write before @@ -863,6 +879,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) /* detach_buf_split clears data, so grab it now. */ buf = vq->split.desc_state[i].data; detach_buf_split(vq, i, NULL); + /* Don't need to check for error because nothing is returned */ vq->split.avail_idx_shadow--; vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->split.avail_idx_shadow); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring
在 2021/6/3 上午8:41, Andi Kleen 写道: In protected guest mode we don't trust the host. This means we need to make sure the host cannot subvert us through virtio communication. In general it can corrupt our virtio data and cause a DOS, but it should not be able to access any data that is not explicitely under IO. Also boundary checking so that the free list (which is accessible to the host) cannot point outside the virtio ring. Note it could still contain loops or similar, but these should only cause an DOS, not a memory corruption or leak. When we detect any out of bounds descriptor trigger an IO error. We also use a WARN() (in case it was a software bug instead of an attack). This implies that a malicious host can flood the guest kernel log, but that's only a DOS and acceptable in the threat model. This patch only hardens the initial consumption of the free list, the freeing comes later. Any of these errors can cause DMA memory leaks, but there is nothing we can do about that and that would be just a DOS. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 46 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index f35629fa47b1..d37ff5a0ff58 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -413,6 +413,15 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, return desc; } +/* assumes no indirect mode */ +static inline bool inside_split_ring(struct vring_virtqueue *vq, +unsigned index) +{ + return !WARN(index >= vq->split.vring.num, + "desc index %u out of bounds (%u)\n", + index, vq->split.vring.num); It's better to use BAD_RING to stop virtqueue in this case. +} + static inline int virtqueue_add_split(struct virtqueue *_vq, struct scatterlist *sgs[], unsigned int total_sg, @@ -428,6 +437,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, unsigned int i, n, avail, descs_used, prev, err_idx; int head; bool indirect; + int io_err; START_USE(vq); @@ -481,7 +491,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, for (n = 0; n < out_sgs; n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -494,7 +510,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, } for (; n < (out_sgs + in_sgs); n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; It looks to me all the evils came from the fact that we depends on the descriptor ring. So the checks in this patch could is unnecessary if we don't even read from the descriptor ring which could be manipulated by the device. This is what my series tries to achieve: https://www.spinics.net/lists/kvm/msg241825.html Thanks @@ -513,6 +535,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, dma_addr_t addr = vring_map_single( vq, desc, total_sg * sizeof(struct vring_desc), DMA_TO_DEVICE); + io_err = -ENOMEM; if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -528,6 +551,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* We're using some buffers from the free list. */ vq->vq.num_free -= descs_used; + io_err = -EIO; + if (!inside_split_ring(vq, head)) + goto unmap_release; + /* Update free pointer */ if (indirect) vq->free_head = virtio16_to_cpu(_vq->vdev, @@ -545,6 +572,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* Put entry in available array (but don't update avail->idx until they * do sync). */ avail = vq->split.avail_
Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest
在 2021/6/3 上午8:41, Andi Kleen 写道: When running under TDX the virtio host is untrusted. The bulk of the kernel memory is encrypted and protected, but the virtio ring is in special shared memory that is shared with the untrusted host. This means virtio needs to be hardened against any attacks from the host through the ring. Of course it's impossible to prevent DOS (the host can chose at any time to stop doing IO), but there should be no buffer overruns or similar that might give access to any private memory in the guest. virtio has a lot of modes, most are difficult to harden. The best for hardening seems to be split mode without indirect descriptors. This also simplifies the hardening job because it's only a single code path. Only allow split mode when in a protected guest. Followon patches harden the split mode code paths, and we don't want an malicious host to force anything else. Also disallow indirect mode for similar reasons. Signed-off-by: Andi Kleen --- drivers/virtio/virtio_ring.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 71e16b53e9c1..f35629fa47b1 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #ifdef DEBUG @@ -2221,8 +,16 @@ void vring_transport_features(struct virtio_device *vdev) unsigned int i; for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) { + + /* +* In protected guest mode disallow packed or indirect +* because they ain't hardened. +*/ + switch (i) { case VIRTIO_RING_F_INDIRECT_DESC: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; So we will see huge performance regression without indirect descriptor. We need to consider to address this. Thanks break; case VIRTIO_RING_F_EVENT_IDX: break; @@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev) case VIRTIO_F_ACCESS_PLATFORM: break; case VIRTIO_F_RING_PACKED: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_F_ORDER_PLATFORM: break; + clear: default: /* We don't understand this bit. */ __virtio_clear_bit(vdev, i); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Virtio hardening for TDX
在 2021/6/3 上午8:41, Andi Kleen 写道: [v1: Initial post] With confidential computing like TDX the guest doesn't trust the host anymore. The host is allowed to DOS of course, but it is not allowed to read or write any guest memory not explicitely shared with it. This has implication for virtio. Traditionally virtio didn't assume the other side of the communication channel is malicious, and therefore didn't do any boundary checks in virtio ring data structures. This patchkit does hardening for virtio. In a TDX like model the only host memory accesses allowed are in the virtio ring, as well as the (forced) swiotlb buffer. This patch kit does various changes to ensure there can be no access outside these two areas. It is possible for the host to break the communication, but this should result in a IO error on the guest, but no memory safety violations. virtio is quite complicated with many modes. To simplify the task we enforce that virtio is only in split mode without indirect descriptors, when running as a TDX guest. We also enforce use of the DMA API. Then these code paths are hardened against any corruptions on the ring. This patchkit has components in three subsystems: - Hardening changes to virtio, all in the generic virtio-ring - Hardening changes to kernel/dma swiotlb to harden swiotlb against malicious pointers. It requires an API change which needed a tree sweep. - A single x86 patch to enable the arch_has_restricted_memory_access for TDX It depends on Sathya's earlier patchkit that adds the basic infrastructure for TDX. This is only needed for the "am I running in TDX" part. Note that it's probably needed by other cases as well: 1) Other encrypted VM technology 2) VDUSE[1] 3) Smart NICs We have already had discussions and some patches have been posted[2][3][4]. I think the basic idea is similar, basically, we don't trust any metadata provided by the device. [2] is the series that use the metadata stored in the private memory which can't be accessed by swiotlb, this series aims to eliminate all the possible attacks via virtqueue metadata [3] is one example for the the used length validation [4] is the fix for the malicious config space Thanks [1] https://www.spinics.net/lists/netdev/msg743264.html [2] https://www.spinics.net/lists/kvm/msg241825.html [3] https://patches.linaro.org/patch/450733/ [4] https://lkml.org/lkml/2021/5/17/376 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/2 上午1:29, Jason Gunthorpe 写道: On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote: For the case of 1M, I would like to know what's the use case for a single process to handle 1M+ address spaces? For some scenarios every guest PASID will require a IOASID ID # so there is a large enough demand that FDs alone are not a good fit. Further there are global container wide properties that are hard to carry over to a multi-FD model, like the attachment of devices to the container at the startup. So if we implement per fd model. The global "container" properties could be done via the parent fd. E.g attaching the parent to the device at the startup. So this RFC treats fd as a container of address spaces which is each tagged by an IOASID. If the container and address space is 1:1 then the container seems useless. The examples at the bottom of the document show multiple IOASIDs in the container for a parent/child type relationship This can also be done per fd? A fd parent can have multiple fd childs. Thanks Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/2 上午1:31, Jason Gunthorpe 写道: On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote: We can open up to ~0U file descriptors, I don't see why we need to restrict it in uAPI. There are significant problems with such large file descriptor tables. High FD numbers man things like select don't work at all anymore and IIRC there are more complications. I don't see how much difference for IOASID and other type of fds. People can choose to use poll or epoll. And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I wonder how select can work for the specific ioasid. Thanks A huge number of FDs for typical usages should be avoided. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/2 上午4:28, Jason Gunthorpe 写道: I summarized five opens here, about: 1) Finalizing the name to replace /dev/ioasid; 2) Whether one device is allowed to bind to multiple IOASID fd's; 3) Carry device information in invalidation/fault reporting uAPI; 4) What should/could be specified when allocating an IOASID; 5) The protocol between vfio group and kvm; For 1), two alternative names are mentioned: /dev/iommu and /dev/ioas. I don't have a strong preference and would like to hear votes from all stakeholders. /dev/iommu is slightly better imho for two reasons. First, per AMD's presentation in last KVM forum they implement vIOMMU in hardware thus need to support user-managed domains. An iommu uAPI notation might make more sense moving forward. Second, it makes later uAPI naming easier as 'IOASID' can be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of IOASID_ALLOC_IOASID.:) I think two years ago I suggested /dev/iommu and it didn't go very far at the time. It looks to me using "/dev/iommu" excludes the possibility of implementing IOASID in a device specific way (e.g through the co-operation with device MMU + platform IOMMU)? What's more, ATS spec doesn't forbid the device #PF to be reported via a device specific way. Thanks We've also talked about this as /dev/sva for a while and now /dev/ioasid I think /dev/iommu is fine, and call the things inside them IOAS objects. Then we don't have naming aliasing with kernel constructs. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/1 下午2:16, Tian, Kevin 写道: From: Jason Wang Sent: Tuesday, June 1, 2021 2:07 PM 在 2021/6/1 下午1:42, Tian, Kevin 写道: From: Jason Wang Sent: Tuesday, June 1, 2021 1:30 PM 在 2021/6/1 下午1:23, Lu Baolu 写道: Hi Jason W, On 6/1/21 1:08 PM, Jason Wang wrote: 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd? 3) If not, how GET_INFO work? oh, missed this question in prior reply. Personally, no special reason yet. But using ID may give us opportunity to customize the management of the handle. For one, better lookup efficiency by using xarray to store the allocated IDs. For two, could categorize the allocated IDs (parent or nested). GET_INFO just works with an input FD and an ID. I'm not sure I get this, for nesting cases you can still make the child an fd. And a question still, under what case we need to create multiple ioasids on a single ioasid fd? One possible situation where multiple IOASIDs per FD could be used is that devices with different underlying IOMMU capabilities are sharing a single FD. In this case, only devices with consistent underlying IOMMU capabilities could be put in an IOASID and multiple IOASIDs per FD could be applied. Though, I still not sure about "multiple IOASID per-FD" vs "multiple IOASID FDs" for such case. Right, that's exactly my question. The latter seems much more easier to be understood and implemented. A simple reason discussed in previous thread - there could be 1M's I/O address spaces per device while #FD's are precious resource. Is the concern for ulimit or performance? Note that we had #define NR_OPEN_MAX ~0U And with the fd semantic, you can do a lot of other stuffs: close on exec, passing via SCM_RIGHTS. yes, fd has its merits. For the case of 1M, I would like to know what's the use case for a single process to handle 1M+ address spaces? This single process is Qemu with an assigned device. Within the guest there could be many guest processes. Though in reality I didn't see such 1M processes on a single device, better not restrict it in uAPI? Sorry I don't get here. We can open up to ~0U file descriptors, I don't see why we need to restrict it in uAPI. Thanks So this RFC treats fd as a container of address spaces which is each tagged by an IOASID. If the container and address space is 1:1 then the container seems useless. yes, 1:1 then container is useless. But here it's assumed 1:M then even a single fd is sufficient for all intended usages. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/1 下午1:42, Tian, Kevin 写道: From: Jason Wang Sent: Tuesday, June 1, 2021 1:30 PM 在 2021/6/1 下午1:23, Lu Baolu 写道: Hi Jason W, On 6/1/21 1:08 PM, Jason Wang wrote: 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd? 3) If not, how GET_INFO work? oh, missed this question in prior reply. Personally, no special reason yet. But using ID may give us opportunity to customize the management of the handle. For one, better lookup efficiency by using xarray to store the allocated IDs. For two, could categorize the allocated IDs (parent or nested). GET_INFO just works with an input FD and an ID. I'm not sure I get this, for nesting cases you can still make the child an fd. And a question still, under what case we need to create multiple ioasids on a single ioasid fd? One possible situation where multiple IOASIDs per FD could be used is that devices with different underlying IOMMU capabilities are sharing a single FD. In this case, only devices with consistent underlying IOMMU capabilities could be put in an IOASID and multiple IOASIDs per FD could be applied. Though, I still not sure about "multiple IOASID per-FD" vs "multiple IOASID FDs" for such case. Right, that's exactly my question. The latter seems much more easier to be understood and implemented. A simple reason discussed in previous thread - there could be 1M's I/O address spaces per device while #FD's are precious resource. Is the concern for ulimit or performance? Note that we had #define NR_OPEN_MAX ~0U And with the fd semantic, you can do a lot of other stuffs: close on exec, passing via SCM_RIGHTS. For the case of 1M, I would like to know what's the use case for a single process to handle 1M+ address spaces? So this RFC treats fd as a container of address spaces which is each tagged by an IOASID. If the container and address space is 1:1 then the container seems useless. Thanks Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/1 下午1:23, Lu Baolu 写道: Hi Jason W, On 6/1/21 1:08 PM, Jason Wang wrote: 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd? 3) If not, how GET_INFO work? oh, missed this question in prior reply. Personally, no special reason yet. But using ID may give us opportunity to customize the management of the handle. For one, better lookup efficiency by using xarray to store the allocated IDs. For two, could categorize the allocated IDs (parent or nested). GET_INFO just works with an input FD and an ID. I'm not sure I get this, for nesting cases you can still make the child an fd. And a question still, under what case we need to create multiple ioasids on a single ioasid fd? One possible situation where multiple IOASIDs per FD could be used is that devices with different underlying IOMMU capabilities are sharing a single FD. In this case, only devices with consistent underlying IOMMU capabilities could be put in an IOASID and multiple IOASIDs per FD could be applied. Though, I still not sure about "multiple IOASID per-FD" vs "multiple IOASID FDs" for such case. Right, that's exactly my question. The latter seems much more easier to be understood and implemented. Thanks Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/1 下午12:27, Shenming Lu 写道: On 2021/6/1 10:36, Jason Wang wrote: 在 2021/5/31 下å�ˆ4:41, Liu Yi L 写é�“: I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support hardware nesting. Or is there way to detect the capability before? I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid is not able to support nesting, then should fail it. I think GET_INFO only works after the ATTACH. yes. After attaching to gpa_ioasid, userspace could GET_INFO on the gpa_ioasid and check if nesting is supported or not. right? Some more questions: 1) Is the handle returned by IOASID_ALLOC an fd? 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd? 3) If not, how GET_INFO work? It seems that the return value from IOASID_ALLOC is an IOASID number in the ioasid_data struct, then when calling GET_INFO, we should convey this IOASID number to get the associated I/O address space attributes (depend on the physical IOMMU, which could be discovered when attaching a device to the IOASID fd or number), right? Right, but the question is why need such indirection? Unless there's a case that you need to create multiple IOASIDs per ioasid fd. It's more simpler to attach the metadata into the ioasid fd itself. Thanks Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/6/1 上午11:31, Liu Yi L 写道: On Tue, 1 Jun 2021 10:36:36 +0800, Jason Wang wrote: 在 2021/5/31 下午4:41, Liu Yi L 写道: I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support hardware nesting. Or is there way to detect the capability before? I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid is not able to support nesting, then should fail it. I think GET_INFO only works after the ATTACH. yes. After attaching to gpa_ioasid, userspace could GET_INFO on the gpa_ioasid and check if nesting is supported or not. right? Some more questions: 1) Is the handle returned by IOASID_ALLOC an fd? it's an ID so far in this proposal. Ok. 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd? 3) If not, how GET_INFO work? oh, missed this question in prior reply. Personally, no special reason yet. But using ID may give us opportunity to customize the management of the handle. For one, better lookup efficiency by using xarray to store the allocated IDs. For two, could categorize the allocated IDs (parent or nested). GET_INFO just works with an input FD and an ID. I'm not sure I get this, for nesting cases you can still make the child an fd. And a question still, under what case we need to create multiple ioasids on a single ioasid fd? (This case is not demonstrated in your examples). Thanks /* Bind guest I/O page table */ bind_data = { .ioasid = giova_ioasid; .addr = giova_pgtable; // and format information }; ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); /* Invalidate IOTLB when required */ inv_data = { .ioasid = giova_ioasid; // granular information }; ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data); /* See 5.6 for I/O page fault handling */ 5.5. Guest SVA (vSVA) ++ After boots the guest further create a GVA address spaces (gpasid1) on dev1. Dev2 is not affected (still attached to giova_ioasid). As explained in section 4, user should avoid expose ENQCMD on both pdev and mdev. The sequence applies to all device types (being pdev or mdev), except one additional step to call KVM for ENQCMD-capable mdev: My understanding is ENQCMD is Intel specific and not a requirement for having vSVA. ENQCMD is not really Intel specific although only Intel supports it today. The PCIe DMWr capability is the capability for software to enumerate the ENQCMD support in device side. yes, it is not a requirement for vSVA. They are orthogonal. Right, then it's better to mention DMWr instead of a vendor specific instruction in a general framework like ioasid. good suggestion. :) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/5/31 下午4:41, Liu Yi L 写道: I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support hardware nesting. Or is there way to detect the capability before? I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid is not able to support nesting, then should fail it. I think GET_INFO only works after the ATTACH. yes. After attaching to gpa_ioasid, userspace could GET_INFO on the gpa_ioasid and check if nesting is supported or not. right? Some more questions: 1) Is the handle returned by IOASID_ALLOC an fd? 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd? 3) If not, how GET_INFO work? /* Bind guest I/O page table */ bind_data = { .ioasid = giova_ioasid; .addr = giova_pgtable; // and format information }; ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); /* Invalidate IOTLB when required */ inv_data = { .ioasid = giova_ioasid; // granular information }; ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data); /* See 5.6 for I/O page fault handling */ 5.5. Guest SVA (vSVA) ++ After boots the guest further create a GVA address spaces (gpasid1) on dev1. Dev2 is not affected (still attached to giova_ioasid). As explained in section 4, user should avoid expose ENQCMD on both pdev and mdev. The sequence applies to all device types (being pdev or mdev), except one additional step to call KVM for ENQCMD-capable mdev: My understanding is ENQCMD is Intel specific and not a requirement for having vSVA. ENQCMD is not really Intel specific although only Intel supports it today. The PCIe DMWr capability is the capability for software to enumerate the ENQCMD support in device side. yes, it is not a requirement for vSVA. They are orthogonal. Right, then it's better to mention DMWr instead of a vendor specific instruction in a general framework like ioasid. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/5/31 下午12:27, Yongji Xie 写道: On Fri, May 28, 2021 at 10:31 AM Jason Wang wrote: 在 2021/5/27 下午9:17, Yongji Xie 写道: On Thu, May 27, 2021 at 4:41 PM Jason Wang wrote: 在 2021/5/27 下午3:34, Yongji Xie 写道: On Thu, May 27, 2021 at 1:40 PM Jason Wang wrote: 在 2021/5/27 下午1:08, Yongji Xie 写道: On Thu, May 27, 2021 at 1:00 PM Jason Wang wrote: 在 2021/5/27 下午12:57, Yongji Xie 写道: On Thu, May 27, 2021 at 12:13 PM Jason Wang wrote: 在 2021/5/17 下午5:55, Xie Yongji 写道: + +static int vduse_dev_msg_sync(struct vduse_dev *dev, + struct vduse_dev_msg *msg) +{ + init_waitqueue_head(&msg->waitq); + spin_lock(&dev->msg_lock); + vduse_enqueue_msg(&dev->send_list, msg); + wake_up(&dev->waitq); + spin_unlock(&dev->msg_lock); + wait_event_killable(msg->waitq, msg->completed); What happens if the userspace(malicous) doesn't give a response forever? It looks like a DOS. If yes, we need to consider a way to fix that. How about using wait_event_killable_timeout() instead? Probably, and then we need choose a suitable timeout and more important, need to report the failure to virtio. Makes sense to me. But it looks like some vdpa_config_ops/virtio_config_ops such as set_status() didn't have a return value. Now I add a WARN_ON() for the failure. Do you mean we need to add some change for virtio core to handle the failure? Maybe, but I'm not sure how hard we can do that. We need to change all virtio device drivers in this way. Probably. We had NEEDS_RESET but it looks we don't implement it. Could it handle the failure of get_feature() and get/set_config()? Looks not: " The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state that a reset is needed. If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device MUST send a device configuration change notification to the driver. " This looks implies that NEEDS_RESET may only work after device is probed. But in the current design, even the reset() is not reliable. Or a rough idea is that maybe need some relaxing to be coupled loosely with userspace. E.g the device (control path) is implemented in the kernel but the datapath is implemented in the userspace like TUN/TAP. I think it can work for most cases. One problem is that the set_config might change the behavior of the data path at runtime, e.g. virtnet_set_mac_address() in the virtio-net driver and cache_type_store() in the virtio-blk driver. Not sure if this path is able to return before the datapath is aware of this change. Good point. But set_config() should be rare: E.g in the case of virtio-net with VERSION_1, config space is read only, and it was set via control vq. For block, we can 1) start from without WCE or 2) we add a config change notification to userspace or I prefer this way. And I think we also need to do similar things for set/get_vq_state(). Yes, I agree. Hi Jason, Now I'm working on this. But I found the config change notification must be synchronous in the virtio-blk case, which means the kernel still needs to wait for the response from userspace in set_config(). Otherwise, some I/Os might still run the old way after we change the cache_type in sysfs. The simple ways to solve this problem are: 1. Only support read-only config space, disable WCE as you suggested 2. Add a return value to set_config() and handle the failure only in virtio-blk driver 3. Print some warnings after timeout since it only affects the dataplane which is under userspace's control Any suggestions? Let's go without WCE first and make VDUSE work first. We can then think of a solution for WCE on top. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/5/28 上午11:54, Yongji Xie 写道: On Fri, May 28, 2021 at 9:33 AM Jason Wang wrote: 在 2021/5/27 下午6:14, Yongji Xie 写道: On Thu, May 27, 2021 at 4:43 PM Jason Wang wrote: 在 2021/5/27 下午4:41, Jason Wang 写道: 在 2021/5/27 下午3:34, Yongji Xie 写道: On Thu, May 27, 2021 at 1:40 PM Jason Wang wrote: 在 2021/5/27 下午1:08, Yongji Xie 写道: On Thu, May 27, 2021 at 1:00 PM Jason Wang wrote: 在 2021/5/27 下午12:57, Yongji Xie 写道: On Thu, May 27, 2021 at 12:13 PM Jason Wang wrote: 在 2021/5/17 下午5:55, Xie Yongji 写道: + +static int vduse_dev_msg_sync(struct vduse_dev *dev, + struct vduse_dev_msg *msg) +{ + init_waitqueue_head(&msg->waitq); + spin_lock(&dev->msg_lock); + vduse_enqueue_msg(&dev->send_list, msg); + wake_up(&dev->waitq); + spin_unlock(&dev->msg_lock); + wait_event_killable(msg->waitq, msg->completed); What happens if the userspace(malicous) doesn't give a response forever? It looks like a DOS. If yes, we need to consider a way to fix that. How about using wait_event_killable_timeout() instead? Probably, and then we need choose a suitable timeout and more important, need to report the failure to virtio. Makes sense to me. But it looks like some vdpa_config_ops/virtio_config_ops such as set_status() didn't have a return value. Now I add a WARN_ON() for the failure. Do you mean we need to add some change for virtio core to handle the failure? Maybe, but I'm not sure how hard we can do that. We need to change all virtio device drivers in this way. Probably. We had NEEDS_RESET but it looks we don't implement it. Could it handle the failure of get_feature() and get/set_config()? Looks not: " The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state that a reset is needed. If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device MUST send a device configuration change notification to the driver. " This looks implies that NEEDS_RESET may only work after device is probed. But in the current design, even the reset() is not reliable. Or a rough idea is that maybe need some relaxing to be coupled loosely with userspace. E.g the device (control path) is implemented in the kernel but the datapath is implemented in the userspace like TUN/TAP. I think it can work for most cases. One problem is that the set_config might change the behavior of the data path at runtime, e.g. virtnet_set_mac_address() in the virtio-net driver and cache_type_store() in the virtio-blk driver. Not sure if this path is able to return before the datapath is aware of this change. Good point. But set_config() should be rare: E.g in the case of virtio-net with VERSION_1, config space is read only, and it was set via control vq. For block, we can 1) start from without WCE or 2) we add a config change notification to userspace or 3) extend the spec to use vq instead of config space Thanks Another thing if we want to go this way: We need find a way to terminate the data path from the kernel side, to implement to reset semantic. Do you mean terminate the data path in vdpa_reset(). Yes. Is it ok to just notify userspace to stop data path asynchronously? For well-behaved userspace, yes but no for buggy or malicious ones. But the buggy or malicious daemons can't do anything if my understanding is correct. You're right. I originally thought there can still have bouncing. But consider we don't do that during fault. It should be safe. I had an idea, how about terminate IOTLB in this case? Then we're in fact turn datapath off. Sorry, I didn't get your point here. What do you mean by terminating IOTLB? I meant terminate the bouncing but it looks safe after a second thought :) Thanks Remove iotlb mapping? But userspace can still access the mapped region. Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/5/27 下午9:17, Yongji Xie 写道: On Thu, May 27, 2021 at 4:41 PM Jason Wang wrote: 在 2021/5/27 下午3:34, Yongji Xie 写道: On Thu, May 27, 2021 at 1:40 PM Jason Wang wrote: 在 2021/5/27 下午1:08, Yongji Xie 写道: On Thu, May 27, 2021 at 1:00 PM Jason Wang wrote: 在 2021/5/27 下午12:57, Yongji Xie 写道: On Thu, May 27, 2021 at 12:13 PM Jason Wang wrote: 在 2021/5/17 下午5:55, Xie Yongji 写道: + +static int vduse_dev_msg_sync(struct vduse_dev *dev, + struct vduse_dev_msg *msg) +{ + init_waitqueue_head(&msg->waitq); + spin_lock(&dev->msg_lock); + vduse_enqueue_msg(&dev->send_list, msg); + wake_up(&dev->waitq); + spin_unlock(&dev->msg_lock); + wait_event_killable(msg->waitq, msg->completed); What happens if the userspace(malicous) doesn't give a response forever? It looks like a DOS. If yes, we need to consider a way to fix that. How about using wait_event_killable_timeout() instead? Probably, and then we need choose a suitable timeout and more important, need to report the failure to virtio. Makes sense to me. But it looks like some vdpa_config_ops/virtio_config_ops such as set_status() didn't have a return value. Now I add a WARN_ON() for the failure. Do you mean we need to add some change for virtio core to handle the failure? Maybe, but I'm not sure how hard we can do that. We need to change all virtio device drivers in this way. Probably. We had NEEDS_RESET but it looks we don't implement it. Could it handle the failure of get_feature() and get/set_config()? Looks not: " The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state that a reset is needed. If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device MUST send a device configuration change notification to the driver. " This looks implies that NEEDS_RESET may only work after device is probed. But in the current design, even the reset() is not reliable. Or a rough idea is that maybe need some relaxing to be coupled loosely with userspace. E.g the device (control path) is implemented in the kernel but the datapath is implemented in the userspace like TUN/TAP. I think it can work for most cases. One problem is that the set_config might change the behavior of the data path at runtime, e.g. virtnet_set_mac_address() in the virtio-net driver and cache_type_store() in the virtio-blk driver. Not sure if this path is able to return before the datapath is aware of this change. Good point. But set_config() should be rare: E.g in the case of virtio-net with VERSION_1, config space is read only, and it was set via control vq. For block, we can 1) start from without WCE or 2) we add a config change notification to userspace or I prefer this way. And I think we also need to do similar things for set/get_vq_state(). Yes, I agree. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
在 2021/5/27 下午3:58, Tian, Kevin 写道: /dev/ioasid provides an unified interface for managing I/O page tables for devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, etc.) are expected to use this interface instead of creating their own logic to isolate untrusted device DMAs initiated by userspace. Not a native speaker but /dev/ioas seems better? This proposal describes the uAPI of /dev/ioasid and also sample sequences with VFIO as example in typical usages. The driver-facing kernel API provided by the iommu layer is still TBD, which can be discussed after consensus is made on this uAPI. It's based on a lengthy discussion starting from here: https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ It ends up to be a long writing due to many things to be summarized and non-trivial effort required to connect them into a complete proposal. Hope it provides a clean base to converge. TOC 1. Terminologies and Concepts 2. uAPI Proposal 2.1. /dev/ioasid uAPI 2.2. /dev/vfio uAPI 2.3. /dev/kvm uAPI 3. Sample structures and helper functions 4. PASID virtualization 5. Use Cases and Flows 5.1. A simple example 5.2. Multiple IOASIDs (no nesting) 5.3. IOASID nesting (software) 5.4. IOASID nesting (hardware) 5.5. Guest SVA (vSVA) 5.6. I/O page fault 5.7. BIND_PASID_TABLE 1. Terminologies and Concepts - IOASID FD is the container holding multiple I/O address spaces. User manages those address spaces through FD operations. Multiple FD's are allowed per process, but with this proposal one FD should be sufficient for all intended usages. IOASID is the FD-local software handle representing an I/O address space. Each IOASID is associated with a single I/O page table. IOASIDs can be nested together, implying the output address from one I/O page table (represented by child IOASID) must be further translated by another I/O page table (represented by parent IOASID). I/O address space can be managed through two protocols, according to whether the corresponding I/O page table is constructed by the kernel or the user. When kernel-managed, a dma mapping protocol (similar to existing VFIO iommu type1) is provided for the user to explicitly specify how the I/O address space is mapped. Otherwise, a different protocol is provided for the user to bind an user-managed I/O page table to the IOMMU, plus necessary commands for iotlb invalidation and I/O fault handling. Pgtable binding protocol can be used only on the child IOASID's, implying IOASID nesting must be enabled. This is because the kernel doesn't trust userspace. Nesting allows the kernel to enforce its DMA isolation policy through the parent IOASID. IOASID nesting can be implemented in two ways: hardware nesting and software nesting. With hardware support the child and parent I/O page tables are walked consecutively by the IOMMU to form a nested translation. When it's implemented in software, the ioasid driver Need to explain what did "ioasid driver" mean. I guess it's the module that implements the IOASID abstraction: 1) RID 2) RID+PASID 3) others And if yes, does it allow the device for software specific implementation: 1) swiotlb or 2) device specific IOASID implementation is responsible for merging the two-level mappings into a single-level shadow I/O page table. Software nesting requires both child/parent page tables operated through the dma mapping protocol, so any change in either level can be captured by the kernel to update the corresponding shadow mapping. An I/O address space takes effect in the IOMMU only after it is attached to a device. The device in the /dev/ioasid context always refers to a physical one or 'pdev' (PF or VF). One I/O address space could be attached to multiple devices. In this case, /dev/ioasid uAPI applies to all attached devices under the specified IOASID. Based on the underlying IOMMU capability one device might be allowed to attach to multiple I/O address spaces, with DMAs accessing them by carrying different routing information. One of them is the default I/O address space routed by PCI Requestor ID (RID) or ARM Stream ID. The remaining are routed by RID + Process Address Space ID (PASID) or Stream+Substream ID. For simplicity the following context uses RID and PASID when talking about the routing information for I/O address spaces. Device attachment is initiated through passthrough framework uAPI (use VFIO for simplicity in following context). VFIO is responsible for identifying the routing information and registering it to the ioasid driver when calling ioasid attach helper function. It could be RID if the assigned device is pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, user might also provide its view of virtual routing information (vPASID) in the attach call, e.g. when multiple user-managed I/O address spaces are attached to the vfio_device. I
Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/5/27 下午6:14, Yongji Xie 写道: On Thu, May 27, 2021 at 4:43 PM Jason Wang wrote: 在 2021/5/27 下午4:41, Jason Wang 写道: 在 2021/5/27 下午3:34, Yongji Xie 写道: On Thu, May 27, 2021 at 1:40 PM Jason Wang wrote: 在 2021/5/27 下午1:08, Yongji Xie 写道: On Thu, May 27, 2021 at 1:00 PM Jason Wang wrote: 在 2021/5/27 下午12:57, Yongji Xie 写道: On Thu, May 27, 2021 at 12:13 PM Jason Wang wrote: 在 2021/5/17 下午5:55, Xie Yongji 写道: + +static int vduse_dev_msg_sync(struct vduse_dev *dev, + struct vduse_dev_msg *msg) +{ + init_waitqueue_head(&msg->waitq); + spin_lock(&dev->msg_lock); + vduse_enqueue_msg(&dev->send_list, msg); + wake_up(&dev->waitq); + spin_unlock(&dev->msg_lock); + wait_event_killable(msg->waitq, msg->completed); What happens if the userspace(malicous) doesn't give a response forever? It looks like a DOS. If yes, we need to consider a way to fix that. How about using wait_event_killable_timeout() instead? Probably, and then we need choose a suitable timeout and more important, need to report the failure to virtio. Makes sense to me. But it looks like some vdpa_config_ops/virtio_config_ops such as set_status() didn't have a return value. Now I add a WARN_ON() for the failure. Do you mean we need to add some change for virtio core to handle the failure? Maybe, but I'm not sure how hard we can do that. We need to change all virtio device drivers in this way. Probably. We had NEEDS_RESET but it looks we don't implement it. Could it handle the failure of get_feature() and get/set_config()? Looks not: " The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state that a reset is needed. If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device MUST send a device configuration change notification to the driver. " This looks implies that NEEDS_RESET may only work after device is probed. But in the current design, even the reset() is not reliable. Or a rough idea is that maybe need some relaxing to be coupled loosely with userspace. E.g the device (control path) is implemented in the kernel but the datapath is implemented in the userspace like TUN/TAP. I think it can work for most cases. One problem is that the set_config might change the behavior of the data path at runtime, e.g. virtnet_set_mac_address() in the virtio-net driver and cache_type_store() in the virtio-blk driver. Not sure if this path is able to return before the datapath is aware of this change. Good point. But set_config() should be rare: E.g in the case of virtio-net with VERSION_1, config space is read only, and it was set via control vq. For block, we can 1) start from without WCE or 2) we add a config change notification to userspace or 3) extend the spec to use vq instead of config space Thanks Another thing if we want to go this way: We need find a way to terminate the data path from the kernel side, to implement to reset semantic. Do you mean terminate the data path in vdpa_reset(). Yes. Is it ok to just notify userspace to stop data path asynchronously? For well-behaved userspace, yes but no for buggy or malicious ones. I had an idea, how about terminate IOTLB in this case? Then we're in fact turn datapath off. Thanks Userspace should not be able to do any I/O at that time because the iotlb mapping is already removed. Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu