Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-28 Thread Jason Wang
On Mon, Mar 28, 2022 at 8:23 PM Jason Gunthorpe  wrote:
>
> On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote:
> > To me, it looks more easier to not answer this question by letting
> > userspace know about the change,
>
> That is not backwards compatbile, so I don't think it helps unless we
> say if you open /dev/vfio/vfio you get old behavior and if you open
> /dev/iommu you get new...

Actually, this is one way to go. Trying to behave exactly like typ1
might be not easy.

>
> Nor does it answer if I can fix RDMA or not :\
>

vDPA has a backend feature negotiation, then actually, userspace can
tell vDPA to go with the new accounting approach. Not sure RDMA can do
the same.

Thanks

> So we really do need to know what exactly is the situation here.
>
> Jason
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-27 Thread Jason Wang
On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe  wrote:
>
> On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
>
> > It's simply because we don't want to break existing userspace. [1]
>
> I'm still waiting to hear what exactly breaks in real systems.
>
> As I explained this is not a significant change, but it could break
> something in a few special scenarios.
>
> Also the one place we do have ABI breaks is security, and ulimit is a
> security mechanism that isn't working right. So we do clearly need to
> understand *exactly* what real thing breaks - if anything.
>
> Jason
>

To tell the truth, I don't know. I remember that Openstack may do some
accounting so adding Sean for more comments. But we really can't image
openstack is the only userspace that may use this.

To me, it looks more easier to not answer this question by letting
userspace know about the change,

Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Jason Wang
On Thu, Mar 24, 2022 at 11:15 AM Tian, Kevin  wrote:
>
> > From: Jason Wang 
> > Sent: Thursday, March 24, 2022 10:57 AM
> >
> > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin  wrote:
> > >
> > > > From: Jason Wang 
> > > > Sent: Thursday, March 24, 2022 10:28 AM
> > > >
> > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin 
> > wrote:
> > > > >
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > > >
> > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > > >
> > > > > > > I'm still picking my way through the series, but the later compat
> > > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > > Doesn't this difference need to be accounted in how libvirt 
> > > > > > > manages
> > VM
> > > > > > > resource limits?
> > > > > >
> > > > > > AFACIT, no, but it should be checked.
> > > > > >
> > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > > memory limits.
> > > > > >
> > > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > > >
> > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > > {
> > > > > >   rlim = tsk->signal->rlim + resource;
> > > > > > [..]
> > > > > >   if (new_rlim)
> > > > > >   *rlim = *new_rlim;
> > > > > >
> > > > > > Which vfio reads back here:
> > > > > >
> > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > >
> > > > > > And iommufd does the same read back:
> > > > > >
> > > > > >   lock_limit =
> > > > > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > > PAGE_SHIFT;
> > > > > >   npages = pages->npinned - pages->last_npinned;
> > > > > >   do {
> > > > > >   cur_pages = atomic_long_read(>source_user-
> > > > > > >locked_vm);
> > > > > >   new_pages = cur_pages + npages;
> > > > > >   if (new_pages > lock_limit)
> > > > > >   return -ENOMEM;
> > > > > >   } while (atomic_long_cmpxchg(>source_user->locked_vm,
> > > > > > cur_pages,
> > > > > >new_pages) != cur_pages);
> > > > > >
> > > > > > So it does work essentially the same.
> > > > > >
> > > > > > The difference is more subtle, iouring/etc puts the charge in the 
> > > > > > user
> > > > > > so it is additive with things like iouring and additively spans all
> > > > > > the users processes.
> > > > > >
> > > > > > However vfio is accounting only per-process and only for itself - no
> > > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > > >
> > > > > > The user visible difference will be that a limit X that worked with
> > > > > > VFIO may start to fail after a kernel upgrade as the charge 
> > > > > > accounting
> > > > > > is now cross user and additive with things like iommufd.
> > > > > >
> > > > > > This whole area is a bit peculiar (eg mlock itself works 
> > > > > > differently),
> > > > > > IMHO, but with most of the places doing pins voting to use
> > > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > > kernel.
> > > > > >
> > > > > > Ceratinly having qemu concurrently using three di

Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Jason Wang
On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin  wrote:
>
> > From: Jason Wang 
> > Sent: Thursday, March 24, 2022 10:28 AM
> >
> > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin  wrote:
> > >
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > >
> > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > >
> > > > > I'm still picking my way through the series, but the later compat
> > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > > resource limits?
> > > >
> > > > AFACIT, no, but it should be checked.
> > > >
> > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > memory limits.
> > > >
> > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > >
> > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > {
> > > >   rlim = tsk->signal->rlim + resource;
> > > > [..]
> > > >   if (new_rlim)
> > > >   *rlim = *new_rlim;
> > > >
> > > > Which vfio reads back here:
> > > >
> > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > >
> > > > And iommufd does the same read back:
> > > >
> > > >   lock_limit =
> > > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > PAGE_SHIFT;
> > > >   npages = pages->npinned - pages->last_npinned;
> > > >   do {
> > > >   cur_pages = atomic_long_read(>source_user-
> > > > >locked_vm);
> > > >   new_pages = cur_pages + npages;
> > > >   if (new_pages > lock_limit)
> > > >   return -ENOMEM;
> > > >   } while (atomic_long_cmpxchg(>source_user->locked_vm,
> > > > cur_pages,
> > > >new_pages) != cur_pages);
> > > >
> > > > So it does work essentially the same.
> > > >
> > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > so it is additive with things like iouring and additively spans all
> > > > the users processes.
> > > >
> > > > However vfio is accounting only per-process and only for itself - no
> > > > other subsystem uses locked as the charge variable for DMA pins.
> > > >
> > > > The user visible difference will be that a limit X that worked with
> > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > is now cross user and additive with things like iommufd.
> > > >
> > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > IMHO, but with most of the places doing pins voting to use
> > > > user->locked_vm as the charge it seems the right path in today's
> > > > kernel.
> > > >
> > > > Ceratinly having qemu concurrently using three different subsystems
> > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > >
> > > > I plan to fix RDMA like this as well so at least we can have
> > > > consistency within qemu.
> > > >
> > >
> > > I have an impression that iommufd and vfio type1 must use
> > > the same accounting scheme given the management stack
> > > has no insight into qemu on which one is actually used thus
> > > cannot adapt to the subtle difference in between. in this
> > > regard either we start fixing vfio type1 to use user->locked_vm
> > > now or have iommufd follow vfio type1 for upward compatibility
> > > and then change them together at a later point.
> > >
> > > I prefer to the former as IMHO I don't know when will be a later
> > > point w/o certain kernel changes to actually break the use

Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Jason Wang
On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin  wrote:
>
> > From: Jason Gunthorpe 
> > Sent: Wednesday, March 23, 2022 12:15 AM
> >
> > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> >
> > > I'm still picking my way through the series, but the later compat
> > > interface doesn't mention this difference as an outstanding issue.
> > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > resource limits?
> >
> > AFACIT, no, but it should be checked.
> >
> > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > memory limits.
> >
> > Yes, and ulimit does work fully. prlimit adjusts the value:
> >
> > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > {
> >   rlim = tsk->signal->rlim + resource;
> > [..]
> >   if (new_rlim)
> >   *rlim = *new_rlim;
> >
> > Which vfio reads back here:
> >
> > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >
> > And iommufd does the same read back:
> >
> >   lock_limit =
> >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > PAGE_SHIFT;
> >   npages = pages->npinned - pages->last_npinned;
> >   do {
> >   cur_pages = atomic_long_read(>source_user-
> > >locked_vm);
> >   new_pages = cur_pages + npages;
> >   if (new_pages > lock_limit)
> >   return -ENOMEM;
> >   } while (atomic_long_cmpxchg(>source_user->locked_vm,
> > cur_pages,
> >new_pages) != cur_pages);
> >
> > So it does work essentially the same.
> >
> > The difference is more subtle, iouring/etc puts the charge in the user
> > so it is additive with things like iouring and additively spans all
> > the users processes.
> >
> > However vfio is accounting only per-process and only for itself - no
> > other subsystem uses locked as the charge variable for DMA pins.
> >
> > The user visible difference will be that a limit X that worked with
> > VFIO may start to fail after a kernel upgrade as the charge accounting
> > is now cross user and additive with things like iommufd.
> >
> > This whole area is a bit peculiar (eg mlock itself works differently),
> > IMHO, but with most of the places doing pins voting to use
> > user->locked_vm as the charge it seems the right path in today's
> > kernel.
> >
> > Ceratinly having qemu concurrently using three different subsystems
> > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > RLIMIT_MEMLOCK differently cannot be sane or correct.
> >
> > I plan to fix RDMA like this as well so at least we can have
> > consistency within qemu.
> >
>
> I have an impression that iommufd and vfio type1 must use
> the same accounting scheme given the management stack
> has no insight into qemu on which one is actually used thus
> cannot adapt to the subtle difference in between. in this
> regard either we start fixing vfio type1 to use user->locked_vm
> now or have iommufd follow vfio type1 for upward compatibility
> and then change them together at a later point.
>
> I prefer to the former as IMHO I don't know when will be a later
> point w/o certain kernel changes to actually break the userspace
> policy built on a wrong accounting scheme...

I wonder if the kernel is the right place to do this. We have new uAPI
so management layer can know the difference of the accounting in
advance by

-device vfio-pci,iommufd=on

?

>
> Thanks
> Kevin
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-08 Thread Jason Wang
On Thu, Dec 9, 2021 at 1:41 PM Tian, Kevin  wrote:
>
> > From: Jason Gunthorpe 
> > Sent: Thursday, December 2, 2021 9:55 PM
> >
> > Further, there is no reason why IMS should be reserved exclusively for
> > VFIO!
>
> This is correct. Just as what you agreed with Thomas, the only difference
> between IMS and MSI is on where the messages are stored. Physically
> it is unreasonable for the HW to check whether an interrupt is used for
> a specific software usage (e.g. virtualization) since it doesn't have such
> knowledge. At most an entry is associated to PASID, but again the PASID
> can be used anywhere.

Note that vDPA had the plan to use IMS for the parent that can have a
huge amount of instances.

Thanks

>
> The wording in current IDXD spec is not clear on this part. We'll talk to
> the spec owner to have it fixed.
>
> Thanks
> Kevin
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/5] iommu/virtio: Add identity domains

2021-10-18 Thread Jason Wang
On Mon, Oct 18, 2021 at 11:35 PM Michael S. Tsirkin  wrote:
>
> On Mon, Oct 18, 2021 at 04:23:41PM +0100, Jean-Philippe Brucker wrote:
> > On Thu, Oct 14, 2021 at 03:00:38AM +, Tian, Kevin wrote:
> > > > From: Jean-Philippe Brucker 
> > > > Sent: Wednesday, October 13, 2021 8:11 PM
> > > >
> > > > Support identity domains, allowing to only enable IOMMU protection for a
> > > > subset of endpoints (those assigned to userspace, for example). Users
> > > > may enable identity domains at compile time
> > > > (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time
> > > > (iommu.passthrough=1) or
> > > > runtime (/sys/kernel/iommu_groups/*/type = identity).
> > >
> > > Do we want to use consistent terms between spec (bypass domain)
> > > and code (identity domain)?
> >
> > I don't think we have to. Linux uses "identity" domains and "passthrough"
> > IOMMU. The old virtio-iommu feature was "bypass" so we should keep that
> > for the new one, to be consistent. And then I've used "bypass" for domains
> > as well, in the spec, because it would look strange to use a different
> > term for the same concept. I find that it sort of falls into place: Linux'
> > identity domains can be implemented either with bypass or identity-mapped
> > virtio-iommu domains.
> >
> > > >
> > > > Patches 1-2 support identity domains using the optional
> > > > VIRTIO_IOMMU_F_BYPASS_CONFIG feature. The feature bit is not yet in the
> > > > spec, see [1] for the latest proposal.
> > > >
> > > > Patches 3-5 add a fallback to identity mappings, when the feature is not
> > > > supported.
> > > >
> > > > Note that this series doesn't touch the global bypass bit added by
> > > > VIRTIO_IOMMU_F_BYPASS_CONFIG. All endpoints managed by the IOMMU
> > > > should
> > > > be attached to a domain, so global bypass isn't in use after endpoints
> > >
> > > I saw a concept of deferred attach in iommu core. See iommu_is_
> > > attach_deferred(). Currently this is vendor specific and I haven't
> > > looked into the exact reason why some vendor sets it now. Just
> > > be curious whether the same reason might be applied to virtio-iommu.
> > >
> > > > are probed. Before that, the global bypass policy is decided by the
> > > > hypervisor and firmware. So I don't think Linux needs to touch the
> > >
> > > This reminds me one thing. The spec says that the global bypass
> > > bit is sticky and not affected by reset.
> >
> > The spec talks about *device* reset, triggered by software writing 0 to
> > the status register, but it doesn't mention system reset. Would be good to
> > clarify that. Something like:
> >
> > If the device offers the VIRTIO_IOMMU_F_BYPASS_CONFIG feature, it MAY
> > initialize the \field{bypass} field to 1. Field \field{bypass} SHOULD
> > NOT change on device reset, but SHOULD be restored to its initial
> > value on system reset.
> >
> > > This implies that in the case
> > > of rebooting the VM into a different OS, the previous OS actually
> > > has the right to override this setting for the next OS. Is it a right
> > > design? Even the firmware itself is unable to identify the original
> > > setting enforced by the hypervisor after reboot. I feel the hypervisor
> > > setting should be recovered after reset since it reflects the
> > > security measure enforced by the virtual platform?
> >
> > So I think clarifying system reset should address your questions.
> > I believe we should leave bypass sticky across device reset, so a FW->OS
> > transition, where the OS resets the device, does not open a vulnerability
> > (if bypass was enabled at boot and then left disabled by FW.)
> >
> > It's still a good idea for the OS to restore on shutdown the bypass value
> > it was booted with. So it can kexec into an OS that doesn't support
> > virtio-iommu, for example.
> >
> > Thanks,
> > Jean
>
> Is this stickiness really important? Can't this be addressed just by
> hypervisor disabling bypass at boot?

And I'm not sure if sticky can survive transport reset.

Thanks

>
> --
> MST
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/5] iova: Move fast alloc size roundup into alloc_iova_fast()

2021-10-10 Thread Jason Wang
On Fri, Sep 24, 2021 at 6:07 PM John Garry  wrote:
>
> It really is a property of the IOVA rcache code that we need to alloc a
> power-of-2 size, so relocate the functionality to resize into
> alloc_iova_fast(), rather than the callsites.
>
> Signed-off-by: John Garry 

Acked-by: Jason Wang 

> ---
>  drivers/iommu/dma-iommu.c| 8 
>  drivers/iommu/iova.c | 9 +
>  drivers/vdpa/vdpa_user/iova_domain.c | 8 
>  3 files changed, 9 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 896bea04c347..a99b3445fef8 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -444,14 +444,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
> iommu_domain *domain,
>
> shift = iova_shift(iovad);
> iova_len = size >> shift;
> -   /*
> -* Freeing non-power-of-two-sized allocations back into the IOVA 
> caches
> -* will come back to bite us badly, so we have to waste a bit of space
> -* rounding up anything cacheable to make sure that can't happen. The
> -* order of the unadjusted size will still match upon freeing.
> -*/
> -   if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> -   iova_len = roundup_pow_of_two(iova_len);
>
> dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
>
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 9e8bc802ac05..ff567cbc42f7 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -497,6 +497,15 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
> size,
> unsigned long iova_pfn;
> struct iova *new_iova;
>
> +   /*
> +* Freeing non-power-of-two-sized allocations back into the IOVA 
> caches
> +* will come back to bite us badly, so we have to waste a bit of space
> +* rounding up anything cacheable to make sure that can't happen. The
> +* order of the unadjusted size will still match upon freeing.
> +*/
> +   if (size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> +   size = roundup_pow_of_two(size);
> +
> iova_pfn = iova_rcache_get(iovad, size, limit_pfn + 1);
> if (iova_pfn)
> return iova_pfn;
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c 
> b/drivers/vdpa/vdpa_user/iova_domain.c
> index 1daae2608860..2b1143f11d8f 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.c
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -292,14 +292,6 @@ vduse_domain_alloc_iova(struct iova_domain *iovad,
> unsigned long iova_len = iova_align(iovad, size) >> shift;
> unsigned long iova_pfn;
>
> -   /*
> -* Freeing non-power-of-two-sized allocations back into the IOVA 
> caches
> -* will come back to bite us badly, so we have to waste a bit of space
> -* rounding up anything cacheable to make sure that can't happen. The
> -* order of the unadjusted size will still match upon freeing.
> -*/
> -   if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> -   iova_len = roundup_pow_of_two(iova_len);
> iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true);
>
> return iova_pfn << shift;
> --
> 2.26.2
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v13 02/13] eventfd: Export eventfd_wake_count to modules

2021-08-31 Thread Jason Wang


在 2021/8/31 下午6:36, Xie Yongji 写道:

Export eventfd_wake_count so that some modules can use
the eventfd_signal_count() to check whether the
eventfd_signal() call should be deferred to a safe context.

Signed-off-by: Xie Yongji 



And this matches the comment inside eventfd_signal():

    /*
 * Deadlock or stack overflow issues can happen if we recurse here
 * through waitqueue wakeup handlers. If the caller users 
potentially

 * nested waitqueues with custom wakeup handlers, then it should
 * check eventfd_signal_count() before calling this function. If
 * it returns true, the eventfd_signal() call should be 
deferred to a

 * safe context.
 */


So:

Acked-by: Jason Wang 



---
  fs/eventfd.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..1b3130b8d6c1 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -26,6 +26,7 @@
  #include 
  
  DEFINE_PER_CPU(int, eventfd_wake_count);

+EXPORT_PER_CPU_SYMBOL_GPL(eventfd_wake_count);
  
  static DEFINE_IDA(eventfd_ida);
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v11 12/12] Documentation: Add documentation for VDUSE

2021-08-23 Thread Jason Wang


在 2021/8/18 下午8:06, Xie Yongji 写道:

VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji 



Acked-by: Jason Wang 



---
  Documentation/userspace-api/index.rst |   1 +
  Documentation/userspace-api/vduse.rst | 233 ++
  2 files changed, 234 insertions(+)
  create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 0b5eefed027e..c432be070f67 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@ place where this information is gathered.
 iommu
 media/index
 sysfs-platform_profile
+   vduse
  
  .. only::  subproject and html
  
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst

new file mode 100644
index ..42ef59ea5314
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,233 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace. And
+to make the device emulation more secure, the emulated vDPA device's
+control path is handled in the kernel and only the data path is
+implemented in the userspace.
+
+Note that only virtio block device is supported by VDUSE framework now,
+which can reduce security risks when the userspace process that implements
+the data path is run by an unprivileged user. The support for other device
+types can be added after the security issue of corresponding device driver
+is clarified or fixed in the future.
+
+Create/Destroy VDUSE devices
+
+
+VDUSE devices are created as follows:
+
+1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+   /dev/vduse/control.
+
+2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME.
+
+3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
+   messages will arrive while attaching the VDUSE instance to vDPA bus.
+
+4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
+   instance to vDPA bus.
+
+VDUSE devices are destroyed as follows:
+
+1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
+   instance from vDPA bus.
+
+2. Close the file descriptor referring to /dev/vduse/$NAME.
+
+3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
+   /dev/vduse/control.
+
+The netlink messages can be sent via vdpa tool in iproute2 or use the
+below sample codes:
+
+.. code-block:: c
+
+   static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
+   {
+   struct nl_sock *nlsock;
+   struct nl_msg *msg;
+   int famid;
+
+   nlsock = nl_socket_alloc();
+   if (!nlsock)
+   return -ENOMEM;
+
+   if (genl_connect(nlsock))
+   goto free_sock;
+
+   famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+   if (famid < 0)
+   goto close_sock;
+
+   msg = nlmsg_alloc();
+   if (!msg)
+   goto close_sock;
+
+   if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
cmd, 0))
+   goto nla_put_failure;
+
+   NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+   if (cmd == VDPA_CMD_DEV_NEW)
+   NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
"vduse");
+
+   if (nl_send_sync(nlsock, msg))
+   goto close_sock;
+
+   nl_close(nlsock);
+   nl_socket_free(nlsock);
+
+   return 0;
+   nla_put_failure:
+   nlmsg_free(msg);
+   close_sock:
+   nl_close(nlsock);
+   free_sock:
+   nl_socket_free(nlsock);
+   return -1;
+   }
+
+How VDUSE works
+---
+
+As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on
+/dev/vduse/control. With this ioctl, userspace can specify some basic 
configuration
+such as device name (uniquely identify a VDUSE device), virtio features, virtio
+configuration space, the number of virtqueues and so on for this emulated 
device.
+Then a char device interface (/dev/vduse/$NAME) is exported to userspace for 
device
+emulation. Userspace can use the VDUSE_VQ_SETUP ioctl on /dev/vduse/$NAME to
+add per-virtqueue configuration such as the max size of virtqueue t

Re: [PATCH v11 12/12] Documentation: Add documentation for VDUSE

2021-08-23 Thread Jason Wang


在 2021/8/18 下午8:06, Xie Yongji 写道:

VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji 



Acked-by: Jason Wang 



---
  Documentation/userspace-api/index.rst |   1 +
  Documentation/userspace-api/vduse.rst | 233 ++
  2 files changed, 234 insertions(+)
  create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 0b5eefed027e..c432be070f67 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@ place where this information is gathered.
 iommu
 media/index
 sysfs-platform_profile
+   vduse
  
  .. only::  subproject and html
  
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst

new file mode 100644
index ..42ef59ea5314
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,233 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace. And
+to make the device emulation more secure, the emulated vDPA device's
+control path is handled in the kernel and only the data path is
+implemented in the userspace.
+
+Note that only virtio block device is supported by VDUSE framework now,
+which can reduce security risks when the userspace process that implements
+the data path is run by an unprivileged user. The support for other device
+types can be added after the security issue of corresponding device driver
+is clarified or fixed in the future.
+
+Create/Destroy VDUSE devices
+
+
+VDUSE devices are created as follows:
+
+1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+   /dev/vduse/control.
+
+2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME.
+
+3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
+   messages will arrive while attaching the VDUSE instance to vDPA bus.
+
+4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
+   instance to vDPA bus.
+
+VDUSE devices are destroyed as follows:
+
+1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
+   instance from vDPA bus.
+
+2. Close the file descriptor referring to /dev/vduse/$NAME.
+
+3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
+   /dev/vduse/control.
+
+The netlink messages can be sent via vdpa tool in iproute2 or use the
+below sample codes:
+
+.. code-block:: c
+
+   static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
+   {
+   struct nl_sock *nlsock;
+   struct nl_msg *msg;
+   int famid;
+
+   nlsock = nl_socket_alloc();
+   if (!nlsock)
+   return -ENOMEM;
+
+   if (genl_connect(nlsock))
+   goto free_sock;
+
+   famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+   if (famid < 0)
+   goto close_sock;
+
+   msg = nlmsg_alloc();
+   if (!msg)
+   goto close_sock;
+
+   if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
cmd, 0))
+   goto nla_put_failure;
+
+   NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+   if (cmd == VDPA_CMD_DEV_NEW)
+   NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
"vduse");
+
+   if (nl_send_sync(nlsock, msg))
+   goto close_sock;
+
+   nl_close(nlsock);
+   nl_socket_free(nlsock);
+
+   return 0;
+   nla_put_failure:
+   nlmsg_free(msg);
+   close_sock:
+   nl_close(nlsock);
+   free_sock:
+   nl_socket_free(nlsock);
+   return -1;
+   }
+
+How VDUSE works
+---
+
+As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on
+/dev/vduse/control. With this ioctl, userspace can specify some basic 
configuration
+such as device name (uniquely identify a VDUSE device), virtio features, virtio
+configuration space, the number of virtqueues and so on for this emulated 
device.
+Then a char device interface (/dev/vduse/$NAME) is exported to userspace for 
device
+emulation. Userspace can use the VDUSE_VQ_SETUP ioctl on /dev/vduse/$NAME to
+add per-virtqueue configuration such as the max size of virtqueue t

Re: [PATCH v11 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-08-23 Thread Jason Wang


在 2021/8/18 下午8:06, Xie Yongji 写道:

This VDUSE driver enables implementing software-emulated vDPA
devices in userspace. The vDPA device is created by
ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device
interface (/dev/vduse/$NAME) is exported to userspace for device
emulation.

In order to make the device emulation more secure, the device's
control path is handled in kernel. A message mechnism is introduced
to forward some dataplane related control messages to userspace.

And in the data path, the DMA buffer will be mapped into userspace
address space through different ways depending on the vDPA bus to
which the vDPA device is attached. In virtio-vdpa case, the MMU-based
software IOTLB is used to achieve that. And in vhost-vdpa case, the
DMA buffer is reside in a userspace memory region which can be shared
to the VDUSE userspace processs via transferring the shmfd.

For more details on VDUSE design and usage, please see the follow-on
Documentation commit.

Signed-off-by: Xie Yongji 



Acked-by: Jason Wang 


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v11 05/12] vhost-vdpa: Handle the failure of vdpa_reset()

2021-08-23 Thread Jason Wang


在 2021/8/18 下午8:06, Xie Yongji 写道:

The vdpa_reset() may fail now. This adds check to its return
value and fail the vhost_vdpa_open().

Signed-off-by: Xie Yongji 



Acked-by: Jason Wang 



---
  drivers/vhost/vdpa.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b1c91b4db0ba..d99d75ad30cc 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -116,12 +116,13 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa 
*v, u16 qid)
irq_bypass_unregister_producer(>call_ctx.producer);
  }
  
-static void vhost_vdpa_reset(struct vhost_vdpa *v)

+static int vhost_vdpa_reset(struct vhost_vdpa *v)
  {
struct vdpa_device *vdpa = v->vdpa;
  
-	vdpa_reset(vdpa);

v->in_batch = 0;
+
+   return vdpa_reset(vdpa);
  }
  
  static long vhost_vdpa_get_device_id(struct vhost_vdpa *v, u8 __user *argp)

@@ -868,7 +869,9 @@ static int vhost_vdpa_open(struct inode *inode, struct file 
*filep)
return -EBUSY;
  
  	nvqs = v->nvqs;

-   vhost_vdpa_reset(v);
+   r = vhost_vdpa_reset(v);
+   if (r)
+   goto err;
  
  	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);

if (!vqs) {


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v11 04/12] vdpa: Add reset callback in vdpa_config_ops

2021-08-23 Thread Jason Wang


在 2021/8/18 下午8:06, Xie Yongji 写道:

This adds a new callback to support device specific reset
behavior. The vdpa bus driver will call the reset function
instead of setting status to zero during resetting if device
driver supports the new callback.

Signed-off-by: Xie Yongji 
---
  drivers/vhost/vdpa.c |  9 +++--
  include/linux/vdpa.h | 11 ++-
  2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b07aa161f7ad..b1c91b4db0ba 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -157,7 +157,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
u8 status, status_old;
-   int nvqs = v->nvqs;
+   int ret, nvqs = v->nvqs;
u16 i;
  
  	if (copy_from_user(, statusp, sizeof(status)))

@@ -172,7 +172,12 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
if (status != 0 && (ops->get_status(vdpa) & ~status) != 0)
return -EINVAL;
  
-	ops->set_status(vdpa, status);

+   if (status == 0 && ops->reset) {
+   ret = ops->reset(vdpa);
+   if (ret)
+   return ret;
+   } else
+   ops->set_status(vdpa, status);
  
  	if ((status & VIRTIO_CONFIG_S_DRIVER_OK) && !(status_old & VIRTIO_CONFIG_S_DRIVER_OK))

for (i = 0; i < nvqs; i++)
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 8a645f8f4476..af7ea5ad795f 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -196,6 +196,9 @@ struct vdpa_iova_range {
   *@vdev: vdpa device
   *Returns the iova range supported by
   *the device.
+ * @reset: Reset device (optional)
+ * @vdev: vdpa device
+ * Returns integer: success (0) or error (< 0)



It looks to me we'd better make this mandatory. This help to reduce the 
confusion for the parent driver.


Thanks



   * @set_map:  Set device memory mapping (optional)
   *Needed for device that using device
   *specific DMA translation (on-chip IOMMU)
@@ -263,6 +266,7 @@ struct vdpa_config_ops {
   const void *buf, unsigned int len);
u32 (*get_generation)(struct vdpa_device *vdev);
struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
+   int (*reset)(struct vdpa_device *vdev);
  
  	/* DMA ops */

int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
@@ -351,12 +355,17 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
return vdev->dma_dev;
  }
  
-static inline void vdpa_reset(struct vdpa_device *vdev)

+static inline int vdpa_reset(struct vdpa_device *vdev)
  {
const struct vdpa_config_ops *ops = vdev->config;
  
  	vdev->features_valid = false;

+   if (ops->reset)
+   return ops->reset(vdev);
+
ops->set_status(vdev, 0);
+
+   return 0;
  }
  
  static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v11 03/12] vdpa: Fix some coding style issues

2021-08-23 Thread Jason Wang


在 2021/8/18 下午8:06, Xie Yongji 写道:

Fix some code indent issues and following checkpatch warning:

WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
371: FILE: include/linux/vdpa.h:371:
+static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset,

Signed-off-by: Xie Yongji 



Acked-by: Jason Wang 



---
  include/linux/vdpa.h | 34 +-
  1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 954b340f6c2f..8a645f8f4476 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -43,17 +43,17 @@ struct vdpa_vq_state_split {
   * @last_used_idx: used index
   */
  struct vdpa_vq_state_packed {
-u16last_avail_counter:1;
-u16last_avail_idx:15;
-u16last_used_counter:1;
-u16last_used_idx:15;
+   u16 last_avail_counter:1;
+   u16 last_avail_idx:15;
+   u16 last_used_counter:1;
+   u16 last_used_idx:15;
  };
  
  struct vdpa_vq_state {

- union {
-  struct vdpa_vq_state_split split;
-  struct vdpa_vq_state_packed packed;
- };
+   union {
+   struct vdpa_vq_state_split split;
+   struct vdpa_vq_state_packed packed;
+   };
  };
  
  struct vdpa_mgmt_dev;

@@ -131,7 +131,7 @@ struct vdpa_iova_range {
   *@vdev: vdpa device
   *@idx: virtqueue index
   *@state: pointer to returned state 
(last_avail_idx)
- * @get_vq_notification:   Get the notification area for a virtqueue
+ * @get_vq_notification:   Get the notification area for a virtqueue
   *@vdev: vdpa device
   *@idx: virtqueue index
   *Returns the notifcation area
@@ -353,25 +353,25 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
  
  static inline void vdpa_reset(struct vdpa_device *vdev)

  {
-const struct vdpa_config_ops *ops = vdev->config;
+   const struct vdpa_config_ops *ops = vdev->config;
  
  	vdev->features_valid = false;

-ops->set_status(vdev, 0);
+   ops->set_status(vdev, 0);
  }
  
  static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features)

  {
-const struct vdpa_config_ops *ops = vdev->config;
+   const struct vdpa_config_ops *ops = vdev->config;
  
  	vdev->features_valid = true;

-return ops->set_features(vdev, features);
+   return ops->set_features(vdev, features);
  }
  
-

-static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset,
-  void *buf, unsigned int len)
+static inline void vdpa_get_config(struct vdpa_device *vdev,
+  unsigned int offset, void *buf,
+  unsigned int len)
  {
-const struct vdpa_config_ops *ops = vdev->config;
+   const struct vdpa_config_ops *ops = vdev->config;
  
  	/*

 * Config accesses aren't supposed to trigger before features are set.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v11 01/12] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-23 Thread Jason Wang


在 2021/8/18 下午8:06, Xie Yongji 写道:

Export alloc_iova_fast() and free_iova_fast() so that
some modules can make use of the per-CPU cache to get
rid of rbtree spinlock in alloc_iova() and free_iova()
during IOVA allocation.

Signed-off-by: Xie Yongji 



Acked-by: Jason Wang 

(If we need respin, I'd suggest to put the numbers you measured here)

Thanks



---
  drivers/iommu/iova.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b6cf5f16123b..3941ed6bb99b 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
size,
  
  	return new_iova->pfn_lo;

  }
+EXPORT_SYMBOL_GPL(alloc_iova_fast);
  
  /**

   * free_iova_fast - free iova pfn range into rcache
@@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long 
pfn, unsigned long size)
  
  	free_iova(iovad, pfn);

  }
+EXPORT_SYMBOL_GPL(free_iova_fast);
  
  #define fq_ring_for_each(i, fq) \

for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % 
IOVA_FQ_SIZE)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-09 Thread Jason Wang


在 2021/8/9 下午1:56, Yongji Xie 写道:

On Thu, Aug 5, 2021 at 9:31 PM Jason Wang  wrote:


在 2021/8/5 下午8:34, Yongji Xie 写道:

My main point, though, is that if you've already got something else
keeping track of the actual addresses, then the way you're using an
iova_domain appears to be something you could do with a trivial bitmap
allocator. That's why I don't buy the efficiency argument. The main
design points of the IOVA allocator are to manage large address spaces
while trying to maximise spatial locality to minimise the underlying
pagetable usage, and allocating with a flexible limit to support
multiple devices with different addressing capabilities in the same
address space. If none of those aspects are relevant to the use-case -
which AFAICS appears to be true here - then as a general-purpose
resource allocator it's rubbish and has an unreasonably massive memory
overhead and there are many, many better choices.


OK, I get your point. Actually we used the genpool allocator in the
early version. Maybe we can fall back to using it.


I think maybe you can share some perf numbers to see how much
alloc_iova_fast() can help.


I did some fio tests[1] with a ram-backend vduse block device[2].

Following are some performance data:

 numjobs=1   numjobs=2numjobs=4   numjobs=8
iova_alloc_fast145k iops  265k iops  514k iops  758k iops

iova_alloc137k iops 170k iops  128k iops  113k iops

gen_pool_alloc   143k iops  270k iops  458k iops  521k iops

The iova_alloc_fast() has the best performance since we always hit the
per-cpu cache. Regardless of the per-cpu cache, the genpool allocator
should be better than the iova allocator.



I think we see convincing numbers for using iova_alloc_fast() than the 
gen_poll_alloc() (45% improvement on job=8).


Thanks




[1] fio jobfile:

[global]
rw=randread
direct=1
ioengine=libaio
iodepth=16
time_based=1
runtime=60s
group_reporting
bs=4k
filename=/dev/vda
[job]
numjobs=..

[2]  $ qemu-storage-daemon \
   --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
   --monitor chardev=charmonitor \
   --blockdev
driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
\
   --export 
type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128

The qemu-storage-daemon can be builded based on the repo:
https://github.com/bytedance/qemu/tree/vduse-test.

Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-05 Thread Jason Wang


在 2021/8/5 下午8:34, Yongji Xie 写道:

My main point, though, is that if you've already got something else
keeping track of the actual addresses, then the way you're using an
iova_domain appears to be something you could do with a trivial bitmap
allocator. That's why I don't buy the efficiency argument. The main
design points of the IOVA allocator are to manage large address spaces
while trying to maximise spatial locality to minimise the underlying
pagetable usage, and allocating with a flexible limit to support
multiple devices with different addressing capabilities in the same
address space. If none of those aspects are relevant to the use-case -
which AFAICS appears to be true here - then as a general-purpose
resource allocator it's rubbish and has an unreasonably massive memory
overhead and there are many, many better choices.


OK, I get your point. Actually we used the genpool allocator in the
early version. Maybe we can fall back to using it.



I think maybe you can share some perf numbers to see how much 
alloc_iova_fast() can help.


Thanks






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()

2021-08-05 Thread Jason Wang


在 2021/8/4 下午5:07, Yongji Xie 写道:

On Wed, Aug 4, 2021 at 4:54 PM Jason Wang  wrote:


在 2021/8/4 下午4:50, Yongji Xie 写道:

On Wed, Aug 4, 2021 at 4:32 PM Jason Wang  wrote:

在 2021/8/3 下午5:38, Yongji Xie 写道:

On Tue, Aug 3, 2021 at 4:09 PM Jason Wang  wrote:

在 2021/7/29 下午3:34, Xie Yongji 写道:

The device reset may fail in virtio-vdpa case now, so add checks to
its return value and fail the register_virtio_device().

So the reset() would be called by the driver during remove as well, or
is it sufficient to deal only with the reset during probe?


Actually there is no way to handle failure during removal. And it
should be safe with the protection of software IOTLB even if the
reset() fails.

Thanks,
Yongji

If this is true, does it mean we don't even need to care about reset
failure?


But we need to handle the failure in the vhost-vdpa case, isn't it?


Yes, but:

- This patch is for virtio not for vhost, if we don't care virtio, we
can avoid the changes
- For vhost, there could be two ways probably:

1) let the set_status to report error
2) require userspace to re-read for status

It looks to me you want to go with 1) and I'm not sure whether or not
it's too late to go with 2).


Looks like 2) can't work if reset failure happens in
vhost_vdpa_release() and vhost_vdpa_open().



Yes, you're right.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()

2021-08-04 Thread Jason Wang


在 2021/8/4 下午4:50, Yongji Xie 写道:

On Wed, Aug 4, 2021 at 4:32 PM Jason Wang  wrote:


在 2021/8/3 下午5:38, Yongji Xie 写道:

On Tue, Aug 3, 2021 at 4:09 PM Jason Wang  wrote:

在 2021/7/29 下午3:34, Xie Yongji 写道:

The device reset may fail in virtio-vdpa case now, so add checks to
its return value and fail the register_virtio_device().

So the reset() would be called by the driver during remove as well, or
is it sufficient to deal only with the reset during probe?


Actually there is no way to handle failure during removal. And it
should be safe with the protection of software IOTLB even if the
reset() fails.

Thanks,
Yongji


If this is true, does it mean we don't even need to care about reset
failure?


But we need to handle the failure in the vhost-vdpa case, isn't it?



Yes, but:

- This patch is for virtio not for vhost, if we don't care virtio, we 
can avoid the changes

- For vhost, there could be two ways probably:

1) let the set_status to report error
2) require userspace to re-read for status

It looks to me you want to go with 1) and I'm not sure whether or not 
it's too late to go with 2).


Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 05/17] vhost-vdpa: Fail the vhost_vdpa_set_status() on reset failure

2021-08-04 Thread Jason Wang


在 2021/8/3 下午5:50, Yongji Xie 写道:

On Tue, Aug 3, 2021 at 4:10 PM Jason Wang  wrote:


在 2021/7/29 下午3:34, Xie Yongji 写道:

Re-read the device status to ensure it's set to zero during
resetting. Otherwise, fail the vhost_vdpa_set_status() after timeout.

Signed-off-by: Xie Yongji 
---
   drivers/vhost/vdpa.c | 11 ++-
   1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b07aa161f7ad..dd05c1e1133c 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -157,7 +157,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
   struct vdpa_device *vdpa = v->vdpa;
   const struct vdpa_config_ops *ops = vdpa->config;
   u8 status, status_old;
- int nvqs = v->nvqs;
+ int timeout = 0, nvqs = v->nvqs;
   u16 i;

   if (copy_from_user(, statusp, sizeof(status)))
@@ -173,6 +173,15 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
   return -EINVAL;

   ops->set_status(vdpa, status);
+ if (status == 0) {
+ while (ops->get_status(vdpa)) {
+ timeout += 20;
+ if (timeout > VDPA_RESET_TIMEOUT_MS)
+ return -EIO;
+
+ msleep(20);
+ }


Spec has introduced the reset a one of the basic facility. And consider
we differ reset here.

This makes me think if it's better to introduce a dedicated vdpa ops for
reset?


Do you mean replace the ops.set_status(vdev, 0) with the ops.reset()?
Then we can remove the timeout processing which is device specific
stuff.



Exactly.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()

2021-08-04 Thread Jason Wang


在 2021/8/3 下午5:38, Yongji Xie 写道:

On Tue, Aug 3, 2021 at 4:09 PM Jason Wang  wrote:


在 2021/7/29 下午3:34, Xie Yongji 写道:

The device reset may fail in virtio-vdpa case now, so add checks to
its return value and fail the register_virtio_device().


So the reset() would be called by the driver during remove as well, or
is it sufficient to deal only with the reset during probe?


Actually there is no way to handle failure during removal. And it
should be safe with the protection of software IOTLB even if the
reset() fails.

Thanks,
Yongji



If this is true, does it mean we don't even need to care about reset 
failure?


Thanks


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 04/17] vdpa: Fail the vdpa_reset() if fail to set device status to zero

2021-08-04 Thread Jason Wang


在 2021/8/3 下午5:31, Yongji Xie 写道:

On Tue, Aug 3, 2021 at 3:58 PM Jason Wang  wrote:


在 2021/7/29 下午3:34, Xie Yongji 写道:

Re-read the device status to ensure it's set to zero during
resetting. Otherwise, fail the vdpa_reset() after timeout.

Signed-off-by: Xie Yongji 
---
   include/linux/vdpa.h | 15 ++-
   1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 406d53a606ac..d1a80ef05089 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -6,6 +6,7 @@
   #include 
   #include 
   #include 
+#include 

   /**
* struct vdpa_calllback - vDPA callback definition.
@@ -340,12 +341,24 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
   return vdev->dma_dev;
   }

-static inline void vdpa_reset(struct vdpa_device *vdev)
+#define VDPA_RESET_TIMEOUT_MS 1000
+
+static inline int vdpa_reset(struct vdpa_device *vdev)
   {
   const struct vdpa_config_ops *ops = vdev->config;
+ int timeout = 0;

   vdev->features_valid = false;
   ops->set_status(vdev, 0);
+ while (ops->get_status(vdev)) {
+ timeout += 20;
+ if (timeout > VDPA_RESET_TIMEOUT_MS)
+ return -EIO;
+
+ msleep(20);
+ }


I wonder if it's better to do this in the vDPA parent?

Thanks


Sorry, I didn't get you here. Do you mean vDPA parent driver (e.g.
VDUSE)?



Yes, since the how it's expected to behave depends on the specific hardware.

Even for the spec, the behavior is transport specific:

PCI: requires reread until 0
MMIO: doesn't require but it might not work for the hardware so we 
decide to change

CCW: the succeed of the ccw command means the success of the reset

Thanks



Actually I didn't find any other place where I can do
set_status() and get_status().

Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 02/17] file: Export receive_fd() to modules

2021-08-04 Thread Jason Wang


在 2021/8/3 下午5:01, Yongji Xie 写道:

On Tue, Aug 3, 2021 at 3:46 PM Jason Wang  wrote:


在 2021/7/29 下午3:34, Xie Yongji 写道:

Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.

Signed-off-by: Xie Yongji 
---
   fs/file.c| 6 ++
   include/linux/file.h | 7 +++
   2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 86dc9956af32..210e540672aa 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1134,6 +1134,12 @@ int receive_fd_replace(int new_fd, struct file *file, 
unsigned int o_flags)
   return new_fd;
   }

+int receive_fd(struct file *file, unsigned int o_flags)
+{
+ return __receive_fd(file, NULL, o_flags);


Any reason that receive_fd_user() can live in the file.h?


Since no modules use it.

Thanks,
Yongji



Ok.


Acked-by: Jason Wang 






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 05/17] vhost-vdpa: Fail the vhost_vdpa_set_status() on reset failure

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Re-read the device status to ensure it's set to zero during
resetting. Otherwise, fail the vhost_vdpa_set_status() after timeout.

Signed-off-by: Xie Yongji 
---
  drivers/vhost/vdpa.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b07aa161f7ad..dd05c1e1133c 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -157,7 +157,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
u8 status, status_old;
-   int nvqs = v->nvqs;
+   int timeout = 0, nvqs = v->nvqs;
u16 i;
  
  	if (copy_from_user(, statusp, sizeof(status)))

@@ -173,6 +173,15 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
return -EINVAL;
  
  	ops->set_status(vdpa, status);

+   if (status == 0) {
+   while (ops->get_status(vdpa)) {
+   timeout += 20;
+   if (timeout > VDPA_RESET_TIMEOUT_MS)
+   return -EIO;
+
+   msleep(20);
+   }



Spec has introduced the reset a one of the basic facility. And consider 
we differ reset here.


This makes me think if it's better to introduce a dedicated vdpa ops for 
reset?


Thanks



+   }
  
  	if ((status & VIRTIO_CONFIG_S_DRIVER_OK) && !(status_old & VIRTIO_CONFIG_S_DRIVER_OK))

for (i = 0; i < nvqs; i++)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

The device reset may fail in virtio-vdpa case now, so add checks to
its return value and fail the register_virtio_device().



So the reset() would be called by the driver during remove as well, or 
is it sufficient to deal only with the reset during probe?


Thanks




Signed-off-by: Xie Yongji 
---
  drivers/virtio/virtio.c | 15 ++-
  1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index a15beb6b593b..8df75425fb43 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -349,7 +349,9 @@ int register_virtio_device(struct virtio_device *dev)
  
  	/* We always start by resetting the device, in case a previous

 * driver messed it up.  This also tests that code path a little. */
-   dev->config->reset(dev);
+   err = dev->config->reset(dev);
+   if (err)
+   goto err_reset;
  
  	/* Acknowledge that we've seen the device. */

virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
@@ -362,10 +364,13 @@ int register_virtio_device(struct virtio_device *dev)
 */
err = device_add(>dev);
if (err)
-   ida_simple_remove(_index_ida, dev->index);
-out:
-   if (err)
-   virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
+   goto err_add;
+
+   return 0;
+err_add:
+   virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
+err_reset:
+   ida_simple_remove(_index_ida, dev->index);
return err;
  }
  EXPORT_SYMBOL_GPL(register_virtio_device);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 07/17] virtio: Don't set FAILED status bit on device index allocation failure

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

We don't need to set FAILED status bit on device index allocation
failure since the device initialization hasn't been started yet.
This doesn't affect runtime, found in code review.

Signed-off-by: Xie Yongji 



Does it really harm?

Thanks



---
  drivers/virtio/virtio.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 4b15c00c0a0a..a15beb6b593b 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -338,7 +338,7 @@ int register_virtio_device(struct virtio_device *dev)
/* Assign a unique device index and hence name. */
err = ida_simple_get(_index_ida, 0, 0, GFP_KERNEL);
if (err < 0)
-   goto out;
+   return err;
  
  	dev->index = err;

dev_set_name(>dev, "virtio%u", dev->index);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 04/17] vdpa: Fail the vdpa_reset() if fail to set device status to zero

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Re-read the device status to ensure it's set to zero during
resetting. Otherwise, fail the vdpa_reset() after timeout.

Signed-off-by: Xie Yongji 
---
  include/linux/vdpa.h | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 406d53a606ac..d1a80ef05089 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -6,6 +6,7 @@
  #include 
  #include 
  #include 
+#include 
  
  /**

   * struct vdpa_calllback - vDPA callback definition.
@@ -340,12 +341,24 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
return vdev->dma_dev;
  }
  
-static inline void vdpa_reset(struct vdpa_device *vdev)

+#define VDPA_RESET_TIMEOUT_MS 1000
+
+static inline int vdpa_reset(struct vdpa_device *vdev)
  {
const struct vdpa_config_ops *ops = vdev->config;
+   int timeout = 0;
  
  	vdev->features_valid = false;

ops->set_status(vdev, 0);
+   while (ops->get_status(vdev)) {
+   timeout += 20;
+   if (timeout > VDPA_RESET_TIMEOUT_MS)
+   return -EIO;
+
+   msleep(20);
+   }



I wonder if it's better to do this in the vDPA parent?

Thanks



+
+   return 0;
  }
  
  static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 03/17] vdpa: Fix code indentation

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Use tabs to indent the code instead of spaces.

Signed-off-by: Xie Yongji 
---
  include/linux/vdpa.h | 29 ++---
  1 file changed, 14 insertions(+), 15 deletions(-)



It looks to me not all the warnings are addressed.

Or did you silent checkpatch.pl -f?

Thanks




diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 7c49bc5a2b71..406d53a606ac 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -43,17 +43,17 @@ struct vdpa_vq_state_split {
   * @last_used_idx: used index
   */
  struct vdpa_vq_state_packed {
-u16last_avail_counter:1;
-u16last_avail_idx:15;
-u16last_used_counter:1;
-u16last_used_idx:15;
+   u16 last_avail_counter:1;
+   u16 last_avail_idx:15;
+   u16 last_used_counter:1;
+   u16 last_used_idx:15;
  };
  
  struct vdpa_vq_state {

- union {
-  struct vdpa_vq_state_split split;
-  struct vdpa_vq_state_packed packed;
- };
+   union {
+   struct vdpa_vq_state_split split;
+   struct vdpa_vq_state_packed packed;
+   };
  };
  
  struct vdpa_mgmt_dev;

@@ -131,7 +131,7 @@ struct vdpa_iova_range {
   *@vdev: vdpa device
   *@idx: virtqueue index
   *@state: pointer to returned state 
(last_avail_idx)
- * @get_vq_notification:   Get the notification area for a virtqueue
+ * @get_vq_notification:   Get the notification area for a virtqueue
   *@vdev: vdpa device
   *@idx: virtqueue index
   *Returns the notifcation area
@@ -342,25 +342,24 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
  
  static inline void vdpa_reset(struct vdpa_device *vdev)

  {
-const struct vdpa_config_ops *ops = vdev->config;
+   const struct vdpa_config_ops *ops = vdev->config;
  
  	vdev->features_valid = false;

-ops->set_status(vdev, 0);
+   ops->set_status(vdev, 0);
  }
  
  static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features)

  {
-const struct vdpa_config_ops *ops = vdev->config;
+   const struct vdpa_config_ops *ops = vdev->config;
  
  	vdev->features_valid = true;

-return ops->set_features(vdev, features);
+   return ops->set_features(vdev, features);
  }
  
-

  static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset,
   void *buf, unsigned int len)
  {
-const struct vdpa_config_ops *ops = vdev->config;
+   const struct vdpa_config_ops *ops = vdev->config;
  
  	/*

 * Config accesses aren't supposed to trigger before features are set.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 02/17] file: Export receive_fd() to modules

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.

Signed-off-by: Xie Yongji 
---
  fs/file.c| 6 ++
  include/linux/file.h | 7 +++
  2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 86dc9956af32..210e540672aa 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1134,6 +1134,12 @@ int receive_fd_replace(int new_fd, struct file *file, 
unsigned int o_flags)
return new_fd;
  }
  
+int receive_fd(struct file *file, unsigned int o_flags)

+{
+   return __receive_fd(file, NULL, o_flags);



Any reason that receive_fd_user() can live in the file.h?

Thanks



+}
+EXPORT_SYMBOL_GPL(receive_fd);
+
  static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
  {
int err = -EBADF;
diff --git a/include/linux/file.h b/include/linux/file.h
index 2de2e4613d7b..51e830b4fe3a 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -94,6 +94,9 @@ extern void fd_install(unsigned int fd, struct file *file);
  
  extern int __receive_fd(struct file *file, int __user *ufd,

unsigned int o_flags);
+
+extern int receive_fd(struct file *file, unsigned int o_flags);
+
  static inline int receive_fd_user(struct file *file, int __user *ufd,
  unsigned int o_flags)
  {
@@ -101,10 +104,6 @@ static inline int receive_fd_user(struct file *file, int 
__user *ufd,
return -EFAULT;
return __receive_fd(file, ufd, o_flags);
  }
-static inline int receive_fd(struct file *file, unsigned int o_flags)
-{
-   return __receive_fd(file, NULL, o_flags);
-}
  int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags);
  
  extern void flush_delayed_fput(void);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Export alloc_iova_fast() and free_iova_fast() so that
some modules can use it to improve iova allocation efficiency.



It's better to explain why alloc_iova() is not sufficient here.

Thanks




Signed-off-by: Xie Yongji 
---
  drivers/iommu/iova.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b6cf5f16123b..3941ed6bb99b 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
size,
  
  	return new_iova->pfn_lo;

  }
+EXPORT_SYMBOL_GPL(alloc_iova_fast);
  
  /**

   * free_iova_fast - free iova pfn range into rcache
@@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long 
pfn, unsigned long size)
  
  	free_iova(iovad, pfn);

  }
+EXPORT_SYMBOL_GPL(free_iova_fast);
  
  #define fq_ring_for_each(i, fq) \

for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % 
IOVA_FQ_SIZE)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Export alloc_iova_fast() and free_iova_fast() so that
some modules can use it to improve iova allocation efficiency.



It's better to explain which alloc_iova() is not sufficient here.

Thanks




Signed-off-by: Xie Yongji 
---
  drivers/iommu/iova.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b6cf5f16123b..3941ed6bb99b 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
size,
  
  	return new_iova->pfn_lo;

  }
+EXPORT_SYMBOL_GPL(alloc_iova_fast);
  
  /**

   * free_iova_fast - free iova pfn range into rcache
@@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long 
pfn, unsigned long size)
  
  	free_iova(iovad, pfn);

  }
+EXPORT_SYMBOL_GPL(free_iova_fast);
  
  #define fq_ring_for_each(i, fq) \

for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % 
IOVA_FQ_SIZE)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 17/17] Documentation: Add documentation for VDUSE

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:35, Xie Yongji 写道:

VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji 
---
  Documentation/userspace-api/index.rst |   1 +
  Documentation/userspace-api/vduse.rst | 232 ++
  2 files changed, 233 insertions(+)
  create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 0b5eefed027e..c432be070f67 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@ place where this information is gathered.
 iommu
 media/index
 sysfs-platform_profile
+   vduse
  
  .. only::  subproject and html
  
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst

new file mode 100644
index ..30c9d1482126
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,232 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace. And
+to make the device emulation more secure, the emulated vDPA device's
+control path is handled in the kernel and only the data path is
+implemented in the userspace.
+
+Note that only virtio block device is supported by VDUSE framework now,
+which can reduce security risks when the userspace process that implements
+the data path is run by an unprivileged user. The support for other device
+types can be added after the security issue of corresponding device driver
+is clarified or fixed in the future.
+
+Create/Destroy VDUSE devices
+
+
+VDUSE devices are created as follows:
+
+1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+   /dev/vduse/control.
+
+2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME.
+
+3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
+   messages will arrive while attaching the VDUSE instance to vDPA bus.
+
+4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
+   instance to vDPA bus.
+
+VDUSE devices are destroyed as follows:
+
+1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
+   instance from vDPA bus.
+
+2. Close the file descriptor referring to /dev/vduse/$NAME.
+
+3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
+   /dev/vduse/control.
+
+The netlink messages can be sent via vdpa tool in iproute2 or use the
+below sample codes:
+
+.. code-block:: c
+
+   static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
+   {
+   struct nl_sock *nlsock;
+   struct nl_msg *msg;
+   int famid;
+
+   nlsock = nl_socket_alloc();
+   if (!nlsock)
+   return -ENOMEM;
+
+   if (genl_connect(nlsock))
+   goto free_sock;
+
+   famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+   if (famid < 0)
+   goto close_sock;
+
+   msg = nlmsg_alloc();
+   if (!msg)
+   goto close_sock;
+
+   if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
cmd, 0))
+   goto nla_put_failure;
+
+   NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+   if (cmd == VDPA_CMD_DEV_NEW)
+   NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
"vduse");
+
+   if (nl_send_sync(nlsock, msg))
+   goto close_sock;
+
+   nl_close(nlsock);
+   nl_socket_free(nlsock);
+
+   return 0;
+   nla_put_failure:
+   nlmsg_free(msg);
+   close_sock:
+   nl_close(nlsock);
+   free_sock:
+   nl_socket_free(nlsock);
+   return -1;
+   }
+
+How VDUSE works
+---
+
+As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on
+/dev/vduse/control. With this ioctl, userspace can specify some basic 
configuration
+such as device name (uniquely identify a VDUSE device), virtio features, virtio
+configuration space, the number of virtqueues and so on for this emulated 
device.
+Then a char device interface (/dev/vduse/$NAME) is exported to userspace for 
device
+emulation. Userspace can use the VDUSE_VQ_SETUP ioctl on /dev/vduse/$NAME to
+add per-virtqueue configuration such as the max size of virtqueue to the 
device.
+
+After the initialization, the VDUSE device can 

Re: [PATCH v10 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:35, Xie Yongji 写道:

This VDUSE driver enables implementing software-emulated vDPA
devices in userspace. The vDPA device is created by
ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device
interface (/dev/vduse/$NAME) is exported to userspace for device
emulation.

In order to make the device emulation more secure, the device's
control path is handled in kernel. A message mechnism is introduced
to forward some dataplane related control messages to userspace.

And in the data path, the DMA buffer will be mapped into userspace
address space through different ways depending on the vDPA bus to
which the vDPA device is attached. In virtio-vdpa case, the MMU-based
software IOTLB is used to achieve that. And in vhost-vdpa case, the
DMA buffer is reside in a userspace memory region which can be shared
to the VDUSE userspace processs via transferring the shmfd.

For more details on VDUSE design and usage, please see the follow-on
Documentation commit.

Signed-off-by: Xie Yongji 
---
  Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
  drivers/vdpa/Kconfig   |   10 +
  drivers/vdpa/Makefile  |1 +
  drivers/vdpa/vdpa_user/Makefile|5 +
  drivers/vdpa/vdpa_user/vduse_dev.c | 1541 
  include/uapi/linux/vduse.h |  220 +++
  6 files changed, 1778 insertions(+)
  create mode 100644 drivers/vdpa/vdpa_user/Makefile
  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
  create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 1409e40e6345..293ca3aef358 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#Include File  
 Comments
  'z'   10-4F  drivers/s390/crypto/zcrypt_api.hconflict!
  '|'   00-7F  linux/media.h
  0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
  0x89  00-06  arch/x86/include/asm/sockios.h
  0x89  0B-DF  linux/sockios.h
  0x89  E0-EF  linux/sockios.h 
SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index a503c1b2bfd9..6e23bce6433a 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
  vDPA block device simulator which terminates IO request in a
  memory buffer.
  
+config VDPA_USER

+   tristate "VDUSE (vDPA Device in Userspace) support"
+   depends on EVENTFD && MMU && HAS_DMA
+   select DMA_OPS
+   select VHOST_IOTLB
+   select IOMMU_IOVA
+   help
+ With VDUSE it is possible to emulate a vDPA Device
+ in a userspace program.
+
  config IFCVF
tristate "Intel IFC VF vDPA driver"
depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index 67fe7f3d6943..f02ebed33f19 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,6 +1,7 @@
  # SPDX-License-Identifier: GPL-2.0
  obj-$(CONFIG_VDPA) += vdpa.o
  obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
  obj-$(CONFIG_IFCVF)+= ifcvf/
  obj-$(CONFIG_MLX5_VDPA) += mlx5/
  obj-$(CONFIG_VP_VDPA)+= virtio_pci/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index ..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index ..6addc62e7de6
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1541 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights 
reserved.
+ *
+ * Author: Xie Yongji 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "iova_domain.h"
+
+#define DRV_AUTHOR   "Yongji Xie "
+#define DRV_DESC "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+#define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
+#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
+#define VDUSE_REQUEST_TIMEOUT 30



I think we need make this as a module parameter. 0 probably means we 
need to wait for ever.


This can help in the case when the userspace is attached by GDB. If 
Michael is still not happy, we can find other solution (e.g only offload 
the datapath).


Other looks good.

Thanks



Re: [PATCH v9 17/17] Documentation: Add documentation for VDUSE

2021-07-14 Thread Jason Wang


在 2021/7/13 下午4:46, Xie Yongji 写道:

VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji 
---
  Documentation/userspace-api/index.rst |   1 +
  Documentation/userspace-api/vduse.rst | 248 ++
  2 files changed, 249 insertions(+)
  create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 0b5eefed027e..c432be070f67 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@ place where this information is gathered.
 iommu
 media/index
 sysfs-platform_profile
+   vduse
  
  .. only::  subproject and html
  
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst

new file mode 100644
index ..2c0d56d4b2da
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,248 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace. And
+to make the device emulation more secure, the emulated vDPA device's
+control path is handled in the kernel and only the data path is
+implemented in the userspace.
+
+Note that only virtio block device is supported by VDUSE framework now,
+which can reduce security risks when the userspace process that implements
+the data path is run by an unprivileged user. The support for other device
+types can be added after the security issue of corresponding device driver
+is clarified or fixed in the future.
+
+Start/Stop VDUSE devices
+
+
+VDUSE devices are started as follows:



Not native speaker but "created" is probably better.



+
+1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+   /dev/vduse/control.
+
+2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME.
+
+3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
+   messages will arrive while attaching the VDUSE instance to vDPA bus.
+
+4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
+   instance to vDPA bus.



I think 4 should be done before 3?



+
+VDUSE devices are stopped as follows:



"removed" or "destroyed" is better than "stopped" here.



+
+1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
+   instance from vDPA bus.
+
+2. Close the file descriptor referring to /dev/vduse/$NAME.
+
+3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
+   /dev/vduse/control.
+
+The netlink messages can be sent via vdpa tool in iproute2 or use the
+below sample codes:
+
+.. code-block:: c
+
+   static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
+   {
+   struct nl_sock *nlsock;
+   struct nl_msg *msg;
+   int famid;
+
+   nlsock = nl_socket_alloc();
+   if (!nlsock)
+   return -ENOMEM;
+
+   if (genl_connect(nlsock))
+   goto free_sock;
+
+   famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+   if (famid < 0)
+   goto close_sock;
+
+   msg = nlmsg_alloc();
+   if (!msg)
+   goto close_sock;
+
+   if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
cmd, 0))
+   goto nla_put_failure;
+
+   NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+   if (cmd == VDPA_CMD_DEV_NEW)
+   NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
"vduse");
+
+   if (nl_send_sync(nlsock, msg))
+   goto close_sock;
+
+   nl_close(nlsock);
+   nl_socket_free(nlsock);
+
+   return 0;
+   nla_put_failure:
+   nlmsg_free(msg);
+   close_sock:
+   nl_close(nlsock);
+   free_sock:
+   nl_socket_free(nlsock);
+   return -1;
+   }
+
+How VDUSE works
+---
+
+As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on
+/dev/vduse/control. With this ioctl, userspace can specify some basic 
configuration
+such as device name (uniquely identify a VDUSE device), virtio features, virtio
+configuration space, bounce buffer size



This bounce buffer size looks questionable. We'd better not expose any 
implementation details to userspace.


I think we can simply start with a module parameter for VDUSE?



  

Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-14 Thread Jason Wang


在 2021/7/15 下午12:03, Yongji Xie 写道:

Which ioctl can be used for this?


I mean we can introduce a new ioctl for that in the future.



Ok, I see.





I wonder if it's better to do something similar to ccw:

1) requires the userspace to update the status bit in the response
2) update the dev->status to the status in the response if no timeout

Then userspace could then set NEEDS_RESET if necessary.


But NEEDS_RESET does not only happen in this case.

Yes, so you need an ioctl for userspace to update the device status
(NEEDS_RESET) probably.


Yes, but I'd like to do that after the initial version is merged since
NEEDS_RESET is not supported now.



Right.

Thanks






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 13/17] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()

2021-07-14 Thread Jason Wang


在 2021/7/14 下午5:57, Dan Carpenter 写道:

On Wed, Jul 14, 2021 at 05:41:54PM +0800, Jason Wang wrote:

在 2021/7/14 下午4:05, Dan Carpenter 写道:

On Wed, Jul 14, 2021 at 10:14:32AM +0800, Jason Wang wrote:

在 2021/7/13 下午7:31, Dan Carpenter 写道:

On Tue, Jul 13, 2021 at 04:46:52PM +0800, Xie Yongji wrote:

@@ -613,37 +618,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 
iova, u64 size)
}
}
-static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
-  struct vhost_iotlb_msg *msg)
+static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
+u64 iova, u64 size, u64 uaddr, u32 perm)
{
struct vhost_dev *dev = >vdev;
-   struct vhost_iotlb *iotlb = dev->iotlb;
struct page **page_list;
unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
unsigned int gup_flags = FOLL_LONGTERM;
unsigned long npages, cur_base, map_pfn, last_pfn = 0;
unsigned long lock_limit, sz2pin, nchunks, i;
-   u64 iova = msg->iova;
+   u64 start = iova;
long pinned;
int ret = 0;
-   if (msg->iova < v->range.first ||
-   msg->iova + msg->size - 1 > v->range.last)
-   return -EINVAL;

This is not related to your patch, but can the "msg->iova + msg->size"
addition can have an integer overflow.  From looking at the callers it
seems like it can.  msg comes from:
 vhost_chr_write_iter()
 --> dev->msg_handler(dev, );
 --> vhost_vdpa_process_iotlb_msg()
--> vhost_vdpa_process_iotlb_update()

Yes.



If I'm thinking of the right thing then these are allowed to overflow to
0 because of the " - 1" but not further than that.  I believe the check
needs to be something like:

if (msg->iova < v->range.first ||
msg->iova - 1 > U64_MAX - msg->size ||

I guess we don't need - 1 here?

The - 1 is important.  The highest address is 0x.  So it goes
start + size = 0 and then start + size - 1 == 0x.


Right, so actually

msg->iova = 0xfffe, msg->size=2 is valid.

I believe so, yes.  It's inclusive of 0xfffe and 0x.
(Not an expert).



I think so, and we probably need to fix vhost_overflow() as well which did:

static bool vhost_overflow(u64 uaddr, u64 size)
{
    /* Make sure 64 bit math will not overflow. */
    return uaddr > ULONG_MAX || size > ULONG_MAX || uaddr > 
ULONG_MAX - size;

}

Thanks




regards,
dan carpenter



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 13/17] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()

2021-07-14 Thread Jason Wang


在 2021/7/14 下午4:05, Dan Carpenter 写道:

On Wed, Jul 14, 2021 at 10:14:32AM +0800, Jason Wang wrote:

在 2021/7/13 下午7:31, Dan Carpenter 写道:

On Tue, Jul 13, 2021 at 04:46:52PM +0800, Xie Yongji wrote:

@@ -613,37 +618,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 
iova, u64 size)
}
   }
-static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
-  struct vhost_iotlb_msg *msg)
+static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
+u64 iova, u64 size, u64 uaddr, u32 perm)
   {
struct vhost_dev *dev = >vdev;
-   struct vhost_iotlb *iotlb = dev->iotlb;
struct page **page_list;
unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
unsigned int gup_flags = FOLL_LONGTERM;
unsigned long npages, cur_base, map_pfn, last_pfn = 0;
unsigned long lock_limit, sz2pin, nchunks, i;
-   u64 iova = msg->iova;
+   u64 start = iova;
long pinned;
int ret = 0;
-   if (msg->iova < v->range.first ||
-   msg->iova + msg->size - 1 > v->range.last)
-   return -EINVAL;

This is not related to your patch, but can the "msg->iova + msg->size"
addition can have an integer overflow.  From looking at the callers it
seems like it can.  msg comes from:
vhost_chr_write_iter()
--> dev->msg_handler(dev, );
--> vhost_vdpa_process_iotlb_msg()
   --> vhost_vdpa_process_iotlb_update()


Yes.



If I'm thinking of the right thing then these are allowed to overflow to
0 because of the " - 1" but not further than that.  I believe the check
needs to be something like:

if (msg->iova < v->range.first ||
msg->iova - 1 > U64_MAX - msg->size ||


I guess we don't need - 1 here?

The - 1 is important.  The highest address is 0x.  So it goes
start + size = 0 and then start + size - 1 == 0x.



Right, so actually

msg->iova = 0xfffe, msg->size=2 is valid.

Thanks




I guess we could move the - 1 to the other side?

msg->iova > U64_MAX - msg->size + 1 ||

regards,
dan carpenter




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-14 Thread Jason Wang


在 2021/7/14 下午2:49, Yongji Xie 写道:

On Wed, Jul 14, 2021 at 1:45 PM Jason Wang  wrote:


在 2021/7/13 下午4:46, Xie Yongji 写道:

This VDUSE driver enables implementing software-emulated vDPA
devices in userspace. The vDPA device is created by
ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device
interface (/dev/vduse/$NAME) is exported to userspace for device
emulation.

In order to make the device emulation more secure, the device's
control path is handled in kernel. A message mechnism is introduced
to forward some dataplane related control messages to userspace.

And in the data path, the DMA buffer will be mapped into userspace
address space through different ways depending on the vDPA bus to
which the vDPA device is attached. In virtio-vdpa case, the MMU-based
IOMMU driver is used to achieve that. And in vhost-vdpa case, the
DMA buffer is reside in a userspace memory region which can be shared
to the VDUSE userspace processs via transferring the shmfd.

For more details on VDUSE design and usage, please see the follow-on
Documentation commit.

Signed-off-by: Xie Yongji 
---
   Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
   drivers/vdpa/Kconfig   |   10 +
   drivers/vdpa/Makefile  |1 +
   drivers/vdpa/vdpa_user/Makefile|5 +
   drivers/vdpa/vdpa_user/vduse_dev.c | 1502 

   include/uapi/linux/vduse.h |  221 +++
   6 files changed, 1740 insertions(+)
   create mode 100644 drivers/vdpa/vdpa_user/Makefile
   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
   create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 1409e40e6345..293ca3aef358 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#Include File  
 Comments
   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h
conflict!
   '|'   00-7F  linux/media.h
   0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
   0x89  00-06  arch/x86/include/asm/sockios.h
   0x89  0B-DF  linux/sockios.h
   0x89  E0-EF  linux/sockios.h 
SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index a503c1b2bfd9..6e23bce6433a 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
 vDPA block device simulator which terminates IO request in a
 memory buffer.

+config VDPA_USER
+ tristate "VDUSE (vDPA Device in Userspace) support"
+ depends on EVENTFD && MMU && HAS_DMA
+ select DMA_OPS
+ select VHOST_IOTLB
+ select IOMMU_IOVA
+ help
+   With VDUSE it is possible to emulate a vDPA Device
+   in a userspace program.
+
   config IFCVF
   tristate "Intel IFC VF vDPA driver"
   depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index 67fe7f3d6943..f02ebed33f19 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,6 +1,7 @@
   # SPDX-License-Identifier: GPL-2.0
   obj-$(CONFIG_VDPA) += vdpa.o
   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
   obj-$(CONFIG_IFCVF)+= ifcvf/
   obj-$(CONFIG_MLX5_VDPA) += mlx5/
   obj-$(CONFIG_VP_VDPA)+= virtio_pci/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index ..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index ..c994a4a4660c
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1502 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights 
reserved.
+ *
+ * Author: Xie Yongji 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "iova_domain.h"
+
+#define DRV_AUTHOR   "Yongji Xie "
+#define DRV_DESC "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
+#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
+#define VDUSE_REQUEST_TIMEOUT 30
+
+struct vduse_virtqueue {
+ u16 index;
+ u16 num_max;
+ u32 num;
+ u64 desc_addr;
+ u64 driver_addr;
+  

Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-14 Thread Jason Wang


在 2021/7/14 下午2:47, Greg KH 写道:

On Wed, Jul 14, 2021 at 02:02:50PM +0800, Jason Wang wrote:

在 2021/7/14 下午1:54, Michael S. Tsirkin 写道:

On Wed, Jul 14, 2021 at 01:45:39PM +0800, Jason Wang wrote:

+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+ struct vduse_dev_msg *msg)
+{
+   int ret;
+
+   init_waitqueue_head(>waitq);
+   spin_lock(>msg_lock);
+   msg->req.request_id = dev->msg_unique++;
+   vduse_enqueue_msg(>send_list, msg);
+   wake_up(>waitq);
+   spin_unlock(>msg_lock);
+
+   wait_event_killable_timeout(msg->waitq, msg->completed,
+   VDUSE_REQUEST_TIMEOUT * HZ);
+   spin_lock(>msg_lock);
+   if (!msg->completed) {
+   list_del(>list);
+   msg->resp.result = VDUSE_REQ_RESULT_FAILED;
+   }
+   ret = (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;

I think we should mark the device as malfunction when there is a timeout and
forbid any userspace operations except for the destroy aftwards for safety.

This looks like if one tried to run gdb on the program the behaviour
will change completely because kernel wants it to respond within
specific time. Looks like a receipe for heisenbugs.

Let's not build interfaces with arbitrary timeouts like that.
Interruptible wait exists for this very reason.


The problem is. Do we want userspace program like modprobe to be stuck for
indefinite time and expect the administrator to kill that?

Why would modprobe be stuck for forever?

Is this on the module probe path?



Yes, it is called in the device probing path where the kernel forwards 
the device configuration request to userspace and wait for its response.


If it turns out to be tricky, we can implement the whole device inside 
the kernel and leave only the datapath in the userspace (as what TUN did).


Thanks






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-14 Thread Jason Wang


在 2021/7/14 下午1:54, Michael S. Tsirkin 写道:

On Wed, Jul 14, 2021 at 01:45:39PM +0800, Jason Wang wrote:

+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+ struct vduse_dev_msg *msg)
+{
+   int ret;
+
+   init_waitqueue_head(>waitq);
+   spin_lock(>msg_lock);
+   msg->req.request_id = dev->msg_unique++;
+   vduse_enqueue_msg(>send_list, msg);
+   wake_up(>waitq);
+   spin_unlock(>msg_lock);
+
+   wait_event_killable_timeout(msg->waitq, msg->completed,
+   VDUSE_REQUEST_TIMEOUT * HZ);
+   spin_lock(>msg_lock);
+   if (!msg->completed) {
+   list_del(>list);
+   msg->resp.result = VDUSE_REQ_RESULT_FAILED;
+   }
+   ret = (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;


I think we should mark the device as malfunction when there is a timeout and
forbid any userspace operations except for the destroy aftwards for safety.

This looks like if one tried to run gdb on the program the behaviour
will change completely because kernel wants it to respond within
specific time. Looks like a receipe for heisenbugs.

Let's not build interfaces with arbitrary timeouts like that.
Interruptible wait exists for this very reason.



The problem is. Do we want userspace program like modprobe to be stuck 
for indefinite time and expect the administrator to kill that?


Thanks



  Let userspace have its
own code to set and use timers. This does mean that userspace will
likely have to change a bit to support this driver, such is life.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-13 Thread Jason Wang


在 2021/7/13 下午4:46, Xie Yongji 写道:

This VDUSE driver enables implementing software-emulated vDPA
devices in userspace. The vDPA device is created by
ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device
interface (/dev/vduse/$NAME) is exported to userspace for device
emulation.

In order to make the device emulation more secure, the device's
control path is handled in kernel. A message mechnism is introduced
to forward some dataplane related control messages to userspace.

And in the data path, the DMA buffer will be mapped into userspace
address space through different ways depending on the vDPA bus to
which the vDPA device is attached. In virtio-vdpa case, the MMU-based
IOMMU driver is used to achieve that. And in vhost-vdpa case, the
DMA buffer is reside in a userspace memory region which can be shared
to the VDUSE userspace processs via transferring the shmfd.

For more details on VDUSE design and usage, please see the follow-on
Documentation commit.

Signed-off-by: Xie Yongji 
---
  Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
  drivers/vdpa/Kconfig   |   10 +
  drivers/vdpa/Makefile  |1 +
  drivers/vdpa/vdpa_user/Makefile|5 +
  drivers/vdpa/vdpa_user/vduse_dev.c | 1502 
  include/uapi/linux/vduse.h |  221 +++
  6 files changed, 1740 insertions(+)
  create mode 100644 drivers/vdpa/vdpa_user/Makefile
  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
  create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 1409e40e6345..293ca3aef358 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#Include File  
 Comments
  'z'   10-4F  drivers/s390/crypto/zcrypt_api.hconflict!
  '|'   00-7F  linux/media.h
  0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
  0x89  00-06  arch/x86/include/asm/sockios.h
  0x89  0B-DF  linux/sockios.h
  0x89  E0-EF  linux/sockios.h 
SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index a503c1b2bfd9..6e23bce6433a 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
  vDPA block device simulator which terminates IO request in a
  memory buffer.
  
+config VDPA_USER

+   tristate "VDUSE (vDPA Device in Userspace) support"
+   depends on EVENTFD && MMU && HAS_DMA
+   select DMA_OPS
+   select VHOST_IOTLB
+   select IOMMU_IOVA
+   help
+ With VDUSE it is possible to emulate a vDPA Device
+ in a userspace program.
+
  config IFCVF
tristate "Intel IFC VF vDPA driver"
depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index 67fe7f3d6943..f02ebed33f19 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,6 +1,7 @@
  # SPDX-License-Identifier: GPL-2.0
  obj-$(CONFIG_VDPA) += vdpa.o
  obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
  obj-$(CONFIG_IFCVF)+= ifcvf/
  obj-$(CONFIG_MLX5_VDPA) += mlx5/
  obj-$(CONFIG_VP_VDPA)+= virtio_pci/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index ..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index ..c994a4a4660c
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1502 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights 
reserved.
+ *
+ * Author: Xie Yongji 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "iova_domain.h"
+
+#define DRV_AUTHOR   "Yongji Xie "
+#define DRV_DESC "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
+#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
+#define VDUSE_REQUEST_TIMEOUT 30
+
+struct vduse_virtqueue {
+   u16 index;
+   u16 num_max;
+   u32 num;
+   u64 desc_addr;
+   u64 driver_addr;
+   u64 device_addr;
+   struct vdpa_vq_state state;
+   bool ready;
+   bool kicked;
+   spinlock_t kick_lock;
+   spinlock_t irq_lock;
+   

Re: [PATCH v9 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-13 Thread Jason Wang


在 2021/7/13 下午9:27, Dan Carpenter 写道:

On Tue, Jul 13, 2021 at 04:46:55PM +0800, Xie Yongji wrote:

+static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
+{
+   struct vduse_vdpa *vdev;
+   int ret;
+
+   if (dev->vdev)
+   return -EEXIST;
+
+   vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
+_vdpa_config_ops, name, true);
+   if (!vdev)
+   return -ENOMEM;

This should be an IS_ERR() check instead of a NULL check.



Yes.




The vdpa_alloc_device() macro is doing something very complicated but
I'm not sure what.  It calls container_of() and that looks buggy until
you spot the BUILD_BUG_ON_ZERO() compile time assert which ensures that
the container_of() is a no-op.

Only one of the callers checks for error pointers correctly so maybe
it's too complicated or maybe there should be better documentation.



We need better documentation for this macro and fix all the buggy callers.

Yong Ji, want to do that?

Thanks




regards,
dan carpenter



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 13/17] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()

2021-07-13 Thread Jason Wang


在 2021/7/13 下午7:31, Dan Carpenter 写道:

On Tue, Jul 13, 2021 at 04:46:52PM +0800, Xie Yongji wrote:

@@ -613,37 +618,28 @@ static void vhost_vdpa_unmap(struct vhost_vdpa *v, u64 
iova, u64 size)
}
  }
  
-static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,

-  struct vhost_iotlb_msg *msg)
+static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
+u64 iova, u64 size, u64 uaddr, u32 perm)
  {
struct vhost_dev *dev = >vdev;
-   struct vhost_iotlb *iotlb = dev->iotlb;
struct page **page_list;
unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
unsigned int gup_flags = FOLL_LONGTERM;
unsigned long npages, cur_base, map_pfn, last_pfn = 0;
unsigned long lock_limit, sz2pin, nchunks, i;
-   u64 iova = msg->iova;
+   u64 start = iova;
long pinned;
int ret = 0;
  
-	if (msg->iova < v->range.first ||

-   msg->iova + msg->size - 1 > v->range.last)
-   return -EINVAL;

This is not related to your patch, but can the "msg->iova + msg->size"
addition can have an integer overflow.  From looking at the callers it
seems like it can.  msg comes from:
   vhost_chr_write_iter()
   --> dev->msg_handler(dev, );
   --> vhost_vdpa_process_iotlb_msg()
  --> vhost_vdpa_process_iotlb_update()



Yes.




If I'm thinking of the right thing then these are allowed to overflow to
0 because of the " - 1" but not further than that.  I believe the check
needs to be something like:

if (msg->iova < v->range.first ||
msg->iova - 1 > U64_MAX - msg->size ||



I guess we don't need - 1 here?

Thanks



msg->iova + msg->size - 1 > v->range.last)

But writing integer overflow check correctly is notoriously difficult.
Do you think you could send a fix for that which is separate from the
patcheset?  We'd want to backport it to stable.

regards,
dan carpenter



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Jason Wang


在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:

On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:

在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:

On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:

在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:

On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:

On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  wrote:

On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:

在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:

On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:

在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.

Note that though you're correct, my understanding is that config space is
not suitable for this kind of error propagating. And it would be very hard
to implement such kind of semantic in some transports.  Virtqueue should be
much better. As Yong Ji quoted, the config space is used for
"rarely-changing or intialization-time parameters".



Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.


I'd like to stick to the current assumption thich get_config won't fail.
That is to say,

1) maintain a config in the kernel, make sure the config space read can
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Because:

1) Kernel can not wait forever in get_config(), this is the major difference
with vhost-user.

virtio_cread() can sleep:

 #define virtio_cread(vdev, structname, member, ptr) \
 do {\
 typeof(((structname*)0)->member) virtio_cread_v;\
 \
 might_sleep();  \
 ^^

Which code path cannot sleep?

Well, it can sleep but it can't sleep forever. For VDUSE, a
buggy/malicious userspace may refuse to respond to the get_config.

It looks to me the ideal case, with the current virtio spec, for VDUSE is to

1) maintain the device and its state in the kernel, userspace may sync
with the kernel device via ioctls
2) offload the datapath (virtqueue) to the userspace

This seems more robust and safe than simply relaying everything to
userspace and waiting for its response.

And we know for sure this model can work, an example is TUN/TAP:
netdevice is abstracted in the kernel and datapath is done via
sendmsg()/recvmsg().

Maintaining the config in the kernel follows this model and it can
simplify the device generation implementation.

For config space write, it requires more thought but fortunately it's
not commonly used. So VDUSE can choose to filter out the
device/features that depends on the config write.

This is the problem. There are other messages like SET_FEATURES where I
guess we'll face the same challenge.

Probably not, userspace device can tell the kernel about the device_feat

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Jason Wang


在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:

On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:

在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:

On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:

On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  wrote:

On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:

在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:

On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:

在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.

Note that though you're correct, my understanding is that config space is
not suitable for this kind of error propagating. And it would be very hard
to implement such kind of semantic in some transports.  Virtqueue should be
much better. As Yong Ji quoted, the config space is used for
"rarely-changing or intialization-time parameters".



Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.


I'd like to stick to the current assumption thich get_config won't fail.
That is to say,

1) maintain a config in the kernel, make sure the config space read can
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Because:

1) Kernel can not wait forever in get_config(), this is the major difference
with vhost-user.

virtio_cread() can sleep:

#define virtio_cread(vdev, structname, member, ptr) \
do {\
typeof(((structname*)0)->member) virtio_cread_v;\
\
might_sleep();  \
^^

Which code path cannot sleep?

Well, it can sleep but it can't sleep forever. For VDUSE, a
buggy/malicious userspace may refuse to respond to the get_config.

It looks to me the ideal case, with the current virtio spec, for VDUSE is to

1) maintain the device and its state in the kernel, userspace may sync
with the kernel device via ioctls
2) offload the datapath (virtqueue) to the userspace

This seems more robust and safe than simply relaying everything to
userspace and waiting for its response.

And we know for sure this model can work, an example is TUN/TAP:
netdevice is abstracted in the kernel and datapath is done via
sendmsg()/recvmsg().

Maintaining the config in the kernel follows this model and it can
simplify the device generation implementation.

For config space write, it requires more thought but fortunately it's
not commonly used. So VDUSE can choose to filter out the
device/features that depends on the config write.

This is the problem. There are other messages like SET_FEATURES where I
guess we'll face the same challenge.


Probably not, userspace device can tell the kernel about the device_features
and mandated_features during creation, and the feature negotiation could be
done purely in the kerne

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-05 Thread Jason Wang


在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:

On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:

在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.


Note that though you're correct, my understanding is that config space is
not suitable for this kind of error propagating. And it would be very hard
to implement such kind of semantic in some transports.  Virtqueue should be
much better. As Yong Ji quoted, the config space is used for
"rarely-changing or intialization-time parameters".



Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.


I'd like to stick to the current assumption thich get_config won't fail.
That is to say,

1) maintain a config in the kernel, make sure the config space read can
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?



Because:

1) Kernel can not wait forever in get_config(), this is the major 
difference with vhost-user.
2) Stick to the current assumption that virtio_cread() should always 
succeed.


Thanks




Here are the vhost-user protocol messages:

   Virtio device config space
   ^^

   ++--+---+-+
   | offset | size | flags | payload |
   ++--+---+-+

   :offset: a 32-bit offset of virtio device's configuration space

   :size: a 32-bit configuration space access size in bytes

   :flags: a 32-bit value:
 - 0: Vhost master messages used for writeable fields
 - 1: Vhost master messages used for live migration

   :payload: Size bytes array holding the contents of the virtio
 device's configuration space

   ...

   ``VHOST_USER_GET_CONFIG``
 :id: 24
 :equivalent ioctl: N/A
 :master payload: virtio device config space
 :slave payload: virtio device config space

 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
 submitted by the vhost-user master to fetch the contents of the
 virtio device configuration space, vhost-user slave's payload size
 MUST match master's request, vhost-user slave uses zero length of
 payload to indicate an error to vhost-user master. The vhost-user
 master may cache the contents to avoid repeated
 ``VHOST_USER_GET_CONFIG`` calls.

   ``VHOST_USER_SET_CONFIG``
 :id: 25
 :equivalent ioctl: N/A
 :master payload: virtio device config space
 :slave payload: N/A

 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
 submitted by the vhost-user master when the Guest changes the virtio
 device configuration space and also can be used for live migration
 on the destination host. The vhost-user slave must check the flags
 field, and slaves MUST NOT accept SET_CONFIG for read-only
 configuration space fields unless the live migration bit is set.

Stefan


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lis

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-04 Thread Jason Wang


在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.



Note that though you're correct, my understanding is that config space 
is not suitable for this kind of error propagating. And it would be very 
hard to implement such kind of semantic in some transports.  Virtqueue 
should be much better. As Yong Ji quoted, the config space is used for 
"rarely-changing or intialization-time parameters".




Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?



I'd like to stick to the current assumption thich get_config won't fail. 
That is to say,


1) maintain a config in the kernel, make sure the config space read can 
always succeed

2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

Thanks






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-01 Thread Jason Wang


在 2021/7/1 下午6:26, Yongji Xie 写道:

On Thu, Jul 1, 2021 at 3:55 PM Jason Wang  wrote:


在 2021/7/1 下午2:50, Yongji Xie 写道:

On Wed, Jun 30, 2021 at 5:51 PM Stefan Hajnoczi  wrote:

On Tue, Jun 29, 2021 at 10:59:51AM +0800, Yongji Xie wrote:

On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:

On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:

+/* ioctls */
+
+struct vduse_dev_config {
+ char name[VDUSE_NAME_MAX]; /* vduse device name */
+ __u32 vendor_id; /* virtio vendor id */
+ __u32 device_id; /* virtio device id */
+ __u64 features; /* device features */
+ __u64 bounce_size; /* bounce buffer size for iommu */
+ __u16 vq_size_max; /* the max size of virtqueue */

The VIRTIO specification allows per-virtqueue sizes. A device can have
two virtqueues, where the first one allows up to 1024 descriptors and
the second one allows only 128 descriptors, for example.


Good point! But it looks like virtio-vdpa/virtio-pci doesn't support
that now. All virtqueues have the same maximum size.

I see struct vpda_config_ops only supports a per-device max vq size:
u16 (*get_vq_num_max)(struct vdpa_device *vdev);

virtio-pci supports per-virtqueue sizes because the struct
virtio_pci_common_cfg->queue_size register is per-queue (controlled by
queue_select).


Oh, yes. I miss queue_select.


I guess this is a question for Jason: will vdpa will keep this limitation?
If yes, then VDUSE can stick to it too without running into problems in
the future.


I think it's better to extend the get_vq_num_max() per virtqueue.

Currently, vDPA assumes the parent to have a global max size. This seems
to work on most of the parents but not vp-vDPA (which could be backed by
QEMU, in that case cvq's size is smaller).

Fortunately, we haven't enabled had cvq support in the userspace now.

I can post the fixes.


OK. If so, it looks like we need to support the per-vq configuration.
I wonder if it's better to use something like: VDUSE_CREATE_DEVICE ->
VDUSE_SETUP_VQ -> VDUSE_SETUP_VQ -> ... -> VDUSE_ENABLE_DEVICE to do
initialization rather than only use VDUSE_CREATE_DEVICE.



This should be fine.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-01 Thread Jason Wang


在 2021/7/1 下午2:50, Yongji Xie 写道:

On Wed, Jun 30, 2021 at 5:51 PM Stefan Hajnoczi  wrote:

On Tue, Jun 29, 2021 at 10:59:51AM +0800, Yongji Xie wrote:

On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:

On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:

+/* ioctls */
+
+struct vduse_dev_config {
+ char name[VDUSE_NAME_MAX]; /* vduse device name */
+ __u32 vendor_id; /* virtio vendor id */
+ __u32 device_id; /* virtio device id */
+ __u64 features; /* device features */
+ __u64 bounce_size; /* bounce buffer size for iommu */
+ __u16 vq_size_max; /* the max size of virtqueue */

The VIRTIO specification allows per-virtqueue sizes. A device can have
two virtqueues, where the first one allows up to 1024 descriptors and
the second one allows only 128 descriptors, for example.


Good point! But it looks like virtio-vdpa/virtio-pci doesn't support
that now. All virtqueues have the same maximum size.

I see struct vpda_config_ops only supports a per-device max vq size:
u16 (*get_vq_num_max)(struct vdpa_device *vdev);

virtio-pci supports per-virtqueue sizes because the struct
virtio_pci_common_cfg->queue_size register is per-queue (controlled by
queue_select).


Oh, yes. I miss queue_select.


I guess this is a question for Jason: will vdpa will keep this limitation?
If yes, then VDUSE can stick to it too without running into problems in
the future.



I think it's better to extend the get_vq_num_max() per virtqueue.

Currently, vDPA assumes the parent to have a global max size. This seems 
to work on most of the parents but not vp-vDPA (which could be backed by 
QEMU, in that case cvq's size is smaller).


Fortunately, we haven't enabled had cvq support in the userspace now.

I can post the fixes.





+ __u16 padding; /* padding */
+ __u32 vq_num; /* the number of virtqueues */
+ __u32 vq_align; /* the allocation alignment of virtqueue's metadata */

I'm not sure what this is?


  This will be used by vring_create_virtqueue() too.

If there is no official definition for the meaning of this value then
"/* same as vring_create_virtqueue()'s vring_align parameter */" would
be clearer. That way the reader knows what to research in order to
understand how this field works.


OK.


I don't remember but maybe it was used to support vrings when the
host/guest have non-4KB page sizes. I wonder if anyone has an official
definition for this value?

Not sure. Maybe we might need some alignment which is less than
PAGE_SIZE sometimes.



So I see CCW always use 4096, but I'm not sure whether or not it's 
smaller than PAGE_SIZE.


Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-29 Thread Jason Wang


在 2021/6/29 下午2:40, Yongji Xie 写道:

On Tue, Jun 29, 2021 at 12:13 PM Jason Wang  wrote:


在 2021/6/28 下午6:32, Yongji Xie 写道:

The large barrier is bounce-buffer mapping: SPDK requires hugepages
for NVMe over PCIe and RDMA, so take some preallcoated hugepages to
map as bounce buffer is necessary. Or it's hard to avoid an extra
memcpy from bounce-buffer to hugepage.
If you can add an option to map hugepages as bounce-buffer,
then SPDK could also be a potential user of vduse.


I think we can support registering user space memory for bounce-buffer
use like XDP does. But this needs to pin the pages, so I didn't
consider it in this initial version.


Note that userspace should be unaware of the existence of the bounce buffer.


If so, it might be hard to use umem. Because we can't use umem for
coherent mapping which needs physical address contiguous space.

Thanks,
Yongji



We probably can use umem for memory other than the virtqueue (still via 
mmap()).


Thanks


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Jason Wang


在 2021/6/28 下午6:32, Yongji Xie 写道:

The large barrier is bounce-buffer mapping: SPDK requires hugepages
for NVMe over PCIe and RDMA, so take some preallcoated hugepages to
map as bounce buffer is necessary. Or it's hard to avoid an extra
memcpy from bounce-buffer to hugepage.
If you can add an option to map hugepages as bounce-buffer,
then SPDK could also be a potential user of vduse.


I think we can support registering user space memory for bounce-buffer
use like XDP does. But this needs to pin the pages, so I didn't
consider it in this initial version.



Note that userspace should be unaware of the existence of the bounce buffer.

So we need to think carefully of mmap() vs umem registering.

Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Jason Wang



在 2021/6/28 下午1:54, Liu, Xiaodong 写道:

Several issues:

- VDUSE needs to limit the total size of the bounce buffers (64M if I was not
wrong). Does it work for SPDK?

Yes, Jason. It is enough and works for SPDK.
Since it's a kind of bounce buffer mainly for in-flight IO, so limited size like
64MB is enough.



Ok.





- VDUSE can use hugepages but I'm not sure we can mandate hugepages (or we
need introduce new flags for supporting this)

Same with your worry, I'm afraid too that it is a hard for a kernel module
to directly preallocate hugepage internal.
What I tried is that:
1. A simple agent daemon (represents for one device)  `preallocates` and maps
 dozens of 2MB hugepages (like 64MB) for one device.
2. The daemon passes its mapping addr and hugepage fd to kernel
 module through created IOCTL.
3. Kernel module remaps the hugepages inside kernel.



Such model should work, but the main "issue" is that it introduce  
overheads in the case of vhost-vDPA.


Note that in the case of vhost-vDPA, we don't use bounce buffer, the  
userspace pages were shared directly.


And since DMA is not done per page, it prevents us from using tricks  
like vm_insert_page() in those cases.




4. Vhost user target gets and maps hugepage fd from kernel module
 in vhost-user msg through Unix Domain Socket cmsg.
Then kernel module and target map on the same hugepage based
bounce buffer for in-flight IO.

If there is one option in VDUSE to map userspace preallocated memory, then
VDUSE should be able to mandate it even it is hugepage based.



As above, this requires some kind of re-design since VDUSE depends on  
the model of mmap(MAP_SHARED) instead of umem registering.


Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Jason Wang


在 2021/6/29 上午11:56, Yongji Xie 写道:

On Tue, Jun 29, 2021 at 11:29 AM Jason Wang  wrote:


在 2021/6/29 上午10:26, Yongji Xie 写道:

On Mon, Jun 28, 2021 at 12:40 PM Jason Wang  wrote:

在 2021/6/25 下午12:19, Yongji Xie 写道:

2b) for set_status(): simply relay the message to userspace, reply is no
needed. Userspace will use a command to update the status when the
datapath is stop. The the status could be fetched via get_stats().

2b looks more spec complaint.


Looks good to me. And I think we can use the reply of the message to
update the status instead of introducing a new command.


Just notice this part in virtio_finalize_features():

   virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
   status = dev->config->get_status(dev);
   if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {

So we no reply doesn't work for FEATURES_OK.

So my understanding is:

1) We must not use noreply for set_status()
2) We can use noreply for get_status(), but it requires a new ioctl to
update the status.

So it looks to me we need synchronize for both get_status() and
set_status().


We should not send messages to userspace in the FEATURES_OK case. So
the synchronization is not necessary.


As discussed previously, there could be a device that mandates some
features (VIRTIO_F_RING_PACKED). So it can choose to not accept
FEATURES_OK is packed virtqueue is not negotiated.

In this case we need to relay the message to userspace.


OK, I see. If so, I prefer to only use noreply for set_status(). We do
not set the status bit if the message is failed. In this way, we don't
need to change lots of virtio core codes to handle the failure of
set_status()/get_status().



It should work.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Jason Wang


在 2021/6/29 上午10:26, Yongji Xie 写道:

On Mon, Jun 28, 2021 at 12:40 PM Jason Wang  wrote:


在 2021/6/25 下午12:19, Yongji Xie 写道:

2b) for set_status(): simply relay the message to userspace, reply is no
needed. Userspace will use a command to update the status when the
datapath is stop. The the status could be fetched via get_stats().

2b looks more spec complaint.


Looks good to me. And I think we can use the reply of the message to
update the status instead of introducing a new command.


Just notice this part in virtio_finalize_features():

  virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
  status = dev->config->get_status(dev);
  if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {

So we no reply doesn't work for FEATURES_OK.

So my understanding is:

1) We must not use noreply for set_status()
2) We can use noreply for get_status(), but it requires a new ioctl to
update the status.

So it looks to me we need synchronize for both get_status() and
set_status().


We should not send messages to userspace in the FEATURES_OK case. So
the synchronization is not necessary.



As discussed previously, there could be a device that mandates some 
features (VIRTIO_F_RING_PACKED). So it can choose to not accept 
FEATURES_OK is packed virtqueue is not negotiated.


In this case we need to relay the message to userspace.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-27 Thread Jason Wang


在 2021/6/25 下午12:19, Yongji Xie 写道:

2b) for set_status(): simply relay the message to userspace, reply is no
needed. Userspace will use a command to update the status when the
datapath is stop. The the status could be fetched via get_stats().

2b looks more spec complaint.


Looks good to me. And I think we can use the reply of the message to
update the status instead of introducing a new command.



Just notice this part in virtio_finalize_features():

    virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
    status = dev->config->get_status(dev);
    if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {

So we no reply doesn't work for FEATURES_OK.

So my understanding is:

1) We must not use noreply for set_status()
2) We can use noreply for get_status(), but it requires a new ioctl to 
update the status.


So it looks to me we need synchronize for both get_status() and 
set_status().


Thanks


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-27 Thread Jason Wang


在 2021/6/28 下午6:33, Liu Xiaodong 写道:

On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:

This series introduces a framework that makes it possible to implement
software-emulated vDPA devices in userspace. And to make it simple, the
emulated vDPA device's control path is handled in the kernel and only the
data path is implemented in the userspace.

Since the emuldated vDPA device's control path is handled in the kernel,
a message mechnism is introduced to make userspace be aware of the data
path related changes. Userspace can use read()/write() to receive/reply
the control messages.

In the data path, the core is mapping dma buffer into VDUSE daemon's
address space, which can be implemented in different ways depending on
the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

-   
--
|Container ||  QEMU(VM) |   |   
VDUSE daemon |
|   -  ||  ---  |   | 
-  |
|   |dev/vdx|  ||  |/dev/vhost-vdpa-x|  |   | | vDPA device 
emulation | | block driver | |
+--- ---+   
-+--+-
 |   || 
 |
 |   || 
 |
+---++--+-
|| block device |   |  vhost device || vduse driver |   
   | TCP/IP ||
|---+   +---+   
   -+|
|   |   |   |   
||
| --+--   --+--- ---+---
||
| | virtio-blk driver |   |  vhost-vdpa driver | | vdpa device |
||
| --+--   --+--- ---+---
||
|   |  virtio bus   |   |   
||
|   ++---   |   |   
||
||  |   |   
||
|  --+--|   |   
||
|  | virtio-blk device ||   |   
||
|  --+--|   |   
||
||  |   |   
||
| ---+---   |   |   
||
| |  virtio-vdpa driver |   |   |   
||
| ---+---   |   |   
||
||  |   |vdpa 
bus   ||
| 
---+--+---+ 
  ||
|   
 ---+--- |
-|
 NIC |--

  ---+---

 |

-+-

| Remote Storages |

---

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

   $ qemu-storage-daemon \
   --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
   --monitor chardev=charmonitor \
   --blockdev 
driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
 \
  

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-24 Thread Jason Wang


在 2021/6/24 下午5:16, Yongji Xie 写道:

On Thu, Jun 24, 2021 at 4:14 PM Jason Wang  wrote:


在 2021/6/24 下午12:46, Yongji Xie 写道:

So we need to deal with both FEATURES_OK and reset, but probably not
DRIVER_OK.


OK, I see. Thanks for the explanation. One more question is how about
clearing the corresponding status bit in get_status() rather than
making set_status() fail. Since the spec recommends this way for
validation which is done in virtio_dev_remove() and
virtio_finalize_features().

Thanks,
Yongji


I think you can. Or it would be even better that we just don't set the
bit during set_status().


Yes, that's what I mean.


I just realize that in vdpa_reset() we had:

static inline void vdpa_reset(struct vdpa_device *vdev)
{
  const struct vdpa_config_ops *ops = vdev->config;

  vdev->features_valid = false;
  ops->set_status(vdev, 0);
}

We probably need to add the synchronization here. E.g re-read with a
timeout.


Looks like the timeout is already in set_status().



Do you mean the VDUSE's implementation?



  Do we really need a
duplicated one here?



1) this is the timeout at the vDPA layer instead of the VDUSE layer.
2) it really depends on what's the meaning of the timeout for set_status 
of VDUSE.


Do we want:

2a) for set_status(): relay the message to userspace and wait for the 
userspace to quiescence the datapath


or

2b) for set_status(): simply relay the message to userspace, reply is no 
needed. Userspace will use a command to update the status when the 
datapath is stop. The the status could be fetched via get_stats().


2b looks more spec complaint.


And how to handle failure? Adding a return value
to virtio_config_ops->reset() and passing the error to the upper
layer?



Something like this.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-24 Thread Jason Wang


在 2021/6/24 下午12:46, Yongji Xie 写道:

So we need to deal with both FEATURES_OK and reset, but probably not
DRIVER_OK.


OK, I see. Thanks for the explanation. One more question is how about
clearing the corresponding status bit in get_status() rather than
making set_status() fail. Since the spec recommends this way for
validation which is done in virtio_dev_remove() and
virtio_finalize_features().

Thanks,
Yongji



I think you can. Or it would be even better that we just don't set the 
bit during set_status().


I just realize that in vdpa_reset() we had:

static inline void vdpa_reset(struct vdpa_device *vdev)
{
    const struct vdpa_config_ops *ops = vdev->config;

    vdev->features_valid = false;
    ops->set_status(vdev, 0);
}

We probably need to add the synchronization here. E.g re-read with a 
timeout.


Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-23 Thread Jason Wang


在 2021/6/23 下午1:50, Yongji Xie 写道:

On Wed, Jun 23, 2021 at 11:31 AM Jason Wang  wrote:


在 2021/6/22 下午4:14, Yongji Xie 写道:

On Tue, Jun 22, 2021 at 3:50 PM Jason Wang  wrote:

在 2021/6/22 下午3:22, Yongji Xie 写道:

We need fix a way to propagate the error to the userspace.

E.g if we want to stop the deivce, we will delay the status reset until
we get respose from the userspace?


I didn't get how to delay the status reset. And should it be a DoS
that we want to fix if the userspace doesn't give a response forever?

You're right. So let's make set_status() can fail first, then propagate
its failure via VHOST_VDPA_SET_STATUS.


OK. So we only need to propagate the failure in the vhost-vdpa case, right?


I think not, we need to deal with the reset for virtio as well:

E.g in register_virtio_devices(), we have:

  /* We always start by resetting the device, in case a previous
   * driver messed it up.  This also tests that code path a
little. */
dev->config->reset(dev);

We probably need to make reset can fail and then fail the
register_virtio_device() as well.


OK, looks like virtio_add_status() and virtio_device_ready()[1] should
be also modified if we need to propagate the failure in the
virtio-vdpa case. Or do we only need to care about the reset case?

[1] https://lore.kernel.org/lkml/20210517093428.670-1-xieyon...@bytedance.com/



My understanding is DRIVER_OK is not something that needs to be validated:

"

DRIVER_OK (4)
Indicates that the driver is set up and ready to drive the device.

"

Since the spec doesn't require to re-read the and check if DRIVER_OK is 
set in 3.1.1 Driver Requirements: Device Initialization.


It's more about "telling the device that driver is ready."

But we don have some status bit that requires the synchronization with 
the device.


1) FEATURES_OK, spec requires to re-read the status bit to check whether 
or it it was set by the device:


"

Re-read device status to ensure the FEATURES_OK bit is still set: 
otherwise, the device does not support our subset of features and the 
device is unusable.


"

This is useful for some device which can only support a subset of the 
features. E.g a device that can only work for packed virtqueue. This 
means the current design of set_features won't work, we need either:


1a) relay the set_features request to userspace

or

1b) introduce a mandated_device_features during device creation and 
validate the driver features during the set_features(), and don't set 
FEATURES_OK if they don't match.



2) Some transports (PCI) requires to re-read the status to ensure the 
synchronization.


"

After writing 0 to device_status, the driver MUST wait for a read of 
device_status to return 0 before reinitializing the device.


"

So we need to deal with both FEATURES_OK and reset, but probably not 
DRIVER_OK.


Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-22 Thread Jason Wang


在 2021/6/22 下午4:14, Yongji Xie 写道:

On Tue, Jun 22, 2021 at 3:50 PM Jason Wang  wrote:


在 2021/6/22 下午3:22, Yongji Xie 写道:

We need fix a way to propagate the error to the userspace.

E.g if we want to stop the deivce, we will delay the status reset until
we get respose from the userspace?


I didn't get how to delay the status reset. And should it be a DoS
that we want to fix if the userspace doesn't give a response forever?


You're right. So let's make set_status() can fail first, then propagate
its failure via VHOST_VDPA_SET_STATUS.


OK. So we only need to propagate the failure in the vhost-vdpa case, right?



I think not, we need to deal with the reset for virtio as well:

E.g in register_virtio_devices(), we have:

    /* We always start by resetting the device, in case a previous
 * driver messed it up.  This also tests that code path a 
little. */

  dev->config->reset(dev);

We probably need to make reset can fail and then fail the 
register_virtio_device() as well.


Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-22 Thread Jason Wang


在 2021/6/22 下午3:22, Yongji Xie 写道:

We need fix a way to propagate the error to the userspace.

E.g if we want to stop the deivce, we will delay the status reset until
we get respose from the userspace?


I didn't get how to delay the status reset. And should it be a DoS
that we want to fix if the userspace doesn't give a response forever?



You're right. So let's make set_status() can fail first, then propagate 
its failure via VHOST_VDPA_SET_STATUS.






+ }
+}
+
+static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
+{
+ struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+ return dev->config_size;
+}
+
+static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int 
offset,
+   void *buf, unsigned int len)
+{
+ struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+ memcpy(buf, dev->config + offset, len);
+}
+
+static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int 
offset,
+ const void *buf, unsigned int len)
+{
+ /* Now we only support read-only configuration space */
+}
+
+static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
+{
+ struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+ return dev->generation;
+}
+
+static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
+ struct vhost_iotlb *iotlb)
+{
+ struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+ int ret;
+
+ ret = vduse_domain_set_map(dev->domain, iotlb);
+ if (ret)
+ return ret;
+
+ ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
+ if (ret) {
+ vduse_domain_clear_map(dev->domain, iotlb);
+ return ret;
+ }
+
+ return 0;
+}
+
+static void vduse_vdpa_free(struct vdpa_device *vdpa)
+{
+ struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+
+ dev->vdev = NULL;
+}
+
+static const struct vdpa_config_ops vduse_vdpa_config_ops = {
+ .set_vq_address = vduse_vdpa_set_vq_address,
+ .kick_vq= vduse_vdpa_kick_vq,
+ .set_vq_cb  = vduse_vdpa_set_vq_cb,
+ .set_vq_num = vduse_vdpa_set_vq_num,
+ .set_vq_ready   = vduse_vdpa_set_vq_ready,
+ .get_vq_ready   = vduse_vdpa_get_vq_ready,
+ .set_vq_state   = vduse_vdpa_set_vq_state,
+ .get_vq_state   = vduse_vdpa_get_vq_state,
+ .get_vq_align   = vduse_vdpa_get_vq_align,
+ .get_features   = vduse_vdpa_get_features,
+ .set_features   = vduse_vdpa_set_features,
+ .set_config_cb  = vduse_vdpa_set_config_cb,
+ .get_vq_num_max = vduse_vdpa_get_vq_num_max,
+ .get_device_id  = vduse_vdpa_get_device_id,
+ .get_vendor_id  = vduse_vdpa_get_vendor_id,
+ .get_status = vduse_vdpa_get_status,
+ .set_status = vduse_vdpa_set_status,
+ .get_config_size= vduse_vdpa_get_config_size,
+ .get_config = vduse_vdpa_get_config,
+ .set_config = vduse_vdpa_set_config,
+ .get_generation = vduse_vdpa_get_generation,
+ .set_map= vduse_vdpa_set_map,
+ .free   = vduse_vdpa_free,
+};
+
+static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
+  unsigned long offset, size_t size,
+  enum dma_data_direction dir,
+  unsigned long attrs)
+{
+ struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_iova_domain *domain = vdev->domain;
+
+ return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
+}
+
+static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_iova_domain *domain = vdev->domain;
+
+ return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
+}
+
+static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
+ dma_addr_t *dma_addr, gfp_t flag,
+ unsigned long attrs)
+{
+ struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_iova_domain *domain = vdev->domain;
+ unsigned long iova;
+ void *addr;
+
+ *dma_addr = DMA_MAPPING_ERROR;
+ addr = vduse_domain_alloc_coherent(domain, size,
+ (dma_addr_t *), flag, attrs);
+ if (!addr)
+ return NULL;
+
+ *dma_addr = (dma_addr_t)iova;
+
+ return addr;
+}
+
+static void vduse_dev_free_coherent(struct device *dev, size_t size,
+ void *vaddr, dma_addr_t dma_addr,
+ unsigned long attrs)
+{
+ struct vduse_dev *vdev = dev_to_vduse(dev);
+ struct vduse_iova_domain *domain = vdev->domain;

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-21 Thread Jason Wang


在 2021/6/21 下午6:41, Yongji Xie 写道:

On Mon, Jun 21, 2021 at 5:14 PM Jason Wang  wrote:


在 2021/6/15 下午10:13, Xie Yongji 写道:

This VDUSE driver enables implementing vDPA devices in userspace.
The vDPA device's control path is handled in kernel and the data
path is handled in userspace.

A message mechnism is used by VDUSE driver to forward some control
messages such as starting/stopping datapath to userspace. Userspace
can use read()/write() to receive/reply those control messages.

And some ioctls are introduced to help userspace to implement the
data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
descriptors referring to vDPA device's iova regions. Then userspace
can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
ioctls can be used to inject interrupt and setup the kickfd for
virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
configuration space and inject a config interrupt.

Signed-off-by: Xie Yongji 
---
   Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
   drivers/vdpa/Kconfig   |   10 +
   drivers/vdpa/Makefile  |1 +
   drivers/vdpa/vdpa_user/Makefile|5 +
   drivers/vdpa/vdpa_user/vduse_dev.c | 1453 

   include/uapi/linux/vduse.h |  143 ++
   6 files changed, 1613 insertions(+)
   create mode 100644 drivers/vdpa/vdpa_user/Makefile
   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
   create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 9bfc2b510c64..acd95e9dcfe7 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#Include File  
 Comments
   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h
conflict!
   '|'   00-7F  linux/media.h
   0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
   0x89  00-06  arch/x86/include/asm/sockios.h
   0x89  0B-DF  linux/sockios.h
   0x89  E0-EF  linux/sockios.h 
SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index a503c1b2bfd9..6e23bce6433a 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
 vDPA block device simulator which terminates IO request in a
 memory buffer.

+config VDPA_USER
+ tristate "VDUSE (vDPA Device in Userspace) support"
+ depends on EVENTFD && MMU && HAS_DMA
+ select DMA_OPS
+ select VHOST_IOTLB
+ select IOMMU_IOVA
+ help
+   With VDUSE it is possible to emulate a vDPA Device
+   in a userspace program.
+
   config IFCVF
   tristate "Intel IFC VF vDPA driver"
   depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index 67fe7f3d6943..f02ebed33f19 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,6 +1,7 @@
   # SPDX-License-Identifier: GPL-2.0
   obj-$(CONFIG_VDPA) += vdpa.o
   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
   obj-$(CONFIG_IFCVF)+= ifcvf/
   obj-$(CONFIG_MLX5_VDPA) += mlx5/
   obj-$(CONFIG_VP_VDPA)+= virtio_pci/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index ..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index ..5271cbd15e28
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1453 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights 
reserved.
+ *
+ * Author: Xie Yongji 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "iova_domain.h"
+
+#define DRV_AUTHOR   "Yongji Xie "
+#define DRV_DESC "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
+#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
+#define VDUSE_REQUEST_TIMEOUT 30
+
+struct vduse_virtqueue {
+ u16 index;
+ u32 num;
+ u32 avail_idx;
+ u64 desc_addr;
+ u64 driver_

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-21 Thread Jason Wang


在 2021/6/15 下午10:13, Xie Yongji 写道:

This VDUSE driver enables implementing vDPA devices in userspace.
The vDPA device's control path is handled in kernel and the data
path is handled in userspace.

A message mechnism is used by VDUSE driver to forward some control
messages such as starting/stopping datapath to userspace. Userspace
can use read()/write() to receive/reply those control messages.

And some ioctls are introduced to help userspace to implement the
data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
descriptors referring to vDPA device's iova regions. Then userspace
can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
ioctls can be used to inject interrupt and setup the kickfd for
virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
configuration space and inject a config interrupt.

Signed-off-by: Xie Yongji 
---
  Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
  drivers/vdpa/Kconfig   |   10 +
  drivers/vdpa/Makefile  |1 +
  drivers/vdpa/vdpa_user/Makefile|5 +
  drivers/vdpa/vdpa_user/vduse_dev.c | 1453 
  include/uapi/linux/vduse.h |  143 ++
  6 files changed, 1613 insertions(+)
  create mode 100644 drivers/vdpa/vdpa_user/Makefile
  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
  create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 9bfc2b510c64..acd95e9dcfe7 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#Include File  
 Comments
  'z'   10-4F  drivers/s390/crypto/zcrypt_api.hconflict!
  '|'   00-7F  linux/media.h
  0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
  0x89  00-06  arch/x86/include/asm/sockios.h
  0x89  0B-DF  linux/sockios.h
  0x89  E0-EF  linux/sockios.h 
SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index a503c1b2bfd9..6e23bce6433a 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
  vDPA block device simulator which terminates IO request in a
  memory buffer.
  
+config VDPA_USER

+   tristate "VDUSE (vDPA Device in Userspace) support"
+   depends on EVENTFD && MMU && HAS_DMA
+   select DMA_OPS
+   select VHOST_IOTLB
+   select IOMMU_IOVA
+   help
+ With VDUSE it is possible to emulate a vDPA Device
+ in a userspace program.
+
  config IFCVF
tristate "Intel IFC VF vDPA driver"
depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index 67fe7f3d6943..f02ebed33f19 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,6 +1,7 @@
  # SPDX-License-Identifier: GPL-2.0
  obj-$(CONFIG_VDPA) += vdpa.o
  obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
  obj-$(CONFIG_IFCVF)+= ifcvf/
  obj-$(CONFIG_MLX5_VDPA) += mlx5/
  obj-$(CONFIG_VP_VDPA)+= virtio_pci/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index ..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index ..5271cbd15e28
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1453 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights 
reserved.
+ *
+ * Author: Xie Yongji 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "iova_domain.h"
+
+#define DRV_AUTHOR   "Yongji Xie "
+#define DRV_DESC "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
+#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
+#define VDUSE_REQUEST_TIMEOUT 30
+
+struct vduse_virtqueue {
+   u16 index;
+   u32 num;
+   u32 avail_idx;
+   u64 desc_addr;
+   u64 driver_addr;
+   u64 device_addr;
+   bool ready;
+   bool kicked;
+   spinlock_t kick_lock;
+   spinlock_t irq_lock;
+   struct eventfd_ctx *kickfd;
+  

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-10 Thread Jason Wang


在 2021/6/10 下午7:47, Jason Gunthorpe 写道:

On Thu, Jun 10, 2021 at 10:00:01AM +0800, Jason Wang wrote:

在 2021/6/8 下午9:20, Jason Gunthorpe 写道:

On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:


Well, this sounds like a re-invention of io_uring which has already worked
for multifds.

How so? io_uring is about sending work to the kernel, not getting
structued events back?


Actually it can. Userspace can poll multiple fds via preparing multiple sqes
with IORING_OP_ADD flag.

Poll is only a part of what is needed here, the main issue is
transfering the PRI events to userspace quickly.



Do we really care e.g at most one more syscall in this case? I think the 
time spent on demand paging is much more than transferring #PF to 
userspace. What's more, a well designed vIOMMU capable IOMMU hardware 
should have the ability to inject such event directly to guest if #PF 
happens on L1.






This means another ring and we need introduce ioctl() to add or remove
ioasids from the poll. And it still need a kind of fallback like a list if
the ring is full.

The max size of the ring should be determinable based on the PRI
concurrance of each device and the number of devices sharing the ring



This has at least one assumption, #PF event is the only event for the 
ring, I'm not sure this is the case.


Thanks




In any event, I'm not entirely convinced eliding the PRI user/kernel
copy is the main issue here.. If we want this to be low latency I
think it ends up with some kernel driver component assisting the
vIOMMU emulation and avoiding the round trip to userspace

Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Wang


在 2021/6/10 上午10:00, Jason Wang 写道:


在 2021/6/8 下午9:20, Jason Gunthorpe 写道:

On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:

Well, this sounds like a re-invention of io_uring which has already 
worked

for multifds.

How so? io_uring is about sending work to the kernel, not getting
structued events back?



Actually it can. Userspace can poll multiple fds via preparing 
multiple sqes with IORING_OP_ADD flag.



IORING_OP_POLL_ADD actually.

Thanks







It is more like one of the perf rings



This means another ring and we need introduce ioctl() to add or remove 
ioasids from the poll. And it still need a kind of fallback like a 
list if the ring is full.


Thanks




Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Wang


在 2021/6/8 下午6:45, Enrico Weigelt, metux IT consult 写道:

On 07.06.21 20:01, Jason Gunthorpe wrote:

 it is what it is, select has a fixed size bitmap of FD #s and
a hard upper bound on that size as part of the glibc ABI - can't be
fixed.


in glibc ABI ? Uuuuh!



Note that dealing with select() or try to overcome the limitation via 
epoll() directly via the application is not a good practice (or it's not 
portable).


It's suggested to use building blocks provided by glib, e.g the main 
event loop[1]. That is how Qemu solve the issues of dealing with a lot 
of file descriptors.


Thanks

[1] https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html




--mtx



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Wang


在 2021/6/8 下午9:20, Jason Gunthorpe 写道:

On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:


Well, this sounds like a re-invention of io_uring which has already worked
for multifds.

How so? io_uring is about sending work to the kernel, not getting
structued events back?



Actually it can. Userspace can poll multiple fds via preparing multiple 
sqes with IORING_OP_ADD flag.





It is more like one of the perf rings



This means another ring and we need introduce ioctl() to add or remove 
ioasids from the poll. And it still need a kind of fallback like a list 
if the ring is full.


Thanks




Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Wang


在 2021/6/8 上午3:41, Alex Williamson 写道:

On Mon, 7 Jun 2021 16:08:02 -0300
Jason Gunthorpe  wrote:


On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:


It is up to qemu if it wants to proceed or not. There is no issue with
allowing the use of no-snoop and blocking wbinvd, other than some
drivers may malfunction. If the user is certain they don't have
malfunctioning drivers then no issue to go ahead.

A driver that knows how to use the device in a coherent way can
certainly proceed, but I suspect that's not something we can ask of
QEMU.  QEMU has no visibility to the in-use driver and sketchy ability
to virtualize the no-snoop enable bit to prevent non-coherent DMA from
the device.  There might be an experimental ("x-" prefixed) QEMU device
option to allow user override, but QEMU should disallow the possibility
of malfunctioning drivers by default.  If we have devices that probe as
supporting no-snoop, but actually can't generate such traffic, we might
need a quirk list somewhere.

Compatibility is important, but when I look in the kernel code I see
very few places that call wbinvd(). Basically all DRM for something
relavent to qemu.

That tells me that the vast majority of PCI devices do not generate
no-snoop traffic.

Unfortunately, even just looking at devices across a couple laptops
most devices do support and have NoSnoop+ set by default.  I don't
notice anything in the kernel that actually tries to set this enable (a
handful that actively disable), so I assume it's done by the firmware.



I wonder whether or not it was done via ACPI:

"

6.2.17 _CCA (Cache Coherency Attribute) The _CCA object returns whether 
or not a bus-master device supports hardware managed cache coherency. 
Expected values are 0 to indicate it is not supported, and 1 to indicate 
that it is supported. All other values are reserved.


...

On Intel platforms, if the _CCA object is not supplied, the OSPM will 
assume the devices are hardware cache coherent.


"

Thanks



It's not safe for QEMU to make an assumption that only GPUs will
actually make use of it.


I think it makes the software design much simpler if the security
check is very simple. Possessing a suitable device in an ioasid fd
container is enough to flip on the feature and we don't need to track
changes from that point on. We don't need to revoke wbinvd if the
ioasid fd changes, for instance. Better to keep the kernel very simple
in this regard.

You're suggesting that a user isn't forced to give up wbinvd emulation
if they lose access to their device?

Sure, why do we need to be stricter? It is the same logic I gave
earlier, once an attacker process has access to wbinvd an attacker can
just keep its access indefinitely.

The main use case for revokation assumes that qemu would be
compromised after a device is hot-unplugged and you want to block off
wbinvd. But I have a hard time seeing that as useful enough to justify
all the complicated code to do it...

It's currently just a matter of the kvm-vfio device holding a reference
to the group so that it cannot be used elsewhere so long as it's being
used to elevate privileges on a given KVM instance.  If we conclude that
access to a device with the right capability is required to gain a
privilege, I don't really see how we can wave aside that the privilege
isn't lost with the device.


For KVM qemu can turn on/off on hot plug events as it requires to give
VM security. It doesn't need to rely on the kernel to control this.

Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
privilege that the device grants it.  Releasing the device and
retaining the privileged gained by it seems wrong.  Thanks,

Alex



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Wang


在 2021/6/3 上午1:21, Jason Gunthorpe 写道:

On Wed, Jun 02, 2021 at 04:54:26PM +0800, Jason Wang wrote:

在 2021/6/2 上午1:31, Jason Gunthorpe 写道:

On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:

We can open up to ~0U file descriptors, I don't see why we need to restrict
it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.


I don't see how much difference for IOASID and other type of fds. People can
choose to use poll or epoll.

Not really, once one thing in an applicate uses a large number FDs the
entire application is effected. If any open() can return 'very big
number' then nothing in the process is allowed to ever use select.

It is not a trivial thing to ask for


And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I
wonder how select can work for the specific ioasid.

pagefault events are one thing that comes to mind. Bundling them all
together into a single ring buffer is going to be necessary. Multifds
just complicate this too

Jason



Well, this sounds like a re-invention of io_uring which has already 
worked for multifds.


Thanks


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Wang


在 2021/6/7 下午10:14, Jason Gunthorpe 写道:

On Mon, Jun 07, 2021 at 11:18:33AM +0800, Jason Wang wrote:


Note that no drivers call these things doesn't meant it was not
supported by the spec.

Of course it does. If the spec doesn't define exactly when the driver
should call the cache flushes for no-snoop transactions then the
protocol doesn't support no-soop.



Just to make sure we are in the same page. What I meant is, if the DMA 
behavior like (no-snoop) is device specific. There's no need to mandate 
a virtio general attributes. We can describe it per device. The devices 
implemented in the current spec does not use non-coherent DMA doesn't 
mean any future devices won't do that. The driver could choose to use 
transport (e.g PCI), platform (ACPI) or device specific (general virtio 
command) way to detect and flush cache when necessary.





no-snoop is only used in very specific sequences of operations, like
certain GPU usages, because regaining coherence on x86 is incredibly
expensive.

ie I wouldn't ever expect a NIC to use no-snoop because NIC's expect
packets to be processed by the CPU.



For NIC yes. But virtio is more that just NIC. We've already supported 
GPU and crypto devices. In this case, no-snoop will be useful since the 
data is not necessarily expected to be processed by CPU.


And a lot of other type of devices are being proposed.

Thanks




"non-coherent DMA" is some general euphemism that evokes images of
embedded platforms that don't have coherent DMA at all and have low
cost ways to regain coherence. This is not at all what we are talking
about here at all.
  
Jason




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-06 Thread Jason Wang


在 2021/6/4 下午7:58, Jason Gunthorpe 写道:

On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote:

nor do any virtio drivers implement the required platform specific
cache flushing to make no-snoop TLPs work.

I don't get why virtio drivers needs to do that. I think DMA API should hide
those arch/platform specific stuffs from us.

It is not arch/platform stuff. If the device uses no-snoop then a
very platform specific recovery is required in the device driver.

It is not part of the normal DMA API, it is side APIs like
flush_agp_cache() or wbinvd() that are used by GPU drivers only.



Yes and virtio doesn't support AGP.




If drivers/virtio doesn't explicitly call these things it doesn't
support no-snoop - hence no VDPA device can ever use no-snoop.



Note that no drivers call these things doesn't meant it was not 
supported by the spec.


Actually, spec doesn't forbid the non coherent DMA, anyway we can raise 
a new thread in the virtio mailing list to discuss about that.


But consider virtio has already supported GPU, crypto and sound device, 
and the devices like codec and video are being proposed. It doesn't help 
if we mandate coherent DMA now.


Thanks




Since VIRTIO_F_ACCESS_PLATFORM doesn't trigger wbinvd on x86 it has
nothing to do with no-snoop.

Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Jason Wang


在 2021/6/3 下午9:55, Andi Kleen 写道:


Ok, but what I meant is this, if we don't read from the descriptor 
ring, and validate all the other metadata supplied by the device 
(used id and len). Then there should be no way for the device to 
suppress the dma flags to write to the indirect descriptor table.


Or do you have an example how it can do that?


I don't. If you can validate everything it's probably ok

The only drawback is even more code to audit and test.

-Andi




Ok, then I'm going to post a formal series, please have a look and we 
can start from there.


Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Jason Wang


在 2021/6/4 上午1:33, Andy Lutomirski 写道:

On 6/2/21 5:41 PM, Andi Kleen wrote:

Only allow split mode when in a protected guest. Followon
patches harden the split mode code paths, and we don't want
an malicious host to force anything else. Also disallow
indirect mode for similar reasons.

I read this as "the virtio driver is buggy.  Let's disable most of the
buggy code in one special case in which we need a driver without bugs.
In all the other cases (e.g. hardware virtio device connected over
USB-C), driver bugs are still allowed."

Can we just fix the driver without special cases?



I think we can, this is what this series tries to do:

https://www.spinics.net/lists/kvm/msg241825.html

It tries to fix without a special caring for any specific features.

Thanks





--Andy



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-03 Thread Jason Wang


在 2021/6/4 上午2:19, Jacob Pan 写道:

Hi Shenming,

On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu 
wrote:


On 2021/6/2 1:33, Jason Gunthorpe wrote:

On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
   

The drivers register per page table fault handlers to /dev/ioasid which
will then register itself to iommu core to listen and route the per-
device I/O page faults.

I'm still confused why drivers need fault handlers at all?

Essentially it is the userspace that needs the fault handlers,
one case is to deliver the faults to the vIOMMU, and another
case is to enable IOPF on the GPA address space for on-demand
paging, it seems that both could be specified in/through the
IOASID_ALLOC ioctl?


I would think IOASID_BIND_PGTABLE is where fault handler should be
registered. There wouldn't be any IO page fault without the binding anyway.

I also don't understand why device drivers should register the fault
handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So
I think it should be the IOASID itself register the handler.



As discussed in another thread.

I think the reason is that ATS doesn't forbid the #PF to be reported via 
a device specific way.


Thanks





Thanks,
Shenming



Thanks,

Jacob



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Jason Wang


在 2021/6/4 上午2:00, Andi Kleen 写道:


On 6/3/2021 10:33 AM, Andy Lutomirski wrote:

On 6/2/21 5:41 PM, Andi Kleen wrote:

Only allow split mode when in a protected guest. Followon
patches harden the split mode code paths, and we don't want
an malicious host to force anything else. Also disallow
indirect mode for similar reasons.

I read this as "the virtio driver is buggy.  Let's disable most of the
buggy code in one special case in which we need a driver without bugs.
In all the other cases (e.g. hardware virtio device connected over
USB-C), driver bugs are still allowed."


My understanding is most of the other modes (except for split with 
separate descriptors) are obsolete and just there for compatibility. 
As long as they're deprecated they won't harm anyone.


-Andi



For "mode" do you packed vs split? If yes, it's not just for 
compatibility. Though packed virtqueue is designed to be more hardware 
friendly, most hardware vendors choose to start from split.


Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Jason Wang


在 2021/6/4 上午3:31, Andy Lutomirski 写道:


On Thu, Jun 3, 2021, at 11:00 AM, Andi Kleen wrote:

On 6/3/2021 10:33 AM, Andy Lutomirski wrote:

On 6/2/21 5:41 PM, Andi Kleen wrote:

Only allow split mode when in a protected guest. Followon
patches harden the split mode code paths, and we don't want
an malicious host to force anything else. Also disallow
indirect mode for similar reasons.

I read this as "the virtio driver is buggy.  Let's disable most of the
buggy code in one special case in which we need a driver without bugs.
In all the other cases (e.g. hardware virtio device connected over
USB-C), driver bugs are still allowed."

My understanding is most of the other modes (except for split with
separate descriptors) are obsolete and just there for compatibility. As
long as they're deprecated they won't harm anyone.



Tell that to every crypto downgrade attack ever.

I see two credible solutions:

1. Actually harden the virtio driver.

2. Have a new virtio-modern driver and use it for modern use cases. Maybe 
rename the old driver virtio-legacy or virtio-insecure.  They can share code.



Note that we had already split legacy driver out which can be turned off 
via Kconfig.





Another snag you may hit: virtio’s heuristic for whether to use proper DMA ops 
or to bypass them is a giant kludge. I’m very slightly optimistic that getting 
the heuristic wrong will make the driver fail to operate but won’t allow the 
host to take over the guest, but I’m not really convinced. And I wrote that 
code!  A virtio-modern mode probably should not have a heuristic, and the 
various iommu-bypassing modes should be fixed to work at the bus level, not the 
device level.



I remember there's a very long discussion about this and probably 
without any conclusion. Fortunately, the management layer has been 
taught to enforce VIRTIO_F_ACCESS_PLATFORM for encrypted guests.


A possible way to fix this is without any conflicts is to mandate the 
VIRTIO_F_ACCESS_PLATFORM in version 1.2.


Thanks






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-03 Thread Jason Wang


在 2021/6/3 下午9:09, Jason Gunthorpe 写道:

On Thu, Jun 03, 2021 at 10:52:51AM +0800, Jason Wang wrote:


Basically, we don't want to bother with pseudo KVM device like what VFIO
did. So for simplicity, we rules out the IOMMU that can't enforce coherency
in vhost-vDPA if the parent purely depends on the platform IOMMU:

VDPA HW cannot issue no-snoop TLPs in the first place.



Note that virtio/vDPA is not necessarily a PCI device.




virtio does not define a protocol to discover such a functionality,



Actually we had:

VIRTIO_F_ACCESS_PLATFORM(33)
This feature indicates that the device can be used on a platform where 
device access to data in memory is limited and/or translated. E.g. this 
is the case if the device can be located behind an IOMMU that translates 
bus addresses from the device into physical addresses in memory, if the 
device can be limited to only access certain memory addresses or if 
special commands such as a cache flush can be needed to synchronise data 
in memory with the device.




nor do any virtio drivers implement the required platform specific
cache flushing to make no-snoop TLPs work.



I don't get why virtio drivers needs to do that. I think DMA API should 
hide those arch/platform specific stuffs from us.





It is fundamentally part of the virtio HW PCI API that a device vendor
cannot alter.



The spec doesn't forbid this, and it just leave the detection and action 
to the driver in a platform specific way.


Thanks




Basically since we already know that the virtio kernel drivers do not
call the cache flush instruction we don't need the weird KVM logic to
turn it on at all.

Enforcing no-snoop at the IOMMU here is redundant/confusing.

Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Jason Wang


在 2021/6/3 上午10:56, Andi Kleen 写道:




I agree, but I want to know why indirect descriptor needs to be 
disabled. The table can't be wrote by the device since it's not 
coherent swiotlb mapping.


I had all kinds of problems with uninitialized entries in the indirect 
table. So I gave up on it and concluded it would be too difficult to 
secure.



-Andi




Ok, but what I meant is this, if we don't read from the descriptor ring, 
and validate all the other metadata supplied by the device (used id and 
len). Then there should be no way for the device to suppress the dma 
flags to write to the indirect descriptor table.


Or do you have an example how it can do that?

Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Wang


在 2021/6/3 上午4:37, Alex Williamson 写道:

On Wed, 2 Jun 2021 16:54:04 -0300
Jason Gunthorpe  wrote:


On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:

Right, the device can generate the no-snoop transactions, but it's the
IOMMU that essentially determines whether those transactions are
actually still cache coherent, AIUI.

Wow, this is really confusing stuff in the code.

At the PCI level there is a TLP bit called no-snoop that is platform
specific. The general intention is to allow devices to selectively
bypass the CPU caching for DMAs. GPUs like to use this feature for
performance.

Yes


I assume there is some exciting security issues here. Looks like
allowing cache bypass does something bad inside VMs? Looks like
allowing the VM to use the cache clear instruction that is mandatory
with cache bypass DMA causes some QOS issues? OK.

IIRC, largely a DoS issue if userspace gets to choose when to emulate
wbinvd rather than it being demanded for correct operation.


So how does it work?

What I see in the intel/iommu.c is that some domains support "snoop
control" or not, based on some HW flag. This indicates if the
DMA_PTE_SNP bit is supported on a page by page basis or not.

Since x86 always leans toward "DMA cache coherent" I'm reading some
tea leaves here:

IOMMU_CAP_CACHE_COHERENCY,  /* IOMMU can enforce cache coherent DMA
   transactions */

And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?

That's my understanding as well.


Further, I guess IOMMUs that don't support PTE_SNP, or have
DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
and all)

Yes.


So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
incoherent with the CPU caches, it just means that that snoop bit in
the TLP cannot be enforced. ie the device *could* do no-shoop DMA
if it wants. Devices that never do no-snoop remain DMA coherent on
x86, as they always have been.

Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device
DMA to be coherent via the IOMMU.


IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
the PCI device is blocked from using no-snoop in its TLPs.

I wonder if ARM implemented this consistently? I see VDPA is
confused..



Basically, we don't want to bother with pseudo KVM device like what VFIO 
did. So for simplicity, we rules out the IOMMU that can't enforce 
coherency in vhost-vDPA if the parent purely depends on the platform IOMMU:



    if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
    return -ENOTSUPP;

For the parents that use its own translations logic, an implicit 
assumption is that the hardware will always perform cache coherent DMA.


Thanks


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring

2021-06-02 Thread Jason Wang


在 2021/6/3 上午10:18, Andi Kleen 写道:


It looks to me all the evils came from the fact that we depends on 
the descriptor ring.


So the checks in this patch could is unnecessary if we don't even 
read from the descriptor ring which could be manipulated by the device.


This is what my series tries to achieve:

https://www.spinics.net/lists/kvm/msg241825.html


I would argue that you should boundary check in any case. It was 
always a bug to not have boundary checks in such a data structure with 
multiple users, trust or not.


But yes your patch series is interesting and definitely makes sense 
for TDX too.


Best would be to have both I guess, and always check the boundaries 
everywhere.



I agree but some of the checks are unnecessary in we do this series on 
top of my series.





So what's the merge status of your series?



If I understand correctly from Michael, I will send a formal series and 
he will try to merge it for the 5.14.


Thanks




-Andi




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Jason Wang


在 2021/6/3 上午9:48, Andi Kleen 写道:


So we will see huge performance regression without indirect 
descriptor. We need to consider to address this.


A regression would be when some existing case would be slower.

That's not the case because the behavior for the existing cases does 
not change.


Anyways when there are performance problems they can be addressed, but 
first is to make it secure.



I agree, but I want to know why indirect descriptor needs to be 
disabled. The table can't be wrote by the device since it's not coherent 
swiotlb mapping.


Thanks




-Andi




Thanks



  break;
  case VIRTIO_RING_F_EVENT_IDX:
  break;
@@ -2231,9 +2240,12 @@ void vring_transport_features(struct 
virtio_device *vdev)

  case VIRTIO_F_ACCESS_PLATFORM:
  break;
  case VIRTIO_F_RING_PACKED:
+    if (protected_guest_has(VM_MEM_ENCRYPT))
+    goto clear;
  break;
  case VIRTIO_F_ORDER_PLATFORM:
  break;
+    clear:
  default:
  /* We don't understand this bit. */
  __virtio_clear_bit(vdev, i);






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 3/8] virtio: Harden split buffer detachment

2021-06-02 Thread Jason Wang


在 2021/6/3 上午8:41, Andi Kleen 写道:

Harden the split buffer detachment path by adding boundary checking. Note
that when this fails we may fail to unmap some swiotlb mapping, which could
result in a leak and a DOS. But that's acceptable because an malicious host
can DOS us anyways.

Signed-off-by: Andi Kleen 
---
  drivers/virtio/virtio_ring.c | 25 +
  1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index d37ff5a0ff58..1e9aa1e95e1b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -651,12 +651,19 @@ static bool virtqueue_kick_prepare_split(struct virtqueue 
*_vq)
return needs_kick;
  }
  
-static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,

-void **ctx)
+static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
+   void **ctx)
  {
unsigned int i, j;
__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
  
+	/* We'll leak DMA mappings when this happens, but nothing

+* can be done about that. In the worst case the host
+* could DOS us, but it can of course do that anyways.
+*/
+   if (!inside_split_ring(vq, head))
+   return -EIO;



I think the caller have already did this for us with even more check on 
the token (virtqueue_get_buf_ctx_split()):


    if (unlikely(i >= vq->split.vring.num)) {
    BAD_RING(vq, "id %u out of range\n", i);
    return NULL;
    }
    if (unlikely(!vq->split.desc_state[i].data)) {
    BAD_RING(vq, "id %u is not a head!\n", i);
    return NULL;
    }



+
/* Clear data ptr. */
vq->split.desc_state[head].data = NULL;
  
@@ -666,6 +673,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,

while (vq->split.vring.desc[i].flags & nextflag) {
vring_unmap_one_split(vq, >split.vring.desc[i]);
i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next);
+   if (!inside_split_ring(vq, i))
+   return -EIO;



Similarly, if we don't depend on the metadata stored in the descriptor, 
we don't need this check.




vq->vq.num_free++;
}
  
@@ -684,7 +693,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
  
  		/* Free the indirect table, if any, now that it's unmapped. */

if (!indir_desc)
-   return;
+   return 0;
  
  		len = virtio32_to_cpu(vq->vq.vdev,

vq->split.vring.desc[head].len);
@@ -701,6 +710,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
} else if (ctx) {
*ctx = vq->split.desc_state[head].indir_desc;
}
+   return 0;
  }
  
  static inline bool more_used_split(const struct vring_virtqueue *vq)

@@ -717,6 +727,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
void *ret;
unsigned int i;
u16 last_used;
+   int err;
  
  	START_USE(vq);
  
@@ -751,7 +762,12 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
  
  	/* detach_buf_split clears data, so grab it now. */

ret = vq->split.desc_state[i].data;
-   detach_buf_split(vq, i, ctx);
+   err = detach_buf_split(vq, i, ctx);
+   if (err) {
+   END_USE(vq);



This reminds me that we don't use END_USE() after BAD_RING() which 
should be fixed.


Thanks



+   return NULL;
+   }
+
vq->last_used_idx++;
/* If we expect an interrupt for the next entry, tell host
 * by writing event index and flush out the write before
@@ -863,6 +879,7 @@ static void *virtqueue_detach_unused_buf_split(struct 
virtqueue *_vq)
/* detach_buf_split clears data, so grab it now. */
buf = vq->split.desc_state[i].data;
detach_buf_split(vq, i, NULL);
+   /* Don't need to check for error because nothing is returned */
vq->split.avail_idx_shadow--;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 2/8] virtio: Add boundary checks to virtio ring

2021-06-02 Thread Jason Wang


在 2021/6/3 上午8:41, Andi Kleen 写道:

In protected guest mode we don't trust the host.

This means we need to make sure the host cannot subvert us through
virtio communication. In general it can corrupt our virtio data
and cause a DOS, but it should not be able to access any data
that is not explicitely under IO.

Also boundary checking so that the free list (which is accessible
to the host) cannot point outside the virtio ring. Note it could
still contain loops or similar, but these should only cause an DOS,
not a memory corruption or leak.

When we detect any out of bounds descriptor trigger an IO error.
We also use a WARN() (in case it was a software bug instead of
an attack). This implies that a malicious host can flood
the guest kernel log, but that's only a DOS and acceptable
in the threat model.

This patch only hardens the initial consumption of the free list,
the freeing comes later.

Any of these errors can cause DMA memory leaks, but there is nothing
we can do about that and that would be just a DOS.

Signed-off-by: Andi Kleen 
---
  drivers/virtio/virtio_ring.c | 46 
  1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f35629fa47b1..d37ff5a0ff58 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -413,6 +413,15 @@ static struct vring_desc *alloc_indirect_split(struct 
virtqueue *_vq,
return desc;
  }
  
+/* assumes no indirect mode */

+static inline bool inside_split_ring(struct vring_virtqueue *vq,
+unsigned index)
+{
+   return !WARN(index >= vq->split.vring.num,
+   "desc index %u out of bounds (%u)\n",
+   index, vq->split.vring.num);



It's better to use BAD_RING to stop virtqueue in this case.



+}
+
  static inline int virtqueue_add_split(struct virtqueue *_vq,
  struct scatterlist *sgs[],
  unsigned int total_sg,
@@ -428,6 +437,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
unsigned int i, n, avail, descs_used, prev, err_idx;
int head;
bool indirect;
+   int io_err;
  
  	START_USE(vq);
  
@@ -481,7 +491,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
  
  	for (n = 0; n < out_sgs; n++) {

for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-   dma_addr_t addr = vring_map_one_sg(vq, sg, 
DMA_TO_DEVICE);
+   dma_addr_t addr;
+
+   io_err = -EIO;
+   if (!inside_split_ring(vq, i))
+   goto unmap_release;
+   io_err = -ENOMEM;
+   addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
if (vring_mapping_error(vq, addr))
goto unmap_release;
  
@@ -494,7 +510,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,

}
for (; n < (out_sgs + in_sgs); n++) {
for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-   dma_addr_t addr = vring_map_one_sg(vq, sg, 
DMA_FROM_DEVICE);
+   dma_addr_t addr;
+
+   io_err = -EIO;
+   if (!inside_split_ring(vq, i))
+   goto unmap_release;
+   io_err = -ENOMEM;
+   addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
if (vring_mapping_error(vq, addr))
goto unmap_release;



It looks to me all the evils came from the fact that we depends on the 
descriptor ring.


So the checks in this patch could is unnecessary if we don't even read 
from the descriptor ring which could be manipulated by the device.


This is what my series tries to achieve:

https://www.spinics.net/lists/kvm/msg241825.html

Thanks



  
@@ -513,6 +535,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,

dma_addr_t addr = vring_map_single(
vq, desc, total_sg * sizeof(struct vring_desc),
DMA_TO_DEVICE);
+   io_err = -ENOMEM;
if (vring_mapping_error(vq, addr))
goto unmap_release;
  
@@ -528,6 +551,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,

/* We're using some buffers from the free list. */
vq->vq.num_free -= descs_used;
  
+	io_err = -EIO;

+   if (!inside_split_ring(vq, head))
+   goto unmap_release;
+
/* Update free pointer */
if (indirect)
vq->free_head = virtio16_to_cpu(_vq->vdev,
@@ -545,6 +572,10 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
/* Put entry in available array (but don't update avail->idx until they
 * do sync). */
avail = 

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-02 Thread Jason Wang


在 2021/6/3 上午8:41, Andi Kleen 写道:

When running under TDX the virtio host is untrusted. The bulk
of the kernel memory is encrypted and protected, but the virtio
ring is in special shared memory that is shared with the
untrusted host.

This means virtio needs to be hardened against any attacks from
the host through the ring. Of course it's impossible to prevent DOS
(the host can chose at any time to stop doing IO), but there
should be no buffer overruns or similar that might give access to
any private memory in the guest.

virtio has a lot of modes, most are difficult to harden.

The best for hardening seems to be split mode without indirect
descriptors. This also simplifies the hardening job because
it's only a single code path.

Only allow split mode when in a protected guest. Followon
patches harden the split mode code paths, and we don't want
an malicious host to force anything else. Also disallow
indirect mode for similar reasons.

Signed-off-by: Andi Kleen 
---
  drivers/virtio/virtio_ring.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71e16b53e9c1..f35629fa47b1 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -11,6 +11,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #ifdef DEBUG

@@ -2221,8 +,16 @@ void vring_transport_features(struct virtio_device *vdev)
unsigned int i;
  
  	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {

+
+   /*
+* In protected guest mode disallow packed or indirect
+* because they ain't hardened.
+*/
+
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
+   if (protected_guest_has(VM_MEM_ENCRYPT))
+   goto clear;



So we will see huge performance regression without indirect descriptor. 
We need to consider to address this.


Thanks



break;
case VIRTIO_RING_F_EVENT_IDX:
break;
@@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev)
case VIRTIO_F_ACCESS_PLATFORM:
break;
case VIRTIO_F_RING_PACKED:
+   if (protected_guest_has(VM_MEM_ENCRYPT))
+   goto clear;
break;
case VIRTIO_F_ORDER_PLATFORM:
break;
+   clear:
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Virtio hardening for TDX

2021-06-02 Thread Jason Wang


在 2021/6/3 上午8:41, Andi Kleen 写道:

[v1: Initial post]

With confidential computing like TDX the guest doesn't trust the host
anymore. The host is allowed to DOS of course, but it is not allowed
to read or write any guest memory not explicitely shared with it.

This has implication for virtio. Traditionally virtio didn't assume
the other side of the communication channel is malicious, and therefore
didn't do any boundary checks in virtio ring data structures.

This patchkit does hardening for virtio.  In a TDX like model
the only host memory accesses allowed are in the virtio ring,
as well as the (forced) swiotlb buffer.

This patch kit does various changes to ensure there can be no
access outside these two areas. It is possible for the host
to break the communication, but this should result in a IO
error on the guest, but no memory safety violations.

virtio is quite complicated with many modes. To simplify
the task we enforce that virtio is only in split mode without
indirect descriptors, when running as a TDX guest. We also
enforce use of the DMA API.

Then these code paths are hardened against any corruptions
on the ring.

This patchkit has components in three subsystems:
- Hardening changes to virtio, all in the generic virtio-ring
- Hardening changes to kernel/dma swiotlb to harden swiotlb against
malicious pointers. It requires an API change which needed a tree sweep.
- A single x86 patch to enable the arch_has_restricted_memory_access
for TDX

It depends on Sathya's earlier patchkit that adds the basic infrastructure
for TDX. This is only needed for the "am I running in TDX" part.



Note that it's probably needed by other cases as well:

1) Other encrypted VM technology
2) VDUSE[1]
3) Smart NICs

We have already had discussions and some patches have been posted[2][3][4].

I think the basic idea is similar, basically,  we don't trust any 
metadata provided by the device.


[2] is the series that use the metadata stored in the private memory 
which can't be accessed by swiotlb, this series aims to eliminate all 
the possible attacks via virtqueue metadata

[3] is one example for the the used length validation
[4] is the fix for the malicious config space

Thanks

[1] https://www.spinics.net/lists/netdev/msg743264.html
[2] https://www.spinics.net/lists/kvm/msg241825.html
[3] https://patches.linaro.org/patch/450733/
[4] https://lkml.org/lkml/2021/5/17/376







___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Wang


在 2021/6/2 上午1:29, Jason Gunthorpe 写道:

On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:


For the case of 1M, I would like to know what's the use case for a single
process to handle 1M+ address spaces?

For some scenarios every guest PASID will require a IOASID ID # so
there is a large enough demand that FDs alone are not a good fit.

Further there are global container wide properties that are hard to
carry over to a multi-FD model, like the attachment of devices to the
container at the startup.



So if we implement per fd model. The global "container" properties could 
be done via the parent fd. E.g attaching the parent to the device at the 
startup.






So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.

If the container and address space is 1:1 then the container seems useless.

The examples at the bottom of the document show multiple IOASIDs in
the container for a parent/child type relationship



This can also be done per fd? A fd parent can have multiple fd childs.

Thanks




Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Wang


在 2021/6/2 上午1:31, Jason Gunthorpe 写道:

On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
  

We can open up to ~0U file descriptors, I don't see why we need to restrict
it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.



I don't see how much difference for IOASID and other type of fds. People 
can choose to use poll or epoll.


And with the current proposal, (assuming there's a N:1 ioasid to 
ioasid). I wonder how select can work for the specific ioasid.


Thanks




A huge number of FDs for typical usages should be avoided.

Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread Jason Wang


在 2021/6/2 上午4:28, Jason Gunthorpe 写道:

I summarized five opens here, about:

1)  Finalizing the name to replace /dev/ioasid;
2)  Whether one device is allowed to bind to multiple IOASID fd's;
3)  Carry device information in invalidation/fault reporting uAPI;
4)  What should/could be specified when allocating an IOASID;
5)  The protocol between vfio group and kvm;

For 1), two alternative names are mentioned: /dev/iommu and
/dev/ioas. I don't have a strong preference and would like to hear
votes from all stakeholders. /dev/iommu is slightly better imho for
two reasons. First, per AMD's presentation in last KVM forum they
implement vIOMMU in hardware thus need to support user-managed
domains. An iommu uAPI notation might make more sense moving
forward. Second, it makes later uAPI naming easier as 'IOASID' can
be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
IOASID_ALLOC_IOASID.:)

I think two years ago I suggested /dev/iommu and it didn't go very far
at the time.



It looks to me using "/dev/iommu" excludes the possibility of 
implementing IOASID in a device specific way (e.g through the 
co-operation with device MMU + platform IOMMU)?


What's more, ATS spec doesn't forbid the device #PF to be reported via a 
device specific way.


Thanks



We've also talked about this as /dev/sva for a while and
now /dev/ioasid

I think /dev/iommu is fine, and call the things inside them IOAS
objects.

Then we don't have naming aliasing with kernel constructs.
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Wang


在 2021/6/1 下午2:16, Tian, Kevin 写道:

From: Jason Wang
Sent: Tuesday, June 1, 2021 2:07 PM

在 2021/6/1 下午1:42, Tian, Kevin 写道:

From: Jason Wang
Sent: Tuesday, June 1, 2021 1:30 PM

在 2021/6/1 下午1:23, Lu Baolu 写道:

Hi Jason W,

On 6/1/21 1:08 PM, Jason Wang wrote:

2) If yes, what's the reason for not simply use the fd opened from
/dev/ioas. (This is the question that is not answered) and what
happens
if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?

oh, missed this question in prior reply. Personally, no special reason
yet. But using ID may give us opportunity to customize the

management

of the handle. For one, better lookup efficiency by using xarray to
store the allocated IDs. For two, could categorize the allocated IDs
(parent or nested). GET_INFO just works with an input FD and an ID.

I'm not sure I get this, for nesting cases you can still make the
child an fd.

And a question still, under what case we need to create multiple
ioasids on a single ioasid fd?

One possible situation where multiple IOASIDs per FD could be used is
that devices with different underlying IOMMU capabilities are sharing a
single FD. In this case, only devices with consistent underlying IOMMU
capabilities could be put in an IOASID and multiple IOASIDs per FD could
be applied.

Though, I still not sure about "multiple IOASID per-FD" vs "multiple
IOASID FDs" for such case.

Right, that's exactly my question. The latter seems much more easier to
be understood and implemented.


A simple reason discussed in previous thread - there could be 1M's
I/O address spaces per device while #FD's are precious resource.


Is the concern for ulimit or performance? Note that we had

#define NR_OPEN_MAX ~0U

And with the fd semantic, you can do a lot of other stuffs: close on
exec, passing via SCM_RIGHTS.

yes, fd has its merits.


For the case of 1M, I would like to know what's the use case for a
single process to handle 1M+ address spaces?

This single process is Qemu with an assigned device. Within the guest
there could be many guest processes. Though in reality I didn't see
such 1M processes on a single device, better not restrict it in uAPI?



Sorry I don't get here.

We can open up to ~0U file descriptors, I don't see why we need to 
restrict it in uAPI.


Thanks







So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.


If the container and address space is 1:1 then the container seems useless.


yes, 1:1 then container is useless. But here it's assumed 1:M then
even a single fd is sufficient for all intended usages.

Thanks
Kevin


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Jason Wang


在 2021/6/1 下午1:42, Tian, Kevin 写道:

From: Jason Wang
Sent: Tuesday, June 1, 2021 1:30 PM

在 2021/6/1 下午1:23, Lu Baolu 写道:

Hi Jason W,

On 6/1/21 1:08 PM, Jason Wang wrote:

2) If yes, what's the reason for not simply use the fd opened from
/dev/ioas. (This is the question that is not answered) and what
happens
if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?

oh, missed this question in prior reply. Personally, no special reason
yet. But using ID may give us opportunity to customize the management
of the handle. For one, better lookup efficiency by using xarray to
store the allocated IDs. For two, could categorize the allocated IDs
(parent or nested). GET_INFO just works with an input FD and an ID.


I'm not sure I get this, for nesting cases you can still make the
child an fd.

And a question still, under what case we need to create multiple
ioasids on a single ioasid fd?

One possible situation where multiple IOASIDs per FD could be used is
that devices with different underlying IOMMU capabilities are sharing a
single FD. In this case, only devices with consistent underlying IOMMU
capabilities could be put in an IOASID and multiple IOASIDs per FD could
be applied.

Though, I still not sure about "multiple IOASID per-FD" vs "multiple
IOASID FDs" for such case.


Right, that's exactly my question. The latter seems much more easier to
be understood and implemented.


A simple reason discussed in previous thread - there could be 1M's
I/O address spaces per device while #FD's are precious resource.



Is the concern for ulimit or performance? Note that we had

#define NR_OPEN_MAX ~0U

And with the fd semantic, you can do a lot of other stuffs: close on 
exec, passing via SCM_RIGHTS.


For the case of 1M, I would like to know what's the use case for a 
single process to handle 1M+ address spaces?




So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.



If the container and address space is 1:1 then the container seems useless.

Thanks




Thanks
Kevin


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-31 Thread Jason Wang


在 2021/6/1 下午1:23, Lu Baolu 写道:

Hi Jason W,

On 6/1/21 1:08 PM, Jason Wang wrote:

2) If yes, what's the reason for not simply use the fd opened from
/dev/ioas. (This is the question that is not answered) and what 
happens

if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?

oh, missed this question in prior reply. Personally, no special reason
yet. But using ID may give us opportunity to customize the management
of the handle. For one, better lookup efficiency by using xarray to
store the allocated IDs. For two, could categorize the allocated IDs
(parent or nested). GET_INFO just works with an input FD and an ID.



I'm not sure I get this, for nesting cases you can still make the 
child an fd.


And a question still, under what case we need to create multiple 
ioasids on a single ioasid fd?


One possible situation where multiple IOASIDs per FD could be used is
that devices with different underlying IOMMU capabilities are sharing a
single FD. In this case, only devices with consistent underlying IOMMU
capabilities could be put in an IOASID and multiple IOASIDs per FD could
be applied.

Though, I still not sure about "multiple IOASID per-FD" vs "multiple
IOASID FDs" for such case.



Right, that's exactly my question. The latter seems much more easier to 
be understood and implemented.


Thanks




Best regards,
baolu



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-31 Thread Jason Wang


在 2021/6/1 下午12:27, Shenming Lu 写道:

On 2021/6/1 10:36, Jason Wang wrote:

在 2021/5/31 下�4:41, Liu Yi L 写�:

I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
hardware nesting. Or is there way to detect the capability before?

I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
is not able to support nesting, then should fail it.


I think GET_INFO only works after the ATTACH.

yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
gpa_ioasid and check if nesting is supported or not. right?


Some more questions:

1) Is the handle returned by IOASID_ALLOC an fd?
2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. 
(This is the question that is not answered) and what happens if we call 
GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?

It seems that the return value from IOASID_ALLOC is an IOASID number in the
ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
number to get the associated I/O address space attributes (depend on the
physical IOMMU, which could be discovered when attaching a device to the
IOASID fd or number), right?



Right, but the question is why need such indirection? Unless there's a 
case that you need to create multiple IOASIDs per ioasid fd. It's more 
simpler to attach the metadata into the ioasid fd itself.


Thanks




Thanks,
Shenming



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-31 Thread Jason Wang


在 2021/6/1 上午11:31, Liu Yi L 写道:

On Tue, 1 Jun 2021 10:36:36 +0800, Jason Wang wrote:


在 2021/5/31 下午4:41, Liu Yi L 写道:

I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
hardware nesting. Or is there way to detect the capability before?

I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
is not able to support nesting, then should fail it.
  

I think GET_INFO only works after the ATTACH.

yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
gpa_ioasid and check if nesting is supported or not. right?


Some more questions:

1) Is the handle returned by IOASID_ALLOC an fd?

it's an ID so far in this proposal.



Ok.





2) If yes, what's the reason for not simply use the fd opened from
/dev/ioas. (This is the question that is not answered) and what happens
if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?

oh, missed this question in prior reply. Personally, no special reason
yet. But using ID may give us opportunity to customize the management
of the handle. For one, better lookup efficiency by using xarray to
store the allocated IDs. For two, could categorize the allocated IDs
(parent or nested). GET_INFO just works with an input FD and an ID.



I'm not sure I get this, for nesting cases you can still make the child 
an fd.


And a question still, under what case we need to create multiple ioasids 
on a single ioasid fd?


(This case is not demonstrated in your examples).

Thanks




  

/* Bind guest I/O page table  */
bind_data = {
.ioasid = giova_ioasid;
.addr   = giova_pgtable;
// and format information
};
ioctl(ioasid_fd, IOASID_BIND_PGTABLE, _data);

/* Invalidate IOTLB when required */
inv_data = {
.ioasid = giova_ioasid;
// granular information
};
ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, _data);

/* See 5.6 for I/O page fault handling */

5.5. Guest SVA (vSVA)
++

After boots the guest further create a GVA address spaces (gpasid1) on
dev1. Dev2 is not affected (still attached to giova_ioasid).

As explained in section 4, user should avoid expose ENQCMD on both
pdev and mdev.

The sequence applies to all device types (being pdev or mdev), except
one additional step to call KVM for ENQCMD-capable mdev:

My understanding is ENQCMD is Intel specific and not a requirement for
having vSVA.

ENQCMD is not really Intel specific although only Intel supports it today.
The PCIe DMWr capability is the capability for software to enumerate the
ENQCMD support in device side. yes, it is not a requirement for vSVA. They
are orthogonal.


Right, then it's better to mention DMWr instead of a vendor specific
instruction in a general framework like ioasid.

good suggestion. :)



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-31 Thread Jason Wang


在 2021/5/31 下午4:41, Liu Yi L 写道:

I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
hardware nesting. Or is there way to detect the capability before?

I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
is not able to support nesting, then should fail it.


I think GET_INFO only works after the ATTACH.

yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
gpa_ioasid and check if nesting is supported or not. right?



Some more questions:

1) Is the handle returned by IOASID_ALLOC an fd?
2) If yes, what's the reason for not simply use the fd opened from 
/dev/ioas. (This is the question that is not answered) and what happens 
if we call GET_INFO for the ioasid_fd?

3) If not, how GET_INFO work?





/* Bind guest I/O page table  */
bind_data = {
.ioasid = giova_ioasid;
.addr   = giova_pgtable;
// and format information
};
ioctl(ioasid_fd, IOASID_BIND_PGTABLE, _data);

/* Invalidate IOTLB when required */
inv_data = {
.ioasid = giova_ioasid;
// granular information
};
ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, _data);

/* See 5.6 for I/O page fault handling */

5.5. Guest SVA (vSVA)
++

After boots the guest further create a GVA address spaces (gpasid1) on
dev1. Dev2 is not affected (still attached to giova_ioasid).

As explained in section 4, user should avoid expose ENQCMD on both
pdev and mdev.

The sequence applies to all device types (being pdev or mdev), except
one additional step to call KVM for ENQCMD-capable mdev:

My understanding is ENQCMD is Intel specific and not a requirement for
having vSVA.

ENQCMD is not really Intel specific although only Intel supports it today.
The PCIe DMWr capability is the capability for software to enumerate the
ENQCMD support in device side. yes, it is not a requirement for vSVA. They
are orthogonal.



Right, then it's better to mention DMWr instead of a vendor specific 
instruction in a general framework like ioasid.


Thanks






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-30 Thread Jason Wang


在 2021/5/31 下午12:27, Yongji Xie 写道:

On Fri, May 28, 2021 at 10:31 AM Jason Wang  wrote:


在 2021/5/27 下午9:17, Yongji Xie 写道:

On Thu, May 27, 2021 at 4:41 PM Jason Wang  wrote:

在 2021/5/27 下午3:34, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:

在 2021/5/27 下午1:08, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:00 PM Jason Wang  wrote:

在 2021/5/27 下午12:57, Yongji Xie 写道:

On Thu, May 27, 2021 at 12:13 PM Jason Wang  wrote:

在 2021/5/17 下午5:55, Xie Yongji 写道:

+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+   struct vduse_dev_msg *msg)
+{
+ init_waitqueue_head(>waitq);
+ spin_lock(>msg_lock);
+ vduse_enqueue_msg(>send_list, msg);
+ wake_up(>waitq);
+ spin_unlock(>msg_lock);
+ wait_event_killable(msg->waitq, msg->completed);

What happens if the userspace(malicous) doesn't give a response forever?

It looks like a DOS. If yes, we need to consider a way to fix that.


How about using wait_event_killable_timeout() instead?

Probably, and then we need choose a suitable timeout and more important,
need to report the failure to virtio.


Makes sense to me. But it looks like some
vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
return value.  Now I add a WARN_ON() for the failure. Do you mean we
need to add some change for virtio core to handle the failure?

Maybe, but I'm not sure how hard we can do that.


We need to change all virtio device drivers in this way.

Probably.



We had NEEDS_RESET but it looks we don't implement it.


Could it handle the failure of get_feature() and get/set_config()?

Looks not:

"

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
that a reset is needed. If DRIVER_OK is set, after it sets
DEVICE_NEEDS_RESET, the device MUST send a device configuration change
notification to the driver.

"

This looks implies that NEEDS_RESET may only work after device is
probed. But in the current design, even the reset() is not reliable.



Or a rough idea is that maybe need some relaxing to be coupled loosely
with userspace. E.g the device (control path) is implemented in the
kernel but the datapath is implemented in the userspace like TUN/TAP.


I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.

Good point.

But set_config() should be rare:

E.g in the case of virtio-net with VERSION_1, config space is read only,
and it was set via control vq.

For block, we can

1) start from without WCE or
2) we add a config change notification to userspace or

I prefer this way. And I think we also need to do similar things for
set/get_vq_state().


Yes, I agree.


Hi Jason,

Now I'm working on this. But I found the config change notification
must be synchronous in the virtio-blk case, which means the kernel
still needs to wait for the response from userspace in set_config().
Otherwise, some I/Os might still run the old way after we change the
cache_type in sysfs.

The simple ways to solve this problem are:

1. Only support read-only config space, disable WCE as you suggested
2. Add a return value to set_config() and handle the failure only in
virtio-blk driver
3. Print some warnings after timeout since it only affects the
dataplane which is under userspace's control

Any suggestions?



Let's go without WCE first and make VDUSE work first. We can then think 
of a solution for WCE on top.


Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-28 Thread Jason Wang


在 2021/5/28 上午11:54, Yongji Xie 写道:

On Fri, May 28, 2021 at 9:33 AM Jason Wang  wrote:


在 2021/5/27 下午6:14, Yongji Xie 写道:

On Thu, May 27, 2021 at 4:43 PM Jason Wang  wrote:

在 2021/5/27 下午4:41, Jason Wang 写道:

在 2021/5/27 下午3:34, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:

在 2021/5/27 下午1:08, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:00 PM Jason Wang 
wrote:

在 2021/5/27 下午12:57, Yongji Xie 写道:

On Thu, May 27, 2021 at 12:13 PM Jason Wang 
wrote:

在 2021/5/17 下午5:55, Xie Yongji 写道:

+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+   struct vduse_dev_msg *msg)
+{
+ init_waitqueue_head(>waitq);
+ spin_lock(>msg_lock);
+ vduse_enqueue_msg(>send_list, msg);
+ wake_up(>waitq);
+ spin_unlock(>msg_lock);
+ wait_event_killable(msg->waitq, msg->completed);

What happens if the userspace(malicous) doesn't give a response
forever?

It looks like a DOS. If yes, we need to consider a way to fix that.


How about using wait_event_killable_timeout() instead?

Probably, and then we need choose a suitable timeout and more
important,
need to report the failure to virtio.


Makes sense to me. But it looks like some
vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
return value.  Now I add a WARN_ON() for the failure. Do you mean we
need to add some change for virtio core to handle the failure?

Maybe, but I'm not sure how hard we can do that.


We need to change all virtio device drivers in this way.

Probably.



We had NEEDS_RESET but it looks we don't implement it.


Could it handle the failure of get_feature() and get/set_config()?

Looks not:

"

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
that a reset is needed. If DRIVER_OK is set, after it sets
DEVICE_NEEDS_RESET, the device MUST send a device configuration change
notification to the driver.

"

This looks implies that NEEDS_RESET may only work after device is
probed. But in the current design, even the reset() is not reliable.



Or a rough idea is that maybe need some relaxing to be coupled loosely
with userspace. E.g the device (control path) is implemented in the
kernel but the datapath is implemented in the userspace like TUN/TAP.


I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.

Good point.

But set_config() should be rare:

E.g in the case of virtio-net with VERSION_1, config space is read
only, and it was set via control vq.

For block, we can

1) start from without WCE or
2) we add a config change notification to userspace or
3) extend the spec to use vq instead of config space

Thanks

Another thing if we want to go this way:

We need find a way to terminate the data path from the kernel side, to
implement to reset semantic.


Do you mean terminate the data path in vdpa_reset().


Yes.



   Is it ok to just
notify userspace to stop data path asynchronously?


For well-behaved userspace, yes but no for buggy or malicious ones.


But the buggy or malicious daemons can't do anything if my
understanding is correct.



You're right. I originally thought there can still have bouncing. But 
consider we don't do that during fault.


It should be safe.





I had an idea, how about terminate IOTLB in this case? Then we're in
fact turn datapath off.


Sorry, I didn't get your point here. What do you mean by terminating
IOTLB?



I meant terminate the bouncing but it looks safe after a second thought :)

Thanks



  Remove iotlb mapping? But userspace can still access the mapped
region.

Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Jason Wang


在 2021/5/27 下午9:17, Yongji Xie 写道:

On Thu, May 27, 2021 at 4:41 PM Jason Wang  wrote:


在 2021/5/27 下午3:34, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:

在 2021/5/27 下午1:08, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:00 PM Jason Wang  wrote:

在 2021/5/27 下午12:57, Yongji Xie 写道:

On Thu, May 27, 2021 at 12:13 PM Jason Wang  wrote:

在 2021/5/17 下午5:55, Xie Yongji 写道:

+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+   struct vduse_dev_msg *msg)
+{
+ init_waitqueue_head(>waitq);
+ spin_lock(>msg_lock);
+ vduse_enqueue_msg(>send_list, msg);
+ wake_up(>waitq);
+ spin_unlock(>msg_lock);
+ wait_event_killable(msg->waitq, msg->completed);

What happens if the userspace(malicous) doesn't give a response forever?

It looks like a DOS. If yes, we need to consider a way to fix that.


How about using wait_event_killable_timeout() instead?

Probably, and then we need choose a suitable timeout and more important,
need to report the failure to virtio.


Makes sense to me. But it looks like some
vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
return value.  Now I add a WARN_ON() for the failure. Do you mean we
need to add some change for virtio core to handle the failure?

Maybe, but I'm not sure how hard we can do that.


We need to change all virtio device drivers in this way.


Probably.



We had NEEDS_RESET but it looks we don't implement it.


Could it handle the failure of get_feature() and get/set_config()?


Looks not:

"

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
that a reset is needed. If DRIVER_OK is set, after it sets
DEVICE_NEEDS_RESET, the device MUST send a device configuration change
notification to the driver.

"

This looks implies that NEEDS_RESET may only work after device is
probed. But in the current design, even the reset() is not reliable.



Or a rough idea is that maybe need some relaxing to be coupled loosely
with userspace. E.g the device (control path) is implemented in the
kernel but the datapath is implemented in the userspace like TUN/TAP.


I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.


Good point.

But set_config() should be rare:

E.g in the case of virtio-net with VERSION_1, config space is read only,
and it was set via control vq.

For block, we can

1) start from without WCE or
2) we add a config change notification to userspace or

I prefer this way. And I think we also need to do similar things for
set/get_vq_state().



Yes, I agree.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-27 Thread Jason Wang


在 2021/5/27 下午3:58, Tian, Kevin 写道:

/dev/ioasid provides an unified interface for managing I/O page tables for
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
etc.) are expected to use this interface instead of creating their own logic to
isolate untrusted device DMAs initiated by userspace.



Not a native speaker but /dev/ioas seems better?




This proposal describes the uAPI of /dev/ioasid and also sample sequences
with VFIO as example in typical usages. The driver-facing kernel API provided
by the iommu layer is still TBD, which can be discussed after consensus is
made on this uAPI.

It's based on a lengthy discussion starting from here:
https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/

It ends up to be a long writing due to many things to be summarized and
non-trivial effort required to connect them into a complete proposal.
Hope it provides a clean base to converge.

TOC

1. Terminologies and Concepts
2. uAPI Proposal
 2.1. /dev/ioasid uAPI
 2.2. /dev/vfio uAPI
 2.3. /dev/kvm uAPI
3. Sample structures and helper functions
4. PASID virtualization
5. Use Cases and Flows
 5.1. A simple example
 5.2. Multiple IOASIDs (no nesting)
 5.3. IOASID nesting (software)
 5.4. IOASID nesting (hardware)
 5.5. Guest SVA (vSVA)
 5.6. I/O page fault
 5.7. BIND_PASID_TABLE


1. Terminologies and Concepts
-

IOASID FD is the container holding multiple I/O address spaces. User
manages those address spaces through FD operations. Multiple FD's are
allowed per process, but with this proposal one FD should be sufficient for
all intended usages.

IOASID is the FD-local software handle representing an I/O address space.
Each IOASID is associated with a single I/O page table. IOASIDs can be
nested together, implying the output address from one I/O page table
(represented by child IOASID) must be further translated by another I/O
page table (represented by parent IOASID).

I/O address space can be managed through two protocols, according to
whether the corresponding I/O page table is constructed by the kernel or
the user. When kernel-managed, a dma mapping protocol (similar to
existing VFIO iommu type1) is provided for the user to explicitly specify
how the I/O address space is mapped. Otherwise, a different protocol is
provided for the user to bind an user-managed I/O page table to the
IOMMU, plus necessary commands for iotlb invalidation and I/O fault
handling.

Pgtable binding protocol can be used only on the child IOASID's, implying
IOASID nesting must be enabled. This is because the kernel doesn't trust
userspace. Nesting allows the kernel to enforce its DMA isolation policy
through the parent IOASID.

IOASID nesting can be implemented in two ways: hardware nesting and
software nesting. With hardware support the child and parent I/O page
tables are walked consecutively by the IOMMU to form a nested translation.
When it's implemented in software, the ioasid driver



Need to explain what did "ioasid driver" mean.

I guess it's the module that implements the IOASID abstraction:

1) RID
2) RID+PASID
3) others

And if yes, does it allow the device for software specific implementation:

1) swiotlb or
2) device specific IOASID implementation



is responsible for
merging the two-level mappings into a single-level shadow I/O page table.
Software nesting requires both child/parent page tables operated through
the dma mapping protocol, so any change in either level can be captured
by the kernel to update the corresponding shadow mapping.

An I/O address space takes effect in the IOMMU only after it is attached
to a device. The device in the /dev/ioasid context always refers to a
physical one or 'pdev' (PF or VF).

One I/O address space could be attached to multiple devices. In this case,
/dev/ioasid uAPI applies to all attached devices under the specified IOASID.

Based on the underlying IOMMU capability one device might be allowed
to attach to multiple I/O address spaces, with DMAs accessing them by
carrying different routing information. One of them is the default I/O
address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
remaining are routed by RID + Process Address Space ID (PASID) or
Stream+Substream ID. For simplicity the following context uses RID and
PASID when talking about the routing information for I/O address spaces.

Device attachment is initiated through passthrough framework uAPI (use
VFIO for simplicity in following context). VFIO is responsible for identifying
the routing information and registering it to the ioasid driver when calling
ioasid attach helper function. It could be RID if the assigned device is
pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
user might also provide its view of virtual routing information (vPASID) in
the attach call, e.g. when multiple user-managed I/O address spaces are
attached to the vfio_device. 

Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Jason Wang


在 2021/5/27 下午6:14, Yongji Xie 写道:

On Thu, May 27, 2021 at 4:43 PM Jason Wang  wrote:


在 2021/5/27 下午4:41, Jason Wang 写道:

在 2021/5/27 下午3:34, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:

在 2021/5/27 下午1:08, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:00 PM Jason Wang 
wrote:

在 2021/5/27 下午12:57, Yongji Xie 写道:

On Thu, May 27, 2021 at 12:13 PM Jason Wang 
wrote:

在 2021/5/17 下午5:55, Xie Yongji 写道:

+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+   struct vduse_dev_msg *msg)
+{
+ init_waitqueue_head(>waitq);
+ spin_lock(>msg_lock);
+ vduse_enqueue_msg(>send_list, msg);
+ wake_up(>waitq);
+ spin_unlock(>msg_lock);
+ wait_event_killable(msg->waitq, msg->completed);

What happens if the userspace(malicous) doesn't give a response
forever?

It looks like a DOS. If yes, we need to consider a way to fix that.


How about using wait_event_killable_timeout() instead?

Probably, and then we need choose a suitable timeout and more
important,
need to report the failure to virtio.


Makes sense to me. But it looks like some
vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
return value.  Now I add a WARN_ON() for the failure. Do you mean we
need to add some change for virtio core to handle the failure?

Maybe, but I'm not sure how hard we can do that.


We need to change all virtio device drivers in this way.


Probably.



We had NEEDS_RESET but it looks we don't implement it.


Could it handle the failure of get_feature() and get/set_config()?


Looks not:

"

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
that a reset is needed. If DRIVER_OK is set, after it sets
DEVICE_NEEDS_RESET, the device MUST send a device configuration change
notification to the driver.

"

This looks implies that NEEDS_RESET may only work after device is
probed. But in the current design, even the reset() is not reliable.



Or a rough idea is that maybe need some relaxing to be coupled loosely
with userspace. E.g the device (control path) is implemented in the
kernel but the datapath is implemented in the userspace like TUN/TAP.


I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.


Good point.

But set_config() should be rare:

E.g in the case of virtio-net with VERSION_1, config space is read
only, and it was set via control vq.

For block, we can

1) start from without WCE or
2) we add a config change notification to userspace or
3) extend the spec to use vq instead of config space

Thanks


Another thing if we want to go this way:

We need find a way to terminate the data path from the kernel side, to
implement to reset semantic.


Do you mean terminate the data path in vdpa_reset().



Yes.



  Is it ok to just
notify userspace to stop data path asynchronously?



For well-behaved userspace, yes but no for buggy or malicious ones.

I had an idea, how about terminate IOTLB in this case? Then we're in 
fact turn datapath off.


Thanks



  Userspace should
not be able to do any I/O at that time because the iotlb mapping is
already removed.

Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

  1   2   >