from:"Jason Gunthorpe"

Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space

2023-05-30 Thread Jason Gunthorpe

On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
> Hi folks,
> 
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework. The use case is nested
> translation, where modern IOMMU hardware supports two-stage translation
> tables. The second-stage translation table is managed by the host VMM
> while the first-stage translation table is owned by the user space.
> Hence, any IO page fault that occurs on the first-stage page table
> should be delivered to the user space and handled there. The user space
> should respond the page fault handling result to the device top-down
> through the IOMMUFD response uAPI.
> 
> User space indicates its capablity of handling IO page faults by setting
> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
> will then setup its infrastructure for page fault delivery. Together
> with the iopf-capable flag, user space should also provide an eventfd
> where it will listen on any down-top page fault messages.
> 
> On a successful return of the allocation of iopf-capable HWPT, a fault
> fd will be returned. User space can open and read fault messages from it
> once the eventfd is signaled.

This is a performance path so we really need to think about this more,
polling on an eventfd and then reading a different fd is not a good
design.

What I would like is to have a design from the start that fits into
io_uring, so we can have pre-posted 'recvs' in io_uring that just get
completed at high speed when PRIs come in.

This suggests that the PRI should be delivered via read() on a single
FD and pollability on the single FD without any eventfd.

> Besides the overall design, I'd like to hear comments about below
> designs:
> 
> - The IOMMUFD fault message format. It is very similar to that in
>   uapi/linux/iommu which has been discussed before and partially used by
>   the IOMMU SVA implementation. I'd like to get more comments on the
>   format when it comes to IOMMUFD.

We have to have the same discussion as always, does a generic fault
message format make any sense here?

PRI seems more likely that it would but it needs a big carefull cross
vendor check out.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space

2023-06-19 Thread Jason Gunthorpe

On Fri, Jun 16, 2023 at 12:32:32PM +0100, Jean-Philippe Brucker wrote:

> We might need to revisit supporting stop markers: request that each device
> driver declares whether their device uses stop markers on unbind() ("This
> mechanism must indicate that a Stop Marker Message will be generated."
> says the spec, but doesn't say if the function always uses one or the
> other mechanism so it's per-unbind). Then we still have to synchronize
> unbind() with the fault handler to deal with the pending stop marker,
> which might have already gone through or be generated later.

An explicit API to wait for the stop marker makes sense, with the
expectation that well behaved devices will generate it and well
behaved drivers will wait for it.

Things like VFIO should have a way to barrier/drain the PRI queue
after issuing FLR. ie the VMM processing FLR should also barrier the
real HW queues and flush them to VM visibility.

> with stop markers, the host needs to flush the PRI queue when a PASID is
> detached. I guess on Intel detaching the PASID goes through the host which
> can flush the host queue. On Arm we'll probably need to flush the queue
> when receiving a PASID cache invalidation, which the guest issues after
> clearing a PASID table entry.

We are trying to get ARM to a point where invalidations don't need to
be trapped. It would be good to not rely on that anyplace.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space

2023-06-23 Thread Jason Gunthorpe

On Fri, Jun 23, 2023 at 02:18:38PM +0800, Baolu Lu wrote:

>   struct io_uring ring;
> 
>   io_uring_setup(IOPF_ENTRIES, &ring);
> 
>   while (1) {
>   struct io_uring_prep_read read;
>   struct io_uring_cqe *cqe;
> 
>   read.fd = iopf_fd;
>   read.buf = malloc(IOPF_SIZE);
>   read.len = IOPF_SIZE;
>   read.flags = 0;
> 
>   io_uring_prep_read(&ring, &read);
>   io_uring_submit(&ring);
> 
>   // Wait for the read to complete
>   while ((cqe = io_uring_get_cqe(&ring)) != NULL) {
>   // Check if the read completed
>   if (cqe->res < 0)
>   break;
> 
>   if (page_fault_read_completion(cqe)) {
>   // Get the fault data
>   void *data = cqe->buf;
>   size_t size = cqe->res;
> 
>   // Handle the page fault
>   handle_page_fault(data);
> 
>   // Respond the fault
>   struct io_uring_prep_write write;
>   write.fd = iopf_fd;
>   write.buf = malloc(IOPF_RESPONSE_SIZE);
>   write.len = IOPF_RESPONSE_SIZE;
>   write.flags = 0;
> 
>   io_uring_prep_write(&ring, &write);
>   io_uring_submit(&ring);
>   }
> 
>   // Reap the cqe
>   io_uring_cqe_free(&ring, cqe);
>   }
>   }
> 
> Did I understand you correctly?

Yes, basically this is the right idea. There are more complex ways to
use the iouring that would be faster still.

And the kernel side can have support to speed it up as well.

I'm wondering if we should be pushing invalidations on io_uring as
well?

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space

2023-06-26 Thread Jason Gunthorpe

On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote:

> Agreed. We should avoid workqueue in sva iopf framework. Perhaps we
> could go ahead with below code? It will be registered to device with
> iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling
> path. Un-registering in the disable path of cause.

This maze needs to be undone as well.

It makes no sense that all the drivers are calling 

 iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);

The driver should RX a PRI fault and deliver it to some core code
function, this looks like a good start:

> static int io_pgfault_handler(struct iommu_fault *fault, void *cookie)
> {
> ioasid_t pasid = fault->prm.pasid;
> struct device *dev = cookie;
> struct iommu_domain *domain;
> 
> if (fault->type != IOMMU_FAULT_PAGE_REQ)
> return -EOPNOTSUPP;
> 
> if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)
> domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0);
> else
> domain = iommu_get_domain_for_dev(dev);
> 
> if (!domain || !domain->iopf_handler)
> return -ENODEV;
> 
> if (domain->type == IOMMU_DOMAIN_SVA)
> return iommu_queue_iopf(fault, cookie);
> 
> return domain->iopf_handler(fault, dev, domain->fault_data);

Then we find the domain that owns the translation and invoke its
domain->ops->iopf_handler()

If the driver created a SVA domain then the op should point to some
generic 'handle sva fault' function. There shouldn't be weird SVA
stuff in the core code.

The weird SVA stuff is really just a generic per-device workqueue
dispatcher, so if we think that is valuable then it should be
integrated into the iommu_domain (domain->ops->use_iopf_workqueue =
true for instance). Then it could route the fault through the
workqueue and still invoke domain->ops->iopf_handler.

The word "SVA" should not appear in any of this.

Not sure what iommu_register_device_fault_handler() has to do with all
of this.. Setting up the dev_iommu stuff to allow for the workqueue
should happen dynamically during domain attach, ideally in the core
code before calling to the driver.

Also, I can understand there is a need to turn on PRI support really
early, and it can make sense to have some IOMMU_DEV_FEAT_IOPF/SVA to
ask to turn it on.. But that should really only be needed if the HW
cannot turn it on dynamically during domain attach of a PRI enabled
domain.

It needs cleaning up..

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space

2023-06-28 Thread Jason Gunthorpe

On Wed, Jun 28, 2023 at 10:00:56AM +0800, Baolu Lu wrote:
> > If the driver created a SVA domain then the op should point to some
> > generic 'handle sva fault' function. There shouldn't be weird SVA
> > stuff in the core code.
> > 
> > The weird SVA stuff is really just a generic per-device workqueue
> > dispatcher, so if we think that is valuable then it should be
> > integrated into the iommu_domain (domain->ops->use_iopf_workqueue =
> > true for instance). Then it could route the fault through the
> > workqueue and still invoke domain->ops->iopf_handler.
> > 
> > The word "SVA" should not appear in any of this.
> 
> Yes. We should make it generic. The domain->use_iopf_workqueue flag
> denotes that the page faults of a fault group should be put together and
> then be handled and responded in a workqueue. Otherwise, the page fault
> is dispatched to domain->iopf_handler directly.

It might be better to have iopf_handler and
iopf_handler_work function pointers to distinguish to two cases.

> > Not sure what iommu_register_device_fault_handler() has to do with all
> > of this.. Setting up the dev_iommu stuff to allow for the workqueue
> > should happen dynamically during domain attach, ideally in the core
> > code before calling to the driver.
> 
> There are two pointers under struct dev_iommu for fault handling.
> 
> /**
>  * struct dev_iommu - Collection of per-device IOMMU data
>  *
>  * @fault_param: IOMMU detected device fault reporting data
>  * @iopf_param:  I/O Page Fault queue and data
> 
> [...]
> 
> struct dev_iommu {
> struct mutex lock;
> struct iommu_fault_param*fault_param;
> struct iopf_device_param*iopf_param;
> 
> My understanding is that @fault_param is a place holder for generic
> things, while @iopf_param is workqueue specific.

Well, lets look

struct iommu_fault_param {
iommu_dev_fault_handler_t handler;
void *data;

These two make no sense now. handler is always iommu_queue_iopf. Given
our domain centric design we want the function pointer in the domain,
not in the device. So delete it.

struct list_head faults;
struct mutex lock;

Queue of unhandled/unacked faults? Seems sort of reasonable

> @iopf_param could be allocated on demand. (perhaps renaming it to a more
> meaningful one?) It happens before a domain with use_iopf_workqueue flag
> set attaches to a device. iopf_param keeps alive until device_release.

Yes

Do this for the iommu_fault_param as well, in fact, probably just put
the two things together in one allocation and allocate if we attach a
PRI using domain. I don't think we need to micro optimze further..
 
Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/2] eventfd: simplify signal helpers

2023-07-14 Thread Jason Gunthorpe

On Fri, Jul 14, 2023 at 09:05:21AM +0200, Christian Brauner wrote:

> I have no skin in the game aside from having to drop this conversion
> which I'm fine to do if there are actually users for this btu really,
> that looks a lot like abusing an api that really wasn't designed for
> this.

Yeah, I think so too. The ACPI thing should use its own FD if it wants
to feed actual data..

Jason

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/2] eventfd: simplify signal helpers

2023-07-17 Thread Jason Gunthorpe

On Mon, Jul 17, 2023 at 01:08:31PM -0600, Alex Williamson wrote:

> What would that mechanism be?  We've been iterating on getting the
> serialization and buffering correct, but I don't know of another means
> that combines the notification with a value, so we'd likely end up with
> an eventfd only for notification and a separate ring buffer for
> notification values.

All FDs do this. You just have to make a FD with custom
file_operations that does what this wants. The uAPI shouldn't be able
to tell if the FD is backing it with an eventfd or otherwise. Have the
kernel return the FD instead of accepting it. Follow the basic design
of eg mlx5vf_save_fops

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/2] eventfd: simplify signal helpers

2023-07-18 Thread Jason Gunthorpe

On Mon, Jul 17, 2023 at 04:52:03PM -0600, Alex Williamson wrote:
> On Mon, 17 Jul 2023 19:12:16 -0300
> Jason Gunthorpe  wrote:
> 
> > On Mon, Jul 17, 2023 at 01:08:31PM -0600, Alex Williamson wrote:
> > 
> > > What would that mechanism be?  We've been iterating on getting the
> > > serialization and buffering correct, but I don't know of another means
> > > that combines the notification with a value, so we'd likely end up with
> > > an eventfd only for notification and a separate ring buffer for
> > > notification values.  
> > 
> > All FDs do this. You just have to make a FD with custom
> > file_operations that does what this wants. The uAPI shouldn't be able
> > to tell if the FD is backing it with an eventfd or otherwise. Have the
> > kernel return the FD instead of accepting it. Follow the basic design
> > of eg mlx5vf_save_fops
> 
> Sure, userspace could poll on any fd and read a value from it, but at
> that point we're essentially duplicating a lot of what eventfd provides
> for a minor(?) semantic difference over how the counter value is
> interpreted.  Using an actual eventfd allows the ACPI notification to
> work as just another interrupt index within the existing vfio IRQ
> uAPI.

Yes, duplicated, sort of, whatever the "ack" is to allow pushing a new
value can be revised to run as part of the read.

But I don't really view it as a minor difference. eventfd is a
counter. It should not be abused otherwise, even if it can be made to
work.

It really isn't an IRQ if it is pushing an async message w/data.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] iommu/virtio: Use single flush queue (EXPERIMENTAL)

2023-08-02 Thread Jason Gunthorpe

On Wed, Aug 02, 2023 at 01:36:12PM +0100, Jean-Philippe Brucker wrote:

> automatically get plugged into a VM without user intervention. Here I
> guess the devices we don't trust will be virtual devices implemented by
> other VMs. We don't have any method to identify them yet, so
> iommu.strict=1 and CONFIG_IOMMU_DEFAULT_DMA_STRICT is the best we can do
> at the moment.

VM's should work the same way as bare metal. The hypervisor should
pass in an ACPI/etc indication if specific devices are to be
untrusted. Otherwise the VM should assume trusted devices.

The hypervisor can already read all the VM's memory, it doesn't make
alot of sense for the VM to try and be defensive here in the general
case.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-19 Thread Jason Gunthorpe

On Tue, Sep 19, 2023 at 09:15:19AM +0100, Jean-Philippe Brucker wrote:
> On Mon, Sep 18, 2023 at 05:37:47PM +0100, Robin Murphy wrote:
> > > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> > > index 17dcd826f5c2..3649586f0e5c 100644
> > > --- a/drivers/iommu/virtio-iommu.c
> > > +++ b/drivers/iommu/virtio-iommu.c
> > > @@ -189,6 +189,12 @@ static int viommu_sync_req(struct viommu_dev *viommu)
> > >   int ret;
> > >   unsigned long flags;
> > > + /*
> > > +  * .iotlb_sync_map and .flush_iotlb_all may be called before the viommu
> > > +  * is initialized e.g. via iommu_create_device_direct_mappings()
> > > +  */
> > > + if (!viommu)
> > > + return 0;
> > 
> > Minor nit: I'd be inclined to make that check explicitly in the places where
> > it definitely is expected, rather than allowing *any* sync to silently do
> > nothing if called incorrectly. Plus then they could use
> > vdomain->nr_endpoints for consistency with the equivalent checks elsewhere
> > (it did take me a moment to figure out how we could get to .iotlb_sync_map
> > with a NULL viommu without viommu_map_pages() blowing up first...)

This makes more sense to me

Ultimately this driver should reach a point where every iommu_domain
always has a non-null domain->viommu because it will be set during
alloc.

But it can still have nr_endpoints == 0, doesn't it make sense to
avoid sync in this case?

(btw this driver is missing locking around vdomain->nr_endpoints)

> They're not strictly equivalent: this check works around a temporary issue
> with the IOMMU core, which calls map/unmap before the domain is
> finalized.

Where? The above points to iommu_create_device_direct_mappings() but
it doesn't because the pgsize_bitmap == 0:

static int iommu_create_device_direct_mappings(struct iommu_domain *domain,
   struct device *dev)
{
struct iommu_resv_region *entry;
struct list_head mappings;
unsigned long pg_size;
int ret = 0;

pg_size = domain->pgsize_bitmap ? 1UL << __ffs(domain->pgsize_bitmap) : 
0;
INIT_LIST_HEAD(&mappings);

if (WARN_ON_ONCE(iommu_is_dma_domain(domain) && !pg_size))

Indeed, the driver should be failing all map's until the domain is
finalized because it has no way to check the IOVA matches the eventual
aperture.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-22 Thread Jason Gunthorpe

On Fri, Sep 22, 2023 at 08:57:19AM +0100, Jean-Philippe Brucker wrote:
> > > They're not strictly equivalent: this check works around a temporary issue
> > > with the IOMMU core, which calls map/unmap before the domain is
> > > finalized.
> > 
> > Where? The above points to iommu_create_device_direct_mappings() but
> > it doesn't because the pgsize_bitmap == 0:
> 
> __iommu_domain_alloc() sets pgsize_bitmap in this case:
> 
> /*
>  * If not already set, assume all sizes by default; the driver
>  * may override this later
>  */
> if (!domain->pgsize_bitmap)
> domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap;

Dirver's shouldn't do that.

The core code was fixed to try again with mapping reserved regions to
support these kinds of drivers.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-22 Thread Jason Gunthorpe

On Fri, Sep 22, 2023 at 02:13:18PM +0100, Robin Murphy wrote:
> On 22/09/2023 1:41 pm, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 08:57:19AM +0100, Jean-Philippe Brucker wrote:
> > > > > They're not strictly equivalent: this check works around a temporary 
> > > > > issue
> > > > > with the IOMMU core, which calls map/unmap before the domain is
> > > > > finalized.
> > > > 
> > > > Where? The above points to iommu_create_device_direct_mappings() but
> > > > it doesn't because the pgsize_bitmap == 0:
> > > 
> > > __iommu_domain_alloc() sets pgsize_bitmap in this case:
> > > 
> > >  /*
> > >   * If not already set, assume all sizes by default; the driver
> > >   * may override this later
> > >   */
> > >  if (!domain->pgsize_bitmap)
> > >  domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap;
> > 
> > Dirver's shouldn't do that.
> > 
> > The core code was fixed to try again with mapping reserved regions to
> > support these kinds of drivers.
> 
> This is still the "normal" code path, really; I think it's only AMD that
> started initialising the domain bitmap "early" and warranted making it
> conditional.

My main point was that iommu_create_device_direct_mappings() should
fail for unfinalized domains, setting pgsize_bitmap to allow it to
succeed is not a nice hack, and not necessary now.

What do you think about something like this to replace
iommu_create_device_direct_mappings(), that does enforce things
properly?


static int resv_cmp(void *priv, const struct list_head *llhs,
const struct list_head *lrhs)
{
struct iommu_resv_region *lhs = list_entry(llhs, struct 
iommu_resv_region, list);
struct iommu_resv_region *rhs = list_entry(lrhs, struct 
iommu_resv_region, list);

if (lhs->start == rhs->start)
return 0;
if (lhs->start < rhs->start)
return -1;
return 1;
}

static int iommu_create_device_direct_mappings(struct iommu_domain *domain,
   struct device *dev)
{
struct iommu_resv_region *entry;
struct iommu_resv_region *tmp;
struct list_head mappings;
struct list_head direct;
phys_addr_t cur = 0;
int ret = 0;

INIT_LIST_HEAD(&mappings);
INIT_LIST_HEAD(&direct);

iommu_get_resv_regions(dev, &mappings);

list_for_each_entry_safe(entry, tmp, &mappings, list) {
if (entry->type == IOMMU_RESV_DIRECT)
dev->iommu->require_direct = 1;

if ((domain->type & __IOMMU_DOMAIN_PAGING) &&
(entry->type == IOMMU_RESV_DIRECT ||
 entry->type == IOMMU_RESV_DIRECT_RELAXABLE)) {
if (domain->geometry.aperture_start > entry->start ||
domain->geometry.aperture_end == 0 ||
(domain->geometry.aperture_end - 1) <
(entry->start + entry->length - 1)) {
ret = -EINVAL;
goto out;
}
list_move(&entry->list, &direct);
}
}

if (list_empty(&direct))
goto out;

/*
 * FW can have overlapping ranges, sort the list by start address
 * and map any duplicated IOVA only once.
 */
list_sort(NULL, &direct, resv_cmp);
list_for_each_entry(entry, &direct, list) {
phys_addr_t start_pfn = entry->start / PAGE_SIZE;
phys_addr_t last_pfn =
(entry->length - 1 + entry->start) / PAGE_SIZE;

if (start_pfn < cur)
start_pfn = cur;

if (start_pfn <= last_pfn) {
ret = iommu_map(domain, start_pfn * PAGE_SIZE,
start_pfn * PAGE_SIZE,
(last_pfn - start_pfn + 1) * PAGE_SIZE,
entry->prot, GFP_KERNEL);
if (ret)
goto out;
cur = last_pfn + 1;
}
}

out:
list_splice(&direct, &mappings);
iommu_put_resv_regions(dev, &mappings);

return ret;
}
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-22 Thread Jason Gunthorpe

On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:

> virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
> either; it sets it once it's discovered any instance, since apparently it's
> assuming that all instances must support identical page sizes, and thus once
> it's seen one it can work "normally" per the core code's assumptions. It's
> also I think the only driver which has a "finalise" bodge but *can* still
> properly support map-before-attach, by virtue of having to replay mappings
> to every new endpoint anyway.

Well it can't quite do that since it doesn't know the geometry - it
all is sort of guessing and hoping it doesn't explode on replay. If it
knows the geometry it wouldn't need finalize...

> > What do you think about something like this to replace
> > iommu_create_device_direct_mappings(), that does enforce things
> > properly?
> 
> I fail to see how that would make any practical difference. Either the
> mappings can be correctly set up in a pagetable *before* the relevant device
> is attached to that pagetable, or they can't (if the driver doesn't have
> enough information to be able to do so) and we just have to really hope
> nothing blows up in the race window between attaching the device to an empty
> pagetable and having a second try at iommu_create_device_direct_mappings().
> That's a driver-level issue and has nothing to do with pgsize_bitmap either
> way.

Except we don't detect this in the core code correctly, that is my
point. We should detect the aperture conflict, not pgsize_bitmap to
check if it is the first or second try.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-25 Thread Jason Gunthorpe

On Mon, Sep 25, 2023 at 10:48:21AM +0800, Baolu Lu wrote:
> On 9/23/23 7:33 AM, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:
> > 
> > > virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
> > > either; it sets it once it's discovered any instance, since apparently 
> > > it's
> > > assuming that all instances must support identical page sizes, and thus 
> > > once
> > > it's seen one it can work "normally" per the core code's assumptions. It's
> > > also I think the only driver which has a "finalise" bodge but*can*  still
> > > properly support map-before-attach, by virtue of having to replay mappings
> > > to every new endpoint anyway.
> > Well it can't quite do that since it doesn't know the geometry - it
> > all is sort of guessing and hoping it doesn't explode on replay. If it
> > knows the geometry it wouldn't need finalize...
> 
> The ultimate solution to this problem seems to be to add device pointer
> to the parameter of ops->domain_alloc. So once the domain is allocated,
> it is fully initialized. Attaching this domain to a device that is not
> compatible will return -EINVAL, then the caller has to allocate a new
> domain for this device.

Sure, it looks like this, and it works already for default domain
setup.

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 8599394c9157d7..1697f5a2370785 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -637,15 +637,10 @@ static void viommu_event_handler(struct virtqueue *vq)
 
 /* IOMMU API */
 
-static struct iommu_domain *viommu_domain_alloc(unsigned type)
+static struct viommu_domain *__viommu_domain_alloc(void)
 {
struct viommu_domain *vdomain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED &&
-   type != IOMMU_DOMAIN_DMA &&
-   type != IOMMU_DOMAIN_IDENTITY)
-   return NULL;
-
vdomain = kzalloc(sizeof(*vdomain), GFP_KERNEL);
if (!vdomain)
return NULL;
@@ -654,16 +649,32 @@ static struct iommu_domain *viommu_domain_alloc(unsigned 
type)
spin_lock_init(&vdomain->mappings_lock);
vdomain->mappings = RB_ROOT_CACHED;
 
+   return vdomain;
+}
+
+static struct iommu_domain *viommu_domain_alloc(unsigned type)
+{
+   struct viommu_domain *vdomain;
+
+   /*
+* viommu sometimes creates an identity domain out of a
+* paging domain, that can't use the global static.
+* During attach it will fill in an identity page table.
+*/
+   if (type != IOMMU_DOMAIN_IDENTITY)
+   return NULL;
+   vdomain = __viommu_domain_alloc();
+   if (!vdomain)
+   return NULL;
return &vdomain->domain;
 }
 
 static int viommu_domain_finalise(struct viommu_endpoint *vdev,
- struct iommu_domain *domain)
+ struct viommu_domain *vdomain)
 {
int ret;
unsigned long viommu_page_size;
struct viommu_dev *viommu = vdev->viommu;
-   struct viommu_domain *vdomain = to_viommu_domain(domain);
 
viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap);
if (viommu_page_size > PAGE_SIZE) {
@@ -680,13 +691,13 @@ static int viommu_domain_finalise(struct viommu_endpoint 
*vdev,
 
vdomain->id = (unsigned int)ret;
 
-   domain->pgsize_bitmap   = viommu->pgsize_bitmap;
-   domain->geometry= viommu->geometry;
+   vdomain->domain.pgsize_bitmap = viommu->pgsize_bitmap;
+   vdomain->domain.geometry = viommu->geometry;
 
vdomain->map_flags  = viommu->map_flags;
vdomain->viommu = viommu;
 
-   if (domain->type == IOMMU_DOMAIN_IDENTITY) {
+   if (vdomain->domain.type == IOMMU_DOMAIN_IDENTITY) {
if (virtio_has_feature(viommu->vdev,
   VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
vdomain->bypass = true;
@@ -717,6 +728,24 @@ static void viommu_domain_free(struct iommu_domain *domain)
kfree(vdomain);
 }
 
+static struct iommu_domain *viommu_domain_alloc_paging(struct device *dev)
+{
+   struct viommu_domain *vdomain;
+   vdomain = __viommu_domain_alloc();
+   if (!vdomain)
+   return NULL;
+
+   if (dev) {
+   struct viommu_endpoint *vdev = dev_iommu_priv_get(dev);
+
+   if (viommu_domain_finalise(vdev, vdomain)) {
+   viommu_domain_free(&vdomain->domain);
+   return NULL;
+   }
+   }
+   return &vdomain->domain;
+}
+
 static int viommu_a

Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-25 Thread Jason Gunthorpe

On Mon, Sep 25, 2023 at 02:07:50PM +0100, Robin Murphy wrote:
> On 2023-09-23 00:33, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:
> > 
> > > virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
> > > either; it sets it once it's discovered any instance, since apparently 
> > > it's
> > > assuming that all instances must support identical page sizes, and thus 
> > > once
> > > it's seen one it can work "normally" per the core code's assumptions. It's
> > > also I think the only driver which has a "finalise" bodge but *can* still
> > > properly support map-before-attach, by virtue of having to replay mappings
> > > to every new endpoint anyway.
> > 
> > Well it can't quite do that since it doesn't know the geometry - it
> > all is sort of guessing and hoping it doesn't explode on replay. If it
> > knows the geometry it wouldn't need finalize...
> 
> I think it's entirely reasonable to assume that any direct mappings
> specified for a device are valid for that device and its IOMMU. However, in
> the particular case of virtio, it really shouldn't ever have direct mappings
> anyway, since even if the underlying hardware did have any, the host can
> enforce the actual direct-mapping aspect itself, and just present them as
> unusable regions to the guest.

I assume this machinery is for the ARM GIC ITS page

> Again, that's irrelevant. It can only be about whether the actual
> ->map_pages call succeeds or not. A driver could well know up-front that all
> instances support the same pgsize_bitmap and aperture, and set both at
> ->domain_alloc time, yet still be unable to handle an actual mapping without
> knowing which instance(s) that needs to interact with (e.g. omap-iommu).

I think this is a different issue. The domain is supposed to represent
the actual io pte storage, and the storage is supposed to exist even
when the domain is not attached to anything.

As we said with tegra-gart, it is a bug in the driver if all the
mappings disappear when the last device is detached from the domain.
Driver bugs like this turn into significant issues with vfio/iommufd
as this will result in warn_on's and memory leaking.

So, I disagree that this is something we should be allowing in the API
design. map_pages should succeed (memory allocation failures aside) if
a IOVA within the aperture and valid flags are presented. Regardless
of the attachment status. Calling map_pages with an IOVA outside the
aperture should be a caller bug.

It looks omap is just mis-designed to store the pgd in the omap_iommu,
not the omap_iommu_domain :( pgd is clearly a per-domain object in our
API. And why does every instance need its own copy of the identical
pgd?

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

2023-11-02 Thread Jason Gunthorpe

On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> Hi folks,
> 
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.

Having now looked more closely at the ARM requirements it seems we
will need generic events, not just page fault events to have a
complete emulation.

So I'd like to see this generalized into a channel to carry any
events..

> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.

This is the right way to approach it, and more broadly this shouldn't
be an iommufd specific thing. Kernel drivers will also need to create
fault capable PAGING iommu domains.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 2/5] iommu/vt-d: Add generic IO page table support

2023-11-06 Thread Jason Gunthorpe

On Mon, Nov 06, 2023 at 02:12:23AM -0500, Tina Zhang wrote:
> Add basic hook up code to implement generic IO page table framework.
> 
> Signed-off-by: Tina Zhang 
> ---
>  drivers/iommu/intel/Kconfig |  1 +
>  drivers/iommu/intel/iommu.c | 94 +
>  drivers/iommu/intel/iommu.h |  7 +++
>  drivers/iommu/io-pgtable.c  |  3 ++
>  include/linux/io-pgtable.h  |  2 +
>  5 files changed, 107 insertions(+)

If this is going to happen can we also convert vt-d to actually use
the io page table stuff directly and shuffle the code around so it is
structured like the rest of the io page table implementations?

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

2023-11-07 Thread Jason Gunthorpe

On Tue, Nov 07, 2023 at 08:35:10AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, November 2, 2023 8:48 PM
> >
> > On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > > Hi folks,
> > >
> > > This series implements the functionality of delivering IO page faults to
> > > user space through the IOMMUFD framework for nested translation.
> > Nested
> > > translation is a hardware feature that supports two-stage translation
> > > tables for IOMMU. The second-stage translation table is managed by the
> > > host VMM, while the first-stage translation table is owned by user
> > > space. This allows user space to control the IOMMU mappings for its
> > > devices.
> > 
> > Having now looked more closely at the ARM requirements it seems we
> > will need generic events, not just page fault events to have a
> > complete emulation.
> 
> Can you elaborate?

There are many events related to object in guest memory or controlled
by the guest, eg C_BAD_CD and C_BAD_STE. These should be relayed or
the emulation is not working well.

> > > User space indicates its capability of handling IO page faults by
> > > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > > hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> > > for page fault delivery. On a successful return of HWPT allocation, the
> > > user can retrieve and respond to page faults by reading and writing to
> > > the file descriptor (FD) returned in out_fault_fd.
> > 
> > This is the right way to approach it, and more broadly this shouldn't
> > be an iommufd specific thing. Kernel drivers will also need to create
> > fault capable PAGING iommu domains.
> 
> Are you suggesting a common interface used by both iommufd and
> kernel drivers?

Yes
 
> but I didn't get the last piece. If those domains are created by kernel
> drivers why would they require a uAPI for userspace to specify fault
> capable?

Not to userspace, but a kapi to request a fault capable domain and to
supply the fault handler. Eg:

 iommu_domain_alloc_faultable(dev, handler);

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 00/17] Solve iommu probe races around iommu_fwspec

2023-11-08 Thread Jason Gunthorpe

On Wed, Nov 08, 2023 at 06:34:58PM +, André Draszik wrote:

> For me, it's working fine so far on master, and I've also done my own back 
> port
> to 6.1 and am currently testing both. An official back port once finalised
> could be useful, though :-)

Great, I'll post a non-RFC version next week (LPC permitting)

BTW, kbuild 0-day caught your note in the other email and a bunch of
other wonky stuff I've fixed on the github version.

Thanks,
Jason

Re: [RFC PATCH 2/5] iommu/vt-d: Add generic IO page table support

2023-11-08 Thread Jason Gunthorpe

On Thu, Nov 09, 2023 at 12:10:59AM +, Zhang, Tina wrote:

> > If this is going to happen can we also convert vt-d to actually use the io 
> > page
> > table stuff directly and shuffle the code around so it is structured like 
> > the rest of
> > the io page table implementations?

> Converting VT-d driver to use io page table involves much code
> change. I made a local version of it, and it didn't prove much
> benefit.

Well, it structures the code in a similar way to all the other
drivers, though I admit to having not looked closely at the io page
table stuff.
 
> VT-d only supports one 1st-stage IO pgtable format and one 2nd-stage
> IO pgtable format. So, the current IO pgtable handling operations
> seems more efficient comparing to adding another layer callbacks in
> them.

I would like to de-virtualize those callbacks, is is completely
pointless when we have per-format iommu_domain ops now.

Jason

Re: [PATCH RFC 10/17] acpi: Do not use dev->iommu within acpi_iommu_configure()

2023-11-13 Thread Jason Gunthorpe

On Sun, Nov 12, 2023 at 09:44:18AM -0800, Moritz Fischer wrote:
> On Fri, Nov 03, 2023 at 01:44:55PM -0300, Jason Gunthorpe wrote:
> > This call chain is using dev->iommu->fwspec to pass around the fwspec
> > between the three parts (acpi_iommu_configure(), acpi_iommu_fwspec_init(),
> > iommu_probe_device()).
> > 
> > However there is no locking around the accesses to dev->iommu, so this is
> > all racy.
> > 
> > Allocate a clean, local, fwspec at the start of acpu_iommu_configure(),
> Nit: s/acpu_iommu_configure/acpi_iommu_configure_id() ?

Yep

Thanks
Jason

> > pass it through all functions on the stack to fill it with data, and
> > finally pass it into iommu_probe_device_fwspec() which will load it into
> > dev->iommu under a lock.
> > 
> > Signed-off-by: Jason Gunthorpe 
> 
> Reviewed-by: Moritz Fischer 
> > ---
> >  drivers/acpi/arm64/iort.c | 39 -
> >  drivers/acpi/scan.c   | 89 ++-
> >  drivers/acpi/viot.c   | 44 ++-
> >  drivers/iommu/iommu.c |  5 +--
> >  include/acpi/acpi_bus.h   |  8 ++--
> >  include/linux/acpi_iort.h |  3 +-
> >  include/linux/acpi_viot.h |  5 ++-
> >  include/linux/iommu.h |  2 +
> >  8 files changed, 97 insertions(+), 98 deletions(-)
> > 
> > diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
> > index 6496ff5a6ba20d..accd01dcfe93f5 100644
> > --- a/drivers/acpi/arm64/iort.c
> > +++ b/drivers/acpi/arm64/iort.c
> > @@ -1218,10 +1218,9 @@ static bool iort_pci_rc_supports_ats(struct 
> > acpi_iort_node *node)
> > return pci_rc->ats_attribute & ACPI_IORT_ATS_SUPPORTED;
> >  }
> >  
> > -static int iort_iommu_xlate(struct device *dev, struct acpi_iort_node 
> > *node,
> > -   u32 streamid)
> > +static int iort_iommu_xlate(struct iommu_fwspec *fwspec, struct device 
> > *dev,
> > +   struct acpi_iort_node *node, u32 streamid)
> >  {
> > -   const struct iommu_ops *ops;
> > struct fwnode_handle *iort_fwnode;
> >  
> > if (!node)
> > @@ -1239,17 +1238,14 @@ static int iort_iommu_xlate(struct device *dev, 
> > struct acpi_iort_node *node,
> >  * in the kernel or not, defer the IOMMU configuration
> >  * or just abort it.
> >  */
> > -   ops = iommu_ops_from_fwnode(iort_fwnode);
> > -   if (!ops)
> > -   return iort_iommu_driver_enabled(node->type) ?
> > -  -EPROBE_DEFER : -ENODEV;
> > -
> > -   return acpi_iommu_fwspec_init(dev, streamid, iort_fwnode, ops);
> > +   return acpi_iommu_fwspec_init(fwspec, dev, streamid, iort_fwnode,
> > + iort_iommu_driver_enabled(node->type));
> >  }
> >  
> >  struct iort_pci_alias_info {
> > struct device *dev;
> > struct acpi_iort_node *node;
> > +   struct iommu_fwspec *fwspec;
> >  };
> >  
> >  static int iort_pci_iommu_init(struct pci_dev *pdev, u16 alias, void *data)
> > @@ -1260,7 +1256,7 @@ static int iort_pci_iommu_init(struct pci_dev *pdev, 
> > u16 alias, void *data)
> >  
> > parent = iort_node_map_id(info->node, alias, &streamid,
> >   IORT_IOMMU_TYPE);
> > -   return iort_iommu_xlate(info->dev, parent, streamid);
> > +   return iort_iommu_xlate(info->fwspec, info->dev, parent, streamid);
> >  }
> >  
> >  static void iort_named_component_init(struct device *dev,
> > @@ -1280,7 +1276,8 @@ static void iort_named_component_init(struct device 
> > *dev,
> > dev_warn(dev, "Could not add device properties\n");
> >  }
> >  
> > -static int iort_nc_iommu_map(struct device *dev, struct acpi_iort_node 
> > *node)
> > +static int iort_nc_iommu_map(struct iommu_fwspec *fwspec, struct device 
> > *dev,
> > +struct acpi_iort_node *node)
> >  {
> > struct acpi_iort_node *parent;
> > int err = -ENODEV, i = 0;
> > @@ -1293,13 +1290,13 @@ static int iort_nc_iommu_map(struct device *dev, 
> > struct acpi_iort_node *node)
> >i++);
> >  
> > if (parent)
> > -   err = iort_iommu_xlate(dev, parent, streamid);
> > +   err = iort_iommu_xlate(fwspec, dev, parent, streamid);
> > } while (parent && !err);
> >  
> > return err;
> >  }
> >  
> > -static int iort_nc_iommu_map_id(struct device *de

Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

2024-03-06 Thread Jason Gunthorpe

On Wed, Mar 06, 2024 at 11:15:50PM +0800, Zhangfei Gao wrote:
> 
> Double checked, this does not send flags, 0 is OK,
> Only domain_alloc_user in iommufd_hwpt_paging_alloc requires flags.
> 
> In my debug, I need this patch, otherwise NULL pointer errors happen
> since SVA is not set.

This is some driver bug, we need to get rid of these
iommu_dev_enable_feature() requirements.

Attaching domains with properties should be entirely sufficient.

Jason

Re: [PATCH v3 2/8] iommu/sva: Use iopf domain attach/detach interface

2024-03-08 Thread Jason Gunthorpe

On Mon, Jan 22, 2024 at 03:38:57PM +0800, Lu Baolu wrote:
> @@ -215,7 +202,23 @@ static struct iopf_group *iopf_group_alloc(struct 
> iommu_fault_param *iopf_param,
>   group = abort_group;
>   }
>  
> + cookie = iopf_pasid_cookie_get(iopf_param->dev, pasid);
> + if (!cookie && pasid != IOMMU_NO_PASID)
> + cookie = iopf_pasid_cookie_get(iopf_param->dev, IOMMU_NO_PASID);
> + if (IS_ERR(cookie) || !cookie) {
> + /*
> +  * The PASID of this device was not attached by an I/O-capable
> +  * domain. Ask the caller to abort handling of this fault.
> +  * Otherwise, the reference count will be switched to the new
> +  * iopf group and will be released in iopf_free_group().
> +  */
> + kfree(group);
> + group = abort_group;
> + cookie = NULL;
> + }

If this is the main point of the cookie mechansim - why not just have
a refcount inside the domain itself?

I'm really having a hard time making sense of this cookie thing, we
have a lifetime issue on the domain pointer, why is adding another
structure the answer?

I see we also need to stick a pointer in the domain for iommufd to get
back to the hwpt, but that doesn't seem to need such a big system to
accomplish - just add a void *. It would make sense for the domain to
have some optional pointer to a struct for all the fault related data
that becomes allocated when a PRI domain is created..

Also, I thought we'd basically said that domain detatch is supposed to
flush any open PRIs before returning, what happened to that as a
solution to the lifetime problem?

Regardless I think we should split this into two series - improve the PRI
lifetime model for domains and then put iommufd on top of it..

Jason

Re: [PATCH v3 3/8] iommufd: Add fault and response message definitions

2024-03-08 Thread Jason Gunthorpe

On Mon, Jan 22, 2024 at 03:38:58PM +0800, Lu Baolu wrote:

> +/**
> + * enum iommu_hwpt_pgfault_flags - flags for struct iommu_hwpt_pgfault
> + * @IOMMU_PGFAULT_FLAGS_PASID_VALID: The pasid field of the fault data is
> + *   valid.
> + * @IOMMU_PGFAULT_FLAGS_LAST_PAGE: It's the last fault of a fault group.
> + */
> +enum iommu_hwpt_pgfault_flags {
> + IOMMU_PGFAULT_FLAGS_PASID_VALID = (1 << 0),
> + IOMMU_PGFAULT_FLAGS_LAST_PAGE   = (1 << 1),
> +};
> +
> +/**
> + * enum iommu_hwpt_pgfault_perm - perm bits for struct iommu_hwpt_pgfault
> + * @IOMMU_PGFAULT_PERM_READ: request for read permission
> + * @IOMMU_PGFAULT_PERM_WRITE: request for write permission
> + * @IOMMU_PGFAULT_PERM_EXEC: request for execute permission
> + * @IOMMU_PGFAULT_PERM_PRIV: request for privileged permission

You are going to have to elaborate what PRIV is for.. We don't have
any concept of this in the UAPI for iommufd so what is a userspace
supposed to do if it hits this? EXEC is similar, we can't actually
enable exec permissions from userspace IIRC..

> +enum iommu_hwpt_pgfault_perm {
> + IOMMU_PGFAULT_PERM_READ = (1 << 0),
> + IOMMU_PGFAULT_PERM_WRITE= (1 << 1),
> + IOMMU_PGFAULT_PERM_EXEC = (1 << 2),
> + IOMMU_PGFAULT_PERM_PRIV = (1 << 3),
> +};
> +
> +/**
> + * struct iommu_hwpt_pgfault - iommu page fault data
> + * @size: sizeof(struct iommu_hwpt_pgfault)
> + * @flags: Combination of enum iommu_hwpt_pgfault_flags
> + * @dev_id: id of the originated device
> + * @pasid: Process Address Space ID
> + * @grpid: Page Request Group Index
> + * @perm: Combination of enum iommu_hwpt_pgfault_perm
> + * @addr: page address
> + */
> +struct iommu_hwpt_pgfault {
> + __u32 size;
> + __u32 flags;
> + __u32 dev_id;
> + __u32 pasid;
> + __u32 grpid;
> + __u32 perm;
> + __u64 addr;
> +};

Do we need an addr + size here? I've seen a few things where I wonder
if that might become an enhancment someday.

> +/**
> + * struct iommu_hwpt_page_response - IOMMU page fault response
> + * @size: sizeof(struct iommu_hwpt_page_response)
> + * @flags: Must be set to 0
> + * @dev_id: device ID of target device for the response
> + * @pasid: Process Address Space ID
> + * @grpid: Page Request Group Index
> + * @code: response code. The supported codes include:
> + *0: Successful; 1: Response Failure; 2: Invalid Request.

This should be an enum

> + * @addr: The fault address. Must match the addr field of the
> + *last iommu_hwpt_pgfault of a reported iopf group.
> + */
> +struct iommu_hwpt_page_response {
> + __u32 size;
> + __u32 flags;
> + __u32 dev_id;
> + __u32 pasid;
> + __u32 grpid;
> + __u32 code;
> + __u64 addr;
> +};

Do we want some kind of opaque ID value from the kernel here to match
request with response exactly? Or is the plan to search on the addr?

Jason

Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

2024-03-08 Thread Jason Gunthorpe

On Mon, Jan 22, 2024 at 03:38:59PM +0800, Lu Baolu wrote:
> --- /dev/null
> +++ b/drivers/iommu/iommufd/fault.c
> @@ -0,0 +1,255 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2024 Intel Corporation
> + */
> +#define pr_fmt(fmt) "iommufd: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "iommufd_private.h"
> +
> +static int device_add_fault(struct iopf_group *group)
> +{
> + struct iommufd_device *idev = group->cookie->private;
> + void *curr;
> +
> + curr = xa_cmpxchg(&idev->faults, group->last_fault.fault.prm.grpid,
> +   NULL, group, GFP_KERNEL);
> +
> + return curr ? xa_err(curr) : 0;
> +}
> +
> +static void device_remove_fault(struct iopf_group *group)
> +{
> + struct iommufd_device *idev = group->cookie->private;
> +
> + xa_store(&idev->faults, group->last_fault.fault.prm.grpid,
> +  NULL, GFP_KERNEL);

xa_erase ?

Is grpid OK to use this way? Doesn't it come from the originating
device?

> +void iommufd_fault_destroy(struct iommufd_object *obj)
> +{
> + struct iommufd_fault *fault = container_of(obj, struct iommufd_fault, 
> obj);
> + struct iopf_group *group, *next;
> +
> + mutex_lock(&fault->mutex);
> + list_for_each_entry_safe(group, next, &fault->deliver, node) {
> + list_del(&group->node);
> + iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
> + iopf_free_group(group);
> + }
> + list_for_each_entry_safe(group, next, &fault->response, node) {
> + list_del(&group->node);
> + device_remove_fault(group);
> + iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
> + iopf_free_group(group);
> + }
> + mutex_unlock(&fault->mutex);
> +
> + mutex_destroy(&fault->mutex);

It is really weird to lock a mutex we are about to destroy? At this
point the refcount on the iommufd_object is zero so there had better
not be any other threads with access to this pointer!

> +static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
> +size_t count, loff_t *ppos)
> +{
> + size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
> + struct iommufd_fault *fault = filep->private_data;
> + struct iommu_hwpt_pgfault data;
> + struct iommufd_device *idev;
> + struct iopf_group *group;
> + struct iopf_fault *iopf;
> + size_t done = 0;
> + int rc;
> +
> + if (*ppos || count % fault_size)
> + return -ESPIPE;
> +
> + mutex_lock(&fault->mutex);
> + while (!list_empty(&fault->deliver) && count > done) {
> + group = list_first_entry(&fault->deliver,
> +  struct iopf_group, node);
> +
> + if (list_count_nodes(&group->faults) * fault_size > count - 
> done)
> + break;
> +
> + idev = (struct iommufd_device *)group->cookie->private;
> + list_for_each_entry(iopf, &group->faults, list) {
> + iommufd_compose_fault_message(&iopf->fault, &data, 
> idev);
> + rc = copy_to_user(buf + done, &data, fault_size);
> + if (rc)
> + goto err_unlock;
> + done += fault_size;
> + }
> +
> + rc = device_add_fault(group);

See I wonder if this should be some xa_alloc or something instead of
trying to use the grpid?

> + while (!list_empty(&fault->response) && count > done) {
> + rc = copy_from_user(&response, buf + done, response_size);
> + if (rc)
> + break;
> +
> + idev = container_of(iommufd_get_object(fault->ictx,
> +response.dev_id,
> +IOMMUFD_OBJ_DEVICE),
> + struct iommufd_device, obj);

It seems unfortunate we do this lookup for every iteration of the loop,
I would lift it up and cache it..

> + if (IS_ERR(idev))
> + break;
> +
> + group = device_get_fault(idev, response.grpid);

This looks locked wrong, it should hold the fault mutex here and we
should probably do xchng to zero it at the same time we fetch it.

> + if (!group ||
> + response.addr != group->last_fault.fault.prm.addr ||
> + ((group->last_fault.fault.prm.flags & 
> IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
> +   response.pasid != group->last_fault.fault.prm.pasid)) {

Why? If grpid is unique then just trust it.

> + iommufd_put_object(fault->ictx, &idev->obj);
> + break;
> + }
> +
> + iopf_group_response(group, response.code);
> +
> + mutex_lock(&fault->mutex);
> + list_del(&group->node);
>

Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

2024-03-08 Thread Jason Gunthorpe

On Mon, Jan 22, 2024 at 03:39:00PM +0800, Lu Baolu wrote:

> @@ -411,6 +414,8 @@ enum iommu_hwpt_data_type {
>   * @__reserved: Must be 0
>   * @data_type: One of enum iommu_hwpt_data_type
>   * @data_len: Length of the type specific data
> + * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field of
> + *IOMMU_HWPT_FAULT_ID_VALID is set.
>   * @data_uptr: User pointer to the type specific data
>   *
>   * Explicitly allocate a hardware page table object. This is the same object
> @@ -441,6 +446,7 @@ struct iommu_hwpt_alloc {
>   __u32 __reserved;
>   __u32 data_type;
>   __u32 data_len;
> + __u32 fault_id;
>   __aligned_u64 data_uptr;
>  };

?? We can't add fault_id in the middle of the struct??

> + if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID) {
> + struct iommufd_fault *fault;
> +
> + fault = iommufd_get_fault(ucmd, cmd->fault_id);
> + if (IS_ERR(fault)) {
> + rc = PTR_ERR(fault);
> + goto out_hwpt;
> + }
> + hwpt->fault = fault;
> + hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
> + hwpt->domain->fault_data = hwpt;
> + hwpt->fault_capable = true;

I wonder if there should be an iommu API to make a domain fault
capable?

Jason

Re: [PATCH v3 2/8] iommu/sva: Use iopf domain attach/detach interface

2024-03-22 Thread Jason Gunthorpe

On Thu, Mar 14, 2024 at 03:41:23PM +0800, Baolu Lu wrote:

> The whole cookie mechanism aims to address two things:
> 
> - Extend the domain lifetime until all pending page faults are
> resolved.

Like you answered, I think the flush is a simpler scheme..

> - Associate information about the iommu device with each attachment of
>   the domain so that the iommufd device object ID could be quickly
>   retrieved in the fault delivering path.
 
> > I see we also need to stick a pointer in the domain for iommufd to get
> > back to the hwpt, but that doesn't seem to need such a big system to
> > accomplish - just add a void *. It would make sense for the domain to
> > have some optional pointer to a struct for all the fault related data
> > that becomes allocated when a PRI domain is created..
> 
> It's not getting back hwpt from domain, just getting the iommufd_device
> in the fault delivering path. The iommufd_device is not per-domain, but
> per-domain-attachment.

It does make sense you'd need that, but I think something like this is
a more direct way to get it. Caller allocates the handle struct. The
iopf will provide the handle from the XA to the
callback. container_of() not void * is used to in the caller's API.

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 68e648b5576706..0d29d8f0847cd9 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -46,6 +46,8 @@ static unsigned int iommu_def_domain_type __read_mostly;
 static bool iommu_dma_strict __read_mostly = 
IS_ENABLED(CONFIG_IOMMU_DEFAULT_DMA_STRICT);
 static u32 iommu_cmd_line __read_mostly;
 
+enum { IOMMU_PASID_ARRAY_DOMAIN, IOMMU_PASID_ARRAY_HANDLE };
+
 struct iommu_group {
struct kobject kobj;
struct kobject *devices_kobj;
@@ -3516,18 +3518,20 @@ static void __iommu_remove_group_pasid(struct 
iommu_group *group,
 }
 
 /*
- * iommu_attach_device_pasid() - Attach a domain to pasid of device
+ * __iommu_attach_device_pasid() - Attach a domain to pasid of device
  * @domain: the iommu domain.
  * @dev: the attached device.
  * @pasid: the pasid of the device.
  *
  * Return: 0 on success, or an error.
  */
-int iommu_attach_device_pasid(struct iommu_domain *domain,
- struct device *dev, ioasid_t pasid)
+int __iommu_attach_device_pasid(struct iommu_domain *domain, struct device 
*dev,
+   ioasid_t pasid,
+   struct iommu_pasid_handle *handle)
 {
/* Caller must be a probed driver on dev */
struct iommu_group *group = dev->iommu_group;
+   void *xa_entry;
void *curr;
int ret;
 
@@ -3541,9 +3545,14 @@ int iommu_attach_device_pasid(struct iommu_domain 
*domain,
return -EINVAL;
 
mutex_lock(&group->mutex);
-   curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, GFP_KERNEL);
+   if (handle)
+   xa_entry = xa_tag_pointer(handle, IOMMU_PASID_ARRAY_HANDLE);
+   else
+   xa_entry = xa_tag_pointer(domain, IOMMU_PASID_ARRAY_DOMAIN);
+   curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, xa_entry,
+ GFP_KERNEL);
if (curr) {
-   ret = xa_err(curr) ? : -EBUSY;
+   ret = xa_err(curr) ?: -EBUSY;
goto out_unlock;
}
 
@@ -3556,7 +3565,7 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
mutex_unlock(&group->mutex);
return ret;
 }
-EXPORT_SYMBOL_GPL(iommu_attach_device_pasid);
+EXPORT_SYMBOL_GPL(__iommu_attach_device_pasid);
 
 /*
  * iommu_detach_device_pasid() - Detach the domain from pasid of device
@@ -3600,13 +3609,23 @@ struct iommu_domain 
*iommu_get_domain_for_dev_pasid(struct device *dev,
 {
/* Caller must be a probed driver on dev */
struct iommu_group *group = dev->iommu_group;
+   struct iommu_pasid_handle *handle;
struct iommu_domain *domain;
+   void *xa_entry;
 
if (!group)
return NULL;
 
xa_lock(&group->pasid_array);
-   domain = xa_load(&group->pasid_array, pasid);
+   xa_entry = xa_load(&group->pasid_array, pasid);
+   if (xa_pointer_tag(xa_entry) == IOMMU_PASID_ARRAY_HANDLE) {
+   handle = xa_untag_pointer(xa_entry);
+   domain = handle->domain;
+   } else if (xa_pointer_tag(xa_entry) == IOMMU_PASID_ARRAY_DOMAIN) {
+   domain = xa_untag_pointer(xa_entry);
+   } else {
+   domain = NULL;
+   }
if (type && domain && domain->type != type)
domain = ERR_PTR(-EBUSY);
xa_unlock(&group->pasid_array);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 1ea2a820e1eb03..47c121d4e13f65 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -947,8 +947,27 @@ void iommu_device_release_dma_owner(struct device *dev);
 
 struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
struct mm_struct

Re: [PATCH v3 3/8] iommufd: Add fault and response message definitions

2024-03-22 Thread Jason Gunthorpe

On Thu, Mar 14, 2024 at 09:41:45PM +0800, Baolu Lu wrote:
> On 2024/3/9 1:50, Jason Gunthorpe wrote:
> > On Mon, Jan 22, 2024 at 03:38:58PM +0800, Lu Baolu wrote:
> > 
> > > +/**
> > > + * enum iommu_hwpt_pgfault_flags - flags for struct iommu_hwpt_pgfault
> > > + * @IOMMU_PGFAULT_FLAGS_PASID_VALID: The pasid field of the fault data is
> > > + *   valid.
> > > + * @IOMMU_PGFAULT_FLAGS_LAST_PAGE: It's the last fault of a fault group.
> > > + */
> > > +enum iommu_hwpt_pgfault_flags {
> > > + IOMMU_PGFAULT_FLAGS_PASID_VALID = (1 << 0),
> > > + IOMMU_PGFAULT_FLAGS_LAST_PAGE   = (1 << 1),
> > > +};
> > > +
> > > +/**
> > > + * enum iommu_hwpt_pgfault_perm - perm bits for struct iommu_hwpt_pgfault
> > > + * @IOMMU_PGFAULT_PERM_READ: request for read permission
> > > + * @IOMMU_PGFAULT_PERM_WRITE: request for write permission
> > > + * @IOMMU_PGFAULT_PERM_EXEC: request for execute permission
> > > + * @IOMMU_PGFAULT_PERM_PRIV: request for privileged permission
> > 
> > You are going to have to elaborate what PRIV is for.. We don't have
> > any concept of this in the UAPI for iommufd so what is a userspace
> > supposed to do if it hits this? EXEC is similar, we can't actually
> > enable exec permissions from userspace IIRC..
> 
> The PCIe spec, section "10.4.1 Page Request Message" and "6.20.2 PASID
> Information Layout":
> 
> The PCI PASID TLP Prefix defines "Execute Requested" and "Privileged
> Mode Requested" bits.
> 
> PERM_EXEC indicates a page request with a PASID that has the "Execute
> Requested" bit set. Similarly, PERM_PRIV indicates a page request with a
>  PASID that has "Privileged Mode Requested" bit set.

Oh, I see! OK Maybe just add a note that it follows PCIE 10.4.1

> > > +struct iommu_hwpt_pgfault {
> > > + __u32 size;
> > > + __u32 flags;
> > > + __u32 dev_id;
> > > + __u32 pasid;
> > > + __u32 grpid;
> > > + __u32 perm;
> > > + __u64 addr;
> > > +};
> > 
> > Do we need an addr + size here? I've seen a few things where I wonder
> > if that might become an enhancment someday.
> 
> I am not sure. The page size is not part of ATS/PRI. Can you please
> elaborate a bit about how the size could be used? Perhaps I
> misunderstood here?

size would be an advice how much data the requestor is expecting to
fetch. Eg of the PRI initiator knows it is going to do a 10MB transfer
it could fill in 10MB and the OS could pre-fault in 10MB of IOVA.

It is not in the spec, it may never be in the spec, but it seems like
it would be good to consider it, at least make sure we have
compatability to add it later.

> > > + * @addr: The fault address. Must match the addr field of the
> > > + *last iommu_hwpt_pgfault of a reported iopf group.
> > > + */
> > > +struct iommu_hwpt_page_response {
> > > + __u32 size;
> > > + __u32 flags;
> > > + __u32 dev_id;
> > > + __u32 pasid;
> > > + __u32 grpid;
> > > + __u32 code;
> > > + __u64 addr;
> > > +};
> > 
> > Do we want some kind of opaque ID value from the kernel here to match
> > request with response exactly? Or is the plan to search on the addr?
> 
> I am using the "addr" as the opaque data to search request in this
> series. Is it enough?

I'm not sure, the other discussion about grpid seems to be the main
question so lets see there.

Jason

Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

2024-03-22 Thread Jason Gunthorpe

On Fri, Mar 15, 2024 at 09:16:43AM +0800, Baolu Lu wrote:
> On 3/9/24 3:05 AM, Jason Gunthorpe wrote:
> > On Mon, Jan 22, 2024 at 03:39:00PM +0800, Lu Baolu wrote:
> > 
> > > @@ -411,6 +414,8 @@ enum iommu_hwpt_data_type {
> > >* @__reserved: Must be 0
> > >* @data_type: One of enum iommu_hwpt_data_type
> > >* @data_len: Length of the type specific data
> > > + * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field 
> > > of
> > > + *IOMMU_HWPT_FAULT_ID_VALID is set.
> > >* @data_uptr: User pointer to the type specific data
> > >*
> > >* Explicitly allocate a hardware page table object. This is the same 
> > > object
> > > @@ -441,6 +446,7 @@ struct iommu_hwpt_alloc {
> > >   __u32 __reserved;
> > >   __u32 data_type;
> > >   __u32 data_len;
> > > + __u32 fault_id;
> > >   __aligned_u64 data_uptr;
> > >   };
> > 
> > ?? We can't add fault_id in the middle of the struct??
> 
> Yes. I should add the new field at the end.
> 
> By the way, with a __u32 added, this data structure is not 64-byte-
> aligned anymore. Do we need to add another unused u32 entry, or just let
> the compiler handle it?

Yes, add a reserved u32 to ensure the structs is always without
implicit padding.

> > 
> > > + if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID) {
> > > + struct iommufd_fault *fault;
> > > +
> > > + fault = iommufd_get_fault(ucmd, cmd->fault_id);
> > > + if (IS_ERR(fault)) {
> > > + rc = PTR_ERR(fault);
> > > + goto out_hwpt;
> > > + }
> > > + hwpt->fault = fault;
> > > + hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
> > > + hwpt->domain->fault_data = hwpt;
> > > + hwpt->fault_capable = true;
> > 
> > I wonder if there should be an iommu API to make a domain fault
> > capable?
> 
> The iommu core identifies a fault-capable domain by checking its
> domain->iopf_handler. Anyway, what's the difference between a fault or
> non-fault capable domain from iommu core's point of view?

>From the core? Nothing. I'm just wondering from an API perspective if
we should have a little inline to indicate it.

Jason

Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

2024-03-22 Thread Jason Gunthorpe

On Fri, Mar 15, 2024 at 09:46:06AM +0800, Baolu Lu wrote:
> On 3/9/24 2:03 AM, Jason Gunthorpe wrote:
> > On Mon, Jan 22, 2024 at 03:38:59PM +0800, Lu Baolu wrote:
> > > --- /dev/null
> > > +++ b/drivers/iommu/iommufd/fault.c
> > > @@ -0,0 +1,255 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/* Copyright (C) 2024 Intel Corporation
> > > + */
> > > +#define pr_fmt(fmt) "iommufd: " fmt
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +#include "iommufd_private.h"
> > > +
> > > +static int device_add_fault(struct iopf_group *group)
> > > +{
> > > + struct iommufd_device *idev = group->cookie->private;
> > > + void *curr;
> > > +
> > > + curr = xa_cmpxchg(&idev->faults, group->last_fault.fault.prm.grpid,
> > > +   NULL, group, GFP_KERNEL);
> > > +
> > > + return curr ? xa_err(curr) : 0;
> > > +}
> > > +
> > > +static void device_remove_fault(struct iopf_group *group)
> > > +{
> > > + struct iommufd_device *idev = group->cookie->private;
> > > +
> > > + xa_store(&idev->faults, group->last_fault.fault.prm.grpid,
> > > +  NULL, GFP_KERNEL);
> > 
> > xa_erase ?
> 
> Yes. Sure.
> 
> > Is grpid OK to use this way? Doesn't it come from the originating
> > device?
> 
> The group ID is generated by the hardware. Here, we use it as an index
> in the fault array to ensure it can be quickly retrieved in the page
> fault response path.

I'm nervous about this, we are trusting HW outside the kernel to
provide unique grp id's which are integral to how the kernel
operates..

> > > +static ssize_t iommufd_fault_fops_read(struct file *filep, char __user 
> > > *buf,
> > > +size_t count, loff_t *ppos)
> > > +{
> > > + size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
> > > + struct iommufd_fault *fault = filep->private_data;
> > > + struct iommu_hwpt_pgfault data;
> > > + struct iommufd_device *idev;
> > > + struct iopf_group *group;
> > > + struct iopf_fault *iopf;
> > > + size_t done = 0;
> > > + int rc;
> > > +
> > > + if (*ppos || count % fault_size)
> > > + return -ESPIPE;
> > > +
> > > + mutex_lock(&fault->mutex);
> > > + while (!list_empty(&fault->deliver) && count > done) {
> > > + group = list_first_entry(&fault->deliver,
> > > +  struct iopf_group, node);
> > > +
> > > + if (list_count_nodes(&group->faults) * fault_size > count - 
> > > done)
> > > + break;
> > > +
> > > + idev = (struct iommufd_device *)group->cookie->private;
> > > + list_for_each_entry(iopf, &group->faults, list) {
> > > + iommufd_compose_fault_message(&iopf->fault, &data, 
> > > idev);
> > > + rc = copy_to_user(buf + done, &data, fault_size);
> > > + if (rc)
> > > + goto err_unlock;
> > > + done += fault_size;
> > > + }
> > > +
> > > + rc = device_add_fault(group);
> > 
> > See I wonder if this should be some xa_alloc or something instead of
> > trying to use the grpid?
> 
> So this magic number will be passed to user space in the fault message.
> And the user will then include this number in its response message. The
> response message is valid only when the magic number matches. Do I get
> you correctly?

Yes, then it is simple xa_alloc() and xa_load() without any other
searching and we don't have to rely on the grpid to be correctly
formed by the PCI device.

But I don't know about performance xa_alloc() is pretty fast but
trusting the grpid would be faster..

IMHO from a uapi perspective we should have a definate "cookie" that
gets echo'd back. If the kernel uses xa_alloc or grpid to build that
cookie it doesn't matter to the uAPI.

Jason

Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

2024-03-22 Thread Jason Gunthorpe

On Wed, Mar 20, 2024 at 04:18:05PM +, Shameerali Kolothum Thodi wrote:
> 
> What I have noticed is that,
> -read interface works fine and I can receive struct tiommu_hwpt_pgfault data.
> -But once Guest handles the page faults and returns the page response,
>  the write to fault fd never reaches the kernel. The sequence is like below,
>  
>   sqe = io_uring_get_sqe(ring);
>   io_uring_prep_write(sqe, hwpt->fault_fd, resp, sizeof(*resp), 0);
>   io_uring_sqe_set_data(sqe, resp);
>   io_uring_submit(ring);
>   ret = io_uring_wait_cqe(ring, &cqe); 
>   
> Please find the function here[2]
> 
> The above cqe wait never returns and hardware times out without receiving
> page response. My understanding of io_uring default op is that it tries to 
> issue an sqe as non-blocking first. But it looks like the above write sequence
> ends up in kernel poll_wait() as well.Not sure how we can avoid that for
> write.

Ah, right, it is because poll can't be choosy about read/write, it has
to work equally for both directions. iommufd_fault_fops_poll() never
returns EPOLLOUT

It should just always return EPOLLOUT because we don't have any queue
to manage.

Jason

Re: [PATCH v4 1/9] iommu: Introduce domain attachment handle

2024-04-03 Thread Jason Gunthorpe

On Wed, Apr 03, 2024 at 09:15:11AM +0800, Lu Baolu wrote:
> Currently, when attaching a domain to a device or its PASID, domain is
> stored within the iommu group. It could be retrieved for use during the
> window between attachment and detachment.
> 
> With new features introduced, there's a need to store more information
> than just a domain pointer. This information essentially represents the
> association between a domain and a device. For example, the SVA code
> already has a custom struct iommu_sva which represents a bond between
> sva domain and a PASID of a device. Looking forward, the IOMMUFD needs
> a place to store the iommufd_device pointer in the core, so that the
> device object ID could be quickly retrieved in the critical fault handling
> path.
> 
> Introduce domain attachment handle that explicitly represents the
> attachment relationship between a domain and a device or its PASID.
> A caller-specific data field can be used by the caller to store additional
> information beyond a domain pointer, depending on its specific use case.
> 
> Co-developed-by: Jason Gunthorpe 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/iommu-priv.h |   9 +++
>  drivers/iommu/iommu.c  | 158 +
>  2 files changed, 153 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
> index 5f731d994803..08c0667cef54 100644
> --- a/drivers/iommu/iommu-priv.h
> +++ b/drivers/iommu/iommu-priv.h
> @@ -28,4 +28,13 @@ void iommu_device_unregister_bus(struct iommu_device 
> *iommu,
>const struct bus_type *bus,
>struct notifier_block *nb);
>  
> +struct iommu_attach_handle {
> + struct iommu_domain *domain;
> + refcount_t  users;

I don't understand how the refcounting can be generally useful. There
is no way to free this:

> + void*priv;

When the refcount goes to zero.

Jason

Re: [PATCH v4 2/9] iommu: Replace sva_iommu with iommu_attach_handle

2024-04-03 Thread Jason Gunthorpe

On Wed, Apr 03, 2024 at 09:15:12AM +0800, Lu Baolu wrote:
> + /* A bond already exists, just take a reference`. */
> + handle = iommu_attach_handle_get(group, iommu_mm->pasid);
> + if (handle) {
> + mutex_unlock(&iommu_sva_lock);
> + return handle;
>   }

At least in this context this is not enough we need to ensure that the
domain on the PASID is actually an SVA domain and it was installed by
this mechanism, not an iommufd domain for instance.

ie you probably need a type field in the iommu_attach_handle to tell
what the priv is.

Otherwise this seems like a great idea!

> - iommu_detach_device_pasid(domain, dev, iommu_mm->pasid);
> - if (--domain->users == 0) {
> - list_del(&domain->next);
> - iommu_domain_free(domain);
> + iommu_attach_handle_put(handle);
> + if (refcount_read(&handle->users) == 1) {
> + iommu_detach_device_pasid(domain, dev, iommu_mm->pasid);
> + if (--domain->users == 0) {
> + list_del(&domain->next);
> + iommu_domain_free(domain);
> + }
>   }

Though I'm not convinced the refcount should be elevated into the core
structure. The prior patch I showed you where the caller can provide
the memory for the handle and we don't have a priv would make it easy
to put the refcount in a SVA dervied handle struct without more
allocation. Then we don't need this weirdness.

>   mutex_unlock(&iommu_sva_lock);
> - kfree(handle);

Also do we need iommu_sva_lock here anymore? I wonder if the group
mutex would be sufficient..

Jason

Re: [PATCH v4 1/9] iommu: Introduce domain attachment handle

2024-04-08 Thread Jason Gunthorpe

On Sat, Apr 06, 2024 at 12:34:14PM +0800, Baolu Lu wrote:
> On 4/3/24 7:58 PM, Jason Gunthorpe wrote:
> > On Wed, Apr 03, 2024 at 09:15:11AM +0800, Lu Baolu wrote:
> > > Currently, when attaching a domain to a device or its PASID, domain is
> > > stored within the iommu group. It could be retrieved for use during the
> > > window between attachment and detachment.
> > > 
> > > With new features introduced, there's a need to store more information
> > > than just a domain pointer. This information essentially represents the
> > > association between a domain and a device. For example, the SVA code
> > > already has a custom struct iommu_sva which represents a bond between
> > > sva domain and a PASID of a device. Looking forward, the IOMMUFD needs
> > > a place to store the iommufd_device pointer in the core, so that the
> > > device object ID could be quickly retrieved in the critical fault handling
> > > path.
> > > 
> > > Introduce domain attachment handle that explicitly represents the
> > > attachment relationship between a domain and a device or its PASID.
> > > A caller-specific data field can be used by the caller to store additional
> > > information beyond a domain pointer, depending on its specific use case.
> > > 
> > > Co-developed-by: Jason Gunthorpe
> > > Signed-off-by: Jason Gunthorpe
> > > Signed-off-by: Lu Baolu
> > > ---
> > >   drivers/iommu/iommu-priv.h |   9 +++
> > >   drivers/iommu/iommu.c  | 158 +
> > >   2 files changed, 153 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
> > > index 5f731d994803..08c0667cef54 100644
> > > --- a/drivers/iommu/iommu-priv.h
> > > +++ b/drivers/iommu/iommu-priv.h
> > > @@ -28,4 +28,13 @@ void iommu_device_unregister_bus(struct iommu_device 
> > > *iommu,
> > >const struct bus_type *bus,
> > >struct notifier_block *nb);
> > > +struct iommu_attach_handle {
> > > + struct iommu_domain *domain;
> > > + refcount_t  users;
> > I don't understand how the refcounting can be generally useful. There
> > is no way to free this:
> > 
> > > + void*priv;
> > When the refcount goes to zero.
> 
> This field is set by the caller, so the caller ensures that the pointer
> can only be freed after iommu domain detachment. For iopf, the caller
> should automatically respond to all outstanding iopf's in the domain
> detach path.
> 
> In the sva case, which uses the workqueue to handle iopf,
> flush_workqueue() is called in the domain detach path to ensure that all
> outstanding iopf's are completed before detach completion.

Which is back to what is the point of the refcount at all?

> +static void iommufd_auto_response_handle(struct iommufd_fault *fault,
> +struct iommu_attach_handle *handle)
> +{
> +   struct iommufd_device *idev = handle->priv;

The caller already has the iommufd_device, don't need the handler.

> +   struct iopf_group *group;
> +   unsigned long index;
> +
> +   mutex_lock(&fault->mutex);
> +   xa_for_each(&idev->faults, index, group) {
> +   xa_erase(&idev->faults, index);
> +   iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
> +   }
> +   mutex_unlock(&fault->mutex);

This makes sense, yes..

>  void iommufd_fault_domain_detach_dev(struct iommufd_hw_pagetable *hwpt,
>  struct iommufd_device *idev)
>  {
> +   struct iommufd_fault *fault = hwpt->fault;
> +   struct iommu_attach_handle *handle;
> +
> if (WARN_ON(!hwpt->fault_capable))
> return;
> 
> +   handle = iommu_attach_handle_get(idev->igroup->group,
> IOMMU_NO_PASID);
> iommu_detach_group(hwpt->domain, idev->igroup->group);
> iommufd_fault_iopf_disable(idev);

But is this right? Couldn't there be PASID's doing PRI?

Jason

Re: [PATCH v4 2/9] iommu: Replace sva_iommu with iommu_attach_handle

2024-04-08 Thread Jason Gunthorpe

On Sat, Apr 06, 2024 at 02:09:34PM +0800, Baolu Lu wrote:
> On 4/3/24 7:59 PM, Jason Gunthorpe wrote:
> > On Wed, Apr 03, 2024 at 09:15:12AM +0800, Lu Baolu wrote:
> > > + /* A bond already exists, just take a reference`. */
> > > + handle = iommu_attach_handle_get(group, iommu_mm->pasid);
> > > + if (handle) {
> > > + mutex_unlock(&iommu_sva_lock);
> > > + return handle;
> > >   }
> > At least in this context this is not enough we need to ensure that the
> > domain on the PASID is actually an SVA domain and it was installed by
> > this mechanism, not an iommufd domain for instance.
> > 
> > ie you probably need a type field in the iommu_attach_handle to tell
> > what the priv is.
> > 
> > Otherwise this seems like a great idea!
> 
> Yes, you are right. For the SVA case, I will add the following changes.
> The IOMMUFD path will also need such enhancement. I will update it in
> the next version.

The only use for this is the PRI callbacks right? Maybe instead of
adding a handle type let's just check domain->iopf_handler  ?

Ie SVA will pass &ommu_sva_iopf_handler as its "type"

Jason

Re: [PATCH v4 1/9] iommu: Introduce domain attachment handle

2024-04-09 Thread Jason Gunthorpe

On Tue, Apr 09, 2024 at 09:53:26AM +0800, Baolu Lu wrote:
> On 4/8/24 10:05 PM, Jason Gunthorpe wrote:
> > > void iommufd_fault_domain_detach_dev(struct iommufd_hw_pagetable *hwpt,
> > >   struct iommufd_device *idev)
> > >   {
> > > +   struct iommufd_fault *fault = hwpt->fault;
> > > +   struct iommu_attach_handle *handle;
> > > +
> > >  if (WARN_ON(!hwpt->fault_capable))
> > >  return;
> > > 
> > > +   handle = iommu_attach_handle_get(idev->igroup->group,
> > > IOMMU_NO_PASID);
> > >  iommu_detach_group(hwpt->domain, idev->igroup->group);
> > >  iommufd_fault_iopf_disable(idev);
> > But is this right? Couldn't there be PASID's doing PRI?
> 
> As far as I can see, there are two types of user PASID.
> 
> 1. When a device is assigned to userspace, the PASID table is managed by
>the userspace.
> 
>Userspace doesn't need PASID attach/detach/replace uAPIs in this
>scenario. All I/O page faults are directed to userspace through the
>hw pagetable attached to the RID.
> 
>If hw pagetable is detached from the RID, or a non-iopf-capable
>hw pagetable is attached the RID, the PRI for user PASID is already
>broken.

I would say in this case the special nesting HWPT should have an
indication if PRI should be supported or not when it is created and
that should drive the PRI enablement..

>The current code base doesn't yet support PASID attach/detach/replace
>uAPIs. Therefore, above code is safe and reasonable. However, we will
>need to revisit this code when those APIs become available.

Okay, I see.

Can we do the PASID iommufd stuff soon? What is the plan here?

Jason

Re: [PATCH v4 2/9] iommu: Replace sva_iommu with iommu_attach_handle

2024-04-09 Thread Jason Gunthorpe

On Tue, Apr 09, 2024 at 10:11:28AM +0800, Baolu Lu wrote:
> On 4/8/24 10:19 PM, Jason Gunthorpe wrote:
> > On Sat, Apr 06, 2024 at 02:09:34PM +0800, Baolu Lu wrote:
> > > On 4/3/24 7:59 PM, Jason Gunthorpe wrote:
> > > > On Wed, Apr 03, 2024 at 09:15:12AM +0800, Lu Baolu wrote:
> > > > > + /* A bond already exists, just take a reference`. */
> > > > > + handle = iommu_attach_handle_get(group, iommu_mm->pasid);
> > > > > + if (handle) {
> > > > > + mutex_unlock(&iommu_sva_lock);
> > > > > + return handle;
> > > > >   }
> > > > At least in this context this is not enough we need to ensure that the
> > > > domain on the PASID is actually an SVA domain and it was installed by
> > > > this mechanism, not an iommufd domain for instance.
> > > > 
> > > > ie you probably need a type field in the iommu_attach_handle to tell
> > > > what the priv is.
> > > > 
> > > > Otherwise this seems like a great idea!
> > > Yes, you are right. For the SVA case, I will add the following changes.
> > > The IOMMUFD path will also need such enhancement. I will update it in
> > > the next version.
> > The only use for this is the PRI callbacks right? Maybe instead of
> > adding a handle type let's just check domain->iopf_handler  ?
> > 
> > Ie SVA will pass &ommu_sva_iopf_handler as its "type"
> 
> Sorry that I don't fully understand the proposal here.

I was talking specifically about the type field you suggested adding
to the handle struct.

Instead of adding a type field check the domain->iopf_handler to
determine the domain and thus handle type.

> The problem is that the context code (SVA, IOMMUFD, etc.) needs to make
> sure that the attach handle is really what it has installed during
> domain attachment. The context code needs some mechanism to include some
> kind of "owner cookie" in the attach handle, so that it could check
> against it later for valid use.

Right, you have a derived struct for each user and you need a way to
check if casting from the general handle struct to the derived struct
is OK.

I'm suggesting using domain->iopf_handle as the type key.

Jason

Re: [PATCH v4 2/9] iommu: Replace sva_iommu with iommu_attach_handle

2024-04-29 Thread Jason Gunthorpe

On Sun, Apr 28, 2024 at 06:22:28PM +0800, Baolu Lu wrote:

> /* A bond already exists, just take a reference`. */
> handle = iommu_attach_handle_get(group, iommu_mm->pasid);
> if (handle) {
> if (handle->domain->iopf_handler != iommu_sva_iopf_handler)
> {
> ret = -EBUSY;
> goto out_unlock;
> }
> 
> refcount_inc(&handle->users);
> mutex_unlock(&iommu_sva_lock);
> return handle;
> }
> 
> But it appears that this code is not lock safe. If the domain on the
> PASID is not a SVA domain, the check of "handle->domain->iopf_handler !=
> iommu_sva_iopf_handler" could result in a use-after-free issue as the
> other thread might detach the domain in between the fetch and check
> lines.

For the above you just need to pass in the iommu_sva_iopf_handler as
an argument to attach_handle_get() and have it check it under the
xa_lock.

The whole thing is already protected under the ugly sva_lock.

Ideally it would be protected by the group mutex..

Jason

Re: [PATCH v5 3/9] iommu: Add attachment handle to struct iopf_group

2024-05-07 Thread Jason Gunthorpe

On Tue, Apr 30, 2024 at 10:57:04PM +0800, Lu Baolu wrote:
> @@ -206,8 +197,11 @@ void iommu_report_device_fault(struct device *dev, 
> struct iopf_fault *evt)
>   if (group == &abort_group)
>   goto err_abort;
>  
> - group->domain = get_domain_for_iopf(dev, fault);
> - if (!group->domain)
> + if (!(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) ||
> + get_attach_handle_for_iopf(dev, fault->prm.pasid, group))
> + get_attach_handle_for_iopf(dev, IOMMU_NO_PASID, group);

That seems a bit weird looking?

get_attach_handle_for_iopf(dev, 
   (fault->prm.flags &
   IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) ? fault->prm.pasid : IOMMU_NO_PASID,
   group);

Jason

Re: [PATCH v5 2/9] iommu: Replace sva_iommu with iommu_attach_handle

2024-05-07 Thread Jason Gunthorpe

On Tue, Apr 30, 2024 at 10:57:03PM +0800, Lu Baolu wrote:
> diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
> index da1addaa1a31..ae65e0b85d69 100644
> --- a/drivers/iommu/iommu-priv.h
> +++ b/drivers/iommu/iommu-priv.h
> @@ -30,6 +30,13 @@ void iommu_device_unregister_bus(struct iommu_device 
> *iommu,
>  
>  struct iommu_attach_handle {
>   struct iommu_domain *domain;
> + union {
> + /* attach data for SVA domain */
> + struct {
> + struct device   *dev;
> + refcount_t  users;
> + };
> + };
>  };

FWIW I was thinking of having the caller allocate the handle and pass it
down, but this seems workable too and is a bit simpler.

> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
> index bdc2e6fda782..b325097421c1 100644
> --- a/drivers/misc/uacce/uacce.c
> +++ b/drivers/misc/uacce/uacce.c
> @@ -106,7 +106,7 @@ static long uacce_fops_compat_ioctl(struct file *filep,
>  static int uacce_bind_queue(struct uacce_device *uacce, struct uacce_queue 
> *q)
>  {
>   u32 pasid;
> - struct iommu_sva *handle;
> + struct iommu_attach_handle *handle;

Though I'm much less keen on this..

Maybe

  struct iommu_attach_handle {
struct iommu_domain *domain;
union {
  struct iommu_sva sva;
};
  };

?

Then container_of(sva) to get back to handle and keep the meaningful
type?

Jason

Re: [PATCH v5 5/9] iommufd: Add iommufd fault object

2024-05-07 Thread Jason Gunthorpe

On Tue, Apr 30, 2024 at 10:57:06PM +0800, Lu Baolu wrote:
> diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
> index ae65e0b85d69..1a0450a83bd0 100644
> --- a/drivers/iommu/iommu-priv.h
> +++ b/drivers/iommu/iommu-priv.h
> @@ -36,6 +36,10 @@ struct iommu_attach_handle {
>   struct device   *dev;
>   refcount_t  users;
>   };
> + /* attach data for IOMMUFD */
> + struct {
> + void*idev;
> + };

We can use a proper type here, just forward declare it.

But this sequence in the other patch:

+   ret = iommu_attach_group(hwpt->domain, idev->igroup->group);
+   if (ret) {
+   iommufd_fault_iopf_disable(idev);
+   return ret;
+   }
+
+   handle = iommu_attach_handle_get(idev->igroup->group, IOMMU_NO_PASID, 
0);
+   handle->idev = idev;

Is why I was imagining the caller would allocate, because now we have
the issue that a fault capable domain was installed into the IOMMU
before it's handle could be fully setup, so we have a race where a
fault could come in right between those things. Then what happens?
I suppose we can retry the fault and by the time it comes back the
race should resolve. A bit ugly I suppose.

> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 83b45dce94a4..1819b28e9e6b 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -50,6 +50,7 @@ enum {
>   IOMMUFD_CMD_HWPT_SET_DIRTY_TRACKING,
>   IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP,
>   IOMMUFD_CMD_HWPT_INVALIDATE,
> + IOMMUFD_CMD_FAULT_ALLOC,
>  };

I think I'd call this a CMD_FAULT_QUEUE_ALLOC - does that make sense?

Jason

Re: [PATCH v5 6/9] iommufd: Fault-capable hwpt attach/detach/replace

2024-05-07 Thread Jason Gunthorpe

On Tue, Apr 30, 2024 at 10:57:07PM +0800, Lu Baolu wrote:
> diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
> index 13125c0feecb..6357229bf3b4 100644
> --- a/drivers/iommu/iommufd/fault.c
> +++ b/drivers/iommu/iommufd/fault.c
> @@ -15,6 +15,124 @@
>  #include "../iommu-priv.h"
>  #include "iommufd_private.h"
>  
> +static int iommufd_fault_iopf_enable(struct iommufd_device *idev)
> +{
> + int ret;
> +
> + if (idev->iopf_enabled)
> + return 0;
> +
> + ret = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_IOPF);
> + if (ret)
> + return ret;
> +
> + idev->iopf_enabled = true;
> +
> + return 0;
> +}

I would greatly prefer we remove this from the drivers :\ I guess it
is Ok for now

Doesn't this need a counter? We can have many fault capable PASIDs?
That will get changed in the PASID series?

Jason

Re: [PATCH v5 5/9] iommufd: Add iommufd fault object

2024-05-07 Thread Jason Gunthorpe

On Tue, Apr 30, 2024 at 10:57:06PM +0800, Lu Baolu wrote:
> +static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
> +size_t count, loff_t *ppos)
> +{
> + size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
> + struct iommufd_fault *fault = filep->private_data;
> + struct iommu_hwpt_pgfault data;
> + struct iommufd_device *idev;
> + struct iopf_group *group;
> + struct iopf_fault *iopf;
> + size_t done = 0;
> + int rc;
> +
> + if (*ppos || count % fault_size)
> + return -ESPIPE;
> +
> + mutex_lock(&fault->mutex);
> + while (!list_empty(&fault->deliver) && count > done) {
> + group = list_first_entry(&fault->deliver,
> +  struct iopf_group, node);
> +
> + if (list_count_nodes(&group->faults) * fault_size > count - 
> done)
> + break;

Can this list_count be precomputed when we build the fault group?

> +
> + idev = group->attach_handle->idev;
> + if (!idev)
> + break;

This check should be done before adding the fault to the linked
list. See my other note about the race.

> +
> + rc = xa_alloc(&idev->faults, &group->cookie, group,
> +   xa_limit_32b, GFP_KERNEL);
> + if (rc)
> + break;

This error handling is not quite right, if done == 0 then this should
return rc.


> +
> + list_for_each_entry(iopf, &group->faults, list) {
> + iommufd_compose_fault_message(&iopf->fault,
> +   &data, idev,
> +   group->cookie);
> + rc = copy_to_user(buf + done, &data, fault_size);
> + if (rc) {
> + xa_erase(&idev->faults, group->cookie);
> + break;

Same here

(same comment on the write side too)

Jason

Re: [PATCH v5 7/9] iommufd: Associate fault object with iommufd_hw_pgtable

2024-05-07 Thread Jason Gunthorpe

On Tue, Apr 30, 2024 at 10:57:08PM +0800, Lu Baolu wrote:
>  /**
> @@ -412,6 +415,9 @@ enum iommu_hwpt_data_type {
>   * @data_type: One of enum iommu_hwpt_data_type
>   * @data_len: Length of the type specific data
>   * @data_uptr: User pointer to the type specific data
> + * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field of
> + *IOMMU_HWPT_FAULT_ID_VALID is set.
> + * @__reserved2: Padding to 64-bit alignment. Must be 0.
>   *
>   * Explicitly allocate a hardware page table object. This is the same object
>   * type that is returned by iommufd_device_attach() and represents the
> @@ -442,6 +448,8 @@ struct iommu_hwpt_alloc {
>   __u32 data_type;
>   __u32 data_len;
>   __aligned_u64 data_uptr;
> + __u32 fault_id;
> + __u32 __reserved2;
>  };
>  #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)

[..]

> @@ -359,7 +359,7 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] 
> = {
>   IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct iommu_hw_info,
>__reserved),
>   IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
> -  __reserved),
> +  __reserved2),

This is now how the back compat mechanism works. The value here is the
absolute minimum size, it should never increase. The first __reserved
is always the right value.

If you change it then old userspace that doesn't include the fault_id
will stop working.

Jason

Re: [PATCH v5 3/9] iommu: Add attachment handle to struct iopf_group

2024-05-10 Thread Jason Gunthorpe

On Fri, May 10, 2024 at 11:14:20AM +0800, Baolu Lu wrote:
> On 5/8/24 8:04 AM, Jason Gunthorpe wrote:
> > On Tue, Apr 30, 2024 at 10:57:04PM +0800, Lu Baolu wrote:
> > > @@ -206,8 +197,11 @@ void iommu_report_device_fault(struct device *dev, 
> > > struct iopf_fault *evt)
> > >   if (group == &abort_group)
> > >   goto err_abort;
> > > - group->domain = get_domain_for_iopf(dev, fault);
> > > - if (!group->domain)
> > > + if (!(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) ||
> > > + get_attach_handle_for_iopf(dev, fault->prm.pasid, group))
> > > + get_attach_handle_for_iopf(dev, IOMMU_NO_PASID, group);
> > That seems a bit weird looking?
> 
> Agreed.
> 
> > get_attach_handle_for_iopf(dev,
> > (fault->prm.flags &
> > IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) ? fault->prm.pasid : 
> > IOMMU_NO_PASID,
> > group);
> 
> The logic here is that it tries the PASID domain and if it doesn't
> exist, then tries the RID domain as well. I explained this in the commit
> message:
> 
> "
> ... if the pasid table of a device is wholly managed by user space,
> there is no domain attached to the PASID of the device ...
> "

Okay, it needs a comment in the code, and the RID fallback should be
based aroudn checking for a NESTING domain type which includes the
PASID table. (ie ARM and AMD not Intel)

We shouldn't just elevate a random PASID to the RID if it isn't
approprite..

Jason

Re: [PATCH v5 6/9] iommufd: Fault-capable hwpt attach/detach/replace

2024-05-10 Thread Jason Gunthorpe

On Fri, May 10, 2024 at 11:20:01AM +0800, Baolu Lu wrote:
> On 5/8/24 8:18 AM, Jason Gunthorpe wrote:
> > On Tue, Apr 30, 2024 at 10:57:07PM +0800, Lu Baolu wrote:
> > > diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
> > > index 13125c0feecb..6357229bf3b4 100644
> > > --- a/drivers/iommu/iommufd/fault.c
> > > +++ b/drivers/iommu/iommufd/fault.c
> > > @@ -15,6 +15,124 @@
> > >   #include "../iommu-priv.h"
> > >   #include "iommufd_private.h"
> > > +static int iommufd_fault_iopf_enable(struct iommufd_device *idev)
> > > +{
> > > + int ret;
> > > +
> > > + if (idev->iopf_enabled)
> > > + return 0;
> > > +
> > > + ret = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_IOPF);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + idev->iopf_enabled = true;
> > > +
> > > + return 0;
> > > +}
> > I would greatly prefer we remove this from the drivers :\ I guess it
> > is Ok for now
> > 
> > Doesn't this need a counter? We can have many fault capable PASIDs?
> > That will get changed in the PASID series?
> 
> Okay, let's design this more gracefully after the PASID interfaces are
> landed. For now, we assume that the device driver will do this.

Well, for now to work the device drivers do still need these calls.

I'm trying to get them into NOPs in the drivers so we can remove this.

Jason

Re: [PATCH v5 3/9] iommu: Add attachment handle to struct iopf_group

2024-05-10 Thread Jason Gunthorpe

On Fri, May 10, 2024 at 10:30:10PM +0800, Baolu Lu wrote:

> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 35ae9a6f73d3..09b4e671dcee 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -173,6 +173,8 @@ struct iommu_domain_geometry {
> 
>  #define __IOMMU_DOMAIN_NESTED  (1U << 6)  /* User-managed address space
> nested
>   on a stage-2 translation
> */
> +#define __IOMMU_DOMAIN_PASID   (1U << 7)  /* User-managed domain for a
> + specific PASID*/
> 
>  #define IOMMU_DOMAIN_ALLOC_FLAGS ~__IOMMU_DOMAIN_DMA_FQ
>  /*
> @@ -204,6 +206,9 @@ struct iommu_domain_geometry {
>  #define IOMMU_DOMAIN_SVA   (__IOMMU_DOMAIN_SVA)
>  #define IOMMU_DOMAIN_PLATFORM  (__IOMMU_DOMAIN_PLATFORM)
>  #define IOMMU_DOMAIN_NESTED(__IOMMU_DOMAIN_NESTED)
> +#define IOMMU_DOMAIN_NESTED_PASID  \
> +   (__IOMMU_DOMAIN_NESTED |\
> +__IOMMU_DOMAIN_PASID)

Yeah, something like that, I don't know about naming though..

How about

DOMAIN_NESTED = Attachment to RID or PASID
DOMAIN_NESTED_PASID_TABLE = RID and the entire PASID space

?

Jason

Re: [PATCH v5 4/9] iommufd: Add fault and response message definitions

2024-05-24 Thread Jason Gunthorpe

On Mon, May 20, 2024 at 04:59:18AM +, Tian, Kevin wrote:
> > From: Baolu Lu 
> > Sent: Monday, May 20, 2024 11:33 AM
> > 
> > On 5/20/24 11:24 AM, Tian, Kevin wrote:
> > >> From: Baolu Lu 
> > >> Sent: Sunday, May 19, 2024 10:38 PM
> > >>
> > >> On 2024/5/15 15:43, Tian, Kevin wrote:
> >  From: Lu Baolu 
> >  Sent: Tuesday, April 30, 2024 10:57 PM
> > 
> >  + * @length: a hint of how much data the requestor is expecting to
> > fetch.
> > >> For
> >  + *  example, if the PRI initiator knows it is going to do a 
> >  10MB
> >  + *  transfer, it could fill in 10MB and the OS could 
> >  pre-fault in
> >  + *  10MB of IOVA. It's default to 0 if there's no such hint.
> > >>>
> > >>> This is not clear to me and I don't remember PCIe spec defines such
> > >>> mechanism.
> > >>
> > >> This came up in a previous discussion. While it's not currently part of
> > >
> > > Can you provide a link to that discussion?
> > 
> > https://lore.kernel.org/linux-iommu/20240322170410.gh66...@ziepe.ca/
> > 
> 
> We can always extend uAPI for new usages, e.g. having a new flag
> bit to indicate the additional filed for carrying the number of pages.
> But requiring the user to handle non-zero length now (though trivial)
> is unnecessary burden.

It is tricky to extend this stuff since it comes out in read().. We'd
have to have userspace negotiate a new format most likely.

> Do we want the response message to also carry a length field i.e.
> allowing the user to partially fix the fault? 

No, the device will discover this when it gets another fault  :)

Jason

Re: [PATCH v5 5/9] iommufd: Add iommufd fault object

2024-05-24 Thread Jason Gunthorpe

On Mon, May 20, 2024 at 09:24:09AM +0800, Baolu Lu wrote:
> On 5/15/24 4:37 PM, Tian, Kevin wrote:
> > > +static ssize_t iommufd_fault_fops_write(struct file *filep, const char 
> > > __user
> > > *buf,
> > > + size_t count, loff_t *ppos)
> > > +{
> > > + size_t response_size = sizeof(struct iommu_hwpt_page_response);
> > > + struct iommufd_fault *fault = filep->private_data;
> > > + struct iommu_hwpt_page_response response;
> > > + struct iommufd_device *idev = NULL;
> > > + struct iopf_group *group;
> > > + size_t done = 0;
> > > + int rc;
> > > +
> > > + if (*ppos || count % response_size)
> > > + return -ESPIPE;
> > > +
> > > + mutex_lock(&fault->mutex);
> > > + while (count > done) {
> > > + rc = copy_from_user(&response, buf + done, response_size);
> > > + if (rc)
> > > + break;
> > > +
> > > + if (!idev || idev->obj.id != response.dev_id)
> > > + idev = container_of(iommufd_get_object(fault->ictx,
> > > +response.dev_id,
> > > +
> > > IOMMUFD_OBJ_DEVICE),
> > > + struct iommufd_device, obj);
> > > + if (IS_ERR(idev))
> > > + break;
> > > +
> > > + group = xa_erase(&idev->faults, response.cookie);
> > > + if (!group)
> > > + break;
> > is 'continue' better?
> 
> If we can't find a matched iopf group here, it means userspace provided
> something wrong. The current logic is that we stop here and tell
> userspace that only part of the faults have been responded to and it
> should retry the remaining responses with the right message.

The usual fd-ish error handling here should be to return a short write
(success) and then userspace will retry with the failing entry at the
start of the buffer and collect the errno.

Jason

Re: [PATCH v5 7/9] iommufd: Associate fault object with iommufd_hw_pgtable

2024-05-24 Thread Jason Gunthorpe

On Mon, May 20, 2024 at 03:39:54AM +, Tian, Kevin wrote:
> > From: Baolu Lu 
> > Sent: Monday, May 20, 2024 10:19 AM
> > 
> > On 5/15/24 4:50 PM, Tian, Kevin wrote:
> > >> From: Lu Baolu 
> > >> Sent: Tuesday, April 30, 2024 10:57 PM
> > >>
> > >> @@ -308,6 +314,19 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd
> > >> *ucmd)
> > >>  goto out_put_pt;
> > >>  }
> > >>
> > >> +if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID) {
> > >> +struct iommufd_fault *fault;
> > >> +
> > >> +fault = iommufd_get_fault(ucmd, cmd->fault_id);
> > >> +if (IS_ERR(fault)) {
> > >> +rc = PTR_ERR(fault);
> > >> +goto out_hwpt;
> > >> +}
> > >> +hwpt->fault = fault;
> > >> +hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
> > >> +hwpt->domain->fault_data = hwpt;
> > >> +}
> > >
> > > this is nesting specific. why not moving it to the nested_alloc()?
> > 
> > Nesting is currently a use case for userspace I/O page faults, but this
> > design should be general enough to support other scenarios as well.
> 
> Do we allow user page table w/o nesting?
> 
> What would be a scenario in which the user doesn't manage the
> page table but still want to handle the I/O page fault? The fault
> should always be delivered to the owner managing the page table...

userspace always manages the page table, either it updates the IOPTE
directly in a nest or it calls iommufd map operations.

Ideally the driver will allow PRI on normal cases, although it will
probably never be used.

Jason

Re: [PATCH v6 02/10] iommu: Remove sva handle list

2024-06-12 Thread Jason Gunthorpe

On Fri, Jun 07, 2024 at 09:35:23AM +, Tian, Kevin wrote:
> > From: Baolu Lu 
> > Sent: Thursday, June 6, 2024 2:07 PM
> > 
> > On 6/5/24 4:15 PM, Tian, Kevin wrote:
> > >> From: Lu Baolu 
> > >> Sent: Monday, May 27, 2024 12:05 PM
> > >>
> > >> -list_for_each_entry(handle, &mm->iommu_mm->sva_handles,
> > >> handle_item) {
> > >> -if (handle->dev == dev) {
> > >> -refcount_inc(&handle->users);
> > >> -mutex_unlock(&iommu_sva_lock);
> > >> -return handle;
> > >> -}
> > >> +/* A bond already exists, just take a reference`. */
> > >> +attach_handle = iommu_attach_handle_get(group, iommu_mm-
> > >>> pasid, IOMMU_DOMAIN_SVA);
> > >> +if (!IS_ERR(attach_handle)) {
> > >> +handle = container_of(attach_handle, struct iommu_sva,
> > >> handle);
> > >> +refcount_inc(&handle->users);
> > >> +mutex_unlock(&iommu_sva_lock);
> > >> +return handle;
> > >>  }
> > >
> > > It's counter-intuitive to move forward when an error is returned.
> > >
> > > e.g. if it's -EBUSY indicating the pasid already used for another type 
> > > then
> > > following attempts shouldn't been tried.

Yes, it looks like iommu_sva_bind_device() should fail with EBUSY if
the PASID is already in use and is not exactly the same SVA as being
asked for here.

It will eventually do this naturally when iommu_attach_device_pasid()
is called with an in-use PASID, but may as well do it here for
clarity.

Also, is there a missing test for the same mm too?

I'd maybe change iommu_attach_handle() to return NULL if there is no
handle and then write it like:

if (IS_ERR(attach_handle) && PTR_ERR(attach_handle) != -ENOENT) {
ret = PTR_ERR(attach_handle);
goto out_unlock;
}

if (!IS_ERR(attach_handle) && attach_handle->domain->mm == mm) {
   /* Can re-use the existing SVA attachment */
}

> > > Does it suggest that having the caller to always provide a handle
> > > makes more sense?

I was thinking no just to avoid memory allocation.. But how does the
caller not provide a handle? My original draft of this concept used an
XA_MARK to indicate if the xarray pointed to a handle or a domain

This seems to require the handle:

-   curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, GFP_KERNEL);
-   if (curr) {
-   ret = xa_err(curr) ? : -EBUSY;
+   ret = xa_insert(&group->pasid_array, pasid, handle, GFP_KERNEL);

Confused.

> > I'm neutral on this since only sva bind and iopf path delivery currently
> > require an attach handle.
> 
> let's hear Jason's opinion.

At least iommu_attach_handle_get() should not return NULL if there is
no handle, it should return EBUSY as though it couldn't match the
type.

Jason

Re: [PATCH v6 01/10] iommu: Introduce domain attachment handle

2024-06-12 Thread Jason Gunthorpe

On Thu, Jun 06, 2024 at 01:33:29PM +0800, Baolu Lu wrote:

> > But if certain path (other than iopf) in the iommu core needs to know
> > the exact domain pointer then this change breaks it.
> 
> The iommu core should not fetch the domain pointer in paths other than
> attach/detach/replace. There is currently no reference counter for an
> iommu domain, hence fetching the domain for other purposes will
> potentially lead to a use-after-free issue.

If you are doing this then we should get rid of
iommu_get_domain_for_dev_pasid, and always return the handle
there. Making it clear that all those flows require handles to work.

But is this really OK? What happened to the patch fixing the error
unwind in attach_pasid? Doesn't it always need the domain to undo a
failed attach?

Jason

Re: [PATCH v6 05/10] iommufd: Add fault and response message definitions

2024-06-12 Thread Jason Gunthorpe

On Fri, Jun 07, 2024 at 09:38:38AM +, Tian, Kevin wrote:
> > From: Baolu Lu 
> > Sent: Thursday, June 6, 2024 2:28 PM
> > 
> > On 6/5/24 4:28 PM, Tian, Kevin wrote:
> > >> From: Lu Baolu 
> > >> Sent: Monday, May 27, 2024 12:05 PM
> > >>
> > >> +
> > >> +/**
> > >> + * struct iommu_hwpt_page_response - IOMMU page fault response
> > >> + * @size: sizeof(struct iommu_hwpt_page_response)
> > >> + * @flags: Must be set to 0
> > >> + * @dev_id: device ID of target device for the response
> > >> + * @pasid: Process Address Space ID
> > >> + * @grpid: Page Request Group Index
> > >> + * @code: One of response code in enum
> > iommufd_page_response_code.
> > >> + * @cookie: The kernel-managed cookie reported in the fault message.
> > >> + */
> > >> +struct iommu_hwpt_page_response {
> > >> +__u32 size;
> > >> +__u32 flags;
> > >> +__u32 dev_id;
> > >> +__u32 pasid;
> > >> +__u32 grpid;
> > >> +__u32 code;
> > >> +__u32 cookie;
> > >> +__u32 reserved;
> > >> +};
> > >
> > > with the response queue per fault object we don't need all fields here,
> > > e.g. dev_id, pasid, etc. Cookie is sufficient.

Wait, why did we make it per object? The fault FD is supposed to be
sharable across HWPTs.

> > I prefer not to mess the definition of user API data and the kernel
> > driver implementation. The kernel driver may change in the future, but
> > the user API will remain stable for a long time.
> 
> sure it remains stable for reasonable reason. Here we defined some
> fields but they are even not used and checked in the kernel. IMHO it
> suggests redundant definition. If there is value to keep them, do we
> need to at least verify them same as the completion record?

They are not here for the kernel, they are here for userspace.

A single HWPT and a single fault queue can be attached to multiple
devices we need to return the dev_id so that userspace can know which
device initiated the PRI. Same with PASID.

The only way we could remove them is if we are sure that no vIOMMU
requires RID or PASID in the virtual fault queue PRI fault message.. I
don't think that is true?

Cookie is not a replacement, cookie is an opaque value for the kernel
to use to match a response to a request.

Jason

Re: [PATCH v6 06/10] iommufd: Add iommufd fault object

2024-06-12 Thread Jason Gunthorpe

On Fri, Jun 07, 2024 at 09:17:28AM +, Tian, Kevin wrote:
> > From: Lu Baolu 
> > Sent: Monday, May 27, 2024 12:05 PM
> > 
> > +static ssize_t iommufd_fault_fops_read(struct file *filep, char __user 
> > *buf,
> > +  size_t count, loff_t *ppos)
> > +{
> > +   size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
> > +   struct iommufd_fault *fault = filep->private_data;
> > +   struct iommu_hwpt_pgfault data;
> > +   struct iommufd_device *idev;
> > +   struct iopf_group *group;
> > +   struct iopf_fault *iopf;
> > +   size_t done = 0;
> > +   int rc = 0;
> > +
> > +   if (*ppos || count % fault_size)
> > +   return -ESPIPE;
> 
> the man page says:
> 
> "If count is zero, read() returns zero and has no  other  results."

The above does that? 0 % X == 0

> > +
> > +   mutex_lock(&fault->mutex);
> > +   while (!list_empty(&fault->deliver) && count > done) {
> > +   group = list_first_entry(&fault->deliver,
> > +struct iopf_group, node);
> > +
> > +   if (group->fault_count * fault_size > count - done)
> > +   break;
> > +
> > +   rc = xa_alloc(&fault->response, &group->cookie, group,
> > + xa_limit_32b, GFP_KERNEL);
> > +   if (rc)
> > +   break;
> > +
> > +   idev = to_iommufd_handle(group->attach_handle)->idev;
> > +   list_for_each_entry(iopf, &group->faults, list) {
> > +   iommufd_compose_fault_message(&iopf->fault,
> > + &data, idev,
> > + group->cookie);
> > +   rc = copy_to_user(buf + done, &data, fault_size);
> > +   if (rc) {
> 
> 'rc' should be converted to -EFAULT.

Yes


> > +   xa_erase(&fault->response, group->cookie);
> > +   break;
> > +   }
> > +   done += fault_size;
> > +   }
> > +
> > +   list_del(&group->node);
> > +   }
> > +   mutex_unlock(&fault->mutex);
> > +
> > +   return done == 0 ? rc : done;
> 
> again this doesn't match the manual:
> 
> "On error, -1 is returned, and errno is set appropriately. "
> 
> it doesn't matter whether 'done' is 0.

It is setup so that once the list_del() below happens it is guarenteed
that the system call will return a positive result so that the
list_del'd items are always returned to userspace.

If we hit any fault here on the Nth item we should still return the
prior items and ignore the fault.

If we hit a fault on the first item then we should return the fault.

Jason

Re: [PATCH v6 06/10] iommufd: Add iommufd fault object

2024-06-12 Thread Jason Gunthorpe

On Sat, Jun 08, 2024 at 05:58:34PM +0800, Baolu Lu wrote:

> > > +static int iommufd_fault_fops_release(struct inode *inode, struct file 
> > > *filep)
> > > +{
> > > + struct iommufd_fault *fault = filep->private_data;
> > > +
> > > + iommufd_ctx_put(fault->ictx);
> > > + refcount_dec(&fault->obj.users);
> > > + return 0;


This is in the wrong order, dec users before ctx_put.

> > hmm this doesn't sound correct. the context and refcount are
> > acquired in iommufd_fault_alloc() but here they are reverted when
> > the fd is closed...
> 
> These two refcounts were requested when the fault object was installed
> to the fault FD.
> 
>filep = anon_inode_getfile("[iommufd-pgfault]", &iommufd_fault_fops,
>fault, O_RDWR);
> if (IS_ERR(filep)) {
> rc = PTR_ERR(filep);
> goto out_abort;
> }
> 
> refcount_inc(&fault->obj.users);
> iommufd_ctx_get(fault->ictx);
> fault->filep = filep;
> 
> These refcounts must then be released when the FD is released.

Yes

The ctx refcount is to avoid destroying the ctx FD, which can't work,
while the fault FD has an object pinned. This is also why the above
order is backwards.

Jason

Re: [PATCH v6 03/10] iommu: Add attach handle to struct iopf_group

2024-06-12 Thread Jason Gunthorpe

On Mon, May 27, 2024 at 12:05:10PM +0800, Lu Baolu wrote:
> @@ -206,20 +182,49 @@ void iommu_report_device_fault(struct device *dev, 
> struct iopf_fault *evt)
>   if (group == &abort_group)
>   goto err_abort;
>  
> - group->domain = get_domain_for_iopf(dev, fault);
> - if (!group->domain)
> + if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) {
> + group->attach_handle = iommu_attach_handle_get(dev->iommu_group,
> +fault->prm.pasid,
> +0);
> + if (IS_ERR(group->attach_handle)) {
> + if (!device_iommu_capable(dev, 
> IOMMU_CAP_USER_IOASID_TABLE))
> + goto err_abort;

I'm not excited about calling a function pointer on every fault. Let's
just add a constant flag to iommu_ops?

Jason

Re: [PATCH v6 04/10] iommu: Extend domain attach group with handle support

2024-06-12 Thread Jason Gunthorpe

On Mon, May 27, 2024 at 12:05:11PM +0800, Lu Baolu wrote:
> Unlike the SVA case where each PASID of a device has an SVA domain
> attached to it, the I/O page faults are handled by the fault handler
> of the SVA domain. The I/O page faults for a user page table might
> be handled by the domain attached to RID or the domain attached to
> the PASID, depending on whether the PASID table is managed by user
> space or kernel. As a result, there is a need for the domain attach
> group interfaces to have attach handle support. The attach handle
> will be forwarded to the fault handler of the user domain.
> 
> Add some variants of the domain attaching group interfaces so that they
> could support the attach handle and export them for use in IOMMUFD.
> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/iommu-priv.h |  8 +++
>  drivers/iommu/iommu.c  | 99 ++
>  2 files changed, 107 insertions(+)

I don't have an objection to it like this, but I wonder if we could be
smaller to teach iommu_attach_device_pasid to use IOMMU_NO_PASID to
attach the handle to the rid?

It would have an if there to call the  __iommu_attach_group() instead
of the pasid attach ?

Jason

Re: [PATCH v6 05/10] iommufd: Add fault and response message definitions

2024-06-12 Thread Jason Gunthorpe

On Wed, Jun 12, 2024 at 10:19:46AM -0300, Jason Gunthorpe wrote:
> > > I prefer not to mess the definition of user API data and the kernel
> > > driver implementation. The kernel driver may change in the future, but
> > > the user API will remain stable for a long time.
> > 
> > sure it remains stable for reasonable reason. Here we defined some
> > fields but they are even not used and checked in the kernel. IMHO it
> > suggests redundant definition. If there is value to keep them, do we
> > need to at least verify them same as the completion record?
> 
> They are not here for the kernel, they are here for userspace.
> 
> A single HWPT and a single fault queue can be attached to multiple
> devices we need to return the dev_id so that userspace can know which
> device initiated the PRI. Same with PASID.
> 
> The only way we could remove them is if we are sure that no vIOMMU
> requires RID or PASID in the virtual fault queue PRI fault message.. I
> don't think that is true?
> 
> Cookie is not a replacement, cookie is an opaque value for the kernel
> to use to match a response to a request.

Oh I got this wrong, the above is the response, yeah we can ditch
everything but the cookie, and code right?

struct iommu_hwpt_page_response {
   __u32 cookie;
   __u32 code;
};

What is the workflow here? We expect the VMM will take the vIOMMU
response and match it to a request cookie for the kernel to complete?

Jason

Re: [PATCH v6 05/10] iommufd: Add fault and response message definitions

2024-06-12 Thread Jason Gunthorpe

On Mon, May 27, 2024 at 12:05:12PM +0800, Lu Baolu wrote:
> +/**
> + * struct iommu_hwpt_pgfault - iommu page fault data
> + * @size: sizeof(struct iommu_hwpt_pgfault)
> + * @flags: Combination of enum iommu_hwpt_pgfault_flags
> + * @dev_id: id of the originated device
> + * @pasid: Process Address Space ID
> + * @grpid: Page Request Group Index
> + * @perm: Combination of enum iommu_hwpt_pgfault_perm
> + * @addr: Fault address
> + * @length: a hint of how much data the requestor is expecting to fetch. For
> + *  example, if the PRI initiator knows it is going to do a 10MB
> + *  transfer, it could fill in 10MB and the OS could pre-fault in
> + *  10MB of IOVA. It's default to 0 if there's no such hint.
> + * @cookie: kernel-managed cookie identifying a group of fault messages. The
> + *  cookie number encoded in the last page fault of the group should
> + *  be echoed back in the response message.
> + */
> +struct iommu_hwpt_pgfault {
> + __u32 size;

Given we fail the system call if size is not exactly the right value
we should probably drop it here.

The ioctl to get the FD can someday specify the format of the fault
messages if we need to upgrade. If we want to change it down the road
then the old FD will be exactly as it is now, and the user will
request a new format FD that only works in whatever the new way is.

Jason

Re: [PATCH v6 00/10] IOMMUFD: Deliver IO page faults to user space

2024-06-12 Thread Jason Gunthorpe

On Mon, May 27, 2024 at 12:05:07PM +0800, Lu Baolu wrote:
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework. One feasible use case is the
> nested translation. Nested translation is a hardware feature that
> supports two-stage translation tables for IOMMU. The second-stage
> translation table is managed by the host VMM, while the first-stage
> translation table is owned by user space. This allows user space to
> control the IOMMU mappings for its devices.

This looks pretty close, will you post a v7 with the minor changes?

Thanks,
Jasno

Re: [PATCH v6 03/10] iommu: Add attach handle to struct iopf_group

2024-06-13 Thread Jason Gunthorpe

On Thu, Jun 13, 2024 at 12:23:17PM +0800, Baolu Lu wrote:

>  struct iommu_ops {
> bool (*capable)(struct device *dev, enum iommu_cap);
> @@ -600,6 +598,7 @@ struct iommu_ops {
> struct iommu_domain *blocked_domain;
> struct iommu_domain *release_domain;
> struct iommu_domain *default_domain;
> +   bool user_pasid_table;
>  };

Yeah, maybe spell it 

  u8 user_pasid_table : 1;

Jason

Re: [PATCH v7 04/10] iommu: Extend domain attach group with handle support

2024-06-28 Thread Jason Gunthorpe

On Sun, Jun 16, 2024 at 02:11:49PM +0800, Lu Baolu wrote:

> +int iommu_replace_group_handle(struct iommu_group *group,
> +struct iommu_domain *new_domain,
> +struct iommu_attach_handle *handle)
> +{
> + struct iommu_domain *old_domain = group->domain;
> + void *curr;
> + int ret;
> +
> + if (!new_domain)
> + return -EINVAL;
> +
> + mutex_lock(&group->mutex);
> + ret = __iommu_group_set_domain(group, new_domain);
> + if (ret)
> + goto err_unlock;
> + xa_erase(&group->pasid_array, IOMMU_NO_PASID);
> + if (handle) {
> + curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, 
> GFP_KERNEL);
> + if (xa_err(curr)) {
> + ret = xa_err(curr);
> + goto err_restore;

But this error unwind doesn't work because the xa_erase() already
happened and there may have been a handle there that we don't put
back.

Something like this - store to a reserved entry cannot fail:

int iommu_replace_group_handle(struct iommu_group *group,
   struct iommu_domain *new_domain,
   struct iommu_attach_handle *handle)
{
void *curr;
int ret;

if (!new_domain)
return -EINVAL;

mutex_lock(&group->mutex);
if (handle) {
ret = xa_reserve(&group->pasid_array, IOMMU_NO_PASID,
 GFP_KERNEL);
if (ret)
goto err_unlock;
}

ret = __iommu_group_set_domain(group, new_domain);
if (ret)
goto err_release;

curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle,
GFP_KERNEL);
WARN_ON(xa_is_err(curr));

mutex_unlock(&group->mutex);

return 0;
err_release:
xa_release(&group->pasid_array, IOMMU_NO_PASID);
err_unlock:
mutex_unlock(&group->mutex);
return ret;
}

Jason

Re: [PATCH v7 07/10] iommufd: Fault-capable hwpt attach/detach/replace

2024-06-28 Thread Jason Gunthorpe

On Sun, Jun 16, 2024 at 02:11:52PM +0800, Lu Baolu wrote:
> +static int iommufd_fault_iopf_enable(struct iommufd_device *idev)
> +{
> + struct device *dev = idev->dev;
> + int ret;
> +
> + /*
> +  * Once we turn on PCI/PRI support for VF, the response failure code
> +  * should not be forwarded to the hardware due to PRI being a shared
> +  * resource between PF and VFs. There is no coordination for this
> +  * shared capability. This waits for a vPRI reset to recover.
> +  */
> + if (dev_is_pci(dev) && to_pci_dev(dev)->is_virtfn)
> + return -EINVAL;

I don't quite get this remark, isn't not supporting PRI on VFs kind of
useless? What is the story here?

Jason

Re: [PATCH v7 01/10] iommu: Introduce domain attachment handle

2024-06-28 Thread Jason Gunthorpe

On Sun, Jun 16, 2024 at 02:11:46PM +0800, Lu Baolu wrote:
> Currently, when attaching a domain to a device or its PASID, domain is
> stored within the iommu group. It could be retrieved for use during the
> window between attachment and detachment.
> 
> With new features introduced, there's a need to store more information
> than just a domain pointer. This information essentially represents the
> association between a domain and a device. For example, the SVA code
> already has a custom struct iommu_sva which represents a bond between
> sva domain and a PASID of a device. Looking forward, the IOMMUFD needs
> a place to store the iommufd_device pointer in the core, so that the
> device object ID could be quickly retrieved in the critical fault handling
> path.
> 
> Introduce domain attachment handle that explicitly represents the
> attachment relationship between a domain and a device or its PASID.
> 
> Co-developed-by: Jason Gunthorpe 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Lu Baolu 
> ---
>  include/linux/iommu.h | 18 +++---
>  drivers/dma/idxd/init.c   |  2 +-
>  drivers/iommu/iommu-sva.c | 13 -
>  drivers/iommu/iommu.c | 26 ------
>  4 files changed, 40 insertions(+), 19 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason

Re: [PATCH v7 08/10] iommufd: Associate fault object with iommufd_hw_pgtable

2024-06-28 Thread Jason Gunthorpe

On Sun, Jun 16, 2024 at 02:11:53PM +0800, Lu Baolu wrote:

> @@ -308,13 +315,29 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
>   goto out_put_pt;
>   }
>  
> + if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID) {
> + struct iommufd_fault *fault;
> +
> + fault = iommufd_get_fault(ucmd, cmd->fault_id);
> + if (IS_ERR(fault)) {
> + rc = PTR_ERR(fault);
> + goto out_hwpt;
> + }
> + hwpt->fault = fault;
> + hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
> + hwpt->domain->fault_data = hwpt;

This is not the right refcounting for a longterm reference... The PT
above shows the pattern:

pt_obj = iommufd_get_object(ucmd->ictx, cmd->pt_id, IOMMUFD_OBJ_ANY);
hwpt_paging = iommufd_hwpt_paging_alloc()
refcount_inc(&ioas->obj.users);

iommufd_put_object(ucmd->ictx, pt_obj);

Which is to say you need to incr users and then do the put object. And
iommufd_object_abort_and_destroy() will always destroy the ref on the
fault if the fault is non-null so the error handling will double free.

fail_nth is intended to catch this, but you have to add enough inputs
to cover the new cases when you add them, it seems like that is
missing in this series. ie add a fault object and hwpt alloc to a
fail_nth test and see we execute the iommufd_ucmd_respond() failure
path.

--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -14,7 +14,7 @@ static void __iommufd_hwpt_destroy(struct 
iommufd_hw_pagetable *hwpt)
iommu_domain_free(hwpt->domain);
 
if (hwpt->fault)
-   iommufd_put_object(hwpt->fault->ictx, &hwpt->fault->obj);
+   refcount_dec(&hwpt->fault->obj.users);
 }
 
 void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
@@ -326,18 +326,17 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
hwpt->fault = fault;
hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
hwpt->domain->fault_data = hwpt;
+   refcount_inc(&fault->obj.users);
+   iommufd_put_object(ucmd->ictx, &fault->obj);
}
 
cmd->out_hwpt_id = hwpt->obj.id;
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
if (rc)
-   goto out_put_fault;
+   goto out_hwpt;
iommufd_object_finalize(ucmd->ictx, &hwpt->obj);
goto out_unlock;
 
-out_put_fault:
-   if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID)
-   iommufd_put_object(ucmd->ictx, &hwpt->fault->obj);
 out_hwpt:
iommufd_object_abort_and_destroy(ucmd->ictx, &hwpt->obj);
 out_unlock:

Re: [PATCH v7 00/10] IOMMUFD: Deliver IO page faults to user space

2024-06-28 Thread Jason Gunthorpe

On Sun, Jun 16, 2024 at 02:11:45PM +0800, Lu Baolu wrote:

> Lu Baolu (10):
>   iommu: Introduce domain attachment handle
>   iommu: Remove sva handle list
>   iommu: Add attach handle to struct iopf_group
>   iommu: Extend domain attach group with handle support
>   iommufd: Add fault and response message definitions
>   iommufd: Add iommufd fault object
>   iommufd: Fault-capable hwpt attach/detach/replace
>   iommufd: Associate fault object with iommufd_hw_pgtable
>   iommufd/selftest: Add IOPF support for mock device
>   iommufd/selftest: Add coverage for IOPF test

Small remarks aside this all looks fine to me, send a v8 and I'll try
to pick it up next week

Thanks,
Jason

Re: [PATCH v7 03/10] iommu: Add attach handle to struct iopf_group

2024-06-28 Thread Jason Gunthorpe

On Sun, Jun 16, 2024 at 02:11:48PM +0800, Lu Baolu wrote:
> Previously, the domain that a page fault targets is stored in an
> iopf_group, which represents a minimal set of page faults. With the
> introduction of attach handle, replace the domain with the handle
> so that the fault handler can obtain more information as needed
> when handling the faults.
> 
> iommu_report_device_fault() is currently used for SVA page faults,
> which handles the page fault in an internal cycle. The domain is retrieved
> with iommu_get_domain_for_dev_pasid() if the pasid in the fault message
> is valid. This doesn't work in IOMMUFD case, where if the pasid table of
> a device is wholly managed by user space, there is no domain attached to
> the PASID of the device, and all page faults are forwarded through a
> NESTING domain attaching to RID.
> 
> Add a new IOMMU capability flag, IOMMU_CAP_USER_IOASID_TABLE, which
> indicates if the IOMMU driver supports user-managed PASID tables. In the
> iopf deliver path, if no attach handle found for the iopf PASID, roll
> back to RID domain when the IOMMU driver supports this capability.

This remark is out of date since it is now:

> @@ -547,6 +547,10 @@ static inline int __iommu_copy_struct_from_user_array(
>   * @default_domain: If not NULL this will always be set as the default 
> domain.
>   *  This should be an IDENTITY/BLOCKED/PLATFORM domain.
>   *  Do not use in new drivers.
> + * @user_pasid_table: IOMMU driver supports user-managed PASID table. There 
> is
> + *no user domain for each PASID and the I/O page faults 
> are
> + *forwarded through the user domain attached to the 
> device
> + *RID.
>   */
>  struct iommu_ops {
>   bool (*capable)(struct device *dev, enum iommu_cap);
> @@ -590,6 +594,7 @@ struct iommu_ops {
>   struct iommu_domain *blocked_domain;
>   struct iommu_domain *release_domain;
>   struct iommu_domain *default_domain;
> + u8 user_pasid_table:1;
>  };

Otherwise looks goot

Reviewed-by: Jason Gunthorpe 

Jason

Re: [PATCH v7 02/10] iommu: Remove sva handle list

2024-06-28 Thread Jason Gunthorpe

On Sun, Jun 16, 2024 at 02:11:47PM +0800, Lu Baolu wrote:
> The struct sva_iommu represents an association between an SVA domain and
> a PASID of a device. It's stored in the iommu group's pasid array and also
> tracked by a list in the per-mm data structure. Removes duplicate tracking
> of sva_iommu by eliminating the list.
> 
> Signed-off-by: Lu Baolu 
> ---
>  include/linux/iommu.h  |  2 --
>  drivers/iommu/iommu-priv.h |  3 +++
>  drivers/iommu/iommu-sva.c  | 30 --
>  drivers/iommu/iommu.c  | 31 +++
>  4 files changed, 54 insertions(+), 12 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason

Re: [PATCH v7 06/10] iommufd: Add iommufd fault object

2024-06-28 Thread Jason Gunthorpe

On Sun, Jun 16, 2024 at 02:11:51PM +0800, Lu Baolu wrote:
> @@ -6,6 +6,7 @@ iommufd-y := \
>   ioas.o \
>   main.o \
>   pages.o \
> + fault.o \
>   vfio_compat.o

Keep sorted

Reviewed-by: Jason Gunthorpe 

Jason

Re: [PATCH v8 06/10] iommufd: Add iommufd fault object

2024-07-08 Thread Jason Gunthorpe

On Fri, Jul 05, 2024 at 12:49:10AM +, Tian, Kevin wrote:

> > > > > > > +enum iommu_fault_type {
> > > > > > > + IOMMU_FAULT_TYPE_HWPT_IOPF,
> > > > > > > + IOMMU_FAULT_TYPE_VIOMMU_IRQ,
> > > > > > > +};
> > > > > > >
> > > > > > >struct iommu_fault_alloc {
> > > > > > >__u32 size;
> > > > > > >__u32 flags;
> > > > > > > + __u32 type;  /* enum iommu_fault_type */
> > > > > > >__u32 out_fault_id;
> > > > > > >__u32 out_fault_fd;
> > > > and need a new reserved field for alignment.
> > 
> > Hmm, what's the reason for enforcing a 64-bit alignment to an
> > all-u32 struct though? I thought we need a reserved field only
> > for padding. The struct iommu_ioas_alloc has three u32 members
> > for example?
> 
> yeah please ignore this comment.

Sometimes I encourage it so that people notice the if the structure is
changed later. Almost all structs here are 8 byte aligned. It is OK
like this too.

Jason

Re: [PATCH v8 06/10] iommufd: Add iommufd fault object

2024-07-08 Thread Jason Gunthorpe

On Wed, Jul 03, 2024 at 04:06:15PM -0700, Nicolin Chen wrote:

> I learned that this hwpt->fault is exclusively for IOPF/PRI. And
> Jason suggested me to add a different one for VIOMMU. Yet, after
> taking a closer look, I found the fault object in this series is
> seemingly quite generic at the uAPI level: its naming/structure,
> and the way how it's allocated and passed to hwpt, despite being
> highly correlated with IOPF in its fops code. So, I feel that we
> might have a chance of reusing it for different fault types:
> 
> +enum iommu_fault_type {
> + IOMMU_FAULT_TYPE_HWPT_IOPF,
> + IOMMU_FAULT_TYPE_VIOMMU_IRQ,
> +};
> 
>  struct iommu_fault_alloc {
>   __u32 size;
>   __u32 flags;
> + __u32 type;  /* enum iommu_fault_type */
>   __u32 out_fault_id;
>   __u32 out_fault_fd;
>  };

I think I would just add the type at the end of the struct and rely on
our existing 0 is backwards compat mechanism. 0 means HWPT_IOPF. ie no
need to do anything now.

It would make some sense to call this a "report" object than "fault"
if we are going to use it for different things. We could probably
rename it without much trouble. There is also not a significant issue
with having two alloc commands for FDs.

I'd also think VIOMMU_IRQ is probably not that right abstraction,
likely it makes more sense to push driver-specific event messages sort
of like IOPF and one of the messages can indicate a arm-smmu-v3 VCDMQ
interrupt, other messages could indicate BAD_CD and similar sorts of
events we might want to capture and forward.

So, I'm inclined to just take this series as-is

Jason

Re: [PATCH v8 06/10] iommufd: Add iommufd fault object

2024-07-09 Thread Jason Gunthorpe

On Mon, Jul 08, 2024 at 11:36:57AM -0700, Nicolin Chen wrote:
> Maybe something like this?
> 
> struct iommu_viommu_event_arm_smmuv3 {
>   u64 evt[4];
> };
> 
> struct iommu_viommu_event_tegra241_cmdqv {
>   u64 vcmdq_err_map[2];
> };
> 
> enum iommu_event_type {
>   IOMMM_HWPT_EVENT_TYPE_IOPF,
>   IOMMU_VIOMMU_EVENT_TYPE_SMMUv3,
>   IOMMU_VIOMMU_EVENT_TYPE_TEGRA241_CMDQV,
> };
> 
> struct iommu_event_alloc {
>   __u32 size;
>   __u32 flags;
>   __u32 out_event_id;
>   __u32 out_event_fd;
>   __u32 type;
>   __u32 _reserved;
> };
> 
> It can be "report" if you prefer.

Yeah, something like that makes sense to me. The other question is if
you want to multiplex the SMMUv3 and CMDQV on the same FD?

Or multiplex physical smmus on the same FD.

We are potentially talking about 5-10 physical smmus and 2-3 FDs per
physical? Does that scare anyone?

Jason

Re: [PATCH v8 00/10] IOMMUFD: Deliver IO page faults to user space

2024-07-09 Thread Jason Gunthorpe

On Thu, Jul 04, 2024 at 03:18:57PM +0100, Will Deacon wrote:
> On Tue, 02 Jul 2024 14:34:34 +0800, Lu Baolu wrote:
> > This series implements the functionality of delivering IO page faults to
> > user space through the IOMMUFD framework. One feasible use case is the
> > nested translation. Nested translation is a hardware feature that
> > supports two-stage translation tables for IOMMU. The second-stage
> > translation table is managed by the host VMM, while the first-stage
> > translation table is owned by user space. This allows user space to
> > control the IOMMU mappings for its devices.
> 
> Applied first four to iommu (iommufd/attach-handles), thanks!
> 
> [01/10] iommu: Introduce domain attachment handle
> https://git.kernel.org/iommu/c/14678219cf40
> [02/10] iommu: Remove sva handle list
> https://git.kernel.org/iommu/c/3e7f57d1ef3f
> [03/10] iommu: Add attach handle to struct iopf_group
> https://git.kernel.org/iommu/c/06cdcc32d657
> [04/10] iommu: Extend domain attach group with handle support
> https://git.kernel.org/iommu/c/8519e689834a
> 
> Jason -- feel free to use this branch as a base on which you can take
> the iommufd parts.

Done, the whole series is in iommufd for-next

Thanks,
Jason

Re: [PATCH v7 07/10] iommufd: Fault-capable hwpt attach/detach/replace

2024-07-09 Thread Jason Gunthorpe

On Mon, Jul 01, 2024 at 01:55:12PM +0800, Baolu Lu wrote:
> On 2024/6/29 5:17, Jason Gunthorpe wrote:
> > On Sun, Jun 16, 2024 at 02:11:52PM +0800, Lu Baolu wrote:
> > > +static int iommufd_fault_iopf_enable(struct iommufd_device *idev)
> > > +{
> > > + struct device *dev = idev->dev;
> > > + int ret;
> > > +
> > > + /*
> > > +  * Once we turn on PCI/PRI support for VF, the response failure code
> > > +  * should not be forwarded to the hardware due to PRI being a shared
> > > +  * resource between PF and VFs. There is no coordination for this
> > > +  * shared capability. This waits for a vPRI reset to recover.
> > > +  */
> > > + if (dev_is_pci(dev) && to_pci_dev(dev)->is_virtfn)
> > > + return -EINVAL;
> > I don't quite get this remark, isn't not supporting PRI on VFs kind of
> > useless? What is the story here?
> 
> This remark is trying to explain why attaching an iopf-capable hwpt to a
> VF is not supported for now. The PCI sepc (section 10.4.2.1) states that
> a response failure will disable the PRI on the function. But for PF/VF
> case, the PRI is a shared resource, therefore a response failure on a VF
> might cause iopf on other VFs to malfunction. So, we start from simple
> by not allowing it.

You are talking about IOMMU_PAGE_RESP_FAILURE ?

But this is bad already, something like SVA could trigger
IOMMU_PAGE_RESP_FAILURE on a VF without iommufd today. Due to memory
allocation failure in iommu_report_device_fault()

And then we pass in code from userspace and blindly cast it to
enum iommu_page_response_code ?

Probably we should just only support IOMMU_PAGE_RESP_SUCCESS/INVALID
from userspace and block FAILURE entirely. Probably the VMM should
emulate FAILURE by disabling PRI on by changing to a non PRI domain.

And this subtle uABI leak needs a fix:

iopf_group_response(group, response.code);

response.code and enum iommu_page_response_code are different
enums, and there is no range check. Need a static assert at least and
a range check. Send a followup patch please

Jason

Re: [PATCH v8 06/10] iommufd: Add iommufd fault object

2024-07-12 Thread Jason Gunthorpe

On Tue, Jul 09, 2024 at 10:33:42AM -0700, Nicolin Chen wrote:

> > We are potentially talking about 5-10 physical smmus and 2-3 FDs per
> > physical? Does that scare anyone?
> 
> I think we can share the same FD by adding a viommu_id somewhere
> to indicate what the data/event belongs to. Yet, it seemed that
> you had a counter-argument that a shared FD (queue) might have a
> security concern as well?
> https://lore.kernel.org/linux-iommu/20240522232833.gh20...@nvidia.com/

That was for the physical HW queue not so much the FD.

We need to be mindful that these events can't DOS the hypervisor, that
constrains how we track pending events in the kernel, not how they get
marshaled to FDs to deliver to user space.

Thinking more, it makes sense that a FD would tie 1:1 with a queue in
the VM.

That way backpressure on a queue will not cause head of line blocking
to other queues because they multiplex onto a single FD.

Jason

Re: [PATCH 0/4] driver_core: Auxiliary drvdata helper cleanup

2021-12-21 Thread Jason Gunthorpe

On Tue, Dec 21, 2021 at 03:58:48PM -0800, David E. Box wrote:
> Depends on "driver core: auxiliary bus: Add driver data helpers" patch [1].
> Applies the helpers to all auxiliary device drivers using
> dev_(get/set)_drvdata. Drivers were found using the following search:
> 
> grep -lr "struct auxiliary_device" $(grep -lr "drvdata" .)
> 
> Changes were build tested using the following configs:
> 
> vdpa/mlx5:   CONFIG_MLX5_VDPA_NET
> net/mlx53:   CONFIG_MLX5_CORE_EN
> soundwire/intel: CONFIG_SOUNDWIRE_INTEL
> RDAM/irdma:  CONFIG_INFINIBAND_IRDMA
>  CONFIG_MLX5_INFINIBAND
> 
> [1] https://www.spinics.net/lists/platform-driver-x86/msg29940.html 

I have to say I don't really find this to be a big readability
improvement.

Also, what use is 'to_auxiliary_dev()' ? I didn't see any users added..

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/4] driver_core: Auxiliary drvdata helper cleanup

2021-12-22 Thread Jason Gunthorpe

On Tue, Dec 21, 2021 at 04:48:17PM -0800, David E. Box wrote:
> On Tue, 2021-12-21 at 20:09 -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 21, 2021 at 03:58:48PM -0800, David E. Box wrote:
> > > Depends on "driver core: auxiliary bus: Add driver data helpers" patch 
> > > [1].
> > > Applies the helpers to all auxiliary device drivers using
> > > dev_(get/set)_drvdata. Drivers were found using the following search:
> > > 
> > > grep -lr "struct auxiliary_device" $(grep -lr "drvdata" .)
> > > 
> > > Changes were build tested using the following configs:
> > > 
> > > vdpa/mlx5:   CONFIG_MLX5_VDPA_NET
> > > net/mlx53:   CONFIG_MLX5_CORE_EN
> > > soundwire/intel: CONFIG_SOUNDWIRE_INTEL
> > > RDAM/irdma:  CONFIG_INFINIBAND_IRDMA
> > >  CONFIG_MLX5_INFINIBAND
> > > 
> > > [1] https://www.spinics.net/lists/platform-driver-x86/msg29940.html 
> > 
> > I have to say I don't really find this to be a big readability
> > improvement.
> 
> I should have referenced the thread [1] discussing the benefit of this change
> since the question was asked and answered already. The idea is that drivers
> shouldn't have to touch the device API directly if they are already using a
> higher level core API (auxiliary bus) that can do that on its behalf.

Driver writers should rarely use the auxilary device type directly, the
should always immediately container_of it to their proper derived
type.

> > Also, what use is 'to_auxiliary_dev()' ? I didn't see any users added..
>
> This was not added by that patch.

It was added by the referenced patch, and seems totally pointless cut
and paste, again because nothing should be using the auxiliary_device
type for anything more than container_of'ing to their own type.

We've been ripping out bus specific APIs in favour of generic ones
(see the work on the DMA API for instance) so this whole concept seems
regressive, particularly when applied to auxiliary bus which does not
have an API of its own.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] treewide: cleanup unreachable breaks

2020-10-19 Thread Jason Gunthorpe

On Mon, Oct 19, 2020 at 12:42:15PM -0700, Nick Desaulniers wrote:
> On Sat, Oct 17, 2020 at 10:43 PM Greg KH  wrote:
> >
> > On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> > > From: Tom Rix 
> > >
> > > This is a upcoming change to clean up a new warning treewide.
> > > I am wondering if the change could be one mega patch (see below) or
> > > normal patch per file about 100 patches or somewhere half way by 
> > > collecting
> > > early acks.
> >
> > Please break it up into one-patch-per-subsystem, like normal, and get it
> > merged that way.
> >
> > Sending us a patch, without even a diffstat to review, isn't going to
> > get you very far...
> 
> Tom,
> If you're able to automate this cleanup, I suggest checking in a
> script that can be run on a directory.  Then for each subsystem you
> can say in your commit "I ran scripts/fix_whatever.py on this subdir."
>  Then others can help you drive the tree wide cleanup.  Then we can
> enable -Wunreachable-code-break either by default, or W=2 right now
> might be a good idea.

I remember using clang-modernize in the past to fix issues very
similar to this, if clang machinery can generate the warning, can't
something like clang-tidy directly generate the patch?

You can send me a patch for drivers/infiniband/* as well

Thanks,
Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH mlx5-next v1 06/11] vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus

2020-11-03 Thread Jason Gunthorpe

On Sun, Nov 01, 2020 at 10:15:37PM +0200, Leon Romanovsky wrote:
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 6c218b47b9f1..5316e51e72d4 100644
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1,18 +1,27 @@
>  // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>  /* Copyright (c) 2020 Mellanox Technologies Ltd. */
> 
> +#include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> -#include 
>  #include 
> -#include "mlx5_vnet.h"
>  #include "mlx5_vdpa.h"
> 
> +MODULE_AUTHOR("Eli Cohen ");
> +MODULE_DESCRIPTION("Mellanox VDPA driver");
> +MODULE_LICENSE("Dual BSD/GPL");
> +
> +#define to_mlx5_vdpa_ndev(__mvdev) container_of(__mvdev, struct 
> mlx5_vdpa_net, mvdev)
>  #define to_mvdev(__vdev) container_of((__vdev), struct mlx5_vdpa_dev, vdev)
> 
>  #define VALID_FEATURES_MASK  
>   \
> @@ -159,6 +168,11 @@ static bool mlx5_vdpa_debug;
>   mlx5_vdpa_info(mvdev, "%s\n", #_status);
>\
>   } while (0)
> 
> +static inline u32 mlx5_vdpa_max_qps(int max_vqs)
> +{
> + return max_vqs / 2;
> +}
> +
>  static void print_status(struct mlx5_vdpa_dev *mvdev, u8 status, bool set)
>  {
>   if (status & ~VALID_STATUS_MASK)
> @@ -1928,8 +1942,11 @@ static void init_mvqs(struct mlx5_vdpa_net *ndev)
>   }
>  }
> 
> -void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
> +static int mlx5v_probe(struct auxiliary_device *adev,
> +const struct auxiliary_device_id *id)
>  {
> + struct mlx5_adev *madev = container_of(adev, struct mlx5_adev, adev);
> + struct mlx5_core_dev *mdev = madev->mdev;
>   struct virtio_net_config *config;
>   struct mlx5_vdpa_dev *mvdev;
>   struct mlx5_vdpa_net *ndev;
> @@ -1943,7 +1960,7 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
>   ndev = vdpa_alloc_device(struct mlx5_vdpa_net, mvdev.vdev, 
> mdev->device, &mlx5_vdpa_ops,
>2 * mlx5_vdpa_max_qps(max_vqs));
>   if (IS_ERR(ndev))
> - return ndev;
> + return PTR_ERR(ndev);
> 
>   ndev->mvdev.max_vqs = max_vqs;
>   mvdev = &ndev->mvdev;
> @@ -1972,7 +1989,8 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
>   if (err)
>   goto err_reg;
> 
> - return ndev;
> + dev_set_drvdata(&adev->dev, ndev);
> + return 0;
> 
>  err_reg:
>   free_resources(ndev);
> @@ -1981,10 +1999,29 @@ void *mlx5_vdpa_add_dev(struct mlx5_core_dev *mdev)
>  err_mtu:
>   mutex_destroy(&ndev->reslock);
>   put_device(&mvdev->vdev.dev);
> - return ERR_PTR(err);
> + return err;
>  }
> 
> -void mlx5_vdpa_remove_dev(struct mlx5_vdpa_dev *mvdev)
> +static int mlx5v_remove(struct auxiliary_device *adev)
>  {
> + struct mlx5_vdpa_dev *mvdev = dev_get_drvdata(&adev->dev);
> +
>   vdpa_unregister_device(&mvdev->vdev);
> + return 0;
>  }
> +
> +static const struct auxiliary_device_id mlx5v_id_table[] = {
> + { .name = MLX5_ADEV_NAME ".vnet", },
> + {},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, mlx5v_id_table);
> +
> +static struct auxiliary_driver mlx5v_driver = {
> + .name = "vnet",
> + .probe = mlx5v_probe,
> + .remove = mlx5v_remove,
> + .id_table = mlx5v_id_table,
> +};

It is hard to see from the diff, but when this patch is applied the
vdpa module looks like I imagined things would look with the auxiliary
bus. It is very similar in structure to a PCI driver with the probe()
function cleanly registering with its subsystem. This is what I'd like
to see from the new Intel RDMA driver.

Greg, I think this patch is the best clean usage example.

I've looked over this series and it has the right idea and
parts. There is definitely more that can be done to improve mlx5 in
this area, but this series is well scoped and cleans a good part of
it.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH mlx5-next v1 06/11] vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus

2020-11-05 Thread Jason Gunthorpe

On Thu, Nov 05, 2020 at 08:33:02AM +0100, gregkh wrote:
> > Were there any additional changes you wanted to see happen? I'll go
> > give the final set another once over, but David has been diligently
> > fixing up all the declared major issues so I expect to find at most
> > minor incremental fixups.
> 
> This is in my to-review pile, along with a load of other stuff at the
> moment:
>   $ ~/bin/mdfrm -c ~/mail/todo/
>   1709 messages in /home/gregkh/mail/todo/
> 
> So give me a chance.  There is no rush on my side for this given the
> huge delays that have happened here on the authorship side many times in
> the past :)

On the other hand Leon and his team did invest alot of time and
effort, very quickly, to build and QA this large mlx5 series here to
give a better/second example as you requested only a few weeks ago.

> If you can review it, or anyone else, that is always most appreciated.

Dan, Leon, Myself and others have looked at the auxiliary bus patch a
more than a few times now. Leon in particular went over it very
carefully and a number of bugs were fixed while he developed this
series.

There seems to be nothing fundamentally wrong with it, so long as
people are fine with the colour of the shed...

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members

2022-06-27 Thread Jason Gunthorpe

On Mon, Jun 27, 2022 at 08:27:37PM +0200, Daniel Borkmann wrote:
> On 6/27/22 8:04 PM, Gustavo A. R. Silva wrote:
> > There is a regular need in the kernel to provide a way to declare
> > having a dynamically sized set of trailing elements in a structure.
> > Kernel code should always use “flexible array members”[1] for these
> > cases. The older style of one-element or zero-length arrays should
> > no longer be used[2].
> > 
> > This code was transformed with the help of Coccinelle:
> > (linux-5.19-rc2$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file 
> > script.cocci --include-headers --dir . > output.patch)
> > 
> > @@
> > identifier S, member, array;
> > type T1, T2;
> > @@
> > 
> > struct S {
> >...
> >T1 member;
> >T2 array[
> > - 0
> >];
> > };
> > 
> > -fstrict-flex-arrays=3 is coming and we need to land these changes
> > to prevent issues like these in the short future:
> > 
> > ../fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; 
> > destination buffer has size 0,
> > but the source string has length 2 (including NUL byte) [-Wfortify-source]
> > strcpy(de3->name, ".");
> > ^
> > 
> > Since these are all [0] to [] changes, the risk to UAPI is nearly zero. If
> > this breaks anything, we can use a union with a new member name.
> > 
> > [1] https://en.wikipedia.org/wiki/Flexible_array_member
> > [2] 
> > https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays
> > 
> > Link: https://github.com/KSPP/linux/issues/78
> > Build-tested-by: 
> > https://lore.kernel.org/lkml/62b675ec.wkx6aoz6cbe71vtf%25...@intel.com/
> > Signed-off-by: Gustavo A. R. Silva 
> > ---
> > Hi all!
> > 
> > JFYI: I'm adding this to my -next tree. :)
> 
> Fyi, this breaks BPF CI:
> 
> https://github.com/kernel-patches/bpf/runs/7078719372?check_suite_focus=true
> 
>   [...]
>   progs/map_ptr_kern.c:314:26: error: field 'trie_key' with variable sized 
> type 'struct bpf_lpm_trie_key' not at the end of a struct or class is a GNU 
> extension [-Werror,-Wgnu-variable-sized-type-not-at-end]
>   struct bpf_lpm_trie_key trie_key;
>   ^

This will break the rdma-core userspace as well, with a similar
error:

/usr/bin/clang-13 -DVERBS_DEBUG -Dibverbs_EXPORTS -Iinclude 
-I/usr/include/libnl3 -I/usr/include/drm -g -O2 -fdebug-prefix-map=/__w/1/s=. 
-fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
-D_FORTIFY_SOURCE=2 -Wmissing-prototypes -Wmissing-declarations -Wwrite-strings 
-Wformat=2 -Wcast-function-type -Wformat-nonliteral -Wdate-time 
-Wnested-externs -Wshadow -Wstrict-prototypes -Wold-style-definition -Werror 
-Wredundant-decls -g -fPIC   -std=gnu11 -MD -MT 
libibverbs/CMakeFiles/ibverbs.dir/cmd_flow.c.o -MF 
libibverbs/CMakeFiles/ibverbs.dir/cmd_flow.c.o.d -o 
libibverbs/CMakeFiles/ibverbs.dir/cmd_flow.c.o   -c ../libibverbs/cmd_flow.c
In file included from ../libibverbs/cmd_flow.c:33:
In file included from include/infiniband/cmd_write.h:36:
In file included from include/infiniband/cmd_ioctl.h:41:
In file included from include/infiniband/verbs.h:48:
In file included from include/infiniband/verbs_api.h:66:
In file included from include/infiniband/ib_user_ioctl_verbs.h:38:
include/rdma/ib_user_verbs.h:436:34: error: field 'base' with variable sized 
type 'struct ib_uverbs_create_cq_resp' not at the end of a struct or class is a 
GNU extension [-Werror,-Wgnu-variable-sized-type-not-at-end]
struct ib_uverbs_create_cq_resp base;
^
include/rdma/ib_user_verbs.h:644:34: error: field 'base' with variable sized 
type 'struct ib_uverbs_create_qp_resp' not at the end of a struct or class is a 
GNU extension [-Werror,-Wgnu-variable-sized-type-not-at-end]
struct ib_uverbs_create_qp_resp base;

Which is why I gave up trying to change these..

Though maybe we could just switch off -Wgnu-variable-sized-type-not-at-end  
during configuration ?

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members

2022-06-28 Thread Jason Gunthorpe

On Tue, Jun 28, 2022 at 04:21:29AM +0200, Gustavo A. R. Silva wrote:

> > > Though maybe we could just switch off 
> > > -Wgnu-variable-sized-type-not-at-end  during configuration ?

> We need to think in a different strategy.

I think we will need to switch off the warning in userspace - this is
doable for rdma-core.

On the other hand, if the goal is to enable the array size check
compiler warning I would suggest focusing only on those structs that
actually hit that warning in the kernel. IIRC infiniband doesn't
trigger it because it just pointer casts the flex array to some other
struct.

It isn't actually an array it is a placeholder for a trailing
structure, so it is never indexed.

This is also why we hit the warning because the convient way for
userspace to compose the message is to squash the header and trailer
structs together in a super struct on the stack, then invoke the
ioctl.

Jason 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members

2022-06-28 Thread Jason Gunthorpe

On Tue, Jun 28, 2022 at 10:54:58AM -0700, Kees Cook wrote:

 
> which must also be assuming it's a header. So probably better to just
> drop the driver_data field? I don't see anything using it (that I can
> find) besides as a sanity-check that the field exists and is at the end
> of the struct.

The field is guaranteeing alignment of the following structure. IIRC
there are a few cases that we don't have a u64 already to force this.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v6 10/21] RDMA/umem: Prepare to dynamic dma-buf locking specification

2022-10-10 Thread Jason Gunthorpe

On Sun, Oct 09, 2022 at 03:08:56AM +0300, Dmitry Osipenko wrote:
> On 9/28/22 22:15, Dmitry Osipenko wrote:
> > Prepare InfiniBand drivers to the common dynamic dma-buf locking
> > convention by starting to use the unlocked versions of dma-buf API
> > functions.
> > 
> > Acked-by: Christian König 
> > Signed-off-by: Dmitry Osipenko 
> > ---
> >  drivers/infiniband/core/umem_dmabuf.c | 7 ---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/infiniband/core/umem_dmabuf.c 
> > b/drivers/infiniband/core/umem_dmabuf.c
> > index 04c04e6d24c3..43b26bc12288 100644
> > --- a/drivers/infiniband/core/umem_dmabuf.c
> > +++ b/drivers/infiniband/core/umem_dmabuf.c
> > @@ -26,7 +26,8 @@ int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf 
> > *umem_dmabuf)
> > if (umem_dmabuf->sgt)
> > goto wait_fence;
> >  
> > -   sgt = dma_buf_map_attachment(umem_dmabuf->attach, DMA_BIDIRECTIONAL);
> > +   sgt = dma_buf_map_attachment_unlocked(umem_dmabuf->attach,
> > + DMA_BIDIRECTIONAL);
> > if (IS_ERR(sgt))
> > return PTR_ERR(sgt);
> >  
> > @@ -102,8 +103,8 @@ void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf 
> > *umem_dmabuf)
> > umem_dmabuf->last_sg_trim = 0;
> > }
> >  
> > -   dma_buf_unmap_attachment(umem_dmabuf->attach, umem_dmabuf->sgt,
> > -DMA_BIDIRECTIONAL);
> > +   dma_buf_unmap_attachment_unlocked(umem_dmabuf->attach, umem_dmabuf->sgt,
> > +     DMA_BIDIRECTIONAL);
> >  
> > umem_dmabuf->sgt = NULL;
> >  }
> 
> Jason / Leon,
> 
> Could you please ack this patch?

You probably don't need it, for something so simple, but sure

Acked-by: Jason Gunthorpe 

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Jason Gunthorpe

On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> On Thu, 11 Apr 2019 14:01:54 +0300
> Yuval Shaia  wrote:
> 
> > Data center backends use more and more RDMA or RoCE devices and more and
> > more software runs in virtualized environment.
> > There is a need for a standard to enable RDMA/RoCE on Virtual Machines.
> > 
> > Virtio is the optimal solution since is the de-facto para-virtualizaton
> > technology and also because the Virtio specification
> > allows Hardware Vendors to support Virtio protocol natively in order to
> > achieve bare metal performance.
> > 
> > This RFC is an effort to addresses challenges in defining the RDMA/RoCE
> > Virtio Specification and a look forward on possible implementation
> > techniques.
> > 
> > Open issues/Todo list:
> > List is huge, this is only start point of the project.
> > Anyway, here is one example of item in the list:
> > - Multi VirtQ: Every QP has two rings and every CQ has one. This means that
> >   in order to support for example 32K QPs we will need 64K VirtQ. Not sure
> >   that this is reasonable so one option is to have one for all and
> >   multiplex the traffic on it. This is not good approach as by design it
> >   introducing an optional starvation. Another approach would be multi
> >   queues and round-robin (for example) between them.
> > 
> > Expectations from this posting:
> > In general, any comment is welcome, starting from hey, drop this as it is a
> > very bad idea, to yeah, go ahead, we really want it.
> > Idea here is that since it is not a minor effort i first want to know if
> > there is some sort interest in the community for such device.
> 
> My first reaction is: Sounds sensible, but it would be good to have a
> spec for this :)

I'm unclear why you'd want to have a virtio queue for anything other
that some kind of command channel.

I'm not sure a QP or CQ benefits from this??

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Jason Gunthorpe

On Thu, Apr 11, 2019 at 08:34:20PM +0300, Yuval Shaia wrote:
> On Thu, Apr 11, 2019 at 05:24:08PM +0000, Jason Gunthorpe wrote:
> > On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> > > On Thu, 11 Apr 2019 14:01:54 +0300
> > > Yuval Shaia  wrote:
> > > 
> > > > Data center backends use more and more RDMA or RoCE devices and more and
> > > > more software runs in virtualized environment.
> > > > There is a need for a standard to enable RDMA/RoCE on Virtual Machines.
> > > > 
> > > > Virtio is the optimal solution since is the de-facto para-virtualizaton
> > > > technology and also because the Virtio specification
> > > > allows Hardware Vendors to support Virtio protocol natively in order to
> > > > achieve bare metal performance.
> > > > 
> > > > This RFC is an effort to addresses challenges in defining the RDMA/RoCE
> > > > Virtio Specification and a look forward on possible implementation
> > > > techniques.
> > > > 
> > > > Open issues/Todo list:
> > > > List is huge, this is only start point of the project.
> > > > Anyway, here is one example of item in the list:
> > > > - Multi VirtQ: Every QP has two rings and every CQ has one. This means 
> > > > that
> > > >   in order to support for example 32K QPs we will need 64K VirtQ. Not 
> > > > sure
> > > >   that this is reasonable so one option is to have one for all and
> > > >   multiplex the traffic on it. This is not good approach as by design it
> > > >   introducing an optional starvation. Another approach would be multi
> > > >   queues and round-robin (for example) between them.
> > > > 
> > > > Expectations from this posting:
> > > > In general, any comment is welcome, starting from hey, drop this as it 
> > > > is a
> > > > very bad idea, to yeah, go ahead, we really want it.
> > > > Idea here is that since it is not a minor effort i first want to know if
> > > > there is some sort interest in the community for such device.
> > > 
> > > My first reaction is: Sounds sensible, but it would be good to have a
> > > spec for this :)
> > 
> > I'm unclear why you'd want to have a virtio queue for anything other
> > that some kind of command channel.
> > 
> > I'm not sure a QP or CQ benefits from this??
> 
> Virtqueue is a standard mechanism to pass data from guest to host. By
> saying that - it really sounds like QP send and recv rings. So my thought
> is to use a standard way for rings. As i've learned this is how it is used
> by other virtio devices ex virtio-net.

I doubt you can use virtio queues from userspace securely? Usually
needs a dedicated page for each user space process

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Qemu-devel] [RFC 0/3] VirtIO RDMA

2019-04-22 Thread Jason Gunthorpe

On Fri, Apr 19, 2019 at 01:16:06PM +0200, Hannes Reinecke wrote:
> On 4/15/19 12:35 PM, Yuval Shaia wrote:
> > On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> > > On Thu, 11 Apr 2019 14:01:54 +0300
> > > Yuval Shaia  wrote:
> > > 
> > > > Data center backends use more and more RDMA or RoCE devices and more and
> > > > more software runs in virtualized environment.
> > > > There is a need for a standard to enable RDMA/RoCE on Virtual Machines.
> > > > 
> > > > Virtio is the optimal solution since is the de-facto para-virtualizaton
> > > > technology and also because the Virtio specification
> > > > allows Hardware Vendors to support Virtio protocol natively in order to
> > > > achieve bare metal performance.
> > > > 
> > > > This RFC is an effort to addresses challenges in defining the RDMA/RoCE
> > > > Virtio Specification and a look forward on possible implementation
> > > > techniques.
> > > > 
> > > > Open issues/Todo list:
> > > > List is huge, this is only start point of the project.
> > > > Anyway, here is one example of item in the list:
> > > > - Multi VirtQ: Every QP has two rings and every CQ has one. This means 
> > > > that
> > > >in order to support for example 32K QPs we will need 64K VirtQ. Not 
> > > > sure
> > > >that this is reasonable so one option is to have one for all and
> > > >multiplex the traffic on it. This is not good approach as by design 
> > > > it
> > > >introducing an optional starvation. Another approach would be multi
> > > >queues and round-robin (for example) between them.
> > > > 
> Typically there will be a one-to-one mapping between QPs and CPUs (on the
> guest). 

Er we are really overloading words here.. The typical expectation is
that a 'RDMA QP' will have thousands and thousands of instances on a
system.

Most likely I think mapping 1:1 a virtio queue to a 'RDMA QP, CQ, SRQ,
etc' is a bad idea...

> However, I'm still curious about the overall intent of this driver. Where
> would the I/O be routed _to_ ?
> It's nice that we have a virtualized driver, but this driver is
> intended to do I/O (even if it doesn't _do_ any I/O ATM :-)
> And this I/O needs to be send to (and possibly received from)
> something.

As yet I have never heard of public RDMA HW that could be coupled to a
virtio scheme. All HW defines their own queue ring buffer formats
without standardization.

> If so, wouldn't it be more efficient to use vfio, either by using SR-IOV or
> by using virtio-mdev?

Using PCI pass through means the guest has to have drivers for the
device. A generic, perhaps slower, virtio path has some appeal in some
cases.

> If so, how would we route the I/O from one guest to the other?
> Shared memory? Implementing a full-blown RDMA switch in qemu?

RoCE rides over the existing ethernet switching layer quemu plugs
into

So if you built a shared memory, local host only, virtio-rdma then
you'd probably run through the ethernet switch upon connection
establishment to match the participating VMs.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Qemu-devel] [RFC 0/3] VirtIO RDMA

2019-05-07 Thread Jason Gunthorpe

On Tue, Apr 30, 2019 at 08:13:54PM +0300, Yuval Shaia wrote:
> On Mon, Apr 22, 2019 at 01:45:27PM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 19, 2019 at 01:16:06PM +0200, Hannes Reinecke wrote:
> > > On 4/15/19 12:35 PM, Yuval Shaia wrote:
> > > > On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> > > > > On Thu, 11 Apr 2019 14:01:54 +0300
> > > > > Yuval Shaia  wrote:
> > > > > 
> > > > > > Data center backends use more and more RDMA or RoCE devices and 
> > > > > > more and
> > > > > > more software runs in virtualized environment.
> > > > > > There is a need for a standard to enable RDMA/RoCE on Virtual 
> > > > > > Machines.
> > > > > > 
> > > > > > Virtio is the optimal solution since is the de-facto 
> > > > > > para-virtualizaton
> > > > > > technology and also because the Virtio specification
> > > > > > allows Hardware Vendors to support Virtio protocol natively in 
> > > > > > order to
> > > > > > achieve bare metal performance.
> > > > > > 
> > > > > > This RFC is an effort to addresses challenges in defining the 
> > > > > > RDMA/RoCE
> > > > > > Virtio Specification and a look forward on possible implementation
> > > > > > techniques.
> > > > > > 
> > > > > > Open issues/Todo list:
> > > > > > List is huge, this is only start point of the project.
> > > > > > Anyway, here is one example of item in the list:
> > > > > > - Multi VirtQ: Every QP has two rings and every CQ has one. This 
> > > > > > means that
> > > > > >in order to support for example 32K QPs we will need 64K VirtQ. 
> > > > > > Not sure
> > > > > >that this is reasonable so one option is to have one for all and
> > > > > >multiplex the traffic on it. This is not good approach as by 
> > > > > > design it
> > > > > >introducing an optional starvation. Another approach would be 
> > > > > > multi
> > > > > >queues and round-robin (for example) between them.
> > > > > > 
> > > Typically there will be a one-to-one mapping between QPs and CPUs (on the
> > > guest). 
> > 
> > Er we are really overloading words here.. The typical expectation is
> > that a 'RDMA QP' will have thousands and thousands of instances on a
> > system.
> > 
> > Most likely I think mapping 1:1 a virtio queue to a 'RDMA QP, CQ, SRQ,
> > etc' is a bad idea...
> 
> We have three options, no virtqueue for QP, 1 to 1 or multiplexing. What
> would be your vote on that?
> I think you are for option #1, right? but in this case there is actually no
> use of having a virtio-driver, isn't it?

The virtio driver is supposed to be a standard, like a hardware
standard, for doing the operation.

It doesn't mean that every single element under the driver needs to
use the virtio format QP.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-07-31 Thread Jason Gunthorpe

On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> We used to use RCU to synchronize MMU notifier with worker. This leads
> calling synchronize_rcu() in invalidate_range_start(). But on a busy
> system, there would be many factors that may slow down the
> synchronize_rcu() which makes it unsuitable to be called in MMU
> notifier.
> 
> A solution is SRCU but its overhead is obvious with the expensive full
> memory barrier. Another choice is to use seqlock, but it doesn't
> provide a synchronization method between readers and writers. The last
> choice is to use vq mutex, but it need to deal with the worst case
> that MMU notifier must be blocked and wait for the finish of swap in.
> 
> So this patch switches use a counter to track whether or not the map
> was used. The counter was increased when vq try to start or finish
> uses the map. This means, when it was even, we're sure there's no
> readers and MMU notifier is synchronized. When it was odd, it means
> there's a reader we need to wait it to be even again then we are
> synchronized. 

You just described a seqlock.

We've been talking about providing this as some core service from mmu
notifiers because nearly every use of this API needs it.

IMHO this gets the whole thing backwards, the common pattern is to
protect the 'shadow pte' data with a seqlock (usually open coded),
such that the mmu notififer side has the write side of that lock and
the read side is consumed by the thread accessing or updating the SPTE.


> Reported-by: Michael S. Tsirkin 
> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual 
> address")
> Signed-off-by: Jason Wang 
>  drivers/vhost/vhost.c | 145 ++
>  drivers/vhost/vhost.h |   7 +-
>  2 files changed, 94 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index cfc11f9ed9c9..db2c81cb1e90 100644
> +++ b/drivers/vhost/vhost.c
> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue 
> *vq)
>  
>   spin_lock(&vq->mmu_lock);
>   for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> - map[i] = rcu_dereference_protected(vq->maps[i],
> -   lockdep_is_held(&vq->mmu_lock));
> + map[i] = vq->maps[i];
>   if (map[i]) {
>   vhost_set_map_dirty(vq, map[i], i);
> - rcu_assign_pointer(vq->maps[i], NULL);
> + vq->maps[i] = NULL;
>   }
>   }
>   spin_unlock(&vq->mmu_lock);
>  
> - /* No need for synchronize_rcu() or kfree_rcu() since we are
> -  * serialized with memory accessors (e.g vq mutex held).
> + /* No need for synchronization since we are serialized with
> +  * memory accessors (e.g vq mutex held).
>*/
>  
>   for (i = 0; i < VHOST_NUM_ADDRS; i++)
> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr 
> *uaddr,
>   return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
>  }
>  
> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> +{
> + int ref = READ_ONCE(vq->ref);

Is a lock/single threaded supposed to be held for this?

> +
> + smp_store_release(&vq->ref, ref + 1);
> + /* Make sure ref counter is visible before accessing the map */
> + smp_load_acquire(&vq->ref);

release/acquire semantics are intended to protect blocks of related
data, so reading something with acquire and throwing away the result
is nonsense.

> +}
> +
> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> +{
> + int ref = READ_ONCE(vq->ref);

If the write to vq->ref is not locked this algorithm won't work, if it
is locked the READ_ONCE is not needed.

> + /* Make sure vq access is done before increasing ref counter */
> + smp_store_release(&vq->ref, ref + 1);
> +}
> +
> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
> +{
> + int ref;
> +
> + /* Make sure map change was done before checking ref counter */
> + smp_mb();

This is probably smp_rmb after reading ref, and if you are setting ref
with smp_store_release then this should be smp_load_acquire() without
an explicit mb.

> + ref = READ_ONCE(vq->ref);
> + if (ref & 0x1) {
> + /* When ref change, we are sure no reader can see
> +  * previous map */
> + while (READ_ONCE(vq->ref) == ref) {
> + set_current_state(TASK_RUNNING);
> + schedule();
> + }
> + }

This is basically read_seqcount_begin()' with a schedule instead of
cpu_relax


> + /* Make sure ref counter was checked before any other
> +  * operations that was dene on map. */
> + smp_mb();

should be in a smp_load_acquire()

> +}
> +
>  static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> int index,
> unsigned l

Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr()

2019-07-31 Thread Jason Gunthorpe

On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote:
> The vhost_set_vring_num_addr() could be called in the middle of
> invalidate_range_start() and invalidate_range_end(). If we don't reset
> invalidate_count after the un-registering of MMU notifier, the
> invalidate_cont will run out of sync (e.g never reach zero). This will
> in fact disable the fast accessor path. Fixing by reset the count to
> zero.
> 
> Reported-by: Michael S. Tsirkin 

Did Michael report this as well?

> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual 
> address")
> Signed-off-by: Jason Wang 
>  drivers/vhost/vhost.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2a3154976277..2a7217c33668 100644
> +++ b/drivers/vhost/vhost.c
> @@ -2073,6 +2073,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev 
> *d,
>   d->has_notifier = false;
>   }
>  
> + /* reset invalidate_count in case we are in the middle of
> +  * invalidate_start() and invalidate_end().
> +  */
> + vq->invalidate_count = 0;
>   vhost_uninit_vq_maps(vq);
>  #endif
>  
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr()

2019-07-31 Thread Jason Gunthorpe

On Wed, Jul 31, 2019 at 09:29:28PM +0800, Jason Wang wrote:
> 
> On 2019/7/31 下午8:41, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote:
> > > The vhost_set_vring_num_addr() could be called in the middle of
> > > invalidate_range_start() and invalidate_range_end(). If we don't reset
> > > invalidate_count after the un-registering of MMU notifier, the
> > > invalidate_cont will run out of sync (e.g never reach zero). This will
> > > in fact disable the fast accessor path. Fixing by reset the count to
> > > zero.
> > > 
> > > Reported-by: Michael S. Tsirkin 
> > Did Michael report this as well?
> 
> 
> Correct me if I was wrong. I think it's point 4 described in
> https://lkml.org/lkml/2019/7/21/25.

I'm not sure what that is talking about

But this fixes what I described:

https://lkml.org/lkml/2019/7/22/554

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-07-31 Thread Jason Gunthorpe

On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote:
> 
> On 2019/7/31 下午8:39, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> > > We used to use RCU to synchronize MMU notifier with worker. This leads
> > > calling synchronize_rcu() in invalidate_range_start(). But on a busy
> > > system, there would be many factors that may slow down the
> > > synchronize_rcu() which makes it unsuitable to be called in MMU
> > > notifier.
> > > 
> > > A solution is SRCU but its overhead is obvious with the expensive full
> > > memory barrier. Another choice is to use seqlock, but it doesn't
> > > provide a synchronization method between readers and writers. The last
> > > choice is to use vq mutex, but it need to deal with the worst case
> > > that MMU notifier must be blocked and wait for the finish of swap in.
> > > 
> > > So this patch switches use a counter to track whether or not the map
> > > was used. The counter was increased when vq try to start or finish
> > > uses the map. This means, when it was even, we're sure there's no
> > > readers and MMU notifier is synchronized. When it was odd, it means
> > > there's a reader we need to wait it to be even again then we are
> > > synchronized.
> > You just described a seqlock.
> 
> 
> Kind of, see my explanation below.
> 
> 
> > 
> > We've been talking about providing this as some core service from mmu
> > notifiers because nearly every use of this API needs it.
> 
> 
> That would be very helpful.
> 
> 
> > 
> > IMHO this gets the whole thing backwards, the common pattern is to
> > protect the 'shadow pte' data with a seqlock (usually open coded),
> > such that the mmu notififer side has the write side of that lock and
> > the read side is consumed by the thread accessing or updating the SPTE.
> 
> 
> Yes, I've considered something like that. But the problem is, mmu notifier
> (writer) need to wait for the vhost worker to finish the read before it can
> do things like setting dirty pages and unmapping page.  It looks to me
> seqlock doesn't provide things like this.  

The seqlock is usually used to prevent a 2nd thread from accessing the
VA while it is being changed by the mm. ie you use something seqlocky
instead of the ugly mmu_notifier_unregister/register cycle.

You are supposed to use something simple like a spinlock or mutex
inside the invalidate_range_start to serialized tear down of the SPTEs
with their accessors.

> write_seqcount_begin()
> 
> map = vq->map[X]
> 
> write or read through map->addr directly
> 
> write_seqcount_end()
> 
> 
> There's no rmb() in write_seqcount_begin(), so map could be read before
> write_seqcount_begin(), but it looks to me now that this doesn't harm at
> all, maybe we can try this way.

That is because it is a write side lock, not a read lock. IIRC
seqlocks have weaker barriers because the write side needs to be
serialized in some other way.

The requirement I see is you need invalidate_range_start to block
until another thread exits its critical section (ie stops accessing
the SPTEs). 

That is a spinlock/mutex.

You just can't invent a faster spinlock by open coding something with
barriers, it doesn't work.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-08-01 Thread Jason Gunthorpe

On Thu, Aug 01, 2019 at 01:02:18PM +0800, Jason Wang wrote:
> 
> On 2019/8/1 上午3:30, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote:
> > > On 2019/7/31 下午8:39, Jason Gunthorpe wrote:
> > > > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> > > > > We used to use RCU to synchronize MMU notifier with worker. This leads
> > > > > calling synchronize_rcu() in invalidate_range_start(). But on a busy
> > > > > system, there would be many factors that may slow down the
> > > > > synchronize_rcu() which makes it unsuitable to be called in MMU
> > > > > notifier.
> > > > > 
> > > > > A solution is SRCU but its overhead is obvious with the expensive full
> > > > > memory barrier. Another choice is to use seqlock, but it doesn't
> > > > > provide a synchronization method between readers and writers. The last
> > > > > choice is to use vq mutex, but it need to deal with the worst case
> > > > > that MMU notifier must be blocked and wait for the finish of swap in.
> > > > > 
> > > > > So this patch switches use a counter to track whether or not the map
> > > > > was used. The counter was increased when vq try to start or finish
> > > > > uses the map. This means, when it was even, we're sure there's no
> > > > > readers and MMU notifier is synchronized. When it was odd, it means
> > > > > there's a reader we need to wait it to be even again then we are
> > > > > synchronized.
> > > > You just described a seqlock.
> > > 
> > > Kind of, see my explanation below.
> > > 
> > > 
> > > > We've been talking about providing this as some core service from mmu
> > > > notifiers because nearly every use of this API needs it.
> > > 
> > > That would be very helpful.
> > > 
> > > 
> > > > IMHO this gets the whole thing backwards, the common pattern is to
> > > > protect the 'shadow pte' data with a seqlock (usually open coded),
> > > > such that the mmu notififer side has the write side of that lock and
> > > > the read side is consumed by the thread accessing or updating the SPTE.
> > > 
> > > Yes, I've considered something like that. But the problem is, mmu notifier
> > > (writer) need to wait for the vhost worker to finish the read before it 
> > > can
> > > do things like setting dirty pages and unmapping page.  It looks to me
> > > seqlock doesn't provide things like this.
> > The seqlock is usually used to prevent a 2nd thread from accessing the
> > VA while it is being changed by the mm. ie you use something seqlocky
> > instead of the ugly mmu_notifier_unregister/register cycle.
> 
> 
> Yes, so we have two mappings:
> 
> [1] vring address to VA
> [2] VA to PA
> 
> And have several readers and writers
> 
> 1) set_vring_num_addr(): writer of both [1] and [2]
> 2) MMU notifier: reader of [1] writer of [2]
> 3) GUP: reader of [1] writer of [2]
> 4) memory accessors: reader of [1] and [2]
> 
> Fortunately, 1) 3) and 4) have already synchronized through vq->mutex. We
> only need to deal with synchronization between 2) and each of the reset:
> Sync between 1) and 2): For mapping [1], I do
> mmu_notifier_unregister/register. This help to avoid holding any lock to do
> overlap check.

I suspect you could have done this with a RCU technique instead of
register/unregister.

> Sync between 2) and 4): For mapping [1], both are readers, no need any
> synchronization. For mapping [2], synchronize through RCU (or something
> simliar to seqlock).

You can't really use a seqlock, seqlocks are collision-retry locks,
and the semantic here is that invalidate_range_start *MUST* not
continue until thread doing #4 above is guarenteed no longer touching
the memory.

This must be a proper barrier, like a spinlock, mutex, or
synchronize_rcu.

And, again, you can't re-invent a spinlock with open coding and get
something better.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-08-02 Thread Jason Gunthorpe

On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > This must be a proper barrier, like a spinlock, mutex, or
> > synchronize_rcu.
> 
> 
> I start with synchronize_rcu() but both you and Michael raise some
> concern.

I've also idly wondered if calling synchronize_rcu() under the various
mm locks is a deadlock situation.

> Then I try spinlock and mutex:
> 
> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> improvement.

I think the topic here is correctness not performance improvement

> 2) SRCU: full memory barrier requires on srcu_read_lock(), which still leads
> little performance improvement
 
> 3) mutex: a possible issue is need to wait for the page to be swapped in (is
> this unacceptable ?), another issue is that we need hold vq lock during
> range overlap check.

I have a feeling that mmu notififers cannot safely become dependent on
progress of swap without causing deadlock. You probably should avoid
this.

> > And, again, you can't re-invent a spinlock with open coding and get
> > something better.
> 
> So the question is if waiting for swap is considered to be unsuitable for
> MMU notifiers. If not, it would simplify codes. If not, we still need to
> figure out a possible solution.
> 
> Btw, I come up another idea, that is to disable preemption when vhost thread
> need to access the memory. Then register preempt notifier and if vhost
> thread is preempted, we're sure no one will access the memory and can do the
> cleanup.

I think you should use the spinlock so at least the code is obviously
functionally correct and worry about designing some properly justified
performance change after.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-08-02 Thread Jason Gunthorpe

On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote:
> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > synchronize_rcu.
> > > 
> > > 
> > > I start with synchronize_rcu() but both you and Michael raise some
> > > concern.
> > 
> > I've also idly wondered if calling synchronize_rcu() under the various
> > mm locks is a deadlock situation.
> > 
> > > Then I try spinlock and mutex:
> > > 
> > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> > > improvement.
> > 
> > I think the topic here is correctness not performance improvement
> 
> The topic is whether we should revert
> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual 
> address")
> 
> or keep it in. The only reason to keep it is performance.

Yikes, I'm not sure you can ever win against copy_from_user using
mmu_notifiers?  The synchronization requirements are likely always
more expensive unless large and scattered copies are being done..

The rcu is about the only simple approach that could be less
expensive, and that gets back to the question if you can block an
invalidate_start_range in synchronize_rcu or not..

So, frankly, I'd revert it until someone could prove the rcu solution is
OK..

BTW, how do you get copy_from_user to work outside a syscall?

Also, why can't this just permanently GUP the pages? In fact, where
does it put_page them anyhow? Worrying that 7f466 adds a get_user page
but does not add a put_page??

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-08-03 Thread Jason Gunthorpe

On Sat, Aug 03, 2019 at 05:36:13PM -0400, Michael S. Tsirkin wrote:
> On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > > > synchronize_rcu.
> > > > > 
> > > > > 
> > > > > I start with synchronize_rcu() but both you and Michael raise some
> > > > > concern.
> > > > 
> > > > I've also idly wondered if calling synchronize_rcu() under the various
> > > > mm locks is a deadlock situation.
> > > > 
> > > > > Then I try spinlock and mutex:
> > > > > 
> > > > > 1) spinlock: add lots of overhead on datapath, this leads 0 
> > > > > performance
> > > > > improvement.
> > > > 
> > > > I think the topic here is correctness not performance improvement
> > > 
> > > The topic is whether we should revert
> > > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual 
> > > address")
> > > 
> > > or keep it in. The only reason to keep it is performance.
> > 
> > Yikes, I'm not sure you can ever win against copy_from_user using
> > mmu_notifiers?
> 
> Ever since copy_from_user started playing with flags (for SMAP) and
> added speculation barriers there's a chance we can win by accessing
> memory through the kernel address.

You think copy_to_user will be more expensive than the minimum two
atomics required to synchronize with another thread?

> > Also, why can't this just permanently GUP the pages? In fact, where
> > does it put_page them anyhow? Worrying that 7f466 adds a get_user page
> > but does not add a put_page??

You didn't answer this.. Why not just use GUP?

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-08-06 Thread Jason Gunthorpe

On Sun, Aug 04, 2019 at 04:07:17AM -0400, Michael S. Tsirkin wrote:
> > > > Also, why can't this just permanently GUP the pages? In fact, where
> > > > does it put_page them anyhow? Worrying that 7f466 adds a get_user page
> > > > but does not add a put_page??
> > 
> > You didn't answer this.. Why not just use GUP?
> > 
> > Jason
> 
> Sorry I misunderstood the question. Permanent GUP breaks lots of
> functionality we need such as THP and numa balancing.

Really? It doesn't look like that many pages are involved..

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-08-06 Thread Jason Gunthorpe

On Mon, Aug 05, 2019 at 12:20:45PM +0800, Jason Wang wrote:
> 
> On 2019/8/2 下午8:46, Jason Gunthorpe wrote:
> > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > synchronize_rcu.
> > > 
> > > I start with synchronize_rcu() but both you and Michael raise some
> > > concern.
> > I've also idly wondered if calling synchronize_rcu() under the various
> > mm locks is a deadlock situation.
> 
> 
> Maybe, that's why I suggest to use vhost_work_flush() which is much
> lightweight can can achieve the same function. It can guarantee all previous
> work has been processed after vhost_work_flush() return.

If things are already running in a work, then yes, you can piggyback
on the existing spinlocks inside the workqueue and be Ok

However, if that work is doing any copy_from_user, then the flush
becomes dependent on swap and it won't work again...

> > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> > > improvement.
> > I think the topic here is correctness not performance improvement> 
 
> But the whole series is to speed up vhost.

So? Starting with a whole bunch of crazy, possibly broken, locking and
claiming a performance win is not reasonable.

> Spinlock is correct but make the whole series meaningless consider it won't
> bring any performance improvement.

You can't invent a faster spinlock by opencoding some wild
scheme. There is nothing special about the usage here, it needs a
blocking lock, plain and simple.

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-08-06 Thread Jason Gunthorpe

On Tue, Aug 06, 2019 at 09:36:58AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 06, 2019 at 08:53:17AM -0300, Jason Gunthorpe wrote:
> > On Sun, Aug 04, 2019 at 04:07:17AM -0400, Michael S. Tsirkin wrote:
> > > > > > Also, why can't this just permanently GUP the pages? In fact, where
> > > > > > does it put_page them anyhow? Worrying that 7f466 adds a get_user 
> > > > > > page
> > > > > > but does not add a put_page??
> > > > 
> > > > You didn't answer this.. Why not just use GUP?
> > > > 
> > > > Jason
> > > 
> > > Sorry I misunderstood the question. Permanent GUP breaks lots of
> > > functionality we need such as THP and numa balancing.
> > 
> > Really? It doesn't look like that many pages are involved..
> > 
> > Jason
> 
> Yea. But they just might happen to be heavily accessed ones

Maybe you can solve the numa balance problem some other way and use
normal GUP..

Jason 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V4 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

2019-08-07 Thread Jason Gunthorpe

On Wed, Aug 07, 2019 at 03:06:15AM -0400, Jason Wang wrote:
> We used to use RCU to synchronize MMU notifier with worker. This leads
> calling synchronize_rcu() in invalidate_range_start(). But on a busy
> system, there would be many factors that may slow down the
> synchronize_rcu() which makes it unsuitable to be called in MMU
> notifier.
> 
> So this patch switches use seqlock counter to track whether or not the
> map was used. The counter was increased when vq try to start or finish
> uses the map. This means, when it was even, we're sure there's no
> readers and MMU notifier is synchronized. When it was odd, it means
> there's a reader we need to wait it to be even again then we are
> synchronized. Consider the read critical section is pretty small the
> synchronization should be done very fast.
> 
> Reported-by: Michael S. Tsirkin 
> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual 
> address")
> Signed-off-by: Jason Wang 
>  drivers/vhost/vhost.c | 141 ++
>  drivers/vhost/vhost.h |   7 ++-
>  2 files changed, 90 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index cfc11f9ed9c9..57bfbb60d960 100644
> +++ b/drivers/vhost/vhost.c
> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue 
> *vq)
>  
>   spin_lock(&vq->mmu_lock);
>   for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> - map[i] = rcu_dereference_protected(vq->maps[i],
> -   lockdep_is_held(&vq->mmu_lock));
> + map[i] = vq->maps[i];
>   if (map[i]) {
>   vhost_set_map_dirty(vq, map[i], i);
> - rcu_assign_pointer(vq->maps[i], NULL);
> + vq->maps[i] = NULL;
>   }
>   }
>   spin_unlock(&vq->mmu_lock);
>  
> - /* No need for synchronize_rcu() or kfree_rcu() since we are
> -  * serialized with memory accessors (e.g vq mutex held).
> + /* No need for synchronization since we are serialized with
> +  * memory accessors (e.g vq mutex held).
>*/
>  
>   for (i = 0; i < VHOST_NUM_ADDRS; i++)
> @@ -362,6 +361,40 @@ static bool vhost_map_range_overlap(struct vhost_uaddr 
> *uaddr,
>   return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
>  }
>  
> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> +{
> + write_seqcount_begin(&vq->seq);
> +}
> +
> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> +{
> + write_seqcount_end(&vq->seq);
> +}

The write side of a seqlock only provides write barriers. Access to

map = vq->maps[VHOST_ADDR_USED];

Still needs a read side barrier, and then I think this will be no
better than a normal spinlock.

It also doesn't seem like this algorithm even needs a seqlock, as this
is just a one bit flag

atomic_set_bit(using map)
smp_mb__after_atomic()
.. maps [...]
atomic_clear_bit(using map)


map = NULL;
smp_mb__before_atomic();
while (atomic_read_bit(using map))
   relax()

Again, not clear this could be faster than a spinlock when the
barriers are correct...

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

1 2 >

1 - 100 of 148 matches

Mail list logo