RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 28, 2021 1:12 AM
> 
[...] 
> > As Alex says, if this line fails because of the group restrictions,
> > that's not great because it's not very obvious what's gone wrong.
> 
> Okay, that is fair, but let's solve that problem directly. For
> instance netlink has been going in the direction of adding a "extack"
> from the kernel which is a descriptive error string. If the failing
> ioctl returned the string:
> 
>   "cannot join this device to the IOASID because device XXX in the
>same group #10 is in use"
> 
> Would you agree it is now obvious what has gone wrong? In fact would
> you agree this is a lot better user experience than what applications
> do today even though they have the group FD?
> 

Currently all the discussions are around implicit vs. explicit uAPI semantics
on the group restriction. However if we look beyond group the implicit 
semantics might be inevitable when dealing with incompatible iommu
domains. An existing example of iommu incompatibility is IOMMU_
CACHE. In the future there could be other incompatibilities such as 
whether nested translation is supported. In the end the userspace has 
to do some due diligence on understanding iommu topology and attributes 
to decide how many VFIO containers or ioasid fds should be created. It 
does push some burden to userspace but it's difficult to define a group-
like kernel object to enforce such restriction for iommu compatibility. 
Then the best that the kernel can do is to return an informational error 
message in case an incompatible device is attached to the existing domain. 
If this is the perceived way to move forward anyway, I feel that removing 
explicit group FD from uAPI doesn't make userspace worse...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Monday, April 26, 2021 8:38 PM
> 
[...]
> > Want to hear your opinion for one open here. There is no doubt that
> > an ioasid represents a HW page table when the table is constructed by
> > userspace and then linked to the IOMMU through the bind/unbind
> > API. But I'm not very sure about whether an ioasid should represent
> > the exact pgtable or the mapping metadata when the underlying
> > pgtable is indirectly constructed through map/unmap API. VFIO does
> > the latter way, which is why it allows multiple incompatible domains
> > in a single container which all share the same mapping metadata.
> 
> I think VFIO's map/unmap is way too complex and we know it has bad
> performance problems.

Can you or Alex elaborate where the complexity and performance problem
locate in VFIO map/umap? We'd like to understand more detail and see how 
to avoid it in the new interface.

> 
> If /dev/ioasid is single HW page table only then I would focus on that
> implementation and leave it to userspace to span different
> /dev/ioasids if needed.
> 
> > OK, now I see where the disconnection comes from. In my context ioasid
> > is the identifier that is actually used in the wire, but seems you treat it 
> > as
> > a sw-defined namespace purely for representing page tables. We should
> > clear this concept first before further discussing other details. 😊
> 
> There is no general HW requirement that every IO page table be
> referred to by the same PASID and this API would have to support

Yes, but what is the value of allowing multiple PASIDs referring to the
the same I/O page table (except the nesting pgtable case)? Doesn't it 
lead to poor iotlb efficiency issue similar to multiple iommu domains 
referring to the same page table?

> non-PASID IO page tables as well. So I'd keep the two things
> separated in the uAPI - even though the kernel today has a global
> PASID pool.

for non-PASID usages the allocated PASID will be wasted if we don't
separate ioasid from pasid. But it may be worthwhile given 1m available 
pasids and the simplification in the uAPI which only needs to care about 
one id space then.

> 
> > Then following your proposal, does it mean that we need another
> > interface for allocating PASID? and since ioasid means different
> > thing in uAPI and in-kernel API, possibly a new name is required to
> > avoid confusion?
> 
> I would suggest have two ways to control the PASID
> 
>  1) Over /dev/ioasid allocate a PASID for an IOASID. All future PASID
> based usages of the IOASID will use that global PASID
> 
>  2) Over the device FD, when the IOASID is bound return the PASID that
> was selected. If the IOASID does not have a global PASID then the
> kernel is free to make something up. In this mode a single IOASID
> can have multiple PASIDs.
> 
> Simple things like DPDK can use #2 and potentially have better PASID
> limits. hypervisors will most likely have to use #1, but it depends on
> how their vIOMMU interface works.

Can you elaborate why DPDK wants to use #2 i.e. not using a global
PASID?

> 
> I think the name IOASID is fine for the uAPI, the kernel version can
> be called ioasid_id or something.

ioasid is already an id and then ioasid_id just adds confusion. Another
point is that ioasid is currently used to represent both PCI PASID and
ARM substream ID in the kernel. It implies that if we want to separate
ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with
another general term usable for substream ID. Are we making the
terms too confusing here? 

> 
> (also looking at ioasid.c, why do we need such a thin and odd wrapper
> around xarray?)
> 

I'll leave it to Jean and Jacob.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 0/5] iommu: Support identity mappings of reserved-memory regions

2021-04-27 Thread Mikko Perttunen

On 4/28/21 8:51 AM, Dmitry Osipenko wrote:

23.04.2021 19:32, Thierry Reding пишет:

Note that there will be no new releases of the bootloader for earlier
devices, so adding support for these new DT bindings will not be
practical. The bootloaders on those devices do pass information about
the active framebuffer via the kernel command-line, so we may want to
add code to create reserved regions in the IOMMU based on that.


Since this change requires a bootloader update anyways, why it's not
possible to fix the bootloader properly, making it to stop all the DMA
activity before jumping into kernel?



That is not desirable, as then we couldn't have seamless 
bootloader-kernel boot splash transition.


Mikko
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 0/5] iommu: Support identity mappings of reserved-memory regions

2021-04-27 Thread Dmitry Osipenko
23.04.2021 19:32, Thierry Reding пишет:
> Note that an earlier proposal was to use the existing simple-framebuffer
> device tree bindings to transport this information. Unfortunately there
> are cases where this is not enough. On Tegra SoCs, for example, the
> bootloader will also set up a color space correction lookup table in the
> system memory that the display controller will access during boot,
> alongside the framebuffer. The simple-framebuffer DT bindings have no
> way of describing this (and I guess one could argue that this particular
> setup no longer is a "simple" framebuffer), so the above, more flexible
> proposal was implemented.

Will simple-framebuffer be able to use that reserved region
transparently? Or will it require a custom simple-framebuffer driver?

Could we make simple-framebuffer support a part of this series?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 0/5] iommu: Support identity mappings of reserved-memory regions

2021-04-27 Thread Dmitry Osipenko
23.04.2021 19:32, Thierry Reding пишет:
> Note that there will be no new releases of the bootloader for earlier
> devices, so adding support for these new DT bindings will not be
> practical. The bootloaders on those devices do pass information about
> the active framebuffer via the kernel command-line, so we may want to
> add code to create reserved regions in the IOMMU based on that.

Since this change requires a bootloader update anyways, why it's not
possible to fix the bootloader properly, making it to stop all the DMA
activity before jumping into kernel?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 0/5] iommu: Support identity mappings of reserved-memory regions

2021-04-27 Thread Dmitry Osipenko
27.04.2021 21:30, Krishna Reddy пишет:
>> Is it always safe to enable SMMU ASID in a middle of a DMA request made by a
>> memory client?
> 
> From MC point of view, It is safe to enable and has been this way from many 
> years in downstream code for display engine.
> It doesn't impact the transactions that have already bypassed SMMU before 
> enabling SMMU ASID. 
> Transactions that are yet to pass SMMU stage would go through SMMU once SMMU 
> ASID is enabled and visible.

Hello,

Thank you for the answer. Could you please give more information about:

1) Are you on software or hardware team, or both?

2) Is SMMU a third party IP or developed in-house?

3) Do you have a direct access to HDL sources? Are you 100% sure that
hardware does what you say?

4) What happens when CPU writes to ASID register? Does SMMU state
machine latch ASID status (making it visible) only at a single "safe" point?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread David Gibson
On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > > Starting from a BDF the general pseudo code is:
> > >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> > >  device_fd = open("/dev/vfio/"+device_name)
> > >  ioasidfd = open("/dev/ioasid")
> > >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> > 
> > This line is the problem.
> > 
> > [Historical aside: Alex's early drafts for the VFIO interface looked
> > quite similar to this.  Ben Herrenschmidt and myself persuaded him it
> > was a bad idea, and groups were developed instead.  I still think it's
> > a bad idea, and not just for POWER]
> 
> Spawning the VFIO device FD from the group FD is incredibly gross from
> a kernel design perspective. Since that was done the struct
> vfio_device missed out on a sysfs presence and doesn't have the
> typical 'struct device' member or dedicated char device you'd expect a
> FD based subsystem to have.
> 
> This basically traded normal usage of the driver core for something
> that doesn't serve a technical usage. Given we are now nearly 10 years
> later and see that real widely deployed applications are not doing
> anything special with the group FD it makes me question the wisdom of
> this choice.

I'm really not sure what "anything special" would constitute here.

> > As Alex says, if this line fails because of the group restrictions,
> > that's not great because it's not very obvious what's gone wrong.  
> 
> Okay, that is fair, but let's solve that problem directly. For
> instance netlink has been going in the direction of adding a "extack"
> from the kernel which is a descriptive error string. If the failing
> ioctl returned the string:
> 
>   "cannot join this device to the IOASID because device XXX in the
>same group #10 is in use"

Um.. is there a sane way to return strings from an ioctl()?

> Would you agree it is now obvious what has gone wrong? In fact would
> you agree this is a lot better user experience than what applications
> do today even though they have the group FD?
> 
> > But IMO, the success path on a multi-device group is kind of worse:
> > you've now made made a meaningful and visible change to the setup of
> > devices which are not mentioned in this line *at all*.  
> 
> I don't think spawning a single device_fd from the guoup clearly says
> there are repercussions outside that immediate, single, device.

It's not the fact that the device fds are spawed from the group fd.
It's the fact that the "attach" operation - binding the group to the
container now, binding the whatever to the iosasid in future -
explicitly takes a group.  That's an operation that affects a group,
so the interface should reflect that.

Getting the device fds from the group fd kind of follows, because it's
unsafe to do basically anything on the device unless you already
control the group (which in this case means attaching it to a
container/ioasid).  I'm entirely open to ways of doing that that are
less inelegant from a sysfs integration point of view, but the point
is you must manage the group before you can do anything at all with
individual devices.

> That comes from understanding what the ioctls are doing, and reading
> the documentation. The same applies to some non-group FD world.
> 
> > Yes, it makes set up more of a pain, but it's necessary complexity to
> > actually understand what's going on here.
> 
> There is a real technical problem here - the VFIO group is the thing
> that spawns the device_fd and that is incompatible with the idea to
> centralize the group security logic in drivers/iommu/ and share it
> with multiple subsystems.

I don't see why.  I mean, sure, you don't want explicitly the *vfio*
group as such.  But IOMMU group is already a cross-subsystem concept
and you can explicitly expose that in a different way.

> We also don't have an obvious clean way to incorporate a group FD into
> other subsystems (nor would I want to).

If you don't have a group concept in other subsystems, there's a fair
chance they are broken.  There are a bunch of operations that are
inherently per-group.  Well.. per container/IOASID, but the
granularity of membership for that is the group.

> One option is VFIO can keep its group FD but nothing else will have
> anthing like it. However I don't much like the idea that VFIO will
> have a special and unique programming model to do that same things
> other subsystem will do. That will make it harder for userspace to
> implement.

Again, I realy think this is necessary complexity.  You're right that
far too little of the userspace properly understands group
restrictions.. but these come from real hardware limitations, and I
don't feel like making it *less* obvious in the interface is going to
help that.

> But again, lets see what the draft ioasid proposal looks like and
> maybe someone will see a different solution.
> 
> Jason
> 

-- 
David Gibson| I'll h

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread David Gibson
On Tue, Apr 27, 2021 at 02:24:32PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 02:50:45PM +1000, David Gibson wrote:
> 
> > > > I say this because the SPAPR looks quite a lot like PASID when it has
> > > > APIs for allocating multiple tables and other things. I would be
> > > > interested to hear someone from IBM talk about what it is doing and
> > > > how it doesn't fit into today's IOMMU API.
> > 
> > Hm.  I don't think it's really like PASID.  Just like Type1, the TCE
> > backend represents a single DMA address space which all devices in the
> > container will see at all times.  The difference is that there can be
> > multiple (well, 2) "windows" of valid IOVAs within that address space.
> > Each window can have a different TCE (page table) layout.  For kernel
> > drivers, a smallish translated window at IOVA 0 is used for 32-bit
> > devices, and a large direct mapped (no page table) window is created
> > at a high IOVA for better performance with 64-bit DMA capable devices.
> >
> > With the VFIO backend we create (but don't populate) a similar
> > smallish 32-bit window, userspace can create its own secondary window
> > if it likes, though obvious for userspace use there will always be a
> > page table.  Userspace can choose the total size (but not address),
> > page size and to an extent the page table format of the created
> > window.  Note that the TCE page table format is *not* the same as the
> > POWER CPU core's page table format.  Userspace can also remove the
> > default small window and create its own.
> 
> So what do you need from the generic API? I'd suggest if userspace
> passes in the required IOVA range it would benefit all the IOMMU
> drivers to setup properly sized page tables and PPC could use that to
> drive a single window. I notice this is all DPDK did to support TCE.

Yes.  My proposed model for a unified interface would be that when you
create a new container/IOASID, *no* IOVAs are valid.  Before you can
map anything you would have to create a window with specified base,
size, pagesize (probably some flags for extension, too).  That could
fail if the backend IOMMU can't handle that IOVA range, it could be a
backend no-op if the requested window lies within a fixed IOVA range
the backend supports, or it could actually reprogram the back end for
the new window (such as for POWER TCEs).  Regardless of the hardware,
attempts to map outside the created window(s) would be rejected by
software.

I expect we'd need some kind of query operation to expose limitations
on the number of windows, addresses for them, available pagesizes etc.

> > The second wrinkle is pre-registration.  That lets userspace register
> > certain userspace VA ranges (*not* IOVA ranges) as being the only ones
> > allowed to be mapped into the IOMMU.  This is a performance
> > optimization, because on pre-registration we also pre-account memory
> > that will be effectively locked by DMA mappings, rather than doing it
> > at DMA map and unmap time.
> 
> This feels like nesting IOASIDs to me, much like a vPASID.
> 
> The pre-registered VA range would be the root of the tree and the
> vIOMMU created ones would be children of the tree. This could allow
> the map operations of the child to refer to already prepped physical
> memory held in the root IOASID avoiding the GUP/etc cost.

Huh... I never thought of it that way, but yeah, that sounds like it
could work.  More elegantly than the current system in fact.

> Seems fairly genericish, though I'm not sure about the kvm linkage..

I think it should be doable.  We'd basically need to give KVM a handle
on the parent AS, and the child AS, and the guest side handle (what
PAPR calls a "Logical IO Bus Number" - liobn).  KVM would then
translate H_PUT_TCE etc. hypercalls on that liobn into calls into the
IOMMU subsystem to map bits of the parent AS into the child.  We'd
probably have to have some requirements that either parent AS is
identity-mapped to a subset of the userspace AS (effectively what we
have now) or that parent AS is the same as guest physical address.
Not sure which would work better.

> > I like the idea of a common DMA/IOMMU handling system across
> > platforms.  However in order to be efficiently usable for POWER it
> > will need to include multiple windows, allowing the user to change
> > those windows and something like pre-registration to amortize
> > accounting costs for heavy vIOMMU load.
> 
> I have a feeling /dev/ioasid is going to end up with some HW specific
> escape hatch to create some HW specific IOASID types and operate on
> them in a HW specific way.
> 
> However, what I would like to see is that something simple like DPDK
> can have a single implementation - POWER should implement the standard
> operations and map them to something that will work for it.
> 
> As an ideal, only things like the HW specific qemu vIOMMU driver
> should be reaching for all the special stuff.

I'm hoping we can even avoid that, usually.  With the explicit

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread David Gibson
On Tue, Apr 27, 2021 at 01:39:54PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 03:11:25PM +1000, David Gibson wrote:
> 
> > > So your proposal sort of moves the entire container/group/domain 
> > > managment into /dev/ioasid and then leaves vfio only provide device
> > > specific uAPI. An ioasid represents a page table (address space), thus 
> > > is equivalent to the scope of VFIO container.
> > 
> > Right.  I don't really know how /dev/iosasid is supposed to work, and
> > so far I don't see how it conceptually differs from a container.  What
> > is it adding?
> 
> There are three motivating topics:
>  1) /dev/vfio/vfio is only usable by VFIO and we have many interesting
> use cases now where we need the same thing usable outside VFIO
>  2) /dev/vfio/vfio does not support modern stuff like PASID and
> updating to support that is going to be a big change, like adding
> multiple IOASIDs so they can be modeled as as a tree inside a
> single FD
>  3) I understand there is some desire to revise the uAPI here a bit,
> ie Alex mentioned the poor mapping performance.
> 
> I would say it is not conceptually different from what VFIO calls a
> container, it is just a different uAPI with the goal to be cross
> subsystem.

Ok, that makes sense.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

2021-04-27 Thread Jason Gunthorpe
On Tue, Apr 27, 2021 at 04:55:45PM -0600, Logan Gunthorpe wrote:

> > Also, I see only 8 users of this function. How about just fix them all
> > to support negative returns and use this as the p2p API instead of
> > adding new API?
> 
> Well there might be 8 users of dma_map_sg_attrs() but there are a very
> large number of dma_map_sg(). Seems odd to me to single out the first as
> requiring these changes, but leave the latter.

At a high level I'm OK with it. dma_map_sg_attrs() is the extra
extended version of dma_map_sg(), it already has a different
signature, a different return code is not out of the question.

dma_map_sg() is just the simple easy to use interface that can't do
advanced stuff.

> I'm not that opposed to this. But it will make this series a fair bit
> longer to change the 8 map_sg_attrs() usages.

Yes, but the result seems much nicer to not grow the DMA API further.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 14/16] nvme-rdma: Ensure dma support when using p2pdma

2021-04-27 Thread Logan Gunthorpe



On 2021-04-27 1:47 p.m., Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:21AM -0600, Logan Gunthorpe wrote:
>> Ensure the dma operations support p2pdma before using the RDMA
>> device for P2PDMA. This allows switching the RDMA driver from
>> pci_p2pdma_map_sg() to dma_map_sg_p2pdma().
>>
>> Signed-off-by: Logan Gunthorpe 
>>  drivers/nvme/target/rdma.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
>> index 6c1f3ab7649c..3ec7e77e5416 100644
>> +++ b/drivers/nvme/target/rdma.c
>> @@ -414,7 +414,8 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device 
>> *ndev,
>>  if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
>>  goto out_free_rsp;
>>  
>> -if (!ib_uses_virt_dma(ndev->device))
>> +if (!ib_uses_virt_dma(ndev->device) &&
>> +dma_pci_p2pdma_supported(&ndev->device->dev))
> 
> ib_uses_virt_dma() should not be called by nvme and this is using the
> wrong device pointer to query for DMA related properties.
> 
> I suspect this wants a ib_dma_pci_p2p_dma_supported() wrapper like
> everything else.

Makes sense. Will add for v2.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 11/16] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

2021-04-27 Thread Logan Gunthorpe



On 2021-04-27 1:43 p.m., Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:18AM -0600, Logan Gunthorpe wrote:
>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>> apply the appropriate bus address to the segment. The IOVA is not
>> created if the scatterlist only consists of P2PDMA pages.
> 
> I expect P2P to work with systems that use ATS, so we'd want to see
> those systems have the IOMMU programmed with the bus address.

Oh, the paragraph you quote isn't quite as clear as it could be. The bus
address is only used in specific circumstances depending on how the
P2PDMA core code figures the addresses should be mapped (see the
documentation for (upstream_bridge_distance()). The P2PDMA code
currently doesn't have any provisions for ATS (I haven't had access to
any such hardware) but I'm sure it wouldn't be too hard to add.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

2021-04-27 Thread Logan Gunthorpe



On 2021-04-27 1:40 p.m., Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 04:33:51PM -0300, Jason Gunthorpe wrote:
>> On Thu, Apr 08, 2021 at 11:01:16AM -0600, Logan Gunthorpe wrote:
>>> Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
>>> PCI P2PDMA pages directly without a hack in the callers. This allows
>>> for heterogeneous SGLs that contain both P2PDMA and regular pages.
>>>
>>> SGL segments that contain PCI bus addresses are marked with
>>> sg_mark_pci_p2pdma() and are ignored when unmapped.
>>>
>>> Signed-off-by: Logan Gunthorpe 
>>>  kernel/dma/direct.c | 25 ++---
>>>  1 file changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>>> index 002268262c9a..108dfb4ecbd5 100644
>>> +++ b/kernel/dma/direct.c
>>> @@ -13,6 +13,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include "direct.h"
>>>  
>>>  /*
>>> @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct 
>>> scatterlist *sgl,
>>> struct scatterlist *sg;
>>> int i;
>>>  
>>> -   for_each_sg(sgl, sg, nents, i)
>>> +   for_each_sg(sgl, sg, nents, i) {
>>> +   if (sg_is_pci_p2pdma(sg)) {
>>> +   sg_unmark_pci_p2pdma(sg);
>>
>> This doesn't seem nice, the DMA layer should only alter the DMA
>> portion of the SG, not the other portions. Is it necessary?
> 
> Oh, I got it completely wrong what this is for.
> 
> This should be named sg_dma_mark_pci_p2p() and similar for other
> functions to make it clear it is part of the DMA side of the SG
> interface (eg it is like sg_dma_address, sg_dma_len, etc)

Fair point. Yes, I'll rename this for the next version.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

2021-04-27 Thread Logan Gunthorpe



On 2021-04-27 1:31 p.m., Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:12AM -0600, Logan Gunthorpe wrote:
>> +/*
>> + * dma_maps_sg_attrs returns 0 on error and > 0 on success.
>> + * It should never return a value < 0.
>> + */
> 
> Also it is weird a function that can't return 0 is returning an int type

Yes, Christoph mentioned in the last series that this should probably
change to an unsigned but I wasn't really sure if that change should be
a part of the P2PDMA series.

>> +int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
>> +enum dma_data_direction dir, unsigned long attrs)
>> +{
>> +int ents;
>> +
>> +ents = __dma_map_sg_attrs(dev, sg, nents, dir, attrs);
>>  BUG_ON(ents < 0);
> 
> if (WARN_ON(ents < 0))
>  return 0;
> 
> instead of bug on?

It was BUG_ON in the original code. So I felt I should leave it.

> Also, I see only 8 users of this function. How about just fix them all
> to support negative returns and use this as the p2p API instead of
> adding new API?

Well there might be 8 users of dma_map_sg_attrs() but there are a very
large number of dma_map_sg(). Seems odd to me to single out the first as
requiring these changes, but leave the latter.

> Add the opposite logic flag, 'DMA_ATTRS_NO_ERROR' and pass it through
> the other api entry callers that can't handle it?

I'm not that opposed to this. But it will make this series a fair bit
longer to change the 8 map_sg_attrs() usages.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

2021-04-27 Thread Logan Gunthorpe



On 2021-04-27 1:22 p.m., Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:12AM -0600, Logan Gunthorpe wrote:
>> dma_map_sg() either returns a positive number indicating the number
>> of entries mapped or zero indicating that resources were not available
>> to create the mapping. When zero is returned, it is always safe to retry
>> the mapping later once resources have been freed.
>>
>> Once P2PDMA pages are mixed into the SGL there may be pages that may
>> never be successfully mapped with a given device because that device may
>> not actually be able to access those pages. Thus, multiple error
>> conditions will need to be distinguished to determine weather a retry
>> is safe.
>>
>> Introduce dma_map_sg_p2pdma[_attrs]() with a different calling
>> convention from dma_map_sg(). The function will return a positive
>> integer on success or a negative errno on failure.
>>
>> ENOMEM will be used to indicate a resource failure and EREMOTEIO to
>> indicate that a P2PDMA page is not mappable.
>>
>> The __DMA_ATTR_PCI_P2PDMA attribute is introduced to inform the lower
>> level implementations that P2PDMA pages are allowed and to warn if a
>> caller introduces them into the regular dma_map_sg() interface.
> 
> So this new API is all about being able to return an error code
> because auditing the old API is basically terrifying?
> 
> OK, but why name everything new P2PDMA? It seems nicer to give this
> some generic name and have some general program to gradually deprecate
> normal non-error-capable dma_map_sg() ?
> 
> I think that will raise less questions when subsystem people see the
> changes, as I was wondering why RW was being moved to use what looked
> like a p2pdma only API.
> 
> dma_map_sg_or_err() would have been clearer
> 
> The flag is also clearer as to the purpose if it is named
> __DMA_ATTR_ERROR_ALLOWED

I'm not opposed to these names. I can use them for v2 if there are no
other opinions.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

2021-04-27 Thread Dan Williams
On Tue, Apr 27, 2021 at 1:22 PM John Hubbard  wrote:
>
> On 4/27/21 12:28 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 08, 2021 at 11:01:07AM -0600, Logan Gunthorpe wrote:
> >> Hi,
> >>
> >> This patchset continues my work to to add P2PDMA support to the common
> >> dma map operations. This allows for creating SGLs that have both P2PDMA
> >> and regular pages which is a necessary step to allowing P2PDMA pages in
> >> userspace.
> >>
> >> The earlier RFC[1] generated a lot of great feedback and I heard no show
> >> stopping objections. Thus, I've incorporated all the feedback and have
> >> decided to post this as a proper patch series with hopes of eventually
> >> getting it in mainline.
> >>
> >> I'm happy to do a few more passes if anyone has any further feedback
> >> or better ideas.
> >
> > For the user of the DMA API the idea seems reasonable enough, the next
> > steps to integrate with pin_user_pages() seem fairly straightfoward
> > too
> >
> > Was there no feedback on this at all?
> >
>
> oops, I meant to review this a lot sooner, because this whole p2pdma thing is
> actually very interesting and important...somehow it slipped but I'll take
> a look now.

Still in my queue as well behind Joao's memmap consolidation series,
and a recent copy_mc_to_iter() fix series from Al.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

2021-04-27 Thread John Hubbard

On 4/27/21 12:28 PM, Jason Gunthorpe wrote:

On Thu, Apr 08, 2021 at 11:01:07AM -0600, Logan Gunthorpe wrote:

Hi,

This patchset continues my work to to add P2PDMA support to the common
dma map operations. This allows for creating SGLs that have both P2PDMA
and regular pages which is a necessary step to allowing P2PDMA pages in
userspace.

The earlier RFC[1] generated a lot of great feedback and I heard no show
stopping objections. Thus, I've incorporated all the feedback and have
decided to post this as a proper patch series with hopes of eventually
getting it in mainline.

I'm happy to do a few more passes if anyone has any further feedback
or better ideas.


For the user of the DMA API the idea seems reasonable enough, the next
steps to integrate with pin_user_pages() seem fairly straightfoward
too

Was there no feedback on this at all?



oops, I meant to review this a lot sooner, because this whole p2pdma thing is
actually very interesting and important...somehow it slipped but I'll take
a look now.

thanks,
--
John Hubbard
NVIDIA
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 14/16] nvme-rdma: Ensure dma support when using p2pdma

2021-04-27 Thread Jason Gunthorpe
On Thu, Apr 08, 2021 at 11:01:21AM -0600, Logan Gunthorpe wrote:
> Ensure the dma operations support p2pdma before using the RDMA
> device for P2PDMA. This allows switching the RDMA driver from
> pci_p2pdma_map_sg() to dma_map_sg_p2pdma().
> 
> Signed-off-by: Logan Gunthorpe 
>  drivers/nvme/target/rdma.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 6c1f3ab7649c..3ec7e77e5416 100644
> +++ b/drivers/nvme/target/rdma.c
> @@ -414,7 +414,8 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device 
> *ndev,
>   if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
>   goto out_free_rsp;
>  
> - if (!ib_uses_virt_dma(ndev->device))
> + if (!ib_uses_virt_dma(ndev->device) &&
> + dma_pci_p2pdma_supported(&ndev->device->dev))

ib_uses_virt_dma() should not be called by nvme and this is using the
wrong device pointer to query for DMA related properties.

I suspect this wants a ib_dma_pci_p2p_dma_supported() wrapper like
everything else.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 11/16] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

2021-04-27 Thread Jason Gunthorpe
On Thu, Apr 08, 2021 at 11:01:18AM -0600, Logan Gunthorpe wrote:
> When a PCI P2PDMA page is seen, set the IOVA length of the segment
> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
> apply the appropriate bus address to the segment. The IOVA is not
> created if the scatterlist only consists of P2PDMA pages.

I expect P2P to work with systems that use ATS, so we'd want to see
those systems have the IOMMU programmed with the bus address.

Is it OK like this because the other logic prohibits all PCI cases
that would lean on the IOMMU, like ATS, hairpinning through the root
port, or transiting the root complex?

If yes, the code deserves a big comment explaining this is incomplete,
and I'd want to know we can finish this to include ATS at least based
on this series.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

2021-04-27 Thread Jason Gunthorpe
On Tue, Apr 27, 2021 at 04:33:51PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 08, 2021 at 11:01:16AM -0600, Logan Gunthorpe wrote:
> > Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
> > PCI P2PDMA pages directly without a hack in the callers. This allows
> > for heterogeneous SGLs that contain both P2PDMA and regular pages.
> > 
> > SGL segments that contain PCI bus addresses are marked with
> > sg_mark_pci_p2pdma() and are ignored when unmapped.
> > 
> > Signed-off-by: Logan Gunthorpe 
> >  kernel/dma/direct.c | 25 ++---
> >  1 file changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> > index 002268262c9a..108dfb4ecbd5 100644
> > +++ b/kernel/dma/direct.c
> > @@ -13,6 +13,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "direct.h"
> >  
> >  /*
> > @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct 
> > scatterlist *sgl,
> > struct scatterlist *sg;
> > int i;
> >  
> > -   for_each_sg(sgl, sg, nents, i)
> > +   for_each_sg(sgl, sg, nents, i) {
> > +   if (sg_is_pci_p2pdma(sg)) {
> > +   sg_unmark_pci_p2pdma(sg);
> 
> This doesn't seem nice, the DMA layer should only alter the DMA
> portion of the SG, not the other portions. Is it necessary?

Oh, I got it completely wrong what this is for.

This should be named sg_dma_mark_pci_p2p() and similar for other
functions to make it clear it is part of the DMA side of the SG
interface (eg it is like sg_dma_address, sg_dma_len, etc)

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 09/16] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

2021-04-27 Thread Jason Gunthorpe
On Thu, Apr 08, 2021 at 11:01:16AM -0600, Logan Gunthorpe wrote:
> Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
> PCI P2PDMA pages directly without a hack in the callers. This allows
> for heterogeneous SGLs that contain both P2PDMA and regular pages.
> 
> SGL segments that contain PCI bus addresses are marked with
> sg_mark_pci_p2pdma() and are ignored when unmapped.
> 
> Signed-off-by: Logan Gunthorpe 
>  kernel/dma/direct.c | 25 ++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 002268262c9a..108dfb4ecbd5 100644
> +++ b/kernel/dma/direct.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "direct.h"
>  
>  /*
> @@ -387,19 +388,37 @@ void dma_direct_unmap_sg(struct device *dev, struct 
> scatterlist *sgl,
>   struct scatterlist *sg;
>   int i;
>  
> - for_each_sg(sgl, sg, nents, i)
> + for_each_sg(sgl, sg, nents, i) {
> + if (sg_is_pci_p2pdma(sg)) {
> + sg_unmark_pci_p2pdma(sg);

This doesn't seem nice, the DMA layer should only alter the DMA
portion of the SG, not the other portions. Is it necessary?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

2021-04-27 Thread Jason Gunthorpe
On Thu, Apr 08, 2021 at 11:01:12AM -0600, Logan Gunthorpe wrote:
> +/*
> + * dma_maps_sg_attrs returns 0 on error and > 0 on success.
> + * It should never return a value < 0.
> + */

Also it is weird a function that can't return 0 is returning an int type

> +int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
> + enum dma_data_direction dir, unsigned long attrs)
> +{
> + int ents;
> +
> + ents = __dma_map_sg_attrs(dev, sg, nents, dir, attrs);
>   BUG_ON(ents < 0);

if (WARN_ON(ents < 0))
 return 0;

instead of bug on?

Also, I see only 8 users of this function. How about just fix them all
to support negative returns and use this as the p2p API instead of
adding new API?

Add the opposite logic flag, 'DMA_ATTRS_NO_ERROR' and pass it through
the other api entry callers that can't handle it?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

2021-04-27 Thread Jason Gunthorpe
On Thu, Apr 08, 2021 at 11:01:07AM -0600, Logan Gunthorpe wrote:
> Hi,
> 
> This patchset continues my work to to add P2PDMA support to the common
> dma map operations. This allows for creating SGLs that have both P2PDMA
> and regular pages which is a necessary step to allowing P2PDMA pages in
> userspace.
> 
> The earlier RFC[1] generated a lot of great feedback and I heard no show
> stopping objections. Thus, I've incorporated all the feedback and have
> decided to post this as a proper patch series with hopes of eventually
> getting it in mainline.
>
> I'm happy to do a few more passes if anyone has any further feedback
> or better ideas.

For the user of the DMA API the idea seems reasonable enough, the next
steps to integrate with pin_user_pages() seem fairly straightfoward
too

Was there no feedback on this at all?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

2021-04-27 Thread Jason Gunthorpe
On Thu, Apr 08, 2021 at 11:01:12AM -0600, Logan Gunthorpe wrote:
> dma_map_sg() either returns a positive number indicating the number
> of entries mapped or zero indicating that resources were not available
> to create the mapping. When zero is returned, it is always safe to retry
> the mapping later once resources have been freed.
> 
> Once P2PDMA pages are mixed into the SGL there may be pages that may
> never be successfully mapped with a given device because that device may
> not actually be able to access those pages. Thus, multiple error
> conditions will need to be distinguished to determine weather a retry
> is safe.
> 
> Introduce dma_map_sg_p2pdma[_attrs]() with a different calling
> convention from dma_map_sg(). The function will return a positive
> integer on success or a negative errno on failure.
> 
> ENOMEM will be used to indicate a resource failure and EREMOTEIO to
> indicate that a P2PDMA page is not mappable.
> 
> The __DMA_ATTR_PCI_P2PDMA attribute is introduced to inform the lower
> level implementations that P2PDMA pages are allowed and to warn if a
> caller introduces them into the regular dma_map_sg() interface.

So this new API is all about being able to return an error code
because auditing the old API is basically terrifying?

OK, but why name everything new P2PDMA? It seems nicer to give this
some generic name and have some general program to gradually deprecate
normal non-error-capable dma_map_sg() ?

I think that will raise less questions when subsystem people see the
changes, as I was wondering why RW was being moved to use what looked
like a p2pdma only API.

dma_map_sg_or_err() would have been clearer

The flag is also clearer as to the purpose if it is named
__DMA_ATTR_ERROR_ALLOWED

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 0/5] iommu: Support identity mappings of reserved-memory regions

2021-04-27 Thread Krishna Reddy
> Is it always safe to enable SMMU ASID in a middle of a DMA request made by a
> memory client?

>From MC point of view, It is safe to enable and has been this way from many 
>years in downstream code for display engine.
It doesn't impact the transactions that have already bypassed SMMU before 
enabling SMMU ASID. 
Transactions that are yet to pass SMMU stage would go through SMMU once SMMU 
ASID is enabled and visible.

-KR

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread Jason Gunthorpe
On Tue, Apr 27, 2021 at 02:50:45PM +1000, David Gibson wrote:

> > > I say this because the SPAPR looks quite a lot like PASID when it has
> > > APIs for allocating multiple tables and other things. I would be
> > > interested to hear someone from IBM talk about what it is doing and
> > > how it doesn't fit into today's IOMMU API.
> 
> Hm.  I don't think it's really like PASID.  Just like Type1, the TCE
> backend represents a single DMA address space which all devices in the
> container will see at all times.  The difference is that there can be
> multiple (well, 2) "windows" of valid IOVAs within that address space.
> Each window can have a different TCE (page table) layout.  For kernel
> drivers, a smallish translated window at IOVA 0 is used for 32-bit
> devices, and a large direct mapped (no page table) window is created
> at a high IOVA for better performance with 64-bit DMA capable devices.
>
> With the VFIO backend we create (but don't populate) a similar
> smallish 32-bit window, userspace can create its own secondary window
> if it likes, though obvious for userspace use there will always be a
> page table.  Userspace can choose the total size (but not address),
> page size and to an extent the page table format of the created
> window.  Note that the TCE page table format is *not* the same as the
> POWER CPU core's page table format.  Userspace can also remove the
> default small window and create its own.

So what do you need from the generic API? I'd suggest if userspace
passes in the required IOVA range it would benefit all the IOMMU
drivers to setup properly sized page tables and PPC could use that to
drive a single window. I notice this is all DPDK did to support TCE.

> The second wrinkle is pre-registration.  That lets userspace register
> certain userspace VA ranges (*not* IOVA ranges) as being the only ones
> allowed to be mapped into the IOMMU.  This is a performance
> optimization, because on pre-registration we also pre-account memory
> that will be effectively locked by DMA mappings, rather than doing it
> at DMA map and unmap time.

This feels like nesting IOASIDs to me, much like a vPASID.

The pre-registered VA range would be the root of the tree and the
vIOMMU created ones would be children of the tree. This could allow
the map operations of the child to refer to already prepped physical
memory held in the root IOASID avoiding the GUP/etc cost.

Seems fairly genericish, though I'm not sure about the kvm linkage..

> I like the idea of a common DMA/IOMMU handling system across
> platforms.  However in order to be efficiently usable for POWER it
> will need to include multiple windows, allowing the user to change
> those windows and something like pre-registration to amortize
> accounting costs for heavy vIOMMU load.

I have a feeling /dev/ioasid is going to end up with some HW specific
escape hatch to create some HW specific IOASID types and operate on
them in a HW specific way.

However, what I would like to see is that something simple like DPDK
can have a single implementation - POWER should implement the standard
operations and map them to something that will work for it.

As an ideal, only things like the HW specific qemu vIOMMU driver
should be reaching for all the special stuff.

In this way the kernel IOMMU driver and the qemu user vIOMMU driver
would form something of a classical split user/kernel driver pattern.

Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread Jason Gunthorpe
On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > Starting from a BDF the general pseudo code is:
> >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> >  device_fd = open("/dev/vfio/"+device_name)
> >  ioasidfd = open("/dev/ioasid")
> >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> 
> This line is the problem.
> 
> [Historical aside: Alex's early drafts for the VFIO interface looked
> quite similar to this.  Ben Herrenschmidt and myself persuaded him it
> was a bad idea, and groups were developed instead.  I still think it's
> a bad idea, and not just for POWER]

Spawning the VFIO device FD from the group FD is incredibly gross from
a kernel design perspective. Since that was done the struct
vfio_device missed out on a sysfs presence and doesn't have the
typical 'struct device' member or dedicated char device you'd expect a
FD based subsystem to have.

This basically traded normal usage of the driver core for something
that doesn't serve a technical usage. Given we are now nearly 10 years
later and see that real widely deployed applications are not doing
anything special with the group FD it makes me question the wisdom of
this choice.

> As Alex says, if this line fails because of the group restrictions,
> that's not great because it's not very obvious what's gone wrong.  

Okay, that is fair, but let's solve that problem directly. For
instance netlink has been going in the direction of adding a "extack"
from the kernel which is a descriptive error string. If the failing
ioctl returned the string:

  "cannot join this device to the IOASID because device XXX in the
   same group #10 is in use"

Would you agree it is now obvious what has gone wrong? In fact would
you agree this is a lot better user experience than what applications
do today even though they have the group FD?

> But IMO, the success path on a multi-device group is kind of worse:
> you've now made made a meaningful and visible change to the setup of
> devices which are not mentioned in this line *at all*.  

I don't think spawning a single device_fd from the guoup clearly says
there are repercussions outside that immediate, single, device.

That comes from understanding what the ioctls are doing, and reading
the documentation. The same applies to some non-group FD world.

> Yes, it makes set up more of a pain, but it's necessary complexity to
> actually understand what's going on here.

There is a real technical problem here - the VFIO group is the thing
that spawns the device_fd and that is incompatible with the idea to
centralize the group security logic in drivers/iommu/ and share it
with multiple subsystems.

We also don't have an obvious clean way to incorporate a group FD into
other subsystems (nor would I want to).

One option is VFIO can keep its group FD but nothing else will have
anthing like it. However I don't much like the idea that VFIO will
have a special and unique programming model to do that same things
other subsystem will do. That will make it harder for userspace to
implement.

But again, lets see what the draft ioasid proposal looks like and
maybe someone will see a different solution.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread Jason Gunthorpe
On Tue, Apr 27, 2021 at 03:11:25PM +1000, David Gibson wrote:

> > So your proposal sort of moves the entire container/group/domain 
> > managment into /dev/ioasid and then leaves vfio only provide device
> > specific uAPI. An ioasid represents a page table (address space), thus 
> > is equivalent to the scope of VFIO container.
> 
> Right.  I don't really know how /dev/iosasid is supposed to work, and
> so far I don't see how it conceptually differs from a container.  What
> is it adding?

There are three motivating topics:
 1) /dev/vfio/vfio is only usable by VFIO and we have many interesting
use cases now where we need the same thing usable outside VFIO
 2) /dev/vfio/vfio does not support modern stuff like PASID and
updating to support that is going to be a big change, like adding
multiple IOASIDs so they can be modeled as as a tree inside a
single FD
 3) I understand there is some desire to revise the uAPI here a bit,
ie Alex mentioned the poor mapping performance.

I would say it is not conceptually different from what VFIO calls a
container, it is just a different uAPI with the goal to be cross
subsystem.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: dma-api debugfs directory is not created since debugfs is not initialized

2021-04-27 Thread Corentin Labbe
Le Tue, Apr 27, 2021 at 01:39:16PM +0200, Greg KH a écrit :
> On Tue, Apr 27, 2021 at 01:34:27PM +0200, Corentin Labbe wrote:
> > Hello
> > 
> > I try to debug some DMA problem on next-20210427, and so I have enabled 
> > CONFIG_DMA_API_DEBUG=y.
> > But the dma-api directory does show up in debugfs, but lot of other 
> > directory exists in it.
> 
> Does it show up properly in 5.12?
> 

No (Tested on a qemu x86_64)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: dma-api debugfs directory is not created since debugfs is not initialized

2021-04-27 Thread Greg KH
On Tue, Apr 27, 2021 at 01:32:50PM +0100, Robin Murphy wrote:
> On 2021-04-27 12:39, Greg KH wrote:
> > On Tue, Apr 27, 2021 at 01:34:27PM +0200, Corentin Labbe wrote:
> > > Hello
> > > 
> > > I try to debug some DMA problem on next-20210427, and so I have enabled 
> > > CONFIG_DMA_API_DEBUG=y.
> > > But the dma-api directory does show up in debugfs, but lot of other 
> > > directory exists in it.
> > 
> > Does it show up properly in 5.12?
> > 
> > > After debugging it seems due to commit 56348560d495 ("debugfs: do not 
> > > attempt to create a new file before the filesystem is initalized")
> > > Reverting the commit permit to "dma-api" debugfs to be found. (but seems 
> > > not the right way to fix it).
> > 
> > We have had some odd start-up ordering issues that the above commit has
> > caused to show.  Given that this commit is now in stable kernels, with
> > no report of this issue so far, I'm worried that maybe this is a dma
> > subsystem ordering issue?
> 
> Both debugfs_init() and dma_debug_init() do run at core_initcall level, and
> disassembling the vmlinux from my current working tree does indeed suggest
> that they somehow end up in the wrong relative order:
> 
> [...]
> 80001160d0c8 <__initcall__kmod_debug__325_918_dma_debug_init1>:
> 80001160d0c8:   feb0d528.word   0xfeb0d528
> 
> [...]
> 
> 80001160d108 <__initcall__kmod_debugfs__357_848_debugfs_init1>:
> 80001160d108:   fff4326c.word   0xfff4326c
> [...]
> 
> 
> I always had the impression that initcall ordering tended to work out
> roughly alphabetical, such that entries from fs/* might come before
> kernel/*, but I guess it's at the whims of the linker in the end :/

init call ordering happens from link ordering.

> Perhaps the easiest thing to do is split out dma_debug_fs_init() and run
> that at a later level? We do want the dma-debug infrastructure itself to be
> up as early as possible, but I think the debugfs view of its internals can
> happily wait until closer to the time that there's actually a userspace to
> be able to look at it.

That seems like a better idea here, there's no need for "special
treatment" of debugfs.

thanks,

greg k-h
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: dma-api debugfs directory is not created since debugfs is not initialized

2021-04-27 Thread Robin Murphy

On 2021-04-27 12:39, Greg KH wrote:

On Tue, Apr 27, 2021 at 01:34:27PM +0200, Corentin Labbe wrote:

Hello

I try to debug some DMA problem on next-20210427, and so I have enabled 
CONFIG_DMA_API_DEBUG=y.
But the dma-api directory does show up in debugfs, but lot of other directory 
exists in it.


Does it show up properly in 5.12?


After debugging it seems due to commit 56348560d495 ("debugfs: do not attempt to 
create a new file before the filesystem is initalized")
Reverting the commit permit to "dma-api" debugfs to be found. (but seems not 
the right way to fix it).


We have had some odd start-up ordering issues that the above commit has
caused to show.  Given that this commit is now in stable kernels, with
no report of this issue so far, I'm worried that maybe this is a dma
subsystem ordering issue?


Both debugfs_init() and dma_debug_init() do run at core_initcall level, 
and disassembling the vmlinux from my current working tree does indeed 
suggest that they somehow end up in the wrong relative order:


[...]
80001160d0c8 <__initcall__kmod_debug__325_918_dma_debug_init1>:
80001160d0c8:   feb0d528.word   0xfeb0d528

[...]

80001160d108 <__initcall__kmod_debugfs__357_848_debugfs_init1>:
80001160d108:   fff4326c.word   0xfff4326c
[...]


I always had the impression that initcall ordering tended to work out 
roughly alphabetical, such that entries from fs/* might come before 
kernel/*, but I guess it's at the whims of the linker in the end :/


Perhaps the easiest thing to do is split out dma_debug_fs_init() and run 
that at a later level? We do want the dma-debug infrastructure itself to 
be up as early as possible, but I think the debugfs view of its 
internals can happily wait until closer to the time that there's 
actually a userspace to be able to look at it.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: dma-api debugfs directory is not created since debugfs is not initialized

2021-04-27 Thread Greg KH
On Tue, Apr 27, 2021 at 01:34:27PM +0200, Corentin Labbe wrote:
> Hello
> 
> I try to debug some DMA problem on next-20210427, and so I have enabled 
> CONFIG_DMA_API_DEBUG=y.
> But the dma-api directory does show up in debugfs, but lot of other directory 
> exists in it.

Does it show up properly in 5.12?

> After debugging it seems due to commit 56348560d495 ("debugfs: do not attempt 
> to create a new file before the filesystem is initalized")
> Reverting the commit permit to "dma-api" debugfs to be found. (but seems not 
> the right way to fix it).

We have had some odd start-up ordering issues that the above commit has
caused to show.  Given that this commit is now in stable kernels, with
no report of this issue so far, I'm worried that maybe this is a dma
subsystem ordering issue?

thanks,

greg k-h
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


dma-api debugfs directory is not created since debugfs is not initialized

2021-04-27 Thread Corentin Labbe
Hello

I try to debug some DMA problem on next-20210427, and so I have enabled 
CONFIG_DMA_API_DEBUG=y.
But the dma-api directory does show up in debugfs, but lot of other directory 
exists in it.

After debugging it seems due to commit 56348560d495 ("debugfs: do not attempt 
to create a new file before the filesystem is initalized")
Reverting the commit permit to "dma-api" debugfs to be found. (but seems not 
the right way to fix it).

Regards
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu