Re: [PATCH 0/4] Free cached iovas when rmmod the driver of the last device in group

2021-06-07 Thread Robin Murphy

On 2021-06-07 03:42, chenxiang wrote:

From: Xiang Chen 

When rmmod the driver of the last device in the group, cached iovas are not
used, and it is better to free them to save memories. And also export
function free_rcache_cached_iovas() and iommu_domain_to_iova().


How common is it to use a device a significant amount, then unbind its 
driver and not use it for long enough to care about? Off-hand I can't 
think of a particularly realistic use-case which looks like that - the 
closest I can imagine is unbinding a default kernel driver to replace it 
with VFIO, but I would expect the set of devices intended for assignment 
to be distinct from the set of devices used by the host itself, and thus 
the host driver wouldn't have actually done much to generate a lot of 
DMA mappings in that initial period. Is my expectation there wrong?


If there is such a case, how much memory does this actually save in 
practice? The theoretical worst-case should be roughly 4 * 128 * 6 * 
sizeof(struct iova) bytes per CPU, which is around 192KB assuming 
64-byte kmem cache alignment. However it seems rather unlikely in 
practice to have every possible cache entry of every size used, so if 
saving smaller amounts of memory is a concern wouldn't you also want to 
explicitly drain the flush queues (16KB per CPU) and maybe look at 
trying to reclaim the unused pagetable pages from the domain itself - 
that ~192KB worth of cached IOVAs represents ~32K pages worth of IOVA 
space, which for an implementation like io-pgtable-arm with the 4KB 
granule means ~256KB worth of non-leaf pagetables left behind.


I'm not against the idea of having a mechanism to "compact" an idle DMA 
domain if there are convincing numbers to back it up, but the actual 
implementation would need to be better than this as well - having the 
IOMMU core code poking directly into the internals of the iommu-dma 
abstraction is not the way to go, and exporting anything to modules, 
which the IOMU core is not, smells suspicious.


Robin.


Xiang Chen (4):
   iommu/iova: add a function to free all rcached iovas and export it
   iommu/iova: use function free_rcache_cached_iovas() to free all
 rcached iovas
   dma-iommu: add a interface to get iova_domain from iommu domain
   iommu: free cached iovas when rmmod the driver of the last device in
 the group

  drivers/iommu/dma-iommu.c |  7 +++
  drivers/iommu/iommu.c |  7 +++
  drivers/iommu/iova.c  | 17 -
  include/linux/dma-iommu.h |  6 ++
  include/linux/iova.h  |  5 +
  5 files changed, 37 insertions(+), 5 deletions(-)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/dma: Fix compile warning in 32-bit builds

2021-06-07 Thread Joerg Roedel
From: Joerg Roedel 

Compiling the recent dma-iommu changes under 32-bit x86 triggers this
compile warning:

drivers/iommu/dma-iommu.c:249:5: warning: format ‘%llx’ expects argument of 
type ‘long long unsigned int’, but argument 3 has type ‘phys_addr_t’ {aka 
‘unsigned int’} [-Wformat=]

The reason is that %llx is used to print a variable of type
phys_addr_t. Fix it by using the correct %pa format specifier for
phys_addr_t.

Cc: Srinath Mannam 
Cc: Robin Murphy 
Cc: Oza Pawandeep 
Fixes: aadad097cd46 ("iommu/dma: Reserve IOVA for PCIe inaccessible DMA 
address")
Signed-off-by: Joerg Roedel 
---
 drivers/iommu/dma-iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 95e7349ac3f1..5d96fcc45fec 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -246,8 +246,8 @@ static int iova_reserve_pci_windows(struct pci_dev *dev,
} else if (end < start) {
/* dma_ranges list should be sorted */
dev_err(>dev,
-   "Failed to reserve IOVA [%#010llx-%#010llx]\n",
-   start, end);
+   "Failed to reserve IOVA [%pa-%pa]\n",
+   , );
return -EINVAL;
}
 
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Enrico Weigelt, metux IT consult

On 02.06.21 19:21, Jason Gunthorpe wrote:

Hi,


Not really, once one thing in an applicate uses a large number FDs the
entire application is effected. If any open() can return 'very big
number' then nothing in the process is allowed to ever use select.


isn't that a bug in select() ?

--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Liu, Yi L
> From: Shenming Lu 
> Sent: Friday, June 4, 2021 10:03 AM
> 
> On 2021/6/4 2:19, Jacob Pan wrote:
> > Hi Shenming,
> >
> > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu
> 
> > wrote:
> >
> >> On 2021/6/2 1:33, Jason Gunthorpe wrote:
> >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >>>
>  The drivers register per page table fault handlers to /dev/ioasid which
>  will then register itself to iommu core to listen and route the per-
>  device I/O page faults.
> >>>
> >>> I'm still confused why drivers need fault handlers at all?
> >>
> >> Essentially it is the userspace that needs the fault handlers,
> >> one case is to deliver the faults to the vIOMMU, and another
> >> case is to enable IOPF on the GPA address space for on-demand
> >> paging, it seems that both could be specified in/through the
> >> IOASID_ALLOC ioctl?
> >>
> > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > registered. There wouldn't be any IO page fault without the binding
> anyway.
> 
> Yeah, I also proposed this before, registering the handler in the
> BIND_PGTABLE
> ioctl does make sense for the guest page faults. :-)
> 
> But how about the page faults from the GPA address space (it's page table is
> mapped through the MAP_DMA ioctl)? From your point of view, it seems
> that we should register the handler for the GPA address space in the (first)
> MAP_DMA ioctl.

under new proposal, I think the page fault handler is also registered
per ioasid object. The difference compared with guest page table case
is there is no need to inject the fault to VM.
 
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Alex Williamson
On Fri, 4 Jun 2021 20:01:08 -0300
Jason Gunthorpe  wrote:

> On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> > On Fri, 4 Jun 2021 14:22:07 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:  
> > > > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > > > I don't want a security proof myself; I want to trust VFIO to make 
> > > > > > the right
> > > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > > > 
> > > > > > Given how KVM is just a device driver inside Linux, VMs should be a 
> > > > > > slightly
> > > > > > more roundabout way to do stuff that is accessible to bare metal; 
> > > > > > not a way
> > > > > > to gain extra privilege.
> > > > > 
> > > > > Okay, fine, lets turn the question on its head then.
> > > > > 
> > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > > in a normal process context.
> > > > > 
> > > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > > elevated access to the CPU:
> > > > 
> > > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > > #2+#3 would be worse than what we have today), but IIUC the proposal 
> > > > (was it
> > > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable 
> > > > ioctl,
> > > > which then would be on VFIO and not on KVM.  
> > > 
> > > At the end of the day we need an ioctl with two arguments:
> > >  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> > >  - The KVM FD to control wbinvd support on
> > > 
> > > Philosophically it doesn't matter too much which subsystem that ioctl
> > > lives, but we have these obnoxious cross module dependencies to
> > > consider.. 
> > > 
> > > Framing the question, as you have, to be about the process, I think
> > > explains why KVM doesn't really care what is decided, so long as the
> > > process and the VM have equivalent rights.
> > > 
> > > Alex, how about a more fleshed out suggestion:
> > > 
> > >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > > it communicates its no-snoop configuration:  
> > 
> > Communicates to whom?  
> 
> To the /dev/iommu FD which will have to maintain a list of devices
> attached to it internally.
> 
> > >  - 0 enable, allow WBINVD
> > >  - 1 automatic disable, block WBINVD if the platform
> > >IOMMU can police it (what we do today)
> > >  - 2 force disable, do not allow BINVD ever  
> > 
> > The only thing we know about the device is whether or not Enable
> > No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> > TLPs ("coherent-only") or it might ("assumed non-coherent").
> 
> Here I am outlining the choice an also imagining we might want an
> admin knob to select the three.

You're calling this an admin knob, which to me suggests a global module
option, so are you trying to implement both an administrator and a user
policy?  ie. the user can create scenarios where access to wbinvd might
be justified by hardware/IOMMU configuration, but can be limited by the
admin?

For example I proposed that the ioasidfd would bear the responsibility
of a wbinvd ioctl and therefore validate the user's access to enable
wbinvd emulation w/ KVM, so I'm assuming this module option lives
there.  I essentially described the "enable" behavior in my previous
reply, user has access to wbinvd if owning a non-coherent capable
device managed in a non-coherent IOASID.  Yes, the user IOASID
configuration controls the latter half of this.

What then is "automatic" mode?  The user cannot create a non-coherent
IOASID with a non-coherent device if the IOMMU supports no-snoop
blocking?  Do they get a failure?  Does it get silently promoted to
coherent?

In "disable" mode, I think we're just narrowing the restriction
further, a non-coherent capable device cannot be used except in a
forced coherent IOASID.

> > If we're putting the policy decision in the hands of userspace they
> > should have access to wbinvd if they own a device that is assumed
> > non-coherent AND it's attached to an IOMMU (page table) that is not
> > blocking no-snoop (a "non-coherent IOASID").  
> 
> There are two parts here, like Paolo was leading too. If the process
> has access to WBINVD and then if such an allowed process tells KVM to
> turn on WBINVD in the guest.
> 
> If the process has a device and it has a way to create a non-coherent
> IOASID, then that process has access to WBINVD.
> 
> For security it doesn't matter if the process actually creates the
> non-coherent IOASID or not. An attacker will simply do the steps that
> give access to WBINVD.

Yes, at this point the user has 

Re: [PATCH] iommu/amd: Tidy up DMA ops init

2021-06-07 Thread Joerg Roedel
On Fri, Jun 04, 2021 at 06:35:17PM +0100, Robin Murphy wrote:
> For the sake of justifying this as "fix" rather than "cleanup", you may as
> well use the flush queue commit cited in the patch log - I maintain there's
> nothing technically wrong with that commit itself, but it is the point at
> which the underlying issue becomes worth fixing due to how they interact ;)

Okay :) I added:

Fixes: a250c23f15c2 ("iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE")

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 11:18:33AM +0800, Jason Wang wrote:

> Note that no drivers call these things doesn't meant it was not
> supported by the spec.

Of course it does. If the spec doesn't define exactly when the driver
should call the cache flushes for no-snoop transactions then the
protocol doesn't support no-soop. 

no-snoop is only used in very specific sequences of operations, like
certain GPU usages, because regaining coherence on x86 is incredibly
expensive.

ie I wouldn't ever expect a NIC to use no-snoop because NIC's expect
packets to be processed by the CPU.

"non-coherent DMA" is some general euphemism that evokes images of
embedded platforms that don't have coherent DMA at all and have low
cost ways to regain coherence. This is not at all what we are talking
about here at all.
 
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4] iommu/of: Fix pci_request_acs() before enumerating PCI devices

2021-06-07 Thread Xingang Wang

On 2021/6/5 3:04, Bjorn Helgaas wrote:

[+cc John, who tested 6bf6c24720d3]

On Fri, May 21, 2021 at 03:03:24AM +, Wang Xingang wrote:

From: Xingang Wang 

When booting with devicetree, the pci_request_acs() is called after the
enumeration and initialization of PCI devices, thus the ACS is not
enabled. And ACS should be enabled when IOMMU is detected for the
PCI host bridge, so add check for IOMMU before probe of PCI host and call
pci_request_acs() to make sure ACS will be enabled when enumerating PCI
devices.


I'm happy to apply this, but I'm a little puzzled about 6bf6c24720d3
("iommu/of: Request ACS from the PCI core when configuring IOMMU
linkage").  It was tested and fixed a problem, but I don't understand
how.

6bf6c24720d3 added the call to pci_request_acs() in
of_iommu_configure() so it currently looks like this:

   of_iommu_configure(dev, ...)
   {
 if (dev_is_pci(dev))
   pci_request_acs();

pci_request_acs() sets pci_acs_enable, which tells us to enable ACS
when enumerating PCI devices in the future.  But we only call
pci_request_acs() if we already *have* a PCI device.

So maybe 6bf6c24720d3 fixed a problem for *some* PCI devices, but not
all?  E.g., did we call of_iommu_configure() for one PCI device before
enumerating the rest?


I test the kernel on an arm platform with qemu:

qemu-system-aarch64 \
 -cpu host \
 -kernel arch/arm64/boot/Image \
 -enable-kvm \
 -m 8G \
 -smp 2,sockets=2,cores=1,threads=1 \
 -machine virt,kernel_irqchip=on,gic-version=3,iommu=smmuv3\
 -initrd rootfs.cpio.gz \
 -nographic \
 -append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900 nokaslr" \
 -device pcie-root-port,port=0x1,chassis=1,id=pci.1,addr=0x8 \
 -netdev user,id=hostnet0 \
 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=08:13:3a:5a:22:5b,bus=pci.1,addr=0x0 
\


And find that the of_iommu_configure is called after the enumeration
of the pcie-root-port. And this might only infect the first device, when 
enumerating

the rest devices, the pci_acs_enable has already be enabled.

But to make sure that the pci_acs_enable will always be set before all 
PCI devices,

it would be better to set it in initialization of PCI bridges.

Thanks

Xingang


Fixes: 6bf6c24720d33 ("iommu/of: Request ACS from the PCI core when
configuring IOMMU linkage")
Signed-off-by: Xingang Wang 
---
  drivers/iommu/of_iommu.c | 1 -
  drivers/pci/of.c | 8 +++-
  2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index a9d2df001149..54a14da242cc 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -205,7 +205,6 @@ const struct iommu_ops *of_iommu_configure(struct device 
*dev,
.np = master_np,
};
  
-		pci_request_acs();

err = pci_for_each_dma_alias(to_pci_dev(dev),
 of_pci_iommu_init, );
} else {
diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index da5b414d585a..2313c3f848b0 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -581,9 +581,15 @@ static int pci_parse_request_of_pci_ranges(struct device 
*dev,
  
  int devm_of_pci_bridge_init(struct device *dev, struct pci_host_bridge *bridge)

  {
-   if (!dev->of_node)
+   struct device_node *node = dev->of_node;
+
+   if (!node)
return 0;
  
+	/* Detect IOMMU and make sure ACS will be enabled */

+   if (of_property_read_bool(node, "iommu-map"))
+   pci_request_acs();
+
bridge->swizzle_irq = pci_common_swizzle;
bridge->map_irq = of_irq_parse_and_map_pci;
  
--

2.19.1


.



.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 11/11] HV/Storvsc: Add Isolation VM support for storvsc driver

2021-06-07 Thread Tianyu Lan




On 6/7/2021 2:46 PM, Christoph Hellwig wrote:

On Sun, May 30, 2021 at 11:06:28AM -0400, Tianyu Lan wrote:

+   for (i = 0; i < request->hvpg_count; i++)
+   dma_unmap_page(>device,
+   
request->dma_range[i].dma,
+   
request->dma_range[i].mapping_size,
+   
request->vstor_packet.vm_srb.data_in
+== READ_TYPE ?
+   DMA_FROM_DEVICE : 
DMA_TO_DEVICE);
+   kfree(request->dma_range);


Unreadably long lines.  You probably want to factor the quoted code into
a little readable helper and do the same for the map side.


Good suggestion. Will update.

Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 08/11] swiotlb: Add bounce buffer remap address setting function

2021-06-07 Thread Tianyu Lan



On 6/7/2021 2:43 PM, Christoph Hellwig wrote:

On Sun, May 30, 2021 at 11:06:25AM -0400, Tianyu Lan wrote:

From: Tianyu Lan 

For Hyper-V isolation VM with AMD SEV SNP, the bounce buffer(shared memory)
needs to be accessed via extra address space(e.g address above bit39).
Hyper-V code may remap extra address space outside of swiotlb. swiotlb_
bounce() needs to use remap virtual address to copy data from/to bounce
buffer. Add new interface swiotlb_set_bounce_remap() to do that.


Why can't you use the bus_dma_region ranges to remap to your preferred
address?



Thanks for your suggestion.

These addresses in extra address space works as system memory mirror. 
The shared memory with host in Isolation VM needs to be accessed via 
extra address space which is above shared gpa boundary. During 
initializing swiotlb bounce buffer pool, only address bellow shared gpa 
boundary can be accepted by swiotlb API because it is treated as system 
memory and managed by memory management. This is why Hyper-V swiotlb 
bounce buffer pool needs to be allocated in Hyper-V code and map

associated physical address in extra address space. The patch target is
to add the new interface to set start virtual address of bounce buffer
pool and let swiotlb boucne buffer copy function to use right virtual 
address for extra address space.


bus_dma_region is to translate cpu physical address to dma address.
It can't modify the virtual address of bounce buffer pool and let
swiotlb code to copy data with right address. If some thing missed,
please correct me.

Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4] iommu/of: Fix pci_request_acs() before enumerating PCI devices

2021-06-07 Thread Xingang Wang

On 2021/6/4 23:36, Joerg Roedel wrote:

On Fri, May 21, 2021 at 03:03:24AM +, Wang Xingang wrote:

From: Xingang Wang 

When booting with devicetree, the pci_request_acs() is called after the
enumeration and initialization of PCI devices, thus the ACS is not
enabled. And ACS should be enabled when IOMMU is detected for the
PCI host bridge, so add check for IOMMU before probe of PCI host and call
pci_request_acs() to make sure ACS will be enabled when enumerating PCI
devices.

Fixes: 6bf6c24720d33 ("iommu/of: Request ACS from the PCI core when
configuring IOMMU linkage")
Signed-off-by: Xingang Wang 
---
  drivers/iommu/of_iommu.c | 1 -
  drivers/pci/of.c | 8 +++-
  2 files changed, 7 insertions(+), 2 deletions(-)


Should probably go through the PCI tree, so

Acked-by: Joerg Roedel 

.



Thanks

.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 11:10:53PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Friday, June 4, 2021 8:09 PM
> > 
> > On Fri, Jun 04, 2021 at 06:37:26AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe
> > > > Sent: Thursday, June 3, 2021 9:05 PM
> > > >
> > > > > >
> > > > > > 3) Device accepts any PASIDs from the guest. No
> > > > > >vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > > > 4) Device accepts any PASID from the guest and has an
> > > > > >internal vPASID/pPASID translation (enhanced vfio_pci)
> > > > >
> > > > > what is enhanced vfio_pci? In my writing this is for mdev
> > > > > which doesn't support ENQCMD
> > > >
> > > > This is a vfio_pci that mediates some element of the device interface
> > > > to communicate the vPASID/pPASID table to the device, using Max's
> > > > series for vfio_pci drivers to inject itself into VFIO.
> > > >
> > > > For instance a device might send a message through the PF that the VF
> > > > has a certain vPASID/pPASID translation table. This would be useful
> > > > for devices that cannot use ENQCMD but still want to support migration
> > > > and thus need vPASID.
> > >
> > > I still don't quite get. If it's a PCI device why is PASID translation 
> > > required?
> > > Just delegate the per-RID PASID space to user as type-3 then migrating the
> > > vPASID space is just straightforward.
> > 
> > This is only possible if we get rid of the global pPASID allocation
> > (honestly is my preference as it makes the HW a lot simpler)
> > 
> 
> In this proposal global vs. per-RID allocation is a per-device policy.
> for vfio-pci it can always use per-RID (regardless of whether the
> device is partially mediated or not) and no vPASID/pPASID conversion. 
> Even for mdev if no ENQCMD we can still do per-RID conversion.
> only for mdev which has ENQCMD we need global pPASID allocation.
> 
> I think this is the motivation you explained earlier that it's not good
> to have one global PASID allocator in the kernel. per-RID vs. global
> should be selected per device.

I thought we concluded this wasn't possible because the guest could
choose to bind the same vPASID to a RID and to a ENQCMD device and
then we run into trouble? Are are you saying that a RID device gets a
complete dedicated table and can always have a vPASID == pPASID?

In any event it needs clear explanation in the next RFC

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Alex Williamson
On Mon, 7 Jun 2021 15:18:58 -0300
Jason Gunthorpe  wrote:

> On Mon, Jun 07, 2021 at 09:41:48AM -0600, Alex Williamson wrote:
> > You're calling this an admin knob, which to me suggests a global module
> > option, so are you trying to implement both an administrator and a user
> > policy?  ie. the user can create scenarios where access to wbinvd might
> > be justified by hardware/IOMMU configuration, but can be limited by the
> > admin?  
> 
> Could be a per-device sysfs too. I'm not really sure what is useful
> here.
> 
> > For example I proposed that the ioasidfd would bear the responsibility
> > of a wbinvd ioctl and therefore validate the user's access to enable
> > wbinvd emulation w/ KVM, so I'm assuming this module option lives
> > there.
> 
> Right, this is what I was thinking
> 
> > What then is "automatic" mode?  The user cannot create a non-coherent
> > IOASID with a non-coherent device if the IOMMU supports no-snoop
> > blocking?  Do they get a failure?  Does it get silently promoted to
> > coherent?  
> 
> "automatic" was just a way to keep the API the same as today. Today if
> the IOMMU can block no-snoop then vfio disables wbinvd. To get the
> same level of security automatic mode would detect that vfio would
> have blocked wbinvd because the IOMMU can do it, and then always block
> it.
> 
> It makes sense if there is an admin knob, as the admin could then move
> to an explict enable/disable to get functionality they can't get
> today.
> 
> > In "disable" mode, I think we're just narrowing the restriction
> > further, a non-coherent capable device cannot be used except in a
> > forced coherent IOASID.  
> 
> I wouldn't say "cannot be used" - just you can't get access to
> wbinvd. 
> 
> It is up to qemu if it wants to proceed or not. There is no issue with
> allowing the use of no-snoop and blocking wbinvd, other than some
> drivers may malfunction. If the user is certain they don't have
> malfunctioning drivers then no issue to go ahead.

A driver that knows how to use the device in a coherent way can
certainly proceed, but I suspect that's not something we can ask of
QEMU.  QEMU has no visibility to the in-use driver and sketchy ability
to virtualize the no-snoop enable bit to prevent non-coherent DMA from
the device.  There might be an experimental ("x-" prefixed) QEMU device
option to allow user override, but QEMU should disallow the possibility
of malfunctioning drivers by default.  If we have devices that probe as
supporting no-snoop, but actually can't generate such traffic, we might
need a quirk list somewhere.

> The current vfio arrangement (automatic) maximized compatability. The
> enable/disable options provide for max performance and max security as
> alternative targets.
> 
> > > It is the strenth of Paolo's model that KVM should not be able to do
> > > optionally less, not more than the process itself can do.  
> > 
> > I think my previous reply was working towards those guidelines.  I feel
> > like we're mostly in agreement, but perhaps reading past each other.  
> 
> Yes, I think I said we were agreeing :)
> 
> > Nothing here convinced me against my previous proposal that the
> > ioasidfd bears responsibility for managing access to a wbinvd ioctl,
> > and therefore the equivalent KVM access.  Whether wbinvd is allowed or
> > no-op'd when the use has access to a non-coherent device in a
> > configuration where the IOMMU prevents non-coherent DMA is maybe still
> > a matter of personal preference.  
> 
> I think it makes the software design much simpler if the security
> check is very simple. Possessing a suitable device in an ioasid fd
> container is enough to flip on the feature and we don't need to track
> changes from that point on. We don't need to revoke wbinvd if the
> ioasid fd changes, for instance. Better to keep the kernel very simple
> in this regard.

You're suggesting that a user isn't forced to give up wbinvd emulation
if they lose access to their device?  I suspect that like we do today,
we'll want to re-evaluate the need for wbinvd on most device changes.
I think this is why the kvm-vfio device holds a vfio group reference;
to make sure a given group can't elevate privileges for multiple
processes.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 09:41:48AM -0600, Alex Williamson wrote:
> You're calling this an admin knob, which to me suggests a global module
> option, so are you trying to implement both an administrator and a user
> policy?  ie. the user can create scenarios where access to wbinvd might
> be justified by hardware/IOMMU configuration, but can be limited by the
> admin?

Could be a per-device sysfs too. I'm not really sure what is useful
here.

> For example I proposed that the ioasidfd would bear the responsibility
> of a wbinvd ioctl and therefore validate the user's access to enable
> wbinvd emulation w/ KVM, so I'm assuming this module option lives
> there.  

Right, this is what I was thinking

> What then is "automatic" mode?  The user cannot create a non-coherent
> IOASID with a non-coherent device if the IOMMU supports no-snoop
> blocking?  Do they get a failure?  Does it get silently promoted to
> coherent?

"automatic" was just a way to keep the API the same as today. Today if
the IOMMU can block no-snoop then vfio disables wbinvd. To get the
same level of security automatic mode would detect that vfio would
have blocked wbinvd because the IOMMU can do it, and then always block
it.

It makes sense if there is an admin knob, as the admin could then move
to an explict enable/disable to get functionality they can't get
today.

> In "disable" mode, I think we're just narrowing the restriction
> further, a non-coherent capable device cannot be used except in a
> forced coherent IOASID.

I wouldn't say "cannot be used" - just you can't get access to
wbinvd. 

It is up to qemu if it wants to proceed or not. There is no issue with
allowing the use of no-snoop and blocking wbinvd, other than some
drivers may malfunction. If the user is certain they don't have
malfunctioning drivers then no issue to go ahead.

The current vfio arrangement (automatic) maximized compatability. The
enable/disable options provide for max performance and max security as
alternative targets.

> > It is the strenth of Paolo's model that KVM should not be able to do
> > optionally less, not more than the process itself can do.
> 
> I think my previous reply was working towards those guidelines.  I feel
> like we're mostly in agreement, but perhaps reading past each other.

Yes, I think I said we were agreeing :)

> Nothing here convinced me against my previous proposal that the
> ioasidfd bears responsibility for managing access to a wbinvd ioctl,
> and therefore the equivalent KVM access.  Whether wbinvd is allowed or
> no-op'd when the use has access to a non-coherent device in a
> configuration where the IOMMU prevents non-coherent DMA is maybe still
> a matter of personal preference.

I think it makes the software design much simpler if the security
check is very simple. Possessing a suitable device in an ioasid fd
container is enough to flip on the feature and we don't need to track
changes from that point on. We don't need to revoke wbinvd if the
ioasid fd changes, for instance. Better to keep the kernel very simple
in this regard.

Seems agreeable enough that there is something here to explore in
patches when the time comes

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 03:30:21PM +0200, Enrico Weigelt, metux IT consult 
wrote:
> On 02.06.21 19:21, Jason Gunthorpe wrote:
> 
> Hi,
> 
> > Not really, once one thing in an applicate uses a large number FDs the
> > entire application is effected. If any open() can return 'very big
> > number' then nothing in the process is allowed to ever use select.
> 
> isn't that a bug in select() ?

 it is what it is, select has a fixed size bitmap of FD #s and
a hard upper bound on that size as part of the glibc ABI - can't be
fixed.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 08:51:42AM +0200, Paolo Bonzini wrote:
> On 07/06/21 05:25, Tian, Kevin wrote:
> > Per Intel SDM wbinvd is a privileged instruction. A process on the
> > host has no privilege to execute it.
> 
> (Half of) the point of the kernel is to do privileged tasks on the
> processes' behalf.  There are good reasons why a process that uses VFIO
> (without KVM) could want to use wbinvd, so VFIO lets them do it with a ioctl
> and adequate checks around the operation.

Yes, exactly.

You cannot write a correct VFIO application for hardware that uses the
no-snoop bit without access to wbinvd.

KVM or not does not matter.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:

> > It is up to qemu if it wants to proceed or not. There is no issue with
> > allowing the use of no-snoop and blocking wbinvd, other than some
> > drivers may malfunction. If the user is certain they don't have
> > malfunctioning drivers then no issue to go ahead.
> 
> A driver that knows how to use the device in a coherent way can
> certainly proceed, but I suspect that's not something we can ask of
> QEMU.  QEMU has no visibility to the in-use driver and sketchy ability
> to virtualize the no-snoop enable bit to prevent non-coherent DMA from
> the device.  There might be an experimental ("x-" prefixed) QEMU device
> option to allow user override, but QEMU should disallow the possibility
> of malfunctioning drivers by default.  If we have devices that probe as
> supporting no-snoop, but actually can't generate such traffic, we might
> need a quirk list somewhere.

Compatibility is important, but when I look in the kernel code I see
very few places that call wbinvd(). Basically all DRM for something
relavent to qemu.

That tells me that the vast majority of PCI devices do not generate
no-snoop traffic.

> > I think it makes the software design much simpler if the security
> > check is very simple. Possessing a suitable device in an ioasid fd
> > container is enough to flip on the feature and we don't need to track
> > changes from that point on. We don't need to revoke wbinvd if the
> > ioasid fd changes, for instance. Better to keep the kernel very simple
> > in this regard.
> 
> You're suggesting that a user isn't forced to give up wbinvd emulation
> if they lose access to their device?  

Sure, why do we need to be stricter? It is the same logic I gave
earlier, once an attacker process has access to wbinvd an attacker can
just keep its access indefinitely.

The main use case for revokation assumes that qemu would be
compromised after a device is hot-unplugged and you want to block off
wbinvd. But I have a hard time seeing that as useful enough to justify
all the complicated code to do it...

For KVM qemu can turn on/off on hot plug events as it requires to give
VM security. It doesn't need to rely on the kernel to control this.

But I think it is all fine tuning, the basic idea seems like it could
work, so we are not blocked here on kvm interactions.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Sat, Jun 05, 2021 at 08:22:27AM +0200, Paolo Bonzini wrote:
> On 04/06/21 19:22, Jason Gunthorpe wrote:
> >   4) The KVM interface is the very simple enable/disable WBINVD.
> >  Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> >  to enable WBINVD at KVM.
> 
> The KVM interface is the same kvm-vfio device that exists already.  The
> userspace API does not need to change at all: adding one VFIO file
> descriptor with WBINVD enabled to the kvm-vfio device lets the VM use WBINVD
> functionality (see kvm_vfio_update_coherency).

The problem is we are talking about adding a new /dev/ioasid FD and it
won't fit into the existing KVM VFIO FD interface. There are lots of
options here, one is to add new ioctls that specifically use the new
FD, the other is to somehow use VFIO as a proxy to carry things to the
/dev/ioasid FD code.

> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
> seems useless complication compared to just using what we have now, at least
> while VMs only use IOASIDs via VFIO.

The simplest is KVM_ENABLE_WBINVD() and be done
with it.

I don't need to keep track things in KVM, just flip one flag on/off
under user control.

> Either way, there should be no policy attached to the add/delete operations.
> KVM users want to add the VFIO (or IOASID) file descriptors to the device
> independent of WBINVD.  If userspace wants/needs to apply its own policy on
> whether to enable WBINVD or not, they can do it on the VFIO/IOASID side:

Why does KVM need to know abut IOASID's? I don't think it can do
anything with this general information.

> >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > it communicates its no-snoop configuration:
> >  - 0 enable, allow WBINVD
> >  - 1 automatic disable, block WBINVD if the platform
> >IOMMU can police it (what we do today)
> >  - 2 force disable, do not allow BINVD ever
> 
> Though, like Alex, it's also not clear to me whether force-disable is
> useful.  Instead userspace can query the IOMMU or the device to ensure it's
> not enabled.

"force disable" would be a way for the device to signal to whatever
query you imagine that it is not enabled. Maybe I should have called
it "no-snoop is never used"

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Alex Williamson
On Mon, 7 Jun 2021 16:08:02 -0300
Jason Gunthorpe  wrote:

> On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:
> 
> > > It is up to qemu if it wants to proceed or not. There is no issue with
> > > allowing the use of no-snoop and blocking wbinvd, other than some
> > > drivers may malfunction. If the user is certain they don't have
> > > malfunctioning drivers then no issue to go ahead.  
> > 
> > A driver that knows how to use the device in a coherent way can
> > certainly proceed, but I suspect that's not something we can ask of
> > QEMU.  QEMU has no visibility to the in-use driver and sketchy ability
> > to virtualize the no-snoop enable bit to prevent non-coherent DMA from
> > the device.  There might be an experimental ("x-" prefixed) QEMU device
> > option to allow user override, but QEMU should disallow the possibility
> > of malfunctioning drivers by default.  If we have devices that probe as
> > supporting no-snoop, but actually can't generate such traffic, we might
> > need a quirk list somewhere.  
> 
> Compatibility is important, but when I look in the kernel code I see
> very few places that call wbinvd(). Basically all DRM for something
> relavent to qemu.
> 
> That tells me that the vast majority of PCI devices do not generate
> no-snoop traffic.

Unfortunately, even just looking at devices across a couple laptops
most devices do support and have NoSnoop+ set by default.  I don't
notice anything in the kernel that actually tries to set this enable (a
handful that actively disable), so I assume it's done by the firmware.
It's not safe for QEMU to make an assumption that only GPUs will
actually make use of it.

> > > I think it makes the software design much simpler if the security
> > > check is very simple. Possessing a suitable device in an ioasid fd
> > > container is enough to flip on the feature and we don't need to track
> > > changes from that point on. We don't need to revoke wbinvd if the
> > > ioasid fd changes, for instance. Better to keep the kernel very simple
> > > in this regard.  
> > 
> > You're suggesting that a user isn't forced to give up wbinvd emulation
> > if they lose access to their device?
> 
> Sure, why do we need to be stricter? It is the same logic I gave
> earlier, once an attacker process has access to wbinvd an attacker can
> just keep its access indefinitely.
> 
> The main use case for revokation assumes that qemu would be
> compromised after a device is hot-unplugged and you want to block off
> wbinvd. But I have a hard time seeing that as useful enough to justify
> all the complicated code to do it...

It's currently just a matter of the kvm-vfio device holding a reference
to the group so that it cannot be used elsewhere so long as it's being
used to elevate privileges on a given KVM instance.  If we conclude that
access to a device with the right capability is required to gain a
privilege, I don't really see how we can wave aside that the privilege
isn't lost with the device.

> For KVM qemu can turn on/off on hot plug events as it requires to give
> VM security. It doesn't need to rely on the kernel to control this.

Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
privilege that the device grants it.  Releasing the device and
retaining the privileged gained by it seems wrong.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/dma: Fix compile warning in 32-bit builds

2021-06-07 Thread Joerg Roedel
On Mon, Jun 07, 2021 at 02:49:05PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> Compiling the recent dma-iommu changes under 32-bit x86 triggers this
> compile warning:
> 
> drivers/iommu/dma-iommu.c:249:5: warning: format ‘%llx’ expects argument of 
> type ‘long long unsigned int’, but argument 3 has type ‘phys_addr_t’ {aka 
> ‘unsigned int’} [-Wformat=]
> 
> The reason is that %llx is used to print a variable of type
> phys_addr_t. Fix it by using the correct %pa format specifier for
> phys_addr_t.
> 
> Cc: Srinath Mannam 
> Cc: Robin Murphy 
> Cc: Oza Pawandeep 
> Fixes: aadad097cd46 ("iommu/dma: Reserve IOVA for PCIe inaccessible DMA 
> address")

This tag is wrong, it should be:

Fixes: 9561dd434860 ("iommu/dma: Fix IOVA reserve dma ranges")

> Signed-off-by: Joerg Roedel 
> ---
>  drivers/iommu/dma-iommu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 95e7349ac3f1..5d96fcc45fec 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -246,8 +246,8 @@ static int iova_reserve_pci_windows(struct pci_dev *dev,
>   } else if (end < start) {
>   /* dma_ranges list should be sorted */
>   dev_err(>dev,
> - "Failed to reserve IOVA [%#010llx-%#010llx]\n",
> - start, end);
> + "Failed to reserve IOVA [%pa-%pa]\n",
> + , );
>   return -EINVAL;
>   }
>  
> -- 
> 2.31.1
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH V3 10/11] HV/Netvsc: Add Isolation VM support for netvsc driver

2021-06-07 Thread Tianyu Lan




On 6/7/2021 2:50 PM, Christoph Hellwig wrote:

On Sun, May 30, 2021 at 11:06:27AM -0400, Tianyu Lan wrote:

+   if (hv_isolation_type_snp()) {
+   pfns = kcalloc(buf_size / HV_HYP_PAGE_SIZE, sizeof(unsigned 
long),
+  GFP_KERNEL);
+   for (i = 0; i < buf_size / HV_HYP_PAGE_SIZE; i++)
+   pfns[i] = virt_to_hvpfn(net_device->recv_buf + i * 
HV_HYP_PAGE_SIZE) +
+   (ms_hyperv.shared_gpa_boundary >> 
HV_HYP_PAGE_SHIFT);
+
+   vaddr = vmap_pfn(pfns, buf_size / HV_HYP_PAGE_SIZE, 
PAGE_KERNEL_IO);
+   kfree(pfns);
+   if (!vaddr)
+   goto cleanup;
+   net_device->recv_original_buf = net_device->recv_buf;
+   net_device->recv_buf = vaddr;
+   }


This probably wnats a helper to make the thing more readable.  But who
came up with this fucked up communication protocol where the host needs
to map random pfns into a contigous range?  Sometime I really have to
wonder what crack the hyper-v people take when comparing this to the
relatively sane approach others take.


Agree. Will add a helper function.



+   for (i = 0; i < page_count; i++)
+   dma_unmap_single(_dev->device, packet->dma_range[i].dma,
+packet->dma_range[i].mapping_size,
+DMA_TO_DEVICE);
+
+   kfree(packet->dma_range);


Any reason this isn't simply using a struct scatterlist?


I will have a look. Thanks to reminder scatterlist.




+   for (i = 0; i < page_count; i++) {
+   char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
++ pb[i].offset);
+   u32 len = pb[i].len;
+
+   dma = dma_map_single(_dev->device, src, len,
+DMA_TO_DEVICE);


dma_map_single can only be used on page baked memory, and if this is
using page backed memory you wouldn't need to do thee phys_to_virt
tricks.  Can someone explain the mess here in more detail?


Sorry. Could you elaborate the issue? These pages in the pb array are 
not allocated by DMA API and using dma_map_single() here is to map these 
pages' address to bounce buffer physical address.





struct rndis_device *dev = nvdev->extension;
struct rndis_request *request = NULL;
+   struct hv_device *hv_dev = ((struct net_device_context *)
+   netdev_priv(ndev))->device_ctx;


Why not use a net_device_context local variable instead of this cast
galore?



OK. I will update.


Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-07 Thread David Gibson
On Tue, Jun 01, 2021 at 09:57:12AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 02:03:33PM +1000, David Gibson wrote:
> > On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> > > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > > > 
> > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created 
> > > > > > per
> > > > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > > > protection scope as VF associated to it. 
> > > > > 
> > > > > This doesn't require, and certainly shouldn't create, a fake group.
> > > > 
> > > > It's only fake if you start with a narrow view of what a group is. 
> > > 
> > > A group is connected to drivers/iommu. A group object without *any*
> > > relation to drivers/iommu is just a complete fiction, IMHO.
> > 
> > That might be where we differ.  As I've said, my group I'm primarily
> > meaning the fundamental hardware unit of isolation.  *Usually* that's
> > determined by the capabilities of an IOMMU, but in some cases it might
> > not be.  In either case, the boundaries still matter.
> 
> As in my other email we absolutely need a group concept, it is just a
> question of how the user API is designed around it.
> 
> > > The group mdev implicitly creates is just a fake proxy that comes
> > > along with mdev API. It doesn't do anything and it doesn't mean
> > > anything.
> > 
> > But.. the case of multiple mdevs managed by a single PCI device with
> > an internal IOMMU also exists, and then the mdev groups are *not*
> > proxies but true groups independent of the parent device.  Which means
> > that the group structure of mdevs can vary, which is an argument *for*
> > keeping it, not against.
> 
> If VFIO becomes more "vfio_device" centric then the vfio_device itself
> has some properties. One of those can be "is it inside a drivers/iommu
> group, or not?".
> 
> If the vfio_device is not using a drivers/iommu IOMMU interface then
> it can just have no group at all - no reason to lie. This would mean
> that the device has perfect isolation.

When you say "not using a drivers/iommu IOMMU interface" do you
basically mean the device doesn't do DMA?  I can see some benefit to
that, but some drawbacks too.  The *main* form of isolation (or lack
thereof) that groups is about the IOMMU, but groups can also represent
other forms of isolation failure: e.g. a multifunction device, where
function 0 has some debug registers which affect other functions.
That's relevant whether or not any of those functions use DMA.

Now, we could represent those different sorts of isolation separately,
but at the time our thinking was that we should group together devices
that can't be safely isolated for *any* reason, since the practical
upshot is the same: you can't safely split those devices between
different owners.

> What I don't like is forcing certain things depending on how the
> vfio_device was created - for instance forcing a IOMMU group as part
> and forcing an ugly "SW IOMMU" mode in the container only as part of
> mdev_device.

I don't really see how this is depending on how the device is created.
The current VFIO model is that every device always belongs to a group
- but that group might be a singleton.  That seems less complicated to
me that some devices do and some don't have a group.

> These should all be properties of the vfio_device itself.
> 
> Again this is all about the group fd - and how to fit in with the
> /dev/ioasid proposal from Kevin:
> 
> https://lore.kernel.org/kvm/mwhpr11mb1886422d4839b372c6ab245f8c...@mwhpr11mb1886.namprd11.prod.outlook.com/
> 
> Focusing on vfio_device and skipping the group fd smooths out some
> rough edges.
> 
> Code wise we are not quite there, but I have mapped out eliminating
> the group from the vfio_device centric API and a few other places it
> has crept in.
> 
> The group can exist in the background to enforce security without
> being a cornerstone of the API design.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Thu, Jun 03, 2021 at 06:49:20AM +, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Thursday, June 3, 2021 1:09 PM
> [...]
> > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > cache.
> > > >
> > > > The collaposed shadow page table is really just a cache.
> > > >
> > >
> > > OK. One additional thing is that we may need a 'caching_mode"
> > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > required when changing non-present to present. For hardware
> > > nesting it's not reported as the hardware IOMMU will walk the
> > > guest page table in cases of iotlb miss. For software nesting
> > > caching_mode is reported so the user must issue invalidation
> > > upon any change in guest page table so the kernel can update
> > > the shadow page table timely.
> > 
> > For the fist cut, I'd have the API assume that invalidates are
> > *always* required.  Some bypass to avoid them in cases where they're
> > not needed can be an additional extension.
> > 
> 
> Isn't a typical TLB semantics is that non-present entries are not
> cached thus invalidation is not required when making non-present
> to present?

Usually, but not necessarily.

> It's true to both CPU TLB and IOMMU TLB.

I don't think it's entirely true of the CPU TLB on all ppc MMU models
(of which there are far too many).

> In reality
> I feel there are more usages built on hardware nesting than software
> nesting thus making default following hardware TLB behavior makes
> more sense...

I'm arguing for always-require-invalidate because it's strictly more
general.  Requiring the invalidate will support models that don't
require it in all cases; we just make the invalidate a no-op.  The
reverse is not true, so we should tackle the general case first, then
optimize.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Tue, Jun 01, 2021 at 04:22:25PM -0600, Alex Williamson wrote:
> On Tue, 1 Jun 2021 07:01:57 +
> "Tian, Kevin"  wrote:
> > 
> > I summarized five opens here, about:
> > 
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> > 
> ...
> > 
> > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > original purpose of this protocol is not about I/O address space. It's
> > for KVM to know whether any device is assigned to this VM and then
> > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> 
> Right, the original use case was for KVM to determine whether it needs
> to emulate invlpg, so it needs to be aware when an assigned device is
> present and be able to test if DMA for that device is cache coherent.
> The user, QEMU, creates a KVM "pseudo" device representing the vfio
> group, providing the file descriptor of that group to show ownership.
> The ugly symbol_get code is to avoid hard module dependencies, ie. the
> kvm module should not pull in or require the vfio module, but vfio will
> be present if attempting to register this device.
> 
> With kvmgt, the interface also became a way to register the kvm pointer
> with vfio for the translation mentioned elsewhere in this thread.
> 
> The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> page table so that it can handle iotlb programming from pre-registered
> memory without trapping out to userspace.

To clarify that's a guest side logical vIOMMU page table which is
partially managed by KVM.  This is an optimization - things can work
without it, but it means guest iomap/unmap becomes a hot path because
each map/unmap hypercall has to go
guest -> KVM -> qemu -> VFIO

So there are multiple context transitions.

> > Because KVM deduces some policy based on the fact of assigned device, 
> > it needs to hold a reference to related vfio group. this part is irrelevant
> > to this RFC. 
> 
> All of these use cases are related to the IOMMU, whether DMA is
> coherent, translating device IOVA to GPA, and an acceleration path to
> emulate IOMMU programming in kernel... they seem pretty relevant.
> 
> > But ARM's VMID usage is related to I/O address space thus needs some
> > consideration. Another strange thing is about PPC. Looks it also leverages
> > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> > group. I don't know why it's done through KVM instead of VFIO uAPI in
> > the first place.
> 
> AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs
> to know how to handle those for in-kernel acceleration.  Thanks,

For PAPR guests, which is the common case, yes.  Bare metal POWER
hosts have their own page table format.  And probably some of the
newer embedded ppc models have some different IOMMU model entirely,
but I'm not familiar with it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 1/6] iommu/amd: Selective flush on unmap

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

Recent patch attempted to enable selective page flushes on AMD IOMMU but
neglected to adapt amd_iommu_iotlb_sync() to use the selective flushes.

Adapt amd_iommu_iotlb_sync() to use selective flushes and change
amd_iommu_unmap() to collect the flushes. As a defensive measure, to
avoid potential issues as those that the Intel IOMMU driver encountered
recently, flush the page-walk caches by always setting the "pde"
parameter. This can be removed later.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3ac42bbdefc6..3e40f6610b6a 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2059,12 +2059,17 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 {
struct protection_domain *domain = to_pdomain(dom);
struct io_pgtable_ops *ops = >iop.iop.ops;
+   size_t r;
 
if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
(domain->iop.mode == PAGE_MODE_NONE))
return 0;
 
-   return (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+   r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+
+   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+   return r;
 }
 
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
@@ -2167,7 +2172,13 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
 static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
 {
-   amd_iommu_flush_iotlb_all(domain);
+   struct protection_domain *dom = to_pdomain(domain);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   amd_iommu_domain_flush_complete(dom);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static int amd_iommu_def_domain_type(struct device *dev)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 2/6] iommu/amd: Do not use flush-queue when NpCache is on

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

Do not use flush-queue on virtualized environments, where the NpCache
capability of the IOMMU is set. This is required to reduce
virtualization overheads.

This change follows a similar change to Intel's VT-d and a detailed
explanation as for the rationale is described in commit 29b32839725f
("iommu/vt-d: Do not use flush-queue when caching-mode is on").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/init.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index d006724f4dc2..ba3b76ed776d 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1850,8 +1850,13 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
if (ret)
return ret;
 
-   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE))
+   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE)) {
+   if (!amd_iommu_unmap_flush)
+   pr_warn_once("IOMMU batching is disabled due to 
virtualization");
+
amd_iommu_np_cache = true;
+   amd_iommu_unmap_flush = true;
+   }
 
init_iommu_perf_ctr(iommu);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 4/6] iommu: Factor iommu_iotlb_gather_is_disjoint() out

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

Refactor iommu_iotlb_gather_add_page() and factor out the logic that
detects whether IOTLB gather range and a new range are disjoint. To be
used by the next patch that implements different gathering logic for
AMD.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Signed-off-by: Nadav Amit 
---
 include/linux/iommu.h | 41 +
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f254c62f3720..b5a2bfc68fb0 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,28 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_is_disjoint - Checks whether a new range is disjoint
+ *
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to check whether a new range is and the gathered
+ * range are disjoint. For many IOMMUs, flushing the IOMMU in this case is
+ * better than merging the two, which might lead to unnecessary invalidations.
+ */
+static inline
+bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long start = iova, end = start + size - 1;
+
+   return gather->end != 0 &&
+   (end + 1 < gather->start || start > gather->end + 1);
+}
+
+
 /**
  * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
  * @gather: TLB gather data
@@ -533,20 +555,16 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
 {
-   unsigned long start = iova, end = start + size - 1;
-
/*
 * If the new page is disjoint from the current range or is mapped at
 * a different granularity, then sync the TLB so that the gather
 * structure can be rewritten.
 */
-   if (gather->pgsize != size ||
-   end + 1 < gather->start || start > gather->end + 1) {
-   if (gather->pgsize)
-   iommu_iotlb_sync(domain, gather);
-   gather->pgsize = size;
-   }
+   if ((gather->pgsize && gather->pgsize != size) ||
+   iommu_iotlb_gather_is_disjoint(gather, iova, size))
+   iommu_iotlb_sync(domain, gather);
 
+   gather->pgsize = size;
iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
@@ -730,6 +748,13 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
 {
 }
 
+static inline
+bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   return false;
+}
+
 static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
unsigned long iova, size_t size)
 {
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 3/6] iommu: Improve iommu_iotlb_gather helpers

2021-06-07 Thread Nadav Amit
From: Robin Murphy 

The Mediatek driver is not the only one which might want a basic
address-based gathering behaviour, so although it's arguably simple
enough to open-code, let's factor it out for the sake of cleanliness.
Let's also take this opportunity to document the intent of these
helpers for clarity.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Robin Murphy 

---

Changes from Robin's version:
* Added iommu_iotlb_gather_add_range() stub !CONFIG_IOMMU_API
* Use iommu_iotlb_gather_add_range() in iommu_iotlb_gather_add_page()
---
 drivers/iommu/mtk_iommu.c |  5 +
 include/linux/iommu.h | 43 ++-
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e06b8a0e2b56..0af4a91ac7da 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -523,10 +523,7 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
unsigned long end = iova + size - 1;
 
-   if (gather->start > iova)
-   gather->start = iova;
-   if (gather->end < end)
-   gather->end = end;
+   iommu_iotlb_gather_add_range(gather, iova, size);
return dom->iop->unmap(dom->iop, iova, size, gather);
 }
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..f254c62f3720 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,38 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build arbitrarily-sized invalidation commands
+ * where only the address range matters, and simply minimising intermediate
+ * syncs is preferred.
+ */
+static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long end = iova + size - 1;
+
+   if (gather->start > iova)
+   gather->start = iova;
+   if (gather->end < end)
+   gather->end = end;
+}
+
+/**
+ * iommu_iotlb_gather_add_page - Gather for page-based TLB invalidation
+ * @domain: IOMMU domain to be invalidated
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build invalidation commands based on individual
+ * pages, or with page size/table level hints which cannot be gathered if they
+ * differ.
+ */
 static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
@@ -515,11 +547,7 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
gather->pgsize = size;
}
 
-   if (gather->end < end)
-   gather->end = end;
-
-   if (gather->start > start)
-   gather->start = start;
+   iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
 /* PCI device grouping function */
@@ -702,6 +730,11 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
 {
 }
 
+static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
+   unsigned long iova, size_t size)
+{
+}
+
 static inline phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, 
dma_addr_t iova)
 {
return 0;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 6/6] iommu/amd: Sync once for scatter-gather operations

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

On virtual machines, software must flush the IOTLB after each page table
entry update.

The iommu_map_sg() code iterates through the given scatter-gather list
and invokes iommu_map() for each element in the scatter-gather list,
which calls into the vendor IOMMU driver through iommu_ops callback. As
the result, a single sg mapping may lead to multiple IOTLB flushes.

Fix this by adding amd_iotlb_sync_map() callback and flushing at this
point after all sg mappings we set.

This commit is followed and inspired by commit 933fcd01e97e2
("iommu/vt-d: Add iotlb_sync_map callback").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 128f2e889ced..dd23566f1db8 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2027,6 +2027,16 @@ static int amd_iommu_attach_device(struct iommu_domain 
*dom,
return ret;
 }
 
+static void amd_iommu_iotlb_sync_map(struct iommu_domain *dom,
+unsigned long iova, size_t size)
+{
+   struct protection_domain *domain = to_pdomain(dom);
+   struct io_pgtable_ops *ops = >iop.iop.ops;
+
+   if (ops->map)
+   domain_flush_np_cache(domain, iova, size);
+}
+
 static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
 phys_addr_t paddr, size_t page_size, int iommu_prot,
 gfp_t gfp)
@@ -2045,10 +2055,8 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
if (iommu_prot & IOMMU_WRITE)
prot |= IOMMU_PROT_IW;
 
-   if (ops->map) {
+   if (ops->map)
ret = ops->map(ops, iova, paddr, page_size, prot, gfp);
-   domain_flush_np_cache(domain, iova, page_size);
-   }
 
return ret;
 }
@@ -2249,6 +2257,7 @@ const struct iommu_ops amd_iommu_ops = {
.attach_dev = amd_iommu_attach_device,
.detach_dev = amd_iommu_detach_device,
.map = amd_iommu_map,
+   .iotlb_sync_map = amd_iommu_iotlb_sync_map,
.unmap = amd_iommu_unmap,
.iova_to_phys = amd_iommu_iova_to_phys,
.probe_device = amd_iommu_probe_device,
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
This is in contrast, for instnace, to Intel IOMMUs that have a limit on
the number of pages that can be flushed in a single flush.  In addition,
AMD's IOMMU do not care about the page-size, so changes of the page size
do not need to trigger a TLB flush.

So in most cases, a TLB flush due to disjoint range or page-size changes
are not needed for AMD. Yet, vIOMMUs require the hypervisor to
synchronize the virtualized IOMMU's PTEs with the physical ones. This
process induce overheads, so it is better not to cause unnecessary
flushes, i.e., flushes of PTEs that were not modified.

Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
of the generic iommu_iotlb_gather_add_page(). Ignore page-size changes
and disjoint regions unless "non-present cache" feature is reported by
the IOMMU capabilities, as this is an indication we are running on a
physical IOMMU. A similar indication is used by VT-d (see "caching
mode"). The new logic retains the same flushing behavior that we had
before the introduction of page-selective IOTLB flushes for AMD.

On virtualized environments, check if the newly flushed region and the
gathered one are disjoint and flush if it is. Also check whether the new
region would cause IOTLB invalidation of large region that would include
unmodified PTE. The latter check is done according to the "order" of the
IOTLB flush.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 44 ++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3e40f6610b6a..128f2e889ced 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2053,6 +2053,48 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
return ret;
 }
 
+static void amd_iommu_iotlb_gather_add_page(struct iommu_domain *domain,
+   struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   /*
+* AMD's IOMMU can flush as many pages as necessary in a single flush.
+* Unless we run in a virtual machine, which can be inferred according
+* to whether "non-present cache" is on, it is probably best to prefer
+* (potentially) too extensive TLB flushing (i.e., more misses) over
+* mutliple TLB flushes (i.e., more flushes). For virtual machines the
+* hypervisor needs to synchronize the host IOMMU PTEs with those of
+* the guest, and the trade-off is different: unnecessary TLB flushes
+* should be avoided.
+*/
+   if (amd_iommu_np_cache && gather->end != 0) {
+   unsigned long start = iova, end = start + size - 1;
+
+   if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) {
+   /*
+* If the new page is disjoint from the current range,
+* flush.
+*/
+   iommu_iotlb_sync(domain, gather);
+   } else {
+   /*
+* If the order of TLB flushes increases by more than
+* 1, it means that we would have to flush PTEs that
+* were not modified. In this case, flush.
+*/
+   unsigned long new_start = min(gather->start, start);
+   unsigned long new_end = min(gather->end, end);
+   int msb_diff = fls64(gather->end ^ gather->start);
+   int new_msb_diff = fls64(new_end ^ new_start);
+
+   if (new_msb_diff > msb_diff + 1)
+   iommu_iotlb_sync(domain, gather);
+   }
+   }
+
+   iommu_iotlb_gather_add_range(gather, iova, size);
+}
+
 static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
  size_t page_size,
  struct iommu_iotlb_gather *gather)
@@ -2067,7 +2109,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 
r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
 
-   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+   amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
 
return r;
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 0/6] iommu/amd: Enable page-selective flushes

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

The previous patch, commit 268aa4548277 ("iommu/amd: Page-specific
invalidations for more than one page") was supposed to enable
page-selective IOTLB flushes on AMD.

Besides the bug that was already fixed by commit a017c567915f
("iommu/amd: Fix wrong parentheses on page-specific invalidations")
there are several remaining matters to enable and benefit from
page-selective IOTLB flushes on AMD:

1. Enable selective flushes on unmap (patch 1)
2. Avoid using flush-queue on vIOMMUs (patch 2)
3. Relaxed flushes when gathering, excluding vIOMMUs (patches 3-5)
4. Syncing once on scatter-gather map operations (patch 6)

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org

--- 

v2->v3:
* Rebase on v5.13-rc5
* Refactoring (patches 4-5) [Robin]
* Rework flush logic (patch 5): more relaxed on native
* Syncing once on scatter-gather operations (patch 6)

v1->v2:
* Rebase on v5.13-rc3

Nadav Amit (5):
  iommu/amd: Selective flush on unmap
  iommu/amd: Do not use flush-queue when NpCache is on
  iommu: Factor iommu_iotlb_gather_is_disjoint() out
  iommu/amd: Tailored gather logic for AMD
  iommu/amd: Sync once for scatter-gather operations

Robin Murphy (1):
  iommu: Improve iommu_iotlb_gather helpers

 drivers/iommu/amd/init.c  |  7 +++-
 drivers/iommu/amd/iommu.c | 72 ++---
 drivers/iommu/mtk_iommu.c |  5 +--
 include/linux/iommu.h | 84 +--
 4 files changed, 145 insertions(+), 23 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Thu, Jun 03, 2021 at 08:52:24AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:
> 
> > > We can still consider it a single "address space" from the IOMMU
> > > perspective. What has happened is that the address table is not just a
> > > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
> > 
> > True.  This does complexify how we represent what IOVA ranges are
> > valid, though.  I'll bet you most implementations don't actually
> > implement a full 64-bit IOVA, which means we effectively have a large
> > number of windows from (0..max IOVA) for each valid pasid.  This adds
> > another reason I don't think my concept of IOVA windows is just a
> > power specific thing.
> 
> Yes
> 
> Things rapidly get into weird hardware specific stuff though, the
> request will be for things like:
>   "ARM PASID page table format from SMMU IP block vXX"

So, I'm happy enough for picking a user-managed pagetable format to
imply the set of valid IOVA ranges (though a query might be nice).

I'm mostly thinking of representing (and/or choosing) valid IOVA
ranges as something for the kernel-managed pagetable style
(MAP/UNMAP).

> Which may have a bunch of (possibly very weird!) format specific data
> to describe and/or configure it.
> 
> The uAPI needs to be suitably general here. :(
> 
> > > If we are already going in the direction of having the IOASID specify
> > > the page table format and other details, specifying that the page
> > > tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> > > step.
> > 
> > Well, rather I think userspace needs to request what page table format
> > it wants and the kernel tells it whether it can oblige or not.
> 
> Yes, this is what I ment.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Fri, Jun 04, 2021 at 09:30:54AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 04, 2021 at 12:44:28PM +0200, Enrico Weigelt, metux IT consult 
> wrote:
> > On 02.06.21 19:24, Jason Gunthorpe wrote:
> > 
> > Hi,
> > 
> > >> If I understand this correctly, /dev/ioasid is a kind of "common
> > supplier"
> > >> to other APIs / devices. Why can't the fd be acquired by the
> > >> consumer APIs (eg. kvm, vfio, etc) ?
> > >
> > > /dev/ioasid would be similar to /dev/vfio, and everything already
> > > deals with exposing /dev/vfio and /dev/vfio/N together
> > >
> > > I don't see it as a problem, just more work.
> > 
> > One of the problems I'm seeing is in container environments: when
> > passing in an vfio device, we now also need to pass in /dev/ioasid,
> > thus increasing the complexity in container setup (or orchestration).
> 
> Containers already needed to do this today. Container orchestration is
> hard.

Right to use VFIO a container already needs both /dev/vfio and one or
more /dev/vfio/NNN group devices.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Wang


在 2021/6/3 上午1:21, Jason Gunthorpe 写道:

On Wed, Jun 02, 2021 at 04:54:26PM +0800, Jason Wang wrote:

在 2021/6/2 上午1:31, Jason Gunthorpe 写道:

On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:

We can open up to ~0U file descriptors, I don't see why we need to restrict
it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.


I don't see how much difference for IOASID and other type of fds. People can
choose to use poll or epoll.

Not really, once one thing in an applicate uses a large number FDs the
entire application is effected. If any open() can return 'very big
number' then nothing in the process is allowed to ever use select.

It is not a trivial thing to ask for


And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I
wonder how select can work for the specific ioasid.

pagefault events are one thing that comes to mind. Bundling them all
together into a single ring buffer is going to be necessary. Multifds
just complicate this too

Jason



Well, this sounds like a re-invention of io_uring which has already 
worked for multifds.


Thanks


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Alex Williamson
On Mon, 7 Jun 2021 20:03:53 -0300
Jason Gunthorpe  wrote:

> On Mon, Jun 07, 2021 at 01:41:28PM -0600, Alex Williamson wrote:
> 
> > > Compatibility is important, but when I look in the kernel code I see
> > > very few places that call wbinvd(). Basically all DRM for something
> > > relavent to qemu.
> > > 
> > > That tells me that the vast majority of PCI devices do not generate
> > > no-snoop traffic.  
> > 
> > Unfortunately, even just looking at devices across a couple laptops
> > most devices do support and have NoSnoop+ set by default.
> 
> Yes, mine too, but that doesn't mean the device is issuing nosnoop
> transactions, it just means the OS is allowing it to do so if it wants.
> 
> As I said, without driver support the feature cannot be used, and
> there is no driver support in Linux outside DRM, unless it is
> hidden.. Certainly I've never run into it..
> 
> Even mlx5 is setting the nosnoop bit, but I have a fairly high
> confidence that we don't set the TLP bit for anything Linux does.
> 
> > It's not safe for QEMU to make an assumption that only GPUs will
> > actually make use of it.  
> 
> Not 100% safe, but if you know you are running Linux OS in the VM you
> can look at the drivers the devices need and make a determination.

QEMU doesn't know what guest it's running or what driver the guest is
using.  QEMU can only create safe configurations by default, the same
as done now using vfio.  Anything outside of that scope would require
experimental opt-in support by the user or a guarantee from the device
vendor that the device cannot ever (not just for the existing drivers)
create non-coherent TLPs. Thanks,

Alex

> > Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
> > privilege that the device grants it.  Releasing the device and
> > retaining the privileged gained by it seems wrong.  Thanks,  
> 
> It is not completely ideal, but it is such a simplification, and I
> can't really see a drawback..
> 
> Jason
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 01:41:28PM -0600, Alex Williamson wrote:

> > Compatibility is important, but when I look in the kernel code I see
> > very few places that call wbinvd(). Basically all DRM for something
> > relavent to qemu.
> > 
> > That tells me that the vast majority of PCI devices do not generate
> > no-snoop traffic.
> 
> Unfortunately, even just looking at devices across a couple laptops
> most devices do support and have NoSnoop+ set by default.  

Yes, mine too, but that doesn't mean the device is issuing nosnoop
transactions, it just means the OS is allowing it to do so if it wants.

As I said, without driver support the feature cannot be used, and
there is no driver support in Linux outside DRM, unless it is
hidden.. Certainly I've never run into it..

Even mlx5 is setting the nosnoop bit, but I have a fairly high
confidence that we don't set the TLP bit for anything Linux does.

> It's not safe for QEMU to make an assumption that only GPUs will
> actually make use of it.

Not 100% safe, but if you know you are running Linux OS in the VM you
can look at the drivers the devices need and make a determination.

> Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
> privilege that the device grants it.  Releasing the device and
> retaining the privileged gained by it seems wrong.  Thanks,

It is not completely ideal, but it is such a simplification, and I
can't really see a drawback..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Wang


在 2021/6/8 上午3:41, Alex Williamson 写道:

On Mon, 7 Jun 2021 16:08:02 -0300
Jason Gunthorpe  wrote:


On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:


It is up to qemu if it wants to proceed or not. There is no issue with
allowing the use of no-snoop and blocking wbinvd, other than some
drivers may malfunction. If the user is certain they don't have
malfunctioning drivers then no issue to go ahead.

A driver that knows how to use the device in a coherent way can
certainly proceed, but I suspect that's not something we can ask of
QEMU.  QEMU has no visibility to the in-use driver and sketchy ability
to virtualize the no-snoop enable bit to prevent non-coherent DMA from
the device.  There might be an experimental ("x-" prefixed) QEMU device
option to allow user override, but QEMU should disallow the possibility
of malfunctioning drivers by default.  If we have devices that probe as
supporting no-snoop, but actually can't generate such traffic, we might
need a quirk list somewhere.

Compatibility is important, but when I look in the kernel code I see
very few places that call wbinvd(). Basically all DRM for something
relavent to qemu.

That tells me that the vast majority of PCI devices do not generate
no-snoop traffic.

Unfortunately, even just looking at devices across a couple laptops
most devices do support and have NoSnoop+ set by default.  I don't
notice anything in the kernel that actually tries to set this enable (a
handful that actively disable), so I assume it's done by the firmware.



I wonder whether or not it was done via ACPI:

"

6.2.17 _CCA (Cache Coherency Attribute) The _CCA object returns whether 
or not a bus-master device supports hardware managed cache coherency. 
Expected values are 0 to indicate it is not supported, and 1 to indicate 
that it is supported. All other values are reserved.


...

On Intel platforms, if the _CCA object is not supplied, the OSPM will 
assume the devices are hardware cache coherent.


"

Thanks



It's not safe for QEMU to make an assumption that only GPUs will
actually make use of it.


I think it makes the software design much simpler if the security
check is very simple. Possessing a suitable device in an ioasid fd
container is enough to flip on the feature and we don't need to track
changes from that point on. We don't need to revoke wbinvd if the
ioasid fd changes, for instance. Better to keep the kernel very simple
in this regard.

You're suggesting that a user isn't forced to give up wbinvd emulation
if they lose access to their device?

Sure, why do we need to be stricter? It is the same logic I gave
earlier, once an attacker process has access to wbinvd an attacker can
just keep its access indefinitely.

The main use case for revokation assumes that qemu would be
compromised after a device is hot-unplugged and you want to block off
wbinvd. But I have a hard time seeing that as useful enough to justify
all the complicated code to do it...

It's currently just a matter of the kvm-vfio device holding a reference
to the group so that it cannot be used elsewhere so long as it's being
used to elevate privileges on a given KVM instance.  If we conclude that
access to a device with the right capability is required to gain a
privilege, I don't really see how we can wave aside that the privilege
isn't lost with the device.


For KVM qemu can turn on/off on hot plug events as it requires to give
VM security. It doesn't need to rely on the kernel to control this.

Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
privilege that the device grants it.  Releasing the device and
retaining the privileged gained by it seems wrong.  Thanks,

Alex



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Wang


在 2021/6/7 下午10:14, Jason Gunthorpe 写道:

On Mon, Jun 07, 2021 at 11:18:33AM +0800, Jason Wang wrote:


Note that no drivers call these things doesn't meant it was not
supported by the spec.

Of course it does. If the spec doesn't define exactly when the driver
should call the cache flushes for no-snoop transactions then the
protocol doesn't support no-soop.



Just to make sure we are in the same page. What I meant is, if the DMA 
behavior like (no-snoop) is device specific. There's no need to mandate 
a virtio general attributes. We can describe it per device. The devices 
implemented in the current spec does not use non-coherent DMA doesn't 
mean any future devices won't do that. The driver could choose to use 
transport (e.g PCI), platform (ACPI) or device specific (general virtio 
command) way to detect and flush cache when necessary.





no-snoop is only used in very specific sequences of operations, like
certain GPU usages, because regaining coherence on x86 is incredibly
expensive.

ie I wouldn't ever expect a NIC to use no-snoop because NIC's expect
packets to be processed by the CPU.



For NIC yes. But virtio is more that just NIC. We've already supported 
GPU and crypto devices. In this case, no-snoop will be useful since the 
data is not necessarily expected to be processed by CPU.


And a lot of other type of devices are being proposed.

Thanks




"non-coherent DMA" is some general euphemism that evokes images of
embedded platforms that don't have coherent DMA at all and have low
cost ways to regain coherence. This is not at all what we are talking
about here at all.
  
Jason




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Shenming Lu
On 2021/6/7 20:19, Liu, Yi L wrote:
>> From: Shenming Lu 
>> Sent: Friday, June 4, 2021 10:03 AM
>>
>> On 2021/6/4 2:19, Jacob Pan wrote:
>>> Hi Shenming,
>>>
>>> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu
>> 
>>> wrote:
>>>
 On 2021/6/2 1:33, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>
>> The drivers register per page table fault handlers to /dev/ioasid which
>> will then register itself to iommu core to listen and route the per-
>> device I/O page faults.
>
> I'm still confused why drivers need fault handlers at all?

 Essentially it is the userspace that needs the fault handlers,
 one case is to deliver the faults to the vIOMMU, and another
 case is to enable IOPF on the GPA address space for on-demand
 paging, it seems that both could be specified in/through the
 IOASID_ALLOC ioctl?

>>> I would think IOASID_BIND_PGTABLE is where fault handler should be
>>> registered. There wouldn't be any IO page fault without the binding
>> anyway.
>>
>> Yeah, I also proposed this before, registering the handler in the
>> BIND_PGTABLE
>> ioctl does make sense for the guest page faults. :-)
>>
>> But how about the page faults from the GPA address space (it's page table is
>> mapped through the MAP_DMA ioctl)? From your point of view, it seems
>> that we should register the handler for the GPA address space in the (first)
>> MAP_DMA ioctl.
> 
> under new proposal, I think the page fault handler is also registered
> per ioasid object. The difference compared with guest page table case
> is there is no need to inject the fault to VM.

Yeah.  And there are some issues specific to the GPA address space case
which have been discussed with Alex..  Thanks,

Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 01/11] x86/HV: Initialize GHCB page in Isolation VM

2021-06-07 Thread Christoph Hellwig
On Sun, May 30, 2021 at 11:06:18AM -0400, Tianyu Lan wrote:
> + if (ms_hyperv.ghcb_base) {
> + rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
> +
> + ghcb_va = ioremap_cache(ghcb_gpa, HV_HYP_PAGE_SIZE);
> + if (!ghcb_va)
> + return -ENOMEM;

Can you explain this a bit more?  We've very much deprecated
ioremap_cache in favor of memremap.  Why yo you need a __iomem address
here?  Why do we need the remap here at all?

Does the data structure at this address not have any types that we
could use a struct for?

> +
> + rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
> + ghcb_va = ioremap_cache(ghcb_gpa, HV_HYP_PAGE_SIZE);
> + if (!ghcb_va) {

This seems to duplicate the above code.

> +bool hv_isolation_type_snp(void)
> +{
> + return static_branch_unlikely(_type_snp);
> +}
> +EXPORT_SYMBOL_GPL(hv_isolation_type_snp);

This probably wants a kerneldoc explaining when it should be used.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 09/11] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-06-07 Thread Christoph Hellwig
Honestly, we really need to do away with the concept of hypervisor-
specific swiotlb allocations and just add a hypervisor hook to remap the
"main" buffer. That should remove a lot of code and confusion not just
for Xen but also any future addition like hyperv.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Paolo Bonzini

On 07/06/21 05:25, Tian, Kevin wrote:

Per Intel SDM wbinvd is a privileged instruction. A process on the
host has no privilege to execute it.


(Half of) the point of the kernel is to do privileged tasks on the 
processes' behalf.  There are good reasons why a process that uses VFIO 
(without KVM) could want to use wbinvd, so VFIO lets them do it with a 
ioctl and adequate checks around the operation.


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 08/11] swiotlb: Add bounce buffer remap address setting function

2021-06-07 Thread Christoph Hellwig
On Sun, May 30, 2021 at 11:06:25AM -0400, Tianyu Lan wrote:
> From: Tianyu Lan 
> 
> For Hyper-V isolation VM with AMD SEV SNP, the bounce buffer(shared memory)
> needs to be accessed via extra address space(e.g address above bit39).
> Hyper-V code may remap extra address space outside of swiotlb. swiotlb_
> bounce() needs to use remap virtual address to copy data from/to bounce
> buffer. Add new interface swiotlb_set_bounce_remap() to do that.

Why can't you use the bus_dma_region ranges to remap to your preferred
address?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 11/11] HV/Storvsc: Add Isolation VM support for storvsc driver

2021-06-07 Thread Christoph Hellwig
On Sun, May 30, 2021 at 11:06:28AM -0400, Tianyu Lan wrote:
> + for (i = 0; i < request->hvpg_count; i++)
> + dma_unmap_page(>device,
> + 
> request->dma_range[i].dma,
> + 
> request->dma_range[i].mapping_size,
> + 
> request->vstor_packet.vm_srb.data_in
> +  == READ_TYPE ?
> + DMA_FROM_DEVICE : 
> DMA_TO_DEVICE);
> + kfree(request->dma_range);

Unreadably long lines.  You probably want to factor the quoted code into
a little readable helper and do the same for the map side.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 10/11] HV/Netvsc: Add Isolation VM support for netvsc driver

2021-06-07 Thread Christoph Hellwig
On Sun, May 30, 2021 at 11:06:27AM -0400, Tianyu Lan wrote:
> + if (hv_isolation_type_snp()) {
> + pfns = kcalloc(buf_size / HV_HYP_PAGE_SIZE, sizeof(unsigned 
> long),
> +GFP_KERNEL);
> + for (i = 0; i < buf_size / HV_HYP_PAGE_SIZE; i++)
> + pfns[i] = virt_to_hvpfn(net_device->recv_buf + i * 
> HV_HYP_PAGE_SIZE) +
> + (ms_hyperv.shared_gpa_boundary >> 
> HV_HYP_PAGE_SHIFT);
> +
> + vaddr = vmap_pfn(pfns, buf_size / HV_HYP_PAGE_SIZE, 
> PAGE_KERNEL_IO);
> + kfree(pfns);
> + if (!vaddr)
> + goto cleanup;
> + net_device->recv_original_buf = net_device->recv_buf;
> + net_device->recv_buf = vaddr;
> + }

This probably wnats a helper to make the thing more readable.  But who
came up with this fucked up communication protocol where the host needs
to map random pfns into a contigous range?  Sometime I really have to
wonder what crack the hyper-v people take when comparing this to the
relatively sane approach others take.

> + for (i = 0; i < page_count; i++)
> + dma_unmap_single(_dev->device, packet->dma_range[i].dma,
> +  packet->dma_range[i].mapping_size,
> +  DMA_TO_DEVICE);
> +
> + kfree(packet->dma_range);

Any reason this isn't simply using a struct scatterlist?

> + for (i = 0; i < page_count; i++) {
> + char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
> +  + pb[i].offset);
> + u32 len = pb[i].len;
> +
> + dma = dma_map_single(_dev->device, src, len,
> +  DMA_TO_DEVICE);

dma_map_single can only be used on page baked memory, and if this is
using page backed memory you wouldn't need to do thee phys_to_virt
tricks.  Can someone explain the mess here in more detail?

>   struct rndis_device *dev = nvdev->extension;
>   struct rndis_request *request = NULL;
> + struct hv_device *hv_dev = ((struct net_device_context *)
> + netdev_priv(ndev))->device_ctx;

Why not use a net_device_context local variable instead of this cast
galore?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V3 01/11] x86/HV: Initialize GHCB page in Isolation VM

2021-06-07 Thread Tianyu Lan

Hi Christoph:
Thanks for your review.

On 6/7/2021 2:41 PM, Christoph Hellwig wrote:

On Sun, May 30, 2021 at 11:06:18AM -0400, Tianyu Lan wrote:

+   if (ms_hyperv.ghcb_base) {
+   rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
+
+   ghcb_va = ioremap_cache(ghcb_gpa, HV_HYP_PAGE_SIZE);
+   if (!ghcb_va)
+   return -ENOMEM;


Can you explain this a bit more?  We've very much deprecated
ioremap_cache in favor of memremap.  Why yo you need a __iomem address
here?  Why do we need the remap here at all? >


GHCB physical address is an address in extra address space which is 
above shared gpa boundary reported by Hyper-V CPUID. The addresses below
shared gpa boundary treated as encrypted and the one above is treated as 
decrypted. System memory is remapped in the extra address space and it 
starts from the boundary. The shared memory with host needs to use 
address in the extra address(pa + shared_gpa_boundary) in Linux guest.


Here is to map ghcb page for the communication operations with 
Hypervisor(e.g, hypercall and read/write MSR) via GHCB page.


memremap() will go through iomem_resource list and the address in extra 
address space will not be in the list. So I used ioremap_cache(). I will

memremap() instead of ioremap() here.


Does the data structure at this address not have any types that we
could use a struct for?


The struct will be added in the following patch. I will refresh the 
following patch and use the struct hv_ghcb for the mapped point.



+
+   rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
+   ghcb_va = ioremap_cache(ghcb_gpa, HV_HYP_PAGE_SIZE);
+   if (!ghcb_va) {


This seems to duplicate the above code.


The above is to map ghcb for BSP and here does the same thing for APs
Will add a new function to avoid the duplication.




+bool hv_isolation_type_snp(void)
+{
+   return static_branch_unlikely(_type_snp);
+}
+EXPORT_SYMBOL_GPL(hv_isolation_type_snp);


This probably wants a kerneldoc explaining when it should be used. >


OK. I will add.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[GIT PULL] memory: Tegra memory controller for v5.14

2021-06-07 Thread Krzysztof Kozlowski
Hi Olof and Arnd,

Tegra memory controller driver changes with necessary dependency from Thierry
(which you will also get from him):
1. Dmitry's power domain work on Tegra MC drivers,
2. Necessary clock and regulator dependencies for Dmitry's work.


Hi Thierry and Will,

This is the pull for you to base further SMMU aptches (prevent early SMMU
faults).

Best regards,
Krzysztof


The following changes since commit 6efb943b8616ec53a5e444193dccf1af9ad627b5:

  Linux 5.13-rc1 (2021-05-09 14:17:44 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl.git 
tags/memory-controller-drv-tegra-5.14

for you to fetch changes up to 393d66fd2cacba3e6aa95d7bb38790bfb7b1cc3a:

  memory: tegra: Implement SID override programming (2021-06-03 21:50:43 +0200)


Memory controller drivers for v5.14 - Tegra SoC

1. Enable compile testing of Tegra memory controller drivers.
2. Unify Tegra memory controller drivers. From Thierry Reding:
   "This set of patches converges the feature sets of the pre-Tegra186
   and the post-Tegra186 memory controller drivers such that newer chips
   can take advantage of some features that were previously only
   implemented on earlier chips."
3. Implement SID override programming as part of work to prevent early
   SMMU faults.
4. Some simplifications, e.g. use devm-helpers.

This pulls dedicated tag from Thierry Reding with necessary changes to
Tegra memory controller drivers, as a pre-requisite to series applied
here.  The changes from Thierry's tree also include their own
dependencies: regulator and clock driver changes.


Dmitry Osipenko (18):
  clk: tegra30: Use 300MHz for video decoder by default
  clk: tegra: Fix refcounting of gate clocks
  clk: tegra: Ensure that PLLU configuration is applied properly
  clk: tegra: Halve SCLK rate on Tegra20
  clk: tegra: Don't allow zero clock rate for PLLs
  clk: tegra: cclk: Handle thermal DIV2 CPU frequency throttling
  clk: tegra: Mark external clocks as not having reset control
  clk: tegra: Don't deassert reset on enabling clocks
  regulator: core: Add regulator_sync_voltage_rdev()
  soc/tegra: regulators: Bump voltages on system reboot
  soc/tegra: Add stub for soc_is_tegra()
  soc/tegra: Add devm_tegra_core_dev_init_opp_table()
  soc/tegra: fuse: Add stubs needed for compile-testing
  clk: tegra: Add stubs needed for compile-testing
  memory: tegra: Fix compilation warnings on 64bit platforms
  memory: tegra: Enable compile testing for all drivers
  memory: tegra20-emc: Use devm_tegra_core_dev_init_opp_table()
  memory: tegra30-emc: Use devm_tegra_core_dev_init_opp_table()

Krzysztof Kozlowski (1):
  Merge tag 'tegra-for-5.14-memory' of 
https://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into 
for-v5.14/tegra-mc

Thierry Reding (16):
  Merge branch 'for-5.14/regulator' into for-5.14/soc
  Merge branch 'for-5.14/clk' into for-5.14/memory
  Merge branch 'for-5.14/soc' into for-5.14/memory
  memory: tegra: Consolidate register fields
  memory: tegra: Unify struct tegra_mc across SoC generations
  memory: tegra: Introduce struct tegra_mc_ops
  memory: tegra: Push suspend/resume into SoC drivers
  memory: tegra: Make per-SoC setup more generic
  memory: tegra: Extract setup code into callback
  memory: tegra: Parameterize interrupt handler
  memory: tegra: Make IRQ support opitonal
  memory: tegra: Only initialize reset controller if available
  memory: tegra: Unify drivers
  memory: tegra: Add memory client IDs to tables
  memory: tegra: Split Tegra194 data into separate file
  memory: tegra: Implement SID override programming

 drivers/clk/tegra/clk-periph-gate.c  |   80 +-
 drivers/clk/tegra/clk-periph.c   |   11 +
 drivers/clk/tegra/clk-pll.c  |   12 +-
 drivers/clk/tegra/clk-tegra-periph.c |6 +-
 drivers/clk/tegra/clk-tegra-super-cclk.c |   16 +-
 drivers/clk/tegra/clk-tegra20.c  |6 +-
 drivers/clk/tegra/clk-tegra30.c  |6 +-
 drivers/clk/tegra/clk.h  |4 -
 drivers/iommu/tegra-smmu.c   |   16 +-
 drivers/memory/tegra/Kconfig |   18 +-
 drivers/memory/tegra/Makefile|6 +-
 drivers/memory/tegra/mc.c|  321 +++---
 drivers/memory/tegra/mc.h|   25 +
 drivers/memory/tegra/tegra114.c  | 1245 --
 drivers/memory/tegra/tegra124-emc.c  |4 +-
 drivers/memory/tegra/tegra124.c  | 1306 ---
 drivers/memory/tegra/tegra186.c  | 1679 +-
 drivers/memory/tegra/tegra194.c  | 1351 
 drivers/memory/tegra/tegra20-emc.c   |   48 +-