Re: [PATCH v2 2/5] iommu/mediatek: Always check runtime PM status in tlb flush range callback

2021-12-12 Thread Yong Wu
On Wed, 2021-12-08 at 14:07 +0200, Dafna Hirschfeld wrote:
> From: Sebastian Reichel 
> 
> In case of v4l2_reqbufs() it is possible, that a TLB flush is done
> without runtime PM being enabled. In that case the "Partial TLB flush
> timed out, falling back to full flush" warning is printed.
> 
> Commit c0b57581b73b ("iommu/mediatek: Add power-domain operation")
> introduced has_pm as optimization to avoid checking runtime PM
> when there is no power domain attached. But without the PM domain
> there is still the device driver's runtime PM suspend handler, which
> disables the clock. Thus flushing should also be avoided when there
> is no PM domain involved.
> 
> Signed-off-by: Sebastian Reichel 
> Reviewed-by: Dafna Hirschfeld 

Reviewed-by: Yong Wu 

> ---
>  drivers/iommu/mtk_iommu.c | 10 +++---
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
> index 342aa562ab6a..dd2c08c54df4 100644
> --- a/drivers/iommu/mtk_iommu.c
> +++ b/drivers/iommu/mtk_iommu.c
> @@ -225,16 +225,13 @@ static void
> mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size,
>  size_t granule,
>  struct mtk_iommu_data *data)
>  {
> - bool has_pm = !!data->dev->pm_domain;
>   unsigned long flags;
>   int ret;
>   u32 tmp;
>  
>   for_each_m4u(data) {
> - if (has_pm) {
> - if (pm_runtime_get_if_in_use(data->dev) <= 0)
> - continue;
> - }
> + if (pm_runtime_get_if_in_use(data->dev) <= 0)
> + continue;
>  
>   spin_lock_irqsave(>tlb_lock, flags);
>   writel_relaxed(F_INVLD_EN1 | F_INVLD_EN0,
> @@ -259,8 +256,7 @@ static void
> mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size,
>   writel_relaxed(0, data->base + REG_MMU_CPE_DONE);
>   spin_unlock_irqrestore(>tlb_lock, flags);
>  
> - if (has_pm)
> - pm_runtime_put(data->dev);
> + pm_runtime_put(data->dev);
>   }
>  }
>  
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Monday, December 13, 2021 7:37 AM
> 
> On Sun, Dec 12, 2021 at 09:55:32PM +0100, Thomas Gleixner wrote:
> > Kevin,
> >
> > On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> > >> From: Thomas Gleixner 
> > >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> > >> vIR related there.
> > >
> > > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> > >
> > > In reality there are also fully emulated vIOMMU implementations (e.g.
> > > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > > the IR logic in existing iommu drivers just apply:
> > >
> > >   drivers/iommu/intel/irq_remapping.c
> > >   drivers/iommu/amd/iommu.c
> >
> > thanks for the explanation. So that's a full IOMMU emulation. I was more
> > expecting a paravirtualized lightweight one.
> 
> Kevin can you explain what on earth vIR is for and how does it work??
> 
> Obviously we don't expose the IR machinery to userspace, so at best
> this is somehow changing what the MSI trap does?
> 

Initially it was introduced for supporting more than 255 vCPUs. Due to 
full emulation this capability can certainly support other vIR usages
as observed on bare metal.

vIR doesn't rely on the physical IR presence.

First if the guest doesn't have vfio device then the physical capability
doesn't matter.

Even with vfio device, IR by definition is just about remapping instead
of injection (talk about this part later). The interrupts are always routed
to the host handler first (vfio_msihandler() in this case), which then 
triggers irqfd to call virtual interrupt injection handler (irqfd_wakeup())
in kvm.

This suggests a clear role split between vfio and kvm:

  - vfio is responsible for irq allocation/startup as it is the device driver;
  - kvm takes care of virtual interrupt injection, being a VMM;

The two are connected via irqfd.

Following this split vIR information is completely hidden in userspace.
Qemu calculates the routing information between vGSI and vCPU
(with or without vIR, and for whatever trapped interrupt storages) 
and then registers it to kvm.

When kvm receives a notification via irqfd, it follows irqfd->vGSI->vCPU
and injects a virtual interrupt into the target vCPU.

Then comes an interesting scenario about IOMMU posted interrupt (PI).
This capability allows the IR engine directly converting a physical
interrupt into virtual and then inject it into the guest. Kinda offloading 
the virtual routing information into the hardware.

This is currently achieved via IRQ bypass manager, which helps connect 
vfio (IRQ producer) to kvm (IRQ consumer) around a specific Linux irq 
number. Once the connection is established, kvm calls 
irq_set_vcpu_affinity() to update IRTE with virtual routing information 
for that irq number.

With that design Qemu doesn't know whether IR or PI is enabled 
physically. It always talks to vfio for having IRQ resource allocated and
to kvm for registering virtual routing information.

Then adding the new hypercall machinery into this picture:

  1) The hypercall needs to carry all necessary virtual routing 
 information due to no-trap;

  2) Before querying IRTE data/pair, Qemu needs to complete necessary
 operations as of today to have IRTE ready:

a) Register irqfd and related GSI routing info to kvm
b) Allocates/startup IRQs via vfio;

 When PI is enabled, IRTE is ready only after both are completed.

  3) Qemu gets IRTE data/pair from kernel and return to the guest.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH V7 5/5] net: netvsc: Add Isolation VM support for netvsc driver

2021-12-12 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
pagebuffer() stills need to be handled. Use DMA API to map/umap
these memory during sending/receiving packet and Hyper-V swiotlb
bounce buffer dma address will be returned. The swiotlb bounce buffer
has been masked to be visible to host during boot up.

rx/tx ring buffer is allocated via vzalloc() and they need to be
mapped into unencrypted address space(above vTOM) before sharing
with host and accessing. Add hv_map/unmap_memory() to map/umap rx
/tx ring buffer.

Signed-off-by: Tianyu Lan 
---
Change since v3:
   * Replace HV_HYP_PAGE_SIZE with PAGE_SIZE and virt_to_hvpfn()
 with vmalloc_to_pfn() in the hv_map_memory()

Change since v2:
   * Add hv_map/unmap_memory() to map/umap rx/tx ring buffer.
---
 arch/x86/hyperv/ivm.c |  28 ++
 drivers/hv/hv_common.c|  11 +++
 drivers/net/hyperv/hyperv_net.h   |   5 ++
 drivers/net/hyperv/netvsc.c   | 136 +-
 drivers/net/hyperv/netvsc_drv.c   |   1 +
 drivers/net/hyperv/rndis_filter.c |   2 +
 include/asm-generic/mshyperv.h|   2 +
 include/linux/hyperv.h|   5 ++
 8 files changed, 187 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 69c7a57f3307..2b994117581e 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -287,3 +287,31 @@ int hv_set_mem_host_visibility(unsigned long kbuffer, int 
pagecount, bool visibl
kfree(pfn_array);
return ret;
 }
+
+/*
+ * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
+ */
+void *hv_map_memory(void *addr, unsigned long size)
+{
+   unsigned long *pfns = kcalloc(size / PAGE_SIZE,
+ sizeof(unsigned long), GFP_KERNEL);
+   void *vaddr;
+   int i;
+
+   if (!pfns)
+   return NULL;
+
+   for (i = 0; i < size / PAGE_SIZE; i++)
+   pfns[i] = vmalloc_to_pfn(addr + i * PAGE_SIZE) +
+   (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
+
+   vaddr = vmap_pfn(pfns, size / PAGE_SIZE, PAGE_KERNEL_IO);
+   kfree(pfns);
+
+   return vaddr;
+}
+
+void hv_unmap_memory(void *addr)
+{
+   vunmap(addr);
+}
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 7be173a99f27..3c5cb1f70319 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -295,3 +295,14 @@ u64 __weak hv_ghcb_hypercall(u64 control, void *input, 
void *output, u32 input_s
return HV_STATUS_INVALID_PARAMETER;
 }
 EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);
+
+void __weak *hv_map_memory(void *addr, unsigned long size)
+{
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(hv_map_memory);
+
+void __weak hv_unmap_memory(void *addr)
+{
+}
+EXPORT_SYMBOL_GPL(hv_unmap_memory);
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 315278a7cf88..cf69da0e296c 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -164,6 +164,7 @@ struct hv_netvsc_packet {
u32 total_bytes;
u32 send_buf_index;
u32 total_data_buflen;
+   struct hv_dma_range *dma_range;
 };
 
 #define NETVSC_HASH_KEYLEN 40
@@ -1074,6 +1075,7 @@ struct netvsc_device {
 
/* Receive buffer allocated by us but manages by NetVSP */
void *recv_buf;
+   void *recv_original_buf;
u32 recv_buf_size; /* allocated bytes */
struct vmbus_gpadl recv_buf_gpadl_handle;
u32 recv_section_cnt;
@@ -1082,6 +1084,7 @@ struct netvsc_device {
 
/* Send buffer allocated by us */
void *send_buf;
+   void *send_original_buf;
u32 send_buf_size;
struct vmbus_gpadl send_buf_gpadl_handle;
u32 send_section_cnt;
@@ -1731,4 +1734,6 @@ struct rndis_message {
 #define RETRY_US_HI1
 #define RETRY_MAX  2000/* >10 sec */
 
+void netvsc_dma_unmap(struct hv_device *hv_dev,
+ struct hv_netvsc_packet *packet);
 #endif /* _HYPERV_NET_H */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 396bc1c204e6..b7ade735a806 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head)
int i;
 
kfree(nvdev->extension);
-   vfree(nvdev->recv_buf);
-   vfree(nvdev->send_buf);
+
+   if (nvdev->recv_original_buf) {
+   hv_unmap_memory(nvdev->recv_buf);
+   vfree(nvdev->recv_original_buf);
+   } else {
+   vfree(nvdev->recv_buf);
+   }
+
+   if (nvdev->send_original_buf) {
+   hv_unmap_memory(nvdev->send_buf);
+   vfree(nvdev->send_original_buf);
+   } else {
+   vfree(nvdev->send_buf);
+   }
+

[PATCH V7 4/5] scsi: storvsc: Add Isolation VM support for storvsc driver

2021-12-12 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
mpb_desc() still needs to be handled. Use DMA API(scsi_dma_map/unmap)
to map these memory during sending/receiving packet and return swiotlb
bounce buffer dma address. In Isolation VM, swiotlb  bounce buffer is
marked to be visible to host and the swiotlb force mode is enabled.

Set device's dma min align mask to HV_HYP_PAGE_SIZE - 1 in order to
keep the original data offset in the bounce buffer.

Signed-off-by: Tianyu Lan 
---
 drivers/hv/vmbus_drv.c |  4 
 drivers/scsi/storvsc_drv.c | 37 +
 include/linux/hyperv.h |  1 +
 3 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 392c1ac4f819..ae6ec503399a 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "hyperv_vmbus.h"
 
@@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t *type,
return child_device_obj;
 }
 
+static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
 /*
  * vmbus_device_register - Register the child device
  */
@@ -2118,6 +2120,8 @@ int vmbus_device_register(struct hv_device 
*child_device_obj)
}
hv_debug_add_dev_dir(child_device_obj);
 
+   child_device_obj->device.dma_mask = _dma_mask;
+   child_device_obj->device.dma_parms = _device_obj->dma_parms;
return 0;
 
 err_kset_unregister:
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 20595c0ba0ae..ae293600d799 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+
 #include 
 #include 
 #include 
@@ -1336,6 +1338,7 @@ static void storvsc_on_channel_callback(void *context)
continue;
}
request = (struct storvsc_cmd_request 
*)scsi_cmd_priv(scmnd);
+   scsi_dma_unmap(scmnd);
}
 
storvsc_on_receive(stor_device, packet, request);
@@ -1749,7 +1752,6 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
struct hv_host_device *host_dev = shost_priv(host);
struct hv_device *dev = host_dev->dev;
struct storvsc_cmd_request *cmd_request = scsi_cmd_priv(scmnd);
-   int i;
struct scatterlist *sgl;
unsigned int sg_count;
struct vmscsi_request *vm_srb;
@@ -1831,10 +1833,11 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
payload_sz = sizeof(cmd_request->mpb);
 
if (sg_count) {
-   unsigned int hvpgoff, hvpfns_to_add;
unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
-   u64 hvpfn;
+   struct scatterlist *sg;
+   unsigned long hvpfn, hvpfns_to_add;
+   int j, i = 0;
 
if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {
 
@@ -1848,21 +1851,22 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
payload->range.len = length;
payload->range.offset = offset_in_hvpg;
 
+   sg_count = scsi_dma_map(scmnd);
+   if (sg_count < 0)
+   return SCSI_MLQUEUE_DEVICE_BUSY;
 
-   for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
+   for_each_sg(sgl, sg, sg_count, j) {
/*
-* Init values for the current sgl entry. hvpgoff
-* and hvpfns_to_add are in units of Hyper-V size
-* pages. Handling the PAGE_SIZE != HV_HYP_PAGE_SIZE
-* case also handles values of sgl->offset that are
-* larger than PAGE_SIZE. Such offsets are handled
-* even on other than the first sgl entry, provided
-* they are a multiple of PAGE_SIZE.
+* Init values for the current sgl entry. hvpfns_to_add
+* is in units of Hyper-V size pages. Handling the
+* PAGE_SIZE != HV_HYP_PAGE_SIZE case also handles
+* values of sgl->offset that are larger than PAGE_SIZE.
+* Such offsets are handled even on other than the first
+* sgl entry, provided they are a multiple of PAGE_SIZE.
 */
-   hvpgoff = HVPFN_DOWN(sgl->offset);
-   hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff;
-   

[PATCH V7 3/5] hyper-v: Enable swiotlb bounce buffer for Isolation VM

2021-12-12 Thread Tianyu Lan
From: Tianyu Lan 

hyperv Isolation VM requires bounce buffer support to copy
data from/to encrypted memory and so enable swiotlb force
mode to use swiotlb bounce buffer for DMA transaction.

In Isolation VM with AMD SEV, the bounce buffer needs to be
accessed via extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.

Swiotlb bounce buffer code calls set_memory_decrypted()
to mark bounce buffer visible to host and map it in extra
address space via memremap. Populate the shared_gpa_boundary
(vTOM) via swiotlb_unencrypted_base variable.

The map function memremap() can't work in the early place
(e.g ms_hyperv_init_platform()) and so call swiotlb_update_mem_
attributes() in the hyperv_init().

Signed-off-by: Tianyu Lan 
---
Change since v6:
* Fix compile error when swiotlb is not enabled.

Change since v4:
* Remove Hyper-V IOMMU IOMMU_INIT_FINISH related functions
  and set SWIOTLB_FORCE and swiotlb_unencrypted_base in the
  ms_hyperv_init_platform(). Call swiotlb_update_mem_attributes()
  in the hyperv_init().

Change since v3:
* Add comment in pci-swiotlb-xen.c to explain why add
  dependency between hyperv_swiotlb_detect() and pci_
  xen_swiotlb_detect().
* Return directly when fails to allocate Hyper-V swiotlb
  buffer in the hyperv_iommu_swiotlb_init().
---
 arch/x86/hyperv/hv_init.c  | 12 
 arch/x86/kernel/cpu/mshyperv.c | 15 ++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 24f4a06ac46a..749906a8e068 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int hyperv_init_cpuhp;
 u64 hv_current_partition_id = ~0ull;
@@ -502,6 +503,17 @@ void __init hyperv_init(void)
 
/* Query the VMs extended capability once, so that it can be cached. */
hv_query_ext_cap(0);
+
+#ifdef CONFIG_SWIOTLB
+   /*
+* Swiotlb bounce buffer needs to be mapped in extra address
+* space. Map function doesn't work in the early place and so
+* call swiotlb_update_mem_attributes() here.
+*/
+   if (hv_is_isolation_supported())
+   swiotlb_update_mem_attributes();
+#endif
+
return;
 
 clean_guest_os_id:
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 4794b716ec79..e3a240c5e4f5 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -319,8 +320,20 @@ static void __init ms_hyperv_init_platform(void)
pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 
0x%x\n",
ms_hyperv.isolation_config_a, 
ms_hyperv.isolation_config_b);
 
-   if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
+   if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) {
static_branch_enable(_type_snp);
+#ifdef CONFIG_SWIOTLB
+   swiotlb_unencrypted_base = 
ms_hyperv.shared_gpa_boundary;
+#endif
+   }
+
+#ifdef CONFIG_SWIOTLB
+   /*
+* Enable swiotlb force mode in Isolation VM to
+* use swiotlb bounce buffer for dma transaction.
+*/
+   swiotlb_force = SWIOTLB_FORCE;
+#endif
}
 
if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH V7 2/5] x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()

2021-12-12 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides Isolation VM for confidential computing support and
guest memory is encrypted in it. Places checking cc_platform_has()
with GUEST_MEM_ENCRYPT attr should return "True" in Isolation vm. e.g,
swiotlb bounce buffer size needs to adjust according to memory size
in the sev_setup_arch(). Add GUEST_MEM_ENCRYPT check for Hyper-V Isolation
VM.

Signed-off-by: Tianyu Lan 
---
Change since v6:
* Change the order in the cc_platform_has() and check sev first.

Change since v3:
* Change code style of checking GUEST_MEM attribute in the
  hyperv_cc_platform_has().
---
 arch/x86/kernel/cc_platform.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 03bb2f343ddb..6cb3a675e686 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 static bool __maybe_unused intel_cc_platform_has(enum cc_attr attr)
@@ -58,12 +59,19 @@ static bool amd_cc_platform_has(enum cc_attr attr)
 #endif
 }
 
+static bool hyperv_cc_platform_has(enum cc_attr attr)
+{
+   return attr == CC_ATTR_GUEST_MEM_ENCRYPT;
+}
 
 bool cc_platform_has(enum cc_attr attr)
 {
if (sme_me_mask)
return amd_cc_platform_has(attr);
 
+   if (hv_is_isolation_supported())
+   return hyperv_cc_platform_has(attr);
+
return false;
 }
 EXPORT_SYMBOL_GPL(cc_platform_has);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH V7 0/5] x86/Hyper-V: Add Hyper-V Isolation VM support(Second part)

2021-12-12 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based
security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset
is to add support for these Isolation VM support in Linux.

The memory of these vms are encrypted and host can't access guest
memory directly. Hyper-V provides new host visibility hvcall and
the guest needs to call new hvcall to mark memory visible to host
before sharing memory with host. For security, all network/storage
stack memory should not be shared with host and so there is bounce
buffer requests.

Vmbus channel ring buffer already plays bounce buffer role because
all data from/to host needs to copy from/to between the ring buffer
and IO stack memory. So mark vmbus channel ring buffer visible.

For SNP isolation VM, guest needs to access the shared memory via
extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_
ISOLATION_CONFIG. The access physical address of the shared memory
should be bounce buffer memory GPA plus with shared_gpa_boundary
reported by CPUID.

This patchset is to enable swiotlb bounce buffer for netvsc/storvsc
drivers in Isolation VM.

Change since v6:
* Fix compile error in hv_init.c and mshyperv.c when swiotlb
  is not enabled.
* Change the order in the cc_platform_has() and check sev first. 

Change sicne v5:
* Modify "Swiotlb" to "swiotlb" in commit log.
* Remove CONFIG_HYPERV check in the hyperv_cc_platform_has()

Change since v4:
* Remove Hyper-V IOMMU IOMMU_INIT_FINISH related functions
  and set SWIOTLB_FORCE and swiotlb_unencrypted_base in the
  ms_hyperv_init_platform(). Call swiotlb_update_mem_attributes()
  in the hyperv_init().

Change since v3:
* Fix boot up failure on the host with mem_encrypt=on.
  Move calloing of set_memory_decrypted() back from
  swiotlb_init_io_tlb_mem to swiotlb_late_init_with_tbl()
  and rmem_swiotlb_device_init().
* Change code style of checking GUEST_MEM attribute in the
  hyperv_cc_platform_has().
* Add comment in pci-swiotlb-xen.c to explain why add
  dependency between hyperv_swiotlb_detect() and pci_
  xen_swiotlb_detect().
* Return directly when fails to allocate Hyper-V swiotlb
  buffer in the hyperv_iommu_swiotlb_init().

Change since v2:
* Remove Hyper-V dma ops and dma_alloc/free_noncontiguous. Add
  hv_map/unmap_memory() to map/umap netvsc rx/tx ring into extra
  address space.
* Leave mem->vaddr in swiotlb code with phys_to_virt(mem->start)
  when fail to remap swiotlb memory.

Change since v1:
* Add Hyper-V Isolation support check in the cc_platform_has()
  and return true for guest memory encrypt attr.
* Remove hv isolation check in the sev_setup_arch()

Tianyu Lan (5):
  swiotlb: Add swiotlb bounce buffer remap function for HV IVM
  x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()
  hyper-v: Enable swiotlb bounce buffer for Isolation VM
  scsi: storvsc: Add Isolation VM support for storvsc driver
  net: netvsc: Add Isolation VM support for netvsc driver

 arch/x86/hyperv/hv_init.c |  12 +++
 arch/x86/hyperv/ivm.c |  28 ++
 arch/x86/kernel/cc_platform.c |   8 ++
 arch/x86/kernel/cpu/mshyperv.c|  15 +++-
 drivers/hv/hv_common.c|  11 +++
 drivers/hv/vmbus_drv.c|   4 +
 drivers/net/hyperv/hyperv_net.h   |   5 ++
 drivers/net/hyperv/netvsc.c   | 136 +-
 drivers/net/hyperv/netvsc_drv.c   |   1 +
 drivers/net/hyperv/rndis_filter.c |   2 +
 drivers/scsi/storvsc_drv.c|  37 
 include/asm-generic/mshyperv.h|   2 +
 include/linux/hyperv.h|   6 ++
 include/linux/swiotlb.h   |   6 ++
 kernel/dma/swiotlb.c  |  43 +-
 15 files changed, 294 insertions(+), 22 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH V7 1/5] swiotlb: Add swiotlb bounce buffer remap function for HV IVM

2021-12-12 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary (E.G 39 bit
address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access
physical address will be original physical address + shared_gpa_boundary.
The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of
memory(vTOM). Memory addresses below vTOM are automatically treated as
private while memory above vTOM is treated as shared.

Expose swiotlb_unencrypted_base for platforms to set unencrypted
memory base offset and platform calls swiotlb_update_mem_attributes()
to remap swiotlb mem to unencrypted address space. memremap() can
not be called in the early stage and so put remapping code into
swiotlb_update_mem_attributes(). Store remap address and use it to copy
data from/to swiotlb bounce buffer.

Acked-by: Christoph Hellwig 
Signed-off-by: Tianyu Lan 
---
Change since v3:
* Fix boot up failure on the host with mem_encrypt=on.
  Move calloing of set_memory_decrypted() back from
  swiotlb_init_io_tlb_mem to swiotlb_late_init_with_tbl()
  and rmem_swiotlb_device_init().

Change since v2:
* Leave mem->vaddr with phys_to_virt(mem->start) when fail
  to remap swiotlb memory.

Change since v1:
* Rework comment in the swiotlb_init_io_tlb_mem()
* Make swiotlb_init_io_tlb_mem() back to return void.
---
 include/linux/swiotlb.h |  6 ++
 kernel/dma/swiotlb.c| 43 +++--
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 569272871375..f6c3638255d5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force;
  * @end:   The end address of the swiotlb memory pool. Used to do a quick
  * range check to see if the memory was in fact allocated by this
  * API.
+ * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory pool
+ * may be remapped in the memory encrypted case and store virtual
+ * address for bounce buffer operation.
  * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and
  * @end. For default swiotlb, this is command line adjustable via
  * setup_io_tlb_npages.
@@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force;
 struct io_tlb_mem {
phys_addr_t start;
phys_addr_t end;
+   void *vaddr;
unsigned long nslabs;
unsigned long used;
unsigned int index;
@@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
 }
 #endif /* CONFIG_DMA_RESTRICTED_POOL */
 
+extern phys_addr_t swiotlb_unencrypted_base;
+
 #endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8e840fbbed7c..34e6ade4f73c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;
 
 struct io_tlb_mem io_tlb_default_mem;
 
+phys_addr_t swiotlb_unencrypted_base;
+
 /*
  * Max segment that we can provide which (if pages are contingous) will
  * not be bounced (unless SWIOTLB_FORCE is set).
@@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val)
return DIV_ROUND_UP(val, IO_TLB_SIZE);
 }
 
+/*
+ * Remap swioltb memory in the unencrypted physical address space
+ * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP
+ * Isolation VMs).
+ */
+void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
+{
+   void *vaddr = NULL;
+
+   if (swiotlb_unencrypted_base) {
+   phys_addr_t paddr = mem->start + swiotlb_unencrypted_base;
+
+   vaddr = memremap(paddr, bytes, MEMREMAP_WB);
+   if (!vaddr)
+   pr_err("Failed to map the unencrypted memory %llx size 
%lx.\n",
+  paddr, bytes);
+   }
+
+   return vaddr;
+}
+
 /*
  * Early SWIOTLB allocation may be too early to allow an architecture to
  * perform the desired operations.  This function allows the architecture to
@@ -172,7 +196,12 @@ void __init swiotlb_update_mem_attributes(void)
vaddr = phys_to_virt(mem->start);
bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
-   memset(vaddr, 0, bytes);
+
+   mem->vaddr = swiotlb_mem_remap(mem, bytes);
+   if (!mem->vaddr)
+   mem->vaddr = vaddr;
+
+   memset(mem->vaddr, 0, bytes);
 }
 
 static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
@@ -196,7 +225,17 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem 
*mem, phys_addr_t start,
mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
mem->slots[i].alloc_size 

Re: [PATCH 1/4] dt-bindings: memory: mediatek: Correct the minItems of clk for larbs

2021-12-12 Thread Yong Wu
On Fri, 2021-12-03 at 17:34 -0600, Rob Herring wrote:
> On Fri, 03 Dec 2021 14:40:24 +0800, Yong Wu wrote:
> > If a platform's larb support gals, there will be some larbs have a
> > one
> > more "gals" clock while the others still only need "apb"/"smi"
> > clocks.
> > then the minItems is 2 and the maxItems is 3.
> > 
> > Fixes: 27bb0e42855a ("dt-bindings: memory: mediatek: Convert SMI to
> > DT schema")
> > Signed-off-by: Yong Wu 
> > ---
> >  .../bindings/memory-controllers/mediatek,smi-larb.yaml  |
> > 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> 
> Running 'make dtbs_check' with the schema in this patch gives the
> following warnings. Consider if they are expected or the schema is
> incorrect. These may not be new warnings.
> 
> Note that it is not yet a requirement to have 0 warnings for
> dtbs_check.
> This will change in the future.
> 
> Full log is available here: 
> https://patchwork.ozlabs.org/patch/1563127
> 
> 
> larb@14016000: 'mediatek,larb-id' is a required property
>   arch/arm64/boot/dts/mediatek/mt8167-pumpkin.dt.yaml

I will fix this in next version. This property is not needed in mt8167.

> 
> larb@14017000: clock-names: ['apb', 'smi'] is too short
>   arch/arm64/boot/dts/mediatek/mt8183-evb.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-
> burnet.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-damu.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-
> fennel14.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-
> sku1.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-
> sku6.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-juniper-
> sku16.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kappa.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kenzo.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-
> sku0.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-
> sku1.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kakadu.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku16.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku272.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku288.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku32.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku0.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku176.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-pumpkin.dt.yaml

Some larbs only have two clocks(apb/smi) in mt8183. thus it is
reasonable for me. I won't fix this in next version.

Please tell me if I miss something.
Thanks.

> 
> larb@15001000: 'mediatek,larb-id' is a required property
>   arch/arm64/boot/dts/mediatek/mt8167-pumpkin.dt.yaml
> 
> larb@1601: clock-names: ['apb', 'smi'] is too short
>   arch/arm64/boot/dts/mediatek/mt8183-evb.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-
> burnet.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-damu.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-
> fennel14.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-
> sku1.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-
> sku6.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-juniper-
> sku16.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kappa.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kenzo.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-
> sku0.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-
> sku1.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kakadu.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku16.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku272.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku288.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku32.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku0.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku176.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-pumpkin.dt.yaml
> 
> larb@1601: 'mediatek,larb-id' is a required property
>   arch/arm64/boot/dts/mediatek/mt8167-pumpkin.dt.yaml
> 
> larb@1701: clock-names: ['apb', 'smi'] is too short
>   arch/arm64/boot/dts/mediatek/mt8183-evb.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-
> burnet.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-damu.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-
> fennel14.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-
> sku1.dt.yaml
>   arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-
> sku6.dt.yaml
>   

Re: [patch V3 34/35] soc: ti: ti_sci_inta_msi: Get rid of ti_sci_inta_msi_get_virq()

2021-12-12 Thread Vinod Koul
On 10-12-21, 23:19, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> Just use the core function msi_get_virq().

Acked-By: Vinod Koul 

-- 
~Vinod
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch V3 35/35] dmaengine: qcom_hidma: Cleanup MSI handling

2021-12-12 Thread Vinod Koul
On 10-12-21, 23:19, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> There is no reason to walk the MSI descriptors to retrieve the interrupt
> number for a device. Use msi_get_virq() instead.

Acked-By: Vinod Koul 

-- 
~Vinod
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch V3 29/35] dmaengine: mv_xor_v2: Get rid of msi_desc abuse

2021-12-12 Thread Vinod Koul
On 10-12-21, 23:19, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> Storing a pointer to the MSI descriptor just to keep track of the Linux
> interrupt number is daft. Use msi_get_virq() instead.

Acked-By: Vinod Koul 

-- 
~Vinod
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 04/18] driver core: platform: Add driver dma ownership management

2021-12-12 Thread Lu Baolu

On 12/10/21 9:23 AM, Lu Baolu wrote:

Hi Greg, Jason and Christoph,

On 12/9/21 9:20 AM, Lu Baolu wrote:

On 12/7/21 9:16 PM, Jason Gunthorpe wrote:

On Tue, Dec 07, 2021 at 10:57:25AM +0800, Lu Baolu wrote:

On 12/6/21 11:06 PM, Jason Gunthorpe wrote:

On Mon, Dec 06, 2021 at 06:36:27AM -0800, Christoph Hellwig wrote:

I really hate the amount of boilerplate code that having this in each
bus type causes.

+1

I liked the first version of this series better with the code near
really_probe().

Can we go back to that with some device_configure_dma() wrapper
condtionally called by really_probe as we discussed?


[...]



diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 68ea1f949daa..68ca5a579eb1 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -538,6 +538,32 @@ static int call_driver_probe(struct device *dev, 
struct device_driver *drv)

 return ret;
  }

+static int device_dma_configure(struct device *dev, struct 
device_driver *drv)

+{
+   int ret;
+
+   if (!dev->bus->dma_configure)
+   return 0;
+
+   ret = dev->bus->dma_configure(dev);
+   if (ret)
+   return ret;
+
+   if (!drv->suppress_auto_claim_dma_owner)
+   ret = iommu_device_set_dma_owner(dev, 
DMA_OWNER_DMA_API, NULL);

+
+   return ret;
+}
+
+static void device_dma_cleanup(struct device *dev, struct 
device_driver *drv)

+{
+   if (!dev->bus->dma_configure)
+   return;
+
+   if (!drv->suppress_auto_claim_dma_owner)
+   iommu_device_release_dma_owner(dev, DMA_OWNER_DMA_API);
+}
+
  static int really_probe(struct device *dev, struct device_driver *drv)
  {
 bool test_remove = 
IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) &&
@@ -574,11 +600,8 @@ static int really_probe(struct device *dev, 
struct device_driver *drv)

 if (ret)
 goto pinctrl_bind_failed;

-   if (dev->bus->dma_configure) {
-   ret = dev->bus->dma_configure(dev);
-   if (ret)
-   goto probe_failed;
-   }
+   if (device_dma_configure(dev, drv))
+   goto pinctrl_bind_failed;

 ret = driver_sysfs_add(dev);
 if (ret) {
@@ -660,6 +683,8 @@ static int really_probe(struct device *dev, struct 
device_driver *drv)

 if (dev->bus)
 blocking_notifier_call_chain(>bus->p->bus_notifier,

BUS_NOTIFY_DRIVER_NOT_BOUND, dev);
+
+   device_dma_cleanup(dev, drv);
  pinctrl_bind_failed:
 device_links_no_driver(dev);
 devres_release_all(dev);
@@ -1204,6 +1229,7 @@ static void __device_release_driver(struct 
device *dev, struct device *parent)

 else if (drv->remove)
 drv->remove(dev);

+   device_dma_cleanup(dev, drv);
 device_links_driver_cleanup(dev);

 devres_release_all(dev);
diff --git a/include/linux/device/driver.h 
b/include/linux/device/driver.h

index a498ebcf4993..374a3c2cc10d 100644
--- a/include/linux/device/driver.h
+++ b/include/linux/device/driver.h
@@ -100,6 +100,7 @@ struct device_driver {
 const char  *mod_name;  /* used for built-in 
modules */


 bool suppress_bind_attrs;   /* disables bind/unbind via 
sysfs */

+   bool suppress_auto_claim_dma_owner;
 enum probe_type probe_type;

 const struct of_device_id   *of_match_table;


Does this work for you? Can I work towards this in the next version?


A kindly ping ... Is this heading the right direction? I need your
advice to move ahead. :-)

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Jason Gunthorpe via iommu
On Sun, Dec 12, 2021 at 01:12:05AM +0100, Thomas Gleixner wrote:

>  PCI/MSI and PCI/MSI-X are just implementations of IMS
> 
> Not more, not less. The fact that they have very strict rules about the
> storage space and the fact that they are mutually exclusive does not
> change that at all.

And the mess we have is that virtualiation broke this
design. Virtualization made MSI/MSI-X special!

I am wondering if we just need to bite the bullet and force the
introduction of a new ACPI flag for the APIC that says one of:
- message addr/data pairs work correctly (baremetal)
- creating message addr/data pairs need to use a hypercall protocol
- property not defined so assume only MSI/MSI-X/etc work.

Intel was originally trying to do this with the 'IMS enabled' PCI
Capability block, but a per PCI device capability is in the wrong
layer.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Jason Gunthorpe via iommu
On Sun, Dec 12, 2021 at 09:55:32PM +0100, Thomas Gleixner wrote:
> Kevin,
> 
> On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> >> From: Thomas Gleixner 
> >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> >> vIR related there.
> >
> > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> >
> > In reality there are also fully emulated vIOMMU implementations (e.g.
> > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > the IR logic in existing iommu drivers just apply:
> >
> > drivers/iommu/intel/irq_remapping.c
> > drivers/iommu/amd/iommu.c
> 
> thanks for the explanation. So that's a full IOMMU emulation. I was more
> expecting a paravirtualized lightweight one.

Kevin can you explain what on earth vIR is for and how does it work??

Obviously we don't expose the IR machinery to userspace, so at best
this is somehow changing what the MSI trap does?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/4] ioasid: Reserve a global PASID for in-kernel DMA

2021-12-12 Thread Jason Gunthorpe via iommu
On Sat, Dec 11, 2021 at 08:39:12AM +, Tian, Kevin wrote:

> Uniqueness is not the main argument of using global PASIDs for
> SWQ, since it can be defined either in per-RID or in global PASID
> space. No SVA architecture can allow two processes to use the
> same PASID to submit work unless they share mm! 
> 
> IMO the real reason is that SWQ for user SVA must be accessed 
> via ENQCMD instruction which fetches the PASID from a CPU MSR

This really should have been inside a comment in the struct mm

"pasid is the value used by x86 ENQCMD"

(and if we phrase it that way I wonder why it is in a struct mm not
some process or task related struct, since it has nothing to do with
page tables)

And, IMHO, the IOMMU part of the code should avoid using this
field. IOMMU should be able to create arbitarily many "SVA"
iommu_domains for use by PASID even if they can't be used with
ENQCMD. Such is proper layering.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Jason Gunthorpe via iommu
On Sun, Dec 12, 2021 at 08:44:46AM +0200, Mika Penttilä wrote:

> > /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> > mmapped
> >   * which allows direct access to non-MSIX registers which happened to be 
> > within
> >   * the same system page.
> >   *
> >   * Even though the userspace gets direct access to the MSIX data, the 
> > existing
> >   * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX 
> > configuration.
> >   */
> > #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE  3
> > 
> > IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
> > other MMIO registers. Trapping MSI-X leads to performance downgrade on
> > accesses to adjacent registers. MSI-X can be mapped by userspace because
> > PPC already uses a hypercall mechanism for interrupt. Though unclear about
> > the detail it sounds a similar usage as proposed here.
> > 
> > Thanks
> > Kevin
>
> I see  VFIO_REGION_INFO_CAP_MSIX_MAPPABLE is always set so if msix table is
> in its own bar, qemu never traps/emulates the access. 

It is some backwards compat, the kernel always sets it to indicate a
new kernel, that doesn't mean qemu doesn't trap.

As the comment says, ""VFIO_DEVICE_SET_IRQS interface must still be
used for MSIX configuration"" so there is no way qemu can meet that
without either trapping the MSI page or using a special hypercall
(ppc)

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Thomas Gleixner
Kevin,

On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
>> From: Thomas Gleixner 
>> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
>> vIR related there.
>
> Well, virtio-iommu is a para-virtualized vIOMMU implementations.
>
> In reality there are also fully emulated vIOMMU implementations (e.g.
> Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> the IR logic in existing iommu drivers just apply:
>
>   drivers/iommu/intel/irq_remapping.c
>   drivers/iommu/amd/iommu.c

thanks for the explanation. So that's a full IOMMU emulation. I was more
expecting a paravirtualized lightweight one.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Thomas Gleixner
Kevin,

On Sun, Dec 12 2021 at 02:14, Kevin Tian wrote:
>> From: Thomas Gleixner 
> I just continue the thought practice along that direction to see what
> the host flow will be like (step by step). Looking at the current 
> implementation is just one necessary step in my thought practice to 
> help refine the picture. When I found something which may be 
> worth being aligned then I shared to avoid follow a wrong direction 
> too far.
>
> If both of your think it simply adds noise to this discussion, I can
> surely hold back and focus on 'concept' only.

All good. We _want_ your participartion for sure. Comparing and
contrasting it to the existing flow is fine.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Mika Penttilä



On 10.12.2021 9.36, Tian, Kevin wrote:

From: Jason Gunthorpe 
Sent: Friday, December 10, 2021 4:59 AM

On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:

On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:

On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
MSI code needs to write a addr/data pair compatible with the emulation
and the IMS code needs to write an addr/data pair from the
hypercall. Seems like this scenario is best avoided!

 From this perspective I haven't connected how virtual interrupt
remapping helps in the guest? Is this a way to provide the hypercall
I'm imagining above?

That was my thought to avoid having different mechanisms.

The address/data pair is computed in two places:

   1) Activation of an interrupt
   2) Affinity setting on an interrupt

Both configure the IRTE when interrupt remapping is in place.

In both cases a vector is allocated in the vector domain and based on
the resulting target APIC / vector number pair the IRTE is
(re)configured.

So putting the hypercall into the vIRTE update is the obvious
place. Both activation and affinity setting can fail and propagate an
error code down to the originating caller.

Hmm?

Okay, I think I get it. Would be nice to have someone from intel
familiar with the vIOMMU protocols and qemu code remark what the
hypervisor side can look like.

There is a bit more work here, we'd have to change VFIO to somehow
entirely disconnect the kernel IRQ logic from the MSI table and
directly pass control of it to the guest after the hypervisor IOMMU IR
secures it. ie directly mmap the msi-x table into the guest


It's supported already:

/*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
  * the same system page.
  *
  * Even though the userspace gets direct access to the MSIX data, the existing
  * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
  */
#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE  3

IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
other MMIO registers. Trapping MSI-X leads to performance downgrade on
accesses to adjacent registers. MSI-X can be mapped by userspace because
PPC already uses a hypercall mechanism for interrupt. Though unclear about
the detail it sounds a similar usage as proposed here.

Thanks
Kevin


I see  VFIO_REGION_INFO_CAP_MSIX_MAPPABLE is always set so if msix table 
is in its own bar, qemu never traps/emulates the access. On the other 
hand, qemu is said to depend on emulating masking. So how is this 
supposed to work, in case the table is not in the config bar?


Thanks,
Mika


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu