Re: [PATCH] net: move from strlcpy with unused retval to strscpy

2022-08-19 Thread Tom Lendacky via Virtualization

On 8/18/22 16:00, Wolfram Sang wrote:

Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: 
https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=v6a6g1ouzcprm...@mail.gmail.com/
Signed-off-by: Wolfram Sang 
---



diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c
index 6ceb1cdf6eba..6e83ff59172a 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c
@@ -402,8 +402,8 @@ static void xgbe_get_drvinfo(struct net_device *netdev,
struct xgbe_prv_data *pdata = netdev_priv(netdev);
struct xgbe_hw_features *hw_feat = >hw_feat;
  
-	strlcpy(drvinfo->driver, XGBE_DRV_NAME, sizeof(drvinfo->driver));

-   strlcpy(drvinfo->bus_info, dev_name(pdata->dev),
+   strscpy(drvinfo->driver, XGBE_DRV_NAME, sizeof(drvinfo->driver));
+   strscpy(drvinfo->bus_info, dev_name(pdata->dev),
sizeof(drvinfo->bus_info));
snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version), "%d.%d.%d",
 XGMAC_GET_BITS(hw_feat->version, MAC_VR, USERVER),


For drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c

Acked-by: Tom Lendacky 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 00/10] x86/sev: KEXEC/KDUMP support for SEV-ES guests

2022-04-29 Thread Tom Lendacky via Virtualization

On 4/29/22 04:06, Tao Liu wrote:

On Thu, Jan 27, 2022 at 11:10:34AM +0100, Joerg Roedel wrote:




Hi Joerg,

I tried the patch set with 5.17.0-rc1 kernel, and I have a few questions:

1) Is it a bug or should qemu-kvm 6.2.0 be patched with specific patch? Because
I found it will exit with 0 when I tried to reboot the VM with sev-es 
enabled.
However with only sev enabled, the VM can do reboot with no problem:


Qemu was specifically patched to exit on reboot with SEV-ES guests. Qemu 
performs a reboot by resetting the vCPU state, which can't be done with an 
SEV-ES guest because the vCPU state is encrypted.




[root@dell-per7525-03 ~]# virsh start TW-SEV-ES --console

Fedora Linux 35 (Server Edition)
Kernel 5.17.0-rc1 on an x86_64 (ttyS0)

[root@fedora ~]# reboot
.
[   48.077682] reboot: Restarting system
[   48.078109] reboot: machine restart
^^^ guest vm reached restart
[root@dell-per7525-03 ~]# echo $?
0
^^^ qemu-kvm exit with 0, no reboot back to normal VM kernel
[root@dell-per7525-03 ~]#

2) With sev-es enabled and the 2 patch sets applied: A) [PATCH v3 00/10] 
x86/sev:
KEXEC/KDUMP support for SEV-ES guests, and B) [PATCH v6 0/7] KVM: SVM: Add 
initial
GHCB protocol version 2 support. I can enable kdump and have vmcore generated:

[root@fedora ~]# dmesg|grep -i sev
[0.030600] SEV: Hypervisor GHCB protocol version support: min=1 max=2
[0.030602] SEV: Using GHCB protocol version 2
[0.296144] AMD Memory Encryption Features active: SEV SEV-ES
[0.450991] SEV: AP jump table Blob successfully set up
[root@fedora ~]# kdumpctl status
kdump: Kdump is operational

However without the 2 patch sets, I can also enable kdump and have vmcore 
generated:

[root@fedora ~]# dmesg|grep -i sev
[0.295754] AMD Memory Encryption Features active: SEV SEV-ES
^ patch set A & 
B
   not applied, so only have this string.
[root@fedora ~]# echo c > /proc/sysrq-trigger
...
[2.759403] kdump[549]: saving vmcore-dmesg.txt to 
/sysroot/var/crash/127.0.0.1-2022-04-18-05:58:50/
[2.804355] kdump[555]: saving vmcore-dmesg.txt complete
[2.806915] kdump[557]: saving vmcore
^ vmcore can still be generated
...
[7.068981] reboot: Restarting system
[7.069340] reboot: machine restart

[root@dell-per7525-03 ~]# echo $?
0
^^^ same exit issue as question 1.

I doesn't have a complete technical background of the patch set, but isn't
it the issue which this patch set is trying to solve? Or I missed something?


The main goal of this patch set is to really to solve the ability to 
perform a kexec. I would expect kdump to work since kdump shuts down all 
but the executing vCPU and performs its operations before "rebooting" 
(which will exit Qemu as I mentioned above). But kexec requires the need 
to restart the APs from within the guest after they have been stopped. 
That requires specific support and actions on the part of the guest kernel 
in how the APs are stopped and restarted.


Thanks,
Tom



Thanks,
Tao Liu
  

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio: replace arch_has_restricted_virtio_memory_access()

2022-04-27 Thread Tom Lendacky via Virtualization

On 4/27/22 07:37, Juergen Gross wrote:

On 27.04.22 14:28, Borislav Petkov wrote:

On Wed, Apr 27, 2022 at 08:37:31AM +0200, Juergen Gross wrote:

On 26.04.22 19:35, Borislav Petkov wrote:

On Tue, Apr 26, 2022 at 03:40:21PM +0200, Juergen Gross wrote:

   /* protected virtualization */
   static void pv_init(void)
   {
   if (!is_prot_virt_guest())
   return;
+    platform_set_feature(PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS);


Kinda long-ish for my taste. I'll probably call it:

platform_set()

as it is implicit that it sets a feature bit.


Okay, fine with me.



diff --git a/arch/x86/mm/mem_encrypt_identity.c 
b/arch/x86/mm/mem_encrypt_identity.c

index b43bc24d2bb6..6043ba6cd17d 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -40,6 +40,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
@@ -566,6 +567,10 @@ void __init sme_enable(struct boot_params *bp)
   } else {
   /* SEV state cannot be controlled by a command line option */
   sme_me_mask = me_mask;
+
+    /* Set restricted memory access for virtio. */
+    platform_set_feature(PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS);


This is way early in the boot, but it appears that marking the platform 
feature bitmap as __read_mostly puts this in the .data section, so avoids 
the issue of bss being cleared.


TDX support also uses the arch_has_restricted_virtio_memory_access() 
function and will need to be updated.


Seems like a lot of changes, I just wonder if the the arch_has...() 
function couldn't be updated to also include a Xen check?


Thanks,
Tom



Huh, what does that have to do with SME?


I picked the function where sev_status is being set, as this seemed to be
the correct place to set the feature bit.


What I don't understand is what does restricted memory access have to do
with AMD SEV and how does play together with what you guys are trying to
do?

The big picture pls.


Ah, okay.

For support of virtio with Xen we want to not only support the virtio
devices like KVM, but use grants for letting the guest decide which
pages are allowed to be mapped by the backend (dom0).

Instead of physical guest addresses the guest will use grant-ids (plus
offset). In order to be able to handle this at the basic virtio level
instead of the single virtio device drivers, we need to use dedicated
dma-ops. And those will be used by virtio only, if the "restricted
virtio memory request" flag is set, which is used by SEV, too. In order
to let virtio set this flag, we need a way to communicate to virtio
that the running system is either a SEV guest or a Xen guest.

HTH,


Juergen

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v5 07/16] x86/kvm: Use bounce buffers for TD guest

2021-10-20 Thread Tom Lendacky via Virtualization

On 10/20/21 11:50 AM, Sathyanarayanan Kuppuswamy wrote:



On 10/20/21 9:39 AM, Tom Lendacky wrote:

On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:

From: "Kirill A. Shutemov" 

Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with VMM must be shared
explicitly. The same rule applies for any DMA to and from TDX guest.
All DMA pages had to marked as shared pages. A generic way to achieve
this without any changes to device drivers is to use the SWIOTLB
framework.

This method of handling is similar to AMD SEV. So extend this support
for TDX guest as well. Also since there are some common code between
AMD SEV and TDX guest in mem_encrypt_init(), move it to
mem_encrypt_common.c and call AMD specific init function from it

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Reviewed-by: Tony Luck 
Signed-off-by: Kuppuswamy Sathyanarayanan 


---

Changes since v4:
  * Replaced prot_guest_has() with cc_guest_has().

Changes since v3:
  * Rebased on top of Tom Lendacky's protected guest
    changes 
(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fpatchwork%2Fcover%2F1468760%2Fdata=04%7C01%7Cthomas.lendacky%40amd.com%7Cad852703670a44b1e29108d993e9dcc2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637703454904800065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=lXwd5%2Fhnmd5QYaIElQ%2BtVT%2B62JHq%2Bimno5VIjTILaig%3Dreserved=0) 



Changes since v1:
  * Removed sme_me_mask check for amd_mem_encrypt_init() in 
mem_encrypt_init().


  arch/x86/include/asm/mem_encrypt_common.h |  3 +++
  arch/x86/kernel/tdx.c |  2 ++
  arch/x86/mm/mem_encrypt.c |  5 +
  arch/x86/mm/mem_encrypt_common.c  | 14 ++
  4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt_common.h 
b/arch/x86/include/asm/mem_encrypt_common.h

index 697bc40a4e3d..bc90e565bce4 100644
--- a/arch/x86/include/asm/mem_encrypt_common.h
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -8,11 +8,14 @@
  #ifdef CONFIG_AMD_MEM_ENCRYPT
  bool amd_force_dma_unencrypted(struct device *dev);
+void __init amd_mem_encrypt_init(void);
  #else /* CONFIG_AMD_MEM_ENCRYPT */
  static inline bool amd_force_dma_unencrypted(struct device *dev)
  {
  return false;
  }
+
+static inline void amd_mem_encrypt_init(void) {}
  #endif /* CONFIG_AMD_MEM_ENCRYPT */
  #endif
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 433f366ca25c..ce8e3019b812 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -12,6 +12,7 @@
  #include 
  #include 
  #include  /* force_sig_fault() */
+#include 
  /* TDX Module call Leaf IDs */
  #define TDX_GET_INFO    1
@@ -577,6 +578,7 @@ void __init tdx_early_init(void)
  pv_ops.irq.halt = tdx_halt;
  legacy_pic = _legacy_pic;
+    swiotlb_force = SWIOTLB_FORCE;
  cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
    NULL, tdx_cpu_offline_prepare);
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 5d7fbed73949..8385bc4565e9 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -438,14 +438,11 @@ static void print_mem_encrypt_feature_info(void)
  }
  /* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
+void __init amd_mem_encrypt_init(void)
  {
  if (!sme_me_mask)
  return;
-    /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-    swiotlb_update_mem_attributes();
-
  /*
   * With SEV, we need to unroll the rep string I/O instructions,
   * but SEV-ES supports them through the #VC handler.
diff --git a/arch/x86/mm/mem_encrypt_common.c 
b/arch/x86/mm/mem_encrypt_common.c

index 119a9056efbb..6fe44c6cb753 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,6 +10,7 @@
  #include 
  #include 
  #include 
+#include 
  /* Override for DMA direct allocation check - 
ARCH_HAS_FORCE_DMA_UNENCRYPTED */

  bool force_dma_unencrypted(struct device *dev)
@@ -24,3 +25,16 @@ bool force_dma_unencrypted(struct device *dev)
  return false;
  }
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+    /*
+ * For TDX guest or SEV/SME, call into SWIOTLB to update
+ * the SWIOTLB DMA buffers
+ */
+    if (sme_me_mask || cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))


Can't you just make this:

 if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))

SEV will return true if sme_me_mask is not zero and TDX should only 
return true if it is TDX guest, right?


Yes. It can be simplified.

But where shall we leave this function cc_platform.c or here?


Either one works...  all depends on how the maintainers feel about 
creating/using mem_encrypt_common.c or using cc_platform.c.


Thanks,
Tom





Thanks,
Tom


+    swiotlb_u

Re: [PATCH v5 06/16] x86/tdx: Make DMA pages shared

2021-10-20 Thread Tom Lendacky via Virtualization

On 10/20/21 11:45 AM, Sathyanarayanan Kuppuswamy wrote:

On 10/20/21 9:33 AM, Tom Lendacky wrote:

On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:


...


  bool force_dma_unencrypted(struct device *dev)
  {
-    return amd_force_dma_unencrypted(dev);
+    if (cc_platform_has(CC_ATTR_GUEST_TDX) &&
+    cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+    return true;
+
+    if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) ||
+    cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
+    return amd_force_dma_unencrypted(dev);
+
+    return false;


Assuming the original force_dma_unencrypted() function was moved here or 
cc_platform.c, then you shouldn't need any changes. Both SEV and TDX 
require true be returned if CC_ATTR_GUEST_MEM_ENCRYPT returns true. And 
then TDX should never return true for CC_ATTR_HOST_MEM_ENCRYPT.



For non TDX case, in CC_ATTR_HOST_MEM_ENCRYPT, we should still call
amd_force_dma_unencrypted() right?


What I'm saying is that you wouldn't have amd_force_dma_unencrypted(). I 
think the whole force_dma_unencrypted() can exist as-is in a different 
file, whether that's cc_platform.c or mem_encrypt_common.c.


It will return true for an SEV or TDX guest, true for an SME host based on 
the DMA mask or else false. That should work just fine for TDX.


Thanks,
Tom






  }
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 527957586f3c..6c531d5cb5fd 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -30,6 +30,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "../mm_internal.h"
@@ -1981,8 +1982,10 @@ int set_memory_global(unsigned long addr, int 
numpages)

  __pgprot(_PAGE_GLOBAL), 0);
  }
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool 
enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool 
protect)

  {
+    pgprot_t mem_protected_bits, mem_plain_bits;
+    enum tdx_map_type map_type;
  struct cpa_data cpa;
  int ret;
@@ -1997,8 +2000,25 @@ static int __set_memory_enc_dec(unsigned long 
addr, int numpages, bool enc)

  memset(, 0, sizeof(cpa));
  cpa.vaddr = 
  cpa.numpages = numpages;
-    cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-    cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+    if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) {
+    mem_protected_bits = __pgprot(0);
+    mem_plain_bits = pgprot_cc_shared_mask();


How about having generic versions for both shared and private that 
return the proper value for SEV or TDX. Then this remains looking 
similar to as it does now, just replacing the __pgprot() calls with the 
appropriate pgprot_cc_{shared,private}_mask().


Makes sense.



Thanks,
Tom


+    } else {
+    mem_protected_bits = __pgprot(_PAGE_ENC);
+    mem_plain_bits = __pgprot(0);
+    }
+
+    if (protect) {
+    cpa.mask_set = mem_protected_bits;
+    cpa.mask_clr = mem_plain_bits;
+    map_type = TDX_MAP_PRIVATE;
+    } else {
+    cpa.mask_set = mem_plain_bits;
+    cpa.mask_clr = mem_protected_bits;
+    map_type = TDX_MAP_SHARED;
+    }
+
  cpa.pgd = init_mm.pgd;
  /* Must avoid aliasing mappings in the highmem code */
@@ -2006,9 +2026,17 @@ static int __set_memory_enc_dec(unsigned long 
addr, int numpages, bool enc)

  vm_unmap_aliases();
  /*
- * Before changing the encryption attribute, we need to flush caches.
+ * Before changing the encryption attribute, flush caches.
+ *
+ * For TDX, guest is responsible for flushing caches on 
private->shared

+ * transition. VMM is responsible for flushing on shared->private.
   */
-    cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+    if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
+    if (map_type == TDX_MAP_SHARED)
+    cpa_flush(, 1);
+    } else {
+    cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+    }
  ret = __change_page_attr_set_clr(, 1);
@@ -2021,18 +2049,21 @@ static int __set_memory_enc_dec(unsigned long 
addr, int numpages, bool enc)

   */
  cpa_flush(, 0);
+    if (!ret && cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
+    ret = tdx_hcall_gpa_intent(__pa(addr), numpages, map_type);
+
  return ret;
  }
  int set_memory_encrypted(unsigned long addr, int numpages)
  {
-    return __set_memory_enc_dec(addr, numpages, true);
+    return __set_memory_protect(addr, numpages, true);
  }
  EXPORT_SYMBOL_GPL(set_memory_encrypted);
  int set_memory_decrypted(unsigned long addr, int numpages)
  {
-    return __set_memory_enc_dec(addr, numpages, false);
+    return __set_memory_protect(addr, numpages, false);
  }
  EXPORT_SYMBOL_GPL(set_memory_decrypted);




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v5 07/16] x86/kvm: Use bounce buffers for TD guest

2021-10-20 Thread Tom Lendacky via Virtualization

On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:

From: "Kirill A. Shutemov" 

Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with VMM must be shared
explicitly. The same rule applies for any DMA to and from TDX guest.
All DMA pages had to marked as shared pages. A generic way to achieve
this without any changes to device drivers is to use the SWIOTLB
framework.

This method of handling is similar to AMD SEV. So extend this support
for TDX guest as well. Also since there are some common code between
AMD SEV and TDX guest in mem_encrypt_init(), move it to
mem_encrypt_common.c and call AMD specific init function from it

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Reviewed-by: Tony Luck 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---

Changes since v4:
  * Replaced prot_guest_has() with cc_guest_has().

Changes since v3:
  * Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)

Changes since v1:
  * Removed sme_me_mask check for amd_mem_encrypt_init() in mem_encrypt_init().

  arch/x86/include/asm/mem_encrypt_common.h |  3 +++
  arch/x86/kernel/tdx.c |  2 ++
  arch/x86/mm/mem_encrypt.c |  5 +
  arch/x86/mm/mem_encrypt_common.c  | 14 ++
  4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt_common.h 
b/arch/x86/include/asm/mem_encrypt_common.h
index 697bc40a4e3d..bc90e565bce4 100644
--- a/arch/x86/include/asm/mem_encrypt_common.h
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -8,11 +8,14 @@
  
  #ifdef CONFIG_AMD_MEM_ENCRYPT

  bool amd_force_dma_unencrypted(struct device *dev);
+void __init amd_mem_encrypt_init(void);
  #else /* CONFIG_AMD_MEM_ENCRYPT */
  static inline bool amd_force_dma_unencrypted(struct device *dev)
  {
return false;
  }
+
+static inline void amd_mem_encrypt_init(void) {}
  #endif /* CONFIG_AMD_MEM_ENCRYPT */
  
  #endif

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 433f366ca25c..ce8e3019b812 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -12,6 +12,7 @@
  #include 
  #include 
  #include  /* force_sig_fault() */
+#include 
  
  /* TDX Module call Leaf IDs */

  #define TDX_GET_INFO  1
@@ -577,6 +578,7 @@ void __init tdx_early_init(void)
pv_ops.irq.halt = tdx_halt;
  
  	legacy_pic = _legacy_pic;

+   swiotlb_force = SWIOTLB_FORCE;
  
  	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",

  NULL, tdx_cpu_offline_prepare);
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 5d7fbed73949..8385bc4565e9 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -438,14 +438,11 @@ static void print_mem_encrypt_feature_info(void)
  }
  
  /* Architecture __weak replacement functions */

-void __init mem_encrypt_init(void)
+void __init amd_mem_encrypt_init(void)
  {
if (!sme_me_mask)
return;
  
-	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */

-   swiotlb_update_mem_attributes();
-
/*
 * With SEV, we need to unroll the rep string I/O instructions,
 * but SEV-ES supports them through the #VC handler.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 119a9056efbb..6fe44c6cb753 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,6 +10,7 @@
  #include 
  #include 
  #include 
+#include 
  
  /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */

  bool force_dma_unencrypted(struct device *dev)
@@ -24,3 +25,16 @@ bool force_dma_unencrypted(struct device *dev)
  
  	return false;

  }
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+   /*
+* For TDX guest or SEV/SME, call into SWIOTLB to update
+* the SWIOTLB DMA buffers
+*/
+   if (sme_me_mask || cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))


Can't you just make this:

if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))

SEV will return true if sme_me_mask is not zero and TDX should only return 
true if it is TDX guest, right?


Thanks,
Tom


+   swiotlb_update_mem_attributes();
+
+   amd_mem_encrypt_init();
+}


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 06/16] x86/tdx: Make DMA pages shared

2021-10-20 Thread Tom Lendacky via Virtualization

On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:

From: "Kirill A. Shutemov" 

Just like MKTME, TDX reassigns bits of the physical address for
metadata.  MKTME used several bits for an encryption KeyID. TDX
uses a single bit in guests to communicate whether a physical page
should be protected by TDX as private memory (bit set to 0) or
unprotected and shared with the VMM (bit set to 1).

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDX hypercall.

Also, Do TDX_ACCEPT_PAGE on every 4k page after mapping the GPA range
when converting memory to private. Using 4k page size limit is due
to current TDX spec restriction. Also, If the GPA (range) was
already mapped as an active, private page, the host VMM may remove
the private page from the TD by following the “Removing TD Private
Pages” sequence in the Intel TDX-module specification [1] to safely
block the mapping(s), flush the TLB and cache, and remove the
mapping(s).

BUG() if TDX_ACCEPT_PAGE fails (except "previously accepted page" case)
, as the guest is completely hosed if it can't access memory.

[1] 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fcontent%2Fdam%2Fdevelop%2Fexternal%2Fus%2Fen%2Fdocuments%2Ftdx-module-1eas-v0.85.039.pdfdata=04%7C01%7Cthomas.lendacky%40amd.com%7C0e667adf5a4042abce3908d98abd07a8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637693367201703893%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=UGxQ9xBjWsmev7PetX%2BuS0RChkAXyaH7q6JHO9ZiUtY%3Dreserved=0

Tested-by: Kai Huang 
Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Sean Christopherson 
Reviewed-by: Andi Kleen 
Reviewed-by: Tony Luck 
Signed-off-by: Kuppuswamy Sathyanarayanan 



...


diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index f063c885b0a5..119a9056efbb 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -9,9 +9,18 @@
  
  #include 

  #include 
+#include 
  
  /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */

  bool force_dma_unencrypted(struct device *dev)
  {
-   return amd_force_dma_unencrypted(dev);
+   if (cc_platform_has(CC_ATTR_GUEST_TDX) &&
+   cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+   return true;
+
+   if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) ||
+   cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
+   return amd_force_dma_unencrypted(dev);
+
+   return false;


Assuming the original force_dma_unencrypted() function was moved here or 
cc_platform.c, then you shouldn't need any changes. Both SEV and TDX 
require true be returned if CC_ATTR_GUEST_MEM_ENCRYPT returns true. And 
then TDX should never return true for CC_ATTR_HOST_MEM_ENCRYPT.



  }
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 527957586f3c..6c531d5cb5fd 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -30,6 +30,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include "../mm_internal.h"
  
@@ -1981,8 +1982,10 @@ int set_memory_global(unsigned long addr, int numpages)

__pgprot(_PAGE_GLOBAL), 0);
  }
  
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
  {
+   pgprot_t mem_protected_bits, mem_plain_bits;
+   enum tdx_map_type map_type;
struct cpa_data cpa;
int ret;
  
@@ -1997,8 +2000,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

memset(, 0, sizeof(cpa));
cpa.vaddr = 
cpa.numpages = numpages;
-   cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-   cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+   if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) {
+   mem_protected_bits = __pgprot(0);
+   mem_plain_bits = pgprot_cc_shared_mask();


How about having generic versions for both shared and private that return 
the proper value for SEV or TDX. Then this remains looking similar to as 
it does now, just replacing the __pgprot() calls with the appropriate 
pgprot_cc_{shared,private}_mask().


Thanks,
Tom


+   } else {
+   mem_protected_bits = __pgprot(_PAGE_ENC);
+   mem_plain_bits = __pgprot(0);
+   }
+
+   if (protect) {
+   cpa.mask_set = mem_protected_bits;
+   cpa.mask_clr = mem_plain_bits;
+   map_type = TDX_MAP_PRIVATE;
+   } else {
+   cpa.mask_set = mem_plain_bits;
+   cpa.mask_clr = mem_protected_bits;
+   map_type = TDX_MAP_SHARED;
+   }
+
cpa.pgd = init_mm.pgd;
  
  	/* Must avoid aliasing mappings in the highmem code */

@@ -2006,9 +2026,17 @@ static int __set_memory_enc_dec(unsigned long addr, 

Re: [PATCH v5 01/16] x86/mm: Move force_dma_unencrypted() to common code

2021-10-20 Thread Tom Lendacky via Virtualization

On 10/8/21 7:36 PM, Kuppuswamy Sathyanarayanan wrote:

From: "Kirill A. Shutemov" 

Intel TDX doesn't allow VMM to access guest private memory. Any memory
that is required for communication with VMM must be shared explicitly
by setting the bit in page table entry. After setting the shared bit,
the conversion must be completed with MapGPA hypercall. Details about
MapGPA hypercall can be found in [1], sec 3.2.

The call informs VMM about the conversion between private/shared
mappings. The shared memory is similar to unencrypted memory in AMD
SME/SEV terminology but the underlying process of sharing/un-sharing
the memory is different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted" physical
addresses without the C-bit set. In order for the CPU to interact with
this memory, the CPU needs a decrypted mapping. To add this support,
AMD SME code forces force_dma_unencrypted() to return true for
platforms that support AMD SEV feature. It will be used for DMA memory
allocation API to trigger set_memory_decrypted() for platforms that
support AMD SEV feature.

TDX is similar. So, to communicate with I/O devices, related pages need
to be marked as shared. As mentioned above, shared memory in TDX
architecture is similar to decrypted memory in AMD SME/SEV. So similar
to AMD SEV, force_dma_unencrypted() has to forced to return true. This
support is added in other patches in this series.

So move force_dma_unencrypted() out of AMD specific code and call AMD
specific (amd_force_dma_unencrypted()) initialization function from it.
force_dma_unencrypted() will be modified by later patches to include
Intel TDX guest platform specific initialization.

Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features. This will be selected
by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has no
functional change.


Can force_dma_unencrypted() be moved to arch/x86/kernel/cc_platform.c, 
instead of creating a new file? It might fit better with patch #6.


Thanks,
Tom



[1] - 
https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Reviewed-by: Tony Luck 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---

Changes since v4:
  * Removed used we/you from commit log.

Change since v3:
  * None

Changes since v1:
  * Removed sev_active(), sme_active() checks in force_dma_unencrypted().

  arch/x86/Kconfig  |  8 ++--
  arch/x86/include/asm/mem_encrypt_common.h | 18 ++
  arch/x86/mm/Makefile  |  2 ++
  arch/x86/mm/mem_encrypt.c |  3 ++-
  arch/x86/mm/mem_encrypt_common.c  | 17 +
  5 files changed, 45 insertions(+), 3 deletions(-)
  create mode 100644 arch/x86/include/asm/mem_encrypt_common.h
  create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af49ad084919..37b27412f52e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1519,16 +1519,20 @@ config X86_CPA_STATISTICS
  helps to determine the effectiveness of preserving large and huge
  page mappings when mapping protections are changed.
  
+config X86_MEM_ENCRYPT_COMMON

+   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+   select DYNAMIC_PHYSICAL_MASK
+   def_bool n
+
  config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DMA_COHERENT_POOL
-   select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
-   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS
select ARCH_HAS_CC_PLATFORM
+   select X86_MEM_ENCRYPT_COMMON
help
  Say yes to enable support for the encryption of system memory.
  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/mem_encrypt_common.h 
b/arch/x86/include/asm/mem_encrypt_common.h
new file mode 100644
index ..697bc40a4e3d
--- /dev/null
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_MEM_ENCRYPT_COMMON_H
+#define _ASM_X86_MEM_ENCRYPT_COMMON_H
+
+#include 
+#include 
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+bool amd_force_dma_unencrypted(struct device *dev);
+#else /* CONFIG_AMD_MEM_ENCRYPT */
+static inline bool amd_force_dma_unencrypted(struct device *dev)
+{
+   return false;
+}
+#endif /* CONFIG_AMD_MEM_ENCRYPT */
+
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ 

Re: [PATCH v5 04/16] x86/tdx: Make pages shared in ioremap()

2021-10-20 Thread Tom Lendacky via Virtualization

On 10/8/21 7:36 PM, Kuppuswamy Sathyanarayanan wrote:

From: "Kirill A. Shutemov" 

All ioremap()ed pages that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure from AMD SEV code.

Note that DMA code doesn't use ioremap() to convert memory to shared as
DMA buffers backed by normal memory. DMA code make buffer shared with
set_memory_decrypted().

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Reviewed-by: Tony Luck 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---

Changes since v4:
  * Renamed "protected_guest" to "cc_guest".
  * Replaced use of prot_guest_has() with cc_guest_has()
  * Modified the patch to adapt to latest version cc_guest_has()
changes.

Changes since v3:
  * Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)

Changes since v1:
  * Fixed format issues in commit log.

  arch/x86/include/asm/pgtable.h |  4 
  arch/x86/mm/ioremap.c  |  8 ++--
  include/linux/cc_platform.h| 13 +
  3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..ecefccbdf2e3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,10 @@
  #define pgprot_encrypted(prot)__pgprot(__sme_set(pgprot_val(prot)))
  #define pgprot_decrypted(prot)__pgprot(__sme_clr(pgprot_val(prot)))
  
+/* Make the page accesable by VMM for confidential guests */

+#define pgprot_cc_guest(prot) __pgprot(pgprot_val(prot) |  \
+ tdx_shared_mask())


So this is only for TDX guests, so maybe a name that is less generic.

Alternatively, you could create pgprot_private()/pgprot_shared() functions 
that check for SME/SEV or TDX and do the proper thing.


Then you can redefine pgprot_encrypted()/pgprot_decrypted() to the new 
functions?



+
  #ifndef __ASSEMBLY__
  #include 
  #include 
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 026031b3b782..83daa3f8f39c 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -17,6 +17,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -26,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include "physaddr.h"
  
@@ -87,8 +89,8 @@ static unsigned int __ioremap_check_ram(struct resource *res)

  }
  
  /*

- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
   */
  static unsigned int __ioremap_check_encrypted(struct resource *res)
  {
@@ -246,6 +248,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long 
size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
+   else if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
+   prot = pgprot_cc_guest(prot);


And if you do the new functions this could be:

if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_private(prot);
else
prot = pgprot_shared(prot);

Thanks,
Tom

  
  	switch (pcm) {

case _PAGE_CACHE_MODE_UC:
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 7728574d7783..edb1d7a2f6af 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -81,6 +81,19 @@ enum cc_attr {
 * Examples include TDX Guest.
 */
CC_ATTR_GUEST_UNROLL_STRING_IO,
+
+   /**
+* @CC_ATTR_GUEST_SHARED_MAPPING_INIT: IO Remapped memory is marked
+* as shared.
+*
+* The platform/OS is running as a guest/virtual machine and
+* initializes all IO remapped memory as shared.
+*
+* Examples include TDX Guest (SEV marks all pages as shared by default
+* so this feature cannot be enabled for it).
+*/
+   CC_ATTR_GUEST_SHARED_MAPPING_INIT,
+
  };
  
  #ifdef CONFIG_ARCH_HAS_CC_PLATFORM



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 04/12] x86/sev: Do not hardcode GHCB protocol version

2021-07-21 Thread Tom Lendacky via Virtualization
On 7/21/21 9:20 AM, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> Introduce the sev_get_ghcb_proto_ver() which will return the negotiated
> GHCB protocol version and use it to set the version field in the GHCB.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/boot/compressed/sev.c | 5 +
>  arch/x86/kernel/sev-shared.c   | 5 -
>  arch/x86/kernel/sev.c  | 5 +
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
> index 1a2e49730f8b..101e08c67296 100644
> --- a/arch/x86/boot/compressed/sev.c
> +++ b/arch/x86/boot/compressed/sev.c
> @@ -119,6 +119,11 @@ static enum es_result vc_read_mem(struct es_em_ctxt 
> *ctxt,
>  /* Include code for early handlers */
>  #include "../../kernel/sev-shared.c"
>  
> +static u64 sev_get_ghcb_proto_ver(void)
> +{
> + return GHCB_PROTOCOL_MAX;
> +}
> +
>  static bool early_setup_sev_es(void)
>  {
>   if (!sev_es_negotiate_protocol(NULL))
> diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
> index 73eeb5897d16..36eaac2773ed 100644
> --- a/arch/x86/kernel/sev-shared.c
> +++ b/arch/x86/kernel/sev-shared.c
> @@ -28,6 +28,9 @@ struct sev_ghcb_protocol_info {
>   unsigned int vm_proto;
>  };
>  
> +/* Returns the negotiated GHCB Protocol version */
> +static u64 sev_get_ghcb_proto_ver(void);
> +
>  static bool __init sev_es_check_cpu_features(void)
>  {
>   if (!has_cpuflag(X86_FEATURE_RDRAND)) {
> @@ -122,7 +125,7 @@ static enum es_result sev_es_ghcb_hv_call(struct ghcb 
> *ghcb,
>   enum es_result ret;
>  
>   /* Fill in protocol and format specifiers */
> - ghcb->protocol_version = GHCB_PROTOCOL_MAX;
> + ghcb->protocol_version = sev_get_ghcb_proto_ver();

So this probably needs better clarification in the spec, but the GHCB
version field is for the GHCB structure layout. So if you don't plan to
use the XSS field that was added for version 2 of the layout, then you
should report the GHCB structure version as 1.

Thanks,
Tom

>   ghcb->ghcb_usage   = GHCB_DEFAULT_USAGE;
>  
>   ghcb_set_sw_exit_code(ghcb, exit_code);
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 8084bfd7cce1..5d3422e8b25e 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -498,6 +498,11 @@ static enum es_result vc_slow_virt_to_phys(struct ghcb 
> *ghcb, struct es_em_ctxt
>  /* Negotiated GHCB protocol version */
>  static struct sev_ghcb_protocol_info ghcb_protocol_info __ro_after_init;
>  
> +static u64 sev_get_ghcb_proto_ver(void)
> +{
> + return ghcb_protocol_info.vm_proto;
> +}
> +
>  static noinstr void __sev_put_ghcb(struct ghcb_state *state)
>  {
>   struct sev_es_runtime_data *data;
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] x86/sev: Add defines for GHCB version 2 MSR protocol requests

2021-06-23 Thread Tom Lendacky via Virtualization
On 6/23/21 4:32 AM, Borislav Petkov wrote:
> On Wed, Jun 23, 2021 at 08:40:00AM +0200, Joerg Roedel wrote:
>> From: Brijesh Singh 
>>
> 
> Ok, so I took a critical look at this and it doesn't make sense to have
> a differently named define each time you need the [63:12] slice of
> GHCBData. So you can simply use GHCB_DATA(msr_value) instead, see below.
> 
> Complaints?

None from me.

Thanks,
Tom
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] x86/sev: Add defines for GHCB version 2 MSR protocol requests

2021-06-22 Thread Tom Lendacky via Virtualization
On 6/22/21 9:48 AM, Joerg Roedel wrote:
> From: Brijesh Singh 
> 
> Add the necessary defines for supporting the GHCB version 2 protocol.
> This includes defines for:
> 
>   - MSR-based AP hlt request/response
>   - Hypervisor Feature request/response
> 
> This is the bare minimum of requests that need to be supported by a GHCB
> version 2 implementation. There are more requests in the specification,
> but those depend on Secure Nested Paging support being available.
> 
> These defines are shared between SEV host and guest support, so they are
> submitted as an individual patch without users yet to avoid merge
> conflicts in the future.
> 
> Signed-off-by: Brijesh Singh 
> Co-developed-by: Tom Lendacky 
> Signed-off-by: Tom Lendacky 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/include/asm/sev-common.h | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/arch/x86/include/asm/sev-common.h 
> b/arch/x86/include/asm/sev-common.h
> index 1cc9e7dd8107..4e6c4c7cb294 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -47,6 +47,21 @@
>   (((unsigned long)reg & GHCB_MSR_CPUID_REG_MASK) << 
> GHCB_MSR_CPUID_REG_POS) | \
>   (((unsigned long)fn) << GHCB_MSR_CPUID_FUNC_POS))
>  
> +/* AP Reset Hold */
> +#define GHCB_MSR_AP_RESET_HOLD_REQ   0x006
> +#define GHCB_MSR_AP_RESET_HOLD_RESP  0x007
> +#define GHCB_MSR_AP_RESET_HOLD_RESULT_POS12
> +#define GHCB_MSR_AP_RESET_HOLD_RESULT_MASK   GENMASK_ULL(51, 0)
> +
> +/* GHCB Hypervisor Feature Request/Response */
> +#define GHCB_MSR_HV_FT_REQ   0x080
> +#define GHCB_MSR_HV_FT_RESP  0x081
> +#define GHCB_MSR_HV_FT_POS   12
> +#define GHCB_MSR_HV_FT_MASK  GENMASK_ULL(51, 0)
> +
> +#define GHCB_MSR_HV_FT_RESP_VAL(v)   \
> + (((unsigned long)((v) & GHCB_MSR_HV_FT_MASK) >> GHCB_MSR_HV_FT_POS))

This should shift down first and then mask or else the mask should be from
12 to 63.

Thanks,
Tom

> +
>  #define GHCB_MSR_TERM_REQ0x100
>  #define GHCB_MSR_TERM_REASON_SET_POS 12
>  #define GHCB_MSR_TERM_REASON_SET_MASK0xf
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 2/8] x86/sev: Do not require Hypervisor CPUID bit for SEV guests

2021-03-17 Thread Tom Lendacky
On 3/12/21 6:38 AM, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> A malicious hypervisor could disable the CPUID intercept for an SEV or
> SEV-ES guest and trick it into the no-SEV boot path, where it could
> potentially reveal secrets. This is not an issue for SEV-SNP guests,
> as the CPUID intercept can't be disabled for those.
> 
> Remove the Hypervisor CPUID bit check from the SEV detection code to
> protect against this kind of attack and add a Hypervisor bit equals
> zero check to the SME detection path to prevent non-SEV guests from
> trying to enable SME.
> 
> This handles the following cases:
> 
>   1) SEV(-ES) guest where CPUID intercept is disabled. The guest
>  will still see leaf 0x801f and the SEV bit. It can
>  retrieve the C-bit and boot normally.
> 
>   2) Non-SEV guests with intercepted CPUID will check SEV_STATUS
>  MSR and find it 0 and will try to enable SME. This will
>  fail when the guest finds MSR_K8_SYSCFG to be zero, as it
>  is emulated by KVM. But we can't rely on that, as there
>  might be other hypervisors which return this MSR with bit
>  23 set. The Hypervisor bit check will prevent that the
>  guest tries to enable SME in this case.
> 
>   3) Non-SEV guests on SEV capable hosts with CPUID intercept
>  disabled (by a malicious hypervisor) will try to boot into
>  the SME path. This will fail, but it is also not considered
>  a problem because non-encrypted guests have no protection
>  against the hypervisor anyway.
> 
> Signed-off-by: Joerg Roedel 

Acked-by: Tom Lendacky 

> ---
>  arch/x86/boot/compressed/mem_encrypt.S |  6 -
>  arch/x86/kernel/sev-es-shared.c|  6 +
>  arch/x86/mm/mem_encrypt_identity.c | 35 ++
>  3 files changed, 20 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/mem_encrypt.S 
> b/arch/x86/boot/compressed/mem_encrypt.S
> index aa561795efd1..a6dea4e8a082 100644
> --- a/arch/x86/boot/compressed/mem_encrypt.S
> +++ b/arch/x86/boot/compressed/mem_encrypt.S
> @@ -23,12 +23,6 @@ SYM_FUNC_START(get_sev_encryption_bit)
>   push%ecx
>   push%edx
>  
> - /* Check if running under a hypervisor */
> - movl$1, %eax
> - cpuid
> - bt  $31, %ecx   /* Check the hypervisor bit */
> - jnc .Lno_sev
> -
>   movl$0x8000, %eax   /* CPUID to check the highest leaf */
>   cpuid
>   cmpl$0x801f, %eax   /* See if 0x801f is available */
> diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
> index cdc04d091242..387b71669818 100644
> --- a/arch/x86/kernel/sev-es-shared.c
> +++ b/arch/x86/kernel/sev-es-shared.c
> @@ -186,7 +186,6 @@ void __init do_vc_no_ghcb(struct pt_regs *regs, unsigned 
> long exit_code)
>* make it accessible to the hypervisor.
>*
>* In particular, check for:
> -  *  - Hypervisor CPUID bit
>*  - Availability of CPUID leaf 0x801f
>*  - SEV CPUID bit.
>*
> @@ -194,10 +193,7 @@ void __init do_vc_no_ghcb(struct pt_regs *regs, unsigned 
> long exit_code)
>* can't be checked here.
>*/
>  
> - if ((fn == 1 && !(regs->cx & BIT(31
> - /* Hypervisor bit */
> - goto fail;
> - else if (fn == 0x8000 && (regs->ax < 0x801f))
> + if (fn == 0x8000 && (regs->ax < 0x801f))
>   /* SEV leaf check */
>   goto fail;
>   else if ((fn == 0x801f && !(regs->ax & BIT(1
> diff --git a/arch/x86/mm/mem_encrypt_identity.c 
> b/arch/x86/mm/mem_encrypt_identity.c
> index 6c5eb6f3f14f..a19374d26101 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -503,14 +503,10 @@ void __init sme_enable(struct boot_params *bp)
>  
>  #define AMD_SME_BIT  BIT(0)
>  #define AMD_SEV_BIT  BIT(1)
> - /*
> -  * Set the feature mask (SME or SEV) based on whether we are
> -  * running under a hypervisor.
> -  */
> - eax = 1;
> - ecx = 0;
> - native_cpuid(, , , );
> - feature_mask = (ecx & BIT(31)) ? AMD_SEV_BIT : AMD_SME_BIT;
> +
> + /* Check the SEV MSR whether SEV or SME is enabled */
> + sev_status   = __rdmsr(MSR_AMD64_SEV);
> + feature_mask = (sev_status & MSR_AMD64_SEV_ENABLED) ? AMD_SEV_BIT : 
> AMD_SME_BIT;
>  
>   /*
>* Check for the SME/SEV feature:
> @@ -530,19 +526,26 @@ void __init sme_enable(struct boot_params *bp)
>  
>   /* Ch

Re: [PATCH 6/7] x86/boot/compressed/64: Check SEV encryption in 32-bit boot-path

2021-02-10 Thread Tom Lendacky

On 2/10/21 10:47 AM, Dave Hansen wrote:

On 2/10/21 2:21 AM, Joerg Roedel wrote:

+   /* Store to memory and keep it in the registers */
+   movl%eax, rva(sev_check_data)(%ebp)
+   movl%ebx, rva(sev_check_data+4)(%ebp)
+
+   /* Enable paging to see if encryption is active */
+   movl%cr0, %edx  /* Backup %cr0 in %edx */
+   movl$(X86_CR0_PG | X86_CR0_PE), %ecx /* Enable Paging and Protected 
mode */
+   movl%ecx, %cr0
+
+   cmpl%eax, rva(sev_check_data)(%ebp)
+   jne 3f
+   cmpl%ebx, rva(sev_check_data+4)(%ebp)
+   jne 3f


Also, I know that turning paging on is a *BIG* barrier.  But, I didn't
think it has any effect on the caches.

I would expect that the underlying physical address of 'sev_check_data'
would change when paging gets enabled because paging sets the C bit.
So, how does the write of 'sev_check_data' get out of the caches and
into memory where it can be read back with the new physical address?

I think there's some bit of the SEV architecture that I'm missing.


Non-paging memory accesses are always considered private (APM Volume 2, 
15.34.4) and thus are cached with the C bit set.


Thanks,
Tom




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-02-02 Thread Tom Lendacky
On 2/2/21 10:37 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Jan 25, 2021 at 07:33:35PM +0100, Martin Radev wrote:
>> On Mon, Jan 18, 2021 at 10:14:28AM -0500, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
>>>> On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote:
>>>>> On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
>>>>>> The size of the buffer being bounced is not checked if it happens
>>>>>> to be larger than the size of the mapped buffer. Because the size
>>>>>> can be controlled by a device, as it's the case with virtio devices,
>>>>>> this can lead to memory corruption.
>>>>>>
>>>>>
>>>>> I'm really worried about all these hodge podge hacks for not trusted
>>>>> hypervisors in the I/O stack.  Instead of trying to harden protocols
>>>>> that are fundamentally not designed for this, how about instead coming
>>>>> up with a new paravirtualized I/O interface that is specifically
>>>>> designed for use with an untrusted hypervisor from the start?
>>>>
>>>> Your comment makes sense but then that would require the cooperation
>>>> of these vendors and the cloud providers to agree on something meaningful.
>>>> I am also not sure whether the end result would be better than hardening
>>>> this interface to catch corruption. There is already some validation in
>>>> unmap path anyway.
>>>>
>>>> Another possibility is to move this hardening to the common virtio code,
>>>> but I think the code may become more complicated there since it would
>>>> require tracking both the dma_addr and length for each descriptor.
>>>
>>> Christoph,
>>>
>>> I've been wrestling with the same thing - this is specific to busted
>>> drivers. And in reality you could do the same thing with a hardware
>>> virtio device (see example in 
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fthunderclap.io%2Fdata=04%7C01%7Cthomas.lendacky%40amd.com%7Cfc27af49d9a943699f6c08d8c798eed4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637478806973542212%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aUVqobkOSDfDhCAEauABOUvCAaIcw%2FTh07YFxeBjBDU%3Dreserved=0)
>>>  - where the
>>> mitigation is 'enable the IOMMU to do its job.'.
>>>
>>> AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP)..
>>> and while that is great in the future, SEV without IOMMU is now here.
>>>
>>> Doing a full circle here, this issue can be exploited with virtio
>>> but you could say do that with real hardware too if you hacked the
>>> firmware, so if you say used Intel SR-IOV NIC that was compromised
>>> on an AMD SEV machine, and plumbed in the guest - the IOMMU inside
>>> of the guest would be SWIOTLB code. Last line of defense against
>>> bad firmware to say.
>>>
>>> As such I am leaning towards taking this code, but I am worried
>>> about the performance hit .. but perhaps I shouldn't as if you
>>> are using SWIOTLB=force already you are kind of taking a
>>> performance hit?
>>>
>>
>> I have not measured the performance degradation. This will hit all AMD SEV,
>> Intel TDX, IBM Protected Virtualization VMs. I don't expect the hit to
>> be large since there are only few added operations per hundreads of copied
>> bytes. I could try to measure the performance hit by running some benchmark
>> with virtio-net/virtio-blk/virtio-rng.
>>
>> Earlier I said:
>>>> Another possibility is to move this hardening to the common virtio code,
>>>> but I think the code may become more complicated there since it would
>>>> require tracking both the dma_addr and length for each descriptor.
>>
>> Unfortunately, this doesn't make sense. Even if there's validation for
>> the size in the common virtio layer, there will be some other device
>> which controls a dma_addr and length passed to dma_unmap* in the
>> corresponding driver. The device can target a specific dma-mapped private
>> buffer by changing the dma_addr and set a good length to overwrite buffers
>> following it.
>>
>> So, instead of doing the check in every driver and hitting a performance
>> cost even when swiotlb is not used, it's probably better to fix it in
>> swiotlb.
>>
>> @Tom Lendacky, do you think that it makes sense to h

Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

2020-09-09 Thread Tom Lendacky

On 9/9/20 7:44 AM, Laszlo Ersek wrote:

On 09/09/20 10:27, Ard Biesheuvel wrote:

(adding Laszlo and Brijesh)

On Tue, 8 Sep 2020 at 20:46, Borislav Petkov  wrote:


+ Ard so that he can ack the efi bits.

On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:

From: Tom Lendacky 

Calling down to EFI runtime services can result in the firmware
performing VMGEXIT calls. The firmware is likely to use the GHCB of
the OS (e.g., for setting EFI variables),


I've had to stare at this for a while.

Because, normally a VMGEXIT is supposed to occur like this:

- guest does something privileged
- resultant non-automatic exit (NAE) injects a #VC exception
- exception handler figures out what that privileged thing was
- exception handler submits request to hypervisor via GHCB contents plus
   VMGEXIT instruction

Point being, the agent that "owns" the exception handler is supposed to
pre-allocate or otherwise provide the GHCB too, for information passing.

So... what is the particular NAE that occurs during the execution of
UEFI runtime services (at OS runtime)?

And assuming it occurs, I'm unsure if the exception handler (IDT) at
that point is owned (temporarily?) by the firmware.

- If the #VC handler comes from the firmware, then I don't know why it
would use the OS's GHCB.

- If the #VC handler comes from the OS, then I don't understand why the
commit message says "firmware performing VMGEXIT", given that in this
case it would be the OS's #VC handler executing VMGEXIT.

So, I think the above commit message implies a VMGEXIT *without* a NAE /
#VC context. (Because, I fail to interpret the commit message in a NAE /
#VC context in any way; see above.)


Correct.



OK, so let's see where the firmware performs a VMGEXIT *outside* of an
exception handler, *while* at OS runtime. There seems to be one, in file
"OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlashDxe.c":


Again, correct. Basically this is what is invoked when setting UEFI variables.




VOID
QemuFlashPtrWrite (
   INvolatile UINT8*Ptr,
   INUINT8 Value
   )
{
   if (MemEncryptSevEsIsEnabled ()) {
 MSR_SEV_ES_GHCB_REGISTER  Msr;
 GHCB  *Ghcb;

 Msr.GhcbPhysicalAddress = AsmReadMsr64 (MSR_SEV_ES_GHCB);
 Ghcb = Msr.Ghcb;

 //
 // Writing to flash is emulated by the hypervisor through the use of write
 // protection. This won't work for an SEV-ES guest because the write won't
 // be recognized as a true MMIO write, which would result in the required
 // #VC exception. Instead, use the the VMGEXIT MMIO write support directly
 // to perform the update.
 //
 VmgInit (Ghcb);
 Ghcb->SharedBuffer[0] = Value;
 Ghcb->SaveArea.SwScratch = (UINT64) (UINTN) Ghcb->SharedBuffer;
 VmgExit (Ghcb, SVM_EXIT_MMIO_WRITE, (UINT64) (UINTN) Ptr, 1);
 VmgDone (Ghcb);
   } else {
 *Ptr = Value;
   }
}


This function *does* run at OS runtime (as a part of non-volatile UEFI
variable writes).

And note that, wherever MSR_SEV_ES_GHCB points to at the moment, is used
as GHCB.

If the guest kernel allocates its own GHCB and writes the allocation
address to MSR_SEV_ES_GHCB, then indeed the firmware will use the GHCB
of the OS.

I reviewed edk2 commit 437eb3f7a8db
("OvmfPkg/QemuFlashFvbServicesRuntimeDxe: Bypass flash detection with
SEV-ES", 2020-08-17), but I admit I never thought of the guest OS
changing MSR_SEV_ES_GHCB. I'm sorry about that.

As long as this driver is running before OS runtime (i.e., during the
DXE and BDS phases), MSR_SEV_ES_GHCB is supposed to carry the value we
set in "OvmfPkg/PlatformPei/AmdSev.c":


STATIC
VOID
AmdSevEsInitialize (
   VOID
   )
{
   VOID  *GhcbBase;
   PHYSICAL_ADDRESS  GhcbBasePa;
   UINTN GhcbPageCount, PageCount;
   RETURN_STATUS PcdStatus, DecryptStatus;
   IA32_DESCRIPTOR   Gdtr;
   VOID  *Gdt;

   if (!MemEncryptSevEsIsEnabled ()) {
 return;
   }

   PcdStatus = PcdSetBoolS (PcdSevEsIsEnabled, TRUE);
   ASSERT_RETURN_ERROR (PcdStatus);

   //
   // Allocate GHCB and per-CPU variable pages.
   //   Since the pages must survive across the UEFI to OS transition
   //   make them reserved.
   //
   GhcbPageCount = mMaxCpuCount * 2;
   GhcbBase = AllocateReservedPages (GhcbPageCount);
   ASSERT (GhcbBase != NULL);

   GhcbBasePa = (PHYSICAL_ADDRESS)(UINTN) GhcbBase;

   //
   // Each vCPU gets two consecutive pages, the first is the GHCB and the
   // second is the per-CPU variable page. Loop through the allocation and
   // only clear the encryption mask for the GHCB pages.
   //
   for (PageCount = 0; PageCount < GhcbPageCount; PageCount += 2) {
 DecryptStatus = MemEncryptSevClearPageEncMask (
   0,
   GhcbBasePa + EFI_PAGES_TO_SIZE (PageCount),
   1,
   TRUE
   );
 ASSERT_RETURN_ERROR (DecryptStatus);
   }

   ZeroMem (GhcbBase, EFI_PAGES_TO_SIZE (GhcbPageCount));

   PcdStatus

Re: [PATCH v3 1/1] s390: virtio: let arch accept devices without IOMMU feature

2020-06-17 Thread Tom Lendacky
On 6/17/20 5:43 AM, Pierre Morel wrote:
> An architecture protecting the guest memory against unauthorized host
> access may want to enforce VIRTIO I/O device protection through the
> use of VIRTIO_F_IOMMU_PLATFORM.
> 
> Let's give a chance to the architecture to accept or not devices
> without VIRTIO_F_IOMMU_PLATFORM.
> 
> Signed-off-by: Pierre Morel 
> Acked-by: Jason Wang 
> Acked-by: Christian Borntraeger 
> ---
>  arch/s390/mm/init.c |  6 ++
>  drivers/virtio/virtio.c | 22 ++
>  include/linux/virtio.h  |  2 ++
>  3 files changed, 30 insertions(+)
> 
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 6dc7c3b60ef6..215070c03226 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -45,6 +45,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  pgd_t swapper_pg_dir[PTRS_PER_PGD] __section(.bss..swapper_pg_dir);
>  
> @@ -161,6 +162,11 @@ bool force_dma_unencrypted(struct device *dev)
>   return is_prot_virt_guest();
>  }
>  
> +int arch_needs_virtio_iommu_platform(struct virtio_device *dev)
> +{
> + return is_prot_virt_guest();
> +}
> +
>  /* protected virtualization */
>  static void pv_init(void)
>  {
> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> index a977e32a88f2..aa8e01104f86 100644
> --- a/drivers/virtio/virtio.c
> +++ b/drivers/virtio/virtio.c
> @@ -167,6 +167,21 @@ void virtio_add_status(struct virtio_device *dev, 
> unsigned int status)
>  }
>  EXPORT_SYMBOL_GPL(virtio_add_status);
>  
> +/*
> + * arch_needs_virtio_iommu_platform - provide arch specific hook when 
> finalizing
> + * features for VIRTIO device dev
> + * @dev: the VIRTIO device being added
> + *
> + * Permits the platform to provide architecture specific functionality when
> + * devices features are finalized. This is the default implementation.
> + * Architecture implementations can override this.
> + */
> +
> +int __weak arch_needs_virtio_iommu_platform(struct virtio_device *dev)
> +{
> + return 0;
> +}
> +
>  int virtio_finalize_features(struct virtio_device *dev)
>  {
>   int ret = dev->config->finalize_features(dev);
> @@ -179,6 +194,13 @@ int virtio_finalize_features(struct virtio_device *dev)
>   if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1))
>   return 0;
>  
> + if (arch_needs_virtio_iommu_platform(dev) &&
> + !virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) {

Just a style nit, but the "!virtio..." should be directly under the
"arch_...", not tabbed out.

Thanks,
Tom

> + dev_warn(>dev,
> +  "virtio: device must provide 
> VIRTIO_F_IOMMU_PLATFORM\n");
> + return -ENODEV;
> + }
> +
>   virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
>   status = dev->config->get_status(dev);
>   if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index a493eac08393..e8526ae3463e 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -195,4 +195,6 @@ void unregister_virtio_driver(struct virtio_driver *drv);
>  #define module_virtio_driver(__virtio_driver) \
>   module_driver(__virtio_driver, register_virtio_driver, \
>   unregister_virtio_driver)
> +
> +int arch_needs_virtio_iommu_platform(struct virtio_device *dev);
>  #endif /* _LINUX_VIRTIO_H */
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 59/75] x86/sev-es: Handle MONITOR/MONITORX Events

2020-06-11 Thread Tom Lendacky




On 6/11/20 12:13 PM, Sean Christopherson wrote:

On Thu, Jun 11, 2020 at 03:10:45PM +0200, Joerg Roedel wrote:

On Tue, May 19, 2020 at 11:38:45PM -0700, Sean Christopherson wrote:

On Tue, Apr 28, 2020 at 05:17:09PM +0200, Joerg Roedel wrote:

+static enum es_result vc_handle_monitor(struct ghcb *ghcb,
+   struct es_em_ctxt *ctxt)
+{
+   phys_addr_t monitor_pa;
+   pgd_t *pgd;
+
+   pgd = __va(read_cr3_pa());
+   monitor_pa = vc_slow_virt_to_phys(ghcb, ctxt->regs->ax);
+
+   ghcb_set_rax(ghcb, monitor_pa);
+   ghcb_set_rcx(ghcb, ctxt->regs->cx);
+   ghcb_set_rdx(ghcb, ctxt->regs->dx);
+
+   return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MONITOR, 0, 0);


Why?  If SVM has the same behavior as VMX, the MONITOR will be disarmed on
VM-Enter, i.e. the VMM can't do anything useful for MONITOR/MWAIT.  I
assume that's the case given that KVM emulates MONITOR/MWAIT as NOPs on
SVM.


Not sure if it is disarmed on VMRUN, but the MONITOR/MWAIT instructions
are part of the GHCB spec, so they are implemented here.


Even if MONITOR/MWAIT somehow works across VMRUN I'm not sure it's something
the guest should enable by default as it leaks GPAs to the untrusted host,
with no benefit to the guest except in specific configurations.  Yeah, the
VMM can muck with page tables to trace guest to the some extent, but the
guest shouldn't be unnecessarily volunteering information to the host.

If MONITOR/MWAIT is effectively a NOP then removing this code is a no
brainer.

Can someone from AMD chime in?


I don't think there is any guarantee that MONITOR/MWAIT would work within 
a guest (I'd have to dig some more on that to get a definitive answer, but 
probably not necessary to do). As you say, if KVM emulates it as a NOP, 
there's no sense in exposing the GPA - make it a NOP in the handler. I 
just need to poke some other hypervisor vendors and hear what they have to 
say.


Thanks,
Tom





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 69/75] x86/realmode: Setup AP jump table

2020-05-29 Thread Tom Lendacky




On 5/29/20 4:02 AM, Borislav Petkov wrote:

On Tue, Apr 28, 2020 at 05:17:19PM +0200, Joerg Roedel wrote:

From: Tom Lendacky 

Setup the AP jump table to point to the SEV-ES trampoline code so that
the APs can boot.


Tom, in his laconic way, doesn't want to explain to us why is this even
needed...

:)


Looks like some of the detail was lost during the patch shuffling. 
Originally (on GitHub) this was the text:


 As part of the GHCB specification, the booting of APs under SEV-ES
 requires an AP jump table when transitioning from one layer of code to
 another (e.g. when going from UEFI to the OS). As a result, each layer
 that parks an AP must provide the physical address of an AP jump table
 to the next layer using the GHCB MSR.

 Upon booting of the kernel, read the GHCB MSR and save the address of
 the AP jump table. Under SEV-ES, APs are started using the INIT-SIPI-SIPI
 sequence. Before issuing the first SIPI request for an AP, the start eip
 is programmed into the AP jump table. Upon issuing the SIPI request, the
 AP will awaken and jump to the start eip address programmed into the AP
 jump table.

It needs to change "GHCB MSR" to "VMGEXIT MSR protocol", but should cover 
what you're looking for.


Thanks,
Tom



/me reads the code

/me reads the GHCB spec

aha, it gets it from the HV. And it can be set by the guest too...

So how about expanding that commit message as to why this is done, why
needed, etc?

Thx.


diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 262f83cad355..1c5cbfd102d5 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -9,6 +9,7 @@
  #include 
  #include 
  #include 
+#include 
  
  struct real_mode_header *real_mode_header;

  u32 *trampoline_cr4_features;
@@ -107,6 +108,11 @@ static void __init setup_real_mode(void)
if (sme_active())
trampoline_header->flags |= TH_FLAGS_SME_ACTIVE;
  
+	if (sev_es_active()) {

+   if (sev_es_setup_ap_jump_table(real_mode_header))
+   panic("Failed to update SEV-ES AP Jump Table");
+   }
+


So this function gets slowly sprinkled with

if (sev-something)
bla

Please wrap at least those last two into a

sev_setup_real_mode()

or so.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 64/75] x86/sev-es: Cache CPUID results for improved performance

2020-05-27 Thread Tom Lendacky

On 4/28/20 10:17 AM, Joerg Roedel wrote:

From: Mike Stunes 

To avoid a future VMEXIT for a subsequent CPUID function, cache the
results returned by CPUID into an xarray.

  [tl: coding standard changes, register zero extension]

Signed-off-by: Mike Stunes 
Signed-off-by: Tom Lendacky 
[ jroe...@suse.de: - Wrapped cache handling into vc_handle_cpuid_cached()
- Used lower_32_bits() where applicable
   - Moved cache_index out of struct es_em_ctxt ]
Co-developed-by: Joerg Roedel 
Signed-off-by: Joerg Roedel 
---
  arch/x86/kernel/sev-es-shared.c |  12 ++--
  arch/x86/kernel/sev-es.c| 119 +++-
  2 files changed, 124 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index 5bfc1f3030d4..cfdafe12da4f 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -427,8 +427,8 @@ static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
u32 cr4 = native_read_cr4();
enum es_result ret;
  
-	ghcb_set_rax(ghcb, regs->ax);

-   ghcb_set_rcx(ghcb, regs->cx);
+   ghcb_set_rax(ghcb, lower_32_bits(regs->ax));
+   ghcb_set_rcx(ghcb, lower_32_bits(regs->cx));
  
  	if (cr4 & X86_CR4_OSXSAVE)

/* Safe to read xcr0 */
@@ -447,10 +447,10 @@ static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
  ghcb_is_valid_rdx(ghcb)))
return ES_VMM_ERROR;
  
-	regs->ax = ghcb->save.rax;

-   regs->bx = ghcb->save.rbx;
-   regs->cx = ghcb->save.rcx;
-   regs->dx = ghcb->save.rdx;
+   regs->ax = lower_32_bits(ghcb->save.rax);
+   regs->bx = lower_32_bits(ghcb->save.rbx);
+   regs->cx = lower_32_bits(ghcb->save.rcx);
+   regs->dx = lower_32_bits(ghcb->save.rdx);
  
  	return ES_OK;

  }
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 03095bc7b563..0303834d4811 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -33,6 +34,16 @@
  
  #define DR7_RESET_VALUE0x400
  
+struct sev_es_cpuid_cache_entry {

+   unsigned long eax;
+   unsigned long ebx;
+   unsigned long ecx;
+   unsigned long edx;
+};
+
+static struct xarray sev_es_cpuid_cache;
+static bool __ro_after_init sev_es_cpuid_cache_initialized;
+
  /* For early boot hypervisor communication in SEV-ES enabled guests */
  static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
  
@@ -463,6 +474,9 @@ void __init sev_es_init_vc_handling(void)

sev_es_setup_vc_stack(cpu);
}
  
+	xa_init_flags(_es_cpuid_cache, XA_FLAGS_LOCK_IRQ);

+   sev_es_cpuid_cache_initialized = true;
+
init_vc_stack_names();
  }
  
@@ -744,6 +758,91 @@ static enum es_result vc_handle_mmio(struct ghcb *ghcb,

return ret;
  }
  
+static unsigned long sev_es_get_cpuid_cache_index(struct es_em_ctxt *ctxt)

+{
+   unsigned long hi, lo;
+
+   /* Don't attempt to cache until the xarray is initialized */
+   if (!sev_es_cpuid_cache_initialized)
+   return ULONG_MAX;
+
+   lo = lower_32_bits(ctxt->regs->ax);
+
+   /*
+* CPUID 0x000d requires both RCX and XCR0, so it can't be
+* cached.
+*/
+   if (lo == 0x000d)
+   return ULONG_MAX;
+
+   /*
+* Some callers of CPUID don't always set RCX to zero for CPUID
+* functions that don't require RCX, which can result in excessive
+* cached values, so RCX needs to be manually zeroed for use as part
+* of the cache index. Future CPUID values may need RCX, but since
+* they can't be known, they must not be cached.
+*/
+   if (lo > 0x8020)
+   return ULONG_MAX;
+
+   switch (lo) {
+   case 0x0007:
+   case 0x000b:
+   case 0x000f:
+   case 0x0010:
+   case 0x801d:
+   case 0x8020:
+   hi = ctxt->regs->cx << 32;
+   break;
+   default:
+   hi = 0;
+   }
+
+   return hi | lo;
+}
+
+static bool sev_es_check_cpuid_cache(struct es_em_ctxt *ctxt,
+unsigned long cache_index)
+{
+   struct sev_es_cpuid_cache_entry *cache_entry;
+
+   if (cache_index == ULONG_MAX)
+   return false;
+
+   cache_entry = xa_load(_es_cpuid_cache, cache_index);
+   if (!cache_entry)
+   return false;
+
+   ctxt->regs->ax = cache_entry->eax;
+   ctxt->regs->bx = cache_entry->ebx;
+   ctxt->regs->cx = cache_entry->ecx;
+   ctxt->regs->dx = cache_entry->edx;
+
+   return true;
+}
+
+static void sev_es_add_cpuid_cache(struct es_em_ctxt *ctxt,
+  unsigned long cach

Re: [PATCH v3 64/75] x86/sev-es: Cache CPUID results for improved performance

2020-05-27 Thread Tom Lendacky

On 5/26/20 4:19 AM, Borislav Petkov wrote:

On Tue, May 19, 2020 at 10:16:37PM -0700, Sean Christopherson wrote:

The whole cache on-demand approach seems like overkill.  The number of CPUID
leaves that are invoked after boot with any regularity can probably be counted
on one hand.   IIRC glibc invokes CPUID to gather TLB/cache info, XCR0-based
features, and one or two other leafs.  A statically sized global array that's
arbitrarily index a la x86_capability would be just as simple and more
performant.  It would also allow fancier things like emulating CPUID 0xD in
the guest if you want to go down that road.


And before we do any of that "caching" or whatnot, I'd like to see
numbers justifying its existence. Because if it is only a couple of
CPUID invocations and the boot delay is immeasurable, then it's not
worth the effort.


I added some rudimentary stats code to see how many times there was a 
CPUID cache hit on a 64-vCPU guest during a kernel build (make clean 
followed by make with -j 64):


  SEV-ES CPUID cache statistics
0x/0x: 220,384
0x0007/0x: 213,306
0x8000/0x: 1,054,642
0x8001/0x: 213,306
0x8005/0x: 210,334
0x8006/0x: 420,668
0x8007/0x: 210,334
0x8008/0x: 420,684

2,963,658 cache hits

So it is significant in quantity, but I'm not sure what the overall 
performance difference is. If I can find some more time I'll try to 
compare the kernel builds with and without the caching to see if it is 
noticeable.


Thanks,
Tom




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 64/75] x86/sev-es: Cache CPUID results for improved performance

2020-05-06 Thread Tom Lendacky




On 5/6/20 1:08 PM, Mike Stunes wrote:




On Apr 28, 2020, at 8:17 AM, Joerg Roedel  wrote:

From: Mike Stunes 

To avoid a future VMEXIT for a subsequent CPUID function, cache the
results returned by CPUID into an xarray.

[tl: coding standard changes, register zero extension]

Signed-off-by: Mike Stunes 
Signed-off-by: Tom Lendacky 
[ jroe...@suse.de: - Wrapped cache handling into vc_handle_cpuid_cached()
   - Used lower_32_bits() where applicable
   - Moved cache_index out of struct es_em_ctxt ]
Co-developed-by: Joerg Roedel 
Signed-off-by: Joerg Roedel 
---
arch/x86/kernel/sev-es-shared.c |  12 ++--
arch/x86/kernel/sev-es.c| 119 +++-
2 files changed, 124 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 03095bc7b563..0303834d4811 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -744,6 +758,91 @@ static enum es_result vc_handle_mmio(struct ghcb *ghcb,
return ret;
}

+static unsigned long sev_es_get_cpuid_cache_index(struct es_em_ctxt *ctxt)
+{
+   unsigned long hi, lo;
+
+   /* Don't attempt to cache until the xarray is initialized */
+   if (!sev_es_cpuid_cache_initialized)
+   return ULONG_MAX;
+
+   lo = lower_32_bits(ctxt->regs->ax);
+
+   /*
+* CPUID 0x000d requires both RCX and XCR0, so it can't be
+* cached.
+*/
+   if (lo == 0x000d)
+   return ULONG_MAX;
+
+   /*
+* Some callers of CPUID don't always set RCX to zero for CPUID
+* functions that don't require RCX, which can result in excessive
+* cached values, so RCX needs to be manually zeroed for use as part
+* of the cache index. Future CPUID values may need RCX, but since
+* they can't be known, they must not be cached.
+*/
+   if (lo > 0x8020)
+   return ULONG_MAX;


If the cache is shared across CPUs, do we also need to exclude function 0x1 
because it contains the LAPIC ID? (Or is the cache per-CPU?)


It's currently not a per-CPU cache, but given what you pointed out, it 
should be if we're going to keep function 0x1 in there. The question will 
be how often is that CPUID issued, as that would determine if (not) 
caching it matters. Or even how often CPUID is issued and whether the 
xarray lock could be under contention if the cache is not per-CPU.


Thanks,
Tom




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-04-27 Thread Tom Lendacky




On 4/27/20 12:37 PM, Andy Lutomirski wrote:

On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski  wrote:


On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel  wrote:


On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote:

I assume the race you mean is:

#VC
Immediate NMI before IST gets shifted
#VC

Kaboom.

How are you dealing with this?  Ultimately, I think that NMI will need
to turn off IST before engaging in any funny business. Let me ponder
this a bit.


Right, I dealt with that by unconditionally shifting/unshifting the #VC IST 
entry
in do_nmi() (thanks to Davin Kaplan for the idea). It might cause
one of the IST stacks to be unused during nesting, but that is fine. The
stack memory for #VC is only allocated when SEV-ES is active (in an
SEV-ES VM).


Blech.  It probably works, but still, yuck.  It's a bit sad that we
seem to be growing more and more poorly designed happens-anywhere
exception types at an alarming rate.  We seem to have #NM, #MC, #VC,
#HV, and #DB.  This doesn't really scale.


I have a somewhat serious question: should we use IST for #VC at all?
As I understand it, Rome and Naples make it mandatory for hypervisors
to intercept #DB, which means that, due to the MOV SS mess, it's sort
of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
if we're running under a sufficiently sensible hypervisor, we don't
need IST for #VC.

So I think we have two choices:

1. Use IST for #VC and deal with all the mess that entails.

2. Say that we SEV-ES client support on Rome and Naples is for
development only and do a quick boot-time check for whether #DB is
intercepted.  (Just set TF and see what vector we get.)  If #DB is
intercepted, print a very loud warning and refuse to boot unless some
special sev_es.insecure_development_mode or similar option is set.

#2 results in simpler and more robust entry code.  #1 is more secure.

So my question is: will anyone actually use SEV-ES in production on
Rome or Naples?  As I understand it, it's not really ready for prime
time on those chips.  And do we care if the combination of a malicious


Naples was limited in the number of encryption keys available for guests 
(15), but Rome increased that significantly (509). SEV-ES is ready on 
those chips - Rome more so with the increased key count given the 
requirement that SEV and SEV-ES guests have non-overlapping ASID ranges 
(which corresponds to key usage).


Thanks,
Tom


hypervisor and malicious guest userspace on Milan can compromise the
guest kernel?  I don't think SEV-ES is really mean to resist a
concerted effort by the hypervisor to compromise the guest.

--Andy


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Allow RDTSC and RDTSCP from userspace

2020-04-24 Thread Tom Lendacky

On 4/24/20 4:24 PM, Dave Hansen wrote:

On 4/24/20 2:03 PM, Mike Stunes wrote:

I needed to allow RDTSC(P) from userspace and in early boot in order to
get userspace started properly. Patch below.

---
SEV-ES guests will need to execute rdtsc and rdtscp from userspace and
during early boot. Move the rdtsc(p) #VC handler into common code and
extend the #VC handlers.


Do SEV-ES guests _always_ #VC on rdtsc(p)?


Only if the hypervisor is intercepting those instructions.

Thanks,
Tom




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler

2020-04-14 Thread Tom Lendacky

On 4/14/20 3:16 PM, Tom Lendacky wrote:



On 4/14/20 3:12 PM, Dave Hansen wrote:

On 4/14/20 1:04 PM, Tom Lendacky wrote:

set_memory_decrypted needs to check the return value. I see it
consistently return ENOMEM. I've traced that back to split_large_page
in arch/x86/mm/pat/set_memory.c.


At that point the guest won't be able to communicate with the
hypervisor, too. Maybe we should BUG() here to terminate further
processing?


Escalating an -ENOMEM into a crashed kernel seems a bit extreme.
Granted, the guest may be in an unrecoverable state, but the host
doesn't need to be too.



The host wouldn't be. This only happens in a guest, so it would be just 
causing the guest kernel to panic early in the boot.


And I should add that it would only impact an SEV-ES guest.

Thanks,
Tom



Thanks,
Tom


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler

2020-04-14 Thread Tom Lendacky




On 4/14/20 3:12 PM, Dave Hansen wrote:

On 4/14/20 1:04 PM, Tom Lendacky wrote:

set_memory_decrypted needs to check the return value. I see it
consistently return ENOMEM. I've traced that back to split_large_page
in arch/x86/mm/pat/set_memory.c.


At that point the guest won't be able to communicate with the
hypervisor, too. Maybe we should BUG() here to terminate further
processing?


Escalating an -ENOMEM into a crashed kernel seems a bit extreme.
Granted, the guest may be in an unrecoverable state, but the host
doesn't need to be too.



The host wouldn't be. This only happens in a guest, so it would be just 
causing the guest kernel to panic early in the boot.


Thanks,
Tom

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler

2020-04-14 Thread Tom Lendacky

On 4/14/20 2:03 PM, Mike Stunes wrote:

On Mar 19, 2020, at 2:13 AM, Joerg Roedel  wrote:


From: Tom Lendacky 

The runtime handler needs a GHCB per CPU. Set them up and map them
unencrypted.

Signed-off-by: Tom Lendacky 
Signed-off-by: Joerg Roedel 
---
arch/x86/include/asm/mem_encrypt.h |  2 ++
arch/x86/kernel/sev-es.c   | 28 +++-
arch/x86/kernel/traps.c|  3 +++
3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index c17980e8db78..4bf5286310a0 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -197,6 +203,26 @@ static bool __init sev_es_setup_ghcb(void)
return true;
}

+void sev_es_init_ghcbs(void)
+{
+   int cpu;
+
+   if (!sev_es_active())
+   return;
+
+   /* Allocate GHCB pages */
+   ghcb_page = __alloc_percpu(sizeof(struct ghcb), PAGE_SIZE);
+
+   /* Initialize per-cpu GHCB pages */
+   for_each_possible_cpu(cpu) {
+   struct ghcb *ghcb = (struct ghcb *)per_cpu_ptr(ghcb_page, cpu);
+
+   set_memory_decrypted((unsigned long)ghcb,
+sizeof(*ghcb) >> PAGE_SHIFT);
+   memset(ghcb, 0, sizeof(*ghcb));
+   }
+}
+


set_memory_decrypted needs to check the return value. I see it
consistently return ENOMEM. I've traced that back to split_large_page
in arch/x86/mm/pat/set_memory.c.


At that point the guest won't be able to communicate with the hypervisor, 
too. Maybe we should BUG() here to terminate further processing?


Thanks,
Tom




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected

2020-02-21 Thread Tom Lendacky
On 2/21/20 7:12 AM, Halil Pasic wrote:
> On Thu, 20 Feb 2020 15:55:14 -0500
> "Michael S. Tsirkin"  wrote:
> 
>> On Thu, Feb 20, 2020 at 05:06:06PM +0100, Halil Pasic wrote:
>>> Currently the advanced guest memory protection technologies (AMD SEV,
>>> powerpc secure guest technology and s390 Protected VMs) abuse the
>>> VIRTIO_F_IOMMU_PLATFORM flag to make virtio core use the DMA API, which
>>> is in turn necessary, to make IO work with guest memory protection.
>>>
>>> But VIRTIO_F_IOMMU_PLATFORM a.k.a. VIRTIO_F_ACCESS_PLATFORM is really a
>>> different beast: with virtio devices whose implementation runs on an SMP
>>> CPU we are still fine with doing all the usual optimizations, it is just
>>> that we need to make sure that the memory protection mechanism does not
>>> get in the way. The VIRTIO_F_ACCESS_PLATFORM mandates more work on the
>>> side of the guest (and possibly he host side as well) than we actually
>>> need.
>>>
>>> An additional benefit of teaching the guest to make the right decision
>>> (and use DMA API) on it's own is: removing the need, to mandate special
>>> VM configuration for guests that may run with protection. This is
>>> especially interesting for s390 as VIRTIO_F_IOMMU_PLATFORM pushes all
>>> the virtio control structures into the first 2G of guest memory:
>>> something we don't necessarily want to do per-default.
>>>
>>> Signed-off-by: Halil Pasic 
>>> Tested-by: Ram Pai 
>>> Tested-by: Michael Mueller 
>>
>> This might work for you but it's fragile, since without
>> VIRTIO_F_ACCESS_PLATFORM hypervisor assumes it gets
>> GPA's, not DMA addresses.
>>
> 
> Thanks for your constructive approach. I do want the hypervisor to
> assume it gets GPA's. My train of thought was that the guys that need
> to use IOVA's that are not GPA's when force_dma_unencrypted() will have
> to to specify VIRTIO_F_ACCESS_PLATFORM (at the device) anyway, because
> otherwise it won't work. But I see your point: in case of a
> mis-configuration and provided the DMA API returns IOVA's one could end
> up trying to touch wrong memory locations. But this should be similar to
> what would happen if DMA ops are not used, and memory is not made accessible.
> 
>>
>>
>> IOW this looks like another iteration of:
>>
>>  virtio: Support encrypted memory on powerpc secure guests
>>
>> which I was under the impression was abandoned as unnecessary.
> 
> Unnecessary for powerpc because they do normal PCI. In the context of
> CCW there are only guest physical addresses (CCW I/O has no concept of
> IOMMU or IOVAs).
> 
>>
>>
>> To summarize, the necessary conditions for a hack along these lines
>> (using DMA API without VIRTIO_F_ACCESS_PLATFORM) are that we detect that:
>>
>>   - secure guest mode is enabled - so we know that since we don't share
>> most memory regular virtio code won't
>> work, even though the buggy hypervisor didn't set 
>> VIRTIO_F_ACCESS_PLATFORM
> 
> force_dma_unencrypted(>dev) is IMHO exactly about this.
> 
>>   - DMA API is giving us addresses that are actually also physical
>> addresses
> 
> In case of s390 this is given. I talked with the power people before
> posting this, and they ensured me they can are willing to deal with
> this. I was hoping to talk abut this with the AMD SEV people here (hence
> the cc).

Yes, physical addresses are fine for SEV - the key is that the DMA API is
used so that an address for unencrypted, or shared, memory is returned.
E.g. for a dma_alloc_coherent() call this is an allocation that has had
set_memory_decrypted() called or for a dma_map_page() call this is an
address from SWIOTLB, which was mapped shared during boot, where the data
will be bounce-buffered.

We don't currently support an emulated IOMMU in our SEV guest because that
would require a lot of support in the driver to make IOMMU data available
to the hypervisor (I/O page tables, etc.). We would need hardware support
to really make this work easily in the guest.

Thanks,
Tom

> 
>>   - Hypervisor is buggy and didn't enable VIRTIO_F_ACCESS_PLATFORM
>>
> 
> I don't get this point. The argument where the hypervisor is buggy is a
> bit hard to follow for me. If hypervisor is buggy we have already lost
> anyway most of the time, or?
>  
>> I don't see how this patch does this.
> 
> I do get your point. I don't know of a good way to check that DMA API
> is giving us addresses that are actually physical addresses, and the
> situation you describe definitely has some risk to it.
> 
> Let me comment on other ideas that came up. I would be very happy to go
> with the best one. Thank you very much.
> 
> Regards,
> Halil
> 
>>
>>
>>> ---
>>>  drivers/virtio/virtio_ring.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 867c7ebd3f10..fafc8f924955 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -243,6 +243,9 @@ static bool vring_use_dma_api(struct virtio_device 
>>> *vdev)

Re: [PATCH 1/1] virtio_ring: fix return code on DMA mapping fails

2019-11-23 Thread Tom Lendacky

On 11/22/19 7:08 AM, Halil Pasic wrote:

Thanks Michael!

Actually I also hoped to start a discussion on virtio with encrypted
memory.

I assume the AMD folks have the most experience with this, and I very
much like to understand how do they master the challenges we are all
facing.

My understanding of IO in the context of AMD SEV is that the user
is responsible for choosing the swiotlb command line parameter of the
guest kernel so, that the guest never runs out of swiotlb. And that
not doing so may have fatal consequences with regards to the guest. [1]

The swiotlb being a guest global resource, to choose such a size, one
would fist need to know the maximal swiotlb footprint of each device,
and then apply some heuristics regarding fragmentation.

Honestly, if somebody asked me how to calculate the max swiotlb
footprint of the most common virtio devices, I would feel very
uncomfortable.

But maybe I got it all wrong. @Tom can you help me understand how this
works?


Yes, SWIOTLB sizing is hard. It really depends on the workload and the
associated I/O load that the guest will be performing. We've been looking
at a simple patch to increase the default SWIOTLB size if SEV is active.
But what size do you choose? Do you base it on the overall guest size? And
you're limited because it must reside low in memory.

Ideally, having a pool of shared pages for DMA, outside of standard
SWIOTLB, might be a good thing.  On x86, SWIOTLB really seems geared
towards devices that don't support 64-bit DMA. If a device supports 64-bit
DMA then it can use shared pages that reside anywhere to perform the DMA
and bounce buffering. I wonder if the SWIOTLB support can be enhanced to
support something like this, using today's low SWIOTLB buffers if the DMA
mask necessitates it, otherwise using a dynamically sized pool of shared
pages that can live anywhere.

Thanks,
Tom



In any case, we s390 protected virtualization folks are concerned about
the things laid out above. The goal of this patch is to make the swiotlb
full condition less grave, but it is by no means a full solution.

I would like to work on improving on this situation. Obviously we have
done some thinking about what can be done, but I would very much like to
collect the opinions, of the people in the community that AFAICT face
same problem. One of the ideas is to try to prevent it from happening by
making swiotlb sizing dynamic. Another idea is to make the system deal
with the failures gracefully. Both ideas come with a bag of problems of
their own (AFAICT).

According to my research the people I need to talk to are Tom (AMD), and
Ram and Thiago (Power) and of course the respective maintainers. Have I
missed anybody?

Regards,
Halil

--

[1] 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAMDESE%2FAMDSEV%23faq-4data=02%7C01%7CThomas.Lendacky%40amd.com%7Cd733eab74c7346b72fb608d76f4d175d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637100249200530156sdata=mUISWUHYJfLE3c1cYoqC%2B3uzM8RtpnffyMlrX84oGug%3Dreserved=0

On Tue, 19 Nov 2019 08:04:29 -0500
"Michael S. Tsirkin"  wrote:


Will be in the next pull request.

On Tue, Nov 19, 2019 at 12:10:22PM +0100, Halil Pasic wrote:

ping

On Thu, 14 Nov 2019 13:46:46 +0100
Halil Pasic  wrote:


Commit 780bc7903a32 ("virtio_ring: Support DMA APIs")  makes
virtqueue_add() return -EIO when we fail to map our I/O buffers. This is
a very realistic scenario for guests with encrypted memory, as swiotlb
may run out of space, depending on it's size and the I/O load.

The virtio-blk driver interprets -EIO form virtqueue_add() as an IO
error, despite the fact that swiotlb full is in absence of bugs a
recoverable condition.

Let us change the return code to -ENOMEM, and make the block layer
recover form these failures when virtio-blk encounters the condition
described above.

Fixes: 780bc7903a32 ("virtio_ring: Support DMA APIs")
Signed-off-by: Halil Pasic 
Tested-by: Michael Mueller 
---


[..]


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017-10-26 Thread Tom Lendacky

On 10/12/2017 10:34 AM, Thomas Garnier wrote:

On Wed, Oct 11, 2017 at 2:34 PM, Tom Lendacky <thomas.lenda...@amd.com> wrote:

On 10/11/2017 3:30 PM, Thomas Garnier wrote:

Changes:
   - patch v1:
 - Simplify ftrace implementation.
 - Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
   - rfc v3:
 - Use --emit-relocs instead of -pie to reduce dynamic relocation space on
   mapped memory. It also simplifies the relocation process.
 - Move the start the module section next to the kernel. Remove the need for
   -mcmodel=large on modules. Extends module space from 1 to 2G maximum.
 - Support for XEN PVH as 32-bit relocations can be ignored with
   --emit-relocs.
 - Support for GOT relocations previously done automatically with -pie.
 - Remove need for dynamic PLT in modules.
 - Support dymamic GOT for modules.
   - rfc v2:
 - Add support for global stack cookie while compiler default to fs without
   mcmodel=kernel
 - Change patch 7 to correctly jump out of the identity mapping on kexec 
load
   preserve.

These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.


Hi Thomas,

I've applied your patches so that I can verify that SME works with PIE.
Unfortunately, I'm running into build warnings and errors when I enable
PIE.

With CONFIG_STACK_VALIDATION=y I receive lots of messages like this:

   drivers/scsi/libfc/fc_exch.o: warning: objtool: fc_destroy_exch_mgr()+0x0: 
call without frame pointer save/setup

Disabling CONFIG_STACK_VALIDATION suppresses those.


I ran into that, I plan to fix it in the next iteration.



But near the end of the build, I receive errors like this:

   arch/x86/kernel/setup.o: In function `dump_kernel_offset':
   .../arch/x86/kernel/setup.c:801:(.text+0x32): relocation truncated to fit: 
R_X86_64_32S against symbol `_text' defined in .text section in .tmp_vmlinux1
   .
   . about 10 more of the above type messages
   .
   make: *** [vmlinux] Error 1
   Error building kernel, exiting

Are there any config options that should or should not be enabled when
building with PIE enabled?  Is there a compiler requirement for PIE (I'm
using gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5))?


I never ran into these ones and I tested compilers older and newer.
What was your exact configuration?


I'll send you the config in a separate email.

Thanks,
Tom





Thanks,
Tom



Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to Roland McGrath on his
feedback for using -pie versus --emit-relocs and details on compiler code
generation.

The patches:
   - 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant.
   - 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically.
   - 14: Adapt percpu design to work correctly when PIE is enabled.
   - 15: Provide an option to default visibility to hidden except for key 
symbols.
 It removes errors between compilation units.
   - 16: Adapt relocation tool to handle PIE binary correctly.
   - 19: Add support for global cookie.
   - 20: Support ftrace with PIE (used on Ubuntu config).
   - 21: Fix incorrect address marker on dump_pagetables.
   - 22: Add option to move the module section just after the kernel.
   - 23: Adapt module loading to support PIE with dynamic GOT.
   - 24: Make the GOT read-only.
   - 25: Add the CONFIG_X86_PIE option (off by default).
   - 26: Adapt relocation tool to generate a 64-bit relocation table.
   - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation range
 from 1G to 3G (off by default).

Performance/Size impact:

Size of vmlinux (Default configuration):
   File size:
   - PIE disabled: +0.31%
   - PIE enabled: -3.210% (less relocations)
   .text section:
   - PIE disabled: +0.000644%
   - PIE enabled: +0.837%

Size of vmlinux (Ubuntu configuration):
   File size:
   - PIE disabled: -0.201%
   - PIE enabled: -0.082%
   .text section:
   - PIE disabled: same
   - PIE enabled: +1.319%

Size of vmlinux (Default configuration + ORC):
   File size:
   - PIE enabled: -3.167%
   .text section:
   - PIE enabled: +0.814%

Size of vmlinux (Ubuntu configuration + ORC):
   File size:
   - PIE enabled: -3.167%
   .text section:
   - PIE enabled: +1.26%

The size increase is mainly due to not having access to the 32-bit signed
relocation that can be used with mcmodel=kernel. A small part is due to reduced
optimization for PIE code. This bug [1] was opened with gcc to provide a better
code generation for kernel PIE.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
   - PIE disabled: no significant change (avg +0.1% on latest test).
   - PIE enabled: between -0.50% to +0.86% in average (default 

Re: [PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017-10-26 Thread Tom Lendacky
On 10/11/2017 3:30 PM, Thomas Garnier wrote:
> Changes:
>   - patch v1:
> - Simplify ftrace implementation.
> - Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
>   - rfc v3:
> - Use --emit-relocs instead of -pie to reduce dynamic relocation space on
>   mapped memory. It also simplifies the relocation process.
> - Move the start the module section next to the kernel. Remove the need 
> for
>   -mcmodel=large on modules. Extends module space from 1 to 2G maximum.
> - Support for XEN PVH as 32-bit relocations can be ignored with
>   --emit-relocs.
> - Support for GOT relocations previously done automatically with -pie.
> - Remove need for dynamic PLT in modules.
> - Support dymamic GOT for modules.
>   - rfc v2:
> - Add support for global stack cookie while compiler default to fs without
>   mcmodel=kernel
> - Change patch 7 to correctly jump out of the identity mapping on kexec 
> load
>   preserve.
> 
> These patches make the changes necessary to build the kernel as Position
> Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
> the top 2G of the virtual address space. It allows to optionally extend the
> KASLR randomization range from 1G to 3G.

Hi Thomas,

I've applied your patches so that I can verify that SME works with PIE.
Unfortunately, I'm running into build warnings and errors when I enable
PIE.

With CONFIG_STACK_VALIDATION=y I receive lots of messages like this:

  drivers/scsi/libfc/fc_exch.o: warning: objtool: fc_destroy_exch_mgr()+0x0: 
call without frame pointer save/setup

Disabling CONFIG_STACK_VALIDATION suppresses those.

But near the end of the build, I receive errors like this:

  arch/x86/kernel/setup.o: In function `dump_kernel_offset':
  .../arch/x86/kernel/setup.c:801:(.text+0x32): relocation truncated to fit: 
R_X86_64_32S against symbol `_text' defined in .text section in .tmp_vmlinux1
  .
  . about 10 more of the above type messages
  .
  make: *** [vmlinux] Error 1
  Error building kernel, exiting

Are there any config options that should or should not be enabled when
building with PIE enabled?  Is there a compiler requirement for PIE (I'm
using gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5))?

Thanks,
Tom

> 
> Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
> changes, PIE support and KASLR in general. Thanks to Roland McGrath on his
> feedback for using -pie versus --emit-relocs and details on compiler code
> generation.
> 
> The patches:
>   - 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant.
>   - 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically.
>   - 14: Adapt percpu design to work correctly when PIE is enabled.
>   - 15: Provide an option to default visibility to hidden except for key 
> symbols.
> It removes errors between compilation units.
>   - 16: Adapt relocation tool to handle PIE binary correctly.
>   - 19: Add support for global cookie.
>   - 20: Support ftrace with PIE (used on Ubuntu config).
>   - 21: Fix incorrect address marker on dump_pagetables.
>   - 22: Add option to move the module section just after the kernel.
>   - 23: Adapt module loading to support PIE with dynamic GOT.
>   - 24: Make the GOT read-only.
>   - 25: Add the CONFIG_X86_PIE option (off by default).
>   - 26: Adapt relocation tool to generate a 64-bit relocation table.
>   - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation 
> range
> from 1G to 3G (off by default).
> 
> Performance/Size impact:
> 
> Size of vmlinux (Default configuration):
>   File size:
>   - PIE disabled: +0.31%
>   - PIE enabled: -3.210% (less relocations)
>   .text section:
>   - PIE disabled: +0.000644%
>   - PIE enabled: +0.837%
> 
> Size of vmlinux (Ubuntu configuration):
>   File size:
>   - PIE disabled: -0.201%
>   - PIE enabled: -0.082%
>   .text section:
>   - PIE disabled: same
>   - PIE enabled: +1.319%
> 
> Size of vmlinux (Default configuration + ORC):
>   File size:
>   - PIE enabled: -3.167%
>   .text section:
>   - PIE enabled: +0.814%
> 
> Size of vmlinux (Ubuntu configuration + ORC):
>   File size:
>   - PIE enabled: -3.167%
>   .text section:
>   - PIE enabled: +1.26%
> 
> The size increase is mainly due to not having access to the 32-bit signed
> relocation that can be used with mcmodel=kernel. A small part is due to 
> reduced
> optimization for PIE code. This bug [1] was opened with gcc to provide a 
> better
> code generation for kernel PIE.
> 
> Hackbench (50% and 1600% on thread/process for pipe/sockets):
>   - PIE disabled: no significant change (avg +0.1% on latest test).
>   - PIE enabled: between -0.50% to +0.86% in average (default and Ubuntu 
> config).
> 
> slab_test (average of 10 runs):
>   - PIE disabled: no significant change (-2% on latest run, likely noise).
>   - PIE enabled: between -1% and +0.8% on latest runs.
> 
> Kernbench (average of 10 Half and Optimal runs):
>   

Re: RFT: virtio_net: limit xmit polling

2011-06-21 Thread Tom Lendacky
On Sunday, June 19, 2011 05:27:00 AM Michael S. Tsirkin wrote:
 OK, different people seem to test different trees.  In the hope to get
 everyone on the same page, I created several variants of this patch so
 they can be compared. Whoever's interested, please check out the
 following, and tell me how these compare:

I'm in the process of testing these patches.  Base and v0 are complete
and v1 is near complete with v2 to follow.  I'm testing with a variety
of TCP_RR and TCP_STREAM/TCP_MAERTS tests involving local guest-to-guest
tests and remote host-to-guest tests.  I'll post the results in the next
day or two when the tests finish.

Thanks,
Tom

 
 kernel:
 
 git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
 
 virtio-net-limit-xmit-polling/base - this is net-next baseline to test
 against virtio-net-limit-xmit-polling/v0 - fixes checks on out of capacity
 virtio-net-limit-xmit-polling/v1 - previous revision of the patch
   this does xmit,free,xmit,2*free,free
 virtio-net-limit-xmit-polling/v2 - new revision of the patch
   this does free,xmit,2*free,free
 
 There's also this on top:
 virtio-net-limit-xmit-polling/v3 - don't delay avail index update
 I don't think it's important to test this one, yet
 
 Userspace to use: event index work is not yet merged upstream
 so the revision to use is still this:
 git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git
 virtio-net-event-idx-v3
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 09/18] virtio: use avail_event index

2011-05-17 Thread Tom Lendacky

On Monday, May 16, 2011 02:12:21 AM Rusty Russell wrote:
 On Sun, 15 May 2011 16:55:41 +0300, Michael S. Tsirkin m...@redhat.com 
wrote:
  On Mon, May 09, 2011 at 02:03:26PM +0930, Rusty Russell wrote:
   On Wed, 4 May 2011 23:51:47 +0300, Michael S. Tsirkin m...@redhat.com 
wrote:
Use the new avail_event feature to reduce the number
of exits from the guest.
   
   Figures here would be nice :)
  
  You mean ASCII art in comments?
 
 I mean benchmarks of some kind.

I'm working on getting some benchmark results for the patches.  I should 
hopefully have something in the next day or two.

Tom
 
@@ -228,6 +237,12 @@ add_head:
 * new available array entries. */

virtio_wmb();
vq-vring.avail-idx++;

+   /* If the driver never bothers to kick in a very long while,
+* avail index might wrap around. If that happens, invalidate
+* kicked_avail index we stored. TODO: make sure all drivers
+* kick at least once in 2^16 and remove this. */
+   if (unlikely(vq-vring.avail-idx == vq-kicked_avail))
+   vq-kicked_avail_valid = true;
   
   If they don't, they're already buggy.  Simply do:
   WARN_ON(vq-vring.avail-idx == vq-kicked_avail);
  
  Hmm, but does it say that somewhere?
 
 AFAICT it's a corollary of:
 1) You have a finite ring of size = 2^16.
 2) You need to kick the other side once you've done some work.
 
@@ -482,6 +517,8 @@ void vring_transport_features(struct
virtio_device *vdev)

break;

case VIRTIO_RING_F_USED_EVENT_IDX:
break;

+   case VIRTIO_RING_F_AVAIL_EVENT_IDX:
+   break;

default:
/* We don't understand this bit. */
clear_bit(i, vdev-features);
   
   Does this belong in a prior patch?
   
   Thanks,
   Rusty.
  
  Well if we don't support the feature in the ring we should not
  ack the feature, right?
 
 Ah, you're right.
 
 Thanks,
 Rusty.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 05/18] virtio: used event index interface

2011-05-05 Thread Tom Lendacky
On Wednesday, May 04, 2011 03:51:09 PM Michael S. Tsirkin wrote:
 Define a new feature bit for the guest to utilize a used_event index
 (like Xen) instead if a flag bit to enable/disable interrupts.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  include/linux/virtio_ring.h |9 +
  1 files changed, 9 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
 index e4d144b..f5c1b75 100644
 --- a/include/linux/virtio_ring.h
 +++ b/include/linux/virtio_ring.h
 @@ -29,6 +29,10 @@
  /* We support indirect buffer descriptors */
  #define VIRTIO_RING_F_INDIRECT_DESC  28
 
 +/* The Guest publishes the used index for which it expects an interrupt
 + * at the end of the avail ring. Host should ignore the avail-flags
 field. */ +#define VIRTIO_RING_F_USED_EVENT_IDX   29
 +
  /* Virtio ring descriptors: 16 bytes.  These can chain together via
 next. */ struct vring_desc {
   /* Address (guest-physical). */
 @@ -83,6 +87,7 @@ struct vring {
   *   __u16 avail_flags;
   *   __u16 avail_idx;
   *   __u16 available[num];
 + *   __u16 used_event_idx;
   *
   *   // Padding to the next align boundary.
   *   char pad[];
 @@ -93,6 +98,10 @@ struct vring {
   *   struct vring_used_elem used[num];
   * };
   */
 +/* We publish the used event index at the end of the available ring.
 + * It is at the end for backwards compatibility. */
 +#define vring_used_event(vr) ((vr)-avail-ring[(vr)-num])
 +
  static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 unsigned long align)
  {

You should update the vring_size procedure to account for the extra field at 
the end of the available ring by change the (2 + num) to (3 + num):
return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (3 + num)

Tom
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 09/18] virtio: use avail_event index

2011-05-05 Thread Tom Lendacky

On Wednesday, May 04, 2011 03:51:47 PM Michael S. Tsirkin wrote:
 Use the new avail_event feature to reduce the number
 of exits from the guest.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/virtio/virtio_ring.c |   39
 ++- 1 files changed, 38 insertions(+),
 1 deletions(-)
 
 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 index 3a3ed75..262dfe6 100644
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -82,6 +82,15 @@ struct vring_virtqueue
   /* Host supports indirect buffers */
   bool indirect;
 
 + /* Host publishes avail event idx */
 + bool event;
 +
 + /* Is kicked_avail below valid? */
 + bool kicked_avail_valid;
 +
 + /* avail idx value we already kicked. */
 + u16 kicked_avail;
 +
   /* Number of free buffers */
   unsigned int num_free;
   /* Head of free buffer list. */
 @@ -228,6 +237,12 @@ add_head:
* new available array entries. */
   virtio_wmb();
   vq-vring.avail-idx++;
 + /* If the driver never bothers to kick in a very long while,
 +  * avail index might wrap around. If that happens, invalidate
 +  * kicked_avail index we stored. TODO: make sure all drivers
 +  * kick at least once in 2^16 and remove this. */
 + if (unlikely(vq-vring.avail-idx == vq-kicked_avail))
 + vq-kicked_avail_valid = true;

vq-kicked_avail_valid should be set to false here.

Tom

 
   pr_debug(Added buffer head %i to %p\n, head, vq);
   END_USE(vq);
 @@ -236,6 +251,23 @@ add_head:
  }
  EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 
 +
 +static bool vring_notify(struct vring_virtqueue *vq)
 +{
 + u16 old, new;
 + bool v;
 + if (!vq-event)
 + return !(vq-vring.used-flags  VRING_USED_F_NO_NOTIFY);
 +
 + v = vq-kicked_avail_valid;
 + old = vq-kicked_avail;
 + new = vq-kicked_avail = vq-vring.avail-idx;
 + vq-kicked_avail_valid = true;
 + if (unlikely(!v))
 + return true;
 + return vring_need_event(vring_avail_event(vq-vring), new, old);
 +}
 +
  void virtqueue_kick(struct virtqueue *_vq)
  {
   struct vring_virtqueue *vq = to_vvq(_vq);
 @@ -244,7 +276,7 @@ void virtqueue_kick(struct virtqueue *_vq)
   /* Need to update avail index before checking if we should notify */
   virtio_mb();
 
 - if (!(vq-vring.used-flags  VRING_USED_F_NO_NOTIFY))
 + if (vring_notify(vq))
   /* Prod other side to tell it about changes. */
   vq-notify(vq-vq);
 
 @@ -437,6 +469,8 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
   vq-vq.name = name;
   vq-notify = notify;
   vq-broken = false;
 + vq-kicked_avail_valid = false;
 + vq-kicked_avail = 0;
   vq-last_used_idx = 0;
   list_add_tail(vq-vq.list, vdev-vqs);
  #ifdef DEBUG
 @@ -444,6 +478,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
  #endif
 
   vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
 + vq-event = virtio_has_feature(vdev, VIRTIO_RING_F_AVAIL_EVENT_IDX);
 
   /* No callback?  Tell other side not to bother us. */
   if (!callback)
 @@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device
 *vdev) break;
   case VIRTIO_RING_F_USED_EVENT_IDX:
   break;
 + case VIRTIO_RING_F_AVAIL_EVENT_IDX:
 + break;
   default:
   /* We don't understand this bit. */
   clear_bit(i, vdev-features);
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization