Re: [PATCH] net: move from strlcpy with unused retval to strscpy
On 8/18/22 16:00, Wolfram Sang wrote: Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=v6a6g1ouzcprm...@mail.gmail.com/ Signed-off-by: Wolfram Sang --- diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c b/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c index 6ceb1cdf6eba..6e83ff59172a 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c @@ -402,8 +402,8 @@ static void xgbe_get_drvinfo(struct net_device *netdev, struct xgbe_prv_data *pdata = netdev_priv(netdev); struct xgbe_hw_features *hw_feat = >hw_feat; - strlcpy(drvinfo->driver, XGBE_DRV_NAME, sizeof(drvinfo->driver)); - strlcpy(drvinfo->bus_info, dev_name(pdata->dev), + strscpy(drvinfo->driver, XGBE_DRV_NAME, sizeof(drvinfo->driver)); + strscpy(drvinfo->bus_info, dev_name(pdata->dev), sizeof(drvinfo->bus_info)); snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version), "%d.%d.%d", XGMAC_GET_BITS(hw_feat->version, MAC_VR, USERVER), For drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c Acked-by: Tom Lendacky ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 00/10] x86/sev: KEXEC/KDUMP support for SEV-ES guests
On 4/29/22 04:06, Tao Liu wrote: On Thu, Jan 27, 2022 at 11:10:34AM +0100, Joerg Roedel wrote: Hi Joerg, I tried the patch set with 5.17.0-rc1 kernel, and I have a few questions: 1) Is it a bug or should qemu-kvm 6.2.0 be patched with specific patch? Because I found it will exit with 0 when I tried to reboot the VM with sev-es enabled. However with only sev enabled, the VM can do reboot with no problem: Qemu was specifically patched to exit on reboot with SEV-ES guests. Qemu performs a reboot by resetting the vCPU state, which can't be done with an SEV-ES guest because the vCPU state is encrypted. [root@dell-per7525-03 ~]# virsh start TW-SEV-ES --console Fedora Linux 35 (Server Edition) Kernel 5.17.0-rc1 on an x86_64 (ttyS0) [root@fedora ~]# reboot . [ 48.077682] reboot: Restarting system [ 48.078109] reboot: machine restart ^^^ guest vm reached restart [root@dell-per7525-03 ~]# echo $? 0 ^^^ qemu-kvm exit with 0, no reboot back to normal VM kernel [root@dell-per7525-03 ~]# 2) With sev-es enabled and the 2 patch sets applied: A) [PATCH v3 00/10] x86/sev: KEXEC/KDUMP support for SEV-ES guests, and B) [PATCH v6 0/7] KVM: SVM: Add initial GHCB protocol version 2 support. I can enable kdump and have vmcore generated: [root@fedora ~]# dmesg|grep -i sev [0.030600] SEV: Hypervisor GHCB protocol version support: min=1 max=2 [0.030602] SEV: Using GHCB protocol version 2 [0.296144] AMD Memory Encryption Features active: SEV SEV-ES [0.450991] SEV: AP jump table Blob successfully set up [root@fedora ~]# kdumpctl status kdump: Kdump is operational However without the 2 patch sets, I can also enable kdump and have vmcore generated: [root@fedora ~]# dmesg|grep -i sev [0.295754] AMD Memory Encryption Features active: SEV SEV-ES ^ patch set A & B not applied, so only have this string. [root@fedora ~]# echo c > /proc/sysrq-trigger ... [2.759403] kdump[549]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2022-04-18-05:58:50/ [2.804355] kdump[555]: saving vmcore-dmesg.txt complete [2.806915] kdump[557]: saving vmcore ^ vmcore can still be generated ... [7.068981] reboot: Restarting system [7.069340] reboot: machine restart [root@dell-per7525-03 ~]# echo $? 0 ^^^ same exit issue as question 1. I doesn't have a complete technical background of the patch set, but isn't it the issue which this patch set is trying to solve? Or I missed something? The main goal of this patch set is to really to solve the ability to perform a kexec. I would expect kdump to work since kdump shuts down all but the executing vCPU and performs its operations before "rebooting" (which will exit Qemu as I mentioned above). But kexec requires the need to restart the APs from within the guest after they have been stopped. That requires specific support and actions on the part of the guest kernel in how the APs are stopped and restarted. Thanks, Tom Thanks, Tao Liu ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] virtio: replace arch_has_restricted_virtio_memory_access()
On 4/27/22 07:37, Juergen Gross wrote: On 27.04.22 14:28, Borislav Petkov wrote: On Wed, Apr 27, 2022 at 08:37:31AM +0200, Juergen Gross wrote: On 26.04.22 19:35, Borislav Petkov wrote: On Tue, Apr 26, 2022 at 03:40:21PM +0200, Juergen Gross wrote: /* protected virtualization */ static void pv_init(void) { if (!is_prot_virt_guest()) return; + platform_set_feature(PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS); Kinda long-ish for my taste. I'll probably call it: platform_set() as it is implicit that it sets a feature bit. Okay, fine with me. diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c index b43bc24d2bb6..6043ba6cd17d 100644 --- a/arch/x86/mm/mem_encrypt_identity.c +++ b/arch/x86/mm/mem_encrypt_identity.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -566,6 +567,10 @@ void __init sme_enable(struct boot_params *bp) } else { /* SEV state cannot be controlled by a command line option */ sme_me_mask = me_mask; + + /* Set restricted memory access for virtio. */ + platform_set_feature(PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS); This is way early in the boot, but it appears that marking the platform feature bitmap as __read_mostly puts this in the .data section, so avoids the issue of bss being cleared. TDX support also uses the arch_has_restricted_virtio_memory_access() function and will need to be updated. Seems like a lot of changes, I just wonder if the the arch_has...() function couldn't be updated to also include a Xen check? Thanks, Tom Huh, what does that have to do with SME? I picked the function where sev_status is being set, as this seemed to be the correct place to set the feature bit. What I don't understand is what does restricted memory access have to do with AMD SEV and how does play together with what you guys are trying to do? The big picture pls. Ah, okay. For support of virtio with Xen we want to not only support the virtio devices like KVM, but use grants for letting the guest decide which pages are allowed to be mapped by the backend (dom0). Instead of physical guest addresses the guest will use grant-ids (plus offset). In order to be able to handle this at the basic virtio level instead of the single virtio device drivers, we need to use dedicated dma-ops. And those will be used by virtio only, if the "restricted virtio memory request" flag is set, which is used by SEV, too. In order to let virtio set this flag, we need a way to communicate to virtio that the running system is either a SEV guest or a Xen guest. HTH, Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 07/16] x86/kvm: Use bounce buffers for TD guest
On 10/20/21 11:50 AM, Sathyanarayanan Kuppuswamy wrote: On 10/20/21 9:39 AM, Tom Lendacky wrote: On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote: From: "Kirill A. Shutemov" Intel TDX doesn't allow VMM to directly access guest private memory. Any memory that is required for communication with VMM must be shared explicitly. The same rule applies for any DMA to and from TDX guest. All DMA pages had to marked as shared pages. A generic way to achieve this without any changes to device drivers is to use the SWIOTLB framework. This method of handling is similar to AMD SEV. So extend this support for TDX guest as well. Also since there are some common code between AMD SEV and TDX guest in mem_encrypt_init(), move it to mem_encrypt_common.c and call AMD specific init function from it Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Kuppuswamy Sathyanarayanan --- Changes since v4: * Replaced prot_guest_has() with cc_guest_has(). Changes since v3: * Rebased on top of Tom Lendacky's protected guest changes (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fpatchwork%2Fcover%2F1468760%2Fdata=04%7C01%7Cthomas.lendacky%40amd.com%7Cad852703670a44b1e29108d993e9dcc2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637703454904800065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=lXwd5%2Fhnmd5QYaIElQ%2BtVT%2B62JHq%2Bimno5VIjTILaig%3Dreserved=0) Changes since v1: * Removed sme_me_mask check for amd_mem_encrypt_init() in mem_encrypt_init(). arch/x86/include/asm/mem_encrypt_common.h | 3 +++ arch/x86/kernel/tdx.c | 2 ++ arch/x86/mm/mem_encrypt.c | 5 + arch/x86/mm/mem_encrypt_common.c | 14 ++ 4 files changed, 20 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h index 697bc40a4e3d..bc90e565bce4 100644 --- a/arch/x86/include/asm/mem_encrypt_common.h +++ b/arch/x86/include/asm/mem_encrypt_common.h @@ -8,11 +8,14 @@ #ifdef CONFIG_AMD_MEM_ENCRYPT bool amd_force_dma_unencrypted(struct device *dev); +void __init amd_mem_encrypt_init(void); #else /* CONFIG_AMD_MEM_ENCRYPT */ static inline bool amd_force_dma_unencrypted(struct device *dev) { return false; } + +static inline void amd_mem_encrypt_init(void) {} #endif /* CONFIG_AMD_MEM_ENCRYPT */ #endif diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index 433f366ca25c..ce8e3019b812 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -12,6 +12,7 @@ #include #include #include /* force_sig_fault() */ +#include /* TDX Module call Leaf IDs */ #define TDX_GET_INFO 1 @@ -577,6 +578,7 @@ void __init tdx_early_init(void) pv_ops.irq.halt = tdx_halt; legacy_pic = _legacy_pic; + swiotlb_force = SWIOTLB_FORCE; cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug", NULL, tdx_cpu_offline_prepare); diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index 5d7fbed73949..8385bc4565e9 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -438,14 +438,11 @@ static void print_mem_encrypt_feature_info(void) } /* Architecture __weak replacement functions */ -void __init mem_encrypt_init(void) +void __init amd_mem_encrypt_init(void) { if (!sme_me_mask) return; - /* Call into SWIOTLB to update the SWIOTLB DMA buffers */ - swiotlb_update_mem_attributes(); - /* * With SEV, we need to unroll the rep string I/O instructions, * but SEV-ES supports them through the #VC handler. diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c index 119a9056efbb..6fe44c6cb753 100644 --- a/arch/x86/mm/mem_encrypt_common.c +++ b/arch/x86/mm/mem_encrypt_common.c @@ -10,6 +10,7 @@ #include #include #include +#include /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */ bool force_dma_unencrypted(struct device *dev) @@ -24,3 +25,16 @@ bool force_dma_unencrypted(struct device *dev) return false; } + +/* Architecture __weak replacement functions */ +void __init mem_encrypt_init(void) +{ + /* + * For TDX guest or SEV/SME, call into SWIOTLB to update + * the SWIOTLB DMA buffers + */ + if (sme_me_mask || cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) Can't you just make this: if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) SEV will return true if sme_me_mask is not zero and TDX should only return true if it is TDX guest, right? Yes. It can be simplified. But where shall we leave this function cc_platform.c or here? Either one works... all depends on how the maintainers feel about creating/using mem_encrypt_common.c or using cc_platform.c. Thanks, Tom Thanks, Tom + swiotlb_u
Re: [PATCH v5 06/16] x86/tdx: Make DMA pages shared
On 10/20/21 11:45 AM, Sathyanarayanan Kuppuswamy wrote: On 10/20/21 9:33 AM, Tom Lendacky wrote: On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote: ... bool force_dma_unencrypted(struct device *dev) { - return amd_force_dma_unencrypted(dev); + if (cc_platform_has(CC_ATTR_GUEST_TDX) && + cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) + return true; + + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) || + cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) + return amd_force_dma_unencrypted(dev); + + return false; Assuming the original force_dma_unencrypted() function was moved here or cc_platform.c, then you shouldn't need any changes. Both SEV and TDX require true be returned if CC_ATTR_GUEST_MEM_ENCRYPT returns true. And then TDX should never return true for CC_ATTR_HOST_MEM_ENCRYPT. For non TDX case, in CC_ATTR_HOST_MEM_ENCRYPT, we should still call amd_force_dma_unencrypted() right? What I'm saying is that you wouldn't have amd_force_dma_unencrypted(). I think the whole force_dma_unencrypted() can exist as-is in a different file, whether that's cc_platform.c or mem_encrypt_common.c. It will return true for an SEV or TDX guest, true for an SME host based on the DMA mask or else false. That should work just fine for TDX. Thanks, Tom } diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 527957586f3c..6c531d5cb5fd 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "../mm_internal.h" @@ -1981,8 +1982,10 @@ int set_memory_global(unsigned long addr, int numpages) __pgprot(_PAGE_GLOBAL), 0); } -static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) +static int __set_memory_protect(unsigned long addr, int numpages, bool protect) { + pgprot_t mem_protected_bits, mem_plain_bits; + enum tdx_map_type map_type; struct cpa_data cpa; int ret; @@ -1997,8 +2000,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) memset(, 0, sizeof(cpa)); cpa.vaddr = cpa.numpages = numpages; - cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0); - cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC); + + if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) { + mem_protected_bits = __pgprot(0); + mem_plain_bits = pgprot_cc_shared_mask(); How about having generic versions for both shared and private that return the proper value for SEV or TDX. Then this remains looking similar to as it does now, just replacing the __pgprot() calls with the appropriate pgprot_cc_{shared,private}_mask(). Makes sense. Thanks, Tom + } else { + mem_protected_bits = __pgprot(_PAGE_ENC); + mem_plain_bits = __pgprot(0); + } + + if (protect) { + cpa.mask_set = mem_protected_bits; + cpa.mask_clr = mem_plain_bits; + map_type = TDX_MAP_PRIVATE; + } else { + cpa.mask_set = mem_plain_bits; + cpa.mask_clr = mem_protected_bits; + map_type = TDX_MAP_SHARED; + } + cpa.pgd = init_mm.pgd; /* Must avoid aliasing mappings in the highmem code */ @@ -2006,9 +2026,17 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) vm_unmap_aliases(); /* - * Before changing the encryption attribute, we need to flush caches. + * Before changing the encryption attribute, flush caches. + * + * For TDX, guest is responsible for flushing caches on private->shared + * transition. VMM is responsible for flushing on shared->private. */ - cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT)); + if (cc_platform_has(CC_ATTR_GUEST_TDX)) { + if (map_type == TDX_MAP_SHARED) + cpa_flush(, 1); + } else { + cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT)); + } ret = __change_page_attr_set_clr(, 1); @@ -2021,18 +2049,21 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) */ cpa_flush(, 0); + if (!ret && cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) + ret = tdx_hcall_gpa_intent(__pa(addr), numpages, map_type); + return ret; } int set_memory_encrypted(unsigned long addr, int numpages) { - return __set_memory_enc_dec(addr, numpages, true); + return __set_memory_protect(addr, numpages, true); } EXPORT_SYMBOL_GPL(set_memory_encrypted); int set_memory_decrypted(unsigned long addr, int numpages) { - return __set_memory_enc_dec(addr, numpages, false); + return __set_memory_protect(addr, numpages, false); } EXPORT_SYMBOL_GPL(set_memory_decrypted); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 07/16] x86/kvm: Use bounce buffers for TD guest
On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote: From: "Kirill A. Shutemov" Intel TDX doesn't allow VMM to directly access guest private memory. Any memory that is required for communication with VMM must be shared explicitly. The same rule applies for any DMA to and from TDX guest. All DMA pages had to marked as shared pages. A generic way to achieve this without any changes to device drivers is to use the SWIOTLB framework. This method of handling is similar to AMD SEV. So extend this support for TDX guest as well. Also since there are some common code between AMD SEV and TDX guest in mem_encrypt_init(), move it to mem_encrypt_common.c and call AMD specific init function from it Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Kuppuswamy Sathyanarayanan --- Changes since v4: * Replaced prot_guest_has() with cc_guest_has(). Changes since v3: * Rebased on top of Tom Lendacky's protected guest changes (https://lore.kernel.org/patchwork/cover/1468760/) Changes since v1: * Removed sme_me_mask check for amd_mem_encrypt_init() in mem_encrypt_init(). arch/x86/include/asm/mem_encrypt_common.h | 3 +++ arch/x86/kernel/tdx.c | 2 ++ arch/x86/mm/mem_encrypt.c | 5 + arch/x86/mm/mem_encrypt_common.c | 14 ++ 4 files changed, 20 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h index 697bc40a4e3d..bc90e565bce4 100644 --- a/arch/x86/include/asm/mem_encrypt_common.h +++ b/arch/x86/include/asm/mem_encrypt_common.h @@ -8,11 +8,14 @@ #ifdef CONFIG_AMD_MEM_ENCRYPT bool amd_force_dma_unencrypted(struct device *dev); +void __init amd_mem_encrypt_init(void); #else /* CONFIG_AMD_MEM_ENCRYPT */ static inline bool amd_force_dma_unencrypted(struct device *dev) { return false; } + +static inline void amd_mem_encrypt_init(void) {} #endif /* CONFIG_AMD_MEM_ENCRYPT */ #endif diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index 433f366ca25c..ce8e3019b812 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -12,6 +12,7 @@ #include #include #include /* force_sig_fault() */ +#include /* TDX Module call Leaf IDs */ #define TDX_GET_INFO 1 @@ -577,6 +578,7 @@ void __init tdx_early_init(void) pv_ops.irq.halt = tdx_halt; legacy_pic = _legacy_pic; + swiotlb_force = SWIOTLB_FORCE; cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug", NULL, tdx_cpu_offline_prepare); diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index 5d7fbed73949..8385bc4565e9 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -438,14 +438,11 @@ static void print_mem_encrypt_feature_info(void) } /* Architecture __weak replacement functions */ -void __init mem_encrypt_init(void) +void __init amd_mem_encrypt_init(void) { if (!sme_me_mask) return; - /* Call into SWIOTLB to update the SWIOTLB DMA buffers */ - swiotlb_update_mem_attributes(); - /* * With SEV, we need to unroll the rep string I/O instructions, * but SEV-ES supports them through the #VC handler. diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c index 119a9056efbb..6fe44c6cb753 100644 --- a/arch/x86/mm/mem_encrypt_common.c +++ b/arch/x86/mm/mem_encrypt_common.c @@ -10,6 +10,7 @@ #include #include #include +#include /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */ bool force_dma_unencrypted(struct device *dev) @@ -24,3 +25,16 @@ bool force_dma_unencrypted(struct device *dev) return false; } + +/* Architecture __weak replacement functions */ +void __init mem_encrypt_init(void) +{ + /* +* For TDX guest or SEV/SME, call into SWIOTLB to update +* the SWIOTLB DMA buffers +*/ + if (sme_me_mask || cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) Can't you just make this: if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) SEV will return true if sme_me_mask is not zero and TDX should only return true if it is TDX guest, right? Thanks, Tom + swiotlb_update_mem_attributes(); + + amd_mem_encrypt_init(); +} ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 06/16] x86/tdx: Make DMA pages shared
On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote: From: "Kirill A. Shutemov" Just like MKTME, TDX reassigns bits of the physical address for metadata. MKTME used several bits for an encryption KeyID. TDX uses a single bit in guests to communicate whether a physical page should be protected by TDX as private memory (bit set to 0) or unprotected and shared with the VMM (bit set to 1). __set_memory_enc_dec() is now aware about TDX and sets Shared bit accordingly following with relevant TDX hypercall. Also, Do TDX_ACCEPT_PAGE on every 4k page after mapping the GPA range when converting memory to private. Using 4k page size limit is due to current TDX spec restriction. Also, If the GPA (range) was already mapped as an active, private page, the host VMM may remove the private page from the TD by following the “Removing TD Private Pages” sequence in the Intel TDX-module specification [1] to safely block the mapping(s), flush the TLB and cache, and remove the mapping(s). BUG() if TDX_ACCEPT_PAGE fails (except "previously accepted page" case) , as the guest is completely hosed if it can't access memory. [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fcontent%2Fdam%2Fdevelop%2Fexternal%2Fus%2Fen%2Fdocuments%2Ftdx-module-1eas-v0.85.039.pdfdata=04%7C01%7Cthomas.lendacky%40amd.com%7C0e667adf5a4042abce3908d98abd07a8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637693367201703893%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=UGxQ9xBjWsmev7PetX%2BuS0RChkAXyaH7q6JHO9ZiUtY%3Dreserved=0 Tested-by: Kai Huang Signed-off-by: Kirill A. Shutemov Signed-off-by: Sean Christopherson Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Kuppuswamy Sathyanarayanan ... diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c index f063c885b0a5..119a9056efbb 100644 --- a/arch/x86/mm/mem_encrypt_common.c +++ b/arch/x86/mm/mem_encrypt_common.c @@ -9,9 +9,18 @@ #include #include +#include /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */ bool force_dma_unencrypted(struct device *dev) { - return amd_force_dma_unencrypted(dev); + if (cc_platform_has(CC_ATTR_GUEST_TDX) && + cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) + return true; + + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) || + cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) + return amd_force_dma_unencrypted(dev); + + return false; Assuming the original force_dma_unencrypted() function was moved here or cc_platform.c, then you shouldn't need any changes. Both SEV and TDX require true be returned if CC_ATTR_GUEST_MEM_ENCRYPT returns true. And then TDX should never return true for CC_ATTR_HOST_MEM_ENCRYPT. } diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 527957586f3c..6c531d5cb5fd 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "../mm_internal.h" @@ -1981,8 +1982,10 @@ int set_memory_global(unsigned long addr, int numpages) __pgprot(_PAGE_GLOBAL), 0); } -static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) +static int __set_memory_protect(unsigned long addr, int numpages, bool protect) { + pgprot_t mem_protected_bits, mem_plain_bits; + enum tdx_map_type map_type; struct cpa_data cpa; int ret; @@ -1997,8 +2000,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) memset(, 0, sizeof(cpa)); cpa.vaddr = cpa.numpages = numpages; - cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0); - cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC); + + if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) { + mem_protected_bits = __pgprot(0); + mem_plain_bits = pgprot_cc_shared_mask(); How about having generic versions for both shared and private that return the proper value for SEV or TDX. Then this remains looking similar to as it does now, just replacing the __pgprot() calls with the appropriate pgprot_cc_{shared,private}_mask(). Thanks, Tom + } else { + mem_protected_bits = __pgprot(_PAGE_ENC); + mem_plain_bits = __pgprot(0); + } + + if (protect) { + cpa.mask_set = mem_protected_bits; + cpa.mask_clr = mem_plain_bits; + map_type = TDX_MAP_PRIVATE; + } else { + cpa.mask_set = mem_plain_bits; + cpa.mask_clr = mem_protected_bits; + map_type = TDX_MAP_SHARED; + } + cpa.pgd = init_mm.pgd; /* Must avoid aliasing mappings in the highmem code */ @@ -2006,9 +2026,17 @@ static int __set_memory_enc_dec(unsigned long addr,
Re: [PATCH v5 01/16] x86/mm: Move force_dma_unencrypted() to common code
On 10/8/21 7:36 PM, Kuppuswamy Sathyanarayanan wrote: From: "Kirill A. Shutemov" Intel TDX doesn't allow VMM to access guest private memory. Any memory that is required for communication with VMM must be shared explicitly by setting the bit in page table entry. After setting the shared bit, the conversion must be completed with MapGPA hypercall. Details about MapGPA hypercall can be found in [1], sec 3.2. The call informs VMM about the conversion between private/shared mappings. The shared memory is similar to unencrypted memory in AMD SME/SEV terminology but the underlying process of sharing/un-sharing the memory is different for Intel TDX guest platform. SEV assumes that I/O devices can only do DMA to "decrypted" physical addresses without the C-bit set. In order for the CPU to interact with this memory, the CPU needs a decrypted mapping. To add this support, AMD SME code forces force_dma_unencrypted() to return true for platforms that support AMD SEV feature. It will be used for DMA memory allocation API to trigger set_memory_decrypted() for platforms that support AMD SEV feature. TDX is similar. So, to communicate with I/O devices, related pages need to be marked as shared. As mentioned above, shared memory in TDX architecture is similar to decrypted memory in AMD SME/SEV. So similar to AMD SEV, force_dma_unencrypted() has to forced to return true. This support is added in other patches in this series. So move force_dma_unencrypted() out of AMD specific code and call AMD specific (amd_force_dma_unencrypted()) initialization function from it. force_dma_unencrypted() will be modified by later patches to include Intel TDX guest platform specific initialization. Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has to be selected by all x86 memory encryption features. This will be selected by both AMD SEV and Intel TDX guest config options. This is preparation for TDX changes in DMA code and it has no functional change. Can force_dma_unencrypted() be moved to arch/x86/kernel/cc_platform.c, instead of creating a new file? It might fit better with patch #6. Thanks, Tom [1] - https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Kuppuswamy Sathyanarayanan --- Changes since v4: * Removed used we/you from commit log. Change since v3: * None Changes since v1: * Removed sev_active(), sme_active() checks in force_dma_unencrypted(). arch/x86/Kconfig | 8 ++-- arch/x86/include/asm/mem_encrypt_common.h | 18 ++ arch/x86/mm/Makefile | 2 ++ arch/x86/mm/mem_encrypt.c | 3 ++- arch/x86/mm/mem_encrypt_common.c | 17 + 5 files changed, 45 insertions(+), 3 deletions(-) create mode 100644 arch/x86/include/asm/mem_encrypt_common.h create mode 100644 arch/x86/mm/mem_encrypt_common.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index af49ad084919..37b27412f52e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1519,16 +1519,20 @@ config X86_CPA_STATISTICS helps to determine the effectiveness of preserving large and huge page mappings when mapping protections are changed. +config X86_MEM_ENCRYPT_COMMON + select ARCH_HAS_FORCE_DMA_UNENCRYPTED + select DYNAMIC_PHYSICAL_MASK + def_bool n + config AMD_MEM_ENCRYPT bool "AMD Secure Memory Encryption (SME) support" depends on X86_64 && CPU_SUP_AMD select DMA_COHERENT_POOL - select DYNAMIC_PHYSICAL_MASK select ARCH_USE_MEMREMAP_PROT - select ARCH_HAS_FORCE_DMA_UNENCRYPTED select INSTRUCTION_DECODER select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS select ARCH_HAS_CC_PLATFORM + select X86_MEM_ENCRYPT_COMMON help Say yes to enable support for the encryption of system memory. This requires an AMD processor that supports Secure Memory diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h new file mode 100644 index ..697bc40a4e3d --- /dev/null +++ b/arch/x86/include/asm/mem_encrypt_common.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2020 Intel Corporation */ +#ifndef _ASM_X86_MEM_ENCRYPT_COMMON_H +#define _ASM_X86_MEM_ENCRYPT_COMMON_H + +#include +#include + +#ifdef CONFIG_AMD_MEM_ENCRYPT +bool amd_force_dma_unencrypted(struct device *dev); +#else /* CONFIG_AMD_MEM_ENCRYPT */ +static inline bool amd_force_dma_unencrypted(struct device *dev) +{ + return false; +} +#endif /* CONFIG_AMD_MEM_ENCRYPT */ + +#endif diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 5864219221ca..b31cb52bf1bd 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -52,6 +52,8 @@
Re: [PATCH v5 04/16] x86/tdx: Make pages shared in ioremap()
On 10/8/21 7:36 PM, Kuppuswamy Sathyanarayanan wrote: From: "Kirill A. Shutemov" All ioremap()ed pages that are not backed by normal memory (NONE or RESERVED) have to be mapped as shared. Reuse the infrastructure from AMD SEV code. Note that DMA code doesn't use ioremap() to convert memory to shared as DMA buffers backed by normal memory. DMA code make buffer shared with set_memory_decrypted(). Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Kuppuswamy Sathyanarayanan --- Changes since v4: * Renamed "protected_guest" to "cc_guest". * Replaced use of prot_guest_has() with cc_guest_has() * Modified the patch to adapt to latest version cc_guest_has() changes. Changes since v3: * Rebased on top of Tom Lendacky's protected guest changes (https://lore.kernel.org/patchwork/cover/1468760/) Changes since v1: * Fixed format issues in commit log. arch/x86/include/asm/pgtable.h | 4 arch/x86/mm/ioremap.c | 8 ++-- include/linux/cc_platform.h| 13 + 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 448cd01eb3ec..ecefccbdf2e3 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -21,6 +21,10 @@ #define pgprot_encrypted(prot)__pgprot(__sme_set(pgprot_val(prot))) #define pgprot_decrypted(prot)__pgprot(__sme_clr(pgprot_val(prot))) +/* Make the page accesable by VMM for confidential guests */ +#define pgprot_cc_guest(prot) __pgprot(pgprot_val(prot) | \ + tdx_shared_mask()) So this is only for TDX guests, so maybe a name that is less generic. Alternatively, you could create pgprot_private()/pgprot_shared() functions that check for SME/SEV or TDX and do the proper thing. Then you can redefine pgprot_encrypted()/pgprot_decrypted() to the new functions? + #ifndef __ASSEMBLY__ #include #include diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 026031b3b782..83daa3f8f39c 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -26,6 +27,7 @@ #include #include #include +#include #include "physaddr.h" @@ -87,8 +89,8 @@ static unsigned int __ioremap_check_ram(struct resource *res) } /* - * In a SEV guest, NONE and RESERVED should not be mapped encrypted because - * there the whole memory is already encrypted. + * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or + * private in TDX case) because there the whole memory is already encrypted. */ static unsigned int __ioremap_check_encrypted(struct resource *res) { @@ -246,6 +248,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size, prot = PAGE_KERNEL_IO; if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted) prot = pgprot_encrypted(prot); + else if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) + prot = pgprot_cc_guest(prot); And if you do the new functions this could be: if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted) prot = pgprot_private(prot); else prot = pgprot_shared(prot); Thanks, Tom switch (pcm) { case _PAGE_CACHE_MODE_UC: diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h index 7728574d7783..edb1d7a2f6af 100644 --- a/include/linux/cc_platform.h +++ b/include/linux/cc_platform.h @@ -81,6 +81,19 @@ enum cc_attr { * Examples include TDX Guest. */ CC_ATTR_GUEST_UNROLL_STRING_IO, + + /** +* @CC_ATTR_GUEST_SHARED_MAPPING_INIT: IO Remapped memory is marked +* as shared. +* +* The platform/OS is running as a guest/virtual machine and +* initializes all IO remapped memory as shared. +* +* Examples include TDX Guest (SEV marks all pages as shared by default +* so this feature cannot be enabled for it). +*/ + CC_ATTR_GUEST_SHARED_MAPPING_INIT, + }; #ifdef CONFIG_ARCH_HAS_CC_PLATFORM ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 04/12] x86/sev: Do not hardcode GHCB protocol version
On 7/21/21 9:20 AM, Joerg Roedel wrote: > From: Joerg Roedel > > Introduce the sev_get_ghcb_proto_ver() which will return the negotiated > GHCB protocol version and use it to set the version field in the GHCB. > > Signed-off-by: Joerg Roedel > --- > arch/x86/boot/compressed/sev.c | 5 + > arch/x86/kernel/sev-shared.c | 5 - > arch/x86/kernel/sev.c | 5 + > 3 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c > index 1a2e49730f8b..101e08c67296 100644 > --- a/arch/x86/boot/compressed/sev.c > +++ b/arch/x86/boot/compressed/sev.c > @@ -119,6 +119,11 @@ static enum es_result vc_read_mem(struct es_em_ctxt > *ctxt, > /* Include code for early handlers */ > #include "../../kernel/sev-shared.c" > > +static u64 sev_get_ghcb_proto_ver(void) > +{ > + return GHCB_PROTOCOL_MAX; > +} > + > static bool early_setup_sev_es(void) > { > if (!sev_es_negotiate_protocol(NULL)) > diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c > index 73eeb5897d16..36eaac2773ed 100644 > --- a/arch/x86/kernel/sev-shared.c > +++ b/arch/x86/kernel/sev-shared.c > @@ -28,6 +28,9 @@ struct sev_ghcb_protocol_info { > unsigned int vm_proto; > }; > > +/* Returns the negotiated GHCB Protocol version */ > +static u64 sev_get_ghcb_proto_ver(void); > + > static bool __init sev_es_check_cpu_features(void) > { > if (!has_cpuflag(X86_FEATURE_RDRAND)) { > @@ -122,7 +125,7 @@ static enum es_result sev_es_ghcb_hv_call(struct ghcb > *ghcb, > enum es_result ret; > > /* Fill in protocol and format specifiers */ > - ghcb->protocol_version = GHCB_PROTOCOL_MAX; > + ghcb->protocol_version = sev_get_ghcb_proto_ver(); So this probably needs better clarification in the spec, but the GHCB version field is for the GHCB structure layout. So if you don't plan to use the XSS field that was added for version 2 of the layout, then you should report the GHCB structure version as 1. Thanks, Tom > ghcb->ghcb_usage = GHCB_DEFAULT_USAGE; > > ghcb_set_sw_exit_code(ghcb, exit_code); > diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c > index 8084bfd7cce1..5d3422e8b25e 100644 > --- a/arch/x86/kernel/sev.c > +++ b/arch/x86/kernel/sev.c > @@ -498,6 +498,11 @@ static enum es_result vc_slow_virt_to_phys(struct ghcb > *ghcb, struct es_em_ctxt > /* Negotiated GHCB protocol version */ > static struct sev_ghcb_protocol_info ghcb_protocol_info __ro_after_init; > > +static u64 sev_get_ghcb_proto_ver(void) > +{ > + return ghcb_protocol_info.vm_proto; > +} > + > static noinstr void __sev_put_ghcb(struct ghcb_state *state) > { > struct sev_es_runtime_data *data; > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] x86/sev: Add defines for GHCB version 2 MSR protocol requests
On 6/23/21 4:32 AM, Borislav Petkov wrote: > On Wed, Jun 23, 2021 at 08:40:00AM +0200, Joerg Roedel wrote: >> From: Brijesh Singh >> > > Ok, so I took a critical look at this and it doesn't make sense to have > a differently named define each time you need the [63:12] slice of > GHCBData. So you can simply use GHCB_DATA(msr_value) instead, see below. > > Complaints? None from me. Thanks, Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] x86/sev: Add defines for GHCB version 2 MSR protocol requests
On 6/22/21 9:48 AM, Joerg Roedel wrote: > From: Brijesh Singh > > Add the necessary defines for supporting the GHCB version 2 protocol. > This includes defines for: > > - MSR-based AP hlt request/response > - Hypervisor Feature request/response > > This is the bare minimum of requests that need to be supported by a GHCB > version 2 implementation. There are more requests in the specification, > but those depend on Secure Nested Paging support being available. > > These defines are shared between SEV host and guest support, so they are > submitted as an individual patch without users yet to avoid merge > conflicts in the future. > > Signed-off-by: Brijesh Singh > Co-developed-by: Tom Lendacky > Signed-off-by: Tom Lendacky > Signed-off-by: Joerg Roedel > --- > arch/x86/include/asm/sev-common.h | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/arch/x86/include/asm/sev-common.h > b/arch/x86/include/asm/sev-common.h > index 1cc9e7dd8107..4e6c4c7cb294 100644 > --- a/arch/x86/include/asm/sev-common.h > +++ b/arch/x86/include/asm/sev-common.h > @@ -47,6 +47,21 @@ > (((unsigned long)reg & GHCB_MSR_CPUID_REG_MASK) << > GHCB_MSR_CPUID_REG_POS) | \ > (((unsigned long)fn) << GHCB_MSR_CPUID_FUNC_POS)) > > +/* AP Reset Hold */ > +#define GHCB_MSR_AP_RESET_HOLD_REQ 0x006 > +#define GHCB_MSR_AP_RESET_HOLD_RESP 0x007 > +#define GHCB_MSR_AP_RESET_HOLD_RESULT_POS12 > +#define GHCB_MSR_AP_RESET_HOLD_RESULT_MASK GENMASK_ULL(51, 0) > + > +/* GHCB Hypervisor Feature Request/Response */ > +#define GHCB_MSR_HV_FT_REQ 0x080 > +#define GHCB_MSR_HV_FT_RESP 0x081 > +#define GHCB_MSR_HV_FT_POS 12 > +#define GHCB_MSR_HV_FT_MASK GENMASK_ULL(51, 0) > + > +#define GHCB_MSR_HV_FT_RESP_VAL(v) \ > + (((unsigned long)((v) & GHCB_MSR_HV_FT_MASK) >> GHCB_MSR_HV_FT_POS)) This should shift down first and then mask or else the mask should be from 12 to 63. Thanks, Tom > + > #define GHCB_MSR_TERM_REQ0x100 > #define GHCB_MSR_TERM_REASON_SET_POS 12 > #define GHCB_MSR_TERM_REASON_SET_MASK0xf > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 2/8] x86/sev: Do not require Hypervisor CPUID bit for SEV guests
On 3/12/21 6:38 AM, Joerg Roedel wrote: > From: Joerg Roedel > > A malicious hypervisor could disable the CPUID intercept for an SEV or > SEV-ES guest and trick it into the no-SEV boot path, where it could > potentially reveal secrets. This is not an issue for SEV-SNP guests, > as the CPUID intercept can't be disabled for those. > > Remove the Hypervisor CPUID bit check from the SEV detection code to > protect against this kind of attack and add a Hypervisor bit equals > zero check to the SME detection path to prevent non-SEV guests from > trying to enable SME. > > This handles the following cases: > > 1) SEV(-ES) guest where CPUID intercept is disabled. The guest > will still see leaf 0x801f and the SEV bit. It can > retrieve the C-bit and boot normally. > > 2) Non-SEV guests with intercepted CPUID will check SEV_STATUS > MSR and find it 0 and will try to enable SME. This will > fail when the guest finds MSR_K8_SYSCFG to be zero, as it > is emulated by KVM. But we can't rely on that, as there > might be other hypervisors which return this MSR with bit > 23 set. The Hypervisor bit check will prevent that the > guest tries to enable SME in this case. > > 3) Non-SEV guests on SEV capable hosts with CPUID intercept > disabled (by a malicious hypervisor) will try to boot into > the SME path. This will fail, but it is also not considered > a problem because non-encrypted guests have no protection > against the hypervisor anyway. > > Signed-off-by: Joerg Roedel Acked-by: Tom Lendacky > --- > arch/x86/boot/compressed/mem_encrypt.S | 6 - > arch/x86/kernel/sev-es-shared.c| 6 + > arch/x86/mm/mem_encrypt_identity.c | 35 ++ > 3 files changed, 20 insertions(+), 27 deletions(-) > > diff --git a/arch/x86/boot/compressed/mem_encrypt.S > b/arch/x86/boot/compressed/mem_encrypt.S > index aa561795efd1..a6dea4e8a082 100644 > --- a/arch/x86/boot/compressed/mem_encrypt.S > +++ b/arch/x86/boot/compressed/mem_encrypt.S > @@ -23,12 +23,6 @@ SYM_FUNC_START(get_sev_encryption_bit) > push%ecx > push%edx > > - /* Check if running under a hypervisor */ > - movl$1, %eax > - cpuid > - bt $31, %ecx /* Check the hypervisor bit */ > - jnc .Lno_sev > - > movl$0x8000, %eax /* CPUID to check the highest leaf */ > cpuid > cmpl$0x801f, %eax /* See if 0x801f is available */ > diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c > index cdc04d091242..387b71669818 100644 > --- a/arch/x86/kernel/sev-es-shared.c > +++ b/arch/x86/kernel/sev-es-shared.c > @@ -186,7 +186,6 @@ void __init do_vc_no_ghcb(struct pt_regs *regs, unsigned > long exit_code) >* make it accessible to the hypervisor. >* >* In particular, check for: > - * - Hypervisor CPUID bit >* - Availability of CPUID leaf 0x801f >* - SEV CPUID bit. >* > @@ -194,10 +193,7 @@ void __init do_vc_no_ghcb(struct pt_regs *regs, unsigned > long exit_code) >* can't be checked here. >*/ > > - if ((fn == 1 && !(regs->cx & BIT(31 > - /* Hypervisor bit */ > - goto fail; > - else if (fn == 0x8000 && (regs->ax < 0x801f)) > + if (fn == 0x8000 && (regs->ax < 0x801f)) > /* SEV leaf check */ > goto fail; > else if ((fn == 0x801f && !(regs->ax & BIT(1 > diff --git a/arch/x86/mm/mem_encrypt_identity.c > b/arch/x86/mm/mem_encrypt_identity.c > index 6c5eb6f3f14f..a19374d26101 100644 > --- a/arch/x86/mm/mem_encrypt_identity.c > +++ b/arch/x86/mm/mem_encrypt_identity.c > @@ -503,14 +503,10 @@ void __init sme_enable(struct boot_params *bp) > > #define AMD_SME_BIT BIT(0) > #define AMD_SEV_BIT BIT(1) > - /* > - * Set the feature mask (SME or SEV) based on whether we are > - * running under a hypervisor. > - */ > - eax = 1; > - ecx = 0; > - native_cpuid(, , , ); > - feature_mask = (ecx & BIT(31)) ? AMD_SEV_BIT : AMD_SME_BIT; > + > + /* Check the SEV MSR whether SEV or SME is enabled */ > + sev_status = __rdmsr(MSR_AMD64_SEV); > + feature_mask = (sev_status & MSR_AMD64_SEV_ENABLED) ? AMD_SEV_BIT : > AMD_SME_BIT; > > /* >* Check for the SME/SEV feature: > @@ -530,19 +526,26 @@ void __init sme_enable(struct boot_params *bp) > > /* Ch
Re: [PATCH 6/7] x86/boot/compressed/64: Check SEV encryption in 32-bit boot-path
On 2/10/21 10:47 AM, Dave Hansen wrote: On 2/10/21 2:21 AM, Joerg Roedel wrote: + /* Store to memory and keep it in the registers */ + movl%eax, rva(sev_check_data)(%ebp) + movl%ebx, rva(sev_check_data+4)(%ebp) + + /* Enable paging to see if encryption is active */ + movl%cr0, %edx /* Backup %cr0 in %edx */ + movl$(X86_CR0_PG | X86_CR0_PE), %ecx /* Enable Paging and Protected mode */ + movl%ecx, %cr0 + + cmpl%eax, rva(sev_check_data)(%ebp) + jne 3f + cmpl%ebx, rva(sev_check_data+4)(%ebp) + jne 3f Also, I know that turning paging on is a *BIG* barrier. But, I didn't think it has any effect on the caches. I would expect that the underlying physical address of 'sev_check_data' would change when paging gets enabled because paging sets the C bit. So, how does the write of 'sev_check_data' get out of the caches and into memory where it can be read back with the new physical address? I think there's some bit of the SEV architecture that I'm missing. Non-paging memory accesses are always considered private (APM Volume 2, 15.34.4) and thus are cached with the C bit set. Thanks, Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path
On 2/2/21 10:37 AM, Konrad Rzeszutek Wilk wrote: > On Mon, Jan 25, 2021 at 07:33:35PM +0100, Martin Radev wrote: >> On Mon, Jan 18, 2021 at 10:14:28AM -0500, Konrad Rzeszutek Wilk wrote: >>> On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote: >>>> On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote: >>>>> On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote: >>>>>> The size of the buffer being bounced is not checked if it happens >>>>>> to be larger than the size of the mapped buffer. Because the size >>>>>> can be controlled by a device, as it's the case with virtio devices, >>>>>> this can lead to memory corruption. >>>>>> >>>>> >>>>> I'm really worried about all these hodge podge hacks for not trusted >>>>> hypervisors in the I/O stack. Instead of trying to harden protocols >>>>> that are fundamentally not designed for this, how about instead coming >>>>> up with a new paravirtualized I/O interface that is specifically >>>>> designed for use with an untrusted hypervisor from the start? >>>> >>>> Your comment makes sense but then that would require the cooperation >>>> of these vendors and the cloud providers to agree on something meaningful. >>>> I am also not sure whether the end result would be better than hardening >>>> this interface to catch corruption. There is already some validation in >>>> unmap path anyway. >>>> >>>> Another possibility is to move this hardening to the common virtio code, >>>> but I think the code may become more complicated there since it would >>>> require tracking both the dma_addr and length for each descriptor. >>> >>> Christoph, >>> >>> I've been wrestling with the same thing - this is specific to busted >>> drivers. And in reality you could do the same thing with a hardware >>> virtio device (see example in >>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fthunderclap.io%2Fdata=04%7C01%7Cthomas.lendacky%40amd.com%7Cfc27af49d9a943699f6c08d8c798eed4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637478806973542212%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aUVqobkOSDfDhCAEauABOUvCAaIcw%2FTh07YFxeBjBDU%3Dreserved=0) >>> - where the >>> mitigation is 'enable the IOMMU to do its job.'. >>> >>> AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP).. >>> and while that is great in the future, SEV without IOMMU is now here. >>> >>> Doing a full circle here, this issue can be exploited with virtio >>> but you could say do that with real hardware too if you hacked the >>> firmware, so if you say used Intel SR-IOV NIC that was compromised >>> on an AMD SEV machine, and plumbed in the guest - the IOMMU inside >>> of the guest would be SWIOTLB code. Last line of defense against >>> bad firmware to say. >>> >>> As such I am leaning towards taking this code, but I am worried >>> about the performance hit .. but perhaps I shouldn't as if you >>> are using SWIOTLB=force already you are kind of taking a >>> performance hit? >>> >> >> I have not measured the performance degradation. This will hit all AMD SEV, >> Intel TDX, IBM Protected Virtualization VMs. I don't expect the hit to >> be large since there are only few added operations per hundreads of copied >> bytes. I could try to measure the performance hit by running some benchmark >> with virtio-net/virtio-blk/virtio-rng. >> >> Earlier I said: >>>> Another possibility is to move this hardening to the common virtio code, >>>> but I think the code may become more complicated there since it would >>>> require tracking both the dma_addr and length for each descriptor. >> >> Unfortunately, this doesn't make sense. Even if there's validation for >> the size in the common virtio layer, there will be some other device >> which controls a dma_addr and length passed to dma_unmap* in the >> corresponding driver. The device can target a specific dma-mapped private >> buffer by changing the dma_addr and set a good length to overwrite buffers >> following it. >> >> So, instead of doing the check in every driver and hitting a performance >> cost even when swiotlb is not used, it's probably better to fix it in >> swiotlb. >> >> @Tom Lendacky, do you think that it makes sense to h
Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active
On 9/9/20 7:44 AM, Laszlo Ersek wrote: On 09/09/20 10:27, Ard Biesheuvel wrote: (adding Laszlo and Brijesh) On Tue, 8 Sep 2020 at 20:46, Borislav Petkov wrote: + Ard so that he can ack the efi bits. On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote: From: Tom Lendacky Calling down to EFI runtime services can result in the firmware performing VMGEXIT calls. The firmware is likely to use the GHCB of the OS (e.g., for setting EFI variables), I've had to stare at this for a while. Because, normally a VMGEXIT is supposed to occur like this: - guest does something privileged - resultant non-automatic exit (NAE) injects a #VC exception - exception handler figures out what that privileged thing was - exception handler submits request to hypervisor via GHCB contents plus VMGEXIT instruction Point being, the agent that "owns" the exception handler is supposed to pre-allocate or otherwise provide the GHCB too, for information passing. So... what is the particular NAE that occurs during the execution of UEFI runtime services (at OS runtime)? And assuming it occurs, I'm unsure if the exception handler (IDT) at that point is owned (temporarily?) by the firmware. - If the #VC handler comes from the firmware, then I don't know why it would use the OS's GHCB. - If the #VC handler comes from the OS, then I don't understand why the commit message says "firmware performing VMGEXIT", given that in this case it would be the OS's #VC handler executing VMGEXIT. So, I think the above commit message implies a VMGEXIT *without* a NAE / #VC context. (Because, I fail to interpret the commit message in a NAE / #VC context in any way; see above.) Correct. OK, so let's see where the firmware performs a VMGEXIT *outside* of an exception handler, *while* at OS runtime. There seems to be one, in file "OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlashDxe.c": Again, correct. Basically this is what is invoked when setting UEFI variables. VOID QemuFlashPtrWrite ( INvolatile UINT8*Ptr, INUINT8 Value ) { if (MemEncryptSevEsIsEnabled ()) { MSR_SEV_ES_GHCB_REGISTER Msr; GHCB *Ghcb; Msr.GhcbPhysicalAddress = AsmReadMsr64 (MSR_SEV_ES_GHCB); Ghcb = Msr.Ghcb; // // Writing to flash is emulated by the hypervisor through the use of write // protection. This won't work for an SEV-ES guest because the write won't // be recognized as a true MMIO write, which would result in the required // #VC exception. Instead, use the the VMGEXIT MMIO write support directly // to perform the update. // VmgInit (Ghcb); Ghcb->SharedBuffer[0] = Value; Ghcb->SaveArea.SwScratch = (UINT64) (UINTN) Ghcb->SharedBuffer; VmgExit (Ghcb, SVM_EXIT_MMIO_WRITE, (UINT64) (UINTN) Ptr, 1); VmgDone (Ghcb); } else { *Ptr = Value; } } This function *does* run at OS runtime (as a part of non-volatile UEFI variable writes). And note that, wherever MSR_SEV_ES_GHCB points to at the moment, is used as GHCB. If the guest kernel allocates its own GHCB and writes the allocation address to MSR_SEV_ES_GHCB, then indeed the firmware will use the GHCB of the OS. I reviewed edk2 commit 437eb3f7a8db ("OvmfPkg/QemuFlashFvbServicesRuntimeDxe: Bypass flash detection with SEV-ES", 2020-08-17), but I admit I never thought of the guest OS changing MSR_SEV_ES_GHCB. I'm sorry about that. As long as this driver is running before OS runtime (i.e., during the DXE and BDS phases), MSR_SEV_ES_GHCB is supposed to carry the value we set in "OvmfPkg/PlatformPei/AmdSev.c": STATIC VOID AmdSevEsInitialize ( VOID ) { VOID *GhcbBase; PHYSICAL_ADDRESS GhcbBasePa; UINTN GhcbPageCount, PageCount; RETURN_STATUS PcdStatus, DecryptStatus; IA32_DESCRIPTOR Gdtr; VOID *Gdt; if (!MemEncryptSevEsIsEnabled ()) { return; } PcdStatus = PcdSetBoolS (PcdSevEsIsEnabled, TRUE); ASSERT_RETURN_ERROR (PcdStatus); // // Allocate GHCB and per-CPU variable pages. // Since the pages must survive across the UEFI to OS transition // make them reserved. // GhcbPageCount = mMaxCpuCount * 2; GhcbBase = AllocateReservedPages (GhcbPageCount); ASSERT (GhcbBase != NULL); GhcbBasePa = (PHYSICAL_ADDRESS)(UINTN) GhcbBase; // // Each vCPU gets two consecutive pages, the first is the GHCB and the // second is the per-CPU variable page. Loop through the allocation and // only clear the encryption mask for the GHCB pages. // for (PageCount = 0; PageCount < GhcbPageCount; PageCount += 2) { DecryptStatus = MemEncryptSevClearPageEncMask ( 0, GhcbBasePa + EFI_PAGES_TO_SIZE (PageCount), 1, TRUE ); ASSERT_RETURN_ERROR (DecryptStatus); } ZeroMem (GhcbBase, EFI_PAGES_TO_SIZE (GhcbPageCount)); PcdStatus
Re: [PATCH v3 1/1] s390: virtio: let arch accept devices without IOMMU feature
On 6/17/20 5:43 AM, Pierre Morel wrote: > An architecture protecting the guest memory against unauthorized host > access may want to enforce VIRTIO I/O device protection through the > use of VIRTIO_F_IOMMU_PLATFORM. > > Let's give a chance to the architecture to accept or not devices > without VIRTIO_F_IOMMU_PLATFORM. > > Signed-off-by: Pierre Morel > Acked-by: Jason Wang > Acked-by: Christian Borntraeger > --- > arch/s390/mm/init.c | 6 ++ > drivers/virtio/virtio.c | 22 ++ > include/linux/virtio.h | 2 ++ > 3 files changed, 30 insertions(+) > > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c > index 6dc7c3b60ef6..215070c03226 100644 > --- a/arch/s390/mm/init.c > +++ b/arch/s390/mm/init.c > @@ -45,6 +45,7 @@ > #include > #include > #include > +#include > > pgd_t swapper_pg_dir[PTRS_PER_PGD] __section(.bss..swapper_pg_dir); > > @@ -161,6 +162,11 @@ bool force_dma_unencrypted(struct device *dev) > return is_prot_virt_guest(); > } > > +int arch_needs_virtio_iommu_platform(struct virtio_device *dev) > +{ > + return is_prot_virt_guest(); > +} > + > /* protected virtualization */ > static void pv_init(void) > { > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > index a977e32a88f2..aa8e01104f86 100644 > --- a/drivers/virtio/virtio.c > +++ b/drivers/virtio/virtio.c > @@ -167,6 +167,21 @@ void virtio_add_status(struct virtio_device *dev, > unsigned int status) > } > EXPORT_SYMBOL_GPL(virtio_add_status); > > +/* > + * arch_needs_virtio_iommu_platform - provide arch specific hook when > finalizing > + * features for VIRTIO device dev > + * @dev: the VIRTIO device being added > + * > + * Permits the platform to provide architecture specific functionality when > + * devices features are finalized. This is the default implementation. > + * Architecture implementations can override this. > + */ > + > +int __weak arch_needs_virtio_iommu_platform(struct virtio_device *dev) > +{ > + return 0; > +} > + > int virtio_finalize_features(struct virtio_device *dev) > { > int ret = dev->config->finalize_features(dev); > @@ -179,6 +194,13 @@ int virtio_finalize_features(struct virtio_device *dev) > if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) > return 0; > > + if (arch_needs_virtio_iommu_platform(dev) && > + !virtio_has_feature(dev, VIRTIO_F_IOMMU_PLATFORM)) { Just a style nit, but the "!virtio..." should be directly under the "arch_...", not tabbed out. Thanks, Tom > + dev_warn(>dev, > + "virtio: device must provide > VIRTIO_F_IOMMU_PLATFORM\n"); > + return -ENODEV; > + } > + > virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); > status = dev->config->get_status(dev); > if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { > diff --git a/include/linux/virtio.h b/include/linux/virtio.h > index a493eac08393..e8526ae3463e 100644 > --- a/include/linux/virtio.h > +++ b/include/linux/virtio.h > @@ -195,4 +195,6 @@ void unregister_virtio_driver(struct virtio_driver *drv); > #define module_virtio_driver(__virtio_driver) \ > module_driver(__virtio_driver, register_virtio_driver, \ > unregister_virtio_driver) > + > +int arch_needs_virtio_iommu_platform(struct virtio_device *dev); > #endif /* _LINUX_VIRTIO_H */ > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 59/75] x86/sev-es: Handle MONITOR/MONITORX Events
On 6/11/20 12:13 PM, Sean Christopherson wrote: On Thu, Jun 11, 2020 at 03:10:45PM +0200, Joerg Roedel wrote: On Tue, May 19, 2020 at 11:38:45PM -0700, Sean Christopherson wrote: On Tue, Apr 28, 2020 at 05:17:09PM +0200, Joerg Roedel wrote: +static enum es_result vc_handle_monitor(struct ghcb *ghcb, + struct es_em_ctxt *ctxt) +{ + phys_addr_t monitor_pa; + pgd_t *pgd; + + pgd = __va(read_cr3_pa()); + monitor_pa = vc_slow_virt_to_phys(ghcb, ctxt->regs->ax); + + ghcb_set_rax(ghcb, monitor_pa); + ghcb_set_rcx(ghcb, ctxt->regs->cx); + ghcb_set_rdx(ghcb, ctxt->regs->dx); + + return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MONITOR, 0, 0); Why? If SVM has the same behavior as VMX, the MONITOR will be disarmed on VM-Enter, i.e. the VMM can't do anything useful for MONITOR/MWAIT. I assume that's the case given that KVM emulates MONITOR/MWAIT as NOPs on SVM. Not sure if it is disarmed on VMRUN, but the MONITOR/MWAIT instructions are part of the GHCB spec, so they are implemented here. Even if MONITOR/MWAIT somehow works across VMRUN I'm not sure it's something the guest should enable by default as it leaks GPAs to the untrusted host, with no benefit to the guest except in specific configurations. Yeah, the VMM can muck with page tables to trace guest to the some extent, but the guest shouldn't be unnecessarily volunteering information to the host. If MONITOR/MWAIT is effectively a NOP then removing this code is a no brainer. Can someone from AMD chime in? I don't think there is any guarantee that MONITOR/MWAIT would work within a guest (I'd have to dig some more on that to get a definitive answer, but probably not necessary to do). As you say, if KVM emulates it as a NOP, there's no sense in exposing the GPA - make it a NOP in the handler. I just need to poke some other hypervisor vendors and hear what they have to say. Thanks, Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 69/75] x86/realmode: Setup AP jump table
On 5/29/20 4:02 AM, Borislav Petkov wrote: On Tue, Apr 28, 2020 at 05:17:19PM +0200, Joerg Roedel wrote: From: Tom Lendacky Setup the AP jump table to point to the SEV-ES trampoline code so that the APs can boot. Tom, in his laconic way, doesn't want to explain to us why is this even needed... :) Looks like some of the detail was lost during the patch shuffling. Originally (on GitHub) this was the text: As part of the GHCB specification, the booting of APs under SEV-ES requires an AP jump table when transitioning from one layer of code to another (e.g. when going from UEFI to the OS). As a result, each layer that parks an AP must provide the physical address of an AP jump table to the next layer using the GHCB MSR. Upon booting of the kernel, read the GHCB MSR and save the address of the AP jump table. Under SEV-ES, APs are started using the INIT-SIPI-SIPI sequence. Before issuing the first SIPI request for an AP, the start eip is programmed into the AP jump table. Upon issuing the SIPI request, the AP will awaken and jump to the start eip address programmed into the AP jump table. It needs to change "GHCB MSR" to "VMGEXIT MSR protocol", but should cover what you're looking for. Thanks, Tom /me reads the code /me reads the GHCB spec aha, it gets it from the HV. And it can be set by the guest too... So how about expanding that commit message as to why this is done, why needed, etc? Thx. diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index 262f83cad355..1c5cbfd102d5 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -9,6 +9,7 @@ #include #include #include +#include struct real_mode_header *real_mode_header; u32 *trampoline_cr4_features; @@ -107,6 +108,11 @@ static void __init setup_real_mode(void) if (sme_active()) trampoline_header->flags |= TH_FLAGS_SME_ACTIVE; + if (sev_es_active()) { + if (sev_es_setup_ap_jump_table(real_mode_header)) + panic("Failed to update SEV-ES AP Jump Table"); + } + So this function gets slowly sprinkled with if (sev-something) bla Please wrap at least those last two into a sev_setup_real_mode() or so. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 64/75] x86/sev-es: Cache CPUID results for improved performance
On 4/28/20 10:17 AM, Joerg Roedel wrote: From: Mike Stunes To avoid a future VMEXIT for a subsequent CPUID function, cache the results returned by CPUID into an xarray. [tl: coding standard changes, register zero extension] Signed-off-by: Mike Stunes Signed-off-by: Tom Lendacky [ jroe...@suse.de: - Wrapped cache handling into vc_handle_cpuid_cached() - Used lower_32_bits() where applicable - Moved cache_index out of struct es_em_ctxt ] Co-developed-by: Joerg Roedel Signed-off-by: Joerg Roedel --- arch/x86/kernel/sev-es-shared.c | 12 ++-- arch/x86/kernel/sev-es.c| 119 +++- 2 files changed, 124 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c index 5bfc1f3030d4..cfdafe12da4f 100644 --- a/arch/x86/kernel/sev-es-shared.c +++ b/arch/x86/kernel/sev-es-shared.c @@ -427,8 +427,8 @@ static enum es_result vc_handle_cpuid(struct ghcb *ghcb, u32 cr4 = native_read_cr4(); enum es_result ret; - ghcb_set_rax(ghcb, regs->ax); - ghcb_set_rcx(ghcb, regs->cx); + ghcb_set_rax(ghcb, lower_32_bits(regs->ax)); + ghcb_set_rcx(ghcb, lower_32_bits(regs->cx)); if (cr4 & X86_CR4_OSXSAVE) /* Safe to read xcr0 */ @@ -447,10 +447,10 @@ static enum es_result vc_handle_cpuid(struct ghcb *ghcb, ghcb_is_valid_rdx(ghcb))) return ES_VMM_ERROR; - regs->ax = ghcb->save.rax; - regs->bx = ghcb->save.rbx; - regs->cx = ghcb->save.rcx; - regs->dx = ghcb->save.rdx; + regs->ax = lower_32_bits(ghcb->save.rax); + regs->bx = lower_32_bits(ghcb->save.rbx); + regs->cx = lower_32_bits(ghcb->save.rcx); + regs->dx = lower_32_bits(ghcb->save.rdx); return ES_OK; } diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c index 03095bc7b563..0303834d4811 100644 --- a/arch/x86/kernel/sev-es.c +++ b/arch/x86/kernel/sev-es.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -33,6 +34,16 @@ #define DR7_RESET_VALUE0x400 +struct sev_es_cpuid_cache_entry { + unsigned long eax; + unsigned long ebx; + unsigned long ecx; + unsigned long edx; +}; + +static struct xarray sev_es_cpuid_cache; +static bool __ro_after_init sev_es_cpuid_cache_initialized; + /* For early boot hypervisor communication in SEV-ES enabled guests */ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE); @@ -463,6 +474,9 @@ void __init sev_es_init_vc_handling(void) sev_es_setup_vc_stack(cpu); } + xa_init_flags(_es_cpuid_cache, XA_FLAGS_LOCK_IRQ); + sev_es_cpuid_cache_initialized = true; + init_vc_stack_names(); } @@ -744,6 +758,91 @@ static enum es_result vc_handle_mmio(struct ghcb *ghcb, return ret; } +static unsigned long sev_es_get_cpuid_cache_index(struct es_em_ctxt *ctxt) +{ + unsigned long hi, lo; + + /* Don't attempt to cache until the xarray is initialized */ + if (!sev_es_cpuid_cache_initialized) + return ULONG_MAX; + + lo = lower_32_bits(ctxt->regs->ax); + + /* +* CPUID 0x000d requires both RCX and XCR0, so it can't be +* cached. +*/ + if (lo == 0x000d) + return ULONG_MAX; + + /* +* Some callers of CPUID don't always set RCX to zero for CPUID +* functions that don't require RCX, which can result in excessive +* cached values, so RCX needs to be manually zeroed for use as part +* of the cache index. Future CPUID values may need RCX, but since +* they can't be known, they must not be cached. +*/ + if (lo > 0x8020) + return ULONG_MAX; + + switch (lo) { + case 0x0007: + case 0x000b: + case 0x000f: + case 0x0010: + case 0x801d: + case 0x8020: + hi = ctxt->regs->cx << 32; + break; + default: + hi = 0; + } + + return hi | lo; +} + +static bool sev_es_check_cpuid_cache(struct es_em_ctxt *ctxt, +unsigned long cache_index) +{ + struct sev_es_cpuid_cache_entry *cache_entry; + + if (cache_index == ULONG_MAX) + return false; + + cache_entry = xa_load(_es_cpuid_cache, cache_index); + if (!cache_entry) + return false; + + ctxt->regs->ax = cache_entry->eax; + ctxt->regs->bx = cache_entry->ebx; + ctxt->regs->cx = cache_entry->ecx; + ctxt->regs->dx = cache_entry->edx; + + return true; +} + +static void sev_es_add_cpuid_cache(struct es_em_ctxt *ctxt, + unsigned long cach
Re: [PATCH v3 64/75] x86/sev-es: Cache CPUID results for improved performance
On 5/26/20 4:19 AM, Borislav Petkov wrote: On Tue, May 19, 2020 at 10:16:37PM -0700, Sean Christopherson wrote: The whole cache on-demand approach seems like overkill. The number of CPUID leaves that are invoked after boot with any regularity can probably be counted on one hand. IIRC glibc invokes CPUID to gather TLB/cache info, XCR0-based features, and one or two other leafs. A statically sized global array that's arbitrarily index a la x86_capability would be just as simple and more performant. It would also allow fancier things like emulating CPUID 0xD in the guest if you want to go down that road. And before we do any of that "caching" or whatnot, I'd like to see numbers justifying its existence. Because if it is only a couple of CPUID invocations and the boot delay is immeasurable, then it's not worth the effort. I added some rudimentary stats code to see how many times there was a CPUID cache hit on a 64-vCPU guest during a kernel build (make clean followed by make with -j 64): SEV-ES CPUID cache statistics 0x/0x: 220,384 0x0007/0x: 213,306 0x8000/0x: 1,054,642 0x8001/0x: 213,306 0x8005/0x: 210,334 0x8006/0x: 420,668 0x8007/0x: 210,334 0x8008/0x: 420,684 2,963,658 cache hits So it is significant in quantity, but I'm not sure what the overall performance difference is. If I can find some more time I'll try to compare the kernel builds with and without the caching to see if it is noticeable. Thanks, Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 64/75] x86/sev-es: Cache CPUID results for improved performance
On 5/6/20 1:08 PM, Mike Stunes wrote: On Apr 28, 2020, at 8:17 AM, Joerg Roedel wrote: From: Mike Stunes To avoid a future VMEXIT for a subsequent CPUID function, cache the results returned by CPUID into an xarray. [tl: coding standard changes, register zero extension] Signed-off-by: Mike Stunes Signed-off-by: Tom Lendacky [ jroe...@suse.de: - Wrapped cache handling into vc_handle_cpuid_cached() - Used lower_32_bits() where applicable - Moved cache_index out of struct es_em_ctxt ] Co-developed-by: Joerg Roedel Signed-off-by: Joerg Roedel --- arch/x86/kernel/sev-es-shared.c | 12 ++-- arch/x86/kernel/sev-es.c| 119 +++- 2 files changed, 124 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c index 03095bc7b563..0303834d4811 100644 --- a/arch/x86/kernel/sev-es.c +++ b/arch/x86/kernel/sev-es.c @@ -744,6 +758,91 @@ static enum es_result vc_handle_mmio(struct ghcb *ghcb, return ret; } +static unsigned long sev_es_get_cpuid_cache_index(struct es_em_ctxt *ctxt) +{ + unsigned long hi, lo; + + /* Don't attempt to cache until the xarray is initialized */ + if (!sev_es_cpuid_cache_initialized) + return ULONG_MAX; + + lo = lower_32_bits(ctxt->regs->ax); + + /* +* CPUID 0x000d requires both RCX and XCR0, so it can't be +* cached. +*/ + if (lo == 0x000d) + return ULONG_MAX; + + /* +* Some callers of CPUID don't always set RCX to zero for CPUID +* functions that don't require RCX, which can result in excessive +* cached values, so RCX needs to be manually zeroed for use as part +* of the cache index. Future CPUID values may need RCX, but since +* they can't be known, they must not be cached. +*/ + if (lo > 0x8020) + return ULONG_MAX; If the cache is shared across CPUs, do we also need to exclude function 0x1 because it contains the LAPIC ID? (Or is the cache per-CPU?) It's currently not a per-CPU cache, but given what you pointed out, it should be if we're going to keep function 0x1 in there. The question will be how often is that CPUID issued, as that would determine if (not) caching it matters. Or even how often CPUID is issued and whether the xarray lock could be under contention if the cache is not per-CPU. Thanks, Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
On 4/27/20 12:37 PM, Andy Lutomirski wrote: On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski wrote: On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel wrote: On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote: I assume the race you mean is: #VC Immediate NMI before IST gets shifted #VC Kaboom. How are you dealing with this? Ultimately, I think that NMI will need to turn off IST before engaging in any funny business. Let me ponder this a bit. Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry in do_nmi() (thanks to Davin Kaplan for the idea). It might cause one of the IST stacks to be unused during nesting, but that is fine. The stack memory for #VC is only allocated when SEV-ES is active (in an SEV-ES VM). Blech. It probably works, but still, yuck. It's a bit sad that we seem to be growing more and more poorly designed happens-anywhere exception types at an alarming rate. We seem to have #NM, #MC, #VC, #HV, and #DB. This doesn't really scale. I have a somewhat serious question: should we use IST for #VC at all? As I understand it, Rome and Naples make it mandatory for hypervisors to intercept #DB, which means that, due to the MOV SS mess, it's sort of mandatory to use IST for #VC. But Milan fixes the #DB issue, so, if we're running under a sufficiently sensible hypervisor, we don't need IST for #VC. So I think we have two choices: 1. Use IST for #VC and deal with all the mess that entails. 2. Say that we SEV-ES client support on Rome and Naples is for development only and do a quick boot-time check for whether #DB is intercepted. (Just set TF and see what vector we get.) If #DB is intercepted, print a very loud warning and refuse to boot unless some special sev_es.insecure_development_mode or similar option is set. #2 results in simpler and more robust entry code. #1 is more secure. So my question is: will anyone actually use SEV-ES in production on Rome or Naples? As I understand it, it's not really ready for prime time on those chips. And do we care if the combination of a malicious Naples was limited in the number of encryption keys available for guests (15), but Rome increased that significantly (509). SEV-ES is ready on those chips - Rome more so with the increased key count given the requirement that SEV and SEV-ES guests have non-overlapping ASID ranges (which corresponds to key usage). Thanks, Tom hypervisor and malicious guest userspace on Milan can compromise the guest kernel? I don't think SEV-ES is really mean to resist a concerted effort by the hypervisor to compromise the guest. --Andy ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] Allow RDTSC and RDTSCP from userspace
On 4/24/20 4:24 PM, Dave Hansen wrote: On 4/24/20 2:03 PM, Mike Stunes wrote: I needed to allow RDTSC(P) from userspace and in early boot in order to get userspace started properly. Patch below. --- SEV-ES guests will need to execute rdtsc and rdtscp from userspace and during early boot. Move the rdtsc(p) #VC handler into common code and extend the #VC handlers. Do SEV-ES guests _always_ #VC on rdtsc(p)? Only if the hypervisor is intercepting those instructions. Thanks, Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
On 4/14/20 3:16 PM, Tom Lendacky wrote: On 4/14/20 3:12 PM, Dave Hansen wrote: On 4/14/20 1:04 PM, Tom Lendacky wrote: set_memory_decrypted needs to check the return value. I see it consistently return ENOMEM. I've traced that back to split_large_page in arch/x86/mm/pat/set_memory.c. At that point the guest won't be able to communicate with the hypervisor, too. Maybe we should BUG() here to terminate further processing? Escalating an -ENOMEM into a crashed kernel seems a bit extreme. Granted, the guest may be in an unrecoverable state, but the host doesn't need to be too. The host wouldn't be. This only happens in a guest, so it would be just causing the guest kernel to panic early in the boot. And I should add that it would only impact an SEV-ES guest. Thanks, Tom Thanks, Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
On 4/14/20 3:12 PM, Dave Hansen wrote: On 4/14/20 1:04 PM, Tom Lendacky wrote: set_memory_decrypted needs to check the return value. I see it consistently return ENOMEM. I've traced that back to split_large_page in arch/x86/mm/pat/set_memory.c. At that point the guest won't be able to communicate with the hypervisor, too. Maybe we should BUG() here to terminate further processing? Escalating an -ENOMEM into a crashed kernel seems a bit extreme. Granted, the guest may be in an unrecoverable state, but the host doesn't need to be too. The host wouldn't be. This only happens in a guest, so it would be just causing the guest kernel to panic early in the boot. Thanks, Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
On 4/14/20 2:03 PM, Mike Stunes wrote: On Mar 19, 2020, at 2:13 AM, Joerg Roedel wrote: From: Tom Lendacky The runtime handler needs a GHCB per CPU. Set them up and map them unencrypted. Signed-off-by: Tom Lendacky Signed-off-by: Joerg Roedel --- arch/x86/include/asm/mem_encrypt.h | 2 ++ arch/x86/kernel/sev-es.c | 28 +++- arch/x86/kernel/traps.c| 3 +++ 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c index c17980e8db78..4bf5286310a0 100644 --- a/arch/x86/kernel/sev-es.c +++ b/arch/x86/kernel/sev-es.c @@ -197,6 +203,26 @@ static bool __init sev_es_setup_ghcb(void) return true; } +void sev_es_init_ghcbs(void) +{ + int cpu; + + if (!sev_es_active()) + return; + + /* Allocate GHCB pages */ + ghcb_page = __alloc_percpu(sizeof(struct ghcb), PAGE_SIZE); + + /* Initialize per-cpu GHCB pages */ + for_each_possible_cpu(cpu) { + struct ghcb *ghcb = (struct ghcb *)per_cpu_ptr(ghcb_page, cpu); + + set_memory_decrypted((unsigned long)ghcb, +sizeof(*ghcb) >> PAGE_SHIFT); + memset(ghcb, 0, sizeof(*ghcb)); + } +} + set_memory_decrypted needs to check the return value. I see it consistently return ENOMEM. I've traced that back to split_large_page in arch/x86/mm/pat/set_memory.c. At that point the guest won't be able to communicate with the hypervisor, too. Maybe we should BUG() here to terminate further processing? Thanks, Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected
On 2/21/20 7:12 AM, Halil Pasic wrote: > On Thu, 20 Feb 2020 15:55:14 -0500 > "Michael S. Tsirkin" wrote: > >> On Thu, Feb 20, 2020 at 05:06:06PM +0100, Halil Pasic wrote: >>> Currently the advanced guest memory protection technologies (AMD SEV, >>> powerpc secure guest technology and s390 Protected VMs) abuse the >>> VIRTIO_F_IOMMU_PLATFORM flag to make virtio core use the DMA API, which >>> is in turn necessary, to make IO work with guest memory protection. >>> >>> But VIRTIO_F_IOMMU_PLATFORM a.k.a. VIRTIO_F_ACCESS_PLATFORM is really a >>> different beast: with virtio devices whose implementation runs on an SMP >>> CPU we are still fine with doing all the usual optimizations, it is just >>> that we need to make sure that the memory protection mechanism does not >>> get in the way. The VIRTIO_F_ACCESS_PLATFORM mandates more work on the >>> side of the guest (and possibly he host side as well) than we actually >>> need. >>> >>> An additional benefit of teaching the guest to make the right decision >>> (and use DMA API) on it's own is: removing the need, to mandate special >>> VM configuration for guests that may run with protection. This is >>> especially interesting for s390 as VIRTIO_F_IOMMU_PLATFORM pushes all >>> the virtio control structures into the first 2G of guest memory: >>> something we don't necessarily want to do per-default. >>> >>> Signed-off-by: Halil Pasic >>> Tested-by: Ram Pai >>> Tested-by: Michael Mueller >> >> This might work for you but it's fragile, since without >> VIRTIO_F_ACCESS_PLATFORM hypervisor assumes it gets >> GPA's, not DMA addresses. >> > > Thanks for your constructive approach. I do want the hypervisor to > assume it gets GPA's. My train of thought was that the guys that need > to use IOVA's that are not GPA's when force_dma_unencrypted() will have > to to specify VIRTIO_F_ACCESS_PLATFORM (at the device) anyway, because > otherwise it won't work. But I see your point: in case of a > mis-configuration and provided the DMA API returns IOVA's one could end > up trying to touch wrong memory locations. But this should be similar to > what would happen if DMA ops are not used, and memory is not made accessible. > >> >> >> IOW this looks like another iteration of: >> >> virtio: Support encrypted memory on powerpc secure guests >> >> which I was under the impression was abandoned as unnecessary. > > Unnecessary for powerpc because they do normal PCI. In the context of > CCW there are only guest physical addresses (CCW I/O has no concept of > IOMMU or IOVAs). > >> >> >> To summarize, the necessary conditions for a hack along these lines >> (using DMA API without VIRTIO_F_ACCESS_PLATFORM) are that we detect that: >> >> - secure guest mode is enabled - so we know that since we don't share >> most memory regular virtio code won't >> work, even though the buggy hypervisor didn't set >> VIRTIO_F_ACCESS_PLATFORM > > force_dma_unencrypted(>dev) is IMHO exactly about this. > >> - DMA API is giving us addresses that are actually also physical >> addresses > > In case of s390 this is given. I talked with the power people before > posting this, and they ensured me they can are willing to deal with > this. I was hoping to talk abut this with the AMD SEV people here (hence > the cc). Yes, physical addresses are fine for SEV - the key is that the DMA API is used so that an address for unencrypted, or shared, memory is returned. E.g. for a dma_alloc_coherent() call this is an allocation that has had set_memory_decrypted() called or for a dma_map_page() call this is an address from SWIOTLB, which was mapped shared during boot, where the data will be bounce-buffered. We don't currently support an emulated IOMMU in our SEV guest because that would require a lot of support in the driver to make IOMMU data available to the hypervisor (I/O page tables, etc.). We would need hardware support to really make this work easily in the guest. Thanks, Tom > >> - Hypervisor is buggy and didn't enable VIRTIO_F_ACCESS_PLATFORM >> > > I don't get this point. The argument where the hypervisor is buggy is a > bit hard to follow for me. If hypervisor is buggy we have already lost > anyway most of the time, or? > >> I don't see how this patch does this. > > I do get your point. I don't know of a good way to check that DMA API > is giving us addresses that are actually physical addresses, and the > situation you describe definitely has some risk to it. > > Let me comment on other ideas that came up. I would be very happy to go > with the best one. Thank you very much. > > Regards, > Halil > >> >> >>> --- >>> drivers/virtio/virtio_ring.c | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c >>> index 867c7ebd3f10..fafc8f924955 100644 >>> --- a/drivers/virtio/virtio_ring.c >>> +++ b/drivers/virtio/virtio_ring.c >>> @@ -243,6 +243,9 @@ static bool vring_use_dma_api(struct virtio_device >>> *vdev)
Re: [PATCH 1/1] virtio_ring: fix return code on DMA mapping fails
On 11/22/19 7:08 AM, Halil Pasic wrote: Thanks Michael! Actually I also hoped to start a discussion on virtio with encrypted memory. I assume the AMD folks have the most experience with this, and I very much like to understand how do they master the challenges we are all facing. My understanding of IO in the context of AMD SEV is that the user is responsible for choosing the swiotlb command line parameter of the guest kernel so, that the guest never runs out of swiotlb. And that not doing so may have fatal consequences with regards to the guest. [1] The swiotlb being a guest global resource, to choose such a size, one would fist need to know the maximal swiotlb footprint of each device, and then apply some heuristics regarding fragmentation. Honestly, if somebody asked me how to calculate the max swiotlb footprint of the most common virtio devices, I would feel very uncomfortable. But maybe I got it all wrong. @Tom can you help me understand how this works? Yes, SWIOTLB sizing is hard. It really depends on the workload and the associated I/O load that the guest will be performing. We've been looking at a simple patch to increase the default SWIOTLB size if SEV is active. But what size do you choose? Do you base it on the overall guest size? And you're limited because it must reside low in memory. Ideally, having a pool of shared pages for DMA, outside of standard SWIOTLB, might be a good thing. On x86, SWIOTLB really seems geared towards devices that don't support 64-bit DMA. If a device supports 64-bit DMA then it can use shared pages that reside anywhere to perform the DMA and bounce buffering. I wonder if the SWIOTLB support can be enhanced to support something like this, using today's low SWIOTLB buffers if the DMA mask necessitates it, otherwise using a dynamically sized pool of shared pages that can live anywhere. Thanks, Tom In any case, we s390 protected virtualization folks are concerned about the things laid out above. The goal of this patch is to make the swiotlb full condition less grave, but it is by no means a full solution. I would like to work on improving on this situation. Obviously we have done some thinking about what can be done, but I would very much like to collect the opinions, of the people in the community that AFAICT face same problem. One of the ideas is to try to prevent it from happening by making swiotlb sizing dynamic. Another idea is to make the system deal with the failures gracefully. Both ideas come with a bag of problems of their own (AFAICT). According to my research the people I need to talk to are Tom (AMD), and Ram and Thiago (Power) and of course the respective maintainers. Have I missed anybody? Regards, Halil -- [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAMDESE%2FAMDSEV%23faq-4data=02%7C01%7CThomas.Lendacky%40amd.com%7Cd733eab74c7346b72fb608d76f4d175d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637100249200530156sdata=mUISWUHYJfLE3c1cYoqC%2B3uzM8RtpnffyMlrX84oGug%3Dreserved=0 On Tue, 19 Nov 2019 08:04:29 -0500 "Michael S. Tsirkin" wrote: Will be in the next pull request. On Tue, Nov 19, 2019 at 12:10:22PM +0100, Halil Pasic wrote: ping On Thu, 14 Nov 2019 13:46:46 +0100 Halil Pasic wrote: Commit 780bc7903a32 ("virtio_ring: Support DMA APIs") makes virtqueue_add() return -EIO when we fail to map our I/O buffers. This is a very realistic scenario for guests with encrypted memory, as swiotlb may run out of space, depending on it's size and the I/O load. The virtio-blk driver interprets -EIO form virtqueue_add() as an IO error, despite the fact that swiotlb full is in absence of bugs a recoverable condition. Let us change the return code to -ENOMEM, and make the block layer recover form these failures when virtio-blk encounters the condition described above. Fixes: 780bc7903a32 ("virtio_ring: Support DMA APIs") Signed-off-by: Halil Pasic Tested-by: Michael Mueller --- [..] ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization
On 10/12/2017 10:34 AM, Thomas Garnier wrote: On Wed, Oct 11, 2017 at 2:34 PM, Tom Lendacky <thomas.lenda...@amd.com> wrote: On 10/11/2017 3:30 PM, Thomas Garnier wrote: Changes: - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce dynamic relocation space on mapped memory. It also simplifies the relocation process. - Move the start the module section next to the kernel. Remove the need for -mcmodel=large on modules. Extends module space from 1 to 2G maximum. - Support for XEN PVH as 32-bit relocations can be ignored with --emit-relocs. - Support for GOT relocations previously done automatically with -pie. - Remove need for dynamic PLT in modules. - Support dymamic GOT for modules. - rfc v2: - Add support for global stack cookie while compiler default to fs without mcmodel=kernel - Change patch 7 to correctly jump out of the identity mapping on kexec load preserve. These patches make the changes necessary to build the kernel as Position Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below the top 2G of the virtual address space. It allows to optionally extend the KASLR randomization range from 1G to 3G. Hi Thomas, I've applied your patches so that I can verify that SME works with PIE. Unfortunately, I'm running into build warnings and errors when I enable PIE. With CONFIG_STACK_VALIDATION=y I receive lots of messages like this: drivers/scsi/libfc/fc_exch.o: warning: objtool: fc_destroy_exch_mgr()+0x0: call without frame pointer save/setup Disabling CONFIG_STACK_VALIDATION suppresses those. I ran into that, I plan to fix it in the next iteration. But near the end of the build, I receive errors like this: arch/x86/kernel/setup.o: In function `dump_kernel_offset': .../arch/x86/kernel/setup.c:801:(.text+0x32): relocation truncated to fit: R_X86_64_32S against symbol `_text' defined in .text section in .tmp_vmlinux1 . . about 10 more of the above type messages . make: *** [vmlinux] Error 1 Error building kernel, exiting Are there any config options that should or should not be enabled when building with PIE enabled? Is there a compiler requirement for PIE (I'm using gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5))? I never ran into these ones and I tested compilers older and newer. What was your exact configuration? I'll send you the config in a separate email. Thanks, Tom Thanks, Tom Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler changes, PIE support and KASLR in general. Thanks to Roland McGrath on his feedback for using -pie versus --emit-relocs and details on compiler code generation. The patches: - 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant. - 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically. - 14: Adapt percpu design to work correctly when PIE is enabled. - 15: Provide an option to default visibility to hidden except for key symbols. It removes errors between compilation units. - 16: Adapt relocation tool to handle PIE binary correctly. - 19: Add support for global cookie. - 20: Support ftrace with PIE (used on Ubuntu config). - 21: Fix incorrect address marker on dump_pagetables. - 22: Add option to move the module section just after the kernel. - 23: Adapt module loading to support PIE with dynamic GOT. - 24: Make the GOT read-only. - 25: Add the CONFIG_X86_PIE option (off by default). - 26: Adapt relocation tool to generate a 64-bit relocation table. - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation range from 1G to 3G (off by default). Performance/Size impact: Size of vmlinux (Default configuration): File size: - PIE disabled: +0.31% - PIE enabled: -3.210% (less relocations) .text section: - PIE disabled: +0.000644% - PIE enabled: +0.837% Size of vmlinux (Ubuntu configuration): File size: - PIE disabled: -0.201% - PIE enabled: -0.082% .text section: - PIE disabled: same - PIE enabled: +1.319% Size of vmlinux (Default configuration + ORC): File size: - PIE enabled: -3.167% .text section: - PIE enabled: +0.814% Size of vmlinux (Ubuntu configuration + ORC): File size: - PIE enabled: -3.167% .text section: - PIE enabled: +1.26% The size increase is mainly due to not having access to the 32-bit signed relocation that can be used with mcmodel=kernel. A small part is due to reduced optimization for PIE code. This bug [1] was opened with gcc to provide a better code generation for kernel PIE. Hackbench (50% and 1600% on thread/process for pipe/sockets): - PIE disabled: no significant change (avg +0.1% on latest test). - PIE enabled: between -0.50% to +0.86% in average (default
Re: [PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization
On 10/11/2017 3:30 PM, Thomas Garnier wrote: > Changes: > - patch v1: > - Simplify ftrace implementation. > - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. > - rfc v3: > - Use --emit-relocs instead of -pie to reduce dynamic relocation space on > mapped memory. It also simplifies the relocation process. > - Move the start the module section next to the kernel. Remove the need > for > -mcmodel=large on modules. Extends module space from 1 to 2G maximum. > - Support for XEN PVH as 32-bit relocations can be ignored with > --emit-relocs. > - Support for GOT relocations previously done automatically with -pie. > - Remove need for dynamic PLT in modules. > - Support dymamic GOT for modules. > - rfc v2: > - Add support for global stack cookie while compiler default to fs without > mcmodel=kernel > - Change patch 7 to correctly jump out of the identity mapping on kexec > load > preserve. > > These patches make the changes necessary to build the kernel as Position > Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below > the top 2G of the virtual address space. It allows to optionally extend the > KASLR randomization range from 1G to 3G. Hi Thomas, I've applied your patches so that I can verify that SME works with PIE. Unfortunately, I'm running into build warnings and errors when I enable PIE. With CONFIG_STACK_VALIDATION=y I receive lots of messages like this: drivers/scsi/libfc/fc_exch.o: warning: objtool: fc_destroy_exch_mgr()+0x0: call without frame pointer save/setup Disabling CONFIG_STACK_VALIDATION suppresses those. But near the end of the build, I receive errors like this: arch/x86/kernel/setup.o: In function `dump_kernel_offset': .../arch/x86/kernel/setup.c:801:(.text+0x32): relocation truncated to fit: R_X86_64_32S against symbol `_text' defined in .text section in .tmp_vmlinux1 . . about 10 more of the above type messages . make: *** [vmlinux] Error 1 Error building kernel, exiting Are there any config options that should or should not be enabled when building with PIE enabled? Is there a compiler requirement for PIE (I'm using gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5))? Thanks, Tom > > Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler > changes, PIE support and KASLR in general. Thanks to Roland McGrath on his > feedback for using -pie versus --emit-relocs and details on compiler code > generation. > > The patches: > - 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant. > - 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically. > - 14: Adapt percpu design to work correctly when PIE is enabled. > - 15: Provide an option to default visibility to hidden except for key > symbols. > It removes errors between compilation units. > - 16: Adapt relocation tool to handle PIE binary correctly. > - 19: Add support for global cookie. > - 20: Support ftrace with PIE (used on Ubuntu config). > - 21: Fix incorrect address marker on dump_pagetables. > - 22: Add option to move the module section just after the kernel. > - 23: Adapt module loading to support PIE with dynamic GOT. > - 24: Make the GOT read-only. > - 25: Add the CONFIG_X86_PIE option (off by default). > - 26: Adapt relocation tool to generate a 64-bit relocation table. > - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation > range > from 1G to 3G (off by default). > > Performance/Size impact: > > Size of vmlinux (Default configuration): > File size: > - PIE disabled: +0.31% > - PIE enabled: -3.210% (less relocations) > .text section: > - PIE disabled: +0.000644% > - PIE enabled: +0.837% > > Size of vmlinux (Ubuntu configuration): > File size: > - PIE disabled: -0.201% > - PIE enabled: -0.082% > .text section: > - PIE disabled: same > - PIE enabled: +1.319% > > Size of vmlinux (Default configuration + ORC): > File size: > - PIE enabled: -3.167% > .text section: > - PIE enabled: +0.814% > > Size of vmlinux (Ubuntu configuration + ORC): > File size: > - PIE enabled: -3.167% > .text section: > - PIE enabled: +1.26% > > The size increase is mainly due to not having access to the 32-bit signed > relocation that can be used with mcmodel=kernel. A small part is due to > reduced > optimization for PIE code. This bug [1] was opened with gcc to provide a > better > code generation for kernel PIE. > > Hackbench (50% and 1600% on thread/process for pipe/sockets): > - PIE disabled: no significant change (avg +0.1% on latest test). > - PIE enabled: between -0.50% to +0.86% in average (default and Ubuntu > config). > > slab_test (average of 10 runs): > - PIE disabled: no significant change (-2% on latest run, likely noise). > - PIE enabled: between -1% and +0.8% on latest runs. > > Kernbench (average of 10 Half and Optimal runs): >
Re: RFT: virtio_net: limit xmit polling
On Sunday, June 19, 2011 05:27:00 AM Michael S. Tsirkin wrote: OK, different people seem to test different trees. In the hope to get everyone on the same page, I created several variants of this patch so they can be compared. Whoever's interested, please check out the following, and tell me how these compare: I'm in the process of testing these patches. Base and v0 are complete and v1 is near complete with v2 to follow. I'm testing with a variety of TCP_RR and TCP_STREAM/TCP_MAERTS tests involving local guest-to-guest tests and remote host-to-guest tests. I'll post the results in the next day or two when the tests finish. Thanks, Tom kernel: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git virtio-net-limit-xmit-polling/base - this is net-next baseline to test against virtio-net-limit-xmit-polling/v0 - fixes checks on out of capacity virtio-net-limit-xmit-polling/v1 - previous revision of the patch this does xmit,free,xmit,2*free,free virtio-net-limit-xmit-polling/v2 - new revision of the patch this does free,xmit,2*free,free There's also this on top: virtio-net-limit-xmit-polling/v3 - don't delay avail index update I don't think it's important to test this one, yet Userspace to use: event index work is not yet merged upstream so the revision to use is still this: git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git virtio-net-event-idx-v3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 09/18] virtio: use avail_event index
On Monday, May 16, 2011 02:12:21 AM Rusty Russell wrote: On Sun, 15 May 2011 16:55:41 +0300, Michael S. Tsirkin m...@redhat.com wrote: On Mon, May 09, 2011 at 02:03:26PM +0930, Rusty Russell wrote: On Wed, 4 May 2011 23:51:47 +0300, Michael S. Tsirkin m...@redhat.com wrote: Use the new avail_event feature to reduce the number of exits from the guest. Figures here would be nice :) You mean ASCII art in comments? I mean benchmarks of some kind. I'm working on getting some benchmark results for the patches. I should hopefully have something in the next day or two. Tom @@ -228,6 +237,12 @@ add_head: * new available array entries. */ virtio_wmb(); vq-vring.avail-idx++; + /* If the driver never bothers to kick in a very long while, +* avail index might wrap around. If that happens, invalidate +* kicked_avail index we stored. TODO: make sure all drivers +* kick at least once in 2^16 and remove this. */ + if (unlikely(vq-vring.avail-idx == vq-kicked_avail)) + vq-kicked_avail_valid = true; If they don't, they're already buggy. Simply do: WARN_ON(vq-vring.avail-idx == vq-kicked_avail); Hmm, but does it say that somewhere? AFAICT it's a corollary of: 1) You have a finite ring of size = 2^16. 2) You need to kick the other side once you've done some work. @@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device *vdev) break; case VIRTIO_RING_F_USED_EVENT_IDX: break; + case VIRTIO_RING_F_AVAIL_EVENT_IDX: + break; default: /* We don't understand this bit. */ clear_bit(i, vdev-features); Does this belong in a prior patch? Thanks, Rusty. Well if we don't support the feature in the ring we should not ack the feature, right? Ah, you're right. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 05/18] virtio: used event index interface
On Wednesday, May 04, 2011 03:51:09 PM Michael S. Tsirkin wrote: Define a new feature bit for the guest to utilize a used_event index (like Xen) instead if a flag bit to enable/disable interrupts. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- include/linux/virtio_ring.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h index e4d144b..f5c1b75 100644 --- a/include/linux/virtio_ring.h +++ b/include/linux/virtio_ring.h @@ -29,6 +29,10 @@ /* We support indirect buffer descriptors */ #define VIRTIO_RING_F_INDIRECT_DESC 28 +/* The Guest publishes the used index for which it expects an interrupt + * at the end of the avail ring. Host should ignore the avail-flags field. */ +#define VIRTIO_RING_F_USED_EVENT_IDX 29 + /* Virtio ring descriptors: 16 bytes. These can chain together via next. */ struct vring_desc { /* Address (guest-physical). */ @@ -83,6 +87,7 @@ struct vring { * __u16 avail_flags; * __u16 avail_idx; * __u16 available[num]; + * __u16 used_event_idx; * * // Padding to the next align boundary. * char pad[]; @@ -93,6 +98,10 @@ struct vring { * struct vring_used_elem used[num]; * }; */ +/* We publish the used event index at the end of the available ring. + * It is at the end for backwards compatibility. */ +#define vring_used_event(vr) ((vr)-avail-ring[(vr)-num]) + static inline void vring_init(struct vring *vr, unsigned int num, void *p, unsigned long align) { You should update the vring_size procedure to account for the extra field at the end of the available ring by change the (2 + num) to (3 + num): return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (3 + num) Tom ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 09/18] virtio: use avail_event index
On Wednesday, May 04, 2011 03:51:47 PM Michael S. Tsirkin wrote: Use the new avail_event feature to reduce the number of exits from the guest. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_ring.c | 39 ++- 1 files changed, 38 insertions(+), 1 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 3a3ed75..262dfe6 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -82,6 +82,15 @@ struct vring_virtqueue /* Host supports indirect buffers */ bool indirect; + /* Host publishes avail event idx */ + bool event; + + /* Is kicked_avail below valid? */ + bool kicked_avail_valid; + + /* avail idx value we already kicked. */ + u16 kicked_avail; + /* Number of free buffers */ unsigned int num_free; /* Head of free buffer list. */ @@ -228,6 +237,12 @@ add_head: * new available array entries. */ virtio_wmb(); vq-vring.avail-idx++; + /* If the driver never bothers to kick in a very long while, + * avail index might wrap around. If that happens, invalidate + * kicked_avail index we stored. TODO: make sure all drivers + * kick at least once in 2^16 and remove this. */ + if (unlikely(vq-vring.avail-idx == vq-kicked_avail)) + vq-kicked_avail_valid = true; vq-kicked_avail_valid should be set to false here. Tom pr_debug(Added buffer head %i to %p\n, head, vq); END_USE(vq); @@ -236,6 +251,23 @@ add_head: } EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp); + +static bool vring_notify(struct vring_virtqueue *vq) +{ + u16 old, new; + bool v; + if (!vq-event) + return !(vq-vring.used-flags VRING_USED_F_NO_NOTIFY); + + v = vq-kicked_avail_valid; + old = vq-kicked_avail; + new = vq-kicked_avail = vq-vring.avail-idx; + vq-kicked_avail_valid = true; + if (unlikely(!v)) + return true; + return vring_need_event(vring_avail_event(vq-vring), new, old); +} + void virtqueue_kick(struct virtqueue *_vq) { struct vring_virtqueue *vq = to_vvq(_vq); @@ -244,7 +276,7 @@ void virtqueue_kick(struct virtqueue *_vq) /* Need to update avail index before checking if we should notify */ virtio_mb(); - if (!(vq-vring.used-flags VRING_USED_F_NO_NOTIFY)) + if (vring_notify(vq)) /* Prod other side to tell it about changes. */ vq-notify(vq-vq); @@ -437,6 +469,8 @@ struct virtqueue *vring_new_virtqueue(unsigned int num, vq-vq.name = name; vq-notify = notify; vq-broken = false; + vq-kicked_avail_valid = false; + vq-kicked_avail = 0; vq-last_used_idx = 0; list_add_tail(vq-vq.list, vdev-vqs); #ifdef DEBUG @@ -444,6 +478,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num, #endif vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC); + vq-event = virtio_has_feature(vdev, VIRTIO_RING_F_AVAIL_EVENT_IDX); /* No callback? Tell other side not to bother us. */ if (!callback) @@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device *vdev) break; case VIRTIO_RING_F_USED_EVENT_IDX: break; + case VIRTIO_RING_F_AVAIL_EVENT_IDX: + break; default: /* We don't understand this bit. */ clear_bit(i, vdev-features); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization