Re: SEV guest regression in 4.18
On Fri, Aug 24, 2018 at 01:47:10PM -0500, Brijesh Singh wrote: > I am more inclined towards creating a new section with PMD aligned and > sized. This section will contains the decrypted data. In early > boot code we will update the mapping with C=0. If caller wants to create > a shared variable then it can do so with: > > static int foo __decrypted; Right, and keeping the SEV-ES's GHCB in mind, you could make that section extensible so that the GHCB's 4K page can land there too. Maybe something like a PMD-aligned range of 4K pages which are fully defined and which hypervisor and guest can share and can be used for all kinds of communication in the future... -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --
Re: SEV guest regression in 4.18
On 08/24/2018 11:24 AM, Sean Christopherson wrote: On Fri, Aug 24, 2018 at 10:41:27AM -0500, Brijesh Singh wrote: On 08/23/2018 11:16 AM, Paolo Bonzini wrote: On 23/08/2018 17:29, Sean Christopherson wrote: On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote: On 22/08/2018 22:11, Brijesh Singh wrote: Yes, this is one of approach I have in mind. It will avoid splitting the larger pages; I am thinking that early in boot code we can lookup for this special section and decrypt it in-place and probably maps with C=0. Only downside, it will increase data section footprint a bit because we need to align this section to PM_SIZE. If you can ensure it doesn't span a PMD, maybe it does not need to be aligned; you could establish a C=0 mapping of the whole 2M around it. Wouldn't that result in exposing/leaking whatever code/data happened to reside on the same 2M page (or corrupting it if the entire page isn't decrypted)? Or are you suggesting that we'd also leave the encrypted mapping intact? Yes, exactly the latter, because... Hardware does not enforce coherency between the encrypted and unencrypted mapping for the same physical page. So, creating a two mapping of same physical address will lead a possible data corruption. But couldn't we avoid corruption by ensuring data accessed via the unencrypted mapping is cache line aligned and sized? The CPU could speculatively bring the encrypted version into the cache but it should never get into a modified state (barring a software bug, but that would be a problem regardless of encryption). Yes, if we can ensure that accessed are cache line aligned and sized then we should be fine.
Re: SEV guest regression in 4.18
On 08/24/2018 10:50 AM, Paolo Bonzini wrote: On 24/08/2018 17:41, Brijesh Singh wrote: Wouldn't that result in exposing/leaking whatever code/data happened to reside on the same 2M page (or corrupting it if the entire page isn't decrypted)? Or are you suggesting that we'd also leave the encrypted mapping intact? Yes, exactly the latter, because... Hardware does not enforce coherency between the encrypted and unencrypted mapping for the same physical page. So, creating a two mapping of same physical address will lead a possible data corruption. Note, SME creates two mapping of the same physical address to perform in-place encryption of kernel and initrd images; this is a special case and APM documents steps on how to do this. Ah, so that's what I was thinking about. But a single cache line would never be used both encrypted and unencrypted, would it? No. If we create two mapping of same physical address and ensure that accesses are cache line aligned and size then I think we will safe from cache point of view. I have not tried early remap from kvmclock_init() yet and not sure if it will work so early. The downside of this approach is, we will put too many constraints on caller; it will need to ensure that variables are cache line aligned and sized, never accessed with C=1, and must create a new mapping before accessing it. I am more inclined towards creating a new section with PMD aligned and sized. This section will contains the decrypted data. In early boot code we will update the mapping with C=0. If caller wants to create a shared variable then it can do so with: static int foo __decrypted; I will submit patch fir review. thanks
Re: SEV guest regression in 4.18
On Fri, Aug 24, 2018 at 10:41:27AM -0500, Brijesh Singh wrote: > > > On 08/23/2018 11:16 AM, Paolo Bonzini wrote: > >On 23/08/2018 17:29, Sean Christopherson wrote: > >>On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote: > >>>On 22/08/2018 22:11, Brijesh Singh wrote: > > Yes, this is one of approach I have in mind. It will avoid splitting > the larger pages; I am thinking that early in boot code we can lookup > for this special section and decrypt it in-place and probably maps with > C=0. Only downside, it will increase data section footprint a bit > because we need to align this section to PM_SIZE. > >>> > >>>If you can ensure it doesn't span a PMD, maybe it does not need to be > >>>aligned; you could establish a C=0 mapping of the whole 2M around it. > >> > >>Wouldn't that result in exposing/leaking whatever code/data happened > >>to reside on the same 2M page (or corrupting it if the entire page > >>isn't decrypted)? Or are you suggesting that we'd also leave the > >>encrypted mapping intact? > > > >Yes, exactly the latter, because... > > > Hardware does not enforce coherency between the encrypted and > unencrypted mapping for the same physical page. So, creating a > two mapping of same physical address will lead a possible data > corruption. But couldn't we avoid corruption by ensuring data accessed via the unencrypted mapping is cache line aligned and sized? The CPU could speculatively bring the encrypted version into the cache but it should never get into a modified state (barring a software bug, but that would be a problem regardless of encryption). > Note, SME creates two mapping of the same physical address to perform > in-place encryption of kernel and initrd images; this is a special case > and APM documents steps on how to do this. > > > > > >>Does hardware include the C-bit in the cache tag? > > > >... the C-bit is effectively part of the physical address and hence of > >the cache tag. The kernel is already relying on this to properly > >encrypt/decrypt pages, if I remember correctly. > > > >Paolo > >
Re: SEV guest regression in 4.18
On 24/08/2018 17:41, Brijesh Singh wrote: >>> >>> Wouldn't that result in exposing/leaking whatever code/data happened >>> to reside on the same 2M page (or corrupting it if the entire page >>> isn't decrypted)? Or are you suggesting that we'd also leave the >>> encrypted mapping intact? >> >> Yes, exactly the latter, because... > > > Hardware does not enforce coherency between the encrypted and > unencrypted mapping for the same physical page. So, creating a > two mapping of same physical address will lead a possible data > corruption. > > Note, SME creates two mapping of the same physical address to perform > in-place encryption of kernel and initrd images; this is a special case > and APM documents steps on how to do this. Ah, so that's what I was thinking about. But a single cache line would never be used both encrypted and unencrypted, would it? Paolo
Re: SEV guest regression in 4.18
On 08/23/2018 11:16 AM, Paolo Bonzini wrote: On 23/08/2018 17:29, Sean Christopherson wrote: On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote: On 22/08/2018 22:11, Brijesh Singh wrote: Yes, this is one of approach I have in mind. It will avoid splitting the larger pages; I am thinking that early in boot code we can lookup for this special section and decrypt it in-place and probably maps with C=0. Only downside, it will increase data section footprint a bit because we need to align this section to PM_SIZE. If you can ensure it doesn't span a PMD, maybe it does not need to be aligned; you could establish a C=0 mapping of the whole 2M around it. Wouldn't that result in exposing/leaking whatever code/data happened to reside on the same 2M page (or corrupting it if the entire page isn't decrypted)? Or are you suggesting that we'd also leave the encrypted mapping intact? Yes, exactly the latter, because... Hardware does not enforce coherency between the encrypted and unencrypted mapping for the same physical page. So, creating a two mapping of same physical address will lead a possible data corruption. Note, SME creates two mapping of the same physical address to perform in-place encryption of kernel and initrd images; this is a special case and APM documents steps on how to do this. Does hardware include the C-bit in the cache tag? ... the C-bit is effectively part of the physical address and hence of the cache tag. The kernel is already relying on this to properly encrypt/decrypt pages, if I remember correctly. Paolo
Re: SEV guest regression in 4.18
On 23/08/2018 17:29, Sean Christopherson wrote: > On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote: >> On 22/08/2018 22:11, Brijesh Singh wrote: >>> >>> Yes, this is one of approach I have in mind. It will avoid splitting >>> the larger pages; I am thinking that early in boot code we can lookup >>> for this special section and decrypt it in-place and probably maps with >>> C=0. Only downside, it will increase data section footprint a bit >>> because we need to align this section to PM_SIZE. >> >> If you can ensure it doesn't span a PMD, maybe it does not need to be >> aligned; you could establish a C=0 mapping of the whole 2M around it. > > Wouldn't that result in exposing/leaking whatever code/data happened > to reside on the same 2M page (or corrupting it if the entire page > isn't decrypted)? Or are you suggesting that we'd also leave the > encrypted mapping intact? Yes, exactly the latter, because... > Does hardware include the C-bit in the cache tag? ... the C-bit is effectively part of the physical address and hence of the cache tag. The kernel is already relying on this to properly encrypt/decrypt pages, if I remember correctly. Paolo
Re: SEV guest regression in 4.18
On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote: > On 22/08/2018 22:11, Brijesh Singh wrote: > > > > Yes, this is one of approach I have in mind. It will avoid splitting > > the larger pages; I am thinking that early in boot code we can lookup > > for this special section and decrypt it in-place and probably maps with > > C=0. Only downside, it will increase data section footprint a bit > > because we need to align this section to PM_SIZE. > > If you can ensure it doesn't span a PMD, maybe it does not need to be > aligned; you could establish a C=0 mapping of the whole 2M around it. Wouldn't that result in exposing/leaking whatever code/data happened to reside on the same 2M page (or corrupting it if the entire page isn't decrypted)? Or are you suggesting that we'd also leave the encrypted mapping intact? If it's the latter... Does hardware include the C-bit in the cache tag? I.e are the C=0 and C=1 variations of the same PA treated as different cache lines? If so, we could also treat the unencrypted variation as a separate PA by defining it to be (ACTUAL_PA | (1 << x86_phys_bits)), (re)adjusting x86_phys_bits if necessary to get the kernel to allow the address. init_memory_mapping() could then alias every PA with an unencrypted VA mapping, which would allow the kernel to access any PA unencrypted by using virt_to_phys() and phys_to_virt() to translate an encrypted VA to an unencrypted VA. It would mean doubling INIT_PGD_PAGE_COUNT, but that'd be a one-time cost regardless of how many pages needed to be accessed with C=0. > Paolo
Re: SEV guest regression in 4.18
On 22/08/2018 22:11, Brijesh Singh wrote: > > Yes, this is one of approach I have in mind. It will avoid splitting > the larger pages; I am thinking that early in boot code we can lookup > for this special section and decrypt it in-place and probably maps with > C=0. Only downside, it will increase data section footprint a bit > because we need to align this section to PM_SIZE. If you can ensure it doesn't span a PMD, maybe it does not need to be aligned; you could establish a C=0 mapping of the whole 2M around it. Paolo
Re: SEV guest regression in 4.18
Hi Sean, On 08/22/2018 10:00 AM, Sean Christopherson wrote: On Wed, Aug 22, 2018 at 10:14:17AM +0200, Borislav Petkov wrote: Dropping Pavel as it bounces. On Tue, Aug 21, 2018 at 11:07:38AM -0500, Brijesh Singh wrote: The tsc_early_init() is called before setup_arch() -> init_mem_mapping. Ok, I see it, thanks for explaining. So back to your original ideas - I'm wondering whether we should define a chunk of memory which the hypervisor and guest can share and thus communicate over... Something ala SEV-ES also with strictly defined layout and put all those variables there. And then the guest can map decrypted. What about creating a data section specifically for shared memory? The section would be PMD aligned and sized so that it could be mapped appropriately without having to fracture the page. Then define a macro to easily declare data in the new section, a la __read_mostly. Yes, this is one of approach I have in mind. It will avoid splitting the larger pages; I am thinking that early in boot code we can lookup for this special section and decrypt it in-place and probably maps with C=0. Only downside, it will increase data section footprint a bit because we need to align this section to PM_SIZE. There might be something similar though, I dunno. Maybe Paolo has a better idea... -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --
Re: SEV guest regression in 4.18
On Wed, Aug 22, 2018 at 10:14:17AM +0200, Borislav Petkov wrote: > Dropping Pavel as it bounces. > > On Tue, Aug 21, 2018 at 11:07:38AM -0500, Brijesh Singh wrote: > > The tsc_early_init() is called before setup_arch() -> init_mem_mapping. > > Ok, I see it, thanks for explaining. > > So back to your original ideas - I'm wondering whether we should define > a chunk of memory which the hypervisor and guest can share and thus > communicate over... Something ala SEV-ES also with strictly defined > layout and put all those variables there. And then the guest can map > decrypted. What about creating a data section specifically for shared memory? The section would be PMD aligned and sized so that it could be mapped appropriately without having to fracture the page. Then define a macro to easily declare data in the new section, a la __read_mostly. > There might be something similar though, I dunno. > > Maybe Paolo has a better idea... > > -- > Regards/Gruss, > Boris. > > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB > 21284 (AG Nürnberg) > --
Re: SEV guest regression in 4.18
Dropping Pavel as it bounces. On Tue, Aug 21, 2018 at 11:07:38AM -0500, Brijesh Singh wrote: > The tsc_early_init() is called before setup_arch() -> init_mem_mapping. Ok, I see it, thanks for explaining. So back to your original ideas - I'm wondering whether we should define a chunk of memory which the hypervisor and guest can share and thus communicate over... Something ala SEV-ES also with strictly defined layout and put all those variables there. And then the guest can map decrypted. There might be something similar though, I dunno. Maybe Paolo has a better idea... -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --
Re: SEV guest regression in 4.18
On 08/21/2018 10:19 AM, Borislav Petkov wrote: On Tue, Aug 21, 2018 at 09:37:56AM -0500, Brijesh Singh wrote: Those variables are accessed immediately by the tsc calibration code path hence we will not able to delay the allocation. If you mean, check_tsc_sync_source/_target(), those are called pretty late, AFAICT. And I see a setup_arch() -> init_mem_mapping -> ... -> kernel_physical_mapping_init() call which happens much earlier. Again, AFAICT. You probably should try it though, to make sure. Here is a typical flow: setup_arch() init_hypervisor_platform() kvmclock_init() kvm_register_clock("primary cpu clock") /* this shares the physical address of variable with HV. * Ideally you want C-bit to be cleared before we give * the address to HV because HV may start using the * variable as soon as we issue wrmsrl(MSR_KVM_SYSTEM_TIME). * * we may able to relax rules that variable should be mapped * with C=0 before making the wrmsrl(). Since HV will anyways * write the data with C=0. The important thing is when guest * tries to read it must map the address as C=0. */ Later part of kvmclock_init provides a hooks for tsc calibration tsc_early_init() determine_cpu_tsc_frequencies() /* this will call tsc calibration hooks registered by the * kvmclock_init. The hooks basically reads those variables. * These variable must be mapped with C=0 otherwise we will * not be able to get the data written by the hypervisor. */ ... ... ... ... trim_low_memory_range init_mem_mapping() ... The tsc_early_init() is called before setup_arch() -> init_mem_mapping. -Brijesh
Re: SEV guest regression in 4.18
On Tue, Aug 21, 2018 at 09:37:56AM -0500, Brijesh Singh wrote: > Those variables are accessed immediately by the tsc calibration code > path hence we will not able to delay the allocation. If you mean, check_tsc_sync_source/_target(), those are called pretty late, AFAICT. And I see a setup_arch() -> init_mem_mapping -> ... -> kernel_physical_mapping_init() call which happens much earlier. Again, AFAICT. You probably should try it though, to make sure. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --
Re: SEV guest regression in 4.18
Hi Boris, On 08/21/2018 03:39 AM, Borislav Petkov wrote: On Mon, Aug 20, 2018 at 05:11:53PM -0500, Brijesh Singh wrote: Hi All, The following commit " x86/kvmclock: Remove memblock dependency https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=368a540e0232ad446931f5a4e8a5e06f69f21343 " broke the SEV support in 4.18. Since the guest physical address holding the wall_clock and vcpu_time_info are shared with the hypervisor it must be mapped as "decrypted" when SEV is active. To clear the C-bit we use kernel_physical_mapping_init() to split the large pages into 4K before changing the C-bit. Now the kernel_physical_mapping_init() is failing to allocate the memory because its called very early. [0.00] Call Trace: [0.00] ? dump_stack+0x5c/0x80 [0.00] ? panic+0xe7/0x247 [0.00] ? alloc_low_pages+0x130/0x130 [0.00] ? kernel_physical_mapping_init+0xe0/0x204 [0.00] ? early_set_memory_enc_dec+0x123/0x174 [0.00] ? 0xae00 [0.00] ? kvmclock_init+0x60/0x1ea [0.00] ? kvm_init_platform+0xa/0x16 [0.00] ? setup_arch+0x434/0xce9 [0.00] ? start_kernel+0x67/0x52e [0.00] ? load_ucode_bsp+0x76/0x12e [0.00] ? secondary_startup_64+0xa4/0xb0 Silly question: can we delay the call to kernel_physical_mapping_init() to right before those variables are accessed by the guest for the first time, assuming we can allocate memory at that point? Those variables are accessed immediately by the tsc calibration code path hence we will not able to delay the allocation. Before the above patch series, kvmclock_init() and early tsc calibration was called a bit late in setup_arch() hence it all worked fine. -Brijesh
Re: SEV guest regression in 4.18
On Mon, Aug 20, 2018 at 05:11:53PM -0500, Brijesh Singh wrote: > Hi All, > > The following commit > > " > x86/kvmclock: Remove memblock dependency > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=368a540e0232ad446931f5a4e8a5e06f69f21343 > > " > broke the SEV support in 4.18. > > Since the guest physical address holding the wall_clock and > vcpu_time_info are shared with the hypervisor it must be mapped > as "decrypted" when SEV is active. To clear the C-bit we use > kernel_physical_mapping_init() to split the large pages into 4K before > changing the C-bit. Now the kernel_physical_mapping_init() is failing to > allocate the memory because its called very early. > > [0.00] Call Trace: > [0.00] ? dump_stack+0x5c/0x80 > [0.00] ? panic+0xe7/0x247 > [0.00] ? alloc_low_pages+0x130/0x130 > [0.00] ? kernel_physical_mapping_init+0xe0/0x204 > [0.00] ? early_set_memory_enc_dec+0x123/0x174 > [0.00] ? 0xae00 > [0.00] ? kvmclock_init+0x60/0x1ea > [0.00] ? kvm_init_platform+0xa/0x16 > [0.00] ? setup_arch+0x434/0xce9 > [0.00] ? start_kernel+0x67/0x52e > [0.00] ? load_ucode_bsp+0x76/0x12e > [0.00] ? secondary_startup_64+0xa4/0xb0 Silly question: can we delay the call to kernel_physical_mapping_init() to right before those variables are accessed by the guest for the first time, assuming we can allocate memory at that point? -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --
SEV guest regression in 4.18
Hi All, The following commit " x86/kvmclock: Remove memblock dependency https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=368a540e0232ad446931f5a4e8a5e06f69f21343 " broke the SEV support in 4.18. Since the guest physical address holding the wall_clock and vcpu_time_info are shared with the hypervisor it must be mapped as "decrypted" when SEV is active. To clear the C-bit we use kernel_physical_mapping_init() to split the large pages into 4K before changing the C-bit. Now the kernel_physical_mapping_init() is failing to allocate the memory because its called very early. [0.00] Call Trace: [0.00] ? dump_stack+0x5c/0x80 [0.00] ? panic+0xe7/0x247 [0.00] ? alloc_low_pages+0x130/0x130 [0.00] ? kernel_physical_mapping_init+0xe0/0x204 [0.00] ? early_set_memory_enc_dec+0x123/0x174 [0.00] ? 0xae00 [0.00] ? kvmclock_init+0x60/0x1ea [0.00] ? kvm_init_platform+0xa/0x16 [0.00] ? setup_arch+0x434/0xce9 [0.00] ? start_kernel+0x67/0x52e [0.00] ? load_ucode_bsp+0x76/0x12e [0.00] ? secondary_startup_64+0xa4/0xb0 I don't have proper solution to fix it. I have the following two approaches in mind: 1) - reserve a few pages in head_64.S - pass a flag in kernel_physical_mapping_init() to tell it to use the preallocated pages instead of alloc_low_pages(). or 2) - update hv_clock_boot and hv_clock_boot to align PMD_SIZE - when variables are PMD_SIZE aligned then we do not need to split the pages hence avoid the allocation issue. Since the variables are static hence we also need to update the contents to match with updated C-bit. Currently, we use sme_early_decrypt() to perform in-place decrypt but this routines can not be called before pat_init() hence we probably need to do come up with some other approach for this as well. Any suggestions/recommendations ? Thanks Brijesh