Re: SEV guest regression in 4.18

2018-08-24 Thread Borislav Petkov
On Fri, Aug 24, 2018 at 01:47:10PM -0500, Brijesh Singh wrote:
> I am more inclined towards creating a new section with PMD aligned and
> sized. This section will contains the decrypted data. In early
> boot code we will update the mapping with C=0. If caller wants to create
> a shared variable then it can do so with:
> 
> static int foo __decrypted;

Right, and keeping the SEV-ES's GHCB in mind, you could make that
section extensible so that the GHCB's 4K page can land there too. Maybe
something like a PMD-aligned range of 4K pages which are fully defined
and which hypervisor and guest can share and can be used for all kinds of
communication in the future...

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: SEV guest regression in 4.18

2018-08-24 Thread Brijesh Singh




On 08/24/2018 11:24 AM, Sean Christopherson wrote:

On Fri, Aug 24, 2018 at 10:41:27AM -0500, Brijesh Singh wrote:



On 08/23/2018 11:16 AM, Paolo Bonzini wrote:

On 23/08/2018 17:29, Sean Christopherson wrote:

On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote:

On 22/08/2018 22:11, Brijesh Singh wrote:


Yes, this is one of approach I have in mind. It will avoid splitting
the larger pages; I am thinking that early in boot code we can lookup
for this special section and decrypt it in-place and probably maps with
C=0. Only downside, it will increase data section footprint a bit
because we need to align this section to PM_SIZE.


If you can ensure it doesn't span a PMD, maybe it does not need to be
aligned; you could establish a C=0 mapping of the whole 2M around it.


Wouldn't that result in exposing/leaking whatever code/data happened
to reside on the same 2M page (or corrupting it if the entire page
isn't decrypted)?  Or are you suggesting that we'd also leave the
encrypted mapping intact?


Yes, exactly the latter, because...



Hardware does not enforce coherency between the encrypted and
unencrypted mapping for the same physical page. So, creating a
two mapping of same physical address will lead a possible data
corruption.


But couldn't we avoid corruption by ensuring data accessed via the
unencrypted mapping is cache line aligned and sized?  The CPU could
speculatively bring the encrypted version into the cache but it
should never get into a modified state (barring a software bug, but
that would be a problem regardless of encryption).



Yes, if we can ensure that accessed are cache line aligned and sized
then we should be fine.



Re: SEV guest regression in 4.18

2018-08-24 Thread Brijesh Singh




On 08/24/2018 10:50 AM, Paolo Bonzini wrote:

On 24/08/2018 17:41, Brijesh Singh wrote:


Wouldn't that result in exposing/leaking whatever code/data happened
to reside on the same 2M page (or corrupting it if the entire page
isn't decrypted)?  Or are you suggesting that we'd also leave the
encrypted mapping intact?


Yes, exactly the latter, because...



Hardware does not enforce coherency between the encrypted and
unencrypted mapping for the same physical page. So, creating a
two mapping of same physical address will lead a possible data
corruption.

Note, SME creates two mapping of the same physical address to perform
in-place encryption of kernel and initrd images; this is a special case
and APM documents steps on how to do this.


Ah, so that's what I was thinking about.  But a single cache line would
never be used both encrypted and unencrypted, would it?



No.

If we create two mapping of same physical address and ensure that
accesses are cache line aligned and size then I think we will safe from
cache point of view.

I have not tried early remap from kvmclock_init() yet and not sure if it
will work so early. The downside of this approach is, we will put
too many constraints on caller; it will need to ensure that variables
are cache line aligned and sized, never accessed with C=1, and must
create a new mapping before accessing it.


I am more inclined towards creating a new section with PMD aligned and
sized. This section will contains the decrypted data. In early
boot code we will update the mapping with C=0. If caller wants to create
a shared variable then it can do so with:

static int foo __decrypted;

I will submit patch fir review.

thanks


Re: SEV guest regression in 4.18

2018-08-24 Thread Sean Christopherson
On Fri, Aug 24, 2018 at 10:41:27AM -0500, Brijesh Singh wrote:
> 
> 
> On 08/23/2018 11:16 AM, Paolo Bonzini wrote:
> >On 23/08/2018 17:29, Sean Christopherson wrote:
> >>On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote:
> >>>On 22/08/2018 22:11, Brijesh Singh wrote:
> 
> Yes, this is one of approach I have in mind. It will avoid splitting
> the larger pages; I am thinking that early in boot code we can lookup
> for this special section and decrypt it in-place and probably maps with
> C=0. Only downside, it will increase data section footprint a bit
> because we need to align this section to PM_SIZE.
> >>>
> >>>If you can ensure it doesn't span a PMD, maybe it does not need to be
> >>>aligned; you could establish a C=0 mapping of the whole 2M around it.
> >>
> >>Wouldn't that result in exposing/leaking whatever code/data happened
> >>to reside on the same 2M page (or corrupting it if the entire page
> >>isn't decrypted)?  Or are you suggesting that we'd also leave the
> >>encrypted mapping intact?
> >
> >Yes, exactly the latter, because...
> 
> 
> Hardware does not enforce coherency between the encrypted and
> unencrypted mapping for the same physical page. So, creating a
> two mapping of same physical address will lead a possible data
> corruption.

But couldn't we avoid corruption by ensuring data accessed via the
unencrypted mapping is cache line aligned and sized?  The CPU could
speculatively bring the encrypted version into the cache but it
should never get into a modified state (barring a software bug, but
that would be a problem regardless of encryption).

> Note, SME creates two mapping of the same physical address to perform
> in-place encryption of kernel and initrd images; this is a special case
> and APM documents steps on how to do this.
> 
> 
> >
> >>Does hardware include the C-bit in the cache tag?
> >
> >... the C-bit is effectively part of the physical address and hence of
> >the cache tag.  The kernel is already relying on this to properly
> >encrypt/decrypt pages, if I remember correctly.
> >
> >Paolo
> >


Re: SEV guest regression in 4.18

2018-08-24 Thread Paolo Bonzini
On 24/08/2018 17:41, Brijesh Singh wrote:
>>>
>>> Wouldn't that result in exposing/leaking whatever code/data happened
>>> to reside on the same 2M page (or corrupting it if the entire page
>>> isn't decrypted)?  Or are you suggesting that we'd also leave the
>>> encrypted mapping intact?
>>
>> Yes, exactly the latter, because...
> 
> 
> Hardware does not enforce coherency between the encrypted and
> unencrypted mapping for the same physical page. So, creating a
> two mapping of same physical address will lead a possible data
> corruption.
> 
> Note, SME creates two mapping of the same physical address to perform
> in-place encryption of kernel and initrd images; this is a special case
> and APM documents steps on how to do this.

Ah, so that's what I was thinking about.  But a single cache line would
never be used both encrypted and unencrypted, would it?

Paolo


Re: SEV guest regression in 4.18

2018-08-24 Thread Brijesh Singh




On 08/23/2018 11:16 AM, Paolo Bonzini wrote:

On 23/08/2018 17:29, Sean Christopherson wrote:

On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote:

On 22/08/2018 22:11, Brijesh Singh wrote:


Yes, this is one of approach I have in mind. It will avoid splitting
the larger pages; I am thinking that early in boot code we can lookup
for this special section and decrypt it in-place and probably maps with
C=0. Only downside, it will increase data section footprint a bit
because we need to align this section to PM_SIZE.


If you can ensure it doesn't span a PMD, maybe it does not need to be
aligned; you could establish a C=0 mapping of the whole 2M around it.


Wouldn't that result in exposing/leaking whatever code/data happened
to reside on the same 2M page (or corrupting it if the entire page
isn't decrypted)?  Or are you suggesting that we'd also leave the
encrypted mapping intact?


Yes, exactly the latter, because...



Hardware does not enforce coherency between the encrypted and
unencrypted mapping for the same physical page. So, creating a
two mapping of same physical address will lead a possible data
corruption.

Note, SME creates two mapping of the same physical address to perform
in-place encryption of kernel and initrd images; this is a special case
and APM documents steps on how to do this.





Does hardware include the C-bit in the cache tag?


... the C-bit is effectively part of the physical address and hence of
the cache tag.  The kernel is already relying on this to properly
encrypt/decrypt pages, if I remember correctly.

Paolo



Re: SEV guest regression in 4.18

2018-08-23 Thread Paolo Bonzini
On 23/08/2018 17:29, Sean Christopherson wrote:
> On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote:
>> On 22/08/2018 22:11, Brijesh Singh wrote:
>>>
>>> Yes, this is one of approach I have in mind. It will avoid splitting
>>> the larger pages; I am thinking that early in boot code we can lookup
>>> for this special section and decrypt it in-place and probably maps with
>>> C=0. Only downside, it will increase data section footprint a bit
>>> because we need to align this section to PM_SIZE.
>>
>> If you can ensure it doesn't span a PMD, maybe it does not need to be
>> aligned; you could establish a C=0 mapping of the whole 2M around it.
> 
> Wouldn't that result in exposing/leaking whatever code/data happened
> to reside on the same 2M page (or corrupting it if the entire page
> isn't decrypted)?  Or are you suggesting that we'd also leave the
> encrypted mapping intact?

Yes, exactly the latter, because...

> Does hardware include the C-bit in the cache tag?

... the C-bit is effectively part of the physical address and hence of
the cache tag.  The kernel is already relying on this to properly
encrypt/decrypt pages, if I remember correctly.

Paolo


Re: SEV guest regression in 4.18

2018-08-23 Thread Sean Christopherson
On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote:
> On 22/08/2018 22:11, Brijesh Singh wrote:
> > 
> > Yes, this is one of approach I have in mind. It will avoid splitting
> > the larger pages; I am thinking that early in boot code we can lookup
> > for this special section and decrypt it in-place and probably maps with
> > C=0. Only downside, it will increase data section footprint a bit
> > because we need to align this section to PM_SIZE.
> 
> If you can ensure it doesn't span a PMD, maybe it does not need to be
> aligned; you could establish a C=0 mapping of the whole 2M around it.

Wouldn't that result in exposing/leaking whatever code/data happened
to reside on the same 2M page (or corrupting it if the entire page
isn't decrypted)?  Or are you suggesting that we'd also leave the
encrypted mapping intact?  If it's the latter...

Does hardware include the C-bit in the cache tag?  I.e are the C=0 and
C=1 variations of the same PA treated as different cache lines?  If
so, we could also treat the unencrypted variation as a separate PA by
defining it to be (ACTUAL_PA | (1 << x86_phys_bits)), (re)adjusting
x86_phys_bits if necessary to get the kernel to allow the address.
init_memory_mapping() could then alias every PA with an unencrypted
VA mapping, which would allow the kernel to access any PA unencrypted
by using virt_to_phys() and phys_to_virt() to translate an encrypted
VA to an unencrypted VA.  It would mean doubling INIT_PGD_PAGE_COUNT,
but that'd be a one-time cost regardless of how many pages needed to
be accessed with C=0.

> Paolo


Re: SEV guest regression in 4.18

2018-08-23 Thread Paolo Bonzini
On 22/08/2018 22:11, Brijesh Singh wrote:
> 
> Yes, this is one of approach I have in mind. It will avoid splitting
> the larger pages; I am thinking that early in boot code we can lookup
> for this special section and decrypt it in-place and probably maps with
> C=0. Only downside, it will increase data section footprint a bit
> because we need to align this section to PM_SIZE.

If you can ensure it doesn't span a PMD, maybe it does not need to be
aligned; you could establish a C=0 mapping of the whole 2M around it.

Paolo


Re: SEV guest regression in 4.18

2018-08-22 Thread Brijesh Singh

Hi Sean,


On 08/22/2018 10:00 AM, Sean Christopherson wrote:

On Wed, Aug 22, 2018 at 10:14:17AM +0200, Borislav Petkov wrote:

Dropping Pavel as it bounces.

On Tue, Aug 21, 2018 at 11:07:38AM -0500, Brijesh Singh wrote:

The tsc_early_init() is called before setup_arch() -> init_mem_mapping.


Ok, I see it, thanks for explaining.

So back to your original ideas - I'm wondering whether we should define
a chunk of memory which the hypervisor and guest can share and thus
communicate over... Something ala SEV-ES also with strictly defined
layout and put all those variables there. And then the guest can map
decrypted.


What about creating a data section specifically for shared memory?
The section would be PMD aligned and sized so that it could be mapped
appropriately without having to fracture the page.  Then define a
macro to easily declare data in the new section, a la __read_mostly.
  


Yes, this is one of approach I have in mind. It will avoid splitting
the larger pages; I am thinking that early in boot code we can lookup
for this special section and decrypt it in-place and probably maps with
C=0. Only downside, it will increase data section footprint a bit
because we need to align this section to PM_SIZE.




There might be something similar though, I dunno.




Maybe Paolo has a better idea...

--
Regards/Gruss,
 Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
--


Re: SEV guest regression in 4.18

2018-08-22 Thread Sean Christopherson
On Wed, Aug 22, 2018 at 10:14:17AM +0200, Borislav Petkov wrote:
> Dropping Pavel as it bounces.
> 
> On Tue, Aug 21, 2018 at 11:07:38AM -0500, Brijesh Singh wrote:
> > The tsc_early_init() is called before setup_arch() -> init_mem_mapping.
> 
> Ok, I see it, thanks for explaining.
> 
> So back to your original ideas - I'm wondering whether we should define
> a chunk of memory which the hypervisor and guest can share and thus
> communicate over... Something ala SEV-ES also with strictly defined
> layout and put all those variables there. And then the guest can map
> decrypted.

What about creating a data section specifically for shared memory?
The section would be PMD aligned and sized so that it could be mapped
appropriately without having to fracture the page.  Then define a
macro to easily declare data in the new section, a la __read_mostly.
 
> There might be something similar though, I dunno.
> 
> Maybe Paolo has a better idea...
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)
> -- 


Re: SEV guest regression in 4.18

2018-08-22 Thread Borislav Petkov
Dropping Pavel as it bounces.

On Tue, Aug 21, 2018 at 11:07:38AM -0500, Brijesh Singh wrote:
> The tsc_early_init() is called before setup_arch() -> init_mem_mapping.

Ok, I see it, thanks for explaining.

So back to your original ideas - I'm wondering whether we should define
a chunk of memory which the hypervisor and guest can share and thus
communicate over... Something ala SEV-ES also with strictly defined
layout and put all those variables there. And then the guest can map
decrypted.

There might be something similar though, I dunno.

Maybe Paolo has a better idea...

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: SEV guest regression in 4.18

2018-08-21 Thread Brijesh Singh




On 08/21/2018 10:19 AM, Borislav Petkov wrote:

On Tue, Aug 21, 2018 at 09:37:56AM -0500, Brijesh Singh wrote:

Those variables are accessed immediately by the tsc calibration code
path hence we will not able to delay the allocation.


If you mean, check_tsc_sync_source/_target(), those are called pretty
late, AFAICT.

And I see a setup_arch() -> init_mem_mapping -> ... ->
kernel_physical_mapping_init() call which happens much earlier.

Again, AFAICT. You probably should try it though, to make sure.



Here is a typical flow:

setup_arch()
  

  init_hypervisor_platform()
kvmclock_init()
  kvm_register_clock("primary cpu clock")
/* this shares the physical address of variable with HV.
 * Ideally you want C-bit to be cleared before we give
 * the address to HV because HV may start using the
 * variable as soon as we issue wrmsrl(MSR_KVM_SYSTEM_TIME).
 *
 * we may able to relax rules that variable should be mapped
 * with C=0 before making the wrmsrl(). Since HV will anyways
 * write the data with C=0. The important thing is when guest
 * tries to read it must map the address as C=0.
 */
   Later part of kvmclock_init provides a hooks for tsc calibration

  tsc_early_init()
 determine_cpu_tsc_frequencies()
   /* this will call tsc calibration hooks registered by the
* kvmclock_init. The hooks basically reads those variables.
* These variable must be mapped with C=0 otherwise we will
* not be able to get the data written by the hypervisor.
*/

  ...
  ...
  ...
  ...

  trim_low_memory_range
  init_mem_mapping()

  ...

The tsc_early_init() is called before setup_arch() -> init_mem_mapping.

-Brijesh


Re: SEV guest regression in 4.18

2018-08-21 Thread Borislav Petkov
On Tue, Aug 21, 2018 at 09:37:56AM -0500, Brijesh Singh wrote:
> Those variables are accessed immediately by the tsc calibration code
> path hence we will not able to delay the allocation.

If you mean, check_tsc_sync_source/_target(), those are called pretty
late, AFAICT.

And I see a setup_arch() -> init_mem_mapping -> ... ->
kernel_physical_mapping_init() call which happens much earlier.

Again, AFAICT. You probably should try it though, to make sure.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: SEV guest regression in 4.18

2018-08-21 Thread Brijesh Singh

Hi Boris,


On 08/21/2018 03:39 AM, Borislav Petkov wrote:

On Mon, Aug 20, 2018 at 05:11:53PM -0500, Brijesh Singh wrote:

Hi All,

The following commit

"
x86/kvmclock: Remove memblock dependency

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=368a540e0232ad446931f5a4e8a5e06f69f21343

"
broke the SEV support in 4.18.

Since the guest physical address holding the wall_clock and
vcpu_time_info are shared with the hypervisor  it must be mapped
as "decrypted" when SEV is active. To clear the C-bit we use
kernel_physical_mapping_init() to split the large pages into 4K before
changing the C-bit. Now the kernel_physical_mapping_init() is failing to
allocate the memory because its called very early.

[0.00] Call Trace:
[0.00]  ? dump_stack+0x5c/0x80
[0.00]  ? panic+0xe7/0x247
[0.00]  ? alloc_low_pages+0x130/0x130
[0.00]  ? kernel_physical_mapping_init+0xe0/0x204
[0.00]  ? early_set_memory_enc_dec+0x123/0x174
[0.00]  ? 0xae00
[0.00]  ? kvmclock_init+0x60/0x1ea
[0.00]  ? kvm_init_platform+0xa/0x16
[0.00]  ? setup_arch+0x434/0xce9
[0.00]  ? start_kernel+0x67/0x52e
[0.00]  ? load_ucode_bsp+0x76/0x12e
[0.00]  ? secondary_startup_64+0xa4/0xb0


Silly question: can we delay the call to kernel_physical_mapping_init()
to right before those variables are accessed by the guest for the first
time, assuming we can allocate memory at that point?




Those variables are accessed immediately by the tsc calibration code
path hence we will not able to delay the allocation. Before the above
patch series, kvmclock_init() and early tsc calibration was called a
bit late in setup_arch() hence it all worked fine.

-Brijesh


Re: SEV guest regression in 4.18

2018-08-21 Thread Borislav Petkov
On Mon, Aug 20, 2018 at 05:11:53PM -0500, Brijesh Singh wrote:
> Hi All,
> 
> The following commit
> 
> "
> x86/kvmclock: Remove memblock dependency
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=368a540e0232ad446931f5a4e8a5e06f69f21343
> 
> "
> broke the SEV support in 4.18.
> 
> Since the guest physical address holding the wall_clock and
> vcpu_time_info are shared with the hypervisor  it must be mapped
> as "decrypted" when SEV is active. To clear the C-bit we use
> kernel_physical_mapping_init() to split the large pages into 4K before
> changing the C-bit. Now the kernel_physical_mapping_init() is failing to
> allocate the memory because its called very early.
> 
> [0.00] Call Trace:
> [0.00]  ? dump_stack+0x5c/0x80
> [0.00]  ? panic+0xe7/0x247
> [0.00]  ? alloc_low_pages+0x130/0x130
> [0.00]  ? kernel_physical_mapping_init+0xe0/0x204
> [0.00]  ? early_set_memory_enc_dec+0x123/0x174
> [0.00]  ? 0xae00
> [0.00]  ? kvmclock_init+0x60/0x1ea
> [0.00]  ? kvm_init_platform+0xa/0x16
> [0.00]  ? setup_arch+0x434/0xce9
> [0.00]  ? start_kernel+0x67/0x52e
> [0.00]  ? load_ucode_bsp+0x76/0x12e
> [0.00]  ? secondary_startup_64+0xa4/0xb0

Silly question: can we delay the call to kernel_physical_mapping_init()
to right before those variables are accessed by the guest for the first
time, assuming we can allocate memory at that point?

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


SEV guest regression in 4.18

2018-08-20 Thread Brijesh Singh

Hi All,

The following commit

"
x86/kvmclock: Remove memblock dependency

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=368a540e0232ad446931f5a4e8a5e06f69f21343

"
broke the SEV support in 4.18.

Since the guest physical address holding the wall_clock and
vcpu_time_info are shared with the hypervisor  it must be mapped
as "decrypted" when SEV is active. To clear the C-bit we use 
kernel_physical_mapping_init() to split the large pages into 4K before

changing the C-bit. Now the kernel_physical_mapping_init() is failing to
allocate the memory because its called very early.

[0.00] Call Trace:
[0.00]  ? dump_stack+0x5c/0x80
[0.00]  ? panic+0xe7/0x247
[0.00]  ? alloc_low_pages+0x130/0x130
[0.00]  ? kernel_physical_mapping_init+0xe0/0x204
[0.00]  ? early_set_memory_enc_dec+0x123/0x174
[0.00]  ? 0xae00
[0.00]  ? kvmclock_init+0x60/0x1ea
[0.00]  ? kvm_init_platform+0xa/0x16
[0.00]  ? setup_arch+0x434/0xce9
[0.00]  ? start_kernel+0x67/0x52e
[0.00]  ? load_ucode_bsp+0x76/0x12e
[0.00]  ? secondary_startup_64+0xa4/0xb0

I don't have proper solution to fix it. I have the following two
approaches in mind:

1)
 - reserve  a few pages in head_64.S
 - pass a flag in kernel_physical_mapping_init() to tell it to use
   the preallocated pages instead of alloc_low_pages().

or

2)
 - update hv_clock_boot and hv_clock_boot to align PMD_SIZE
 - when variables are PMD_SIZE aligned then we do not need to
   split the pages hence avoid the allocation issue.

Since the variables are static hence we also need to update the contents 
to match with updated C-bit. Currently, we use sme_early_decrypt() to

perform in-place decrypt but this routines can not be called before
pat_init() hence we probably need to do come up with some other approach
for this as well.


Any suggestions/recommendations ?


Thanks
Brijesh