Re: [PATCH v2 2/2] ACPI: HMAT: Fix initiator registration for single-initiator systems

2022-11-16 Thread Kirill A. Shutemov
On Wed, Nov 16, 2022 at 04:37:37PM -0700, Vishal Verma wrote:
> In a system with a single initiator node, and one or more memory-only
> 'target' nodes, the memory-only node(s) would fail to register their
> initiator node correctly. i.e. in sysfs:
> 
>   # ls /sys/devices/system/node/node0/access0/targets/
>   node0
> 
> Where as the correct behavior should be:
> 
>   # ls /sys/devices/system/node/node0/access0/targets/
>   node0 node1
> 
> This happened because hmat_register_target_initiators() uses list_sort()
> to sort the initiator list, but the sort comparision function
> (initiator_cmp()) is overloaded to also set the node mask's bits.
> 
> In a system with a single initiator, the list is singular, and list_sort
> elides the comparision helper call. Thus the node mask never gets set,
> and the subsequent search for the best initiator comes up empty.
> 
> Add a new helper to consume the sorted initiator list, and generate the
> nodemask, decoupling it from the overloaded initiator_cmp() comparision
> callback. This prevents the singular list corner case naturally, and
> makes the code easier to follow as well.
> 
> Cc: 
> Cc: Rafael J. Wysocki 
> Cc: Liu Shixin 
> Cc: Dan Williams 
> Cc: Kirill A. Shutemov 
> Reported-by: Chris Piper 
> Signed-off-by: Vishal Verma 

Acked-by: Kirill A. Shutemov 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH 2/2] ACPI: HMAT: Fix initiator registration for single-initiator systems

2022-11-16 Thread Kirill A. Shutemov
On Wed, Nov 16, 2022 at 12:57:36AM -0700, Vishal Verma wrote:
> In a system with a single initiator node, and one or more memory-only
> 'target' nodes, the memory-only node(s) would fail to register their
> initiator node correctly. i.e. in sysfs:
> 
>   # ls /sys/devices/system/node/node0/access0/targets/
>   node0
> 
> Where as the correct behavior should be:
> 
>   # ls /sys/devices/system/node/node0/access0/targets/
>   node0 node1
> 
> This happened because hmat_register_target_initiators() uses list_sort()
> to sort the initiator list, but the sort comparision function
> (initiator_cmp()) is overloaded to also set the node mask's bits.
> 
> In a system with a single initiator, the list is singular, and list_sort
> elides the comparision helper call. Thus the node mask never gets set,
> and the subsequent search for the best initiator comes up empty.
> 
> Add a new helper to sort the initiator list, and handle the singular
> list corner case by setting the node mask for that explicitly.
> 
> Reported-by: Chris Piper 
> Cc: 
> Cc: Rafael J. Wysocki 
> Cc: Liu Shixin 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/acpi/numa/hmat.c | 32 ++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 144a84f429ed..cd20b0e9cdfa 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -573,6 +573,30 @@ static int initiator_cmp(void *priv, const struct 
> list_head *a,
>   return ia->processor_pxm - ib->processor_pxm;
>  }
>  
> +static int initiators_to_nodemask(unsigned long *p_nodes)
> +{
> + /*
> +  * list_sort doesn't call @cmp (initiator_cmp) for 0 or 1 sized lists.
> +  * For a single-initiator system with other memory-only nodes, this
> +  * means an empty p_nodes mask, since that is set by initiator_cmp().
> +  * Special case the singular list, and make sure the node mask gets set
> +  * appropriately.
> +  */
> + if (list_empty())
> + return -ENXIO;
> +
> + if (list_is_singular()) {
> + struct memory_initiator *initiator = list_first_entry(
> + , struct memory_initiator, node);
> +
> + set_bit(initiator->processor_pxm, p_nodes);
> + return 0;
> + }
> +
> + list_sort(p_nodes, , initiator_cmp);
> + return 0;
> +}
> +

Hm. I think it indicates that these set_bit()s do not belong to
initiator_cmp().

Maybe remove both set_bit() from the compare helper and walk the list
separately to initialize the node mask? I think it will be easier to
follow.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH 1/2] ACPI: HMAT: remove unnecessary variable initialization

2022-11-16 Thread Kirill A. Shutemov
On Wed, Nov 16, 2022 at 12:57:35AM -0700, Vishal Verma wrote:
> In hmat_register_target_initiators(), the variable 'best' gets
> initialized in the outer per-locality-type for loop. The initialization
> just before setting up 'Access 1' targets was unnecessary. Remove it.
> 
> Cc: Rafael J. Wysocki 
> Cc: Liu Shixin 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 

Acked-by: Kirill A. Shutemov 


-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Kirill A. Shutemov
On Mon, Apr 19, 2021 at 08:09:13PM +, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > On Mon, Apr 19, 2021 at 06:09:29PM +, Sean Christopherson wrote:
> > > On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > > > On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:
> > > > > But fundamentally the private pages, are well, private.  They can't 
> > > > > be shared
> > > > > across processes, so I think we could (should?) require the VMA to 
> > > > > always be
> > > > > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. 
> > > > > is that
> > > > > enough to prevent userspace and unaware kernel code from acquiring a 
> > > > > reference
> > > > > to the underlying page?
> > > > 
> > > > Shared pages should be fine too (you folks wanted tmpfs support).
> > > 
> > > Is that a conflict though?  If the private->shared conversion request is 
> > > kicked
> > > out to userspace, then userspace can re-mmap() the files as MAP_SHARED, 
> > > no?
> > > 
> > > Allowing MAP_SHARED for guest private memory feels wrong.  The data can't 
> > > be
> > > shared, and dirty data can't be written back to the file.
> > 
> > It can be remapped, but faulting in the page would produce hwpoison entry.
> 
> It sounds like you're thinking the whole tmpfs file is poisoned.

No. File is not poisoned. Pages got poisoned when they are faulted into
flagged VMA. Different VM can use the same file as long as offsets do not
overlap.

> My thought is that userspace would need to do something like for guest
> private memory:
> 
>   mmap(NULL, guest_size, PROT_READ|PROT_WRITE, MAP_PRIVATE | 
> MAP_GUEST_ONLY, fd, 0);
> 
> The MAP_GUEST_ONLY would be used by the kernel to ensure the resulting VMA can
> only point at private/poisoned memory, e.g. on fault, the associated PFN would
> be tagged with PG_hwpoison or whtaever.  @fd in this case could point at 
> tmpfs,
> but I don't think it's a hard requirement.
> 
> On conversion to shared, userspace could then do:
> 
>   munmap(, )
>   mmap(, , PROT_READ|PROT_WRITE, MAP_SHARED | 
> MAP_FIXED_NOREPLACE, fd, );
> 
> or
> 
>   mmap(, , PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 
> );
> 

I played with this variant before, but initiated from kernel. Should work
fine.

> or
> 
>   ioctl(kvm, KVM_SET_USER_MEMORY_REGION, );
>   mmap(NULL, , PROT_READ|PROT_WRITE, MAP_SHARED, fd, );
>   ioctl(kvm, KVM_SET_USER_MEMORY_REGION, );
> 
> Combinations would also work, e.g. unmap the private range and move the 
> memslot.
> The private and shared memory regions could also be backed differently, e.g.
> tmpfs for shared memory, anonymous for private memory.

Right. Kernel has to be flexible enough to provide any of the schemes.

> 
> > I don't see other way to make Google's use-case with tmpfs-backed guest
> > memory work.
> 
> The underlying use-case is to be able to access guest memory from more than 
> one
> process, e.g. so that communication with the guest isn't limited to the VMM
> process associated with the KVM instances.  By definition, guest private 
> memory
> can't be accessed by the host; I don't see how anyone, Google included, can 
> have
> any real requirements about
> 
> > > > The poisoned pages must be useless outside of the process with the 
> > > > blessed
> > > > struct kvm. See kvm_pfn_map in the patch.
> > > 
> > > The big requirement for kernel TDX support is that the pages are useless 
> > > in the
> > > host.  Regarding the guest, for TDX, the TDX Module guarantees that at 
> > > most a
> > > single KVM guest can have access to a page at any given time.  I believe 
> > > the RMP
> > > provides the same guarantees for SEV-SNP.
> > > 
> > > SEV/SEV-ES could still end up with corruption if multiple guests map the 
> > > same
> > > private page, but that's obviously not the end of the world since it's 
> > > the status
> > > quo today.  Living with that shortcoming might be a worthy tradeoff if 
> > > punting
> > > mutual exclusion between guests to firmware/hardware allows us to 
> > > simplify the
> > > kernel implementation.
> > 
> > The critical question is whether we ever need to translate hva->pfn after
> > the page is added to the guest private memory. I believe we do, but I
> > never checked. And that's the reason we need to keep

Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Kirill A. Shutemov
On Mon, Apr 19, 2021 at 06:09:29PM +, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:
> > > But fundamentally the private pages, are well, private.  They can't be 
> > > shared
> > > across processes, so I think we could (should?) require the VMA to always 
> > > be
> > > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is 
> > > that
> > > enough to prevent userspace and unaware kernel code from acquiring a 
> > > reference
> > > to the underlying page?
> > 
> > Shared pages should be fine too (you folks wanted tmpfs support).
> 
> Is that a conflict though?  If the private->shared conversion request is 
> kicked
> out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?
> 
> Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
> shared, and dirty data can't be written back to the file.

It can be remapped, but faulting in the page would produce hwpoison entry.
I don't see other way to make Google's use-case with tmpfs-backed guest
memory work.

> > The poisoned pages must be useless outside of the process with the blessed
> > struct kvm. See kvm_pfn_map in the patch.
> 
> The big requirement for kernel TDX support is that the pages are useless in 
> the
> host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
> single KVM guest can have access to a page at any given time.  I believe the 
> RMP
> provides the same guarantees for SEV-SNP.
> 
> SEV/SEV-ES could still end up with corruption if multiple guests map the same
> private page, but that's obviously not the end of the world since it's the 
> status
> quo today.  Living with that shortcoming might be a worthy tradeoff if punting
> mutual exclusion between guests to firmware/hardware allows us to simplify the
> kernel implementation.

The critical question is whether we ever need to translate hva->pfn after
the page is added to the guest private memory. I believe we do, but I
never checked. And that's the reason we need to keep hwpoison entries
around, which encode pfn.

If we don't, it would simplify the solution: kvm_pfn_map is not needed.
Single bit-per page would be enough.

> > > >  - Add a new GUP flag to retrive such pages from the userspace mapping.
> > > >Used only for private mapping population.
> > > 
> > > >  - Shared gfn ranges managed by userspace, based on hypercalls from the
> > > >guest.
> > > > 
> > > >  - Shared mappings get populated via normal VMA. Any poisoned pages here
> > > >would lead to SIGBUS.
> > > > 
> > > > So far it looks pretty straight-forward.
> > > > 
> > > > The only thing that I don't understand is at way point the page gets 
> > > > tied
> > > > to the KVM instance. Currently we do it just before populating shadow
> > > > entries, but it would not work with the new scheme: as we poison pages
> > > > on fault it they may never get inserted into shadow entries. That's not
> > > > good as we rely on the info to unpoison page on free.
> > > 
> > > Can you elaborate on what you mean by "unpoison"?  If the page is never 
> > > actually
> > > mapped into the guest, then its poisoned status is nothing more than a 
> > > software
> > > flag, i.e. nothing extra needs to be done on free.
> > 
> > Normally, poisoned flag preserved for freed pages as it usually indicate
> > hardware issue. In this case we need return page to the normal circulation.
> > So we need a way to differentiate two kinds of page poison. Current patch
> > does this by adding page's pfn to kvm_pfn_map. But this will not work if
> > we uncouple poisoning and adding to shadow PTE.
> 
> Why use PG_hwpoison then?

Page flags are scarce. I don't want to take occupy a new one until I'm
sure I must.

And we can re-use existing infrastructure to SIGBUS on access to such
pages.

-- 
 Kirill A. Shutemov


Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Kirill A. Shutemov
On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > On Fri, Apr 16, 2021 at 05:30:30PM +, Sean Christopherson wrote:
> > > I like the idea of using "special" PTE value to denote guest private 
> > > memory,
> > > e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved 
> > > in the
> > > manipulation of the special flag/value.
> > > 
> > > Today, userspace owns the gfn->hva translations and the kernel 
> > > effectively owns
> > > the hva->pfn translations (with input from userspace).  KVM just connects 
> > > the
> > > dots.
> > > 
> > > Having KVM own the shared/private transitions means KVM is now part owner 
> > > of the
> > > entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary 
> > > MMU
> > > and a co-owner of the primary MMU.  This creates locking madness, e.g. 
> > > KVM taking
> > > mmap_sem for write, mmu_lock under page lock, etc..., and also takes 
> > > control away
> > > from userspace.  E.g. userspace strategy could be to use a separate 
> > > backing/pool
> > > for shared memory and change the gfn->hva translation (memslots) in 
> > > reaction to
> > > a shared/private conversion.  Automatically swizzling things in KVM takes 
> > > away
> > > that option.
> > > 
> > > IMO, KVM should be entirely "passive" in this process, e.g. the guest 
> > > shares or
> > > protects memory, userspace calls into the kernel to change state, and the 
> > > kernel
> > > manages the page tables to prevent bad actors.  KVM simply does the 
> > > plumbing for
> > > the guest page tables.
> > 
> > That's a new perspective for me. Very interesting.
> > 
> > Let's see how it can look like:
> > 
> >  - KVM only allows poisoned pages (or whatever flag we end up using for
> >protection) in the private mappings. SIGBUS otherwise.
> > 
> >  - Poisoned pages must be tied to the KVM instance to be allowed in the
> >private mappings. Like kvm->id in the current prototype. SIGBUS
> >otherwise.
> > 
> >  - Pages get poisoned on fault in if the VMA has a new vmflag set.
> > 
> >  - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot
> >access such pages.
> > 
> >  - Poisoned pages produced this way get unpoisoned on free.
> > 
> >  - The new VMA flag set by userspace. mprotect(2)?
> 
> Ya, or mmap(), though I'm not entirely sure a VMA flag would suffice.  The
> notion of the page being private is tied to the PFN, which would suggest 
> "struct
> page" needs to be involved.

PG_hwpoison will be set on the page, so it's tied to pfn.

> But fundamentally the private pages, are well, private.  They can't be shared
> across processes, so I think we could (should?) require the VMA to always be
> MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
> enough to prevent userspace and unaware kernel code from acquiring a reference
> to the underlying page?

Shared pages should be fine too (you folks wanted tmpfs support).

The poisoned pages must be useless outside of the process with the blessed
struct kvm. See kvm_pfn_map in the patch.

> >  - Add a new GUP flag to retrive such pages from the userspace mapping.
> >Used only for private mapping population.
> 
> >  - Shared gfn ranges managed by userspace, based on hypercalls from the
> >guest.
> > 
> >  - Shared mappings get populated via normal VMA. Any poisoned pages here
> >would lead to SIGBUS.
> > 
> > So far it looks pretty straight-forward.
> > 
> > The only thing that I don't understand is at way point the page gets tied
> > to the KVM instance. Currently we do it just before populating shadow
> > entries, but it would not work with the new scheme: as we poison pages
> > on fault it they may never get inserted into shadow entries. That's not
> > good as we rely on the info to unpoison page on free.
> 
> Can you elaborate on what you mean by "unpoison"?  If the page is never 
> actually
> mapped into the guest, then its poisoned status is nothing more than a 
> software
> flag, i.e. nothing extra needs to be done on free.

Normally, poisoned flag preserved for freed pages as it usually indicate
hardware issue. In this case we need return page to the normal circulation.
So we need a way to differentiate two kinds of page poison. Current patch
does this by adding page's pfn to kvm_pfn_map. But this will not work if
we uncouple poisoning and adding to shadow PTE.

-- 
 Kirill A. Shutemov


Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Kirill A. Shutemov
On Fri, Apr 16, 2021 at 05:30:30PM +, Sean Christopherson wrote:
> On Fri, Apr 16, 2021, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1b404e4d7dd8..f8183386abe7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > kvm_sched_yield(vcpu->kvm, a0);
> > ret = 0;
> > break;
> > +   case KVM_HC_ENABLE_MEM_PROTECTED:
> > +   ret = kvm_protect_memory(vcpu->kvm);
> > +   break;
> > +   case KVM_HC_MEM_SHARE:
> > +   ret = kvm_share_memory(vcpu->kvm, a0, a1);
> 
> Can you take a look at a proposed hypercall interface for SEV live migration 
> and
> holler if you (or anyone else) thinks it will have extensibility issues?
> 
> https://lkml.kernel.org/r/93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.ka...@amd.com

Will look closer. Thanks.

> > @@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
> > *async, bool write_fault,
> > flags |= FOLL_WRITE;
> > if (async)
> > flags |= FOLL_NOWAIT;
> > +   if (kvm->mem_protected)
> > +   flags |= FOLL_ALLOW_POISONED;
> 
> This is unsafe, only the flows that are mapping the PFN into the guest should
> use ALLOW_POISONED, e.g. __kvm_map_gfn() should fail on a poisoned page.

That's true for TDX. I prototyped with pure KVM with minimal modification
to the guest. We had to be more permissive for the reason. It will go
away for TDX.

> > -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > -void *data, int offset, int len)
> > +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int 
> > len)
> > +{
> > +   int offset = offset_in_page(hva);
> > +   struct page *page;
> > +   int npages, seg;
> > +   void *vaddr;
> > +
> > +   if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
> > +   !kvm->mem_protected) {
> > +   return __copy_from_user(data, (void __user *)hva, len);
> > +   }
> > +
> > +   might_fault();
> > +   kasan_check_write(data, len);
> > +   check_object_size(data, len, false);
> > +
> > +   while ((seg = next_segment(len, offset)) != 0) {
> > +   npages = get_user_pages_unlocked(hva, 1, ,
> > +FOLL_ALLOW_POISONED);
> > +   if (npages != 1)
> > +   return -EFAULT;
> > +
> > +   if (!kvm_page_allowed(kvm, page))
> > +   return -EFAULT;
> > +
> > +   vaddr = kmap_atomic(page);
> > +   memcpy(data, vaddr + offset, seg);
> > +   kunmap_atomic(vaddr);
> 
> Why is KVM allowed to access a poisoned page?  I would expect shared pages to
> _not_ be poisoned.  Except for pure software emulation of SEV, KVM can't 
> access
> guest private memory.

Again, it's not going to be in TDX implementation.


> I like the idea of using "special" PTE value to denote guest private memory,
> e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved in the
> manipulation of the special flag/value.
> 
> Today, userspace owns the gfn->hva translations and the kernel effectively 
> owns
> the hva->pfn translations (with input from userspace).  KVM just connects the
> dots.
> 
> Having KVM own the shared/private transitions means KVM is now part owner of 
> the
> entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU
> and a co-owner of the primary MMU.  This creates locking madness, e.g. KVM 
> taking
> mmap_sem for write, mmu_lock under page lock, etc..., and also takes control 
> away
> from userspace.  E.g. userspace strategy could be to use a separate 
> backing/pool
> for shared memory and change the gfn->hva translation (memslots) in reaction 
> to
> a shared/private conversion.  Automatically swizzling things in KVM takes away
> that option.
> 
> IMO, KVM should be entirely "passive" in this process, e.g. the guest shares 
> or
> protects memory, userspace calls into the kernel to change state, and the 
> kernel
> manages the page tables to prevent bad actors.  KVM simply does the plumbing 
> for
> the guest page tables.

That's a new perspective for me. Very interesting.

Let's see how it can look like:

 - KVM only allows poisoned pages (or whatever flag we end up using for
   protection) in the private mappings. SIGBUS otherwise.

 - Poisoned pages must be tied to the KVM instance to b

Re: [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature

2021-04-19 Thread Kirill A. Shutemov
On Fri, Apr 16, 2021 at 06:10:30PM +0200, Borislav Petkov wrote:
> On Fri, Apr 16, 2021 at 06:40:55PM +0300, Kirill A. Shutemov wrote:
> > Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.
> > 
> > Host side doesn't provide the feature yet, so it is a dead code for now.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >  arch/x86/include/asm/cpufeatures.h   |  1 +
> >  arch/x86/include/asm/kvm_para.h  |  5 +
> >  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> >  arch/x86/kernel/kvm.c| 17 +
> >  include/uapi/linux/kvm_para.h|  3 ++-
> >  5 files changed, 27 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/cpufeatures.h 
> > b/arch/x86/include/asm/cpufeatures.h
> > index 84b887825f12..d8f3d2619913 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -238,6 +238,7 @@
> >  #define X86_FEATURE_VMW_VMMCALL( 8*32+19) /* "" VMware prefers 
> > VMMCALL hypercall instruction */
> >  #define X86_FEATURE_SEV_ES ( 8*32+20) /* AMD Secure Encrypted 
> > Virtualization - Encrypted State */
> >  #define X86_FEATURE_VM_PAGE_FLUSH  ( 8*32+21) /* "" VM Page Flush MSR is 
> > supported */
> > +#define X86_FEATURE_KVM_MEM_PROTECTED  ( 8*32+22) /* "" KVM memory 
> > protection extension */
> 
> That feature bit is still unused.

Sorry. Will drop it.

-- 
 Kirill A. Shutemov


Re: [PATCH v7 02/28] mm: Introduce struct folio

2021-04-19 Thread Kirill A. Shutemov
On Fri, Apr 16, 2021 at 04:55:16PM +0100, Matthew Wilcox wrote:
> On Fri, Apr 09, 2021 at 07:50:39PM +0100, Matthew Wilcox (Oracle) wrote:
> > A struct folio is a new abstraction to replace the venerable struct page.
> > A function which takes a struct folio argument declares that it will
> > operate on the entire (possibly compound) page, not just PAGE_SIZE bytes.
> > In return, the caller guarantees that the pointer it is passing does
> > not point to a tail page.
> > +++ b/include/linux/mm_types.h
> [...]
> > +static inline struct folio *page_folio(struct page *page)
> > +{
> > +   unsigned long head = READ_ONCE(page->compound_head);
> > +
> > +   if (unlikely(head & 1))
> > +   return (struct folio *)(head - 1);
> > +   return (struct folio *)page;
> > +}
> 
> I'm looking at changing this for the next revision, and basing it on
> my recent patch to make compound_head() const-preserving:
> 
> +#define page_folio(page)   _Generic((page),\
> +   const struct page *:(const struct folio *)_compound_head(page), \
> +   struct page *:  (struct folio *)_compound_head(page))
> 
> I've also noticed an awkward pattern occurring that I think this makes
> less awkward:
> 
> +/**
> + * folio_page - Return a page from a folio.
> + * @folio: The folio.
> + * @n: The page number to return.
> + *
> + * @n is relative to the start of the folio.  It should be between
> + * 0 and folio_nr_pages(@folio) - 1, but this is not checked for.
> + */
> +#define folio_page(folio, n)   nth_page(&(folio)->page, n)
> 
> That lets me simplify folio_next():
> 
> +static inline struct folio *folio_next(struct folio *folio)
> +{
> +   return (struct folio *)folio_page(folio, folio_nr_pages(folio));
> +}
> 
> (it occurs to me this should also be const-preserving, but it's not clear
> that's needed yet)

Are we risking that we would need to replace inline functions with macros
all the way down? Not sure const-preserving worth it.

-- 
 Kirill A. Shutemov


[RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-16 Thread Kirill A. Shutemov
TDX architecture aims to provide resiliency against confidentiality and
integrity attacks. Towards this goal, the TDX architecture helps enforce
the enabling of memory integrity for all TD-private memory.

The CPU memory controller computes the integrity check value (MAC) for
the data (cache line) during writes, and it stores the MAC with the
memory as meta-data. A 28-bit MAC is stored in the ECC bits.

Checking of memory integrity is performed during memory reads. If
integrity check fails, CPU poisones cache line.

On a subsequent consumption (read) of the poisoned data by software,
there are two possible scenarios:

 - Core determines that the execution can continue and it treats
   poison with exception semantics signaled as a #MCE

 - Core determines execution cannot continue,and it does an unbreakable
   shutdown

For more details, see Chapter 14 of Intel TDX Module EAS[1]

As some of integrity check failures may lead to system shutdown host
kernel must not allow any writes to TD-private memory. This requirment
clashes with KVM design: KVM expects the guest memory to be mapped into
host userspace (e.g. QEMU).

This patch aims to start discussion on how we can approach the issue.

For now I intentionally keep TDX out of picture here and try to find a
generic way to unmap KVM guest memory from host userspace. Hopefully, it
makes the patch more approachable. And anyone can try it out.

To the proposal:

Looking into existing codepaths I've discovered that we already have
semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap
entries in page tables:

  - If an application touches a page mapped with the SWP_HWPOISON, it will
get SIGBUS.

  - GUP will fail with -EFAULT;

Access the poisoned memory via page cache doesn't match required
semantics right now, but it shouldn't be too hard to make it work:
access to poisoned dirty pages should give -EIO or -EHWPOISON.

My idea is that we can mark page as poisoned when we make it TD-private
and replace all PTEs that map the page with SWP_HWPOISON.

TODO: THP support is missing.

[1] 
https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf

Not-signed-off-by: Kirill A. Shutemov 
---
 arch/x86/kvm/Kconfig   |   1 +
 arch/x86/kvm/cpuid.c   |   3 +-
 arch/x86/kvm/mmu/mmu.c |  12 ++-
 arch/x86/kvm/mmu/paging_tmpl.h |  10 +-
 arch/x86/kvm/x86.c |   6 ++
 include/linux/kvm_host.h   |  19 
 include/uapi/linux/kvm_para.h  |   1 +
 mm/page_alloc.c|   7 ++
 virt/Makefile  |   2 +-
 virt/kvm/Kconfig   |   4 +
 virt/kvm/Makefile  |   1 +
 virt/kvm/kvm_main.c| 163 -
 virt/kvm/mem_protected.c   | 110 ++
 13 files changed, 311 insertions(+), 28 deletions(-)
 create mode 100644 virt/kvm/Makefile
 create mode 100644 virt/kvm/mem_protected.c

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 7ac592664c52..b7db1c455e7c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -46,6 +46,7 @@ config KVM
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select KVM_VFIO
select SRCU
+   select HAVE_KVM_PROTECTED_MEMORY
help
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 38172ca627d3..1457692c1080 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -796,7 +796,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array 
*array, u32 function)
 (1 << KVM_FEATURE_PV_SEND_IPI) |
 (1 << KVM_FEATURE_POLL_CONTROL) |
 (1 << KVM_FEATURE_PV_SCHED_YIELD) |
-(1 << KVM_FEATURE_ASYNC_PF_INT);
+(1 << KVM_FEATURE_ASYNC_PF_INT) |
+(1 << KVM_FEATURE_MEM_PROTECTED);
 
if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3b97342af026..3fa76693edcd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2758,7 +2759,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, 
u64 *sptep)
if (sp->role.level > PG_LEVEL_4K)
return;
 
-   __direct_pte_prefetch(vcpu, sp, sptep);
+   if (!vcpu->kvm->mem_protected)
+   __direct_pte_prefetch(vcpu, sp, sptep);
 }
 
 static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
@@ -3725,6 +3727,14 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, 
gpa_t gpa, u32 error_code,
if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn,

[RFCv2 12/13] KVM: passdown struct kvm to hva_to_pfn_slow()

2021-04-16 Thread Kirill A. Shutemov
Make struct kvm pointer available within hva_to_pfn_slow(). It is
prepartion for the next patch.

Signed-off-by: Kirill A. Shutemov 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/x86/kvm/mmu/mmu.c |  8 +++--
 include/linux/kvm_host.h   | 12 ---
 virt/kvm/kvm_main.c| 49 ++
 5 files changed, 41 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 38ea396a23d6..86781ff76fcb 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -589,7 +589,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
write_ok = true;
} else {
/* Call KVM generic code to do the slow-path check */
-   pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+   pfn = __gfn_to_pfn_memslot(kvm, memslot, gfn, false, NULL,
   writing, _ok);
if (is_error_noslot_pfn(pfn))
return -EFAULT;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index bb35490400e9..319a1a99153f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -821,7 +821,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long pfn;
 
/* Call KVM generic code to do the slow-path check */
-   pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+   pfn = __gfn_to_pfn_memslot(kvm, memslot, gfn, false, NULL,
   writing, upgrade_p);
if (is_error_noslot_pfn(pfn))
return -EFAULT;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d16481aa29d..3b97342af026 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2687,7 +2687,7 @@ static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu 
*vcpu, gfn_t gfn,
if (!slot)
return KVM_PFN_ERR_FAULT;
 
-   return gfn_to_pfn_memslot_atomic(slot, gfn);
+   return gfn_to_pfn_memslot_atomic(vcpu->kvm, slot, gfn);
 }
 
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
@@ -3672,7 +3672,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
prefault, gfn_t gfn,
}
 
async = false;
-   *pfn = __gfn_to_pfn_memslot(slot, gfn, false, , write, writable);
+   *pfn = __gfn_to_pfn_memslot(vcpu->kvm, slot, gfn, false,
+   , write, writable);
if (!async)
return false; /* *pfn has correct page already */
 
@@ -3686,7 +3687,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
prefault, gfn_t gfn,
return true;
}
 
-   *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
+   *pfn = __gfn_to_pfn_memslot(vcpu->kvm, slot, gfn, false,
+   NULL, write, writable);
return false;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f3b1013fb22c..fadaccb95a4c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -725,11 +725,13 @@ void kvm_set_page_accessed(struct page *page);
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
  bool *writable);
-kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
-  bool atomic, bool *async, bool write_fault,
-  bool *writable);
+kvm_pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
+struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm *kvm,
+   struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
+  gfn_t gfn, bool atomic, bool *async,
+  bool write_fault, bool *writable);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8367d88ce39b..3e2dbaef7979 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2009,9 +2009,9 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool 
atomic, bool *async,
return pfn;
 }
 
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
-  bool atomic, bool *async, bool write_fault,
-  bool *writable)
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm *kvm, struct kvm_memory_s

[RFCv2 05/13] x86/kvmclock: Share hvclock memory with the host

2021-04-16 Thread Kirill A. Shutemov
hvclock is shared between the guest and the hypervisor. It has to be
accessible by host.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/kernel/kvmclock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa593743acf6..3d004b278dba 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -252,7 +252,7 @@ static void __init kvmclock_init_mem(void)
 * hvclock is shared between the guest and the hypervisor, must
 * be mapped decrypted.
 */
-   if (sev_active()) {
+   if (sev_active() || kvm_mem_protected()) {
r = set_memory_decrypted((unsigned long) hvclock_mem,
 1UL << order);
if (r) {
-- 
2.26.3



[RFCv2 10/13] mm: Keep page reference for hwpoison entries

2021-04-16 Thread Kirill A. Shutemov
On migration we drop referece to a page once PTE gets replaced with a
migration entry. Migration code keeps the last reference on the page
that gets dropped once migration is complete.

We do the same for hwpoison entries, but it doesn't work: nobody keeps
the reference once page is remapped with hwpoison entries.

Keep page referenced on setting up hwpoison entries, copy the reference
on fork() and return on zap().

Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c | 6 ++
 mm/rmap.c   | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..b15b0c582186 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -767,6 +767,9 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
pte = pte_swp_mkuffd_wp(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
+   } else if (is_hwpoison_entry(entry)) {
+   page = hwpoison_entry_to_page(entry);
+   get_page(page);
}
set_pte_at(dst_mm, addr, dst_pte, pte);
return 0;
@@ -1305,6 +1308,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
page = migration_entry_to_page(entry);
rss[mm_counter(page)]--;
+
+   } else if (is_hwpoison_entry(entry)) {
+   put_page(hwpoison_entry_to_page(entry));
}
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
diff --git a/mm/rmap.c b/mm/rmap.c
index 08c56aaf72eb..f08d1fc28522 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1575,7 +1575,7 @@ static bool try_to_unmap_one(struct page *page, struct 
vm_area_struct *vma,
dec_mm_counter(mm, mm_counter(page));
set_pte_at(mm, address, pvmw.pte, pteval);
}
-
+   get_page(page);
} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
/*
 * The guest indicated that the page content is of no
-- 
2.26.3



[RFCv2 11/13] mm: Replace hwpoison entry with present PTE if page got unpoisoned

2021-04-16 Thread Kirill A. Shutemov
If the page got unpoisoned we can replace hwpoison entry with a present
PTE on page fault instead of delivering SIGBUS.

Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index b15b0c582186..56f93e8e98f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3280,7 +3280,43 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->page = device_private_entry_to_page(entry);
ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
} else if (is_hwpoison_entry(entry)) {
-   ret = VM_FAULT_HWPOISON;
+   page = hwpoison_entry_to_page(entry);
+
+   locked = lock_page_or_retry(page, vma->vm_mm, 
vmf->flags);
+   if (!locked) {
+   ret = VM_FAULT_RETRY;
+   goto out;
+   }
+
+   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+  vmf->address, >ptl);
+
+   if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+   ret = 0;
+   } else if (PageHWPoison(page)) {
+   ret = VM_FAULT_HWPOISON;
+   } else {
+   /*
+* The page is unpoisoned. Replace hwpoison
+* entry with a present PTE.
+*/
+
+   inc_mm_counter(vma->vm_mm, mm_counter(page));
+   pte = mk_pte(page, vma->vm_page_prot);
+
+   if (PageAnon(page)) {
+   page_add_anon_rmap(page, vma,
+  vmf->address, false);
+   } else {
+   page_add_file_rmap(page, false);
+   }
+
+   set_pte_at(vma->vm_mm, vmf->address,
+  vmf->pte, pte);
+   }
+
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   unlock_page(page);
} else {
print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
ret = VM_FAULT_SIGBUS;
-- 
2.26.3



[RFCv2 09/13] shmem: Fail shmem_getpage_gfp() on poisoned pages

2021-04-16 Thread Kirill A. Shutemov
Forbid access to poisoned pages.

TODO: Probably more fine-grained approach is needed. It shuld be a
allowed to fault-in these pages as hwpoison entries.

Not-Signed-off-by: Kirill A. Shutemov 
---
 mm/shmem.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..d29a0c9be19c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1832,6 +1832,13 @@ static int shmem_getpage_gfp(struct inode *inode, 
pgoff_t index,
 
if (page)
hindex = page->index;
+
+   if (page && PageHWPoison(page)) {
+   unlock_page(page);
+   put_page(page);
+   return -EIO;
+   }
+
if (page && sgp == SGP_WRITE)
mark_page_accessed(page);
 
-- 
2.26.3



[RFCv2 06/13] x86/realmode: Share trampoline area if KVM memory protection enabled

2021-04-16 Thread Kirill A. Shutemov
If KVM memory protection is active, the trampoline area will need to be
in shared memory.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/realmode/init.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 22fda7d99159..f3b54b5da693 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct real_mode_header *real_mode_header;
 u32 *trampoline_cr4_features;
@@ -75,11 +76,11 @@ static void __init setup_real_mode(void)
base = (unsigned char *)real_mode_header;
 
/*
-* If SME is active, the trampoline area will need to be in
-* decrypted memory in order to bring up other processors
+* If SME or KVM memory protection is active, the trampoline area will
+* need to be in decrypted memory in order to bring up other processors
 * successfully. This is not needed for SEV.
 */
-   if (sme_active())
+   if (sme_active() || kvm_mem_protected())
set_memory_decrypted((unsigned long)base, size >> PAGE_SHIFT);
 
memcpy(base, real_mode_blob, size);
-- 
2.26.3



[RFCv2 07/13] mm: Add hwpoison_entry_to_pfn() and hwpoison_entry_to_page()

2021-04-16 Thread Kirill A. Shutemov
Add helpers to convert hwpoison swap entry to pfn and page.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/swapops.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..520589b12fb3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -323,6 +323,16 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+   return swp_offset(entry);
+}
+
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+   return pfn_to_page(hwpoison_entry_to_pfn(entry));
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
atomic_long_inc(_poisoned_pages);
@@ -345,6 +355,16 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
return 0;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+   return 0;
+}
+
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+   return NULL;
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 }
-- 
2.26.3



[RFCv2 08/13] mm/gup: Add FOLL_ALLOW_POISONED

2021-04-16 Thread Kirill A. Shutemov
The new flag allows to bypass check if the page is poisoned and get
reference on it.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/mm.h |  1 +
 mm/gup.c   | 29 ++---
 2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..378a94481fd1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2802,6 +2802,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
 #define FOLL_PIN   0x4 /* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_ALLOW_POISONED 0x10 /* bypass poison check */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..dd3b79b03eb5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -384,22 +384,29 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
ptep = pte_offset_map_lock(mm, pmd, address, );
pte = *ptep;
if (!pte_present(pte)) {
-   swp_entry_t entry;
+   swp_entry_t entry = pte_to_swp_entry(pte);
+
+   if (pte_none(pte))
+   goto no_page;
+
/*
 * KSM's break_ksm() relies upon recognizing a ksm page
 * even while it is being migrated, so for that case we
 * need migration_entry_wait().
 */
-   if (likely(!(flags & FOLL_MIGRATION)))
-   goto no_page;
-   if (pte_none(pte))
-   goto no_page;
-   entry = pte_to_swp_entry(pte);
-   if (!is_migration_entry(entry))
-   goto no_page;
-   pte_unmap_unlock(ptep, ptl);
-   migration_entry_wait(mm, pmd, address);
-   goto retry;
+   if (is_migration_entry(entry) && (flags & FOLL_MIGRATION)) {
+   pte_unmap_unlock(ptep, ptl);
+   migration_entry_wait(mm, pmd, address);
+   goto retry;
+   }
+
+   if (is_hwpoison_entry(entry) && (flags & FOLL_ALLOW_POISONED)) {
+   page = hwpoison_entry_to_page(entry);
+   get_page(page);
+   goto out;
+   }
+
+   goto no_page;
}
if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;
-- 
2.26.3



[RFCv2 04/13] x86/kvm: Use bounce buffers for KVM memory protection

2021-04-16 Thread Kirill A. Shutemov
Mirror SEV, use SWIOTLB always if KVM memory protection is enabled.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/Kconfig   |  1 +
 arch/x86/include/asm/mem_encrypt.h |  7 +++--
 arch/x86/kernel/kvm.c  |  2 ++
 arch/x86/kernel/pci-swiotlb.c  |  3 +-
 arch/x86/mm/mem_encrypt.c  | 44 ---
 arch/x86/mm/mem_encrypt_common.c   | 48 ++
 6 files changed, 57 insertions(+), 48 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d197b3beb904..c51d14db5620 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -812,6 +812,7 @@ config KVM_GUEST
select ARCH_CPUIDLE_HALTPOLL
select X86_HV_CALLBACK_VECTOR
select X86_MEM_ENCRYPT_COMMON
+   select SWIOTLB
default y
help
  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/include/asm/mem_encrypt.h 
b/arch/x86/include/asm/mem_encrypt.h
index 31c4df123aa0..a748b30c2f23 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -47,10 +47,8 @@ int __init early_set_memory_encrypted(unsigned long vaddr, 
unsigned long size);
 
 void __init mem_encrypt_free_decrypted_mem(void);
 
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void);
-
 void __init sev_es_init_vc_handling(void);
+
 bool sme_active(void);
 bool sev_active(void);
 bool sev_es_active(void);
@@ -91,6 +89,9 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void);
+
 /*
  * The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
  * writing to or comparing values from the cr3 register.  Having the
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index aed6034fcac1..ba179f5ca198 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -765,6 +766,7 @@ static void __init kvm_init_platform(void)
pr_info("KVM memory protection enabled\n");
mem_protected = true;
setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+   swiotlb_force = SWIOTLB_FORCE;
}
 }
 
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..814060a6ceb0 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int swiotlb __read_mostly;
 
@@ -49,7 +50,7 @@ int __init pci_swiotlb_detect_4gb(void)
 * buffers are allocated and used for devices that do not support
 * the addressing range required for the encryption mask.
 */
-   if (sme_active())
+   if (sme_active() || kvm_mem_protected())
swiotlb = 1;
 
return swiotlb;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 9ca477b9b8ba..3478f20fb46f 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -409,47 +409,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-static void print_mem_encrypt_feature_info(void)
-{
-   pr_info("AMD Memory Encryption Features active:");
-
-   /* Secure Memory Encryption */
-   if (sme_active()) {
-   /*
-* SME is mutually exclusive with any of the SEV
-* features below.
-*/
-   pr_cont(" SME\n");
-   return;
-   }
-
-   /* Secure Encrypted Virtualization */
-   if (sev_active())
-   pr_cont(" SEV");
-
-   /* Encrypted Register State */
-   if (sev_es_active())
-   pr_cont(" SEV-ES");
-
-   pr_cont("\n");
-}
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
-   if (!sme_me_mask)
-   return;
-
-   /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-   swiotlb_update_mem_attributes();
-
-   /*
-* With SEV, we need to unroll the rep string I/O instructions.
-*/
-   if (sev_active())
-   static_branch_enable(_enable_key);
-
-   print_mem_encrypt_feature_info();
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 6bf0718bb72a..351b77361a5d 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
@@ -37,3 +38,50 @@ bool force_dma_unencrypted(struct device *dev)
 
return false;
 }
+
+static void print_mem_encrypt_feature_in

[RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature

2021-04-16 Thread Kirill A. Shutemov
Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.

Host side doesn't provide the feature yet, so it is a dead code for now.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/kvm_para.h  |  5 +
 arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
 arch/x86/kernel/kvm.c| 17 +
 include/uapi/linux/kvm_para.h|  3 ++-
 5 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..d8f3d2619913 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VMW_VMMCALL( 8*32+19) /* "" VMware prefers 
VMMCALL hypercall instruction */
 #define X86_FEATURE_SEV_ES ( 8*32+20) /* AMD Secure Encrypted 
Virtualization - Encrypted State */
 #define X86_FEATURE_VM_PAGE_FLUSH  ( 8*32+21) /* "" VM Page Flush MSR is 
supported */
+#define X86_FEATURE_KVM_MEM_PROTECTED  ( 8*32+22) /* "" KVM memory protection 
extension */
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE   ( 9*32+ 0) /* RDFSBASE, WRFSBASE, 
RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..74aea18f3130 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -11,11 +11,16 @@ extern void kvmclock_init(void);
 
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
+bool kvm_mem_protected(void);
 #else
 static inline bool kvm_check_and_clear_guest_paused(void)
 {
return false;
 }
+static inline bool kvm_mem_protected(void)
+{
+   return false;
+}
 #endif /* CONFIG_KVM_GUEST */
 
 #define KVM_HYPERCALL \
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 950afebfba88..8d32c41861c9 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -28,11 +28,12 @@
 #define KVM_FEATURE_PV_UNHALT  7
 #define KVM_FEATURE_PV_TLB_FLUSH   9
 #define KVM_FEATURE_ASYNC_PF_VMEXIT10
-#define KVM_FEATURE_PV_SEND_IPI11
+#define KVM_FEATURE_PV_SEND_IPI11
 #define KVM_FEATURE_POLL_CONTROL   12
 #define KVM_FEATURE_PV_SCHED_YIELD 13
 #define KVM_FEATURE_ASYNC_PF_INT   14
 #define KVM_FEATURE_MSI_EXT_DEST_ID15
+#define KVM_FEATURE_MEM_PROTECTED  16
 
 #define KVM_HINTS_REALTIME  0
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5e78e01ca3b4..aed6034fcac1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -39,6 +39,13 @@
 #include 
 #include 
 
+static bool mem_protected;
+
+bool kvm_mem_protected(void)
+{
+   return mem_protected;
+}
+
 DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
 
 static int kvmapf = 1;
@@ -749,6 +756,16 @@ static void __init kvm_init_platform(void)
 {
kvmclock_init();
x86_platform.apic_post_init = kvm_apic_init;
+
+   if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
+   if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
+   panic("Failed to enable KVM memory protection");
+   }
+
+   pr_info("KVM memory protection enabled\n");
+   mem_protected = true;
+   setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+   }
 }
 
 #if defined(CONFIG_AMD_MEM_ENCRYPT)
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 8b86609849b9..1a216f32e572 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -27,8 +27,9 @@
 #define KVM_HC_MIPS_EXIT_VM7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT 8
 #define KVM_HC_CLOCK_PAIRING   9
-#define KVM_HC_SEND_IPI10
+#define KVM_HC_SEND_IPI10
 #define KVM_HC_SCHED_YIELD 11
+#define KVM_HC_ENABLE_MEM_PROTECTED12
 
 /*
  * hypercalls use architecture specific
-- 
2.26.3



[RFCv2 00/13] TDX and guest memory unmapping

2021-04-16 Thread Kirill A. Shutemov
TDX integrity check failures may lead to system shutdown host kernel must
not allow any writes to TD-private memory. This requirment clashes with
KVM design: KVM expects the guest memory to be mapped into host userspace
(e.g. QEMU).

This patchset aims to start discussion on how we can approach the issue.

The core of the change is in the last patch. Please see more detailed
description of the issue and proposoal of the solution there.

TODO:
  - THP support
  - Clarify semantics wrt. page cache

v2:
  - Unpoison page on free
  - Authorize access to the page: only the KVM that poisoned the page
allows to touch it
  - FOLL_ALLOW_POISONED is implemented

The patchset can also be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git kvm-unmapped-poison

Kirill A. Shutemov (13):
  x86/mm: Move force_dma_unencrypted() to common code
  x86/kvm: Introduce KVM memory protection feature
  x86/kvm: Make DMA pages shared
  x86/kvm: Use bounce buffers for KVM memory protection
  x86/kvmclock: Share hvclock memory with the host
  x86/realmode: Share trampoline area if KVM memory protection enabled
  mm: Add hwpoison_entry_to_pfn() and hwpoison_entry_to_page()
  mm/gup: Add FOLL_ALLOW_POISONED
  shmem: Fail shmem_getpage_gfp() on poisoned pages
  mm: Keep page reference for hwpoison entries
  mm: Replace hwpoison entry with present PTE if page got unpoisoned
  KVM: passdown struct kvm to hva_to_pfn_slow()
  KVM: unmap guest memory using poisoned pages

 arch/powerpc/kvm/book3s_64_mmu_hv.c|   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   2 +-
 arch/x86/Kconfig   |   9 +-
 arch/x86/include/asm/cpufeatures.h |   1 +
 arch/x86/include/asm/io.h  |   4 +-
 arch/x86/include/asm/kvm_para.h|   5 +
 arch/x86/include/asm/mem_encrypt.h |   7 +-
 arch/x86/include/uapi/asm/kvm_para.h   |   3 +-
 arch/x86/kernel/kvm.c  |  19 +++
 arch/x86/kernel/kvmclock.c |   2 +-
 arch/x86/kernel/pci-swiotlb.c  |   3 +-
 arch/x86/kvm/Kconfig   |   1 +
 arch/x86/kvm/cpuid.c   |   3 +-
 arch/x86/kvm/mmu/mmu.c |  20 ++-
 arch/x86/kvm/mmu/paging_tmpl.h |  10 +-
 arch/x86/kvm/x86.c |   6 +
 arch/x86/mm/Makefile   |   2 +
 arch/x86/mm/mem_encrypt.c  |  74 -
 arch/x86/mm/mem_encrypt_common.c   |  87 ++
 arch/x86/mm/pat/set_memory.c   |  10 ++
 arch/x86/realmode/init.c   |   7 +-
 include/linux/kvm_host.h   |  31 +++-
 include/linux/mm.h |   1 +
 include/linux/swapops.h|  20 +++
 include/uapi/linux/kvm_para.h  |   5 +-
 mm/gup.c   |  29 ++--
 mm/memory.c|  44 -
 mm/page_alloc.c|   7 +
 mm/rmap.c  |   2 +-
 mm/shmem.c |   7 +
 virt/Makefile  |   2 +-
 virt/kvm/Kconfig   |   4 +
 virt/kvm/Makefile  |   1 +
 virt/kvm/kvm_main.c| 212 +++--
 virt/kvm/mem_protected.c   | 110 +
 35 files changed, 593 insertions(+), 159 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c
 create mode 100644 virt/kvm/Makefile
 create mode 100644 virt/kvm/mem_protected.c

-- 
2.26.3



[RFCv2 01/13] x86/mm: Move force_dma_unencrypted() to common code

2021-04-16 Thread Kirill A. Shutemov
force_dma_unencrypted() has to return true for KVM guest with the memory
protected enabled. Move it out of AMD SME code.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features.

This is preparation for the following patches.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/Kconfig |  7 +-
 arch/x86/include/asm/io.h|  4 +++-
 arch/x86/mm/Makefile |  2 ++
 arch/x86/mm/mem_encrypt.c| 30 -
 arch/x86/mm/mem_encrypt_common.c | 38 
 5 files changed, 49 insertions(+), 32 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 21f851179ff0..2b4ce1722dbd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1520,14 +1520,19 @@ config X86_CPA_STATISTICS
  helps to determine the effectiveness of preserving large and huge
  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+   select DYNAMIC_PHYSICAL_MASK
+   def_bool n
+
 config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DMA_COHERENT_POOL
-   select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
+   select X86_MEM_ENCRYPT_COMMON
help
  Say yes to enable support for the encryption of system memory.
  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d726459d08e5..6dc51b31cb0e 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -256,10 +256,12 @@ static inline void slow_down_io(void)
 
 #endif
 
-#ifdef CONFIG_AMD_MEM_ENCRYPT
 #include 
 
 extern struct static_key_false sev_enable_key;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
 static inline bool sev_key_active(void)
 {
return static_branch_unlikely(_enable_key);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)+= 
pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)   += mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index c3d5f0236f35..9ca477b9b8ba 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
-#include 
 
 #include 
 #include 
@@ -390,32 +386,6 @@ bool noinstr sev_es_active(void)
return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-   /*
-* For SEV, all DMA must be to unencrypted addresses.
-*/
-   if (sev_active())
-   return true;
-
-   /*
-* For SME, all DMA must be to unencrypted addresses if the
-* device does not support DMA to addresses that include the
-* encryption mask.
-*/
-   if (sme_active()) {
-   u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-   u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-   dev->bus_dma_limit);
-
-   if (dma_dev_mask <= dma_enc_mask)
-   return true;
-   }
-
-   return false;
-}
-
 void __init mem_encrypt_free_decrypted_mem(void)
 {
unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index ..dd791352f73f
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky 
+ */
+
+#include 
+#include 
+#include 
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+   /*
+* For SEV, all DMA must be to unencrypted/shared addresses.
+*/
+   if (sev_active())
+   return true;
+
+   /*
+* For SME, all DMA must be to unencrypted addresses if the
+* device does not support DMA to addresses that include the
+* encryption mask.
+*/
+   if (sme_active()) {
+

[RFCv2 03/13] x86/kvm: Make DMA pages shared

2021-04-16 Thread Kirill A. Shutemov
Make force_dma_unencrypted() return true for KVM to get DMA pages mapped
as shared.

__set_memory_enc_dec() now informs the host via hypercall if the state
of the page has changed from shared to private or back.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/Kconfig |  1 +
 arch/x86/mm/mem_encrypt_common.c |  5 +++--
 arch/x86/mm/pat/set_memory.c | 10 ++
 include/uapi/linux/kvm_para.h|  1 +
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2b4ce1722dbd..d197b3beb904 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -811,6 +811,7 @@ config KVM_GUEST
select PARAVIRT_CLOCK
select ARCH_CPUIDLE_HALTPOLL
select X86_HV_CALLBACK_VECTOR
+   select X86_MEM_ENCRYPT_COMMON
default y
help
  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index dd791352f73f..6bf0718bb72a 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,14 +10,15 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
/*
-* For SEV, all DMA must be to unencrypted/shared addresses.
+* For SEV and KVM, all DMA must be to unencrypted/shared addresses.
 */
-   if (sev_active())
+   if (sev_active() || kvm_mem_protected())
return true;
 
/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..4b312d80033d 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1977,6 +1978,15 @@ static int __set_memory_enc_dec(unsigned long addr, int 
numpages, bool enc)
struct cpa_data cpa;
int ret;
 
+   if (kvm_mem_protected()) {
+   /* TODO: Unsharing memory back */
+   if (WARN_ON_ONCE(enc))
+   return -ENOSYS;
+
+   return kvm_hypercall2(KVM_HC_MEM_SHARE,
+ __pa(addr) >> PAGE_SHIFT, numpages);
+   }
+
/* Nothing to do if memory encryption is not active */
if (!mem_encrypt_active())
return 0;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 1a216f32e572..09d36683ee0a 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -30,6 +30,7 @@
 #define KVM_HC_SEND_IPI10
 #define KVM_HC_SCHED_YIELD 11
 #define KVM_HC_ENABLE_MEM_PROTECTED12
+#define KVM_HC_MEM_SHARE   13
 
 /*
  * hypercalls use architecture specific
-- 
2.26.3



Re: [PATCH v24 22/30] x86/cet/shstk: Add user-mode shadow stack support

2021-04-09 Thread Kirill A. Shutemov
TACK), SZ_4G), 
> PAGE_SIZE);
> + addr = alloc_shstk(size, 0);
> + if (IS_ERR_VALUE(addr))
> + return PTR_ERR((void *)addr);
> +
> + cet->shstk_base = addr;
> + cet->shstk_size = size;
> +
> + start_update_msrs();
> + wrmsrl(MSR_IA32_PL3_SSP, addr + size);
> + wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
> + end_update_msrs();
> + return 0;
> +}
> +
> +void shstk_free(struct task_struct *tsk)
> +{
> + struct cet_status *cet = >thread.cet;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> + !cet->shstk_size ||
> + !cet->shstk_base)
> + return;
> +
> + if (!tsk->mm)
> + return;
> +
> + while (1) {
> + int r;
> +
> + r = vm_munmap(cet->shstk_base, cet->shstk_size);
> +
> + /*
> +  * vm_munmap() returns -EINTR when mmap_lock is held by
> +  * something else, and that lock should not be held for a
> +  * long time.  Retry it for the case.
> +  */

Hm, no. -EINTR is not about the lock being held by somebody else. The task
got a signal and need to return to userspace.

I have not looked at the rest of the patches yet, but why do you need a
special free path for shadow stack? Why the normal unmap route doesn't
work for you?

> + if (r == -EINTR) {
> + cond_resched();
> + continue;
> + }
> + break;
> + }
> +
> + cet->shstk_base = 0;
> + cet->shstk_size = 0;
> +}
> +
> +void shstk_disable(void)
> +{
> + struct cet_status *cet = >thread.cet;
> + u64 msr_val;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> + !cet->shstk_size ||
> + !cet->shstk_base)
> + return;
> +
> + start_update_msrs();
> + rdmsrl(MSR_IA32_U_CET, msr_val);
> + wrmsrl(MSR_IA32_U_CET, msr_val & ~CET_SHSTK_EN);
> + wrmsrl(MSR_IA32_PL3_SSP, 0);
> + end_update_msrs();
> +
> + shstk_free(current);
> +}
> -- 
> 2.21.0
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v24 20/30] mm/mprotect: Exclude shadow stack from preserve_write

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 03:10:54PM -0700, Yu-cheng Yu wrote:
> In change_pte_range(), when a PTE is changed for prot_numa, _PAGE_RW is
> preserved to avoid the additional write fault after the NUMA hinting fault.
> However, pte_write() now includes both normal writable and shadow stack
> (RW=0, Dirty=1) PTEs, but the latter does not have _PAGE_RW and has no need
> to preserve it.
> 
> Exclude shadow stack from preserve_write test, and apply the same change to
> change_huge_pmd().
> 
> Signed-off-by: Yu-cheng Yu 
> Cc: Kirill A. Shutemov 
> ---
> v24:
> - Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().
> 
>  mm/huge_memory.c | 7 ++-
>  mm/mprotect.c| 9 -
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 65fc0aedd577..1d41138c4f74 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1812,12 +1812,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>   bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>   bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> + bool shstk = is_shadow_stack_mapping(vma->vm_flags);
>  
>   ptl = __pmd_trans_huge_lock(pmd, vma);
>   if (!ptl)
>   return 0;
>  
> - preserve_write = prot_numa && pmd_write(*pmd);
> + /*
> +  * Preserve only normal writable huge PMD, but not shadow
> +  * stack (RW=0, Dirty=1).
> +  */
> + preserve_write = prot_numa && pmd_write(*pmd) && !shstk;

New variable seems unnecessary. What about just:

if (is_shadow_stack_mapping(vma->vm_flags))
preserve_write = false;

?

>   ret = 1;
>  
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index c1ce78d688b6..550448dc5ff1 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -75,7 +75,14 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   oldpte = *pte;
>   if (pte_present(oldpte)) {
>   pte_t ptent;
> - bool preserve_write = prot_numa && pte_write(oldpte);
> + bool shstk = is_shadow_stack_mapping(vma->vm_flags);
> + bool preserve_write;
> +
> + /*
> +  * Preserve only normal writable PTE, but not shadow
> +  * stack (RW=0, Dirty=1).
> +  */
> + preserve_write = prot_numa && pte_write(oldpte) && 
> !shstk;

Ditto.

>  
>   /*
>* Avoid trapping faults against the zero or KSM
> -- 
> 2.21.0
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v24 19/30] mm: Update can_follow_write_pte() for shadow stack

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 03:10:53PM -0700, Yu-cheng Yu wrote:
> Can_follow_write_pte() ensures a read-only page is COWed by checking the
> FOLL_COW flag, and uses pte_dirty() to validate the flag is still valid.
> 
> Like a writable data page, a shadow stack page is writable, and becomes
> read-only during copy-on-write, but it is always dirty.  Thus, in the
> can_follow_write_pte() check, it belongs to the writable page case and
> should be excluded from the read-only page pte_dirty() check.  Apply
> the same changes to can_follow_write_pmd().
> 
> Signed-off-by: Yu-cheng Yu 
> Reviewed-by: Kees Cook 
> Cc: Kirill A. Shutemov 
> ---
> v24:
> - Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().
> 
>  mm/gup.c | 8 +---
>  mm/huge_memory.c | 8 +---
>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..c313cc988865 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -356,10 +356,12 @@ static int follow_pfn_pte(struct vm_area_struct *vma, 
> unsigned long address,
>   * FOLL_FORCE can write to even unwritable pte's, but only
>   * after we've gone through a COW cycle and they are dirty.
>   */
> -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> +static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
> + struct vm_area_struct *vma)
>  {
>   return pte_write(pte) ||
> - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte) &&
> +   !is_shadow_stack_mapping(vma->vm_flags));

It's getting too ugly. I think it deserve to be rewritten. What about:

if (pte_write(pte))
return true;
if ((flags & (FOLL_FORCE | FOLL_COW)) != (FOLL_FORCE | FOLL_COW))
return false;
    if (!pte_dirty(pte))
return false;
if (is_shadow_stack_mapping(vma->vm_flags))
return false;
return true;

?

-- 
 Kirill A. Shutemov


Re: [PATCH v24 18/30] mm/mmap: Add shadow stack pages to memory accounting

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 03:10:52PM -0700, Yu-cheng Yu wrote:
> Account shadow stack pages to stack memory.
> 
> Signed-off-by: Yu-cheng Yu 
> Cc: Kees Cook 
> Cc: Kirill A. Shutemov 
> ---
> v24:
> - Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().
> - Change VM_SHSTK to VM_SHADOW_STACK.
> 
>  arch/x86/mm/pgtable.c   |  7 +++
>  include/linux/pgtable.h | 11 +++
>  mm/mmap.c   |  5 +
>  3 files changed, 23 insertions(+)
> 
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index e778dbbef3d8..212a8c1fe5ba 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -897,3 +897,10 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
>  
>  #endif /* CONFIG_X86_64 */
>  #endif   /* CONFIG_HAVE_ARCH_HUGE_VMAP */
> +
> +#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
> +bool is_shadow_stack_mapping(vm_flags_t vm_flags)
> +{
> + return (vm_flags & VM_SHADOW_STACK);
> +}

No, just define it as you have here in linux/mm.h. It will always be false
for !CONFIG_ARCH_HAS_SHADOW_STACK as VM_SHADOW_STACK is 0 there.

This maze of #ifdefs are unneeded.

-- 
 Kirill A. Shutemov


Re: [PATCH v24 17/30] mm: Add guard pages around a shadow stack.

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 03:10:51PM -0700, Yu-cheng Yu wrote:
> INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
> first and the last elements in the range, effectively touches those memory
> areas.
> 
> The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
> 255 * 4 = 1020 bytes by INCSSPD.  Both ranges are far from PAGE_SIZE.
> Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
> CALL, and RET from going beyond.
> 
> Signed-off-by: Yu-cheng Yu 
> Cc: Kees Cook 
> Cc: Kirill A. Shutemov 
> ---
> v24:
> - Instead changing vm_*_gap(), create x86-specific versions.
> 
>  arch/x86/include/asm/page_types.h | 17 +++
>  arch/x86/mm/mmap.c| 36 +++
>  include/linux/mm.h|  4 
>  3 files changed, 57 insertions(+)
> 
> diff --git a/arch/x86/include/asm/page_types.h 
> b/arch/x86/include/asm/page_types.h
> index a506a411474d..3a5529bcfd76 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -73,6 +73,23 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned 
> long end_pfn);
>  
>  extern void initmem_init(void);
>  
> +/*
> + * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).  INCSSPQ
> + * moves shadow stack pointer up to 255 * 8 = ~2 KB (~1KB for INCSSPD) and
> + * touches the first and the last element in the range, which triggers a
> + * page fault if the range is not in a shadow stack.  Because of this,
> + * creating 4-KB guard pages around a shadow stack prevents these
> + * instructions from going beyond.
> + */
> +#define SHADOW_STACK_GUARD_GAP PAGE_SIZE

The define is only used within arch/x86/mm/mmap.c. Maybe move it there?

Otherwise:

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v24 16/30] mm: Fixup places that call pte_mkwrite() directly

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 03:10:50PM -0700, Yu-cheng Yu wrote:
> When serving a page fault, maybe_mkwrite() makes a PTE writable if it is in
> a writable vma.  A shadow stack vma is writable, but its PTEs need
> _PAGE_DIRTY to be set to become writable.  For this reason, maybe_mkwrite()
> has been updated.
> 
> There are a few places that call pte_mkwrite() directly, but have the
> same result as from maybe_mkwrite().  These sites need to be updated for
> shadow stack as well.  Thus, change them to maybe_mkwrite():
> 
> - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE directly
>   and call pte_mkwrite(), which is the same as maybe_mkwrite().  Change
>   them to maybe_mkwrite().
> 
> - In do_numa_page(), if the numa entry was writable, then pte_mkwrite()
>   is called directly.  Fix it by doing maybe_mkwrite().
> 
> - In change_pte_range(), pte_mkwrite() is called directly.  Replace it with
>   maybe_mkwrite().
> 
>   A shadow stack vma is writable but has different vma
> flags, and handled accordingly in maybe_mkwrite().
> 

Have you checked THP side? Looks like at least do_huge_pmd_numa_page()
needs adjustment, no?

-- 
 Kirill A. Shutemov


Re: [PATCH v24 15/30] x86/mm: Update maybe_mkwrite() for shadow stack

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 03:10:49PM -0700, Yu-cheng Yu wrote:
> When serving a page fault, maybe_mkwrite() makes a PTE writable if its vma
> has VM_WRITE.
> 
> A shadow stack vma has VM_SHADOW_STACK.  Its PTEs have _PAGE_DIRTY, but not
> _PAGE_WRITE.  In fork(), _PAGE_DIRTY is cleared to cause copy-on-write,
> and in the page fault handler, _PAGE_DIRTY is restored and the shadow stack
> page is writable again.
> 
> Introduce an x86 version of maybe_mkwrite(), which sets proper PTE bits
> according to VM flags.
> 
> Apply the same changes to maybe_pmd_mkwrite().
> 
> Signed-off-by: Yu-cheng Yu 
> Cc: Kees Cook 
> Cc: Kirill A. Shutemov 
> ---
> v24:
> - Instead of doing arch_maybe_mkwrite(), overwrite maybe*_mkwrite() with x86
>   versions.
> - Change VM_SHSTK to VM_SHADOW_STACK.
> 
>  arch/x86/include/asm/pgtable.h |  8 
>  arch/x86/mm/pgtable.c  | 20 
>  include/linux/mm.h |  2 ++
>  mm/huge_memory.c   |  2 ++
>  4 files changed, 32 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 46d9394b884f..51cdf14488b7 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1686,6 +1686,14 @@ static inline bool arch_faults_on_old_pte(void)
>   return false;
>  }
>  
> +#define maybe_mkwrite maybe_mkwrite
> +extern pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define maybe_pmd_mkwrite maybe_pmd_mkwrite
> +extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

Move it next to other THP-depended stuff.

Otherwise looks good to me:

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v24 14/30] x86/mm: Shadow Stack page fault error checking

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 03:10:48PM -0700, Yu-cheng Yu wrote:
> Shadow stack accesses are those that are performed by the CPU where it
> expects to encounter a shadow stack mapping.  These accesses are performed
> implicitly by CALL/RET at the site of the shadow stack pointer.  These
> accesses are made explicitly by shadow stack management instructions like
> WRUSSQ.
> 
> Shadow stacks accesses to shadow-stack mapping can see faults in normal,
> valid operation just like regular accesses to regular mappings.  Shadow
> stacks need some of the same features like delayed allocation, swap and
> copy-on-write.
> 
> Shadow stack accesses can also result in errors, such as when a shadow
> stack overflows, or if a shadow stack access occurs to a non-shadow-stack
> mapping.
> 
> In handling a shadow stack page fault, verify it occurs within a shadow
> stack mapping.  It is always an error otherwise.  For valid shadow stack
> accesses, set FAULT_FLAG_WRITE to effect copy-on-write.  Because clearing
> _PAGE_DIRTY (vs. _PAGE_RW) is used to trigger the fault, shadow stack read
> fault and shadow stack write fault are not differentiated and both are
> handled as a write access.
> 
> Signed-off-by: Yu-cheng Yu 
> Reviewed-by: Kees Cook 

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v24 13/30] mm: Introduce VM_SHADOW_STACK for shadow stack memory

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 03:10:47PM -0700, Yu-cheng Yu wrote:
> A shadow stack PTE must be read-only and have _PAGE_DIRTY set.  However,
> read-only and Dirty PTEs also exist for copy-on-write (COW) pages.  These
> two cases are handled differently for page faults.  Introduce
> VM_SHADOW_STACK to track shadow stack VMAs.
> 
> Signed-off-by: Yu-cheng Yu 
> Cc: Kees Cook 

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v24 12/30] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 03:10:46PM -0700, Yu-cheng Yu wrote:
> When Shadow Stack is introduced, [R/O + _PAGE_DIRTY] PTE is reserved for
> shadow stack.  Copy-on-write PTEs have [R/O + _PAGE_COW].
> 
> When a PTE goes from [R/W + _PAGE_DIRTY] to [R/O + _PAGE_COW], it could
> become a transient shadow stack PTE in two cases:
> 
> The first case is that some processors can start a write but end up seeing
> a read-only PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE.  However, this will not occur on processors supporting
> Shadow Stack, and a TLB flush is not necessary.
> 
> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
> atomically, a transient shadow stack PTE can be created as a result.
> Thus, prevent that with cmpxchg.
> 
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue.  Jann Horn provided the cmpxchg solution.
> 
> Signed-off-by: Yu-cheng Yu 
> Reviewed-by: Kees Cook 
> Cc: Kirill A. Shutemov 

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages

2021-04-09 Thread Kirill A. Shutemov
On Fri, Apr 09, 2021 at 03:50:42PM +0200, David Hildenbrand wrote:
> > > 3. Allow selected users to still grab the pages (esp. KVM to fault them 
> > > into
> > > the page tables).
> > 
> > As long as fault leads to non-present PTEs we are fine. Usespace still may
> > want to mlock() some of guest memory. There's no reason to prevent this.
> 
> I'm curious, even get_user_pages() will lead to a present PTE as is, no? So
> that will need modifications I assume. (although I think it fundamentally
> differs to the way get_user_pages() works - trigger a fault first, then
> lookup the PTE in the page tables).

For now, the patch has two step poisoning: first fault in, on the add to
shadow PTE -- poison. By the time VM has chance to use the page it's
poisoned and unmapped from the host userspace.

-- 
 Kirill A. Shutemov


Re: [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature

2021-04-09 Thread Kirill A. Shutemov
On Thu, Apr 08, 2021 at 11:52:35AM +0200, Borislav Petkov wrote:
> On Fri, Apr 02, 2021 at 06:26:40PM +0300, Kirill A. Shutemov wrote:
> > Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.
> > 
> > Host side doesn't provide the feature yet, so it is a dead code for now.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >  arch/x86/include/asm/cpufeatures.h   |  1 +
> >  arch/x86/include/asm/kvm_para.h  |  5 +
> >  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> >  arch/x86/kernel/kvm.c| 18 ++
> >  include/uapi/linux/kvm_para.h|  3 ++-
> >  5 files changed, 28 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/cpufeatures.h 
> > b/arch/x86/include/asm/cpufeatures.h
> > index 84b887825f12..5b6f23e6edc4 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -238,6 +238,7 @@
> >  #define X86_FEATURE_VMW_VMMCALL( 8*32+19) /* "" VMware prefers 
> > VMMCALL hypercall instruction */
> >  #define X86_FEATURE_SEV_ES ( 8*32+20) /* AMD Secure Encrypted 
> > Virtualization - Encrypted State */
> >  #define X86_FEATURE_VM_PAGE_FLUSH  ( 8*32+21) /* "" VM Page Flush MSR is 
> > supported */
> > +#define X86_FEATURE_KVM_MEM_PROTECTED  ( 8*32+22) /* KVM memory 
> > protection extenstion */
>   
> ^^
> What's that feature bit for?

The patchset is still in path-finding stage. I'll be more specific once we
settle on how the feature works.
 
> Also, use a spellchecker pls: "extenstion".

Ouch. Thanks.

-- 
 Kirill A. Shutemov


Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages

2021-04-09 Thread Kirill A. Shutemov
On Wed, Apr 07, 2021 at 04:55:54PM +0200, David Hildenbrand wrote:
> On 02.04.21 17:26, Kirill A. Shutemov wrote:
> > TDX architecture aims to provide resiliency against confidentiality and
> > integrity attacks. Towards this goal, the TDX architecture helps enforce
> > the enabling of memory integrity for all TD-private memory.
> > 
> > The CPU memory controller computes the integrity check value (MAC) for
> > the data (cache line) during writes, and it stores the MAC with the
> > memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> > 
> > Checking of memory integrity is performed during memory reads. If
> > integrity check fails, CPU poisones cache line.
> > 
> > On a subsequent consumption (read) of the poisoned data by software,
> > there are two possible scenarios:
> > 
> >   - Core determines that the execution can continue and it treats
> > poison with exception semantics signaled as a #MCE
> > 
> >   - Core determines execution cannot continue,and it does an unbreakable
> > shutdown
> > 
> > For more details, see Chapter 14 of Intel TDX Module EAS[1]
> > 
> > As some of integrity check failures may lead to system shutdown host
> > kernel must not allow any writes to TD-private memory. This requirment
> > clashes with KVM design: KVM expects the guest memory to be mapped into
> > host userspace (e.g. QEMU).
> > 
> > This patch aims to start discussion on how we can approach the issue.
> > 
> > For now I intentionally keep TDX out of picture here and try to find a
> > generic way to unmap KVM guest memory from host userspace. Hopefully, it
> > makes the patch more approachable. And anyone can try it out.
> > 
> > To the proposal:
> > 
> > Looking into existing codepaths I've discovered that we already have
> > semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap
> > entries in page tables:
> > 
> >- If an application touches a page mapped with the SWP_HWPOISON, it will
> >  get SIGBUS.
> > 
> >- GUP will fail with -EFAULT;
> > 
> > Access the poisoned memory via page cache doesn't match required
> > semantics right now, but it shouldn't be too hard to make it work:
> > access to poisoned dirty pages should give -EIO or -EHWPOISON.
> > 
> > My idea is that we can mark page as poisoned when we make it TD-private
> > and replace all PTEs that map the page with SWP_HWPOISON.
> 
> It looks quite hacky (well, what did I expect from an RFC :) ) you can no
> longer distinguish actually poisoned pages from "temporarily poisoned"
> pages. FOLL_ALLOW_POISONED sounds especially nasty and dangerous -  "I want
> to read/write a poisoned page, trust me, I know what I am doing".
> 
> Storing the state for each individual page initially sounded like the right
> thing to do, but I wonder if we couldn't handle this on a per-VMA level. You
> can just remember the handful of shared ranges internally like you do right
> now AFAIU.

per-VMA would not fly for file-backed (e.g. tmpfs) memory. We may need to
combine PG_hwpoison with VMA flag. Maybe per-inode tracking would also be
required. Or per-memslot. I donno. Need more experiments.

Note, I use PG_hwpoison now, but if we find a show-stopper issue where we
would see confusion with a real poison, we can switch to new flags and
a new swap_type(). I have not seen a reason yet.

> From what I get, you want a way to
> 
> 1. Unmap pages from the user space page tables.

Plain unmap would not work for some use-cases. Some CSPs want to
preallocate memory in a specific way. It's a way to provide a fine-grained
NUMA policy.

The existing mapping has to be converted.

> 2. Disallow re-faulting of the protected pages into the page tables. On user
> space access, you want to deliver some signal (e.g., SIGBUS).

Note that userspace mapping is the only source of pfn's for VM's shadow
mapping. The fault should be allow, but lead to non-present PTE that still
encodes pfn.

> 3. Allow selected users to still grab the pages (esp. KVM to fault them into
> the page tables).

As long as fault leads to non-present PTEs we are fine. Usespace still may
want to mlock() some of guest memory. There's no reason to prevent this.

> 4. Allow access to currently shared specific pages from user space.
> 
> Right now, you achieve
> 
> 1. Via try_to_unmap()
> 2. TestSetPageHWPoison
> 3. TBD (e.g., FOLL_ALLOW_POISONED)
> 4. ClearPageHWPoison()
> 
> 
> If we could bounce all writes to shared pages through the kernel, things
> could end up a little easier. Some very rough idea:
> 
> We could let user space setup VM memory as
> mprotect(PROT_REA

Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages

2021-04-07 Thread Kirill A. Shutemov
On Wed, Apr 07, 2021 at 04:09:35PM +0200, David Hildenbrand wrote:
> On 07.04.21 15:16, Kirill A. Shutemov wrote:
> > On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote:
> > > On 06.04.21 16:33, Dave Hansen wrote:
> > > > On 4/6/21 12:44 AM, David Hildenbrand wrote:
> > > > > On 02.04.21 17:26, Kirill A. Shutemov wrote:
> > > > > > TDX architecture aims to provide resiliency against confidentiality 
> > > > > > and
> > > > > > integrity attacks. Towards this goal, the TDX architecture helps 
> > > > > > enforce
> > > > > > the enabling of memory integrity for all TD-private memory.
> > > > > > 
> > > > > > The CPU memory controller computes the integrity check value (MAC) 
> > > > > > for
> > > > > > the data (cache line) during writes, and it stores the MAC with the
> > > > > > memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> > > > > > 
> > > > > > Checking of memory integrity is performed during memory reads. If
> > > > > > integrity check fails, CPU poisones cache line.
> > > > > > 
> > > > > > On a subsequent consumption (read) of the poisoned data by software,
> > > > > > there are two possible scenarios:
> > > > > > 
> > > > > >     - Core determines that the execution can continue and it treats
> > > > > >       poison with exception semantics signaled as a #MCE
> > > > > > 
> > > > > >     - Core determines execution cannot continue,and it does an 
> > > > > > unbreakable
> > > > > >       shutdown
> > > > > > 
> > > > > > For more details, see Chapter 14 of Intel TDX Module EAS[1]
> > > > > > 
> > > > > > As some of integrity check failures may lead to system shutdown host
> > > > > > kernel must not allow any writes to TD-private memory. This 
> > > > > > requirment
> > > > > > clashes with KVM design: KVM expects the guest memory to be mapped 
> > > > > > into
> > > > > > host userspace (e.g. QEMU).
> > > > > 
> > > > > So what you are saying is that if QEMU would write to such memory, it
> > > > > could crash the kernel? What a broken design.
> > > > 
> > > > IMNHO, the broken design is mapping the memory to userspace in the first
> > > > place.  Why the heck would you actually expose something with the MMU to
> > > > a context that can't possibly meaningfully access or safely write to it?
> > > 
> > > I'd say the broken design is being able to crash the machine via a simple
> > > memory write, instead of only crashing a single process in case you're 
> > > doing
> > > something nasty. From the evaluation of the problem it feels like this 
> > > was a
> > > CPU design workaround: instead of properly cleaning up when it gets tricky
> > > within the core, just crash the machine. And that's a CPU "feature", not a
> > > kernel "feature". Now we have to fix broken HW in the kernel - once again.
> > > 
> > > However, you raise a valid point: it does not make too much sense to to 
> > > map
> > > this into user space. Not arguing against that; but crashing the machine 
> > > is
> > > just plain ugly.
> > > 
> > > I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds
> > > like: for hacking support for that memory type into QEMU/KVM.
> > > 
> > > This all feels wrong, but I cannot really tell how it could be better. 
> > > That
> > > memory can really only be used (right now?) with hardware virtualization
> > > from some point on. From that point on (right from the start?), there 
> > > should
> > > be no VMA/mmap/page tables for user space anymore.
> > > 
> > > Or am I missing something? Is there still valid user space access?
> > 
> > There is. For IO (e.g. virtio) the guest mark a range of memory as shared
> > (or unencrypted for AMD SEV). The range is not pre-defined.
> > 
> 
> Ah right, rings a bell. One obvious alternative would be to let user space
> only explicitly map what is shared and can be safely accessed, instead of
> doing it the other way around. But that obviously requires more thought/work
> and clashes with future MM changes you discuss

Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages

2021-04-07 Thread Kirill A. Shutemov
On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote:
> On 06.04.21 16:33, Dave Hansen wrote:
> > On 4/6/21 12:44 AM, David Hildenbrand wrote:
> > > On 02.04.21 17:26, Kirill A. Shutemov wrote:
> > > > TDX architecture aims to provide resiliency against confidentiality and
> > > > integrity attacks. Towards this goal, the TDX architecture helps enforce
> > > > the enabling of memory integrity for all TD-private memory.
> > > > 
> > > > The CPU memory controller computes the integrity check value (MAC) for
> > > > the data (cache line) during writes, and it stores the MAC with the
> > > > memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> > > > 
> > > > Checking of memory integrity is performed during memory reads. If
> > > > integrity check fails, CPU poisones cache line.
> > > > 
> > > > On a subsequent consumption (read) of the poisoned data by software,
> > > > there are two possible scenarios:
> > > > 
> > > >    - Core determines that the execution can continue and it treats
> > > >      poison with exception semantics signaled as a #MCE
> > > > 
> > > >    - Core determines execution cannot continue,and it does an 
> > > > unbreakable
> > > >      shutdown
> > > > 
> > > > For more details, see Chapter 14 of Intel TDX Module EAS[1]
> > > > 
> > > > As some of integrity check failures may lead to system shutdown host
> > > > kernel must not allow any writes to TD-private memory. This requirment
> > > > clashes with KVM design: KVM expects the guest memory to be mapped into
> > > > host userspace (e.g. QEMU).
> > > 
> > > So what you are saying is that if QEMU would write to such memory, it
> > > could crash the kernel? What a broken design.
> > 
> > IMNHO, the broken design is mapping the memory to userspace in the first
> > place.  Why the heck would you actually expose something with the MMU to
> > a context that can't possibly meaningfully access or safely write to it?
> 
> I'd say the broken design is being able to crash the machine via a simple
> memory write, instead of only crashing a single process in case you're doing
> something nasty. From the evaluation of the problem it feels like this was a
> CPU design workaround: instead of properly cleaning up when it gets tricky
> within the core, just crash the machine. And that's a CPU "feature", not a
> kernel "feature". Now we have to fix broken HW in the kernel - once again.
> 
> However, you raise a valid point: it does not make too much sense to to map
> this into user space. Not arguing against that; but crashing the machine is
> just plain ugly.
> 
> I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds
> like: for hacking support for that memory type into QEMU/KVM.
> 
> This all feels wrong, but I cannot really tell how it could be better. That
> memory can really only be used (right now?) with hardware virtualization
> from some point on. From that point on (right from the start?), there should
> be no VMA/mmap/page tables for user space anymore.
> 
> Or am I missing something? Is there still valid user space access?

There is. For IO (e.g. virtio) the guest mark a range of memory as shared
(or unencrypted for AMD SEV). The range is not pre-defined.

> > This started with SEV.  QEMU creates normal memory mappings with the SEV
> > C-bit (encryption) disabled.  The kernel plumbs those into NPT, but when
> > those are instantiated, they have the C-bit set.  So, we have mismatched
> > mappings.  Where does that lead?  The two mappings not only differ in
> > the encryption bit, causing one side to read gibberish if the other
> > writes: they're not even cache coherent.
> > 
> > That's the situation *TODAY*, even ignoring TDX.
> > 
> > BTW, I'm pretty sure I know the answer to the "why would you expose this
> > to userspace" question: it's what QEMU/KVM did alreadhy for
> > non-encrypted memory, so this was the quickest way to get SEV working.
> > 
> 
> Yes, I guess so. It was the fastest way to "hack" it into QEMU.
> 
> Would we ever even want a VMA/mmap/process page tables for that memory? How
> could user space ever do something *not so nasty* with that memory (in the
> current context of VMs)?

In the future, the memory should be still managable by host MM: migration,
swapping, etc. But it's long way there. For now, the guest memory
effectively pinned on the host.

-- 
 Kirill A. Shutemov


Re: [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code

2021-04-06 Thread Kirill A. Shutemov
On Tue, Apr 06, 2021 at 09:11:25AM -0700, Dave Hansen wrote:
> On 4/6/21 8:37 AM, Kirill A. Shutemov wrote:
> > On Thu, Apr 01, 2021 at 01:06:29PM -0700, Dave Hansen wrote:
> >> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> >>> From: "Kirill A. Shutemov" 
> >>>
> >>> Intel TDX doesn't allow VMM to access guest memory. Any memory that is
> >>> required for communication with VMM suppose to be shared explicitly by
> >>
> >> s/suppose to/must/
> > 
> > Right.
> > 
> >>> setting the bit in page table entry. The shared memory is similar to
> >>> unencrypted memory in AMD SME/SEV terminology.
> >>
> >> In addition to setting the page table bit, there's also a dance to go
> >> through to convert the memory.  Please mention the procedure here at
> >> least.  It's very different from SME.
> > 
> > "
> >   After setting the shared bit, the conversion must be completed with
> >   MapGPA TDVMALL. The call informs VMM about the conversion and makes it
> >   remove the GPA from the S-EPT mapping.
> > "
> 
> Where does the TDX module fit in here?

VMM must go through TLB Tracking Sequence which involves bunch of
SEAMCALLs. See 3.3.1.2 "Dynamic Page Removal (Private to Shared
Conversion)" of TDX Module spec.
> 
> >>> force_dma_unencrypted() has to return true for TDX guest. Move it out of
> >>> AMD SME code.
> >>
> >> You lost me here.  What does force_dma_unencrypted() have to do with
> >> host/guest shared memory?
> > 
> > "
> >   AMD SEV makes force_dma_unencrypted() return true which triggers
> >   set_memory_decrypted() calls on all DMA allocations. TDX will use the
> >   same code path to make DMA allocations shared.
> > "
> 
> SEV assumes that I/O devices can only do DMA to "decrypted" physical
> addresses without the C-bit set.  In order for the CPU to interact with
> this memory, the CPU needs a decrypted mapping.
> 
> TDX is similar.  TDX architecturally prevents access to private guest
> memory by anything other than the guest itself.  This means that any DMA
> buffers must be shared.
> 
> Right?

Yes.


-- 
 Kirill A. Shutemov


Re: [RFC v1 25/26] x86/tdx: Make DMA pages shared

2021-04-06 Thread Kirill A. Shutemov
t; +   if (!enc || !is_tdx_guest())
> > +   cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> 
> That "!enc" looks wrong to me.  Caches would need to be flushed whenever
> encryption attributes *change*, not just when they are set.
> 
> Also, cpa_flush() flushes caches *AND* the TLB.  How does TDX manage to
> not need TLB flushes?

I will double-check everthing, but I think we can skip *both* cpa_flush()
for private->shared conversion. VMM and TDX module will take care about
TLB and cache flush in response to MapGPA TDVMCALL.

> > ret = __change_page_attr_set_clr(, 1);
> >  
> > @@ -2012,6 +2020,11 @@ static int __set_memory_enc_dec(unsigned long addr, 
> > int numpages, bool enc)
> >  */
> > cpa_flush(, 0);
> >  
> > +   if (!ret && is_tdx_guest()) {
> > +   ret = tdx_map_gpa(__pa(addr), numpages, enc);
> > +   // XXX: need to undo on error?
> > +   }
> 
> Time to fix this stuff up if you want folks to take this series more
> seriously.

My bad, will fix it.

-- 
 Kirill A. Shutemov


Re: [RFC v1 23/26] x86/tdx: Make pages shared in ioremap()

2021-04-06 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 01:26:23PM -0700, Dave Hansen wrote:
> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > All ioremap()ed paged that are not backed by normal memory (NONE or
> > RESERVED) have to be mapped as shared.
> 
> s/paged/pages/
> 
> 
> > +/* Make the page accesable by VMM */
> > +#define pgprot_tdx_shared(prot) __pgprot(pgprot_val(prot) | 
> > tdx_shared_mask())
> > +
> >  #ifndef __ASSEMBLY__
> >  #include 
> >  #include 
> > diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> > index 9e5ccc56f8e0..a0ba760866d4 100644
> > --- a/arch/x86/mm/ioremap.c
> > +++ b/arch/x86/mm/ioremap.c
> > @@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource 
> > *res)
> >  }
> >  
> >  /*
> > - * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
> > - * there the whole memory is already encrypted.
> > + * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted 
> > (or
> > + * private in TDX case) because there the whole memory is already 
> > encrypted.
> >   */
> 
> But doesn't this mean that we can't ioremap() normal memory?

It's not allowed anyway: see (io_desc.flags & IORES_MAP_SYSTEM_RAM) in the
__ioremap_caller().


> I was somehow expecting that we would need to do this for some
> host<->guest communication pages.

It goes though DMA API, not ioremap().

-- 
 Kirill A. Shutemov


Re: [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

2021-04-06 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 01:13:16PM -0700, Dave Hansen wrote:
> > @@ -56,6 +61,9 @@ static void tdx_get_info(void)
> >  
> > td_info.gpa_width = rcx & GENMASK(5, 0);
> > td_info.attributes = rdx;
> > +
> > +   /* Exclude Shared bit from the __PHYSICAL_MASK */
> > +   physical_mask &= ~tdx_shared_mask();
> >  }
> 
> I wish we had all of these 'physical_mask' manipulations in a single
> spot.  Can we consolidate these instead of having TDX and SME poke at
> them individually?

SME has to do it very early -- from __startup_64() -- as it sets the bit
on all memory, except what used for communication. TDX can postpone as we
don't need any shared mapping in very early boot.

Basically, to make it done from the same place we would need to move TDX
enumeration earlier into boot. It's risky: everything is more fragile
there.

I would rather keep it as is. We should be fine as long as we only allow
to clear bits from the mask.

-- 
 Kirill A. Shutemov


Re: [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code

2021-04-06 Thread Kirill A. Shutemov
On Thu, Apr 01, 2021 at 01:06:29PM -0700, Dave Hansen wrote:
> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > Intel TDX doesn't allow VMM to access guest memory. Any memory that is
> > required for communication with VMM suppose to be shared explicitly by
> 
> s/suppose to/must/

Right.

> > setting the bit in page table entry. The shared memory is similar to
> > unencrypted memory in AMD SME/SEV terminology.
> 
> In addition to setting the page table bit, there's also a dance to go
> through to convert the memory.  Please mention the procedure here at
> least.  It's very different from SME.

"
  After setting the shared bit, the conversion must be completed with
  MapGPA TDVMALL. The call informs VMM about the conversion and makes it
  remove the GPA from the S-EPT mapping.
"

> > force_dma_unencrypted() has to return true for TDX guest. Move it out of
> > AMD SME code.
> 
> You lost me here.  What does force_dma_unencrypted() have to do with
> host/guest shared memory?

"
  AMD SEV makes force_dma_unencrypted() return true which triggers
  set_memory_decrypted() calls on all DMA allocations. TDX will use the
  same code path to make DMA allocations shared.
"

> > Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
> > selected by all x86 memory encryption features.
> 
> Please also mention what will set it.  I assume TDX guest support will
> set this option.  It's probably also worth a sentence to say that
> force_dma_unencrypted() will have TDX-specific code added to it.  (It
> will, right??)

"
  Only AMD_MEM_ENCRYPT uses the option now. TDX will be the second one.
"

> > This is preparation for TDX changes in DMA code.
> 
> Probably best to also mention that this effectively just moves code
> around.  This patch should have no functional changes at runtime.

Isn't it what the subject says? :P

> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 0374d9f262a5..8fa654d61ac2 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1538,14 +1538,18 @@ config X86_CPA_STATISTICS
> >   helps to determine the effectiveness of preserving large and huge
> >   page mappings when mapping protections are changed.
> >  
> > +config X86_MEM_ENCRYPT_COMMON
> > +   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> > +   select DYNAMIC_PHYSICAL_MASK
> > +   def_bool n
> > +
> >  config AMD_MEM_ENCRYPT
> > bool "AMD Secure Memory Encryption (SME) support"
> > depends on X86_64 && CPU_SUP_AMD
> > select DMA_COHERENT_POOL
> > -   select DYNAMIC_PHYSICAL_MASK
> > select ARCH_USE_MEMREMAP_PROT
> > -   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> > select INSTRUCTION_DECODER
> > +   select X86_MEM_ENCRYPT_COMMON
> > help
> >   Say yes to enable support for the encryption of system memory.
> >   This requires an AMD processor that supports Secure Memory
> > diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> > index 30a3b30395ad..95e534cffa99 100644
> > --- a/arch/x86/include/asm/io.h
> > +++ b/arch/x86/include/asm/io.h
> > @@ -257,10 +257,12 @@ static inline void slow_down_io(void)
> >  
> >  #endif
> >  
> > -#ifdef CONFIG_AMD_MEM_ENCRYPT
> >  #include 
> >  
> >  extern struct static_key_false sev_enable_key;
> 
> This _looks_ odd.  sev_enable_key went from being under
> CONFIG_AMD_MEM_ENCRYPT to being unconditionally referenced.

Not referenced, but declared.

> Could you explain a bit more?
> 
> I would have expected it tot at *least* be tied to the new #ifdef.

Looks like a fixup got folded into a wrong patch. It supposed to be in
"x86/kvm: Use bounce buffers for TD guest".

This declaration allows to get away without any #ifdefs in
mem_encrypt_init() when !CONFIG_AMD_MEM_ENCRYPT: sev_active() is
false at compile-time and sev_enable_key never referenced.

Sathya, could move it to the right patch?

> > +#ifdef CONFIG_AMD_MEM_ENCRYPT
> > +
> >  static inline bool sev_key_active(void)
> >  {
> > return static_branch_unlikely(_enable_key);
> > diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> > index 5864219221ca..b31cb52bf1bd 100644
> > --- a/arch/x86/mm/Makefile
> ...

-- 
 Kirill A. Shutemov


Re: [PATCH v6 01/27] mm: Introduce struct folio

2021-04-06 Thread Kirill A. Shutemov
On Tue, Apr 06, 2021 at 01:48:07PM +0100, Matthew Wilcox wrote:
> Now, maybe we should put this optimisation into the definition of nth_page?

Sounds like a good idea to me.

> > > +struct folio {
> > > + /* private: don't document the anon union */
> > > + union {
> > > + struct {
> > > + /* public: */
> > > + unsigned long flags;
> > > + struct list_head lru;
> > > + struct address_space *mapping;
> > > + pgoff_t index;
> > > + unsigned long private;
> > > + atomic_t _mapcount;
> > > + atomic_t _refcount;
> > > +#ifdef CONFIG_MEMCG
> > > + unsigned long memcg_data;
> > > +#endif
> > 
> > As Christoph, I'm not a fan of this :/
> 
> What would you prefer?

I liked earlier approach with only struct page here. Once we know a field
should never be referenced from raw struct page, we can move it here.

But feel free to ignore my suggestion. It's not show-stopper for me and
reverting is back doesn't worth it.

I went through the patchset and it looks good. You can use my

  Acked-by: Kirill A. Shutemov 

on all of them.

Thanks a lot for doing this.

-- 
 Kirill A. Shutemov


Re: [PATCH v6 01/27] mm: Introduce struct folio

2021-04-06 Thread Kirill A. Shutemov
On Wed, Mar 31, 2021 at 07:47:02PM +0100, Matthew Wilcox (Oracle) wrote:
> +/**
> + * folio_next - Move to the next physical folio.
> + * @folio: The folio we're currently operating on.
> + *
> + * If you have physically contiguous memory which may span more than
> + * one folio (eg a  bio_vec), use this function to move from one
> + * folio to the next.  Do not use it if the memory is only virtually
> + * contiguous as the folios are almost certainly not adjacent to each
> + * other.  This is the folio equivalent to writing ``page++``.
> + *
> + * Context: We assume that the folios are refcounted and/or locked at a
> + * higher level and do not adjust the reference counts.
> + * Return: The next struct folio.
> + */
> +static inline struct folio *folio_next(struct folio *folio)
> +{
> +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> + return (struct folio *)nth_page(>page, folio_nr_pages(folio));
> +#else
> + return folio + folio_nr_pages(folio);
> +#endif

Do we really need the #if here?

>From quick look at nth_page() and memory_model.h, compiler should be able
to simplify calculation for FLATMEM or SPARSEMEM_VMEMMAP to what you do in
the #else. No?

> @@ -224,6 +224,71 @@ struct page {
>  #endif
>  } _struct_page_alignment;
>  
> +/**
> + * struct folio - Represents a contiguous set of bytes.
> + * @flags: Identical to the page flags.
> + * @lru: Least Recently Used list; tracks how recently this folio was used.
> + * @mapping: The file this page belongs to, or refers to the anon_vma for
> + *anonymous pages.
> + * @index: Offset within the file, in units of pages.  For anonymous pages,
> + *this is the index from the beginning of the mmap.
> + * @private: Filesystem per-folio data (see attach_folio_private()).
> + *Used for swp_entry_t if FolioSwapCache().
> + * @_mapcount: How many times this folio is mapped to userspace.  Use
> + *folio_mapcount() to access it.
> + * @_refcount: Number of references to this folio.  Use folio_ref_count()
> + *to read it.
> + * @memcg_data: Memory Control Group data.
> + *
> + * A folio is a physically, virtually and logically contiguous set
> + * of bytes.  It is a power-of-two in size, and it is aligned to that
> + * same power-of-two.  It is at least as large as %PAGE_SIZE.  If it is
> + * in the page cache, it is at a file offset which is a multiple of that
> + * power-of-two.
> + */
> +struct folio {
> + /* private: don't document the anon union */
> + union {
> + struct {
> + /* public: */
> + unsigned long flags;
> + struct list_head lru;
> + struct address_space *mapping;
> + pgoff_t index;
> + unsigned long private;
> + atomic_t _mapcount;
> + atomic_t _refcount;
> +#ifdef CONFIG_MEMCG
> + unsigned long memcg_data;
> +#endif

As Christoph, I'm not a fan of this :/

> + /* private: the union with struct page is transitional */
> + };
> + struct page page;
> + };
> +};

-- 
 Kirill A. Shutemov


Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages

2021-04-06 Thread Kirill A. Shutemov
On Tue, Apr 06, 2021 at 09:44:07AM +0200, David Hildenbrand wrote:
> On 02.04.21 17:26, Kirill A. Shutemov wrote:
> > TDX architecture aims to provide resiliency against confidentiality and
> > integrity attacks. Towards this goal, the TDX architecture helps enforce
> > the enabling of memory integrity for all TD-private memory.
> > 
> > The CPU memory controller computes the integrity check value (MAC) for
> > the data (cache line) during writes, and it stores the MAC with the
> > memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> > 
> > Checking of memory integrity is performed during memory reads. If
> > integrity check fails, CPU poisones cache line.
> > 
> > On a subsequent consumption (read) of the poisoned data by software,
> > there are two possible scenarios:
> > 
> >   - Core determines that the execution can continue and it treats
> > poison with exception semantics signaled as a #MCE
> > 
> >   - Core determines execution cannot continue,and it does an unbreakable
> > shutdown
> > 
> > For more details, see Chapter 14 of Intel TDX Module EAS[1]
> > 
> > As some of integrity check failures may lead to system shutdown host
> > kernel must not allow any writes to TD-private memory. This requirment
> > clashes with KVM design: KVM expects the guest memory to be mapped into
> > host userspace (e.g. QEMU).
> 
> So what you are saying is that if QEMU would write to such memory, it could
> crash the kernel? What a broken design.

Cannot disagree. #MCE for integrity check is very questionable. But I'm not
CPU engineer.

> "As some of integrity check failures may lead to system shutdown host" --
> usually we expect to recover from an MCE by killing the affected process,
> which would be the right thing to do here.

In the most cases that's what happen.

> How can it happen that "Core determines execution cannot continue,and it
> does an unbreakable shutdown". Who is "Core"? CPU "core", MM "core" ?

CPU core.

> And why would it decide to do a shutdown instead of just killing the
> process?



If the CPU handles long flow instruction (involves microcode and doing
multiple memory accesses), consuming poison somewhere in the middle leads
to CPU not being able to get back into sane state and the only option is
system shutdown.

-- 
 Kirill A. Shutemov


[RFCv1 5/7] x86/kvmclock: Share hvclock memory with the host

2021-04-02 Thread Kirill A. Shutemov
hvclock is shared between the guest and the hypervisor. It has to be
accessible by host.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/kernel/kvmclock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa593743acf6..3d004b278dba 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -252,7 +252,7 @@ static void __init kvmclock_init_mem(void)
 * hvclock is shared between the guest and the hypervisor, must
 * be mapped decrypted.
 */
-   if (sev_active()) {
+   if (sev_active() || kvm_mem_protected()) {
r = set_memory_decrypted((unsigned long) hvclock_mem,
 1UL << order);
if (r) {
-- 
2.26.3



[RFCv1 4/7] x86/kvm: Use bounce buffers for KVM memory protection

2021-04-02 Thread Kirill A. Shutemov
Mirror SEV, use SWIOTLB always if KVM memory protection is enabled.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/Kconfig   |  1 +
 arch/x86/include/asm/mem_encrypt.h |  7 +++--
 arch/x86/kernel/kvm.c  |  2 ++
 arch/x86/kernel/pci-swiotlb.c  |  3 +-
 arch/x86/mm/mem_encrypt.c  | 44 ---
 arch/x86/mm/mem_encrypt_common.c   | 48 ++
 6 files changed, 57 insertions(+), 48 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d197b3beb904..c51d14db5620 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -812,6 +812,7 @@ config KVM_GUEST
select ARCH_CPUIDLE_HALTPOLL
select X86_HV_CALLBACK_VECTOR
select X86_MEM_ENCRYPT_COMMON
+   select SWIOTLB
default y
help
  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/include/asm/mem_encrypt.h 
b/arch/x86/include/asm/mem_encrypt.h
index 31c4df123aa0..a748b30c2f23 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -47,10 +47,8 @@ int __init early_set_memory_encrypted(unsigned long vaddr, 
unsigned long size);
 
 void __init mem_encrypt_free_decrypted_mem(void);
 
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void);
-
 void __init sev_es_init_vc_handling(void);
+
 bool sme_active(void);
 bool sev_active(void);
 bool sev_es_active(void);
@@ -91,6 +89,9 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void);
+
 /*
  * The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
  * writing to or comparing values from the cr3 register.  Having the
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e6989e1b74eb..45aee29e4294 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -766,6 +767,7 @@ static void __init kvm_init_platform(void)
pr_info("KVM memory protection enabled\n");
mem_protected = true;
setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+   swiotlb_force = SWIOTLB_FORCE;
}
 }
 
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..814060a6ceb0 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int swiotlb __read_mostly;
 
@@ -49,7 +50,7 @@ int __init pci_swiotlb_detect_4gb(void)
 * buffers are allocated and used for devices that do not support
 * the addressing range required for the encryption mask.
 */
-   if (sme_active())
+   if (sme_active() || kvm_mem_protected())
swiotlb = 1;
 
return swiotlb;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 9ca477b9b8ba..3478f20fb46f 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -409,47 +409,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-static void print_mem_encrypt_feature_info(void)
-{
-   pr_info("AMD Memory Encryption Features active:");
-
-   /* Secure Memory Encryption */
-   if (sme_active()) {
-   /*
-* SME is mutually exclusive with any of the SEV
-* features below.
-*/
-   pr_cont(" SME\n");
-   return;
-   }
-
-   /* Secure Encrypted Virtualization */
-   if (sev_active())
-   pr_cont(" SEV");
-
-   /* Encrypted Register State */
-   if (sev_es_active())
-   pr_cont(" SEV-ES");
-
-   pr_cont("\n");
-}
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
-   if (!sme_me_mask)
-   return;
-
-   /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-   swiotlb_update_mem_attributes();
-
-   /*
-* With SEV, we need to unroll the rep string I/O instructions.
-*/
-   if (sev_active())
-   static_branch_enable(_enable_key);
-
-   print_mem_encrypt_feature_info();
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 6bf0718bb72a..351b77361a5d 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
@@ -37,3 +38,50 @@ bool force_dma_unencrypted(struct device *dev)
 
return false;
 }
+
+static void print_mem_encrypt_feature_in

[RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature

2021-04-02 Thread Kirill A. Shutemov
Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.

Host side doesn't provide the feature yet, so it is a dead code for now.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/kvm_para.h  |  5 +
 arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
 arch/x86/kernel/kvm.c| 18 ++
 include/uapi/linux/kvm_para.h|  3 ++-
 5 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..5b6f23e6edc4 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VMW_VMMCALL( 8*32+19) /* "" VMware prefers 
VMMCALL hypercall instruction */
 #define X86_FEATURE_SEV_ES ( 8*32+20) /* AMD Secure Encrypted 
Virtualization - Encrypted State */
 #define X86_FEATURE_VM_PAGE_FLUSH  ( 8*32+21) /* "" VM Page Flush MSR is 
supported */
+#define X86_FEATURE_KVM_MEM_PROTECTED  ( 8*32+22) /* KVM memory protection 
extenstion */
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE   ( 9*32+ 0) /* RDFSBASE, WRFSBASE, 
RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..74aea18f3130 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -11,11 +11,16 @@ extern void kvmclock_init(void);
 
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
+bool kvm_mem_protected(void);
 #else
 static inline bool kvm_check_and_clear_guest_paused(void)
 {
return false;
 }
+static inline bool kvm_mem_protected(void)
+{
+   return false;
+}
 #endif /* CONFIG_KVM_GUEST */
 
 #define KVM_HYPERCALL \
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 950afebfba88..8d32c41861c9 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -28,11 +28,12 @@
 #define KVM_FEATURE_PV_UNHALT  7
 #define KVM_FEATURE_PV_TLB_FLUSH   9
 #define KVM_FEATURE_ASYNC_PF_VMEXIT10
-#define KVM_FEATURE_PV_SEND_IPI11
+#define KVM_FEATURE_PV_SEND_IPI11
 #define KVM_FEATURE_POLL_CONTROL   12
 #define KVM_FEATURE_PV_SCHED_YIELD 13
 #define KVM_FEATURE_ASYNC_PF_INT   14
 #define KVM_FEATURE_MSI_EXT_DEST_ID15
+#define KVM_FEATURE_MEM_PROTECTED  16
 
 #define KVM_HINTS_REALTIME  0
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5e78e01ca3b4..e6989e1b74eb 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -39,6 +39,13 @@
 #include 
 #include 
 
+static bool mem_protected;
+
+bool kvm_mem_protected(void)
+{
+   return mem_protected;
+}
+
 DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
 
 static int kvmapf = 1;
@@ -749,6 +756,17 @@ static void __init kvm_init_platform(void)
 {
kvmclock_init();
x86_platform.apic_post_init = kvm_apic_init;
+
+   if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
+   if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
+   pr_err("Failed to enable KVM memory protection\n");
+   return;
+   }
+
+   pr_info("KVM memory protection enabled\n");
+   mem_protected = true;
+   setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+   }
 }
 
 #if defined(CONFIG_AMD_MEM_ENCRYPT)
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 8b86609849b9..1a216f32e572 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -27,8 +27,9 @@
 #define KVM_HC_MIPS_EXIT_VM7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT 8
 #define KVM_HC_CLOCK_PAIRING   9
-#define KVM_HC_SEND_IPI10
+#define KVM_HC_SEND_IPI10
 #define KVM_HC_SCHED_YIELD 11
+#define KVM_HC_ENABLE_MEM_PROTECTED12
 
 /*
  * hypercalls use architecture specific
-- 
2.26.3



[RFCv1 1/7] x86/mm: Move force_dma_unencrypted() to common code

2021-04-02 Thread Kirill A. Shutemov
force_dma_unencrypted() has to return true for KVM guest with the memory
protected enabled. Move it out of AMD SME code.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features.

This is preparation for the following patches.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/Kconfig |  7 +-
 arch/x86/include/asm/io.h|  4 +++-
 arch/x86/mm/Makefile |  2 ++
 arch/x86/mm/mem_encrypt.c| 30 -
 arch/x86/mm/mem_encrypt_common.c | 38 
 5 files changed, 49 insertions(+), 32 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 21f851179ff0..2b4ce1722dbd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1520,14 +1520,19 @@ config X86_CPA_STATISTICS
  helps to determine the effectiveness of preserving large and huge
  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+   select DYNAMIC_PHYSICAL_MASK
+   def_bool n
+
 config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DMA_COHERENT_POOL
-   select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
+   select X86_MEM_ENCRYPT_COMMON
help
  Say yes to enable support for the encryption of system memory.
  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d726459d08e5..6dc51b31cb0e 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -256,10 +256,12 @@ static inline void slow_down_io(void)
 
 #endif
 
-#ifdef CONFIG_AMD_MEM_ENCRYPT
 #include 
 
 extern struct static_key_false sev_enable_key;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
 static inline bool sev_key_active(void)
 {
return static_branch_unlikely(_enable_key);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)+= 
pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)   += mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index c3d5f0236f35..9ca477b9b8ba 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
-#include 
 
 #include 
 #include 
@@ -390,32 +386,6 @@ bool noinstr sev_es_active(void)
return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-   /*
-* For SEV, all DMA must be to unencrypted addresses.
-*/
-   if (sev_active())
-   return true;
-
-   /*
-* For SME, all DMA must be to unencrypted addresses if the
-* device does not support DMA to addresses that include the
-* encryption mask.
-*/
-   if (sme_active()) {
-   u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-   u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-   dev->bus_dma_limit);
-
-   if (dma_dev_mask <= dma_enc_mask)
-   return true;
-   }
-
-   return false;
-}
-
 void __init mem_encrypt_free_decrypted_mem(void)
 {
unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index ..dd791352f73f
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky 
+ */
+
+#include 
+#include 
+#include 
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+   /*
+* For SEV, all DMA must be to unencrypted/shared addresses.
+*/
+   if (sev_active())
+   return true;
+
+   /*
+* For SME, all DMA must be to unencrypted addresses if the
+* device does not support DMA to addresses that include the
+* encryption mask.
+*/
+   if (sme_active()) {
+

[RFCv1 3/7] x86/kvm: Make DMA pages shared

2021-04-02 Thread Kirill A. Shutemov
Make force_dma_unencrypted() return true for KVM to get DMA pages mapped
as shared.

__set_memory_enc_dec() now informs the host via hypercall if the state
of the page has changed from shared to private or back.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/Kconfig |  1 +
 arch/x86/mm/mem_encrypt_common.c |  5 +++--
 arch/x86/mm/pat/set_memory.c | 10 ++
 include/uapi/linux/kvm_para.h|  1 +
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2b4ce1722dbd..d197b3beb904 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -811,6 +811,7 @@ config KVM_GUEST
select PARAVIRT_CLOCK
select ARCH_CPUIDLE_HALTPOLL
select X86_HV_CALLBACK_VECTOR
+   select X86_MEM_ENCRYPT_COMMON
default y
help
  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index dd791352f73f..6bf0718bb72a 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,14 +10,15 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
/*
-* For SEV, all DMA must be to unencrypted/shared addresses.
+* For SEV and KVM, all DMA must be to unencrypted/shared addresses.
 */
-   if (sev_active())
+   if (sev_active() || kvm_mem_protected())
return true;
 
/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..4b312d80033d 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1977,6 +1978,15 @@ static int __set_memory_enc_dec(unsigned long addr, int 
numpages, bool enc)
struct cpa_data cpa;
int ret;
 
+   if (kvm_mem_protected()) {
+   /* TODO: Unsharing memory back */
+   if (WARN_ON_ONCE(enc))
+   return -ENOSYS;
+
+   return kvm_hypercall2(KVM_HC_MEM_SHARE,
+ __pa(addr) >> PAGE_SHIFT, numpages);
+   }
+
/* Nothing to do if memory encryption is not active */
if (!mem_encrypt_active())
return 0;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 1a216f32e572..09d36683ee0a 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -30,6 +30,7 @@
 #define KVM_HC_SEND_IPI10
 #define KVM_HC_SCHED_YIELD 11
 #define KVM_HC_ENABLE_MEM_PROTECTED12
+#define KVM_HC_MEM_SHARE   13
 
 /*
  * hypercalls use architecture specific
-- 
2.26.3



[RFCv1 0/7] TDX and guest memory unmapping

2021-04-02 Thread Kirill A. Shutemov
TDX integrity check failures may lead to system shutdown host kernel must
not allow any writes to TD-private memory. This requirment clashes with
KVM design: KVM expects the guest memory to be mapped into host userspace
(e.g. QEMU).

This patchset aims to start discussion on how we can approach the issue.

The core of the change is in the last patch. Please see more detailed
description of the issue and proposoal of the solution there.

The patchset can also be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git kvm-unmapped-poison

Kirill A. Shutemov (7):
  x86/mm: Move force_dma_unencrypted() to common code
  x86/kvm: Introduce KVM memory protection feature
  x86/kvm: Make DMA pages shared
  x86/kvm: Use bounce buffers for KVM memory protection
  x86/kvmclock: Share hvclock memory with the host
  x86/realmode: Share trampoline area if KVM memory protection enabled
  KVM: unmap guest memory using poisoned pages

 arch/x86/Kconfig |   9 +-
 arch/x86/include/asm/cpufeatures.h   |   1 +
 arch/x86/include/asm/io.h|   4 +-
 arch/x86/include/asm/kvm_para.h  |   5 +
 arch/x86/include/asm/mem_encrypt.h   |   7 +-
 arch/x86/include/uapi/asm/kvm_para.h |   3 +-
 arch/x86/kernel/kvm.c|  20 
 arch/x86/kernel/kvmclock.c   |   2 +-
 arch/x86/kernel/pci-swiotlb.c|   3 +-
 arch/x86/kvm/Kconfig |   1 +
 arch/x86/kvm/cpuid.c |   3 +-
 arch/x86/kvm/mmu/mmu.c   |  15 ++-
 arch/x86/kvm/mmu/paging_tmpl.h   |  10 +-
 arch/x86/kvm/x86.c   |   6 +
 arch/x86/mm/Makefile |   2 +
 arch/x86/mm/mem_encrypt.c|  74 
 arch/x86/mm/mem_encrypt_common.c |  87 ++
 arch/x86/mm/pat/set_memory.c |  10 ++
 arch/x86/realmode/init.c |   7 +-
 include/linux/kvm_host.h |  12 ++
 include/linux/swapops.h  |  20 
 include/uapi/linux/kvm_para.h|   5 +-
 mm/gup.c |  31 +++--
 mm/memory.c  |  45 +++-
 mm/page_vma_mapped.c |   8 +-
 mm/rmap.c|   2 +-
 mm/shmem.c   |   7 ++
 virt/kvm/Kconfig |   3 +
 virt/kvm/kvm_main.c  | 164 ---
 29 files changed, 442 insertions(+), 124 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

-- 
2.26.3



[RFCv1 7/7] KVM: unmap guest memory using poisoned pages

2021-04-02 Thread Kirill A. Shutemov
TDX architecture aims to provide resiliency against confidentiality and
integrity attacks. Towards this goal, the TDX architecture helps enforce
the enabling of memory integrity for all TD-private memory.

The CPU memory controller computes the integrity check value (MAC) for
the data (cache line) during writes, and it stores the MAC with the
memory as meta-data. A 28-bit MAC is stored in the ECC bits.

Checking of memory integrity is performed during memory reads. If
integrity check fails, CPU poisones cache line.

On a subsequent consumption (read) of the poisoned data by software,
there are two possible scenarios:

 - Core determines that the execution can continue and it treats
   poison with exception semantics signaled as a #MCE

 - Core determines execution cannot continue,and it does an unbreakable
   shutdown

For more details, see Chapter 14 of Intel TDX Module EAS[1]

As some of integrity check failures may lead to system shutdown host
kernel must not allow any writes to TD-private memory. This requirment
clashes with KVM design: KVM expects the guest memory to be mapped into
host userspace (e.g. QEMU).

This patch aims to start discussion on how we can approach the issue.

For now I intentionally keep TDX out of picture here and try to find a
generic way to unmap KVM guest memory from host userspace. Hopefully, it
makes the patch more approachable. And anyone can try it out.

To the proposal:

Looking into existing codepaths I've discovered that we already have
semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap
entries in page tables:

  - If an application touches a page mapped with the SWP_HWPOISON, it will
get SIGBUS.

  - GUP will fail with -EFAULT;

Access the poisoned memory via page cache doesn't match required
semantics right now, but it shouldn't be too hard to make it work:
access to poisoned dirty pages should give -EIO or -EHWPOISON.

My idea is that we can mark page as poisoned when we make it TD-private
and replace all PTEs that map the page with SWP_HWPOISON.

The patch is proof-of-concept and has known issues:

  - Limited to swap-backed pages for now: anon or tmpfs/shmem

  - No THP support

  - Need a new FOLL_XXX flags to access such pages from KVM code.

  - Page unpoisoning is not implemented. It proved to be more difficult
than I expected. I'm looking into solution.

  - Poisoned pages must be tied to KVM instance and another KVM must not
be able to map the page into guest.

[1] 
https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf

Not-signed-off-by: Kirill A. Shutemov 
---
 arch/x86/kvm/Kconfig   |   1 +
 arch/x86/kvm/cpuid.c   |   3 +-
 arch/x86/kvm/mmu/mmu.c |  15 ++-
 arch/x86/kvm/mmu/paging_tmpl.h |  10 +-
 arch/x86/kvm/x86.c |   6 ++
 include/linux/kvm_host.h   |  12 +++
 include/linux/swapops.h|  20 
 include/uapi/linux/kvm_para.h  |   1 +
 mm/gup.c   |  31 ---
 mm/memory.c|  45 -
 mm/page_vma_mapped.c   |   8 +-
 mm/rmap.c  |   2 +-
 mm/shmem.c |   7 ++
 virt/kvm/Kconfig   |   3 +
 virt/kvm/kvm_main.c| 164 +
 15 files changed, 290 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 7ac592664c52..b7db1c455e7c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -46,6 +46,7 @@ config KVM
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select KVM_VFIO
select SRCU
+   select HAVE_KVM_PROTECTED_MEMORY
help
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 38172ca627d3..1457692c1080 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -796,7 +796,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array 
*array, u32 function)
 (1 << KVM_FEATURE_PV_SEND_IPI) |
 (1 << KVM_FEATURE_POLL_CONTROL) |
 (1 << KVM_FEATURE_PV_SCHED_YIELD) |
-(1 << KVM_FEATURE_ASYNC_PF_INT);
+(1 << KVM_FEATURE_ASYNC_PF_INT) |
+(1 << KVM_FEATURE_MEM_PROTECTED);
 
if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d16481aa29d..53a69c8c59f1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2758,7 +2759,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, 
u64 *sptep)
if (sp->role.level > PG_LEVEL_4K)
retu

[RFCv1 6/7] x86/realmode: Share trampoline area if KVM memory protection enabled

2021-04-02 Thread Kirill A. Shutemov
If KVM memory protection is active, the trampoline area will need to be
in shared memory.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/realmode/init.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 22fda7d99159..f3b54b5da693 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct real_mode_header *real_mode_header;
 u32 *trampoline_cr4_features;
@@ -75,11 +76,11 @@ static void __init setup_real_mode(void)
base = (unsigned char *)real_mode_header;
 
/*
-* If SME is active, the trampoline area will need to be in
-* decrypted memory in order to bring up other processors
+* If SME or KVM memory protection is active, the trampoline area will
+* need to be in decrypted memory in order to bring up other processors
 * successfully. This is not needed for SEV.
 */
-   if (sme_active())
+   if (sme_active() || kvm_mem_protected())
set_memory_decrypted((unsigned long)base, size >> PAGE_SHIFT);
 
memcpy(base, real_mode_blob, size);
-- 
2.26.3



Re: [PATCH v23 18/28] mm/mmap: Add shadow stack pages to memory accounting

2021-03-29 Thread Kirill A. Shutemov
On Fri, Mar 26, 2021 at 08:46:30AM -0700, Yu, Yu-cheng wrote:
> On 3/22/2021 3:57 AM, Kirill A. Shutemov wrote:
> > On Tue, Mar 16, 2021 at 08:10:44AM -0700, Yu-cheng Yu wrote:
> > > Account shadow stack pages to stack memory.
> > > 
> > > Signed-off-by: Yu-cheng Yu 
> > > Reviewed-by: Kees Cook 
> > > ---
> > >   arch/x86/mm/pgtable.c   |  7 +++
> > >   include/linux/pgtable.h | 11 +++
> > >   mm/mmap.c   |  5 +
> > >   3 files changed, 23 insertions(+)
> > > 
> > > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> > > index 0f4fbf51a9fc..948d28c29964 100644
> > > --- a/arch/x86/mm/pgtable.c
> > > +++ b/arch/x86/mm/pgtable.c
> > > @@ -895,3 +895,10 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
> > >   #endif /* CONFIG_X86_64 */
> > >   #endif  /* CONFIG_HAVE_ARCH_HUGE_VMAP */
> > > +
> > > +#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
> > > +bool arch_shadow_stack_mapping(vm_flags_t vm_flags)
> > > +{
> > > + return (vm_flags & VM_SHSTK);
> > > +}
> > > +#endif
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index cbd98484c4f1..487c08df4365 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -1470,6 +1470,17 @@ static inline pmd_t arch_maybe_pmd_mkwrite(pmd_t 
> > > pmd, struct vm_area_struct *vma
> > >   #endif /* CONFIG_ARCH_MAYBE_MKWRITE */
> > >   #endif /* CONFIG_MMU */
> > > +#ifdef CONFIG_MMU
> > > +#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
> > > +bool arch_shadow_stack_mapping(vm_flags_t vm_flags);
> > > +#else
> > > +static inline bool arch_shadow_stack_mapping(vm_flags_t vm_flags)
> > > +{
> > > + return false;
> > > +}
> > > +#endif /* CONFIG_ARCH_HAS_SHADOW_STACK */
> > > +#endif /* CONFIG_MMU */
> > > +
> > >   /*
> > >* Architecture PAGE_KERNEL_* fallbacks
> > >*
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 3f287599a7a3..2ac67882ace2 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1718,6 +1718,9 @@ static inline int accountable_mapping(struct file 
> > > *file, vm_flags_t vm_flags)
> > >   if (file && is_file_hugepages(file))
> > >   return 0;
> > > + if (arch_shadow_stack_mapping(vm_flags))
> > > + return 1;
> > > +
> > 
> > What's wrong with testing (vm_flags & VM_SHSTK) here? VM_SHSTK is 0 on
> > non-x86.
> > 
> > >   return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == 
> > > VM_WRITE;
> > >   }
> > > @@ -3387,6 +3390,8 @@ void vm_stat_account(struct mm_struct *mm, 
> > > vm_flags_t flags, long npages)
> > >   mm->stack_vm += npages;
> > >   else if (is_data_mapping(flags))
> > >   mm->data_vm += npages;
> > > + else if (arch_shadow_stack_mapping(flags))
> > > + mm->stack_vm += npages;
> > 
> > Ditto.
> > 
> 
> The thought was, here all testings are done with helpers, e.g.
> is_data_mapping(), so creating a helper for shadow stack is more inline with
> the existing code.  Or, maybe we can call it is_shadow_stack_mapping()?
> And, since we have a helper, use it in accountable_mapping() as well.  Or do
> you have other suggestions?

is_shadow_stack_mapping() sounds reasonable.

My point is that we already have ifdef around #define VM_SHSTK. No need in
duplicating it for helper.

-- 
 Kirill A. Shutemov


Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018

2021-03-23 Thread Kirill A. Shutemov
 89 de
> > > 4c 89 ef e8 10 d7 ff ff 48 8b 15 59 36 9b 00 4c 89
> > > > [0.145129] RSP: 0018:990b80013dc0 EFLAGS: 00010282
> > > > [0.145129] RAX:  RBX: 972d474ada80 RCX:
> > > 
> > > > [0.145129] RDX: 0001 RSI: 0092 RDI:
> > > 8e1dd32c
> > > > [0.145129] RBP: 972d47425680 R08: 990b80013c7d R09:
> > > 00fc
> > > > [0.145129] R10: 990b80013c78 R11: 990b80013c7d R12:
> > > 972dc74ada80
> > > > [0.145129] R13: 972d474038c0 R14:  R15:
> > > 
> > > > [0.145129] FS:  () GS:972d47a0()
> > > knlGS:
> > > > [0.145129] CS:  0010 DS:  ES:  CR0: 80050033
> > > > [0.145129] CR2:  CR3: 0660a000 CR4:
> > > 003406f0
> > > > [0.145129] Call Trace:
> > > > [0.145129]  acpi_os_release_object+0x5/0x10
> > > > [0.145129]  acpi_ns_delete_children+0x46/0x59
> > > > [0.145129]  acpi_ns_delete_namespace_subtree+0x5c/0x79
> > > > [0.145129]  ? acpi_sleep_proc_init+0x1f/0x1f
> > > > [0.145129]  acpi_ns_terminate+0xc/0x31
> > > > [0.145129]  acpi_ut_subsystem_shutdown+0x45/0xa3
> > > > [0.145129]  ? acpi_sleep_proc_init+0x1f/0x1f
> > > > [0.145129]  acpi_terminate+0x5/0xf
> > > > [0.145129]  acpi_init+0x27b/0x308
> > > > [0.145129]  ? video_setup+0x79/0x79
> > > > [0.145129]  do_one_initcall+0x7b/0x160
> > > > [0.145129]  kernel_init_freeable+0x190/0x1f2
> > > > [0.145129]  ? rest_init+0x9a/0x9a
> > > > [0.145129]  kernel_init+0x5/0xf6
> > > > [0.145129]  ret_from_fork+0x22/0x30
> > > > [0.145129] ---[ end trace 574554fca7bd06bb ]---
> > > > [0.145133] INFO: Allocated in acpi_ns_root_initialize+0xb6/0x2d1 
> > > > age=58
> > > cpu=0 pid=0
> > > > [0.145881]  kmem_cache_alloc_trace+0x1a9/0x1c0
> > > > [0.146132]  acpi_ns_root_initialize+0xb6/0x2d1
> > > > [0.146578]  acpi_initialize_subsystem+0x65/0xa8
> > > > [0.147024]  acpi_early_init+0x5d/0xd1
> > > > [0.147132]  start_kernel+0x45b/0x518
> > > > [0.147491]  secondary_startup_64+0xb6/0xc0
> > > > [0.147897] [ cut here ]
> > > > 
> > > > And it seems ACPI is allocating an object via kmalloc() and then
> > > > freeing it via kmem_cache_free(<"Acpi-Namespace" kmem_cache>) which
> > > is wrong.
> > > > > ./scripts/faddr2line vmlinux 'acpi_ns_root_initialize+0xb6'
> > > > acpi_ns_root_initialize+0xb6/0x2d1:
> > > > kmalloc at include/linux/slab.h:555
> > > > (inlined by) kzalloc at include/linux/slab.h:669 (inlined by)
> > > > acpi_os_allocate_zeroed at include/acpi/platform/aclinuxex.h:57
> > > > (inlined by) acpi_ns_root_initialize at
> > > > drivers/acpi/acpica/nsaccess.c:102
> > > > 
> > Hi Vegard,
> > 
> > > That's it :-) This fixes it for me:
> > We'll take this patch for ACPICA and it will be in the next release.
> > 
> > Rafael, do you want to take this as a part of the next rc?
> 
> Yes, I do.

Folks, what happened to the patch? I don't see it in current upstream.

Looks like it got reported again:

https://lore.kernel.org/r/a1461e21-c744-767d-6dfc-6641fd3e3...@siemens.com

-- 
 Kirill A. Shutemov


Re: [PATCH] khugepaged: Raplace barrier() with READ_ONCE() for a selective variable

2021-03-23 Thread Kirill A. Shutemov
On Tue, Mar 23, 2021 at 05:27:30PM +0800, yanfei...@windriver.com wrote:
> From: Yanfei Xu 
> 
> READ_ONCE() is more selective and lightweight. It is more appropriate that
> using a READ_ONCE() for the certain variable to prevent the compiler from
> reordering.
> 
> Signed-off-by: Yanfei Xu 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 12/28] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

2021-03-22 Thread Kirill A. Shutemov
On Mon, Mar 22, 2021 at 11:46:21AM +0100, Peter Zijlstra wrote:
> On Mon, Mar 22, 2021 at 01:15:02PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Mar 16, 2021 at 08:10:38AM -0700, Yu-cheng Yu wrote:
> 
> > > + pte_t old_pte, new_pte;
> > > +
> > > + old_pte = READ_ONCE(*ptep);
> > > + do {
> > > + new_pte = pte_wrprotect(old_pte);
> > > + } while (!try_cmpxchg(>pte, _pte.pte, new_pte.pte));
> > 
> > I think this is wrong. You need to update old_pte on every loop iteration,
> > otherwise you can get in to endless loop.
> 
> It is correct, please consider why the old argument is a pointer.

Ah, right. Sorry for the noise.

-- 
 Kirill A. Shutemov


Re: [PATCH v23 13/28] mm: Introduce VM_SHSTK for shadow stack memory

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:39AM -0700, Yu-cheng Yu wrote:
> +#ifdef CONFIG_X86_CET
> +# define VM_SHSTKVM_HIGH_ARCH_5
> +#else
> +# define VM_SHSTKVM_NONE
> +#endif
> +

Why not VM_SHADOW_STACK? Random reader may think SH stands for SHARED or
something.

-- 
 Kirill A. Shutemov


Re: [PATCH v23 11/28] x86/mm: Update pte_modify for _PAGE_COW

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:37AM -0700, Yu-cheng Yu wrote:
> The read-only and Dirty PTE has been used to indicate copy-on-write pages.
> However, newer x86 processors also regard a read-only and Dirty PTE as a
> shadow stack page.  In order to separate the two, the software-defined
> _PAGE_COW is created to replace _PAGE_DIRTY for the copy-on-write case, and
> pte_*() are updated.
> 
> Pte_modify() changes a PTE to 'newprot', but it doesn't use the pte_*().
> Introduce fixup_dirty_pte(), which sets a dirty PTE, based on _PAGE_RW,
> to either _PAGE_DIRTY or _PAGE_COW.
> 
> Apply the same changes to pmd_modify().
> 
> Signed-off-by: Yu-cheng Yu 

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 10/28] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:36AM -0700, Yu-cheng Yu wrote:
> After the introduction of _PAGE_COW, a modified page's PTE can have either
> _PAGE_DIRTY or _PAGE_COW.  Change _PAGE_DIRTY to _PAGE_DIRTY_BITS.
> 
> Signed-off-by: Yu-cheng Yu 
> Reviewed-by: Kees Cook 
> Cc: David Airlie 
> Cc: Joonas Lahtinen 
> Cc: Jani Nikula 
> Cc: Daniel Vetter 
> Cc: Rodrigo Vivi 
> Cc: Zhenyu Wang 
> Cc: Zhi Wang 

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 09/28] x86/mm: Introduce _PAGE_COW

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:35AM -0700, Yu-cheng Yu wrote:
> There is essentially no room left in the x86 hardware PTEs on some OSes
> (not Linux).  That left the hardware architects looking for a way to
> represent a new memory type (shadow stack) within the existing bits.
> They chose to repurpose a lightly-used state: Write=0, Dirty=1.
> 
> The reason it's lightly used is that Dirty=1 is normally set by hardware
> and cannot normally be set by hardware on a Write=0 PTE.  Software must
> normally be involved to create one of these PTEs, so software can simply
> opt to not create them.
> 
> In places where Linux normally creates Write=0, Dirty=1, it can use the
> software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY.  In other
> words, whenever Linux needs to create Write=0, Dirty=1, it instead creates
> Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1.  This
> clearly separates shadow stack from other data, and results in the
> following:
> 
> (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1)
> (b) A R/O page that has been COW'ed: (Write=0, Cow=1)
> The user page is in a R/O VMA, and get_user_pages() needs a writable
> copy.  The page fault handler creates a copy of the page and sets
> the new copy's PTE as Write=0 and Cow=1.
> (c) A shadow stack PTE: (Write=0, Dirty=1)
> (d) A shared shadow stack PTE: (Write=0, Cow=1)
> When a shadow stack page is being shared among processes (this happens
> at fork()), its PTE is made Dirty=0, so the next shadow stack access
> causes a fault, and the page is duplicated and Dirty=1 is set again.
> This is the COW equivalent for shadow stack pages, even though it's
> copy-on-access rather than copy-on-write.
> (e) A page where the processor observed a Write=1 PTE, started a write, set
> Dirty=1, but then observed a Write=0 PTE.  That's possible today, but
> will not happen on processors that support shadow stack.
> 
> Define _PAGE_COW and update pte_*() helpers and apply the same changes to
> pmd and pud.
> 
> After this, there are six free bits left in the 64-bit PTE, and no more
> free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not
> implemented for the 32-bit kernel.
> 
> Signed-off-by: Yu-cheng Yu 

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 08/28] x86/mm: Move pmd_write(), pud_write() up in the file

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:34AM -0700, Yu-cheng Yu wrote:
> To prepare the introduction of _PAGE_COW, move pmd_write() and
> pud_write() up in the file, so that they can be used by other
> helpers below.
> 
> Signed-off-by: Yu-cheng Yu 

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 28/28] mm: Introduce PROT_SHSTK for shadow stack

2021-03-22 Thread Kirill A. Shutemov
define MAP_32BIT0x40/* only give out 32bit addresses */
>  
> +#define PROT_SHSTK   0x10/* shadow stack pages */
>  
>  #include 
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e178be052419..40c4b0fe7cc4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -342,6 +342,7 @@ extern unsigned int kobjsize(const void *objp);
>  
>  #if defined(CONFIG_X86)
>  # define VM_PAT  VM_ARCH_1   /* PAT reserves whole VMA at 
> once (x86) */
> +# define VM_ARCH_CLEAR   VM_SHSTK
>  #elif defined(CONFIG_PPC)
>  # define VM_SAO  VM_ARCH_1   /* Strong Access Ordering 
> (powerpc) */
>  #elif defined(CONFIG_PARISC)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 99077171010b..934cb3cbe952 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1481,6 +1481,12 @@ unsigned long do_mmap(struct file *file, unsigned long 
> addr,
>   struct inode *inode = file_inode(file);
>   unsigned long flags_mask;
>  
> + /*
> +  * Call stack cannot be backed by a file.
> +  */
> + if (vm_flags & VM_SHSTK)
> + return -EINVAL;
> +
>   if (!file_mmap_ok(file, inode, pgoff, len))
>   return -EOVERFLOW;
>  
> @@ -1545,7 +1551,7 @@ unsigned long do_mmap(struct file *file, unsigned long 
> addr,
>   } else {
>   switch (flags & MAP_TYPE) {
>   case MAP_SHARED:
> - if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
> + if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP|VM_SHSTK))
>   return -EINVAL;
>   /*
>* Ignore pgoff.
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 18/28] mm/mmap: Add shadow stack pages to memory accounting

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:44AM -0700, Yu-cheng Yu wrote:
> Account shadow stack pages to stack memory.
> 
> Signed-off-by: Yu-cheng Yu 
> Reviewed-by: Kees Cook 
> ---
>  arch/x86/mm/pgtable.c   |  7 +++
>  include/linux/pgtable.h | 11 +++
>  mm/mmap.c   |  5 +
>  3 files changed, 23 insertions(+)
> 
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 0f4fbf51a9fc..948d28c29964 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -895,3 +895,10 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
>  
>  #endif /* CONFIG_X86_64 */
>  #endif   /* CONFIG_HAVE_ARCH_HUGE_VMAP */
> +
> +#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
> +bool arch_shadow_stack_mapping(vm_flags_t vm_flags)
> +{
> + return (vm_flags & VM_SHSTK);
> +}
> +#endif
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index cbd98484c4f1..487c08df4365 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1470,6 +1470,17 @@ static inline pmd_t arch_maybe_pmd_mkwrite(pmd_t pmd, 
> struct vm_area_struct *vma
>  #endif /* CONFIG_ARCH_MAYBE_MKWRITE */
>  #endif /* CONFIG_MMU */
>  
> +#ifdef CONFIG_MMU
> +#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
> +bool arch_shadow_stack_mapping(vm_flags_t vm_flags);
> +#else
> +static inline bool arch_shadow_stack_mapping(vm_flags_t vm_flags)
> +{
> + return false;
> +}
> +#endif /* CONFIG_ARCH_HAS_SHADOW_STACK */
> +#endif /* CONFIG_MMU */
> +
>  /*
>   * Architecture PAGE_KERNEL_* fallbacks
>   *
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3f287599a7a3..2ac67882ace2 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1718,6 +1718,9 @@ static inline int accountable_mapping(struct file 
> *file, vm_flags_t vm_flags)
>   if (file && is_file_hugepages(file))
>   return 0;
>  
> + if (arch_shadow_stack_mapping(vm_flags))
> + return 1;
> +

What's wrong with testing (vm_flags & VM_SHSTK) here? VM_SHSTK is 0 on
non-x86.

>   return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
>  }
>  
> @@ -3387,6 +3390,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t 
> flags, long npages)
>   mm->stack_vm += npages;
>   else if (is_data_mapping(flags))
>   mm->data_vm += npages;
> + else if (arch_shadow_stack_mapping(flags))
> + mm->stack_vm += npages;

Ditto.

>  }
>  
>  static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 17/28] mm: Add guard pages around a shadow stack.

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:43AM -0700, Yu-cheng Yu wrote:
> INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
> first and the last elements in the range, effectively touches those memory
> areas.
> 
> The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
> 255 * 4 = 1020 bytes by INCSSPD.  Both ranges are far from PAGE_SIZE.
> Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
> CALL, and RET from going beyond.
> 
> Signed-off-by: Yu-cheng Yu 
> Reviewed-by: Kees Cook 
> ---
>  arch/x86/include/asm/page_64_types.h | 10 ++
>  include/linux/mm.h   | 24 
>  2 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/page_64_types.h 
> b/arch/x86/include/asm/page_64_types.h
> index 64297eabad63..23e3d880ce6c 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -115,4 +115,14 @@
>  #define KERNEL_IMAGE_SIZE(512 * 1024 * 1024)
>  #endif
>  
> +/*
> + * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).  INCSSPQ
> + * moves shadow stack pointer up to 255 * 8 = ~2 KB (~1KB for INCSSPD) and
> + * touches the first and the last element in the range, which triggers a
> + * page fault if the range is not in a shadow stack.  Because of this,
> + * creating 4-KB guard pages around a shadow stack prevents these
> + * instructions from going beyond.
> + */
> +#define ARCH_SHADOW_STACK_GUARD_GAP PAGE_SIZE
> +
>  #endif /* _ASM_X86_PAGE_64_DEFS_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af805ffde48e..9890e9f5a5e0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2619,6 +2619,10 @@ extern vm_fault_t filemap_page_mkwrite(struct vm_fault 
> *vmf);
>  int __must_check write_one_page(struct page *page);
>  void task_dirty_inc(struct task_struct *tsk);
>  
> +#ifndef ARCH_SHADOW_STACK_GUARD_GAP
> +#define ARCH_SHADOW_STACK_GUARD_GAP 0
> +#endif
> +
>  extern unsigned long stack_guard_gap;
>  /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
>  extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
> @@ -2651,9 +2655,15 @@ static inline struct vm_area_struct * 
> find_vma_intersection(struct mm_struct * m
>  static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
>  {
>   unsigned long vm_start = vma->vm_start;
> + unsigned long gap = 0;
>  
> - if (vma->vm_flags & VM_GROWSDOWN) {
> - vm_start -= stack_guard_gap;
> + if (vma->vm_flags & VM_GROWSDOWN)
> + gap = stack_guard_gap;
> + else if (vma->vm_flags & VM_SHSTK)
> + gap = ARCH_SHADOW_STACK_GUARD_GAP;

Looks too x86-centric for generic code.

Maybe we can have a helper that would return gap for the given VMA?
The generic version of the helper would only return stack_guard_gap for
VM_GROWSDOWN. Arch code would override it to handle VM_SHSTK case too.

Similar can be done in vm_end_gap().

> +
> + if (gap != 0) {
> + vm_start -= gap;
>   if (vm_start > vma->vm_start)
>   vm_start = 0;
>   }
> @@ -2663,9 +2673,15 @@ static inline unsigned long vm_start_gap(struct 
> vm_area_struct *vma)
>  static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
>  {
>   unsigned long vm_end = vma->vm_end;
> + unsigned long gap = 0;
> +
> + if (vma->vm_flags & VM_GROWSUP)
> + gap = stack_guard_gap;
> + else if (vma->vm_flags & VM_SHSTK)
> + gap = ARCH_SHADOW_STACK_GUARD_GAP;
>  
> - if (vma->vm_flags & VM_GROWSUP) {
> - vm_end += stack_guard_gap;
> + if (gap != 0) {
> + vm_end += gap;
>   if (vm_end < vma->vm_end)
>   vm_end = -PAGE_SIZE;
>   }
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 15/28] x86/mm: Update maybe_mkwrite() for shadow stack

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:41AM -0700, Yu-cheng Yu wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a6c18c5752d6..af805ffde48e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -997,6 +997,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
> vm_area_struct *vma)
>  {
>   if (likely(vma->vm_flags & VM_WRITE))
>   pte = pte_mkwrite(pte);
> + else
> + pte = arch_maybe_mkwrite(pte, vma);
>   return pte;
>  }

I think it would be cleaner to allow arch code to override maybe_mkwrite()
and maybe_pmd_mkwrite() altogether. Wrap it into #ifndef maybe_mkwrite
here and provide VM_SHSTK-aware version from .

-- 
 Kirill A. Shutemov


Re: [PATCH v23 14/28] x86/mm: Shadow Stack page fault error checking

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:40AM -0700, Yu-cheng Yu wrote:
> Shadow stack accesses are those that are performed by the CPU where it
> expects to encounter a shadow stack mapping.  These accesses are performed
> implicitly by CALL/RET at the site of the shadow stack pointer.  These
> accesses are made explicitly by shadow stack management instructions like
> WRUSSQ.
> 
> Shadow stacks accesses to shadow-stack mapping can see faults in normal,
> valid operation just like regular accesses to regular mappings.  Shadow
> stacks need some of the same features like delayed allocation, swap and
> copy-on-write.
> 
> Shadow stack accesses can also result in errors, such as when a shadow
> stack overflows, or if a shadow stack access occurs to a non-shadow-stack
> mapping.
> 
> In handling a shadow stack page fault, verify it occurs within a shadow
> stack mapping.  It is always an error otherwise.  For valid shadow stack
> accesses, set FAULT_FLAG_WRITE to effect copy-on-write.  Because clearing
> _PAGE_DIRTY (vs. _PAGE_RW) is used to trigger the fault, shadow stack read
> fault and shadow stack write fault are not differentiated and both are
> handled as a write access.
> 
> Signed-off-by: Yu-cheng Yu 
> Reviewed-by: Kees Cook 
> ---
>  arch/x86/include/asm/trap_pf.h |  2 ++
>  arch/x86/mm/fault.c| 19 +++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
> index 10b1de500ab1..afa524325e55 100644
> --- a/arch/x86/include/asm/trap_pf.h
> +++ b/arch/x86/include/asm/trap_pf.h
> @@ -11,6 +11,7 @@
>   *   bit 3 ==1: use of reserved bit detected
>   *   bit 4 ==1: fault was an instruction 
> fetch
>   *   bit 5 ==1: protection keys block access
> + *   bit 6 ==1: shadow stack access fault
>   *   bit 15 ==   1: SGX MMU page-fault
>   */
>  enum x86_pf_error_code {
> @@ -20,6 +21,7 @@ enum x86_pf_error_code {
>   X86_PF_RSVD =   1 << 3,
>   X86_PF_INSTR=   1 << 4,
>   X86_PF_PK   =   1 << 5,
> + X86_PF_SHSTK=   1 << 6,
>   X86_PF_SGX  =   1 << 15,
>  };
>  
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index a73347e2cdfc..4316732a18c6 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1100,6 +1100,17 @@ access_error(unsigned long error_code, struct 
> vm_area_struct *vma)
>  (error_code & X86_PF_INSTR), foreign))
>   return 1;
>  
> + /*
> +  * Verify a shadow stack access is within a shadow stack VMA.
> +  * It is always an error otherwise.  Normal data access to a
> +  * shadow stack area is checked in the case followed.
> +  */
> + if (error_code & X86_PF_SHSTK) {
> + if (!(vma->vm_flags & VM_SHSTK))
> + return 1;
> + return 0;

Any reason to return 0 here? I would rather keep the single return 0 in
the function, after all checks are done.

> + }
> +
>   if (error_code & X86_PF_WRITE) {
>   /* write, present and write, not present: */
>   if (unlikely(!(vma->vm_flags & VM_WRITE)))
> @@ -1293,6 +1304,14 @@ void do_user_addr_fault(struct pt_regs *regs,
>  
>   perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>  
> + /*
> +  * Clearing _PAGE_DIRTY is used to detect shadow stack access.
> +  * This method cannot distinguish shadow stack read vs. write.
> +  * For valid shadow stack accesses, set FAULT_FLAG_WRITE to effect
> +  * copy-on-write.
> +  */
> + if (error_code & X86_PF_SHSTK)
> + flags |= FAULT_FLAG_WRITE;
>   if (error_code & X86_PF_WRITE)
>   flags |= FAULT_FLAG_WRITE;
>   if (error_code & X86_PF_INSTR)
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 12/28] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:38AM -0700, Yu-cheng Yu wrote:
> When Shadow Stack is introduced, [R/O + _PAGE_DIRTY] PTE is reserved for
> shadow stack.  Copy-on-write PTEs have [R/O + _PAGE_COW].
> 
> When a PTE goes from [R/W + _PAGE_DIRTY] to [R/O + _PAGE_COW], it could
> become a transient shadow stack PTE in two cases:
> 
> The first case is that some processors can start a write but end up seeing
> a read-only PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE.  However, this will not occur on processors supporting
> Shadow Stack, and a TLB flush is not necessary.
> 
> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
> atomically, a transient shadow stack PTE can be created as a result.
> Thus, prevent that with cmpxchg.
> 
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue.  Jann Horn provided the cmpxchg solution.
> 
> Signed-off-by: Yu-cheng Yu 
> Reviewed-by: Kees Cook 
> ---
>  arch/x86/include/asm/pgtable.h | 36 ++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index e1739f590ca6..46d9394b884f 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1306,6 +1306,24 @@ static inline pte_t ptep_get_and_clear_full(struct 
> mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
>  {
> + /*
> +  * If Shadow Stack is enabled, pte_wrprotect() moves _PAGE_DIRTY
> +  * to _PAGE_COW (see comments at pte_wrprotect()).
> +  * When a thread reads a RW=1, Dirty=0 PTE and before changing it
> +  * to RW=0, Dirty=0, another thread could have written to the page
> +  * and the PTE is RW=1, Dirty=1 now.  Use try_cmpxchg() to detect
> +  * PTE changes and update old_pte, then try again.
> +  */
> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> + pte_t old_pte, new_pte;
> +
> + old_pte = READ_ONCE(*ptep);
> + do {
> + new_pte = pte_wrprotect(old_pte);
> + } while (!try_cmpxchg(>pte, _pte.pte, new_pte.pte));

I think this is wrong. You need to update old_pte on every loop iteration,
otherwise you can get in to endless loop.

The same issue for pmdp_set_wrprotect().

> +
> + return;
> + }
>   clear_bit(_PAGE_BIT_RW, (unsigned long *)>pte);
>  }
>  
> @@ -1350,6 +1368,24 @@ static inline pud_t pudp_huge_get_and_clear(struct 
> mm_struct *mm,
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> unsigned long addr, pmd_t *pmdp)
>  {
> + /*
> +  * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
> +  * to _PAGE_COW (see comments at pmd_wrprotect()).
> +  * When a thread reads a RW=1, Dirty=0 PMD and before changing it
> +  * to RW=0, Dirty=0, another thread could have written to the page
> +  * and the PMD is RW=1, Dirty=1 now.  Use try_cmpxchg() to detect
> +  * PMD changes and update old_pmd, then try again.
> +  */
> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> + pmd_t old_pmd, new_pmd;
> +
> + old_pmd = READ_ONCE(*pmdp);
> + do {
> + new_pmd = pmd_wrprotect(old_pmd);
> +     } while (!try_cmpxchg((pmdval_t *)pmdp, (pmdval_t *)_pmd, 
> pmd_val(new_pmd)));
> +
> + return;
> + }
>   clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
>  }
>  
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v23 07/28] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

2021-03-22 Thread Kirill A. Shutemov
On Tue, Mar 16, 2021 at 08:10:33AM -0700, Yu-cheng Yu wrote:
> The x86 family of processors do not directly create read-only and Dirty
> PTEs.  These PTEs are created by software.  One such case is that kernel
> read-only pages are historically setup as Dirty.
> 
> New processors that support Shadow Stack regard read-only and Dirty PTEs as
> shadow stack pages.  This results in ambiguity between shadow stack and
> kernel read-only pages.  To resolve this, removed Dirty from kernel read-
> only pages.
> 
> Signed-off-by: Yu-cheng Yu 
> Cc: "H. Peter Anvin" 
> Cc: Kees Cook 
> Cc: Thomas Gleixner 
> Cc: Dave Hansen 
> Cc: Christoph Hellwig 
> Cc: Andy Lutomirski 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Peter Zijlstra 

Looks good to me.

Reviewed-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: fuse: kernel BUG at mm/truncate.c:763!

2021-03-18 Thread Kirill A. Shutemov
gt; > [16247.536520]  fuse_open_common+0x1a8/0x1b0 [fuse]
> > [16247.536530]  ? fuse_open_common+0x1b0/0x1b0 [fuse]
> > [16247.536540]  do_dentry_open+0x14e/0x380
> > [16247.536547]  path_openat+0xaf6/0x10a0
> > [16247.536555]  do_filp_open+0x88/0x130
> > [16247.536560]  ? security_prepare_creds+0x6d/0x90
> > [16247.536566]  ? __kmalloc+0x157/0x2e0
> > [16247.536575]  do_open_execat+0x6d/0x1a0
> > [16247.536581]  bprm_execve+0x128/0x660
> > [16247.536587]  do_execveat_common+0x192/0x1c0
> > [16247.536593]  __x64_sys_execve+0x39/0x50
> > [16247.536599]  do_syscall_64+0x33/0x80
> > [16247.536606]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [16247.536614] RIP: 0033:0x7f97c0efec37
> > [16247.536621] Code: Unable to access opcode bytes at RIP 0x7f97c0efec0d.
> > [16247.536625] RSP: 002b:7ffdc2fdea68 EFLAGS: 0202 ORIG_RAX: 
> > 003b
> > [16247.536631] RAX: ffda RBX: 7f97c17176a0 RCX: 
> > 7f97c0efec37
> > [16247.536635] RDX: 00ea42c0 RSI: 00ea5848 RDI: 
> > 00ea5d00
> > [16247.536639] RBP: 0001 R08:  R09: 
> > 
> > [16247.536643] R10: 7ffdc2fdde60 R11: 0202 R12: 
> > 
> > [16247.536647] R13: 0001 R14: 00ea5d00 R15: 
> > 
> > [16247.536653] Modules linked in: overlay rpcsec_gss_krb5 nfsv4 
> > dns_resolver nfsv3 nfs fscache iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs 
> > intel_rapl_msr intel_rapl_common joydev isst_if_common ipmi_ssif i40iw 
> > ib_uverbs iTCO_wdt intel_pmc_bxt skx_edac ib_core hid_generic 
> > iTCO_vendor_support nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp 
> > coretemp kvm_intel acpi_ipmi kvm usbhid i2c_i801 mei_me i40e irqbypass 
> > efi_pstore pcspkr ipmi_si ioatdma i2c_smbus lpc_ich mei intel_pch_thermal 
> > dca ipmi_devintf ipmi_msghandler tiny_power_button acpi_pad button 
> > nls_iso8859_1 nls_cp437 vfat fat nfsd nfs_acl auth_rpcgss lockd grace 
> > sunrpc fuse configfs nfs_ssc ast i2c_algo_bit drm_vram_helper 
> > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core 
> > drm_ttm_helper ttm xhci_pci xhci_pci_renesas drm xhci_hcd crct10dif_pclmul 
> > crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd 
> > cryptd usbcore wmi sg br_netfilter bridge stp llc dm_multipath dm_mod 
> > scsi_dh_rdac scsi_dh_emc
> > [16247.536758]  scsi_dh_alua msr efivarfs
> > [16247.536800] ---[ end trace e1493f55bf5b3a34 ]---
> > [16247.544126] RIP: 0010:invalidate_inode_pages2_range+0x3b4/0x550
> > [16247.544140] Code: 00 00 00 4c 89 e6 e8 eb 0f 03 00 4c 89 ff e8 63 40 01 
> > 00 84 c0 0f 84 23 fe ff ff 48 c7 c6 d0 1d f4 b1 4c 89 ff e8 ec 82 02 00 
> > <0f> 0b 48 8b 45 78 48 8b 80 80 00 00 00 48 85 c0 0f 84 fb fe ff ff
> > [16247.544148] RSP: :a18cb0af7a40 EFLAGS: 00010246
> > [16247.544153] RAX: 0036 RBX: 000d RCX: 
> > 8ef13fc9a748
> > [16247.544158] RDX:  RSI: 0027 RDI: 
> > 8ef13fc9a740
> > [16247.544162] RBP: 8eb2f9a02ef8 R08: 8ef23ffb48a8 R09: 
> > 0004fffb
> > [16247.544166] R10:  R11: 3fff R12: 
> > 1400
> > [16247.544170] R13: 8eb2f9a02f00 R14:  R15: 
> > d651b1978000
> > [16247.544175] FS:  7f97c1717740() GS:8ef13fc8() 
> > knlGS:
> > [16247.544180] CS:  0010 DS:  ES:  CR0: 80050033
> > [16247.544184] CR2: 7f97c0efec0d CR3: 0040aa3ac006 CR4: 
> > 007706e0
> > [16247.544188] DR0:  DR1:  DR2: 
> > 
> > [16247.544191] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [16247.544194] PKRU: 5554
> > [16247.546763] BUG: Bad rss-counter state mm:060c94f4 
> > type:MM_ANONPAGES val:8
> >
> >

-- 
 Kirill A. Shutemov


Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-15 Thread Kirill A. Shutemov
On Mon, Mar 15, 2021 at 01:25:40PM +0100, David Hildenbrand wrote:
> On 15.03.21 13:22, Kirill A. Shutemov wrote:
> > On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote:
> > > + case -EHWPOISON: /* Skip over any poisoned pages. */
> > > + start += PAGE_SIZE;
> > > + continue;
> > 
> > Why is it good approach? It's not abvious to me.
> 
> My main motivation was to simplify return code handling. I don't want to
> return -EHWPOISON to user space

Why? Hiding the problem under the rug doesn't help anybody. SIGBUS later
is not better than an error upfront.

-- 
 Kirill A. Shutemov


Re: [PATCH RFCv2] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-15 Thread Kirill A. Shutemov
On Mon, Mar 08, 2021 at 05:45:20PM +0100, David Hildenbrand wrote:
> + case -EHWPOISON: /* Skip over any poisoned pages. */
> + start += PAGE_SIZE;
> + continue;

Why is it good approach? It's not abvious to me.

-- 
 Kirill A. Shutemov


Re: [PATCH v3] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-15 Thread Kirill A. Shutemov
On Thu, Mar 11, 2021 at 07:57:12PM -0500, Zi Yan wrote:
> From: Zi Yan 
> 
> We do not have a direct user interface of splitting the compound page
> backing a THP

But we do. You expand it.

> and there is no need unless we want to expose the THP
> implementation details to users. Make /split_huge_pages accept
> a new command to do that.
> 
> By writing ",," to
> /split_huge_pages, THPs within the given virtual address range
> from the process with the given pid are split. It is used to test
> split_huge_page function. In addition, a selftest program is added to
> tools/testing/selftests/vm to utilize the interface by splitting
> PMD THPs and PTE-mapped THPs.
> 

Okay, makes sense.

But it doesn't cover non-mapped THPs. tmpfs may have file backed by THP
that mapped nowhere. Do we want to cover this case too?

Maybe have PID:,, and
FILE:,, ?

-- 
 Kirill A. Shutemov


Re: [PATCH v4 00/25] Page folios

2021-03-15 Thread Kirill A. Shutemov
On Sat, Mar 13, 2021 at 07:09:01PM -0800, Hugh Dickins wrote:
> On Sat, 13 Mar 2021, Andrew Morton wrote:
> > On Fri,  5 Mar 2021 04:18:36 + "Matthew Wilcox (Oracle)" 
> >  wrote:
> > 
> > > Our type system does not currently distinguish between tail pages and
> > > head or single pages.  This is a problem because we call compound_head()
> > > multiple times (and the compiler cannot optimise it out), bloating the
> > > kernel.  It also makes programming hard as it is often unclear whether
> > > a function operates on an individual page, or an entire compound page.
> > > 
> > > This patch series introduces the struct folio, which is a type that
> > > represents an entire compound page.  This initial set reduces the kernel
> > > size by approximately 6kB, although its real purpose is adding
> > > infrastructure to enable further use of the folio.
> > 
> > Geeze it's a lot of noise.  More things to remember and we'll forever
> > have a mismash of `page' and `folio' and code everywhere converting
> > from one to the other.  Ongoing addition of folio
> > accessors/manipulators to overlay the existing page
> > accessors/manipulators, etc.
> > 
> > It's unclear to me that it's all really worth it.  What feedback have
> > you seen from others?
> 
> My own feeling and feedback have been much like yours.
> 
> I don't get very excited by type safety at this level; and although
> I protested back when all those compound_head()s got tucked into the
> *PageFlag() functions, the text size increase was not very much, and
> I never noticed any adverse performance reports.
> 
> To me, it's distraction, churn and friction, ongoing for years; but
> that's just me, and I'm resigned to the possibility that it will go in.
> Matthew is not alone in wanting to pursue it: let others speak.

I'm with Matthew on this. I would really want to drop the number of places
where we call compoud_head(). I hope we can get rid of the page flag
policy hack I made.

-- 
 Kirill A. Shutemov


Re: [PATCH 0/5] Cleanup and fixup for khugepaged

2021-03-05 Thread Kirill A. Shutemov
On Thu, Mar 04, 2021 at 07:30:08AM -0500, Miaohe Lin wrote:
> Hi all,
> This series contains cleanups to remove unneeded return value, use
> helper function and so on. And there is one fix to correct the wrong
> result value for trace_mm_collapse_huge_page_isolate().
> 
> More details can be found in the respective changelogs. Thanks!
> 
> Miaohe Lin (5):
>   khugepaged: remove unneeded return value of
> khugepaged_collapse_pte_mapped_thps()
>   khugepaged: reuse the smp_wmb() inside __SetPageUptodate()
>   khugepaged: use helper khugepaged_test_exit() in __khugepaged_enter()
>   khugepaged: remove unnecessary mem_cgroup_uncharge() in
> collapse_[file|huge_page]
>   khugepaged: fix wrong result value for
> trace_mm_collapse_huge_page_isolate()
> 
>  mm/khugepaged.c | 47 ---
>  1 file changed, 20 insertions(+), 27 deletions(-)

Apart from patch 4/5, looks fine. For the rest, you can use:

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH 4/5] khugepaged: remove unnecessary mem_cgroup_uncharge() in collapse_[file|huge_page]

2021-03-05 Thread Kirill A. Shutemov
On Thu, Mar 04, 2021 at 07:30:12AM -0500, Miaohe Lin wrote:
> Since commit 7ae88534cdd9 ("mm: move mem_cgroup_uncharge out of
> __page_cache_release()"), the mem_cgroup will be uncharged when hpage is
> freed. Uncharge mem_cgroup here is harmless but it looks confusing and
> buggy: if mem_cgroup charge failed, we will call mem_cgroup_uncharge()
> uncorrectly in error path because hpage is not IS_ERR_OR_NULL().
> 
> Signed-off-by: Miaohe Lin 

Hm. I'm not sure about this patch.

For !NUMA the page will get allocated and freed very early: in
khugepaged_do_scan() and with the change mem_cgroup_charge() may get
called twice for two different mm_structs.

Is it safe?

-- 
 Kirill A. Shutemov


Re: [PATCH v2 3/3] mm: use PF_ONLY_HEAD for PG_active and PG_unevictable

2021-03-01 Thread Kirill A. Shutemov
On Fri, Feb 26, 2021 at 12:13:14PM +, Matthew Wilcox wrote:
> On Fri, Feb 26, 2021 at 02:17:18AM -0700, Yu Zhao wrote:
> > All places but one test, set or clear PG_active and PG_unevictable on
> > small or head pages. Use compound_head() explicitly for that singleton
> > so the rest can rid of redundant compound_head().
> 
> How do you know it's only one place?  I really wish you'd work with me
> on folios.  They make the compiler prove that it's not a tail page.

+1 to this.

The problem with compound_head() is systemic and ad-hoc solution to few
page flags will only complicate the picture.

-- 
 Kirill A. Shutemov


Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE

2021-02-07 Thread Kirill A. Shutemov
On Sun, Feb 07, 2021 at 08:01:50AM -0800, Dave Hansen wrote:
> On 2/7/21 6:13 AM, Kirill A. Shutemov wrote:
> >>> +   /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM 
> >>> */
> >>> +   rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
> >>> +
> >>> +   asm volatile(TDCALL
> >>> +   : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), 
> >>> "=r"(r13),
> >>> + "=r"(r14), "=r"(r15)
> >>> +   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), 
> >>> "r"(r12),
> >>> + "r"(r13)
> >>> +   : );
> >> Some "+" constraints would make this simpler.  But I think you should
> >> factor the TDCALL helper out into its own function.
> > Factor out TDCALL into a helper is tricky: different TDCALLs have
> > different list of registers passed to VMM.
> 
> Couldn't you just have one big helper that takes *all* the registers
> that get used in any TDVMCALL and sets all the rcx bits?  The users
> could just pass 0's for the things they don't use.
> 
> Then you've got the ugly inline asm in one place.  It also makes it
> harder to screw up the 'rcx' mask and end up passing registers you
> didn't want into a malicious VMM.

For now we only pass down R10-R15, but the interface allows to pass down
much wider set of registers, including XMM. How far do we want to get it?

-- 
 Kirill A. Shutemov


Re: [PATCH] tmpfs: Disallow CONFIG_TMPFS_INODE64 on s390

2021-02-07 Thread Kirill A. Shutemov
On Fri, Feb 05, 2021 at 05:06:20PM -0600, Seth Forshee wrote:
> This feature requires ino_t be 64-bits, which is true for every
> 64-bit architecture but s390, so prevent this option from being
> selected there.

Quick grep suggests the same for alpha. Am I wrong?

-- 
 Kirill A. Shutemov


Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE

2021-02-07 Thread Kirill A. Shutemov
On Fri, Feb 05, 2021 at 03:42:01PM -0800, Andy Lutomirski wrote:
> On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
>  wrote:
> >
> > From: "Kirill A. Shutemov" 
> >
> > TDX has three classes of CPUID leaves: some CPUID leaves
> > are always handled by the CPU, others are handled by the TDX module,
> > and some others are handled by the VMM. Since the VMM cannot directly
> > intercept the instruction these are reflected with a #VE exception
> > to the guest, which then converts it into a TDCALL to the VMM,
> > or handled directly.
> >
> > The TDX module EAS has a full list of CPUID leaves which are handled
> > natively or by the TDX module in 16.2. Only unknown CPUIDs are handled by
> > the #VE method. In practice this typically only applies to the
> > hypervisor specific CPUIDs unknown to the native CPU.
> >
> > Therefore there is no risk of causing this in early CPUID code which
> > runs before the #VE handler is set up because it will never access
> > those exotic CPUID leaves.
> >
> > Signed-off-by: Kirill A. Shutemov 
> > Reviewed-by: Andi Kleen 
> > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > 
> > ---
> >  arch/x86/kernel/tdx.c | 32 
> >  1 file changed, 32 insertions(+)
> >
> > diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> > index 5d961263601e..e98058c048b5 100644
> > --- a/arch/x86/kernel/tdx.c
> > +++ b/arch/x86/kernel/tdx.c
> > @@ -172,6 +172,35 @@ static int tdx_write_msr_safe(unsigned int msr, 
> > unsigned int low,
> > return ret || r10 ? -EIO : 0;
> >  }
> >
> > +static void tdx_handle_cpuid(struct pt_regs *regs)
> > +{
> > +   register long r10 asm("r10") = TDVMCALL_STANDARD;
> > +   register long r11 asm("r11") = EXIT_REASON_CPUID;
> > +   register long r12 asm("r12") = regs->ax;
> > +   register long r13 asm("r13") = regs->cx;
> > +   register long r14 asm("r14");
> > +   register long r15 asm("r15");
> > +   register long rcx asm("rcx");
> > +   long ret;
> > +
> > +   /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
> > +   rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
> > +
> > +   asm volatile(TDCALL
> > +   : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), 
> > "=r"(r13),
> > + "=r"(r14), "=r"(r15)
> > +   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), 
> > "r"(r12),
> > + "r"(r13)
> > +   : );
> 
> Some "+" constraints would make this simpler.  But I think you should
> factor the TDCALL helper out into its own function.

Factor out TDCALL into a helper is tricky: different TDCALLs have
different list of registers passed to VMM.

-- 
 Kirill A. Shutemov


Re: [RFC 0/9] Linear Address Masking enabling

2021-02-07 Thread Kirill A. Shutemov
On Sun, Feb 07, 2021 at 09:24:23AM +0100, Dmitry Vyukov wrote:
> On Fri, Feb 5, 2021 at 4:16 PM Kirill A. Shutemov
>  wrote:
> >
> > Linear Address Masking[1] (LAM) modifies the checking that is applied to
> > 64-bit linear addresses, allowing software to use of the untranslated
> > address bits for metadata.
> >
> > The patchset brings support for LAM for userspace addresses.
> >
> > The most sensitive part of enabling is change in tlb.c, where CR3 flags
> > get set. Please take a look that what I'm doing makes sense.
> >
> > The patchset is RFC quality and the code requires more testing before it
> > can be applied.
> >
> > The userspace API is not finalized yet. The patchset extends API used by
> > ARM64: PR_GET/SET_TAGGED_ADDR_CTRL. The API is adjusted to not imply ARM
> > TBI: it now allows to request a number of bits of metadata needed and
> > report where these bits are located in the address.
> >
> > There's an alternative proposal[2] for the API based on Intel CET
> > interface. Please let us know if you prefer one over another.
> >
> > The feature competes for bits with 5-level paging: LAM_U48 makes it
> > impossible to map anything about 47-bits. The patchset made these
> > capability mutually exclusive: whatever used first wins. LAM_U57 can be
> > combined with mappings above 47-bits.
> >
> > I include QEMU patch in case if somebody wants to play with the feature.
> 
> Exciting! Do you plan to send the QEMU patch to QEMU?

Sure. After more testing, once I'm sure it's conforming to the hardware.

-- 
 Kirill A. Shutemov


Re: [RFC 9/9] x86/mm: Implement PR_SET/GET_TAGGED_ADDR_CTRL with LAM

2021-02-07 Thread Kirill A. Shutemov
On Sun, Feb 07, 2021 at 09:07:02AM +0100, Dmitry Vyukov wrote:
> On Fri, Feb 5, 2021 at 4:43 PM H.J. Lu  wrote:
> >
> > On Fri, Feb 5, 2021 at 7:16 AM Kirill A. Shutemov
> >  wrote:
> > >
> > > Provide prctl() interface to enabled LAM for user addresses. Depending
> > > how many tag bits requested it may result in enabling LAM_U57 or
> > > LAM_U48.
> >
> > I prefer the alternate kernel interface based on CET arch_prctl interface 
> > which
> > is implemented in glibc on users/intel/lam/master branch:
> >
> > https://gitlab.com/x86-glibc/glibc/-/tree/users/intel/lam/master
> >
> > and in GCC on users/intel/lam/master branch:
> >
> > https://gitlab.com/x86-gcc/gcc/-/tree/users/intel/lam/master
> 
> Hi Kirill, H.J.,
> 
> I don't have strong preference for PR_SET/GET_TAGGED_ADDR_CTRL vs
> ARCH_X86_FEATURE_1_ENABLE itself, but tying LAM to ELF and
> GNU_PROPERTY in the second option looks strange. LAM can be used
> outside of ELF/GNU, right?

Sure. In both cases it's still a syscall.

-- 
 Kirill A. Shutemov


[QEMU] x86: Implement Linear Address Masking support

2021-02-05 Thread Kirill A. Shutemov
Linear Address Masking feature makes CPU ignore some bits of the virtual
address. These bits can be used to encode metadata.

The feature is enumerated with CPUID.(EAX=07H, ECX=01H):EAX.LAM[bit 26].

CR3.LAM_U57[bit 62] allows to encode 6 bits of metadata in bits 62:57 of
user pointers.

CR3.LAM_U48[bit 61] allows to encode 15 bits of metadata in bits 62:48
of user pointers.

CR4.LAM_SUP[bit 28] allows to encode metadata of supervisor pointers.
If 5-level paging is in use, 6 bits of metadata can be encoded in 62:57.
For 4-level paging, 15 bits of metadata can be encoded in bits 62:48.

QEMU strips address from the metadata bits and gets it to canonical
shape before handling memory access. It has to be done very early before
TLB lookup.

Signed-off-by: Kirill A. Shutemov 
---
 accel/tcg/cputlb.c| 54 +++
 include/hw/core/cpu.h |  1 +
 target/i386/cpu.c |  5 ++--
 target/i386/cpu.h |  7 +
 target/i386/excp_helper.c | 28 +++-
 target/i386/helper.c  |  2 +-
 6 files changed, 71 insertions(+), 26 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 42ab79c1a582..f2d27134474f 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1271,6 +1271,17 @@ static inline ram_addr_t 
qemu_ram_addr_from_host_nofail(void *ptr)
 return ram_addr;
 }
 
+static vaddr clean_addr(CPUState *cpu, vaddr addr)
+{
+CPUClass *cc = CPU_GET_CLASS(cpu);
+
+if (cc->do_clean_addr) {
+addr = cc->do_clean_addr(cpu, addr);
+}
+
+return addr;
+}
+
 /*
  * Note: tlb_fill() can trigger a resize of the TLB. This means that all of the
  * caller's prior references to the TLB table (e.g. CPUTLBEntry pointers) must
@@ -1702,9 +1713,11 @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong addr, 
int mmu_idx,
 
 /* Probe for a read-modify-write atomic operation.  Do not allow unaligned
  * operations, or io operations to proceed.  Return the host address.  */
-static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
+static void *atomic_mmu_lookup(CPUArchState *env, target_ulong address,
TCGMemOpIdx oi, uintptr_t retaddr)
 {
+CPUState *cpu = env_cpu(env);
+target_ulong addr = clean_addr(cpu, address);
 size_t mmu_idx = get_mmuidx(oi);
 uintptr_t index = tlb_index(env, mmu_idx, addr);
 CPUTLBEntry *tlbe = tlb_entry(env, mmu_idx, addr);
@@ -1720,8 +1733,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 /* Enforce guest required alignment.  */
 if (unlikely(a_bits > 0 && (addr & ((1 << a_bits) - 1 {
 /* ??? Maybe indicate atomic op to cpu_unaligned_access */
-cpu_unaligned_access(env_cpu(env), addr, MMU_DATA_STORE,
- mmu_idx, retaddr);
+cpu_unaligned_access(cpu, addr, MMU_DATA_STORE, mmu_idx, retaddr);
 }
 
 /* Enforce qemu required alignment.  */
@@ -1736,8 +1748,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 /* Check TLB entry and enforce page permissions.  */
 if (!tlb_hit(tlb_addr, addr)) {
 if (!VICTIM_TLB_HIT(addr_write, addr)) {
-tlb_fill(env_cpu(env), addr, 1 << s_bits, MMU_DATA_STORE,
- mmu_idx, retaddr);
+tlb_fill(cpu, addr, 1 << s_bits, MMU_DATA_STORE, mmu_idx, retaddr);
 index = tlb_index(env, mmu_idx, addr);
 tlbe = tlb_entry(env, mmu_idx, addr);
 }
@@ -1753,8 +1764,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 
 /* Let the guest notice RMW on a write-only page.  */
 if (unlikely(tlbe->addr_read != (tlb_addr & ~TLB_NOTDIRTY))) {
-tlb_fill(env_cpu(env), addr, 1 << s_bits, MMU_DATA_LOAD,
- mmu_idx, retaddr);
+tlb_fill(cpu, addr, 1 << s_bits, MMU_DATA_LOAD, mmu_idx, retaddr);
 /* Since we don't support reads and writes to different addresses,
and we do have the proper page loaded for write, this shouldn't
ever return.  But just in case, handle via stop-the-world.  */
@@ -1764,14 +1774,14 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 hostaddr = (void *)((uintptr_t)addr + tlbe->addend);
 
 if (unlikely(tlb_addr & TLB_NOTDIRTY)) {
-notdirty_write(env_cpu(env), addr, 1 << s_bits,
+notdirty_write(cpu, addr, 1 << s_bits,
_tlb(env)->d[mmu_idx].iotlb[index], retaddr);
 }
 
 return hostaddr;
 
  stop_the_world:
-cpu_loop_exit_atomic(env_cpu(env), retaddr);
+cpu_loop_exit_atomic(cpu, retaddr);
 }
 
 /*
@@ -1810,10 +1820,12 @@ load_memop(const void *haddr, MemOp op)
 }
 
 static inline uint64_t QEMU_ALWAYS_INLINE
-load_helper(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
+load_helper(CPUArchState *env, target_ulong address, TCGMemOpIdx oi,
 uin

[RFC 8/9] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive

2021-02-05 Thread Kirill A. Shutemov
LAM_U48 steals bits above 47-bit for tags and makes it impossible for
userspace to use full address space on 5-level paging machine.

Make these features mutually exclusive: whichever gets enabled first
blocks the othe one.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/elf.h |  3 ++-
 arch/x86/include/asm/mmu_context.h | 13 +
 arch/x86/kernel/sys_x86_64.c   |  5 +++--
 arch/x86/mm/hugetlbpage.c  |  6 --
 arch/x86/mm/mmap.c |  9 -
 5 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index b9a5d488f1a5..8ba405dfb861 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -326,7 +326,8 @@ static inline int mmap_is_ia32(void)
 extern unsigned long task_size_32bit(void);
 extern unsigned long task_size_64bit(int full_addr_space);
 extern unsigned long get_mmap_base(int is_legacy);
-extern bool mmap_address_hint_valid(unsigned long addr, unsigned long len);
+extern bool mmap_address_hint_valid(struct mm_struct *mm,
+   unsigned long addr, unsigned long len);
 
 #ifdef CONFIG_X86_32
 
diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index d98016b83755..e338beb031b2 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -214,4 +214,17 @@ static inline bool arch_vma_access_permitted(struct 
vm_area_struct *vma,
 
 unsigned long __get_current_cr3_fast(void);
 
+#ifdef CONFIG_X86_5LEVEL
+static inline bool full_va_allowed(struct mm_struct *mm)
+{
+   /* LAM_U48 steals VA bits abouve 47-bit for tags */
+   return mm->context.lam != LAM_U48;
+}
+#else
+static inline bool full_va_allowed(struct mm_struct *mm)
+{
+   return false;
+}
+#endif
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 504fa5425bce..b5d0afee9b0f 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -189,7 +190,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const 
unsigned long addr0,
/* requesting a specific address */
if (addr) {
addr &= PAGE_MASK;
-   if (!mmap_address_hint_valid(addr, len))
+   if (!mmap_address_hint_valid(mm, addr, len))
goto get_unmapped_area;
 
vma = find_vma(mm, addr);
@@ -210,7 +211,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const 
unsigned long addr0,
 * !in_32bit_syscall() check to avoid high addresses for x32
 * (and make it no op on native i386).
 */
-   if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall())
+   if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall() && 
full_va_allowed(mm))
info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
 
info.align_mask = 0;
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index a0d023cb4292..9fdc8db42365 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if 0  /* This is just for testing */
 struct page *
@@ -103,6 +104,7 @@ static unsigned long 
hugetlb_get_unmapped_area_topdown(struct file *file,
unsigned long pgoff, unsigned long flags)
 {
struct hstate *h = hstate_file(file);
+   struct mm_struct *mm = current->mm;
struct vm_unmapped_area_info info;
 
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
@@ -114,7 +116,7 @@ static unsigned long 
hugetlb_get_unmapped_area_topdown(struct file *file,
 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
 * in the full address space.
 */
-   if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall())
+   if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall() && 
full_va_allowed(mm))
info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
 
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
@@ -161,7 +163,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long 
addr,
 
if (addr) {
addr &= huge_page_mask(h);
-   if (!mmap_address_hint_valid(addr, len))
+   if (!mmap_address_hint_valid(mm, addr, len))
goto get_unmapped_area;
 
vma = find_vma(mm, addr);
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index c90c20904a60..f9ca824729de 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "physaddr.h"
 
@@ -35,6 +36,8 @@ unsigned long task_size_32bit(void)
 
 unsigned long task_size_64bit(int full_addr_space)

Re: [RFC 0/9] Linear Address Masking enabling

2021-02-05 Thread Kirill A. Shutemov
On Fri, Feb 05, 2021 at 04:49:05PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 05, 2021 at 06:16:20PM +0300, Kirill A. Shutemov wrote:
> > The feature competes for bits with 5-level paging: LAM_U48 makes it
> > impossible to map anything about 47-bits. The patchset made these
> > capability mutually exclusive: whatever used first wins. LAM_U57 can be
> > combined with mappings above 47-bits.
> 
> And I suppose we still can't switch between 4 and 5 level at runtime,
> using a CR3 bit?

No. And I can't imagine how would it work with 5-level on kernel side.

-- 
 Kirill A. Shutemov


[RFC 4/9] x86/mm: Introduce TIF_LAM_U57 and TIF_LAM_U48

2021-02-05 Thread Kirill A. Shutemov
The new thread flags indicate that the thread has Linear Address Masking
enabled.

switch_mm_irqs_off() now respects these flags and set CR3 accordingly.

The active LAM mode gets recorded in the tlb_state.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/thread_info.h |  9 ++-
 arch/x86/include/asm/tlbflush.h|  5 ++
 arch/x86/mm/tlb.c  | 96 +-
 3 files changed, 93 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 44733a4bfc42..e2ae8fcb3492 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_NOTSC  16  /* TSC is not accessible in userland */
 #define TIF_IA32   17  /* IA32 compatibility process */
 #define TIF_SLD18  /* Restore split lock detection 
on context switch */
+#define TIF_X3219  /* 32-bit native x86-64 binary 
*/
 #define TIF_MEMDIE 20  /* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG 21  /* idle is polling for TIF_NEED_RESCHED 
*/
 #define TIF_IO_BITMAP  22  /* uses I/O bitmap */
@@ -101,7 +102,9 @@ struct thread_info {
 #define TIF_LAZY_MMU_UPDATES   27  /* task is updating the mmu lazily */
 #define TIF_SYSCALL_TRACEPOINT 28  /* syscall tracepoint instrumentation */
 #define TIF_ADDR32 29  /* 32-bit address space on 64 bits */
-#define TIF_X3230  /* 32-bit native x86-64 binary 
*/
+#define TIF_LAM_U5730  /* LAM for userspace pointers, 6 bits */
+#define TIF_LAM_U4831  /* LAM for userspace pointers, 15 bits 
*/
+
 
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -122,6 +125,7 @@ struct thread_info {
 #define _TIF_NOTSC (1 << TIF_NOTSC)
 #define _TIF_IA32  (1 << TIF_IA32)
 #define _TIF_SLD   (1 << TIF_SLD)
+#define _TIF_X32   (1 << TIF_X32)
 #define _TIF_POLLING_NRFLAG(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
@@ -129,7 +133,8 @@ struct thread_info {
 #define _TIF_LAZY_MMU_UPDATES  (1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32(1 << TIF_ADDR32)
-#define _TIF_X32   (1 << TIF_X32)
+#define _TIF_LAM_U57   (1 << TIF_LAM_U57)
+#define _TIF_LAM_U48   (1 << TIF_LAM_U48)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE   \
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..7e124fd71a67 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -17,6 +17,10 @@ void __flush_tlb_all(void);
 
 #define TLB_FLUSH_ALL  -1UL
 
+#define LAM_NONE   0
+#define LAM_U571
+#define LAM_U482
+
 void cr4_update_irqsoff(unsigned long set, unsigned long clear);
 unsigned long cr4_read_shadow(void);
 
@@ -88,6 +92,7 @@ struct tlb_state {
 
u16 loaded_mm_asid;
u16 next_asid;
+   u8 lam;
 
/*
 * We can be in one of several states:
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 569ac1d57f55..138d4748aa97 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -145,17 +145,73 @@ static inline u16 user_pcid(u16 asid)
return ret;
 }
 
-static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+#ifdef CONFIG_X86_64
+static inline unsigned long lam_to_cr3(u8 lam)
+{
+   switch (lam) {
+   case LAM_NONE:
+   return 0;
+   case LAM_U57:
+   return X86_CR3_LAM_U57;
+   case LAM_U48:
+   return X86_CR3_LAM_U48;
+   default:
+   WARN_ON_ONCE(1);
+   return 0;
+   }
+}
+
+static inline u8 cr3_to_lam(unsigned long cr3)
+{
+   if (cr3 & X86_CR3_LAM_U57)
+   return LAM_U57;
+   if (cr3 & X86_CR3_LAM_U48)
+   return LAM_U48;
+   return 0;
+}
+
+static u8 gen_lam(struct task_struct *tsk, struct mm_struct *mm)
+{
+   struct thread_info *ti = task_thread_info(tsk);
+   if (!tsk)
+   return LAM_NONE;
+
+   if (test_ti_thread_flag(ti, TIF_LAM_U57))
+   return LAM_U57;
+   if (test_ti_thread_flag(ti, TIF_LAM_U48))
+   return LAM_U48;
+   return LAM_NONE;
+}
+
+#else
+
+static inline unsigned long lam_to_cr3(u8 lam)
+{
+   return 0;
+}
+
+static inline u8 cr3_to_lam(unsigned long cr3)
+{
+   return LAM_NONE;
+}
+
+static u8 gen_lam(struct task_struct *tsk, struct mm_struct *mm)
+{
+  

[RFC 2/9] x86/mm: Fix CR3_ADDR_MASK

2021-02-05 Thread Kirill A. Shutemov
The mask must not include bits above physical address mask. These bits
are reserved and can be used for other things. Bits 61 and 62 are used
for Linear Address Masking.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/processor-flags.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor-flags.h 
b/arch/x86/include/asm/processor-flags.h
index 02c2cbda4a74..a7f3d9100adb 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,7 +35,7 @@
  */
 #ifdef CONFIG_X86_64
 /* Mask off the address space ID and SME encryption bits. */
-#define CR3_ADDR_MASK  __sme_clr(0x7000ull)
+#define CR3_ADDR_MASK  __sme_clr(PHYSICAL_PAGE_MASK)
 #define CR3_PCID_MASK  0xFFFull
 #define CR3_NOFLUSHBIT_ULL(63)
 
-- 
2.26.2



[RFC 6/9] x86/uaccess: Remove tags from the address before checking

2021-02-05 Thread Kirill A. Shutemov
The tags must not be included into check whether it's okay to access the
userspace address.

__chk_range_not_ok() is at the core or access_ok() and it's handly to
strip tags there.

get_user() and put_user() don't use access_ok(), but check access
against TASK_SIZE direcly in assembly. Strip tags, before calling into
the assembly helper.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/uaccess.h | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index c9fa7be3df82..ee0a482b2f1f 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -18,6 +18,9 @@
  */
 static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, 
unsigned long limit)
 {
+   /* Remove tags from the address before comparing to the limit */
+   addr = untagged_addr(addr);
+
/*
 * If we have used "sizeof()" for the size,
 * we know it won't overflow the limit (but
@@ -152,7 +155,12 @@ extern int __get_user_bad(void);
  * Return: zero on success, or -EFAULT on error.
  * On error, the variable @x is set to zero.
  */
-#define get_user(x,ptr) ({ might_fault(); do_get_user_call(get_user,x,ptr); })
+#define get_user(x,ptr)
\
+({ \
+   __typeof__(*(ptr)) *__ptr_clean = untagged_ptr(ptr);\
+   might_fault();  \
+   do_get_user_call(get_user,x,__ptr_clean);   \
+})
 
 /**
  * __get_user - Get a simple variable from user space, with less checking.
@@ -249,7 +257,11 @@ extern void __put_user_nocheck_8(void);
  *
  * Return: zero on success, or -EFAULT on error.
  */
-#define put_user(x, ptr) ({ might_fault(); do_put_user_call(put_user,x,ptr); })
+#define put_user(x, ptr) ({\
+   __typeof__(*(ptr)) *__ptr_clean = untagged_ptr(ptr);\
+   might_fault();  \
+   do_put_user_call(put_user,x,__ptr_clean);   \
+})
 
 /**
  * __put_user - Write a simple value into user space, with less checking.
-- 
2.26.2



[RFC 1/9] mm, arm64: Update PR_SET/GET_TAGGED_ADDR_CTRL interface

2021-02-05 Thread Kirill A. Shutemov
The interface for enabling tagged addresses is very inflexible. It
implies tag size and tag shift implemented by ARM TBI.

Rework the interface to accommodate different shifts and tag sizes.

PR_SET_TAGGED_ADDR_CTRL now accepts two new arguments:

 - nr_bits is pointer to int. The caller specifies the tag size it
   wants. Kernel updates the value of actual tag size that can be
   larger.

 - offset is pointer to int. Kernel returns there a shift of tag in the
   address.

The change doesn't break existing users of the interface: if any of
these pointers are NULL (as we had before the change), the user expects
ARM TBI implementation: nr_bits == 8 && offset == 56 as it was implied
before.

The initial implementation checked that these argument are NULL and the
change wouldn't not break any legacy users.

If tagging is enabled, GET_TAGGED_ADDR_CTRL would return size of tags
and offset in the additional arguments.

If tagging is disable, GET_TAGGED_ADDR_CTRL would return the maximum tag
size in nr_bits.

The selftest is updated accordingly and moved out of arm64-specific
directory as we going to enable the interface on x86.

As alternative to this approach we could introduce a totally new API and
leave the legacy one as is. But it would slow down adoption: new
prctl(2) flag wound need to propogate to the userspace headers.

Signed-off-by: Kirill A. Shutemov 
---
 arch/arm64/include/asm/processor.h| 12 ++--
 arch/arm64/kernel/process.c   | 45 ---
 arch/arm64/kernel/ptrace.c|  4 +-
 kernel/sys.c  | 14 +++--
 .../testing/selftests/arm64/tags/tags_test.c  | 31 --
 .../selftests/{arm64 => vm}/tags/.gitignore   |  0
 .../selftests/{arm64 => vm}/tags/Makefile |  0
 .../{arm64 => vm}/tags/run_tags_test.sh   |  0
 tools/testing/selftests/vm/tags/tags_test.c   | 57 +++
 9 files changed, 113 insertions(+), 50 deletions(-)
 delete mode 100644 tools/testing/selftests/arm64/tags/tags_test.c
 rename tools/testing/selftests/{arm64 => vm}/tags/.gitignore (100%)
 rename tools/testing/selftests/{arm64 => vm}/tags/Makefile (100%)
 rename tools/testing/selftests/{arm64 => vm}/tags/run_tags_test.sh (100%)
 create mode 100644 tools/testing/selftests/vm/tags/tags_test.c

diff --git a/arch/arm64/include/asm/processor.h 
b/arch/arm64/include/asm/processor.h
index fce8cbecd6bc..77b91e6d3c85 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -305,10 +305,14 @@ extern void __init minsigstksz_setup(void);
 
 #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
 /* PR_{SET,GET}_TAGGED_ADDR_CTRL prctl */
-long set_tagged_addr_ctrl(struct task_struct *task, unsigned long arg);
-long get_tagged_addr_ctrl(struct task_struct *task);
-#define SET_TAGGED_ADDR_CTRL(arg)  set_tagged_addr_ctrl(current, arg)
-#define GET_TAGGED_ADDR_CTRL() get_tagged_addr_ctrl(current)
+long set_tagged_addr_ctrl(struct task_struct *task, unsigned long flags,
+ int __user *nr_bits, int __user *offset);
+long get_tagged_addr_ctrl(struct task_struct *task,
+ int __user *nr_bits, int __user *offset);
+#define SET_TAGGED_ADDR_CTRL(flags, nr_bits, offset)   \
+   set_tagged_addr_ctrl(current, flags, nr_bits, offset)
+#define GET_TAGGED_ADDR_CTRL(nr_bits, offset)  \
+   get_tagged_addr_ctrl(current, nr_bits, offset)
 #endif
 
 /*
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index ed919f633ed8..a3007f80e889 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -630,15 +630,21 @@ void arch_setup_new_exec(void)
 }
 
 #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
+
+#define TBI_TAG_BITS   8
+#define TBI_TAG_SHIFT  56
+
 /*
  * Control the relaxed ABI allowing tagged user addresses into the kernel.
  */
 static unsigned int tagged_addr_disabled;
 
-long set_tagged_addr_ctrl(struct task_struct *task, unsigned long arg)
+long set_tagged_addr_ctrl(struct task_struct *task, unsigned long flags,
+ int __user *nr_bits, int __user *offset)
 {
unsigned long valid_mask = PR_TAGGED_ADDR_ENABLE;
struct thread_info *ti = task_thread_info(task);
+   int val;
 
if (is_compat_thread(ti))
return -EINVAL;
@@ -646,25 +652,41 @@ long set_tagged_addr_ctrl(struct task_struct *task, 
unsigned long arg)
if (system_supports_mte())
valid_mask |= PR_MTE_TCF_MASK | PR_MTE_TAG_MASK;
 
-   if (arg & ~valid_mask)
+   if (flags & ~valid_mask)
return -EINVAL;
 
+   if (nr_bits) {
+   if (get_user(val, nr_bits))
+   return -EFAULT;
+   if (val > TBI_TAG_BITS || val < 1)
+   return -EINVAL;
+   }
+
/*
 * Do not allow the enabling of the tagged address ABI if globally
 * disabled via sysctl abi.tagged_ad

[QEMU] x86: Implement Linear Address Masking support

2021-02-05 Thread Kirill A. Shutemov
Linear Address Masking feature makes CPU ignore some bits of the virtual
address. These bits can be used to encode metadata.

The feature is enumerated with CPUID.(EAX=07H, ECX=01H):EAX.LAM[bit 26].

CR3.LAM_U57[bit 62] allows to encode 6 bits of metadata in bits 62:57 of
user pointers.

CR3.LAM_U48[bit 61] allows to encode 15 bits of metadata in bits 62:48
of user pointers.

CR4.LAM_SUP[bit 28] allows to encode metadata of supervisor pointers.
If 5-level paging is in use, 6 bits of metadata can be encoded in 62:57.
For 4-level paging, 15 bits of metadata can be encoded in bits 62:48.

QEMU strips address from the metadata bits and gets it to canonical
shape before handling memory access. It has to be done very early before
TLB lookup.

Signed-off-by: Kirill A. Shutemov 
---
 accel/tcg/cputlb.c| 54 +++
 include/hw/core/cpu.h |  1 +
 target/i386/cpu.c |  5 ++--
 target/i386/cpu.h |  7 +
 target/i386/excp_helper.c | 28 +++-
 target/i386/helper.c  |  2 +-
 6 files changed, 71 insertions(+), 26 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 42ab79c1a582..f2d27134474f 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1271,6 +1271,17 @@ static inline ram_addr_t 
qemu_ram_addr_from_host_nofail(void *ptr)
 return ram_addr;
 }
 
+static vaddr clean_addr(CPUState *cpu, vaddr addr)
+{
+CPUClass *cc = CPU_GET_CLASS(cpu);
+
+if (cc->do_clean_addr) {
+addr = cc->do_clean_addr(cpu, addr);
+}
+
+return addr;
+}
+
 /*
  * Note: tlb_fill() can trigger a resize of the TLB. This means that all of the
  * caller's prior references to the TLB table (e.g. CPUTLBEntry pointers) must
@@ -1702,9 +1713,11 @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong addr, 
int mmu_idx,
 
 /* Probe for a read-modify-write atomic operation.  Do not allow unaligned
  * operations, or io operations to proceed.  Return the host address.  */
-static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
+static void *atomic_mmu_lookup(CPUArchState *env, target_ulong address,
TCGMemOpIdx oi, uintptr_t retaddr)
 {
+CPUState *cpu = env_cpu(env);
+target_ulong addr = clean_addr(cpu, address);
 size_t mmu_idx = get_mmuidx(oi);
 uintptr_t index = tlb_index(env, mmu_idx, addr);
 CPUTLBEntry *tlbe = tlb_entry(env, mmu_idx, addr);
@@ -1720,8 +1733,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 /* Enforce guest required alignment.  */
 if (unlikely(a_bits > 0 && (addr & ((1 << a_bits) - 1 {
 /* ??? Maybe indicate atomic op to cpu_unaligned_access */
-cpu_unaligned_access(env_cpu(env), addr, MMU_DATA_STORE,
- mmu_idx, retaddr);
+cpu_unaligned_access(cpu, addr, MMU_DATA_STORE, mmu_idx, retaddr);
 }
 
 /* Enforce qemu required alignment.  */
@@ -1736,8 +1748,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 /* Check TLB entry and enforce page permissions.  */
 if (!tlb_hit(tlb_addr, addr)) {
 if (!VICTIM_TLB_HIT(addr_write, addr)) {
-tlb_fill(env_cpu(env), addr, 1 << s_bits, MMU_DATA_STORE,
- mmu_idx, retaddr);
+tlb_fill(cpu, addr, 1 << s_bits, MMU_DATA_STORE, mmu_idx, retaddr);
 index = tlb_index(env, mmu_idx, addr);
 tlbe = tlb_entry(env, mmu_idx, addr);
 }
@@ -1753,8 +1764,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 
 /* Let the guest notice RMW on a write-only page.  */
 if (unlikely(tlbe->addr_read != (tlb_addr & ~TLB_NOTDIRTY))) {
-tlb_fill(env_cpu(env), addr, 1 << s_bits, MMU_DATA_LOAD,
- mmu_idx, retaddr);
+tlb_fill(cpu, addr, 1 << s_bits, MMU_DATA_LOAD, mmu_idx, retaddr);
 /* Since we don't support reads and writes to different addresses,
and we do have the proper page loaded for write, this shouldn't
ever return.  But just in case, handle via stop-the-world.  */
@@ -1764,14 +1774,14 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 hostaddr = (void *)((uintptr_t)addr + tlbe->addend);
 
 if (unlikely(tlb_addr & TLB_NOTDIRTY)) {
-notdirty_write(env_cpu(env), addr, 1 << s_bits,
+notdirty_write(cpu, addr, 1 << s_bits,
_tlb(env)->d[mmu_idx].iotlb[index], retaddr);
 }
 
 return hostaddr;
 
  stop_the_world:
-cpu_loop_exit_atomic(env_cpu(env), retaddr);
+cpu_loop_exit_atomic(cpu, retaddr);
 }
 
 /*
@@ -1810,10 +1820,12 @@ load_memop(const void *haddr, MemOp op)
 }
 
 static inline uint64_t QEMU_ALWAYS_INLINE
-load_helper(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
+load_helper(CPUArchState *env, target_ulong address, TCGMemOpIdx oi,
 uin

[RFC 3/9] x86: CPUID and CR3/CR4 flags for Linear Address Masking

2021-02-05 Thread Kirill A. Shutemov
Enumerate Linear Address Masking and provide defines for CR3 and CR4
flags.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/cpufeatures.h  | 1 +
 arch/x86/include/uapi/asm/processor-flags.h | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index dad350d42ecf..3ae25d88216e 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -293,6 +293,7 @@
 
 /* Intel-defined CPU features, CPUID level 0x0007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX512_BF16(12*32+ 5) /* AVX512 BFLOAT16 
instructions */
+#define X86_FEATURE_LAM(12*32+26) /* Linear Address 
Masking */
 
 /* AMD-defined CPU features, CPUID level 0x8008 (EBX), word 13 */
 #define X86_FEATURE_CLZERO (13*32+ 0) /* CLZERO instruction */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..f2a4a53308e2 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -82,6 +82,10 @@
 #define X86_CR3_PCID_BITS  12
 #define X86_CR3_PCID_MASK  (_AC((1UL << X86_CR3_PCID_BITS) - 1, UL))
 
+#define X86_CR3_LAM_U48_BIT61 /* Activate LAM for userspace, 62:48 bits 
masked */
+#define X86_CR3_LAM_U48_BITULL(X86_CR3_LAM_U48_BIT)
+#define X86_CR3_LAM_U57_BIT62 /* Activate LAM for userspace, 62:57 bits 
masked */
+#define X86_CR3_LAM_U57_BITULL(X86_CR3_LAM_U57_BIT)
 #define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
 #define X86_CR3_PCID_NOFLUSH_BITULL(X86_CR3_PCID_NOFLUSH_BIT)
 
@@ -130,6 +134,8 @@
 #define X86_CR4_SMAP   _BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT22 /* enable Protection Keys support */
 #define X86_CR4_PKE_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_LAM_SUP_BIT28 /* LAM for supervisor pointers */
+#define X86_CR4_LAM_SUP_BITUL(X86_CR4_LAM_SUP_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
-- 
2.26.2



[RFC 5/9] x86/mm: Provide untagged_addr() helper

2021-02-05 Thread Kirill A. Shutemov
The helper used by the core-mm to strip tag bits and get the address to
the canonical shape. In only handles userspace addresses.

For LAM, the address gets sanitized according to the thread flags.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/page_32.h |  3 +++
 arch/x86/include/asm/page_64.h | 19 +++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 94dbd51df58f..a829c46ab977 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -15,6 +15,9 @@ extern unsigned long __phys_addr(unsigned long);
 #define __phys_addr_symbol(x)  __phys_addr(x)
 #define __phys_reloc_hide(x)   RELOC_HIDE((x), 0)
 
+#define untagged_addr(addr)(addr)
+#define untagged_ptr(ptr)  (ptr)
+
 #ifdef CONFIG_FLATMEM
 #define pfn_valid(pfn) ((pfn) < max_mapnr)
 #endif /* CONFIG_FLATMEM */
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 939b1cff4a7b..67cb434efdf6 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -56,6 +56,25 @@ static inline void clear_page(void *page)
 
 void copy_page(void *to, void *from);
 
+#define __untagged_addr(addr, n)   \
+   ((__force __typeof__(addr))sign_extend64((__force u64)(addr), n))
+
+#define untagged_addr(addr)({  \
+   u64 __addr = (__force u64)(addr);   \
+   if (__addr >> 63 == 0) {\
+   if (test_thread_flag(TIF_LAM_U57))  \
+   __addr &= __untagged_addr(__addr, 56);  \
+   else if (test_thread_flag(TIF_LAM_U48)) \
+   __addr &= __untagged_addr(__addr, 47);  \
+   }   \
+   (__force __typeof__(addr))__addr;   \
+})
+
+#define untagged_ptr(ptr)  ({  \
+   u64 __ptrval = (__force u64)(ptr);  \
+   __ptrval = untagged_addr(__ptrval); \
+   (__force __typeof__(*(ptr)) *)__ptrval; \
+})
 #endif /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
-- 
2.26.2



[RFC 9/9] x86/mm: Implement PR_SET/GET_TAGGED_ADDR_CTRL with LAM

2021-02-05 Thread Kirill A. Shutemov
Provide prctl() interface to enabled LAM for user addresses. Depending
how many tag bits requested it may result in enabling LAM_U57 or
LAM_U48.

If LAM_U48 is enabled, the process is no longer able to use full address
space on 5-level paging machine and gets limited to 47-bit VA.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/processor.h |  10 +++
 arch/x86/kernel/process_64.c | 145 +++
 2 files changed, 155 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..49fac2cc4329 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -810,6 +810,16 @@ extern void start_thread(struct pt_regs *regs, unsigned 
long new_ip,
 extern int get_tsc_mode(unsigned long adr);
 extern int set_tsc_mode(unsigned int val);
 
+#ifdef CONFIG_X86_64
+long set_tagged_addr_ctrl(unsigned long flags,
+ int __user *nr_bits, int __user *offset);
+long get_tagged_addr_ctrl(int __user *nr_bits, int __user *offset);
+#define SET_TAGGED_ADDR_CTRL(flags, nr_bits, offset)   \
+   set_tagged_addr_ctrl(flags, nr_bits, offset)
+#define GET_TAGGED_ADDR_CTRL(nr_bits, offset)  \
+   get_tagged_addr_ctrl(nr_bits, offset)
+#endif
+
 DECLARE_PER_CPU(u64, msr_misc_features_shadow);
 
 #ifdef CONFIG_CPU_SUP_AMD
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index df342bedea88..99b87f0e1bc7 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -837,3 +837,148 @@ unsigned long KSTK_ESP(struct task_struct *task)
 {
return task_pt_regs(task)->sp;
 }
+
+/*
+ * Control the relaxed ABI allowing tagged user addresses into the kernel.
+ */
+static unsigned int tagged_addr_disabled;
+
+static bool lam_u48_allowed(void)
+{
+   struct mm_struct *mm = current->mm;
+
+   if (!full_va_allowed(mm))
+   return true;
+
+   return find_vma(mm, DEFAULT_MAP_WINDOW) == NULL;
+}
+
+#define LAM_U48_BITS 15
+#define LAM_U57_BITS 6
+
+long set_tagged_addr_ctrl(unsigned long flags,
+ int __user *nr_bits, int __user *offset)
+{
+   int val;
+
+   if (in_32bit_syscall())
+   return -EINVAL;
+   if (flags & ~PR_TAGGED_ADDR_ENABLE)
+   return -EINVAL;
+   if (!boot_cpu_has(X86_FEATURE_LAM))
+   return -ENOTSUPP;
+
+   /* Disable LAM */
+   if (!(flags & PR_TAGGED_ADDR_ENABLE)) {
+   clear_thread_flag(TIF_LAM_U48);
+   clear_thread_flag(TIF_LAM_U57);
+
+   /* Update CR3 */
+   switch_mm(current->mm, current->mm, current);
+
+   return 0;
+   }
+
+   /*
+* nr_bits == NULL || offset == NULL assumes ARM TBI (nr_bits == 8,
+* offset == 56). LAM cannot provide this.
+*/
+   if (!nr_bits || !offset)
+   return -EINVAL;
+
+   /*
+* Do not allow the enabling of the tagged address ABI if globally
+* disabled via sysctl abi.tagged_addr_disabled.
+*/
+   if (tagged_addr_disabled)
+   return -EINVAL;
+
+   if (get_user(val, nr_bits))
+   return -EFAULT;
+   if (val > LAM_U48_BITS || val < 1)
+   return -EINVAL;
+   if (val > LAM_U57_BITS && !lam_u48_allowed())
+   return -EINVAL;
+
+   val = val > LAM_U57_BITS ? LAM_U48_BITS : LAM_U57_BITS;
+   if (put_user(val, nr_bits) || put_user(63 - val, offset))
+   return -EFAULT;
+
+   if (val == LAM_U57_BITS) {
+   clear_thread_flag(TIF_LAM_U48);
+   set_thread_flag(TIF_LAM_U57);
+   if (current->mm->context.lam == LAM_NONE)
+   current->mm->context.lam = LAM_U57;
+   } else {
+   clear_thread_flag(TIF_LAM_U57);
+   set_thread_flag(TIF_LAM_U48);
+
+   /*
+* Do not allow to create a mapping above 47 bit.
+*
+* It's one way road: once a thread of the process enabled
+* LAM_U48, no thread can ever create mapping above 47 bit.
+* Even the LAM got disabled later.
+*/
+   current->mm->context.lam = LAM_U48;
+   }
+
+   /* Update CR3 */
+   switch_mm(current->mm, current->mm, current);
+
+   return 0;
+}
+
+long get_tagged_addr_ctrl(int __user *nr_bits, int __user *offset)
+{
+   if (in_32bit_syscall())
+   return -EINVAL;
+
+   if (test_thread_flag(TIF_LAM_U57)) {
+   if (nr_bits && put_user(LAM_U57_BITS, nr_bits))
+   return -EFAULT;
+   if (offset && put_user(63 - LAM_U57_BITS, offset))
+   return -EFAULT;
+   } else if (test_thread_flag(TIF_LAM_U48)) {
+   if (nr_bits && put_user(LA

[RFC 7/9] x86/mm: Handle tagged memory accesses from kernel threads

2021-02-05 Thread Kirill A. Shutemov
When a kernel thread performs memory access on behalf of a process (like
in async I/O, io_uring, etc.) it has to respect tagging setup of the
process as user addresses can include tags.

Normally, LAM setup is per-thread and recorded in thread flags, but for
this use case we also track LAM setup per-mm. mm->context.lam would
record LAM that allows the most tag bits among the threads of the mm.

The info used by switch_mm_irqs_off() to construct CR3 if the task is
kernel thread. Thread flags of the kernel thread get updated according
to mm->context.lam. It allows untagged_addr() to work correctly.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/mmu.h |  1 +
 arch/x86/mm/tlb.c  | 28 
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 9257667d13c5..fb05d6a11538 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -35,6 +35,7 @@ typedef struct {
 #ifdef CONFIG_X86_64
/* True if mm supports a task running in 32 bit compatibility mode. */
unsigned short ia32_compat;
+   u8 lam;
 #endif
 
struct mutex lock;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 138d4748aa97..1f9749da12e4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -176,6 +176,34 @@ static u8 gen_lam(struct task_struct *tsk, struct 
mm_struct *mm)
if (!tsk)
return LAM_NONE;
 
+   if (tsk->flags & PF_KTHREAD) {
+   /*
+* For kernel thread use the most permissive LAM
+* used by the mm. It's required to handle kernel thread
+* memory accesses on behalf of a process.
+*
+* Adjust thread flags accodringly, so untagged_addr() would
+* work correctly.
+*/
+   switch (mm->context.lam) {
+   case LAM_NONE:
+   clear_thread_flag(TIF_LAM_U48);
+   clear_thread_flag(TIF_LAM_U57);
+   return LAM_NONE;
+   case LAM_U57:
+   clear_thread_flag(TIF_LAM_U48);
+   set_thread_flag(TIF_LAM_U57);
+   return LAM_U57;
+   case LAM_U48:
+   set_thread_flag(TIF_LAM_U48);
+   clear_thread_flag(TIF_LAM_U57);
+   return LAM_U48;
+   default:
+   WARN_ON_ONCE(1);
+   return LAM_NONE;
+   }
+   }
+
if (test_ti_thread_flag(ti, TIF_LAM_U57))
return LAM_U57;
if (test_ti_thread_flag(ti, TIF_LAM_U48))
-- 
2.26.2



[RFC 0/9] Linear Address Masking enabling

2021-02-05 Thread Kirill A. Shutemov
Linear Address Masking[1] (LAM) modifies the checking that is applied to
64-bit linear addresses, allowing software to use of the untranslated
address bits for metadata.

The patchset brings support for LAM for userspace addresses.

The most sensitive part of enabling is change in tlb.c, where CR3 flags
get set. Please take a look that what I'm doing makes sense.

The patchset is RFC quality and the code requires more testing before it
can be applied.

The userspace API is not finalized yet. The patchset extends API used by
ARM64: PR_GET/SET_TAGGED_ADDR_CTRL. The API is adjusted to not imply ARM
TBI: it now allows to request a number of bits of metadata needed and
report where these bits are located in the address.

There's an alternative proposal[2] for the API based on Intel CET
interface. Please let us know if you prefer one over another.

The feature competes for bits with 5-level paging: LAM_U48 makes it
impossible to map anything about 47-bits. The patchset made these
capability mutually exclusive: whatever used first wins. LAM_U57 can be
combined with mappings above 47-bits.

I include QEMU patch in case if somebody wants to play with the feature.

The branch:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam

Any comments are welcome.

[1] ISE, Chapter 14. 
https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf
[2] 
https://github.com/hjl-tools/linux/commit/e85fa032e5b276ddf17edd056f92f599db9e8369

Kirill A. Shutemov (9):
  mm, arm64: Update PR_SET/GET_TAGGED_ADDR_CTRL interface
  x86/mm: Fix CR3_ADDR_MASK
  x86: CPUID and CR3/CR4 flags for Linear Address Masking
  x86/mm: Introduce TIF_LAM_U57 and TIF_LAM_U48
  x86/mm: Provide untagged_addr() helper
  x86/uaccess: Remove tags from the address before checking
  x86/mm: Handle tagged memory accesses from kernel threads
  x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive
  x86/mm: Implement PR_SET/GET_TAGGED_ADDR_CTRL with LAM

 arch/arm64/include/asm/processor.h|  12 +-
 arch/arm64/kernel/process.c   |  45 +-
 arch/arm64/kernel/ptrace.c|   4 +-
 arch/x86/include/asm/cpufeatures.h|   1 +
 arch/x86/include/asm/elf.h|   3 +-
 arch/x86/include/asm/mmu.h|   1 +
 arch/x86/include/asm/mmu_context.h|  13 ++
 arch/x86/include/asm/page_32.h|   3 +
 arch/x86/include/asm/page_64.h|  19 +++
 arch/x86/include/asm/processor-flags.h|   2 +-
 arch/x86/include/asm/processor.h  |  10 ++
 arch/x86/include/asm/thread_info.h|   9 +-
 arch/x86/include/asm/tlbflush.h   |   5 +
 arch/x86/include/asm/uaccess.h|  16 +-
 arch/x86/include/uapi/asm/processor-flags.h   |   6 +
 arch/x86/kernel/process_64.c  | 145 ++
 arch/x86/kernel/sys_x86_64.c  |   5 +-
 arch/x86/mm/hugetlbpage.c |   6 +-
 arch/x86/mm/mmap.c|   9 +-
 arch/x86/mm/tlb.c | 124 +--
 kernel/sys.c  |  14 +-
 .../testing/selftests/arm64/tags/tags_test.c  |  31 
 .../selftests/{arm64 => vm}/tags/.gitignore   |   0
 .../selftests/{arm64 => vm}/tags/Makefile |   0
 .../{arm64 => vm}/tags/run_tags_test.sh   |   0
 tools/testing/selftests/vm/tags/tags_test.c   |  57 +++
 26 files changed, 464 insertions(+), 76 deletions(-)
 delete mode 100644 tools/testing/selftests/arm64/tags/tags_test.c
 rename tools/testing/selftests/{arm64 => vm}/tags/.gitignore (100%)
 rename tools/testing/selftests/{arm64 => vm}/tags/Makefile (100%)
 rename tools/testing/selftests/{arm64 => vm}/tags/run_tags_test.sh (100%)
 create mode 100644 tools/testing/selftests/vm/tags/tags_test.c

-- 
2.26.2



Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-01-20 Thread Kirill A. Shutemov
On Mon, Jan 18, 2021 at 04:42:20PM -0500, Arvind Sankar wrote:
> AFAICT, MODULES_END is only relevant as being something that happens to
> be in the top 512GiB, and -1ul would be clearer.

I think you are right. But -1UL is not very self-descriptive. :/

-- 
 Kirill A. Shutemov


Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-01-20 Thread Kirill A. Shutemov
On Fri, Jan 15, 2021 at 02:07:51PM -0500, Arvind Sankar wrote:
> On Thu, Jan 07, 2021 at 11:34:15PM +0100, Arnd Bergmann wrote:
> > From: Arnd Bergmann 
> > 
> > When 5-level page tables are enabled, clang triggers a BUILD_BUG_ON():
> > 
> > x86_64-linux-ld: arch/x86/platform/efi/efi_64.o: in function 
> > `efi_sync_low_kernel_mappings':
> > efi_64.c:(.text+0x22c): undefined reference to `__compiletime_assert_354'
> > 
> > Use the same method as in commit c65e774fb3f6 ("x86/mm: Make PGDIR_SHIFT
> > and PTRS_PER_P4D variable") and change it to MAYBE_BUILD_BUG_ON(),
> > so it only triggers for constant input.
> > 
> > Link: https://github.com/ClangBuiltLinux/linux/issues/256
> > Signed-off-by: Arnd Bergmann 
> > ---
> >  arch/x86/platform/efi/efi_64.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> > index e1e8d4e3a213..62bb1616b4a5 100644
> > --- a/arch/x86/platform/efi/efi_64.c
> > +++ b/arch/x86/platform/efi/efi_64.c
> > @@ -137,8 +137,8 @@ void efi_sync_low_kernel_mappings(void)
> >  * As with PGDs, we share all P4D entries apart from the one entry
> >  * that covers the EFI runtime mapping space.
> >  */
> > -   BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END));
> > -   BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & P4D_MASK));
> > +   MAYBE_BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END));
> > +   MAYBE_BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & 
> > P4D_MASK));
> >  
> > pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
> > pgd_k = pgd_offset_k(EFI_VA_END);
> > -- 
> > 2.29.2
> > 
> 
> I think this needs more explanation as to why clang is triggering this.
> The issue mentions clang not inline p4d_index(), and I guess not
> performing inter-procedural analysis either?
> 
> For the second assertion there, everything is always constant AFAICT:
> EFI_VA_START, EFI_VA_END and P4D_MASK are all constants regardless of
> CONFIG_5LEVEL.

Back in the days BUILD_BUG_ON() false-triggered on GCC-4.8 as well. 

-- 
 Kirill A. Shutemov


Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-13 Thread Kirill A. Shutemov
On Tue, Jan 12, 2021 at 07:31:07PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox  wrote:
> >
> > The thing about the speculative page cache references is that they can
> > temporarily bump a refcount on a page which _used_ to be in the page
> > cache and has now been reallocated as some other kind of page.
> 
> Oh, and thinking about this made me think we might actually have a
> serious bug here, and it has nothing what-so-ever to do with COW, GUP,
> or even the page count itself.
> 
> It's unlikely enough that I think it's mostly theoretical, but tell me
> I'm wrong.
> 
> PLEASE tell me I'm wrong:
> 
> CPU1 does page_cache_get_speculative under RCU lock
> 
> CPU2 frees and re-uses the page
> 
> CPU1CPU2
> 
> 
> page = xas_load();
> if (!page_cache_get_speculative(page))
> goto repeat;
> .. succeeds ..
> 
> remove page from XA
> release page
> reuse for something else

How can it be reused if CPU1 hold reference to it?

> 
> .. and then re-check ..
> if (unlikely(page != xas_reload())) {
> put_page(page);
> goto repeat;
> }
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/3] mm: Allow architectures to request 'old' entries when prefaulting

2021-01-11 Thread Kirill A. Shutemov
On Mon, Jan 11, 2021 at 02:37:42PM +, Will Deacon wrote:
> On Mon, Jan 11, 2021 at 05:25:33PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jan 08, 2021 at 05:15:16PM +, Will Deacon wrote:
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index c1f2dc89b8a7..0fb9d1714797 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -3051,14 +3051,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> > >   if (!pte_none(*vmf->pte))
> > >   goto unlock;
> > >  
> > > + /* We're about to handle the fault */
> > > + if (vmf->address == address) {
> > > + vmf->flags &= ~FAULT_FLAG_PREFAULT;
> > > + ret = VM_FAULT_NOPAGE;
> > > + } else {
> > > + vmf->flags |= FAULT_FLAG_PREFAULT;
> > > + }
> > > +
> > 
> > Do we need to restore the oririnal status of the bit once we are done?
> 
> I can certainly add that, although it doesn't look like we do that for
> vmf->pte, so it's hard to tell what the rules are here. It certainly feels
> odd to restore some fields but not others, as it looks like vmf->address
> will be out-of-whack with vmf->pte when filemap_map_pages() returns. Am I
> missing something?

Unlike vmf->flags or vmf->address, vmf->pte is not going to be reused.
finish_fault() will overwrite it.

Yeah, it's messy.

-- 
 Kirill A. Shutemov


  1   2   3   4   5   6   7   8   9   10   >