from:"Dave Hansen"

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-02 Thread Dave Hansen

On 3/30/24 04:23, Jarkko Sakkinen wrote:
>>> I also wonder is cgroup-tools dependency absolutely required or could
>>> you just have a function that would interact with sysfs?
>> I should have checked email before hit the send button for v10 .
>>
>> It'd be more complicated and less readable to do all the stuff without the  
>> cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends  
>> on libc so I hope this would not cause too much inconvenience.
> As per cgroup-tools, please prove this. It makes the job for more
> complicated *for you* and you are making the job more  complicated
> to every possible person in the planet running any kernel QA.

I don't see any other use of cgroup-tools in testing/selftests.

I *DO* see a ton of /bin/bash use though.  I wouldn't go to much trouble
to make the thing ash-compatible.

That said, the most important thing is to get some selftests in place.
If using cgroup-tools means we get actual, runnable tests in place,
that's a heck of a lot more important than making them perfect.
Remember, almost nobody uses SGX.  It's available on *VERY* few systems
from one CPU vendor and only in very specific hardware configurations.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Dave Hansen

On 2/26/24 14:34, Huang, Kai wrote:
> So I am trying to get the actual downside of doing per-cgroup reclaim or
> the full reason that we choose global reclaim.

Take the most extreme example:

while (hit_global_sgx_limit())
reclaim_from_this(cgroup);

You eventually end up with all of 'cgroup's pages gone and handed out to
other users on the system who stole them all.  Other users might cause
you to go over the global limit.  *They* should be paying part of the
cost, not just you and your cgroup.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Dave Hansen

On 2/26/24 14:24, Huang, Kai wrote:
> What is the downside of doing per-group reclaim when try_charge()
> succeeds for the enclave but failed to allocate EPC page?
> 
> Could you give an complete answer why you choose to use global reclaim
> for the above case?

There are literally two different limits at play.  There's the limit
that the cgroup imposes and then the actual physical limit.

Hitting the cgroup limit induces cgroup reclaim.

Hitting the physical limit induces global reclaim.

Maybe I'm just being dense, but I fail to understand why you would want
to entangle those two different concepts more than absolutely necessary.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Dave Hansen

On 2/26/24 13:48, Haitao Huang wrote:
> In case of overcomitting, i.e., sum of limits greater than the EPC
> capacity, if one group has a fault, and its usage is not above its own
> limit (try_charge() passes), yet total usage of the system has exceeded
> the capacity, whether we do global reclaim or just reclaim pages in the
> current faulting group.

I don't see _any_ reason to limit reclaim to the current faulting cgroup.

>> Last, what is the simplest (least amount of code) thing that the SGX
>> cgroup controller could implement here?
> 
> I still think the current approach of doing global reclaim is reasonable
> and simple: try_charge() checks cgroup limit and reclaim within the
> group if needed, then do EPC page allocation, reclaim globally if
> allocation fails due to global usage reaches the capacity.
> 
> I'm not sure how not doing global reclaiming in this case would bring
> any benefit.
I tend to agree.

Kai, I think your examples sound a little bit contrived.  Have actual
users expressed a strong intent for doing anything with this series
other than limiting bad actors from eating all the EPC?

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Dave Hansen

On 2/26/24 03:36, Huang, Kai wrote:
>> In case of overcomitting, even if we always reclaim from the same cgroup  
>> for each fault, one group may still interfere the other: e.g., consider an  
>> extreme case in that group A used up almost all EPC at the time group B  
>> has a fault, B has to fail allocation and kill enclaves.
> If the admin allows group A to use almost all EPC, to me it's fair to say 
> he/she
> doesn't want to run anything inside B at all and it is acceptable enclaves in 
> B
> to be killed.

Folks, I'm having a really hard time following this thread.  It sounds
like there's disagreement about when to do system-wide reclaim.  Could
someone remind me of the choices that we have?  (A proposed patch would
go a _long_ way to helping me understand)

Also, what does the core mm memcg code do?

Last, what is the simplest (least amount of code) thing that the SGX
cgroup controller could implement here?

Re: [RFC PATCH] x86/sgx: Remove 'reclaim' boolean parameters

2024-02-19 Thread Dave Hansen

On 2/19/24 07:39, Haitao Huang wrote:
> Remove all boolean parameters for 'reclaim' from the function
> sgx_alloc_epc_page() and its callers by making two versions of each
> function.
> 
> Also opportunistically remove non-static declaration of
> __sgx_alloc_epc_page() and a typo
> 
> Signed-off-by: Haitao Huang 
> Suggested-by: Jarkko Sakkinen 
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 56 +--
>  arch/x86/kernel/cpu/sgx/encl.h  |  6 ++-
>  arch/x86/kernel/cpu/sgx/ioctl.c | 23 ---
>  arch/x86/kernel/cpu/sgx/main.c  | 68 ++---
>  arch/x86/kernel/cpu/sgx/sgx.h   |  4 +-
>  arch/x86/kernel/cpu/sgx/virt.c  |  2 +-
>  6 files changed, 115 insertions(+), 44 deletions(-)

Jarkko, did this turn out how you expected?

I think passing around a function pointer to *only* communicate 1 bit of
information is a _bit_ overkill here.

Simply replacing the bool with:

enum sgx_reclaim {
SGX_NO_RECLAIM,
SGX_DO_RECLAIM
};

would do the same thing.  Right?

Are you sure you want a function pointer for this?

Re: [PATCH v9 09/15] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-02-16 Thread Dave Hansen

On 2/16/24 13:38, Haitao Huang wrote:
> On Fri, 16 Feb 2024 09:15:59 -0600, Dave Hansen 
> wrote:
...
>> Does this 'indirect' change any behavior other than whether it does a
>> search for an mm to find a place to charge the backing storage?
> 
> No.
> 
>> Instead of passing a flag around, why not just pass the mm?
>>
> There is no need to pass in mm. We could just check if current->mm ==
> NULL for the need of doing the search in the enclave mm list.
> 
> But you had a concern [1] that the purpose was not clear hence suggested
> current_is_ksgxd().

Right, because there was only one possible way that mm could be NULL but
it wasn't obvious from the code what that way was.

> Would it be OK if we replace current_is_ksgxd() with (current->flags &
> PF_KTHREAD)? That would express the real intent of checking if calling
> context is not in a user context.

No, I think that focuses on the symptom and not on the fundamental problem.

The fundamental problem is that you need an mm in order to charge your
allocations to the right group.  Indirect reclaim means you are not in a
context which is connected to the mm that should be charged while direct
reclaim is.

>> This refactoring out of 'indirect' or passing the mm around really wants
>> to be in its own patch anyway.
>>
> Looks like I could do:
> 1) refactoring of 'indirect' value/enum suggested above. This seems the
> most straightforward without depending on any assumptions of other
> kernel code.
> 2) replace  current_is_ksgxd() with current->mm == NULL. This assumes
> kthreads has no mm.
> 3) replace current_is_ksgxd() with current->flags & PF_KTHREAD. This is
> direct use of the flag PF_KTHREAD, so it should be better than #2?
> 
> Any preference or further thoughts?

Pass around a:

struct mm_struct *charge_mm

Then, at the bottom do:

/*
 * Backing RAM allocations need to be charged to some mm and
 * associated cgroup.  If this context does not have an mm to
 * charge, search the enclave's mm_list to find some mm
 * associated with this enclave.
 */
if (!charge_mm)
... do slow mm lookup
else
return mm_to_cgroup_whatever(charge_mm);

Then just comment the call sites where the initial charge_mm comes in:

/* Indirect SGX reclaim, no mm to charge, so NULL: */
foo(..., NULL);

/* Direct SGX reclaim, charge current mm for allocations: */
foo(..., current->mm);

Re: [PATCH v9 09/15] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-02-16 Thread Dave Hansen

On 2/5/24 13:06, Haitao Huang wrote:
> @@ -414,7 +416,7 @@ static void sgx_reclaim_pages_global(void)
>  void sgx_reclaim_direct(void)
>  {
>   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> - sgx_reclaim_pages_global();
> + sgx_reclaim_pages_global(false);
>  }
>  
>  static int ksgxd(void *p)
> @@ -437,7 +439,7 @@ static int ksgxd(void *p)
>sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>  
>   if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> - sgx_reclaim_pages_global();
> + sgx_reclaim_pages_global(true);
>  
>   cond_resched();
>   }

First, I'm never a fan of random true/false or 0/1 arguments to
functions like this.  You end up having to go look at the called
function to make any sense of it.  You can either do an enum, or some
construct like this:

if (sgx_should_reclaim(SGX_NR_HIGH_PAGES)) {
bool indirect = true;
sgx_reclaim_pages_global(indirect);
}

Yeah, it takes a few more lines but it saves you having to comment the
thing.

Does this 'indirect' change any behavior other than whether it does a
search for an mm to find a place to charge the backing storage?  Instead
of passing a flag around, why not just pass the mm?

This refactoring out of 'indirect' or passing the mm around really wants
to be in its own patch anyway.

Re: [PATCH v9 09/15] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-02-15 Thread Dave Hansen

On 2/5/24 13:06, Haitao Huang wrote:
>  static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
>  {
> @@ -1003,14 +1001,6 @@ static struct mem_cgroup 
> *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
>   struct sgx_encl_mm *encl_mm;
>   int idx;
>  
> - /*
> -  * If called from normal task context, return the mem_cgroup
> -  * of the current task's mm. The remainder of the handling is for
> -  * ksgxd.
> -  */
> - if (!current_is_ksgxd())
> - return get_mem_cgroup_from_mm(current->mm);

Why is this being removed?

Searching the enclave mm list is a last resort.  It's expensive and
imprecise.

get_mem_cgroup_from_mm(current->mm), on the other hand is fast and precise.

Re: [PATCH v6 00/12] Add Cgroup support for SGX EPC memory

2024-01-05 Thread Dave Hansen

There's very little about how the LRU design came to be in this cover
letter.  Let's add some details.

How's this?

Writing this up, I'm a lot more convinced that this series is, in
general, taking the right approach.  I honestly don't see any other
alternatives.  As much as I'd love to do something stupidly simple like
just killing enclaves at the moment they hit the limit, that would be a
horrid experience for users _and_ a departure from what the existing
reclaim support does.

That said, there's still a lot of work do to do refactor this series.
It's in need of some love to make it more clear what is going on and to
making the eventual switch over to per-cgroup LRUs more gradual.  Each
patch in the series is still doing way too much, _especially_ in patch 10.

==

The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed.  EPC
allocations can wait for reclaim.  The alternative to waiting would have
been to send a signal and let the enclave die.

This series attempts to implement that same logic for cgroups, for the
same reasons: it's preferable to wait for memory to become available and
let reclaim happen than to do things that are fatal to enclaves.

There is currently a global reclaimable page SGX LRU list.  That list
(and the existing scanning algorithm) is essentially useless for doing
reclaim when a cgroup hits its limit because the cgroup's pages are
scattered around that LRU.  It is unspeakably inefficient to scan a
linked list with millions of entries for what could be dozens of pages
from a cgroup that needs reclaim.

Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU.  It
would either need to hold the list locks for unreasonable amounts of
time, or be taught to scan the list in pieces, which has its own challenges.

tl;dr: An cgroup hitting its limit should be as similar as possible to
the system running out of EPC memory.  The only two choices to implement
that are nasty changes the existing LRU scanning algorithm, or to add
new LRUs.  The result: Add a new LRU for each cgroup and scans those
instead.  Replace the existing global cgroup with the root cgroup's LRU
(only when this new support is compiled in, obviously).

Re: [PATCH v6 07/12] x86/sgx: Introduce EPC page states

2024-01-05 Thread Dave Hansen

On 10/30/23 11:20, Haitao Huang wrote:
> @@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page 
> *page)
>  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>  {
>   spin_lock(_global_lru.lock);
> - if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> - /* The page is being reclaimed. */
> - if (list_empty(>list)) {
> - spin_unlock(_global_lru.lock);
> - return -EBUSY;
> - }
> -
> - list_del(>list);
> - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> + spin_unlock(_global_lru.lock);
> + return -EBUSY;
>   }
> +
> + list_del(>list);
> + sgx_epc_page_reset_state(page);

I want to know how much if this series is basically line-for-line
abstraction shifting like:

-   page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+   sgx_epc_page_reset_state(page);

versus actually adding complexity.  That way, I might be able to offer
some advice on where this can be pared down.  That's really hard to do
with the current series.

Please don't just "introduce new page states".  This should have first
abstracted out the sgx_epc_page_reclaim_in_progress() operation, using
the list_empty() check as the implementation.

Then, in a separate patch, introduce the concept of the "reclaim in
progress" flag and finally flip the implementation over.

Ditto for the sgx_epc_page_reset_state() abstraction.  It should have
been introduced separately as a concept and then have the implementation
changed.

On in to patch 10 (which is much too big) which introduces the
sgx_lru_list() abstraction.

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Dave Hansen

On 1/4/24 11:11, Haitao Huang wrote:
> If those are OK with users and also make it acceptable for merge
> quickly, I'm happy to do that 

How about we put some actual numbers behind this?  How much complexity
are we talking about here?  What's the diffstat for the utterly
bare-bones implementation and what does it cost on top of that to do the
per-cgroup reclaim?

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-03 Thread Dave Hansen

On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
thoughts? Or could you confirm we should not
> do reclaim per cgroup at all?
What's the benefit of doing reclaim per cgroup?  Is that worth the extra
complexity?

The key question here is whether we want the SGX VM to be complex and
more like the real VM or simple when a cgroup hits its limit.  Right?

If stopping at patch 5 and having less code is even remotely an option,
why not do _that_?

Re: [PATCH v7 00/13] selftests/sgx: Fix compilation errors

2023-11-08 Thread Dave Hansen

On 11/8/23 12:31, Jo Van Bulck wrote:
> Just a kind follow-up: from what I can see, this series has not been
> merged into the x86/sgx branch of tip yet (assuming that's where it
> should go next)?
> 
> Apologies if I've overlooked anything, and please let me know if there's
> something on my end that can help!

Yes, you've missed something.  For your reading pleasure:

https://www.kernel.org/doc/html/latest/process/2.Process.html?highlight=merge%20window
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html

I honestly didn't even think about applying this until Jarkko said
something on 23rd.  By that point, it was far too late for it to get
sorted out for 6.7.  So, that puts it in the next merge window.
Specifically:

The release candidate -rc1 is the starting point for new
patches to be applied which are targeted for the next merge
window.

So wait for the next -rc1, and you'll hopefully see your series get merged.

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-18 Thread Dave Hansen

On 10/18/23 08:26, Haitao Huang wrote:
> Maybe not in sense of killing something. My understanding memory.reclaim
> does not necessarily invoke the OOM killer. But what I really intend to
> say is we can have a separate knob for user to express the need for
> reducing the current usage explicitly and keep "misc.max' non-preemptive
> semantics intact. When we implement that new knob, then we can define
> what kind of reclaim for that. Depending on vEPC implementation, it may
> or may not involve killing VMs. But at least that semantics will be
> explicit for user.

I'm really worried that you're going for "perfect" semantics here.  This
is SGX.  It's *NOT* getting heavy use and even fewer folks will ever
apply cgroup controls to it.

Can we please stick to simple, easily-coded rules here?  I honestly
don't think these corner cases matter all that much and there's been
*WAY* too much traffic in this thread for what is ultimately not that
complicated.  Focus on *ONE* thing:

1. Admin sets a limit
2. Enclave is created
3. Enclave hits limit, allocation fails

Nothing else matters.  What if the admin lowers the limit on an
already-created enclave?  Nobody cares.  Seriously.  What about inducing
reclaim?  Nobody cares.  What about vEPC?  Doesn't matter, an enclave
page is an enclave page.

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-18 Thread Dave Hansen

On 10/17/23 21:37, Haitao Huang wrote:
> Yes we can introduce misc.reclaim to give user a knob to forcefully 
> reducing usage if that is really needed in real usage. The semantics
> would make force-kill VMs explicit to user.

Do any other controllers do something like this?  It seems odd.

Re: [PATCH v6] x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race

2023-09-28 Thread Dave Hansen

On 9/28/23 16:08, Reinette Chatre wrote:
> I'd like to check in on the status of this patch. This two month old
> patch looks to be a needed fix and has Jarkko and Kai's review tags,
> but I am not able to find it queued or merged in tip or upstream.
> Apologies if I did not look in the right spot, I just want to make
> sure it did not fall through the cracks if deemed needed.

It fell through the cracks.  Sorry about that.  It's in x86/urgent now.

Re: [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists

2023-09-14 Thread Dave Hansen

On 9/14/23 03:31, Huang, Kai wrote:
>> Signed-off-by: Haitao Huang 
>> Cc: Sean Christopherson 
> You don't need 'Cc:' Sean if the patch has Sean's SoB.

It is a SoB for Sean's @intel address and cc's his @google address.

It is fine.

Re: [PATCH] x86/tdx: refactor deprecated strncpy

2023-09-11 Thread Dave Hansen

On 9/11/23 11:27, Justin Stitt wrote:
> `strncpy` is deprecated and we should prefer more robust string apis.

I dunno.  It actually seems like a pretty good fit here.

> In this case, `message.str` is not expected to be NUL-terminated as it
> is simply a buffer of characters residing in a union which allows for
> named fields representing 8 bytes each. There is only one caller of
> `tdx_panic()` and they use a 59-length string for `msg`:
> | const char *msg = "TD misconfiguration: SEPT_VE_DISABLE attribute must 
> be set.";

I'm not really following this logic.

We need to do the following:

1. Make sure not to over write past the end of 'message'
2. Make sure not to over read past the end of 'msg'
3. Make sure not to leak stack data into the hypercall registers
   in the case of short strings.

strncpy() does #1, #2 and #3 just fine.

The resulting string does *NOT* need to be NULL-terminated.  See the
comment:

/* VMM assumes '\0' in byte 65, if the message took all 64 bytes */

Are there cases where strncpy() doesn't NULL-terminate the string other
than when the buffer is full?

I actually didn't realize that strncpy() pads its output up to the full
size.  I wonder if Kirill used it intentionally or whether he got lucky
here. :)

Re: [PATCH v4] memregion: Add cpu_cache_invalidate_memregion() interface

2022-10-27 Thread Dave Hansen

On 10/25/22 13:05, Dan Williams wrote:
> Users must first call cpu_cache_has_invalidate_memregion() to know whether
> this functionality is available on the architecture. Only enable it on
> x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX
> guests may trigger a virtualization exception and may need proper handling
> to recover. See:
That sentence doesn't quite parse correctly.  Does it need to be "and
may trigger..."?

> This global cache invalidation facility,
> cpu_cache_invalidate_memregion(), is exported to modules since NVDIMM
> and CXL support can be built as a module. However, the intent is that
> this facility is not available outside of specific "device-memory" use
> cases. To that end the API is scoped to a new "DEVMEM" module namespace
> that only applies to the NVDIMM and CXL subsystems.

Oh, thank $DEITY you're trying to limit the amount of code that has
access to this thing.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 67745ceab0db..b68661d0633b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -69,6 +69,7 @@ config X86
>   select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
>   select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
>   select ARCH_HAS_CACHE_LINE_SIZE
> + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION  if X86_64

What is 64-bit only about this?

I don't expect to have a lot of NVDIMMs or CXL devices on 32-bit
kernels, but it would be nice to remove this if it's not strictly
needed.  Or, to add a changelog nugget that says:

Restrict this to X86_64 kernels.  It would probably work on 32-
bit, but there is no practical reason to use 32-bit kernels and
no one is testing them.

>   select ARCH_HAS_CURRENT_STACK_POINTER
>   select ARCH_HAS_DEBUG_VIRTUAL
>   select ARCH_HAS_DEBUG_VM_PGTABLEif !X86_PAE
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 97342c42dda8..8650bb6481a8 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size)
>  EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
>  #endif
>  
> +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
> +bool cpu_cache_has_invalidate_memregion(void)
> +{
> + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
> +}
> +EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);
> +
> +int cpu_cache_invalidate_memregion(int res_desc)
> +{
> + wbinvd_on_all_cpus();
> + return 0;
> +}

Does this maybe also deserve a:

WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion());

in case one of the cpu_cache_invalidate_memregion() paths missed a
cpu_cache_has_invalidate_memregion() check?

> +/**
> + * cpu_cache_invalidate_memregion - drop any CPU cached data for
> + * memregions described by @res_desc
> + * @res_desc: one of the IORES_DESC_* types
> + *
> + * Perform cache maintenance after a memory event / operation that
> + * changes the contents of physical memory in a cache-incoherent manner.
> + * For example, device memory technologies like NVDIMM and CXL have
> + * device secure erase, or dynamic region provision features where such
> + * semantics.

s/where/with/ ?

> + * Limit the functionality to architectures that have an efficient way
> + * to writeback and invalidate potentially terabytes of memory at once.
> + * Note that this routine may or may not write back any dirty contents
> + * while performing the invalidation. It is only exported for the
> + * explicit usage of the NVDIMM and CXL modules in the 'DEVMEM' symbol
> + * namespace.
> + *
> + * Returns 0 on success or negative error code on a failure to perform
> + * the cache maintenance.
> + */

WBINVD is a scary beast.  But, there's also no better alternative in the
architecture.  I don't think any of my comments above are deal breakers,
so from the x86 side:

Acked-by: Dave Hansen

Re: [PATCH v2 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions

2021-04-20 Thread Dave Hansen

On 4/20/21 4:12 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 4/20/21 12:59 PM, Dave Hansen wrote:
>> On 4/20/21 12:20 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>> approach is, it adds a few extra instructions for every
>>>>> TDCALL use case when compared to distributed checks. Although
>>>>> it's a bit less efficient, it's worth it to make the code more
>>>>> readable.
>>>>
>>>> What's a "distributed check"?
>>>
>>> It should be "distributed TDVMCALL/TDCALL inline assembly calls"
>>
>> It's still not clear to what that refers.
> 
> I am just comparing the performance cost of using generic 
> TDCALL()/TDVMCALL() function implementation with "usage specific"
> (GetQuote,MapGPA, HLT,etc) custom TDCALL()/TDVMCALL() inline assembly
> implementation.

So, I actually had an idea what you were talking about, but I have
*ZERO* idea what "distributed" means in this context.

I think you are trying to say something along the lines of:

Just like syscalls, not all TDVMCALL/TDCALLs use the same set
of argument registers.  The implementation here picks the
current worst-case scenario for TDCALL (4 registers).  For
TDCALLs with fewer than 4 arguments, there will end up being
a few superfluous (cheap) instructions.  But, this approach
maximizes code reuse.

>>>> This also doesn't talk at all about why this approach was
>>>> chosen versus inline assembly.  You're going to be asked "why
>>>> not use inline asm?"
>>> "To make the core more readable and less error prone." I have
>>> added this info in above paragraph. Do you think we need more
>>> argument to justify our approach?
>> 
>> Yes, you need much more justification.  That's pretty generic and 
>> non-specific.
> readability is one of the main motivation for not choosing inline 

I'm curious.  Is there a reason you are not choosing to use
capitalization in your replies?  I personally use capitalization as a
visual clue for where a reply starts.

I'm not sure whether this indicates that your keyboard is not
functioning properly, or that these replies are simply not important
enough to warrant the use of the Shift key.  Or, is it simply an
oversight?  Or, maybe I'm just being overly picky because I've been
working on these exact things with my third-grader a bit too much lately.

Either way, I personally would appreciate your attention to detail in
crafting writing that is easy to parse, since I'm the one that's going
to have to parse it.  Details show that you care about the content you
produce.  Even if you don't mean it, a lack of attention to detail (even
capital letters) can be perceived to mean that you do not care about
what you write.  If you don't care about it, why should the reader?

> assembly. Since number of lines of instructions (with comments) are 
> over 70, using inline assembly made it hard to read. Another reason 
> is, since we
> are using many registers (R8-R15, R[A-D]X)) in TDVMCAL/TDCALL 
> operation, we are not sure whether some older compiler can follow
> our specified inline assembly constraints.

As for the justification, that's much improved.  Please include that,
along with some careful review of the grammar.

>>>>> +    movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>>>>> +
>>>>> +    tdcall
>>>>> +
>>>>> +    /* Panic if TDCALL reports failure. */
>>>>> +    test %rax, %rax
>>>>> +    jnz 2f
>>>>
>>>> Why panic?
>>> As per spec, TDCALL should never fail. Note that it has nothing to do
>>> with TDVMCALL function specific failure (which is reported via R10).
>>
>> You've introduced two concepts here, without differentiating them.  You
>> need to work to differentiate these two kinds of failure somewhere.  You
>> can't simply refer to both as "failure".
> will clarify it. I have assumed that once the user reads the spec, its
> easier
> to understand.

Your code should be 100% self-supporting without the spec.  The spec can
be there in a supportive role to help resolve ambiguity or add fine
detail.  But, I think this is a major, repeated problem with this patch
set: it relies too much on reviewers spending quality time with the spec.

>>>> Also, do you *REALLY* need to do this from assembly?  Can't it be done
>>>> in the C wrapper?
>>> Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
>>> so added
>>> it here.
>>
>> That's not a good reason.  You could just as easily have a C wrapper
>> which all uses of TDVMCALL go through.
>

Re: [PATCH v2 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions

2021-04-20 Thread Dave Hansen

On 4/20/21 12:20 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> approach is, it adds a few extra instructions for every
>>> TDCALL use case when compared to distributed checks. Although
>>> it's a bit less efficient, it's worth it to make the code more
>>> readable.
>>
>> What's a "distributed check"?
> 
> It should be "distributed TDVMCALL/TDCALL inline assembly calls"

It's still not clear to what that refers.

>> This also doesn't talk at all about why this approach was chosen versus
>> inline assembly.  You're going to be asked "why not use inline asm?"
> "To make the core more readable and less error prone." I have added this
> info in above paragraph. Do you think we need more argument to
> justify our approach?

Yes, you need much more justification.  That's pretty generic and
non-specific.

>>> +    /*
>>> + * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs
>>> + * defined in the GHCI.  Note, RAX and RCX are consumed, but
>>> only by
>>> + * TDX-Module and so don't need to be listed in the mask.
>>> + */
>>
>> "GCHI" is out of the blue here.  So is "TDX-Module".  There needs to be
>> more context.
> ok. will add it. Do you want GHCI spec reference with section id here?

Absolutely not.  I dislike all of the section references as-is.  Doesn't
a comment like this say what you said above and also add context?

Expose every register currently used in the
guest-to-host communication interface (GHCI).

>>> +    movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>>> +
>>> +    tdcall
>>> +
>>> +    /* Panic if TDCALL reports failure. */
>>> +    test %rax, %rax
>>> +    jnz 2f
>>
>> Why panic?
> As per spec, TDCALL should never fail. Note that it has nothing to do
> with TDVMCALL function specific failure (which is reported via R10).

You've introduced two concepts here, without differentiating them.  You
need to work to differentiate these two kinds of failure somewhere.  You
can't simply refer to both as "failure".

>> Also, do you *REALLY* need to do this from assembly?  Can't it be done
>> in the C wrapper?
> Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
> so added
> it here.

That's not a good reason.  You could just as easily have a C wrapper
which all uses of TDVMCALL go through.

>>> +    /* Propagate TDVMCALL success/failure to return value. */
>>> +    mov %r10, %rax
>>
>> You just said it panic's on failure.  How can this propagate failure?
> we use panic for TDCALL failure. But, R10 content used to identify
> whether given
> TDVMCALL function operation is successful or not.

As I said above, please endeavor to differentiate the two classes of
failures.

Also, if the spec is violated, do you *REALLY* want to blow up the whole
world with a panic?  I guess you can argue that it could have been
something security-related that failed.  But, either way, you owe a
description of why panic'ing is a good idea, not just blindly deferring
to "the spec says this can't happen".

>>> +    xor %r10d, %r10d
>>> +    xor %r11d, %r11d
>>> +    xor %r12d, %r12d
>>> +    xor %r13d, %r13d
>>> +    xor %r14d, %r14d
>>> +    xor %r15d, %r15d
>>> +
>>> +    pop %r12
>>> +    pop %r13
>>> +    pop %r14
>>> +    pop %r15
>>> +
>>> +    FRAME_END
>>> +    ret
>>> +2:
>>> +    ud2
>>> +.endm
>>> +
>>> +SYM_FUNC_START(__tdvmcall)
>>> +    xor %r10, %r10
>>> +    tdvmcall_core
>>> +SYM_FUNC_END(__tdvmcall)
>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>> index 0d00dd50a6ff..1147e7e765d6 100644
>>> --- a/arch/x86/kernel/tdx.c
>>> +++ b/arch/x86/kernel/tdx.c
>>> @@ -3,6 +3,36 @@
>>>     #include 
>>>   +/*
>>> + * Wrapper for the common case with standard output value (R10).
>>> + */
>>
>> ... and oddly enough there is no explicit mention of R10 anywhere.  Why?
> For Guest to Host call -> R10 holds TDCALL function id (which is 0 for
> TDVMCALL). so
> we don't need special argument.
> After TDVMCALL execution, R10 value is returned via RAX.

OK... so this is how it works.  But why mention R10 in the comment?  Why
is *THAT* worth mentioning?

>>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>>> +{
>>> +    u64 err;
>>> +
>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>>> +
>>> +    WARN_ON(err);
>>> +
>>> +    return err;
>>> +}
>>
>> Are there really *ZERO* reasons for a TDVMCALL to return an error?
> No. Its useful for debugging TDVMCALL failures.
>> Won't this let a malicious VMM spew endless warnings into the guest
>> console?
> As per GHCI spec, R10 will hold error code details which can be used to
> determine
> the type of TDVMCALL failure.

I would encourage you to stop citing the GCCI spec.  In all of these
conversations, you seem to rely on it without considering the underlying
reasons.  The fact that R10 is the error code is 100% irrelevant for
this conversation.

It's also entirely possible that the host would have bugs, or forget to
clear a bit somewhere, even if the spec says, "don't do it".

> More

Re: [PATCH v2 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions

2021-04-20 Thread Dave Hansen

On 3/26/21 4:38 PM, Kuppuswamy Sathyanarayanan wrote:
> Implement common helper functions to communicate with
> the TDX Module and VMM (using TDCALL instruction).

This is missing any kind of background.  I'd say:

Guests communicate with VMMs with hypercalls. Historically, these are
implemented using instructions that are known to cause VMEXITs like
.  However, with TDX, VMEXITs no longer expose guest
state from the host.  This prevents the old hypercall mechanisms from
working

... and then go on to talk about what you are introducing, why there are
two of them and so forth.

> __tdvmcall() function can be used to request services
> from VMM.

^ "from a VMM" or "from the VMM", please

> __tdcall() function can be used to communicate with the
> TDX Module.
> 
> Using common helper functions makes the code more readable
> and less error prone compared to distributed and use case
> specific inline assembly code. Only downside in using this

 ^ "The only downside..."

> approach is, it adds a few extra instructions for every
> TDCALL use case when compared to distributed checks. Although
> it's a bit less efficient, it's worth it to make the code more
> readable.

What's a "distributed check"?

This also doesn't talk at all about why this approach was chosen versus
inline assembly.  You're going to be asked "why not use inline asm?"

> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,12 +8,35 @@
>  #ifdef CONFIG_INTEL_TDX_GUEST
>  
>  #include 
> +#include 
> +
> +struct tdcall_output {
> + u64 rcx;
> + u64 rdx;
> + u64 r8;
> + u64 r9;
> + u64 r10;
> + u64 r11;
> +};
> +
> +struct tdvmcall_output {
> + u64 r11;
> + u64 r12;
> + u64 r13;
> + u64 r14;
> + u64 r15;
> +};
>  
>  /* Common API to check TDX support in decompression and common kernel code. 
> */
>  bool is_tdx_guest(void);
>  
>  void __init tdx_early_init(void);
>  
> +u64 __tdcall(u64 fn, u64 rcx, u64 rdx, struct tdcall_output *out);
> +
> +u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
> +struct tdvmcall_output *out);

Some one-liner comments about what these do would be nice.

>  #else // !CONFIG_INTEL_TDX_GUEST
>  
>  static inline bool is_tdx_guest(void)
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index ea111bf50691..7966c10ea8d1 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)  += pvclock.o
>  obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>  
>  obj-$(CONFIG_JAILHOUSE_GUEST)+= jailhouse.o
> -obj-$(CONFIG_INTEL_TDX_GUEST)+= tdx.o
> +obj-$(CONFIG_INTEL_TDX_GUEST)+= tdcall.o tdx.o
>  
>  obj-$(CONFIG_EISA)   += eisa.o
>  obj-$(CONFIG_PCSPKR_PLATFORM)+= pcspeaker.o
> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 60b9f42ce3c1..72de0b49467e 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -23,6 +23,10 @@
>  #include 
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +#include 
> +#endif
> +
>  #ifdef CONFIG_X86_32
>  # include "asm-offsets_32.c"
>  #else
> @@ -75,6 +79,24 @@ static void __used common(void)
>   OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + BLANK();
> + /* Offset for fields in tdcall_output */
> + OFFSET(TDCALL_rcx, tdcall_output, rcx);
> + OFFSET(TDCALL_rdx, tdcall_output, rdx);
> + OFFSET(TDCALL_r8, tdcall_output, r8);
> + OFFSET(TDCALL_r9, tdcall_output, r9);
> + OFFSET(TDCALL_r10, tdcall_output, r10);
> + OFFSET(TDCALL_r11, tdcall_output, r11);

 ^ vertically align this

> + /* Offset for fields in tdvmcall_output */
> + OFFSET(TDVMCALL_r11, tdvmcall_output, r11);
> + OFFSET(TDVMCALL_r12, tdvmcall_output, r12);
> + OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
> + OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
> + OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
> +#endif
> +
>   BLANK();
>   OFFSET(BP_scratch, boot_params, scratch);
>   OFFSET(BP_secure_boot, boot_params, secure_boot);
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> new file mode 100644
> index ..a73b67c0b407
> --- /dev/null
> +++ b/arch/x86/kernel/tdcall.S
> @@ -0,0 +1,163 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#define TDVMCALL_EXPOSE_REGS_MASK0xfc00

This looks like an undocumented magic number.

> +/*
> + * TDCALL instruction is newly added in TDX architecture,
> + * used by TD for requesting the host VMM to provide
> + * (untrusted) services. Supported in Binutils >= 2.36
> + */

Host VMM *AND* TD-module, right?

> +#define tdcall .byte 0x66,0x0f,0x01,0xcc

How well will the "newly added" comment age?

"host VMM" is redundant.

/*
 * TDX guests

Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-19 Thread Dave Hansen

On 4/19/21 11:10 AM, Andy Lutomirski wrote:
> I’m confused by this scenario. This should only affect physical pages
> that are in the 2M area that contains guest memory. But, if we have a
> 2M direct map PMD entry that contains kernel data and guest private
> memory, we’re already in a situation in which the kernel touching
> that memory would machine check, right?

Not machine check, but page fault.  Do machine checks even play a
special role in SEV-SNP?  I thought that was only TDX?

My point was just that you can't _easily_ do the 2M->4k kernel mapping
demotion in a page fault handler, like I think Borislav was suggesting.

> ISTM we should fully unmap any guest private page from the kernel and
> all host user pagetables before actually making it be a guest private
> page.

Yes, that sounds attractive.  Then, we'd actually know if the host
kernel was doing stray reads somehow because we'd get a fault there too.

Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-19 Thread Dave Hansen

On 4/19/21 10:46 AM, Brijesh Singh wrote:
> - guest wants to make gpa 0x1000 as a shared page. To support this, we
> need to psmash the large RMP entry into 512 4K entries. The psmash
> instruction breaks the large RMP entry into 512 4K entries without
> affecting the previous validation. Now the we need to force the host to
> use the 4K page level instead of the 2MB.
> 
> To my understanding, Linux kernel fault handler does not build the page
> tables on demand for the kernel addresses. All kernel addresses are
> pre-mapped on the boot. Currently, I am proactively spitting the physmap
> to avoid running into situation where x86 page level is greater than the
> RMP page level.

In other words, if the host maps guest memory with 2M mappings, the
guest can induce page faults in the host.  The only way the host can
avoid this is to map everything with 4k mappings.

If the host does not avoid this, it could end up in the situation where
it gets page faults on access to kernel data structures.  Imagine if a
kernel stack page ended up in the same 2M mapping as a guest page.  I
*think* the next write to the kernel stack would end up double-faulting.

Re: [RFCv2 06/13] x86/realmode: Share trampoline area if KVM memory protection enabled

2021-04-19 Thread Dave Hansen

On 4/16/21 8:40 AM, Kirill A. Shutemov wrote:
>   /*
> -  * If SME is active, the trampoline area will need to be in
> -  * decrypted memory in order to bring up other processors
> +  * If SME or KVM memory protection is active, the trampoline area will
> +  * need to be in decrypted memory in order to bring up other processors
>* successfully. This is not needed for SEV.
>*/
> - if (sme_active())
> + if (sme_active() || kvm_mem_protected())
>   set_memory_decrypted((unsigned long)base, size >> PAGE_SHIFT);

Could you take a look at all the places you've added these:

if (foo() || kvm_mem_protected())
bar();

spots and see if some refactoring is in order?

I suspect that some thought about what the high-level commonalities are,
plus some thoughtful helper function names would go a long way to making
this whole thing understandable.

set_memory_decrypted() as a name needs to go.  It almost needs to be
something like:

set_memory_sharing()

or something.  The "sharing" would be between the kernel and devices
(for SME), or the guest kernel and host kernel for protected memory.

Re: [RFCv2 04/13] x86/kvm: Use bounce buffers for KVM memory protection

2021-04-16 Thread Dave Hansen

On 4/16/21 8:40 AM, Kirill A. Shutemov wrote:
> Mirror SEV, use SWIOTLB always if KVM memory protection is enabled.
...
>  arch/x86/mm/mem_encrypt.c  | 44 ---
>  arch/x86/mm/mem_encrypt_common.c   | 48 ++

The changelog need to at least mention what's going on here.  It doesn't
prepare me at all for having code move around.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d197b3beb904..c51d14db5620 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -812,6 +812,7 @@ config KVM_GUEST
>   select ARCH_CPUIDLE_HALTPOLL
>   select X86_HV_CALLBACK_VECTOR
>   select X86_MEM_ENCRYPT_COMMON
> + select SWIOTLB
>   default y
>   help
> This option enables various optimizations for running under the KVM

So, this feature is always compiled in with KVM.  Could you say a couple
of things about that?  Why did you decide not have a Kconfig option for it?

> diff --git a/arch/x86/include/asm/mem_encrypt.h 
> b/arch/x86/include/asm/mem_encrypt.h
> index 31c4df123aa0..a748b30c2f23 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -47,10 +47,8 @@ int __init early_set_memory_encrypted(unsigned long vaddr, 
> unsigned long size);
>  
>  void __init mem_encrypt_free_decrypted_mem(void);
>  
> -/* Architecture __weak replacement functions */
> -void __init mem_encrypt_init(void);
> -
>  void __init sev_es_init_vc_handling(void);
> +
>  bool sme_active(void);
>  bool sev_active(void);
>  bool sev_es_active(void);
> @@ -91,6 +89,9 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }
>  
>  #endif   /* CONFIG_AMD_MEM_ENCRYPT */
>  
> +/* Architecture __weak replacement functions */
> +void __init mem_encrypt_init(void);

FWIW, I'd rather have the code movement in separate patches from the
functional changes.

>  /*
>   * The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
>   * writing to or comparing values from the cr3 register.  Having the
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index aed6034fcac1..ba179f5ca198 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -765,6 +766,7 @@ static void __init kvm_init_platform(void)
>   pr_info("KVM memory protection enabled\n");
>   mem_protected = true;
>   setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
> + swiotlb_force = SWIOTLB_FORCE;
>   }
>  }
>  
> diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
> index c2cfa5e7c152..814060a6ceb0 100644
> --- a/arch/x86/kernel/pci-swiotlb.c
> +++ b/arch/x86/kernel/pci-swiotlb.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  int swiotlb __read_mostly;
>  
> @@ -49,7 +50,7 @@ int __init pci_swiotlb_detect_4gb(void)
>* buffers are allocated and used for devices that do not support
>* the addressing range required for the encryption mask.
>*/
> - if (sme_active())
> + if (sme_active() || kvm_mem_protected())
>   swiotlb = 1;
>  
>   return swiotlb;

While I don't doubt you got it right, it would be nice to also explain
in the changelog why you manipulate both 'swiotlb_force' and 'swiotlb'.

> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 9ca477b9b8ba..3478f20fb46f 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -409,47 +409,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
>  
>   free_init_pages("unused decrypted", vaddr, vaddr_end);
>  }
> -
> -static void print_mem_encrypt_feature_info(void)
> -{
> - pr_info("AMD Memory Encryption Features active:");
> -
> - /* Secure Memory Encryption */
> - if (sme_active()) {
> - /*
> -  * SME is mutually exclusive with any of the SEV
> -  * features below.
> -  */
> - pr_cont(" SME\n");
> - return;
> - }
> -
> - /* Secure Encrypted Virtualization */
> - if (sev_active())
> - pr_cont(" SEV");
> -
> - /* Encrypted Register State */
> - if (sev_es_active())
> - pr_cont(" SEV-ES");
> -
> - pr_cont("\n");
> -}
> -
> -/* Architecture __weak replacement functions */
> -void __init mem_encrypt_init(void)
> -{
> - if (!sme_me_mask)
> - return;
> -
> - /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
> - swiotlb_update_mem_attributes();
> -
> - /*
> -  * With SEV, we need to unroll the rep string I/O instructions.
> -  */
> - if (sev_active())
> - static_branch_enable(_enable_key);
> -
> - print_mem_encrypt_feature_info();
> -}
> -
> diff --git a/arch/x86/mm/mem_encrypt_common.c 
> b/arch/x86/mm/mem_encrypt_common.c
> index 6bf0718bb72a..351b77361a5d 100644
> ---

Re: [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard

2021-04-16 Thread Dave Hansen

On 4/16/21 5:35 AM, Michal Hocko wrote:
>   I have to confess that I haven't grasped the initialization
>   completely. There is a nice comment explaining a 2 socket system with
>   3 different NUMA nodes attached to it with one node being terminal.
>   This is OK if the terminal node is PMEM but how that fits into usual
>   NUMA setups. E.g.
>   4 nodes each with its set of CPUs
>   node distances:
>   node   0   1   2   3
>   0:  10  20  20  20
>   1:  20  10  20  20
>   2:  20  20  10  20
>   3:  20  20  20  10
>   Do I get it right that Node 3 would be terminal?

Yes, I think Node 3 would end up being the terminal node in that setup.

That said, I'm not sure how much I expect folks to use this on
traditional, non-tiered setups.  It's also hard to argue what the
migration order *should* be when all the nodes are uniform.

> - The demotion is controlled by node_reclaim_mode but unlike other modes
>   it applies to both direct and kswapd reclaims.
>   I do not see that explained anywhere though.

That's an interesting observation.  Let me do a bit of research and I'll
update the Documentation/ and the changelog.

> - The demotion is implemented at shrink_page_list level which migrates
>   pages in the first round and then falls back to the regular reclaim
>   when migration fails. This means that the reclaim context
>   (PF_MEMALLOC) will allocate memory so it has access to full memory
>   reserves. Btw. I do not __GFP_NO_MEMALLOC anywhere in the allocation
>   mask which looks like a bug rather than an intention. Btw. using
>   GFP_NOWAIT in the allocation callback would make more things clear
>   IMO.

Yes, the lack of __GFP_NO_MEMALLOC is a bug.  I'll fix that up.

GFP_NOWAIT _seems_ like it will work.  I'll give it a shot.

> - Memcg reclaim is excluded from all this because it is not NUMA aware
>   which makes sense to me.
> - Anonymous pages are bit tricky because they can be demoted even when
>   they cannot be reclaimed due to no (or no available) swap storage.
>   Unless I have missed something the second round will try to reclaim
>   them even the later is true and I am not sure this is completely OK.

What we want is something like this:

Swap Space / Demotion OK  -> Can Reclaim
Swap Space / Demotion Off -> Can Reclaim
Swap Full  / Demotion OK  -> Can Reclaim
Swap Full  / Demotion Off -> No Reclaim

I *think* that's what can_reclaim_anon_pages() ends up doing.  Maybe I'm
misunderstanding what you are referring to, though.  By "second round"
did you mean when we do reclaim on a node which is a terminal node?

> I am still trying to digest the whole thing but at least jamming
> node_reclaim logic into kswapd seems strange to me. Need to think more
> about that though.

I'm entirely open to other ways to do the opt-in.  It seemed sane at the
time, but I also understand the kswapd concern.

> Btw. do you have any numbers from running this with some real work
> workload?

Yes, quite a bit.  Do you have a specific scenario in mind?  Folks seem
to come at this in two different ways:

Some want to know how much DRAM they can replace by buying some PMEM.
They tend to care about how much adding the (cheaper) PMEM slows them
down versus (expensive) DRAM.  They're making a cost-benefit call

Others want to repurpose some PMEM they already have.  They want to know
how much using PMEM in this way will speed them up.  They will basically
take any speedup they can get.

I ask because as a kernel developer with PMEM in my systems, I find the
"I'll take what I can get" case more personally appealing.  But, the
business folks are much more keen on the "DRAM replacement" use.  Do you
have any thoughts on what you would like to see?

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-15 Thread Dave Hansen

On 4/15/21 9:24 AM, Andy Lutomirski wrote:
> In the patches, *as submitted*, if you trip the XFD #NM *once* and you
> are the only thread on the system to do so, you will eat the cost of a
> WRMSR on every subsequent context switch.

I think you're saying: If a thread trips XFD #NM *once*, every switch to
and from *that* thread will incur the WRMSR cost.

The first time I read this, I thought you were saying that all threads
would incur a WRMSR cost on every context switch.  If that's the case, I
grossly misread the patches. :)

Re: [PATCH 02/10] mm/numa: automatically generate node migration order

2021-04-15 Thread Dave Hansen

On 4/14/21 9:07 PM, Wei Xu wrote:
> On Wed, Apr 14, 2021 at 1:08 AM Oscar Salvador  wrote:
>> Fast class/memory are pictured as those nodes with CPUs, while Slow 
>> class/memory
>> are PMEM, right?
>> Then, what stands for medium class/memory?
> 
> That is Dave's example.  I think David's guess makes sense (HBM - fast, DRAM -
> medium, PMEM - slow).  It may also be possible that we have DDR5 as fast,
> CXL-DDR4 as medium, and CXL-PMEM as slow.  But the most likely use cases for
> now should be just two tiers: DRAM vs PMEM or other types of slower
> memory devices.

Yes, it would be nice to apply this to fancier tiering systems.  But
DRAM/PMEM combos are out in the wild today and it's where I expect this
to be used first.

> This can help enable more flexible demotion policies to be
> configured, such as to allow a cgroup to allocate from all fast tier
> nodes, but only demote to a local slow tier node.  Such a policy can
> reduce memory stranding at the fast tier (compared to if memory
> hardwall is used) and still allow demotion from all fast tier nodes
> without incurring the expensive random accesses to the demoted pages
> if they were demoted to remote slow tier nodes.

Could you explain this stranding effect in a bit more detail?  I'm not
quite following.

Re: [PATCH v8] x86/sgx: Maintain encl->refcount for each encl->mm_list entry

2021-04-14 Thread Dave Hansen

On 4/14/21 8:51 AM, Sean Christopherson wrote:
>> Could this access to and kfree of encl_mm possibly be after the
>> kfree(encl_mm) noted above?
> No, the mmu_notifier_unregister() ensures that all in-progress notifiers 
> complete
> before it returns, i.e. SGX's notifier call back is not reachable after it's
> unregistered.
> 
>> Also is there a reason we do kfree(encl_mm) in notifier_free not directly in
>> notifier_release?
> Because encl_mm is the anchor to the enclave reference
> 
>   /* 'encl_mm' is going away, put encl_mm->encl reference: */
>   kref_put(_mm->encl->refcount, sgx_encl_release);
> 
> as well as the mmu notifier reference (the mmu_notifier_put(mn) call chain).
> Freeing encl_mm immediately would prevent sgx_mmu_notifier_free() from 
> dropping
> the enclave reference.  And the mmu notifier reference need to be dropped in
> sgx_mmu_notifier_release() because the encl_mm has been taken off 
> encl->mm_list.

Haitao, I think you've highlighted that this locking scheme is woefully
under-documented.  Any patches to beef it up would be very welcome.

Re: [PATCH v2 2/3] soundwire: Intel: introduce DMI quirks for HP Spectre x360 Convertible

2021-04-12 Thread Dave Hansen

On 3/1/21 11:51 PM, Bard Liao wrote:
> +++ b/drivers/soundwire/dmi-quirks.c
> @@ -0,0 +1,66 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +// Copyright(c) 2021 Intel Corporation.

It looks like this is already in intel-next, so this may be moot.  But,
is there a specific reason this is dual licensed?  If so, can you please
include information about the license choice in the cover letter of any
future version?

If there is no specific reason for this contribution to be dual
licensed, please make it GPL-2.0 only.

Re: [PATCH v6 19/34] xlink-core: Add xlink core device tree bindings

2021-04-12 Thread Dave Hansen

On 2/12/21 2:22 PM, mgr...@linux.intel.com wrote:
> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
> +# Copyright (c) Intel Corporation. All rights reserved.
> +%YAML 1.2
> +---
> +$id: "http://devicetree.org/schemas/misc/intel,keembay-xlink.yaml#;
> +$schema: "http://devicetree.org/meta-schemas/core.yaml#;
> +
> +title: Intel Keem Bay xlink

Is there a specific reason this is dual licensed?  If so, can you please
include information about the license choice in the next post's cover
letter?

If there is no specific reason for this contribution to be dual
licensed, please make it GPL-2.0 only.

Re: [PATCH v2 0/3] x86/sgx: eextend ioctl

2021-04-12 Thread Dave Hansen

On 4/12/21 8:58 AM, Jethro Beekman wrote:
> On 2021-04-12 17:36, Dave Hansen wrote:
>> On 4/12/21 1:59 AM, Raoul Strackx wrote:
>>> This patch set adds a new ioctl to enable userspace to execute EEXTEND
>>> leaf functions per 256 bytes of enclave memory. With this patch in place,
>>> Linux will be able to build all valid SGXv1 enclaves.
>> This didn't cover why we need a *NEW* ABI for this instead of relaxing
>> the page alignment rules in the existing one.
>>
> In executing the ECREATE, EADD, EEXTEND, EINIT sequence, you currently have 2 
> options for EADD/EEXTEND using the SGX_IOC_ENCLAVE_ADD_PAGES ioctl:
> - execute EADD on any address
> - execute EADD on any address followed by 16× EEXTEND for that address span

I think you forgot a key piece of the explanation here.  The choice as
to whether you just EADD or EADD+16xEEXTEND is governed by the addition
of the: SGX_PAGE_MEASURE flag.

> Could you be more specific on how you're suggesting that the current ioctl is 
> modified to in addition support the following?
> - execute EEXTEND on any address

I'm still not convinced you *NEED* EEXTEND on arbitrary addresses.

Right now, we have (roughly):

 ioctl(ADD_PAGES, ptr, PAGE_SIZE, MEASURE)

which translates in the kernel to:

__eadd(ptr, epc)
if (flags & MEASURE) {
for (i = 0; i < PAGE_SIZE/256; i++)
__eextend(epc + i*256);
}

Instead, we could allow add_arg.src and add_arg.offset to be
non-page-aligned.  Then, we still do the same __eadd(), but modify the
__eextend() loop to only cover the actual range referred to by 'add_arg'.

The downside is that you only get a single range of measured data per
page.  Let's say a 'X' means measured (EEXTEND'ed) and '_' means not.
You could have patterns like:

or
XXX_
or

but not:

_X_X_X_X_X_X_X_X
or
_XX_

Is that a problem?

Re: [PATCH v2 0/3] x86/sgx: eextend ioctl

2021-04-12 Thread Dave Hansen

On 4/12/21 9:41 AM, Jethro Beekman wrote:
> Yes this still doesn't let one execute all possible ECREATE, EADD, EEXTEND, 
> EINIT sequences.

OK, so we're going in circles now.

I don't believe we necessarily *WANT* or need Linux to support "all
possible ECREATE, EADD, EEXTEND, EINIT sequences".  Yet, it's what is
being used to justify this series without any other justification.

It's going to be a different story if you bring me a real enclave that
*REALLY* wants to do this for good reasons.

Re: [PATCH v2 0/3] x86/sgx: eextend ioctl

2021-04-12 Thread Dave Hansen

On 4/12/21 1:59 AM, Raoul Strackx wrote:
> This patch set adds a new ioctl to enable userspace to execute EEXTEND
> leaf functions per 256 bytes of enclave memory. With this patch in place,
> Linux will be able to build all valid SGXv1 enclaves.

This didn't cover why we need a *NEW* ABI for this instead of relaxing
the page alignment rules in the existing one.

Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-09 Thread Dave Hansen

On 4/8/21 11:17 AM, Oscar Salvador wrote:
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8490,7 +8490,8 @@ static int __alloc_contig_migrate_range(struct 
> compact_control *cc,
>   cc->nr_migratepages -= nr_reclaimed;
> 
>   ret = migrate_pages(>migratepages, alloc_migration_target,
> - NULL, (unsigned long), cc->mode, 
> MR_CONTIG_RANGE);
> + NULL, (unsigned long), cc->mode, 
> MR_CONTIG_RANGE,
> + NULL);
>   }
>   if (ret < 0) {
>   putback_movable_pages(>migratepages);

I also considered passing NULL to mean "I don't care about
nr_succeeded".  I mostly avoided it to reduce churn.  But, looking at it
here, it does seem cleaner.

Any objections to moving over to Oscar's suggestion?

Re: [PATCH 02/10] mm/numa: automatically generate node migration order

2021-04-08 Thread Dave Hansen

On 4/8/21 1:26 AM, Oscar Salvador wrote:
> On Thu, Apr 01, 2021 at 11:32:19AM -0700, Dave Hansen wrote:
>> The protocol for node_demotion[] access and writing is not
>> standard.  It has no specific locking and is intended to be read
>> locklessly.  Readers must take care to avoid observing changes
>> that appear incoherent.  This was done so that node_demotion[]
> 
> It might be just me being dense here, but that reads odd.
> 
> "Readers must take care to avoid observing changes that appear
> incoherent" - I am not sure what is that supposed to mean.
> 
> I guess you mean readers of next_demotion_node()?
> And if so, how do they have to take care? And what would apply for
> "incoherent" terminology here?

I've fleshed out the description a bit.  I hope this helps?

> Readers of node_demotion[] (such as next_demotion_node() callers)
> must take care to avoid observing changes that appear incoherent.
> For instance, even though no demotion cycles are allowed, it's
> possible that a cycle could be observed.
> 
> Let's say that there are three nodes, A, B and C.  node_demotion[]
> is set up to have this path:
> 
> A -> B -> C
> 
> Suppose it was modified to instead represent this path:
> 
> A -> C -> B
> 
> There is nothing to stop a reader from seeing B->C and then a
> moment later seeting C->B.  That *appears* to be a cycle.  This
> can be avoided with RCU and will be implemented in a later patch.

...
>> +again:
>> +this_pass = next_pass;
>> +next_pass = NODE_MASK_NONE;
>> +/*
>> + * To avoid cycles in the migration "graph", ensure
>> + * that migration sources are not future targets by
>> + * setting them in 'used_targets'.  Do this only
>> + * once per pass so that multiple source nodes can
>> + * share a target node.
>> + *
>> + * 'used_targets' will become unavailable in future
>> + * passes.  This limits some opportunities for
>> + * multiple source nodes to share a destination.
>> + */
>> +nodes_or(used_targets, used_targets, this_pass);
>> +for_each_node_mask(node, this_pass) {
>> +int target_node = establish_migrate_target(node, _targets);
>> +
>> +if (target_node == NUMA_NO_NODE)
>> +continue;
>> +
>> +/* Visit targets from this pass in the next pass: */
>> +node_set(target_node, next_pass);
>> +}
>> +/* Is another pass necessary? */
>> +if (!nodes_empty(next_pass))
> 
> When I read this I was about puzzled and it took me a while to figure
> out how the passes were made.
> I think this could benefit from a better explanation on how the passes
> are being performed e.g: why next_pass should be empty before leaving.
> 
> Other than that looks good to me.
I've tried to flesh out those comments to elaborate on what is going on:

> /*
>  * Visit targets from this pass in the next pass.
>  * Eventually, every node will have been part of
>  * a pass, and will become set in 'used_targets'.
>  */
> node_set(target_node, next_pass);
> }
> /*
>  * 'next_pass' contains nodes which became migration
>  * targets in this pass.  Make additional passes until
>  * no more migrations targets are available.
>  */
> if (!nodes_empty(next_pass))
> goto again;
> }

Re: [PATCH 01/10] mm/numa: node demotion data structure and lookup

2021-04-08 Thread Dave Hansen

On 4/8/21 1:03 AM, Oscar Salvador wrote:
> I think this patch and patch#2 could be squashed
> 
> Reviewed-by: Oscar Salvador 

Yeah, that makes a lot of sense.  I'll do that for the next version.

Re: [PATCH RESEND 0/3] x86/sgx: eextend ioctl

2021-04-08 Thread Dave Hansen



On 4/8/21 8:27 AM, Jethro Beekman wrote:
> But the native “executable format” for SGX is very clearly defined in
> the Intel SDM as a specific sequence of ECREATE, EADD, EEXTEND and
> EINIT calls. It's that sequence that's used for loading the enclave
> and it's that sequence that's used for code signing. So when I say
> save space I mean save space in the native format.
> 
> Not EEXTENDing unnecessarily also reduces enclave load time if
> you're looking for additional arguments.
I look forward to all of this being clearly explained in your resubmission.

> SGX defines lots of things and you may not see the use case for all
> of this immediately. No one has a usecase for creating enclaves with
> SECS.SSAFRAMESIZE = 1000 or TCS.NSSA = 3. Why did you not demand
> checks for this in the ECREATE/EADD ioctls?
There's a difference between adding code to support a feature and adding
code to *RESTRICT* use of a feature.

Re: [RFC v1 25/26] x86/tdx: Make DMA pages shared

2021-04-06 Thread Dave Hansen

On 4/6/21 9:31 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 01, 2021 at 02:01:15PM -0700, Dave Hansen wrote:
>>> @@ -1977,8 +1978,8 @@ static int __set_memory_enc_dec(unsigned long addr, 
>>> int numpages, bool enc)
>>> struct cpa_data cpa;
>>> int ret;
>>>  
>>> -   /* Nothing to do if memory encryption is not active */
>>> -   if (!mem_encrypt_active())
>>> +   /* Nothing to do if memory encryption and TDX are not active */
>>> +   if (!mem_encrypt_active() && !is_tdx_guest())
>>> return 0;
>>
>> So, this is starting to look like the "enc" naming is wrong, or at least
>> a little misleading.   Should we be talking about "protection" or
>> "guards" or something?
> 
> Are you talking about the function argument or function name too?

Yes, __set_memory_enc_dec() isn't really just doing "enc"ryption any more.

>>> /* Should not be working on unaligned addresses */
>>> @@ -1988,8 +1989,14 @@ static int __set_memory_enc_dec(unsigned long addr, 
>>> int numpages, bool enc)
>>> memset(, 0, sizeof(cpa));
>>> cpa.vaddr = 
>>> cpa.numpages = numpages;
>>> -   cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
>>> -   cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
>>> +   if (is_tdx_guest()) {
>>> +   cpa.mask_set = __pgprot(enc ? 0 : tdx_shared_mask());
>>> +   cpa.mask_clr = __pgprot(enc ? tdx_shared_mask() : 0);
>>> +   } else {
>>> +   cpa.mask_set = __pgprot(enc ? _PAGE_ENC : 0);
>>> +   cpa.mask_clr = __pgprot(enc ? 0 : _PAGE_ENC);
>>> +   }
>>
>> OK, this is too hideous to live.  It sucks that the TDX and SEV/SME bits
>> are opposite polarity, but oh well.
>>
>> To me, this gets a lot clearer, and opens up room for commenting if you
>> do something like:
>>
>>  if (is_tdx_guest()) {
>>  mem_enc_bits   = 0;
>>  mem_plain_bits = tdx_shared_mask();
>>  } else {
>>  mem_enc_bits   = _PAGE_ENC;
>>  mem_plain_bits = 0
>>  }
>>
>>  if (enc) {
>>  cpa.mask_set = mem_enc_bits;
>>  cpa.mask_clr = mem_plain_bits;  // clear "plain" bits
>>  } else {
>>  
>>  cpa.mask_set = mem_plain_bits;
>>  cpa.mask_clr = mem_enc_bits;// clear encryption bits
>>  }
> 
> I'm not convinced that your approach it clearer. If you add the missing
> __pgprot() it going to as ugly as the original.
> 
> But if a maintainer wants... :)

Yes, please.  I think my version (with the added __pgprot() conversions)
clearly separates out the two thing that are going on.

>>> cpa.pgd = init_mm.pgd;
>>>  
>>> /* Must avoid aliasing mappings in the highmem code */
>>> @@ -1999,7 +2006,8 @@ static int __set_memory_enc_dec(unsigned long addr, 
>>> int numpages, bool enc)
>>> /*
>>>  * Before changing the encryption attribute, we need to flush caches.
>>>  */
>>> -   cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT));
>>> +   if (!enc || !is_tdx_guest())
>>> +   cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT));
>>
>> That "!enc" looks wrong to me.  Caches would need to be flushed whenever
>> encryption attributes *change*, not just when they are set.
>>
>> Also, cpa_flush() flushes caches *AND* the TLB.  How does TDX manage to
>> not need TLB flushes?
> 
> I will double-check everthing, but I think we can skip *both* cpa_flush()
> for private->shared conversion. VMM and TDX module will take care about
> TLB and cache flush in response to MapGPA TDVMCALL.

Oh, interesting.  You might also want to double check if there are any
more places where X86_FEATURE_SME_COHERENT and TDX have similar properties.

Re: [RFC v1 23/26] x86/tdx: Make pages shared in ioremap()

2021-04-06 Thread Dave Hansen

On 4/6/21 9:00 AM, Kirill A. Shutemov wrote:
>>> --- a/arch/x86/mm/ioremap.c
>>> +++ b/arch/x86/mm/ioremap.c
>>> @@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource 
>>> *res)
>>>  }
>>>  
>>>  /*
>>> - * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
>>> - * there the whole memory is already encrypted.
>>> + * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted 
>>> (or
>>> + * private in TDX case) because there the whole memory is already 
>>> encrypted.
>>>   */
>> But doesn't this mean that we can't ioremap() normal memory?
> It's not allowed anyway: see (io_desc.flags & IORES_MAP_SYSTEM_RAM) in the
> __ioremap_caller().
> 
>> I was somehow expecting that we would need to do this for some
>> host<->guest communication pages.
> It goes though DMA API, not ioremap().

Ahh, got it.  Thanks for the clarification.

It would help to make mention of that stuff in the changelog to make it
more obvious going forward.

Re: [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

2021-04-06 Thread Dave Hansen

On 4/6/21 8:54 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 01, 2021 at 01:13:16PM -0700, Dave Hansen wrote:
>>> @@ -56,6 +61,9 @@ static void tdx_get_info(void)
>>>  
>>> td_info.gpa_width = rcx & GENMASK(5, 0);
>>> td_info.attributes = rdx;
>>> +
>>> +   /* Exclude Shared bit from the __PHYSICAL_MASK */
>>> +   physical_mask &= ~tdx_shared_mask();
>>>  }
>> I wish we had all of these 'physical_mask' manipulations in a single
>> spot.  Can we consolidate these instead of having TDX and SME poke at
>> them individually?
> SME has to do it very early -- from __startup_64() -- as it sets the bit
> on all memory, except what used for communication. TDX can postpone as we
> don't need any shared mapping in very early boot.
> 
> Basically, to make it done from the same place we would need to move TDX
> enumeration earlier into boot. It's risky: everything is more fragile
> there.
> 
> I would rather keep it as is. We should be fine as long as we only allow
> to clear bits from the mask.

I'll buy that.  Could you mention it in the changelog, please?

Re: [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code

2021-04-06 Thread Dave Hansen

On 4/6/21 8:37 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 01, 2021 at 01:06:29PM -0700, Dave Hansen wrote:
>> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" 
>>>
>>> Intel TDX doesn't allow VMM to access guest memory. Any memory that is
>>> required for communication with VMM suppose to be shared explicitly by
>>
>> s/suppose to/must/
> 
> Right.
> 
>>> setting the bit in page table entry. The shared memory is similar to
>>> unencrypted memory in AMD SME/SEV terminology.
>>
>> In addition to setting the page table bit, there's also a dance to go
>> through to convert the memory.  Please mention the procedure here at
>> least.  It's very different from SME.
> 
> "
>   After setting the shared bit, the conversion must be completed with
>   MapGPA TDVMALL. The call informs VMM about the conversion and makes it
>   remove the GPA from the S-EPT mapping.
> "

Where does the TDX module fit in here?

>>> force_dma_unencrypted() has to return true for TDX guest. Move it out of
>>> AMD SME code.
>>
>> You lost me here.  What does force_dma_unencrypted() have to do with
>> host/guest shared memory?
> 
> "
>   AMD SEV makes force_dma_unencrypted() return true which triggers
>   set_memory_decrypted() calls on all DMA allocations. TDX will use the
>   same code path to make DMA allocations shared.
> "

SEV assumes that I/O devices can only do DMA to "decrypted" physical
addresses without the C-bit set.  In order for the CPU to interact with
this memory, the CPU needs a decrypted mapping.

TDX is similar.  TDX architecturally prevents access to private guest
memory by anything other than the guest itself.  This means that any DMA
buffers must be shared.

Right?

>>> Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
>>> selected by all x86 memory encryption features.
>>
>> Please also mention what will set it.  I assume TDX guest support will
>> set this option.  It's probably also worth a sentence to say that
>> force_dma_unencrypted() will have TDX-specific code added to it.  (It
>> will, right??)
> 
> "
>   Only AMD_MEM_ENCRYPT uses the option now. TDX will be the second one.
> "
> 
>>> This is preparation for TDX changes in DMA code.
>>
>> Probably best to also mention that this effectively just moves code
>> around.  This patch should have no functional changes at runtime.
> 
> Isn't it what the subject says? :P

Yes, but please mention it explicitly.

Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages

2021-04-06 Thread Dave Hansen

On 4/6/21 12:44 AM, David Hildenbrand wrote:
> On 02.04.21 17:26, Kirill A. Shutemov wrote:
>> TDX architecture aims to provide resiliency against confidentiality and
>> integrity attacks. Towards this goal, the TDX architecture helps enforce
>> the enabling of memory integrity for all TD-private memory.
>>
>> The CPU memory controller computes the integrity check value (MAC) for
>> the data (cache line) during writes, and it stores the MAC with the
>> memory as meta-data. A 28-bit MAC is stored in the ECC bits.
>>
>> Checking of memory integrity is performed during memory reads. If
>> integrity check fails, CPU poisones cache line.
>>
>> On a subsequent consumption (read) of the poisoned data by software,
>> there are two possible scenarios:
>>
>>   - Core determines that the execution can continue and it treats
>>     poison with exception semantics signaled as a #MCE
>>
>>   - Core determines execution cannot continue,and it does an unbreakable
>>     shutdown
>>
>> For more details, see Chapter 14 of Intel TDX Module EAS[1]
>>
>> As some of integrity check failures may lead to system shutdown host
>> kernel must not allow any writes to TD-private memory. This requirment
>> clashes with KVM design: KVM expects the guest memory to be mapped into
>> host userspace (e.g. QEMU).
> 
> So what you are saying is that if QEMU would write to such memory, it
> could crash the kernel? What a broken design.

IMNHO, the broken design is mapping the memory to userspace in the first
place.  Why the heck would you actually expose something with the MMU to
a context that can't possibly meaningfully access or safely write to it?

This started with SEV.  QEMU creates normal memory mappings with the SEV
C-bit (encryption) disabled.  The kernel plumbs those into NPT, but when
those are instantiated, they have the C-bit set.  So, we have mismatched
mappings.  Where does that lead?  The two mappings not only differ in
the encryption bit, causing one side to read gibberish if the other
writes: they're not even cache coherent.

That's the situation *TODAY*, even ignoring TDX.

BTW, I'm pretty sure I know the answer to the "why would you expose this
to userspace" question: it's what QEMU/KVM did alreadhy for
non-encrypted memory, so this was the quickest way to get SEV working.

So, I don't like the #MC either.  But, this series is a step in the
right direction for TDX *AND* SEV.

Re: [RFC 2/3] vmalloc: Support grouped page allocations

2021-04-05 Thread Dave Hansen

On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> +static void __dispose_pages(struct list_head *head)
> +{
> + struct list_head *cur, *next;
> +
> + list_for_each_safe(cur, next, head) {
> + list_del(cur);
> +
> + /* The list head is stored at the start of the page */
> + free_page((unsigned long)cur);
> + }
> +}

This is interesting.

While the page is in the allocator, you're using the page contents
themselves to store the list_head.  It took me a minute to figure out
what you were doing here because: "start of the page" is a bit
ambiguous.  It could mean:

 * the first 16 bytes in 'struct page'
or
 * the first 16 bytes in the page itself, aka *page_address(page)

The fact that this doesn't work on higmem systems makes this an OK thing
to do, but it is a bit weird.  It's also doubly susceptible to bugs
where there's a page_to_virt() or virt_to_page() screwup.

I was *hoping* there was still sufficient space in 'struct page' for
this second list_head in addition to page->lru.  I think there *should*
be.  That would at least make this allocator a bit more "normal" in not
caring about page contents while the page is free in the allocator.  If
you were able to do that you could do things like kmemcheck or page
alloc debugging while the page is in the allocator.

Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
But, if that doesn't work out, please comment the snot out of this thing
because it _is_ weird.

Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-04 Thread Dave Hansen

It occurred to me that I've been doing a lot of digging in the TDX spec
lately.  I think we can all agree that the "Architecture Specification"
is not the world's easiest, most disgestable reading.  It's hard to
figure out what the Linux relation to the spec is.

One bit of Documentation we need for TDX is a description of the memory
states.  For instance, it would be nice to spell out the different
classes of memory, how they are selected, who selects them, and who
enforces the selection.  What faults are generated on each type and who
can induce those?

For instance:

TD-Private memory is selected by the Shared/Private bit in Present=1
guest PTEs.  When the hardware page walker sees that bit, it walk the
secure EPT.  The secure EPT entries can only be written by the TDX
module, although they are written at the request of the VMM.  The TDX
module enforces rules like ensuring that the memory mapped by secure EPT
is not mapped multiple times.  The VMM can remove entries.  From the
guest perspective, all private memory accesses are either successful, or
result in a #VE.  Private memory access does not cause VMExits.

Would that be useful to folks?

Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-03 Thread Dave Hansen

On 4/2/21 2:32 PM, Andi Kleen wrote:
>> If we go this route, what are the rules and restrictions?  Do we have to
>> say "no MMIO in #VE"?
> 
> All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments"
> After that it can nest without problems.

Well, not exactly.  You still can't do things that will could cause a n
unbounded recusive #VE.

It doesn't seem *that* far fetched to think that someone might try to
defer some work or dump data to the console.

> If you nest before that the TDX will cause a triple fault.
> 
> The code that cannot do it is a few lines in the early handler which
> runs with interrupts off.

>> Which brings up another related point: How do you debug TD guests?  Does
>> earlyprintk work?
> 
> Today it works actually because serial ports are allowed. But I expect it to
> be closed eventually because serial code is a lot of code to audit. 
> But you can always disable the filtering with a command line option and
> then it will always work for debugging.

Do we need a TDX-specific earlyprintk?  I would imagine it's pretty easy
to implement.

Re: [PATCH RESEND 0/3] x86/sgx: eextend ioctl

2021-04-02 Thread Dave Hansen

On 4/2/21 1:20 PM, Jethro Beekman wrote:
> On 2021-04-02 21:50, Dave Hansen wrote:
>> Again, how does this save space?
>>
>> Are you literally talking about the temporary cost of allocating *one* page?
> 
> No I'm talking about the amount of disk space/network traffic needed
> to distribute the application.

Am I just horribly confused about how executable formats work?

Executables go through an SGX loader that copies them into SGX memory
with the kernel's help.

That executable can have *ANY* format, really.

Then, a loader needs to read that format and turn it into data that can
be shoved into the kernel.  The simplest way to do this is to just
mmap() the executable and point the kernel ioctl()'s at the executable's
pages one-by-one.

The other way a loader *could* work is to copy the data to a temporary
location and then hand the temporary location to the SGX ioctl()s.

Let's say the kernel *REQUIRED* page-aligned and page-sized ioctl()
arguments forever.  If an executable had a 256-byte-sized chunk of data,
all the loader would have to do is allocate a page, copy the data in
there, and then pass that whole page into the ioctl().

In other words, the loading restrictions (page alignment) have little to
nothing to do with the format of the thing loading the executable.

But, in no way does a kernel page-size-based ABI *REQUIRE* that an
underlying binary format represent things only in page-sized chunks.
Look at how many page-sized executables there are in /bin.  Yet, they
can only be mapped into the address space in PAGE_SIZE increments.

>>>> Does any actual, real-world enclave want this functionality?  Why?
>>
>> I didn't see an answer on this one.
> 
> Yes, we have enclaves that use this functionality. They already exist
> so they can't be changed (without changing the measurement) and we'd
> like to stop using the out of tree driver as soon as possible.
> However, we are not able to load the enclaves.
OK, so please give this series another shot.  Please explain why you
*ACTUALLY* need it and what the goals are.  Please explain why you can't
just relax the restrictions of the existing add ioctl() to take

Re: [PATCH RESEND 0/3] x86/sgx: eextend ioctl

2021-04-02 Thread Dave Hansen



On 4/2/21 12:38 PM, Jethro Beekman wrote:
> On 2021-04-02 20:42, Dave Hansen wrote:
>> On 4/2/21 11:31 AM, Jethro Beekman wrote:
>>> On 2021-04-02 17:53, Dave Hansen wrote:
>>>> But, why would an enclave loader application ever do this?
>>> 
>>> e.g. to save space
>> 
>> How does this save space, exactly?  What space does it save?
> 
> With the current driver interface, if you want to communicate an 
> application binary that has pages that are at least partially
> measured, you need to communicate the entire page (to ensure the same
> measurement for the entire page), even though most of  that page's contents
> are irrelevant to the application.

Again, how does this save space?

Are you literally talking about the temporary cost of allocating *one* page?

>> We don't blindly support CPU features in Linux.  They need to do
>> something *useful*.  I'm still missing what this does which is
>> useful.
> 
> Enclaves can only be loaded exactly as specified by the developer,
this is the whole point of the measurement architecture. By not
supporting arbitrary EADD/EEXTEND flows, the SGX application ecosystem
is effectively split into two: SGX applications that run everywhere and
SGX applications that run everywhere except on Linux. So, the "something
useful" is being compatible. Linux has plenty of features that exist
solely for compatibility with other systems, such as binfmt_misc.

That's a mildly compelling argument.  Is it theoretical or practical?
Are folks *practically* going to run the same enclave binaries on Linux
and Windows?

I guess the enclave never interacts with the OS directly, so this is
_possible_.  But, are enclaves really that divorced from the "runtimes"
which *are* OS-specific?

>> Does any actual, real-world enclave want this functionality?  Why?

I didn't see an answer on this one.

>> P.S. There are plenty of things you can do with the SGX
>> architecture that we probably won't ever implement in Linux.
> 
> How so? 

For example, the architecture allows swapping VA pages and guest enclave
pages.  But, we may never do either of those.

Re: [PATCH RESEND 0/3] x86/sgx: eextend ioctl

2021-04-02 Thread Dave Hansen

On 4/2/21 11:31 AM, Jethro Beekman wrote:
> On 2021-04-02 17:53, Dave Hansen wrote:
>> On 4/2/21 1:38 AM, Jethro Beekman wrote:
>>>> So, we're talking here about pages that have been EEADDED, but for
>>>> which we do not want to include the entire contents of the page?
>>>> Do these contents always include the beginning of the page, or can
>>>> the holes be anywhere?
>>> Holes can be anywhere, and EEXTEND calls need not be sequential in
>>> memory address or even relate to the most recently EADDed page.
>>
>> I think you're referring to the SGX architecture itself here.  The
>> architecture permits this, right?
> 
> Yes.
> 
>> But, why would an enclave loader application ever do this? 
> 
> e.g. to save space

How does this save space, exactly?  What space does it save?

Let's say I want to add 4352 bytes of data to an enclave.  Today, I have
to page-align the beginning and end of that 4352 bytes, and call the add
ioctl() to add the two resulting pages.  It consumes two EPC pages.

With EEXTEND, if I want to add the data, I only need to page-align the
beginning of the data.  I call add_page on the first page, then eextend
on the 256 bytes.  It consumes two EPC pages.

I guess you can argue that this saves padding out the second page, which
would *theoretically* temporarily eat up one extra page of non-enclave
memory and the cost of a 256-byte memcpy.

>> Is this something we want to support in Linux?
> 
> Why not? Is there a good reason to not fully support this part of the
> CPU architecture?

We don't blindly support CPU features in Linux.  They need to do
something *useful*.  I'm still missing what this does which is useful.

Does any actual, real-world enclave want this functionality?  Why?

P.S. There are plenty of things you can do with the SGX architecture
that we probably won't ever implement in Linux.

Re: [PATCH RESEND 0/3] x86/sgx: eextend ioctl

2021-04-02 Thread Dave Hansen

On 4/2/21 1:38 AM, Jethro Beekman wrote:
>> So, we're talking here about pages that have been EEADDED, but for
>> which we do not want to include the entire contents of the page?
>> Do these contents always include the beginning of the page, or can
>> the holes be anywhere?
> Holes can be anywhere, and EEXTEND calls need not be sequential in
> memory address or even relate to the most recently EADDed page.

I think you're referring to the SGX architecture itself here.  The
architecture permits this, right?

But, why would an enclave loader application ever do this?  Is this
something we want to support in Linux?

Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-02 Thread Dave Hansen

On 4/1/21 7:48 PM, Andi Kleen wrote:
>> I've heard things like "we need to harden the drivers" or "we need to do
>> audits" and that drivers might be "whitelisted".
> 
> The basic driver allow listing patches are already in the repository,
> but not currently posted or complete:
> 
> https://github.com/intel/tdx/commits/guest

That lists exactly 8 ids:

>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1000 }, /* Virtio NET */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1001 }, /* Virtio block */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1003 }, /* Virtio console */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1009 }, /* Virtio FS */
> 
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041 }, /* Virtio 1.0 NET */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042 }, /* Virtio 1.0 block */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1043 }, /* Virtio 1.0 console */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1049 }, /* Virtio 1.0 FS */

How many places do those 8 drivers touch MMIO?

>> Are there any "real" hardware drivers
>> involved like how QEMU emulates an e1000 or rtl8139 device?  
> 
> Not currently (but some later hypervisor might rely on one of those)
> 
>> What about
>> the APIC or HPET?
> 
> No IO-APIC, but the local APIC. No HPET.

Sean seemed worried about other x86-specific oddities.  Are there any
more, or is the local APIC the only non-driver MMIO?

>> Without something concrete, it's really hard to figure out if we should
>> go full-blown paravirtualized MMIO, or do something like the #VE
>> trapping that's in this series currently.
> 
> As Sean says the concern about MMIO is less drivers (which should
> be generally ok if they work on other architectures which require MMIO
> magic), but other odd code that only ran on x86 before.
> 
> I really don't understand your crusade against #VE. It really
> isn't that bad if we can avoid the few corner cases.

The problem isn't with #VE per se.  It's with posting a series that
masquerades as a full solution while *NOT* covering or even enumerating
the corner cases.  That's exactly what happened with #VE to start with:
it was implemented in a way that exposed the kernel to #VE during the
syscall gap (and the SWAPGS gap for that matter).

So, I'm pushing for a design that won't have corner cases.  If MMIO
itself is disallowed, then we can scream about *any* detected MMIO.
Then, there's no worry about #VE nesting.  No #VE, no #VE nesting.  We
don't even have to consider if #VE needs NMI-like semantics.

> For me it would seem wrong to force all MMIO for all drivers to some
> complicated paravirt construct, blowing up code side everywhere
> and adding complicated self modifying code, when it's only needed for very
> few drivers. But we also don't want to patch every MMIO to be special cased
> even those few drivers.
> 
> #VE based MMIO avoids all that cleanly while being nicely non intrusive.

But, we're not selling used cars here.  Using #VE is has downsides.
Let's not pretend that it doesn't.

If we go this route, what are the rules and restrictions?  Do we have to
say "no MMIO in #VE"?

I'm really the most worried about the console.  Consoles and NMIs have
been a nightmare, IIRC.  Doesn't this just make it *WORSE* because now
the deepest reaches of the console driver are guaranteed to #VE?

Which brings up another related point: How do you debug TD guests?  Does
earlyprintk work?

Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-01 Thread Dave Hansen

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
> hosts and some physical attacks. This series adds the bare-minimum
> support to run a TDX guest. The host-side support will be submitted
> separately. Also support for advanced TD guest features like attestation
> or debug-mode will be submitted separately. Also, at this point it is not
> secure with some known holes in drivers, and also hasn’t been fully audited
> and fuzzed yet.

I want to hear a lot more about this driver model.

I've heard things like "we need to harden the drivers" or "we need to do
audits" and that drivers might be "whitelisted".

What are we talking about specifically?  Which drivers?  How many
approximately?  Just virtio?  Are there any "real" hardware drivers
involved like how QEMU emulates an e1000 or rtl8139 device?  What about
the APIC or HPET?

How broadly across the kernel is this going to go?

Without something concrete, it's really hard to figure out if we should
go full-blown paravirtualized MMIO, or do something like the #VE
trapping that's in this series currently.

Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-01 Thread Dave Hansen

On 4/1/21 3:35 PM, Wei Xu wrote:
> A small suggestion: Given that migrate_pages() requires that
> *nr_succeeded should be initialized to 0 when it is called due to its
> use of *nr_succeeded in count_vm_events() and trace_mm_migrate_pages(),
> it would be less error-prone if migrate_pages() initializes
> *nr_succeeded itself.

That's a good point, especially if a caller made multiple
migrate_pages() calls without resetting it, which a number of callers
do.  That could really have caused some interesting problems.  Thanks
for catching that!

I'll do what you suggested.

Re: [PATCH 05/10] mm/migrate: demote pages during reclaim

2021-04-01 Thread Dave Hansen

On 4/1/21 1:01 PM, Yang Shi wrote:
> On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen  
> wrote:
>>
>>
>> From: Dave Hansen 
>>
>> This is mostly derived from a patch from Yang Shi:
>>
>> 
>> https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang@linux.alibaba.com/
>>
>> Add code to the reclaim path (shrink_page_list()) to "demote" data
>> to another NUMA node instead of discarding the data.  This always
>> avoids the cost of I/O needed to read the page back in and sometimes
>> avoids the writeout cost when the pagee is dirty.
> 
> s/pagee/page
> 
>>
>> A second pass through shrink_page_list() will be made if any demotions
>> fail.  This essentally falls back to normal reclaim behavior in the
> 
> s/essentally/essentially
> 
>> case that demotions fail.  Previous versions of this patch may have
>> simply failed to reclaim pages which were eligible for demotion but
>> were unable to be demoted in practice.
>>
>> Note: This just adds the start of infratructure for migration. It is
> 
> s/infratructure/infrastructure

Thanks for finding those!  I somehow got really sloppy in this patch.  I
scanned the rest of the descriptions and don't see any more obvious
typos/misspellings.

>> +static struct page *alloc_demote_page(struct page *page, unsigned long node)
>> +{
>> +   struct migration_target_control mtc = {
>> +   /*
>> +* Allocate from 'node', or fail the quickly and quietly.
>> +* When this happens, 'page; will likely just be discarded
> 
> s/'page;/'page'

Fixed.  Thanks for the review tag!

Re: [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO

2021-04-01 Thread Dave Hansen

On 4/1/21 3:26 PM, Sean Christopherson wrote:
> On Thu, Apr 01, 2021, Dave Hansen wrote:
>> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" 
>>>
>>> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
>>> exit reason.
>>>
>>> For now we only handle subset of instruction that kernel uses for MMIO
>>> oerations. User-space access triggers SIGBUS.
>> ..
>>> +   case EXIT_REASON_EPT_VIOLATION:
>>> +   ve->instr_len = tdx_handle_mmio(regs, ve);
>>> +   break;
>>
>> Is MMIO literally the only thing that can cause an EPT violation for TDX
>> guests?
> 
> Any EPT Violation, or specifically EPT Violation #VE?  Any memory access can
> cause an EPT violation, but the VMM will get the ones that lead to VM-Exit.  
> The
> guest will only get the ones that cause #VE.

I'll rephrase: Is MMIO literally the only thing that can cause us to get
into the EXIT_REASON_EPT_VIOLATION case of the switch() here?

> Assuming you're asking about #VE... No, any shared memory access can take a 
> #VE
> since the VMM controls the shared EPT tables and can clear the SUPPRESS_VE 
> bit 
> at any time.  But, if the VMM is friendly, #VE should be limited to MMIO.

OK, but what are we doing in the case of unfriendly VMMs?  What does
*this* code do as-is, and where do we want to take it?

>From the _looks_ of this patch, tdx_handle_mmio() is the be all end all
solution to all EXIT_REASON_EPT_VIOLATION events.

>> But for an OS where we have source for the *ENTIRE* thing, and where we
>> have a chokepoint for MMIO accesses (arch/x86/include/asm/io.h), it
>> seems like an *AWFUL* idea to:
>> 1. Have the kernel set up special mappings for I/O memory
>> 2. Kernel generates special instructions to access that memory
>> 3. Kernel faults on that memory
>> 4. Kernel cracks its own special instructions to see what they were
>>doing
>> 5. Kernel calls up to host to do the MMIO
>>
>> Instead of doing 2/3/4, why not just have #2 call up to the host
>> directly?  This patch seems a very slow, roundabout way to do
>> paravirtualized MMIO.
>>
>> BTW, there's already some SEV special-casing in io.h.
> 
> I implemented #2 a while back for build_mmio_{read,write}(), I'm guessing the
> code is floating around somewhere.  The gotcha is that there are nasty little
> pieces of the kernel that don't use the helpers provided by io.h, e.g. the I/O
> APIC code likes to access MMIO via a struct overlay, so the compiler is free 
> to
> use any instruction that satisfies the constraint.

So, there aren't an infinite number of these.  It's also 100% possible
to add some tooling to the kernel today to help you find these.  You
could also have added tooling to KVM hosts to help find these.

Folks are *also* saying that we'll need a driver audit just to trust
that drivers aren't vulnerable to attacks from devices or from the host.
 This can quite easily be a part of that effort.

> The I/O APIC can and should be forced off, but dollars to donuts says there 
> are
> more special snowflakes lying in wait.  If the kernel uses an allowlist for
> drivers, then in theory it should be possible to hunt down all offenders.  But
> I think we'll want fallback logic to handle kernel MMIO #VEs, especially if 
> the
> kernel needs ISA cracking logic for userspace.  Without fallback logic, any 
> MMIO
> #VE from the kernel would be fatal, which is too harsh IMO since the behavior
> isn't so obviously wrong, e.g. versus the split lock #AC purge where there's 
> no
> legitimate reason for the kernel to generate a split lock.

I'll buy that this patch is convenient for *debugging*.  It helped folks
bootstrap the TDX support and get it going.

IMNHO, if a driver causes a #VE, it's a bug.  Just like if it goes off
the rails and touches bad memory and #GP's or #PF's.

Are there any printk's in the #VE handler?  Guess what those do.  Print
to the console.  Guess what consoles do.  MMIO.  You can't get away from
doing audits of the console drivers.  Sure, you can go make #VE special,
like NMIs, but that's not going to be fun.  At least the guest doesn't
have to deal with the fatality of a nested #VE, but it's still fatal.

I just don't like us pretending that we're Windows and have no control
over the code we run and throwing up our hands.

Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface

2021-04-01 Thread Dave Hansen

On 4/1/21 2:15 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 4/1/21 2:08 PM, Dave Hansen wrote:
>> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>>> +bool is_tdx_guest(void)
>>> +{
>>> +    return static_cpu_has(X86_FEATURE_TDX_GUEST);
>>> +}
>>
>> Why do you need is_tdx_guest() as opposed to calling
>> cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere?
> 
> is_tdx_guest() is also implemented/used in compressed
> code (which uses native_cpuid calls). I don't think
> we can use cpu_feature_enabled(X86_FEATURE_TDX_GUEST) in
> compressed code right? Also is_tdx_guest() looks easy
> to read and use.

OK, but how many of the is_tdx_guest() uses are in the compressed code?
 Why has its use spread beyond that?

Re: [RFC v1 26/26] x86/kvm: Use bounce buffers for TD guest

2021-04-01 Thread Dave Hansen

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" 
> 
> TDX doesn't allow to perform DMA access to guest private memory.
> In order for DMA to work properly in TD guest, user SWIOTLB bounce
> buffers.
> 
> Move AMD SEV initialization into common code and adopt for TDX.

This would be best if it can draw a parallel between TDX and SEV.

>  arch/x86/kernel/pci-swiotlb.c|  2 +-
>  arch/x86/kernel/tdx.c|  3 +++
>  arch/x86/mm/mem_encrypt.c| 44 ---
>  arch/x86/mm/mem_encrypt_common.c | 45 
>  4 files changed, 49 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
> index c2cfa5e7c152..020e13749758 100644
> --- a/arch/x86/kernel/pci-swiotlb.c
> +++ b/arch/x86/kernel/pci-swiotlb.c
> @@ -49,7 +49,7 @@ int __init pci_swiotlb_detect_4gb(void)
>* buffers are allocated and used for devices that do not support
>* the addressing range required for the encryption mask.
>*/
> - if (sme_active())
> + if (sme_active() || is_tdx_guest())
>   swiotlb = 1;
>  
>   return swiotlb;
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index f51a19168adc..ccb9401bd706 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include  /* force_sig_fault() */
> +#include 
>  
>  #ifdef CONFIG_KVM_GUEST
>  #include "tdx-kvm.c"
> @@ -472,6 +473,8 @@ void __init tdx_early_init(void)
>  
>   legacy_pic = _legacy_pic;
>  
> + swiotlb_force = SWIOTLB_FORCE;

Dumb question time.  But, what is the difference between

swiotlb = 1;

and

swiotlb_force = SWIOTLB_FORCE;

It would be nice of the patch to enable me to be a lazy reviewer.

>   cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
> NULL, tdx_cpu_offline_prepare);
>  
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 11a6a7b3af7e..7fbbb2f3d426 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c

Should we be renaming this to amd_mem_encrypt.c or something?

...
> -  */
> - if (sev_active())
> - static_branch_enable(_enable_key);
> -
> - print_mem_encrypt_feature_info();
> -}
> -
> diff --git a/arch/x86/mm/mem_encrypt_common.c 
> b/arch/x86/mm/mem_encrypt_common.c
> index b6d93b0c5dcf..6f3d90d4d68e 100644
> --- a/arch/x86/mm/mem_encrypt_common.c
> +++ b/arch/x86/mm/mem_encrypt_common.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED 
> */
>  bool force_dma_unencrypted(struct device *dev)
> @@ -36,3 +37,47 @@ bool force_dma_unencrypted(struct device *dev)
>  
>   return false;
>  }
> +
> +static void print_mem_encrypt_feature_info(void)
> +{

This function is now named wrong IMNHO.  If it's about AMD only, it
needs AMD in the name.

> + pr_info("AMD Memory Encryption Features active:");
> +
> + /* Secure Memory Encryption */
> + if (sme_active()) {
> + /*
> +  * SME is mutually exclusive with any of the SEV
> +  * features below.
> +  */
> + pr_cont(" SME\n");
> + return;
> + }
> +
> + /* Secure Encrypted Virtualization */
> + if (sev_active())
> + pr_cont(" SEV");
> +
> + /* Encrypted Register State */
> + if (sev_es_active())
> + pr_cont(" SEV-ES");
> +
> + pr_cont("\n");
> +}

I'm really tempted to say this needs to be off in arch/x86/kernel/cpu/amd.c

> +/* Architecture __weak replacement functions */
> +void __init mem_encrypt_init(void)
> +{
> + if (!sme_me_mask && !is_tdx_guest())
> + return;

The direct check of sme_me_mask looks odd now.  What does this *MEAN*?
Are we looking to jump out of here if no memory encryption is enabled?

I'd much rather this look more like:

if (!x86_memory_encryption())
return;

> + /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
> + swiotlb_update_mem_attributes();
> + /*
> +  * With SEV, we need to unroll the rep string I/O instructions.
> +  */
> + if (sev_active())
> + static_branch_enable(_enable_key);
> +
> + if (!is_tdx_guest())
> + print_mem_encrypt_feature_info();
> +}

Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface

2021-04-01 Thread Dave Hansen

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> +bool is_tdx_guest(void)
> +{
> + return static_cpu_has(X86_FEATURE_TDX_GUEST);
> +}

Why do you need is_tdx_guest() as opposed to calling
cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere?

Re: [RFC v1 25/26] x86/tdx: Make DMA pages shared

2021-04-01 Thread Dave Hansen

> +int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
> +{
> + int ret, i;
> +
> + ret = __tdx_map_gpa(gpa, numpages, private);
> + if (ret || !private)
> + return ret;
> +
> + for (i = 0; i < numpages; i++)
> + tdx_accept_page(gpa + i*PAGE_SIZE);
> +
> + return 0;
> +}

Please do something like this:

enum tdx_max_type {
TDX_MAP_PRIVATE,
TDX_MAP_SHARED
}

Then, your calls will look like:

tdx_map_gpa(gpa, nr, TDX_MAP_SHARED);

instead of:

tdx_map_gpa(gpa, nr, false);

>  static __cpuidle void tdx_halt(void)
>  {
>   register long r10 asm("r10") = TDVMCALL_STANDARD;
> diff --git a/arch/x86/mm/mem_encrypt_common.c 
> b/arch/x86/mm/mem_encrypt_common.c
> index 964e04152417..b6d93b0c5dcf 100644
> --- a/arch/x86/mm/mem_encrypt_common.c
> +++ b/arch/x86/mm/mem_encrypt_common.c
> @@ -15,9 +15,9 @@
>  bool force_dma_unencrypted(struct device *dev)
>  {
>   /*
> -  * For SEV, all DMA must be to unencrypted/shared addresses.
> +  * For SEV and TDX, all DMA must be to unencrypted/shared addresses.
>*/
> - if (sev_active())
> + if (sev_active() || is_tdx_guest())
>   return true;
>  
>   /*
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..6f23a9816ef0 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "../mm_internal.h"
>  
> @@ -1977,8 +1978,8 @@ static int __set_memory_enc_dec(unsigned long addr, int 
> numpages, bool enc)
>   struct cpa_data cpa;
>   int ret;
>  
> - /* Nothing to do if memory encryption is not active */
> - if (!mem_encrypt_active())
> + /* Nothing to do if memory encryption and TDX are not active */
> + if (!mem_encrypt_active() && !is_tdx_guest())
>   return 0;

So, this is starting to look like the "enc" naming is wrong, or at least
a little misleading.   Should we be talking about "protection" or
"guards" or something?

>   /* Should not be working on unaligned addresses */
> @@ -1988,8 +1989,14 @@ static int __set_memory_enc_dec(unsigned long addr, 
> int numpages, bool enc)
>   memset(, 0, sizeof(cpa));
>   cpa.vaddr = 
>   cpa.numpages = numpages;
> - cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
> - cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
> + if (is_tdx_guest()) {
> + cpa.mask_set = __pgprot(enc ? 0 : tdx_shared_mask());
> + cpa.mask_clr = __pgprot(enc ? tdx_shared_mask() : 0);
> + } else {
> + cpa.mask_set = __pgprot(enc ? _PAGE_ENC : 0);
> + cpa.mask_clr = __pgprot(enc ? 0 : _PAGE_ENC);
> + }

OK, this is too hideous to live.  It sucks that the TDX and SEV/SME bits
are opposite polarity, but oh well.

To me, this gets a lot clearer, and opens up room for commenting if you
do something like:

if (is_tdx_guest()) {
mem_enc_bits   = 0;
mem_plain_bits = tdx_shared_mask();
} else {
mem_enc_bits   = _PAGE_ENC;
mem_plain_bits = 0
}

if (enc) {
cpa.mask_set = mem_enc_bits;
cpa.mask_clr = mem_plain_bits;  // clear "plain" bits
} else {

cpa.mask_set = mem_plain_bits;
cpa.mask_clr = mem_enc_bits;// clear encryption bits
}

>   cpa.pgd = init_mm.pgd;
>  
>   /* Must avoid aliasing mappings in the highmem code */
> @@ -1999,7 +2006,8 @@ static int __set_memory_enc_dec(unsigned long addr, int 
> numpages, bool enc)
>   /*
>* Before changing the encryption attribute, we need to flush caches.
>*/
> - cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> + if (!enc || !is_tdx_guest())
> + cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT));

That "!enc" looks wrong to me.  Caches would need to be flushed whenever
encryption attributes *change*, not just when they are set.

Also, cpa_flush() flushes caches *AND* the TLB.  How does TDX manage to
not need TLB flushes?

>   ret = __change_page_attr_set_clr(, 1);
>  
> @@ -2012,6 +2020,11 @@ static int __set_memory_enc_dec(unsigned long addr, 
> int numpages, bool enc)
>*/
>   cpa_flush(, 0);
>  
> + if (!ret && is_tdx_guest()) {
> + ret = tdx_map_gpa(__pa(addr), numpages, enc);
> + // XXX: need to undo on error?
> + }

Time to fix this stuff up if you want folks to take this series more
seriously.

Re: [RFC v1 23/26] x86/tdx: Make pages shared in ioremap()

2021-04-01 Thread Dave Hansen

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" 
> 
> All ioremap()ed paged that are not backed by normal memory (NONE or
> RESERVED) have to be mapped as shared.

s/paged/pages/


> +/* Make the page accesable by VMM */
> +#define pgprot_tdx_shared(prot) __pgprot(pgprot_val(prot) | 
> tdx_shared_mask())
> +
>  #ifndef __ASSEMBLY__
>  #include 
>  #include 
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 9e5ccc56f8e0..a0ba760866d4 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource 
> *res)
>  }
>  
>  /*
> - * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
> - * there the whole memory is already encrypted.
> + * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted 
> (or
> + * private in TDX case) because there the whole memory is already encrypted.
>   */

But doesn't this mean that we can't ioremap() normal memory?  I was
somehow expecting that we would need to do this for some host<->guest
communication pages.

>  static unsigned int __ioremap_check_encrypted(struct resource *res)
>  {
> - if (!sev_active())
> + if (!sev_active() && !is_tdx_guest())
>   return 0;
>  
>   switch (res->desc) {
> @@ -244,6 +244,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long 
> size,
>   prot = PAGE_KERNEL_IO;
>   if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
>   prot = pgprot_encrypted(prot);
> + else if (is_tdx_guest())
> + prot = pgprot_tdx_shared(prot);

Re: [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

2021-04-01 Thread Dave Hansen

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" 
> 
> tdx_shared_mask() returns the mask that has to be set in page table
> entry to make page shared with VMM.

Needs to be either:

has to be set in a page table entry
or
has to be set in page table entries

Pick one, please.  But, the grammar is wrong as-is.


> index 8fa654d61ac2..f10a00c4ad7f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -875,6 +875,7 @@ config INTEL_TDX_GUEST
>   select PARAVIRT_XL
>   select X86_X2APIC
>   select SECURITY_LOCKDOWN_LSM
> + select X86_MEM_ENCRYPT_COMMON
>   help
> Provide support for running in a trusted domain on Intel processors
> equipped with Trusted Domain eXtenstions. TDX is an new Intel
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index b46ae140e39b..9bbfe6520ea4 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -104,5 +104,6 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long 
> p1, unsigned long p2,
>  long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
>   unsigned long p3, unsigned long p4);
>  
> +phys_addr_t tdx_shared_mask(void);

I know it's redundant, but extern this, please.  Ditto for all the other
declarations in that header.

>  #endif
>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index ae37498df981..9681f4a0b4e0 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -41,6 +41,11 @@ bool is_tdx_guest(void)
>  }
>  EXPORT_SYMBOL_GPL(is_tdx_guest);
>  
> +phys_addr_t tdx_shared_mask(void)
> +{
> + return 1ULL << (td_info.gpa_width - 1);
> +}

A comment would be helpful:

/* The highest bit of a guest physical address is the "sharing" bit */

>  static void tdx_get_info(void)
>  {
>   register long rcx asm("rcx");
> @@ -56,6 +61,9 @@ static void tdx_get_info(void)
>  
>   td_info.gpa_width = rcx & GENMASK(5, 0);
>   td_info.attributes = rdx;
> +
> + /* Exclude Shared bit from the __PHYSICAL_MASK */
> + physical_mask &= ~tdx_shared_mask();
>  }

I wish we had all of these 'physical_mask' manipulations in a single
spot.  Can we consolidate these instead of having TDX and SME poke at
them individually?

Re: [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code

2021-04-01 Thread Dave Hansen

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" 
> 
> Intel TDX doesn't allow VMM to access guest memory. Any memory that is
> required for communication with VMM suppose to be shared explicitly by

s/suppose to/must/

> setting the bit in page table entry. The shared memory is similar to
> unencrypted memory in AMD SME/SEV terminology.

In addition to setting the page table bit, there's also a dance to go
through to convert the memory.  Please mention the procedure here at
least.  It's very different from SME.

> force_dma_unencrypted() has to return true for TDX guest. Move it out of
> AMD SME code.

You lost me here.  What does force_dma_unencrypted() have to do with
host/guest shared memory?

> Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
> selected by all x86 memory encryption features.

Please also mention what will set it.  I assume TDX guest support will
set this option.  It's probably also worth a sentence to say that
force_dma_unencrypted() will have TDX-specific code added to it.  (It
will, right??)

> This is preparation for TDX changes in DMA code.

Probably best to also mention that this effectively just moves code
around.  This patch should have no functional changes at runtime.


> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0374d9f262a5..8fa654d61ac2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1538,14 +1538,18 @@ config X86_CPA_STATISTICS
> helps to determine the effectiveness of preserving large and huge
> page mappings when mapping protections are changed.
>  
> +config X86_MEM_ENCRYPT_COMMON
> + select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> + select DYNAMIC_PHYSICAL_MASK
> + def_bool n
> +
>  config AMD_MEM_ENCRYPT
>   bool "AMD Secure Memory Encryption (SME) support"
>   depends on X86_64 && CPU_SUP_AMD
>   select DMA_COHERENT_POOL
> - select DYNAMIC_PHYSICAL_MASK
>   select ARCH_USE_MEMREMAP_PROT
> - select ARCH_HAS_FORCE_DMA_UNENCRYPTED
>   select INSTRUCTION_DECODER
> + select X86_MEM_ENCRYPT_COMMON
>   help
> Say yes to enable support for the encryption of system memory.
> This requires an AMD processor that supports Secure Memory
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index 30a3b30395ad..95e534cffa99 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -257,10 +257,12 @@ static inline void slow_down_io(void)
>  
>  #endif
>  
> -#ifdef CONFIG_AMD_MEM_ENCRYPT
>  #include 
>  
>  extern struct static_key_false sev_enable_key;

This _looks_ odd.  sev_enable_key went from being under
CONFIG_AMD_MEM_ENCRYPT to being unconditionally referenced.

Could you explain a bit more?

I would have expected it tot at *least* be tied to the new #ifdef.

> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +
>  static inline bool sev_key_active(void)
>  {
>   return static_branch_unlikely(_enable_key);
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 5864219221ca..b31cb52bf1bd 100644
> --- a/arch/x86/mm/Makefile
...

Re: [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO

2021-04-01 Thread Dave Hansen

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" 
> 
> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
> exit reason.
> 
> For now we only handle subset of instruction that kernel uses for MMIO
> oerations. User-space access triggers SIGBUS.
..
> + case EXIT_REASON_EPT_VIOLATION:
> + ve->instr_len = tdx_handle_mmio(regs, ve);
> + break;

Is MMIO literally the only thing that can cause an EPT violation for TDX
guests?

Forget userspace for a minute.  #VE's from userspace are annoying, but
fine.  We can't control what userspace does.  If an action it takes
causes a #VE in the TDX architecture, tough cookies, the kernel must
handle it and try to recover or kill the app.

The kernel is very different.  We know in advance (must know,
actually...) which instructions might cause exceptions of any kind.
That's why we have exception tables and copy_to/from_user().  That's why
we can handle kernel page faults on userspace, but not inside spinlocks.

Binary-dependent OSes are also very different.  It's going to be natural
for them to want to take existing, signed drivers and use them in TDX
guests.  They might want to do something like this.

But for an OS where we have source for the *ENTIRE* thing, and where we
have a chokepoint for MMIO accesses (arch/x86/include/asm/io.h), it
seems like an *AWFUL* idea to:
1. Have the kernel set up special mappings for I/O memory
2. Kernel generates special instructions to access that memory
3. Kernel faults on that memory
4. Kernel cracks its own special instructions to see what they were
   doing
5. Kernel calls up to host to do the MMIO

Instead of doing 2/3/4, why not just have #2 call up to the host
directly?  This patch seems a very slow, roundabout way to do
paravirtualized MMIO.

BTW, there's already some SEV special-casing in io.h.

Re: [PATCH 2/2] x86/sgx: Add sgx_nr_{all, free}_pages to the debugfs

2021-04-01 Thread Dave Hansen

On 3/31/21 10:21 PM, Jarkko Sakkinen wrote:
> +#ifdef CONFIG_DEBUG_FS
> + debugfs_create_file("sgx_nr_all_pages", 0400, arch_debugfs_dir, NULL,
> + _nr_all_pages_fops);
> + debugfs_create_file("sgx_nr_free_pages", 0400, arch_debugfs_dir, NULL,
> + _nr_free_pages_fops);
> +#endif /* CONFIG_DEBUG_FS */


Why not make the types u64's and use debugfs_create_u64()?  That would
save a ton of code.  There's also debugfs_create_ulong().

[PATCH 02/10] mm/numa: automatically generate node migration order

2021-04-01 Thread Dave Hansen



From: Dave Hansen 

When memory fills up on a node, memory contents can be
automatically migrated to another node.  The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list would
be to follow the page allocator fallback lists.  Those lists
already tell us if memory is full where to look next.  It would
also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists.  This would potentially lead to migration
cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering.  But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard.  It has no specific locking and is intended to be read
locklessly.  Readers must take care to avoid observing changes
that appear incoherent.  This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now.  It will be called later in the
series.

Signed-off-by: Dave Hansen 
Reviewed-by: Yang Shi 
Cc: Wei Xu 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

--

Changes from 20200122:
 * Add big node_demotion[] comment
Changes from 20210302:
 * Fix typo in node_demotion[] comment
---

 b/mm/internal.h   |5 +
 b/mm/migrate.c|  175 +-
 b/mm/page_alloc.c |2 
 3 files changed, 180 insertions(+), 2 deletions(-)

diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware 
mm/internal.h
--- a/mm/internal.h~auto-setup-default-migration-path-from-firmware 
2021-03-31 15:17:11.794000261 -0700
+++ b/mm/internal.h 2021-03-31 15:17:11.816000261 -0700
@@ -515,12 +515,17 @@ static inline void mminit_validate_memmo
 
 #ifdef CONFIG_NUMA
 extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 #else
 static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
unsigned int order)
 {
return NODE_RECLAIM_NOSCAN;
 }
+static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+   return NUMA_NO_NODE;
+}
 #endif
 
 extern int hwpoison_filter(struct page *p);
diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware 
mm/migrate.c
--- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware  
2021-03-31 15:17:11.798000261 -0700
+++ b/mm/migrate.c  2021-03-31 15:17:11.821000261 -0700
@@ -1163,6 +1163,44 @@ out:
return rc;
 }
 
+
+/*
+ * node_demotion[] example:
+ *
+ * Consider a system with two sockets.  Each socket has
+ * three classes of memory attached: fast, medium and slow.
+ * Each memory class is placed in its own NUMA node.  The
+ * CPUs are placed in the node with the "fast" memory.  The
+ * 6 NUMA nodes (0-5) might be split among the sockets like
+ * this:
+ *
+ * Socket A: 0, 1, 2
+ * Socket B: 3, 4, 5
+ *
+ * When Node 0 fills up, its memory should be migrated to
+ * Node 1.  When Node 1 fills up, it should be migrated to
+ * Node 2.  The migration path start on the nodes with the
+ * processors (since allocations default to this node) and
+ * fast memory, progress through medium and end with the
+ * slow memory:
+ *
+ * 0 -> 1 -> 2 -> stop
+ * 3 -> 4 -> 5 -> stop
+ *
+ * This is represented in the node_demotion[] like this:
+ *
+ * {  1, // Node 0 migrates to 1
+ *2, // Node 1 migrates to 2
+ *   -1, // Node 2 does not migrate
+ *4, // Node 3 migrates to 4
+ *5, // Node 4 migrates to 5
+ *   -1} // Node 5 does not migrate
+ */
+
+/*
+ * Writes to this array occur without locking.  READ_ONCE()
+ * is recommended for readers to ensure consistent reads.
+ */
 static int node_demotion[MAX_NUMNODES] __read_mostly =
{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
 
@@ -1177,7 +1215,13 @@ static int node_demotion[MAX_NUMNODES] _
  */
 int next_demotion_node(int node)
 {
-   return node_demotion[node];
+   /*
+* node_demotion[] is updated without excluding
+* this function from running.  READ_ONCE() avoids
+* reading multiple, inconsistent 'node' values
+* during an update.
+*/
+   return READ_ONCE(node_demotion[node]);
 }
 
 /*
@@ -3181,3 +3225,132 @@ void migrate_vma_finalize(struct migrate
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_P

[PATCH 03/10] mm/migrate: update node demotion order during on hotplug events

2021-04-01 Thread Dave Hansen



From: Dave Hansen 

Reclaim-based migration is attempting to optimize data placement in
memory based on the system topology.  If the system changes, so must
the migration ordering.

The implementation is conceptually simple and entirely unoptimized.
On any memory or CPU hotplug events, assume that a node was added or
removed and recalculate all migration targets.  This ensures that the
node_demotion[] array is always ready to be used in case the new
reclaim mode is enabled.

This recalculation is far from optimal, most glaringly that it does
not even attempt to figure out the hotplug event would have some
*actual* effect on the demotion order.  But, given the expected
paucity of hotplug events, this should be fine.

=== What does RCU provide? ===

Imaginge a simple loop which walks down the demotion path looking
for the last node:

terminal_node = start_node;
while (node_demotion[terminal_node] != NUMA_NO_NODE) {
terminal_node = node_demotion[terminal_node];
}

The initial values are:

node_demotion[0] = 1;
node_demotion[1] = NUMA_NO_NODE;

and are updated to:

node_demotion[0] = NUMA_NO_NODE;
node_demotion[1] = 0;

What guarantees that the loop did not observe:

node_demotion[0] = 1;
node_demotion[1] = 0;

and would loop forever?

With RCU, a rcu_read_lock/unlock() can be placed around the
loop.  Since the write side does a synchronize_rcu(), the loop
that observed the old contents is known to be complete after the
synchronize_rcu() has completed.

RCU, combined with disable_all_migrate_targets(), ensures that
the old migration state is not visible by the time
__set_migration_target_nodes() is called.

=== What does READ_ONCE() provide? ===

READ_ONCE() forbids the compiler from merging or reordering
successive reads of node_demotion[].  This ensures that any
updates are *eventually* observed.

Consider the above loop again.  The compiler could theoretically
read the entirety of node_demotion[] into local storage
(registers) and never go back to memory, and *permanently*
observe bad values for node_demotion[].

Note: RCU does not provide any universal compiler-ordering
guarantees:

https://lore.kernel.org/lkml/20150921204327.gh4...@linux.vnet.ibm.com/

Signed-off-by: Dave Hansen 
Reviewed-by: Yang Shi 
Cc: Wei Xu 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

--

Changes since 20210302:
 * remove duplicate synchronize_rcu()
---

 b/mm/migrate.c |  152 -
 1 file changed, 129 insertions(+), 23 deletions(-)

diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion 2021-03-31 15:17:13.056000258 -0700
+++ b/mm/migrate.c  2021-03-31 15:17:13.062000258 -0700
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1198,8 +1199,12 @@ out:
  */
 
 /*
- * Writes to this array occur without locking.  READ_ONCE()
- * is recommended for readers to ensure consistent reads.
+ * Writes to this array occur without locking.  Cycles are
+ * not allowed: Node X demotes to Y which demotes to X...
+ *
+ * If multiple reads are performed, a single rcu_read_lock()
+ * must be held over all reads to ensure that no cycles are
+ * observed.
  */
 static int node_demotion[MAX_NUMNODES] __read_mostly =
{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
@@ -1215,13 +1220,22 @@ static int node_demotion[MAX_NUMNODES] _
  */
 int next_demotion_node(int node)
 {
+   int target;
+
/*
-* node_demotion[] is updated without excluding
-* this function from running.  READ_ONCE() avoids
-* reading multiple, inconsistent 'node' values
-* during an update.
+* node_demotion[] is updated without excluding this
+* function from running.  RCU doesn't provide any
+* compiler barriers, so the READ_ONCE() is required
+* to avoid compiler reordering or read merging.
+*
+* Make sure to use RCU over entire code blocks if
+* node_demotion[] reads need to be consistent.
 */
-   return READ_ONCE(node_demotion[node]);
+   rcu_read_lock();
+   target = READ_ONCE(node_demotion[node]);
+   rcu_read_unlock();
+
+   return target;
 }
 
 /*
@@ -3226,8 +3240,9 @@ void migrate_vma_finalize(struct migrate
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
 
+#if defined(CONFIG_MEMORY_HOTPLUG)
 /* Disable reclaim-based migration. */
-static void disable_all_migrate_targets(void)
+static void __disable_all_migrate_targets(void)
 {
int node;
 
@@ -3235,6 +3250,25 @@ static void disable_all_migrate_targets(
node_demotion[node] = NUMA_NO_NODE;
 }
 
+static void disable_all_migrate_targets(void)
+{
+   __disable_all_migrate_targets();
+
+   /*
+* Ensure that the "disable" is visible across the system.
+

[PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard

2021-04-01 Thread Dave Hansen

I'm resending this because I forgot to cc the mailing lists on the
post yesterday.  Sorry for the noise.  Please reply to this series.

The full series is also available here:

https://github.com/hansendc/linux/tree/automigrate-20210331

which also inclues some vm.zone_reclaim_mode sysctl ABI fixup
prerequisites:


https://github.com/hansendc/linux/commit/18daad8f0181a2da57cb43e595303c2ef5bd7b6e

https://github.com/hansendc/linux/commit/a873f3b6f250581072ab36f2735a3aa341e36705

There are no major changes since the last post.

--

We're starting to see systems with more and more kinds of memory such
as Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out.  Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties.  First, the newer allocations can end
up in the slower persistent memory.  Second, reclaimed data in DRAM
are just discarded even if there are gobs of space in persistent
memory that could be used.

This set implements a solution to these problems.  At the end of the
reclaim process in shrink_page_list() just before the last page
refcount is dropped, the page is migrated to persistent memory instead
of being dropped.

While I've talked about a DRAM/PMEM pairing, this approach would
function in any environment where memory tiers exist.

This is not perfect.  It "strands" pages in slower memory and never
brings them back to fast DRAM.  Huang Ying has follow-on work which
repurposes autonuma to promote hot pages back to DRAM.

This is also all based on an upstream mechanism that allows
persistent memory to be onlined and used as if it were volatile:

http://lkml.kernel.org/r/20190124231441.37a4a...@viggo.jf.intel.com

== Open Issues ==

 * Memory policies and cpusets that, for instance, restrict allocations
   to DRAM can be demoted to PMEM whenever they opt in to this
   new mechanism.  A cgroup-level API to opt-in or opt-out of
   these migrations will likely be required as a follow-on.
 * Could be more aggressive about where anon LRU scanning occurs
   since it no longer necessarily involves I/O.  get_scan_count()
   for instance says: "If we have no swap space, do not bother
   scanning anon pages"

--

 Documentation/admin-guide/sysctl/vm.rst |  12 +
 include/linux/migrate.h |  14 +-
 include/linux/swap.h|   3 +-
 include/linux/vm_event_item.h   |   2 +
 include/trace/events/migrate.h  |   3 +-
 include/uapi/linux/mempolicy.h  |   1 +
 mm/compaction.c |   3 +-
 mm/gup.c|   3 +-
 mm/internal.h   |   5 +
 mm/memory-failure.c |   4 +-
 mm/memory_hotplug.c |   4 +-
 mm/mempolicy.c  |   8 +-
 mm/migrate.c| 315 +++-
 mm/page_alloc.c |  11 +-
 mm/vmscan.c | 158 +++-
 mm/vmstat.c |   2 +
 16 files changed, 520 insertions(+), 28 deletions(-)

--

Changes since (automigrate-20210304):
 * Add ack/review tags
 * Remove duplicate synchronize_rcu() call

Changes since (automigrate-20210122):
 * move from GFP_HIGHUSER -> GFP_HIGHUSER_MOVABLE since pages *are*
   movable.
 * Separate out helpers that check for being able to relaim anonymous
   pages versus being able to meaningfully scan the anon LRU.

Changes since (automigrate-20200818):
 * Fall back to normal reclaim when demotion fails
 * Fix some compile issues, when page migration and NUMA are off

Changes since (automigrate-20201007):
 * separate out checks for "can scan anon LRU" from "can actually
   swap anon pages right now".  Previous series conflated them
   and may have been overly aggressive scanning LRU
 * add MR_DEMOTION to tracepoint header
 * remove unnecessary hugetlb page check

Changes since (https://lwn.net/Articles/824830/):
 * Use higher-level migrate_pages() API approach from Yang Shi's
   earlier patches.
 * made sure to actually check node_reclaim_mode's new bit
 * disabled migration entirely before introducing RECLAIM_MIGRATE
 * Replace GFP_NOWAIT with explicit __GFP_KSWAPD_RECLAIM and
   comment why we want that.
 * Comment on effects of that keep multiple source nodes from
   sharing target nodes

Cc: Yang Shi 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 
Cc: Huang Ying 
Cc: Wei Xu

[PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-04-01 Thread Dave Hansen



From: Dave Hansen 

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold.  If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This proposes extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

Once this is enabled page demotion may move data to a NUMA node
that does not fall into the cpuset of the allocating process.
This could be construed to violate the guarantees of cpusets.
However, since this is an opt-in mechanism, the assumption is
that anyone enabling it is content to relax the guarantees.

Signed-off-by: Dave Hansen 
Cc: Wei Xu 
Cc: Yang Shi 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

Changes since 20200122:
 * Changelog material about relaxing cpuset constraints

Changes since 20210304:
 * Add Documentation/ material about relaxing cpuset constraints
---

 b/Documentation/admin-guide/sysctl/vm.rst |   12 
 b/include/linux/swap.h|3 ++-
 b/include/uapi/linux/mempolicy.h  |1 +
 b/mm/vmscan.c |6 --
 4 files changed, 19 insertions(+), 3 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 
Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE   2021-03-31 
15:17:40.324000190 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst   2021-03-31 15:17:40.349000190 
-0700
@@ -976,6 +976,7 @@ This is value OR'ed together of
 1  Zone reclaim on
 2  Zone reclaim writes dirty pages out
 4  Zone reclaim swaps pages
+8  Zone reclaim migrates pages
 =  ===
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
@@ -1000,3 +1001,14 @@ of other processes running on other node
 Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations.  These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances.  Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure.  This migration is
+performed before swap.  It may move data to a NUMA node that does not
+fall into the cpuset of the allocating process which might be construed
+to violate the guarantees of cpusets.  This should not be enabled on
+systems which need strict cpuset location guarantees.
diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
--- a/include/linux/swap.h~RECLAIM_MIGRATE  2021-03-31 15:17:40.331000190 
-0700
+++ b/include/linux/swap.h  2021-03-31 15:17:40.351000190 -0700
@@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio;
 static inline bool node_reclaim_enabled(void)
 {
/* Is any node_reclaim_mode bit set? */
-   return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
+   return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
+   RECLAIM_UNMAP|RECLAIM_MIGRATE);
 }
 
 extern void check_move_unevictable_pages(struct pagevec *pvec);
diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 
include/uapi/linux/mempolicy.h
--- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE2021-03-31 
15:17:40.337000190 -0700
+++ b/include/uapi/linux/mempolicy.h2021-03-31 15:17:40.352000190 -0700
@@ -71,5 +71,6 @@ enum {
 #define RECLAIM_ZONE   (1<<0)  /* Run shrink_inactive_list on the zone */
 #define RECLAIM_WRITE  (1<<1)  /* Writeout pages during reclaim */
 #define RECLAIM_UNMAP  (1<<2)  /* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE(1<<3)  /* Migrate to other nodes during 
reclaim */
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c
--- a/mm/vmscan.c~RECLAIM_MIGRATE   2021-03-31 15:17:40.339000190 -0700
+++ b/mm/vmscan.c   2021-03-31 15:17:40.357000190 -0700
@@ -1074,6 +1074,9 @@ static bool migrate_demote_page_

[PATCH 06/10] mm/vmscan: add page demotion counter

2021-04-01 Thread Dave Hansen



From: Yang Shi 

Account the number of demoted pages into reclaim_state->nr_demoted.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

[ daveh:
   - __count_vm_events() a bit, and made them look at the THP
 size directly rather than getting data from migrate_pages()
]

Signed-off-by: Yang Shi 
Signed-off-by: Dave Hansen 
Reviewed-by: Yang Shi 
Cc: Wei Xu 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

--

Changes since 202010:
 * remove unused scan-control 'demoted' field
---

 b/include/linux/vm_event_item.h |2 ++
 b/mm/vmscan.c   |5 +
 b/mm/vmstat.c   |2 ++
 3 files changed, 9 insertions(+)

diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter 
include/linux/vm_event_item.h
--- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter 
2021-03-31 15:17:17.079000248 -0700
+++ b/include/linux/vm_event_item.h 2021-03-31 15:17:17.101000248 -0700
@@ -33,6 +33,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
PGREUSE,
PGSTEAL_KSWAPD,
PGSTEAL_DIRECT,
+   PGDEMOTE_KSWAPD,
+   PGDEMOTE_DIRECT,
PGSCAN_KSWAPD,
PGSCAN_DIRECT,
PGSCAN_DIRECT_THROTTLE,
diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter   2021-03-31 
15:17:17.081000248 -0700
+++ b/mm/vmscan.c   2021-03-31 15:17:17.109000248 -0700
@@ -1120,6 +1120,11 @@ static unsigned int demote_page_list(str
target_nid, MIGRATE_ASYNC, MR_DEMOTION,
_succeeded);
 
+   if (current_is_kswapd())
+   __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
+   else
+   __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
+
return nr_succeeded;
 }
 
diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c
--- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter   2021-03-31 
15:17:17.092000248 -0700
+++ b/mm/vmstat.c   2021-03-31 15:17:17.116000248 -0700
@@ -1259,6 +1259,8 @@ const char * const vmstat_text[] = {
"pgreuse",
"pgsteal_kswapd",
"pgsteal_direct",
+   "pgdemote_kswapd",
+   "pgdemote_direct",
"pgscan_kswapd",
"pgscan_direct",
"pgscan_direct_throttle",
_

[PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-01 Thread Dave Hansen



From: Yang Shi 

The migrate_pages() returns the number of pages that were not migrated,
or an error code.  When returning an error code, there is no way to know
how many pages were migrated or not migrated.

In the following patch, migrate_pages() is used to demote pages to PMEM
node, we need account how many pages are reclaimed (demoted) since page
reclaim behavior depends on this.  Add *nr_succeeded parameter to make
migrate_pages() return how many pages are demoted successfully for all
cases.

Signed-off-by: Yang Shi 
Signed-off-by: Dave Hansen 
Reviewed-by: Yang Shi 
Cc: Wei Xu 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

--

Note: Yang Shi originally wrote the patch, thus the SoB.  There was
also a Reviewed-by provided since there were some modifications made
to this after the original work.

Changes since 20210302:
 * Fix definition of CONFIG_MIGRATION=n stub migrate_pages().  Its
   parameters were wrong, but oddly enough did not generate any
   compile errors.

Changes since 20200122:
 * Fix migrate_pages() to manipulate nr_succeeded *value*
   rather than the pointer.
---

 b/include/linux/migrate.h |5 +++--
 b/mm/compaction.c |3 ++-
 b/mm/gup.c|3 ++-
 b/mm/memory-failure.c |4 +++-
 b/mm/memory_hotplug.c |4 +++-
 b/mm/mempolicy.c  |8 ++--
 b/mm/migrate.c|   19 +++
 b/mm/page_alloc.c |9 ++---
 8 files changed, 36 insertions(+), 19 deletions(-)

diff -puN include/linux/migrate.h~migrate_pages-add-success-return 
include/linux/migrate.h
--- a/include/linux/migrate.h~migrate_pages-add-success-return  2021-03-31 
15:17:14.144000255 -0700
+++ b/include/linux/migrate.h   2021-03-31 15:17:14.182000255 -0700
@@ -40,7 +40,8 @@ extern int migrate_page(struct address_s
struct page *newpage, struct page *page,
enum migrate_mode mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
-   unsigned long private, enum migrate_mode mode, int reason);
+   unsigned long private, enum migrate_mode mode, int reason,
+   unsigned int *nr_succeeded);
 extern struct page *alloc_migration_target(struct page *page, unsigned long 
private);
 extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
@@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(str
 static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t new,
free_page_t free, unsigned long private, enum migrate_mode mode,
-   int reason)
+   int reason, unsigned int *nr_succeeded)
{ return -ENOSYS; }
 static inline struct page *alloc_migration_target(struct page *page,
unsigned long private)
diff -puN mm/compaction.c~migrate_pages-add-success-return mm/compaction.c
--- a/mm/compaction.c~migrate_pages-add-success-return  2021-03-31 
15:17:14.146000255 -0700
+++ b/mm/compaction.c   2021-03-31 15:17:14.186000255 -0700
@@ -2247,6 +2247,7 @@ compact_zone(struct compact_control *cc,
unsigned long last_migrated_pfn;
const bool sync = cc->mode != MIGRATE_ASYNC;
bool update_cached;
+   unsigned int nr_succeeded = 0;
 
/*
 * These counters track activities during zone compaction.  Initialize
@@ -2364,7 +2365,7 @@ compact_zone(struct compact_control *cc,
 
err = migrate_pages(>migratepages, compaction_alloc,
compaction_free, (unsigned long)cc, cc->mode,
-   MR_COMPACTION);
+   MR_COMPACTION, _succeeded);
 
trace_mm_compaction_migratepages(cc->nr_migratepages, err,
>migratepages);
diff -puN mm/gup.c~migrate_pages-add-success-return mm/gup.c
--- a/mm/gup.c~migrate_pages-add-success-return 2021-03-31 15:17:14.15255 
-0700
+++ b/mm/gup.c  2021-03-31 15:17:14.19255 -0700
@@ -1550,6 +1550,7 @@ static long check_and_migrate_cma_pages(
unsigned long i;
unsigned long step;
bool drain_allow = true;
+   unsigned int nr_succeeded = 0;
bool migrate_allow = true;
LIST_HEAD(cma_page_list);
long ret = nr_pages;
@@ -1606,7 +1607,7 @@ check_again:
put_page(pages[i]);
 
if (migrate_pages(_page_list, alloc_migration_target, NULL,
-   (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+   (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE, 
_succeeded)) {
/*
 * some of the pages failed migration. Do get_user_pages
 * without migration.
diff -puN mm/memory-failure.c~migrate_pages-add-success-re

Re: [PATCH RESEND 0/3] x86/sgx: eextend ioctl

2021-04-01 Thread Dave Hansen

On 4/1/21 10:49 AM, Raoul Strackx wrote:
> On 4/1/21 6:11 PM, Dave Hansen wrote:
>> On 4/1/21 7:56 AM, Raoul Strackx wrote:
>>> SOLUTION OF THIS PATCH
>>> This patch adds a new ioctl to enable userspace to execute EEXTEND leaf 
>>> functions per 256 bytes of enclave memory. This enables enclaves to be 
>>> build as specified by enclave providers.
>> I think tying the user ABI to the SGX architecture this closely is a
>> mistake.
>>
>> Do we need another ioctl() or can we just relax the existing add_pages
>> ioctl() to allow unaligned addresses?
>>
> I've considered this. In order to do an EEXTEND without an EADD, we'd
> need to add a flag DONT_ADD_PAGES flag to `add_pages` ioctl as well. Two
> separate ioctls, one for adding, another for extending made more sense
> to me.

So, we're talking here about pages that have been EEADDED, but for which
we do not want to include the entire contents of the page?  Do these
contents always include the beginning of the page, or can the holes be
anywhere?

[PATCH 09/10] mm/vmscan: never demote for memcg reclaim

2021-04-01 Thread Dave Hansen



From: Dave Hansen 

Global reclaim aims to reduce the amount of memory used on
a given node or set of nodes.  Migrating pages to another
node serves this purpose.

memcg reclaim is different.  Its goal is to reduce the
total memory consumption of the entire memcg, across all
nodes.  Migration does not assist memcg reclaim because
it just moves page contents between nodes rather than
actually reducing memory consumption.

Signed-off-by: Dave Hansen 
Suggested-by: Yang Shi 
Reviewed-by: Yang Shi 
Cc: Wei Xu 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 
---

 b/mm/vmscan.c |   18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c
--- a/mm/vmscan.c~never-demote-for-memcg-reclaim2021-03-31 
15:17:20.476000239 -0700
+++ b/mm/vmscan.c   2021-03-31 15:17:20.487000239 -0700
@@ -288,7 +288,8 @@ static bool writeback_throttling_sane(st
 #endif
 
 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
- int node_id)
+ int node_id,
+ struct scan_control *sc)
 {
if (memcg == NULL) {
/*
@@ -326,7 +327,7 @@ unsigned long zone_reclaimable_pages(str
 
nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-   if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
+   if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -1064,7 +1065,8 @@ static enum page_references page_check_r
return PAGEREF_RECLAIM;
 }
 
-static bool migrate_demote_page_ok(struct page *page)
+static bool migrate_demote_page_ok(struct page *page,
+  struct scan_control *sc)
 {
int next_nid = next_demotion_node(page_to_nid(page));
 
@@ -1072,6 +1074,10 @@ static bool migrate_demote_page_ok(struc
VM_BUG_ON_PAGE(PageHuge(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);
 
+   /* It is pointless to do demotion in memcg reclaim */
+   if (cgroup_reclaim(sc))
+   return false;
+
if (next_nid == NUMA_NO_NODE)
return false;
if (PageTransHuge(page) && !thp_migration_supported())
@@ -1328,7 +1334,7 @@ retry:
 * Before reclaiming the page, try to relocate
 * its contents to another node.
 */
-   if (do_demote_pass && migrate_demote_page_ok(page)) {
+   if (do_demote_pass && migrate_demote_page_ok(page, sc)) {
list_add(>lru, _pages);
unlock_page(page);
continue;
@@ -2362,7 +2368,7 @@ static void get_scan_count(struct lruvec
enum lru_list lru;
 
/* If we have no swap space, do not bother scanning anon pages. */
-   if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
+   if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, 
sc)) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2737,7 +2743,7 @@ static inline bool should_continue_recla
 */
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-   if (can_reclaim_anon_pages(NULL, pgdat->node_id))
+   if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
return inactive_lru_pages > pages_for_compaction;
_

[PATCH 01/10] mm/numa: node demotion data structure and lookup

2021-04-01 Thread Dave Hansen



From: Dave Hansen 

Prepare for the kernel to auto-migrate pages to other memory nodes
with a user defined node migration table. This allows creating single
migration target for each NUMA node to enable the kernel to do NUMA
page migrations instead of simply reclaiming colder pages. A node
with no target is a "terminal node", so reclaim acts normally there.
The migration target does not fundamentally _need_ to be a single node,
but this implementation starts there to limit complexity.

If you consider the migration path as a graph, cycles (loops) in the
graph are disallowed.  This avoids wasting resources by constantly
migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
never be allowed.

Signed-off-by: Dave Hansen 
Reviewed-by: Yang Shi 
Cc: Wei Xu 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

--

changes since 20200122:
 * Make node_demotion[] __read_mostly

changes in July 2020:
 - Remove loop from next_demotion_node() and get_online_mems().
   This means that the node returned by next_demotion_node()
   might now be offline, but the worst case is that the
   allocation fails.  That's fine since it is transient.
---

 b/mm/migrate.c |   17 +
 1 file changed, 17 insertions(+)

diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path 
mm/migrate.c
--- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path
2021-03-31 15:17:10.734000264 -0700
+++ b/mm/migrate.c  2021-03-31 15:17:10.742000264 -0700
@@ -1163,6 +1163,23 @@ out:
return rc;
 }
 
+static int node_demotion[MAX_NUMNODES] __read_mostly =
+   {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
+
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * @returns: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+   return node_demotion[node];
+}
+
 /*
  * Obtain the lock on page, remove all ptes and migrate the page
  * to the newly allocated page in newpage.
_

[PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages

2021-04-01 Thread Dave Hansen



From: Dave Hansen 

Anonymous pages are kept on their own LRU(s).  These lists could
theoretically always be scanned and maintained.  But, without swap,
there is currently nothing the kernel can *do* with the results of a
scanned, sorted LRU for anonymous pages.

A check for '!total_swap_pages' currently serves as a valid check as
to whether anonymous LRUs should be maintained.  However, another
method will be added shortly: page demotion.

Abstract out the 'total_swap_pages' checks into a helper, give it a
logically significant name, and check for the possibility of page
demotion.

Signed-off-by: Dave Hansen 
Reviewed-by: Yang Shi 
Reviewed-by: Greg Thelen 
Cc: Wei Xu 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 
---

 b/mm/vmscan.c |   28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff -puN mm/vmscan.c~mm-vmscan-anon-can-be-aged mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-anon-can-be-aged2021-03-31 15:17:18.325000245 
-0700
+++ b/mm/vmscan.c   2021-03-31 15:17:18.333000245 -0700
@@ -2508,6 +2508,26 @@ out:
}
 }
 
+/*
+ * Anonymous LRU management is a waste if there is
+ * ultimately no way to reclaim the memory.
+ */
+bool anon_should_be_aged(struct lruvec *lruvec)
+{
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+   /* Aging the anon LRU is valuable if swap is present: */
+   if (total_swap_pages > 0)
+   return true;
+
+   /* Also valuable if anon pages can be demoted: */
+   if (next_demotion_node(pgdat->node_id) >= 0)
+   return true;
+
+   /* No way to reclaim anon pages.  Should not age anon LRUs: */
+   return false;
+}
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
unsigned long nr[NR_LRU_LISTS];
@@ -2617,7 +2637,8 @@ static void shrink_lruvec(struct lruvec
 * Even if we did not try to evict anon pages at all, we want to
 * rebalance the anon lru active/inactive ratio.
 */
-   if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
+   if (anon_should_be_aged(lruvec) &&
+   inactive_is_low(lruvec, LRU_INACTIVE_ANON))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
   sc, LRU_ACTIVE_ANON);
 }
@@ -3446,10 +3467,11 @@ static void age_active_anon(struct pglis
struct mem_cgroup *memcg;
struct lruvec *lruvec;
 
-   if (!total_swap_pages)
+   lruvec = mem_cgroup_lruvec(NULL, pgdat);
+
+   if (!anon_should_be_aged(lruvec))
return;
 
-   lruvec = mem_cgroup_lruvec(NULL, pgdat);
if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
return;
 
_

[PATCH 05/10] mm/migrate: demote pages during reclaim

2021-04-01 Thread Dave Hansen



From: Dave Hansen 

This is mostly derived from a patch from Yang Shi:


https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang@linux.alibaba.com/

Add code to the reclaim path (shrink_page_list()) to "demote" data
to another NUMA node instead of discarding the data.  This always
avoids the cost of I/O needed to read the page back in and sometimes
avoids the writeout cost when the pagee is dirty.

A second pass through shrink_page_list() will be made if any demotions
fail.  This essentally falls back to normal reclaim behavior in the
case that demotions fail.  Previous versions of this patch may have
simply failed to reclaim pages which were eligible for demotion but
were unable to be demoted in practice.

Note: This just adds the start of infratructure for migration. It is
actually disabled next to the FIXME in migrate_demote_page_ok().

Signed-off-by: Dave Hansen 
Cc: Wei Xu 
Cc: Yang Shi 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: osalvador 

--
changes from 20210122:
 * move from GFP_HIGHUSER -> GFP_HIGHUSER_MOVABLE (Ying)

changes from 202010:
 * add MR_NUMA_MISPLACED to trace MIGRATE_REASON define
 * make migrate_demote_page_ok() static, remove 'sc' arg until
   later patch
 * remove unnecessary alloc_demote_page() hugetlb warning
 * Simplify alloc_demote_page() gfp mask.  Depend on
   __GFP_NORETRY to make it lightweight instead of fancier
   stuff like leaving out __GFP_IO/FS.
 * Allocate migration page with alloc_migration_target()
   instead of allocating directly.
changes from 20200730:
 * Add another pass through shrink_page_list() when demotion
   fails.
changes from 20210302:
 * Use __GFP_THISNODE and revise the comment explaining the
   GFP mask constructionn
---

 b/include/linux/migrate.h|9 
 b/include/trace/events/migrate.h |3 -
 b/mm/vmscan.c|   82 +++
 3 files changed, 93 insertions(+), 1 deletion(-)

diff -puN include/linux/migrate.h~demote-with-migrate_pages 
include/linux/migrate.h
--- a/include/linux/migrate.h~demote-with-migrate_pages 2021-03-31 
15:17:15.842000251 -0700
+++ b/include/linux/migrate.h   2021-03-31 15:17:15.853000251 -0700
@@ -27,6 +27,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+   MR_DEMOTION,
MR_TYPES
 };
 
@@ -196,6 +197,14 @@ struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
+int next_demotion_node(int node);
+
+#else /* CONFIG_MIGRATION disabled: */
+
+static inline int next_demotion_node(int node)
+{
+   return NUMA_NO_NODE;
+}
 
 #endif /* CONFIG_MIGRATION */
 
diff -puN include/trace/events/migrate.h~demote-with-migrate_pages 
include/trace/events/migrate.h
--- a/include/trace/events/migrate.h~demote-with-migrate_pages  2021-03-31 
15:17:15.846000251 -0700
+++ b/include/trace/events/migrate.h2021-03-31 15:17:15.853000251 -0700
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset")\
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")  \
EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
-   EMe(MR_CONTIG_RANGE,"contig_range")
+   EM( MR_CONTIG_RANGE,"contig_range") \
+   EMe(MR_DEMOTION,"demotion")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff -puN mm/vmscan.c~demote-with-migrate_pages mm/vmscan.c
--- a/mm/vmscan.c~demote-with-migrate_pages 2021-03-31 15:17:15.848000251 
-0700
+++ b/mm/vmscan.c   2021-03-31 15:17:15.856000251 -0700
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1035,6 +1036,23 @@ static enum page_references page_check_r
return PAGEREF_RECLAIM;
 }
 
+static bool migrate_demote_page_ok(struct page *page)
+{
+   int next_nid = next_demotion_node(page_to_nid(page));
+
+   VM_BUG_ON_PAGE(!PageLocked(page), page);
+   VM_BUG_ON_PAGE(PageHuge(page), page);
+   VM_BUG_ON_PAGE(PageLRU(page), page);
+
+   if (next_nid == NUMA_NO_NODE)
+   return false;
+   if (PageTransHuge(page) && !thp_migration_supported())
+   return false;
+
+   // FIXME: actually enable this later in the series
+   return false;
+}
+
 /* Check if a page is dirty or under writeback */
 static void page_check_dirty_writeback(struct page *page,
   bool *dirty, bool *writeback)
@@ -1065,6 +1083,46 @@ static void page_check_dirty_writeback(s
mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
+static struct page *alloc_demote_page(struct page *page, unsigned long node)
+{
+   struct migration_target_control mtc = {
+

[PATCH 08/10] mm/vmscan: Consider anonymous pages without swap

2021-04-01 Thread Dave Hansen



From: Keith Busch 

Reclaim anonymous pages if a migration path is available now that
demotion provides a non-swap recourse for reclaiming anon pages.

Note that this check is subtly different from the
anon_should_be_aged() checks.  This mechanism checks whether a
specific page in a specific context *can* actually be reclaimed, given
current swap space and cgroup limits

anon_should_be_aged() is a much simpler and more preliminary check
which just says whether there is a possibility of future reclaim.

#Signed-off-by: Keith Busch 
Cc: Keith Busch 
Signed-off-by: Dave Hansen 
Reviewed-by: Yang Shi 
Cc: Wei Xu 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

--

Changes from Dave 10/2020:
 * remove 'total_swap_pages' modification

Changes from Dave 06/2020:
 * rename reclaim_anon_pages()->can_reclaim_anon_pages()

Note: Keith's Intel SoB is commented out because he is no
longer at Intel and his @intel.com mail will bounce.
---

 b/mm/vmscan.c |   35 ---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap 
mm/vmscan.c
--- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap  
2021-03-31 15:17:19.388000242 -0700
+++ b/mm/vmscan.c   2021-03-31 15:17:19.407000242 -0700
@@ -287,6 +287,34 @@ static bool writeback_throttling_sane(st
 }
 #endif
 
+static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
+ int node_id)
+{
+   if (memcg == NULL) {
+   /*
+* For non-memcg reclaim, is there
+* space in any swap device?
+*/
+   if (get_nr_swap_pages() > 0)
+   return true;
+   } else {
+   /* Is the memcg below its swap limit? */
+   if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+   return true;
+   }
+
+   /*
+* The page can not be swapped.
+*
+* Can it be reclaimed from this node via demotion?
+*/
+   if (next_demotion_node(node_id) >= 0)
+   return true;
+
+   /* No way to reclaim anon pages */
+   return false;
+}
+
 /*
  * This misses isolated pages which are not accounted for to save counters.
  * As the data only determines if reclaim or compaction continues, it is
@@ -298,7 +326,7 @@ unsigned long zone_reclaimable_pages(str
 
nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-   if (get_nr_swap_pages() > 0)
+   if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -2323,6 +2351,7 @@ enum scan_balance {
 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
   unsigned long *nr)
 {
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
unsigned long anon_cost, file_cost, total_cost;
int swappiness = mem_cgroup_swappiness(memcg);
@@ -2333,7 +2362,7 @@ static void get_scan_count(struct lruvec
enum lru_list lru;
 
/* If we have no swap space, do not bother scanning anon pages. */
-   if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+   if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2708,7 +2737,7 @@ static inline bool should_continue_recla
 */
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-   if (get_nr_swap_pages() > 0)
+   if (can_reclaim_anon_pages(NULL, pgdat->node_id))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
return inactive_lru_pages > pages_for_compaction;
_

Re: [PATCH RESEND 0/3] x86/sgx: eextend ioctl

2021-04-01 Thread Dave Hansen

On 4/1/21 7:56 AM, Raoul Strackx wrote:
> 
> SOLUTION OF THIS PATCH
> This patch adds a new ioctl to enable userspace to execute EEXTEND leaf 
> functions per 256 bytes of enclave memory. This enables enclaves to be 
> build as specified by enclave providers.

I think tying the user ABI to the SGX architecture this closely is a
mistake.

Do we need another ioctl() or can we just relax the existing add_pages
ioctl() to allow unaligned addresses?

Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Dave Hansen

On 3/31/21 8:28 PM, Andi Kleen wrote:
>> The hardware (and VMMs and SEAM) have ways of telling the guest kernel
>> what is supported: CPUID.  If it screws up, and the guest gets an
>> unexpected #VE, so be it.
> The main reason for disabling stuff is actually that we don't need
> to harden it. All these things are potential attack paths.

Wait, MWAIT is an attack path?  If it were an attack path, wouldn't it
be an attack path that was created from the SEAM layer or the hardware
being broken?  Aren't those two things within the trust boundary?  Do we
harden against other things within the trust boundary?

>> We don't have all kinds of crazy handling in the kernel's #UD handler
>> just in case a CPU mis-enumerates a feature and we get a #UD.  We have
>> to trust the underlying hardware to be sane.  If it isn't, we die a
>> horrible death as fast as possible.  Why should TDX be any different?
> That's what the original patch did -- no unnecessary checks -- but reviewers
> keep asking for the extra checks, so Sathya added more. We have the not
> unusual problem here that reviewers don't agree among themselves.

Getting consensus is a pain in the neck, eh?

It's too bad all the reviewers in the community aren't like all of the
engineers at big companies where everyone always agrees. :)

Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Dave Hansen

On 3/31/21 3:28 PM, Kuppuswamy, Sathyanarayanan wrote:
> 
> On 3/31/21 3:11 PM, Dave Hansen wrote:
>> On 3/31/21 3:06 PM, Sean Christopherson wrote:
>>> I've no objection to a nice message in the #VE handler.  What I'm
>>> objecting to
>>> is sanity checking the CPUID model provided by the TDX module.  If we
>>> don't
>>> trust the TDX module to honor the spec, then there are a huge pile of
>>> things
>>> that are far higher priority than MONITOR/MWAIT.
>>
>> In other words:  Don't muck with CPUID or the X86_FEATURE at all.  Don't
>> check it to comply with the spec.  If something doesn't comply, we'll
>> get a #VE at *SOME* point.  We don't need to do belt-and-suspenders
>> programming here.
>>
>> That sounds sane to me.
> But I think there are cases (like MCE) where SEAM does not disable
> them because there will be future support for it. We should at-least
> suppress such features in kernel.

Specifics, please.

The hardware (and VMMs and SEAM) have ways of telling the guest kernel
what is supported: CPUID.  If it screws up, and the guest gets an
unexpected #VE, so be it.

We don't have all kinds of crazy handling in the kernel's #UD handler
just in case a CPU mis-enumerates a feature and we get a #UD.  We have
to trust the underlying hardware to be sane.  If it isn't, we die a
horrible death as fast as possible.  Why should TDX be any different?

Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Dave Hansen

On 3/31/21 3:06 PM, Sean Christopherson wrote:
> I've no objection to a nice message in the #VE handler.  What I'm objecting to
> is sanity checking the CPUID model provided by the TDX module.  If we don't
> trust the TDX module to honor the spec, then there are a huge pile of things
> that are far higher priority than MONITOR/MWAIT.

In other words:  Don't muck with CPUID or the X86_FEATURE at all.  Don't
check it to comply with the spec.  If something doesn't comply, we'll
get a #VE at *SOME* point.  We don't need to do belt-and-suspenders
programming here.

That sounds sane to me.

Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Dave Hansen

On 3/31/21 2:53 PM, Sean Christopherson wrote:
> On Wed, Mar 31, 2021, Kuppuswamy Sathyanarayanan wrote:
>> Changes since v3:
>>  * WARN user if SEAM does not disable MONITOR/MWAIT instruction.
> Why bother?  There are a whole pile of features that are dictated by the TDX
> module spec.  MONITOR/MWAIT is about as uninteresting as it gets, e.g. 
> absolute
> worst case scenario is the guest kernel crashes, whereas a lot of spec 
> violations
> would compromise the security of the guest.

So, what should we do?  In the #VE handler:

switch (exit_reason) {
case SOMETHING_WE_HANDLE:
blah();
break;
...
default:
pr_err("unhadled #VE, exit reason: %d\n", exit_reason);
BUG_ON(1);
}

?

Is this the *ONLY* one of these, or are we going to have another twenty?

If this is the only one, we might as well give a nice string error
message.  If there are twenty more, let's just dump the exit reason,
BUG() and move on with our lives.

Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Dave Hansen

On 3/31/21 2:09 PM, Kuppuswamy Sathyanarayanan wrote:
> As per Guest-Host Communication Interface (GHCI) Specification
> for Intel TDX, sec 2.4.1, TDX architecture does not support
> MWAIT, MONITOR and WBINVD instructions. So in non-root TDX mode,
> if MWAIT/MONITOR instructions are executed with CPL != 0 it will
> trigger #UD, and for CPL = 0 case, virtual exception (#VE) is
> triggered. WBINVD instruction behavior is also similar to
> MWAIT/MONITOR, but for CPL != 0 case, it will trigger #GP instead
> of #UD.

Could we give it a go to try this in plain English before jumping in and
quoting the exact spec section?  Also, the CPL language is nice and
precise for talking inside Intel, but it's generally easier for me to
read kernel descriptions when we just talk about the kernel.

When running as a TDX guest, there are a number of existing,
privileged instructions that do not work.  If the guest kernel
uses these instructions, the hardware generates a #VE.

Which reminds me...  The SDM says: MWAIT will "#UD ... If
CPUID.01H:ECX.MONITOR[bit 3] = 0".  So, is this an architectural change?
 The guest is *supposed* to see that CPUID bit as 0, so shouldn't it
also get a #UD?  Or is this all so that if SEAM *forgets* to clear the
CPUID bit, the guest gets #VE?

What are we *actually* mitigating here?

Also, FWIW, MWAIT/MONITOR and WBINVD are pretty different beasts.  I
think this would all have been a lot more clear if this would have been
two patches instead of shoehorning them into one.

> To prevent TD guest from using these unsupported instructions,
> following measures are adapted:
> 
> 1. For MWAIT/MONITOR instructions, support for these instructions
> are already disabled by TDX module (SEAM). So CPUID flags for
> these instructions should be in disabled state. Also, just to be
> sure that these instructions are disabled, forcefully unset
> X86_FEATURE_MWAIT CPU cap in OS.
> 
> 2. For WBINVD instruction, we use audit to find the code that uses
> this instruction and disable them for TD.

Really?  Where are those patches?

> +static inline bool cpuid_has_mwait(void)
> +{
> + if (cpuid_ecx(1) & (1 << (X86_FEATURE_MWAIT % 32)))
> + return true;
> +
> + return false;
> +}
> +
>  bool is_tdx_guest(void)
>  {
>   return static_cpu_has(X86_FEATURE_TDX_GUEST);
> @@ -301,12 +309,25 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct 
> ve_info *ve)
>   return insn.length;
>  }
>  
> +/* Initialize TDX specific CPU capabilities */
> +static void __init tdx_cpu_cap_init(void)
> +{
> + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> +
> + if (cpuid_has_mwait()) {
> + WARN(1, "TDX Module failed to disable MWAIT\n");

WARN(1, "TDX guest enumerated support for MWAIT, disabling it").

> + /* MWAIT is not supported in TDX platform, so suppress it */
> + setup_clear_cpu_cap(X86_FEATURE_MWAIT);
> + }
> +
> +}

Extra newline.

>  void __init tdx_early_init(void)
>  {
>   if (!cpuid_has_tdx_guest())
>   return;
>  
> - setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> + tdx_cpu_cap_init();
>  
>   tdg_get_info();
>  
> @@ -362,6 +383,27 @@ int tdg_handle_virtualization_exception(struct pt_regs 
> *regs,
>   case EXIT_REASON_EPT_VIOLATION:
>   ve->instr_len = tdg_handle_mmio(regs, ve);
>   break;
> + case EXIT_REASON_WBINVD:
> + /*
> +  * TDX architecture does not support WBINVD instruction.
> +  * Currently, usage of this instruction is prevented by
> +  * disabling the drivers which uses it. So if we still
> +  * reach here, it needs user attention.
> +  */

This comment is awfully vague.  "TDX architecture..." what?  Any CPUs
supporting the TDX architecture?  TDX VMM's?  TDX Guests?

Let's also not waste byte on stating the obvious.  If it didn't need
attention we wouldn't be warning about it, eh?

So, let's halve the size of the comment and say:

/*
 * WBINVD is not supported inside TDX guests.  All in-
 * kernel uses should have been disabled.
 */

> + pr_err("TD Guest used unsupported WBINVD instruction\n");
> + BUG();
> + break;
> + case EXIT_REASON_MONITOR_INSTRUCTION:
> + case EXIT_REASON_MWAIT_INSTRUCTION:
> + /*
> +  * MWAIT/MONITOR features are disabled by TDX Module (SEAM)
> +  * and also re-suppressed in kernel by clearing
> +  * X86_FEATURE_MWAIT CPU feature flag in tdx_early_init(). So
> +  * if TD guest still executes MWAIT/MONITOR instruction with
> +  * above suppression, it needs user attention.
> +  */

Again, let's trim this down:

/*
 * Something in the kernel used MONITOR or MWAIT despite
 * X86_FEATURE_MWAIT being cleared for TDX guests.

Re: [PATCH V5 08/10] x86/entry: Preserve PKRS MSR across exceptions

2021-03-31 Thread Dave Hansen

On 3/31/21 12:14 PM, ira.we...@intel.com wrote:
> + * To protect against exceptions having access to this memory we save the
> + * current running value and sets the PKRS value to be used during the
> + * exception.

This series seems to have grown some "we's".

The preexisting pkey code was not innocent in this regard, either.  But,
please fix this up across the series in the next submission.  Literally
removing "we" from this sentence doesn't change the meaning at all.

Re: [PATCH RESEND 0/3] x86/sgx: eextend ioctl

2021-03-31 Thread Dave Hansen

On 3/31/21 5:50 AM, Raoul Strackx wrote:
> The sgx driver can only load enclaves whose pages are fully measured.
> This may exclude existing enclaves from running. This patch adds a
> new ioctl to measure 256 byte chunks at a time.

The changelogs here are pretty sparse.  Could you explain in a bit more
detail what's going on?

A review of the relevant pieces of the SGX architecture would be
appreciated.

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-30 Thread Dave Hansen

On 3/30/21 10:56 AM, Len Brown wrote:
> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski  wrote:
>>> On Mar 30, 2021, at 10:01 AM, Len Brown  wrote:
>>> Is it required (by the "ABI") that a user program has everything
>>> on the stack for user-space XSAVE/XRESTOR to get back
>>> to the state of the program just before receiving the signal?
>> The current Linux signal frame format has XSTATE in uncompacted format,
>> so everything has to be there.
>> Maybe we could have an opt in new signal frame format, but the details would 
>> need to be worked out.
>>
>> It is certainly the case that a signal should be able to be delivered, run 
>> “async-signal-safe” code,
>> and return, without corrupting register contents.
> And so an an acknowledgement:
> 
> We can't change the legacy signal stack format without breaking
> existing programs.  The legacy is uncompressed XSTATE.  It is a
> complete set of architectural state -- everything necessary to
> XRESTOR.  Further, the sigreturn flow allows the signal handler to
> *change* any of that state, so that it becomes active upon return from
> signal.

One nit with this: XRSTOR itself can work with the compacted format or
uncompacted format.  Unlike the XSAVE/XSAVEC side where compaction is
explicit from the instruction itself, XRSTOR changes its behavior by
reading XCOMP_BV.  There's no XRSTORC.

The issue with using the compacted format is when legacy software in the
signal handler needs to go access the state.  *That* is what can't
handle a change in the XSAVE buffer format (either optimized/XSAVEOPT,
or compacted/XSAVEC).

Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-30 Thread Dave Hansen

On 3/30/21 8:00 AM, Andi Kleen wrote:
>>> +   /* MWAIT is not supported in TDX platform, so suppress it */
>>> +   setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>> In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This
>> is enforced by SEAM module.
> Good point.
>> Do we still need to safeguard it by setup_clear_cpu_cap() here?
> I guess it doesn't hurt to do it explicitly.

If this MWAIT behavior (clearing the CPUID bit) is part of the guest
architecture, then it would also be appropriate to WARN() rather than
silently clearing the X86_FEATURE bit.

Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Dave Hansen

On 3/29/21 4:16 PM, Kuppuswamy Sathyanarayanan wrote:
> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> are not supported. So handle #VE due to these instructions
> appropriately.

This misses a key detail:

"are not supported" ... and other patches have prevented a guest
from using these instructions.

In other words, they're bad now, and we know it.  We tried to keep the
kernel from using them, but we failed.  Whoops.

> Since the impact of executing WBINVD instruction in non ring-0 mode
> can be heavy, use BUG() to report it. For others, raise a WARNING
> message.

"Heavy" makes it sound like there's a performance impact.

>   pv_ops.irq.safe_halt = tdg_safe_halt;
> @@ -362,6 +365,32 @@ int tdg_handle_virtualization_exception(struct pt_regs 
> *regs,
>   case EXIT_REASON_EPT_VIOLATION:
>   ve->instr_len = tdg_handle_mmio(regs, ve);
>   break;
> + case EXIT_REASON_WBINVD:
> + /*
> +  * WBINVD is a privileged instruction, can only be executed
> +  * in ring 0. Since we reached here, the kernel is in buggy
> +  * state.
> +  */
> + pr_err("WBINVD #VE Exception\n");

"#VE Exception" is basically wasted bytes.  It really tells us nothing.
 This, on the other hand:

"TDX guest used unsupported WBINVD instruction"

gives us the clues that TDX is involved and that the kernel used the
instruction.  The fact that #VE itself is involved is kinda irrelevant.

I also hate the comment.  Don't waste the bytes saying we're in a buggy
state.  That's kinda obvious from the BUG().

Again, it might be nice to mention in the changelog *WHY* we're so sure
that WBINVD won't be called from a TDX guest.  What did we do to
guarantee that?  How could we be sure that someone looking at the splat
that this generates and then reading the comment can go fix the bug in
their kernel?

> + BUG();
> + break;
> + case EXIT_REASON_MONITOR_INSTRUCTION:
> + /*
> +  * MONITOR is a privileged instruction, can only be executed
> +  * in ring 0. So we are not supposed to reach here. Raise a
> +  * warning message.
> +  */
> + WARN(1, "MONITOR unexpected #VE Exception\n");
> + break;
> + case EXIT_REASON_MWAIT_INSTRUCTION:
> + /*
> +  * MWAIT feature is suppressed in firmware and in
> +  * tdx_early_init() by clearing X86_FEATURE_MWAIT CPU feature
> +  * flag. Since we are not supposed to reach here, raise a
> +  * warning message and return -EFAULT.
> +  */
> + WARN(1, "MWAIT unexpected #VE Exception\n");
> + return -EFAULT;
>   default:
>   pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
>   return -EFAULT;

This is more of the same.  Don't waste comment bytes telling me we're
not suppose to reach a BUG() or WARN().

Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Dave Hansen

On 3/29/21 3:09 PM, Kuppuswamy, Sathyanarayanan wrote:
> +    case EXIT_REASON_MWAIT_INSTRUCTION:
> +    /* MWAIT is supressed, not supposed to reach here. */
> +    WARN(1, "MWAIT unexpected #VE Exception\n");
> +    return -EFAULT;

 How is MWAIT "supppressed"?
>>> I am clearing the MWAIT feature flag in early init code. We should also
>>> disable this feature in firmware.
>>> setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>>
>> I'd be more explicit about that.  Maybe even reference the code that
>> clears the X86_FEATURE.
> This change is part of the same patch.

Right, but if someone goes and looks at the switch() statement in 10
years is it going to be obvious how MWAIT was "suppressed"?

Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Dave Hansen

On 3/29/21 2:55 PM, Kuppuswamy, Sathyanarayanan wrote:
>>
>> MONITOR is a privileged instruction, right?  So we can only end up in
>> here if the kernel screws up and isn't reading CPUID correctly, right?
>>
>> That dosen't seem to me like something we want to suppress.  This needs
>> a warning, at least.  I assume that having a MONITOR instruction
>> immediately return doesn't do any harm.
> Agree. Since we are not supposed to come here, I will use BUG.

"This is unexpected" is a WARN()able offense.

"This is unexpected and might be corrupting data" is where we want to
use BUG().

Does broken MONITOR risk data corruption?

>>> +    case EXIT_REASON_MWAIT_INSTRUCTION:
>>> +    /* MWAIT is supressed, not supposed to reach here. */
>>> +    WARN(1, "MWAIT unexpected #VE Exception\n");
>>> +    return -EFAULT;
>>
>> How is MWAIT "supppressed"?
> I am clearing the MWAIT feature flag in early init code. We should also
> disable this feature in firmware. setup_clear_cpu_cap(X86_FEATURE_MWAIT);

I'd be more explicit about that.  Maybe even reference the code that
clears the X86_FEATURE.

Re: I915 CI-run with kfence enabled, issues found

2021-03-29 Thread Dave Hansen

On 3/29/21 10:45 AM, Marco Elver wrote:
> On Mon, 29 Mar 2021 at 19:32, Dave Hansen  wrote:
> Doing it to all CPUs is too expensive, and we can tolerate this being
> approximate (nothing bad will happen, KFENCE might just miss a bug and
> that's ok).
...
>> BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
>> being enabled.  That's probably why you don't see this everywhere.  We
>> should probably have unconditional preempt checks in there.
> 
> In which case I'll add a preempt_disable/enable() pair to
> kfence_protect_page() in arch/x86/include/asm/kfence.h.

That sounds sane to me.  I'd just plead that the special situation (not
needing deterministic TLB flushes) is obvious.  We don't want any folks
copying this code.

BTW, I know you want to avoid the cost of IPIs, but have you considered
any other low-cost ways to get quicker TLB flushes?  For instance, you
could loop over all CPUs and set cpu_tlbstate.invalidate_other=1.  That
would induce a context switch at the next context switch without needing
an IPI.

Re: I915 CI-run with kfence enabled, issues found

2021-03-29 Thread Dave Hansen

On 3/29/21 9:40 AM, Marco Elver wrote:
> It looks like the code path from flush_tlb_one_kernel() to
> invalidate_user_asid()'s this_cpu_ptr() has several feature checks, so
> probably some feature difference between systems where it triggers and
> it doesn't.
> 
> As far as I'm aware, there is no restriction on where
> flush_tlb_one_kernel() is called. We could of course guard it but I
> think that's wrong.
> 
> Other than that, I hope the x86 maintainers know what's going on here.
> 
> Just for reference, the stack traces in the above logs start with:
> 
> | <3> [31.556004] BUG: using smp_processor_id() in preemptible [] 
> code: dmesg/1075
> | <4> [31.556070] caller is invalidate_user_asid+0x13/0x50
> | <4> [31.556078] CPU: 6 PID: 1075 Comm: dmesg Not tainted 
> 5.12.0-rc4-gda4a2b1a5479-kfence_1+ #1
> | <4> [31.556081] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, 
> BIOS 8.11 10/24/2012
> | <4> [31.556084] Call Trace:
> | <4> [31.556088]  dump_stack+0x7f/0xad
> | <4> [31.556097]  check_preemption_disabled+0xc8/0xd0
> | <4> [31.556104]  invalidate_user_asid+0x13/0x50
> | <4> [31.556109]  flush_tlb_one_kernel+0x5/0x20
> | <4> [31.556113]  kfence_protect+0x56/0x80
> | ...

Our naming here isn't great.

But, the "one" in flush_tlb_one_kernel() really refers to two "ones":
1. Flush one single address
2. Flush that address from one CPU's TLB

The reason preempt needs to be off is that it doesn't make any sense to
flush one TLB entry from a "random" CPU.  It only makes sense to flush
it when preempt is disabled and you *know* which CPU's TLB you're flushing.

I think kfence needs to be using flush_tlb_kernel_range().  That does
all the IPI fanciness to flush the TLBs on *ALL* CPUs, not just the
current one.

BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
being enabled.  That's probably why you don't see this everywhere.  We
should probably have unconditional preempt checks in there.

Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Dave Hansen

On 3/27/21 3:54 PM, Kuppuswamy Sathyanarayanan wrote:
> + /*
> +  * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
> +  * Domain Extensions (Intel TDX) specification, sec 2.4,
> +  * some instructions that unconditionally cause #VE (such as WBINVD,
> +  * MONITOR, MWAIT) do not have corresponding TDCALL
> +  * [TDG.VP.VMCALL ] leaves, since the TD has been designed
> +  * with no deterministic way to confirm the result of those operations
> +  * performed by the host VMM.  In those cases, the goal is for the TD
> +  * #VE handler to increment the RIP appropriately based on the VE
> +  * information provided via TDCALL.
> +  */

That's an awfully big comment.  Could you pare it down, please?  Maybe
focus on the fact that we should never get here and why, rather than
talking about some silly spec?

> + case EXIT_REASON_WBINVD:
> + pr_warn_once("WBINVD #VE Exception\n");

I actually think WBINVD in here should oops.  We use it for some really
important things.  If it can't be executed, and we're depending on it,
the kernel is in deep, deep trouble.

I think a noop here is dangerous.

> + case EXIT_REASON_MONITOR_INSTRUCTION:
> + /* Handle as nops. */
> + break;

MONITOR is a privileged instruction, right?  So we can only end up in
here if the kernel screws up and isn't reading CPUID correctly, right?

That dosen't seem to me like something we want to suppress.  This needs
a warning, at least.  I assume that having a MONITOR instruction
immediately return doesn't do any harm.

> + case EXIT_REASON_MWAIT_INSTRUCTION:
> + /* MWAIT is supressed, not supposed to reach here. */
> + WARN(1, "MWAIT unexpected #VE Exception\n");
> + return -EFAULT;

How is MWAIT "supppressed"?

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-29 Thread Dave Hansen

On 3/27/21 5:53 PM, Thomas Gleixner wrote:
> Making it solely depend on XCR0 and fault if not requested upfront is
> bringing you into the situation that you broke 'legacy code' which
> relied on the CPUID bit and that worked until now which gets you
> in the no-regression trap.

Trying to find the right place to jump into this thread... :)

I don't know what apps do in practice.  But, the enumeration of the
features in the SDM describes three steps:
1. Check for XGETBV support
2. Use XGETBV[0] to check that the OS is aware of the feature and is
   context-switching it
3. Detect the feature itself

So, apps *are* supposed to be checking XCR0 via XGETBV.  If they don't,
they run the risk of a feature being supported by the CPU and the
registers "working" but not being context-switched.

Zeroing out bits in XCR0 will have the effect of telling the app that
the OS isn't context-switching the state.  I think this means that apps
will see the same thing in both situations:
1. If they run an old (say pre-AVX-512) kernel on new AVX-512-enabled
   hardware, or
2. They run a new kernel with this fancy proposed XCR0-switching
   mechanism

I _think_ that gets us off the hook for an ABI break, at least for AVX-512.

Re: [PATCH v3 05/25] x86/sgx: Introduce virtual EPC for use by KVM guests

2021-03-26 Thread Dave Hansen

On 3/26/21 8:29 AM, Borislav Petkov wrote:
> On Fri, Mar 26, 2021 at 08:17:38AM -0700, Dave Hansen wrote:
>> We're working on a cgroup controller just for enclave pages that will
>> apply to guest use and bare metal.  It would have been nice to have up
>> front, but we're trying to do things incrementally.  A cgroup controller
>> should solve he vast majority of these issues where users are quarreling
>> about who gets enclave memory.
> Maybe I'm missing something but why do you need a cgroup controller
> instead of controlling that resource sharing in the sgx core? Or the
> cgroup thing has additional functionality which is good to have anyway?

We could do it in the SGX core, but I think what we end up with will end
up looking a lot like a cgroup controller.  It seems like overkill, but
I think there's enough infrastructure to leverage that it's simpler to
do it with cgroups versus anything else.

Re: [PATCH v3 05/25] x86/sgx: Introduce virtual EPC for use by KVM guests

2021-03-26 Thread Dave Hansen

On 3/26/21 8:03 AM, Borislav Petkov wrote:
> Let's say all guests start using enclaves and baremetal cannot start any
> new ones anymore due to no more memory. Are we ok with that?

Yes, for now.

> What if baremetal creates a big fat enclave and starves guests all of a
> sudden. Are we ok with that either?

Actually, the baremetal enclave will get a large chunk of its resources
reclaimed and stolen from it.  The guests will probably start and the
baremetal will probably thrash until its allocations fail and it is
killed because it couldn't allocate enclave memory in a page fault.

> In general, having two disjoint things give out SGX resources separately
> sounds like trouble to me.

Yes, it's trouble as-is.

We're working on a cgroup controller just for enclave pages that will
apply to guest use and bare metal.  It would have been nice to have up
front, but we're trying to do things incrementally.  A cgroup controller
should solve he vast majority of these issues where users are quarreling
about who gets enclave memory.

BTW, we probably should have laid this out up front in the original
merge, but the plans in order were roughly:

1. Core SGX functionality (merged into 5.11)
2. NUMA and KVM work
3. cgroup controller for enclave pages
4. EDMM support (lets you add/remove pages and change permissions while
   enclave runs.  Current enclaves are stuck with the same memory they
   start with)

After that, things become less clear.  There's some debate whether we
need to rework the VA pages (enclave swapping metadata to prevent
replay) or improve ability to reclaim guest pages.

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-25 Thread Dave Hansen

On 3/25/21 3:59 PM, Len Brown wrote:
> We call AMX a "simple state feature" -- it actually requires NO KERNEL 
> ENABLING
> above the generic state save/restore to fully support userspace AMX
> applications.
> 
> While not all ISA extensions can be simple state features, we do expect
> future features to share this trait, and so we want to be sure that it is 
> simple
> to update the kernel to turn those features on (and when necessary, off).

>From some IRC chats with Thomaas and Andy, I think it's safe to say that
they're not comfortable blindly enabling even our "simple features".  I
think we're going to need at least some additional architecture to get
us to a point where everyone will be comfortable.

For instance, AMX might be "simple", but there are really only kludgy
ways to get it back to the init state.  Plus, it's *not* simple in that
state left in the registers can have permanent (as long as the state
remains) power and performance impact.

Also, we probably need to expand the "simple" architecture documentation
a bit.  For instance, we need to promise that things like pkeys which
can cause kernel exceptions will never be enumerated as "simple".

Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

2021-03-25 Thread Dave Hansen

On 3/25/21 8:24 AM, Brijesh Singh wrote:
> On 3/25/21 9:48 AM, Dave Hansen wrote:
>> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>>> can raise an RMP violation. We can resolve the RMP violation by splitting
>>> the virtual address to a lower page level.
>>>
>>> e.g
>>> - guest made a page shared in the RMP entry so that the hypervisor
>>>   can write to it.
>>> - the hypervisor has mapped the pfn as a large page. A write access
>>>   will cause an RMP violation if one of the pages within the 2MB region
>>>   is a guest private page.
>>>
>>> The above RMP violation can be resolved by simply splitting the large
>>> page.
>> What if the large page is provided by hugetlbfs?
> I was not able to find a method to split the large pages in the
> hugetlbfs. Unfortunately, at this time a VMM cannot use the backing
> memory from the hugetlbfs pool. An SEV-SNP aware VMM can use either
> transparent hugepage or small pages.

That's really, really nasty.  Especially since it might not be evident
until long after boot and the guest is killed.

It's even nastier because hugetlbfs is actually a great fit for SEV-SNP
memory.  It's physically contiguous, so it would keep you from having to
fracture the direct map all the way down to 4k, it also can't be
reclaimed (just like all SEV memory).

I think the minimal thing you can do here is to fail to add memory to
the RMP in the first place if you can't split it.  That way, users will
at least fail to _start_ their VM versus dying randomly for no good reason.

Even better would be to come up with a stronger contract between host
and guest.  I really don't think the host should be exposed to random
RMP faults on the direct map.  If the guest wants to share memory, then
it needs to tell the host and give the host an opportunity to move the
guest physical memory.  It might, for instance, sequester all the shared
pages in a single spot to minimize direct map fragmentation.

I'll let the other x86 folks chime in on this, but I really think this
needs a different approach than what's being proposed.

Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support

2021-03-25 Thread Dave Hansen

On 3/25/21 8:31 AM, Brijesh Singh wrote:
> 
> On 3/25/21 9:58 AM, Dave Hansen wrote:
>>> +static int __init mem_encrypt_snp_init(void)
>>> +{
>>> +   if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
>>> +   return 1;
>>> +
>>> +   if (rmptable_init()) {
>>> +   setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>>> +   return 1;
>>> +   }
>>> +
>>> +   static_branch_enable(_enable_key);
>>> +
>>> +   return 0;
>>> +}
>> Could you explain a bit why 'snp_enable_key' is needed in addition to
>> X86_FEATURE_SEV_SNP?
> 
> 
> The X86_FEATURE_SEV_SNP indicates that hardware supports the feature --
> this does not necessary means that SEV-SNP is enabled in the host.

I think you're confusing the CPUID bit that initially populates
X86_FEATURE_SEV_SNP with the X86_FEATURE bit.  We clear X86_FEATURE bits
all the time for features that the kernel turns off, even while the
hardware supports it.

Look at what we do in init_ia32_feat_ctl() for SGX, for instance.  We
then go on to use X86_FEATURE_SGX at runtime to see if SGX was disabled,
even though the hardware supports it.

>> For a lot of features, we just use cpu_feature_enabled(), which does
>> both compile-time and static_cpu_has().  This whole series seems to lack
>> compile-time disables for the code that it adds, like the code it adds
>> to arch/x86/mm/fault.c or even mm/memory.c.
> 
> Noted, I will add the #ifdef  to make sure that its compiled out when
> the config does not have the AMD_MEM_ENCRYPTION enabled.

IS_ENABLED() tends to be nicer for these things.

Even better is if you coordinate these with your X86_FEATURE_SEV_SNP
checks.  Then, put X86_FEATURE_SEV_SNP in disabled-features.h, and you
can use cpu_feature_enabled(X86_FEATURE_SEV_SNP) as both a
(statically-patched) runtime *AND* compile-time check without an
explicit #ifdefs.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 9728 matches

Mail list logo