from:"Borislav Petkov"

Re: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint

2024-03-28 Thread Borislav Petkov

On Thu, Mar 28, 2024 at 01:17:43AM -0500, Naik, Avadhut wrote:
> SOCKET -> Socket
> PROCESSOR -> Processor
> MICROCODE -> Microcode

SOCKET -> socket
PROCESSOR -> processor
MICROCODE -> microcode

And yeah, the acronyms need to obviously stay in all caps.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint

2024-03-27 Thread Borislav Petkov

On Wed, Mar 27, 2024 at 03:31:01PM -0700, Sohil Mehta wrote:
> On 3/27/2024 1:54 PM, Avadhut Naik wrote:
> 
> > -   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
> > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
> > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
> > +   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
> > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
> > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE 
> > REVISION: %x",
> 
> Nit: s/MICROCODE REVISION/MICROCODE/g
> 
> You could probably get rid of the word REVISION in the interest of
> brevity similar to __print_mce().

You *definitely* want to do that - good catch.

And TBH, all the screaming words aren't helping either... :)

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v3 0/2] Update mce_record tracepoint

2024-03-25 Thread Borislav Petkov

On Mon, Mar 25, 2024 at 03:12:14PM -0500, Naik, Avadhut wrote:
> Can this patchset be merged in? Or would you prefer me sending out
> another revision with Steven's "Reviewed-by:" tag?

First of all, please do not top-post.

Then, you were on Cc on the previous thread. Please summarize from it
and put in the commit message *why* it is good to have each field added.

And then, above the tracepoint, I'd like you to add a rule which
states what information can and should be added to the tracepoint. And
no, "just because" is not good enough. The previous thread has hints.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] tracepoints: Use WARN() and not WARN_ON() for warnings

2024-02-28 Thread Borislav Petkov

On Wed, Feb 28, 2024 at 01:31:12PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> There are two WARN_ON*() warnings in tracepoint.h that deal with RCU
> usage. But when they trigger, especially from using a TRACE_EVENT()
> macro, the information is not very helpful and is confusing:
> 
>  [ cut here ]
>  WARNING: CPU: 0 PID: 0 at include/trace/events/lock.h:24 
> lock_acquire+0x2b2/0x2d0
> 
> Where the above warning takes you to:
> 
>  TRACE_EVENT(lock_acquire,  <<<--- line 24 in lock.h
> 
>   TP_PROTO(struct lockdep_map *lock, unsigned int subclass,
>   int trylock, int read, int check,
>   struct lockdep_map *next_lock, unsigned long ip),
>   [..]
> 
> Change the WARN_ON_ONCE() to WARN_ONCE() and add a string that allows
> someone to search for exactly where the bug happened.
> 
> Reported-by: Borislav Petkov 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  include/linux/tracepoint.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)

Tested-by: Borislav Petkov (AMD) 

meaning: tested that it really fires:

[1.196008] Running RCU Tasks wait API self tests
[1.200227] [ cut here ]
[1.203899] RCU not watching for tracepoint
[1.203899] WARNING: CPU: 0 PID: 0 at include/trace/events/lock.h:24 
lock_acquire+0x2d3/0x300
...

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-27 Thread Borislav Petkov

On Fri, Jan 26, 2024 at 10:01:29PM +, Luck, Tony wrote:
> PPIN: Nice to have. People that send stuff to me are terrible about
> providing surrounding details. The record already includes
> CPUID(1).EAX ... so I can at least skip the step of asking them which
> CPU family/model/stepping they were using). But PPIN can be recovered
> (so long as the submitter kept good records about which system
> generated the record).

Yes.

> MICROCODE: Must have. Microcode version can be changed at run time.
> Going back to the system to check later may not give the correct
> answer to what was active at the time of the error. Especially for an
> error reported while a microcode update is waling across the CPUs
> poking the MSR on each in turn.

Easy:

- You've got an MCE? Was it during scheduled microcode updates?
- Yes.
- Come back to me when it happens again, *outside* of the microcode
  update schedule.

Anyway, I still don't buy that. Maybe I'm wrong and maybe I need to talk
to data center operators more but this sounds like microcode update
failing is such a common thing to happen so that we *absolutely* *must*
capture the microcode revision when an MCE happens.

Maybe we should make microcode updates more resilient and add a retry
mechanism which doesn't back off as easily.

Or maybe people should script around it and keep retrying, dunno.

In my world, microcode update just works in the vast majority of the
cases and if it doesn't, then those cases need a specific look.

And if I am debugging an issue and I want to see the microcode revision,
I look at /proc/cpuinfo.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Borislav Petkov

On Fri, Jan 26, 2024 at 08:49:03PM +, Luck, Tony wrote:
> Every patch that adds new code or data structures adds to the kernel
> memory footprint. Each should be considered on its merits. The basic
> question being:
> 
>"Is the new functionality worth the cost?"
> 
> Where does it end? It would end if Linus declared:
> 
>   "Linux is now complete. Stop sending patches".
> 
> I.e. it is never going to end.

No, it's not that - it is the merit thing which determines.

> 1) PPIN
> Cost = 8 bytes.
> Benefit: Emdeds a system identifier into the trace record so there
> can be no ambiguity about which machine generated this error.
> Also definitively indicates which socket on a multi-socket system.
> 
> 2) MICROCODE
> Cost = 4 bytes
> Benefit: Certainty about the microcode version active on the core
> at the time the error was detected.
> 
> RAS = Reliability, Availability, Serviceability
> 
> These changes fall into the serviceability bucket. They make it
> easier to diagnose what went wrong.

So does dmesg. Let's add it to the tracepoint...

But no, that's not the right question to ask.

It is rather: which bits of information are very relevant to an error
record and which are transient enough so that they cannot be gathered
from a system by other means or only gathered in a difficult way, and
should be part of that record.

The PPIN is not transient but you have to go map ->extcpu to the PPIN so
adding it to the tracepoint is purely a convenience thing. More or less.

The microcode revision thing I still don't buy but it is already there
so whateva...

So we'd need a rule hammered out and put there in a prominent place so
that it is clear what goes into struct mce and what not.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Borislav Petkov

On Fri, Jan 26, 2024 at 07:15:50PM +, Luck, Tony wrote:
> If deployment of a microcode update across a fleet always went
> flawlessly, life would be simpler. But things can fail. And maybe the
> failure wasn't noticed. Perhaps a node was rebooting when the sysadmin
> pushed the update to the fleet and so missed the deployment. Perhaps
> one core was already acting weird and the microcode update didn't get
> applied to that core.

Yes, and you go collect data from that box. You will have to anyway to
figure out why the microcode didn't update.

> Swapping a hard drive, or hot plugging a NIC isn't very likely
> to correlate with an error reported by the CPU in machine
> check banks.

Ofc it is - coherent probe timeoutting due to problematic insertion
could be reported with a MCE, and so on and so on.

> Is it so very different to add this to a trace record so that rasdaemon
> can have feature parity with mcelog(8)?

I knew you were gonna say that. When someone decides that it is
a splendid idea to add more stuff to struct mce then said someone would
want it in the tracepoint too.

And then we're back to my original question: 

"And where does it end? Stick full dmesg in the tracepoint too?"

Where do you draw the line in the sand and say, no more, especially
static, fields bloating the trace record should be added and from then
on, you should go collect the info from that box. Something which you're
supposed to do anyway.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Borislav Petkov

On Fri, Jan 26, 2024 at 05:10:20PM +, Luck, Tony wrote:
> 12 extra bytes divided by (say) 64GB (a very small server these days, may 
> laptop has that much)
>= 0.0001746%
> 
> We will need 57000 changes like this one before we get to 0.001% :-)

You're forgetting that those 12 bytes repeat per MCE tracepoint logged.
And there's other code which adds more 0.01% here and there, well,
because we can.

> But the key there is keeping the details of the source machine attached to
> the error record. My first contact with machine check debugging is always
> just the raw error record (from mcelog, rasdaemon, or console log).

Yes, that is somewhat sensible reason to have the PPIN together with the
MCE record.

> Knowing which microcode version was loaded on a core *at the time of
> the error* is critical. 

So is the rest of the debug info you're going to need from that machine.
And yet we're not adding that to the tracepoint.

> You've spent enough time with Ashok and Thomas tweaking the Linux
> microcode driver to know that going back to the machine the next day
> to ask about microcode version has a bunch of ways to get a wrong
> answer.

Huh, what does that have to do with this?

IIUC, if someone changes something on the system, whether that is
updating microcode or swapping a harddrive or swapping memory or
whatever, that invalidates the errors reported, pretty much.

You can't put it all in the trace record, you just can't. 

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Borislav Petkov

On Thu, Jan 25, 2024 at 07:19:22PM +, Luck, Tony wrote:
> 8 bytes for PPIN, 4 more for microcode.

I know, nothing leads to bloat like 0.01% here, 0.001% there...

> Number of recoverable machine checks per system  I hope the
> monthly rate should be countable on my fingers...

That's not the point. Rather, when you look at MCE reports, you pretty
much almost always go and collect additional information from the target
machine because you want to figure out what exactly is going on.

So what's stopping you from collecting all that static information
instead of parrotting it through the tracepoint with each error?

> PPIN is useful when talking to the CPU vendor about patterns of
> similar errors seen across a cluster.

I guess that is perhaps the only thing of the two that makes some sense
at least - the identifier uniquely describes which CPU the error comes
from...

> MICROCODE - gives a fast path to root cause problems that have already
> been fixed in a microcode update.

But that, nah. See above.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-25 Thread Borislav Petkov

On Thu, Jan 25, 2024 at 12:48:55PM -0600, Avadhut Naik wrote:
> This patchset updates the mce_record tracepoint so that the recently added
> fields of struct mce are exported through it to userspace.
> 
> The first patch adds PPIN (Protected Processor Inventory Number) field to
> the tracepoint.
> 
> The second patch adds the microcode field (Microcode Revision) to the
> tracepoint.

This is a lot of static information to add to *every* MCE.

And where does it end? Stick full dmesg in the tracepoint too?

What is the real-life use case here?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] tracing: Include PPIN in mce_record tracepoint

2024-01-25 Thread Borislav Petkov

On Wed, Jan 24, 2024 at 09:09:08AM -0500, Steven Rostedt wrote:
> I don't think that's a worry anymore. The offsets can change based on
> kernel config. PowerTop needed to have the library ported to it because
> it use to hardcode the offsets but then it broke when running the 32bit
> version on a 64bit kernel.
> 
> > 
> > I guess no until we break some use case and then we will have to revert.
> > At least this is what we've done in the past...
> > 
> 
> But that revert was reverted when we converted PowerTop to use libtraceevent.

Ok, sounds like a good plan.

/me makes a mental note for the future.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] tracing: Include PPIN in mce_record tracepoint

2024-01-24 Thread Borislav Petkov

On Tue, Jan 23, 2024 at 08:38:53PM -0500, Steven Rostedt wrote:
> Yes, rasdaemon uses libtraceevent (or a copy of it internally) that
> reads the format file to find fields. You can safely add fields to the
> middle of the event structure and the parsing will be just fine.

Should we worry about tools who consume the event "blindly", without the
lib?

I guess no until we break some use case and then we will have to revert.
At least this is what we've done in the past...

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v5 4/5] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2

2023-12-08 Thread Borislav Petkov

On Fri, Dec 08, 2023 at 12:53:47PM +0100, Juergen Gross wrote:
> Took me a while to find it. Patch 5 was repairing the issue again

Can't have that. Any patch in any set must build and boot successfully
for bisection reasons.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v4 4/5] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2

2023-11-21 Thread Borislav Petkov

On Mon, Oct 30, 2023 at 03:25:07PM +0100, Juergen Gross wrote:
> Instead of stacking alternative and paravirt patching, use the new
> ALT_FLAG_CALL flag to switch those mixed calls to pure alternative
> handling.
> 
> This eliminates the need to be careful regarding the sequence of
> alternative and paravirt patching.
> 
> For call depth tracking callthunks_setup() needs to be adapted to patch
> calls at alternative patching sites instead of paravirt calls.

Why is this important so that it is called out explicitly in the commit
message? Is callthunks_setup() special somehow?

> 
> Signed-off-by: Juergen Gross 
> Acked-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/include/asm/alternative.h|  5 +++--
>  arch/x86/include/asm/paravirt.h   |  9 ++---
>  arch/x86/include/asm/paravirt_types.h | 26 +-
>  arch/x86/kernel/callthunks.c  | 17 -
>  arch/x86/kernel/module.c  | 20 +---
>  5 files changed, 31 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/x86/include/asm/alternative.h 
> b/arch/x86/include/asm/alternative.h
> index 2a74a94bd569..07b17bc615a0 100644
> --- a/arch/x86/include/asm/alternative.h
> +++ b/arch/x86/include/asm/alternative.h
> @@ -89,6 +89,8 @@ struct alt_instr {
>   u8  replacementlen; /* length of new instruction */
>  } __packed;
>  
> +extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
> +

arch/x86/include/asm/alternative.h:92:extern struct alt_instr 
__alt_instructions[], __alt_instructions_end[];
arch/x86/kernel/alternative.c:163:extern struct alt_instr __alt_instructions[], 
__alt_instructions_end[];

Zap the declaration from the .c file pls.

>  /*
>   * Debug flag that can be tested to see whether alternative
>   * instructions were patched in already:
> @@ -104,11 +106,10 @@ extern void apply_fineibt(s32 *start_retpoline, s32 
> *end_retpoine,
> s32 *start_cfi, s32 *end_cfi);
>  
>  struct module;
> -struct paravirt_patch_site;
>  
>  struct callthunk_sites {
>   s32 *call_start, *call_end;
> - struct paravirt_patch_site  *pv_start, *pv_end;
> + struct alt_instr*alt_start, *alt_end;
>  };
>  
>  #ifdef CONFIG_CALL_THUNKS
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 3749311d51c3..9c6c5cfa9fe2 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -740,20 +740,23 @@ void native_pv_lock_init(void) __init;
>  
>  #ifdef CONFIG_X86_64
>  #ifdef CONFIG_PARAVIRT_XXL
> +#ifdef CONFIG_DEBUG_ENTRY
>  
>  #define PARA_PATCH(off)  ((off) / 8)
>  #define PARA_SITE(ptype, ops)_PVSITE(ptype, ops, .quad, 8)
>  #define PARA_INDIRECT(addr)  *addr(%rip)
>  
> -#ifdef CONFIG_DEBUG_ENTRY
>  .macro PARA_IRQ_save_fl
>   PARA_SITE(PARA_PATCH(PV_IRQ_save_fl),
> ANNOTATE_RETPOLINE_SAFE;
> call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl);)
> + ANNOTATE_RETPOLINE_SAFE;
> + call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl);
>  .endm
>  
> -#define SAVE_FLAGS   ALTERNATIVE "PARA_IRQ_save_fl;", "pushf; pop %rax;", \
> - ALT_NOT_XEN
> +#define SAVE_FLAGS ALTERNATIVE_2 "PARA_IRQ_save_fl;",
> \
> +  ALT_CALL_INSTR, ALT_CALL_ALWAYS,   \
> +  "pushf; pop %rax;", ALT_NOT_XEN

How is that supposed to work?

At build time for a PARAVIRT_XXL build it'll have that PARA_IRQ_save_fl
macro in there which issues a .parainstructions section and an indirect
call to

call *pv_ops+240(%rip);

then it'll always patch in "call BUG_func" due to X86_FEATURE_ALWAYS.

I guess this is your way of saying "this should always be patched, one
way or the other, depending on X86_FEATURE_XENPV, and this is a way to
catch unpatched locations...

Then on a pv build which doesn't set X86_FEATURE_XENPV during boot,
it'll replace the "call BUG_func" thing with the pushf;pop.

And if it does set X86_FEATURE_XENPV, it'll patch in the direct call to
 /me greps tree ... pv_native_save_fl I guess.

If anything, how those ALT_CALL_ALWAYS things are supposed to work,
should be documented there, over the macro definition and what the
intent is.

Because otherwise we'll have to swap in the whole machinery back into
our L1s each time we need to touch it.

And btw, this whole patching stuff becomes insanely non-trivial slowly.

:-\

> diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c
> index faa9f2299848..200ea8087ddb 100644
> --- a/arch/x86/kernel/callthunks.c
> +++ b/arch/x86/kernel/callthunks.c
> @@ -238,14 +238,13 @@ patch_call_sites(s32 *start, s32 *end, const struct 
> core_text *ct)
>  }
>  
>  static __init_or_module void
> -patch_paravirt_call_sites(struct paravirt_patch_site *start,
> -   struct paravirt_patch_site *end,
> -   const

Re: [PATCH v4 3/5] x86/paravirt: introduce ALT_NOT_XEN

2023-11-14 Thread Borislav Petkov

On Mon, Oct 30, 2023 at 03:25:06PM +0100, Juergen Gross wrote:
> Introduce the macro ALT_NOT_XEN as a short form of
> ALT_NOT(X86_FEATURE_XENPV).

Not crazy about adding yet another macro indirection - at least with the
X86_FEATURE_ it is clear what this is. But ok, whatever.

Anyway, this patch can be the first one in the series.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v13 00/12] Add AMD SEV guest live migration support

2021-04-20 Thread Borislav Petkov

On Tue, Apr 20, 2021 at 09:08:26PM +0200, Paolo Bonzini wrote:
> Yup, for now it's all at kvm/queue and it will land in kvm/next tomorrow
> (hopefully).  The guest interface patches in KVM are very near the top.

Thx, I'll have a look tomorrow.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] docs: fix the invalid vt-d spec location

2021-04-20 Thread Borislav Petkov

On Mon, Apr 19, 2021 at 04:05:12PM +0800, Steven Zhou wrote:
> Do you have any other suggestion about the link location please ?

Yeah, I'm working on it. We need to come up with a proper solution for
all those docs but it'll take time...

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v13 00/12] Add AMD SEV guest live migration support

2021-04-20 Thread Borislav Petkov

Hey Paolo,

On Tue, Apr 20, 2021 at 01:11:31PM +0200, Paolo Bonzini wrote:
> I have queued patches 1-6.
> 
> For patches 8 and 10 I will post my own version based on my review and
> feedback.

can you pls push that tree up to here to a branch somewhere so that ...
 
> For guest patches, please repost separately so that x86 maintainers will
> notice them and ack them.

... I can take a look at the guest bits in the full context of the
changes?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2] tools/power turbostat: Fix RAPL summary collection on AMD processors

2021-04-20 Thread Borislav Petkov

On Tue, Apr 20, 2021 at 09:15:41PM +0800, Chen Yu wrote:
> This patch was sent to Len and it is not in public repo yet.

If that thing were in a public repo, you'd have the advantage of
pointing people to it and they could be testing patches.

-- 
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG 
Nürnberg

Re: [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code

2021-04-20 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 12:04:11PM -0500, Brijesh Singh wrote:

Btw, for all your patches where the subject prefix is only "x86:":

The tip tree preferred format for patch subject prefixes is
'subsys/component:', e.g. 'x86/apic:', 'x86/mm/fault:', 'sched/fair:',
'genirq/core:'. Please do not use file names or complete file paths as
prefix. 'git log path/to/file' should give you a reasonable hint in most
cases.

The condensed patch description in the subject line should start with a
uppercase letter and should be written in imperative tone.

Please go over them and fix that up.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-04-20 Thread Borislav Petkov

On Tue, Apr 20, 2021 at 07:46:26AM +, HORIGUCHI NAOYA(堀口 直也) wrote:
> If you have any other suggestion, please let me know.

Looks almost ok...

> From: Tony Luck 
> Date: Tue, 20 Apr 2021 16:42:01 +0900
> Subject: [PATCH 1/3] mm/memory-failure: Use a mutex to avoid memory_failure()
>  races
> 
> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.
> 
> Along with introducing an additional goto label, this patch also

... avoid having "This patch" or "This commit" in the commit message.
It is tautologically useless. Also, you don't have to explain what the
patch does - that's visible hopefully. :-)

Other than that, it makes sense and the "sandwitching" looks correct:

mutex_lock
lock_page

...

unlock_page
mutex_unlock

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-20 Thread Borislav Petkov

On Mon, Apr 19, 2021 at 11:33:08AM -0700, Dave Hansen wrote:
> My point was just that you can't _easily_ do the 2M->4k kernel mapping
> demotion in a page fault handler, like I think Borislav was suggesting.

Yeah, see my reply to Brijesh. Not in the #PF handler but when the guest
does update the RMP table on page allocation, we should split the kernel
mapping too, so that it corresponds to what's being changed in the RMP
table.

Dunno how useful it would be if we also do coalescing of 4K pages into
their corresponding 2M pages... I haven't looked at set_memory.c for a
long time and have forgotten about all details...

In any case, my main goal here is to keep the tables in sync so that we
don't have to do crazy splitting in unsafe contexts like #PF. I hope I'm
making sense...

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-20 Thread Borislav Petkov

On Mon, Apr 19, 2021 at 12:46:53PM -0500, Brijesh Singh wrote:
> - KVM calls  alloc_page() to allocate a VMSA page. The allocator returns
> 0xc820 (PFN 0x200, page-level=2M). The VMSA page is private
> page so KVM will call RMPUPDATE to add the page as a private page in the
> RMP table. While adding the RMP entry the KVM will use the page level=4K.

Right, and *here* we split the 2M page on the *host* so that there's no
discrepancy between the host pagetable and the RMP. I guess your patch
does exactly that. :)

And AFAIR, set_memory.c doesn't have the functionality to coalesce 4K
pages back into the corresponding 2M page. Which means, the RMP table is
in sync, more or less.

Thx and thanks for elaborating.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] x86, sched: Fix the AMD CPPC maximum perf on some specific generations

2021-04-20 Thread Borislav Petkov

On Tue, Apr 20, 2021 at 04:09:43PM +0800, Huang Rui wrote:
> Some AMD Ryzen generations has different calculation method on maximum
> perf. 255 is not for all asics, some specific generations used 166 as
> the maximum perf. This patch is to fix the different maximum perf value

Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.

Also, do

$ git grep 'This patch' Documentation/process

for more details.

> of AMD CPPC.
> 
> Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD 
> systems")
> Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost 
> frequencies")
> 
> Reported-by: Jason Bagavatsingham 
> Tested-by: Jason Bagavatsingham 
> Signed-off-by: Huang Rui 
> Cc: Alex Deucher 
> Cc: Nathan Fontenot 
> Cc: Rafael J. Wysocki 
> Cc: Borislav Petkov 
> Cc: x...@kernel.org
> ---
>  arch/x86/kernel/smpboot.c  | 33 ++-
>  drivers/cpufreq/acpi-cpufreq.c | 41 ++
>  2 files changed, 73 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 02813a7f3a7c..705bc5ceb1ea 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -2033,6 +2033,37 @@ static bool intel_set_max_freq_ratio(void)
>  }
>  
>  #ifdef CONFIG_ACPI_CPPC_LIB
> +static u64 amd_get_highest_perf(void)
> +{

struct cpuinfo_x86 *c = _cpu_data;

and then you can use "c" everywhere.

> + u64 cppc_max_perf;

u64 for something which fits in a byte?

Also,
max_perf = 255;

and you can remove the else and default branches below.

> +
> + switch (boot_cpu_data.x86) {
> + case 0x17:
> + if ((boot_cpu_data.x86_model >= 0x30 &&
> +  boot_cpu_data.x86_model < 0x40) ||
> + (boot_cpu_data.x86_model >= 0x70 &&
> +  boot_cpu_data.x86_model < 0x80))
> + cppc_max_perf = 166;
> + else
> + cppc_max_perf = 255;
> + break;
> + case 0x19:
> + if ((boot_cpu_data.x86_model >= 0x20 &&
> +  boot_cpu_data.x86_model < 0x30) ||
> + (boot_cpu_data.x86_model >= 0x40 &&
> +  boot_cpu_data.x86_model < 0x70))
> + cppc_max_perf = 166;
> + else
> + cppc_max_perf = 255;
> + break;
> + default:
> + cppc_max_perf = 255;
> + break;
> + }
> +
> + return cppc_max_perf;
> +}

Why is this here and not in arch/x86/kernel/cpu/amd.c?

> +
>  static bool amd_set_max_freq_ratio(void)
>  {
>   struct cppc_perf_caps perf_caps;



> @@ -2046,8 +2077,8 @@ static bool amd_set_max_freq_ratio(void)
>   return false;
>   }
>  
> - highest_perf = perf_caps.highest_perf;
>   nominal_perf = perf_caps.nominal_perf;
> + highest_perf = amd_get_highest_perf();
>  
>   if (!highest_perf || !nominal_perf) {
>   pr_debug("Could not retrieve highest or nominal performance\n");
> diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
> index d1bbc16fba4b..e5c03360db20 100644
> --- a/drivers/cpufreq/acpi-cpufreq.c
> +++ b/drivers/cpufreq/acpi-cpufreq.c
> @@ -630,6 +630,44 @@ static int acpi_cpufreq_blacklist(struct cpuinfo_x86 *c)
>  #endif
>  
>  #ifdef CONFIG_ACPI_CPPC_LIB
> +
> +static u64 get_amd_max_boost_ratio(unsigned int cpu, u64 nominal_perf)
> +{
> + u64 boost_ratio, cppc_max_perf;
> +
> + if (!nominal_perf)
> + return 0;
> +
> + switch (boot_cpu_data.x86) {
> + case 0x17:
> + if ((boot_cpu_data.x86_model >= 0x30 &&
> +  boot_cpu_data.x86_model < 0x40) ||
> + (boot_cpu_data.x86_model >= 0x70 &&
> +  boot_cpu_data.x86_model < 0x80))
> + cppc_max_perf = 166;
> + else
> + cppc_max_perf = 255;
> + break;
> + case 0x19:
> + if ((boot_cpu_data.x86_model >= 0x20 &&
> +  boot_cpu_data.x86_model < 0x30) ||
> + (boot_cpu_data.x86_model >= 0x40 &&
> +  boot_cpu_data.x86_model < 0x70))
> + cppc_max_perf = 166;
> + else
> + cppc_max_perf = 255;
> + break;
> + default:
> + cppc_max_perf = 255;
> + break;

This chunk is repeated here. Why?

-- 
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG 
Nürnberg

Re: [PATCH v2] tools/power turbostat: Fix RAPL summary collection on AMD processors

2021-04-20 Thread Borislav Petkov

On Tue, Apr 20, 2021 at 10:03:36AM +0800, Chen Yu wrote:
> On Mon, Apr 19, 2021 at 02:58:12PM -0500, Terry Bowman wrote:
> > Turbostat fails to correctly collect and display RAPL summary information
> > on Family 17h and 19h AMD processors. Running turbostat on these processors
> > returns immediately. If turbostat is working correctly then RAPL summary
> > data is displayed until the user provided command completes. If a command
> > is not provided by the user then turbostat is designed to continuously
> > display RAPL information until interrupted.
> > 
> > The issue is due to offset_to_idx() and idx_to_offset() missing support for
> > AMD MSR addresses/offsets. offset_to_idx()'s switch statement is missing
> > cases for AMD MSRs and idx_to_offset() does not include a path to return
> > AMD MSR(s) for any idx.
> > 
> > The solution is add AMD MSR support to offset_to_idx() and idx_to_offset().
> > These functions are split-out and renamed along architecture vendor lines
> > for supporting both AMD and Intel MSRs.
> > 
> > Fixes: 9972d5d84d76 ("tools/power turbostat: Enable accumulate RAPL 
> > display")
> > Signed-off-by: Terry Bowman 
> Thanks for fixing, Terry, and previously there was a patch for this from Bas 
> Nieuwenhuizen:
> https://lkml.org/lkml/2021/3/12/682
> and it is expected to have been merged in Len's branch already.

Expected?

So is it or is it not?

And can you folks agree on a patch already and give it to Artem for
testing (CCed) because he's triggering it too:

https://bugzilla.kernel.org/show_bug.cgi?id=212357

Thx.

-- 
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG 
Nürnberg

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-19 Thread Borislav Petkov

On Mon, Apr 19, 2021 at 05:33:03PM -0400, Len Brown wrote:
> For this to happen, every thread would not only have to include/link-with
> code that uses AMX, but that code would have to *run*.

It looks like either I'm not expressing myself clearly enough or you're
not reading my text: the *library* does that decision automatically!

Which means *every* possible thread on the system.

Which means, *every* thread has a fat 8K buffer attached to it because
the library uses AMX on its behalf by *default*.

> I'm sure that the AI guys are super excited about matrix multiplication,
> but I have a hard time imagining why grep(1) would find a use for it.

It doesn't matter if you're imagining it or not - what matters is if the
decision whether the thread uses AMX or not is put in the hands of the
thread and *NOT* in the hands of the library.

Which means, majority of the threads should not allow AMX and only a
handful who do, will have to explicitly state that. And the library will
have to comply. Not the library decides for every thread itself because
the feature's there.

> Indeed, if anyone expected AMX to be used by every task, we would have
> never gone to the trouble of inventing the XFD hardware to support the
> kernel's lazy 8KB buffer allocation.

If it gives me fat-buffers-off-by-default and on only for a handful
of threads which really want it and *request* it *explicitly*, sure,
whatever gets the job done.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-19 Thread Borislav Petkov

On Mon, Apr 19, 2021 at 02:18:51PM -0400, Len Brown wrote:
> Yes, we could invent a new system call and mandate that it be called
> between #2 and #3. However, we'd still do #4 in response, so I don't
> see any value to that system call.

Lemme refresh your memory: there was a time when the kernel did lazy FPU
switching because tasks which really wanted to do that, would use FPU
insns and from the first use onwards, the kernel would shuffle an FPU
state buffer back'n'forth for the task, for the length of its lifetime.

Then glibc decided to use FPU in memcpy or whatever, leading up to
*every* task using the FPU which practically made us remove all that
lazy FPU switching logic and do eager FPU.

Back then that state was what, dunno, 1-2 KB tops.

Now imagine the same lazy => eager switch but with AVX or AMX or .

All of a sudden you have *every* thread sporting a fat 8K buffer because
the library decided to use a fat buffer feature for it.

Nope, I don't want that to happen.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-04-19 Thread Borislav Petkov

On Tue, Apr 13, 2021 at 07:43:18AM +0900, Naoya Horiguchi wrote:
> From: Tony Luck 
> 
> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.
> 
> Signed-off-by: Tony Luck 
> Signed-off-by: Naoya Horiguchi 
> ---
>  mm/memory-failure.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c
> index 24210c9bd843..c1509f4b565e 100644
> --- v5.12-rc5/mm/memory-failure.c
> +++ v5.12-rc5_patched/mm/memory-failure.c
> @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>   return rc;
>  }
>  
> +static DEFINE_MUTEX(mf_mutex);
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
>   return -ENXIO;
>   }

So the locking patterns are done in two different ways, which are
confusing when following this code:

> + mutex_lock(_mutex);
> +
>  try_again:
> - if (PageHuge(p))
> - return memory_failure_hugetlb(pfn, flags);
> + if (PageHuge(p)) {
> + res = memory_failure_hugetlb(pfn, flags);
> + goto out2;
> + }

You have the goto to a label where you do the unlocking (btw, pls do
s/out2/out_unlock/g;)...

> +
>   if (TestSetPageHWPoison(p)) {
>   pr_err("Memory failure: %#lx: already hardware poisoned\n",
>   pfn);
> + mutex_unlock(_mutex);
>   return 0;

... and you have the other case where you unlock before returning.

Since you've added the label, I think *all* the unlocking should do
"goto out_unlock" instead of doing either/or.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-19 Thread Borislav Petkov

On Mon, Apr 19, 2021 at 10:25:01AM -0500, Brijesh Singh wrote:
> To my understanding, we don't group 512 4K entries into a 2M for the
> kernel address range. We do this for the userspace address through
> khugepage daemon. If page tables get out of sync then it will cause an
> RMP violation, the Patch #7 adds support to split the pages on demand.

Ok. So I haven't reviewed the whole thing but, is it possible to keep
the RMP table in sync so that you don't have to split the physmap like
you do in this patch?

I.e., if the physmap page is 2M, then you have a corresponding RMP entry
of 2M so that you don't have to split. And if you have 4K, then the
corresponding RMP entry is 4K. You get the idea...

IOW, when does that happen: "During the page table walk, we may get into
the situation where one of the pages within the large page is owned by
the guest (i.e assigned bit is set in RMP)." In which case is a 4K page
- as part of a 2M physmap mapping - owned by a guest?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-19 Thread Borislav Petkov

On Fri, Apr 16, 2021 at 06:05:10PM -0400, Len Brown wrote:
> I'm not aware of any intent to transparently use AMX for bcopy, like
> what happened
> with AVX-512.  (didn't they undo that mistake?)

No clue, did they?

> Tasks are created without an 8KB AMX buffer.
> Tasks have to actually touch the AMX TILE registers for us to allocate
> one for them.

When tasks do that it doesn't matter too much - for the library it does!

If the library does that by default and the processes which comprise
of that pipe I mentioned earlier, get all 8K buffers because the
underlying library decided so and swinging those buffers around when
saving/restoring contexts turns out to be a performance penalty, then we
have lost.

Lost because if that thing goes upstream in this way of use of AMX is
allowed implicitly, there ain't fixing it anymore once it becomes an
ABI.

So, that library should ask the kernel whether it supports AMX and only
use it if has gotten a positive answer. And by default that answer
should be "no" because the majority of processes - that same pipe I keep
mentioning - don't need it.

I have no good idea yet how granulary that should be - per process, per
thread, whatever, but there should be a way for the kernel to control
whether the library uses AMX, AVX512 or whatever fat state is out there
available.

Then, if a process wants the library to use AMX on its behalf, then it
can say so and the library can do that but only after having asked for
explicitly.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-19 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 12:04:10PM -0500, Brijesh Singh wrote:
> A write from the hypervisor goes through the RMP checks. When the
> hypervisor writes to pages, hardware checks to ensures that the assigned
> bit in the RMP is zero (i.e page is shared). If the page table entry that
> gives the sPA indicates that the target page size is a large page, then
> all RMP entries for the 4KB constituting pages of the target must have the
> assigned bit 0.

Hmm, so this is important: I read this such that we can have a 2M
page table entry but the RMP table can contain 4K entries for the
corresponding 512 4K pages. Is that correct?

If so, then there's a certain discrepancy here and I'd expect that if
the page gets split/collapsed, depending on the result, the RMP table
should be updated too, so that it remains in sync.

For example:

* mm decides to group all 512 4K entries into a 2M entry, RMP table gets
updated in the end to reflect that

* mm decides to split a page, RMP table gets updated too, for the same
reason.

In this way, RMP table will be always in sync with the pagetables.

I know, I probably am missing something but that makes most sense to
me instead of noticing the discrepancy and getting to work then, when
handling the RMP violation.

Or?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [patch] x86/crash: fix crash_setup_memmap_entries() out-of-bounds access

2021-04-19 Thread Borislav Petkov

Here's an attempt to explain what this fixes:

---
From: Mike Galbraith 
Date: Fri, 16 Apr 2021 14:02:07 +0200
Subject: [PATCH] x86/crash: Fix crash_setup_memmap_entries() out-of-bounds
 access

Commit in Fixes: added support for kexec-ing a kernel on panic using a
new system call. As part of it, it does prepare a memory map for the new
kernel.

However, while doing so, it wrongly accesses memory it has not
allocated: it accesses the first element of the cmem->ranges[] array in
memmap_exclude_ranges() but it has not allocated the memory for it in
crash_setup_memmap_entries(). As KASAN reports:

  BUG: KASAN: vmalloc-out-of-bounds in crash_setup_memmap_entries+0x17e/0x3a0
  Write of size 8 at addr c9426008 by task kexec/1187

  (gdb) list *crash_setup_memmap_entries+0x17e
  0x8107cafe is in crash_setup_memmap_entries 
(arch/x86/kernel/crash.c:322).
  317  unsigned long long mend)
  318 {
  319 unsigned long start, end;
  320
  321 cmem->ranges[0].start = mstart;
  322 cmem->ranges[0].end = mend;
  323 cmem->nr_ranges = 1;
  324
  325 /* Exclude elf header region */
  326 start = image->arch.elf_load_addr;
  (gdb)

Make sure the ranges array becomes a single element allocated.

 [ bp: Write a proper commit message. ]

Fixes: dd5f726076cc ("kexec: support for kexec on panic using new system call")
Signed-off-by: Mike Galbraith 
Signed-off-by: Borislav Petkov 
Link: 
https://lkml.kernel.org/r/725fa3dc1da2737f0f6188a1a9701bead257ea9d.ca...@gmx.de
---
 arch/x86/kernel/crash.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index a8f3af257e26..b1deacbeb266 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -337,7 +337,7 @@ int crash_setup_memmap_entries(struct kimage *image, struct 
boot_params *params)
struct crash_memmap_data cmd;
struct crash_mem *cmem;
 
-   cmem = vzalloc(sizeof(struct crash_mem));
+   cmem = vzalloc(struct_size(cmem, ranges, 1));
if (!cmem)
return -ENOMEM;
 
-- 
2.29.2

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] docs: fix the invalid vt-d spec location

2021-04-18 Thread Borislav Petkov

On Sun, Apr 18, 2021 at 09:29:46AM -0700, Liang Zhou wrote:
> This patch fixes the invalid vt-d spec location.

Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.

Also, do

$ git grep 'This patch' Documentation/process

for more details.

> Signed-off-by: Liang Zhou 
> ---
>  Documentation/x86/intel-iommu.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/x86/intel-iommu.rst 
> b/Documentation/x86/intel-iommu.rst
> index 099f13d..e95ee34 100644
> --- a/Documentation/x86/intel-iommu.rst
> +++ b/Documentation/x86/intel-iommu.rst
> @@ -4,7 +4,7 @@ Linux IOMMU Support
>  
>  The architecture spec can be obtained from the below location.
>  
> -http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf
> +https://software.intel.com/content/dam/develop/external/us/en/documents-tps/vt-directed-io-spec.pdf

Those links are never stable.

Please open a bugzilla at bugzilla.kernel.org, upload that document
there, like this, for example:

https://bugzilla.kernel.org/show_bug.cgi?id=206537

and then add the *bugzilla* link to intel-iommu.rst so that it doesn't
get invalid again.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [tip: x86/urgent] x86/dma: Tear down DMA ops on driver unbind

2021-04-17 Thread Borislav Petkov

On Thu, Apr 15, 2021 at 09:00:57AM -, tip-bot2 for Jean-Philippe Brucker 
wrote:
> The following commit has been merged into the x86/urgent branch of tip:
> 
> Commit-ID: 9f8614f5567eb4e38579422d38a1bdfeeb648ffc
> Gitweb:
> https://git.kernel.org/tip/9f8614f5567eb4e38579422d38a1bdfeeb648ffc
> Author:Jean-Philippe Brucker 
> AuthorDate:Wed, 14 Apr 2021 10:26:34 +02:00
> Committer: Borislav Petkov 
> CommitterDate: Thu, 15 Apr 2021 10:27:29 +02:00
> 
> x86/dma: Tear down DMA ops on driver unbind
> 
> Since
> 
>   08a27c1c3ecf ("iommu: Add support to change default domain of an iommu 
> group")
> 
> a user can switch a device between IOMMU and direct DMA through sysfs.
> This doesn't work for AMD IOMMU at the moment because dev->dma_ops is
> not cleared when switching from a DMA to an identity IOMMU domain. The
> DMA layer thus attempts to use the dma-iommu ops on an identity domain,
> causing an oops:
> 
>   # echo :00:05.0 > /sys/sys/bus/pci/drivers/e1000e/unbind
>   # echo identity > /sys/bus/pci/devices/:00:05.0/iommu_group/type
>   # echo :00:05.0 > /sys/sys/bus/pci/drivers/e1000e/bind
>...
>   BUG: kernel NULL pointer dereference, address: 0028
>...
>Call Trace:
> iommu_dma_alloc
> e1000e_setup_tx_resources
> e1000e_open
> 
> Implement arch_teardown_dma_ops() on x86 to clear the device's dma_ops
> pointer during driver unbind.
> 
>  [ bp: Massage commit message. ]
> 
> Fixes: 08a27c1c3ecf ("iommu: Add support to change default domain of an iommu 
> group")
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Borislav Petkov 
> Link: 
> https://lkml.kernel.org/r/20210414082633.877461-1-jean-phili...@linaro.org
> ---
>  arch/x86/Kconfig  | 1 +
>  arch/x86/kernel/pci-dma.c | 7 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2792879..2c90f8d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -85,6 +85,7 @@ config X86
>   select ARCH_HAS_STRICT_MODULE_RWX
>   select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>   select ARCH_HAS_SYSCALL_WRAPPER
> + select ARCH_HAS_TEARDOWN_DMA_OPSif IOMMU_DMA
>   select ARCH_HAS_UBSAN_SANITIZE_ALL
>   select ARCH_HAS_DEBUG_WX
>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index de234e7..60a4ec2 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -154,3 +154,10 @@ static void via_no_dac(struct pci_dev *dev)
>  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_VIA, PCI_ANY_ID,
>   PCI_CLASS_BRIDGE_PCI, 8, via_no_dac);
>  #endif
> +
> +#ifdef CONFIG_ARCH_HAS_TEARDOWN_DMA_OPS
> +void arch_teardown_dma_ops(struct device *dev)
> +{
> + set_dma_ops(dev, NULL);
> +}
> +#endif

Nope, sorry, no joy. Zapping it from tip.

With that patch, it fails booting on my test box with messages like
(typing up from video I took):

...
ata: softreset failed (1st FIS failed)
ahci :03:00:1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=...]
ahci :03:00:1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=...]
<--- EOF

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [tip:sched/core 23/23] kernel/sched/topology.c:17:2: error: 'sched_debug_verbose' undeclared; did you mean 'sched_debug_setup'?

2021-04-17 Thread Borislav Petkov

On Sat, Apr 17, 2021 at 02:47:46PM +0800, kernel test robot wrote:
>kernel/sched/topology.c: In function 'sched_debug_setup':
> >> kernel/sched/topology.c:17:2: error: 'sched_debug_verbose' undeclared 
> >> (first use in this function); did you mean 'sched_debug_setup'?
>   17 |  sched_debug_verbose = true;

I guess it needs this:

---
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 55232db745ac..916bb96ff9a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2364,6 +2364,7 @@ extern struct sched_entity *__pick_last_entity(struct 
cfs_rq *cfs_rq);
 
 #ifdef CONFIG_SCHED_DEBUG
 extern bool sched_debug_enabled;
+extern bool sched_debug_verbose;
 
 extern void print_cfs_stats(struct seq_file *m, int cpu);
 extern void print_rt_stats(struct seq_file *m, int cpu);

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-16 Thread Borislav Petkov

On Fri, Apr 16, 2021 at 03:06:17PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 16, 2021 at 3:03 PM Borislav Petkov  wrote:
> >
> > On Fri, Apr 16, 2021 at 02:49:23PM -0700, Sami Tolvanen wrote:
> > > __nocfi only disables CFI checking in a function, the compiler still
> > > changes function addresses to point to the CFI jump table, which is
> > > why we need function_nocfi().
> >
> > So call it __func_addr() or get_function_addr() or so, so that at least
> > it is clear what this does.
> >
> 
> This seems backwards to me.  If I do:
> 
> extern void foo(some signature);
> 
> then I would, perhaps naively, expect foo to be the actual symbol that

I'm just reading the patch:

... The function_nocfi macro always returns the address of the
+ * actual function instead.
+ */
+#define function_nocfi(x) ({   \
+   void *addr; \
+   asm("leaq " __stringify(x) "(%%rip), %0\n\t" : "=r" (addr));\
+   addr;

so it does a rip-relative load into a reg which ends up with the function
address.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-16 Thread Borislav Petkov

On Fri, Apr 16, 2021 at 02:49:23PM -0700, Sami Tolvanen wrote:
> __nocfi only disables CFI checking in a function, the compiler still
> changes function addresses to point to the CFI jump table, which is
> why we need function_nocfi().

So call it __func_addr() or get_function_addr() or so, so that at least
it is clear what this does.

Also, am I going to get a steady stream of patches adding that wrapper
to function names or is this it? IOW, have you built an allyesconfig to
see how many locations need touching?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-16 Thread Borislav Petkov

On Fri, Apr 16, 2021 at 01:38:34PM -0700, Sami Tolvanen wrote:
> With CONFIG_CFI_CLANG, the compiler replaces function addresses in
> instrumented C code with jump table addresses. This change implements
> the function_nocfi() macro, which returns the actual function address
> instead.
> 
> Signed-off-by: Sami Tolvanen 
> ---
>  arch/x86/include/asm/page.h | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 7555b48803a8..5499a05c44fc 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -71,6 +71,20 @@ static inline void copy_user_page(void *to, void *from, 
> unsigned long vaddr,
>  extern bool __virt_addr_valid(unsigned long kaddr);
>  #define virt_addr_valid(kaddr)   __virt_addr_valid((unsigned long) 
> (kaddr))
>  
> +#ifdef CONFIG_CFI_CLANG

Almost every patch is talking about this magical config symbol but it
is nowhere to be found. How do I disable it, is there a Kconfig entry
somewhere, etc, etc?

> +/*
> + * With CONFIG_CFI_CLANG, the compiler replaces function address
> + * references with the address of the function's CFI jump table
> + * entry. The function_nocfi macro always returns the address of the
> + * actual function instead.
> + */
> +#define function_nocfi(x) ({ \
> + void *addr; \
> + asm("leaq " __stringify(x) "(%%rip), %0\n\t" : "=r" (addr));\
> + addr;   \
> +})

Also, eww.

It seems all these newfangled things we get, mean obfuscating code even
more. What's wrong with using __nocfi instead? That's nicely out of the
way so slap that in front of functions.

Also, did you even build this on, say, 32-bit allmodconfig?

Oh well.

In file included from ./include/linux/ftrace.h:22:0,
 from ./include/linux/init_task.h:9,
 from init/init_task.c:2:
./include/linux/ftrace.h: In function ‘ftrace_init_nop’:
./arch/x86/include/asm/ftrace.h:9:40: error: implicit declaration of function 
‘function_nocfi’ [-Werror=implicit-function-declaration]
 # define MCOUNT_ADDR  ((unsigned long)(function_nocfi(__fentry__)))
^
./include/linux/ftrace.h:671:35: note: in expansion of macro ‘MCOUNT_ADDR’
  return ftrace_make_nop(mod, rec, MCOUNT_ADDR);
   ^~~
In file included from ./include/linux/ftrace.h:22:0,
 from ./include/linux/perf_event.h:49,
 from ./include/linux/trace_events.h:10,
 from ./include/trace/syscall.h:7,
 from ./include/linux/syscalls.h:85,
 from init/main.c:21:
./include/linux/ftrace.h: In function ‘ftrace_init_nop’:
./arch/x86/include/asm/ftrace.h:9:40: error: implicit declaration of function 
‘function_nocfi’ [-Werror=implicit-function-declaration]
 # define MCOUNT_ADDR  ((unsigned long)(function_nocfi(__fentry__)))
^
./include/linux/ftrace.h:671:35: note: in expansion of macro ‘MCOUNT_ADDR’
  return ftrace_make_nop(mod, rec, MCOUNT_ADDR);
   ^~~
In file included from ./include/linux/ftrace.h:22:0,
 from ./include/linux/perf_event.h:49,
 from ./include/linux/trace_events.h:10,
 from ./include/trace/syscall.h:7,
 from ./include/linux/syscalls.h:85,
 from init/initramfs.c:10:
./include/linux/ftrace.h: In function ‘ftrace_init_nop’:
./arch/x86/include/asm/ftrace.h:9:40: error: implicit declaration of function 
‘function_nocfi’ [-Werror=implicit-function-declaration]
 # define MCOUNT_ADDR  ((unsigned long)(function_nocfi(__fentry__)))
^
./include/linux/ftrace.h:671:35: note: in expansion of macro ‘MCOUNT_ADDR’
  return ftrace_make_nop(mod, rec, MCOUNT_ADDR);
   ^~~
In file included from ./include/linux/ftrace.h:22:0,
 from ./include/linux/perf_event.h:49,
 from arch/x86/events/core.c:15:
./include/linux/ftrace.h: In function ‘ftrace_init_nop’:
./arch/x86/include/asm/ftrace.h:9:40: error: implicit declaration of function 
‘function_nocfi’ [-Werror=implicit-function-declaration]
 # define MCOUNT_ADDR  ((unsigned long)(function_nocfi(__fentry__)))
...

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature

2021-04-16 Thread Borislav Petkov

On Fri, Apr 16, 2021 at 06:40:55PM +0300, Kirill A. Shutemov wrote:
> Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.
> 
> Host side doesn't provide the feature yet, so it is a dead code for now.
> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  arch/x86/include/asm/cpufeatures.h   |  1 +
>  arch/x86/include/asm/kvm_para.h  |  5 +
>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
>  arch/x86/kernel/kvm.c| 17 +
>  include/uapi/linux/kvm_para.h|  3 ++-
>  5 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 84b887825f12..d8f3d2619913 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -238,6 +238,7 @@
>  #define X86_FEATURE_VMW_VMMCALL  ( 8*32+19) /* "" VMware prefers 
> VMMCALL hypercall instruction */
>  #define X86_FEATURE_SEV_ES   ( 8*32+20) /* AMD Secure Encrypted 
> Virtualization - Encrypted State */
>  #define X86_FEATURE_VM_PAGE_FLUSH( 8*32+21) /* "" VM Page Flush MSR is 
> supported */
> +#define X86_FEATURE_KVM_MEM_PROTECTED( 8*32+22) /* "" KVM memory 
> protection extension */

That feature bit is still unused.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [patch] x86/crash: fix crash_setup_memmap_entries() out-of-bounds access

2021-04-16 Thread Borislav Petkov

On Fri, Apr 16, 2021 at 03:16:07PM +0200, Mike Galbraith wrote:
> On Fri, 2021-04-16 at 14:16 +0200, Borislav Petkov wrote:
> >
> > Please be more verbose and structure your commit message like this:
> 
> Hrmph, I thought it was too verbose for dinky one-liner if anything. 

Please look at how other commit messages in tip have free text - not
only tools output.

Also, this looks like a fix for some previous commit. Please dig out
which commit introduced the issue and put its commit ID in a Fixes: tag
above your S-o-B.

If you don't have time or desire to do that, you can say so and I'll do
it myself when I get a chance.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [patch] x86/crash: fix crash_setup_memmap_entries() out-of-bounds access

2021-04-16 Thread Borislav Petkov

On Fri, Apr 16, 2021 at 02:02:07PM +0200, Mike Galbraith wrote:
> [   15.428011] BUG: KASAN: vmalloc-out-of-bounds in 
> crash_setup_memmap_entries+0x17e/0x3a0
> [   15.428018] Write of size 8 at addr c9426008 by task kexec/1187
> 
> (gdb) list *crash_setup_memmap_entries+0x17e
> 0x8107cafe is in crash_setup_memmap_entries 
> (arch/x86/kernel/crash.c:322).
> 317  unsigned long long mend)
> 318 {
> 319 unsigned long start, end;
> 320
> 321 cmem->ranges[0].start = mstart;
> 322 cmem->ranges[0].end = mend;
> 323 cmem->nr_ranges = 1;
> 324
> 325 /* Exclude elf header region */
> 326 start = image->arch.elf_load_addr;
> (gdb)
> 
> Append missing struct crash_mem_range to cmem.

This is winning this year's contest for most laconic patch commit
message! :-)

Please be more verbose and structure your commit message like this:

Problem is A.

It happens because of B.

Fix it by doing C.

(Potentially do D).

For more detailed info, see
Documentation/process/submitting-patches.rst, Section "2) Describe your
changes".

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

[tip: x86/cleanups] MAINTAINERS: Remove me from IDE/ATAPI section

2021-04-16 Thread tip-bot2 for Borislav Petkov

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID: df448cdfc01ffc117702a494ef302e7fb76df78a
Gitweb:
https://git.kernel.org/tip/df448cdfc01ffc117702a494ef302e7fb76df78a
Author:Borislav Petkov 
AuthorDate:Mon, 12 Apr 2021 10:59:51 +02:00
Committer: Borislav Petkov 
CommitterDate: Fri, 16 Apr 2021 12:58:34 +02:00

MAINTAINERS: Remove me from IDE/ATAPI section

It has been years since I've touched this and "this" is going away
anyway... any day now. :-)

So remove me so that I do not get CCed on bugs/patches.

Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/20210412090346.31213-1...@alien8.de
---
 MAINTAINERS | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9e87692..93215a6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8603,9 +8603,8 @@ F:drivers/ide/
 F: include/linux/ide.h
 
 IDE/ATAPI DRIVERS
-M:     Borislav Petkov 
 L: linux-...@vger.kernel.org
-S: Maintained
+S: Orphan
 F: Documentation/cdrom/ide-cd.rst
 F: drivers/ide/ide-cd*

Re: [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers

2021-04-15 Thread Borislav Petkov

On Thu, Apr 15, 2021 at 01:08:09PM -0500, Brijesh Singh wrote:
> This is from Family 19h Model 01h Rev B01. The processor which
> introduces the SNP feature. Yes, I have already upload the PPR on the BZ.
> 
> The PPR is also available at AMD: https://www.amd.com/en/support/tech-docs

Please add the link in the bugzilla to the comments here - this is the
reason why stuff is being uploaded in the first place, because those
vendor sites tend to change and those links become stale with time.

> I guess I was trying to shorten the name. I am good with struct rmpentry;

Yes please - typedefs are used only in very specific cases.

> All those magic numbers are documented in the PPR.

We use defines - not magic numbers. For example

#define RMPTABLE_ENTRIES_OFFSET 0x4000

The 8 is probably

PAGE_SHIFT - RMPENTRY_SHIFT

because you have GPA bits [50:12] and an RMP entry is 16 bytes, i.e., 1 << 4.

With defines it is actually clear what the computation is doing - with
naked numbers not really.

> APM does not provide the offset of the entry inside the RMP table.

It does, kinda, but in the pseudocode of those new insns in APM v3. From
PVALIDATE pseudo:

RMP_ENTRY_PA = RMP_BASE + 0x4000 + (SYSTEM_PA / 0x1000) * 16

and that last

/ 0x1000 * 16

is actually

>> 12 - 4

i.e., the >> 8 shift.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part2 PATCH 03/30] x86: add helper functions for RMPUPDATE and PSMASH instruction

2021-04-15 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 12:04:09PM -0500, Brijesh Singh wrote:
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 06394b6d56b2..7a0138cb3e17 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -644,3 +644,44 @@ rmpentry_t *lookup_page_in_rmptable(struct page *page, 
> int *level)
>   return entry;
>  }
>  EXPORT_SYMBOL_GPL(lookup_page_in_rmptable);
> +
> +int rmptable_psmash(struct page *page)

psmash() should be enough like all those other wrappers around insns.

> +{
> + unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
> + int ret;
> +
> + if (!static_branch_unlikely(_enable_key))
> + return -ENXIO;
> +
> + /* Retry if another processor is modifying the RMP entry. */

Also, a comment here should say which binutils version supports the
insn mnemonic so that it can be converted to "psmash" later. Ditto for
rmpupdate below.

Looking at the binutils repo, it looks like since version 2.36.

/me rebuilds objdump...

> + do {
> + asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
> +   : "=a"(ret)
> +   : "a"(spa)
> +   : "memory", "cc");
> + } while (ret == PSMASH_FAIL_INUSE);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(rmptable_psmash);
> +
> +int rmptable_rmpupdate(struct page *page, struct rmpupdate *val)

rmpupdate()

> +{
> + unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
> + bool flush = true;
> + int ret;
> +
> + if (!static_branch_unlikely(_enable_key))
> + return -ENXIO;
> +
> + /* Retry if another processor is modifying the RMP entry. */
> + do {
> + asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
> +  : "=a"(ret)
> +  : "a"(spa), "c"((unsigned long)val), "d"(flush)
^^^

what's the cast for?

"d"(flush)?

There's nothing in the APM talking about RMPUPDATE taking an input arg
in %rdx?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers

2021-04-15 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 12:04:08PM -0500, Brijesh Singh wrote:
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c

Also, why is all this SNP stuff landing in this file instead of in sev.c
or so which is AMD-specific?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers

2021-04-15 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 12:04:08PM -0500, Brijesh Singh wrote:
> The lookup_page_in_rmptable() can be used by the host to read the RMP
> entry for a given page. The RMP entry format is documented in PPR
> section 2.1.5.2.

I see

Table 15-36. Fields of an RMP Entry

in the APM.

Which PPR do you mean? Also, you know where to put those documents,
right?

> +/* RMP table entry format (PPR section 2.1.5.2) */
> +struct __packed rmpentry {
> + union {
> + struct {
> + uint64_t assigned:1;
> + uint64_t pagesize:1;
> + uint64_t immutable:1;
> + uint64_t rsvd1:9;
> + uint64_t gpa:39;
> + uint64_t asid:10;
> + uint64_t vmsa:1;
> + uint64_t validated:1;
> + uint64_t rsvd2:1;
> + } info;
> + uint64_t low;
> + };
> + uint64_t high;
> +};
> +
> +typedef struct rmpentry rmpentry_t;

Eww, a typedef. Why?

struct rmpentry is just fine.

> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 39461b9cb34e..06394b6d56b2 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -34,6 +34,8 @@
>  
>  #include "mm_internal.h"
>  

<--- Needs a comment here to explain the magic 0x4000 and the magic
shift by 8.

> +#define rmptable_page_offset(x)  (0x4000 + (((unsigned long) x) >> 8))
> +
>  /*
>   * Since SME related variables are set early in the boot process they must
>   * reside in the .data section so as not to be zeroed out when the .bss
> @@ -612,3 +614,33 @@ static int __init mem_encrypt_snp_init(void)
>   * SEV-SNP must be enabled across all CPUs, so make the initialization as a 
> late initcall.
>   */
>  late_initcall(mem_encrypt_snp_init);
> +
> +rmpentry_t *lookup_page_in_rmptable(struct page *page, int *level)

snp_lookup_page_in_rmptable()

> +{
> + unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
> + rmpentry_t *entry, *large_entry;
> + unsigned long vaddr;
> +
> + if (!static_branch_unlikely(_enable_key))
> + return NULL;
> +
> + vaddr = rmptable_start + rmptable_page_offset(phys);
> + if (WARN_ON(vaddr > rmptable_end))

Do you really want to spew a warn on splat for each wrong vaddr? What
for?

> + return NULL;
> +
> + entry = (rmpentry_t *)vaddr;
> +
> + /*
> +  * Check if this page is covered by the large RMP entry. This is needed 
> to get
> +  * the page level used in the RMP entry.
> +  *

No need for a new line in the comment and no need for the "e.g." thing
either.

Also, s/the large RMP entry/a large RMP entry/g.

> +  * e.g. if the page is covered by the large RMP entry then page size is 
> set in the
> +  *   base RMP entry.
> +  */
> + vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
> + large_entry = (rmpentry_t *)vaddr;
> + *level = rmpentry_pagesize(large_entry);
> +
> + return entry;
> +}
> +EXPORT_SYMBOL_GPL(lookup_page_in_rmptable);

Exported for kvm?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] Documentation/submitting-patches: Document RESEND tag on patches

2021-04-15 Thread Borislav Petkov

On Tue, Apr 13, 2021 at 03:02:21PM -0600, Jonathan Corbet wrote:
> For future installments, could you send them in their own thread as an
> ordinary patch so I don't need to edit in the changelog after applying
> them?

Ok, sure but I might not need to anymore because, AFAICT, what is left
is really tip-tree specific and that can finally be the tip-tree doc
file.

I'm pasting the whole thing below, if you think something's still
generic enough, lemme know and I'll carve it out.

Thx.

---
.. SPDX-License-Identifier: GPL-2.0

The tip tree handbook
=

What is the tip tree?
-

The tip tree is a collection of several subsystems and areas of
development. The tip tree is both a direct development tree and a
aggregation tree for several sub-maintainer trees. The tip tree gitweb URL
is: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

The tip tree contains the following subsystems:

   - **x86 architecture**

 The x86 architecture development takes place in the tip tree except
 for the x86 KVM and XEN specific parts which are maintained in the
 corresponding subsystems and routed directly to mainline from
 there. It's still good practice to Cc the x86 maintainers on
 x86-specific KVM and XEN patches.

 Some x86 subsystems have their own maintainers in addition to the
 overall x86 maintainers.  Please Cc the overall x86 maintainers on
 patches touching files in arch/x86 even when they are not called out
 by the MAINTAINER file.

 Note, that ``x...@kernel.org`` is not a mailing list. It is merely a
 mail alias which distributes mails to the x86 top-level maintainer
 team. Please always Cc the Linux Kernel mailing list (LKML)
 ``linux-kernel@vger.kernel.org``, otherwise your mail ends up only in
 the private inboxes of the maintainers.

   - **Scheduler**

 Scheduler development takes place in the -tip tree, in the
 sched/core branch - with occasional sub-topic trees for
 work-in-progress patch-sets.

   - **Locking and atomics**

 Locking development (including atomics and other synchronization
 primitives that are connected to locking) takes place in the -tip
 tree, in the locking/core branch - with occasional sub-topic trees
 for work-in-progress patch-sets.

   - **Generic interrupt subsystem and interrupt chip drivers**:

 - interrupt core development happens in the irq/core branch

 - interrupt chip driver development also happens in the irq/core
   branch, but the patches are usually applied in a separate maintainer
   tree and then aggregated into irq/core

   - **Time, timers, timekeeping, NOHZ and related chip drivers**:

 - timekeeping, clocksource core, NTP and alarmtimer development
   happens in the timers/core branch, but patches are usually applied in
   a separate maintainer tree and then aggregated into timers/core

 - clocksource/event driver development happens in the timers/core
   branch, but patches are mostly applied in a separate maintainer tree
   and then aggregated into timers/core

   - **Performance counters core, architecture support and tooling**:

 - perf core and architecture support development happens in the
   perf/core branch

 - perf tooling development happens in the perf tools maintainer
   tree and is aggregated into the tip tree.

   - **CPU hotplug core**

   - **RAS core**

   - **EFI core**

 EFI development in the efi git tree. The collected patches are
 aggregated in the tip efi/core branch.

   - **RCU**

 RCU development happens in the linux-rcu tree. The resulting changes
 are aggregated into the tip core/rcu branch.

   - **Various core code components**:

   - debugobjects

   - objtool

   - random bits and pieces

Patch submission notes
--

Selecting the tree/branch
^

In general, development against the head of the tip tree master branch is
fine, but for the subsystems which are maintained separately, have their
own git tree and are only aggregated into the tip tree, development should
take place against the relevant subsystem tree or branch.

Bug fixes which target mainline should always be applicable against the
mainline kernel tree. Potential conflicts against changes which are already
queued in the tip tree are handled by the maintainers.

Patch subject
^

The tip tree preferred format for patch subject prefixes is
'subsys/component:', e.g. 'x86/apic:', 'x86/mm/fault:', 'sched/fair:',
'genirq/core:'. Please do not use file names or complete file paths as
prefix. 'git log path/to/file' should give you a reasonable hint in most
cases.

The condensed patch description in the subject line should start with a
uppercase letter and should be written in imperative tone.

Changelog
^

The general rules about changelogs in the process documentation, see
:ref:`Documentation/process/ `,

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-14 Thread Borislav Petkov

On Thu, Apr 15, 2021 at 07:29:38AM +0200, Willy Tarreau wrote:
> What Len is saying is that not being interested in a feature is not an
> argument for rejecting its adoption,

Oh, I'm not rejecting its adoption - no, don't mean that.

> which I'm perfectly fine with. But conversely not being interested in
> a feature is also an argument for insisting that its adoption doesn't
> harm other use cases (generally speaking, not this specific case
> here).

Pretty much.

What I'd like to see is 0-overhead for current use cases and only
overhead for those who want to use it. If that can't be done
automagically, then users should request it explicitly. So basically you
blow up the xsave buffer only for processes which want to do AMX.

And this brings the question about libraries which, if they start using
AMX by default - which doesn't sound like they will want to because AMX
reportedly will have only a limited? set of users - if libraries start
using it by default, then it better be worth the handling of the 8kb
buffer per process.

If not, this should also be requestable per process so that a simple
pipe in Linux:

 | grep | awk | sed ...

and so on is not penalized to allocate and handle by default 8kb for
*each* process' buffer in that pipe just because each is linking against
glibc which has detected AMX support in CPUID and is using it too for
some weird reason like some microbenchmark saying so.

All AFAIU, ofc.

But my initial question was on the "establishing" part and was asking
where we have established anything wrt AMX.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-14 Thread Borislav Petkov

On Wed, Apr 14, 2021 at 05:57:22PM -0400, Len Brown wrote:
> I'm pretty sure that the "it isn't my use case of interest, so it
> doesn't matter" line of reasoning has long been established as -EINVAL
> ;-)

I have only a very faint idea what you're trying to say here. Please
explain properly and more verbosely what exactly has been established
where?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison

2021-04-14 Thread Borislav Petkov

On Wed, Apr 14, 2021 at 07:46:49AM -0700, Jue Wang wrote:
> I can see this is useful in other types of domains, e.g., on multi-tenant 
> cloud
> servers where many VMs are collocated on the same host,
> with proper recovery + live migration, a single MCE would only affect a single
> VM at most.
> 
> Another type of generic use case may be services that can tolerate
> abrupt crash,
> i.e., they periodically save checkpoints to persistent storage or are 
> stateless
> services in nature and are managed by some process manager to automatically
> restart and resume from where the work was left at when crashed.

Yap, thanks for those.

So I do see a disconnect between us doing those features in the kernel
and not really seeing how people use them. So this helps, I guess the VM
angle will become important real soon - if not already - so hopefully
we'll get more feedback in the future.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison

2021-04-14 Thread Borislav Petkov

On Tue, Apr 13, 2021 at 10:47:21PM -0700, Jue Wang wrote:
> This path is when EPT #PF finds accesses to a hwpoisoned page and
> sends SIGBUS to user space (KVM exits into user space) with the same
> semantic as if regular #PF found access to a hwpoisoned page.
> 
> The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest.
> 
> We are in process to launch a product with MCE recovery capability in
> a KVM based virtualization product and plan to expand the scope of the
> application of it in the near future.

Any pointers to code or is this all non-public? Any text on what that
product does with the MCEs?

> The in-memory database and analytical domain are definitely using it.
> A couple examples:
> SAP HANA - as we've tested and planned to launch as a strategic
> enterprise use case with MCE recovery capability in our product
> SQL server - 
> https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors

Aha, so they register callbacks for the processes to exec on a memory
error. Good to know, thanks for those.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison

2021-04-14 Thread Borislav Petkov

On Tue, Apr 13, 2021 at 04:13:03PM +, Luck, Tony wrote:
> Even if no applications ever do anything with it, it is still useful to avoid
> crashing the whole system and just terminate one application/guest.

True.

> There's one more item on my long term TODO list. Add fixups so that
> copy_to_user() from poison in the page cache doesn't crash, but just
> checks to see if the page was clean .. .in which case re-read from the
> filesystem into a different physical page and retire the old page ... the
> read can now succeed. If the page is dirty, then fail the read (and retire
> the page ... need to make sure filesystem knows the data for the page
> was lost so subsequent reads return -EIO or something).

Makes sense.

> Page cache occupies enough memory that it is a big enough
> source of system crashes that could be avoided. I'm not sure
> if there are any other obvious cases after this ... it all gets into
> diminishing returns ... not really worth it to handle a case that
> only occupies 0.2% of memory.

Ack.

> See above. With core counts continuing to increase, the cloud service
> providers really want to see fewer events that crash the whole physical
> machine (taking down dozens, or hundreds, of guest VMs).

Yap.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v7 5/6] x86/signal: Detect and prevent an alternate signal stack overflow

2021-04-14 Thread Borislav Petkov

On Wed, Apr 14, 2021 at 01:30:43PM +0200, Florian Weimer wrote:
> Is this discussion about better behavior (at least diagnostics) for
> existing applications, without any code changes?  Or an alternative
> programming model?

Former.

> Does noavx512 acutally reduce the XSAVE size to AVX2 levels?

Yeah.

> Or would you need noxsave?

I don't think so.

> One possibility is that the sigaltstack size check prevents application
> from running which work just fine today because all they do is install a
> stack overflow handler, and stack overflow does not actually happen.

So sigaltstack(2) says in the NOTES:

   Functions  called  from  a signal handler executing on an alternate 
signal stack
   will also use the alternate signal stack.  (This also applies  to  any  
handlers
   invoked for other signals while the process is executing on the 
alternate signal
   stack.)  Unlike the standard stack, the system does not automatically 
extend the
   alternate  signal  stack.   Exceeding the allocated size of the 
alternate signal
   stack will lead to unpredictable results.

> So if sigaltstack fails and the application checks the result of the
> system call, it probably won't run at all. Shifting the diagnostic to
> the pointer where the signal would have to be delivered is perhaps the
> only thing that can be done.

So using the example from the same manpage:

   The most common usage of an alternate signal stack is to handle the 
SIGSEGV sig‐
   nal that is generated if the space available for the normal process 
stack is ex‐
   hausted: in this case, a signal handler for SIGSEGV cannot  be  invoked  
on  the
   process stack; if we wish to handle it, we must use an alternate signal 
stack.

and considering these "unpredictable results" would it make sense or
even be at all possible to return SIGFAIL from that SIGSEGV signal
handler which should run on the sigaltstack but that sigaltstack
overflows?

I think we wanna be able to tell the process through that previously
registered SIGSEGV handler which is supposed to run on the sigaltstack,
that that stack got overflowed.

Or is this use case obsolete and this is not what people do at all?

> As for SIGFAIL in particular, I don't think there are any leftover
> signal numbers.  It would need a prctl to assign the signal number, and
> I'm not sure if there is a useful programming model because signals do
> not really compose well even today.  SIGFAIL adds another point where
> libraries need to collaborate, and we do not have a mechanism for that.
> (This is about what Rich Felker termed “library-safe code”, proper
> maintenance of process-wide resources such as the current directory.)

Oh fun.

I guess if Linux goes and does something, people would adopt it and
it'll become standard. :-P

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v7 5/6] x86/signal: Detect and prevent an alternate signal stack overflow

2021-04-14 Thread Borislav Petkov

On Mon, Apr 12, 2021 at 10:30:23PM +, Bae, Chang Seok wrote:
> On Mar 26, 2021, at 03:30, Borislav Petkov  wrote:
> > On Thu, Mar 25, 2021 at 09:56:53PM -0700, Andy Lutomirski wrote:
> >> We really ought to have a SIGSIGFAIL signal that's sent, double-fault
> >> style, when we fail to send a signal.
> > 
> > Yeap, we should be able to tell userspace that we couldn't send a
> > signal, hohumm.
> 
> Hi Boris,
> 
> Let me clarify some details as preparing to include this in a revision.
> 
> So, IIUC, a number needs to be assigned for this new SIGFAIL. At a glance, not
> sure which one to pick there in signal.h -- 1-31 fully occupied and the rest
> for 33 different real-time signals.
> 
> Also, perhaps, force_sig(SIGFAIL) here, instead of return -1 -- to die with
> SIGSEGV.

I think this needs to be decided together with userspace people so that
they can act accordingly and whether it even makes sense to them.

Florian, any suggestions?

Subthread starts here:

https://lkml.kernel.org/r/calcetrxqzuvjqrhdmst6ppgtjxas_spk2jhwmimdnpunq45...@mail.gmail.com

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-14 Thread Borislav Petkov

On Wed, Apr 14, 2021 at 12:06:39PM +0200, Willy Tarreau wrote:
> And change jobs :-)

I think by the time that happens, we'll be ready to go to the eternal
vacation. Which means: not my problem.

:-)))

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-14 Thread Borislav Petkov

On Tue, Apr 13, 2021 at 03:51:50PM -0400, Len Brown wrote:
> AMX does the type of matrix multiplication that AI algorithms use. In
> the unlikely event that you or one of the libraries you call are doing
> the same, then you will be very happy with AMX. Otherwise, you'll
> probably not use it.

Which sounds to me like AMX is something which should not be enabled
automatically but explicitly requested. I don't see the majority of the
processes on the majority of the Linux machines out there doing AI with
AMX - at least not anytime soon. If it becomes ubiquitous later, we can
make it automatic then.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support

2021-04-14 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 12:04:07PM -0500, Brijesh Singh wrote:
> @@ -538,6 +540,10 @@
>  #define MSR_K8_SYSCFG0xc0010010
>  #define MSR_K8_SYSCFG_MEM_ENCRYPT_BIT23
>  #define MSR_K8_SYSCFG_MEM_ENCRYPTBIT_ULL(MSR_K8_SYSCFG_MEM_ENCRYPT_BIT)
> +#define MSR_K8_SYSCFG_SNP_EN_BIT 24
> +#define MSR_K8_SYSCFG_SNP_EN BIT_ULL(MSR_K8_SYSCFG_SNP_EN_BIT)
> +#define MSR_K8_SYSCFG_SNP_VMPL_EN_BIT25
> +#define MSR_K8_SYSCFG_SNP_VMPL_ENBIT_ULL(MSR_K8_SYSCFG_SNP_VMPL_EN_BIT)
>  #define MSR_K8_INT_PENDING_MSG   0xc0010055
>  /* C1E active bits in int pending message */
>  #define K8_INTP_C1E_ACTIVE_MASK  0x1800

Ok, I believe it is finally time to make this MSR architectural and drop
this silliness with "K8" in the name. If you wanna send me a prepatch which
converts all like this:

MSR_K8_SYSCFG -> MSR_AMD64_SYSCFG

I'll gladly take it. If you prefer me to do it, I'll gladly do it.

> @@ -44,12 +45,16 @@ u64 sev_check_data __section(".data") = 0;
>  EXPORT_SYMBOL(sme_me_mask);
>  DEFINE_STATIC_KEY_FALSE(sev_enable_key);
>  EXPORT_SYMBOL_GPL(sev_enable_key);
> +DEFINE_STATIC_KEY_FALSE(snp_enable_key);
> +EXPORT_SYMBOL_GPL(snp_enable_key);
>  
>  bool sev_enabled __section(".data");
>  
>  /* Buffer used for early in-place encryption by BSP, no locking needed */
>  static char sme_early_buffer[PAGE_SIZE] __initdata __aligned(PAGE_SIZE);
>  
> +static unsigned long rmptable_start, rmptable_end;

__ro_after_init I guess.

> +
>  /*
>   * When SNP is active, this routine changes the page state from private to 
> shared before
>   * copying the data from the source to destination and restore after the 
> copy. This is required
> @@ -528,3 +533,82 @@ void __init mem_encrypt_init(void)
>   print_mem_encrypt_feature_info();
>  }
>  
> +static __init void snp_enable(void *arg)
> +{
> + u64 val;
> +
> + rdmsrl_safe(MSR_K8_SYSCFG, );

Why is this one _safe but the wrmsr isn't? Also, _safe returns a value -
check it pls and return early.

> +
> + val |= MSR_K8_SYSCFG_SNP_EN;
> + val |= MSR_K8_SYSCFG_SNP_VMPL_EN;
> +
> + wrmsrl(MSR_K8_SYSCFG, val);
> +}
> +
> +static __init int rmptable_init(void)
> +{
> + u64 rmp_base, rmp_end;
> + unsigned long sz;
> + void *start;
> + u64 val;
> +
> + rdmsrl_safe(MSR_AMD64_RMP_BASE, _base);
> + rdmsrl_safe(MSR_AMD64_RMP_END, _end);

Ditto, why _safe if you're checking CPUID?

> +
> + if (!rmp_base || !rmp_end) {
> + pr_info("SEV-SNP: Memory for the RMP table has not been 
> reserved by BIOS\n");
> + return 1;
> + }
> +
> + sz = rmp_end - rmp_base + 1;
> +
> + start = memremap(rmp_base, sz, MEMREMAP_WB);
> + if (!start) {
> + pr_err("SEV-SNP: Failed to map RMP table 0x%llx-0x%llx\n", 
> rmp_base, rmp_end);
^^^

That prefix is done by doing

#undef pr_fmt
#define pr_fmt(fmt) "SEV-SNP: " fmt

before the SNP-specific functions.

> + return 1;
> + }
> +
> + /*
> +  * Check if SEV-SNP is already enabled, this can happen if we are 
> coming from kexec boot.
> +  * Do not initialize the RMP table when SEV-SNP is already.
> +  */

comment can be 80 cols wide.

> + rdmsrl_safe(MSR_K8_SYSCFG, );

As above.

> + if (val & MSR_K8_SYSCFG_SNP_EN)
> + goto skip_enable;
> +
> + /* Initialize the RMP table to zero */
> + memset(start, 0, sz);
> +
> + /* Flush the caches to ensure that data is written before we enable the 
> SNP */
> + wbinvd_on_all_cpus();
> +
> + /* Enable the SNP feature */
> + on_each_cpu(snp_enable, NULL, 1);

What happens if you boot only a subset of the CPUs and then others get
hotplugged later? IOW, you need a CPU hotplug notifier which enables the
feature bit on newly arrived CPUs.

Which makes me wonder whether it makes sense to have this in an initcall
and not put it instead in init_amd(): the BSP will do the init work
and the APs coming in will see that it has been enabled and only call
snp_enable().

Which solves the hotplug thing automagically.

> +
> +skip_enable:
> + rmptable_start = (unsigned long)start;
> + rmptable_end = rmptable_start + sz;
> +
> + pr_info("SEV-SNP: RMP table physical address 0x%016llx - 0x%016llx\n", 
> rmp_base, rmp_end);

  "RMP table at ..."

also, why is this issued in skip_enable? You want to issue it only once,
on enable.

also, rmp_base and rmp_end look redundant - you can simply use
rmptable_start and rmptable_end.

Which reminds me - that function needs to check as the very first thing
on entry whether SNP is enabled and exit if so - there's no need to read
MSR_AMD64_RMP_BASE and MSR_AMD64_RMP_END unnecessarily.

> +
> + return 0;
> +}
> +
> +static int __init mem_encrypt_snp_init(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
> + return 1;
> +
> + if (rmptable_init()) {
> +

Re: [PATCH] MAINTAINERS: Remove me from IDE/ATAPI section

2021-04-14 Thread Borislav Petkov

On Wed, Apr 14, 2021 at 07:39:25AM +0100, Christoph Hellwig wrote:
> On Mon, Apr 12, 2021 at 11:03:46AM +0200, Borislav Petkov wrote:
> >  IDE/ATAPI DRIVERS
> > -M: Borislav Petkov 
> >  L: linux-...@vger.kernel.org
> >  S: Maintained
> >  F: Documentation/cdrom/ide-cd.rst
> 
> The Maintained should also become Orphaned then.

Ok, will correct and push through tip this week.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [thermal] 9223d0dccb: stress-ng.msg.ops_per_sec -27.4% regression

2021-04-13 Thread Borislav Petkov

On Tue, Apr 13, 2021 at 09:58:01PM +0800, kernel test robot wrote:
> Greeting,
> 
> FYI, we noticed a -27.4% regression of stress-ng.msg.ops_per_sec due to 
> commit:
> 
> 
> commit: 9223d0dccb8f8523754122f68316dd1a4f39f7f8 ("thermal: Move therm_throt 
> there from x86/mce")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

Hmm, so I went and ran your reproducer, but simplified (see end of
mail), on a KBL box here. The kernel is tip:x86/urgent from last week:

5.12.0-rc6+
---
stress-ng: info:  [1430] dispatching hogs: 9 msg
stress-ng: info:  [1430] successful run completed in 60.01s (1 min, 0.01 secs)
stress-ng: info:  [1430] stressor   bogo ops real time  usr time  sys time  
 bogo ops/s   bogo ops/s
stress-ng: info:  [1430]   (secs)(secs)(secs)   
(real time) (usr+sys time)
stress-ng: info:  [1430] msg   237390147 60.01104.03255.85  
 3955872.56659636.95
stress-ng: info:  [1430] for a 60.01s run time:
stress-ng: info:  [1430] 360.08s available CPU time
stress-ng: info:  [1430] 104.11s user time   ( 28.91%)
stress-ng: info:  [1430] 255.93s system time ( 71.08%)
stress-ng: info:  [1430] 360.04s total time  ( 99.99%)
stress-ng: info:  [1430] load average: 8.47 3.71 1.48

Now the same kernel with

>   4f432e8bb1 ("x86/mce: Get rid of mcheck_intel_therm_init()")
>   9223d0dccb ("thermal: Move therm_throt there from x86/mce")

reverted.

5.12.0-rc6-rev+
---
stress-ng: info:  [1246] dispatching hogs: 9 msg
stress-ng: info:  [1246] successful run completed in 60.02s (1 min, 0.02 secs)
stress-ng: info:  [1246] stressor   bogo ops real time  usr time  sys time  
 bogo ops/s   bogo ops/s
stress-ng: info:  [1246]   (secs)(secs)(secs)   
(real time) (usr+sys time)
stress-ng: info:  [1246] msg   215174467 60.01 99.64260.24  
 3585438.79597906.15
stress-ng: info:  [1246] for a 60.02s run time:
stress-ng: info:  [1246] 360.10s available CPU time
stress-ng: info:  [1246]  99.72s user time   ( 27.69%)
stress-ng: info:  [1246] 260.32s system time ( 72.29%)
stress-ng: info:  [1246] 360.04s total time  ( 99.98%)
stress-ng: info:  [1246] load average: 7.98 2.33 0.80

so if I'm reading this correctly, reverting the patches here brings the
*slow-down*.

What's up?

reproducer:
--

#!/usr/bin/bash

for cpu_dir in /sys/devices/system/cpu/cpu[0-9]*
do
online_file="$cpu_dir"/online
[ -f "$online_file" ] && [ "$(cat "$online_file")" -eq 0 ] && continue

file="$cpu_dir"/cpufreq/scaling_governor
[ -f "$file" ] && echo "performance" > "$file"
done

stress-ng --timeout 60 --times --verify --metrics-brief --msg 9

-- 
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG 
Nürnberg

Re: Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Borislav Petkov

On Tue, Apr 13, 2021 at 08:54:55PM +0800, Kemeng Shi wrote:
> Yes. And NT stores should be better for copy_page especially copying a lot
> of pages as only partial memory of copied page will be access recently.

I thought "should be better" too last time when I measured rep; movs vs
NT stores but actual measurements showed no real difference.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

[PATCH] Documentation/submitting-patches: Document RESEND tag on patches

2021-04-13 Thread Borislav Petkov

Hi Jon,

here's the next piece of documentation which should be generic enough.

Thx.

---
From: Borislav Petkov 
Date: Tue, 13 Apr 2021 13:26:29 +0200

Explain when a submitter should tag a patch or a patch series with the
"RESEND" tag.

This has been partially carved out from a tip subsystem handbook
patchset by Thomas Gleixner:

  https://lkml.kernel.org/r/20181107171010.421878...@linutronix.de

and incorporates follow-on comments.

Signed-off-by: Borislav Petkov 
---
 Documentation/process/submitting-patches.rst | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/process/submitting-patches.rst 
b/Documentation/process/submitting-patches.rst
index ab92d9ccd39a..9284735e0b34 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -346,6 +346,16 @@ that you have sent your patches to the right place.  Wait 
for a minimum of
 one week before resubmitting or pinging reviewers - possibly longer during
 busy times like merge windows.
 
+It's also ok to resend the patch or the patch series after a couple of
+weeks with the word "RESEND" added to the subject line::
+
+   [PATCH Vx RESEND] sub/sys: Condensed patch summary
+
+Don't add "RESEND" when you are submitting a modified version of your
+patch or patch series - "RESEND" only applies to resubmission of a
+patch or patch series which have not been modified in any way from the
+previous submission.
+
 
 Include PATCH in the subject
 -
-- 
2.29.2


-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Borislav Petkov

+ linux-nvdimm

Original mail at 
https://lkml.kernel.org/r/3f28adee-8214-fa8e-b368-eaf8b1934...@huawei.com

On Tue, Apr 13, 2021 at 02:25:58PM +0800, Kemeng Shi wrote:
> I'm using AEP with dax_kmem drvier, and AEP is export as a NUMA node in

What is AEP?

> my system. I will move cold pages from DRAM node to AEP node with
> move_pages system call. With old "rep movsq', it costs 2030ms to move
> 1 GB pages. With "movnti", it only cost about 890ms to move 1GB pages.

So there's __copy_user_nocache() which does NT stores.

> - ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD
> + ALTERNATIVE_2 "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD, \
> +  "jmp copy_page_nt", X86_FEATURE_XMM2

This makes every machine which has sse2 do NT stores now. Which means
*every* machine practically.

The folks on linux-nvdimm@ should be able to give you a better idea what
to do.

HTH.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison

2021-04-13 Thread Borislav Petkov

On Thu, Apr 08, 2021 at 10:08:52AM -0700, Luck, Tony wrote:
> Also not clear to me either ... but sending a SIGBUS to a kthread isn't
> going to do anything useful. So avoiding doing that is another worthy
> goal.

Ack.

> With these patches nothing gets killed when kernel touches user poison.
> If this is in a regular system call then these patches will return EFAULT
> to the user (but now that I see EHWPOISON exists that looks like a better
> choice - so applications can distinguish the "I just used an invalid address 
> in
> a parameter to a syscall" from "This isn't my fault, the memory broke".

Yap, makes sense.

> KVM apparently passes a machine check into the guest.

Ah, there it is:

static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct 
*tsk)
{
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, tsk);
}

> Though it seems to be misisng the MCG_STATUS information to tell
> the guest whether this is an "Action Required" machine check, or an
> "Action Optional" (i.e. whether the poison was found synchonously by
> execution of the current instruction, or asynchronously).

Yeah, someone needs to deal with that sooner or later, considering how
virt is becoming ubiquitous.

> There is the ancient Documentation/vm/hwpoison.rst from 2009 ... nothing
> seems wrong in that, but could use some updates.

Ah yap, that.

So what I'm missing with all this fun is, yeah, sure, we have this
facility out there but who's using it? Is anyone even using it at all?

If so, does it even make sense, does it need improvements, etc?

Because from where I stand it all looks like we do all these fancy
recovery things but is userspace even paying attention or using them or
whatever...

> I don't know how much detail we might want to go into on recovery
> stratgies for applications.

Perhaps an example or two how userspace is supposed to use it...

> In terms of production s/w there was one ISV who prototyped recovery
> for their application but last time I checked didn't enable it in the
> production version.

See, one and it hasn't even enabled it. So it all feels like a lot of
wasted energy to me to do all those and keep 'em working. But maybe I'm
missing the big picture and someone would come and say, no, Boris, we
use this and this and that is our feedback...

> Essentially it boils down to:
> SIGBUS handler gets additional data giving virtual address that has gone away
> 
> 1) Can the application replace the lost page?
>   Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap
>   and fill with replacement data. This case can return from SIGBUS
>   handler to re-execute failed instruction
> 2) Can the application continue in degraded mode w/o the lost page?
>   Hunt down pointers to lost page and update structures to say
>   "this data lost". Use siglongjmp() to go to preset recovery path
> 3) Can the application shut down gracefully?
>   Record details of the lost page. Inform next-of-kin. Exit.
> 4) Default - just exit

Right and this should probably start with "Hey userspace folks, here's
what you can do when you get a hwpoison signal and here's how we
envision this recovery to work" and then we can all discuss and converge
on an agreeable solution which is actually used and there will be
selftests and so on and so on...

But what the h*ll do I know.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: x86: report: write to unrecognized MSR

2021-04-13 Thread Borislav Petkov

On Mon, Apr 12, 2021 at 03:09:41PM -0700, Randy Dunlap wrote:
> 
> [   27.075563] msr: Write to unrecognized MSR 0x1b0 by x86_energy_perf (pid: 
> 1223).
> [   27.078979] msr: See 
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about for details.
> 
> (aka x86_energy_perf_policy)
> 
> 
> on linux-next-20210409 / x86_64.

You're using an old x86_energy_perf tool on a new kernel. That's fixed
upstream.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Borislav Petkov

On Mon, Apr 12, 2021 at 04:38:15PM +0200, Florian Weimer wrote:
> Yes, that's why we have the XGETBV handshake.  I was imprecise.  It's
> CPUID + XGETBV of course.  Or even AT_HWCAP2 (for FSGSBASE).

Ok, that sounds better. So looking at glibc sources, I see something
like this:

init_cpu_features
|-> update_usable
 |-> CPU_FEATURE_SET_USABLE (cpu_features, XGETBV_ECX_1);

so I'm guessing this happens when the library gets loaded per process,
right?

Which means, once the detection has taken place and the library has
gotten XCR0, it is going to use it and won't re-ask the kernel or so?

I.e., I'm trying to imagine how a per-process thing would work at all.
If at all.

And this sounds especially "fun":

> Code that installs a signal handler often does not have control on
> which thread an asynchronous signal is delivered, or which code it
> interrupts.

In my simplistic approach I'm thinking about something along the lines
of:

Library: hey kernel, can you handle AVX512?
Kernel: yes
Library: ok, I will use that in the signal handler

And since kernel has said yes, kernel is going to take care of handling
AVX512 state and library can assume that.

All those old processes which cannot be recompiled, for them I guess the
kernel should have to say no.

Dunno how much sense that makes...

> > And the CPUID-faulting thing would solve stuff like that because then
> > the kernel can *actually* get involved into answering something where it
> > has a say in, too.
> 
> But why wouldn't we use a syscall or an entry in the auxiliary vector
> for that?  Why fault a potentially performance-critical instruction?

Oh sure, CPUID faulting was just an example. I think the intent is to
have this important aspect of userspace asking the kernel first what
kind of features it can handle and then do accordingly.

IOW, legacy stuff can work unchanged and new libraries and kernels can
support fancier features and bigger buffers.

Methinks.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Borislav Petkov

On Mon, Apr 12, 2021 at 04:19:29PM +0200, Florian Weimer wrote:
> Maybe we could have done this in 2016 when I reported this for the first
> time.  Now it is too late, as more and more software is using
> CPUID-based detection for AVX-512.

So as I said on another mail today, I don't think a library should rely
solely on CPUID-based detection of features especially if those features
need kernel support too. IOW, it should ask whether the kernel can
handle those too, first.

And the CPUID-faulting thing would solve stuff like that because then
the kernel can *actually* get involved into answering something where it
has a say in, too.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part1 PATCH 13/13] x86/kernel: add support to validate memory when changing C-bit

2021-04-12 Thread Borislav Petkov

On Mon, Apr 12, 2021 at 07:55:01AM -0500, Brijesh Singh wrote:
> The cur_entry is updated by the hypervisor. While building the psc
> buffer the guest sets the cur_entry=0 and the end_entry point to the
> last valid entry. The cur_entry is incremented by the hypervisor after
> it successfully processes one 4K page. As per the spec, the hypervisor
> could get interrupted in middle of the page state change and cur_entry
> allows the guest to resume the page state change from the point where it
> was interrupted.

This is non-obvious and belongs in a comment above it. Otherwise it
looks weird.

> Since we can get interrupted while executing the PSC so just to be safe
> I re-initialized the scratch scratch area with our buffer instead of
> relying on old values.

Ditto.

> As per the spec the caller must check that the cur_entry > end_entry to
> determine whether all the entries are processed. If not then retry the
> state change. The hypervisor will skip the previously processed entries.
> The snp_page_state_vmgexit() is implemented to return only after all the
> entries are changed.

Ditto.

This whole mechanism of what the guest does and what the HV does, needs
to be explained in a bigger comment somewhere around there so that it is
clear what's going on.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Borislav Petkov

On Sun, Apr 11, 2021 at 03:07:29PM -0400, Len Brown wrote:
> If it is the program, how does it know that the library wants to use
> what instructions?
> 
> If it is the library, then you have just changed XCR0 at run-time and
> you expose breakage of the thread library that has computed XSAVE size.

So, when old programs which cannot possibly know about the arch_prctl()
extension we're proposing here, link against that library, then that
library should not be allowed to go use "fat" states.

Unless the library can "tell" the process which links to it, that it
has dynamically enlarged the save state. If it can and the process can
handle that, then all is fine, save state gets enlarged dynamically and
it all continues merrily.

Also, in order for the library to use fat states, it needs to ask the
kernel for such support - not CPUID - because the kernel is doing the
state handling for everybody and doing all the CR4.OSXSAVE setup etc.

Which also means that the kernel can help here by telling the library:

- No, you cannot use fat states with this process because it hasn't
called arch_prctl() so it cannot handle them properly.

- Yes, this process allowes me to handle fat states for it so you can
use those states and thus those instructions when doing operations for
it.

So the kernel becomes the arbiter in all this - as it should be - and
then all should work fine.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part1 PATCH 13/13] x86/kernel: add support to validate memory when changing C-bit

2021-04-12 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 11:44:24AM -0500, Brijesh Singh wrote:
> @@ -161,3 +162,108 @@ void __init early_snp_set_memory_shared(unsigned long 
> vaddr, unsigned long paddr
>/* Ask hypervisor to make the memory shared in the RMP table. */
>   early_snp_set_page_state(paddr, npages, SNP_PAGE_STATE_SHARED);
>  }
> +
> +static int snp_page_state_vmgexit(struct ghcb *ghcb, struct 
> snp_page_state_change *data)

That function name definitely needs changing. The
vmgexit_page_state_change() one too. They're currenty confusing as hell
and I can't know what each one does without looking at its function
body.

> +{
> + struct snp_page_state_header *hdr;
> + int ret = 0;
> +
> + hdr = >header;
> +
> + /*
> +  * The hypervisor can return before processing all the entries, the 
> loop below retries
> +  * until all the entries are processed.
> +  */
> + while (hdr->cur_entry <= hdr->end_entry) {

This doesn't make any sense: snp_set_page_state() builds a "set" of
pages to change their state in a loop and this one iterates *again* over
*something* which I'm not even clear on what.

Is something setting cur_entry to end_entry eventually?

In any case, why not issue those page state changes one-by-one in
snp_set_page_state() or is it possible that HV can do a couple of
them in one go so you have to poke it here until it sets cur_entry ==
end_entry?

> + ghcb_set_sw_scratch(ghcb, (u64)__pa(data));

Why do you have to call that here for every loop iteration...

> + ret = vmgexit_page_state_change(ghcb, data);

 and in that function too?!

> + /* Page State Change VMGEXIT can pass error code through 
> exit_info_2. */
> + if (ret || ghcb->save.sw_exit_info_2)
> + break;
> + }
> +
> + return ret;

You don't need that ret variable - just return value directly.

> +}
> +
> +static void snp_set_page_state(unsigned long paddr, unsigned int npages, int 
> op)
> +{
> + unsigned long paddr_end, paddr_next;
> + struct snp_page_state_change *data;
> + struct snp_page_state_header *hdr;
> + struct snp_page_state_entry *e;
> + struct ghcb_state state;
> + struct ghcb *ghcb;
> + int ret, idx;
> +
> + paddr = paddr & PAGE_MASK;
> + paddr_end = paddr + (npages << PAGE_SHIFT);
> +
> + ghcb = sev_es_get_ghcb();

That function can return NULL.

> + data = (struct snp_page_state_change *)ghcb->shared_buffer;
> + hdr = >header;
> + e = &(data->entry[0]);

So

e = data->entry;

?

> + memset(data, 0, sizeof (*data));
> +
> + for (idx = 0; paddr < paddr_end; paddr = paddr_next) {

As before, a while loop pls.

> + int level = PG_LEVEL_4K;

Why does this needs to happen on each loop iteration? It looks to me you
wanna do below:

e->pagesize = X86_RMP_PG_LEVEL(PG_LEVEL_4K);

instead.

> +
> + /* If we cannot fit more request then issue VMGEXIT before 
> going further.  */
   any more requests

No "we" pls.

> + if (hdr->end_entry == (SNP_PAGE_STATE_CHANGE_MAX_ENTRY - 1)) {
> + ret = snp_page_state_vmgexit(ghcb, data);
> + if (ret)
> + goto e_fail;

WARN

> +
> + idx = 0;
> + memset(data, 0, sizeof (*data));
> + e = &(data->entry[0]);
> + }

The order of the operations in this function looks weird: what you
should do is:

- clear stuff, memset etc.
- build shared buffer
- issue vmgexit

so that you don't have the memset and e reinit twice and the flow can
be more clear and you don't have two snp_page_state_vmgexit() function
calls when there's a trailing set of entries which haven't reached
SNP_PAGE_STATE_CHANGE_MAX_ENTRY.

Maybe a double-loop or so.

...

> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..19ee18ddbc37 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -27,6 +27,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include "../mm_internal.h"
>  
> @@ -2001,8 +2003,25 @@ static int __set_memory_enc_dec(unsigned long addr, 
> int numpages, bool enc)
>*/
>   cpa_flush(, !this_cpu_has(X86_FEATURE_SME_COHERENT));
>  
> + /*
> +  * To maintain the security gurantees of SEV-SNP guest invalidate the 
> memory before
> +  * clearing the encryption attribute.
> +  */

Align that comment on 80 cols, just like the rest of the comments do in
this file. Below too.

> + if (sev_snp_active() && !enc) {

Push that sev_snp_active() inside the function. Below too.

> + ret = snp_set_memory_shared(addr, numpages);
> + if (ret)
> + return ret;
> + }
> +
>   ret = __change_page_attr_set_clr(, 1);
>  
> + /*
> +  * Now that memory is

Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-12 Thread Borislav Petkov

On Sun, Apr 11, 2021 at 04:21:21PM -0700, Andy Lutomirski wrote:
> https://bugs.winehq.org/show_bug.cgi?id=33275#c19
> 
> I sure hope no one is still doing this.

Aha, IRET with rFLAGS.NT set. At least it is only an ad-hoc program to
fix this particular issue and I hope too it hasn't propagated somewhere
else.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

[PATCH] MAINTAINERS: Remove me from IDE/ATAPI section

2021-04-12 Thread Borislav Petkov

From: Borislav Petkov 

It has been years since I've touched this and "this" is going away
anyway... any day now. :-)

So remove me so that I do not get CCed on bugs/patches.

Signed-off-by: Borislav Petkov 
---
 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 0ec903a142b5..8de7af2d709f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8611,7 +8611,6 @@ F:drivers/ide/
 F: include/linux/ide.h
 
 IDE/ATAPI DRIVERS
-M:     Borislav Petkov 
 L: linux-...@vger.kernel.org
 S: Maintained
 F: Documentation/cdrom/ide-cd.rst
-- 
2.29.2

Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-11 Thread Borislav Petkov

On Sun, Apr 11, 2021 at 09:57:20AM -0700, Andy Lutomirski wrote:
> Working around a kernel bug. The workaround only worked on AMD
> systems. The correct solution was to fix the kernel bug, not poke
> MSRs.

Do you remember which program(s) and where I can get them to have a
look?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-11 Thread Borislav Petkov

On Sat, Apr 10, 2021 at 11:52:17AM -0700, Andi Kleen wrote:
> Have you ever seen any user programs actually write those MSRs?
> I don't see why they ever would, it's not that they have any motivation
> to do it (unlike SMM), and I don't know of any examples.

You'd be surprised how much motivation people have to poke at random
MSRs. Just browse some of those tools on github which think poking at
MSRs is ok.

> The whole MSR blocking seems more like a tilting at windmills
> type effort.

Nope, this is trying to salvage the current situation of people thinking
it is a good idea to poke at random MSRs and develop all kinds of tools
around it and then run those tools ontop of a kernel which pokes at
those same MSRs.

It is not really hard to realize that touching resources in an
unsynchronized way is a disaster waiting to happen. No matter how useful
and how wonderful those tools are.

> But on a non locked down system fully accessible MSRs are really
> useful for all kind of debugging and tuning, and anything that
> prevents that is bad.

If you wanna do that, you can just as well patch your kernel.

We're currently tainting the kernel on MSR writes and perhaps that's
good enough for now but if some tool starts doing dumb crap and pokes at
MSRs it should not be poking at and users start complaining because of
it, I'm committing that.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2] x86: pat: Do not compile stubbed functions when X86_PAT is off

2021-04-11 Thread Borislav Petkov

On Sun, Apr 11, 2021 at 11:12:32AM +0200, Jan Kiszka wrote:
> The patches are coming from downstream usage, yes,

Ok, for the future, it would be good to mention that in the commit
message so that a committer knows what they're for.

> but I think the solutions are relevant cleanups for upstream as well.

Yeah but you know that we generally don't do "preemptive" fixes like
that. But those are small enough so I don't see a problem...

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2] x86: pat: Do not compile stubbed functions when X86_PAT is off

2021-04-11 Thread Borislav Petkov

On Sun, Apr 11, 2021 at 10:22:19AM +0200, Jan Kiszka wrote:
> On 26.10.20 18:39, Jan Kiszka wrote:
> > From: Jan Kiszka 
> >
> > Those are already provided by linux/io.h as stubs.
> >
> > The conflict remains invisible until someone would pull linux/io.h into
> > memtype.c.
> >
> > Signed-off-by: Jan Kiszka 
> > ---
> >
> > Change in v2:
> >  - correct commit message
> >
> >  arch/x86/mm/pat/memtype.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > index 8f665c352bf0..41a4ac585af3 100644
> > --- a/arch/x86/mm/pat/memtype.c
> > +++ b/arch/x86/mm/pat/memtype.c
> > @@ -800,6 +800,7 @@ void memtype_free_io(resource_size_t start, 
> > resource_size_t end)
> > memtype_free(start, end);
> >  }
> >
> > +#ifdef CONFIG_X86_PAT
> >  int arch_io_reserve_memtype_wc(resource_size_t start, resource_size_t size)
> >  {
> > enum page_cache_mode type = _PAGE_CACHE_MODE_WC;
> > @@ -813,6 +814,7 @@ void arch_io_free_memtype_wc(resource_size_t start, 
> > resource_size_t size)
> > memtype_free_io(start, start + size);
> >  }
> >  EXPORT_SYMBOL(arch_io_free_memtype_wc);
> > +#endif
> >
> >  pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
> > unsigned long size, pgprot_t vma_prot)
> >
> 
> What happened to this?

What is this patch and your other asm/proto.h thing fixing?

Looks like you're using kernel headers in something else and don't want
to include the whole world thus those fixes... so that you can include
upstream kernel headers without having to touch them...

Or am I way off base here?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

[GIT PULL] x86/urgent for v5.12-rc7

2021-04-11 Thread Borislav Petkov

Hi Linus,

please pull two more urgent fixes (not gonna say "final" this time) from
x86-land.

Thx.

---

The following changes since commit e49d033bddf5b565044e2abe4241353959bc9120:

  Linux 5.12-rc6 (2021-04-04 14:15:36 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
tags/x86_urgent_for_v5.12-rc7

for you to fetch changes up to 632a1c209b8773cb0119fe3aada9f1db14fa357c:

  x86/traps: Correct exc_general_protection() and math_error() return paths 
(2021-04-09 13:45:09 +0200)


- Fix the vDSO exception handling return path to disable interrupts
again.

- A fix for the CE collector to return the proper return values to its
callers which are used to convey what the collector has done with the
error address.


Thomas Tai (1):
  x86/traps: Correct exc_general_protection() and math_error() return paths

William Roche (1):
  RAS/CEC: Correct ce_add_elem()'s returned values

 arch/x86/kernel/traps.c |  4 ++--
 drivers/ras/cec.c   | 15 ---
 2 files changed, 14 insertions(+), 5 deletions(-)

-- 
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG 
Nürnberg

Re: [RFC 1/2] x86/tsc: add a timer to make sure tsc_adjust is always checked

2021-04-10 Thread Borislav Petkov

On Sat, Apr 10, 2021 at 10:48:56PM +0800, Feng Tang wrote:
> And the bigger risk is still BIOS's writing to TSC_ADJUST MSR beneath
> kernel.

For that we need to do more persuasive work with hw guys. Needs a *lot*
of perseverance.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

[PATCH -v1.1] x86/msr: Block writes to certain MSRs unconditionally

2021-04-10 Thread Borislav Petkov

On Sat, Apr 10, 2021 at 07:51:58AM -0700, Andy Lutomirski wrote:
> Can you add STAR, CSTAR, LSTAR, SYSENTER*, SYSCALL*, and EFER please?

Sure.

---

From: Borislav Petkov 
Date: Sat, 10 Apr 2021 14:08:13 +0200

There are a bunch of MSRs which luserspace has no business poking
at, whatsoever. Add a ban list and put the TSC-related MSRs and the
ring0-entry and features control MSRs in there. Issue a big juicy splat
to catch offenders.

Signed-off-by: Borislav Petkov 
---
 arch/x86/kernel/msr.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c
index ed8ac6bcbafb..2435a619cd9f 100644
--- a/arch/x86/kernel/msr.c
+++ b/arch/x86/kernel/msr.c
@@ -78,6 +78,21 @@ static ssize_t msr_read(struct file *file, char __user *buf,
return bytes ? bytes : err;
 }
 
+static const u32 msr_ban_list[] = {
+   MSR_IA32_TSC,
+   MSR_TSC_AUX,
+   MSR_IA32_TSC_ADJUST,
+   MSR_IA32_TSC_DEADLINE,
+   MSR_EFER,
+   MSR_STAR,
+   MSR_CSTAR,
+   MSR_LSTAR,
+   MSR_SYSCALL_MASK,
+   MSR_IA32_SYSENTER_CS,
+   MSR_IA32_SYSENTER_ESP,
+   MSR_IA32_SYSENTER_EIP,
+};
+
 static int filter_write(u32 reg)
 {
/*
@@ -89,6 +104,16 @@ static int filter_write(u32 reg)
 * avoid saturating the ring buffer.
 */
static DEFINE_RATELIMIT_STATE(fw_rs, 30 * HZ, 1);
+   int i;
+
+   for (i = 0; i < ARRAY_SIZE(msr_ban_list); i++) {
+   if (msr_ban_list[i] != reg)
+   continue;
+
+   WARN_ONCE(1, "Blocked write to MSR 0x%x.\n", reg);
+
+   return -EINVAL;
+   }
 
switch (allow_writes) {
case MSR_WRITES_ON:  return 0;
-- 
2.29.2

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

[PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-10 Thread Borislav Petkov

From: Borislav Petkov 
Date: Sat, 10 Apr 2021 14:08:13 +0200

There are a bunch of MSRs which luserspace has no business poking at,
whatsoever. Add a ban list and put the TSC-related MSRs in there. Issue
a big juicy splat to catch offenders.

Signed-off-by: Borislav Petkov 
---
 arch/x86/kernel/msr.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c
index ed8ac6bcbafb..574bd2c6d4f8 100644
--- a/arch/x86/kernel/msr.c
+++ b/arch/x86/kernel/msr.c
@@ -78,6 +78,13 @@ static ssize_t msr_read(struct file *file, char __user *buf,
return bytes ? bytes : err;
 }
 
+static const u32 msr_ban_list[] = {
+   MSR_IA32_TSC,
+   MSR_TSC_AUX,
+   MSR_IA32_TSC_ADJUST,
+   MSR_IA32_TSC_DEADLINE,
+};
+
 static int filter_write(u32 reg)
 {
/*
@@ -89,6 +96,16 @@ static int filter_write(u32 reg)
 * avoid saturating the ring buffer.
 */
static DEFINE_RATELIMIT_STATE(fw_rs, 30 * HZ, 1);
+   int i;
+
+   for (i = 0; i < ARRAY_SIZE(msr_ban_list); i++) {
+   if (msr_ban_list[i] != reg)
+   continue;
+
+   WARN_ONCE(1, "Blocked write to MSR 0x%x\n", reg);
+
+   return -EINVAL;
+   }
 
switch (allow_writes) {
case MSR_WRITES_ON:  return 0;
-- 
2.29.2


-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC 1/2] x86/tsc: add a timer to make sure tsc_adjust is always checked

2021-04-10 Thread Borislav Petkov

On Sat, Apr 10, 2021 at 11:27:11AM +0200, Thomas Gleixner wrote:
> On Tue, Mar 30 2021 at 16:25, Feng Tang wrote:
> > Normally the tsc_sync will be checked every time system enters idle state,
> > but there is still caveat that a system won't enter idle, either because
> > it's too busy or configured purposely to not enter idle. Setup a periodic
> > timer to make sure the check is always on.
> 
> Bah. I really hate the fact that we don't have a knob to disable writes
> to the TSC/TSC_ADJUST msrs. That would spare this business alltogether.

We have the MSR filtering and I'd *love* to add those MSRs to a
permanent ban list of MSRs which will never ever be written to from
luserspace.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v24 04/30] x86/cpufeatures: Introduce X86_FEATURE_CET and setup functions

2021-04-10 Thread Borislav Petkov

On Fri, Apr 09, 2021 at 04:14:09PM -0700, Yu, Yu-cheng wrote:
> > @@ -53,6 +55,8 @@ static short xsave_cpuid_features[] __initdata = {
> > X86_FEATURE_INTEL_PT,
> > X86_FEATURE_PKU,
> > X86_FEATURE_ENQCMD,
> > +   X86_FEATURE_CET, /* XFEATURE_CET_USER */
> > +   X86_FEATURE_CET, /* XFEATURE_CET_KERNEL */
> > 
> > or what is the piece which becomes simpler?
> 
> Yes, this is it.

Those should be X86_FEATURE_SHSTK no?

> Signals, arch_prctl, and ELF header are three places that need to depend on
> either shadow stack or IBT is configured.  To remain simple, we can make all
> three depend on CONFIG_X86_SHADOW_STACK, and in Kconfig, make CONFIG_X86_IBT
> depend on CONFIG_X86_SHADOW_STACK.  Without shadow stack, IBT itself is not
> as useful anyway.

Makes sense to me.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 1/5] x86/cpufeatures: Define feature bits to support mitigation of PSF

2021-04-09 Thread Borislav Petkov

On Fri, Apr 09, 2021 at 02:45:23PM -0500, Saripalli, RK wrote:
> Yes, these options should be fine for now.
> Like you said, if we get the need to add prctl and seccomp, I can always do 
> that later.
> 
> What do you think auto should default to?. 
> In SSBD case, I believe auto defaults to prctl or seccomp.
> Since we will not have that here, we should choose something for auto.

Or not add it yet. Just have "on" and "off" for now.

Which begs the question should this be controllable by the mitigations=
switch too?

I wanna say, let's have people evaluate and play with it first and
we can add it to that switch later. As long as we don't change the
user-visible controls - if anything we'll be extending them later,
potentially - we should be fine usage-wise and from user visibility POV.

> All the other mitigation x86 mitigation code goes into kernel/cpu/bugs.c.
> I think psf_cmdline() or equivalent also belongs there and not in 
> kernel/cpu/amd.c.

It being AMD-specific, it can dwell in amd.c initially.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 1/5] x86/cpufeatures: Define feature bits to support mitigation of PSF

2021-04-09 Thread Borislav Petkov

On Fri, Apr 09, 2021 at 01:22:49PM -0500, Saripalli, RK wrote:
> > And I think you don't need this one either if we do a "light" controls
> > thing but lemme look at the rest first.

Ok, and what I mean with "lite" version is something like this below
which needs finishing and testing.

Initially, it could support the cmdline params:

predict_store_fwd={on,off,auto}

to give people the opportunity to experiment with the feature.

If it turns out that prctl and seccomp per-task toggling is needed then
sure, we can extend but I don't see the reason for a whole separate set
of options yet. Especially is ssbd already controls this.

AFAICT, of course and if I'm not missing some other aspect here.

Thx.

---
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 2d11384dc9ab..226b73700f88 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1165,3 +1165,22 @@ void set_dr_addr_mask(unsigned long mask, int dr)
break;
}
 }
+
+static int __init psf_cmdline(char *str)
+{
+   if (!boot_cpu_has(X86_FEATURE_PSFD))
+   return 0;
+
+   if (!str)
+   return -EINVAL;
+
+   if (!strcmp(str, "off")) {
+   x86_spec_ctrl_base |= SPEC_CTRL_PSFD;
+   setup_clear_cpu_cap(X86_FEATURE_PSFD);
+   }
+
+   return 0;
+}
+early_param("predict_store_fwd", psf_cmdline);
+
+

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 1/5] x86/cpufeatures: Define feature bits to support mitigation of PSF

2021-04-09 Thread Borislav Petkov

On Tue, Apr 06, 2021 at 10:50:00AM -0500, Ramakrishna Saripalli wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index cc96e26d69f7..21e7f8d0d7d9 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -201,7 +201,7 @@
>  #define X86_FEATURE_INVPCID_SINGLE   ( 7*32+ 7) /* Effectively INVPCID && 
> CR4.PCIDE=1 */
>  #define X86_FEATURE_HW_PSTATE( 7*32+ 8) /* AMD HW-PState */
>  #define X86_FEATURE_PROC_FEEDBACK( 7*32+ 9) /* AMD ProcFeedbackInterface 
> */
> -/* FREE!( 7*32+10) */
> +#define X86_FEATURE_PSFD ( 7*32+10) /* Predictive Store Forward 
> Disable */

You don't need this one...

>  #define X86_FEATURE_PTI  ( 7*32+11) /* Kernel Page Table 
> Isolation enabled */
>  #define X86_FEATURE_RETPOLINE( 7*32+12) /* "" Generic 
> Retpoline mitigation for Spectre variant 2 */
>  #define X86_FEATURE_RETPOLINE_AMD( 7*32+13) /* "" AMD Retpoline 
> mitigation for Spectre variant 2 */
> @@ -309,6 +309,7 @@
>  #define X86_FEATURE_AMD_SSBD (13*32+24) /* "" Speculative Store 
> Bypass Disable */
>  #define X86_FEATURE_VIRT_SSBD(13*32+25) /* Virtualized 
> Speculative Store Bypass Disable */
>  #define X86_FEATURE_AMD_SSB_NO   (13*32+26) /* "" Speculative 
> Store Bypass is fixed in hardware. */
> +#define X86_FEATURE_AMD_PSFD (13*32+28) /* "" Predictive Store 
> Forward Disable */

... when you have this one. And this one is AMD-specific so you can just
as well call it X86_FEATURE_PSFD and remove the "".

>  
>  /* Thermal and Power Management Leaf, CPUID level 0x0006 (EAX), word 14 
> */
>  #define X86_FEATURE_DTHERM   (14*32+ 0) /* Digital Thermal Sensor */
> @@ -428,5 +429,6 @@
>  #define X86_BUG_TAA  X86_BUG(22) /* CPU is affected by TSX 
> Async Abort(TAA) */
>  #define X86_BUG_ITLB_MULTIHITX86_BUG(23) /* CPU may incur 
> MCE during certain page attribute changes */
>  #define X86_BUG_SRBDSX86_BUG(24) /* CPU may leak RNG 
> bits if not mitigated */
> +#define X86_BUG_PSF  X86_BUG(25) /* CPU is affected by 
> Predictive Store Forwarding attack */

And I think you don't need this one either if we do a "light" controls
thing but lemme look at the rest first.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v24 04/30] x86/cpufeatures: Introduce X86_FEATURE_CET and setup functions

2021-04-09 Thread Borislav Petkov

On Fri, Apr 09, 2021 at 08:52:52AM -0700, Yu, Yu-cheng wrote:
> Recall we had complicated code for the XSAVES features detection in
> xstate.c.  Dave Hansen proposed the solution and then the whole thing
> becomes simple.  Because of this flag, even when only the shadow stack is
> available, the code handles it nicely.

Is that what you mean?

@@ -53,6 +55,8 @@ static short xsave_cpuid_features[] __initdata = {
X86_FEATURE_INTEL_PT,
X86_FEATURE_PKU,
X86_FEATURE_ENQCMD,
+   X86_FEATURE_CET, /* XFEATURE_CET_USER */
+   X86_FEATURE_CET, /* XFEATURE_CET_KERNEL */

or what is the piece which becomes simpler?

> Would this equal to only CONFIG_X86_CET (one Kconfig option)?  In fact, when
> you proposed only CONFIG_X86_CET, things became much simpler.

When you use CONFIG_X86_SHADOW_STACK instead, it should remain same
simple no?

> Practically, IBT is not much in terms of code size.  Since we have already
> separated the two, why don't we leave it as-is.  When people start using it
> more, there will be more feedback, and we can decide if one Kconfig is
> better?

Because when we add stuff to the kernel, we add the simplest and
cleanest version possible and later, when we determine that additional
functionality is needed, *then* we add it. Not the other way around.

Our Kconfig symbol space is already an abomination so we can't just add
some more and decide later.

What happens in such situations usually is stuff gets added, it bitrots
and some poor soul - very likely a maintainer who has to mop up after
everybody - comes and cleans it up. I'd like to save myself that
cleaning up.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part1 PATCH 11/13] x86/kernel: validate rom memory before accessing when SEV-SNP is active

2021-04-09 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 11:44:22AM -0500, Brijesh Singh wrote:
> + /*
> +  * The ROM memory is not part of the E820 system RAM and is not 
> prevalidated by the BIOS.
> +  * The kernel page table maps the ROM region as encrypted memory, the 
> SEV-SNP requires
> +  * the all the encrypted memory must be validated before the access.
> +  */
> + if (sev_snp_active()) {
> + unsigned long n, paddr;
> +
> + n = ((system_rom_resource.end + 1) - video_rom_resource.start) 
> >> PAGE_SHIFT;
> + paddr = video_rom_resource.start;
> + early_snp_set_memory_private((unsigned long)__va(paddr), paddr, 
> n);
> + }

I don't like this sprinkling of SNP-special stuff that needs to be done,
around the tree. Instead, pls define a function called

snp_prep_memory(unsigned long pa, unsigned int num_pages, enum 
operation);

or so which does all the manipulation needed and the callsites only
simply unconditionally call that function so that all detail is
extracted and optimized away when not config-enabled.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature

2021-04-09 Thread Borislav Petkov

On Fri, Apr 09, 2021 at 04:36:01PM +0300, Kirill A. Shutemov wrote:
> The patchset is still in path-finding stage. I'll be more specific once we
> settle on how the feature works.

This is not why I'm asking: these feature bits are visible to userspace
in /proc/cpuinfo and if you don't have a use case to show this to
userspace, use the "" in the comment.

And if you don't need that feature bit at all (you're only setting it
but not querying it) you should not add it at all.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v5 0/4] KVM: cpuid: fix KVM_GET_EMULATED_CPUID implementation

2021-04-09 Thread Borislav Petkov

On Fri, Apr 09, 2021 at 02:54:19PM +0200, Emanuele Giuseppe Esposito wrote:
> This series aims to clarify the behavior of the KVM_GET_EMULATED_CPUID
> ioctl, and fix a corner case where -E2BIG is returned when
> the nent field of struct kvm_cpuid2 is matching the amount of
> emulated entries that kvm returns.
> 
> Patch 1 proposes the nent field fix to cpuid.c,
> patch 2 updates the ioctl documentation accordingly and
> patches 3 and 4 extend the x86_64/get_cpuid_test.c selftest to check
> the intended behavior of KVM_GET_EMULATED_CPUID.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
> v5:
> - Better comment in cpuid.c (patch 1)

How about you wait a couple of days so people have a chance to look at
your patches instead of sending a new revision every day?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v24 04/30] x86/cpufeatures: Introduce X86_FEATURE_CET and setup functions

2021-04-09 Thread Borislav Petkov

On Thu, Apr 01, 2021 at 03:10:38PM -0700, Yu-cheng Yu wrote:
> Introduce a software-defined X86_FEATURE_CET, which indicates either Shadow
> Stack or Indirect Branch Tracking (or both) is present.  Also introduce
> related cpu init/setup functions.
> 
> Signed-off-by: Yu-cheng Yu 
> Cc: Kees Cook 
> ---
> v24:
> - Update #ifdef placement to reflect Kconfig changes of splitting shadow 
> stack and ibt.
> 
>  arch/x86/include/asm/cpufeatures.h  |  2 +-
>  arch/x86/include/asm/disabled-features.h|  9 -
>  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
>  arch/x86/kernel/cpu/common.c| 14 ++
>  arch/x86/kernel/cpu/intel.c |  3 +++
>  5 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index bf861fc89fef..d771e62677de 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -108,7 +108,7 @@
>  #define X86_FEATURE_EXTD_APICID  ( 3*32+26) /* Extended APICID 
> (8 bits) */
>  #define X86_FEATURE_AMD_DCM  ( 3*32+27) /* AMD multi-node processor 
> */
>  #define X86_FEATURE_APERFMPERF   ( 3*32+28) /* P-State hardware 
> coordination feedback capability (APERF/MPERF MSRs) */
> -/* free  ( 3*32+29) */
> +#define X86_FEATURE_CET  ( 3*32+29) /* Control-flow 
> enforcement */

Right, I know we talked about having this synthetic flag but now that we
are moving to CONFIG_X86_SHADOW_STACK and separate SHSTK and IBT feature
bits, that synthetic flag is not needed anymore.

For the cases where you wanna test whether any of the two are present,
we're probably better off adding a x86_cet_enabled() helper which tests
SHSTK and IBT bits.

I haven't gone through the whole thing yet but depending on the context
and the fact that AMD doesn't support IBT, that helper might need some
tweaking too. I'll see.

>  #define X86_FEATURE_NONSTOP_TSC_S3   ( 3*32+30) /* TSC doesn't stop in S3 
> state */
>  #define X86_FEATURE_TSC_KNOWN_FREQ   ( 3*32+31) /* TSC has known frequency */
>  
> diff --git a/arch/x86/include/asm/disabled-features.h 
> b/arch/x86/include/asm/disabled-features.h
> index e5c6ed9373e8..018cd7acd3e9 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -74,13 +74,20 @@
>  #define DISABLE_SHSTK(1 << (X86_FEATURE_SHSTK & 31))
>  #endif
>  
> +#ifdef CONFIG_X86_CET

And you don't need that config item either - AFAICT, you can use
CONFIG_X86_SHADOW_STACK everywhere.

Which would simplify that config space.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 0/5] Introduce support for PSF mitigation

2021-04-09 Thread Borislav Petkov

On Thu, Apr 08, 2021 at 09:56:47AM -0500, Saripalli, RK wrote:
> It is possible most applications have been reviewed and scrubbed for
> SSB-type attacks but PSF-type issues may not have been looked at yet.
> This may be one of the cases where SSB is enabled but PSF is disabled
> until the application(s) are scrubbed for the same.

Right, and for that I think we could do a slimmer version of the psfd=
toggle - no prctl and seccomp stuff - just the cmdline disable thing to
keep this simpler.

Btw "psfd=" is maybe too short and cryptic. It would probably be more
user-friendly if it were called:

predict_store_fwd={on,off,...}

or so.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery

2021-04-08 Thread Borislav Petkov

On Thu, Mar 25, 2021 at 05:02:35PM -0700, Tony Luck wrote:
...
> Expected worst case is two machine checks before moving on (e.g. one user
> access with page faults disabled, then a repeat to the same addrsss with
> page faults enabled). Just in case there is some code that loops forever
> enforce a limit of 10.
> 
> Signed-off-by: Tony Luck 
> ---
>  arch/x86/kernel/cpu/mce/core.c | 40 ++
>  include/linux/sched.h  |  1 +
>  2 files changed, 32 insertions(+), 9 deletions(-)

What I'm still unclear on, does this new version address that
"mysterious" hang or panic which the validation team triggered or you
haven't checked yet?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part1 PATCH 09/13] x86/kernel: add support to validate memory in early enc attribute change

2021-04-08 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 11:44:20AM -0500, Brijesh Singh wrote:
> @@ -63,6 +63,10 @@ struct __packed snp_page_state_change {
>  #define GHCB_REGISTER_GPA_RESP   0x013UL
>  #define  GHCB_REGISTER_GPA_RESP_VAL(val) ((val) >> 12)
>  
> +/* Macro to convert the x86 page level to the RMP level and vice versa */
> +#define X86_RMP_PG_LEVEL(level)  (((level) == PG_LEVEL_4K) ? 
> RMP_PG_SIZE_4K : RMP_PG_SIZE_2M)
> +#define RMP_X86_PG_LEVEL(level)  (((level) == RMP_PG_SIZE_4K) ? 
> PG_LEVEL_4K : PG_LEVEL_2M)

Please add those with the patch which uses them for the first time.

Also, it seems to me the names should be

X86_TO_RMP_PG_LEVEL
RMP_TO_X86_PG_LEVEL

...

> @@ -56,3 +56,108 @@ void sev_snp_register_ghcb(unsigned long paddr)
>   /* Restore the GHCB MSR value */
>   sev_es_wr_ghcb_msr(old);
>  }
> +
> +static void sev_snp_issue_pvalidate(unsigned long vaddr, unsigned int 
> npages, bool validate)

pvalidate_pages() I guess.

> +{
> + unsigned long eflags, vaddr_end, vaddr_next;
> + int rc;
> +
> + vaddr = vaddr & PAGE_MASK;
> + vaddr_end = vaddr + (npages << PAGE_SHIFT);
> +
> + for (; vaddr < vaddr_end; vaddr = vaddr_next) {

Yuck, that vaddr_next gets initialized at the end of the loop. How about
using a while loop here instead?

while (vaddr < vaddr_end) {

...

vaddr += PAGE_SIZE;
}

then you don't need vaddr_next at all. Ditto for all the other loops in
this patch which iterate over pages.

> + rc = __pvalidate(vaddr, RMP_PG_SIZE_4K, validate, );

So this function gets only 4K pages to pvalidate?

> +

^ Superfluous newline.

> + if (rc) {
> + pr_err("Failed to validate address 0x%lx ret %d\n", 
> vaddr, rc);

You can combine the pr_err and dump_stack() below into a WARN() here:

WARN(rc, ...);

> + goto e_fail;
> + }
> +
> + /* Check for the double validation condition */
> + if (eflags & X86_EFLAGS_CF) {
> + pr_err("Double %salidation detected (address 0x%lx)\n",
> + validate ? "v" : "inv", vaddr);
> + goto e_fail;
> + }

As before - this should be communicated by a special retval from
__pvalidate().

> +
> + vaddr_next = vaddr + PAGE_SIZE;
> + }
> +
> + return;
> +
> +e_fail:
> + /* Dump stack for the debugging purpose */
> + dump_stack();
> +
> + /* Ask to terminate the guest */
> + sev_es_terminate(GHCB_SEV_ES_REASON_GENERAL_REQUEST);

Another termination reason to #define.

> +}
> +
> +static void __init early_snp_set_page_state(unsigned long paddr, unsigned 
> int npages, int op)
> +{
> + unsigned long paddr_end, paddr_next;
> + u64 old, val;
> +
> + paddr = paddr & PAGE_MASK;
> + paddr_end = paddr + (npages << PAGE_SHIFT);
> +
> + /* save the old GHCB MSR */
> + old = sev_es_rd_ghcb_msr();
> +
> + for (; paddr < paddr_end; paddr = paddr_next) {
> +
> + /*
> +  * Use the MSR protocol VMGEXIT to request the page state 
> change. We use the MSR
> +  * protocol VMGEXIT because in early boot we may not have the 
> full GHCB setup
> +  * yet.
> +  */
> + sev_es_wr_ghcb_msr(GHCB_SNP_PAGE_STATE_REQ_GFN(paddr >> 
> PAGE_SHIFT, op));
> + VMGEXIT();

Yeah, I know we don't always strictly adhere to 80 columns but there's
no real need not to fit that in 80 cols here so please shorten names and
comments. Ditto for the rest.

> +
> + val = sev_es_rd_ghcb_msr();
> +
> + /* Read the response, if the page state change failed then 
> terminate the guest. */
> + if (GHCB_SEV_GHCB_RESP_CODE(val) != 
> GHCB_SNP_PAGE_STATE_CHANGE_RESP)
> + sev_es_terminate(GHCB_SEV_ES_REASON_GENERAL_REQUEST);

if (...)
goto fail;

and add that fail label at the end so that all the error handling path
is out of the way.

> +
> + if (GHCB_SNP_PAGE_STATE_RESP_VAL(val) != 0) {

s/!= 0//

> + pr_err("Failed to change page state to '%s' paddr 0x%lx 
> error 0x%llx\n",
> + op == SNP_PAGE_STATE_PRIVATE ? 
> "private" : "shared",
> + paddr, 
> GHCB_SNP_PAGE_STATE_RESP_VAL(val));
> +
> + /* Dump stack for the debugging purpose */
> + dump_stack();

WARN as above.

> @@ -49,6 +50,27 @@ bool sev_enabled __section(".data");
>  /* Buffer used for early in-place encryption by BSP, no locking needed */
>  static char sme_early_buffer[PAGE_SIZE] __initdata __aligned(PAGE_SIZE);
>  
> +/*
> + * When SNP is active, this routine changes the page state from private to 
> shared before
> + * copying the data from the source to destination and restore after the 
> copy. This is required
> + * because

Re: [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature

2021-04-08 Thread Borislav Petkov

On Fri, Apr 02, 2021 at 06:26:40PM +0300, Kirill A. Shutemov wrote:
> Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.
> 
> Host side doesn't provide the feature yet, so it is a dead code for now.
> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  arch/x86/include/asm/cpufeatures.h   |  1 +
>  arch/x86/include/asm/kvm_para.h  |  5 +
>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
>  arch/x86/kernel/kvm.c| 18 ++
>  include/uapi/linux/kvm_para.h|  3 ++-
>  5 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 84b887825f12..5b6f23e6edc4 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -238,6 +238,7 @@
>  #define X86_FEATURE_VMW_VMMCALL  ( 8*32+19) /* "" VMware prefers 
> VMMCALL hypercall instruction */
>  #define X86_FEATURE_SEV_ES   ( 8*32+20) /* AMD Secure Encrypted 
> Virtualization - Encrypted State */
>  #define X86_FEATURE_VM_PAGE_FLUSH( 8*32+21) /* "" VM Page Flush MSR is 
> supported */
> +#define X86_FEATURE_KVM_MEM_PROTECTED( 8*32+22) /* KVM memory 
> protection extenstion */

^^
What's that feature bit for?

Also, use a spellchecker pls: "extenstion".

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 2/2] x86/sgx: Add sgx_nr_{all, free}_pages to the debugfs

2021-04-08 Thread Borislav Petkov

On Thu, Apr 08, 2021 at 12:13:21PM +0300, Jarkko Sakkinen wrote:
> Actually I think read-only sysctl attributes would be a better idea.

I still think debugfs is the right *start* for this: you play with them,
see what makes sense and what not, tweak them, etc, and then you cast
them in stone.

Not cast them in stone and see if anyone is even interested. So pls keep
them in debugfs for now - you can always do whatever, later, when it
turns out that those are useful.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 1/2] x86/sgx: Do not update sgx_nr_free_pages in sgx_setup_epc_section()

2021-04-08 Thread Borislav Petkov

On Thu, Apr 08, 2021 at 12:22:56PM +0300, Jarkko Sakkinen wrote:
> They are not in the "free_page_list" before sanitization process has put
> them to there. So in that way the count is also better in sync with this
> fix.

This is the bit of information I was looking for. This needs to be in
the commit message.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 2/2] x86/sgx: Add sgx_nr_{all, free}_pages to the debugfs

2021-04-08 Thread Borislav Petkov

On Thu, Apr 08, 2021 at 11:52:40AM +0300, Jarkko Sakkinen wrote:
> I think these attributes are quite useful information to have available so
> I'd go actually doing sysfs attributes and create
> Documentation/ABI/stable/sysfs-driver-sgx to document them.

  testing/
This directory documents interfaces that are felt to be stable,
as the main development of this interface has been completed.

This sounds better for a start. From Documentation/ABI/README.

> Given that they would go then to the sysfs directory of the driver, then
> probably the legit names for the attributes ought to be:
> 
> - nr_all_epc_pages
> - nr_free_epc_pages
> 
> What do you think?

Sounds ok to me.

> PS. One useful case that I forgot to mention is that I use these to give
> idea what I gave EPC size in the BIOS. Now my EPC is set to 32 MB, and
> these report 20 MB of EPC pages. It's because other metadata (e.g. EPCM
> containing page attributes) is also stored in this area.

Just remember to put yourself in the user's shoes and think whether they
make sense to her/him.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 1/2] x86/sgx: Do not update sgx_nr_free_pages in sgx_setup_epc_section()

2021-04-08 Thread Borislav Petkov

On Thu, Apr 08, 2021 at 11:48:46AM +0300, Jarkko Sakkinen wrote:
> The regression is that the sgx_nr_free_pages is also incremented by
> sgx_free_epc_pages(), and thus it ends up having double the number of
> pages available.

So when you add a new EPC section with sgx_setup_epc_section(), those
new pages in "nr_pages" are initially not going to be accounted
anywhere? Or is that sgx_nr_all_pages? And you do that in your second
patch...

But those new pages coming in *are* free pages so they should be in the
free pages count too, IMHO.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison

2021-04-08 Thread Borislav Petkov

On Wed, Apr 07, 2021 at 02:43:10PM -0700, Luck, Tony wrote:
> On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote:
> > On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote:
> > > Andy Lutomirski pointed out that sending SIGBUS to tasks that
> > > hit poison in the kernel copying syscall parameters from user
> > > address space is not the right semantic.
> > 
> > What does that mean exactly?
> 
> Andy said that a task could check a memory range for poison by
> doing:
> 
>   ret = write(fd, buf, size);
>   if (ret == size) {
>   memory range is all good
>   }
> 
> That doesn't work if the kernel sends a SIGBUS.
> 
> It doesn't seem a likely scenario ... but Andy is correct that
> the above ought to work.

We need to document properly what this is aiming to fix. He said
something yesterday along the lines of kthread_use_mm() hitting a SIGBUS
when a kthread "attaches" to an address space. I'm still unclear as to
how exactly that happens - there are only a handful of kthread_use_mm()
users in the tree...

> Yes. This is for kernel reading memory belongng to "current" task.

Provided "current" is really the task to which the poison page belongs.
That kthread_use_mm() thing sounded like the wrong task gets killed. But that
needs more details.

> Same in that the page gets unmapped. Different in that there
> is no SIGBUS if the kernel did the access for the user.

What is even the actual use case with sending tasks SIGBUS on poison
consumption? KVM? Others?

Are we documenting somewhere: "if your process gets a SIGBUS and this
and that, which means your page got offlined, you should do this and
that to recover"?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part1 PATCH 08/13] x86/sev-es: register GHCB memory when SEV-SNP is active

2021-04-08 Thread Borislav Petkov

On Wed, Mar 24, 2021 at 11:44:19AM -0500, Brijesh Singh wrote:
> @@ -88,6 +89,13 @@ struct sev_es_runtime_data {
>* is currently unsupported in SEV-ES guests.
>*/
>   unsigned long dr7;
> +
> + /*
> +  * SEV-SNP requires that the GHCB must be registered before using it.
> +  * The flag below will indicate whether the GHCB is registered, if its
> +  * not registered then sev_es_get_ghcb() will perform the registration.
> +  */
> + bool ghcb_registered;

snp_ghcb_registered

because it is SNP-specific.

>  };
>  
>  struct ghcb_state {
> @@ -196,6 +204,12 @@ static __always_inline struct ghcb 
> *sev_es_get_ghcb(struct ghcb_state *state)
>   data->ghcb_active = true;
>   }
>  
> + /* SEV-SNP guest requires that GHCB must be registered before using it. 
> */
> + if (sev_snp_active() && !data->ghcb_registered) {
> + sev_snp_register_ghcb(__pa(ghcb));
> + data->ghcb_registered = true;

This needs to be set to true in the function itself, in the success
case.

> +static inline u64 sev_es_rd_ghcb_msr(void)
> +{
> + return __rdmsr(MSR_AMD64_SEV_ES_GHCB);
> +}
> +
> +static inline void sev_es_wr_ghcb_msr(u64 val)
> +{
> + u32 low, high;
> +
> + low  = (u32)(val);
> + high = (u32)(val >> 32);
> +
> + native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
> +}

Those copies will go away once you create the common sev.c

> +
> +/* Provides sev_es_terminate() */
> +#include "sev-common-shared.c"
> +
> +void sev_snp_register_ghcb(unsigned long paddr)
> +{
> + u64 pfn = paddr >> PAGE_SHIFT;
> + u64 old, val;
> +
> + /* save the old GHCB MSR */
> + old = sev_es_rd_ghcb_msr();
> +
> + /* Issue VMGEXIT */
> + sev_es_wr_ghcb_msr(GHCB_REGISTER_GPA_REQ_VAL(pfn));
> + VMGEXIT();
> +
> + val = sev_es_rd_ghcb_msr();
> +
> + /* If the response GPA is not ours then abort the guest */
> + if ((GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_REGISTER_GPA_RESP) ||
> + (GHCB_REGISTER_GPA_RESP_VAL(val) != pfn))
> + sev_es_terminate(GHCB_SEV_ES_REASON_GENERAL_REQUEST);
> +
> + /* Restore the GHCB MSR value */
> + sev_es_wr_ghcb_msr(old);
> +}

This is an almost identical copy of the version in compressed/. Move to
the shared file?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [RFC Part1 PATCH 07/13] x86/compressed: register GHCB memory when SNP is active

2021-04-08 Thread Borislav Petkov

On Wed, Apr 07, 2021 at 12:34:59PM -0500, Brijesh Singh wrote:
> The feature is part of the GHCB version 2 and is enforced by the
> hypervisor. I guess it can be extended for the ES. Since this feature
> was not available in GHCB version 1 (base ES) so it should be presented
> as an optional for the ES ?

Yeah, it probably is not worth the effort. If an attacker controls the
guest kernel, then it can re-register a new GHCB so it doesn't really
matter.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 23602 matches

Mail list logo