from:"Andy Lutomirski"

Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions

2021-12-02 Thread Andy Lutomirski


On 11/12/21 16:50, Ira Weiny wrote:

On Tue, Aug 03, 2021 at 09:32:21PM -0700, 'Ira Weiny' wrote:

From: Ira Weiny 

The PKRS MSR is not managed by XSAVE.  It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]  However, Andy
came up with a way to hide additional values on the stack which could be
accessed as "extended_pt_regs".[3]


Andy,

I'm preparing to send V8 of this PKS work.  But I have not seen any feed back
since I originally implemented this in V4[1].

Does this meets your expectations?  Are there any issues you can see with this
code?


I think I'm generally okay with the approach to allocating space.  All 
of Thomas' comments still apply, though.  (Sorry, I'm horribly behind.)

Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-19 Thread Andy Lutomirski




> On Apr 19, 2021, at 11:33 AM, Dave Hansen  wrote:
> 
> On 4/19/21 11:10 AM, Andy Lutomirski wrote:
>> I’m confused by this scenario. This should only affect physical pages
>> that are in the 2M area that contains guest memory. But, if we have a
>> 2M direct map PMD entry that contains kernel data and guest private
>> memory, we’re already in a situation in which the kernel touching
>> that memory would machine check, right?
> 
> Not machine check, but page fault.  Do machine checks even play a
> special role in SEV-SNP?  I thought that was only TDX?

Brain fart.

> 
> My point was just that you can't _easily_ do the 2M->4k kernel mapping
> demotion in a page fault handler, like I think Borislav was suggesting.

We are certainly toast if this hits the stack.  Or if it hits a page table or 
the GDT or IDT :). The latter delightful choices would be triple faults.

I sure hope the code we use to split a mapping is properly NMI safe.

> 
>> ISTM we should fully unmap any guest private page from the kernel and
>> all host user pagetables before actually making it be a guest private
>> page.
> 
> Yes, that sounds attractive.  Then, we'd actually know if the host
> kernel was doing stray reads somehow because we'd get a fault there too.

Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-19 Thread Andy Lutomirski




> On Apr 19, 2021, at 10:58 AM, Dave Hansen  wrote:
> 
> On 4/19/21 10:46 AM, Brijesh Singh wrote:
>> - guest wants to make gpa 0x1000 as a shared page. To support this, we
>> need to psmash the large RMP entry into 512 4K entries. The psmash
>> instruction breaks the large RMP entry into 512 4K entries without
>> affecting the previous validation. Now the we need to force the host to
>> use the 4K page level instead of the 2MB.
>> 
>> To my understanding, Linux kernel fault handler does not build the page
>> tables on demand for the kernel addresses. All kernel addresses are
>> pre-mapped on the boot. Currently, I am proactively spitting the physmap
>> to avoid running into situation where x86 page level is greater than the
>> RMP page level.
> 
> In other words, if the host maps guest memory with 2M mappings, the
> guest can induce page faults in the host.  The only way the host can
> avoid this is to map everything with 4k mappings.
> 
> If the host does not avoid this, it could end up in the situation where
> it gets page faults on access to kernel data structures.  Imagine if a
> kernel stack page ended up in the same 2M mapping as a guest page.  I
> *think* the next write to the kernel stack would end up double-faulting.

I’m confused by this scenario. This should only affect physical pages that are 
in the 2M area that contains guest memory. But, if we have a 2M direct map PMD 
entry that contains kernel data and guest private memory, we’re already in a 
situation in which the kernel touching that memory would machine check, right?

ISTM we should fully unmap any guest private page from the kernel and all host 
user pagetables before actually making it be a guest private page.

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-19 Thread Andy Lutomirski



> On Apr 19, 2021, at 8:26 AM, David Laight  wrote:
> 
> From: Andy Lutomirski
>> Sent: 18 April 2021 01:12
> ..
>> Slightly more complicated:
>> 
>> struct opaque_symbol;
>> extern struct opaque_symbol entry_SYSCALL_64;
>> 
>> The opaque_symbol variant avoids any possible confusion over the weird
>> status of arrays in C, and it's hard to misuse, since struct
>> opaque_symbol is an incomplete type.
> 
> Maybe:
> 
> s/opaque_symbol/entry_SYSCALL_64/
> 

Cute. OTOH, I’m not sure whether that has much benefit, and having a single 
type for all of this allows it to be declared just once.  I suppose the magic 
could be wrapped in a macro, though.

>David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-18 Thread Andy Lutomirski

On Sun, Apr 18, 2021 at 9:17 AM Thomas Gleixner  wrote:
>
> On Sat, Apr 17 2021 at 17:11, Andy Lutomirski wrote:
> > On Sat, Apr 17, 2021 at 4:53 PM Thomas Gleixner  wrote:
> >> which works for
> >>
> >>   foo = function_nocfi(bar);
> >
> > I agree in general.  But right now, we have, in asm/proto.h:
> >
> > void entry_SYSCALL_64(void);
> >
> > and that's pure nonsense.  Depending on your point of view,
> > entry_SYSCALL_64 is a symbol that resolves to an integer or it's an
> > array of bytes containing instructions, but it is most definitely not
> > a function void (void).  So, regardless of any CFI stuff, I propose
> > that we standardize our handling of prototypes of symbols that are
> > opaque to the C compiler.  Here are a couple of choices:
> >
> > Easy one:
> >
> > extern u8 entry_SYSCALL_64[];
> >
> > Slightly more complicated:
> >
> > struct opaque_symbol;
> > extern struct opaque_symbol entry_SYSCALL_64;
> >
> > The opaque_symbol variant avoids any possible confusion over the weird
> > status of arrays in C, and it's hard to misuse, since struct
> > opaque_symbol is an incomplete type.
>
> Makes sense.

Sami, do you want to do this as part of your series or should I write a patch?

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-17 Thread Andy Lutomirski

On Sat, Apr 17, 2021 at 4:53 PM Thomas Gleixner  wrote:
>
> On Sat, Apr 17 2021 at 16:19, Andy Lutomirski wrote:
> > On Fri, Apr 16, 2021 at 4:40 PM Kees Cook  wrote:
> >> Okay, you're saying you want __builtin_gimme_body_p() to be a constant
> >> expression for the compiler, not inline asm?
> >
> > Yes.
> >
> > I admit that, in the trivial case where the asm code is *not* a
> > C-ABI-compliant function, giving a type that doesn't fool the compiler
> > into thinking that it might be is probably the best fix.  Maybe we
> > should standardize something, e.g.:
> >
> > struct raw_symbol;  /* not defined anywhere */
> > #define DECLARE_RAW_SYMBOL(x) struct raw_symbol x[]
> >
> > and then we write this:
> >
> > DECLARE_RAW_SYMBOL(entry_SYSCALL_64);
> >
> > wrmsrl(..., (unsigned long)entry_SYSCALL_64);
> >
> > It would be a bit nifty if we didn't need a forward declaration, but
> > I'm not immediately seeing a way to do this without hacks that we'll
> > probably regret;
> >
> > But this doesn't help the case in which the symbol is an actual
> > C-callable function and we want to be able to call it, too.
>
> The right way to solve this is that the compiler provides a builtin
>
>  function_nocfi() +/- the naming bikeshed
>
> which works for
>
>   foo = function_nocfi(bar);

I agree in general.  But right now, we have, in asm/proto.h:

void entry_SYSCALL_64(void);

and that's pure nonsense.  Depending on your point of view,
entry_SYSCALL_64 is a symbol that resolves to an integer or it's an
array of bytes containing instructions, but it is most definitely not
a function void (void).  So, regardless of any CFI stuff, I propose
that we standardize our handling of prototypes of symbols that are
opaque to the C compiler.  Here are a couple of choices:

Easy one:

extern u8 entry_SYSCALL_64[];

Slightly more complicated:

struct opaque_symbol;
extern struct opaque_symbol entry_SYSCALL_64;

The opaque_symbol variant avoids any possible confusion over the weird
status of arrays in C, and it's hard to misuse, since struct
opaque_symbol is an incomplete type.

--Andy

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-17 Thread Andy Lutomirski

On Fri, Apr 16, 2021 at 4:40 PM Kees Cook  wrote:
>

> > 1. I defined a function in asm.  I want to tell clang that this
> > function is defined in asm, and for clang to behave accordingly:
> >
> > .globl func
> > func:
> >  ; do stuff
> >
> > later:
> >
> > extern void func(void) [something here];
> >
> > There really should be a way to write this correctly such that it
> > works regardless of the setting of
> > -fsanitize-cfi-canonical-jump-tables.  This should not bypass CFI.  It
> > should *work*, with CFI enforced.  If I read all the various things
> > you linked correctly, this would be something like __cfi_noncanonical,
> > and I reserve the right to think that this is a horrible name.
>
> Yes, I find the name confusing too. Without noncanonical, we'd need
> C call wrappers for every single .S function that had its address
> taken. This is very common in crypto, for example. That level of extra
> code seemed like a total non-starter. Instead, we just get a few places
> we have to mark.

The patch you linked doesn't have a noncanonical attribute, though.
So I'm not sure how to reliably call into asm from C.

(The more I think about it, the less I like "canonical".  What is
"canonical"?  The symbol?  The function body?  Something else?)

>
> > 2. I need a raw function pointer, thank you very much.  I would like
> > to be honest about it, and I don't really want to bypass CFI, but I
> > need the actual bits in the actual symbol.
> >
> > translation unit 1 defines func.  Maybe it's C with
> > -fsanitize-cfi-canonical-jump-tables, maybe it's C with
> > -fno-sanitize-cfi-canonical-jump-tables or however it's spelled, and
> > maybe it's plain asm.  Now translation unit 2 does:
> >
> > 2a. Uses a literal symbol, because it's going to modify function text
> > or poke an MSR or whatever:
> >
> > wrmsrl(MSR_WHATEVER, func);
> >
> > clang needs to give us *some* way to have a correct declaration of
> > func such that we can, without resorting to inline asm kludges, get
> > the actual bit pattern of the actual symbol.
>
> We don't want version of a global symbol alias of func that points to
> the function body, though; it's only very specific cases where this
> should be stripped (MSR, ftrace, etc).
>
> So, if there were some Clang-specific syntax for this, it would still be
> used on a case-by-case basis. It would still be something like:
>
> wrmsrl(MSR_WAT, __builtin_gimme_body_p(func));
>

> Okay, you're saying you want __builtin_gimme_body_p() to be a constant
> expression for the compiler, not inline asm?

Yes.

I admit that, in the trivial case where the asm code is *not* a
C-ABI-compliant function, giving a type that doesn't fool the compiler
into thinking that it might be is probably the best fix.  Maybe we
should standardize something, e.g.:

struct raw_symbol;  /* not defined anywhere */
#define DECLARE_RAW_SYMBOL(x) struct raw_symbol x[]

and then we write this:

DECLARE_RAW_SYMBOL(entry_SYSCALL_64);

wrmsrl(..., (unsigned long)entry_SYSCALL_64);

It would be a bit nifty if we didn't need a forward declaration, but
I'm not immediately seeing a way to do this without hacks that we'll
probably regret;

But this doesn't help the case in which the symbol is an actual
C-callable function and we want to be able to call it, too.

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-17 Thread Andy Lutomirski




> On Apr 17, 2021, at 7:20 AM, David Laight  wrote:
> 
> From: Kees Cook
>> Sent: 16 April 2021 23:28
>> 
>>> On Fri, Apr 16, 2021 at 03:06:17PM -0700, Andy Lutomirski wrote:
>>> On Fri, Apr 16, 2021 at 3:03 PM Borislav Petkov  wrote:
>>>> 
>>>> On Fri, Apr 16, 2021 at 02:49:23PM -0700, Sami Tolvanen wrote:
>>>>> __nocfi only disables CFI checking in a function, the compiler still
>>>>> changes function addresses to point to the CFI jump table, which is
>>>>> why we need function_nocfi().
>>>> 
>>>> So call it __func_addr() or get_function_addr() or so, so that at least
>>>> it is clear what this does.
>>>> 
>>> 
>>> This seems backwards to me.  If I do:
>>> 
>>> extern void foo(some signature);
>>> 
>>> then I would, perhaps naively, expect foo to be the actual symbol that
>>> gets called
>> 
>> Yes.
>> 
>>> and for the ABI to be changed to do the CFI checks.
>> 
>> Uh, no? There's no ABI change -- indirect calls are changed to do the
>> checking.
>> 
>>> The
>>> foo symbol would point to whatever magic is needed.
>> 
>> No, the symbol points to the jump table entry. Direct calls get minimal
>> overhead and indirect calls can add the "is this function in the right
>> table" checking.
> 
> 
> Isn't this a bit like one of the PPC? ABI where function addresses
> aren't (always) the entry point.
> IIRC is causes all sorts of obscure grief.
> 
> I'd also like to know how indirect calls are actually expected to work.
> The whole idea is that the potential targets might be in a kernel module
> that is loaded at run time.
> 
> Scanning a list of possible targets has to be a bad design decision.
> 
> If you are trying to check that the called function has the right
> prototype, stashing a hash of the prototype (or a known random number)
> before the entry point would give most of the benefits without most
> of the costs.
> The linker could be taught to do the same test (much like name mangling
> but without the crap user experience).
> 
> That scheme only has the downside of a data cache miss and (hopefully)
> statically predicted correct branch (hmm - except you don't want to
> speculatively execute the wrong function) and polluting the data cache
> with code.

I admit I was quite surprised by the actual CFI implementation. I would have 
expected a CFI’d function pointer to actually point to a little descriptor that 
contains a hash of the function’s type.  The whole bit vector thing seems quite 
inefficient.

> 
> This all seems like a ploy to force people to buy faster cpus.
> 
>David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
>

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-16 Thread Andy Lutomirski

On Fri, Apr 16, 2021 at 3:28 PM Kees Cook  wrote:
>
> On Fri, Apr 16, 2021 at 03:06:17PM -0700, Andy Lutomirski wrote:
> > On Fri, Apr 16, 2021 at 3:03 PM Borislav Petkov  wrote:
> > >
> > > On Fri, Apr 16, 2021 at 02:49:23PM -0700, Sami Tolvanen wrote:
> > > > __nocfi only disables CFI checking in a function, the compiler still
> > > > changes function addresses to point to the CFI jump table, which is
> > > > why we need function_nocfi().
> > >
> > > So call it __func_addr() or get_function_addr() or so, so that at least
> > > it is clear what this does.
> > >
> >
> > This seems backwards to me.  If I do:
> >
> > extern void foo(some signature);
> >
> > then I would, perhaps naively, expect foo to be the actual symbol that
> > gets called
>
> Yes.
>
> > and for the ABI to be changed to do the CFI checks.
>
> Uh, no? There's no ABI change -- indirect calls are changed to do the
> checking.

Maybe ABI is the wrong word, or maybe I'm not fully clued in.  But, if I do:

extern void call_it(void (*ptr)(void));

and I define call_it in one translation unit and call it from another,
the ABI effectively changed, right?  Because ptr is (depending on the
"canonical" mode) no longer a regular function pointer.

> > char entry_whatever[];
> > wrmsrl(..., (unsigned long)entry_whatever);
>
> This is just casting. It'll still resolve to the jump table entry.

How?  As far as clang is concerned, entry_whatever isn't a function at
all.  What jump table entry?

>
> > or, alternatively,
> >
> > extern void func() __attribute__((nocfi));
>
> __nocfi says func() should not perform checking of correct jump table
> membership for indirect calls.
>
> But we don't want a global marking for a function to be ignored by CFI;
> we don't want functions to escape CFI -- we want specific _users_ to
> either not check CFI for indirect calls (__nocfi) or we want specific
> passed addresses to avoid going through the jump table
> (function_nocfi()).

Maybe I spelled it badly, and I maybe I requested the wrong thing.
Here are actual required use cases.

1. I defined a function in asm.  I want to tell clang that this
function is defined in asm, and for clang to behave accordingly:

.globl func
func:
 ; do stuff

later:

extern void func(void) [something here];

There really should be a way to write this correctly such that it
works regardless of the setting of
-fsanitize-cfi-canonical-jump-tables.  This should not bypass CFI.  It
should *work*, with CFI enforced.  If I read all the various things
you linked correctly, this would be something like __cfi_noncanonical,
and I reserve the right to think that this is a horrible name.

2. I need a raw function pointer, thank you very much.  I would like
to be honest about it, and I don't really want to bypass CFI, but I
need the actual bits in the actual symbol.

translation unit 1 defines func.  Maybe it's C with
-fsanitize-cfi-canonical-jump-tables, maybe it's C with
-fno-sanitize-cfi-canonical-jump-tables or however it's spelled, and
maybe it's plain asm.  Now translation unit 2 does:

2a. Uses a literal symbol, because it's going to modify function text
or poke an MSR or whatever:

wrmsrl(MSR_WHATEVER, func);

clang needs to give us *some* way to have a correct declaration of
func such that we can, without resorting to inline asm kludges, get
the actual bit pattern of the actual symbol.

2b. Maybe optional: convert from function pointer to bit pattern of
actual symbol.

If someone gives me a real, correctly typed C pointer representing a
function pointer, I want a way to find the address of the body of the
function.  Then we can use it for things that aren't *calling* it per
se, e.g. disassembling it.  This is not necessarily a fully formed
thought right now, but I think something along these lines might be
needed.

The reverse of 2b (converting from actual symbol to function pointer)
might be equivalent to 1, or it might not.  I suppose there are some
subtleties here.

Be that as it may, it sounds like right now clang has some issues
interoperating with asm when CFI is enabled.  If so, clang needs to be
improved.

(The unsigned long hack is not necessarily good enough.  I should be able to do:

.global func
func:
 ; C ABI compliant code here

extern void func(void) [attribute as in 1]

unsigned long actual_address = [something clang actually understands](func);

If this thing refuses to work when fed a nonconstant function pointer
because of some genuinely good reason, fine.  But, if 'func' is an
actual literal symbol name, this thing needs to be compile-time
constant expression.

>
> So, instead of a cast, a wrapper is used to bypass instrumentation in
> the very few cases its needed.

NAK.  The compiler needs to cooperate IMO.

>
> (N

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-16 Thread Andy Lutomirski

On Fri, Apr 16, 2021 at 3:14 PM Borislav Petkov  wrote:
>
> On Fri, Apr 16, 2021 at 03:06:17PM -0700, Andy Lutomirski wrote:
> > On Fri, Apr 16, 2021 at 3:03 PM Borislav Petkov  wrote:
> > >
> > > On Fri, Apr 16, 2021 at 02:49:23PM -0700, Sami Tolvanen wrote:
> > > > __nocfi only disables CFI checking in a function, the compiler still
> > > > changes function addresses to point to the CFI jump table, which is
> > > > why we need function_nocfi().
> > >
> > > So call it __func_addr() or get_function_addr() or so, so that at least
> > > it is clear what this does.
> > >
> >
> > This seems backwards to me.  If I do:
> >
> > extern void foo(some signature);
> >
> > then I would, perhaps naively, expect foo to be the actual symbol that
>
> I'm just reading the patch:
>
> ... The function_nocfi macro always returns the address of the
> + * actual function instead.
> + */
> +#define function_nocfi(x) ({   \
> +   void *addr; \
> +   asm("leaq " __stringify(x) "(%%rip), %0\n\t" : "=r" (addr));\
> +   addr;
>
> so it does a rip-relative load into a reg which ends up with the function
> address.

This is horrible.

We made a mistake adapting the kernel to GCC's nonsensical stack
protector ABI, especially on 32-bit, instead of making GCC fix it.
Let's not repeat this with clang please.

Sami, I'm assuming that:

extern void func(void);

results in anything that takes a pointer to func getting a pointer to
some special magic descriptor instead of to func, so that:

void (*ptr)(void);
ptr = func;
ptr();

does the right thing.  Then void (*)(void) is no longer a raw pointer.  Fine.

But obviously there is code that needs real function pointers.  How
about making this a first-class feature, or at least hacking around it
more cleanly.  For example, what does this do:

char entry_whatever[];
wrmsrl(..., (unsigned long)entry_whatever);

or, alternatively,

extern void func() __attribute__((nocfi));

void (*ptr)(void);
ptr = func;  /* probably fails to compile -- invalid conversion */

(unsigned long)func /* returns the actual pointer */

func();  /* works like normal */

And maybe allow this too:

void (*ptr)(void) __attribute__((nocfi);
ptr = func;
ptr();  /* emits an unchecked call.  maybe warns, too.  anyone who
does this needs to be extremely careful. */

--Andy

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-16 Thread Andy Lutomirski

On Fri, Apr 16, 2021 at 3:11 PM Len Brown  wrote:
>
> > I get it.  That does not explain why LDMXCSR and VLDMXCSR cause
> > pipelines stalls.
>
> Sorry, I thought this thread was about AMX.
> I don't know the answer to your LDMXCSR and VLDMXCSR question.

My point is that every single major math extension since the original
XMM extensions (SSE, etc) has come with performance gotchas.  Given
Intel's general unwillingness to document the gotchas in hardware that
is actually shipping, I'm sceptical that AMX is as delightfully
gotcha-free as you are making it out to be.

Is there any authoritative guidance at all on what actually happens,
performance-wise, when someone does AMX math?

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-16 Thread Andy Lutomirski

On Fri, Apr 16, 2021 at 3:03 PM Borislav Petkov  wrote:
>
> On Fri, Apr 16, 2021 at 02:49:23PM -0700, Sami Tolvanen wrote:
> > __nocfi only disables CFI checking in a function, the compiler still
> > changes function addresses to point to the CFI jump table, which is
> > why we need function_nocfi().
>
> So call it __func_addr() or get_function_addr() or so, so that at least
> it is clear what this does.
>

This seems backwards to me.  If I do:

extern void foo(some signature);

then I would, perhaps naively, expect foo to be the actual symbol that
gets called and for the ABI to be changed to do the CFI checks.  The
foo symbol would point to whatever magic is needed.  I assume I'm
missing something.

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-16 Thread Andy Lutomirski

On Fri, Apr 16, 2021 at 2:54 PM Len Brown  wrote:
>
> On Thu, Apr 15, 2021 at 12:24 PM Andy Lutomirski  wrote:
> > On Wed, Apr 14, 2021 at 2:48 PM Len Brown  wrote:
>
> > > > ... the transition penalty into and out of AMX code
>
> The concept of 'transition' exists between AVX and SSE instructions
> because it is possible to mix both instruction sets and touch different
> parts of the same registers.  The "unused" parts of those registers
> need to be tracked to assure that data is not lost when mixing.

I get it.  That does not explain why LDMXCSR and VLDMXCSR cause
pipelines stalls.

>
> This concept is moot with AMX, which has its own dedicated registers.
>
> > What is the actual impact of a trivial function that initializes the
> > tile config, does one tiny math op, and then does TILERELEASE?

 "does one tiny math op"

AVX-512 *also* has sort-of-dedicated registers: ZMM16 and up.  I still
can't find any conclusive evidence as to whether that avoids the
performance hit.

Intel's track record at actually explaining what operations cause what
particular performance disasters is poor, and your explanation is not
helping the situation.  Sorry.

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-15 Thread Andy Lutomirski



> On Apr 15, 2021, at 10:00 AM, Dave Hansen  wrote:
> 
> On 4/15/21 9:24 AM, Andy Lutomirski wrote:
>> In the patches, *as submitted*, if you trip the XFD #NM *once* and you
>> are the only thread on the system to do so, you will eat the cost of a
>> WRMSR on every subsequent context switch.
> 
> I think you're saying: If a thread trips XFD #NM *once*, every switch to
> and from *that* thread will incur the WRMSR cost.

Indeed.  My sentence was missing a few words at the end.

> 
> The first time I read this, I thought you were saying that all threads
> would incur a WRMSR cost on every context switch.  If that's the case, I
> grossly misread the patches. :)

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-15 Thread Andy Lutomirski

On Wed, Apr 14, 2021 at 2:48 PM Len Brown  wrote:
>

>
> > Then I take the transition penalty into and out of AMX code (I'll
> > believe there is no penalty when I see it -- we've had a penalty with
> > VEX and with AVX-512) and my program runs *slower*.
>
> If you have a clear definition of what "transition penalty" is, please share 
> it.

Given the generally awful state of Intel's documentation about these
issues, it's quite hard to tell for real.  But here are some examples.

VEX: Figures 11-1 ("AVX-SSE Transitions in the Broadwell, and Prior
Generation Microarchitectures") and 11-2 ("AVX-SSE Transitions in the
Skylake Microarchitecture").  We *still* have a performance regression
in the upstream kernel because, despite all common sense, the CPUs
consider LDMXCSR to be an SSE instruction and VLDMXCSR to be an AVX
instruction despite the fact that neither one of them touch the XMM or
YMM state at all.

AVX-512:

https://lore.kernel.org/linux-crypto/calcetru06cuvuf5ndsm8--dy3dokxyq88cgwaakoque4vkz...@mail.gmail.com/

https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

>
> Lacking one, I'll assume you are referring to the
> impact on turbo frequency of using AMX hardware?
>
> Again...
>
> On the hardware that supports AMX, there is zero impact on frequency
> due to the presence of AMX state, whether modified or unmodified.
>
> We resolved on another thread that Linux will never allow entry
> into idle with modified AMX state, and so AMX will have zero impact
> on the ability of the process to enter deep power-saving C-states.
>
> It is true that AMX activity is considered when determining max turbo.
> (as it must be)
> However, the *release* of the turbo credits consumed by AMX is
> "several orders of magnitude" faster on this generation
> than it was for AVX-512 on pre-AMX hardware.

What is the actual impact of a trivial function that initializes the
tile config, does one tiny math op, and then does TILERELEASE?

> Yes, the proposal, and the working patch set on the list, context
> switches XFD -- which is exactly what that hardware was designed to do.
> If the old and new tasks have the same value of XFD, the MSR write is skipped.
>
> I'm not aware of any serious proposal to context-switch XCR0,
> as it would break the current programming model, where XCR0
> advertises what the OS supports.  It would also impact performance,
> as every write to XCR0 necessarily provokes a VMEXIT.

You're arguing against a nonsensical straw man.

In the patches, *as submitted*, if you trip the XFD #NM *once* and you
are the only thread on the system to do so, you will eat the cost of a
WRMSR on every subsequent context switch.  This is not free.  If we
use XCR0 (I'm not saying we will -- I'm just mentioning at a
possibility), then the penalty is presumably worse due to the VMX
issue.

Re: [PATCH V3 2/2] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task

2021-04-13 Thread Andy Lutomirski

On Tue, Apr 13, 2021 at 12:05 PM  wrote:
>
> From: Kan Liang 
>
> The counter value of a perf task may leak to another RDPMC task.
> For example, a perf stat task as below is running on CPU 0.
>
> perf stat -e 'branches,cycles' -- taskset -c 0 ./workload

I assume this doesn't fix the leak if the sensitive counter is systemwide?

Could Intel please add proper security and ideally virtualization for
this?  Ideally RDPMC permission would be a bitmask for all RDPMC-able
counters, not just a single on/off switch.

-Andy

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-13 Thread Andy Lutomirski

On Tue, Apr 13, 2021 at 3:47 PM Len Brown  wrote:
>
> On Tue, Apr 13, 2021 at 4:16 PM Andy Lutomirski  wrote:
> >
> > On Mon, Apr 12, 2021 at 4:46 PM Len Brown  wrote:
> > >
> > > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski  wrote:
> > >
> > > > AMX: Multiplying a 4x4 matrix probably looks *great* in a
> > > > microbenchmark.  Do it once and you permanently allocate 8kB (is that
> > > > even a constant?  can it grow in newer parts?), potentially hurts all
> > > > future context switches, and does who-knows-what to Turbo licenses and
> > > > such.
> > >
> > > Intel expects that AMX will be extremely valuable to key workloads.
> > > It is true that you may never run that kind of workload on the machine
> > > in front of you,
> > > and so you have every right to be doubtful about the value of AMX.
> >
> > I fully believe that AMX will be amazing when used for the right
> > workload.  The problem is that a library may have no way to tell
> > whether a workload is the type of computationally intensive workload
> > for which it makes sense.  Imagine you have a little function:
> >
> > int matrix_times_vector(int dim, float *out, const float *matrix,
> > const float *vector);
> >
> > A clever library might use AMX for this.  If dim == 4 and the caller
> > is planning to call it in a long, tight loop, maybe this even makes
> > sense.  If dim == 4 and it's being called once, AMX is probably a
> > losing proposition.  With previous technologies, at least the impact
> > was limited to the function itself and maybe once per call to the
> > caller.  But now, with AMX, the program that invoked this takes a
> > performance and memory hit *forever* if it uses AMX once.
>
> Again...
>
> As this is a "clever" library, built with a clever toolchain, and the
> result is that
> TILERELEASE was properly issued at the end of computation.
> Thus the hardware knows that the  (volatile) AMX registers are no longer live.

My argument has *nothing* to do with TILERELEASE.  Let me try again.

Suppose I write some user code an call into a library that uses AMX
because the library authors benchmarked it and determined that using
AMX is faster when called in a loop.  But I don't call it in a loop.
Then I take the transition penalty into and out of AMX code (I'll
believe there is no penalty when I see it -- we've had a penalty with
VEX and with AVX-512) and my program runs *slower*.  And, to top it
off, I've just permanently allocated 8kB of extra FPU state buffer,
*and* I'm taking either an XCR0 or an XFD write penalty on every
future context switch.

Someone or something needs to make a decision as to whether AMX should
actually be used for a given algorithm.  The user library community
has swept this under the rug by declaring that libraries should use
the best-in-a-tight-loop code for the entire existence of extensions
beyond XMM, and the cost keeps getting higher.

> > Beyond that, we have the signal handling issue.
>
> I'm unaware of any unresolved feedback on the signal handling series
> other than a wistful "wouldn't a new SIGFAIL be more clear (for future apps)
> than the existing SIGSEGV?"  I agree with this sentiment, but I don't
> think we should hold up a patch to prevent corrupting user data
> because a new signal number to describe the scenario doesn't exit.
> Particularly since the new code that knows about the new SIGFAIL
> will also be new code that has been compiled with the new glibc
> that for most cases will prevent this scenario in the first place...
>
> > One solution, going
> > off of what WIlly mentioned, is:
> >
> > bool amx_begin(void *signal_save_buffer);
> > void amx_end();
> >
> > In the amx_begin() region, if you get a signal, the AMX state is saved
> > in the buffer.  Outside the region, if you get a signal and AMX is in
> > use, the kernel will either unceremoniously kill the task or will
> > deliver SIGYOUBLEWIT. [0]
>
> I think it is clear that if a new signal ABI is going to be invented,
> that it should be opt-in on state, so that it can run fast on machines
> far into the future by not choosing to opt-in on anything.
>
> It isn't clear that changing the signal save state around critical regions
> (in multiple threads) so that a single (per process definition) of a signal
> handler gets a different result at different times is going to make that
> (new) signal handler author especially happy.  More likely they
> either always want the state, or they do not.

Perhaps some form of decision should be reached before AMX lands?
Landing AMX in its current form is a decision, and we should make a
credible effort to decide if it's the right one.

--Andy

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-13 Thread Andy Lutomirski

On Mon, Apr 12, 2021 at 4:46 PM Len Brown  wrote:
>
> On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski  wrote:
>
> > AMX: Multiplying a 4x4 matrix probably looks *great* in a
> > microbenchmark.  Do it once and you permanently allocate 8kB (is that
> > even a constant?  can it grow in newer parts?), potentially hurts all
> > future context switches, and does who-knows-what to Turbo licenses and
> > such.
>
> Intel expects that AMX will be extremely valuable to key workloads.
> It is true that you may never run that kind of workload on the machine
> in front of you,
> and so you have every right to be doubtful about the value of AMX.

I fully believe that AMX will be amazing when used for the right
workload.  The problem is that a library may have no way to tell
whether a workload is the type of computationally intensive workload
for which it makes sense.  Imagine you have a little function:

int matrix_times_vector(int dim, float *out, const float *matrix,
const float *vector);

A clever library might use AMX for this.  If dim == 4 and the caller
is planning to call it in a long, tight loop, maybe this even makes
sense.  If dim == 4 and it's being called once, AMX is probably a
losing proposition.  With previous technologies, at least the impact
was limited to the function itself and maybe once per call to the
caller.  But now, with AMX, the program that invoked this takes a
performance and memory hit *forever* if it uses AMX once.

Beyond that, we have the signal handling issue.  One solution, going
off of what WIlly mentioned, is:

bool amx_begin(void *signal_save_buffer);
void amx_end();

In the amx_begin() region, if you get a signal, the AMX state is saved
in the buffer.  Outside the region, if you get a signal and AMX is in
use, the kernel will either unceremoniously kill the task or will
deliver SIGYOUBLEWIT. [0]

I'm really hoping some userspace people can chime in.

[0] We really ought to have a SIGSIGNALFAILURE or something for the
case where normal signal delivery fails.  This is the userspace
equivalent of #DF.  SIGYOUBLEWIT could be folded in.  There would be a
flag in the signal frame saying "don't even try to sigreturn".

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-12 Thread Andy Lutomirski

On Mon, Apr 12, 2021 at 2:52 AM Baoquan He  wrote:
>
> On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> >
> >
> > > On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
> > >
> > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> > >> Why don't we do this unconditionally? At the very best we gain half a 
> > >> megabyte of memory (except the trampoline, which has to live there, but 
> > >> it is only a few kilobytes.)
> > >
> > > This is a great suggestion, thanks. I think we can fix it in this way to
> > > make code simpler. Then the specific caring of real mode in
> > > efi_free_boot_services() can be removed too.
> > >
> >
> > This whole situation makes me think that the code is buggy before and buggy 
> > after.
> >
> > The issue here (I think) is that various pieces of code want to reserve 
> > specific pieces of otherwise-available low memory for their own nefarious 
> > uses. I don’t know *why* crash kernel needs this, but that doesn’t matter 
> > too much.
>
> Kdump kernel also need go through real mode code path during bootup. It
> is not different than normal kernel except that it skips the firmware
> resetting. So kdump kernel needs low 1M as system RAM just as normal
> kernel does. Here we reserve the whole low 1M with memblock_reserve()
> to avoid any later kernel or driver data reside in this area. Otherwise,
> we need dump the content of this area to vmcore. As we know, when crash
> happened, the old memory of 1st kernel should be untouched until vmcore
> dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> In the past, we used a back up region to copy out the low 1M area, and
> map the back up region into the low 1M area in vmcore elf file. In
> 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> option is specified"), we changed to lock the whole low 1M to avoid
> writting any kernel data into, like this we can skip this area when
> dumping vmcore.
>
> Above is why we try to memblock reserve the whole low 1M. We don't want
> to use it, just don't want anyone to use it in 1st kernel.
>
> >
> > I propose that the right solution is to give low-memory-reserving code 
> > paths two chances to do what they need: once at the very beginning and once 
> > after EFI boot services are freed.
> >
> > Alternatively, just reserve *all* otherwise unused sub 1M memory up front, 
> > then release it right after releasing boot services, and then invoke the 
> > special cases exactly once.
>
> I am not sure if I got both suggested ways clearly. They look a little
> complicated in our case. As I explained at above, we want the whole low
> 1M locked up, not one piece or some pieces of it.

My second suggestion is probably the better one.  Here it is, concretely:

The early (pre-free_efi_boot_services) code just reserves all
available sub-1M memory unconditionally, but it specially marks it as
reserved-but-available-later.  We stop allocating the trampoline page
at this stage.

In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
stick it in the pile of reserved memory created in the early step.
This may involve splitting a block, kind of like the current
trampoline late allocation works.

Then, *after* free_efi_boot_services(), we run a single block of code
that lets everything that wants sub-1M code claim some.  This means
that the trampoline gets allocated and, if crashkernel wants to claim
everything else, it can.  After that, everything still unclaimed gets
freed.

Does that make sense?

--Andy

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Andy Lutomirski

On Mon, Apr 12, 2021 at 7:19 AM Florian Weimer  wrote:
>
> * Andy Lutomirski:
>
> Maybe we could have done this in 2016 when I reported this for the first
> time.  Now it is too late, as more and more software is using
> CPUID-based detection for AVX-512.  Our users have been using AVX-512
> hardware for quite some time now, and I haven't seen *that* many issues
> resulting from the context size.  That isn't to say that problems do not
> exist, but they are more of the kind that the increased stack usage
> means that areas of the stack that used to be zero no longer are, so
> users encounter different side effects from uninitialized-variable bugs.
>
> How much software depends on the signal handler data layout?  The %zmm
> state does not appear to be exposed today, so perhaps some savings could
> be had there.

The fact that including  is barely functional in glibc
probably helps keep software from touching the state. :)

>
> The suggestion to make CPUID trap doesn't sound workable to me.  At
> least in the past, it's been suggested as a serializing instruction to
> be used alongside RDTSC, which makes it rather time-critical for some
> applications.
>
> Even today, signal handlers do not really compose well in the sense that
> multiple libraries can use them and collaborate without being aware of
> each other (like they can divide up TLS memory with the help of the
> dynamic linker, or carve out address space using mmap).  Proposals to
> set special process-wide flags only make that situation worse.  Code
> that installs a signal handler often does not have control on which
> thread an asynchronous signal is delivered, or which code it interrupts.
> A single process-wide flag cannot capture that accurately, even if it is
> per signal number.

I think this would want to be a per-signal-handler flag, not per
process.  It's entirely possible to write a signal handler callback
that doesn't touch AVX512 or AMX state, even if the toolchain may make
it annoying.  That specific handler could set the "make me fast" flag.

>
> The rseq extension might work conceptually, but it requires to make
> operations idempotent, with periodic checkpoint, and of course
> inline/flatten all calls.  And it requires compiler work, the present
> model based on inline asm hacks doesn't look workable.  Maybe that works
> for AMX.  I have not checked if there is yet any public documentation of
> the programming model.

I tend to think that the rseq model will be unworkable.  People trying
to use the new instructions will hate it.

>
> I think someone expressed the sentiment (maybe on another thread) that
> the current CPU feature enablement process does not work.  I do not
> agree.  Currently it is only necessary to upgrade the kernel and maybe
> glibc (but not in all cases), and then you are good to go.  You can keep
> using your old libraries, old compilers, and even old assemblers if you
> are okay with .byte hacks.  You do not need special userspace libraries,
> new compilers for different languages, special firmware or binary blobs.
> Overall, it just works.
>
> On x86, we are really bad about actually using CPU features pervasively,
> but that is a different story.
>

"Just works" is different from "is a good idea", though.  With SSE2
and other non-VEX xmm extensions, just using them in userspace seems
quite reasonable.  If a function could run faster using xmm, then it
might as well use xmm.  But this model starts to break down with newer
features:

VEX: ymm (AFAIK) performs just fine, at least on most CPUs, except
that mixing VEX and non-VEX code has big penalties.  Copying that
64-bit data structure using ymm is not necessarily wise even if it
microbenchmarks well.  Heck, mixing translation units using normal C
floating point code that were compiled with different flags can be
quite slow.

AVX-512: Intel has still not responded to my requests for detailed
documentation of the performance issues.  The internet is full of
various reports and various voodoo ideas.  VZEROALL does not do what
one would naively expect, and the implications are unclear.  AVX-512
code, even used just once, is likely to permanently bloat the signal
state.  Even ignoring the unknowns here, on most current non-Xeon-phi
parts AFAICT, using small bits of AVX-512 code has *huge* performance
impacts.  Libraries automatically using AVX-512 just because it's
there is not necessarily a good idea, even if it microbenchmarks well.

AMX: Multiplying a 4x4 matrix probably looks *great* in a
microbenchmark.  Do it once and you permanently allocate 8kB (is that
even a constant?  can it grow in newer parts?), potentially hurts all
future context switches, and does who-knows-what to Turbo licenses and
such.

Even putting aside all kernel and ABI issues, is it actually a good
idea for user libraries to transparently use these new features?  I'm
not really convinced.  I think that serious discussion among userspace
people is needed.

--Andy

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Andy Lutomirski




> On Apr 12, 2021, at 7:38 AM, Florian Weimer  wrote:
> 
> * Borislav Petkov:
> 
>>> On Mon, Apr 12, 2021 at 04:19:29PM +0200, Florian Weimer wrote:
>>> Maybe we could have done this in 2016 when I reported this for the first
>>> time.  Now it is too late, as more and more software is using
>>> CPUID-based detection for AVX-512.
>> 
>> So as I said on another mail today, I don't think a library should rely
>> solely on CPUID-based detection of features especially if those features
>> need kernel support too. IOW, it should ask whether the kernel can
>> handle those too, first.
> 
> Yes, that's why we have the XGETBV handshake.  I was imprecise.  It's
> CPUID + XGETBV of course.  Or even AT_HWCAP2 (for FSGSBASE).
> 
>> And the CPUID-faulting thing would solve stuff like that because then
>> the kernel can *actually* get involved into answering something where it
>> has a say in, too.
> 
> But why wouldn't we use a syscall or an entry in the auxiliary vector
> for that?  Why fault a potentially performance-critical instruction?
> 

CPUID is horrifically slow in various virt scenarios. If user code needs to 
serialize, use IRET or SERIALIZE.

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-11 Thread Andy Lutomirski

> On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
> 
> On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
>> Why don't we do this unconditionally? At the very best we gain half a 
>> megabyte of memory (except the trampoline, which has to live there, but it 
>> is only a few kilobytes.)
> 
> This is a great suggestion, thanks. I think we can fix it in this way to
> make code simpler. Then the specific caring of real mode in
> efi_free_boot_services() can be removed too.
> 

This whole situation makes me think that the code is buggy before and buggy 
after.

The issue here (I think) is that various pieces of code want to reserve 
specific pieces of otherwise-available low memory for their own nefarious uses. 
I don’t know *why* crash kernel needs this, but that doesn’t matter too much.

I propose that the right solution is to give low-memory-reserving code paths 
two chances to do what they need: once at the very beginning and once after EFI 
boot services are freed.

Alternatively, just reserve *all* otherwise unused sub 1M memory up front, then 
release it right after releasing boot services, and then invoke the special 
cases exactly once.

In either case, the result is that the crashkernel mess gets unified with the 
trampoline mess.  One way the result is called twice and needs to be more 
careful, and the other way it’s called only once.

Just skipping freeing boot services seems wrong.  It doesn’t unmap boot 
services, and skipping that is incorrect, I think. And it seems to result in a 
bogus memory map in which the system thinks that some crashkernel memory is EFI 
memory instead.

Let’s please just fix the problem instead of papering over it with more hacks.

> Thanks
> Baoquan
>

Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-11 Thread Andy Lutomirski

On Sun, Apr 11, 2021 at 10:04 AM Borislav Petkov  wrote:
>
> On Sun, Apr 11, 2021 at 09:57:20AM -0700, Andy Lutomirski wrote:
> > Working around a kernel bug. The workaround only worked on AMD
> > systems. The correct solution was to fix the kernel bug, not poke
> > MSRs.
>
> Do you remember which program(s) and where I can get them to have a
> look?
>

https://bugs.winehq.org/show_bug.cgi?id=33275#c19

I sure hope no one is still doing this.

--Andy

Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-11 Thread Andy Lutomirski

> On Apr 11, 2021, at 9:43 AM, Andi Kleen  wrote:
> 
> 
>> 
>> I have actually seen real user programs poke MSR_SYSCALL_MASK.
> 
> Hmm, what was the use case?
> 
> 

Working around a kernel bug.  The workaround only worked on AMD systems.  The 
correct solution was to fix the kernel bug, not poke MSRs.

Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-11 Thread Andy Lutomirski

On Sat, Apr 10, 2021 at 11:52 AM Andi Kleen  wrote:
>
> Borislav Petkov  writes:
>
> > From: Borislav Petkov 
> > Date: Sat, 10 Apr 2021 14:08:13 +0200
> >
> > There are a bunch of MSRs which luserspace has no business poking at,
> > whatsoever. Add a ban list and put the TSC-related MSRs in there. Issue
> > a big juicy splat to catch offenders.
>
> Have you ever seen any user programs actually write those MSRs?
> I don't see why they ever would, it's not that they have any motivation
> to do it (unlike SMM), and I don't know of any examples.

I have actually seen real user programs poke MSR_SYSCALL_MASK.

Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-10 Thread Andy Lutomirski




> On Apr 10, 2021, at 5:11 AM, Borislav Petkov  wrote:
> 
> From: Borislav Petkov 
> Date: Sat, 10 Apr 2021 14:08:13 +0200
> 
> There are a bunch of MSRs which luserspace has no business poking at,
> whatsoever. Add a ban list and put the TSC-related MSRs in there. Issue
> a big juicy splat to catch offenders.

Ack!

Can you add STAR, CSTAR, LSTAR, SYSENTER*, SYSCALL*, and EFER please? 

> 
> Signed-off-by: Borislav Petkov 
> ---
> arch/x86/kernel/msr.c | 17 +
> 1 file changed, 17 insertions(+)
> 
> diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c
> index ed8ac6bcbafb..574bd2c6d4f8 100644
> --- a/arch/x86/kernel/msr.c
> +++ b/arch/x86/kernel/msr.c
> @@ -78,6 +78,13 @@ static ssize_t msr_read(struct file *file, char __user 
> *buf,
>return bytes ? bytes : err;
> }
> 
> +static const u32 msr_ban_list[] = {
> +MSR_IA32_TSC,
> +MSR_TSC_AUX,
> +MSR_IA32_TSC_ADJUST,
> +MSR_IA32_TSC_DEADLINE,
> +};
> +
> static int filter_write(u32 reg)
> {
>/*
> @@ -89,6 +96,16 @@ static int filter_write(u32 reg)
> * avoid saturating the ring buffer.
> */
>static DEFINE_RATELIMIT_STATE(fw_rs, 30 * HZ, 1);
> +int i;
> +
> +for (i = 0; i < ARRAY_SIZE(msr_ban_list); i++) {
> +if (msr_ban_list[i] != reg)
> +continue;
> +
> +WARN_ONCE(1, "Blocked write to MSR 0x%x\n", reg);
> +
> +return -EINVAL;
> +}
> 
>switch (allow_writes) {
>case MSR_WRITES_ON:  return 0;
> -- 
> 2.29.2
> 
> 
> -- 
> Regards/Gruss,
>Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-09 Thread Andy Lutomirski

On Fri, Apr 9, 2021 at 1:53 PM Len Brown  wrote:
>
> On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski  wrote:
> >
> > On Wed, Mar 31, 2021 at 3:28 PM Len Brown  wrote:
> >
> > > We added compiler annotation for user-level interrupt handlers.
> > > I'm not aware of it failing, or otherwise being confused.
> >
> > I followed your link and found nothing. Can you elaborate?  In the
> > kernel, we have noinstr, and gcc gives approximately no help toward
> > catching problems.
>
> A search for the word "interrupt" on this page
> https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes
> comes to the description of this attribute:
>
> __attribute__ ((interrupt))
>

I read that and I see no mention of anything saying "this will
generate code that does not touch extended state".  Instead I see,
paraphrasing, "this will generate code with an ABI that is completely
inappropriate for use in a user space signal handler".  Am I missing
something?

> > > dynamic XCR0 breaks the installed base, I thought we had established that.
> >
> > I don't think this is at all established.  If some code thinks it
> > knows the uncompacted XSTATE size and XCR0 changes, it crashes.  This
> > is not necessarily a showstopper.
>
> My working assumption is that crashing applications actually *is* a 
> showstopper.
> Please clarify.

I think you're presuming that some program actually does this.  If no
program does this, it's not an ABI break.

More relevantly, this can only happen in a process that uses XSAVE and
thinks it knows the size that *also* does the prctl to change XCR0.
By construction, existing programs can't break unless they load new
dynamic libraries that break them.

>
> > > We've also established that when running in a VMM, every update to
> > > XCR0 causes a VMEXIT.
> >
> > This is true, it sucks, and Intel could fix it going forward.
>
> What hardware fix do you suggest?
> If a guest is permitted to set XCR0 bits without notifying the VMM,
> what happens when it sets bits that the VMM doesn't know about?

The VM could have a mask of allowed XCR0 bits that don't exist.

TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on
TDX.  Surely plain VMX could fix it too.

>
> > > I thought the goal was to allow new programs to have fast signal handlers.
> > > By default, those fast signal handlers would have a stable state
> > > image, and would
> > > not inherit large architectural state on their stacks, and could thus
> > > have minimal overhead on all hardware.
> >
> > That is *a* goal, but not necessarily the only goal.
>
> I fully support coming up with a scheme for fast future-proof signal handlers,
> and I'm willing to back that up by putting work into it.
>
> I don't see any other goals articulated in this thread.

Before we get too carried away with *fast* signal handlers, something
that works with existing programs is also a pretty strong goal.  RIght
now AVX-512 breaks existing programs, even if they don't use AVX-512.

Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

2021-04-07 Thread Andy Lutomirski

On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> The page table tree is walked with local irqs disabled, which prevents
> page table reclamation (similarly to what fast GUP does). The logic is
> otherwise similar to the non-speculative path, but with additional
> restrictions: in the speculative path, we do not handle huge pages or
> wiring new pages tables.

Not on most architectures.  Quoting the actual comment in mm/gup.c:

>  * Before activating this code, please be aware that the following assumptions
>  * are currently made:
>  *
>  *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is 
> used to
>  *  free pages containing page tables or TLB flushing requires IPI broadcast.

On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
assumption that it is safe to dereference a pointer in a page table just
because irqs are off.  You need RCU protection, too.

You have the same error in the cover letter.

--Andy

Re: [PATCH v24 25/30] x86/cet/shstk: Handle signals for shadow stack

2021-04-06 Thread Andy Lutomirski

On Thu, Apr 1, 2021 at 3:11 PM Yu-cheng Yu  wrote:
>
> When shadow stack is enabled, a task's shadow stack states must be saved
> along with the signal context and later restored in sigreturn.  However,
> currently there is no systematic facility for extending a signal context.
>
> Introduce a signal context extension struct 'sc_ext', which is used to save
> shadow stack restore token address and WAIT_ENDBR status[1].  The extension
> is located above the fpu states, plus alignment.
>
> Introduce routines for the allocation, save, and restore for sc_ext:
> - fpu__alloc_sigcontext_ext(),
> - save_extra_state_to_sigframe(),
> - get_extra_state_from_sigframe(),
> - restore_extra_state().
>
> [1] WAIT_ENDBR will be introduced later in the Indirect Branch Tracking
> series, but add that into sc_ext now to keep the struct stable in case
> the IBT series is applied later.

Please don't.  Instead, please figure out how that structure gets
extended for real, and organize your patches to demonstrate that the
extension works.

>
> Signed-off-by: Yu-cheng Yu 
> Cc: Kees Cook 
> ---
> v24:
> - Split out shadow stack token routines to a separate patch.
> - Put signal frame save/restore routines to fpu/signal.c and re-name 
> accordingly.
>
>  arch/x86/ia32/ia32_signal.c|  16 +++
>  arch/x86/include/asm/cet.h |   2 +
>  arch/x86/include/asm/fpu/internal.h|   2 +
>  arch/x86/include/uapi/asm/sigcontext.h |   9 ++
>  arch/x86/kernel/fpu/signal.c   | 143 +
>  arch/x86/kernel/signal.c   |   9 ++
>  6 files changed, 181 insertions(+)
>
> diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
> index 5e3d9b7fd5fb..96b87c5f0bbe 100644
> --- a/arch/x86/ia32/ia32_signal.c
> +++ b/arch/x86/ia32/ia32_signal.c
> @@ -205,6 +205,7 @@ static void __user *get_sigframe(struct ksignal *ksig, 
> struct pt_regs *regs,
>  void __user **fpstate)
>  {
> unsigned long sp, fx_aligned, math_size;
> +   void __user *restorer = NULL;
>
> /* Default to using normal stack */
> sp = regs->sp;
> @@ -218,8 +219,23 @@ static void __user *get_sigframe(struct ksignal *ksig, 
> struct pt_regs *regs,
>  ksig->ka.sa.sa_restorer)
> sp = (unsigned long) ksig->ka.sa.sa_restorer;
>
> +   if (ksig->ka.sa.sa_flags & SA_RESTORER) {
> +   restorer = ksig->ka.sa.sa_restorer;
> +   } else if (current->mm->context.vdso) {
> +   if (ksig->ka.sa.sa_flags & SA_SIGINFO)
> +   restorer = current->mm->context.vdso +
> +   vdso_image_32.sym___kernel_rt_sigreturn;
> +   else
> +   restorer = current->mm->context.vdso +
> +   vdso_image_32.sym___kernel_sigreturn;
> +   }
> +

Why do we need another copy of this logic?  You're trying to push the
correct return address for the signal handler function onto the stack.
Please calculate that return address once and then use it here.

> sp = fpu__alloc_mathframe(sp, 1, _aligned, _size);
> *fpstate = (struct _fpstate_32 __user *) sp;
> +
> +   if (save_extra_state_to_sigframe(1, *fpstate, (unsigned 
> long)restorer))
> +   return (void __user *)-1L;
> +
> if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
>  math_size) < 0)
> return (void __user *) -1L;
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index ef6155213b7e..5e66919bd2fe 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -6,6 +6,8 @@
>  #include 
>
>  struct task_struct;
> +struct sc_ext;
> +
>  /*
>   * Per-thread CET status
>   */
> diff --git a/arch/x86/include/asm/fpu/internal.h 
> b/arch/x86/include/asm/fpu/internal.h
> index 8d33ad80704f..eb01eb6ea55d 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -443,6 +443,8 @@ static inline void copy_kernel_to_fpregs(union 
> fpregs_state *fpstate)
> __copy_kernel_to_fpregs(fpstate, -1);
>  }
>
> +extern int save_extra_state_to_sigframe(int ia32, void __user *fp,
> +   unsigned long restorer);
>  extern int copy_fpstate_to_sigframe(void __user *buf, void __user *fp, int 
> size);
>
>  /*
> diff --git a/arch/x86/include/uapi/asm/sigcontext.h 
> b/arch/x86/include/uapi/asm/sigcontext.h
> index 844d60eb1882..cf2d55db3be4 100644
> --- a/arch/x86/include/uapi/asm/sigcontext.h
> +++ b/arch/x86/include/uapi/asm/sigcontext.h
> @@ -196,6 +196,15 @@ struct _xstate {
> /* New processor state extensions go here: */
>  };
>
> +/*
> + * Located at the end of sigcontext->fpstate, aligned to 8.
> + */
> +struct sc_ext {
> +   unsigned long total_size;
> +   unsigned long ssp;
> +   unsigned long wait_endbr;
> +};

We need some proper

Re: [PATCH v24 24/30] x86/cet/shstk: Introduce shadow stack token setup/verify routines

2021-04-06 Thread Andy Lutomirski

On Thu, Apr 1, 2021 at 3:12 PM Yu-cheng Yu  wrote:
>
> A shadow stack restore token marks a restore point of the shadow stack, and
> the address in a token must point directly above the token, which is within
> the same shadow stack.  This is distinctively different from other pointers
> on the shadow stack, since those pointers point to executable code area.
>
> The restore token can be used as an extra protection for signal handling.
> To deliver a signal, create a shadow stack restore token and put the token
> and the signal restorer address on the shadow stack.  In sigreturn, verify
> the token and restore from it the shadow stack pointer.
>
> Introduce token setup and verify routines.  Also introduce WRUSS, which is
> a kernel-mode instruction but writes directly to user shadow stack.  It is
> used to construct user signal stack as described above.
>
> Signed-off-by: Yu-cheng Yu 
> Cc: Kees Cook 
> ---
>  arch/x86/include/asm/cet.h   |   9 ++
>  arch/x86/include/asm/special_insns.h |  32 +++
>  arch/x86/kernel/shstk.c  | 126 +++
>  3 files changed, 167 insertions(+)
>
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index 8b83ded577cc..ef6155213b7e 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -20,6 +20,10 @@ int shstk_setup_thread(struct task_struct *p, unsigned 
> long clone_flags,
>unsigned long stack_size);
>  void shstk_free(struct task_struct *p);
>  void shstk_disable(void);
> +int shstk_setup_rstor_token(bool ia32, unsigned long rstor,
> +   unsigned long *token_addr, unsigned long 
> *new_ssp);
> +int shstk_check_rstor_token(bool ia32, unsigned long token_addr,
> +   unsigned long *new_ssp);
>  #else
>  static inline int shstk_setup(void) { return 0; }
>  static inline int shstk_setup_thread(struct task_struct *p,
> @@ -27,6 +31,11 @@ static inline int shstk_setup_thread(struct task_struct *p,
>  unsigned long stack_size) { return 0; }
>  static inline void shstk_free(struct task_struct *p) {}
>  static inline void shstk_disable(void) {}
> +static inline int shstk_setup_rstor_token(bool ia32, unsigned long rstor,
> + unsigned long *token_addr,
> + unsigned long *new_ssp) { return 0; 
> }
> +static inline int shstk_check_rstor_token(bool ia32, unsigned long 
> token_addr,
> + unsigned long *new_ssp) { return 0; 
> }
>  #endif
>
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/include/asm/special_insns.h 
> b/arch/x86/include/asm/special_insns.h
> index 1d3cbaef4bb7..c41c371f6c7d 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -234,6 +234,38 @@ static inline void clwb(volatile void *__p)
> : [pax] "a" (p));
>  }
>
> +#ifdef CONFIG_X86_SHADOW_STACK
> +#if defined(CONFIG_IA32_EMULATION) || defined(CONFIG_X86_X32)
> +static inline int write_user_shstk_32(unsigned long addr, unsigned int val)

u32 __user *addr?

> +{
> +   asm_volatile_goto("1: wrussd %1, (%0)\n"
> + _ASM_EXTABLE(1b, %l[fail])
> + :: "r" (addr), "r" (val)
> + :: fail);
> +   return 0;
> +fail:
> +   return -EPERM;

-EFAULT?

> +}
> +#else
> +static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
> +{
> +   WARN_ONCE(1, "%s used but not supported.\n", __func__);
> +   return -EFAULT;
> +}
> +#endif
> +
> +static inline int write_user_shstk_64(unsigned long addr, unsigned long val)

u64 __user *addr, perhaps?

> +{
> +   asm_volatile_goto("1: wrussq %1, (%0)\n"
> + _ASM_EXTABLE(1b, %l[fail])
> + :: "r" (addr), "r" (val)

Can you use the modern [addr] "r" (addr) syntax?

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-31 Thread Andy Lutomirski

On Wed, Mar 31, 2021 at 3:28 PM Len Brown  wrote:
>
> On Wed, Mar 31, 2021 at 12:53 PM Andy Lutomirski  wrote:
>
> > But this whole annotation thing will require serious compiler support.
> > We already have problems with compilers inlining functions and getting 
> > confused about attributes.
>
> We added compiler annotation for user-level interrupt handlers.
> I'm not aware of it failing, or otherwise being confused.

I followed your link and found nothing. Can you elaborate?  In the
kernel, we have noinstr, and gcc gives approximately no help toward
catching problems.

>
> Why would compiler support for fast-signals be any more "serious"?
>
> > An API like:
> >
> > if (get_amx()) {
> >  use AMX;
> > } else {
> >  don’t;
> > }
> >
> > Avoids this problem. And making XCR0 dynamic, for all its faults, at least 
> > helps force a degree of discipline on user code.
>
> dynamic XCR0 breaks the installed base, I thought we had established that.

I don't think this is at all established.  If some code thinks it
knows the uncompacted XSTATE size and XCR0 changes, it crashes.  This
is not necessarily a showstopper.

>
> We've also established that when running in a VMM, every update to
> XCR0 causes a VMEXIT.

This is true, it sucks, and Intel could fix it going forward.

>
> I thought the goal was to allow new programs to have fast signal handlers.
> By default, those fast signal handlers would have a stable state
> image, and would
> not inherit large architectural state on their stacks, and could thus
> have minimal overhead on all hardware.

That is *a* goal, but not necessarily the only goal.

--Andy

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-31 Thread Andy Lutomirski



> On Mar 31, 2021, at 9:31 AM, Len Brown  wrote:
> 
> On Tue, Mar 30, 2021 at 6:01 PM David Laight  wrote:
> 
>>> Can we leave it in live registers?  That would be the speed-of-light
>>> signal handler approach.  But we'd need to teach the signal handler to
>>> not clobber it.  Perhaps that could be part of the contract that a
>>> fast signal handler signs?  INIT=0 AMX state could simply sit
>>> patiently in the AMX registers for the duration of the signal handler.
>>> You can't get any faster than doing nothing :-)
>>> 
>>> Of course part of the contract for the fast signal handler is that it
>>> knows that it can't possibly use XRESTOR of the stuff on the stack to
>>> necessarily get back to the state of the signaled thread (assuming we
>>> even used XSTATE format on the fast signal handler stack, it would
>>> forget the contents of the AMX registers, in this example)
>> 
>> gcc will just use the AVX registers for 'normal' code within
>> the signal handler.
>> So it has to have its own copy of all the registers.
>> (Well, maybe you could make the TMX instructions fault,
>> but that would need a nested signal delivered.)
> 
> This is true, by default, but it doesn't have to be true.
> 
> Today, gcc has an annotation for user-level interrupts
> https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes
> 
> An analogous annotation could be created for fast signals.
> gcc can be told exactly what registers and instructions it can use for
> that routine.
> 
> Of course, this begs the question about what routines that handler calls,
> and that would need to be constrained too.
> 
> Today signal-safety(7) advises programmers to limit what legacy signal 
> handlers
> can call.  There is no reason that a fast-signal-safety(7) could not be 
> created
> for the fast path.
> 
>> There is also the register save buffer that you need in order
>> to long-jump out of a signal handler.
>> Unfortunately that is required to work.
>> I'm pretty sure the original setjmp/longjmp just saved the stack
>> pointer - but that really doesn't work any more.
>> 
>> OTOH most signal handlers don't care - but there isn't a flag
>> to sigset() (etc) so ask for a specific register layout.
> 
> Right, the idea is to optimize for *most* signal handlers,
> since making any changes to *all* signal handlers is intractable.
> 
> So the idea is that opting-in to a fast signal handler would opt-out
> of some legacy signal capibilities.  Complete state is one of them,
> and thus long-jump is not supported, because the complete state
> may not automatically be available.

Long jump is probably the easiest problem of all: sigsetjmp() is a *function*, 
following ABI, so sigsetjmp() is expected to clobber most or all of the 
extended state.

But this whole annotation thing will require serious compiler support. We 
already have problems with compilers inlining functions and getting confused 
about attributes.

An API like:

if (get_amx()) {
 use AMX;
} else {
 don’t;
}

Avoids this problem. And making XCR0 dynamic, for all its faults, at least 
helps force a degree of discipline on user code.


> 
> thanks,
> Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-30 Thread Andy Lutomirski



> On Mar 30, 2021, at 12:12 PM, Dave Hansen  wrote:
> 
> On 3/30/21 10:56 AM, Len Brown wrote:
>> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski  wrote:
>>>> On Mar 30, 2021, at 10:01 AM, Len Brown  wrote:
>>>> Is it required (by the "ABI") that a user program has everything
>>>> on the stack for user-space XSAVE/XRESTOR to get back
>>>> to the state of the program just before receiving the signal?
>>> The current Linux signal frame format has XSTATE in uncompacted format,
>>> so everything has to be there.
>>> Maybe we could have an opt in new signal frame format, but the details 
>>> would need to be worked out.
>>> 
>>> It is certainly the case that a signal should be able to be delivered, run 
>>> “async-signal-safe” code,
>>> and return, without corrupting register contents.
>> And so an an acknowledgement:
>> 
>> We can't change the legacy signal stack format without breaking
>> existing programs.  The legacy is uncompressed XSTATE.  It is a
>> complete set of architectural state -- everything necessary to
>> XRESTOR.  Further, the sigreturn flow allows the signal handler to
>> *change* any of that state, so that it becomes active upon return from
>> signal.
> 
> One nit with this: XRSTOR itself can work with the compacted format or
> uncompacted format.  Unlike the XSAVE/XSAVEC side where compaction is
> explicit from the instruction itself, XRSTOR changes its behavior by
> reading XCOMP_BV.  There's no XRSTORC.
> 
> The issue with using the compacted format is when legacy software in the
> signal handler needs to go access the state.  *That* is what can't
> handle a change in the XSAVE buffer format (either optimized/XSAVEOPT,
> or compacted/XSAVEC).

The compacted format isn’t compact enough anyway. If we want to keep AMX and 
AVX512 enabled in XCR0 then we need to further muck with the format to omit the 
not-in-use features. I *think* we can pull this off in a way that still does 
the right thing wrt XRSTOR.

If we go this route, I think we want a way for sigreturn to understand a 
pointer to the state instead of inline state to allow programs to change the 
state.  Or maybe just to have a way to ask sigreturn to skip the restore 
entirely.

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-30 Thread Andy Lutomirski

> On Mar 30, 2021, at 10:01 AM, Len Brown  wrote:
> 
> Andy,
> 
> I agree, completely, with your description of the challenge,
> thank you for focusing the discussion on that problem statement.
> 
> Question:
> 
> Is it required (by the "ABI") that a user program has everything
> on the stack for user-space XSAVE/XRESTOR to get back
> to the state of the program just before receiving the signal?

The current Linux signal frame format has XSTATE in uncompacted format, so 
everything has to be there. Maybe we could have an opt in new signal frame 
format, but the details would need to be worked out.

It is certainly the case that a signal should be able to be delivered, run 
“async-signal-safe” code, and return, without corrupting register contents.

Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-30 Thread Andy Lutomirski



> On Mar 30, 2021, at 8:14 AM, Sean Christopherson  wrote:
> 
> On Mon, Mar 29, 2021, Andy Lutomirski wrote:
>> 
>>>> On Mar 29, 2021, at 7:04 PM, Andi Kleen  wrote:
>>> 
>>> 
>>>> 
>>>>> No, if these instructions take a #VE then they were executed at CPL=0.  
>>>>> MONITOR
>>>>> and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.
>>>> 
>>>> Dare I ask about XSETBV?
>>> 
>>> XGETBV does not cause a #VE, it just works normally. The guest has full
>>> AVX capabilities.
>>> 
>> 
>> X *SET* BV
> 
> Heh, XSETBV also works normally, relative to the features enumerated in CPUID.
> XSAVES/XRSTORS support is fixed to '1' in the virtual CPU model.  A subset of
> the features managed by XSAVE can be hidden by the VMM, but attempting to 
> enable
> unsupported features will #GP (either from hardware or injected by TDX 
> Module),
> not #VE.

Normally in non-root mode means that every XSETBV results in a VM exit and, 
IIUC, there’s a buglet in that this happens even if CPL==3.  Does something 
special happen in TDX or does the exit get reflected back to the guest as a #VE?

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-29 Thread Andy Lutomirski

On Mon, Mar 29, 2021 at 3:38 PM Len Brown  wrote:
>
> On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski  wrote:
> >

> Hi Andy,
>
> Can you provide a concise definition of the exact problemI(s) this thread
> is attempting to address?

The AVX-512 state, all by itself, is more than 2048 bytes.  Quoting
the POSIX sigaltstack page (man 3p sigaltstack):

   The  value  SIGSTKSZ is a system default specifying the number of bytes
   that would be used to cover the usual case when manually allocating  an
   alternate  stack area. The value MINSIGSTKSZ is defined to be the mini‐
   mum stack size for a signal handler. In computing  an  alternate  stack
   size, a program should add that amount to its stack requirements to al‐
   low for the system implementation overhead. The  constants  SS_ONSTACK,
   SS_DISABLE, SIGSTKSZ, and MINSIGSTKSZ are defined in .

arch/x86/include/uapi/asm/signal.h:#define MINSIGSTKSZ2048
arch/x86/include/uapi/asm/signal.h:#define SIGSTKSZ8192

Regrettably, the Linux signal frame format is the uncompacted format
and, also regrettably, the uncompacted format has the nasty property
that its format depends on XCR0 but not on the set of registers that
are actually used or wanted, so, with the current ABI, the signal
frame is stuck being quite large for all programs on a machine that
supports avx512 and has it enabled by the kernel.  And it's even
larger for AMX and violates SIGSTKSZ as well as MINSTKSZ.

There are apparently real programs that break as a result.  We need to
find a way to handle new, large extended states without breaking user
ABI.  We should also find a way to handle them without consuming silly
amounts of stack space for programs that don't use them.

Sadly, if the solution we settle on involves context switching XCR0,
performance on first-generation hardware will suffer because VMX does
not have any way to allow guests to write XCR0 without exiting.  I
don't consider this to be a showstopper -- if we end up having this
problem, fixing it in subsequent CPUs is straightforward.

>
> Thank ahead-of-time for excluding "blow up power consumption",
> since that paranoia is not grounded in fact.
>

I will gladly exclude power consumption from this discussion, since
that's a separate issue that has nothing to do with the user<->kernel
ABI.

Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Andy Lutomirski



> On Mar 29, 2021, at 7:04 PM, Andi Kleen  wrote:
> 
> 
>> 
>>> No, if these instructions take a #VE then they were executed at CPL=0.  
>>> MONITOR
>>> and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.
>> 
>> Dare I ask about XSETBV?
> 
> XGETBV does not cause a #VE, it just works normally. The guest has full
> AVX capabilities.
> 

X *SET* BV

Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Andy Lutomirski

On Mon, Mar 29, 2021 at 4:42 PM Sean Christopherson  wrote:
>
> On Mon, Mar 29, 2021, Kuppuswamy, Sathyanarayanan wrote:
> >
> >
> > On 3/29/21 4:23 PM, Andy Lutomirski wrote:
> > >
> > > > On Mar 29, 2021, at 4:17 PM, Kuppuswamy Sathyanarayanan 
> > > >  wrote:
> > > >
> > > > In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> > > > are not supported. So handle #VE due to these instructions
> > > > appropriately.
> > >
> > > Is there something I missed elsewhere in the code that checks CPL?
> > We don't check for CPL explicitly. But if we are reaching here, then we
> > executing these instructions with wrong CPL.
>
> No, if these instructions take a #VE then they were executed at CPL=0.  
> MONITOR
> and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.

Dare I ask about XSETBV?

Re: I915 CI-run with kfence enabled, issues found

2021-03-29 Thread Andy Lutomirski



> On Mar 29, 2021, at 2:55 PM, Marco Elver  wrote:
> 
> On Mon, 29 Mar 2021 at 23:47, Andy Lutomirski  wrote:
>> 
>> 
>>>> On Mar 29, 2021, at 2:34 PM, Marco Elver  wrote:
>>> 
>>> On Mon, 29 Mar 2021 at 23:03, Dave Hansen  wrote:
>>>>> On 3/29/21 10:45 AM, Marco Elver wrote:
>>>>>> On Mon, 29 Mar 2021 at 19:32, Dave Hansen  wrote:
>>>>> Doing it to all CPUs is too expensive, and we can tolerate this being
>>>>> approximate (nothing bad will happen, KFENCE might just miss a bug and
>>>>> that's ok).
>>>> ...
>>>>>> BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
>>>>>> being enabled.  That's probably why you don't see this everywhere.  We
>>>>>> should probably have unconditional preempt checks in there.
>>>>> 
>>>>> In which case I'll add a preempt_disable/enable() pair to
>>>>> kfence_protect_page() in arch/x86/include/asm/kfence.h.
>>>> 
>>>> That sounds sane to me.  I'd just plead that the special situation (not
>>>> needing deterministic TLB flushes) is obvious.  We don't want any folks
>>>> copying this code.
>>>> 
>>>> BTW, I know you want to avoid the cost of IPIs, but have you considered
>>>> any other low-cost ways to get quicker TLB flushes?  For instance, you
>>>> could loop over all CPUs and set cpu_tlbstate.invalidate_other=1.  That
>>>> would induce a context switch at the next context switch without needing
>>>> an IPI.
>>> 
>>> This is interesting. And it seems like it would work well for our
>>> usecase. Ideally we should only flush entries related to the page we
>>> changed. But it seems invalidate_other would flush the entire TLB.
>>> 
>>> With PTI, flush_tlb_one_kernel() already does that for the current
>>> CPU, but now we'd flush entire TLBs for all CPUs and even if PTI is
>>> off.
>>> 
>>> Do you have an intuition for how much this would affect large
>>> multi-socket systems? I currently can't quite say, and would err on
>>> the side of caution.
>> 
>> Flushing the kernel TLB for all addresses
>> Is rather pricy. ISTR 600 cycles on Skylake, not to mention the cost of 
>> losing the TLB.  How common is this?
> 
> AFAIK, invalidate_other resets the asid, so it's not explicit and
> perhaps cheaper?
> 
> In any case, if we were to do this, it'd be based on the sample
> interval of KFENCE, which can be as low as 1ms. But this is a
> production debugging feature, so the target machines are not test
> machines. For those production deployments we'd be looking at every
> ~500ms. But I know of other deployments that use <100ms.
> 
> Doesn't sound like much, but as you say, I also worry a bit about
> losing the TLB across >100 CPUs even if it's every 500ms.

On non-PTI, the only way to zap kernel mappings is to do a global flush, either 
via INVPCID (expensive) or CR4 (extra expensive). In PTI mode, it’s plausible 
that the implicit flush is good enough, and I’d be happy to review the patch, 
but it’s a PTI only thing.  Much less expensive in PTI mode, too, because it 
only needs to flush kernel mappings.

If this is best-effort, it might be better to have some work in the exit to 
usermode path or a thread or similar that periodically does targeting 
single-page zaps.

Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Andy Lutomirski



> On Mar 29, 2021, at 4:17 PM, Kuppuswamy Sathyanarayanan 
>  wrote:
> 
> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> are not supported. So handle #VE due to these instructions
> appropriately.

Is there something I missed elsewhere in the code that checks CPL?

Re: I915 CI-run with kfence enabled, issues found

2021-03-29 Thread Andy Lutomirski



> On Mar 29, 2021, at 2:34 PM, Marco Elver  wrote:
> 
> On Mon, 29 Mar 2021 at 23:03, Dave Hansen  wrote:
>>> On 3/29/21 10:45 AM, Marco Elver wrote:
 On Mon, 29 Mar 2021 at 19:32, Dave Hansen  wrote:
>>> Doing it to all CPUs is too expensive, and we can tolerate this being
>>> approximate (nothing bad will happen, KFENCE might just miss a bug and
>>> that's ok).
>> ...
 BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
 being enabled.  That's probably why you don't see this everywhere.  We
 should probably have unconditional preempt checks in there.
>>> 
>>> In which case I'll add a preempt_disable/enable() pair to
>>> kfence_protect_page() in arch/x86/include/asm/kfence.h.
>> 
>> That sounds sane to me.  I'd just plead that the special situation (not
>> needing deterministic TLB flushes) is obvious.  We don't want any folks
>> copying this code.
>> 
>> BTW, I know you want to avoid the cost of IPIs, but have you considered
>> any other low-cost ways to get quicker TLB flushes?  For instance, you
>> could loop over all CPUs and set cpu_tlbstate.invalidate_other=1.  That
>> would induce a context switch at the next context switch without needing
>> an IPI.
> 
> This is interesting. And it seems like it would work well for our
> usecase. Ideally we should only flush entries related to the page we
> changed. But it seems invalidate_other would flush the entire TLB.
> 
> With PTI, flush_tlb_one_kernel() already does that for the current
> CPU, but now we'd flush entire TLBs for all CPUs and even if PTI is
> off.
> 
> Do you have an intuition for how much this would affect large
> multi-socket systems? I currently can't quite say, and would err on
> the side of caution.

Flushing the kernel TLB for all addresses
Is rather pricy. ISTR 600 cycles on Skylake, not to mention the cost of losing 
the TLB.  How common is this?

> 
> Thanks,
> -- Marco

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-29 Thread Andy Lutomirski

> On Mar 29, 2021, at 8:47 AM, Len Brown  wrote:
> 
> On Sat, Mar 27, 2021 at 5:58 AM Greg KH  wrote:
>>> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote:
>>> Hi Andy,
>>> Say a mainline links with a math library that uses AMX without the
>>> knowledge of the mainline.
> 
> sorry for the confusion.
> 
> mainline = main().
> 
> ie. the part of the program written by you, and not the library you linked 
> with.
> 
> In particular, the library may use instructions that main() doesn't know 
> exist.

If we pretend for a bit that AMX were a separate device instead of a part of 
the CPU, this would be a no brainer: something would be responsible for opening 
a device node or otherwise requesting access to the device. 

Real AMX isn’t so different. Programs acquire access either by syscall or by a 
fault, they use it, and (hopefully) they release it again using TILERELEASE. 
The only thing special about it is that, supposedly, acquiring and releasing 
access (at least after the first time) is quite fast.  But holding access is 
*not* free — despite all your assertions to the contrary, the kernel *will* 
correctly context switch it to avoid blowing up power consumption, and this 
will have overhead.

We’ve seen the pattern of programs thinking that, just because something is a 
CPU insn, it’s free and no thought is needed before using it. This happened 
with AVX and AVX512, and it will happen again with AMX. We *still* have a 
significant performance regression in the kernel due to screwing up the AVX 
state machine, and the only way I know about any of the details is that I wrote 
silly test programs to try to reverse engineer the nonsensical behavior of the 
CPUs.

I might believe that Intel has figured out how to make a well behaved XSTATE 
feature after Intel demonstrates at least once that it’s possible.  That means 
full documentation of all the weird issues, no new special cases, and the 
feature actually making sense in the context of XSTATE.  This has not happened. 
 Let’s list all of them:

- SSE.  Look for all the MXCSR special cases in the pseudocode and tell me with 
a straight face that this one works sensibly.

- AVX.  Also has special cases in the pseudocode. And has transition issues 
that are still problems and still not fully documented. L

- AVX2.  Horrible undocumented performance issues.  Otherwise maybe okay?

- MPX: maybe the best example, but the compat mode part got flubbed and it’s 
MPX.

- PKRU: Should never have been in XSTATE. (Also, having WRPKRU in the ISA was a 
major mistake, now unfixable, that seriously limits the usefulness of the whole 
feature.  I suppose Intel could release PKRU2 with a better ISA and deprecate 
the original PKRU, but I’m not holding my breath.)

- AVX512: Yet more uarch-dependent horrible performance issues, and Intel has 
still not responded about documentation.  The web is full of people speculating 
differently about when, exactly, using AVX512 breaks performance. This is 
NAKked in kernel until docs arrive. Also, it broke old user programs.  If we 
had noticed a few years ago, AVX512 enablement would have been reverted.

- AMX: This mess.

The current system of automatic user enablement does not work. We need 
something better.

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-29 Thread Andy Lutomirski

> On Mar 29, 2021, at 9:39 AM, Len Brown  wrote:
> 
> 
>> 
>> In particular, the library may use instructions that main() doesn't know 
>> exist.
> 
> And so I'll ask my question another way.
> 
> How is it okay to change the value of XCR0 during the run time of a program?
> 
> I submit that it is not, and that is a deal-killer for a request/release API.
> 
> eg.  main() doesn't know that the math library wants to use AMX,
> and neither does the threading library.  So main() doesn't know to
> call the API before either library is invoked.  The threading library starts 
> up
> and creates user-space threads based on the initial value from XCR0.
> Then the math library calls the API, which adds bits to XCRO,
> and then the user-space context switch in the threading library corrupts data
> because the new XCR0 size doesn't match the initial size.
> 

In the most extreme case, userspace could require that every loaded DSO be 
tagged with a new ELF note indicating support for dynamic XCR0 before changing 
XCR0.

I would like to remind everyone that kernel enablement of AVX512 *already* 
broke old userspace. AMX will further break something. At least with dynamic 
XCR0 we can make the breakage opt-in.

The ISA could have helped here by allowing the non-compacted XSTATE format to 
be frozen even in the face of changing XCR0.  But it didn’t.  At the end of the 
day, we are faced with the fact that XSTATE is a poor design, and we have to 
make the best of it.

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Andy Lutomirski



> On Mar 29, 2021, at 9:06 AM, Len Brown  wrote:
> 
> On Mon, Mar 29, 2021 at 11:43 AM Len Brown  wrote:
>> 
>> On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner  wrote:
>> 
 I found the author of this passage, and he agreed to revise it to say this
 was targeted primarily at VMMs.
>>> 
>>> Why would this only a problem for VMMs?
>> 
>> VMMs may have to emulate different hardware for different guest OS's,
>> and they would likely "context switch" XCR0 to achieve that.
>> 
>> As switching XCR0 at run-time would confuse the heck out of user-space,
>> it was not imagined that a bare-metal OS would do that.
> 
> to clarify...
> *switching* XCR0 on context switch is slow, but perfectly legal.

How slow is it?  And how slow is switching XFD?  XFD is definitely serializing?

Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-27 Thread Andy Lutomirski





> On Mar 26, 2021, at 8:40 PM, Kuppuswamy, Sathyanarayanan 
>  wrote:
> 
> 
> 
> On 3/26/21 7:40 PM, Andy Lutomirski wrote:
>>>> On Mar 26, 2021, at 5:18 PM, Kuppuswamy Sathyanarayanan 
>>>>  wrote:
>>> 
>>> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
>>> are not supported. So handle #VE due to these instructions as no ops.
>> These should at least be WARN.
> I will change it to WARN.
>> Does TDX send #UD if these instructions have the wrong CPL?  
> No, TDX does not trigger #UD for these instructions.
> If the #VE came from user mode, we should send an appropriate signal instead.
> It will be mapped into #GP(0) fault. This should be enough notification right?

Yes. And I did mean #GP, not #UD.

Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-26 Thread Andy Lutomirski




> On Mar 26, 2021, at 5:18 PM, Kuppuswamy Sathyanarayanan 
>  wrote:
> 
> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> are not supported. So handle #VE due to these instructions as no ops.

These should at least be WARN.

Does TDX send #UD if these instructions have the wrong CPL?  If the #VE came 
from user mode, we should send an appropriate signal instead.

> 
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> Reviewed-by: Andi Kleen 
> ---
> 
> Changes since previous series:
> * Suppressed MWAIT feature as per Andi's comment.
> * Added warning debug log for MWAIT #VE exception.
> 
> arch/x86/kernel/tdx.c | 23 +++
> 1 file changed, 23 insertions(+)
> 
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index e936b2f88bf6..fb7d22b846fc 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -308,6 +308,9 @@ void __init tdx_early_init(void)
> 
>setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> 
> +/* MWAIT is not supported in TDX platform, so suppress it */
> +setup_clear_cpu_cap(X86_FEATURE_MWAIT);
> +
>tdg_get_info();
> 
>pv_ops.irq.safe_halt = tdg_safe_halt;
> @@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs 
> *regs,
>case EXIT_REASON_EPT_VIOLATION:
>ve->instr_len = tdg_handle_mmio(regs, ve);
>break;
> +/*
> + * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
> + * Domain Extensions (Intel TDX) specification, sec 2.4,
> + * some instructions that unconditionally cause #VE (such as WBINVD,
> + * MONITOR, MWAIT) do not have corresponding TDCALL
> + * [TDG.VP.VMCALL ] leaves, since the TD has been designed
> + * with no deterministic way to confirm the result of those operations
> + * performed by the host VMM.  In those cases, the goal is for the TD
> + * #VE handler to increment the RIP appropriately based on the VE
> + * information provided via TDCALL.
> + */
> +case EXIT_REASON_WBINVD:
> +pr_warn_once("WBINVD #VE Exception\n");
> +case EXIT_REASON_MONITOR_INSTRUCTION:
> +/* Handle as nops. */
> +break;
> +case EXIT_REASON_MWAIT_INSTRUCTION:
> +/* MWAIT is supressed, not supposed to reach here. */
> +pr_warn("MWAIT unexpected #VE Exception\n");
> +return -EFAULT;
>default:
>pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
>return -EFAULT;
> -- 
> 2.25.1
>

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-26 Thread Andy Lutomirski

Sigh, cc linux-api, not linux-abi.

On Fri, Mar 26, 2021 at 4:12 PM Andy Lutomirski  wrote:
>
> Hi all-
>
> After some discussion on IRC, I have a proposal for a Linux ABI for
> using Intel AMX and other similar features.  It works like this:
>
> First, we make XCR0 dynamic.  This looks a lot like Keno's patch but
> with a different API, outlined below.  Different tasks can have
> different XCR0 values.  The default XCR0 for new tasks does not
> include big features like AMX.  XMM and YMM are still there.  The AVX2
> states are debatable -- see below.
>
> To detect features and control XCR0, we add some new arch_prctls:
>
> arch_prctl(ARCH_GET_XCR0_SUPPORT, 0, ...);
>
> returns the set of XCR0 bits supported on the current kernel.
>
> arch_prctl(ARCH_GET_XCR0_LAZY_SUPPORT, 0, ...);
>
> returns 0.  See below.
>
> arch_prctl(ARCH_SET_XCR0, xcr0, lazy_states, sigsave_states,
> sigclear_states, 0);
>
> Sets xcr0.  All states are preallocated except that states in
> lazy_states may be unallocated in the kernel until used.  (Not
> supported at all in v1.  lazy_states & ~xcr0 != 0 is illegal.)  States
> in sigsave_states are saved in the signal frame.  States in
> sigclear_states are reset to the init state on signal delivery.
> States in sigsave_states are restored by sigreturn, and states not in
> sigsave_states are left alone by sigreturn.
>
> Optionally we do not support PKRU at all in XCR0 -- it doesn't make
> that much sense as an XSAVE feature, and I'm not convinced that trying
> to correctly context switch XINUSE[PKRU] is worthwhile.  I doubt we
> get it right today.
>
> Optionally we come up with a new format for new features in the signal
> frame, since the current format is showing its age.  Taking 8kB for a
> signal with AMX is one thing.  Taking another 8kB for a nested signal
> if AMX is not in use is worse.
>
> Optionally we make AVX-512 also default off, which fixes what is
> arguably a serious ABI break with AVX-512: lots of programs, following
> POSIX (!), seem to think that they know much much space to allocate
> for sigaltstack().   AVX-512 is too big.
>
> Thoughts?
>
> --Andy

Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-26 Thread Andy Lutomirski

Hi all-

After some discussion on IRC, I have a proposal for a Linux ABI for
using Intel AMX and other similar features.  It works like this:

First, we make XCR0 dynamic.  This looks a lot like Keno's patch but
with a different API, outlined below.  Different tasks can have
different XCR0 values.  The default XCR0 for new tasks does not
include big features like AMX.  XMM and YMM are still there.  The AVX2
states are debatable -- see below.

To detect features and control XCR0, we add some new arch_prctls:

arch_prctl(ARCH_GET_XCR0_SUPPORT, 0, ...);

returns the set of XCR0 bits supported on the current kernel.

arch_prctl(ARCH_GET_XCR0_LAZY_SUPPORT, 0, ...);

returns 0.  See below.

arch_prctl(ARCH_SET_XCR0, xcr0, lazy_states, sigsave_states,
sigclear_states, 0);

Sets xcr0.  All states are preallocated except that states in
lazy_states may be unallocated in the kernel until used.  (Not
supported at all in v1.  lazy_states & ~xcr0 != 0 is illegal.)  States
in sigsave_states are saved in the signal frame.  States in
sigclear_states are reset to the init state on signal delivery.
States in sigsave_states are restored by sigreturn, and states not in
sigsave_states are left alone by sigreturn.

Optionally we do not support PKRU at all in XCR0 -- it doesn't make
that much sense as an XSAVE feature, and I'm not convinced that trying
to correctly context switch XINUSE[PKRU] is worthwhile.  I doubt we
get it right today.

Optionally we come up with a new format for new features in the signal
frame, since the current format is showing its age.  Taking 8kB for a
signal with AMX is one thing.  Taking another 8kB for a nested signal
if AMX is not in use is worse.

Optionally we make AVX-512 also default off, which fixes what is
arguably a serious ABI break with AVX-512: lots of programs, following
POSIX (!), seem to think that they know much much space to allocate
for sigaltstack().   AVX-512 is too big.

Thoughts?

--Andy

Re: Why does glibc use AVX-512?

2021-03-26 Thread Andy Lutomirski

> On Mar 26, 2021, at 2:11 PM, Florian Weimer  wrote:
> 
> * Andy Lutomirski:
> 
>>> On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer  wrote:
>>> 
>>> I mean the immense slowdown you get if you use %xmm registers after
> their %ymm counterparts (doesn't have to be %zmm, that issue is
> present starting with AVX) and you have not issued VZEROALL or
> VZEROUPPER between the two uses.

It turns out that it’s not necessary to access the registers in question to 
trigger this behavior. You just need to make the CPU think it should penalize 
you. For example, LDMXCSR appears to be a legacy SSE insn for this purpose, and 
VLDMXCSR is an AVX insn for this purpose. I wouldn’t trust that using ymm9 
would avoid the penalty just because common sense says it should.

>> What kind of system has that problem?
> 
> It's a standard laptop after a suspend/resume cycle.  It's either a
> kernel or firmware bug.

What kernel version?  I think fixing the kernel makes more sense than fixing 
glibc.

Re: Why does glibc use AVX-512?

2021-03-26 Thread Andy Lutomirski

On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer  wrote:
>
> * Andy Lutomirski:
>
> > On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer  wrote:
> >>   x86: Sporadic failures in tst-cpu-features-cpuinfo
> >>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
> >
> > It's worth noting that recent microcode updates have make RTM
> > considerably less likely to actually work on many parts.  It's
> > possible you should just disable it. :(
>
> Sorry, I'm not sure who should disable it.
>
> Let me sum up the situation:
>
> We have a request for a performance enhancement in glibc, so that
> applications can use it on server parts where RTM actually works.
>
> For CPUs that support AVX-512, we may be able to meet that with a
> change that uses the new 256-bit registers, t avoid the %xmm
> transition penalty.  (This is the easy case, hopefully—there shouldn't
> be any frequency issues associated with that, and if the kernel
> doesn't optimize the context switch today, that's a nonissue as well.)

I would make sure that the transition penalty actually works the way
you think it does.  My general experience with the transition
penalties is that the CPU is rather more aggressive about penalizing
you than makes sense.

>
> For CPUs that do not support AVX-512 but support RTM (and AVX2), we
> need a dynamic run-time check whether the string function is invoked
> in a transaction.  In that case, we need to use VZEROALL instead of
> VZEROUPPER.  (It's apparently too costly to issue VZEROALL
> unconditionally.)

So VZEROALL works in a transaction and VZEROUPPER doesn't?  That's bizarre.


> All this needs to work transparently without user intervention.  We
> cannot require firmware upgrades to fix the incorrect RTM reporting
> issue (the bug I referenced).  I think we can require software updates
> which tell glibc when to use RTM-enabled string functions if the
> dynamic selection does not work (either for performance reasons, or
> because of the RTM reporting bug).
>
> I want to avoid a situation where one in eight processes fail to work
> correctly because the CPUID checks ran on CPU 0, where RTM is reported
> as available, and then we trap when executing XTEST on other CPUs.

What kind of system has that problem?  If RTM reports as available,
then it should work in the sense of not trapping.  (There is no
guarantee that transactions will *ever* complete, and that part is no
joke.)

Re: Why does glibc use AVX-512?

2021-03-26 Thread Andy Lutomirski

On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer  wrote:
>
> * Andy Lutomirski:
>
> >> > AVX-512 cleared, and programs need to explicitly request enablement.
> >> > This would allow programs to opt into not saving/restoring across
> >> > signals or to save/restore in buffers supplied when the feature is
> >> > enabled.
> >>
> >> Isn't XSAVEOPT already able to handle that?
> >>
> >
> > Yes, but we need a place to put the data, and we need to acknowledge
> > that, with the current save-everything-on-signal model, the amount of
> > time and memory used is essentially unbounded.  This isn't great.
>
> The size has to have a known upper bound, but the save amount can be
> dynamic, right?
>
> How was the old lazy FPU initialization support for i386 implemented?
>
> >> Assuming you can make XSAVEOPT work for you on the kernel side, my
> >> instincts tell me that we should have markup for RTM, not for AVX-512.
> >> This way, we could avoid use of the AVX-512 registers and keep using
> >> VZEROUPPER, without run-time transaction checks, and deal with other
> >> idiosyncrasies needed for transaction support that users might
> >> encounter once this feature sees more use.  But the VZEROUPPER vs RTM
> >> issues is currently stuck in some internal process issue on my end (or
> >> two, come to think of it), which I hope to untangle next month.
> >
> > Can you elaborate on the issue?
>
> This is the bug:
>
>   vzeroupper use in AVX2 multiarch string functions cause HTM aborts
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27457>
>
> Unfortunately we have a bug (outside of glibc) that makes me wonder if
> we can actually roll out RTM transaction checks (or any RTM
> instruction) on a large scale:
>
>   x86: Sporadic failures in tst-cpu-features-cpuinfo
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>

It's worth noting that recent microcode updates have make RTM
considerably less likely to actually work on many parts.  It's
possible you should just disable it. :(

Re: Why does glibc use AVX-512?

2021-03-26 Thread Andy Lutomirski

On Fri, Mar 26, 2021 at 3:08 AM Borislav Petkov  wrote:
>
> On Thu, Mar 25, 2021 at 09:38:24PM -0700, Andy Lutomirski wrote:
> > I think we should seriously consider solutions in which, for new
> > tasks, XCR0 has new giant features (e.g. AMX) and possibly even
> > AVX-512 cleared, and programs need to explicitly request enablement.
>
> I totally agree with making this depend on an explicit user request,
> but...
>
> > This would allow programs to opt into not saving/restoring across
> > signals or to save/restore in buffers supplied when the feature is
> > enabled.  This has all kinds of pros and cons, and I'm not sure it's a
> > great idea.  But, in the absence of some change to the ABI, the
> > default outcome is that, on AMX-enabled kernels on AMX-enabled
> > hardware, the signal frame will be more than 8kB, and this will affect
> > *every* signal regardless of whether AMX is in use.
>
> ... what's stopping the library from issuing that new ABI call before it
> starts the app and get  automatically enabled
> for everything by default?
>
> And then we'll get the lazy FPU thing all over again.

At the end of the day, it's not the kernel's job to make userspace be
sane or to make users or programmers make the right decisions.  But it
is our job to make sure that it's even possible to make the system
work well, and we are responsible for making sure that old binaries
continue to work, preferably well, on new kernels and new hardware.

Re: Why does glibc use AVX-512?

2021-03-26 Thread Andy Lutomirski

On Fri, Mar 26, 2021 at 5:12 AM Florian Weimer  wrote:
>
> * Andy Lutomirski-alpha:
>
> > glibc appears to use AVX512F for memcpy by default.  (Unless
> > Prefer_ERMS is default-on, but I genuinely can't tell if this is the
> > case.  I did some searching.)  The commit adding it refers to a 2016
> > email saying that it's 30% on KNL.
>
> As far as I know, glibc only does that on KNL, and there it is
> actually beneficial.  The relevant code is:
>
>   /* Since AVX512ER is unique to Xeon Phi, set Prefer_No_VZEROUPPER
>  if AVX512ER is available.  Don't use AVX512 to avoid lower CPU
>  frequency if AVX512ER isn't available.  */
>   if (CPU_FEATURES_CPU_P (cpu_features, AVX512ER))
> cpu_features->preferred[index_arch_Prefer_No_VZEROUPPER]
>   |= bit_arch_Prefer_No_VZEROUPPER;
>   else
> cpu_features->preferred[index_arch_Prefer_No_AVX512]
>   |= bit_arch_Prefer_No_AVX512;
>
> So it's not just about Prefer_ERMS.

Phew.

>
> > AVX-512 cleared, and programs need to explicitly request enablement.
> > This would allow programs to opt into not saving/restoring across
> > signals or to save/restore in buffers supplied when the feature is
> > enabled.
>
> Isn't XSAVEOPT already able to handle that?
>

Yes, but we need a place to put the data, and we need to acknowledge
that, with the current save-everything-on-signal model, the amount of
time and memory used is essentially unbounded.  This isn't great.

>
> There is a discussion about using the higher (AVX-512-only) %ymm
> registers, to avoid the %xmm transition penalty without the need for
> VZEROUPPER.  (VZEROUPPER is incompatible with RTM from a performance
> point of view.)  That would perhaps negatively impact XSAVEOPT.
>
> Assuming you can make XSAVEOPT work for you on the kernel side, my
> instincts tell me that we should have markup for RTM, not for AVX-512.
> This way, we could avoid use of the AVX-512 registers and keep using
> VZEROUPPER, without run-time transaction checks, and deal with other
> idiosyncrasies needed for transaction support that users might
> encounter once this feature sees more use.  But the VZEROUPPER vs RTM
> issues is currently stuck in some internal process issue on my end (or
> two, come to think of it), which I hope to untangle next month.

Can you elaborate on the issue?

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-26 Thread Andy Lutomirski

On Fri, Mar 26, 2021 at 10:54 AM Len Brown  wrote:
>
> On Fri, Mar 26, 2021 at 11:48 AM Andy Lutomirski  wrote:
>
> > > I submit, that after the generic XFD support is in place,
> > > there is exactly 1 bit that needs to be flipped to enable
> > > user applications to benefit from AMX.
> >
> > The TILERELEASE opcode itself is rather longer than one bit, and the
> > supporting code to invoke it at the right time, to avoid corrupting
> > user state, and avoid causing performance regressions merely by
> > existing will be orders of magnitude more than 1 bit.  Of course, all
> > of this is zero bits in the current series because the code is
> > missing.entirely.
>
> Please explain why the kernel must know about the TILERELEASE
> instruction in order for an AMX application to run properly.

I'm just repeating things already said, and this is getting
ridiculous.  TILERELEASE isn't needed for an AMX application to run
properly -- it's needed for the rest of the system to run properly, at
least according to Intel's published docs.  Quoting the current ISE
document:

3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE

System software may disable use of Intel AMX by clearing XCR0[18:17],
by clearing CR4.OSXSAVE, or by setting
IA32_XFD[18]. It is recommended that system software initialize AMX
state (e.g., by executing TILERELEASE)
before doing so. This is because maintaining AMX state in a
non-initialized state may have negative power and
performance implications.

Since you reviewed the patch set, I assume you are familiar with how
Linux manages XSTATE.  Linux does *not* eagerly load XSTATE on context
switch.  Instead, Linux loads XSTATE when the kernel needs it loaded
or before executing user code.  This means that the kernel can (and
does, and it's a performance win) execute kernel thread code and/or go
idle, *including long-term deep idle*, with user XSTATE loaded.

>
> > This isn't just about validation.  There's also ABI, performance, and
> > correctness.
>
> Thank you for agreeing that this is not about unvalidated features.
>
> > ABI: The AVX-512 enablement *already* broke user ABI.  Sadly no one
> > told anyone in the kernel community until about 5 years after the
> > fact, and it's a bit late to revert AVX-512.  But we don't want to
> > enable AMX until the ABI has a reasonable chance of being settled.
> > Ditto for future features.  As it stands, if you xstate.enable some
> > 16MB feature, the system may well simply fail to boot as too many user
> > processes explode.
>
> At Dave's suggestion, we had a 64 *KB* sanity check on this path.
> Boris forced us to remove it, because we could not tell him
> how we chose the number 64.
>
> I would be delighted to see a check for 64 KB restored, and that
> it be a rejection, rather than warning.  At this point, as there is no way
> go down that path without manually modifying the kernel, it would
> devolve into a sanity check for a hardware (CPUID) bug.

This is nuts.  The ABI is ALREADY BROKEN.  How does picking a random
number quantifying additional breakage help?  We do not have a good
design for AVX-512 in Linux, we don't have a good design for AMX in
Linux, and we absolutely don't have a good design for the secret
feature we don't know about yet in Linux.

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-26 Thread Andy Lutomirski

On Fri, Mar 26, 2021 at 8:34 AM Len Brown  wrote:
>
> On Thu, Mar 25, 2021 at 9:42 PM Andy Lutomirski  wrote:
>
> > Regardless of what you call AMX, AMX requires kernel enabling.
>
> I submit, that after the generic XFD support is in place,
> there is exactly 1 bit that needs to be flipped to enable
> user applications to benefit from AMX.

The TILERELEASE opcode itself is rather longer than one bit, and the
supporting code to invoke it at the right time, to avoid corrupting
user state, and avoid causing performance regressions merely by
existing will be orders of magnitude more than 1 bit.  Of course, all
of this is zero bits in the current series because the code is
missing.entirely.

To avoid email thread blowup:

> If there is a new requirement that the kernel cmdline not allow anything
> that a distro didn't explicitly validate, then about 99.9% of the kernel 
> cmdline
> options that exist today would need to be removed.
>
> Does such a requirement exist, or does it not?

This isn't just about validation.  There's also ABI, performance, and
correctness:

ABI: The AVX-512 enablement *already* broke user ABI.  Sadly no one
told anyone in the kernel community until about 5 years after the
fact, and it's a bit late to revert AVX-512.  But we don't want to
enable AMX until the ABI has a reasonable chance of being settled.
Ditto for future features.  As it stands, if you xstate.enable some
16MB feature, the system may well simply fail to boot as too many user
processes explode.

Performance:

We *still* don't know the performance implications of leaving the AMX
features in use inappropriately.  Does it completely destroy idle?
Will it literally operate CPUs out of spec such that Intel's
reliability estimates will be invalidated?  (We had that with NVMe
APST.  Let's not repeat this with XSTATE.)  The performance impacts
and transitions for AVX-512 are, to put it charitably, forthcoming.

Correctness: PKRU via the kernel's normal XSAVE path would simply be
incorrect.  Do we really trust that this won't get repeated?  Also,
frankly, a command line option that may well break lots of userspace
but that we fully expect Intel to recommend setting is not a good
thing.

--Andy

Re: [PATCH v7 5/6] x86/signal: Detect and prevent an alternate signal stack overflow

2021-03-25 Thread Andy Lutomirski

I forgot to mention why I cc'd all you fine Xen folk:

On Thu, Mar 25, 2021 at 11:13 AM Andy Lutomirski  wrote:

>
> > } else if (IS_ENABLED(CONFIG_X86_32) &&
> >!onsigstack &&
> >regs->ss != __USER_DS &&

This bit here seems really dubious on Xen PV.  Honestly it seems
dubious everywhere, but especially on Xen PV.

--Andy

Re: [PATCH v7 5/6] x86/signal: Detect and prevent an alternate signal stack overflow

2021-03-25 Thread Andy Lutomirski

On Thu, Mar 25, 2021 at 11:54 AM Borislav Petkov  wrote:
>
> On Thu, Mar 25, 2021 at 11:13:12AM -0700, Andy Lutomirski wrote:
> > diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> > index ea794a083c44..53781324a2d3 100644
> > --- a/arch/x86/kernel/signal.c
> > +++ b/arch/x86/kernel/signal.c
> > @@ -237,7 +237,8 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
> > *regs, size_t frame_size,
> >   unsigned long math_size = 0;
> >   unsigned long sp = regs->sp;
> >   unsigned long buf_fx = 0;
> > - int onsigstack = on_sig_stack(sp);
> > + bool already_onsigstack = on_sig_stack(sp);
> > + bool entering_altstack = false;
> >   int ret;
> >
> >   /* redzone */
> > @@ -246,15 +247,25 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
> > *regs, size_t frame_size,
> >
> >   /* This is the X/Open sanctioned signal stack switching.  */
> >   if (ka->sa.sa_flags & SA_ONSTACK) {
> > - if (sas_ss_flags(sp) == 0)
> > + /*
> > +  * This checks already_onsigstack via sas_ss_flags().
> > +  * Sensible programs use SS_AUTODISARM, which disables
> > +  * that check, and programs that don't use
> > +  * SS_AUTODISARM get compatible but potentially
> > +  * bizarre behavior.
> > +  */
> > + if (sas_ss_flags(sp) == 0) {
> >   sp = current->sas_ss_sp + current->sas_ss_size;
> > + entering_altstack = true;
> > + }
> >   } else if (IS_ENABLED(CONFIG_X86_32) &&
> > -!onsigstack &&
> > +!already_onsigstack &&
> >  regs->ss != __USER_DS &&
> >  !(ka->sa.sa_flags & SA_RESTORER) &&
> >  ka->sa.sa_restorer) {
> >   /* This is the legacy signal stack switching. */
> >   sp = (unsigned long) ka->sa.sa_restorer;
> > + entering_altstack = true;
> >   }
>
> What a mess this whole signal handling is. I need a course in signal
> handling to understand what's going on here...
>
> >
> >   sp = fpu__alloc_mathframe(sp, IS_ENABLED(CONFIG_X86_32),
> > @@ -267,8 +278,16 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
> > *regs, size_t frame_size,
> >* If we are on the alternate signal stack and would overflow it, 
> > don't.
> >* Return an always-bogus address instead so we will die with SIGSEGV.
> >*/
> > - if (onsigstack && !likely(on_sig_stack(sp)))
> > + if (unlikely(entering_altstack &&
> > +  (sp <= current->sas_ss_sp ||
> > +   sp - current->sas_ss_sp > current->sas_ss_size))) {
>
> You could've simply done
>
> if (unlikely(entering_altstack && !on_sig_stack(sp)))
>
> here.

Nope.  on_sig_stack() is a horrible kludge and won't work here.  We
could have something like __on_sig_stack() or sp_is_on_sig_stack() or
something, though.

>
>
> > + if (show_unhandled_signals && printk_ratelimit()) {
> > + pr_info("%s[%d] overflowed sigaltstack",
> > + tsk->comm, task_pid_nr(tsk));
> > + }
>
> Why do you even wanna issue that? It looks like callers will propagate
> an error value up and people don't look at dmesg all the time.

I figure that the people whose programs spontaneously crash should get
a hint why if they look at dmesg.  Maybe the message should say
"overflowed sigaltstack -- try noavx512"?

We really ought to have a SIGSIGFAIL signal that's sent, double-fault
style, when we fail to send a signal.

Why does glibc use AVX-512?

2021-03-25 Thread Andy Lutomirski

Hi all-

glibc appears to use AVX512F for memcpy by default.  (Unless
Prefer_ERMS is default-on, but I genuinely can't tell if this is the
case.  I did some searching.)  The commit adding it refers to a 2016
email saying that it's 30% on KNL.  Unfortunately, AVX-512 is now
available in normal hardware, and the overhead from switching between
normal and AVX-512 code appears to vary from bad to genuinely
horrible.  And, once anything has used the high parts of YMM and/or
ZMM, those states tend to get stuck with XINUSE=1.

I'm wondering whether glibc should stop using AVX-512 by default.

Meanwhile, some of you may have noticed a little ABI break we have.
On AVX-512 hardware, the size of a signal frame is unreasonably large,
and this is causing problems even for existing software that doesn't
use AVX-512.  Do any of you have any clever ideas for how to fix it?
We have some kernel patches around to try to fail more cleanly, but we
still fail.

I think we should seriously consider solutions in which, for new
tasks, XCR0 has new giant features (e.g. AMX) and possibly even
AVX-512 cleared, and programs need to explicitly request enablement.
This would allow programs to opt into not saving/restoring across
signals or to save/restore in buffers supplied when the feature is
enabled.  This has all kinds of pros and cons, and I'm not sure it's a
great idea.  But, in the absence of some change to the ABI, the
default outcome is that, on AMX-enabled kernels on AMX-enabled
hardware, the signal frame will be more than 8kB, and this will affect
*every* signal regardless of whether AMX is in use.

--Andy

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-25 Thread Andy Lutomirski

On Thu, Mar 25, 2021 at 3:59 PM Len Brown  wrote:
>
> On Sat, Mar 20, 2021 at 4:57 PM Thomas Gleixner  wrote:
>
> > We won't enable features which are unknown ever. Keep that presilicon
> > test gunk where it belongs: In the Intel poison cabinet along with the
> > rest of the code which nobody ever want's to see.
>
> I agree, it would be irresponsible to enable unvalidated features by default,
> and pre-silicon "test gunk" should be kept out of the upstream kernel.
>
> This patch series is intended solely to enable fully validated
> hardware features,
> with product quality kernel support.
>
> The reason that the actual AMX feature isn't mentioned until the 16th
> patch in this series
> is because all of the patches before it are generic state save/restore 
> patches,
> that are not actually specific to AMX.
>
> We call AMX a "simple state feature" -- it actually requires NO KERNEL 
> ENABLING
> above the generic state save/restore to fully support userspace AMX
> applications.

Regardless of what you call AMX, AMX requires kernel enabling.
Specifically, it appears that leaving AMX in use in the XINUSE sense
degrades system performance and/or power.  And the way to handle that
in kernel (TILERELEASE) cannot possibly be construed as generic.
Here's a little summary of XSTATE features that have failed to be
simple:

 - XMM: seemed simple, but the performance issues switching between
legacy and VEX are still unresolved.  And they affect the kernel, and
people have noticed and complained.

 - ZMM and the high parts of X/YMM: Intel *still* hasn't documented
the actual performance rules.  Reports from people trying to reverse
engineer it suggest that it's horrible on all but the very newest
chips.  For some reason, glibc uses it.  And it broke sigaltstack.  I
have NAKked in-kernel AVX-512 usage until Intel answers a long list of
questions.  No progress yet.

 - PKRU: makes no sense as an XSAVE feature.

 - AMX: XFD, as I understand it, has virtualization problems.  And the
TILERELEASE issue is unresolved.

Intel's track record here is poor.  If you want the kernel to trust
Intel going forward, Intel needs to build trust first.

> So after the generic state management support, the kernel enabling of AMX
> is not actually required to run applications.  Just like when a new 
> instruction
> is added that re-uses existing state -- the application or library can check
> CPUID and just use it.  It is a formality (perhaps an obsolete one), that
> we add every feature flag to /proc/cpuid for the "benefit" of userspace.

Even this isn't true.  AVX-512 already Broke ABI (tm).  Sorry for the
big evil words, but existing programs that worked on Linux stopped
working due to kernel enablement of AVX-512.  AMX has the same
problem, except more than an order of magnitude worse.  No credible
resolution has shown up, and the only remotely credible idea anyone
has mentioned is to actually mask AMX in XCR0 until an application
opts in to an as-yet-undetermined new ABI.

--Andy

Re: [PATCH v7 5/6] x86/signal: Detect and prevent an alternate signal stack overflow

2021-03-25 Thread Andy Lutomirski

On Mon, Mar 15, 2021 at 11:57 PM Chang S. Bae  wrote:
>
> The kernel pushes context on to the userspace stack to prepare for the
> user's signal handler. When the user has supplied an alternate signal
> stack, via sigaltstack(2), it is easy for the kernel to verify that the
> stack size is sufficient for the current hardware context.
>
> Check if writing the hardware context to the alternate stack will exceed
> it's size. If yes, then instead of corrupting user-data and proceeding with
> the original signal handler, an immediate SIGSEGV signal is delivered.
>
> Instead of calling on_sig_stack(), directly check the new stack pointer
> whether in the bounds.
>
> While the kernel allows new source code to discover and use a sufficient
> alternate signal stack size, this check is still necessary to protect
> binaries with insufficient alternate signal stack size from data
> corruption.

This patch results in excessively complicated control and data flow.

> -   int onsigstack = on_sig_stack(sp);
> +   bool onsigstack = on_sig_stack(sp);

Here onsigstack means "we were already using the altstack".

> int ret;
>
> /* redzone */
> @@ -251,8 +251,11 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
> *regs, size_t frame_size,
>
> /* This is the X/Open sanctioned signal stack switching.  */
> if (ka->sa.sa_flags & SA_ONSTACK) {
> -   if (sas_ss_flags(sp) == 0)
> +   if (sas_ss_flags(sp) == 0) {
> sp = current->sas_ss_sp + current->sas_ss_size;
> +   /* On the alternate signal stack */
> +   onsigstack = true;
> +   }

But now onsigstack is also true if we are using the legacy path to
*enter* the altstack.  So now it's (was on altstack) || (entering
altstack via legacy path).

> } else if (IS_ENABLED(CONFIG_X86_32) &&
>!onsigstack &&
>regs->ss != __USER_DS &&
> @@ -272,7 +275,8 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
> *regs, size_t frame_size,
>  * If we are on the alternate signal stack and would overflow it, 
> don't.
>  * Return an always-bogus address instead so we will die with SIGSEGV.
>  */
> -   if (onsigstack && !likely(on_sig_stack(sp)))
> +   if (onsigstack && unlikely(sp <= current->sas_ss_sp ||
> +  sp - current->sas_ss_sp > 
> current->sas_ss_size))

And now we fail if ((was on altstack) || (entering altstack via legacy
path)) && (new sp is out of bounds).


The condition we actually want is that, if we are entering the
altstack and we don't fit, we should fail.  This is tricky because of
the autodisarm stuff and the possibility of nonlinear stack segments,
so it's not even clear to me exactly what we should be doing.  I
propose:




> return (void __user *)-1L;

Can we please log something (if (show_unhandled_signals ||
printk_ratelimit()) that says that we overflowed the altstack?

How about:

diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index ea794a083c44..53781324a2d3 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -237,7 +237,8 @@ get_sigframe(struct k_sigaction *ka, struct
pt_regs *regs, size_t frame_size,
 unsigned long math_size = 0;
 unsigned long sp = regs->sp;
 unsigned long buf_fx = 0;
-int onsigstack = on_sig_stack(sp);
+bool already_onsigstack = on_sig_stack(sp);
+bool entering_altstack = false;
 int ret;

 /* redzone */
@@ -246,15 +247,25 @@ get_sigframe(struct k_sigaction *ka, struct
pt_regs *regs, size_t frame_size,

 /* This is the X/Open sanctioned signal stack switching.  */
 if (ka->sa.sa_flags & SA_ONSTACK) {
-if (sas_ss_flags(sp) == 0)
+/*
+ * This checks already_onsigstack via sas_ss_flags().
+ * Sensible programs use SS_AUTODISARM, which disables
+ * that check, and programs that don't use
+ * SS_AUTODISARM get compatible but potentially
+ * bizarre behavior.
+ */
+if (sas_ss_flags(sp) == 0) {
 sp = current->sas_ss_sp + current->sas_ss_size;
+entering_altstack = true;
+}
 } else if (IS_ENABLED(CONFIG_X86_32) &&
-   !onsigstack &&
+   !already_onsigstack &&
regs->ss != __USER_DS &&
!(ka->sa.sa_flags & SA_RESTORER) &&
ka->sa.sa_restorer) {
 /* This is the legacy signal stack switching. */
 sp = (unsigned long) ka->sa.sa_restorer;
+entering_altstack = true;
 }

 sp = fpu__alloc_mathframe(sp, IS_ENABLED(CONFIG_X86_32),
@@ -267,8 +278,16 @@ get_sigframe(struct k_sigaction *ka, struct
pt_regs *regs, size_t frame_size,
  * If we are on the alternate signal stack and would overflow it, don't.
  * Return an always-bogus address instead so we will die with SIGSEGV.
  */
-if (onsigstack && !likely(on_sig_stack(sp)))

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Andy Lutomirski



> On Mar 24, 2021, at 2:58 PM, Dave Hansen  wrote:
> 
> On 3/24/21 2:42 PM, Andy Lutomirski wrote:
>>>>> 3. user space always uses fully uncompacted XSAVE buffers.
>>>>> 
>>>> There is no reason we have to do this for new states. Arguably we
>>>> shouldn’t for AMX to avoid yet another altstack explosion.
>>> The thing that's worried me is that the list of OS-enabled states is
>>> visible to apps via XGETBV.  It doesn't seem too much of a stretch to
>>> think that apps will see AMX enabled with XGETBV and them assume that
>>> it's on the signal stack.
>>> 
>>> Please tell me I'm being too paranoid.  If we can break this
>>> assumption, it would get rid of a lot of future pain.
>> There are no AMX apps. I sure hope that there are no apps that
>> enumerate xfeatures with CPUID and try to decode the mess in the
>> signal stack.
> 
> I don't think they quite need to decode it in order to be screwed over a
> bit.  For instance, I don't think it's too crazy if someone did:
> 
>xcr0 = xgetbv(0);
>xrstor(xcr0, _stack[something]);
>// change some registers
>xsave(xcr0, _stack[something]);
> 
> The XRSTOR would work fine, but the XSAVE would overflow the stack
> because it would save the AMX state.  It also *looks* awfully benign.
> This is true even if the silly signal handler didn't know about AMX at
> *ALL*.
> 
> A good app would have checked that the xfeatures field in the header
> matched xcr0.

Ugh.

On the other hand, if we require a syscall to flip the AMX bit in XCR0, we 
could maybe get away with saying that programs that flip the bit and don’t 
understand the new ABI get to keep both pieces.

I don’t live futzing with the ABI like this, but AMX is really only barely 
compatible with everything before it.

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Andy Lutomirski

> On Mar 24, 2021, at 2:30 PM, Dave Hansen  wrote:
> 
> On 3/24/21 2:26 PM, Andy Lutomirski wrote:
>>> 3. user space always uses fully uncompacted XSAVE buffers.
>>> 
>> There is no reason we have to do this for new states. Arguably we
>> shouldn’t for AMX to avoid yet another altstack explosion.
> 
> The thing that's worried me is that the list of OS-enabled states is
> visible to apps via XGETBV.  It doesn't seem too much of a stretch to
> think that apps will see AMX enabled with XGETBV and them assume that
> it's on the signal stack.
> 
> Please tell me I'm being too paranoid.  If we can break this assumption,
> it would get rid of a lot of future pain.

There are no AMX apps. I sure hope that there are no apps that enumerate 
xfeatures with CPUID and try to decode the mess in the signal stack.

I do think we need to save AMX state *somewhere* if a signal happens unless 
userspace opts out, but I don’t think it needs to be in the nominally expected 
spot.

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Andy Lutomirski





> On Mar 24, 2021, at 2:09 PM, Len Brown  wrote:
> 
> On Tue, Mar 23, 2021 at 11:15 PM Liu, Jing2  
> wrote:
> 
>>> IMO, the problem with AVX512 state
>>> is that we guaranteed it will be zero for XINUSE=0.
>>> That means we have to write 0's on saves.
> 
>> why "we have to write 0's on saves" when XINUSE=0.
>> 
>> Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
>> xstate_bv bit is 0; if use XSAVE, it need save the state but
>> xstate_bv bit is also 0.
>>>  It would be better
>>> to be able to skip the write -- even if we can't save the space
>>> we can save the data transfer.  (This is what we did for AMX).
>> With XFD feature that XFD=1, XSAVE command still has to save INIT state
>> to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
>> do the same that both can help save the data transfer.
> 
> Hi Jing, Good observation!
> 
> There are 3 cases.
> 
> 1. Task context switch save into the context switch buffer.
> Here we use XSAVES, and as you point out, XSAVES includes
> the compaction optimization feature tracked by XINUSE.
> So when AMX is enabled, but clean, XSAVES doesn't write zeros.
> Further, it omits the buffer space for AMX in the destination altogether!
> However, since XINUSE=1 is possible, we have to *allocate* a buffer
> large enough to handle the dirty data for when XSAVES can not
> employ that optimization.
> 
> 2. Entry into user signal handler saves into the user space sigframe.
> Here we use XSAVE, and so the hardware will write zeros for XINUSE=0,
> and for AVX512, we save neither time or space.
> 
> My understanding that for application compatibility, we can *not* compact
> the destination buffer that user-space sees.  This is because existing code
> may have adopted fixed size offsets.  (which is unfortunate).
> 
> And so, for AVX512, we both reserve the space, and we write zeros
> for clean AVX512 state.
> 
> For AMX, we must still reserve the space, but we are not going to write zeros
> for clean state.  We so this in software by checking XINUSE=0, and clearing
> the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
> writing the zeros, even though we can't compress the space.

Why?

> 
> 3. user space always uses fully uncompacted XSAVE buffers.
> 

There is no reason we have to do this for new states. Arguably we shouldn’t for 
AMX to avoid yet another altstack explosion.

Re: Is s390's new generic-using syscall code actually correct?

2021-03-24 Thread Andy Lutomirski

On Wed, Mar 24, 2021 at 10:39 AM Vasily Gorbik  wrote:
>
> Hi Andy,
>
> On Sat, Mar 20, 2021 at 08:48:34PM -0700, Andy Lutomirski wrote:
> > Hi all-
> >
> > I'm working on my kentry patchset, and I encountered:
> >
> > commit 56e62a73702836017564eaacd5212e4d0fa1c01d
> > Author: Sven Schnelle 
> > Date:   Sat Nov 21 11:14:56 2020 +0100
> >
> > s390: convert to generic entry
> >
> > As part of this work, I was cleaning up the generic syscall helpers,
> > and I encountered the goodies in do_syscall() and __do_syscall().
> >
> > I'm trying to wrap my head around the current code, and I'm rather confused.
> >
> > 1. syscall_exit_to_user_mode_work() does *all* the exit work, not just
> > the syscall exit work.  So a do_syscall() that gets called twice will
> > do the loopy part of the exit work (e.g. signal handling) twice.  Is
> > this intentional?  If so, why?
> >
> > 2. I don't understand how this PIF_SYSCALL_RESTART thing is supposed
> > to work.  Looking at the code in Linus' tree, if a signal is pending
> > and a syscall returns -ERESTARTSYS, the syscall will return back to
> > do_syscall().  The work (as in (1)) gets run, calling do_signal(),
> > which will notice -ERESTARTSYS and set PIF_SYSCALL_RESTART.
> > Presumably it will also push the signal frame onto the stack and aim
> > the return address at the svc instruction mentioned in the commit
> > message from "s390: convert to generic entry".  Then __do_syscall()
> > will turn interrupts back on and loop right back into do_syscall().
> > That seems incorrect.
> >
> > Can you enlighten me?  My WIP tree is here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/kentry
> >
>
> For all the details to that change we'd have to wait for Sven, who is back
> next week.
>
> > Here are my changes to s390, and I don't think they're really correct:
> >
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/diff/arch/s390/kernel/syscall.c?h=x86/kentry=58a459922be0fb8e0f17aeaebcb0ac8d0575a62c
>
> Couple of things: syscall_exit_to_user_mode_prepare is static,
> and there is another code path in arch/s390/kernel/traps.c using
> enter_from_user_mode/exit_to_user_mode.
>
> Anyhow I gave your branch a spin and got few new failures on strace test
> suite, in particular on restart_syscall test. I'll try to find time to
> look into details.

I refreshed the branch, but I confess I haven't compile tested it. :)

I would guess that the new test case failures are a result of the
buggy syscall restart logic.  I think that all of the "restart" cases
except execve() should just be removed.  Without my patch, I suspect
that signal delivery with -ERESTARTSYS would create the signal frame,
do an accidental "restarted" syscall that was a no-op, and then
deliver the signal.  With my patch, it may simply repeat the original
interrupted signal forever.

--Andy

Re: [RFC Part2 PATCH 06/30] x86/fault: dump the RMP entry on #PF

2021-03-24 Thread Andy Lutomirski

On Wed, Mar 24, 2021 at 10:04 AM Brijesh Singh  wrote:
>
> If hardware detects an RMP violation, it will raise a page-fault exception
> with the RMP bit set. To help the debug, dump the RMP entry of the faulting
> address.
>
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Joerg Roedel 
> Cc: "H. Peter Anvin" 
> Cc: Tony Luck 
> Cc: Dave Hansen 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: Paolo Bonzini 
> Cc: Tom Lendacky 
> Cc: David Rientjes 
> Cc: Sean Christopherson 
> Cc: x...@kernel.org
> Cc: k...@vger.kernel.org
> Signed-off-by: Brijesh Singh 
> ---
>  arch/x86/mm/fault.c | 75 +
>  1 file changed, 75 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f39b551f89a6..7605e06a6dd9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -31,6 +31,7 @@
>  #include  /* VMALLOC_START, ...   */
>  #include   /* kvm_handle_async_pf  */
>  #include   /* fixup_vdso_exception()   */
> +#include/* lookup_rmpentry ...  */
>
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -147,6 +148,76 @@ is_prefetch(struct pt_regs *regs, unsigned long 
> error_code, unsigned long addr)
>  DEFINE_SPINLOCK(pgd_lock);
>  LIST_HEAD(pgd_list);
>
> +static void dump_rmpentry(struct page *page, rmpentry_t *e)
> +{
> +   unsigned long paddr = page_to_pfn(page) << PAGE_SHIFT;
> +
> +   pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d pagesize=%d 
> gpa=0x%lx asid=%d "
> +   "vmsa=%d validated=%d]\n", paddr, rmpentry_assigned(e), 
> rmpentry_immutable(e),
> +   rmpentry_pagesize(e), rmpentry_gpa(e), rmpentry_asid(e), 
> rmpentry_vmsa(e),
> +   rmpentry_validated(e));
> +   pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", paddr, e->high, 
> e->low);
> +}
> +
> +static void show_rmpentry(unsigned long address)
> +{
> +   struct page *page = virt_to_page(address);

This is an error path, and I don't think you have any particular
guarantee that virt_to_page(address) is valid.  Please add appropriate
validation or use one of the slow lookup helpers.

Is s390's new generic-using syscall code actually correct?

2021-03-20 Thread Andy Lutomirski

Hi all-

I'm working on my kentry patchset, and I encountered:

commit 56e62a73702836017564eaacd5212e4d0fa1c01d
Author: Sven Schnelle 
Date:   Sat Nov 21 11:14:56 2020 +0100

s390: convert to generic entry

As part of this work, I was cleaning up the generic syscall helpers,
and I encountered the goodies in do_syscall() and __do_syscall().

I'm trying to wrap my head around the current code, and I'm rather confused.

1. syscall_exit_to_user_mode_work() does *all* the exit work, not just
the syscall exit work.  So a do_syscall() that gets called twice will
do the loopy part of the exit work (e.g. signal handling) twice.  Is
this intentional?  If so, why?

2. I don't understand how this PIF_SYSCALL_RESTART thing is supposed
to work.  Looking at the code in Linus' tree, if a signal is pending
and a syscall returns -ERESTARTSYS, the syscall will return back to
do_syscall().  The work (as in (1)) gets run, calling do_signal(),
which will notice -ERESTARTSYS and set PIF_SYSCALL_RESTART.
Presumably it will also push the signal frame onto the stack and aim
the return address at the svc instruction mentioned in the commit
message from "s390: convert to generic entry".  Then __do_syscall()
will turn interrupts back on and loop right back into do_syscall().
That seems incorrect.

Can you enlighten me?  My WIP tree is here:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/kentry

Here are my changes to s390, and I don't think they're really correct:


https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/diff/arch/s390/kernel/syscall.c?h=x86/kentry=58a459922be0fb8e0f17aeaebcb0ac8d0575a62c

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-20 Thread Andy Lutomirski

On Sat, Mar 20, 2021 at 3:13 PM Thomas Gleixner  wrote:
>
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> > +
> > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
> > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> > +{
> > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > + return;
> > +
> > + if (unlikely(prev->state_mask != next->state_mask))
> > + xdisable_setbits(xfirstuse_not_detected(next));
> > +}
>
> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
> when it does not match. The spec document says:
>
>   "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>system software initialize AMX state (e.g., by executing TILERELEASE)
>before doing so. This is because maintaining AMX state in a
>non-initialized state may have negative power and performance
>implications."
>
> I'm not seeing anything related to this. Is this a recommendation
> which can be ignored or is that going to be duct taped into the code
> base once the first user complains about slowdowns of their non AMX
> workloads on that machine?

I have an obnoxious question: do we really want to use the XFD mechanism?

Right now, glibc, and hence most user space code, blindly uses
whatever random CPU features are present for no particularly good
reason, which means that all these features get stuck in the XINUSE=1
state, even if there is no code whatsoever in the process that
benefits.  AVX512 is bad enough as we're seeing right now.  AMX will
be much worse if this happens.

We *could* instead use XCR0 and require an actual syscall to enable
it.  We could even then play games like requiring whomever enables the
feature to allocate memory for the state save area for signals, and
signal delivery could save the state and disable the feature, this
preventing the signal frame from blowing up to 8 or 12 or who knows
how many kB.

--Andy

Re: [PATCH] Document that PF_KTHREAD _is_ ABI

2021-03-20 Thread Andy Lutomirski

> On Mar 20, 2021, at 9:31 AM, Alexey Dobriyan  wrote:
>
> PF_KTHREAD value is visible via field number 9 of /proc/*/stat
>
>$ sudo cat /proc/2/stat
>2 (kthreadd) S 0 0 0 0 -1 2129984 0 ...
>  ^^^
>
> It is used by at least systemd to check for kernel-threadness:
> https://github.com/systemd/systemd/blob/main/src/basic/process-util.c#L354
> src/basic/process-util.c:is_kernel_thread()

Eww.

Could we fix it differently and more permanently by modifying the proc
code to display the values systemd expects?


>
> It means that the value can't be changed despite perceived notion that
> task_struct flags are internal to kernel and can be shuffled at whim.
>
> Formally, _all_ struct task_struct::flags PF_* values are kernel ABI
> which is a disaster.
>
> I hope we can mask everything but few flags and hope for the best :^)
>
> Note for beginner Linux programmers:
> every other way you find on the interwebs and Stack Overflow
> for checking if pid belongs to a kernel thread is broken one way or
> another.
>
> Signed-off-by: Alexey Dobriyan 
> ---
>
> include/linux/sched.h |3 +++
> 1 file changed, 3 insertions(+)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1566,6 +1566,9 @@ extern struct pid *cad_pid;
> #define PF_MEMALLOC_NOIO0x0008/* All allocation requests will 
> inherit GFP_NOIO */
> #define PF_LOCAL_THROTTLE0x0010/* Throttle writes only against 
> the bdi I write to,
> * I am cleaning dirty pages from some other bdi. */
> +/*
> + * PF_KTHREAD is part of kernel ABI, visible via value #9 in /proc/$pid/stat
> + */
> #define PF_KTHREAD0x0020/* I am a kernel thread */
> #define PF_RANDOMIZE0x0040/* Randomize virtual address space 
> */
> #define PF_SWAPWRITE0x0080/* Allowed to write to swap */

[tip: x86/misc] selftests/x86: Add a missing .note.GNU-stack section to thunks_32.S

2021-03-18 Thread tip-bot2 for Andy Lutomirski

The following commit has been merged into the x86/misc branch of tip:

Commit-ID: f706bb59204ba1c47e896b456c97977fc97b7964
Gitweb:
https://git.kernel.org/tip/f706bb59204ba1c47e896b456c97977fc97b7964
Author:Andy Lutomirski 
AuthorDate:Thu, 04 Mar 2021 09:01:55 -08:00
Committer: Borislav Petkov 
CommitterDate: Thu, 18 Mar 2021 11:05:14 +01:00

selftests/x86: Add a missing .note.GNU-stack section to thunks_32.S

test_syscall_vdso_32 ended up with an executable stacks because the asm
was missing the annotation that says that it is modern and doesn't need
an executable stack. Add the annotation.

This was missed in commit aeaaf005da1d ("selftests/x86: Add missing
.note.GNU-stack sections").

Fixes: aeaaf005da1d ("selftests/x86: Add missing .note.GNU-stack sections")
Signed-off-by: Andy Lutomirski 
Signed-off-by: Borislav Petkov 
Link: 
https://lkml.kernel.org/r/487ed5348a43c031b816fa7e9efedb75dc324299.1614877299.git.l...@kernel.org
---
 tools/testing/selftests/x86/thunks_32.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/x86/thunks_32.S 
b/tools/testing/selftests/x86/thunks_32.S
index a71d92d..f3f56e6 100644
--- a/tools/testing/selftests/x86/thunks_32.S
+++ b/tools/testing/selftests/x86/thunks_32.S
@@ -45,3 +45,5 @@ call64_from_32:
ret
 
 .size call64_from_32, .-call64_from_32
+
+.section .note.GNU-stack,"",%progbits

[PATCH v4 8/9] kentry: Check that syscall entries and syscall exits match

2021-03-17 Thread Andy Lutomirski

If arch code calls the wrong kernel entry helpers, syscall entries and
exits can get out of sync.  Add a new field to task_struct to track the
syscall state and validate that it transitions correctly.

Signed-off-by: Andy Lutomirski 
---
 include/linux/sched.h |  4 
 init/init_task.c  |  8 
 kernel/entry/common.c | 24 +++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..95d6d8686d98 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1368,6 +1368,10 @@ struct task_struct {
struct llist_head   kretprobe_instances;
 #endif
 
+#ifdef CONFIG_DEBUG_ENTRY
+   bool kentry_in_syscall;
+#endif
+
/*
 * New fields for task_struct should be added above here, so that
 * they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index 8a992d73e6fb..de4fdaac4e8b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -212,6 +212,14 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP
.seccomp= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_DEBUG_ENTRY
+   /*
+* The init task, and kernel threads in general, are considered
+* to be "in a syscall".  This way they can execve() and then exit
+* the supposed syscall that they were in to go to user mode.
+*/
+   .kentry_in_syscall = true,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 8ba774184e00..152e7546be16 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -141,11 +141,21 @@ static long syscall_trace_enter(struct pt_regs *regs, 
long syscall,
 
 long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 {
-   unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
+   unsigned long work;
+
+   if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
+   DEBUG_ENTRY_WARN_ONCE(
+   current->kentry_in_syscall,
+   "entering syscall %ld while already in a syscall",
+   syscall);
+   current->kentry_in_syscall = true;
+   }
 
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
lockdep_assert_irqs_enabled();
 
+   work = READ_ONCE(current_thread_info()->syscall_work);
+
if (work & SYSCALL_WORK_ENTER)
syscall = syscall_trace_enter(regs, syscall, work);
 
@@ -156,11 +166,16 @@ long kentry_syscall_begin(struct pt_regs *regs, long 
syscall)
 static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
+
 #ifdef CONFIG_DEBUG_ENTRY
DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) != 1,
  "__exit_to_user_mode called at wrong kentry cpu 
depth (%u)",
  this_cpu_read(kentry_cpu_depth));
+
+   DEBUG_ENTRY_WARN_ONCE(current->kentry_in_syscall,
+ "exiting to user mode while in syscall context");
 #endif
+
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
instrumentation_end();
@@ -317,6 +332,13 @@ void kentry_syscall_end(struct pt_regs *regs)
 */
if (unlikely(work & SYSCALL_WORK_EXIT))
syscall_exit_work(regs, work);
+
+#ifdef CONFIG_DEBUG_ENTRY
+   DEBUG_ENTRY_WARN_ONCE(!current->kentry_in_syscall,
+ "exiting syscall %lu without entering first", nr);
+
+   current->kentry_in_syscall = 0;
+#endif
 }
 
 noinstr void kentry_enter_from_user_mode(struct pt_regs *regs)
-- 
2.30.2

[PATCH v4 9/9] kentry: Verify kentry state in instrumentation_begin/end()

2021-03-17 Thread Andy Lutomirski

Calling instrumentation_begin() and instrumentation_end() when kentry
thinks the CPU is in user mode is an error.  Verify the kentry state when
instrumentation_begin/end() are called.

Add _nocheck() variants to skip verification to avoid WARN() generating
extra kentry warnings.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/kernel/traps.c |  4 ++--
 include/asm-generic/bug.h   |  8 
 include/linux/instrumentation.h | 25 -
 kernel/entry/common.c   |  7 +++
 4 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index be924180005a..983e4be5fdcb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -229,7 +229,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
/*
 * All lies, just get the WARN/BUG out.
 */
-   instrumentation_begin();
+   instrumentation_begin_nocheck();
/*
 * Since we're emulating a CALL with exceptions, restore the interrupt
 * state to what it was at the exception site.
@@ -242,7 +242,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
}
if (regs->flags & X86_EFLAGS_IF)
raw_local_irq_disable();
-   instrumentation_end();
+   instrumentation_end_nocheck();
 
return handled;
 }
diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 76a10e0dca9f..fc360c463a99 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -85,18 +85,18 @@ void warn_slowpath_fmt(const char *file, const int line, 
unsigned taint,
   const char *fmt, ...);
 #define __WARN()   __WARN_printf(TAINT_WARN, NULL)
 #define __WARN_printf(taint, arg...) do {  \
-   instrumentation_begin();\
+   instrumentation_begin_nocheck();\
warn_slowpath_fmt(__FILE__, __LINE__, taint, arg);  \
-   instrumentation_end();  \
+   instrumentation_end_nocheck();  \
} while (0)
 #else
 extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 #define __WARN()   __WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))
 #define __WARN_printf(taint, arg...) do {  \
-   instrumentation_begin();\
+   instrumentation_begin_nocheck();\
__warn_printk(arg); \
__WARN_FLAGS(BUGFLAG_NO_CUT_HERE | BUGFLAG_TAINT(taint));\
-   instrumentation_end();  \
+   instrumentation_end_nocheck();  \
} while (0)
 #define WARN_ON_ONCE(condition) ({ \
int __ret_warn_on = !!(condition);  \
diff --git a/include/linux/instrumentation.h b/include/linux/instrumentation.h
index 93e2ad67fc10..cdf80454f92a 100644
--- a/include/linux/instrumentation.h
+++ b/include/linux/instrumentation.h
@@ -4,14 +4,21 @@
 
 #if defined(CONFIG_DEBUG_ENTRY) && defined(CONFIG_STACK_VALIDATION)
 
+extern void kentry_assert_may_instrument(void);
+
 /* Begin/end of an instrumentation safe region */
-#define instrumentation_begin() ({ \
-   asm volatile("%c0: nop\n\t" 
\
+#define instrumentation_begin_nocheck() ({ \
+   asm volatile("%c0: nop\n\t" \
 ".pushsection .discard.instr_begin\n\t"\
 ".long %c0b - .\n\t"   \
 ".popsection\n\t" : : "i" (__COUNTER__));  \
 })
 
+#define instrumentation_begin() ({ \
+   instrumentation_begin_nocheck();\
+   kentry_assert_may_instrument(); \
+})
+
 /*
  * Because instrumentation_{begin,end}() can nest, objtool validation considers
  * _begin() a +1 and _end() a -1 and computes a sum over the instructions.
@@ -43,15 +50,23 @@
  * To avoid this, have _end() be a NOP instruction, this ensures it will be
  * part of the condition block and does not escape.
  */
-#define instrumentation_end() ({   \
+#define instrumentation_end_nocheck() ({   \
asm volatile("%c0: nop\n\t" \
 ".pushsection .discard.instr_end\n\t"  \
 ".long %c0b - .\n\t"   \

[PATCH v4 7/9] kentry: Add debugging checks for proper kentry API usage

2021-03-17 Thread Andy Lutomirski

It's quite easy to mess up kentry calls.  Add debgging checks that kentry
transitions to and from user mode match up and that kentry_nmi_enter() and
kentry_nmi_exit() match up.

Checking full matching of kentry_enter() with kentry_exit() needs
per-task state.

Signed-off-by: Andy Lutomirski 
---
 kernel/entry/common.c | 81 ++-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6fbe1c109877..8ba774184e00 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,9 +11,65 @@
 #define CREATE_TRACE_POINTS
 #include 
 
+/*
+ * kentry_cpu_depth is 0 in user mode, 1 in normal kernel mode, and
+ * 1 + n * KENTRY_DEPTH_NMI in kentry_nmi_enter() mode.  We can't
+ * use a percpu variable to match up kentry_enter() from kernel mode
+ * with the corresponding kentry_exit() because tasks may schedule inside
+ * nested kentry_enter() regions.
+ */
+#define KENTRY_CPU_DEPTH_NMI   1024UL
+
+#ifdef CONFIG_DEBUG_ENTRY
+
+/*
+ * Extra safe WARN_ONCE.  Per-arch optimized WARN_ONCE() implementations
+ * might go through the low-level entry and kentry code even before noticing
+ * that the warning already fired, which could result in recursive warnings.
+ * This carefully avoids any funny business once a given warning has fired.
+ */
+#define DEBUG_ENTRY_WARN_ONCE(condition, format...) ({ \
+   static bool __section(".data.once") __warned;   \
+   int __ret_warn_once = !!(condition);\
+   \
+   if (unlikely(__ret_warn_once && !READ_ONCE(__warned))) {\
+   WRITE_ONCE(__warned, true); \
+   WARN(1, format);\
+   }   \
+   unlikely(__ret_warn_once);  \
+})
+
+
+static DEFINE_PER_CPU(unsigned int, kentry_cpu_depth) = 1UL;
+
+static __always_inline void kentry_cpu_depth_add(unsigned int n)
+{
+   this_cpu_add(kentry_cpu_depth, n);
+}
+
+static void kentry_cpu_depth_check(unsigned int n)
+{
+   DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) < n, "kentry 
depth underflow");
+}
+
+static __always_inline void kentry_cpu_depth_sub(unsigned int n)
+{
+   this_cpu_sub(kentry_cpu_depth, n);
+}
+#else
+
+#define DEBUG_ENTRY_WARN_ONCE(condition, format...) do {} while (0)
+
+static __always_inline void kentry_cpu_depth_add(unsigned int n) {}
+static void kentry_cpu_depth_check(unsigned int n) {}
+static __always_inline void kentry_cpu_depth_sub(unsigned int n) {}
+
+#endif
+
 /* See comment for enter_from_user_mode() in entry-common.h */
 static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
+   kentry_cpu_depth_add(1);
arch_check_user_regs(regs);
lockdep_hardirqs_off(CALLER_ADDR0);
 
@@ -22,6 +78,14 @@ static __always_inline void __enter_from_user_mode(struct 
pt_regs *regs)
 
instrumentation_begin();
trace_hardirqs_off_finish();
+
+#ifdef CONFIG_DEBUG_ENTRY
+   DEBUG_ENTRY_WARN_ONCE(
+   this_cpu_read(kentry_cpu_depth) != 1,
+   "kentry: __enter_from_user_mode() called while kentry thought 
the CPU was in the kernel (%u)",
+   this_cpu_read(kentry_cpu_depth));
+#endif
+
instrumentation_end();
 }
 
@@ -92,6 +156,11 @@ long kentry_syscall_begin(struct pt_regs *regs, long 
syscall)
 static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
+#ifdef CONFIG_DEBUG_ENTRY
+   DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) != 1,
+ "__exit_to_user_mode called at wrong kentry cpu 
depth (%u)",
+ this_cpu_read(kentry_cpu_depth));
+#endif
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
instrumentation_end();
@@ -99,6 +168,7 @@ static __always_inline void __exit_to_user_mode(void)
user_enter_irqoff();
arch_exit_to_user_mode();
lockdep_hardirqs_on(CALLER_ADDR0);
+   kentry_cpu_depth_sub(1);
 }
 
 /* Workaround to allow gradual conversion of architecture code */
@@ -346,7 +416,12 @@ noinstr void kentry_exit(struct pt_regs *regs, 
kentry_state_t state)
/* Check whether this returns to user mode */
if (user_mode(regs)) {
kentry_exit_to_user_mode(regs);
-   } else if (!regs_irqs_disabled(regs)) {
+   return;
+   }
+
+   kentry_cpu_depth_check(1);
+
+   if (!regs_irqs_disabled(regs)) {
/*
 * If RCU was not watching on entry this needs to be done
 * carefully and needs the same ordering of lockdep/tracing
@@ -385,6 +460,8 @@ kentry_state_t noinstr kentry_nmi_enter(struct pt_regs 
*regs)
 
irq_state.lockdep = lockdep

[PATCH v4 4/9] kentry: Simplify the common syscall API

2021-03-17 Thread Andy Lutomirski

The new common syscall API had a large and confusing API surface.  Simplify
it.  Now there is exactly one way to use it.  It's a bit more verbose than
the old way for the simple x86_64 native case, but it's much easier to use
right, and the diffstat should speak for itself.

Acked-by: Mark Rutland 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/common.c  | 57 +++-
 include/linux/entry-common.h | 86 ++--
 kernel/entry/common.c| 57 +++-
 3 files changed, 54 insertions(+), 146 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ef1c65938a6b..8710b2300b8d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -38,9 +38,12 @@
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
-   nr = syscall_enter_from_user_mode(regs, nr);
-
+   kentry_enter_from_user_mode(regs);
+   local_irq_enable();
instrumentation_begin();
+
+   nr = kentry_syscall_begin(regs, nr);
+
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
regs->ax = sys_call_table[nr](regs);
@@ -52,8 +55,12 @@ __visible noinstr void do_syscall_64(unsigned long nr, 
struct pt_regs *regs)
regs->ax = x32_sys_call_table[nr](regs);
 #endif
}
+
+   kentry_syscall_end(regs);
+
+   local_irq_disable();
instrumentation_end();
-   syscall_exit_to_user_mode(regs);
+   kentry_exit_to_user_mode(regs);
 }
 #endif
 
@@ -83,33 +90,34 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs 
*regs)
 {
unsigned int nr = syscall_32_enter(regs);
 
+   kentry_enter_from_user_mode(regs);
+   local_irq_enable();
+   instrumentation_begin();
+
/*
 * Subtlety here: if ptrace pokes something larger than 2^32-1 into
 * orig_ax, the unsigned int return value truncates it.  This may
 * or may not be necessary, but it matches the old asm behavior.
 */
-   nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
-   instrumentation_begin();
-
+   nr = (unsigned int)kentry_syscall_begin(regs, nr);
do_syscall_32_irqs_on(regs, nr);
+   kentry_syscall_end(regs);
 
+   local_irq_disable();
instrumentation_end();
-   syscall_exit_to_user_mode(regs);
+   kentry_exit_to_user_mode(regs);
 }
 
 static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 {
unsigned int nr = syscall_32_enter(regs);
+   bool ret;
int res;
 
-   /*
-* This cannot use syscall_enter_from_user_mode() as it has to
-* fetch EBP before invoking any of the syscall entry work
-* functions.
-*/
-   syscall_enter_from_user_mode_prepare(regs);
-
+   kentry_enter_from_user_mode(regs);
+   local_irq_enable();
instrumentation_begin();
+
/* Fetch EBP from where the vDSO stashed it. */
if (IS_ENABLED(CONFIG_X86_64)) {
/*
@@ -126,21 +134,23 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs 
*regs)
if (res) {
/* User code screwed up. */
regs->ax = -EFAULT;
-
-   instrumentation_end();
-   local_irq_disable();
-   irqentry_exit_to_user_mode(regs);
-   return false;
+   ret = false;
+   goto out;
}
 
/* The case truncates any ptrace induced syscall nr > 2^32 -1 */
-   nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
+   nr = (unsigned int)kentry_syscall_begin(regs, nr);
 
/* Now this is just like a normal syscall. */
do_syscall_32_irqs_on(regs, nr);
 
+   kentry_syscall_end(regs);
+   ret = true;
+
+out:
+   local_irq_disable();
instrumentation_end();
-   syscall_exit_to_user_mode(regs);
+   kentry_exit_to_user_mode(regs);
return true;
 }
 
@@ -233,8 +243,11 @@ __visible void noinstr ret_from_fork(struct task_struct 
*prev,
user_regs->ax = 0;
}
 
+   kentry_syscall_end(user_regs);
+
+   local_irq_disable();
instrumentation_end();
-   syscall_exit_to_user_mode(user_regs);
+   kentry_exit_to_user_mode(user_regs);
 }
 
 #ifdef CONFIG_XEN_PV
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fd2d7c35670a..5287c6c15a66 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -119,31 +119,12 @@ static inline __must_check int 
arch_syscall_enter_tracehook(struct pt_regs *regs
 void enter_from_user_mode(struct pt_regs *regs);
 
 /**
- * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
- * @regs:  Pointer to currents pt_regs
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-in

[PATCH v4 6/9] entry: Make CONFIG_DEBUG_ENTRY available outside x86

2021-03-17 Thread Andy Lutomirski

In principle, the generic entry code is generic, and the goal is to use it
in many architectures once it settles down more.  Move CONFIG_DEBUG_ENTRY
to the generic config so that it can be used in the generic entry code and
not just in arch/x86.

This currently depends on GENERIC_ENTRY.  If an architecture wants to use
DEBUG_ENTRY without using GENERIC_ENTRY, this could be changed.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/Kconfig.debug | 10 --
 lib/Kconfig.debug  | 11 +++
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 80b57e7f4947..a5a52133730c 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -170,16 +170,6 @@ config CPA_DEBUG
help
  Do change_page_attr() self-tests every 30 seconds.
 
-config DEBUG_ENTRY
-   bool "Debug low-level entry code"
-   depends on DEBUG_KERNEL
-   help
- This option enables sanity checks in x86's low-level entry code.
- Some of these sanity checks may slow down kernel entries and
- exits or otherwise impact performance.
-
- If unsure, say N.
-
 config DEBUG_NMI_SELFTEST
bool "NMI Selftest"
depends on DEBUG_KERNEL && X86_LOCAL_APIC
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 7937265ef879..927d913fd471 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1411,6 +1411,17 @@ config CSD_LOCK_WAIT_DEBUG
 
 endmenu # lock debugging
 
+config DEBUG_ENTRY
+   bool "Debug low-level entry code"
+   depends on DEBUG_KERNEL
+   depends on GENERIC_ENTRY
+   help
+ This option enables sanity checks in the low-level entry code.
+ Some of these sanity checks may slow down kernel entries and
+ exits or otherwise impact performance.
+
+ If unsure, say N.
+
 config TRACE_IRQFLAGS
depends on TRACE_IRQFLAGS_SUPPORT
bool
-- 
2.30.2

[PATCH v4 5/9] kentry: Remove enter_from/exit_to_user_mode()

2021-03-17 Thread Andy Lutomirski

enter_from_user_mode() and exit_to_user_mode() do part, but not all, of the
entry and exit work.  Fortunately, they have no callers, so there's no need
to try to give the precise semantics.  Delete them.

This doesn't affect arm64.  arm64 has identically named functions
in arch/arm64, and kentry isn't built on arm64.  So this patch also
reduces confusion for people who grep the tree and might previously
have thought that arm64 used this code.

Signed-off-by: Andy Lutomirski 
---
 include/linux/entry-common.h | 41 
 kernel/entry/common.c| 10 -
 2 files changed, 51 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 5287c6c15a66..ba4e8509a20f 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -97,27 +97,6 @@ static inline __must_check int 
arch_syscall_enter_tracehook(struct pt_regs *regs
 }
 #endif
 
-/**
- * enter_from_user_mode - Establish state when coming from user mode
- *
- * Syscall/interrupt entry disables interrupts, but user mode is traced as
- * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
- *
- * 1) Tell lockdep that interrupts are disabled
- * 2) Invoke context tracking if enabled to reactivate RCU
- * 3) Trace interrupts off state
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct and interrupts are still
- * disabled. The subsequent functions can be instrumented.
- *
- * This is invoked when there is architecture specific functionality to be
- * done between establishing state and enabling interrupts. The caller must
- * enable interrupts before invoking syscall_enter_from_user_mode_work().
- */
-void enter_from_user_mode(struct pt_regs *regs);
-
 /**
  * kentry_syscall_begin - Prepare to invoke a syscall handler
  * @regs:  Pointer to currents pt_regs
@@ -261,26 +240,6 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
 }
 #endif
 
-/**
- * exit_to_user_mode - Fixup state when exiting to user mode
- *
- * Syscall/interrupt exit enables interrupts, but the kernel state is
- * interrupts disabled when this is invoked. Also tell RCU about it.
- *
- * 1) Trace interrupts on state
- * 2) Invoke context tracking if enabled to adjust RCU state
- * 3) Invoke architecture specific last minute exit code, e.g. speculation
- *mitigations, etc.: arch_exit_to_user_mode()
- * 4) Tell lockdep that interrupts are enabled
- *
- * Invoked from architecture specific code when syscall_exit_to_user_mode()
- * is not suitable as the last step before returning to userspace. Must be
- * invoked with interrupts disabled and the caller must be
- * non-instrumentable.
- * The caller has to invoke syscall_exit_to_user_mode_work() before this.
- */
-void exit_to_user_mode(void);
-
 /**
  * kentry_syscall_end - Finish syscall processing
  * @regs:  Pointer to currents pt_regs
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 800ad406431b..6fbe1c109877 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -25,11 +25,6 @@ static __always_inline void __enter_from_user_mode(struct 
pt_regs *regs)
instrumentation_end();
 }
 
-void noinstr enter_from_user_mode(struct pt_regs *regs)
-{
-   __enter_from_user_mode(regs);
-}
-
 static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
 {
if (unlikely(audit_context())) {
@@ -106,11 +101,6 @@ static __always_inline void __exit_to_user_mode(void)
lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
-void noinstr exit_to_user_mode(void)
-{
-   __exit_to_user_mode();
-}
-
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { 
}
 
-- 
2.30.2

[PATCH v4 3/9] x86/entry: Convert ret_from_fork to C

2021-03-17 Thread Andy Lutomirski

ret_from_fork is written in asm, slightly differently, for x86_32 and
x86_64.  Convert it to C.

This is a straight conversion without any particular cleverness.  As a
further cleanup, the code that sets up the ret_from_fork argument registers
could be adjusted to put the arguments in the correct registers.

This will cause the ORC unwinder to find pt_regs even for kernel threads on
x86_64.  This seems harmless.

The 32-bit comment above the now-deleted schedule_tail_wrapper was
obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the
same problem more cleanly.

Cc: Josh Poimboeuf 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/common.c  | 23 ++
 arch/x86/entry/entry_32.S| 51 +---
 arch/x86/entry/entry_64.S| 33 +
 arch/x86/include/asm/switch_to.h |  2 +-
 arch/x86/kernel/process.c|  2 +-
 arch/x86/kernel/process_32.c |  2 +-
 arch/x86/kernel/unwind_orc.c |  2 +-
 7 files changed, 43 insertions(+), 72 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 95776f16c1cb..ef1c65938a6b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -214,6 +214,29 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
+void ret_from_fork(struct task_struct *prev,
+  int (*kernel_thread_fn)(void *),
+  void *kernel_thread_arg,
+  struct pt_regs *user_regs);
+
+__visible void noinstr ret_from_fork(struct task_struct *prev,
+int (*kernel_thread_fn)(void *),
+void *kernel_thread_arg,
+struct pt_regs *user_regs)
+{
+   instrumentation_begin();
+
+   schedule_tail(prev);
+
+   if (kernel_thread_fn) {
+   kernel_thread_fn(kernel_thread_arg);
+   user_regs->ax = 0;
+   }
+
+   instrumentation_end();
+   syscall_exit_to_user_mode(user_regs);
+}
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index df8c017e6161..7113d259727f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -805,26 +805,6 @@ SYM_CODE_START(__switch_to_asm)
 SYM_CODE_END(__switch_to_asm)
 .popsection
 
-/*
- * The unwinder expects the last frame on the stack to always be at the same
- * offset from the end of the page, which allows it to validate the stack.
- * Calling schedule_tail() directly would break that convention because its an
- * asmlinkage function so its argument has to be pushed on the stack.  This
- * wrapper creates a proper "end of stack" frame header before the call.
- */
-.pushsection .text, "ax"
-SYM_FUNC_START(schedule_tail_wrapper)
-   FRAME_BEGIN
-
-   pushl   %eax
-   callschedule_tail
-   popl%eax
-
-   FRAME_END
-   ret
-SYM_FUNC_END(schedule_tail_wrapper)
-.popsection
-
 /*
  * A newly forked process directly context switches into this address.
  *
@@ -833,29 +813,14 @@ SYM_FUNC_END(schedule_tail_wrapper)
  * edi: kernel thread arg
  */
 .pushsection .text, "ax"
-SYM_CODE_START(ret_from_fork)
-   callschedule_tail_wrapper
-
-   testl   %ebx, %ebx
-   jnz 1f  /* kernel threads are uncommon */
-
-2:
-   /* When we fork, we trace the syscall return in the child, too. */
-   movl%esp, %eax
-   callsyscall_exit_to_user_mode
-   jmp .Lsyscall_32_done
-
-   /* kernel thread */
-1: movl%edi, %eax
-   CALL_NOSPEC ebx
-   /*
-* A kernel thread is allowed to return here after successfully
-* calling kernel_execve().  Exit to userspace to complete the execve()
-* syscall.
-*/
-   movl$0, PT_EAX(%esp)
-   jmp 2b
-SYM_CODE_END(ret_from_fork)
+SYM_CODE_START(asm_ret_from_fork)
+   movl%ebx, %edx
+   movl%edi, %ecx
+   pushl   %esp
+   callret_from_fork
+   addl$4, %esp
+   jmp .Lsyscall_32_done
+SYM_CODE_END(asm_ret_from_fork)
 .popsection
 
 SYM_ENTRY(__begin_SYSENTER_singlestep_region, SYM_L_GLOBAL, SYM_A_NONE)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..0f7df8861ac1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -273,35 +273,18 @@ SYM_FUNC_END(__switch_to_asm)
  * rax: prev task we switched from
  * rbx: kernel thread func (NULL for user thread)
  * r12: kernel thread arg
+ * rbp: encoded frame pointer for the fp unwinder
  */
 .pushsection .text, "ax"
-SYM_CODE_START(ret_from_fork)
-   UNWIND_HINT_EMPTY
-   movq%rax, %rdi
-   callschedule_tail   /* rdi: 'prev' task parameter */
-
-   testq   %rbx, %rbx  /* from kernel_thread? */
-   jnz 1f  /* kernel threads are uncommon

[PATCH v4 1/9] x86/dumpstack: Remove unnecessary range check fetching opcode bytes

2021-03-17 Thread Andy Lutomirski

copy_from_user_nmi() validates that the pointer is in the user range,
so there is no need for an extra check in copy_code().

Signed-off-by: Andy Lutomirski 
---
 arch/x86/kernel/dumpstack.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 299c20f0a38b..55cf3c8325c6 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -81,12 +81,6 @@ static int copy_code(struct pt_regs *regs, u8 *buf, unsigned 
long src,
/* The user space code from other tasks cannot be accessed. */
if (regs != task_pt_regs(current))
return -EPERM;
-   /*
-* Make sure userspace isn't trying to trick us into dumping kernel
-* memory by pointing the userspace instruction pointer at it.
-*/
-   if (__chk_range_not_ok(src, nbytes, TASK_SIZE_MAX))
-   return -EINVAL;
 
/*
 * Even if named copy_from_user_nmi() this can be invoked from
-- 
2.30.2

[PATCH v4 2/9] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads

2021-03-17 Thread Andy Lutomirski

For kernel threads, task_pt_regs is currently all zeros, a valid user state
(if kernel_execve() has been called), or some combination thereof during
execution of kernel_execve().  If a stack trace is printed, the unwinder
might get confused and treat task_pt_regs as a kernel state.  Indeed,
forcing a stack dump results in an attempt to display _kernel_ code bytes
from a bogus address at the very top of kernel memory.

Fix the confusion by setting cs=3 so that user_mode(task_pt_regs(...)) ==
true for kernel threads.

Also improve the error when failing to fetch code bytes to show which type
of fetch failed.  This makes it much easier to understand what's happening.

Cc: Josh Poimboeuf 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/kernel/dumpstack.c |  4 ++--
 arch/x86/kernel/process.c   | 13 +
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 55cf3c8325c6..9b7b69bb12e5 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -128,8 +128,8 @@ void show_opcodes(struct pt_regs *regs, const char *loglvl)
/* No access to the user space stack of other tasks. Ignore. */
break;
default:
-   printk("%sCode: Unable to access opcode bytes at RIP 0x%lx.\n",
-  loglvl, prologue);
+   printk("%sCode: Unable to access %s opcode bytes at RIP 
0x%lx.\n",
+  loglvl, user_mode(regs) ? "user" : "kernel", prologue);
break;
}
 }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 145a7ac0c19a..f6f16df04cb9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -163,6 +163,19 @@ int copy_thread(unsigned long clone_flags, unsigned long 
sp, unsigned long arg,
/* Kernel thread ? */
if (unlikely(p->flags & PF_KTHREAD)) {
memset(childregs, 0, sizeof(struct pt_regs));
+
+   /*
+* Even though there is no real user state here, these
+* are were user regs belong, and kernel_execve() will
+* overwrite them with user regs.  Put an obviously
+* invalid value that nonetheless causes user_mode(regs)
+* to return true in CS.
+*
+* This also prevents the unwinder from thinking that there
+* is invalid kernel state at the top of the stack.
+*/
+   childregs->cs = 3;
+
kthread_frame_init(frame, sp, arg);
return 0;
}
-- 
2.30.2

[PATCH v4 0/9] kentry: A stable bugfix and a bunch of improvements

2021-03-17 Thread Andy Lutomirski

I noticed a little bug in fast compat syscalls.  I got a bit carried away
fixing it.  This renames the irqentry stuff to kentry, improves (IMNSHO)
the API, and adds lots of debugging.

It also tweaks the unwinder wrt ret_from_fork and rewrites ret_from_fork
in C.  I did this because the kentry work involved a small change to
ret_from_fork, and adjusting the asm is a mess.  So C it is.

Changes from v3: Get rid of arm64 special cases

Changes from v1 and v2: Complete rewrite

Andy Lutomirski (9):
  x86/dumpstack: Remove unnecessary range check fetching opcode bytes
  x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads
  x86/entry: Convert ret_from_fork to C
  kentry: Simplify the common syscall API
  kentry: Remove enter_from/exit_to_user_mode()
  entry: Make CONFIG_DEBUG_ENTRY available outside x86
  kentry: Add debugging checks for proper kentry API usage
  kentry: Check that syscall entries and syscall exits match
  kentry: Verify kentry state in instrumentation_begin/end()

 arch/x86/Kconfig.debug   |  10 --
 arch/x86/entry/common.c  |  78 ++
 arch/x86/entry/entry_32.S|  51 ++---
 arch/x86/entry/entry_64.S|  33 ++
 arch/x86/include/asm/switch_to.h |   2 +-
 arch/x86/kernel/dumpstack.c  |  10 +-
 arch/x86/kernel/process.c|  15 ++-
 arch/x86/kernel/process_32.c |   2 +-
 arch/x86/kernel/traps.c  |   4 +-
 arch/x86/kernel/unwind_orc.c |   2 +-
 include/asm-generic/bug.h|   8 +-
 include/linux/entry-common.h | 127 +++
 include/linux/instrumentation.h  |  25 -
 include/linux/sched.h|   4 +
 init/init_task.c |   8 ++
 kernel/entry/common.c| 173 ---
 lib/Kconfig.debug|  11 ++
 17 files changed, 267 insertions(+), 296 deletions(-)

-- 
2.30.2

Re: [PATCH -tip 0/3] x86/kprobes: Remoev single-step trap from x86 kprobes

2021-03-17 Thread Andy Lutomirski

On Wed, Mar 17, 2021 at 9:27 AM Peter Zijlstra  wrote:
>
> On Wed, Mar 17, 2021 at 11:55:22PM +0900, Masami Hiramatsu wrote:
> > Hi Andy,
> >
> > Would you think you still need this series to remove iret to kernel?
>
> They're an improvement anyway, let me queue them so that they don't get
> lost.
>
> I'll line them up for tip/x86/core unless anybody else thinks of a
> better place for them and tells me in time ;-)

I agree.  The new code is quite nice.  I'm still not quite sure what
I'm doing with IRET, but thank you very much for the kprobe
improvements.

Re: [PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only

2021-03-13 Thread Andy Lutomirski

On Mon, Mar 8, 2021 at 2:06 AM Mark Rutland  wrote:
>
> On Thu, Mar 04, 2021 at 11:06:00AM -0800, Andy Lutomirski wrote:
> > exit_to_user_mode() does part, but not all, of the exit-to-user-mode work.
> > It's used only by arm64, and arm64 should stop using it (hint, hint!).
> > Compile it out on other architectures to minimize the chance of error.
>
> For context, I do plan to move over, but there's a reasonable amount of
> preparatory work needed first (e.g. factoring out the remaining asm
> entry points, and reworking our pseudo-NMI management to play nicely
> with the common entry code).
>
> > enter_from_user_mode() is a legacy way to spell
> > kentry_enter_from_user_mode().  It's also only used by arm64.  Give it
> > the same treatment.
>
> I think you can remove these entirely, no ifdeferry necessary.
>
> Currently arm64 cannot select CONFIG_GENERIC_ENTRY, so we open-code
> copies of these in arch/arm64/kernel/entry-common.c, and doesn't use
> these common versions at all. When we move over to the common code we
> can move directly to the kentry_* versions. If we are relying on the
> prototypes anywhere, that's a bug as of itself.

I got faked out!  Fixed for the next version.

Re: [syzbot] WARNING in handle_mm_fault

2021-03-11 Thread Andy Lutomirski

Your warning is odd, but I see the bug.  It's in KVM.

On Thu, Mar 11, 2021 at 4:37 PM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:05a59d79 Merge git://git.kernel.org:/pub/scm/linux/kernel/..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=16f493ead0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=750735fdbc630971
> dashboard link: https://syzkaller.appspot.com/bug?extid=7d7013084f0a806f3786
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+7d7013084f0a806f3...@syzkaller.appspotmail.com
>
> [ cut here ]
> raw_local_irq_restore() called with IRQs enabled
> WARNING: CPU: 0 PID: 8412 at kernel/locking/irqflag-debug.c:10 
> warn_bogus_irq_restore+0x1d/0x20 kernel/locking/irqflag-debug.c:10
> Modules linked in:
> CPU: 0 PID: 8412 Comm: syz-fuzzer Not tainted 5.12.0-rc2-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:warn_bogus_irq_restore+0x1d/0x20 kernel/locking/irqflag-debug.c:10

The above makes sense, but WTH is the below:

> Code: be ff cc cc cc cc cc cc cc cc cc cc cc 80 3d 11 d1 ad 04 00 74 01 c3 48 
> c7 c7 20 79 6b 89 c6 05 00 d1 ad 04 01 e8 75 5b be ff <0f> 0b c3 48 39 77 10 
> 0f 84 97 00 00 00 66 f7 47 22 f0 ff 74 4b 48
> RSP: :c9000185fac8 EFLAGS: 00010282
> RAX:  RBX: 8880194268a0 RCX: 
> RBP: 0200 R08:  R09: 
> R13: ed1003284d14 R14: 0001 R15: 8880b9c36000
> FS:  00c2ec90() GS:8880b9c0() knlGS:
> Call Trace:
>  handle_mm_fault+0x1bc/0x7e0 mm/memory.c:4549
> Code: 48 8d 05 97 25 3e 00 48 89 44 24 08 e8 6d 54 ea ff 90 e8 07 a1 ed ff eb 
> a5 cc cc cc cc cc 8b 44 24 10 48 8b 4c 24 08 89 41 24  cc cc cc cc cc cc 
> cc cc cc cc cc cc cc cc cc cc cc cc cc 48 8b
> RAX: 47f6 RBX: 47f6 RCX: 00d6
> RDX: 4c00 RSI: 00d6 RDI: 0181cad0
> RBP: 00c000301890 R08: 47f5 R09: 0059c5a0
> R10: 00c0004e2000 R11: 0020 R12: 00fa
> R13: 00aa R14: 0093f064 R15: 0038
> Kernel panic - not syncing: panic_on_warn set ...
> CPU: 0 PID: 8412 Comm: syz-fuzzer Not tainted 5.12.0-rc2-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:

Now we start reading here:

>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
>  panic+0x306/0x73d kernel/panic.c:231
>  __warn.cold+0x35/0x44 kernel/panic.c:605
>  report_bug+0x1bd/0x210 lib/bug.c:195
>  handle_bug+0x3c/0x60 arch/x86/kernel/traps.c:239
>  exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:259
>  asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:575
> RIP: 0010:warn_bogus_irq_restore+0x1d/0x20 kernel/locking/irqflag-debug.c:10
> Code: be ff cc cc cc cc cc cc cc cc cc cc cc 80 3d 11 d1 ad 04 00 74 01 c3 48 
> c7 c7 20 79 6b 89 c6 05 00 d1 ad 04 01 e8 75 5b be ff <0f> 0b c3 48 39 77 10 
> 0f 84 97 00 00 00 66 f7 47 22 f0 ff 74 4b 48
> RSP: :c9000185fac8 EFLAGS: 00010282
> RAX:  RBX: 8880194268a0 RCX: 
> RDX: 88802f7b2400 RSI: 815b4435 RDI: f5200030bf4b
> RBP: 0200 R08:  R09: 
> R10: 815ad19e R11:  R12: 0003
> R13: ed1003284d14 R14: 0001 R15: 8880b9c36000
>  kvm_wait arch/x86/kernel/kvm.c:860 [inline]

and there's the bug:

/*
 * halt until it's our turn and kicked. Note that we do safe halt
 * for irq enabled case to avoid hang when lock info is overwritten
 * in irq spinlock slowpath and no spurious interrupt occur to save us.
 */
if (arch_irqs_disabled_flags(flags))
halt();
else
safe_halt();

out:
local_irq_restore(flags);
}

The safe_halt path is bogus.  It should just return instead of
restoring the IRQ flags.

--Andy

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-10 Thread Andy Lutomirski

On Wed, Mar 10, 2021 at 5:19 PM Aili Yao  wrote:
>
> On Mon, 8 Mar 2021 11:00:28 -0800
> Andy Lutomirski  wrote:
>
> > > On Mar 8, 2021, at 10:31 AM, Luck, Tony  wrote:
> > >
> > > 
> > >>
> > >> Can you point me at that SIGBUS code in a current kernel?
> > >
> > > It is in kill_me_maybe().  mce_vaddr is setup when we disassemble 
> > > whatever get_user()
> > > or copy from user variant was in use in the kernel when the poison memory 
> > > was consumed.
> > >
> > >if (p->mce_vaddr != (void __user *)-1l) {
> > >force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> >
> > Hmm. On the one hand, no one has complained yet. On the other hand, 
> > hardware that supports this isn’t exactly common.
> >
> > We may need some actual ABI design here. We also need to make sure that 
> > things like io_uring accesses or, more generally, anything using the use_mm 
> > / use_temporary_mm ends up either sending no signal or sending a signal to 
> > the right target.
> >
> > >
> > > Would it be any better if we used the BUS_MCEERR_AO code that goes into 
> > > siginfo?
> >
> > Dunno.
>
> I have one thought here but don't know if it's proper:
>
> Previous patch use force_sig_mceerr to the user process for such a scenario; 
> with this method
> The SIGBUS can't be ignored as force_sig_mceerr() was designed to.
>
> If the user process don't want this signal, will it set signal config to 
> ignore?
> Maybe we can use a send_sig_mceerr() instead of force_sig_mceerr(), if 
> process want to
> ignore the SIGBUS, then it will ignore that, or it can also process the 
> SIGBUS?

I don't think the signal blocking mechanism makes sense for this.
Blocking a signal is for saying that, if another process sends the
signal (or an async event like ctrl-C), then the process doesn't want
it.  Blocking doesn't block synchronous things like faults.

I think we need to at least fix the existing bug before we add more
signals.  AFAICS the MCE_IN_KERNEL_COPYIN code is busted for kernel
threads.

--Andy

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-08 Thread Andy Lutomirski

> On Mar 8, 2021, at 10:31 AM, Luck, Tony  wrote:
> 
> 
>> 
>> Can you point me at that SIGBUS code in a current kernel?
> 
> It is in kill_me_maybe().  mce_vaddr is setup when we disassemble whatever 
> get_user()
> or copy from user variant was in use in the kernel when the poison memory was 
> consumed.
> 
>if (p->mce_vaddr != (void __user *)-1l) {
>force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);

Hmm. On the one hand, no one has complained yet. On the other hand, hardware 
that supports this isn’t exactly common.

We may need some actual ABI design here. We also need to make sure that things 
like io_uring accesses or, more generally, anything using the use_mm / 
use_temporary_mm ends up either sending no signal or sending a signal to the 
right target.

> 
> Would it be any better if we used the BUS_MCEERR_AO code that goes into 
> siginfo?

Dunno.

> 
> That would make it match up better with what happens when poison is found
> asynchronously by the patrol scrubber. I.e. the semantics are:
> 
> AR: You just touched poison at this address and need to do something about 
> that.
> AO: Just letting you know that you have some poison at the address in siginfo.
> 
> -Tony

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-08 Thread Andy Lutomirski



> On Mar 8, 2021, at 1:49 AM, Aili Yao  wrote:
> 
> On Sun, 7 Mar 2021 11:16:24 -0800
> Andy Lutomirski  wrote:
> 
>>>>>>> Some programs may use read(2), write(2), etc as ways to check if
>>>>>>> memory is valid without getting a signal.  They might not want
>>>>>>> signals, which means that this feature might need to be configurable.  
>>>>>> 
>>>>>> That sounds like an appalling hack. If users need such a mechanism
>>>>>> we should create some better way to do that.
>>>>>> 
>>>>> 
>>>>> Appalling hack or not, it works. So, if we’re going to send a signal to 
>>>>> user code that looks like it originated from a bina fide architectural 
>>>>> recoverable fault, it needs to be recoverable.  A load from a failed 
>>>>> NVDIMM page is such a fault. A *kernel* load is not. So we need to 
>>>>> distinguish it somehow.  
>>>> 
>>>> Sorry for my previous mis-understanding, and i have some questions:
>>>> if programs use read,write to check if if memory is valid, does it really 
>>>> want to cover the poison case?  
>> 
>> I don't know.
>> 
>>>> When for such a case, an error is returned,  can the program realize it's 
>>>> hwposion issue not other software error and process correctly?  
>> 
>> Again, I don't know.  But changing the API like this seems potentialy
>> dangerous and needs to be done with care.
>> 
>>>> 
>>>> if this is the proper action, the original posion flow in current code 
>>>> from read and write need to change too.
>>>> 
>>> 
>>> Sorry, another question:
>>>  When programs use read(2), write(2) as ways to check if memory is valid, 
>>> does it really want to check if the user page the program provided is 
>>> valid, not the destination or disk space valid?  
>> 
>> They may well be trying to see if their memory is valid.
> 
> Thanks for your reply, and I don't know what to do.
> For current code, if user program write to a block device(maybe a test try) 
> and if its user copy page corrupt when in kernel copy, the process is killed 
> with a SIGBUS.
> And for the page fault case in this thread, the process is error returned.

Can you point me at that SIGBUS code in a current kernel?

> 
> -- 
> Thanks!
> Aili Yao

[tip: x86/core] x86/stackprotector/32: Make the canary into a regular percpu variable

2021-03-08 Thread tip-bot2 for Andy Lutomirski

The following commit has been merged into the x86/core branch of tip:

Commit-ID: 3fb0fdb3bbe7aed495109b3296b06c2409734023
Gitweb:
https://git.kernel.org/tip/3fb0fdb3bbe7aed495109b3296b06c2409734023
Author:Andy Lutomirski 
AuthorDate:Sat, 13 Feb 2021 11:19:44 -08:00
Committer: Borislav Petkov 
CommitterDate: Mon, 08 Mar 2021 13:19:05 +01:00

x86/stackprotector/32: Make the canary into a regular percpu variable

On 32-bit kernels, the stackprotector canary is quite nasty -- it is
stored at %gs:(20), which is nasty because 32-bit kernels use %fs for
percpu storage.  It's even nastier because it means that whether %gs
contains userspace state or kernel state while running kernel code
depends on whether stackprotector is enabled (this is
CONFIG_X86_32_LAZY_GS), and this setting radically changes the way
that segment selectors work.  Supporting both variants is a
maintenance and testing mess.

Merely rearranging so that percpu and the stack canary
share the same segment would be messy as the 32-bit percpu address
layout isn't currently compatible with putting a variable at a fixed
offset.

Fortunately, GCC 8.1 added options that allow the stack canary to be
accessed as %fs:__stack_chk_guard, effectively turning it into an ordinary
percpu variable.  This lets us get rid of all of the code to manage the
stack canary GDT descriptor and the CONFIG_X86_32_LAZY_GS mess.

(That name is special.  We could use any symbol we want for the
 %fs-relative mode, but for CONFIG_SMP=n, gcc refuses to let us use any
 name other than __stack_chk_guard.)

Forcibly disable stackprotector on older compilers that don't support
the new options and turn the stack canary into a percpu variable. The
"lazy GS" approach is now used for all 32-bit configurations.

Also makes load_gs_index() work on 32-bit kernels. On 64-bit kernels,
it loads the GS selector and updates the user GSBASE accordingly. (This
is unchanged.) On 32-bit kernels, it loads the GS selector and updates
GSBASE, which is now always the user base. This means that the overall
effect is the same on 32-bit and 64-bit, which avoids some ifdeffery.

 [ bp: Massage commit message. ]

Signed-off-by: Andy Lutomirski 
Signed-off-by: Borislav Petkov 
Link: 
https://lkml.kernel.org/r/c0ff7dba14041c7e5d1cae5d4df052f03759bef3.1613243844.git.l...@kernel.org
---
 arch/x86/Kconfig  |  7 +--
 arch/x86/Makefile |  8 ++-
 arch/x86/entry/entry_32.S | 56 +---
 arch/x86/include/asm/processor.h  | 15 +---
 arch/x86/include/asm/ptrace.h |  5 +-
 arch/x86/include/asm/segment.h| 30 ++--
 arch/x86/include/asm/stackprotector.h | 79 --
 arch/x86/include/asm/suspend_32.h |  6 +--
 arch/x86/kernel/asm-offsets_32.c  |  5 +-
 arch/x86/kernel/cpu/common.c  |  5 +-
 arch/x86/kernel/doublefault_32.c  |  4 +-
 arch/x86/kernel/head_32.S | 18 +-
 arch/x86/kernel/setup_percpu.c|  1 +-
 arch/x86/kernel/tls.c |  8 +--
 arch/x86/lib/insn-eval.c  |  4 +-
 arch/x86/platform/pvh/head.S  | 14 +
 arch/x86/power/cpu.c  |  6 +--
 arch/x86/xen/enlighten_pv.c   |  1 +-
 scripts/gcc-x86_32-has-stack-protector.sh |  6 +-
 19 files changed, 60 insertions(+), 218 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879..10cc619 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -360,10 +360,6 @@ config X86_64_SMP
def_bool y
depends on X86_64 && SMP
 
-config X86_32_LAZY_GS
-   def_bool y
-   depends on X86_32 && !STACKPROTECTOR
-
 config ARCH_SUPPORTS_UPROBES
def_bool y
 
@@ -386,7 +382,8 @@ config CC_HAS_SANE_STACKPROTECTOR
default $(success,$(srctree)/scripts/gcc-x86_32-has-stack-protector.sh 
$(CC))
help
   We have to make sure stack protector is unconditionally disabled if
-  the compiler produces broken code.
+  the compiler produces broken code or if it does not let us control
+  the segment on 32-bit kernels.
 
 menu "Processor type and features"
 
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 2d6d5a2..952f534 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -79,6 +79,14 @@ ifeq ($(CONFIG_X86_32),y)
 
 # temporary until string.h is fixed
 KBUILD_CFLAGS += -ffreestanding
+
+   ifeq ($(CONFIG_STACKPROTECTOR),y)
+   ifeq ($(CONFIG_SMP),y)
+   KBUILD_CFLAGS += -mstack-protector-guard-reg=fs 
-mstack-protector-guard-symbol=__stack_chk_guard
+   else
+   KBUILD_CFLAGS += -mstack-protector-guard=global
+   endif
+   endif
 else
 BITS := 64
 UTS_MACHINE := x86_64
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index df8c017.

[tip: x86/core] x86/entry/32: Remove leftover macros after stackprotector cleanups

2021-03-08 Thread tip-bot2 for Andy Lutomirski

The following commit has been merged into the x86/core branch of tip:

Commit-ID: d0962f2b24c99889a386f0658c71535f56358f77
Gitweb:
https://git.kernel.org/tip/d0962f2b24c99889a386f0658c71535f56358f77
Author:Andy Lutomirski 
AuthorDate:Sat, 13 Feb 2021 11:19:45 -08:00
Committer: Borislav Petkov 
CommitterDate: Mon, 08 Mar 2021 13:27:31 +01:00

x86/entry/32: Remove leftover macros after stackprotector cleanups

Now that nonlazy-GS mode is gone, remove the macros from entry_32.S
that obfuscated^Wabstracted GS handling.  The assembled output is
identical before and after this patch.

Signed-off-by: Andy Lutomirski 
Signed-off-by: Borislav Petkov 
Link: 
https://lkml.kernel.org/r/b1543116f0f0e68f1763d90d5f7fcec27885dff5.1613243844.git.l...@kernel.org
---
 arch/x86/entry/entry_32.S | 43 +-
 1 file changed, 2 insertions(+), 41 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index eb0cb66..bee9101 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -53,35 +53,6 @@
 
 #define PTI_SWITCH_MASK (1 << PAGE_SHIFT)
 
-/*
- * User gs save/restore
- *
- * This is leftover junk from CONFIG_X86_32_LAZY_GS.  A subsequent patch
- * will remove it entirely.
- */
- /* unfortunately push/pop can't be no-op */
-.macro PUSH_GS
-   pushl   $0
-.endm
-.macro POP_GS pop=0
-   addl$(4 + \pop), %esp
-.endm
-.macro POP_GS_EX
-.endm
-
- /* all the rest are no-op */
-.macro PTGS_TO_GS
-.endm
-.macro PTGS_TO_GS_EX
-.endm
-.macro GS_TO_REG reg
-.endm
-.macro REG_TO_PTGS reg
-.endm
-.macro SET_KERNEL_GS reg
-.endm
-
-
 /* Unconditionally switch to user cr3 */
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
@@ -234,7 +205,7 @@
 .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 skip_gs=0 unwind_espfix=0
cld
 .if \skip_gs == 0
-   PUSH_GS
+   pushl   $0
 .endif
pushl   %fs
 
@@ -259,9 +230,6 @@
movl$(__USER_DS), %edx
movl%edx, %ds
movl%edx, %es
-.if \skip_gs == 0
-   SET_KERNEL_GS %edx
-.endif
/* Switch to kernel stack if necessary */
 .if \switch_stacks > 0
SWITCH_TO_KERNEL_STACK
@@ -300,7 +268,7 @@
 1: popl%ds
 2: popl%es
 3: popl%fs
-   POP_GS \pop
+   addl$(4 + \pop), %esp   /* pop the unused "gs" slot */
IRET_FRAME
 .pushsection .fixup, "ax"
 4: movl$0, (%esp)
@@ -313,7 +281,6 @@
_ASM_EXTABLE(1b, 4b)
_ASM_EXTABLE(2b, 5b)
_ASM_EXTABLE(3b, 6b)
-   POP_GS_EX
 .endm
 
 .macro RESTORE_ALL_NMI cr3_reg:req pop=0
@@ -928,7 +895,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
movlPT_EIP(%esp), %edx  /* pt_regs->ip */
movlPT_OLDESP(%esp), %ecx   /* pt_regs->sp */
 1: mov PT_FS(%esp), %fs
-   PTGS_TO_GS
 
popl%ebx/* pt_regs->bx */
addl$2*4, %esp  /* skip pt_regs->cx and pt_regs->dx */
@@ -964,7 +930,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
jmp 1b
 .popsection
_ASM_EXTABLE(1b, 2b)
-   PTGS_TO_GS_EX
 
 .Lsysenter_fix_flags:
pushl   $X86_EFLAGS_FIXED
@@ -1106,11 +1071,7 @@ SYM_CODE_START_LOCAL_NOALIGN(handle_exception)
SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1
ENCODE_FRAME_POINTER
 
-   /* fixup %gs */
-   GS_TO_REG %ecx
movlPT_GS(%esp), %edi   # get the function address
-   REG_TO_PTGS %ecx
-   SET_KERNEL_GS %ecx
 
/* fixup orig %eax */
movlPT_ORIG_EAX(%esp), %edx # get the error code

Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-07 Thread Andy Lutomirski

On Wed, Mar 3, 2021 at 4:51 AM Aili Yao  wrote:
>
> On Wed, 3 Mar 2021 20:24:02 +0800
> Aili Yao  wrote:
>
> > On Mon, 1 Mar 2021 11:09:36 -0800
> > Andy Lutomirski  wrote:
> >
> > > > On Mar 1, 2021, at 11:02 AM, Luck, Tony  wrote:
> > > >
> > > > 
> > > >>
> > > >> Some programs may use read(2), write(2), etc as ways to check if
> > > >> memory is valid without getting a signal.  They might not want
> > > >> signals, which means that this feature might need to be configurable.
> > > >
> > > > That sounds like an appalling hack. If users need such a mechanism
> > > > we should create some better way to do that.
> > > >
> > >
> > > Appalling hack or not, it works. So, if we’re going to send a signal to 
> > > user code that looks like it originated from a bina fide architectural 
> > > recoverable fault, it needs to be recoverable.  A load from a failed 
> > > NVDIMM page is such a fault. A *kernel* load is not. So we need to 
> > > distinguish it somehow.
> >
> > Sorry for my previous mis-understanding, and i have some questions:
> > if programs use read,write to check if if memory is valid, does it really 
> > want to cover the poison case?

I don't know.

> > When for such a case, an error is returned,  can the program realize it's 
> > hwposion issue not other software error and process correctly?

Again, I don't know.  But changing the API like this seems potentialy
dangerous and needs to be done with care.

> >
> > if this is the proper action, the original posion flow in current code from 
> > read and write need to change too.
> >
>
> Sorry, another question:
>   When programs use read(2), write(2) as ways to check if memory is valid, 
> does it really want to check if the user page the program provided is valid, 
> not the destination or disk space valid?

They may well be trying to see if their memory is valid.

>   the patch will not affect this purpose as it's only valid for user page 
> which program provide to write or some syscall similiar parameter
>
> --
> Thanks!
> Aili Yao



-- 
Andy Lutomirski
AMA Capital Management, LLC

[tip: x86/urgent] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls

2021-03-06 Thread tip-bot2 for Andy Lutomirski

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 5d5675df792ff67e74a500c4c94db0f99e6a10ef
Gitweb:
https://git.kernel.org/tip/5d5675df792ff67e74a500c4c94db0f99e6a10ef
Author:Andy Lutomirski 
AuthorDate:Thu, 04 Mar 2021 11:05:54 -08:00
Committer: Borislav Petkov 
CommitterDate: Sat, 06 Mar 2021 13:10:06 +01:00

x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls

On a 32-bit fast syscall that fails to read its arguments from user
memory, the kernel currently does syscall exit work but not
syscall entry work.  This confuses audit and ptrace.  For example:

$ ./tools/testing/selftests/x86/syscall_arg_fault_32
...
strace: pid 264258: entering, ptrace_syscall_info.op == 2
...

This is a minimal fix intended for ease of backporting.  A more
complete cleanup is coming.

Fixes: 0b085e68f407 ("x86/entry: Consolidate 32/64 bit syscall entry")
Signed-off-by: Andy Lutomirski 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov 
Cc: sta...@vger.kernel.org
Link: 
https://lore.kernel.org/r/8c82296ddf803b91f8d1e5eac89e5803ba54ab0e.1614884673.git.l...@kernel.org
---
 arch/x86/entry/common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a2433ae..4efd39a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -128,7 +128,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs 
*regs)
regs->ax = -EFAULT;
 
instrumentation_end();
-   syscall_exit_to_user_mode(regs);
+   local_irq_disable();
+   irqentry_exit_to_user_mode(regs);
return false;
}

[tip: x86/urgent] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls

2021-03-06 Thread tip-bot2 for Andy Lutomirski

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: e59ba7bf71a09e474198741563e0e587ae43d1c7
Gitweb:
https://git.kernel.org/tip/e59ba7bf71a09e474198741563e0e587ae43d1c7
Author:Andy Lutomirski 
AuthorDate:Thu, 04 Mar 2021 11:05:54 -08:00
Committer: Borislav Petkov 
CommitterDate: Sat, 06 Mar 2021 11:37:00 +01:00

x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls

On a 32-bit fast syscall that fails to read its arguments from user
memory, the kernel currently does syscall exit work but not
syscall entry work.  This confuses audit and ptrace.  For example:

$ ./tools/testing/selftests/x86/syscall_arg_fault_32
...
strace: pid 264258: entering, ptrace_syscall_info.op == 2
...

This is a minimal fix intended for ease of backporting.  A more
complete cleanup is coming.

Fixes: 0b085e68f407 ("x86/entry: Consolidate 32/64 bit syscall entry")
Signed-off-by: Andy Lutomirski 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: 
https://lore.kernel.org/r/8c82296ddf803b91f8d1e5eac89e5803ba54ab0e.1614884673.git.l...@kernel.org

---
 arch/x86/entry/common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a2433ae..4efd39a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -128,7 +128,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs 
*regs)
regs->ax = -EFAULT;
 
instrumentation_end();
-   syscall_exit_to_user_mode(regs);
+   local_irq_disable();
+   irqentry_exit_to_user_mode(regs);
return false;
}

[tip: x86/urgent] x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls

2021-03-05 Thread tip-bot2 for Andy Lutomirski

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: dabf017539988a9bfc40a38dbafd35c501bacc44
Gitweb:
https://git.kernel.org/tip/dabf017539988a9bfc40a38dbafd35c501bacc44
Author:Andy Lutomirski 
AuthorDate:Thu, 04 Mar 2021 11:05:54 -08:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 05 Mar 2021 11:10:13 +01:00

x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls

On a 32-bit fast syscall that fails to read its arguments from user
memory, the kernel currently does syscall exit work but not
syscall entry work.  This confuses audit and ptrace.  For example:

$ ./tools/testing/selftests/x86/syscall_arg_fault_32
...
strace: pid 264258: entering, ptrace_syscall_info.op == 2
...

This is a minimal fix intended for ease of backporting.  A more
complete cleanup is coming.

Fixes: 0b085e68f407 ("x86/entry: Consolidate 32/64 bit syscall entry")
Signed-off-by: Andy Lutomirski 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: 
https://lore.kernel.org/r/8c82296ddf803b91f8d1e5eac89e5803ba54ab0e.1614884673.git.l...@kernel.org

---
 arch/x86/entry/common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a2433ae..4efd39a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -128,7 +128,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs 
*regs)
regs->ax = -EFAULT;
 
instrumentation_end();
-   syscall_exit_to_user_mode(regs);
+   local_irq_disable();
+   irqentry_exit_to_user_mode(regs);
return false;
}

[PATCH v3 10/11] kentry: Check that syscall entries and syscall exits match

2021-03-04 Thread Andy Lutomirski

If arch code calls the wrong kernel entry helpers, syscall entries and
exits can get out of sync.  Add a new field to task_struct to track the
syscall state and validate that it transitions correctly.

Signed-off-by: Andy Lutomirski 
---
 include/linux/sched.h |  4 
 init/init_task.c  |  8 
 kernel/entry/common.c | 24 +++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..95d6d8686d98 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1368,6 +1368,10 @@ struct task_struct {
struct llist_head   kretprobe_instances;
 #endif
 
+#ifdef CONFIG_DEBUG_ENTRY
+   bool kentry_in_syscall;
+#endif
+
/*
 * New fields for task_struct should be added above here, so that
 * they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index 8a992d73e6fb..de4fdaac4e8b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -212,6 +212,14 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP
.seccomp= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_DEBUG_ENTRY
+   /*
+* The init task, and kernel threads in general, are considered
+* to be "in a syscall".  This way they can execve() and then exit
+* the supposed syscall that they were in to go to user mode.
+*/
+   .kentry_in_syscall = true,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f62934d761e3..9dea411de3b0 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -148,11 +148,21 @@ static long syscall_trace_enter(struct pt_regs *regs, 
long syscall,
 
 long kentry_syscall_begin(struct pt_regs *regs, long syscall)
 {
-   unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
+   unsigned long work;
+
+   if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
+   DEBUG_ENTRY_WARN_ONCE(
+   current->kentry_in_syscall,
+   "entering syscall %ld while already in a syscall",
+   syscall);
+   current->kentry_in_syscall = true;
+   }
 
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
lockdep_assert_irqs_enabled();
 
+   work = READ_ONCE(current_thread_info()->syscall_work);
+
if (work & SYSCALL_WORK_ENTER)
syscall = syscall_trace_enter(regs, syscall, work);
 
@@ -163,11 +173,16 @@ long kentry_syscall_begin(struct pt_regs *regs, long 
syscall)
 static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
+
 #ifdef CONFIG_DEBUG_ENTRY
DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) != 1,
  "__exit_to_user_mode called at wrong kentry cpu 
depth (%u)",
  this_cpu_read(kentry_cpu_depth));
+
+   DEBUG_ENTRY_WARN_ONCE(current->kentry_in_syscall,
+ "exiting to user mode while in syscall context");
 #endif
+
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
instrumentation_end();
@@ -331,6 +346,13 @@ void kentry_syscall_end(struct pt_regs *regs)
 */
if (unlikely(work & SYSCALL_WORK_EXIT))
syscall_exit_work(regs, work);
+
+#ifdef CONFIG_DEBUG_ENTRY
+   DEBUG_ENTRY_WARN_ONCE(!current->kentry_in_syscall,
+ "exiting syscall %lu without entering first", nr);
+
+   current->kentry_in_syscall = 0;
+#endif
 }
 
 noinstr void kentry_enter_from_user_mode(struct pt_regs *regs)
-- 
2.29.2

[PATCH v3 07/11] kentry: Make entry/exit_to_user_mode() arm64-only

2021-03-04 Thread Andy Lutomirski

exit_to_user_mode() does part, but not all, of the exit-to-user-mode work.
It's used only by arm64, and arm64 should stop using it (hint, hint!).
Compile it out on other architectures to minimize the chance of error.

enter_from_user_mode() is a legacy way to spell
kentry_enter_from_user_mode().  It's also only used by arm64.  Give it
the same treatment.

Signed-off-by: Andy Lutomirski 
---
 include/linux/entry-common.h | 34 ++
 kernel/entry/common.c|  4 
 2 files changed, 10 insertions(+), 28 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 5287c6c15a66..a75374f87258 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -97,26 +97,15 @@ static inline __must_check int 
arch_syscall_enter_tracehook(struct pt_regs *regs
 }
 #endif
 
+#ifdef CONFIG_ARM64
 /**
  * enter_from_user_mode - Establish state when coming from user mode
  *
- * Syscall/interrupt entry disables interrupts, but user mode is traced as
- * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
+ * Legacy variant of kentry_enter_from_user_mode().  Used only by arm64.
  *
- * 1) Tell lockdep that interrupts are disabled
- * 2) Invoke context tracking if enabled to reactivate RCU
- * 3) Trace interrupts off state
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct and interrupts are still
- * disabled. The subsequent functions can be instrumented.
- *
- * This is invoked when there is architecture specific functionality to be
- * done between establishing state and enabling interrupts. The caller must
- * enable interrupts before invoking syscall_enter_from_user_mode_work().
  */
 void enter_from_user_mode(struct pt_regs *regs);
+#endif
 
 /**
  * kentry_syscall_begin - Prepare to invoke a syscall handler
@@ -261,25 +250,14 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
 }
 #endif
 
+#ifdef CONFIG_ARM64
 /**
  * exit_to_user_mode - Fixup state when exiting to user mode
  *
- * Syscall/interrupt exit enables interrupts, but the kernel state is
- * interrupts disabled when this is invoked. Also tell RCU about it.
- *
- * 1) Trace interrupts on state
- * 2) Invoke context tracking if enabled to adjust RCU state
- * 3) Invoke architecture specific last minute exit code, e.g. speculation
- *mitigations, etc.: arch_exit_to_user_mode()
- * 4) Tell lockdep that interrupts are enabled
- *
- * Invoked from architecture specific code when syscall_exit_to_user_mode()
- * is not suitable as the last step before returning to userspace. Must be
- * invoked with interrupts disabled and the caller must be
- * non-instrumentable.
- * The caller has to invoke syscall_exit_to_user_mode_work() before this.
+ * Does the latter part of irqentry_exit_to_user_mode().  Only used by arm64.
  */
 void exit_to_user_mode(void);
+#endif
 
 /**
  * kentry_syscall_end - Finish syscall processing
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 800ad406431b..4ba82c684189 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -25,10 +25,12 @@ static __always_inline void __enter_from_user_mode(struct 
pt_regs *regs)
instrumentation_end();
 }
 
+#ifdef CONFIG_ARM64
 void noinstr enter_from_user_mode(struct pt_regs *regs)
 {
__enter_from_user_mode(regs);
 }
+#endif
 
 static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
 {
@@ -106,10 +108,12 @@ static __always_inline void __exit_to_user_mode(void)
lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
+#ifdef CONFIG_ARM64
 void noinstr exit_to_user_mode(void)
 {
__exit_to_user_mode();
 }
+#endif
 
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { 
}
-- 
2.29.2

[PATCH v3 11/11] kentry: Verify kentry state in instrumentation_begin/end()

2021-03-04 Thread Andy Lutomirski

Calling instrumentation_begin() and instrumentation_end() when kentry
thinks the CPU is in user mode is an error.  Verify the kentry state when
instrumentation_begin/end() are called.

Add _nocheck() variants to skip verification to avoid WARN() generating
extra kentry warnings.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/kernel/traps.c |  4 ++--
 include/asm-generic/bug.h   |  8 
 include/linux/instrumentation.h | 25 -
 kernel/entry/common.c   |  7 +++
 4 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index be924180005a..983e4be5fdcb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -229,7 +229,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
/*
 * All lies, just get the WARN/BUG out.
 */
-   instrumentation_begin();
+   instrumentation_begin_nocheck();
/*
 * Since we're emulating a CALL with exceptions, restore the interrupt
 * state to what it was at the exception site.
@@ -242,7 +242,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)
}
if (regs->flags & X86_EFLAGS_IF)
raw_local_irq_disable();
-   instrumentation_end();
+   instrumentation_end_nocheck();
 
return handled;
 }
diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 76a10e0dca9f..fc360c463a99 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -85,18 +85,18 @@ void warn_slowpath_fmt(const char *file, const int line, 
unsigned taint,
   const char *fmt, ...);
 #define __WARN()   __WARN_printf(TAINT_WARN, NULL)
 #define __WARN_printf(taint, arg...) do {  \
-   instrumentation_begin();\
+   instrumentation_begin_nocheck();\
warn_slowpath_fmt(__FILE__, __LINE__, taint, arg);  \
-   instrumentation_end();  \
+   instrumentation_end_nocheck();  \
} while (0)
 #else
 extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 #define __WARN()   __WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))
 #define __WARN_printf(taint, arg...) do {  \
-   instrumentation_begin();\
+   instrumentation_begin_nocheck();\
__warn_printk(arg); \
__WARN_FLAGS(BUGFLAG_NO_CUT_HERE | BUGFLAG_TAINT(taint));\
-   instrumentation_end();  \
+   instrumentation_end_nocheck();  \
} while (0)
 #define WARN_ON_ONCE(condition) ({ \
int __ret_warn_on = !!(condition);  \
diff --git a/include/linux/instrumentation.h b/include/linux/instrumentation.h
index 93e2ad67fc10..cdf80454f92a 100644
--- a/include/linux/instrumentation.h
+++ b/include/linux/instrumentation.h
@@ -4,14 +4,21 @@
 
 #if defined(CONFIG_DEBUG_ENTRY) && defined(CONFIG_STACK_VALIDATION)
 
+extern void kentry_assert_may_instrument(void);
+
 /* Begin/end of an instrumentation safe region */
-#define instrumentation_begin() ({ \
-   asm volatile("%c0: nop\n\t" 
\
+#define instrumentation_begin_nocheck() ({ \
+   asm volatile("%c0: nop\n\t" \
 ".pushsection .discard.instr_begin\n\t"\
 ".long %c0b - .\n\t"   \
 ".popsection\n\t" : : "i" (__COUNTER__));  \
 })
 
+#define instrumentation_begin() ({ \
+   instrumentation_begin_nocheck();\
+   kentry_assert_may_instrument(); \
+})
+
 /*
  * Because instrumentation_{begin,end}() can nest, objtool validation considers
  * _begin() a +1 and _end() a -1 and computes a sum over the instructions.
@@ -43,15 +50,23 @@
  * To avoid this, have _end() be a NOP instruction, this ensures it will be
  * part of the condition block and does not escape.
  */
-#define instrumentation_end() ({   \
+#define instrumentation_end_nocheck() ({   \
asm volatile("%c0: nop\n\t" \
 ".pushsection .discard.instr_end\n\t"  \
 ".long %c0b - .\n\t"   \

[PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86

2021-03-04 Thread Andy Lutomirski

In principle, the generic entry code is generic, and the goal is to use it
in many architectures once it settles down more.  Move CONFIG_DEBUG_ENTRY
to the generic config so that it can be used in the generic entry code and
not just in arch/x86.

Disable it on arm64.  arm64 uses some but not all of the kentry
code, and trying to debug the resulting state machine will be painful.
arm64 can turn it back on when it starts using the entire generic
path.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/Kconfig.debug | 10 --
 lib/Kconfig.debug  | 11 +++
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 80b57e7f4947..a5a52133730c 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -170,16 +170,6 @@ config CPA_DEBUG
help
  Do change_page_attr() self-tests every 30 seconds.
 
-config DEBUG_ENTRY
-   bool "Debug low-level entry code"
-   depends on DEBUG_KERNEL
-   help
- This option enables sanity checks in x86's low-level entry code.
- Some of these sanity checks may slow down kernel entries and
- exits or otherwise impact performance.
-
- If unsure, say N.
-
 config DEBUG_NMI_SELFTEST
bool "NMI Selftest"
depends on DEBUG_KERNEL && X86_LOCAL_APIC
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 7937265ef879..76549c8afa8a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1411,6 +1411,17 @@ config CSD_LOCK_WAIT_DEBUG
 
 endmenu # lock debugging
 
+config DEBUG_ENTRY
+   bool "Debug low-level entry code"
+   depends on DEBUG_KERNEL
+   depends on !ARM64
+   help
+ This option enables sanity checks in the low-level entry code.
+ Some of these sanity checks may slow down kernel entries and
+ exits or otherwise impact performance.
+
+ If unsure, say N.
+
 config TRACE_IRQFLAGS
depends on TRACE_IRQFLAGS_SUPPORT
bool
-- 
2.29.2

[PATCH v3 09/11] kentry: Add debugging checks for proper kentry API usage

2021-03-04 Thread Andy Lutomirski

It's quite easy to mess up kentry calls.  Add debgging checks that kentry
transitions to and from user mode match up and that kentry_nmi_enter() and
kentry_nmi_exit() match up.

Checking full matching of kentry_enter() with kentry_exit() needs
per-task state.

Signed-off-by: Andy Lutomirski 
---
 kernel/entry/common.c | 81 ++-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 4ba82c684189..f62934d761e3 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,9 +11,65 @@
 #define CREATE_TRACE_POINTS
 #include 
 
+/*
+ * kentry_cpu_depth is 0 in user mode, 1 in normal kernel mode, and
+ * 1 + n * KENTRY_DEPTH_NMI in kentry_nmi_enter() mode.  We can't
+ * use a percpu variable to match up kentry_enter() from kernel mode
+ * with the corresponding kentry_exit() because tasks may schedule inside
+ * nested kentry_enter() regions.
+ */
+#define KENTRY_CPU_DEPTH_NMI   1024UL
+
+#ifdef CONFIG_DEBUG_ENTRY
+
+/*
+ * Extra safe WARN_ONCE.  Per-arch optimized WARN_ONCE() implementations
+ * might go through the low-level entry and kentry code even before noticing
+ * that the warning already fired, which could result in recursive warnings.
+ * This carefully avoids any funny business once a given warning has fired.
+ */
+#define DEBUG_ENTRY_WARN_ONCE(condition, format...) ({ \
+   static bool __section(".data.once") __warned;   \
+   int __ret_warn_once = !!(condition);\
+   \
+   if (unlikely(__ret_warn_once && !READ_ONCE(__warned))) {\
+   WRITE_ONCE(__warned, true); \
+   WARN(1, format);\
+   }   \
+   unlikely(__ret_warn_once);  \
+})
+
+
+static DEFINE_PER_CPU(unsigned int, kentry_cpu_depth) = 1UL;
+
+static __always_inline void kentry_cpu_depth_add(unsigned int n)
+{
+   this_cpu_add(kentry_cpu_depth, n);
+}
+
+static void kentry_cpu_depth_check(unsigned int n)
+{
+   DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) < n, "kentry 
depth underflow");
+}
+
+static __always_inline void kentry_cpu_depth_sub(unsigned int n)
+{
+   this_cpu_sub(kentry_cpu_depth, n);
+}
+#else
+
+#define DEBUG_ENTRY_WARN_ONCE(condition, format...) do {} while (0)
+
+static __always_inline void kentry_cpu_depth_add(unsigned int n) {}
+static void kentry_cpu_depth_check(unsigned int n) {}
+static __always_inline void kentry_cpu_depth_sub(unsigned int n) {}
+
+#endif
+
 /* See comment for enter_from_user_mode() in entry-common.h */
 static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
+   kentry_cpu_depth_add(1);
arch_check_user_regs(regs);
lockdep_hardirqs_off(CALLER_ADDR0);
 
@@ -22,6 +78,14 @@ static __always_inline void __enter_from_user_mode(struct 
pt_regs *regs)
 
instrumentation_begin();
trace_hardirqs_off_finish();
+
+#ifdef CONFIG_DEBUG_ENTRY
+   DEBUG_ENTRY_WARN_ONCE(
+   this_cpu_read(kentry_cpu_depth) != 1,
+   "kentry: __enter_from_user_mode() called while kentry thought 
the CPU was in the kernel (%u)",
+   this_cpu_read(kentry_cpu_depth));
+#endif
+
instrumentation_end();
 }
 
@@ -99,6 +163,11 @@ long kentry_syscall_begin(struct pt_regs *regs, long 
syscall)
 static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
+#ifdef CONFIG_DEBUG_ENTRY
+   DEBUG_ENTRY_WARN_ONCE(this_cpu_read(kentry_cpu_depth) != 1,
+ "__exit_to_user_mode called at wrong kentry cpu 
depth (%u)",
+ this_cpu_read(kentry_cpu_depth));
+#endif
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
instrumentation_end();
@@ -106,6 +175,7 @@ static __always_inline void __exit_to_user_mode(void)
user_enter_irqoff();
arch_exit_to_user_mode();
lockdep_hardirqs_on(CALLER_ADDR0);
+   kentry_cpu_depth_sub(1);
 }
 
 #ifdef CONFIG_ARM64
@@ -360,7 +430,12 @@ noinstr void kentry_exit(struct pt_regs *regs, 
kentry_state_t state)
/* Check whether this returns to user mode */
if (user_mode(regs)) {
kentry_exit_to_user_mode(regs);
-   } else if (!regs_irqs_disabled(regs)) {
+   return;
+   }
+
+   kentry_cpu_depth_check(1);
+
+   if (!regs_irqs_disabled(regs)) {
/*
 * If RCU was not watching on entry this needs to be done
 * carefully and needs the same ordering of lockdep/tracing
@@ -399,6 +474,8 @@ kentry_state_t noinstr kentry_nmi_enter(struct pt_regs 
*regs)
 
irq_state.lockdep = lockdep_hardirqs_enabled();
 
+   kentry_cpu_dep

[PATCH v3 05/11] x86/entry: Convert ret_from_fork to C

2021-03-04 Thread Andy Lutomirski

ret_from_fork is written in asm, slightly differently, for x86_32 and
x86_64.  Convert it to C.

This is a straight conversion without any particular cleverness.  As a
further cleanup, the code that sets up the ret_from_fork argument registers
could be adjusted to put the arguments in the correct registers.

This will cause the ORC unwinder to find pt_regs even for kernel threads on
x86_64.  This seems harmless.

The 32-bit comment above the now-deleted schedule_tail_wrapper was
obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the
same problem more cleanly.

Cc: Josh Poimboeuf 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/common.c  | 23 ++
 arch/x86/entry/entry_32.S| 51 +---
 arch/x86/entry/entry_64.S| 33 +
 arch/x86/include/asm/switch_to.h |  2 +-
 arch/x86/kernel/process.c|  2 +-
 arch/x86/kernel/process_32.c |  2 +-
 arch/x86/kernel/unwind_orc.c |  2 +-
 7 files changed, 43 insertions(+), 72 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 95776f16c1cb..ef1c65938a6b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -214,6 +214,29 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
+void ret_from_fork(struct task_struct *prev,
+  int (*kernel_thread_fn)(void *),
+  void *kernel_thread_arg,
+  struct pt_regs *user_regs);
+
+__visible void noinstr ret_from_fork(struct task_struct *prev,
+int (*kernel_thread_fn)(void *),
+void *kernel_thread_arg,
+struct pt_regs *user_regs)
+{
+   instrumentation_begin();
+
+   schedule_tail(prev);
+
+   if (kernel_thread_fn) {
+   kernel_thread_fn(kernel_thread_arg);
+   user_regs->ax = 0;
+   }
+
+   instrumentation_end();
+   syscall_exit_to_user_mode(user_regs);
+}
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index df8c017e6161..7113d259727f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -805,26 +805,6 @@ SYM_CODE_START(__switch_to_asm)
 SYM_CODE_END(__switch_to_asm)
 .popsection
 
-/*
- * The unwinder expects the last frame on the stack to always be at the same
- * offset from the end of the page, which allows it to validate the stack.
- * Calling schedule_tail() directly would break that convention because its an
- * asmlinkage function so its argument has to be pushed on the stack.  This
- * wrapper creates a proper "end of stack" frame header before the call.
- */
-.pushsection .text, "ax"
-SYM_FUNC_START(schedule_tail_wrapper)
-   FRAME_BEGIN
-
-   pushl   %eax
-   callschedule_tail
-   popl%eax
-
-   FRAME_END
-   ret
-SYM_FUNC_END(schedule_tail_wrapper)
-.popsection
-
 /*
  * A newly forked process directly context switches into this address.
  *
@@ -833,29 +813,14 @@ SYM_FUNC_END(schedule_tail_wrapper)
  * edi: kernel thread arg
  */
 .pushsection .text, "ax"
-SYM_CODE_START(ret_from_fork)
-   callschedule_tail_wrapper
-
-   testl   %ebx, %ebx
-   jnz 1f  /* kernel threads are uncommon */
-
-2:
-   /* When we fork, we trace the syscall return in the child, too. */
-   movl%esp, %eax
-   callsyscall_exit_to_user_mode
-   jmp .Lsyscall_32_done
-
-   /* kernel thread */
-1: movl%edi, %eax
-   CALL_NOSPEC ebx
-   /*
-* A kernel thread is allowed to return here after successfully
-* calling kernel_execve().  Exit to userspace to complete the execve()
-* syscall.
-*/
-   movl$0, PT_EAX(%esp)
-   jmp 2b
-SYM_CODE_END(ret_from_fork)
+SYM_CODE_START(asm_ret_from_fork)
+   movl%ebx, %edx
+   movl%edi, %ecx
+   pushl   %esp
+   callret_from_fork
+   addl$4, %esp
+   jmp .Lsyscall_32_done
+SYM_CODE_END(asm_ret_from_fork)
 .popsection
 
 SYM_ENTRY(__begin_SYSENTER_singlestep_region, SYM_L_GLOBAL, SYM_A_NONE)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..0f7df8861ac1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -273,35 +273,18 @@ SYM_FUNC_END(__switch_to_asm)
  * rax: prev task we switched from
  * rbx: kernel thread func (NULL for user thread)
  * r12: kernel thread arg
+ * rbp: encoded frame pointer for the fp unwinder
  */
 .pushsection .text, "ax"
-SYM_CODE_START(ret_from_fork)
-   UNWIND_HINT_EMPTY
-   movq%rax, %rdi
-   callschedule_tail   /* rdi: 'prev' task parameter */
-
-   testq   %rbx, %rbx  /* from kernel_thread? */
-   jnz 1f  /* kernel threads are uncommon

[PATCH v3 06/11] kentry: Simplify the common syscall API

2021-03-04 Thread Andy Lutomirski

The new common syscall API had a large and confusing API surface.  Simplify
it.  Now there is exactly one way to use it.  It's a bit more verbose than
the old way for the simple x86_64 native case, but it's much easier to use
right, and the diffstat should speak for itself.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/common.c  | 57 +++-
 include/linux/entry-common.h | 86 ++--
 kernel/entry/common.c| 57 +++-
 3 files changed, 54 insertions(+), 146 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ef1c65938a6b..8710b2300b8d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -38,9 +38,12 @@
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
-   nr = syscall_enter_from_user_mode(regs, nr);
-
+   kentry_enter_from_user_mode(regs);
+   local_irq_enable();
instrumentation_begin();
+
+   nr = kentry_syscall_begin(regs, nr);
+
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
regs->ax = sys_call_table[nr](regs);
@@ -52,8 +55,12 @@ __visible noinstr void do_syscall_64(unsigned long nr, 
struct pt_regs *regs)
regs->ax = x32_sys_call_table[nr](regs);
 #endif
}
+
+   kentry_syscall_end(regs);
+
+   local_irq_disable();
instrumentation_end();
-   syscall_exit_to_user_mode(regs);
+   kentry_exit_to_user_mode(regs);
 }
 #endif
 
@@ -83,33 +90,34 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs 
*regs)
 {
unsigned int nr = syscall_32_enter(regs);
 
+   kentry_enter_from_user_mode(regs);
+   local_irq_enable();
+   instrumentation_begin();
+
/*
 * Subtlety here: if ptrace pokes something larger than 2^32-1 into
 * orig_ax, the unsigned int return value truncates it.  This may
 * or may not be necessary, but it matches the old asm behavior.
 */
-   nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
-   instrumentation_begin();
-
+   nr = (unsigned int)kentry_syscall_begin(regs, nr);
do_syscall_32_irqs_on(regs, nr);
+   kentry_syscall_end(regs);
 
+   local_irq_disable();
instrumentation_end();
-   syscall_exit_to_user_mode(regs);
+   kentry_exit_to_user_mode(regs);
 }
 
 static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 {
unsigned int nr = syscall_32_enter(regs);
+   bool ret;
int res;
 
-   /*
-* This cannot use syscall_enter_from_user_mode() as it has to
-* fetch EBP before invoking any of the syscall entry work
-* functions.
-*/
-   syscall_enter_from_user_mode_prepare(regs);
-
+   kentry_enter_from_user_mode(regs);
+   local_irq_enable();
instrumentation_begin();
+
/* Fetch EBP from where the vDSO stashed it. */
if (IS_ENABLED(CONFIG_X86_64)) {
/*
@@ -126,21 +134,23 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs 
*regs)
if (res) {
/* User code screwed up. */
regs->ax = -EFAULT;
-
-   instrumentation_end();
-   local_irq_disable();
-   irqentry_exit_to_user_mode(regs);
-   return false;
+   ret = false;
+   goto out;
}
 
/* The case truncates any ptrace induced syscall nr > 2^32 -1 */
-   nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
+   nr = (unsigned int)kentry_syscall_begin(regs, nr);
 
/* Now this is just like a normal syscall. */
do_syscall_32_irqs_on(regs, nr);
 
+   kentry_syscall_end(regs);
+   ret = true;
+
+out:
+   local_irq_disable();
instrumentation_end();
-   syscall_exit_to_user_mode(regs);
+   kentry_exit_to_user_mode(regs);
return true;
 }
 
@@ -233,8 +243,11 @@ __visible void noinstr ret_from_fork(struct task_struct 
*prev,
user_regs->ax = 0;
}
 
+   kentry_syscall_end(user_regs);
+
+   local_irq_disable();
instrumentation_end();
-   syscall_exit_to_user_mode(user_regs);
+   kentry_exit_to_user_mode(user_regs);
 }
 
 #ifdef CONFIG_XEN_PV
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fd2d7c35670a..5287c6c15a66 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -119,31 +119,12 @@ static inline __must_check int 
arch_syscall_enter_tracehook(struct pt_regs *regs
 void enter_from_user_mode(struct pt_regs *regs);
 
 /**
- * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
- * @regs:  Pointer to currents pt_regs
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function

[PATCH v3 03/11] x86/dumpstack: Remove unnecessary range check fetching opcode bytes

2021-03-04 Thread Andy Lutomirski

copy_from_user_nmi() validates that the pointer is in the user range,
so there is no need for an extra check in copy_code().

Signed-off-by: Andy Lutomirski 
---
 arch/x86/kernel/dumpstack.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 299c20f0a38b..55cf3c8325c6 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -81,12 +81,6 @@ static int copy_code(struct pt_regs *regs, u8 *buf, unsigned 
long src,
/* The user space code from other tasks cannot be accessed. */
if (regs != task_pt_regs(current))
return -EPERM;
-   /*
-* Make sure userspace isn't trying to trick us into dumping kernel
-* memory by pointing the userspace instruction pointer at it.
-*/
-   if (__chk_range_not_ok(src, nbytes, TASK_SIZE_MAX))
-   return -EINVAL;
 
/*
 * Even if named copy_from_user_nmi() this can be invoked from
-- 
2.29.2

[PATCH v3 02/11] kentry: Rename irqentry to kentry

2021-03-04 Thread Andy Lutomirski

The common entry functions are mostly named irqentry, and this is
confusing.  They are used for syscalls, exceptions, NMIs and, yes, IRQs.
Call them kentry instead, since they are supposed to be usable for any
entry to the kernel.

This path doesn't touch the .irqentry section -- someone can contemplate
changing that later.  That code predates the common entry code.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/common.c |  8 ++---
 arch/x86/include/asm/idtentry.h | 28 +++
 arch/x86/kernel/cpu/mce/core.c  | 10 +++---
 arch/x86/kernel/kvm.c   |  6 ++--
 arch/x86/kernel/nmi.c   |  6 ++--
 arch/x86/kernel/traps.c | 28 +++
 arch/x86/mm/fault.c |  6 ++--
 include/linux/entry-common.h| 60 -
 kernel/entry/common.c   | 26 +++---
 9 files changed, 89 insertions(+), 89 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 8fdb4cb27efe..95776f16c1cb 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -264,9 +264,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 {
struct pt_regs *old_regs;
bool inhcall;
-   irqentry_state_t state;
+   kentry_state_t state;
 
-   state = irqentry_enter(regs);
+   state = kentry_enter(regs);
old_regs = set_irq_regs(regs);
 
instrumentation_begin();
@@ -278,11 +278,11 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
instrumentation_begin();
-   irqentry_exit_cond_resched();
+   kentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
-   irqentry_exit(regs, state);
+   kentry_exit(regs, state);
}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index f656aabd1545..3ac805d24289 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -40,8 +40,8 @@
  * The macro is written so it acts as function definition. Append the
  * body with a pair of curly brackets.
  *
- * irqentry_enter() contains common code which has to be invoked before
- * arbitrary code in the body. irqentry_exit() contains common code
+ * kentry_enter() contains common code which has to be invoked before
+ * arbitrary code in the body. kentry_exit() contains common code
  * which has to run before returning to the low level assembly code.
  */
 #define DEFINE_IDTENTRY(func)  \
@@ -49,12 +49,12 @@ static __always_inline void __##func(struct pt_regs *regs); 
\
\
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   kentry_state_t state = kentry_enter(regs);  \
\
instrumentation_begin();\
__##func (regs);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   kentry_exit(regs, state);   \
 }  \
\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +96,12 @@ static __always_inline void __##func(struct pt_regs *regs,  
\
 __visible noinstr void func(struct pt_regs *regs,  \
unsigned long error_code)   \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   kentry_state_t state = kentry_enter(regs);  \
\
instrumentation_begin();\
__##func (regs, error_code);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   kentry_exit(regs, state);   \
 }  \
\
 static __always_inline void __##func(struct pt_regs *regs, \
@@ -156,

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 19466 matches

Mail list logo