Re: [RFC][PATCH] objtool: STAC/CLAC validation

2019-02-22 Thread hpa
On February 22, 2019 2:26:35 PM PST, Peter Zijlstra  
wrote:
>On Fri, Feb 22, 2019 at 07:10:34PM +0100, Thomas Gleixner wrote:
>
>> But correct :)
>
>> I agree, that a function which is doing the actual copy should be
>callable,
>> but random other functions? NO!
>
>So find the below patch -- which spotted fail in ptrace.c
>
>It has an AC_SAFE(func) annotation which allows marking specific
>functions as safe to call. The patch includes 2 instances which were
>required to make arch/x86 'build':
>
>arch/x86/ia32/ia32_signal.o: warning: objtool:
>ia32_restore_sigcontext()+0x3d: call to native_load_gs_index() with AC
>set
>arch/x86/kernel/ptrace.o: warning: objtool: genregs_get()+0x8e: call to
>getreg() with AC set
>
>It also screams (provided one has CONFIG_FUNCTION_TRACE=y) about the
>lack of notrace annotations on functions marked AC_SAFE():
>
>arch/x86/kernel/ptrace.o: warning: objtool: getreg()+0x0: call to
>__fentry__() with AC set
>
>It builds arch/x86 relatively clean; it only complains about some
>redundant CLACs in entry_64.S because it doesn't understand interrupts
>and I've not bothered creating an annotation for them yet.
>
>arch/x86/entry/entry_64.o: warning: objtool:
>.altinstr_replacement+0x4d: redundant CLAC
>arch/x86/entry/entry_64.o: warning: objtool:
>.altinstr_replacement+0x5a: redundant CLAC
>  ...
>arch/x86/entry/entry_64.o: warning: objtool:
>.altinstr_replacement+0xb1: redundant CLAC
>
>Also, I realized we don't need special annotations for preempt_count;
>preempt_disable() emits a CALL instruction which should readily trigger
>the warnings added here.
>
>The VDSO thing is a bit of a hack, but I couldn't quickly find anything
>better.
>
>Comments?
>
>---
> arch/x86/include/asm/special_insns.h |  2 ++
> arch/x86/kernel/ptrace.c |  3 +-
> include/linux/frame.h| 23 ++
> tools/objtool/arch.h |  4 ++-
> tools/objtool/arch/x86/decode.c  | 14 -
>tools/objtool/check.c| 59
>
> tools/objtool/check.h|  3 +-
> tools/objtool/elf.h  |  1 +
> 8 files changed, 105 insertions(+), 4 deletions(-)
>
>diff --git a/arch/x86/include/asm/special_insns.h
>b/arch/x86/include/asm/special_insns.h
>index 43c029cdc3fe..cd31e4433f4c 100644
>--- a/arch/x86/include/asm/special_insns.h
>+++ b/arch/x86/include/asm/special_insns.h
>@@ -5,6 +5,7 @@
> 
> #ifdef __KERNEL__
> 
>+#include 
> #include 
> 
> /*
>@@ -135,6 +136,7 @@ static inline void native_wbinvd(void)
> }
> 
> extern asmlinkage void native_load_gs_index(unsigned);
>+AC_SAFE(native_load_gs_index);
> 
> static inline unsigned long __read_cr4(void)
> {
>diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>index 4b8ee05dd6ad..e278b4055a8b 100644
>--- a/arch/x86/kernel/ptrace.c
>+++ b/arch/x86/kernel/ptrace.c
>@@ -420,7 +420,7 @@ static int putreg(struct task_struct *child,
>   return 0;
> }
> 
>-static unsigned long getreg(struct task_struct *task, unsigned long
>offset)
>+static notrace unsigned long getreg(struct task_struct *task, unsigned
>long offset)
> {
>   switch (offset) {
>   case offsetof(struct user_regs_struct, cs):
>@@ -444,6 +444,7 @@ static unsigned long getreg(struct task_struct
>*task, unsigned long offset)
> 
>   return *pt_regs_access(task_pt_regs(task), offset);
> }
>+AC_SAFE(getreg);
> 
> static int genregs_get(struct task_struct *target,
>  const struct user_regset *regset,
>diff --git a/include/linux/frame.h b/include/linux/frame.h
>index 02d3ca2d9598..5d354cf42a56 100644
>--- a/include/linux/frame.h
>+++ b/include/linux/frame.h
>@@ -21,4 +21,27 @@
> 
> #endif /* CONFIG_STACK_VALIDATION */
> 
>+#if defined(CONFIG_STACK_VALIDATION) && !defined(BUILD_VDSO) &&
>!defined(BUILD_VDSO32)
>+/*
>+ * This macro marks functions as AC-safe, that is, it is safe to call
>from an
>+ * EFLAGS.AC enabled region (typically user_access_begin() /
>+ * user_access_end()).
>+ *
>+ * These functions in turn will only call AC-safe functions themselves
>(which
>+ * precludes tracing, including __fentry__ and scheduling, including
>+ * preempt_enable).
>+ *
>+ * AC-safe functions will obviously also not change EFLAGS.AC
>themselves.
>+ *
>+ * Since STAC/CLAC are OPL-0 only, this is all irrelevant for VDSO
>builds
>+ * (and the generated symbol reference will in fact cause link
>failures).
>+ */
>+#define AC_SAFE(func) \
>+  static void __used __section(.discard.ac_safe) \
>+  *__func_ac_safe_##func = func
>+
>+#else
>+#define AC_SAFE(func)
>+#endif
>+
> #endif /* _LINUX_FRAME_H */
>diff --git a/tools/objtool/arch.h b/tools/objtool/arch.h
>index b0d7dc3d71b5..48327099466d 100644
>--- a/tools/objtool/arch.h
>+++ b/tools/objtool/arch.h
>@@ -33,7 +33,9 @@
> #define INSN_STACK8
> #define INSN_BUG  9
> #define INSN_NOP  10
>-#define INSN_OTHER11
>+#define INSN_STAC 11
>+#define INSN_CLAC

Re: [PATCH] sched/x86: Save [ER]FLAGS on context switch

2019-02-19 Thread hpa
On February 19, 2019 1:04:09 AM PST, Peter Zijlstra  
wrote:
>On Mon, Feb 18, 2019 at 02:30:21PM -0800, H. Peter Anvin wrote:
>> On 2/16/19 2:30 AM, Peter Zijlstra wrote:
>> > On Fri, Feb 15, 2019 at 08:06:56PM -0800, h...@zytor.com wrote:
>> >> This implies we invoke schedule -- a restricted operation
>(consider
>> >> may_sleep) during execution of STAC-enabled code, but *not* as an
>> >> exception or interrupt, since those preserve the flags.
>> > 
>> > Meet preempt_enable().
>> 
>> I believe this falls under "doctor, it hurts when I do that." And it
>hurts for
>> very good reasons. See below.
>
>I disagree; the basic rule is that if you're preemptible you must also
>be able to schedule and vice-versa. These AC regions violate this.
>
>And, like illustrated, it is _very_ easy to get all sorts inside that
>AC
>crap.
>
>> >> I have serious concerns about this. This is more or less saying
>that
>> >> we have left an unlimited gap and have had AC escape.
>> > 
>> > Yes; by allowing normal C in between STAC and CLAC this is so.
>> > 
>> >> Does *anyone* see a need to allow this? I got a question at LPC
>from
>> >> someone about this, and what they were trying to do once all the
>> >> layers had been unwound was so far down the wrong track for a root
>> >> problem that actually has a very simple solution.
>> > 
>> > Have you read the rest of the thread?
>> > 
>> > All it takes for this to explode is a call to a C function that has
>> > tracing on it in between the user_access_begin() and
>user_access_end()
>> > things. That is a _very_ easy thing to do.
>> > 
>> > Heck, GCC is allowed to generate that broken code compiling
>> > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user
>(esp.
>> > with CONFIG_OPTIMIZE_INLINING), and making that a function call
>would
>> > get us fentry hooks and ftrace and *BOOM*.
>> > 
>> > (Now, of course, its a static function with a single caller, and
>GCC
>> > isn't _THAT_ stupid, but it could, if it wanted to)
>> > 
>> > Since the bar is _that_ low for mistakes, I figure we'd better fix
>it.
>> > 
>> 
>> The question is what "fix it" means. 
>
>It means that when we do schedule, the next task doesn't have AC set,
>and when we schedule back, we'll restore our AC when we had it. Unlike
>the current situation, where the next task will run with AC set.
>
>IOW, it confines AC to the task context where it was set.
>
>> I'm really concerned about AC escapes,
>> and everyone else should be, too.
>
>Then _that_ should be asserted.
>
>> For an issue specific to tracing, it would be more appropriate to do
>the
>> appropriate things in the tracing entry/exit than in schedule.
>Otherwise, we
>> don't want to silently paper over mistakes which could mean that we
>run a
>> large portion of the kernel without protection we thought we had.
>> 
>> In that sense, calling schedule() with AC set is in fact a great
>place to have
>> a WARN() or even BUG(), because such an event really could imply that
>the
>> kernel has been security compromised.
>
>It is not specific to tracing, tracing is just one of the most trivial
>and non-obvious ways to make it go splat.
>
>There's lot of fairly innocent stuff that does preempt_disable() /
>preempt_enable(). And just a warning in schedule() isn't sufficient,
>you'd have to actually trigger a reschedule before you know your code
>is
>bad.
>
>And like I said; the invariant is: if you're preemptible you can
>schedule and v.v.
>
>Now, clearly you don't want to mark these whole regions as
>!preemptible,
>because then you can also not take faults, but somehow you're not
>worried about the whole fault handler, but you are worried about the
>scheduler ?!? How does that work? The fault handler can touch a _ton_
>more code. Heck, the fault handler can schedule.
>
>So either pre-fault, and run the whole AC crap with preemption disabled
>and retry, or accept that we have to schedule.
>
>> Does that make more sense?
>
>It appears to me you're going about it backwards.

I'm not worried about the fault handler, because AC is always 
presented/disabled on exception entry; otherwise I would of course be very 
concerned.

To be clear: I'm not worried about the scheduler itself. I'm worried about how 
much code we have gone through to get there. Perhaps the scheduler itself is 
not the right point to probe for it, but we do need to catch things that have 
gone wrong, rather than just leaving the door wide open.

I would personally be far more comfortable saying you have to disable 
preemption in user access regions. It may be an unnecessary constraint for x86 
and ARM64, but it is safe.

And Julien, yes, it is "somewhat arbitrary", but so are many engineering 
tradeoffs. Not everything has a very sharply definable line.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] sched/x86: Save [ER]FLAGS on context switch

2019-02-15 Thread hpa
On February 14, 2019 2:14:29 AM PST, Peter Zijlstra  
wrote:
>On Wed, Feb 13, 2019 at 02:49:47PM -0800, Andy Lutomirski wrote:
>
>> Do we need to backport this thing?
>
>Possibly, just to be safe.
>
>> The problem can’t be too widespread or we would have heard of it
>before.
>
>Yes, so far we've been lucky.
>
>---
>Subject: sched/x86: Save [ER]FLAGS on context switch
>From: Peter Zijlstra 
>Date: Thu Feb 14 10:30:52 CET 2019
>
>Effectively revert commit:
>
>  2c7577a75837 ("sched/x86_64: Don't save flags on context switch")
>
>Specifically because SMAP uses FLAGS.AC which invalidates the claim
>that the kernel has clean flags.
>
>In particular; while preemption from interrupt return is fine (the
>IRET frame on the exception stack contains FLAGS) it breaks any code
>that does synchonous scheduling, including preempt_enable().
>
>This has become a significant issue ever since commit:
>
>5b24a7a2aa20 ("Add 'unsafe' user access functions for batched
>accesses")
>
>provided for means of having 'normal' C code between STAC / CLAC,
>exposing the FLAGS.AC state. So far this hasn't led to trouble,
>however fix it before it comes apart.
>
>Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched
>accesses")
>Acked-by: Andy Lutomirski 
>Reported-by: Julien Thierry 
>Signed-off-by: Peter Zijlstra (Intel) 
>---
> arch/x86/entry/entry_32.S|2 ++
> arch/x86/entry/entry_64.S|2 ++
> arch/x86/include/asm/switch_to.h |1 +
> 3 files changed, 5 insertions(+)
>
>--- a/arch/x86/entry/entry_32.S
>+++ b/arch/x86/entry/entry_32.S
>@@ -650,6 +650,7 @@ ENTRY(__switch_to_asm)
>   pushl   %ebx
>   pushl   %edi
>   pushl   %esi
>+  pushfl
> 
>   /* switch stack */
>   movl%esp, TASK_threadsp(%eax)
>@@ -672,6 +673,7 @@ ENTRY(__switch_to_asm)
> #endif
> 
>   /* restore callee-saved registers */
>+  popfl
>   popl%esi
>   popl%edi
>   popl%ebx
>--- a/arch/x86/entry/entry_64.S
>+++ b/arch/x86/entry/entry_64.S
>@@ -291,6 +291,7 @@ ENTRY(__switch_to_asm)
>   pushq   %r13
>   pushq   %r14
>   pushq   %r15
>+  pushfq
> 
>   /* switch stack */
>   movq%rsp, TASK_threadsp(%rdi)
>@@ -313,6 +314,7 @@ ENTRY(__switch_to_asm)
> #endif
> 
>   /* restore callee-saved registers */
>+  popfq
>   popq%r15
>   popq%r14
>   popq%r13
>--- a/arch/x86/include/asm/switch_to.h
>+++ b/arch/x86/include/asm/switch_to.h
>@@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void);
>  * order of the fields must match the code in __switch_to_asm().
>  */
> struct inactive_task_frame {
>+  unsigned long flags;
> #ifdef CONFIG_X86_64
>   unsigned long r15;
>   unsigned long r14;

This implies we invoke schedule -- a restricted operation (consider may_sleep) 
during execution of STAC-enabled code, but *not* as an exception or interrupt, 
since those preserve the flags.

I have serious concerns about this. This is more or less saying that we have 
left an unlimited gap and have had AC escape.

Does *anyone* see a need to allow this? I got a question at LPC from someone 
about this, and what they were trying to do once all the layers had been 
unwound was so far down the wrong track for a root problem that actually has a 
very simple solution.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v5 01/13] taint: Introduce a new taint flag (insecure)

2019-02-05 Thread hpa
On February 5, 2019 2:46:11 PM PST, Randy Dunlap  wrote:
>On 2/5/19 1:21 PM, Andrew Morton wrote:
>> On Fri, 1 Feb 2019 18:42:29 -0800 Andy Lutomirski 
>wrote:
>> 
>>> On Fri, Feb 1, 2019 at 12:54 PM Chang S. Bae
> wrote:

 For testing (or root-only) purposes, the new flag will serve to tag
>the
 kernel taint accurately.

 When adding a new feature support, patches need to be incrementally
 applied and tested with temporal parameters. Currently, there is no
>flag
 for this usage.
>>>
>>> I think this should be reviewed by someone like akpm.  akpm, for
>>> background, this is part of an x86 patch series.  If only part of
>the
>>> series is applied, the kernel will be blatantly insecure (but still
>>> functional and useful for testing and bisection), and this taint
>flag
>>> will be set if this kernel is booted.  With the whole series
>applied,
>>> there are no users of the taint flag in the kernel.
>>>
>>> Do you think this is a good idea?
>> 
>> What does "temporal parameters" mean?  A complete description of this
>> testing process would help.
>> 
>> I sounds a bit strange.  You mean it assumes that people will
>partially
>> apply the series to test its functionality?  That would be
>inconvenient.
>
>Ack.  I don't think we need to (or should) worry about that kind of
>muckup.
>
>> - Can the new and now-unused taint flag be removed again at
>>   end-of-series?
>> 
>> - It would be a lot more convenient if we had some means of testing
>>   after the whole series is applied, on a permanent basis - some
>>   debugfs flag, perhaps?
>> 

I would like to see this taint flag, though, because sometimes it is useful to 
write test modules (e.g. when I was testing SMAP) which are dangerous even if 
out of tree.

In case of an escape or pilot error gets it into the wrong kernel, it is a very 
good thing to have the kernel flagged.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v5 00/13] x86: Enable FSGSBASE instructions

2019-02-04 Thread hpa
On February 1, 2019 6:43:25 PM PST, Andy Lutomirski  wrote:
>Hi hpa-
>
>A while back, you were working on some patches to make modify_ldt()
>play better with this series.  What happened to them?
>
>--Andy

Looks like I need to dig them out...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-25 Thread hpa
On January 25, 2019 11:15:52 AM PST, Daniel Colascione  
wrote:
>On Fri, Jan 25, 2019 at 11:01 AM  wrote:
>>
>> On January 24, 2019 12:59:29 PM PST, Joel Fernandes
> wrote:
>> >On Thu, Jan 24, 2019 at 07:57:26PM +0100, Karim Yaghmour wrote:
>> >>
>> >> On 1/23/19 11:37 PM, Daniel Colascione wrote:
>> >[..]
>> >> > > Personally I advocated a more aggressive approach with Joel in
>> >private:
>> >> > > just put the darn headers straight into the kernel image, it's
>> >the
>> >> > > *only* artifact we're sure will follow the Android device
>> >whatever
>> >> > > happens to it (like built-in ftrace).
>> >> >
>> >> > I was thinking along similar lines. Ordinarily, we make loadable
>> >> > kernel modules. What we kind of want here is a non-loadable
>kernel
>> >> > module --- or a non-loadable section in the kernel image proper.
>> >I'm
>> >> > not familiar with early-stage kernel loader operation: I know
>it's
>> >> > possible to crease discardable sections in the kernel image, but
>> >can
>> >> > we create sections that are never slurped into memory in the
>first
>> >> > place? If not, maybe loading and immediately discarding the
>header
>> >> > section is good enough.
>> >>
>> >> Interesting. Maybe just append it to the image but have it not
>loaded
>> >and
>> >> have a kernel parameter than enables a "/proc/kheaders" driver to
>> >know where
>> >> the fetch the appended headers from storage at runtime. There
>would
>> >be no
>> >> RAM loading whatsoever of the headers, just some sort of
>> >> "kheaders=/dev/foobar:offset:size" parameter. If you turn the
>option
>> >on, you
>> >> get a fatter kernel image size to store on permanent storage, but
>no
>> >impact
>> >> on what's loaded at boot time.
>> >
>> >Embedding anything into the kernel image does impact boot time
>though
>> >because
>> >it increase the time spent by bootloader. A module OTOH would not
>have
>> >such
>> >overhead.
>> >
>> >Also a kernel can be booted in any number of ways other than mass
>> >storage so
>> >it is not a generic Linux-wide solution to have a kheaders= option
>like
>> >that.
>> >If the option is forgotten, then the running system can't use the
>> >feature.
>> >The other issue is it requires a kernel command line option /
>> >bootloader
>> >changes for that which adds more configuration burden, which not be
>> >needed
>> >with a module.
>> >
>> >> > Would such a thing really do better than LZMA? LZMA already has
>> >very
>> >> > clever techniques for eliminating long-range redundancies in
>> >> > compressible text, including redundancies at the sub-byte level.
>I
>> >can
>> >> > certainly understand the benefit of stripping comments, since
>> >removing
>> >> > comments really does decrease the total amount of information
>the
>> >> > compressor has to preserve, but I'm not sure how much the
>encoding
>> >> > scheme you propose below would help, since it reminds me of the
>> >> > encoding scheme that LZMA would discover automatically.
>> >>
>> >> I'm no compression algorithm expert. If you say LZMA would do the
>> >> same/better than what I suggested then I have no reason to contest
>> >that. My
>> >> goal is to see the headers as part of the kernel image that's
>> >distributed on
>> >> devices so that they don't have to be chased around. I'm just
>trying
>> >to make
>> >> it as palatable as possible.
>> >
>> >I believe LZMA is really good at that sort of thing too.
>> >
>> >Also at 3.3MB of module size, I think we are really good size-wise.
>But
>> >Dan
>> >is helping look at possibly reducing further if he gets time. Many
>> >modules in
>> >my experience are much bigger. amdgpu.ko on my Linux machine is
>6.1MB.
>> >
>> >I really think making it a module is the best way to make sure this
>is
>> >bundled with the kernel on the widest number of Android and other
>Linux
>> >systems, without incurring boot time overhead, or any other command
>> >line
>> >configuration burden.
>> >
>> >I spoke to so many people at LPC personally with other kernel
>> >contributors,
>> >and many folks told me one word - MODULE :D.  Even though I
>hesitated
>> >at
>> >first, now it seems the right solution.
>> >
>> >If no one seriously objects, I'll clean this up and post a v2 and
>with
>> >the
>> >RFC tag taken off. Thank you!
>> >
>> > - Joel
>>
>> So let me throw in a different notion.
>>
>> A kernel module really is nothing other than a kernel build system
>artifact stored in the filesystem.
>>
>> I really don't at any reason whatsoever why this is direct from just
>producing an archive and putting it in the module directory, except
>that the latter is far simpler.
>>
>> I see literally *no* problem, social or technical, you are solvin by
>actually making it a kernel ELF object.
>
>Joel does have a point. Suppose we're on Android and we're testing a
>random local kernel we've built. We can command the device to boot
>into the bootloader, then send our locally-built kernel to the device
>with "fastboot boot mykernelz". Having booted the device this 

Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-25 Thread hpa
On January 24, 2019 12:59:29 PM PST, Joel Fernandes  
wrote:
>On Thu, Jan 24, 2019 at 07:57:26PM +0100, Karim Yaghmour wrote:
>> 
>> On 1/23/19 11:37 PM, Daniel Colascione wrote:
>[..]
>> > > Personally I advocated a more aggressive approach with Joel in
>private:
>> > > just put the darn headers straight into the kernel image, it's
>the
>> > > *only* artifact we're sure will follow the Android device
>whatever
>> > > happens to it (like built-in ftrace).
>> > 
>> > I was thinking along similar lines. Ordinarily, we make loadable
>> > kernel modules. What we kind of want here is a non-loadable kernel
>> > module --- or a non-loadable section in the kernel image proper.
>I'm
>> > not familiar with early-stage kernel loader operation: I know it's
>> > possible to crease discardable sections in the kernel image, but
>can
>> > we create sections that are never slurped into memory in the first
>> > place? If not, maybe loading and immediately discarding the header
>> > section is good enough.
>> 
>> Interesting. Maybe just append it to the image but have it not loaded
>and
>> have a kernel parameter than enables a "/proc/kheaders" driver to
>know where
>> the fetch the appended headers from storage at runtime. There would
>be no
>> RAM loading whatsoever of the headers, just some sort of
>> "kheaders=/dev/foobar:offset:size" parameter. If you turn the option
>on, you
>> get a fatter kernel image size to store on permanent storage, but no
>impact
>> on what's loaded at boot time.
>
>Embedding anything into the kernel image does impact boot time though
>because
>it increase the time spent by bootloader. A module OTOH would not have
>such
>overhead.
>
>Also a kernel can be booted in any number of ways other than mass
>storage so
>it is not a generic Linux-wide solution to have a kheaders= option like
>that.
>If the option is forgotten, then the running system can't use the
>feature.
>The other issue is it requires a kernel command line option /
>bootloader
>changes for that which adds more configuration burden, which not be
>needed
>with a module.
>
>> > Would such a thing really do better than LZMA? LZMA already has
>very
>> > clever techniques for eliminating long-range redundancies in
>> > compressible text, including redundancies at the sub-byte level. I
>can
>> > certainly understand the benefit of stripping comments, since
>removing
>> > comments really does decrease the total amount of information the
>> > compressor has to preserve, but I'm not sure how much the encoding
>> > scheme you propose below would help, since it reminds me of the
>> > encoding scheme that LZMA would discover automatically.
>> 
>> I'm no compression algorithm expert. If you say LZMA would do the
>> same/better than what I suggested then I have no reason to contest
>that. My
>> goal is to see the headers as part of the kernel image that's
>distributed on
>> devices so that they don't have to be chased around. I'm just trying
>to make
>> it as palatable as possible.
>
>I believe LZMA is really good at that sort of thing too.
>
>Also at 3.3MB of module size, I think we are really good size-wise. But
>Dan
>is helping look at possibly reducing further if he gets time. Many
>modules in
>my experience are much bigger. amdgpu.ko on my Linux machine is 6.1MB.
>
>I really think making it a module is the best way to make sure this is
>bundled with the kernel on the widest number of Android and other Linux
>systems, without incurring boot time overhead, or any other command
>line
>configuration burden.
>
>I spoke to so many people at LPC personally with other kernel
>contributors,
>and many folks told me one word - MODULE :D.  Even though I hesitated
>at
>first, now it seems the right solution.
>
>If no one seriously objects, I'll clean this up and post a v2 and with
>the
>RFC tag taken off. Thank you!
>
> - Joel

So let me throw in a different notion.

A kernel module really is nothing other than a kernel build system artifact 
stored in the filesystem.

I really don't at any reason whatsoever why this is direct from just producing 
an archive and putting it in the module directory, except that the latter is 
far simpler.

I see literally *no* problem, social or technical, you are solvin by actually 
making it a kernel ELF object.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-21 Thread hpa
On January 20, 2019 8:10:03 AM PST, Joel Fernandes  
wrote:
>On Sat, Jan 19, 2019 at 11:01:13PM -0800, h...@zytor.com wrote:
>> On January 19, 2019 2:36:06 AM PST, Greg KH
> wrote:
>> >On Sat, Jan 19, 2019 at 02:28:00AM -0800, Christoph Hellwig wrote:
>> >> This seems like a pretty horrible idea and waste of kernel memory.
>> >
>> >It's only a waste if you want it to be a waste, i.e. if you load the
>> >kernel module.
>> >
>> >This really isn't any different from how /proc/config.gz works.
>> >
>> >> Just add support to kbuild to store a compressed archive in
>initramfs
>> >> and unpack it in the right place.
>> >
>> >I think the issue is that some devices do not use initramfs, or
>switch
>> >away from it after init happens or something like that.  Joel has
>all
>> >of
>> >the looney details that he can provide.
>> >
>> >thanks,
>> >
>> >greg k-h
>> 
>> Yeah, well... but it is kind of a losing game... the more in-kernel
>stuff there is the less smiley are things to actually be supported.
>
>It is better than nothing, and if this makes things a bit easier and
>solves
>real-world issues people have been having, and is optional, then I
>don't see
>why not.
>
>> Modularizing is it in some ways even crazier in the sense that at
>that point you are relying on the filesystem containing the module,
>which has to be loaded into the kernel by a root user. One could even
>wonder if a better way to do this would be to have "make
>modules_install" park an archive file – or even a directory as opposed
>to a symlink – with this stuff in /lib/modules. We could even provide a
>tmpfs shim which autoloads such an archive via the firmware loader;
>this might even be generically useful, who knows.
>
>All this seems to assume where the modules are located. In Android, we
>don't
>have /lib/modules. This patch generically fits into the grand scheme
>things
>and I think is just better made a part of the kernel since it is not
>that
>huge once compressed, as Dan also pointed. The more complex, and the
>more
>assumptions we make, the less likely people writing tools will get it
>right
>and be able to easily use it.
>
>> 
>> Note also that initramfs contents can be built into the kernel.
>Extracting such content into a single-instance tmpfs would again be a
>possibility
>
>Such an approach would bloat the kernel image size though, which may
>not work
>for everyone. The module based approach, on the other hand, gives an
>option
>to the user to enable the feature, but not have it loaded into memory
>or used
>until it is really needed.
>
>thanks,
>
> - Joel

Well, where are the modules? They must exist in the filesystem.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-21 Thread hpa
On January 20, 2019 5:45:53 PM PST, Joel Fernandes  
wrote:
>On Sun, Jan 20, 2019 at 01:58:15PM -0800, h...@zytor.com wrote:
>> On January 20, 2019 8:10:03 AM PST, Joel Fernandes
> wrote:
>> >On Sat, Jan 19, 2019 at 11:01:13PM -0800, h...@zytor.com wrote:
>> >> On January 19, 2019 2:36:06 AM PST, Greg KH
>> > wrote:
>> >> >On Sat, Jan 19, 2019 at 02:28:00AM -0800, Christoph Hellwig
>wrote:
>> >> >> This seems like a pretty horrible idea and waste of kernel
>memory.
>> >> >
>> >> >It's only a waste if you want it to be a waste, i.e. if you load
>the
>> >> >kernel module.
>> >> >
>> >> >This really isn't any different from how /proc/config.gz works.
>> >> >
>> >> >> Just add support to kbuild to store a compressed archive in
>> >initramfs
>> >> >> and unpack it in the right place.
>> >> >
>> >> >I think the issue is that some devices do not use initramfs, or
>> >switch
>> >> >away from it after init happens or something like that.  Joel has
>> >all
>> >> >of
>> >> >the looney details that he can provide.
>> >> >
>> >> >thanks,
>> >> >
>> >> >greg k-h
>> >> 
>> >> Yeah, well... but it is kind of a losing game... the more
>in-kernel
>> >stuff there is the less smiley are things to actually be supported.
>> >
>> >It is better than nothing, and if this makes things a bit easier and
>> >solves
>> >real-world issues people have been having, and is optional, then I
>> >don't see
>> >why not.
>> >
>> >> Modularizing is it in some ways even crazier in the sense that at
>> >that point you are relying on the filesystem containing the module,
>> >which has to be loaded into the kernel by a root user. One could
>even
>> >wonder if a better way to do this would be to have "make
>> >modules_install" park an archive file – or even a directory as
>opposed
>> >to a symlink – with this stuff in /lib/modules. We could even
>provide a
>> >tmpfs shim which autoloads such an archive via the firmware loader;
>> >this might even be generically useful, who knows.
>> >
>> >All this seems to assume where the modules are located. In Android,
>we
>> >don't
>> >have /lib/modules. This patch generically fits into the grand scheme
>> >things
>> >and I think is just better made a part of the kernel since it is not
>> >that
>> >huge once compressed, as Dan also pointed. The more complex, and the
>> >more
>> >assumptions we make, the less likely people writing tools will get
>it
>> >right
>> >and be able to easily use it.
>> >
>> >> 
>> >> Note also that initramfs contents can be built into the kernel.
>> >Extracting such content into a single-instance tmpfs would again be
>a
>> >possibility
>> >
>> >Such an approach would bloat the kernel image size though, which may
>> >not work
>> >for everyone. The module based approach, on the other hand, gives an
>> >option
>> >to the user to enable the feature, but not have it loaded into
>memory
>> >or used
>> >until it is really needed.
>> >
>> >thanks,
>> >
>> > - Joel
>> 
>> Well, where are the modules? They must exist in the filesystem.
>
>The scheme of loading a module doesn't depend on _where_ the module is
>on the
>filesystem. As long as a distro knows how to load a module in its own
>way (by
>looking into whichever paths it cares about), that's all that matters. 
>And
>the module contains compressed headers which saves space, vs storing it
>uncompressed on the file system.
>
>To remove complete reliance on the filesystem, there is an option of
>not
>building it as a module, and making it as a built-in.
>
>I think I see your point now - you're saying if its built-in, then it
>becomes kernel memory that is lost and unswappable. Did I get that
>right?
>I am saying that if that's a major concern, then:
>1. Don't make it a built-in, make it a module.
>2. Don't enable it at for your distro, and use a linux-headers package
>or
>whatever else you have been using so far that works for you.
>
>thanks,
>
> - Joel

My point is that if we're going to actually solve a problem, we need to make it 
so that the distro won't just disable it anyway, and it ought to be something 
scalable; otherwise nothing is gained.

I am *not* disagreeing with the problem statement!

Now, /proc isn't something that will autoload modules. A filesystem *will*, 
although you need to be able to mount it; furthermore, it makes it trivially to 
extend it (and the firmware interface provides an . easy way to feed the data 
to such a filesystem without having to muck with anything magic.)

Heck, we could even make it a squashfs image that can just be mounted.

So, first of all, where does Android keep its modules, and what is actually 
included? Is /sbin/modprobe used to load the modules, as is normal? We might 
even be able to address this with some fairly trivial enhancements to modprobe; 
specifically to search in the module paths for something that isn't a module 
per se.

The best scenario would be if we could simply have the tools find the location 
equivalent of /lib/modules/$version/source...
-- 
Sent from my Android device 

Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-19 Thread hpa
On January 19, 2019 2:36:06 AM PST, Greg KH  wrote:
>On Sat, Jan 19, 2019 at 02:28:00AM -0800, Christoph Hellwig wrote:
>> This seems like a pretty horrible idea and waste of kernel memory.
>
>It's only a waste if you want it to be a waste, i.e. if you load the
>kernel module.
>
>This really isn't any different from how /proc/config.gz works.
>
>> Just add support to kbuild to store a compressed archive in initramfs
>> and unpack it in the right place.
>
>I think the issue is that some devices do not use initramfs, or switch
>away from it after init happens or something like that.  Joel has all
>of
>the looney details that he can provide.
>
>thanks,
>
>greg k-h

Yeah, well... but it is kind of a losing game... the more in-kernel stuff there 
is the less smiley are things to actually be supported.

Modularizing is it in some ways even crazier in the sense that at that point 
you are relying on the filesystem containing the module, which has to be loaded 
into the kernel by a root user. One could even wonder if a better way to do 
this would be to have "make modules_install" park an archive file – or even a 
directory as opposed to a symlink – with this stuff in /lib/modules. We could 
even provide a tmpfs shim which autoloads such an archive via the firmware 
loader; this might even be generically useful, who knows.

Note also that initramfs contents can be built into the kernel. Extracting such 
content into a single-instance tmpfs would again be a possibility.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-19 Thread hpa
On January 19, 2019 3:25:03 PM PST, Joel Fernandes  
wrote:
>On Sat, Jan 19, 2019 at 12:43:35PM -0500, Daniel Colascione wrote:
>> On Sat, Jan 19, 2019 at 11:27 AM Joel Fernandes
> wrote:
>> >
>> > On Sat, Jan 19, 2019 at 09:25:32AM +0100, Greg KH wrote:
>> > > On Fri, Jan 18, 2019 at 05:55:43PM -0500, Joel Fernandes wrote:
>> > > > --- /dev/null
>> > > > +++ b/kernel/kheaders.c
>> 
>> Thanks a ton for this work. It'll make it much easier to do cool
>> things with BPF.
>
>You're welcome, thanks.
>
>> One question: I can imagine wanting to probe
>> structures that are defined, not in headers, but in random
>> implementation files. Would it be possible to optionally include
>*all*
>> kernel source files?
>
>That would be prohibitively too large to justify keeping it in memory,
>even
>compressed. Arguably, such structures should be moved into include/ if
>modules or whatever is extending the kernel like eBPF needs them.
>
>> If not, what about a hash, so we could at least
>> do precise correlation between a candidate local tree and what's
>> actually on device?
>
>That would make a tool too difficult to write wouldn't it, since they
>you have to
>correlate every possible hash and keep updating eBPF tools with new
>hashes -
>probably not scalable. I think what you want is to use the kernel
>version to
>assume what such internal structures look like although that's still
>not
>robust.
>
>> BTW, I'm not sure that the magic constants you've defined are long
>> enough.  I'd feel more comfortable with two UUIDs (16 bytes each).
>
>Ok, I'll expand it.
>
>> I'd also strongly consider LZMA compression: xz -9 on the kernel
>> headers (with comments) brings the size down to 5MB, compared to the
>> 7MB I get for gzip -9. Considering that this feature is optional, I
>> think it's okay to introduce a dependency on widespread modern
>> compression tools. (For comparison, bzip2 -9 gets us 6MB.)
>
>Ok, I'll look into LZMA. Thanks for checking the compression sizes.
>
>- Joel

Don't use lzma, use xz if you are going to do something.

However, it seems unlikely to me that someone not willing to spend the space in 
the filesystem will spend unswappable kernel memory.

It would seem that a far saner way to do this is to use inittmpfs or perhaps an 
auxiliary "ktmpfs" so it can at least be swapped out if you have swap.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH 01/17] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

2019-01-17 Thread hpa
On January 17, 2019 2:39:15 PM PST, Nadav Amit  wrote:
>> On Jan 17, 2019, at 1:15 PM, h...@zytor.com wrote:
>> 
>> On January 16, 2019 10:47:01 PM PST, Masami Hiramatsu
> wrote:
>>> On Wed, 16 Jan 2019 16:32:43 -0800
>>> Rick Edgecombe  wrote:
>>> 
 From: Nadav Amit 
 
 text_mutex is currently expected to be held before text_poke() is
 called, but we kgdb does not take the mutex, and instead
>*supposedly*
 ensures the lock is not taken and will not be acquired by any other
>>> core
 while text_poke() is running.
 
 The reason for the "supposedly" comment is that it is not entirely
>>> clear
 that this would be the case if gdb_do_roundup is zero.
 
 This patch creates two wrapper functions, text_poke() and
 text_poke_kgdb() which do or do not run the lockdep assertion
 respectively.
 
 While we are at it, change the return code of text_poke() to
>>> something
 meaningful. One day, callers might actually respect it and the
>>> existing
 BUG_ON() when patching fails could be removed. For kgdb, the return
 value can actually be used.
>>> 
>>> Looks good to me.
>>> 
>>> Reviewed-by: Masami Hiramatsu 
>>> 
>>> Thank you,
>>> 
 Cc: Andy Lutomirski 
 Cc: Kees Cook 
 Cc: Dave Hansen 
 Cc: Masami Hiramatsu 
 Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex
>in
>>> text_poke*()")
 Suggested-by: Peter Zijlstra 
 Acked-by: Jiri Kosina 
 Signed-off-by: Nadav Amit 
 Signed-off-by: Rick Edgecombe 
 ---
 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c| 52
>>> 
 arch/x86/kernel/kgdb.c   | 11 +++---
 3 files changed, 45 insertions(+), 19 deletions(-)
 
 diff --git a/arch/x86/include/asm/text-patching.h
>>> b/arch/x86/include/asm/text-patching.h
 index e85ff65c43c3..f8fc8e86cf01 100644
 --- a/arch/x86/include/asm/text-patching.h
 +++ b/arch/x86/include/asm/text-patching.h
 @@ -35,6 +35,7 @@ extern void *text_poke_early(void *addr, const
>void
>>> *opcode, size_t len);
  * inconsistent instruction while you patch.
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 +extern void *text_poke_kgdb(void *addr, const void *opcode, size_t
>>> len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t
>>> len, void *handler);
 extern int after_bootmem;
 diff --git a/arch/x86/kernel/alternative.c
>>> b/arch/x86/kernel/alternative.c
 index ebeac487a20c..c6a3a10a2fd5 100644
 --- a/arch/x86/kernel/alternative.c
 +++ b/arch/x86/kernel/alternative.c
 @@ -678,18 +678,7 @@ void *__init_or_module text_poke_early(void
>>> *addr, const void *opcode,
 return addr;
 }
 
 -/**
 - * text_poke - Update instructions on a live kernel
 - * @addr: address to modify
 - * @opcode: source of the copy
 - * @len: length to copy
 - *
 - * Only atomic text poke/set should be allowed when not doing
>early
>>> patching.
 - * It means the size must be writable atomically and the address
>>> must be aligned
 - * in a way that permits an atomic write. It also makes sure we
>fit
>>> on a single
 - * page.
 - */
 -void *text_poke(void *addr, const void *opcode, size_t len)
 +static void *__text_poke(void *addr, const void *opcode, size_t
>len)
 {
unsigned long flags;
char *vaddr;
 @@ -702,8 +691,6 @@ void *text_poke(void *addr, const void *opcode,
>>> size_t len)
  */
BUG_ON(!after_bootmem);
 
 -  lockdep_assert_held(_mutex);
 -
if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
 @@ -732,6 +719,43 @@ void *text_poke(void *addr, const void
>*opcode,
>>> size_t len)
 return addr;
 }
 
 +/**
 + * text_poke - Update instructions on a live kernel
 + * @addr: address to modify
 + * @opcode: source of the copy
 + * @len: length to copy
 + *
 + * Only atomic text poke/set should be allowed when not doing
>early
>>> patching.
 + * It means the size must be writable atomically and the address
>>> must be aligned
 + * in a way that permits an atomic write. It also makes sure we
>fit
>>> on a single
 + * page.
 + */
 +void *text_poke(void *addr, const void *opcode, size_t len)
 +{
 +  lockdep_assert_held(_mutex);
 +
 +  return __text_poke(addr, opcode, len);
 +}
 +
 +/**
 + * text_poke_kgdb - Update instructions on a live kernel by kgdb
 + * @addr: address to modify
 + * @opcode: source of the copy
 + * @len: length to copy
 + *
 + * Only atomic text poke/set should be allowed when not doing
>early
>>> patching.
 + * It means the size must be writable atomically 

Re: [PATCH 06/17] x86/alternative: use temporary mm for text poking

2019-01-17 Thread hpa
On January 17, 2019 1:43:54 PM PST, Nadav Amit  wrote:
>> On Jan 17, 2019, at 12:47 PM, Andy Lutomirski 
>wrote:
>> 
>> On Thu, Jan 17, 2019 at 12:27 PM Andy Lutomirski 
>wrote:
>>> On Wed, Jan 16, 2019 at 4:33 PM Rick Edgecombe
>>>  wrote:
 From: Nadav Amit 
 
 text_poke() can potentially compromise the security as it sets
>temporary
 PTEs in the fixmap. These PTEs might be used to rewrite the kernel
>code
 from other cores accidentally or maliciously, if an attacker gains
>the
 ability to write onto kernel memory.
>>> 
>>> i think this may be sufficient, but barely.
>>> 
 +   pte_clear(poking_mm, poking_addr, ptep);
 +
 +   /*
 +* __flush_tlb_one_user() performs a redundant TLB flush
>when PTI is on,
 +* as it also flushes the corresponding "user" address
>spaces, which
 +* does not exist.
 +*
 +* Poking, however, is already very inefficient since it
>does not try to
 +* batch updates, so we ignore this problem for the time
>being.
 +*
 +* Since the PTEs do not exist in other kernel
>address-spaces, we do
 +* not use __flush_tlb_one_kernel(), which when PTI is on
>would cause
 +* more unwarranted TLB flushes.
 +*
 +* There is a slight anomaly here: the PTE is a
>supervisor-only and
 +* (potentially) global and we use __flush_tlb_one_user()
>but this
 +* should be fine.
 +*/
 +   __flush_tlb_one_user(poking_addr);
 +   if (cross_page_boundary) {
 +   pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep
>+ 1);
 +   __flush_tlb_one_user(poking_addr + PAGE_SIZE);
 +   }
>>> 
>>> In principle, another CPU could still have the old translation. 
>Your
>>> mutex probably makes this impossible, but it makes me nervous.
>>> Ideally you'd use flush_tlb_mm_range(), but I guess you can't do
>that
>>> with IRQs off.  Hmm.  I think you should add an inc_mm_tlb_gen()
>here.
>>> Arguably, if you did that, you could omit the flushes, but maybe
>>> that's silly.
>>> 
>>> If we start getting new users of use_temporary_mm(), we should give
>>> some serious thought to the SMP semantics.
>>> 
>>> Also, you're using PAGE_KERNEL.  Please tell me that the global bit
>>> isn't set in there.
>> 
>> Much better solution: do unuse_temporary_mm() and *then*
>> flush_tlb_mm_range().  This is entirely non-sketchy and should be
>just
>> about optimal, too.
>
>This solution sounds nice and clean. The fact the global-bit was set
>didn’t
>matter before (since __flush_tlb_one_user would get rid of it no matter
>what), but would matter now, so I’ll change it too.
>
>Thanks!
>
>Nadav

You can just disable the global bit at the top level, obviously.

This approach also should make it far easier to do batching if desired.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH 01/17] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

2019-01-17 Thread hpa
On January 16, 2019 10:47:01 PM PST, Masami Hiramatsu  
wrote:
>On Wed, 16 Jan 2019 16:32:43 -0800
>Rick Edgecombe  wrote:
>
>> From: Nadav Amit 
>> 
>> text_mutex is currently expected to be held before text_poke() is
>> called, but we kgdb does not take the mutex, and instead *supposedly*
>> ensures the lock is not taken and will not be acquired by any other
>core
>> while text_poke() is running.
>> 
>> The reason for the "supposedly" comment is that it is not entirely
>clear
>> that this would be the case if gdb_do_roundup is zero.
>> 
>> This patch creates two wrapper functions, text_poke() and
>> text_poke_kgdb() which do or do not run the lockdep assertion
>> respectively.
>> 
>> While we are at it, change the return code of text_poke() to
>something
>> meaningful. One day, callers might actually respect it and the
>existing
>> BUG_ON() when patching fails could be removed. For kgdb, the return
>> value can actually be used.
>
>Looks good to me.
>
>Reviewed-by: Masami Hiramatsu 
>
>Thank you,
>
>> 
>> Cc: Andy Lutomirski 
>> Cc: Kees Cook 
>> Cc: Dave Hansen 
>> Cc: Masami Hiramatsu 
>> Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex in
>text_poke*()")
>> Suggested-by: Peter Zijlstra 
>> Acked-by: Jiri Kosina 
>> Signed-off-by: Nadav Amit 
>> Signed-off-by: Rick Edgecombe 
>> ---
>>  arch/x86/include/asm/text-patching.h |  1 +
>>  arch/x86/kernel/alternative.c| 52
>
>>  arch/x86/kernel/kgdb.c   | 11 +++---
>>  3 files changed, 45 insertions(+), 19 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/text-patching.h
>b/arch/x86/include/asm/text-patching.h
>> index e85ff65c43c3..f8fc8e86cf01 100644
>> --- a/arch/x86/include/asm/text-patching.h
>> +++ b/arch/x86/include/asm/text-patching.h
>> @@ -35,6 +35,7 @@ extern void *text_poke_early(void *addr, const void
>*opcode, size_t len);
>>   * inconsistent instruction while you patch.
>>   */
>>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>> +extern void *text_poke_kgdb(void *addr, const void *opcode, size_t
>len);
>>  extern int poke_int3_handler(struct pt_regs *regs);
>>  extern void *text_poke_bp(void *addr, const void *opcode, size_t
>len, void *handler);
>>  extern int after_bootmem;
>> diff --git a/arch/x86/kernel/alternative.c
>b/arch/x86/kernel/alternative.c
>> index ebeac487a20c..c6a3a10a2fd5 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -678,18 +678,7 @@ void *__init_or_module text_poke_early(void
>*addr, const void *opcode,
>>  return addr;
>>  }
>>  
>> -/**
>> - * text_poke - Update instructions on a live kernel
>> - * @addr: address to modify
>> - * @opcode: source of the copy
>> - * @len: length to copy
>> - *
>> - * Only atomic text poke/set should be allowed when not doing early
>patching.
>> - * It means the size must be writable atomically and the address
>must be aligned
>> - * in a way that permits an atomic write. It also makes sure we fit
>on a single
>> - * page.
>> - */
>> -void *text_poke(void *addr, const void *opcode, size_t len)
>> +static void *__text_poke(void *addr, const void *opcode, size_t len)
>>  {
>>  unsigned long flags;
>>  char *vaddr;
>> @@ -702,8 +691,6 @@ void *text_poke(void *addr, const void *opcode,
>size_t len)
>>   */
>>  BUG_ON(!after_bootmem);
>>  
>> -lockdep_assert_held(_mutex);
>> -
>>  if (!core_kernel_text((unsigned long)addr)) {
>>  pages[0] = vmalloc_to_page(addr);
>>  pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
>> @@ -732,6 +719,43 @@ void *text_poke(void *addr, const void *opcode,
>size_t len)
>>  return addr;
>>  }
>>  
>> +/**
>> + * text_poke - Update instructions on a live kernel
>> + * @addr: address to modify
>> + * @opcode: source of the copy
>> + * @len: length to copy
>> + *
>> + * Only atomic text poke/set should be allowed when not doing early
>patching.
>> + * It means the size must be writable atomically and the address
>must be aligned
>> + * in a way that permits an atomic write. It also makes sure we fit
>on a single
>> + * page.
>> + */
>> +void *text_poke(void *addr, const void *opcode, size_t len)
>> +{
>> +lockdep_assert_held(_mutex);
>> +
>> +return __text_poke(addr, opcode, len);
>> +}
>> +
>> +/**
>> + * text_poke_kgdb - Update instructions on a live kernel by kgdb
>> + * @addr: address to modify
>> + * @opcode: source of the copy
>> + * @len: length to copy
>> + *
>> + * Only atomic text poke/set should be allowed when not doing early
>patching.
>> + * It means the size must be writable atomically and the address
>must be aligned
>> + * in a way that permits an atomic write. It also makes sure we fit
>on a single
>> + * page.
>> + *
>> + * Context: should only be used by kgdb, which ensures no other core
>is running,
>> + *  despite the fact it does not hold the text_mutex.
>> + */
>> +void *text_poke_kgdb(void *addr, const void *opcode, size_t len)
>> +{
>> +

Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread hpa
On January 14, 2019 3:27:55 PM PST, Andy Lutomirski  wrote:
>On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin  wrote:
>>
>> So I was already in the middle of composing this message when Andy
>posted:
>>
>> > I don't even think this is sufficient.  I think we also need
>everyone
>> > who clears the bit to check if all bits are clear and, if so,
>remove
>> > the breakpoint.  Otherwise we have a situation where, if you are in
>> > text_poke_bp() and you take an NMI (or interrupt or MCE or
>whatever)
>> > and that interrupt then hits the breakpoint, then you deadlock
>because
>> > no one removes the breakpoint.
>> >
>> > If we do this, and if we can guarantee that all CPUs make forward
>> > progress, then maybe the problem is solved. Can we guarantee
>something
>> > like all NMI handlers that might wait in a spinlock or for any
>other
>> > reason will periodically check if a sync is needed while they're
>> > spinning?
>>
>> So the really, really nasty case is when an asynchronous event on the
>> *patching* processor gets stuck spinning on a resource which is
>> unavailable due to another processor spinning on the #BP. We can
>disable
>> interrupts, but we can't stop NMIs from coming in (although we could
>> test in the NMI handler if we are in that condition and return
>> immediately; I'm not sure we want to do that, and we still have to
>deal
>> with #MC and what not.)
>>
>> The fundamental problem here is that we don't see the #BP on the
>> patching processor, in which case we could simply complete the
>patching
>> from the #BP handler on that processor.
>>
>> On 1/13/19 6:40 PM, H. Peter Anvin wrote:
>> > On 1/13/19 6:31 PM, H. Peter Anvin wrote:
>> >>
>> >> static cpumask_t text_poke_cpumask;
>> >>
>> >> static void text_poke_sync(void)
>> >> {
>> >>  smp_wmb();
>> >>  text_poke_cpumask = cpu_online_mask;
>> >>  smp_wmb();  /* Should be optional on x86 */
>> >>  cpumask_clear_cpu(_poke_cpumask, smp_processor_id());
>> >>  on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu,
>NULL, false);
>> >>  while (!cpumask_empty(_poke_cpumask)) {
>> >>  cpu_relax();
>> >>  smp_rmb();
>> >>  }
>> >> }
>> >>
>> >> static void text_poke_sync_cpu(void *dummy)
>> >> {
>> >>  (void)dummy;
>> >>
>> >>  smp_rmb();
>> >>  cpumask_clear_cpu(_bitmask, smp_processor_id());
>> >>  /*
>> >>   * We are guaranteed to return with an IRET, either from the
>> >>   * IPI or the #BP handler; this provides serialization.
>> >>   */
>> >> }
>> >>
>> >
>> > The invariants here are:
>> >
>> > 1. The patching routine must set each bit in the cpumask after each
>event
>> >that requires synchronization is complete.
>> > 2. The bit can be (atomically) cleared on the target CPU only, and
>only in a
>> >place that guarantees a synchronizing event (e.g. IRET) before
>it may
>> >reaching the poked instruction.
>> > 3. At a minimum the IPI handler and #BP handler needs to clear the
>bit. It
>> >*is* also possible to clear it in other places, e.g. the NMI
>handler, if
>> >necessary as long as condition 2 is satisfied.
>> >
>>
>> OK, so with interrupts enabled *on the processor doing the patching*
>we
>> still have a problem if it takes an interrupt which in turn takes a
>#BP.
>>  Disabling interrupts would not help, because but an NMI and #MC
>could
>> still cause problems unless we can guarantee that no path which may
>be
>> invoked by NMI/#MC can do text_poke, which seems to be a very
>aggressive
>> assumption.
>>
>> Note: I am assuming preemption is disabled.
>>
>> The easiest/sanest way to deal with this might be to switch the IDT
>(or
>> provide a hook in the generic exception entry code) on the patching
>> processor, such that if an asynchronous event comes in, we either
>roll
>> forward or revert. This is doable because the second sync we
>currently
>> do is not actually necessary per the hardware guys.
>
>This is IMO insanely complicated.  I much prefer the kind of
>complexity that is more or less deterministic and easy to test to the
>kind of complexity (like this) that only happens in corner cases.
>
>I see two solutions here:
>
>1. Just suck it up and emulate the CALL.  And find a way to write a
>test case so we know it works.
>
>2. Find a non-deadlocky way to make the breakpoint handler wait for
>the breakpoint to get removed, without any mucking at all with the
>entry code.  And find a way to write a test case so we know it works.
>(E.g. stick an actual static_call call site *in text_poke_bp()* that
>fires once on boot so that the really awful recursive case gets
>exercised all the time.
>
>But if we're going to do any mucking with the entry code, let's just
>do the simple mucking to make emulating CALL work.
>
>--Andy

Ugh. So much for not really proofreading. Yes, I think the second solution is 
the right thing since I think I figured out how to do it without deadlock; see 
other mail.
-- 
Sent from my Android device with K-9 Mail. 

Re: [PATCH v3 0/6] Static calls

2019-01-12 Thread hpa
On January 11, 2019 11:34:34 AM PST, Linus Torvalds 
 wrote:
>On Fri, Jan 11, 2019 at 11:24 AM  wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
>  text_poke(addr, , sizeof(int3));
>   *interrupt*
>  interrupt has a static call
>*BP*
>  poke_int3_handler
> *BOOM*
>
>Note how at BOOM we cannot just spin (or return) to wait for the
>'int3' to be switched back. Becuase it never will. Because we are
>interrupting the thing that would do that switch-back.
>
>So we'd have to do the 'text_poke_bp()' sequence with interrupts
>disabled. Which we can't do right now at least, because part of that
>sequence involves that on_each_cpu(do_sync_core) thing, which needs
>interrupts enabled.
>
>See?
>
>Or am I missing something?
>
>Linus

Ok, I was thinking far more about spinning with an IRET and letting the 
exception be delivered. Patching with interrupts disabled have other 
problems... 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 0/6] Static calls

2019-01-12 Thread hpa
On January 11, 2019 11:34:34 AM PST, Linus Torvalds 
 wrote:
>On Fri, Jan 11, 2019 at 11:24 AM  wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
>  text_poke(addr, , sizeof(int3));
>   *interrupt*
>  interrupt has a static call
>*BP*
>  poke_int3_handler
> *BOOM*
>
>Note how at BOOM we cannot just spin (or return) to wait for the
>'int3' to be switched back. Becuase it never will. Because we are
>interrupting the thing that would do that switch-back.
>
>So we'd have to do the 'text_poke_bp()' sequence with interrupts
>disabled. Which we can't do right now at least, because part of that
>sequence involves that on_each_cpu(do_sync_core) thing, which needs
>interrupts enabled.
>
>See?
>
>Or am I missing something?
>
>Linus

Let me think about it.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 0/6] Static calls

2019-01-11 Thread hpa
On January 11, 2019 11:03:30 AM PST, Linus Torvalds 
 wrote:
>On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf 
>wrote:
>>
>> >
>> > Now, in the int3 handler can you take the faulting RIP and search
>for it in
>> > the “static-calls” table, writing the RIP+5 (offset) into R10
>(return
>> > address) and the target into R11. You make the int3 handler to
>divert the
>> > code execution by changing pt_regs->rip to point to a new function
>that does:
>> >
>> >   push R10
>> >   jmp __x86_indirect_thunk_r11
>> >
>> > And then you are done. No?
>>
>> IIUC, that sounds pretty much like what Steven proposed:
>>
>>  
>https://lkml.kernel.org/r/20181129122000.7fb4f...@gandalf.local.home
>>
>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>> that case there's only one clobbered register to work with (rax).
>
>Actually, there's a much simpler model now that I think about it.
>
>The BP fixup just fixes up %rip to to point to "bp_int3_handler".
>
>And that's just a random text address set up by "text_poke_bp()".
>
>So how about the static call rewriting simply do this:
>
> - for each static call:
>
> 1)   create a fixup code stub that does
>
>push $returnaddressforTHIScall
>jmp targetforTHIScall
>
> 2) do
>
>on_each_cpu(do_sync_core, NULL, 1);
>
> to make sure all CPU's see this generated code
>
>  3) do
>
>text_poke_bp(addr, newcode, newlen, generatedcode);
>
>Ta-daa! Done.
>
>In fact, it turns out that even the extra "do_sync_core()" in #2 is
>unnecessary, because taking the BP will be serializing on the CPU that
>takes it, so we can skip it.
>
>End result: the text_poke_bp() function will do the two do_sync_core
>IPI's that guarantee that by the time it returns, no other CPU is
>using the generated code any more, so it can be re-used for the next
>static call fixup.
>
>Notice? No odd emulation, no need to adjust the stack in the BP
>handler, just the regular "return to a different IP".
>
>Now, there is a nasty special case with that stub, though.
>
>So nasty thing with the whole "generate a stub for each call" case:
>because it's dynamic and because of the re-use of the stub, you could
>be in the situation where:
>
>  CPU1  CPU2
>    
>
>  generate a stub
>  on_each_cpu(do_sync_core..)
>  text_poke_bp()
>  ...
>
>  rewrite to BP
>trigger the BP
>return to the stub
>fun the first instruction of the stub
>*INTERRUPT causes rescheduling*
>
>  on_each_cpu(do_sync_core..)
>  rewrite to good instruction
>  on_each_cpu(do_sync_core..)
>
>  free or re-generate the stub
>
>!! The stub is still in use !!
>
>So that simple "just generate the stub dynamically" isn't so simple
>after all.
>
>But it turns out that that is really simple to handle too. How do we do
>that?
>
>We do that by giving the BP handler *two* code sequences, and we make
>the BP handler pick one depending on whether it is returning to a
>"interrupts disabled" or "interrupts enabled" case.
>
>So the BP handler does this:
>
> - if we're returning with interrupts disabled, pick the simple stub
>
> - if we're returning with interrupts enabled, clkear IF in the return
>%rflags, and pick a *slightly* more complex stub:
>
>push $returnaddressforTHIScall
>sti
>jmp targetforTHIScall
>
>and now the STI shadow will mean that this sequence is uninterruptible.
>
>So we'd not do complex emulation of the call instruction at BP time,
>but we'd do that *trivial* change at BP time.
>
>This seems simple, doesn't need any temporary registers at all, and
>doesn't need any extra stack magic. It literally needs just a trivial
>sequence in poke_int3_handler().
>
>The we'd change the end of poke_int3_handler() to do something like
>this instead:
>
>void *newip = bp_int3_handler;
>..
>if (new == magic_static_call_bp_int3_handler) {
>if (regs->flags _FLAGS_IF) {
>newip = magic_static_call_bp_int3_handler_sti;
>regs->flags &= ~X86_FLAGS_IF;
>}
>regs->ip = (unsigned long) newip;
>return 1;
>
>AAND now we're *really* done.
>
>Does anybody see any issues in this?
>
>  Linus

I still don't see why can't simply spin in the #BP handler until the patch is 
complete.

We can't have the #BP handler do any additional patching, as previously 
discussed, but spinning should be perfectly safe. The simplest way to spin it 
to just IRET; that both serializes and will re-take the exception if the patch 
is still in progress.

It requires exactly *no* awareness in the #BP handler, allows for the call to 
be replaced with inline code or a simple NOP if desired (or vice versa, as long 
as it is a single instruction.)

If I'm missing something, then please educate me or point me to previous 
discussion; I would greatly appreciate it.

-- 

Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible

2019-01-11 Thread hpa
On January 10, 2019 5:34:21 PM PST, Sean Christopherson 
 wrote:
>On Thu, Jan 10, 2019 at 04:59:55PM -0800, h...@zytor.com wrote:
>> On January 10, 2019 9:42:57 AM PST, Sean Christopherson
> wrote:
>> >On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
>> >> On Thu, 10 Jan 2019 11:20:04 -0600
>> >> Josh Poimboeuf  wrote:
>> >> 
>> >> 
>> >> > > While I can't find a reason for hypervisors to emulate this
>> >instruction,
>> >> > > smarter people might find ways to turn it into a security
>> >exploit.  
>> >> > 
>> >> > Interesting point... but I wonder if it's a realistic concern. 
>> >BTW,
>> >> > text_poke_bp() also relies on undocumented behavior.
>> >> 
>> >> But we did get an official OK from Intel that it will work. Took a
>> >bit
>> >> of arm twisting to get them to do so, but they did. And it really
>is
>> >> pretty robust.
>> >
>> >Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
>> >certain atomicity guarantees go out the window if WC memtype is
>used.
>> 
>> If you run code from non-WB memory, all bets are off and you better
>> not be doing cross-modifying code.
>
>I wasn't thinking of running code from non-WB, but rather running code
>in WB while doing a CMC write via WC.

Same thing. Don't do that.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 5/6] x86/alternative: Use a single access in text_poke() where possible

2019-01-10 Thread hpa
On January 10, 2019 9:42:57 AM PST, Sean Christopherson 
 wrote:
>On Thu, Jan 10, 2019 at 12:32:43PM -0500, Steven Rostedt wrote:
>> On Thu, 10 Jan 2019 11:20:04 -0600
>> Josh Poimboeuf  wrote:
>> 
>> 
>> > > While I can't find a reason for hypervisors to emulate this
>instruction,
>> > > smarter people might find ways to turn it into a security
>exploit.  
>> > 
>> > Interesting point... but I wonder if it's a realistic concern. 
>BTW,
>> > text_poke_bp() also relies on undocumented behavior.
>> 
>> But we did get an official OK from Intel that it will work. Took a
>bit
>> of arm twisting to get them to do so, but they did. And it really is
>> pretty robust.
>
>Did we (they?) list any caveats for this behavior?  E.g. I'm fairly
>certain atomicity guarantees go out the window if WC memtype is used.

If you run code from non-WB memory, all bets are off and you better not be 
doing cross-modifying code.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86/boot: drop memset from copy.S

2019-01-08 Thread hpa
On January 7, 2019 12:52:57 AM PST, Cao jin  wrote:
>Hi,
>
>On 1/7/19 3:59 PM, h...@zytor.com wrote:
>> On January 6, 2019 11:40:56 PM PST, Cao jin
> wrote:
>>> According to objdump output of setup, function memset is not used in
>>> setup code. Currently, all usage of memset in setup come from macro
>>> definition of string.h.
>>>
>>> Signed-off-by: Cao jin 
>>> ---
>>> Compiled and booted under x86_64; compiled under i386.
>>>
>>> Questions: now there is 2 definition of memcpy, one lies in copy.S,
>>> another lies in string.h which is mapped to gcc builtin function. Do
>we
>>> still need 2 definition? Could we move the content of copy.S to
>>> boot/string.c?
>>>
>>> At first glance, the usage of string.{c.h} of setup is kind of
>>> confusing,
>>> they are also used in compressed/ and purgatory/
>>>
>>> arch/x86/boot/copy.S | 15 ---
>>> 1 file changed, 15 deletions(-)
>>>
>>> diff --git a/arch/x86/boot/copy.S b/arch/x86/boot/copy.S
>>> index 15d9f74b0008..5157d08b0ff2 100644
>>> --- a/arch/x86/boot/copy.S
>>> +++ b/arch/x86/boot/copy.S
>>> @@ -33,21 +33,6 @@ GLOBAL(memcpy)
>>> retl
>>> ENDPROC(memcpy)
>>>
>>> -GLOBAL(memset)
>>> -   pushw   %di
>>> -   movw%ax, %di
>>> -   movzbl  %dl, %eax
>>> -   imull   $0x01010101,%eax
>>> -   pushw   %cx
>>> -   shrw$2, %cx
>>> -   rep; stosl
>>> -   popw%cx
>>> -   andw$3, %cx
>>> -   rep; stosb
>>> -   popw%di
>>> -   retl
>>> -ENDPROC(memset)
>>> -
>>> GLOBAL(copy_from_fs)
>>> pushw   %ds
>>> pushw   %fs
>> 
>> This is dependent on both gcc version and flags.
>> 
>
>Thanks for your info, but I still don't quite get your point. All files
>who has memset reference in setup will also #include "string.h", so how
>gcc version and flags will associate memset reference to the assembly
>function by force?  Hope to get a little more details when you are
>convenient:)

GCC will sometimes emit calls to certain functions including memcpy().
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86/boot: drop memset from copy.S

2019-01-07 Thread hpa
On January 6, 2019 11:40:56 PM PST, Cao jin  wrote:
>According to objdump output of setup, function memset is not used in
>setup code. Currently, all usage of memset in setup come from macro
>definition of string.h.
>
>Signed-off-by: Cao jin 
>---
>Compiled and booted under x86_64; compiled under i386.
>
>Questions: now there is 2 definition of memcpy, one lies in copy.S,
>another lies in string.h which is mapped to gcc builtin function. Do we
>still need 2 definition? Could we move the content of copy.S to
>boot/string.c?
>
>At first glance, the usage of string.{c.h} of setup is kind of
>confusing,
>they are also used in compressed/ and purgatory/
>
> arch/x86/boot/copy.S | 15 ---
> 1 file changed, 15 deletions(-)
>
>diff --git a/arch/x86/boot/copy.S b/arch/x86/boot/copy.S
>index 15d9f74b0008..5157d08b0ff2 100644
>--- a/arch/x86/boot/copy.S
>+++ b/arch/x86/boot/copy.S
>@@ -33,21 +33,6 @@ GLOBAL(memcpy)
>   retl
> ENDPROC(memcpy)
> 
>-GLOBAL(memset)
>-  pushw   %di
>-  movw%ax, %di
>-  movzbl  %dl, %eax
>-  imull   $0x01010101,%eax
>-  pushw   %cx
>-  shrw$2, %cx
>-  rep; stosl
>-  popw%cx
>-  andw$3, %cx
>-  rep; stosb
>-  popw%di
>-  retl
>-ENDPROC(memset)
>-
> GLOBAL(copy_from_fs)
>   pushw   %ds
>   pushw   %fs

This is dependent on both gcc version and flags.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [RFC v2 1/6] x86: introduce kernel restartable sequence

2019-01-03 Thread hpa
On December 30, 2018 11:21:07 PM PST, Nadav Amit  wrote:
>It is sometimes beneficial to have a restartable sequence - very few
>instructions which if they are preempted jump to a predefined point.
>
>To provide such functionality on x86-64, we use an empty REX-prefix
>(opcode 0x40) as an indication for instruction in such a sequence.
>Before
>calling the schedule IRQ routine, if the "magic" prefix is found, we
>call a routine to adjust the instruction pointer.  It is expected that
>this opcode is not in common use.
>
>The following patch will make use of this function. Since there are no
>other users (yet?), the patch does not bother to create a general
>infrastructure and API that others can use for such sequences. Yet, it
>should not be hard to make such extension later.
>
>Signed-off-by: Nadav Amit 
>---
> arch/x86/entry/entry_64.S| 16 ++--
> arch/x86/include/asm/nospec-branch.h | 12 
> arch/x86/kernel/traps.c  |  7 +++
> 3 files changed, 33 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>index 1f0efdb7b629..e144ff8b914f 100644
>--- a/arch/x86/entry/entry_64.S
>+++ b/arch/x86/entry/entry_64.S
>@@ -644,12 +644,24 @@ retint_kernel:
>   /* Interrupts are off */
>   /* Check if we need preemption */
>   btl $9, EFLAGS(%rsp)/* were interrupts off? */
>-  jnc 1f
>+  jnc 2f
> 0:cmpl$0, PER_CPU_VAR(__preempt_count)
>+  jnz 2f
>+
>+  /*
>+   * Allow to use restartable code sections in the kernel. Consider an
>+   * instruction with the first byte having REX prefix without any bits
>+   * set as an indication for an instruction in such a section.
>+   */
>+  movqRIP(%rsp), %rax
>+  cmpb$KERNEL_RESTARTABLE_PREFIX, (%rax)
>   jnz 1f
>+  mov %rsp, %rdi
>+  callrestart_kernel_rseq
>+1:
>   callpreempt_schedule_irq
>   jmp 0b
>-1:
>+2:
> #endif
>   /*
>* The iretq could re-enable interrupts:
>diff --git a/arch/x86/include/asm/nospec-branch.h
>b/arch/x86/include/asm/nospec-branch.h
>index dad12b767ba0..be4713ef0940 100644
>--- a/arch/x86/include/asm/nospec-branch.h
>+++ b/arch/x86/include/asm/nospec-branch.h
>@@ -54,6 +54,12 @@
>   jnz 771b;   \
>   add $(BITS_PER_LONG/8) * nr, sp;
> 
>+/*
>+ * An empty REX-prefix is an indication that this instruction is part
>of kernel
>+ * restartable sequence.
>+ */
>+#define KERNEL_RESTARTABLE_PREFIX (0x40)
>+
> #ifdef __ASSEMBLY__
> 
> /*
>@@ -150,6 +156,12 @@
> #endif
> .endm
> 
>+.macro restartable_seq_prefix
>+#ifdef CONFIG_PREEMPT
>+  .byte   KERNEL_RESTARTABLE_PREFIX
>+#endif
>+.endm
>+
> #else /* __ASSEMBLY__ */
> 
> #define ANNOTATE_NOSPEC_ALTERNATIVE   \
>diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>index 85cccadb9a65..b1e855bad5ac 100644
>--- a/arch/x86/kernel/traps.c
>+++ b/arch/x86/kernel/traps.c
>@@ -59,6 +59,7 @@
> #include 
> #include 
> #include 
>+#include 
> 
> #ifdef CONFIG_X86_64
> #include 
>@@ -186,6 +187,12 @@ int fixup_bug(struct pt_regs *regs, int trapnr)
>   return 0;
> }
> 
>+asmlinkage __visible void restart_kernel_rseq(struct pt_regs *regs)
>+{
>+  if (user_mode(regs) || *(u8 *)regs->ip != KERNEL_RESTARTABLE_PREFIX)
>+  return;
>+}
>+
> static nokprobe_inline int
>do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str,
> struct pt_regs *regs, long error_code)

A 0x40 prefix is *not* a noop. It changes the interpretation of byte registers 
4 though 7 from ah, ch, dh, bh to spl, bpl, sil and dil.

It may not matter in your application but:

a. You need to clarify that so is the case, and why;
b. Phrase it differently so others don't propagate the same misunderstanding.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Can we drop upstream Linux x32 support?

2018-12-10 Thread hpa
On December 10, 2018 5:40:33 PM PST, Linus Torvalds 
 wrote:
>On Mon, Dec 10, 2018 at 5:23 PM Andy Lutomirski 
>wrote:
>>
>> I'm seriously considering sending a patch to remove x32 support from
>> upstream Linux.  Here are some problems with it:
>
>I talked to Arnd (I think - we were talking about all the crazy ABI's,
>but maybe it was with somebody else) about exactly this in Edinburgh.
>
>Apparently the main real use case is for extreme benchmarking. It's
>the only use-case where the complexity of maintaining a whole
>development environment and distro is worth it, it seems. Apparently a
>number of Spec submissions have been done with the x32 model.
>
>I'm not opposed to trying to sunset the support, but let's see who
>complains..
>
>  Linus

The use case aside, I need to address the technical issues in this post; some 
of the behaviors that Andy is pointing out area quite intentional, even if they 
are perhaps somewhat confusing at first glance. That being said, some were due 
to tradeoffs that might have been wrong.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86/boot: clear rsdp address in boot_params for broken loaders

2018-12-03 Thread hpa
On December 3, 2018 2:38:11 AM PST, Juergen Gross  wrote:
>In case a broken boot loader doesn't clear its struct boot_params clear
>rsdp_addr in sanitize_boot_params().
>
>This fixes commit e6e094e053af75 ("x86/acpi, x86/boot: Take RSDP
>address from boot params if available") e.g. for the case of a boot via
>systemd-boot.
>
>Fixes: e6e094e053af75 ("x86/acpi, x86/boot: Take RSDP address from boot
>params if available")
>Reported-by: Gunnar Krueger 
>Tested-by: Gunnar Krueger 
>Signed-off-by: Juergen Gross 
>---
> arch/x86/include/asm/bootparam_utils.h | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/arch/x86/include/asm/bootparam_utils.h
>b/arch/x86/include/asm/bootparam_utils.h
>index a07ffd23e4dd..f6f6ef436599 100644
>--- a/arch/x86/include/asm/bootparam_utils.h
>+++ b/arch/x86/include/asm/bootparam_utils.h
>@@ -36,6 +36,7 @@ static void sanitize_boot_params(struct boot_params
>*boot_params)
>*/
>   if (boot_params->sentinel) {
>   /* fields in boot_params are left uninitialized, clear them */
>+  boot_params->acpi_rsdp_addr = 0;
>   memset(_params->ext_ramdisk_image, 0,
>  (char *)_params->efi_info -
>   (char *)_params->ext_ramdisk_image);

Isn't this already covered by the memset()? If not, we should extend the 
memset() to maximal coverage.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86/boot: clear rsdp address in boot_params for broken loaders

2018-12-03 Thread hpa
On December 3, 2018 2:38:11 AM PST, Juergen Gross  wrote:
>In case a broken boot loader doesn't clear its struct boot_params clear
>rsdp_addr in sanitize_boot_params().
>
>This fixes commit e6e094e053af75 ("x86/acpi, x86/boot: Take RSDP
>address from boot params if available") e.g. for the case of a boot via
>systemd-boot.
>
>Fixes: e6e094e053af75 ("x86/acpi, x86/boot: Take RSDP address from boot
>params if available")
>Reported-by: Gunnar Krueger 
>Tested-by: Gunnar Krueger 
>Signed-off-by: Juergen Gross 
>---
> arch/x86/include/asm/bootparam_utils.h | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/arch/x86/include/asm/bootparam_utils.h
>b/arch/x86/include/asm/bootparam_utils.h
>index a07ffd23e4dd..f6f6ef436599 100644
>--- a/arch/x86/include/asm/bootparam_utils.h
>+++ b/arch/x86/include/asm/bootparam_utils.h
>@@ -36,6 +36,7 @@ static void sanitize_boot_params(struct boot_params
>*boot_params)
>*/
>   if (boot_params->sentinel) {
>   /* fields in boot_params are left uninitialized, clear them */
>+  boot_params->acpi_rsdp_addr = 0;
>   memset(_params->ext_ramdisk_image, 0,
>  (char *)_params->efi_info -
>   (char *)_params->ext_ramdisk_image);

Isn't this already covered by the memset()? If not, we should extend the 
memset() to maximal coverage.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86/TSC: Use RDTSCP

2018-11-23 Thread hpa
On November 23, 2018 12:03:07 PM PST, Guenter Roeck  wrote:
>Hi,
>
>On Mon, Nov 19, 2018 at 07:45:56PM +0100, Borislav Petkov wrote:
>> From: Borislav Petkov 
>> 
>> Currently, the kernel uses
>> 
>>   [LM]FENCE; RDTSC
>> 
>> in the timekeeping code, to guarantee monotonicity of time where the
>> *FENCE is selected based on vendor.
>> 
>> Replace that sequence with RDTSCP which is faster or on-par and gives
>> the same guarantees.
>> 
>> A microbenchmark on Intel shows that the change is on-par.
>> 
>> On AMD, the change is either on-par with the current LFENCE-prefixed
>> RDTSC and some are slightly better with RDTSCP.
>> 
>> The comparison is done with the LFENCE-prefixed RDTSC (and not with
>the
>> MFENCE-prefixed one, as one would normally expect) because all modern
>> AMD families make LFENCE serializing and thus avoid the heavy MFENCE
>by
>> effectively enabling X86_FEATURE_LFENCE_RDTSC.
>> 
>> Co-developed-by: Thomas Gleixner 
>> Signed-off-by: Thomas Gleixner 
>> Signed-off-by: Borislav Petkov 
>> Cc: Andy Lutomirski 
>> Cc: "H. Peter Anvin" 
>> Cc: John Stultz 
>> Cc: Thomas Lendacky 
>> Cc: x...@kernel.org
>> ---
>>  arch/x86/include/asm/msr.h | 15 +--
>>  1 file changed, 13 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>> index 91e4cf189914..f00f2b61d326 100644
>> --- a/arch/x86/include/asm/msr.h
>> +++ b/arch/x86/include/asm/msr.h
>> @@ -217,6 +217,8 @@ static __always_inline unsigned long long
>rdtsc(void)
>>   */
>>  static __always_inline unsigned long long rdtsc_ordered(void)
>>  {
>> +DECLARE_ARGS(val, low, high);
>> +
>>  /*
>>   * The RDTSC instruction is not ordered relative to memory
>>   * access.  The Intel SDM and the AMD APM are both vague on this
>> @@ -227,9 +229,18 @@ static __always_inline unsigned long long
>rdtsc_ordered(void)
>>   * ordering guarantees as reading from a global memory location
>>   * that some other imaginary CPU is updating continuously with a
>>   * time stamp.
>> + *
>> + * Thus, use the preferred barrier on the respective CPU, aiming
>for
>> + * RDTSCP as the default.
>>   */
>> -barrier_nospec();
>> -return rdtsc();
>> +asm volatile(ALTERNATIVE_2("mfence; rdtsc",
>> +   "lfence; rdtsc", X86_FEATURE_LFENCE_RDTSC,
>> +   "rdtscp", X86_FEATURE_RDTSCP)
>> +: EAX_EDX_RET(val, low, high)
>> +/* RDTSCP clobbers ECX with MSR_TSC_AUX. */
>> +:: "ecx");
>> +
>> +return EAX_EDX_VAL(val, low, high);
>>  }
>>  
>
>This patch results in a crash with certain qemu emulations.
>
>[0.756869] hpet0: 3 comparators, 64-bit 100.00 MHz counter
>[0.762233] invalid opcode:  [#1] PTI
>[0.762435] CPU: 0 PID: 1 Comm: swapper Not tainted
>4.20.0-rc3-next-20181123 #2
>[0.762644] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>Ubuntu-1.8.2-1ubuntu1 04/01/2014
>[0.762832] EIP: read_tsc+0x4/0x10
>
>To reproduce:
>
>make ARCH=i386 defconfig
>echo "CONFIG_HIGHMEM64G=y" >> .config
>make ARCH=i386 olddefconfig
>make -j30 ARCH=i386
>
>qemu-system-i386 \
>   -kernel arch/x86/boot/bzImage \
>   -M q35 -cpu pentium3 -no-reboot -m 256 \
>   -initrd rootfs.cpio \
>   --append 'earlycon=uart8250,io,0x3f8,9600n8 rdinit=/sbin/init panic=-1
>console=ttyS0 console=tty' \
>   -nographic -monitor none
>
>initrd or root file system doesn't really matter since the code never
>gets there, but it is available from here:
>   
> https://github.com/groeck/linux-build-test/blob/master/rootfs/x86/rootfs.cpio.gz
>The qemu version does not matter either (I tried with 2.5 and 3.0).
>
>Reverting the patch fixes the problem.
>
>Guenter
>
>---
>crash log:
>
>...
>[0.756366] HPET: 3 timers in total, 0 timers will be used for
>per-cpu timer
>[0.756718] hpet0: at MMIO 0xfed0, IRQs 2, 8, 0
>[0.756869] hpet0: 3 comparators, 64-bit 100.00 MHz counter
>[0.762233] invalid opcode:  [#1] PTI
>[0.762435] CPU: 0 PID: 1 Comm: swapper Not tainted
>4.20.0-rc3-next-20181123
>#2
>[0.762644] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>Ubuntu-1.8.2-1ubuntu1 04/01/2014
>[0.762832] EIP: read_tsc+0x4/0x10
>[0.762832] Code: 00 01 00 eb 89 90 55 89 e5 5d c3 90 90 90 90 90 90
>90 90 90 90 90 55 a1 44 5a 8b c5 89 e5 5d c3 8d b6 00 00 00 00 55 89 e5
>57 <0f> ae f0b
>[0.762832] EAX: c58062c0 EBX: c58062c0 ECX: 0008 EDX: c4c1f0a0
>[0.762832] ESI: c58168c0 EDI: 00200086 EBP: cb495ed4 ESP: cb495ed0
>[0.762832] DS: 007b ES: 007b FS:  GS: 00e0 SS: 0068 EFLAGS:
>0022
>[0.762832] CR0: 80050033 CR2:  CR3: 0597a000 CR4: 06f0
>[0.762832] Call Trace:
>[0.762832]  tk_setup_internals.constprop.10+0x3d/0x260
>[0.762832]  timekeeping_notify+0x56/0xc0
>[0.762832]  __clocksource_select+0xf3/0x150
>[0.762832]  ? boot_override_clock+0x42/0x42

Re: [PATCH] x86/TSC: Use RDTSCP

2018-11-23 Thread hpa
On November 23, 2018 12:03:07 PM PST, Guenter Roeck  wrote:
>Hi,
>
>On Mon, Nov 19, 2018 at 07:45:56PM +0100, Borislav Petkov wrote:
>> From: Borislav Petkov 
>> 
>> Currently, the kernel uses
>> 
>>   [LM]FENCE; RDTSC
>> 
>> in the timekeeping code, to guarantee monotonicity of time where the
>> *FENCE is selected based on vendor.
>> 
>> Replace that sequence with RDTSCP which is faster or on-par and gives
>> the same guarantees.
>> 
>> A microbenchmark on Intel shows that the change is on-par.
>> 
>> On AMD, the change is either on-par with the current LFENCE-prefixed
>> RDTSC and some are slightly better with RDTSCP.
>> 
>> The comparison is done with the LFENCE-prefixed RDTSC (and not with
>the
>> MFENCE-prefixed one, as one would normally expect) because all modern
>> AMD families make LFENCE serializing and thus avoid the heavy MFENCE
>by
>> effectively enabling X86_FEATURE_LFENCE_RDTSC.
>> 
>> Co-developed-by: Thomas Gleixner 
>> Signed-off-by: Thomas Gleixner 
>> Signed-off-by: Borislav Petkov 
>> Cc: Andy Lutomirski 
>> Cc: "H. Peter Anvin" 
>> Cc: John Stultz 
>> Cc: Thomas Lendacky 
>> Cc: x...@kernel.org
>> ---
>>  arch/x86/include/asm/msr.h | 15 +--
>>  1 file changed, 13 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>> index 91e4cf189914..f00f2b61d326 100644
>> --- a/arch/x86/include/asm/msr.h
>> +++ b/arch/x86/include/asm/msr.h
>> @@ -217,6 +217,8 @@ static __always_inline unsigned long long
>rdtsc(void)
>>   */
>>  static __always_inline unsigned long long rdtsc_ordered(void)
>>  {
>> +DECLARE_ARGS(val, low, high);
>> +
>>  /*
>>   * The RDTSC instruction is not ordered relative to memory
>>   * access.  The Intel SDM and the AMD APM are both vague on this
>> @@ -227,9 +229,18 @@ static __always_inline unsigned long long
>rdtsc_ordered(void)
>>   * ordering guarantees as reading from a global memory location
>>   * that some other imaginary CPU is updating continuously with a
>>   * time stamp.
>> + *
>> + * Thus, use the preferred barrier on the respective CPU, aiming
>for
>> + * RDTSCP as the default.
>>   */
>> -barrier_nospec();
>> -return rdtsc();
>> +asm volatile(ALTERNATIVE_2("mfence; rdtsc",
>> +   "lfence; rdtsc", X86_FEATURE_LFENCE_RDTSC,
>> +   "rdtscp", X86_FEATURE_RDTSCP)
>> +: EAX_EDX_RET(val, low, high)
>> +/* RDTSCP clobbers ECX with MSR_TSC_AUX. */
>> +:: "ecx");
>> +
>> +return EAX_EDX_VAL(val, low, high);
>>  }
>>  
>
>This patch results in a crash with certain qemu emulations.
>
>[0.756869] hpet0: 3 comparators, 64-bit 100.00 MHz counter
>[0.762233] invalid opcode:  [#1] PTI
>[0.762435] CPU: 0 PID: 1 Comm: swapper Not tainted
>4.20.0-rc3-next-20181123 #2
>[0.762644] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>Ubuntu-1.8.2-1ubuntu1 04/01/2014
>[0.762832] EIP: read_tsc+0x4/0x10
>
>To reproduce:
>
>make ARCH=i386 defconfig
>echo "CONFIG_HIGHMEM64G=y" >> .config
>make ARCH=i386 olddefconfig
>make -j30 ARCH=i386
>
>qemu-system-i386 \
>   -kernel arch/x86/boot/bzImage \
>   -M q35 -cpu pentium3 -no-reboot -m 256 \
>   -initrd rootfs.cpio \
>   --append 'earlycon=uart8250,io,0x3f8,9600n8 rdinit=/sbin/init panic=-1
>console=ttyS0 console=tty' \
>   -nographic -monitor none
>
>initrd or root file system doesn't really matter since the code never
>gets there, but it is available from here:
>   
> https://github.com/groeck/linux-build-test/blob/master/rootfs/x86/rootfs.cpio.gz
>The qemu version does not matter either (I tried with 2.5 and 3.0).
>
>Reverting the patch fixes the problem.
>
>Guenter
>
>---
>crash log:
>
>...
>[0.756366] HPET: 3 timers in total, 0 timers will be used for
>per-cpu timer
>[0.756718] hpet0: at MMIO 0xfed0, IRQs 2, 8, 0
>[0.756869] hpet0: 3 comparators, 64-bit 100.00 MHz counter
>[0.762233] invalid opcode:  [#1] PTI
>[0.762435] CPU: 0 PID: 1 Comm: swapper Not tainted
>4.20.0-rc3-next-20181123
>#2
>[0.762644] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>Ubuntu-1.8.2-1ubuntu1 04/01/2014
>[0.762832] EIP: read_tsc+0x4/0x10
>[0.762832] Code: 00 01 00 eb 89 90 55 89 e5 5d c3 90 90 90 90 90 90
>90 90 90 90 90 55 a1 44 5a 8b c5 89 e5 5d c3 8d b6 00 00 00 00 55 89 e5
>57 <0f> ae f0b
>[0.762832] EAX: c58062c0 EBX: c58062c0 ECX: 0008 EDX: c4c1f0a0
>[0.762832] ESI: c58168c0 EDI: 00200086 EBP: cb495ed4 ESP: cb495ed0
>[0.762832] DS: 007b ES: 007b FS:  GS: 00e0 SS: 0068 EFLAGS:
>0022
>[0.762832] CR0: 80050033 CR2:  CR3: 0597a000 CR4: 06f0
>[0.762832] Call Trace:
>[0.762832]  tk_setup_internals.constprop.10+0x3d/0x260
>[0.762832]  timekeeping_notify+0x56/0xc0
>[0.762832]  __clocksource_select+0xf3/0x150
>[0.762832]  ? boot_override_clock+0x42/0x42

Re: Sleeping in user_access section

2018-11-23 Thread hpa
On November 23, 2018 1:27:02 AM PST, Julien Thierry  
wrote:
>Hi,
>
>I made an attempt at implementing the 
>user_access_begin()/user_access_end() macros along with the 
>get/put_user_unsafe() for arm64 by toggling the status of PAN (more or 
>less similar to x86's STAC/CTAC).
>
>With a small mistake in my patch, we realized that directly calling 
>function that could reschedule while in a user_access section could
>lead to:
>
>- scheduling another task keeping the user_access status enabled
>despite 
>the task never calling user_access_begin()
>
>- when re-scheduling the task that was mid user_access section, 
>user_access would be disabled and the task would fault on the next 
>get/put_user_unsafe.
>
>
>This is because __switch_to does not alter the user_access status when 
>switching from next to prev (at least on arm64 we currently don't, and 
>by looking at the x86 code I don't think this is done either).
>
>
> From my understanding, this is not an issue when the task in 
>user_access mode gets scheduled out/in as a result of an interrupt as 
>PAN and EFLAGS.AC get saved/restore on exception entry/exit (at least I
>
>know it is the case for PAN, I am less sure for the x86 side).
>
>
>So, the question is, should __switch_to take care of the user_access 
>status when scheduling new tasks? Or should there be a restriction
>about 
>scheduling out a task with user_access mode enabled and maybe add a 
>warning if we can detect this?
>
>(Or did we miss something and this is not an issue on x86?)
>
>Thanks,

You should never call a sleeping function from a user_access section. It is 
intended for very limited regions.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Sleeping in user_access section

2018-11-23 Thread hpa
On November 23, 2018 1:27:02 AM PST, Julien Thierry  
wrote:
>Hi,
>
>I made an attempt at implementing the 
>user_access_begin()/user_access_end() macros along with the 
>get/put_user_unsafe() for arm64 by toggling the status of PAN (more or 
>less similar to x86's STAC/CTAC).
>
>With a small mistake in my patch, we realized that directly calling 
>function that could reschedule while in a user_access section could
>lead to:
>
>- scheduling another task keeping the user_access status enabled
>despite 
>the task never calling user_access_begin()
>
>- when re-scheduling the task that was mid user_access section, 
>user_access would be disabled and the task would fault on the next 
>get/put_user_unsafe.
>
>
>This is because __switch_to does not alter the user_access status when 
>switching from next to prev (at least on arm64 we currently don't, and 
>by looking at the x86 code I don't think this is done either).
>
>
> From my understanding, this is not an issue when the task in 
>user_access mode gets scheduled out/in as a result of an interrupt as 
>PAN and EFLAGS.AC get saved/restore on exception entry/exit (at least I
>
>know it is the case for PAN, I am less sure for the x86 side).
>
>
>So, the question is, should __switch_to take care of the user_access 
>status when scheduling new tasks? Or should there be a restriction
>about 
>scheduling out a task with user_access mode enabled and maybe add a 
>warning if we can detect this?
>
>(Or did we miss something and this is not an issue on x86?)
>
>Thanks,

You should never call a sleeping function from a user_access section. It is 
intended for very limited regions.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86/TSC: Use RDTSCP

2018-11-19 Thread hpa
On November 19, 2018 12:40:25 PM PST, Borislav Petkov  wrote:
>On Mon, Nov 19, 2018 at 12:17:35PM -0800, H. Peter Anvin wrote:
>> On 11/19/18 11:52 AM, Andy Lutomirski wrote:
>> > 
>> > I thought I benchmarked this on Intel at some point and found the
>> > LFENCE;RDTSC variant to be slightly faster.  But I believe you, so:
>> > 
>> > Acked-by: Andy Lutomirski 
>> > 
>> 
>> As long as the difference isn't significant, the simplicity would
>seem to be a
>> win.
>
>Right, I think by simplicity you mean RDTSCP. :)
>
>Also in the sense that you have a single instruction which gives you
>that barrier of waiting for all older insns to retire before reading
>the
>TSC vs two where you don't necessarily know what happens on the uarch
>level. I mean, nothing probably happens but RDTSCP is still simpler :)
>
>Also, hpa, isn't LFENCE; RDTSC and RDTSCP equivalent on Intel? In the
>sense that RDTSCP microcode practically adds an LFENCE before reading
>the TSC?
>
>Because SDM says:
>
>"The RDTSCP instruction is not a serializing instruction, but it does
>wait until all previous instructions have executed and all previous
>loads are globally visible."
>
>which sounds pretty much like an LFENCE to me:
>
>"LFENCE does not execute until all prior instructions have completed
>locally, and no later instruction begins execution until LFENCE
>completes."

I don't know the exact sequence of fencing operations it is equivalent to, but 
it is probably something quite similar.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86/TSC: Use RDTSCP

2018-11-19 Thread hpa
On November 19, 2018 12:40:25 PM PST, Borislav Petkov  wrote:
>On Mon, Nov 19, 2018 at 12:17:35PM -0800, H. Peter Anvin wrote:
>> On 11/19/18 11:52 AM, Andy Lutomirski wrote:
>> > 
>> > I thought I benchmarked this on Intel at some point and found the
>> > LFENCE;RDTSC variant to be slightly faster.  But I believe you, so:
>> > 
>> > Acked-by: Andy Lutomirski 
>> > 
>> 
>> As long as the difference isn't significant, the simplicity would
>seem to be a
>> win.
>
>Right, I think by simplicity you mean RDTSCP. :)
>
>Also in the sense that you have a single instruction which gives you
>that barrier of waiting for all older insns to retire before reading
>the
>TSC vs two where you don't necessarily know what happens on the uarch
>level. I mean, nothing probably happens but RDTSCP is still simpler :)
>
>Also, hpa, isn't LFENCE; RDTSC and RDTSCP equivalent on Intel? In the
>sense that RDTSCP microcode practically adds an LFENCE before reading
>the TSC?
>
>Because SDM says:
>
>"The RDTSCP instruction is not a serializing instruction, but it does
>wait until all previous instructions have executed and all previous
>loads are globally visible."
>
>which sounds pretty much like an LFENCE to me:
>
>"LFENCE does not execute until all prior instructions have completed
>locally, and no later instruction begins execution until LFENCE
>completes."

I don't know the exact sequence of fencing operations it is equivalent to, but 
it is probably something quite similar.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: PLEASE REVERT URGENTLY: Re: [PATCH v5 2/3] x86/boot: add acpi rsdp address to setup_header

2018-11-11 Thread hpa
On November 10, 2018 7:22:29 AM PST, Juergen Gross  wrote:
>On 09/11/2018 23:23, H. Peter Anvin wrote:
>> I just noticed this patch -- I missed it because the cover message
>> seemed far more harmless so I didn't notice this change.
>> 
>> THIS PATCH IS FATALLY WRONG AND NEEDS TO BE IMMEDIATELY REVERTED
>BEFORE
>> ANYONE STARTS RELYING ON IT; IT HAS THE POTENTIAL OF BREAKING THE
>> BOOTLOADER PROTOCOL FOR ALL FUTURE.
>> 
>> It seems to be based on fundamental misconceptions about the various
>> data structures in the protocol, and does so in a way that completely
>> breaks the way the protocol is designed to work.
>> 
>> The protocol is specifically designed such that fields are not
>version
>> dependencies. The version number is strictly to inform the boot
>loader
>> about which capabilities the kernel has, so that the boot loader can
>> know if a certain data field is meaningful and/or honored.
>> 
>>> +Protocol 2.14: (Kernel 4.20) Added acpi_rsdp_addr holding the
>physical
>>> +   address of the ACPI RSDP table.
>>> +   The bootloader updates version with:
>>> +   0x8000 | min(kernel-version, bootloader-version)
>>> +   kernel-version being the protocol version supported by
>>> +   the kernel and bootloader-version the protocol version
>>> +   supported by the bootloader.
>> 
>> [...]
>> 
>>>   MEMORY LAYOUT
>>>
>>>  The traditional memory map for the kernel loader, used for Image or
>>> @@ -197,6 +209,7 @@ Offset  Proto   NameMeaning
>>>  0258/8 2.10+   pref_addressPreferred loading address
>>>  0260/4 2.10+   init_size   Linear memory required during 
>>> initialization
>>>  0264/4 2.11+   handover_offset Offset of handover entry point
>>> +0268/8 2.14+   acpi_rsdp_addr  Physical address of RSDP table
>> 
>> NO.
>> 
>> That is not how struct setup_header works, nor does this belong here.
>> 
>> struct setup_header contains *initialized data*, and has a length
>byte
>> at offset 0x201.  The bootloader is responsible for copying the full
>> structure into the appropriate offset (0x1f1) in struct boot_params.
>> 
>> The length byte isn't actually a requirement, since the maximum
>possible
>> size of this structure is 144 bytes, and the kernel will (obviously)
>not
>> look at the older fields anyway, but it is good practice. The kernel
>or
>> any other entity is free to zero out the bytes past this length
>pointer.
>> 
>> There are only 24 bytes left in this structure, and this would occupy
>8
>> of them for no valid reason.  The *only* valid reason to put a
>> zero-initialized field in struct setup_header is if it used by the
>> 16-bit legacy BIOS boot, which is obviously not the case here.
>> 
>> This field thus belongs in struct boot_params, not struct
>setup_header.
>
>Would you be okay with putting acpi_rsdp_addr at offset 0x0cc (_pad4)?
>
>
>Juergen

I'd prefer if you used __pad3 offset 0x70 to keep the large block, and that way 
your field is also aligned.

However, if you have some specific reason to prefer __pad4 it's no big deal, 
although I'm curious what it would be.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: PLEASE REVERT URGENTLY: Re: [PATCH v5 2/3] x86/boot: add acpi rsdp address to setup_header

2018-11-11 Thread hpa
On November 10, 2018 7:22:29 AM PST, Juergen Gross  wrote:
>On 09/11/2018 23:23, H. Peter Anvin wrote:
>> I just noticed this patch -- I missed it because the cover message
>> seemed far more harmless so I didn't notice this change.
>> 
>> THIS PATCH IS FATALLY WRONG AND NEEDS TO BE IMMEDIATELY REVERTED
>BEFORE
>> ANYONE STARTS RELYING ON IT; IT HAS THE POTENTIAL OF BREAKING THE
>> BOOTLOADER PROTOCOL FOR ALL FUTURE.
>> 
>> It seems to be based on fundamental misconceptions about the various
>> data structures in the protocol, and does so in a way that completely
>> breaks the way the protocol is designed to work.
>> 
>> The protocol is specifically designed such that fields are not
>version
>> dependencies. The version number is strictly to inform the boot
>loader
>> about which capabilities the kernel has, so that the boot loader can
>> know if a certain data field is meaningful and/or honored.
>> 
>>> +Protocol 2.14: (Kernel 4.20) Added acpi_rsdp_addr holding the
>physical
>>> +   address of the ACPI RSDP table.
>>> +   The bootloader updates version with:
>>> +   0x8000 | min(kernel-version, bootloader-version)
>>> +   kernel-version being the protocol version supported by
>>> +   the kernel and bootloader-version the protocol version
>>> +   supported by the bootloader.
>> 
>> [...]
>> 
>>>   MEMORY LAYOUT
>>>
>>>  The traditional memory map for the kernel loader, used for Image or
>>> @@ -197,6 +209,7 @@ Offset  Proto   NameMeaning
>>>  0258/8 2.10+   pref_addressPreferred loading address
>>>  0260/4 2.10+   init_size   Linear memory required during 
>>> initialization
>>>  0264/4 2.11+   handover_offset Offset of handover entry point
>>> +0268/8 2.14+   acpi_rsdp_addr  Physical address of RSDP table
>> 
>> NO.
>> 
>> That is not how struct setup_header works, nor does this belong here.
>> 
>> struct setup_header contains *initialized data*, and has a length
>byte
>> at offset 0x201.  The bootloader is responsible for copying the full
>> structure into the appropriate offset (0x1f1) in struct boot_params.
>> 
>> The length byte isn't actually a requirement, since the maximum
>possible
>> size of this structure is 144 bytes, and the kernel will (obviously)
>not
>> look at the older fields anyway, but it is good practice. The kernel
>or
>> any other entity is free to zero out the bytes past this length
>pointer.
>> 
>> There are only 24 bytes left in this structure, and this would occupy
>8
>> of them for no valid reason.  The *only* valid reason to put a
>> zero-initialized field in struct setup_header is if it used by the
>> 16-bit legacy BIOS boot, which is obviously not the case here.
>> 
>> This field thus belongs in struct boot_params, not struct
>setup_header.
>
>Would you be okay with putting acpi_rsdp_addr at offset 0x0cc (_pad4)?
>
>
>Juergen

I'd prefer if you used __pad3 offset 0x70 to keep the large block, and that way 
your field is also aligned.

However, if you have some specific reason to prefer __pad4 it's no big deal, 
although I'm curious what it would be.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm

2018-11-07 Thread hpa
On November 7, 2018 1:43:39 PM PST, Logan Gunthorpe  wrote:
>
>
>On 2018-11-07 11:56 a.m., Nadav Amit wrote:
>> HPA indicated more than once that this is wrong (and that was indeed
>my
>> initial solution), since it is not guaranteed that the compiler would
>put
>> the macro assembly before its use.
>
>Hmm, that's very unfortunate. I don't really understand why the
>compiler
>would not put the macro assembly in the same order as it appears in the
>code and it would be in the correct order there.
>
>In any case, I've submitted a couple of issues to icecc[1] and
>distcc[2]
>to see if they have any ideas about supporting this on their sides.
>
>Thanks,
>
>Logan
>
>[1] https://github.com/icecc/icecream/issues/428
>[2] https://github.com/distcc/distcc/issues/312

Apparently gcc will treat them like basic blocks and possibly move them around.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm

2018-11-07 Thread hpa
On November 7, 2018 1:43:39 PM PST, Logan Gunthorpe  wrote:
>
>
>On 2018-11-07 11:56 a.m., Nadav Amit wrote:
>> HPA indicated more than once that this is wrong (and that was indeed
>my
>> initial solution), since it is not guaranteed that the compiler would
>put
>> the macro assembly before its use.
>
>Hmm, that's very unfortunate. I don't really understand why the
>compiler
>would not put the macro assembly in the same order as it appears in the
>code and it would be in the correct order there.
>
>In any case, I've submitted a couple of issues to icecc[1] and
>distcc[2]
>to see if they have any ideas about supporting this on their sides.
>
>Thanks,
>
>Logan
>
>[1] https://github.com/icecc/icecream/issues/428
>[2] https://github.com/distcc/distcc/issues/312

Apparently gcc will treat them like basic blocks and possibly move them around.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH RFC] x86: Don't include '-Wa,-' when building with Clang

2018-10-23 Thread hpa
On October 23, 2018 3:53:10 PM PDT, Nick Desaulniers  
wrote:
>On Tue, Oct 23, 2018 at 3:44 PM Nathan Chancellor
> wrote:
>>
>> On Tue, Oct 23, 2018 at 03:08:53PM -0700, Nick Desaulniers wrote:
>> > On Tue, Oct 23, 2018 at 2:58 PM Nathan Chancellor
>> >  wrote:
>> > >
>> > > On Tue, Oct 23, 2018 at 01:01:22PM -0700, H. Peter Anvin wrote:
>> > > > On 10/23/18 11:40, Nick Desaulniers wrote:
>> > > > > On Mon, Oct 22, 2018 at 10:11 PM Nadav Amit
> wrote:
>> > > > >>
>> > > > >> at 5:37 PM, Nathan Chancellor 
>wrote:
>> > > > >>
>> > > > >> Commit 77b0bf55bc67 ("kbuild/Makefile: Prepare for using
>macros in
>> > > > >> inline assembly code to work around asm() related GCC
>inlining bugs")
>> > > > >> added this flag to KBUILD_CFLAGS, where it works perfectly
>fine with
>> > > > >> GCC. However, when building with Clang, all of the object
>files compile
>> > > > >> fine but the build hangs indefinitely at init/main.o, right
>before the
>> > > > >> linking stage. Don't include this flag when building with
>Clang.
>> > > > >>
>> > > > >> The kernel builds and boots to a shell in QEMU with both GCC
>and Clang
>> > > > >> with this patch applied.
>> > > > >>
>> > > > >> Link:
>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FClangBuiltLinux%2Flinux%2Fissues%2F213data=02%7C01%7Cnamit%40vmware.com%7C871daebc2ca44947d28d08d638811fb5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636758524579997650sdata=shuxW81QRrO3TSqbgf462wgZYdLeAKeQEdGRxmnUX30%3Dreserved=0
>> > > > >> Signed-off-by: Nathan Chancellor 
>> > > > >> ---
>> > > > >>
>> > > > >> The reason this patch is labeled RFC is while I can verify
>that this
>> > > > >> fixes the issue, I'm not entirely sure why the '-Wa,-' works
>for GCC
>> > > > >> and not Clang. I looked into what the flag means and I
>couldn't really
>> > > > >> find anything so I just assume it's taking input from stdin?
>The issue
>> > > > >> could stem from how GCC forks gas versus how Clang does it.
>If this
>> > > > >> isn't of concern and the maintainers are happy with this
>patch as is,
>> > > > >> feel free to take it.
>> > > > >>
>> > > >
>> > > > Perhaps someone could actually, you know, time the build and
>see how
>> > > > much -pipe actually matters, if at all?
>> > > >
>> > > >   -hpa
>> > > >
>> > >
>> > > Thank you for the suggestion! With the attached diff for removing
>> > > '-pipe' and 'make -j1' with defconfig (just to make sure any
>variance
>> > > would stand out), here are my results:
>> > >
>> > > -pipe (GCC):
>> > >
>> > > real15m55.202s
>> > > user14m17.748s
>> > > sys 1m47.496s
>> > >
>> > > No -pipe (GCC):
>> > >
>> > > real16m4.430s
>> > > user14m16.277s
>> > > sys 1m46.604s
>> > >
>> > > -pipe (Clang):
>> > >
>> > > real21m26.016s
>> > > user19m21.722s
>> > > sys 2m2.606s
>> > >
>> > > No -pipe (Clang):
>> > >
>> > > real21m27.822s
>> > > user19m22.092s
>> > > sys 2m4.151s
>> >
>> > Looks like Clang eats `-pipe`:
>> >
>https://github.com/llvm-mirror/clang/blob/391667a023f79287f9c40868f34f08c16156/lib/Driver/Driver.cpp#L962
>> > commit r110007 has the log:
>> > Driver: Start ripping out support for -pipe, which is worthless
>> > and complicates
>> > too many other things.
>> >
>>
>> In that case, we can either keep this change (I'll resend with the
>> explanation that Clang doesn't respect -pipe) or we can just rip out
>> -pipe for GCC too. Here are three separate results for GCC with my
>> normal jobs flag:
>>
>> -pipe (GCC):
>>
>> real3m40.813s
>> real3m44.449s
>> real3m39.648s
>>
>> No -pipe (GCC):
>>
>> real3m38.492s
>> real3m38.335s
>> real3m38.975s
>>
>> P

Re: [PATCH RFC] x86: Don't include '-Wa,-' when building with Clang

2018-10-23 Thread hpa
On October 23, 2018 3:53:10 PM PDT, Nick Desaulniers  
wrote:
>On Tue, Oct 23, 2018 at 3:44 PM Nathan Chancellor
> wrote:
>>
>> On Tue, Oct 23, 2018 at 03:08:53PM -0700, Nick Desaulniers wrote:
>> > On Tue, Oct 23, 2018 at 2:58 PM Nathan Chancellor
>> >  wrote:
>> > >
>> > > On Tue, Oct 23, 2018 at 01:01:22PM -0700, H. Peter Anvin wrote:
>> > > > On 10/23/18 11:40, Nick Desaulniers wrote:
>> > > > > On Mon, Oct 22, 2018 at 10:11 PM Nadav Amit
> wrote:
>> > > > >>
>> > > > >> at 5:37 PM, Nathan Chancellor 
>wrote:
>> > > > >>
>> > > > >> Commit 77b0bf55bc67 ("kbuild/Makefile: Prepare for using
>macros in
>> > > > >> inline assembly code to work around asm() related GCC
>inlining bugs")
>> > > > >> added this flag to KBUILD_CFLAGS, where it works perfectly
>fine with
>> > > > >> GCC. However, when building with Clang, all of the object
>files compile
>> > > > >> fine but the build hangs indefinitely at init/main.o, right
>before the
>> > > > >> linking stage. Don't include this flag when building with
>Clang.
>> > > > >>
>> > > > >> The kernel builds and boots to a shell in QEMU with both GCC
>and Clang
>> > > > >> with this patch applied.
>> > > > >>
>> > > > >> Link:
>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FClangBuiltLinux%2Flinux%2Fissues%2F213data=02%7C01%7Cnamit%40vmware.com%7C871daebc2ca44947d28d08d638811fb5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636758524579997650sdata=shuxW81QRrO3TSqbgf462wgZYdLeAKeQEdGRxmnUX30%3Dreserved=0
>> > > > >> Signed-off-by: Nathan Chancellor 
>> > > > >> ---
>> > > > >>
>> > > > >> The reason this patch is labeled RFC is while I can verify
>that this
>> > > > >> fixes the issue, I'm not entirely sure why the '-Wa,-' works
>for GCC
>> > > > >> and not Clang. I looked into what the flag means and I
>couldn't really
>> > > > >> find anything so I just assume it's taking input from stdin?
>The issue
>> > > > >> could stem from how GCC forks gas versus how Clang does it.
>If this
>> > > > >> isn't of concern and the maintainers are happy with this
>patch as is,
>> > > > >> feel free to take it.
>> > > > >>
>> > > >
>> > > > Perhaps someone could actually, you know, time the build and
>see how
>> > > > much -pipe actually matters, if at all?
>> > > >
>> > > >   -hpa
>> > > >
>> > >
>> > > Thank you for the suggestion! With the attached diff for removing
>> > > '-pipe' and 'make -j1' with defconfig (just to make sure any
>variance
>> > > would stand out), here are my results:
>> > >
>> > > -pipe (GCC):
>> > >
>> > > real15m55.202s
>> > > user14m17.748s
>> > > sys 1m47.496s
>> > >
>> > > No -pipe (GCC):
>> > >
>> > > real16m4.430s
>> > > user14m16.277s
>> > > sys 1m46.604s
>> > >
>> > > -pipe (Clang):
>> > >
>> > > real21m26.016s
>> > > user19m21.722s
>> > > sys 2m2.606s
>> > >
>> > > No -pipe (Clang):
>> > >
>> > > real21m27.822s
>> > > user19m22.092s
>> > > sys 2m4.151s
>> >
>> > Looks like Clang eats `-pipe`:
>> >
>https://github.com/llvm-mirror/clang/blob/391667a023f79287f9c40868f34f08c16156/lib/Driver/Driver.cpp#L962
>> > commit r110007 has the log:
>> > Driver: Start ripping out support for -pipe, which is worthless
>> > and complicates
>> > too many other things.
>> >
>>
>> In that case, we can either keep this change (I'll resend with the
>> explanation that Clang doesn't respect -pipe) or we can just rip out
>> -pipe for GCC too. Here are three separate results for GCC with my
>> normal jobs flag:
>>
>> -pipe (GCC):
>>
>> real3m40.813s
>> real3m44.449s
>> real3m39.648s
>>
>> No -pipe (GCC):
>>
>> real3m38.492s
>> real3m38.335s
>> real3m38.975s
>>
>> P

Re: [PATCH RFC] x86: Don't include '-Wa,-' when building with Clang

2018-10-23 Thread hpa
On October 23, 2018 2:58:11 PM PDT, Nathan Chancellor 
 wrote:
>On Tue, Oct 23, 2018 at 01:01:22PM -0700, H. Peter Anvin wrote:
>> On 10/23/18 11:40, Nick Desaulniers wrote:
>> > On Mon, Oct 22, 2018 at 10:11 PM Nadav Amit 
>wrote:
>> >>
>> >> at 5:37 PM, Nathan Chancellor  wrote:
>> >>
>> >> Commit 77b0bf55bc67 ("kbuild/Makefile: Prepare for using macros in
>> >> inline assembly code to work around asm() related GCC inlining
>bugs")
>> >> added this flag to KBUILD_CFLAGS, where it works perfectly fine
>with
>> >> GCC. However, when building with Clang, all of the object files
>compile
>> >> fine but the build hangs indefinitely at init/main.o, right before
>the
>> >> linking stage. Don't include this flag when building with Clang.
>> >>
>> >> The kernel builds and boots to a shell in QEMU with both GCC and
>Clang
>> >> with this patch applied.
>> >>
>> >> Link:
>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FClangBuiltLinux%2Flinux%2Fissues%2F213data=02%7C01%7Cnamit%40vmware.com%7C871daebc2ca44947d28d08d638811fb5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636758524579997650sdata=shuxW81QRrO3TSqbgf462wgZYdLeAKeQEdGRxmnUX30%3Dreserved=0
>> >> Signed-off-by: Nathan Chancellor 
>> >> ---
>> >>
>> >> The reason this patch is labeled RFC is while I can verify that
>this
>> >> fixes the issue, I'm not entirely sure why the '-Wa,-' works for
>GCC
>> >> and not Clang. I looked into what the flag means and I couldn't
>really
>> >> find anything so I just assume it's taking input from stdin? The
>issue
>> >> could stem from how GCC forks gas versus how Clang does it. If
>this
>> >> isn't of concern and the maintainers are happy with this patch as
>is,
>> >> feel free to take it.
>> >>
>> 
>> Perhaps someone could actually, you know, time the build and see how
>> much -pipe actually matters, if at all?
>> 
>>  -hpa
>> 
>
>Thank you for the suggestion! With the attached diff for removing
>'-pipe' and 'make -j1' with defconfig (just to make sure any variance
>would stand out), here are my results:
>
>-pipe (GCC):
>
>real15m55.202s
>user14m17.748s
>sys 1m47.496s
>
>No -pipe (GCC):
>
>real16m4.430s
>user14m16.277s
>sys 1m46.604s
>
>-pipe (Clang):
>
>real21m26.016s
>user19m21.722s
>sys 2m2.606s
>
>No -pipe (Clang):
>
>real21m27.822s
>user19m22.092s
>sys 2m4.151s
>
>Certainly seems like -pipe doesn't make a ton of difference. If this is
>a better fix, I am happy to draft up a proper commit message and send
>it out for review.
>
>==
>
>diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>index 73f4831283ac..672c689c1faa 100644
>--- a/arch/x86/Makefile
>+++ b/arch/x86/Makefile
>@@ -213,8 +213,6 @@ ifdef CONFIG_X86_64
> KBUILD_LDFLAGS += $(call ld-option, -z max-page-size=0x20)
> endif
> 
>-# Speed up the build
>-KBUILD_CFLAGS += -pipe
># Workaround for a gcc prelease that unfortunately was shipped in a
>suse release
> KBUILD_CFLAGS += -Wno-sign-compare
> #
>@@ -239,7 +237,7 @@ archheaders:
> archmacros:
>$(Q)$(MAKE) $(build)=arch/x86/kernel arch/x86/kernel/macros.s
> 
>-ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s -Wa,-
>+ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s
> export ASM_MACRO_FLAGS
> KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)

make -j1 makes the test useless. You should aim for optimal performance for 
your machine.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH RFC] x86: Don't include '-Wa,-' when building with Clang

2018-10-23 Thread hpa
On October 23, 2018 2:58:11 PM PDT, Nathan Chancellor 
 wrote:
>On Tue, Oct 23, 2018 at 01:01:22PM -0700, H. Peter Anvin wrote:
>> On 10/23/18 11:40, Nick Desaulniers wrote:
>> > On Mon, Oct 22, 2018 at 10:11 PM Nadav Amit 
>wrote:
>> >>
>> >> at 5:37 PM, Nathan Chancellor  wrote:
>> >>
>> >> Commit 77b0bf55bc67 ("kbuild/Makefile: Prepare for using macros in
>> >> inline assembly code to work around asm() related GCC inlining
>bugs")
>> >> added this flag to KBUILD_CFLAGS, where it works perfectly fine
>with
>> >> GCC. However, when building with Clang, all of the object files
>compile
>> >> fine but the build hangs indefinitely at init/main.o, right before
>the
>> >> linking stage. Don't include this flag when building with Clang.
>> >>
>> >> The kernel builds and boots to a shell in QEMU with both GCC and
>Clang
>> >> with this patch applied.
>> >>
>> >> Link:
>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FClangBuiltLinux%2Flinux%2Fissues%2F213data=02%7C01%7Cnamit%40vmware.com%7C871daebc2ca44947d28d08d638811fb5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636758524579997650sdata=shuxW81QRrO3TSqbgf462wgZYdLeAKeQEdGRxmnUX30%3Dreserved=0
>> >> Signed-off-by: Nathan Chancellor 
>> >> ---
>> >>
>> >> The reason this patch is labeled RFC is while I can verify that
>this
>> >> fixes the issue, I'm not entirely sure why the '-Wa,-' works for
>GCC
>> >> and not Clang. I looked into what the flag means and I couldn't
>really
>> >> find anything so I just assume it's taking input from stdin? The
>issue
>> >> could stem from how GCC forks gas versus how Clang does it. If
>this
>> >> isn't of concern and the maintainers are happy with this patch as
>is,
>> >> feel free to take it.
>> >>
>> 
>> Perhaps someone could actually, you know, time the build and see how
>> much -pipe actually matters, if at all?
>> 
>>  -hpa
>> 
>
>Thank you for the suggestion! With the attached diff for removing
>'-pipe' and 'make -j1' with defconfig (just to make sure any variance
>would stand out), here are my results:
>
>-pipe (GCC):
>
>real15m55.202s
>user14m17.748s
>sys 1m47.496s
>
>No -pipe (GCC):
>
>real16m4.430s
>user14m16.277s
>sys 1m46.604s
>
>-pipe (Clang):
>
>real21m26.016s
>user19m21.722s
>sys 2m2.606s
>
>No -pipe (Clang):
>
>real21m27.822s
>user19m22.092s
>sys 2m4.151s
>
>Certainly seems like -pipe doesn't make a ton of difference. If this is
>a better fix, I am happy to draft up a proper commit message and send
>it out for review.
>
>==
>
>diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>index 73f4831283ac..672c689c1faa 100644
>--- a/arch/x86/Makefile
>+++ b/arch/x86/Makefile
>@@ -213,8 +213,6 @@ ifdef CONFIG_X86_64
> KBUILD_LDFLAGS += $(call ld-option, -z max-page-size=0x20)
> endif
> 
>-# Speed up the build
>-KBUILD_CFLAGS += -pipe
># Workaround for a gcc prelease that unfortunately was shipped in a
>suse release
> KBUILD_CFLAGS += -Wno-sign-compare
> #
>@@ -239,7 +237,7 @@ archheaders:
> archmacros:
>$(Q)$(MAKE) $(build)=arch/x86/kernel arch/x86/kernel/macros.s
> 
>-ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s -Wa,-
>+ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s
> export ASM_MACRO_FLAGS
> KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)

make -j1 makes the test useless. You should aim for optimal performance for 
your machine.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH stable v2 1/2] termios, tty/tty_baudrate.c: fix buffer overrun

2018-10-23 Thread hpa
On October 23, 2018 7:53:51 AM PDT, Greg Kroah-Hartman 
 wrote:
>On Mon, Oct 22, 2018 at 09:19:04AM -0700, H. Peter Anvin (Intel) wrote:
>> From: "H. Peter Anvin" 
>> 
>> On architectures with CBAUDEX == 0 (Alpha and PowerPC), the code in
>tty_baudrate.c does
>> not do any limit checking on the tty_baudrate[] array, and in fact a
>> buffer overrun is possible on both architectures. Add a limit check
>to
>> prevent that situation.
>> 
>> This will be followed by a much bigger cleanup/simplification patch.
>> 
>> Signed-off-by: H. Peter Anvin (Intel) 
>> Requested-by: Cc: Johan Hovold 
>> Cc: Greg Kroah-Hartman 
>> Cc: Jiri Slaby 
>> Cc: Al Viro 
>> Cc: Richard Henderson 
>> Cc: Ivan Kokshaysky 
>> Cc: Matt Turner 
>> Cc: Thomas Gleixner 
>> Cc: Kate Stewart 
>> Cc: Philippe Ombredanne 
>> Cc: Greg Kroah-Hartman 
>> Cc: Eugene Syromiatnikov 
>> Cc: 
>> Cc: 
>> Cc: Alan Cox 
>> Cc: 
>> ---
>>  drivers/tty/tty_baudrate.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>
>As I think Al's big termios cleanups are going to be hitting Linus's
>tree soon, do you know how these patches interact with that?
>
>This patch seems like it will not, so I'll be glad to queue that up
>after my first round of patches get merged to Linus later this week,
>but
>the second one worries me.
>
>thanks,
>
>greg k-h

I have been working with Al; we had approached much the same problems but from 
different directions. Mine ended up being a bit more comprehensive as a result, 
so I think we're going to end up using my code with Al's reviews.

So bottom line is that it should be all good.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH stable v2 1/2] termios, tty/tty_baudrate.c: fix buffer overrun

2018-10-23 Thread hpa
On October 23, 2018 7:53:51 AM PDT, Greg Kroah-Hartman 
 wrote:
>On Mon, Oct 22, 2018 at 09:19:04AM -0700, H. Peter Anvin (Intel) wrote:
>> From: "H. Peter Anvin" 
>> 
>> On architectures with CBAUDEX == 0 (Alpha and PowerPC), the code in
>tty_baudrate.c does
>> not do any limit checking on the tty_baudrate[] array, and in fact a
>> buffer overrun is possible on both architectures. Add a limit check
>to
>> prevent that situation.
>> 
>> This will be followed by a much bigger cleanup/simplification patch.
>> 
>> Signed-off-by: H. Peter Anvin (Intel) 
>> Requested-by: Cc: Johan Hovold 
>> Cc: Greg Kroah-Hartman 
>> Cc: Jiri Slaby 
>> Cc: Al Viro 
>> Cc: Richard Henderson 
>> Cc: Ivan Kokshaysky 
>> Cc: Matt Turner 
>> Cc: Thomas Gleixner 
>> Cc: Kate Stewart 
>> Cc: Philippe Ombredanne 
>> Cc: Greg Kroah-Hartman 
>> Cc: Eugene Syromiatnikov 
>> Cc: 
>> Cc: 
>> Cc: Alan Cox 
>> Cc: 
>> ---
>>  drivers/tty/tty_baudrate.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>
>As I think Al's big termios cleanups are going to be hitting Linus's
>tree soon, do you know how these patches interact with that?
>
>This patch seems like it will not, so I'll be glad to queue that up
>after my first round of patches get merged to Linus later this week,
>but
>the second one worries me.
>
>thanks,
>
>greg k-h

I have been working with Al; we had approached much the same problems but from 
different directions. Mine ended up being a bit more comprehensive as a result, 
so I think we're going to end up using my code with Al's reviews.

So bottom line is that it should be all good.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Insanely high baud rates

2018-10-11 Thread hpa
On October 11, 2018 5:31:34 AM PDT, Alan Cox  wrote:
>> I'm mostly wondering if it is worth future-proofing for new
>transports. It sounds like we can have a consensus on leaving the upper
>4 bits of the speed fields reserved, but leave the details of
>implementation for the future?
>
>It seems reasonable, although I think the reality is that any future
>transport is not going to be a true serial link, but some kind of
>serial
>emulation layer. For those the speed really only matters to tell
>editors
>and the like not to bother being clever.
>
>I mean - what is the baud rate of a pty  ?
>
>Alan

Whatever the master wants it to be...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Insanely high baud rates

2018-10-11 Thread hpa
On October 11, 2018 5:31:34 AM PDT, Alan Cox  wrote:
>> I'm mostly wondering if it is worth future-proofing for new
>transports. It sounds like we can have a consensus on leaving the upper
>4 bits of the speed fields reserved, but leave the details of
>implementation for the future?
>
>It seems reasonable, although I think the reality is that any future
>transport is not going to be a true serial link, but some kind of
>serial
>emulation layer. For those the speed really only matters to tell
>editors
>and the like not to bother being clever.
>
>I mean - what is the baud rate of a pty  ?
>
>Alan

Whatever the master wants it to be...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Insanely high baud rates

2018-10-10 Thread hpa
On October 10, 2018 1:17:17 PM PDT, Alan Cox  wrote:
>On Tue, 9 Oct 2018 12:19:04 -0700
>"H. Peter Anvin"  wrote:
>
>> [Resending to a wider audience]
>> 
>> In trying to get the termios2 interface actually implemented in
>glibc,
>> the question came up if we will ever care about baud rates in excess
>of
>> 4 Gbps, even in the relatively remote future.
>
>Even RS485 at 4MBits involves deep magic. I think we are fairly safe.
>Not
>only that but our entire tty layer isn't capable of sustaining anything
>even remotely in that range.
>
>I think its non issue.
>
>Alan

I'm mostly wondering if it is worth future-proofing for new transports. It 
sounds like we can have a consensus on leaving the upper 4 bits of the speed 
fields reserved, but leave the details of implementation for the future?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Insanely high baud rates

2018-10-10 Thread hpa
On October 10, 2018 1:17:17 PM PDT, Alan Cox  wrote:
>On Tue, 9 Oct 2018 12:19:04 -0700
>"H. Peter Anvin"  wrote:
>
>> [Resending to a wider audience]
>> 
>> In trying to get the termios2 interface actually implemented in
>glibc,
>> the question came up if we will ever care about baud rates in excess
>of
>> 4 Gbps, even in the relatively remote future.
>
>Even RS485 at 4MBits involves deep magic. I think we are fairly safe.
>Not
>only that but our entire tty layer isn't capable of sustaining anything
>even remotely in that range.
>
>I think its non issue.
>
>Alan

I'm mostly wondering if it is worth future-proofing for new transports. It 
sounds like we can have a consensus on leaving the upper 4 bits of the speed 
fields reserved, but leave the details of implementation for the future?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions

2018-10-04 Thread hpa
On October 4, 2018 2:12:22 AM PDT, Ingo Molnar  wrote:
>
>* Nadav Amit  wrote:
>
>> I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>> perhaps I misunderstood).
>
>Well, tests are unlikely to show the overhead of extra lines of this
>magnitude, unless done very carefully, yet the added bloat exists and
>is not even
>mentioned by the changelog, it just says:
>
>Subject: [PATCH v9 02/10] Makefile: Prepare for using macros for inline
>asm
>
>  Using macros for inline assembly improves both readability and
>compilation decisions that are distorted by big assembly blocks that
>use
>  alternative sections. Compile macros.S and use it to assemble all C
>  files. Currently, only x86 will use it.
>
>> I guess you regard to the preprocessing of the assembler. Note that
>the C 
>> preprocessing of macros.S obviously happens only once. That’s the
>reason
>> I assumed it’s not that expensive.
>
>True - so first we build macros.s, and that gets included in every C
>file build, right?
>
>macros.s is smaller: 275 lines only in the distro test build I tried,
>which looks
>a lot better than my first 4,200 lines guesstimate.
>
>> Anyhow, I remember that we discussed at some point doing something
>like
>> ‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>> remember why and don’t see any reason it is so. Unless I am missing
>> something, I think it is possible to take each individual header and
>> preprocess the assembly part of into a separate .s file. Then we can
>put in
>> the C part of the header ‘asm(".include XXX.s”)’.
>> 
>> What do you think?
>
>Hm, this looks quite complex - macros.s is better I think. Also, 275
>straight assembly lines is 
>a lot better than 4,200.
>
>Another, separate question I wanted to ask: how do we ensure that the
>kernel stays fixed?
>I.e. is there some tooling we can use to actually measure whether
>there's bad inlining decisions 
>done, to detect all these bad patterns that cause bad GCC code
>generation?
>
>Thanks,
>
>   Ingo

The assembly output from GCC is quite volumious; I doubt tacking a few hundred 
lines on will matter one iota.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions

2018-10-04 Thread hpa
On October 4, 2018 2:12:22 AM PDT, Ingo Molnar  wrote:
>
>* Nadav Amit  wrote:
>
>> I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>> perhaps I misunderstood).
>
>Well, tests are unlikely to show the overhead of extra lines of this
>magnitude, unless done very carefully, yet the added bloat exists and
>is not even
>mentioned by the changelog, it just says:
>
>Subject: [PATCH v9 02/10] Makefile: Prepare for using macros for inline
>asm
>
>  Using macros for inline assembly improves both readability and
>compilation decisions that are distorted by big assembly blocks that
>use
>  alternative sections. Compile macros.S and use it to assemble all C
>  files. Currently, only x86 will use it.
>
>> I guess you regard to the preprocessing of the assembler. Note that
>the C 
>> preprocessing of macros.S obviously happens only once. That’s the
>reason
>> I assumed it’s not that expensive.
>
>True - so first we build macros.s, and that gets included in every C
>file build, right?
>
>macros.s is smaller: 275 lines only in the distro test build I tried,
>which looks
>a lot better than my first 4,200 lines guesstimate.
>
>> Anyhow, I remember that we discussed at some point doing something
>like
>> ‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>> remember why and don’t see any reason it is so. Unless I am missing
>> something, I think it is possible to take each individual header and
>> preprocess the assembly part of into a separate .s file. Then we can
>put in
>> the C part of the header ‘asm(".include XXX.s”)’.
>> 
>> What do you think?
>
>Hm, this looks quite complex - macros.s is better I think. Also, 275
>straight assembly lines is 
>a lot better than 4,200.
>
>Another, separate question I wanted to ask: how do we ensure that the
>kernel stays fixed?
>I.e. is there some tooling we can use to actually measure whether
>there's bad inlining decisions 
>done, to detect all these bad patterns that cause bad GCC code
>generation?
>
>Thanks,
>
>   Ingo

The assembly output from GCC is quite volumious; I doubt tacking a few hundred 
lines on will matter one iota.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions

2018-10-04 Thread hpa
On October 4, 2018 1:56:37 AM PDT, Nadav Amit  wrote:
>at 1:40 AM, h...@zytor.com wrote:
>
>> On October 4, 2018 1:33:33 AM PDT, Ingo Molnar 
>wrote:
>>> * Ingo Molnar  wrote:
>>> 
>>>> I'm also somewhat annoyed at the fact that this series carries a
>>> boatload
>>>> of reviewed-by's and acked-by's, yet none of those reviewers found
>it
>>>> important to point out the large chasm that is gaping between
>>> description
>>>> and reality.
>>> 
>>> Another problem I just realized is that we now include
>>> arch/x86/kernel/macros.S in every 
>>> translation pass when building the kernel, right?
>>> 
>>> But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>>> header files:
>>> 
>>> $ make arch/x86/kernel/macros.s
>>> 
>>> $ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort
>|
>>> uniq) | wc -l
>>> 4128
>>> 
>>> That's 4,100 extra lines of code to be preprocessed for every
>>> translation unit, of
>>> which there are tens of thousands. More if other pieces of code get
>>> macrofied in
>>> this fasion in the future.
>>> 
>>> If we assume that a typical distribution kernel build has ~20,000
>>> translation units
>>> then this change adds 82,560,000 more lines to be preprocessed, just
>to
>>> work around
>>> a stupid GCC bug?
>>> 
>>> I'm totally unhappy about that. Can we do this without adding
>macros.S?
>>> 
>>> It's also a pretty stupidly central file anyway that moves source
>code
>>> away
>>> from where it's used.
>>> 
>>> Thanks,
>>> 
>>> Ingo
>> 
>> It's not just for working around a stupid GCC bug, but it also has a
>huge
>> potential for cleaning up the inline asm in general.
>> 
>> I would like to know if there is an actual number for the build
>overhead
>> (an actual benchmark); I have asked for that once already.
>
>I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>perhaps I misunderstood).
>
>I guess you regard to the preprocessing of the assembler. Note that the
>C 
>preprocessing of macros.S obviously happens only once. That’s the
>reason
>I assumed it’s not that expensive.
>
>Anyhow, I remember that we discussed at some point doing something like
>‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>remember why and don’t see any reason it is so. Unless I am missing
>something, I think it is possible to take each individual header and
>preprocess the assembly part of into a separate .s file. Then we can
>put in
>the C part of the header ‘asm(".include XXX.s”)’.
>
>What do you think?

Ingo: I wasn't talking necessarily about the specifics of each bit, but rather 
the general concept about being able to use macros in inlines... I can send you 
something I have been working on in the background, but have been holding off 
on because of this, in the morning my time.

But that's no excuse for not doing it right.

Global asm() statements are problematic because there is no guarantee where in 
the assembly source they will end up. You'd almost need to intercept the 
assembly output, move all the includes to the top, and then process...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions

2018-10-04 Thread hpa
On October 4, 2018 1:56:37 AM PDT, Nadav Amit  wrote:
>at 1:40 AM, h...@zytor.com wrote:
>
>> On October 4, 2018 1:33:33 AM PDT, Ingo Molnar 
>wrote:
>>> * Ingo Molnar  wrote:
>>> 
>>>> I'm also somewhat annoyed at the fact that this series carries a
>>> boatload
>>>> of reviewed-by's and acked-by's, yet none of those reviewers found
>it
>>>> important to point out the large chasm that is gaping between
>>> description
>>>> and reality.
>>> 
>>> Another problem I just realized is that we now include
>>> arch/x86/kernel/macros.S in every 
>>> translation pass when building the kernel, right?
>>> 
>>> But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>>> header files:
>>> 
>>> $ make arch/x86/kernel/macros.s
>>> 
>>> $ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort
>|
>>> uniq) | wc -l
>>> 4128
>>> 
>>> That's 4,100 extra lines of code to be preprocessed for every
>>> translation unit, of
>>> which there are tens of thousands. More if other pieces of code get
>>> macrofied in
>>> this fasion in the future.
>>> 
>>> If we assume that a typical distribution kernel build has ~20,000
>>> translation units
>>> then this change adds 82,560,000 more lines to be preprocessed, just
>to
>>> work around
>>> a stupid GCC bug?
>>> 
>>> I'm totally unhappy about that. Can we do this without adding
>macros.S?
>>> 
>>> It's also a pretty stupidly central file anyway that moves source
>code
>>> away
>>> from where it's used.
>>> 
>>> Thanks,
>>> 
>>> Ingo
>> 
>> It's not just for working around a stupid GCC bug, but it also has a
>huge
>> potential for cleaning up the inline asm in general.
>> 
>> I would like to know if there is an actual number for the build
>overhead
>> (an actual benchmark); I have asked for that once already.
>
>I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>perhaps I misunderstood).
>
>I guess you regard to the preprocessing of the assembler. Note that the
>C 
>preprocessing of macros.S obviously happens only once. That’s the
>reason
>I assumed it’s not that expensive.
>
>Anyhow, I remember that we discussed at some point doing something like
>‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>remember why and don’t see any reason it is so. Unless I am missing
>something, I think it is possible to take each individual header and
>preprocess the assembly part of into a separate .s file. Then we can
>put in
>the C part of the header ‘asm(".include XXX.s”)’.
>
>What do you think?

Ingo: I wasn't talking necessarily about the specifics of each bit, but rather 
the general concept about being able to use macros in inlines... I can send you 
something I have been working on in the background, but have been holding off 
on because of this, in the morning my time.

But that's no excuse for not doing it right.

Global asm() statements are problematic because there is no guarantee where in 
the assembly source they will end up. You'd almost need to intercept the 
assembly output, move all the includes to the top, and then process...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions

2018-10-04 Thread hpa
On October 4, 2018 1:33:33 AM PDT, Ingo Molnar  wrote:
>
>* Ingo Molnar  wrote:
>
>> I'm also somewhat annoyed at the fact that this series carries a
>boatload
>> of reviewed-by's and acked-by's, yet none of those reviewers found it
>> important to point out the large chasm that is gaping between
>description
>> and reality.
>
>Another problem I just realized is that we now include
>arch/x86/kernel/macros.S in every 
>translation pass when building the kernel, right?
>
>But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>header files:
>
>  $ make arch/x86/kernel/macros.s
>
>$ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort |
>uniq) | wc -l
>  4128
>
>That's 4,100 extra lines of code to be preprocessed for every
>translation unit, of
>which there are tens of thousands. More if other pieces of code get
>macrofied in
>this fasion in the future.
>
>If we assume that a typical distribution kernel build has ~20,000
>translation units
>then this change adds 82,560,000 more lines to be preprocessed, just to
>work around
>a stupid GCC bug?
>
>I'm totally unhappy about that. Can we do this without adding macros.S?
>
>It's also a pretty stupidly central file anyway that moves source code
>away
>from where it's used.
>
>Thanks,
>
>   Ingo

It's not just for working around a stupid GCC bug, but it also has a huge 
potential for cleaning up the inline asm in general.

I would like to know if there is an actual number for the build overhead (an 
actual benchmark); I have asked for that once already.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions

2018-10-04 Thread hpa
On October 4, 2018 1:33:33 AM PDT, Ingo Molnar  wrote:
>
>* Ingo Molnar  wrote:
>
>> I'm also somewhat annoyed at the fact that this series carries a
>boatload
>> of reviewed-by's and acked-by's, yet none of those reviewers found it
>> important to point out the large chasm that is gaping between
>description
>> and reality.
>
>Another problem I just realized is that we now include
>arch/x86/kernel/macros.S in every 
>translation pass when building the kernel, right?
>
>But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>header files:
>
>  $ make arch/x86/kernel/macros.s
>
>$ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort |
>uniq) | wc -l
>  4128
>
>That's 4,100 extra lines of code to be preprocessed for every
>translation unit, of
>which there are tens of thousands. More if other pieces of code get
>macrofied in
>this fasion in the future.
>
>If we assume that a typical distribution kernel build has ~20,000
>translation units
>then this change adds 82,560,000 more lines to be preprocessed, just to
>work around
>a stupid GCC bug?
>
>I'm totally unhappy about that. Can we do this without adding macros.S?
>
>It's also a pretty stupidly central file anyway that moves source code
>away
>from where it's used.
>
>Thanks,
>
>   Ingo

It's not just for working around a stupid GCC bug, but it also has a huge 
potential for cleaning up the inline asm in general.

I would like to know if there is an actual number for the build overhead (an 
actual benchmark); I have asked for that once already.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v8 02/10] Makefile: Prepare for using macros for inline asm

2018-10-03 Thread hpa
On October 2, 2018 3:59:52 AM PDT, Ingo Molnar  wrote:
>
>* Nadav Amit  wrote:
>
>> at 1:58 AM, Rasmus Villemoes  wrote:
>> 
>> > On 2018-09-18 23:28, Nadav Amit wrote:
>> > 
>> >> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>> >> index 8f6e7eb8ae9f..944fa3bc9376 100644
>> >> --- a/arch/x86/Makefile
>> >> +++ b/arch/x86/Makefile
>> >> @@ -214,8 +214,8 @@ ifdef CONFIG_X86_64
>> >> KBUILD_LDFLAGS += $(call ld-option, -z max-page-size=0x20)
>> >> endif
>> >> 
>> >> -# Speed up the build
>> >> -KBUILD_CFLAGS += -pipe
>> >> +# We cannot use -pipe flag since we give an additional .s file to
>the compiler
>> >> +#KBUILD_CFLAGS += -pipe
>> > 
>> > Is this really necessary? The gas manual says that one can use --
>to
>> > name stdin, though that's probably a typo and should just be - .
>Doing
>> > 
>> > gcc -pipe -Wa,foo.s -Wa,-
>> 
>> Good idea. I didn’t think of it. Yes, this can do the trick. I’ll do
>it in
>> v9.
>
>Ok, will wait for v9.
>
>Thanks,
>
>   Ingo

Does -pipe actually win anything these days?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v8 02/10] Makefile: Prepare for using macros for inline asm

2018-10-03 Thread hpa
On October 2, 2018 3:59:52 AM PDT, Ingo Molnar  wrote:
>
>* Nadav Amit  wrote:
>
>> at 1:58 AM, Rasmus Villemoes  wrote:
>> 
>> > On 2018-09-18 23:28, Nadav Amit wrote:
>> > 
>> >> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>> >> index 8f6e7eb8ae9f..944fa3bc9376 100644
>> >> --- a/arch/x86/Makefile
>> >> +++ b/arch/x86/Makefile
>> >> @@ -214,8 +214,8 @@ ifdef CONFIG_X86_64
>> >> KBUILD_LDFLAGS += $(call ld-option, -z max-page-size=0x20)
>> >> endif
>> >> 
>> >> -# Speed up the build
>> >> -KBUILD_CFLAGS += -pipe
>> >> +# We cannot use -pipe flag since we give an additional .s file to
>the compiler
>> >> +#KBUILD_CFLAGS += -pipe
>> > 
>> > Is this really necessary? The gas manual says that one can use --
>to
>> > name stdin, though that's probably a typo and should just be - .
>Doing
>> > 
>> > gcc -pipe -Wa,foo.s -Wa,-
>> 
>> Good idea. I didn’t think of it. Yes, this can do the trick. I’ll do
>it in
>> v9.
>
>Ok, will wait for v9.
>
>Thanks,
>
>   Ingo

Does -pipe actually win anything these days?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v12 05/13] x86/sgx: architectural structures

2018-07-05 Thread hpa
On July 5, 2018 1:09:12 PM PDT, Jarkko Sakkinen 
 wrote:
>On Thu, Jul 05, 2018 at 08:31:42AM -0700, Dave Hansen wrote:
>> On 07/03/2018 11:19 AM, Jarkko Sakkinen wrote:
>> > +struct sgx_secs {
>> > +  uint64_t size;
>> > +  uint64_t base;
>> > +  uint32_t ssaframesize;
>> > +  uint32_t miscselect;
>> > +  uint8_t reserved1[SGX_SECS_RESERVED1_SIZE];
>> > +  uint64_t attributes;
>> > +  uint64_t xfrm;
>> > +  uint32_t mrenclave[8];
>> > +  uint8_t reserved2[SGX_SECS_RESERVED2_SIZE];
>> > +  uint32_t mrsigner[8];
>> > +  uint8_t reserved3[SGX_SECS_RESERVED3_SIZE];
>> > +  uint16_t isvvprodid;
>> > +  uint16_t isvsvn;
>> > +  uint8_t reserved4[SGX_SECS_RESERVED4_SIZE];
>> > +} __packed __aligned(4096);
>> 
>> Why are the uint* versions in use here?  Those are for userspace ABI,
>> but this is entirely for in-kernel-use, right?
>> 
>> We've used u8/16/32/64 in arch code in a bunch of places.  They're at
>> least a bit more compact and easier to read.  It's this:
>> 
>>  u8  foo;
>>  u64 bar;
>> 
>> vs. this:
>> 
>>  uint8_t  foo;
>>  uint64_t bar;
>
>The reason was that with in-kernel LE these were in fact used by
>user space code. Now they can be changed to those that you
>suggested.
>
>/Jarkko

For things exported to user space use __u* and __s* types... the _t types would 
actually violate the C standard with respect to namespace pollution.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v12 05/13] x86/sgx: architectural structures

2018-07-05 Thread hpa
On July 5, 2018 1:09:12 PM PDT, Jarkko Sakkinen 
 wrote:
>On Thu, Jul 05, 2018 at 08:31:42AM -0700, Dave Hansen wrote:
>> On 07/03/2018 11:19 AM, Jarkko Sakkinen wrote:
>> > +struct sgx_secs {
>> > +  uint64_t size;
>> > +  uint64_t base;
>> > +  uint32_t ssaframesize;
>> > +  uint32_t miscselect;
>> > +  uint8_t reserved1[SGX_SECS_RESERVED1_SIZE];
>> > +  uint64_t attributes;
>> > +  uint64_t xfrm;
>> > +  uint32_t mrenclave[8];
>> > +  uint8_t reserved2[SGX_SECS_RESERVED2_SIZE];
>> > +  uint32_t mrsigner[8];
>> > +  uint8_t reserved3[SGX_SECS_RESERVED3_SIZE];
>> > +  uint16_t isvvprodid;
>> > +  uint16_t isvsvn;
>> > +  uint8_t reserved4[SGX_SECS_RESERVED4_SIZE];
>> > +} __packed __aligned(4096);
>> 
>> Why are the uint* versions in use here?  Those are for userspace ABI,
>> but this is entirely for in-kernel-use, right?
>> 
>> We've used u8/16/32/64 in arch code in a bunch of places.  They're at
>> least a bit more compact and easier to read.  It's this:
>> 
>>  u8  foo;
>>  u64 bar;
>> 
>> vs. this:
>> 
>>  uint8_t  foo;
>>  uint64_t bar;
>
>The reason was that with in-kernel LE these were in fact used by
>user space code. Now they can be changed to those that you
>suggested.
>
>/Jarkko

For things exported to user space use __u* and __s* types... the _t types would 
actually violate the C standard with respect to namespace pollution.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-28 Thread hpa
On June 28, 2018 1:33:24 PM PDT, Andy Lutomirski  wrote:
>On Wed, Jun 27, 2018 at 11:22 AM,   wrote:
>> On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski 
>wrote:
>>>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski
>
>>>wrote:



> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
>>> wrote:
>
>> On 06/22/18 07:24, Andy Lutomirski wrote:
>>
>> That RPL3 part is false.  The following program does:
>>
>> #include 
>>
>> int main()
>> {
>>unsigned short sel;
>>asm volatile ("mov %%ss, %0" : "=rm" (sel));
>>sel &= ~3;
>>printf("Will write 0x%hx to GS\n", sel);
>>asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>>asm volatile ("mov %%gs, %0" : "=rm" (sel));
>>printf("GS = 0x%hx\n", sel);
>>return 0;
>> }
>>
>> prints:
>>
>> Will write 0x28 to GS
>> GS = 0x28
>>
>> The x86 architecture is *insane*.
>>
>> Other than that, this patch seems generally sensible.  But my
>> objection that it's incorrect with FSGSBASE enabled for %fs and
>%gs
>> still applies.
>>
>
> Ugh, you're right... I misremembered.  The CPL simply overrides
>the
>>>RPL
> rather than trapping.
>
> We still need to give legacy applications which have zero idea
>about
>>>the
> separate bases that apply only to 64-bit mode a way to DTRT.
>>>Requiring
> these old crufty applications to do something new is not an
>option.

>
> As ugly as it is, I'm thinking the Right Thing is to simply make
>it
>>>a
> part of the Linux ABI that if the FS or GS selector registers
>point
>>>into
> the LDT then we will requalify them; if a 64-bit app does that
>then
>>>they
> get that behavior.  This isn't something that will happen
> asynchronously, and if a 64-bit process loads an LDT value into FS
>>>or
> GS, they are considered to have opted in to that behavior.

 But the old and crusty apps don’t depend on requalification because
>>>we never used to do it.

 I’m not convinced we ever need to refresh the base. In fact, we
>could
>>>start preserving the base of LDT-referencing FS/GS across context
>>>switches even without FSGSBASE at some minor performance cost, but I
>>>don’t really see the point. I still think my proposed semantics are
>>>easy to implement and preserve the ABI even if they have the sad
>>>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>>>up different.

>>>
>>>There's another reasonable solution: do exactly what your patch does,
>>>minus the bugs.  We would need to get the RPL != 3 case right (easy)
>>>and the case where there's a non-running thread using the selector in
>>>question.  The latter is probably best handled by adding a flag to
>>>thread_struct that says "fsbase needs reloading from the descriptor
>>>table" and only applies if the selector is in the LDT or TLS area. 
>Or
>>>we could hijack a high bit in the selector.  Then we'd need to update
>>>everything that uses the fields.
>>
>> Obviously fix the bugs.
>>
>> How would you control this bit?
>
>Sorry, I was wrong in my previous email.  Let me try this again:
>
>Notwithstanding the RPL thing, the reason I don't like your patch as
>is, and the reason I didn't write a similar patch myself, is that it
>will behave nondeterministically on an FSGSBASE kernel.  Suppose there
>are two threads, A and B, that share an mm.  A has %fs == 0x7 and
>FSBASE = 0.  The LDT has the base for entry 0 set to 0.
>
>Now thread B calls modify_ldt to change the base for entry 0 to 1.
>
>The Obviously Sane (tm) behavior is for task A's FSBASE to
>asynchronously change to 1.  This is the only deterministic behavior
>that is even possible on a 32-bit kernel, and it's the only
>not-totally-nutty behavior that is possible on a 64-bit non-FSGSBASE
>kernel, and it's still perfectly reasonable for FSGSBASE.  The problem
>is that it's not so easly to implement.
>
>With your patch, on an FSGSBASE kernel, we get the desired behavior if
>thread A is running while thread B calls modify_ldt().  But we get
>different behavior if thread A is stopped -- thread A's FSBASE will
>remain set to 0.
>
>With that in mind, my email was otherwise garbage, and the magic "bit"
>idea was total crap.
>
>I can see three vaguely credible ways to implement this.
>
>1. thread B walks all threads on the system, notices that thread A has
>the same mm, and asynchronously fixes it up.  The locking is a bit
>tricky, and the performance isn't exactly great.  Maybe that's okay.
>
>2. We finally add an efficient way to find all threads that share an
>mm and do (1) but faster.
>
>3. We add enough bookkeeping to thread_struct so that, the next time
>thread A runs or has ptrace try to read its FSBASE, we notice that
>FSBASE is stale and fix it up.
>
>(3) will perform the best, but the implementation is probably nasty.
>If we want modify_ldt() to only reset the base for the modified
>records, 

Re: [PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-28 Thread hpa
On June 28, 2018 1:33:24 PM PDT, Andy Lutomirski  wrote:
>On Wed, Jun 27, 2018 at 11:22 AM,   wrote:
>> On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski 
>wrote:
>>>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski
>
>>>wrote:



> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
>>> wrote:
>
>> On 06/22/18 07:24, Andy Lutomirski wrote:
>>
>> That RPL3 part is false.  The following program does:
>>
>> #include 
>>
>> int main()
>> {
>>unsigned short sel;
>>asm volatile ("mov %%ss, %0" : "=rm" (sel));
>>sel &= ~3;
>>printf("Will write 0x%hx to GS\n", sel);
>>asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>>asm volatile ("mov %%gs, %0" : "=rm" (sel));
>>printf("GS = 0x%hx\n", sel);
>>return 0;
>> }
>>
>> prints:
>>
>> Will write 0x28 to GS
>> GS = 0x28
>>
>> The x86 architecture is *insane*.
>>
>> Other than that, this patch seems generally sensible.  But my
>> objection that it's incorrect with FSGSBASE enabled for %fs and
>%gs
>> still applies.
>>
>
> Ugh, you're right... I misremembered.  The CPL simply overrides
>the
>>>RPL
> rather than trapping.
>
> We still need to give legacy applications which have zero idea
>about
>>>the
> separate bases that apply only to 64-bit mode a way to DTRT.
>>>Requiring
> these old crufty applications to do something new is not an
>option.

>
> As ugly as it is, I'm thinking the Right Thing is to simply make
>it
>>>a
> part of the Linux ABI that if the FS or GS selector registers
>point
>>>into
> the LDT then we will requalify them; if a 64-bit app does that
>then
>>>they
> get that behavior.  This isn't something that will happen
> asynchronously, and if a 64-bit process loads an LDT value into FS
>>>or
> GS, they are considered to have opted in to that behavior.

 But the old and crusty apps don’t depend on requalification because
>>>we never used to do it.

 I’m not convinced we ever need to refresh the base. In fact, we
>could
>>>start preserving the base of LDT-referencing FS/GS across context
>>>switches even without FSGSBASE at some minor performance cost, but I
>>>don’t really see the point. I still think my proposed semantics are
>>>easy to implement and preserve the ABI even if they have the sad
>>>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>>>up different.

>>>
>>>There's another reasonable solution: do exactly what your patch does,
>>>minus the bugs.  We would need to get the RPL != 3 case right (easy)
>>>and the case where there's a non-running thread using the selector in
>>>question.  The latter is probably best handled by adding a flag to
>>>thread_struct that says "fsbase needs reloading from the descriptor
>>>table" and only applies if the selector is in the LDT or TLS area. 
>Or
>>>we could hijack a high bit in the selector.  Then we'd need to update
>>>everything that uses the fields.
>>
>> Obviously fix the bugs.
>>
>> How would you control this bit?
>
>Sorry, I was wrong in my previous email.  Let me try this again:
>
>Notwithstanding the RPL thing, the reason I don't like your patch as
>is, and the reason I didn't write a similar patch myself, is that it
>will behave nondeterministically on an FSGSBASE kernel.  Suppose there
>are two threads, A and B, that share an mm.  A has %fs == 0x7 and
>FSBASE = 0.  The LDT has the base for entry 0 set to 0.
>
>Now thread B calls modify_ldt to change the base for entry 0 to 1.
>
>The Obviously Sane (tm) behavior is for task A's FSBASE to
>asynchronously change to 1.  This is the only deterministic behavior
>that is even possible on a 32-bit kernel, and it's the only
>not-totally-nutty behavior that is possible on a 64-bit non-FSGSBASE
>kernel, and it's still perfectly reasonable for FSGSBASE.  The problem
>is that it's not so easly to implement.
>
>With your patch, on an FSGSBASE kernel, we get the desired behavior if
>thread A is running while thread B calls modify_ldt().  But we get
>different behavior if thread A is stopped -- thread A's FSBASE will
>remain set to 0.
>
>With that in mind, my email was otherwise garbage, and the magic "bit"
>idea was total crap.
>
>I can see three vaguely credible ways to implement this.
>
>1. thread B walks all threads on the system, notices that thread A has
>the same mm, and asynchronously fixes it up.  The locking is a bit
>tricky, and the performance isn't exactly great.  Maybe that's okay.
>
>2. We finally add an efficient way to find all threads that share an
>mm and do (1) but faster.
>
>3. We add enough bookkeeping to thread_struct so that, the next time
>thread A runs or has ptrace try to read its FSBASE, we notice that
>FSBASE is stale and fix it up.
>
>(3) will perform the best, but the implementation is probably nasty.
>If we want modify_ldt() to only reset the base for the modified
>records, 

Re: [PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-27 Thread hpa
On June 27, 2018 11:22:14 AM PDT, h...@zytor.com wrote:
>On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski 
>wrote:
>>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski
>
>>wrote:
>>>
>>>
>>>
 On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
>> wrote:

> On 06/22/18 07:24, Andy Lutomirski wrote:
>
> That RPL3 part is false.  The following program does:
>
> #include 
>
> int main()
> {
>unsigned short sel;
>asm volatile ("mov %%ss, %0" : "=rm" (sel));
>sel &= ~3;
>printf("Will write 0x%hx to GS\n", sel);
>asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>asm volatile ("mov %%gs, %0" : "=rm" (sel));
>printf("GS = 0x%hx\n", sel);
>return 0;
> }
>
> prints:
>
> Will write 0x28 to GS
> GS = 0x28
>
> The x86 architecture is *insane*.
>
> Other than that, this patch seems generally sensible.  But my
> objection that it's incorrect with FSGSBASE enabled for %fs and
>%gs
> still applies.
>

 Ugh, you're right... I misremembered.  The CPL simply overrides the
>>RPL
 rather than trapping.

 We still need to give legacy applications which have zero idea
>about
>>the
 separate bases that apply only to 64-bit mode a way to DTRT. 
>>Requiring
 these old crufty applications to do something new is not an option.
>>>

 As ugly as it is, I'm thinking the Right Thing is to simply make it
>>a
 part of the Linux ABI that if the FS or GS selector registers point
>>into
 the LDT then we will requalify them; if a 64-bit app does that then
>>they
 get that behavior.  This isn't something that will happen
 asynchronously, and if a 64-bit process loads an LDT value into FS
>>or
 GS, they are considered to have opted in to that behavior.
>>>
>>> But the old and crusty apps don’t depend on requalification because
>>we never used to do it.
>>>
>>> I’m not convinced we ever need to refresh the base. In fact, we
>could
>>start preserving the base of LDT-referencing FS/GS across context
>>switches even without FSGSBASE at some minor performance cost, but I
>>don’t really see the point. I still think my proposed semantics are
>>easy to implement and preserve the ABI even if they have the sad
>>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>>up different.
>>>
>>
>>There's another reasonable solution: do exactly what your patch does,
>>minus the bugs.  We would need to get the RPL != 3 case right (easy)
>>and the case where there's a non-running thread using the selector in
>>question.  The latter is probably best handled by adding a flag to
>>thread_struct that says "fsbase needs reloading from the descriptor
>>table" and only applies if the selector is in the LDT or TLS area.  Or
>>we could hijack a high bit in the selector.  Then we'd need to update
>>everything that uses the fields.
>
>Obviously fix the bugs.
>
>How would you control this bit?

I can personally think of these options:

1. A prctl() to disable requalification;
2. Make the new instructions trap until used.  This will add to the startup 
time of legitimate users of these instructions;
3. Either of these, but start out in "off" mode until one of the descriptor 
table system calls are called.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-27 Thread hpa
On June 27, 2018 11:22:14 AM PDT, h...@zytor.com wrote:
>On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski 
>wrote:
>>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski
>
>>wrote:
>>>
>>>
>>>
 On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
>> wrote:

> On 06/22/18 07:24, Andy Lutomirski wrote:
>
> That RPL3 part is false.  The following program does:
>
> #include 
>
> int main()
> {
>unsigned short sel;
>asm volatile ("mov %%ss, %0" : "=rm" (sel));
>sel &= ~3;
>printf("Will write 0x%hx to GS\n", sel);
>asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>asm volatile ("mov %%gs, %0" : "=rm" (sel));
>printf("GS = 0x%hx\n", sel);
>return 0;
> }
>
> prints:
>
> Will write 0x28 to GS
> GS = 0x28
>
> The x86 architecture is *insane*.
>
> Other than that, this patch seems generally sensible.  But my
> objection that it's incorrect with FSGSBASE enabled for %fs and
>%gs
> still applies.
>

 Ugh, you're right... I misremembered.  The CPL simply overrides the
>>RPL
 rather than trapping.

 We still need to give legacy applications which have zero idea
>about
>>the
 separate bases that apply only to 64-bit mode a way to DTRT. 
>>Requiring
 these old crufty applications to do something new is not an option.
>>>

 As ugly as it is, I'm thinking the Right Thing is to simply make it
>>a
 part of the Linux ABI that if the FS or GS selector registers point
>>into
 the LDT then we will requalify them; if a 64-bit app does that then
>>they
 get that behavior.  This isn't something that will happen
 asynchronously, and if a 64-bit process loads an LDT value into FS
>>or
 GS, they are considered to have opted in to that behavior.
>>>
>>> But the old and crusty apps don’t depend on requalification because
>>we never used to do it.
>>>
>>> I’m not convinced we ever need to refresh the base. In fact, we
>could
>>start preserving the base of LDT-referencing FS/GS across context
>>switches even without FSGSBASE at some minor performance cost, but I
>>don’t really see the point. I still think my proposed semantics are
>>easy to implement and preserve the ABI even if they have the sad
>>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>>up different.
>>>
>>
>>There's another reasonable solution: do exactly what your patch does,
>>minus the bugs.  We would need to get the RPL != 3 case right (easy)
>>and the case where there's a non-running thread using the selector in
>>question.  The latter is probably best handled by adding a flag to
>>thread_struct that says "fsbase needs reloading from the descriptor
>>table" and only applies if the selector is in the LDT or TLS area.  Or
>>we could hijack a high bit in the selector.  Then we'd need to update
>>everything that uses the fields.
>
>Obviously fix the bugs.
>
>How would you control this bit?

I can personally think of these options:

1. A prctl() to disable requalification;
2. Make the new instructions trap until used.  This will add to the startup 
time of legitimate users of these instructions;
3. Either of these, but start out in "off" mode until one of the descriptor 
table system calls are called.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-27 Thread hpa
On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski  wrote:
>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski 
>wrote:
>>
>>
>>
>>> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
> wrote:
>>>
 On 06/22/18 07:24, Andy Lutomirski wrote:

 That RPL3 part is false.  The following program does:

 #include 

 int main()
 {
unsigned short sel;
asm volatile ("mov %%ss, %0" : "=rm" (sel));
sel &= ~3;
printf("Will write 0x%hx to GS\n", sel);
asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
asm volatile ("mov %%gs, %0" : "=rm" (sel));
printf("GS = 0x%hx\n", sel);
return 0;
 }

 prints:

 Will write 0x28 to GS
 GS = 0x28

 The x86 architecture is *insane*.

 Other than that, this patch seems generally sensible.  But my
 objection that it's incorrect with FSGSBASE enabled for %fs and %gs
 still applies.

>>>
>>> Ugh, you're right... I misremembered.  The CPL simply overrides the
>RPL
>>> rather than trapping.
>>>
>>> We still need to give legacy applications which have zero idea about
>the
>>> separate bases that apply only to 64-bit mode a way to DTRT. 
>Requiring
>>> these old crufty applications to do something new is not an option.
>>
>>>
>>> As ugly as it is, I'm thinking the Right Thing is to simply make it
>a
>>> part of the Linux ABI that if the FS or GS selector registers point
>into
>>> the LDT then we will requalify them; if a 64-bit app does that then
>they
>>> get that behavior.  This isn't something that will happen
>>> asynchronously, and if a 64-bit process loads an LDT value into FS
>or
>>> GS, they are considered to have opted in to that behavior.
>>
>> But the old and crusty apps don’t depend on requalification because
>we never used to do it.
>>
>> I’m not convinced we ever need to refresh the base. In fact, we could
>start preserving the base of LDT-referencing FS/GS across context
>switches even without FSGSBASE at some minor performance cost, but I
>don’t really see the point. I still think my proposed semantics are
>easy to implement and preserve the ABI even if they have the sad
>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>up different.
>>
>
>There's another reasonable solution: do exactly what your patch does,
>minus the bugs.  We would need to get the RPL != 3 case right (easy)
>and the case where there's a non-running thread using the selector in
>question.  The latter is probably best handled by adding a flag to
>thread_struct that says "fsbase needs reloading from the descriptor
>table" and only applies if the selector is in the LDT or TLS area.  Or
>we could hijack a high bit in the selector.  Then we'd need to update
>everything that uses the fields.

Obviously fix the bugs.

How would you control this bit?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-27 Thread hpa
On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski  wrote:
>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski 
>wrote:
>>
>>
>>
>>> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
> wrote:
>>>
 On 06/22/18 07:24, Andy Lutomirski wrote:

 That RPL3 part is false.  The following program does:

 #include 

 int main()
 {
unsigned short sel;
asm volatile ("mov %%ss, %0" : "=rm" (sel));
sel &= ~3;
printf("Will write 0x%hx to GS\n", sel);
asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
asm volatile ("mov %%gs, %0" : "=rm" (sel));
printf("GS = 0x%hx\n", sel);
return 0;
 }

 prints:

 Will write 0x28 to GS
 GS = 0x28

 The x86 architecture is *insane*.

 Other than that, this patch seems generally sensible.  But my
 objection that it's incorrect with FSGSBASE enabled for %fs and %gs
 still applies.

>>>
>>> Ugh, you're right... I misremembered.  The CPL simply overrides the
>RPL
>>> rather than trapping.
>>>
>>> We still need to give legacy applications which have zero idea about
>the
>>> separate bases that apply only to 64-bit mode a way to DTRT. 
>Requiring
>>> these old crufty applications to do something new is not an option.
>>
>>>
>>> As ugly as it is, I'm thinking the Right Thing is to simply make it
>a
>>> part of the Linux ABI that if the FS or GS selector registers point
>into
>>> the LDT then we will requalify them; if a 64-bit app does that then
>they
>>> get that behavior.  This isn't something that will happen
>>> asynchronously, and if a 64-bit process loads an LDT value into FS
>or
>>> GS, they are considered to have opted in to that behavior.
>>
>> But the old and crusty apps don’t depend on requalification because
>we never used to do it.
>>
>> I’m not convinced we ever need to refresh the base. In fact, we could
>start preserving the base of LDT-referencing FS/GS across context
>switches even without FSGSBASE at some minor performance cost, but I
>don’t really see the point. I still think my proposed semantics are
>easy to implement and preserve the ABI even if they have the sad
>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>up different.
>>
>
>There's another reasonable solution: do exactly what your patch does,
>minus the bugs.  We would need to get the RPL != 3 case right (easy)
>and the case where there's a non-running thread using the selector in
>question.  The latter is probably best handled by adding a flag to
>thread_struct that says "fsbase needs reloading from the descriptor
>table" and only applies if the selector is in the LDT or TLS area.  Or
>we could hijack a high bit in the selector.  Then we'd need to update
>everything that uses the fields.

Obviously fix the bugs.

How would you control this bit?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86-64: use 32-bit XOR to zero registers

2018-06-25 Thread hpa
On June 25, 2018 9:33:35 AM PDT, Randy Dunlap  wrote:
>On 06/25/2018 03:25 AM, Jan Beulich wrote:
>> Some Intel CPUs don't recognize 64-bit XORs as zeroing idioms - use
>> 32-bit ones instead.
>
>Hmph.  Is that considered a bug (errata)?
>
>URL/references?
>
>Are these changes really only zeroing the lower 32 bits of the
>register?
>and that's all that the code cares about?
>
>thanks.
>
>> Signed-off-by: Jan Beulich 
>> ---
>>  arch/x86/crypto/aegis128-aesni-asm.S |2 +-
>>  arch/x86/crypto/aegis128l-aesni-asm.S|2 +-
>>  arch/x86/crypto/aegis256-aesni-asm.S |2 +-
>>  arch/x86/crypto/aesni-intel_asm.S|8 
>>  arch/x86/crypto/aesni-intel_avx-x86_64.S |4 ++--
>>  arch/x86/crypto/morus1280-avx2-asm.S |2 +-
>>  arch/x86/crypto/morus1280-sse2-asm.S |2 +-
>>  arch/x86/crypto/morus640-sse2-asm.S  |2 +-
>>  arch/x86/crypto/sha1_ssse3_asm.S |2 +-
>>  arch/x86/kernel/head_64.S|2 +-
>>  arch/x86/kernel/paravirt_patch_64.c  |2 +-
>>  arch/x86/lib/memcpy_64.S |2 +-
>>  arch/x86/power/hibernate_asm_64.S|2 +-
>>  13 files changed, 17 insertions(+), 17 deletions(-)
>> 
>> --- 4.18-rc2/arch/x86/crypto/aegis128-aesni-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aegis128-aesni-asm.S
>> @@ -75,7 +75,7 @@
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG, MSG
>>  
>>  mov LEN, %r8
>> --- 4.18-rc2/arch/x86/crypto/aegis128l-aesni-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aegis128l-aesni-asm.S
>> @@ -66,7 +66,7 @@
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG0, MSG0
>>  pxor MSG1, MSG1
>>  
>> --- 4.18-rc2/arch/x86/crypto/aegis256-aesni-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aegis256-aesni-asm.S
>> @@ -59,7 +59,7 @@
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG, MSG
>>  
>>  mov LEN, %r8
>> --- 4.18-rc2/arch/x86/crypto/aesni-intel_asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aesni-intel_asm.S
>> @@ -258,7 +258,7 @@ ALL_F:  .octa 0x
>>  .macro GCM_INIT Iv SUBKEY AAD AADLEN
>>  mov \AADLEN, %r11
>>  mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
>> -xor %r11, %r11
>> +xor %r11d, %r11d
>>  mov %r11, InLen(%arg2) # ctx_data.in_length = 0
>>  mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
>>  mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
>> @@ -286,7 +286,7 @@ ALL_F:  .octa 0x
>>  movdqu HashKey(%arg2), %xmm13
>>  add %arg5, InLen(%arg2)
>>  
>> -xor %r11, %r11 # initialise the data pointer offset as zero
>> +xor %r11d, %r11d # initialise the data pointer offset as zero
>>  PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation
>>  
>>  sub %r11, %arg5 # sub partial block data used
>> @@ -702,7 +702,7 @@ _no_extra_mask_1_\@:
>>  
>>  # GHASH computation for the last <16 Byte block
>>  GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
>> -xor %rax,%rax
>> +xor %eax, %eax
>>  
>>  mov %rax, PBlockLen(%arg2)
>>  jmp _dec_done_\@
>> @@ -737,7 +737,7 @@ _no_extra_mask_2_\@:
>>  
>>  # GHASH computation for the last <16 Byte block
>>  GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
>> -xor %rax,%rax
>> +xor %eax, %eax
>>  
>>  mov %rax, PBlockLen(%arg2)
>>  jmp _encode_done_\@
>> --- 4.18-rc2/arch/x86/crypto/aesni-intel_avx-x86_64.S
>> +++
>4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aesni-intel_avx-x86_64.S
>> @@ -463,7 +463,7 @@ _get_AAD_rest_final\@:
>>  
>>  _get_AAD_done\@:
>>  # initialize the data pointer offset as zero
>> -xor %r11, %r11
>> +xor %r11d, %r11d
>>  
>>  # start AES for num_initial_blocks blocks
>>  mov arg5, %rax # rax = *Y0
>> @@ -1770,7 +1770,7 @@ _get_AAD_rest_final\@:
>>  
>>  _get_AAD_done\@:
>>  # initialize the data pointer offset as zero
>> -xor %r11, %r11
>> +xor %r11d, %r11d
>>  
>>  # start AES for num_initial_blocks blocks
>>  mov arg5, %rax # rax = *Y0
>> --- 4.18-rc2/arch/x86/crypto/morus1280-avx2-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/morus1280-avx2-asm.S
>> @@ -113,7 +113,7 @@ ENDPROC(__morus1280_update_zero)
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  vpxor MSG, MSG, MSG
>>  
>>  mov %rcx, %r8
>> --- 4.18-rc2/arch/x86/crypto/morus1280-sse2-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/morus1280-sse2-asm.S
>> @@ -235,7 +235,7 @@ ENDPROC(__morus1280_update_zero)
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG_LO, MSG_LO
>>  pxor MSG_HI, MSG_HI
>>  

Re: [PATCH] x86-64: use 32-bit XOR to zero registers

2018-06-25 Thread hpa
On June 25, 2018 9:33:35 AM PDT, Randy Dunlap  wrote:
>On 06/25/2018 03:25 AM, Jan Beulich wrote:
>> Some Intel CPUs don't recognize 64-bit XORs as zeroing idioms - use
>> 32-bit ones instead.
>
>Hmph.  Is that considered a bug (errata)?
>
>URL/references?
>
>Are these changes really only zeroing the lower 32 bits of the
>register?
>and that's all that the code cares about?
>
>thanks.
>
>> Signed-off-by: Jan Beulich 
>> ---
>>  arch/x86/crypto/aegis128-aesni-asm.S |2 +-
>>  arch/x86/crypto/aegis128l-aesni-asm.S|2 +-
>>  arch/x86/crypto/aegis256-aesni-asm.S |2 +-
>>  arch/x86/crypto/aesni-intel_asm.S|8 
>>  arch/x86/crypto/aesni-intel_avx-x86_64.S |4 ++--
>>  arch/x86/crypto/morus1280-avx2-asm.S |2 +-
>>  arch/x86/crypto/morus1280-sse2-asm.S |2 +-
>>  arch/x86/crypto/morus640-sse2-asm.S  |2 +-
>>  arch/x86/crypto/sha1_ssse3_asm.S |2 +-
>>  arch/x86/kernel/head_64.S|2 +-
>>  arch/x86/kernel/paravirt_patch_64.c  |2 +-
>>  arch/x86/lib/memcpy_64.S |2 +-
>>  arch/x86/power/hibernate_asm_64.S|2 +-
>>  13 files changed, 17 insertions(+), 17 deletions(-)
>> 
>> --- 4.18-rc2/arch/x86/crypto/aegis128-aesni-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aegis128-aesni-asm.S
>> @@ -75,7 +75,7 @@
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG, MSG
>>  
>>  mov LEN, %r8
>> --- 4.18-rc2/arch/x86/crypto/aegis128l-aesni-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aegis128l-aesni-asm.S
>> @@ -66,7 +66,7 @@
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG0, MSG0
>>  pxor MSG1, MSG1
>>  
>> --- 4.18-rc2/arch/x86/crypto/aegis256-aesni-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aegis256-aesni-asm.S
>> @@ -59,7 +59,7 @@
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG, MSG
>>  
>>  mov LEN, %r8
>> --- 4.18-rc2/arch/x86/crypto/aesni-intel_asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aesni-intel_asm.S
>> @@ -258,7 +258,7 @@ ALL_F:  .octa 0x
>>  .macro GCM_INIT Iv SUBKEY AAD AADLEN
>>  mov \AADLEN, %r11
>>  mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
>> -xor %r11, %r11
>> +xor %r11d, %r11d
>>  mov %r11, InLen(%arg2) # ctx_data.in_length = 0
>>  mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
>>  mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
>> @@ -286,7 +286,7 @@ ALL_F:  .octa 0x
>>  movdqu HashKey(%arg2), %xmm13
>>  add %arg5, InLen(%arg2)
>>  
>> -xor %r11, %r11 # initialise the data pointer offset as zero
>> +xor %r11d, %r11d # initialise the data pointer offset as zero
>>  PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation
>>  
>>  sub %r11, %arg5 # sub partial block data used
>> @@ -702,7 +702,7 @@ _no_extra_mask_1_\@:
>>  
>>  # GHASH computation for the last <16 Byte block
>>  GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
>> -xor %rax,%rax
>> +xor %eax, %eax
>>  
>>  mov %rax, PBlockLen(%arg2)
>>  jmp _dec_done_\@
>> @@ -737,7 +737,7 @@ _no_extra_mask_2_\@:
>>  
>>  # GHASH computation for the last <16 Byte block
>>  GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
>> -xor %rax,%rax
>> +xor %eax, %eax
>>  
>>  mov %rax, PBlockLen(%arg2)
>>  jmp _encode_done_\@
>> --- 4.18-rc2/arch/x86/crypto/aesni-intel_avx-x86_64.S
>> +++
>4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aesni-intel_avx-x86_64.S
>> @@ -463,7 +463,7 @@ _get_AAD_rest_final\@:
>>  
>>  _get_AAD_done\@:
>>  # initialize the data pointer offset as zero
>> -xor %r11, %r11
>> +xor %r11d, %r11d
>>  
>>  # start AES for num_initial_blocks blocks
>>  mov arg5, %rax # rax = *Y0
>> @@ -1770,7 +1770,7 @@ _get_AAD_rest_final\@:
>>  
>>  _get_AAD_done\@:
>>  # initialize the data pointer offset as zero
>> -xor %r11, %r11
>> +xor %r11d, %r11d
>>  
>>  # start AES for num_initial_blocks blocks
>>  mov arg5, %rax # rax = *Y0
>> --- 4.18-rc2/arch/x86/crypto/morus1280-avx2-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/morus1280-avx2-asm.S
>> @@ -113,7 +113,7 @@ ENDPROC(__morus1280_update_zero)
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  vpxor MSG, MSG, MSG
>>  
>>  mov %rcx, %r8
>> --- 4.18-rc2/arch/x86/crypto/morus1280-sse2-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/morus1280-sse2-asm.S
>> @@ -235,7 +235,7 @@ ENDPROC(__morus1280_update_zero)
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG_LO, MSG_LO
>>  pxor MSG_HI, MSG_HI
>>  

Re: [PATCH v3 7/7] x86/ldt,ptrace: provide regset access to the LDT

2018-06-22 Thread hpa
On June 22, 2018 7:49:13 AM PDT, Andy Lutomirski  wrote:
>On Thu, Jun 21, 2018 at 2:18 PM H. Peter Anvin, Intel
> wrote:
>>
>> From: "H. Peter Anvin" 
>>
>> Provide ptrace/regset access to the LDT, if one exists.  This
>> interface provides both read and write access. The write code is
>> unified with modify_ldt(); the read code doesn't have enough
>> similarity so it has been kept made separate.
>
>For this and for the GDT, you've chosen to use struct user_desc as
>your format instead of using a native hardware descriptor format.  Any
>particular reason why?  If nothing else, it will bloat core files a
>bit more than needed.

I did because REGSET_TLS was implemented that way, and that is simply a subset 
of the GDT (which made the same code trivially applicable to both.) 
modify_ldt() does it *both* ways for extra fun (one for reading, and one for 
writing.)

->active is defined as "beyond this point the regset contains only the default 
value", which seemed appropriate in this case.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 7/7] x86/ldt,ptrace: provide regset access to the LDT

2018-06-22 Thread hpa
On June 22, 2018 7:49:13 AM PDT, Andy Lutomirski  wrote:
>On Thu, Jun 21, 2018 at 2:18 PM H. Peter Anvin, Intel
> wrote:
>>
>> From: "H. Peter Anvin" 
>>
>> Provide ptrace/regset access to the LDT, if one exists.  This
>> interface provides both read and write access. The write code is
>> unified with modify_ldt(); the read code doesn't have enough
>> similarity so it has been kept made separate.
>
>For this and for the GDT, you've chosen to use struct user_desc as
>your format instead of using a native hardware descriptor format.  Any
>particular reason why?  If nothing else, it will bloat core files a
>bit more than needed.

I did because REGSET_TLS was implemented that way, and that is simply a subset 
of the GDT (which made the same code trivially applicable to both.) 
modify_ldt() does it *both* ways for extra fun (one for reading, and one for 
writing.)

->active is defined as "beyond this point the regset contains only the default 
value", which seemed appropriate in this case.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 0/7] x86/ptrace: regset access to the GDT and LDT

2018-06-21 Thread hpa
On June 21, 2018 6:58:51 PM PDT, Ingo Molnar  wrote:
>
>* H. Peter Anvin, Intel  wrote:
>
>> From: "H. Peter Anvin" 
>> 
>> Give a debugger access to the visible part of the GDT and LDT.  This
>> allows a debugger to find out what a particular segment descriptor
>> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
>> 
>> v3:
>>  Requalify LDT segments for selectors that have actually changed.
>> 
>> v2:
>>  Add missing change to elf.h for the very last patch.
>> 
>> Cc: Ingo Molnar 
>> Cc: Thomas Gleixner 
>> Cc: Andy Lutomirski 
>> Cc: Chang S. Bae 
>> Cc: Markus T. Metzger 
>> Cc: H. Peter Anvin 
>> 
>>  arch/x86/Kconfig   |   4 +
>>  arch/x86/include/asm/desc.h|  24 +++-
>>  arch/x86/include/asm/ldt.h |  16 +++
>>  arch/x86/include/asm/segment.h |  10 ++
>>  arch/x86/kernel/Makefile   |   3 +-
>>  arch/x86/kernel/ldt.c  | 283
>-
>>  arch/x86/kernel/ptrace.c   | 103 ++-
>>  arch/x86/kernel/tls.c  | 102 +--
>>  arch/x86/kernel/tls.h  |   8 +-
>>  include/uapi/linux/elf.h   |   2 +
>>  10 files changed, 413 insertions(+), 142 deletions(-)
>
>Ok, this looks mostly good to me at a quick glance, but could you
>please do one 
>more iteration and more explicitly describe where you change/extend
>existing ABIs?
>
>I.e. instead of a terse and somewhat vague summary:
>
>> x86/tls,ptrace: provide regset access to the GDT
>>
>> Give a debugger access to the visible part of the GDT and LDT.  This
>> allows a debugger to find out what a particular segment descriptor
>> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
>
>Please make it really explicit how the ABI is affected, both in the
>title and in 
>the description, and also _explain_ how this helps us over what we had
>before, 
>plus also list limitations of the new ABI.
>
>I.e. something like:
>
>  x86/tls, ptrace: Extend the ptrace ABI with the new REGSET_GDT method
>
>   Add the new REGSET_GDT ptrace variant to PTRACE_{GET|SET}REGSET,
>   to give read and write access to the GDT:
>
>- Previously only certain parts of the GDT were visible, and only via
>random
>ABIs and instructions - there was no architectured access to all of it.
>
>- The SETREGSET variant is only allowed to change the TLS area of the
>GDT.
>
>(or so.)
>
>This is important not just for documentation and library support
>purposes, but 
>also to be able to review it properly, to match 'intent' to
>'implementation'.
>
>(It might also help later on in finding bugs/quirks, if any.)
>
>Please do this for all patches in the series that change the ABI.
>
>Thanks,
>
>   Ingo

ACK.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 0/7] x86/ptrace: regset access to the GDT and LDT

2018-06-21 Thread hpa
On June 21, 2018 6:58:51 PM PDT, Ingo Molnar  wrote:
>
>* H. Peter Anvin, Intel  wrote:
>
>> From: "H. Peter Anvin" 
>> 
>> Give a debugger access to the visible part of the GDT and LDT.  This
>> allows a debugger to find out what a particular segment descriptor
>> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
>> 
>> v3:
>>  Requalify LDT segments for selectors that have actually changed.
>> 
>> v2:
>>  Add missing change to elf.h for the very last patch.
>> 
>> Cc: Ingo Molnar 
>> Cc: Thomas Gleixner 
>> Cc: Andy Lutomirski 
>> Cc: Chang S. Bae 
>> Cc: Markus T. Metzger 
>> Cc: H. Peter Anvin 
>> 
>>  arch/x86/Kconfig   |   4 +
>>  arch/x86/include/asm/desc.h|  24 +++-
>>  arch/x86/include/asm/ldt.h |  16 +++
>>  arch/x86/include/asm/segment.h |  10 ++
>>  arch/x86/kernel/Makefile   |   3 +-
>>  arch/x86/kernel/ldt.c  | 283
>-
>>  arch/x86/kernel/ptrace.c   | 103 ++-
>>  arch/x86/kernel/tls.c  | 102 +--
>>  arch/x86/kernel/tls.h  |   8 +-
>>  include/uapi/linux/elf.h   |   2 +
>>  10 files changed, 413 insertions(+), 142 deletions(-)
>
>Ok, this looks mostly good to me at a quick glance, but could you
>please do one 
>more iteration and more explicitly describe where you change/extend
>existing ABIs?
>
>I.e. instead of a terse and somewhat vague summary:
>
>> x86/tls,ptrace: provide regset access to the GDT
>>
>> Give a debugger access to the visible part of the GDT and LDT.  This
>> allows a debugger to find out what a particular segment descriptor
>> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
>
>Please make it really explicit how the ABI is affected, both in the
>title and in 
>the description, and also _explain_ how this helps us over what we had
>before, 
>plus also list limitations of the new ABI.
>
>I.e. something like:
>
>  x86/tls, ptrace: Extend the ptrace ABI with the new REGSET_GDT method
>
>   Add the new REGSET_GDT ptrace variant to PTRACE_{GET|SET}REGSET,
>   to give read and write access to the GDT:
>
>- Previously only certain parts of the GDT were visible, and only via
>random
>ABIs and instructions - there was no architectured access to all of it.
>
>- The SETREGSET variant is only allowed to change the TLS area of the
>GDT.
>
>(or so.)
>
>This is important not just for documentation and library support
>purposes, but 
>also to be able to review it properly, to match 'intent' to
>'implementation'.
>
>(It might also help later on in finding bugs/quirks, if any.)
>
>Please do this for all patches in the series that change the ABI.
>
>Thanks,
>
>   Ingo

ACK.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 9/9] x86: jump-labels: use macros instead of inline assembly

2018-06-10 Thread hpa
On June 10, 2018 7:19:11 AM PDT, Nadav Amit  wrote:
>Use assembly macros for jump-labels and call them from inline assembly.
>This not only makes the code more readable, but also improves
>compilation decision, specifically inline decisions which GCC base on
>the number of new lines in inline assembly.
>
>As a result the code size is slightly increased.
>
>   text   data bss dec hex filename
>18163528 10226300 2957312 31347140 1de51c4 ./vmlinux before
>18163608 10227348 2957312 31348268 1de562c ./vmlinux after (+1128)
>
>And functions such as intel_pstate_adjust_policy_max(),
>kvm_cpu_accept_dm_intr(), kvm_register_read() are inlined.
>
>Cc: Thomas Gleixner 
>Cc: Ingo Molnar 
>Cc: "H. Peter Anvin" 
>Cc: x...@kernel.org
>Cc: Greg Kroah-Hartman 
>Cc: Kate Stewart 
>Cc: Philippe Ombredanne 
>
>Signed-off-by: Nadav Amit 
>---
> arch/x86/include/asm/jump_label.h | 65 ++-
> arch/x86/kernel/macros.S  |  1 +
> 2 files changed, 39 insertions(+), 27 deletions(-)
>
>diff --git a/arch/x86/include/asm/jump_label.h
>b/arch/x86/include/asm/jump_label.h
>index 8c0de4282659..ea0633a41122 100644
>--- a/arch/x86/include/asm/jump_label.h
>+++ b/arch/x86/include/asm/jump_label.h
>@@ -2,19 +2,6 @@
> #ifndef _ASM_X86_JUMP_LABEL_H
> #define _ASM_X86_JUMP_LABEL_H
> 
>-#ifndef HAVE_JUMP_LABEL
>-/*
>- * For better or for worse, if jump labels (the gcc extension) are
>missing,
>- * then the entire static branch patching infrastructure is compiled
>out.
>- * If that happens, the code in here will malfunction.  Raise a
>compiler
>- * error instead.
>- *
>- * In theory, jump labels and the static branch patching
>infrastructure
>- * could be decoupled to fix this.
>- */
>-#error asm/jump_label.h included on a non-jump-label kernel
>-#endif
>-
> #define JUMP_LABEL_NOP_SIZE 5
> 
> #ifdef CONFIG_X86_64
>@@ -28,18 +15,27 @@
> 
> #ifndef __ASSEMBLY__
> 
>+#ifndef HAVE_JUMP_LABEL
>+/*
>+ * For better or for worse, if jump labels (the gcc extension) are
>missing,
>+ * then the entire static branch patching infrastructure is compiled
>out.
>+ * If that happens, the code in here will malfunction.  Raise a
>compiler
>+ * error instead.
>+ *
>+ * In theory, jump labels and the static branch patching
>infrastructure
>+ * could be decoupled to fix this.
>+ */
>+#error asm/jump_label.h included on a non-jump-label kernel
>+#endif
>+
> #include 
> #include 
> 
>static __always_inline bool arch_static_branch(struct static_key *key,
>bool branch)
> {
>-  asm_volatile_goto("1:"
>-  ".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
>-  ".pushsection __jump_table,  \"aw\" \n\t"
>-  _ASM_ALIGN "\n\t"
>-  _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
>-  ".popsection \n\t"
>-  : :  "i" (key), "i" (branch) : : l_yes);
>+  asm_volatile_goto("STATIC_BRANCH_GOTO l_yes=\"%l[l_yes]\" key=\"%c0\"
>"
>+"branch=\"%c1\""
>+  : :  "i" (key), "i" (branch) : : l_yes);
> 
>   return false;
> l_yes:
>@@ -48,13 +44,8 @@ static __always_inline bool
>arch_static_branch(struct static_key *key, bool bran
> 
>static __always_inline bool arch_static_branch_jump(struct static_key
>*key, bool branch)
> {
>-  asm_volatile_goto("1:"
>-  ".byte 0xe9\n\t .long %l[l_yes] - 2f\n\t"
>-  "2:\n\t"
>-  ".pushsection __jump_table,  \"aw\" \n\t"
>-  _ASM_ALIGN "\n\t"
>-  _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
>-  ".popsection \n\t"
>+  asm_volatile_goto("STATIC_BRANCH_JUMP_GOTO l_yes=\"%l[l_yes]\"
>key=\"%c0\" "
>+"branch=\"%c1\""
>   : :  "i" (key), "i" (branch) : : l_yes);
> 
>   return false;
>@@ -108,6 +99,26 @@ struct jump_entry {
>   .popsection
> .endm
> 
>+.macro STATIC_BRANCH_GOTO l_yes:req key:req branch:req
>+1:
>+  .byte STATIC_KEY_INIT_NOP
>+  .pushsection __jump_table, "aw"
>+  _ASM_ALIGN
>+  _ASM_PTR 1b, \l_yes, \key + \branch
>+  .popsection
>+.endm
>+
>+.macro STATIC_BRANCH_JUMP_GOTO l_yes:req key:req branch:req
>+1:
>+  .byte 0xe9
>+  .long \l_yes - 2f
>+2:
>+  .pushsection __jump_table, "aw"
>+  _ASM_ALIGN
>+  _ASM_PTR 1b, \l_yes, \key + \branch
>+  .popsection
>+.endm
>+
> #endif/* __ASSEMBLY__ */
> 
> #endif
>diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
>index bf8b9c93e255..161c95059044 100644
>--- a/arch/x86/kernel/macros.S
>+++ b/arch/x86/kernel/macros.S
>@@ -13,3 +13,4 @@
> #include 
> #include 
> #include 
>+#include 

If the code size increases, do you have any metrics for improvement?

That being said, I do like the readability improvements and the ability to use 
gas macros in assembly as opposed to cpp macros wrapped in different ways for C 
and assembly.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v3 9/9] x86: jump-labels: use macros instead of inline assembly

2018-06-10 Thread hpa
On June 10, 2018 7:19:11 AM PDT, Nadav Amit  wrote:
>Use assembly macros for jump-labels and call them from inline assembly.
>This not only makes the code more readable, but also improves
>compilation decision, specifically inline decisions which GCC base on
>the number of new lines in inline assembly.
>
>As a result the code size is slightly increased.
>
>   text   data bss dec hex filename
>18163528 10226300 2957312 31347140 1de51c4 ./vmlinux before
>18163608 10227348 2957312 31348268 1de562c ./vmlinux after (+1128)
>
>And functions such as intel_pstate_adjust_policy_max(),
>kvm_cpu_accept_dm_intr(), kvm_register_read() are inlined.
>
>Cc: Thomas Gleixner 
>Cc: Ingo Molnar 
>Cc: "H. Peter Anvin" 
>Cc: x...@kernel.org
>Cc: Greg Kroah-Hartman 
>Cc: Kate Stewart 
>Cc: Philippe Ombredanne 
>
>Signed-off-by: Nadav Amit 
>---
> arch/x86/include/asm/jump_label.h | 65 ++-
> arch/x86/kernel/macros.S  |  1 +
> 2 files changed, 39 insertions(+), 27 deletions(-)
>
>diff --git a/arch/x86/include/asm/jump_label.h
>b/arch/x86/include/asm/jump_label.h
>index 8c0de4282659..ea0633a41122 100644
>--- a/arch/x86/include/asm/jump_label.h
>+++ b/arch/x86/include/asm/jump_label.h
>@@ -2,19 +2,6 @@
> #ifndef _ASM_X86_JUMP_LABEL_H
> #define _ASM_X86_JUMP_LABEL_H
> 
>-#ifndef HAVE_JUMP_LABEL
>-/*
>- * For better or for worse, if jump labels (the gcc extension) are
>missing,
>- * then the entire static branch patching infrastructure is compiled
>out.
>- * If that happens, the code in here will malfunction.  Raise a
>compiler
>- * error instead.
>- *
>- * In theory, jump labels and the static branch patching
>infrastructure
>- * could be decoupled to fix this.
>- */
>-#error asm/jump_label.h included on a non-jump-label kernel
>-#endif
>-
> #define JUMP_LABEL_NOP_SIZE 5
> 
> #ifdef CONFIG_X86_64
>@@ -28,18 +15,27 @@
> 
> #ifndef __ASSEMBLY__
> 
>+#ifndef HAVE_JUMP_LABEL
>+/*
>+ * For better or for worse, if jump labels (the gcc extension) are
>missing,
>+ * then the entire static branch patching infrastructure is compiled
>out.
>+ * If that happens, the code in here will malfunction.  Raise a
>compiler
>+ * error instead.
>+ *
>+ * In theory, jump labels and the static branch patching
>infrastructure
>+ * could be decoupled to fix this.
>+ */
>+#error asm/jump_label.h included on a non-jump-label kernel
>+#endif
>+
> #include 
> #include 
> 
>static __always_inline bool arch_static_branch(struct static_key *key,
>bool branch)
> {
>-  asm_volatile_goto("1:"
>-  ".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
>-  ".pushsection __jump_table,  \"aw\" \n\t"
>-  _ASM_ALIGN "\n\t"
>-  _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
>-  ".popsection \n\t"
>-  : :  "i" (key), "i" (branch) : : l_yes);
>+  asm_volatile_goto("STATIC_BRANCH_GOTO l_yes=\"%l[l_yes]\" key=\"%c0\"
>"
>+"branch=\"%c1\""
>+  : :  "i" (key), "i" (branch) : : l_yes);
> 
>   return false;
> l_yes:
>@@ -48,13 +44,8 @@ static __always_inline bool
>arch_static_branch(struct static_key *key, bool bran
> 
>static __always_inline bool arch_static_branch_jump(struct static_key
>*key, bool branch)
> {
>-  asm_volatile_goto("1:"
>-  ".byte 0xe9\n\t .long %l[l_yes] - 2f\n\t"
>-  "2:\n\t"
>-  ".pushsection __jump_table,  \"aw\" \n\t"
>-  _ASM_ALIGN "\n\t"
>-  _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
>-  ".popsection \n\t"
>+  asm_volatile_goto("STATIC_BRANCH_JUMP_GOTO l_yes=\"%l[l_yes]\"
>key=\"%c0\" "
>+"branch=\"%c1\""
>   : :  "i" (key), "i" (branch) : : l_yes);
> 
>   return false;
>@@ -108,6 +99,26 @@ struct jump_entry {
>   .popsection
> .endm
> 
>+.macro STATIC_BRANCH_GOTO l_yes:req key:req branch:req
>+1:
>+  .byte STATIC_KEY_INIT_NOP
>+  .pushsection __jump_table, "aw"
>+  _ASM_ALIGN
>+  _ASM_PTR 1b, \l_yes, \key + \branch
>+  .popsection
>+.endm
>+
>+.macro STATIC_BRANCH_JUMP_GOTO l_yes:req key:req branch:req
>+1:
>+  .byte 0xe9
>+  .long \l_yes - 2f
>+2:
>+  .pushsection __jump_table, "aw"
>+  _ASM_ALIGN
>+  _ASM_PTR 1b, \l_yes, \key + \branch
>+  .popsection
>+.endm
>+
> #endif/* __ASSEMBLY__ */
> 
> #endif
>diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
>index bf8b9c93e255..161c95059044 100644
>--- a/arch/x86/kernel/macros.S
>+++ b/arch/x86/kernel/macros.S
>@@ -13,3 +13,4 @@
> #include 
> #include 
> #include 
>+#include 

If the code size increases, do you have any metrics for improvement?

That being said, I do like the readability improvements and the ability to use 
gas macros in assembly as opposed to cpp macros wrapped in different ways for C 
and assembly.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v2 7/8] x86/segments/32: Introduce CPU_NUMBER segment

2018-06-06 Thread hpa
On June 6, 2018 12:07:15 PM PDT, Brian Gerst  wrote:
>On Wed, Jun 6, 2018 at 1:16 PM, Andy Lutomirski 
>wrote:
>> On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae
> wrote:
>>>
>>> The new entry will be equivalent to that of x86-64 which
>>> stores CPU number. The entry is placed in segment 23 in GDT
>>> by bumping down 23-28 by one, which are all kernel-internal
>>> segments and so have no impact on user space.
>>>
>>> CPU_NUMBER segment will always be at '%ss (USER_DS) + 80'
>>> for the default (flat, initial) user space %ss.
>>
>> No, it won't :(  This is because, on Xen PV, user code very
>frequently
>> sees a different, Xen-supplied "flat" SS value.  This is definitely
>> true right now on 64-bit, and I'm reasonably confident it's also the
>> case on 32-bit.
>>
>> As it stands, as far as I can tell, we don't have a "cpu number"
>> segment on 32-bit kernels.  I see no compelling reason to add one,
>and
>> we should definitely not add one as part of the FSGSBASE series.  I
>> think the right solution is to rename the 64-bit segment to
>> "CPU_NUMBER" and then have the rearrangement of the initialization
>> code as a followup patch.  The goal is to make the patches
>> individually reviewable.  As it stands, this patch adds some #defines
>> without making them work, which is extra confusing.
>>
>> Given how many times we screwed it up, I really want the patch that
>> moves the initialization of the 64-bit CPU number to be obviously
>> correct and to avoid changing the sematics of anything except the
>> actual CPU number fields during boot.
>>
>> So NAK to this patch, at least as part of the FSGSBASE series.
>>
>> (My apologies -- a bunch of this is because I along with everyone
>else
>> misunderstood the existing code.)
>
>The sole purpose of this segment is for the getcpu() function in the
>VDSO.  No other userspace code can rely on its presence or location.
>
>--
>Brian Gerst

Unfortunately that is not true in reality :(
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v2 7/8] x86/segments/32: Introduce CPU_NUMBER segment

2018-06-06 Thread hpa
On June 6, 2018 12:07:15 PM PDT, Brian Gerst  wrote:
>On Wed, Jun 6, 2018 at 1:16 PM, Andy Lutomirski 
>wrote:
>> On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae
> wrote:
>>>
>>> The new entry will be equivalent to that of x86-64 which
>>> stores CPU number. The entry is placed in segment 23 in GDT
>>> by bumping down 23-28 by one, which are all kernel-internal
>>> segments and so have no impact on user space.
>>>
>>> CPU_NUMBER segment will always be at '%ss (USER_DS) + 80'
>>> for the default (flat, initial) user space %ss.
>>
>> No, it won't :(  This is because, on Xen PV, user code very
>frequently
>> sees a different, Xen-supplied "flat" SS value.  This is definitely
>> true right now on 64-bit, and I'm reasonably confident it's also the
>> case on 32-bit.
>>
>> As it stands, as far as I can tell, we don't have a "cpu number"
>> segment on 32-bit kernels.  I see no compelling reason to add one,
>and
>> we should definitely not add one as part of the FSGSBASE series.  I
>> think the right solution is to rename the 64-bit segment to
>> "CPU_NUMBER" and then have the rearrangement of the initialization
>> code as a followup patch.  The goal is to make the patches
>> individually reviewable.  As it stands, this patch adds some #defines
>> without making them work, which is extra confusing.
>>
>> Given how many times we screwed it up, I really want the patch that
>> moves the initialization of the 64-bit CPU number to be obviously
>> correct and to avoid changing the sematics of anything except the
>> actual CPU number fields during boot.
>>
>> So NAK to this patch, at least as part of the FSGSBASE series.
>>
>> (My apologies -- a bunch of this is because I along with everyone
>else
>> misunderstood the existing code.)
>
>The sole purpose of this segment is for the getcpu() function in the
>VDSO.  No other userspace code can rely on its presence or location.
>
>--
>Brian Gerst

Unfortunately that is not true in reality :(
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Is this a kernel BUG? ///Re: [Question] Can we use SIGRTMIN when vdso disabled on X86?

2018-06-06 Thread hpa
On June 6, 2018 2:17:42 AM PDT, "Leizhen (ThunderTown)" 
 wrote:
>I found that glibc has already dealt with this case. So this issue must
>have been met before, should it be maintained by libc/user?
>
>   if (GLRO(dl_sysinfo_dso) == NULL)
>   {
>   kact.sa_flags |= SA_RESTORER;
>
>   kact.sa_restorer = ((act->sa_flags & SA_SIGINFO)
>   ? _rt : );
>   }
>
>
>On 2018/6/6 15:52, Leizhen (ThunderTown) wrote:
>> 
>> 
>> On 2018/6/5 19:24, Leizhen (ThunderTown) wrote:
>>> After I executed "echo 0 > /proc/sys/abi/vsyscall32" to disable
>vdso, the rt_sigaction01 test case from ltp_2015 failed.
>>> The test case source code please refer to the attachment, and the
>output as blow:
>>>
>>> -
>>> ./rt_sigaction01
>>> rt_sigaction010  TINFO  :  signal: 34
>>> rt_sigaction011  TPASS  :  rt_sigaction call succeeded: result =
>0
>>> rt_sigaction010  TINFO  :  sa.sa_flags = SA_RESETHAND|SA_SIGINFO
>>> rt_sigaction010  TINFO  :  Signal Handler Called with signal
>number 34
>>>
>>> Segmentation fault
>>> --
>>>
>>>
>>> Is this the desired result? In function ia32_setup_rt_frame, I found
>below code:
>>>
>>> if (ksig->ka.sa.sa_flags & SA_RESTORER)
>>> restorer = ksig->ka.sa.sa_restorer;
>>> else
>>> restorer = current->mm->context.vdso +
>>> vdso_image_32.sym___kernel_rt_sigreturn;
>>> put_user_ex(ptr_to_compat(restorer), >pretcode);
>>>
>>> Because the vdso is disabled, so current->mm->context.vdso is NULL,
>which cause the result of frame->pretcode invalid.
>>>
>>> I'm not sure whether this is a kernel bug or just an error of test
>case itself. Can anyone help me?
>>>
>> 

The use of signals without SA_RESTORER is considered obsolete, but it's 
somewhat surprising that the vdso isn't there; it should be mapped even for 
static binaries esp. on i386 since it is the preferred way to do system calls 
(you don't need to parse the ELF for that.) Are you explicitly disabling the 
VDSO? If so, Don't Do That.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Is this a kernel BUG? ///Re: [Question] Can we use SIGRTMIN when vdso disabled on X86?

2018-06-06 Thread hpa
On June 6, 2018 2:17:42 AM PDT, "Leizhen (ThunderTown)" 
 wrote:
>I found that glibc has already dealt with this case. So this issue must
>have been met before, should it be maintained by libc/user?
>
>   if (GLRO(dl_sysinfo_dso) == NULL)
>   {
>   kact.sa_flags |= SA_RESTORER;
>
>   kact.sa_restorer = ((act->sa_flags & SA_SIGINFO)
>   ? _rt : );
>   }
>
>
>On 2018/6/6 15:52, Leizhen (ThunderTown) wrote:
>> 
>> 
>> On 2018/6/5 19:24, Leizhen (ThunderTown) wrote:
>>> After I executed "echo 0 > /proc/sys/abi/vsyscall32" to disable
>vdso, the rt_sigaction01 test case from ltp_2015 failed.
>>> The test case source code please refer to the attachment, and the
>output as blow:
>>>
>>> -
>>> ./rt_sigaction01
>>> rt_sigaction010  TINFO  :  signal: 34
>>> rt_sigaction011  TPASS  :  rt_sigaction call succeeded: result =
>0
>>> rt_sigaction010  TINFO  :  sa.sa_flags = SA_RESETHAND|SA_SIGINFO
>>> rt_sigaction010  TINFO  :  Signal Handler Called with signal
>number 34
>>>
>>> Segmentation fault
>>> --
>>>
>>>
>>> Is this the desired result? In function ia32_setup_rt_frame, I found
>below code:
>>>
>>> if (ksig->ka.sa.sa_flags & SA_RESTORER)
>>> restorer = ksig->ka.sa.sa_restorer;
>>> else
>>> restorer = current->mm->context.vdso +
>>> vdso_image_32.sym___kernel_rt_sigreturn;
>>> put_user_ex(ptr_to_compat(restorer), >pretcode);
>>>
>>> Because the vdso is disabled, so current->mm->context.vdso is NULL,
>which cause the result of frame->pretcode invalid.
>>>
>>> I'm not sure whether this is a kernel bug or just an error of test
>case itself. Can anyone help me?
>>>
>> 

The use of signals without SA_RESTORER is considered obsolete, but it's 
somewhat surprising that the vdso isn't there; it should be mapped even for 
static binaries esp. on i386 since it is the preferred way to do system calls 
(you don't need to parse the ELF for that.) Are you explicitly disabling the 
VDSO? If so, Don't Do That.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH V2 06/15] taint: Add taint for insecure

2018-05-31 Thread hpa
On May 31, 2018 1:25:39 PM PDT, Andy Lutomirski  wrote:
>On Thu, May 31, 2018 at 10:58 AM Chang S. Bae
> wrote:
>>
>> When adding new feature support, patches need to be
>> incrementally applied and tested with temporal parameters.
>> For such testing (or root-only) purposes, the new flag
>> will serve to tag the kernel taint state properly.
>
>I'm okay with this, I guess, but I'm not at all convinced we need it.

This was my idea. It isn't the only thing that may want it, and I think it is 
critical that we give the system a way to flag that the system contains 
experimental code that is known to break security. Sometimes that kind of 
experimental code is useful (I have written some myself, e.g. to treat SMAP), 
but it is a good idea to be able to flag such a kernel permanently, even if 
it's a module that can be removed.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH V2 06/15] taint: Add taint for insecure

2018-05-31 Thread hpa
On May 31, 2018 1:25:39 PM PDT, Andy Lutomirski  wrote:
>On Thu, May 31, 2018 at 10:58 AM Chang S. Bae
> wrote:
>>
>> When adding new feature support, patches need to be
>> incrementally applied and tested with temporal parameters.
>> For such testing (or root-only) purposes, the new flag
>> will serve to tag the kernel taint state properly.
>
>I'm okay with this, I guess, but I'm not at all convinced we need it.

This was my idea. It isn't the only thing that may want it, and I think it is 
critical that we give the system a way to flag that the system contains 
experimental code that is known to break security. Sometimes that kind of 
experimental code is useful (I have written some myself, e.g. to treat SMAP), 
but it is a good idea to be able to flag such a kernel permanently, even if 
it's a module that can be removed.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH V2 02/15] x86/fsgsbase/64: Make ptrace read FS/GS base accurately

2018-05-31 Thread hpa
On May 31, 2018 1:14:59 PM PDT, Andy Lutomirski  wrote:
>On Thu, May 31, 2018 at 10:58 AM Chang S. Bae
> wrote:
>>
>> From: Andy Lutomirski 
>>
>> ptrace can read FS/GS base using the register access API
>> (PTRACE_PEEKUSER, etc) or PTRACE_ARCH_PRCTL.  Make both of these
>> mechanisms return the actual FS/GS base.
>>
>> This will improve debuggability by providing the correct information
>> to ptracer (GDB and etc).
>
>LGTM.

LGTM?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH V2 02/15] x86/fsgsbase/64: Make ptrace read FS/GS base accurately

2018-05-31 Thread hpa
On May 31, 2018 1:14:59 PM PDT, Andy Lutomirski  wrote:
>On Thu, May 31, 2018 at 10:58 AM Chang S. Bae
> wrote:
>>
>> From: Andy Lutomirski 
>>
>> ptrace can read FS/GS base using the register access API
>> (PTRACE_PEEKUSER, etc) or PTRACE_ARCH_PRCTL.  Make both of these
>> mechanisms return the actual FS/GS base.
>>
>> This will improve debuggability by providing the correct information
>> to ptracer (GDB and etc).
>
>LGTM.

LGTM?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86, random: Fix get_random_bytes() warning in x86 start_kernel

2018-05-29 Thread hpa
On May 29, 2018 9:58:10 AM PDT, Prarit Bhargava  wrote:
>
>
>On 05/29/2018 12:07 PM, Theodore Y. Ts'o wrote:
>> On Tue, May 29, 2018 at 11:01:07AM -0400, Prarit Bhargava wrote:
>>> Kees, in early boot no pool is available so the stack canary is
>initialized from
>>> the TSC.  Later in boot, the stack canary will use the the crng.
>>>
>>> ie) in early boot only TSC is okay, and late boot (when crng_ready()
>is true)
>>> the pool will be used.
>> 
>> But that means all of the kernel threads (e.g., workqueues, et. al)
>> would not be well protected by the stack canary.  That
>> seems rather unfortunate.
>
>Well, as stated the TSC is used as a source of entropy in early boot. 
>It's
>always been that way and get_random_bytes() AFAICT has always returned
>0.  CPUs
>added later on via hotplug do use get_random_bytes().
>
>Does anyone cc'd have a better idea on how to get another source of
>entropy this
>early in boot?
>
>P.
>
>> 
>>   - Ted
>> 

RDRAND/RDSEED for newer x86 processors.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH] x86, random: Fix get_random_bytes() warning in x86 start_kernel

2018-05-29 Thread hpa
On May 29, 2018 9:58:10 AM PDT, Prarit Bhargava  wrote:
>
>
>On 05/29/2018 12:07 PM, Theodore Y. Ts'o wrote:
>> On Tue, May 29, 2018 at 11:01:07AM -0400, Prarit Bhargava wrote:
>>> Kees, in early boot no pool is available so the stack canary is
>initialized from
>>> the TSC.  Later in boot, the stack canary will use the the crng.
>>>
>>> ie) in early boot only TSC is okay, and late boot (when crng_ready()
>is true)
>>> the pool will be used.
>> 
>> But that means all of the kernel threads (e.g., workqueues, et. al)
>> would not be well protected by the stack canary.  That
>> seems rather unfortunate.
>
>Well, as stated the TSC is used as a source of entropy in early boot. 
>It's
>always been that way and get_random_bytes() AFAICT has always returned
>0.  CPUs
>added later on via hotplug do use get_random_bytes().
>
>Does anyone cc'd have a better idea on how to get another source of
>entropy this
>early in boot?
>
>P.
>
>> 
>>   - Ted
>> 

RDRAND/RDSEED for newer x86 processors.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 10:49:28 AM PDT, Nick Desaulniers  
wrote:
>On Fri, May 25, 2018 at 10:35 AM Tom Stellard 
>wrote:
>> On 05/25/2018 10:31 AM, Nick Desaulniers wrote:
>> > On Fri, May 25, 2018 at 9:53 AM  wrote:
>> >> On May 25, 2018 9:46:42 AM PDT, Nick Desaulniers <
>ndesaulni...@google.com>
>> > wrote:
>> >>> On Fri, May 25, 2018 at 9:33 AM  wrote:
>>  On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers
>> >>>  wrote:
>> >>> When you say
>> >>>
>>  It still should be available as as inline, however, but now
>"extern
>> >>> inline".
>> >>>
>> >>> Am I understanding correctly that native_save_fl should be
>inlined
>into
>> >>> all
>> >>> call sites (modulo the problematic pv_irq_ops.save_fl case)? 
>Because
>> >>> for
>> >>> these two assembly implementations, it's not, but maybe there's
>> >>> something
>> >>> missing in my implementation?
>> >
>> >> Yes, that's what "extern inline" means. Maybe it needs a must
>inline
>> > annotation, but that's really messed up.
>> >
>
>> What about doing something like suggested here:
>> https://bugs.llvm.org/show_bug.cgi?id=37512#c17
>
>> This would keep the definition in C and make it easier for compilers
>> to inline.
>
>The GCC docs for __attribute__((naked)) seem to imply this is a machine
>specific constraint (of which x86 is not listed):
>https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html
>
>But let's try with source:
>https://godbolt.org/g/aJ4gZB
>
>Clang errors:
>:3:3: error: non-ASM statement in naked function is not
>supported
>   unsigned long flags;
>   ^
>
>Is it valid to use assembly to place the results in %rax and mark the c
>function somehow?
>
>gcc doesn't support this attribute until 4.9 (but we can add a feature
>test
>for attributes with gcc (unlike builtins)), but warns that:
>
>warning: ‘naked’ attribute directive ignored [-Wattributes]
>
>gcc 8.1 and trunk inserts a `ud2` instruction (what?!) (let me see if I
>can
>repro outside of godbolt, and will file a bug report).

No, we found that the paravirt code can do the wrong thing for a C 
implementation.

Nick, could you forward the list of problems so we all have it?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 10:49:28 AM PDT, Nick Desaulniers  
wrote:
>On Fri, May 25, 2018 at 10:35 AM Tom Stellard 
>wrote:
>> On 05/25/2018 10:31 AM, Nick Desaulniers wrote:
>> > On Fri, May 25, 2018 at 9:53 AM  wrote:
>> >> On May 25, 2018 9:46:42 AM PDT, Nick Desaulniers <
>ndesaulni...@google.com>
>> > wrote:
>> >>> On Fri, May 25, 2018 at 9:33 AM  wrote:
>>  On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers
>> >>>  wrote:
>> >>> When you say
>> >>>
>>  It still should be available as as inline, however, but now
>"extern
>> >>> inline".
>> >>>
>> >>> Am I understanding correctly that native_save_fl should be
>inlined
>into
>> >>> all
>> >>> call sites (modulo the problematic pv_irq_ops.save_fl case)? 
>Because
>> >>> for
>> >>> these two assembly implementations, it's not, but maybe there's
>> >>> something
>> >>> missing in my implementation?
>> >
>> >> Yes, that's what "extern inline" means. Maybe it needs a must
>inline
>> > annotation, but that's really messed up.
>> >
>
>> What about doing something like suggested here:
>> https://bugs.llvm.org/show_bug.cgi?id=37512#c17
>
>> This would keep the definition in C and make it easier for compilers
>> to inline.
>
>The GCC docs for __attribute__((naked)) seem to imply this is a machine
>specific constraint (of which x86 is not listed):
>https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html
>
>But let's try with source:
>https://godbolt.org/g/aJ4gZB
>
>Clang errors:
>:3:3: error: non-ASM statement in naked function is not
>supported
>   unsigned long flags;
>   ^
>
>Is it valid to use assembly to place the results in %rax and mark the c
>function somehow?
>
>gcc doesn't support this attribute until 4.9 (but we can add a feature
>test
>for attributes with gcc (unlike builtins)), but warns that:
>
>warning: ‘naked’ attribute directive ignored [-Wattributes]
>
>gcc 8.1 and trunk inserts a `ud2` instruction (what?!) (let me see if I
>can
>repro outside of godbolt, and will file a bug report).

No, we found that the paravirt code can do the wrong thing for a C 
implementation.

Nick, could you forward the list of problems so we all have it?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 10:31:51 AM PDT, Nick Desaulniers  
wrote:
>On Fri, May 25, 2018 at 9:53 AM  wrote:
>> On May 25, 2018 9:46:42 AM PDT, Nick Desaulniers
>
>wrote:
>> >On Fri, May 25, 2018 at 9:33 AM  wrote:
>> >> On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers
>> > wrote:
>> >When you say
>> >
>> >> It still should be available as as inline, however, but now
>"extern
>> >inline".
>> >
>> >Am I understanding correctly that native_save_fl should be inlined
>into
>> >all
>> >call sites (modulo the problematic pv_irq_ops.save_fl case)? 
>Because
>> >for
>> >these two assembly implementations, it's not, but maybe there's
>> >something
>> >missing in my implementation?
>
>> Yes, that's what "extern inline" means. Maybe it needs a must inline
>annotation, but that's really messed up.
>
>I don't think it's possible to inline a function from an external
>translation unit without something like LTO.
>
>If I move the implementation of native_save_fl() to a separate .c (with
>out
>of line assembly) or .S, neither clang nor gcc will inline that
>assembly to
>any call sites, whether the declaration of native_save_fl() looks like:
>
>extern inline unsigned long native_save_fl(void);
>
>or
>
>__attribute__((always_inline))
>
>extern inline unsigned long native_save_fl(void);
>
>I think an external copy is the best approach for the paravirt code:
>
>diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>new file mode 100644
>index ..e173ba8bee7b
>--- /dev/null
>+++ b/arch/x86/kernel/irqflags.c
>@@ -0,0 +1,24 @@
>+#include 
>+
>+extern unsigned long native_save_fl_no_stack_protector(void);
>+extern void native_restore_fl_no_stack_protector(unsigned long flags);
>+
>+asm(
>+".pushsection .text;"
>+".global native_save_fl_no_stack_protector;"
>+".type native_save_fl_no_stack_protector, @function;"
>+"native_save_fl_no_stack_protector:"
>+"pushf;"
>+"pop %" _ASM_AX ";"
>+"ret;"
>+".popsection");
>+
>+asm(
>+".pushsection .text;"
>+".global native_restore_fl_no_stack_protector;"
>+".type native_restore_fl_no_stack_protector, @function;"
>+"native_restore_fl_no_stack_protector:"
>+"push %" _ASM_DI ";"
>+"popf;"
>+"ret;"
>+".popsection");
>diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>index 02d6f5cf4e70..8824d01c0c35 100644
>--- a/arch/x86/kernel/Makefile
>+++ b/arch/x86/kernel/Makefile
>@@ -61,6 +61,7 @@ obj-y += alternative.o i8253.o
>hw_breakpoint.o
>  obj-y  += tsc.o tsc_msr.o io_delay.o rtc.o
>  obj-y  += pci-iommu_table.o
>  obj-y  += resource.o
>+obj-y  += irqflags.o
>
>  obj-y  += process.o
>  obj-y  += fpu/
>--- a/arch/x86/kernel/paravirt.c
>+++ b/arch/x86/kernel/paravirt.c
>@@ -322,9 +322,12 @@ struct pv_time_ops pv_time_ops = {
> .steal_clock = native_steal_clock,
>  };
>
>+extern unsigned long native_save_fl_no_stack_protector(void);
>+extern void native_restore_fl_no_stack_protector(unsigned long flags);
>+
>  __visible struct pv_irq_ops pv_irq_ops = {
>-   .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>-   .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
>+   .save_fl =
>__PV_IS_CALLEE_SAVE(native_save_fl_no_stack_protector),
>+   .restore_fl =
>__PV_IS_CALLEE_SAVE(native_restore_fl_no_stack_protector),
> .irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
> .irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
> .safe_halt = native_safe_halt,
>
>Thoughts?

You need the extern inline in the .h file and the out-of-line .S file both.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 10:31:51 AM PDT, Nick Desaulniers  
wrote:
>On Fri, May 25, 2018 at 9:53 AM  wrote:
>> On May 25, 2018 9:46:42 AM PDT, Nick Desaulniers
>
>wrote:
>> >On Fri, May 25, 2018 at 9:33 AM  wrote:
>> >> On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers
>> > wrote:
>> >When you say
>> >
>> >> It still should be available as as inline, however, but now
>"extern
>> >inline".
>> >
>> >Am I understanding correctly that native_save_fl should be inlined
>into
>> >all
>> >call sites (modulo the problematic pv_irq_ops.save_fl case)? 
>Because
>> >for
>> >these two assembly implementations, it's not, but maybe there's
>> >something
>> >missing in my implementation?
>
>> Yes, that's what "extern inline" means. Maybe it needs a must inline
>annotation, but that's really messed up.
>
>I don't think it's possible to inline a function from an external
>translation unit without something like LTO.
>
>If I move the implementation of native_save_fl() to a separate .c (with
>out
>of line assembly) or .S, neither clang nor gcc will inline that
>assembly to
>any call sites, whether the declaration of native_save_fl() looks like:
>
>extern inline unsigned long native_save_fl(void);
>
>or
>
>__attribute__((always_inline))
>
>extern inline unsigned long native_save_fl(void);
>
>I think an external copy is the best approach for the paravirt code:
>
>diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>new file mode 100644
>index ..e173ba8bee7b
>--- /dev/null
>+++ b/arch/x86/kernel/irqflags.c
>@@ -0,0 +1,24 @@
>+#include 
>+
>+extern unsigned long native_save_fl_no_stack_protector(void);
>+extern void native_restore_fl_no_stack_protector(unsigned long flags);
>+
>+asm(
>+".pushsection .text;"
>+".global native_save_fl_no_stack_protector;"
>+".type native_save_fl_no_stack_protector, @function;"
>+"native_save_fl_no_stack_protector:"
>+"pushf;"
>+"pop %" _ASM_AX ";"
>+"ret;"
>+".popsection");
>+
>+asm(
>+".pushsection .text;"
>+".global native_restore_fl_no_stack_protector;"
>+".type native_restore_fl_no_stack_protector, @function;"
>+"native_restore_fl_no_stack_protector:"
>+"push %" _ASM_DI ";"
>+"popf;"
>+"ret;"
>+".popsection");
>diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>index 02d6f5cf4e70..8824d01c0c35 100644
>--- a/arch/x86/kernel/Makefile
>+++ b/arch/x86/kernel/Makefile
>@@ -61,6 +61,7 @@ obj-y += alternative.o i8253.o
>hw_breakpoint.o
>  obj-y  += tsc.o tsc_msr.o io_delay.o rtc.o
>  obj-y  += pci-iommu_table.o
>  obj-y  += resource.o
>+obj-y  += irqflags.o
>
>  obj-y  += process.o
>  obj-y  += fpu/
>--- a/arch/x86/kernel/paravirt.c
>+++ b/arch/x86/kernel/paravirt.c
>@@ -322,9 +322,12 @@ struct pv_time_ops pv_time_ops = {
> .steal_clock = native_steal_clock,
>  };
>
>+extern unsigned long native_save_fl_no_stack_protector(void);
>+extern void native_restore_fl_no_stack_protector(unsigned long flags);
>+
>  __visible struct pv_irq_ops pv_irq_ops = {
>-   .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>-   .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
>+   .save_fl =
>__PV_IS_CALLEE_SAVE(native_save_fl_no_stack_protector),
>+   .restore_fl =
>__PV_IS_CALLEE_SAVE(native_restore_fl_no_stack_protector),
> .irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
> .irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
> .safe_halt = native_safe_halt,
>
>Thoughts?

You need the extern inline in the .h file and the out-of-line .S file both.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 9:46:42 AM PDT, Nick Desaulniers  
wrote:
>On Fri, May 25, 2018 at 9:33 AM  wrote:
>> On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers
>
>wrote:
>> >On Thu, May 24, 2018 at 3:43 PM  wrote:
>> >> On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers
>> >
>> >wrote:
>> >> >On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin 
>> >wrote:
>> >> >> COMPILER AR: "=rm" should NEVER generate worse code than "=r".
>> >That
>> >> >is
>> >> >> unequivocally a compiler bug.
>> >> >
>> >> >Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>> >> >
>> >> >> >> You are claiming it doesn't buy us anything, but you are
>only
>> >> >looking
>> >> >at
>> >> >> > the paravirt case which is kind of "special" (in the short
>bus
>> >kind
>> >> >of
>> >> >way),
>> >> >> >
>> >> >> > That's fair.  Is another possible solution to have paravirt
>> >maybe
>> >> >not
>> >> >use
>> >> >> > native_save_fl() then, but its own
>> >> >non-static-inline-without-m-constraint
>> >> >> > implementation?
>> >> >
>> >> >> KERNEL AR: change native_save_fl() to an extern inline with an
>> >> >assembly
>> >> >> out-of-line implementation, to satisfy the paravirt requirement
>> >that
>> >> >no
>> >> >> GPRs other than %rax are clobbered.
>> >> >
>> >> >i'm happy to add that, do you have a recommendation if it should
>go
>> >in
>> >> >an
>> >> >existing .S file or a new one (and if so where/what shall I call
>> >it?).
>> >
>> >> How about irqflags.c since that is what the .h file is called.
>> >
>> >> It should simply be:
>> >
>> >> push %rdi
>> >> popf
>> >> ret
>> >
>> >> pushf
>> >> pop %rax
>> >> ret
>> >
>> >> ... but with all the regular assembly decorations, of course.
>> >
>> >Something like the following?
>> >
>> >
>> >diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>> >new file mode 100644
>> >index ..59dc21bd3327
>> >--- /dev/null
>> >+++ b/arch/x86/kernel/irqflags.c
>> >@@ -0,0 +1,24 @@
>> >+#include 
>> >+
>> >+extern unsigned long native_save_fl(void);
>> >+extern void native_restore_fl(unsigned long flags);
>> >+
>> >+asm(
>> >+".pushsection .text;"
>> >+".global native_save_fl;"
>> >+".type native_save_fl, @function;"
>> >+"native_save_fl:"
>> >+"pushf;"
>> >+"pop %" _ASM_AX ";"
>> >+"ret;"
>> >+".popsection");
>> >+
>> >+asm(
>> >+".pushsection .text;"
>> >+".global native_restore_fl;"
>> >+".type native_restore_fl, @function;"
>> >+"native_restore_fl:"
>> >+"push %" _ASM_DI ";"
>> >+"popf;"
>> >+"ret;"
>> >+".popsection");
>> >
>> >And change the declaration in arch/x86/include/asm/irqflags.h to:
>> >+extern inline unsigned long native_save_fl(void);
>> >+extern inline void native_restore_fl(unsigned long flags);
>> >
>> >This seems to work, but
>> >1. arch/x86/boot/compressed/misc.o warns that native_save_fl() is
>never
>> >defined (arch_local_save_flags() uses it).  Does that mean
>> >arch_local_save_flags(), and friends would also have to move to the
>> >newly
>> >created .c file as well?
>> >2. `extern inline` doesn't inline any instances (from what I can
>tell
>> >from
>> >disassembling vmlinux).  I think this is strictly worse. Don't we
>only
>> >want
>> >pv_irq_ops.save_fl to be non-inlined in a way that no stack
>protector
>> >can
>> >be added? If that's the case, should my assembly based
>implementation
>> >have
>> >a different identifier (`native_save_fl_paravirt` or something).
>That
>> >would
>> >also fix point #1 above. But now the paravirt code has its own copy
>of
>> >the
>> >function.
>
>> Sorry, I meant irqflags.S.
>
>> It still should be available as as inline, however, but now "extern
>inline".
>
>Heh, ok I was confused.  But in testing, I had also created:
>
>arch/x86/lib/irqflags.S
>/* SPDX-License-Identifier: GPL-2.0 */
>
>#include 
>#include 
>#include 
>
>/*
>  * unsigned long native_save_fl(void)
>  */
>ENTRY(native_save_fl)
>pushf
>pop %_ASM_AX
>ret
>ENDPROC(native_save_fl)
>EXPORT_SYMBOL(native_save_fl)
>
>/*
>  * void native_restore_fl(unsigned long flags)
>  * %rdi: flags
>  */
>ENTRY(native_restore_fl)
>push %_ASM_DI
>popf
>ret
>ENDPROC(native_restore_fl)
>EXPORT_SYMBOL(native_restore_fl)
>
>The issue is that this still has issues 1 & 2 listed earlier (and the
>disassembly has a lot more trailing nops added).
>
>When you say
>
>> It still should be available as as inline, however, but now "extern
>inline".
>
>Am I understanding correctly that native_save_fl should be inlined into
>all
>call sites (modulo the problematic pv_irq_ops.save_fl case)?  Because
>for
>these two assembly implementations, it's not, but maybe there's
>something
>missing in my implementation?

Yes, that's what "extern inline" means. Maybe it needs a must inline 
annotation, but that's really messed up.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 9:46:42 AM PDT, Nick Desaulniers  
wrote:
>On Fri, May 25, 2018 at 9:33 AM  wrote:
>> On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers
>
>wrote:
>> >On Thu, May 24, 2018 at 3:43 PM  wrote:
>> >> On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers
>> >
>> >wrote:
>> >> >On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin 
>> >wrote:
>> >> >> COMPILER AR: "=rm" should NEVER generate worse code than "=r".
>> >That
>> >> >is
>> >> >> unequivocally a compiler bug.
>> >> >
>> >> >Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>> >> >
>> >> >> >> You are claiming it doesn't buy us anything, but you are
>only
>> >> >looking
>> >> >at
>> >> >> > the paravirt case which is kind of "special" (in the short
>bus
>> >kind
>> >> >of
>> >> >way),
>> >> >> >
>> >> >> > That's fair.  Is another possible solution to have paravirt
>> >maybe
>> >> >not
>> >> >use
>> >> >> > native_save_fl() then, but its own
>> >> >non-static-inline-without-m-constraint
>> >> >> > implementation?
>> >> >
>> >> >> KERNEL AR: change native_save_fl() to an extern inline with an
>> >> >assembly
>> >> >> out-of-line implementation, to satisfy the paravirt requirement
>> >that
>> >> >no
>> >> >> GPRs other than %rax are clobbered.
>> >> >
>> >> >i'm happy to add that, do you have a recommendation if it should
>go
>> >in
>> >> >an
>> >> >existing .S file or a new one (and if so where/what shall I call
>> >it?).
>> >
>> >> How about irqflags.c since that is what the .h file is called.
>> >
>> >> It should simply be:
>> >
>> >> push %rdi
>> >> popf
>> >> ret
>> >
>> >> pushf
>> >> pop %rax
>> >> ret
>> >
>> >> ... but with all the regular assembly decorations, of course.
>> >
>> >Something like the following?
>> >
>> >
>> >diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>> >new file mode 100644
>> >index ..59dc21bd3327
>> >--- /dev/null
>> >+++ b/arch/x86/kernel/irqflags.c
>> >@@ -0,0 +1,24 @@
>> >+#include 
>> >+
>> >+extern unsigned long native_save_fl(void);
>> >+extern void native_restore_fl(unsigned long flags);
>> >+
>> >+asm(
>> >+".pushsection .text;"
>> >+".global native_save_fl;"
>> >+".type native_save_fl, @function;"
>> >+"native_save_fl:"
>> >+"pushf;"
>> >+"pop %" _ASM_AX ";"
>> >+"ret;"
>> >+".popsection");
>> >+
>> >+asm(
>> >+".pushsection .text;"
>> >+".global native_restore_fl;"
>> >+".type native_restore_fl, @function;"
>> >+"native_restore_fl:"
>> >+"push %" _ASM_DI ";"
>> >+"popf;"
>> >+"ret;"
>> >+".popsection");
>> >
>> >And change the declaration in arch/x86/include/asm/irqflags.h to:
>> >+extern inline unsigned long native_save_fl(void);
>> >+extern inline void native_restore_fl(unsigned long flags);
>> >
>> >This seems to work, but
>> >1. arch/x86/boot/compressed/misc.o warns that native_save_fl() is
>never
>> >defined (arch_local_save_flags() uses it).  Does that mean
>> >arch_local_save_flags(), and friends would also have to move to the
>> >newly
>> >created .c file as well?
>> >2. `extern inline` doesn't inline any instances (from what I can
>tell
>> >from
>> >disassembling vmlinux).  I think this is strictly worse. Don't we
>only
>> >want
>> >pv_irq_ops.save_fl to be non-inlined in a way that no stack
>protector
>> >can
>> >be added? If that's the case, should my assembly based
>implementation
>> >have
>> >a different identifier (`native_save_fl_paravirt` or something).
>That
>> >would
>> >also fix point #1 above. But now the paravirt code has its own copy
>of
>> >the
>> >function.
>
>> Sorry, I meant irqflags.S.
>
>> It still should be available as as inline, however, but now "extern
>inline".
>
>Heh, ok I was confused.  But in testing, I had also created:
>
>arch/x86/lib/irqflags.S
>/* SPDX-License-Identifier: GPL-2.0 */
>
>#include 
>#include 
>#include 
>
>/*
>  * unsigned long native_save_fl(void)
>  */
>ENTRY(native_save_fl)
>pushf
>pop %_ASM_AX
>ret
>ENDPROC(native_save_fl)
>EXPORT_SYMBOL(native_save_fl)
>
>/*
>  * void native_restore_fl(unsigned long flags)
>  * %rdi: flags
>  */
>ENTRY(native_restore_fl)
>push %_ASM_DI
>popf
>ret
>ENDPROC(native_restore_fl)
>EXPORT_SYMBOL(native_restore_fl)
>
>The issue is that this still has issues 1 & 2 listed earlier (and the
>disassembly has a lot more trailing nops added).
>
>When you say
>
>> It still should be available as as inline, however, but now "extern
>inline".
>
>Am I understanding correctly that native_save_fl should be inlined into
>all
>call sites (modulo the problematic pv_irq_ops.save_fl case)?  Because
>for
>these two assembly implementations, it's not, but maybe there's
>something
>missing in my implementation?

Yes, that's what "extern inline" means. Maybe it needs a must inline 
annotation, but that's really messed up.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 3:43 PM  wrote:
>> On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers
>
>wrote:
>> >On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin 
>wrote:
>> >> COMPILER AR: "=rm" should NEVER generate worse code than "=r".
>That
>> >is
>> >> unequivocally a compiler bug.
>> >
>> >Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>> >
>> >> >> You are claiming it doesn't buy us anything, but you are only
>> >looking
>> >at
>> >> > the paravirt case which is kind of "special" (in the short bus
>kind
>> >of
>> >way),
>> >> >
>> >> > That's fair.  Is another possible solution to have paravirt
>maybe
>> >not
>> >use
>> >> > native_save_fl() then, but its own
>> >non-static-inline-without-m-constraint
>> >> > implementation?
>> >
>> >> KERNEL AR: change native_save_fl() to an extern inline with an
>> >assembly
>> >> out-of-line implementation, to satisfy the paravirt requirement
>that
>> >no
>> >> GPRs other than %rax are clobbered.
>> >
>> >i'm happy to add that, do you have a recommendation if it should go
>in
>> >an
>> >existing .S file or a new one (and if so where/what shall I call
>it?).
>
>> How about irqflags.c since that is what the .h file is called.
>
>> It should simply be:
>
>> push %rdi
>> popf
>> ret
>
>> pushf
>> pop %rax
>> ret
>
>> ... but with all the regular assembly decorations, of course.
>
>Something like the following?
>
>
>diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>new file mode 100644
>index ..59dc21bd3327
>--- /dev/null
>+++ b/arch/x86/kernel/irqflags.c
>@@ -0,0 +1,24 @@
>+#include 
>+
>+extern unsigned long native_save_fl(void);
>+extern void native_restore_fl(unsigned long flags);
>+
>+asm(
>+".pushsection .text;"
>+".global native_save_fl;"
>+".type native_save_fl, @function;"
>+"native_save_fl:"
>+"pushf;"
>+"pop %" _ASM_AX ";"
>+"ret;"
>+".popsection");
>+
>+asm(
>+".pushsection .text;"
>+".global native_restore_fl;"
>+".type native_restore_fl, @function;"
>+"native_restore_fl:"
>+"push %" _ASM_DI ";"
>+"popf;"
>+"ret;"
>+".popsection");
>
>And change the declaration in arch/x86/include/asm/irqflags.h to:
>+extern inline unsigned long native_save_fl(void);
>+extern inline void native_restore_fl(unsigned long flags);
>
>This seems to work, but
>1. arch/x86/boot/compressed/misc.o warns that native_save_fl() is never
>defined (arch_local_save_flags() uses it).  Does that mean
>arch_local_save_flags(), and friends would also have to move to the
>newly
>created .c file as well?
>2. `extern inline` doesn't inline any instances (from what I can tell
>from
>disassembling vmlinux).  I think this is strictly worse. Don't we only
>want
>pv_irq_ops.save_fl to be non-inlined in a way that no stack protector
>can
>be added? If that's the case, should my assembly based implementation
>have
>a different identifier (`native_save_fl_paravirt` or something). That
>would
>also fix point #1 above. But now the paravirt code has its own copy of
>the
>function.

You keep compiling with paravirtualization enabled... that code will not do any 
inlining.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 3:43 PM  wrote:
>> On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers
>
>wrote:
>> >On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin 
>wrote:
>> >> COMPILER AR: "=rm" should NEVER generate worse code than "=r".
>That
>> >is
>> >> unequivocally a compiler bug.
>> >
>> >Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>> >
>> >> >> You are claiming it doesn't buy us anything, but you are only
>> >looking
>> >at
>> >> > the paravirt case which is kind of "special" (in the short bus
>kind
>> >of
>> >way),
>> >> >
>> >> > That's fair.  Is another possible solution to have paravirt
>maybe
>> >not
>> >use
>> >> > native_save_fl() then, but its own
>> >non-static-inline-without-m-constraint
>> >> > implementation?
>> >
>> >> KERNEL AR: change native_save_fl() to an extern inline with an
>> >assembly
>> >> out-of-line implementation, to satisfy the paravirt requirement
>that
>> >no
>> >> GPRs other than %rax are clobbered.
>> >
>> >i'm happy to add that, do you have a recommendation if it should go
>in
>> >an
>> >existing .S file or a new one (and if so where/what shall I call
>it?).
>
>> How about irqflags.c since that is what the .h file is called.
>
>> It should simply be:
>
>> push %rdi
>> popf
>> ret
>
>> pushf
>> pop %rax
>> ret
>
>> ... but with all the regular assembly decorations, of course.
>
>Something like the following?
>
>
>diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>new file mode 100644
>index ..59dc21bd3327
>--- /dev/null
>+++ b/arch/x86/kernel/irqflags.c
>@@ -0,0 +1,24 @@
>+#include 
>+
>+extern unsigned long native_save_fl(void);
>+extern void native_restore_fl(unsigned long flags);
>+
>+asm(
>+".pushsection .text;"
>+".global native_save_fl;"
>+".type native_save_fl, @function;"
>+"native_save_fl:"
>+"pushf;"
>+"pop %" _ASM_AX ";"
>+"ret;"
>+".popsection");
>+
>+asm(
>+".pushsection .text;"
>+".global native_restore_fl;"
>+".type native_restore_fl, @function;"
>+"native_restore_fl:"
>+"push %" _ASM_DI ";"
>+"popf;"
>+"ret;"
>+".popsection");
>
>And change the declaration in arch/x86/include/asm/irqflags.h to:
>+extern inline unsigned long native_save_fl(void);
>+extern inline void native_restore_fl(unsigned long flags);
>
>This seems to work, but
>1. arch/x86/boot/compressed/misc.o warns that native_save_fl() is never
>defined (arch_local_save_flags() uses it).  Does that mean
>arch_local_save_flags(), and friends would also have to move to the
>newly
>created .c file as well?
>2. `extern inline` doesn't inline any instances (from what I can tell
>from
>disassembling vmlinux).  I think this is strictly worse. Don't we only
>want
>pv_irq_ops.save_fl to be non-inlined in a way that no stack protector
>can
>be added? If that's the case, should my assembly based implementation
>have
>a different identifier (`native_save_fl_paravirt` or something). That
>would
>also fix point #1 above. But now the paravirt code has its own copy of
>the
>function.

You keep compiling with paravirtualization enabled... that code will not do any 
inlining.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 3:43 PM  wrote:
>> On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers
>
>wrote:
>> >On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin 
>wrote:
>> >> COMPILER AR: "=rm" should NEVER generate worse code than "=r".
>That
>> >is
>> >> unequivocally a compiler bug.
>> >
>> >Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>> >
>> >> >> You are claiming it doesn't buy us anything, but you are only
>> >looking
>> >at
>> >> > the paravirt case which is kind of "special" (in the short bus
>kind
>> >of
>> >way),
>> >> >
>> >> > That's fair.  Is another possible solution to have paravirt
>maybe
>> >not
>> >use
>> >> > native_save_fl() then, but its own
>> >non-static-inline-without-m-constraint
>> >> > implementation?
>> >
>> >> KERNEL AR: change native_save_fl() to an extern inline with an
>> >assembly
>> >> out-of-line implementation, to satisfy the paravirt requirement
>that
>> >no
>> >> GPRs other than %rax are clobbered.
>> >
>> >i'm happy to add that, do you have a recommendation if it should go
>in
>> >an
>> >existing .S file or a new one (and if so where/what shall I call
>it?).
>
>> How about irqflags.c since that is what the .h file is called.
>
>> It should simply be:
>
>> push %rdi
>> popf
>> ret
>
>> pushf
>> pop %rax
>> ret
>
>> ... but with all the regular assembly decorations, of course.
>
>Something like the following?
>
>
>diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>new file mode 100644
>index ..59dc21bd3327
>--- /dev/null
>+++ b/arch/x86/kernel/irqflags.c
>@@ -0,0 +1,24 @@
>+#include 
>+
>+extern unsigned long native_save_fl(void);
>+extern void native_restore_fl(unsigned long flags);
>+
>+asm(
>+".pushsection .text;"
>+".global native_save_fl;"
>+".type native_save_fl, @function;"
>+"native_save_fl:"
>+"pushf;"
>+"pop %" _ASM_AX ";"
>+"ret;"
>+".popsection");
>+
>+asm(
>+".pushsection .text;"
>+".global native_restore_fl;"
>+".type native_restore_fl, @function;"
>+"native_restore_fl:"
>+"push %" _ASM_DI ";"
>+"popf;"
>+"ret;"
>+".popsection");
>
>And change the declaration in arch/x86/include/asm/irqflags.h to:
>+extern inline unsigned long native_save_fl(void);
>+extern inline void native_restore_fl(unsigned long flags);
>
>This seems to work, but
>1. arch/x86/boot/compressed/misc.o warns that native_save_fl() is never
>defined (arch_local_save_flags() uses it).  Does that mean
>arch_local_save_flags(), and friends would also have to move to the
>newly
>created .c file as well?
>2. `extern inline` doesn't inline any instances (from what I can tell
>from
>disassembling vmlinux).  I think this is strictly worse. Don't we only
>want
>pv_irq_ops.save_fl to be non-inlined in a way that no stack protector
>can
>be added? If that's the case, should my assembly based implementation
>have
>a different identifier (`native_save_fl_paravirt` or something). That
>would
>also fix point #1 above. But now the paravirt code has its own copy of
>the
>function.

Sorry, I meant irqflags.S.

It still should be available as as inline, however, but now "extern inline".
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa
On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 3:43 PM  wrote:
>> On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers
>
>wrote:
>> >On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin 
>wrote:
>> >> COMPILER AR: "=rm" should NEVER generate worse code than "=r".
>That
>> >is
>> >> unequivocally a compiler bug.
>> >
>> >Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>> >
>> >> >> You are claiming it doesn't buy us anything, but you are only
>> >looking
>> >at
>> >> > the paravirt case which is kind of "special" (in the short bus
>kind
>> >of
>> >way),
>> >> >
>> >> > That's fair.  Is another possible solution to have paravirt
>maybe
>> >not
>> >use
>> >> > native_save_fl() then, but its own
>> >non-static-inline-without-m-constraint
>> >> > implementation?
>> >
>> >> KERNEL AR: change native_save_fl() to an extern inline with an
>> >assembly
>> >> out-of-line implementation, to satisfy the paravirt requirement
>that
>> >no
>> >> GPRs other than %rax are clobbered.
>> >
>> >i'm happy to add that, do you have a recommendation if it should go
>in
>> >an
>> >existing .S file or a new one (and if so where/what shall I call
>it?).
>
>> How about irqflags.c since that is what the .h file is called.
>
>> It should simply be:
>
>> push %rdi
>> popf
>> ret
>
>> pushf
>> pop %rax
>> ret
>
>> ... but with all the regular assembly decorations, of course.
>
>Something like the following?
>
>
>diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>new file mode 100644
>index ..59dc21bd3327
>--- /dev/null
>+++ b/arch/x86/kernel/irqflags.c
>@@ -0,0 +1,24 @@
>+#include 
>+
>+extern unsigned long native_save_fl(void);
>+extern void native_restore_fl(unsigned long flags);
>+
>+asm(
>+".pushsection .text;"
>+".global native_save_fl;"
>+".type native_save_fl, @function;"
>+"native_save_fl:"
>+"pushf;"
>+"pop %" _ASM_AX ";"
>+"ret;"
>+".popsection");
>+
>+asm(
>+".pushsection .text;"
>+".global native_restore_fl;"
>+".type native_restore_fl, @function;"
>+"native_restore_fl:"
>+"push %" _ASM_DI ";"
>+"popf;"
>+"ret;"
>+".popsection");
>
>And change the declaration in arch/x86/include/asm/irqflags.h to:
>+extern inline unsigned long native_save_fl(void);
>+extern inline void native_restore_fl(unsigned long flags);
>
>This seems to work, but
>1. arch/x86/boot/compressed/misc.o warns that native_save_fl() is never
>defined (arch_local_save_flags() uses it).  Does that mean
>arch_local_save_flags(), and friends would also have to move to the
>newly
>created .c file as well?
>2. `extern inline` doesn't inline any instances (from what I can tell
>from
>disassembling vmlinux).  I think this is strictly worse. Don't we only
>want
>pv_irq_ops.save_fl to be non-inlined in a way that no stack protector
>can
>be added? If that's the case, should my assembly based implementation
>have
>a different identifier (`native_save_fl_paravirt` or something). That
>would
>also fix point #1 above. But now the paravirt code has its own copy of
>the
>function.

Sorry, I meant irqflags.S.

It still should be available as as inline, however, but now "extern inline".
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin  wrote:
>> COMPILER AR: "=rm" should NEVER generate worse code than "=r". That
>is
>> unequivocally a compiler bug.
>
>Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>
>> >> You are claiming it doesn't buy us anything, but you are only
>looking
>at
>> > the paravirt case which is kind of "special" (in the short bus kind
>of
>way),
>> >
>> > That's fair.  Is another possible solution to have paravirt maybe
>not
>use
>> > native_save_fl() then, but its own
>non-static-inline-without-m-constraint
>> > implementation?
>
>> KERNEL AR: change native_save_fl() to an extern inline with an
>assembly
>> out-of-line implementation, to satisfy the paravirt requirement that
>no
>> GPRs other than %rax are clobbered.
>
>i'm happy to add that, do you have a recommendation if it should go in
>an
>existing .S file or a new one (and if so where/what shall I call it?).

How about irqflags.c since that is what the .h file is called.

It should simply be:

push %rdi
popf
ret

pushf
pop %rax
ret

... but with all the regular assembly decorations, of course.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin  wrote:
>> COMPILER AR: "=rm" should NEVER generate worse code than "=r". That
>is
>> unequivocally a compiler bug.
>
>Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>
>> >> You are claiming it doesn't buy us anything, but you are only
>looking
>at
>> > the paravirt case which is kind of "special" (in the short bus kind
>of
>way),
>> >
>> > That's fair.  Is another possible solution to have paravirt maybe
>not
>use
>> > native_save_fl() then, but its own
>non-static-inline-without-m-constraint
>> > implementation?
>
>> KERNEL AR: change native_save_fl() to an extern inline with an
>assembly
>> out-of-line implementation, to satisfy the paravirt requirement that
>no
>> GPRs other than %rax are clobbered.
>
>i'm happy to add that, do you have a recommendation if it should go in
>an
>existing .S file or a new one (and if so where/what shall I call it?).

How about irqflags.c since that is what the .h file is called.

It should simply be:

push %rdi
popf
ret

pushf
pop %rax
ret

... but with all the regular assembly decorations, of course.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 24, 2018 1:52:16 PM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 1:26 PM Nick Desaulniers
>
>wrote:
>> On Thu, May 24, 2018 at 11:59 AM  wrote:
>> > Issue 3: Let's face it, reading and writing the flags should be
>builtins,
>> exactly because it has to do stack operations, which really means the
>> compiler should be involved.
>
>> I'm happy to propose that as a feature request to llvm+gcc.
>
>Oh, looks like both clang and gcc have:
>__builtin_ia32_readeflags_u64()
>
>https://godbolt.org/g/SwPjhq
>
>Maybe native_save_fl() and native_restore_fl() should be replaced in
>the
>kernel with
>__builtin_ia32_readeflags_u64() and __builtin_ia32_writeeflags_u64()?

It should, at least when the compiler supports them (we keep support in for old 
compilers back to about gcc 3.4 or 4.1 or so for x86, but they are intently 
special cases.)

For the PV case, I think we should create it as an explicit assembly function, 
meaning it should be an extern inline in C code. That way we're fixing the 
kernel right.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 24, 2018 1:52:16 PM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 1:26 PM Nick Desaulniers
>
>wrote:
>> On Thu, May 24, 2018 at 11:59 AM  wrote:
>> > Issue 3: Let's face it, reading and writing the flags should be
>builtins,
>> exactly because it has to do stack operations, which really means the
>> compiler should be involved.
>
>> I'm happy to propose that as a feature request to llvm+gcc.
>
>Oh, looks like both clang and gcc have:
>__builtin_ia32_readeflags_u64()
>
>https://godbolt.org/g/SwPjhq
>
>Maybe native_save_fl() and native_restore_fl() should be replaced in
>the
>kernel with
>__builtin_ia32_readeflags_u64() and __builtin_ia32_writeeflags_u64()?

It should, at least when the compiler supports them (we keep support in for old 
compilers back to about gcc 3.4 or 4.1 or so for x86, but they are intently 
special cases.)

For the PV case, I think we should create it as an explicit assembly function, 
meaning it should be an extern inline in C code. That way we're fixing the 
kernel right.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 24, 2018 12:49:56 PM PDT, Tom Stellard  wrote:
>On 05/24/2018 11:19 AM, h...@zytor.com wrote:
>> On May 23, 2018 3:08:19 PM PDT, Nick Desaulniers
> wrote:
>>> H. Peter,
>>>
>>> It was reported [0] that compiling the Linux kernel with Clang +
>>> CC_STACKPROTECTOR_STRONG was causing a crash in native_save_fl(),
>due
>>> to
>>> how GCC does not emit a stack guard for static inline functions (see
>>> Alistair's excellent report in [1]) but Clang does.
>>>
>>> When working with the LLVM release maintainers, Tom had suggested
>[2]
>>> changing the inline assembly constraint in native_save_fl() from
>'=rm'
>>> to
>>> '=r', and Alistair had verified the disassembly:
>>>
>>> (good) code generated w/o -fstack-protector-strong:
>>>
>>> native_save_fl:
>>>  pushfq
>>>  popq-8(%rsp)
>>>  movq-8(%rsp), %rax
>>>  retq
>>>
>>> (good) code generated w/ =r input constraint:
>>>
>>> native_save_fl:
>>>  pushfq
>>>  popq%rax
>>>  retq
>>>
>>> (bad) code generated with -fstack-protector-strong:
>>>
>>> native_save_fl:
>>>  subq$24, %rsp
>>>  movq%fs:40, %rax
>>>  movq%rax, 16(%rsp)
>>>  pushfq
>>>  popq8(%rsp)
>>>  movq8(%rsp), %rax
>>>  movq%fs:40, %rcx
>>>  cmpq16(%rsp), %rcx
>>>  jne .LBB0_2
>>>  addq$24, %rsp
>>>  retq
>>> .LBB0_2:
>>>  callq   __stack_chk_fail
>>>
>>> It looks like the sugguestion is actually a revert of your commit:
>>> ab94fcf528d127fcb490175512a8910f37e5b346:
>>> x86: allow "=rm" in native_save_fl()
>>>
>>> It seemed like there was a question internally about why worry about
>>> pop
>>> adjusting the stack if the stack could be avoided altogether.
>>>
>>> I think Sedat can retest this, but I was curious as well about the
>>> commit
>>> message in ab94fcf528d: "[ Impact: performance ]", but Alistair's
>>> analysis
>>> of the disassembly seems to indicate there is no performance impact
>(in
>>> fact, looks better as there's one less mov).
>>>
>>> Is there a reason we should not revert ab94fcf528d12, or maybe a
>better
>>> approach?
>>>
>>> [0] https://lkml.org/lkml/2018/5/7/534
>>> [1] https://bugs.llvm.org/show_bug.cgi?id=37512#c15
>>> [2] https://bugs.llvm.org/show_bug.cgi?id=37512#c22
>> 
>> A stack canary on an *inlined* function? That's bound to break things
>elsewhere too sooner or later.
>> 
>> It feels like *once again* clang is asking for the Linux kernel to
>change to paper over technical or compatibility problems in clang/LLVM.
>This is not exactly helping the feeling that we should just rip out any
>and all clang hacks and call it a loss.
>> 
>
>In this case this fix is working-around a bug in the kernel.  The
>problem
>here is that the caller of native_save_fl() assumes that it won't
>clobber any of the general purpose register even though the function
>is defined as a normal c function which tells the compiler that it's
>OK to clobber the registers.
>
>This is something that really needs to be fixed in the kernel.  Relying
>on heuristics of internal compiler algorithms in order for code to work
>correctly is always going to lead to problems like this.
>
>-Tom

Ok, yet another problem (this time with paravirtualization, no surprise 
there.). Good, let's find the actual problems and fix them where they should be 
fixed.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 24, 2018 12:49:56 PM PDT, Tom Stellard  wrote:
>On 05/24/2018 11:19 AM, h...@zytor.com wrote:
>> On May 23, 2018 3:08:19 PM PDT, Nick Desaulniers
> wrote:
>>> H. Peter,
>>>
>>> It was reported [0] that compiling the Linux kernel with Clang +
>>> CC_STACKPROTECTOR_STRONG was causing a crash in native_save_fl(),
>due
>>> to
>>> how GCC does not emit a stack guard for static inline functions (see
>>> Alistair's excellent report in [1]) but Clang does.
>>>
>>> When working with the LLVM release maintainers, Tom had suggested
>[2]
>>> changing the inline assembly constraint in native_save_fl() from
>'=rm'
>>> to
>>> '=r', and Alistair had verified the disassembly:
>>>
>>> (good) code generated w/o -fstack-protector-strong:
>>>
>>> native_save_fl:
>>>  pushfq
>>>  popq-8(%rsp)
>>>  movq-8(%rsp), %rax
>>>  retq
>>>
>>> (good) code generated w/ =r input constraint:
>>>
>>> native_save_fl:
>>>  pushfq
>>>  popq%rax
>>>  retq
>>>
>>> (bad) code generated with -fstack-protector-strong:
>>>
>>> native_save_fl:
>>>  subq$24, %rsp
>>>  movq%fs:40, %rax
>>>  movq%rax, 16(%rsp)
>>>  pushfq
>>>  popq8(%rsp)
>>>  movq8(%rsp), %rax
>>>  movq%fs:40, %rcx
>>>  cmpq16(%rsp), %rcx
>>>  jne .LBB0_2
>>>  addq$24, %rsp
>>>  retq
>>> .LBB0_2:
>>>  callq   __stack_chk_fail
>>>
>>> It looks like the sugguestion is actually a revert of your commit:
>>> ab94fcf528d127fcb490175512a8910f37e5b346:
>>> x86: allow "=rm" in native_save_fl()
>>>
>>> It seemed like there was a question internally about why worry about
>>> pop
>>> adjusting the stack if the stack could be avoided altogether.
>>>
>>> I think Sedat can retest this, but I was curious as well about the
>>> commit
>>> message in ab94fcf528d: "[ Impact: performance ]", but Alistair's
>>> analysis
>>> of the disassembly seems to indicate there is no performance impact
>(in
>>> fact, looks better as there's one less mov).
>>>
>>> Is there a reason we should not revert ab94fcf528d12, or maybe a
>better
>>> approach?
>>>
>>> [0] https://lkml.org/lkml/2018/5/7/534
>>> [1] https://bugs.llvm.org/show_bug.cgi?id=37512#c15
>>> [2] https://bugs.llvm.org/show_bug.cgi?id=37512#c22
>> 
>> A stack canary on an *inlined* function? That's bound to break things
>elsewhere too sooner or later.
>> 
>> It feels like *once again* clang is asking for the Linux kernel to
>change to paper over technical or compatibility problems in clang/LLVM.
>This is not exactly helping the feeling that we should just rip out any
>and all clang hacks and call it a loss.
>> 
>
>In this case this fix is working-around a bug in the kernel.  The
>problem
>here is that the caller of native_save_fl() assumes that it won't
>clobber any of the general purpose register even though the function
>is defined as a normal c function which tells the compiler that it's
>OK to clobber the registers.
>
>This is something that really needs to be fixed in the kernel.  Relying
>on heuristics of internal compiler algorithms in order for code to work
>correctly is always going to lead to problems like this.
>
>-Tom

Ok, yet another problem (this time with paravirtualization, no surprise 
there.). Good, let's find the actual problems and fix them where they should be 
fixed.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 23, 2018 3:08:19 PM PDT, Nick Desaulniers  
wrote:
>H. Peter,
>
>It was reported [0] that compiling the Linux kernel with Clang +
>CC_STACKPROTECTOR_STRONG was causing a crash in native_save_fl(), due
>to
>how GCC does not emit a stack guard for static inline functions (see
>Alistair's excellent report in [1]) but Clang does.
>
>When working with the LLVM release maintainers, Tom had suggested [2]
>changing the inline assembly constraint in native_save_fl() from '=rm'
>to
>'=r', and Alistair had verified the disassembly:
>
>(good) code generated w/o -fstack-protector-strong:
>
>native_save_fl:
>  pushfq
>  popq-8(%rsp)
>  movq-8(%rsp), %rax
>  retq
>
>(good) code generated w/ =r input constraint:
>
>native_save_fl:
>  pushfq
>  popq%rax
>  retq
>
>(bad) code generated with -fstack-protector-strong:
>
>native_save_fl:
>  subq$24, %rsp
>  movq%fs:40, %rax
>  movq%rax, 16(%rsp)
>  pushfq
>  popq8(%rsp)
>  movq8(%rsp), %rax
>  movq%fs:40, %rcx
>  cmpq16(%rsp), %rcx
>  jne .LBB0_2
>  addq$24, %rsp
>  retq
>.LBB0_2:
>  callq   __stack_chk_fail
>
>It looks like the sugguestion is actually a revert of your commit:
>ab94fcf528d127fcb490175512a8910f37e5b346:
>x86: allow "=rm" in native_save_fl()
>
>It seemed like there was a question internally about why worry about
>pop
>adjusting the stack if the stack could be avoided altogether.
>
>I think Sedat can retest this, but I was curious as well about the
>commit
>message in ab94fcf528d: "[ Impact: performance ]", but Alistair's
>analysis
>of the disassembly seems to indicate there is no performance impact (in
>fact, looks better as there's one less mov).
>
>Is there a reason we should not revert ab94fcf528d12, or maybe a better
>approach?
>
>[0] https://lkml.org/lkml/2018/5/7/534
>[1] https://bugs.llvm.org/show_bug.cgi?id=37512#c15
>[2] https://bugs.llvm.org/show_bug.cgi?id=37512#c22

Ok, this is the *second* thing about LLVM-originated bug reports that drives me 
batty. When you *do* identify a real problem, you propose a paper over and/or 
talk about it as an LLVM issue and don't consider the often far bigger picture.

Issue 1: Fundamentally, the compiler is doing The Wrong Thing if it generates 
worse code with a less constrained =rm than with =r. That is a compiler 
optimization bug, period. The whole point with the less constrained option is 
to give the compiler the freedom of action.

You are claiming it doesn't buy us anything, but you are only looking at the 
paravirt case which is kind of "special" (in the short bus kind of way), and 
only because the compiler in question makes an incredibly stupid decision.

Issue 2: What you are flagging seems to be a far more fundamental problem, 
which would affect *any* use of push/pop in inline assembly. If that is true, 
we need to pull in the gcc people too and create an interface to let the 
compiler know that online assembly needs a certain number of stack slots. We do 
a lot of push/pop in assembly. The other option is to turn stack canary 
explicitly off for all such functions.

Issue 3: Let's face it, reading and writing the flags should be builtins, 
exactly because it has to do stack operations, which really means the compiler 
should be involved.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 23, 2018 3:08:19 PM PDT, Nick Desaulniers  
wrote:
>H. Peter,
>
>It was reported [0] that compiling the Linux kernel with Clang +
>CC_STACKPROTECTOR_STRONG was causing a crash in native_save_fl(), due
>to
>how GCC does not emit a stack guard for static inline functions (see
>Alistair's excellent report in [1]) but Clang does.
>
>When working with the LLVM release maintainers, Tom had suggested [2]
>changing the inline assembly constraint in native_save_fl() from '=rm'
>to
>'=r', and Alistair had verified the disassembly:
>
>(good) code generated w/o -fstack-protector-strong:
>
>native_save_fl:
>  pushfq
>  popq-8(%rsp)
>  movq-8(%rsp), %rax
>  retq
>
>(good) code generated w/ =r input constraint:
>
>native_save_fl:
>  pushfq
>  popq%rax
>  retq
>
>(bad) code generated with -fstack-protector-strong:
>
>native_save_fl:
>  subq$24, %rsp
>  movq%fs:40, %rax
>  movq%rax, 16(%rsp)
>  pushfq
>  popq8(%rsp)
>  movq8(%rsp), %rax
>  movq%fs:40, %rcx
>  cmpq16(%rsp), %rcx
>  jne .LBB0_2
>  addq$24, %rsp
>  retq
>.LBB0_2:
>  callq   __stack_chk_fail
>
>It looks like the sugguestion is actually a revert of your commit:
>ab94fcf528d127fcb490175512a8910f37e5b346:
>x86: allow "=rm" in native_save_fl()
>
>It seemed like there was a question internally about why worry about
>pop
>adjusting the stack if the stack could be avoided altogether.
>
>I think Sedat can retest this, but I was curious as well about the
>commit
>message in ab94fcf528d: "[ Impact: performance ]", but Alistair's
>analysis
>of the disassembly seems to indicate there is no performance impact (in
>fact, looks better as there's one less mov).
>
>Is there a reason we should not revert ab94fcf528d12, or maybe a better
>approach?
>
>[0] https://lkml.org/lkml/2018/5/7/534
>[1] https://bugs.llvm.org/show_bug.cgi?id=37512#c15
>[2] https://bugs.llvm.org/show_bug.cgi?id=37512#c22

Ok, this is the *second* thing about LLVM-originated bug reports that drives me 
batty. When you *do* identify a real problem, you propose a paper over and/or 
talk about it as an LLVM issue and don't consider the often far bigger picture.

Issue 1: Fundamentally, the compiler is doing The Wrong Thing if it generates 
worse code with a less constrained =rm than with =r. That is a compiler 
optimization bug, period. The whole point with the less constrained option is 
to give the compiler the freedom of action.

You are claiming it doesn't buy us anything, but you are only looking at the 
paravirt case which is kind of "special" (in the short bus kind of way), and 
only because the compiler in question makes an incredibly stupid decision.

Issue 2: What you are flagging seems to be a far more fundamental problem, 
which would affect *any* use of push/pop in inline assembly. If that is true, 
we need to pull in the gcc people too and create an interface to let the 
compiler know that online assembly needs a certain number of stack slots. We do 
a lot of push/pop in assembly. The other option is to turn stack canary 
explicitly off for all such functions.

Issue 3: Let's face it, reading and writing the flags should be builtins, 
exactly because it has to do stack operations, which really means the compiler 
should be involved.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 23, 2018 3:08:19 PM PDT, Nick Desaulniers  
wrote:
>H. Peter,
>
>It was reported [0] that compiling the Linux kernel with Clang +
>CC_STACKPROTECTOR_STRONG was causing a crash in native_save_fl(), due
>to
>how GCC does not emit a stack guard for static inline functions (see
>Alistair's excellent report in [1]) but Clang does.
>
>When working with the LLVM release maintainers, Tom had suggested [2]
>changing the inline assembly constraint in native_save_fl() from '=rm'
>to
>'=r', and Alistair had verified the disassembly:
>
>(good) code generated w/o -fstack-protector-strong:
>
>native_save_fl:
>  pushfq
>  popq-8(%rsp)
>  movq-8(%rsp), %rax
>  retq
>
>(good) code generated w/ =r input constraint:
>
>native_save_fl:
>  pushfq
>  popq%rax
>  retq
>
>(bad) code generated with -fstack-protector-strong:
>
>native_save_fl:
>  subq$24, %rsp
>  movq%fs:40, %rax
>  movq%rax, 16(%rsp)
>  pushfq
>  popq8(%rsp)
>  movq8(%rsp), %rax
>  movq%fs:40, %rcx
>  cmpq16(%rsp), %rcx
>  jne .LBB0_2
>  addq$24, %rsp
>  retq
>.LBB0_2:
>  callq   __stack_chk_fail
>
>It looks like the sugguestion is actually a revert of your commit:
>ab94fcf528d127fcb490175512a8910f37e5b346:
>x86: allow "=rm" in native_save_fl()
>
>It seemed like there was a question internally about why worry about
>pop
>adjusting the stack if the stack could be avoided altogether.
>
>I think Sedat can retest this, but I was curious as well about the
>commit
>message in ab94fcf528d: "[ Impact: performance ]", but Alistair's
>analysis
>of the disassembly seems to indicate there is no performance impact (in
>fact, looks better as there's one less mov).
>
>Is there a reason we should not revert ab94fcf528d12, or maybe a better
>approach?
>
>[0] https://lkml.org/lkml/2018/5/7/534
>[1] https://bugs.llvm.org/show_bug.cgi?id=37512#c15
>[2] https://bugs.llvm.org/show_bug.cgi?id=37512#c22

A stack canary on an *inlined* function? That's bound to break things elsewhere 
too sooner or later.

It feels like *once again* clang is asking for the Linux kernel to change to 
paper over technical or compatibility problems in clang/LLVM. This is not 
exactly helping the feeling that we should just rip out any and all clang hacks 
and call it a loss.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa
On May 23, 2018 3:08:19 PM PDT, Nick Desaulniers  
wrote:
>H. Peter,
>
>It was reported [0] that compiling the Linux kernel with Clang +
>CC_STACKPROTECTOR_STRONG was causing a crash in native_save_fl(), due
>to
>how GCC does not emit a stack guard for static inline functions (see
>Alistair's excellent report in [1]) but Clang does.
>
>When working with the LLVM release maintainers, Tom had suggested [2]
>changing the inline assembly constraint in native_save_fl() from '=rm'
>to
>'=r', and Alistair had verified the disassembly:
>
>(good) code generated w/o -fstack-protector-strong:
>
>native_save_fl:
>  pushfq
>  popq-8(%rsp)
>  movq-8(%rsp), %rax
>  retq
>
>(good) code generated w/ =r input constraint:
>
>native_save_fl:
>  pushfq
>  popq%rax
>  retq
>
>(bad) code generated with -fstack-protector-strong:
>
>native_save_fl:
>  subq$24, %rsp
>  movq%fs:40, %rax
>  movq%rax, 16(%rsp)
>  pushfq
>  popq8(%rsp)
>  movq8(%rsp), %rax
>  movq%fs:40, %rcx
>  cmpq16(%rsp), %rcx
>  jne .LBB0_2
>  addq$24, %rsp
>  retq
>.LBB0_2:
>  callq   __stack_chk_fail
>
>It looks like the sugguestion is actually a revert of your commit:
>ab94fcf528d127fcb490175512a8910f37e5b346:
>x86: allow "=rm" in native_save_fl()
>
>It seemed like there was a question internally about why worry about
>pop
>adjusting the stack if the stack could be avoided altogether.
>
>I think Sedat can retest this, but I was curious as well about the
>commit
>message in ab94fcf528d: "[ Impact: performance ]", but Alistair's
>analysis
>of the disassembly seems to indicate there is no performance impact (in
>fact, looks better as there's one less mov).
>
>Is there a reason we should not revert ab94fcf528d12, or maybe a better
>approach?
>
>[0] https://lkml.org/lkml/2018/5/7/534
>[1] https://bugs.llvm.org/show_bug.cgi?id=37512#c15
>[2] https://bugs.llvm.org/show_bug.cgi?id=37512#c22

A stack canary on an *inlined* function? That's bound to break things elsewhere 
too sooner or later.

It feels like *once again* clang is asking for the Linux kernel to change to 
paper over technical or compatibility problems in clang/LLVM. This is not 
exactly helping the feeling that we should just rip out any and all clang hacks 
and call it a loss.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


<    1   2   3   4   5   >