Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-24 Thread Eric W. Biederman
"Guilherme G. Piccoli"  writes:

> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
>
> This patch proposes a major refactor on the panic path based on Petr's
> idea [0] - basically we split the notifiers list in three, having a set
> of different call points in the panic path. Below a list of changes
> proposed in this patch, culminating in the panic notifiers level
> concept:
>
> (a) First of all, we improved comments all over the function
> and removed useless variables / includes. Also, as part of this
> clean-up we concentrate the console flushing functions in a helper.
>
> (b) As mentioned before, there is a split of the panic notifier list
> in three, based on the purpose of the callback. The code contains
> good documentation in form of comments, but a summary of the three
> lists follows:
>
> - the hypervisor list aims low-risk procedures to inform hypervisors
> or firmware about the panic event, also includes LED-related functions;
>
> - the informational list contains callbacks that provide more details,
> like kernel offset or trace dump (if enabled) and also includes the
> callbacks aimed at reducing log pollution or warns, like the RCU and
> hung task disable callbacks;
>
> - finally, the pre_reboot list is the old notifier list renamed,
> containing the more risky callbacks that didn't fit the previous
> lists. There is also a 4th list (the post_reboot one), but it's not
> related with the original list - it contains late time architecture
> callbacks aimed at stopping the machine, for example.
>
> The 3 notifiers lists execute in different moments, hypervisor being
> the first, followed by informational and finally the pre_reboot list.
>
> (c) But then, there is the ordering problem of the notifiers against
> the crash_kernel() call - kdump must be as reliable as possible.
> For that, a simple binary "switch" as "crash_kexec_post_notifiers"
> is not enough, hence we introduce here concept of panic notifier
> levels: there are 5 levels, from 0 (no notifier executes before
> kdump) until 4 (all notifiers run before kdump); the default level
> is 2, in which the hypervisor and (iff we have any kmsg dumper)
> the informational notifiers execute before kdump.
>
> The detailed documentation of the levels is present in code comments
> and in the kernel-parameters.txt file; as an analogy with the previous
> panic() implementation, the level 0 is exactly the same as the old
> behavior of notifiers, running all after kdump, and the level 4 is
> the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
> a deprecated one).
>
> (d) Finally, an important change made here: we now use only the
> function "crash_smp_send_stop()" to shut all the secondary CPUs
> in the panic path. Before, there was a case of using the regular
> "smp_send_stop()", but the better approach is to simplify the
> code and try to use the function which was created exclusively
> for the panic path. Experiments showed that it works fine, and
> code was very simplified with that.
>
> Functional change is expected from this refactor, since now we
> call some notifiers by default before kdump, but the goal here
> besides code clean-up is to have a better panic path, more
> reliable and deterministic, but also very customizable.
>
> [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/

I am late to this discussion.  My apologies.

Unfortunately I am also very grouchy.

Notifiers before kexec on panic are fundamentally broken.  So please
just remove crash_kexec_post notifiers and be done with it.  Part of the
deep issue is that firmware always has a common and broken
implementation for anything that is not mission critical to
motherboards.

Notifiers in any sense on these paths are just bollocks.  Any kind of
notifier list is fundamentally fragile in the face of memory corruption
and very very difficult to review.

So I am going to refresh my ancient NACK on this.

I can certainly appreciate that there are pieces of the reboot paths
that can be improved.  I don't think making anything more feature full
or flexible is any kind of real improvement.

Eric



>
> Suggested-by: Petr Mladek 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>
> Special thanks to Petr and Baoquan for the suggestion and feedback in a 
> previous
> email thread. There's some important design decisions that worth mentioning 
> and
> discussing:
>
> * The default panic notifiers level is 2, based on Petr Mladek's suggestion,
> which makes a lot of sense. Of course, this is customizable through the
> parameter, but 

Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-20 Thread Eric W. Biederman
Baoquan He  writes:

> On 05/19/22 at 12:59pm, Eric W. Biederman wrote:
>> Baoquan He  writes:
>> 
>> > Hi Eric,
>> >
>> > On 05/18/22 at 04:59pm, Eric W. Biederman wrote:
>> >> "Naveen N. Rao"  writes:
>> >> 
>> >> > Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
>> >> > symbols") [1], binutils (v2.36+) started dropping section symbols that
>> >> > it thought were unused.  This isn't an issue in general, but with
>> >> > kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a
>> >> > separate .text.unlikely section and the section symbol ".text.unlikely"
>> >> > is being dropped. Due to this, recordmcount is unable to find a non-weak
>> >> > symbol in .text.unlikely to generate a relocation record against.
>> >> >
>> >> > Address this by dropping the weak attribute from these functions:
>> >> > - arch_kexec_apply_relocations() is not overridden by any architecture
>> >> >   today, so just drop the weak attribute.
>> >> > - arch_kexec_apply_relocations_add() is only overridden by x86 and s390.
>> >> >   Retain the function prototype for those and move the weak
>> >> >   implementation into the header as a static inline for other
>> >> >   architectures.
>> >> >
>> >> > [1] 
>> >> > https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1
>> >> 
>> >> Any chance you can also get machine_kexec_post_load,
>> >> crash_free_reserved_phys_range, arch_kexec_protect_protect_crashkres,
>> >> arch_kexec_unprotect_crashkres, arch_kexec_kernel_image_probe,
>> >> arch_kexec_kernel_image_probe, arch_kimage_file_post_load_cleanup,
>> >> arch_kexec_kernel_verify_sig, and arch_kexec_locate_mem_hole as well.
>> >> 
>> >> That is everything in kexec that uses a __weak symbol.  If we can't
>> >> count on them working we might as well just get rid of the rest
>> >> preemptively.
>> >
>> > Is there a new rule that __weak is not suggested in kernel any more?
>> > Please help provide a pointer if yes, so that I can learn that.
>> >
>> > In my mind, __weak is very simple and clear as a mechanism to add
>> > ARCH related functionality.
>> 
>> You should be able to trace the conversation back for all of the details
>> but if you can't here is the summary.
>> 
>> There is a tool that some architectures use called recordmcount.  The
>> recordmcount looks for a symbol in a section, and ignores all weak
>> symbols.  In certain cases sections become so simple there are only weak
>> symbols.  At which point recordmcount fails.
>> 
>> Which means in practice __weak symbols are unreliable and don't work
>> to add ARCH related functionality.
>> 
>> Given that __weak symbols fail randomly I would much rather have simpler
>> code that doesn't fail.  It has never been the case that __weak symbols
>> have been very common in the kernel.  I expect they are something like
>> bool that have been gaining traction.  Still given that __weak symbols
>> don't work.  I don't want them.
>
> Thanks for the summary, Eric.
>
> From Naveen's reply, what I got is, llvm's recent change makes
> symbol of section .text.unlikely lost,

If I have read the thread correctly this change happened in both
llvm and binutils.  So both tools chains that are used to build the
kernel.

> but the secton .text.unlikely
> still exists. The __weak symbol will be put in .text.unlikely partly,
> when arch_kexec_apply_relocations_add() includes the pr_err line. While
> removing the pr_err() line will put __weak symbol
> arch_kexec_apply_relocations_add() in .text instead.

Yes.  Calling pr_err has some effect.  Either causing an mcount
entry to be ommitted, or causing the symbols in the function to be
placed in .text.unlikely.

> Now the status is that not only recordmcount got this problem, objtools
> met it too and got an appropriate fix. Means objtools's fix doesn't need
> kernel's adjustment. Recordmcount need kernel to adjust because it lacks
> continuous support and developement. Naveen also told that they are
> converting to objtools, just the old CI cases rely on recordmcount. In
> fact, if someone stands up to get an appropriate recordmcount fix too,
> the problem will be gone too.

If the descriptions are correct I suspect recoredmcount could just
decided to use the weak symbol, and not ignore it.

Unfortunately I looked at the code a

Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-19 Thread Eric W. Biederman
Baoquan He  writes:

> Hi Eric,
>
> On 05/18/22 at 04:59pm, Eric W. Biederman wrote:
>> "Naveen N. Rao"  writes:
>> 
>> > Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
>> > symbols") [1], binutils (v2.36+) started dropping section symbols that
>> > it thought were unused.  This isn't an issue in general, but with
>> > kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a
>> > separate .text.unlikely section and the section symbol ".text.unlikely"
>> > is being dropped. Due to this, recordmcount is unable to find a non-weak
>> > symbol in .text.unlikely to generate a relocation record against.
>> >
>> > Address this by dropping the weak attribute from these functions:
>> > - arch_kexec_apply_relocations() is not overridden by any architecture
>> >   today, so just drop the weak attribute.
>> > - arch_kexec_apply_relocations_add() is only overridden by x86 and s390.
>> >   Retain the function prototype for those and move the weak
>> >   implementation into the header as a static inline for other
>> >   architectures.
>> >
>> > [1] 
>> > https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1
>> 
>> Any chance you can also get machine_kexec_post_load,
>> crash_free_reserved_phys_range, arch_kexec_protect_protect_crashkres,
>> arch_kexec_unprotect_crashkres, arch_kexec_kernel_image_probe,
>> arch_kexec_kernel_image_probe, arch_kimage_file_post_load_cleanup,
>> arch_kexec_kernel_verify_sig, and arch_kexec_locate_mem_hole as well.
>> 
>> That is everything in kexec that uses a __weak symbol.  If we can't
>> count on them working we might as well just get rid of the rest
>> preemptively.
>
> Is there a new rule that __weak is not suggested in kernel any more?
> Please help provide a pointer if yes, so that I can learn that.
>
> In my mind, __weak is very simple and clear as a mechanism to add
> ARCH related functionality.

You should be able to trace the conversation back for all of the details
but if you can't here is the summary.

There is a tool that some architectures use called recordmcount.  The
recordmcount looks for a symbol in a section, and ignores all weak
symbols.  In certain cases sections become so simple there are only weak
symbols.  At which point recordmcount fails.

Which means in practice __weak symbols are unreliable and don't work
to add ARCH related functionality.

Given that __weak symbols fail randomly I would much rather have simpler
code that doesn't fail.  It has never been the case that __weak symbols
have been very common in the kernel.  I expect they are something like
bool that have been gaining traction.  Still given that __weak symbols
don't work.  I don't want them.

Eric


Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-18 Thread Eric W. Biederman
"Naveen N. Rao"  writes:

> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
> symbols") [1], binutils (v2.36+) started dropping section symbols that
> it thought were unused.  This isn't an issue in general, but with
> kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a
> separate .text.unlikely section and the section symbol ".text.unlikely"
> is being dropped. Due to this, recordmcount is unable to find a non-weak
> symbol in .text.unlikely to generate a relocation record against.
>
> Address this by dropping the weak attribute from these functions:
> - arch_kexec_apply_relocations() is not overridden by any architecture
>   today, so just drop the weak attribute.
> - arch_kexec_apply_relocations_add() is only overridden by x86 and s390.
>   Retain the function prototype for those and move the weak
>   implementation into the header as a static inline for other
>   architectures.
>
> [1] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1

Any chance you can also get machine_kexec_post_load,
crash_free_reserved_phys_range, arch_kexec_protect_protect_crashkres,
arch_kexec_unprotect_crashkres, arch_kexec_kernel_image_probe,
arch_kexec_kernel_image_probe, arch_kimage_file_post_load_cleanup,
arch_kexec_kernel_verify_sig, and arch_kexec_locate_mem_hole as well.

That is everything in kexec that uses a __weak symbol.  If we can't
count on them working we might as well just get rid of the rest
preemptively.

Could you also address Andrews concerns by using a Kconfig symbol that
the architectures that implement the symbol can select.

I don't want to ask too much of a volunteer but if you are willing
addressing both of those would be a great help.

Eric

> Signed-off-by: Naveen N. Rao 
> ---
>  include/linux/kexec.h | 28 
>  kernel/kexec_file.c   | 19 +--
>  2 files changed, 25 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 58d1b58a971e34..e656f981f43a73 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -193,10 +193,6 @@ void *kexec_purgatory_get_symbol_addr(struct kimage 
> *image, const char *name);
>  int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> unsigned long buf_len);
>  void *arch_kexec_kernel_image_load(struct kimage *image);
> -int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
> -  Elf_Shdr *section,
> -  const Elf_Shdr *relsec,
> -  const Elf_Shdr *symtab);
>  int arch_kexec_apply_relocations(struct purgatory_info *pi,
>Elf_Shdr *section,
>const Elf_Shdr *relsec,
> @@ -229,6 +225,30 @@ extern int crash_exclude_mem_range(struct crash_mem *mem,
>  unsigned long long mend);
>  extern int crash_prepare_elf64_headers(struct crash_mem *mem, int kernel_map,
>  void **addr, unsigned long *sz);
> +
> +#if defined(CONFIG_X86_64) || defined(CONFIG_S390)
> +int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
> +  Elf_Shdr *section,
> +  const Elf_Shdr *relsec,
> +  const Elf_Shdr *symtab);
> +#else
> +/*
> + * arch_kexec_apply_relocations_add - apply relocations of type RELA
> + * @pi:  Purgatory to be relocated.
> + * @section: Section relocations applying to.
> + * @relsec:  Section containing RELAs.
> + * @symtab:  Corresponding symtab.
> + *
> + * Return: 0 on success, negative errno on error.
> + */
> +static inline int
> +arch_kexec_apply_relocations_add(struct purgatory_info *pi, Elf_Shdr 
> *section,
> +  const Elf_Shdr *relsec, const Elf_Shdr *symtab)
> +{
> + pr_err("RELA relocation unsupported.\n");
> + return -ENOEXEC;
> +}
> +#endif /* CONFIG_X86_64 || CONFIG_S390 */
>  #endif /* CONFIG_KEXEC_FILE */
>  
>  #ifdef CONFIG_KEXEC_ELF
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 8347fc158d2b96..6bae253b4d315e 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -108,23 +108,6 @@ int __weak arch_kexec_kernel_verify_sig(struct kimage 
> *image, void *buf,
>  }
>  #endif
>  
> -/*
> - * arch_kexec_apply_relocations_add - apply relocations of type RELA
> - * @pi:  Purgatory to be relocated.
> - * @section: Section relocations applying to.
> - * @relsec:  Section containing RELAs.
> - * @symtab:  Corresponding symtab.
> - *
> - * Return: 0 on success, negative errno on error.
> - */
> -int __weak
> -arch_kexec_apply_relocations_add(struct purgatory_info *pi, Elf_Shdr 
> *section,
> -  const Elf_Shdr *relsec, const Elf_Shdr *symtab)
> -{
> - pr_err("RELA relocation unsupported.\n");
> - return -ENOEXEC;
> -}
> 

Re: [PATCH] kexec_file: Drop pr_err in weak implementations of arch_kexec_apply_relocations[_add]

2022-05-18 Thread Eric W. Biederman
Michael Ellerman  writes:

> "Eric W. Biederman"  writes:
>> Looking at this the pr_err is absolutely needed.  If an unsupported case
>> winds up in the purgatory blob and the code can't handle it things
>> will fail silently much worse later.
>
> It won't fail later, it will fail the syscall.
>
> sys_kexec_file_load()
>   kimage_file_alloc_init()
> kimage_file_prepare_segments()
>   arch_kexec_kernel_image_load()
> kexec_image_load_default()
>   image->fops->load()
> elf64_load()# powerpc
> bzImage64_load()# x86
>   kexec_load_purgatory()
> kexec_apply_relocations()
>
> Which does:
>
>   if (relsec->sh_type == SHT_RELA)
>   ret = arch_kexec_apply_relocations_add(pi, section,
>  relsec, symtab);
>   else if (relsec->sh_type == SHT_REL)
>   ret = arch_kexec_apply_relocations(pi, section,
>  relsec, symtab);
>   if (ret)
>   return ret;
>
> And that error is bubbled all the way back up. So as long as
> arch_kexec_apply_relocations() returns an error the syscall will fail
> back to userspace and there'll be an error message at that level.
>
> It's true that having nothing printed in dmesg makes it harder to work
> out why the syscall failed. But it's a kernel bug if there are unhandled
> relocations in the kernel-supplied purgatory code, so a user really has
> no way to do anything about the error even if it is printed.

Good point.  I really hadn't noticed the error code in there when I
looked.

I still don't think changing the functionality of the code because of
a tool issue is the right solution.


>> "Naveen N. Rao"  writes:
>>
>>> Baoquan He wrote:
>>>> On 04/25/22 at 11:11pm, Naveen N. Rao wrote:
>>>>> kexec_load_purgatory() can fail for many reasons - there is no need to
>>>>> print an error when encountering unsupported relocations.
>>>>> This solves a build issue on powerpc with binutils v2.36 and newer [1].
>>>>> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
>>>>> symbols") [2], binutils started dropping section symbols that it thought
>>>> I am not familiar with binutils, while wondering if this exists in other
>>>> ARCHes except of ppc. Arm64 doesn't have the ARCH override either, do we
>>>> have problem with it?
>>>
>>> I'm not aware of this specific file causing a problem on other 
>>> architectures -
>>> perhaps the config options differ enough. There are however more reports of
>>> similar issues affecting other architectures with the llvm integrated 
>>> assembler:
>>> https://github.com/ClangBuiltLinux/linux/issues/981
>>>
>>>>
>>>>> were unused.  This isn't an issue in general, but with kexec_file.c, gcc
>>>>> is placing kexec_arch_apply_relocations[_add] into a separate
>>>>> .text.unlikely section and the section symbol ".text.unlikely" is being
>>>>> dropped. Due to this, recordmcount is unable to find a non-weak symbol
>>>> But arch_kexec_apply_relocations_add is weak symbol on ppc.
>>>
>>> Yes. Note that it is just the section symbol that gets dropped. The section 
>>> is
>>> still present and will continue to hold the symbols for the functions
>>> themselves.
>>
>> So we have a case where binutils thinks it is doing something useful
>> and our kernel specific tool gets tripped up by it.
>
> It's not just binutils, the LLVM assembler has the same behavior.
>
>> Reading the recordmcount code it looks like it is finding any symbol
>> within a section but ignoring weak symbols.  So I suspect the only
>> remaining symbol in the section is __weak and that confuses
>> recordmcount.
>>
>> Does removing the __weak annotation on those functions fix the build
>> error?  If so we can restructure the kexec code to simply not use __weak
>> symbols.
>>
>> Otherwise the fix needs to be in recordmcount or binutils, and we should
>> loop whoever maintains recordmcount in to see what they can do.
>
> It seems that recordmcount is not really maintained anymore now that x86
> uses objtool?
>
> There've been several threads about fixing recordmcount, but none of
> them seem to have lead to a solution.

That is unfortunate.

> These weak symbol vs recordmcount problems have been worked around going
> back as far as 2020:
>
>  

Re: [PATCH] kexec_file: Drop pr_err in weak implementations of arch_kexec_apply_relocations[_add]

2022-05-17 Thread Eric W. Biederman



Looking at this the pr_err is absolutely needed.  If an unsupported case
winds up in the purgatory blob and the code can't handle it things
will fail silently much worse later.  So the proposed patch is
unfortunately the wrong direction.

"Naveen N. Rao"  writes:

> Baoquan He wrote:
>> On 04/25/22 at 11:11pm, Naveen N. Rao wrote:
>>> kexec_load_purgatory() can fail for many reasons - there is no need to
>>> print an error when encountering unsupported relocations.
>>> This solves a build issue on powerpc with binutils v2.36 and newer [1].
>>> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
>>> symbols") [2], binutils started dropping section symbols that it thought
>> I am not familiar with binutils, while wondering if this exists in other
>> ARCHes except of ppc. Arm64 doesn't have the ARCH override either, do we
>> have problem with it?
>
> I'm not aware of this specific file causing a problem on other architectures -
> perhaps the config options differ enough. There are however more reports of
> similar issues affecting other architectures with the llvm integrated 
> assembler:
> https://github.com/ClangBuiltLinux/linux/issues/981
>
>> 
>>> were unused.  This isn't an issue in general, but with kexec_file.c, gcc
>>> is placing kexec_arch_apply_relocations[_add] into a separate
>>> .text.unlikely section and the section symbol ".text.unlikely" is being
>>> dropped. Due to this, recordmcount is unable to find a non-weak symbol
>> But arch_kexec_apply_relocations_add is weak symbol on ppc.
>
> Yes. Note that it is just the section symbol that gets dropped. The section is
> still present and will continue to hold the symbols for the functions
> themselves.

So we have a case where binutils thinks it is doing something useful
and our kernel specific tool gets tripped up by it.

Reading the recordmcount code it looks like it is finding any symbol
within a section but ignoring weak symbols.  So I suspect the only
remaining symbol in the section is __weak and that confuses
recordmcount.

Does removing the __weak annotation on those functions fix the build
error?  If so we can restructure the kexec code to simply not use __weak
symbols.

Otherwise the fix needs to be in recordmcount or binutils, and we should
loop whoever maintains recordmcount in to see what they can do.

Eric


Re: [PATCH v2 18/18] uaccess: drop maining CONFIG_SET_FS users

2022-02-17 Thread Eric W. Biederman
Arnd Bergmann  writes:

> From: Arnd Bergmann 
>
> There are no remaining callers of set_fs(), so CONFIG_SET_FS
> can be removed globally, along with the thread_info field and
> any references to it.
>
> This turns access_ok() into a cheaper check against TASK_SIZE_MAX.
>
> With CONFIG_SET_FS gone, so drop all remaining references to
> set_fs()/get_fs(), mm_segment_t and uaccess_kernel().

For the bits I have looked at recently, and think I understand.

Acked-by: "Eric W. Biederman" 

>
> Signed-off-by: Arnd Bergmann 
> ---
>  fs/exec.c |  6 --
>  kernel/exit.c | 14 -
>  kernel/kthread.c  |  5 --
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 79f2c9483302..bc68a0c089ac 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1303,12 +1303,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>   if (retval)
>   goto out_unlock;
>  
> - /*
> -  * Ensure that the uaccess routines can actually operate on userspace
> -  * pointers:
> -  */
> - force_uaccess_begin();
> -
>   if (me->flags & PF_KTHREAD)
>   free_kthread_struct(me);
>   me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> diff --git a/kernel/exit.c b/kernel/exit.c
> index b00a25bb4ab9..0884a75bc2f8 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -737,20 +737,6 @@ void __noreturn do_exit(long code)
>  
>   WARN_ON(blk_needs_flush_plug(tsk));
>  
> - /*
> -  * If do_dead is called because this processes oopsed, it's possible
> -  * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> -  * continuing. Amongst other possible reasons, this is to prevent
> -  * mm_release()->clear_child_tid() from writing to a user-controlled
> -  * kernel address.
> -  *
> -  * On uptodate architectures force_uaccess_begin is a noop.  On
> -  * architectures that still have set_fs/get_fs in addition to handling
> -  * oopses handles kernel threads that run as set_fs(KERNEL_DS) by
> -  * default.
> -  */
> - force_uaccess_begin();
> -
>   kcov_task_exit(tsk);
>  
>   coredump_task_exit(tsk);
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 38c6dd822da8..16c2275d4b50 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -55,7 +55,6 @@ struct kthread {
>   int result;
>   int (*threadfn)(void *);
>   void *data;
> - mm_segment_t oldfs;
>   struct completion parked;
>   struct completion exited;
>  #ifdef CONFIG_BLK_CGROUP
> @@ -1441,8 +1440,6 @@ void kthread_use_mm(struct mm_struct *mm)
>   mmdrop(active_mm);
>   else
>   smp_mb();
> -
> - to_kthread(tsk)->oldfs = force_uaccess_begin();
>  }
>  EXPORT_SYMBOL_GPL(kthread_use_mm);
>  
> @@ -1457,8 +1454,6 @@ void kthread_unuse_mm(struct mm_struct *mm)
>   WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
>   WARN_ON_ONCE(!tsk->mm);
>  
> - force_uaccess_end(to_kthread(tsk)->oldfs);
> -
>   task_lock(tsk);
>   /*
>* When a kthread stops operating on an address space, the loop


Re: [PATCH 21/20] signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)

2021-10-21 Thread Eric W. Biederman
Geert Uytterhoeven  writes:

> Hi Eric,
>
> Patch 21/20?

In reviewing another part of the patchset Linus asked if force_sigsegv
could go away.  It can't completely but I can get this far.

Given that it is just a cleanup it makes most sense to me as an
additional patch on top of what is already here.


> On Wed, Oct 20, 2021 at 11:52 PM Eric W. Biederman
>  wrote:
>> Now that force_fatal_sig exists it is unnecessary and a bit confusing
>> to use force_sigsegv in cases where the simpler force_fatal_sig is
>> wanted.  So change every instance we can to make the code clearer.
>>
>> Signed-off-by: "Eric W. Biederman" 
>
>>  arch/m68k/kernel/traps.c| 2 +-
>
> Acked-by: Geert Uytterhoeven 

Thank you.

Eric


[PATCH 21/20] signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)

2021-10-20 Thread Eric W. Biederman


Now that force_fatal_sig exists it is unnecessary and a bit confusing
to use force_sigsegv in cases where the simpler force_fatal_sig is
wanted.  So change every instance we can to make the code clearer.

Signed-off-by: "Eric W. Biederman" 
---
 arch/arc/kernel/process.c   | 2 +-
 arch/m68k/kernel/traps.c| 2 +-
 arch/powerpc/kernel/signal_32.c | 2 +-
 arch/powerpc/kernel/signal_64.c | 4 ++--
 arch/s390/kernel/traps.c| 2 +-
 arch/um/kernel/trap.c   | 2 +-
 arch/x86/kernel/vm86_32.c   | 2 +-
 fs/exec.c   | 2 +-
 8 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 3793876f42d9..8e90052f6f05 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -294,7 +294,7 @@ int elf_check_arch(const struct elf32_hdr *x)
eflags = x->e_flags;
if ((eflags & EF_ARC_OSABI_MSK) != EF_ARC_OSABI_CURRENT) {
pr_err("ABI mismatch - you need newer toolchain\n");
-   force_sigsegv(SIGSEGV);
+   force_fatal_sig(SIGSEGV);
return 0;
}
 
diff --git a/arch/m68k/kernel/traps.c b/arch/m68k/kernel/traps.c
index 5b19fcdcd69e..74045d164ddb 100644
--- a/arch/m68k/kernel/traps.c
+++ b/arch/m68k/kernel/traps.c
@@ -1150,7 +1150,7 @@ asmlinkage void set_esp0(unsigned long ssp)
  */
 asmlinkage void fpsp040_die(void)
 {
-   force_sigsegv(SIGSEGV);
+   force_fatal_sig(SIGSEGV);
 }
 
 #ifdef CONFIG_M68KFPU_EMU
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 666f3da41232..933ab95805a6 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1063,7 +1063,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
 * We kill the task with a SIGSEGV in this situation.
 */
if (do_setcontext(new_ctx, regs, 0)) {
-   force_sigsegv(SIGSEGV);
+   force_fatal_sig(SIGSEGV);
return -EFAULT;
}
 
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index d8de622c9e4a..8ead9b3f47c6 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -704,7 +704,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
 */
 
if (__get_user_sigset(, _ctx->uc_sigmask)) {
-   force_sigsegv(SIGSEGV);
+   force_fatal_sig(SIGSEGV);
return -EFAULT;
}
set_current_blocked();
@@ -713,7 +713,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
return -EFAULT;
if (__unsafe_restore_sigcontext(current, NULL, 0, 
_ctx->uc_mcontext)) {
user_read_access_end();
-   force_sigsegv(SIGSEGV);
+   force_fatal_sig(SIGSEGV);
return -EFAULT;
}
user_read_access_end();
diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 51729ea2cf8e..01a7c68dcfb6 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -84,7 +84,7 @@ static void default_trap_handler(struct pt_regs *regs)
 {
if (user_mode(regs)) {
report_user_fault(regs, SIGSEGV, 0);
-   force_sigsegv(SIGSEGV);
+   force_fatal_sig(SIGSEGV);
} else
die(regs, "Unknown program exception");
 }
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 3198c4767387..c32efb09db21 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -158,7 +158,7 @@ static void bad_segv(struct faultinfo fi, unsigned long ip)
 
 void fatal_sigsegv(void)
 {
-   force_sigsegv(SIGSEGV);
+   force_fatal_sig(SIGSEGV);
do_signal(>thread.regs);
/*
 * This is to tell gcc that we're not returning - do_signal
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 040fd01be8b3..7ff0f622abd4 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -159,7 +159,7 @@ void save_v86_state(struct kernel_vm86_regs *regs, int 
retval)
user_access_end();
 Efault:
pr_alert("could not access userspace vm86 info\n");
-   force_sigsegv(SIGSEGV);
+   force_fatal_sig(SIGSEGV);
 }
 
 static int do_vm86_irq_handling(int subfunction, int irqnumber);
diff --git a/fs/exec.c b/fs/exec.c
index a098c133d8d7..ac7b51b51f38 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1852,7 +1852,7 @@ static int bprm_execve(struct linux_binprm *bprm,
 * SIGSEGV.
 */
if (bprm->point_of_no_return && !fatal_signal_pending(current))
-   force_sigsegv(SIGSEGV);
+   force_fatal_sig(SIGSEGV);
 
 out_unmark:
current->fs->in_exec = 0;
-- 
2.20.1



[PATCH 00/20] exit cleanups

2021-10-20 Thread Eric W. Biederman


While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.

This set of changes introduces force_fatal_sig a way of sending
a signal and not allowing it to be caught, and corrects the
misuse of the existing abstractions that I found.

A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).

It is my plan after sending all of these changes out for review to place
them in a topic branch for sending Linus.  Especially for the changes
that depend upon the new helper force_fatal_sig this is important.

Eric W. Biederman (20):
  exit/doublefault: Remove apparently bogus comment about 
rewind_stack_do_exit
  exit: Remove calls of do_exit after noreturn versions of die
  reboot: Remove the unreachable panic after do_exit in reboot(2)
  signal/sparc32: Remove unreachable do_exit in do_sparc_fault
  signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
  signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
  signal/powerpc: On swapcontext failure force SIGSEGV
  signal/sparc: In setup_tsb_params convert open coded BUG into BUG
  signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
  signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
  signal/s390: Use force_sigsegv in default_trap_handler
  exit/kthread: Have kernel threads return instead of calling do_exit
  signal: Implement force_fatal_sig
  exit/syscall_user_dispatch: Send ordinary signals on failure
  signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer 
fails
  signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
  signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
  exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
  exit/rtl8712: Replace the macro thread_exit with a simple return 0
  exit/r8188eu: Replace the macro thread_exit with a simple return 0

 arch/mips/kernel/r2300_fpu.S   |  4 ++--
 arch/mips/kernel/syscall.c |  9 
 arch/nds32/kernel/traps.c  |  2 +-
 arch/nds32/mm/fault.c  |  6 +
 arch/openrisc/kernel/traps.c   |  2 +-
 arch/openrisc/mm/fault.c   |  4 +---
 arch/powerpc/kernel/signal_32.c|  6 +++--
 arch/powerpc/kernel/signal_64.c|  9 +---
 arch/s390/include/asm/kdebug.h |  2 +-
 arch/s390/kernel/dumpstack.c   |  2 +-
 arch/s390/kernel/traps.c   |  2 +-
 arch/s390/mm/fault.c   |  2 --
 arch/sh/kernel/cpu/fpu.c   | 10 +
 arch/sh/kernel/traps.c |  2 +-
 arch/sh/mm/fault.c |  2 --
 arch/sparc/kernel/signal_32.c  |  4 ++--
 arch/sparc/kernel/windows.c|  6 +++--
 arch/sparc/mm/fault_32.c   |  1 -
 arch/sparc/mm/tsb.c|  2 +-
 arch/x86/entry/vsyscall/vsyscall_64.c  |  3 ++-
 arch/x86/kernel/doublefault_32.c   |  3 ---
 arch/x86/kernel/signal.c   |  6 -
 arch/x86/kernel/vm86_32.c  |  8 +++
 arch/xtensa/kernel/traps.c |  2 +-
 arch/xtensa/mm/fault.c |  3 +--
 drivers/firmware/stratix10-svc.c   |  4 ++--
 drivers/soc/ti/wkup_m3_ipc.c   |  2 +-
 drivers/staging/r8188eu/core/rtw_cmd.c |  2 +-
 drivers/staging/r8188eu/core/rtw_mp.c  |  2 +-
 drivers/staging/r8188eu/include/osdep_service.h|  2 --
 drivers/staging/rtl8712/osdep_service.h|  1 -
 drivers/staging/rtl8712/rtl8712_cmd.c  |  2 +-
 drivers/staging/rtl8723bs/core/rtw_cmd.c   |  2 +-
 drivers/staging/rtl8723bs/core/rtw_xmit.c  |  2 +-
 drivers/staging/rtl8723bs/hal/rtl8723bs_xmit.c |  2 +-
 .../rtl8723bs/include/osdep_service_linux.h|  2 --
 fs/ocfs2/journal.c |  5 +
 include/linux/sched/signal.h   |  1 +
 kernel/entry/syscall_user_dispatch.c   | 12 ++
 kernel/kthread.c   |  2 +-
 kernel/reboot.c|  1 -
 kernel/signal.c| 26 ++
 net/batman-adv/tp_meter.c  |  2 +-
 43 files changed, 83 insertions(+), 91 deletions(-)

Eric


[PATCH 07/20] signal/powerpc: On swapcontext failure force SIGSEGV

2021-10-20 Thread Eric W. Biederman
If the register state may be partial and corrupted instead of calling
do_exit, call force_sigsegv(SIGSEGV).  Which properly kills the
process with SIGSEGV and does not let any more userspace code execute,
instead of just killing one thread of the process and potentially
confusing everything.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 756f1ae8a44e ("PPC32: Rework signal code and add a swapcontext system 
call.")
Fixes: 04879b04bf50 ("[PATCH] ppc64: VMX (Altivec) support & signal32 rework, 
from Ben Herrenschmidt")
Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/kernel/signal_32.c | 6 --
 arch/powerpc/kernel/signal_64.c | 9 ++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 0608581967f0..666f3da41232 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1062,8 +1062,10 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
 * or if another thread unmaps the region containing the context.
 * We kill the task with a SIGSEGV in this situation.
 */
-   if (do_setcontext(new_ctx, regs, 0))
-   do_exit(SIGSEGV);
+   if (do_setcontext(new_ctx, regs, 0)) {
+   force_sigsegv(SIGSEGV);
+   return -EFAULT;
+   }
 
set_thread_flag(TIF_RESTOREALL);
return 0;
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 1831bba0582e..d8de622c9e4a 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -703,15 +703,18 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
 * We kill the task with a SIGSEGV in this situation.
 */
 
-   if (__get_user_sigset(, _ctx->uc_sigmask))
-   do_exit(SIGSEGV);
+   if (__get_user_sigset(, _ctx->uc_sigmask)) {
+   force_sigsegv(SIGSEGV);
+   return -EFAULT;
+   }
set_current_blocked();
 
if (!user_read_access_begin(new_ctx, ctx_size))
return -EFAULT;
if (__unsafe_restore_sigcontext(current, NULL, 0, 
_ctx->uc_mcontext)) {
user_read_access_end();
-   do_exit(SIGSEGV);
+   force_sigsegv(SIGSEGV);
+   return -EFAULT;
}
user_read_access_end();
 
-- 
2.20.1



Re: [PATCH RESEND v3 6/6] powerpc/signal: Use unsafe_copy_siginfo_to_user()

2021-09-13 Thread Eric W. Biederman
Christophe Leroy  writes:

> Le 13/09/2021 à 18:21, Eric W. Biederman a écrit :
>> ebied...@xmission.com (Eric W. Biederman) writes:
>>
>>> Christophe Leroy  writes:
>>>
>>>> Use unsafe_copy_siginfo_to_user() in order to do the copy
>>>> within the user access block.
>>>>
>>>> On an mpc 8321 (book3s/32) the improvment is about 5% on a process
>>>> sending a signal to itself.
>>
>> If you can't make function calls from an unsafe macro there is another
>> way to handle this that doesn't require everything to be inline.
>>
>>  From a safety perspective it is probably even a better approach.
>
> Yes but that's exactly what I wanted to avoid for the native ppc32 case: this
> double hop means useless pressure on the cache. The siginfo_t structure is 128
> bytes large, that means 8 lines of cache on powerpc 8xx.
>
> But maybe it is acceptable to do that only for the compat case. Let me think
> about it, it might be quite easy.

The places get_signal is called tend to be well known.  So I think we
are safe from a capacity standpoint.

I am not certain it makes a difference in capacity as there is a high
probability that the stack was deeper recently than it is now which
suggests the cache blocks might already be in the cache.

My sense it is worth benchmarking before optimizing out the extra copy
like that.

On the extreme side there is simply building the entire sigframe on the
stack and then just calling it copy_to_user.  As the stack cache lines
are likely to be hot, and copy_to_user is quite well optimized
there is a real possibility that it is faster to build everything
on the kernel stack, and then copy it to the user space stack.

It is also possible that I am wrong and we may want to figure out how
far up we can push the conversion to the 32bit siginfo format.

If could move the work into collect_signal we could guarantee there
would be no extra work.  That would require adjusting the sigframe
generation code on all of the architectures.

There is a lot we can do but we need benchmarking to tell if it is
worth it.

Eric







Re: [PATCH RESEND v3 6/6] powerpc/signal: Use unsafe_copy_siginfo_to_user()

2021-09-13 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Christophe Leroy  writes:
>
>> Use unsafe_copy_siginfo_to_user() in order to do the copy
>> within the user access block.
>>
>> On an mpc 8321 (book3s/32) the improvment is about 5% on a process
>> sending a signal to itself.

If you can't make function calls from an unsafe macro there is another
way to handle this that doesn't require everything to be inline.

>From a safety perspective it is probably even a better approach.

Eric

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 0608581967f0..1b30bb78b863 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -205,6 +205,12 @@ struct sigframe {
int abigap[56];
 };
 
+#ifdef CONFIG_PPC64
+# define user_siginfo_tcompat_siginfo_t
+#else
+# define user_siginfo_tsiginfo_t
+#endif
+
 /*
  *  When we have rt signals to deliver, we set up on the
  *  user stack, going down from the original stack pointer:
@@ -217,11 +223,7 @@ struct sigframe {
  *
  */
 struct rt_sigframe {
-#ifdef CONFIG_PPC64
-   compat_siginfo_t info;
-#else
-   struct siginfo info;
-#endif
+   user_siginfo_t info;
struct ucontext uc;
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
struct ucontext uc_transact;
@@ -712,7 +714,7 @@ static long restore_tm_user_regs(struct pt_regs *regs, 
struct mcontext __user *s
 
 #ifdef CONFIG_PPC64
 
-#define copy_siginfo_to_user   copy_siginfo_to_user32
+#define copy_siginfo_to_external   copy_siginfo_to_external32
 
 #endif /* CONFIG_PPC64 */
 
@@ -731,6 +733,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
struct pt_regs *regs = tsk->thread.regs;
/* Save the thread's msr before get_tm_stackpointer() changes it */
unsigned long msr = regs->msr;
+   user_siginfo_t uinfo;
 
/* Set up Signal Frame */
frame = get_sigframe(ksig, tsk, sizeof(*frame), 1);
@@ -743,6 +746,8 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
else
prepare_save_user_regs(1);
 
+   copy_siginfo_to_external(, >info);
+
if (!user_access_begin(frame, sizeof(*frame)))
goto badframe;
 
@@ -778,12 +783,10 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
*oldset,
asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
}
unsafe_put_sigset_t(>uc.uc_sigmask, oldset, failed);
+   unsafe_copy_to_user(>info, , failed);
 
user_access_end();
 
-   if (copy_siginfo_to_user(>info, >info))
-   goto badframe;
-
regs->link = tramp;
 
 #ifdef CONFIG_PPC_FPU_REGS

Eric


Re: [PATCH RESEND v3 6/6] powerpc/signal: Use unsafe_copy_siginfo_to_user()

2021-09-13 Thread Eric W. Biederman
Christophe Leroy  writes:

> Use unsafe_copy_siginfo_to_user() in order to do the copy
> within the user access block.
>
> On an mpc 8321 (book3s/32) the improvment is about 5% on a process
> sending a signal to itself.
>
> Signed-off-by: Christophe Leroy 
> ---
> v3: Don't leave compat aside, use the new unsafe_copy_siginfo_to_user32()
> ---
>  arch/powerpc/kernel/signal_32.c | 8 +++-
>  arch/powerpc/kernel/signal_64.c | 5 +
>  2 files changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> index ff101e2b3bab..3a2db8af2d65 100644
> --- a/arch/powerpc/kernel/signal_32.c
> +++ b/arch/powerpc/kernel/signal_32.c
> @@ -710,9 +710,9 @@ static long restore_tm_user_regs(struct pt_regs *regs, 
> struct mcontext __user *s
>  }
>  #endif
>  
> -#ifdef CONFIG_PPC64
> +#ifndef CONFIG_PPC64
>  
> -#define copy_siginfo_to_user copy_siginfo_to_user32
> +#define unsafe_copy_siginfo_to_user32unsafe_copy_siginfo_to_user
>  
>  #endif /* CONFIG_PPC64 */

Any particular reason to reverse the sense of this #ifdef?

Otherwise this change looks much cleaner.

Eric


>  
> @@ -779,15 +779,13 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
> *oldset,
>   asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
>   }
>   unsafe_put_sigset_t(>uc.uc_sigmask, oldset, failed);
> + unsafe_copy_siginfo_to_user32(>info, >info, failed);
>  
>   /* create a stack frame for the caller of the handler */
>   unsafe_put_user(regs->gpr[1], newsp, failed);
>  
>   user_access_end();
>  
> - if (copy_siginfo_to_user(>info, >info))
> - goto badframe;
> -
>   regs->link = tramp;
>  
>  #ifdef CONFIG_PPC_FPU_REGS
> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> index d80ff83cacb9..56c0c74aa28c 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -901,15 +901,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
> *set,
>   }
>  
>   unsafe_copy_to_user(>uc.uc_sigmask, set, sizeof(*set), 
> badframe_block);
> + unsafe_copy_siginfo_to_user(>info, >info, badframe_block);
>   /* Allocate a dummy caller frame for the signal handler. */
>   unsafe_put_user(regs->gpr[1], newsp, badframe_block);
>  
>   user_write_access_end();
>  
> - /* Save the siginfo outside of the unsafe block. */
> - if (copy_siginfo_to_user(>info, >info))
> - goto badframe;
> -
>   /* Make sure signal handler doesn't get spurious FP exceptions */
>   tsk->thread.fp_state.fpscr = 0;


Re: [PATCH RESEND v3 4/6] signal: Add unsafe_copy_siginfo_to_user32()

2021-09-13 Thread Eric W. Biederman
Christophe Leroy  writes:

> In the same spirit as commit fb05121fd6a2 ("signal: Add
> unsafe_get_compat_sigset()"), implement an 'unsafe' version of
> copy_siginfo_to_user32() in order to use it within user access blocks.
>
> To do so, we need inline version of copy_siginfo_to_external32() as we
> don't want any function call inside user access blocks.

I don't understand.  What is wrong with?

#define unsafe_copy_siginfo_to_user32(to, from, label)  do {\
struct compat_siginfo __user *__ucs_to = to;\
const struct kernel_siginfo *__ucs_from = from; \
struct compat_siginfo __ucs_new;\
\
copy_siginfo_to_external32(&__ucs_new, __ucs_from); \
unsafe_copy_to_user(__ucs_to, &__ucs_new,   \
sizeof(struct compat_siginfo), label);  \
} while (0)

Your replacement of "memset(to, 0, sizeof(*to))" with
"struct compat_siginfo __ucs_new = {0}".  is actively unsafe as the
compiler is free not to initialize any holes in the structure to 0 in
the later case.

Is there something about the unsafe macros that I am not aware of that
makes it improper to actually call C functions?  Is that a requirement
for the instructions that change the user space access behavior?

>From the looks of this change all that you are doing is making it so
that all of copy_siginfo_to_external32 is being inlined.  If that is not
a hard requirement of the instructions it seems like the wrong thing to
do here. copy_siginfo_to_external32 has not failures so it does not need
to be inlined so you can jump to the label.

Eric

>
> Signed-off-by: Christophe Leroy 
> ---
>  include/linux/compat.h |  83 +-
>  include/linux/signal.h |  58 +
>  kernel/signal.c| 114 +
>  3 files changed, 141 insertions(+), 114 deletions(-)
>
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index 8e0598c7d1d1..68823f4b86ee 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -412,6 +412,19 @@ int __copy_siginfo_to_user32(struct compat_siginfo 
> __user *to,
>  #ifndef copy_siginfo_to_user32
>  #define copy_siginfo_to_user32 __copy_siginfo_to_user32
>  #endif
> +
> +#ifdef CONFIG_COMPAT
> +#define unsafe_copy_siginfo_to_user32(to, from, label)   do {
> \
> + struct compat_siginfo __user *__ucs_to = to;\
> + const struct kernel_siginfo *__ucs_from = from; \
> + struct compat_siginfo __ucs_new = {0};  \
> + \
> + __copy_siginfo_to_external32(&__ucs_new, __ucs_from);   \
> + unsafe_copy_to_user(__ucs_to, &__ucs_new,   \
> + sizeof(struct compat_siginfo), label);  \
> +} while (0)
> +#endif
> +
>  int get_compat_sigevent(struct sigevent *event,
>   const struct compat_sigevent __user *u_event);
>  
> @@ -992,15 +1005,81 @@ static inline bool in_compat_syscall(void) { return 
> false; }
>   * appropriately converted them already.
>   */
>  #ifndef compat_ptr
> -static inline void __user *compat_ptr(compat_uptr_t uptr)
> +static __always_inline void __user *compat_ptr(compat_uptr_t uptr)
>  {
>   return (void __user *)(unsigned long)uptr;
>  }
>  #endif
>  
> -static inline compat_uptr_t ptr_to_compat(void __user *uptr)
> +static __always_inline compat_uptr_t ptr_to_compat(void __user *uptr)
>  {
>   return (u32)(unsigned long)uptr;
>  }
>  
> +static __always_inline void
> +__copy_siginfo_to_external32(struct compat_siginfo *to,
> +  const struct kernel_siginfo *from)
> +{
> + to->si_signo = from->si_signo;
> + to->si_errno = from->si_errno;
> + to->si_code  = from->si_code;
> + switch(__siginfo_layout(from->si_signo, from->si_code)) {
> + case SIL_KILL:
> + to->si_pid = from->si_pid;
> + to->si_uid = from->si_uid;
> + break;
> + case SIL_TIMER:
> + to->si_tid = from->si_tid;
> + to->si_overrun = from->si_overrun;
> + to->si_int = from->si_int;
> + break;
> + case SIL_POLL:
> + to->si_band = from->si_band;
> + to->si_fd   = from->si_fd;
> + break;
> + case SIL_FAULT:
> + to->si_addr = ptr_to_compat(from->si_addr);
> + break;
> + case SIL_FAULT_TRAPNO:
> + to->si_addr = ptr_to_compat(from->si_addr);
> + to->si_trapno = from->si_trapno;
> + break;
> + case SIL_FAULT_MCEERR:
> + to->si_addr = ptr_to_compat(from->si_addr);
> + to->si_addr_lsb = from->si_addr_lsb;
> + break;
> + case 

Re: [PATCH v2 3/5] signal: Add unsafe_copy_siginfo_to_user()

2021-09-11 Thread Eric W. Biederman
Christophe Leroy  writes:

> On 9/8/21 6:17 PM, Eric W. Biederman wrote:
>> Christophe Leroy  writes:
>>
>>> Le 02/09/2021 à 20:43, Eric W. Biederman a écrit :
>>>> Christophe Leroy  writes:
>>>>
>>>>> In the same spirit as commit fb05121fd6a2 ("signal: Add
>>>>> unsafe_get_compat_sigset()"), implement an 'unsafe' version of
>>>>> copy_siginfo_to_user() in order to use it within user access blocks.
>>>>>
>>>>> For that, also add an 'unsafe' version of clear_user().
>>>>
>>>> Looking at your use cases you need the 32bit compat version of this
>>>> as well.
>>>>
>>>> The 32bit compat version is too complicated to become a macro, so I
>>>> don't think you can make this work correctly for the 32bit compat case.
>>>
>>> When looking into patch 5/5 that you nacked, I think you missed the fact 
>>> that we
>>> keep using copy_siginfo_to_user32() as it for the 32 bit compat case.
>>
>> I did.  My mistake.
>>
>> However that mistake was so easy I think it mirrors the comments others
>> have made that this looks like a maintenance hazard.
>>
>> Is improving the performance of 32bit kernels interesting?
>
> Yes it is, and that's what this series do.
>
>> Is improving the performance of 32bit compat support interesting?
>
> For me this is a corner case, so I left it aside for now.
>
>>
>> If performance one or either of those cases is interesting it looks like
>> we already have copy_siginfo_to_external32 the factor you would need
>> to build unsafe_copy_siginfo_to_user32.
>
> I'm not sure I understand your saying here. What do you expect me to
> do with copy_siginfo_to_external32() ?

Implement unsafe_copy_siginfo_to_user32.

> copy_siginfo_to_user32() is for compat only.
>
> Native 32 bits powerpc use copy_siginfo_to_user()

What you implemented doubles the number of test cases necessary to
compile test the 32bit ppc signal code, and makes the code noticeably
harder to follow.

Having a unsafe_copy_to_siginfo_to_user32 at least would allow the
number of test cases to remain the same as the current code.

>> So I am not going to say impossible but please make something
>> maintainable.  I unified all of the compat 32bit siginfo logic because
>> it simply did not get enough love and attention when it was implemented
>> per architecture.
>
> Yes, and ? I didn't do any modification to the compat case, so what
> you did remains.

You undid the unification between the 32bit code and the 32bit compat
code.

>> In general I think that concern applies to this case as well.  We really
>> need an implementation that shares as much burden as possible with other
>> architectures.
>
> I think yes, that's the reason why I made a generic
> unsafe_copy_siginfo_to_user() and didn't make a powerpc dedicated
> change.
>
> Once this is merged any other architecture can use
> unsafe_copy_siginfo_to_user().
>
> Did I miss something ?

Not dealing with the compat case and making the code signal stack frame
code noticeably more complicated.

If this optimization profitably applies to other architectures we need
to figure out how to implement unsafe_copy_siginfo_to_user32 or risk
making them all much worse to maintain.

Eric


Re: [PATCH v2 3/5] signal: Add unsafe_copy_siginfo_to_user()

2021-09-08 Thread Eric W. Biederman
Christophe Leroy  writes:

> Le 02/09/2021 à 20:43, Eric W. Biederman a écrit :
>> Christophe Leroy  writes:
>>
>>> In the same spirit as commit fb05121fd6a2 ("signal: Add
>>> unsafe_get_compat_sigset()"), implement an 'unsafe' version of
>>> copy_siginfo_to_user() in order to use it within user access blocks.
>>>
>>> For that, also add an 'unsafe' version of clear_user().
>>
>> Looking at your use cases you need the 32bit compat version of this
>> as well.
>>
>> The 32bit compat version is too complicated to become a macro, so I
>> don't think you can make this work correctly for the 32bit compat case.
>
> When looking into patch 5/5 that you nacked, I think you missed the fact that 
> we
> keep using copy_siginfo_to_user32() as it for the 32 bit compat case.

I did.  My mistake.

However that mistake was so easy I think it mirrors the comments others
have made that this looks like a maintenance hazard.

Is improving the performance of 32bit kernels interesting?
Is improving the performance of 32bit compat support interesting?

If performance one or either of those cases is interesting it looks like
we already have copy_siginfo_to_external32 the factor you would need
to build unsafe_copy_siginfo_to_user32.

So I am not going to say impossible but please make something
maintainable.  I unified all of the compat 32bit siginfo logic because
it simply did not get enough love and attention when it was implemented
per architecture.

In general I think that concern applies to this case as well.  We really
need an implementation that shares as much burden as possible with other
architectures.

Eric


>> Probably-Not-by: "Eric W. Biederman" 
>>
>> Eric
>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>>   include/linux/signal.h  | 15 +++
>>>   include/linux/uaccess.h |  1 +
>>>   kernel/signal.c |  5 -
>>>   3 files changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/signal.h b/include/linux/signal.h
>>> index 3454c7ff0778..659bd43daf10 100644
>>> --- a/include/linux/signal.h
>>> +++ b/include/linux/signal.h
>>> @@ -35,6 +35,21 @@ static inline void copy_siginfo_to_external(siginfo_t 
>>> *to,
>>>   int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t 
>>> *from);
>>>   int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user 
>>> *from);
>>>   +static __always_inline char __user *si_expansion(const siginfo_t __user
>>> *info)
>>> +{
>>> +   return ((char __user *)info) + sizeof(struct kernel_siginfo);
>>> +}
>>> +
>>> +#define unsafe_copy_siginfo_to_user(to, from, label) do {  \
>>> +   siginfo_t __user *__ucs_to = to;\
>>> +   const kernel_siginfo_t *__ucs_from = from;  \
>>> +   char __user *__ucs_expansion = si_expansion(__ucs_to);  \
>>> +   \
>>> +   unsafe_copy_to_user(__ucs_to, __ucs_from,   \
>>> +   sizeof(struct kernel_siginfo), label);  \
>>> +   unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label);   \
>>> +} while (0)
>>> +
>>>   enum siginfo_layout {
>>> SIL_KILL,
>>> SIL_TIMER,
>>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>>> index c05e903cef02..37073caac474 100644
>>> --- a/include/linux/uaccess.h
>>> +++ b/include/linux/uaccess.h
>>> @@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user 
>>> *unsafe_addr, long count);
>>>   #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e)
>>>   #define unsafe_copy_to_user(d,s,l,e) 
>>> unsafe_op_wrap(__copy_to_user(d,s,l),e)
>>>   #define unsafe_copy_from_user(d,s,l,e) 
>>> unsafe_op_wrap(__copy_from_user(d,s,l),e)
>>> +#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e)
>>>   static inline unsigned long user_access_save(void) { return 0UL; }
>>>   static inline void user_access_restore(unsigned long flags) { }
>>>   #endif
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index a3229add4455..83b5971e4304 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -3261,11 +3261,6 @@ enum siginfo_layout siginfo_layout(unsigned sig, int 
>>> si_code)
>>> return layout;
>>>   }
>>>   -static inline char __user *si_expansion(const siginfo_t __user *info)
>>> -{
>>> -   return ((char __user *)info) + sizeof(struct kernel_siginfo);
>>> -}
>>> -
>>>   int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t 
>>> *from)
>>>   {
>>> char __user *expansion = si_expansion(to);


Re: [PATCH v2 5/5] powerpc/signal: Use unsafe_copy_siginfo_to_user()

2021-09-02 Thread Eric W. Biederman
Christophe Leroy  writes:

> Use unsafe_copy_siginfo_to_user() in order to do the copy
> within the user access block.
>
> On an mpc 8321 (book3s/32) the improvment is about 5% on a process
> sending a signal to itself.

Nacked-by: "Eric W. Biederman" 

copy_siginfo_to_user is not the same as copy_siginfo_to_user32.

As in this patch breaks 32bit userspace on powerpc.


> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/signal_32.c | 13 ++---
>  arch/powerpc/kernel/signal_64.c |  5 +
>  2 files changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> index ff101e2b3bab..f9e16d108bc8 100644
> --- a/arch/powerpc/kernel/signal_32.c
> +++ b/arch/powerpc/kernel/signal_32.c
> @@ -710,12 +710,6 @@ static long restore_tm_user_regs(struct pt_regs *regs, 
> struct mcontext __user *s
>  }
>  #endif
>  
> -#ifdef CONFIG_PPC64
> -
> -#define copy_siginfo_to_user copy_siginfo_to_user32
> -
> -#endif /* CONFIG_PPC64 */
> -
>  /*
>   * Set up a signal frame for a "real-time" signal handler
>   * (one which gets siginfo).
> @@ -779,14 +773,19 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t 
> *oldset,
>   asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
>   }
>   unsafe_put_sigset_t(>uc.uc_sigmask, oldset, failed);
> +#ifndef CONFIG_COMPAT
> + unsafe_copy_siginfo_to_user(>info, >info, failed);
> +#endif
>  
>   /* create a stack frame for the caller of the handler */
>   unsafe_put_user(regs->gpr[1], newsp, failed);
>  
>   user_access_end();
>  
> - if (copy_siginfo_to_user(>info, >info))
> +#ifdef CONFIG_COMPAT
> + if (copy_siginfo_to_user32(>info, >info))
>   goto badframe;
> +#endif
>  
>   regs->link = tramp;
>  
> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> index 2cca6c8febe1..82b73fbd937d 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -901,15 +901,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
> *set,
>   }
>  
>   unsafe_copy_to_user(>uc.uc_sigmask, set, sizeof(*set), 
> badframe_block);
> + unsafe_copy_siginfo_to_user(>info, >info, badframe_block);
>   /* Allocate a dummy caller frame for the signal handler. */
>   unsafe_put_user(regs->gpr[1], newsp, badframe_block);
>  
>   user_write_access_end();
>  
> - /* Save the siginfo outside of the unsafe block. */
> - if (copy_siginfo_to_user(>info, >info))
> - goto badframe;
> -
>   /* Make sure signal handler doesn't get spurious FP exceptions */
>   tsk->thread.fp_state.fpscr = 0;


Re: [PATCH v2 3/5] signal: Add unsafe_copy_siginfo_to_user()

2021-09-02 Thread Eric W. Biederman
Christophe Leroy  writes:

> In the same spirit as commit fb05121fd6a2 ("signal: Add
> unsafe_get_compat_sigset()"), implement an 'unsafe' version of
> copy_siginfo_to_user() in order to use it within user access blocks.
>
> For that, also add an 'unsafe' version of clear_user().

Looking at your use cases you need the 32bit compat version of this
as well.

The 32bit compat version is too complicated to become a macro, so I
don't think you can make this work correctly for the 32bit compat case.

Probably-Not-by: "Eric W. Biederman" 

Eric

> Signed-off-by: Christophe Leroy 
> ---
>  include/linux/signal.h  | 15 +++
>  include/linux/uaccess.h |  1 +
>  kernel/signal.c |  5 -
>  3 files changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 3454c7ff0778..659bd43daf10 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -35,6 +35,21 @@ static inline void copy_siginfo_to_external(siginfo_t *to,
>  int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from);
>  int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user 
> *from);
>  
> +static __always_inline char __user *si_expansion(const siginfo_t __user 
> *info)
> +{
> + return ((char __user *)info) + sizeof(struct kernel_siginfo);
> +}
> +
> +#define unsafe_copy_siginfo_to_user(to, from, label) do {\
> + siginfo_t __user *__ucs_to = to;\
> + const kernel_siginfo_t *__ucs_from = from;  \
> + char __user *__ucs_expansion = si_expansion(__ucs_to);  \
> + \
> + unsafe_copy_to_user(__ucs_to, __ucs_from,   \
> + sizeof(struct kernel_siginfo), label);  \
> + unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label);   \
> +} while (0)
> +
>  enum siginfo_layout {
>   SIL_KILL,
>   SIL_TIMER,
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index c05e903cef02..37073caac474 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user *unsafe_addr, 
> long count);
>  #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e)
>  #define unsafe_copy_to_user(d,s,l,e) unsafe_op_wrap(__copy_to_user(d,s,l),e)
>  #define unsafe_copy_from_user(d,s,l,e) 
> unsafe_op_wrap(__copy_from_user(d,s,l),e)
> +#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e)
>  static inline unsigned long user_access_save(void) { return 0UL; }
>  static inline void user_access_restore(unsigned long flags) { }
>  #endif
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a3229add4455..83b5971e4304 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -3261,11 +3261,6 @@ enum siginfo_layout siginfo_layout(unsigned sig, int 
> si_code)
>   return layout;
>  }
>  
> -static inline char __user *si_expansion(const siginfo_t __user *info)
> -{
> - return ((char __user *)info) + sizeof(struct kernel_siginfo);
> -}
> -
>  int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from)
>  {
>   char __user *expansion = si_expansion(to);


Re: [PATCH v5 1/6] kexec: move locking into do_kexec_load

2021-07-28 Thread Eric W. Biederman
Arnd Bergmann  writes:

> From: Arnd Bergmann 
>
> The locking is the same between the native and compat version of
> sys_kexec_load(), so it can be done in the common implementation
> to reduce duplication.

Acked-by: "Eric W. Biederman" 

>
> Co-developed-by: Eric Biederman 
> Co-developed-by: Christoph Hellwig 
> Signed-off-by: Arnd Bergmann 
> ---
>  kernel/kexec.c | 44 
>  1 file changed, 16 insertions(+), 28 deletions(-)
>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index c82c6c06f051..9c7aef8f4bb6 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -110,6 +110,17 @@ static int do_kexec_load(unsigned long entry, unsigned 
> long nr_segments,
>   unsigned long i;
>   int ret;
>  
> + /*
> +  * Because we write directly to the reserved memory region when loading
> +  * crash kernels we need a mutex here to prevent multiple crash kernels
> +  * from attempting to load simultaneously, and to prevent a crash kernel
> +  * from loading over the top of a in use crash kernel.
> +  *
> +  * KISS: always take the mutex.
> +  */
> + if (!mutex_trylock(_mutex))
> + return -EBUSY;
> +
>   if (flags & KEXEC_ON_CRASH) {
>   dest_image = _crash_image;
>   if (kexec_crash_image)
> @@ -121,7 +132,8 @@ static int do_kexec_load(unsigned long entry, unsigned 
> long nr_segments,
>   if (nr_segments == 0) {
>   /* Uninstall image */
>   kimage_free(xchg(dest_image, NULL));
> - return 0;
> + ret = 0;
> + goto out_unlock;
>   }
>   if (flags & KEXEC_ON_CRASH) {
>   /*
> @@ -134,7 +146,7 @@ static int do_kexec_load(unsigned long entry, unsigned 
> long nr_segments,
>  
>   ret = kimage_alloc_init(, entry, nr_segments, segments, flags);
>   if (ret)
> - return ret;
> + goto out_unlock;
>  
>   if (flags & KEXEC_PRESERVE_CONTEXT)
>   image->preserve_context = 1;
> @@ -171,6 +183,8 @@ static int do_kexec_load(unsigned long entry, unsigned 
> long nr_segments,
>   arch_kexec_protect_crashkres();
>  
>   kimage_free(image);
> +out_unlock:
> + mutex_unlock(_mutex);
>   return ret;
>  }
>  
> @@ -247,21 +261,8 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
> unsigned long, nr_segments,
>   ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT))
>   return -EINVAL;
>  
> - /* Because we write directly to the reserved memory
> -  * region when loading crash kernels we need a mutex here to
> -  * prevent multiple crash  kernels from attempting to load
> -  * simultaneously, and to prevent a crash kernel from loading
> -  * over the top of a in use crash kernel.
> -  *
> -  * KISS: always take the mutex.
> -  */
> - if (!mutex_trylock(_mutex))
> - return -EBUSY;
> -
>   result = do_kexec_load(entry, nr_segments, segments, flags);
>  
> - mutex_unlock(_mutex);
> -
>   return result;
>  }
>  
> @@ -301,21 +302,8 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
>   return -EFAULT;
>   }
>  
> - /* Because we write directly to the reserved memory
> -  * region when loading crash kernels we need a mutex here to
> -  * prevent multiple crash  kernels from attempting to load
> -  * simultaneously, and to prevent a crash kernel from loading
> -  * over the top of a in use crash kernel.
> -  *
> -  * KISS: always take the mutex.
> -  */
> - if (!mutex_trylock(_mutex))
> - return -EBUSY;
> -
>   result = do_kexec_load(entry, nr_segments, ksegments, flags);
>  
> - mutex_unlock(_mutex);
> -
>   return result;
>  }
>  #endif


Re: [PATCH v5 2/6] kexec: avoid compat_alloc_user_space

2021-07-28 Thread Eric W. Biederman
Arnd Bergmann  writes:

> From: Arnd Bergmann 
>
> kimage_alloc_init() expects a __user pointer, so compat_sys_kexec_load()
> uses compat_alloc_user_space() to convert the layout and put it back
> onto the user space caller stack.
>
> Moving the user space access into the syscall handler directly actually
> makes the code simpler, as the conversion for compat mode can now be
> done on kernel memory.

Acked-by: "Eric W. Biederman" 

>
> Co-developed-by: Eric Biederman 
> Co-developed-by: Christoph Hellwig 
> Link: https://lore.kernel.org/lkml/ypbtsu4gx6pl7%2...@infradead.org/
> Link: https://lore.kernel.org/lkml/m1y2cbzmnw@fess.ebiederm.org/
> Signed-off-by: Arnd Bergmann 
> ---
>  kernel/kexec.c | 61 +-
>  1 file changed, 25 insertions(+), 36 deletions(-)
>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 9c7aef8f4bb6..b5e40f069768 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -19,26 +19,9 @@
>  
>  #include "kexec_internal.h"
>  
> -static int copy_user_segment_list(struct kimage *image,
> -   unsigned long nr_segments,
> -   struct kexec_segment __user *segments)
> -{
> - int ret;
> - size_t segment_bytes;
> -
> - /* Read in the segments */
> - image->nr_segments = nr_segments;
> - segment_bytes = nr_segments * sizeof(*segments);
> - ret = copy_from_user(image->segment, segments, segment_bytes);
> - if (ret)
> - ret = -EFAULT;
> -
> - return ret;
> -}
> -
>  static int kimage_alloc_init(struct kimage **rimage, unsigned long entry,
>unsigned long nr_segments,
> -  struct kexec_segment __user *segments,
> +  struct kexec_segment *segments,
>unsigned long flags)
>  {
>   int ret;
> @@ -58,10 +41,8 @@ static int kimage_alloc_init(struct kimage **rimage, 
> unsigned long entry,
>   return -ENOMEM;
>  
>   image->start = entry;
> -
> - ret = copy_user_segment_list(image, nr_segments, segments);
> - if (ret)
> - goto out_free_image;
> + image->nr_segments = nr_segments;
> + memcpy(image->segment, segments, nr_segments * sizeof(*segments));
>  
>   if (kexec_on_panic) {
>   /* Enable special crash kernel control page alloc policy. */
> @@ -104,7 +85,7 @@ static int kimage_alloc_init(struct kimage **rimage, 
> unsigned long entry,
>  }
>  
>  static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
> - struct kexec_segment __user *segments, unsigned long flags)
> + struct kexec_segment *segments, unsigned long flags)
>  {
>   struct kimage **dest_image, *image;
>   unsigned long i;
> @@ -250,7 +231,8 @@ static inline int kexec_load_check(unsigned long 
> nr_segments,
>  SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
>   struct kexec_segment __user *, segments, unsigned long, flags)
>  {
> - int result;
> + struct kexec_segment *ksegments;
> + unsigned long result;
>  
>   result = kexec_load_check(nr_segments, flags);
>   if (result)
> @@ -261,7 +243,12 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
> unsigned long, nr_segments,
>   ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT))
>   return -EINVAL;
>  
> - result = do_kexec_load(entry, nr_segments, segments, flags);
> + ksegments = memdup_user(segments, nr_segments * sizeof(ksegments[0]));
> + if (IS_ERR(ksegments))
> + return PTR_ERR(ksegments);
> +
> + result = do_kexec_load(entry, nr_segments, ksegments, flags);
> + kfree(ksegments);
>  
>   return result;
>  }
> @@ -273,7 +260,7 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
>  compat_ulong_t, flags)
>  {
>   struct compat_kexec_segment in;
> - struct kexec_segment out, __user *ksegments;
> + struct kexec_segment *ksegments;
>   unsigned long i, result;
>  
>   result = kexec_load_check(nr_segments, flags);
> @@ -286,24 +273,26 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, 
> entry,
>   if ((flags & KEXEC_ARCH_MASK) == KEXEC_ARCH_DEFAULT)
>   return -EINVAL;
>  
> - ksegments = compat_alloc_user_space(nr_segments * sizeof(out));
> + ksegments = kmalloc_array(nr_segments, sizeof(ksegments[0]),
> + GFP_KERNEL);
> + if (!ksegments)
> + return -ENOMEM;
> +
>   for (i = 0; i

Re: [PATCH 3/4] exec: simplify the compat syscall handling

2021-03-26 Thread Eric W. Biederman
Christoph Hellwig  writes:


> diff --git a/fs/exec.c b/fs/exec.c
> index 06e07278b456fa..b34c1eb9e7ad8e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -391,47 +391,34 @@ static int bprm_mm_init(struct linux_binprm *bprm)
>   return err;
>  }
>  
> -struct user_arg_ptr {
> -#ifdef CONFIG_COMPAT
> - bool is_compat;
> -#endif
> - union {
> - const char __user *const __user *native;
> -#ifdef CONFIG_COMPAT
> - const compat_uptr_t __user *compat;
> -#endif
> - } ptr;
> -};
> -
> -static const char __user *get_user_arg_ptr(struct user_arg_ptr argv, int nr)
> +static const char __user *
> +get_user_arg_ptr(const char __user *const __user *argv, int nr)
>  {
> - const char __user *native;
> -
> -#ifdef CONFIG_COMPAT
> - if (unlikely(argv.is_compat)) {
> + if (in_compat_syscall()) {
> + const compat_uptr_t __user *compat_argv =
> + compat_ptr((unsigned long)argv);

Ouch!  Passing a pointer around as the wrong type through the kernel!

Perhaps we should reduce everything to do_execveat and
do_execveat_compat.  Then there would be no need for anything
to do anything odd with the pointer types.

I think the big change would be to factor out a copy_string out
of copy_strings, that performs all of the work once we know the proper
pointer value.

Casting pointers from one type to another scares me as one mistake means
we are doing something wrong and probably exploitable.


Eric




>   compat_uptr_t compat;
>  
> - if (get_user(compat, argv.ptr.compat + nr))
> + if (get_user(compat, compat_argv + nr))
>   return ERR_PTR(-EFAULT);
> -
>   return compat_ptr(compat);
> - }
> -#endif
> -
> - if (get_user(native, argv.ptr.native + nr))
> - return ERR_PTR(-EFAULT);
> + } else {
> + const char __user *native;
>  
> - return native;
> + if (get_user(native, argv + nr))
> + return ERR_PTR(-EFAULT);
> + return native;
> + }
>  }
>  


Re: [PATCH RFC PKS/PMEM 51/58] kernel: Utilize new kmap_thread()

2020-10-09 Thread Eric W. Biederman
ira.we...@intel.com writes:

> From: Ira Weiny 
>
> This kmap() call is localized to a single thread.  To avoid the over
> head of global PKRS updates use the new kmap_thread() call.

Acked-by: "Eric W. Biederman" 

>
> Cc: Eric Biederman 
> Signed-off-by: Ira Weiny 
> ---
>  kernel/kexec_core.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index c19c0dad1ebe..272a9920c0d6 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -815,7 +815,7 @@ static int kimage_load_normal_segment(struct kimage 
> *image,
>   if (result < 0)
>   goto out;
>  
> - ptr = kmap(page);
> + ptr = kmap_thread(page);
>   /* Start with a clear page */
>   clear_page(ptr);
>   ptr += maddr & ~PAGE_MASK;
> @@ -828,7 +828,7 @@ static int kimage_load_normal_segment(struct kimage 
> *image,
>   memcpy(ptr, kbuf, uchunk);
>   else
>   result = copy_from_user(ptr, buf, uchunk);
> - kunmap(page);
> + kunmap_thread(page);
>   if (result) {
>   result = -EFAULT;
>   goto out;
> @@ -879,7 +879,7 @@ static int kimage_load_crash_segment(struct kimage *image,
>   goto out;
>   }
>   arch_kexec_post_alloc_pages(page_address(page), 1, 0);
> - ptr = kmap(page);
> + ptr = kmap_thread(page);
>   ptr += maddr & ~PAGE_MASK;
>   mchunk = min_t(size_t, mbytes,
>   PAGE_SIZE - (maddr & ~PAGE_MASK));
> @@ -895,7 +895,7 @@ static int kimage_load_crash_segment(struct kimage *image,
>   else
>   result = copy_from_user(ptr, buf, uchunk);
>   kexec_flush_icache_page(page);
> - kunmap(page);
> + kunmap_thread(page);
>   arch_kexec_pre_free_pages(page_address(page), 1);
>   if (result) {
>   result = -EFAULT;


Re: [PATCH v2] All arch: remove system call sys_sysctl

2020-06-11 Thread Eric W. Biederman
Rich Felker  writes:

> On Thu, Jun 11, 2020 at 12:01:11PM -0500, Eric W. Biederman wrote:
>> Rich Felker  writes:
>> 
>> > On Thu, Jun 11, 2020 at 06:43:00AM -0500, Eric W. Biederman wrote:
>> >> Xiaoming Ni  writes:
>> >> 
>> >> > Since the commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system 
>> >> > call"),
>> >> > sys_sysctl is actually unavailable: any input can only return an error.
>> >> >
>> >> > We have been warning about people using the sysctl system call for years
>> >> > and believe there are no more users.  Even if there are users of this
>> >> > interface if they have not complained or fixed their code by now they
>> >> > probably are not going to, so there is no point in warning them any
>> >> > longer.
>> >> >
>> >> > So completely remove sys_sysctl on all architectures.
>> >> 
>> >> 
>> >> 
>> >> >
>> >> > Signed-off-by: Xiaoming Ni 
>> >> >
>> >> > changes in v2:
>> >> >   According to Kees Cook's suggestion, completely remove sys_sysctl on 
>> >> > all arch
>> >> >   According to Eric W. Biederman's suggestion, update the commit log
>> >> >
>> >> > V1: 
>> >> > https://lore.kernel.org/lkml/1591683605-8585-1-git-send-email-nixiaom...@huawei.com/
>> >> >   Delete the code of sys_sysctl and return -ENOSYS directly at the 
>> >> > function entry
>> >> > ---
>> >> >  include/uapi/linux/sysctl.h|  15 --
>> >> [snip]
>> >> 
>> >> > diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
>> >> > index 27c1ed2..84b44c3 100644
>> >> > --- a/include/uapi/linux/sysctl.h
>> >> > +++ b/include/uapi/linux/sysctl.h
>> >> > @@ -27,21 +27,6 @@
>> >> >  #include 
>> >> >  #include 
>> >> >  
>> >> > -#define CTL_MAXNAME 10 /* how many path components do we allow 
>> >> > in a
>> >> > -  call to sysctl?   In other words, 
>> >> > what is
>> >> > -  the largest acceptable value for the 
>> >> > nlen
>> >> > -  member of a struct __sysctl_args to 
>> >> > have? */
>> >> > -
>> >> > -struct __sysctl_args {
>> >> > -   int __user *name;
>> >> > -   int nlen;
>> >> > -   void __user *oldval;
>> >> > -   size_t __user *oldlenp;
>> >> > -   void __user *newval;
>> >> > -   size_t newlen;
>> >> > -   unsigned long __unused[4];
>> >> > -};
>> >> > -
>> >> >  /* Define sysctl names first */
>> >> >  
>> >> >  /* Top-level names: */
>> >> [snip]
>> >> 
>> >> The uapi header change does not make sense.  The entire point of the
>> >> header is to allow userspace programs to be able to call sys_sysctl.
>> >> It either needs to all stay or all go.
>> >> 
>> >> As the concern with the uapi header is about userspace programs being
>> >> able to compile please leave the header for now.
>> >> 
>> >> We should leave auditing userspace and seeing if userspace code will
>> >> still compile if we remove this header for a separate patch.  The
>> >> concerns and justifications for the uapi header are completely different
>> >> then for the removing the sys_sysctl implementation.
>> >> 
>> >> Otherwise
>> >> Acked-by: "Eric W. Biederman" 
>> >
>> > The UAPI header should be kept because it's defining an API not just
>> > for the kernel the headers are supplied with, but for all past
>> > kernels. In particular programs needing a failsafe CSPRNG source that
>> > works on old kernels may (do) use this as a fallback only if modern
>> > syscalls are missing. Removing the syscall is no problem since it
>> > won't be used, but if you remove the types/macros from the UAPI
>> > headers, they'll have to copy that into their own sources.
>> 
>> May we assume you know of a least one piece of userspace that will fail
>> to compile if this header file is remov

Re: [PATCH v2] All arch: remove system call sys_sysctl

2020-06-11 Thread Eric W. Biederman
Rich Felker  writes:

> On Thu, Jun 11, 2020 at 06:43:00AM -0500, Eric W. Biederman wrote:
>> Xiaoming Ni  writes:
>> 
>> > Since the commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"),
>> > sys_sysctl is actually unavailable: any input can only return an error.
>> >
>> > We have been warning about people using the sysctl system call for years
>> > and believe there are no more users.  Even if there are users of this
>> > interface if they have not complained or fixed their code by now they
>> > probably are not going to, so there is no point in warning them any
>> > longer.
>> >
>> > So completely remove sys_sysctl on all architectures.
>> 
>> 
>> 
>> >
>> > Signed-off-by: Xiaoming Ni 
>> >
>> > changes in v2:
>> >   According to Kees Cook's suggestion, completely remove sys_sysctl on all 
>> > arch
>> >   According to Eric W. Biederman's suggestion, update the commit log
>> >
>> > V1: 
>> > https://lore.kernel.org/lkml/1591683605-8585-1-git-send-email-nixiaom...@huawei.com/
>> >   Delete the code of sys_sysctl and return -ENOSYS directly at the 
>> > function entry
>> > ---
>> >  include/uapi/linux/sysctl.h|  15 --
>> [snip]
>> 
>> > diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
>> > index 27c1ed2..84b44c3 100644
>> > --- a/include/uapi/linux/sysctl.h
>> > +++ b/include/uapi/linux/sysctl.h
>> > @@ -27,21 +27,6 @@
>> >  #include 
>> >  #include 
>> >  
>> > -#define CTL_MAXNAME 10/* how many path components do we allow 
>> > in a
>> > - call to sysctl?   In other words, what is
>> > - the largest acceptable value for the nlen
>> > - member of a struct __sysctl_args to have? */
>> > -
>> > -struct __sysctl_args {
>> > -  int __user *name;
>> > -  int nlen;
>> > -  void __user *oldval;
>> > -  size_t __user *oldlenp;
>> > -  void __user *newval;
>> > -  size_t newlen;
>> > -  unsigned long __unused[4];
>> > -};
>> > -
>> >  /* Define sysctl names first */
>> >  
>> >  /* Top-level names: */
>> [snip]
>> 
>> The uapi header change does not make sense.  The entire point of the
>> header is to allow userspace programs to be able to call sys_sysctl.
>> It either needs to all stay or all go.
>> 
>> As the concern with the uapi header is about userspace programs being
>> able to compile please leave the header for now.
>> 
>> We should leave auditing userspace and seeing if userspace code will
>> still compile if we remove this header for a separate patch.  The
>> concerns and justifications for the uapi header are completely different
>> then for the removing the sys_sysctl implementation.
>> 
>> Otherwise
>> Acked-by: "Eric W. Biederman" 
>
> The UAPI header should be kept because it's defining an API not just
> for the kernel the headers are supplied with, but for all past
> kernels. In particular programs needing a failsafe CSPRNG source that
> works on old kernels may (do) use this as a fallback only if modern
> syscalls are missing. Removing the syscall is no problem since it
> won't be used, but if you remove the types/macros from the UAPI
> headers, they'll have to copy that into their own sources.

May we assume you know of a least one piece of userspace that will fail
to compile if this header file is removed?

Eric



Re: [PATCH v2] All arch: remove system call sys_sysctl

2020-06-11 Thread Eric W. Biederman
Xiaoming Ni  writes:

> Since the commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"),
> sys_sysctl is actually unavailable: any input can only return an error.
>
> We have been warning about people using the sysctl system call for years
> and believe there are no more users.  Even if there are users of this
> interface if they have not complained or fixed their code by now they
> probably are not going to, so there is no point in warning them any
> longer.
>
> So completely remove sys_sysctl on all architectures.



>
> Signed-off-by: Xiaoming Ni 
>
> changes in v2:
>   According to Kees Cook's suggestion, completely remove sys_sysctl on all 
> arch
>   According to Eric W. Biederman's suggestion, update the commit log
>
> V1: 
> https://lore.kernel.org/lkml/1591683605-8585-1-git-send-email-nixiaom...@huawei.com/
>   Delete the code of sys_sysctl and return -ENOSYS directly at the function 
> entry
> ---
>  include/uapi/linux/sysctl.h|  15 --
[snip]

> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
> index 27c1ed2..84b44c3 100644
> --- a/include/uapi/linux/sysctl.h
> +++ b/include/uapi/linux/sysctl.h
> @@ -27,21 +27,6 @@
>  #include 
>  #include 
>  
> -#define CTL_MAXNAME 10   /* how many path components do we allow 
> in a
> -call to sysctl?   In other words, what is
> -the largest acceptable value for the nlen
> -member of a struct __sysctl_args to have? */
> -
> -struct __sysctl_args {
> - int __user *name;
> - int nlen;
> - void __user *oldval;
> - size_t __user *oldlenp;
> - void __user *newval;
> - size_t newlen;
> - unsigned long __unused[4];
> -};
> -
>  /* Define sysctl names first */
>  
>  /* Top-level names: */
[snip]

The uapi header change does not make sense.  The entire point of the
header is to allow userspace programs to be able to call sys_sysctl.
It either needs to all stay or all go.

As the concern with the uapi header is about userspace programs being
able to compile please leave the header for now.

We should leave auditing userspace and seeing if userspace code will
still compile if we remove this header for a separate patch.  The
concerns and justifications for the uapi header are completely different
then for the removing the sys_sysctl implementation.

Otherwise
Acked-by: "Eric W. Biederman" 


Eric


Re: [PATCH 12/13] sysctl: add helper to register empty subdir

2020-05-29 Thread Eric W. Biederman
Luis Chamberlain  writes:

> The way to create a subdirectory from the base set of directories
> is a bit obscure, so provide a helper which makes this clear, and
> also helps remove boiler plate code required to do this work.

I agreee calling:
register_sysctl("fs/binfmt_misc", sysctl_mount_point)
is a bit obscure but if you are going to make a wrapper
please make it the trivial one liner above.

Say something that looks like:
struct sysctl_header *register_sysctl_mount_point(const char *path)
{
return register_sysctl(path, sysctl_mount_point);
}

And yes please talk about a mount point and not an empty dir, as these
are permanently empty directories to serve as mount points.  There are
some subtle but important permission checks this allows in the case of
unprivileged mounts.

Further code like this belong in proc_sysctl.c next to all of the code
it is related to so that it is easier to see how to refactor the code if
necessary.

Eric

>
> Signed-off-by: Luis Chamberlain 
> ---
>  include/linux/sysctl.h |  7 +++
>  kernel/sysctl.c| 16 +---
>  2 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index 33a471b56345..89c92390e6de 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -208,6 +208,8 @@ extern void register_sysctl_init(const char *path, struct 
> ctl_table *table,
>  extern struct ctl_table_header *register_sysctl_subdir(const char *base,
>  const char *subdir,
>  struct ctl_table *table);
> +extern void register_sysctl_empty_subdir(const char *base, const char 
> *subdir);
> +
>  void do_sysctl_args(void);
>  
>  extern int pwrsw_enabled;
> @@ -231,6 +233,11 @@ inline struct ctl_table_header 
> *register_sysctl_subdir(const char *base,
>   return NULL;
>  }
>  
> +static inline void register_sysctl_empty_subdir(const char *base,
> + const char *subdir)
> +{
> +}
> +
>  static inline struct ctl_table_header *register_sysctl_paths(
>   const struct ctl_path *path, struct ctl_table *table)
>  {
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index f9a35325d5d5..460532cd5ac8 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -3188,13 +3188,17 @@ struct ctl_table_header *register_sysctl_subdir(const 
> char *base,
>   { }
>   };
>  
> - if (!table->procname)
> + if (table != sysctl_mount_point && !table->procname)
>   goto out;
>  
>   hdr = register_sysctl_table(base_table);
>   if (unlikely(!hdr)) {
> - pr_err("failed when creating subdirectory sysctl %s/%s/%s\n",
> -base, subdir, table->procname);
> + if (table != sysctl_mount_point)
> + pr_err("failed when creating subdirectory sysctl 
> %s/%s/%s\n",
> +base, subdir, table->procname);
> + else
> + pr_err("failed when creating empty subddirectory 
> %s/%s\n",
> +base, subdir);
>   goto out;
>   }
>   kmemleak_not_leak(hdr);
> @@ -3202,6 +3206,12 @@ struct ctl_table_header *register_sysctl_subdir(const 
> char *base,
>   return hdr;
>  }
>  EXPORT_SYMBOL_GPL(register_sysctl_subdir);
> +
> +void register_sysctl_empty_subdir(const char *base,
> +   const char *subdir)
> +{
> + register_sysctl_subdir(base, subdir, sysctl_mount_point);
> +}
>  #endif /* CONFIG_SYSCTL */
>  /*
>   * No sense putting this after each symbol definition, twice,


Re: [PATCH 02/13] cdrom: use new sysctl subdir helper register_sysctl_subdir()

2020-05-29 Thread Eric W. Biederman
Luis Chamberlain  writes:

> This simplifies the code considerably. The following coccinelle

With register_sysctl the code would read:

cdrom_sysctl_header = register_sysctl("dev/cdrom", cdrom_table);

Please go that direction.  Thank you.

Eric


Re: [PATCH 01/13] sysctl: add new register_sysctl_subdir() helper

2020-05-29 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Luis Chamberlain  writes:
>
>> Often enough all we need to do is create a subdirectory so that
>> we can stuff sysctls underneath it. However, *if* that directory
>> was already created early on the boot sequence we really have no
>> need to use the full boiler plate code for it, we can just use
>> local variables to help us guide sysctl to place the new leaf files.
>>
>> So use a helper to do precisely this.
>
> Reset restart.  This is patch is total nonsense.
>
> - You are using register_sysctl_table which as I believe I have
>   mentioned is a deprecated compatibility wrapper.  The point of
>   spring house cleaning is to get off of the deprecated functions
>   isn't it?
>
> - You are using the old nasty form for creating directories instead
>   of just passing in a path.
>
> - None of this is even remotely necessary.  The directories
>   are created automatically if you just register their entries.

Oh.  *blink*  The poor naming threw me off.

This is a clumsy and poorly named version of register_sysctl();

Yes. This change is totally unnecessary.

Eric



Re: [PATCH 01/13] sysctl: add new register_sysctl_subdir() helper

2020-05-29 Thread Eric W. Biederman
Luis Chamberlain  writes:

> Often enough all we need to do is create a subdirectory so that
> we can stuff sysctls underneath it. However, *if* that directory
> was already created early on the boot sequence we really have no
> need to use the full boiler plate code for it, we can just use
> local variables to help us guide sysctl to place the new leaf files.
>
> So use a helper to do precisely this.

Reset restart.  This is patch is total nonsense.

- You are using register_sysctl_table which as I believe I have
  mentioned is a deprecated compatibility wrapper.  The point of
  spring house cleaning is to get off of the deprecated functions
  isn't it?

- You are using the old nasty form for creating directories instead
  of just passing in a path.

- None of this is even remotely necessary.  The directories
  are created automatically if you just register their entries.

Eric


Re: remove set_fs calls from the coredump code v6

2020-05-06 Thread Eric W. Biederman
Christoph Hellwig  writes:

> On Tue, May 05, 2020 at 03:28:50PM -0500, Eric W. Biederman wrote:
>> We probably can.   After introducing a kernel_compat_siginfo that is
>> the size that userspace actually would need.
>> 
>> It isn't something I want to mess with until this code gets merged, as I
>> think the set_fs cleanups are more important.
>> 
>> 
>> Christoph made some good points about how ugly the #ifdefs are in
>> the generic copy_siginfo_to_user32 implementation.
>
> Take a look at the series you are replying to, the magic x86 ifdefs are
> entirely gone from the common code :)

Interesting.

That is a different way of achieving that, and I don't hate it.
  
I still want whatever you are doing to settle before I touch that code
again.  Removing the set_fs is important and I have other fish to fry
at the moment.

Eric





Re: remove set_fs calls from the coredump code v6

2020-05-05 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Tue, May 5, 2020 at 3:13 AM Christoph Hellwig  wrote:
>>
>> this series gets rid of playing with the address limit in the exec and
>> coredump code.  Most of this was fairly trivial, the biggest changes are
>> those to the spufs coredump code.
>
> Ack, nice, and looks good.
>
> The only part I dislike is how we have that 'struct compat_siginfo' on
> the stack, which is a huge waste (most of it is the nasty padding to
> 128 bytes).
>
> But that's not new, I only reacted to it because the code moved a bit.
> We cleaned up the regular siginfo to not have the padding in the
> kernel (and by "we" I mean "Eric Biederman did it after some prodding
> as part of his siginfo cleanups" - see commit 4ce5f9c9e754 "signal:
> Use a smaller struct siginfo in the kernel"),  and I wonder if we
> could do something similar with that compat thing.
>
> 128 bytes of wasted kernel stack isn't the end of the world, but it's
> sad when the *actual* data is only 32 bytes or so.

We probably can.   After introducing a kernel_compat_siginfo that is
the size that userspace actually would need.

It isn't something I want to mess with until this code gets merged, as I
think the set_fs cleanups are more important.


Christoph made some good points about how ugly the #ifdefs are in
the generic copy_siginfo_to_user32 implementation.

I am thinking the right fix is to introduce.
- TS_X32 as a companion to TS_COMPAT in the x86_64.
- Modify in_x32_syscall() to test TS_X32
- Implement x32_copy_siginfo_to_user32 that forces TS_X32 to be
  set. AKA:

x32_copy_siginfo_to_user32()
{
unsigned long state = current_thread_info()->state;
current_thread_info()->state |= TS_X32;
copy_siginfo_to_user32();
current_thread_info()->state = state;
}

That would make the #ifdefs go away, but I don't yet know what the x86
maintainers would say about that scheme.  I think it is a good path as
it would isolate the runtime cost of that weird SIGCHLD siginfo format
to just x32.  Then ia32 in compat mode would not need to pay.

Once I get that then it will be easier to introduce a yet another helper
of copy_siginfo_to_user32 that generates just the kernel_compat_siginfo
part, and the two visible derivatives can call memset and clear_user
to clear the unset parts.

I am assuming you don't don't mind having a full siginfo in
elf_note_info that ultimately gets copied into the core dump?

Eric


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Eric W. Biederman
David Hildenbrand  writes:

> On 30.04.20 18:33, Eric W. Biederman wrote:
>> David Hildenbrand  writes:
>> 
>>> On 30.04.20 17:38, Eric W. Biederman wrote:
>>>> David Hildenbrand  writes:
>>>>
>>>>> Some devices/drivers that add memory via add_memory() and friends (e.g.,
>>>>> dax/kmem, but also virtio-mem in the future) don't want to create entries
>>>>> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this
>>>>> memory to the boot memmap of the kexec kernel.
>>>>>
>>>>> In fact, such memory is never exposed via the firmware memmap as System
>>>>> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is
>>>>> wrong:
>>>>>  "kexec needs the raw firmware-provided memory map to setup the
>>>>>   parameter segment of the kernel that should be booted with
>>>>>   kexec. Also, the raw memory map is useful for debugging. For
>>>>>   that reason, /sys/firmware/memmap is an interface that provides
>>>>>   the raw memory map to userspace." [1]
>>>>>
>>>>> We don't have to worry about firmware_map_remove() on the removal path.
>>>>> If there is no entry, it will simply return with -EINVAL.
>>>>>
>>>>> [1]
>>>>> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap
>>>>
>>>>
>>>> You know what this justification is rubbish, and I have previously
>>>> explained why it is rubbish.
>>>
>>> Actually, no, I don't think it is rubbish. See patch #3 and the cover
>>> letter why this is the right thing to do *for special memory*, *not
>>> ordinary DIMMs*.
>>>
>>> And to be quite honest, I think your response is a little harsh. I don't
>>> recall you replying to my virtio-mem-related comments.
>>>
>>>>
>>>> Nacked-by: "Eric W. Biederman" 
>>>>
>>>> This needs to be based on weather the added memory is ultimately normal
>>>> ram or is something special.
>>>
>>> Yes, that's what the caller are expected to decide, see patch #3.
>>>
>>> kexec should try to be as closely as possible to a real reboot - IMHO.
>> 
>> That is very fuzzy in terms of hotplug memory.  The kexec'd kernel
>> should see the hotplugged memory assuming it is ordinary memory.
>> 
>> But kexec is not a reboot although it is quite similar.   Kexec is
>> swapping one running kernel and it's state for another kernel without
>> rebooting.
>
> I agree (especially regarding the arm64 DIMM hotplug discussion).
> However, for the two cases
>
> a) dax/kmem
> b) virtio-mem
>
> We really want to let the driver take back control and figure out "what
> to do with the memory".

>From reading your v1 cover letter (the description appears missing in
v2) I see what you are talking about with respect to virtio-mem.

So I will count virt-io mem as something different.

>>>> Justifying behavior by documentation that does not consider memory
>>>> hotplug is bad thinking.
>>>
>>> Are you maybe confusing this patch series with the arm64 approach? This
>>> is not about ordinary hotplugged DIMMs.
>> 
>> I think I am.
>> 
>> My challenge is that I don't see anything in the description that says
>> this isn't about ordinary hotplugged DIMMs.  All I saw was hotplug
>> memory.
>
> I'm sorry if that was confusing, I tried to stress that kmem and
> virtio-mem is special in the description.
>
> I squeezed a lot of that information into the cover letter and into
> patch #3.


>> If the class of memory is different then please by all means let's mark
>> it differently in struct resource so everyone knows it is different.
>> But that difference needs to be more than hotplug.
>> 
>> That difference needs to be the hypervisor loaned us memory and might
>> take it back at any time, or this memory is persistent and so it has
>> these different characteristics so don't use it as ordinary ram.
>
> Yes, and I think kmem took an excellent approach of explicitly putting
> that "System RAM" into a resource hierarchy. That "System RAM" won't
> show up as a root node under /proc/iomem (see patch #3), which already
> results in kexec-tools to treat it in a special way. I am thinking about
> doing the same for virtio-mem.

Reading this and your patch cover letters again my concern is that
the justification seems to be letting the tail wa

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Eric W. Biederman
David Hildenbrand  writes:

> On 30.04.20 17:38, Eric W. Biederman wrote:
>> David Hildenbrand  writes:
>> 
>>> Some devices/drivers that add memory via add_memory() and friends (e.g.,
>>> dax/kmem, but also virtio-mem in the future) don't want to create entries
>>> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this
>>> memory to the boot memmap of the kexec kernel.
>>>
>>> In fact, such memory is never exposed via the firmware memmap as System
>>> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is
>>> wrong:
>>>  "kexec needs the raw firmware-provided memory map to setup the
>>>   parameter segment of the kernel that should be booted with
>>>   kexec. Also, the raw memory map is useful for debugging. For
>>>   that reason, /sys/firmware/memmap is an interface that provides
>>>   the raw memory map to userspace." [1]
>>>
>>> We don't have to worry about firmware_map_remove() on the removal path.
>>> If there is no entry, it will simply return with -EINVAL.
>>>
>>> [1]
>>> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap
>> 
>> 
>> You know what this justification is rubbish, and I have previously
>> explained why it is rubbish.
>
> Actually, no, I don't think it is rubbish. See patch #3 and the cover
> letter why this is the right thing to do *for special memory*, *not
> ordinary DIMMs*.
>
> And to be quite honest, I think your response is a little harsh. I don't
> recall you replying to my virtio-mem-related comments.
>
>> 
>> Nacked-by: "Eric W. Biederman" 
>> 
>> This needs to be based on weather the added memory is ultimately normal
>> ram or is something special.
>
> Yes, that's what the caller are expected to decide, see patch #3.
>
> kexec should try to be as closely as possible to a real reboot - IMHO.

That is very fuzzy in terms of hotplug memory.  The kexec'd kernel
should see the hotplugged memory assuming it is ordinary memory.

But kexec is not a reboot although it is quite similar.   Kexec is
swapping one running kernel and it's state for another kernel without
rebooting.

>> Justifying behavior by documentation that does not consider memory
>> hotplug is bad thinking.
>
> Are you maybe confusing this patch series with the arm64 approach? This
> is not about ordinary hotplugged DIMMs.

I think I am.

My challenge is that I don't see anything in the description that says
this isn't about ordinary hotplugged DIMMs.  All I saw was hotplug
memory.

If the class of memory is different then please by all means let's mark
it differently in struct resource so everyone knows it is different.
But that difference needs to be more than hotplug.

That difference needs to be the hypervisor loaned us memory and might
take it back at any time, or this memory is persistent and so it has
these different characteristics so don't use it as ordinary ram.

That information is also useful to other people looking at the system
and seeing what is going on.

Just please don't muddle the concepts, or assume that whatever subset of
hotplug memory you are dealing with is the only subset.

I didn't see that flag making the distinction about the kind of memory
it is.

Eric






Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Eric W. Biederman
David Hildenbrand  writes:

> Some devices/drivers that add memory via add_memory() and friends (e.g.,
> dax/kmem, but also virtio-mem in the future) don't want to create entries
> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this
> memory to the boot memmap of the kexec kernel.
>
> In fact, such memory is never exposed via the firmware memmap as System
> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is
> wrong:
>  "kexec needs the raw firmware-provided memory map to setup the
>   parameter segment of the kernel that should be booted with
>   kexec. Also, the raw memory map is useful for debugging. For
>   that reason, /sys/firmware/memmap is an interface that provides
>   the raw memory map to userspace." [1]
>
> We don't have to worry about firmware_map_remove() on the removal path.
> If there is no entry, it will simply return with -EINVAL.
>
> [1]
> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap


You know what this justification is rubbish, and I have previously
explained why it is rubbish.

Nacked-by: "Eric W. Biederman" 

This needs to be based on weather the added memory is ultimately normal
ram or is something special.

At least when we are talking memory resources.  Keeping it out of the
firmware map that is fine.

If the hotplugged memory is the result of plugging a stick of ram
into the kernel and can and should used be like any other memory
it should be treated like any normal memory.

If the hotplugged memory is something special it should be treated as
something special.

Justifying behavior by documentation that does not consider memory
hotplug is bad thinking.








> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Pankaj Gupta 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Eric Biederman 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/memory_hotplug.h | 8 
>  mm/memory_hotplug.c| 3 ++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 0151fb935c09..4ca418a731eb 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -68,6 +68,14 @@ struct mhp_params {
>   pgprot_t pgprot;
>  };
>  
> +/* Flags used for add_memory() and friends. */
> +
> +/*
> + * Don't create entries in /sys/firmware/memmap/. The memory is detected and
> + * added via a device driver, not via the initial (firmware) memmap.
> + */
> +#define MHP_NO_FIRMWARE_MEMMAP   1
> +
>  /*
>   * Zone resizing functions
>   *
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c01be92693e3..e94ede9cad00 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1062,7 +1062,8 @@ int __ref add_memory_resource(int nid, struct resource 
> *res,
>   BUG_ON(ret);
>  
>   /* create new memmap entry */
> - firmware_map_add_hotplug(start, start + size, "System RAM");
> + if (!(flags & MHP_NO_FIRMWARE_MEMMAP))
> + firmware_map_add_hotplug(start, start + size, "System RAM");
>  
>   /* device_online() will take the lock when calling online_pages() */
>   mem_hotplug_done();


Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-21 Thread Eric W. Biederman
David Hildenbrand  writes:

>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>>> pass the efi, it won't get the SRAT table correctly, if I remember
>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>>> I have been using kvm guest with uefi firmwire recently.
>> 
>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>> 
>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
>> not part of any efi tables or whatsoever. So I assume the kexec kernel
>> will not detect it automatically (good!), instead load the virtio-mem
>> driver and let it add memory back to the system.
>> 
>> I should probably play with kexec and virtio-mem once I have some spare
>> cycles ... to find out what's broken and needs to be addressed :)
>
> FWIW, I just gave virtio-mem and kexec/kdump a try.
>
> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> The kexec kernel only uses memory in the crash region. The virtio-mem
> driver properly bails out due to is_kdump_kernel().
>
> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> to get placed on virtio-mem memory (pure luck due to the left-to-right
> search). Memory added by virtio-mem is not getting added to the e820
> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> right memory is readded.

This sounds like a bug.

> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
> is added to the e820 map, which is wrong. Memory that should not be
> touched will be touched by the kexec kernel. I assume kexec-tools just
> goes ahead and adds anything it can find in /proc/iomem (or
> /sys/firmware/memmap/) to the e820 map of the new kernel.
>
> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
> similarly added to the e820 map and, therefore, won't be able to be
> onlined MOVABLE easily.

This sounds like correct behavior to me.  If you add memory to the
system it is treated as memory to the system.

If we need to make it a special kind of memory with special rules we can
have some kind of special marking for the memory.  But hotplugged is not
in itself a sufficient criteria to say don't use this as normal memory.

If take a huge server and I plug in an extra dimm it is just memory.

For a similarly huge server I might want to have memory that the system
booted with unpluggable, in case hardware error reporting notices
a dimm generating a lot of memory errors.

Now perhaps virtualization needs a special tier of memory that should
only be used for cases where the memory is easily movable.

I am not familiar with virtio-mem but my skim of the initial design
is that virtio-mem was not designed to be such a special tier of memory.
Perhaps something has changed?
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html


> At least for virtio-mem, I would either have to
> a) Not support "kexec -c -l". A viable option if we would be planning on
> not supporting it either way in the long term. I could block this
> in-kernel somehow eventually.

No.

> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
> indicating it in /proc/iomem in a special way ("System RAM
> (hotplugged)"/"System RAM (virtio-mem)").

How does the kernel memory allocator treat this memory?

The logic is simple.  If the kernel memory allocator treats that memory
as ordinary memory available for all uses it should be presented as
ordinary memory available for all uses.

If the kernel memory allocator treats that memory as special memory
only available for uses that we can easily free later and give back to
the system.  AKA it is special and not oridinary memory we should mark
it as such.

Eric

p.s.  Please excuse me for jumping in I may be missing some important
context, but what I read when I saw this message in my inbox just seemed
very wrong.




Re: remove set_fs calls from the exec and coredump code v2

2020-04-19 Thread Eric W. Biederman
Christoph Hellwig  writes:

> On Fri, Apr 17, 2020 at 05:41:52PM -0500, Eric W. Biederman wrote:
>> > this series gets rid of playing with the address limit in the exec and
>> > coredump code.  Most of this was fairly trivial, the biggest changes are
>> > those to the spufs coredump code.
>> >
>> > Changes since v1:
>> >  - properly spell NUL
>> >  - properly handle the compat siginfo case in ELF coredumps
>> 
>> Quick question is exec from a kernel thread within the scope of what you
>> are looking at?
>> 
>> There is a set_fs(USER_DS) in flush_old_exec whose sole purpose appears
>> to be to allow exec from kernel threads.  Where the kernel threads
>> run with set_fs(KERNEL_DS) until they call exec.
>
> This series doesn't really look at that area.  But I don't think exec
> from a kernel thread makes any sense, and cleaning up how to set the
> initial USER_DS vs KERNEL_DS state is something I'll eventually get to,
> it seems like a major mess at the moment.

Fair enough.  I just wanted to make certain that it is on people's radar
that when the kernel exec's init the arguments are read from kernel
memory and the set_fs(USER_DS) in flush_old_exec() that makes that not
work later.

It is subtle and easy to miss.  So I figured I would mention it since
I have been staring at the exec code a lot lately.

Eric


Re: [PATCH 1/2] signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32

2020-04-18 Thread Eric W. Biederman
Christophe Leroy  writes:

> Le 17/04/2020 à 23:09, Eric W. Biederman a écrit :
>>
>> To remove the use of set_fs in the coredump code there needs to be a
>> way to convert a kernel siginfo to a userspace compat siginfo.
>>
>> Call that function copy_siginfo_to_compat and factor it out of
>> copy_siginfo_to_user32.
>
> I find it a pitty to do that.
>
> The existing function could have been easily converted to using
> user_access_begin() + user_access_end() and use unsafe_put_user() to copy to
> userspace to avoid copying through a temporary structure on the stack.
>
> With your change, it becomes impossible to do that.

I don't follow.  You don't like temporary structures in the coredump
code or temporary structures in copy_siginfo_to_user32?

A temporary structure in copy_siginfo_to_user is pretty much required
so that it can be zeroed to guarantee we don't pass a structure with
holes to userspace.

The implementation of copy_siginfo_to_user32 used to use the equivalent
of user_access_begin() and user_access_end() and the code was a mess
that was very difficult to reason about.  I recall their being holes
in the structure that were being copied to userspace.

Meanwhile if you are going to set all of the bytes a cache hot temporary
structure is quite cheap.

> Is that really an issue to use that set_fs() in the coredump code ?

Using set_fs() is pretty bad and something that we would like to remove
from the kernel entirely.  The fewer instances of set_fs() we have the
better.

I forget all of the details but set_fs() is both a type violation and an
attack point when people are attacking the kernel.  The existence of
set_fs() requires somethings that should be constants to be variables.
Something about that means that our current code is difficult to protect
from spectre style vulnerabilities.

There was a very good thread about it all in I think 2018 but
unfortunately I can't find it now.

Eric


Re: remove set_fs calls from the exec and coredump code v2

2020-04-17 Thread Eric W. Biederman
Christoph Hellwig  writes:

> Hi all,
>
> this series gets rid of playing with the address limit in the exec and
> coredump code.  Most of this was fairly trivial, the biggest changes are
> those to the spufs coredump code.
>
> Changes since v1:
>  - properly spell NUL
>  - properly handle the compat siginfo case in ELF coredumps

Quick question is exec from a kernel thread within the scope of what you
are looking at?

There is a set_fs(USER_DS) in flush_old_exec whose sole purpose appears
to be to allow exec from kernel threads.  Where the kernel threads
run with set_fs(KERNEL_DS) until they call exec.

Eric


[PATCH 2/2] signal: Remove the set_fs in binfmt_elf.c:fill_siginfo_note

2020-04-17 Thread Eric W. Biederman


The code in binfmt_elf.c is differnt from the rest of the code that
processes siginfo, as it sends siginfo from a kernel buffer to a file
rather than from kernel memory to userspace buffers.  To remove it's
use of set_fs the code needs some different siginfo helpers.

Add the helper copy_siginfo_to_external to copy from the kernel's
internal siginfo layout to a buffer in the siginfo layout that
userspace expects.

Modify fill_siginfo_note to use copy_siginfo_to_external instead of
set_fs and copy_siginfo_to_user.

Update compat_binfmt_elf.c to use the previously added
copy_siginfo_to_external32 to handle the compat case.

Signed-off-by: "Eric W. Biederman" 
---
 fs/binfmt_elf.c| 5 +
 fs/compat_binfmt_elf.c | 2 +-
 include/linux/signal.h | 7 +++
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 13f25e241ac4..a1f57e20c3cf 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1556,10 +1556,7 @@ static void fill_auxv_note(struct memelfnote *note, 
struct mm_struct *mm)
 static void fill_siginfo_note(struct memelfnote *note, user_siginfo_t 
*csigdata,
const kernel_siginfo_t *siginfo)
 {
-   mm_segment_t old_fs = get_fs();
-   set_fs(KERNEL_DS);
-   copy_siginfo_to_user((user_siginfo_t __user *) csigdata, siginfo);
-   set_fs(old_fs);
+   copy_siginfo_to_external(csigdata, siginfo);
fill_note(note, "CORE", NT_SIGINFO, sizeof(*csigdata), csigdata);
 }
 
diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c
index aaad4ca1217e..fa0e24e1b726 100644
--- a/fs/compat_binfmt_elf.c
+++ b/fs/compat_binfmt_elf.c
@@ -39,7 +39,7 @@
  */
 #define user_long_tcompat_long_t
 #define user_siginfo_t compat_siginfo_t
-#define copy_siginfo_to_user   copy_siginfo_to_user32
+#define copy_siginfo_to_external   copy_siginfo_to_external32
 
 /*
  * The machine-dependent core note format types are defined in 
elfcore-compat.h,
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 05bacd2ab135..c1796321cadb 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -24,6 +24,13 @@ static inline void clear_siginfo(kernel_siginfo_t *info)
 
 #define SI_EXPANSION_SIZE (sizeof(struct siginfo) - sizeof(struct 
kernel_siginfo))
 
+static inline void copy_siginfo_to_external(siginfo_t *to,
+   const kernel_siginfo_t *from)
+{
+   memcpy(to, from, sizeof(*from));
+   memset(((char *)to) + sizeof(struct kernel_siginfo), 0, 
SI_EXPANSION_SIZE);
+}
+
 int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from);
 int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from);
 
-- 
2.25.0



[PATCH 1/2] signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32

2020-04-17 Thread Eric W. Biederman


To remove the use of set_fs in the coredump code there needs to be a
way to convert a kernel siginfo to a userspace compat siginfo.

Call that function copy_siginfo_to_compat and factor it out of
copy_siginfo_to_user32.

The existence of x32 complicates this code.  On x32 SIGCHLD uses 64bit
times for utime and stime.  As only SIGCHLD is affected and SIGCHLD
never causes a coredump I have avoided handling that case.

Signed-off-by: "Eric W. Biederman" 
---
 include/linux/compat.h |   1 +
 kernel/signal.c| 108 +++--
 2 files changed, 63 insertions(+), 46 deletions(-)

diff --git a/include/linux/compat.h b/include/linux/compat.h
index 0480ba4db592..4962b254e550 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -402,6 +402,7 @@ long compat_get_bitmap(unsigned long *mask, const 
compat_ulong_t __user *umask,
   unsigned long bitmap_size);
 long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask,
   unsigned long bitmap_size);
+void copy_siginfo_to_external32(struct compat_siginfo *to, const struct 
kernel_siginfo *from);
 int copy_siginfo_from_user32(kernel_siginfo_t *to, const struct compat_siginfo 
__user *from);
 int copy_siginfo_to_user32(struct compat_siginfo __user *to, const 
kernel_siginfo_t *from);
 int get_compat_sigevent(struct sigevent *event,
diff --git a/kernel/signal.c b/kernel/signal.c
index e58a6c619824..578f196898cb 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3235,90 +3235,106 @@ int copy_siginfo_from_user(kernel_siginfo_t *to, const 
siginfo_t __user *from)
 }
 
 #ifdef CONFIG_COMPAT
-int copy_siginfo_to_user32(struct compat_siginfo __user *to,
-  const struct kernel_siginfo *from)
-#if defined(CONFIG_X86_X32_ABI) || defined(CONFIG_IA32_EMULATION)
+void copy_siginfo_to_external32(struct compat_siginfo *to,
+   const struct kernel_siginfo *from)
 {
-   return __copy_siginfo_to_user32(to, from, in_x32_syscall());
-}
-int __copy_siginfo_to_user32(struct compat_siginfo __user *to,
-const struct kernel_siginfo *from, bool x32_ABI)
-#endif
-{
-   struct compat_siginfo new;
-   memset(, 0, sizeof(new));
+   /*
+* This function does not work properly for SIGCHLD on x32,
+* but it does not need to as SIGCHLD never causes a coredump.
+*/
+   memset(to, 0, sizeof(*to));
 
-   new.si_signo = from->si_signo;
-   new.si_errno = from->si_errno;
-   new.si_code  = from->si_code;
+   to->si_signo = from->si_signo;
+   to->si_errno = from->si_errno;
+   to->si_code  = from->si_code;
switch(siginfo_layout(from->si_signo, from->si_code)) {
case SIL_KILL:
-   new.si_pid = from->si_pid;
-   new.si_uid = from->si_uid;
+   to->si_pid = from->si_pid;
+   to->si_uid = from->si_uid;
break;
case SIL_TIMER:
-   new.si_tid = from->si_tid;
-   new.si_overrun = from->si_overrun;
-   new.si_int = from->si_int;
+   to->si_tid = from->si_tid;
+   to->si_overrun = from->si_overrun;
+   to->si_int = from->si_int;
break;
case SIL_POLL:
-   new.si_band = from->si_band;
-   new.si_fd   = from->si_fd;
+   to->si_band = from->si_band;
+   to->si_fd   = from->si_fd;
break;
case SIL_FAULT:
-   new.si_addr = ptr_to_compat(from->si_addr);
+   to->si_addr = ptr_to_compat(from->si_addr);
 #ifdef __ARCH_SI_TRAPNO
-   new.si_trapno = from->si_trapno;
+   to->si_trapno = from->si_trapno;
 #endif
break;
case SIL_FAULT_MCEERR:
-   new.si_addr = ptr_to_compat(from->si_addr);
+   to->si_addr = ptr_to_compat(from->si_addr);
 #ifdef __ARCH_SI_TRAPNO
-   new.si_trapno = from->si_trapno;
+   to->si_trapno = from->si_trapno;
 #endif
-   new.si_addr_lsb = from->si_addr_lsb;
+   to->si_addr_lsb = from->si_addr_lsb;
break;
case SIL_FAULT_BNDERR:
-   new.si_addr = ptr_to_compat(from->si_addr);
+   to->si_addr = ptr_to_compat(from->si_addr);
 #ifdef __ARCH_SI_TRAPNO
-   new.si_trapno = from->si_trapno;
+   to->si_trapno = from->si_trapno;
 #endif
-   new.si_lower = ptr_to_compat(from->si_lower);
-   new.si_upper = ptr_to_compat(from->si_upper);
+   to->si_lower = ptr_to_compat(from->si_lower);
+   to->si_upper = ptr_to_compat(from->si_upper);
   

Re: [PATCH 2/8] signal: clean up __copy_siginfo_to_user32

2020-04-17 Thread Eric W. Biederman
Christoph Hellwig  writes:

> Instead of an architecture specific calling convention in common code
> just pass a flags argument with architecture specific values.

This bothers me because it makes all architectures pay for the sins of
x32.  Further it starts burying the details of the what is happening in
architecture specific helpers.  Hiding the fact that there is only
one niche architecture that does anything weird.

I am very sensitive to hiding away signal handling details right now
because way to much of the signal handling code got hidden in
architecture specific files and was quite buggy because as a result.

My general sense is putting all of the weird details up front and center
in kernel/signal.c is the only way for this code will be looked at
and successfully maintained.

How about these patches to solve set_fs with binfmt_elf instead:

Eric W. Biederman (2):
  signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32
  signal: Remove the set_fs in binfmt_elf.c:fill_siginfo_note

 fs/binfmt_elf.c|   5 +
 fs/compat_binfmt_elf.c |   2 +-
 include/linux/compat.h |   1 +
 include/linux/signal.h |   7 +++
 kernel/signal.c| 108 
++--

Eric


Re: [PATCH 4/8] binfmt_elf: open code copy_siginfo_to_user to kernelspace buffer

2020-04-17 Thread Eric W. Biederman
Christoph Hellwig  writes:

> On Wed, Apr 15, 2020 at 10:20:11AM +0200, Arnd Bergmann wrote:
>> > I'd rather keep it out of this series and to
>> > an interested party.  Then again x32 doesn't seem to have a whole lot
>> > of interested parties..
>> 
>> Fine with me. It's on my mental list of things that we want to kill off
>> eventually as soon as the remaining users stop replying to questions
>> about it.
>> 
>> In fact I should really turn that into a properly maintained list in
>> Documentation/... that contains any options that someone has
>> asked about removing in the past, along with the reasons for keeping
>> it around and a time at which we should ask about it again.
>
> To the newly added x86 maintainers:  Arnd brought up the point that
> elf_core_dump writes the ABI siginfo format into the core dump. That
> format differs for i386 vs x32.  Is there any good way to find out
> which is the right format when are not in a syscall?
>
> As far a I can tell x32 vs i386 just seems to be based around what
> syscall table was used for the current syscall, but core dumps aren't
> always in syscall context.

I don't think this matters.  The i386 and x32 signal structures
only differ for SIGCHLD.  The SIGCHLD signal does cause coredumps.
So as long as we get the 32bit vs 64bit distinct correct all should be
well.

Eric





Re: [PATCH 1/4] fs: always build llseek.

2019-08-28 Thread Eric W. Biederman
Michal Suchanek  writes:

> 64bit !COMPAT does not build because the llseek syscall is in the
> tables.

Do I read this right you have a 128 bit offset to llseek on ppc64?

Looking at the signature it does not appear to make sense to build this
function on any 64bit platform.

Perhaps the proper fix to to take llseek out of your syscall tables?

Eric

> Signed-off-by: Michal Suchanek 
> ---
>  fs/read_write.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 5bbf587f5bc1..9db56931eb26 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -331,7 +331,6 @@ COMPAT_SYSCALL_DEFINE3(lseek, unsigned int, fd, 
> compat_off_t, offset, unsigned i
>  }
>  #endif
>  
> -#if !defined(CONFIG_64BIT) || defined(CONFIG_COMPAT)
>  SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, offset_high,
>   unsigned long, offset_low, loff_t __user *, result,
>   unsigned int, whence)
> @@ -360,7 +359,6 @@ SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, 
> offset_high,
>   fdput_pos(f);
>   return retval;
>  }
> -#endif
>  
>  int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, 
> size_t count)
>  {


Re: [PATCH 1/4] fs: always build llseek.

2019-08-28 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Michal Suchanek  writes:
>
>> 64bit !COMPAT does not build because the llseek syscall is in the
>> tables.
>
> Do I read this right you have a 128 bit offset to llseek on ppc64?
>
> Looking at the signature it does not appear to make sense to build this
> function on any 64bit platform.
>
> Perhaps the proper fix to to take llseek out of your syscall tables?

Sorry I missed seeing the 2 newer version of the patchset in my inbox
it seems you have already made the change I am suggesting.

Never mind.

Eric



[REVIEW][PATCH 0/9] signal/powerpc: siginfo cleanups

2018-09-18 Thread Eric W. Biederman


This is the continuation of my work to sort out signaling of exceptions
with siginfo.  The old functions by passing siginfo resulted in many
cases of fields of siginfo that were not initialized and then passed to
userspace, and also resulted in callers getting confused and
initializing the wrong fields.  My remedy is to have specific functions
for sending each different kind of signal with siginfo.  Those functions
take the information needed to fill in siginfo and do the work
themselves.

This is my set of changes to update powerpc to use those functions.
Along with some refactoring so those functions can be cleanly used.

Folks please review and double check me.  I think I have kept these
changes simple and obviously correct but I am human and mess up
sometimes.

After these patches have had a chance to be reviewed I plan to merge
them by my siginfo tree.  If you would rather take them in the powerpc
tree let me know.   All of the prerequisites should have been merged
through Linus's tree several releases ago.

Eric W. Biederman (9):
  signal/powerpc: Use force_sig_mceerr as appropriate
  signal/powerpc: Remove pkey parameter from __bad_area
  signal/powerpc: Call _exception_pkey directly from bad_key_fault_exception
  signal/powerpc: Remove pkey parameter from __bad_area_nosemaphore
  signal/powerpc: Factor the common exception code into exception_common
  signal/powerpc: Call force_sig_fault from _exception
  signal/poewrpc: Specialize _exception_pkey for handling pkey exceptions
  signal/powerpc: Simplify _exception_pkey by using force_sig_pkuerr
  signal/powerpc: Use force_sig_fault where appropriate

 arch/powerpc/include/asm/bug.h|  2 +-
 arch/powerpc/kernel/process.c |  9 +
 arch/powerpc/kernel/traps.c   | 27 ---
 arch/powerpc/mm/fault.c   | 55 +--
 arch/powerpc/platforms/cell/spu_base.c|  4 +--
 arch/powerpc/platforms/cell/spufs/fault.c | 26 +--
 6 files changed, 57 insertions(+), 66 deletions(-)

Eric


[REVIEW][PATCH 9/9] signal/powerpc: Use force_sig_fault where appropriate

2018-09-18 Thread Eric W. Biederman
Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/kernel/process.c |  9 +---
 arch/powerpc/mm/fault.c   |  9 +---
 arch/powerpc/platforms/cell/spu_base.c|  4 ++--
 arch/powerpc/platforms/cell/spufs/fault.c | 26 +++
 4 files changed, 12 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 913c5725cdb2..553a396e7fc1 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -620,8 +620,6 @@ void do_send_trap(struct pt_regs *regs, unsigned long 
address,
 void do_break (struct pt_regs *regs, unsigned long address,
unsigned long error_code)
 {
-   siginfo_t info;
-
current->thread.trap_nr = TRAP_HWBKPT;
if (notify_die(DIE_DABR_MATCH, "dabr_match", regs, error_code,
11, SIGSEGV) == NOTIFY_STOP)
@@ -634,12 +632,7 @@ void do_break (struct pt_regs *regs, unsigned long address,
hw_breakpoint_disable();
 
/* Deliver the signal to userspace */
-   clear_siginfo();
-   info.si_signo = SIGTRAP;
-   info.si_errno = 0;
-   info.si_code = TRAP_HWBKPT;
-   info.si_addr = (void __user *)address;
-   force_sig_info(SIGTRAP, , current);
+   force_sig_fault(SIGTRAP, TRAP_HWBKPT, (void __user *)address, current);
 }
 #endif /* CONFIG_PPC_ADV_DEBUG_REGS */
 
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 406d0e0ef096..1697e903bbf2 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -165,17 +165,10 @@ static noinline int bad_access(struct pt_regs *regs, 
unsigned long address)
 static int do_sigbus(struct pt_regs *regs, unsigned long address,
 vm_fault_t fault)
 {
-   siginfo_t info;
-
if (!user_mode(regs))
return SIGBUS;
 
current->thread.trap_nr = BUS_ADRERR;
-   clear_siginfo();
-   info.si_signo = SIGBUS;
-   info.si_errno = 0;
-   info.si_code = BUS_ADRERR;
-   info.si_addr = (void __user *)address;
 #ifdef CONFIG_MEMORY_FAILURE
if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
unsigned int lsb = 0; /* shutup gcc */
@@ -194,7 +187,7 @@ static int do_sigbus(struct pt_regs *regs, unsigned long 
address,
}
 
 #endif
-   force_sig_info(SIGBUS, , current);
+   force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address, current);
return 0;
 }
 
diff --git a/arch/powerpc/platforms/cell/spu_base.c 
b/arch/powerpc/platforms/cell/spu_base.c
index 0c45cdbac4cf..7f12c7b78c0f 100644
--- a/arch/powerpc/platforms/cell/spu_base.c
+++ b/arch/powerpc/platforms/cell/spu_base.c
@@ -50,11 +50,11 @@ struct cbe_spu_info cbe_spu_info[MAX_NUMNODES];
 EXPORT_SYMBOL_GPL(cbe_spu_info);
 
 /*
- * The spufs fault-handling code needs to call force_sig_info to raise signals
+ * The spufs fault-handling code needs to call force_sig_fault to raise signals
  * on DMA errors. Export it here to avoid general kernel-wide access to this
  * function
  */
-EXPORT_SYMBOL_GPL(force_sig_info);
+EXPORT_SYMBOL_GPL(force_sig_fault);
 
 /*
  * Protects cbe_spu_info and spu->number.
diff --git a/arch/powerpc/platforms/cell/spufs/fault.c 
b/arch/powerpc/platforms/cell/spufs/fault.c
index 83cf58daaa79..971ac43b5d60 100644
--- a/arch/powerpc/platforms/cell/spufs/fault.c
+++ b/arch/powerpc/platforms/cell/spufs/fault.c
@@ -36,42 +36,32 @@
 static void spufs_handle_event(struct spu_context *ctx,
unsigned long ea, int type)
 {
-   siginfo_t info;
-
if (ctx->flags & SPU_CREATE_EVENTS_ENABLED) {
ctx->event_return |= type;
wake_up_all(>stop_wq);
return;
}
 
-   clear_siginfo();
-
switch (type) {
case SPE_EVENT_INVALID_DMA:
-   info.si_signo = SIGBUS;
-   info.si_code = BUS_OBJERR;
+   force_sig_fault(SIGBUS, BUS_OBJERR, NULL, current);
break;
case SPE_EVENT_SPE_DATA_STORAGE:
-   info.si_signo = SIGSEGV;
-   info.si_addr = (void __user *)ea;
-   info.si_code = SEGV_ACCERR;
ctx->ops->restart_dma(ctx);
+   force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *)ea,
+   current);
break;
case SPE_EVENT_DMA_ALIGNMENT:
-   info.si_signo = SIGBUS;
/* DAR isn't set for an alignment fault :( */
-   info.si_code = BUS_ADRALN;
+   force_sig_fault(SIGBUS, BUS_ADRALN, NULL, current);
break;
case SPE_EVENT_SPE_ERROR:
-   info.si_signo = SIGILL;
-   info.si_addr = (void __user *)(unsigned long)
-   ctx->ops->npc_read(ctx) - 4;
-   info.si_code = ILL_ILLOPC;
+   f

[REVIEW][PATCH 8/9] signal/powerpc: Simplify _exception_pkey by using force_sig_pkuerr

2018-09-18 Thread Eric W. Biederman
Call force_sig_pkuerr directly instead of rolling it by hand
in _exception_pkey.

Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/kernel/traps.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index e5ea69222459..ab1bd06d7c44 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -364,18 +364,10 @@ static bool exception_common(int signr, struct pt_regs 
*regs, int code,
 
 void _exception_pkey(struct pt_regs *regs, unsigned long addr, int key)
 {
-   siginfo_t info;
-
if (!exception_common(SIGSEGV, regs, SEGV_PKUERR, addr))
return;
 
-   clear_siginfo();
-   info.si_signo = SIGSEGV;
-   info.si_code = SEGV_PKUERR;
-   info.si_addr = (void __user *) addr;
-   info.si_pkey = key;
-
-   force_sig_info(info.si_signo, , current);
+   force_sig_pkuerr((void __user *) addr, key);
 }
 
 void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
-- 
2.17.1



[REVIEW][PATCH 7/9] signal/poewrpc: Specialize _exception_pkey for handling pkey exceptions

2018-09-18 Thread Eric W. Biederman
Now that _exception no longer calls _exception_pkey it is no longer
necessary to handle any signal with any si_code.  All pkey exceptions
are SIGSEGV with paired with SEGV_PKUERR.  So just handle
that case and remove the now unnecessary parameters from _exception_pkey.

Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/include/asm/bug.h |  2 +-
 arch/powerpc/kernel/traps.c| 10 +-
 arch/powerpc/mm/fault.c|  2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index fd06dbe7d7d3..fed7e6241349 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -133,7 +133,7 @@ struct pt_regs;
 extern int do_page_fault(struct pt_regs *, unsigned long, unsigned long);
 extern void bad_page_fault(struct pt_regs *, unsigned long, int);
 extern void _exception(int, struct pt_regs *, int, unsigned long);
-extern void _exception_pkey(int, struct pt_regs *, int, unsigned long, int);
+extern void _exception_pkey(struct pt_regs *, unsigned long, int);
 extern void die(const char *, struct pt_regs *, long);
 extern bool die_will_crash(void);
 extern void panic_flush_kmsg_start(void);
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index c38bec51dd84..e5ea69222459 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -362,20 +362,20 @@ static bool exception_common(int signr, struct pt_regs 
*regs, int code,
return true;
 }
 
-void _exception_pkey(int signr, struct pt_regs *regs, int code, unsigned long 
addr, int key)
+void _exception_pkey(struct pt_regs *regs, unsigned long addr, int key)
 {
siginfo_t info;
 
-   if (!exception_common(signr, regs, code, addr))
+   if (!exception_common(SIGSEGV, regs, SEGV_PKUERR, addr))
return;
 
clear_siginfo();
-   info.si_signo = signr;
-   info.si_code = code;
+   info.si_signo = SIGSEGV;
+   info.si_code = SEGV_PKUERR;
info.si_addr = (void __user *) addr;
info.si_pkey = key;
 
-   force_sig_info(signr, , current);
+   force_sig_info(info.si_signo, , current);
 }
 
 void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a84d06b7d50d..406d0e0ef096 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -152,7 +152,7 @@ static int bad_key_fault_exception(struct pt_regs *regs, 
unsigned long address,
if (!user_mode(regs))
return SIGSEGV;
 
-   _exception_pkey(SIGSEGV, regs, SEGV_PKUERR, address, pkey);
+   _exception_pkey(regs, address, pkey);
 
return 0;
 }
-- 
2.17.1



[REVIEW][PATCH 6/9] signal/powerpc: Call force_sig_fault from _exception

2018-09-18 Thread Eric W. Biederman
The callers of _exception don't need the pkey exception logic because
they are not processing a pkey exception.  So just call exception_common
directly and then call force_sig_fault to generate the appropriate siginfo
and deliver the appropriate signal.

Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/kernel/traps.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index f6c778b5144f..c38bec51dd84 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -380,7 +380,10 @@ void _exception_pkey(int signr, struct pt_regs *regs, int 
code, unsigned long ad
 
 void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
 {
-   _exception_pkey(signr, regs, code, addr, 0);
+   if (!exception_common(signr, regs, code, addr))
+   return;
+
+   force_sig_fault(signr, code, (void __user *)addr, current);
 }
 
 void system_reset_exception(struct pt_regs *regs)
-- 
2.17.1



[REVIEW][PATCH 5/9] signal/powerpc: Factor the common exception code into exception_common

2018-09-18 Thread Eric W. Biederman
It is brittle and wrong to populate si_pkey when there was not a pkey
exception.  The field does not exist for all si_codes and in some
cases another field exists in the same memory location.

So factor out the code that all exceptions handlers must run
into exception_common, leaving the individual exception handlers
to generate the signals themselves.

Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/kernel/traps.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index f651fa91cdc9..f6c778b5144f 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -338,14 +338,12 @@ static void show_signal_msg(int signr, struct pt_regs 
*regs, int code,
show_user_instructions(regs);
 }
 
-void _exception_pkey(int signr, struct pt_regs *regs, int code,
-unsigned long addr, int key)
+static bool exception_common(int signr, struct pt_regs *regs, int code,
+ unsigned long addr)
 {
-   siginfo_t info;
-
if (!user_mode(regs)) {
die("Exception in kernel mode", regs, signr);
-   return;
+   return false;
}
 
show_signal_msg(signr, regs, code, addr);
@@ -361,6 +359,16 @@ void _exception_pkey(int signr, struct pt_regs *regs, int 
code,
 */
thread_pkey_regs_save(>thread);
 
+   return true;
+}
+
+void _exception_pkey(int signr, struct pt_regs *regs, int code, unsigned long 
addr, int key)
+{
+   siginfo_t info;
+
+   if (!exception_common(signr, regs, code, addr))
+   return;
+
clear_siginfo();
info.si_signo = signr;
info.si_code = code;
-- 
2.17.1



[REVIEW][PATCH 4/9] signal/powerpc: Remove pkey parameter from __bad_area_nosemaphore

2018-09-18 Thread Eric W. Biederman
Now that bad_key_fault_exception no longer calls __bad_area_nosemaphore
there is no reason for __bad_area_nosemaphore to handle pkeys.

Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/mm/fault.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5afc1ee55043..a84d06b7d50d 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -103,8 +103,7 @@ static bool store_updates_sp(unsigned int inst)
  */
 
 static int
-__bad_area_nosemaphore(struct pt_regs *regs, unsigned long address, int 
si_code,
-   int pkey)
+__bad_area_nosemaphore(struct pt_regs *regs, unsigned long address, int 
si_code)
 {
/*
 * If we are in kernel mode, bail out with a SEGV, this will
@@ -114,14 +113,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned 
long address, int si_code,
if (!user_mode(regs))
return SIGSEGV;
 
-   _exception_pkey(SIGSEGV, regs, si_code, address, pkey);
+   _exception(SIGSEGV, regs, si_code, address);
 
return 0;
 }
 
 static noinline int bad_area_nosemaphore(struct pt_regs *regs, unsigned long 
address)
 {
-   return __bad_area_nosemaphore(regs, address, SEGV_MAPERR, 0);
+   return __bad_area_nosemaphore(regs, address, SEGV_MAPERR);
 }
 
 static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code)
@@ -134,7 +133,7 @@ static int __bad_area(struct pt_regs *regs, unsigned long 
address, int si_code)
 */
up_read(>mmap_sem);
 
-   return __bad_area_nosemaphore(regs, address, si_code, 0);
+   return __bad_area_nosemaphore(regs, address, si_code);
 }
 
 static noinline int bad_area(struct pt_regs *regs, unsigned long address)
-- 
2.17.1



[REVIEW][PATCH 3/9] signal/powerpc: Call _exception_pkey directly from bad_key_fault_exception

2018-09-18 Thread Eric W. Biederman
This removes the need for other code paths to deal with pkey exceptions.

Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/mm/fault.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index e5725fa96a48..5afc1ee55043 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -145,7 +145,17 @@ static noinline int bad_area(struct pt_regs *regs, 
unsigned long address)
 static int bad_key_fault_exception(struct pt_regs *regs, unsigned long address,
int pkey)
 {
-   return __bad_area_nosemaphore(regs, address, SEGV_PKUERR, pkey);
+   /*
+* If we are in kernel mode, bail out with a SEGV, this will
+* be caught by the assembly which will restore the non-volatile
+* registers before calling bad_page_fault()
+*/
+   if (!user_mode(regs))
+   return SIGSEGV;
+
+   _exception_pkey(SIGSEGV, regs, SEGV_PKUERR, address, pkey);
+
+   return 0;
 }
 
 static noinline int bad_access(struct pt_regs *regs, unsigned long address)
-- 
2.17.1



[REVIEW][PATCH 2/9] signal/powerpc: Remove pkey parameter from __bad_area

2018-09-18 Thread Eric W. Biederman
There are no callers of __bad_area that pass in a pkey parameter so it makes
no sense to take one.

Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/mm/fault.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 22d7f8748cd7..e5725fa96a48 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -124,8 +124,7 @@ static noinline int bad_area_nosemaphore(struct pt_regs 
*regs, unsigned long add
return __bad_area_nosemaphore(regs, address, SEGV_MAPERR, 0);
 }
 
-static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code,
-   int pkey)
+static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code)
 {
struct mm_struct *mm = current->mm;
 
@@ -135,12 +134,12 @@ static int __bad_area(struct pt_regs *regs, unsigned long 
address, int si_code,
 */
up_read(>mmap_sem);
 
-   return __bad_area_nosemaphore(regs, address, si_code, pkey);
+   return __bad_area_nosemaphore(regs, address, si_code, 0);
 }
 
 static noinline int bad_area(struct pt_regs *regs, unsigned long address)
 {
-   return __bad_area(regs, address, SEGV_MAPERR, 0);
+   return __bad_area(regs, address, SEGV_MAPERR);
 }
 
 static int bad_key_fault_exception(struct pt_regs *regs, unsigned long address,
@@ -151,7 +150,7 @@ static int bad_key_fault_exception(struct pt_regs *regs, 
unsigned long address,
 
 static noinline int bad_access(struct pt_regs *regs, unsigned long address)
 {
-   return __bad_area(regs, address, SEGV_ACCERR, 0);
+   return __bad_area(regs, address, SEGV_ACCERR);
 }
 
 static int do_sigbus(struct pt_regs *regs, unsigned long address,
-- 
2.17.1



[REVIEW][PATCH 1/9] signal/powerpc: Use force_sig_mceerr as appropriate

2018-09-18 Thread Eric W. Biederman
In do_sigbus isolate the mceerr signaling code and call
force_sig_mceerr instead of falling through to the force_sig_info that
works for all of the other signals.

Signed-off-by: "Eric W. Biederman" 
---
 arch/powerpc/mm/fault.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index d51cf5f4e45e..22d7f8748cd7 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -158,7 +158,6 @@ static int do_sigbus(struct pt_regs *regs, unsigned long 
address,
 vm_fault_t fault)
 {
siginfo_t info;
-   unsigned int lsb = 0;
 
if (!user_mode(regs))
return SIGBUS;
@@ -171,17 +170,22 @@ static int do_sigbus(struct pt_regs *regs, unsigned long 
address,
info.si_addr = (void __user *)address;
 #ifdef CONFIG_MEMORY_FAILURE
if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
+   unsigned int lsb = 0; /* shutup gcc */
+
pr_err("MCE: Killing %s:%d due to hardware memory corruption 
fault at %lx\n",
current->comm, current->pid, address);
-   info.si_code = BUS_MCEERR_AR;
+
+   if (fault & VM_FAULT_HWPOISON_LARGE)
+   lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
+   if (fault & VM_FAULT_HWPOISON)
+   lsb = PAGE_SHIFT;
+
+   force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb,
+current);
+   return 0;
}
 
-   if (fault & VM_FAULT_HWPOISON_LARGE)
-   lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
-   if (fault & VM_FAULT_HWPOISON)
-   lsb = PAGE_SHIFT;
 #endif
-   info.si_addr_lsb = lsb;
force_sig_info(SIGBUS, , current);
return 0;
 }
-- 
2.17.1



[REVIEW][PATCH 15/17] signal: Add TRAP_UNK si_code for undiagnosted trap exceptions

2018-04-19 Thread Eric W. Biederman
Both powerpc and alpha have cases where they wronly set si_code to 0
in combination with SIGTRAP and don't mean SI_USER.

About half the time this is because the architecture can not report
accurately what kind of trap exception triggered the trap exception.
The other half the time it looks like no one has bothered to
figure out an appropriate si_code.

For the cases where the architecture does not have enough information
or is too lazy to figure out exactly what kind of trap exception
it is define TRAP_UNK.

Cc: linux-...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: "Eric W. Biederman" <ebied...@xmission.com>
---
 arch/x86/kernel/signal_compat.c| 2 +-
 include/uapi/asm-generic/siginfo.h | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 14c057f29979..9ccbf0576cd0 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -29,7 +29,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGFPE  != 15);
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS  != 5);
-   BUILD_BUG_ON(NSIGTRAP != 4);
+   BUILD_BUG_ON(NSIGTRAP != 5);
BUILD_BUG_ON(NSIGCHLD != 6);
BUILD_BUG_ON(NSIGSYS  != 1);
 
diff --git a/include/uapi/asm-generic/siginfo.h 
b/include/uapi/asm-generic/siginfo.h
index 558b902f18d4..80e2a7227205 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -249,7 +249,8 @@ typedef struct siginfo {
 #define TRAP_TRACE 2   /* process trace trap */
 #define TRAP_BRANCH 3  /* process taken branch trap */
 #define TRAP_HWBKPT 4  /* hardware breakpoint/watchpoint */
-#define NSIGTRAP   4
+#define TRAP_UNK   5   /* undiagnosed trap */
+#define NSIGTRAP   5
 
 /*
  * There is an additional set of SIGTRAP si_codes used by ptrace
-- 
2.14.1



[REVIEW][PATCH 13/17] signal/powerpc: Replace FPE_FIXME with FPE_FLTUNK

2018-04-19 Thread Eric W. Biederman
Using an si_code of 0 that aliases with SI_USER is clearly the
wrong thing todo, and causes problems in interesting ways.

The newly defined FPE_FLTUNK semantically appears to fit the
bill so use it instead.

Cc: Paul Mackerras <pau...@samba.org>
Cc: Kumar Gala <kumar.g...@freescale.com>
Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Cc:  linuxppc-dev@lists.ozlabs.org
Fixes: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx")
Fixes: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various 
exceptions.")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebied...@xmission.com>
---
 arch/powerpc/include/uapi/asm/siginfo.h | 7 ---
 arch/powerpc/kernel/traps.c | 6 +++---
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/siginfo.h 
b/arch/powerpc/include/uapi/asm/siginfo.h
index 9f142451a01f..0437afc9ef3c 100644
--- a/arch/powerpc/include/uapi/asm/siginfo.h
+++ b/arch/powerpc/include/uapi/asm/siginfo.h
@@ -15,13 +15,6 @@
 
 #include 
 
-/*
- * SIGFPE si_codes
- */
-#ifdef __KERNEL__
-#define FPE_FIXME  0   /* Broken dup of SI_USER */
-#endif /* __KERNEL__ */
-
 /*
  * SIGTRAP si_codes
  */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 087855caf6a9..fdf9400beec8 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1031,7 +1031,7 @@ static void emulate_single_step(struct pt_regs *regs)
 
 static inline int __parse_fpscr(unsigned long fpscr)
 {
-   int ret = FPE_FIXME;
+   int ret = FPE_FLTUNK;
 
/* Invalid operation */
if ((fpscr & FPSCR_VE) && (fpscr & FPSCR_VX))
@@ -1972,7 +1972,7 @@ void SPEFloatingPointException(struct pt_regs *regs)
extern int do_spe_mathemu(struct pt_regs *regs);
unsigned long spefscr;
int fpexc_mode;
-   int code = FPE_FIXME;
+   int code = FPE_FLTUNK;
int err;
 
flush_spe_to_thread(current);
@@ -2041,7 +2041,7 @@ void SPEFloatingPointRoundException(struct pt_regs *regs)
printk(KERN_ERR "unrecognized spe instruction "
   "in %s at %lx\n", current->comm, regs->nip);
} else {
-   _exception(SIGFPE, regs, FPE_FIXME, regs->nip);
+   _exception(SIGFPE, regs, FPE_FLTUNK, regs->nip);
return;
}
 }
-- 
2.14.1



[REVIEW][PATCH 17/17] signal/powerpc: Replace TRAP_FIXME with TRAP_UNK

2018-04-19 Thread Eric W. Biederman
Using an si_code of 0 that aliases with SI_USER is clearly the wrong
thing todo, and causes problems in interesting ways.

For use in unknown_exception the recently defined TRAP_UNK
semantically is a perfect fit.  For use in RunModeException it looks
like something more specific than TRAP_UNK could be used.  No one has
bothered to find a better fit than the broken si_code of 0 in all of
these years and I don't see an obvious better fit so TRAP_UNK is
switching RunModeException to return TRAP_UNK is clearly an
improvement.

Recent history suggests no actually cares about crazy corner
cases of the kernel behavior like this so I don't expect any
regressions from changing this.  However if something does
happen this change is easy to revert.

Though I wonder if SIGKILL might not be a better fit.

Cc: Paul Mackerras <pau...@samba.org>
Cc: Kumar Gala <kumar.g...@freescale.com>
Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Fixes: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx")
Fixes: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various 
exceptions.")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebied...@xmission.com>
---
 arch/powerpc/include/uapi/asm/siginfo.h | 8 
 arch/powerpc/kernel/traps.c | 4 ++--
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/siginfo.h 
b/arch/powerpc/include/uapi/asm/siginfo.h
index 0437afc9ef3c..1d51d9b88221 100644
--- a/arch/powerpc/include/uapi/asm/siginfo.h
+++ b/arch/powerpc/include/uapi/asm/siginfo.h
@@ -15,12 +15,4 @@
 
 #include 
 
-/*
- * SIGTRAP si_codes
- */
-#ifdef __KERNEL__
-#define TRAP_FIXME 0   /* Broken dup of SI_USER */
-#endif /* __KERNEL__ */
-
-
 #endif /* _ASM_POWERPC_SIGINFO_H */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index fdf9400beec8..0e17dcb48720 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -969,7 +969,7 @@ void unknown_exception(struct pt_regs *regs)
printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n",
   regs->nip, regs->msr, regs->trap);
 
-   _exception(SIGTRAP, regs, TRAP_FIXME, 0);
+   _exception(SIGTRAP, regs, TRAP_UNK, 0);
 
exception_exit(prev_state);
 }
@@ -991,7 +991,7 @@ void instruction_breakpoint_exception(struct pt_regs *regs)
 
 void RunModeException(struct pt_regs *regs)
 {
-   _exception(SIGTRAP, regs, TRAP_FIXME, 0);
+   _exception(SIGTRAP, regs, TRAP_UNK, 0);
 }
 
 void single_step_exception(struct pt_regs *regs)
-- 
2.14.1



Re: [PATCH v3 01/17] y2038: asm-generic: Extend sysvipc data structures

2018-04-19 Thread Eric W. Biederman
Arnd Bergmann <a...@arndb.de> writes:

> On Thu, Apr 19, 2018 at 5:20 PM, Arnd Bergmann <a...@arndb.de> wrote:
>> On Thu, Apr 19, 2018 at 4:59 PM, Eric W. Biederman <ebied...@xmission.com> 
>> wrote:
>>> I suspect you want to use __kernel_ulong_t here instead of a raw
>>> unsigned long.  If nothing else it seems inconsistent to use typedefs
>>> in one half of the structure and no typedefs in the other half.
>>
>> Good catch, there is definitely something wrong here, but I think using
>> __kernel_ulong_t for all members would also be wrong, as that
>> still changes the layout on x32, which effectively is
>>
>> struct msqid64_ds {
>>  ipc64_perm msg_perm;
>>  u64 msg_stime;
>>  u32 __unused1;
>>  /* 32 bit implict padding */
>>  u64 msg_rtime;
>>  u32 __unused2;
>>  /* 32 bit implict padding */
>>  u64 msg_ctime;
>>  u32 __unused3;
>>  /* 32 bit implict padding */
>>  __kernel_pid_t  shm_cpid;   /* pid of creator */
>>  __kernel_pid_t  shm_lpid;   /* pid of last operator */
>>  
>> };
>>
>> The choices here would be to either use a mix of
>> __kernel_ulong_t and unsigned long, or taking the x32
>> version back into arch/x86/include/uapi/asm/ so the
>> generic version at least makes some sense.
>>
>> I can't use __kernel_time_t for the lower half on 32-bit
>> since it really should be unsigned.
>
> After thinking about it some more, I conclude that the structure is simply
> incorrect on x32: The __kernel_ulong_t usage was introduced in 2013
> in commit b9cd5ca22d67 ("uapi: Use __kernel_ulong_t in struct
> msqid64_ds") and apparently was correct initially as __BITS_PER_LONG
> evaluated to 64, but it broke with commit f4b4aae18288 ("x86/headers/uapi:
> Fix __BITS_PER_LONG value for x32 builds") that changed the value
> of __BITS_PER_LONG and introduced the extra padding in 2015.
>
> The same change apparently also broke a lot of other definitions, e.g.
>
> $ echo "#include " | gcc -mx32 -E -xc - | grep -A3
> __kernel_size_t
> typedef unsigned int __kernel_size_t;
> typedef int __kernel_ssize_t;
> typedef int __kernel_ptrdiff_t;
>
> Those used to be defined as 'unsigned long long' and 'long long'
> respectively, so now all kernel interfaces using those on x32
> became incompatible!

That seems like a real mess.

Is this just for the uapi header as seen by userspace?  I expect we are
using the a normal kernel interface with 64bit longs and 64bit pointers
when we build the kernel.

If this is just a header as seen from userspace mess it seems
unfortunate but fixable.

Eric


Re: [PATCH v3 01/17] y2038: asm-generic: Extend sysvipc data structures

2018-04-19 Thread Eric W. Biederman
Arnd Bergmann  writes:

> Most architectures now use the asm-generic copy of the sysvipc data
> structures (msqid64_ds, semid64_ds, shmid64_ds), which use 32-bit
> __kernel_time_t on 32-bit architectures but have padding behind them to
> allow extending the type to 64-bit.
>
> Unfortunately, that fails on all big-endian architectures, which have the
> padding on the wrong side. As so many of them get it wrong, we decided to
> not bother even trying to fix it up when we introduced the asm-generic
> copy. Instead we always use the padding word now to provide the upper
> 32 bits of the seconds value, regardless of the endianess.
>
> A libc implementation on a typical big-endian system can deal with
> this by providing its own copy of the structure definition to user
> space, and swapping the two 32-bit words before returning from the
> semctl/shmctl/msgctl system calls.
>
> ARM64 and s/390 are architectures that use these generic headers and
> also provide support for compat mode on 64-bit kernels, so we adapt
> their copies here as well.
>
> Signed-off-by: Arnd Bergmann 
> ---
>  include/uapi/asm-generic/msgbuf.h | 17 -
>  include/uapi/asm-generic/sembuf.h | 26 --
>  include/uapi/asm-generic/shmbuf.h | 17 -
>  3 files changed, 32 insertions(+), 28 deletions(-)
>
> diff --git a/include/uapi/asm-generic/msgbuf.h 
> b/include/uapi/asm-generic/msgbuf.h
> index fb306ebdb36f..d2169cae93b8 100644
> --- a/include/uapi/asm-generic/msgbuf.h
> +++ b/include/uapi/asm-generic/msgbuf.h
> @@ -18,23 +18,22 @@
>   * On big-endian systems, the padding is in the wrong place.
>   *
>   * Pad space is left for:
> - * - 64-bit time_t to solve y2038 problem
>   * - 2 miscellaneous 32-bit values
>   */
>  
>  struct msqid64_ds {
>   struct ipc64_perm msg_perm;
> +#if __BITS_PER_LONG == 64
>   __kernel_time_t msg_stime;  /* last msgsnd time */
> -#if __BITS_PER_LONG != 64
> - unsigned long   __unused1;
> -#endif
>   __kernel_time_t msg_rtime;  /* last msgrcv time */
> -#if __BITS_PER_LONG != 64
> - unsigned long   __unused2;
> -#endif
>   __kernel_time_t msg_ctime;  /* last change time */
> -#if __BITS_PER_LONG != 64
> - unsigned long   __unused3;
> +#else
> + unsigned long   msg_stime;  /* last msgsnd time */
> + unsigned long   msg_stime_high;
> + unsigned long   msg_rtime;  /* last msgrcv time */
> + unsigned long   msg_rtime_high;
> + unsigned long   msg_ctime;  /* last change time */
> + unsigned long   msg_ctime_high;
>  #endif

I suspect you want to use __kernel_ulong_t here instead of a raw
unsigned long.  If nothing else it seems inconsistent to use typedefs
in one half of the structure and no typedefs in the other half.

>   __kernel_ulong_t msg_cbytes;/* current number of bytes on queue */
>   __kernel_ulong_t msg_qnum;  /* number of messages in queue */
> diff --git a/include/uapi/asm-generic/sembuf.h 
> b/include/uapi/asm-generic/sembuf.h
> index cbf9cfe977d6..0bae010f1b64 100644
> --- a/include/uapi/asm-generic/sembuf.h
> +++ b/include/uapi/asm-generic/sembuf.h
> @@ -13,23 +13,29 @@
>   * everyone just ended up making identical copies without specific
>   * optimizations, so we may just as well all use the same one.
>   *
> - * 64 bit architectures typically define a 64 bit __kernel_time_t,
> + * 64 bit architectures use a 64-bit __kernel_time_t here, while
> + * 32 bit architectures have a pair of unsigned long values.
>   * so they do not need the first two padding words.
> - * On big-endian systems, the padding is in the wrong place.
>   *
> - * Pad space is left for:
> - * - 64-bit time_t to solve y2038 problem
> - * - 2 miscellaneous 32-bit values
> + * On big-endian systems, the padding is in the wrong place for
> + * historic reasons, so user space has to reconstruct a time_t
> + * value using
> + *
> + * user_semid_ds.sem_otime = kernel_semid64_ds.sem_otime +
> + *   ((long long)kernel_semid64_ds.sem_otime_high << 32)
> + *
> + * Pad space is left for 2 miscellaneous 32-bit values
>   */
>  struct semid64_ds {
>   struct ipc64_perm sem_perm; /* permissions .. see ipc.h */
> +#if __BITS_PER_LONG == 64
>   __kernel_time_t sem_otime;  /* last semop time */
> -#if __BITS_PER_LONG != 64
> - unsigned long   __unused1;
> -#endif
>   __kernel_time_t sem_ctime;  /* last change time */
> -#if __BITS_PER_LONG != 64
> - unsigned long   __unused2;
> +#else
> + unsigned long   sem_otime;  /* last semop time */
> + unsigned long   sem_otime_high;
> + unsigned long   sem_ctime;  /* last change time */
> + unsigned long   sem_ctime_high;
>  #endif
>   unsigned long   sem_nsems;  /* no. of semaphores in array */
>   unsigned long   __unused3;
> diff --git a/include/uapi/asm-generic/shmbuf.h 
> b/include/uapi/asm-generic/shmbuf.h
> index 2b6c3bb97f97..602f1b5b462b 100644
> --- 

Re: [RFC PATCH 0/3] Dealing with the aliases of SI_USER

2018-04-19 Thread Eric W. Biederman
Dave Martin  writes:

> On Sun, Apr 15, 2018 at 11:16:04AM -0700, Linus Torvalds wrote:
>
> [...]
>
>> The other thing we should do is to get rid of the stupid padding.
>> Right now "struct siginfo" is pointlessly padded to 128 bytes. That is
>> completely insane, when it's always just zero in the kernel.
>
> Agreed, inside the kernel the padding achieves nothing.
>
>> So put that _pad[] thing inside #ifndef __KERNEL__, and make
>> copy_siginfo_to_user() write the padding zeroes when copying to user
>> space. The reason for the padding is "future expansion", so we do want
>> to tell the user space that it's maybe up to 128 bytes in size, but if
>> we don't fill it all, we shouldn't waste time and memory on clearing
>> the padding internally.
>> 
>> I'm certainly *hoping* nobody depends on the whole 128 bytes in
>> rt_sigqueueinfo(). In theory you can fill it all (as long as si_code
>> is negative), but the man-page only says "si_value", and the compat
>> function doesn't copy any more than that either, so any user that
>> tries to fill in more than si_value is already broken. In fact, it
>> might even be worth enforcing that in rt_sigqueueinfo(), just to see
>> if anybody plays any games..
>
> [...]
>
> Digression:
>
> Since we don't traditionally zero the tail-padding in the user sigframe,
> is there a reliable way for userspace to detect newly-added fields in
> siginfo other than by having an explicit sigaction sa_flags flag to
> request them?  I imagine something like [1] below from the userspace
> perspective.
>
> On a separate thread, the issue of how to report syndrome information
> for SIGSEGV came up [2] (such as whether the faulting instruction was a
> read or a write).  This information is useful (and used) by things like
> userspace sanitisers and qemu.  Currently, reporting this to userspace
> relies on arch-specific cruft in the sigframe.
>
> We're committed to maintaining what's already in each arch sigframe,
> but it would be preferable to have a portable way of adding information
> to siginfo in a generic way.  si_code doesn't really work for that,
> since si_codes are mutually exclusive: I can't see a way of adding
> supplementary information using si_code.
>
> Anyway, that would be a separate RFC in the future (if ever).

So far what I have seen done is ``registers'' added to sigcontext.
Which it looks like you have done with esr.

Scrubbing information from faults to where the addresses point
outside of the userspace mapping makes sense.

I think before I would pursue what you are talking about on a generic
level I would want to look at the fact that we handle unblockable faults
wrong.  While unlikely it is possible for someone to send a thread
specific signal at just the right time, and have that signal
delivered before the synchronous fault.

Then we could pass through additional arguments through that new
``generic'' path.  Especially what are arguments such as
tsk->thread.fault_address and tsk->thread.fault_code.

We can do anything we like with a new SA_flag as that allows us to
change the format of the sigframe.

If we are very careful we can add generic fields after that crazy union
anonymous union in the _sigfault case of struct siginfo.

The trick would be to find something that would be enough so that people
don't need to implement their own instruction decoder to see what is
going on.  Something that is applicable to every sigfault case not just
SIGSEGV.  Something that we can and want to implement on multiple
architectures.

The point being doing something generic can be a lot of work, even if it
is worth it in the end.


> [1]
>
> static volatile int have_extflags = 0;
>
> static void handler(int n, siginfo_t *si, void *uc)
> {
>   /* ... */
>
>   if (have_extflags) {
>   /* Check si->si_extflags */
>   } else {
>   /* fallback */
>   }
>
>   /* ... */
> }
>
> int main(void)
> {
>   /* ... */
>
>   struct sigaction sa;
>
>   /* ... */
>
>   sa.sa_flags = SA_SIGINFO | SA_SIGINFO_EXTFLAGS;
>   sa.sa_sigaction = handler;
>   if (!sigaction(SIGSEGV, , NULL)) {
>   have_extflags = 1;
>   } else {
>   sa.sa_flags &= ~SA_SIGINFO_EXTFLAGS;
>   if (sigaction(SIGSEGV, , NULL))
>   goto error;
>   }
>
>   /* ... */
> }
>
> [2] [RFC PATCH] arm64: fault: Don't leak data in ESR context for user fault 
> on kernel VA
> http://lists.infradead.org/pipermail/linux-arm-kernel/2018-April/571428.html

Eric


Re: [RFC PATCH 0/3] Dealing with the aliases of SI_USER

2018-04-18 Thread Eric W. Biederman
Linus Torvalds <torva...@linux-foundation.org> writes:

> (
>
> On Sun, Apr 15, 2018 at 8:56 AM, Eric W. Biederman
> <ebied...@xmission.com> wrote:

[snip bit about wanting what is effectively force_sig_fault instead of
clear_siginfo everywhere]

> The other thing we should do is to get rid of the stupid padding.
> Right now "struct siginfo" is pointlessly padded to 128 bytes. That is
> completely insane, when it's always just zero in the kernel.
>
> So put that _pad[] thing inside #ifndef __KERNEL__, and make
> copy_siginfo_to_user() write the padding zeroes when copying to user
> space. The reason for the padding is "future expansion", so we do want
> to tell the user space that it's maybe up to 128 bytes in size, but if
> we don't fill it all, we shouldn't waste time and memory on clearing
> the padding internally.
>
> I'm certainly *hoping* nobody depends on the whole 128 bytes in
> rt_sigqueueinfo(). In theory you can fill it all (as long as si_code
> is negative), but the man-page only says "si_value", and the compat
> function doesn't copy any more than that either, so any user that
> tries to fill in more than si_value is already broken. In fact, it
> might even be worth enforcing that in rt_sigqueueinfo(), just to see
> if anybody plays any games..


>From my earlier looking I think this is all doable except detecting
if

> On x86-64, without the pointless padding, the size of 'struct siginfo'
> inside the kernel would be 48 bytes. That's quite a big difference for
> something that is often allocated on the kernel stack.

>From my earlier looking I can say that I know of no case where signals
are injected into the kernel that we need more bytes than the kernel
currently provides.

The two cases I know of are signal reinjection for checkpoint/restart
or something that just uses pid, uid and ptr.   Generally that is
enough.

If we just truncate siginfo to everything except the pad bytes in the
kernel there should be no problems.  What I don't see how to do is to
limit the size of struct siginfo the kernel accepts in a forward
compatible way.  Something that would fail if for some reason you used
more siginfo bytes.

Looking at userspace.  glibc always memsets siginfo_t to 0.
Criu just uses whatever it captured with PTRACE_PEEKSIGINFO,
and criu uses unitialized memory for the target of PTRACE_PEEKSIGINFO.

If we truncate siginfo in the kernel then we check for a known si_code
in which case we are safe to truncate siginfo.  If the si_code is
unknown then we should check to see if the extra bytes are 0.  That
should work with everything.  Plus it is a natural point to test to
see if userspace is using signals that the kernel does not currently
support.

I will put that in my queue.

> So I'm certainly willing to make those kinds of changes, but let's
> make them real *improvements* now, ok? Wasn't that the point of all
> the cleanups in the end?

Definitely.  With the strace test case causing people to talk about
regressions I was just looking to see if it would make sense to do
something minimal for -rc2 so reduce concerns about regressions.

Now I am going to focus on getting what I can ready for the next merge
window.

Eric






Re: [RFC PATCH 1/3] signal: Ensure every siginfo we send has all bits initialized

2018-04-18 Thread Eric W. Biederman
Dave Martin <dave.mar...@arm.com> writes:

> On Tue, Apr 17, 2018 at 02:37:38PM -0500, Eric W. Biederman wrote:
>> Dave Martin <dave.mar...@arm.com> writes:
>> 
>> > Hmmm
>> >
>> > memset()/clear_siginfo() may ensure that there are no uninitialised
>> > explicit fields except for those in inactive union members, but I'm not
>> > sure that this approach is guaranteed to sanitise the padding seen by
>> > userspace.
>> >
>> > Rationale below, though it's a bit theoretical...
>> >
>> > With this in mind, I tend agree with Linus that hiding memset() calls
>> > from the maintainer may be a bad idea unless they are also hidden from
>> > the compiler.  If the compiler sees the memset() it may be able to
>> > optimise it in ways that wouldn't be possible for some other random
>> > external function call, including optimising all or part of the call
>> > out.
>> >
>> > As a result, the breakdown into individual put_user()s etc. in
>> > copy_siginfo_to_user() may still be valuable even if all paths have the
>> > memset().
>> 
>> The breakdown into individual put_user()s is known to be problematically
>> slow, and is actually wrong.
>
> Slowness certainly looked like a potential problem.
>
>> Even exclusing the SI_USER duplication in a small number of cases the
>> fields filled out in siginfo by architecture code are not the fields
>> that copy_siginfo_to_user is copying.  Which is much worse.  The code
>> looks safe but is not.
>> 
>> My intention is to leave 0 instances of clear_siginfo in the
>> architecture specific code.  Ideally struct siginfo will be limited to
>> kernel/signal.c but I am not certain I can quite get that far.
>> The function do_coredump appears to have a legit need for siginfo.
>
> So, you mean we can't detect that the caller didn't initialise all the
> members, or initialised the wrong union member?

Correct.  Even when we smuggled the the union member in the upper bits
of si_code we got it wrong.  So an interface that helps out and does
more and is harder to misues looks desirable.

> What would be the alternative?  Have a separate interface for each SIL_
> type, with only kernel/signal.c translating that into the siginfo_t that
> userspace sees?

Yes.  It really isn't bad as architecture specific code only generates
faults.  In general faults only take a pointer.  I have already merged
the needed helpers into kernel/signal.c

> Either way, I don't see how we force the caller to initilise the whole
> structure.

In general the plan is to convert the callers to call force_sig_fault,
and then there is no need to have siginfo in the architecture specific
code.  I have all of the necessary helpers are already merged into
kernel/signal.c

>
>> > (Rationale for an arch/arm example:)
>> >
>> >> diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
>> >> index 4c375e11ae95..adda3fc2dde8 100644
>> >> --- a/arch/arm/vfp/vfpmodule.c
>> >> +++ b/arch/arm/vfp/vfpmodule.c
>> >> @@ -218,8 +218,7 @@ static void vfp_raise_sigfpe(unsigned int sicode, 
>> >> struct pt_regs *regs)
>> >>  {
>> >>   siginfo_t info;
>> >>  
>> >> - memset(, 0, sizeof(info));
>> >> -
>> >> + clear_siginfo();
>> >>   info.si_signo = SIGFPE;
>> >
>> > /* by c11 (n1570) 6.2.6.1 para 6 [1], all padding bytes in info now take
>> >unspecified values */
>> >
>> >>   info.si_code = sicode;
>> >>   info.si_addr = (void __user *)(instruction_pointer(regs) - 4);
>> >
>> > /* by c11 (n1570) 6.2.6.1 para 7 [2], all bytes of the union info._sifields
>> >other than than those corresponding to _sigfault take unspecified
>> >values */
>> >
>> > So I don't see why the compiler needs to ensure that any of the affected
>> > bytes are zero: it could potentially skip a lot of the memset() as a
>> > result, in theory.
>> >
>> > I've not seen a compiler actually take advantage of that, but I'm now
>> > not sure what forbids it.
>> 
>> I took a quick look at gcc-4.9 which I have handy.
>> 
>> The passes -f-no-strict-aliasing which helps, and gcc actually
>> documents that if you access things through the union it will
>> not take advantage of c11.
>> 
>> gcc-4.9 Documents it this way:
>> 
>> > -fstrict-aliasing'
>> >  Allow the compiler to assume the strictest aliasing rules
>> >  applicable to the language 

Re: [RFC PATCH 1/3] signal: Ensure every siginfo we send has all bits initialized

2018-04-17 Thread Eric W. Biederman
Dave Martin  writes:

> Hmmm
>
> memset()/clear_siginfo() may ensure that there are no uninitialised
> explicit fields except for those in inactive union members, but I'm not
> sure that this approach is guaranteed to sanitise the padding seen by
> userspace.
>
> Rationale below, though it's a bit theoretical...
>
> With this in mind, I tend agree with Linus that hiding memset() calls
> from the maintainer may be a bad idea unless they are also hidden from
> the compiler.  If the compiler sees the memset() it may be able to
> optimise it in ways that wouldn't be possible for some other random
> external function call, including optimising all or part of the call
> out.
>
> As a result, the breakdown into individual put_user()s etc. in
> copy_siginfo_to_user() may still be valuable even if all paths have the
> memset().

The breakdown into individual put_user()s is known to be problematically
slow, and is actually wrong.

Even exclusing the SI_USER duplication in a small number of cases the
fields filled out in siginfo by architecture code are not the fields
that copy_siginfo_to_user is copying.  Which is much worse.  The code
looks safe but is not.

My intention is to leave 0 instances of clear_siginfo in the
architecture specific code.  Ideally struct siginfo will be limited to
kernel/signal.c but I am not certain I can quite get that far.
The function do_coredump appears to have a legit need for siginfo.


> (Rationale for an arch/arm example:)
>
>> diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
>> index 4c375e11ae95..adda3fc2dde8 100644
>> --- a/arch/arm/vfp/vfpmodule.c
>> +++ b/arch/arm/vfp/vfpmodule.c
>> @@ -218,8 +218,7 @@ static void vfp_raise_sigfpe(unsigned int sicode, struct 
>> pt_regs *regs)
>>  {
>>  siginfo_t info;
>>  
>> -memset(, 0, sizeof(info));
>> -
>> +clear_siginfo();
>>  info.si_signo = SIGFPE;
>
> /* by c11 (n1570) 6.2.6.1 para 6 [1], all padding bytes in info now take
>unspecified values */
>
>>  info.si_code = sicode;
>>  info.si_addr = (void __user *)(instruction_pointer(regs) - 4);
>
> /* by c11 (n1570) 6.2.6.1 para 7 [2], all bytes of the union info._sifields
>other than than those corresponding to _sigfault take unspecified
>values */
>
> So I don't see why the compiler needs to ensure that any of the affected
> bytes are zero: it could potentially skip a lot of the memset() as a
> result, in theory.
>
> I've not seen a compiler actually take advantage of that, but I'm now
> not sure what forbids it.

I took a quick look at gcc-4.9 which I have handy.

The passes -f-no-strict-aliasing which helps, and gcc actually
documents that if you access things through the union it will
not take advantage of c11.

gcc-4.9 Documents it this way:

> -fstrict-aliasing'
>  Allow the compiler to assume the strictest aliasing rules
>  applicable to the language being compiled.  For C (and C++), this
>  activates optimizations based on the type of expressions.  In
>  particular, an object of one type is assumed never to reside at the
>  same address as an object of a different type, unless the types are
>  almost the same.  For example, an 'unsigned int' can alias an
>  'int', but not a 'void*' or a 'double'.  A character type may alias
>  any other type.
> 
>  Pay special attention to code like this:
>   union a_union {
> int i;
> double d;
>   };
> 
>   int f() {
> union a_union t;
> t.d = 3.0;
> return t.i;
>   }
>  The practice of reading from a different union member than the one
>  most recently written to (called "type-punning") is common.  Even
>  with '-fstrict-aliasing', type-punning is allowed, provided the
>  memory is accessed through the union type.  So, the code above
>  works as expected.


> If this can happen, I only see two watertight workarounds:
>
> 1) Ensure that there is no implicit padding in any UAPI structure, e.g.
> aeb1f39d814b: ("arm64/ptrace: Avoid uninitialised struct padding in
> fpr_set()").  This would include tail-padding of any union member that
> is smaller than the containing union.
>
> It would be significantly more effort to ensure this for siginfo though.
>
> 2) Poke all values directly into allocated or user memory directly
> via pointers to paddingless types; never assign to objects on the kernel
> stack if you care what ends up in the padding, e.g., what your
> copy_siginfo_to_user() does prior to this series.
>
>
> If I'm not barking up the wrong tree, memset() cannot generally be
> used to determine the value of padding bytes, but it may still be
> useful for forcing otherwise uninitialised members to sane initial
> values.
>
> This likely affects many more things than just siginfo.

Unless gcc has changed it's stance on type-punning through unions
or it's semantics with -fno-strict_aliasing we should be good.

Eric


Re: [RFC PATCH 0/3] Dealing with the aliases of SI_USER

2018-04-15 Thread Eric W. Biederman
Linus Torvalds <torva...@linux-foundation.org> writes:

> (
>
> On Sun, Apr 15, 2018 at 8:56 AM, Eric W. Biederman
> <ebied...@xmission.com> wrote:
>>
>> Would you consider the patchset below for -rc2?
>
> Ugh.

The point of this series is to squash the potential for regressions even
from the weird broken code that fills in fields for si_code 0 that do
not match SI_USER.

The copy_siginfo_to_user32 never handled that so we don't need to worry
about that.  All of the signals that abuse si_code 0 go through
force_sig_info so signalfd can not catch them.  Which means that only
copy_siginfo_to_user needs to be worried about.

Last time I was benchmarking I could not see a difference between
copying the entire siginfo and only a small part so I don't see
the point of optimizing now.

For an actual cleanup I intend to go farther than you are proposing and
convert everthing to the set of helper functions I have added to
kernel/signal.c force_sig_fault, force_sig_mceerr, force_sig_bnderr,
force_sig_pkuerr.

When I am done there will be maybe 5 instances of clear_siginfo outside
of those helper functions.  At which point I won't care if we remove
clear_siginfo and just use memset.  That is a real improvement
that looks something like: "105 files changed, 933 insertions(+), 1698 
deletions(-)"
That is my end goal with all of this.

You complained to me about regressions and you are right with the
current handling of FPE_FIXME TRAP_FIXME the code is not bug compatible
with old versions of linux, and people have noticed.

What this patchset represents is the bare minimum needed to convert
copy_siginfo_to_user to a copy_to_user, and remove the possibility
of regressions.

To that end I need to ensure every struct siginfo in the entire
kernel is memset to 0.  A structure initializer is absolutely
not enough.  So I don't mind people thinking clear_siginfo is necessary.


To that end clear_siginfo where it does not take the size of
struct siginfo as a parameter is much less suceptible to typos,
and allows me to ensure every instance of struct siginfo is fully
initialized.

I fully intend to remove practically every instance of struct siginfo
from the architecture specific code, and use the aforementioned helpers.
The trouble is that some of the architecture code need refactoring
for that to happen.  Even your proposed init_siginfo can't be widely
used as a common pattern is to fill in siginfo partially in the
code with si_code and si_signo being filled in almost last.

So in short.  I intend to remove most of these clear_siginfo calls when
I remove the siginfos later.  This series is focused on making
copy_siginfo_to_user simple enough we don't need to use siginfo_layout,
and thus can be fully backwards compatibile with ourselves and we won't
need to worry about regressions.  I have aimed to keep this simple
enough we can merge this after -rc1 because we don't want regressions.

Eric

ps.  I intend to place this change first in my series wether or not it makes
it into -rc2 so that I can be certain we remove any possible regressions
in behavior on the buggy architectures.  Then I can take my time and
ensure the non-trivial changes of refactoring etc are done carefully so
I don't introduce other bugs.  I need that so I can sleep at night.

pps.  I can look at some of your other suggestions but cleverness leads
to regressions, and if you are going to complain at me harshly when I
have been being careful and taking things seriously I am not
particularly willing to take unnecessary chances.


Re: sparc/ppc/arm compat siginfo ABI regressions: sending SIGFPE via kill() returns wrong values in si_pid and si_uid

2018-04-15 Thread Eric W. Biederman
Russell King - ARM Linux  writes:

> On Fri, Apr 13, 2018 at 12:53:49PM -0700, Linus Torvalds wrote:
>> On Fri, Apr 13, 2018 at 11:45 AM, Dave Martin  wrote:
>> >
>> > Most uses I've seen do nothing more than use the FPE_xyz value to
>> > format diagnostic messages while dying.  I struggled to find code that
>> > made a meaningful functional decision based on the value, though that's
>> > not proof...
>> 
>> Yeah. I've seen code that cares about SIGFPE deeply, but it's almost
>> invariably about some emulated environment (eg Java VM, or CPU
>> emulation).
>> 
>> And the siginfo data is basically never good enough for those
>> environments anyway on its own, so they will go and look at the actual
>> instruction that caused the fault and the register state instead,
>> because they need *all* the information.
>> 
>> The cases that use si_code are the ones that just trapped signals in
>> order to give a more helpful abort message.
>> 
>> So I could certainly imagine that si_code is actually used by somebody
>> who then decides to actuall act differently on it, but aside from
>> perhaps printing out a different message, it sounds far-fetched.
>
> Okay, in that case let's just use FPE_FLTINV.  That makes the patch
> easily back-portable for stable kernels.

If we want to I don't think  backporting 266da65e9156 ("signal: Add
FPE_FLTUNK si_code for undiagnosable fp exceptions") would be at
all difficult.

What it is changing has been stable for quite a while.  The surroundings
might change and so it might require some trivial manual fixup but I
don't expect any problems.

Not that I want to derail the consensus but if we want to backport
similar fixes for arm64 or the other architectures that wind up using
FPE_FLTUNK for their fix we would need to backport 266da65e9156 anyway.

Eric



[RFC PATCH 3/3] signal: Stop special casing TRAP_FIXME and FPE_FIXME in siginfo_layout

2018-04-15 Thread Eric W. Biederman

After more experience with the cases where no one the si_code of 0 is
used both as a signal specific si_code, and as SI_USER it appears that
no one cares about the signal specific si_code case and the good
solution is to just fix the architectures by using a different si_code.

In none of the conversations has anyone even suggested that anything
depends on the signal specific redefinition of SI_USER.

There are at least test cases that care when si_code as 0 does not work
as si_user.

So make things simple and keep the generic code from introducing
problems by removing the special casing of TRAP_FIXME and FPE_FIXME.
This will ensure the generic case of sending a signal with kill will
always set SI_USER and work.

The architecture specific, and signal specific overloads that set
si_code to 0 will now have problems with signalfd and the 32bit compat
versions of siginfo copying.  At least until they are fixed.

Signed-off-by: "Eric W. Biederman" <ebied...@xmission.com>
---
 kernel/signal.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index d56f4d496c89..fc82d2c0918f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2835,15 +2835,6 @@ enum siginfo_layout siginfo_layout(int sig, int si_code)
layout = SIL_POLL;
else if (si_code < 0)
layout = SIL_RT;
-   /* Tests to support buggy kernel ABIs */
-#ifdef TRAP_FIXME
-   if ((sig == SIGTRAP) && (si_code == TRAP_FIXME))
-   layout = SIL_FAULT;
-#endif
-#ifdef FPE_FIXME
-   if ((sig == SIGFPE) && (si_code == FPE_FIXME))
-   layout = SIL_FAULT;
-#endif
}
return layout;
 }
-- 
2.14.1



[RFC PATCH 2/3] signal: Reduce copy_siginfo_to_user to just copy_to_user

2018-04-15 Thread Eric W. Biederman

Now that every instance of struct siginfo is now initialized it is no
longer necessary to copy struct siginfo piece by piece to userspace
but instead the entire structure can be copied.

As well as making the code simpler and more efficient this means that
copy_sinfo_to_user no longer cares which union member of struct
siginfo is in use.

In practice this means that all 32bit architectures that define
FPE_FIXME will handle properly send SI_USER when kill(SIGFPE) is sent.
While still performing their historic architectural brokenness when 0
is used a floating pointer signal.  This matches the current behavior
of 64bit architectures that define FPE_FIXME who get lucky and an
overloaded SI_USER has continuted to work through copy_siginfo_to_user
because the 8 byte si_addr occupies the same bytes in struct siginfo
as the 4 byte si_pid and the 4 byte si_uid.

Problematic architectures still need to fix their ABI so that signalfd
and 32bit compat code will work properly.

Signed-off-by: "Eric W. Biederman" <ebied...@xmission.com>
---
 kernel/signal.c | 84 ++---
 1 file changed, 2 insertions(+), 82 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index d4ccea599692..d56f4d496c89 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2850,89 +2850,9 @@ enum siginfo_layout siginfo_layout(int sig, int si_code)
 
 int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
 {
-   int err;
-
-   if (!access_ok (VERIFY_WRITE, to, sizeof(siginfo_t)))
+   if (copy_to_user(to, from , sizeof(struct siginfo)))
return -EFAULT;
-   if (from->si_code < 0)
-   return __copy_to_user(to, from, sizeof(siginfo_t))
-   ? -EFAULT : 0;
-   /*
-* If you change siginfo_t structure, please be sure
-* this code is fixed accordingly.
-* Please remember to update the signalfd_copyinfo() function
-* inside fs/signalfd.c too, in case siginfo_t changes.
-* It should never copy any pad contained in the structure
-* to avoid security leaks, but must copy the generic
-* 3 ints plus the relevant union member.
-*/
-   err = __put_user(from->si_signo, >si_signo);
-   err |= __put_user(from->si_errno, >si_errno);
-   err |= __put_user(from->si_code, >si_code);
-   switch (siginfo_layout(from->si_signo, from->si_code)) {
-   case SIL_KILL:
-   err |= __put_user(from->si_pid, >si_pid);
-   err |= __put_user(from->si_uid, >si_uid);
-   break;
-   case SIL_TIMER:
-   /* Unreached SI_TIMER is negative */
-   break;
-   case SIL_POLL:
-   err |= __put_user(from->si_band, >si_band);
-   err |= __put_user(from->si_fd, >si_fd);
-   break;
-   case SIL_FAULT:
-   err |= __put_user(from->si_addr, >si_addr);
-#ifdef __ARCH_SI_TRAPNO
-   err |= __put_user(from->si_trapno, >si_trapno);
-#endif
-#ifdef __ia64__
-   err |= __put_user(from->si_imm, >si_imm);
-   err |= __put_user(from->si_flags, >si_flags);
-   err |= __put_user(from->si_isr, >si_isr);
-#endif
-   /*
-* Other callers might not initialize the si_lsb field,
-* so check explicitly for the right codes here.
-*/
-#ifdef BUS_MCEERR_AR
-   if (from->si_signo == SIGBUS && from->si_code == BUS_MCEERR_AR)
-   err |= __put_user(from->si_addr_lsb, >si_addr_lsb);
-#endif
-#ifdef BUS_MCEERR_AO
-   if (from->si_signo == SIGBUS && from->si_code == BUS_MCEERR_AO)
-   err |= __put_user(from->si_addr_lsb, >si_addr_lsb);
-#endif
-#ifdef SEGV_BNDERR
-   if (from->si_signo == SIGSEGV && from->si_code == SEGV_BNDERR) {
-   err |= __put_user(from->si_lower, >si_lower);
-   err |= __put_user(from->si_upper, >si_upper);
-   }
-#endif
-#ifdef SEGV_PKUERR
-   if (from->si_signo == SIGSEGV && from->si_code == SEGV_PKUERR)
-   err |= __put_user(from->si_pkey, >si_pkey);
-#endif
-   break;
-   case SIL_CHLD:
-   err |= __put_user(from->si_pid, >si_pid);
-   err |= __put_user(from->si_uid, >si_uid);
-   err |= __put_user(from->si_status, >si_status);
-   err |= __put_user(from->si_utime, >si_utime);
-   err |= __put_user(from->si_stime, >si_stime);
-   break;
-   case SIL_RT:
-   err |= __put_user(from->si_pid, >si_pid);
-   err |= __put_user(from->si_uid, >si_uid);
-   

[RFC PATCH 1/3] signal: Ensure every siginfo we send has all bits initialized

2018-04-15 Thread Eric W. Biederman

Call clear_siginfo to ensure every stack allocated siginfo is properly
initialized before being passed to the signal sending functions.

Note: It is not safe to depend on C initializers to initialize struct
siginfo on the stack because C is allowed to skip holes when
initializing a structure.

The initialization of struct siginfo in tracehook_report_syscall_exit
was moved from the helper user_single_step_siginfo into
tracehook_report_syscall_exit itself, to make it clear that the local
variable siginfo gets fully initialized.

In a few cases the scope of struct siginfo has been reduced to make it
clear that siginfo siginfo is not used on other paths in the function
in which it is declared.

Instances of using memset to initialize siginfo have been replaced
with calls clear_siginfo for clarity.

Signed-off-by: "Eric W. Biederman" <ebied...@xmission.com>
---
 arch/alpha/kernel/osf_sys.c   |  1 +
 arch/alpha/kernel/signal.c|  2 ++
 arch/alpha/kernel/traps.c |  5 +
 arch/alpha/mm/fault.c |  2 ++
 arch/arc/mm/fault.c   |  2 ++
 arch/arm/kernel/ptrace.c  |  1 +
 arch/arm/kernel/swp_emulate.c |  1 +
 arch/arm/kernel/traps.c   |  5 +
 arch/arm/mm/alignment.c   |  1 +
 arch/arm/mm/fault.c   |  5 +
 arch/arm/vfp/vfpmodule.c  |  3 +--
 arch/arm64/kernel/fpsimd.c|  2 +-
 arch/arm64/kernel/sys_compat.c|  1 +
 arch/arm64/kernel/traps.c |  1 +
 arch/arm64/mm/fault.c | 18 --
 arch/c6x/kernel/traps.c   |  1 +
 arch/hexagon/kernel/traps.c   |  1 +
 arch/hexagon/mm/vm_fault.c|  1 +
 arch/ia64/kernel/brl_emu.c|  1 +
 arch/ia64/kernel/signal.c |  2 ++
 arch/ia64/kernel/traps.c  | 27 ---
 arch/ia64/kernel/unaligned.c  |  1 +
 arch/ia64/mm/fault.c  |  4 +++-
 arch/m68k/kernel/traps.c  |  2 ++
 arch/microblaze/kernel/exceptions.c   |  1 +
 arch/microblaze/mm/fault.c|  4 +++-
 arch/mips/mm/fault.c  |  1 +
 arch/nds32/kernel/traps.c |  6 +-
 arch/nds32/mm/fault.c |  1 +
 arch/nios2/kernel/traps.c |  1 +
 arch/openrisc/kernel/traps.c  |  5 -
 arch/openrisc/mm/fault.c  |  1 +
 arch/parisc/kernel/ptrace.c   |  1 +
 arch/parisc/kernel/traps.c|  2 ++
 arch/parisc/kernel/unaligned.c|  1 +
 arch/parisc/math-emu/driver.c |  1 +
 arch/parisc/mm/fault.c|  1 +
 arch/powerpc/kernel/process.c |  1 +
 arch/powerpc/kernel/traps.c   |  3 +--
 arch/powerpc/mm/fault.c   |  1 +
 arch/powerpc/platforms/cell/spufs/fault.c |  2 +-
 arch/riscv/kernel/traps.c |  1 +
 arch/s390/kernel/traps.c  |  5 -
 arch/s390/mm/fault.c  |  2 ++
 arch/sh/kernel/hw_breakpoint.c|  1 +
 arch/sh/kernel/traps_32.c |  2 ++
 arch/sh/math-emu/math.c   |  1 +
 arch/sh/mm/fault.c|  1 +
 arch/sparc/kernel/process_64.c|  1 +
 arch/sparc/kernel/sys_sparc_32.c  |  1 +
 arch/sparc/kernel/traps_32.c  | 10 ++
 arch/sparc/kernel/traps_64.c  | 14 ++
 arch/sparc/kernel/unaligned_32.c  |  1 +
 arch/sparc/mm/fault_32.c  |  1 +
 arch/sparc/mm/fault_64.c  |  1 +
 arch/um/kernel/trap.c |  2 ++
 arch/unicore32/kernel/fpu-ucf64.c |  2 +-
 arch/unicore32/mm/fault.c |  3 +++
 arch/x86/entry/vsyscall/vsyscall_64.c |  2 +-
 arch/x86/kernel/ptrace.c  |  2 +-
 arch/x86/kernel/traps.c   |  3 +++
 arch/x86/kernel/umip.c|  1 +
 arch/x86/kvm/mmu.c|  1 +
 arch/x86/mm/fault.c   |  1 +
 arch/xtensa/kernel/traps.c|  1 +
 arch/xtensa/mm/fault.c|  1 +
 include/linux/ptrace.h|  1 -
 include/linux/tracehook.h |  1 +
 virt/kvm/arm/mmu.c|  1 +
 69 files changed, 163 insertions(+), 24 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 89faa6f4de47..8ad689d6a0e4 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -881,6 +881,7 @@ SYSCALL_DEFINE5(osf_setsysinfo, unsigned long, op, void 
__user *, buffer,
if (fex & IEEE_TRAP_ENABLE_DZE) si_code = FPE_FLTDIV;
if (fex & IEEE_TRAP_ENABLE_INV) si_code = FPE_FLTINV;
 
+

[RFC PATCH 0/3] Dealing with the aliases of SI_USER

2018-04-15 Thread Eric W. Biederman

Linus,

Would you consider the patchset below for -rc2?

Dealing with the aliases of SI_USER has been a challenge as we have had
a b0rked ABI in some cases since 2.5.

So far no one except myself has suggested that changing the si_code of
from 0 to something else for those problematic aliases of SI_USER is
going to be a problem.  So it looks like just fixing the issue is a real
possibility.

Fixing the cases that do kill(SIGFPE, ...) because at least test cases
care seems important.

The best fixes to backport appear to be the real architecture fixes that
remove the aliases for SI_USER as those are deep fixes that
fundamentally fix the problems, and are also very small changes.

I am not yet brave enough to merge architectural fixes like that without
arch maintainers buy-in.   Getting at least an ack if nothing else takes
a little bit of time.

Still we have a arm fix upthread and David Miller has given his nod
to a sparc fix that uses FPE_FLTUNK.  So it appears real architecture
fixes are progressing.  Further I have looked and that leaves only
powerpc, parisc, ia64, and alpha.   The new si_code FPE_FLTUNK appears
to address most of those, and there is an untested patch for parisc.

So real progress appears possible.

The generic code can do better, and that is what this rfc patchset is
about.  It ensures siginfo is fully initialized and uses copy_to_user
to copy siginfo to userspace.  This takes siginfo_layout out of
the picture and so for non-compat non-signalfd siginfos the status
quo returns to what it was before I introduced siginfo_layout (AKA
regressions go bye-bye).

I believe given the issues these changes are a candiate for -rc2.
Otherwise I will keep these changes for the next merge window.

Eric W. Biederman (3):
  signal: Ensure every siginfo we send has all bits initialized
  signal: Reduce copy_siginfo_to_user to just copy_to_user
  signal: Stop special casing TRAP_FIXME and FPE_FIXME in siginfo_layout

 arch/alpha/kernel/osf_sys.c   |  1 +
 arch/alpha/kernel/signal.c|  2 +
 arch/alpha/kernel/traps.c |  5 ++
 arch/alpha/mm/fault.c |  2 +
 arch/arc/mm/fault.c   |  2 +
 arch/arm/kernel/ptrace.c  |  1 +
 arch/arm/kernel/swp_emulate.c |  1 +
 arch/arm/kernel/traps.c   |  5 ++
 arch/arm/mm/alignment.c   |  1 +
 arch/arm/mm/fault.c   |  5 ++
 arch/arm/vfp/vfpmodule.c  |  3 +-
 arch/arm64/kernel/fpsimd.c|  2 +-
 arch/arm64/kernel/sys_compat.c|  1 +
 arch/arm64/kernel/traps.c |  1 +
 arch/arm64/mm/fault.c | 18 --
 arch/c6x/kernel/traps.c   |  1 +
 arch/hexagon/kernel/traps.c   |  1 +
 arch/hexagon/mm/vm_fault.c|  1 +
 arch/ia64/kernel/brl_emu.c|  1 +
 arch/ia64/kernel/signal.c |  2 +
 arch/ia64/kernel/traps.c  | 27 -
 arch/ia64/kernel/unaligned.c  |  1 +
 arch/ia64/mm/fault.c  |  4 +-
 arch/m68k/kernel/traps.c  |  2 +
 arch/microblaze/kernel/exceptions.c   |  1 +
 arch/microblaze/mm/fault.c|  4 +-
 arch/mips/mm/fault.c  |  1 +
 arch/nds32/kernel/traps.c |  6 +-
 arch/nds32/mm/fault.c |  1 +
 arch/nios2/kernel/traps.c |  1 +
 arch/openrisc/kernel/traps.c  |  5 +-
 arch/openrisc/mm/fault.c  |  1 +
 arch/parisc/kernel/ptrace.c   |  1 +
 arch/parisc/kernel/traps.c|  2 +
 arch/parisc/kernel/unaligned.c|  1 +
 arch/parisc/math-emu/driver.c |  1 +
 arch/parisc/mm/fault.c|  1 +
 arch/powerpc/kernel/process.c |  1 +
 arch/powerpc/kernel/traps.c   |  3 +-
 arch/powerpc/mm/fault.c   |  1 +
 arch/powerpc/platforms/cell/spufs/fault.c |  2 +-
 arch/riscv/kernel/traps.c |  1 +
 arch/s390/kernel/traps.c  |  5 +-
 arch/s390/mm/fault.c  |  2 +
 arch/sh/kernel/hw_breakpoint.c|  1 +
 arch/sh/kernel/traps_32.c |  2 +
 arch/sh/math-emu/math.c   |  1 +
 arch/sh/mm/fault.c|  1 +
 arch/sparc/kernel/process_64.c|  1 +
 arch/sparc/kernel/sys_sparc_32.c  |  1 +
 arch/sparc/kernel/traps_32.c  | 10 
 arch/sparc/kernel/traps_64.c  | 14 +
 arch/sparc/kernel/unaligned_32.c  |  1 +
 arch/sparc/mm/fault_32.c  |  1 +
 arch/sparc/mm/fault_64.c  |  1 +
 arch/um/kernel/trap.c |  2 +
 arch/unicore32/kernel/fpu-ucf64.c |  2 +-
 arch/unicore32/mm/fault.c |  3 +
 arch/x86/entry/vsyscall/vsyscall_64.c |  2 +-
 arch/x86/kernel/ptrace.c  |  2

Re: [PATCH v11 00/10] Application Data Integrity feature introduced by SPARC M7

2018-02-07 Thread Eric W. Biederman
Khalid Aziz  writes:

> On 02/07/2018 12:38 AM, ebied...@xmission.com wrote:
>> Khalid Aziz  writes:
>>
>>> On 02/01/2018 07:29 PM, ebied...@xmission.com wrote:
 Khalid Aziz  writes:

> V11 changes:
> This series is same as v10 and was simply rebased on 4.15 kernel. Can
> mm maintainers please review patches 2, 7, 8 and 9 which are arch
> independent, and include/linux/mm.h and mm/ksm.c changes in patch 10
> and ack these if everything looks good?

 I am a bit puzzled how this differs from the pkey's that other
 architectures are implementing to achieve a similar result.

 I am a bit mystified why you don't store the tag in a vma
 instead of inventing a new way to store data on page out.
>>>
>>> Hello Eric,
>>>
>>> As Steven pointed out, sparc sets tags per cacheline unlike pkey. This 
>>> results
>>> in much finer granularity for tags that pkey and hence requires larger tag
>>> storage than what we can do in a vma.
>>
>> *Nod*   I am a bit mystified where you keep the information in memory.
>> I would think the tags would need to be stored per cacheline or per
>> tlb entry, in some kind of cache that could overflow.  So I would be
>> surprised if swapping is the only time this information needs stored
>> in memory.  Which makes me wonder if you have the proper data
>> structures.
>>
>> I would think an array per vma or something in the page tables would
>> tend to make sense.
>>
>> But perhaps I am missing something.
>
> The ADI tags are stored in spare bits in the RAM. ADI tag storage is
> managed entirely by memory controller which maintains these tags per
> ADI block. An ADI block is the same size as cacheline on M7. Tags for
> each ADI block are associated with the physical ADI block, not the
> virtual address. When a physical page is reused, the physical ADI tag
> storage for that page is overwritten with new ADI tags, hence we need
> to store away the tags when we swap out a page. Kernel updates the ADI
> tags for physical page when it swaps a new page in. Each vma can cover
> variable number of pages so it is best to store a pointer to the tag
> storage in vma as opposed to actual tags in an array. Each 8K page can
> have 128 tags on it. Since each tag is 4 bits, we need 64 bytes per
> page to store the tags. That can add up for a large vma.

If the tags are already stored in RAM I can see why it does not make any
sense to store them except on page out.  Management wise this feels a
lot like the encrypted memory options I have been seeing on x86.

 Can you please use force_sig_fault to send these signals instead
 of force_sig_info.  Emperically I have found that it is very
 error prone to generate siginfo's by hand, especially on code
 paths where several different si_codes may apply.  So it helps
 to go through a helper function to ensure the fiddly bits are
 all correct.  AKA the unused bits all need to be set to zero before
 struct siginfo is copied to userspace.

>>>
>>> What you say makes sense. I followed the same code as other fault handlers 
>>> for
>>> sparc. I could change just the fault handlers for ADI related faults. Would 
>>> it
>>> make more sense to change all the fault handlers in a separate patch and 
>>> keep
>>> the code in arch/sparc/kernel/traps_64.c consistent? Dave M, do you have a
>>> preference?
>>
>> It is my intention post -rc1 to start sending out patches to get the
>> rest of not just sparc but all of the architectures using the new
>> helpers.  I have the code I just ran out of time befor the merge
>> window opened to ensure everything had a good thorough review.
>>
>> So if you can handle the your new changes I expect I will handle the
>> rest.
>>
>
> I can add a patch at the end of my series to update all
> force_sig_info() in my patchset to force_sig_fault(). That will sync
> my patches up with your changes cleanly. Does that work for you? I can
> send an updated series with this change. Can you review and ack the
> patches after this change.

One additional patch would be fine.  I can certainly review and ack that
part.  You probably want to wait until post -rc1 so that you have a
clean base to work off of.

Eric




Re: [PATCH v11 00/10] Application Data Integrity feature introduced by SPARC M7

2018-02-07 Thread Eric W. Biederman
Khalid Aziz  writes:

> On 02/01/2018 07:29 PM, ebied...@xmission.com wrote:
>> Khalid Aziz  writes:
>>
>>> V11 changes:
>>> This series is same as v10 and was simply rebased on 4.15 kernel. Can
>>> mm maintainers please review patches 2, 7, 8 and 9 which are arch
>>> independent, and include/linux/mm.h and mm/ksm.c changes in patch 10
>>> and ack these if everything looks good?
>>
>> I am a bit puzzled how this differs from the pkey's that other
>> architectures are implementing to achieve a similar result.
>>
>> I am a bit mystified why you don't store the tag in a vma
>> instead of inventing a new way to store data on page out.
>
> Hello Eric,
>
> As Steven pointed out, sparc sets tags per cacheline unlike pkey. This results
> in much finer granularity for tags that pkey and hence requires larger tag
> storage than what we can do in a vma.

*Nod*   I am a bit mystified where you keep the information in memory.
I would think the tags would need to be stored per cacheline or per
tlb entry, in some kind of cache that could overflow.  So I would be
surprised if swapping is the only time this information needs stored
in memory.  Which makes me wonder if you have the proper data
structures.

I would think an array per vma or something in the page tables would
tend to make sense.

But perhaps I am missing something.

>> Can you please use force_sig_fault to send these signals instead
>> of force_sig_info.  Emperically I have found that it is very
>> error prone to generate siginfo's by hand, especially on code
>> paths where several different si_codes may apply.  So it helps
>> to go through a helper function to ensure the fiddly bits are
>> all correct.  AKA the unused bits all need to be set to zero before
>> struct siginfo is copied to userspace.
>>
>
> What you say makes sense. I followed the same code as other fault handlers for
> sparc. I could change just the fault handlers for ADI related faults. Would it
> make more sense to change all the fault handlers in a separate patch and keep
> the code in arch/sparc/kernel/traps_64.c consistent? Dave M, do you have a
> preference?

It is my intention post -rc1 to start sending out patches to get the
rest of not just sparc but all of the architectures using the new
helpers.  I have the code I just ran out of time befor the merge
window opened to ensure everything had a good thorough review.

So if you can handle the your new changes I expect I will handle the
rest.

Eric


Re: [PATCH v11 00/10] Application Data Integrity feature introduced by SPARC M7

2018-02-01 Thread Eric W. Biederman
Khalid Aziz  writes:

> V11 changes:
> This series is same as v10 and was simply rebased on 4.15 kernel. Can
> mm maintainers please review patches 2, 7, 8 and 9 which are arch
> independent, and include/linux/mm.h and mm/ksm.c changes in patch 10
> and ack these if everything looks good?

I am a bit puzzled how this differs from the pkey's that other
architectures are implementing to achieve a similar result.

I am a bit mystified why you don't store the tag in a vma
instead of inventing a new way to store data on page out.

Can you please use force_sig_fault to send these signals instead
of force_sig_info.  Emperically I have found that it is very
error prone to generate siginfo's by hand, especially on code
paths where several different si_codes may apply.  So it helps
to go through a helper function to ensure the fiddly bits are
all correct.  AKA the unused bits all need to be set to zero before
struct siginfo is copied to userspace.

Eric



Re: [PATCH v10 27/27] mm: display pkey in smaps if arch_pkeys_enabled() is true

2018-01-19 Thread Eric W. Biederman
Ram Pai <linux...@us.ibm.com> writes:

> On Fri, Jan 19, 2018 at 10:09:41AM -0600, Eric W. Biederman wrote:
>> Ram Pai <linux...@us.ibm.com> writes:
>> 
>> > Currently the  architecture  specific code is expected to
>> > display  the  protection  keys  in  smap  for a given vma.
>> > This can lead to redundant code and possibly to divergent
>> > formats in which the key gets displayed.
>> >
>> > This  patch  changes  the implementation. It displays the
>> > pkey only if the architecture support pkeys.
>> >
>> > x86 arch_show_smap() function is not needed anymore.
>> > Delete it.
>> >
>> > Signed-off-by: Ram Pai <linux...@us.ibm.com>
>> > ---
>> >  arch/x86/kernel/setup.c |8 
>> >  fs/proc/task_mmu.c  |   11 ++-
>> >  2 files changed, 6 insertions(+), 13 deletions(-)
>> >
>> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> > index 8af2e8d..ddf945a 100644
>> > --- a/arch/x86/kernel/setup.c
>> > +++ b/arch/x86/kernel/setup.c
>> > @@ -1326,11 +1326,3 @@ static int __init 
>> > register_kernel_offset_dumper(void)
>> >return 0;
>> >  }
>> >  __initcall(register_kernel_offset_dumper);
>> > -
>> > -void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
>> > -{
>> > -  if (!boot_cpu_has(X86_FEATURE_OSPKE))
>> > -  return;
>> > -
>> > -  seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
>> > -}
>> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> > index 0edd4da..4b39a94 100644
>> > --- a/fs/proc/task_mmu.c
>> > +++ b/fs/proc/task_mmu.c
>> > @@ -18,6 +18,7 @@
>> >  #include 
>> >  #include 
>> >  #include 
>> > +#include 
>> >  
>> >  #include 
>> >  #include 
>> > @@ -728,10 +729,6 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned 
>> > long hmask,
>> >  }
>> >  #endif /* HUGETLB_PAGE */
>> >  
>> > -void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
>> > -{
>> > -}
>> > -
>> >  static int show_smap(struct seq_file *m, void *v, int is_pid)
>> >  {
>> >struct proc_maps_private *priv = m->private;
>> > @@ -851,9 +848,13 @@ static int show_smap(struct seq_file *m, void *v, int 
>> > is_pid)
>> >   (unsigned long)(mss->pss >> (10 + PSS_SHIFT)));
>> >  
>> >if (!rollup_mode) {
>> > -  arch_show_smap(m, vma);
>> > +#ifdef CONFIG_ARCH_HAS_PKEYS
>> > +  if (arch_pkeys_enabled())
>> > +  seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
>> > +#endif
>> 
>> Would it be worth it making vma_pkey a noop on architectures that don't
>> support protection keys so that we don't need the #ifdef here?
>
> You mean something like this?
>   #define vma_pkey(vma)  
> It will lead to compilation error.
>
>
> I can make it
>   #define vma_pkey(vma)  0
>
> and that will work and get rid of the #ifdef

Yes the second is what I was thinking.

I don't know if it is worth it but #ifdefs can be problematic as the
result in code not being compile tested.

Eric



Re: [PATCH v10 27/27] mm: display pkey in smaps if arch_pkeys_enabled() is true

2018-01-19 Thread Eric W. Biederman
Ram Pai  writes:

> Currently the  architecture  specific code is expected to
> display  the  protection  keys  in  smap  for a given vma.
> This can lead to redundant code and possibly to divergent
> formats in which the key gets displayed.
>
> This  patch  changes  the implementation. It displays the
> pkey only if the architecture support pkeys.
>
> x86 arch_show_smap() function is not needed anymore.
> Delete it.
>
> Signed-off-by: Ram Pai 
> ---
>  arch/x86/kernel/setup.c |8 
>  fs/proc/task_mmu.c  |   11 ++-
>  2 files changed, 6 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 8af2e8d..ddf945a 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1326,11 +1326,3 @@ static int __init register_kernel_offset_dumper(void)
>   return 0;
>  }
>  __initcall(register_kernel_offset_dumper);
> -
> -void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
> -{
> - if (!boot_cpu_has(X86_FEATURE_OSPKE))
> - return;
> -
> - seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
> -}
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 0edd4da..4b39a94 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -728,10 +729,6 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
> hmask,
>  }
>  #endif /* HUGETLB_PAGE */
>  
> -void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
> -{
> -}
> -
>  static int show_smap(struct seq_file *m, void *v, int is_pid)
>  {
>   struct proc_maps_private *priv = m->private;
> @@ -851,9 +848,13 @@ static int show_smap(struct seq_file *m, void *v, int 
> is_pid)
>  (unsigned long)(mss->pss >> (10 + PSS_SHIFT)));
>  
>   if (!rollup_mode) {
> - arch_show_smap(m, vma);
> +#ifdef CONFIG_ARCH_HAS_PKEYS
> + if (arch_pkeys_enabled())
> + seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
> +#endif

Would it be worth it making vma_pkey a noop on architectures that don't
support protection keys so that we don't need the #ifdef here?

Eric


>   show_smap_vma_flags(m, vma);
>   }
> +
>   m_cache_vma(m, vma);
>   return ret;
>  }


[PATCH 06/11] signal/powerpc: Document conflicts with SI_USER and SIGFPE and SIGTRAP

2018-01-11 Thread Eric W. Biederman
Setting si_code to 0 results in a userspace seeing an si_code of 0.
This is the same si_code as SI_USER.  Posix and common sense requires
that SI_USER not be a signal specific si_code.  As such this use of 0
for the si_code is a pretty horribly broken ABI.

Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a
value of __SI_KILL and now sees a value of SIL_KILL with the result
that uid and pid fields are copied and which might copying the si_addr
field by accident but certainly not by design.  Making this a very
flakey implementation.

Utilizing FPE_FIXME and TRAP_FIXME, siginfo_layout() will now return
SIL_FAULT and the appropriate fields will be reliably copied.

Possible ABI fixes includee:
- Send the signal without siginfo
- Don't generate a signal
- Possibly assign and use an appropriate si_code
- Don't handle cases which can't happen
Cc: Paul Mackerras <pau...@samba.org>
Cc: Kumar Gala <kumar.g...@freescale.com>
Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Cc:  linuxppc-dev@lists.ozlabs.org
Ref: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx")
Ref: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various 
exceptions.")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebied...@xmission.com>
---
 arch/powerpc/include/uapi/asm/siginfo.h | 15 +++
 arch/powerpc/kernel/traps.c | 10 +-
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/siginfo.h 
b/arch/powerpc/include/uapi/asm/siginfo.h
index 1a691141e49f..444ca6c9989a 100644
--- a/arch/powerpc/include/uapi/asm/siginfo.h
+++ b/arch/powerpc/include/uapi/asm/siginfo.h
@@ -18,4 +18,19 @@
 #undef NSIGTRAP
 #define NSIGTRAP   4
 
+/*
+ * SIGFPE si_codes
+ */
+#ifdef __KERNEL__
+#define FPE_FIXME  0   /* Broken dup of SI_USER */
+#endif /* __KERNEL__ */
+
+/*
+ * SIGTRAP si_codes
+ */
+#ifdef __KERNEL__
+#define TRAP_FIXME 0   /* Broken dup of SI_USER */
+#endif /* __KERNEL__ */
+
+
 #endif /* _ASM_POWERPC_SIGINFO_H */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index f3eb61be0d30..f2e6e1838952 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -917,7 +917,7 @@ void unknown_exception(struct pt_regs *regs)
printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n",
   regs->nip, regs->msr, regs->trap);
 
-   _exception(SIGTRAP, regs, 0, 0);
+   _exception(SIGTRAP, regs, TRAP_FIXME, 0);
 
exception_exit(prev_state);
 }
@@ -939,7 +939,7 @@ void instruction_breakpoint_exception(struct pt_regs *regs)
 
 void RunModeException(struct pt_regs *regs)
 {
-   _exception(SIGTRAP, regs, 0, 0);
+   _exception(SIGTRAP, regs, TRAP_FIXME, 0);
 }
 
 void single_step_exception(struct pt_regs *regs)
@@ -978,7 +978,7 @@ static void emulate_single_step(struct pt_regs *regs)
 
 static inline int __parse_fpscr(unsigned long fpscr)
 {
-   int ret = 0;
+   int ret = FPE_FIXME;
 
/* Invalid operation */
if ((fpscr & FPSCR_VE) && (fpscr & FPSCR_VX))
@@ -1929,7 +1929,7 @@ void SPEFloatingPointException(struct pt_regs *regs)
extern int do_spe_mathemu(struct pt_regs *regs);
unsigned long spefscr;
int fpexc_mode;
-   int code = 0;
+   int code = FPE_FIXME;
int err;
 
flush_spe_to_thread(current);
@@ -1998,7 +1998,7 @@ void SPEFloatingPointRoundException(struct pt_regs *regs)
printk(KERN_ERR "unrecognized spe instruction "
   "in %s at %lx\n", current->comm, regs->nip);
} else {
-   _exception(SIGFPE, regs, 0, regs->nip);
+   _exception(SIGFPE, regs, FPE_FIXME, regs->nip);
return;
}
 }
-- 
2.14.1



Re: [RFC v6 35/62] powerpc: Deliver SEGV signal on pkey violation

2017-08-19 Thread Eric W. Biederman
Ram Pai  writes:

> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index d4e545d..fe1e7c7 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -247,6 +248,15 @@ void user_single_step_siginfo(struct task_struct *tsk,
>   info->si_addr = (void __user *)regs->nip;
>  }
>  
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> +static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long 
> addr)
> +{
> + if (si_code != SEGV_PKUERR)
> + return;

Given that SEGV_PKUERR is a signal specific si_code this test is
insufficient to detect an pkey error.  You also need to check
that signr == SIGSEGV

> + info->si_pkey = get_paca()->paca_pkey;
> +}
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
>  void _exception(int signr, struct pt_regs *regs, int code, unsigned long 
> addr)
>  {
>   siginfo_t info;
> @@ -274,6 +284,11 @@ void _exception(int signr, struct pt_regs *regs, int 
> code, unsigned long addr)
>   info.si_signo = signr;
>   info.si_code = code;
>   info.si_addr = (void __user *) addr;
> +
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + fill_sig_info_pkey(code, , addr);
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
>   force_sig_info(signr, , current);
>  }

Eric


Re: [PATCH v2 2/5] ia64: reuse append_elf_note() and final_note() functions

2016-12-02 Thread Eric W. Biederman
Hari Bathini  writes:

> Hi Dave,
>
>
> Thanks for the review.
>
>
> On Thursday 01 December 2016 10:26 AM, Dave Young wrote:
>> Hi Hari
>>
>> Personally I like V1 more, but split the patch 2 is easier for ia64
>> people to reivew.  I did basic x86 testing, it runs ok.
>>
>> On 11/25/16 at 05:24pm, Hari Bathini wrote:
>>> Get rid of multiple definitions of append_elf_note() & final_note()
>>> functions. Reuse these functions compiled under CONFIG_CRASH_CORE.
>>>
>>> Signed-off-by: Hari Bathini 
>>> ---
>>>   arch/ia64/kernel/crash.c   |   22 --
>>>   include/linux/crash_core.h |4 
>>>   kernel/crash_core.c|6 +++---
>>>   kernel/kexec_core.c|   28 
>>>   4 files changed, 7 insertions(+), 53 deletions(-)
>>>
>>> diff --git a/arch/ia64/kernel/crash.c b/arch/ia64/kernel/crash.c
>>> index 2955f35..75859a0 100644
>>> --- a/arch/ia64/kernel/crash.c
>>> +++ b/arch/ia64/kernel/crash.c
>>> @@ -27,28 +27,6 @@ static int kdump_freeze_monarch;
>>>   static int kdump_on_init = 1;
>>>   static int kdump_on_fatal_mca = 1;
>>>   -static inline Elf64_Word
>>> -*append_elf_note(Elf64_Word *buf, char *name, unsigned type, void *data,
>>> -   size_t data_len)
>>> -{
>>> -   struct elf_note *note = (struct elf_note *)buf;
>>> -   note->n_namesz = strlen(name) + 1;
>>> -   note->n_descsz = data_len;
>>> -   note->n_type   = type;
>>> -   buf += (sizeof(*note) + 3)/4;
>>> -   memcpy(buf, name, note->n_namesz);
>>> -   buf += (note->n_namesz + 3)/4;
>>> -   memcpy(buf, data, data_len);
>>> -   buf += (data_len + 3)/4;
>>> -   return buf;
>>> -}
>>> -
>>> -static void
>>> -final_note(void *buf)
>>> -{
>>> -   memset(buf, 0, sizeof(struct elf_note));
>>> -}
>>> -
>> The above IA64 version looks better than the functions in kexec_core.c
>> about the Elf64_Word type usage and the simpler final_note function.
>
> Hmmm.. Is void* better over Elf64_Word* to be agnostic of Elf32 or
> Elf64 type?

Both Elf64_Word and Elf32_Word result in a u32.  So I expect the right
solution is to add a definition of Elf_Word to include/linux/elf.h
and to make the buffer "Elf_Word *buf".

That way we preserve the alignment knowledge, while making the code
depend on 32bit or 64bit.

Eric


Re: [RFC] kexec_file: Add support for purgatory built as PIE

2016-11-04 Thread Eric W. Biederman
Baoquan He  writes:

> On 11/02/16 at 04:00am, Thiago Jung Bauermann wrote:
>> Hello,
>> 
>> The kexec_file code currently builds the purgatory as a partially linked 
>> object 
>> (using ld -r). Is there a particular reason to use that instead of a 
>> position 
>> independent executable (PIE)?
>
> It's taken as "-r", relocatable in user space kexec-tools too originally.
> I think Vivek just keeps it the same when moving into kernel.

At least on x86 using just -r removed the need for a GOT and all of the
other nasty dynamic relocatable bits, that are not needed when the you
don't want to share your text bits with the page cache.

I can see reaons for refactoring code but I expect PIE expecutables need
a GOT and all of that pain in the neck stuff that can just be avoided by
building the code to run at an absolute address.

So far I have not seen ELF relocations that are difficult to process.

Eric


Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-29 Thread Eric W. Biederman
Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> writes:

> Hello Eric,
>
> Am Dienstag, 20 September 2016, 11:07:29 schrieb Eric W. Biederman:
>> A semi-generic concept called a hand-over buffer seems to be a
>> construction of infrustructure for no actual reason that will just
>> result in confusion.  There are lots of things that are handed over, the
>> flattend device tree, ramdisks, bootparams on x86, etc, etc.  ima is not
>> special in this execpt for being perhaps the first addition that we are
>> going to want the option of including on most architectures.
>
> Ok, I understand. I decided to implement a generic concept because I thought 
> that proposing a feature that is more useful than what I need it for would  
> increase its chance of being accepted. It's interesting to see that it had 
> the opposite effect.

Yes.  In this case it was not clear that anyone else could use it, and
being less generic you can tweak the needs of the code to ima without
anyone having to worry about it.

So thank you very much for making the code more specific to the circumstances.

> I reworked and simplified the code and folded the hand-over buffer patches 
> into Mimi's patch series to carry the measurement list across kexec. The 
> kexec buffer code is in the following patches now:
>
> [PATCH v5 01/10] powerpc: ima: Get the kexec buffer passed by the previous 
> kernel
> [PATCH v5 05/10] powerpc: ima: Send the kexec buffer to the next
> kernel

That plus  [PATCH v5 06/10] ima: on soft reboot, save the measurement list

> Each patch has a changelog listing what I changed to make it specific to 
> IMA.

I am a little sad to see you needed to modify kexec_file.c to get where
you were going, but that isn't a huge issue either way.

Eric




Re: [PATCH v5 00/10] ima: carry the measurement list across kexec

2016-09-29 Thread Eric W. Biederman
Mimi Zohar <zo...@linux.vnet.ibm.com> writes:

> The TPM PCRs are only reset on a hard reboot.  In order to validate a
> TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list
> of the running kernel must be saved and then restored on the subsequent
> boot, possibly of a different architecture.
>
> The existing securityfs binary_runtime_measurements file conveniently
> provides a serialized format of the IMA measurement list. This patch
> set serializes the measurement list in this format and restores it.
>
> Up to now, the binary_runtime_measurements was defined as architecture
> native format.  The assumption being that userspace could and would
> handle any architecture conversions.  With the ability of carrying the
> measurement list across kexec, possibly from one architecture to a
> different one, the per boot architecture information is lost and with it
> the ability of recalculating the template digest hash.  To resolve this
> problem, without breaking the existing ABI, this patch set introduces
> the boot command line option "ima_canonical_fmt", which is arbitrarily
> defined as little endian.
>
> The need for this boot command line option will be limited to the
> existing version 1 format of the binary_runtime_measurements.
> Subsequent formats will be defined as canonical format (eg. TPM 2.0
> support for larger digests).
>
> A simplified method of Thiago Bauermann's "kexec buffer handover" patch
> series for carrying the IMA measurement list across kexec is included
> in this patch set.  The simplified method requires all file measurements
> be taken prior to executing the kexec load, as subsequent measurements
> will not be carried across the kexec and restored.

So I just went through the kexec portions of this and I don't see
anything particularly worrying.

I have one thing that I think could be improved, but is not wrong.
Having both receiving and transmitting the ima measurments both under
HAVE_IMA_KEXEC seems wrong.  There may be people who want to receive the
measurment list but don't want to support kexec'ing other kernels or the
other way around.  I can very much see bootloaders that expect they will
be the first kernel to not want to compile in the extra code for
receiving the measurment list.

But again that is a nit, and not a problem.

So for the series, from the kexec point of view.

Acked-by: "Eric W. Biederman" <ebied...@xmission.com>

>
> These patches can also be found in the next-kexec-restore branch of:
> git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git
>
> Changelog v5:
> - Included patches from Thiago Bauermann's "kexec buffer handover"
> patch series for carrying the IMA measurement list across kexec.
> - Added CONFIG_HAVE_IMA_KEXEC
> - Renamed functions to variations of ima_kexec_buffer instead of
> variations of kexec_handover_buffer
>
> Changelog v4:
> - Fixed "spinlock bad magic" BUG - reported by Dmitry Vyukov
> - Rebased on Thiago Bauermann's v5 patch set
> - Removed the skip_checksum initialization  
>
> Changelog v3:
> - Cleaned up the code for calculating the requested kexec segment size
> needed for the IMA measurement list, limiting the segment size to half
> of the totalram_pages.
> - Fixed kernel test robot reports as enumerated in the respective
> patch changelog.
>
> Changelog v2:
> - Canonical measurement list support added
> - Redefined the ima_kexec_hdr struct to use well defined sizes
>
> Andreas Steffen (1):
>   ima: platform-independent hash value
>
> Mimi Zohar (7):
>   ima: on soft reboot, restore the measurement list
>   ima: permit duplicate measurement list entries
>   ima: maintain memory size needed for serializing the measurement list
>   ima: on soft reboot, save the measurement list
>   ima: store the builtin/custom template definitions in a list
>   ima: support restoring multiple template formats
>   ima: define a canonical binary_runtime_measurements list format
>
> Thiago Jung Bauermann (2):
>   powerpc: ima: Get the kexec buffer passed by the previous kernel
>   powerpc: ima: Send the kexec buffer to the next kernel
>
>  Documentation/kernel-parameters.txt   |   4 +
>  arch/Kconfig  |   3 +
>  arch/powerpc/Kconfig  |   1 +
>  arch/powerpc/include/asm/ima.h|  29 +++
>  arch/powerpc/include/asm/kexec.h  |  16 +-
>  arch/powerpc/kernel/Makefile  |   4 +
>  arch/powerpc/kernel/ima_kexec.c   | 223 +++
>  arch/powerpc/kernel/kexec_elf_64.c|   2 +-
>  arch/powerpc/kernel/machine_kexec_64.c| 116 ++--
>  include/linux/ima.h   |  12 ++
>  kernel

Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-20 Thread Eric W. Biederman

Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> writes:

> Am Samstag, 17 September 2016, 00:17:37 schrieb Eric W. Biederman:
>> Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> writes:
>> > Hello Eric,
>> > 
>> > Am Freitag, 16 September 2016, 14:47:13 schrieb Eric W. Biederman:
>> >> I can see tracking to see if the list has changed at some
>> >> point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail.
>> > 
>> > Yes, that is an interesting feature that I can add using the checksum-
>> > verifying part of my code. I can submit a patch for that if there's
>> > interest, adding a reboot notifier that verifies the checksum and causes
>> > a regular reboot instead of a kexec reboot if the checksum fails.
>> 
>> I was thinking an early failure instead of getting all of the way down
>> into a kernel an discovering the tpm/ima subsystem would not
>> initialized.  But where that falls in the reboot pathway I don't expect
>> there is much value in it.
>
> I'm not sure I understand. What I described doesn't involve the tpm or ima. 
> I'm suggesting that if I take the parts of patch 4/5 in the kexec hand-over 
> buffer series that verify the image checksum, I can submit a separate patch 
> that checks the integrity of the kexec image early in kernel_kexec() and 
> reverts to a regular reboot if the check fails. This would be orthogonal to 
> ima carrying its measurement list across kexec.
>
> I think there is value in that, because if the kexec image is corrupted the 
> machine will just get stuck in the purgatory and (unless it's a platform 
> where the purgatory can print to the console) without even an error message 
> explaining what is going on. Whereas if we notice the corruption before 
> jumping into the purgatory we can switch to a regular reboot and the machine 
> will boot successfully.
>
> To have an early failure, when would the checksum verification be done?
> What I can think of is to have kexec_file_load accept a new flag 
> KEXEC_FILE_VERIFY_IMAGE, which userspace could use to request an integrity 
> check when it's about to start the reboot procedure. Then it can decide to 
> either reload the kernel or use a regular reboot if the image is corrupted.
>
> Is this what you had in mind?

Sort of.

I was just thinking that instead of having the boot path verify your ima
list matches what is in the tpm and halting the boot there, we could
test that on reboot.  Which would give a clean failure without the nasty
poking into a prepared image.  The downside is that we have already run
the shutdown scripts so it wouldn't be much cleaner, than triggering a
machine reboot from elsewhere.

But I don't think we should spend too much time on that.  It was a
passing thought.  We should focus on getting a non-poked ima buffer
cleanly into kexec and we can worry about the rest later.

>> >> At least the common bootloader cases that I know of using kexec are
>> >> very
>> >> minimal distributions that live in a ramdisk and as such it should be
>> >> very straight forward to measure what is needed at or before
>> >> sys_kexec_load.  But that was completely dismissed as unrealistic so I
>> >> don't have a clue what actual problem you are trying to solve.
>> > 
>> > We are interested in solving the problem in a general way because it
>> > will be useful to us in the future for the case of an arbitrary number
>> > of kexecs (and thus not only a bootloader but also multiple full-blown
>> > distros may be involved in the chain).
>> > 
>> > But you are right that for the use case for which we currently need this
>> > feature it's feasible to measure everything upfront. We can cross the
>> > other bridge when we get there.
>> 
>> Then let's start there.  Passing the measurment list is something that
>> should not be controversial.
>
> Great!
>
>> >> If there is anyway we can start small and not with this big scary
>> >> infrastructure change I would very much prefer it.
>> > 
>> > Sounds good. If we pre-measure everything then the following patches
>> > from my buffer hand-over series are enough:
>> > 
>> > [PATCH v5 2/5] kexec_file: Add buffer hand-over support for the next
>> > kernel [PATCH v5 3/5] powerpc: kexec_file: Add buffer hand-over support
>> > for the next kernel
>> > 
>> > Would you consider including those two?
>> > 
>> > And like I mentioned in the cover letter, patch 1/5 is an interesting
>> > improvement that is worth considering.
>> 
>> So from 10,000 feet I th

Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-16 Thread Eric W. Biederman
Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> writes:

> Hello Eric,
>
> Am Freitag, 16 September 2016, 14:47:13 schrieb Eric W. Biederman:
>> Mimi Zohar <zo...@linux.vnet.ibm.com> writes:
>> > Hi Andrew,
>> > 
>> > On Wed, 2016-08-31 at 18:38 -0400, Mimi Zohar wrote:
>> >> On Wed, 2016-08-31 at 13:50 -0700, Andrew Morton wrote:
>> >> > On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar 
> <zo...@linux.vnet.ibm.com> wrote:
>> >> > > The TPM PCRs are only reset on a hard reboot.  In order to validate
>> >> > > a
>> >> > > TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement
>> >> > > list
>> >> > > of the running kernel must be saved and then restored on the
>> >> > > subsequent
>> >> > > boot, possibly of a different architecture.
>> >> > > 
>> >> > > The existing securityfs binary_runtime_measurements file
>> >> > > conveniently
>> >> > > provides a serialized format of the IMA measurement list. This
>> >> > > patch
>> >> > > set serializes the measurement list in this format and restores it.
>> >> > > 
>> >> > > Up to now, the binary_runtime_measurements was defined as
>> >> > > architecture
>> >> > > native format.  The assumption being that userspace could and would
>> >> > > handle any architecture conversions.  With the ability of carrying
>> >> > > the
>> >> > > measurement list across kexec, possibly from one architecture to a
>> >> > > different one, the per boot architecture information is lost and
>> >> > > with it
>> >> > > the ability of recalculating the template digest hash.  To resolve
>> >> > > this
>> >> > > problem, without breaking the existing ABI, this patch set
>> >> > > introduces
>> >> > > the boot command line option "ima_canonical_fmt", which is
>> >> > > arbitrarily
>> >> > > defined as little endian.
>> >> > > 
>> >> > > The need for this boot command line option will be limited to the
>> >> > > existing version 1 format of the binary_runtime_measurements.
>> >> > > Subsequent formats will be defined as canonical format (eg. TPM 2.0
>> >> > > support for larger digests).
>> >> > > 
>> >> > > This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer
>> >> > > hand-over for the next kernel" patch set.
>> >> > > 
>> >> > > These patches can also be found in the next-kexec-restore branch
>> >> > > of:
>> >> > > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
>> >> > > .git
>> >> > 
>> >> > I'll merge these into -mm to get some linux-next exposure.  I don't
>> >> > know what your upstream merge plans will be?
>> >> 
>> >> Sounds good.  I'm hoping to get some review/comments on this patch set
>> >> as well.  At the moment, I'm chasing down a kernel test robot report
>> >> from this afternoon.
>> > 
>> > My concern about changing the canonical format as originally defined in
>> > patch 9/9 from big endian to little endian never materialized.  Andreas
>> > Steffan, the patch author, is happy either way.
>> > 
>> > We proposed two methods of addressing Eric Biederman's concerns of not
>> > including the IMA measurement list segment in the kexec hash as
>> > described in  https://lkml.org/lkml/2016/9/9/355.
>> > 
>> > - defer calculating and verifying the serialized IMA measurement list
>> > buffer hash to IMA
>> > - calculate the kexec hash on load, verify it on the kexec execute,
>> > before re-calculating and updating it.
>> 
>> I need to ask: How this is anticipated to interact with kexec on panic?
>> Because honestly I can't see this ever working in that case.  The
>> assumption is that the original kernel has gone crazy.  So from a
>> practical standpoint any trusted path should have been invalided.
>
> We are not interested in carrying the measurement list in the case of kexec 
> on panic. I see that the code is adding a hand-over buffer to the crash 
> image as well, but that is a bug.
>
> The fix is to do nothing in ima_add

Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-16 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Mimi Zohar <zo...@linux.vnet.ibm.com> writes:
>
>> Hi Andrew,
>>
>> On Wed, 2016-08-31 at 18:38 -0400, Mimi Zohar wrote:
>>> On Wed, 2016-08-31 at 13:50 -0700, Andrew Morton wrote:
>>> > On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar <zo...@linux.vnet.ibm.com> 
>>> > wrote:
>>> > 
>>> > > The TPM PCRs are only reset on a hard reboot.  In order to validate a
>>> > > TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list
>>> > > of the running kernel must be saved and then restored on the subsequent
>>> > > boot, possibly of a different architecture.
>>> > > 
>>> > > The existing securityfs binary_runtime_measurements file conveniently
>>> > > provides a serialized format of the IMA measurement list. This patch
>>> > > set serializes the measurement list in this format and restores it.
>>> > > 
>>> > > Up to now, the binary_runtime_measurements was defined as architecture
>>> > > native format.  The assumption being that userspace could and would
>>> > > handle any architecture conversions.  With the ability of carrying the
>>> > > measurement list across kexec, possibly from one architecture to a
>>> > > different one, the per boot architecture information is lost and with it
>>> > > the ability of recalculating the template digest hash.  To resolve this
>>> > > problem, without breaking the existing ABI, this patch set introduces
>>> > > the boot command line option "ima_canonical_fmt", which is arbitrarily
>>> > > defined as little endian.
>>> > > 
>>> > > The need for this boot command line option will be limited to the
>>> > > existing version 1 format of the binary_runtime_measurements.
>>> > > Subsequent formats will be defined as canonical format (eg. TPM 2.0
>>> > > support for larger digests).
>>> > > 
>>> > > This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer
>>> > > hand-over for the next kernel" patch set. 
>>> > > 
>>> > > These patches can also be found in the next-kexec-restore branch of:
>>> > > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git
>>> > 
>>> > I'll merge these into -mm to get some linux-next exposure.  I don't
>>> > know what your upstream merge plans will be?
>>> 
>>> Sounds good.  I'm hoping to get some review/comments on this patch set
>>> as well.  At the moment, I'm chasing down a kernel test robot report
>>> from this afternoon.
>>
>> My concern about changing the canonical format as originally defined in
>> patch 9/9 from big endian to little endian never materialized.  Andreas
>> Steffan, the patch author, is happy either way.
>>
>> We proposed two methods of addressing Eric Biederman's concerns of not
>> including the IMA measurement list segment in the kexec hash as
>> described in  https://lkml.org/lkml/2016/9/9/355.
>>
>> - defer calculating and verifying the serialized IMA measurement list
>> buffer hash to IMA
>> - calculate the kexec hash on load, verify it on the kexec execute,
>> before re-calculating and updating it.
>
> I need to ask: How this is anticipated to interact with kexec on panic?
> Because honestly I can't see this ever working in that case.  The
> assumption is that the original kernel has gone crazy.  So from a
> practical standpoint any trusted path should have been invalided.
>
> This entire idea of updating the kexec image makes me extremely
> extremely nervious.  It feels like sticking a screw driver through the
> spokes of your bicicle tires while ridding down the road.
>
> I can see tracking to see if the list has changed at some
> point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail.
>
> At least the common bootloader cases that I know of using kexec are very
> minimal distributions that live in a ramdisk and as such it should be
> very straight forward to measure what is needed at or before
> sys_kexec_load.  But that was completely dismissed as unrealistic so I
> don't have a clue what actual problem you are trying to solve.
>
> If there is anyway we can start small and not with this big scary
> infrastructure change I would very much prefer it.

I have thought about this a little more and the entire reason for
updating things on the fly really really disturbs me.  To prove you are
t

Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-16 Thread Eric W. Biederman
Mimi Zohar  writes:

> Hi Andrew,
>
> On Wed, 2016-08-31 at 18:38 -0400, Mimi Zohar wrote:
>> On Wed, 2016-08-31 at 13:50 -0700, Andrew Morton wrote:
>> > On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar  
>> > wrote:
>> > 
>> > > The TPM PCRs are only reset on a hard reboot.  In order to validate a
>> > > TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list
>> > > of the running kernel must be saved and then restored on the subsequent
>> > > boot, possibly of a different architecture.
>> > > 
>> > > The existing securityfs binary_runtime_measurements file conveniently
>> > > provides a serialized format of the IMA measurement list. This patch
>> > > set serializes the measurement list in this format and restores it.
>> > > 
>> > > Up to now, the binary_runtime_measurements was defined as architecture
>> > > native format.  The assumption being that userspace could and would
>> > > handle any architecture conversions.  With the ability of carrying the
>> > > measurement list across kexec, possibly from one architecture to a
>> > > different one, the per boot architecture information is lost and with it
>> > > the ability of recalculating the template digest hash.  To resolve this
>> > > problem, without breaking the existing ABI, this patch set introduces
>> > > the boot command line option "ima_canonical_fmt", which is arbitrarily
>> > > defined as little endian.
>> > > 
>> > > The need for this boot command line option will be limited to the
>> > > existing version 1 format of the binary_runtime_measurements.
>> > > Subsequent formats will be defined as canonical format (eg. TPM 2.0
>> > > support for larger digests).
>> > > 
>> > > This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer
>> > > hand-over for the next kernel" patch set. 
>> > > 
>> > > These patches can also be found in the next-kexec-restore branch of:
>> > > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git
>> > 
>> > I'll merge these into -mm to get some linux-next exposure.  I don't
>> > know what your upstream merge plans will be?
>> 
>> Sounds good.  I'm hoping to get some review/comments on this patch set
>> as well.  At the moment, I'm chasing down a kernel test robot report
>> from this afternoon.
>
> My concern about changing the canonical format as originally defined in
> patch 9/9 from big endian to little endian never materialized.  Andreas
> Steffan, the patch author, is happy either way.
>
> We proposed two methods of addressing Eric Biederman's concerns of not
> including the IMA measurement list segment in the kexec hash as
> described in  https://lkml.org/lkml/2016/9/9/355.
>
> - defer calculating and verifying the serialized IMA measurement list
> buffer hash to IMA
> - calculate the kexec hash on load, verify it on the kexec execute,
> before re-calculating and updating it.

I need to ask: How this is anticipated to interact with kexec on panic?
Because honestly I can't see this ever working in that case.  The
assumption is that the original kernel has gone crazy.  So from a
practical standpoint any trusted path should have been invalided.

This entire idea of updating the kexec image makes me extremely
extremely nervious.  It feels like sticking a screw driver through the
spokes of your bicicle tires while ridding down the road.

I can see tracking to see if the list has changed at some
point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail.

At least the common bootloader cases that I know of using kexec are very
minimal distributions that live in a ramdisk and as such it should be
very straight forward to measure what is needed at or before
sys_kexec_load.  But that was completely dismissed as unrealistic so I
don't have a clue what actual problem you are trying to solve.

If there is anyway we can start small and not with this big scary
infrastructure change I would very much prefer it.

Eric


Re: [PATCH v4 0/5] kexec_file: Add buffer hand-over for the next kernel

2016-09-09 Thread Eric W. Biederman
Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> writes:

> Am Mittwoch, 07 September 2016, 09:19:40 schrieb Eric W. Biederman:
>> ebied...@xmission.com (Eric W. Biederman) writes:
>> > Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> writes:
>> >> Hello,
>> >> 
>> >> The purpose of this new version of the series is to fix a small issue
>> >> that I found, which is that the kernel doesn't remove the memory
>> >> reservation for the hand-over buffer it received from the previous
>> >> kernel in the device tree it sets up for the next kernel. The result
>> >> is that for each successive kexec, a stale hand-over buffer is left
>> >> behind, wasting memory.
>> >> 
>> >> This is fixed by changes to kexec_free_handover_buffer and
>> >> setup_handover_buffer in patch 2. The other change is to fix checkpatch
>> >> warnings in the last patch.
>> > 
>> > This is fundamentally broken.  You do not increase the integrity of a
>> > system by dropping integrity checks.
>> > 
>> > No. No. No. No.
>> > 
>> > Nacked-by: "Eric W. Biederman" <ebied...@xmission.com>
>
> The IMA measurement list can be verified without the need of a checksum over 
> its contents by replaying the PCR extend operations and checking that the 
> result matches the registers in the TPM device. So the fact that it is not 
> part of the kexec segments checksum verification doesn't actually reduce the 
> integrity of the system.

Bit flips and errant DMA transfers are the concern here.  That happens
routinely and can easily result in a corrupt data structure which may be
non-trivial to verify.

> Currently, IMA doesn't perform that verification when it restores the 
> measurement list from the kexec handover buffer, so if you believe it's 
> necessary to do that check at boot time, we could do one of the following:
>
> 1. Have IMA replay the PCR extend operations when it restores the 
> measurement list from the handover buffer and validate it against the TPM 
> PCRs, or
>
> 2. Have a buffer hash in the ima_kexec_hdr that IMA includes in the handover 
> buffer, and verify the buffer checksum before restoring the measurement 
> list.
>
> What do you think?

I think you are playing very much with fire and I am extremely
uncomfortable with the entire concept.  I think you are making things
more complicated in a way that will allow system to try and start
booting when their memory is correct.  Which may wind up corrupting
someones non-volatile storage.

It makes me doubly nervous that this adds a general purpose facility
that is generally not at all reusable.

I have seen malicious actors cause entirely too much damage to be at all
comfortable using a data structure before we validate that it was valid
before we started booting.  This isn't the same case but it is close
enough I don't trust someone just splatting data structures.
We can't guarantee integrity but we should not bypass best practices
either.

>> To be constructive the way we have handled similiar situations in the
>> past (hotplu memory) is to call kexec_load again.
>
> Thanks for your suggestion. Unfortunately it's always possible for new 
> measurements to be added to the measurement list between the kexec_file_load 
> and the reboot. We see that happen in practice with system scripts and 
> configuration files that are only read or executed during the reboot 
> process. They are only measured by IMA as a result of the kexec execute.

If I understand what you are saying correctly I expect things could be
setup so that those measurements are forced to happen before kexec load.

Especially in a boot loader context which you described earlier I
believe you should have quite a lot of control of the system.  Having a
facility that fundamentally undermines the design of kexec for a case
where someone might do something you have not predicted does not make me
comfortable.

Which is to say I don't see why you can't measure things before the
kexec_load system call in a tightly controlled setup like a boot loader.
Which should make it that in practice no measurements change.  I believe
that should increase the reliability of the system overall.

Having code in the kernel that updates a buffer that kexec will use
after that buffer is loaded in memory honestely gives me the heebie
jeebies.

Eric


Re: [PATCH v4 0/5] kexec_file: Add buffer hand-over for the next kernel

2016-09-07 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> writes:
>
>> Hello,
>>
>> The purpose of this new version of the series is to fix a small issue that
>> I found, which is that the kernel doesn't remove the memory reservation
>> for the hand-over buffer it received from the previous kernel in the
>> device tree it sets up for the next kernel. The result is that for each
>> successive kexec, a stale hand-over buffer is left behind, wasting memory.
>>
>> This is fixed by changes to kexec_free_handover_buffer and
>> setup_handover_buffer in patch 2. The other change is to fix checkpatch
>> warnings in the last patch.
>
> This is fundamentally broken.  You do not increase the integrity of a
> system by dropping integrity checks.
>
> No. No. No. No.
>
> Nacked-by: "Eric W. Biederman" <ebied...@xmission.com>

To be constructive the way we have handled similiar situations in the
past (hotplu memory) is to call kexec_load again.

Eric


Re: [PATCH v4 0/5] kexec_file: Add buffer hand-over for the next kernel

2016-09-07 Thread Eric W. Biederman
Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> writes:

> Hello,
>
> The purpose of this new version of the series is to fix a small issue that
> I found, which is that the kernel doesn't remove the memory reservation
> for the hand-over buffer it received from the previous kernel in the
> device tree it sets up for the next kernel. The result is that for each
> successive kexec, a stale hand-over buffer is left behind, wasting memory.
>
> This is fixed by changes to kexec_free_handover_buffer and
> setup_handover_buffer in patch 2. The other change is to fix checkpatch
> warnings in the last patch.

This is fundamentally broken.  You do not increase the integrity of a
system by dropping integrity checks.

No. No. No. No.

Nacked-by: "Eric W. Biederman" <ebied...@xmission.com>

Eric


Re: [PATCH v4 3/5] kexec_file: Allow skipping checksum calculation for some segments.

2016-09-06 Thread Eric W. Biederman
Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> writes:

2> Add skip_checksum member to struct kexec_buf to specify whether the
> corresponding segment should be part of the checksum calculation.
>
> The next patch will add a way to update segments after a kimage is loaded.
> Segments that will be updated in this way should not be checksummed,
> otherwise they will cause the purgatory checksum verification to fail
> when the machine is rebooted.
>
> As a bonus, we don't need to special-case the purgatory segment anymore
> to avoid checksumming it.
>
> Places using struct kexec_buf get false as the default value for
> skip_checksum since they all use designated initializers.  Therefore,
> there is no behavior change with this patch and all segments except the
> purgatory are checksummed.

What is the world.  This fundamentally makes kexec unsafe to use.  If
any updates need to be made they should be made before they are part of
the entire image checksum.

No way should this be merged anywhere ever.

Nacked-by: "Eric W. Biederman" <ebied...@xmission.com>

>
> Signed-off-by: Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com>
> ---
>  include/linux/kexec.h | 23 ++-
>  kernel/kexec_file.c   | 15 +++
>  2 files changed, 21 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 16561e96a6d7..edadff6c86ff 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -100,6 +100,9 @@ struct kexec_segment {
>   size_t bufsz;
>   unsigned long mem;
>   size_t memsz;
> +
> + /* Whether this segment is ignored in the checksum calculation. */
> + bool skip_checksum;
>  };
>  
>  #ifdef CONFIG_COMPAT
> @@ -151,15 +154,16 @@ struct kexec_file_ops {
>  
>  /**
>   * struct kexec_buf - parameters for finding a place for a buffer in memory
> - * @image:   kexec image in which memory to search.
> - * @buffer:  Contents which will be copied to the allocated memory.
> - * @bufsz:   Size of @buffer.
> - * @mem: On return will have address of the buffer in memory.
> - * @memsz:   Size for the buffer in memory.
> - * @buf_align:   Minimum alignment needed.
> - * @buf_min: The buffer can't be placed below this address.
> - * @buf_max: The buffer can't be placed above this address.
> - * @top_down:Allocate from top of memory.
> + * @image:   kexec image in which memory to search.
> + * @buffer:  Contents which will be copied to the allocated memory.
> + * @bufsz:   Size of @buffer.
> + * @mem: On return will have address of the buffer in memory.
> + * @memsz:   Size for the buffer in memory.
> + * @buf_align:   Minimum alignment needed.
> + * @buf_min: The buffer can't be placed below this address.
> + * @buf_max: The buffer can't be placed above this address.
> + * @top_down:Allocate from top of memory.
> + * @skip_checksum:   Don't verify checksum for this buffer in purgatory.
>   */
>  struct kexec_buf {
>   struct kimage *image;
> @@ -171,6 +175,7 @@ struct kexec_buf {
>   unsigned long buf_min;
>   unsigned long buf_max;
>   bool top_down;
> + bool skip_checksum;
>  };
>  
>  int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index f5684adfad07..0e90d1446cb0 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -584,6 +584,7 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
>   ksegment->bufsz = kbuf->bufsz;
>   ksegment->mem = kbuf->mem;
>   ksegment->memsz = kbuf->memsz;
> + ksegment->skip_checksum = kbuf->skip_checksum;
>   kbuf->image->nr_segments++;
>   return 0;
>  }
> @@ -598,7 +599,6 @@ static int kexec_calculate_store_digests(struct kimage 
> *image)
>   char *digest;
>   void *zero_buf;
>   struct kexec_sha_region *sha_regions;
> - struct purgatory_info *pi = >purgatory_info;
>  
>   zero_buf = __va(page_to_pfn(ZERO_PAGE(0)) << PAGE_SHIFT);
>   zero_buf_sz = PAGE_SIZE;
> @@ -638,11 +638,7 @@ static int kexec_calculate_store_digests(struct kimage 
> *image)
>   struct kexec_segment *ksegment;
>  
>   ksegment = >segment[i];
> - /*
> -  * Skip purgatory as it will be modified once we put digest
> -  * info in purgatory.
> -  */
> - if (ksegment->kbuf == pi->purgatory_buf)
> + if (ksegment->skip_checksum)
>   continue;
>  
> 

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Petr Tesarik <ptesa...@suse.cz> writes:
>
>> On Tue, 12 Jul 2016 13:25:11 -0300
>> Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> wrote:
>>
>>> Hi Eric,
>>> 
>>> I'm trying to understand your concerns leading to your nack. I hope you 
>>> don't mind expanding your thoughts on them a bit.
>>> 
>>> Am Dienstag, 12 Juli 2016, 08:25:48 schrieb Eric W. Biederman:
>>> > AKASHI Takahiro <takahiro.aka...@linaro.org> writes:
>>> > > Device tree blob must be passed to a second kernel on DTB-capable
>>> > > archs, like powerpc and arm64, but the current kernel interface
>>> > > lacks this support.
>>> > > 
>>> > > This patch extends kexec_file_load system call by adding an extra
>>> > > argument to this syscall so that an arbitrary number of file descriptors
>>> > > can be handed out from user space to the kernel.
>>> > > 
>>> > > See the background [1].
>>> > > 
>>> > > Please note that the new interface looks quite similar to the current
>>> > > system call, but that it won't always mean that it provides the "binary
>>> > > compatibility."
>>> > > 
>>> > > [1] http://lists.infradead.org/pipermail/kexec/2016-June/016276.html
>>> > 
>>> > So this design is wrong.  The kernel already has the device tree blob,
>>> > you should not be extracting it from the kernel munging it, and then
>>> > reinserting it in the kernel if you want signatures and everything to
>>> > pass.
>>> 
>>> I don't understand how the kernel signature will be invalidated. 
>>> 
>>> There are some types of boot images that can embed a device tree blob in 
>>> them, but the kernel can also be handed a separate device tree blob from 
>>> firmware, the boot loader, or kexec. This latter case is what we are 
>>> discussing, so we are not talking about modifying an embedded blob in the 
>>> kernel image.
>>> 
>>> > What x86 does is pass it's equivalent of the device tree blob from one
>>> > kernel to another directly and behind the scenes.  It does not go
>>> > through userspace for this.
>>> > 
>>> > Until a persuasive case can be made for going around the kernel and
>>> > probably adding a feature (like code execution) that can be used to
>>> > defeat the signature scheme I am going to nack this.
>>> 
>>> I also don't understand what you mean by code execution. How does passing a 
>>> device tree blob via kexec enables code execution? How can the signature 
>>> scheme be defeated?
>>
>> I'm not an expert on DTB, so I can't provide an example of code
>> execution, but you have already mentioned the /chosen/linux,stdout-path
>> property. If an attacker redirects the bootloader to an insecure
>> console, they may get access to the system that would otherwise be
>> impossible.
>>
>> In general, tampering with the hardware inventory of a machine opens up
>> a security hole, and one must be very cautious which modifications are
>> allowed. You're giving this power to an (unsigned, hence untrusted)
>> userspace application; Eric argues that only the kernel should have
>> this power.
>
> At the very least it should be signed.  And of course the more signed
> images we have in different combinations the more easily someone can
> find a combination that does things the people performing the signing
> didn't realizing they were allowing.
>
> So if we can not add an extra variable into the mix it would be good.

But it is even more than that.  There was a giant design discussion that
lasted months before this code was added on x86.  The facts on the
ground on x86 are substantially similar to ARM64.  So coming up and
saying that oh that design sucks and we want to do something completely
different to achieve the exact same goals and then not even discussing
why the current design can not work in the problem descriptions is
inconsiderate.

Not taking the time to understand how something works and why and then
asking people to explain to them what they are doing wrong is rude.  It
is a waste of everyones time.

I thought I had said something to that effect, but it doesn't look like
I did.  Apologies for not being clear about that.

I have had a lot of that this last little while.  Code with big design
issues that really should be justified given people are aiming to
overturn previous design decisions and not even considering those
previous decisions, and I find it quite tiring.

Especially when we are dealing with design decisions with a security
impact, and peopel want to add code to achieve a security goal I expect
people to be paying attention to what effect their changes have on the
entire ecosystem.  Not just saying the current behavior is inconvinient
and using that as a rational for changing things.

Eric

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Eric W. Biederman
Petr Tesarik <ptesa...@suse.cz> writes:

> On Tue, 12 Jul 2016 13:25:11 -0300
> Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> wrote:
>
>> Hi Eric,
>> 
>> I'm trying to understand your concerns leading to your nack. I hope you 
>> don't mind expanding your thoughts on them a bit.
>> 
>> Am Dienstag, 12 Juli 2016, 08:25:48 schrieb Eric W. Biederman:
>> > AKASHI Takahiro <takahiro.aka...@linaro.org> writes:
>> > > Device tree blob must be passed to a second kernel on DTB-capable
>> > > archs, like powerpc and arm64, but the current kernel interface
>> > > lacks this support.
>> > > 
>> > > This patch extends kexec_file_load system call by adding an extra
>> > > argument to this syscall so that an arbitrary number of file descriptors
>> > > can be handed out from user space to the kernel.
>> > > 
>> > > See the background [1].
>> > > 
>> > > Please note that the new interface looks quite similar to the current
>> > > system call, but that it won't always mean that it provides the "binary
>> > > compatibility."
>> > > 
>> > > [1] http://lists.infradead.org/pipermail/kexec/2016-June/016276.html
>> > 
>> > So this design is wrong.  The kernel already has the device tree blob,
>> > you should not be extracting it from the kernel munging it, and then
>> > reinserting it in the kernel if you want signatures and everything to
>> > pass.
>> 
>> I don't understand how the kernel signature will be invalidated. 
>> 
>> There are some types of boot images that can embed a device tree blob in 
>> them, but the kernel can also be handed a separate device tree blob from 
>> firmware, the boot loader, or kexec. This latter case is what we are 
>> discussing, so we are not talking about modifying an embedded blob in the 
>> kernel image.
>> 
>> > What x86 does is pass it's equivalent of the device tree blob from one
>> > kernel to another directly and behind the scenes.  It does not go
>> > through userspace for this.
>> > 
>> > Until a persuasive case can be made for going around the kernel and
>> > probably adding a feature (like code execution) that can be used to
>> > defeat the signature scheme I am going to nack this.
>> 
>> I also don't understand what you mean by code execution. How does passing a 
>> device tree blob via kexec enables code execution? How can the signature 
>> scheme be defeated?
>
> I'm not an expert on DTB, so I can't provide an example of code
> execution, but you have already mentioned the /chosen/linux,stdout-path
> property. If an attacker redirects the bootloader to an insecure
> console, they may get access to the system that would otherwise be
> impossible.
>
> In general, tampering with the hardware inventory of a machine opens up
> a security hole, and one must be very cautious which modifications are
> allowed. You're giving this power to an (unsigned, hence untrusted)
> userspace application; Eric argues that only the kernel should have
> this power.

At the very least it should be signed.  And of course the more signed
images we have in different combinations the more easily someone can
find a combination that does things the people performing the signing
didn't realizing they were allowing.

So if we can not add an extra variable into the mix it would be good.

Eric

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Eric W. Biederman
AKASHI Takahiro <takahiro.aka...@linaro.org> writes:

> Device tree blob must be passed to a second kernel on DTB-capable
> archs, like powerpc and arm64, but the current kernel interface
> lacks this support.
>   
> This patch extends kexec_file_load system call by adding an extra
> argument to this syscall so that an arbitrary number of file descriptors
> can be handed out from user space to the kernel.
>
> See the background [1].
>
> Please note that the new interface looks quite similar to the current
> system call, but that it won't always mean that it provides the "binary
> compatibility."
>
> [1] http://lists.infradead.org/pipermail/kexec/2016-June/016276.html

So this design is wrong.  The kernel already has the device tree blob,
you should not be extracting it from the kernel munging it, and then
reinserting it in the kernel if you want signatures and everything to
pass.

What x86 does is pass it's equivalent of the device tree blob from one
kernel to another directly and behind the scenes.  It does not go
through userspace for this.

Until a persuasive case can be made for going around the kernel and
probably adding a feature (like code execution) that can be used to
defeat the signature scheme I am going to nack this.

Nacked-by: "Eric W. Biederman" <ebied...@xmission.com>

I am happy to see support for other architectures, but for the sake of
not moving some code in the kernel let's not build an attackable
infrastructure.

Eric
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] panic: Disable crash_kexec_post_notifiers if kdump is not available

2015-07-14 Thread Eric W. Biederman
Vivek Goyal vgo...@redhat.com writes:

 On Tue, Jul 14, 2015 at 03:48:33PM +, dwal...@fifo99.com wrote:
 On Tue, Jul 14, 2015 at 11:40:40AM -0400, Vivek Goyal wrote:
  On Tue, Jul 14, 2015 at 03:34:30PM +, dwal...@fifo99.com wrote:
   On Tue, Jul 14, 2015 at 11:02:08AM -0400, Vivek Goyal wrote:
On Tue, Jul 14, 2015 at 01:59:19PM +, dwal...@fifo99.com wrote:
 On Mon, Jul 13, 2015 at 08:19:45PM -0500, Eric W. Biederman wrote:
  dwal...@fifo99.com writes:
  
   On Fri, Jul 10, 2015 at 08:41:28AM -0500, Eric W. Biederman 
   wrote:
   Hidehiro Kawai hidehiro.kawai...@hitachi.com writes:
   
You can call panic notifiers and kmsg dumpers before kdump by
specifying crash_kexec_post_notifiers as a boot parameter.
However, it doesn't make sense if kdump is not available.  In 
that
case, disable crash_kexec_post_notifiers boot parameter so 
that
you can't change the value of the parameter.
   
   Nacked-by: Eric W. Biederman ebied...@xmission.com
  
   I think it would make sense if he just replaced kdump with 
   kexec.
  
  It would be less insane, however it still makes no sense as without
  kexec on panic support crash_kexec is a noop.  So the value of the
  seeting makes no difference.
 
 Can you explain more, I don't really understand what you mean. Are 
 you suggesting
 the whole crash_kexec_post_notifiers feature has no value ?

Daniel,

BTW, why are you using crash_kexec_post_notifiers commandline? Why not
without it?
   
   It was explained in the prior thread but to rehash, the notifiers are 
   used to do a switch
   over from the crashed machine to another redundant machine.
  
  So why not detect failure using polling or issue notifications from second
  kernel.
  
  IOW, expecting that a crashed machine will be able to deliver notification
  reliably is falwed to begin with, IMHO.
 
 It's flawed to think you can kexec, but you still do it right ? I've not 
 gotten into
 the deep details of this switching process, but that's how this interface is 
 used.

 Sure. But the deal here is that users of interface know that sometimes it
 can be unreliable. And in the absence of more reliable mechanism, somewhat
 less reliable mechanism is fine. 

  
  If a machine is failing, there are high chance it can't deliver you the
  notification. Detecting that failure suing some kind of polling mechanism
  might be more reliable. And it will make even kdump mechanism more
  reliable so that it does not have to run panic notifiers after the crash.
 
 I think what your suggesting is that my company should change how it's 
 hardware works
 and that's not really an option for me. This isn't a simple thing like 
 checking over the
 network if the machine is down or not, this is way more complex hardware 
 design.

 That means you are ready to live with an unreliable design. There might be
 cases where notifier does not get run properly and you will not do switch
 despite the fact that OS has failed. I was just trying to nudge you in
 a direction which could be more reliable mechanism.

Sigh I see some deep confusion going on here.

The panic notifiers are just that panic notifiers.  They have not been
nor should they be tied to kexec.   If those notifiers force a switch
over of between machines I fail to see why you would care if it was
kexec or another panic situation that is forcing that switchover.

Now if you want a reliable design, I strongly recommend as I have been
recommending for the 15 years that magic failover code be placed in
either the new kernel or a stub that preceedes the new kernel.

That gives the greatest reliabilty we know how to engineer, and it lets
you do whatever you need to do.

Especially if it is not desirable for the panic notifiers to run without
the presence of kexec, I very strongly recommend not using them at all
and just writing a stub of code that can run before a new kernel starts.

Eric
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] panic: Disable crash_kexec_post_notifiers if kdump is not available

2015-07-14 Thread Eric W. Biederman
Vivek Goyal vgo...@redhat.com writes:

 On Tue, Jul 14, 2015 at 05:29:53PM +, dwal...@fifo99.com wrote:

 [..]
If a machine is failing, there are high chance it can't deliver you 
the
notification. Detecting that failure suing some kind of polling 
mechanism
might be more reliable. And it will make even kdump mechanism more
reliable so that it does not have to run panic notifiers after the 
crash.
   
   I think what your suggesting is that my company should change how it's 
   hardware works
   and that's not really an option for me. This isn't a simple thing like 
   checking over the
   network if the machine is down or not, this is way more complex 
   hardware design.
  
   That means you are ready to live with an unreliable design. There might 
   be
   cases where notifier does not get run properly and you will not do switch
   despite the fact that OS has failed. I was just trying to nudge you in
   a direction which could be more reliable mechanism.
  
  Sigh I see some deep confusion going on here.
  
  The panic notifiers are just that panic notifiers.  They have not been
  nor should they be tied to kexec.   If those notifiers force a switch
  over of between machines I fail to see why you would care if it was
  kexec or another panic situation that is forcing that switchover.
 
 Hidehiro isn't fixing the failover situation on my side, he's fixing register
 information collection when crash_kexec_post_notifiers is used.

 Sure. Given that we have created this new parameter, let us fix it so that
 we can capture the other cpu register state in crash dump.

 I am little disappointed that it was not tested well when this parameter was
 introuced. We should have atleast tested it to the extent to see if there
 is proper cpu state present for all cpus in the crash dump.

 At that point of time it looked like a simple modification
 to allow panic notifiers before crash_kexec().

Either that or we say no one cares enough, and it known broken so let's
just revert the fool thing.

I honestly can't see how to support panic notifiers, before kexec.
There is no way to tell what is being done and all of the pieces
including smp_send_stop are known to be buggy.

It isn't like this latest set of patches was reviewed/tested much
better, as the first patch was wrong.

Eric
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] panic: Disable crash_kexec_post_notifiers if kdump is not available

2015-07-13 Thread Eric W. Biederman
dwal...@fifo99.com writes:

 On Fri, Jul 10, 2015 at 08:41:28AM -0500, Eric W. Biederman wrote:
 Hidehiro Kawai hidehiro.kawai...@hitachi.com writes:
 
  You can call panic notifiers and kmsg dumpers before kdump by
  specifying crash_kexec_post_notifiers as a boot parameter.
  However, it doesn't make sense if kdump is not available.  In that
  case, disable crash_kexec_post_notifiers boot parameter so that
  you can't change the value of the parameter.
 
 Nacked-by: Eric W. Biederman ebied...@xmission.com

 I think it would make sense if he just replaced kdump with kexec.

It would be less insane, however it still makes no sense as without
kexec on panic support crash_kexec is a noop.  So the value of the
seeting makes no difference.

Eric
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] panic: Disable crash_kexec_post_notifiers if kdump is not available

2015-07-10 Thread Eric W. Biederman
Hidehiro Kawai hidehiro.kawai...@hitachi.com writes:

 You can call panic notifiers and kmsg dumpers before kdump by
 specifying crash_kexec_post_notifiers as a boot parameter.
 However, it doesn't make sense if kdump is not available.  In that
 case, disable crash_kexec_post_notifiers boot parameter so that
 you can't change the value of the parameter.

Nacked-by: Eric W. Biederman ebied...@xmission.com

You are confusing kexec on panic and CONFIG_CRASH_DUMP
which is about the tools for reading the state of the previous kernel.

Eric

 Signed-off-by: Hidehiro Kawai hidehiro.kawai...@hitachi.com
 Cc: Andrew Morton a...@linux-foundation.org
 Cc: Eric Biederman ebied...@xmission.com
 Cc: Vivek Goyal vgo...@redhat.com
 ---
  kernel/panic.c |2 ++
  1 file changed, 2 insertions(+)

 diff --git a/kernel/panic.c b/kernel/panic.c
 index 04e91ff..5252331 100644
 --- a/kernel/panic.c
 +++ b/kernel/panic.c
 @@ -502,12 +502,14 @@ __visible void __stack_chk_fail(void)
  core_param(pause_on_oops, pause_on_oops, int, 0644);
  core_param(panic_on_warn, panic_on_warn, int, 0644);
  
 +#ifdef CONFIG_CRASH_DUMP
  static int __init setup_crash_kexec_post_notifiers(char *s)
  {
   crash_kexec_post_notifiers = true;
   return 0;
  }
  early_param(crash_kexec_post_notifiers, setup_crash_kexec_post_notifiers);
 +#endif
  
  static int __init oops_setup(char *s)
  {
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] arch: configuration, deleting 'CONFIG_BUG' since always need it.

2013-05-24 Thread Eric W. Biederman
Chen Gang gang.c...@asianux.com writes:

 The crazy user can unset 'CONFIG_BUG' in menuconfig:  General setup 
 Configure standard kernel features (expert users)  BUG() Support.

 But in fact, we always need it, and quite a few of architectures have
 already implemented it (e.g. alpha, arc, arm, avr32, blackfin, cris,
 frv, ia64, m68k, mips, mn10300, parisc, powerpc, s390, sh, sparc, x86).

 And kernel also already has prepared a default effective implementation
 for the architectures which is unwilling to implement it by themselves
 (e.g. arm64, c6x, h8300, hexagon, m32r, metag, microblaze, openrisc,
 score, tile, um, unicore32, xtensa).

 So need get rid of 'CONFIG_BUG', and let it always enabled everywhere.


This looks like the right way to handle this to me.  If the BUG
annotations are too big and not needed they should simply be deleted
from the code base.  Disabling CONFIG_BUG which removes the BUG
annotations from the binaries without modifying the source code seems
like the wrong approach.

Acked-by: Eric W. Biederman ebied...@xmission.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] arch: configuration, deleting 'CONFIG_BUG' since always need it.

2013-05-23 Thread Eric W. Biederman
Arnd Bergmann a...@arndb.de writes:

 On Thursday 23 May 2013, Geert Uytterhoeven wrote:
  The problem is: trying to fix that will mean the result is a larger
  kernel than if you just do the usual arch-implemented thing of placing
  an defined faulting instruction at the BUG() site - which defeats the
  purpose of turning off CONFIG_BUG.
 
 Is __builtin_unreachable() working well these days?
 

 Hmm, I just tried the trivial patch below, which seemed to do the right thing.
 Needs a little more investigation, but that might actually be the correct
 solution. I thought that at some point __builtin_unreachable() was the same
 as do {} while (1), but this is not the case with the gcc I was using --
 it just tells gcc that we don't expect to ever get here.

Yes.

We already have this abstracted in compiler.h as the macro unreachable,
so the slight modification of your patch below should handle this case.

For compilers without __builtin_unreachable() unreachable() expands to
do {} while(1) but an infinite loop seems reasonable and preserves the
semantics of the code, unlike the current noop that is do {} while(0).

 Signed-off-by: Arnd Bergmann a...@arndb.de

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 7d10f96..9afff7d 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -108,11 +108,11 @@ extern void warn_slowpath_null(const char *file, const 
int line);
 
 #else /* !CONFIG_BUG */
 #ifndef HAVE_ARCH_BUG
-#define BUG() do {} while(0)
+#define BUG() unreachable ()
 #endif
 
 #ifndef HAVE_ARCH_BUG_ON
-#define BUG_ON(condition) do { if (condition) ; } while(0)
+#define BUG_ON(condition) do { if (condition) unreachable(); } while(0)
 #endif
 
 #ifndef HAVE_ARCH_WARN_ON
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] kdump: Allow shrinking of kdump region to be overridden

2010-08-24 Thread Eric W. Biederman
Anton Blanchard an...@samba.org writes:

 On ppc64 the crashkernel region almost always overlaps an area of firmware.
 This works fine except when using the sysfs interface to reduce the kdump
 region. If we free the firmware area we are guaranteed to crash.

That is ppc64 bug.  firmware should not be in the reserved region.  Any
random kernel like thing can be put in to that region at any valid
address and the fact that shrinking the region frees your firmware means
that using that region could also stomp your firmware (which I assume
would be a bad thing).

So please fix the ppc64 reservation.

Eric
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [Patch v2] kexec: increase max of kexec segments and use dynamic allocation

2010-07-27 Thread Eric W. Biederman
Milton Miller milt...@bga.com writes:

 [ Added kexec at lists.infradead.org and linuxppc-dev@lists.ozlabs.org ]

 
 Currently KEXEC_SEGMENT_MAX is only 16 which is too small for machine with
 many memory ranges.  When hibernate on a machine with disjoint memory we do
 need one segment for each memory region. Increase this hard limit to 16K
 which is reasonably large.
 
 And change -segment from a static array to a dynamically allocated memory.
 
 Cc: Neil Horman nhor...@redhat.com
 Cc: huang ying huang.ying.cari...@gmail.com
 Cc: Eric W. Biederman ebied...@xmission.com
 Signed-off-by: WANG Cong amw...@redhat.com
 
 ---
 diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
 b/arch/powerpc/kernel/machine_kexec_64.c
 index ed31a29..f115585 100644
 --- a/arch/powerpc/kernel/machine_kexec_64.c
 +++ b/arch/powerpc/kernel/machine_kexec_64.c
 @@ -131,10 +131,7 @@ static void copy_segments(unsigned long ind)
  void kexec_copy_flush(struct kimage *image)
  {
  long i, nr_segments = image-nr_segments;
 -struct  kexec_segment ranges[KEXEC_SEGMENT_MAX];
 -
 -/* save the ranges on the stack to efficiently flush the icache */
 -memcpy(ranges, image-segment, sizeof(ranges));
 +struct  kexec_segment range;

 I'm glad you found our copy on the stack and removed the stack overflow
 that comes with this bump, but ...

  
  /*
   * After this call we may not use anything allocated in dynamic
 @@ -148,9 +145,11 @@ void kexec_copy_flush(struct kimage *image)
   * we need to clear the icache for all dest pages sometime,
   * including ones that were in place on the original copy
   */
 -for (i = 0; i  nr_segments; i++)
 -flush_icache_range((unsigned long)__va(ranges[i].mem),
 -(unsigned long)__va(ranges[i].mem + ranges[i].memsz));
 +for (i = 0; i  nr_segments; i++) {
 +memcpy(range, image-segment[i], sizeof(range));
 +flush_icache_range((unsigned long)__va(range.mem),
 +(unsigned long)__va(range.mem + range.memsz));
 +}
  }

 This is executed after the copy, so as it says,
 we may not use anything allocated in dynamic memory.

 We could allocate control pages to copy the segment list into.
 Actually ppc64 doesn't use the existing control page, but that
 is only 4kB today.

 We need the list to icache flush all the pages in all the segments.
 The as the indirect list doesn't have pages that were allocated at
 their destination.

An interesting point.

 Or maybe the icache flush should be done in the generic code
 like it does for crash load segments?

Please.  I don't quite understand the icache flush requirement.
But we really should not be looking at the segments in the
architecture specific code.

Ideally we would only keep the segment information around for
the duration of the kexec_load syscall and not have it when
it comes time to start the second kernel.

I am puzzled.  We should be completely replacing the page tables so
can't we just do a global flush?  Perhaps I am being naive about what
is required for a ppc flush.

Eric
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


  1   2   >