Re: [PATCH 1/1] livepatch: Rename KLP_* to KLP_TRANSITION_*

2024-05-07 Thread Josh Poimboeuf
On Tue, May 07, 2024 at 01:01:11PM +0800, zhangwar...@gmail.com wrote:
> From: Wardenjohn 
> 
> The original macros of KLP_* is about the state of the transition.
> Rename macros of KLP_* to KLP_TRANSITION_* to fix the confusing
> description of klp transition state.
> 
> Signed-off-by: Wardenjohn 

Acked-by: Josh Poimboeuf 

-- 
Josh



Re: [PATCH 0/1] *** Replace KLP_* to KLP_TRANSITION_* ***

2024-05-06 Thread Josh Poimboeuf
On Tue, May 07, 2024 at 10:56:09AM +0800, zhang warden wrote:
> 
> 
> > On May 7, 2024, at 10:41, Josh Poimboeuf  wrote:
> > 
> > On Tue, May 07, 2024 at 10:21:40AM +0800, zhang warden wrote:
> >> 
> >> 
> >>> 
> >>> transition state. With this marcos renamed, comments are not 
> >>> necessary at this point.
> >>> 
> >> Sorry for my careless, the comment still remains in the code. However,
> >> comment in the code do no harms here. Maybe it can be kept.
> > 
> > The comments aren't actually correct.
> > 
> > For example, KLP_TRANSITION_UNPATCHED doesn't always mean "transitioning
> > to unpatched state".  Sometimes it means "transitioning *from* unpatched
> > state".
> > 
> > -- 
> > Josh
> 
> OK, I got it. I will remove the comment later. After all, comment is
> not necessary at this point after we rename the macros.

Yeah, removing them altogether might be best, as the meaning of these
can vary slightly depending on the operation (patching vs unpatching),
and also depending on where it's stored (task->patch_state vs klp_target_state).

-- 
Josh



Re: [PATCH 0/1] *** Replace KLP_* to KLP_TRANSITION_* ***

2024-05-06 Thread Josh Poimboeuf
On Tue, May 07, 2024 at 10:21:40AM +0800, zhang warden wrote:
> 
> 
> > 
> > transition state. With this marcos renamed, comments are not 
> > necessary at this point.
> > 
> Sorry for my careless, the comment still remains in the code. However,
> comment in the code do no harms here. Maybe it can be kept.

The comments aren't actually correct.

For example, KLP_TRANSITION_UNPATCHED doesn't always mean "transitioning
to unpatched state".  Sometimes it means "transitioning *from* unpatched
state".

-- 
Josh



Re: [PATCH] livepatch.h: Add comment to klp transition state

2024-05-05 Thread Josh Poimboeuf
On Mon, Apr 29, 2024 at 03:26:28PM +0800, zhangwar...@gmail.com wrote:
> From: Wardenjohn 
> 
> livepatch.h use KLP_UNDEFINED\KLP_UNPATCHED\KLP_PATCHED for klp transition 
> state.
> When livepatch is ready but idle, using KLP_UNDEFINED seems very confusing.
> In order not to introduce potential risks to kernel, just update comment
> to these state.
> ---
>  include/linux/livepatch.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> index 9b9b38e89563..b6a214f2f8e3 100644
> --- a/include/linux/livepatch.h
> +++ b/include/linux/livepatch.h
> @@ -18,9 +18,9 @@
>  #if IS_ENABLED(CONFIG_LIVEPATCH)
>  
>  /* task patch states */
> -#define KLP_UNDEFINED-1
> -#define KLP_UNPATCHED 0
> -#define KLP_PATCHED   1
> +#define KLP_UNDEFINED-1 /* idle, no transition in progress */
> +#define KLP_UNPATCHED 0 /* transitioning to unpatched state */
> +#define KLP_PATCHED   1 /* transitioning to patched state */

Instead of the comments, how about we just rename them to

  KLP_TRANSITION_IDLE
  KLP_TRANSITION_UNPATCHED
  KLP_TRANSITION_PATCHED

which shouldn't break userspace AFAIK.

-- 
Josh



Re: [RFC PATCH 0/4] perf: Correlating user process data to samples

2024-04-18 Thread Josh Poimboeuf
On Sat, Apr 13, 2024 at 08:48:57AM -0400, Steven Rostedt wrote:
> On Sat, 13 Apr 2024 12:53:38 +0200
> Peter Zijlstra  wrote:
> 
> > On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote:
> > 
> > > > Anyway, since we typically run stuff from NMI context, accessing user
> > > > data is 'interesting'. As such I would really like to make this work
> > > > depend on the call-graph rework that pushes all the user access bits
> > > > into return-to-user.  
> > > 
> > > Cool, I assume that's the SFRAME work? Are there pointers to work I
> > > could look at and think about what a rebase looks like? Or do you have
> > > someone in mind I should work with for this?  
> > 
> > I've been offline for a little while and still need to catch up with
> > things myself.
> > 
> > Josh was working on that when I dropped off IIRC, I'm not entirely sure
> > where things are at currently (and there is no way I can ever hope to
> > process the backlog).
> > 
> > Anybody know where we are with that?
> 
> It's still very much on my RADAR, but with layoffs and such, my
> priorities have unfortunately changed. I'm hoping to start helping out
> in the near future though (in a month or two).
> 
> Josh was working on it, but I think he got pulled off onto other
> priorities too :-p

Yeah, this is still a priority for me and I hope to get back to it over
the next few weeks (crosses fingers).

-- 
Josh



Re: [POC 0/7] livepatch: Make livepatch states, callbacks, and shadow variables work together

2023-11-10 Thread Josh Poimboeuf
On Fri, Nov 10, 2023 at 06:04:21PM +0100, Petr Mladek wrote:
> This POC is a material for the discussion "Simplify Livepatch Callbacks,
> Shadow Variables, and States handling" at LPC 2013, see
> https://lpc.events/event/17/contributions/1541/
> 
> It obsoletes the patchset adding the garbage collection of shadow
> variables. This new solution is based on ideas from Nicolai Stange.
> And it should also be in sync with Josh's ideas mentioned into
> the thread about the garbage collection, see
> https://lore.kernel.org/r/20230204235910.4j4ame5ntqogqi7m@treble

Nice!  I like how it brings the "features" together and makes them easy
to use.  This looks like a vast improvement.

Was there a reason to change the naming?  I'm thinking

  setup / enable / disable / release

is less precise than

  pre_patch / post_patch / pre_unpatch / post_unpatch.

Also, I'm thinking "replaced" instead of "obsolete" would be more
consistent with the existing terminology.

For example, in __klp_enable_patch():

ret = klp_setup_states(patch);
if (ret)
goto err;

if (patch->replace)
klp_disable_obsolete_states(patch);

it's not immediately clear why "disable obsolete" would be the "replace"
counterpart to "setup".

Similarly, in klp_complete_transition():

if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) {
klp_unpatch_replaced_patches(klp_transition_patch);
klp_discard_nops(klp_transition_patch);
klp_release_obsolete_states(klp_transition_patch);
}

it's a little jarring to have "unpatch replaced" followed by "release
obsolete".

So instead of:

  int  klp_setup_states(struct klp_patch *patch);
  void klp_enable_states(struct klp_patch *patch);
  void klp_disable_states(struct klp_patch *patch);
  void klp_release_states(struct klp_patch *patch);

  void klp_enable_obsolete_states(struct klp_patch *patch);
  void klp_disable_obsolete_states(struct klp_patch *patch);
  void klp_release_obsolete_states(struct klp_patch *patch);

how about something like:

  int  klp_states_pre_patch(void);
  void klp_states_post_patch(void);
  void klp_states_pre_unpatch(void);
  void klp_states_post_unpatch(void);

  void klp_states_post_patch_replaced(void);
  void klp_states_pre_unpatch_replaced(void);
  void klp_states_post_unpatch_replaced(void);

?

(note that passing klp_transition_patch might be optional since state.c
already has visibility to it anyway)

-- 
Josh



Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"

2023-11-09 Thread Josh Poimboeuf
On Thu, Nov 09, 2023 at 02:50:48PM -0800, Ankur Arora wrote:
> >> I guess I'm not fully understanding what the cond rescheds are for. But
> >> would an IPI to all CPUs setting NEED_RESCHED, fix it?
> 
> Yeah. We could just temporarily toggle to full preemption, when
> NEED_RESCHED_LAZY is always upgraded to NEED_RESCHED which will
> then send IPIs.
> 
> > If all livepatch arches had the ORC unwinder, yes.
> >
> > The problem is that frame pointer (and similar) unwinders can't reliably
> > unwind past an interrupt frame.
> 
> Ah, I wonder if we could just disable the preempt_schedule_irq() path
> temporarily? Hooking into schedule() alongside something like this:
> 
> @@ -379,7 +379,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs 
> *regs)
> 
>  void irqentry_exit_cond_resched(void)
>  {
> -   if (!preempt_count()) {
> +   if (klp_cond_resched_disable() && !preempt_count()) {
> 
> The problem would be tasks that don't go through any preemptible
> sections.

Let me back up a bit and explain what klp is trying to do.

When a livepatch is applied, klp needs to unwind all the tasks,
preferably within a reasonable amount of time.

We can't unwind task A from task B while task A is running, since task A
could be changing the stack during the unwind.  So task A needs to be
blocked or asleep.  The only exception to that is if the unwind happens
in the context of task A itself.

The problem we were seeing was CPU-bound kthreads (e.g., vhost_worker)
not getting patched within a reasonable amount of time.  We fixed it by
hooking the klp unwind into cond_resched() so it can unwind from the
task itself.

It only worked because we had a non-preempted hook (because non-ORC
unwinders can't unwind reliably through preemption) which called klp to
unwind from the context of the task.

Without something to hook into, we have a problem.  We could of course
hook into schedule(), but if the kthread never calls schedule() from a
non-preempted context then it still doesn't help.

-- 
Josh



Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"

2023-11-09 Thread Josh Poimboeuf
On Thu, Nov 09, 2023 at 12:31:47PM -0500, Steven Rostedt wrote:
> On Thu, 9 Nov 2023 09:26:37 -0800
> Josh Poimboeuf  wrote:
> 
> > On Tue, Nov 07, 2023 at 06:16:09PM -0500, Steven Rostedt wrote:
> > > On Tue,  7 Nov 2023 13:56:53 -0800
> > > Ankur Arora  wrote:
> > >   
> > > > This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8.
> > > > 
> > > > Note that removing this commit reintroduces "live patches failing to
> > > > complete within a reasonable amount of time due to CPU-bound kthreads."
> > > > 
> > > > Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and
> > > > existence of cond_resched() so this will need an alternate fix.  
> > 
> > We definitely don't want to introduce a regression, something will need
> > to be figured out before removing cond_resched().
> > 
> > We could hook into preempt_schedule_irq(), but that wouldn't work for
> > non-ORC.
> > 
> > Another option would be to hook into schedule().  Then livepatch could
> > set TIF_NEED_RESCHED on remaining unpatched tasks.  But again if they go
> > through the preemption path then we have the same problem for non-ORC.
> > 
> > Worst case we'll need to sprinkle cond_livepatch() everywhere :-/
> > 
> 
> I guess I'm not fully understanding what the cond rescheds are for. But
> would an IPI to all CPUs setting NEED_RESCHED, fix it?

If all livepatch arches had the ORC unwinder, yes.

The problem is that frame pointer (and similar) unwinders can't reliably
unwind past an interrupt frame.

-- 
Josh



Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"

2023-11-09 Thread Josh Poimboeuf
On Tue, Nov 07, 2023 at 06:16:09PM -0500, Steven Rostedt wrote:
> On Tue,  7 Nov 2023 13:56:53 -0800
> Ankur Arora  wrote:
> 
> > This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8.
> > 
> > Note that removing this commit reintroduces "live patches failing to
> > complete within a reasonable amount of time due to CPU-bound kthreads."
> > 
> > Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and
> > existence of cond_resched() so this will need an alternate fix.

We definitely don't want to introduce a regression, something will need
to be figured out before removing cond_resched().

We could hook into preempt_schedule_irq(), but that wouldn't work for
non-ORC.

Another option would be to hook into schedule().  Then livepatch could
set TIF_NEED_RESCHED on remaining unpatched tasks.  But again if they go
through the preemption path then we have the same problem for non-ORC.

Worst case we'll need to sprinkle cond_livepatch() everywhere :-/

-- 
Josh



Re: [PATCH 01/15] objtool: Find a destination for jumps beyond the section end

2021-04-20 Thread Josh Poimboeuf
On Tue, Apr 20, 2021 at 01:25:43PM -0700, Sami Tolvanen wrote:
> On Tue, Apr 20, 2021 at 11:14 AM Josh Poimboeuf  wrote:
> >
> > On Fri, Apr 16, 2021 at 01:38:30PM -0700, Sami Tolvanen wrote:
> > > With -ffunction-sections, Clang can generate a jump beyond the end of
> > > a section when the section ends in an unreachable instruction.
> >
> > Why?  Can you show an example?
> 
> Here's the warning I'm seeing when building allyesconfig + CFI:
> 
> vmlinux.o: warning: objtool:
> rockchip_spi_transfer_one.f088382d97b74759d70e27e891fe8f1c()+0x149:
> can't find jump dest instruction at
> .text.rockchip_spi_transfer_one.f088382d97b74759d70e27e891fe8f1c+0x7dc
> 
> $ objdump -d -r -j
> .text.rockchip_spi_transfer_one.f088382d97b74759d70e27e891fe8f1c
> vmlinux.o
>  :
>   ...
>  149:   0f 85 8d 06 00 00   jne7dc <.compoundliteral.4>
>   ...
>  7d7:   e8 00 00 00 00  callq  7dc <.compoundliteral.4>
> 7d8: R_X86_64_PLT32 __stack_chk_fail-0x4

Instead of silencing the warning by faking the jump destination, I'd
rather improve the warning to something like

  "warning: rockchip_spi_transfer_one() falls through to the next function"

which is what we normally do in this type of situation.

It may be caused by UB, or a compiler bug, but either way we should
figure out the root cause.

-- 
Josh



Re: [PATCH 02/15] objtool: Add CONFIG_CFI_CLANG support

2021-04-20 Thread Josh Poimboeuf
On Fri, Apr 16, 2021 at 01:38:31PM -0700, Sami Tolvanen wrote:
> +static int fix_cfi_relocs(const struct elf *elf)
> +{
> + struct section *sec, *tmpsec;
> + struct reloc *reloc, *tmpreloc;
> +
> + list_for_each_entry_safe(sec, tmpsec, >sections, list) {
> + list_for_each_entry_safe(reloc, tmpreloc, >reloc_list, 
> list) {

These loops don't remove structs from the lists, so are the "safe"
iterators really needed?

> + struct symbol *sym;
> + struct reloc *func_reloc;
> +
> + /*
> +  * CONFIG_CFI_CLANG replaces function relocations to 
> refer
> +  * to an intermediate jump table. Undo the conversion so
> +  * objtool can make sense of things.
> +  */

I think this comment could be clearer if it were placed above the
function.

> + if (!reloc->sym->sec->cfi_jt)
> + continue;
> +
> + if (reloc->sym->type == STT_SECTION)
> + sym = find_func_by_offset(reloc->sym->sec,
> +   reloc->addend);
> + else
> + sym = reloc->sym;
> +
> + if (!sym || !sym->sec)
> + continue;

This should be a fatal error, it probably means something's gone wrong
with the reading of the jump table.

> +
> + /*
> +  * The jump table immediately jumps to the actual 
> function,
> +  * so look up the relocation there.
> +  */
> + func_reloc = find_reloc_by_dest_range(elf, sym->sec, 
> sym->offset, 5);

The jump instruction's reloc is always at offset 1, so this can maybe be
optimized to:

func_reloc = find_reloc_by_dest(elf, sym->sec, 
sym->offset+1);

though of course that will probably break (future) arm64 objtool.  Maybe
an arch-specific offset from the start of the insn would be good.

> + if (!func_reloc || !func_reloc->sym)
> + continue;

This should also be an error.

> +
> + reloc->sym = func_reloc->sym;

I think we should also do 'reloc->addend = 0', because of the
STT_SECTION case:

  0025  259e000b R_X86_64_32S    
.text..L.cfi.jumptable.8047 + 8

which then translates to

  0009  06320004 R_X86_64_PLT32  
x2apic_prepare_cpu - 4

so the original addend of '8' is no longer valid.  Copying the '-4'
wouldn't be right either, because that only applies to a PLT32 text
reloc.  A '0' should be appropriate for the 32S data reloc.

This addend might not actually be read by any code, so it may not
currently be an actual bug, but better safe than sorry.

-- 
Josh



Re: [PATCH 01/15] objtool: Find a destination for jumps beyond the section end

2021-04-20 Thread Josh Poimboeuf
On Fri, Apr 16, 2021 at 01:38:30PM -0700, Sami Tolvanen wrote:
> With -ffunction-sections, Clang can generate a jump beyond the end of
> a section when the section ends in an unreachable instruction.

Why?  Can you show an example?

-- 
Josh



[tip: objtool/core] x86/crypto/aesni-intel_avx: Standardize stack alignment prologue

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: e163be86fff3deec70f63330fc43fedf892c9aee
Gitweb:
https://git.kernel.org/tip/e163be86fff3deec70f63330fc43fedf892c9aee
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:17 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:34 -05:00

x86/crypto/aesni-intel_avx: Standardize stack alignment prologue

Use RBP instead of R14 for saving the old stack pointer before
realignment.  This resembles what compilers normally do.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/02d00a0903a0959f4787e186e2a07d271e1f63d4.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 188f184..98e3552 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -251,22 +251,20 @@ VARIABLE_OFFSET = 16*8
 .macro FUNC_SAVE
 push%r12
 push%r13
-push%r14
 push%r15
 
-mov %rsp, %r14
-
-
+   push%rbp
+   mov %rsp, %rbp
 
 sub $VARIABLE_OFFSET, %rsp
 and $~63, %rsp# align rsp to 64 bytes
 .endm
 
 .macro FUNC_RESTORE
-mov %r14, %rsp
+mov %rbp, %rsp
+   pop %rbp
 
 pop %r15
-pop %r14
 pop %r13
 pop %r12
 .endm


[tip: objtool/core] x86/crypto/aesni-intel_avx: Remove unused macros

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: 4f08300916e882a0c34a2f325ff3fea2be2e57b3
Gitweb:
https://git.kernel.org/tip/4f08300916e882a0c34a2f325ff3fea2be2e57b3
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:15 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:33 -05:00

x86/crypto/aesni-intel_avx: Remove unused macros

These macros are no longer used; remove them.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/53f7136ea93ebdbca399959e6d2991ecb46e733e.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 2cf8e94..4fdf38e 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -212,10 +212,6 @@ HashKey_8_k= 16*21   # store XOR of HashKey^8 <<1 mod 
poly here (for Karatsu
 #define arg4 %rcx
 #define arg5 %r8
 #define arg6 %r9
-#define arg7 STACK_OFFSET+8*1(%r14)
-#define arg8 STACK_OFFSET+8*2(%r14)
-#define arg9 STACK_OFFSET+8*3(%r14)
-#define arg10 STACK_OFFSET+8*4(%r14)
 #define keysize 2*15*16(arg1)
 
 i = 0
@@ -237,9 +233,6 @@ define_reg j %j
 .noaltmacro
 .endm
 
-# need to push 4 registers into stack to maintain
-STACK_OFFSET = 8*4
-
 TMP1 =   16*0# Temporary storage for AAD
 TMP2 =   16*1# Temporary storage for AES State 2 (State 1 is stored in an 
XMM register)
 TMP3 =   16*2# Temporary storage for AES State 3
@@ -256,7 +249,6 @@ VARIABLE_OFFSET = 16*8
 
 
 .macro FUNC_SAVE
-#the number of pushes must equal STACK_OFFSET
 push%r12
 push%r13
 push%r14


[tip: objtool/core] x86/crypto/aesni-intel_avx: Fix register usage comments

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: ff5796b6dbea4763fdca002101e32b60aa17f8e8
Gitweb:
https://git.kernel.org/tip/ff5796b6dbea4763fdca002101e32b60aa17f8e8
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:16 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:33 -05:00

x86/crypto/aesni-intel_avx: Fix register usage comments

Fix register usage comments to match reality.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/8655d4513a0ed1eddec609165064153973010aa2.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 4fdf38e..188f184 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -286,7 +286,7 @@ VARIABLE_OFFSET = 16*8
 
 # combined for GCM encrypt and decrypt functions
 # clobbering all xmm registers
-# clobbering r10, r11, r12, r13, r14, r15
+# clobbering r10, r11, r12, r13, r15, rax
 .macro  GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 
GHASH_MUL ENC_DEC REP
 vmovdqu AadHash(arg2), %xmm8
 vmovdqu  HashKey(arg2), %xmm13  # xmm13 = HashKey
@@ -988,7 +988,7 @@ _partial_block_done_\@:
 ## num_initial_blocks = b mod 4#
 ## encrypt the initial num_initial_blocks blocks and apply ghash on the 
ciphertext
 ## r10, r11, r12, rax are clobbered
-## arg1, arg3, arg4, r14 are used as a pointer only, not modified
+## arg1, arg2, arg3, arg4 are used as pointers only, not modified
 
 .macro INITIAL_BLOCKS_AVX REP num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 
XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC
i = (8-\num_initial_blocks)
@@ -1223,7 +1223,7 @@ _initial_blocks_done\@:
 
 # encrypt 8 blocks at a time
 # ghash the 8 previously encrypted ciphertext blocks
-# arg1, arg3, arg4 are used as pointers only, not modified
+# arg1, arg2, arg3, arg4 are used as pointers only, not modified
 # r11 is the data offset value
 .macro GHASH_8_ENCRYPT_8_PARALLEL_AVX REP T1 T2 T3 T4 T5 T6 CTR XMM1 XMM2 XMM3 
XMM4 XMM5 XMM6 XMM7 XMM8 T7 loop_idx ENC_DEC
 
@@ -1936,7 +1936,7 @@ SYM_FUNC_END(aesni_gcm_finalize_avx_gen2)
 ## num_initial_blocks = b mod 4#
 ## encrypt the initial num_initial_blocks blocks and apply ghash on the 
ciphertext
 ## r10, r11, r12, rax are clobbered
-## arg1, arg3, arg4, r14 are used as a pointer only, not modified
+## arg1, arg2, arg3, arg4 are used as pointers only, not modified
 
 .macro INITIAL_BLOCKS_AVX2 REP num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 
XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC VER
i = (8-\num_initial_blocks)
@@ -2178,7 +2178,7 @@ _initial_blocks_done\@:
 
 # encrypt 8 blocks at a time
 # ghash the 8 previously encrypted ciphertext blocks
-# arg1, arg3, arg4 are used as pointers only, not modified
+# arg1, arg2, arg3, arg4 are used as pointers only, not modified
 # r11 is the data offset value
 .macro GHASH_8_ENCRYPT_8_PARALLEL_AVX2 REP T1 T2 T3 T4 T5 T6 CTR XMM1 XMM2 
XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T7 loop_idx ENC_DEC
 


[tip: objtool/core] objtool: Support asm jump tables

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: 99033461e685b48549ec77608b4bda75ddf772ce
Gitweb:
https://git.kernel.org/tip/99033461e685b48549ec77608b4bda75ddf772ce
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:14 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:32 -05:00

objtool: Support asm jump tables

Objtool detection of asm jump tables would normally just work, except
for the fact that asm retpolines use alternatives.  Objtool thinks the
alternative code path (a jump to the retpoline) is a sibling call.

Don't treat alternative indirect branches as sibling calls when the
original instruction has a jump table.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/460cf4dc675d64e1124146562cabd2c05aa322e8.1614182415.git.jpoim...@redhat.com
---
 tools/objtool/check.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index a0f762a..46621e8 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -108,6 +108,18 @@ static struct instruction *prev_insn_same_sym(struct 
objtool_file *file,
for (insn = next_insn_same_sec(file, insn); insn;   \
 insn = next_insn_same_sec(file, insn))
 
+static bool is_jump_table_jump(struct instruction *insn)
+{
+   struct alt_group *alt_group = insn->alt_group;
+
+   if (insn->jump_table)
+   return true;
+
+   /* Retpoline alternative for a jump table? */
+   return alt_group && alt_group->orig_group &&
+  alt_group->orig_group->first_insn->jump_table;
+}
+
 static bool is_sibling_call(struct instruction *insn)
 {
/*
@@ -120,7 +132,7 @@ static bool is_sibling_call(struct instruction *insn)
 
/* An indirect jump is either a sibling call or a jump to a table. */
if (insn->type == INSN_JUMP_DYNAMIC)
-   return list_empty(>alts);
+   return !is_jump_table_jump(insn);
 
/* add_jump_destinations() sets insn->call_dest for sibling calls. */
return (is_static_jump(insn) && insn->call_dest);


[tip: objtool/core] x86/crypto/sha512-avx2: Standardize stack alignment prologue

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: ec063e090bd6487097d459bb4272508b78448270
Gitweb:
https://git.kernel.org/tip/ec063e090bd6487097d459bb4272508b78448270
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:24 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:36 -05:00

x86/crypto/sha512-avx2: Standardize stack alignment prologue

Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/b1a7b29fcfc65d60a3b6e77ef75f4762a5b8488d.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/sha512-avx2-asm.S | 42 ++
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/arch/x86/crypto/sha512-avx2-asm.S 
b/arch/x86/crypto/sha512-avx2-asm.S
index 3a44bdc..072cb0f 100644
--- a/arch/x86/crypto/sha512-avx2-asm.S
+++ b/arch/x86/crypto/sha512-avx2-asm.S
@@ -102,17 +102,13 @@ SRND_SIZE = 1*8
 INP_SIZE = 1*8
 INPEND_SIZE = 1*8
 CTX_SIZE = 1*8
-RSPSAVE_SIZE = 1*8
-GPRSAVE_SIZE = 5*8
 
 frame_XFER = 0
 frame_SRND = frame_XFER + XFER_SIZE
 frame_INP = frame_SRND + SRND_SIZE
 frame_INPEND = frame_INP + INP_SIZE
 frame_CTX = frame_INPEND + INPEND_SIZE
-frame_RSPSAVE = frame_CTX + CTX_SIZE
-frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
-frame_size = frame_GPRSAVE + GPRSAVE_SIZE
+frame_size = frame_CTX + CTX_SIZE
 
 ## assume buffers not aligned
 #defineVMOVDQ vmovdqu
@@ -570,18 +566,18 @@ frame_size = frame_GPRSAVE + GPRSAVE_SIZE
 # "blocks" is the message length in SHA512 blocks
 
 SYM_FUNC_START(sha512_transform_rorx)
+   # Save GPRs
+   push%rbx
+   push%r12
+   push%r13
+   push%r14
+   push%r15
+
# Allocate Stack Space
-   mov %rsp, %rax
+   push%rbp
+   mov %rsp, %rbp
sub $frame_size, %rsp
and $~(0x20 - 1), %rsp
-   mov %rax, frame_RSPSAVE(%rsp)
-
-   # Save GPRs
-   mov %rbx, 8*0+frame_GPRSAVE(%rsp)
-   mov %r12, 8*1+frame_GPRSAVE(%rsp)
-   mov %r13, 8*2+frame_GPRSAVE(%rsp)
-   mov %r14, 8*3+frame_GPRSAVE(%rsp)
-   mov %r15, 8*4+frame_GPRSAVE(%rsp)
 
shl $7, NUM_BLKS# convert to bytes
jz  done_hash
@@ -672,15 +668,17 @@ loop2:
 
 done_hash:
 
-# Restore GPRs
-   mov 8*0+frame_GPRSAVE(%rsp), %rbx
-   mov 8*1+frame_GPRSAVE(%rsp), %r12
-   mov 8*2+frame_GPRSAVE(%rsp), %r13
-   mov 8*3+frame_GPRSAVE(%rsp), %r14
-   mov 8*4+frame_GPRSAVE(%rsp), %r15
-
# Restore Stack Pointer
-   mov frame_RSPSAVE(%rsp), %rsp
+   mov %rbp, %rsp
+   pop %rbp
+
+   # Restore GPRs
+   pop %r15
+   pop %r14
+   pop %r13
+   pop %r12
+   pop %rbx
+
ret
 SYM_FUNC_END(sha512_transform_rorx)
 


[tip: objtool/core] x86/crypto/sha_ni: Standardize stack alignment prologue

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: 35a0067d2c02a7c35466db5f207b7b9265de84d9
Gitweb:
https://git.kernel.org/tip/35a0067d2c02a7c35466db5f207b7b9265de84d9
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:20 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:35 -05:00

x86/crypto/sha_ni: Standardize stack alignment prologue

Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/5033e1a79867dff1b18e1b4d0783c38897d3f223.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/sha1_ni_asm.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/crypto/sha1_ni_asm.S b/arch/x86/crypto/sha1_ni_asm.S
index 11efe3a..5d8415f 100644
--- a/arch/x86/crypto/sha1_ni_asm.S
+++ b/arch/x86/crypto/sha1_ni_asm.S
@@ -59,8 +59,6 @@
 #define DATA_PTR   %rsi/* 2nd arg */
 #define NUM_BLKS   %rdx/* 3rd arg */
 
-#define RSPSAVE%rax
-
 /* gcc conversion */
 #define FRAME_SIZE 32  /* space for 2x16 bytes */
 
@@ -96,7 +94,8 @@
 .text
 .align 32
 SYM_FUNC_START(sha1_ni_transform)
-   mov %rsp, RSPSAVE
+   push%rbp
+   mov %rsp, %rbp
sub $FRAME_SIZE, %rsp
and $~0xF, %rsp
 
@@ -288,7 +287,8 @@ SYM_FUNC_START(sha1_ni_transform)
pextrd  $3, E0, 1*16(DIGEST_PTR)
 
 .Ldone_hash:
-   mov RSPSAVE, %rsp
+   mov %rbp, %rsp
+   pop %rbp
 
ret
 SYM_FUNC_END(sha1_ni_transform)


[tip: objtool/core] x86/crypto/crc32c-pcl-intel: Standardize jump table

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: 2b02ed55482a1c5c310a7f53707292fcf1601e7a
Gitweb:
https://git.kernel.org/tip/2b02ed55482a1c5c310a7f53707292fcf1601e7a
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:19 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:34 -05:00

x86/crypto/crc32c-pcl-intel: Standardize jump table

Simplify the jump table code so that it resembles a compiler-generated
table.

This enables ORC unwinding by allowing objtool to follow all the
potential code paths.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/5357a039def90b8ef6b5874ef12cda008ecf18ba.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/crc32c-pcl-intel-asm_64.S | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S 
b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index 884dc76..ac1f303 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -53,7 +53,7 @@
 .endm
 
 .macro JMPTBL_ENTRY i
-.word crc_\i - crc_array
+.quad crc_\i
 .endm
 
 .macro JNC_LESS_THAN j
@@ -168,10 +168,7 @@ continue_block:
xor crc2, crc2
 
## branch into array
-   lea jump_table(%rip), %bufp
-   movzwq  (%bufp, %rax, 2), len
-   lea crc_array(%rip), %bufp
-   lea (%bufp, len, 1), %bufp
+   mov jump_table(,%rax,8), %bufp
JMP_NOSPEC bufp
 



[tip: objtool/core] x86/crypto/camellia-aesni-avx2: Unconditionally allocate stack buffer

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: dabe5167a3cbb4bf16b20c0e5b6497513e2e3a08
Gitweb:
https://git.kernel.org/tip/dabe5167a3cbb4bf16b20c0e5b6497513e2e3a08
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:18 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:34 -05:00

x86/crypto/camellia-aesni-avx2: Unconditionally allocate stack buffer

A conditional stack allocation violates traditional unwinding
requirements when a single instruction can have differing stack layouts.

There's no benefit in allocating the stack buffer conditionally.  Just
do it unconditionally.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/85ac96613ee5784b6239c18d3f68b1f3c509caa3.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S 
b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
index 782e971..706f708 100644
--- a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
@@ -990,6 +990,7 @@ SYM_FUNC_START(camellia_cbc_dec_32way)
 *  %rdx: src (32 blocks)
 */
FRAME_BEGIN
+   subq $(16 * 32), %rsp;
 
vzeroupper;
 
@@ -1002,7 +1003,6 @@ SYM_FUNC_START(camellia_cbc_dec_32way)
 %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
 %ymm15, %rdx, (key_table)(CTX, %r8, 8));
 
-   movq %rsp, %r10;
cmpq %rsi, %rdx;
je .Lcbc_dec_use_stack;
 
@@ -1015,7 +1015,6 @@ SYM_FUNC_START(camellia_cbc_dec_32way)
 * dst still in-use (because dst == src), so use stack for temporary
 * storage.
 */
-   subq $(16 * 32), %rsp;
movq %rsp, %rax;
 
 .Lcbc_dec_continue:
@@ -1025,7 +1024,6 @@ SYM_FUNC_START(camellia_cbc_dec_32way)
vpxor %ymm7, %ymm7, %ymm7;
vinserti128 $1, (%rdx), %ymm7, %ymm7;
vpxor (%rax), %ymm7, %ymm7;
-   movq %r10, %rsp;
vpxor (0 * 32 + 16)(%rdx), %ymm6, %ymm6;
vpxor (1 * 32 + 16)(%rdx), %ymm5, %ymm5;
vpxor (2 * 32 + 16)(%rdx), %ymm4, %ymm4;
@@ -1047,6 +1045,7 @@ SYM_FUNC_START(camellia_cbc_dec_32way)
 
vzeroupper;
 
+   addq $(16 * 32), %rsp;
FRAME_END
ret;
 SYM_FUNC_END(camellia_cbc_dec_32way)


[tip: objtool/core] x86/crypto/sha512-ssse3: Standardize stack alignment prologue

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: 27d26793f2105281d9374928448142777cef6f74
Gitweb:
https://git.kernel.org/tip/27d26793f2105281d9374928448142777cef6f74
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:25 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:37 -05:00

x86/crypto/sha512-ssse3: Standardize stack alignment prologue

Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/6ecaaac9f3828fbb903513bf90c34a08380a8e35.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/sha512-ssse3-asm.S | 41 +
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/x86/crypto/sha512-ssse3-asm.S 
b/arch/x86/crypto/sha512-ssse3-asm.S
index 50812af..bd51c90 100644
--- a/arch/x86/crypto/sha512-ssse3-asm.S
+++ b/arch/x86/crypto/sha512-ssse3-asm.S
@@ -74,14 +74,10 @@ tmp0 =  %rax
 
 W_SIZE = 80*8
 WK_SIZE = 2*8
-RSPSAVE_SIZE = 1*8
-GPRSAVE_SIZE = 5*8
 
 frame_W = 0
 frame_WK = frame_W + W_SIZE
-frame_RSPSAVE = frame_WK + WK_SIZE
-frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
-frame_size = frame_GPRSAVE + GPRSAVE_SIZE
+frame_size = frame_WK + WK_SIZE
 
 # Useful QWORD "arrays" for simpler memory references
 # MSG, DIGEST, K_t, W_t are arrays
@@ -283,18 +279,18 @@ SYM_FUNC_START(sha512_transform_ssse3)
test msglen, msglen
je nowork
 
+   # Save GPRs
+   push%rbx
+   push%r12
+   push%r13
+   push%r14
+   push%r15
+
# Allocate Stack Space
-   mov %rsp, %rax
+   push%rbp
+   mov %rsp, %rbp
sub $frame_size, %rsp
and $~(0x20 - 1), %rsp
-   mov %rax, frame_RSPSAVE(%rsp)
-
-   # Save GPRs
-   mov %rbx, frame_GPRSAVE(%rsp)
-   mov %r12, frame_GPRSAVE +8*1(%rsp)
-   mov %r13, frame_GPRSAVE +8*2(%rsp)
-   mov %r14, frame_GPRSAVE +8*3(%rsp)
-   mov %r15, frame_GPRSAVE +8*4(%rsp)
 
 updateblock:
 
@@ -355,15 +351,16 @@ updateblock:
dec msglen
jnz updateblock
 
-   # Restore GPRs
-   mov frame_GPRSAVE(%rsp),  %rbx
-   mov frame_GPRSAVE +8*1(%rsp), %r12
-   mov frame_GPRSAVE +8*2(%rsp), %r13
-   mov frame_GPRSAVE +8*3(%rsp), %r14
-   mov frame_GPRSAVE +8*4(%rsp), %r15
-
# Restore Stack Pointer
-   mov frame_RSPSAVE(%rsp), %rsp
+   mov %rbp, %rsp
+   pop %rbp
+
+   # Restore GPRs
+   pop %r15
+   pop %r14
+   pop %r13
+   pop %r12
+   pop %rbx
 
 nowork:
ret


[tip: objtool/core] x86/crypto/sha256-avx2: Standardize stack alignment prologue

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: ce5846668076aa76a17ab559f0296374e3611fec
Gitweb:
https://git.kernel.org/tip/ce5846668076aa76a17ab559f0296374e3611fec
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:22 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:36 -05:00

x86/crypto/sha256-avx2: Standardize stack alignment prologue

Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/8048e7444c49a8137f05265262b83dc50f8fb7f3.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/sha256-avx2-asm.S | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx2-asm.S 
b/arch/x86/crypto/sha256-avx2-asm.S
index 11ff60c..4087f74 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -117,15 +117,13 @@ _XMM_SAVE_SIZE= 0
 _INP_END_SIZE  = 8
 _INP_SIZE  = 8
 _CTX_SIZE  = 8
-_RSP_SIZE  = 8
 
 _XFER  = 0
 _XMM_SAVE  = _XFER + _XFER_SIZE
 _INP_END   = _XMM_SAVE + _XMM_SAVE_SIZE
 _INP   = _INP_END  + _INP_END_SIZE
 _CTX   = _INP  + _INP_SIZE
-_RSP   = _CTX  + _CTX_SIZE
-STACK_SIZE = _RSP  + _RSP_SIZE
+STACK_SIZE = _CTX  + _CTX_SIZE
 
 # rotate_Xs
 # Rotate values of symbols X0...X3
@@ -533,11 +531,11 @@ SYM_FUNC_START(sha256_transform_rorx)
pushq   %r14
pushq   %r15
 
-   mov %rsp, %rax
+   push%rbp
+   mov %rsp, %rbp
+
subq$STACK_SIZE, %rsp
and $-32, %rsp  # align rsp to 32 byte boundary
-   mov %rax, _RSP(%rsp)
-
 
shl $6, NUM_BLKS# convert to bytes
jz  done_hash
@@ -704,7 +702,8 @@ only_one_block:
 
 done_hash:
 
-   mov _RSP(%rsp), %rsp
+   mov %rbp, %rsp
+   pop %rbp
 
popq%r15
popq%r14


[tip: objtool/core] x86/crypto/sha1_avx2: Standardize stack alignment prologue

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: 20114c899cafa8313534a841cab0ab1f7ab09672
Gitweb:
https://git.kernel.org/tip/20114c899cafa8313534a841cab0ab1f7ab09672
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:21 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:35 -05:00

x86/crypto/sha1_avx2: Standardize stack alignment prologue

Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/fdaaf8670ed1f52f55ba9a6bbac98c1afddc1af6.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/sha1_avx2_x86_64_asm.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/crypto/sha1_avx2_x86_64_asm.S 
b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
index 1e594d6..5eed620 100644
--- a/arch/x86/crypto/sha1_avx2_x86_64_asm.S
+++ b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
@@ -645,9 +645,9 @@ _loop3:
RESERVE_STACK  = (W_SIZE*4 + 8+24)
 
/* Align stack */
-   mov %rsp, %rbx
+   push%rbp
+   mov %rsp, %rbp
and $~(0x20-1), %rsp
-   push%rbx
sub $RESERVE_STACK, %rsp
 
avx2_zeroupper
@@ -665,8 +665,8 @@ _loop3:
 
avx2_zeroupper
 
-   add $RESERVE_STACK, %rsp
-   pop %rsp
+   mov %rbp, %rsp
+   pop %rbp
 
pop %r15
pop %r14


[tip: objtool/core] x86/crypto: Enable objtool in crypto code

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: 7d3d10e0e85fb7c23a86a70f795b1eabd2bc030b
Gitweb:
https://git.kernel.org/tip/7d3d10e0e85fb7c23a86a70f795b1eabd2bc030b
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:26 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:37 -05:00

x86/crypto: Enable objtool in crypto code

Now that all the stack alignment prologues have been cleaned up in the
crypto code, enable objtool.  Among other benefits, this will allow ORC
unwinding to work.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/fc2a1918c50e33e46ef0e9a5de02743f2f6e3639.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/Makefile | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index b28e36b..d0959e7 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -2,8 +2,6 @@
 #
 # x86 crypto algorithms
 
-OBJECT_FILES_NON_STANDARD := y
-
 obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o
 twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
 obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o


[tip: objtool/core] x86/crypto/sha512-avx: Standardize stack alignment prologue

2021-04-20 Thread tip-bot2 for Josh Poimboeuf
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: d61684b56edf369f0a6d388088d7c9d59f1618d4
Gitweb:
https://git.kernel.org/tip/d61684b56edf369f0a6d388088d7c9d59f1618d4
Author:Josh Poimboeuf 
AuthorDate:Wed, 24 Feb 2021 10:29:23 -06:00
Committer: Josh Poimboeuf 
CommitterDate: Mon, 19 Apr 2021 12:36:36 -05:00

x86/crypto/sha512-avx: Standardize stack alignment prologue

Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf 
Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 
Tested-by: Sami Tolvanen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Herbert Xu 
Link: 
https://lore.kernel.org/r/d36e9ea1c819d87fa89b3df3fa83e2a1ede18146.1614182415.git.jpoim...@redhat.com
---
 arch/x86/crypto/sha512-avx-asm.S | 41 ++-
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/x86/crypto/sha512-avx-asm.S b/arch/x86/crypto/sha512-avx-asm.S
index 684d58c..3d8f0fd 100644
--- a/arch/x86/crypto/sha512-avx-asm.S
+++ b/arch/x86/crypto/sha512-avx-asm.S
@@ -76,14 +76,10 @@ tmp0= %rax
 W_SIZE = 80*8
 # W[t] + K[t] | W[t+1] + K[t+1]
 WK_SIZE = 2*8
-RSPSAVE_SIZE = 1*8
-GPRSAVE_SIZE = 5*8
 
 frame_W = 0
 frame_WK = frame_W + W_SIZE
-frame_RSPSAVE = frame_WK + WK_SIZE
-frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
-frame_size = frame_GPRSAVE + GPRSAVE_SIZE
+frame_size = frame_WK + WK_SIZE
 
 # Useful QWORD "arrays" for simpler memory references
 # MSG, DIGEST, K_t, W_t are arrays
@@ -281,18 +277,18 @@ SYM_FUNC_START(sha512_transform_avx)
test msglen, msglen
je nowork
 
+   # Save GPRs
+   push%rbx
+   push%r12
+   push%r13
+   push%r14
+   push%r15
+
# Allocate Stack Space
-   mov %rsp, %rax
+   push%rbp
+   mov %rsp, %rbp
sub $frame_size, %rsp
and $~(0x20 - 1), %rsp
-   mov %rax, frame_RSPSAVE(%rsp)
-
-   # Save GPRs
-   mov %rbx, frame_GPRSAVE(%rsp)
-   mov %r12, frame_GPRSAVE +8*1(%rsp)
-   mov %r13, frame_GPRSAVE +8*2(%rsp)
-   mov %r14, frame_GPRSAVE +8*3(%rsp)
-   mov %r15, frame_GPRSAVE +8*4(%rsp)
 
 updateblock:
 
@@ -353,15 +349,16 @@ updateblock:
dec msglen
jnz updateblock
 
-   # Restore GPRs
-   mov frame_GPRSAVE(%rsp),  %rbx
-   mov frame_GPRSAVE +8*1(%rsp), %r12
-   mov frame_GPRSAVE +8*2(%rsp), %r13
-   mov frame_GPRSAVE +8*3(%rsp), %r14
-   mov frame_GPRSAVE +8*4(%rsp), %r15
-
# Restore Stack Pointer
-   mov frame_RSPSAVE(%rsp), %rsp
+   mov %rbp, %rsp
+   pop %rbp
+
+   # Restore GPRs
+   pop %r15
+   pop %r14
+   pop %r13
+   pop %r12
+   pop %rbx
 
 nowork:
ret


Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-19 Thread Josh Don
On Mon, Apr 19, 2021 at 2:01 AM Peter Zijlstra  wrote:
>
> Josh, you being on the other Google team, the one that actually uses the
> cgroup interface AFAIU, can you fight the good fight with TJ on this?

A bit of extra context is in
https://lore.kernel.org/lkml/cabk29nttscu2ho7v9di+fh2gv8zu5xic5inrwpfclhpd+dk...@mail.gmail.com.

On the management/auditing side, the cgroup interface gives a clear
indication of which tasks share a cookie. It is a bit less attractive
to add a prctl interface for enumerating this.

Also on the management side, I illustrated in the above message how a
thread would potentially group together other threads. One limitation
of the current prctl interface is that the share_{to, from} always
operates on the current thread. Granted we can work around this as
described, and also potentially extend the prctl interface to operate
on two tasks.

So I agree that the cgroup interface here isn't strictly necessary,
though it seems convenient. I will double-check with internal teams
that would be using the interface to see if there are any other
considerations I'm missing.

On Mon, Apr 19, 2021 at 4:30 AM Tejun Heo  wrote:
>
> My suggestion is going ahead with the per-process interface with cgroup
> extension on mind in case actual use cases arise. Also, when planning cgroup
> integration, putting dynamic migration front and center likely isn't a good
> idea.

tasks would not be frequently moved around; I'd expect security
configuration to remain mostly static. Or maybe I'm misunderstanding
your emphasis here?


If you feel the above is not strong enough (ie. there should be a use
case not feasible with prctl), I'd support that we move forward with
the prctl stuff for now, since the cgroup interface is independant.

Thanks,
Josh


Re: [PATCH] sched/debug: Rename the sched_debug parameter to sched_debug_verbose

2021-04-19 Thread Josh Don
Hi Peter,

Looks reasonable to me.

> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4756,7 +4756,8 @@
>
> sbni=   [NET] Granch SBNI12 leased line adapter
>
> -   sched_debug [KNL] Enables verbose scheduler debug messages.
> +   sched_debug_verbose
> +   [KNL] Enables verbose scheduler debug messages.

boot param is not renamed from sched_debug below.

> @@ -22,7 +22,7 @@ early_param("sched_debug", sched_debug_s
>
>  static inline bool sched_debug(void)

nit: consider renaming. Or, we can even get rid of this function
entirely, since in the !CONFIG_SCHED_DEBUG case we have
sched_debug_verbose defined to 0.


Re: [PATCH v2] sched: Warn on long periods of pending need_resched

2021-04-16 Thread Josh Don
On Fri, Apr 16, 2021 at 8:05 AM Peter Zijlstra  wrote:
>
> On Tue, Mar 30, 2021 at 03:44:12PM -0700, Josh Don wrote:
> > Peter,
> >
> > Since you've already pulled the need_resched warning patch into your
> > tree, I'm including just the diff based on that patch (in response to
> > Mel's comments) below. This should be squashed into the original
> > patch.
> >
>
> Sorry, I seem to have missed this. The patch is completely whitespace
> mangled through.
>
> I've attached my latest copy of the patch and held it back for now,
> please resubmit.

Yikes, sorry about that. I've squashed the fixup and sent the updated
patch (with trimmed cc list): https://lkml.org/lkml/2021/4/16/1125

Thanks,
Josh


[PATCH v2 resubmit] sched: Warn on long periods of pending need_resched

2021-04-16 Thread Josh Don
From: Paul Turner 

CPU scheduler marks need_resched flag to signal a schedule() on a
particular CPU. But, schedule() may not happen immediately in cases
where the current task is executing in the kernel mode (no
preemption state) for extended periods of time.

This patch adds a warn_on if need_resched is pending for more than the
time specified in sysctl resched_latency_warn_ms. If it goes off, it is
likely that there is a missing cond_resched() somewhere. Monitoring is
done via the tick and the accuracy is hence limited to jiffy scale. This
also means that we won't trigger the warning if the tick is disabled.

This feature (LATENCY_WARN) is default disabled.

Signed-off-by: Paul Turner 
Signed-off-by: Josh Don 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210323035706.572953-1-josh...@google.com
---
This squashes the fixup from https://lkml.org/lkml/2021/3/30/1309.

 include/linux/sched/sysctl.h |  3 ++
 kernel/sched/core.c  | 70 +++-
 kernel/sched/debug.c | 13 +++
 kernel/sched/features.h  |  2 ++
 kernel/sched/sched.h | 10 ++
 5 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 0a3f34638cf5..db2c0f34aaaf 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -48,6 +48,9 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
+
+extern int sysctl_resched_latency_warn_ms;
+extern int sysctl_resched_latency_warn_once;
 #endif
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9565f304ac46..c07a4c17205f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -58,7 +58,17 @@ const_debug unsigned int sysctl_sched_features =
 #include "features.h"
0;
 #undef SCHED_FEAT
-#endif
+
+/*
+ * Print a warning if need_resched is set for the given duration (if
+ * LATENCY_WARN is enabled).
+ *
+ * If sysctl_resched_latency_warn_once is set, only one warning will be shown
+ * per boot.
+ */
+__read_mostly int sysctl_resched_latency_warn_ms = 100;
+__read_mostly int sysctl_resched_latency_warn_once = 1;
+#endif /* CONFIG_SCHED_DEBUG */
 
 /*
  * Number of tasks to iterate in a single balance run.
@@ -4527,6 +4537,55 @@ unsigned long long task_sched_runtime(struct task_struct 
*p)
return ns;
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+static u64 cpu_resched_latency(struct rq *rq)
+{
+   int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);
+   u64 resched_latency, now = rq_clock(rq);
+   static bool warned_once;
+
+   if (sysctl_resched_latency_warn_once && warned_once)
+   return 0;
+
+   if (!need_resched() || !latency_warn_ms)
+   return 0;
+
+   if (system_state == SYSTEM_BOOTING)
+   return 0;
+
+   if (!rq->last_seen_need_resched_ns) {
+   rq->last_seen_need_resched_ns = now;
+   rq->ticks_without_resched = 0;
+   return 0;
+   }
+
+   rq->ticks_without_resched++;
+   resched_latency = now - rq->last_seen_need_resched_ns;
+   if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC)
+   return 0;
+
+   warned_once = true;
+
+   return resched_latency;
+}
+
+static int __init setup_resched_latency_warn_ms(char *str)
+{
+   long val;
+
+   if ((kstrtol(str, 0, ))) {
+   pr_warn("Unable to set resched_latency_warn_ms\n");
+   return 1;
+   }
+
+   sysctl_resched_latency_warn_ms = val;
+   return 1;
+}
+__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
+#else
+static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
+#endif /* CONFIG_SCHED_DEBUG */
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -4538,6 +4597,7 @@ void scheduler_tick(void)
struct task_struct *curr = rq->curr;
struct rq_flags rf;
unsigned long thermal_pressure;
+   u64 resched_latency;
 
arch_scale_freq_tick();
sched_clock_tick();
@@ -4548,10 +4608,15 @@ void scheduler_tick(void)
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
+   if (sched_feat(LATENCY_WARN))
+   resched_latency = cpu_resched_latency(rq);
calc_global_load_tick(rq);
 
rq_unlock(rq, );
 
+   if (sched_feat(LATENCY_WARN) && resched_latency)
+   resched_latency_warn(cpu, resched_latency);
+
perf_event_task_tick();
 
 #ifdef CONFIG_SMP
@@ -5046,6 +5111,9 @@ static void __sched notrace __schedule(bool preempt)
 

Re: [PATCH 00/13] [RFC] Rust support

2021-04-16 Thread Josh Triplett
On Fri, Apr 16, 2021 at 12:27:39PM +0800, Boqun Feng wrote:
> Josh, I think it's good if we can connect to the people working on Rust
> memoryg model, I think the right person is Ralf Jung and the right place
> is https://github.com/rust-lang/unsafe-code-guidelines, but you
> cerntainly know better than me ;-) Or maybe we can use Rust-for-Linux or
> linux-toolchains list to discuss.

Ralf is definitely the right person to talk to. I don't think the UCG
repository is the right place to start that discussion, though. For now,
I'd suggest starting an email thread with Ralf and some C-and-kernel
memory model folks (hi Paul!) to suss out the most important changes
that would be needed.

With my language team hat on, I'd *absolutely* like to see the Rust
memory model support RCU-style deferred reclamation in a sound way,
ideally with as little unsafe code as possible.


Re: [PATCH v2] X86: Makefile: Replace -pg with CC_FLAGS_FTRACE

2021-04-16 Thread Josh Poimboeuf
Adding Steven Rostedt (ftrace maintainer).

On Fri, Apr 16, 2021 at 01:39:28PM +0800, zhaoxiao wrote:
> In preparation for x86 supporting ftrace built on other compiler
> options, let's have the x86 Makefiles remove the $(CC_FLAGS_FTRACE)
> flags, whatever these may be, rather than assuming '-pg'.
> 
> There should be no functional change as a result of this patch.
> 
> Signed-off-by: zhaoxiao 
> ---
> v2: add the same change for the other Makefile in arch/x86 directory.
>  arch/x86/entry/vdso/Makefile |  8 
>  arch/x86/kernel/Makefile | 16 
>  arch/x86/kernel/cpu/Makefile |  4 ++--
>  arch/x86/lib/Makefile|  2 +-
>  arch/x86/mm/Makefile |  4 ++--
>  arch/x86/xen/Makefile|  6 +++---
>  6 files changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> index 05c4abc2fdfd..c5bd91bf9f93 100644
> --- a/arch/x86/entry/vdso/Makefile
> +++ b/arch/x86/entry/vdso/Makefile
> @@ -96,10 +96,10 @@ $(vobjs): KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO) 
> $(GCC_PLUGINS_CFLAGS) $(
>  #
>  # vDSO code runs in userspace and -pg doesn't help with profiling anyway.
>  #
> -CFLAGS_REMOVE_vclock_gettime.o = -pg
> -CFLAGS_REMOVE_vdso32/vclock_gettime.o = -pg
> -CFLAGS_REMOVE_vgetcpu.o = -pg
> -CFLAGS_REMOVE_vsgx.o = -pg
> +CFLAGS_REMOVE_vclock_gettime.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_vdso32/vclock_gettime.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_vgetcpu.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_vsgx.o = $(CC_FLAGS_FTRACE)
>  
>  #
>  # X32 processes use x32 vDSO to access 64bit kernel data.
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 2ddf08351f0b..2811fc6a17ba 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -13,14 +13,14 @@ CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE)
>  
>  ifdef CONFIG_FUNCTION_TRACER
>  # Do not profile debug and lowlevel utilities
> -CFLAGS_REMOVE_tsc.o = -pg
> -CFLAGS_REMOVE_paravirt-spinlocks.o = -pg
> -CFLAGS_REMOVE_pvclock.o = -pg
> -CFLAGS_REMOVE_kvmclock.o = -pg
> -CFLAGS_REMOVE_ftrace.o = -pg
> -CFLAGS_REMOVE_early_printk.o = -pg
> -CFLAGS_REMOVE_head64.o = -pg
> -CFLAGS_REMOVE_sev-es.o = -pg
> +CFLAGS_REMOVE_tsc.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_paravirt-spinlocks.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_pvclock.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_kvmclock.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_ftrace.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_early_printk.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_head64.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_sev-es.o = $(CC_FLAGS_FTRACE)
>  endif
>  
>  KASAN_SANITIZE_head$(BITS).o := n
> diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> index 637b499450d1..4435c6de9145 100644
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -5,8 +5,8 @@
>  
>  # Don't trace early stages of a secondary CPU boot
>  ifdef CONFIG_FUNCTION_TRACER
> -CFLAGS_REMOVE_common.o = -pg
> -CFLAGS_REMOVE_perf_event.o = -pg
> +CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_perf_event.o = $(CC_FLAGS_FTRACE)
>  endif
>  
>  # If these files are instrumented, boot hangs during the first second.
> diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
> index bad4dee4f0e4..0aa71b8a5bc1 100644
> --- a/arch/x86/lib/Makefile
> +++ b/arch/x86/lib/Makefile
> @@ -21,7 +21,7 @@ KASAN_SANITIZE_cmdline.o  := n
>  KCSAN_SANITIZE_cmdline.o  := n
>  
>  ifdef CONFIG_FUNCTION_TRACER
> -CFLAGS_REMOVE_cmdline.o = -pg
> +CFLAGS_REMOVE_cmdline.o = $(CC_FLAGS_FTRACE)
>  endif
>  
>  CFLAGS_cmdline.o := -fno-stack-protector -fno-jump-tables
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 5864219221ca..91883d5a0293 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -12,8 +12,8 @@ KASAN_SANITIZE_mem_encrypt_identity.o   := n
>  KCSAN_SANITIZE := n
>  
>  ifdef CONFIG_FUNCTION_TRACER
> -CFLAGS_REMOVE_mem_encrypt.o  = -pg
> -CFLAGS_REMOVE_mem_encrypt_identity.o = -pg
> +CFLAGS_REMOVE_mem_encrypt.o  = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_mem_encrypt_identity.o = $(CC_FLAGS_FTRACE)
>  endif
>  
>  obj-y:=  init.o init_$(BITS).o fault.o 
> ioremap.o extable.o mmap.o \
> diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
> index 40b5779fce21..179dfc124c94 100644
> --- a/arch/x86/xen/Makefile
> +++ b/arch/x86/xen/Makefile
> @@ -2,9 +2,9 @@
>  
>  ifdef CONFIG_FUNCTION_TRACER
>  # Do not profile debug and lowlevel utilities
> -CFLAGS_REMOVE_spinlock.o = -pg
> -CFLAGS_REMOVE_time.o = -pg
> -CFLAGS_REMOVE_irq.o = -pg
> +CFLAGS_REMOVE_spinlock.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_time.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_irq.o = $(CC_FLAGS_FTRACE)
>  endif
>  
>  # Make sure early boot has no stackprotector
> -- 
> 2.20.1
> 
> 
> 

-- 
Josh



Re: [PATCH 00/13] [RFC] Rust support

2021-04-14 Thread Josh Triplett
On Wed, Apr 14, 2021 at 01:21:52PM -0700, Linus Torvalds wrote:
> On Wed, Apr 14, 2021 at 1:10 PM Matthew Wilcox  wrote:
> >
> > There's a philosophical point to be discussed here which you're skating
> > right over!  Should rust-in-the-linux-kernel provide the same memory
> > allocation APIs as the rust-standard-library, or should it provide a Rusty
> > API to the standard-linux-memory-allocation APIs?
> 
> Yeah, I think that the standard Rust API may simply not be acceptable
> inside the kernel, if it has similar behavior to the (completely
> broken) C++ "new" operator.
> 
> So anything that does "panic!" in the normal Rust API model needs to
> be (statically) caught, and never exposed as an actual call to
> "panic()/BUG()" in the kernel.

Rust has both kinds of allocation APIs: you can call a method like
`Box::new` that panics on allocation failure, or a method like
`Box::try_new` that returns an error on allocation failure.

With some additional infrastructure that's still in progress, we could
just not supply the former kind of methods at all, and *only* supply the
latter, so that you're forced to handle allocation failure. That just
requires introducing some further ability to customize the Rust standard
library.

(There are some cases of methods in the standard library that don't have
a `try_` equivalent, but we could fix that. Right now, for instance,
there isn't a `try_` equivalent of every Vec method, and you're instead
expected to call `try_reserve` to make sure you have enough memory
first; however, that could potentially be changed.)


Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-13 Thread Josh Poimboeuf
On Mon, Apr 12, 2021 at 05:59:33PM +0100, Mark Brown wrote:
> On Fri, Apr 09, 2021 at 05:32:27PM -0500, Josh Poimboeuf wrote:
> 
> > Hm, for that matter, even without renaming things, a comment above
> > stack_trace_save_tsk_reliable() describing the meaning of "reliable"
> > would be a good idea.
> 
> Might be better to place something at the prototype for
> arch_stack_walk_reliable() or cross link the two since that's where any
> new architectures should be starting, or perhaps even better to extend
> the document that Mark wrote further and point to that from both places.  
> 
> Some more explict pointer to live patching as the only user would
> definitely be good but I think the more important thing would be writing
> down any assumptions in the API that aren't already written down and
> we're supposed to be relying on.  Mark's document captured a lot of it
> but it sounds like there's more here, and even with knowing that this
> interface is only used by live patch and digging into what it does it's
> not always clear what happens to work with the code right now and what's
> something that's suitable to be relied on.

Something like so?

From: Josh Poimboeuf 
Subject: [PATCH] livepatch: Clarify the meaning of 'reliable'

Update the comments and documentation to reflect what 'reliable'
unwinding actually means, in the context of live patching.

Suggested-by: Mark Brown 
Signed-off-by: Josh Poimboeuf 
---
 .../livepatch/reliable-stacktrace.rst | 26 +
 arch/x86/kernel/stacktrace.c  |  6 
 include/linux/stacktrace.h| 29 +--
 kernel/stacktrace.c   |  7 -
 4 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/Documentation/livepatch/reliable-stacktrace.rst 
b/Documentation/livepatch/reliable-stacktrace.rst
index 67459d2ca2af..e325efc7e952 100644
--- a/Documentation/livepatch/reliable-stacktrace.rst
+++ b/Documentation/livepatch/reliable-stacktrace.rst
@@ -72,7 +72,21 @@ The unwinding process varies across architectures, their 
respective procedure
 call standards, and kernel configurations. This section describes common
 details that architectures should consider.
 
-4.1 Identifying successful termination
+4.1 Only preemptible code needs reliability detection
+-
+
+The only current user of reliable stacktracing is livepatch, which only
+calls it for a) inactive tasks; or b) the current task in task context.
+
+Therefore, the unwinder only needs to detect the reliability of stacks
+involving *preemptible* code.
+
+Practically speaking, reliability of stacks involving *non-preemptible*
+code is a "don't-care".  It may help to return a wrong reliability
+result for such cases, if it results in reduced complexity, since such
+cases will not happen in practice.
+
+4.2 Identifying successful termination
 --
 
 Unwinding may terminate early for a number of reasons, including:
@@ -95,7 +109,7 @@ architectures verify that a stacktrace ends at an expected 
location, e.g.
 * On a specific stack expected for a kernel entry point (e.g. if the
   architecture has separate task and IRQ stacks).
 
-4.2 Identifying unwindable code
+4.3 Identifying unwindable code
 ---
 
 Unwinding typically relies on code following specific conventions (e.g.
@@ -129,7 +143,7 @@ unreliable to unwind from, e.g.
 
 * Identifying specific portions of code using bounds information.
 
-4.3 Unwinding across interrupts and exceptions
+4.4 Unwinding across interrupts and exceptions
 --
 
 At function call boundaries the stack and other unwind state is expected to be
@@ -156,7 +170,7 @@ have no such cases) should attempt to unwind across 
exception boundaries, as
 doing so can prevent unnecessarily stalling livepatch consistency checks and
 permits livepatch transitions to complete more quickly.
 
-4.4 Rewriting of return addresses
+4.5 Rewriting of return addresses
 -
 
 Some trampolines temporarily modify the return address of a function in order
@@ -222,7 +236,7 @@ middle of return_to_handler and can report this as 
unreliable. Architectures
 are not required to unwind from other trampolines which modify the return
 address.
 
-4.5 Obscuring of return addresses
+4.6 Obscuring of return addresses
 -
 
 Some trampolines do not rewrite the return address in order to intercept
@@ -249,7 +263,7 @@ than the link register as would usually be the case.
 Architectures must either ensure that unwinders either reliably unwind
 such cases, or report the unwinding as unreliable.
 
-4.6 Link register unreliability
+4.7 Link register unreliability
 ---
 
 On some other architectures, 'call' instructions place 

Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-09 Thread Josh Poimboeuf
On Fri, Apr 09, 2021 at 05:32:27PM -0500, Josh Poimboeuf wrote:
> On Fri, Apr 09, 2021 at 05:05:58PM -0500, Madhavan T. Venkataraman wrote:
> > > FWIW, over the years we've had zero issues with encoding the frame
> > > pointer on x86.  After you save pt_regs, you encode the frame pointer to
> > > point to it.  Ideally in the same macro so it's hard to overlook.
> > > 
> > 
> > I had the same opinion. In fact, in my encoding scheme, I have additional
> > checks to make absolutely sure that it is a true encoding and not stack
> > corruption. The chances of all of those values accidentally matching are,
> > well, null.
> 
> Right, stack corruption -- which is already exceedingly rare -- would
> have to be combined with a miracle or two in order to come out of the
> whole thing marked as 'reliable' :-)
> 
> And really, we already take a similar risk today by "trusting" the frame
> pointer value on the stack to a certain extent.

Oh yeah, I forgot to mention some more benefits of encoding the frame
pointer (or marking pt_regs in some other way):

a) Stack addresses can be printed properly: '%pS' for printing regs->pc
   and '%pB' for printing call returns.

   Using '%pS' for call returns (as arm64 seems to do today) will result
   in printing the wrong function when you have tail calls to noreturn
   functions on the stack (which is actually quite common for calls to
   panic(), die(), etc).

   More details:

   https://lkml.kernel.org/r/20210403155948.ubbgtwmlsdyar7yp@treble

b) Stack dumps to the console can dump the exception registers they find
   along the way.  This is actually quite nice for debugging.


-- 
Josh



Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-09 Thread Josh Poimboeuf
On Fri, Apr 09, 2021 at 05:05:58PM -0500, Madhavan T. Venkataraman wrote:
> > FWIW, over the years we've had zero issues with encoding the frame
> > pointer on x86.  After you save pt_regs, you encode the frame pointer to
> > point to it.  Ideally in the same macro so it's hard to overlook.
> > 
> 
> I had the same opinion. In fact, in my encoding scheme, I have additional
> checks to make absolutely sure that it is a true encoding and not stack
> corruption. The chances of all of those values accidentally matching are,
> well, null.

Right, stack corruption -- which is already exceedingly rare -- would
have to be combined with a miracle or two in order to come out of the
whole thing marked as 'reliable' :-)

And really, we already take a similar risk today by "trusting" the frame
pointer value on the stack to a certain extent.

> >> I think there's a lot more code that we cannot unwind, e.g. KVM
> >> exception code, or almost anything marked with SYM_CODE_END().
> > 
> > Just a reminder that livepatch only unwinds blocked tasks (plus the
> > 'current' task which calls into livepatch).  So practically speaking, it
> > doesn't matter whether the 'unreliable' detection has full coverage.
> > The only exceptions which really matter are those which end up calling
> > schedule(), e.g. preemption or page faults.
> > 
> > Being able to consistently detect *all* possible unreliable paths would
> > be nice in theory, but it's unnecessary and may not be worth the extra
> > complexity.
> > 
> 
> You do have a point. I tried to think of arch_stack_walk_reliable() as
> something that should be implemented independent of livepatching. But
> I could not really come up with a single example of where else it would
> really be useful.
> 
> So, if we assume that the reliable stack trace is solely for the purpose
> of livepatching, I agree with your earlier comments as well.

One thought: if folks really view this as a problem, it might help to
just rename things to reduce confusion.

For example, instead of calling it 'reliable', we could call it
something more precise, like 'klp_reliable', to indicate that its
reliable enough for live patching.

Then have a comment above 'klp_reliable' and/or
stack_trace_save_tsk_klp_reliable() which describes what that means.

Hm, for that matter, even without renaming things, a comment above
stack_trace_save_tsk_reliable() describing the meaning of "reliable"
would be a good idea.

-- 
Josh



Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-09 Thread Josh Poimboeuf
On Fri, Apr 09, 2021 at 01:09:09PM +0100, Mark Rutland wrote:
> On Mon, Apr 05, 2021 at 03:43:09PM -0500, madve...@linux.microsoft.com wrote:
> > From: "Madhavan T. Venkataraman" 
> > 
> > There are a number of places in kernel code where the stack trace is not
> > reliable. Enhance the unwinder to check for those cases and mark the
> > stack trace as unreliable. Once all of the checks are in place, the unwinder
> > can provide a reliable stack trace. But before this can be used for 
> > livepatch,
> > some other entity needs to guarantee that the frame pointers are all set up
> > correctly in kernel functions. objtool is currently being worked on to
> > fill that gap.
> > 
> > Except for the return address check, all the other checks involve checking
> > the return PC of every frame against certain kernel functions. To do this,
> > implement some infrastructure code:
> > 
> > - Define a special_functions[] array and populate the array with
> >   the special functions
> 
> I'm not too keen on having to manually collate this within the unwinder,
> as it's very painful from a maintenance perspective.

Agreed.

> I'd much rather we could associate this information with the
> implementations of these functions, so that they're more likely to
> stay in sync.
> 
> Further, I believe all the special cases are assembly functions, and
> most of those are already in special sections to begin with. I reckon
> it'd be simpler and more robust to reject unwinding based on the
> section. If we need to unwind across specific functions in those
> sections, we could opt-in with some metadata. So e.g. we could reject
> all functions in ".entry.text", special casing the EL0 entry functions
> if necessary.

Couldn't this also end up being somewhat fragile?  Saying "certain
sections are deemed unreliable" isn't necessarily obvious to somebody
who doesn't already know about it, and it could be overlooked or
forgotten over time.  And there's no way to enforce it stays that way.

FWIW, over the years we've had zero issues with encoding the frame
pointer on x86.  After you save pt_regs, you encode the frame pointer to
point to it.  Ideally in the same macro so it's hard to overlook.

If you're concerned about debuggers getting confused by the encoding -
which debuggers specifically?  In my experience, if vmlinux has
debuginfo, gdb and most other debuggers will use DWARF (which is already
broken in asm code) and completely ignore frame pointers.

> I think there's a lot more code that we cannot unwind, e.g. KVM
> exception code, or almost anything marked with SYM_CODE_END().

Just a reminder that livepatch only unwinds blocked tasks (plus the
'current' task which calls into livepatch).  So practically speaking, it
doesn't matter whether the 'unreliable' detection has full coverage.
The only exceptions which really matter are those which end up calling
schedule(), e.g. preemption or page faults.

Being able to consistently detect *all* possible unreliable paths would
be nice in theory, but it's unnecessary and may not be worth the extra
complexity.

-- 
Josh



Re: [PATCH 0/5] Introduce support for PSF mitigation

2021-04-09 Thread Josh Poimboeuf
On Thu, Apr 08, 2021 at 09:56:47AM -0500, Saripalli, RK wrote:
> Josh, thank you for taking the time to review the patches.
> 
> On 4/7/2021 5:39 PM, Josh Poimboeuf wrote:
> > On Tue, Apr 06, 2021 at 10:49:59AM -0500, Ramakrishna Saripalli wrote:
> >> Because PSF speculation is limited to the current program context,
> >> the impact of bad PSF speculation is very similar to that of
> >> Speculative Store Bypass (Spectre v4)
> >>
> >> Predictive Store Forwarding controls:
> >> There are two hardware control bits which influence the PSF feature:
> >> - MSR 48h bit 2 – Speculative Store Bypass (SSBD)
> >> - MSR 48h bit 7 – Predictive Store Forwarding Disable (PSFD)
> >>
> >> The PSF feature is disabled if either of these bits are set.  These bits
> >> are controllable on a per-thread basis in an SMT system. By default, both
> >> SSBD and PSFD are 0 meaning that the speculation features are enabled.
> >>
> >> While the SSBD bit disables PSF and speculative store bypass, PSFD only
> >> disables PSF.
> >>
> >> PSFD may be desirable for software which is concerned with the
> >> speculative behavior of PSF but desires a smaller performance impact than
> >> setting SSBD.
> > 
> > Hi Ramakrishna,
> > 
> > Is there a realistic scenario where an application would want to disable
> > PSF, but not disable SSB?
> 
> It is possible most applications have been reviewed and scrubbed for
> SSB-type attacks but PSF-type issues may not have been looked at yet.

It's "possible", but is it realistic?  As far as I know, SSB is
impractical to scrub an application for.

Do we know of any real-world cases where this option is needed?

-- 
Josh



[tip: sched/core] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-09 Thread tip-bot2 for Josh Hunt
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 6db12ee0456d0e369c7b59788d46e15a56ad0294
Gitweb:
https://git.kernel.org/tip/6db12ee0456d0e369c7b59788d46e15a56ad0294
Author:Josh Hunt 
AuthorDate:Thu, 01 Apr 2021 22:58:33 -04:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 08 Apr 2021 23:09:44 +02:00

psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.

Signed-off-by: Josh Hunt 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Link: https://lkml.kernel.org/r/20210402025833.27599-1-joh...@akamai.com
---
 kernel/sched/psi.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index b1b00e9..d1212f1 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1061,19 +1061,27 @@ static int psi_cpu_show(struct seq_file *m, void *v)
return psi_show(m, _system, PSI_CPU);
 }
 
+static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void 
*))
+{
+   if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+
+   return single_open(file, psi_show, NULL);
+}
+
 static int psi_io_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_io_show, NULL);
+   return psi_open(file, psi_io_show);
 }
 
 static int psi_memory_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_memory_show, NULL);
+   return psi_open(file, psi_memory_show);
 }
 
 static int psi_cpu_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_cpu_show, NULL);
+   return psi_open(file, psi_cpu_show);
 }
 
 struct psi_trigger *psi_trigger_create(struct psi_group *group,
@@ -1353,9 +1361,9 @@ static int __init psi_proc_init(void)
 {
if (psi_enable) {
proc_mkdir("pressure", NULL);
-   proc_create("pressure/io", 0, NULL, _io_proc_ops);
-   proc_create("pressure/memory", 0, NULL, _memory_proc_ops);
-   proc_create("pressure/cpu", 0, NULL, _cpu_proc_ops);
+   proc_create("pressure/io", 0666, NULL, _io_proc_ops);
+   proc_create("pressure/memory", 0666, NULL, 
_memory_proc_ops);
+   proc_create("pressure/cpu", 0666, NULL, _cpu_proc_ops);
}
return 0;
 }


Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-08 Thread Josh Don
On Thu, Apr 8, 2021 at 9:47 AM Peter Zijlstra  wrote:
>
> On Thu, Apr 08, 2021 at 03:25:52PM +0200, Michal Koutný wrote:
>
> > I'm curious whether the cgroup API actually simplifies things that are
> > possible with the clone/prctl API or allows anything that wouldn't be
> > otherwise possible.
>
> With the cgroup API it is impossible for a task to escape the cgroup
> constraint. It can never share a core with anything not in the subtree.
>
> This is not possible with just the task interface.
>
> If this is actually needed I've no clue, IMO all of cgroups is not
> needed :-) Clearly other people feel differently about that.

The cgroup interface seems very nice from a management perspective;
moving arbitrary tasks around in the cgroup hierarchy will handle the
necessary cookie adjustments to ensure that nothing shares with an
unexpected task. It also makes auditing the core scheduling groups
very straightforward; anything in a cookie'd cgroup's tasks file will
be guaranteed isolated from the rest of the system, period.

Further, if a system management thread wants two arbitrary tasks A and
B to share a cookie, this seems more painful with the PRCTL interface.
The management thread would need to something like
- PR_SCHED_CORE_CREATE
- PR_SCHED_CORE_SHARE_TO A
- PR_SCHED_CORE_SHARE_TO B
- PR_SCHED_CORE_CLEAR


Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-08 Thread Josh Poimboeuf
On Thu, Apr 08, 2021 at 08:18:21AM +0200, Greg KH wrote:
> On Wed, Apr 07, 2021 at 03:17:46PM -0500, Josh Poimboeuf wrote:
> > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> > > On Thu, Apr 01, 2021 at 11:59:25PM +, Luis Chamberlain wrote:
> > > > As for the syfs deadlock possible with drivers, this fixes it in a 
> > > > generic way:
> > > > 
> > > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> > > > Author: Luis Chamberlain 
> > > > Date:   Sat Mar 27 14:58:15 2021 +
> > > > 
> > > > sysfs: add optional module_owner to attribute
> > > > 
> > > > This is needed as otherwise the owner of the attribute
> > > > or group read/store might have a shared lock used on driver removal,
> > > > and deadlock if we race with driver removal.
> > > > 
> > > > Signed-off-by: Luis Chamberlain 
> > > 
> > > No, please no.  Module removal is a "best effort", if the system dies
> > > when it happens, that's on you.  I am not willing to expend extra energy
> > > and maintance of core things like sysfs for stuff like this that does
> > > not matter in any system other than a developer's box.
> > 
> > So I mentioned this on IRC, and some folks were surprised to hear that
> > module unloading is unsupported and is just a development aid.
> > 
> > Is this stance documented anywhere?
> > 
> > If we really believe this to be true, we should make rmmod taint the
> > kernel.
> 
> My throw-away comment here seems to have gotten way more attention than
> it deserved, sorry about that everyone.
> 
> Nothing is supported for anything here, it's all "best effort" :)
> 
> And I would love a taint for rmmod, but what is that going to help out
> with?

I'm having trouble following this conclusion.  Sure, we give our best
effort, but if "nothing is supported" then what are we even doing here?

Either we aim to support module unloading, or we don't.

If we do support it, "best effort" isn't a valid reason for nacking
fixes.

If we don't support it, but still want to keep it half-working for
development purposes, tainting on rmmod will make it crystal clear that
the system has been destabilized.

-- 
Josh



Re: [PATCH 0/5] Introduce support for PSF mitigation

2021-04-07 Thread Josh Poimboeuf
On Tue, Apr 06, 2021 at 10:49:59AM -0500, Ramakrishna Saripalli wrote:
> Because PSF speculation is limited to the current program context,
> the impact of bad PSF speculation is very similar to that of
> Speculative Store Bypass (Spectre v4)
> 
> Predictive Store Forwarding controls:
> There are two hardware control bits which influence the PSF feature:
> - MSR 48h bit 2 – Speculative Store Bypass (SSBD)
> - MSR 48h bit 7 – Predictive Store Forwarding Disable (PSFD)
> 
> The PSF feature is disabled if either of these bits are set.  These bits
> are controllable on a per-thread basis in an SMT system. By default, both
> SSBD and PSFD are 0 meaning that the speculation features are enabled.
> 
> While the SSBD bit disables PSF and speculative store bypass, PSFD only
> disables PSF.
> 
> PSFD may be desirable for software which is concerned with the
> speculative behavior of PSF but desires a smaller performance impact than
> setting SSBD.

Hi Ramakrishna,

Is there a realistic scenario where an application would want to disable
PSF, but not disable SSB?

Maybe I'm missing something, but I'd presume an application would either
care about this class of attacks, or not.

-- 
Josh



Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-07 Thread Josh Poimboeuf
On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> On Thu, Apr 01, 2021 at 11:59:25PM +, Luis Chamberlain wrote:
> > As for the syfs deadlock possible with drivers, this fixes it in a generic 
> > way:
> > 
> > commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> > Author: Luis Chamberlain 
> > Date:   Sat Mar 27 14:58:15 2021 +
> > 
> > sysfs: add optional module_owner to attribute
> > 
> > This is needed as otherwise the owner of the attribute
> > or group read/store might have a shared lock used on driver removal,
> > and deadlock if we race with driver removal.
> > 
> > Signed-off-by: Luis Chamberlain 
> 
> No, please no.  Module removal is a "best effort", if the system dies
> when it happens, that's on you.  I am not willing to expend extra energy
> and maintance of core things like sysfs for stuff like this that does
> not matter in any system other than a developer's box.

So I mentioned this on IRC, and some folks were surprised to hear that
module unloading is unsupported and is just a development aid.

Is this stance documented anywhere?

If we really believe this to be true, we should make rmmod taint the
kernel.

-- 
Josh



Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-07 Thread Josh Poimboeuf
On Wed, Apr 07, 2021 at 04:09:44PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 10:54:23AM -0500, Josh Poimboeuf wrote:
> 
> > Same for Red Hat.  Unloading livepatch modules seems to work fine, but
> > isn't officially supported.
> > 
> > That said, if rmmod is just considered a development aid, and we're
> > going to be ignoring bugs, we should make it official with a new
> > TAINT_RMMOD.
> 
> Another option would be to have live-patch modules leak a module
> reference by default, except when some debug sysctl is set or something.
> Then only those LP modules loaded while the sysctl is set to 'YOLO' can
> be unloaded.

The issue is broader than just live patching.

My suggestion was that if we aren't going to fix bugs in kernel module
unloading, then unloading modules shouldn't be supported, and should
taint the kernel.

-- 
Josh



Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-06 Thread Josh Poimboeuf
On Tue, Apr 06, 2021 at 02:00:19PM +0200, Miroslav Benes wrote:
> Hi,
> 
> > > Driver developers will simply have to open code these protections. In
> > > light of what I see on LTP / fuzzing, I suspect the use case will grow
> > > and we'll have to revisit this in the future. But for now, sure, we can
> > > just open code the required protections everywhere to not crash on module
> > > removal.
> > 
> > LTP and fuzzing too do not remove modules.  So I do not understand the
> > root problem here, that's just something that does not happen on a real
> > system.
> 
> If I am not mistaken, the issue that Luis tries to solve here was indeed 
> found by running LTP.
> 
> > On Sat, Apr 03, 2021 at 08:13:23AM +0200, Greg KH wrote:
> > > On Fri, Apr 02, 2021 at 06:30:16PM +, Luis Chamberlain wrote:
> > > > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> > > > > No, please no.  Module removal is a "best effort",
> > > > 
> > > > Not for live patching. I am not sure if I am missing any other valid
> > > > use case?
> > > 
> > > live patching removes modules?  We have so many code paths that are
> > > "best effort" when it comes to module unloading, trying to resolve this
> > > one is a valiant try, but not realistic.
> > 
> > Miroslav, your input / help here would be valuable. I did the
> > generalization work because you said it would be worthy for you too...
> 
> Yes, we have the option to revert and remove the existing live patch from 
> the system. I am not sure how (if) it is used in practice.
> 
> At least at SUSE we do not support the option. But we are only one of the 
> many downstream users. So yes, there is the option.

Same for Red Hat.  Unloading livepatch modules seems to work fine, but
isn't officially supported.

That said, if rmmod is just considered a development aid, and we're
going to be ignoring bugs, we should make it official with a new
TAINT_RMMOD.

-- 
Josh



Re: [RFC PATCH v1 0/4] arm64: Implement stack trace reliability checks

2021-04-03 Thread Josh Poimboeuf
inal return address. But I am
> not entirely sure how to safely traverse that list for stack traces
> not on the current process. So, I have taken the easy way out.

For kretprobes, unwinding from the trampoline or kretprobe handler
shouldn't be a reliability concern for live patching, for similar
reasons as above.

Otherwise, when unwinding from a blocked task which has
'kretprobe_trampoline' on the stack, the unwinder needs a way to get the
original return address.  Masami has been working on an interface to
make that possible for x86.  I assume something similar could be done
for arm64.

> Optprobes
> =
> 
> Optprobes may be implemented in the future for arm64. For optprobes,
> the relevant trampoline(s) can be added to special_functions[].

Similar comment here, livepatch doesn't unwind from such trampolines.

-- 
Josh



Re: [RFC PATCH v2 1/1] arm64: Implement stack trace termination record

2021-04-03 Thread Josh Poimboeuf
On Thu, Apr 01, 2021 at 10:24:04PM -0500, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> @@ -447,9 +464,9 @@ SYM_FUNC_START_LOCAL(__primary_switched)
>  #endif
>   bl  switch_to_vhe   // Prefer VHE if possible
>   add sp, sp, #16
> - mov x29, #0
> - mov x30, #0
> - b   start_kernel
> + setup_final_frame
> + bl  start_kernel
> + nop
>  SYM_FUNC_END(__primary_switched)
>  
>   .pushsection ".rodata", "a"
> @@ -606,14 +623,14 @@ SYM_FUNC_START_LOCAL(__secondary_switched)
>   cbz x2, __secondary_too_slow
>   msr sp_el0, x2
>   scs_load x2, x3
> - mov x29, #0
> - mov x30, #0
> + setup_final_frame
>  
>  #ifdef CONFIG_ARM64_PTR_AUTH
>   ptrauth_keys_init_cpu x2, x3, x4, x5
>  #endif
>  
> - b   secondary_start_kernel
> + bl  secondary_start_kernel
> + nop
>  SYM_FUNC_END(__secondary_switched)

I'm somewhat arm-ignorant, so take the following comments with a grain
of salt.


I don't think changing these to 'bl' is necessary, unless you wanted
__primary_switched() and __secondary_switched() to show up in the
stacktrace for some reason?  If so, that seems like a separate patch.


Also, why are nops added after the calls?  My guess would be because,
since these are basically tail calls to "noreturn" functions, the stack
dump code would otherwise show the wrong function, i.e. whatever
function happens to be after the 'bl'.

We had the same issue for x86.  It can be fixed by using '%pB' instead
of '%pS' when printing the address in dump_backtrace_entry().  See
sprint_backtrace() for more details.

BTW I think the same issue exists for GCC-generated code.  The following
shows several such cases:

  objdump -dr vmlinux |awk '/bl   / {bl=1;l=$0;next} bl == 1 && /^$/ {print l; 
print} // {bl=0}'


However, looking at how arm64 unwinds through exceptions in kernel
space, using '%pB' might have side effects when the exception LR
(elr_el1) points to the beginning of a function.  Then '%pB' would show
the end of the previous function, instead of the function which was
interrupted.

So you may need to rethink how to unwind through in-kernel exceptions.

Basically, when printing a stack return address, you want to use '%pB'
for a call return address and '%pS' for an interrupted address.

On x86, with the frame pointer unwinder, we encode the frame pointer by
setting a bit in %rbp which tells the unwinder that it's a special
pt_regs frame.  Then instead of treating it like a normal call frame,
the stack dump code prints the registers, and the return address
(regs->ip) gets printed with '%pS'.

>  SYM_FUNC_START_LOCAL(__secondary_too_slow)
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 325c83b1a24d..906baa232a89 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -437,6 +437,11 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> stack_start,
>   }
>   p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
>   p->thread.cpu_context.sp = (unsigned long)childregs;
> + /*
> +  * For the benefit of the unwinder, set up childregs->stackframe
> +  * as the final frame for the new task.
> +  */
> + p->thread.cpu_context.fp = (unsigned long)childregs->stackframe;
>  
>   ptrace_hw_copy_thread(p);
>  
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index ad20981dfda4..72f5af8c69dc 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -44,16 +44,16 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
> stackframe *frame)
>   unsigned long fp = frame->fp;
>   struct stack_info info;
>  
> - /* Terminal record; nothing to unwind */
> - if (!fp)
> + if (!tsk)
> + tsk = current;
> +
> + /* Final frame; nothing to unwind */
> + if (fp == (unsigned long) task_pt_regs(tsk)->stackframe)
>   return -ENOENT;

As far as I can tell, the regs stackframe value is initialized to zero
during syscall entry, so isn't this basically just 'if (fp == 0)'?

Shouldn't it instead be comparing with the _address_ of the stackframe
field to make sure it reached the end?

-- 
Josh



Re: [PATCH 9/9] sched: prctl() and cgroup interaction

2021-04-02 Thread Josh Don
Hi Peter,

I walked through the reference counting, and it seems good to me
(though it did take a few passes to fully digest the invariants for
the fat cookie stuff).

> +unsigned long sched_core_alloc_cookie(unsigned int type)
>  {
> struct sched_core_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
> if (!ck)
> return 0;
> -   refcount_set(>refcnt, 1);
> +   WARN_ON_ONCE(type > GROUP_COOKIE);
> +   sched_core_init_cookie(ck, type);
> sched_core_get();
>
> -   return (unsigned long)ck;
> +   return (unsigned long)ck | type;
 > }

This feels like it needs to be stronger than a WARN_ON_ONCE; could
create a corrupted address that we later try to kfree().

Also, for my own edification, why will the bottom two bits here always be 0?

> -unsigned long sched_core_alloc_cookie(void)
> +static inline void *cookie_ptr(unsigned long cookie)
> +{
> +   return (void *)(cookie & ~3UL);
> +}
> +
> +static inline int cookie_type(unsigned long cookie)
> +{
> +   return cookie & 3;
> +}

s/3/FAT_COOKIE

> +#define FAT_COOKIE 0x03

Move to sched.h to group with TASK/GROUP_COOKIE?

> +static unsigned long __sched_core_fat_cookie(struct task_struct *p,
> +void **spare_fat,
> +unsigned long cookie)
> +{

This function looks good to me, but could use some more comments about
the pre/post-condition assumptions. Ie. cookie already has a get()
associated with it, caller is expected to kfree the spare_fat.

> +   raw_spin_lock_irqsave(_lock, flags);
> +   n = rb_find_add(>node, _root, fat_cmp);
> +   raw_spin_unlock_irqrestore(_lock, flags);
> +
> +   if (n) {
> +   sched_core_put_fat(fat);
> +   fat = node_2_fat(n);

This put() doesn't seem strictly necessary; caller will be
unconditionally freeing the spare_fat. Keep anyway for completeness,
but add a comment?


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2021-04-02 Thread Josh Hunt

On 4/2/21 12:25 PM, Jiri Kosina wrote:

On Thu, 3 Sep 2020, John Fastabend wrote:


At this point I fear we could consider reverting the NOLOCK stuff.
I personally would hate doing so, but it looks like NOLOCK benefits are
outweighed by its issues.


I agree, NOLOCK brings more pains than gains. There are many race
conditions hidden in generic qdisc layer, another one is enqueue vs.
reset which is being discussed in another thread.


Sure. Seems they crept in over time. I had some plans to write a
lockless HTB implementation. But with fq+EDT with BPF it seems that
it is no longer needed, we have a more generic/better solution.  So
I dropped it. Also most folks should really be using fq, fq_codel,
etc. by default anyways. Using pfifo_fast alone is not ideal IMO.


Half a year later, we still have the NOLOCK implementation
present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.

And we've just been bitten by this very same race which appears to be
still unfixed, with single packet being stuck in pfifo_fast qdisc
basically indefinitely due to this very race that this whole thread began
with back in 2019.

Unless there are

(a) any nice ideas how to solve this in an elegant way without
(re-)introducing extra spinlock (Cong's fix) or

(b) any objections to revert as per the argumentation above

I'll be happy to send a revert of the whole NOLOCK implementation next
week.



Jiri

If you have a reproducer can you try 
https://lkml.org/lkml/2021/3/24/1485 ? If that doesn't work I think your 
suggestion of reverting nolock makes sense to me. We've moved to using 
fq as our default now b/c of this bug.


Josh


[PATCH v2] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-01 Thread Josh Hunt
Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.

Signed-off-by: Josh Hunt 
Acked-by: Johannes Weiner 
---
 kernel/sched/psi.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index b1b00e9bd7ed..d1212f17a898 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1061,19 +1061,27 @@ static int psi_cpu_show(struct seq_file *m, void *v)
return psi_show(m, _system, PSI_CPU);
 }
 
+static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void 
*))
+{
+   if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+
+   return single_open(file, psi_show, NULL);
+}
+
 static int psi_io_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_io_show, NULL);
+   return psi_open(file, psi_io_show);
 }
 
 static int psi_memory_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_memory_show, NULL);
+   return psi_open(file, psi_memory_show);
 }
 
 static int psi_cpu_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_cpu_show, NULL);
+   return psi_open(file, psi_cpu_show);
 }
 
 struct psi_trigger *psi_trigger_create(struct psi_group *group,
@@ -1353,9 +1361,9 @@ static int __init psi_proc_init(void)
 {
if (psi_enable) {
proc_mkdir("pressure", NULL);
-   proc_create("pressure/io", 0, NULL, _io_proc_ops);
-   proc_create("pressure/memory", 0, NULL, _memory_proc_ops);
-   proc_create("pressure/cpu", 0, NULL, _cpu_proc_ops);
+   proc_create("pressure/io", 0666, NULL, _io_proc_ops);
+   proc_create("pressure/memory", 0666, NULL, 
_memory_proc_ops);
+   proc_create("pressure/cpu", 0666, NULL, _cpu_proc_ops);
}
return 0;
 }
-- 
2.17.1



Re: [PATCH 7/9] sched: Cgroup core-scheduling interface

2021-04-01 Thread Josh Don
Thanks, allowing for multiple group cookies in a hierarchy is a nice
improvement.

> +   if (tgi != tg) {
> +   if (tgi->core_cookie || (tgi->core_parent && 
> tgi->core_parent != tg))
> +   continue;
> +
> +   tgi->core_parent = parent;
> +   tgi->core_cookie = 0;

core_cookie must already be 0, given the check above.


Re: [PATCH 3/9] sched: Trivial core scheduling cookie management

2021-04-01 Thread Josh Don
> +/*
> + * sched_core_update_cookie - Common helper to update a task's core cookie. 
> This
> + * updates the selected cookie field.
> + * @p: The task whose cookie should be updated.
> + * @cookie: The new cookie.
> + * @cookie_type: The cookie field to which the cookie corresponds.

No cookie_type in this patch (or cookie fields). Also might want to
call out that the caller is responsible for put'ing the return value.


Re: [PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-01 Thread Josh Hunt

On 4/1/21 10:47 AM, Eric W. Biederman wrote:

Kees Cook  writes:


On Wed, Mar 31, 2021 at 11:36:28PM -0500, Eric W. Biederman wrote:

Josh Hunt  writes:


Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.


The test for CAP_SYS_RESOURCE really needs to be in open rather
than in write.

Otherwise a suid root executable could have stdout redirected
into these files.


Right. Or check against f_cred. (See uses of kallsyms_show_value())
https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/security/credentials.html*open-file-credentials__;Iw!!GjvTz_vk!B_aeVyHMG20VNUGx001EFKpeYlahLQHye7oS8sokXuZOhVDTtF_deDl71a_KYA$


We really want to limit checking against f_cred to those cases where we
break userspace by checking in open.  AKA the cases where we made the
mistake of putting the permission check in the wrong place and now can't
fix it.

Since this change is change the permissions that open uses already I
don't see any reason we can't perform a proper check in open.

Eric



Thank you for the feedback. I will spin a v2 doing the check in open.

Josh


[PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-03-31 Thread Josh Hunt
Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.

Signed-off-by: Josh Hunt 
---
 kernel/sched/psi.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index b1b00e9bd7ed..98ff7baf1ba8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1270,6 +1270,9 @@ static ssize_t psi_write(struct file *file, const char 
__user *user_buf,
if (!nbytes)
return -EINVAL;
 
+   if (!capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+
buf_size = min(nbytes, sizeof(buf));
if (copy_from_user(buf, user_buf, buf_size))
return -EFAULT;
@@ -1353,9 +1356,9 @@ static int __init psi_proc_init(void)
 {
if (psi_enable) {
proc_mkdir("pressure", NULL);
-   proc_create("pressure/io", 0, NULL, _io_proc_ops);
-   proc_create("pressure/memory", 0, NULL, _memory_proc_ops);
-   proc_create("pressure/cpu", 0, NULL, _cpu_proc_ops);
+   proc_create("pressure/io", 0666, NULL, _io_proc_ops);
+   proc_create("pressure/memory", 0666, NULL, 
_memory_proc_ops);
+   proc_create("pressure/cpu", 0666, NULL, _cpu_proc_ops);
}
return 0;
 }
-- 
2.17.1



Re: [RFC PATCH -tip 3/3] x86/kprobes,orc: Unwind optprobe trampoline correctly

2021-03-31 Thread Josh Poimboeuf
On Wed, Mar 31, 2021 at 02:44:56PM +0900, Masami Hiramatsu wrote:
> +#ifdef CONFIG_UNWINDER_ORC
> +unsigned long recover_optprobe_trampoline(unsigned long addr, unsigned long 
> *sp)
> +{
> + unsigned long offset, entry, probe_addr;
> + struct optimized_kprobe *op;
> + struct orc_entry *orc;
> +
> + entry = find_kprobe_optinsn_slot_entry(addr);
> + if (!entry)
> + return addr;
> +
> + offset = addr - entry;
> +
> + /* Decode arg1 and get the optprobe */
> + op = (void *)extract_set_arg1((void *)(entry + TMPL_MOVE_IDX));
> + if (!op)
> + return addr;
> +
> + probe_addr = (unsigned long)op->kp.addr;
> +
> + if (offset < TMPL_END_IDX) {
> + orc = orc_find((unsigned long)optprobe_template_func + offset);
> + if (!orc || orc->sp_reg != ORC_REG_SP)
> + return addr;
> + /*
> +  * Since optprobe trampoline doesn't push caller on the stack,
> +  * need to decrement 1 stack entry size
> +  */
> + *sp += orc->sp_offset - sizeof(long);
> + return probe_addr;
> + } else {
> + return probe_addr + offset - TMPL_END_IDX;
> + }
> +}
> +#endif

Hm, I'd like to avoid intertwining kprobes and ORC like this.

ORC unwinds other generated code by assuming the generated code uses a
frame pointer.  Could we do that here?

With CONFIG_FRAME_POINTER, unwinding works because SAVE_REGS_STRING has
ENCODE_FRAME_POINTER, but that's not going to work for ORC.

Instead of these patches, can we 'push %rbp; mov %rsp, %rbp' at the
beginning of the template and 'pop %rbp' at the end?

I guess SAVE_REGS_STRING would need to be smart enough to push the
original saved version of %rbp.  Of course then that breaks the
kretprobe_trampoline() usage, so it may need to be a separate macro.

[ Or make the same change to kretprobe_trampoline().  Then the other
  patch set wouldn't be needed either ;-) ]

Of course the downside is, when you get an interrupt during the frame
pointer setup, unwinding is broken.  But I think that's acceptable for
generated code.  We've lived with that limitation for all code, with
CONFIG_FRAME_POINTER, for many years.

Eventually we may want to have a way to register generated code (and the
ORC for it).

-- 
Josh



Re: [PATCH v2] sched: Warn on long periods of pending need_resched

2021-03-30 Thread Josh Don
Peter,

Since you've already pulled the need_resched warning patch into your
tree, I'm including just the diff based on that patch (in response to
Mel's comments) below. This should be squashed into the original
patch.

Thanks,
Josh

---
>From 85796b4d299b1cf3f99bde154a356ce1061221b7 Mon Sep 17 00:00:00 2001
From: Josh Don 
Date: Mon, 22 Mar 2021 20:57:06 -0700
Subject: [PATCH] fixup: sched: Warn on long periods of pending need_resched

---
 kernel/sched/core.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6fdf15eebc0d..c07a4c17205f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -61,17 +61,13 @@ const_debug unsigned int sysctl_sched_features =

 /*
  * Print a warning if need_resched is set for the given duration (if
- * resched_latency_warn_enabled is set).
+ * LATENCY_WARN is enabled).
  *
  * If sysctl_resched_latency_warn_once is set, only one warning will be shown
  * per boot.
- *
- * Resched latency will be ignored for the first resched_boot_quiet_sec, to
- * reduce false alarms.
  */
-int sysctl_resched_latency_warn_ms = 100;
-int sysctl_resched_latency_warn_once = 1;
-static const long resched_boot_quiet_sec = 600;
+__read_mostly int sysctl_resched_latency_warn_ms = 100;
+__read_mostly int sysctl_resched_latency_warn_once = 1;
 #endif /* CONFIG_SCHED_DEBUG */

 /*
@@ -4542,20 +4538,19 @@ unsigned long long task_sched_runtime(struct
task_struct *p)
 }

 #ifdef CONFIG_SCHED_DEBUG
-static u64 resched_latency_check(struct rq *rq)
+static u64 cpu_resched_latency(struct rq *rq)
 {
  int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);
- u64 need_resched_latency, now = rq_clock(rq);
+ u64 resched_latency, now = rq_clock(rq);
  static bool warned_once;

  if (sysctl_resched_latency_warn_once && warned_once)
  return 0;

- if (!need_resched() || WARN_ON_ONCE(latency_warn_ms < 2))
+ if (!need_resched() || !latency_warn_ms)
  return 0;

- /* Disable this warning for the first few mins after boot */
- if (now < resched_boot_quiet_sec * NSEC_PER_SEC)
+ if (system_state == SYSTEM_BOOTING)
  return 0;

  if (!rq->last_seen_need_resched_ns) {
@@ -4565,13 +4560,13 @@ static u64 resched_latency_check(struct rq *rq)
  }

  rq->ticks_without_resched++;
- need_resched_latency = now - rq->last_seen_need_resched_ns;
- if (need_resched_latency <= latency_warn_ms * NSEC_PER_MSEC)
+ resched_latency = now - rq->last_seen_need_resched_ns;
+ if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC)
  return 0;

  warned_once = true;

- return need_resched_latency;
+ return resched_latency;
 }

 static int __init setup_resched_latency_warn_ms(char *str)
@@ -4588,7 +4583,7 @@ static int __init setup_resched_latency_warn_ms(char *str)
 }
 __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
 #else
-static inline u64 resched_latency_check(struct rq *rq) { return 0; }
+static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
 #endif /* CONFIG_SCHED_DEBUG */

 /*
@@ -4614,7 +4609,7 @@ void scheduler_tick(void)
  update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
  curr->sched_class->task_tick(rq, curr, 0);
  if (sched_feat(LATENCY_WARN))
- resched_latency = resched_latency_check(rq);
+ resched_latency = cpu_resched_latency(rq);
  calc_global_load_tick(rq);

  rq_unlock(rq, );
-- 
2.31.0.291.g576ba9dcdaf-goog


Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure

2021-03-30 Thread Josh Don
On Mon, Mar 29, 2021 at 2:55 AM Peter Zijlstra  wrote:
>
> OK, fixed the fails. My tired head made it unconditionally return the
> cookie-id of 'current' instead of task. Pushed out an update.

I see you have the per-task and prctl stuff pulled into your tree. I
can rebase the compound cookie and cgroup api patches on top if you'd
like; not sure if you've already re-ordered it locally. Any other
comments on the former?

> > > Also, we really need a better name than coretag.c.
> >
> > Yea, we don't really otherwise use the phrase "tagging". core_sched.c
> > is probably too confusing given we have sched/core.c.
>
> Right, so I tried core_sched and my fingers already hate it as much as
> kernel/scftorture.c (which I'd assumed my fingers would get used to
> eventually, but n).
>
> Looking at kernel/sched/ C is very overrepresented, so we really don't
> want another I think. B, E, G, H, J, K, N, seem to still be available in
> the first half of the alphabet. Maybe, bonghits.c, gabbleduck.c ?

hardware_vuln.c? Tricky to avoid a C with cpu, core, and cookie :)


Re: [PATCH resend 5/8] sched: cgroup cookie API for core scheduling

2021-03-30 Thread Josh Don
On Tue, Mar 30, 2021 at 2:29 AM Peter Zijlstra  wrote:
> > On Wed, Mar 24, 2021 at 05:40:17PM -0400, Joel Fernandes (Google) wrote:
> > > +
> > > +   if (!tg->core_tagged && val) {
> > > +   /* Tag is being set. Check ancestors and descendants. */
> > > +   if (cpu_core_get_group_cookie(tg) ||
> > > +   cpu_core_check_descendants(tg, true /* tag */)) {
> > > +   ret = -EBUSY;
> > > +   goto out_unlock;
> > > +   }
> >
> > So the desired semantics is to only allow a single tag on any upwards
> > path? Isn't that in conflict with the cgroup requirements?
> >
> > TJ?

I carried this requirement over from the previous iteration, but I
don't see a reason why we can't just dump this and have each task use
the group cookie of its closest tagged ancestor. Joel, is there any
context here I'm missing?

FWIW I also just realized that cpu_core_check_descendants() is busted
as it recurses only on one child.

> > > +   } else if (tg->core_tagged && !val) {
> > > +   /* Tag is being reset. Check descendants. */
> > > +   if (cpu_core_check_descendants(tg, true /* tag */)) {
> >
> > I'm struggling to understand this. If, per the above, you cannot set
> > when either a parent is already set or a child is set, then how can a
> > child be set to refuse clearing?

Yes this is superfluous with the above semantics.


Re: [PATCH -tip v4 10/12] x86/kprobes: Push a fake return address at kretprobe_trampoline

2021-03-29 Thread Josh Poimboeuf
On Fri, Mar 26, 2021 at 10:20:09AM -0400, Steven Rostedt wrote:
> On Fri, 26 Mar 2021 21:03:49 +0900
> Masami Hiramatsu  wrote:
> 
> > I confirmed this is not related to this series, but occurs when I build 
> > kernels with different
> > configs without cleanup.
> > 
> > Once I build kernel with CONFIG_UNWIND_GUESS=y (for testing), and after 
> > that,
> > I build kernel again with CONFIG_UNWIND_ORC=y (but without make clean), this
> > happened. In this case, I guess ORC data might be corrupted?
> > When I cleanup and rebuild, the stacktrace seems correct.
> 
> Hmm, that should be fixed. Seems like we are missing a dependency somewhere.

Thomas reported something similar: for example arch/x86/kernel/head_64.o
doesn't get rebuilt when changing unwinders.

  https://lkml.kernel.org/r/87tuqublrb@nanos.tec.linutronix.de

Masahiro, any idea how we can force a full tree rebuild when changing
the unwinder?

-- 
Josh



Re: [PATCH v3 16/16] objtool,x86: Rewrite retpoline thunk calls

2021-03-29 Thread Josh Poimboeuf
On Fri, Mar 26, 2021 at 04:12:15PM +0100, Peter Zijlstra wrote:
> @@ -61,3 +89,15 @@ SYM_FUNC_END(__x86_indirect_thunk_\reg)
>  #define GEN(reg) EXPORT_THUNK(reg)
>  #include 
>  
> +#undef GEN
> +#define GEN(reg) ALT_THUNK reg
> +#include 
> +
> +#undef GEN
> +#define GEN(reg) __EXPORT_THUNK(__x86_indirect_alt_call_ ## reg)
> +#include 
> +
> +#undef GEN
> +#define GEN(reg) __EXPORT_THUNK(__x86_indirect_alt_jmp_ ## reg)
> +#include 
> +

Git complains about this last newline.

Otherwise everything looks pretty good to me.  Let me run it through the
test matrix.

-- 
Josh



Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure

2021-03-26 Thread Josh Don
Hi Peter,

On Fri, Mar 26, 2021 at 5:10 PM Peter Zijlstra  wrote:
>
> On Wed, Mar 24, 2021 at 05:40:14PM -0400, Joel Fernandes (Google) wrote:
> > From: Josh Don 
> >
> > A single unsigned long is insufficient as a cookie value for core
> > scheduling. We will minimally have cookie values for a per-task and a
> > per-group interface, which must be combined into an overall cookie.
> >
> > This patch adds the infrastructure necessary for setting task and group
> > cookie. Namely, it reworks the core_cookie into a struct, and provides
> > interfaces for setting task and group cookie, as well as other
> > operations (i.e. compare()). Subsequent patches will use these hooks to
> > provide an API for setting these cookies.
> >
>
> *urgh*... so I specifically wanted the task interface first to avoid /
> get-rid of all this madness. And then you keep it :-(

Sorry, I misunderstood the ask here :/ I had separated out the cgroup
interface parts of the patch, leaving (mostly) the parts which
introduced a compound cookie structure. I see now that you just wanted
the plain task interface to start, with no notion of group cookie.

> I've spend the past few hours rewriting patches #2 and #3, and adapting
> #4. The thing was working before I added SHARE_FROM back and introduced
> GET, but now I'm seeing a few FAILs from the selftest.
>
> I'm too tired to make sense of anything much, or even focus my eyes
> consistently, so I'll have to prod at it some more next week, but I've
> pushed out the lot to my queue.git:
>
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core-sched

Thanks, I'll take a look next week.

> Also, we really need a better name than coretag.c.

Yea, we don't really otherwise use the phrase "tagging". core_sched.c
is probably too confusing given we have sched/core.c.


Re: [PATCH v2] sched: Warn on long periods of pending need_resched

2021-03-25 Thread Josh Don
> On Wed, Mar 24, 2021 at 01:39:16PM +, Mel Gorman wrote:
> I'm not going to NAK because I do not have hard data that shows they must
> exist. However, I won't ACK either because I bet a lot of tasty beverages
> the next time we meet that the following parameters will generate reports
> if removed.
>
> kernel.sched_latency_ns
> kernel.sched_migration_cost_ns
> kernel.sched_min_granularity_ns
> kernel.sched_wakeup_granularity_ns
>
> I know they are altered by tuned for different profiles and some people do
> go the effort to create custom profiles for specific applications. They
> also show up in "Official Benchmarking" such as SPEC CPU 2017 and
> some vendors put a *lot* of effort into SPEC CPU results for bragging
> rights. They show up in technical books and best practice guids for
> applications.  Finally they show up in Google when searching for "tuning
> sched_foo". I'm not saying that any of these are even accurate or a good
> idea, just that they show up near the top of the results and they are
> sufficiently popular that they might as well be an ABI.

+1, these seem like sufficiently well-known scheduler tunables, and
not really SCHED_DEBUG.


Re: [PATCH v2] sched: Warn on long periods of pending need_resched

2021-03-25 Thread Josh Don
On Wed, Mar 24, 2021 at 4:27 AM Mel Gorman  wrote:
>
> I'm not a fan of the name. I know other sysctls have _enabled in the
> name but it's redundant. If you say the name out loud, it sounds weird.
> I would suggest an alternative but see below.

Now using the version rebased by Peter; this control has gone away and
we have simply a scheduling feature "LATENCY WARN"

> I suggest semantics and naming similar to hung_task_warnings
> because it's sortof similar. resched_latency_warnings would combine
> resched_latency_warn_enabled and resched_latency_warn_once. 0 would mean
> "never warn", -1 would mean always warn and any positive value means
> "warn X number of times".

See above. I'm happy with the enabled bit being toggled separately by
a sched feature; the warn_once behavior is not overloaded with the
enabling/disabling. Also, I don't see value in "warn X number of
times", given the warning is rate limited anyway.

> > +int sysctl_resched_latency_warn_ms = 100;
> > +int sysctl_resched_latency_warn_once = 1;
>
> Use __read_mostly

Good point, done.

> > +#ifdef CONFIG_SCHED_DEBUG
> > +static u64 resched_latency_check(struct rq *rq)
> > +{
> > + int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);
> > + u64 need_resched_latency, now = rq_clock(rq);
> > + static bool warned_once;
> > +
> > + if (sysctl_resched_latency_warn_once && warned_once)
> > + return 0;
> > +
>
> That is a global variable that can be modified in parallel and I do not
> think it's properly locked (scheduler_tick is holding rq lock which does
> not protect this).
>
> Consider making resched_latency_warnings atomic and use
> atomic_dec_if_positive. If it drops to zero in this path, disable the
> static branch.
>
> That said, it may be overkill. hung_task_warnings does not appear to have
> special protection that prevents it going to -1 or lower values by accident
> either. Maybe it can afford to be a bit more relaxed because a system that
> is spamming hung task warnings is probably dead or might as well be dead.

There's no real issue if we race over modification to that sysctl.
This is intentionally not more strongly synchronized for that reason.

> > + if (!need_resched() || WARN_ON_ONCE(latency_warn_ms < 2))
> > + return 0;
> > +
>
> Why is 1ms special? Regardless of the answer, if the sysctl should not
> be 1 then the user should not be able to set it to 1.

Yea let me change that to !latency_warn_ms so it isn't HZ specific.

>
> > + /* Disable this warning for the first few mins after boot */
> > + if (now < resched_boot_quiet_sec * NSEC_PER_SEC)
> > + return 0;
> > +
>
> Check system_state == SYSTEM_BOOTING instead?

Yea, that might be better; let me test that.

> > + if (!rq->last_seen_need_resched_ns) {
> > + rq->last_seen_need_resched_ns = now;
> > + rq->ticks_without_resched = 0;
> > + return 0;
> > + }
> > +
> > + rq->ticks_without_resched++;
> > + need_resched_latency = now - rq->last_seen_need_resched_ns;
> > + if (need_resched_latency <= latency_warn_ms * NSEC_PER_MSEC)
> > + return 0;
> > +
>
> The naming need_resched_latency implies it's a boolean but it's not.
> Maybe just resched_latency?
>
> Similarly, resched_latency_check implies it returns a boolean but it
> returns an excessive latency value. At this point I've been reading the
> patch for a long time so I've ran out of naming suggestions :)

The "need_" part does confuse it a bit; I reworded these to hopefully
make it more clear.

> > + warned_once = true;
> > +
> > + return need_resched_latency;
> > +}
> > +
>
> I note that you split when a warning is needed and printing the warning
> but it's not clear why. Sure you are under the RQ lock but there are other
> places that warn under the RQ lock. I suppose for consistency it could
> use SCHED_WARN_ON even though all this code is under SCHED_DEBUG already.

We had seen a circular lock dependency warning (console_sem, pi lock,
rq lock), since printing might need to wake a waiter. However, I do
see plenty of warns under rq->lock, so maybe I missed a patch to
address this?


Re: [PATCH -tip v4 10/12] x86/kprobes: Push a fake return address at kretprobe_trampoline

2021-03-24 Thread Josh Poimboeuf
On Wed, Mar 24, 2021 at 10:40:58AM +0900, Masami Hiramatsu wrote:
> On Tue, 23 Mar 2021 23:30:07 +0100
> Peter Zijlstra  wrote:
> 
> > On Mon, Mar 22, 2021 at 03:41:40PM +0900, Masami Hiramatsu wrote:
> > >   ".global kretprobe_trampoline\n"
> > >   ".type kretprobe_trampoline, @function\n"
> > >   "kretprobe_trampoline:\n"
> > >  #ifdef CONFIG_X86_64
> > 
> > So what happens if we get an NMI here? That is, after the RET but before
> > the push? Then our IP points into the trampoline but we've not done that
> > push yet.
> 
> Not only NMI, but also interrupts can happen. There is no cli/sti here.
> 
> Anyway, thanks for pointing!
> I think in UNWIND_HINT_TYPE_REGS and UNWIND_HINT_TYPE_REGS_PARTIAL cases
> ORC unwinder also has to check the state->ip and if it is 
> kretprobe_trampoline,
> it should be recovered.
> What about this?

I think the REGS and REGS_PARTIAL cases can also be affected by function
graph tracing.  So should they use the generic unwind_recover_ret_addr()
instead of unwind_recover_kretprobe()?

-- 
Josh



Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-23 Thread Josh Don
On Mon, Mar 22, 2021 at 8:54 PM Li, Aubrey  wrote:
>
> On 2021/3/22 20:56, Peter Zijlstra wrote:
> > On Mon, Mar 22, 2021 at 08:31:09PM +0800, Li, Aubrey wrote:
> >> Please let me know if I put cookie match check at the right position
> >> in task_hot(), if so, I'll obtain some performance data of it.
> >>

Will be sending out a rebased stack soon, with this updated patch included.


[PATCH v2] sched: Warn on long periods of pending need_resched

2021-03-22 Thread Josh Don
From: Paul Turner 

CPU scheduler marks need_resched flag to signal a schedule() on a
particular CPU. But, schedule() may not happen immediately in cases
where the current task is executing in the kernel mode (no
preemption state) for extended periods of time.

This patch adds a warn_on if need_resched is pending for more than the
time specified in sysctl resched_latency_warn_ms. If it goes off, it is
likely that there is a missing cond_resched() somewhere. Monitoring is
done via the tick and the accuracy is hence limited to jiffy scale. This
also means that we won't trigger the warning if the tick is disabled.

This feature is default disabled. It can be toggled on using sysctl
resched_latency_warn_enabled.

Signed-off-by: Paul Turner 
Signed-off-by: Josh Don 
---
Delta from v1:
- separate sysctl for enabling/disabling and triggering warn_once
  behavior
- add documentation
- static branch for the enable
 Documentation/admin-guide/sysctl/kernel.rst | 23 ++
 include/linux/sched/sysctl.h|  4 ++
 kernel/sched/core.c | 78 -
 kernel/sched/debug.c| 10 +++
 kernel/sched/sched.h| 10 +++
 kernel/sysctl.c | 24 +++
 6 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
b/Documentation/admin-guide/sysctl/kernel.rst
index 1d56a6b73a4e..2d4a21d3b79f 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1077,6 +1077,29 @@ ROM/Flash boot loader. Maybe to tell it what to do after
 rebooting. ???
 
 
+resched_latency_warn_enabled
+
+
+Enables/disables a warning that will trigger if need_resched is set for
+longer than sysctl ``resched_latency_warn_ms``. This warning likely
+indicates a kernel bug, such as a failure to call cond_resched().
+
+Requires ``CONFIG_SCHED_DEBUG``.
+
+
+resched_latency_warn_ms
+===
+
+See ``resched_latency_warn_enabled``.
+
+
+resched_latency_warn_once
+=
+
+If set, ``resched_latency_warn_enabled`` will only trigger one warning
+per boot.
+
+
 sched_energy_aware
 ==
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..43a1f5ab819a 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -48,6 +48,10 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
 
+extern struct static_key_false resched_latency_warn_enabled;
+extern int sysctl_resched_latency_warn_ms;
+extern int sysctl_resched_latency_warn_once;
+
 int sched_proc_update_handler(struct ctl_table *table, int write,
void *buffer, size_t *length, loff_t *ppos);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98191218d891..d69ae342b450 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -58,7 +58,21 @@ const_debug unsigned int sysctl_sched_features =
 #include "features.h"
0;
 #undef SCHED_FEAT
-#endif
+
+/*
+ * Print a warning if need_resched is set for the given duration (if
+ * resched_latency_warn_enabled is set).
+ *
+ * If sysctl_resched_latency_warn_once is set, only one warning will be shown
+ * per boot.
+ *
+ * Resched latency will be ignored for the first resched_boot_quiet_sec, to
+ * reduce false alarms.
+ */
+int sysctl_resched_latency_warn_ms = 100;
+int sysctl_resched_latency_warn_once = 1;
+const long resched_boot_quiet_sec = 600;
+#endif /* CONFIG_SCHED_DEBUG */
 
 /*
  * Number of tasks to iterate in a single balance run.
@@ -4520,6 +4534,58 @@ unsigned long long task_sched_runtime(struct task_struct 
*p)
return ns;
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+static u64 resched_latency_check(struct rq *rq)
+{
+   int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);
+   u64 need_resched_latency, now = rq_clock(rq);
+   static bool warned_once;
+
+   if (sysctl_resched_latency_warn_once && warned_once)
+   return 0;
+
+   if (!need_resched() || WARN_ON_ONCE(latency_warn_ms < 2))
+   return 0;
+
+   /* Disable this warning for the first few mins after boot */
+   if (now < resched_boot_quiet_sec * NSEC_PER_SEC)
+   return 0;
+
+   if (!rq->last_seen_need_resched_ns) {
+   rq->last_seen_need_resched_ns = now;
+   rq->ticks_without_resched = 0;
+   return 0;
+   }
+
+   rq->ticks_without_resched++;
+   need_resched_latency = now - rq->last_seen_need_resched_ns;
+   if (need_resched_latency <= latency_warn_ms * NSEC_PER_MSEC)
+   return 0;
+
+   warned_once = true;
+
+   return need_resched_latency;
+}
+
+static int __init setup_resched_latency_warn_ms(char *str)
+

Re: [PATCH] sched: Warn on long periods of pending need_resched

2021-03-22 Thread Josh Don
On Fri, Mar 19, 2021 at 2:02 AM Mel Gorman  wrote:
>
> Default disabling and hidden behind a static branch would be useful
> because the majority of users are not going to know what to do about
> a need_resched warning and the sysctl is not documented. As Ingo said,
> SCHED_DEBUG is enabled by default a lot. While I'm not sure what motivates
> most distributions, I have found it useful to display topology information
> on boot and in rare cases, for the enabling/disabling of sched features.
> Hence, sched debug features should avoid adding too much overhead where
> possible.
>
> The static branch would mean splitting the very large inline functions
> adding by the patch. The inline section should do a static check only and
> do the main work in a function in kernel/sched/debug.c so it has minimal
> overhead if unused.
>
> --
> Mel Gorman
> SUSE Labs

Makes sense, I'll include this in v2.

Thanks,
Josh


Re: [PATCH 2/6] sched: tagging interface for core scheduling

2021-03-22 Thread Josh Don
> > +static unsigned long sched_core_alloc_task_cookie(void)
> > +{
> > + struct sched_core_task_cookie *ck =
> > + kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL);
>
> struct sched_core_task_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
>
> Also, those type names are unfortunately long..
>
> > +static void sched_core_get_task_cookie(unsigned long cookie)
> > +{
> > + struct sched_core_task_cookie *ptr =
> > + (struct sched_core_task_cookie *)cookie;
>
> struct sched_core_task_cookie *ptr = (void *)cookie;
>
> Know your language and use it to avoid typing excessively long names :-)

Good point, done. Keeping sched_core_task_cookie for now unless you'd
prefer a replacement, since it is only used internally by coretag.c.


Re: [PATCH -tip v3 05/11] x86/kprobes: Add UNWIND_HINT_FUNC on kretprobe_trampoline code

2021-03-21 Thread Josh Poimboeuf
On Sat, Mar 20, 2021 at 10:05:43PM +0900, Masami Hiramatsu wrote:
> On Sat, 20 Mar 2021 21:16:16 +0900
> Masami Hiramatsu  wrote:
> 
> > On Fri, 19 Mar 2021 21:22:39 +0900
> > Masami Hiramatsu  wrote:
> > 
> > > From: Josh Poimboeuf 
> > > 
> > > Add UNWIND_HINT_FUNC on kretporbe_trampoline code so that ORC
> > > information is generated on the kretprobe_trampoline correctly.
> > > 
> > 
> > Test bot also found a new warning for this.
> > 
> > > >> arch/x86/kernel/kprobes/core.o: warning: objtool: 
> > > >> kretprobe_trampoline()+0x25: call without frame pointer save/setup
> > 
> > With CONFIG_FRAME_POINTER=y.
> > 
> > Of course this can be fixed with additional "push %bp; mov %sp, %bp" before 
> > calling
> > trampoline_handler. But actually we know that this function has a bit 
> > special
> > stack frame too. 
> > 
> > Can I recover STACK_FRAME_NON_STANDARD(kretprobe_trampoline) when 
> > CONFIG_FRAME_POINTER=y ?
> 
> So something like this. Does it work?
> 
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index b31058a152b6..651f337dc880 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -760,6 +760,10 @@ int kprobe_int3_handler(struct pt_regs *regs)
>  }
>  NOKPROBE_SYMBOL(kprobe_int3_handler);
>  
> +#ifdef CONFIG_FRAME_POINTER
> +#undef UNWIND_HINT_FUNC
> +#define UNWIND_HINT_FUNC
> +#endif

This hunk isn't necessary.  The unwind hints don't actually have an
effect with frame pointers.

>  /*
>   * When a retprobed function returns, this code saves registers and
>   * calls trampoline_handler() runs, which calls the kretprobe's handler.
> @@ -797,7 +801,14 @@ asm(
>   ".size kretprobe_trampoline, .-kretprobe_trampoline\n"
>  );
>  NOKPROBE_SYMBOL(kretprobe_trampoline);
> -
> +#ifdef CONFIG_FRAME_POINTER
> +/*
> + * kretprobe_trampoline skips updating frame pointer. The frame pointer
> + * saved in trampoline_handler points to the real caller function's
> + * frame pointer.
> + */
> +STACK_FRAME_NON_STANDARD(kretprobe_trampoline);
> +#endif
>  
>  /*
>   * Called from kretprobe_trampoline

Ack.

-- 
Josh



Re: [PATCH -tip v3 05/11] x86/kprobes: Add UNWIND_HINT_FUNC on kretprobe_trampoline code

2021-03-21 Thread Josh Poimboeuf
On Sat, Mar 20, 2021 at 09:16:16PM +0900, Masami Hiramatsu wrote:
> On Fri, 19 Mar 2021 21:22:39 +0900
> Masami Hiramatsu  wrote:
> 
> > From: Josh Poimboeuf 
> > 
> > Add UNWIND_HINT_FUNC on kretporbe_trampoline code so that ORC
> > information is generated on the kretprobe_trampoline correctly.
> > 
> 
> Test bot also found a new warning for this.
> 
> > >> arch/x86/kernel/kprobes/core.o: warning: objtool: 
> > >> kretprobe_trampoline()+0x25: call without frame pointer save/setup
> 
> With CONFIG_FRAME_POINTER=y.
> 
> Of course this can be fixed with additional "push %bp; mov %sp, %bp" before 
> calling
> trampoline_handler. But actually we know that this function has a bit special
> stack frame too. 
> 
> Can I recover STACK_FRAME_NON_STANDARD(kretprobe_trampoline) when 
> CONFIG_FRAME_POINTER=y ?

Yes, that's what I'd recommend.

-- 
Josh



Re: [PATCH v2 14/14] objtool,x86: Rewrite retpoline thunk calls

2021-03-19 Thread Josh Poimboeuf
On Fri, Mar 19, 2021 at 04:56:30PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 19, 2021 at 10:30:26AM -0500, Josh Poimboeuf wrote:
> > On Fri, Mar 19, 2021 at 09:06:44AM +0100, Peter Zijlstra wrote:
> > > > Also doesn't the alternative code already insert nops?
> > > 
> > > Problem is that the {call,jmp} *%\reg thing is not fixed length. They're
> > > 2 or 3 bytes depending on which register is picked.
> > 
> > Why do they need to be fixed length?  Objtool can use sym->len as the
> > alternative replacement length.  Then alternatives can add nops as
> > needed.
> 
> UNDEF has size 0. That is, unless these symbols exist in the translation
> unit (they do not) we have no clue.
> 
> Arguably I could parse the symbol name again and then we know the
> register number and thus if we need REX etc.. but I figured we wanted to
> avoid all that.

Ah, makes sense now.

-- 
Josh



Re: [PATCH v2 08/14] objtool: Add elf_create_reloc() helper

2021-03-19 Thread Josh Poimboeuf
On Fri, Mar 19, 2021 at 04:24:40PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 19, 2021 at 10:12:59AM -0500, Josh Poimboeuf wrote:
> 
> > > -void elf_add_reloc(struct elf *elf, struct reloc *reloc)
> > > +int elf_add_reloc(struct elf *elf, struct section *sec, unsigned long 
> > > offset,
> > > +   unsigned int type, struct symbol *sym, int addend)
> > >  {
> > > - struct section *sec = reloc->sec;
> > > + struct reloc *reloc;
> > > +
> > > + reloc = malloc(sizeof(*reloc));
> > > + if (!reloc) {
> > > + perror("malloc");
> > > + return -1;
> > > + }
> > > + memset(reloc, 0, sizeof(*reloc));
> > > +
> > > + reloc->sec = sec->reloc;
> > 
> > Maybe elf_add_reloc() could create the reloc section if it doesn't
> > already exist, that would help remove the multiple calls to
> > elf_create_reloc_section().
> 
> I'll do that as yet another patch ;-)

Ok!

> > > + reloc->offset = offset;
> > > + reloc->type = type;
> > > + reloc->sym = sym;
> > > + reloc->addend = addend;
> > >  
> > >   list_add_tail(>list, >reloc_list);
> > 
> > This should be sec->reloc->reloc_list?
> 
> Absolutely, there were a few other 'funnies' as well that I just fixed
> whilst trying to make the whole new stack work again :-)

I'm sure some of those were my fault, thanks for testing my crap :-)

-- 
Josh



Re: [PATCH v2 14/14] objtool,x86: Rewrite retpoline thunk calls

2021-03-19 Thread Josh Poimboeuf
On Fri, Mar 19, 2021 at 09:06:44AM +0100, Peter Zijlstra wrote:
> > Also doesn't the alternative code already insert nops?
> 
> Problem is that the {call,jmp} *%\reg thing is not fixed length. They're
> 2 or 3 bytes depending on which register is picked.

Why do they need to be fixed length?  Objtool can use sym->len as the
alternative replacement length.  Then alternatives can add nops as
needed.

-- 
Josh



Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic

2021-03-19 Thread Josh Poimboeuf
On Fri, Mar 19, 2021 at 10:22:54AM +0100, Peter Zijlstra wrote:
> @@ -814,6 +814,11 @@ struct section *elf_create_reloc_section
>   }
>  }
>  
> +static inline bool is_reloc_section(struct section *reloc)
> +{
> + return reloc->base && reloc->base->reloc == reloc;
> +}

I believe the 2nd condition will always be true, so it can just be
'return reloc->base'.

-- 
Josh



Re: [PATCH v2 08/14] objtool: Add elf_create_reloc() helper

2021-03-19 Thread Josh Poimboeuf
On Fri, Mar 19, 2021 at 10:47:36AM +0100, Peter Zijlstra wrote:
> Full patch, because diff on a diff on a diff is getting ludicrous hard
> to read :-)

Appreciated :-)

> -void elf_add_reloc(struct elf *elf, struct reloc *reloc)
> +int elf_add_reloc(struct elf *elf, struct section *sec, unsigned long offset,
> +   unsigned int type, struct symbol *sym, int addend)
>  {
> - struct section *sec = reloc->sec;
> + struct reloc *reloc;
> +
> + reloc = malloc(sizeof(*reloc));
> + if (!reloc) {
> + perror("malloc");
> + return -1;
> + }
> + memset(reloc, 0, sizeof(*reloc));
> +
> + reloc->sec = sec->reloc;

Maybe elf_add_reloc() could create the reloc section if it doesn't
already exist, that would help remove the multiple calls to
elf_create_reloc_section().

> + reloc->offset = offset;
> + reloc->type = type;
> + reloc->sym = sym;
> + reloc->addend = addend;
>  
>   list_add_tail(>list, >reloc_list);

This should be sec->reloc->reloc_list?

-- 
Josh



Re: [PATCH v2 10/14] objtool: Extract elf_symbol_add()

2021-03-19 Thread Josh Poimboeuf
On Fri, Mar 19, 2021 at 10:54:05AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 18, 2021 at 09:14:03PM -0500, Josh Poimboeuf wrote:
> > On Thu, Mar 18, 2021 at 06:11:13PM +0100, Peter Zijlstra wrote:
> > > Create a common helper to add symbols.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > ---
> > >  tools/objtool/elf.c |   57 
> > > ++--
> > >  1 file changed, 33 insertions(+), 24 deletions(-)
> > > 
> > > --- a/tools/objtool/elf.c
> > > +++ b/tools/objtool/elf.c
> > > @@ -290,12 +290,41 @@ static int read_sections(struct elf *elf
> > >   return 0;
> > >  }
> > >  
> > > +static bool elf_symbol_add(struct elf *elf, struct symbol *sym)
> > 
> > How about "elf_add_symbol()" for consistency with my other suggestions
> > (elf_add_reloc() and elf_add_string()).  And return an int.
> 
> Changed the nane, how about void? This latest incarnation doesn't
> actually have an error path. Still doesn't hurt to have one and not use
> it I suppose...

void would be my preference, just to avoid unnecessary error handling
boilerplate in the caller.

-- 
Josh



Re: [PATCH v2 14/14] objtool,x86: Rewrite retpoline thunk calls

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 06:11:17PM +0100, Peter Zijlstra wrote:
> When the compiler emits: "CALL __x86_indirect_thunk_\reg" for an
> indirect call, have objtool rewrite it to:
> 
>   ALTERNATIVE "call __x86_indirect_thunk_\reg",
>   "call *%reg", ALT_NOT(X86_FEATURE_RETPOLINE)
> 
> Additionally, in order to not emit endless identical
> .altinst_replacement chunks, use a global symbol for them, see
> __x86_indirect_alt_*.
> 
> This also avoids objtool from having to do code generation.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

This is better than I expected.  Nice workaround for not generating
code.

> +.macro ALT_THUNK reg
> +
> + .align 1
> +
> +SYM_FUNC_START_NOALIGN(__x86_indirect_alt_call_\reg)
> + ANNOTATE_RETPOLINE_SAFE
> +1:   call*%\reg
> +2:   .skip   5-(2b-1b), 0x90
> +SYM_FUNC_END(__x86_indirect_alt_call_\reg)
> +
> +SYM_FUNC_START_NOALIGN(__x86_indirect_alt_jmp_\reg)
> + ANNOTATE_RETPOLINE_SAFE
> +1:   jmp *%\reg
> +2:   .skip   5-(2b-1b), 0x90
> +SYM_FUNC_END(__x86_indirect_alt_jmp_\reg)

This mysterious code needs a comment.  Shouldn't it be in
.altinstr_replacement or something?

Also doesn't the alternative code already insert nops?

> +int arch_rewrite_retpoline(struct objtool_file *file,
> +struct instruction *insn,
> +struct reloc *reloc)
> +{
> + struct symbol *sym;
> + char name[32] = "";
> +
> + if (!strcmp(insn->sec->name, ".text.__x86.indirect_thunk"))
> + return 0;
> +
> + sprintf(name, "__x86_indirect_alt_%s_%s",
> + insn->type == INSN_JUMP_DYNAMIC ? "jmp" : "call",
> + reloc->sym->name + 21);
> +
> + sym = find_symbol_by_name(file->elf, name);
> + if (!sym) {
> + sym = elf_create_undef_symbol(file->elf, name);
> + if (!sym) {
> + WARN("elf_create_undef_symbol");
> + return -1;
> + }
> + }
> +
> + elf_add_alternative(file->elf, insn, sym,
> + ALT_NOT(X86_FEATURE_RETPOLINE), 5, 5);
> +
> + return 0;
> +}

Need to propagate the error.

-- 
Josh



Re: [PATCH v2 12/14] objtool: Allow archs to rewrite retpolines

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 06:11:15PM +0100, Peter Zijlstra wrote:
> @@ -1212,6 +1225,8 @@ static int handle_group_alt(struct objto
>   dest_off = arch_jump_destination(insn);
>   if (dest_off == special_alt->new_off + special_alt->new_len)
>   insn->jump_dest = next_insn_same_sec(file, 
> last_orig_insn);
> + else
> + insn->jump_dest = find_insn(file, insn->sec, dest_off);
>  
>   if (!insn->jump_dest) {
>   WARN_FUNC("can't find alternative jump destination",

So I assume this change is because of the ordering change: now this is
done before add_jump_destinations().

But doesn't that mean the alternative jump modification (changing the
dest to the end of the original insns) will get overwritten later?

Also the new hunk to be an oversimplified version of
add_jump_destinations().  I'm not quite convinced that it will always do
the right thing for this case.

> @@ -1797,11 +1812,15 @@ static int decode_sections(struct objtoo
>   if (ret)
>   return ret;
>  
> - ret = add_jump_destinations(file);
> + /*
> +  * Must be before add_{jump,call}_destination; for they can add
> +  * magic alternatives.
> +  */
> + ret = add_special_section_alts(file);

This reordering is unfortunate.  Maybe there's a better way, though I
don't have any ideas, at least until I get to the most controversial
patch.

-- 
Josh



Re: [PATCH v2 05/14] objtool: Per arch retpoline naming

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 06:11:08PM +0100, Peter Zijlstra wrote:
> @@ -872,7 +877,7 @@ static int add_jump_destinations(struct
>   } else if (reloc->sym->type == STT_SECTION) {
>   dest_sec = reloc->sym->sec;
>   dest_off = arch_dest_reloc_offset(reloc->addend);
> - } else if (!strncmp(reloc->sym->name, "__x86_indirect_thunk_", 
> 21)) {
> + } else if (arch_is_retpoline(reloc->sym)) {
>   /*
>* Retpoline jumps are really dynamic jumps in
>* disguise, so convert them accordingly.

There's another one in add_call_destinations().

-- 
Josh



Re: [PATCH v2 11/14] objtool: Add elf_create_undef_symbol()

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 06:11:14PM +0100, Peter Zijlstra wrote:
> Allow objtool to create undefined symbols; this allows creating
> relocations to symbols not currently in the symbol table.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  tools/objtool/elf.c |   63 
> 
>  tools/objtool/include/objtool/elf.h |1 
>  2 files changed, 64 insertions(+)
> 
> --- a/tools/objtool/elf.c
> +++ b/tools/objtool/elf.c
> @@ -724,6 +724,69 @@ static int elf_strtab_concat(struct elf
>   return len;
>  }
>  
> +struct symbol *elf_create_undef_symbol(struct elf *elf, const char *name)
> +{
> + struct section *symtab;
> + struct symbol *sym;
> + Elf_Data *data;
> + Elf_Scn *s;
> +
> + sym = malloc(sizeof(*sym));
> + if (!sym) {
> + perror("malloc");
> + return NULL;
> + }
> + memset(sym, 0, sizeof(*sym));
> +
> + sym->name = strdup(name);
> +
> + sym->sym.st_name = elf_strtab_concat(elf, sym->name, NULL);
> + if (sym->sym.st_name == -1)
> + return NULL;
> +
> + sym->sym.st_info = 0x10; /* STB_GLOBAL, STT_NOTYPE */

There's a generic macro for this:
    
sym->sym.st_info = GELF_ST_INFO(STB_GLOBAL, STT_NOTYPE);

And sym->bind and sym->type should probably get set.

-- 
Josh



Re: [PATCH v2 10/14] objtool: Extract elf_symbol_add()

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 06:11:13PM +0100, Peter Zijlstra wrote:
> Create a common helper to add symbols.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  tools/objtool/elf.c |   57 
> ++--
>  1 file changed, 33 insertions(+), 24 deletions(-)
> 
> --- a/tools/objtool/elf.c
> +++ b/tools/objtool/elf.c
> @@ -290,12 +290,41 @@ static int read_sections(struct elf *elf
>   return 0;
>  }
>  
> +static bool elf_symbol_add(struct elf *elf, struct symbol *sym)

How about "elf_add_symbol()" for consistency with my other suggestions
(elf_add_reloc() and elf_add_string()).  And return an int.

-- 
Josh



Re: [PATCH v2 09/14] objtool: Extract elf_strtab_concat()

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 06:11:12PM +0100, Peter Zijlstra wrote:
> Create a common helper to append strings to a strtab.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  tools/objtool/elf.c |   73 
> +---
>  1 file changed, 42 insertions(+), 31 deletions(-)
> 
> --- a/tools/objtool/elf.c
> +++ b/tools/objtool/elf.c
> @@ -676,13 +676,51 @@ struct elf *elf_open_read(const char *na
>   return NULL;
>  }
>  
> +static int elf_strtab_concat(struct elf *elf, char *name, const char 
> *strtab_name)
> +{
> + struct section *strtab = NULL;
> + Elf_Data *data;
> + Elf_Scn *s;
> + int len;
> +
> + if (strtab_name)
> + strtab = find_section_by_name(elf, strtab_name);
> + if (!strtab)
> + strtab = find_section_by_name(elf, ".strtab");
> + if (!strtab) {
> + WARN("can't find %s or .strtab section", strtab_name);
> + return -1;
> + }

This part's a bit mysterious (and it loses the Clang comment).  Maybe we
can leave the section finding logic in elf_create_section()?

diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index b85db6efb9d3..db9ad54894d8 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -676,19 +676,17 @@ struct elf *elf_open_read(const char *name, int flags)
return NULL;
 }
 
-static int elf_strtab_concat(struct elf *elf, char *name, const char 
*strtab_name)
+static int elf_add_string(struct elf *elf, struct section *strtab, char *str)
 {
struct section *strtab = NULL;
Elf_Data *data;
Elf_Scn *s;
int len;
 
-   if (strtab_name)
-   strtab = find_section_by_name(elf, strtab_name);
if (!strtab)
strtab = find_section_by_name(elf, ".strtab");
if (!strtab) {
-   WARN("can't find %s or .strtab section", strtab_name);
+   WARN("can't find .strtab section");
return -1;
}
 
@@ -718,7 +716,7 @@ static int elf_strtab_concat(struct elf *elf, char *name, 
const char *strtab_nam
 struct section *elf_create_section(struct elf *elf, const char *name,
   unsigned int sh_flags, size_t entsize, int 
nr)
 {
-   struct section *sec;
+   struct section *sec, *shstrtab;
size_t size = entsize * nr;
Elf_Scn *s;
 
@@ -777,7 +775,15 @@ struct section *elf_create_section(struct elf *elf, const 
char *name,
sec->sh.sh_addralign = 1;
sec->sh.sh_flags = SHF_ALLOC | sh_flags;
 
-   sec->sh.sh_name = elf_strtab_concat(elf, sec->name, ".shstrtab");
+   /* Add section name to .shstrtab (or .strtab for Clang) */
+   shstrtab = find_section_by_name(elf, ".shstrtab");
+   if (!shstrtab)
+   shstrtab = find_section_by_name(elf, ".strtab");
+   if (!shstrtab) {
+   WARN("can't find .shstrtab or .strtab section");
+   return NULL;
+   }
+   sec->sh.sh_name = elf_add_string(elf, sec->name, shstrtab);
if (sec->sh.sh_name == -1)
return NULL;
 



Re: [PATCH v2 08/14] objtool: Add elf_create_reloc() helper

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 06:11:11PM +0100, Peter Zijlstra wrote:
> We have 4 instances of adding a relocation. Create a common helper
> to avoid growing even more.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

I'm not a fan of the API -- how about squashing this in?  Untested, of
course.

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 6a52d56504b1..38ffa3b2595e 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -469,12 +469,11 @@ static int create_static_call_sections(struct 
objtool_file *file)
memset(site, 0, sizeof(struct static_call_site));
 
/* populate reloc for 'addr' */
-   if (!elf_create_reloc(file->elf, sec, idx * sizeof(struct 
static_call_site),
- R_X86_64_PC32,
- reloc_to_insn, insn, 0)) {
-   WARN_ELF("elf_create_reloc: static_call_site::addr");
+   if (elf_add_reloc_to_insn(file->elf, sec,
+ idx * sizeof(struct static_call_site),
+ R_X86_64_PC32,
+ insn->sec, insn->offset))
return -1;
-   }
 
/* find key symbol */
key_name = strdup(insn->call_dest->name);
@@ -511,13 +510,11 @@ static int create_static_call_sections(struct 
objtool_file *file)
free(key_name);
 
/* populate reloc for 'key' */
-   if (!elf_create_reloc(file->elf, sec, idx * sizeof(struct 
static_call_site) + 4,
- R_X86_64_PC32,
- reloc_to_sym, key_sym,
- is_sibling_call(insn) * 
STATIC_CALL_SITE_TAIL)) {
-   WARN_ELF("elf_create_reloc: static_call_site::key");
+   if (elf_add_reloc(file->elf, sec,
+ idx * sizeof(struct static_call_site) + 4,
+ R_X86_64_PC32, key_sym,
+ is_sibling_call(insn) * 
STATIC_CALL_SITE_TAIL))
return -1;
-   }
 
idx++;
}
@@ -559,12 +556,11 @@ static int create_mcount_loc_sections(struct objtool_file 
*file)
loc = (unsigned long *)sec->data->d_buf + idx;
memset(loc, 0, sizeof(unsigned long));
 
-   if (!elf_create_reloc(file->elf, sec, idx * sizeof(unsigned 
long),
- R_X86_64_64,
- reloc_to_insn, insn, 0)) {
-   WARN_ELF("elf_create_reloc: __mcount_loc");
+   if (elf_add_reloc_to_insn(file->elf, sec,
+ idx * sizeof(unsigned long),
+ R_X86_64_64,
+ insn->sec, insn->offset))
return -1;
-   }
 
idx++;
}
diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index d2afe2454de4..b3bd97b2b034 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -447,76 +447,73 @@ static int read_symbols(struct elf *elf)
return -1;
 }
 
-static void elf_add_reloc(struct elf *elf, struct reloc *reloc)
+int elf_add_reloc(struct elf *elf, struct section *sec, unsigned long offset,
+ unsigned int type, struct symbol *sym, int addend)
 {
-   struct section *sec = reloc->sec;
+   struct reloc *reloc;
+
+   reloc = malloc(sizeof(*reloc));
+   if (!reloc) {
+   perror("malloc");
+   return -1;
+   }
+   memset(reloc, 0, sizeof(*reloc));
+
+   reloc->sec = sec->reloc;
+   reloc->offset = offset;
+   reloc->type = type;
+   reloc->sym = sym;
+   reloc->addend = addend;
 
list_add_tail(>list, >reloc_list);
elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc));
 
-   sec->changed = true;
-}
-
-void reloc_to_sym(struct reloc *reloc, void *dst)
-{
-   struct symbol *sym = dst;
-   reloc->sym = sym;
+   return 0;
 }
 
-void __reloc_to_insn(struct reloc *reloc, struct section *sec, unsigned long 
offset)
+static int insn_to_sym_addend(struct section *sec, unsigned long offset,
+ struct symbol **sym, int *addend)
 {
if (sec->sym) {
-   reloc->sym = sec->sym;
-   reloc->addend = offset;
-   return;
+   *sym = sec->sym;
+   *addend = offset;
+   return 0;
}
 
/*
 * The Clang assembler strips section symbols, so we have to reference
 * the function symbol instead:
 */
-   reloc->sym = find_symbol_containing(sec, offset);
-   if (!reloc->sym) {
+   *sym = find_symbol_containing(sec, offset);
+   if (!*sym) {
/*
 

Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 12:38:42PM -0500, Josh Poimboeuf wrote:
> On Thu, Mar 18, 2021 at 06:04:25PM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 18, 2021 at 11:36:40AM -0500, Josh Poimboeuf wrote:
> > > > I was thinking you could get a section changed without touching
> > > > relocations, but while that is theoretically possible, it is exceedingly
> > > > unlikely (and objtool doesn't do that).
> > > 
> > > Hm?  This is a *relocation* section, not a normal one.  So by
> > > definition, it only changes when its relocations change.
> > 
> > The way I read this code:
> > 
> > list_for_each_entry(sec, >sections, list) {
> > if (sec->changed) {
> > +   if (sec->reloc &&
> > +   elf_rebuild_reloc_section(elf, sec->reloc)) {
> > +   WARN_ELF("elf_rebuild_reloc_section");
> > +   return -1;
> > +   }
> > 
> > is that we iterate the regular sections (which could be dirtied because
> > we changed some data), and if that section has a relocation section, we
> > rebuild that for good measure (even though it might not have altered
> > relocations).
> > 
> > Or am I just totally confused ?
> 
> Ah, you're right.  I'm the one that's confused.  I guess I was also
> confused when I wrote that hunk, but it just happens to work anyway.
> 
> It would be cleaner to do something like
> 
>   if ((is_reloc_sec(sec) &&   
>   elf_rebuild_reloc_section(elf, sec)) {
> 
> so we process the changed reloc section directly, instead of relying on
> the (most likely) fact that the corresponding text section also changed.

i.e., in actual code:

diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index 66c49c2e20a6..3b3d19a5e626 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -948,8 +948,7 @@ int elf_write(struct elf *elf)
/* Update changed relocation sections and section headers: */
list_for_each_entry(sec, >sections, list) {
if (sec->changed) {
-   if (sec->reloc &&
-   elf_rebuild_reloc_section(elf, sec->reloc)) {
+   if (sec->base && elf_rebuild_reloc_section(elf, sec)) {
WARN_ELF("elf_rebuild_reloc_section");
return -1;
}



Re: [sched] 663017c554: WARNING:at_kernel/sched/core.c:#scheduler_tick

2021-03-18 Thread Josh Don
The warning is WAI (holding spinlock for 100ms). However, since this
is expected for locktorture, it makes sense to not have the warning
enabled while the test is running. I can add that to the patch.


Re: [PATCH] kbuild: rebuild GCC plugins when the compiler is upgraded

2021-03-18 Thread Josh Poimboeuf
On Wed, Mar 10, 2021 at 10:44:46PM +0900, Masahiro Yamada wrote:
> > Masahiro,
> >
> > Ping.  Do you have a better approach for building GCC plugins in the
> > external module directory?
> 
> 
> I am not sure if building GCC plugins in the external module directory
> is the right approach.

I'm certainly open to any better ideas that would allow GCC plugins to
be enabled in distro kernel builds.

-- 
Josh



Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 06:04:25PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 18, 2021 at 11:36:40AM -0500, Josh Poimboeuf wrote:
> > > I was thinking you could get a section changed without touching
> > > relocations, but while that is theoretically possible, it is exceedingly
> > > unlikely (and objtool doesn't do that).
> > 
> > Hm?  This is a *relocation* section, not a normal one.  So by
> > definition, it only changes when its relocations change.
> 
> The way I read this code:
> 
>   list_for_each_entry(sec, >sections, list) {
>   if (sec->changed) {
> + if (sec->reloc &&
> + elf_rebuild_reloc_section(elf, sec->reloc)) {
> + WARN_ELF("elf_rebuild_reloc_section");
> + return -1;
> + }
> 
> is that we iterate the regular sections (which could be dirtied because
> we changed some data), and if that section has a relocation section, we
> rebuild that for good measure (even though it might not have altered
> relocations).
> 
> Or am I just totally confused ?

Ah, you're right.  I'm the one that's confused.  I guess I was also
confused when I wrote that hunk, but it just happens to work anyway.

It would be cleaner to do something like

if ((is_reloc_sec(sec) &&   
elf_rebuild_reloc_section(elf, sec)) {

so we process the changed reloc section directly, instead of relying on
the (most likely) fact that the corresponding text section also changed.

-- 
Josh



Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 01:57:03PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 17, 2021 at 07:49:17PM -0500, Josh Poimboeuf wrote:
> > On Wed, Mar 17, 2021 at 09:12:15AM +0100, Peter Zijlstra wrote:
> > > On Tue, Mar 16, 2021 at 10:34:17PM -0500, Josh Poimboeuf wrote:
> > > > On Fri, Mar 12, 2021 at 06:16:18PM +0100, Peter Zijlstra wrote:
> > > > > --- a/tools/objtool/elf.c
> > > > > +++ b/tools/objtool/elf.c
> > > > > @@ -479,6 +479,8 @@ void elf_add_reloc(struct elf *elf, stru
> > > > >  
> > > > >   list_add_tail(>list, >reloc_list);
> > > > >   elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc));
> > > > > +
> > > > > + sec->rereloc = true;
> > > > >  }
> > > > 
> > > > Can we just reuse sec->changed for this?  Something like this on top
> > > > (untested of course):
> > > 
> > > I think my worry was that we'd dirty too much and slow down the write,
> > > but I haven't done any actual performance measurements on this.
> > 
> > Really?  I thought my proposal was purely aesthetic, no functional
> > change, but my brain is toasty this week due to other distractions so
> > who knows.
> 
> I was thinking you could get a section changed without touching
> relocations, but while that is theoretically possible, it is exceedingly
> unlikely (and objtool doesn't do that).

Hm?  This is a *relocation* section, not a normal one.  So by
definition, it only changes when its relocations change.

-- 
Josh



Re: [PATCH 3/3] static_call: Fix static_call_update() sanity check

2021-03-18 Thread Josh Poimboeuf
On Thu, Mar 18, 2021 at 12:31:59PM +0100, Peter Zijlstra wrote:
>   if (!kernel_text_address((unsigned long)site_addr)) {
> - WARN_ONCE(1, "can't patch static call site at 
> %pS",
> + /*
> +  * This skips patching __exit, which is part of
> +  * init_section_contains() but is not part of
> +  * kernel_text_address().
> +  *
> +  * Skipping __exit is fine since it will never
> +  * be executed.
> +  */
> + WARN_ONCE(!static_call_is_init(site),
> +   "can't patch static call site at %pS",
> site_addr);
>   continue;
>   }

It might be good to clarify the situation for __exit in modules in the
comment and/or changelog, as they both seem to be implicitly talking
only about __exit in vmlinux.

For CONFIG_MODULE_UNLOAD, the code ends up in the normal text area, so
static_call_is_init() is false and kernel_text_address() is true.

For !CONFIG_MODULE_UNLOAD, the code gets discarded during module load,
so static_call_is_init() and kernel_text_address() are both false.  I
guess that will trigger a warning?

-- 
Josh



Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic

2021-03-17 Thread Josh Poimboeuf
On Wed, Mar 17, 2021 at 09:12:15AM +0100, Peter Zijlstra wrote:
> On Tue, Mar 16, 2021 at 10:34:17PM -0500, Josh Poimboeuf wrote:
> > On Fri, Mar 12, 2021 at 06:16:18PM +0100, Peter Zijlstra wrote:
> > > --- a/tools/objtool/elf.c
> > > +++ b/tools/objtool/elf.c
> > > @@ -479,6 +479,8 @@ void elf_add_reloc(struct elf *elf, stru
> > >  
> > >   list_add_tail(>list, >reloc_list);
> > >   elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc));
> > > +
> > > + sec->rereloc = true;
> > >  }
> > 
> > Can we just reuse sec->changed for this?  Something like this on top
> > (untested of course):
> 
> I think my worry was that we'd dirty too much and slow down the write,
> but I haven't done any actual performance measurements on this.

Really?  I thought my proposal was purely aesthetic, no functional
change, but my brain is toasty this week due to other distractions so
who knows.

> Let me do a few runs and see if it matters at all.

-- 
Josh



Re: [PATCH 6/9] objtool: Add elf_create_undef_symbol()

2021-03-17 Thread Josh Poimboeuf
On Wed, Mar 17, 2021 at 03:13:43PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 17, 2021 at 02:52:23PM +0100, Miroslav Benes wrote:
> 
> > > + if (!elf_symbol_add(elf, sym, SHN_XINDEX)) {
> > > + WARN("elf_symbol_add");
> > > + return NULL;
> > > + }
> > 
> > SHN_XINDEX means that the extended section index is used. Above you seem 
> > to use it in the opposite sense too (assigning to shndx when shndx_data is 
> > NULL). While it makes the code easier to handle, it is a bit confusing 
> > (and maybe I am just confused now). Could you add a comment about that, 
> > please? elf_symbol_add() seems like a good place.
> 
> Yes, that was a horrible thing to do :/ And you understood it right.
> 
> Looking at it again, I'm not sure it is actually correct tho; shouldn't
> elf_create_undef_symbol() also look at gelf_getsymshndx() of symtab ?
> 
> What toolchain generates these extended sections and how? That is, how
> do I test this crud..

SHN_XINDEX is basically a special-case extension to original ELF for
supporting more than 64k sections.

-- 
Josh



Re: [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C

2021-03-17 Thread Josh Poimboeuf
On Wed, Mar 17, 2021 at 11:12:42AM -0700, Andy Lutomirski wrote:
> ret_from_fork is written in asm, slightly differently, for x86_32 and
> x86_64.  Convert it to C.
> 
> This is a straight conversion without any particular cleverness.  As a
> further cleanup, the code that sets up the ret_from_fork argument registers
> could be adjusted to put the arguments in the correct registers.
> 
> This will cause the ORC unwinder to find pt_regs even for kernel threads on
> x86_64.  This seems harmless.
> 
> The 32-bit comment above the now-deleted schedule_tail_wrapper was
> obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the
> same problem more cleanly.
> 
> Cc: Josh Poimboeuf 
> Signed-off-by: Andy Lutomirski 

Acked-by: Josh Poimboeuf 

-- 
Josh



Re: [PATCH v4 2/9] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads

2021-03-17 Thread Josh Poimboeuf
On Wed, Mar 17, 2021 at 11:12:41AM -0700, Andy Lutomirski wrote:
> @@ -163,6 +163,19 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> sp, unsigned long arg,
>   /* Kernel thread ? */
>   if (unlikely(p->flags & PF_KTHREAD)) {
>   memset(childregs, 0, sizeof(struct pt_regs));
> +
> + /*
> +  * Even though there is no real user state here, these
> +  * are were user regs belong, and kernel_execve() will

s/were/where/

> +  * overwrite them with user regs.  Put an obviously
> +  * invalid value that nonetheless causes user_mode(regs)
> +  * to return true in CS.
> +  *
> +  * This also prevents the unwinder from thinking that there
> +  * is invalid kernel state at the top of the stack.
> +  */
> + childregs->cs = 3;
> +
>   kthread_frame_init(frame, sp, arg);
>   return 0;
>   }

Acked-by: Josh Poimboeuf 

-- 
Josh



Re: [PATCH] sched: Warn on long periods of pending need_resched

2021-03-17 Thread Josh Don
On Wed, Mar 17, 2021 at 1:31 AM Ingo Molnar  wrote:
>
>
> * Josh Don  wrote:
>
> > +static inline u64 resched_latency_check(struct rq *rq)
> > +{
> > + int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);
> > + bool warn_only_once = (latency_warn_ms == 
> > RESCHED_DEFAULT_WARN_LATENCY_MS);
> > + u64 need_resched_latency, now = rq_clock(rq);
> > + static bool warned_once;
> > +
> > + if (warn_only_once && warned_once)
> > + return 0;
> > +
> > + if (!need_resched() || latency_warn_ms < 2)
> > + return 0;
> > +
> > + /* Disable this warning for the first few mins after boot */
> > + if (now < RESCHED_BOOT_QUIET_SEC * NSEC_PER_SEC)
> > + return 0;
> > +
> > + if (!rq->last_seen_need_resched_ns) {
> > + rq->last_seen_need_resched_ns = now;
> > + rq->ticks_without_resched = 0;
> > + return 0;
> > + }
> > +
> > + rq->ticks_without_resched++;
>
> So AFAICS this will only really do something useful on full-nohz
> kernels with sufficiently long scheduler ticks, right?

Not quite sure what you mean; it is actually the inverse? Since we
rely on the tick to detect the resched latency, on nohz-full we won't
have detection on cpus running a single thread. The ideal scenario is
!nohz-full and tick interval << warn_ms.

> On other kernels the scheduler tick interrupt, when it returns to
> user-space, will trigger a reschedule if it sees a need_resched.

True for the case where we return to userspace, but we could instead
be executing in a non-preemptible region of the kernel. This is where
we've seen/fixed kernel bugs.

Best,
Josh


Re: [PATCH] sched: Warn on long periods of pending need_resched

2021-03-17 Thread Josh Don
On Wed, Mar 17, 2021 at 1:25 AM Ingo Molnar  wrote:
>
> * Josh Don  wrote:
>
> > If resched_latency_warn_ms is set to the default value, only one warning
> > will be produced per boot.
>
> Looks like a value hack, should probably be a separate flag,
> defaulting to warn-once.

Agreed, done.

> > This warning only exists under CONFIG_SCHED_DEBUG. If it goes off, it is
> > likely that there is a missing cond_resched() somewhere.
>
> CONFIG_SCHED_DEBUG is default-y, so most distros have it enabled.

To avoid log spam for people who don't care, I was considering having
the feature default disabled. Perhaps a better alternative is to only
show a single line warning and not print the full backtrace by
default. Does the latter sound good to you?

> > +#ifdef CONFIG_KASAN
> > +#define RESCHED_DEFAULT_WARN_LATENCY_MS 101
> > +#define RESCHED_BOOT_QUIET_SEC 600
> > +#else
> > +#define RESCHED_DEFAULT_WARN_LATENCY_MS 51
> > +#define RESCHED_BOOT_QUIET_SEC 300
> >  #endif
> > +int sysctl_resched_latency_warn_ms = RESCHED_DEFAULT_WARN_LATENCY_MS;
> > +#endif /* CONFIG_SCHED_DEBUG */
>
> I'd really just make this a single value - say 100 or 200 msecs.

Replacing these both with a single value (the more conservative
default of 100ms and 600s).

> > +static inline void resched_latency_warn(int cpu, u64 latency)
> > +{
> > + static DEFINE_RATELIMIT_STATE(latency_check_ratelimit, 60 * 60 * HZ, 
> > 1);
> > +
> > + WARN(__ratelimit(_check_ratelimit),
> > +  "CPU %d: need_resched set for > %llu ns (%d ticks) "
> > +  "without schedule\n",
> > +  cpu, latency, cpu_rq(cpu)->ticks_without_resched);
> > +}
>
> Could you please put the 'sched:' prefix into scheduler warnings.
> Let's have a bit of a namespace structure in new warnings.

Sounds good, done.


Re: [PATCH] objtool,static_call: Don't emit static_call_site for .exit.text

2021-03-17 Thread Josh Poimboeuf
On Wed, Mar 17, 2021 at 01:45:57PM +0100, Peter Zijlstra wrote:
> arguably it simply isn't a good idea to use static_call() in __exit
> code anyway, since module unload is never a performance critical path.

Couldn't you make the same argument about __init functions, which are
allowed to do static calls?

We might consider a STATIC_CALL_SITE_EXIT flag, but I suppose we've run
out of flag space.

-- 
Josh



[PATCH] sched: Warn on long periods of pending need_resched

2021-03-16 Thread Josh Don
From: Paul Turner 

CPU scheduler marks need_resched flag to signal a schedule() on a
particular CPU. But, schedule() may not happen immediately in cases
where the current task is executing in the kernel mode (no
preemption state) for extended periods of time.

This patch adds a warn_on if need_resched is pending for more than the
time specified in sysctl resched_latency_warn_ms. Monitoring is done via
the tick and the accuracy is hence limited to jiffy scale. This also
means that we won't trigger the warning if the tick is disabled.

If resched_latency_warn_ms is set to the default value, only one warning
will be produced per boot.

This warning only exists under CONFIG_SCHED_DEBUG. If it goes off, it is
likely that there is a missing cond_resched() somewhere.

Signed-off-by: Paul Turner 
Signed-off-by: Josh Don 
---
We've caught various bugs using this patch. In fact, a followup patch to
this one will be a patch introducing a missing cond_resched(). However,
this may be too noisy to have as default enabled (especially given that
it requires some tuning); I'm open to having this default disabled
instead.
 kernel/sched/core.c  | 91 
 kernel/sched/sched.h |  6 +++
 kernel/sysctl.c  |  8 
 3 files changed, 105 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98191218d891..ac037fc87d5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -58,7 +58,25 @@ const_debug unsigned int sysctl_sched_features =
 #include "features.h"
0;
 #undef SCHED_FEAT
+
+/*
+ * Print a warning if need_resched is set for at least this long. At the
+ * default value, only a single warning will be printed per boot.
+ *
+ * Values less than 2 disable the feature.
+ *
+ * A kernel compiled with CONFIG_KASAN tends to run more slowly on average.
+ * Increase the need_resched timeout to reduce false alarms.
+ */
+#ifdef CONFIG_KASAN
+#define RESCHED_DEFAULT_WARN_LATENCY_MS 101
+#define RESCHED_BOOT_QUIET_SEC 600
+#else
+#define RESCHED_DEFAULT_WARN_LATENCY_MS 51
+#define RESCHED_BOOT_QUIET_SEC 300
 #endif
+int sysctl_resched_latency_warn_ms = RESCHED_DEFAULT_WARN_LATENCY_MS;
+#endif /* CONFIG_SCHED_DEBUG */
 
 /*
  * Number of tasks to iterate in a single balance run.
@@ -4520,6 +4538,71 @@ unsigned long long task_sched_runtime(struct task_struct 
*p)
return ns;
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+
+static inline u64 resched_latency_check(struct rq *rq)
+{
+   int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms);
+   bool warn_only_once = (latency_warn_ms == 
RESCHED_DEFAULT_WARN_LATENCY_MS);
+   u64 need_resched_latency, now = rq_clock(rq);
+   static bool warned_once;
+
+   if (warn_only_once && warned_once)
+   return 0;
+
+   if (!need_resched() || latency_warn_ms < 2)
+   return 0;
+
+   /* Disable this warning for the first few mins after boot */
+   if (now < RESCHED_BOOT_QUIET_SEC * NSEC_PER_SEC)
+   return 0;
+
+   if (!rq->last_seen_need_resched_ns) {
+   rq->last_seen_need_resched_ns = now;
+   rq->ticks_without_resched = 0;
+   return 0;
+   }
+
+   rq->ticks_without_resched++;
+   need_resched_latency = now - rq->last_seen_need_resched_ns;
+   if (need_resched_latency <= latency_warn_ms * NSEC_PER_MSEC)
+   return 0;
+
+   warned_once = true;
+
+   return need_resched_latency;
+}
+
+static inline void resched_latency_warn(int cpu, u64 latency)
+{
+   static DEFINE_RATELIMIT_STATE(latency_check_ratelimit, 60 * 60 * HZ, 1);
+
+   WARN(__ratelimit(_check_ratelimit),
+"CPU %d: need_resched set for > %llu ns (%d ticks) "
+"without schedule\n",
+cpu, latency, cpu_rq(cpu)->ticks_without_resched);
+}
+
+
+static int __init setup_resched_latency_warn_ms(char *str)
+{
+   long val;
+
+   if ((kstrtol(str, 0, ))) {
+   pr_warn("Unable to set resched_latency_warn_ms\n");
+   return 1;
+   }
+
+   sysctl_resched_latency_warn_ms = val;
+   return 1;
+}
+__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
+#else
+static inline u64 resched_latency_check(struct rq *rq) { return 0; }
+static inline void resched_latency_warn(int cpu, u64 latency) {}
+#endif /* CONFIG_SCHED_DEBUG */
+
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -4531,6 +4614,7 @@ void scheduler_tick(void)
struct task_struct *curr = rq->curr;
struct rq_flags rf;
unsigned long thermal_pressure;
+   u64 resched_latency;
 
arch_scale_freq_tick();
sched_clock_tick();
@@ -4541,11 +4625,15 @@ void scheduler_tick(void)
thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
update_thermal_load_avg(rq

Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic

2021-03-16 Thread Josh Poimboeuf
On Fri, Mar 12, 2021 at 06:16:18PM +0100, Peter Zijlstra wrote:
> --- a/tools/objtool/elf.c
> +++ b/tools/objtool/elf.c
> @@ -479,6 +479,8 @@ void elf_add_reloc(struct elf *elf, stru
>  
>   list_add_tail(>list, >reloc_list);
>   elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc));
> +
> + sec->rereloc = true;
>  }

Can we just reuse sec->changed for this?  Something like this on top
(untested of course):

diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index addcec88ac9f..b9cb74a54681 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -480,7 +480,7 @@ void elf_add_reloc(struct elf *elf, struct reloc *reloc)
list_add_tail(>list, >reloc_list);
elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc));
 
-   sec->rereloc = true;
+   sec->changed = true;
 }
 
 static int read_rel_reloc(struct section *sec, int i, struct reloc *reloc, 
unsigned int *symndx)
@@ -882,9 +882,6 @@ int elf_rebuild_reloc_section(struct elf *elf, struct 
section *sec)
struct reloc *reloc;
int nr;
 
-   sec->changed = true;
-   elf->changed = true;
-
nr = 0;
list_for_each_entry(reloc, >reloc_list, list)
nr++;
@@ -894,8 +891,6 @@ int elf_rebuild_reloc_section(struct elf *elf, struct 
section *sec)
case SHT_RELA: return elf_rebuild_rela_reloc_section(sec, nr);
default:   return -1;
}
-
-   sec->rereloc = false;
 }
 
 int elf_write_insn(struct elf *elf, struct section *sec,
@@ -950,14 +945,15 @@ int elf_write(struct elf *elf)
struct section *sec;
Elf_Scn *s;
 
-   list_for_each_entry(sec, >sections, list) {
-   if (sec->reloc && sec->reloc->rereloc)
-   elf_rebuild_reloc_section(elf, sec->reloc);
-   }
-
-   /* Update section headers for changed sections: */
+   /* Update changed relocation sections and section headers: */
list_for_each_entry(sec, >sections, list) {
if (sec->changed) {
+   if (sec->reloc &&
+   elf_rebuild_reloc_section(elf, sec->reloc)) {
+   WARN_ELF("elf_rebuild_reloc_section");
+   return -1;
+   }
+
s = elf_getscn(elf->elf, sec->idx);
if (!s) {
WARN_ELF("elf_getscn");
@@ -969,6 +965,7 @@ int elf_write(struct elf *elf)
}
 
sec->changed = false;
+   elf->changed = true;
}
}
 
diff --git a/tools/objtool/include/objtool/elf.h 
b/tools/objtool/include/objtool/elf.h
index 9fdd4c5f9f32..e6890cc70a25 100644
--- a/tools/objtool/include/objtool/elf.h
+++ b/tools/objtool/include/objtool/elf.h
@@ -39,7 +39,7 @@ struct section {
char *name;
int idx;
unsigned int len;
-   bool changed, text, rodata, noinstr, rereloc;
+   bool changed, text, rodata, noinstr;
 };
 
 struct symbol {



Re: [PATCH 4/9] objtool: Fix static_call list generation

2021-03-16 Thread Josh Poimboeuf
On Fri, Mar 12, 2021 at 06:16:17PM +0100, Peter Zijlstra wrote:
> @@ -1701,6 +1706,9 @@ static int decode_sections(struct objtoo
>   if (ret)
>   return ret;
>  
> + /*
> +  * Must be before add_{jump_call}_desetination.
> +  */

s/desetination/destination/

-- 
Josh



Re: [PATCH 2/9] objtool: Correctly handle retpoline thunk calls

2021-03-16 Thread Josh Poimboeuf
On Fri, Mar 12, 2021 at 06:16:15PM +0100, Peter Zijlstra wrote:
> Just like JMP handling, convert a direct CALL to a retpoline thunk
> into a retpoline safe indirect CALL.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  tools/objtool/check.c |   12 
>  1 file changed, 12 insertions(+)
> 
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -953,6 +953,18 @@ static int add_call_destinations(struct
> dest_off);
>   return -1;
>   }
> +
> + } else if (!strncmp(reloc->sym->name, "__x86_indirect_thunk_", 
> 21)) {
> + /*
> +  * Retpoline calls are really dynamic calls in
> +  * disguise, so convert them accodingly.

s/accodingly/accordingly/

-- 
Josh



Re: [PATCH -tip v2 00/10] kprobes: Fix stacktrace with kretprobes

2021-03-15 Thread Josh Poimboeuf
On Fri, Mar 12, 2021 at 03:41:44PM +0900, Masami Hiramatsu wrote:
> Hello,
> 
> Here is the 2nd version of the series to fix the stacktrace with kretprobe.
> 
> The 1st series is here;
> 
> https://lore.kernel.org/bpf/161495873696.346821.10161501768906432924.stgit@devnote2/
> 
> In this version I merged the ORC unwinder fix for kretprobe which discussed 
> in the
> previous thread. [3/10] is updated according to the Miroslav's comment. 
> [4/10] is
> updated for simplify the code. [5/10]-[9/10] are discussed in the previsous 
> tread
> and are introduced to the series.
> 
> Daniel, can you also test this again? I and Josh discussed a bit different
> method and I've implemented it on this version.
> 
> This actually changes the kretprobe behavisor a bit, now the instraction 
> pointer in
> the pt_regs passed to kretprobe user handler is correctly set the real return
> address. So user handlers can get it via instruction_pointer() API.

When I add WARN_ON(1) to a test kretprobe, it doesn't unwind properly.

show_trace_log_lvl() reads the entire stack in lockstep with calls to
the unwinder so that it can decide which addresses get prefixed with
'?'.  So it expects to find an actual return address on the stack.
Instead it finds %rsp.  So it never syncs up with unwind_next_frame()
and shows all remaining addresses as unreliable.

  Call Trace:
   __kretprobe_trampoline_handler+0xca/0x1a0
   trampoline_handler+0x3d/0x50
   kretprobe_trampoline+0x25/0x50
   ? init_test_probes.cold+0x323/0x387
   ? init_kprobes+0x144/0x14c
   ? init_optprobes+0x15/0x15
   ? do_one_initcall+0x5b/0x300
   ? lock_is_held_type+0xe8/0x140
   ? kernel_init_freeable+0x174/0x2cd
   ? rest_init+0x233/0x233
   ? kernel_init+0xa/0x11d
   ? ret_from_fork+0x22/0x30

How about pushing 'kretprobe_trampoline' instead of %rsp for the return
address placeholder.  That fixes the above test, and removes the need
for the weird 'state->ip == sp' check:

  Call Trace:
   __kretprobe_trampoline_handler+0xca/0x150
   trampoline_handler+0x3d/0x50
   kretprobe_trampoline+0x29/0x50
   ? init_test_probes.cold+0x323/0x387
   elfcorehdr_read+0x10/0x10
   init_kprobes+0x144/0x14c
   ? init_optprobes+0x15/0x15
   do_one_initcall+0x72/0x280
   kernel_init_freeable+0x174/0x2cd
   ? rest_init+0x122/0x122
   kernel_init+0xa/0x10e
   ret_from_fork+0x22/0x30

Though, init_test_probes.cold() (the real return address) is still
listed as unreliable.  Maybe we need a call to kretprobe_find_ret_addr()
in show_trace_log_lvl(), similar to the ftrace_graph_ret_addr() call
there.



diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 06f33bfebc50..70188fdd288e 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -766,19 +766,19 @@ asm(
"kretprobe_trampoline:\n"
/* We don't bother saving the ss register */
 #ifdef CONFIG_X86_64
-   "   pushq %rsp\n"
+   /* Push fake return address to tell the unwinder it's a kretprobe */
+   "   pushq $kretprobe_trampoline\n"
UNWIND_HINT_FUNC
"   pushfq\n"
SAVE_REGS_STRING
"   movq %rsp, %rdi\n"
"   call trampoline_handler\n"
-   /* Replace saved sp with true return address. */
+   /* Replace the fake return address with the real one. */
"   movq %rax, 19*8(%rsp)\n"
RESTORE_REGS_STRING
"   popfq\n"
 #else
"   pushl %esp\n"
-   UNWIND_HINT_FUNC
"   pushfl\n"
SAVE_REGS_STRING
"   movl %esp, %eax\n"
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 1d1b9388a1b1..1d3de84d2410 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -548,8 +548,7 @@ bool unwind_next_frame(struct unwind_state *state)
 * In those cases, find the correct return address from
 * task->kretprobe_instances list.
 */
-   if (state->ip == sp ||
-   is_kretprobe_trampoline(state->ip))
+   if (is_kretprobe_trampoline(state->ip))
state->ip = kretprobe_find_ret_addr(state->task,
>kr_iter);
 




  1   2   3   4   5   6   7   8   9   10   >