Re: [PATCH 1/1] livepatch: Rename KLP_* to KLP_TRANSITION_*
On Tue, May 07, 2024 at 01:01:11PM +0800, zhangwar...@gmail.com wrote: > From: Wardenjohn > > The original macros of KLP_* is about the state of the transition. > Rename macros of KLP_* to KLP_TRANSITION_* to fix the confusing > description of klp transition state. > > Signed-off-by: Wardenjohn Acked-by: Josh Poimboeuf -- Josh
Re: [PATCH 0/1] *** Replace KLP_* to KLP_TRANSITION_* ***
On Tue, May 07, 2024 at 10:56:09AM +0800, zhang warden wrote: > > > > On May 7, 2024, at 10:41, Josh Poimboeuf wrote: > > > > On Tue, May 07, 2024 at 10:21:40AM +0800, zhang warden wrote: > >> > >> > >>> > >>> transition state. With this marcos renamed, comments are not > >>> necessary at this point. > >>> > >> Sorry for my careless, the comment still remains in the code. However, > >> comment in the code do no harms here. Maybe it can be kept. > > > > The comments aren't actually correct. > > > > For example, KLP_TRANSITION_UNPATCHED doesn't always mean "transitioning > > to unpatched state". Sometimes it means "transitioning *from* unpatched > > state". > > > > -- > > Josh > > OK, I got it. I will remove the comment later. After all, comment is > not necessary at this point after we rename the macros. Yeah, removing them altogether might be best, as the meaning of these can vary slightly depending on the operation (patching vs unpatching), and also depending on where it's stored (task->patch_state vs klp_target_state). -- Josh
Re: [PATCH 0/1] *** Replace KLP_* to KLP_TRANSITION_* ***
On Tue, May 07, 2024 at 10:21:40AM +0800, zhang warden wrote: > > > > > > transition state. With this marcos renamed, comments are not > > necessary at this point. > > > Sorry for my careless, the comment still remains in the code. However, > comment in the code do no harms here. Maybe it can be kept. The comments aren't actually correct. For example, KLP_TRANSITION_UNPATCHED doesn't always mean "transitioning to unpatched state". Sometimes it means "transitioning *from* unpatched state". -- Josh
Re: [PATCH] livepatch.h: Add comment to klp transition state
On Mon, Apr 29, 2024 at 03:26:28PM +0800, zhangwar...@gmail.com wrote: > From: Wardenjohn > > livepatch.h use KLP_UNDEFINED\KLP_UNPATCHED\KLP_PATCHED for klp transition > state. > When livepatch is ready but idle, using KLP_UNDEFINED seems very confusing. > In order not to introduce potential risks to kernel, just update comment > to these state. > --- > include/linux/livepatch.h | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h > index 9b9b38e89563..b6a214f2f8e3 100644 > --- a/include/linux/livepatch.h > +++ b/include/linux/livepatch.h > @@ -18,9 +18,9 @@ > #if IS_ENABLED(CONFIG_LIVEPATCH) > > /* task patch states */ > -#define KLP_UNDEFINED-1 > -#define KLP_UNPATCHED 0 > -#define KLP_PATCHED 1 > +#define KLP_UNDEFINED-1 /* idle, no transition in progress */ > +#define KLP_UNPATCHED 0 /* transitioning to unpatched state */ > +#define KLP_PATCHED 1 /* transitioning to patched state */ Instead of the comments, how about we just rename them to KLP_TRANSITION_IDLE KLP_TRANSITION_UNPATCHED KLP_TRANSITION_PATCHED which shouldn't break userspace AFAIK. -- Josh
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On Sat, Apr 13, 2024 at 08:48:57AM -0400, Steven Rostedt wrote: > On Sat, 13 Apr 2024 12:53:38 +0200 > Peter Zijlstra wrote: > > > On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote: > > > > > > Anyway, since we typically run stuff from NMI context, accessing user > > > > data is 'interesting'. As such I would really like to make this work > > > > depend on the call-graph rework that pushes all the user access bits > > > > into return-to-user. > > > > > > Cool, I assume that's the SFRAME work? Are there pointers to work I > > > could look at and think about what a rebase looks like? Or do you have > > > someone in mind I should work with for this? > > > > I've been offline for a little while and still need to catch up with > > things myself. > > > > Josh was working on that when I dropped off IIRC, I'm not entirely sure > > where things are at currently (and there is no way I can ever hope to > > process the backlog). > > > > Anybody know where we are with that? > > It's still very much on my RADAR, but with layoffs and such, my > priorities have unfortunately changed. I'm hoping to start helping out > in the near future though (in a month or two). > > Josh was working on it, but I think he got pulled off onto other > priorities too :-p Yeah, this is still a priority for me and I hope to get back to it over the next few weeks (crosses fingers). -- Josh
Re: [POC 0/7] livepatch: Make livepatch states, callbacks, and shadow variables work together
On Fri, Nov 10, 2023 at 06:04:21PM +0100, Petr Mladek wrote: > This POC is a material for the discussion "Simplify Livepatch Callbacks, > Shadow Variables, and States handling" at LPC 2013, see > https://lpc.events/event/17/contributions/1541/ > > It obsoletes the patchset adding the garbage collection of shadow > variables. This new solution is based on ideas from Nicolai Stange. > And it should also be in sync with Josh's ideas mentioned into > the thread about the garbage collection, see > https://lore.kernel.org/r/20230204235910.4j4ame5ntqogqi7m@treble Nice! I like how it brings the "features" together and makes them easy to use. This looks like a vast improvement. Was there a reason to change the naming? I'm thinking setup / enable / disable / release is less precise than pre_patch / post_patch / pre_unpatch / post_unpatch. Also, I'm thinking "replaced" instead of "obsolete" would be more consistent with the existing terminology. For example, in __klp_enable_patch(): ret = klp_setup_states(patch); if (ret) goto err; if (patch->replace) klp_disable_obsolete_states(patch); it's not immediately clear why "disable obsolete" would be the "replace" counterpart to "setup". Similarly, in klp_complete_transition(): if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) { klp_unpatch_replaced_patches(klp_transition_patch); klp_discard_nops(klp_transition_patch); klp_release_obsolete_states(klp_transition_patch); } it's a little jarring to have "unpatch replaced" followed by "release obsolete". So instead of: int klp_setup_states(struct klp_patch *patch); void klp_enable_states(struct klp_patch *patch); void klp_disable_states(struct klp_patch *patch); void klp_release_states(struct klp_patch *patch); void klp_enable_obsolete_states(struct klp_patch *patch); void klp_disable_obsolete_states(struct klp_patch *patch); void klp_release_obsolete_states(struct klp_patch *patch); how about something like: int klp_states_pre_patch(void); void klp_states_post_patch(void); void klp_states_pre_unpatch(void); void klp_states_post_unpatch(void); void klp_states_post_patch_replaced(void); void klp_states_pre_unpatch_replaced(void); void klp_states_post_unpatch_replaced(void); ? (note that passing klp_transition_patch might be optional since state.c already has visibility to it anyway) -- Josh
Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
On Thu, Nov 09, 2023 at 02:50:48PM -0800, Ankur Arora wrote: > >> I guess I'm not fully understanding what the cond rescheds are for. But > >> would an IPI to all CPUs setting NEED_RESCHED, fix it? > > Yeah. We could just temporarily toggle to full preemption, when > NEED_RESCHED_LAZY is always upgraded to NEED_RESCHED which will > then send IPIs. > > > If all livepatch arches had the ORC unwinder, yes. > > > > The problem is that frame pointer (and similar) unwinders can't reliably > > unwind past an interrupt frame. > > Ah, I wonder if we could just disable the preempt_schedule_irq() path > temporarily? Hooking into schedule() alongside something like this: > > @@ -379,7 +379,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs > *regs) > > void irqentry_exit_cond_resched(void) > { > - if (!preempt_count()) { > + if (klp_cond_resched_disable() && !preempt_count()) { > > The problem would be tasks that don't go through any preemptible > sections. Let me back up a bit and explain what klp is trying to do. When a livepatch is applied, klp needs to unwind all the tasks, preferably within a reasonable amount of time. We can't unwind task A from task B while task A is running, since task A could be changing the stack during the unwind. So task A needs to be blocked or asleep. The only exception to that is if the unwind happens in the context of task A itself. The problem we were seeing was CPU-bound kthreads (e.g., vhost_worker) not getting patched within a reasonable amount of time. We fixed it by hooking the klp unwind into cond_resched() so it can unwind from the task itself. It only worked because we had a non-preempted hook (because non-ORC unwinders can't unwind reliably through preemption) which called klp to unwind from the context of the task. Without something to hook into, we have a problem. We could of course hook into schedule(), but if the kthread never calls schedule() from a non-preempted context then it still doesn't help. -- Josh
Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
On Thu, Nov 09, 2023 at 12:31:47PM -0500, Steven Rostedt wrote: > On Thu, 9 Nov 2023 09:26:37 -0800 > Josh Poimboeuf wrote: > > > On Tue, Nov 07, 2023 at 06:16:09PM -0500, Steven Rostedt wrote: > > > On Tue, 7 Nov 2023 13:56:53 -0800 > > > Ankur Arora wrote: > > > > > > > This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8. > > > > > > > > Note that removing this commit reintroduces "live patches failing to > > > > complete within a reasonable amount of time due to CPU-bound kthreads." > > > > > > > > Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and > > > > existence of cond_resched() so this will need an alternate fix. > > > > We definitely don't want to introduce a regression, something will need > > to be figured out before removing cond_resched(). > > > > We could hook into preempt_schedule_irq(), but that wouldn't work for > > non-ORC. > > > > Another option would be to hook into schedule(). Then livepatch could > > set TIF_NEED_RESCHED on remaining unpatched tasks. But again if they go > > through the preemption path then we have the same problem for non-ORC. > > > > Worst case we'll need to sprinkle cond_livepatch() everywhere :-/ > > > > I guess I'm not fully understanding what the cond rescheds are for. But > would an IPI to all CPUs setting NEED_RESCHED, fix it? If all livepatch arches had the ORC unwinder, yes. The problem is that frame pointer (and similar) unwinders can't reliably unwind past an interrupt frame. -- Josh
Re: [RFC PATCH 07/86] Revert "livepatch,sched: Add livepatch task switching to cond_resched()"
On Tue, Nov 07, 2023 at 06:16:09PM -0500, Steven Rostedt wrote: > On Tue, 7 Nov 2023 13:56:53 -0800 > Ankur Arora wrote: > > > This reverts commit e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8. > > > > Note that removing this commit reintroduces "live patches failing to > > complete within a reasonable amount of time due to CPU-bound kthreads." > > > > Unfortunately this fix depends quite critically on PREEMPT_DYNAMIC and > > existence of cond_resched() so this will need an alternate fix. We definitely don't want to introduce a regression, something will need to be figured out before removing cond_resched(). We could hook into preempt_schedule_irq(), but that wouldn't work for non-ORC. Another option would be to hook into schedule(). Then livepatch could set TIF_NEED_RESCHED on remaining unpatched tasks. But again if they go through the preemption path then we have the same problem for non-ORC. Worst case we'll need to sprinkle cond_livepatch() everywhere :-/ -- Josh
Re: [PATCH 01/15] objtool: Find a destination for jumps beyond the section end
On Tue, Apr 20, 2021 at 01:25:43PM -0700, Sami Tolvanen wrote: > On Tue, Apr 20, 2021 at 11:14 AM Josh Poimboeuf wrote: > > > > On Fri, Apr 16, 2021 at 01:38:30PM -0700, Sami Tolvanen wrote: > > > With -ffunction-sections, Clang can generate a jump beyond the end of > > > a section when the section ends in an unreachable instruction. > > > > Why? Can you show an example? > > Here's the warning I'm seeing when building allyesconfig + CFI: > > vmlinux.o: warning: objtool: > rockchip_spi_transfer_one.f088382d97b74759d70e27e891fe8f1c()+0x149: > can't find jump dest instruction at > .text.rockchip_spi_transfer_one.f088382d97b74759d70e27e891fe8f1c+0x7dc > > $ objdump -d -r -j > .text.rockchip_spi_transfer_one.f088382d97b74759d70e27e891fe8f1c > vmlinux.o > : > ... > 149: 0f 85 8d 06 00 00 jne7dc <.compoundliteral.4> > ... > 7d7: e8 00 00 00 00 callq 7dc <.compoundliteral.4> > 7d8: R_X86_64_PLT32 __stack_chk_fail-0x4 Instead of silencing the warning by faking the jump destination, I'd rather improve the warning to something like "warning: rockchip_spi_transfer_one() falls through to the next function" which is what we normally do in this type of situation. It may be caused by UB, or a compiler bug, but either way we should figure out the root cause. -- Josh
Re: [PATCH 02/15] objtool: Add CONFIG_CFI_CLANG support
On Fri, Apr 16, 2021 at 01:38:31PM -0700, Sami Tolvanen wrote: > +static int fix_cfi_relocs(const struct elf *elf) > +{ > + struct section *sec, *tmpsec; > + struct reloc *reloc, *tmpreloc; > + > + list_for_each_entry_safe(sec, tmpsec, >sections, list) { > + list_for_each_entry_safe(reloc, tmpreloc, >reloc_list, > list) { These loops don't remove structs from the lists, so are the "safe" iterators really needed? > + struct symbol *sym; > + struct reloc *func_reloc; > + > + /* > + * CONFIG_CFI_CLANG replaces function relocations to > refer > + * to an intermediate jump table. Undo the conversion so > + * objtool can make sense of things. > + */ I think this comment could be clearer if it were placed above the function. > + if (!reloc->sym->sec->cfi_jt) > + continue; > + > + if (reloc->sym->type == STT_SECTION) > + sym = find_func_by_offset(reloc->sym->sec, > + reloc->addend); > + else > + sym = reloc->sym; > + > + if (!sym || !sym->sec) > + continue; This should be a fatal error, it probably means something's gone wrong with the reading of the jump table. > + > + /* > + * The jump table immediately jumps to the actual > function, > + * so look up the relocation there. > + */ > + func_reloc = find_reloc_by_dest_range(elf, sym->sec, > sym->offset, 5); The jump instruction's reloc is always at offset 1, so this can maybe be optimized to: func_reloc = find_reloc_by_dest(elf, sym->sec, sym->offset+1); though of course that will probably break (future) arm64 objtool. Maybe an arch-specific offset from the start of the insn would be good. > + if (!func_reloc || !func_reloc->sym) > + continue; This should also be an error. > + > + reloc->sym = func_reloc->sym; I think we should also do 'reloc->addend = 0', because of the STT_SECTION case: 0025 259e000b R_X86_64_32S .text..L.cfi.jumptable.8047 + 8 which then translates to 0009 06320004 R_X86_64_PLT32 x2apic_prepare_cpu - 4 so the original addend of '8' is no longer valid. Copying the '-4' wouldn't be right either, because that only applies to a PLT32 text reloc. A '0' should be appropriate for the 32S data reloc. This addend might not actually be read by any code, so it may not currently be an actual bug, but better safe than sorry. -- Josh
Re: [PATCH 01/15] objtool: Find a destination for jumps beyond the section end
On Fri, Apr 16, 2021 at 01:38:30PM -0700, Sami Tolvanen wrote: > With -ffunction-sections, Clang can generate a jump beyond the end of > a section when the section ends in an unreachable instruction. Why? Can you show an example? -- Josh
[tip: objtool/core] x86/crypto/aesni-intel_avx: Standardize stack alignment prologue
The following commit has been merged into the objtool/core branch of tip: Commit-ID: e163be86fff3deec70f63330fc43fedf892c9aee Gitweb: https://git.kernel.org/tip/e163be86fff3deec70f63330fc43fedf892c9aee Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:17 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:34 -05:00 x86/crypto/aesni-intel_avx: Standardize stack alignment prologue Use RBP instead of R14 for saving the old stack pointer before realignment. This resembles what compilers normally do. This enables ORC unwinding by allowing objtool to understand the stack realignment. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/02d00a0903a0959f4787e186e2a07d271e1f63d4.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 188f184..98e3552 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -251,22 +251,20 @@ VARIABLE_OFFSET = 16*8 .macro FUNC_SAVE push%r12 push%r13 -push%r14 push%r15 -mov %rsp, %r14 - - + push%rbp + mov %rsp, %rbp sub $VARIABLE_OFFSET, %rsp and $~63, %rsp# align rsp to 64 bytes .endm .macro FUNC_RESTORE -mov %r14, %rsp +mov %rbp, %rsp + pop %rbp pop %r15 -pop %r14 pop %r13 pop %r12 .endm
[tip: objtool/core] x86/crypto/aesni-intel_avx: Remove unused macros
The following commit has been merged into the objtool/core branch of tip: Commit-ID: 4f08300916e882a0c34a2f325ff3fea2be2e57b3 Gitweb: https://git.kernel.org/tip/4f08300916e882a0c34a2f325ff3fea2be2e57b3 Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:15 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:33 -05:00 x86/crypto/aesni-intel_avx: Remove unused macros These macros are no longer used; remove them. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/53f7136ea93ebdbca399959e6d2991ecb46e733e.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 8 1 file changed, 8 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 2cf8e94..4fdf38e 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -212,10 +212,6 @@ HashKey_8_k= 16*21 # store XOR of HashKey^8 <<1 mod poly here (for Karatsu #define arg4 %rcx #define arg5 %r8 #define arg6 %r9 -#define arg7 STACK_OFFSET+8*1(%r14) -#define arg8 STACK_OFFSET+8*2(%r14) -#define arg9 STACK_OFFSET+8*3(%r14) -#define arg10 STACK_OFFSET+8*4(%r14) #define keysize 2*15*16(arg1) i = 0 @@ -237,9 +233,6 @@ define_reg j %j .noaltmacro .endm -# need to push 4 registers into stack to maintain -STACK_OFFSET = 8*4 - TMP1 = 16*0# Temporary storage for AAD TMP2 = 16*1# Temporary storage for AES State 2 (State 1 is stored in an XMM register) TMP3 = 16*2# Temporary storage for AES State 3 @@ -256,7 +249,6 @@ VARIABLE_OFFSET = 16*8 .macro FUNC_SAVE -#the number of pushes must equal STACK_OFFSET push%r12 push%r13 push%r14
[tip: objtool/core] x86/crypto/aesni-intel_avx: Fix register usage comments
The following commit has been merged into the objtool/core branch of tip: Commit-ID: ff5796b6dbea4763fdca002101e32b60aa17f8e8 Gitweb: https://git.kernel.org/tip/ff5796b6dbea4763fdca002101e32b60aa17f8e8 Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:16 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:33 -05:00 x86/crypto/aesni-intel_avx: Fix register usage comments Fix register usage comments to match reality. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/8655d4513a0ed1eddec609165064153973010aa2.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/aesni-intel_avx-x86_64.S | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 4fdf38e..188f184 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -286,7 +286,7 @@ VARIABLE_OFFSET = 16*8 # combined for GCM encrypt and decrypt functions # clobbering all xmm registers -# clobbering r10, r11, r12, r13, r14, r15 +# clobbering r10, r11, r12, r13, r15, rax .macro GCM_ENC_DEC INITIAL_BLOCKS GHASH_8_ENCRYPT_8_PARALLEL GHASH_LAST_8 GHASH_MUL ENC_DEC REP vmovdqu AadHash(arg2), %xmm8 vmovdqu HashKey(arg2), %xmm13 # xmm13 = HashKey @@ -988,7 +988,7 @@ _partial_block_done_\@: ## num_initial_blocks = b mod 4# ## encrypt the initial num_initial_blocks blocks and apply ghash on the ciphertext ## r10, r11, r12, rax are clobbered -## arg1, arg3, arg4, r14 are used as a pointer only, not modified +## arg1, arg2, arg3, arg4 are used as pointers only, not modified .macro INITIAL_BLOCKS_AVX REP num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC i = (8-\num_initial_blocks) @@ -1223,7 +1223,7 @@ _initial_blocks_done\@: # encrypt 8 blocks at a time # ghash the 8 previously encrypted ciphertext blocks -# arg1, arg3, arg4 are used as pointers only, not modified +# arg1, arg2, arg3, arg4 are used as pointers only, not modified # r11 is the data offset value .macro GHASH_8_ENCRYPT_8_PARALLEL_AVX REP T1 T2 T3 T4 T5 T6 CTR XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T7 loop_idx ENC_DEC @@ -1936,7 +1936,7 @@ SYM_FUNC_END(aesni_gcm_finalize_avx_gen2) ## num_initial_blocks = b mod 4# ## encrypt the initial num_initial_blocks blocks and apply ghash on the ciphertext ## r10, r11, r12, rax are clobbered -## arg1, arg3, arg4, r14 are used as a pointer only, not modified +## arg1, arg2, arg3, arg4 are used as pointers only, not modified .macro INITIAL_BLOCKS_AVX2 REP num_initial_blocks T1 T2 T3 T4 T5 CTR XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T6 T_key ENC_DEC VER i = (8-\num_initial_blocks) @@ -2178,7 +2178,7 @@ _initial_blocks_done\@: # encrypt 8 blocks at a time # ghash the 8 previously encrypted ciphertext blocks -# arg1, arg3, arg4 are used as pointers only, not modified +# arg1, arg2, arg3, arg4 are used as pointers only, not modified # r11 is the data offset value .macro GHASH_8_ENCRYPT_8_PARALLEL_AVX2 REP T1 T2 T3 T4 T5 T6 CTR XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 T7 loop_idx ENC_DEC
[tip: objtool/core] objtool: Support asm jump tables
The following commit has been merged into the objtool/core branch of tip: Commit-ID: 99033461e685b48549ec77608b4bda75ddf772ce Gitweb: https://git.kernel.org/tip/99033461e685b48549ec77608b4bda75ddf772ce Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:14 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:32 -05:00 objtool: Support asm jump tables Objtool detection of asm jump tables would normally just work, except for the fact that asm retpolines use alternatives. Objtool thinks the alternative code path (a jump to the retpoline) is a sibling call. Don't treat alternative indirect branches as sibling calls when the original instruction has a jump table. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/460cf4dc675d64e1124146562cabd2c05aa322e8.1614182415.git.jpoim...@redhat.com --- tools/objtool/check.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index a0f762a..46621e8 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -108,6 +108,18 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file, for (insn = next_insn_same_sec(file, insn); insn; \ insn = next_insn_same_sec(file, insn)) +static bool is_jump_table_jump(struct instruction *insn) +{ + struct alt_group *alt_group = insn->alt_group; + + if (insn->jump_table) + return true; + + /* Retpoline alternative for a jump table? */ + return alt_group && alt_group->orig_group && + alt_group->orig_group->first_insn->jump_table; +} + static bool is_sibling_call(struct instruction *insn) { /* @@ -120,7 +132,7 @@ static bool is_sibling_call(struct instruction *insn) /* An indirect jump is either a sibling call or a jump to a table. */ if (insn->type == INSN_JUMP_DYNAMIC) - return list_empty(>alts); + return !is_jump_table_jump(insn); /* add_jump_destinations() sets insn->call_dest for sibling calls. */ return (is_static_jump(insn) && insn->call_dest);
[tip: objtool/core] x86/crypto/sha512-avx2: Standardize stack alignment prologue
The following commit has been merged into the objtool/core branch of tip: Commit-ID: ec063e090bd6487097d459bb4272508b78448270 Gitweb: https://git.kernel.org/tip/ec063e090bd6487097d459bb4272508b78448270 Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:24 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:36 -05:00 x86/crypto/sha512-avx2: Standardize stack alignment prologue Use a more standard prologue for saving the stack pointer before realigning the stack. This enables ORC unwinding by allowing objtool to understand the stack realignment. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/b1a7b29fcfc65d60a3b6e77ef75f4762a5b8488d.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/sha512-avx2-asm.S | 42 ++ 1 file changed, 20 insertions(+), 22 deletions(-) diff --git a/arch/x86/crypto/sha512-avx2-asm.S b/arch/x86/crypto/sha512-avx2-asm.S index 3a44bdc..072cb0f 100644 --- a/arch/x86/crypto/sha512-avx2-asm.S +++ b/arch/x86/crypto/sha512-avx2-asm.S @@ -102,17 +102,13 @@ SRND_SIZE = 1*8 INP_SIZE = 1*8 INPEND_SIZE = 1*8 CTX_SIZE = 1*8 -RSPSAVE_SIZE = 1*8 -GPRSAVE_SIZE = 5*8 frame_XFER = 0 frame_SRND = frame_XFER + XFER_SIZE frame_INP = frame_SRND + SRND_SIZE frame_INPEND = frame_INP + INP_SIZE frame_CTX = frame_INPEND + INPEND_SIZE -frame_RSPSAVE = frame_CTX + CTX_SIZE -frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE -frame_size = frame_GPRSAVE + GPRSAVE_SIZE +frame_size = frame_CTX + CTX_SIZE ## assume buffers not aligned #defineVMOVDQ vmovdqu @@ -570,18 +566,18 @@ frame_size = frame_GPRSAVE + GPRSAVE_SIZE # "blocks" is the message length in SHA512 blocks SYM_FUNC_START(sha512_transform_rorx) + # Save GPRs + push%rbx + push%r12 + push%r13 + push%r14 + push%r15 + # Allocate Stack Space - mov %rsp, %rax + push%rbp + mov %rsp, %rbp sub $frame_size, %rsp and $~(0x20 - 1), %rsp - mov %rax, frame_RSPSAVE(%rsp) - - # Save GPRs - mov %rbx, 8*0+frame_GPRSAVE(%rsp) - mov %r12, 8*1+frame_GPRSAVE(%rsp) - mov %r13, 8*2+frame_GPRSAVE(%rsp) - mov %r14, 8*3+frame_GPRSAVE(%rsp) - mov %r15, 8*4+frame_GPRSAVE(%rsp) shl $7, NUM_BLKS# convert to bytes jz done_hash @@ -672,15 +668,17 @@ loop2: done_hash: -# Restore GPRs - mov 8*0+frame_GPRSAVE(%rsp), %rbx - mov 8*1+frame_GPRSAVE(%rsp), %r12 - mov 8*2+frame_GPRSAVE(%rsp), %r13 - mov 8*3+frame_GPRSAVE(%rsp), %r14 - mov 8*4+frame_GPRSAVE(%rsp), %r15 - # Restore Stack Pointer - mov frame_RSPSAVE(%rsp), %rsp + mov %rbp, %rsp + pop %rbp + + # Restore GPRs + pop %r15 + pop %r14 + pop %r13 + pop %r12 + pop %rbx + ret SYM_FUNC_END(sha512_transform_rorx)
[tip: objtool/core] x86/crypto/sha_ni: Standardize stack alignment prologue
The following commit has been merged into the objtool/core branch of tip: Commit-ID: 35a0067d2c02a7c35466db5f207b7b9265de84d9 Gitweb: https://git.kernel.org/tip/35a0067d2c02a7c35466db5f207b7b9265de84d9 Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:20 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:35 -05:00 x86/crypto/sha_ni: Standardize stack alignment prologue Use a more standard prologue for saving the stack pointer before realigning the stack. This enables ORC unwinding by allowing objtool to understand the stack realignment. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/5033e1a79867dff1b18e1b4d0783c38897d3f223.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/sha1_ni_asm.S | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/crypto/sha1_ni_asm.S b/arch/x86/crypto/sha1_ni_asm.S index 11efe3a..5d8415f 100644 --- a/arch/x86/crypto/sha1_ni_asm.S +++ b/arch/x86/crypto/sha1_ni_asm.S @@ -59,8 +59,6 @@ #define DATA_PTR %rsi/* 2nd arg */ #define NUM_BLKS %rdx/* 3rd arg */ -#define RSPSAVE%rax - /* gcc conversion */ #define FRAME_SIZE 32 /* space for 2x16 bytes */ @@ -96,7 +94,8 @@ .text .align 32 SYM_FUNC_START(sha1_ni_transform) - mov %rsp, RSPSAVE + push%rbp + mov %rsp, %rbp sub $FRAME_SIZE, %rsp and $~0xF, %rsp @@ -288,7 +287,8 @@ SYM_FUNC_START(sha1_ni_transform) pextrd $3, E0, 1*16(DIGEST_PTR) .Ldone_hash: - mov RSPSAVE, %rsp + mov %rbp, %rsp + pop %rbp ret SYM_FUNC_END(sha1_ni_transform)
[tip: objtool/core] x86/crypto/crc32c-pcl-intel: Standardize jump table
The following commit has been merged into the objtool/core branch of tip: Commit-ID: 2b02ed55482a1c5c310a7f53707292fcf1601e7a Gitweb: https://git.kernel.org/tip/2b02ed55482a1c5c310a7f53707292fcf1601e7a Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:19 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:34 -05:00 x86/crypto/crc32c-pcl-intel: Standardize jump table Simplify the jump table code so that it resembles a compiler-generated table. This enables ORC unwinding by allowing objtool to follow all the potential code paths. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/5357a039def90b8ef6b5874ef12cda008ecf18ba.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/crc32c-pcl-intel-asm_64.S | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S index 884dc76..ac1f303 100644 --- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S +++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S @@ -53,7 +53,7 @@ .endm .macro JMPTBL_ENTRY i -.word crc_\i - crc_array +.quad crc_\i .endm .macro JNC_LESS_THAN j @@ -168,10 +168,7 @@ continue_block: xor crc2, crc2 ## branch into array - lea jump_table(%rip), %bufp - movzwq (%bufp, %rax, 2), len - lea crc_array(%rip), %bufp - lea (%bufp, len, 1), %bufp + mov jump_table(,%rax,8), %bufp JMP_NOSPEC bufp
[tip: objtool/core] x86/crypto/camellia-aesni-avx2: Unconditionally allocate stack buffer
The following commit has been merged into the objtool/core branch of tip: Commit-ID: dabe5167a3cbb4bf16b20c0e5b6497513e2e3a08 Gitweb: https://git.kernel.org/tip/dabe5167a3cbb4bf16b20c0e5b6497513e2e3a08 Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:18 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:34 -05:00 x86/crypto/camellia-aesni-avx2: Unconditionally allocate stack buffer A conditional stack allocation violates traditional unwinding requirements when a single instruction can have differing stack layouts. There's no benefit in allocating the stack buffer conditionally. Just do it unconditionally. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/85ac96613ee5784b6239c18d3f68b1f3c509caa3.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S index 782e971..706f708 100644 --- a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S +++ b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S @@ -990,6 +990,7 @@ SYM_FUNC_START(camellia_cbc_dec_32way) * %rdx: src (32 blocks) */ FRAME_BEGIN + subq $(16 * 32), %rsp; vzeroupper; @@ -1002,7 +1003,6 @@ SYM_FUNC_START(camellia_cbc_dec_32way) %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, %ymm15, %rdx, (key_table)(CTX, %r8, 8)); - movq %rsp, %r10; cmpq %rsi, %rdx; je .Lcbc_dec_use_stack; @@ -1015,7 +1015,6 @@ SYM_FUNC_START(camellia_cbc_dec_32way) * dst still in-use (because dst == src), so use stack for temporary * storage. */ - subq $(16 * 32), %rsp; movq %rsp, %rax; .Lcbc_dec_continue: @@ -1025,7 +1024,6 @@ SYM_FUNC_START(camellia_cbc_dec_32way) vpxor %ymm7, %ymm7, %ymm7; vinserti128 $1, (%rdx), %ymm7, %ymm7; vpxor (%rax), %ymm7, %ymm7; - movq %r10, %rsp; vpxor (0 * 32 + 16)(%rdx), %ymm6, %ymm6; vpxor (1 * 32 + 16)(%rdx), %ymm5, %ymm5; vpxor (2 * 32 + 16)(%rdx), %ymm4, %ymm4; @@ -1047,6 +1045,7 @@ SYM_FUNC_START(camellia_cbc_dec_32way) vzeroupper; + addq $(16 * 32), %rsp; FRAME_END ret; SYM_FUNC_END(camellia_cbc_dec_32way)
[tip: objtool/core] x86/crypto/sha512-ssse3: Standardize stack alignment prologue
The following commit has been merged into the objtool/core branch of tip: Commit-ID: 27d26793f2105281d9374928448142777cef6f74 Gitweb: https://git.kernel.org/tip/27d26793f2105281d9374928448142777cef6f74 Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:25 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:37 -05:00 x86/crypto/sha512-ssse3: Standardize stack alignment prologue Use a more standard prologue for saving the stack pointer before realigning the stack. This enables ORC unwinding by allowing objtool to understand the stack realignment. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/6ecaaac9f3828fbb903513bf90c34a08380a8e35.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/sha512-ssse3-asm.S | 41 + 1 file changed, 19 insertions(+), 22 deletions(-) diff --git a/arch/x86/crypto/sha512-ssse3-asm.S b/arch/x86/crypto/sha512-ssse3-asm.S index 50812af..bd51c90 100644 --- a/arch/x86/crypto/sha512-ssse3-asm.S +++ b/arch/x86/crypto/sha512-ssse3-asm.S @@ -74,14 +74,10 @@ tmp0 = %rax W_SIZE = 80*8 WK_SIZE = 2*8 -RSPSAVE_SIZE = 1*8 -GPRSAVE_SIZE = 5*8 frame_W = 0 frame_WK = frame_W + W_SIZE -frame_RSPSAVE = frame_WK + WK_SIZE -frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE -frame_size = frame_GPRSAVE + GPRSAVE_SIZE +frame_size = frame_WK + WK_SIZE # Useful QWORD "arrays" for simpler memory references # MSG, DIGEST, K_t, W_t are arrays @@ -283,18 +279,18 @@ SYM_FUNC_START(sha512_transform_ssse3) test msglen, msglen je nowork + # Save GPRs + push%rbx + push%r12 + push%r13 + push%r14 + push%r15 + # Allocate Stack Space - mov %rsp, %rax + push%rbp + mov %rsp, %rbp sub $frame_size, %rsp and $~(0x20 - 1), %rsp - mov %rax, frame_RSPSAVE(%rsp) - - # Save GPRs - mov %rbx, frame_GPRSAVE(%rsp) - mov %r12, frame_GPRSAVE +8*1(%rsp) - mov %r13, frame_GPRSAVE +8*2(%rsp) - mov %r14, frame_GPRSAVE +8*3(%rsp) - mov %r15, frame_GPRSAVE +8*4(%rsp) updateblock: @@ -355,15 +351,16 @@ updateblock: dec msglen jnz updateblock - # Restore GPRs - mov frame_GPRSAVE(%rsp), %rbx - mov frame_GPRSAVE +8*1(%rsp), %r12 - mov frame_GPRSAVE +8*2(%rsp), %r13 - mov frame_GPRSAVE +8*3(%rsp), %r14 - mov frame_GPRSAVE +8*4(%rsp), %r15 - # Restore Stack Pointer - mov frame_RSPSAVE(%rsp), %rsp + mov %rbp, %rsp + pop %rbp + + # Restore GPRs + pop %r15 + pop %r14 + pop %r13 + pop %r12 + pop %rbx nowork: ret
[tip: objtool/core] x86/crypto/sha256-avx2: Standardize stack alignment prologue
The following commit has been merged into the objtool/core branch of tip: Commit-ID: ce5846668076aa76a17ab559f0296374e3611fec Gitweb: https://git.kernel.org/tip/ce5846668076aa76a17ab559f0296374e3611fec Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:22 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:36 -05:00 x86/crypto/sha256-avx2: Standardize stack alignment prologue Use a more standard prologue for saving the stack pointer before realigning the stack. This enables ORC unwinding by allowing objtool to understand the stack realignment. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/8048e7444c49a8137f05265262b83dc50f8fb7f3.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/sha256-avx2-asm.S | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/crypto/sha256-avx2-asm.S b/arch/x86/crypto/sha256-avx2-asm.S index 11ff60c..4087f74 100644 --- a/arch/x86/crypto/sha256-avx2-asm.S +++ b/arch/x86/crypto/sha256-avx2-asm.S @@ -117,15 +117,13 @@ _XMM_SAVE_SIZE= 0 _INP_END_SIZE = 8 _INP_SIZE = 8 _CTX_SIZE = 8 -_RSP_SIZE = 8 _XFER = 0 _XMM_SAVE = _XFER + _XFER_SIZE _INP_END = _XMM_SAVE + _XMM_SAVE_SIZE _INP = _INP_END + _INP_END_SIZE _CTX = _INP + _INP_SIZE -_RSP = _CTX + _CTX_SIZE -STACK_SIZE = _RSP + _RSP_SIZE +STACK_SIZE = _CTX + _CTX_SIZE # rotate_Xs # Rotate values of symbols X0...X3 @@ -533,11 +531,11 @@ SYM_FUNC_START(sha256_transform_rorx) pushq %r14 pushq %r15 - mov %rsp, %rax + push%rbp + mov %rsp, %rbp + subq$STACK_SIZE, %rsp and $-32, %rsp # align rsp to 32 byte boundary - mov %rax, _RSP(%rsp) - shl $6, NUM_BLKS# convert to bytes jz done_hash @@ -704,7 +702,8 @@ only_one_block: done_hash: - mov _RSP(%rsp), %rsp + mov %rbp, %rsp + pop %rbp popq%r15 popq%r14
[tip: objtool/core] x86/crypto/sha1_avx2: Standardize stack alignment prologue
The following commit has been merged into the objtool/core branch of tip: Commit-ID: 20114c899cafa8313534a841cab0ab1f7ab09672 Gitweb: https://git.kernel.org/tip/20114c899cafa8313534a841cab0ab1f7ab09672 Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:21 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:35 -05:00 x86/crypto/sha1_avx2: Standardize stack alignment prologue Use a more standard prologue for saving the stack pointer before realigning the stack. This enables ORC unwinding by allowing objtool to understand the stack realignment. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/fdaaf8670ed1f52f55ba9a6bbac98c1afddc1af6.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/sha1_avx2_x86_64_asm.S | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/crypto/sha1_avx2_x86_64_asm.S b/arch/x86/crypto/sha1_avx2_x86_64_asm.S index 1e594d6..5eed620 100644 --- a/arch/x86/crypto/sha1_avx2_x86_64_asm.S +++ b/arch/x86/crypto/sha1_avx2_x86_64_asm.S @@ -645,9 +645,9 @@ _loop3: RESERVE_STACK = (W_SIZE*4 + 8+24) /* Align stack */ - mov %rsp, %rbx + push%rbp + mov %rsp, %rbp and $~(0x20-1), %rsp - push%rbx sub $RESERVE_STACK, %rsp avx2_zeroupper @@ -665,8 +665,8 @@ _loop3: avx2_zeroupper - add $RESERVE_STACK, %rsp - pop %rsp + mov %rbp, %rsp + pop %rbp pop %r15 pop %r14
[tip: objtool/core] x86/crypto: Enable objtool in crypto code
The following commit has been merged into the objtool/core branch of tip: Commit-ID: 7d3d10e0e85fb7c23a86a70f795b1eabd2bc030b Gitweb: https://git.kernel.org/tip/7d3d10e0e85fb7c23a86a70f795b1eabd2bc030b Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:26 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:37 -05:00 x86/crypto: Enable objtool in crypto code Now that all the stack alignment prologues have been cleaned up in the crypto code, enable objtool. Among other benefits, this will allow ORC unwinding to work. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/fc2a1918c50e33e46ef0e9a5de02743f2f6e3639.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/Makefile | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index b28e36b..d0959e7 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -2,8 +2,6 @@ # # x86 crypto algorithms -OBJECT_FILES_NON_STANDARD := y - obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
[tip: objtool/core] x86/crypto/sha512-avx: Standardize stack alignment prologue
The following commit has been merged into the objtool/core branch of tip: Commit-ID: d61684b56edf369f0a6d388088d7c9d59f1618d4 Gitweb: https://git.kernel.org/tip/d61684b56edf369f0a6d388088d7c9d59f1618d4 Author:Josh Poimboeuf AuthorDate:Wed, 24 Feb 2021 10:29:23 -06:00 Committer: Josh Poimboeuf CommitterDate: Mon, 19 Apr 2021 12:36:36 -05:00 x86/crypto/sha512-avx: Standardize stack alignment prologue Use a more standard prologue for saving the stack pointer before realigning the stack. This enables ORC unwinding by allowing objtool to understand the stack realignment. Signed-off-by: Josh Poimboeuf Tested-by: Ard Biesheuvel Acked-by: Ard Biesheuvel Tested-by: Sami Tolvanen Acked-by: Peter Zijlstra (Intel) Acked-by: Herbert Xu Link: https://lore.kernel.org/r/d36e9ea1c819d87fa89b3df3fa83e2a1ede18146.1614182415.git.jpoim...@redhat.com --- arch/x86/crypto/sha512-avx-asm.S | 41 ++- 1 file changed, 19 insertions(+), 22 deletions(-) diff --git a/arch/x86/crypto/sha512-avx-asm.S b/arch/x86/crypto/sha512-avx-asm.S index 684d58c..3d8f0fd 100644 --- a/arch/x86/crypto/sha512-avx-asm.S +++ b/arch/x86/crypto/sha512-avx-asm.S @@ -76,14 +76,10 @@ tmp0= %rax W_SIZE = 80*8 # W[t] + K[t] | W[t+1] + K[t+1] WK_SIZE = 2*8 -RSPSAVE_SIZE = 1*8 -GPRSAVE_SIZE = 5*8 frame_W = 0 frame_WK = frame_W + W_SIZE -frame_RSPSAVE = frame_WK + WK_SIZE -frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE -frame_size = frame_GPRSAVE + GPRSAVE_SIZE +frame_size = frame_WK + WK_SIZE # Useful QWORD "arrays" for simpler memory references # MSG, DIGEST, K_t, W_t are arrays @@ -281,18 +277,18 @@ SYM_FUNC_START(sha512_transform_avx) test msglen, msglen je nowork + # Save GPRs + push%rbx + push%r12 + push%r13 + push%r14 + push%r15 + # Allocate Stack Space - mov %rsp, %rax + push%rbp + mov %rsp, %rbp sub $frame_size, %rsp and $~(0x20 - 1), %rsp - mov %rax, frame_RSPSAVE(%rsp) - - # Save GPRs - mov %rbx, frame_GPRSAVE(%rsp) - mov %r12, frame_GPRSAVE +8*1(%rsp) - mov %r13, frame_GPRSAVE +8*2(%rsp) - mov %r14, frame_GPRSAVE +8*3(%rsp) - mov %r15, frame_GPRSAVE +8*4(%rsp) updateblock: @@ -353,15 +349,16 @@ updateblock: dec msglen jnz updateblock - # Restore GPRs - mov frame_GPRSAVE(%rsp), %rbx - mov frame_GPRSAVE +8*1(%rsp), %r12 - mov frame_GPRSAVE +8*2(%rsp), %r13 - mov frame_GPRSAVE +8*3(%rsp), %r14 - mov frame_GPRSAVE +8*4(%rsp), %r15 - # Restore Stack Pointer - mov frame_RSPSAVE(%rsp), %rsp + mov %rbp, %rsp + pop %rbp + + # Restore GPRs + pop %r15 + pop %r14 + pop %r13 + pop %r12 + pop %rbx nowork: ret
Re: [PATCH 0/9] sched: Core scheduling interfaces
On Mon, Apr 19, 2021 at 2:01 AM Peter Zijlstra wrote: > > Josh, you being on the other Google team, the one that actually uses the > cgroup interface AFAIU, can you fight the good fight with TJ on this? A bit of extra context is in https://lore.kernel.org/lkml/cabk29nttscu2ho7v9di+fh2gv8zu5xic5inrwpfclhpd+dk...@mail.gmail.com. On the management/auditing side, the cgroup interface gives a clear indication of which tasks share a cookie. It is a bit less attractive to add a prctl interface for enumerating this. Also on the management side, I illustrated in the above message how a thread would potentially group together other threads. One limitation of the current prctl interface is that the share_{to, from} always operates on the current thread. Granted we can work around this as described, and also potentially extend the prctl interface to operate on two tasks. So I agree that the cgroup interface here isn't strictly necessary, though it seems convenient. I will double-check with internal teams that would be using the interface to see if there are any other considerations I'm missing. On Mon, Apr 19, 2021 at 4:30 AM Tejun Heo wrote: > > My suggestion is going ahead with the per-process interface with cgroup > extension on mind in case actual use cases arise. Also, when planning cgroup > integration, putting dynamic migration front and center likely isn't a good > idea. tasks would not be frequently moved around; I'd expect security configuration to remain mostly static. Or maybe I'm misunderstanding your emphasis here? If you feel the above is not strong enough (ie. there should be a use case not feasible with prctl), I'd support that we move forward with the prctl stuff for now, since the cgroup interface is independant. Thanks, Josh
Re: [PATCH] sched/debug: Rename the sched_debug parameter to sched_debug_verbose
Hi Peter, Looks reasonable to me. > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -4756,7 +4756,8 @@ > > sbni= [NET] Granch SBNI12 leased line adapter > > - sched_debug [KNL] Enables verbose scheduler debug messages. > + sched_debug_verbose > + [KNL] Enables verbose scheduler debug messages. boot param is not renamed from sched_debug below. > @@ -22,7 +22,7 @@ early_param("sched_debug", sched_debug_s > > static inline bool sched_debug(void) nit: consider renaming. Or, we can even get rid of this function entirely, since in the !CONFIG_SCHED_DEBUG case we have sched_debug_verbose defined to 0.
Re: [PATCH v2] sched: Warn on long periods of pending need_resched
On Fri, Apr 16, 2021 at 8:05 AM Peter Zijlstra wrote: > > On Tue, Mar 30, 2021 at 03:44:12PM -0700, Josh Don wrote: > > Peter, > > > > Since you've already pulled the need_resched warning patch into your > > tree, I'm including just the diff based on that patch (in response to > > Mel's comments) below. This should be squashed into the original > > patch. > > > > Sorry, I seem to have missed this. The patch is completely whitespace > mangled through. > > I've attached my latest copy of the patch and held it back for now, > please resubmit. Yikes, sorry about that. I've squashed the fixup and sent the updated patch (with trimmed cc list): https://lkml.org/lkml/2021/4/16/1125 Thanks, Josh
[PATCH v2 resubmit] sched: Warn on long periods of pending need_resched
From: Paul Turner CPU scheduler marks need_resched flag to signal a schedule() on a particular CPU. But, schedule() may not happen immediately in cases where the current task is executing in the kernel mode (no preemption state) for extended periods of time. This patch adds a warn_on if need_resched is pending for more than the time specified in sysctl resched_latency_warn_ms. If it goes off, it is likely that there is a missing cond_resched() somewhere. Monitoring is done via the tick and the accuracy is hence limited to jiffy scale. This also means that we won't trigger the warning if the tick is disabled. This feature (LATENCY_WARN) is default disabled. Signed-off-by: Paul Turner Signed-off-by: Josh Don Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20210323035706.572953-1-josh...@google.com --- This squashes the fixup from https://lkml.org/lkml/2021/3/30/1309. include/linux/sched/sysctl.h | 3 ++ kernel/sched/core.c | 70 +++- kernel/sched/debug.c | 13 +++ kernel/sched/features.h | 2 ++ kernel/sched/sched.h | 10 ++ 5 files changed, 97 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 0a3f34638cf5..db2c0f34aaaf 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -48,6 +48,9 @@ extern unsigned int sysctl_numa_balancing_scan_size; #ifdef CONFIG_SCHED_DEBUG extern __read_mostly unsigned int sysctl_sched_migration_cost; extern __read_mostly unsigned int sysctl_sched_nr_migrate; + +extern int sysctl_resched_latency_warn_ms; +extern int sysctl_resched_latency_warn_once; #endif /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9565f304ac46..c07a4c17205f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -58,7 +58,17 @@ const_debug unsigned int sysctl_sched_features = #include "features.h" 0; #undef SCHED_FEAT -#endif + +/* + * Print a warning if need_resched is set for the given duration (if + * LATENCY_WARN is enabled). + * + * If sysctl_resched_latency_warn_once is set, only one warning will be shown + * per boot. + */ +__read_mostly int sysctl_resched_latency_warn_ms = 100; +__read_mostly int sysctl_resched_latency_warn_once = 1; +#endif /* CONFIG_SCHED_DEBUG */ /* * Number of tasks to iterate in a single balance run. @@ -4527,6 +4537,55 @@ unsigned long long task_sched_runtime(struct task_struct *p) return ns; } +#ifdef CONFIG_SCHED_DEBUG +static u64 cpu_resched_latency(struct rq *rq) +{ + int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms); + u64 resched_latency, now = rq_clock(rq); + static bool warned_once; + + if (sysctl_resched_latency_warn_once && warned_once) + return 0; + + if (!need_resched() || !latency_warn_ms) + return 0; + + if (system_state == SYSTEM_BOOTING) + return 0; + + if (!rq->last_seen_need_resched_ns) { + rq->last_seen_need_resched_ns = now; + rq->ticks_without_resched = 0; + return 0; + } + + rq->ticks_without_resched++; + resched_latency = now - rq->last_seen_need_resched_ns; + if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC) + return 0; + + warned_once = true; + + return resched_latency; +} + +static int __init setup_resched_latency_warn_ms(char *str) +{ + long val; + + if ((kstrtol(str, 0, ))) { + pr_warn("Unable to set resched_latency_warn_ms\n"); + return 1; + } + + sysctl_resched_latency_warn_ms = val; + return 1; +} +__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms); +#else +static inline u64 cpu_resched_latency(struct rq *rq) { return 0; } +#endif /* CONFIG_SCHED_DEBUG */ + /* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. @@ -4538,6 +4597,7 @@ void scheduler_tick(void) struct task_struct *curr = rq->curr; struct rq_flags rf; unsigned long thermal_pressure; + u64 resched_latency; arch_scale_freq_tick(); sched_clock_tick(); @@ -4548,10 +4608,15 @@ void scheduler_tick(void) thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure); curr->sched_class->task_tick(rq, curr, 0); + if (sched_feat(LATENCY_WARN)) + resched_latency = cpu_resched_latency(rq); calc_global_load_tick(rq); rq_unlock(rq, ); + if (sched_feat(LATENCY_WARN) && resched_latency) + resched_latency_warn(cpu, resched_latency); + perf_event_task_tick(); #ifdef CONFIG_SMP @@ -5046,6 +5111,9 @@ static void __sched notrace __schedule(bool preempt)
Re: [PATCH 00/13] [RFC] Rust support
On Fri, Apr 16, 2021 at 12:27:39PM +0800, Boqun Feng wrote: > Josh, I think it's good if we can connect to the people working on Rust > memoryg model, I think the right person is Ralf Jung and the right place > is https://github.com/rust-lang/unsafe-code-guidelines, but you > cerntainly know better than me ;-) Or maybe we can use Rust-for-Linux or > linux-toolchains list to discuss. Ralf is definitely the right person to talk to. I don't think the UCG repository is the right place to start that discussion, though. For now, I'd suggest starting an email thread with Ralf and some C-and-kernel memory model folks (hi Paul!) to suss out the most important changes that would be needed. With my language team hat on, I'd *absolutely* like to see the Rust memory model support RCU-style deferred reclamation in a sound way, ideally with as little unsafe code as possible.
Re: [PATCH v2] X86: Makefile: Replace -pg with CC_FLAGS_FTRACE
Adding Steven Rostedt (ftrace maintainer). On Fri, Apr 16, 2021 at 01:39:28PM +0800, zhaoxiao wrote: > In preparation for x86 supporting ftrace built on other compiler > options, let's have the x86 Makefiles remove the $(CC_FLAGS_FTRACE) > flags, whatever these may be, rather than assuming '-pg'. > > There should be no functional change as a result of this patch. > > Signed-off-by: zhaoxiao > --- > v2: add the same change for the other Makefile in arch/x86 directory. > arch/x86/entry/vdso/Makefile | 8 > arch/x86/kernel/Makefile | 16 > arch/x86/kernel/cpu/Makefile | 4 ++-- > arch/x86/lib/Makefile| 2 +- > arch/x86/mm/Makefile | 4 ++-- > arch/x86/xen/Makefile| 6 +++--- > 6 files changed, 20 insertions(+), 20 deletions(-) > > diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile > index 05c4abc2fdfd..c5bd91bf9f93 100644 > --- a/arch/x86/entry/vdso/Makefile > +++ b/arch/x86/entry/vdso/Makefile > @@ -96,10 +96,10 @@ $(vobjs): KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO) > $(GCC_PLUGINS_CFLAGS) $( > # > # vDSO code runs in userspace and -pg doesn't help with profiling anyway. > # > -CFLAGS_REMOVE_vclock_gettime.o = -pg > -CFLAGS_REMOVE_vdso32/vclock_gettime.o = -pg > -CFLAGS_REMOVE_vgetcpu.o = -pg > -CFLAGS_REMOVE_vsgx.o = -pg > +CFLAGS_REMOVE_vclock_gettime.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_vdso32/vclock_gettime.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_vgetcpu.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_vsgx.o = $(CC_FLAGS_FTRACE) > > # > # X32 processes use x32 vDSO to access 64bit kernel data. > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile > index 2ddf08351f0b..2811fc6a17ba 100644 > --- a/arch/x86/kernel/Makefile > +++ b/arch/x86/kernel/Makefile > @@ -13,14 +13,14 @@ CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE) > > ifdef CONFIG_FUNCTION_TRACER > # Do not profile debug and lowlevel utilities > -CFLAGS_REMOVE_tsc.o = -pg > -CFLAGS_REMOVE_paravirt-spinlocks.o = -pg > -CFLAGS_REMOVE_pvclock.o = -pg > -CFLAGS_REMOVE_kvmclock.o = -pg > -CFLAGS_REMOVE_ftrace.o = -pg > -CFLAGS_REMOVE_early_printk.o = -pg > -CFLAGS_REMOVE_head64.o = -pg > -CFLAGS_REMOVE_sev-es.o = -pg > +CFLAGS_REMOVE_tsc.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_paravirt-spinlocks.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_pvclock.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_kvmclock.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_ftrace.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_early_printk.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_head64.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_sev-es.o = $(CC_FLAGS_FTRACE) > endif > > KASAN_SANITIZE_head$(BITS).o := n > diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile > index 637b499450d1..4435c6de9145 100644 > --- a/arch/x86/kernel/cpu/Makefile > +++ b/arch/x86/kernel/cpu/Makefile > @@ -5,8 +5,8 @@ > > # Don't trace early stages of a secondary CPU boot > ifdef CONFIG_FUNCTION_TRACER > -CFLAGS_REMOVE_common.o = -pg > -CFLAGS_REMOVE_perf_event.o = -pg > +CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_perf_event.o = $(CC_FLAGS_FTRACE) > endif > > # If these files are instrumented, boot hangs during the first second. > diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile > index bad4dee4f0e4..0aa71b8a5bc1 100644 > --- a/arch/x86/lib/Makefile > +++ b/arch/x86/lib/Makefile > @@ -21,7 +21,7 @@ KASAN_SANITIZE_cmdline.o := n > KCSAN_SANITIZE_cmdline.o := n > > ifdef CONFIG_FUNCTION_TRACER > -CFLAGS_REMOVE_cmdline.o = -pg > +CFLAGS_REMOVE_cmdline.o = $(CC_FLAGS_FTRACE) > endif > > CFLAGS_cmdline.o := -fno-stack-protector -fno-jump-tables > diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile > index 5864219221ca..91883d5a0293 100644 > --- a/arch/x86/mm/Makefile > +++ b/arch/x86/mm/Makefile > @@ -12,8 +12,8 @@ KASAN_SANITIZE_mem_encrypt_identity.o := n > KCSAN_SANITIZE := n > > ifdef CONFIG_FUNCTION_TRACER > -CFLAGS_REMOVE_mem_encrypt.o = -pg > -CFLAGS_REMOVE_mem_encrypt_identity.o = -pg > +CFLAGS_REMOVE_mem_encrypt.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_mem_encrypt_identity.o = $(CC_FLAGS_FTRACE) > endif > > obj-y:= init.o init_$(BITS).o fault.o > ioremap.o extable.o mmap.o \ > diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile > index 40b5779fce21..179dfc124c94 100644 > --- a/arch/x86/xen/Makefile > +++ b/arch/x86/xen/Makefile > @@ -2,9 +2,9 @@ > > ifdef CONFIG_FUNCTION_TRACER > # Do not profile debug and lowlevel utilities > -CFLAGS_REMOVE_spinlock.o = -pg > -CFLAGS_REMOVE_time.o = -pg > -CFLAGS_REMOVE_irq.o = -pg > +CFLAGS_REMOVE_spinlock.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_time.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_irq.o = $(CC_FLAGS_FTRACE) > endif > > # Make sure early boot has no stackprotector > -- > 2.20.1 > > > -- Josh
Re: [PATCH 00/13] [RFC] Rust support
On Wed, Apr 14, 2021 at 01:21:52PM -0700, Linus Torvalds wrote: > On Wed, Apr 14, 2021 at 1:10 PM Matthew Wilcox wrote: > > > > There's a philosophical point to be discussed here which you're skating > > right over! Should rust-in-the-linux-kernel provide the same memory > > allocation APIs as the rust-standard-library, or should it provide a Rusty > > API to the standard-linux-memory-allocation APIs? > > Yeah, I think that the standard Rust API may simply not be acceptable > inside the kernel, if it has similar behavior to the (completely > broken) C++ "new" operator. > > So anything that does "panic!" in the normal Rust API model needs to > be (statically) caught, and never exposed as an actual call to > "panic()/BUG()" in the kernel. Rust has both kinds of allocation APIs: you can call a method like `Box::new` that panics on allocation failure, or a method like `Box::try_new` that returns an error on allocation failure. With some additional infrastructure that's still in progress, we could just not supply the former kind of methods at all, and *only* supply the latter, so that you're forced to handle allocation failure. That just requires introducing some further ability to customize the Rust standard library. (There are some cases of methods in the standard library that don't have a `try_` equivalent, but we could fix that. Right now, for instance, there isn't a `try_` equivalent of every Vec method, and you're instead expected to call `try_reserve` to make sure you have enough memory first; however, that could potentially be changed.)
Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks
On Mon, Apr 12, 2021 at 05:59:33PM +0100, Mark Brown wrote: > On Fri, Apr 09, 2021 at 05:32:27PM -0500, Josh Poimboeuf wrote: > > > Hm, for that matter, even without renaming things, a comment above > > stack_trace_save_tsk_reliable() describing the meaning of "reliable" > > would be a good idea. > > Might be better to place something at the prototype for > arch_stack_walk_reliable() or cross link the two since that's where any > new architectures should be starting, or perhaps even better to extend > the document that Mark wrote further and point to that from both places. > > Some more explict pointer to live patching as the only user would > definitely be good but I think the more important thing would be writing > down any assumptions in the API that aren't already written down and > we're supposed to be relying on. Mark's document captured a lot of it > but it sounds like there's more here, and even with knowing that this > interface is only used by live patch and digging into what it does it's > not always clear what happens to work with the code right now and what's > something that's suitable to be relied on. Something like so? From: Josh Poimboeuf Subject: [PATCH] livepatch: Clarify the meaning of 'reliable' Update the comments and documentation to reflect what 'reliable' unwinding actually means, in the context of live patching. Suggested-by: Mark Brown Signed-off-by: Josh Poimboeuf --- .../livepatch/reliable-stacktrace.rst | 26 + arch/x86/kernel/stacktrace.c | 6 include/linux/stacktrace.h| 29 +-- kernel/stacktrace.c | 7 - 4 files changed, 53 insertions(+), 15 deletions(-) diff --git a/Documentation/livepatch/reliable-stacktrace.rst b/Documentation/livepatch/reliable-stacktrace.rst index 67459d2ca2af..e325efc7e952 100644 --- a/Documentation/livepatch/reliable-stacktrace.rst +++ b/Documentation/livepatch/reliable-stacktrace.rst @@ -72,7 +72,21 @@ The unwinding process varies across architectures, their respective procedure call standards, and kernel configurations. This section describes common details that architectures should consider. -4.1 Identifying successful termination +4.1 Only preemptible code needs reliability detection +- + +The only current user of reliable stacktracing is livepatch, which only +calls it for a) inactive tasks; or b) the current task in task context. + +Therefore, the unwinder only needs to detect the reliability of stacks +involving *preemptible* code. + +Practically speaking, reliability of stacks involving *non-preemptible* +code is a "don't-care". It may help to return a wrong reliability +result for such cases, if it results in reduced complexity, since such +cases will not happen in practice. + +4.2 Identifying successful termination -- Unwinding may terminate early for a number of reasons, including: @@ -95,7 +109,7 @@ architectures verify that a stacktrace ends at an expected location, e.g. * On a specific stack expected for a kernel entry point (e.g. if the architecture has separate task and IRQ stacks). -4.2 Identifying unwindable code +4.3 Identifying unwindable code --- Unwinding typically relies on code following specific conventions (e.g. @@ -129,7 +143,7 @@ unreliable to unwind from, e.g. * Identifying specific portions of code using bounds information. -4.3 Unwinding across interrupts and exceptions +4.4 Unwinding across interrupts and exceptions -- At function call boundaries the stack and other unwind state is expected to be @@ -156,7 +170,7 @@ have no such cases) should attempt to unwind across exception boundaries, as doing so can prevent unnecessarily stalling livepatch consistency checks and permits livepatch transitions to complete more quickly. -4.4 Rewriting of return addresses +4.5 Rewriting of return addresses - Some trampolines temporarily modify the return address of a function in order @@ -222,7 +236,7 @@ middle of return_to_handler and can report this as unreliable. Architectures are not required to unwind from other trampolines which modify the return address. -4.5 Obscuring of return addresses +4.6 Obscuring of return addresses - Some trampolines do not rewrite the return address in order to intercept @@ -249,7 +263,7 @@ than the link register as would usually be the case. Architectures must either ensure that unwinders either reliably unwind such cases, or report the unwinding as unreliable. -4.6 Link register unreliability +4.7 Link register unreliability --- On some other architectures, 'call' instructions place
Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks
On Fri, Apr 09, 2021 at 05:32:27PM -0500, Josh Poimboeuf wrote: > On Fri, Apr 09, 2021 at 05:05:58PM -0500, Madhavan T. Venkataraman wrote: > > > FWIW, over the years we've had zero issues with encoding the frame > > > pointer on x86. After you save pt_regs, you encode the frame pointer to > > > point to it. Ideally in the same macro so it's hard to overlook. > > > > > > > I had the same opinion. In fact, in my encoding scheme, I have additional > > checks to make absolutely sure that it is a true encoding and not stack > > corruption. The chances of all of those values accidentally matching are, > > well, null. > > Right, stack corruption -- which is already exceedingly rare -- would > have to be combined with a miracle or two in order to come out of the > whole thing marked as 'reliable' :-) > > And really, we already take a similar risk today by "trusting" the frame > pointer value on the stack to a certain extent. Oh yeah, I forgot to mention some more benefits of encoding the frame pointer (or marking pt_regs in some other way): a) Stack addresses can be printed properly: '%pS' for printing regs->pc and '%pB' for printing call returns. Using '%pS' for call returns (as arm64 seems to do today) will result in printing the wrong function when you have tail calls to noreturn functions on the stack (which is actually quite common for calls to panic(), die(), etc). More details: https://lkml.kernel.org/r/20210403155948.ubbgtwmlsdyar7yp@treble b) Stack dumps to the console can dump the exception registers they find along the way. This is actually quite nice for debugging. -- Josh
Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks
On Fri, Apr 09, 2021 at 05:05:58PM -0500, Madhavan T. Venkataraman wrote: > > FWIW, over the years we've had zero issues with encoding the frame > > pointer on x86. After you save pt_regs, you encode the frame pointer to > > point to it. Ideally in the same macro so it's hard to overlook. > > > > I had the same opinion. In fact, in my encoding scheme, I have additional > checks to make absolutely sure that it is a true encoding and not stack > corruption. The chances of all of those values accidentally matching are, > well, null. Right, stack corruption -- which is already exceedingly rare -- would have to be combined with a miracle or two in order to come out of the whole thing marked as 'reliable' :-) And really, we already take a similar risk today by "trusting" the frame pointer value on the stack to a certain extent. > >> I think there's a lot more code that we cannot unwind, e.g. KVM > >> exception code, or almost anything marked with SYM_CODE_END(). > > > > Just a reminder that livepatch only unwinds blocked tasks (plus the > > 'current' task which calls into livepatch). So practically speaking, it > > doesn't matter whether the 'unreliable' detection has full coverage. > > The only exceptions which really matter are those which end up calling > > schedule(), e.g. preemption or page faults. > > > > Being able to consistently detect *all* possible unreliable paths would > > be nice in theory, but it's unnecessary and may not be worth the extra > > complexity. > > > > You do have a point. I tried to think of arch_stack_walk_reliable() as > something that should be implemented independent of livepatching. But > I could not really come up with a single example of where else it would > really be useful. > > So, if we assume that the reliable stack trace is solely for the purpose > of livepatching, I agree with your earlier comments as well. One thought: if folks really view this as a problem, it might help to just rename things to reduce confusion. For example, instead of calling it 'reliable', we could call it something more precise, like 'klp_reliable', to indicate that its reliable enough for live patching. Then have a comment above 'klp_reliable' and/or stack_trace_save_tsk_klp_reliable() which describes what that means. Hm, for that matter, even without renaming things, a comment above stack_trace_save_tsk_reliable() describing the meaning of "reliable" would be a good idea. -- Josh
Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks
On Fri, Apr 09, 2021 at 01:09:09PM +0100, Mark Rutland wrote: > On Mon, Apr 05, 2021 at 03:43:09PM -0500, madve...@linux.microsoft.com wrote: > > From: "Madhavan T. Venkataraman" > > > > There are a number of places in kernel code where the stack trace is not > > reliable. Enhance the unwinder to check for those cases and mark the > > stack trace as unreliable. Once all of the checks are in place, the unwinder > > can provide a reliable stack trace. But before this can be used for > > livepatch, > > some other entity needs to guarantee that the frame pointers are all set up > > correctly in kernel functions. objtool is currently being worked on to > > fill that gap. > > > > Except for the return address check, all the other checks involve checking > > the return PC of every frame against certain kernel functions. To do this, > > implement some infrastructure code: > > > > - Define a special_functions[] array and populate the array with > > the special functions > > I'm not too keen on having to manually collate this within the unwinder, > as it's very painful from a maintenance perspective. Agreed. > I'd much rather we could associate this information with the > implementations of these functions, so that they're more likely to > stay in sync. > > Further, I believe all the special cases are assembly functions, and > most of those are already in special sections to begin with. I reckon > it'd be simpler and more robust to reject unwinding based on the > section. If we need to unwind across specific functions in those > sections, we could opt-in with some metadata. So e.g. we could reject > all functions in ".entry.text", special casing the EL0 entry functions > if necessary. Couldn't this also end up being somewhat fragile? Saying "certain sections are deemed unreliable" isn't necessarily obvious to somebody who doesn't already know about it, and it could be overlooked or forgotten over time. And there's no way to enforce it stays that way. FWIW, over the years we've had zero issues with encoding the frame pointer on x86. After you save pt_regs, you encode the frame pointer to point to it. Ideally in the same macro so it's hard to overlook. If you're concerned about debuggers getting confused by the encoding - which debuggers specifically? In my experience, if vmlinux has debuginfo, gdb and most other debuggers will use DWARF (which is already broken in asm code) and completely ignore frame pointers. > I think there's a lot more code that we cannot unwind, e.g. KVM > exception code, or almost anything marked with SYM_CODE_END(). Just a reminder that livepatch only unwinds blocked tasks (plus the 'current' task which calls into livepatch). So practically speaking, it doesn't matter whether the 'unreliable' detection has full coverage. The only exceptions which really matter are those which end up calling schedule(), e.g. preemption or page faults. Being able to consistently detect *all* possible unreliable paths would be nice in theory, but it's unnecessary and may not be worth the extra complexity. -- Josh
Re: [PATCH 0/5] Introduce support for PSF mitigation
On Thu, Apr 08, 2021 at 09:56:47AM -0500, Saripalli, RK wrote: > Josh, thank you for taking the time to review the patches. > > On 4/7/2021 5:39 PM, Josh Poimboeuf wrote: > > On Tue, Apr 06, 2021 at 10:49:59AM -0500, Ramakrishna Saripalli wrote: > >> Because PSF speculation is limited to the current program context, > >> the impact of bad PSF speculation is very similar to that of > >> Speculative Store Bypass (Spectre v4) > >> > >> Predictive Store Forwarding controls: > >> There are two hardware control bits which influence the PSF feature: > >> - MSR 48h bit 2 – Speculative Store Bypass (SSBD) > >> - MSR 48h bit 7 – Predictive Store Forwarding Disable (PSFD) > >> > >> The PSF feature is disabled if either of these bits are set. These bits > >> are controllable on a per-thread basis in an SMT system. By default, both > >> SSBD and PSFD are 0 meaning that the speculation features are enabled. > >> > >> While the SSBD bit disables PSF and speculative store bypass, PSFD only > >> disables PSF. > >> > >> PSFD may be desirable for software which is concerned with the > >> speculative behavior of PSF but desires a smaller performance impact than > >> setting SSBD. > > > > Hi Ramakrishna, > > > > Is there a realistic scenario where an application would want to disable > > PSF, but not disable SSB? > > It is possible most applications have been reviewed and scrubbed for > SSB-type attacks but PSF-type issues may not have been looked at yet. It's "possible", but is it realistic? As far as I know, SSB is impractical to scrub an application for. Do we know of any real-world cases where this option is needed? -- Josh
[tip: sched/core] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files
The following commit has been merged into the sched/core branch of tip: Commit-ID: 6db12ee0456d0e369c7b59788d46e15a56ad0294 Gitweb: https://git.kernel.org/tip/6db12ee0456d0e369c7b59788d46e15a56ad0294 Author:Josh Hunt AuthorDate:Thu, 01 Apr 2021 22:58:33 -04:00 Committer: Peter Zijlstra CommitterDate: Thu, 08 Apr 2021 23:09:44 +02:00 psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. Signed-off-by: Josh Hunt Signed-off-by: Peter Zijlstra (Intel) Acked-by: Johannes Weiner Link: https://lkml.kernel.org/r/20210402025833.27599-1-joh...@akamai.com --- kernel/sched/psi.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index b1b00e9..d1212f1 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1061,19 +1061,27 @@ static int psi_cpu_show(struct seq_file *m, void *v) return psi_show(m, _system, PSI_CPU); } +static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void *)) +{ + if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) + return -EPERM; + + return single_open(file, psi_show, NULL); +} + static int psi_io_open(struct inode *inode, struct file *file) { - return single_open(file, psi_io_show, NULL); + return psi_open(file, psi_io_show); } static int psi_memory_open(struct inode *inode, struct file *file) { - return single_open(file, psi_memory_show, NULL); + return psi_open(file, psi_memory_show); } static int psi_cpu_open(struct inode *inode, struct file *file) { - return single_open(file, psi_cpu_show, NULL); + return psi_open(file, psi_cpu_show); } struct psi_trigger *psi_trigger_create(struct psi_group *group, @@ -1353,9 +1361,9 @@ static int __init psi_proc_init(void) { if (psi_enable) { proc_mkdir("pressure", NULL); - proc_create("pressure/io", 0, NULL, _io_proc_ops); - proc_create("pressure/memory", 0, NULL, _memory_proc_ops); - proc_create("pressure/cpu", 0, NULL, _cpu_proc_ops); + proc_create("pressure/io", 0666, NULL, _io_proc_ops); + proc_create("pressure/memory", 0666, NULL, _memory_proc_ops); + proc_create("pressure/cpu", 0666, NULL, _cpu_proc_ops); } return 0; }
Re: [PATCH 0/9] sched: Core scheduling interfaces
On Thu, Apr 8, 2021 at 9:47 AM Peter Zijlstra wrote: > > On Thu, Apr 08, 2021 at 03:25:52PM +0200, Michal Koutný wrote: > > > I'm curious whether the cgroup API actually simplifies things that are > > possible with the clone/prctl API or allows anything that wouldn't be > > otherwise possible. > > With the cgroup API it is impossible for a task to escape the cgroup > constraint. It can never share a core with anything not in the subtree. > > This is not possible with just the task interface. > > If this is actually needed I've no clue, IMO all of cgroups is not > needed :-) Clearly other people feel differently about that. The cgroup interface seems very nice from a management perspective; moving arbitrary tasks around in the cgroup hierarchy will handle the necessary cookie adjustments to ensure that nothing shares with an unexpected task. It also makes auditing the core scheduling groups very straightforward; anything in a cookie'd cgroup's tasks file will be guaranteed isolated from the rest of the system, period. Further, if a system management thread wants two arbitrary tasks A and B to share a cookie, this seems more painful with the PRCTL interface. The management thread would need to something like - PR_SCHED_CORE_CREATE - PR_SCHED_CORE_SHARE_TO A - PR_SCHED_CORE_SHARE_TO B - PR_SCHED_CORE_CLEAR
Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate
On Thu, Apr 08, 2021 at 08:18:21AM +0200, Greg KH wrote: > On Wed, Apr 07, 2021 at 03:17:46PM -0500, Josh Poimboeuf wrote: > > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > > > On Thu, Apr 01, 2021 at 11:59:25PM +, Luis Chamberlain wrote: > > > > As for the syfs deadlock possible with drivers, this fixes it in a > > > > generic way: > > > > > > > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14 > > > > Author: Luis Chamberlain > > > > Date: Sat Mar 27 14:58:15 2021 + > > > > > > > > sysfs: add optional module_owner to attribute > > > > > > > > This is needed as otherwise the owner of the attribute > > > > or group read/store might have a shared lock used on driver removal, > > > > and deadlock if we race with driver removal. > > > > > > > > Signed-off-by: Luis Chamberlain > > > > > > No, please no. Module removal is a "best effort", if the system dies > > > when it happens, that's on you. I am not willing to expend extra energy > > > and maintance of core things like sysfs for stuff like this that does > > > not matter in any system other than a developer's box. > > > > So I mentioned this on IRC, and some folks were surprised to hear that > > module unloading is unsupported and is just a development aid. > > > > Is this stance documented anywhere? > > > > If we really believe this to be true, we should make rmmod taint the > > kernel. > > My throw-away comment here seems to have gotten way more attention than > it deserved, sorry about that everyone. > > Nothing is supported for anything here, it's all "best effort" :) > > And I would love a taint for rmmod, but what is that going to help out > with? I'm having trouble following this conclusion. Sure, we give our best effort, but if "nothing is supported" then what are we even doing here? Either we aim to support module unloading, or we don't. If we do support it, "best effort" isn't a valid reason for nacking fixes. If we don't support it, but still want to keep it half-working for development purposes, tainting on rmmod will make it crystal clear that the system has been destabilized. -- Josh
Re: [PATCH 0/5] Introduce support for PSF mitigation
On Tue, Apr 06, 2021 at 10:49:59AM -0500, Ramakrishna Saripalli wrote: > Because PSF speculation is limited to the current program context, > the impact of bad PSF speculation is very similar to that of > Speculative Store Bypass (Spectre v4) > > Predictive Store Forwarding controls: > There are two hardware control bits which influence the PSF feature: > - MSR 48h bit 2 – Speculative Store Bypass (SSBD) > - MSR 48h bit 7 – Predictive Store Forwarding Disable (PSFD) > > The PSF feature is disabled if either of these bits are set. These bits > are controllable on a per-thread basis in an SMT system. By default, both > SSBD and PSFD are 0 meaning that the speculation features are enabled. > > While the SSBD bit disables PSF and speculative store bypass, PSFD only > disables PSF. > > PSFD may be desirable for software which is concerned with the > speculative behavior of PSF but desires a smaller performance impact than > setting SSBD. Hi Ramakrishna, Is there a realistic scenario where an application would want to disable PSF, but not disable SSB? Maybe I'm missing something, but I'd presume an application would either care about this class of attacks, or not. -- Josh
Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate
On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > On Thu, Apr 01, 2021 at 11:59:25PM +, Luis Chamberlain wrote: > > As for the syfs deadlock possible with drivers, this fixes it in a generic > > way: > > > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14 > > Author: Luis Chamberlain > > Date: Sat Mar 27 14:58:15 2021 + > > > > sysfs: add optional module_owner to attribute > > > > This is needed as otherwise the owner of the attribute > > or group read/store might have a shared lock used on driver removal, > > and deadlock if we race with driver removal. > > > > Signed-off-by: Luis Chamberlain > > No, please no. Module removal is a "best effort", if the system dies > when it happens, that's on you. I am not willing to expend extra energy > and maintance of core things like sysfs for stuff like this that does > not matter in any system other than a developer's box. So I mentioned this on IRC, and some folks were surprised to hear that module unloading is unsupported and is just a development aid. Is this stance documented anywhere? If we really believe this to be true, we should make rmmod taint the kernel. -- Josh
Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate
On Wed, Apr 07, 2021 at 04:09:44PM +0200, Peter Zijlstra wrote: > On Tue, Apr 06, 2021 at 10:54:23AM -0500, Josh Poimboeuf wrote: > > > Same for Red Hat. Unloading livepatch modules seems to work fine, but > > isn't officially supported. > > > > That said, if rmmod is just considered a development aid, and we're > > going to be ignoring bugs, we should make it official with a new > > TAINT_RMMOD. > > Another option would be to have live-patch modules leak a module > reference by default, except when some debug sysctl is set or something. > Then only those LP modules loaded while the sysctl is set to 'YOLO' can > be unloaded. The issue is broader than just live patching. My suggestion was that if we aren't going to fix bugs in kernel module unloading, then unloading modules shouldn't be supported, and should taint the kernel. -- Josh
Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate
On Tue, Apr 06, 2021 at 02:00:19PM +0200, Miroslav Benes wrote: > Hi, > > > > Driver developers will simply have to open code these protections. In > > > light of what I see on LTP / fuzzing, I suspect the use case will grow > > > and we'll have to revisit this in the future. But for now, sure, we can > > > just open code the required protections everywhere to not crash on module > > > removal. > > > > LTP and fuzzing too do not remove modules. So I do not understand the > > root problem here, that's just something that does not happen on a real > > system. > > If I am not mistaken, the issue that Luis tries to solve here was indeed > found by running LTP. > > > On Sat, Apr 03, 2021 at 08:13:23AM +0200, Greg KH wrote: > > > On Fri, Apr 02, 2021 at 06:30:16PM +, Luis Chamberlain wrote: > > > > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > > > > > No, please no. Module removal is a "best effort", > > > > > > > > Not for live patching. I am not sure if I am missing any other valid > > > > use case? > > > > > > live patching removes modules? We have so many code paths that are > > > "best effort" when it comes to module unloading, trying to resolve this > > > one is a valiant try, but not realistic. > > > > Miroslav, your input / help here would be valuable. I did the > > generalization work because you said it would be worthy for you too... > > Yes, we have the option to revert and remove the existing live patch from > the system. I am not sure how (if) it is used in practice. > > At least at SUSE we do not support the option. But we are only one of the > many downstream users. So yes, there is the option. Same for Red Hat. Unloading livepatch modules seems to work fine, but isn't officially supported. That said, if rmmod is just considered a development aid, and we're going to be ignoring bugs, we should make it official with a new TAINT_RMMOD. -- Josh
Re: [RFC PATCH v1 0/4] arm64: Implement stack trace reliability checks
inal return address. But I am > not entirely sure how to safely traverse that list for stack traces > not on the current process. So, I have taken the easy way out. For kretprobes, unwinding from the trampoline or kretprobe handler shouldn't be a reliability concern for live patching, for similar reasons as above. Otherwise, when unwinding from a blocked task which has 'kretprobe_trampoline' on the stack, the unwinder needs a way to get the original return address. Masami has been working on an interface to make that possible for x86. I assume something similar could be done for arm64. > Optprobes > = > > Optprobes may be implemented in the future for arm64. For optprobes, > the relevant trampoline(s) can be added to special_functions[]. Similar comment here, livepatch doesn't unwind from such trampolines. -- Josh
Re: [RFC PATCH v2 1/1] arm64: Implement stack trace termination record
On Thu, Apr 01, 2021 at 10:24:04PM -0500, madve...@linux.microsoft.com wrote: > From: "Madhavan T. Venkataraman" > @@ -447,9 +464,9 @@ SYM_FUNC_START_LOCAL(__primary_switched) > #endif > bl switch_to_vhe // Prefer VHE if possible > add sp, sp, #16 > - mov x29, #0 > - mov x30, #0 > - b start_kernel > + setup_final_frame > + bl start_kernel > + nop > SYM_FUNC_END(__primary_switched) > > .pushsection ".rodata", "a" > @@ -606,14 +623,14 @@ SYM_FUNC_START_LOCAL(__secondary_switched) > cbz x2, __secondary_too_slow > msr sp_el0, x2 > scs_load x2, x3 > - mov x29, #0 > - mov x30, #0 > + setup_final_frame > > #ifdef CONFIG_ARM64_PTR_AUTH > ptrauth_keys_init_cpu x2, x3, x4, x5 > #endif > > - b secondary_start_kernel > + bl secondary_start_kernel > + nop > SYM_FUNC_END(__secondary_switched) I'm somewhat arm-ignorant, so take the following comments with a grain of salt. I don't think changing these to 'bl' is necessary, unless you wanted __primary_switched() and __secondary_switched() to show up in the stacktrace for some reason? If so, that seems like a separate patch. Also, why are nops added after the calls? My guess would be because, since these are basically tail calls to "noreturn" functions, the stack dump code would otherwise show the wrong function, i.e. whatever function happens to be after the 'bl'. We had the same issue for x86. It can be fixed by using '%pB' instead of '%pS' when printing the address in dump_backtrace_entry(). See sprint_backtrace() for more details. BTW I think the same issue exists for GCC-generated code. The following shows several such cases: objdump -dr vmlinux |awk '/bl / {bl=1;l=$0;next} bl == 1 && /^$/ {print l; print} // {bl=0}' However, looking at how arm64 unwinds through exceptions in kernel space, using '%pB' might have side effects when the exception LR (elr_el1) points to the beginning of a function. Then '%pB' would show the end of the previous function, instead of the function which was interrupted. So you may need to rethink how to unwind through in-kernel exceptions. Basically, when printing a stack return address, you want to use '%pB' for a call return address and '%pS' for an interrupted address. On x86, with the frame pointer unwinder, we encode the frame pointer by setting a bit in %rbp which tells the unwinder that it's a special pt_regs frame. Then instead of treating it like a normal call frame, the stack dump code prints the registers, and the return address (regs->ip) gets printed with '%pS'. > SYM_FUNC_START_LOCAL(__secondary_too_slow) > diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c > index 325c83b1a24d..906baa232a89 100644 > --- a/arch/arm64/kernel/process.c > +++ b/arch/arm64/kernel/process.c > @@ -437,6 +437,11 @@ int copy_thread(unsigned long clone_flags, unsigned long > stack_start, > } > p->thread.cpu_context.pc = (unsigned long)ret_from_fork; > p->thread.cpu_context.sp = (unsigned long)childregs; > + /* > + * For the benefit of the unwinder, set up childregs->stackframe > + * as the final frame for the new task. > + */ > + p->thread.cpu_context.fp = (unsigned long)childregs->stackframe; > > ptrace_hw_copy_thread(p); > > diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c > index ad20981dfda4..72f5af8c69dc 100644 > --- a/arch/arm64/kernel/stacktrace.c > +++ b/arch/arm64/kernel/stacktrace.c > @@ -44,16 +44,16 @@ int notrace unwind_frame(struct task_struct *tsk, struct > stackframe *frame) > unsigned long fp = frame->fp; > struct stack_info info; > > - /* Terminal record; nothing to unwind */ > - if (!fp) > + if (!tsk) > + tsk = current; > + > + /* Final frame; nothing to unwind */ > + if (fp == (unsigned long) task_pt_regs(tsk)->stackframe) > return -ENOENT; As far as I can tell, the regs stackframe value is initialized to zero during syscall entry, so isn't this basically just 'if (fp == 0)'? Shouldn't it instead be comparing with the _address_ of the stackframe field to make sure it reached the end? -- Josh
Re: [PATCH 9/9] sched: prctl() and cgroup interaction
Hi Peter, I walked through the reference counting, and it seems good to me (though it did take a few passes to fully digest the invariants for the fat cookie stuff). > +unsigned long sched_core_alloc_cookie(unsigned int type) > { > struct sched_core_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL); > if (!ck) > return 0; > - refcount_set(>refcnt, 1); > + WARN_ON_ONCE(type > GROUP_COOKIE); > + sched_core_init_cookie(ck, type); > sched_core_get(); > > - return (unsigned long)ck; > + return (unsigned long)ck | type; > } This feels like it needs to be stronger than a WARN_ON_ONCE; could create a corrupted address that we later try to kfree(). Also, for my own edification, why will the bottom two bits here always be 0? > -unsigned long sched_core_alloc_cookie(void) > +static inline void *cookie_ptr(unsigned long cookie) > +{ > + return (void *)(cookie & ~3UL); > +} > + > +static inline int cookie_type(unsigned long cookie) > +{ > + return cookie & 3; > +} s/3/FAT_COOKIE > +#define FAT_COOKIE 0x03 Move to sched.h to group with TASK/GROUP_COOKIE? > +static unsigned long __sched_core_fat_cookie(struct task_struct *p, > +void **spare_fat, > +unsigned long cookie) > +{ This function looks good to me, but could use some more comments about the pre/post-condition assumptions. Ie. cookie already has a get() associated with it, caller is expected to kfree the spare_fat. > + raw_spin_lock_irqsave(_lock, flags); > + n = rb_find_add(>node, _root, fat_cmp); > + raw_spin_unlock_irqrestore(_lock, flags); > + > + if (n) { > + sched_core_put_fat(fat); > + fat = node_2_fat(n); This put() doesn't seem strictly necessary; caller will be unconditionally freeing the spare_fat. Keep anyway for completeness, but add a comment?
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
On 4/2/21 12:25 PM, Jiri Kosina wrote: On Thu, 3 Sep 2020, John Fastabend wrote: At this point I fear we could consider reverting the NOLOCK stuff. I personally would hate doing so, but it looks like NOLOCK benefits are outweighed by its issues. I agree, NOLOCK brings more pains than gains. There are many race conditions hidden in generic qdisc layer, another one is enqueue vs. reset which is being discussed in another thread. Sure. Seems they crept in over time. I had some plans to write a lockless HTB implementation. But with fq+EDT with BPF it seems that it is no longer needed, we have a more generic/better solution. So I dropped it. Also most folks should really be using fq, fq_codel, etc. by default anyways. Using pfifo_fast alone is not ideal IMO. Half a year later, we still have the NOLOCK implementation present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself. And we've just been bitten by this very same race which appears to be still unfixed, with single packet being stuck in pfifo_fast qdisc basically indefinitely due to this very race that this whole thread began with back in 2019. Unless there are (a) any nice ideas how to solve this in an elegant way without (re-)introducing extra spinlock (Cong's fix) or (b) any objections to revert as per the argumentation above I'll be happy to send a revert of the whole NOLOCK implementation next week. Jiri If you have a reproducer can you try https://lkml.org/lkml/2021/3/24/1485 ? If that doesn't work I think your suggestion of reverting nolock makes sense to me. We've moved to using fq as our default now b/c of this bug. Josh
[PATCH v2] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files
Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. Signed-off-by: Josh Hunt Acked-by: Johannes Weiner --- kernel/sched/psi.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index b1b00e9bd7ed..d1212f17a898 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1061,19 +1061,27 @@ static int psi_cpu_show(struct seq_file *m, void *v) return psi_show(m, _system, PSI_CPU); } +static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void *)) +{ + if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) + return -EPERM; + + return single_open(file, psi_show, NULL); +} + static int psi_io_open(struct inode *inode, struct file *file) { - return single_open(file, psi_io_show, NULL); + return psi_open(file, psi_io_show); } static int psi_memory_open(struct inode *inode, struct file *file) { - return single_open(file, psi_memory_show, NULL); + return psi_open(file, psi_memory_show); } static int psi_cpu_open(struct inode *inode, struct file *file) { - return single_open(file, psi_cpu_show, NULL); + return psi_open(file, psi_cpu_show); } struct psi_trigger *psi_trigger_create(struct psi_group *group, @@ -1353,9 +1361,9 @@ static int __init psi_proc_init(void) { if (psi_enable) { proc_mkdir("pressure", NULL); - proc_create("pressure/io", 0, NULL, _io_proc_ops); - proc_create("pressure/memory", 0, NULL, _memory_proc_ops); - proc_create("pressure/cpu", 0, NULL, _cpu_proc_ops); + proc_create("pressure/io", 0666, NULL, _io_proc_ops); + proc_create("pressure/memory", 0666, NULL, _memory_proc_ops); + proc_create("pressure/cpu", 0666, NULL, _cpu_proc_ops); } return 0; } -- 2.17.1
Re: [PATCH 7/9] sched: Cgroup core-scheduling interface
Thanks, allowing for multiple group cookies in a hierarchy is a nice improvement. > + if (tgi != tg) { > + if (tgi->core_cookie || (tgi->core_parent && > tgi->core_parent != tg)) > + continue; > + > + tgi->core_parent = parent; > + tgi->core_cookie = 0; core_cookie must already be 0, given the check above.
Re: [PATCH 3/9] sched: Trivial core scheduling cookie management
> +/* > + * sched_core_update_cookie - Common helper to update a task's core cookie. > This > + * updates the selected cookie field. > + * @p: The task whose cookie should be updated. > + * @cookie: The new cookie. > + * @cookie_type: The cookie field to which the cookie corresponds. No cookie_type in this patch (or cookie fields). Also might want to call out that the caller is responsible for put'ing the return value.
Re: [PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files
On 4/1/21 10:47 AM, Eric W. Biederman wrote: Kees Cook writes: On Wed, Mar 31, 2021 at 11:36:28PM -0500, Eric W. Biederman wrote: Josh Hunt writes: Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. The test for CAP_SYS_RESOURCE really needs to be in open rather than in write. Otherwise a suid root executable could have stdout redirected into these files. Right. Or check against f_cred. (See uses of kallsyms_show_value()) https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/security/credentials.html*open-file-credentials__;Iw!!GjvTz_vk!B_aeVyHMG20VNUGx001EFKpeYlahLQHye7oS8sokXuZOhVDTtF_deDl71a_KYA$ We really want to limit checking against f_cred to those cases where we break userspace by checking in open. AKA the cases where we made the mistake of putting the permission check in the wrong place and now can't fix it. Since this change is change the permissions that open uses already I don't see any reason we can't perform a proper check in open. Eric Thank you for the feedback. I will spin a v2 doing the check in open. Josh
[PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files
Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. Signed-off-by: Josh Hunt --- kernel/sched/psi.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index b1b00e9bd7ed..98ff7baf1ba8 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1270,6 +1270,9 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf, if (!nbytes) return -EINVAL; + if (!capable(CAP_SYS_RESOURCE)) + return -EPERM; + buf_size = min(nbytes, sizeof(buf)); if (copy_from_user(buf, user_buf, buf_size)) return -EFAULT; @@ -1353,9 +1356,9 @@ static int __init psi_proc_init(void) { if (psi_enable) { proc_mkdir("pressure", NULL); - proc_create("pressure/io", 0, NULL, _io_proc_ops); - proc_create("pressure/memory", 0, NULL, _memory_proc_ops); - proc_create("pressure/cpu", 0, NULL, _cpu_proc_ops); + proc_create("pressure/io", 0666, NULL, _io_proc_ops); + proc_create("pressure/memory", 0666, NULL, _memory_proc_ops); + proc_create("pressure/cpu", 0666, NULL, _cpu_proc_ops); } return 0; } -- 2.17.1
Re: [RFC PATCH -tip 3/3] x86/kprobes,orc: Unwind optprobe trampoline correctly
On Wed, Mar 31, 2021 at 02:44:56PM +0900, Masami Hiramatsu wrote: > +#ifdef CONFIG_UNWINDER_ORC > +unsigned long recover_optprobe_trampoline(unsigned long addr, unsigned long > *sp) > +{ > + unsigned long offset, entry, probe_addr; > + struct optimized_kprobe *op; > + struct orc_entry *orc; > + > + entry = find_kprobe_optinsn_slot_entry(addr); > + if (!entry) > + return addr; > + > + offset = addr - entry; > + > + /* Decode arg1 and get the optprobe */ > + op = (void *)extract_set_arg1((void *)(entry + TMPL_MOVE_IDX)); > + if (!op) > + return addr; > + > + probe_addr = (unsigned long)op->kp.addr; > + > + if (offset < TMPL_END_IDX) { > + orc = orc_find((unsigned long)optprobe_template_func + offset); > + if (!orc || orc->sp_reg != ORC_REG_SP) > + return addr; > + /* > + * Since optprobe trampoline doesn't push caller on the stack, > + * need to decrement 1 stack entry size > + */ > + *sp += orc->sp_offset - sizeof(long); > + return probe_addr; > + } else { > + return probe_addr + offset - TMPL_END_IDX; > + } > +} > +#endif Hm, I'd like to avoid intertwining kprobes and ORC like this. ORC unwinds other generated code by assuming the generated code uses a frame pointer. Could we do that here? With CONFIG_FRAME_POINTER, unwinding works because SAVE_REGS_STRING has ENCODE_FRAME_POINTER, but that's not going to work for ORC. Instead of these patches, can we 'push %rbp; mov %rsp, %rbp' at the beginning of the template and 'pop %rbp' at the end? I guess SAVE_REGS_STRING would need to be smart enough to push the original saved version of %rbp. Of course then that breaks the kretprobe_trampoline() usage, so it may need to be a separate macro. [ Or make the same change to kretprobe_trampoline(). Then the other patch set wouldn't be needed either ;-) ] Of course the downside is, when you get an interrupt during the frame pointer setup, unwinding is broken. But I think that's acceptable for generated code. We've lived with that limitation for all code, with CONFIG_FRAME_POINTER, for many years. Eventually we may want to have a way to register generated code (and the ORC for it). -- Josh
Re: [PATCH v2] sched: Warn on long periods of pending need_resched
Peter, Since you've already pulled the need_resched warning patch into your tree, I'm including just the diff based on that patch (in response to Mel's comments) below. This should be squashed into the original patch. Thanks, Josh --- >From 85796b4d299b1cf3f99bde154a356ce1061221b7 Mon Sep 17 00:00:00 2001 From: Josh Don Date: Mon, 22 Mar 2021 20:57:06 -0700 Subject: [PATCH] fixup: sched: Warn on long periods of pending need_resched --- kernel/sched/core.c | 29 - 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6fdf15eebc0d..c07a4c17205f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -61,17 +61,13 @@ const_debug unsigned int sysctl_sched_features = /* * Print a warning if need_resched is set for the given duration (if - * resched_latency_warn_enabled is set). + * LATENCY_WARN is enabled). * * If sysctl_resched_latency_warn_once is set, only one warning will be shown * per boot. - * - * Resched latency will be ignored for the first resched_boot_quiet_sec, to - * reduce false alarms. */ -int sysctl_resched_latency_warn_ms = 100; -int sysctl_resched_latency_warn_once = 1; -static const long resched_boot_quiet_sec = 600; +__read_mostly int sysctl_resched_latency_warn_ms = 100; +__read_mostly int sysctl_resched_latency_warn_once = 1; #endif /* CONFIG_SCHED_DEBUG */ /* @@ -4542,20 +4538,19 @@ unsigned long long task_sched_runtime(struct task_struct *p) } #ifdef CONFIG_SCHED_DEBUG -static u64 resched_latency_check(struct rq *rq) +static u64 cpu_resched_latency(struct rq *rq) { int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms); - u64 need_resched_latency, now = rq_clock(rq); + u64 resched_latency, now = rq_clock(rq); static bool warned_once; if (sysctl_resched_latency_warn_once && warned_once) return 0; - if (!need_resched() || WARN_ON_ONCE(latency_warn_ms < 2)) + if (!need_resched() || !latency_warn_ms) return 0; - /* Disable this warning for the first few mins after boot */ - if (now < resched_boot_quiet_sec * NSEC_PER_SEC) + if (system_state == SYSTEM_BOOTING) return 0; if (!rq->last_seen_need_resched_ns) { @@ -4565,13 +4560,13 @@ static u64 resched_latency_check(struct rq *rq) } rq->ticks_without_resched++; - need_resched_latency = now - rq->last_seen_need_resched_ns; - if (need_resched_latency <= latency_warn_ms * NSEC_PER_MSEC) + resched_latency = now - rq->last_seen_need_resched_ns; + if (resched_latency <= latency_warn_ms * NSEC_PER_MSEC) return 0; warned_once = true; - return need_resched_latency; + return resched_latency; } static int __init setup_resched_latency_warn_ms(char *str) @@ -4588,7 +4583,7 @@ static int __init setup_resched_latency_warn_ms(char *str) } __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms); #else -static inline u64 resched_latency_check(struct rq *rq) { return 0; } +static inline u64 cpu_resched_latency(struct rq *rq) { return 0; } #endif /* CONFIG_SCHED_DEBUG */ /* @@ -4614,7 +4609,7 @@ void scheduler_tick(void) update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure); curr->sched_class->task_tick(rq, curr, 0); if (sched_feat(LATENCY_WARN)) - resched_latency = resched_latency_check(rq); + resched_latency = cpu_resched_latency(rq); calc_global_load_tick(rq); rq_unlock(rq, ); -- 2.31.0.291.g576ba9dcdaf-goog
Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure
On Mon, Mar 29, 2021 at 2:55 AM Peter Zijlstra wrote: > > OK, fixed the fails. My tired head made it unconditionally return the > cookie-id of 'current' instead of task. Pushed out an update. I see you have the per-task and prctl stuff pulled into your tree. I can rebase the compound cookie and cgroup api patches on top if you'd like; not sure if you've already re-ordered it locally. Any other comments on the former? > > > Also, we really need a better name than coretag.c. > > > > Yea, we don't really otherwise use the phrase "tagging". core_sched.c > > is probably too confusing given we have sched/core.c. > > Right, so I tried core_sched and my fingers already hate it as much as > kernel/scftorture.c (which I'd assumed my fingers would get used to > eventually, but n). > > Looking at kernel/sched/ C is very overrepresented, so we really don't > want another I think. B, E, G, H, J, K, N, seem to still be available in > the first half of the alphabet. Maybe, bonghits.c, gabbleduck.c ? hardware_vuln.c? Tricky to avoid a C with cpu, core, and cookie :)
Re: [PATCH resend 5/8] sched: cgroup cookie API for core scheduling
On Tue, Mar 30, 2021 at 2:29 AM Peter Zijlstra wrote: > > On Wed, Mar 24, 2021 at 05:40:17PM -0400, Joel Fernandes (Google) wrote: > > > + > > > + if (!tg->core_tagged && val) { > > > + /* Tag is being set. Check ancestors and descendants. */ > > > + if (cpu_core_get_group_cookie(tg) || > > > + cpu_core_check_descendants(tg, true /* tag */)) { > > > + ret = -EBUSY; > > > + goto out_unlock; > > > + } > > > > So the desired semantics is to only allow a single tag on any upwards > > path? Isn't that in conflict with the cgroup requirements? > > > > TJ? I carried this requirement over from the previous iteration, but I don't see a reason why we can't just dump this and have each task use the group cookie of its closest tagged ancestor. Joel, is there any context here I'm missing? FWIW I also just realized that cpu_core_check_descendants() is busted as it recurses only on one child. > > > + } else if (tg->core_tagged && !val) { > > > + /* Tag is being reset. Check descendants. */ > > > + if (cpu_core_check_descendants(tg, true /* tag */)) { > > > > I'm struggling to understand this. If, per the above, you cannot set > > when either a parent is already set or a child is set, then how can a > > child be set to refuse clearing? Yes this is superfluous with the above semantics.
Re: [PATCH -tip v4 10/12] x86/kprobes: Push a fake return address at kretprobe_trampoline
On Fri, Mar 26, 2021 at 10:20:09AM -0400, Steven Rostedt wrote: > On Fri, 26 Mar 2021 21:03:49 +0900 > Masami Hiramatsu wrote: > > > I confirmed this is not related to this series, but occurs when I build > > kernels with different > > configs without cleanup. > > > > Once I build kernel with CONFIG_UNWIND_GUESS=y (for testing), and after > > that, > > I build kernel again with CONFIG_UNWIND_ORC=y (but without make clean), this > > happened. In this case, I guess ORC data might be corrupted? > > When I cleanup and rebuild, the stacktrace seems correct. > > Hmm, that should be fixed. Seems like we are missing a dependency somewhere. Thomas reported something similar: for example arch/x86/kernel/head_64.o doesn't get rebuilt when changing unwinders. https://lkml.kernel.org/r/87tuqublrb@nanos.tec.linutronix.de Masahiro, any idea how we can force a full tree rebuild when changing the unwinder? -- Josh
Re: [PATCH v3 16/16] objtool,x86: Rewrite retpoline thunk calls
On Fri, Mar 26, 2021 at 04:12:15PM +0100, Peter Zijlstra wrote: > @@ -61,3 +89,15 @@ SYM_FUNC_END(__x86_indirect_thunk_\reg) > #define GEN(reg) EXPORT_THUNK(reg) > #include > > +#undef GEN > +#define GEN(reg) ALT_THUNK reg > +#include > + > +#undef GEN > +#define GEN(reg) __EXPORT_THUNK(__x86_indirect_alt_call_ ## reg) > +#include > + > +#undef GEN > +#define GEN(reg) __EXPORT_THUNK(__x86_indirect_alt_jmp_ ## reg) > +#include > + Git complains about this last newline. Otherwise everything looks pretty good to me. Let me run it through the test matrix. -- Josh
Re: [PATCH resend 2/8] sched: core scheduling tagging infrastructure
Hi Peter, On Fri, Mar 26, 2021 at 5:10 PM Peter Zijlstra wrote: > > On Wed, Mar 24, 2021 at 05:40:14PM -0400, Joel Fernandes (Google) wrote: > > From: Josh Don > > > > A single unsigned long is insufficient as a cookie value for core > > scheduling. We will minimally have cookie values for a per-task and a > > per-group interface, which must be combined into an overall cookie. > > > > This patch adds the infrastructure necessary for setting task and group > > cookie. Namely, it reworks the core_cookie into a struct, and provides > > interfaces for setting task and group cookie, as well as other > > operations (i.e. compare()). Subsequent patches will use these hooks to > > provide an API for setting these cookies. > > > > *urgh*... so I specifically wanted the task interface first to avoid / > get-rid of all this madness. And then you keep it :-( Sorry, I misunderstood the ask here :/ I had separated out the cgroup interface parts of the patch, leaving (mostly) the parts which introduced a compound cookie structure. I see now that you just wanted the plain task interface to start, with no notion of group cookie. > I've spend the past few hours rewriting patches #2 and #3, and adapting > #4. The thing was working before I added SHARE_FROM back and introduced > GET, but now I'm seeing a few FAILs from the selftest. > > I'm too tired to make sense of anything much, or even focus my eyes > consistently, so I'll have to prod at it some more next week, but I've > pushed out the lot to my queue.git: > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core-sched Thanks, I'll take a look next week. > Also, we really need a better name than coretag.c. Yea, we don't really otherwise use the phrase "tagging". core_sched.c is probably too confusing given we have sched/core.c.
Re: [PATCH v2] sched: Warn on long periods of pending need_resched
> On Wed, Mar 24, 2021 at 01:39:16PM +, Mel Gorman wrote: > I'm not going to NAK because I do not have hard data that shows they must > exist. However, I won't ACK either because I bet a lot of tasty beverages > the next time we meet that the following parameters will generate reports > if removed. > > kernel.sched_latency_ns > kernel.sched_migration_cost_ns > kernel.sched_min_granularity_ns > kernel.sched_wakeup_granularity_ns > > I know they are altered by tuned for different profiles and some people do > go the effort to create custom profiles for specific applications. They > also show up in "Official Benchmarking" such as SPEC CPU 2017 and > some vendors put a *lot* of effort into SPEC CPU results for bragging > rights. They show up in technical books and best practice guids for > applications. Finally they show up in Google when searching for "tuning > sched_foo". I'm not saying that any of these are even accurate or a good > idea, just that they show up near the top of the results and they are > sufficiently popular that they might as well be an ABI. +1, these seem like sufficiently well-known scheduler tunables, and not really SCHED_DEBUG.
Re: [PATCH v2] sched: Warn on long periods of pending need_resched
On Wed, Mar 24, 2021 at 4:27 AM Mel Gorman wrote: > > I'm not a fan of the name. I know other sysctls have _enabled in the > name but it's redundant. If you say the name out loud, it sounds weird. > I would suggest an alternative but see below. Now using the version rebased by Peter; this control has gone away and we have simply a scheduling feature "LATENCY WARN" > I suggest semantics and naming similar to hung_task_warnings > because it's sortof similar. resched_latency_warnings would combine > resched_latency_warn_enabled and resched_latency_warn_once. 0 would mean > "never warn", -1 would mean always warn and any positive value means > "warn X number of times". See above. I'm happy with the enabled bit being toggled separately by a sched feature; the warn_once behavior is not overloaded with the enabling/disabling. Also, I don't see value in "warn X number of times", given the warning is rate limited anyway. > > +int sysctl_resched_latency_warn_ms = 100; > > +int sysctl_resched_latency_warn_once = 1; > > Use __read_mostly Good point, done. > > +#ifdef CONFIG_SCHED_DEBUG > > +static u64 resched_latency_check(struct rq *rq) > > +{ > > + int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms); > > + u64 need_resched_latency, now = rq_clock(rq); > > + static bool warned_once; > > + > > + if (sysctl_resched_latency_warn_once && warned_once) > > + return 0; > > + > > That is a global variable that can be modified in parallel and I do not > think it's properly locked (scheduler_tick is holding rq lock which does > not protect this). > > Consider making resched_latency_warnings atomic and use > atomic_dec_if_positive. If it drops to zero in this path, disable the > static branch. > > That said, it may be overkill. hung_task_warnings does not appear to have > special protection that prevents it going to -1 or lower values by accident > either. Maybe it can afford to be a bit more relaxed because a system that > is spamming hung task warnings is probably dead or might as well be dead. There's no real issue if we race over modification to that sysctl. This is intentionally not more strongly synchronized for that reason. > > + if (!need_resched() || WARN_ON_ONCE(latency_warn_ms < 2)) > > + return 0; > > + > > Why is 1ms special? Regardless of the answer, if the sysctl should not > be 1 then the user should not be able to set it to 1. Yea let me change that to !latency_warn_ms so it isn't HZ specific. > > > + /* Disable this warning for the first few mins after boot */ > > + if (now < resched_boot_quiet_sec * NSEC_PER_SEC) > > + return 0; > > + > > Check system_state == SYSTEM_BOOTING instead? Yea, that might be better; let me test that. > > + if (!rq->last_seen_need_resched_ns) { > > + rq->last_seen_need_resched_ns = now; > > + rq->ticks_without_resched = 0; > > + return 0; > > + } > > + > > + rq->ticks_without_resched++; > > + need_resched_latency = now - rq->last_seen_need_resched_ns; > > + if (need_resched_latency <= latency_warn_ms * NSEC_PER_MSEC) > > + return 0; > > + > > The naming need_resched_latency implies it's a boolean but it's not. > Maybe just resched_latency? > > Similarly, resched_latency_check implies it returns a boolean but it > returns an excessive latency value. At this point I've been reading the > patch for a long time so I've ran out of naming suggestions :) The "need_" part does confuse it a bit; I reworded these to hopefully make it more clear. > > + warned_once = true; > > + > > + return need_resched_latency; > > +} > > + > > I note that you split when a warning is needed and printing the warning > but it's not clear why. Sure you are under the RQ lock but there are other > places that warn under the RQ lock. I suppose for consistency it could > use SCHED_WARN_ON even though all this code is under SCHED_DEBUG already. We had seen a circular lock dependency warning (console_sem, pi lock, rq lock), since printing might need to wake a waiter. However, I do see plenty of warns under rq->lock, so maybe I missed a patch to address this?
Re: [PATCH -tip v4 10/12] x86/kprobes: Push a fake return address at kretprobe_trampoline
On Wed, Mar 24, 2021 at 10:40:58AM +0900, Masami Hiramatsu wrote: > On Tue, 23 Mar 2021 23:30:07 +0100 > Peter Zijlstra wrote: > > > On Mon, Mar 22, 2021 at 03:41:40PM +0900, Masami Hiramatsu wrote: > > > ".global kretprobe_trampoline\n" > > > ".type kretprobe_trampoline, @function\n" > > > "kretprobe_trampoline:\n" > > > #ifdef CONFIG_X86_64 > > > > So what happens if we get an NMI here? That is, after the RET but before > > the push? Then our IP points into the trampoline but we've not done that > > push yet. > > Not only NMI, but also interrupts can happen. There is no cli/sti here. > > Anyway, thanks for pointing! > I think in UNWIND_HINT_TYPE_REGS and UNWIND_HINT_TYPE_REGS_PARTIAL cases > ORC unwinder also has to check the state->ip and if it is > kretprobe_trampoline, > it should be recovered. > What about this? I think the REGS and REGS_PARTIAL cases can also be affected by function graph tracing. So should they use the generic unwind_recover_ret_addr() instead of unwind_recover_kretprobe()? -- Josh
Re: [PATCH 1/6] sched: migration changes for core scheduling
On Mon, Mar 22, 2021 at 8:54 PM Li, Aubrey wrote: > > On 2021/3/22 20:56, Peter Zijlstra wrote: > > On Mon, Mar 22, 2021 at 08:31:09PM +0800, Li, Aubrey wrote: > >> Please let me know if I put cookie match check at the right position > >> in task_hot(), if so, I'll obtain some performance data of it. > >> Will be sending out a rebased stack soon, with this updated patch included.
[PATCH v2] sched: Warn on long periods of pending need_resched
From: Paul Turner CPU scheduler marks need_resched flag to signal a schedule() on a particular CPU. But, schedule() may not happen immediately in cases where the current task is executing in the kernel mode (no preemption state) for extended periods of time. This patch adds a warn_on if need_resched is pending for more than the time specified in sysctl resched_latency_warn_ms. If it goes off, it is likely that there is a missing cond_resched() somewhere. Monitoring is done via the tick and the accuracy is hence limited to jiffy scale. This also means that we won't trigger the warning if the tick is disabled. This feature is default disabled. It can be toggled on using sysctl resched_latency_warn_enabled. Signed-off-by: Paul Turner Signed-off-by: Josh Don --- Delta from v1: - separate sysctl for enabling/disabling and triggering warn_once behavior - add documentation - static branch for the enable Documentation/admin-guide/sysctl/kernel.rst | 23 ++ include/linux/sched/sysctl.h| 4 ++ kernel/sched/core.c | 78 - kernel/sched/debug.c| 10 +++ kernel/sched/sched.h| 10 +++ kernel/sysctl.c | 24 +++ 6 files changed, 148 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 1d56a6b73a4e..2d4a21d3b79f 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -1077,6 +1077,29 @@ ROM/Flash boot loader. Maybe to tell it what to do after rebooting. ??? +resched_latency_warn_enabled + + +Enables/disables a warning that will trigger if need_resched is set for +longer than sysctl ``resched_latency_warn_ms``. This warning likely +indicates a kernel bug, such as a failure to call cond_resched(). + +Requires ``CONFIG_SCHED_DEBUG``. + + +resched_latency_warn_ms +=== + +See ``resched_latency_warn_enabled``. + + +resched_latency_warn_once += + +If set, ``resched_latency_warn_enabled`` will only trigger one warning +per boot. + + sched_energy_aware == diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 3c31ba88aca5..43a1f5ab819a 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -48,6 +48,10 @@ extern unsigned int sysctl_numa_balancing_scan_size; extern __read_mostly unsigned int sysctl_sched_migration_cost; extern __read_mostly unsigned int sysctl_sched_nr_migrate; +extern struct static_key_false resched_latency_warn_enabled; +extern int sysctl_resched_latency_warn_ms; +extern int sysctl_resched_latency_warn_once; + int sched_proc_update_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 98191218d891..d69ae342b450 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -58,7 +58,21 @@ const_debug unsigned int sysctl_sched_features = #include "features.h" 0; #undef SCHED_FEAT -#endif + +/* + * Print a warning if need_resched is set for the given duration (if + * resched_latency_warn_enabled is set). + * + * If sysctl_resched_latency_warn_once is set, only one warning will be shown + * per boot. + * + * Resched latency will be ignored for the first resched_boot_quiet_sec, to + * reduce false alarms. + */ +int sysctl_resched_latency_warn_ms = 100; +int sysctl_resched_latency_warn_once = 1; +const long resched_boot_quiet_sec = 600; +#endif /* CONFIG_SCHED_DEBUG */ /* * Number of tasks to iterate in a single balance run. @@ -4520,6 +4534,58 @@ unsigned long long task_sched_runtime(struct task_struct *p) return ns; } +#ifdef CONFIG_SCHED_DEBUG +static u64 resched_latency_check(struct rq *rq) +{ + int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms); + u64 need_resched_latency, now = rq_clock(rq); + static bool warned_once; + + if (sysctl_resched_latency_warn_once && warned_once) + return 0; + + if (!need_resched() || WARN_ON_ONCE(latency_warn_ms < 2)) + return 0; + + /* Disable this warning for the first few mins after boot */ + if (now < resched_boot_quiet_sec * NSEC_PER_SEC) + return 0; + + if (!rq->last_seen_need_resched_ns) { + rq->last_seen_need_resched_ns = now; + rq->ticks_without_resched = 0; + return 0; + } + + rq->ticks_without_resched++; + need_resched_latency = now - rq->last_seen_need_resched_ns; + if (need_resched_latency <= latency_warn_ms * NSEC_PER_MSEC) + return 0; + + warned_once = true; + + return need_resched_latency; +} + +static int __init setup_resched_latency_warn_ms(char *str) +
Re: [PATCH] sched: Warn on long periods of pending need_resched
On Fri, Mar 19, 2021 at 2:02 AM Mel Gorman wrote: > > Default disabling and hidden behind a static branch would be useful > because the majority of users are not going to know what to do about > a need_resched warning and the sysctl is not documented. As Ingo said, > SCHED_DEBUG is enabled by default a lot. While I'm not sure what motivates > most distributions, I have found it useful to display topology information > on boot and in rare cases, for the enabling/disabling of sched features. > Hence, sched debug features should avoid adding too much overhead where > possible. > > The static branch would mean splitting the very large inline functions > adding by the patch. The inline section should do a static check only and > do the main work in a function in kernel/sched/debug.c so it has minimal > overhead if unused. > > -- > Mel Gorman > SUSE Labs Makes sense, I'll include this in v2. Thanks, Josh
Re: [PATCH 2/6] sched: tagging interface for core scheduling
> > +static unsigned long sched_core_alloc_task_cookie(void) > > +{ > > + struct sched_core_task_cookie *ck = > > + kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL); > > struct sched_core_task_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL); > > Also, those type names are unfortunately long.. > > > +static void sched_core_get_task_cookie(unsigned long cookie) > > +{ > > + struct sched_core_task_cookie *ptr = > > + (struct sched_core_task_cookie *)cookie; > > struct sched_core_task_cookie *ptr = (void *)cookie; > > Know your language and use it to avoid typing excessively long names :-) Good point, done. Keeping sched_core_task_cookie for now unless you'd prefer a replacement, since it is only used internally by coretag.c.
Re: [PATCH -tip v3 05/11] x86/kprobes: Add UNWIND_HINT_FUNC on kretprobe_trampoline code
On Sat, Mar 20, 2021 at 10:05:43PM +0900, Masami Hiramatsu wrote: > On Sat, 20 Mar 2021 21:16:16 +0900 > Masami Hiramatsu wrote: > > > On Fri, 19 Mar 2021 21:22:39 +0900 > > Masami Hiramatsu wrote: > > > > > From: Josh Poimboeuf > > > > > > Add UNWIND_HINT_FUNC on kretporbe_trampoline code so that ORC > > > information is generated on the kretprobe_trampoline correctly. > > > > > > > Test bot also found a new warning for this. > > > > > >> arch/x86/kernel/kprobes/core.o: warning: objtool: > > > >> kretprobe_trampoline()+0x25: call without frame pointer save/setup > > > > With CONFIG_FRAME_POINTER=y. > > > > Of course this can be fixed with additional "push %bp; mov %sp, %bp" before > > calling > > trampoline_handler. But actually we know that this function has a bit > > special > > stack frame too. > > > > Can I recover STACK_FRAME_NON_STANDARD(kretprobe_trampoline) when > > CONFIG_FRAME_POINTER=y ? > > So something like this. Does it work? > > diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c > index b31058a152b6..651f337dc880 100644 > --- a/arch/x86/kernel/kprobes/core.c > +++ b/arch/x86/kernel/kprobes/core.c > @@ -760,6 +760,10 @@ int kprobe_int3_handler(struct pt_regs *regs) > } > NOKPROBE_SYMBOL(kprobe_int3_handler); > > +#ifdef CONFIG_FRAME_POINTER > +#undef UNWIND_HINT_FUNC > +#define UNWIND_HINT_FUNC > +#endif This hunk isn't necessary. The unwind hints don't actually have an effect with frame pointers. > /* > * When a retprobed function returns, this code saves registers and > * calls trampoline_handler() runs, which calls the kretprobe's handler. > @@ -797,7 +801,14 @@ asm( > ".size kretprobe_trampoline, .-kretprobe_trampoline\n" > ); > NOKPROBE_SYMBOL(kretprobe_trampoline); > - > +#ifdef CONFIG_FRAME_POINTER > +/* > + * kretprobe_trampoline skips updating frame pointer. The frame pointer > + * saved in trampoline_handler points to the real caller function's > + * frame pointer. > + */ > +STACK_FRAME_NON_STANDARD(kretprobe_trampoline); > +#endif > > /* > * Called from kretprobe_trampoline Ack. -- Josh
Re: [PATCH -tip v3 05/11] x86/kprobes: Add UNWIND_HINT_FUNC on kretprobe_trampoline code
On Sat, Mar 20, 2021 at 09:16:16PM +0900, Masami Hiramatsu wrote: > On Fri, 19 Mar 2021 21:22:39 +0900 > Masami Hiramatsu wrote: > > > From: Josh Poimboeuf > > > > Add UNWIND_HINT_FUNC on kretporbe_trampoline code so that ORC > > information is generated on the kretprobe_trampoline correctly. > > > > Test bot also found a new warning for this. > > > >> arch/x86/kernel/kprobes/core.o: warning: objtool: > > >> kretprobe_trampoline()+0x25: call without frame pointer save/setup > > With CONFIG_FRAME_POINTER=y. > > Of course this can be fixed with additional "push %bp; mov %sp, %bp" before > calling > trampoline_handler. But actually we know that this function has a bit special > stack frame too. > > Can I recover STACK_FRAME_NON_STANDARD(kretprobe_trampoline) when > CONFIG_FRAME_POINTER=y ? Yes, that's what I'd recommend. -- Josh
Re: [PATCH v2 14/14] objtool,x86: Rewrite retpoline thunk calls
On Fri, Mar 19, 2021 at 04:56:30PM +0100, Peter Zijlstra wrote: > On Fri, Mar 19, 2021 at 10:30:26AM -0500, Josh Poimboeuf wrote: > > On Fri, Mar 19, 2021 at 09:06:44AM +0100, Peter Zijlstra wrote: > > > > Also doesn't the alternative code already insert nops? > > > > > > Problem is that the {call,jmp} *%\reg thing is not fixed length. They're > > > 2 or 3 bytes depending on which register is picked. > > > > Why do they need to be fixed length? Objtool can use sym->len as the > > alternative replacement length. Then alternatives can add nops as > > needed. > > UNDEF has size 0. That is, unless these symbols exist in the translation > unit (they do not) we have no clue. > > Arguably I could parse the symbol name again and then we know the > register number and thus if we need REX etc.. but I figured we wanted to > avoid all that. Ah, makes sense now. -- Josh
Re: [PATCH v2 08/14] objtool: Add elf_create_reloc() helper
On Fri, Mar 19, 2021 at 04:24:40PM +0100, Peter Zijlstra wrote: > On Fri, Mar 19, 2021 at 10:12:59AM -0500, Josh Poimboeuf wrote: > > > > -void elf_add_reloc(struct elf *elf, struct reloc *reloc) > > > +int elf_add_reloc(struct elf *elf, struct section *sec, unsigned long > > > offset, > > > + unsigned int type, struct symbol *sym, int addend) > > > { > > > - struct section *sec = reloc->sec; > > > + struct reloc *reloc; > > > + > > > + reloc = malloc(sizeof(*reloc)); > > > + if (!reloc) { > > > + perror("malloc"); > > > + return -1; > > > + } > > > + memset(reloc, 0, sizeof(*reloc)); > > > + > > > + reloc->sec = sec->reloc; > > > > Maybe elf_add_reloc() could create the reloc section if it doesn't > > already exist, that would help remove the multiple calls to > > elf_create_reloc_section(). > > I'll do that as yet another patch ;-) Ok! > > > + reloc->offset = offset; > > > + reloc->type = type; > > > + reloc->sym = sym; > > > + reloc->addend = addend; > > > > > > list_add_tail(>list, >reloc_list); > > > > This should be sec->reloc->reloc_list? > > Absolutely, there were a few other 'funnies' as well that I just fixed > whilst trying to make the whole new stack work again :-) I'm sure some of those were my fault, thanks for testing my crap :-) -- Josh
Re: [PATCH v2 14/14] objtool,x86: Rewrite retpoline thunk calls
On Fri, Mar 19, 2021 at 09:06:44AM +0100, Peter Zijlstra wrote: > > Also doesn't the alternative code already insert nops? > > Problem is that the {call,jmp} *%\reg thing is not fixed length. They're > 2 or 3 bytes depending on which register is picked. Why do they need to be fixed length? Objtool can use sym->len as the alternative replacement length. Then alternatives can add nops as needed. -- Josh
Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic
On Fri, Mar 19, 2021 at 10:22:54AM +0100, Peter Zijlstra wrote: > @@ -814,6 +814,11 @@ struct section *elf_create_reloc_section > } > } > > +static inline bool is_reloc_section(struct section *reloc) > +{ > + return reloc->base && reloc->base->reloc == reloc; > +} I believe the 2nd condition will always be true, so it can just be 'return reloc->base'. -- Josh
Re: [PATCH v2 08/14] objtool: Add elf_create_reloc() helper
On Fri, Mar 19, 2021 at 10:47:36AM +0100, Peter Zijlstra wrote: > Full patch, because diff on a diff on a diff is getting ludicrous hard > to read :-) Appreciated :-) > -void elf_add_reloc(struct elf *elf, struct reloc *reloc) > +int elf_add_reloc(struct elf *elf, struct section *sec, unsigned long offset, > + unsigned int type, struct symbol *sym, int addend) > { > - struct section *sec = reloc->sec; > + struct reloc *reloc; > + > + reloc = malloc(sizeof(*reloc)); > + if (!reloc) { > + perror("malloc"); > + return -1; > + } > + memset(reloc, 0, sizeof(*reloc)); > + > + reloc->sec = sec->reloc; Maybe elf_add_reloc() could create the reloc section if it doesn't already exist, that would help remove the multiple calls to elf_create_reloc_section(). > + reloc->offset = offset; > + reloc->type = type; > + reloc->sym = sym; > + reloc->addend = addend; > > list_add_tail(>list, >reloc_list); This should be sec->reloc->reloc_list? -- Josh
Re: [PATCH v2 10/14] objtool: Extract elf_symbol_add()
On Fri, Mar 19, 2021 at 10:54:05AM +0100, Peter Zijlstra wrote: > On Thu, Mar 18, 2021 at 09:14:03PM -0500, Josh Poimboeuf wrote: > > On Thu, Mar 18, 2021 at 06:11:13PM +0100, Peter Zijlstra wrote: > > > Create a common helper to add symbols. > > > > > > Signed-off-by: Peter Zijlstra (Intel) > > > --- > > > tools/objtool/elf.c | 57 > > > ++-- > > > 1 file changed, 33 insertions(+), 24 deletions(-) > > > > > > --- a/tools/objtool/elf.c > > > +++ b/tools/objtool/elf.c > > > @@ -290,12 +290,41 @@ static int read_sections(struct elf *elf > > > return 0; > > > } > > > > > > +static bool elf_symbol_add(struct elf *elf, struct symbol *sym) > > > > How about "elf_add_symbol()" for consistency with my other suggestions > > (elf_add_reloc() and elf_add_string()). And return an int. > > Changed the nane, how about void? This latest incarnation doesn't > actually have an error path. Still doesn't hurt to have one and not use > it I suppose... void would be my preference, just to avoid unnecessary error handling boilerplate in the caller. -- Josh
Re: [PATCH v2 14/14] objtool,x86: Rewrite retpoline thunk calls
On Thu, Mar 18, 2021 at 06:11:17PM +0100, Peter Zijlstra wrote: > When the compiler emits: "CALL __x86_indirect_thunk_\reg" for an > indirect call, have objtool rewrite it to: > > ALTERNATIVE "call __x86_indirect_thunk_\reg", > "call *%reg", ALT_NOT(X86_FEATURE_RETPOLINE) > > Additionally, in order to not emit endless identical > .altinst_replacement chunks, use a global symbol for them, see > __x86_indirect_alt_*. > > This also avoids objtool from having to do code generation. > > Signed-off-by: Peter Zijlstra (Intel) This is better than I expected. Nice workaround for not generating code. > +.macro ALT_THUNK reg > + > + .align 1 > + > +SYM_FUNC_START_NOALIGN(__x86_indirect_alt_call_\reg) > + ANNOTATE_RETPOLINE_SAFE > +1: call*%\reg > +2: .skip 5-(2b-1b), 0x90 > +SYM_FUNC_END(__x86_indirect_alt_call_\reg) > + > +SYM_FUNC_START_NOALIGN(__x86_indirect_alt_jmp_\reg) > + ANNOTATE_RETPOLINE_SAFE > +1: jmp *%\reg > +2: .skip 5-(2b-1b), 0x90 > +SYM_FUNC_END(__x86_indirect_alt_jmp_\reg) This mysterious code needs a comment. Shouldn't it be in .altinstr_replacement or something? Also doesn't the alternative code already insert nops? > +int arch_rewrite_retpoline(struct objtool_file *file, > +struct instruction *insn, > +struct reloc *reloc) > +{ > + struct symbol *sym; > + char name[32] = ""; > + > + if (!strcmp(insn->sec->name, ".text.__x86.indirect_thunk")) > + return 0; > + > + sprintf(name, "__x86_indirect_alt_%s_%s", > + insn->type == INSN_JUMP_DYNAMIC ? "jmp" : "call", > + reloc->sym->name + 21); > + > + sym = find_symbol_by_name(file->elf, name); > + if (!sym) { > + sym = elf_create_undef_symbol(file->elf, name); > + if (!sym) { > + WARN("elf_create_undef_symbol"); > + return -1; > + } > + } > + > + elf_add_alternative(file->elf, insn, sym, > + ALT_NOT(X86_FEATURE_RETPOLINE), 5, 5); > + > + return 0; > +} Need to propagate the error. -- Josh
Re: [PATCH v2 12/14] objtool: Allow archs to rewrite retpolines
On Thu, Mar 18, 2021 at 06:11:15PM +0100, Peter Zijlstra wrote: > @@ -1212,6 +1225,8 @@ static int handle_group_alt(struct objto > dest_off = arch_jump_destination(insn); > if (dest_off == special_alt->new_off + special_alt->new_len) > insn->jump_dest = next_insn_same_sec(file, > last_orig_insn); > + else > + insn->jump_dest = find_insn(file, insn->sec, dest_off); > > if (!insn->jump_dest) { > WARN_FUNC("can't find alternative jump destination", So I assume this change is because of the ordering change: now this is done before add_jump_destinations(). But doesn't that mean the alternative jump modification (changing the dest to the end of the original insns) will get overwritten later? Also the new hunk to be an oversimplified version of add_jump_destinations(). I'm not quite convinced that it will always do the right thing for this case. > @@ -1797,11 +1812,15 @@ static int decode_sections(struct objtoo > if (ret) > return ret; > > - ret = add_jump_destinations(file); > + /* > + * Must be before add_{jump,call}_destination; for they can add > + * magic alternatives. > + */ > + ret = add_special_section_alts(file); This reordering is unfortunate. Maybe there's a better way, though I don't have any ideas, at least until I get to the most controversial patch. -- Josh
Re: [PATCH v2 05/14] objtool: Per arch retpoline naming
On Thu, Mar 18, 2021 at 06:11:08PM +0100, Peter Zijlstra wrote: > @@ -872,7 +877,7 @@ static int add_jump_destinations(struct > } else if (reloc->sym->type == STT_SECTION) { > dest_sec = reloc->sym->sec; > dest_off = arch_dest_reloc_offset(reloc->addend); > - } else if (!strncmp(reloc->sym->name, "__x86_indirect_thunk_", > 21)) { > + } else if (arch_is_retpoline(reloc->sym)) { > /* >* Retpoline jumps are really dynamic jumps in >* disguise, so convert them accordingly. There's another one in add_call_destinations(). -- Josh
Re: [PATCH v2 11/14] objtool: Add elf_create_undef_symbol()
On Thu, Mar 18, 2021 at 06:11:14PM +0100, Peter Zijlstra wrote: > Allow objtool to create undefined symbols; this allows creating > relocations to symbols not currently in the symbol table. > > Signed-off-by: Peter Zijlstra (Intel) > --- > tools/objtool/elf.c | 63 > > tools/objtool/include/objtool/elf.h |1 > 2 files changed, 64 insertions(+) > > --- a/tools/objtool/elf.c > +++ b/tools/objtool/elf.c > @@ -724,6 +724,69 @@ static int elf_strtab_concat(struct elf > return len; > } > > +struct symbol *elf_create_undef_symbol(struct elf *elf, const char *name) > +{ > + struct section *symtab; > + struct symbol *sym; > + Elf_Data *data; > + Elf_Scn *s; > + > + sym = malloc(sizeof(*sym)); > + if (!sym) { > + perror("malloc"); > + return NULL; > + } > + memset(sym, 0, sizeof(*sym)); > + > + sym->name = strdup(name); > + > + sym->sym.st_name = elf_strtab_concat(elf, sym->name, NULL); > + if (sym->sym.st_name == -1) > + return NULL; > + > + sym->sym.st_info = 0x10; /* STB_GLOBAL, STT_NOTYPE */ There's a generic macro for this: sym->sym.st_info = GELF_ST_INFO(STB_GLOBAL, STT_NOTYPE); And sym->bind and sym->type should probably get set. -- Josh
Re: [PATCH v2 10/14] objtool: Extract elf_symbol_add()
On Thu, Mar 18, 2021 at 06:11:13PM +0100, Peter Zijlstra wrote: > Create a common helper to add symbols. > > Signed-off-by: Peter Zijlstra (Intel) > --- > tools/objtool/elf.c | 57 > ++-- > 1 file changed, 33 insertions(+), 24 deletions(-) > > --- a/tools/objtool/elf.c > +++ b/tools/objtool/elf.c > @@ -290,12 +290,41 @@ static int read_sections(struct elf *elf > return 0; > } > > +static bool elf_symbol_add(struct elf *elf, struct symbol *sym) How about "elf_add_symbol()" for consistency with my other suggestions (elf_add_reloc() and elf_add_string()). And return an int. -- Josh
Re: [PATCH v2 09/14] objtool: Extract elf_strtab_concat()
On Thu, Mar 18, 2021 at 06:11:12PM +0100, Peter Zijlstra wrote: > Create a common helper to append strings to a strtab. > > Signed-off-by: Peter Zijlstra (Intel) > --- > tools/objtool/elf.c | 73 > +--- > 1 file changed, 42 insertions(+), 31 deletions(-) > > --- a/tools/objtool/elf.c > +++ b/tools/objtool/elf.c > @@ -676,13 +676,51 @@ struct elf *elf_open_read(const char *na > return NULL; > } > > +static int elf_strtab_concat(struct elf *elf, char *name, const char > *strtab_name) > +{ > + struct section *strtab = NULL; > + Elf_Data *data; > + Elf_Scn *s; > + int len; > + > + if (strtab_name) > + strtab = find_section_by_name(elf, strtab_name); > + if (!strtab) > + strtab = find_section_by_name(elf, ".strtab"); > + if (!strtab) { > + WARN("can't find %s or .strtab section", strtab_name); > + return -1; > + } This part's a bit mysterious (and it loses the Clang comment). Maybe we can leave the section finding logic in elf_create_section()? diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c index b85db6efb9d3..db9ad54894d8 100644 --- a/tools/objtool/elf.c +++ b/tools/objtool/elf.c @@ -676,19 +676,17 @@ struct elf *elf_open_read(const char *name, int flags) return NULL; } -static int elf_strtab_concat(struct elf *elf, char *name, const char *strtab_name) +static int elf_add_string(struct elf *elf, struct section *strtab, char *str) { struct section *strtab = NULL; Elf_Data *data; Elf_Scn *s; int len; - if (strtab_name) - strtab = find_section_by_name(elf, strtab_name); if (!strtab) strtab = find_section_by_name(elf, ".strtab"); if (!strtab) { - WARN("can't find %s or .strtab section", strtab_name); + WARN("can't find .strtab section"); return -1; } @@ -718,7 +716,7 @@ static int elf_strtab_concat(struct elf *elf, char *name, const char *strtab_nam struct section *elf_create_section(struct elf *elf, const char *name, unsigned int sh_flags, size_t entsize, int nr) { - struct section *sec; + struct section *sec, *shstrtab; size_t size = entsize * nr; Elf_Scn *s; @@ -777,7 +775,15 @@ struct section *elf_create_section(struct elf *elf, const char *name, sec->sh.sh_addralign = 1; sec->sh.sh_flags = SHF_ALLOC | sh_flags; - sec->sh.sh_name = elf_strtab_concat(elf, sec->name, ".shstrtab"); + /* Add section name to .shstrtab (or .strtab for Clang) */ + shstrtab = find_section_by_name(elf, ".shstrtab"); + if (!shstrtab) + shstrtab = find_section_by_name(elf, ".strtab"); + if (!shstrtab) { + WARN("can't find .shstrtab or .strtab section"); + return NULL; + } + sec->sh.sh_name = elf_add_string(elf, sec->name, shstrtab); if (sec->sh.sh_name == -1) return NULL;
Re: [PATCH v2 08/14] objtool: Add elf_create_reloc() helper
On Thu, Mar 18, 2021 at 06:11:11PM +0100, Peter Zijlstra wrote: > We have 4 instances of adding a relocation. Create a common helper > to avoid growing even more. > > Signed-off-by: Peter Zijlstra (Intel) I'm not a fan of the API -- how about squashing this in? Untested, of course. diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 6a52d56504b1..38ffa3b2595e 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -469,12 +469,11 @@ static int create_static_call_sections(struct objtool_file *file) memset(site, 0, sizeof(struct static_call_site)); /* populate reloc for 'addr' */ - if (!elf_create_reloc(file->elf, sec, idx * sizeof(struct static_call_site), - R_X86_64_PC32, - reloc_to_insn, insn, 0)) { - WARN_ELF("elf_create_reloc: static_call_site::addr"); + if (elf_add_reloc_to_insn(file->elf, sec, + idx * sizeof(struct static_call_site), + R_X86_64_PC32, + insn->sec, insn->offset)) return -1; - } /* find key symbol */ key_name = strdup(insn->call_dest->name); @@ -511,13 +510,11 @@ static int create_static_call_sections(struct objtool_file *file) free(key_name); /* populate reloc for 'key' */ - if (!elf_create_reloc(file->elf, sec, idx * sizeof(struct static_call_site) + 4, - R_X86_64_PC32, - reloc_to_sym, key_sym, - is_sibling_call(insn) * STATIC_CALL_SITE_TAIL)) { - WARN_ELF("elf_create_reloc: static_call_site::key"); + if (elf_add_reloc(file->elf, sec, + idx * sizeof(struct static_call_site) + 4, + R_X86_64_PC32, key_sym, + is_sibling_call(insn) * STATIC_CALL_SITE_TAIL)) return -1; - } idx++; } @@ -559,12 +556,11 @@ static int create_mcount_loc_sections(struct objtool_file *file) loc = (unsigned long *)sec->data->d_buf + idx; memset(loc, 0, sizeof(unsigned long)); - if (!elf_create_reloc(file->elf, sec, idx * sizeof(unsigned long), - R_X86_64_64, - reloc_to_insn, insn, 0)) { - WARN_ELF("elf_create_reloc: __mcount_loc"); + if (elf_add_reloc_to_insn(file->elf, sec, + idx * sizeof(unsigned long), + R_X86_64_64, + insn->sec, insn->offset)) return -1; - } idx++; } diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c index d2afe2454de4..b3bd97b2b034 100644 --- a/tools/objtool/elf.c +++ b/tools/objtool/elf.c @@ -447,76 +447,73 @@ static int read_symbols(struct elf *elf) return -1; } -static void elf_add_reloc(struct elf *elf, struct reloc *reloc) +int elf_add_reloc(struct elf *elf, struct section *sec, unsigned long offset, + unsigned int type, struct symbol *sym, int addend) { - struct section *sec = reloc->sec; + struct reloc *reloc; + + reloc = malloc(sizeof(*reloc)); + if (!reloc) { + perror("malloc"); + return -1; + } + memset(reloc, 0, sizeof(*reloc)); + + reloc->sec = sec->reloc; + reloc->offset = offset; + reloc->type = type; + reloc->sym = sym; + reloc->addend = addend; list_add_tail(>list, >reloc_list); elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc)); - sec->changed = true; -} - -void reloc_to_sym(struct reloc *reloc, void *dst) -{ - struct symbol *sym = dst; - reloc->sym = sym; + return 0; } -void __reloc_to_insn(struct reloc *reloc, struct section *sec, unsigned long offset) +static int insn_to_sym_addend(struct section *sec, unsigned long offset, + struct symbol **sym, int *addend) { if (sec->sym) { - reloc->sym = sec->sym; - reloc->addend = offset; - return; + *sym = sec->sym; + *addend = offset; + return 0; } /* * The Clang assembler strips section symbols, so we have to reference * the function symbol instead: */ - reloc->sym = find_symbol_containing(sec, offset); - if (!reloc->sym) { + *sym = find_symbol_containing(sec, offset); + if (!*sym) { /*
Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic
On Thu, Mar 18, 2021 at 12:38:42PM -0500, Josh Poimboeuf wrote: > On Thu, Mar 18, 2021 at 06:04:25PM +0100, Peter Zijlstra wrote: > > On Thu, Mar 18, 2021 at 11:36:40AM -0500, Josh Poimboeuf wrote: > > > > I was thinking you could get a section changed without touching > > > > relocations, but while that is theoretically possible, it is exceedingly > > > > unlikely (and objtool doesn't do that). > > > > > > Hm? This is a *relocation* section, not a normal one. So by > > > definition, it only changes when its relocations change. > > > > The way I read this code: > > > > list_for_each_entry(sec, >sections, list) { > > if (sec->changed) { > > + if (sec->reloc && > > + elf_rebuild_reloc_section(elf, sec->reloc)) { > > + WARN_ELF("elf_rebuild_reloc_section"); > > + return -1; > > + } > > > > is that we iterate the regular sections (which could be dirtied because > > we changed some data), and if that section has a relocation section, we > > rebuild that for good measure (even though it might not have altered > > relocations). > > > > Or am I just totally confused ? > > Ah, you're right. I'm the one that's confused. I guess I was also > confused when I wrote that hunk, but it just happens to work anyway. > > It would be cleaner to do something like > > if ((is_reloc_sec(sec) && > elf_rebuild_reloc_section(elf, sec)) { > > so we process the changed reloc section directly, instead of relying on > the (most likely) fact that the corresponding text section also changed. i.e., in actual code: diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c index 66c49c2e20a6..3b3d19a5e626 100644 --- a/tools/objtool/elf.c +++ b/tools/objtool/elf.c @@ -948,8 +948,7 @@ int elf_write(struct elf *elf) /* Update changed relocation sections and section headers: */ list_for_each_entry(sec, >sections, list) { if (sec->changed) { - if (sec->reloc && - elf_rebuild_reloc_section(elf, sec->reloc)) { + if (sec->base && elf_rebuild_reloc_section(elf, sec)) { WARN_ELF("elf_rebuild_reloc_section"); return -1; }
Re: [sched] 663017c554: WARNING:at_kernel/sched/core.c:#scheduler_tick
The warning is WAI (holding spinlock for 100ms). However, since this is expected for locktorture, it makes sense to not have the warning enabled while the test is running. I can add that to the patch.
Re: [PATCH] kbuild: rebuild GCC plugins when the compiler is upgraded
On Wed, Mar 10, 2021 at 10:44:46PM +0900, Masahiro Yamada wrote: > > Masahiro, > > > > Ping. Do you have a better approach for building GCC plugins in the > > external module directory? > > > I am not sure if building GCC plugins in the external module directory > is the right approach. I'm certainly open to any better ideas that would allow GCC plugins to be enabled in distro kernel builds. -- Josh
Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic
On Thu, Mar 18, 2021 at 06:04:25PM +0100, Peter Zijlstra wrote: > On Thu, Mar 18, 2021 at 11:36:40AM -0500, Josh Poimboeuf wrote: > > > I was thinking you could get a section changed without touching > > > relocations, but while that is theoretically possible, it is exceedingly > > > unlikely (and objtool doesn't do that). > > > > Hm? This is a *relocation* section, not a normal one. So by > > definition, it only changes when its relocations change. > > The way I read this code: > > list_for_each_entry(sec, >sections, list) { > if (sec->changed) { > + if (sec->reloc && > + elf_rebuild_reloc_section(elf, sec->reloc)) { > + WARN_ELF("elf_rebuild_reloc_section"); > + return -1; > + } > > is that we iterate the regular sections (which could be dirtied because > we changed some data), and if that section has a relocation section, we > rebuild that for good measure (even though it might not have altered > relocations). > > Or am I just totally confused ? Ah, you're right. I'm the one that's confused. I guess I was also confused when I wrote that hunk, but it just happens to work anyway. It would be cleaner to do something like if ((is_reloc_sec(sec) && elf_rebuild_reloc_section(elf, sec)) { so we process the changed reloc section directly, instead of relying on the (most likely) fact that the corresponding text section also changed. -- Josh
Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic
On Thu, Mar 18, 2021 at 01:57:03PM +0100, Peter Zijlstra wrote: > On Wed, Mar 17, 2021 at 07:49:17PM -0500, Josh Poimboeuf wrote: > > On Wed, Mar 17, 2021 at 09:12:15AM +0100, Peter Zijlstra wrote: > > > On Tue, Mar 16, 2021 at 10:34:17PM -0500, Josh Poimboeuf wrote: > > > > On Fri, Mar 12, 2021 at 06:16:18PM +0100, Peter Zijlstra wrote: > > > > > --- a/tools/objtool/elf.c > > > > > +++ b/tools/objtool/elf.c > > > > > @@ -479,6 +479,8 @@ void elf_add_reloc(struct elf *elf, stru > > > > > > > > > > list_add_tail(>list, >reloc_list); > > > > > elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc)); > > > > > + > > > > > + sec->rereloc = true; > > > > > } > > > > > > > > Can we just reuse sec->changed for this? Something like this on top > > > > (untested of course): > > > > > > I think my worry was that we'd dirty too much and slow down the write, > > > but I haven't done any actual performance measurements on this. > > > > Really? I thought my proposal was purely aesthetic, no functional > > change, but my brain is toasty this week due to other distractions so > > who knows. > > I was thinking you could get a section changed without touching > relocations, but while that is theoretically possible, it is exceedingly > unlikely (and objtool doesn't do that). Hm? This is a *relocation* section, not a normal one. So by definition, it only changes when its relocations change. -- Josh
Re: [PATCH 3/3] static_call: Fix static_call_update() sanity check
On Thu, Mar 18, 2021 at 12:31:59PM +0100, Peter Zijlstra wrote: > if (!kernel_text_address((unsigned long)site_addr)) { > - WARN_ONCE(1, "can't patch static call site at > %pS", > + /* > + * This skips patching __exit, which is part of > + * init_section_contains() but is not part of > + * kernel_text_address(). > + * > + * Skipping __exit is fine since it will never > + * be executed. > + */ > + WARN_ONCE(!static_call_is_init(site), > + "can't patch static call site at %pS", > site_addr); > continue; > } It might be good to clarify the situation for __exit in modules in the comment and/or changelog, as they both seem to be implicitly talking only about __exit in vmlinux. For CONFIG_MODULE_UNLOAD, the code ends up in the normal text area, so static_call_is_init() is false and kernel_text_address() is true. For !CONFIG_MODULE_UNLOAD, the code gets discarded during module load, so static_call_is_init() and kernel_text_address() are both false. I guess that will trigger a warning? -- Josh
Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic
On Wed, Mar 17, 2021 at 09:12:15AM +0100, Peter Zijlstra wrote: > On Tue, Mar 16, 2021 at 10:34:17PM -0500, Josh Poimboeuf wrote: > > On Fri, Mar 12, 2021 at 06:16:18PM +0100, Peter Zijlstra wrote: > > > --- a/tools/objtool/elf.c > > > +++ b/tools/objtool/elf.c > > > @@ -479,6 +479,8 @@ void elf_add_reloc(struct elf *elf, stru > > > > > > list_add_tail(>list, >reloc_list); > > > elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc)); > > > + > > > + sec->rereloc = true; > > > } > > > > Can we just reuse sec->changed for this? Something like this on top > > (untested of course): > > I think my worry was that we'd dirty too much and slow down the write, > but I haven't done any actual performance measurements on this. Really? I thought my proposal was purely aesthetic, no functional change, but my brain is toasty this week due to other distractions so who knows. > Let me do a few runs and see if it matters at all. -- Josh
Re: [PATCH 6/9] objtool: Add elf_create_undef_symbol()
On Wed, Mar 17, 2021 at 03:13:43PM +0100, Peter Zijlstra wrote: > On Wed, Mar 17, 2021 at 02:52:23PM +0100, Miroslav Benes wrote: > > > > + if (!elf_symbol_add(elf, sym, SHN_XINDEX)) { > > > + WARN("elf_symbol_add"); > > > + return NULL; > > > + } > > > > SHN_XINDEX means that the extended section index is used. Above you seem > > to use it in the opposite sense too (assigning to shndx when shndx_data is > > NULL). While it makes the code easier to handle, it is a bit confusing > > (and maybe I am just confused now). Could you add a comment about that, > > please? elf_symbol_add() seems like a good place. > > Yes, that was a horrible thing to do :/ And you understood it right. > > Looking at it again, I'm not sure it is actually correct tho; shouldn't > elf_create_undef_symbol() also look at gelf_getsymshndx() of symtab ? > > What toolchain generates these extended sections and how? That is, how > do I test this crud.. SHN_XINDEX is basically a special-case extension to original ELF for supporting more than 64k sections. -- Josh
Re: [PATCH v4 3/9] x86/entry: Convert ret_from_fork to C
On Wed, Mar 17, 2021 at 11:12:42AM -0700, Andy Lutomirski wrote: > ret_from_fork is written in asm, slightly differently, for x86_32 and > x86_64. Convert it to C. > > This is a straight conversion without any particular cleverness. As a > further cleanup, the code that sets up the ret_from_fork argument registers > could be adjusted to put the arguments in the correct registers. > > This will cause the ORC unwinder to find pt_regs even for kernel threads on > x86_64. This seems harmless. > > The 32-bit comment above the now-deleted schedule_tail_wrapper was > obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the > same problem more cleanly. > > Cc: Josh Poimboeuf > Signed-off-by: Andy Lutomirski Acked-by: Josh Poimboeuf -- Josh
Re: [PATCH v4 2/9] x86/kthread,dumpstack: Set task_pt_regs->cs.RPL=3 for kernel threads
On Wed, Mar 17, 2021 at 11:12:41AM -0700, Andy Lutomirski wrote: > @@ -163,6 +163,19 @@ int copy_thread(unsigned long clone_flags, unsigned long > sp, unsigned long arg, > /* Kernel thread ? */ > if (unlikely(p->flags & PF_KTHREAD)) { > memset(childregs, 0, sizeof(struct pt_regs)); > + > + /* > + * Even though there is no real user state here, these > + * are were user regs belong, and kernel_execve() will s/were/where/ > + * overwrite them with user regs. Put an obviously > + * invalid value that nonetheless causes user_mode(regs) > + * to return true in CS. > + * > + * This also prevents the unwinder from thinking that there > + * is invalid kernel state at the top of the stack. > + */ > + childregs->cs = 3; > + > kthread_frame_init(frame, sp, arg); > return 0; > } Acked-by: Josh Poimboeuf -- Josh
Re: [PATCH] sched: Warn on long periods of pending need_resched
On Wed, Mar 17, 2021 at 1:31 AM Ingo Molnar wrote: > > > * Josh Don wrote: > > > +static inline u64 resched_latency_check(struct rq *rq) > > +{ > > + int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms); > > + bool warn_only_once = (latency_warn_ms == > > RESCHED_DEFAULT_WARN_LATENCY_MS); > > + u64 need_resched_latency, now = rq_clock(rq); > > + static bool warned_once; > > + > > + if (warn_only_once && warned_once) > > + return 0; > > + > > + if (!need_resched() || latency_warn_ms < 2) > > + return 0; > > + > > + /* Disable this warning for the first few mins after boot */ > > + if (now < RESCHED_BOOT_QUIET_SEC * NSEC_PER_SEC) > > + return 0; > > + > > + if (!rq->last_seen_need_resched_ns) { > > + rq->last_seen_need_resched_ns = now; > > + rq->ticks_without_resched = 0; > > + return 0; > > + } > > + > > + rq->ticks_without_resched++; > > So AFAICS this will only really do something useful on full-nohz > kernels with sufficiently long scheduler ticks, right? Not quite sure what you mean; it is actually the inverse? Since we rely on the tick to detect the resched latency, on nohz-full we won't have detection on cpus running a single thread. The ideal scenario is !nohz-full and tick interval << warn_ms. > On other kernels the scheduler tick interrupt, when it returns to > user-space, will trigger a reschedule if it sees a need_resched. True for the case where we return to userspace, but we could instead be executing in a non-preemptible region of the kernel. This is where we've seen/fixed kernel bugs. Best, Josh
Re: [PATCH] sched: Warn on long periods of pending need_resched
On Wed, Mar 17, 2021 at 1:25 AM Ingo Molnar wrote: > > * Josh Don wrote: > > > If resched_latency_warn_ms is set to the default value, only one warning > > will be produced per boot. > > Looks like a value hack, should probably be a separate flag, > defaulting to warn-once. Agreed, done. > > This warning only exists under CONFIG_SCHED_DEBUG. If it goes off, it is > > likely that there is a missing cond_resched() somewhere. > > CONFIG_SCHED_DEBUG is default-y, so most distros have it enabled. To avoid log spam for people who don't care, I was considering having the feature default disabled. Perhaps a better alternative is to only show a single line warning and not print the full backtrace by default. Does the latter sound good to you? > > +#ifdef CONFIG_KASAN > > +#define RESCHED_DEFAULT_WARN_LATENCY_MS 101 > > +#define RESCHED_BOOT_QUIET_SEC 600 > > +#else > > +#define RESCHED_DEFAULT_WARN_LATENCY_MS 51 > > +#define RESCHED_BOOT_QUIET_SEC 300 > > #endif > > +int sysctl_resched_latency_warn_ms = RESCHED_DEFAULT_WARN_LATENCY_MS; > > +#endif /* CONFIG_SCHED_DEBUG */ > > I'd really just make this a single value - say 100 or 200 msecs. Replacing these both with a single value (the more conservative default of 100ms and 600s). > > +static inline void resched_latency_warn(int cpu, u64 latency) > > +{ > > + static DEFINE_RATELIMIT_STATE(latency_check_ratelimit, 60 * 60 * HZ, > > 1); > > + > > + WARN(__ratelimit(_check_ratelimit), > > + "CPU %d: need_resched set for > %llu ns (%d ticks) " > > + "without schedule\n", > > + cpu, latency, cpu_rq(cpu)->ticks_without_resched); > > +} > > Could you please put the 'sched:' prefix into scheduler warnings. > Let's have a bit of a namespace structure in new warnings. Sounds good, done.
Re: [PATCH] objtool,static_call: Don't emit static_call_site for .exit.text
On Wed, Mar 17, 2021 at 01:45:57PM +0100, Peter Zijlstra wrote: > arguably it simply isn't a good idea to use static_call() in __exit > code anyway, since module unload is never a performance critical path. Couldn't you make the same argument about __init functions, which are allowed to do static calls? We might consider a STATIC_CALL_SITE_EXIT flag, but I suppose we've run out of flag space. -- Josh
[PATCH] sched: Warn on long periods of pending need_resched
From: Paul Turner CPU scheduler marks need_resched flag to signal a schedule() on a particular CPU. But, schedule() may not happen immediately in cases where the current task is executing in the kernel mode (no preemption state) for extended periods of time. This patch adds a warn_on if need_resched is pending for more than the time specified in sysctl resched_latency_warn_ms. Monitoring is done via the tick and the accuracy is hence limited to jiffy scale. This also means that we won't trigger the warning if the tick is disabled. If resched_latency_warn_ms is set to the default value, only one warning will be produced per boot. This warning only exists under CONFIG_SCHED_DEBUG. If it goes off, it is likely that there is a missing cond_resched() somewhere. Signed-off-by: Paul Turner Signed-off-by: Josh Don --- We've caught various bugs using this patch. In fact, a followup patch to this one will be a patch introducing a missing cond_resched(). However, this may be too noisy to have as default enabled (especially given that it requires some tuning); I'm open to having this default disabled instead. kernel/sched/core.c | 91 kernel/sched/sched.h | 6 +++ kernel/sysctl.c | 8 3 files changed, 105 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 98191218d891..ac037fc87d5e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -58,7 +58,25 @@ const_debug unsigned int sysctl_sched_features = #include "features.h" 0; #undef SCHED_FEAT + +/* + * Print a warning if need_resched is set for at least this long. At the + * default value, only a single warning will be printed per boot. + * + * Values less than 2 disable the feature. + * + * A kernel compiled with CONFIG_KASAN tends to run more slowly on average. + * Increase the need_resched timeout to reduce false alarms. + */ +#ifdef CONFIG_KASAN +#define RESCHED_DEFAULT_WARN_LATENCY_MS 101 +#define RESCHED_BOOT_QUIET_SEC 600 +#else +#define RESCHED_DEFAULT_WARN_LATENCY_MS 51 +#define RESCHED_BOOT_QUIET_SEC 300 #endif +int sysctl_resched_latency_warn_ms = RESCHED_DEFAULT_WARN_LATENCY_MS; +#endif /* CONFIG_SCHED_DEBUG */ /* * Number of tasks to iterate in a single balance run. @@ -4520,6 +4538,71 @@ unsigned long long task_sched_runtime(struct task_struct *p) return ns; } +#ifdef CONFIG_SCHED_DEBUG + +static inline u64 resched_latency_check(struct rq *rq) +{ + int latency_warn_ms = READ_ONCE(sysctl_resched_latency_warn_ms); + bool warn_only_once = (latency_warn_ms == RESCHED_DEFAULT_WARN_LATENCY_MS); + u64 need_resched_latency, now = rq_clock(rq); + static bool warned_once; + + if (warn_only_once && warned_once) + return 0; + + if (!need_resched() || latency_warn_ms < 2) + return 0; + + /* Disable this warning for the first few mins after boot */ + if (now < RESCHED_BOOT_QUIET_SEC * NSEC_PER_SEC) + return 0; + + if (!rq->last_seen_need_resched_ns) { + rq->last_seen_need_resched_ns = now; + rq->ticks_without_resched = 0; + return 0; + } + + rq->ticks_without_resched++; + need_resched_latency = now - rq->last_seen_need_resched_ns; + if (need_resched_latency <= latency_warn_ms * NSEC_PER_MSEC) + return 0; + + warned_once = true; + + return need_resched_latency; +} + +static inline void resched_latency_warn(int cpu, u64 latency) +{ + static DEFINE_RATELIMIT_STATE(latency_check_ratelimit, 60 * 60 * HZ, 1); + + WARN(__ratelimit(_check_ratelimit), +"CPU %d: need_resched set for > %llu ns (%d ticks) " +"without schedule\n", +cpu, latency, cpu_rq(cpu)->ticks_without_resched); +} + + +static int __init setup_resched_latency_warn_ms(char *str) +{ + long val; + + if ((kstrtol(str, 0, ))) { + pr_warn("Unable to set resched_latency_warn_ms\n"); + return 1; + } + + sysctl_resched_latency_warn_ms = val; + return 1; +} +__setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms); +#else +static inline u64 resched_latency_check(struct rq *rq) { return 0; } +static inline void resched_latency_warn(int cpu, u64 latency) {} +#endif /* CONFIG_SCHED_DEBUG */ + + /* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. @@ -4531,6 +4614,7 @@ void scheduler_tick(void) struct task_struct *curr = rq->curr; struct rq_flags rf; unsigned long thermal_pressure; + u64 resched_latency; arch_scale_freq_tick(); sched_clock_tick(); @@ -4541,11 +4625,15 @@ void scheduler_tick(void) thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); update_thermal_load_avg(rq
Re: [PATCH 5/9] objtool: Rework rebuild_reloc logic
On Fri, Mar 12, 2021 at 06:16:18PM +0100, Peter Zijlstra wrote: > --- a/tools/objtool/elf.c > +++ b/tools/objtool/elf.c > @@ -479,6 +479,8 @@ void elf_add_reloc(struct elf *elf, stru > > list_add_tail(>list, >reloc_list); > elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc)); > + > + sec->rereloc = true; > } Can we just reuse sec->changed for this? Something like this on top (untested of course): diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c index addcec88ac9f..b9cb74a54681 100644 --- a/tools/objtool/elf.c +++ b/tools/objtool/elf.c @@ -480,7 +480,7 @@ void elf_add_reloc(struct elf *elf, struct reloc *reloc) list_add_tail(>list, >reloc_list); elf_hash_add(elf->reloc_hash, >hash, reloc_hash(reloc)); - sec->rereloc = true; + sec->changed = true; } static int read_rel_reloc(struct section *sec, int i, struct reloc *reloc, unsigned int *symndx) @@ -882,9 +882,6 @@ int elf_rebuild_reloc_section(struct elf *elf, struct section *sec) struct reloc *reloc; int nr; - sec->changed = true; - elf->changed = true; - nr = 0; list_for_each_entry(reloc, >reloc_list, list) nr++; @@ -894,8 +891,6 @@ int elf_rebuild_reloc_section(struct elf *elf, struct section *sec) case SHT_RELA: return elf_rebuild_rela_reloc_section(sec, nr); default: return -1; } - - sec->rereloc = false; } int elf_write_insn(struct elf *elf, struct section *sec, @@ -950,14 +945,15 @@ int elf_write(struct elf *elf) struct section *sec; Elf_Scn *s; - list_for_each_entry(sec, >sections, list) { - if (sec->reloc && sec->reloc->rereloc) - elf_rebuild_reloc_section(elf, sec->reloc); - } - - /* Update section headers for changed sections: */ + /* Update changed relocation sections and section headers: */ list_for_each_entry(sec, >sections, list) { if (sec->changed) { + if (sec->reloc && + elf_rebuild_reloc_section(elf, sec->reloc)) { + WARN_ELF("elf_rebuild_reloc_section"); + return -1; + } + s = elf_getscn(elf->elf, sec->idx); if (!s) { WARN_ELF("elf_getscn"); @@ -969,6 +965,7 @@ int elf_write(struct elf *elf) } sec->changed = false; + elf->changed = true; } } diff --git a/tools/objtool/include/objtool/elf.h b/tools/objtool/include/objtool/elf.h index 9fdd4c5f9f32..e6890cc70a25 100644 --- a/tools/objtool/include/objtool/elf.h +++ b/tools/objtool/include/objtool/elf.h @@ -39,7 +39,7 @@ struct section { char *name; int idx; unsigned int len; - bool changed, text, rodata, noinstr, rereloc; + bool changed, text, rodata, noinstr; }; struct symbol {
Re: [PATCH 4/9] objtool: Fix static_call list generation
On Fri, Mar 12, 2021 at 06:16:17PM +0100, Peter Zijlstra wrote: > @@ -1701,6 +1706,9 @@ static int decode_sections(struct objtoo > if (ret) > return ret; > > + /* > + * Must be before add_{jump_call}_desetination. > + */ s/desetination/destination/ -- Josh
Re: [PATCH 2/9] objtool: Correctly handle retpoline thunk calls
On Fri, Mar 12, 2021 at 06:16:15PM +0100, Peter Zijlstra wrote: > Just like JMP handling, convert a direct CALL to a retpoline thunk > into a retpoline safe indirect CALL. > > Signed-off-by: Peter Zijlstra (Intel) > --- > tools/objtool/check.c | 12 > 1 file changed, 12 insertions(+) > > --- a/tools/objtool/check.c > +++ b/tools/objtool/check.c > @@ -953,6 +953,18 @@ static int add_call_destinations(struct > dest_off); > return -1; > } > + > + } else if (!strncmp(reloc->sym->name, "__x86_indirect_thunk_", > 21)) { > + /* > + * Retpoline calls are really dynamic calls in > + * disguise, so convert them accodingly. s/accodingly/accordingly/ -- Josh
Re: [PATCH -tip v2 00/10] kprobes: Fix stacktrace with kretprobes
On Fri, Mar 12, 2021 at 03:41:44PM +0900, Masami Hiramatsu wrote: > Hello, > > Here is the 2nd version of the series to fix the stacktrace with kretprobe. > > The 1st series is here; > > https://lore.kernel.org/bpf/161495873696.346821.10161501768906432924.stgit@devnote2/ > > In this version I merged the ORC unwinder fix for kretprobe which discussed > in the > previous thread. [3/10] is updated according to the Miroslav's comment. > [4/10] is > updated for simplify the code. [5/10]-[9/10] are discussed in the previsous > tread > and are introduced to the series. > > Daniel, can you also test this again? I and Josh discussed a bit different > method and I've implemented it on this version. > > This actually changes the kretprobe behavisor a bit, now the instraction > pointer in > the pt_regs passed to kretprobe user handler is correctly set the real return > address. So user handlers can get it via instruction_pointer() API. When I add WARN_ON(1) to a test kretprobe, it doesn't unwind properly. show_trace_log_lvl() reads the entire stack in lockstep with calls to the unwinder so that it can decide which addresses get prefixed with '?'. So it expects to find an actual return address on the stack. Instead it finds %rsp. So it never syncs up with unwind_next_frame() and shows all remaining addresses as unreliable. Call Trace: __kretprobe_trampoline_handler+0xca/0x1a0 trampoline_handler+0x3d/0x50 kretprobe_trampoline+0x25/0x50 ? init_test_probes.cold+0x323/0x387 ? init_kprobes+0x144/0x14c ? init_optprobes+0x15/0x15 ? do_one_initcall+0x5b/0x300 ? lock_is_held_type+0xe8/0x140 ? kernel_init_freeable+0x174/0x2cd ? rest_init+0x233/0x233 ? kernel_init+0xa/0x11d ? ret_from_fork+0x22/0x30 How about pushing 'kretprobe_trampoline' instead of %rsp for the return address placeholder. That fixes the above test, and removes the need for the weird 'state->ip == sp' check: Call Trace: __kretprobe_trampoline_handler+0xca/0x150 trampoline_handler+0x3d/0x50 kretprobe_trampoline+0x29/0x50 ? init_test_probes.cold+0x323/0x387 elfcorehdr_read+0x10/0x10 init_kprobes+0x144/0x14c ? init_optprobes+0x15/0x15 do_one_initcall+0x72/0x280 kernel_init_freeable+0x174/0x2cd ? rest_init+0x122/0x122 kernel_init+0xa/0x10e ret_from_fork+0x22/0x30 Though, init_test_probes.cold() (the real return address) is still listed as unreliable. Maybe we need a call to kretprobe_find_ret_addr() in show_trace_log_lvl(), similar to the ftrace_graph_ret_addr() call there. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 06f33bfebc50..70188fdd288e 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -766,19 +766,19 @@ asm( "kretprobe_trampoline:\n" /* We don't bother saving the ss register */ #ifdef CONFIG_X86_64 - " pushq %rsp\n" + /* Push fake return address to tell the unwinder it's a kretprobe */ + " pushq $kretprobe_trampoline\n" UNWIND_HINT_FUNC " pushfq\n" SAVE_REGS_STRING " movq %rsp, %rdi\n" " call trampoline_handler\n" - /* Replace saved sp with true return address. */ + /* Replace the fake return address with the real one. */ " movq %rax, 19*8(%rsp)\n" RESTORE_REGS_STRING " popfq\n" #else " pushl %esp\n" - UNWIND_HINT_FUNC " pushfl\n" SAVE_REGS_STRING " movl %esp, %eax\n" diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c index 1d1b9388a1b1..1d3de84d2410 100644 --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -548,8 +548,7 @@ bool unwind_next_frame(struct unwind_state *state) * In those cases, find the correct return address from * task->kretprobe_instances list. */ - if (state->ip == sp || - is_kretprobe_trampoline(state->ip)) + if (is_kretprobe_trampoline(state->ip)) state->ip = kretprobe_find_ret_addr(state->task, >kr_iter);