Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction
On 02/04, Ravi Bangoria wrote: > > +static int get_instr(struct mm_struct *mm, unsigned long addr, u32 *instr) > +{ > + struct page *page; > + struct vm_area_struct *vma; > + void *kaddr; > + unsigned int gup_flags = FOLL_FORCE | FOLL_SPLIT_PMD; > + > + if (get_user_pages_remote(mm, addr, 1, gup_flags, , , NULL) <= > 0) > + return -EINVAL; "vma" is not used, and I don't think you need FOLL_SPLIT_PMD. Otherwise I can't really comment this ppc-specific change. To be honest, I don't even understand why do we need this fix. Sure, the breakpoint in the middle of 64-bit insn won't work, why do we care? The user should know what does he do. Not to mention we can't really trust get_user_pages() in that this page can be modified by mm owner or debugger... But I won't argue. Oleg.
Re: [PATCH] powerpc/uprobes: Don't allow probe on suffix of prefixed instruction
On 01/19, Ravi Bangoria wrote: > > Probe on 2nd word of a prefixed instruction is invalid scenario and > should be restricted. I don't understand this ppc-specific problem, but... > +#ifdef CONFIG_PPC64 > +int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr, > + uprobe_opcode_t opcode) > +{ > + uprobe_opcode_t prefix; > + void *kaddr; > + struct ppc_inst inst; > + > + /* Don't check if vaddr is pointing to the beginning of page */ > + if (!(vaddr & ~PAGE_MASK)) > + return 0; So the fix is incomplete? Or insn at the start of page can't be prefixed? > +int __weak arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr, > + uprobe_opcode_t opcode) > +{ > + return 0; > +} > + > static int verify_opcode(struct page *page, unsigned long vaddr, > uprobe_opcode_t *new_opcode) > { > uprobe_opcode_t old_opcode; > @@ -275,6 +281,8 @@ static int verify_opcode(struct page *page, unsigned long > vaddr, uprobe_opcode_t > if (is_swbp_insn(new_opcode)) { > if (is_swbp)/* register: already installed? */ > return 0; > + if (arch_uprobe_verify_opcode(page, vaddr, old_opcode)) > + return -EINVAL; Well, this doesn't look good... To me it would be better to change the prepare_uprobe() path to copy the potential prefix into uprobe->arch and check ppc_inst_prefixed() in arch_uprobe_analyze_insn(). What do you think? Oleg.
Re: [PATCH v3 2/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too
Christophe, et al, So what? Are you going to push your change or should I re-send 1-2 without whitespace cleanups? On 11/19, Oleg Nesterov wrote: > > On 11/19, Christophe Leroy wrote: > > > > I think the following should work, and not require the first patch (compile > > tested only). > > > > --- a/arch/powerpc/kernel/ptrace/ptrace-view.c > > +++ b/arch/powerpc/kernel/ptrace/ptrace-view.c > > @@ -234,9 +234,21 @@ static int gpr_get(struct task_struct *target, const > > struct user_regset *regset, > > BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) != > > offsetof(struct pt_regs, msr) + sizeof(long)); > > > > +#ifdef CONFIG_PPC64 > > + membuf_write(, >thread.regs->orig_gpr3, > > +offsetof(struct pt_regs, softe) - offsetof(struct pt_regs, > > orig_gpr3)); > > + membuf_store(, 1UL); > > + > > + BUILD_BUG_ON(offsetof(struct pt_regs, trap) != > > +offsetof(struct pt_regs, softe) + sizeof(long)); > > + > > + membuf_write(, >thread.regs->trap, > > +sizeof(struct user_pt_regs) - offsetof(struct pt_regs, > > trap)); > > +#else > > membuf_write(, >thread.regs->orig_gpr3, > > sizeof(struct user_pt_regs) - > > offsetof(struct pt_regs, orig_gpr3)); > > +#endif > > return membuf_zero(, ELF_NGREG * sizeof(unsigned long) - > > sizeof(struct user_pt_regs)); > > } > > Probably yes. > > This mirrors the previous patch I sent > (https://lore.kernel.org/lkml/20190917143753.ga12...@redhat.com/) > and this is exactly what I tried to avoid, we can make a simpler fix now. > > But let me repeat, I agree with any fix even if imp my version simplifies the > code, just > commit this change and lets forget this problem. > > Oleg.
Re: [PATCH v3 2/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too
On 11/19, Christophe Leroy wrote: > > I think the following should work, and not require the first patch (compile > tested only). > > --- a/arch/powerpc/kernel/ptrace/ptrace-view.c > +++ b/arch/powerpc/kernel/ptrace/ptrace-view.c > @@ -234,9 +234,21 @@ static int gpr_get(struct task_struct *target, const > struct user_regset *regset, > BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) != >offsetof(struct pt_regs, msr) + sizeof(long)); > > +#ifdef CONFIG_PPC64 > + membuf_write(, >thread.regs->orig_gpr3, > + offsetof(struct pt_regs, softe) - offsetof(struct pt_regs, > orig_gpr3)); > + membuf_store(, 1UL); > + > + BUILD_BUG_ON(offsetof(struct pt_regs, trap) != > + offsetof(struct pt_regs, softe) + sizeof(long)); > + > + membuf_write(, >thread.regs->trap, > + sizeof(struct user_pt_regs) - offsetof(struct pt_regs, > trap)); > +#else > membuf_write(, >thread.regs->orig_gpr3, > sizeof(struct user_pt_regs) - > offsetof(struct pt_regs, orig_gpr3)); > +#endif > return membuf_zero(, ELF_NGREG * sizeof(unsigned long) - >sizeof(struct user_pt_regs)); > } Probably yes. This mirrors the previous patch I sent (https://lore.kernel.org/lkml/20190917143753.ga12...@redhat.com/) and this is exactly what I tried to avoid, we can make a simpler fix now. But let me repeat, I agree with any fix even if imp my version simplifies the code, just commit this change and lets forget this problem. Oleg.
Re: [PATCH v3 0/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too
On 11/19, Christophe Leroy wrote: > > > Le 19/11/2020 à 17:01, Oleg Nesterov a écrit : > >Can we finally fix this problem? ;) > > > >My previous attempt was ignored, see > > That doesn't seems right. > > Michael made some suggestion it seems, can you respond to it ? I did, see https://lore.kernel.org/lkml/20200611105830.gb12...@redhat.com/ > >Sorry, uncompiled/untested, I don't have a ppc machine. > > I compiled with ppc64_defconfig, that seems ok. Still untested. Thanks. Oleg.
Re: [PATCH v3 1/2] powerpc/ptrace: simplify gpr_get/tm_cgpr_get
On 11/19, Christophe Leroy wrote: > > > Le 19/11/2020 à 17:02, Oleg Nesterov a écrit : > >gpr_get() does membuf_write() twice to override pt_regs->msr in between. > > Is there anything wrong with that ? Nothing wrong, but imo the code and 2/2 looks simpler after this patch. I tried to explain this in the changelog. > > int tm_cgpr_get(struct task_struct *target, const struct user_regset > > *regset, > > struct membuf to) > > { > >+struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr)); > >+ > > if (!cpu_has_feature(CPU_FTR_TM)) > > return -ENODEV; > >@@ -97,17 +99,12 @@ int tm_cgpr_get(struct task_struct *target, const struct > >user_regset *regset, > > flush_altivec_to_thread(target); > > membuf_write(, >thread.ckpt_regs, > >-offsetof(struct pt_regs, msr)); > >-membuf_store(, get_user_ckpt_msr(target)); > >+sizeof(struct user_pt_regs)); > > This looks mis-aligned. But it should fit on a single line, now we allow up > to 100 chars on a line. OK, I can change this. > >-BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) != > >- offsetof(struct pt_regs, msr) + sizeof(long)); > >+membuf_store(_msr, get_user_ckpt_msr(target)); > >-membuf_write(, >thread.ckpt_regs.orig_gpr3, > >-sizeof(struct user_pt_regs) - > >-offsetof(struct pt_regs, orig_gpr3)); > > return membuf_zero(, ELF_NGREG * sizeof(unsigned long) - > >-sizeof(struct user_pt_regs)); > >+sizeof(struct user_pt_regs)); > > I can't see any change here except the alignment. Can you leave it as is ? I just tried to make tm_cgpr_get() and gpr_get() look similar. Sure, I can leave it as is. Better yet, could you please fix this problem somehow so that I could forget about the bug assigned to me? I know nothing about powerpc, and personally I do not care about this (minor) bug, I agree with any changes. > >-membuf_write(, target->thread.regs, offsetof(struct pt_regs, msr)); > >-membuf_store(, get_user_msr(target)); > >+membuf_write(, target->thread.regs, > >+sizeof(struct user_pt_regs)); > > This should fit on a single line. > > > return membuf_zero(, ELF_NGREG * sizeof(unsigned long) - > >- sizeof(struct user_pt_regs)); > >+sizeof(struct user_pt_regs)); > > This should not change, it's not part of the changes for this patch. See above, I can leave it as is. > >--- a/include/linux/regset.h > >+++ b/include/linux/regset.h > >@@ -46,6 +46,18 @@ static inline int membuf_write(struct membuf *s, const > >void *v, size_t size) > > return s->left; > > } > >+static inline struct membuf membuf_at(const struct membuf *s, size_t offs) > >+{ > >+struct membuf n = *s; > > Is there any point in using a struct membuf * instaed of a struct membuf as > parameter ? This matches other membuf_ helpers. Oleg.
Re: [PATCH v3 2/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too
On 11/19, Oleg Nesterov wrote: > > This is not consistent and this breaks the user-regs-peekpoke test > from https://sourceware.org/systemtap/wiki/utrace/tests/ See the test-case below. Oleg. /* Test case for PTRACE_SETREGS modifying the requested ragisters. x86* counterpart of the s390* testcase `user-area-access.c'. This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely. */ /* FIXME: EFLAGS should be tested restricted on the appropriate bits. */ #define _GNU_SOURCE 1 #if defined __powerpc__ || defined __sparc__ # define user_regs_struct pt_regs #endif #ifdef __ia64__ #define ia64_fpreg ia64_fpreg_DISABLE #define pt_all_user_regs pt_all_user_regs_DISABLE #endif /* __ia64__ */ #include #ifdef __ia64__ #undef ia64_fpreg #undef pt_all_user_regs #endif /* __ia64__ */ #include #include #include #if defined __i386__ || defined __x86_64__ #include #endif #include #include #include #include #include #include #include #include #include #include #include #include /* ia64 has PTRACE_SETREGS but it has no USER_REGS_STRUCT. */ #if !defined PTRACE_SETREGS || defined __ia64__ int main (void) { return 77; } #else /* PTRACE_SETREGS */ /* The minimal alignment we use for the random access ranges. */ #define REGALIGN (sizeof (long)) static pid_t child; static void cleanup (void) { if (child > 0) kill (child, SIGKILL); child = 0; } static void handler_fail (int signo) { cleanup (); signal (SIGABRT, SIG_DFL); abort (); } int main (void) { long l; int status, i; pid_t pid; union { struct user_regs_struct user; unsigned char byte[sizeof (struct user_regs_struct)]; } u, u2; int start; setbuf (stdout, NULL); atexit (cleanup); signal (SIGABRT, handler_fail); signal (SIGALRM, handler_fail); signal (SIGINT, handler_fail); i = alarm (10); assert (i == 0); child = fork (); switch (child) { case -1: assert_perror (errno); assert (0); case 0: l = ptrace (PTRACE_TRACEME, 0, NULL, NULL); assert (l == 0); // Prevent rt_sigprocmask() call called by glibc after raise(). syscall (__NR_tkill, getpid (), SIGSTOP); assert (0); default: break; } pid = waitpid (child, , 0); assert (pid == child); assert (WIFSTOPPED (status)); assert (WSTOPSIG (status) == SIGSTOP); /* Fetch U2 from the inferior. */ errno = 0; # ifdef __sparc__ l = ptrace (PTRACE_GETREGS, child, , NULL); # else l = ptrace (PTRACE_GETREGS, child, NULL, ); # endif assert_perror (errno); assert (l == 0); /* Initialize U with a pattern. */ for (i = 0; i < sizeof u.byte; i++) u.byte[i] = i; #ifdef __x86_64__ /* non-EFLAGS modifications fail with EIO, EFLAGS gets back different. */ u.user.eflags = u2.user.eflags; u.user.cs = u2.user.cs; u.user.ds = u2.user.ds; u.user.es = u2.user.es; u.user.fs = u2.user.fs; u.user.gs = u2.user.gs; u.user.ss = u2.user.ss; u.user.fs_base = u2.user.fs_base; u.user.gs_base = u2.user.gs_base; /* RHEL-4 refuses to set too high (and invalid) PC values. */ u.user.rip = (unsigned long) handler_fail; /* 2.6.25 always truncates and sign-extends orig_rax. */ u.user.orig_rax = (int) u.user.orig_rax; #endif /* __x86_64__ */ #ifdef __i386__ /* These values get back different. */ u.user.xds = u2.user.xds; u.user.xes = u2.user.xes; u.user.xfs = u2.user.xfs; u.user.xgs = u2.user.xgs; u.user.xcs = u2.user.xcs; u.user.eflags = u2.user.eflags; u.user.xss = u2.user.xss; /* RHEL-4 refuses to set too high (and invalid) PC values. */ u.user.eip = (unsigned long) handler_fail; #endif /* __i386__ */ #ifdef __powerpc__ /* These fields are constrained. */ u.user.msr = u2.user.msr; # ifdef __powerpc64__ u.user.softe = u2.user.softe; # else u.user.mq = u2.user.mq; # endif /* __powerpc64__ */ u.user.trap = u2.user.trap; u.user.dar = u2.user.dar; u.user.dsisr = u2.user.dsisr; u.user.result = u2.user.result; #endif /* __powerpc__ */ /* Poke U. */ # ifdef __sparc__ l = ptrace (PTRACE_SETREGS, child, , NULL); # else l = ptrace (PTRACE_SETREGS, child, NULL, ); # endif assert (l == 0); /* Peek into U2. */ # ifdef __sparc__ l = ptrace (PTRACE_GETREGS, child, , NULL); # else l = ptrace (PTRACE_GETREGS, child, NULL, ); # endif assert (l == 0); /* Verify it matches. */ if (memcmp (, , sizeof u.byte) != 0) { for (start = 0; start + REGALIGN <= sizeof u.byte; start += REGALIGN) if (*(unsigned long *) (u.byte + start) != *(unsigned long *) (u2.byte + start)) printf ("\ mismatch at offset %#x: SETREGS wrote %lx GETREGS read %lx\n",
[PATCH v3 2/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too
The commit a8a4b03ab95f ("powerpc: Hard wire PT_SOFTE value to 1 in ptrace & signals") changed ptrace_get_reg(PT_SOFTE) to report 0x1, but PTRACE_GETREGS still copies pt_regs->softe as is. This is not consistent and this breaks the user-regs-peekpoke test from https://sourceware.org/systemtap/wiki/utrace/tests/ Reported-by: Jan Kratochvil Signed-off-by: Oleg Nesterov --- arch/powerpc/kernel/ptrace/ptrace-tm.c | 8 +++- arch/powerpc/kernel/ptrace/ptrace-view.c | 8 +++- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/ptrace/ptrace-tm.c b/arch/powerpc/kernel/ptrace/ptrace-tm.c index f8fcbd85d4cb..d0d339f86e61 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-tm.c +++ b/arch/powerpc/kernel/ptrace/ptrace-tm.c @@ -87,6 +87,10 @@ int tm_cgpr_get(struct task_struct *target, const struct user_regset *regset, struct membuf to) { struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr)); +#ifdef CONFIG_PPC64 + struct membuf to_softe = membuf_at(, + offsetof(struct pt_regs, softe)); +#endif if (!cpu_has_feature(CPU_FTR_TM)) return -ENODEV; @@ -102,7 +106,9 @@ int tm_cgpr_get(struct task_struct *target, const struct user_regset *regset, sizeof(struct user_pt_regs)); membuf_store(_msr, get_user_ckpt_msr(target)); - +#ifdef CONFIG_PPC64 + membuf_store(_softe, 0x1ul); +#endif return membuf_zero(, ELF_NGREG * sizeof(unsigned long) - sizeof(struct user_pt_regs)); } diff --git a/arch/powerpc/kernel/ptrace/ptrace-view.c b/arch/powerpc/kernel/ptrace/ptrace-view.c index 39686ede40b3..f554ccfcbfae 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-view.c +++ b/arch/powerpc/kernel/ptrace/ptrace-view.c @@ -218,6 +218,10 @@ static int gpr_get(struct task_struct *target, const struct user_regset *regset, struct membuf to) { struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr)); +#ifdef CONFIG_PPC64 + struct membuf to_softe = membuf_at(, + offsetof(struct pt_regs, softe)); +#endif int i; if (target->thread.regs == NULL) @@ -233,7 +237,9 @@ static int gpr_get(struct task_struct *target, const struct user_regset *regset, sizeof(struct user_pt_regs)); membuf_store(_msr, get_user_msr(target)); - +#ifdef CONFIG_PPC64 + membuf_store(_softe, 0x1ul); +#endif return membuf_zero(, ELF_NGREG * sizeof(unsigned long) - sizeof(struct user_pt_regs)); } -- 2.25.1.362.g51ebf55
[PATCH v3 1/2] powerpc/ptrace: simplify gpr_get/tm_cgpr_get
gpr_get() does membuf_write() twice to override pt_regs->msr in between. We can call membuf_write() once and change ->msr in the kernel buffer, this simplifies the code and the next fix. The patch adds a new simple helper, membuf_at(offs), it returns the new membuf which can be safely used after membuf_write(). Signed-off-by: Oleg Nesterov --- arch/powerpc/kernel/ptrace/ptrace-tm.c | 13 + arch/powerpc/kernel/ptrace/ptrace-view.c | 13 + include/linux/regset.h | 12 3 files changed, 22 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kernel/ptrace/ptrace-tm.c b/arch/powerpc/kernel/ptrace/ptrace-tm.c index 54f2d076206f..f8fcbd85d4cb 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-tm.c +++ b/arch/powerpc/kernel/ptrace/ptrace-tm.c @@ -86,6 +86,8 @@ int tm_cgpr_active(struct task_struct *target, const struct user_regset *regset) int tm_cgpr_get(struct task_struct *target, const struct user_regset *regset, struct membuf to) { + struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr)); + if (!cpu_has_feature(CPU_FTR_TM)) return -ENODEV; @@ -97,17 +99,12 @@ int tm_cgpr_get(struct task_struct *target, const struct user_regset *regset, flush_altivec_to_thread(target); membuf_write(, >thread.ckpt_regs, - offsetof(struct pt_regs, msr)); - membuf_store(, get_user_ckpt_msr(target)); + sizeof(struct user_pt_regs)); - BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) != -offsetof(struct pt_regs, msr) + sizeof(long)); + membuf_store(_msr, get_user_ckpt_msr(target)); - membuf_write(, >thread.ckpt_regs.orig_gpr3, - sizeof(struct user_pt_regs) - - offsetof(struct pt_regs, orig_gpr3)); return membuf_zero(, ELF_NGREG * sizeof(unsigned long) - - sizeof(struct user_pt_regs)); + sizeof(struct user_pt_regs)); } /* diff --git a/arch/powerpc/kernel/ptrace/ptrace-view.c b/arch/powerpc/kernel/ptrace/ptrace-view.c index 7e6478e7ed07..39686ede40b3 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-view.c +++ b/arch/powerpc/kernel/ptrace/ptrace-view.c @@ -217,6 +217,7 @@ int ptrace_put_reg(struct task_struct *task, int regno, unsigned long data) static int gpr_get(struct task_struct *target, const struct user_regset *regset, struct membuf to) { + struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr)); int i; if (target->thread.regs == NULL) @@ -228,17 +229,13 @@ static int gpr_get(struct task_struct *target, const struct user_regset *regset, target->thread.regs->gpr[i] = NV_REG_POISON; } - membuf_write(, target->thread.regs, offsetof(struct pt_regs, msr)); - membuf_store(, get_user_msr(target)); + membuf_write(, target->thread.regs, + sizeof(struct user_pt_regs)); - BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) != -offsetof(struct pt_regs, msr) + sizeof(long)); + membuf_store(_msr, get_user_msr(target)); - membuf_write(, >thread.regs->orig_gpr3, - sizeof(struct user_pt_regs) - - offsetof(struct pt_regs, orig_gpr3)); return membuf_zero(, ELF_NGREG * sizeof(unsigned long) - -sizeof(struct user_pt_regs)); + sizeof(struct user_pt_regs)); } static int gpr_set(struct task_struct *target, const struct user_regset *regset, diff --git a/include/linux/regset.h b/include/linux/regset.h index c3403f328257..a00765f0e8cf 100644 --- a/include/linux/regset.h +++ b/include/linux/regset.h @@ -46,6 +46,18 @@ static inline int membuf_write(struct membuf *s, const void *v, size_t size) return s->left; } +static inline struct membuf membuf_at(const struct membuf *s, size_t offs) +{ + struct membuf n = *s; + + if (offs > n.left) + offs = n.left; + n.p += offs; + n.left -= offs; + + return n; +} + /* current s->p must be aligned for v; v must be a scalar */ #define membuf_store(s, v) \ ({ \ -- 2.25.1.362.g51ebf55
[PATCH v3 0/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too
Can we finally fix this problem? ;) My previous attempt was ignored, see https://lore.kernel.org/lkml/20190917121256.ga8...@redhat.com/ Now that gpr_get() was changed to use membuf API we can make a simpler fix. Sorry, uncompiled/untested, I don't have a ppc machine. Oleg. arch/powerpc/kernel/ptrace/ptrace-tm.c | 21 - arch/powerpc/kernel/ptrace/ptrace-view.c | 21 - include/linux/regset.h | 12 3 files changed, 36 insertions(+), 18 deletions(-)
Re: [PATCH? v2] powerpc: Hard wire PT_SOFTE value to 1 in gpr_get() too
On 06/11, Madhavan Srinivasan wrote: > > > On 6/10/20 8:37 PM, Oleg Nesterov wrote: > >Hi, > > > >looks like this patch was forgotten. > > yep, I missed this. But mpe did have comments for the patch. > > https://lkml.org/lkml/2019/9/19/107 Yes, and I thought that I have replied... apparently not, sorry! So let me repeat, I am fine either way, I do not understand this ppc-specific code and I can't really test this change. Let me quote that email from Michael: > We could do it like below. I'm 50/50 though on whether it's worth it, or > if we should just go with the big ifdef like in your patch. up to you ;) Hmm. And yes, > >>This is not consistent and this breaks > >>http://sourceware.org/systemtap/wiki/utrace/tests/user-regs-peekpoke this is 404. Jan, could correct the link above? Oleg.
Re: [PATCH? v2] powerpc: Hard wire PT_SOFTE value to 1 in gpr_get() too
Hi, looks like this patch was forgotten. Do you think this should be fixed or should we document that PTRACE_GETREGS is not consistent with PTRACE_PEEKUSER on ppc64? On 09/17, Oleg Nesterov wrote: > > I don't have a ppc machine, this patch wasn't even compile tested, > could you please review? > > The commit a8a4b03ab95f ("powerpc: Hard wire PT_SOFTE value to 1 in > ptrace & signals") changed ptrace_get_reg(PT_SOFTE) to report 0x1, > but PTRACE_GETREGS still copies pt_regs->softe as is. > > This is not consistent and this breaks > http://sourceware.org/systemtap/wiki/utrace/tests/user-regs-peekpoke > > Reported-by: Jan Kratochvil > Signed-off-by: Oleg Nesterov > --- > arch/powerpc/kernel/ptrace.c | 25 + > 1 file changed, 25 insertions(+) > > diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c > index 8c92feb..291acfb 100644 > --- a/arch/powerpc/kernel/ptrace.c > +++ b/arch/powerpc/kernel/ptrace.c > @@ -363,11 +363,36 @@ static int gpr_get(struct task_struct *target, const > struct user_regset *regset, > BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) != >offsetof(struct pt_regs, msr) + sizeof(long)); > > +#ifdef CONFIG_PPC64 > + if (!ret) > + ret = user_regset_copyout(, , , , > + >thread.regs->orig_gpr3, > + offsetof(struct pt_regs, orig_gpr3), > + offsetof(struct pt_regs, softe)); > + > + if (!ret) { > + unsigned long softe = 0x1; > + ret = user_regset_copyout(, , , , , > + offsetof(struct pt_regs, softe), > + offsetof(struct pt_regs, softe) + > + sizeof(softe)); > + } > + > + BUILD_BUG_ON(offsetof(struct pt_regs, trap) != > + offsetof(struct pt_regs, softe) + sizeof(long)); > + > + if (!ret) > + ret = user_regset_copyout(, , , , > + >thread.regs->trap, > + offsetof(struct pt_regs, trap), > + sizeof(struct user_pt_regs)); > +#else > if (!ret) > ret = user_regset_copyout(, , , , > >thread.regs->orig_gpr3, > offsetof(struct pt_regs, orig_gpr3), > sizeof(struct user_pt_regs)); > +#endif > if (!ret) > ret = user_regset_copyout_zero(, , , , > sizeof(struct user_pt_regs), -1); > -- > 2.5.0 >
[PATCH? v2] powerpc: Hard wire PT_SOFTE value to 1 in gpr_get() too
I don't have a ppc machine, this patch wasn't even compile tested, could you please review? The commit a8a4b03ab95f ("powerpc: Hard wire PT_SOFTE value to 1 in ptrace & signals") changed ptrace_get_reg(PT_SOFTE) to report 0x1, but PTRACE_GETREGS still copies pt_regs->softe as is. This is not consistent and this breaks http://sourceware.org/systemtap/wiki/utrace/tests/user-regs-peekpoke Reported-by: Jan Kratochvil Signed-off-by: Oleg Nesterov --- arch/powerpc/kernel/ptrace.c | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index 8c92feb..291acfb 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -363,11 +363,36 @@ static int gpr_get(struct task_struct *target, const struct user_regset *regset, BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) != offsetof(struct pt_regs, msr) + sizeof(long)); +#ifdef CONFIG_PPC64 + if (!ret) + ret = user_regset_copyout(, , , , + >thread.regs->orig_gpr3, + offsetof(struct pt_regs, orig_gpr3), + offsetof(struct pt_regs, softe)); + + if (!ret) { + unsigned long softe = 0x1; + ret = user_regset_copyout(, , , , , + offsetof(struct pt_regs, softe), + offsetof(struct pt_regs, softe) + + sizeof(softe)); + } + + BUILD_BUG_ON(offsetof(struct pt_regs, trap) != +offsetof(struct pt_regs, softe) + sizeof(long)); + + if (!ret) + ret = user_regset_copyout(, , , , + >thread.regs->trap, + offsetof(struct pt_regs, trap), + sizeof(struct user_pt_regs)); +#else if (!ret) ret = user_regset_copyout(, , , , >thread.regs->orig_gpr3, offsetof(struct pt_regs, orig_gpr3), sizeof(struct user_pt_regs)); +#endif if (!ret) ret = user_regset_copyout_zero(, , , , sizeof(struct user_pt_regs), -1); -- 2.5.0
[PATCH?] powerpc: Hard wire PT_SOFTE value to 1 in gpr_get() too
I don't have a ppc machine, this patch wasn't even compile tested, could you please review? The commit a8a4b03ab95f ("powerpc: Hard wire PT_SOFTE value to 1 in ptrace & signals") changed ptrace_get_reg(PT_SOFTE) to report 0x1, but PTRACE_GETREGS still copies pt_regs->softe as is. This is not consistent and this breaks http://sourceware.org/systemtap/wiki/utrace/tests/user-regs-peekpoke Reported-by: Jan Kratochvil Signed-off-by: Oleg Nesterov --- arch/powerpc/kernel/ptrace.c | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index 8c92feb..9e9342c 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -363,11 +363,36 @@ static int gpr_get(struct task_struct *target, const struct user_regset *regset, BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) != offsetof(struct pt_regs, msr) + sizeof(long)); +#ifdef CONFIG_PPC64 + if (!ret) + ret = user_regset_copyout(, , , , + >thread.regs->orig_gpr3, + offsetof(struct pt_regs, orig_gpr3), + offsetof(struct pt_regs, softe)); + + if (!ret) { + unsigned long softe = 0x1; + ret = user_regset_copyout(, , , , , + offsetof(struct pt_regs, softe), + offsetof(struct pt_regs, softe) + + sizeof(softe)); + } + + BUILD_BUG_ON(offsetof(struct pt_regs, trap) != +offsetof(struct pt_regs, softe) + sizeof(long)); + + if (!ret) + ret = user_regset_copyout(, , , , + >thread.regs->trap, + offsetof(struct pt_regs, trap), + sizeof(struct user_pt_regs)); +#else if (!ret) ret = user_regset_copyout(, , , , >thread.regs->orig_gpr3, offsetof(struct pt_regs, orig_gpr3), sizeof(struct user_pt_regs)); +#endif if (!ret) ret = user_regset_copyout_zero(, , , , sizeof(struct user_pt_regs), -1); -- 2.5.0
Re: [PATCH v1 1/2] open: add close_range()
On 05/23, Christian Brauner wrote: > > So given that we would really need another find_next_open_fd() I think > sticking to the simple cond_resched() version I sent before is better > for now until we see real-world performance issues. OK, agreed. Oleg.
Re: [PATCH v2 1/2] open: add close_range()
On 05/23, Christian Brauner wrote: > > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd) > +{ > + unsigned int cur_max; > + > + if (fd > max_fd) > + return -EINVAL; > + > + rcu_read_lock(); > + cur_max = files_fdtable(files)->max_fds; > + rcu_read_unlock(); > + > + /* cap to last valid index into fdtable */ > + max_fd = max(max_fd, (cur_max - 1)); ^^^ Hmm. min() ? Oleg.
Re: [PATCH v1 1/2] open: add close_range()
On 05/22, Christian Brauner wrote: > > +static struct file *pick_file(struct files_struct *files, unsigned fd) > { > - struct file *file; > + struct file *file = NULL; > struct fdtable *fdt; > > spin_lock(>file_lock); > @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd) > goto out_unlock; > rcu_assign_pointer(fdt->fd[fd], NULL); > __put_unused_fd(files, fd); > - spin_unlock(>file_lock); > - return filp_close(file, files); > > out_unlock: > spin_unlock(>file_lock); > - return -EBADF; > + return file; ... > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd) > +{ > + unsigned int cur_max; > + > + if (fd > max_fd) > + return -EINVAL; > + > + rcu_read_lock(); > + cur_max = files_fdtable(files)->max_fds; > + rcu_read_unlock(); > + > + /* cap to last valid index into fdtable */ > + if (max_fd >= cur_max) > + max_fd = cur_max - 1; > + > + while (fd <= max_fd) { > + struct file *file; > + > + file = pick_file(files, fd++); Well, how about something like static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned start) { unsigned int maxfd = fdt->max_fds; unsigned int maxbit = maxfd / BITS_PER_LONG; unsigned int bitbit = start / BITS_PER_LONG; bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; if (bitbit > maxfd) return maxfd; if (bitbit > start) start = bitbit; return find_next_bit(fdt->open_fds, maxfd, start); } unsigned close_next_fd(struct files_struct *files, unsigned start, unsigned maxfd) { unsigned fd; struct file *file; struct fdtable *fdt; spin_lock(>file_lock); fdt = files_fdtable(files); fd = find_next_opened_fd(fdt, start); if (fd >= fdt->max_fds || fd > maxfd) { fd = -1; goto out; } file = fdt->fd[fd]; rcu_assign_pointer(fdt->fd[fd], NULL); __put_unused_fd(files, fd); out: spin_unlock(>file_lock); if (fd == -1u) return fd; filp_close(file, files); return fd + 1; } ? Then close_range() can do while (fd < max_fd) fd = close_next_fd(fd, maxfd); Oleg.
Re: [PATCH 5/5] asm-generic: remove ptrace.h
On 05/20, Christoph Hellwig wrote: > > No one is using this header anymore. > > Signed-off-by: Christoph Hellwig > Acked-by: Arnd Bergmann > --- > MAINTAINERS| 1 - > arch/mips/include/asm/ptrace.h | 5 --- > include/asm-generic/ptrace.h | 74 -- > 3 files changed, 80 deletions(-) > delete mode 100644 include/asm-generic/ptrace.h Acked-by: Oleg Nesterov
Re: [PATCH 4/5] x86: don't use asm-generic/ptrace.h
On 05/20, Christoph Hellwig wrote: > > Doing the indirection through macros for the regs accessors just > makes them harder to read, so implement the helpers directly. Acked-by: Oleg Nesterov
Re: [PATCH v1 1/2] pid: add pidfd_open()
On 05/17, Aleksa Sarai wrote: > > On 2019-05-16, Oleg Nesterov wrote: > > On 05/17, Aleksa Sarai wrote: > > > On 2019-05-16, Oleg Nesterov wrote: > > > > On 05/16, Christian Brauner wrote: > > > > > With the introduction of pidfds through CLONE_PIDFD it is possible to > > > > > created pidfds at process creation time. > > > > > > > > Now I am wondering why do we need CLONE_PIDFD, you can just do > > > > > > > > pid = fork(); > > > > pidfd_open(pid); > > > > > > While the race window would be exceptionally short, there is the > > > possibility that the child will die > > > > Yes, > > > > > and their pid will be recycled > > > before you do pidfd_open(). > > > > No. > > > > Unless the caller's sub-thread does wait() before pidfd_open(), of course. > > Or unless you do signal(SIGCHILD, SIG_IGN). > > What about CLONE_PARENT? I should have mentioned CLONE_PARENT ;) Of course in this case the child can be reaped before pidfd_open(). But how often do you or other people use clone(CLONE_PARENT) ? not to mention you can trivially eliminate/detect this race if you really need this. Don't get me wrong, I am not trying to say that CLONE_PIDFD is a bad idea. But to me pidfd_open() is much more useful. Say, as a perl programmer I can easily use pidfd_open(), but not CLONE_PIDFD. Oleg.
Re: [PATCH v1 1/2] pid: add pidfd_open()
On 05/17, Aleksa Sarai wrote: > > On 2019-05-16, Oleg Nesterov wrote: > > On 05/16, Christian Brauner wrote: > > > > > > With the introduction of pidfds through CLONE_PIDFD it is possible to > > > created pidfds at process creation time. > > > > Now I am wondering why do we need CLONE_PIDFD, you can just do > > > > pid = fork(); > > pidfd_open(pid); > > While the race window would be exceptionally short, there is the > possibility that the child will die Yes, > and their pid will be recycled > before you do pidfd_open(). No. Unless the caller's sub-thread does wait() before pidfd_open(), of course. Or unless you do signal(SIGCHILD, SIG_IGN). Oleg.
Re: [PATCH v1 1/2] pid: add pidfd_open()
On 05/16, Christian Brauner wrote: > > With the introduction of pidfds through CLONE_PIDFD it is possible to > created pidfds at process creation time. Now I am wondering why do we need CLONE_PIDFD, you can just do pid = fork(); pidfd_open(pid); > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags) > +{ > + int fd, ret; > + struct pid *p; > + struct task_struct *tsk; > + > + if (flags) > + return -EINVAL; > + > + if (pid <= 0) > + return -EINVAL; > + > + p = find_get_pid(pid); > + if (!p) > + return -ESRCH; > + > + ret = 0; > + rcu_read_lock(); > + /* > + * If this returns non-NULL the pid was used as a thread-group > + * leader. Note, we race with exec here: If it changes the > + * thread-group leader we might return the old leader. > + */ > + tsk = pid_task(p, PIDTYPE_TGID); > + if (!tsk) > + ret = -ESRCH; > + rcu_read_unlock(); > + > + fd = ret ?: pidfd_create(p); > + put_pid(p); > + return fd; > +} Looks correct, feel free to add Reviewed-by: Oleg Nesterov But why do we need task_struct *tsk? rcu_read_lock(); if (!pid_task(PIDTYPE_TGID)) ret = -ESRCH; rcu_read_unlock(); and in fact we do not even need rcu_read_lock(), we could do // shut up rcu_dereference_check() rcu_lock_acquire(_lock_map); if (!pid_task(PIDTYPE_TGID)) ret = -ESRCH; rcu_lock_release(_lock_map); Well... I won't insist, but the comment about the race with exec looks a bit confusing to me. It is true, but we do not care at all, we are not going to use the task_struct returned by pid_task(). Oleg.
Re: [PATCH 1/2] pid: add pidfd_open()
On 05/15, Oleg Nesterov wrote: > > On 05/15, Christian Brauner wrote: > > > > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags) > > +{ > > + int fd, ret; > > + struct pid *p; > > + struct task_struct *tsk; > > + > > + if (flags) > > + return -EINVAL; > > + > > + if (pid <= 0) > > + return -EINVAL; > > + > > + p = find_get_pid(pid); > > + if (!p) > > + return -ESRCH; > > + > > + rcu_read_lock(); > > + tsk = pid_task(p, PIDTYPE_PID); > > You do not need find_get_pid() before rcu_lock and put_pid() at the end. > You can just do find_vpid() under rcu_read_lock(). Ah, sorry. Somehow I forgot you need to call pidfd_create(pid), you can't do this under rcu_read_lock(). So I was wrong, you can't avoid get/put_pid. Oleg.
Re: [PATCH 1/2] pid: add pidfd_open()
On 05/15, Christian Brauner wrote: > > On Wed, May 15, 2019 at 04:38:58PM +0200, Oleg Nesterov wrote: > > > > it seems that you can do a single check > > > > tsk = pid_task(p, PIDTYPE_TGID); > > if (!tsk) > > ret = -ESRCH; > > > > this even looks more correct if we race with exec changing the leader. > > The logic here being that you can only reach the thread_group leader > from struct pid if PIDTYPE_PID == PIDTYPE_TGID for this struct pid? Not exactly... it is not that PIDTYPE_PID == PIDTYPE_TGID for this pid, struct pid has no "type" or something like this. The logic is that pid->tasks[PIDTYPE_XXX] is the list of task which use this pid as "XXX" type. For example, clone(CLONE_THREAD) creates a pid which has a single non- empty list, pid->tasks[PIDTYPE_PID]. This pid can't be used as TGID or SID. So if pid_task(PIDTYPE_TGID) returns non-NULL we know that this pid was used for a group-leader, see copy_process() which does if (thread_group_leader(p)) attach_pid(p, PIDTYPE_TGID); If we race with exec which changes the leader pid_task(TGID) can return the old leader. We do not care, but this means that we should not check thread_group_leader(). Oleg.
Re: [PATCH 1/2] pid: add pidfd_open()
On 05/15, Christian Brauner wrote: > > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags) > +{ > + int fd, ret; > + struct pid *p; > + struct task_struct *tsk; > + > + if (flags) > + return -EINVAL; > + > + if (pid <= 0) > + return -EINVAL; > + > + p = find_get_pid(pid); > + if (!p) > + return -ESRCH; > + > + rcu_read_lock(); > + tsk = pid_task(p, PIDTYPE_PID); You do not need find_get_pid() before rcu_lock and put_pid() at the end. You can just do find_vpid() under rcu_read_lock(). > + if (!tsk) > + ret = -ESRCH; > + else if (unlikely(!thread_group_leader(tsk))) > + ret = -EINVAL; it seems that you can do a single check tsk = pid_task(p, PIDTYPE_TGID); if (!tsk) ret = -ESRCH; this even looks more correct if we race with exec changing the leader. Oleg.
Re: [PATCH linux-next v10 0/7] ptrace: add PTRACE_GET_SYSCALL_INFO request
On 04/16, Dmitry V. Levin wrote: > > [Andrew, could you take this patchset into your tree, please?] Just in case... I have already acked 6/7. Other patches look good to me too, just I don't think I can actually review these non-x86 changes. Oleg.
Re: remove asm-generic/ptrace.h
On 05/01, Christoph Hellwig wrote: > > Hi all, > > asm-generic/ptrace.h is a little weird in that it doesn't actually > implement any functionality, but it provided multiple layers of macros > that just implement trivial inline functions. We implement those > directly in the few architectures and be off with a much simpler > design. Oh, thanks, I was always confused by these macros ;) Oleg.
Re: [PATCH v2 3/6] x86: clean up _TIF_SYSCALL_EMU handling using ptrace_syscall_enter hook
On 04/30, Sudeep Holla wrote: > > On Mon, Mar 18, 2019 at 04:33:22PM +0100, Oleg Nesterov wrote: > > > > And it seems that _TIF_WORK_SYSCALL_ENTRY needs some cleanups too... We > > don't need > > "& _TIF_WORK_SYSCALL_ENTRY" in syscall_trace_enter, and > > _TIF_WORK_SYSCALL_ENTRY > > should not include _TIF_NOHZ? > > > > I was about to post the updated version and checked this to make sure I have > covered everything or not. I had missed the above comment. All architectures > have _TIF_NOHZ in their mask that they check to do work. And from x86, I read > "...syscall_trace_enter(). Also includes TIF_NOHZ for enter_from_user_mode()" > So I don't understand why _TIF_NOHZ needs to be dropped. I have already forgot this discussion... But after I glanced at this code again I still think the same, and I don't understand why do you disagree. > Also if we need to drop, we can address that separately examining all archs. Sure, and I was only talking about x86. We can keep TIF_NOHZ and even set_tsk_thread_flag(TIF_NOHZ) in context_tracking_cpu_set() if some arch needs this but remove TIF_NOHZ from TIF_WORK_SYSCALL_ENTRY in arch/x86/include/asm/thread_info.h, afaics this shouldn't make any difference. And I see no reason why x86 needs to use TIF_WORK_SYSCALL_ENTRY in syscall_trace_enter(). Oleg.
Re: [PATCH v2 4/6] powerpc: use common ptrace_syscall_enter hook to handle _TIF_SYSCALL_EMU
On 03/19, Oleg Nesterov wrote: > > Well, personally I see no point... Again, after the trivial simplification > x86 does > > if (work & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) { > ret = tracehook_report_syscall_entry(regs); > if (ret || (work & _TIF_SYSCALL_EMU)) > return -1L; > } > > this looks simple enough for copy-and-paste. > > > If there's a better way to achieve the same > > I can only say that if we add a common helper, I think it should absorb > tracehook_report_syscall_entry() and handle both TIF's just like the code > above does. Not sure this makes any sense. this won't work, looking at 6/6 I see that arm64 needs to distinguish _TRACE and _EMU ... I don't understand this code, but it looks suspicious. If tracehook_report_syscall_entry() returns nonzero the tracee was killed, syscall_trace_enter() should just return. To me this is another indication that consolidation makes no sense ;) Oleg.
Re: [PATCH v2 4/6] powerpc: use common ptrace_syscall_enter hook to handle _TIF_SYSCALL_EMU
On 03/18, Sudeep Holla wrote: > > On Mon, Mar 18, 2019 at 06:33:41PM +0100, Oleg Nesterov wrote: > > On 03/18, Sudeep Holla wrote: > > > > > > On Mon, Mar 18, 2019 at 06:20:24PM +0100, Oleg Nesterov wrote: > > > > > > > > Again, to me this patch just makes the code look worse. Honestly, I > > > > don't > > > > think that the new (badly named) ptrace_syscall_enter() hook makes any > > > > sense. > > > > > > > > > > Worse because we end up reading current_thread_info->flags twice ? > > > > Mostly because in my opinion ptrace_syscall_enter() buys nothing but makes > > the caller's code less readable/understandable. > > > > Sure, this is subjective. > > > > Based on what we have in that function today, I tend to agree. Will and > Richard were in the opinion to consolidate SYSEMU handling Well, personally I see no point... Again, after the trivial simplification x86 does if (work & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) { ret = tracehook_report_syscall_entry(regs); if (ret || (work & _TIF_SYSCALL_EMU)) return -1L; } this looks simple enough for copy-and-paste. > If there's a better way to achieve the same I can only say that if we add a common helper, I think it should absorb tracehook_report_syscall_entry() and handle both TIF's just like the code above does. Not sure this makes any sense. Oleg.
Re: [PATCH v2 4/6] powerpc: use common ptrace_syscall_enter hook to handle _TIF_SYSCALL_EMU
On 03/18, Sudeep Holla wrote: > > On Mon, Mar 18, 2019 at 06:20:24PM +0100, Oleg Nesterov wrote: > > > > Again, to me this patch just makes the code look worse. Honestly, I don't > > think that the new (badly named) ptrace_syscall_enter() hook makes any > > sense. > > > > Worse because we end up reading current_thread_info->flags twice ? Mostly because in my opinion ptrace_syscall_enter() buys nothing but makes the caller's code less readable/understandable. Sure, this is subjective. Oleg.
Re: [PATCH v2 1/6] ptrace: move clearing of TIF_SYSCALL_EMU flag to core
On 03/18, Sudeep Holla wrote: > @@ -534,6 +534,10 @@ static int ptrace_detach(struct task_struct *child, unsigned int data) > /* Architecture-specific hardware disable .. */ > ptrace_disable(child); > > +#ifdef TIF_SYSCALL_EMU > + clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); > +#endif perhaps it makes sense to factor out clear_tsk_thread_flag(TIF_SYSCALL_EMU), but then we should probably clear it along with TIF_SYSCALL_TRACE in __ptrace_unlink? Oleg.
Re: [PATCH v2 4/6] powerpc: use common ptrace_syscall_enter hook to handle _TIF_SYSCALL_EMU
On 03/18, Sudeep Holla wrote: > > --- a/arch/powerpc/kernel/ptrace.c > +++ b/arch/powerpc/kernel/ptrace.c > @@ -3278,35 +3278,29 @@ long do_syscall_trace_enter(struct pt_regs *regs) > > user_exit(); > > - flags = READ_ONCE(current_thread_info()->flags) & > - (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE); > - > - if (flags) { > - int rc = tracehook_report_syscall_entry(regs); > + if (unlikely(ptrace_syscall_enter(regs))) { > + /* > + * A nonzero return code from tracehook_report_syscall_entry() > + * tells us to prevent the syscall execution, but we are not > + * going to execute it anyway. > + * > + * Returning -1 will skip the syscall execution. We want to > + * avoid clobbering any registers, so we don't goto the skip > + * label below. > + */ > + return -1; > + } > > - if (unlikely(flags & _TIF_SYSCALL_EMU)) { > - /* > - * A nonzero return code from > - * tracehook_report_syscall_entry() tells us to prevent > - * the syscall execution, but we are not going to > - * execute it anyway. > - * > - * Returning -1 will skip the syscall execution. We want > - * to avoid clobbering any registers, so we don't goto > - * the skip label below. > - */ > - return -1; > - } > + flags = READ_ONCE(current_thread_info()->flags) & _TIF_SYSCALL_TRACE; Why do we need READ_ONCE() with this change? And now that we change a single bit "flags" doesn't look like a good name. Again, to me this patch just makes the code look worse. Honestly, I don't think that the new (badly named) ptrace_syscall_enter() hook makes any sense. Oleg.
Re: [PATCH v2 3/6] x86: clean up _TIF_SYSCALL_EMU handling using ptrace_syscall_enter hook
On 03/18, Sudeep Holla wrote: > > --- a/arch/x86/entry/common.c > +++ b/arch/x86/entry/common.c > @@ -70,22 +70,16 @@ static long syscall_trace_enter(struct pt_regs *regs) > > struct thread_info *ti = current_thread_info(); > unsigned long ret = 0; > - bool emulated = false; > u32 work; > > if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) > BUG_ON(regs != task_pt_regs(current)); > > - work = READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY; > - > - if (unlikely(work & _TIF_SYSCALL_EMU)) > - emulated = true; > - > - if ((emulated || (work & _TIF_SYSCALL_TRACE)) && > - tracehook_report_syscall_entry(regs)) > + if (unlikely(ptrace_syscall_enter(regs))) > return -1L; > > - if (emulated) > + work = READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY; > + if ((work & _TIF_SYSCALL_TRACE) && tracehook_report_syscall_entry(regs)) > return -1L; Well, I won't really argue, but to be honest I think this change doesn't make the code better... With this patch tracehook_report_syscall_entry() has 2 callers, to me this just adds some confusion. I agree that the usage of emulated/_TIF_SYSCALL_EMU looks a bit overcomplicated, I'd suggest a simple cleanup below. And it seems that _TIF_WORK_SYSCALL_ENTRY needs some cleanups too... We don't need "& _TIF_WORK_SYSCALL_ENTRY" in syscall_trace_enter, and _TIF_WORK_SYSCALL_ENTRY should not include _TIF_NOHZ? Oleg. --- x/arch/x86/entry/common.c +++ x/arch/x86/entry/common.c @@ -70,23 +70,18 @@ static long syscall_trace_enter(struct pt_regs *regs) struct thread_info *ti = current_thread_info(); unsigned long ret = 0; - bool emulated = false; u32 work; if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) BUG_ON(regs != task_pt_regs(current)); - work = READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY; + work = READ_ONCE(ti->flags); - if (unlikely(work & _TIF_SYSCALL_EMU)) - emulated = true; - - if ((emulated || (work & _TIF_SYSCALL_TRACE)) && - tracehook_report_syscall_entry(regs)) - return -1L; - - if (emulated) - return -1L; + if (work & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) { + ret = tracehook_report_syscall_entry(regs); + if (ret || (work & _TIF_SYSCALL_EMU)) + return -1L; + } #ifdef CONFIG_SECCOMP /*
Re: [PATCH] powerpc/ptrace: cleanup do_syscall_trace_enter
On 12/16, Dmitry V. Levin wrote: > > long do_syscall_trace_enter(struct pt_regs *regs) > { > + u32 cached_flags; > + > user_exit(); > > - if (test_thread_flag(TIF_SYSCALL_EMU)) { > - /* > - * A nonzero return code from tracehook_report_syscall_entry() > - * tells us to prevent the syscall execution, but we are not > - * going to execute it anyway. > - * > - * Returning -1 will skip the syscall execution. We want to > - * avoid clobbering any register also, thus, not 'gotoing' > - * skip label. > - */ > - if (tracehook_report_syscall_entry(regs)) > - ; > - return -1; > - } > + cached_flags = READ_ONCE(current_thread_info()->flags) & > +(_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE); > > - /* > - * The tracer may decide to abort the syscall, if so tracehook > - * will return !0. Note that the tracer may also just change > - * regs->gpr[0] to an invalid syscall number, that is handled > - * below on the exit path. > - */ > - if (test_thread_flag(TIF_SYSCALL_TRACE) && > - tracehook_report_syscall_entry(regs)) > - goto skip; > + if (cached_flags) { > + int rc = tracehook_report_syscall_entry(regs); > + > + if (unlikely(cached_flags & _TIF_SYSCALL_EMU)) { > + /* > + * A nonzero return code from > + * tracehook_report_syscall_entry() tells us > + * to prevent the syscall execution, but > + * we are not going to execute it anyway. > + * > + * Returning -1 will skip the syscall execution. > + * We want to avoid clobbering any register also, > + * thus, not 'gotoing' skip label. > + */ > + return -1; > + } > + > + if (rc) { > + /* > + * The tracer decided to abort the syscall. > + * Note that the tracer may also just change > + * regs->gpr[0] to an invalid syscall number, > + * that is handled below on the exit path. > + */ > + goto skip; > + } > + } Looks good to me, Oleg.
Re: [PATCH v6] powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call
On 12/07, Dmitry V. Levin wrote: > > Please make either v5 or v6 edition of this fix, or any similar fix, > into v4.20. IIUC, v5 above means [PATCH v5 23/25] powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call you sent in another series... > long do_syscall_trace_enter(struct pt_regs *regs) > { > + struct thread_info *ti; > + u32 cached_flags; > + > user_exit(); > > - if (test_thread_flag(TIF_SYSCALL_EMU)) { > - ptrace_report_syscall(regs); > - /* > - * Returning -1 will skip the syscall execution. We want to > - * avoid clobbering any register also, thus, not 'gotoing' > - * skip label. > - */ > - return -1; > - } > + ti = current_thread_info(); > + cached_flags = READ_ONCE(ti->flags) & > +(_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE | > + _TIF_SYSCALL_TRACEPOINT); > > - /* > - * The tracer may decide to abort the syscall, if so tracehook > - * will return !0. Note that the tracer may also just change > - * regs->gpr[0] to an invalid syscall number, that is handled > - * below on the exit path. > - */ > - if (test_thread_flag(TIF_SYSCALL_TRACE) && > - tracehook_report_syscall_entry(regs)) > - goto skip; > + if (cached_flags & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) { > + int rc = tracehook_report_syscall_entry(regs); > + > + if (unlikely(cached_flags & _TIF_SYSCALL_EMU)) { > + /* > + * A nonzero return code from > + * tracehook_report_syscall_entry() tells us > + * to prevent the syscall execution, but > + * we are not going to execute it anyway. > + * > + * Returning -1 will skip the syscall execution. > + * We want to avoid clobbering any register also, > + * thus, not 'gotoing' skip label. > + */ > + return -1; > + } > + > + if (rc) { > + /* > + * The tracer decided to abort the syscall. > + * Note that the tracer may also just change > + * regs->gpr[0] to an invalid syscall number, > + * that is handled below on the exit path. > + */ > + goto skip; > + } > + } > > /* Run seccomp after ptrace; allow it to set gpr[3]. */ > if (do_seccomp(regs)) > @@ -3293,7 +3309,7 @@ long do_syscall_trace_enter(struct pt_regs *regs) > if (regs->gpr[0] >= NR_syscalls) > goto skip; > > - if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) > + if (unlikely(cached_flags & _TIF_SYSCALL_TRACEPOINT)) I will leave this to maintainers, but to me this change looks good and imo it also cleanups the code. However I am not sure cached_flags should include _TIF_SYSCALL_TRACEPOINT. If nothing else, the caller can sleep in ptrace_stop() unpredictably long and TIF_SYSCALL_TRACEPOINT can be set/cleared meanwhile. Oleg.
Re: [PATCH v4] powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call
On 12/07, Dmitry V. Levin wrote: > > On Fri, Dec 07, 2018 at 10:12:49PM +1100, Michael Ellerman wrote: > > > > Sorry, this patch does not work, please ignore it. > > > > Hmm OK. Why exactly? > > Unfortunately, I have no idea why it doesn't work. > All I can say is it breaks strace because the kernel no longer sends > syscall entry stops. May be because TIF_SYSCALL_EMU/etc is a bit number, not a mask? IOW, rather than whatever & TIF_XXX you should do whatever & _TIF_XXX intstead? Oleg.
Re: [PATCH v3 1/2] livepatch: send a fake signal to all blocking tasks
On 11/02, Miroslav Benes wrote: > > On Wed, 1 Nov 2017, Oleg Nesterov wrote: > > > Note also that wake_up_state(TASK_INTERRUPTIBLE) won't wakeup the TASK_IDLE > > kthreads, and most of the kthreads which use TASK_INTERRUPTIBLE should use > > TASK_IDLE today, because in most cases TASK_INTERRUPTIBLE was used to not > > contribute to loadavg. > > Yes. Unfortunately, we have TASK_IDLE for more than two years now and > nothing much has happened yet. TASK_IDLE is still used sporadically. I'd > like to be on the safe side with livepatch OK, as I said I won't argue, > and given that > TASK_INTERRUPTIBLE loops should be prepared for spurious wakeups by > definition, Not really when it comes to kthreads. Once again, unless kthread does allow_signal() TASK_INTERRUPTIBLE does not really differ from TASK_UNINTERRUPTIBLE except the latter contributes to loadavg. And that is why TASK_INTERRUPTIBLE was commonly used instead of TASK_UNINTERRUPTIBLE, so I do not think that TASK_INTERRUPTIBLE loops are more ready in general than TASK_UNINTERRUPTIBLE. Oleg.
Re: [PATCH v3 1/2] livepatch: send a fake signal to all blocking tasks
On 11/01, Petr Mladek wrote: > > On Tue 2017-10-31 12:48:52, Miroslav Benes wrote: > > + if (task->flags & PF_KTHREAD) { > > + /* > > +* Wake up a kthread which still has not been migrated. > > +*/ > > + wake_up_process(task); > > I have just noticed that freezer used wake_up_state(p, TASK_INTERRUPTIBLE); > IMHO, we should do so as well. I won't argue, but... > wake_up_process() wakes also tasks in TASK_UNINTERRUPTIBLE state. > These might not be ready for an unexpected wakeup. For example, > see concat_dev_erase() in drivers/mtd/mtdcontact.c. I'd say that concat_dev_erase() should be fixed, any code should be ready for spurious wakeup. Note also that wake_up_state(TASK_INTERRUPTIBLE) won't wakeup the TASK_IDLE kthreads, and most of the kthreads which use TASK_INTERRUPTIBLE should use TASK_IDLE today, because in most cases TASK_INTERRUPTIBLE was used to not contribute to loadavg. Oleg.
Re: [PATCH] ptrace: Add compat PTRACE_{G,S}ETSIGMASK handlers
On 06/29, James Morse wrote: > > compat_ptrace_request() lacks handlers for PTRACE_{G,S}ETSIGMASK, > instead using those in ptrace_request(). The compat variant should > read a compat_sigset_t from userspace instead of ptrace_request()s > sigset_t. Acked-by: Oleg Nesterov <o...@redhat.com>
Re: [PATCH v5] powerpc: Do not make the entire heap executable
On 09/28, Kees Cook wrote: > > This is where the flags are actually built from what's coming in > through the newly created exported function vm_brk_flags() below. The > only flag we're acting on is VM_EXEC (passed in from set_brk() above). > I think do_brk_flags() should mask the valid flags, or we'll regret it > in the future. I'd like to see something like: > > /* Until we need other flags, refuse anything except VM_EXEC. */ > if ((flags & (~VM_EXEC)) != 0) > return -EINVAL; > flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; I tried to suggest this too. In particular it would be simply wrong to accept VM_LOCKED in flags. Oleg.
Re: [PATCH v4] powerpc: Do not make the entire heap executable
On 08/10, Denys Vlasenko wrote: > > Currently, to support 32-bit binaries with PLT in BSS kernel maps *entire > brk area* with executable rights for all binaries, even --secure-plt ones. > > Stop doing that. Can't really review this patch, but at least the change in mm/mmap.c looks technically correct to me... One nit below, feel free to ignore. > @@ -2668,7 +2668,7 @@ static int do_brk(unsigned long addr, unsigned long > request) > if (!len) > return 0; > > - flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; > + flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; OK. But note that we have mlock_future_check(mm->def_flags); a few lines below and after this change this _looks_ wrong because VM_LOCKED can come from the new "flags" argument passed to do_brk(). Nobody does this right now, still this looks wrong/confusing. I'd suggest to add another change - mlock_future_check(mm->def_flags); + mlock_future_check(flags); or add a sanity check at the start to deny VM_LOCKED and perhaps something else... The same for vm_brk_flags() which after your change does do_brk_flags(flags); populate = (mm->def_flags & VM_LOCKED); again, this is just a nit, I do not think it will be ever called with VM_LOCKED in "flags". Oleg.
Re: [PATCH 09/14] resource limits: track highwater mark of locked memory
On 07/15, Topi Miettinen wrote: > > On 07/15/16 15:14, Oleg Nesterov wrote: > > > > Btw this is not right. The same for the previous patch which tracks > > RLIMIT_STACK. The "current" task can debugger/etc. > > acct_stack_growth() is called from expand_upwards() and > expand_downwards(). They call security_mmap_addr() and the various LSM > implementations also use current task in the checks. Are these also not > right? Just suppose that the stack grows because you read/write to /proc/pid/mem. > > Yes, yes, this just reminds that the whole rlimit logic in this path > > is broken but still... > > I'd be happy to fix the logic with a separate prerequisite patch and > then use the right logic for this patch, but I'm not sure I know how. > Could you elaborate a bit? If only I Knew how to fix this ;) I mean, if only I could suggest a simple fix. Because IMHO we do not really care, rlimts are obsolete. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 09/14] resource limits: track highwater mark of locked memory
On 07/15, Topi Miettinen wrote: > > Track maximum size of locked memory, to be able to configure > RLIMIT_MEMLOCK resource limits. The information is available > with taskstats and cgroupstats netlink socket. So I personally still dislike the very idea of this series... but I won't argue if you convince maintainers. > @@ -2020,6 +2020,10 @@ static int acct_stack_growth(struct vm_area_struct > *vma, unsigned long size, uns > return -ENOMEM; > > update_resource_highwatermark(RLIMIT_STACK, actual_size); > + if (vma->vm_flags & VM_LOCKED) > + update_resource_highwatermark(RLIMIT_MEMLOCK, > + (mm->locked_vm + grow) << > + PAGE_SHIFT); Btw this is not right. The same for the previous patch which tracks RLIMIT_STACK. The "current" task can debugger/etc. Yes, yes, this just reminds that the whole rlimit logic in this path is broken but still... Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 00/20] Fix handling of compat_siginfo_t
On 11/07, Andy Lutomirski wrote: > > On Wed, Nov 4, 2015 at 4:50 PM, Amanieu d'Antraswrote: > > One issue that isn't resolved in this series is sending signals between a > > 32-bit > > process and 64-bit process. Sending a si_int will work correctly, but a > > si_ptr > > value will likely get corrupted due to the different layouts of the 32-bit > > and > > 64-bit siginfo_t structures. > > This is so screwed up it's not even funny. Agreed, > A 64-bit big-endian compat calls rt_sigqueueinfo. It passes in (among > other things) a sigval_t. The kernel can choose to interpret it I always thought that the kernel should not interpret it at all. And indeed, copy_siginfo_to_user() does if (from->si_code < 0) return __copy_to_user(to, from, sizeof(siginfo_t)) probably copy_siginfo_to_user32() should do something similar, at least it should not truncate ->si_code it it is less than zero. Not sure what signalfd_copyinfo() should do. But perhaps I was wrong, I failed to find man sigqueueinfo, and man sigqueue() documents that it passes sigval_t. > BTW, x86 has its own set of screwups here. Somehow cr2 and error_code > ended up as part of ucontext instead of siginfo, which makes > absolutely no sense to me and bloats task_struct. Yes, and probably ->ip should have been the part of siginfo too. Say, if you get SIGBUS you can't trust sc->ip if another signal was dequeued before SIGBUS, in this case sc->ip will point to the handler of that another signal. That is why we have SYNCHRONOUS_MASK and it helps, but still this doesn't look nice. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/8] x86/spinlock: Leftover conversion ACCESS_ONCE-READ_ONCE
On 01/15, Christian Borntraeger wrote: Am 15.01.2015 um 20:38 schrieb Oleg Nesterov: On 01/15, Christian Borntraeger wrote: --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -186,7 +186,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) __ticket_t head = ACCESS_ONCE(lock-tickets.head); for (;;) { - struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); + struct __raw_tickets tmp = READ_ONCE(lock-tickets); Agreed, but what about another ACCESS_ONCE() above? Oleg. tickets.head is a scalar type, so ACCESS_ONCE does work fine with gcc 4.6/4.7. My goal was to convert all accesses on non-scalar types I understand, but READ_ONCE(lock-tickets.head) looks better anyway and arch_spin_lock() already use READ_ONCE() for this. So why we should keep the last ACCESS_ONCE() in spinlock.h ? Just to make another cosmetic cleanup which touches the same function later? Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/8] x86/spinlock: Leftover conversion ACCESS_ONCE-READ_ONCE
On 01/15, Christian Borntraeger wrote: --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -186,7 +186,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) __ticket_t head = ACCESS_ONCE(lock-tickets.head); for (;;) { - struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); + struct __raw_tickets tmp = READ_ONCE(lock-tickets); Agreed, but what about another ACCESS_ONCE() above? Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 1/3] init/main.c: Give init_task a canary
On 09/12, Aaron Tomlin wrote: Tasks get their end of stack set to STACK_END_MAGIC with the aim to catch stack overruns. Currently this feature does not apply to init_task. This patch removes this restriction. Note that a similar patch was posted by Prarit Bhargava [1] some time ago but was never merged. [1]: http://marc.info/?l=linux-kernelm=127144305403241w=2 Signed-off-by: Aaron Tomlin atom...@redhat.com Acked-by: Michael Ellerman m...@ellerman.id.au Acked-by: Oleg Nesterov o...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] sched: Add helper for task stack page overrun checking
On 09/04, Aaron Tomlin wrote: +#define task_stack_end_corrupted(task) \ + (*(end_of_stack(task)) != STACK_END_MAGIC) and it is always used along with tsk != init_task. But why we exclude swapper/0? Can't we add end_of_stack(current) = STACK_END_MAGIC; somewhere at the start of start_kernel() ? If not, perhaps this new helper should check task != init_task itself so that we can simplify its users? Oleg. static inline int object_is_on_stack(void *obj) { diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c index 8a4e5cb..06c7390 100644 --- a/kernel/trace/trace_stack.c +++ b/kernel/trace/trace_stack.c @@ -13,7 +13,6 @@ #include linux/sysctl.h #include linux/init.h #include linux/fs.h -#include linux/magic.h #include asm/setup.h @@ -171,8 +170,8 @@ check_stack(unsigned long ip, unsigned long *stack) i++; } - if ((current != init_task - *(end_of_stack(current)) != STACK_END_MAGIC)) { + if (current != init_task + task_stack_end_corrupted(current)) { print_max_stack(); BUG(); } -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: bit fields data tearing
On 07/13, Benjamin Herrenschmidt wrote: On Sat, 2014-07-12 at 22:51 +0200, Oleg Nesterov wrote: OK, looks like this is compiler bug, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080 Thanks to Dan who informed me privately. So yes, there's is this compiler bug which means a bitfield access can cause a r-m-w access to a neighbouring field Thanks. So I can forward this all back to bugzilla. but in general, I would be weary of bitfields anyway since accessing them isn't going to be atomic anyway... it's too easy to get things wrong and in most cases the benefit is yet to be demonstrated. Sure, bit fields should be used with care. But the same arguments apply to bitmasks, they are often used without atomic set/clear_bit. In your example, I don't see the point of the bitfield. This is just test-case. The real code has more adjacent bit fields, only the tracee can modify them, and only debugger can change -freeze_stop. Thanks, Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
bit fields data tearing
Hello, I am not sure I should ask here, but since Documentation/memory-barriers.txt mentions load/store tearing perhaps my question is not completely off-topic... I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390 but not on x86. Finally I seem to understand the problem, and I even wrote the stupid kernel module to ensure, see it below at the end. It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop. (If I turn -freeze_stop int long, the problem goes away). So the question is: is this gcc bug or the code below is buggy? If it is buggy, then probably memory-barriers.txt could mention that you should be carefull with bit fields, even ACCESS_ONCE() obviously can't help. Or this just discloses my ignorance and you need at least aligned(long) after a bit field to be thread-safe ? I thought that compiler should take care and add the necessary alignment if (say) CPU can't update a single byte/uint. gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm: .kt_2: 0: 7c 08 02 a6 mflrr0 4: fb 81 ff e0 std r28,-32(r1) 8: fb a1 ff e8 std r29,-24(r1) c: fb c1 ff f0 std r30,-16(r1) 10: fb e1 ff f8 std r31,-8(r1) 14: eb c2 00 00 ld r30,0(r2) 18: f8 01 00 10 std r0,16(r1) 1c: f8 21 ff 71 stdur1,-144(r1) 20: 7c 7d 1b 78 mr r29,r3 24: 3b e0 00 00 li r31,0 28: 78 3c 04 64 rldicr r28,r1,0,49 2c: 3b 9c 00 80 addir28,r28,128 30: 48 00 00 2c b 5c .kt_2+0x5c 34: 60 00 00 00 nop 38: 60 00 00 00 nop 3c: 60 00 00 00 nop 40: 93 fd 00 04 stw r31,4(r29) 44: e8 9d 00 06 lwa r4,4(r29) 48: 7f 84 f8 00 cmpwcr7,r4,r31 4c: 40 de 00 4c bne-cr7,98 .kt_2+0x98 50: e8 1c 00 00 ld r0,0(r28) 54: 78 09 f7 e3 rldicl. r9,r0,62,63 58: 40 c2 00 54 bne-ac .kt_2+0xac 5c: 48 00 00 01 bl 5c .kt_2+0x5c 60: 60 00 00 00 nop 64: 3b ff 00 01 addir31,r31,1 68: 2f a3 00 00 cmpdi cr7,r3,0 6c: 7f ff 07 b4 extsw r31,r31 70: 41 9e ff d0 beq+cr7,40 .kt_2+0x40 74: 38 21 00 90 addir1,r1,144 78: 38 60 00 00 li r3,0 7c: e8 01 00 10 ld r0,16(r1) 80: eb 81 ff e0 ld r28,-32(r1) 84: eb a1 ff e8 ld r29,-24(r1) 88: eb c1 ff f0 ld r30,-16(r1) 8c: eb e1 ff f8 ld r31,-8(r1) 90: 7c 08 03 a6 mtlrr0 94: 4e 80 00 20 blr 98: e8 7e 80 28 ld r3,-32728(r30) 9c: 7f e5 fb 78 mr r5,r31 a0: 48 00 00 01 bl a0 .kt_2+0xa0 a4: 60 00 00 00 nop a8: 4b ff ff a8 b 50 .kt_2+0x50 ac: 48 00 00 01 bl ac .kt_2+0xac b0: 60 00 00 00 nop b4: 4b ff ff a8 b 5c .kt_2+0x5c b8: 60 00 00 00 nop bc: 60 00 00 00 nop 00c0 .kt_1: c0: 7c 08 02 a6 mflrr0 c4: fb 81 ff e0 std r28,-32(r1) c8: fb a1 ff e8 std r29,-24(r1) cc: fb c1 ff f0 std r30,-16(r1) d0: fb e1 ff f8 std r31,-8(r1) d4: eb c2 00 00 ld r30,0(r2) d8: f8 01 00 10 std r0,16(r1) dc: f8 21 ff 71 stdur1,-144(r1) e0: 7c 7d 1b 78 mr r29,r3 e4: 3b e0 00 00 li r31,0 e8: 78 3c 04 64 rldicr r28,r1,0,49 ec: 3b 9c 00 80 addir28,r28,128 f0: 48 00 00 38 b 128 .kt_1+0x68 f4: 60 00 00 00 nop f8: 60 00 00 00 nop fc: 60 00 00 00 nop 100: e8 1d 00 00 ld r0,0(r29) 104: 79 20 e8 0e rldimi r0,r9,61,0 108: f8 1d 00 00 std r0,0(r29) 10c: 80 1d 00 00 lwz r0,0(r29) 110: 54 00 1f 7e rlwinm r0,r0,3,29,31 114: 7f 80 f8 00 cmpwcr7,r0,r31 118: 40 de 00 6c bne-cr7,184 .kt_1+0xc4 11c: e8 1c 00 00 ld r0,0(r28) 120: 78 09 f7 e3 rldicl. r9,r0,62,63 124: 40 c2 00 70 bne-194 .kt_1+0xd4 128: 48 00 00 01 bl 128 .kt_1+0x68 12c: 60 00 00 00 nop 130: 3b ff 00 01 addir31,r31,1 134: 2f a3 00 00 cmpdi cr7,r3,0 138: 7f ff 07 b4 extsw r31,r31 13c: 2f 1f 00 07 cmpwi cr6,r31,7 140: 7b e9
Re: bit fields data tearing
OK, looks like this is compiler bug, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080 Thanks to Dan who informed me privately. On 07/12, Oleg Nesterov wrote: Hello, I am not sure I should ask here, but since Documentation/memory-barriers.txt mentions load/store tearing perhaps my question is not completely off-topic... I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390 but not on x86. Finally I seem to understand the problem, and I even wrote the stupid kernel module to ensure, see it below at the end. It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop. (If I turn -freeze_stop int long, the problem goes away). So the question is: is this gcc bug or the code below is buggy? If it is buggy, then probably memory-barriers.txt could mention that you should be carefull with bit fields, even ACCESS_ONCE() obviously can't help. Or this just discloses my ignorance and you need at least aligned(long) after a bit field to be thread-safe ? I thought that compiler should take care and add the necessary alignment if (say) CPU can't update a single byte/uint. gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm: .kt_2: 0: 7c 08 02 a6 mflrr0 4: fb 81 ff e0 std r28,-32(r1) 8: fb a1 ff e8 std r29,-24(r1) c: fb c1 ff f0 std r30,-16(r1) 10: fb e1 ff f8 std r31,-8(r1) 14: eb c2 00 00 ld r30,0(r2) 18: f8 01 00 10 std r0,16(r1) 1c: f8 21 ff 71 stdur1,-144(r1) 20: 7c 7d 1b 78 mr r29,r3 24: 3b e0 00 00 li r31,0 28: 78 3c 04 64 rldicr r28,r1,0,49 2c: 3b 9c 00 80 addir28,r28,128 30: 48 00 00 2c b 5c .kt_2+0x5c 34: 60 00 00 00 nop 38: 60 00 00 00 nop 3c: 60 00 00 00 nop 40: 93 fd 00 04 stw r31,4(r29) 44: e8 9d 00 06 lwa r4,4(r29) 48: 7f 84 f8 00 cmpwcr7,r4,r31 4c: 40 de 00 4c bne-cr7,98 .kt_2+0x98 50: e8 1c 00 00 ld r0,0(r28) 54: 78 09 f7 e3 rldicl. r9,r0,62,63 58: 40 c2 00 54 bne-ac .kt_2+0xac 5c: 48 00 00 01 bl 5c .kt_2+0x5c 60: 60 00 00 00 nop 64: 3b ff 00 01 addir31,r31,1 68: 2f a3 00 00 cmpdi cr7,r3,0 6c: 7f ff 07 b4 extsw r31,r31 70: 41 9e ff d0 beq+cr7,40 .kt_2+0x40 74: 38 21 00 90 addir1,r1,144 78: 38 60 00 00 li r3,0 7c: e8 01 00 10 ld r0,16(r1) 80: eb 81 ff e0 ld r28,-32(r1) 84: eb a1 ff e8 ld r29,-24(r1) 88: eb c1 ff f0 ld r30,-16(r1) 8c: eb e1 ff f8 ld r31,-8(r1) 90: 7c 08 03 a6 mtlrr0 94: 4e 80 00 20 blr 98: e8 7e 80 28 ld r3,-32728(r30) 9c: 7f e5 fb 78 mr r5,r31 a0: 48 00 00 01 bl a0 .kt_2+0xa0 a4: 60 00 00 00 nop a8: 4b ff ff a8 b 50 .kt_2+0x50 ac: 48 00 00 01 bl ac .kt_2+0xac b0: 60 00 00 00 nop b4: 4b ff ff a8 b 5c .kt_2+0x5c b8: 60 00 00 00 nop bc: 60 00 00 00 nop 00c0 .kt_1: c0: 7c 08 02 a6 mflrr0 c4: fb 81 ff e0 std r28,-32(r1) c8: fb a1 ff e8 std r29,-24(r1) cc: fb c1 ff f0 std r30,-16(r1) d0: fb e1 ff f8 std r31,-8(r1) d4: eb c2 00 00 ld r30,0(r2) d8: f8 01 00 10 std r0,16(r1) dc: f8 21 ff 71 stdur1,-144(r1) e0: 7c 7d 1b 78 mr r29,r3 e4: 3b e0 00 00 li r31,0 e8: 78 3c 04 64 rldicr r28,r1,0,49 ec: 3b 9c 00 80 addir28,r28,128 f0: 48 00 00 38 b 128 .kt_1+0x68 f4: 60 00 00 00 nop f8: 60 00 00 00 nop fc: 60 00 00 00 nop 100: e8 1d 00 00 ld r0,0(r29) 104: 79 20 e8 0e rldimi r0,r9,61,0 108: f8 1d 00 00 std r0,0(r29) 10c: 80 1d 00 00 lwz r0,0(r29) 110: 54 00 1f 7e rlwinm r0,r0,3,29,31 114: 7f 80 f8 00 cmpwcr7,r0,r31 118: 40 de 00 6c bne-cr7,184 .kt_1+0xc4 11c: e8 1c 00 00 ld r0,0(r28) 120: 78 09 f7 e3 rldicl. r9,r0,62,63 124: 40 c2 00 70 bne-194 .kt_1+0xd4 128: 48 00 00 01 bl 128 .kt_1+0x68 12c: 60 00 00 00 nop 130: 3b ff 00 01 addir31,r31,1 134: 2f a3 00 00 cmpdi cr7,r3,0 138
Re: perf events ring buffer memory barrier on powerpc
On 10/29, Peter Zijlstra wrote: On Tue, Oct 29, 2013 at 11:30:57AM +0100, Peter Zijlstra wrote: @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output * Userspace could choose to issue a mb() before updating the * tail pointer. So that all reads will be completed before the * write is issued. +* +* See perf_output_put_handle(). */ tail = ACCESS_ONCE(rb-user_page-data_tail); - smp_rmb(); + smp_mb(); offset = head = local_read(rb-head); head += size; if (unlikely(!perf_output_space(rb, tail, offset, head))) That said; it would be very nice to be able to remove this barrier. This is in every event write path :/ Yes.. And I'm afraid very much that I simply confused you. Perhaps Victor is right and we do not need this mb(). So I am waiting for the end of this story too. And btw I do not understand why we need it (or smp_rmb) right after ACCESS_ONCE(data_tail). Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: perf events ring buffer memory barrier on powerpc
On 10/28, Peter Zijlstra wrote: Lets add Paul and Oleg to the thread; this is getting far more 'fun' that it should be ;-) Heh. All I can say is that I would like to see the authoritative answer, you know who can shed a light ;) But to avoid the confusion, wmb() added by this patch looks obviously correct to me. + * Since the mmap() consumer (userspace) can run on a different CPU: + * + * kernel user + * + * READ -data_tail READ -data_head + * smp_rmb() (A) smp_rmb() (C) + * WRITE $dataREAD $data + * smp_wmb() (B) smp_mb()(D) + * STORE -data_head WRITE -data_tail + * + * Where A pairs with D, and B pairs with C. + * + * I don't think A needs to be a full barrier because we won't in fact + * write data until we see the store from userspace. this matches the intuition, but ... So we simply don't + * issue the data WRITE until we observe it. why do we need any barrier (rmb) then? how it can help to serialize with WRITE $data ? (of course there could be other reasons for this rmb(), just I can't really understand A pairs with D). And this reminds me about the memory barrier in kfifo.c which I was not able to understand. Hmm, it has already gone away, and now I do not understand kfifo.c at all ;) But I have found the commit, attached below. Note that that commit added the full barrier into __kfifo_put(). And to me it looks the same as A above. Following the logic above we could say that we do not need a barrier (at least the full one), we won't in fact write into the unread area until we see the store to -out from __kfifo_get() ? In short. I am confused, I _feel_ that A has to be a full barrier, but I can't prove this. And let me suggest the artificial/simplified example, boolBUSY; data_t DATA; bool try_to_get(data_t *data) { if (!BUSY) return false; rmb(); *data = DATA; mb(); BUSY = false; return true; } bool try_to_put(data_t *data) { if (BUSY) return false; mb(); // : do we really need it? I think yes. DATA = *data; wmb(); BUSY = true; return true; } Again, following the description above we could turn the mb() in _put() into barrier(), but I do not think we can rely on the contorl dependency. Oleg. --- commit a45bce49545739a940f8bd4ca85c3b7435564893 Author: Paul E. McKenney paul...@us.ibm.com Date: Fri Sep 29 02:00:11 2006 -0700 [PATCH] memory ordering in __kfifo primitives Both __kfifo_put() and __kfifo_get() have header comments stating that if there is but one concurrent reader and one concurrent writer, locking is not necessary. This is almost the case, but a couple of memory barriers are needed. Another option would be to change the header comments to remove the bit about locking not being needed, and to change the those callers who currently don't use locking to add the required locking. The attachment analyzes this approach, but the patch below seems simpler. Signed-off-by: Paul E. McKenney paul...@us.ibm.com Cc: Stelian Pop stel...@popies.net Signed-off-by: Andrew Morton a...@osdl.org Signed-off-by: Linus Torvalds torva...@osdl.org diff --git a/kernel/kfifo.c b/kernel/kfifo.c index 64ab045..5d1d907 100644 --- a/kernel/kfifo.c +++ b/kernel/kfifo.c @@ -122,6 +122,13 @@ unsigned int __kfifo_put(struct kfifo *fifo, len = min(len, fifo-size - fifo-in + fifo-out); + /* +* Ensure that we sample the fifo-out index -before- we +* start putting bytes into the kfifo. +*/ + + smp_mb(); + /* first put the data starting from fifo-in to buffer end */ l = min(len, fifo-size - (fifo-in (fifo-size - 1))); memcpy(fifo-buffer + (fifo-in (fifo-size - 1)), buffer, l); @@ -129,6 +136,13 @@ unsigned int __kfifo_put(struct kfifo *fifo, /* then put the rest (if any) at the beginning of the buffer */ memcpy(fifo-buffer, buffer + l, len - l); + /* +* Ensure that we add the bytes to the kfifo -before- +* we update the fifo-in index. +*/ + + smp_wmb(); + fifo-in += len; return len; @@ -154,6 +168,13 @@ unsigned int __kfifo_get(struct kfifo *fifo, len = min(len, fifo-in - fifo-out); + /* +* Ensure that we sample the fifo-in index -before- we +* start removing bytes from the kfifo. +*/ + + smp_rmb(); + /* first get the data from fifo-out until the end of the buffer */ l = min(len,
Re: perf events ring buffer memory barrier on powerpc
On 10/28, Paul E. McKenney wrote: On Mon, Oct 28, 2013 at 02:26:34PM +0100, Peter Zijlstra wrote: --- linux-2.6.orig/kernel/events/ring_buffer.c +++ linux-2.6/kernel/events/ring_buffer.c @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc goto out; /* -* Publish the known good head. Rely on the full barrier implied -* by atomic_dec_and_test() order the rb-head read and this -* write. +* Since the mmap() consumer (userspace) can run on a different CPU: +* +* kernel user +* +* READ -data_tail READ -data_head +* smp_rmb() (A) smp_rmb() (C) Given that both of the kernel's subsequent operations are stores/writes, doesn't (A) need to be smp_mb()? Yes, this is my understanding^Wfeeling too, but I have to admit that I can't really explain to myself why _exactly_ we need mb() here. And let me copy-and-paste the artificial example from my previous email, boolBUSY; data_t DATA; bool try_to_get(data_t *data) { if (!BUSY) return false; rmb(); *data = DATA; mb(); BUSY = false; return true; } bool try_to_put(data_t *data) { if (BUSY) return false; mb(); // : do we really need it? I think yes. DATA = *data; wmb(); BUSY = true; return true; } (just in case, the code above obviously assumes that _get or _put can't race with itself, but they can race with each other). Could you confirm that try_to_put() actually needs mb() between LOAD(BUSY) and STORE(DATA) ? I am sure it actually needs, but I will appreciate it if you can explain why. IOW, how it is possible that without mb() try_to_put() can overwrite DATA before try_to_get() completes its *data = DATA in this particular case. Perhaps this can happen if, say, reader and writer share a level of cache or something like this... Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: linux-next: manual merge of the akpm tree with the powerpc tree
On 06/26, Benjamin Herrenschmidt wrote: On Wed, 2013-06-26 at 16:56 +1000, Stephen Rothwell wrote: Today's linux-next merge of the akpm tree got a conflict in arch/powerpc/kernel/ptrace.c between commit b0b0aa9c7faf (powerpc/hw_brk: Fix setting of length for exact mode breakpoints) from the powerpc tree and commit ptrace/powerpc: revert hw_breakpoints: Fix racy access to ptrace breakpoints from the akpm tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). Where does the latter come from ? ptrace-powerpc-revert-hw_breakpoints-fix-racy-access-to-ptrace-breakpoints.patch attached below. You were cc'ed every time ;) Why didn't it go through the powerpc tree ? Because this series needs to update any user of ptrace_get/put_breakpoints in arch/ (simply remove these calls), then change the core kernel code, then fix arch/86. -- From: Oleg Nesterov o...@redhat.com Subject: ptrace/powerpc: revert hw_breakpoints: Fix racy access to ptrace breakpoints This reverts commit 07fa7a0a8a586 (hw_breakpoints: Fix racy access to ptrace breakpoints) and removes ptrace_get/put_breakpoints() added by other commits. The patch was fine but we can no longer race with SIGKILL after 9899d11f (ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL), the __TASK_TRACED tracee can't be woken up and -ptrace_bps[] can't go away. Signed-off-by: Oleg Nesterov o...@redhat.com Acked-by: Michael Neuling mi...@neuling.org Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Frederic Weisbecker fweis...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Jan Kratochvil jan.kratoch...@redhat.com Cc: Paul Mundt let...@linux-sh.org Cc: Will Deacon will.dea...@arm.com Cc: Prasad pra...@linux.vnet.ibm.com Cc: Russell King li...@arm.linux.org.uk Signed-off-by: Andrew Morton a...@linux-foundation.org --- arch/powerpc/kernel/ptrace.c | 20 1 file changed, 20 deletions(-) diff -puN arch/powerpc/kernel/ptrace.c~ptrace-powerpc-revert-hw_breakpoints-fix-racy-access-to-ptrace-breakpoints arch/powerpc/kernel/ptrace.c --- a/arch/powerpc/kernel/ptrace.c~ptrace-powerpc-revert-hw_breakpoints-fix-racy-access-to-ptrace-breakpoints +++ a/arch/powerpc/kernel/ptrace.c @@ -974,16 +974,12 @@ int ptrace_set_debugreg(struct task_stru hw_brk.type = (data HW_BRK_TYPE_DABR) | HW_BRK_TYPE_PRIV_ALL; hw_brk.len = 8; #ifdef CONFIG_HAVE_HW_BREAKPOINT - if (ptrace_get_breakpoints(task) 0) - return -ESRCH; - bp = thread-ptrace_bps[0]; if ((!data) || !(hw_brk.type HW_BRK_TYPE_RDWR)) { if (bp) { unregister_hw_breakpoint(bp); thread-ptrace_bps[0] = NULL; } - ptrace_put_breakpoints(task); return 0; } if (bp) { @@ -996,11 +992,9 @@ int ptrace_set_debugreg(struct task_stru ret = modify_user_hw_breakpoint(bp, attr); if (ret) { - ptrace_put_breakpoints(task); return ret; } thread-ptrace_bps[0] = bp; - ptrace_put_breakpoints(task); thread-hw_brk = hw_brk; return 0; } @@ -1015,12 +1009,9 @@ int ptrace_set_debugreg(struct task_stru ptrace_triggered, NULL, task); if (IS_ERR(bp)) { thread-ptrace_bps[0] = NULL; - ptrace_put_breakpoints(task); return PTR_ERR(bp); } - ptrace_put_breakpoints(task); - #endif /* CONFIG_HAVE_HW_BREAKPOINT */ task-thread.hw_brk = hw_brk; #else /* CONFIG_PPC_ADV_DEBUG_REGS */ @@ -1439,9 +1430,6 @@ static long ppc_set_hwdebug(struct task_ if (bp_info-trigger_type PPC_BREAKPOINT_TRIGGER_WRITE) brk.type |= HW_BRK_TYPE_WRITE; #ifdef CONFIG_HAVE_HW_BREAKPOINT - if (ptrace_get_breakpoints(child) 0) - return -ESRCH; - /* * Check if the request is for 'range' breakpoints. We can * support it if range 8 bytes. @@ -1449,12 +1437,10 @@ static long ppc_set_hwdebug(struct task_ if (bp_info-addr_mode == PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE) { len = bp_info-addr2 - bp_info-addr; } else if (bp_info-addr_mode != PPC_BREAKPOINT_MODE_EXACT) { - ptrace_put_breakpoints(child); return -EINVAL; } bp = thread-ptrace_bps[0]; if (bp) { - ptrace_put_breakpoints(child); return -ENOSPC; } @@ -1468,11 +1454,9 @@ static long ppc_set_hwdebug(struct task_ ptrace_triggered, NULL, child); if (IS_ERR(bp)) { thread-ptrace_bps[0] = NULL; - ptrace_put_breakpoints(child
[PATCH] ptrace/powerpc: dont flush_ptrace_hw_breakpoint() on fork()
arch_dup_task_struct() does flush_ptrace_hw_breakpoint(src), this destroys the parent's breakpoints for no reason. We should clear child-thread.ptrace_bps[] copied by dup_task_struct() instead. Signed-off-by: Oleg Nesterov o...@redhat.com Acked-by: Michael Neuling mi...@neuling.org --- x/arch/powerpc/kernel/process.c +++ x/arch/powerpc/kernel/process.c @@ -910,10 +910,6 @@ int arch_dup_task_struct(struct task_str flush_altivec_to_thread(src); flush_vsx_to_thread(src); flush_spe_to_thread(src); -#ifdef CONFIG_HAVE_HW_BREAKPOINT - flush_ptrace_hw_breakpoint(src); -#endif /* CONFIG_HAVE_HW_BREAKPOINT */ - *dst = *src; return 0; } @@ -984,6 +980,10 @@ int copy_thread(unsigned long clone_flag p-thread.ksp_limit = (unsigned long)task_stack_page(p) + _ALIGN_UP(sizeof(struct thread_info), 16); +#ifdef CONFIG_HAVE_HW_BREAKPOINT + p-thread.ptrace_bps[0] = NULL; +#endif + #ifdef CONFIG_PPC_STD_MMU_64 if (mmu_has_feature(MMU_FTR_SLB)) { unsigned long sp_vsid; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/4] uprobes: add trap variant helper
On 03/22, Ananth N Mavinakayanahalli wrote: Some architectures like powerpc have multiple variants of the trap instruction. Introduce an additional helper is_trap_insn() for run-time handling of non-uprobe traps on such architectures. Looks fine to me. I am going to take this into my tree, hopefully Ben won't object. The main change is arch-agnostic and the changes in arch/powerpc are trivial. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints
On 03/22, Ananth N Mavinakayanahalli wrote: On Thu, Mar 21, 2013 at 04:58:09PM +0100, Oleg Nesterov wrote: - verify_opcode()-is_swbp_insn() means: is this insn fine for uprobe? (we do not care about gdb, we simply ignore this problem) I will write up a patch for this case.. So, IIUC we do not care to send gdb a SIGTRAP if we have replaced a conditional trap from gdb with an unconditional uprobes one, right? Yes. And just in case, we do not send SIGTRAP if gdb used the same/unconditional insn. We simply can't know if someone else wants to know that the task hits this breakpoint. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] uprobes: add trap variant helper
On 03/22, Ananth N Mavinakayanahalli wrote: +/** + * is_trap_insn - check if instruction is breakpoint instruction. + * @insn: instruction to be checked. + * Default implementation of is_trap_insn + * Returns true if @insn is a breakpoint instruction. + * + * This function is needed for the case where an architecture has multiple + * trap instructions (like powerpc). + */ +bool __weak is_trap_insn(uprobe_opcode_t *insn) +{ + return is_swbp_insn(insn); +} OK, thanks, the whole series looks fine, just one note... My patch also changed prepare_uprobe() to use is_trap_insn(), and I think this is right. Exactly because of 3/3 which removes is_trap() from arch_uprobe_analyze_insn(). If the original insn is_trap(), we do not want to singlestep it and get another trap after we hit handle_swbp(). Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints
On 03/21, Ananth N Mavinakayanahalli wrote: On Wed, Mar 20, 2013 at 05:06:44PM +0100, Oleg Nesterov wrote: But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to verify. If not, we need 2 definitions. is_uprobe_insn() should still check insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap(). Its fine from gdb's perspective with my patch. Yes, but this doesn't look right from uprobe's perspective. So, install_breakpoint()-prepare_uprobe()-is_swbp_insn() will return ENOTSUPP. In fact, arch_uprobe_analyze_insn() will also do the same. Yes and the check in arch_uprobe_analyze_insn() should go away. But you missed my point. Please forget about prepare_uprobe(), it is wrong anyway. And, prepare_uprobe() inspects the _original_ insn from the file, this has nothing install_breakpoint/etc. I meant verify_opcode() called by install_breakpoint/etc. For the case where X already exists, verify_opcode() currently returns 0. IMO, it should return -EEXIST, Oh, this is debatable. Currently we assume that uprobe should win. See below... unless you are proposing that uprobes should ride on the existing trap (even if its a variant). Yes. And this is what the current implementation does. If you are proposing that uprobes ride on X if it already exists, that's not always possible and is a big can of worms... see below... Sure. Whatever we do uprobe and gdb can confuse each other. Unless we rework the vm code completely (not sure this is realistic). OK. So, if I understand correctly, gdb can use some conditional breakpoint, and it is possible that this insn won't generate the trap? Yes it is possible if the condition is not met. If the condition is met, the instruction will generate a trap, and uprobes will do a send_sig(SIGTRAP) from handle_swbp(). Unless there is uprobe at the same address. Once again, uprobe wins. Your patch only fixes the case when the task hits a non-UPROBE_SWBP_INSN breakpoint and there is no uprobe at the same address. Then this patch is not right, or at least we need another change on top? Once again. Suppose that gdb installs the TRAP_IF_R1_GT_R2. After that uprobe_register() is called, but it won't change this insn because verify_opcode() returns 0. Then the probed task hits this breakoint with r1 r2 and we do not report this event. At this time, the condition for the trap is not satisfied, so no exception occurs. Yes, and thus uprobe-handler() is not called, this was my point. If the expectation is that the trap always trigger, then all such trap variants need to be replaced with the unconditional trap Yes. that is why I suggested the patch which doesn't affect verify_opcode(). uprobe_register() should replace the conditional trap with the unconditional UPROBE_SWBP_INSN. uprobes should win. and we should either add logic to re-execute the condional trap after uprobe handling and send_sig() via handle_swbp() or emulate the condition in software and do a send_sig() if needed. Unlikely this is possible. Or even desirable. So. I still think that we actually need something like below, and powerpc should reimplement is_trap_insn() to use is_trap(insn). No? I don't see how this will help, Hmm. This should equally fix this particular problem? handle_swbp() will send the trap if is_trap() == T? Again, unless there is uprobe, but this was always true. especially since the gdb-uprobes is fraught with races. They can race anyway, whatever we do. Unless we rework write_opcode/etc completely. With your proposed patch, we refuse to insert a uprobe if the underlying instruction is a UPROBE_SWBP_INSTRUCTION; If underlying means the original insn in vma-file, this is already true. My patch doesn't change this logic. Otherwise - no, we do not refuse to insert a uprobe if this insn was already changed by gdb. changing is_swbp_at_addr() will need changes in handle_swbp() too. I don't think so. Why? But, unlike x86, we cannot expect a uprobe with an underlying trap variant (X) to always trigger. And that is why I think write_opcode() should rewrite the conditional trap. IMHO, its not a good idea to do that for x86 either, This change has no effect fo x86. IMHO, I really think we should not allow uprobe_register() to succeed if the underlying instruction is a breakpoint (or a variant thereof). I disagree. Just for example. Suppose we change install_breakpoint() so that it fails if the underlying instruction is int3. (once again, underlying does not mean the original insn from vma-vm_file). First of all, this is very much nontrivial. I simply do not see how we can do this. If nothing else, uprobe_register() can race with uprobe_mmap() and install_breakpoint() can be called twice with the same vaddr. With this change register or mmap can fail. But suppose you can do this. Now you can write the trivial application which mmaps glibc and inserts int3
Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints
On 03/21, Ananth N Mavinakayanahalli wrote: ? On Wed, Mar 20, 2013 at 05:07:28PM +0100, Oleg Nesterov wrote: On 03/20, Ananth N Mavinakayanahalli wrote: On Wed, Mar 20, 2013 at 01:43:01PM +0100, Oleg Nesterov wrote: On 03/20, Oleg Nesterov wrote: IOW, if I wasn't clear... Lets forget about gdb/etc for the moment. Suppose we apply the patch below. Will uprobes on powerpc work? If yes, then your patch should be fine. If not, we probably need more changes. Yes, it will work fine. Even if this new insn is conditional? Yes. But this can't be true. If it is conditional, it won't always generate a trap, this means uprobe won't actually work. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints
Hi Ananth, First of all, let me remind that I know nothing about powerpc ;) But iirc we already discussed this a bit, I forgot the details but still I have some concerns... On 03/20, Ananth N Mavinakayanahalli wrote: GDB uses a variant of the trap instruction that is different from the one used by uprobes. Currently, running gdb on a program being traced by uprobes causes an endless loop since uprobes doesn't understand that the trap is inserted by some other entity and hence a SIGTRAP needs to be delivered. Yes, and thus is_swbp_at_addr()-is_swbp_insn() called by handle_swbp() should be updated, +bool is_swbp_insn(uprobe_opcode_t *insn) +{ + return (is_trap(*insn)); +} And this patch should fix the problem. (and probably this is fine for prepare_uprobe()). But, at the same time, is the new definition fine for verify_opcode()? IOW, powerpc has another is_trap() insn(s) used by gdb, lets denote it X. X != UPROBE_SWBP_INSN. Suppose that gdb installs the trap X at some addr, and then uprobe_register() tries to install uprobe at the same address. Then set_swbp() will do nothing, assuming the uprobe was already installed. But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to verify. If not, we need 2 definitions. is_uprobe_insn() should still check insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap(). And I am just curious, could you explain how X and UPROBE_SWBP_INSN differ? Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints
On 03/20, Oleg Nesterov wrote: But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to verify. If not, we need 2 definitions. is_uprobe_insn() should still check insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap(). And I am just curious, could you explain how X and UPROBE_SWBP_INSN differ? IOW, if I wasn't clear... Lets forget about gdb/etc for the moment. Suppose we apply the patch below. Will uprobes on powerpc work? If yes, then your patch should be fine. If not, we probably need more changes. And, forgot to mention. If you change is_swbp_insn(), you can remove is_trap() from arch_uprobe_analyze_insn(). Oleg. --- x/arch/powerpc/include/asm/uprobes.h +++ x/arch/powerpc/include/asm/uprobes.h @@ -31,7 +31,7 @@ typedef ppc_opcode_t uprobe_opcode_t; #define UPROBE_XOL_SLOT_BYTES (MAX_UINSN_BYTES) /* The following alias is needed for reference from arch-agnostic code */ -#define UPROBE_SWBP_INSN BREAKPOINT_INSTRUCTION +#define UPROBE_SWBP_INSN TRAP_INSN_USED_BY_GDB #define UPROBE_SWBP_INSN_SIZE 4 /* swbp insn size in bytes */ struct arch_uprobe { ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints
On 03/20, Ananth N Mavinakayanahalli wrote: On Wed, Mar 20, 2013 at 01:26:39PM +0100, Oleg Nesterov wrote: But, at the same time, is the new definition fine for verify_opcode()? IOW, powerpc has another is_trap() insn(s) used by gdb, lets denote it X. X != UPROBE_SWBP_INSN. Suppose that gdb installs the trap X at some addr, and then uprobe_register() tries to install uprobe at the same address. Then set_swbp() will do nothing, assuming the uprobe was already installed. But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to verify. If not, we need 2 definitions. is_uprobe_insn() should still check insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap(). is_trap() checks for all trap variants on powerpc, including the one uprobe uses. It returns true if the instruction is *any* trap variant. I understand, So, install_breakpoint()-prepare_uprobe()-is_swbp_insn() will return ENOTSUPP. In fact, arch_uprobe_analyze_insn() will also do the same. Yes and the check in arch_uprobe_analyze_insn() should go away. But you missed my point. Please forget about prepare_uprobe(), it is wrong anyway. And, prepare_uprobe() inspects the _original_ insn from the file, this has nothing install_breakpoint/etc. I meant verify_opcode() called by install_breakpoint/etc. This itself should take care of all the cases. And I am just curious, could you explain how X and UPROBE_SWBP_INSN differ? Powerpc has numerous variants of the trap instruction based on comparison of two registers or a regsiter and immediate value and a condition (less than, greater than, [signed forms thereof], or equal to). Uprobes uses 0x7fe0008 which is 'tw 31,0,0' which essentially is an unconditional trap. Gdb uses many traps, one of which is 0x7d821008 which is twge r2,r2, which is basically trap if r2 greater than or equal to r2. OK. So, if I understand correctly, gdb can use some conditional breakpoint, and it is possible that this insn won't generate the trap? Then this patch is not right, or at least we need another change on top? Once again. Suppose that gdb installs the TRAP_IF_R1_GT_R2. After that uprobe_register() is called, but it won't change this insn because verify_opcode() returns 0. Then the probed task hits this breakoint with r1 r2 and we do not report this event. So. I still think that we actually need something like below, and powerpc should reimplement is_trap_insn() to use is_trap(insn). No? Oleg. --- x/kernel/events/uprobes.c +++ x/kernel/events/uprobes.c @@ -532,6 +532,12 @@ static int copy_insn(struct uprobe *upro return __copy_insn(mapping, filp, uprobe-arch.insn, bytes, uprobe-offset); } +bool __weak is_trap_insn(uprobe_opcode_t *insn) +{ + // powerpc: should use is_trap() + return is_swbp_insn(insn); +} + static int prepare_uprobe(struct uprobe *uprobe, struct file *file, struct mm_struct *mm, unsigned long vaddr) { @@ -550,7 +556,7 @@ static int prepare_uprobe(struct uprobe goto out; ret = -ENOTSUPP; - if (is_swbp_insn((uprobe_opcode_t *)uprobe-arch.insn)) + if (is_trap_insn((uprobe_opcode_t *)uprobe-arch.insn)) goto out; ret = arch_uprobe_analyze_insn(uprobe-arch, mm, vaddr); @@ -1452,7 +1458,7 @@ static int is_swbp_at_addr(struct mm_str copy_opcode(page, vaddr, opcode); put_page(page); out: - return is_swbp_insn(opcode); + return is_trap_insn(opcode); } static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints
On 03/20, Ananth N Mavinakayanahalli wrote: On Wed, Mar 20, 2013 at 01:43:01PM +0100, Oleg Nesterov wrote: On 03/20, Oleg Nesterov wrote: But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to verify. If not, we need 2 definitions. is_uprobe_insn() should still check insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap(). And I am just curious, could you explain how X and UPROBE_SWBP_INSN differ? IOW, if I wasn't clear... Lets forget about gdb/etc for the moment. Suppose we apply the patch below. Will uprobes on powerpc work? If yes, then your patch should be fine. If not, we probably need more changes. Yes, it will work fine. Even if this new insn is conditional? Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] lglock: add read-preference local-global rwlock
On 03/05, Lai Jiangshan wrote: On 03/03/13 01:06, Oleg Nesterov wrote: On 03/02, Michel Lespinasse wrote: My version would be slower if it needs to take the slow path in a reentrant way, but I'm not sure it matters either :) I'd say, this doesn't matter at all, simply because this can only happen if we race with the active writer. It can also happen when interrupted. (still very rarely) arch_spin_trylock() ---interrupted, __this_cpu_read() returns 0. arch_spin_trylock() fails slowpath, any nested will be slowpath too. ... ..._read_unlock() ---interrupt __this_cpu_inc() Yes sure. Or it can take the local lock after we already take the global fallback_lock. But the same can happen with FALLBACK_BASE, just because we need to take a lock (local or global) first, then increment the counter. (I worries to much. I tend to remove FALLBACK_BASE now, we should add it only after we proved we needed it, this part is not proved) Agreed, great ;) Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2] lglock: add read-preference local-global rwlock
On 03/05, Lai Jiangshan wrote: On 03/03/13 01:20, Oleg Nesterov wrote: On 03/02, Lai Jiangshan wrote: +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw) +{ + switch (__this_cpu_read(*lgrw-reader_refcnt)) { + case 1: + __this_cpu_write(*lgrw-reader_refcnt, 0); + lg_local_unlock(lgrw-lglock); + return; + case FALLBACK_BASE: + __this_cpu_write(*lgrw-reader_refcnt, 0); + read_unlock(lgrw-fallback_rwlock); + rwlock_release(lg-lock_dep_map, 1, _RET_IP_); I guess case 1: should do rwlock_release() too. Already do it in lg_local_unlock(lgrw-lglock); before it returns. (I like reuse old code) Yes, I was wrong thanks. Another case when I didn't notice that you re-use the regular lg_ code... We need rwlock_acquire_read() even in the fast-path, and this acquire_read should be paired with rwlock_acquire() in _write_lock(), but it does spin_acquire(lg-lock_dep_map). Yes, currently this is the same (afaics) but perhaps fallback_rwlock-dep_map would be more clean. I can't tell which one is better. I try to use fallback_rwlock-dep_map later. I am not sure which one should be better too, please check. Again, I forgot that _write_lock/unlock use lg_global_*() code. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2] lglock: add read-preference local-global rwlock
On 03/02, Oleg Nesterov wrote: On 03/02, Lai Jiangshan wrote: +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw) +{ + switch (__this_cpu_read(*lgrw-reader_refcnt)) { + case 1: + __this_cpu_write(*lgrw-reader_refcnt, 0); + lg_local_unlock(lgrw-lglock); + return; + case FALLBACK_BASE: + __this_cpu_write(*lgrw-reader_refcnt, 0); + read_unlock(lgrw-fallback_rwlock); + rwlock_release(lg-lock_dep_map, 1, _RET_IP_); I guess case 1: should do rwlock_release() too. Otherwise, at first glance looks correct... However, I still think that FALLBACK_BASE only adds the unnecessary complications. But even if I am right this is subjective of course, please feel free to ignore. Yes, but... And btw, I am not sure about lg-lock_dep_map, perhaps we should use fallback_rwlock-dep_map ? We need rwlock_acquire_read() even in the fast-path, and this acquire_read should be paired with rwlock_acquire() in _write_lock(), but it does spin_acquire(lg-lock_dep_map). Yes, currently this is the same (afaics) but perhaps fallback_rwlock-dep_map would be more clean. Please ignore this part. I missed that lg_rwlock_global_write_lock() relies on lg_global_lock(), and I just noticed that it does rwlock_acquire(lg-lock_dep_map). Hmm. But then I do not understand the lglock annotations. Obviously, rwlock_acquire_read() in lg_local_lock() can't even detect the simplest deadlock, say, lg_local_lock(LOCK) + lg_local_lock(LOCK). Not to mention spin_lock(X) + lg_local_lock(Y) vs lg_local_lock(Y) + spin_lock(X). OK, I understand that it is not easy to make these annotations correct... Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] lglock: add read-preference local-global rwlock
On 03/02, Lai Jiangshan wrote: On 02/03/13 02:28, Oleg Nesterov wrote: Lai, I didn't read this discussion except the code posted by Michel. I'll try to read this patch carefully later, but I'd like to ask a couple of questions. This version looks more complex than Michel's, why? Just curious, I am trying to understand what I missed. See http://marc.info/?l=linux-kernelm=136196350213593 Michel changed my old draft version a little, his version is good enough for me. Yes, I see. But imho Michel suggested the valuable cleanup, the code becomes even more simple with the same perfomance. Your v2 looks almost correct to me, but I still think it makes sense to incorporate the simplification from Michel. My new version tries to add a little better nestable support with only adding single __this_cpu_op() in _read_[un]lock(). How? Afaics with or without FALLBACK_BASE you need _reed + _inc/dec in _read_lock/unlock. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] lglock: add read-preference local-global rwlock
On 03/02, Michel Lespinasse wrote: My version would be slower if it needs to take the slow path in a reentrant way, but I'm not sure it matters either :) I'd say, this doesn't matter at all, simply because this can only happen if we race with the active writer. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2] lglock: add read-preference local-global rwlock
On 03/02, Lai Jiangshan wrote: +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw) +{ + switch (__this_cpu_read(*lgrw-reader_refcnt)) { + case 1: + __this_cpu_write(*lgrw-reader_refcnt, 0); + lg_local_unlock(lgrw-lglock); + return; + case FALLBACK_BASE: + __this_cpu_write(*lgrw-reader_refcnt, 0); + read_unlock(lgrw-fallback_rwlock); + rwlock_release(lg-lock_dep_map, 1, _RET_IP_); I guess case 1: should do rwlock_release() too. Otherwise, at first glance looks correct... However, I still think that FALLBACK_BASE only adds the unnecessary complications. But even if I am right this is subjective of course, please feel free to ignore. And btw, I am not sure about lg-lock_dep_map, perhaps we should use fallback_rwlock-dep_map ? We need rwlock_acquire_read() even in the fast-path, and this acquire_read should be paired with rwlock_acquire() in _write_lock(), but it does spin_acquire(lg-lock_dep_map). Yes, currently this is the same (afaics) but perhaps fallback_rwlock-dep_map would be more clean. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] lglock: add read-preference local-global rwlock
Lai, I didn't read this discussion except the code posted by Michel. I'll try to read this patch carefully later, but I'd like to ask a couple of questions. This version looks more complex than Michel's, why? Just curious, I am trying to understand what I missed. See http://marc.info/?l=linux-kernelm=136196350213593 And I can't understand FALLBACK_BASE... OK, suppose that CPU_0 does _write_unlock() and releases -fallback_rwlock. CPU_1 does _read_lock(), and ... +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw) +{ + struct lglock *lg = lgrw-lglock; + + preempt_disable(); + rwlock_acquire_read(lg-lock_dep_map, 0, 0, _RET_IP_); + if (likely(!__this_cpu_read(*lgrw-reader_refcnt))) { + if (!arch_spin_trylock(this_cpu_ptr(lg-lock))) { _trylock() fails, + read_lock(lgrw-fallback_rwlock); + __this_cpu_add(*lgrw-reader_refcnt, FALLBACK_BASE); so we take -fallback_rwlock and -reader_refcnt == FALLBACK_BASE. CPU_0 does lg_global_unlock(lgrw-lglock) and finishes _write_unlock(). Interrupt handler on CPU_1 does _read_lock() notices -reader_refcnt != 0 and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1. Then irq does _read_unlock(), and +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw) +{ + switch (__this_cpu_dec_return(*lgrw-reader_refcnt)) { + case 0: + lg_local_unlock(lgrw-lglock); + return; + case FALLBACK_BASE: + __this_cpu_sub(*lgrw-reader_refcnt, FALLBACK_BASE); + read_unlock(lgrw-fallback_rwlock); hits this case? Doesn't look right, but most probably I missed something. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
On 02/28, Michel Lespinasse wrote: On Thu, Feb 28, 2013 at 3:25 AM, Oleg Nesterov o...@redhat.com wrote: On 02/27, Michel Lespinasse wrote: +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw) +{ + preempt_disable(); + + if (__this_cpu_read(*lgrw-local_refcnt) || + arch_spin_trylock(this_cpu_ptr(lgrw-lglock-lock))) { + __this_cpu_inc(*lgrw-local_refcnt); Please look at __this_cpu_generic_to_op(). You need this_cpu_inc() to avoid the race with irs. The same for _read_unlock. Hmmm, I was thinking that this was safe because while interrupts might modify local_refcnt to acquire a nested read lock, they are expected to release that lock as well which would set local_refcnt back to its original value ??? Yes, yes, this is correct. I meant that (in general, x86 is fine) __this_cpu_inc() itself is not irq-safe. It simply does pcp += 1. this_cpu_inc() is fine, _this_cpu_generic_to_op() does cli/sti around. I know this only because I did the same mistake recently, and Srivatsa explained the problem to me ;) Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
On 02/28, Oleg Nesterov wrote: On 02/28, Michel Lespinasse wrote: On Thu, Feb 28, 2013 at 3:25 AM, Oleg Nesterov o...@redhat.com wrote: On 02/27, Michel Lespinasse wrote: +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw) +{ + preempt_disable(); + + if (__this_cpu_read(*lgrw-local_refcnt) || + arch_spin_trylock(this_cpu_ptr(lgrw-lglock-lock))) { + __this_cpu_inc(*lgrw-local_refcnt); Please look at __this_cpu_generic_to_op(). You need this_cpu_inc() to avoid the race with irs. The same for _read_unlock. Hmmm, I was thinking that this was safe because while interrupts might modify local_refcnt to acquire a nested read lock, they are expected to release that lock as well which would set local_refcnt back to its original value ??? Yes, yes, this is correct. I meant that (in general, x86 is fine) __this_cpu_inc() itself is not irq-safe. It simply does pcp += 1. this_cpu_inc() is fine, _this_cpu_generic_to_op() does cli/sti around. Just in case, it is not that I really understand why __this_cpu_inc() can race with irq in this particular case (given that irq handler should restore the counter). So perhaps I am wrong again. The comments in include/linux/percpu.h look confusing to me, and I simply know nothing about !x86 architectures. But since, say, preempt_disable() doesn't do anything special then probably __this_cpu_inc() is fine too. In short: please ignore me ;) Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
On 02/27, Michel Lespinasse wrote: +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw) +{ + preempt_disable(); + + if (__this_cpu_read(*lgrw-local_refcnt) || + arch_spin_trylock(this_cpu_ptr(lgrw-lglock-lock))) { + __this_cpu_inc(*lgrw-local_refcnt); Please look at __this_cpu_generic_to_op(). You need this_cpu_inc() to avoid the race with irs. The same for _read_unlock. But otherwise I agree, looks like a clever and working idea to me. And simple! There is an interesting case where lg_rwlock_local_read_lock could be interrupted after getting the local lglock but before incrementing local_refcnt to 1; if that happens any nested readers within that interrupt will have to take the global rwlock read side. I think this is perfectly acceptable Agreed. Or interrupt can do spin_trylock(percpu-lock) after we take the global -fallback_rwlock (if we race with write_lock + write_unlock), but I do not see any possible deadlock in this case. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
On 02/08, Paul E. McKenney wrote: On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote: void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock) { - read_unlock(pcpu_rwlock-global_rwlock); We need an smp_mb() here to keep the critical section ordered before the this_cpu_dec() below. Otherwise, if a writer shows up just after we exit the fastpath, that writer is not guaranteed to see the effects of our critical section. Equivalently, the prior read-side critical section just might see some of the writer's updates, which could be a bit of a surprise to the reader. Agreed, we should not assume that a reader doesn't write. And we should ensure that this read section actually completes before this_cpu_dec(). + /* +* We never allow heterogeneous nesting of readers. So it is trivial +* to find out the kind of reader we are, and undo the operation +* done by our corresponding percpu_read_lock(). +*/ + if (__this_cpu_read(*pcpu_rwlock-reader_refcnt)) { + this_cpu_dec(*pcpu_rwlock-reader_refcnt); + smp_wmb(); /* Paired with smp_rmb() in sync_reader() */ Given an smp_mb() above, I don't understand the need for this smp_wmb(). Isn't the idea that if the writer sees -reader_refcnt decremented to zero, it also needs to see the effects of the corresponding reader's critical section? I am equally confused ;) OTOH, we can probably aboid any barrier if reader_nested_percpu() == T. +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) +{ + unsigned int cpu; + + drop_writer_signal(pcpu_rwlock, smp_processor_id()); Why do we drop ourselves twice? More to the point, why is it important to drop ourselves first? And don't we need mb() _before_ we clear -writer_signal ? +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, + unsigned int cpu) +{ + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ As I understand it, the purpose of this memory barrier is to ensure that the stores in drop_writer_signal() happen before the reads from -reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the race between a new reader attempting to use the fastpath and this writer acquiring the lock. Unless I am confused, this must be smp_mb() rather than smp_rmb(). And note that before sync_reader() we call announce_writer_active() which already adds mb() before sync_all_readers/sync_reader, so this rmb() looks unneeded. But, at the same time, could you confirm that we do not need another mb() after sync_all_readers() in percpu_write_lock() ? I mean, without mb(), can't this reader_uses_percpu_refcnt() LOAD leak into the critical section protected by -global_rwlock? Then this LOAD can be re-ordered with other memory operations done by the writer. Srivatsa, I think that the code would be more understandable if you kill the helpers like sync_reader/raise_writer_signal. Perhaps even all write helpers, I am not sure. At least, it seems to me that all barriers should be moved to percpu_write_lock/unlock. But I won't insist of course, up to you. And cosmetic nit... How about struct xxx { unsigned long reader_refcnt; boolwriter_signal; } struct percpu_rwlock { struct xxx __percpu *xxx; rwlock_tglobal_rwlock; }; ? This saves one alloc_percpu() and ensures that reader_refcnt/writer_signal are always in the same cache-line. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 05/45] percpu_rwlock: Make percpu-rwlocks IRQ-safe, optimally
only one cosmetic nit... On 01/22, Srivatsa S. Bhat wrote: +#define READER_PRESENT (1UL 16) +#define READER_REFCNT_MASK (READER_PRESENT - 1) + #define reader_uses_percpu_refcnt(pcpu_rwlock, cpu) \ (ACCESS_ONCE(per_cpu(*((pcpu_rwlock)-reader_refcnt), cpu))) #define reader_nested_percpu(pcpu_rwlock)\ - (__this_cpu_read(*((pcpu_rwlock)-reader_refcnt)) 1) + (__this_cpu_read(*((pcpu_rwlock)-reader_refcnt)) READER_REFCNT_MASK) #define writer_active(pcpu_rwlock) \ (__this_cpu_read(*((pcpu_rwlock)-writer_signal))) I think this all can go to lib/percpu-rwlock.c. Nobody needs to know these implementation details. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
On 02/11, Srivatsa S. Bhat wrote: On 02/10/2013 11:36 PM, Oleg Nesterov wrote: +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) +{ + unsigned int cpu; + + drop_writer_signal(pcpu_rwlock, smp_processor_id()); Why do we drop ourselves twice? More to the point, why is it important to drop ourselves first? And don't we need mb() _before_ we clear -writer_signal ? Oh, right! Or, how about moving announce_writer_inactive() to _after_ write_unlock()? Not sure this will help... but, either way it seems we have another problem... percpu_rwlock tries to be generic. This means we should ignore its usage in hotplug, and _write_lock should not race with _write_unlock. IOW. Suppose that _write_unlock clears -writer_signal. We need to ensure that this can't race with another write which wants to set this flag. Perhaps it should be counter as well, and it should be protected by the same -global_rwlock, but _write_lock() should drop it before sync_all_readers() and then take it again? +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, +unsigned int cpu) +{ + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ As I understand it, the purpose of this memory barrier is to ensure that the stores in drop_writer_signal() happen before the reads from -reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the race between a new reader attempting to use the fastpath and this writer acquiring the lock. Unless I am confused, this must be smp_mb() rather than smp_rmb(). And note that before sync_reader() we call announce_writer_active() which already adds mb() before sync_all_readers/sync_reader, so this rmb() looks unneeded. My intention was to help the writer see the -reader_refcnt drop to zero ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer. Hmm, interesting... Not sure, but can't really comment. However I can answer your next question: Please correct me if my understanding of memory barriers is wrong here.. Who? Me??? No we have paulmck for that ;) Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
On 02/11, Srivatsa S. Bhat wrote: On 02/09/2013 04:40 AM, Paul E. McKenney wrote: +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) +{ + unsigned int cpu; + + drop_writer_signal(pcpu_rwlock, smp_processor_id()); Why do we drop ourselves twice? More to the point, why is it important to drop ourselves first? I don't see where we are dropping ourselves twice. Note that we are no longer in the cpu_online_mask, so the 'for' loop below won't include us. So we need to manually drop ourselves. It doesn't matter whether we drop ourselves first or later. Yes, but this just reflects its usage in cpu-hotplug. cpu goes away under _write_lock. Perhaps _write_lock/unlock shoud use for_each_possible_cpu() instead? Hmm... I think this makes sense anyway. Otherwise, in theory, percpu_write_lock(random_non_hotplug_lock) can race with cpu_up? Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
On 01/23, Michel Lespinasse wrote: On Tue, Jan 22, 2013 at 11:32 AM, Steven Rostedt rost...@goodmis.org wrote: I thought global locks are now fair. That is, a reader will block if a writer is waiting. Hence, the above should deadlock on the current rwlock_t types. I believe you are mistaken here. struct rw_semaphore is fair (and blocking), but rwlock_t is unfair. The reason we can't easily make rwlock_t fair is because tasklist_lock currently depends on the rwlock_t unfairness - tasklist_lock readers typically don't disable local interrupts, and tasklist_lock may be acquired again from within an interrupt, which would deadlock if rwlock_t was fair and a writer was queued by the time the interrupt is processed. Yes. And, iirc, it was even documented somewhere that while rwlock_t is not really nice, it is good to share the locking with interrupts. You do not need to disable irqs. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 2/2] powerpc: Uprobes port to powerpc
On 08/23, Benjamin Herrenschmidt wrote: On Thu, 2012-08-23 at 11:02 +0530, Srikar Dronamraju wrote: insn is updated/accessed in the arch independent code. Size of uprobe_opcode_t could be different for different archs. uprobe_opcode_t represents the size of the smallest breakpoint instruction for an arch. Hence u8 works out the best. I know we could still use uprobe_opcode_t and achieve the same. In which case, we would have to interpret MAX_UINSN_BYTES differently. Do you see any advantages of using uprobe_opcode_t instead of u8 across archs? But don't you actively rely on the fact that on powerpc, unlike x86, you -can- atomically replace an instruction with a single 32-bit store ? I must have missed something... But powerpc does not replace an instruction, the arch independent code does this and it assumes that uprobe-arch.insn is u8[MAX_UINSN_BYTES]. Perhaps you meant that on powerpc it is safe to replace the insn even if this can race with some CPU executing this code? But uprobes has to replace the original page anyway, we should not write to -vm_file. I agree that memcpy() in arch_uprobe_analyze_insn() and arch_uprobe_skip_sstep() looks a bit strange. May be powerpc can do struct arch_uprobe { union { u8 insn[MAX_UINSN_BYTES]; u32 ainsn; }; }; and use auprobe-ainsn directly, I dunno. But probably I misunderstood you. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 2/2] powerpc: Uprobes port to powerpc
On 08/22, Ananth N Mavinakayanahalli wrote: +int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long addr) +{ + unsigned int insn; + + if (addr 0x03) + return -EINVAL; + + memcpy(insn, auprobe-insn, MAX_UINSN_BYTES); + if (is_trap(insn)) + return -ENOTSUPP; This addresses the only concern I had, thanks. Once again, I am in no position to review the powerpc code, but since I made some noise before: I think the patch is fine. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc
On 08/21, Ananth N Mavinakayanahalli wrote: On Fri, Aug 17, 2012 at 05:00:31PM +0200, Oleg Nesterov wrote: We should also take care of the in-memory copy, in case gdb had inserted a breakpoint at the same location, right? gdb (or even the application itself) and uprobes can obviously confuse each other, in many ways, and we can do nothing at least currently. Just we should ensure that the kernel can't crash/hang/etc. Absolutely. The proper fix for this at least from a breakpoint insertion perspective is to educate gdb (possibly ptrace itself) to fail on a breakpoint insertion request on an already existing one. Oh, I don't think this is possible. And there are other problems like this. Uprobe can confuse gdb too, in many ways. For example, uprobe_register() can wrongly _remove_ int3 installed by gdb. The proper fix, I think, is to rework the whole idea about uprobe bps, but this is really in the long term. install_breakpoint() should only unmap the page and mark its pte as owned by kernel, FOLL_WRITE should not work. Something like migration or PROT_NONE. The task itself should install bp during the page fault. And we need the backing store for the pages with uprobes. Yes, this all is very vague and I can be wrong. Anyway, this is relatively minor, we have more serious problems. Updating is_swbp_insn() per-arch where needed will take care of both the cases, 'cos it gets called before arch_analyze_uprobe_insn() too. For example. set_swbp()-is_swbp_insn() == T means that (for example) uprobe_register() and uprobe_mmap() raced with each other and there is no need for set_swbp(). This is true for Intel like architectures that have *one* swbp instruction. On Powerpc, gdb for instance, can insert a trap variant at the address. Therefore, is_swbp_insn() by definition should return true for all trap variants. Not in this case, I think. OK, I was going to do this later, but this discussion makes me think I should try to send the patch sooner. set_swbp()-is_swbp_at_addr() is simply unneeded and in fact should be considered as unnecessary pessimization. set_orig_insn()-is_swbp_at_addr() makes more sense, but it can't fix all races with userpace. Still it should die. OK. I will separate out the is_swbp_insn() change into a separate patch. Great thanks. And if we remove is_swbp_insn() from set_swbp() and set_orig_insn() then the semantics of is_swbp_insn() will much more clear, and in this case I powerpc probably really needs to change it. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc
On 08/17, Ananth N Mavinakayanahalli wrote: On Thu, Aug 16, 2012 at 05:21:12PM +0200, Oleg Nesterov wrote: Hmm, I am not sure. is_swbp_insn(insn), as it is used in the arch agnostic code, should only return true if insn == UPROBE_SWBP_INSN (just in case, this logic needs more fixes but this is offtopic). I think it does... If powerpc has another insn(s) which can trigger powerpc's do_int3() counterpart, they should be rejected by arch_uprobe_analyze_insn(). I think. The insn that gets passed to arch_uprobe_analyze_insn() is copy_insn()'s version, which is the file copy of the instruction. Yes, exactly. And we are going to single-step this saved uprobe-arch.insn, even if gdb/whatever replaces the original insn later or already replaced. So arch_uprobe_analyze_insn() should reject the unsafe instructions which we can't step over safely. We should also take care of the in-memory copy, in case gdb had inserted a breakpoint at the same location, right? gdb (or even the application itself) and uprobes can obviously confuse each other, in many ways, and we can do nothing at least currently. Just we should ensure that the kernel can't crash/hang/etc. Updating is_swbp_insn() per-arch where needed will take care of both the cases, 'cos it gets called before arch_analyze_uprobe_insn() too. For example. set_swbp()-is_swbp_insn() == T means that (for example) uprobe_register() and uprobe_mmap() raced with each other and there is no need for set_swbp(). However, find_active_uprobe()-is_swbp_at_addr()-is_swbp_insn() is different, true confirms that this insn has triggered do_int3() and thus we need send_sig(SIGTRAP) (just in case, this is not strictly correct too but offtopic again). We definitely need more changes/fixes/improvements in this area. And perhaps powerpc requires more changes in the arch-neutral code, I dunno. In particular, I think is_swbp_insn() should have a single caller, is_swbp_at_addr(), and this caller should always play with current-mm. And many, many other changes in the long term. So far I think that, if powerpc really needs to change is_swbp_insn(), it would be better to make another patch and discuss this change. But of course I can't judge. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc
On 08/16, Ananth N Mavinakayanahalli wrote: On Thu, Aug 16, 2012 at 07:41:53AM +1000, Benjamin Herrenschmidt wrote: On Wed, 2012-08-15 at 18:59 +0200, Oleg Nesterov wrote: On 07/26, Ananth N Mavinakayanahalli wrote: From: Ananth N Mavinakayanahalli ana...@in.ibm.com This is the port of uprobes to powerpc. Usage is similar to x86. I am just curious why this series was ignored by powerpc maintainers... Because it arrived too late for the previous merge window considering my limited bandwidth for reviewing things and that nobody else seems to have reviewed it :-) It's still on track for the next one, and I'm hoping to dedicate most of next week going through patches doing a powerpc -next. Thanks Ben! Great! Just one question... Shouldn't arch_uprobe_pre_xol() forbid to probe UPROBE_SWBP_INSN (at least) ? (I assume that emulate_step() can't handle this case but of course I do not understand arch/powerpc/lib/sstep.c) Note that uprobe_pre_sstep_notifier() sets utask-state = UTASK_BP_HIT without any checks. This doesn't look right if it was UTASK_SSTEP... But again, I do not know what powepc will actually do if we try to single-step over UPROBE_SWBP_INSN. Ananth ? set_swbp() will return -EEXIST to install_breakpoint if we are trying to put a breakpoint on UPROBE_SWBP_INSN. not really, this -EEXIST (already removed by recent changes) means that bp was already installed. But this doesn't matter, So, the arch agnostic code itself takes care of this case... Yes. I forgot about install_breakpoint()-is_swbp_insn() check which returns -ENOTSUPP, somehow I thought arch_uprobe_analyze_insn() does this. or am I missing something? No, it is me. However, I see that we need a powerpc specific is_swbp_insn() implementation since we will have to take care of all the trap variants. Hmm, I am not sure. is_swbp_insn(insn), as it is used in the arch agnostic code, should only return true if insn == UPROBE_SWBP_INSN (just in case, this logic needs more fixes but this is offtopic). If powerpc has another insn(s) which can trigger powerpc's do_int3() counterpart, they should be rejected by arch_uprobe_analyze_insn(). I think. I will need to update the patches based on changes being made by Oleg and Sebastien for the single-step issues. Perhaps you can do this in a separate change? We need some (simple) changes in the arch agnostic code first, they should not break poweppc. These changes are still under discussion. Once we have __weak arch_uprobe_step* you can reimplement these hooks and fix the problems with single-stepping. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc
On 07/26, Ananth N Mavinakayanahalli wrote: From: Ananth N Mavinakayanahalli ana...@in.ibm.com This is the port of uprobes to powerpc. Usage is similar to x86. I am just curious why this series was ignored by powerpc maintainers... Of course I can not review this code, I know nothing about powerpc, but the patches look simple/straightforward. Paul, Benjamin? Just one question... Shouldn't arch_uprobe_pre_xol() forbid to probe UPROBE_SWBP_INSN (at least) ? (I assume that emulate_step() can't handle this case but of course I do not understand arch/powerpc/lib/sstep.c) Note that uprobe_pre_sstep_notifier() sets utask-state = UTASK_BP_HIT without any checks. This doesn't look right if it was UTASK_SSTEP... But again, I do not know what powepc will actually do if we try to single-step over UPROBE_SWBP_INSN. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()
On 06/14, Srikar Dronamraju wrote: * Oleg Nesterov o...@redhat.com [2012-06-13 21:15:19]: For example. Suppose there is some instruction in /lib64/libc.so which is valid for 64-bit, but not for 32-bit. Suppose that a 32-bit application does mmap(/lib64/libc.so, PROT_EXEC). How correct is it to have a 32 bit binary link to a 64 bit binary/library? No, I didn't mean this. I guess you misunderstood my point, see below. Now. If vma_prio_tree_foreach() finds this 32-bit mm first, uprobe_register() fails even if there are other 64-bit applications which could be traced. Or. uprobe_register() succeeds because it finds a 64-bit mm first, and then that 32-bit application actually executes the invalid insn. We can move arch_uprobe_analyze_insn() outside of !UPROBE_COPY_INSN block. Or, perhaps, validate_insn_bits() should call both validate_insn_32bits() and validate_insn_64bits(), and set the UPROBE_VALID_IF_32 / UPROBE_VALID_IF_64 flags. install_breakpoint() should do the additinal check before set_swbp() and verify that .ia32_compat matches UPROBE_VALID_IF_*. What do you think? Lets say we do find a 32 bit app and 64 bit app using the same library and the underlying instruction is valid for tracing in 64 bit and not 32 bit. So when we are registering, and failed to insert a breakpoint for the 32 bit app, should we just bail out or should we return a failure? I do not really know, I tend to think we should not fail. But this is another story... Look. Suppose that a 32-bit app starts after uprobe_register() succeeds. In this case we have no option, uprobe_mmap()-install_breakpoint() should silently fail. Currently it doesn't, this is one of the reasons why I think the validation logic is wrong. And. if install_breakpoint() can fail later anyway (in this case), then I think uprobe_register() should not fail. But probably this needs more discussion. I would probably prefer to read the underlying file something similar to what exec does and based on the magic decipher if we should verify for 32 bit instructions or 64 bit instructions. But this can't protect from the malicious user who does mmap(64-bit-code, PROT_EXEC) from a 32-bit app, and this can confuse uprobes even if that 32-bit app never tries to actually execute that 64-bit-code. That is why I think we need the additional (and arch-dependant) check before every set_swbp(), but arch_uprobe_analyze_insn/etc should not depend on task/mm/vaddr/whatever. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()
On 06/12, Oleg Nesterov wrote: On 06/12, Srikar Dronamraju wrote: Note also that we should move this !UPROBE_COPY_INSN from install_breakpoint() to somewhere near alloc_uprobe(). This code is called only once, it looks a bit strange to use the random mm (the first mm vma_prio_tree_foreach() finds) and its mapping to verify the insn. In fact this is simply not correct and should be fixed, note that on x86 arch_uprobe_analyze_insn() checks The reason we delay the copy_insn to the first insert is because we have to get access to mm. For archs like x86, we want to know if the executable is 32 bit or not Yes. And this is wrong afaics. Once again. This !UPROBE_COPY_INSN code is called only once, and it uses the random mm. After that install_breakpoint() just calls set_swbp(another_mm) while the insn can be invalid because another_mm-ia32_compat != mm-ia32_compat. So in effect, if we get access to struct file corresponding to the inode and if the inode corresponds to 32 bit executable file or 64 bit executable file during register, then we can move it around alloc_uprobe(). I don't think this can work. I have another simple fix in mind, I'll write another email later. For example. Suppose there is some instruction in /lib64/libc.so which is valid for 64-bit, but not for 32-bit. Suppose that a 32-bit application does mmap(/lib64/libc.so, PROT_EXEC). Now. If vma_prio_tree_foreach() finds this 32-bit mm first, uprobe_register() fails even if there are other 64-bit applications which could be traced. Or. uprobe_register() succeeds because it finds a 64-bit mm first, and then that 32-bit application actually executes the invalid insn. We can move arch_uprobe_analyze_insn() outside of !UPROBE_COPY_INSN block. Or, perhaps, validate_insn_bits() should call both validate_insn_32bits() and validate_insn_64bits(), and set the UPROBE_VALID_IF_32 / UPROBE_VALID_IF_64 flags. install_breakpoint() should do the additinal check before set_swbp() and verify that .ia32_compat matches UPROBE_VALID_IF_*. What do you think? Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Q: a_ops-readpage() struct file
On 06/13, Peter Zijlstra wrote: On Mon, 2012-06-11 at 21:09 +0200, Oleg Nesterov wrote: Stupid question. I'm afraid the answer is no but I'll ask anyway. Is it safe to pass filp == NULL to mapping-readpage()? In fact I do not understand why it needs struct file* and I do not see any example of actual usage. Looking at afs_readpage it looks like its OK to pass in NULL. Same for nfs_readpage. They use the file, if provided, to avoid some lookups, but seem to deal with not having it. Yes, and reiser4 does the same. This answer by example is of course not authorative nor complete. Yes... perhaps we should simply change this code to use NULL and collect the bug-reports ;) Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()
On 06/12, Srikar Dronamraju wrote: Note also that we should move this !UPROBE_COPY_INSN from install_breakpoint() to somewhere near alloc_uprobe(). This code is called only once, it looks a bit strange to use the random mm (the first mm vma_prio_tree_foreach() finds) and its mapping to verify the insn. In fact this is simply not correct and should be fixed, note that on x86 arch_uprobe_analyze_insn() checks The reason we delay the copy_insn to the first insert is because we have to get access to mm. For archs like x86, we want to know if the executable is 32 bit or not Yes. And this is wrong afaics. Once again. This !UPROBE_COPY_INSN code is called only once, and it uses the random mm. After that install_breakpoint() just calls set_swbp(another_mm) while the insn can be invalid because another_mm-ia32_compat != mm-ia32_compat. So in effect, if we get access to struct file corresponding to the inode and if the inode corresponds to 32 bit executable file or 64 bit executable file during register, then we can move it around alloc_uprobe(). I don't think this can work. I have another simple fix in mind, I'll write another email later. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()
Ananth, Srikar, I think the patch is correct and I am sorry for nit-picking, this is really minor. But, On 06/08, Ananth N Mavinakayanahalli wrote: Changes in V2: Pass (unsigned long)addr Well, IMHO, this is confusing. First of all, why do we have this addr or even vaddr? It should not exists. We pass it to copy_insn(), but for what?? copy_insn() should simply use uprobe-offset, the virtual address for this particular mapping does not matter at all. I am going to send the cleanup. Note also that we should move this !UPROBE_COPY_INSN from install_breakpoint() to somewhere near alloc_uprobe(). This code is called only once, it looks a bit strange to use the random mm (the first mm vma_prio_tree_foreach() finds) and its mapping to verify the insn. In fact this is simply not correct and should be fixed, note that on x86 arch_uprobe_analyze_insn() checks mm-context.ia32_compat. IOW, Perhaps uprobe-offset makes more sense? --- linux-3.5-rc1.orig/kernel/events/uprobes.c +++ linux-3.5-rc1/kernel/events/uprobes.c @@ -697,7 +697,7 @@ install_breakpoint(struct uprobe *uprobe if (is_swbp_insn((uprobe_opcode_t *)uprobe-arch.insn)) return -EEXIST; - ret = arch_uprobe_analyze_insn(uprobe-arch, mm); + ret = arch_uprobe_analyze_insn(uprobe-arch, mm, addr); Just fyi, this conflicts with [PATCH 1/3] uprobes: install_breakpoint() should fail if is_swbp_insn() == T I sent, but the conflict is trivial. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Q: a_ops-readpage() struct file
On 06/11, Oleg Nesterov wrote: Note also that we should move this !UPROBE_COPY_INSN from install_breakpoint() to somewhere near alloc_uprobe(). The main problem is, uprobe_register() doesn't have struct file for read_mapping_page(). Stupid question. I'm afraid the answer is no but I'll ask anyway. Is it safe to pass filp == NULL to mapping-readpage()? In fact I do not understand why it needs struct file* and I do not see any example of actual usage. Yes, I didn't try to grep very much and I understand that the filesystem can do something special. Say it can use file-private_data... However. There is read_cache_page_gfp() which does use a_ops-readpage(filp = NULL), and the comment says nothing about the riskiness. If not, is there any other way uprobe_register(struct inode *inode, loff_t offset) can read the page at this offset? And btw, read_mapping_page() accepts void *data. Why? it uses filler == a_ops-readpage, it shouldn't accept anything but file pointer? Please help, I know nothing about vfs. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()
On 06/06, Ananth N Mavinakayanahalli wrote: From: Ananth N Mavinakayanahalli ana...@in.ibm.com On RISC architectures like powerpc, instructions are fixed size. Instruction analysis on such platforms is just a matter of (insn % 4). Pass the vaddr at which the uprobe is to be inserted so that arch_uprobe_analyze_insn() can flag misaligned registration requests. And the next patch checks vaddr 0x03. But why do you need this new arg? arch_uprobe_analyze_insn() could check container_of(auprobe, struct uprobe, arch)-offset 0x3 with the same effect, no? vm_start/vm_pgoff are obviously page-aligned. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH -v2] Audit: push audit success and retcode into arch ptrace.h
On 06/07, Eric Paris wrote: On Tue, 2011-06-07 at 19:19 +0200, Oleg Nesterov wrote: With or without this patch, can't we call audit_syscall_exit() twice if there is something else in _TIF_WORK_SYSCALL_EXIT mask apart from SYSCALL_AUDIT ? First time it is called from asm, then from syscall_trace_leave(), no? For example. The task has TIF_SYSCALL_AUDIT and nothing else, it does system_call-auditsys-system_call_fastpath. What if it gets, say, TIF_SYSCALL_TRACE before ret_from_sys_call? No harm is done calling twice. The first call will do the real work and cleanup. It will set a flag in the audit data that the work has been done (in_syscall == 0) thus the second call will then not do any real work and won't have anything to clean up. Hmm... and I assume context-previous != NULL is not possible on x86_64. OK, thanks. And I guess, all CONFIG_AUDITSYSCALL code in entry.S is only needed to microoptimize the case when TIF_SYSCALL_AUDIT is the only reason for the slow path. I wonder if it really makes the measureble difference... Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH -v2] Audit: push audit success and retcode into arch ptrace.h
On 06/08, Eric Paris wrote: On Wed, 2011-06-08 at 18:36 +0200, Oleg Nesterov wrote: And I guess, all CONFIG_AUDITSYSCALL code in entry.S is only needed to microoptimize the case when TIF_SYSCALL_AUDIT is the only reason for the slow path. I wonder if it really makes the measureble difference... All I know is what Roland put in the changelog: Avoiding the iret return path when syscall audit is enabled helps performance a lot. I believe this was a result of Fedora starting auditd by default and then Linus bitching about how slow a null syscall in a tight loop was. It was an optimization for a microbenchmark. How much it affects things on a real syscall that does real work is probably going to be determined by how much work is done in the syscall. and probably by how much work is done in audit_syscall_entry/exit. OK. Thanks a lot Eric for your explanations. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH -v2] Audit: push audit success and retcode into arch ptrace.h
On 06/08, Oleg Nesterov wrote: OK. Thanks a lot Eric for your explanations. Yes. but may I ask another one? Shouldn't copy_process()-audit_alloc(tsk) path do clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT) if it doesn't set tsk-audit_context? I can be easily wrong, but afaics otherwise the child can run with TIF_SYSCALL_AUDIT bit copied from parent's thread_info by dup_task_struct()-setup_thread_stack() and without -audit_context, right? For what? Any other reason why audit_syscall_entry() checks context != NULL? IOW. Any reason the patch below is wrong? I am just curious, thanks. Oleg. --- x/kernel/auditsc.c +++ x/kernel/auditsc.c @@ -885,6 +885,8 @@ int audit_alloc(struct task_struct *tsk) if (likely(!audit_ever_enabled)) return 0; /* Return if not auditing. */ + clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT); + state = audit_filter_task(tsk, key); if (likely(state == AUDIT_DISABLED)) return 0; @@ -1591,9 +1593,7 @@ void audit_syscall_entry(int arch, int m struct audit_context *context = tsk-audit_context; enum audit_state state; - if (unlikely(!context)) - return; - + BUG_ON(!context); /* * This happens only on certain architectures that make system * calls in kernel_thread via the entry.S interface, instead of ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH -v2] Audit: push audit success and retcode into arch ptrace.h
On 06/03, Eric Paris wrote: The audit system previously expected arches calling to audit_syscall_exit to supply as arguments if the syscall was a success and what the return code was. Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things by converting from negative retcodes to an audit internal magic value stating success or failure. This helper was wrong and could indicate that a valid pointer returned to userspace was a failed syscall. The fix is to fix the layering foolishness. We now pass audit_syscall_exit a struct pt_reg and it in turns calls back into arch code to collect the return value and to determine if the syscall was a success or failure. We also define a generic is_syscall_success() macro which determines success/failure based on if the value is -MAX_ERRNO. This works for arches like x86 which do not use a separate mechanism to indicate syscall failure. I know nothing about audit, but the patch looks fine to me. But I have a bit off-topic question, diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 8a445a0..b7b1f88 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -53,6 +53,7 @@ #include asm/paravirt.h #include asm/ftrace.h #include asm/percpu.h +#include linux/err.h /* Avoid __ASSEMBLER__'ifying linux/audit.h just for this. */ #include linux/elf-em.h @@ -564,17 +565,16 @@ auditsys: jmp system_call_fastpath /* - * Return fast path for syscall audit. Call audit_syscall_exit() + * Return fast path for syscall audit. Call __audit_syscall_exit() * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT * masked off. */ sysret_audit: movq RAX-ARGOFFSET(%rsp),%rsi /* second arg, syscall return value */ - cmpq $0,%rsi/* is it 0? */ - setl %al/* 1 if so, 0 if not */ + cmpq $-MAX_ERRNO,%rsi /* is it -MAX_ERRNO? */ + setbe %al /* 1 if so, 0 if not */ movzbl %al,%edi /* zero-extend that into %edi */ - inc %edi /* first arg, 0-1(AUDITSC_SUCCESS), 1-2(AUDITSC_FAILURE) */ - call audit_syscall_exit + call __audit_syscall_exit With or without this patch, can't we call audit_syscall_exit() twice if there is something else in _TIF_WORK_SYSCALL_EXIT mask apart from SYSCALL_AUDIT ? First time it is called from asm, then from syscall_trace_leave(), no? For example. The task has TIF_SYSCALL_AUDIT and nothing else, it does system_call-auditsys-system_call_fastpath. What if it gets, say, TIF_SYSCALL_TRACE before ret_from_sys_call? Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev