Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction

2021-02-06 Thread Oleg Nesterov
On 02/04, Ravi Bangoria wrote:
>
> +static int get_instr(struct mm_struct *mm, unsigned long addr, u32 *instr)
> +{
> + struct page *page;
> + struct vm_area_struct *vma;
> + void *kaddr;
> + unsigned int gup_flags = FOLL_FORCE | FOLL_SPLIT_PMD;
> +
> + if (get_user_pages_remote(mm, addr, 1, gup_flags, , , NULL) <= 
> 0)
> + return -EINVAL;

"vma" is not used, and I don't think you need FOLL_SPLIT_PMD.

Otherwise I can't really comment this ppc-specific change.

To be honest, I don't even understand why do we need this fix. Sure, the
breakpoint in the middle of 64-bit insn won't work, why do we care? The
user should know what does he do.

Not to mention we can't really trust get_user_pages() in that this page
can be modified by mm owner or debugger...

But I won't argue.

Oleg.



Re: [PATCH] powerpc/uprobes: Don't allow probe on suffix of prefixed instruction

2021-01-19 Thread Oleg Nesterov
On 01/19, Ravi Bangoria wrote:
>
> Probe on 2nd word of a prefixed instruction is invalid scenario and
> should be restricted.

I don't understand this ppc-specific problem, but...

> +#ifdef CONFIG_PPC64
> +int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> +   uprobe_opcode_t opcode)
> +{
> + uprobe_opcode_t prefix;
> + void *kaddr;
> + struct ppc_inst inst;
> +
> + /* Don't check if vaddr is pointing to the beginning of page */
> + if (!(vaddr & ~PAGE_MASK))
> + return 0;

So the fix is incomplete? Or insn at the start of page can't be prefixed?

> +int __weak arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> +  uprobe_opcode_t opcode)
> +{
> + return 0;
> +}
> +
>  static int verify_opcode(struct page *page, unsigned long vaddr, 
> uprobe_opcode_t *new_opcode)
>  {
>   uprobe_opcode_t old_opcode;
> @@ -275,6 +281,8 @@ static int verify_opcode(struct page *page, unsigned long 
> vaddr, uprobe_opcode_t
>   if (is_swbp_insn(new_opcode)) {
>   if (is_swbp)/* register: already installed? */
>   return 0;
> + if (arch_uprobe_verify_opcode(page, vaddr, old_opcode))
> + return -EINVAL;

Well, this doesn't look good...

To me it would be better to change the prepare_uprobe() path to copy
the potential prefix into uprobe->arch and check ppc_inst_prefixed()
in arch_uprobe_analyze_insn(). What do you think?

Oleg.



Re: [PATCH v3 2/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too

2020-11-23 Thread Oleg Nesterov
Christophe, et al,

So what?

Are you going to push your change or should I re-send 1-2 without
whitespace cleanups?

On 11/19, Oleg Nesterov wrote:
>
> On 11/19, Christophe Leroy wrote:
> >
> > I think the following should work, and not require the first patch (compile
> > tested only).
> >
> > --- a/arch/powerpc/kernel/ptrace/ptrace-view.c
> > +++ b/arch/powerpc/kernel/ptrace/ptrace-view.c
> > @@ -234,9 +234,21 @@ static int gpr_get(struct task_struct *target, const
> > struct user_regset *regset,
> > BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) !=
> >  offsetof(struct pt_regs, msr) + sizeof(long));
> > 
> > +#ifdef CONFIG_PPC64
> > +   membuf_write(, >thread.regs->orig_gpr3,
> > +offsetof(struct pt_regs, softe) - offsetof(struct pt_regs,
> > orig_gpr3));
> > +   membuf_store(, 1UL);
> > +
> > +   BUILD_BUG_ON(offsetof(struct pt_regs, trap) !=
> > +offsetof(struct pt_regs, softe) + sizeof(long));
> > +
> > +   membuf_write(, >thread.regs->trap,
> > +sizeof(struct user_pt_regs) - offsetof(struct pt_regs, 
> > trap));
> > +#else
> > membuf_write(, >thread.regs->orig_gpr3,
> > sizeof(struct user_pt_regs) -
> > offsetof(struct pt_regs, orig_gpr3));
> > +#endif
> > return membuf_zero(, ELF_NGREG * sizeof(unsigned long) -
> >  sizeof(struct user_pt_regs));
> >  }
> 
> Probably yes.
> 
> This mirrors the previous patch I sent 
> (https://lore.kernel.org/lkml/20190917143753.ga12...@redhat.com/)
> and this is exactly what I tried to avoid, we can make a simpler fix now.
> 
> But let me repeat, I agree with any fix even if imp my version simplifies the 
> code, just
> commit this change and lets forget this problem.
> 
> Oleg.



Re: [PATCH v3 2/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too

2020-11-19 Thread Oleg Nesterov
On 11/19, Christophe Leroy wrote:
>
> I think the following should work, and not require the first patch (compile
> tested only).
>
> --- a/arch/powerpc/kernel/ptrace/ptrace-view.c
> +++ b/arch/powerpc/kernel/ptrace/ptrace-view.c
> @@ -234,9 +234,21 @@ static int gpr_get(struct task_struct *target, const
> struct user_regset *regset,
>   BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) !=
>offsetof(struct pt_regs, msr) + sizeof(long));
> 
> +#ifdef CONFIG_PPC64
> + membuf_write(, >thread.regs->orig_gpr3,
> +  offsetof(struct pt_regs, softe) - offsetof(struct pt_regs,
> orig_gpr3));
> + membuf_store(, 1UL);
> +
> + BUILD_BUG_ON(offsetof(struct pt_regs, trap) !=
> +  offsetof(struct pt_regs, softe) + sizeof(long));
> +
> + membuf_write(, >thread.regs->trap,
> +  sizeof(struct user_pt_regs) - offsetof(struct pt_regs, 
> trap));
> +#else
>   membuf_write(, >thread.regs->orig_gpr3,
>   sizeof(struct user_pt_regs) -
>   offsetof(struct pt_regs, orig_gpr3));
> +#endif
>   return membuf_zero(, ELF_NGREG * sizeof(unsigned long) -
>sizeof(struct user_pt_regs));
>  }

Probably yes.

This mirrors the previous patch I sent 
(https://lore.kernel.org/lkml/20190917143753.ga12...@redhat.com/)
and this is exactly what I tried to avoid, we can make a simpler fix now.

But let me repeat, I agree with any fix even if imp my version simplifies the 
code, just
commit this change and lets forget this problem.

Oleg.



Re: [PATCH v3 0/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too

2020-11-19 Thread Oleg Nesterov
On 11/19, Christophe Leroy wrote:
>
>
> Le 19/11/2020 à 17:01, Oleg Nesterov a écrit :
> >Can we finally fix this problem? ;)
> >
> >My previous attempt was ignored, see
>
> That doesn't seems right.
>
> Michael made some suggestion it seems, can you respond to it ?

I did, see https://lore.kernel.org/lkml/20200611105830.gb12...@redhat.com/

> >Sorry, uncompiled/untested, I don't have a ppc machine.
>
> I compiled with ppc64_defconfig, that seems ok. Still untested.

Thanks.

Oleg.



Re: [PATCH v3 1/2] powerpc/ptrace: simplify gpr_get/tm_cgpr_get

2020-11-19 Thread Oleg Nesterov
On 11/19, Christophe Leroy wrote:
>
>
> Le 19/11/2020 à 17:02, Oleg Nesterov a écrit :
> >gpr_get() does membuf_write() twice to override pt_regs->msr in between.
>
> Is there anything wrong with that ?

Nothing wrong, but imo the code and 2/2 looks simpler after this patch.
I tried to explain this in the changelog.

> >  int tm_cgpr_get(struct task_struct *target, const struct user_regset 
> > *regset,
> > struct membuf to)
> >  {
> >+struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr));
> >+
> > if (!cpu_has_feature(CPU_FTR_TM))
> > return -ENODEV;
> >@@ -97,17 +99,12 @@ int tm_cgpr_get(struct task_struct *target, const struct 
> >user_regset *regset,
> > flush_altivec_to_thread(target);
> > membuf_write(, >thread.ckpt_regs,
> >-offsetof(struct pt_regs, msr));
> >-membuf_store(, get_user_ckpt_msr(target));
> >+sizeof(struct user_pt_regs));
>
> This looks mis-aligned. But it should fit on a single line, now we allow up 
> to 100 chars on a line.

OK, I can change this.

> >-BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) !=
> >- offsetof(struct pt_regs, msr) + sizeof(long));
> >+membuf_store(_msr, get_user_ckpt_msr(target));
> >-membuf_write(, >thread.ckpt_regs.orig_gpr3,
> >-sizeof(struct user_pt_regs) -
> >-offsetof(struct pt_regs, orig_gpr3));
> > return membuf_zero(, ELF_NGREG * sizeof(unsigned long) -
> >-sizeof(struct user_pt_regs));
> >+sizeof(struct user_pt_regs));
>
> I can't see any change here except the alignment. Can you leave it as is ?

I just tried to make tm_cgpr_get() and gpr_get() look similar.

Sure, I can leave it as is.

Better yet, could you please fix this problem somehow so that I could forget
about the bug assigned to me?

I know nothing about powerpc, and personally I do not care about this (minor)
bug, I agree with any changes.

> >-membuf_write(, target->thread.regs, offsetof(struct pt_regs, msr));
> >-membuf_store(, get_user_msr(target));
> >+membuf_write(, target->thread.regs,
> >+sizeof(struct user_pt_regs));
>
> This should fit on a single line.
>
> > return membuf_zero(, ELF_NGREG * sizeof(unsigned long) -
> >- sizeof(struct user_pt_regs));
> >+sizeof(struct user_pt_regs));
>
> This should not change, it's not part of the changes for this patch.

See above, I can leave it as is.

> >--- a/include/linux/regset.h
> >+++ b/include/linux/regset.h
> >@@ -46,6 +46,18 @@ static inline int membuf_write(struct membuf *s, const 
> >void *v, size_t size)
> > return s->left;
> >  }
> >+static inline struct membuf membuf_at(const struct membuf *s, size_t offs)
> >+{
> >+struct membuf n = *s;
>
> Is there any point in using a struct membuf * instaed of a struct membuf as 
> parameter ?

This matches other membuf_ helpers.

Oleg.



Re: [PATCH v3 2/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too

2020-11-19 Thread Oleg Nesterov
On 11/19, Oleg Nesterov wrote:
>
> This is not consistent and this breaks the user-regs-peekpoke test
> from https://sourceware.org/systemtap/wiki/utrace/tests/

See the test-case below.

Oleg.

/* Test case for PTRACE_SETREGS modifying the requested ragisters.
   x86* counterpart of the s390* testcase `user-area-access.c'.

   This software is provided 'as-is', without any express or implied
   warranty.  In no event will the authors be held liable for any damages
   arising from the use of this software.

   Permission is granted to anyone to use this software for any purpose,
   including commercial applications, and to alter it and redistribute it
   freely.  */

/* FIXME: EFLAGS should be tested restricted on the appropriate bits.  */

#define _GNU_SOURCE 1

#if defined __powerpc__ || defined __sparc__
# define user_regs_struct pt_regs
#endif

#ifdef __ia64__
#define ia64_fpreg ia64_fpreg_DISABLE
#define pt_all_user_regs pt_all_user_regs_DISABLE
#endif  /* __ia64__ */
#include 
#ifdef __ia64__
#undef ia64_fpreg
#undef pt_all_user_regs
#endif  /* __ia64__ */
#include 
#include 
#include 
#if defined __i386__ || defined __x86_64__
#include 
#endif
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

/* ia64 has PTRACE_SETREGS but it has no USER_REGS_STRUCT.  */
#if !defined PTRACE_SETREGS || defined __ia64__

int
main (void)
{
  return 77;
}

#else   /* PTRACE_SETREGS */

/* The minimal alignment we use for the random access ranges.  */
#define REGALIGN (sizeof (long))

static pid_t child;

static void
cleanup (void)
{
  if (child > 0)
kill (child, SIGKILL);
  child = 0;
}

static void
handler_fail (int signo)
{
  cleanup ();
  signal (SIGABRT, SIG_DFL);
  abort ();
}

int
main (void)
{
  long l;
  int status, i;
  pid_t pid;
  union
{
  struct user_regs_struct user;
  unsigned char byte[sizeof (struct user_regs_struct)];
} u, u2;
  int start;

  setbuf (stdout, NULL);
  atexit (cleanup);
  signal (SIGABRT, handler_fail);
  signal (SIGALRM, handler_fail);
  signal (SIGINT, handler_fail);
  i = alarm (10);
  assert (i == 0);

  child = fork ();
  switch (child)
{
case -1:
  assert_perror (errno);
  assert (0);

case 0:
  l = ptrace (PTRACE_TRACEME, 0, NULL, NULL);
  assert (l == 0);

  // Prevent rt_sigprocmask() call called by glibc after raise().
  syscall (__NR_tkill, getpid (), SIGSTOP);
  assert (0);

default:
  break;
}

  pid = waitpid (child, , 0);
  assert (pid == child);
  assert (WIFSTOPPED (status));
  assert (WSTOPSIG (status) == SIGSTOP);

  /* Fetch U2 from the inferior.  */
  errno = 0;
# ifdef __sparc__
  l = ptrace (PTRACE_GETREGS, child, , NULL);
# else
  l = ptrace (PTRACE_GETREGS, child, NULL, );
# endif
  assert_perror (errno);
  assert (l == 0);

  /* Initialize U with a pattern.  */
  for (i = 0; i < sizeof u.byte; i++)
u.byte[i] = i;
#ifdef __x86_64__
  /* non-EFLAGS modifications fail with EIO,  EFLAGS gets back different.  */
  u.user.eflags = u2.user.eflags;
  u.user.cs = u2.user.cs;
  u.user.ds = u2.user.ds;
  u.user.es = u2.user.es;
  u.user.fs = u2.user.fs;
  u.user.gs = u2.user.gs;
  u.user.ss = u2.user.ss;
  u.user.fs_base = u2.user.fs_base;
  u.user.gs_base = u2.user.gs_base;
  /* RHEL-4 refuses to set too high (and invalid) PC values.  */
  u.user.rip = (unsigned long) handler_fail;
  /* 2.6.25 always truncates and sign-extends orig_rax.  */
  u.user.orig_rax = (int) u.user.orig_rax;
#endif  /* __x86_64__ */
#ifdef __i386__
  /* These values get back different.  */
  u.user.xds = u2.user.xds;
  u.user.xes = u2.user.xes;
  u.user.xfs = u2.user.xfs;
  u.user.xgs = u2.user.xgs;
  u.user.xcs = u2.user.xcs;
  u.user.eflags = u2.user.eflags;
  u.user.xss = u2.user.xss;
  /* RHEL-4 refuses to set too high (and invalid) PC values.  */
  u.user.eip = (unsigned long) handler_fail;
#endif  /* __i386__ */
#ifdef __powerpc__
  /* These fields are constrained.  */
  u.user.msr = u2.user.msr;
# ifdef __powerpc64__
  u.user.softe = u2.user.softe;
# else
  u.user.mq = u2.user.mq;
# endif /* __powerpc64__ */
  u.user.trap = u2.user.trap;
  u.user.dar = u2.user.dar;
  u.user.dsisr = u2.user.dsisr;
  u.user.result = u2.user.result;
#endif  /* __powerpc__ */

  /* Poke U.  */
# ifdef __sparc__
  l = ptrace (PTRACE_SETREGS, child, , NULL);
# else
  l = ptrace (PTRACE_SETREGS, child, NULL, );
# endif
  assert (l == 0);

  /* Peek into U2.  */
# ifdef __sparc__
  l = ptrace (PTRACE_GETREGS, child, , NULL);
# else
  l = ptrace (PTRACE_GETREGS, child, NULL, );
# endif
  assert (l == 0);

  /* Verify it matches.  */
  if (memcmp (, , sizeof u.byte) != 0)
{
  for (start = 0; start + REGALIGN <= sizeof u.byte; start += REGALIGN)
if (*(unsigned long *) (u.byte + start)
!= *(unsigned long *) (u2.byte + start))
  printf ("\
mismatch at offset %#x: SETREGS wrote %lx GETREGS read %lx\n",
   

[PATCH v3 2/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too

2020-11-19 Thread Oleg Nesterov
The commit a8a4b03ab95f ("powerpc: Hard wire PT_SOFTE value to 1 in
ptrace & signals") changed ptrace_get_reg(PT_SOFTE) to report 0x1,
but PTRACE_GETREGS still copies pt_regs->softe as is.

This is not consistent and this breaks the user-regs-peekpoke test
from https://sourceware.org/systemtap/wiki/utrace/tests/

Reported-by: Jan Kratochvil 
Signed-off-by: Oleg Nesterov 
---
 arch/powerpc/kernel/ptrace/ptrace-tm.c   | 8 +++-
 arch/powerpc/kernel/ptrace/ptrace-view.c | 8 +++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace/ptrace-tm.c 
b/arch/powerpc/kernel/ptrace/ptrace-tm.c
index f8fcbd85d4cb..d0d339f86e61 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-tm.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-tm.c
@@ -87,6 +87,10 @@ int tm_cgpr_get(struct task_struct *target, const struct 
user_regset *regset,
struct membuf to)
 {
struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr));
+#ifdef CONFIG_PPC64
+   struct membuf to_softe = membuf_at(,
+   offsetof(struct pt_regs, softe));
+#endif
 
if (!cpu_has_feature(CPU_FTR_TM))
return -ENODEV;
@@ -102,7 +106,9 @@ int tm_cgpr_get(struct task_struct *target, const struct 
user_regset *regset,
sizeof(struct user_pt_regs));
 
membuf_store(_msr, get_user_ckpt_msr(target));
-
+#ifdef CONFIG_PPC64
+   membuf_store(_softe, 0x1ul);
+#endif
return membuf_zero(, ELF_NGREG * sizeof(unsigned long) -
sizeof(struct user_pt_regs));
 }
diff --git a/arch/powerpc/kernel/ptrace/ptrace-view.c 
b/arch/powerpc/kernel/ptrace/ptrace-view.c
index 39686ede40b3..f554ccfcbfae 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-view.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-view.c
@@ -218,6 +218,10 @@ static int gpr_get(struct task_struct *target, const 
struct user_regset *regset,
   struct membuf to)
 {
struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr));
+#ifdef CONFIG_PPC64
+   struct membuf to_softe = membuf_at(,
+   offsetof(struct pt_regs, softe));
+#endif
int i;
 
if (target->thread.regs == NULL)
@@ -233,7 +237,9 @@ static int gpr_get(struct task_struct *target, const struct 
user_regset *regset,
sizeof(struct user_pt_regs));
 
membuf_store(_msr, get_user_msr(target));
-
+#ifdef CONFIG_PPC64
+   membuf_store(_softe, 0x1ul);
+#endif
return membuf_zero(, ELF_NGREG * sizeof(unsigned long) -
sizeof(struct user_pt_regs));
 }
-- 
2.25.1.362.g51ebf55




[PATCH v3 1/2] powerpc/ptrace: simplify gpr_get/tm_cgpr_get

2020-11-19 Thread Oleg Nesterov
gpr_get() does membuf_write() twice to override pt_regs->msr in between.
We can call membuf_write() once and change ->msr in the kernel buffer,
this simplifies the code and the next fix.

The patch adds a new simple helper, membuf_at(offs), it returns the new
membuf which can be safely used after membuf_write().

Signed-off-by: Oleg Nesterov 
---
 arch/powerpc/kernel/ptrace/ptrace-tm.c   | 13 +
 arch/powerpc/kernel/ptrace/ptrace-view.c | 13 +
 include/linux/regset.h   | 12 
 3 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace/ptrace-tm.c 
b/arch/powerpc/kernel/ptrace/ptrace-tm.c
index 54f2d076206f..f8fcbd85d4cb 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-tm.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-tm.c
@@ -86,6 +86,8 @@ int tm_cgpr_active(struct task_struct *target, const struct 
user_regset *regset)
 int tm_cgpr_get(struct task_struct *target, const struct user_regset *regset,
struct membuf to)
 {
+   struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr));
+
if (!cpu_has_feature(CPU_FTR_TM))
return -ENODEV;
 
@@ -97,17 +99,12 @@ int tm_cgpr_get(struct task_struct *target, const struct 
user_regset *regset,
flush_altivec_to_thread(target);
 
membuf_write(, >thread.ckpt_regs,
-   offsetof(struct pt_regs, msr));
-   membuf_store(, get_user_ckpt_msr(target));
+   sizeof(struct user_pt_regs));
 
-   BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) !=
-offsetof(struct pt_regs, msr) + sizeof(long));
+   membuf_store(_msr, get_user_ckpt_msr(target));
 
-   membuf_write(, >thread.ckpt_regs.orig_gpr3,
-   sizeof(struct user_pt_regs) -
-   offsetof(struct pt_regs, orig_gpr3));
return membuf_zero(, ELF_NGREG * sizeof(unsigned long) -
-   sizeof(struct user_pt_regs));
+   sizeof(struct user_pt_regs));
 }
 
 /*
diff --git a/arch/powerpc/kernel/ptrace/ptrace-view.c 
b/arch/powerpc/kernel/ptrace/ptrace-view.c
index 7e6478e7ed07..39686ede40b3 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-view.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-view.c
@@ -217,6 +217,7 @@ int ptrace_put_reg(struct task_struct *task, int regno, 
unsigned long data)
 static int gpr_get(struct task_struct *target, const struct user_regset 
*regset,
   struct membuf to)
 {
+   struct membuf to_msr = membuf_at(, offsetof(struct pt_regs, msr));
int i;
 
if (target->thread.regs == NULL)
@@ -228,17 +229,13 @@ static int gpr_get(struct task_struct *target, const 
struct user_regset *regset,
target->thread.regs->gpr[i] = NV_REG_POISON;
}
 
-   membuf_write(, target->thread.regs, offsetof(struct pt_regs, msr));
-   membuf_store(, get_user_msr(target));
+   membuf_write(, target->thread.regs,
+   sizeof(struct user_pt_regs));
 
-   BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) !=
-offsetof(struct pt_regs, msr) + sizeof(long));
+   membuf_store(_msr, get_user_msr(target));
 
-   membuf_write(, >thread.regs->orig_gpr3,
-   sizeof(struct user_pt_regs) -
-   offsetof(struct pt_regs, orig_gpr3));
return membuf_zero(, ELF_NGREG * sizeof(unsigned long) -
-sizeof(struct user_pt_regs));
+   sizeof(struct user_pt_regs));
 }
 
 static int gpr_set(struct task_struct *target, const struct user_regset 
*regset,
diff --git a/include/linux/regset.h b/include/linux/regset.h
index c3403f328257..a00765f0e8cf 100644
--- a/include/linux/regset.h
+++ b/include/linux/regset.h
@@ -46,6 +46,18 @@ static inline int membuf_write(struct membuf *s, const void 
*v, size_t size)
return s->left;
 }
 
+static inline struct membuf membuf_at(const struct membuf *s, size_t offs)
+{
+   struct membuf n = *s;
+
+   if (offs > n.left)
+   offs = n.left;
+   n.p += offs;
+   n.left -= offs;
+
+   return n;
+}
+
 /* current s->p must be aligned for v; v must be a scalar */
 #define membuf_store(s, v) \
 ({ \
-- 
2.25.1.362.g51ebf55




[PATCH v3 0/2] powerpc/ptrace: Hard wire PT_SOFTE value to 1 in gpr_get() too

2020-11-19 Thread Oleg Nesterov
Can we finally fix this problem? ;)

My previous attempt was ignored, see

https://lore.kernel.org/lkml/20190917121256.ga8...@redhat.com/

Now that gpr_get() was changed to use membuf API we can make a simpler fix.

Sorry, uncompiled/untested, I don't have a ppc machine.

Oleg.

 arch/powerpc/kernel/ptrace/ptrace-tm.c   | 21 -
 arch/powerpc/kernel/ptrace/ptrace-view.c | 21 -
 include/linux/regset.h   | 12 
 3 files changed, 36 insertions(+), 18 deletions(-)



Re: [PATCH? v2] powerpc: Hard wire PT_SOFTE value to 1 in gpr_get() too

2020-06-11 Thread Oleg Nesterov
On 06/11, Madhavan Srinivasan wrote:
>
>
> On 6/10/20 8:37 PM, Oleg Nesterov wrote:
> >Hi,
> >
> >looks like this patch was forgotten.
>
> yep, I missed this. But mpe did have comments for the patch.
>
> https://lkml.org/lkml/2019/9/19/107

Yes, and I thought that I have replied... apparently not, sorry!

So let me repeat, I am fine either way, I do not understand this
ppc-specific code and I can't really test this change.

Let me quote that email from Michael:

> We could do it like below. I'm 50/50 though on whether it's worth it, 
or
> if we should just go with the big ifdef like in your patch.

up to you ;)

Hmm. And yes,

> >>This is not consistent and this breaks
> >>http://sourceware.org/systemtap/wiki/utrace/tests/user-regs-peekpoke

this is 404.

Jan, could correct the link above?

Oleg.



Re: [PATCH? v2] powerpc: Hard wire PT_SOFTE value to 1 in gpr_get() too

2020-06-10 Thread Oleg Nesterov
Hi,

looks like this patch was forgotten.

Do you think this should be fixed or should we document that
PTRACE_GETREGS is not consistent with PTRACE_PEEKUSER on ppc64?


On 09/17, Oleg Nesterov wrote:
>
> I don't have a ppc machine, this patch wasn't even compile tested,
> could you please review?
> 
> The commit a8a4b03ab95f ("powerpc: Hard wire PT_SOFTE value to 1 in
> ptrace & signals") changed ptrace_get_reg(PT_SOFTE) to report 0x1,
> but PTRACE_GETREGS still copies pt_regs->softe as is.
> 
> This is not consistent and this breaks
> http://sourceware.org/systemtap/wiki/utrace/tests/user-regs-peekpoke
> 
> Reported-by: Jan Kratochvil 
> Signed-off-by: Oleg Nesterov 
> ---
>  arch/powerpc/kernel/ptrace.c | 25 +
>  1 file changed, 25 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> index 8c92feb..291acfb 100644
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -363,11 +363,36 @@ static int gpr_get(struct task_struct *target, const 
> struct user_regset *regset,
>   BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) !=
>offsetof(struct pt_regs, msr) + sizeof(long));
>  
> +#ifdef CONFIG_PPC64
> + if (!ret)
> + ret = user_regset_copyout(, , , ,
> +   >thread.regs->orig_gpr3,
> +   offsetof(struct pt_regs, orig_gpr3),
> +   offsetof(struct pt_regs, softe));
> +
> + if (!ret) {
> + unsigned long softe = 0x1;
> + ret = user_regset_copyout(, , , , ,
> +   offsetof(struct pt_regs, softe),
> +   offsetof(struct pt_regs, softe) +
> +   sizeof(softe));
> + }
> +
> + BUILD_BUG_ON(offsetof(struct pt_regs, trap) !=
> +  offsetof(struct pt_regs, softe) + sizeof(long));
> +
> + if (!ret)
> + ret = user_regset_copyout(, , , ,
> +   >thread.regs->trap,
> +   offsetof(struct pt_regs, trap),
> +   sizeof(struct user_pt_regs));
> +#else
>   if (!ret)
>   ret = user_regset_copyout(, , , ,
> >thread.regs->orig_gpr3,
> offsetof(struct pt_regs, orig_gpr3),
> sizeof(struct user_pt_regs));
> +#endif
>   if (!ret)
>   ret = user_regset_copyout_zero(, , , ,
>  sizeof(struct user_pt_regs), -1);
> -- 
> 2.5.0
> 



[PATCH? v2] powerpc: Hard wire PT_SOFTE value to 1 in gpr_get() too

2019-09-17 Thread Oleg Nesterov
I don't have a ppc machine, this patch wasn't even compile tested,
could you please review?

The commit a8a4b03ab95f ("powerpc: Hard wire PT_SOFTE value to 1 in
ptrace & signals") changed ptrace_get_reg(PT_SOFTE) to report 0x1,
but PTRACE_GETREGS still copies pt_regs->softe as is.

This is not consistent and this breaks
http://sourceware.org/systemtap/wiki/utrace/tests/user-regs-peekpoke

Reported-by: Jan Kratochvil 
Signed-off-by: Oleg Nesterov 
---
 arch/powerpc/kernel/ptrace.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 8c92feb..291acfb 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -363,11 +363,36 @@ static int gpr_get(struct task_struct *target, const 
struct user_regset *regset,
BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) !=
 offsetof(struct pt_regs, msr) + sizeof(long));
 
+#ifdef CONFIG_PPC64
+   if (!ret)
+   ret = user_regset_copyout(, , , ,
+ >thread.regs->orig_gpr3,
+ offsetof(struct pt_regs, orig_gpr3),
+ offsetof(struct pt_regs, softe));
+
+   if (!ret) {
+   unsigned long softe = 0x1;
+   ret = user_regset_copyout(, , , , ,
+ offsetof(struct pt_regs, softe),
+ offsetof(struct pt_regs, softe) +
+ sizeof(softe));
+   }
+
+   BUILD_BUG_ON(offsetof(struct pt_regs, trap) !=
+offsetof(struct pt_regs, softe) + sizeof(long));
+
+   if (!ret)
+   ret = user_regset_copyout(, , , ,
+ >thread.regs->trap,
+ offsetof(struct pt_regs, trap),
+ sizeof(struct user_pt_regs));
+#else
if (!ret)
ret = user_regset_copyout(, , , ,
  >thread.regs->orig_gpr3,
  offsetof(struct pt_regs, orig_gpr3),
  sizeof(struct user_pt_regs));
+#endif
if (!ret)
ret = user_regset_copyout_zero(, , , ,
   sizeof(struct user_pt_regs), -1);
-- 
2.5.0




[PATCH?] powerpc: Hard wire PT_SOFTE value to 1 in gpr_get() too

2019-09-17 Thread Oleg Nesterov
I don't have a ppc machine, this patch wasn't even compile tested,
could you please review?

The commit a8a4b03ab95f ("powerpc: Hard wire PT_SOFTE value to 1 in
ptrace & signals") changed ptrace_get_reg(PT_SOFTE) to report 0x1,
but PTRACE_GETREGS still copies pt_regs->softe as is.

This is not consistent and this breaks
http://sourceware.org/systemtap/wiki/utrace/tests/user-regs-peekpoke

Reported-by: Jan Kratochvil 
Signed-off-by: Oleg Nesterov 
---
 arch/powerpc/kernel/ptrace.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 8c92feb..9e9342c 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -363,11 +363,36 @@ static int gpr_get(struct task_struct *target, const 
struct user_regset *regset,
BUILD_BUG_ON(offsetof(struct pt_regs, orig_gpr3) !=
 offsetof(struct pt_regs, msr) + sizeof(long));
 
+#ifdef CONFIG_PPC64
+   if (!ret)
+   ret = user_regset_copyout(, , , ,
+ >thread.regs->orig_gpr3,
+ offsetof(struct pt_regs, orig_gpr3),
+ offsetof(struct pt_regs, softe));
+
+   if (!ret) {
+   unsigned long softe = 0x1;
+   ret = user_regset_copyout(, , , , ,
+ offsetof(struct pt_regs, softe),
+ offsetof(struct pt_regs, softe) +
+ sizeof(softe));
+   }
+
+   BUILD_BUG_ON(offsetof(struct pt_regs, trap) !=
+offsetof(struct pt_regs, softe) + sizeof(long));
+
+   if (!ret)
+   ret = user_regset_copyout(, , , ,
+ >thread.regs->trap,
+ offsetof(struct pt_regs, trap),
+ sizeof(struct user_pt_regs));
+#else
if (!ret)
ret = user_regset_copyout(, , , ,
  >thread.regs->orig_gpr3,
  offsetof(struct pt_regs, orig_gpr3),
  sizeof(struct user_pt_regs));
+#endif
if (!ret)
ret = user_regset_copyout_zero(, , , ,
   sizeof(struct user_pt_regs), -1);
-- 
2.5.0




Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Oleg Nesterov
On 05/23, Christian Brauner wrote:
>
> So given that we would really need another find_next_open_fd() I think
> sticking to the simple cond_resched() version I sent before is better
> for now until we see real-world performance issues.

OK, agreed.

Oleg.



Re: [PATCH v2 1/2] open: add close_range()

2019-05-23 Thread Oleg Nesterov
On 05/23, Christian Brauner wrote:
>
> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> + unsigned int cur_max;
> +
> + if (fd > max_fd)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + cur_max = files_fdtable(files)->max_fds;
> + rcu_read_unlock();
> +
> + /* cap to last valid index into fdtable */
> + max_fd = max(max_fd, (cur_max - 1));
 ^^^

Hmm. min() ?

Oleg.



Re: [PATCH v1 1/2] open: add close_range()

2019-05-22 Thread Oleg Nesterov
On 05/22, Christian Brauner wrote:
>
> +static struct file *pick_file(struct files_struct *files, unsigned fd)
>  {
> - struct file *file;
> + struct file *file = NULL;
>   struct fdtable *fdt;
>  
>   spin_lock(>file_lock);
> @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd)
>   goto out_unlock;
>   rcu_assign_pointer(fdt->fd[fd], NULL);
>   __put_unused_fd(files, fd);
> - spin_unlock(>file_lock);
> - return filp_close(file, files);
>  
>  out_unlock:
>   spin_unlock(>file_lock);
> - return -EBADF;
> + return file;

...

> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> + unsigned int cur_max;
> +
> + if (fd > max_fd)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + cur_max = files_fdtable(files)->max_fds;
> + rcu_read_unlock();
> +
> + /* cap to last valid index into fdtable */
> + if (max_fd >= cur_max)
> + max_fd = cur_max - 1;
> +
> + while (fd <= max_fd) {
> + struct file *file;
> +
> + file = pick_file(files, fd++);

Well, how about something like

static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned 
start)
{
unsigned int maxfd = fdt->max_fds;
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;

bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * 
BITS_PER_LONG;
if (bitbit > maxfd)
return maxfd;
if (bitbit > start)
start = bitbit;
return find_next_bit(fdt->open_fds, maxfd, start);
}

unsigned close_next_fd(struct files_struct *files, unsigned start, 
unsigned maxfd)
{
unsigned fd;
struct file *file;
struct fdtable *fdt;

spin_lock(>file_lock);
fdt = files_fdtable(files);
fd = find_next_opened_fd(fdt, start);
if (fd >= fdt->max_fds || fd > maxfd) {
fd = -1;
goto out;
}

file = fdt->fd[fd];
rcu_assign_pointer(fdt->fd[fd], NULL);
__put_unused_fd(files, fd);
out:
spin_unlock(>file_lock);

if (fd == -1u)
return fd;

filp_close(file, files);
return fd + 1;
}

?

Then close_range() can do

while (fd < max_fd)
fd = close_next_fd(fd, maxfd);

Oleg.



Re: [PATCH 5/5] asm-generic: remove ptrace.h

2019-05-20 Thread Oleg Nesterov
On 05/20, Christoph Hellwig wrote:
>
> No one is using this header anymore.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Arnd Bergmann 
> ---
>  MAINTAINERS|  1 -
>  arch/mips/include/asm/ptrace.h |  5 ---
>  include/asm-generic/ptrace.h   | 74 --
>  3 files changed, 80 deletions(-)
>  delete mode 100644 include/asm-generic/ptrace.h

Acked-by: Oleg Nesterov 



Re: [PATCH 4/5] x86: don't use asm-generic/ptrace.h

2019-05-20 Thread Oleg Nesterov
On 05/20, Christoph Hellwig wrote:
>
> Doing the indirection through macros for the regs accessors just
> makes them harder to read, so implement the helpers directly.

Acked-by: Oleg Nesterov 



Re: [PATCH v1 1/2] pid: add pidfd_open()

2019-05-16 Thread Oleg Nesterov
On 05/17, Aleksa Sarai wrote:
>
> On 2019-05-16, Oleg Nesterov  wrote:
> > On 05/17, Aleksa Sarai wrote:
> > > On 2019-05-16, Oleg Nesterov  wrote:
> > > > On 05/16, Christian Brauner wrote:
> > > > > With the introduction of pidfds through CLONE_PIDFD it is possible to
> > > > > created pidfds at process creation time.
> > > >
> > > > Now I am wondering why do we need CLONE_PIDFD, you can just do
> > > >
> > > > pid = fork();
> > > > pidfd_open(pid);
> > >
> > > While the race window would be exceptionally short, there is the
> > > possibility that the child will die
> >
> > Yes,
> >
> > > and their pid will be recycled
> > > before you do pidfd_open().
> >
> > No.
> >
> > Unless the caller's sub-thread does wait() before pidfd_open(), of course.
> > Or unless you do signal(SIGCHILD, SIG_IGN).
>
> What about CLONE_PARENT?

I should have mentioned CLONE_PARENT ;)

Of course in this case the child can be reaped before pidfd_open(). But how 
often
do you or other people use clone(CLONE_PARENT) ? not to mention you can 
trivially
eliminate/detect this race if you really need this.

Don't get me wrong, I am not trying to say that CLONE_PIDFD is a bad idea.

But to me pidfd_open() is much more useful. Say, as a perl programmer I can 
easily
use pidfd_open(), but not CLONE_PIDFD.

Oleg.



Re: [PATCH v1 1/2] pid: add pidfd_open()

2019-05-16 Thread Oleg Nesterov
On 05/17, Aleksa Sarai wrote:
>
> On 2019-05-16, Oleg Nesterov  wrote:
> > On 05/16, Christian Brauner wrote:
> > >
> > > With the introduction of pidfds through CLONE_PIDFD it is possible to
> > > created pidfds at process creation time.
> >
> > Now I am wondering why do we need CLONE_PIDFD, you can just do
> >
> > pid = fork();
> > pidfd_open(pid);
>
> While the race window would be exceptionally short, there is the
> possibility that the child will die

Yes,

> and their pid will be recycled
> before you do pidfd_open().

No.

Unless the caller's sub-thread does wait() before pidfd_open(), of course.
Or unless you do signal(SIGCHILD, SIG_IGN).

Oleg.



Re: [PATCH v1 1/2] pid: add pidfd_open()

2019-05-16 Thread Oleg Nesterov
On 05/16, Christian Brauner wrote:
>
> With the introduction of pidfds through CLONE_PIDFD it is possible to
> created pidfds at process creation time.

Now I am wondering why do we need CLONE_PIDFD, you can just do

pid = fork();
pidfd_open(pid);

> +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> +{
> + int fd, ret;
> + struct pid *p;
> + struct task_struct *tsk;
> +
> + if (flags)
> + return -EINVAL;
> +
> + if (pid <= 0)
> + return -EINVAL;
> +
> + p = find_get_pid(pid);
> + if (!p)
> + return -ESRCH;
> +
> + ret = 0;
> + rcu_read_lock();
> + /*
> +  * If this returns non-NULL the pid was used as a thread-group
> +  * leader. Note, we race with exec here: If it changes the
> +  * thread-group leader we might return the old leader.
> +  */
> + tsk = pid_task(p, PIDTYPE_TGID);
> + if (!tsk)
> + ret = -ESRCH;
> + rcu_read_unlock();
> +
> + fd = ret ?: pidfd_create(p);
> + put_pid(p);
> + return fd;
> +}

Looks correct, feel free to add Reviewed-by: Oleg Nesterov 

But why do we need task_struct *tsk?

rcu_read_lock();
if (!pid_task(PIDTYPE_TGID))
ret = -ESRCH;
rcu_read_unlock();

and in fact we do not even need rcu_read_lock(), we could do

// shut up rcu_dereference_check()
rcu_lock_acquire(_lock_map);
if (!pid_task(PIDTYPE_TGID))
ret = -ESRCH;
rcu_lock_release(_lock_map);

Well... I won't insist, but the comment about the race with exec looks a bit
confusing to me. It is true, but we do not care at all, we are not going to
use the task_struct returned by pid_task().

Oleg.



Re: [PATCH 1/2] pid: add pidfd_open()

2019-05-15 Thread Oleg Nesterov
On 05/15, Oleg Nesterov wrote:
>
> On 05/15, Christian Brauner wrote:
> >
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > +   int fd, ret;
> > +   struct pid *p;
> > +   struct task_struct *tsk;
> > +
> > +   if (flags)
> > +   return -EINVAL;
> > +
> > +   if (pid <= 0)
> > +   return -EINVAL;
> > +
> > +   p = find_get_pid(pid);
> > +   if (!p)
> > +   return -ESRCH;
> > +
> > +   rcu_read_lock();
> > +   tsk = pid_task(p, PIDTYPE_PID);
>
> You do not need find_get_pid() before rcu_lock and put_pid() at the end.
> You can just do find_vpid() under rcu_read_lock().

Ah, sorry. Somehow I forgot you need to call pidfd_create(pid), you can't
do this under rcu_read_lock().

So I was wrong, you can't avoid get/put_pid.

Oleg.



Re: [PATCH 1/2] pid: add pidfd_open()

2019-05-15 Thread Oleg Nesterov
On 05/15, Christian Brauner wrote:
>
> On Wed, May 15, 2019 at 04:38:58PM +0200, Oleg Nesterov wrote:
> >
> > it seems that you can do a single check
> >
> > tsk = pid_task(p, PIDTYPE_TGID);
> > if (!tsk)
> > ret = -ESRCH;
> >
> > this even looks more correct if we race with exec changing the leader.
>
> The logic here being that you can only reach the thread_group leader
> from struct pid if PIDTYPE_PID == PIDTYPE_TGID for this struct pid?

Not exactly... it is not that PIDTYPE_PID == PIDTYPE_TGID for this pid,
struct pid has no "type" or something like this.

The logic is that pid->tasks[PIDTYPE_XXX] is the list of task which use
this pid as "XXX" type.

For example, clone(CLONE_THREAD) creates a pid which has a single non-
empty list, pid->tasks[PIDTYPE_PID]. This pid can't be used as TGID or
SID.

So if pid_task(PIDTYPE_TGID) returns non-NULL we know that this pid was
used for a group-leader, see copy_process() which does

if (thread_group_leader(p))
attach_pid(p, PIDTYPE_TGID);


If we race with exec which changes the leader pid_task(TGID) can return
the old leader. We do not care, but this means that we should not check
thread_group_leader().

Oleg.



Re: [PATCH 1/2] pid: add pidfd_open()

2019-05-15 Thread Oleg Nesterov
On 05/15, Christian Brauner wrote:
>
> +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> +{
> + int fd, ret;
> + struct pid *p;
> + struct task_struct *tsk;
> +
> + if (flags)
> + return -EINVAL;
> +
> + if (pid <= 0)
> + return -EINVAL;
> +
> + p = find_get_pid(pid);
> + if (!p)
> + return -ESRCH;
> +
> + rcu_read_lock();
> + tsk = pid_task(p, PIDTYPE_PID);

You do not need find_get_pid() before rcu_lock and put_pid() at the end.
You can just do find_vpid() under rcu_read_lock().

> + if (!tsk)
> + ret = -ESRCH;
> + else if (unlikely(!thread_group_leader(tsk)))
> + ret = -EINVAL;

it seems that you can do a single check

tsk = pid_task(p, PIDTYPE_TGID);
if (!tsk)
ret = -ESRCH;

this even looks more correct if we race with exec changing the leader.

Oleg.



Re: [PATCH linux-next v10 0/7] ptrace: add PTRACE_GET_SYSCALL_INFO request

2019-05-09 Thread Oleg Nesterov
On 04/16, Dmitry V. Levin wrote:
>
> [Andrew, could you take this patchset into your tree, please?]

Just in case...

I have already acked 6/7.

Other patches look good to me too, just I don't think I can actually review
these non-x86 changes.

Oleg.



Re: remove asm-generic/ptrace.h

2019-05-02 Thread Oleg Nesterov
On 05/01, Christoph Hellwig wrote:
>
> Hi all,
>
> asm-generic/ptrace.h is a little weird in that it doesn't actually
> implement any functionality, but it provided multiple layers of macros
> that just implement trivial inline functions.  We implement those
> directly in the few architectures and be off with a much simpler
> design.

Oh, thanks, I was always confused by these macros ;)

Oleg.



Re: [PATCH v2 3/6] x86: clean up _TIF_SYSCALL_EMU handling using ptrace_syscall_enter hook

2019-05-01 Thread Oleg Nesterov
On 04/30, Sudeep Holla wrote:
>
> On Mon, Mar 18, 2019 at 04:33:22PM +0100, Oleg Nesterov wrote:
> >
> > And it seems that _TIF_WORK_SYSCALL_ENTRY needs some cleanups too... We 
> > don't need
> > "& _TIF_WORK_SYSCALL_ENTRY" in syscall_trace_enter, and 
> > _TIF_WORK_SYSCALL_ENTRY
> > should not include _TIF_NOHZ?
> >
>
> I was about to post the updated version and checked this to make sure I have
> covered everything or not. I had missed the above comment. All architectures
> have _TIF_NOHZ in their mask that they check to do work. And from x86, I read
> "...syscall_trace_enter(). Also includes TIF_NOHZ for enter_from_user_mode()"
> So I don't understand why _TIF_NOHZ needs to be dropped.

I have already forgot this discussion... But after I glanced at this code again
I still think the same, and I don't understand why do you disagree.

> Also if we need to drop, we can address that separately examining all archs.

Sure, and I was only talking about x86. We can keep TIF_NOHZ and even
set_tsk_thread_flag(TIF_NOHZ) in context_tracking_cpu_set() if some arch needs
this but remove TIF_NOHZ from TIF_WORK_SYSCALL_ENTRY in 
arch/x86/include/asm/thread_info.h,
afaics this shouldn't make any difference.

And I see no reason why x86 needs to use TIF_WORK_SYSCALL_ENTRY in
syscall_trace_enter().

Oleg.



Re: [PATCH v2 4/6] powerpc: use common ptrace_syscall_enter hook to handle _TIF_SYSCALL_EMU

2019-03-19 Thread Oleg Nesterov
On 03/19, Oleg Nesterov wrote:
>
> Well, personally I see no point... Again, after the trivial simplification
> x86 does
>
>   if (work & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) {
>   ret = tracehook_report_syscall_entry(regs);
>   if (ret || (work & _TIF_SYSCALL_EMU))
>   return -1L;
>   }
>
> this looks simple enough for copy-and-paste.
>
> > If there's a better way to achieve the same
>
> I can only say that if we add a common helper, I think it should absorb
> tracehook_report_syscall_entry() and handle both TIF's just like the code
> above does. Not sure this makes any sense.

this won't work, looking at 6/6 I see that arm64 needs to distinguish
_TRACE and _EMU ... I don't understand this code, but it looks suspicious.
If tracehook_report_syscall_entry() returns nonzero the tracee was killed,
syscall_trace_enter() should just return.

To me this is another indication that consolidation makes no sense ;)

Oleg.



Re: [PATCH v2 4/6] powerpc: use common ptrace_syscall_enter hook to handle _TIF_SYSCALL_EMU

2019-03-19 Thread Oleg Nesterov
On 03/18, Sudeep Holla wrote:
>
> On Mon, Mar 18, 2019 at 06:33:41PM +0100, Oleg Nesterov wrote:
> > On 03/18, Sudeep Holla wrote:
> > >
> > > On Mon, Mar 18, 2019 at 06:20:24PM +0100, Oleg Nesterov wrote:
> > > >
> > > > Again, to me this patch just makes the code look worse. Honestly, I 
> > > > don't
> > > > think that the new (badly named) ptrace_syscall_enter() hook makes any 
> > > > sense.
> > > >
> > >
> > > Worse because we end up reading current_thread_info->flags twice ?
> >
> > Mostly because in my opinion ptrace_syscall_enter() buys nothing but makes
> > the caller's code less readable/understandable.
> >
> > Sure, this is subjective.
> >
>
> Based on what we have in that function today, I tend to agree. Will and
> Richard were in the opinion to consolidate SYSEMU handling

Well, personally I see no point... Again, after the trivial simplification
x86 does

if (work & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) {
ret = tracehook_report_syscall_entry(regs);
if (ret || (work & _TIF_SYSCALL_EMU))
return -1L;
}

this looks simple enough for copy-and-paste.

> If there's a better way to achieve the same

I can only say that if we add a common helper, I think it should absorb
tracehook_report_syscall_entry() and handle both TIF's just like the code
above does. Not sure this makes any sense.

Oleg.



Re: [PATCH v2 4/6] powerpc: use common ptrace_syscall_enter hook to handle _TIF_SYSCALL_EMU

2019-03-18 Thread Oleg Nesterov
On 03/18, Sudeep Holla wrote:
>
> On Mon, Mar 18, 2019 at 06:20:24PM +0100, Oleg Nesterov wrote:
> >
> > Again, to me this patch just makes the code look worse. Honestly, I don't
> > think that the new (badly named) ptrace_syscall_enter() hook makes any 
> > sense.
> >
>
> Worse because we end up reading current_thread_info->flags twice ?

Mostly because in my opinion ptrace_syscall_enter() buys nothing but makes
the caller's code less readable/understandable.

Sure, this is subjective.

Oleg.



Re: [PATCH v2 1/6] ptrace: move clearing of TIF_SYSCALL_EMU flag to core

2019-03-18 Thread Oleg Nesterov
On 03/18, Sudeep Holla wrote:
>
 @@ -534,6 +534,10 @@ static int ptrace_detach(struct task_struct *child, 
unsigned int data)
>   /* Architecture-specific hardware disable .. */
>   ptrace_disable(child);
>  
> +#ifdef TIF_SYSCALL_EMU
> + clear_tsk_thread_flag(child, TIF_SYSCALL_EMU);
> +#endif

perhaps it makes sense to factor out clear_tsk_thread_flag(TIF_SYSCALL_EMU), but
then we should probably clear it along with TIF_SYSCALL_TRACE in 
__ptrace_unlink?

Oleg.



Re: [PATCH v2 4/6] powerpc: use common ptrace_syscall_enter hook to handle _TIF_SYSCALL_EMU

2019-03-18 Thread Oleg Nesterov
On 03/18, Sudeep Holla wrote:
>
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -3278,35 +3278,29 @@ long do_syscall_trace_enter(struct pt_regs *regs)
>  
>   user_exit();
>  
> - flags = READ_ONCE(current_thread_info()->flags) &
> - (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE);
> -
> - if (flags) {
> - int rc = tracehook_report_syscall_entry(regs);
> + if (unlikely(ptrace_syscall_enter(regs))) {
> + /*
> +  * A nonzero return code from tracehook_report_syscall_entry()
> +  * tells us to prevent the syscall execution, but we are not
> +  * going to execute it anyway.
> +  *
> +  * Returning -1 will skip the syscall execution. We want to
> +  * avoid clobbering any registers, so we don't goto the skip
> +  * label below.
> +  */
> + return -1;
> + }
>  
> - if (unlikely(flags & _TIF_SYSCALL_EMU)) {
> - /*
> -  * A nonzero return code from
> -  * tracehook_report_syscall_entry() tells us to prevent
> -  * the syscall execution, but we are not going to
> -  * execute it anyway.
> -  *
> -  * Returning -1 will skip the syscall execution. We want
> -  * to avoid clobbering any registers, so we don't goto
> -  * the skip label below.
> -  */
> - return -1;
> - }
> + flags = READ_ONCE(current_thread_info()->flags) & _TIF_SYSCALL_TRACE;

Why do we need READ_ONCE() with this change?

And now that we change a single bit "flags" doesn't look like a good name.

Again, to me this patch just makes the code look worse. Honestly, I don't
think that the new (badly named) ptrace_syscall_enter() hook makes any sense.

Oleg.



Re: [PATCH v2 3/6] x86: clean up _TIF_SYSCALL_EMU handling using ptrace_syscall_enter hook

2019-03-18 Thread Oleg Nesterov
On 03/18, Sudeep Holla wrote:
>
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -70,22 +70,16 @@ static long syscall_trace_enter(struct pt_regs *regs)
>  
>   struct thread_info *ti = current_thread_info();
>   unsigned long ret = 0;
> - bool emulated = false;
>   u32 work;
>  
>   if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>   BUG_ON(regs != task_pt_regs(current));
>  
> - work = READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;
> -
> - if (unlikely(work & _TIF_SYSCALL_EMU))
> - emulated = true;
> -
> - if ((emulated || (work & _TIF_SYSCALL_TRACE)) &&
> - tracehook_report_syscall_entry(regs))
> + if (unlikely(ptrace_syscall_enter(regs)))
>   return -1L;
>  
> - if (emulated)
> + work = READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;
> + if ((work & _TIF_SYSCALL_TRACE) && tracehook_report_syscall_entry(regs))
>   return -1L;

Well, I won't really argue, but to be honest I think this change doesn't make
the code better... With this patch tracehook_report_syscall_entry() has 2 
callers,
to me this just adds some confusion.

I agree that the usage of emulated/_TIF_SYSCALL_EMU looks a bit overcomplicated,
I'd suggest a simple cleanup below.

And it seems that _TIF_WORK_SYSCALL_ENTRY needs some cleanups too... We don't 
need
"& _TIF_WORK_SYSCALL_ENTRY" in syscall_trace_enter, and _TIF_WORK_SYSCALL_ENTRY
should not include _TIF_NOHZ?

Oleg.


--- x/arch/x86/entry/common.c
+++ x/arch/x86/entry/common.c
@@ -70,23 +70,18 @@ static long syscall_trace_enter(struct pt_regs *regs)
 
struct thread_info *ti = current_thread_info();
unsigned long ret = 0;
-   bool emulated = false;
u32 work;
 
if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
BUG_ON(regs != task_pt_regs(current));
 
-   work = READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;
+   work = READ_ONCE(ti->flags);
 
-   if (unlikely(work & _TIF_SYSCALL_EMU))
-   emulated = true;
-
-   if ((emulated || (work & _TIF_SYSCALL_TRACE)) &&
-   tracehook_report_syscall_entry(regs))
-   return -1L;
-
-   if (emulated)
-   return -1L;
+   if (work & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) {
+   ret = tracehook_report_syscall_entry(regs);
+   if (ret || (work & _TIF_SYSCALL_EMU))
+   return -1L;
+   }
 
 #ifdef CONFIG_SECCOMP
/*



Re: [PATCH] powerpc/ptrace: cleanup do_syscall_trace_enter

2018-12-17 Thread Oleg Nesterov
On 12/16, Dmitry V. Levin wrote:
>
>  long do_syscall_trace_enter(struct pt_regs *regs)
>  {
> + u32 cached_flags;
> +
>   user_exit();
>  
> - if (test_thread_flag(TIF_SYSCALL_EMU)) {
> - /*
> -  * A nonzero return code from tracehook_report_syscall_entry()
> -  * tells us to prevent the syscall execution, but we are not
> -  * going to execute it anyway.
> -  *
> -  * Returning -1 will skip the syscall execution. We want to
> -  * avoid clobbering any register also, thus, not 'gotoing'
> -  * skip label.
> -  */
> - if (tracehook_report_syscall_entry(regs))
> - ;
> - return -1;
> - }
> + cached_flags = READ_ONCE(current_thread_info()->flags) &
> +(_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE);
>  
> - /*
> -  * The tracer may decide to abort the syscall, if so tracehook
> -  * will return !0. Note that the tracer may also just change
> -  * regs->gpr[0] to an invalid syscall number, that is handled
> -  * below on the exit path.
> -  */
> - if (test_thread_flag(TIF_SYSCALL_TRACE) &&
> - tracehook_report_syscall_entry(regs))
> - goto skip;
> + if (cached_flags) {
> + int rc = tracehook_report_syscall_entry(regs);
> +
> + if (unlikely(cached_flags & _TIF_SYSCALL_EMU)) {
> + /*
> +  * A nonzero return code from
> +  * tracehook_report_syscall_entry() tells us
> +  * to prevent the syscall execution, but
> +  * we are not going to execute it anyway.
> +  *
> +  * Returning -1 will skip the syscall execution.
> +  * We want to avoid clobbering any register also,
> +  * thus, not 'gotoing' skip label.
> +  */
> + return -1;
> + }
> +
> + if (rc) {
> + /*
> +  * The tracer decided to abort the syscall.
> +  * Note that the tracer may also just change
> +  * regs->gpr[0] to an invalid syscall number,
> +  * that is handled below on the exit path.
> +  */
> + goto skip;
> + }
> + }

Looks good to me,

Oleg.



Re: [PATCH v6] powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call

2018-12-10 Thread Oleg Nesterov
On 12/07, Dmitry V. Levin wrote:
>
> Please make either v5 or v6 edition of this fix, or any similar fix,
> into v4.20.

IIUC, v5 above means

[PATCH v5 23/25] powerpc/ptrace: replace ptrace_report_syscall() with a 
tracehook call

you sent in another series...

>  long do_syscall_trace_enter(struct pt_regs *regs)
>  {
> + struct thread_info *ti;
> + u32 cached_flags;
> +
>   user_exit();
>  
> - if (test_thread_flag(TIF_SYSCALL_EMU)) {
> - ptrace_report_syscall(regs);
> - /*
> -  * Returning -1 will skip the syscall execution. We want to
> -  * avoid clobbering any register also, thus, not 'gotoing'
> -  * skip label.
> -  */
> - return -1;
> - }
> + ti = current_thread_info();
> + cached_flags = READ_ONCE(ti->flags) &
> +(_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE |
> + _TIF_SYSCALL_TRACEPOINT);
>  
> - /*
> -  * The tracer may decide to abort the syscall, if so tracehook
> -  * will return !0. Note that the tracer may also just change
> -  * regs->gpr[0] to an invalid syscall number, that is handled
> -  * below on the exit path.
> -  */
> - if (test_thread_flag(TIF_SYSCALL_TRACE) &&
> - tracehook_report_syscall_entry(regs))
> - goto skip;
> + if (cached_flags & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) {
> + int rc = tracehook_report_syscall_entry(regs);
> +
> + if (unlikely(cached_flags & _TIF_SYSCALL_EMU)) {
> + /*
> +  * A nonzero return code from
> +  * tracehook_report_syscall_entry() tells us
> +  * to prevent the syscall execution, but
> +  * we are not going to execute it anyway.
> +  *
> +  * Returning -1 will skip the syscall execution.
> +  * We want to avoid clobbering any register also,
> +  * thus, not 'gotoing' skip label.
> +  */
> + return -1;
> + }
> +
> + if (rc) {
> + /*
> +  * The tracer decided to abort the syscall.
> +  * Note that the tracer may also just change
> +  * regs->gpr[0] to an invalid syscall number,
> +  * that is handled below on the exit path.
> +  */
> + goto skip;
> + }
> + }
>  
>   /* Run seccomp after ptrace; allow it to set gpr[3]. */
>   if (do_seccomp(regs))
> @@ -3293,7 +3309,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
>   if (regs->gpr[0] >= NR_syscalls)
>   goto skip;
>  
> - if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
> + if (unlikely(cached_flags & _TIF_SYSCALL_TRACEPOINT))

I will leave this to maintainers, but to me this change looks good and imo it
also cleanups the code.

However I am not sure cached_flags should include _TIF_SYSCALL_TRACEPOINT. If
nothing else, the caller can sleep in ptrace_stop() unpredictably long and
TIF_SYSCALL_TRACEPOINT can be set/cleared meanwhile.

Oleg.



Re: [PATCH v4] powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call

2018-12-07 Thread Oleg Nesterov
On 12/07, Dmitry V. Levin wrote:
>
> On Fri, Dec 07, 2018 at 10:12:49PM +1100, Michael Ellerman wrote:
>
> > > Sorry, this patch does not work, please ignore it.
> >
> > Hmm OK. Why exactly?
>
> Unfortunately, I have no idea why it doesn't work.
> All I can say is it breaks strace because the kernel no longer sends
> syscall entry stops.

May be because TIF_SYSCALL_EMU/etc is a bit number, not a mask? IOW, rather
than

whatever & TIF_XXX

you should do

whatever & _TIF_XXX

intstead?

Oleg.



Re: [PATCH v3 1/2] livepatch: send a fake signal to all blocking tasks

2017-11-02 Thread Oleg Nesterov
On 11/02, Miroslav Benes wrote:
>
> On Wed, 1 Nov 2017, Oleg Nesterov wrote:
>
> > Note also that wake_up_state(TASK_INTERRUPTIBLE) won't wakeup the TASK_IDLE
> > kthreads, and most of the kthreads which use TASK_INTERRUPTIBLE should use
> > TASK_IDLE today, because in most cases TASK_INTERRUPTIBLE was used to not
> > contribute to loadavg.
>
> Yes. Unfortunately, we have TASK_IDLE for more than two years now and
> nothing much has happened yet. TASK_IDLE is still used sporadically. I'd
> like to be on the safe side with livepatch

OK, as I said I won't argue,

> and given that
> TASK_INTERRUPTIBLE loops should be prepared for spurious wakeups by
> definition,

Not really when it comes to kthreads.

Once again, unless kthread does allow_signal() TASK_INTERRUPTIBLE does
not really differ from TASK_UNINTERRUPTIBLE except the latter contributes
to loadavg. And that is why TASK_INTERRUPTIBLE was commonly used instead
of TASK_UNINTERRUPTIBLE, so I do not think that TASK_INTERRUPTIBLE loops
are more ready in general than TASK_UNINTERRUPTIBLE.

Oleg.



Re: [PATCH v3 1/2] livepatch: send a fake signal to all blocking tasks

2017-11-01 Thread Oleg Nesterov
On 11/01, Petr Mladek wrote:
>
> On Tue 2017-10-31 12:48:52, Miroslav Benes wrote:
> > +   if (task->flags & PF_KTHREAD) {
> > +   /*
> > +* Wake up a kthread which still has not been migrated.
> > +*/
> > +   wake_up_process(task);
>
> I have just noticed that freezer used wake_up_state(p, TASK_INTERRUPTIBLE);
> IMHO, we should do so as well.

I won't argue, but...

> wake_up_process() wakes also tasks in TASK_UNINTERRUPTIBLE state.
> These might not be ready for an unexpected wakeup. For example,
> see concat_dev_erase() in drivers/mtd/mtdcontact.c.

I'd say that concat_dev_erase() should be fixed, any code should be ready
for spurious wakeup.

Note also that wake_up_state(TASK_INTERRUPTIBLE) won't wakeup the TASK_IDLE
kthreads, and most of the kthreads which use TASK_INTERRUPTIBLE should use
TASK_IDLE today, because in most cases TASK_INTERRUPTIBLE was used to not
contribute to loadavg.

Oleg.



Re: [PATCH] ptrace: Add compat PTRACE_{G,S}ETSIGMASK handlers

2017-07-10 Thread Oleg Nesterov
On 06/29, James Morse wrote:
>
> compat_ptrace_request() lacks handlers for PTRACE_{G,S}ETSIGMASK,
> instead using those in ptrace_request(). The compat variant should
> read a compat_sigset_t from userspace instead of ptrace_request()s
> sigset_t.

Acked-by: Oleg Nesterov <o...@redhat.com>



Re: [PATCH v5] powerpc: Do not make the entire heap executable

2016-09-29 Thread Oleg Nesterov
On 09/28, Kees Cook wrote:
>
> This is where the flags are actually built from what's coming in
> through the newly created exported function vm_brk_flags() below. The
> only flag we're acting on is VM_EXEC (passed in from set_brk() above).
> I think do_brk_flags() should mask the valid flags, or we'll regret it
> in the future. I'd like to see something like:
>
> /* Until we need other flags, refuse anything except VM_EXEC. */
> if ((flags & (~VM_EXEC)) != 0)
> return -EINVAL;
> flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;

I tried to suggest this too. In particular it would be simply wrong
to accept VM_LOCKED in flags.

Oleg.



Re: [PATCH v4] powerpc: Do not make the entire heap executable

2016-08-10 Thread Oleg Nesterov
On 08/10, Denys Vlasenko wrote:
>
> Currently, to support 32-bit binaries with PLT in BSS kernel maps *entire
> brk area* with executable rights for all binaries, even --secure-plt ones.
>
> Stop doing that.

Can't really review this patch, but at least the change in mm/mmap.c looks
technically correct to me... One nit below, feel free to ignore.

> @@ -2668,7 +2668,7 @@ static int do_brk(unsigned long addr, unsigned long 
> request)
>   if (!len)
>   return 0;
>
> - flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
> + flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;

OK. But note that we have

mlock_future_check(mm->def_flags);

a few lines below and after this change this _looks_ wrong because
VM_LOCKED can come from the new "flags" argument passed to do_brk().
Nobody does this right now, still this looks wrong/confusing.

I'd suggest to add another change

-   mlock_future_check(mm->def_flags);
+   mlock_future_check(flags);

or add a sanity check at the start to deny VM_LOCKED and perhaps
something else...

The same for vm_brk_flags() which after your change does

do_brk_flags(flags);
populate = (mm->def_flags & VM_LOCKED);

again, this is just a nit, I do not think it will be ever called
with VM_LOCKED in "flags".

Oleg.



Re: [PATCH 09/14] resource limits: track highwater mark of locked memory

2016-07-18 Thread Oleg Nesterov
On 07/15, Topi Miettinen wrote:
>
> On 07/15/16 15:14, Oleg Nesterov wrote:
> >
> > Btw this is not right. The same for the previous patch which tracks
> > RLIMIT_STACK. The "current" task can debugger/etc.
>
> acct_stack_growth() is called from expand_upwards() and
> expand_downwards(). They call security_mmap_addr() and the various LSM
> implementations also use current task in the checks. Are these also not
> right?

Just suppose that the stack grows because you read/write to /proc/pid/mem.

> > Yes, yes, this just reminds that the whole rlimit logic in this path
> > is broken but still...
>
> I'd be happy to fix the logic with a separate prerequisite patch and
> then use the right logic for this patch, but I'm not sure I know how.
> Could you elaborate a bit?

If only I Knew how to fix this ;) I mean, if only I could suggest a
simple fix. Because IMHO we do not really care, rlimts are obsolete.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 09/14] resource limits: track highwater mark of locked memory

2016-07-15 Thread Oleg Nesterov
On 07/15, Topi Miettinen wrote:
>
> Track maximum size of locked memory, to be able to configure
> RLIMIT_MEMLOCK resource limits. The information is available
> with taskstats and cgroupstats netlink socket.

So I personally still dislike the very idea of this series... but I won't
argue if you convince maintainers.

> @@ -2020,6 +2020,10 @@ static int acct_stack_growth(struct vm_area_struct 
> *vma, unsigned long size, uns
>   return -ENOMEM;
>  
>   update_resource_highwatermark(RLIMIT_STACK, actual_size);
> + if (vma->vm_flags & VM_LOCKED)
> + update_resource_highwatermark(RLIMIT_MEMLOCK,
> +   (mm->locked_vm + grow) <<
> +   PAGE_SHIFT);

Btw this is not right. The same for the previous patch which tracks
RLIMIT_STACK. The "current" task can debugger/etc.

Yes, yes, this just reminds that the whole rlimit logic in this path
is broken but still...

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 00/20] Fix handling of compat_siginfo_t

2015-11-09 Thread Oleg Nesterov
On 11/07, Andy Lutomirski wrote:
>
> On Wed, Nov 4, 2015 at 4:50 PM, Amanieu d'Antras  wrote:
> > One issue that isn't resolved in this series is sending signals between a 
> > 32-bit
> > process and 64-bit process. Sending a si_int will work correctly, but a 
> > si_ptr
> > value will likely get corrupted due to the different layouts of the 32-bit 
> > and
> > 64-bit siginfo_t structures.
>
> This is so screwed up it's not even funny.

Agreed,

> A 64-bit big-endian compat calls rt_sigqueueinfo.  It passes in (among
> other things) a sigval_t.  The kernel can choose to interpret it

I always thought that the kernel should not interpret it at all. And indeed,
copy_siginfo_to_user() does

if (from->si_code < 0)
return __copy_to_user(to, from, sizeof(siginfo_t))

probably copy_siginfo_to_user32() should do something similar, at least
it should not truncate ->si_code it it is less than zero.

Not sure what signalfd_copyinfo() should do.

But perhaps I was wrong, I failed to find man sigqueueinfo, and man
sigqueue() documents that it passes sigval_t.


> BTW, x86 has its own set of screwups here.  Somehow cr2 and error_code
> ended up as part of ucontext instead of siginfo, which makes
> absolutely no sense to me and bloats task_struct.

Yes, and probably ->ip should have been the part of siginfo too. Say,
if you get SIGBUS you can't trust sc->ip if another signal was dequeued
before SIGBUS, in this case sc->ip will point to the handler of that
another signal. That is why we have SYNCHRONOUS_MASK and it helps, but
still this doesn't look nice.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/8] x86/spinlock: Leftover conversion ACCESS_ONCE-READ_ONCE

2015-01-15 Thread Oleg Nesterov
On 01/15, Christian Borntraeger wrote:

 Am 15.01.2015 um 20:38 schrieb Oleg Nesterov:
  On 01/15, Christian Borntraeger wrote:
 
  --- a/arch/x86/include/asm/spinlock.h
  +++ b/arch/x86/include/asm/spinlock.h
  @@ -186,7 +186,7 @@ static inline void 
  arch_spin_unlock_wait(arch_spinlock_t *lock)
 __ticket_t head = ACCESS_ONCE(lock-tickets.head);
 
 for (;;) {
  -  struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets);
  +  struct __raw_tickets tmp = READ_ONCE(lock-tickets);
 
  Agreed, but what about another ACCESS_ONCE() above?
 
  Oleg.

 tickets.head is a scalar type, so ACCESS_ONCE does work fine with gcc 4.6/4.7.
 My goal was to convert all accesses on non-scalar types

I understand, but READ_ONCE(lock-tickets.head) looks better anyway and
arch_spin_lock() already use READ_ONCE() for this.

So why we should keep the last ACCESS_ONCE() in spinlock.h ? Just to make
another cosmetic cleanup which touches the same function later?

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/8] x86/spinlock: Leftover conversion ACCESS_ONCE-READ_ONCE

2015-01-15 Thread Oleg Nesterov
On 01/15, Christian Borntraeger wrote:

 --- a/arch/x86/include/asm/spinlock.h
 +++ b/arch/x86/include/asm/spinlock.h
 @@ -186,7 +186,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t 
 *lock)
   __ticket_t head = ACCESS_ONCE(lock-tickets.head);
  
   for (;;) {
 - struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets);
 + struct __raw_tickets tmp = READ_ONCE(lock-tickets);

Agreed, but what about another ACCESS_ONCE() above?

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4 1/3] init/main.c: Give init_task a canary

2014-09-18 Thread Oleg Nesterov
On 09/12, Aaron Tomlin wrote:

 Tasks get their end of stack set to STACK_END_MAGIC with the
 aim to catch stack overruns. Currently this feature does not
 apply to init_task. This patch removes this restriction.

 Note that a similar patch was posted by Prarit Bhargava [1]
 some time ago but was never merged.

 [1]: http://marc.info/?l=linux-kernelm=127144305403241w=2

 Signed-off-by: Aaron Tomlin atom...@redhat.com
 Acked-by: Michael Ellerman m...@ellerman.id.au

Acked-by: Oleg Nesterov o...@redhat.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] sched: Add helper for task stack page overrun checking

2014-09-04 Thread Oleg Nesterov
On 09/04, Aaron Tomlin wrote:

 +#define task_stack_end_corrupted(task) \
 + (*(end_of_stack(task)) != STACK_END_MAGIC)

and it is always used along with tsk != init_task.

But why we exclude swapper/0? Can't we add

end_of_stack(current) = STACK_END_MAGIC;

somewhere at the start of start_kernel() ?

If not, perhaps this new helper should check task != init_task
itself so that we can simplify its users?

Oleg.

  
  static inline int object_is_on_stack(void *obj)
  {
 diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
 index 8a4e5cb..06c7390 100644
 --- a/kernel/trace/trace_stack.c
 +++ b/kernel/trace/trace_stack.c
 @@ -13,7 +13,6 @@
  #include linux/sysctl.h
  #include linux/init.h
  #include linux/fs.h
 -#include linux/magic.h
  
  #include asm/setup.h
  
 @@ -171,8 +170,8 @@ check_stack(unsigned long ip, unsigned long *stack)
   i++;
   }
  
 - if ((current != init_task 
 - *(end_of_stack(current)) != STACK_END_MAGIC)) {
 + if (current != init_task 
 + task_stack_end_corrupted(current)) {
   print_max_stack();
   BUG();
   }
 -- 
 1.9.3
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: bit fields data tearing

2014-07-13 Thread Oleg Nesterov
On 07/13, Benjamin Herrenschmidt wrote:

 On Sat, 2014-07-12 at 22:51 +0200, Oleg Nesterov wrote:
  OK, looks like this is compiler bug,
 
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080
 
  Thanks to Dan who informed me privately.

 So yes, there's is this compiler bug which means a bitfield
 access can cause a r-m-w access to a neighbouring field

Thanks. So I can forward this all back to bugzilla.

 but
 in general, I would be weary of bitfields anyway since accessing
 them isn't going to be atomic anyway... it's too easy to get things
 wrong and in most cases the benefit is yet to be demonstrated.

Sure, bit fields should be used with care. But the same arguments apply
to bitmasks, they are often used without atomic set/clear_bit.

 In your example, I don't see the point of the bitfield.

This is just test-case. The real code has more adjacent bit fields, only
the tracee can modify them, and only debugger can change -freeze_stop.

Thanks,

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

bit fields data tearing

2014-07-12 Thread Oleg Nesterov
Hello,

I am not sure I should ask here, but since Documentation/memory-barriers.txt
mentions load/store tearing perhaps my question is not completely off-topic...

I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
but not on x86. Finally I seem to understand the problem, and I even wrote the
stupid kernel module to ensure, see it below at the end.

It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
(If I turn -freeze_stop int long, the problem goes away).

So the question is: is this gcc bug or the code below is buggy?

If it is buggy, then probably memory-barriers.txt could mention that you should
be carefull with bit fields, even ACCESS_ONCE() obviously can't help.

Or this just discloses my ignorance and you need at least aligned(long) after a
bit field to be thread-safe ? I thought that compiler should take care and add
the necessary alignment if (say) CPU can't update a single byte/uint.

gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:

 .kt_2:
   0:   7c 08 02 a6 mflrr0
   4:   fb 81 ff e0 std r28,-32(r1)
   8:   fb a1 ff e8 std r29,-24(r1)
   c:   fb c1 ff f0 std r30,-16(r1)
  10:   fb e1 ff f8 std r31,-8(r1)
  14:   eb c2 00 00 ld  r30,0(r2)
  18:   f8 01 00 10 std r0,16(r1)
  1c:   f8 21 ff 71 stdur1,-144(r1)
  20:   7c 7d 1b 78 mr  r29,r3
  24:   3b e0 00 00 li  r31,0
  28:   78 3c 04 64 rldicr  r28,r1,0,49
  2c:   3b 9c 00 80 addir28,r28,128
  30:   48 00 00 2c b   5c .kt_2+0x5c
  34:   60 00 00 00 nop
  38:   60 00 00 00 nop
  3c:   60 00 00 00 nop
  40:   93 fd 00 04 stw r31,4(r29)
  44:   e8 9d 00 06 lwa r4,4(r29)
  48:   7f 84 f8 00 cmpwcr7,r4,r31
  4c:   40 de 00 4c bne-cr7,98 .kt_2+0x98
  50:   e8 1c 00 00 ld  r0,0(r28)
  54:   78 09 f7 e3 rldicl. r9,r0,62,63
  58:   40 c2 00 54 bne-ac .kt_2+0xac
  5c:   48 00 00 01 bl  5c .kt_2+0x5c
  60:   60 00 00 00 nop
  64:   3b ff 00 01 addir31,r31,1
  68:   2f a3 00 00 cmpdi   cr7,r3,0
  6c:   7f ff 07 b4 extsw   r31,r31
  70:   41 9e ff d0 beq+cr7,40 .kt_2+0x40
  74:   38 21 00 90 addir1,r1,144
  78:   38 60 00 00 li  r3,0
  7c:   e8 01 00 10 ld  r0,16(r1)
  80:   eb 81 ff e0 ld  r28,-32(r1)
  84:   eb a1 ff e8 ld  r29,-24(r1)
  88:   eb c1 ff f0 ld  r30,-16(r1)
  8c:   eb e1 ff f8 ld  r31,-8(r1)
  90:   7c 08 03 a6 mtlrr0
  94:   4e 80 00 20 blr
  98:   e8 7e 80 28 ld  r3,-32728(r30)
  9c:   7f e5 fb 78 mr  r5,r31
  a0:   48 00 00 01 bl  a0 .kt_2+0xa0
  a4:   60 00 00 00 nop
  a8:   4b ff ff a8 b   50 .kt_2+0x50
  ac:   48 00 00 01 bl  ac .kt_2+0xac
  b0:   60 00 00 00 nop
  b4:   4b ff ff a8 b   5c .kt_2+0x5c
  b8:   60 00 00 00 nop
  bc:   60 00 00 00 nop

00c0 .kt_1:
  c0:   7c 08 02 a6 mflrr0
  c4:   fb 81 ff e0 std r28,-32(r1)
  c8:   fb a1 ff e8 std r29,-24(r1)
  cc:   fb c1 ff f0 std r30,-16(r1)
  d0:   fb e1 ff f8 std r31,-8(r1)
  d4:   eb c2 00 00 ld  r30,0(r2)
  d8:   f8 01 00 10 std r0,16(r1)
  dc:   f8 21 ff 71 stdur1,-144(r1)
  e0:   7c 7d 1b 78 mr  r29,r3
  e4:   3b e0 00 00 li  r31,0
  e8:   78 3c 04 64 rldicr  r28,r1,0,49
  ec:   3b 9c 00 80 addir28,r28,128
  f0:   48 00 00 38 b   128 .kt_1+0x68
  f4:   60 00 00 00 nop
  f8:   60 00 00 00 nop
  fc:   60 00 00 00 nop
 100:   e8 1d 00 00 ld  r0,0(r29)
 104:   79 20 e8 0e rldimi  r0,r9,61,0
 108:   f8 1d 00 00 std r0,0(r29)
 10c:   80 1d 00 00 lwz r0,0(r29)
 110:   54 00 1f 7e rlwinm  r0,r0,3,29,31
 114:   7f 80 f8 00 cmpwcr7,r0,r31
 118:   40 de 00 6c bne-cr7,184 .kt_1+0xc4
 11c:   e8 1c 00 00 ld  r0,0(r28)
 120:   78 09 f7 e3 rldicl. r9,r0,62,63
 124:   40 c2 00 70 bne-194 .kt_1+0xd4
 128:   48 00 00 01 bl  128 .kt_1+0x68
 12c:   60 00 00 00 nop
 130:   3b ff 00 01 addir31,r31,1
 134:   2f a3 00 00 cmpdi   cr7,r3,0
 138:   7f ff 07 b4 extsw   r31,r31
 13c:   2f 1f 00 07 cmpwi   cr6,r31,7
 140:   7b e9 

Re: bit fields data tearing

2014-07-12 Thread Oleg Nesterov
OK, looks like this is compiler bug,

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

Thanks to Dan who informed me privately.

On 07/12, Oleg Nesterov wrote:

 Hello,

 I am not sure I should ask here, but since Documentation/memory-barriers.txt
 mentions load/store tearing perhaps my question is not completely off-topic...

 I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
 but not on x86. Finally I seem to understand the problem, and I even wrote the
 stupid kernel module to ensure, see it below at the end.

 It triggers the problem immediately, kt_2() sees the wrong value in 
 freeze_stop.
 (If I turn -freeze_stop int long, the problem goes away).

 So the question is: is this gcc bug or the code below is buggy?

 If it is buggy, then probably memory-barriers.txt could mention that you 
 should
 be carefull with bit fields, even ACCESS_ONCE() obviously can't help.

 Or this just discloses my ignorance and you need at least aligned(long) after 
 a
 bit field to be thread-safe ? I thought that compiler should take care and add
 the necessary alignment if (say) CPU can't update a single byte/uint.

 gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:

    .kt_2:
  0:   7c 08 02 a6 mflrr0
  4:   fb 81 ff e0 std r28,-32(r1)
  8:   fb a1 ff e8 std r29,-24(r1)
  c:   fb c1 ff f0 std r30,-16(r1)
 10:   fb e1 ff f8 std r31,-8(r1)
 14:   eb c2 00 00 ld  r30,0(r2)
 18:   f8 01 00 10 std r0,16(r1)
 1c:   f8 21 ff 71 stdur1,-144(r1)
 20:   7c 7d 1b 78 mr  r29,r3
 24:   3b e0 00 00 li  r31,0
 28:   78 3c 04 64 rldicr  r28,r1,0,49
 2c:   3b 9c 00 80 addir28,r28,128
 30:   48 00 00 2c b   5c .kt_2+0x5c
 34:   60 00 00 00 nop
 38:   60 00 00 00 nop
 3c:   60 00 00 00 nop
 40:   93 fd 00 04 stw r31,4(r29)
 44:   e8 9d 00 06 lwa r4,4(r29)
 48:   7f 84 f8 00 cmpwcr7,r4,r31
 4c:   40 de 00 4c bne-cr7,98 .kt_2+0x98
 50:   e8 1c 00 00 ld  r0,0(r28)
 54:   78 09 f7 e3 rldicl. r9,r0,62,63
 58:   40 c2 00 54 bne-ac .kt_2+0xac
 5c:   48 00 00 01 bl  5c .kt_2+0x5c
 60:   60 00 00 00 nop
 64:   3b ff 00 01 addir31,r31,1
 68:   2f a3 00 00 cmpdi   cr7,r3,0
 6c:   7f ff 07 b4 extsw   r31,r31
 70:   41 9e ff d0 beq+cr7,40 .kt_2+0x40
 74:   38 21 00 90 addir1,r1,144
 78:   38 60 00 00 li  r3,0
 7c:   e8 01 00 10 ld  r0,16(r1)
 80:   eb 81 ff e0 ld  r28,-32(r1)
 84:   eb a1 ff e8 ld  r29,-24(r1)
 88:   eb c1 ff f0 ld  r30,-16(r1)
 8c:   eb e1 ff f8 ld  r31,-8(r1)
 90:   7c 08 03 a6 mtlrr0
 94:   4e 80 00 20 blr
 98:   e8 7e 80 28 ld  r3,-32728(r30)
 9c:   7f e5 fb 78 mr  r5,r31
 a0:   48 00 00 01 bl  a0 .kt_2+0xa0
 a4:   60 00 00 00 nop
 a8:   4b ff ff a8 b   50 .kt_2+0x50
 ac:   48 00 00 01 bl  ac .kt_2+0xac
 b0:   60 00 00 00 nop
 b4:   4b ff ff a8 b   5c .kt_2+0x5c
 b8:   60 00 00 00 nop
 bc:   60 00 00 00 nop

   00c0 .kt_1:
 c0:   7c 08 02 a6 mflrr0
 c4:   fb 81 ff e0 std r28,-32(r1)
 c8:   fb a1 ff e8 std r29,-24(r1)
 cc:   fb c1 ff f0 std r30,-16(r1)
 d0:   fb e1 ff f8 std r31,-8(r1)
 d4:   eb c2 00 00 ld  r30,0(r2)
 d8:   f8 01 00 10 std r0,16(r1)
 dc:   f8 21 ff 71 stdur1,-144(r1)
 e0:   7c 7d 1b 78 mr  r29,r3
 e4:   3b e0 00 00 li  r31,0
 e8:   78 3c 04 64 rldicr  r28,r1,0,49
 ec:   3b 9c 00 80 addir28,r28,128
 f0:   48 00 00 38 b   128 .kt_1+0x68
 f4:   60 00 00 00 nop
 f8:   60 00 00 00 nop
 fc:   60 00 00 00 nop
100:   e8 1d 00 00 ld  r0,0(r29)
104:   79 20 e8 0e rldimi  r0,r9,61,0
108:   f8 1d 00 00 std r0,0(r29)
10c:   80 1d 00 00 lwz r0,0(r29)
110:   54 00 1f 7e rlwinm  r0,r0,3,29,31
114:   7f 80 f8 00 cmpwcr7,r0,r31
118:   40 de 00 6c bne-cr7,184 .kt_1+0xc4
11c:   e8 1c 00 00 ld  r0,0(r28)
120:   78 09 f7 e3 rldicl. r9,r0,62,63
124:   40 c2 00 70 bne-194 .kt_1+0xd4
128:   48 00 00 01 bl  128 .kt_1+0x68
12c:   60 00 00 00 nop
130:   3b ff 00 01 addir31,r31,1
134:   2f a3 00 00 cmpdi   cr7,r3,0
138

Re: perf events ring buffer memory barrier on powerpc

2013-10-29 Thread Oleg Nesterov
On 10/29, Peter Zijlstra wrote:

 On Tue, Oct 29, 2013 at 11:30:57AM +0100, Peter Zijlstra wrote:
  @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output
   * Userspace could choose to issue a mb() before updating the
   * tail pointer. So that all reads will be completed before the
   * write is issued.
  +*
  +* See perf_output_put_handle().
   */
  tail = ACCESS_ONCE(rb-user_page-data_tail);
  -   smp_rmb();
  +   smp_mb();
  offset = head = local_read(rb-head);
  head += size;
  if (unlikely(!perf_output_space(rb, tail, offset, head)))

 That said; it would be very nice to be able to remove this barrier. This
 is in every event write path :/

Yes.. And I'm afraid very much that I simply confused you. Perhaps Victor
is right and we do not need this mb(). So I am waiting for the end of
this story too.

And btw I do not understand why we need it (or smp_rmb) right after
ACCESS_ONCE(data_tail).

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: perf events ring buffer memory barrier on powerpc

2013-10-28 Thread Oleg Nesterov
On 10/28, Peter Zijlstra wrote:

 Lets add Paul and Oleg to the thread; this is getting far more 'fun'
 that it should be ;-)

Heh. All I can say is that I would like to see the authoritative answer,
you know who can shed a light ;)

But to avoid the confusion, wmb() added by this patch looks obviously
correct to me.

 +  * Since the mmap() consumer (userspace) can run on a different CPU:
 +  *
 +  *   kernel user
 +  *
 +  *   READ -data_tail   READ -data_head
 +  *   smp_rmb()  (A) smp_rmb()   (C)
 +  *   WRITE $dataREAD $data
 +  *   smp_wmb()  (B) smp_mb()(D)
 +  *   STORE -data_head  WRITE -data_tail
 +  *
 +  * Where A pairs with D, and B pairs with C.
 +  *
 +  * I don't think A needs to be a full barrier because we won't in fact
 +  * write data until we see the store from userspace.

this matches the intuition, but ...

 So we simply don't
 +  * issue the data WRITE until we observe it.

why do we need any barrier (rmb) then? how it can help to serialize with
WRITE $data ?

(of course there could be other reasons for this rmb(), just I can't
 really understand A pairs with D).

And this reminds me about the memory barrier in kfifo.c which I was not
able to understand. Hmm, it has already gone away, and now I do not
understand kfifo.c at all ;) But I have found the commit, attached below.

Note that that commit added the full barrier into __kfifo_put(). And to
me it looks the same as A above. Following the logic above we could say
that we do not need a barrier (at least the full one), we won't in fact
write into the unread area until we see the store to -out from
__kfifo_get() ?


In short. I am confused, I _feel_ that A has to be a full barrier, but
I can't prove this. And let me suggest the artificial/simplified example,

boolBUSY;
data_t  DATA;

bool try_to_get(data_t *data)
{
if (!BUSY)
return false;

rmb();

*data = DATA;
mb();
BUSY = false;

return true;
}

bool try_to_put(data_t *data)
{
if (BUSY)
return false;

mb();   // : do we really need it? I think yes.

DATA = *data;
wmb();
BUSY = true;

return true;
}

Again, following the description above we could turn the mb() in _put()
into barrier(), but I do not think we can rely on the contorl dependency.

Oleg.
---

commit a45bce49545739a940f8bd4ca85c3b7435564893
Author: Paul E. McKenney paul...@us.ibm.com
Date:   Fri Sep 29 02:00:11 2006 -0700

[PATCH] memory ordering in __kfifo primitives

Both __kfifo_put() and __kfifo_get() have header comments stating that if
there is but one concurrent reader and one concurrent writer, locking is not
necessary.  This is almost the case, but a couple of memory barriers are
needed.  Another option would be to change the header comments to remove the
bit about locking not being needed, and to change the those callers who
currently don't use locking to add the required locking.  The attachment
analyzes this approach, but the patch below seems simpler.

Signed-off-by: Paul E. McKenney paul...@us.ibm.com
Cc: Stelian Pop stel...@popies.net
Signed-off-by: Andrew Morton a...@osdl.org
Signed-off-by: Linus Torvalds torva...@osdl.org

diff --git a/kernel/kfifo.c b/kernel/kfifo.c
index 64ab045..5d1d907 100644
--- a/kernel/kfifo.c
+++ b/kernel/kfifo.c
@@ -122,6 +122,13 @@ unsigned int __kfifo_put(struct kfifo *fifo,
 
len = min(len, fifo-size - fifo-in + fifo-out);
 
+   /*
+* Ensure that we sample the fifo-out index -before- we
+* start putting bytes into the kfifo.
+*/
+
+   smp_mb();
+
/* first put the data starting from fifo-in to buffer end */
l = min(len, fifo-size - (fifo-in  (fifo-size - 1)));
memcpy(fifo-buffer + (fifo-in  (fifo-size - 1)), buffer, l);
@@ -129,6 +136,13 @@ unsigned int __kfifo_put(struct kfifo *fifo,
/* then put the rest (if any) at the beginning of the buffer */
memcpy(fifo-buffer, buffer + l, len - l);
 
+   /*
+* Ensure that we add the bytes to the kfifo -before-
+* we update the fifo-in index.
+*/
+
+   smp_wmb();
+
fifo-in += len;
 
return len;
@@ -154,6 +168,13 @@ unsigned int __kfifo_get(struct kfifo *fifo,
 
len = min(len, fifo-in - fifo-out);
 
+   /*
+* Ensure that we sample the fifo-in index -before- we
+* start removing bytes from the kfifo.
+*/
+
+   smp_rmb();
+
/* first get the data from fifo-out until the end of the buffer */
l = min(len, 

Re: perf events ring buffer memory barrier on powerpc

2013-10-28 Thread Oleg Nesterov
On 10/28, Paul E. McKenney wrote:

 On Mon, Oct 28, 2013 at 02:26:34PM +0100, Peter Zijlstra wrote:
  --- linux-2.6.orig/kernel/events/ring_buffer.c
  +++ linux-2.6/kernel/events/ring_buffer.c
  @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
  goto out;
 
  /*
  -* Publish the known good head. Rely on the full barrier implied
  -* by atomic_dec_and_test() order the rb-head read and this
  -* write.
  +* Since the mmap() consumer (userspace) can run on a different CPU:
  +*
  +*   kernel user
  +*
  +*   READ -data_tail   READ -data_head
  +*   smp_rmb()  (A) smp_rmb()   (C)

 Given that both of the kernel's subsequent operations are stores/writes,
 doesn't (A) need to be smp_mb()?

Yes, this is my understanding^Wfeeling too, but I have to admit that
I can't really explain to myself why _exactly_ we need mb() here.

And let me copy-and-paste the artificial example from my previous
email,

boolBUSY;
data_t  DATA;

bool try_to_get(data_t *data)
{
if (!BUSY)
return false;

rmb();

*data = DATA;
mb();
BUSY = false;

return true;
}

bool try_to_put(data_t *data)
{
if (BUSY)
return false;

mb();   // : do we really need it? I think yes.

DATA = *data;
wmb();
BUSY = true;

return true;
}

(just in case, the code above obviously assumes that _get or _put
 can't race with itself, but they can race with each other).

Could you confirm that try_to_put() actually needs mb() between
LOAD(BUSY) and STORE(DATA) ?

I am sure it actually needs, but I will appreciate it if you can
explain why. IOW, how it is possible that without mb() try_to_put()
can overwrite DATA before try_to_get() completes its *data = DATA
in this particular case.

Perhaps this can happen if, say, reader and writer share a level of
cache or something like this...

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: linux-next: manual merge of the akpm tree with the powerpc tree

2013-06-26 Thread Oleg Nesterov
On 06/26, Benjamin Herrenschmidt wrote:

 On Wed, 2013-06-26 at 16:56 +1000, Stephen Rothwell wrote:
  Today's linux-next merge of the akpm tree got a conflict in
  arch/powerpc/kernel/ptrace.c between commit b0b0aa9c7faf
  (powerpc/hw_brk: Fix setting of length for exact mode breakpoints) from
  the powerpc tree and commit ptrace/powerpc: revert hw_breakpoints: Fix
  racy access to ptrace breakpoints from the akpm tree.
 
  I fixed it up (see below) and can carry the fix as necessary (no action
  is required).

 Where does the latter come from ?

ptrace-powerpc-revert-hw_breakpoints-fix-racy-access-to-ptrace-breakpoints.patch
attached below.

You were cc'ed every time ;)

 Why didn't it go through the powerpc tree ?

Because this series needs to update any user of ptrace_get/put_breakpoints
in arch/ (simply remove these calls), then change the core kernel code, then
fix arch/86.

--
From: Oleg Nesterov o...@redhat.com
Subject: ptrace/powerpc: revert hw_breakpoints: Fix racy access to ptrace 
breakpoints

This reverts commit 07fa7a0a8a586 (hw_breakpoints: Fix racy access to
ptrace breakpoints) and removes ptrace_get/put_breakpoints() added by
other commits.

The patch was fine but we can no longer race with SIGKILL after 9899d11f
(ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL),
the __TASK_TRACED tracee can't be woken up and -ptrace_bps[] can't go
away.

Signed-off-by: Oleg Nesterov o...@redhat.com
Acked-by: Michael Neuling mi...@neuling.org
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Jan Kratochvil jan.kratoch...@redhat.com
Cc: Paul Mundt let...@linux-sh.org
Cc: Will Deacon will.dea...@arm.com
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: Russell King li...@arm.linux.org.uk
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 arch/powerpc/kernel/ptrace.c |   20 
 1 file changed, 20 deletions(-)

diff -puN 
arch/powerpc/kernel/ptrace.c~ptrace-powerpc-revert-hw_breakpoints-fix-racy-access-to-ptrace-breakpoints
 arch/powerpc/kernel/ptrace.c
--- 
a/arch/powerpc/kernel/ptrace.c~ptrace-powerpc-revert-hw_breakpoints-fix-racy-access-to-ptrace-breakpoints
+++ a/arch/powerpc/kernel/ptrace.c
@@ -974,16 +974,12 @@ int ptrace_set_debugreg(struct task_stru
hw_brk.type = (data  HW_BRK_TYPE_DABR) | HW_BRK_TYPE_PRIV_ALL;
hw_brk.len = 8;
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
-   if (ptrace_get_breakpoints(task)  0)
-   return -ESRCH;
-
bp = thread-ptrace_bps[0];
if ((!data) || !(hw_brk.type  HW_BRK_TYPE_RDWR)) {
if (bp) {
unregister_hw_breakpoint(bp);
thread-ptrace_bps[0] = NULL;
}
-   ptrace_put_breakpoints(task);
return 0;
}
if (bp) {
@@ -996,11 +992,9 @@ int ptrace_set_debugreg(struct task_stru
 
ret =  modify_user_hw_breakpoint(bp, attr);
if (ret) {
-   ptrace_put_breakpoints(task);
return ret;
}
thread-ptrace_bps[0] = bp;
-   ptrace_put_breakpoints(task);
thread-hw_brk = hw_brk;
return 0;
}
@@ -1015,12 +1009,9 @@ int ptrace_set_debugreg(struct task_stru
   ptrace_triggered, NULL, task);
if (IS_ERR(bp)) {
thread-ptrace_bps[0] = NULL;
-   ptrace_put_breakpoints(task);
return PTR_ERR(bp);
}
 
-   ptrace_put_breakpoints(task);
-
 #endif /* CONFIG_HAVE_HW_BREAKPOINT */
task-thread.hw_brk = hw_brk;
 #else /* CONFIG_PPC_ADV_DEBUG_REGS */
@@ -1439,9 +1430,6 @@ static long ppc_set_hwdebug(struct task_
if (bp_info-trigger_type  PPC_BREAKPOINT_TRIGGER_WRITE)
brk.type |= HW_BRK_TYPE_WRITE;
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
-   if (ptrace_get_breakpoints(child)  0)
-   return -ESRCH;
-
/*
 * Check if the request is for 'range' breakpoints. We can
 * support it if range  8 bytes.
@@ -1449,12 +1437,10 @@ static long ppc_set_hwdebug(struct task_
if (bp_info-addr_mode == PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE) {
len = bp_info-addr2 - bp_info-addr;
} else if (bp_info-addr_mode != PPC_BREAKPOINT_MODE_EXACT) {
-   ptrace_put_breakpoints(child);
return -EINVAL;
}
bp = thread-ptrace_bps[0];
if (bp) {
-   ptrace_put_breakpoints(child);
return -ENOSPC;
}
 
@@ -1468,11 +1454,9 @@ static long ppc_set_hwdebug(struct task_
   ptrace_triggered, NULL, child);
if (IS_ERR(bp)) {
thread-ptrace_bps[0] = NULL;
-   ptrace_put_breakpoints(child

[PATCH] ptrace/powerpc: dont flush_ptrace_hw_breakpoint() on fork()

2013-04-21 Thread Oleg Nesterov
arch_dup_task_struct() does flush_ptrace_hw_breakpoint(src), this
destroys the parent's breakpoints for no reason. We should clear
child-thread.ptrace_bps[] copied by dup_task_struct() instead.

Signed-off-by: Oleg Nesterov o...@redhat.com
Acked-by: Michael Neuling mi...@neuling.org

--- x/arch/powerpc/kernel/process.c
+++ x/arch/powerpc/kernel/process.c
@@ -910,10 +910,6 @@ int arch_dup_task_struct(struct task_str
flush_altivec_to_thread(src);
flush_vsx_to_thread(src);
flush_spe_to_thread(src);
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
-   flush_ptrace_hw_breakpoint(src);
-#endif /* CONFIG_HAVE_HW_BREAKPOINT */
-
*dst = *src;
return 0;
 }
@@ -984,6 +980,10 @@ int copy_thread(unsigned long clone_flag
p-thread.ksp_limit = (unsigned long)task_stack_page(p) +
_ALIGN_UP(sizeof(struct thread_info), 16);
 
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+   p-thread.ptrace_bps[0] = NULL;
+#endif
+
 #ifdef CONFIG_PPC_STD_MMU_64
if (mmu_has_feature(MMU_FTR_SLB)) {
unsigned long sp_vsid;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 1/4] uprobes: add trap variant helper

2013-03-23 Thread Oleg Nesterov
On 03/22, Ananth N Mavinakayanahalli wrote:

 Some architectures like powerpc have multiple variants of the trap
 instruction. Introduce an additional helper is_trap_insn() for run-time
 handling of non-uprobe traps on such architectures.

Looks fine to me.

I am going to take this into my tree, hopefully Ben won't object. The
main change is arch-agnostic and the changes in arch/powerpc are trivial.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints

2013-03-22 Thread Oleg Nesterov
On 03/22, Ananth N Mavinakayanahalli wrote:

 On Thu, Mar 21, 2013 at 04:58:09PM +0100, Oleg Nesterov wrote:
 
  - verify_opcode()-is_swbp_insn() means:
 
  is this insn fine for uprobe? (we do not care about
  gdb, we simply ignore this problem)

 I will write up a patch for this case.. So, IIUC we do not care to send
 gdb a SIGTRAP if we have replaced a conditional trap from gdb with an
 unconditional uprobes one, right?

Yes.

And just in case, we do not send SIGTRAP if gdb used the same/unconditional
insn. We simply can't know if someone else wants to know that the task hits
this breakpoint.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] uprobes: add trap variant helper

2013-03-22 Thread Oleg Nesterov
On 03/22, Ananth N Mavinakayanahalli wrote:

 +/**
 + * is_trap_insn - check if instruction is breakpoint instruction.
 + * @insn: instruction to be checked.
 + * Default implementation of is_trap_insn
 + * Returns true if @insn is a breakpoint instruction.
 + *
 + * This function is needed for the case where an architecture has multiple
 + * trap instructions (like powerpc).
 + */
 +bool __weak is_trap_insn(uprobe_opcode_t *insn)
 +{
 + return is_swbp_insn(insn);
 +}

OK, thanks, the whole series looks fine, just one note...

My patch also changed prepare_uprobe() to use is_trap_insn(), and I think
this is right. Exactly because of 3/3 which removes is_trap() from
arch_uprobe_analyze_insn().

If the original insn is_trap(), we do not want to singlestep it and get
another trap after we hit handle_swbp().

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints

2013-03-21 Thread Oleg Nesterov
On 03/21, Ananth N Mavinakayanahalli wrote:

 On Wed, Mar 20, 2013 at 05:06:44PM +0100, Oleg Nesterov wrote:

But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to
verify. If not, we need 2 definitions. is_uprobe_insn() should still 
check
insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap().

 Its fine from gdb's perspective with my patch.

Yes, but this doesn't look right from uprobe's perspective.

   So, install_breakpoint()-prepare_uprobe()-is_swbp_insn() will return
   ENOTSUPP. In fact, arch_uprobe_analyze_insn() will also do the same.
 
  Yes and the check in arch_uprobe_analyze_insn() should go away.
 
  But you missed my point. Please forget about prepare_uprobe(), it is
  wrong anyway. And, prepare_uprobe() inspects the _original_ insn from
  the file, this has nothing install_breakpoint/etc.
 
  I meant verify_opcode() called by install_breakpoint/etc.

 For the case where X already exists, verify_opcode() currently returns 0.
 IMO, it should return -EEXIST,

Oh, this is debatable. Currently we assume that uprobe should win.
See below...

 unless you are proposing that uprobes
 should ride on the existing trap (even if its a variant).

Yes. And this is what the current implementation does.

 If you are proposing that uprobes ride on X if it already exists, that's
 not always possible and is a big can of worms... see below...

Sure. Whatever we do uprobe and gdb can confuse each other. Unless we
rework the vm code completely (not sure this is realistic).

  OK. So, if I understand correctly, gdb can use some conditional
  breakpoint, and it is possible that this insn won't generate the
  trap?

 Yes it is possible if the condition is not met. If the condition is
 met, the instruction will generate a trap, and uprobes will do a
 send_sig(SIGTRAP) from handle_swbp().

Unless there is uprobe at the same address. Once again, uprobe wins.

Your patch only fixes the case when the task hits a non-UPROBE_SWBP_INSN
breakpoint and there is no uprobe at the same address.

  Then this patch is not right, or at least we need another change
  on top?
 
  Once again. Suppose that gdb installs the TRAP_IF_R1_GT_R2.
 
  After that uprobe_register() is called, but it won't change this
  insn because verify_opcode() returns 0.
 
  Then the probed task hits this breakoint with r1  r2 and we do
  not report this event.

 At this time, the condition for the trap is not satisfied, so no
 exception occurs.

Yes, and thus uprobe-handler() is not called, this was my point.

 If the expectation is that the trap always trigger,
 then all such trap variants need to be replaced with the unconditional
 trap

Yes. that is why I suggested the patch which doesn't affect verify_opcode().
uprobe_register() should replace the conditional trap with the unconditional
UPROBE_SWBP_INSN. uprobes should win.

 and we should either add logic to re-execute the condional trap
 after uprobe handling and send_sig() via handle_swbp() or emulate the
 condition in software and do a send_sig() if needed.

Unlikely this is possible. Or even desirable.

  So. I still think that we actually need something like below, and
  powerpc should reimplement is_trap_insn() to use is_trap(insn).
 
  No?

 I don't see how this will help,

Hmm. This should equally fix this particular problem? handle_swbp()
will send the trap if is_trap() == T?

Again, unless there is uprobe, but this was always true.

 especially since the gdb-uprobes is
 fraught with races.

They can race anyway, whatever we do.

Unless we rework write_opcode/etc completely.

 With your proposed patch, we refuse to insert a uprobe if the underlying
 instruction is a UPROBE_SWBP_INSTRUCTION;

If underlying means the original insn in vma-file, this is already
true. My patch doesn't change this logic.

Otherwise - no, we do not refuse to insert a uprobe if this insn was
already changed by gdb.

 changing is_swbp_at_addr()
 will need changes in handle_swbp() too.

I don't think so. Why?

 But, unlike x86, we cannot
 expect a uprobe with an underlying trap variant (X) to always trigger.

And that is why I think write_opcode() should rewrite the conditional
trap.

 IMHO, its not a good idea to do that for x86 either,

This change has no effect fo x86.

 IMHO, I really think we should not allow uprobe_register() to succeed if
 the underlying instruction is a breakpoint (or a variant thereof).

I disagree.

Just for example. Suppose we change install_breakpoint() so that it fails
if the underlying instruction is int3. (once again, underlying does not
mean the original insn from vma-vm_file).

First of all, this is very much nontrivial. I simply do not see how we
can do this. If nothing else, uprobe_register() can race with uprobe_mmap()
and install_breakpoint() can be called twice with the same vaddr. With
this change register or mmap can fail.

But suppose you can do this. Now you can write the trivial application
which mmaps glibc and inserts int3

Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints

2013-03-21 Thread Oleg Nesterov
On 03/21, Ananth N Mavinakayanahalli wrote:
?
 On Wed, Mar 20, 2013 at 05:07:28PM +0100, Oleg Nesterov wrote:
  On 03/20, Ananth N Mavinakayanahalli wrote:
  
   On Wed, Mar 20, 2013 at 01:43:01PM +0100, Oleg Nesterov wrote:
On 03/20, Oleg Nesterov wrote:

IOW, if I wasn't clear... Lets forget about gdb/etc for the moment.
Suppose we apply the patch below. Will uprobes on powerpc work?
   
If yes, then your patch should be fine. If not, we probably need more
changes.
  
   Yes, it will work fine.
 
  Even if this new insn is conditional?

 Yes.

But this can't be true. If it is conditional, it won't always generate a
trap, this means uprobe won't actually work.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints

2013-03-20 Thread Oleg Nesterov
Hi Ananth,

First of all, let me remind that I know nothing about powerpc ;)

But iirc we already discussed this a bit, I forgot the details but
still I have some concerns...

On 03/20, Ananth N Mavinakayanahalli wrote:

 GDB uses a variant of the trap instruction that is different from the
 one used by uprobes. Currently, running gdb on a program being traced
 by uprobes causes an endless loop since uprobes doesn't understand
 that the trap is inserted by some other entity and hence a SIGTRAP needs
 to be delivered.

Yes, and thus is_swbp_at_addr()-is_swbp_insn() called by handle_swbp()
should be updated,

 +bool is_swbp_insn(uprobe_opcode_t *insn)
 +{
 + return (is_trap(*insn));
 +}

And this patch should fix the problem. (and probably this is fine
for prepare_uprobe()).


But, at the same time, is the new definition fine for verify_opcode()?

IOW, powerpc has another is_trap() insn(s) used by gdb, lets denote it X.
X != UPROBE_SWBP_INSN.

Suppose that gdb installs the trap X at some addr, and then uprobe_register()
tries to install uprobe at the same address. Then set_swbp() will do nothing,
assuming the uprobe was already installed.

But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to
verify. If not, we need 2 definitions. is_uprobe_insn() should still check
insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap().

And I am just curious, could you explain how X and UPROBE_SWBP_INSN
differ?

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints

2013-03-20 Thread Oleg Nesterov
On 03/20, Oleg Nesterov wrote:

 But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to
 verify. If not, we need 2 definitions. is_uprobe_insn() should still check
 insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap().

 And I am just curious, could you explain how X and UPROBE_SWBP_INSN
 differ?

IOW, if I wasn't clear... Lets forget about gdb/etc for the moment.
Suppose we apply the patch below. Will uprobes on powerpc work?

If yes, then your patch should be fine. If not, we probably need more
changes.

And, forgot to mention. If you change is_swbp_insn(), you can remove
is_trap() from arch_uprobe_analyze_insn().

Oleg.

--- x/arch/powerpc/include/asm/uprobes.h
+++ x/arch/powerpc/include/asm/uprobes.h
@@ -31,7 +31,7 @@ typedef ppc_opcode_t uprobe_opcode_t;
 #define UPROBE_XOL_SLOT_BYTES  (MAX_UINSN_BYTES)
 
 /* The following alias is needed for reference from arch-agnostic code */
-#define UPROBE_SWBP_INSN   BREAKPOINT_INSTRUCTION
+#define UPROBE_SWBP_INSN   TRAP_INSN_USED_BY_GDB
 #define UPROBE_SWBP_INSN_SIZE  4 /* swbp insn size in bytes */
 
 struct arch_uprobe {

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints

2013-03-20 Thread Oleg Nesterov
On 03/20, Ananth N Mavinakayanahalli wrote:

 On Wed, Mar 20, 2013 at 01:26:39PM +0100, Oleg Nesterov wrote:
  But, at the same time, is the new definition fine for verify_opcode()?
 
  IOW, powerpc has another is_trap() insn(s) used by gdb, lets denote it X.
  X != UPROBE_SWBP_INSN.
 
  Suppose that gdb installs the trap X at some addr, and then 
  uprobe_register()
  tries to install uprobe at the same address. Then set_swbp() will do 
  nothing,
  assuming the uprobe was already installed.
 
  But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to
  verify. If not, we need 2 definitions. is_uprobe_insn() should still check
  insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap().

 is_trap() checks for all trap variants on powerpc, including the one
 uprobe uses. It returns true if the instruction is *any* trap variant.

I understand,

 So, install_breakpoint()-prepare_uprobe()-is_swbp_insn() will return
 ENOTSUPP. In fact, arch_uprobe_analyze_insn() will also do the same.

Yes and the check in arch_uprobe_analyze_insn() should go away.

But you missed my point. Please forget about prepare_uprobe(), it is
wrong anyway. And, prepare_uprobe() inspects the _original_ insn from
the file, this has nothing install_breakpoint/etc.

I meant verify_opcode() called by install_breakpoint/etc.

 This itself should take care of all the cases.

  And I am just curious, could you explain how X and UPROBE_SWBP_INSN
  differ?

 Powerpc has numerous variants of the trap instruction based on
 comparison of two registers or a regsiter and immediate value and a condition
 (less than, greater than, [signed forms thereof], or equal to).

 Uprobes uses 0x7fe0008 which is 'tw 31,0,0'  which essentially is an
 unconditional trap.

 Gdb uses many traps, one of which is 0x7d821008 which is twge r2,r2,
 which is basically trap if r2 greater than or equal to r2.

OK. So, if I understand correctly, gdb can use some conditional
breakpoint, and it is possible that this insn won't generate the
trap?

Then this patch is not right, or at least we need another change
on top?

Once again. Suppose that gdb installs the TRAP_IF_R1_GT_R2.

After that uprobe_register() is called, but it won't change this
insn because verify_opcode() returns 0.

Then the probed task hits this breakoint with r1  r2 and we do
not report this event.


So. I still think that we actually need something like below, and
powerpc should reimplement is_trap_insn() to use is_trap(insn).

No?

Oleg.

--- x/kernel/events/uprobes.c
+++ x/kernel/events/uprobes.c
@@ -532,6 +532,12 @@ static int copy_insn(struct uprobe *upro
return __copy_insn(mapping, filp, uprobe-arch.insn, bytes, 
uprobe-offset);
 }
 
+bool __weak is_trap_insn(uprobe_opcode_t *insn)
+{
+   // powerpc: should use is_trap()
+   return is_swbp_insn(insn);
+}
+
 static int prepare_uprobe(struct uprobe *uprobe, struct file *file,
struct mm_struct *mm, unsigned long vaddr)
 {
@@ -550,7 +556,7 @@ static int prepare_uprobe(struct uprobe 
goto out;
 
ret = -ENOTSUPP;
-   if (is_swbp_insn((uprobe_opcode_t *)uprobe-arch.insn))
+   if (is_trap_insn((uprobe_opcode_t *)uprobe-arch.insn))
goto out;
 
ret = arch_uprobe_analyze_insn(uprobe-arch, mm, vaddr);
@@ -1452,7 +1458,7 @@ static int is_swbp_at_addr(struct mm_str
copy_opcode(page, vaddr, opcode);
put_page(page);
  out:
-   return is_swbp_insn(opcode);
+   return is_trap_insn(opcode);
 }
 
 static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/uprobes: teach uprobes to ignore gdb breakpoints

2013-03-20 Thread Oleg Nesterov
On 03/20, Ananth N Mavinakayanahalli wrote:

 On Wed, Mar 20, 2013 at 01:43:01PM +0100, Oleg Nesterov wrote:
  On 03/20, Oleg Nesterov wrote:
  
   But we did not install UPROBE_SWBP_INSN. Is it fine? I hope yes, just to
   verify. If not, we need 2 definitions. is_uprobe_insn() should still check
   insns == UPROBE_SWBP_INSN, and is_swbp_insn() should check is_trap().
  
   And I am just curious, could you explain how X and UPROBE_SWBP_INSN
   differ?
 
  IOW, if I wasn't clear... Lets forget about gdb/etc for the moment.
  Suppose we apply the patch below. Will uprobes on powerpc work?
 
  If yes, then your patch should be fine. If not, we probably need more
  changes.

 Yes, it will work fine.

Even if this new insn is conditional?

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-05 Thread Oleg Nesterov
On 03/05, Lai Jiangshan wrote:

 On 03/03/13 01:06, Oleg Nesterov wrote:
  On 03/02, Michel Lespinasse wrote:
 
  My version would be slower if it needs to take the
  slow path in a reentrant way, but I'm not sure it matters either :)
 
  I'd say, this doesn't matter at all, simply because this can only happen
  if we race with the active writer.

 It can also happen when interrupted. (still very rarely)

 arch_spin_trylock()
   ---interrupted,
   __this_cpu_read() returns 0.
   arch_spin_trylock() fails
   slowpath, any nested will be slowpath too.
   ...
   ..._read_unlock()
   ---interrupt
 __this_cpu_inc()
 

Yes sure. Or it can take the local lock after we already take the global
fallback_lock.

But the same can happen with FALLBACK_BASE, just because we need to take
a lock (local or global) first, then increment the counter.

 (I worries to much. I tend to remove FALLBACK_BASE now, we should
 add it only after we proved we needed it, this part is not proved)

Agreed, great ;)

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V2] lglock: add read-preference local-global rwlock

2013-03-05 Thread Oleg Nesterov
On 03/05, Lai Jiangshan wrote:

 On 03/03/13 01:20, Oleg Nesterov wrote:
  On 03/02, Lai Jiangshan wrote:
 
  +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
  +{
  +  switch (__this_cpu_read(*lgrw-reader_refcnt)) {
  +  case 1:
  +  __this_cpu_write(*lgrw-reader_refcnt, 0);
  +  lg_local_unlock(lgrw-lglock);
  +  return;
  +  case FALLBACK_BASE:
  +  __this_cpu_write(*lgrw-reader_refcnt, 0);
  +  read_unlock(lgrw-fallback_rwlock);
  +  rwlock_release(lg-lock_dep_map, 1, _RET_IP_);
 
  I guess case 1: should do rwlock_release() too.

 Already do it in lg_local_unlock(lgrw-lglock); before it returns.
 (I like reuse old code)

Yes, I was wrong thanks. Another case when I didn't notice that you
re-use the regular lg_ code...

  We need rwlock_acquire_read() even in the fast-path, and this acquire_read
  should be paired with rwlock_acquire() in _write_lock(), but it does
  spin_acquire(lg-lock_dep_map). Yes, currently this is the same (afaics)
  but perhaps fallback_rwlock-dep_map would be more clean.

 I can't tell which one is better. I try to use fallback_rwlock-dep_map later.

I am not sure which one should be better too, please check.

Again, I forgot that _write_lock/unlock use lg_global_*() code.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V2] lglock: add read-preference local-global rwlock

2013-03-03 Thread Oleg Nesterov
On 03/02, Oleg Nesterov wrote:

 On 03/02, Lai Jiangshan wrote:
 
  +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
  +{
  +   switch (__this_cpu_read(*lgrw-reader_refcnt)) {
  +   case 1:
  +   __this_cpu_write(*lgrw-reader_refcnt, 0);
  +   lg_local_unlock(lgrw-lglock);
  +   return;
  +   case FALLBACK_BASE:
  +   __this_cpu_write(*lgrw-reader_refcnt, 0);
  +   read_unlock(lgrw-fallback_rwlock);
  +   rwlock_release(lg-lock_dep_map, 1, _RET_IP_);

 I guess case 1: should do rwlock_release() too.

 Otherwise, at first glance looks correct...

 However, I still think that FALLBACK_BASE only adds the unnecessary
 complications. But even if I am right this is subjective of course, please
 feel free to ignore.

Yes, but...

 And btw, I am not sure about lg-lock_dep_map, perhaps we should use
 fallback_rwlock-dep_map ?

 We need rwlock_acquire_read() even in the fast-path, and this acquire_read
 should be paired with rwlock_acquire() in _write_lock(), but it does
 spin_acquire(lg-lock_dep_map). Yes, currently this is the same (afaics)
 but perhaps fallback_rwlock-dep_map would be more clean.

Please ignore this part.

I missed that lg_rwlock_global_write_lock() relies on lg_global_lock(),
and I just noticed that it does rwlock_acquire(lg-lock_dep_map).

Hmm. But then I do not understand the lglock annotations. Obviously,
rwlock_acquire_read() in lg_local_lock() can't even detect the simplest
deadlock, say, lg_local_lock(LOCK) + lg_local_lock(LOCK). Not to mention
spin_lock(X) + lg_local_lock(Y) vs lg_local_lock(Y) + spin_lock(X).

OK, I understand that it is not easy to make these annotations correct...

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-02 Thread Oleg Nesterov
On 03/02, Lai Jiangshan wrote:

 On 02/03/13 02:28, Oleg Nesterov wrote:
  Lai, I didn't read this discussion except the code posted by Michel.
  I'll try to read this patch carefully later, but I'd like to ask
  a couple of questions.
 
  This version looks more complex than Michel's, why? Just curious, I
  am trying to understand what I missed. See
  http://marc.info/?l=linux-kernelm=136196350213593

 Michel changed my old draft version a little, his version is good enough for 
 me.

Yes, I see. But imho Michel suggested the valuable cleanup, the code
becomes even more simple with the same perfomance.

Your v2 looks almost correct to me, but I still think it makes sense
to incorporate the simplification from Michel.

 My new version tries to add a little better nestable support with only
 adding single __this_cpu_op() in _read_[un]lock().

How? Afaics with or without FALLBACK_BASE you need _reed + _inc/dec in
_read_lock/unlock.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-02 Thread Oleg Nesterov
On 03/02, Michel Lespinasse wrote:

 My version would be slower if it needs to take the
 slow path in a reentrant way, but I'm not sure it matters either :)

I'd say, this doesn't matter at all, simply because this can only happen
if we race with the active writer.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V2] lglock: add read-preference local-global rwlock

2013-03-02 Thread Oleg Nesterov
On 03/02, Lai Jiangshan wrote:

 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
 +{
 + switch (__this_cpu_read(*lgrw-reader_refcnt)) {
 + case 1:
 + __this_cpu_write(*lgrw-reader_refcnt, 0);
 + lg_local_unlock(lgrw-lglock);
 + return;
 + case FALLBACK_BASE:
 + __this_cpu_write(*lgrw-reader_refcnt, 0);
 + read_unlock(lgrw-fallback_rwlock);
 + rwlock_release(lg-lock_dep_map, 1, _RET_IP_);

I guess case 1: should do rwlock_release() too.

Otherwise, at first glance looks correct...

However, I still think that FALLBACK_BASE only adds the unnecessary
complications. But even if I am right this is subjective of course, please
feel free to ignore.

And btw, I am not sure about lg-lock_dep_map, perhaps we should use
fallback_rwlock-dep_map ?

We need rwlock_acquire_read() even in the fast-path, and this acquire_read
should be paired with rwlock_acquire() in _write_lock(), but it does
spin_acquire(lg-lock_dep_map). Yes, currently this is the same (afaics)
but perhaps fallback_rwlock-dep_map would be more clean.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-01 Thread Oleg Nesterov
Lai, I didn't read this discussion except the code posted by Michel.
I'll try to read this patch carefully later, but I'd like to ask
a couple of questions.

This version looks more complex than Michel's, why? Just curious, I
am trying to understand what I missed. See
http://marc.info/?l=linux-kernelm=136196350213593

And I can't understand FALLBACK_BASE...

OK, suppose that CPU_0 does _write_unlock() and releases -fallback_rwlock.

CPU_1 does _read_lock(), and ...

 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
 +{
 + struct lglock *lg = lgrw-lglock;
 +
 + preempt_disable();
 + rwlock_acquire_read(lg-lock_dep_map, 0, 0, _RET_IP_);
 + if (likely(!__this_cpu_read(*lgrw-reader_refcnt))) {
 + if (!arch_spin_trylock(this_cpu_ptr(lg-lock))) {

_trylock() fails,

 + read_lock(lgrw-fallback_rwlock);
 + __this_cpu_add(*lgrw-reader_refcnt, FALLBACK_BASE);

so we take -fallback_rwlock and -reader_refcnt == FALLBACK_BASE.

CPU_0 does lg_global_unlock(lgrw-lglock) and finishes _write_unlock().

Interrupt handler on CPU_1 does _read_lock() notices -reader_refcnt != 0
and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.

Then irq does _read_unlock(), and

 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
 +{
 + switch (__this_cpu_dec_return(*lgrw-reader_refcnt)) {
 + case 0:
 + lg_local_unlock(lgrw-lglock);
 + return;
 + case FALLBACK_BASE:
 + __this_cpu_sub(*lgrw-reader_refcnt, FALLBACK_BASE);
 + read_unlock(lgrw-fallback_rwlock);

hits this case?

Doesn't look right, but most probably I missed something.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-28 Thread Oleg Nesterov
On 02/28, Michel Lespinasse wrote:

 On Thu, Feb 28, 2013 at 3:25 AM, Oleg Nesterov o...@redhat.com wrote:
  On 02/27, Michel Lespinasse wrote:
 
  +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
  +{
  +   preempt_disable();
  +
  +   if (__this_cpu_read(*lgrw-local_refcnt) ||
  +   arch_spin_trylock(this_cpu_ptr(lgrw-lglock-lock))) {
  +   __this_cpu_inc(*lgrw-local_refcnt);
 
  Please look at __this_cpu_generic_to_op(). You need this_cpu_inc()
  to avoid the race with irs. The same for _read_unlock.

 Hmmm, I was thinking that this was safe because while interrupts might
 modify local_refcnt to acquire a nested read lock, they are expected
 to release that lock as well which would set local_refcnt back to its
 original value ???

Yes, yes, this is correct.

I meant that (in general, x86 is fine) __this_cpu_inc() itself is not
irq-safe. It simply does pcp += 1.

this_cpu_inc() is fine, _this_cpu_generic_to_op() does cli/sti around.

I know this only because I did the same mistake recently, and Srivatsa
explained the problem to me ;)

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-28 Thread Oleg Nesterov
On 02/28, Oleg Nesterov wrote:
 On 02/28, Michel Lespinasse wrote:
 
  On Thu, Feb 28, 2013 at 3:25 AM, Oleg Nesterov o...@redhat.com wrote:
   On 02/27, Michel Lespinasse wrote:
  
   +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
   +{
   +   preempt_disable();
   +
   +   if (__this_cpu_read(*lgrw-local_refcnt) ||
   +   arch_spin_trylock(this_cpu_ptr(lgrw-lglock-lock))) {
   +   __this_cpu_inc(*lgrw-local_refcnt);
  
   Please look at __this_cpu_generic_to_op(). You need this_cpu_inc()
   to avoid the race with irs. The same for _read_unlock.
 
  Hmmm, I was thinking that this was safe because while interrupts might
  modify local_refcnt to acquire a nested read lock, they are expected
  to release that lock as well which would set local_refcnt back to its
  original value ???

 Yes, yes, this is correct.

 I meant that (in general, x86 is fine) __this_cpu_inc() itself is not
 irq-safe. It simply does pcp += 1.

 this_cpu_inc() is fine, _this_cpu_generic_to_op() does cli/sti around.

Just in case, it is not that I really understand why __this_cpu_inc() can
race with irq in this particular case (given that irq handler should
restore the counter).

So perhaps I am wrong again. The comments in include/linux/percpu.h look
confusing to me, and I simply know nothing about !x86 architectures. But
since, say, preempt_disable() doesn't do anything special then probably
__this_cpu_inc() is fine too.

In short: please ignore me ;)

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-27 Thread Oleg Nesterov
On 02/27, Michel Lespinasse wrote:

 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
 +{
 +   preempt_disable();
 +
 +   if (__this_cpu_read(*lgrw-local_refcnt) ||
 +   arch_spin_trylock(this_cpu_ptr(lgrw-lglock-lock))) {
 +   __this_cpu_inc(*lgrw-local_refcnt);

Please look at __this_cpu_generic_to_op(). You need this_cpu_inc()
to avoid the race with irs. The same for _read_unlock.

But otherwise I agree, looks like a clever and working idea to me.
And simple!

 There is an interesting case where lg_rwlock_local_read_lock could be
 interrupted after getting the local lglock but before incrementing
 local_refcnt to 1; if that happens any nested readers within that
 interrupt will have to take the global rwlock read side. I think this
 is perfectly acceptable

Agreed.

Or interrupt can do spin_trylock(percpu-lock) after we take the global
-fallback_rwlock (if we race with write_lock + write_unlock), but I do
not see any possible deadlock in this case.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-10 Thread Oleg Nesterov
On 02/08, Paul E. McKenney wrote:

 On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
 
   void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
   {
  -   read_unlock(pcpu_rwlock-global_rwlock);

 We need an smp_mb() here to keep the critical section ordered before the
 this_cpu_dec() below.  Otherwise, if a writer shows up just after we
 exit the fastpath, that writer is not guaranteed to see the effects of
 our critical section.  Equivalently, the prior read-side critical section
 just might see some of the writer's updates, which could be a bit of
 a surprise to the reader.

Agreed, we should not assume that a reader doesn't write. And we should
ensure that this read section actually completes before this_cpu_dec().

  +   /*
  +* We never allow heterogeneous nesting of readers. So it is trivial
  +* to find out the kind of reader we are, and undo the operation
  +* done by our corresponding percpu_read_lock().
  +*/
  +   if (__this_cpu_read(*pcpu_rwlock-reader_refcnt)) {
  +   this_cpu_dec(*pcpu_rwlock-reader_refcnt);
  +   smp_wmb(); /* Paired with smp_rmb() in sync_reader() */

 Given an smp_mb() above, I don't understand the need for this smp_wmb().
 Isn't the idea that if the writer sees -reader_refcnt decremented to
 zero, it also needs to see the effects of the corresponding reader's
 critical section?

I am equally confused ;)

OTOH, we can probably aboid any barrier if reader_nested_percpu() == T.


  +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
  +{
  +   unsigned int cpu;
  +
  +   drop_writer_signal(pcpu_rwlock, smp_processor_id());

 Why do we drop ourselves twice?  More to the point, why is it important to
 drop ourselves first?

And don't we need mb() _before_ we clear -writer_signal ?

  +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
  +  unsigned int cpu)
  +{
  +   smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */

 As I understand it, the purpose of this memory barrier is to ensure
 that the stores in drop_writer_signal() happen before the reads from
 -reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
 race between a new reader attempting to use the fastpath and this writer
 acquiring the lock.  Unless I am confused, this must be smp_mb() rather
 than smp_rmb().

And note that before sync_reader() we call announce_writer_active() which
already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
unneeded.

But, at the same time, could you confirm that we do not need another mb()
after sync_all_readers() in percpu_write_lock() ? I mean, without mb(),
can't this reader_uses_percpu_refcnt() LOAD leak into the critical section
protected by -global_rwlock? Then this LOAD can be re-ordered with other
memory operations done by the writer.



Srivatsa, I think that the code would be more understandable if you kill
the helpers like sync_reader/raise_writer_signal. Perhaps even all write
helpers, I am not sure. At least, it seems to me that all barriers should
be moved to percpu_write_lock/unlock. But I won't insist of course, up to
you.

And cosmetic nit... How about

struct xxx {
unsigned long   reader_refcnt;
boolwriter_signal;
}

struct percpu_rwlock {
struct xxx __percpu *xxx;
rwlock_tglobal_rwlock;
};

?

This saves one alloc_percpu() and ensures that reader_refcnt/writer_signal
are always in the same cache-line.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 05/45] percpu_rwlock: Make percpu-rwlocks IRQ-safe, optimally

2013-02-10 Thread Oleg Nesterov
only one cosmetic nit...

On 01/22, Srivatsa S. Bhat wrote:

 +#define READER_PRESENT   (1UL  16)
 +#define READER_REFCNT_MASK   (READER_PRESENT - 1)
 +
  #define reader_uses_percpu_refcnt(pcpu_rwlock, cpu)  \
   (ACCESS_ONCE(per_cpu(*((pcpu_rwlock)-reader_refcnt), cpu)))

  #define reader_nested_percpu(pcpu_rwlock)\
 - (__this_cpu_read(*((pcpu_rwlock)-reader_refcnt))  1)
 + (__this_cpu_read(*((pcpu_rwlock)-reader_refcnt))  READER_REFCNT_MASK)

  #define writer_active(pcpu_rwlock)   \
   (__this_cpu_read(*((pcpu_rwlock)-writer_signal)))

I think this all can go to lib/percpu-rwlock.c. Nobody needs to know
these implementation details.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-10 Thread Oleg Nesterov
On 02/11, Srivatsa S. Bhat wrote:

 On 02/10/2013 11:36 PM, Oleg Nesterov wrote:
  +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
  +{
  +   unsigned int cpu;
  +
  +   drop_writer_signal(pcpu_rwlock, smp_processor_id());
 
  Why do we drop ourselves twice?  More to the point, why is it important to
  drop ourselves first?
 
  And don't we need mb() _before_ we clear -writer_signal ?
 

 Oh, right! Or, how about moving announce_writer_inactive() to _after_
 write_unlock()?

Not sure this will help... but, either way it seems we have another
problem...

percpu_rwlock tries to be generic. This means we should ignore its
usage in hotplug, and _write_lock should not race with _write_unlock.

IOW. Suppose that _write_unlock clears -writer_signal. We need to ensure
that this can't race with another write which wants to set this flag.
Perhaps it should be counter as well, and it should be protected by
the same -global_rwlock, but _write_lock() should drop it before
sync_all_readers() and then take it again?

  +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
  +unsigned int cpu)
  +{
  + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
 
  As I understand it, the purpose of this memory barrier is to ensure
  that the stores in drop_writer_signal() happen before the reads from
  -reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
  race between a new reader attempting to use the fastpath and this writer
  acquiring the lock.  Unless I am confused, this must be smp_mb() rather
  than smp_rmb().
 
  And note that before sync_reader() we call announce_writer_active() which
  already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
  unneeded.
 

 My intention was to help the writer see the -reader_refcnt drop to zero
 ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer.

Hmm, interesting... Not sure, but can't really comment. However I can
answer your next question:

 Please correct me if my understanding of memory barriers is wrong here..

Who? Me??? No we have paulmck for that ;)

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-10 Thread Oleg Nesterov
On 02/11, Srivatsa S. Bhat wrote:

 On 02/09/2013 04:40 AM, Paul E. McKenney wrote:
  +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
  +{
  +  unsigned int cpu;
  +
  +  drop_writer_signal(pcpu_rwlock, smp_processor_id());
 
  Why do we drop ourselves twice?  More to the point, why is it important to
  drop ourselves first?
 

 I don't see where we are dropping ourselves twice. Note that we are no longer
 in the cpu_online_mask, so the 'for' loop below won't include us. So we need
 to manually drop ourselves. It doesn't matter whether we drop ourselves first
 or later.

Yes, but this just reflects its usage in cpu-hotplug. cpu goes away under
_write_lock.

Perhaps _write_lock/unlock shoud use for_each_possible_cpu() instead?

Hmm... I think this makes sense anyway. Otherwise, in theory,
percpu_write_lock(random_non_hotplug_lock) can race with cpu_up?

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend

2013-01-24 Thread Oleg Nesterov
On 01/23, Michel Lespinasse wrote:

 On Tue, Jan 22, 2013 at 11:32 AM, Steven Rostedt rost...@goodmis.org wrote:
 
  I thought global locks are now fair. That is, a reader will block if a
  writer is waiting. Hence, the above should deadlock on the current
  rwlock_t types.

 I believe you are mistaken here. struct rw_semaphore is fair (and
 blocking), but rwlock_t is unfair. The reason we can't easily make
 rwlock_t fair is because tasklist_lock currently depends on the
 rwlock_t unfairness - tasklist_lock readers typically don't disable
 local interrupts, and tasklist_lock may be acquired again from within
 an interrupt, which would deadlock if rwlock_t was fair and a writer
 was queued by the time the interrupt is processed.

Yes.

And, iirc, it was even documented somewhere that while rwlock_t is not
really nice, it is good to share the locking with interrupts. You do
not need to disable irqs.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v4 2/2] powerpc: Uprobes port to powerpc

2012-08-23 Thread Oleg Nesterov
On 08/23, Benjamin Herrenschmidt wrote:

 On Thu, 2012-08-23 at 11:02 +0530, Srikar Dronamraju wrote:
  
 
  insn is updated/accessed in the arch independent code. Size of
  uprobe_opcode_t could be different for different archs.
  uprobe_opcode_t
  represents the size of the smallest breakpoint instruction for an
  arch.
 
  Hence u8 works out the best. I know we could still use uprobe_opcode_t
  and achieve the same. In which case, we would have to interpret
  MAX_UINSN_BYTES differently. Do you see any advantages of using
  uprobe_opcode_t instead of u8 across archs?

 But don't you actively rely on the fact that on powerpc, unlike x86, you
 -can- atomically replace an instruction with a single 32-bit store ?

I must have missed something...

But powerpc does not replace an instruction, the arch independent code
does this and it assumes that uprobe-arch.insn is u8[MAX_UINSN_BYTES].

Perhaps you meant that on powerpc it is safe to replace the insn
even if this can race with some CPU executing this code? But uprobes
has to replace the original page anyway, we should not write to
-vm_file.

I agree that memcpy() in arch_uprobe_analyze_insn() and
arch_uprobe_skip_sstep() looks a bit strange. May be powerpc can do

struct arch_uprobe {
union {
u8  insn[MAX_UINSN_BYTES];
u32 ainsn;
};
};

and use auprobe-ainsn directly, I dunno.

But probably I misunderstood you.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v4 2/2] powerpc: Uprobes port to powerpc

2012-08-22 Thread Oleg Nesterov
On 08/22, Ananth N Mavinakayanahalli wrote:

 +int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct 
 *mm, unsigned long addr)
 +{
 + unsigned int insn;
 +
 + if (addr  0x03)
 + return -EINVAL;
 +
 + memcpy(insn, auprobe-insn, MAX_UINSN_BYTES);
 + if (is_trap(insn))
 + return -ENOTSUPP;

This addresses the only concern I had, thanks.

Once again, I am in no position to review the powerpc code, but
since I made some noise before: I think the patch is fine.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc

2012-08-21 Thread Oleg Nesterov
On 08/21, Ananth N Mavinakayanahalli wrote:

 On Fri, Aug 17, 2012 at 05:00:31PM +0200, Oleg Nesterov wrote:

   We should also take
   care of the in-memory copy, in case gdb had inserted a breakpoint at the
   same location, right?
 
  gdb (or even the application itself) and uprobes can obviously confuse
  each other, in many ways, and we can do nothing at least currently.
  Just we should ensure that the kernel can't crash/hang/etc.

 Absolutely. The proper fix for this at least from a breakpoint insertion
 perspective is to educate gdb (possibly ptrace itself) to fail on a
 breakpoint insertion request on an already existing one.

Oh, I don't think this is possible. And there are other problems like
this. Uprobe can confuse gdb too, in many ways. For example,
uprobe_register() can wrongly _remove_ int3 installed by gdb.

The proper fix, I think, is to rework the whole idea about uprobe bps,
but this is really in the long term. install_breakpoint() should
only unmap the page and mark its pte as owned by kernel, FOLL_WRITE
should not work. Something like migration or PROT_NONE. The task
itself should install bp during the page fault. And we need the
backing store for the pages with uprobes. Yes, this all is very
vague and I can be wrong.

Anyway, this is relatively minor, we have more serious problems.

   Updating is_swbp_insn() per-arch where needed will
   take care of both the cases, 'cos it gets called before
   arch_analyze_uprobe_insn() too.
 
  For example. set_swbp()-is_swbp_insn() == T means that (for example)
  uprobe_register() and uprobe_mmap() raced with each other and there is
  no need for set_swbp().

 This is true for Intel like architectures that have *one* swbp
 instruction. On Powerpc, gdb for instance, can insert a trap variant at
 the address. Therefore, is_swbp_insn() by definition should return true
 for all trap variants.

Not in this case, I think.

OK, I was going to do this later, but this discussion makes me think
I should try to send the patch sooner.

set_swbp()-is_swbp_at_addr() is simply unneeded and in fact should
be considered as unnecessary pessimization.

set_orig_insn()-is_swbp_at_addr() makes more sense, but it can't fix
all races with userpace. Still it should die.

 OK. I will separate out the is_swbp_insn() change into a separate patch.

Great thanks. And if we remove is_swbp_insn() from set_swbp() and
set_orig_insn() then the semantics of is_swbp_insn() will much more
clear, and in this case I powerpc probably really needs to change it.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc

2012-08-17 Thread Oleg Nesterov
On 08/17, Ananth N Mavinakayanahalli wrote:

 On Thu, Aug 16, 2012 at 05:21:12PM +0200, Oleg Nesterov wrote:

  Hmm, I am not sure. is_swbp_insn(insn), as it is used in the arch agnostic
  code, should only return true if insn == UPROBE_SWBP_INSN (just in case,
  this logic needs more fixes but this is offtopic).

 I think it does...

  If powerpc has another insn(s) which can trigger powerpc's do_int3()
  counterpart, they should be rejected by arch_uprobe_analyze_insn().
  I think.

 The insn that gets passed to arch_uprobe_analyze_insn() is copy_insn()'s
 version, which is the file copy of the instruction.

Yes, exactly. And we are going to single-step this saved uprobe-arch.insn,
even if gdb/whatever replaces the original insn later or already replaced.

So arch_uprobe_analyze_insn() should reject the unsafe instructions which
we can't step over safely.

 We should also take
 care of the in-memory copy, in case gdb had inserted a breakpoint at the
 same location, right?

gdb (or even the application itself) and uprobes can obviously confuse
each other, in many ways, and we can do nothing at least currently.
Just we should ensure that the kernel can't crash/hang/etc.

 Updating is_swbp_insn() per-arch where needed will
 take care of both the cases, 'cos it gets called before
 arch_analyze_uprobe_insn() too.

For example. set_swbp()-is_swbp_insn() == T means that (for example)
uprobe_register() and uprobe_mmap() raced with each other and there is
no need for set_swbp().

However, find_active_uprobe()-is_swbp_at_addr()-is_swbp_insn() is
different, true confirms that this insn has triggered do_int3() and
thus we need send_sig(SIGTRAP) (just in case, this is not strictly
correct too but offtopic again).

We definitely need more changes/fixes/improvements in this area. And
perhaps powerpc requires more changes in the arch-neutral code, I dunno.
In particular, I think is_swbp_insn() should have a single caller,
is_swbp_at_addr(), and this caller should always play with current-mm.
And many, many other changes in the long term.

So far I think that, if powerpc really needs to change is_swbp_insn(),
it would be better to make another patch and discuss this change.
But of course I can't judge.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc

2012-08-16 Thread Oleg Nesterov
On 08/16, Ananth N Mavinakayanahalli wrote:

 On Thu, Aug 16, 2012 at 07:41:53AM +1000, Benjamin Herrenschmidt wrote:
  On Wed, 2012-08-15 at 18:59 +0200, Oleg Nesterov wrote:
   On 07/26, Ananth N Mavinakayanahalli wrote:
   
From: Ananth N Mavinakayanahalli ana...@in.ibm.com
   
This is the port of uprobes to powerpc. Usage is similar to x86.
  
   I am just curious why this series was ignored by powerpc maintainers...
 
  Because it arrived too late for the previous merge window considering my
  limited bandwidth for reviewing things and that nobody else seems to
  have reviewed it :-)
 
  It's still on track for the next one, and I'm hoping to dedicate most of
  next week going through patches  doing a powerpc -next.

 Thanks Ben!

Great!

   Just one question... Shouldn't arch_uprobe_pre_xol() forbid to probe
   UPROBE_SWBP_INSN (at least) ?
  
   (I assume that emulate_step() can't handle this case but of course I
do not understand arch/powerpc/lib/sstep.c)
  
   Note that uprobe_pre_sstep_notifier() sets utask-state = UTASK_BP_HIT
   without any checks. This doesn't look right if it was UTASK_SSTEP...
  
   But again, I do not know what powepc will actually do if we try to
   single-step over UPROBE_SWBP_INSN.
 
  Ananth ?

 set_swbp() will return -EEXIST to install_breakpoint if we are trying to
 put a breakpoint on UPROBE_SWBP_INSN.

not really, this -EEXIST (already removed by recent changes) means that
bp was already installed.

But this doesn't matter,

 So, the arch agnostic code itself
 takes care of this case...

Yes. I forgot about install_breakpoint()-is_swbp_insn() check which
returns -ENOTSUPP, somehow I thought arch_uprobe_analyze_insn() does
this.

 or am I missing something?

No, it is me.

 However, I see that we need a powerpc specific is_swbp_insn()
 implementation since we will have to take care of all the trap variants.

Hmm, I am not sure. is_swbp_insn(insn), as it is used in the arch agnostic
code, should only return true if insn == UPROBE_SWBP_INSN (just in case,
this logic needs more fixes but this is offtopic).

If powerpc has another insn(s) which can trigger powerpc's do_int3()
counterpart, they should be rejected by arch_uprobe_analyze_insn().
I think.

 I will need to update the patches based on changes being made by Oleg
 and Sebastien for the single-step issues.

Perhaps you can do this in a separate change?

We need some (simple) changes in the arch agnostic code first, they
should not break poweppc. These changes are still under discussion.
Once we have __weak  arch_uprobe_step* you can reimplement these
hooks and fix the problems with single-stepping.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc

2012-08-15 Thread Oleg Nesterov
On 07/26, Ananth N Mavinakayanahalli wrote:

 From: Ananth N Mavinakayanahalli ana...@in.ibm.com

 This is the port of uprobes to powerpc. Usage is similar to x86.

I am just curious why this series was ignored by powerpc maintainers...

Of course I can not review this code, I know nothing about powerpc,
but the patches look simple/straightforward.

Paul, Benjamin?





Just one question... Shouldn't arch_uprobe_pre_xol() forbid to probe
UPROBE_SWBP_INSN (at least) ?

(I assume that emulate_step() can't handle this case but of course I
 do not understand arch/powerpc/lib/sstep.c)

Note that uprobe_pre_sstep_notifier() sets utask-state = UTASK_BP_HIT
without any checks. This doesn't look right if it was UTASK_SSTEP...

But again, I do not know what powepc will actually do if we try to
single-step over UPROBE_SWBP_INSN.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()

2012-06-14 Thread Oleg Nesterov
On 06/14, Srikar Dronamraju wrote:

 * Oleg Nesterov o...@redhat.com [2012-06-13 21:15:19]:

  For example. Suppose there is some instruction in /lib64/libc.so which
  is valid for 64-bit, but not for 32-bit.
 
  Suppose that a 32-bit application does mmap(/lib64/libc.so, PROT_EXEC).
 

 How correct is it to have a 32 bit binary link to a 64 bit binary/library?

No, I didn't mean this. I guess you misunderstood my point, see below.

  Now. If vma_prio_tree_foreach() finds this 32-bit mm first, 
  uprobe_register()
  fails even if there are other 64-bit applications which could be traced.
 
  Or. uprobe_register() succeeds because it finds a 64-bit mm first, and
  then that 32-bit application actually executes the invalid insn.
 
  We can move arch_uprobe_analyze_insn() outside of !UPROBE_COPY_INSN block.
 
  Or, perhaps, validate_insn_bits() should call both
  validate_insn_32bits() and validate_insn_64bits(), and set the
  UPROBE_VALID_IF_32 / UPROBE_VALID_IF_64 flags. install_breakpoint()
  should do the additinal check before set_swbp() and verify that
  .ia32_compat matches UPROBE_VALID_IF_*.
 

  What do you think?
 

 Lets say we do find a 32 bit app and 64 bit app using the same library
 and the underlying instruction is valid for tracing in 64 bit and not 32
 bit. So when we are registering, and failed to insert a breakpoint  for
 the 32 bit app, should we just bail out or should we return a failure?

I do not really know, I tend to think we should not fail. But this is
another story...

Look. Suppose that a 32-bit app starts after uprobe_register() succeeds.
In this case we have no option, uprobe_mmap()-install_breakpoint()
should silently fail. Currently it doesn't, this is one of the reasons
why I think the validation logic is wrong.

And. if install_breakpoint() can fail later anyway (in this case), then
I think uprobe_register() should not fail.

But probably this needs more discussion.


 I would probably prefer to read the underlying file something similar to
 what exec does and based on the magic decipher if we should verify for
 32 bit instructions or 64 bit instructions.

But this can't protect from the malicious user who does
mmap(64-bit-code, PROT_EXEC) from a 32-bit app, and this can confuse
uprobes even if that 32-bit app never tries to actually execute that
64-bit-code.

That is why I think we need the additional (and arch-dependant) check
before every set_swbp(), but arch_uprobe_analyze_insn/etc should not
depend on task/mm/vaddr/whatever.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()

2012-06-13 Thread Oleg Nesterov
On 06/12, Oleg Nesterov wrote:

 On 06/12, Srikar Dronamraju wrote:
  
   Note also that we should move this !UPROBE_COPY_INSN from
   install_breakpoint() to somewhere near alloc_uprobe(). This code
   is called only once, it looks a bit strange to use the random mm
   (the first mm vma_prio_tree_foreach() finds) and its mapping to
   verify the insn. In fact this is simply not correct and should be
   fixed, note that on x86 arch_uprobe_analyze_insn() checks
 
  The reason we delay the copy_insn to the first insert is because
  we have to get access to mm. For archs like x86, we want to know if the
  executable is 32 bit or not

 Yes. And this is wrong afaics.

 Once again. This !UPROBE_COPY_INSN code is called only once, and it
 uses the random mm. After that install_breakpoint() just calls
 set_swbp(another_mm) while the insn can be invalid because
 another_mm-ia32_compat != mm-ia32_compat.

  So in effect, if we get access to
  struct file corresponding to the inode and if the inode corresponds to
  32 bit executable file or 64 bit executable file during register, then
  we can move it around alloc_uprobe().

 I don't think this can work. I have another simple fix in mind, I'll
 write another email later.

For example. Suppose there is some instruction in /lib64/libc.so which
is valid for 64-bit, but not for 32-bit.

Suppose that a 32-bit application does mmap(/lib64/libc.so, PROT_EXEC).

Now. If vma_prio_tree_foreach() finds this 32-bit mm first, uprobe_register()
fails even if there are other 64-bit applications which could be traced.

Or. uprobe_register() succeeds because it finds a 64-bit mm first, and
then that 32-bit application actually executes the invalid insn.

We can move arch_uprobe_analyze_insn() outside of !UPROBE_COPY_INSN block.

Or, perhaps, validate_insn_bits() should call both
validate_insn_32bits() and validate_insn_64bits(), and set the
UPROBE_VALID_IF_32 / UPROBE_VALID_IF_64 flags. install_breakpoint()
should do the additinal check before set_swbp() and verify that
.ia32_compat matches UPROBE_VALID_IF_*.

What do you think?

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Q: a_ops-readpage() struct file

2012-06-13 Thread Oleg Nesterov
On 06/13, Peter Zijlstra wrote:

 On Mon, 2012-06-11 at 21:09 +0200, Oleg Nesterov wrote:
  Stupid question. I'm afraid the answer is no but I'll ask anyway.
  Is it safe to pass filp == NULL to mapping-readpage()? In fact
  I do not understand why it needs struct file* and I do not see
  any example of actual usage.

 Looking at afs_readpage it looks like its OK to pass in NULL. Same for
 nfs_readpage. They use the file, if provided, to avoid some lookups, but
 seem to deal with not having it.

Yes, and reiser4 does the same.

 This answer by example is of course not authorative nor complete.

Yes... perhaps we should simply change this code to use NULL and
collect the bug-reports ;)

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()

2012-06-12 Thread Oleg Nesterov
On 06/12, Srikar Dronamraju wrote:
 
  Note also that we should move this !UPROBE_COPY_INSN from
  install_breakpoint() to somewhere near alloc_uprobe(). This code
  is called only once, it looks a bit strange to use the random mm
  (the first mm vma_prio_tree_foreach() finds) and its mapping to
  verify the insn. In fact this is simply not correct and should be
  fixed, note that on x86 arch_uprobe_analyze_insn() checks

 The reason we delay the copy_insn to the first insert is because
 we have to get access to mm. For archs like x86, we want to know if the
 executable is 32 bit or not

Yes. And this is wrong afaics.

Once again. This !UPROBE_COPY_INSN code is called only once, and it
uses the random mm. After that install_breakpoint() just calls
set_swbp(another_mm) while the insn can be invalid because
another_mm-ia32_compat != mm-ia32_compat.

 So in effect, if we get access to
 struct file corresponding to the inode and if the inode corresponds to
 32 bit executable file or 64 bit executable file during register, then
 we can move it around alloc_uprobe().

I don't think this can work. I have another simple fix in mind, I'll
write another email later.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()

2012-06-11 Thread Oleg Nesterov
Ananth, Srikar,

I think the patch is correct and I am sorry for nit-picking,
this is really minor.

But,

On 06/08, Ananth N Mavinakayanahalli wrote:

 Changes in V2:
 Pass (unsigned long)addr

Well, IMHO, this is confusing.

First of all, why do we have this addr or even vaddr? It should
not exists. We pass it to copy_insn(), but for what?? copy_insn()
should simply use uprobe-offset, the virtual address for this
particular mapping does not matter at all. I am going to send
the cleanup.

Note also that we should move this !UPROBE_COPY_INSN from
install_breakpoint() to somewhere near alloc_uprobe(). This code
is called only once, it looks a bit strange to use the random mm
(the first mm vma_prio_tree_foreach() finds) and its mapping to
verify the insn. In fact this is simply not correct and should be
fixed, note that on x86 arch_uprobe_analyze_insn() checks
mm-context.ia32_compat.

IOW, Perhaps uprobe-offset makes more sense?

 --- linux-3.5-rc1.orig/kernel/events/uprobes.c
 +++ linux-3.5-rc1/kernel/events/uprobes.c
 @@ -697,7 +697,7 @@ install_breakpoint(struct uprobe *uprobe
   if (is_swbp_insn((uprobe_opcode_t *)uprobe-arch.insn))
   return -EEXIST;

 - ret = arch_uprobe_analyze_insn(uprobe-arch, mm);
 + ret = arch_uprobe_analyze_insn(uprobe-arch, mm, addr);

Just fyi, this conflicts with
[PATCH 1/3] uprobes: install_breakpoint() should fail if is_swbp_insn() == T
I sent, but the conflict is trivial.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Q: a_ops-readpage() struct file

2012-06-11 Thread Oleg Nesterov
On 06/11, Oleg Nesterov wrote:

 Note also that we should move this !UPROBE_COPY_INSN from
 install_breakpoint() to somewhere near alloc_uprobe().

The main problem is, uprobe_register() doesn't have struct file
for read_mapping_page().

Stupid question. I'm afraid the answer is no but I'll ask anyway.
Is it safe to pass filp == NULL to mapping-readpage()? In fact
I do not understand why it needs struct file* and I do not see
any example of actual usage. Yes, I didn't try to grep very much
and I understand that the filesystem can do something special.
Say it can use file-private_data...

However. There is read_cache_page_gfp() which does use
a_ops-readpage(filp = NULL), and the comment says nothing about
the riskiness.


If not, is there any other way
uprobe_register(struct inode *inode, loff_t offset) can read the
page at this offset?



And btw, read_mapping_page() accepts void *data. Why? it uses
filler == a_ops-readpage, it shouldn't accept anything but file
pointer?


Please help, I know nothing about vfs.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()

2012-06-06 Thread Oleg Nesterov
On 06/06, Ananth N Mavinakayanahalli wrote:

 From: Ananth N Mavinakayanahalli ana...@in.ibm.com

 On RISC architectures like powerpc, instructions are fixed size.
 Instruction analysis on such platforms is just a matter of (insn % 4).
 Pass the vaddr at which the uprobe is to be inserted so that
 arch_uprobe_analyze_insn() can flag misaligned registration requests.

And the next patch checks vaddr  0x03.

But why do you need this new arg? arch_uprobe_analyze_insn() could
check container_of(auprobe, struct uprobe, arch)-offset  0x3 with
the same effect, no? vm_start/vm_pgoff are obviously page-aligned.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -v2] Audit: push audit success and retcode into arch ptrace.h

2011-06-08 Thread Oleg Nesterov
On 06/07, Eric Paris wrote:

 On Tue, 2011-06-07 at 19:19 +0200, Oleg Nesterov wrote:
 
  With or without this patch, can't we call audit_syscall_exit() twice
  if there is something else in _TIF_WORK_SYSCALL_EXIT mask apart from
  SYSCALL_AUDIT ? First time it is called from asm, then from
  syscall_trace_leave(), no?
 
  For example. The task has TIF_SYSCALL_AUDIT and nothing else, it does
  system_call-auditsys-system_call_fastpath. What if it gets, say,
  TIF_SYSCALL_TRACE before ret_from_sys_call?

 No harm is done calling twice.  The first call will do the real work and
 cleanup.  It will set a flag in the audit data that the work has been
 done (in_syscall == 0) thus the second call will then not do any real
 work and won't have anything to clean up.

Hmm... and I assume context-previous != NULL is not possible on x86_64.
OK, thanks.

And I guess, all CONFIG_AUDITSYSCALL code in entry.S is only needed to
microoptimize the case when TIF_SYSCALL_AUDIT is the only reason for the
slow path. I wonder if it really makes the measureble difference...

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -v2] Audit: push audit success and retcode into arch ptrace.h

2011-06-08 Thread Oleg Nesterov
On 06/08, Eric Paris wrote:

 On Wed, 2011-06-08 at 18:36 +0200, Oleg Nesterov wrote:
  And I guess, all CONFIG_AUDITSYSCALL code in entry.S is only needed to
  microoptimize the case when TIF_SYSCALL_AUDIT is the only reason for the
  slow path. I wonder if it really makes the measureble difference...

 All I know is what Roland put in the changelog:

 Avoiding the iret return path when syscall audit is enabled helps
 performance a lot.

 I believe this was a result of Fedora starting auditd by default and
 then Linus bitching about how slow a null syscall in a tight loop was.
 It was an optimization for a microbenchmark.  How much it affects things
 on a real syscall that does real work is probably going to be determined
 by how much work is done in the syscall.

and probably by how much work is done in audit_syscall_entry/exit.

OK. Thanks a lot Eric for your explanations.

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -v2] Audit: push audit success and retcode into arch ptrace.h

2011-06-08 Thread Oleg Nesterov
On 06/08, Oleg Nesterov wrote:

 OK. Thanks a lot Eric for your explanations.

Yes. but may I ask another one?

Shouldn't copy_process()-audit_alloc(tsk) path do
clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT) if it doesn't
set tsk-audit_context?

I can be easily wrong, but afaics otherwise the child can run
with TIF_SYSCALL_AUDIT bit copied from parent's thread_info by
dup_task_struct()-setup_thread_stack() and without -audit_context,
right? For what?

Any other reason why audit_syscall_entry() checks context != NULL?

IOW. Any reason the patch below is wrong?

I am just curious, thanks.

Oleg.

--- x/kernel/auditsc.c
+++ x/kernel/auditsc.c
@@ -885,6 +885,8 @@ int audit_alloc(struct task_struct *tsk)
if (likely(!audit_ever_enabled))
return 0; /* Return if not auditing. */
 
+   clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
+
state = audit_filter_task(tsk, key);
if (likely(state == AUDIT_DISABLED))
return 0;
@@ -1591,9 +1593,7 @@ void audit_syscall_entry(int arch, int m
struct audit_context *context = tsk-audit_context;
enum audit_state state;
 
-   if (unlikely(!context))
-   return;
-
+   BUG_ON(!context);
/*
 * This happens only on certain architectures that make system
 * calls in kernel_thread via the entry.S interface, instead of

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -v2] Audit: push audit success and retcode into arch ptrace.h

2011-06-07 Thread Oleg Nesterov
On 06/03, Eric Paris wrote:

 The audit system previously expected arches calling to audit_syscall_exit to
 supply as arguments if the syscall was a success and what the return code was.
 Audit also provides a helper AUDITSC_RESULT which was supposed to simplify 
 things
 by converting from negative retcodes to an audit internal magic value stating
 success or failure.  This helper was wrong and could indicate that a valid
 pointer returned to userspace was a failed syscall.  The fix is to fix the
 layering foolishness.  We now pass audit_syscall_exit a struct pt_reg and it
 in turns calls back into arch code to collect the return value and to
 determine if the syscall was a success or failure.  We also define a generic
 is_syscall_success() macro which determines success/failure based on if the
 value is  -MAX_ERRNO.  This works for arches like x86 which do not use a
 separate mechanism to indicate syscall failure.

I know nothing about audit, but the patch looks fine to me.


But I have a bit off-topic question,

 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index 8a445a0..b7b1f88 100644
 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -53,6 +53,7 @@
  #include asm/paravirt.h
  #include asm/ftrace.h
  #include asm/percpu.h
 +#include linux/err.h

  /* Avoid __ASSEMBLER__'ifying linux/audit.h just for this.  */
  #include linux/elf-em.h
 @@ -564,17 +565,16 @@ auditsys:
   jmp system_call_fastpath

   /*
 -  * Return fast path for syscall audit.  Call audit_syscall_exit()
 +  * Return fast path for syscall audit.  Call __audit_syscall_exit()
* directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
* masked off.
*/
  sysret_audit:
   movq RAX-ARGOFFSET(%rsp),%rsi   /* second arg, syscall return value */
 - cmpq $0,%rsi/* is it  0? */
 - setl %al/* 1 if so, 0 if not */
 + cmpq $-MAX_ERRNO,%rsi   /* is it  -MAX_ERRNO? */
 + setbe %al   /* 1 if so, 0 if not */
   movzbl %al,%edi /* zero-extend that into %edi */
 - inc %edi /* first arg, 0-1(AUDITSC_SUCCESS), 1-2(AUDITSC_FAILURE) */
 - call audit_syscall_exit
 + call __audit_syscall_exit

With or without this patch, can't we call audit_syscall_exit() twice
if there is something else in _TIF_WORK_SYSCALL_EXIT mask apart from
SYSCALL_AUDIT ? First time it is called from asm, then from
syscall_trace_leave(), no?

For example. The task has TIF_SYSCALL_AUDIT and nothing else, it does
system_call-auditsys-system_call_fastpath. What if it gets, say,
TIF_SYSCALL_TRACE before ret_from_sys_call?

Oleg.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev