Re: [PATCH] x86: entry: Remove _TIF_SINGLESTEP define leftover

2021-02-08 Thread Kyle Huey
Looks good to me.

- Kyle

On Mon, Feb 8, 2021 at 1:43 PM Sedat Dilek  wrote:
>
> After commit 6342adcaa683 ("entry: Ensure trap after single-step on
> system call return") a _TIF_SINGLESTEP define is obsolete for arch/x86.
>
> So, remove the leftover in arch/x86/include/asm/thread_info.h file.
>
> Fixes: 6342adcaa683 ("entry: Ensure trap after single-step on system call 
> return"
> CC: Kyle Huey 
> Signed-off-by: Sedat Dilek 
> ---
>  arch/x86/include/asm/thread_info.h | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/thread_info.h 
> b/arch/x86/include/asm/thread_info.h
> index 0d751d5da702..8861967e0305 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -101,7 +101,6 @@ struct thread_info {
>  #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
>  #define _TIF_SIGPENDING(1 << TIF_SIGPENDING)
>  #define _TIF_NEED_RESCHED  (1 << TIF_NEED_RESCHED)
> -#define _TIF_SINGLESTEP(1 << TIF_SINGLESTEP)
>  #define _TIF_SSBD  (1 << TIF_SSBD)
>  #define _TIF_SPEC_IB   (1 << TIF_SPEC_IB)
>  #define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
> --
> 2.30.0
>


Re: [Linux v5.11-rc7] x86: entry: Leftover of _TIF_SINGLESTEP define?

2021-02-07 Thread Kyle Huey
On Sun, Feb 7, 2021 at 3:09 PM Sedat Dilek  wrote:
>
> Hi,
>
> congrats to Linux v5.11-rc7.
>
> after commit 6342adcaa683 ("entry: Ensure trap after single-step on
> system call return"):
>
> $ git grep '\_TIF_SINGLESTEP' arch/x86/
> arch/x86/include/asm/thread_info.h:#define _TIF_SINGLESTEP
>  (1 << TIF_SINGLESTEP)
>
> Is this a leftover and can be removed (now)?
>
> Thanks.
>
> Regards,
> - Sedat -
>
> [1] https://marc.info/?l=linux-kernel=161273724611262=2
> [2] https://git.kernel.org/linus/6342adcaa683

Yes it looks like that can be removed now.

- Kyle


Re: [PATCH] entry: Fix missed trap after single-step on system call return

2021-02-03 Thread Kyle Huey
On Wed, Feb 3, 2021 at 10:11 AM Kyle Huey  wrote:
>
> On Wed, Feb 3, 2021 at 10:00 AM Gabriel Krisman Bertazi
>  wrote:
> > This seems to pass Kyle's test case.  Kyle, can you verify it works with
> > rr?
>
> I will test it later today.

I have verified that a) the test case I sent earlier passes now and b)
all rr tests pass now.

Tested-by: Kyle Huey 

- Kyle


Re: [PATCH] entry: Fix missed trap after single-step on system call return

2021-02-03 Thread Kyle Huey
On Wed, Feb 3, 2021 at 10:00 AM Gabriel Krisman Bertazi
 wrote:
>
> Linus Torvalds  writes:
>
> > On Sun, Jan 31, 2021 at 3:35 PM Linus Torvalds
> >  wrote:
> >>
> >> I wonder if the simple solution is to just
> >>
> >>  (a) always set one of the SYSCALL_WORK_EXIT bits on the child in
> >> ptrace (exactly to catch the child on system call exit)
> >>
> >>  (b) basically revert 299155244770 ("entry: Drop usage of TIF flags in
> >> the generic syscall code") and have the syscall exit code check the
> >> TIF_SINGLESTEP flag
> >
> > Actually, (b) looks unnecessary - as long as we get to
> > syscall_exit_work(), the current code will work fine.
> >
> > So maybe just add a dummy SYSCALL_WORK_SYSCALL_EXIT_TRAP, and set that
> > flag whenever a singestep is requested for a process that is currently
> > in a system call?
> >
> > IOW, make it a very explicit "do TF for system calls", rather than the
> > old code that was doing so implicitly and not very obviously. Hmm?
>
> Linus,
>
> Does the patch below follows your suggestion?  I'm setting the
> SYSCALL_WORK shadowing TIF_SINGLESTEP every time, instead of only when
> the child is inside a system call.  Is this acceptable?
>
> This seems to pass Kyle's test case.  Kyle, can you verify it works with
> rr?

I will test it later today.

> I can also turn Kyle's test case into a selftest, if it is ok with him.

Sure. Consider whatever license/copyright/etc you need granted.

- Kyle

> Thanks,
>
> -- >8 --
> Subject: [PATCH] entry: Fix missed trap after single-step on a system call 
> return
>
> Commit 299155244770 ("entry: Drop usage of TIF flags in the generic
> syscall code") introduces a bug on architectures using the generic
> syscall entry code, in which processes stopped by PTRACE_SYSCALL do not
> trap on syscall return after receiving a TIF_SINGLESTEP. The reason is
> the meaning of TIF_SINGLESTEP flag is overloaded to cause the trap after
> a system call is executed, but since the above commit, the syscall call
> handler only checks for the SYSCALL_WORK flags on the exit work.
>
> This patch splits the meaning of TIF_SINGLESTEP such that it only means
> single-step mode, and creates a new type of SYSCALL_WORK to request a
> trap immediately after a syscall in single-step mode.  In the current
> implementation, the SYSCALL_WORK flag shadows the TIF_SINGLESTEP flag
> for simplicity.
>
> Since x86 is the only code already using the generic syscall handling,
> this also updates that architecture to flip this bit when a tracer
> enables single step.
>
> Suggested-by: Linus Torvalds 
> Fixes: 299155244770 ("entry: Drop usage of TIF flags in the generic syscall 
> code")
> Signed-off-by: Gabriel Krisman Bertazi 
> ---
>  arch/x86/include/asm/entry-common.h |  2 --
>  arch/x86/kernel/step.c  | 10 --
>  include/linux/entry-common.h|  1 +
>  include/linux/thread_info.h |  2 ++
>  kernel/entry/common.c   | 12 ++--
>  5 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/entry-common.h 
> b/arch/x86/include/asm/entry-common.h
> index 6fe54b2813c1..2b87b191b3b8 100644
> --- a/arch/x86/include/asm/entry-common.h
> +++ b/arch/x86/include/asm/entry-common.h
> @@ -43,8 +43,6 @@ static __always_inline void arch_check_user_regs(struct 
> pt_regs *regs)
>  }
>  #define arch_check_user_regs arch_check_user_regs
>
> -#define ARCH_SYSCALL_EXIT_WORK (_TIF_SINGLESTEP)
> -
>  static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
>   unsigned long ti_work)
>  {
> diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
> index 60d2c3798ba2..de975b62f10a 100644
> --- a/arch/x86/kernel/step.c
> +++ b/arch/x86/kernel/step.c
> @@ -127,12 +127,17 @@ static int enable_single_step(struct task_struct *child)
> regs->flags |= X86_EFLAGS_TF;
>
> /*
> -* Always set TIF_SINGLESTEP - this guarantees that
> -* we single-step system calls etc..  This will also
> +* Always set TIF_SINGLESTEP.  This will also
>  * cause us to set TF when returning to user mode.
>  */
> set_tsk_thread_flag(child, TIF_SINGLESTEP);
>
> +   /*
> +* Trigger a trap is triggered once stepping out of a system
> +* call prior to executing any user instruction.
> +*/
> +   set_task_syscall_work(child, SYSCALL_EXIT_TRAP);
> +
> oflags = regs->flags;
>
> /* Set TF on the kernel stack.. */
> @@ -230,6 +235,7 @@ void user_disable_single_step(struct task_struct *child)
>
> /* Always clear TIF_SINGLESTEP... */
> clear_tsk_thread_flag(child, TIF_SINGLESTEP);
> +   clear_task_syscall_work(child, SYSCALL_EXIT_TRAP);
>
> /* But touch TF only if it was set by us.. */
> if (test_and_clear_tsk_thread_flag(child, TIF_FORCED_TF))
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 

Re: [REGRESSION] x86/entry: TIF_SINGLESTEP handling is still broken

2021-01-31 Thread Kyle Huey
On Sun, Jan 31, 2021 at 3:36 PM Andy Lutomirski  wrote:
> > The odd system call tracing part I have no idea who depends on it
> > (apparently "rr", which I assume is some replay thing), and I suspect
> > our semantics for it has been basically random historical one, and
> > it's apparently what changed.
> >
> > That's the one that we _really_ should have a test-case for, along
> > with some documentation and code comment what the actual semantics
> > need to be so that we don't break it again.
>
> This rr thing may be tangled up with the nonsense semantics of SYSRET.  I’ll 
> muck around with Kyle’s test and try to figure out what broke.
>
> I’m guessing the issue is that we are correctly setting TF in the EFLAGS 
> image, but IRET helpfully only traps after the first user insn executes, 
> which isn’t what the tracer is expects.

The state of TF shouldn't really matter here. There should be no user
space code execution in the example I gave. This behavior all happens
in the kernel and not on the silicon.

- Kyle


Re: [REGRESSION] x86/entry: TIF_SINGLESTEP handling is still broken

2021-01-31 Thread Kyle Huey
On Sun, Jan 31, 2021 at 2:27 PM Kyle Huey  wrote:
>
> On Sun, Jan 31, 2021 at 2:20 PM Andy Lutomirski  wrote:
> >
> >
> >
> > > On Jan 31, 2021, at 2:08 PM, Kyle Huey  wrote:
> > >
> > > On Sun, Jan 31, 2021 at 2:04 PM Andy Lutomirski  
> > > wrote:
> > >> Indeed, and I have tests for this.
> > >
> > > Do you mean you already have a test case or that you would like a
> > > minimized test case?
> >
> > A smallish test that we could stick in selftests would be great if that’s 
> > straightforward.
>
> I'll look into it.
>
> - Kyle

A minimal test case follows.

The key to triggering this bug is to enter a ptrace syscall stop and
then use PTRACE_SINGLESTEP to exit it. On a good kernel this will not
result in any userspace code execution in the tracee because on the
way out of the kernel's syscall handling path the singlestep trap will
be raised immediately. On a bad kernel that stop will not be raised,
and in the example below, the program will crash.

- Kyle

---

#include 
#include 
#include 
#include 
#include 
#include 

void do_child() {
  /* Synchronize with the parent */
  kill(getpid(), SIGSTOP);
  /* Do a syscall */
  printf("child is alive\n");
  /* Return and exit */
}

int main() {
  pid_t child = -1;
  int status = 0;
  unsigned long long previous_rip = 0;
  struct user_regs_struct regs;

  if ((child = fork()) == 0) {
  do_child();
  return 0;
  }

  /* Adds 0x80 to syscall stops so we can see them easily */
  intptr_t options = PTRACE_O_TRACESYSGOOD;
  /* Take control of the child (which should be waiting */
  assert(ptrace(PTRACE_SEIZE, child, NULL, options) == 0);
  assert(waitpid(child, , 0) == child);
  assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP);

  /* Advance to the syscall stop for the write underlying
   * the child's printf.
   */
  assert(ptrace(PTRACE_SYSCALL, child, NULL, 0) == 0);
  assert(waitpid(child, , 0) == child);
  /* Should be a syscall stop */
  assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP | 0x80);

  /* Mess with the child's registers, so it will crash if
   * it executes any code
   */
  assert(ptrace(PTRACE_GETREGS, child, NULL, ) == 0);
  previous_rip = regs.rip;
  regs.rip = 0xdeadbeef;
  assert(ptrace(PTRACE_SETREGS, child, NULL, ) == 0);
  /* Singlestep. This should trap without executing any code */
  assert(ptrace(PTRACE_SINGLESTEP, child, NULL, 0) == 0);
  assert(waitpid(child, , 0) == child);
  /* Should be at a singlestep SIGTRAP. In a buggy kernel,
   * the SIGTRAP is skipped, execution resumes, and we
   * get a SIGSEGV at the invalid address.
   */
  assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);

  /* Restore registers */
  assert(ptrace(PTRACE_GETREGS, child, NULL, ) == 0);
  regs.rip = previous_rip;
  assert(ptrace(PTRACE_SETREGS, child, NULL, ) == 0);

  /* Continue to the end of the program */
  assert(ptrace(PTRACE_CONT, child, NULL, 0) == 0);
  assert(waitpid(child, , 0) == child);
  /* Verify the child exited cleanly */
  assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);

  printf("SUCCESS\n");

  return 0;
}


Re: [REGRESSION] x86/entry: TIF_SINGLESTEP handling is still broken

2021-01-31 Thread Kyle Huey
On Sun, Jan 31, 2021 at 2:20 PM Andy Lutomirski  wrote:
>
>
>
> > On Jan 31, 2021, at 2:08 PM, Kyle Huey  wrote:
> >
> > On Sun, Jan 31, 2021 at 2:04 PM Andy Lutomirski  
> > wrote:
> >> Indeed, and I have tests for this.
> >
> > Do you mean you already have a test case or that you would like a
> > minimized test case?
>
> A smallish test that we could stick in selftests would be great if that’s 
> straightforward.

I'll look into it.

- Kyle


Re: [REGRESSION] x86/entry: TIF_SINGLESTEP handling is still broken

2021-01-31 Thread Kyle Huey
On Sun, Jan 31, 2021 at 2:04 PM Andy Lutomirski  wrote:
> Indeed, and I have tests for this.

Do you mean you already have a test case or that you would like a
minimized test case?

- Kyle


Re: [REGRESSION] x86/entry: TIF_SINGLESTEP handling is still broken

2021-01-30 Thread Kyle Huey
On Sat, Jan 30, 2021 at 5:56 PM Linus Torvalds
 wrote:
>
> On Sat, Jan 30, 2021 at 5:32 PM Kyle Huey  wrote:
> >
> > I tested that with 2991552447707d791d9d81a5dc161f9e9e90b163 reverted
> > and Yuxuan's patch applied to Linus's tip rr works and passes all
> > tests.
>
> There's a patch in the -tip tree that hasn't been merged yet:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/urgent
>
> (there's only that one patch in that branch right now, commit ID
> 41c1a06d1d1544bed9692ba72a5692454eee1945).
>
> It should be making its way my direction any day now, but in the
> meantime if you can verify that it makes things work for you, that
> would be great..

Right, I'm saying that that is not sufficient.

41c1a06d1d1544bed9692ba72a5692454eee1945 does indeed fix the bug
introduced in 64eb35f701f04b30706e21d1b02636b5d31a37d2, but
2991552447707d791d9d81a5dc161f9e9e90b163 introduced another bug that
remains unfixed.

- Kyle


[REGRESSION] x86/entry: TIF_SINGLESTEP handling is still broken

2021-01-30 Thread Kyle Huey
Yuxuan Shui previous reported a regression in single step reporting,
introduced in 64eb35f701f04b30706e21d1b02636b5d31a37d2, with a patch
to fix it.

However, after that is fixed, there is another regression introduced
later in the same series, in 2991552447707d791d9d81a5dc161f9e9e90b163,
that again breaks the same code.

The patch renames ARCH_SYSCALL_EXIT_WORK to ARCH_SYSCALL_WORK_EXIT,
which orphans the definition of ARCH_SYSCALL_EXIT_WORK in
arch/x86/include/asm/entry-common.h. No work was done to port
TIF_SINGLESTEP to syscall_work. Despite the code in report_single_step
that checks current_thread_info()->flags, because the code is no
longer checking the TIF values at all to decide whether to enter
syscall_exit_work, report_single_step will never be called and we will
again fail to report the single step.

I tested that with 2991552447707d791d9d81a5dc161f9e9e90b163 reverted
and Yuxuan's patch applied to Linus's tip rr works and passes all
tests.

- Kyle


Re: [PATCH] ptrace: restore the previous single step reporting behavior

2021-01-28 Thread Kyle Huey
On Thu, Jan 28, 2021 at 11:10 AM Linus Torvalds
 wrote:
> You should have pointed to the actual patch.

Sorry, I broke the reply threading in my mail client.

- Kyle


Re: [PATCH] ptrace: restore the previous single step reporting behavior

2021-01-27 Thread Kyle Huey
Hey everyone,

This regression[0] has totally broken rr on 5.11. Could we get someone
to look at and merge Yuxuan's patch here?

- Kyle

[0] https://github.com/rr-debugger/rr/issues/2793


Re: [PATCH 3/3] x86/debug: Fix PTRACE_{BLOCK,SINGLE}STEP vs ptrace_get_debugreg(6)

2020-10-27 Thread Kyle Huey
On Tue, Oct 27, 2020 at 2:44 AM Peter Zijlstra  wrote:
>
> Commit d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to
> thread.virtual_dr6") changed the semantics of the variable from random
> collection of bits, to exactly only those bits that ptrace() needs.
>
> Unfortunately we lost DR_STEP for PTRACE_{BLOCK,SINGLE}STEP.
>
> Fixes: d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to 
> thread.virtual_dr6")
> Reported-by: Kyle Huey 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/include/asm/ptrace.h |2 ++
>  arch/x86/kernel/step.c|9 +
>  arch/x86/kernel/traps.c   |2 +-
>  3 files changed, 12 insertions(+), 1 deletion(-)
>
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -355,6 +355,8 @@ static inline unsigned long regs_get_ker
>  #define arch_has_block_step()  (boot_cpu_data.x86 >= 6)
>  #endif
>
> +extern unsigned long user_dr_step(unsigned long dr6);
> +
>  #define ARCH_HAS_USER_SINGLE_STEP_REPORT
>
>  struct user_desc;
> --- a/arch/x86/kernel/step.c
> +++ b/arch/x86/kernel/step.c
> @@ -235,3 +235,12 @@ void user_disable_single_step(struct tas
> if (test_and_clear_tsk_thread_flag(child, TIF_FORCED_TF))
> task_pt_regs(child)->flags &= ~X86_EFLAGS_TF;
>  }
> +
> +unsigned long user_dr_step(unsigned long dr6)
> +{
> +   if (test_thread_flag(TIF_BLOCKSTEP) ||
> +   test_thread_flag(TIF_SINGLESTEP))
> +   return dr6 & DR_STEP;
> +
> +   return 0;
> +}
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -940,7 +940,7 @@ static __always_inline void exc_debug_us
>  * Clear the virtual DR6 value, ptrace() routines will set bits here
>  * for things it wants signals for.
>  */
> -   current->thread.virtual_dr6 = 0;
> +   current->thread.virtual_dr6 = user_dr_step(dr6);
>
> /*
>  * The SDM says "The processor clears the BTF flag when it
>
>

Tested-by: Kyle Huey 

Confirmed that this patch series fixes rr when applied on top of the
v5.10-rc1 tag.

- Kyle


Re: [PATCH] Fix compat regression in process_vm_rw()

2020-10-26 Thread Kyle Huey
On Mon, Oct 26, 2020 at 5:03 PM Jens Axboe  wrote:
>
> The removal of compat_process_vm_{readv,writev} didn't change
> process_vm_rw(), which always assumes it's not doing a compat syscall.
> Instead of passing in 'false' unconditionally for 'compat', make it
> conditional on in_compat_syscall().
>
> Fixes: c3973b401ef2 ("mm: remove compat_process_vm_{readv,writev}")
> Reported-by: Kyle Huey 
> Signed-off-by: Jens Axboe 
>
> ---
>
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index fd12da80b6f2..05676722d9cd 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -273,7 +273,8 @@ static ssize_t process_vm_rw(pid_t pid,
> return rc;
> if (!iov_iter_count())
> goto free_iov_l;
> -   iov_r = iovec_from_user(rvec, riovcnt, UIO_FASTIOV, iovstack_r, 
> false);
> +   iov_r = iovec_from_user(rvec, riovcnt, UIO_FASTIOV, iovstack_r,
> +   in_compat_syscall());
> if (IS_ERR(iov_r)) {
> rc = PTR_ERR(iov_r);
> goto free_iov_l;
>
> --
> Jens Axboe
>

I tested this patch and it does fix the original testcase I reported.

- Kyle


[REGRESSION] mm: process_vm_readv testcase no longer works after compat_prcoess_vm_readv removed

2020-10-26 Thread Kyle Huey
A test program from the rr[0] test suite, vm_readv_writev[1], no
longer works on 5.10-rc1 when compiled as a 32 bit binary and executed
on a 64 bit kernel. The first process_vm_readv call (on line 35) now
fails with EFAULT. I have bisected this to
c3973b401ef2b0b8005f8074a10e96e3ea093823.

It should be fairly straightforward to extract the test case from our
repository into a standalone program.

- Kyle

[0] https://rr-project.org/
[1] https://github.com/mozilla/rr/blob/master/src/test/vm_readv_writev.c


Re: [REGRESSION] x86/debug: After PTRACE_SINGLESTEP DR_STEP is no longer reported in dr6

2020-10-26 Thread Kyle Huey
On Mon, Oct 26, 2020 at 10:18 AM Peter Zijlstra  wrote:
>
> On Mon, Oct 26, 2020 at 10:12:30AM -0700, Kyle Huey wrote:
> > On Mon, Oct 26, 2020 at 9:55 AM Peter Zijlstra  wrote:
> > > @@ -935,6 +936,26 @@ static __always_inline void exc_debug_user(struct 
> > > pt_regs *regs,
> > > irqentry_enter_from_user_mode(regs);
> > > instrumentation_begin();
> > >
> > > +   /*
> > > +* Clear the virtual DR6 value, ptrace routines will set bits 
> > > here for
> > > +* things we want signals for.
> > > +*/
> > > +   current->thread.virtual_dr6 = 0;
> > > +
> > > +   /*
> > > +* If PTRACE requested SINGLE(BLOCK)STEP, make sure to reflect 
> > > that in
> > > +* the ptrace visible DR6 copy.
> > > +*/
> > > +   if (test_thread_flag(TIF_BLOCKSTEP) || 
> > > test_thread_flag(TIF_SINGLESTEP))
> > > +   current->thread.virtual_dr6 |= (dr6 & DR_STEP);
> > > +
> > > +   /*
> > > +* The SDM says "The processor clears the BTF flag when it
> > > +* generates a debug exception."  Clear TIF_BLOCKSTEP to keep
> > > +* TIF_BLOCKSTEP in sync with the hardware BTF flag.
> > > +*/
> > > +   clear_thread_flag(TIF_BLOCKSTEP);
> > > +
> > > /*
> > >  * If dr6 has no reason to give us about the origin of this trap,
> > >  * then it's very likely the result of an icebp/int01 trap.
> >
> > This looks good to me (at least the non BTF parts), and I'll test it
> > shortly, but especially now that clearing virtual_dr6 is moved to
> > exc_debug_user I still don't see why it's not ok to copy the entire
> > dr6 value into virtual_dr6 unconditionally.  Any extraneous dr6 state
> > from an in-kernel #DB would have been picked up and cleared already
> > when we entered exc_debug_kernel.
>
> There is !ptrace user breakpoints as well. Why should we want potential
> random bits in dr6 ?
>
> Suppose perf and ptrace set a user breakpoint on the exact same
> instruction. The #DB fires and has two DR_TRAP# bits set. perf consumes
> one and ptrace consumes one.
>
> Only the ptrace one should be visible to ptrace, the perf one doesn't
> affect the userspace execution at all and shouldn't be visible.

Ok. Makes sense.

I can confirm that your second patch does fix the behavior I was
seeing and rr works again.

- Kyle


Re: [REGRESSION] x86/debug: After PTRACE_SINGLESTEP DR_STEP is no longer reported in dr6

2020-10-26 Thread Kyle Huey
On Mon, Oct 26, 2020 at 9:55 AM Peter Zijlstra  wrote:
>
> On Mon, Oct 26, 2020 at 05:31:00PM +0100, Peter Zijlstra wrote:
> > In that respect, I think the current virtual_dr6 = 0 is placed wrong, it
> > should only be in exc_debug_user(). The only 'problem' then is that we
> > seem to be able to loose BTF, but perhaps that is already an extant bug.
> >
> > Consider:
> >
> >  - perf: setup in-kernel #DB
> >  - tracer: ptrace(PTRACE_SINGLEBLOCK)
> >  - tracee: #DB on perf breakpoint, looses BTF
> >  - tracee .. never triggers actual blockstep
> >
> > Hmm ? Should we re-set BTF when TIF_BLOCKSTEP && !user_mode(regs) ?
>
> Something like so then.
>
> Or sould we also have the userspace #DB re-set BTF when it was !DR_STEP?
> I need to go untangle that ptrace stuff :/
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 3c70fb34028b..31de8b0980ca 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -793,19 +793,6 @@ static __always_inline unsigned long 
> debug_read_clear_dr6(void)
> set_debugreg(DR6_RESERVED, 6);
> dr6 ^= DR6_RESERVED; /* Flip to positive polarity */
>
> -   /*
> -* Clear the virtual DR6 value, ptrace routines will set bits here for
> -* things we want signals for.
> -*/
> -   current->thread.virtual_dr6 = 0;
> -
> -   /*
> -* The SDM says "The processor clears the BTF flag when it
> -* generates a debug exception."  Clear TIF_BLOCKSTEP to keep
> -* TIF_BLOCKSTEP in sync with the hardware BTF flag.
> -*/
> -   clear_thread_flag(TIF_BLOCKSTEP);
> -
> return dr6;
>  }
>
> @@ -873,6 +860,20 @@ static __always_inline void exc_debug_kernel(struct 
> pt_regs *regs,
>  */
> WARN_ON_ONCE(user_mode(regs));
>
> +   if (test_thread_flag(TIF_BLOCKSTEP)) {
> +   /*
> +* The SDM says "The processor clears the BTF flag when it
> +* generates a debug exception." but PTRACE_BLOCKSTEP 
> requested
> +* it for userspace, but we just took a kernel #DB, so re-set
> +* BTF.
> +*/
> +   unsigned long debugctl;
> +
> +   rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
> +   debugctl |= DEBUGCTLMSR_BTF;
> +   wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
> +   }
> +
> /*
>  * Catch SYSENTER with TF set and clear DR_STEP. If this hit a
>  * watchpoint at the same time then that will still be handled.
> @@ -935,6 +936,26 @@ static __always_inline void exc_debug_user(struct 
> pt_regs *regs,
> irqentry_enter_from_user_mode(regs);
> instrumentation_begin();
>
> +   /*
> +* Clear the virtual DR6 value, ptrace routines will set bits here for
> +* things we want signals for.
> +*/
> +   current->thread.virtual_dr6 = 0;
> +
> +   /*
> +* If PTRACE requested SINGLE(BLOCK)STEP, make sure to reflect that in
> +* the ptrace visible DR6 copy.
> +*/
> +   if (test_thread_flag(TIF_BLOCKSTEP) || 
> test_thread_flag(TIF_SINGLESTEP))
> +   current->thread.virtual_dr6 |= (dr6 & DR_STEP);
> +
> +   /*
> +* The SDM says "The processor clears the BTF flag when it
> +* generates a debug exception."  Clear TIF_BLOCKSTEP to keep
> +* TIF_BLOCKSTEP in sync with the hardware BTF flag.
> +*/
> +   clear_thread_flag(TIF_BLOCKSTEP);
> +
> /*
>  * If dr6 has no reason to give us about the origin of this trap,
>  * then it's very likely the result of an icebp/int01 trap.

This looks good to me (at least the non BTF parts), and I'll test it
shortly, but especially now that clearing virtual_dr6 is moved to
exc_debug_user I still don't see why it's not ok to copy the entire
dr6 value into virtual_dr6 unconditionally.  Any extraneous dr6 state
from an in-kernel #DB would have been picked up and cleared already
when we entered exc_debug_kernel.

- Kyle


Re: [REGRESSION] x86/debug: After PTRACE_SINGLESTEP DR_STEP is no longer reported in dr6

2020-10-26 Thread Kyle Huey
On Mon, Oct 26, 2020 at 9:05 AM Peter Zijlstra  wrote:
>
> On Mon, Oct 26, 2020 at 04:55:21PM +0100, Peter Zijlstra wrote:
> > On Mon, Oct 26, 2020 at 07:33:08AM -0700, Kyle Huey wrote:
> > > After resuming a ptracee with PTRACE_SINGLESTEP, in the following
> > > ptrace stop retrieving the dr6 value for the tracee gets a value that
> > > does not include DR_STEP (it is in fact always DR6_RESERVED). I
> > > bisected this to the 13cb73490f475f8e7669f9288be0bcfa85399b1f merge. I
> > > did not bisect further.
> > >
> > > I don't see any handling to ever set DR_STEP in virtual_dr6, so I
> > > think this code is just broken.
> > >
> > > Sorry for not testing this when I was CCd on the original patch series :)
> >
> > Urgh, now I have to try and remember how all that worked again ...
> >
> > I suspect it's either one (or both) of the last two:
> >
> >   f4956cf83ed1 ("x86/debug: Support negative polarity DR6 bits")
> >   d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")
> >
> >
> > Just to clarify, the sequence is something like:
> >
> >  - tracer: ptrace(PTRACE_SINGLESTEP)
> >  - tracee: #DB, DR6 contains DR_STEP
> >  - tracer: ptrace_get_debugreg(6)
> >
> > ?
> >
> > You're right that that would be broken, let me try and figure out what
> > the best way would be 'fix' that.
> >
> > Also, can you confirm that pthread_set_debugreg(6) should not do
> > anything useful?
>
>
> Does something like this make sense?
>
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 3c70fb34028b..0e7641ac19a8 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -799,6 +799,13 @@ static __always_inline unsigned long 
> debug_read_clear_dr6(void)
>  */
> current->thread.virtual_dr6 = 0;
>
> +   /*
> +* If PTRACE requested SINGLE(BLOCK)STEP, make sure to reflect that in
> +* the ptrace visible DR6 copy.
> +*/
> +   if (test_thread_flag(TIF_BLOCKSTEP) || 
> test_thread_flag(TIF_SINGLESTEP))
> +   current->thread.virtual_dr6 |= dr6 & DR_STEP;
> +
> /*
>  * The SDM says "The processor clears the BTF flag when it
>  * generates a debug exception."  Clear TIF_BLOCKSTEP to keep

I don't think the `& DR_STEP` should be necessary, that bit should be
set by the hardware, I believe.

Also if a program sets TF on itself in EFLAGS and generates a trap it
should still have DR_STEP set in the ptrace-visible dr6, which this
won't do.

Is there a reason not to always copy dr6 into virtual_dr6 here,
regardless of TIF_SINGLESTEP/etc?

- Kyle


Re: [REGRESSION] x86/debug: After PTRACE_SINGLESTEP DR_STEP is no longer reported in dr6

2020-10-26 Thread Kyle Huey
On Mon, Oct 26, 2020 at 8:55 AM Peter Zijlstra  wrote:
> Urgh, now I have to try and remember how all that worked again ...

Sorry.

> I suspect it's either one (or both) of the last two:
>
>   f4956cf83ed1 ("x86/debug: Support negative polarity DR6 bits")
>   d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")

I think it's the latter, particularly the removal of this assignment[0]

> Just to clarify, the sequence is something like:
>
>  - tracer: ptrace(PTRACE_SINGLESTEP)
>  - tracee: #DB, DR6 contains DR_STEP
>  - tracer: ptrace_get_debugreg(6)

Right.

> Also, can you confirm that pthread_set_debugreg(6) should not do
> anything useful?

I don't believe it did anything useful.

- Kyle

[0] 
https://github.com/torvalds/linux/commit/d53d9bc0cf78#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5L790


[REGRESSION] x86/debug: After PTRACE_SINGLESTEP DR_STEP is no longer reported in dr6

2020-10-26 Thread Kyle Huey
After resuming a ptracee with PTRACE_SINGLESTEP, in the following
ptrace stop retrieving the dr6 value for the tracee gets a value that
does not include DR_STEP (it is in fact always DR6_RESERVED). I
bisected this to the 13cb73490f475f8e7669f9288be0bcfa85399b1f merge. I
did not bisect further.

I don't see any handling to ever set DR_STEP in virtual_dr6, so I
think this code is just broken.

Sorry for not testing this when I was CCd on the original patch series :)

- Kyle


Re: [PATCH] core/entry: Report syscall correctly for trace and audit

2020-09-14 Thread Kyle Huey
On Fri, Sep 11, 2020 at 5:58 PM Kees Cook  wrote:
>
> On v5.8 when doing seccomp syscall rewrites (e.g. getpid into getppid
> as seen in the seccomp selftests), trace (and audit) correctly see the
> rewritten syscall on entry and exit:
>
> seccomp_bpf-1307  [000]  22974.874393: sys_enter: NR 110 (...
> seccomp_bpf-1307  [000] .N.. 22974.874401: sys_exit: NR 110 = 1304
>
> With mainline we see a mismatched enter and exit (the original syscall
> is incorrectly visible on entry):
>
> seccomp_bpf-1030  [000] 21.806766: sys_enter: NR 39 (...
> seccomp_bpf-1030  [000] 21.806767: sys_exit: NR 110 = 1027
>
> When ptrace or seccomp change the syscall, this needs to be visible to
> trace and audit at that time as well. Update the syscall earlier so they
> see the correct value.
>
> Reported-by: Michael Ellerman 
> Fixes: d88d59b64ca3 ("core/entry: Respect syscall number rewrites")
> Cc: Thomas Gleixner 
> Cc: Kyle Huey 
> Cc: Andy Lutomirski 
> Cc: Ingo Molnar 
> Signed-off-by: Kees Cook 
> ---
>  kernel/entry/common.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 18683598edbc..6fdb6105e6d6 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -60,13 +60,15 @@ static long syscall_trace_enter(struct pt_regs *regs, 
> long syscall,
> return ret;
> }
>
> +   /* Either of the above might have changed the syscall number */
> +   syscall = syscall_get_nr(current, regs);
> +
> if (unlikely(ti_work & _TIF_SYSCALL_TRACEPOINT))
> trace_sys_enter(regs, syscall);
>
> syscall_enter_audit(regs, syscall);
>
> -   /* The above might have changed the syscall number */
> -   return ret ? : syscall_get_nr(current, regs);
> +   return ret ? : syscall;
>  }
>
>  static __always_inline long
> --
> 2.25.1
>

lgtm

- Kyle


Re: [PATCH] seccomp: kill process instead of thread for unknown actions

2020-09-07 Thread Kyle Huey
On Mon, Aug 31, 2020 at 12:37 PM Kees Cook  wrote:
>
> On Fri, Aug 28, 2020 at 09:56:13PM -0400, Rich Felker wrote:
> > Asynchronous termination of a thread outside of the userspace thread
> > library's knowledge is an unsafe operation that leaves the process in
> > an inconsistent, corrupt, and possibly unrecoverable state. In order
> > to make new actions that may be added in the future safe on kernels
> > not aware of them, change the default action from
> > SECCOMP_RET_KILL_THREAD to SECCOMP_RET_KILL_PROCESS.
> >
> > Signed-off-by: Rich Felker 
> > ---
> >
> > This fundamental problem with SECCOMP_RET_KILL_THREAD, and that it
> > should be considered unsafe and deprecated, was recently noted/fixed
> > seccomp in the man page and its example. Here I've only changed the
> > default action for new/unknown action codes. Ideally the behavior for
> > strict seccomp mode would be changed too but I think that breaks
> > stability policy; in any case it's less likely to be an issue since
> > strict mode is hard or impossible to use reasonably in a multithreaded
> > process.
> >
> > Unfortunately changing this now won't help older kernels where unknown
> > new actions would still be handled unsafely, but at least it makes it
> > so the problem will fade away over time.
>
> I think this is probably fine to change now. I'd always wanted to
> "upgrade" the default to KILL_PROCESS, but wanted to wait for
> KILL_PROCESS to exist at all for a while first. :)
>
> I'm not aware of any filter generators (e.g. libseccomp, Chrome) that
> depend on unknown filter return values to cause a KILL_THREAD, and
> everything I've seen indicates that they aren't _accidentally_ depending
> on it either (i.e. they both produce "valid" filters). It's possible
> that something out there doesn't, and in that case, we likely need to
> make a special case for whatever bad filter value it chose, but we can
> cross that bridge when we come to it.
>
> I've added Kyle and Robert to CC as well, as they have noticed subtle
> changes to seccomp behavior in the past. I *think* this change should be
> fine, but perhaps they will see something I don't. :)

I can't think of anything here that would break stuff, though I do
believe rr needs some changes to how it handles this (I don't think
our current behavior is an accurate emulation of the kernel).

- Kyle

> >
> >  kernel/seccomp.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index d653d8426de9..ce1875fa6b39 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -910,10 +910,10 @@ static int __seccomp_filter(int this_syscall, const 
> > struct seccomp_data *sd,
> >   seccomp_init_siginfo(, this_syscall, data);
> >   do_coredump();
> >   }
> > - if (action == SECCOMP_RET_KILL_PROCESS)
> > - do_group_exit(SIGSYS);
> > - else
> > + if (action == SECCOMP_RET_KILL_THREAD)
> >   do_exit(SIGSYS);
> > + else
> > + do_group_exit(SIGSYS);
>
> I need to think a little more, but I suspect we should change the coredump
> logic (above the quoted code) too... (i.e. "action == 
> SECCOMP_RET_KILL_PROCESS"
> -> "action != SECCOMP_RET_KILL_THREAD")
>
> >   }
> >
> >   unreachable();
> > --
> > 2.21.0
> >
>
> Thanks!
>
> -Kees
>
> --
> Kees Cook


Re: [REGRESSION] x86/cpu fsgsbase breaks TLS in 32 bit rr tracees on a 64 bit system

2020-08-25 Thread Kyle Huey
On Tue, Aug 25, 2020 at 12:32 PM Andy Lutomirski  wrote:
>
> On Tue, Aug 25, 2020 at 11:50 AM Kyle Huey  wrote:
> >
> > On Tue, Aug 25, 2020 at 10:31 AM Kyle Huey  wrote:
> > >
> > > On Tue, Aug 25, 2020 at 9:46 AM Andy Lutomirski  wrote:
> > > >
> > > > On Tue, Aug 25, 2020 at 9:32 AM Kyle Huey  wrote:
> > > > >
> > > > > On Tue, Aug 25, 2020 at 9:12 AM Andy Lutomirski  
> > > > > wrote:
> > > > > > I don’t like this at all. Your behavior really shouldn’t depend on
> > > > > > whether the new instructions are available.  Also, some day I would
> > > > > > like to change Linux to have the new behavior even if FSGSBASE
> > > > > > instructions are not available, and this will break rr again.  (The
> > > > > > current !FSGSBASE behavior is an ugly optimization of dubious value.
> > > > > > I would not go so far as to describe it as correct.)
> > > > >
> > > > > Ok.
> > > > >
> > > > > > I would suggest you do one of the following things:
> > > > > >
> > > > > > 1. Use int $0x80 directly to load 32-bit regs into a child.  This
> > > > > > might dramatically simplify your code and should just do the right
> > > > > > thing.
> > > > >
> > > > > I don't know what that means.
> > > >
> > > > This is untested, but what I mean is:
> > > >
> > > > static int ptrace32(int req, pid_t pid, int addr, int data) {
> > > >int ret;
> > > >/* new enough kernels won't clobber r8, etc. */
> > > >asm volatile ("int $0x80" : "=a" (ret) : "a" (26 /* ptrace */), "b"
> > > > (req), "c" (pid), "d" (addr), "S" (data) : "flags", "r8", "r9", "r10",
> > > > "r11");
> > > >return ret;
> > > > }
> > > >
> > > > with a handful of caveats:
> > > >
> > > >  - This won't compile with -fPIC, I think.  Instead you'll need to
> > > > write a little bit of asm to set up and restore ebx yourself.  gcc is
> > > > silly like this.
> > > >
> > > >  - Note that addr is an int.  You'll need to mmap(..., MAP_32BIT, ...)
> > > > to get a buffer that can be pointed to with an int.
> > > >
> > > > The advantage is that this should work on all kernels that support
> > > > 32-bit mode at all.
> > > >
> > > > >
> > > > > > 2. Something like your patch but make it unconditional.
> > > > > >
> > > > > > 3. Ask for, and receive, real kernel support for setting FS and GS 
> > > > > > in
> > > > > > the way that 32-bit code expects.
> > > > >
> > > > > I think the easiest way forward for us would be a PTRACE_GET/SETREGSET
> > > > > like operation that operates on the regsets according to the
> > > > > *tracee*'s bitness (rather than the tracer, as it works currently).
> > > > > Does that sound workable?
> > > > >
> > > >
> > > > Strictly speaking, on Linux, there is no unified concept of a task's
> > > > bitness, so "set all these registers according to the target's
> > > > bitness" is not well defined.  We could easily give you a
> > > > PTRACE_SETREGS_X86_32, etc, though.
> > >
> > > In the process of responding to this I spent some time doing code
> > > inspection and discovered a subtlety in the ptrace API that I was
> > > previously unaware of. PTRACE_GET/SETREGS use the regset views
> > > corresponding to the tracer but PTRACE_GET/SETREGSET use the regset
> > > views corresponding to the tracee. This means it is possible for us
> > > today to set FS/GS "the old way" with a 64 bit tracer/32 bit tracee
> > > combo, as long as we use PTRACE_SETREGSET with NT_PRSTATUS instead of
> > > PTRACE_SETREGS.
> >
> > Alright I reverted the previous changes and switched us to use
> > PTRACE_SETREGSET with NT_PRSTATUS[0] and our 32 bit tests pass again.
> > I assume this behavior will remain unchanged indefinitely even when
> > the fs/gsbase manipulation instructions are not available since the 32
> > bit user_regs_struct can't grow?
>
> Correct.
>
> I think it would be reasonable to add new, less erratic PTRACE_SETxyz
> modes, but the old ones will stay as is.
>
> Strictly speaking, the issue you had wasn't a ptrace change.  It was a
> context switching change.  Before the change, you were programming
> garbage into the tracee state, but the context switch code wasn't
> capable of restoring the garbage correctly and instead happened to
> restore the state you actually wanted.  With the new code, the context
> switch actually worked correctly (for some value of correctly), and
> the tracee crashed.  Not that this distinction really matters.

Yes. We've been feeding the kernel trash for years. You were just
letting us get away with it before ;)

- Kyle


Re: [REGRESSION] x86/cpu fsgsbase breaks TLS in 32 bit rr tracees on a 64 bit system

2020-08-25 Thread Kyle Huey
On Tue, Aug 25, 2020 at 10:31 AM Kyle Huey  wrote:
>
> On Tue, Aug 25, 2020 at 9:46 AM Andy Lutomirski  wrote:
> >
> > On Tue, Aug 25, 2020 at 9:32 AM Kyle Huey  wrote:
> > >
> > > On Tue, Aug 25, 2020 at 9:12 AM Andy Lutomirski  
> > > wrote:
> > > > I don’t like this at all. Your behavior really shouldn’t depend on
> > > > whether the new instructions are available.  Also, some day I would
> > > > like to change Linux to have the new behavior even if FSGSBASE
> > > > instructions are not available, and this will break rr again.  (The
> > > > current !FSGSBASE behavior is an ugly optimization of dubious value.
> > > > I would not go so far as to describe it as correct.)
> > >
> > > Ok.
> > >
> > > > I would suggest you do one of the following things:
> > > >
> > > > 1. Use int $0x80 directly to load 32-bit regs into a child.  This
> > > > might dramatically simplify your code and should just do the right
> > > > thing.
> > >
> > > I don't know what that means.
> >
> > This is untested, but what I mean is:
> >
> > static int ptrace32(int req, pid_t pid, int addr, int data) {
> >int ret;
> >/* new enough kernels won't clobber r8, etc. */
> >asm volatile ("int $0x80" : "=a" (ret) : "a" (26 /* ptrace */), "b"
> > (req), "c" (pid), "d" (addr), "S" (data) : "flags", "r8", "r9", "r10",
> > "r11");
> >return ret;
> > }
> >
> > with a handful of caveats:
> >
> >  - This won't compile with -fPIC, I think.  Instead you'll need to
> > write a little bit of asm to set up and restore ebx yourself.  gcc is
> > silly like this.
> >
> >  - Note that addr is an int.  You'll need to mmap(..., MAP_32BIT, ...)
> > to get a buffer that can be pointed to with an int.
> >
> > The advantage is that this should work on all kernels that support
> > 32-bit mode at all.
> >
> > >
> > > > 2. Something like your patch but make it unconditional.
> > > >
> > > > 3. Ask for, and receive, real kernel support for setting FS and GS in
> > > > the way that 32-bit code expects.
> > >
> > > I think the easiest way forward for us would be a PTRACE_GET/SETREGSET
> > > like operation that operates on the regsets according to the
> > > *tracee*'s bitness (rather than the tracer, as it works currently).
> > > Does that sound workable?
> > >
> >
> > Strictly speaking, on Linux, there is no unified concept of a task's
> > bitness, so "set all these registers according to the target's
> > bitness" is not well defined.  We could easily give you a
> > PTRACE_SETREGS_X86_32, etc, though.
>
> In the process of responding to this I spent some time doing code
> inspection and discovered a subtlety in the ptrace API that I was
> previously unaware of. PTRACE_GET/SETREGS use the regset views
> corresponding to the tracer but PTRACE_GET/SETREGSET use the regset
> views corresponding to the tracee. This means it is possible for us
> today to set FS/GS "the old way" with a 64 bit tracer/32 bit tracee
> combo, as long as we use PTRACE_SETREGSET with NT_PRSTATUS instead of
> PTRACE_SETREGS.

Alright I reverted the previous changes and switched us to use
PTRACE_SETREGSET with NT_PRSTATUS[0] and our 32 bit tests pass again.
I assume this behavior will remain unchanged indefinitely even when
the fs/gsbase manipulation instructions are not available since the 32
bit user_regs_struct can't grow?

- Kyle

[0] 
https://github.com/mozilla/rr/commit/5c12d5f9ab77e526f852cbca82f454a42e3a6e30#diff-b509a7939392c11bb3c517b00da4526fL1447
(most of the rest of this commit is fixing our *emulation* of
PTRACE_GETREGSET/PTRACE_SETREGSET which was not respecting the
tracee's 32 vs 64 bit stauts).


Re: [REGRESSION] x86/cpu fsgsbase breaks TLS in 32 bit rr tracees on a 64 bit system

2020-08-25 Thread Kyle Huey
On Tue, Aug 25, 2020 at 9:46 AM Andy Lutomirski  wrote:
>
> On Tue, Aug 25, 2020 at 9:32 AM Kyle Huey  wrote:
> >
> > On Tue, Aug 25, 2020 at 9:12 AM Andy Lutomirski  wrote:
> > > I don’t like this at all. Your behavior really shouldn’t depend on
> > > whether the new instructions are available.  Also, some day I would
> > > like to change Linux to have the new behavior even if FSGSBASE
> > > instructions are not available, and this will break rr again.  (The
> > > current !FSGSBASE behavior is an ugly optimization of dubious value.
> > > I would not go so far as to describe it as correct.)
> >
> > Ok.
> >
> > > I would suggest you do one of the following things:
> > >
> > > 1. Use int $0x80 directly to load 32-bit regs into a child.  This
> > > might dramatically simplify your code and should just do the right
> > > thing.
> >
> > I don't know what that means.
>
> This is untested, but what I mean is:
>
> static int ptrace32(int req, pid_t pid, int addr, int data) {
>int ret;
>/* new enough kernels won't clobber r8, etc. */
>asm volatile ("int $0x80" : "=a" (ret) : "a" (26 /* ptrace */), "b"
> (req), "c" (pid), "d" (addr), "S" (data) : "flags", "r8", "r9", "r10",
> "r11");
>return ret;
> }
>
> with a handful of caveats:
>
>  - This won't compile with -fPIC, I think.  Instead you'll need to
> write a little bit of asm to set up and restore ebx yourself.  gcc is
> silly like this.
>
>  - Note that addr is an int.  You'll need to mmap(..., MAP_32BIT, ...)
> to get a buffer that can be pointed to with an int.
>
> The advantage is that this should work on all kernels that support
> 32-bit mode at all.
>
> >
> > > 2. Something like your patch but make it unconditional.
> > >
> > > 3. Ask for, and receive, real kernel support for setting FS and GS in
> > > the way that 32-bit code expects.
> >
> > I think the easiest way forward for us would be a PTRACE_GET/SETREGSET
> > like operation that operates on the regsets according to the
> > *tracee*'s bitness (rather than the tracer, as it works currently).
> > Does that sound workable?
> >
>
> Strictly speaking, on Linux, there is no unified concept of a task's
> bitness, so "set all these registers according to the target's
> bitness" is not well defined.  We could easily give you a
> PTRACE_SETREGS_X86_32, etc, though.

In the process of responding to this I spent some time doing code
inspection and discovered a subtlety in the ptrace API that I was
previously unaware of. PTRACE_GET/SETREGS use the regset views
corresponding to the tracer but PTRACE_GET/SETREGSET use the regset
views corresponding to the tracee. This means it is possible for us
today to set FS/GS "the old way" with a 64 bit tracer/32 bit tracee
combo, as long as we use PTRACE_SETREGSET with NT_PRSTATUS instead of
PTRACE_SETREGS.

- Kyle


Re: [REGRESSION] x86/cpu fsgsbase breaks TLS in 32 bit rr tracees on a 64 bit system

2020-08-25 Thread Kyle Huey
On Tue, Aug 25, 2020 at 9:12 AM Andy Lutomirski  wrote:
> I don’t like this at all. Your behavior really shouldn’t depend on
> whether the new instructions are available.  Also, some day I would
> like to change Linux to have the new behavior even if FSGSBASE
> instructions are not available, and this will break rr again.  (The
> current !FSGSBASE behavior is an ugly optimization of dubious value.
> I would not go so far as to describe it as correct.)

Ok.

> I would suggest you do one of the following things:
>
> 1. Use int $0x80 directly to load 32-bit regs into a child.  This
> might dramatically simplify your code and should just do the right
> thing.

I don't know what that means.

> 2. Something like your patch but make it unconditional.
>
> 3. Ask for, and receive, real kernel support for setting FS and GS in
> the way that 32-bit code expects.

I think the easiest way forward for us would be a PTRACE_GET/SETREGSET
like operation that operates on the regsets according to the
*tracee*'s bitness (rather than the tracer, as it works currently).
Does that sound workable?

- Kyle


Re: [REGRESSION] x86/cpu fsgsbase breaks TLS in 32 bit rr tracees on a 64 bit system

2020-08-24 Thread Kyle Huey
On Mon, Aug 24, 2020 at 5:31 PM Andy Lutomirski  wrote:
>
> On Mon, Aug 24, 2020 at 4:52 PM H. Peter Anvin  wrote:
> >
> > On 2020-08-24 14:10, Andy Lutomirski wrote:
> > >
> > > PTRACE_READ_SEGMENT_DESCRIPTOR to read a segment descriptor.
> > >
> > > PTRACE_SET_FS / PTRACE_SET_GS: Sets FS or GS and updates the base 
> > > accordingly.
> > >
> > > PTRACE_READ_SEGMENT_BASE: pass in a segment selector, get a base out.
> > > You would use this to populate the base fields.
> > >
> > > or perhaps a ptrace SETREGS variant that tries to preserve the old
> > > base semantics and magically sets the bases to match the selectors if
> > > the selectors are nonzero.
> > >
> > > Do any of these choices sound preferable to any of you?
> > >
> >
> > My suggestion would be to export the GDT and LDT as a (readonly or mostly
> > readonly) regset(s) rather than adding entirely new operations. We could 
> > allow
> > the LDT and the per-thread GDT entries to be written, subject to the same
> > limitations as the corresponding system calls.
> >
>
> That seems useful, although we'd want to do some extensive
> sanitization of the GDT.  But maybe it's obnoxious to ask Kyle and
> Robert to parse the GDT, LDT, and selector just to emulate the
> demented pre-5.9 ptrace() behavior.
>
> --Andy

We've already addressed the main issue on rr's side[0]. The only
outstanding issue is that if you record a trace with 32 bit programs
on a pre-5.9 64 bit kernel and then try to replay it on 5.9 it won't
work. If you hit this case rr will print an error message telling you
to boot your 5.9 kernel with nofsgsbase if you want to replay the
trace. I think that's probably sufficient. 32 bit is legacy stuff we
don't care that much about anyways, replaying traces on a different
kernel/machine has always been a bit dicey, and if you absolutely must
do it there is a workaround. I'm not inclined to do much work to
support the narrow remaining case.

- Kyle

[0] Namely, we're tracking fs/gsbase for 32 bit tracees on 64 bit
kernels where the fs/gsbase instructions work in new recordings now:
https://github.com/mozilla/rr/commit/c3292c75dbd8c9ce5256496108965c0442424eef


Re: [REGRESSION] x86/cpu fsgsbase breaks TLS in 32 bit rr tracees on a 64 bit system

2020-08-21 Thread Kyle Huey
On Fri, Aug 21, 2020 at 7:53 PM Andy Lutomirski  wrote:
>
>
>
> > On Aug 21, 2020, at 2:33 PM, Kyle Huey  wrote:
> >
> > On Fri, Aug 21, 2020 at 1:08 PM Bae, Chang Seok
> >  wrote:
> >>
> >>
> >>>> On Aug 20, 2020, at 21:41, Kyle Huey  wrote:
> >>>
> >>> On the x86-64 5.9-rc1 TLS is completely broken in 32 bit tracees when
> >>> running under rr[0]. Booting the kernel with `nofsgsbase` fixes it and
> >>> I bisected to 
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.8=673903495c85137791d5820d690229efe09c8f7b.
> >>>
> >>> STR:
> >>> 1. Build rr from source by
> >>> a. git clone https://github.com/mozilla/rr
> >>> b. mkdir rr/obj
> >>> c. cd rr/obj
> >>> d. cmake ..
> >>> e. make -j16
> >>> 2. Run the simple 32 bit tracee outside of rr with `./bin/simple_32`.
> >>> It should print a message and exit cleanly.
> >>> 3. Run it under rr with `./bin/rr ./bin/simple_32`.
> >>>
> >>> It should behave the same way, but with fsgsbase enabled it will
> >>> segfault. The `simple_32` binary is a simple "hello world" type
> >>> program but it does link to pthreads, so pre-main code attempts to
> >>> access TLS variables.
> >>>
> >>> The interplay between 32 bit and 64 bit TLS is dark magic to me
> >>> unfortunately so this is all the useful information I have.
> >>
> >> As I run it and collect the ptrace logs, it starts to set FSBASE with
> >> some numbers, e.g. 140632147826496, and then later attempts to set GS
> >> with 99 through putreg(), not putreg32().
> >>
> >> With FSGSBASE, the FS/GS base is decoupled from the selector. Andy
> >> made putreg32() to retain the old behavior, fetching FS/GS base
> >> according to the index, but the putreg() does not do. So, rr probably
> >> relies on the old behavior as observed to setting the GS index only.
> >> But it was previously considered to be okay with this comment [1]:
> >>
> >>   "Our modifications to fs/gs/fs_base/gs_base are always either a)
> >>setting values that the kernel set during recording to make them
> >>happen during replay or b) emulating PTRACE_SET_REGS that a tracee
> >>ptracer tried to set on another tracee. Either way I think the
> >>effects are going to be the same as what would happen if the
> >>program were run without rr."
> >>
> >> It is not straightforward to go into the rr internal to me. Robert,
> >> any thought?
> >
> > Hmm. When we are running a 32 bit tracee in a 64 bit build of rr we
> > internally convert between the 32 bit and 64 bit user_regs_structs[0]
> > at the ptrace boundary. This does not preserve the fs/gsbase (because
> > there is no fs/gsbase in the 32 bit user_regs_struct, of course).
> >
> > 40c45904f818c1f6555294ca27afc5fda4f09e68 added magic for a 32 bit
> > tracer tracing a 32 bit tracee on a 64 bit kernel, but it looks like a
> > 64 bit tracer tracing a 32 bit tracee on a 64 bit kernel *is* now
> > expected to preserve the fs/gsbase values (or die, in our case).
> >
> > Is that correct?
>
> I was certainly not expecting rr to do this, and I thought I had asked in 
> advance.  What exact ptrace() calls are you doing here?  Is this POKEUSER or 
> something else?  Breaking rr is at least impolite, and I’d like to fix this.

I believe we are PTRACE_GETREGSing and later PTRACE_SETREGSing, but
doing the latter with garbage for fs/gs_base for a 32 bit tracee. That
didn't used to matter (because those values were completely ignored
for a 32 bit tracee) but now it does.

There's a good case that that's our fault and I'm happy to spend my
"don't break userspace" points somewhere else ;)

- Kyle


Re: [REGRESSION] x86/cpu fsgsbase breaks TLS in 32 bit rr tracees on a 64 bit system

2020-08-21 Thread Kyle Huey
On Fri, Aug 21, 2020 at 1:08 PM Bae, Chang Seok
 wrote:
>
>
> > On Aug 20, 2020, at 21:41, Kyle Huey  wrote:
> >
> > On the x86-64 5.9-rc1 TLS is completely broken in 32 bit tracees when
> > running under rr[0]. Booting the kernel with `nofsgsbase` fixes it and
> > I bisected to 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.8=673903495c85137791d5820d690229efe09c8f7b.
> >
> > STR:
> > 1. Build rr from source by
> >  a. git clone https://github.com/mozilla/rr
> >  b. mkdir rr/obj
> >  c. cd rr/obj
> >  d. cmake ..
> >  e. make -j16
> > 2. Run the simple 32 bit tracee outside of rr with `./bin/simple_32`.
> > It should print a message and exit cleanly.
> > 3. Run it under rr with `./bin/rr ./bin/simple_32`.
> >
> > It should behave the same way, but with fsgsbase enabled it will
> > segfault. The `simple_32` binary is a simple "hello world" type
> > program but it does link to pthreads, so pre-main code attempts to
> > access TLS variables.
> >
> > The interplay between 32 bit and 64 bit TLS is dark magic to me
> > unfortunately so this is all the useful information I have.
>
> As I run it and collect the ptrace logs, it starts to set FSBASE with
> some numbers, e.g. 140632147826496, and then later attempts to set GS
> with 99 through putreg(), not putreg32().
>
> With FSGSBASE, the FS/GS base is decoupled from the selector. Andy
> made putreg32() to retain the old behavior, fetching FS/GS base
> according to the index, but the putreg() does not do. So, rr probably
> relies on the old behavior as observed to setting the GS index only.
> But it was previously considered to be okay with this comment [1]:
>
>"Our modifications to fs/gs/fs_base/gs_base are always either a)
> setting values that the kernel set during recording to make them
> happen during replay or b) emulating PTRACE_SET_REGS that a tracee
> ptracer tried to set on another tracee. Either way I think the
> effects are going to be the same as what would happen if the
> program were run without rr."
>
> It is not straightforward to go into the rr internal to me. Robert,
> any thought?

Hmm. When we are running a 32 bit tracee in a 64 bit build of rr we
internally convert between the 32 bit and 64 bit user_regs_structs[0]
at the ptrace boundary. This does not preserve the fs/gsbase (because
there is no fs/gsbase in the 32 bit user_regs_struct, of course).

40c45904f818c1f6555294ca27afc5fda4f09e68 added magic for a 32 bit
tracer tracing a 32 bit tracee on a 64 bit kernel, but it looks like a
64 bit tracer tracing a 32 bit tracee on a 64 bit kernel *is* now
expected to preserve the fs/gsbase values (or die, in our case).

Is that correct?

- Kyle

[0] 
https://github.com/mozilla/rr/blob/fcd2a26680a3fc2bda5f40d732d0c72b9628357b/src/Registers.cc#L519


[REGRESSION] x86/cpu fsgsbase breaks TLS in 32 bit rr tracees on a 64 bit system

2020-08-20 Thread Kyle Huey
On the x86-64 5.9-rc1 TLS is completely broken in 32 bit tracees when
running under rr[0]. Booting the kernel with `nofsgsbase` fixes it and
I bisected to 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.8=673903495c85137791d5820d690229efe09c8f7b.

STR:
1. Build rr from source by
  a. git clone https://github.com/mozilla/rr
  b. mkdir rr/obj
  c. cd rr/obj
  d. cmake ..
  e. make -j16
2. Run the simple 32 bit tracee outside of rr with `./bin/simple_32`.
It should print a message and exit cleanly.
3. Run it under rr with `./bin/rr ./bin/simple_32`.

It should behave the same way, but with fsgsbase enabled it will
segfault. The `simple_32` binary is a simple "hello world" type
program but it does link to pthreads, so pre-main code attempts to
access TLS variables.

The interplay between 32 bit and 64 bit TLS is dark magic to me
unfortunately so this is all the useful information I have.

- Kyle

[0] https://rr-project.org/


Re: [REGRESSION] x86/entry: Tracer no longer has opportunity to change the syscall number at entry via orig_ax

2020-08-20 Thread Kyle Huey
On Wed, Aug 19, 2020 at 12:44 PM Thomas Gleixner  wrote:
>
> On Wed, Aug 19 2020 at 10:14, Kyle Huey wrote:
> > tl;dr: after 27d6b4d14f5c3ab21c4aef87dd04055a2d7adf14 ptracer
> > modifications to orig_ax in a syscall entry trace stop are not honored
> > and this breaks our code.
>
> My fault and I have no idead why none of the silly test cases
> noticed. Fix below.

That'll do it, thanks.

- Kyle

> Thanks,
>
> tglx
> ---
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 9852e0d62d95..fcae019158ca 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -65,7 +65,8 @@ static long syscall_trace_enter(struct pt_regs *regs, long 
> syscall,
>
> syscall_enter_audit(regs, syscall);
>
> -   return ret ? : syscall;
> +   /* The above might have changed the syscall number */
> +   return ret ? : syscall_get_nr(current, regs);
>  }
>
>  noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)


Re: [REGRESSION 5.8] x86/entry: DR0 break-on-write not working

2020-08-19 Thread Kyle Huey
On Wed, Aug 19, 2020 at 11:42 AM  wrote:
>
> On Wed, Aug 19, 2020 at 10:53:58AM -0700, Kyle Huey wrote:
> > rr, a userspace record and replay debugger[0], has a test suite that
> > attempts to exercise strange corners of the Linux API. One such
> > test[1] began failing after 2bbc68f8373c0631ebf137f376fbea00e8086be7.
> > I have not tried to understand what has changed in the kernel here but
> > since the commit message says "No functional change" I assume
> > something has gone wrong.
> >
> > The test expects to get a SIGTRAP when watchvar is written to in the
> > forked child, but instead the program just exits normally and we get a
> > status value corresponding to that (exit code 77 = wait status
> > 0x4d00). This test program should be usable outside of rr's test suite
> > if you replace the test_assert/atomic_puts functions with
> > assert/printf and replace the util.h include with appropriate standard
> > includes.
> >
> > This regression is present in 5.8.
>
> $ uname -a
> Linux ivb-ep 5.9.0-rc1-dirty #343 SMP PREEMPT Wed Aug 19 15:04:35 CEST 2020 
> x86_64 GNU/Linux
>
> $ ./ptrace_debug_regs
> FAILED: errno=0 (Success)
> ptrace_debug_regs: ptrace_debug_regs.c:104: main: Assertion `"FAILED: !" && 
> check_cond(status == ((5 << 8) | 0x7f))' failed.
> Aborted
>
> I'm guess that is not the expected outcome, is that the same failure you
> saw?

Yes. Is status also 0x4d00 for you?

The program is expected to complete with no assertions firing.

- Kyle

> ---
> /* -*- Mode: C; tab-width: 8; c-basic-offset: 2; indent-tabs-mode: nil; -*- */
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> /**
>  * Print the printf-like arguments to stdout as atomic-ly as we can
>  * manage.  Async-signal-safe.  Does not flush stdio buffers (doing so
>  * isn't signal safe).
>  */
> __attribute__((format(printf, 1, 2))) inline static int atomic_printf(
> const char* fmt, ...) {
>   va_list args;
>   char buf[1024];
>   int len;
>
>   va_start(args, fmt);
>   len = vsnprintf(buf, sizeof(buf) - 1, fmt, args);
>   va_end(args);
>   return write(STDOUT_FILENO, buf, len);
> }
>
> /**
>  * Write |str| on its own line to stdout as atomic-ly as we can
>  * manage.  Async-signal-safe.  Does not flush stdio buffers (doing so
>  * isn't signal safe).
>  */
> inline static int atomic_puts(const char* str) {
>   return atomic_printf("%s\n", str);
> }
>
> inline static int check_cond(int cond) {
>   if (!cond) {
> atomic_printf("FAILED: errno=%d (%s)\n", errno, strerror(errno));
>   }
>   return cond;
> }
>
> #define test_assert(cond) assert("FAILED: !" && check_cond(cond))
>
> #define NEW_VALUE 0xabcdef
>
> static void breakpoint(void) {}
>
> static char watch_var;
>
> int main(void) {
>   pid_t child;
>   int status;
>   int pipe_fds[2];
>
>   test_assert(0 == pipe(pipe_fds));
>
>   if (0 == (child = fork())) {
> char ch;
> read(pipe_fds[0], , 1);
> breakpoint();
> watch_var = 1;
> return 77;
>   }
>
>   test_assert(0 == ptrace(PTRACE_ATTACH, child, NULL, NULL));
>   test_assert(child == waitpid(child, , 0));
>   test_assert(status == ((SIGSTOP << 8) | 0x7f));
>   test_assert(1 == write(pipe_fds[1], "x", 1));
>
>   test_assert(0 == ptrace(PTRACE_POKEUSER, child,
>   (void*)offsetof(struct user, u_debugreg[0]),
>   (void*)breakpoint));
>   /* Enable DR0 break-on-exec */
>   test_assert(0 == ptrace(PTRACE_POKEUSER, child,
>   (void*)offsetof(struct user, u_debugreg[7]),
>   (void*)0x1));
>
>   test_assert(0 == ptrace(PTRACE_CONT, child, NULL, NULL));
>   test_assert(child == waitpid(child, , 0));
>   test_assert(status == ((SIGTRAP << 8) | 0x7f));
>   test_assert(0x1 == ptrace(PTRACE_PEEKUSER, child,
> (void*)offsetof(struct user, u_debugreg[6])));
>
>   test_assert(0 == ptrace(PTRACE_POKEUSER, child,
>   (void*)offsetof(struct user, u_debugreg[0]),
>   _var));
>   /* Enable DR0 break-on-write */
>   test_assert(0 == ptrace(PTRACE_POKEUSER, child,
>   (void*)offsetof(struct user, u_debugreg[7]),
>   (void*)0x10001));
>
>   test_assert(0 == ptrace(PTRACE_CONT, child, NULL, NULL));
>   test_assert(child == waitpid(child, , 0));
>   test_assert(status == ((SIGTRAP << 8) | 0x7f));
>   test_assert(0x1 == ptrace(PTRACE_PEEKUSER, child,
> (void*)offsetof(struct user, u_debugreg[6])));
>
>   test_assert(0 == ptrace(PTRACE_DETACH, child, NULL, NULL));
>
>   test_assert(child == waitpid(child, , 0));
>   test_assert(WIFEXITED(status));
>   test_assert(WEXITSTATUS(status) == 77);
>
>   atomic_puts("EXIT-SUCCESS");
>   return 0;
> }


[REGRESSION 5.8] x86/entry: DR0 break-on-write not working

2020-08-19 Thread Kyle Huey
rr, a userspace record and replay debugger[0], has a test suite that
attempts to exercise strange corners of the Linux API. One such
test[1] began failing after 2bbc68f8373c0631ebf137f376fbea00e8086be7.
I have not tried to understand what has changed in the kernel here but
since the commit message says "No functional change" I assume
something has gone wrong.

The test expects to get a SIGTRAP when watchvar is written to in the
forked child, but instead the program just exits normally and we get a
status value corresponding to that (exit code 77 = wait status
0x4d00). This test program should be usable outside of rr's test suite
if you replace the test_assert/atomic_puts functions with
assert/printf and replace the util.h include with appropriate standard
includes.

This regression is present in 5.8.

- Kyle

[0] https://rr-project.org/
[1] 
https://github.com/mozilla/rr/blob/master/src/test/x86/ptrace_debug_regs.c#L55


[REGRESSION] x86/entry: Tracer no longer has opportunity to change the syscall number at entry via orig_ax

2020-08-19 Thread Kyle Huey
tl;dr: after 27d6b4d14f5c3ab21c4aef87dd04055a2d7adf14 ptracer
modifications to orig_ax in a syscall entry trace stop are not honored
and this breaks our code.

rr, a userspace record and replay debugger[0], redirects syscalls of
its ptracee through an in-process LD_PRELOAD-injected solib. To do
this, it ptraces the tracee to a syscall entry event, and then, if the
syscall instruction is not our redirected syscall instruction, it
examines the tracee's code and pattern matches against a set of
syscall invocations that it knows how to rewrite. If that succeeds, rr
hijacks[1] the current syscall entry by setting orig_ax to something
innocuous like SYS_gettid, runs the hijacked syscall, and then
restores program state to before the syscall entry trace event and
allows the tracee to execute forwards, through the newly patched code
and into our injected solib.

Before 27d6b4d14f5c3ab21c4aef87dd04055a2d7adf14 modifications to
orig_ax were honored by x86's syscall_enter_trace[2]. The generic arch
code however does not honor any modifications to the syscall number[3]
(presumably because on most architectures syscall results clobber the
first argument and not the syscall number, so there is no equivalent
to orig_rax).

Note that the above is just one example of when rr changes the syscall
number this way. This is done in many places in our code and rr is
largely broken on 5.9-rc1 at the moment because of this bug.

- Kyle

[0] https://rr-project.org/
[1] 
https://github.com/mozilla/rr/blob/cd61ba22ccc05b426691312784674c0eb8e654ef/src/Task.cc#L872
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?h=v5.8=bcf876870b95592b52519ed4aafcf9d95999bc9c#n204
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/entry/common.c?h=v5.8=27d6b4d14f5c3ab21c4aef87dd04055a2d7adf14#n44


Re: [PATCH] kvm/x86/vmx: switch MSR_MISC_FEATURES_ENABLES between host and guest

2019-03-14 Thread Kyle Huey
On Thu, Mar 14, 2019 at 7:50 PM Xiaoyao Li  wrote:
>
> CPUID Faulting is a feature about CPUID instruction. When CPUID Faulting is
> enabled, all execution of the CPUID instruction outside system-management
> mode (SMM) cause a general-protection (#GP) if the CPL > 0.
>
> About this feature, detailed information can be found at
> https://www.intel.com/content/dam/www/public/us/en/documents/application-notes/virtualization-technology-flexmigration-application-note.pdf
>
> There is an issue that current kvm doesn't switch the value of
> MSR_MISC_FEATURES_ENABLES between host and guest. If MSR_MISC_FEATURES_ENABLES
> exists on the hardware cpu, and host enables CPUID faulting (setting the bit 0
> of MSR_MISC_FEATURES_ENABLES), it will impact the guest's behavior because
> cpuid faulting is enabled by host and passed to guest.

The host doesn't enable CPUID faulting and keep it enabled for
everything though, it only enables it for specified user-space
processes (via arch_prctl). How does CPUID faulting "leak" into the
KVM guest?

- Kyle

> From my tests, when host enables cpuid faulting, it causes guest boot failure
> when guest uses *modprobe* to load modules. Below is the error log:
>
> [1.233556] traps: modprobe[71] general protection fault ip:7f0077f6495c 
> sp:7ffda148d808 error:0 in ld-2.17.so[7f0077f4d000+22000]
> [1.237780] traps: modprobe[73] general protection fault ip:7fad5aba095c 
> sp:7ffd36067378 error:0 in ld-2.17.so[7fad5ab89000+22000]
> [1.241930] traps: modprobe[75] general protection fault ip:7f3edb89495c 
> sp:7fffa1a81308 error:0 in ld-2.17.so[7f3edb87d000+22000]
> [1.245998] traps: modprobe[77] general protection fault ip:7f91d670895c 
> sp:7ffc25fa7f38 error:0 in ld-2.17.so[7f91d66f1000+22000]
> [1.250016] traps: modprobe[79] general protection fault ip:7f0ddbbdc95c 
> sp:7ffe9c34f8d8 error:0 in ld-2.17.so[7f0ddbbc5000+22000]
>
> *modprobe* calls CPUID instruction thus causing cpuid faulting in guest.
> At the end, because guest cannot *modprobe* modules, it boots failure.
>
> This patch switches MSR_MISC_FEATURES_ENABLES between host and guest when
> hardware has this MSR.
>
> This patch doesn't confict with the commit db2336a80489 ("KVM: x86: virtualize
> cpuid faulting"), which provides a software emulation of cpuid faulting for
> x86 arch. Below analysing how cpuid faulting will work after applying this 
> patch:
>
> 1. If host cpu is AMD. It doesn't have MSR_MISC_FEATURES_ENABLES, so we can 
> just
> use the software emulation of cpuid faulting.
>
> 2. If host cpu is Intel and it doesn't have MSR_MISC_FEATURES_ENABLES. The 
> same
> as case 1, we can just use the software emulation of cpuid faulting.
>
> 3. If host cpu is Intel and it has MSR_MISC_FEATURES_ENABLES. With this patch,
> it will write guest's value into MSR_MISC_FEATURES_ENABLES when vm entry.
> If guest enables cpuid faulting and when guest calls CPUID instruction with
> CPL > 0, it will cause #GP exception in guest instead of VM exit because of
> CPUID, thus it doesn't go to the kvm emualtion path but ues the hardware
> feature. Also it's a benefit that we needn't use VM exit to inject #GP to
> emulate cpuid faulting feature.
>
> Intel SDM vol3.25.1.1 specifies the priority between cpuid faulting
> and CPUID instruction.
>
> Signed-off-by: Xiaoyao Li 
> ---
>  arch/x86/kvm/vmx/vmx.c | 19 +++
>  1 file changed, 19 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 30a6bcd735ec..90707fae688e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6321,6 +6321,23 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx 
> *vmx)
> msrs[i].host, false);
>  }
>
> +static void atomic_switch_msr_misc_features_enables(struct kvm_vcpu *vcpu)
> +{
> +   u64 host_msr;
> +   struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +   /* if MSR MISC_FEATURES_ENABLES doesn't exist on the hardware, do 
> nothing*/
> +   if (rdmsrl_safe(MSR_MISC_FEATURES_ENABLES, _msr))
> +   return;
> +
> +   if (host_msr == vcpu->arch.msr_misc_features_enables)
> +   clear_atomic_switch_msr(vmx, MSR_MISC_FEATURES_ENABLES);
> +   else
> +   add_atomic_switch_msr(vmx, MSR_MISC_FEATURES_ENABLES,
> + vcpu->arch.msr_misc_features_enables,
> + host_msr, false);
> +}
> +
>  static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
>  {
> vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, val);
> @@ -6562,6 +6579,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
>
> atomic_switch_perf_msrs(vmx);
>
> +   atomic_switch_msr_misc_features_enables(vcpu);
> +
> vmx_update_hv_timer(vcpu);
>
> /*
> --
> 2.19.1


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-27 Thread Kyle Huey
On Wed, Nov 21, 2018 at 12:14 AM Peter Zijlstra  wrote:
>
> On Tue, Nov 20, 2018 at 02:38:54PM -0800, Andi Kleen wrote:
> > > In fact, I'll argue FREEZE_ON_OVERFLOW is unfixably broken for
> > > independent counters, because while one counter overflows, we'll stall
> > > counting on all others until we've handled the PMI.
> > >
> > > Even though the PMI might not be for them and they very much want/need
> > > to continue counting.
> >
> > We stop all counters in any case for the PMI. With freeze-on-PMI it just
> > happens slightly earlier.
>
> Hiding the PMI is fine and good. The PMI is not the workload. Stopping
> it earlier is _NOT_ good, it hides your actual workload.

It does seem that FREEZE_PERFMON_ON_PMI (misnamed as it is) is of
rather limited use (or even negative, in our case) to a counter that's
already restricted to ring 3.

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-27 Thread Kyle Huey
On Wed, Nov 21, 2018 at 12:14 AM Peter Zijlstra  wrote:
>
> On Tue, Nov 20, 2018 at 02:38:54PM -0800, Andi Kleen wrote:
> > > In fact, I'll argue FREEZE_ON_OVERFLOW is unfixably broken for
> > > independent counters, because while one counter overflows, we'll stall
> > > counting on all others until we've handled the PMI.
> > >
> > > Even though the PMI might not be for them and they very much want/need
> > > to continue counting.
> >
> > We stop all counters in any case for the PMI. With freeze-on-PMI it just
> > happens slightly earlier.
>
> Hiding the PMI is fine and good. The PMI is not the workload. Stopping
> it earlier is _NOT_ good, it hides your actual workload.

It does seem that FREEZE_PERFMON_ON_PMI (misnamed as it is) is of
rather limited use (or even negative, in our case) to a counter that's
already restricted to ring 3.

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 1:18 PM Andi Kleen  wrote:
>
> > I suppose that's fair that it's better for some use cases. The flip
> > side is that it's no longer possible to get exactly accurate counts
> > from user space if you're using the PMI (because any events between
> > the overflow itself and the transition to the PMI handler are
> > permanently lost) which is catastrophically bad for us :)
>
> Yes that's a fair point. For most usages it doesn't matter.
>
> I suspect that's a case for supporting opt-out for freezing
> per perf event, and rr using that.

I don't see how you could easily opt-out on a per perf event basis. If
I'm reading the SDM correctly the Freeze_PerfMon_On_PMI setting is
global and affects all counters on that CPU. Even counters that don't
use the PMI at all will still be frozen if another counter overflows
and counter freezing is enabled. It would seem that a counter that
wants to use counter freezing and a counter that wants the behavior we
want would be mutually exclusive. I suppose the kernel could handle
all of that but it's a bit involved.

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 1:18 PM Andi Kleen  wrote:
>
> > I suppose that's fair that it's better for some use cases. The flip
> > side is that it's no longer possible to get exactly accurate counts
> > from user space if you're using the PMI (because any events between
> > the overflow itself and the transition to the PMI handler are
> > permanently lost) which is catastrophically bad for us :)
>
> Yes that's a fair point. For most usages it doesn't matter.
>
> I suspect that's a case for supporting opt-out for freezing
> per perf event, and rr using that.

I don't see how you could easily opt-out on a per perf event basis. If
I'm reading the SDM correctly the Freeze_PerfMon_On_PMI setting is
global and affects all counters on that CPU. Even counters that don't
use the PMI at all will still be frozen if another counter overflows
and counter freezing is enabled. It would seem that a counter that
wants to use counter freezing and a counter that wants the behavior we
want would be mutually exclusive. I suppose the kernel could handle
all of that but it's a bit involved.

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 1:19 PM Stephane Eranian  wrote:
>
> On Tue, Nov 20, 2018 at 12:53 PM Kyle Huey  wrote:
> >
> > On Tue, Nov 20, 2018 at 12:11 PM Andi Kleen  wrote:
> > >
> > > > > > Given that we're already at rc3, and that this renders rr unusable,
> > > > > > we'd ask that counter freezing be disabled for the 4.20 release.
> > > > >
> > > > > The boot option should be good enough for the release?
> > > >
> > > > I'm not entirely sure what you mean here. We want you to flip the
> > > > default boot option so this feature is off for this release. i.e. rr
> > > > should work by default on 4.20 and people should have to opt into the
> > > > inaccurate behavior if they want faster PMI servicing.
> > >
> > > I don't think it's inaccurate, it's just different
> > > than what you are used to.
> > >
> > > For profiling including the kernel it's actually far more accurate
> > > because the count is stopped much earlier near the sampling
> > > point. Otherwise there is a considerable over count into
> > > the PMI handler.
> > >
> > > In your case you limit the count to ring 3 so it's always cut off
> > > at the transition point into the kernel, while with freezing
> > > it's at the overflow point.
> >
> > I suppose that's fair that it's better for some use cases. The flip
> > side is that it's no longer possible to get exactly accurate counts
> > from user space if you're using the PMI (because any events between
> > the overflow itself and the transition to the PMI handler are
> > permanently lost) which is catastrophically bad for us :)
> >
> Let me make sure I got this right. During recording, you count on
> retired-cond-branch
> and you record the value of the PMU counter at specific locations,
> e.g., syscalls.
> During replay, you program the branch-conditional-retired to overflow
> on interrupt at
> each recorded values. So if you sampled the event at 1,000,000 and
> then at 1,500,000.
> Then you program the event with a period of 1,000,000 first, on
> overflow the counter interrupts
> and you get a signal. Then, you reprogram the event for a new period
> of 500,000.  During recording
> and replay the event is limited to ring 3 (user level). Am I
> understanding this right?

This is largely correct, except that we only program the interrupt for
events that we would not naturally stop at during the course of
execution such as asynchronous signals or context switch points. At
events that we would naturally stop at (i.e. we can stop at syscalls
via ptrace) we simply check that the counters match to find any
discrepancies faster, before they affect an async signal delivery.
Let's say I have the following event sequence:

1. alarm syscall at rbc=1000
2. SIGALARM delivery at rbc=8000
3. exit syscall at rbc=9000

During replay, we begin the program and run to the syscall via a
PTRACE_SYSCALL ptrace. When the replayed process stops, we check that
the value of the rbc counter is 1000 (we also check that all registers
match what we recorded) and then we emulate the effects of the syscall
on the replayed process's registers and memory.

Then we see that the next event is an asynchronous signal, and we
program the rbc counter to interrupt after an additional (8000 - 1000
- SKID_SIZE) events (where SKID_SIZE has been chosen by
experimentation to ensure that the PMU interrupt is not delivered
*after* the point in the program we care about. For Skylake this value
is 100). We then resume the program with a PTRACE_CONT ptrace and wait
for the PMI to stop the replayed tracee. We advance the program to the
exact point that we care about through a combination of breakpoints
and singlestepping, and then deliver the SIGALARM.

Once that is done, we see that the next event is the exit syscall, and
we again do a PTRACE_SYSCALL ptrace to get to it. Once there we check
the rbc counter value and registers match what were recorded, and
perform the syscall.

Our counters are always restricted to ring 3 in both recording and replay.

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 1:19 PM Stephane Eranian  wrote:
>
> On Tue, Nov 20, 2018 at 12:53 PM Kyle Huey  wrote:
> >
> > On Tue, Nov 20, 2018 at 12:11 PM Andi Kleen  wrote:
> > >
> > > > > > Given that we're already at rc3, and that this renders rr unusable,
> > > > > > we'd ask that counter freezing be disabled for the 4.20 release.
> > > > >
> > > > > The boot option should be good enough for the release?
> > > >
> > > > I'm not entirely sure what you mean here. We want you to flip the
> > > > default boot option so this feature is off for this release. i.e. rr
> > > > should work by default on 4.20 and people should have to opt into the
> > > > inaccurate behavior if they want faster PMI servicing.
> > >
> > > I don't think it's inaccurate, it's just different
> > > than what you are used to.
> > >
> > > For profiling including the kernel it's actually far more accurate
> > > because the count is stopped much earlier near the sampling
> > > point. Otherwise there is a considerable over count into
> > > the PMI handler.
> > >
> > > In your case you limit the count to ring 3 so it's always cut off
> > > at the transition point into the kernel, while with freezing
> > > it's at the overflow point.
> >
> > I suppose that's fair that it's better for some use cases. The flip
> > side is that it's no longer possible to get exactly accurate counts
> > from user space if you're using the PMI (because any events between
> > the overflow itself and the transition to the PMI handler are
> > permanently lost) which is catastrophically bad for us :)
> >
> Let me make sure I got this right. During recording, you count on
> retired-cond-branch
> and you record the value of the PMU counter at specific locations,
> e.g., syscalls.
> During replay, you program the branch-conditional-retired to overflow
> on interrupt at
> each recorded values. So if you sampled the event at 1,000,000 and
> then at 1,500,000.
> Then you program the event with a period of 1,000,000 first, on
> overflow the counter interrupts
> and you get a signal. Then, you reprogram the event for a new period
> of 500,000.  During recording
> and replay the event is limited to ring 3 (user level). Am I
> understanding this right?

This is largely correct, except that we only program the interrupt for
events that we would not naturally stop at during the course of
execution such as asynchronous signals or context switch points. At
events that we would naturally stop at (i.e. we can stop at syscalls
via ptrace) we simply check that the counters match to find any
discrepancies faster, before they affect an async signal delivery.
Let's say I have the following event sequence:

1. alarm syscall at rbc=1000
2. SIGALARM delivery at rbc=8000
3. exit syscall at rbc=9000

During replay, we begin the program and run to the syscall via a
PTRACE_SYSCALL ptrace. When the replayed process stops, we check that
the value of the rbc counter is 1000 (we also check that all registers
match what we recorded) and then we emulate the effects of the syscall
on the replayed process's registers and memory.

Then we see that the next event is an asynchronous signal, and we
program the rbc counter to interrupt after an additional (8000 - 1000
- SKID_SIZE) events (where SKID_SIZE has been chosen by
experimentation to ensure that the PMU interrupt is not delivered
*after* the point in the program we care about. For Skylake this value
is 100). We then resume the program with a PTRACE_CONT ptrace and wait
for the PMI to stop the replayed tracee. We advance the program to the
exact point that we care about through a combination of breakpoints
and singlestepping, and then deliver the SIGALARM.

Once that is done, we see that the next event is the exit syscall, and
we again do a PTRACE_SYSCALL ptrace to get to it. Once there we check
the rbc counter value and registers match what were recorded, and
perform the syscall.

Our counters are always restricted to ring 3 in both recording and replay.

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 12:11 PM Andi Kleen  wrote:
>
> > > > Given that we're already at rc3, and that this renders rr unusable,
> > > > we'd ask that counter freezing be disabled for the 4.20 release.
> > >
> > > The boot option should be good enough for the release?
> >
> > I'm not entirely sure what you mean here. We want you to flip the
> > default boot option so this feature is off for this release. i.e. rr
> > should work by default on 4.20 and people should have to opt into the
> > inaccurate behavior if they want faster PMI servicing.
>
> I don't think it's inaccurate, it's just different
> than what you are used to.
>
> For profiling including the kernel it's actually far more accurate
> because the count is stopped much earlier near the sampling
> point. Otherwise there is a considerable over count into
> the PMI handler.
>
> In your case you limit the count to ring 3 so it's always cut off
> at the transition point into the kernel, while with freezing
> it's at the overflow point.

I suppose that's fair that it's better for some use cases. The flip
side is that it's no longer possible to get exactly accurate counts
from user space if you're using the PMI (because any events between
the overflow itself and the transition to the PMI handler are
permanently lost) which is catastrophically bad for us :)

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 12:11 PM Andi Kleen  wrote:
>
> > > > Given that we're already at rc3, and that this renders rr unusable,
> > > > we'd ask that counter freezing be disabled for the 4.20 release.
> > >
> > > The boot option should be good enough for the release?
> >
> > I'm not entirely sure what you mean here. We want you to flip the
> > default boot option so this feature is off for this release. i.e. rr
> > should work by default on 4.20 and people should have to opt into the
> > inaccurate behavior if they want faster PMI servicing.
>
> I don't think it's inaccurate, it's just different
> than what you are used to.
>
> For profiling including the kernel it's actually far more accurate
> because the count is stopped much earlier near the sampling
> point. Otherwise there is a considerable over count into
> the PMI handler.
>
> In your case you limit the count to ring 3 so it's always cut off
> at the transition point into the kernel, while with freezing
> it's at the overflow point.

I suppose that's fair that it's better for some use cases. The flip
side is that it's no longer possible to get exactly accurate counts
from user space if you're using the PMI (because any events between
the overflow itself and the transition to the PMI handler are
permanently lost) which is catastrophically bad for us :)

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 11:41 AM Andi Kleen  wrote:
>
> > rr, a userspace record and replay debugger[0], uses the PMU interrupt
> > (PMI) to stop a program during replay to inject asynchronous events
> > such as signals. With perf counter freezing enabled we are reliably
> > seeing perf event overcounts during replay. This behavior is easily
>
> It's hard to see how it could over count since the PMU freezes
> earlier than the PMI with freezing. So it should count less.
> Did you mean under count?

Yes, I did mean under count, see my last email.

> > Given that we're already at rc3, and that this renders rr unusable,
> > we'd ask that counter freezing be disabled for the 4.20 release.
>
> The boot option should be good enough for the release?

I'm not entirely sure what you mean here. We want you to flip the
default boot option so this feature is off for this release. i.e. rr
should work by default on 4.20 and people should have to opt into the
inaccurate behavior if they want faster PMI servicing.

> A reasonable future option would be to expose an option to disable it in
> the perf_event. Then rr could set it.

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 11:41 AM Andi Kleen  wrote:
>
> > rr, a userspace record and replay debugger[0], uses the PMU interrupt
> > (PMI) to stop a program during replay to inject asynchronous events
> > such as signals. With perf counter freezing enabled we are reliably
> > seeing perf event overcounts during replay. This behavior is easily
>
> It's hard to see how it could over count since the PMU freezes
> earlier than the PMI with freezing. So it should count less.
> Did you mean under count?

Yes, I did mean under count, see my last email.

> > Given that we're already at rc3, and that this renders rr unusable,
> > we'd ask that counter freezing be disabled for the 4.20 release.
>
> The boot option should be good enough for the release?

I'm not entirely sure what you mean here. We want you to flip the
default boot option so this feature is off for this release. i.e. rr
should work by default on 4.20 and people should have to opt into the
inaccurate behavior if they want faster PMI servicing.

> A reasonable future option would be to expose an option to disable it in
> the perf_event. Then rr could set it.

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 10:16 AM Stephane Eranian  wrote:
> I would like to understand better the PMU behavior you are relying upon and
> why the V4 freeze approach is breaking it. Could you elaborate?

I investigated a bit more to write this response and discovered that
my initial characterization of the problem as an overcount during
replay is incorrect; what we are actually seeing is an undercount
during recording.

rr relies on the userspace retired-conditional-branches counter being
exactly the same between recording and replay. The primary reason we
do this is to establish a program "timeline", allowing us to find the
correct place to inject asynchronous signals during replay (the
program counter plus the retired-conditional-branches counter value
uniquely identifies a point in most programs). Because we run the rcb
counter during recording, we piggy back on it by programming it to
interrupt the program every few hundred thousand branches to give us a
chance to context switch to a different program thread.

We've found that with counter freezing enabled, when the PMI fires,
the reported value of the retired conditional branches counter is low
by something on the order of 10 branches. In a single threaded
program, although the PMI fires, we don't actually record a context
switch or the counter value at this point. We continue on to the next
tracee event (e.g. a syscall) and record the counter value at that
point. Then, during replay, we replay to the syscall and check that
the replay counter value matches the recorded value and find that it
is too high. (NB: during a single threaded replay the PMI is not used
here because there is no asynchronous event.) Repeatedly recording the
same program produces traces that have different recorded
retired-conditional-branch counter values after the first PMI fired
during recording, but during replay we always count off the same
number of branches, further suggesting that the replay value is
correct. And finally, recordings made on a kernel with counter
freezing active still fail to replay on a kernel without counter
freezing active.

I don't know what the underlying mechanism for the loss of counter
events is (e.g. whether it's incorrect code in the interrupt handler,
a silicon bug, or what) but it's clear that the counter freezing
implementation is causing events to be lost.

- Kyle

- Kyle


Re: [REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
On Tue, Nov 20, 2018 at 10:16 AM Stephane Eranian  wrote:
> I would like to understand better the PMU behavior you are relying upon and
> why the V4 freeze approach is breaking it. Could you elaborate?

I investigated a bit more to write this response and discovered that
my initial characterization of the problem as an overcount during
replay is incorrect; what we are actually seeing is an undercount
during recording.

rr relies on the userspace retired-conditional-branches counter being
exactly the same between recording and replay. The primary reason we
do this is to establish a program "timeline", allowing us to find the
correct place to inject asynchronous signals during replay (the
program counter plus the retired-conditional-branches counter value
uniquely identifies a point in most programs). Because we run the rcb
counter during recording, we piggy back on it by programming it to
interrupt the program every few hundred thousand branches to give us a
chance to context switch to a different program thread.

We've found that with counter freezing enabled, when the PMI fires,
the reported value of the retired conditional branches counter is low
by something on the order of 10 branches. In a single threaded
program, although the PMI fires, we don't actually record a context
switch or the counter value at this point. We continue on to the next
tracee event (e.g. a syscall) and record the counter value at that
point. Then, during replay, we replay to the syscall and check that
the replay counter value matches the recorded value and find that it
is too high. (NB: during a single threaded replay the PMI is not used
here because there is no asynchronous event.) Repeatedly recording the
same program produces traces that have different recorded
retired-conditional-branch counter values after the first PMI fired
during recording, but during replay we always count off the same
number of branches, further suggesting that the replay value is
correct. And finally, recordings made on a kernel with counter
freezing active still fail to replay on a kernel without counter
freezing active.

I don't know what the underlying mechanism for the loss of counter
events is (e.g. whether it's incorrect code in the interrupt handler,
a silicon bug, or what) but it's clear that the counter freezing
implementation is causing events to be lost.

- Kyle

- Kyle


[REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
tl;dr: rr is currently broken on 4.20rc2, which I bisected to
af3bdb991a5cb57c189d34aadbd3aa88995e0d9f. I further confirmed that
booting the 4.20rc2 kernel with `disable_counter_freezing=true` allows
rr to work.

rr, a userspace record and replay debugger[0], uses the PMU interrupt
(PMI) to stop a program during replay to inject asynchronous events
such as signals. With perf counter freezing enabled we are reliably
seeing perf event overcounts during replay. This behavior is easily
demonstrated by attempting to record and replay the `alarm` test from
rr's test suite. Through bisection I determined that [1] is the first
bad commit, and further testing showed that booting the kernel with
`disable_counter_freezing=true` fixes rr.

This behavior has been observed on two different CPUs (a Core i7-6700K
and a Xeon E3-1505M v5). We have no reason to believe it is limited to
specific CPU models, this information is included only for
completeness.

Given that we're already at rc3, and that this renders rr unusable,
we'd ask that counter freezing be disabled for the 4.20 release.

Thanks,

- Kyle

[0] https://rr-project.org/
[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=af3bdb991a5cb57c189d34aadbd3aa88995e0d9f


[REGRESSION] x86, perf: counter freezing breaks rr

2018-11-20 Thread Kyle Huey
tl;dr: rr is currently broken on 4.20rc2, which I bisected to
af3bdb991a5cb57c189d34aadbd3aa88995e0d9f. I further confirmed that
booting the 4.20rc2 kernel with `disable_counter_freezing=true` allows
rr to work.

rr, a userspace record and replay debugger[0], uses the PMU interrupt
(PMI) to stop a program during replay to inject asynchronous events
such as signals. With perf counter freezing enabled we are reliably
seeing perf event overcounts during replay. This behavior is easily
demonstrated by attempting to record and replay the `alarm` test from
rr's test suite. Through bisection I determined that [1] is the first
bad commit, and further testing showed that booting the kernel with
`disable_counter_freezing=true` fixes rr.

This behavior has been observed on two different CPUs (a Core i7-6700K
and a Xeon E3-1505M v5). We have no reason to believe it is limited to
specific CPU models, this information is included only for
completeness.

Given that we're already at rc3, and that this renders rr unusable,
we'd ask that counter freezing be disabled for the 4.20 release.

Thanks,

- Kyle

[0] https://rr-project.org/
[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=af3bdb991a5cb57c189d34aadbd3aa88995e0d9f


Re: [PATCH v1 0/2] perf: Drop leaked kernel samples

2018-06-15 Thread Kyle Huey
On Fri, Jun 15, 2018 at 5:50 PM, Jin, Yao  wrote:
> Hi All,
>
> This patch raised many questions, I was prepared. :)
>
> I'd like to try another proposal that it adds a special flag in the returned
> perf_sample_data to indicate the perf binary that this sample is a leaked
> sample.
>
> For example, create a new PERF_SAMPLE_RETURN_LEAKAGE in
> perf_event_sample_format.
>
> In perf_prepare_sample(),
>
> if (event->attr.exclude_kernel && !user_mode(regs))
> data->type |= PERF_SAMPLE_RETURN_LEAKAGE;
>
> Now all the samples are kept and the leaked kernel samples are tagged with
> PERF_SAMPLE_RETURN_LEAKAGE.
>
> In perf binary, it filters out the samples with PERF_SAMPLE_RETURN_LEAKAGE.
> It needs perf binary modification but rr doesn't need to be changed.
>
> I don't 0-stuffing some fields because:
>
> 1. Keeping the skid info should allow us to look at that if we have
> interesting later.
>
> 2. Not sure if 0-stuffing some fields has potential conflicts with some
> applications.
>
> Is this proposal reasonable?
>
> Thanks
> Jin Yao
>
>
> On 6/16/2018 1:34 AM, Robert O'Callahan wrote:
>>
>> On Fri, Jun 15, 2018 at 10:16 AM, Kyle Huey  wrote:
>>>
>>>
>>> If you want a sysctl for your own reasons that's fine. But we don't
>>> want a sysctl. We want to work without any further configuration.
>>
>>
>> Also toggling a sysctl would require root privileges, but rr does not
>> currently need to run as root. Thus rr users would have to either
>> permanently change their system configuration (and every extra
>> configuration step is a pain), or run rr as root so rr can toggle the
>> sysctl itself. Running rr as root is highly undesirable.
>>
>> Rob
>>
>

If the problem you're trying to fix is an inappropriate disclosure of
kernel-space information to user-space, how does filtering the samples
in user space solve anything?

- Kyle


Re: [PATCH v1 0/2] perf: Drop leaked kernel samples

2018-06-15 Thread Kyle Huey
On Fri, Jun 15, 2018 at 5:50 PM, Jin, Yao  wrote:
> Hi All,
>
> This patch raised many questions, I was prepared. :)
>
> I'd like to try another proposal that it adds a special flag in the returned
> perf_sample_data to indicate the perf binary that this sample is a leaked
> sample.
>
> For example, create a new PERF_SAMPLE_RETURN_LEAKAGE in
> perf_event_sample_format.
>
> In perf_prepare_sample(),
>
> if (event->attr.exclude_kernel && !user_mode(regs))
> data->type |= PERF_SAMPLE_RETURN_LEAKAGE;
>
> Now all the samples are kept and the leaked kernel samples are tagged with
> PERF_SAMPLE_RETURN_LEAKAGE.
>
> In perf binary, it filters out the samples with PERF_SAMPLE_RETURN_LEAKAGE.
> It needs perf binary modification but rr doesn't need to be changed.
>
> I don't 0-stuffing some fields because:
>
> 1. Keeping the skid info should allow us to look at that if we have
> interesting later.
>
> 2. Not sure if 0-stuffing some fields has potential conflicts with some
> applications.
>
> Is this proposal reasonable?
>
> Thanks
> Jin Yao
>
>
> On 6/16/2018 1:34 AM, Robert O'Callahan wrote:
>>
>> On Fri, Jun 15, 2018 at 10:16 AM, Kyle Huey  wrote:
>>>
>>>
>>> If you want a sysctl for your own reasons that's fine. But we don't
>>> want a sysctl. We want to work without any further configuration.
>>
>>
>> Also toggling a sysctl would require root privileges, but rr does not
>> currently need to run as root. Thus rr users would have to either
>> permanently change their system configuration (and every extra
>> configuration step is a pain), or run rr as root so rr can toggle the
>> sysctl itself. Running rr as root is highly undesirable.
>>
>> Rob
>>
>

If the problem you're trying to fix is an inappropriate disclosure of
kernel-space information to user-space, how does filtering the samples
in user space solve anything?

- Kyle


Re: [PATCH v1 0/2] perf: Drop leaked kernel samples

2018-06-15 Thread Kyle Huey
On Thu, Jun 14, 2018 at 10:11 PM, Jin, Yao  wrote:
>
>
> On 6/15/2018 11:35 AM, Kyle Huey wrote:
>>
>> I strongly object to this patch as written. As I said when I
>> originally reported[0] the regression introduced by the previous
>> version of this patch a year ago.
>>
>> "It seems like this change should, at a bare minimum, be limited to
>> counters that actually perform sampling of register state when the
>> interrupt fires.  In our case, with the retired conditional branches
>> counter restricted to counting userspace events only, it makes no
>> difference that the PMU interrupt happened to be delivered in the
>> kernel."
>>
>> This means identifying which values of `perf_event_attr::sample_type`
>> are security concerns (presumably PERF_SAMPLE_IP is, and
>> PERF_SAMPLE_TIME is not, and someone needs to go through and decide on
>> all of them) and filtering on those values for this new behavior.
>>
>> And because rr sets its sample_type to 0, once you do that, the sysctl
>> will not be necessary.
>>
>> - Kyle
>>
>
> Since rr sets sample_type to 0, the easiest way is to add checking like:
>
> if (event->attr.sample_type) {
> if (event->attr.exclude_kernel && !user_mode(regs))
> return false;
> }
>
> So the rr doesn't need to be changed and for other use cases the leaked
> kernel samples will be dropped.
>
> But I don't like this is because:
>
> 1. It's too specific for rr case.

Keeping existing software working is the first rule of kernel development!

There is no disclosure of kernel space state in the way rr uses this
API, so there is no reason that this API should not keep working.

> 2. If we create a new sample_type, e.g. PERF_SAMPLE_ALLOW_LEAKAGE, the code
> will be:
>
> if !(event->attr.sample_type & PERF_SAMPLE_ALLOW_LEAKAGE) {
> if (event->attr.exclude_kernel && !user_mode(regs))
> return false;
> }
>
> But rr needs to add PERF_SAMPLE_ALLOW_LEAKAGE to sample_type since by
> default the bit is not set.

There's no reason to add a new PERF_SAMPLE flag. You need to audit the
*existing* PERF_SAMPLE flags and figure out which ones are problems,
and then do

if (event->attr.exclude_kernel && !user_mode(regs) &&
sampling_discloses_kernel_information(event->attr.sample_type)) {
return false;
}

> 3. Sysctl is a more flexible way. It provides us with an option when we want
> to see if skid is existing, we can use sysctl to turn on that.

If you want a sysctl for your own reasons that's fine. But we don't
want a sysctl. We want to work without any further configuration.

> Thanks
> Jin Yao
>

- Kyle


Re: [PATCH v1 0/2] perf: Drop leaked kernel samples

2018-06-15 Thread Kyle Huey
On Thu, Jun 14, 2018 at 10:11 PM, Jin, Yao  wrote:
>
>
> On 6/15/2018 11:35 AM, Kyle Huey wrote:
>>
>> I strongly object to this patch as written. As I said when I
>> originally reported[0] the regression introduced by the previous
>> version of this patch a year ago.
>>
>> "It seems like this change should, at a bare minimum, be limited to
>> counters that actually perform sampling of register state when the
>> interrupt fires.  In our case, with the retired conditional branches
>> counter restricted to counting userspace events only, it makes no
>> difference that the PMU interrupt happened to be delivered in the
>> kernel."
>>
>> This means identifying which values of `perf_event_attr::sample_type`
>> are security concerns (presumably PERF_SAMPLE_IP is, and
>> PERF_SAMPLE_TIME is not, and someone needs to go through and decide on
>> all of them) and filtering on those values for this new behavior.
>>
>> And because rr sets its sample_type to 0, once you do that, the sysctl
>> will not be necessary.
>>
>> - Kyle
>>
>
> Since rr sets sample_type to 0, the easiest way is to add checking like:
>
> if (event->attr.sample_type) {
> if (event->attr.exclude_kernel && !user_mode(regs))
> return false;
> }
>
> So the rr doesn't need to be changed and for other use cases the leaked
> kernel samples will be dropped.
>
> But I don't like this is because:
>
> 1. It's too specific for rr case.

Keeping existing software working is the first rule of kernel development!

There is no disclosure of kernel space state in the way rr uses this
API, so there is no reason that this API should not keep working.

> 2. If we create a new sample_type, e.g. PERF_SAMPLE_ALLOW_LEAKAGE, the code
> will be:
>
> if !(event->attr.sample_type & PERF_SAMPLE_ALLOW_LEAKAGE) {
> if (event->attr.exclude_kernel && !user_mode(regs))
> return false;
> }
>
> But rr needs to add PERF_SAMPLE_ALLOW_LEAKAGE to sample_type since by
> default the bit is not set.

There's no reason to add a new PERF_SAMPLE flag. You need to audit the
*existing* PERF_SAMPLE flags and figure out which ones are problems,
and then do

if (event->attr.exclude_kernel && !user_mode(regs) &&
sampling_discloses_kernel_information(event->attr.sample_type)) {
return false;
}

> 3. Sysctl is a more flexible way. It provides us with an option when we want
> to see if skid is existing, we can use sysctl to turn on that.

If you want a sysctl for your own reasons that's fine. But we don't
want a sysctl. We want to work without any further configuration.

> Thanks
> Jin Yao
>

- Kyle


Re: [PATCH v1 0/2] perf: Drop leaked kernel samples

2018-06-14 Thread Kyle Huey
I strongly object to this patch as written. As I said when I
originally reported[0] the regression introduced by the previous
version of this patch a year ago.

"It seems like this change should, at a bare minimum, be limited to
counters that actually perform sampling of register state when the
interrupt fires.  In our case, with the retired conditional branches
counter restricted to counting userspace events only, it makes no
difference that the PMU interrupt happened to be delivered in the
kernel."

This means identifying which values of `perf_event_attr::sample_type`
are security concerns (presumably PERF_SAMPLE_IP is, and
PERF_SAMPLE_TIME is not, and someone needs to go through and decide on
all of them) and filtering on those values for this new behavior.

And because rr sets its sample_type to 0, once you do that, the sysctl
will not be necessary.

- Kyle

On Fri, Jun 15, 2018 at 3:03 AM, Jin Yao  wrote:
> On workloads that do a lot of kernel entry/exits we see kernel
> samples, even though :u is specified. This is due to skid existing.
>
> This might be a security issue because it can leak kernel addresses even
> though kernel sampling support is disabled.
>
> One patch "perf/core: Drop kernel samples even though :u is specified"
> was posted in last year but it was reverted because it introduced a
> regression issue that broke the rr-project.
>
> Now this patch set uses sysctl to control the dropping of leaked
> kernel samples.
>
> /sys/devices/cpu/perf_allow_sample_leakage:
>
> 0 - default, drop the leaked kernel samples.
> 1 - don't drop the leaked kernel samples.
>
> For rr it can write 1 to /sys/devices/cpu/perf_allow_sample_leakage to
> keep original system behavior.
>
> Jin Yao (2):
>   perf/core: Use sysctl to turn on/off dropping leaked kernel samples
>   perf Documentation: Introduce the sysctl perf_allow_sample_leakage
>
>  kernel/events/core.c | 58 
> 
>  tools/perf/Documentation/perf-record.txt | 14 
>  2 files changed, 72 insertions(+)
>
> --
> 2.7.4
>

[0] https://lkml.org/lkml/2017/6/27/1159


Re: [PATCH v1 0/2] perf: Drop leaked kernel samples

2018-06-14 Thread Kyle Huey
I strongly object to this patch as written. As I said when I
originally reported[0] the regression introduced by the previous
version of this patch a year ago.

"It seems like this change should, at a bare minimum, be limited to
counters that actually perform sampling of register state when the
interrupt fires.  In our case, with the retired conditional branches
counter restricted to counting userspace events only, it makes no
difference that the PMU interrupt happened to be delivered in the
kernel."

This means identifying which values of `perf_event_attr::sample_type`
are security concerns (presumably PERF_SAMPLE_IP is, and
PERF_SAMPLE_TIME is not, and someone needs to go through and decide on
all of them) and filtering on those values for this new behavior.

And because rr sets its sample_type to 0, once you do that, the sysctl
will not be necessary.

- Kyle

On Fri, Jun 15, 2018 at 3:03 AM, Jin Yao  wrote:
> On workloads that do a lot of kernel entry/exits we see kernel
> samples, even though :u is specified. This is due to skid existing.
>
> This might be a security issue because it can leak kernel addresses even
> though kernel sampling support is disabled.
>
> One patch "perf/core: Drop kernel samples even though :u is specified"
> was posted in last year but it was reverted because it introduced a
> regression issue that broke the rr-project.
>
> Now this patch set uses sysctl to control the dropping of leaked
> kernel samples.
>
> /sys/devices/cpu/perf_allow_sample_leakage:
>
> 0 - default, drop the leaked kernel samples.
> 1 - don't drop the leaked kernel samples.
>
> For rr it can write 1 to /sys/devices/cpu/perf_allow_sample_leakage to
> keep original system behavior.
>
> Jin Yao (2):
>   perf/core: Use sysctl to turn on/off dropping leaked kernel samples
>   perf Documentation: Introduce the sysctl perf_allow_sample_leakage
>
>  kernel/events/core.c | 58 
> 
>  tools/perf/Documentation/perf-record.txt | 14 
>  2 files changed, 72 insertions(+)
>
> --
> 2.7.4
>

[0] https://lkml.org/lkml/2017/6/27/1159


Re: [PATCH 0/3] fix SIGNAL_UNKILLABLE && SIGKILL interaction

2017-11-13 Thread Kyle Huey
On Mon, Nov 13, 2017 at 6:53 AM, Oleg Nesterov  wrote:
> ping...
>
> Dmitry confirms this actually fixes the problem reported by syzkaller
> we discussed in another thread.
>
>
> On 11/03, Oleg Nesterov wrote:
>>
>> On 11/02, Oleg Nesterov wrote:
>> >
>> > I need to write the changelog, and perhaps even split this small patch for
>> > better documentation.
>>
>> OK, it is not clear if I answered Eric's concerns or not, let me send the
>> patches for review anyway. I tried to document every change in signal.c.
>>
>> Oleg.
>>
>>  kernel/signal.c | 18 ++
>>  1 file changed, 10 insertions(+), 8 deletions(-)
>

I can confirm that your patches fix the issue I reported as well.

- Kyle


Re: [PATCH 0/3] fix SIGNAL_UNKILLABLE && SIGKILL interaction

2017-11-13 Thread Kyle Huey
On Mon, Nov 13, 2017 at 6:53 AM, Oleg Nesterov  wrote:
> ping...
>
> Dmitry confirms this actually fixes the problem reported by syzkaller
> we discussed in another thread.
>
>
> On 11/03, Oleg Nesterov wrote:
>>
>> On 11/02, Oleg Nesterov wrote:
>> >
>> > I need to write the changelog, and perhaps even split this small patch for
>> > better documentation.
>>
>> OK, it is not clear if I answered Eric's concerns or not, let me send the
>> patches for review anyway. I tried to document every change in signal.c.
>>
>> Oleg.
>>
>>  kernel/signal.c | 18 ++
>>  1 file changed, 10 insertions(+), 8 deletions(-)
>

I can confirm that your patches fix the issue I reported as well.

- Kyle


Re: rseq event_counter field

2017-10-30 Thread Kyle Huey
On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski  wrote:
> Answering both emails here.
>
> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
> event_counter break rr?  For that matter, how badly does rseq without
> an event_counter break rr?
>
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter.  If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
>
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
>
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code.  With event_counter, you can do roughly this:
>
> struct rseq_state rseq = begin();
>
> int cpu = rseq_getcpu();
> Calculate some stuff.  Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
>
> commit(, some pointer you found, some value to write, some
> double-checks to check);
>
> With a real database, this style of programming works well.  You
> probably have a progress guarantee, and the database gives you the
> semantics you want.  With rseq, you don't get any of this.  If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress.  You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
>
> Without event_counter, you can use a different abstraction.  In your
> percpu data, you have something like:
>
> struct rseq_lock lock;
> struct freelist_entry *head;
>
> Then you can do:
>
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
>
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
>
> commit(, cpu, some pointer you found, some value to write, some
> double-checks to check);
>
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant.  But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code.  And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
>
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well.  The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
>
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number.  On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
>  If this is done, then you still have to do:
>
> __this_thread_rseq->rseq_cs = _rseq_cs_here;
>
> but that's a *store* to the rseq cacheline.  It'll go in the store
> buffer and, in the fast case, nothing ever depends on it.  Hence it's
> unlikely to stall.  Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
>
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if event_counter isn't involved.
>

Re: rseq event_counter field

2017-10-30 Thread Kyle Huey
On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski  wrote:
> Answering both emails here.
>
> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
> event_counter break rr?  For that matter, how badly does rseq without
> an event_counter break rr?
>
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter.  If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
>
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
>
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code.  With event_counter, you can do roughly this:
>
> struct rseq_state rseq = begin();
>
> int cpu = rseq_getcpu();
> Calculate some stuff.  Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
>
> commit(, some pointer you found, some value to write, some
> double-checks to check);
>
> With a real database, this style of programming works well.  You
> probably have a progress guarantee, and the database gives you the
> semantics you want.  With rseq, you don't get any of this.  If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress.  You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
>
> Without event_counter, you can use a different abstraction.  In your
> percpu data, you have something like:
>
> struct rseq_lock lock;
> struct freelist_entry *head;
>
> Then you can do:
>
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
>
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
>
> commit(, cpu, some pointer you found, some value to write, some
> double-checks to check);
>
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant.  But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code.  And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
>
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well.  The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
>
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number.  On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
>  If this is done, then you still have to do:
>
> __this_thread_rseq->rseq_cs = _rseq_cs_here;
>
> but that's a *store* to the rseq cacheline.  It'll go in the store
> buffer and, in the fast case, nothing ever depends on it.  Hence it's
> unlikely to stall.  Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
>
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if event_counter isn't involved.
>
>>
>> RDPID is interesting, but it 

Re: [PATCH v16 09/10] x86/arch_prctl: Selftest for ARCH_[GET|SET]_CPUID

2017-10-10 Thread Kyle Huey
On Tue, Oct 10, 2017 at 8:35 PM, Wanpeng Li <kernel...@gmail.com> wrote:
> Hi Kyle,
> 2017-03-20 16:16 GMT+08:00 Kyle Huey <m...@kylehuey.com>:
>> Test disabling and reenabling the cpuid instruction via the new arch_prctl
>> ARCH_SET_CPUID, retrieving the current state via ARCH_GET_CPUID, and the
>> expected behaviors across fork() and exec().
>>
>> Signed-off-by: Kyle Huey <kh...@kylehuey.com>
>> ---
>>  tools/testing/selftests/x86/Makefile  |   2 +-
>>  tools/testing/selftests/x86/cpuid_fault.c | 251 
>> ++
>
> I'm not sure why this commit is not merged to upstream. I test
> 4.14-rc3 w/ this testcase on a haswell client, however I encounter the
> below splat, any idea?

Thanks for pointing out that this never got merged.  That's quite
disappointing, especially after reviewers insisted I write this test.

The failure you're seeing is because the values of ARCH_GET_CPUID and
ARCH_SET_CPUID changed and the values hardcoded in the test are no
longer accurate.  If you set them to the correct values (0x1011 and
0x1012 respectively) the test should pass.

- Kyle

> # ./cpuid_fault_64
> cpuid() == {d, 756e6547, 6c65746e, 49656e69}
> arch_prctl(ARCH_GET_CPUID); ARCH_GET_CPUID is unsupported on this kernel.
>
> Regards,
> Wanpeng Li
>
>>  2 files changed, 252 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/testing/selftests/x86/cpuid_fault.c
>>
>> diff --git a/tools/testing/selftests/x86/Makefile 
>> b/tools/testing/selftests/x86/Makefile
>> index 38e0a9ca5d71..acda4e5fcf25 100644
>> --- a/tools/testing/selftests/x86/Makefile
>> +++ b/tools/testing/selftests/x86/Makefile
>> @@ -6,7 +6,7 @@ include ../lib.mk
>>
>>  TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt 
>> ptrace_syscall test_mremap_vdso \
>> check_initial_reg_state sigreturn ldt_gdt iopl 
>> mpx-mini-test ioperm \
>> -   protection_keys test_vdso
>> +   protection_keys test_vdso cpuid_fault
>>  TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso 
>> unwind_vdso \
>> test_FCMOV test_FCOMI test_FISTTP \
>> vdso_restorer
>> diff --git a/tools/testing/selftests/x86/cpuid_fault.c 
>> b/tools/testing/selftests/x86/cpuid_fault.c
>> new file mode 100644
>> index ..e3b93c28c655
>> --- /dev/null
>> +++ b/tools/testing/selftests/x86/cpuid_fault.c
>> @@ -0,0 +1,251 @@
>> +
>> +/*
>> + * Tests for arch_prctl(ARCH_GET_CPUID, ...) / arch_prctl(ARCH_SET_CPUID, 
>> ...)
>> + *
>> + * Basic test to test behaviour of ARCH_GET_CPUID and ARCH_SET_CPUID
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include 
>> +#include 
>> +
>> +/*
>> +#define ARCH_GET_CPUID 0x1005
>> +#define ARCH_SET_CPUID 0x1006
>> +#ifdef __x86_64__
>> +#define SYS_arch_prctl 158
>> +#else
>> +#define SYS_arch_prctl 384
>> +#endif
>> +*/
>> +
>> +const char *cpuid_names[] = {
>> +   [0] = "[cpuid disabled]",
>> +   [1] = "[cpuid enabled]",
>> +};
>> +
>> +int arch_prctl(int option, unsigned long arg2)
>> +{
>> +   return syscall(SYS_arch_prctl, option, arg2);
>> +}
>> +
>> +int cpuid(unsigned int *eax, unsigned int *ebx, unsigned int *ecx,
>> + unsigned int *edx)
>> +{
>> +   return __get_cpuid(0, eax, ebx, ecx, edx);
>> +}
>> +
>> +int do_child_exec_test(int eax, int ebx, int ecx, int edx)
>> +{
>> +   int cpuid_val = 0, child = 0, status = 0;
>> +
>> +   printf("arch_prctl(ARCH_GET_CPUID); ");
>> +
>> +   cpuid_val = arch_prctl(ARCH_GET_CPUID, 0);
>> +   if (cpuid_val < 0)
>> +   errx(1, "ARCH_GET_CPUID fails now, but not before?");
>> +
>> +   printf("cpuid_val == %s\n", cpuid_names[cpuid_val]);
>> +   if (cpuid_val != 0)
>> +   errx(1, "How did cpuid get re-enabled on fork?");
>> +
>> +   child = fork();
>> +   if (child == 0) {
>> +   cpuid_val = arch_prctl(ARCH_GET_CPUID, 0);
>> +   if (cpuid_val < 0)
>> +   errx(1, "ARCH_GET_CPUID fails now, but not before?");
>> +
>> +   printf("cpuid_val == %

Re: [PATCH v16 09/10] x86/arch_prctl: Selftest for ARCH_[GET|SET]_CPUID

2017-10-10 Thread Kyle Huey
On Tue, Oct 10, 2017 at 8:35 PM, Wanpeng Li  wrote:
> Hi Kyle,
> 2017-03-20 16:16 GMT+08:00 Kyle Huey :
>> Test disabling and reenabling the cpuid instruction via the new arch_prctl
>> ARCH_SET_CPUID, retrieving the current state via ARCH_GET_CPUID, and the
>> expected behaviors across fork() and exec().
>>
>> Signed-off-by: Kyle Huey 
>> ---
>>  tools/testing/selftests/x86/Makefile  |   2 +-
>>  tools/testing/selftests/x86/cpuid_fault.c | 251 
>> ++
>
> I'm not sure why this commit is not merged to upstream. I test
> 4.14-rc3 w/ this testcase on a haswell client, however I encounter the
> below splat, any idea?

Thanks for pointing out that this never got merged.  That's quite
disappointing, especially after reviewers insisted I write this test.

The failure you're seeing is because the values of ARCH_GET_CPUID and
ARCH_SET_CPUID changed and the values hardcoded in the test are no
longer accurate.  If you set them to the correct values (0x1011 and
0x1012 respectively) the test should pass.

- Kyle

> # ./cpuid_fault_64
> cpuid() == {d, 756e6547, 6c65746e, 49656e69}
> arch_prctl(ARCH_GET_CPUID); ARCH_GET_CPUID is unsupported on this kernel.
>
> Regards,
> Wanpeng Li
>
>>  2 files changed, 252 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/testing/selftests/x86/cpuid_fault.c
>>
>> diff --git a/tools/testing/selftests/x86/Makefile 
>> b/tools/testing/selftests/x86/Makefile
>> index 38e0a9ca5d71..acda4e5fcf25 100644
>> --- a/tools/testing/selftests/x86/Makefile
>> +++ b/tools/testing/selftests/x86/Makefile
>> @@ -6,7 +6,7 @@ include ../lib.mk
>>
>>  TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt 
>> ptrace_syscall test_mremap_vdso \
>> check_initial_reg_state sigreturn ldt_gdt iopl 
>> mpx-mini-test ioperm \
>> -   protection_keys test_vdso
>> +   protection_keys test_vdso cpuid_fault
>>  TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso 
>> unwind_vdso \
>> test_FCMOV test_FCOMI test_FISTTP \
>> vdso_restorer
>> diff --git a/tools/testing/selftests/x86/cpuid_fault.c 
>> b/tools/testing/selftests/x86/cpuid_fault.c
>> new file mode 100644
>> index ..e3b93c28c655
>> --- /dev/null
>> +++ b/tools/testing/selftests/x86/cpuid_fault.c
>> @@ -0,0 +1,251 @@
>> +
>> +/*
>> + * Tests for arch_prctl(ARCH_GET_CPUID, ...) / arch_prctl(ARCH_SET_CPUID, 
>> ...)
>> + *
>> + * Basic test to test behaviour of ARCH_GET_CPUID and ARCH_SET_CPUID
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include 
>> +#include 
>> +
>> +/*
>> +#define ARCH_GET_CPUID 0x1005
>> +#define ARCH_SET_CPUID 0x1006
>> +#ifdef __x86_64__
>> +#define SYS_arch_prctl 158
>> +#else
>> +#define SYS_arch_prctl 384
>> +#endif
>> +*/
>> +
>> +const char *cpuid_names[] = {
>> +   [0] = "[cpuid disabled]",
>> +   [1] = "[cpuid enabled]",
>> +};
>> +
>> +int arch_prctl(int option, unsigned long arg2)
>> +{
>> +   return syscall(SYS_arch_prctl, option, arg2);
>> +}
>> +
>> +int cpuid(unsigned int *eax, unsigned int *ebx, unsigned int *ecx,
>> + unsigned int *edx)
>> +{
>> +   return __get_cpuid(0, eax, ebx, ecx, edx);
>> +}
>> +
>> +int do_child_exec_test(int eax, int ebx, int ecx, int edx)
>> +{
>> +   int cpuid_val = 0, child = 0, status = 0;
>> +
>> +   printf("arch_prctl(ARCH_GET_CPUID); ");
>> +
>> +   cpuid_val = arch_prctl(ARCH_GET_CPUID, 0);
>> +   if (cpuid_val < 0)
>> +   errx(1, "ARCH_GET_CPUID fails now, but not before?");
>> +
>> +   printf("cpuid_val == %s\n", cpuid_names[cpuid_val]);
>> +   if (cpuid_val != 0)
>> +   errx(1, "How did cpuid get re-enabled on fork?");
>> +
>> +   child = fork();
>> +   if (child == 0) {
>> +   cpuid_val = arch_prctl(ARCH_GET_CPUID, 0);
>> +   if (cpuid_val < 0)
>> +   errx(1, "ARCH_GET_CPUID fails now, but not before?");
>> +
>> +   printf("cpuid_val == %s\n", cpuid_names[cpuid_val]);
>> +   if (cpuid_val != 0)

Re: [PATCH] mm: Fix typo in VM_MPX definition

2017-09-26 Thread Kyle Huey
Could we get this patch into the next 4.14 rc too?  This "typo" causes a bunch
of sections in /proc/N/maps to be incorrectly labelled [mpx] which confuses rr.
We could probably work around if it we had to but doing this right is trivial.

Thanks,

- Kyle


Re: [PATCH] mm: Fix typo in VM_MPX definition

2017-09-26 Thread Kyle Huey
Could we get this patch into the next 4.14 rc too?  This "typo" causes a bunch
of sections in /proc/N/maps to be incorrectly labelled [mpx] which confuses rr.
We could probably work around if it we had to but doing this right is trivial.

Thanks,

- Kyle


Regression related to ipc shmctl compat

2017-09-25 Thread Kyle Huey
Beginning with 553f770ef71b, the following program fails when compiled
for 32 bit and executed on a 64 bit kernel and succeeds when compiled
for and executed on a 64 bit.  It continues to fail even after
58aff0af7573.  When compiled as 32 bit, an shmctl call fails with
EBADR (see the XXX comment).

The test program is adapted from rr's shm test[0].

#define _GNU_SOURCE 1

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static uint64_t GUARD_VALUE = 0xdeadbeeff00dbaad;

inline static size_t ceil_page_size(size_t size) {
  size_t page_size = sysconf(_SC_PAGESIZE);
  return (size + page_size - 1) & ~(page_size - 1);
}

/**
 * Allocate 'size' bytes, fill with 'value', place canary value before
 * the allocated block, and put guard pages before and after. Ensure
 * there's a guard page immediately after `size`.
 * This lets us catch cases where too much data is being recorded --- which can
 * cause errors if the recorder tries to read invalid memory.
 */
inline static void* allocate_guard(size_t size, char value) {
  size_t page_size = sysconf(_SC_PAGESIZE);
  size_t map_size = ceil_page_size(size + sizeof(GUARD_VALUE)) + 2 * page_size;
  char* cp = (char*)mmap(NULL, map_size, PROT_READ | PROT_WRITE,
 MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
  assert(cp != MAP_FAILED);
  /* create guard pages */
  assert(munmap(cp, page_size) == 0);
  assert(munmap(cp + map_size - page_size, page_size) == 0);
  cp = cp + map_size - page_size - size;
  memcpy(cp - sizeof(GUARD_VALUE), _VALUE, sizeof(GUARD_VALUE));
  memset(cp, value, size);
  return cp;
}

/**
 * Verify that canary value before the block allocated at 'p'
 * (of size 'size') is still valid.
 */
inline static void verify_guard(__attribute__((unused)) size_t size, void* p) {
  char* cp = (char*)p;
  assert(memcmp(cp - sizeof(GUARD_VALUE), _VALUE,
sizeof(GUARD_VALUE)) == 0);
}

/**
 * Verify that canary value before the block allocated at 'p'
 * (of size 'size') is still valid, and free the block.
 */
inline static void free_guard(size_t size, void* p) {
  verify_guard(size, p);
  size_t page_size = sysconf(_SC_PAGESIZE);
  size_t map_size = ceil_page_size(size + sizeof(GUARD_VALUE)) + 2 * page_size;
  char* cp = (char*)p + size + page_size - map_size;
  assert(0 == munmap(cp, map_size - 2 * page_size));
}

#define ALLOCATE_GUARD(p, v) p = allocate_guard(sizeof(*p), v)
#define VERIFY_GUARD(p) verify_guard(sizeof(*p), p)
#define FREE_GUARD(p) free_guard(sizeof(*p), p)

/* Make SIZE not a multiple of the page size, to ensure we handle that case.
   But make sure it's even, since we divide it by two. */
#define SIZE ((int)(16 * page_size) - 10)

static int shmid;

static void before_writing(void) {}
static void after_writing(void) {}

static int run_child(void) {
  int i;
  char* p;
  char* p2;
  pid_t child2;
  int status;
  struct shmid_ds* ds;
  struct shminfo* info;
  struct shm_info* info2;

  size_t page_size = sysconf(_SC_PAGESIZE);

  ALLOCATE_GUARD(ds, 'd');
  assert(0 == shmctl(shmid, IPC_STAT, ds));
  VERIFY_GUARD(ds);
  assert((int)ds->shm_segsz == SIZE);
  assert(ds->shm_cpid == getppid());
  assert(ds->shm_nattch == 0);

  ds->shm_perm.mode = 0660;
  assert(0 == shmctl(shmid, IPC_SET, ds));

  ALLOCATE_GUARD(info, 'i');
  assert(0 <= shmctl(shmid, IPC_INFO, (struct shmid_ds*)info));
  VERIFY_GUARD(info);
  assert(info->shmmin == 1);

  ALLOCATE_GUARD(info2, 'j');
  // XXX: This shmctl call fails with EBADR when compiled with -m32
  assert(0 <= shmctl(shmid, SHM_INFO, (struct shmid_ds*)info2));
  VERIFY_GUARD(info2);
  assert(info2->used_ids > 0);
  assert(info2->used_ids < 100);

  p = shmat(shmid, NULL, 0);
  assert(p != (char*)-1);

  before_writing();

  for (i = 0; i < SIZE; ++i) {
assert(p[i] == 0);
  }
  memset(p, 'r', SIZE / 2);

  after_writing();

  p2 = shmat(shmid, NULL, 0);
  assert(p2 != (char*)-1);
  memset(p + SIZE / 2, 'r', SIZE / 2);
  assert(0 == shmdt(p));
  assert(0 == shmdt(p2));

  assert(p == mmap(p, SIZE, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0));
  assert(p[0] == 0);

  p = shmat(shmid, p, SHM_REMAP);
  assert(p != (char*)-1);
  for (i = 0; i < SIZE; ++i) {
assert(p[i] == 'r');
  }

  if ((child2 = fork()) == 0) {
memset(p, 's', SIZE);
return 0;
  }
  assert(child2 == waitpid(child2, , __WALL));
  assert(0 == status);
  for (i = 0; i < SIZE; ++i) {
assert(p[i] == 's');
  }

  return 0;
}

int main(void) {
  pid_t child;
  int status;

  size_t page_size = sysconf(_SC_PAGESIZE);

  shmid = shmget(IPC_PRIVATE, SIZE, 0666);
  assert(shmid >= 0);

  if ((child = fork()) == 0) {
return run_child();
  }

  printf("child %d\n", child);

  assert(child == waitpid(child, , __WALL));
  /* delete the shm before testing status, because we want to ensure the
 segment is deleted even if the test failed. */
  assert(0 == shmctl(shmid, IPC_RMID, NULL));
  assert(status == 0);

  printf("EXIT-SUCCESS\n");

  return 0;
}

- Kyle

[0] 

Regression related to ipc shmctl compat

2017-09-25 Thread Kyle Huey
Beginning with 553f770ef71b, the following program fails when compiled
for 32 bit and executed on a 64 bit kernel and succeeds when compiled
for and executed on a 64 bit.  It continues to fail even after
58aff0af7573.  When compiled as 32 bit, an shmctl call fails with
EBADR (see the XXX comment).

The test program is adapted from rr's shm test[0].

#define _GNU_SOURCE 1

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static uint64_t GUARD_VALUE = 0xdeadbeeff00dbaad;

inline static size_t ceil_page_size(size_t size) {
  size_t page_size = sysconf(_SC_PAGESIZE);
  return (size + page_size - 1) & ~(page_size - 1);
}

/**
 * Allocate 'size' bytes, fill with 'value', place canary value before
 * the allocated block, and put guard pages before and after. Ensure
 * there's a guard page immediately after `size`.
 * This lets us catch cases where too much data is being recorded --- which can
 * cause errors if the recorder tries to read invalid memory.
 */
inline static void* allocate_guard(size_t size, char value) {
  size_t page_size = sysconf(_SC_PAGESIZE);
  size_t map_size = ceil_page_size(size + sizeof(GUARD_VALUE)) + 2 * page_size;
  char* cp = (char*)mmap(NULL, map_size, PROT_READ | PROT_WRITE,
 MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
  assert(cp != MAP_FAILED);
  /* create guard pages */
  assert(munmap(cp, page_size) == 0);
  assert(munmap(cp + map_size - page_size, page_size) == 0);
  cp = cp + map_size - page_size - size;
  memcpy(cp - sizeof(GUARD_VALUE), _VALUE, sizeof(GUARD_VALUE));
  memset(cp, value, size);
  return cp;
}

/**
 * Verify that canary value before the block allocated at 'p'
 * (of size 'size') is still valid.
 */
inline static void verify_guard(__attribute__((unused)) size_t size, void* p) {
  char* cp = (char*)p;
  assert(memcmp(cp - sizeof(GUARD_VALUE), _VALUE,
sizeof(GUARD_VALUE)) == 0);
}

/**
 * Verify that canary value before the block allocated at 'p'
 * (of size 'size') is still valid, and free the block.
 */
inline static void free_guard(size_t size, void* p) {
  verify_guard(size, p);
  size_t page_size = sysconf(_SC_PAGESIZE);
  size_t map_size = ceil_page_size(size + sizeof(GUARD_VALUE)) + 2 * page_size;
  char* cp = (char*)p + size + page_size - map_size;
  assert(0 == munmap(cp, map_size - 2 * page_size));
}

#define ALLOCATE_GUARD(p, v) p = allocate_guard(sizeof(*p), v)
#define VERIFY_GUARD(p) verify_guard(sizeof(*p), p)
#define FREE_GUARD(p) free_guard(sizeof(*p), p)

/* Make SIZE not a multiple of the page size, to ensure we handle that case.
   But make sure it's even, since we divide it by two. */
#define SIZE ((int)(16 * page_size) - 10)

static int shmid;

static void before_writing(void) {}
static void after_writing(void) {}

static int run_child(void) {
  int i;
  char* p;
  char* p2;
  pid_t child2;
  int status;
  struct shmid_ds* ds;
  struct shminfo* info;
  struct shm_info* info2;

  size_t page_size = sysconf(_SC_PAGESIZE);

  ALLOCATE_GUARD(ds, 'd');
  assert(0 == shmctl(shmid, IPC_STAT, ds));
  VERIFY_GUARD(ds);
  assert((int)ds->shm_segsz == SIZE);
  assert(ds->shm_cpid == getppid());
  assert(ds->shm_nattch == 0);

  ds->shm_perm.mode = 0660;
  assert(0 == shmctl(shmid, IPC_SET, ds));

  ALLOCATE_GUARD(info, 'i');
  assert(0 <= shmctl(shmid, IPC_INFO, (struct shmid_ds*)info));
  VERIFY_GUARD(info);
  assert(info->shmmin == 1);

  ALLOCATE_GUARD(info2, 'j');
  // XXX: This shmctl call fails with EBADR when compiled with -m32
  assert(0 <= shmctl(shmid, SHM_INFO, (struct shmid_ds*)info2));
  VERIFY_GUARD(info2);
  assert(info2->used_ids > 0);
  assert(info2->used_ids < 100);

  p = shmat(shmid, NULL, 0);
  assert(p != (char*)-1);

  before_writing();

  for (i = 0; i < SIZE; ++i) {
assert(p[i] == 0);
  }
  memset(p, 'r', SIZE / 2);

  after_writing();

  p2 = shmat(shmid, NULL, 0);
  assert(p2 != (char*)-1);
  memset(p + SIZE / 2, 'r', SIZE / 2);
  assert(0 == shmdt(p));
  assert(0 == shmdt(p2));

  assert(p == mmap(p, SIZE, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0));
  assert(p[0] == 0);

  p = shmat(shmid, p, SHM_REMAP);
  assert(p != (char*)-1);
  for (i = 0; i < SIZE; ++i) {
assert(p[i] == 'r');
  }

  if ((child2 = fork()) == 0) {
memset(p, 's', SIZE);
return 0;
  }
  assert(child2 == waitpid(child2, , __WALL));
  assert(0 == status);
  for (i = 0; i < SIZE; ++i) {
assert(p[i] == 's');
  }

  return 0;
}

int main(void) {
  pid_t child;
  int status;

  size_t page_size = sysconf(_SC_PAGESIZE);

  shmid = shmget(IPC_PRIVATE, SIZE, 0666);
  assert(shmid >= 0);

  if ((child = fork()) == 0) {
return run_child();
  }

  printf("child %d\n", child);

  assert(child == waitpid(child, , __WALL));
  /* delete the shm before testing status, because we want to ensure the
 segment is deleted even if the test failed. */
  assert(0 == shmctl(shmid, IPC_RMID, NULL));
  assert(status == 0);

  printf("EXIT-SUCCESS\n");

  return 0;
}

- Kyle

[0] 

Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-07-17 Thread Kyle Huey
On Tue, Jul 11, 2017 at 8:32 AM, Kyle Huey <m...@kylehuey.com> wrote:
> On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <mi...@kernel.org> wrote:
>>
>> * Kyle Huey <m...@kylehuey.com> wrote:
>>
>>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <rob...@ocallahan.org> 
>>> wrote:
>>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>>> >> Should any of those be moved into the "should be dropped" pile?
>>> >
>>> > Why not be conservative and clear every sample you're not sure about?
>>> >
>>> > We'd appreciate a fix sooner rather than later here, since rr is
>>> > currently broken on every stable Linux kernel and our attempts to
>>> > implement a workaround have failed.
>>> >
>>> > (We have separate "interrupt" and "measure" counters, and I thought we
>>> > might work around this regression by programming the "interrupt"
>>> > counter to count kernel events as well as user events (interrupting
>>> > early is OK), but that caused our (completely separate) "measure"
>>> > counter to report off-by-one results (!), which seems to be a
>>> > different bug present on a range of older kernels.)
>>>
>>> This seems to have stalled out here unfortunately.
>>>
>>> Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,
>>> alternatively, can we move the patch at the top of this thread forward
>>> on the stable branches until we do reach an answer to that question?
>>>
>>> We've abandoned hope of working around this problem in rr and are
>>> currently broken for all of our users with an up-to-date kernel, so
>>> the situation for us is rather dire at the moment I'm afraid.
>>
>> Sorry about that - I've queued up a revert for the original commit and will 
>> send
>> the fix to Linus later today. I've added a -stable tag as well so it can be
>> forwarded to Greg the moment it hits upstream.
>
> Great, thank you.
>
> - Kyle

Hi again,

I saw that the revert landed on perf/urgent but it doesn't look like
that got pulled by Linus in time for 4.13-rc1.  Consider this a gentle
poke to please get this merged :)

- Kyle


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-07-17 Thread Kyle Huey
On Tue, Jul 11, 2017 at 8:32 AM, Kyle Huey  wrote:
> On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar  wrote:
>>
>> * Kyle Huey  wrote:
>>
>>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan  
>>> wrote:
>>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland  wrote:
>>> >> Should any of those be moved into the "should be dropped" pile?
>>> >
>>> > Why not be conservative and clear every sample you're not sure about?
>>> >
>>> > We'd appreciate a fix sooner rather than later here, since rr is
>>> > currently broken on every stable Linux kernel and our attempts to
>>> > implement a workaround have failed.
>>> >
>>> > (We have separate "interrupt" and "measure" counters, and I thought we
>>> > might work around this regression by programming the "interrupt"
>>> > counter to count kernel events as well as user events (interrupting
>>> > early is OK), but that caused our (completely separate) "measure"
>>> > counter to report off-by-one results (!), which seems to be a
>>> > different bug present on a range of older kernels.)
>>>
>>> This seems to have stalled out here unfortunately.
>>>
>>> Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,
>>> alternatively, can we move the patch at the top of this thread forward
>>> on the stable branches until we do reach an answer to that question?
>>>
>>> We've abandoned hope of working around this problem in rr and are
>>> currently broken for all of our users with an up-to-date kernel, so
>>> the situation for us is rather dire at the moment I'm afraid.
>>
>> Sorry about that - I've queued up a revert for the original commit and will 
>> send
>> the fix to Linus later today. I've added a -stable tag as well so it can be
>> forwarded to Greg the moment it hits upstream.
>
> Great, thank you.
>
> - Kyle

Hi again,

I saw that the revert landed on perf/urgent but it doesn't look like
that got pulled by Linus in time for 4.13-rc1.  Consider this a gentle
poke to please get this merged :)

- Kyle


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-07-11 Thread Kyle Huey
On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <mi...@kernel.org> wrote:
>
> * Kyle Huey <m...@kylehuey.com> wrote:
>
>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <rob...@ocallahan.org> 
>> wrote:
>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>> >> Should any of those be moved into the "should be dropped" pile?
>> >
>> > Why not be conservative and clear every sample you're not sure about?
>> >
>> > We'd appreciate a fix sooner rather than later here, since rr is
>> > currently broken on every stable Linux kernel and our attempts to
>> > implement a workaround have failed.
>> >
>> > (We have separate "interrupt" and "measure" counters, and I thought we
>> > might work around this regression by programming the "interrupt"
>> > counter to count kernel events as well as user events (interrupting
>> > early is OK), but that caused our (completely separate) "measure"
>> > counter to report off-by-one results (!), which seems to be a
>> > different bug present on a range of older kernels.)
>>
>> This seems to have stalled out here unfortunately.
>>
>> Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,
>> alternatively, can we move the patch at the top of this thread forward
>> on the stable branches until we do reach an answer to that question?
>>
>> We've abandoned hope of working around this problem in rr and are
>> currently broken for all of our users with an up-to-date kernel, so
>> the situation for us is rather dire at the moment I'm afraid.
>
> Sorry about that - I've queued up a revert for the original commit and will 
> send
> the fix to Linus later today. I've added a -stable tag as well so it can be
> forwarded to Greg the moment it hits upstream.

Great, thank you.

- Kyle


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-07-11 Thread Kyle Huey
On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar  wrote:
>
> * Kyle Huey  wrote:
>
>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan  
>> wrote:
>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland  wrote:
>> >> Should any of those be moved into the "should be dropped" pile?
>> >
>> > Why not be conservative and clear every sample you're not sure about?
>> >
>> > We'd appreciate a fix sooner rather than later here, since rr is
>> > currently broken on every stable Linux kernel and our attempts to
>> > implement a workaround have failed.
>> >
>> > (We have separate "interrupt" and "measure" counters, and I thought we
>> > might work around this regression by programming the "interrupt"
>> > counter to count kernel events as well as user events (interrupting
>> > early is OK), but that caused our (completely separate) "measure"
>> > counter to report off-by-one results (!), which seems to be a
>> > different bug present on a range of older kernels.)
>>
>> This seems to have stalled out here unfortunately.
>>
>> Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,
>> alternatively, can we move the patch at the top of this thread forward
>> on the stable branches until we do reach an answer to that question?
>>
>> We've abandoned hope of working around this problem in rr and are
>> currently broken for all of our users with an up-to-date kernel, so
>> the situation for us is rather dire at the moment I'm afraid.
>
> Sorry about that - I've queued up a revert for the original commit and will 
> send
> the fix to Linus later today. I've added a -stable tag as well so it can be
> forwarded to Greg the moment it hits upstream.

Great, thank you.

- Kyle


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-07-10 Thread Kyle Huey
On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan  wrote:
> On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland  wrote:
>> Should any of those be moved into the "should be dropped" pile?
>
> Why not be conservative and clear every sample you're not sure about?
>
> We'd appreciate a fix sooner rather than later here, since rr is
> currently broken on every stable Linux kernel and our attempts to
> implement a workaround have failed.
>
> (We have separate "interrupt" and "measure" counters, and I thought we
> might work around this regression by programming the "interrupt"
> counter to count kernel events as well as user events (interrupting
> early is OK), but that caused our (completely separate) "measure"
> counter to report off-by-one results (!), which seems to be a
> different bug present on a range of older kernels.)

This seems to have stalled out here unfortunately.

Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,
alternatively, can we move the patch at the top of this thread forward
on the stable branches until we do reach an answer to that question?

We've abandoned hope of working around this problem in rr and are
currently broken for all of our users with an up-to-date kernel, so
the situation for us is rather dire at the moment I'm afraid.

Thanks,

- Kyle


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-07-10 Thread Kyle Huey
On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan  wrote:
> On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland  wrote:
>> Should any of those be moved into the "should be dropped" pile?
>
> Why not be conservative and clear every sample you're not sure about?
>
> We'd appreciate a fix sooner rather than later here, since rr is
> currently broken on every stable Linux kernel and our attempts to
> implement a workaround have failed.
>
> (We have separate "interrupt" and "measure" counters, and I thought we
> might work around this regression by programming the "interrupt"
> counter to count kernel events as well as user events (interrupting
> early is OK), but that caused our (completely separate) "measure"
> counter to report off-by-one results (!), which seems to be a
> different bug present on a range of older kernels.)

This seems to have stalled out here unfortunately.

Can we get a consensus (from ingo or peterz?) on Mark's question?  Or,
alternatively, can we move the patch at the top of this thread forward
on the stable branches until we do reach an answer to that question?

We've abandoned hope of working around this problem in rr and are
currently broken for all of our users with an up-to-date kernel, so
the situation for us is rather dire at the moment I'm afraid.

Thanks,

- Kyle


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-06-30 Thread Kyle Huey
On Wed, Jun 28, 2017 at 3:55 PM, Kyle Huey <m...@kylehuey.com> wrote:
> On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:
>>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header 
>>> > *header,
>>> > struct perf_output_handle handle;
>>> > struct perf_event_header header;
>>> >
>>> > +   /*
>>> > +* For security, drop the skid kernel samples if necessary.
>>> > +*/
>>> > +   if (!sample_is_allowed(event, regs))
>>> > +   return ret;
>>>
>>> Just a bare return here.
>>
>> Ugh, yes. Sorry about that. I'll fix that up.
>>
>> [...]
>>
>>> I can confirm that with that fixed to compile, this patch fixes rr.
>>
>> Thanks for giving this a go.
>>
>> Having thought about this some more, I think Vince does make a good
>> point that throwing away samples is liable to break stuff, e.g. that
>> which only relies on (non-sensitive) samples.
>>
>> It still seems wrong to make up data, though.
>>
>> Maybe for exclude_kernel && !exclude_user events we can always generate
>> samples from the user regs, rather than the exception regs. That's going
>> to be closer to what the user wants, regardless. I'll take a look
>> tomorrow.
>
> I'm not very familiar with the kernel internals, but the reason I
> didn't suggest this originally is it seems like it will be difficult
> to determine what the "correct" userspace registers are.  For example,
> what happens if a performance counter is fixed to a given tid, the
> interrupt fires during a context switch from that task to another that
> is not being monitored, and the kernel is far enough along in the
> context switch that the current task struct has been switched out?
> Reporting the new task's registers seems as bad as reporting the
> kernel's registers.  But maybe this is easier than I imagine for
> whatever reason.
>
> Something to think about.
>
> - Kyle

We've noticed that the regressing changeset made it into stable
branches that distros are shipping now[0], and we're starting to
receive bug reports from our users.  We would really like to get this
patch or something similar into 4.12 and the next stable releases if
at all possible.

Thanks!

- Kyle

[0] Full list in https://mail.mozilla.org/pipermail/rr-dev/2017-June/000500.html


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-06-30 Thread Kyle Huey
On Wed, Jun 28, 2017 at 3:55 PM, Kyle Huey  wrote:
> On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland  wrote:
>> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:
>>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland  wrote:
>>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header 
>>> > *header,
>>> > struct perf_output_handle handle;
>>> > struct perf_event_header header;
>>> >
>>> > +   /*
>>> > +* For security, drop the skid kernel samples if necessary.
>>> > +*/
>>> > +   if (!sample_is_allowed(event, regs))
>>> > +   return ret;
>>>
>>> Just a bare return here.
>>
>> Ugh, yes. Sorry about that. I'll fix that up.
>>
>> [...]
>>
>>> I can confirm that with that fixed to compile, this patch fixes rr.
>>
>> Thanks for giving this a go.
>>
>> Having thought about this some more, I think Vince does make a good
>> point that throwing away samples is liable to break stuff, e.g. that
>> which only relies on (non-sensitive) samples.
>>
>> It still seems wrong to make up data, though.
>>
>> Maybe for exclude_kernel && !exclude_user events we can always generate
>> samples from the user regs, rather than the exception regs. That's going
>> to be closer to what the user wants, regardless. I'll take a look
>> tomorrow.
>
> I'm not very familiar with the kernel internals, but the reason I
> didn't suggest this originally is it seems like it will be difficult
> to determine what the "correct" userspace registers are.  For example,
> what happens if a performance counter is fixed to a given tid, the
> interrupt fires during a context switch from that task to another that
> is not being monitored, and the kernel is far enough along in the
> context switch that the current task struct has been switched out?
> Reporting the new task's registers seems as bad as reporting the
> kernel's registers.  But maybe this is easier than I imagine for
> whatever reason.
>
> Something to think about.
>
> - Kyle

We've noticed that the regressing changeset made it into stable
branches that distros are shipping now[0], and we're starting to
receive bug reports from our users.  We would really like to get this
patch or something similar into 4.12 and the next stable releases if
at all possible.

Thanks!

- Kyle

[0] Full list in https://mail.mozilla.org/pipermail/rr-dev/2017-June/000500.html


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-06-28 Thread Kyle Huey
On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:
>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header 
>> > *header,
>> > struct perf_output_handle handle;
>> > struct perf_event_header header;
>> >
>> > +   /*
>> > +* For security, drop the skid kernel samples if necessary.
>> > +*/
>> > +   if (!sample_is_allowed(event, regs))
>> > +   return ret;
>>
>> Just a bare return here.
>
> Ugh, yes. Sorry about that. I'll fix that up.
>
> [...]
>
>> I can confirm that with that fixed to compile, this patch fixes rr.
>
> Thanks for giving this a go.
>
> Having thought about this some more, I think Vince does make a good
> point that throwing away samples is liable to break stuff, e.g. that
> which only relies on (non-sensitive) samples.
>
> It still seems wrong to make up data, though.
>
> Maybe for exclude_kernel && !exclude_user events we can always generate
> samples from the user regs, rather than the exception regs. That's going
> to be closer to what the user wants, regardless. I'll take a look
> tomorrow.

I'm not very familiar with the kernel internals, but the reason I
didn't suggest this originally is it seems like it will be difficult
to determine what the "correct" userspace registers are.  For example,
what happens if a performance counter is fixed to a given tid, the
interrupt fires during a context switch from that task to another that
is not being monitored, and the kernel is far enough along in the
context switch that the current task struct has been switched out?
Reporting the new task's registers seems as bad as reporting the
kernel's registers.  But maybe this is easier than I imagine for
whatever reason.

Something to think about.

- Kyle


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-06-28 Thread Kyle Huey
On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland  wrote:
> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:
>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland  wrote:
>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header 
>> > *header,
>> > struct perf_output_handle handle;
>> > struct perf_event_header header;
>> >
>> > +   /*
>> > +* For security, drop the skid kernel samples if necessary.
>> > +*/
>> > +   if (!sample_is_allowed(event, regs))
>> > +   return ret;
>>
>> Just a bare return here.
>
> Ugh, yes. Sorry about that. I'll fix that up.
>
> [...]
>
>> I can confirm that with that fixed to compile, this patch fixes rr.
>
> Thanks for giving this a go.
>
> Having thought about this some more, I think Vince does make a good
> point that throwing away samples is liable to break stuff, e.g. that
> which only relies on (non-sensitive) samples.
>
> It still seems wrong to make up data, though.
>
> Maybe for exclude_kernel && !exclude_user events we can always generate
> samples from the user regs, rather than the exception regs. That's going
> to be closer to what the user wants, regardless. I'll take a look
> tomorrow.

I'm not very familiar with the kernel internals, but the reason I
didn't suggest this originally is it seems like it will be difficult
to determine what the "correct" userspace registers are.  For example,
what happens if a performance counter is fixed to a given tid, the
interrupt fires during a context switch from that task to another that
is not being monitored, and the kernel is far enough along in the
context switch that the current task struct has been switched out?
Reporting the new task's registers seems as bad as reporting the
kernel's registers.  But maybe this is easier than I imagine for
whatever reason.

Something to think about.

- Kyle


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-28 Thread Kyle Huey
On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
> On Wed, Jun 28, 2017 at 09:46:43AM -0700, Kyle Huey wrote:
>> On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>> > On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
>> >> My understanding of the situation is as follows:
>> >>
>> >> There is some time, call it t_0, where the hardware counter overflows.
>> >> The PMU triggers an interrupt, but this is not instantaneous.  Call
>> >> the time when the interrupt is actually delivered t_1.  Then t_1 - t_0
>> >> is the "skid".
>> >>
>> >> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
>> >> *must* be running a userspace program.  But by t_1, the CPU may be
>> >> doing something else.  Your patch changed things so that if at t_1 the
>> >> CPU is in the kernel, then the interrupt is discarded.  But rr has
>> >> programmed the counter to deliver a signal on overflow (via F_SETSIG
>> >> on the fd returned by perf_event_open).  This change results in the
>> >> signal never being delivered, because the interrupt was ignored.
>> >> (More accurately, the signal is delivered the *next* time the counter
>> >> overflows, which is far past where we wanted to inject our
>> >> asynchronous event into our tracee.
>> >
>> > Yes, this is a bug.
>> >
>> > As we're trying to avoid smapling state, I think we can move the check
>> > into perf_prepare_sample() or __perf_event_output(), where that state is
>> > actually sampled. I'll take a look at that momentarily.
>> >
>> > Just to clarify, you don't care about the sample state at all? i.e. you
>> > don't need the user program counter?
>>
>> Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`,
>> etc are all 0.
>> https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194
>> is what we do use.
>
> Given that, I must be missing something.
>
> In __perf_event_overflow(), we already bail out early if
> !is_sampling_event(event), i.e. when the sample_period is 0.
>
> Your attr has a sample_period of zero, so something must be initialising
> that.
>
> Do you always call PERF_EVENT_IOC_PERIOD, or is something in the core
> fiddling with the sample period behind your back?

We always either set sample_period or call PERF_EVENT_IOC_PERIOD (with
an enormous number if we don't actually want an interrupt.  See
`PerfCounters::reset`, line 446.

> It seems odd that an event without any samples to take has a sample
> period. I'm surprised that there's not *some* sample_type set.

Perhaps sample_period is misleadingly named :)  Alternatively, you
could imagine it as sampling where we're only interested in whether
the counter passed the sampling value or not.

>> > Is that signal delivered to the tracee, or to a different process that
>> > traces it? If the latter, what ensures that the task is stopped
>> > sufficiently quickly?
>>
>> It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid).
>> In practice we've found that on modern Intel hardware that the
>> interrupt and resulting signal delivery delay is bounded by a
>> relatively small number of counter events.
>
> Ok.
>
> Thanks,
> Mark.

- Kyle


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-28 Thread Kyle Huey
On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland  wrote:
> On Wed, Jun 28, 2017 at 09:46:43AM -0700, Kyle Huey wrote:
>> On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland  wrote:
>> > On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
>> >> My understanding of the situation is as follows:
>> >>
>> >> There is some time, call it t_0, where the hardware counter overflows.
>> >> The PMU triggers an interrupt, but this is not instantaneous.  Call
>> >> the time when the interrupt is actually delivered t_1.  Then t_1 - t_0
>> >> is the "skid".
>> >>
>> >> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
>> >> *must* be running a userspace program.  But by t_1, the CPU may be
>> >> doing something else.  Your patch changed things so that if at t_1 the
>> >> CPU is in the kernel, then the interrupt is discarded.  But rr has
>> >> programmed the counter to deliver a signal on overflow (via F_SETSIG
>> >> on the fd returned by perf_event_open).  This change results in the
>> >> signal never being delivered, because the interrupt was ignored.
>> >> (More accurately, the signal is delivered the *next* time the counter
>> >> overflows, which is far past where we wanted to inject our
>> >> asynchronous event into our tracee.
>> >
>> > Yes, this is a bug.
>> >
>> > As we're trying to avoid smapling state, I think we can move the check
>> > into perf_prepare_sample() or __perf_event_output(), where that state is
>> > actually sampled. I'll take a look at that momentarily.
>> >
>> > Just to clarify, you don't care about the sample state at all? i.e. you
>> > don't need the user program counter?
>>
>> Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`,
>> etc are all 0.
>> https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194
>> is what we do use.
>
> Given that, I must be missing something.
>
> In __perf_event_overflow(), we already bail out early if
> !is_sampling_event(event), i.e. when the sample_period is 0.
>
> Your attr has a sample_period of zero, so something must be initialising
> that.
>
> Do you always call PERF_EVENT_IOC_PERIOD, or is something in the core
> fiddling with the sample period behind your back?

We always either set sample_period or call PERF_EVENT_IOC_PERIOD (with
an enormous number if we don't actually want an interrupt.  See
`PerfCounters::reset`, line 446.

> It seems odd that an event without any samples to take has a sample
> period. I'm surprised that there's not *some* sample_type set.

Perhaps sample_period is misleadingly named :)  Alternatively, you
could imagine it as sampling where we're only interested in whether
the counter passed the sampling value or not.

>> > Is that signal delivered to the tracee, or to a different process that
>> > traces it? If the latter, what ensures that the task is stopped
>> > sufficiently quickly?
>>
>> It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid).
>> In practice we've found that on modern Intel hardware that the
>> interrupt and resulting signal delivery delay is bounded by a
>> relatively small number of counter events.
>
> Ok.
>
> Thanks,
> Mark.

- Kyle


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-06-28 Thread Kyle Huey
On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:
>> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
>
>> As we're trying to avoid smapling state, I think we can move the check
>> into perf_prepare_sample() or __perf_event_output(), where that state is
>> actually sampled. I'll take a look at that momentarily.
>>
>> Just to clarify, you don't care about the sample state at all? i.e. you
>> don't need the user program counter?
>>
>> Is that signal delivered to the tracee, or to a different process that
>> traces it? If the latter, what ensures that the task is stopped
>> sufficiently quickly?
>>
>> > It seems to me that it might be reasonable to ignore the interrupt if
>> > the purpose of the interrupt is to trigger sampling of the CPUs
>> > register state.  But if the interrupt will trigger some other
>> > operation, such as a signal on an fd, then there's no reason to drop
>> > it.
>>
>> Agreed. I'll try to have a patch for this soon.
>>
>> I just need to figure out exactly where that overflow signal is
>> generated by the perf core.
>
> I've figured that out now. That's handled by perf_pending_event(), whcih
> is the irq_work we schedule in __perf_event_overflow().
>
> Does the below patch work for you?
>
> >8
> From bb1f99dace508ce34ab0818f91d59e73450fa142 Mon Sep 17 00:00:00 2001
> From: Mark Rutland <mark.rutl...@arm.com>
> Date: Wed, 28 Jun 2017 11:39:25 +0100
> Subject: [PATCH] perf/core: generate overflow signal when samples are dropped
>
> We recently tried to kill an information leak where users could receive
> kernel samples due to skid on the PMU interrupt. To block this, we
> bailed out early in perf_event_overflow(), as we do for non-sampling
> events.
>
> This broke rr, which uses sampling events to receive a signal on
> overflow (but does not care about the contents of the sample). These
> signals are critical to the correct operation of rr.
>
> Instead of bailing out early in perf_event_overflow, we can bail prior
> to performing the actual sampling in __perf_event_output(). This avoids
> the information leak, but preserves the generation of the signal.
>
> Since we don't place any sample data into the ring buffer, the signal is
> arguably spurious. However, a userspace ringbuffer consumer can already
> consume data prior to taking the associated signals, and therefore must
> handle spurious signals to operate correctly. Thus, this signal
> shouldn't be harmful.
>
> Fixes: cc1582c231ea041f ("perf/core: Drop kernel samples even though :u is 
> specified")
> Reported-by: Kyle Huey <m...@kylehuey.com>
> Signed-off-by: Mark Rutland <mark.rutl...@arm.com>
> Cc: Alexander Shishkin <alexander.shish...@linux.intel.com>
> Cc: Arnaldo Carvalho de Melo <a...@kernel.org>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Jin Yao <yao@linux.intel.com>
> Cc: Peter Zijlstra (Intel) <pet...@infradead.org>
> ---
>  kernel/events/core.c | 46 +-
>  1 file changed, 25 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6c4e523..6b263f3 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6090,6 +6090,21 @@ void perf_prepare_sample(struct perf_event_header 
> *header,
> }
>  }
>
> +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
> +{
> +   /*
> +* Due to interrupt latency (AKA "skid"), we may enter the
> +* kernel before taking an overflow, even if the PMU is only
> +* counting user events.
> +* To avoid leaking information to userspace, we must always
> +* reject kernel samples when exclude_kernel is set.
> +*/
> +   if (event->attr.exclude_kernel && !user_mode(regs))
> +   return false;
> +
> +   return true;
> +}
> +
>  static void __always_inline
>  __perf_event_output(struct perf_event *event,
> struct perf_sample_data *data,
> @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header 
> *header,
> struct perf_output_handle handle;
> struct perf_event_header header;
>
> +   /*
> +* For security, drop the skid kernel samples if necessary.
> +*/
> +   if (!sample_is_allowed(event, regs))
> +   return ret;

Just a bare return here.

> +
> /* protect the callchain buffers */
> rcu_read_lock();
>
&

Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-06-28 Thread Kyle Huey
On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland  wrote:
> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:
>> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
>
>> As we're trying to avoid smapling state, I think we can move the check
>> into perf_prepare_sample() or __perf_event_output(), where that state is
>> actually sampled. I'll take a look at that momentarily.
>>
>> Just to clarify, you don't care about the sample state at all? i.e. you
>> don't need the user program counter?
>>
>> Is that signal delivered to the tracee, or to a different process that
>> traces it? If the latter, what ensures that the task is stopped
>> sufficiently quickly?
>>
>> > It seems to me that it might be reasonable to ignore the interrupt if
>> > the purpose of the interrupt is to trigger sampling of the CPUs
>> > register state.  But if the interrupt will trigger some other
>> > operation, such as a signal on an fd, then there's no reason to drop
>> > it.
>>
>> Agreed. I'll try to have a patch for this soon.
>>
>> I just need to figure out exactly where that overflow signal is
>> generated by the perf core.
>
> I've figured that out now. That's handled by perf_pending_event(), whcih
> is the irq_work we schedule in __perf_event_overflow().
>
> Does the below patch work for you?
>
> >8
> From bb1f99dace508ce34ab0818f91d59e73450fa142 Mon Sep 17 00:00:00 2001
> From: Mark Rutland 
> Date: Wed, 28 Jun 2017 11:39:25 +0100
> Subject: [PATCH] perf/core: generate overflow signal when samples are dropped
>
> We recently tried to kill an information leak where users could receive
> kernel samples due to skid on the PMU interrupt. To block this, we
> bailed out early in perf_event_overflow(), as we do for non-sampling
> events.
>
> This broke rr, which uses sampling events to receive a signal on
> overflow (but does not care about the contents of the sample). These
> signals are critical to the correct operation of rr.
>
> Instead of bailing out early in perf_event_overflow, we can bail prior
> to performing the actual sampling in __perf_event_output(). This avoids
> the information leak, but preserves the generation of the signal.
>
> Since we don't place any sample data into the ring buffer, the signal is
> arguably spurious. However, a userspace ringbuffer consumer can already
> consume data prior to taking the associated signals, and therefore must
> handle spurious signals to operate correctly. Thus, this signal
> shouldn't be harmful.
>
> Fixes: cc1582c231ea041f ("perf/core: Drop kernel samples even though :u is 
> specified")
> Reported-by: Kyle Huey 
> Signed-off-by: Mark Rutland 
> Cc: Alexander Shishkin 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Ingo Molnar 
> Cc: Jin Yao 
> Cc: Peter Zijlstra (Intel) 
> ---
>  kernel/events/core.c | 46 +-
>  1 file changed, 25 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6c4e523..6b263f3 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6090,6 +6090,21 @@ void perf_prepare_sample(struct perf_event_header 
> *header,
> }
>  }
>
> +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
> +{
> +   /*
> +* Due to interrupt latency (AKA "skid"), we may enter the
> +* kernel before taking an overflow, even if the PMU is only
> +* counting user events.
> +* To avoid leaking information to userspace, we must always
> +* reject kernel samples when exclude_kernel is set.
> +*/
> +   if (event->attr.exclude_kernel && !user_mode(regs))
> +   return false;
> +
> +   return true;
> +}
> +
>  static void __always_inline
>  __perf_event_output(struct perf_event *event,
> struct perf_sample_data *data,
> @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header 
> *header,
> struct perf_output_handle handle;
> struct perf_event_header header;
>
> +   /*
> +* For security, drop the skid kernel samples if necessary.
> +*/
> +   if (!sample_is_allowed(event, regs))
> +   return ret;

Just a bare return here.

> +
> /* protect the callchain buffers */
> rcu_read_lock();
>
> @@ -7316,21 +7337,6 @@ int perf_event_account_interrupt(struct perf_event 
> *event)
> return __perf_event_account_interrupt(event, 1);
>  }
>
> -static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)

Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-28 Thread Kyle Huey
On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
>> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <yao@linux.intel.com> wrote:
>> > Hi,
>> >
>> > In theory, the PMI interrupts in skid region should be dropped, right?
>>
>> No, why would they be dropped?
>>
>> My understanding of the situation is as follows:
>>
>> There is some time, call it t_0, where the hardware counter overflows.
>> The PMU triggers an interrupt, but this is not instantaneous.  Call
>> the time when the interrupt is actually delivered t_1.  Then t_1 - t_0
>> is the "skid".
>>
>> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
>> *must* be running a userspace program.  But by t_1, the CPU may be
>> doing something else.  Your patch changed things so that if at t_1 the
>> CPU is in the kernel, then the interrupt is discarded.  But rr has
>> programmed the counter to deliver a signal on overflow (via F_SETSIG
>> on the fd returned by perf_event_open).  This change results in the
>> signal never being delivered, because the interrupt was ignored.
>> (More accurately, the signal is delivered the *next* time the counter
>> overflows, which is far past where we wanted to inject our
>> asynchronous event into our tracee.
>
> Yes, this is a bug.
>
> As we're trying to avoid smapling state, I think we can move the check
> into perf_prepare_sample() or __perf_event_output(), where that state is
> actually sampled. I'll take a look at that momentarily.
>
> Just to clarify, you don't care about the sample state at all? i.e. you
> don't need the user program counter?

Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`,
etc are all 0.
https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194
is what we do use.

> Is that signal delivered to the tracee, or to a different process that
> traces it? If the latter, what ensures that the task is stopped
> sufficiently quickly?

It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid).
In practice we've found that on modern Intel hardware that the
interrupt and resulting signal delivery delay is bounded by a
relatively small number of counter events.

>> It seems to me that it might be reasonable to ignore the interrupt if
>> the purpose of the interrupt is to trigger sampling of the CPUs
>> register state.  But if the interrupt will trigger some other
>> operation, such as a signal on an fd, then there's no reason to drop
>> it.
>
> Agreed. I'll try to have a patch for this soon.
>
> I just need to figure out exactly where that overflow signal is
> generated by the perf core.
>
> Thanks,
> Mark.

- Kyle


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-28 Thread Kyle Huey
On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland  wrote:
> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
>> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao  wrote:
>> > Hi,
>> >
>> > In theory, the PMI interrupts in skid region should be dropped, right?
>>
>> No, why would they be dropped?
>>
>> My understanding of the situation is as follows:
>>
>> There is some time, call it t_0, where the hardware counter overflows.
>> The PMU triggers an interrupt, but this is not instantaneous.  Call
>> the time when the interrupt is actually delivered t_1.  Then t_1 - t_0
>> is the "skid".
>>
>> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
>> *must* be running a userspace program.  But by t_1, the CPU may be
>> doing something else.  Your patch changed things so that if at t_1 the
>> CPU is in the kernel, then the interrupt is discarded.  But rr has
>> programmed the counter to deliver a signal on overflow (via F_SETSIG
>> on the fd returned by perf_event_open).  This change results in the
>> signal never being delivered, because the interrupt was ignored.
>> (More accurately, the signal is delivered the *next* time the counter
>> overflows, which is far past where we wanted to inject our
>> asynchronous event into our tracee.
>
> Yes, this is a bug.
>
> As we're trying to avoid smapling state, I think we can move the check
> into perf_prepare_sample() or __perf_event_output(), where that state is
> actually sampled. I'll take a look at that momentarily.
>
> Just to clarify, you don't care about the sample state at all? i.e. you
> don't need the user program counter?

Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`,
etc are all 0.
https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194
is what we do use.

> Is that signal delivered to the tracee, or to a different process that
> traces it? If the latter, what ensures that the task is stopped
> sufficiently quickly?

It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid).
In practice we've found that on modern Intel hardware that the
interrupt and resulting signal delivery delay is bounded by a
relatively small number of counter events.

>> It seems to me that it might be reasonable to ignore the interrupt if
>> the purpose of the interrupt is to trigger sampling of the CPUs
>> register state.  But if the interrupt will trigger some other
>> operation, such as a signal on an fd, then there's no reason to drop
>> it.
>
> Agreed. I'll try to have a patch for this soon.
>
> I just need to figure out exactly where that overflow signal is
> generated by the perf core.
>
> Thanks,
> Mark.

- Kyle


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-28 Thread Kyle Huey
On Tue, Jun 27, 2017 at 10:35 PM, Jin, Yao <yao@linux.intel.com> wrote:
> Hi Kyle,
>
> I understand your requirement. Sorry I don't know the detail of rr debugger,
> but I guess if it just uses counter overflow to deliver a signal, could it
> set the counter without "exclude_kernel"?

Unfortunately we cannot.  We depend on the counter value being exactly
the same between recording and replay, and dropping `exclude_kernel`
would introduce non-determinism.

> Another consideration is, I'm not sure if the kernel can accurately realize
> that if the interrupt is to trigger sampling or just deliver a signal.
> Userspace may know that but kernel may not.

After looking at this code a bit more, I think that changing the
`is_sample_allowed` check from an early return to a guard around the
invocation of `overflow_handler` would fix this.  I believe, but have
not tested, that `perf_event_fasync` is what must run to deliver our
signal, while the `overflow_handler` is what copies the kernel RIP/etc
into the output buffer that you want to skip.

- Kyle

> Anyway let's listen to more comments from community.
>
> Thanks
>
> Jin Yao
>
>
>
> On 6/28/2017 12:51 PM, Kyle Huey wrote:
>>
>> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <yao@linux.intel.com> wrote:
>>>
>>> Hi,
>>>
>>> In theory, the PMI interrupts in skid region should be dropped, right?
>>
>> No, why would they be dropped?
>>
>> My understanding of the situation is as follows:
>>
>> There is some time, call it t_0, where the hardware counter overflows.
>> The PMU triggers an interrupt, but this is not instantaneous.  Call
>> the time when the interrupt is actually delivered t_1.  Then t_1 - t_0
>> is the "skid".
>>
>> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
>> *must* be running a userspace program.  But by t_1, the CPU may be
>> doing something else.  Your patch changed things so that if at t_1 the
>> CPU is in the kernel, then the interrupt is discarded.  But rr has
>> programmed the counter to deliver a signal on overflow (via F_SETSIG
>> on the fd returned by perf_event_open).  This change results in the
>> signal never being delivered, because the interrupt was ignored.
>> (More accurately, the signal is delivered the *next* time the counter
>> overflows, which is far past where we wanted to inject our
>> asynchronous event into our tracee.
>>
>> It seems to me that it might be reasonable to ignore the interrupt if
>> the purpose of the interrupt is to trigger sampling of the CPUs
>> register state.  But if the interrupt will trigger some other
>> operation, such as a signal on an fd, then there's no reason to drop
>> it.
>>
>>> For a userspace debugger, is it the only choice that relies on the *skid*
>>> PMI interrupt?
>>
>> I don't understand this question, but hopefully the above clarified
>> things.
>>
>> - Kyle
>>
>>> Thanks
>>> Jin Yao
>>>
>>>
>>> On 6/28/2017 9:01 AM, Kyle Huey wrote:
>>>>
>>>> Sent again with LKML CCd, sorry for the noise.
>>>>
>>>> - Kyle
>>>>
>>>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <m...@kylehuey.com> wrote:
>>>>>
>>>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
>>>>> a candidate for backporting to stable branches.
>>>>>
>>>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt
>>>>> to stop a program during replay to inject asynchronous events such as
>>>>> signals.  We are counting retired conditional branches in userspace
>>>>> only.  This changeset causes the kernel to drop interrupts on the
>>>>> floor if, during the PMU interrupt's "skid" region, the CPU enters
>>>>> kernel mode for whatever reason.  When replaying traces of complex
>>>>> programs such as Firefox, we intermittently fail to deliver
>>>>> asynchronous events on time, leading the replay to diverge from the
>>>>> recorded state.
>>>>>
>>>>> It seems like this change should, at a bare minimum, be limited to
>>>>> counters that actually perform sampling of register state when the
>>>>> interrupt fires.  In our case, with the retired conditional branches
>>>>> counter restricted to counting userspace events only, it makes no
>>>>> difference that the PMU interrupt happened to be delivered in the
>>>>> kernel.
>>>>>
>>>>> As this makes rr unusable on complex applications and cannot be
>>>>> efficiently worked around, we would appreciate this being addressed
>>>>> before 4.12 is finalized, and the regression not being introduced to
>>>>> stable branches.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> - Kyle
>>>>>
>>>>> [0] http://rr-project.org/
>>>
>>>
>


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-28 Thread Kyle Huey
On Tue, Jun 27, 2017 at 10:35 PM, Jin, Yao  wrote:
> Hi Kyle,
>
> I understand your requirement. Sorry I don't know the detail of rr debugger,
> but I guess if it just uses counter overflow to deliver a signal, could it
> set the counter without "exclude_kernel"?

Unfortunately we cannot.  We depend on the counter value being exactly
the same between recording and replay, and dropping `exclude_kernel`
would introduce non-determinism.

> Another consideration is, I'm not sure if the kernel can accurately realize
> that if the interrupt is to trigger sampling or just deliver a signal.
> Userspace may know that but kernel may not.

After looking at this code a bit more, I think that changing the
`is_sample_allowed` check from an early return to a guard around the
invocation of `overflow_handler` would fix this.  I believe, but have
not tested, that `perf_event_fasync` is what must run to deliver our
signal, while the `overflow_handler` is what copies the kernel RIP/etc
into the output buffer that you want to skip.

- Kyle

> Anyway let's listen to more comments from community.
>
> Thanks
>
> Jin Yao
>
>
>
> On 6/28/2017 12:51 PM, Kyle Huey wrote:
>>
>> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao  wrote:
>>>
>>> Hi,
>>>
>>> In theory, the PMI interrupts in skid region should be dropped, right?
>>
>> No, why would they be dropped?
>>
>> My understanding of the situation is as follows:
>>
>> There is some time, call it t_0, where the hardware counter overflows.
>> The PMU triggers an interrupt, but this is not instantaneous.  Call
>> the time when the interrupt is actually delivered t_1.  Then t_1 - t_0
>> is the "skid".
>>
>> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
>> *must* be running a userspace program.  But by t_1, the CPU may be
>> doing something else.  Your patch changed things so that if at t_1 the
>> CPU is in the kernel, then the interrupt is discarded.  But rr has
>> programmed the counter to deliver a signal on overflow (via F_SETSIG
>> on the fd returned by perf_event_open).  This change results in the
>> signal never being delivered, because the interrupt was ignored.
>> (More accurately, the signal is delivered the *next* time the counter
>> overflows, which is far past where we wanted to inject our
>> asynchronous event into our tracee.
>>
>> It seems to me that it might be reasonable to ignore the interrupt if
>> the purpose of the interrupt is to trigger sampling of the CPUs
>> register state.  But if the interrupt will trigger some other
>> operation, such as a signal on an fd, then there's no reason to drop
>> it.
>>
>>> For a userspace debugger, is it the only choice that relies on the *skid*
>>> PMI interrupt?
>>
>> I don't understand this question, but hopefully the above clarified
>> things.
>>
>> - Kyle
>>
>>> Thanks
>>> Jin Yao
>>>
>>>
>>> On 6/28/2017 9:01 AM, Kyle Huey wrote:
>>>>
>>>> Sent again with LKML CCd, sorry for the noise.
>>>>
>>>> - Kyle
>>>>
>>>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey  wrote:
>>>>>
>>>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
>>>>> a candidate for backporting to stable branches.
>>>>>
>>>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt
>>>>> to stop a program during replay to inject asynchronous events such as
>>>>> signals.  We are counting retired conditional branches in userspace
>>>>> only.  This changeset causes the kernel to drop interrupts on the
>>>>> floor if, during the PMU interrupt's "skid" region, the CPU enters
>>>>> kernel mode for whatever reason.  When replaying traces of complex
>>>>> programs such as Firefox, we intermittently fail to deliver
>>>>> asynchronous events on time, leading the replay to diverge from the
>>>>> recorded state.
>>>>>
>>>>> It seems like this change should, at a bare minimum, be limited to
>>>>> counters that actually perform sampling of register state when the
>>>>> interrupt fires.  In our case, with the retired conditional branches
>>>>> counter restricted to counting userspace events only, it makes no
>>>>> difference that the PMU interrupt happened to be delivered in the
>>>>> kernel.
>>>>>
>>>>> As this makes rr unusable on complex applications and cannot be
>>>>> efficiently worked around, we would appreciate this being addressed
>>>>> before 4.12 is finalized, and the regression not being introduced to
>>>>> stable branches.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> - Kyle
>>>>>
>>>>> [0] http://rr-project.org/
>>>
>>>
>


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-27 Thread Kyle Huey
On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <yao@linux.intel.com> wrote:
> Hi,
>
> In theory, the PMI interrupts in skid region should be dropped, right?

No, why would they be dropped?

My understanding of the situation is as follows:

There is some time, call it t_0, where the hardware counter overflows.
The PMU triggers an interrupt, but this is not instantaneous.  Call
the time when the interrupt is actually delivered t_1.  Then t_1 - t_0
is the "skid".

Note that if the counter is `exclude_kernel`, then at t_0 the CPU
*must* be running a userspace program.  But by t_1, the CPU may be
doing something else.  Your patch changed things so that if at t_1 the
CPU is in the kernel, then the interrupt is discarded.  But rr has
programmed the counter to deliver a signal on overflow (via F_SETSIG
on the fd returned by perf_event_open).  This change results in the
signal never being delivered, because the interrupt was ignored.
(More accurately, the signal is delivered the *next* time the counter
overflows, which is far past where we wanted to inject our
asynchronous event into our tracee.

It seems to me that it might be reasonable to ignore the interrupt if
the purpose of the interrupt is to trigger sampling of the CPUs
register state.  But if the interrupt will trigger some other
operation, such as a signal on an fd, then there's no reason to drop
it.

> For a userspace debugger, is it the only choice that relies on the *skid*
> PMI interrupt?

I don't understand this question, but hopefully the above clarified things.

- Kyle

> Thanks
> Jin Yao
>
>
> On 6/28/2017 9:01 AM, Kyle Huey wrote:
>>
>> Sent again with LKML CCd, sorry for the noise.
>>
>> - Kyle
>>
>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <m...@kylehuey.com> wrote:
>>>
>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
>>> a candidate for backporting to stable branches.
>>>
>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt
>>> to stop a program during replay to inject asynchronous events such as
>>> signals.  We are counting retired conditional branches in userspace
>>> only.  This changeset causes the kernel to drop interrupts on the
>>> floor if, during the PMU interrupt's "skid" region, the CPU enters
>>> kernel mode for whatever reason.  When replaying traces of complex
>>> programs such as Firefox, we intermittently fail to deliver
>>> asynchronous events on time, leading the replay to diverge from the
>>> recorded state.
>>>
>>> It seems like this change should, at a bare minimum, be limited to
>>> counters that actually perform sampling of register state when the
>>> interrupt fires.  In our case, with the retired conditional branches
>>> counter restricted to counting userspace events only, it makes no
>>> difference that the PMU interrupt happened to be delivered in the
>>> kernel.
>>>
>>> As this makes rr unusable on complex applications and cannot be
>>> efficiently worked around, we would appreciate this being addressed
>>> before 4.12 is finalized, and the regression not being introduced to
>>> stable branches.
>>>
>>> Thanks,
>>>
>>> - Kyle
>>>
>>> [0] http://rr-project.org/
>
>


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-27 Thread Kyle Huey
On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao  wrote:
> Hi,
>
> In theory, the PMI interrupts in skid region should be dropped, right?

No, why would they be dropped?

My understanding of the situation is as follows:

There is some time, call it t_0, where the hardware counter overflows.
The PMU triggers an interrupt, but this is not instantaneous.  Call
the time when the interrupt is actually delivered t_1.  Then t_1 - t_0
is the "skid".

Note that if the counter is `exclude_kernel`, then at t_0 the CPU
*must* be running a userspace program.  But by t_1, the CPU may be
doing something else.  Your patch changed things so that if at t_1 the
CPU is in the kernel, then the interrupt is discarded.  But rr has
programmed the counter to deliver a signal on overflow (via F_SETSIG
on the fd returned by perf_event_open).  This change results in the
signal never being delivered, because the interrupt was ignored.
(More accurately, the signal is delivered the *next* time the counter
overflows, which is far past where we wanted to inject our
asynchronous event into our tracee.

It seems to me that it might be reasonable to ignore the interrupt if
the purpose of the interrupt is to trigger sampling of the CPUs
register state.  But if the interrupt will trigger some other
operation, such as a signal on an fd, then there's no reason to drop
it.

> For a userspace debugger, is it the only choice that relies on the *skid*
> PMI interrupt?

I don't understand this question, but hopefully the above clarified things.

- Kyle

> Thanks
> Jin Yao
>
>
> On 6/28/2017 9:01 AM, Kyle Huey wrote:
>>
>> Sent again with LKML CCd, sorry for the noise.
>>
>> - Kyle
>>
>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey  wrote:
>>>
>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
>>> a candidate for backporting to stable branches.
>>>
>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt
>>> to stop a program during replay to inject asynchronous events such as
>>> signals.  We are counting retired conditional branches in userspace
>>> only.  This changeset causes the kernel to drop interrupts on the
>>> floor if, during the PMU interrupt's "skid" region, the CPU enters
>>> kernel mode for whatever reason.  When replaying traces of complex
>>> programs such as Firefox, we intermittently fail to deliver
>>> asynchronous events on time, leading the replay to diverge from the
>>> recorded state.
>>>
>>> It seems like this change should, at a bare minimum, be limited to
>>> counters that actually perform sampling of register state when the
>>> interrupt fires.  In our case, with the retired conditional branches
>>> counter restricted to counting userspace events only, it makes no
>>> difference that the PMU interrupt happened to be delivered in the
>>> kernel.
>>>
>>> As this makes rr unusable on complex applications and cannot be
>>> efficiently worked around, we would appreciate this being addressed
>>> before 4.12 is finalized, and the regression not being introduced to
>>> stable branches.
>>>
>>> Thanks,
>>>
>>> - Kyle
>>>
>>> [0] http://rr-project.org/
>
>


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-27 Thread Kyle Huey
Sent again with LKML CCd, sorry for the noise.

- Kyle

On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <m...@kylehuey.com> wrote:
> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
> a candidate for backporting to stable branches.
>
> rr, a userspace record and replay debugger[0], uses the PMU interrupt
> to stop a program during replay to inject asynchronous events such as
> signals.  We are counting retired conditional branches in userspace
> only.  This changeset causes the kernel to drop interrupts on the
> floor if, during the PMU interrupt's "skid" region, the CPU enters
> kernel mode for whatever reason.  When replaying traces of complex
> programs such as Firefox, we intermittently fail to deliver
> asynchronous events on time, leading the replay to diverge from the
> recorded state.
>
> It seems like this change should, at a bare minimum, be limited to
> counters that actually perform sampling of register state when the
> interrupt fires.  In our case, with the retired conditional branches
> counter restricted to counting userspace events only, it makes no
> difference that the PMU interrupt happened to be delivered in the
> kernel.
>
> As this makes rr unusable on complex applications and cannot be
> efficiently worked around, we would appreciate this being addressed
> before 4.12 is finalized, and the regression not being introduced to
> stable branches.
>
> Thanks,
>
> - Kyle
>
> [0] http://rr-project.org/


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-27 Thread Kyle Huey
Sent again with LKML CCd, sorry for the noise.

- Kyle

On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey  wrote:
> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
> a candidate for backporting to stable branches.
>
> rr, a userspace record and replay debugger[0], uses the PMU interrupt
> to stop a program during replay to inject asynchronous events such as
> signals.  We are counting retired conditional branches in userspace
> only.  This changeset causes the kernel to drop interrupts on the
> floor if, during the PMU interrupt's "skid" region, the CPU enters
> kernel mode for whatever reason.  When replaying traces of complex
> programs such as Firefox, we intermittently fail to deliver
> asynchronous events on time, leading the replay to diverge from the
> recorded state.
>
> It seems like this change should, at a bare minimum, be limited to
> counters that actually perform sampling of register state when the
> interrupt fires.  In our case, with the retired conditional branches
> counter restricted to counting userspace events only, it makes no
> difference that the PMU interrupt happened to be delivered in the
> kernel.
>
> As this makes rr unusable on complex applications and cannot be
> efficiently worked around, we would appreciate this being addressed
> before 4.12 is finalized, and the regression not being introduced to
> stable branches.
>
> Thanks,
>
> - Kyle
>
> [0] http://rr-project.org/


Re: [PATCH] x86/syscalls/32: ignore arch_prctl for other architectures

2017-03-23 Thread Kyle Huey
On Thu, Mar 23, 2017 at 8:18 AM, Arnd Bergmann  wrote:
> sys_arch_prctl is only provided on x86, and there is no reason
> to add it elsewhere. However, including it on the 32-bit syscall
> table caused a warning for most configurations on non-x86:
>
> :1328:2: warning: #warning syscall arch_prctl not implemented [-Wcpp]
>
> This adds an exception to the syscall table checking script.
>
> Fixes: 79170fda313e ("x86/syscalls/32: Wire up arch_prctl on x86-32")
> Signed-off-by: Arnd Bergmann 
> ---
>  scripts/checksyscalls.sh | 1 +
>  1 file changed, 1 insertion(+)
>
> Thomas, can you apply this on top of the x86 patch?
>
> diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
> index f7ad07128811..0cce56da3706 100755
> --- a/scripts/checksyscalls.sh
> +++ b/scripts/checksyscalls.sh
> @@ -148,6 +148,7 @@ cat << EOF
>  #define __IGNORE_sysfs
>  #define __IGNORE_uselib
>  #define __IGNORE__sysctl
> +#define __IGNORE_arch_prctl
>
>  /* ... including the "new" 32-bit uid syscalls */
>  #define __IGNORE_lchown32
> --
> 2.9.0
>

Ah, nice. lgtm.

- Kyle


Re: [PATCH] x86/syscalls/32: ignore arch_prctl for other architectures

2017-03-23 Thread Kyle Huey
On Thu, Mar 23, 2017 at 8:18 AM, Arnd Bergmann  wrote:
> sys_arch_prctl is only provided on x86, and there is no reason
> to add it elsewhere. However, including it on the 32-bit syscall
> table caused a warning for most configurations on non-x86:
>
> :1328:2: warning: #warning syscall arch_prctl not implemented [-Wcpp]
>
> This adds an exception to the syscall table checking script.
>
> Fixes: 79170fda313e ("x86/syscalls/32: Wire up arch_prctl on x86-32")
> Signed-off-by: Arnd Bergmann 
> ---
>  scripts/checksyscalls.sh | 1 +
>  1 file changed, 1 insertion(+)
>
> Thomas, can you apply this on top of the x86 patch?
>
> diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
> index f7ad07128811..0cce56da3706 100755
> --- a/scripts/checksyscalls.sh
> +++ b/scripts/checksyscalls.sh
> @@ -148,6 +148,7 @@ cat << EOF
>  #define __IGNORE_sysfs
>  #define __IGNORE_uselib
>  #define __IGNORE__sysctl
> +#define __IGNORE_arch_prctl
>
>  /* ... including the "new" 32-bit uid syscalls */
>  #define __IGNORE_lchown32
> --
> 2.9.0
>

Ah, nice. lgtm.

- Kyle


Re: [tip:x86/process] x86/arch_prctl: Add ARCH_[GET|SET]_CPUID

2017-03-21 Thread Kyle Huey
On Tue, Mar 21, 2017 at 1:34 AM, Ingo Molnar <mi...@kernel.org> wrote:
>
> * tip-bot for Kyle Huey <tip...@zytor.com> wrote:
>
>> Commit-ID:  e9ea1e7f53b852147cbd568b0568c7ad97ec21a3
>> Gitweb: 
>> http://git.kernel.org/tip/e9ea1e7f53b852147cbd568b0568c7ad97ec21a3
>> Author: Kyle Huey <m...@kylehuey.com>
>> AuthorDate: Mon, 20 Mar 2017 01:16:26 -0700
>> Committer:  Thomas Gleixner <t...@linutronix.de>
>> CommitDate: Mon, 20 Mar 2017 16:10:34 +0100
>>
>> x86/arch_prctl: Add ARCH_[GET|SET]_CPUID
>
> Note that this series broke the UML build:
>
>   /home/mingo/tip/arch/x86/um/syscalls_64.c:15:6: error: conflicting types 
> for ‘arch_prctl’
>   ...
>
> etc.
>
> Thanks,
>
> Ingo

Yes, I sent another patch ("um: fix build error due to typo")
yesterday and tglx applied it to x86/process.

- Kyle


Re: [tip:x86/process] x86/arch_prctl: Add ARCH_[GET|SET]_CPUID

2017-03-21 Thread Kyle Huey
On Tue, Mar 21, 2017 at 1:34 AM, Ingo Molnar  wrote:
>
> * tip-bot for Kyle Huey  wrote:
>
>> Commit-ID:  e9ea1e7f53b852147cbd568b0568c7ad97ec21a3
>> Gitweb: 
>> http://git.kernel.org/tip/e9ea1e7f53b852147cbd568b0568c7ad97ec21a3
>> Author: Kyle Huey 
>> AuthorDate: Mon, 20 Mar 2017 01:16:26 -0700
>> Committer:  Thomas Gleixner 
>> CommitDate: Mon, 20 Mar 2017 16:10:34 +0100
>>
>> x86/arch_prctl: Add ARCH_[GET|SET]_CPUID
>
> Note that this series broke the UML build:
>
>   /home/mingo/tip/arch/x86/um/syscalls_64.c:15:6: error: conflicting types 
> for ‘arch_prctl’
>   ...
>
> etc.
>
> Thanks,
>
> Ingo

Yes, I sent another patch ("um: fix build error due to typo")
yesterday and tglx applied it to x86/process.

- Kyle


[tip:x86/process] um/arch_prctl: Fix fallout from x86 arch_prctl() rework

2017-03-21 Thread tip-bot for Kyle Huey
Commit-ID:  d582799fe5de1c1ca127d7f364db12a660cf46d4
Gitweb: http://git.kernel.org/tip/d582799fe5de1c1ca127d7f364db12a660cf46d4
Author: Kyle Huey <m...@kylehuey.com>
AuthorDate: Mon, 20 Mar 2017 16:05:35 -0700
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Tue, 21 Mar 2017 10:08:29 +0100

um/arch_prctl: Fix fallout from x86 arch_prctl() rework

The recent arch_prctl rework added a bracket instead of a comma. Fix it.

Fixes: 17a6e1b8e8e8 ("x86/arch_prctl/64: Rename do_arch_prctl() to 
do_arch_prctl_64()")
Signed-off-by: Kyle Huey <kh...@kylehuey.com>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: kbuild-...@01.org
Link: http://lkml.kernel.org/r/20170320230535.11281-1-kh...@kylehuey.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/um/syscalls_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index 81b9fe1..58f5166 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -12,7 +12,7 @@
 #include  /* XXX This should get the constants from libc */
 #include 
 
-long arch_prctl(struct task_struct *task, int option)
+long arch_prctl(struct task_struct *task, int option,
unsigned long __user *arg2)
 {
unsigned long *ptr = arg2, tmp;


[tip:x86/process] um/arch_prctl: Fix fallout from x86 arch_prctl() rework

2017-03-21 Thread tip-bot for Kyle Huey
Commit-ID:  d582799fe5de1c1ca127d7f364db12a660cf46d4
Gitweb: http://git.kernel.org/tip/d582799fe5de1c1ca127d7f364db12a660cf46d4
Author: Kyle Huey 
AuthorDate: Mon, 20 Mar 2017 16:05:35 -0700
Committer:  Thomas Gleixner 
CommitDate: Tue, 21 Mar 2017 10:08:29 +0100

um/arch_prctl: Fix fallout from x86 arch_prctl() rework

The recent arch_prctl rework added a bracket instead of a comma. Fix it.

Fixes: 17a6e1b8e8e8 ("x86/arch_prctl/64: Rename do_arch_prctl() to 
do_arch_prctl_64()")
Signed-off-by: Kyle Huey 
Cc: Andy Lutomirski 
Cc: kbuild-...@01.org
Link: http://lkml.kernel.org/r/20170320230535.11281-1-kh...@kylehuey.com
Signed-off-by: Thomas Gleixner 
---
 arch/x86/um/syscalls_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index 81b9fe1..58f5166 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -12,7 +12,7 @@
 #include  /* XXX This should get the constants from libc */
 #include 
 
-long arch_prctl(struct task_struct *task, int option)
+long arch_prctl(struct task_struct *task, int option,
unsigned long __user *arg2)
 {
unsigned long *ptr = arg2, tmp;


[PATCH] um: fix build error due to typo

2017-03-20 Thread Kyle Huey
Oops.

Signed-off-by: Kyle Huey <kh...@kylehuey.com>
---
 arch/x86/um/syscalls_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index 81b9fe100f7c..58f51667e2e4 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -12,7 +12,7 @@
 #include  /* XXX This should get the constants from libc */
 #include 
 
-long arch_prctl(struct task_struct *task, int option)
+long arch_prctl(struct task_struct *task, int option,
unsigned long __user *arg2)
 {
unsigned long *ptr = arg2, tmp;
-- 
2.11.0



[PATCH] um: fix build error due to typo

2017-03-20 Thread Kyle Huey
Oops.

Signed-off-by: Kyle Huey 
---
 arch/x86/um/syscalls_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index 81b9fe100f7c..58f51667e2e4 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -12,7 +12,7 @@
 #include  /* XXX This should get the constants from libc */
 #include 
 
-long arch_prctl(struct task_struct *task, int option)
+long arch_prctl(struct task_struct *task, int option,
unsigned long __user *arg2)
 {
unsigned long *ptr = arg2, tmp;
-- 
2.11.0



Re: [PATCH v16 08/10] x86/arch_prctl: Add ARCH_[GET|SET]_CPUID

2017-03-20 Thread Kyle Huey
On Mon, Mar 20, 2017 at 8:00 AM, Thomas Gleixner <t...@linutronix.de> wrote:
> On Mon, 20 Mar 2017, Kyle Huey wrote:
>> --- a/arch/x86/include/uapi/asm/prctl.h
>> +++ b/arch/x86/include/uapi/asm/prctl.h
>> @@ -6,8 +6,17 @@
>>  #define ARCH_GET_FS 0x1003
>>  #define ARCH_GET_GS 0x1004
>>
>> +#define ARCH_GET_CPUID 0x1005
>> +#define ARCH_SET_CPUID 0x1006
>> +
>>  #define ARCH_MAP_VDSO_X320x2001
>>  #define ARCH_MAP_VDSO_32 0x2002
>>  #define ARCH_MAP_VDSO_64 0x2003
>>
>> +#ifdef CONFIG_CHECKPOINT_RESTORE
>> +# define ARCH_MAP_VDSO_X32   0x2001
>> +# define ARCH_MAP_VDSO_320x2002
>> +# define ARCH_MAP_VDSO_640x2003
>> +#endif
>
> That hunk is bogus in two aspects:
>  - It's just a copy of the above wrapped in a ifdef
>
>  - The ifdef is broken, because the UAPI headers do not know about
>that.
>
> I dropped it.

Looks like that was introduced when I rebased over a01aa6c9f40f for
v14 and I didn't catch it manually.  Thanks for finding and fixing
that.

- Kyle


Re: [PATCH v16 08/10] x86/arch_prctl: Add ARCH_[GET|SET]_CPUID

2017-03-20 Thread Kyle Huey
On Mon, Mar 20, 2017 at 8:00 AM, Thomas Gleixner  wrote:
> On Mon, 20 Mar 2017, Kyle Huey wrote:
>> --- a/arch/x86/include/uapi/asm/prctl.h
>> +++ b/arch/x86/include/uapi/asm/prctl.h
>> @@ -6,8 +6,17 @@
>>  #define ARCH_GET_FS 0x1003
>>  #define ARCH_GET_GS 0x1004
>>
>> +#define ARCH_GET_CPUID 0x1005
>> +#define ARCH_SET_CPUID 0x1006
>> +
>>  #define ARCH_MAP_VDSO_X320x2001
>>  #define ARCH_MAP_VDSO_32 0x2002
>>  #define ARCH_MAP_VDSO_64 0x2003
>>
>> +#ifdef CONFIG_CHECKPOINT_RESTORE
>> +# define ARCH_MAP_VDSO_X32   0x2001
>> +# define ARCH_MAP_VDSO_320x2002
>> +# define ARCH_MAP_VDSO_640x2003
>> +#endif
>
> That hunk is bogus in two aspects:
>  - It's just a copy of the above wrapped in a ifdef
>
>  - The ifdef is broken, because the UAPI headers do not know about
>that.
>
> I dropped it.

Looks like that was introduced when I rebased over a01aa6c9f40f for
v14 and I didn't catch it manually.  Thanks for finding and fixing
that.

- Kyle


[PATCH v16 06/10] x86/syscalls/32: Wire up arch_prctl on x86-32

2017-03-20 Thread Kyle Huey
Hook up arch_prctl to call do_arch_prctl() on x86-32, and in 32 bit compat
mode on x86-64. This allows us to have arch_prctls that are not specific to
64 bits.

On UML, simply stub out this syscall.

Signed-off-by: Kyle Huey <kh...@kylehuey.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/kernel/process_32.c   | 7 +++
 arch/x86/kernel/process_64.c   | 7 +++
 arch/x86/um/Makefile   | 2 +-
 arch/x86/um/syscalls_32.c  | 7 +++
 include/linux/compat.h | 2 ++
 6 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/um/syscalls_32.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 9ba050fe47f3..0af59fa789ea 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -390,3 +390,4 @@
 381i386pkey_alloc  sys_pkey_alloc
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
+384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4c818f8bc135..ff40e74c9181 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -56,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void __show_regs(struct pt_regs *regs, int all)
 {
@@ -304,3 +306,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct 
*next_p)
 
return prev_p;
 }
+
+SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
+{
+   return do_arch_prctl_common(current, option, arg2);
+}
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d81b0a60a45c..ea1a6180bf39 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -635,6 +635,13 @@ SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, 
arg2)
return ret;
 }
 
+#ifdef CONFIG_IA32_EMULATION
+COMPAT_SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
+{
+   return do_arch_prctl_common(current, option, arg2);
+}
+#endif
+
 unsigned long KSTK_ESP(struct task_struct *task)
 {
return task_pt_regs(task)->sp;
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index e7e7055a8658..69f0827d5f53 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -16,7 +16,7 @@ obj-y = bug.o bugs_$(BITS).o delay.o fault.o ldt.o \
 
 ifeq ($(CONFIG_X86_32),y)
 
-obj-y += checksum_32.o
+obj-y += checksum_32.o syscalls_32.o
 obj-$(CONFIG_ELF_CORE) += elfcore.o
 
 subarch-y = ../lib/string_32.o ../lib/atomic64_32.o ../lib/atomic64_cx8_32.o
diff --git a/arch/x86/um/syscalls_32.c b/arch/x86/um/syscalls_32.c
new file mode 100644
index ..627d68836b16
--- /dev/null
+++ b/arch/x86/um/syscalls_32.c
@@ -0,0 +1,7 @@
+#include 
+#include 
+
+SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
+{
+   return -EINVAL;
+}
diff --git a/include/linux/compat.h b/include/linux/compat.h
index aef47be2a5c1..af9dbc44fd92 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -723,6 +723,8 @@ asmlinkage long 
compat_sys_sched_rr_get_interval(compat_pid_t pid,
 asmlinkage long compat_sys_fanotify_mark(int, unsigned int, __u32, __u32,
int, const char __user *);
 
+asmlinkage long compat_sys_arch_prctl(int option, unsigned long arg2);
+
 /*
  * For most but not all architectures, "am I in a compat syscall?" and
  * "am I a compat task?" are the same question.  For architectures on which
-- 
2.11.0



[PATCH v16 06/10] x86/syscalls/32: Wire up arch_prctl on x86-32

2017-03-20 Thread Kyle Huey
Hook up arch_prctl to call do_arch_prctl() on x86-32, and in 32 bit compat
mode on x86-64. This allows us to have arch_prctls that are not specific to
64 bits.

On UML, simply stub out this syscall.

Signed-off-by: Kyle Huey 
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/kernel/process_32.c   | 7 +++
 arch/x86/kernel/process_64.c   | 7 +++
 arch/x86/um/Makefile   | 2 +-
 arch/x86/um/syscalls_32.c  | 7 +++
 include/linux/compat.h | 2 ++
 6 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/um/syscalls_32.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 9ba050fe47f3..0af59fa789ea 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -390,3 +390,4 @@
 381i386pkey_alloc  sys_pkey_alloc
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
+384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4c818f8bc135..ff40e74c9181 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -56,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void __show_regs(struct pt_regs *regs, int all)
 {
@@ -304,3 +306,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct 
*next_p)
 
return prev_p;
 }
+
+SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
+{
+   return do_arch_prctl_common(current, option, arg2);
+}
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d81b0a60a45c..ea1a6180bf39 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -635,6 +635,13 @@ SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, 
arg2)
return ret;
 }
 
+#ifdef CONFIG_IA32_EMULATION
+COMPAT_SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
+{
+   return do_arch_prctl_common(current, option, arg2);
+}
+#endif
+
 unsigned long KSTK_ESP(struct task_struct *task)
 {
return task_pt_regs(task)->sp;
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index e7e7055a8658..69f0827d5f53 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -16,7 +16,7 @@ obj-y = bug.o bugs_$(BITS).o delay.o fault.o ldt.o \
 
 ifeq ($(CONFIG_X86_32),y)
 
-obj-y += checksum_32.o
+obj-y += checksum_32.o syscalls_32.o
 obj-$(CONFIG_ELF_CORE) += elfcore.o
 
 subarch-y = ../lib/string_32.o ../lib/atomic64_32.o ../lib/atomic64_cx8_32.o
diff --git a/arch/x86/um/syscalls_32.c b/arch/x86/um/syscalls_32.c
new file mode 100644
index ..627d68836b16
--- /dev/null
+++ b/arch/x86/um/syscalls_32.c
@@ -0,0 +1,7 @@
+#include 
+#include 
+
+SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
+{
+   return -EINVAL;
+}
diff --git a/include/linux/compat.h b/include/linux/compat.h
index aef47be2a5c1..af9dbc44fd92 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -723,6 +723,8 @@ asmlinkage long 
compat_sys_sched_rr_get_interval(compat_pid_t pid,
 asmlinkage long compat_sys_fanotify_mark(int, unsigned int, __u32, __u32,
int, const char __user *);
 
+asmlinkage long compat_sys_arch_prctl(int option, unsigned long arg2);
+
 /*
  * For most but not all architectures, "am I in a compat syscall?" and
  * "am I a compat task?" are the same question.  For architectures on which
-- 
2.11.0



[PATCH v16 05/10] x86/arch_prctl: Add do_arch_prctl_common

2017-03-20 Thread Kyle Huey
Add do_arch_prctl_common() to handle arch_prctls that are not specific to 64
bit mode. Call it from the syscall entry point, but not any of the other
callsites in the kernel, which all want one of the existing 64 bit only
arch_prctls.

Signed-off-by: Kyle Huey <kh...@kylehuey.com>
---
 arch/x86/include/asm/proto.h | 3 +++
 arch/x86/kernel/process.c| 6 ++
 arch/x86/kernel/process_64.c | 8 +++-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index 4e276f6cb1c1..8d3964fc5f91 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -31,4 +31,7 @@ void x86_report_nx(void);
 
 extern int reboot_force;
 
+long do_arch_prctl_common(struct task_struct *task, int option,
+ unsigned long cpuid_enabled);
+
 #endif /* _ASM_X86_PROTO_H */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 366db7782fc6..b12e95eceb83 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -545,3 +545,9 @@ unsigned long get_wchan(struct task_struct *p)
put_task_stack(p);
return ret;
 }
+
+long do_arch_prctl_common(struct task_struct *task, int option,
+ unsigned long cpuid_enabled)
+{
+   return -EINVAL;
+}
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e37f764c11cc..d81b0a60a45c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -626,7 +626,13 @@ long do_arch_prctl_64(struct task_struct *task, int 
option, unsigned long arg2)
 
 SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
 {
-   return do_arch_prctl_64(current, option, arg2);
+   long ret;
+
+   ret = do_arch_prctl_64(current, option, arg2);
+   if (ret == -EINVAL)
+   ret = do_arch_prctl_common(current, option, arg2);
+
+   return ret;
 }
 
 unsigned long KSTK_ESP(struct task_struct *task)
-- 
2.11.0



  1   2   3   4   5   >