Re: [perf] more perf_fuzzer memory corruption

2014-05-07 Thread Ingo Molnar
* Vince Weaver wrote: > On Tue, 6 May 2014, Vince Weaver wrote: > > > In any case if we can get the recent patches applied in time for 3.15 I > > think it will turn out to be a nice release perf-event-stability wise. > > I should also mention I have a list of open perf_fuzzer issues here: >

Re: [perf] more perf_fuzzer memory corruption

2014-05-07 Thread Ingo Molnar
* Vince Weaver wrote: > On Mon, 5 May 2014, Ingo Molnar wrote: > > > > I'm also thinking about waiting a bit before applying anything even > > borderline intrusive to the perf core, to make sure there's enough > > fuzz time to declare stable state (at least as far into the ABI as the > >

Re: [perf] more perf_fuzzer memory corruption

2014-05-07 Thread Peter Zijlstra
On Tue, May 06, 2014 at 12:57:08PM -0400, Vince Weaver wrote: > On Mon, 5 May 2014, Vince Weaver wrote: > > > On Mon, 5 May 2014, Vince Weaver wrote: > > > > > Meanwhile the haswell and AMD machines have been fuzzing away without > > > issue, I don't know why the core2 machine is always the

Re: [perf] more perf_fuzzer memory corruption

2014-05-07 Thread Peter Zijlstra
On Tue, May 06, 2014 at 12:57:08PM -0400, Vince Weaver wrote: On Mon, 5 May 2014, Vince Weaver wrote: On Mon, 5 May 2014, Vince Weaver wrote: Meanwhile the haswell and AMD machines have been fuzzing away without issue, I don't know why the core2 machine is always the trouble maker.

Re: [perf] more perf_fuzzer memory corruption

2014-05-07 Thread Ingo Molnar
* Vince Weaver vincent.wea...@maine.edu wrote: On Mon, 5 May 2014, Ingo Molnar wrote: I'm also thinking about waiting a bit before applying anything even borderline intrusive to the perf core, to make sure there's enough fuzz time to declare stable state (at least as far into the ABI

Re: [perf] more perf_fuzzer memory corruption

2014-05-07 Thread Ingo Molnar
* Vince Weaver vincent.wea...@maine.edu wrote: On Tue, 6 May 2014, Vince Weaver wrote: In any case if we can get the recent patches applied in time for 3.15 I think it will turn out to be a nice release perf-event-stability wise. I should also mention I have a list of open perf_fuzzer

Re: [perf] more perf_fuzzer memory corruption

2014-05-06 Thread Vince Weaver
On Tue, 6 May 2014, Vince Weaver wrote: > In any case if we can get the recent patches applied in time for 3.15 I > think it will turn out to be a nice release perf-event-stability wise. I should also mention I have a list of open perf_fuzzer issues here:

Re: [perf] more perf_fuzzer memory corruption

2014-05-06 Thread Vince Weaver
On Mon, 5 May 2014, Vince Weaver wrote: > On Mon, 5 May 2014, Vince Weaver wrote: > > > Meanwhile the haswell and AMD machines have been fuzzing away without > > issue, I don't know why the core2 machine is always the trouble maker. > > The haswell has been fuzzing 12 hours with only a NMI

Re: [perf] more perf_fuzzer memory corruption

2014-05-06 Thread Vince Weaver
On Mon, 5 May 2014, Vince Weaver wrote: On Mon, 5 May 2014, Vince Weaver wrote: Meanwhile the haswell and AMD machines have been fuzzing away without issue, I don't know why the core2 machine is always the trouble maker. The haswell has been fuzzing 12 hours with only a NMI

Re: [perf] more perf_fuzzer memory corruption

2014-05-06 Thread Vince Weaver
On Tue, 6 May 2014, Vince Weaver wrote: In any case if we can get the recent patches applied in time for 3.15 I think it will turn out to be a nice release perf-event-stability wise. I should also mention I have a list of open perf_fuzzer issues here:

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Ingo Molnar wrote: > > I'm also thinking about waiting a bit before applying anything even > borderline intrusive to the perf core, to make sure there's enough > fuzz time to declare stable state (at least as far into the ABI as the > fuzzing is able to reach). Future

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Vince Weaver wrote: > Meanwhile the haswell and AMD machines have been fuzzing away without > issue, I don't know why the core2 machine is always the trouble maker. The haswell has been fuzzing 12 hours with only a NMI dazed/confused message. The AMD A10 machine however

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Peter Zijlstra wrote: > > It looks like it is stuck repeating this forever: > > perf_fuzzer-5256 [000] 275.943049: kmalloc: > > (T.1262+0xe) call_site=810d022f ptr=0x8800cb028400 > > bytes_req=216 bytes_alloc=256 gfp_flags=GFP_KERNEL|GFP_ZERO

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Peter Zijlstra
On Mon, May 05, 2014 at 02:47:32PM -0400, Vince Weaver wrote: > On Mon, 5 May 2014, Peter Zijlstra wrote: > > > Cute.. does the below cure? > > > > > > --- > > Subject: perf: Fix perf_event_init_context() > > From: Peter Zijlstra > > Date: Mon May 5 19:12:20 CEST 2014 > > > >

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Peter Zijlstra wrote: > Cute.. does the below cure? > > > --- > Subject: perf: Fix perf_event_init_context() > From: Peter Zijlstra > Date: Mon May 5 19:12:20 CEST 2014 > > perf_pin_task_context() can return NULL but perf_event_init_context() > assumes it will not,

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Ingo Molnar
* Vince Weaver wrote: > On Mon, 5 May 2014, Peter Zijlstra wrote: > > > Does this one work better? Making sure all __perf_remove_from_context() > > callers pass the right structure seems to improve things no end. My > > machine is now happy to reboot again. > > Yes, I've been fuzzing this for

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Peter Zijlstra
On Mon, May 05, 2014 at 01:10:55PM -0400, Vince Weaver wrote: > On Mon, 5 May 2014, Vince Weaver wrote: > > > (Although often things like to crash the instant my tested-by e-mails > > clear the lkml list.) > > This did turn up on the core2 machine. I had been seeing this problem > earlier but

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Vince Weaver wrote: > (Although often things like to crash the instant my tested-by e-mails > clear the lkml list.) This did turn up on the core2 machine. I had been seeing this problem earlier but was hoping it was part of the memory corruption issue: [ 4918.921921] BUG:

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Peter Zijlstra wrote: > Does this one work better? Making sure all __perf_remove_from_context() > callers pass the right structure seems to improve things no end. My > machine is now happy to reboot again. Yes, I've been fuzzing this for a few hours on both my haswell and

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Peter Zijlstra
On Fri, May 02, 2014 at 11:02:25PM -0400, Vince Weaver wrote: > On Fri, 2 May 2014, Vince Weaver wrote: > > > I've been fuzzing without your additional patch for 6 hours and all looks > > (almost) good. I can add in your patch and let it fuzz overnight. > > and I applied the additional patch,

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Peter Zijlstra
On Fri, May 02, 2014 at 11:02:25PM -0400, Vince Weaver wrote: On Fri, 2 May 2014, Vince Weaver wrote: I've been fuzzing without your additional patch for 6 hours and all looks (almost) good. I can add in your patch and let it fuzz overnight. and I applied the additional patch, installed

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Peter Zijlstra wrote: Does this one work better? Making sure all __perf_remove_from_context() callers pass the right structure seems to improve things no end. My machine is now happy to reboot again. Yes, I've been fuzzing this for a few hours on both my haswell and core2

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Vince Weaver wrote: (Although often things like to crash the instant my tested-by e-mails clear the lkml list.) This did turn up on the core2 machine. I had been seeing this problem earlier but was hoping it was part of the memory corruption issue: [ 4918.921921] BUG:

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Peter Zijlstra
On Mon, May 05, 2014 at 01:10:55PM -0400, Vince Weaver wrote: On Mon, 5 May 2014, Vince Weaver wrote: (Although often things like to crash the instant my tested-by e-mails clear the lkml list.) This did turn up on the core2 machine. I had been seeing this problem earlier but was

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Ingo Molnar
* Vince Weaver vincent.wea...@maine.edu wrote: On Mon, 5 May 2014, Peter Zijlstra wrote: Does this one work better? Making sure all __perf_remove_from_context() callers pass the right structure seems to improve things no end. My machine is now happy to reboot again. Yes, I've been

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Peter Zijlstra wrote: Cute.. does the below cure? --- Subject: perf: Fix perf_event_init_context() From: Peter Zijlstra pet...@infradead.org Date: Mon May 5 19:12:20 CEST 2014 perf_pin_task_context() can return NULL but perf_event_init_context() assumes it will

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Peter Zijlstra
On Mon, May 05, 2014 at 02:47:32PM -0400, Vince Weaver wrote: On Mon, 5 May 2014, Peter Zijlstra wrote: Cute.. does the below cure? --- Subject: perf: Fix perf_event_init_context() From: Peter Zijlstra pet...@infradead.org Date: Mon May 5 19:12:20 CEST 2014

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Peter Zijlstra wrote: It looks like it is stuck repeating this forever: perf_fuzzer-5256 [000] 275.943049: kmalloc: (T.1262+0xe) call_site=810d022f ptr=0x8800cb028400 bytes_req=216 bytes_alloc=256 gfp_flags=GFP_KERNEL|GFP_ZERO

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Vince Weaver wrote: Meanwhile the haswell and AMD machines have been fuzzing away without issue, I don't know why the core2 machine is always the trouble maker. The haswell has been fuzzing 12 hours with only a NMI dazed/confused message. The AMD A10 machine however has

Re: [perf] more perf_fuzzer memory corruption

2014-05-05 Thread Vince Weaver
On Mon, 5 May 2014, Ingo Molnar wrote: I'm also thinking about waiting a bit before applying anything even borderline intrusive to the perf core, to make sure there's enough fuzz time to declare stable state (at least as far into the ABI as the fuzzing is able to reach). Future bisection

Re: [perf] more perf_fuzzer memory corruption

2014-05-03 Thread Peter Zijlstra
On Fri, May 02, 2014 at 11:02:25PM -0400, Vince Weaver wrote: > On Fri, 2 May 2014, Vince Weaver wrote: > > > I've been fuzzing without your additional patch for 6 hours and all looks > > (almost) good. I can add in your patch and let it fuzz overnight. > > and I applied the additional patch,

Re: [perf] more perf_fuzzer memory corruption

2014-05-03 Thread Peter Zijlstra
On Fri, May 02, 2014 at 11:02:25PM -0400, Vince Weaver wrote: On Fri, 2 May 2014, Vince Weaver wrote: I've been fuzzing without your additional patch for 6 hours and all looks (almost) good. I can add in your patch and let it fuzz overnight. and I applied the additional patch, installed

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Vince Weaver wrote: > I've been fuzzing without your additional patch for 6 hours and all looks > (almost) good. I can add in your patch and let it fuzz overnight. and I applied the additional patch, installed the kernel, hit reboot, and the following happened (this was

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Thomas Gleixner wrote: > > OK the proper patch has been running the quick reproducer for a bit > > without triggering the issue, I'll let it run a bit more and then upgrade > > to full fuzzing. > > If you do that, please add the patch below. I've been fuzzing without your

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Thomas Gleixner
On Fri, 2 May 2014, Vince Weaver wrote: > On Fri, 2 May 2014, Thomas Gleixner wrote: > > > Hmm, and where comes the WARN_ON in _free_event() from? That's not in > > Peters last patch. > > ahh, you're right :( My fault. I gave the new patch and the previous > patch similar names and applied

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Thomas Gleixner wrote: > Hmm, and where comes the WARN_ON in _free_event() from? That's not in > Peters last patch. ahh, you're right :( My fault. I gave the new patch and the previous patch similar names and applied the wrong one. OK the proper patch has been running the

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Thomas Gleixner
On Fri, 2 May 2014, Vince Weaver wrote: > On Fri, 2 May 2014, Peter Zijlstra wrote: > > > On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote: > > > On Fri, 2 May 2014, Peter Zijlstra wrote: > > > > > > > In principle the vfs file refcounting should be responsible for that. > > > > But

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Peter Zijlstra wrote: > On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote: > > On Fri, 2 May 2014, Peter Zijlstra wrote: > > > > > In principle the vfs file refcounting should be responsible for that. > > > But I'll go over it in a bit. > > > > The poll code is

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote: > On Fri, 2 May 2014, Peter Zijlstra wrote: > > > In principle the vfs file refcounting should be responsible for that. > > But I'll go over it in a bit. > > The poll code is ancient and the C-parser in my head really can't handle >

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Fri, May 02, 2014 at 01:06:52PM -0400, Vince Weaver wrote: > On Fri, 2 May 2014, Peter Zijlstra wrote: > > > > Can you give this a spin? > > > > --- > > Subject: perf: Fix race in removing an event > > From: Peter Zijlstra > > Date: Fri May 2 16:56:01 CEST 2014 > > Nope, still shows the bug

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Peter Zijlstra wrote: > > Can you give this a spin? > > --- > Subject: perf: Fix race in removing an event > From: Peter Zijlstra > Date: Fri May 2 16:56:01 CEST 2014 Nope, still shows the bug pretty quickly: [ 210.411542] [ cut here ] [

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Peter Zijlstra wrote: > In principle the vfs file refcounting should be responsible for that. > But I'll go over it in a bit. The poll code is ancient and the C-parser in my head really can't handle it very well. Anyway for completeness this is the kind of thing I'm seeing.

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Fri, May 02, 2014 at 12:22:30PM -0400, Vince Weaver wrote: > > I'll try the patch next. > > Meanwhile, can polling on a closed event cause problems with the reference > count? > > In my various failure traces there's always been a poll() active at the > time of crash, and I added some

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
I'll try the patch next. Meanwhile, can polling on a closed event cause problems with the reference count? In my various failure traces there's always been a poll() active at the time of crash, and I added some trace_printk()s and it looks like poll is at least attempting to poll on the

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Thu, May 01, 2014 at 02:49:01PM -0400, Vince Weaver wrote: > It is a rance condition of sorts, because it's just a 10us or so > interleaving of calls that causes the bug to happen or not. > > In the good trace: > > [parent] __perf_event_task_sched_out (and hence perf_swevent_del) >

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Thu, May 01, 2014 at 02:49:01PM -0400, Vince Weaver wrote: > > OK, humor me a bit here. > > I'm looking at the buggy trace and comparing against a "good" trace where > the bug doesn't happen. > > It is a rance condition of sorts, because it's just a 10us or so > interleaving of calls that

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Thu, May 01, 2014 at 02:49:01PM -0400, Vince Weaver wrote: OK, humor me a bit here. I'm looking at the buggy trace and comparing against a good trace where the bug doesn't happen. It is a rance condition of sorts, because it's just a 10us or so interleaving of calls that causes the

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Thu, May 01, 2014 at 02:49:01PM -0400, Vince Weaver wrote: It is a rance condition of sorts, because it's just a 10us or so interleaving of calls that causes the bug to happen or not. In the good trace: [parent] __perf_event_task_sched_out (and hence perf_swevent_del)

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
I'll try the patch next. Meanwhile, can polling on a closed event cause problems with the reference count? In my various failure traces there's always been a poll() active at the time of crash, and I added some trace_printk()s and it looks like poll is at least attempting to poll on the

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Fri, May 02, 2014 at 12:22:30PM -0400, Vince Weaver wrote: I'll try the patch next. Meanwhile, can polling on a closed event cause problems with the reference count? In my various failure traces there's always been a poll() active at the time of crash, and I added some

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Peter Zijlstra wrote: In principle the vfs file refcounting should be responsible for that. But I'll go over it in a bit. The poll code is ancient and the C-parser in my head really can't handle it very well. Anyway for completeness this is the kind of thing I'm seeing.

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Peter Zijlstra wrote: Can you give this a spin? --- Subject: perf: Fix race in removing an event From: Peter Zijlstra pet...@infradead.org Date: Fri May 2 16:56:01 CEST 2014 Nope, still shows the bug pretty quickly: [ 210.411542] [ cut here ]

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Fri, May 02, 2014 at 01:06:52PM -0400, Vince Weaver wrote: On Fri, 2 May 2014, Peter Zijlstra wrote: Can you give this a spin? --- Subject: perf: Fix race in removing an event From: Peter Zijlstra pet...@infradead.org Date: Fri May 2 16:56:01 CEST 2014 Nope, still shows the

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Peter Zijlstra
On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote: On Fri, 2 May 2014, Peter Zijlstra wrote: In principle the vfs file refcounting should be responsible for that. But I'll go over it in a bit. The poll code is ancient and the C-parser in my head really can't handle it very

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Peter Zijlstra wrote: On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote: On Fri, 2 May 2014, Peter Zijlstra wrote: In principle the vfs file refcounting should be responsible for that. But I'll go over it in a bit. The poll code is ancient and the

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Thomas Gleixner
On Fri, 2 May 2014, Vince Weaver wrote: On Fri, 2 May 2014, Peter Zijlstra wrote: On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote: On Fri, 2 May 2014, Peter Zijlstra wrote: In principle the vfs file refcounting should be responsible for that. But I'll go over it in

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Thomas Gleixner wrote: Hmm, and where comes the WARN_ON in _free_event() from? That's not in Peters last patch. ahh, you're right :( My fault. I gave the new patch and the previous patch similar names and applied the wrong one. OK the proper patch has been running the

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Thomas Gleixner
On Fri, 2 May 2014, Vince Weaver wrote: On Fri, 2 May 2014, Thomas Gleixner wrote: Hmm, and where comes the WARN_ON in _free_event() from? That's not in Peters last patch. ahh, you're right :( My fault. I gave the new patch and the previous patch similar names and applied the wrong

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Thomas Gleixner wrote: OK the proper patch has been running the quick reproducer for a bit without triggering the issue, I'll let it run a bit more and then upgrade to full fuzzing. If you do that, please add the patch below. I've been fuzzing without your

Re: [perf] more perf_fuzzer memory corruption

2014-05-02 Thread Vince Weaver
On Fri, 2 May 2014, Vince Weaver wrote: I've been fuzzing without your additional patch for 6 hours and all looks (almost) good. I can add in your patch and let it fuzz overnight. and I applied the additional patch, installed the kernel, hit reboot, and the following happened (this was

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
OK, with the following patch I've been running the problem test case for an hour without triggering the bug. I'm sure this is the wrong fix (maybe patching over the problem istead of fixing the root cause), but it works for me. It looks like this whole mess got introduced with 76e1d9047 in

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
OK, humor me a bit here. I'm looking at the buggy trace and comparing against a "good" trace where the bug doesn't happen. It is a rance condition of sorts, because it's just a 10us or so interleaving of calls that causes the bug to happen or not. In the good trace: [parent]

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Thu, 1 May 2014, Thomas Gleixner wrote: > Heading out now and postponing the chase for tomorrow morning. Some decoding of the trace. One thing that's possibly unrelated, but on both this and the previous bug the main thread was doing a "perf_poll" while the bug is triggered. I guess in

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Thomas Gleixner
On Thu, 1 May 2014, Vince Weaver wrote: > On Thu, 1 May 2014, Peter Zijlstra wrote: > > > > But yes please! > > OK, sorry for the delay, had forgotten to re-enable -pg for perf in the > makefile when I applied your patch so had to re-build the kernel. > > The trace is here: >

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Thu, 1 May 2014, Peter Zijlstra wrote: > > But yes please! OK, sorry for the delay, had forgotten to re-enable -pg for perf in the makefile when I applied your patch so had to re-build the kernel. The trace is here: www.eece.maine.edu/~vweaver/junk/pzbug.out.bz2 No analysis so

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Peter Zijlstra
On Thu, May 01, 2014 at 10:27:45AM -0400, Vince Weaver wrote: > On Thu, 1 May 2014, Vince Weaver wrote: > > > On Wed, 30 Apr 2014, Peter Zijlstra wrote: > > > > > Vince, could you add the below to whatever tracing muck you already > > > have? > > and this might be what you're looking for. This

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Thu, 1 May 2014, Vince Weaver wrote: > On Wed, 30 Apr 2014, Peter Zijlstra wrote: > > > Vince, could you add the below to whatever tracing muck you already > > have? and this might be what you're looking for. This is with a different random seed than the one I've used for other traces,

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Wed, 30 Apr 2014, Peter Zijlstra wrote: > Vince, could you add the below to whatever tracing muck you already > have? OK, running with your patch, I get this messages a few times. No crashing or memory corruption messages, but as I've said before that only happens maybe 10% of the time,

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Thomas Gleixner
On Thu, 1 May 2014, Thomas Gleixner wrote: > On Thu, 1 May 2014, Peter Zijlstra wrote: > > On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote: > > > On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: > > > > And that's the issue which puzzles us. Let's look at what we

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Thu, 1 May 2014, Peter Zijlstra wrote: > On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: > > And that's the issue which puzzles us. Let's look at what we expect: > > > > Now the trace shows a different story: > > > > perf_fuzzer-4387 [001] 1802.628659: sys_enter:

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Peter Zijlstra
On Thu, May 01, 2014 at 02:35:02PM +0200, Thomas Gleixner wrote: > > grep ptr=0x880118fda000 bug.out | less > > > > We find lovely bits such as: > > > > perf_fuzzer-4387 [001] 1773.427175: kmalloc: > > (perf_event_alloc+0x5a) call_site=8113a8fa

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Thomas Gleixner
On Thu, 1 May 2014, Peter Zijlstra wrote: > On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote: > > On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: > > > And that's the issue which puzzles us. Let's look at what we expect: > > > > > > Now the trace shows a different

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Peter Zijlstra
On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote: > On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: > > And that's the issue which puzzles us. Let's look at what we expect: > > > > Now the trace shows a different story: > > > > perf_fuzzer-4387 [001]

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Peter Zijlstra
On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: > And that's the issue which puzzles us. Let's look at what we expect: > > Now the trace shows a different story: > > perf_fuzzer-4387 [001] 1802.628659: sys_enter:NR 298 > (69bb58, 0, , 12, 0, 0)

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Peter Zijlstra
On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: And that's the issue which puzzles us. Let's look at what we expect: Now the trace shows a different story: perf_fuzzer-4387 [001] 1802.628659: sys_enter:NR 298 (69bb58, 0, , 12, 0, 0) That's a

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Peter Zijlstra
On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote: On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: And that's the issue which puzzles us. Let's look at what we expect: Now the trace shows a different story: perf_fuzzer-4387 [001] 1802.628659:

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Thomas Gleixner
On Thu, 1 May 2014, Peter Zijlstra wrote: On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote: On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: And that's the issue which puzzles us. Let's look at what we expect: Now the trace shows a different story:

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Peter Zijlstra
On Thu, May 01, 2014 at 02:35:02PM +0200, Thomas Gleixner wrote: grep ptr=0x880118fda000 bug.out | less We find lovely bits such as: perf_fuzzer-4387 [001] 1773.427175: kmalloc: (perf_event_alloc+0x5a) call_site=8113a8fa ptr=0x880118fda000

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Thu, 1 May 2014, Peter Zijlstra wrote: On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: And that's the issue which puzzles us. Let's look at what we expect: Now the trace shows a different story: perf_fuzzer-4387 [001] 1802.628659: sys_enter: NR

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Thomas Gleixner
On Thu, 1 May 2014, Thomas Gleixner wrote: On Thu, 1 May 2014, Peter Zijlstra wrote: On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote: On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote: And that's the issue which puzzles us. Let's look at what we expect:

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Wed, 30 Apr 2014, Peter Zijlstra wrote: Vince, could you add the below to whatever tracing muck you already have? OK, running with your patch, I get this messages a few times. No crashing or memory corruption messages, but as I've said before that only happens maybe 10% of the time, let

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Thu, 1 May 2014, Vince Weaver wrote: On Wed, 30 Apr 2014, Peter Zijlstra wrote: Vince, could you add the below to whatever tracing muck you already have? and this might be what you're looking for. This is with a different random seed than the one I've used for other traces, your

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Peter Zijlstra
On Thu, May 01, 2014 at 10:27:45AM -0400, Vince Weaver wrote: On Thu, 1 May 2014, Vince Weaver wrote: On Wed, 30 Apr 2014, Peter Zijlstra wrote: Vince, could you add the below to whatever tracing muck you already have? and this might be what you're looking for. This is with a

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Thu, 1 May 2014, Peter Zijlstra wrote: But yes please! OK, sorry for the delay, had forgotten to re-enable -pg for perf in the makefile when I applied your patch so had to re-build the kernel. The trace is here: www.eece.maine.edu/~vweaver/junk/pzbug.out.bz2 No analysis so

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Thomas Gleixner
On Thu, 1 May 2014, Vince Weaver wrote: On Thu, 1 May 2014, Peter Zijlstra wrote: But yes please! OK, sorry for the delay, had forgotten to re-enable -pg for perf in the makefile when I applied your patch so had to re-build the kernel. The trace is here:

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
On Thu, 1 May 2014, Thomas Gleixner wrote: Heading out now and postponing the chase for tomorrow morning. Some decoding of the trace. One thing that's possibly unrelated, but on both this and the previous bug the main thread was doing a perf_poll while the bug is triggered. I guess in theory

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
OK, humor me a bit here. I'm looking at the buggy trace and comparing against a good trace where the bug doesn't happen. It is a rance condition of sorts, because it's just a 10us or so interleaving of calls that causes the bug to happen or not. In the good trace: [parent]

Re: [perf] more perf_fuzzer memory corruption

2014-05-01 Thread Vince Weaver
OK, with the following patch I've been running the problem test case for an hour without triggering the bug. I'm sure this is the wrong fix (maybe patching over the problem istead of fixing the root cause), but it works for me. It looks like this whole mess got introduced with 76e1d9047 in

Re: [perf] more perf_fuzzer memory corruption

2014-04-30 Thread Thomas Gleixner
On Wed, 30 Apr 2014, Vince Weaver wrote: > On Wed, 30 Apr 2014, Peter Zijlstra wrote: > > > > > Vince, could you add the below to whatever tracing muck you already > > have? > > > > After staring at your traces all day with Thomas, we have doubts about > > the refcount integrity. > > I've

Re: [perf] more perf_fuzzer memory corruption

2014-04-30 Thread Vince Weaver
On Wed, 30 Apr 2014, Peter Zijlstra wrote: > > Vince, could you add the below to whatever tracing muck you already > have? > > After staring at your traces all day with Thomas, we have doubts about > the refcount integrity. I've been staring at traces all day too. Will try your patch

Re: [perf] more perf_fuzzer memory corruption

2014-04-30 Thread Peter Zijlstra
Vince, could you add the below to whatever tracing muck you already have? After staring at your traces all day with Thomas, we have doubts about the refcount integrity. --- kernel/events/core.c | 146 +-- 1 file changed, 82 insertions(+), 64

Re: [perf] more perf_fuzzer memory corruption

2014-04-30 Thread Peter Zijlstra
Vince, could you add the below to whatever tracing muck you already have? After staring at your traces all day with Thomas, we have doubts about the refcount integrity. --- kernel/events/core.c | 146 +-- 1 file changed, 82 insertions(+), 64

Re: [perf] more perf_fuzzer memory corruption

2014-04-30 Thread Vince Weaver
On Wed, 30 Apr 2014, Peter Zijlstra wrote: Vince, could you add the below to whatever tracing muck you already have? After staring at your traces all day with Thomas, we have doubts about the refcount integrity. I've been staring at traces all day too. Will try your patch tomorrow.

Re: [perf] more perf_fuzzer memory corruption

2014-04-30 Thread Thomas Gleixner
On Wed, 30 Apr 2014, Vince Weaver wrote: On Wed, 30 Apr 2014, Peter Zijlstra wrote: Vince, could you add the below to whatever tracing muck you already have? After staring at your traces all day with Thomas, we have doubts about the refcount integrity. I've been staring at

Re: [perf] more perf_fuzzer memory corruption

2014-04-29 Thread Vince Weaver
On Tue, 29 Apr 2014, Peter Zijlstra wrote: > Fair point, nope not in that case. If you can trigger this without ever > using .inherit=1 this would exclude a lot of funny code. I don't think inherit is being set, but I'm not actually sure. I will have to add that to the trace_printk() and

Re: [perf] more perf_fuzzer memory corruption

2014-04-29 Thread Steven Rostedt
On Tue, 29 Apr 2014 14:21:56 -0400 (EDT) Vince Weaver wrote: > Also trace-cmd is a pain to use. Any suggested events I should trace > beyond the obvious? > > Part of the problem is that despite what the documentation says it doesn't > look like you can combine the "-P pid" and "-c" children

Re: [perf] more perf_fuzzer memory corruption

2014-04-29 Thread Steven Rostedt
On Tue, 29 Apr 2014 14:11:09 -0400 (EDT) Vince Weaver wrote: > I've actually given up on source code inspection to figure out what's > going on in kernel/events/core.c. What I do now is write simple test > cases and do an ftrace function trace. The results are often surprising. You might

Re: [perf] more perf_fuzzer memory corruption

2014-04-29 Thread Peter Zijlstra
On Tue, Apr 29, 2014 at 02:21:56PM -0400, Vince Weaver wrote: > On Tue, 29 Apr 2014, Peter Zijlstra wrote: > > > > Event #16 is a SW event created and running in the parent on CPU0. > > > > A regular software one, right? Not a timer one. > > Maybe. From traces I have it looks like it's a

Re: [perf] more perf_fuzzer memory corruption

2014-04-29 Thread Vince Weaver
On Tue, 29 Apr 2014, Peter Zijlstra wrote: > > Event #16 is a SW event created and running in the parent on CPU0. > > A regular software one, right? Not a timer one. Maybe. From traces I have it looks like it's a regular one (i.e. calls perf_swevent_add() ) but who knows at this point. When

Re: [perf] more perf_fuzzer memory corruption

2014-04-29 Thread Vince Weaver
On Tue, 29 Apr 2014, Peter Zijlstra wrote: > On Mon, Apr 28, 2014 at 10:21:34AM -0400, Vince Weaver wrote: > > so it's looking more and more like this issue is with a > > PERF_COUNT_SW_TASK_CLOCK > > event. > > But they don't actually use the hlist thing.. yes. This turns out into another

  1   2   >