* Vince Weaver wrote:
> On Tue, 6 May 2014, Vince Weaver wrote:
>
> > In any case if we can get the recent patches applied in time for 3.15 I
> > think it will turn out to be a nice release perf-event-stability wise.
>
> I should also mention I have a list of open perf_fuzzer issues here:
>
* Vince Weaver wrote:
> On Mon, 5 May 2014, Ingo Molnar wrote:
> >
> > I'm also thinking about waiting a bit before applying anything even
> > borderline intrusive to the perf core, to make sure there's enough
> > fuzz time to declare stable state (at least as far into the ABI as the
> >
On Tue, May 06, 2014 at 12:57:08PM -0400, Vince Weaver wrote:
> On Mon, 5 May 2014, Vince Weaver wrote:
>
> > On Mon, 5 May 2014, Vince Weaver wrote:
> >
> > > Meanwhile the haswell and AMD machines have been fuzzing away without
> > > issue, I don't know why the core2 machine is always the
On Tue, May 06, 2014 at 12:57:08PM -0400, Vince Weaver wrote:
On Mon, 5 May 2014, Vince Weaver wrote:
On Mon, 5 May 2014, Vince Weaver wrote:
Meanwhile the haswell and AMD machines have been fuzzing away without
issue, I don't know why the core2 machine is always the trouble maker.
* Vince Weaver vincent.wea...@maine.edu wrote:
On Mon, 5 May 2014, Ingo Molnar wrote:
I'm also thinking about waiting a bit before applying anything even
borderline intrusive to the perf core, to make sure there's enough
fuzz time to declare stable state (at least as far into the ABI
* Vince Weaver vincent.wea...@maine.edu wrote:
On Tue, 6 May 2014, Vince Weaver wrote:
In any case if we can get the recent patches applied in time for 3.15 I
think it will turn out to be a nice release perf-event-stability wise.
I should also mention I have a list of open perf_fuzzer
On Tue, 6 May 2014, Vince Weaver wrote:
> In any case if we can get the recent patches applied in time for 3.15 I
> think it will turn out to be a nice release perf-event-stability wise.
I should also mention I have a list of open perf_fuzzer issues here:
On Mon, 5 May 2014, Vince Weaver wrote:
> On Mon, 5 May 2014, Vince Weaver wrote:
>
> > Meanwhile the haswell and AMD machines have been fuzzing away without
> > issue, I don't know why the core2 machine is always the trouble maker.
>
> The haswell has been fuzzing 12 hours with only a NMI
On Mon, 5 May 2014, Vince Weaver wrote:
On Mon, 5 May 2014, Vince Weaver wrote:
Meanwhile the haswell and AMD machines have been fuzzing away without
issue, I don't know why the core2 machine is always the trouble maker.
The haswell has been fuzzing 12 hours with only a NMI
On Tue, 6 May 2014, Vince Weaver wrote:
In any case if we can get the recent patches applied in time for 3.15 I
think it will turn out to be a nice release perf-event-stability wise.
I should also mention I have a list of open perf_fuzzer issues here:
On Mon, 5 May 2014, Ingo Molnar wrote:
>
> I'm also thinking about waiting a bit before applying anything even
> borderline intrusive to the perf core, to make sure there's enough
> fuzz time to declare stable state (at least as far into the ABI as the
> fuzzing is able to reach). Future
On Mon, 5 May 2014, Vince Weaver wrote:
> Meanwhile the haswell and AMD machines have been fuzzing away without
> issue, I don't know why the core2 machine is always the trouble maker.
The haswell has been fuzzing 12 hours with only a NMI dazed/confused
message.
The AMD A10 machine however
On Mon, 5 May 2014, Peter Zijlstra wrote:
> > It looks like it is stuck repeating this forever:
> > perf_fuzzer-5256 [000] 275.943049: kmalloc:
> > (T.1262+0xe) call_site=810d022f ptr=0x8800cb028400
> > bytes_req=216 bytes_alloc=256 gfp_flags=GFP_KERNEL|GFP_ZERO
On Mon, May 05, 2014 at 02:47:32PM -0400, Vince Weaver wrote:
> On Mon, 5 May 2014, Peter Zijlstra wrote:
>
> > Cute.. does the below cure?
> >
> >
> > ---
> > Subject: perf: Fix perf_event_init_context()
> > From: Peter Zijlstra
> > Date: Mon May 5 19:12:20 CEST 2014
> >
> >
On Mon, 5 May 2014, Peter Zijlstra wrote:
> Cute.. does the below cure?
>
>
> ---
> Subject: perf: Fix perf_event_init_context()
> From: Peter Zijlstra
> Date: Mon May 5 19:12:20 CEST 2014
>
> perf_pin_task_context() can return NULL but perf_event_init_context()
> assumes it will not,
* Vince Weaver wrote:
> On Mon, 5 May 2014, Peter Zijlstra wrote:
>
> > Does this one work better? Making sure all __perf_remove_from_context()
> > callers pass the right structure seems to improve things no end. My
> > machine is now happy to reboot again.
>
> Yes, I've been fuzzing this for
On Mon, May 05, 2014 at 01:10:55PM -0400, Vince Weaver wrote:
> On Mon, 5 May 2014, Vince Weaver wrote:
>
> > (Although often things like to crash the instant my tested-by e-mails
> > clear the lkml list.)
>
> This did turn up on the core2 machine. I had been seeing this problem
> earlier but
On Mon, 5 May 2014, Vince Weaver wrote:
> (Although often things like to crash the instant my tested-by e-mails
> clear the lkml list.)
This did turn up on the core2 machine. I had been seeing this problem
earlier but was hoping it was part of the memory corruption issue:
[ 4918.921921] BUG:
On Mon, 5 May 2014, Peter Zijlstra wrote:
> Does this one work better? Making sure all __perf_remove_from_context()
> callers pass the right structure seems to improve things no end. My
> machine is now happy to reboot again.
Yes, I've been fuzzing this for a few hours on both my haswell and
On Fri, May 02, 2014 at 11:02:25PM -0400, Vince Weaver wrote:
> On Fri, 2 May 2014, Vince Weaver wrote:
>
> > I've been fuzzing without your additional patch for 6 hours and all looks
> > (almost) good. I can add in your patch and let it fuzz overnight.
>
> and I applied the additional patch,
On Fri, May 02, 2014 at 11:02:25PM -0400, Vince Weaver wrote:
On Fri, 2 May 2014, Vince Weaver wrote:
I've been fuzzing without your additional patch for 6 hours and all looks
(almost) good. I can add in your patch and let it fuzz overnight.
and I applied the additional patch, installed
On Mon, 5 May 2014, Peter Zijlstra wrote:
Does this one work better? Making sure all __perf_remove_from_context()
callers pass the right structure seems to improve things no end. My
machine is now happy to reboot again.
Yes, I've been fuzzing this for a few hours on both my haswell and core2
On Mon, 5 May 2014, Vince Weaver wrote:
(Although often things like to crash the instant my tested-by e-mails
clear the lkml list.)
This did turn up on the core2 machine. I had been seeing this problem
earlier but was hoping it was part of the memory corruption issue:
[ 4918.921921] BUG:
On Mon, May 05, 2014 at 01:10:55PM -0400, Vince Weaver wrote:
On Mon, 5 May 2014, Vince Weaver wrote:
(Although often things like to crash the instant my tested-by e-mails
clear the lkml list.)
This did turn up on the core2 machine. I had been seeing this problem
earlier but was
* Vince Weaver vincent.wea...@maine.edu wrote:
On Mon, 5 May 2014, Peter Zijlstra wrote:
Does this one work better? Making sure all __perf_remove_from_context()
callers pass the right structure seems to improve things no end. My
machine is now happy to reboot again.
Yes, I've been
On Mon, 5 May 2014, Peter Zijlstra wrote:
Cute.. does the below cure?
---
Subject: perf: Fix perf_event_init_context()
From: Peter Zijlstra pet...@infradead.org
Date: Mon May 5 19:12:20 CEST 2014
perf_pin_task_context() can return NULL but perf_event_init_context()
assumes it will
On Mon, May 05, 2014 at 02:47:32PM -0400, Vince Weaver wrote:
On Mon, 5 May 2014, Peter Zijlstra wrote:
Cute.. does the below cure?
---
Subject: perf: Fix perf_event_init_context()
From: Peter Zijlstra pet...@infradead.org
Date: Mon May 5 19:12:20 CEST 2014
On Mon, 5 May 2014, Peter Zijlstra wrote:
It looks like it is stuck repeating this forever:
perf_fuzzer-5256 [000] 275.943049: kmalloc:
(T.1262+0xe) call_site=810d022f ptr=0x8800cb028400
bytes_req=216 bytes_alloc=256 gfp_flags=GFP_KERNEL|GFP_ZERO
On Mon, 5 May 2014, Vince Weaver wrote:
Meanwhile the haswell and AMD machines have been fuzzing away without
issue, I don't know why the core2 machine is always the trouble maker.
The haswell has been fuzzing 12 hours with only a NMI dazed/confused
message.
The AMD A10 machine however has
On Mon, 5 May 2014, Ingo Molnar wrote:
I'm also thinking about waiting a bit before applying anything even
borderline intrusive to the perf core, to make sure there's enough
fuzz time to declare stable state (at least as far into the ABI as the
fuzzing is able to reach). Future bisection
On Fri, May 02, 2014 at 11:02:25PM -0400, Vince Weaver wrote:
> On Fri, 2 May 2014, Vince Weaver wrote:
>
> > I've been fuzzing without your additional patch for 6 hours and all looks
> > (almost) good. I can add in your patch and let it fuzz overnight.
>
> and I applied the additional patch,
On Fri, May 02, 2014 at 11:02:25PM -0400, Vince Weaver wrote:
On Fri, 2 May 2014, Vince Weaver wrote:
I've been fuzzing without your additional patch for 6 hours and all looks
(almost) good. I can add in your patch and let it fuzz overnight.
and I applied the additional patch, installed
On Fri, 2 May 2014, Vince Weaver wrote:
> I've been fuzzing without your additional patch for 6 hours and all looks
> (almost) good. I can add in your patch and let it fuzz overnight.
and I applied the additional patch, installed the kernel, hit reboot, and
the following happened (this was
On Fri, 2 May 2014, Thomas Gleixner wrote:
> > OK the proper patch has been running the quick reproducer for a bit
> > without triggering the issue, I'll let it run a bit more and then upgrade
> > to full fuzzing.
>
> If you do that, please add the patch below.
I've been fuzzing without your
On Fri, 2 May 2014, Vince Weaver wrote:
> On Fri, 2 May 2014, Thomas Gleixner wrote:
>
> > Hmm, and where comes the WARN_ON in _free_event() from? That's not in
> > Peters last patch.
>
> ahh, you're right :( My fault. I gave the new patch and the previous
> patch similar names and applied
On Fri, 2 May 2014, Thomas Gleixner wrote:
> Hmm, and where comes the WARN_ON in _free_event() from? That's not in
> Peters last patch.
ahh, you're right :( My fault. I gave the new patch and the previous
patch similar names and applied the wrong one.
OK the proper patch has been running the
On Fri, 2 May 2014, Vince Weaver wrote:
> On Fri, 2 May 2014, Peter Zijlstra wrote:
>
> > On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote:
> > > On Fri, 2 May 2014, Peter Zijlstra wrote:
> > >
> > > > In principle the vfs file refcounting should be responsible for that.
> > > > But
On Fri, 2 May 2014, Peter Zijlstra wrote:
> On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote:
> > On Fri, 2 May 2014, Peter Zijlstra wrote:
> >
> > > In principle the vfs file refcounting should be responsible for that.
> > > But I'll go over it in a bit.
> >
> > The poll code is
On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote:
> On Fri, 2 May 2014, Peter Zijlstra wrote:
>
> > In principle the vfs file refcounting should be responsible for that.
> > But I'll go over it in a bit.
>
> The poll code is ancient and the C-parser in my head really can't handle
>
On Fri, May 02, 2014 at 01:06:52PM -0400, Vince Weaver wrote:
> On Fri, 2 May 2014, Peter Zijlstra wrote:
> >
> > Can you give this a spin?
> >
> > ---
> > Subject: perf: Fix race in removing an event
> > From: Peter Zijlstra
> > Date: Fri May 2 16:56:01 CEST 2014
>
> Nope, still shows the bug
On Fri, 2 May 2014, Peter Zijlstra wrote:
>
> Can you give this a spin?
>
> ---
> Subject: perf: Fix race in removing an event
> From: Peter Zijlstra
> Date: Fri May 2 16:56:01 CEST 2014
Nope, still shows the bug pretty quickly:
[ 210.411542] [ cut here ]
[
On Fri, 2 May 2014, Peter Zijlstra wrote:
> In principle the vfs file refcounting should be responsible for that.
> But I'll go over it in a bit.
The poll code is ancient and the C-parser in my head really can't handle
it very well.
Anyway for completeness this is the kind of thing I'm seeing.
On Fri, May 02, 2014 at 12:22:30PM -0400, Vince Weaver wrote:
>
> I'll try the patch next.
>
> Meanwhile, can polling on a closed event cause problems with the reference
> count?
>
> In my various failure traces there's always been a poll() active at the
> time of crash, and I added some
I'll try the patch next.
Meanwhile, can polling on a closed event cause problems with the reference
count?
In my various failure traces there's always been a poll() active at the
time of crash, and I added some trace_printk()s and it looks like poll is
at least attempting to poll on the
On Thu, May 01, 2014 at 02:49:01PM -0400, Vince Weaver wrote:
> It is a rance condition of sorts, because it's just a 10us or so
> interleaving of calls that causes the bug to happen or not.
>
> In the good trace:
>
> [parent] __perf_event_task_sched_out (and hence perf_swevent_del)
>
On Thu, May 01, 2014 at 02:49:01PM -0400, Vince Weaver wrote:
>
> OK, humor me a bit here.
>
> I'm looking at the buggy trace and comparing against a "good" trace where
> the bug doesn't happen.
>
> It is a rance condition of sorts, because it's just a 10us or so
> interleaving of calls that
On Thu, May 01, 2014 at 02:49:01PM -0400, Vince Weaver wrote:
OK, humor me a bit here.
I'm looking at the buggy trace and comparing against a good trace where
the bug doesn't happen.
It is a rance condition of sorts, because it's just a 10us or so
interleaving of calls that causes the
On Thu, May 01, 2014 at 02:49:01PM -0400, Vince Weaver wrote:
It is a rance condition of sorts, because it's just a 10us or so
interleaving of calls that causes the bug to happen or not.
In the good trace:
[parent] __perf_event_task_sched_out (and hence perf_swevent_del)
I'll try the patch next.
Meanwhile, can polling on a closed event cause problems with the reference
count?
In my various failure traces there's always been a poll() active at the
time of crash, and I added some trace_printk()s and it looks like poll is
at least attempting to poll on the
On Fri, May 02, 2014 at 12:22:30PM -0400, Vince Weaver wrote:
I'll try the patch next.
Meanwhile, can polling on a closed event cause problems with the reference
count?
In my various failure traces there's always been a poll() active at the
time of crash, and I added some
On Fri, 2 May 2014, Peter Zijlstra wrote:
In principle the vfs file refcounting should be responsible for that.
But I'll go over it in a bit.
The poll code is ancient and the C-parser in my head really can't handle
it very well.
Anyway for completeness this is the kind of thing I'm seeing.
On Fri, 2 May 2014, Peter Zijlstra wrote:
Can you give this a spin?
---
Subject: perf: Fix race in removing an event
From: Peter Zijlstra pet...@infradead.org
Date: Fri May 2 16:56:01 CEST 2014
Nope, still shows the bug pretty quickly:
[ 210.411542] [ cut here ]
On Fri, May 02, 2014 at 01:06:52PM -0400, Vince Weaver wrote:
On Fri, 2 May 2014, Peter Zijlstra wrote:
Can you give this a spin?
---
Subject: perf: Fix race in removing an event
From: Peter Zijlstra pet...@infradead.org
Date: Fri May 2 16:56:01 CEST 2014
Nope, still shows the
On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote:
On Fri, 2 May 2014, Peter Zijlstra wrote:
In principle the vfs file refcounting should be responsible for that.
But I'll go over it in a bit.
The poll code is ancient and the C-parser in my head really can't handle
it very
On Fri, 2 May 2014, Peter Zijlstra wrote:
On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote:
On Fri, 2 May 2014, Peter Zijlstra wrote:
In principle the vfs file refcounting should be responsible for that.
But I'll go over it in a bit.
The poll code is ancient and the
On Fri, 2 May 2014, Vince Weaver wrote:
On Fri, 2 May 2014, Peter Zijlstra wrote:
On Fri, May 02, 2014 at 12:43:17PM -0400, Vince Weaver wrote:
On Fri, 2 May 2014, Peter Zijlstra wrote:
In principle the vfs file refcounting should be responsible for that.
But I'll go over it in
On Fri, 2 May 2014, Thomas Gleixner wrote:
Hmm, and where comes the WARN_ON in _free_event() from? That's not in
Peters last patch.
ahh, you're right :( My fault. I gave the new patch and the previous
patch similar names and applied the wrong one.
OK the proper patch has been running the
On Fri, 2 May 2014, Vince Weaver wrote:
On Fri, 2 May 2014, Thomas Gleixner wrote:
Hmm, and where comes the WARN_ON in _free_event() from? That's not in
Peters last patch.
ahh, you're right :( My fault. I gave the new patch and the previous
patch similar names and applied the wrong
On Fri, 2 May 2014, Thomas Gleixner wrote:
OK the proper patch has been running the quick reproducer for a bit
without triggering the issue, I'll let it run a bit more and then upgrade
to full fuzzing.
If you do that, please add the patch below.
I've been fuzzing without your
On Fri, 2 May 2014, Vince Weaver wrote:
I've been fuzzing without your additional patch for 6 hours and all looks
(almost) good. I can add in your patch and let it fuzz overnight.
and I applied the additional patch, installed the kernel, hit reboot, and
the following happened (this was
OK, with the following patch I've been running the problem test case for
an hour without triggering the bug.
I'm sure this is the wrong fix (maybe patching over the problem istead of
fixing the root cause), but it works for me.
It looks like this whole mess got introduced with 76e1d9047 in
OK, humor me a bit here.
I'm looking at the buggy trace and comparing against a "good" trace where
the bug doesn't happen.
It is a rance condition of sorts, because it's just a 10us or so
interleaving of calls that causes the bug to happen or not.
In the good trace:
[parent]
On Thu, 1 May 2014, Thomas Gleixner wrote:
> Heading out now and postponing the chase for tomorrow morning.
Some decoding of the trace.
One thing that's possibly unrelated, but on both this and the previous
bug the main thread was doing a "perf_poll" while the bug is triggered.
I guess in
On Thu, 1 May 2014, Vince Weaver wrote:
> On Thu, 1 May 2014, Peter Zijlstra wrote:
> >
> > But yes please!
>
> OK, sorry for the delay, had forgotten to re-enable -pg for perf in the
> makefile when I applied your patch so had to re-build the kernel.
>
> The trace is here:
>
On Thu, 1 May 2014, Peter Zijlstra wrote:
>
> But yes please!
OK, sorry for the delay, had forgotten to re-enable -pg for perf in the
makefile when I applied your patch so had to re-build the kernel.
The trace is here:
www.eece.maine.edu/~vweaver/junk/pzbug.out.bz2
No analysis so
On Thu, May 01, 2014 at 10:27:45AM -0400, Vince Weaver wrote:
> On Thu, 1 May 2014, Vince Weaver wrote:
>
> > On Wed, 30 Apr 2014, Peter Zijlstra wrote:
> >
> > > Vince, could you add the below to whatever tracing muck you already
> > > have?
>
> and this might be what you're looking for. This
On Thu, 1 May 2014, Vince Weaver wrote:
> On Wed, 30 Apr 2014, Peter Zijlstra wrote:
>
> > Vince, could you add the below to whatever tracing muck you already
> > have?
and this might be what you're looking for. This is with a different
random seed than the one I've used for other traces,
On Wed, 30 Apr 2014, Peter Zijlstra wrote:
> Vince, could you add the below to whatever tracing muck you already
> have?
OK, running with your patch, I get this messages a few times. No crashing
or memory corruption messages, but as I've said before that only happens
maybe 10% of the time,
On Thu, 1 May 2014, Thomas Gleixner wrote:
> On Thu, 1 May 2014, Peter Zijlstra wrote:
> > On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
> > > > And that's the issue which puzzles us. Let's look at what we
On Thu, 1 May 2014, Peter Zijlstra wrote:
> On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
> > And that's the issue which puzzles us. Let's look at what we expect:
> >
> > Now the trace shows a different story:
> >
> > perf_fuzzer-4387 [001] 1802.628659: sys_enter:
On Thu, May 01, 2014 at 02:35:02PM +0200, Thomas Gleixner wrote:
> > grep ptr=0x880118fda000 bug.out | less
> >
> > We find lovely bits such as:
> >
> > perf_fuzzer-4387 [001] 1773.427175: kmalloc:
> > (perf_event_alloc+0x5a) call_site=8113a8fa
On Thu, 1 May 2014, Peter Zijlstra wrote:
> On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote:
> > On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
> > > And that's the issue which puzzles us. Let's look at what we expect:
> > >
> > > Now the trace shows a different
On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote:
> On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
> > And that's the issue which puzzles us. Let's look at what we expect:
> >
> > Now the trace shows a different story:
> >
> > perf_fuzzer-4387 [001]
On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
> And that's the issue which puzzles us. Let's look at what we expect:
>
> Now the trace shows a different story:
>
> perf_fuzzer-4387 [001] 1802.628659: sys_enter:NR 298
> (69bb58, 0, , 12, 0, 0)
On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
And that's the issue which puzzles us. Let's look at what we expect:
Now the trace shows a different story:
perf_fuzzer-4387 [001] 1802.628659: sys_enter:NR 298
(69bb58, 0, , 12, 0, 0)
That's a
On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote:
On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
And that's the issue which puzzles us. Let's look at what we expect:
Now the trace shows a different story:
perf_fuzzer-4387 [001] 1802.628659:
On Thu, 1 May 2014, Peter Zijlstra wrote:
On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote:
On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
And that's the issue which puzzles us. Let's look at what we expect:
Now the trace shows a different story:
On Thu, May 01, 2014 at 02:35:02PM +0200, Thomas Gleixner wrote:
grep ptr=0x880118fda000 bug.out | less
We find lovely bits such as:
perf_fuzzer-4387 [001] 1773.427175: kmalloc:
(perf_event_alloc+0x5a) call_site=8113a8fa ptr=0x880118fda000
On Thu, 1 May 2014, Peter Zijlstra wrote:
On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
And that's the issue which puzzles us. Let's look at what we expect:
Now the trace shows a different story:
perf_fuzzer-4387 [001] 1802.628659: sys_enter:
NR
On Thu, 1 May 2014, Thomas Gleixner wrote:
On Thu, 1 May 2014, Peter Zijlstra wrote:
On Thu, May 01, 2014 at 12:26:02PM +0200, Peter Zijlstra wrote:
On Thu, May 01, 2014 at 12:51:33AM +0200, Thomas Gleixner wrote:
And that's the issue which puzzles us. Let's look at what we expect:
On Wed, 30 Apr 2014, Peter Zijlstra wrote:
Vince, could you add the below to whatever tracing muck you already
have?
OK, running with your patch, I get this messages a few times. No crashing
or memory corruption messages, but as I've said before that only happens
maybe 10% of the time, let
On Thu, 1 May 2014, Vince Weaver wrote:
On Wed, 30 Apr 2014, Peter Zijlstra wrote:
Vince, could you add the below to whatever tracing muck you already
have?
and this might be what you're looking for. This is with a different
random seed than the one I've used for other traces, your
On Thu, May 01, 2014 at 10:27:45AM -0400, Vince Weaver wrote:
On Thu, 1 May 2014, Vince Weaver wrote:
On Wed, 30 Apr 2014, Peter Zijlstra wrote:
Vince, could you add the below to whatever tracing muck you already
have?
and this might be what you're looking for. This is with a
On Thu, 1 May 2014, Peter Zijlstra wrote:
But yes please!
OK, sorry for the delay, had forgotten to re-enable -pg for perf in the
makefile when I applied your patch so had to re-build the kernel.
The trace is here:
www.eece.maine.edu/~vweaver/junk/pzbug.out.bz2
No analysis so
On Thu, 1 May 2014, Vince Weaver wrote:
On Thu, 1 May 2014, Peter Zijlstra wrote:
But yes please!
OK, sorry for the delay, had forgotten to re-enable -pg for perf in the
makefile when I applied your patch so had to re-build the kernel.
The trace is here:
On Thu, 1 May 2014, Thomas Gleixner wrote:
Heading out now and postponing the chase for tomorrow morning.
Some decoding of the trace.
One thing that's possibly unrelated, but on both this and the previous
bug the main thread was doing a perf_poll while the bug is triggered.
I guess in theory
OK, humor me a bit here.
I'm looking at the buggy trace and comparing against a good trace where
the bug doesn't happen.
It is a rance condition of sorts, because it's just a 10us or so
interleaving of calls that causes the bug to happen or not.
In the good trace:
[parent]
OK, with the following patch I've been running the problem test case for
an hour without triggering the bug.
I'm sure this is the wrong fix (maybe patching over the problem istead of
fixing the root cause), but it works for me.
It looks like this whole mess got introduced with 76e1d9047 in
On Wed, 30 Apr 2014, Vince Weaver wrote:
> On Wed, 30 Apr 2014, Peter Zijlstra wrote:
>
> >
> > Vince, could you add the below to whatever tracing muck you already
> > have?
> >
> > After staring at your traces all day with Thomas, we have doubts about
> > the refcount integrity.
>
> I've
On Wed, 30 Apr 2014, Peter Zijlstra wrote:
>
> Vince, could you add the below to whatever tracing muck you already
> have?
>
> After staring at your traces all day with Thomas, we have doubts about
> the refcount integrity.
I've been staring at traces all day too. Will try your patch
Vince, could you add the below to whatever tracing muck you already
have?
After staring at your traces all day with Thomas, we have doubts about
the refcount integrity.
---
kernel/events/core.c | 146 +--
1 file changed, 82 insertions(+), 64
Vince, could you add the below to whatever tracing muck you already
have?
After staring at your traces all day with Thomas, we have doubts about
the refcount integrity.
---
kernel/events/core.c | 146 +--
1 file changed, 82 insertions(+), 64
On Wed, 30 Apr 2014, Peter Zijlstra wrote:
Vince, could you add the below to whatever tracing muck you already
have?
After staring at your traces all day with Thomas, we have doubts about
the refcount integrity.
I've been staring at traces all day too. Will try your patch tomorrow.
On Wed, 30 Apr 2014, Vince Weaver wrote:
On Wed, 30 Apr 2014, Peter Zijlstra wrote:
Vince, could you add the below to whatever tracing muck you already
have?
After staring at your traces all day with Thomas, we have doubts about
the refcount integrity.
I've been staring at
On Tue, 29 Apr 2014, Peter Zijlstra wrote:
> Fair point, nope not in that case. If you can trigger this without ever
> using .inherit=1 this would exclude a lot of funny code.
I don't think inherit is being set, but I'm not actually sure.
I will have to add that to the trace_printk() and
On Tue, 29 Apr 2014 14:21:56 -0400 (EDT)
Vince Weaver wrote:
> Also trace-cmd is a pain to use. Any suggested events I should trace
> beyond the obvious?
>
> Part of the problem is that despite what the documentation says it doesn't
> look like you can combine the "-P pid" and "-c" children
On Tue, 29 Apr 2014 14:11:09 -0400 (EDT)
Vince Weaver wrote:
> I've actually given up on source code inspection to figure out what's
> going on in kernel/events/core.c. What I do now is write simple test
> cases and do an ftrace function trace. The results are often surprising.
You might
On Tue, Apr 29, 2014 at 02:21:56PM -0400, Vince Weaver wrote:
> On Tue, 29 Apr 2014, Peter Zijlstra wrote:
>
> > > Event #16 is a SW event created and running in the parent on CPU0.
> >
> > A regular software one, right? Not a timer one.
>
> Maybe. From traces I have it looks like it's a
On Tue, 29 Apr 2014, Peter Zijlstra wrote:
> > Event #16 is a SW event created and running in the parent on CPU0.
>
> A regular software one, right? Not a timer one.
Maybe. From traces I have it looks like it's a regular one (i.e. calls
perf_swevent_add() ) but who knows at this point.
When
On Tue, 29 Apr 2014, Peter Zijlstra wrote:
> On Mon, Apr 28, 2014 at 10:21:34AM -0400, Vince Weaver wrote:
> > so it's looking more and more like this issue is with a
> > PERF_COUNT_SW_TASK_CLOCK
> > event.
>
> But they don't actually use the hlist thing..
yes.
This turns out into another
1 - 100 of 158 matches
Mail list logo