Re: Yet more softlockups.

2013-07-12 Thread Vince Weaver
On Fri, 12 Jul 2013, Dave Jones wrote:

> 
> Here's a fun trick:
> 
> trinity -c perf_event_open -C4 -q -l off
> 
> Within about a minute, that brings any of my boxes to its knees.
> The softlockup detector starts going nuts, and then the box wedges solid.

are you running with the patch

 [PATCH 1/2] perf: Clone child context from parent context pmu

 https://lkml.org/lkml/2013/7/9/310

It hasn't hit Linus git yet.

It fixes a bug that my perf_fuzzer would hit within seconds but it took me 
over a month of trace bisecting and kernel bisecting to isolate it.

The symptoms were stuck processes and NMI tracebacks leading to hard 
locks.  

With the patch applied my perf_fuzzer (which uses the same perf_event_open 
syscall generator as trinity) runs for hours w/o problems.

Vince
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread Dave Hansen
I added the WARN_ONCE() the first time we enable a perf event:  The
watchdog code looks to use perf these days:

> [1.003260] [ cut here ]
> [1.007943] WARNING: at 
> /home/davehans/linux.git/arch/x86/kernel/cpu/perf_event.c:471 
> x86_pmu_event_init+0x249/0x430()
> [1.018906] Modules linked in:
> [1.021999] CPU: 0 PID: 11 Comm: watchdog/0 Not tainted 
> 3.10.0-05973-gd2b4a64-dirty #190
> [1.030313] Hardware name: FUJITSU-SV PRIMEQUEST 1800E2/SB, BIOS 
> PRIMEQUEST 1000 Series BIOS Version 1.24 09/14/2011
> [1.041093]  819dc0b8 881fd33e3c98 816270c5 
> 1a9d
> [1.048601]  0009 881fd33e3cd8 81090f50 
> 881fd3248048
> [1.056112]  881fff26a800 881fff26a800  
> 
> [1.063621] Call Trace:
> [1.066106]  [] dump_stack+0x55/0x76
> [1.071313]  [] warn_slowpath_common+0x70/0xa0
> [1.077410]  [] warn_slowpath_null+0x1a/0x20
> [1.083330]  [] x86_pmu_event_init+0x249/0x430
> [1.089427]  [] perf_init_event+0xef/0x130
> [1.095170]  [] perf_event_alloc+0x358/0x4a0
> [1.101091]  [] ? set_next_entity+0x3d/0x80
> [1.106924]  [] ? touch_nmi_watchdog+0x80/0x80
> [1.113018]  [] 
> perf_event_create_kernel_counter+0x2e/0xe0
> [1.120176]  [] watchdog_enable+0xfd/0x1e0
> [1.125919]  [] smpboot_thread_fn+0xe3/0x1f0
> [1.131839]  [] ? 
> smpboot_unregister_percpu_thread+0x50/0x50
> [1.139250]  [] kthread+0xc0/0xd0
> [1.144200]  [] ? flush_kthread_worker+0xb0/0xb0
> [1.150474]  [] ret_from_fork+0x7c/0xb0
> [1.155950]  [] ? flush_kthread_worker+0xb0/0xb0
> [1.162227] ---[ end trace 32c04e859824e00d ]---

static int watchdog_nmi_enable(unsigned int cpu)
{
...
/* Try to register using hardware perf events */
event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, ...

That at least solves the boot-time mystery.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread Dave Hansen
On 07/12/2013 11:07 AM, David Ahern wrote:
> And Dave Hansen: I think nmi.c has the same do_div problem as
> kernel/events/core.c that Stephane fixed. Your patch has:
> 
> whole_msecs = do_div(delta, (1000 * 1000));
> decimal_msecs = do_div(delta, 1000) % 1000;

Yup.  There should be fixes in the x86 tip tree for both of these now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread David Ahern

On 7/12/13 11:50 AM, Dave Jones wrote:


Given you can run trinity long enough that you hit this however, makes me
think you won't be able to triger the bug I'm talking about.
Perhaps virtualised perf counters are somehow immune to this problem, because
on bare-metal, it literally takes seconds.


I can generate the message it just does not lock up:

...
[  131.837129] perf samples too long (35849 > 35714), lowering 
kernel.perf_event_max_sample_rate to 4000

...
[  237.271170] INFO: NMI handler (perf_event_nmi_handler) took too long 
to run: 57900.001 msecs



And Dave Hansen: I think nmi.c has the same do_div problem as 
kernel/events/core.c that Stephane fixed. Your patch has:


whole_msecs = do_div(delta, (1000 * 1000));
decimal_msecs = do_div(delta, 1000) % 1000;

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread Dave Jones
On Fri, Jul 12, 2013 at 11:40:06AM -0600, David Ahern wrote:
 > On 7/12/13 11:18 AM, Dave Jones wrote:
 > > On Fri, Jul 12, 2013 at 11:12:13AM -0600, David Ahern wrote:
 > >   > On 7/12/13 9:45 AM, Dave Jones wrote:
 > >   > > Here's a fun trick:
 > >   > >
 > >   > > trinity -c perf_event_open -C4 -q -l off
 > >   > >
 > 
 > In _get_address, case 8 must be happening a lot and I don't see a free 
 > when that address comes from malloc. Perhaps all of the rand() calls are 
 > breaking down in the VM. If I change that case from malloc to something 
 > static - like page_rand - memory stays flat.

Oh right, yeah, that's a known bug. I forgot about the ARG_ADDRESS case.
Another hacky workaround is to bound the trinity runs with -N100 or 
something
and do that in a shell loop so it frees it up on exit.

I need to find the time to add proper callbacks to release allocated memory.

Given you can run trinity long enough that you hit this however, makes me
think you won't be able to triger the bug I'm talking about.
Perhaps virtualised perf counters are somehow immune to this problem, because
on bare-metal, it literally takes seconds.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread David Ahern

On 7/12/13 11:18 AM, Dave Jones wrote:

On Fri, Jul 12, 2013 at 11:12:13AM -0600, David Ahern wrote:
  > On 7/12/13 9:45 AM, Dave Jones wrote:
  > > Here's a fun trick:
  > >
  > > trinity -c perf_event_open -C4 -q -l off
  > >
  > > Within about a minute, that brings any of my boxes to its knees.
  > > The softlockup detector starts going nuts, and then the box wedges solid.
  >
  > I tried that in a VM running latest Linus tree. I see trinity children
  > getting nuked regularly from oom.

Weird. I'm curious what the backtrace looks like in those cases.
Where is it trying to allocate memory ?
(Though that isn't usually too helpful in most cases, but in absense of
  anything else..)


(gdb) bt
#0  0x00313b27f3e0 in malloc () from /lib64/libc.so.6
#1  0x00404405 in _get_address 
(null_allowed=null_allowed@entry=1 '\001') at generic-sanitise.c:151

#2  0x004044ca in get_address () at generic-sanitise.c:182
#3  0x004052a0 in fill_arg (childno=, 
call=call@entry=298, argnum=argnum@entry=1) at generic-sanitise.c:415
#4  0x0040548d in generic_sanitise (childno=childno@entry=0) at 
generic-sanitise.c:615

#5  0x00405620 in mkcall (childno=childno@entry=0) at syscall.c:131
#6  0x00407d85 in child_process () at child.c:219
#7  0x004073ad in fork_children () at main.c:103
#8  main_loop () at main.c:308
#9  do_main_loop () at main.c:342
#10 0x0040253a in main (argc=, argv=out>) at trinity.c:180



In _get_address, case 8 must be happening a lot and I don't see a free 
when that address comes from malloc. Perhaps all of the rand() calls are 
breaking down in the VM. If I change that case from malloc to something 
static - like page_rand - memory stays flat.


David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread Dave Jones
On Fri, Jul 12, 2013 at 11:12:13AM -0600, David Ahern wrote:
 > On 7/12/13 9:45 AM, Dave Jones wrote:
 > > Here's a fun trick:
 > >
 > > trinity -c perf_event_open -C4 -q -l off
 > >
 > > Within about a minute, that brings any of my boxes to its knees.
 > > The softlockup detector starts going nuts, and then the box wedges solid.
 > 
 > I tried that in a VM running latest Linus tree. I see trinity children 
 > getting nuked regularly from oom.

Weird. I'm curious what the backtrace looks like in those cases.
Where is it trying to allocate memory ?
(Though that isn't usually too helpful in most cases, but in absense of
 anything else..)

 > I was dumping Vm elements using:
 > 
 > while [ 1 ]; do echo $(date) $(egrep Vm /proc/$pid/status); sleep 1; done
 > 
 > And right before the process is killed was the line:
 > 
 > Fri Jul 12 11:00:19 MDT 2013 VmPeak: 2867472 kB VmSize: 2867472 kB 
 > VmLck: 0 kB VmPin: 0 kB VmHWM: 1493092 kB VmRSS: 1493092 kB VmData: 
 > 2857944 kB VmStk: 136 kB VmExe: 100 kB VmLib: 1844 kB VmPTE: 5628 kB 
 > VmSwap: 0 kB
 > 
 > The VmData is growing fairly steadily and strace shows a lot of brk 
 > calls. Is that normal for trinity - or this command line?
 
Hmm, there are a few known leaks in trinity that I still haven't got
around to fixing, but none that should be triggered from this.

I wonder if something about being virtualised is perturbing its behaviour
somehow. Though I can't think what would matter. 

 > Looking at the perf_event_open calls I see a lot of E2BIG errors in 
 > addition to EINVAL. e.g,
 > 
 > ...
 > perf_event_open(0xba9000, 0, 0x4c, 0xcc, 0) = -1 EINVAL (Invalid argument)
 > alarm(0)= 1
 > getppid()   = 9031
 > alarm(1)= 0
 > perf_event_open(0xba9000, 0x2a6e, 0xe, 0xfd, 0) = -1 E2BIG (Argument 
 > list too long)
 > alarm(0)= 1
 > getppid()   = 9031
 > alarm(1)= 0

Yep, that'll happen.

Perhaps try CONFIG_MEMLEAK ?

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread David Ahern

On 7/12/13 9:45 AM, Dave Jones wrote:

Here's a fun trick:

trinity -c perf_event_open -C4 -q -l off

Within about a minute, that brings any of my boxes to its knees.
The softlockup detector starts going nuts, and then the box wedges solid.


I tried that in a VM running latest Linus tree. I see trinity children 
getting nuked regularly from oom.


I was dumping Vm elements using:

while [ 1 ]; do echo $(date) $(egrep Vm /proc/$pid/status); sleep 1; done

And right before the process is killed was the line:

Fri Jul 12 11:00:19 MDT 2013 VmPeak: 2867472 kB VmSize: 2867472 kB 
VmLck: 0 kB VmPin: 0 kB VmHWM: 1493092 kB VmRSS: 1493092 kB VmData: 
2857944 kB VmStk: 136 kB VmExe: 100 kB VmLib: 1844 kB VmPTE: 5628 kB 
VmSwap: 0 kB


The VmData is growing fairly steadily and strace shows a lot of brk 
calls. Is that normal for trinity - or this command line?


Looking at the perf_event_open calls I see a lot of E2BIG errors in 
addition to EINVAL. e.g,


...
perf_event_open(0xba9000, 0, 0x4c, 0xcc, 0) = -1 EINVAL (Invalid argument)
alarm(0)= 1
getppid()   = 9031
alarm(1)= 0
perf_event_open(0xba9000, 0x2a6e, 0xe, 0xfd, 0) = -1 E2BIG (Argument 
list too long)

alarm(0)= 1
getppid()   = 9031
alarm(1)= 0
...

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread Dave Jones
On Fri, Jul 12, 2013 at 08:55:31AM -0700, Dave Hansen wrote:

 > I mean that somebody turned 'active_events' on without actually wanting
 > perf to be on.  I'd be curious how it got set to something nonzero.
 > Could you stick a WARN_ONCE() or printk_ratelimit() on the three sites
 > that modify it?
 
I'll add to my list of things to get to, but it probably won't be until 
post-weekend.

 > > Here's a fun trick:
 > > 
 > > trinity -c perf_event_open -C4 -q -l off
 > > 
 > > Within about a minute, that brings any of my boxes to its knees.
 > > The softlockup detector starts going nuts, and then the box wedges solid.
 > 
 > *But* dropping the perf sample rate has been really effective at keeping
 > me from hitting it, and I've had to use _lots_ of CPUs (60-160) doing
 > those NMIs at once to trigger the lockups.
 > 
 > Being able to trigger it with so few CPUs is interesting though.  I'll
 > try on my hardware.

I hacked trinity to pause for 24s after each call, and changed the kernel
to taint on lockup (so that trinity would immediatly stop).
My hope was to find the combination of perf_event_open calls that triggered 
this.
After 12 hours, it's still ticking along. It's now done about a thousand
more calls than is usually necessary to hit the bug. Which makes me wonder
if this is timing related somehow.

Perhaps also worth noting that the majority of calls trinity makes will -EINVAL

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread Dave Hansen
On 07/12/2013 08:45 AM, Dave Jones wrote:
> On Fri, Jul 12, 2013 at 08:38:52AM -0700, Dave Hansen wrote:
>  > Dave, for your case, my suspicion would be that it got turned on
>  > inadvertently, or that we somehow have a bug which bumped up
>  > perf_event.c's 'active_events' and we're running some perf code that we
>  > don't have to.
>  
> What do you 'inadvertantly' ? I see this during bootup every time.
> Unless systemd or something has started playing with perf, (which afaik it 
> isn't)

I mean that somebody turned 'active_events' on without actually wanting
perf to be on.  I'd be curious how it got set to something nonzero.
Could you stick a WARN_ONCE() or printk_ratelimit() on the three sites
that modify it?

>  > But, I'm suspicious.  I was having all kinds of issues with perf and
>  > NMIs taking hundreds of milliseconds.  I never isolated it to having a
>  > real, single, cause.  I attributed it to my large NUMA system just being
>  > slow.  Your description makes me wonder what I missed, though.
> 
> Here's a fun trick:
> 
> trinity -c perf_event_open -C4 -q -l off
> 
> Within about a minute, that brings any of my boxes to its knees.
> The softlockup detector starts going nuts, and then the box wedges solid.

On my box, the same happens with 'perf top'. ;)

*But* dropping the perf sample rate has been really effective at keeping
me from hitting it, and I've had to use _lots_ of CPUs (60-160) doing
those NMIs at once to trigger the lockups.

Being able to trigger it with so few CPUs is interesting though.  I'll
try on my hardware.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread Dave Jones
On Fri, Jul 12, 2013 at 08:38:52AM -0700, Dave Hansen wrote:
 
 > The warning comes from calling perf_sample_event_took(), which is only
 > called from one place: perf_event_nmi_handler().
 > 
 > So we can be pretty sure that the perf NMI is firing, or at least that
 > this handler code is running.
 > 
 > nmi_handle() says:
 > /*
 >  * NMIs are edge-triggered, which means if you have enough
 >  * of them concurrently, you can lose some because only one
 >  * can be latched at any given time.  Walk the whole list
 >  * to handle those situations.
 >  */
 > 
 > perf_event_nmi_handler() probably gets _called_ when the watchdog NMI
 > goes off.  But, it should hit this check:
 > 
 > if (!atomic_read(&active_events))
 > return NMI_DONE;
 > 
 > and return quickly. This is before it has a chance to call
 > perf_sample_event_took().
 > 
 > Dave, for your case, my suspicion would be that it got turned on
 > inadvertently, or that we somehow have a bug which bumped up
 > perf_event.c's 'active_events' and we're running some perf code that we
 > don't have to.
 
What do you 'inadvertantly' ? I see this during bootup every time.
Unless systemd or something has started playing with perf, (which afaik it 
isn't)

 > But, I'm suspicious.  I was having all kinds of issues with perf and
 > NMIs taking hundreds of milliseconds.  I never isolated it to having a
 > real, single, cause.  I attributed it to my large NUMA system just being
 > slow.  Your description makes me wonder what I missed, though.

Here's a fun trick:

trinity -c perf_event_open -C4 -q -l off

Within about a minute, that brings any of my boxes to its knees.
The softlockup detector starts going nuts, and then the box wedges solid.

(You may need to bump -C depending on your CPU count. I've never seen it happen
 with a single process, but -C2 seems to be a minimum)

That *is* using perf though, so I kind of expect bad shit to happen when there 
are bugs.
The "during bootup" case is still a head-scratcher.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread Dave Hansen
On 07/12/2013 03:31 AM, Ingo Molnar wrote:
> * Dave Jones  wrote:
>> On Wed, Jul 10, 2013 at 05:20:15PM +0200, Markus Trippelsdorf wrote:
>>  > On 2013.07.10 at 11:13 -0400, Dave Jones wrote:
>>  > > I get this right after booting..
>>  > > 
>>  > > [  114.516619] perf samples too long (4262 > 2500), lowering 
>> kernel.perf_event_max_sample_rate to 5
>>  > 
>>  > You can disable this warning by:
>>  > 
>>  > echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
>>
>> Yes, but why is this even being run when I'm not running perf ?
>>
>> The only NMI source running should be the watchdog.
> 
> The NMI watchdog is a perf event.
> 
> I've Cc:-ed Dave Hansen, the author of those changes - is this a false 
> positive or some real problem?

The warning comes from calling perf_sample_event_took(), which is only
called from one place: perf_event_nmi_handler().

So we can be pretty sure that the perf NMI is firing, or at least that
this handler code is running.

nmi_handle() says:
/*
 * NMIs are edge-triggered, which means if you have enough
 * of them concurrently, you can lose some because only one
 * can be latched at any given time.  Walk the whole list
 * to handle those situations.
 */

perf_event_nmi_handler() probably gets _called_ when the watchdog NMI
goes off.  But, it should hit this check:

if (!atomic_read(&active_events))
return NMI_DONE;

and return quickly. This is before it has a chance to call
perf_sample_event_took().

Dave, for your case, my suspicion would be that it got turned on
inadvertently, or that we somehow have a bug which bumped up
perf_event.c's 'active_events' and we're running some perf code that we
don't have to.

But, I'm suspicious.  I was having all kinds of issues with perf and
NMIs taking hundreds of milliseconds.  I never isolated it to having a
real, single, cause.  I attributed it to my large NUMA system just being
slow.  Your description makes me wonder what I missed, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-12 Thread Ingo Molnar

* Dave Jones  wrote:

> On Wed, Jul 10, 2013 at 05:20:15PM +0200, Markus Trippelsdorf wrote:
>  > On 2013.07.10 at 11:13 -0400, Dave Jones wrote:
>  > > I get this right after booting..
>  > > 
>  > > [  114.516619] perf samples too long (4262 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5
>  > 
>  > You can disable this warning by:
>  > 
>  > echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
> 
> Yes, but why is this even being run when I'm not running perf ?
> 
> The only NMI source running should be the watchdog.

The NMI watchdog is a perf event.

I've Cc:-ed Dave Hansen, the author of those changes - is this a false 
positive or some real problem?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-10 Thread Dave Jones
On Wed, Jul 10, 2013 at 11:39:50AM -0400, Vince Weaver wrote:
 > On Wed, 10 Jul 2013, Dave Jones wrote:
 > 
 > > Something is really fucked up in the kernel side of perf.
 > > I get this right after booting..
 > > 
 > > [  114.516619] perf samples too long (4262 > 2500), lowering 
 > > kernel.perf_event_max_sample_rate to 5
 > > 
 > > That's before I even get a chance to log in, so I'm pretty sure perf isn't 
 > > even being run.
 > > 
 > > And this is just booting up my desktop, not fuzzing.
 > 
 > This is probably related to the problem described in the thread
 >perf: fix interrupt handler timing harness
 >http://comments.gmane.org/gmane.linux.kernel/1520775
 > which in theory has a fix.  Not sure why it'd trigger during boot.

Bah. That's not it either.

[ 1903.293233] perf samples too long (2514 > 2500), lowering 
kernel.perf_event_max_sample_rate to 5
[ 1903.295248] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
19498.002 msecs

Dave


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-10 Thread Dave Jones
On Wed, Jul 10, 2013 at 11:39:50AM -0400, Vince Weaver wrote:
 > On Wed, 10 Jul 2013, Dave Jones wrote:
 > 
 > > Something is really fucked up in the kernel side of perf.
 > > I get this right after booting..
 > > 
 > > [  114.516619] perf samples too long (4262 > 2500), lowering 
 > > kernel.perf_event_max_sample_rate to 5
 > > 
 > > That's before I even get a chance to log in, so I'm pretty sure perf isn't 
 > > even being run.
 > > 
 > > And this is just booting up my desktop, not fuzzing.
 > 
 > This is probably related to the problem described in the thread
 >perf: fix interrupt handler timing harness
 >http://comments.gmane.org/gmane.linux.kernel/1520775
 > which in theory has a fix.  Not sure why it'd trigger during boot.

Ah, missed that thread. Thanks for the pointer. That looks important,
and might actually explain some of the other oddness I've been seeing.

 > NMI watchdog?

likely.

Davek


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-10 Thread Dave Jones
On Wed, Jul 10, 2013 at 05:20:15PM +0200, Markus Trippelsdorf wrote:
 > On 2013.07.10 at 11:13 -0400, Dave Jones wrote:
 > > I get this right after booting..
 > > 
 > > [  114.516619] perf samples too long (4262 > 2500), lowering 
 > > kernel.perf_event_max_sample_rate to 5
 > 
 > You can disable this warning by:
 > 
 > echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent

Yes, but why is this even being run when I'm not running perf ?

The only NMI source running should be the watchdog.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-10 Thread Vince Weaver
On Wed, 10 Jul 2013, Dave Jones wrote:

> Something is really fucked up in the kernel side of perf.
> I get this right after booting..
> 
> [  114.516619] perf samples too long (4262 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5
> 
> That's before I even get a chance to log in, so I'm pretty sure perf isn't 
> even being run.
> 
> And this is just booting up my desktop, not fuzzing.

This is probably related to the problem described in the thread
   perf: fix interrupt handler timing harness
   http://comments.gmane.org/gmane.linux.kernel/1520775
which in theory has a fix.  Not sure why it'd trigger during boot.
NMI watchdog?

Vince

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-10 Thread Markus Trippelsdorf
On 2013.07.10 at 11:13 -0400, Dave Jones wrote:
> On Sat, Jul 06, 2013 at 09:24:08AM +0200, Ingo Molnar wrote:
>  > 
>  > * Dave Jones  wrote:
>  > 
>  > > On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote:
>  > >  > On Fri, 5 Jul 2013, Dave Jones wrote:
>  > >  > 
>  > >  > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565]
>  > >  > > perf samples too long (2519 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5
>  > >  > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
> 238147.002 msecs
>  > >  > 
>  > >  > So we see a softlockup of 23 seconds and the perf_event_nmi_handler
>  > >  > claims it did run 23.8 seconds.
>  > >  > 
>  > >  > Are there more instances of NMI handler messages ?
>  > > 
>  > > [ 2552.006181] perf samples too long (2511 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5
>  > > [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long 
> to run: 500392.002 msecs
>  > 
>  > Dave, could you pull in the latest perf fixes at:
>  > 
>  >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/urgent
>  > 
>  > In particular this:
>  > 
>  >   e5302920da9e perf: Fix interrupt handler timing harness
>  > 
>  > could make a difference - if your tests somehow end up activating perf.
> 
> Something is really fucked up in the kernel side of perf.
> I get this right after booting..
> 
> [  114.516619] perf samples too long (4262 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5

You can disable this warning by:

echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent

-- 
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-10 Thread Dave Jones
On Sat, Jul 06, 2013 at 09:24:08AM +0200, Ingo Molnar wrote:
 > 
 > * Dave Jones  wrote:
 > 
 > > On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote:
 > >  > On Fri, 5 Jul 2013, Dave Jones wrote:
 > >  > 
 > >  > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565]
 > >  > > perf samples too long (2519 > 2500), lowering 
 > > kernel.perf_event_max_sample_rate to 5
 > >  > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
 > > 238147.002 msecs
 > >  > 
 > >  > So we see a softlockup of 23 seconds and the perf_event_nmi_handler
 > >  > claims it did run 23.8 seconds.
 > >  > 
 > >  > Are there more instances of NMI handler messages ?
 > > 
 > > [ 2552.006181] perf samples too long (2511 > 2500), lowering 
 > > kernel.perf_event_max_sample_rate to 5
 > > [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to 
 > > run: 500392.002 msecs
 > 
 > Dave, could you pull in the latest perf fixes at:
 > 
 >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/urgent
 > 
 > In particular this:
 > 
 >   e5302920da9e perf: Fix interrupt handler timing harness
 > 
 > could make a difference - if your tests somehow end up activating perf.

Something is really fucked up in the kernel side of perf.
I get this right after booting..

[  114.516619] perf samples too long (4262 > 2500), lowering 
kernel.perf_event_max_sample_rate to 5

That's before I even get a chance to log in, so I'm pretty sure perf isn't even 
being run.

And this is just booting up my desktop, not fuzzing.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-06 Thread Dave Jones
On Sat, Jul 06, 2013 at 09:24:08AM +0200, Ingo Molnar wrote:
 > 
 > * Dave Jones  wrote:
 > 
 > > On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote:
 > >  > On Fri, 5 Jul 2013, Dave Jones wrote:
 > >  > 
 > >  > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565]
 > >  > > perf samples too long (2519 > 2500), lowering 
 > > kernel.perf_event_max_sample_rate to 5
 > >  > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
 > > 238147.002 msecs
 > >  > 
 > >  > So we see a softlockup of 23 seconds and the perf_event_nmi_handler
 > >  > claims it did run 23.8 seconds.
 > >  > 
 > >  > Are there more instances of NMI handler messages ?
 > > 
 > > [ 2552.006181] perf samples too long (2511 > 2500), lowering 
 > > kernel.perf_event_max_sample_rate to 5
 > > [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to 
 > > run: 500392.002 msecs
 > 
 > Dave, could you pull in the latest perf fixes at:
 > 
 >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/urgent
 > 
 > In particular this:
 > 
 >   e5302920da9e perf: Fix interrupt handler timing harness
 > 
 > could make a difference - if your tests somehow end up activating perf.

Not seeing the NMI printk (yet?), but saw this..

[ 2322.758913] perf samples too long (2502 > 2500), lowering 
kernel.perf_event_max_sample_rate to 5

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-06 Thread Ingo Molnar

* Dave Jones  wrote:

> On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote:
>  > On Fri, 5 Jul 2013, Dave Jones wrote:
>  > 
>  > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565]
>  > > perf samples too long (2519 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5
>  > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
> 238147.002 msecs
>  > 
>  > So we see a softlockup of 23 seconds and the perf_event_nmi_handler
>  > claims it did run 23.8 seconds.
>  > 
>  > Are there more instances of NMI handler messages ?
> 
> [ 2552.006181] perf samples too long (2511 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5
> [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to 
> run: 500392.002 msecs

Dave, could you pull in the latest perf fixes at:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/urgent

In particular this:

  e5302920da9e perf: Fix interrupt handler timing harness

could make a difference - if your tests somehow end up activating perf.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Yet more softlockups.

2013-07-05 Thread Thomas Gleixner
On Fri, 5 Jul 2013, Seiji Aguchi wrote:
> > -Original Message-
> > Hmmm... this makes me wonder if the interrupt tracepoint stuff is at
> > fault here, as it changes the IDT handling for NMI context.
> 
> This softlockup happens while disabling the interrupt tracepoints,
> Because if it is enabled, "smp_trace_apic_timer_interrupt" is displayed
> instead of "smp_apic_timer_interrupt" in the call trace below.
> 
> But I can't say anything how this issue is related to the tracepoint stuff,

I doubt it is related. I rather suspect that trinity is able to fuzz
perf into a DoS facility.

Dave, can you trace the perf sys calls and dump that data over serial
when the softlockup hits?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Yet more softlockups.

2013-07-05 Thread Seiji Aguchi


> -Original Message-
> From: H. Peter Anvin [mailto:h...@zytor.com]
> Sent: Friday, July 05, 2013 12:41 PM
> To: Thomas Gleixner
> Cc: Dave Jones; Linus Torvalds; Linux Kernel; Ingo Molnar; Peter Zijlstra; 
> Seiji Aguchi
> Subject: Re: Yet more softlockups.
> 
> On 07/05/2013 09:02 AM, Thomas Gleixner wrote:
> > On Fri, 5 Jul 2013, Dave Jones wrote:
> >> On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote:
> >>   > On Fri, 5 Jul 2013, Dave Jones wrote:
> >>   >
> >>   > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565]
> >>   > > perf samples too long (2519 > 2500), lowering 
> >> kernel.perf_event_max_sample_rate to 5
> >>   > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
> >> 238147.002 msecs
> >>   >
> >>   > So we see a softlockup of 23 seconds and the perf_event_nmi_handler
> >>   > claims it did run 23.8 seconds.
> >>   >
> >>   > Are there more instances of NMI handler messages ?
> >>
> >> [ 2552.006181] perf samples too long (2511 > 2500), lowering 
> >> kernel.perf_event_max_sample_rate to 5
> >> [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to 
> >> run: 500392.002 msecs
> >
> > Yuck. Spending 50 seconds in NMI context surely explains a softlockup :)
> >
> 
> Hmmm... this makes me wonder if the interrupt tracepoint stuff is at
> fault here, as it changes the IDT handling for NMI context.

This softlockup happens while disabling the interrupt tracepoints,
Because if it is enabled, "smp_trace_apic_timer_interrupt" is displayed
instead of "smp_apic_timer_interrupt" in the call trace below.

But I can't say anything how this issue is related to the tracepoint stuff,
I need to reproduce it on my machine first.

Call Trace:
  
 [] __do_softirq+0xff/0x440
 [] irq_exit+0xcd/0xe0
 [] smp_apic_timer_interrupt+0x6b/0x9b
 [] apic_timer_interrupt+0x6f/0x80

Seiji
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-05 Thread H. Peter Anvin

On 07/05/2013 09:02 AM, Thomas Gleixner wrote:

On Fri, 5 Jul 2013, Dave Jones wrote:

On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote:
  > On Fri, 5 Jul 2013, Dave Jones wrote:
  >
  > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565]
  > > perf samples too long (2519 > 2500), lowering 
kernel.perf_event_max_sample_rate to 5
  > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
238147.002 msecs
  >
  > So we see a softlockup of 23 seconds and the perf_event_nmi_handler
  > claims it did run 23.8 seconds.
  >
  > Are there more instances of NMI handler messages ?

[ 2552.006181] perf samples too long (2511 > 2500), lowering 
kernel.perf_event_max_sample_rate to 5
[ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
500392.002 msecs


Yuck. Spending 50 seconds in NMI context surely explains a softlockup :)



Hmmm... this makes me wonder if the interrupt tracepoint stuff is at 
fault here, as it changes the IDT handling for NMI context.


-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-05 Thread Thomas Gleixner
On Fri, 5 Jul 2013, Dave Jones wrote:
> On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote:
>  > On Fri, 5 Jul 2013, Dave Jones wrote:
>  > 
>  > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565]
>  > > perf samples too long (2519 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5
>  > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
> 238147.002 msecs
>  > 
>  > So we see a softlockup of 23 seconds and the perf_event_nmi_handler
>  > claims it did run 23.8 seconds.
>  > 
>  > Are there more instances of NMI handler messages ?
> 
> [ 2552.006181] perf samples too long (2511 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5
> [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to 
> run: 500392.002 msecs

Yuck. Spending 50 seconds in NMI context surely explains a softlockup :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-05 Thread Dave Jones
On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote:
 > On Fri, 5 Jul 2013, Dave Jones wrote:
 > 
 > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565]
 > > perf samples too long (2519 > 2500), lowering 
 > > kernel.perf_event_max_sample_rate to 5
 > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
 > > 238147.002 msecs
 > 
 > So we see a softlockup of 23 seconds and the perf_event_nmi_handler
 > claims it did run 23.8 seconds.
 > 
 > Are there more instances of NMI handler messages ?

[ 2552.006181] perf samples too long (2511 > 2500), lowering 
kernel.perf_event_max_sample_rate to 5
[ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
500392.002 msecs

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Yet more softlockups.

2013-07-05 Thread Thomas Gleixner
On Fri, 5 Jul 2013, Dave Jones wrote:

> BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565]
> perf samples too long (2519 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 5
> INFO: NMI handler (perf_event_nmi_handler) took too long to run: 238147.002 
> msecs

So we see a softlockup of 23 seconds and the perf_event_nmi_handler
claims it did run 23.8 seconds.

Are there more instances of NMI handler messages ?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/