Re: Yet more softlockups.
On Fri, 12 Jul 2013, Dave Jones wrote: > > Here's a fun trick: > > trinity -c perf_event_open -C4 -q -l off > > Within about a minute, that brings any of my boxes to its knees. > The softlockup detector starts going nuts, and then the box wedges solid. are you running with the patch [PATCH 1/2] perf: Clone child context from parent context pmu https://lkml.org/lkml/2013/7/9/310 It hasn't hit Linus git yet. It fixes a bug that my perf_fuzzer would hit within seconds but it took me over a month of trace bisecting and kernel bisecting to isolate it. The symptoms were stuck processes and NMI tracebacks leading to hard locks. With the patch applied my perf_fuzzer (which uses the same perf_event_open syscall generator as trinity) runs for hours w/o problems. Vince -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
I added the WARN_ONCE() the first time we enable a perf event: The watchdog code looks to use perf these days: > [1.003260] [ cut here ] > [1.007943] WARNING: at > /home/davehans/linux.git/arch/x86/kernel/cpu/perf_event.c:471 > x86_pmu_event_init+0x249/0x430() > [1.018906] Modules linked in: > [1.021999] CPU: 0 PID: 11 Comm: watchdog/0 Not tainted > 3.10.0-05973-gd2b4a64-dirty #190 > [1.030313] Hardware name: FUJITSU-SV PRIMEQUEST 1800E2/SB, BIOS > PRIMEQUEST 1000 Series BIOS Version 1.24 09/14/2011 > [1.041093] 819dc0b8 881fd33e3c98 816270c5 > 1a9d > [1.048601] 0009 881fd33e3cd8 81090f50 > 881fd3248048 > [1.056112] 881fff26a800 881fff26a800 > > [1.063621] Call Trace: > [1.066106] [] dump_stack+0x55/0x76 > [1.071313] [] warn_slowpath_common+0x70/0xa0 > [1.077410] [] warn_slowpath_null+0x1a/0x20 > [1.083330] [] x86_pmu_event_init+0x249/0x430 > [1.089427] [] perf_init_event+0xef/0x130 > [1.095170] [] perf_event_alloc+0x358/0x4a0 > [1.101091] [] ? set_next_entity+0x3d/0x80 > [1.106924] [] ? touch_nmi_watchdog+0x80/0x80 > [1.113018] [] > perf_event_create_kernel_counter+0x2e/0xe0 > [1.120176] [] watchdog_enable+0xfd/0x1e0 > [1.125919] [] smpboot_thread_fn+0xe3/0x1f0 > [1.131839] [] ? > smpboot_unregister_percpu_thread+0x50/0x50 > [1.139250] [] kthread+0xc0/0xd0 > [1.144200] [] ? flush_kthread_worker+0xb0/0xb0 > [1.150474] [] ret_from_fork+0x7c/0xb0 > [1.155950] [] ? flush_kthread_worker+0xb0/0xb0 > [1.162227] ---[ end trace 32c04e859824e00d ]--- static int watchdog_nmi_enable(unsigned int cpu) { ... /* Try to register using hardware perf events */ event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, ... That at least solves the boot-time mystery. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On 07/12/2013 11:07 AM, David Ahern wrote: > And Dave Hansen: I think nmi.c has the same do_div problem as > kernel/events/core.c that Stephane fixed. Your patch has: > > whole_msecs = do_div(delta, (1000 * 1000)); > decimal_msecs = do_div(delta, 1000) % 1000; Yup. There should be fixes in the x86 tip tree for both of these now. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On 7/12/13 11:50 AM, Dave Jones wrote: Given you can run trinity long enough that you hit this however, makes me think you won't be able to triger the bug I'm talking about. Perhaps virtualised perf counters are somehow immune to this problem, because on bare-metal, it literally takes seconds. I can generate the message it just does not lock up: ... [ 131.837129] perf samples too long (35849 > 35714), lowering kernel.perf_event_max_sample_rate to 4000 ... [ 237.271170] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 57900.001 msecs And Dave Hansen: I think nmi.c has the same do_div problem as kernel/events/core.c that Stephane fixed. Your patch has: whole_msecs = do_div(delta, (1000 * 1000)); decimal_msecs = do_div(delta, 1000) % 1000; David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Fri, Jul 12, 2013 at 11:40:06AM -0600, David Ahern wrote: > On 7/12/13 11:18 AM, Dave Jones wrote: > > On Fri, Jul 12, 2013 at 11:12:13AM -0600, David Ahern wrote: > > > On 7/12/13 9:45 AM, Dave Jones wrote: > > > > Here's a fun trick: > > > > > > > > trinity -c perf_event_open -C4 -q -l off > > > > > > In _get_address, case 8 must be happening a lot and I don't see a free > when that address comes from malloc. Perhaps all of the rand() calls are > breaking down in the VM. If I change that case from malloc to something > static - like page_rand - memory stays flat. Oh right, yeah, that's a known bug. I forgot about the ARG_ADDRESS case. Another hacky workaround is to bound the trinity runs with -N100 or something and do that in a shell loop so it frees it up on exit. I need to find the time to add proper callbacks to release allocated memory. Given you can run trinity long enough that you hit this however, makes me think you won't be able to triger the bug I'm talking about. Perhaps virtualised perf counters are somehow immune to this problem, because on bare-metal, it literally takes seconds. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On 7/12/13 11:18 AM, Dave Jones wrote: On Fri, Jul 12, 2013 at 11:12:13AM -0600, David Ahern wrote: > On 7/12/13 9:45 AM, Dave Jones wrote: > > Here's a fun trick: > > > > trinity -c perf_event_open -C4 -q -l off > > > > Within about a minute, that brings any of my boxes to its knees. > > The softlockup detector starts going nuts, and then the box wedges solid. > > I tried that in a VM running latest Linus tree. I see trinity children > getting nuked regularly from oom. Weird. I'm curious what the backtrace looks like in those cases. Where is it trying to allocate memory ? (Though that isn't usually too helpful in most cases, but in absense of anything else..) (gdb) bt #0 0x00313b27f3e0 in malloc () from /lib64/libc.so.6 #1 0x00404405 in _get_address (null_allowed=null_allowed@entry=1 '\001') at generic-sanitise.c:151 #2 0x004044ca in get_address () at generic-sanitise.c:182 #3 0x004052a0 in fill_arg (childno=, call=call@entry=298, argnum=argnum@entry=1) at generic-sanitise.c:415 #4 0x0040548d in generic_sanitise (childno=childno@entry=0) at generic-sanitise.c:615 #5 0x00405620 in mkcall (childno=childno@entry=0) at syscall.c:131 #6 0x00407d85 in child_process () at child.c:219 #7 0x004073ad in fork_children () at main.c:103 #8 main_loop () at main.c:308 #9 do_main_loop () at main.c:342 #10 0x0040253a in main (argc=, argv=out>) at trinity.c:180 In _get_address, case 8 must be happening a lot and I don't see a free when that address comes from malloc. Perhaps all of the rand() calls are breaking down in the VM. If I change that case from malloc to something static - like page_rand - memory stays flat. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Fri, Jul 12, 2013 at 11:12:13AM -0600, David Ahern wrote: > On 7/12/13 9:45 AM, Dave Jones wrote: > > Here's a fun trick: > > > > trinity -c perf_event_open -C4 -q -l off > > > > Within about a minute, that brings any of my boxes to its knees. > > The softlockup detector starts going nuts, and then the box wedges solid. > > I tried that in a VM running latest Linus tree. I see trinity children > getting nuked regularly from oom. Weird. I'm curious what the backtrace looks like in those cases. Where is it trying to allocate memory ? (Though that isn't usually too helpful in most cases, but in absense of anything else..) > I was dumping Vm elements using: > > while [ 1 ]; do echo $(date) $(egrep Vm /proc/$pid/status); sleep 1; done > > And right before the process is killed was the line: > > Fri Jul 12 11:00:19 MDT 2013 VmPeak: 2867472 kB VmSize: 2867472 kB > VmLck: 0 kB VmPin: 0 kB VmHWM: 1493092 kB VmRSS: 1493092 kB VmData: > 2857944 kB VmStk: 136 kB VmExe: 100 kB VmLib: 1844 kB VmPTE: 5628 kB > VmSwap: 0 kB > > The VmData is growing fairly steadily and strace shows a lot of brk > calls. Is that normal for trinity - or this command line? Hmm, there are a few known leaks in trinity that I still haven't got around to fixing, but none that should be triggered from this. I wonder if something about being virtualised is perturbing its behaviour somehow. Though I can't think what would matter. > Looking at the perf_event_open calls I see a lot of E2BIG errors in > addition to EINVAL. e.g, > > ... > perf_event_open(0xba9000, 0, 0x4c, 0xcc, 0) = -1 EINVAL (Invalid argument) > alarm(0)= 1 > getppid() = 9031 > alarm(1)= 0 > perf_event_open(0xba9000, 0x2a6e, 0xe, 0xfd, 0) = -1 E2BIG (Argument > list too long) > alarm(0)= 1 > getppid() = 9031 > alarm(1)= 0 Yep, that'll happen. Perhaps try CONFIG_MEMLEAK ? Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On 7/12/13 9:45 AM, Dave Jones wrote: Here's a fun trick: trinity -c perf_event_open -C4 -q -l off Within about a minute, that brings any of my boxes to its knees. The softlockup detector starts going nuts, and then the box wedges solid. I tried that in a VM running latest Linus tree. I see trinity children getting nuked regularly from oom. I was dumping Vm elements using: while [ 1 ]; do echo $(date) $(egrep Vm /proc/$pid/status); sleep 1; done And right before the process is killed was the line: Fri Jul 12 11:00:19 MDT 2013 VmPeak: 2867472 kB VmSize: 2867472 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 1493092 kB VmRSS: 1493092 kB VmData: 2857944 kB VmStk: 136 kB VmExe: 100 kB VmLib: 1844 kB VmPTE: 5628 kB VmSwap: 0 kB The VmData is growing fairly steadily and strace shows a lot of brk calls. Is that normal for trinity - or this command line? Looking at the perf_event_open calls I see a lot of E2BIG errors in addition to EINVAL. e.g, ... perf_event_open(0xba9000, 0, 0x4c, 0xcc, 0) = -1 EINVAL (Invalid argument) alarm(0)= 1 getppid() = 9031 alarm(1)= 0 perf_event_open(0xba9000, 0x2a6e, 0xe, 0xfd, 0) = -1 E2BIG (Argument list too long) alarm(0)= 1 getppid() = 9031 alarm(1)= 0 ... David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Fri, Jul 12, 2013 at 08:55:31AM -0700, Dave Hansen wrote: > I mean that somebody turned 'active_events' on without actually wanting > perf to be on. I'd be curious how it got set to something nonzero. > Could you stick a WARN_ONCE() or printk_ratelimit() on the three sites > that modify it? I'll add to my list of things to get to, but it probably won't be until post-weekend. > > Here's a fun trick: > > > > trinity -c perf_event_open -C4 -q -l off > > > > Within about a minute, that brings any of my boxes to its knees. > > The softlockup detector starts going nuts, and then the box wedges solid. > > *But* dropping the perf sample rate has been really effective at keeping > me from hitting it, and I've had to use _lots_ of CPUs (60-160) doing > those NMIs at once to trigger the lockups. > > Being able to trigger it with so few CPUs is interesting though. I'll > try on my hardware. I hacked trinity to pause for 24s after each call, and changed the kernel to taint on lockup (so that trinity would immediatly stop). My hope was to find the combination of perf_event_open calls that triggered this. After 12 hours, it's still ticking along. It's now done about a thousand more calls than is usually necessary to hit the bug. Which makes me wonder if this is timing related somehow. Perhaps also worth noting that the majority of calls trinity makes will -EINVAL Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On 07/12/2013 08:45 AM, Dave Jones wrote: > On Fri, Jul 12, 2013 at 08:38:52AM -0700, Dave Hansen wrote: > > Dave, for your case, my suspicion would be that it got turned on > > inadvertently, or that we somehow have a bug which bumped up > > perf_event.c's 'active_events' and we're running some perf code that we > > don't have to. > > What do you 'inadvertantly' ? I see this during bootup every time. > Unless systemd or something has started playing with perf, (which afaik it > isn't) I mean that somebody turned 'active_events' on without actually wanting perf to be on. I'd be curious how it got set to something nonzero. Could you stick a WARN_ONCE() or printk_ratelimit() on the three sites that modify it? > > But, I'm suspicious. I was having all kinds of issues with perf and > > NMIs taking hundreds of milliseconds. I never isolated it to having a > > real, single, cause. I attributed it to my large NUMA system just being > > slow. Your description makes me wonder what I missed, though. > > Here's a fun trick: > > trinity -c perf_event_open -C4 -q -l off > > Within about a minute, that brings any of my boxes to its knees. > The softlockup detector starts going nuts, and then the box wedges solid. On my box, the same happens with 'perf top'. ;) *But* dropping the perf sample rate has been really effective at keeping me from hitting it, and I've had to use _lots_ of CPUs (60-160) doing those NMIs at once to trigger the lockups. Being able to trigger it with so few CPUs is interesting though. I'll try on my hardware. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Fri, Jul 12, 2013 at 08:38:52AM -0700, Dave Hansen wrote: > The warning comes from calling perf_sample_event_took(), which is only > called from one place: perf_event_nmi_handler(). > > So we can be pretty sure that the perf NMI is firing, or at least that > this handler code is running. > > nmi_handle() says: > /* > * NMIs are edge-triggered, which means if you have enough > * of them concurrently, you can lose some because only one > * can be latched at any given time. Walk the whole list > * to handle those situations. > */ > > perf_event_nmi_handler() probably gets _called_ when the watchdog NMI > goes off. But, it should hit this check: > > if (!atomic_read(&active_events)) > return NMI_DONE; > > and return quickly. This is before it has a chance to call > perf_sample_event_took(). > > Dave, for your case, my suspicion would be that it got turned on > inadvertently, or that we somehow have a bug which bumped up > perf_event.c's 'active_events' and we're running some perf code that we > don't have to. What do you 'inadvertantly' ? I see this during bootup every time. Unless systemd or something has started playing with perf, (which afaik it isn't) > But, I'm suspicious. I was having all kinds of issues with perf and > NMIs taking hundreds of milliseconds. I never isolated it to having a > real, single, cause. I attributed it to my large NUMA system just being > slow. Your description makes me wonder what I missed, though. Here's a fun trick: trinity -c perf_event_open -C4 -q -l off Within about a minute, that brings any of my boxes to its knees. The softlockup detector starts going nuts, and then the box wedges solid. (You may need to bump -C depending on your CPU count. I've never seen it happen with a single process, but -C2 seems to be a minimum) That *is* using perf though, so I kind of expect bad shit to happen when there are bugs. The "during bootup" case is still a head-scratcher. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On 07/12/2013 03:31 AM, Ingo Molnar wrote: > * Dave Jones wrote: >> On Wed, Jul 10, 2013 at 05:20:15PM +0200, Markus Trippelsdorf wrote: >> > On 2013.07.10 at 11:13 -0400, Dave Jones wrote: >> > > I get this right after booting.. >> > > >> > > [ 114.516619] perf samples too long (4262 > 2500), lowering >> kernel.perf_event_max_sample_rate to 5 >> > >> > You can disable this warning by: >> > >> > echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent >> >> Yes, but why is this even being run when I'm not running perf ? >> >> The only NMI source running should be the watchdog. > > The NMI watchdog is a perf event. > > I've Cc:-ed Dave Hansen, the author of those changes - is this a false > positive or some real problem? The warning comes from calling perf_sample_event_took(), which is only called from one place: perf_event_nmi_handler(). So we can be pretty sure that the perf NMI is firing, or at least that this handler code is running. nmi_handle() says: /* * NMIs are edge-triggered, which means if you have enough * of them concurrently, you can lose some because only one * can be latched at any given time. Walk the whole list * to handle those situations. */ perf_event_nmi_handler() probably gets _called_ when the watchdog NMI goes off. But, it should hit this check: if (!atomic_read(&active_events)) return NMI_DONE; and return quickly. This is before it has a chance to call perf_sample_event_took(). Dave, for your case, my suspicion would be that it got turned on inadvertently, or that we somehow have a bug which bumped up perf_event.c's 'active_events' and we're running some perf code that we don't have to. But, I'm suspicious. I was having all kinds of issues with perf and NMIs taking hundreds of milliseconds. I never isolated it to having a real, single, cause. I attributed it to my large NUMA system just being slow. Your description makes me wonder what I missed, though. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
* Dave Jones wrote: > On Wed, Jul 10, 2013 at 05:20:15PM +0200, Markus Trippelsdorf wrote: > > On 2013.07.10 at 11:13 -0400, Dave Jones wrote: > > > I get this right after booting.. > > > > > > [ 114.516619] perf samples too long (4262 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 > > > > You can disable this warning by: > > > > echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent > > Yes, but why is this even being run when I'm not running perf ? > > The only NMI source running should be the watchdog. The NMI watchdog is a perf event. I've Cc:-ed Dave Hansen, the author of those changes - is this a false positive or some real problem? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Wed, Jul 10, 2013 at 11:39:50AM -0400, Vince Weaver wrote: > On Wed, 10 Jul 2013, Dave Jones wrote: > > > Something is really fucked up in the kernel side of perf. > > I get this right after booting.. > > > > [ 114.516619] perf samples too long (4262 > 2500), lowering > > kernel.perf_event_max_sample_rate to 5 > > > > That's before I even get a chance to log in, so I'm pretty sure perf isn't > > even being run. > > > > And this is just booting up my desktop, not fuzzing. > > This is probably related to the problem described in the thread >perf: fix interrupt handler timing harness >http://comments.gmane.org/gmane.linux.kernel/1520775 > which in theory has a fix. Not sure why it'd trigger during boot. Bah. That's not it either. [ 1903.293233] perf samples too long (2514 > 2500), lowering kernel.perf_event_max_sample_rate to 5 [ 1903.295248] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 19498.002 msecs Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Wed, Jul 10, 2013 at 11:39:50AM -0400, Vince Weaver wrote: > On Wed, 10 Jul 2013, Dave Jones wrote: > > > Something is really fucked up in the kernel side of perf. > > I get this right after booting.. > > > > [ 114.516619] perf samples too long (4262 > 2500), lowering > > kernel.perf_event_max_sample_rate to 5 > > > > That's before I even get a chance to log in, so I'm pretty sure perf isn't > > even being run. > > > > And this is just booting up my desktop, not fuzzing. > > This is probably related to the problem described in the thread >perf: fix interrupt handler timing harness >http://comments.gmane.org/gmane.linux.kernel/1520775 > which in theory has a fix. Not sure why it'd trigger during boot. Ah, missed that thread. Thanks for the pointer. That looks important, and might actually explain some of the other oddness I've been seeing. > NMI watchdog? likely. Davek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Wed, Jul 10, 2013 at 05:20:15PM +0200, Markus Trippelsdorf wrote: > On 2013.07.10 at 11:13 -0400, Dave Jones wrote: > > I get this right after booting.. > > > > [ 114.516619] perf samples too long (4262 > 2500), lowering > > kernel.perf_event_max_sample_rate to 5 > > You can disable this warning by: > > echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent Yes, but why is this even being run when I'm not running perf ? The only NMI source running should be the watchdog. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Wed, 10 Jul 2013, Dave Jones wrote: > Something is really fucked up in the kernel side of perf. > I get this right after booting.. > > [ 114.516619] perf samples too long (4262 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 > > That's before I even get a chance to log in, so I'm pretty sure perf isn't > even being run. > > And this is just booting up my desktop, not fuzzing. This is probably related to the problem described in the thread perf: fix interrupt handler timing harness http://comments.gmane.org/gmane.linux.kernel/1520775 which in theory has a fix. Not sure why it'd trigger during boot. NMI watchdog? Vince -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On 2013.07.10 at 11:13 -0400, Dave Jones wrote: > On Sat, Jul 06, 2013 at 09:24:08AM +0200, Ingo Molnar wrote: > > > > * Dave Jones wrote: > > > > > On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote: > > > > On Fri, 5 Jul 2013, Dave Jones wrote: > > > > > > > > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565] > > > > > perf samples too long (2519 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 > > > > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: > 238147.002 msecs > > > > > > > > So we see a softlockup of 23 seconds and the perf_event_nmi_handler > > > > claims it did run 23.8 seconds. > > > > > > > > Are there more instances of NMI handler messages ? > > > > > > [ 2552.006181] perf samples too long (2511 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 > > > [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long > to run: 500392.002 msecs > > > > Dave, could you pull in the latest perf fixes at: > > > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/urgent > > > > In particular this: > > > > e5302920da9e perf: Fix interrupt handler timing harness > > > > could make a difference - if your tests somehow end up activating perf. > > Something is really fucked up in the kernel side of perf. > I get this right after booting.. > > [ 114.516619] perf samples too long (4262 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 You can disable this warning by: echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent -- Markus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Sat, Jul 06, 2013 at 09:24:08AM +0200, Ingo Molnar wrote: > > * Dave Jones wrote: > > > On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote: > > > On Fri, 5 Jul 2013, Dave Jones wrote: > > > > > > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565] > > > > perf samples too long (2519 > 2500), lowering > > kernel.perf_event_max_sample_rate to 5 > > > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: > > 238147.002 msecs > > > > > > So we see a softlockup of 23 seconds and the perf_event_nmi_handler > > > claims it did run 23.8 seconds. > > > > > > Are there more instances of NMI handler messages ? > > > > [ 2552.006181] perf samples too long (2511 > 2500), lowering > > kernel.perf_event_max_sample_rate to 5 > > [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to > > run: 500392.002 msecs > > Dave, could you pull in the latest perf fixes at: > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/urgent > > In particular this: > > e5302920da9e perf: Fix interrupt handler timing harness > > could make a difference - if your tests somehow end up activating perf. Something is really fucked up in the kernel side of perf. I get this right after booting.. [ 114.516619] perf samples too long (4262 > 2500), lowering kernel.perf_event_max_sample_rate to 5 That's before I even get a chance to log in, so I'm pretty sure perf isn't even being run. And this is just booting up my desktop, not fuzzing. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Sat, Jul 06, 2013 at 09:24:08AM +0200, Ingo Molnar wrote: > > * Dave Jones wrote: > > > On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote: > > > On Fri, 5 Jul 2013, Dave Jones wrote: > > > > > > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565] > > > > perf samples too long (2519 > 2500), lowering > > kernel.perf_event_max_sample_rate to 5 > > > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: > > 238147.002 msecs > > > > > > So we see a softlockup of 23 seconds and the perf_event_nmi_handler > > > claims it did run 23.8 seconds. > > > > > > Are there more instances of NMI handler messages ? > > > > [ 2552.006181] perf samples too long (2511 > 2500), lowering > > kernel.perf_event_max_sample_rate to 5 > > [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to > > run: 500392.002 msecs > > Dave, could you pull in the latest perf fixes at: > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/urgent > > In particular this: > > e5302920da9e perf: Fix interrupt handler timing harness > > could make a difference - if your tests somehow end up activating perf. Not seeing the NMI printk (yet?), but saw this.. [ 2322.758913] perf samples too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 5 Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
* Dave Jones wrote: > On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote: > > On Fri, 5 Jul 2013, Dave Jones wrote: > > > > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565] > > > perf samples too long (2519 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 > > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: > 238147.002 msecs > > > > So we see a softlockup of 23 seconds and the perf_event_nmi_handler > > claims it did run 23.8 seconds. > > > > Are there more instances of NMI handler messages ? > > [ 2552.006181] perf samples too long (2511 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 > [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to > run: 500392.002 msecs Dave, could you pull in the latest perf fixes at: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/urgent In particular this: e5302920da9e perf: Fix interrupt handler timing harness could make a difference - if your tests somehow end up activating perf. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Yet more softlockups.
On Fri, 5 Jul 2013, Seiji Aguchi wrote: > > -Original Message- > > Hmmm... this makes me wonder if the interrupt tracepoint stuff is at > > fault here, as it changes the IDT handling for NMI context. > > This softlockup happens while disabling the interrupt tracepoints, > Because if it is enabled, "smp_trace_apic_timer_interrupt" is displayed > instead of "smp_apic_timer_interrupt" in the call trace below. > > But I can't say anything how this issue is related to the tracepoint stuff, I doubt it is related. I rather suspect that trinity is able to fuzz perf into a DoS facility. Dave, can you trace the perf sys calls and dump that data over serial when the softlockup hits? Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Yet more softlockups.
> -Original Message- > From: H. Peter Anvin [mailto:h...@zytor.com] > Sent: Friday, July 05, 2013 12:41 PM > To: Thomas Gleixner > Cc: Dave Jones; Linus Torvalds; Linux Kernel; Ingo Molnar; Peter Zijlstra; > Seiji Aguchi > Subject: Re: Yet more softlockups. > > On 07/05/2013 09:02 AM, Thomas Gleixner wrote: > > On Fri, 5 Jul 2013, Dave Jones wrote: > >> On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote: > >> > On Fri, 5 Jul 2013, Dave Jones wrote: > >> > > >> > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565] > >> > > perf samples too long (2519 > 2500), lowering > >> kernel.perf_event_max_sample_rate to 5 > >> > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: > >> 238147.002 msecs > >> > > >> > So we see a softlockup of 23 seconds and the perf_event_nmi_handler > >> > claims it did run 23.8 seconds. > >> > > >> > Are there more instances of NMI handler messages ? > >> > >> [ 2552.006181] perf samples too long (2511 > 2500), lowering > >> kernel.perf_event_max_sample_rate to 5 > >> [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to > >> run: 500392.002 msecs > > > > Yuck. Spending 50 seconds in NMI context surely explains a softlockup :) > > > > Hmmm... this makes me wonder if the interrupt tracepoint stuff is at > fault here, as it changes the IDT handling for NMI context. This softlockup happens while disabling the interrupt tracepoints, Because if it is enabled, "smp_trace_apic_timer_interrupt" is displayed instead of "smp_apic_timer_interrupt" in the call trace below. But I can't say anything how this issue is related to the tracepoint stuff, I need to reproduce it on my machine first. Call Trace: [] __do_softirq+0xff/0x440 [] irq_exit+0xcd/0xe0 [] smp_apic_timer_interrupt+0x6b/0x9b [] apic_timer_interrupt+0x6f/0x80 Seiji -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On 07/05/2013 09:02 AM, Thomas Gleixner wrote: On Fri, 5 Jul 2013, Dave Jones wrote: On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote: > On Fri, 5 Jul 2013, Dave Jones wrote: > > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565] > > perf samples too long (2519 > 2500), lowering kernel.perf_event_max_sample_rate to 5 > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 238147.002 msecs > > So we see a softlockup of 23 seconds and the perf_event_nmi_handler > claims it did run 23.8 seconds. > > Are there more instances of NMI handler messages ? [ 2552.006181] perf samples too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 5 [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 500392.002 msecs Yuck. Spending 50 seconds in NMI context surely explains a softlockup :) Hmmm... this makes me wonder if the interrupt tracepoint stuff is at fault here, as it changes the IDT handling for NMI context. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Fri, 5 Jul 2013, Dave Jones wrote: > On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote: > > On Fri, 5 Jul 2013, Dave Jones wrote: > > > > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565] > > > perf samples too long (2519 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 > > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: > 238147.002 msecs > > > > So we see a softlockup of 23 seconds and the perf_event_nmi_handler > > claims it did run 23.8 seconds. > > > > Are there more instances of NMI handler messages ? > > [ 2552.006181] perf samples too long (2511 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 > [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to > run: 500392.002 msecs Yuck. Spending 50 seconds in NMI context surely explains a softlockup :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Fri, Jul 05, 2013 at 05:15:07PM +0200, Thomas Gleixner wrote: > On Fri, 5 Jul 2013, Dave Jones wrote: > > > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565] > > perf samples too long (2519 > 2500), lowering > > kernel.perf_event_max_sample_rate to 5 > > INFO: NMI handler (perf_event_nmi_handler) took too long to run: > > 238147.002 msecs > > So we see a softlockup of 23 seconds and the perf_event_nmi_handler > claims it did run 23.8 seconds. > > Are there more instances of NMI handler messages ? [ 2552.006181] perf samples too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 5 [ 2552.008680] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 500392.002 msecs Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Yet more softlockups.
On Fri, 5 Jul 2013, Dave Jones wrote: > BUG: soft lockup - CPU#3 stuck for 23s! [trinity-child1:14565] > perf samples too long (2519 > 2500), lowering > kernel.perf_event_max_sample_rate to 5 > INFO: NMI handler (perf_event_nmi_handler) took too long to run: 238147.002 > msecs So we see a softlockup of 23 seconds and the perf_event_nmi_handler claims it did run 23.8 seconds. Are there more instances of NMI handler messages ? Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/