Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Fri 2019-01-25 08:00:56, Andi Kleen wrote: > > [Fri Jan 25 10:28:53 2019] perf: interrupt took too long (2501 > 2500), > > lowering kernel.perf_event_max_sample_rate to 79750 > > [Fri Jan 25 10:29:08 2019] perf: interrupt took too long (3136 > 3126), > > lowering kernel.perf_event_max_sample_rate to 63750 > > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (4140 > 3920), > > lowering kernel.perf_event_max_sample_rate to 48250 > > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (5231 > 5175), > > lowering kernel.perf_event_max_sample_rate to 38000 > > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (6736 > 6538), > > lowering kernel.perf_event_max_sample_rate to 29500 > > These are fairly normal. Unfortunately, they are. Could we set up our defaults so that they don't normally happen? pavel@amd:~/g/unicsy_demo$ dmesg | grep "took too" [ 761.507893] perf: interrupt took too long (2516 > 2500), lowering kernel.perf_event_max_sample_rate to 79250 [ 4736.674595] perf: interrupt took too long (3209 > 3145), lowering kernel.perf_event_max_sample_rate to 62250 pavel@amd:~/g/unicsy_demo$ -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Fri, 1 Feb 2019, Jiri Olsa wrote: > > > > I've just started fuzzing with the patch applied. Often it takes a few > > hours to trigger the bug. > > cool, thanks I let it run overnight and no crash. > > Added question about this bug. It appeared that the crash was triggered > > by the BTS driver over-writing kernel memory. The data being written, was > > this user controllable? Meaning, is this a security issue being fixed, or > > just a crashing issue? > > yea, I have an example that can trigger it immediately I mean: the crash is happening because data structures are getting over-written by the BTS driver. Depending who and what is doing this, this could be a security issue (i.e. if it was raw BTS data that was partially userspace controlled values). Though even if this were the case it would probably be hard to exploit. Vince
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Sat, Feb 02, 2019 at 08:54:35AM +0530, Ravi Bangoria wrote: > > > On 2/1/19 1:24 PM, Ravi Bangoria wrote: > > I ran fuzzer for couple of hours but I didn't see any crash with > > your previous patch. > > > > I'll try this newer one as well. > > I ran fuzzer for ~8 hrs and no lockup so far. yep looks good, mine's running for 2 days now I'll put together some changelog and send it proper patch jirka
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On 2/1/19 1:24 PM, Ravi Bangoria wrote: > I ran fuzzer for couple of hours but I didn't see any crash with > your previous patch. > > I'll try this newer one as well. I ran fuzzer for ~8 hrs and no lockup so far. Thanks.
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Fri, Feb 01, 2019 at 11:27:28AM -0500, Vince Weaver wrote: > On Fri, 1 Feb 2019, Jiri Olsa wrote: > > > with attached patch I did not trigger the fuzzer crash > > for over a day now, could you guys try? > > I've just started fuzzing with the patch applied. Often it takes a few > hours to trigger the bug. cool, thanks > > Added question about this bug. It appeared that the crash was triggered > by the BTS driver over-writing kernel memory. The data being written, was > this user controllable? Meaning, is this a security issue being fixed, or > just a crashing issue? yea, I have an example that can trigger it immediately jirka
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Fri, 1 Feb 2019, Jiri Olsa wrote: > with attached patch I did not trigger the fuzzer crash > for over a day now, could you guys try? I've just started fuzzing with the patch applied. Often it takes a few hours to trigger the bug. Added question about this bug. It appeared that the crash was triggered by the BTS driver over-writing kernel memory. The data being written, was this user controllable? Meaning, is this a security issue being fixed, or just a crashing issue? Vince Weaver vincent.wea...@maine.edu
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
Hi Jiri, On 2/1/19 1:13 PM, Jiri Olsa wrote: > On Thu, Jan 31, 2019 at 09:27:11AM +0100, Jiri Olsa wrote: >> On Wed, Jan 30, 2019 at 07:36:48PM +0100, Jiri Olsa wrote: >> >> SNIP >> >>> diff --git a/kernel/events/core.c b/kernel/events/core.c >>> index 280a72b3a553..22ec63a0782e 100644 >>> --- a/kernel/events/core.c >>> +++ b/kernel/events/core.c >>> @@ -4969,6 +4969,26 @@ static void __perf_event_period(struct perf_event >>> *event, >>> } >>> } >>> >>> +static int check_period(struct perf_event *event, u64 value) >>> +{ >>> + u64 sample_period_attr = event->attr.sample_period; >>> + u64 sample_period_hw = event->hw.sample_period; >>> + int ret; >>> + >>> + if (event->attr.freq) { >>> + event->attr.sample_freq = value; >>> + } else { >>> + event->attr.sample_period = value; >>> + event->hw.sample_period = value; >>> + } >> >> hm, I think we need to check the period without changing the event, >> because we don't disable pmu, so it might get picked up by bts code >> >> will check > > with attached patch I did not trigger the fuzzer crash > for over a day now, could you guys try? I ran fuzzer for couple of hours but I didn't see any crash with your previous patch. I'll try this newer one as well. Thanks.
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Thu, Jan 31, 2019 at 09:27:11AM +0100, Jiri Olsa wrote: > On Wed, Jan 30, 2019 at 07:36:48PM +0100, Jiri Olsa wrote: > > SNIP > > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > index 280a72b3a553..22ec63a0782e 100644 > > --- a/kernel/events/core.c > > +++ b/kernel/events/core.c > > @@ -4969,6 +4969,26 @@ static void __perf_event_period(struct perf_event > > *event, > > } > > } > > > > +static int check_period(struct perf_event *event, u64 value) > > +{ > > + u64 sample_period_attr = event->attr.sample_period; > > + u64 sample_period_hw = event->hw.sample_period; > > + int ret; > > + > > + if (event->attr.freq) { > > + event->attr.sample_freq = value; > > + } else { > > + event->attr.sample_period = value; > > + event->hw.sample_period = value; > > + } > > hm, I think we need to check the period without changing the event, > because we don't disable pmu, so it might get picked up by bts code > > will check with attached patch I did not trigger the fuzzer crash for over a day now, could you guys try? thanks, jirka --- arch/x86/events/core.c | 14 ++ arch/x86/events/intel/core.c | 9 + arch/x86/events/perf_event.h | 16 ++-- include/linux/perf_event.h | 5 + kernel/events/core.c | 16 5 files changed, 58 insertions(+), 2 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 374a19712e20..b684f0294f35 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2278,6 +2278,19 @@ void perf_check_microcode(void) x86_pmu.check_microcode(); } +static int x86_pmu_check_period(struct perf_event *event, u64 value) +{ + if (x86_pmu.check_period && x86_pmu.check_period(event, value)) + return -EINVAL; + + if (value && x86_pmu.limit_period) { + if (x86_pmu.limit_period(event, value) > value) + return -EINVAL; + } + + return 0; +} + static struct pmu pmu = { .pmu_enable = x86_pmu_enable, .pmu_disable= x86_pmu_disable, @@ -2302,6 +2315,7 @@ static struct pmu pmu = { .event_idx = x86_pmu_event_idx, .sched_task = x86_pmu_sched_task, .task_ctx_size = sizeof(struct x86_perf_task_context), + .check_period = x86_pmu_check_period, }; void arch_perf_update_userpage(struct perf_event *event, diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 8b4e3020aba2..a3fbbd33beef 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3586,6 +3586,11 @@ static void intel_pmu_sched_task(struct perf_event_context *ctx, intel_pmu_lbr_sched_task(ctx, sched_in); } +static int intel_pmu_check_period(struct perf_event *event, u64 value) +{ + return intel_pmu_has_bts_period(event, value) ? -EINVAL : 0; +} + PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63"); PMU_FORMAT_ATTR(ldlat, "config1:0-15"); @@ -3665,6 +3670,8 @@ static __initconst const struct x86_pmu core_pmu = { .cpu_prepare= intel_pmu_cpu_prepare, .cpu_starting = intel_pmu_cpu_starting, .cpu_dying = intel_pmu_cpu_dying, + + .check_period = intel_pmu_check_period, }; static struct attribute *intel_pmu_attrs[]; @@ -3707,6 +3714,8 @@ static __initconst const struct x86_pmu intel_pmu = { .cpu_dying = intel_pmu_cpu_dying, .guest_get_msrs = intel_guest_get_msrs, .sched_task = intel_pmu_sched_task, + + .check_period = intel_pmu_check_period, }; static __init void intel_clovertown_quirk(void) diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 78d7b7031bfc..d46fd6754d92 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -646,6 +646,11 @@ struct x86_pmu { * Intel host/guest support (KVM) */ struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr); + + /* +* Check period value for PERF_EVENT_IOC_PERIOD ioctl. +*/ + int (*check_period) (struct perf_event *event, u64 period); }; struct x86_perf_task_context { @@ -857,7 +862,7 @@ static inline int amd_pmu_init(void) #ifdef CONFIG_CPU_SUP_INTEL -static inline bool intel_pmu_has_bts(struct perf_event *event) +static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period) { struct hw_perf_event *hwc = >hw; unsigned int hw_event, bts_event; @@ -868,7 +873,14 @@ static inline bool intel_pmu_has_bts(struct perf_event *event) hw_event = hwc->config & INTEL_ARCH_EVENT_MASK; bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS); - return hw_event == bts_event && hwc->sample_period == 1; + return hw_event == bts_event && period == 1; +} + +static inline
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
> Yeah, a loop stuck looks really scary inside an NMI handler. > Should I just go ahead to send a patch to remove this warning? > Or probably turn it into a pr_info()? Not at this point. Would need to fix the PMU reset first to be more selective. -Andi
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Fri, Jan 25, 2019 at 8:02 AM Andi Kleen wrote: > > > [Fri Jan 25 10:28:53 2019] perf: interrupt took too long (2501 > 2500), > > lowering kernel.perf_event_max_sample_rate to 79750 > > [Fri Jan 25 10:29:08 2019] perf: interrupt took too long (3136 > 3126), > > lowering kernel.perf_event_max_sample_rate to 63750 > > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (4140 > 3920), > > lowering kernel.perf_event_max_sample_rate to 48250 > > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (5231 > 5175), > > lowering kernel.perf_event_max_sample_rate to 38000 > > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (6736 > 6538), > > lowering kernel.perf_event_max_sample_rate to 29500 > > These are fairly normal. > > > [Fri Jan 25 10:32:44 2019] [ cut here ] > > [Fri Jan 25 10:32:44 2019] perfevents: irq loop stuck! > > I believe it's always possible to cause an irq loop. This happens when > the PMU is programmed to cause PMIs on multiple counters > too quickly. Maybe should just recover from it without printing such > scary messages. Yeah, a loop stuck looks really scary inside an NMI handler. Should I just go ahead to send a patch to remove this warning? Or probably turn it into a pr_info()? > > Right now the scary message is justified because it resets the complete > PMU. Perhaps need to be a bit more selective resetting on only > the events that loop. > > > [Fri Jan 25 10:32:44 2019] WARNING: CPU: 1 PID: 0 at > > arch/x86/events/intel/core.c:2440 intel_pmu_handle_irq+0x158/0x170 > > This looks independent. > > I would apply the following patch (cut'n'pasted, so may need manual apply) > and then run with > I would like to help as we keep seeing this warning for a rather long time, but unfortunately the reproducer provided by Ravi doesn't trigger any warning or crash here. Maybe I don't use a right hardware to trigger it? [0.132136] Performance Events: PEBS fmt2+, Broadwell events, 16-deep LBR, full-width counters, Intel PMU driver. [0.133003] ... version:3 [0.134001] ... bit width: 48 [0.135001] ... generic registers: 4 [0.136001] ... value mask: [0.137001] ... max period: 7fff [0.138001] ... fixed-purpose events: 3 [0.139001] ... event mask: 0007000f Thanks!
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Thu, Jan 31, 2019 at 01:28:34PM +0530, Ravi Bangoria wrote: > Hi Andi, > > On 1/25/19 9:30 PM, Andi Kleen wrote: > >> [Fri Jan 25 10:28:53 2019] perf: interrupt took too long (2501 > 2500), > >> lowering kernel.perf_event_max_sample_rate to 79750 > >> [Fri Jan 25 10:29:08 2019] perf: interrupt took too long (3136 > 3126), > >> lowering kernel.perf_event_max_sample_rate to 63750 > >> [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (4140 > 3920), > >> lowering kernel.perf_event_max_sample_rate to 48250 > >> [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (5231 > 5175), > >> lowering kernel.perf_event_max_sample_rate to 38000 > >> [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (6736 > 6538), > >> lowering kernel.perf_event_max_sample_rate to 29500 > > > > These are fairly normal. > > I understand that throttling mechanism is designed exactly to do this. > But I've observed that, everytime I run the fuzzer, max_sample_rates is > been throttled down to 250 (which is CONFIG_HZ I guess). Doesn't this > mean the interrupt time is somehow increasing gradually? Is that fine? It's more like the throttling mechanism is an controller and it takes multiple tries to zoom in on the truely needed value. You can measure the PMI time by enabling the nmi:nmi_handler trace point. It directly reports it. From what I've seen it's a long tail distribution with regular large outliers. Most of the PMIs are not that slow, just an occassional few are. When I did some investigation on this a couple years back the outliers were either due to call stack processing, or due to flushing the perf ring buffer. There were some fixes on the the call stack case back then, but I'm sure more could be done. For the call stack processing there isn't much more we can do I think (other than switching to call stack LBR only), but I suspect the buffer flushing problem could be improved more. It's relatively easy to investigate with a variant of the ftrace recipe I posted earlier (but you need to fix the Makefile first to enable ftrace for all of perf) Just add a ftrace trigger on the nmi_handler trace point to stop tracing when the nmi_handler time exceeds a threshold and look at the traces. -Andi
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Wed, Jan 30, 2019 at 07:36:48PM +0100, Jiri Olsa wrote: SNIP > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 280a72b3a553..22ec63a0782e 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -4969,6 +4969,26 @@ static void __perf_event_period(struct perf_event > *event, > } > } > > +static int check_period(struct perf_event *event, u64 value) > +{ > + u64 sample_period_attr = event->attr.sample_period; > + u64 sample_period_hw = event->hw.sample_period; > + int ret; > + > + if (event->attr.freq) { > + event->attr.sample_freq = value; > + } else { > + event->attr.sample_period = value; > + event->hw.sample_period = value; > + } hm, I think we need to check the period without changing the event, because we don't disable pmu, so it might get picked up by bts code will check jirka > + > + ret = event->pmu->check_period(event); > + > + event->attr.sample_period = sample_period_attr; > + event->hw.sample_period = sample_period_hw; > + return ret; > +} > + > static int perf_event_period(struct perf_event *event, u64 __user *arg) > { > u64 value; > @@ -4985,6 +5005,9 @@ static int perf_event_period(struct perf_event *event, > u64 __user *arg) > if (event->attr.freq && value > sysctl_perf_event_sample_rate) > return -EINVAL; > > + if (check_period(event, value)) > + return -EINVAL; > + > event_function_call(event, __perf_event_period, ); > > return 0; > @@ -9601,6 +9624,11 @@ static int perf_pmu_nop_int(struct pmu *pmu) > return 0; > } > > +static int perf_event_nop_int(struct perf_event *event) > +{ > + return 0; > +} > + > static DEFINE_PER_CPU(unsigned int, nop_txn_flags); > > static void perf_pmu_start_txn(struct pmu *pmu, unsigned int flags) > @@ -9901,6 +9929,9 @@ int perf_pmu_register(struct pmu *pmu, const char > *name, int type) > pmu->pmu_disable = perf_pmu_nop_void; > } > > + if (!pmu->check_period) > + pmu->check_period = perf_event_nop_int; > + > if (!pmu->event_idx) > pmu->event_idx = perf_event_idx_default; > > -- > 2.17.2 >
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
Hi Andi, On 1/25/19 9:30 PM, Andi Kleen wrote: >> [Fri Jan 25 10:28:53 2019] perf: interrupt took too long (2501 > 2500), >> lowering kernel.perf_event_max_sample_rate to 79750 >> [Fri Jan 25 10:29:08 2019] perf: interrupt took too long (3136 > 3126), >> lowering kernel.perf_event_max_sample_rate to 63750 >> [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (4140 > 3920), >> lowering kernel.perf_event_max_sample_rate to 48250 >> [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (5231 > 5175), >> lowering kernel.perf_event_max_sample_rate to 38000 >> [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (6736 > 6538), >> lowering kernel.perf_event_max_sample_rate to 29500 > > These are fairly normal. I understand that throttling mechanism is designed exactly to do this. But I've observed that, everytime I run the fuzzer, max_sample_rates is been throttled down to 250 (which is CONFIG_HZ I guess). Doesn't this mean the interrupt time is somehow increasing gradually? Is that fine? Here is the sample dmesg: [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (2928 > 2500), lowering kernel.perf_event_max_sample_rate to 68250 [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (4363 > 3660), lowering kernel.perf_event_max_sample_rate to 45750 [Thu Jan 31 09:25:40 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 2.183 msecs [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (21382 > 5453), lowering kernel.perf_event_max_sample_rate to 9250 [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (34548 > 26727), lowering kernel.perf_event_max_sample_rate to 5750 [Thu Jan 31 09:25:40 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 3.509 msecs [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (61682 > 43185), lowering kernel.perf_event_max_sample_rate to 3000 [Thu Jan 31 09:25:40 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 3.593 msecs [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (89206 > 77102), lowering kernel.perf_event_max_sample_rate to 2000 [Thu Jan 31 09:25:40 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 3.619 msecs [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (120188 > 111507), lowering kernel.perf_event_max_sample_rate to 1500 [Thu Jan 31 09:25:40 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 3.782 msecs [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (171065 > 150235), lowering kernel.perf_event_max_sample_rate to 1000 [Thu Jan 31 09:25:40 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 4.066 msecs [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (226815 > 213831), lowering kernel.perf_event_max_sample_rate to 750 [Thu Jan 31 09:25:40 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 5.364 msecs [Thu Jan 31 09:25:40 2019] perf: interrupt took too long (300844 > 283518), lowering kernel.perf_event_max_sample_rate to 500 [Thu Jan 31 09:33:43 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 6.136 msecs [Thu Jan 31 09:50:35 2019] perf: interrupt took too long (378352 > 376055), lowering kernel.perf_event_max_sample_rate to 500 [Thu Jan 31 09:53:47 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 6.456 msecs [Thu Jan 31 09:57:31 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 11.002 msecs [Thu Jan 31 10:01:30 2019] perf: interrupt took too long (478447 > 472940), lowering kernel.perf_event_max_sample_rate to 250 [Thu Jan 31 12:28:31 2019] perf: interrupt took too long (601630 > 598058), lowering kernel.perf_event_max_sample_rate to 250 [Thu Jan 31 12:28:31 2019] perf: interrupt took too long (754288 > 752037), lowering kernel.perf_event_max_sample_rate to 250 [Thu Jan 31 12:43:13 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 12.781 msecs [Thu Jan 31 12:43:13 2019] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 13.583 msecs Thanks, Ravi
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Wed, Jan 30, 2019 at 07:36:48PM +0100, Jiri Olsa wrote: > On Fri, Jan 25, 2019 at 12:16:27PM +0530, Ravi Bangoria wrote: > > SNIP > > > [ 1432.176374] general protection fault: [#1] SMP PTI > > [ 1432.182253] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW > > 5.0.0-rc3-ravi-pfuzzer+ #1 > > > > [ 1432.192088] Hardware name: IBM CPU PLANAR > >-[8722xyz]-/00FL808, BIOS -[KOE162DUS-2.30]- 08/27/2018 > > [264/488] > > [ 1432.206120] RIP: 0010:perf_prepare_sample+0x8f/0x510 > > > > [ 1432.211679] Code: ff ff 41 f6 c4 01 0f 85 22 02 00 00 41 f6 c4 20 74 26 > > 4d 85 e4 0f 88 0c 01 00 00 4c 89 f6 4c 89 ff e8 f5 fe ff ff 49 89 45 70 > > <48> 8b 00 8d 04 c5 08 00 00 0 > > 0 66 01 43 06 41 f7 c4 00 04 00 00 74 > > > > [ 1432.232699] RSP: :95b3ff843a78 EFLAGS: 00010082 > > > > [ 1432.238548] RAX: 8d1217eb826cce00 RBX: 95b3ff843ad8 RCX: > > 001f > > [ 1432.246536] RDX: RSI: 355563e5 RDI: > > > > [ 1432.254522] RBP: 95b3ff843ab0 R08: d016493f3b42 R09: > > > > [ 1432.262508] R10: 95b3ff843a08 R11: R12: > > 800306e5 > > [ 1432.270495] R13: 95b3ff843bc0 R14: 95b3ff843b18 R15: > > 95b3f6b65800 > > > > [ 1432.278483] FS: () GS:95b3ff84() > > knlGS: > > [ 1432.287539] CS: 0010 DS: ES: CR0: 80050033 > > [ 1432.293969] CR2: 55bf7f768c90 CR3: 001fd220e005 CR4: > > 000606e0 > > [ 1432.301956] Call Trace: > > > > [ 1432.304697] > > > > > > [ 1432.306956] ? intel_pmu_drain_bts_buffer+0x194/0x230 > > > > > > [ 1432.312615] intel_pmu_drain_bts_buffer+0x160/0x230 > > > > > > [ 1432.318078] ? tick_nohz_irq_exit+0x31/0x40 > > > > > > [ 1432.322765] ? smp_call_function_single_interrupt+0x48/0xe0 > > [ 1432.328993] ? call_function_single_interrupt+0xf/0x20 > > [ 1432.334745] ? call_function_single_interrupt+0xa/0x20 > > > > [ 1432.340501] ? x86_schedule_events+0x1a0/0x2f0 > > > > > > [ 1432.345475] ? x86_pmu_commit_txn+0xb4/0x100 > > > > > > [ 1432.350258] ? find_busiest_group+0x47/0x5d0 > > > > > > [ 1432.355039] ? perf_event_set_state.part.42+0x12/0x50 > > > > > > [ 1432.360694] ? perf_mux_hrtimer_restart+0x40/0xb0 > > > > [ 1432.365960] intel_pmu_disable_event+0xae/0x100 > > > > [ 1432.371031] ? intel_pmu_disable_event+0xae/0x100 > > > > [ 1432.376297] x86_pmu_stop+0x7a/0xb0 > > > > [ 1432.380201] x86_pmu_del+0x57/0x120 > > > > > > [ 1432.384105] event_sched_out.isra.101+0x83/0x180 > > [ 1432.389272] group_sched_out.part.103+0x57/0xe0 > > > > [ 1432.394343] ctx_sched_out+0x188/0x240 > > > > [ 1432.398539] ctx_resched+0xa8/0xd0 > > > > [ 1432.402345] __perf_event_enable+0x193/0x1e0 > > > > [ 1432.407125] event_function+0x8e/0xc0 > > > > [ 1432.411222] remote_function+0x41/0x50
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Wed, Jan 30, 2019 at 12:39:47PM -0800, Andi Kleen wrote: > Jiri Olsa writes: > > > > the patch adds check_eriod pmu callback.. I need to check if there's > > better way to do this, but so far it fixes the crash for me > > > > if you guys could check this patch, that'd be great > > There's already a limit_period callback, perhaps that could > be extended. But ok, can do it this way too. right, we call it within x86_perf_event_set_period to limit the period, but I guess we should include this check after changing the period: if (event->attr.sample_period && x86_pmu.limit_period) { if (x86_pmu.limit_period(event, event->attr.sample_period) > event->attr.sample_period) return -EINVAL; } > > I suspect there are some other cases that need this callback, not > just BTS, e.g. the checks in hsw_hw_config ok, the sample_period values for checkpointed events, we should check for this as well.. I'll add that jirka
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
Jiri Olsa writes: > > the patch adds check_eriod pmu callback.. I need to check if there's > better way to do this, but so far it fixes the crash for me > > if you guys could check this patch, that'd be great There's already a limit_period callback, perhaps that could be extended. But ok, can do it this way too. I suspect there are some other cases that need this callback, not just BTS, e.g. the checks in hsw_hw_config -Andi
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Fri, Jan 25, 2019 at 12:16:27PM +0530, Ravi Bangoria wrote: SNIP > [ 1432.176374] general protection fault: [#1] SMP PTI > [ 1432.182253] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW > 5.0.0-rc3-ravi-pfuzzer+ #1 > > [ 1432.192088] Hardware name: IBM CPU PLANAR > -[8722xyz]-/00FL808, BIOS -[KOE162DUS-2.30]- 08/27/2018 > [264/488] > [ 1432.206120] RIP: 0010:perf_prepare_sample+0x8f/0x510 > > [ 1432.211679] Code: ff ff 41 f6 c4 01 0f 85 22 02 00 00 41 f6 c4 20 74 26 4d > 85 e4 0f 88 0c 01 00 00 4c 89 f6 4c 89 ff e8 f5 fe ff ff 49 89 45 70 <48> 8b > 00 8d 04 c5 08 00 00 0 > 0 66 01 43 06 41 f7 c4 00 04 00 00 74 > > [ 1432.232699] RSP: :95b3ff843a78 EFLAGS: 00010082 > > [ 1432.238548] RAX: 8d1217eb826cce00 RBX: 95b3ff843ad8 RCX: > 001f > [ 1432.246536] RDX: RSI: 355563e5 RDI: > > [ 1432.254522] RBP: 95b3ff843ab0 R08: d016493f3b42 R09: > > [ 1432.262508] R10: 95b3ff843a08 R11: R12: > 800306e5 > [ 1432.270495] R13: 95b3ff843bc0 R14: 95b3ff843b18 R15: > 95b3f6b65800 > > [ 1432.278483] FS: () GS:95b3ff84() > knlGS: > [ 1432.287539] CS: 0010 DS: ES: CR0: 80050033 > [ 1432.293969] CR2: 55bf7f768c90 CR3: 001fd220e005 CR4: > 000606e0 > [ 1432.301956] Call Trace: > > [ 1432.304697] > > > [ 1432.306956] ? intel_pmu_drain_bts_buffer+0x194/0x230 > > > [ 1432.312615] intel_pmu_drain_bts_buffer+0x160/0x230 > > > [ 1432.318078] ? tick_nohz_irq_exit+0x31/0x40 > > > [ 1432.322765] ? smp_call_function_single_interrupt+0x48/0xe0 > [ 1432.328993] ? call_function_single_interrupt+0xf/0x20 > [ 1432.334745] ? call_function_single_interrupt+0xa/0x20 > > [ 1432.340501] ? x86_schedule_events+0x1a0/0x2f0 > > > [ 1432.345475] ? x86_pmu_commit_txn+0xb4/0x100 > > > [ 1432.350258] ? find_busiest_group+0x47/0x5d0 > > > [ 1432.355039] ? perf_event_set_state.part.42+0x12/0x50 > > > [ 1432.360694] ? perf_mux_hrtimer_restart+0x40/0xb0 > > [ 1432.365960] intel_pmu_disable_event+0xae/0x100 > > [ 1432.371031] ? intel_pmu_disable_event+0xae/0x100 > > [ 1432.376297] x86_pmu_stop+0x7a/0xb0 > > [ 1432.380201] x86_pmu_del+0x57/0x120 > > > [ 1432.384105] event_sched_out.isra.101+0x83/0x180 > [ 1432.389272] group_sched_out.part.103+0x57/0xe0 > > [ 1432.394343] ctx_sched_out+0x188/0x240 > > [ 1432.398539] ctx_resched+0xa8/0xd0 > > [ 1432.402345] __perf_event_enable+0x193/0x1e0 > > [ 1432.407125] event_function+0x8e/0xc0 > > [ 1432.411222] remote_function+0x41/0x50 > > > [ 1432.415418] flush_smp_call_function_queue+0x68/0x100 >
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
> [Fri Jan 25 10:28:53 2019] perf: interrupt took too long (2501 > 2500), > lowering kernel.perf_event_max_sample_rate to 79750 > [Fri Jan 25 10:29:08 2019] perf: interrupt took too long (3136 > 3126), > lowering kernel.perf_event_max_sample_rate to 63750 > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (4140 > 3920), > lowering kernel.perf_event_max_sample_rate to 48250 > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (5231 > 5175), > lowering kernel.perf_event_max_sample_rate to 38000 > [Fri Jan 25 10:29:11 2019] perf: interrupt took too long (6736 > 6538), > lowering kernel.perf_event_max_sample_rate to 29500 These are fairly normal. > [Fri Jan 25 10:32:44 2019] [ cut here ] > [Fri Jan 25 10:32:44 2019] perfevents: irq loop stuck! I believe it's always possible to cause an irq loop. This happens when the PMU is programmed to cause PMIs on multiple counters too quickly. Maybe should just recover from it without printing such scary messages. Right now the scary message is justified because it resets the complete PMU. Perhaps need to be a bit more selective resetting on only the events that loop. > [Fri Jan 25 10:32:44 2019] WARNING: CPU: 1 PID: 0 at > arch/x86/events/intel/core.c:2440 intel_pmu_handle_irq+0x158/0x170 This looks independent. I would apply the following patch (cut'n'pasted, so may need manual apply) and then run with cd /sys/kernel/debug/tracing echo 5 > buffer_size_kb echo default_do_nmi > set_graph_function echo 1 > events/msr/enable echo 'msr != 0xc100 && msr != 0x6e0' > events/msr/write_msr/filter echo function_graph > current_tracer echo printk:traceoff > set_ftrace_filter echo 1 > tracing_on and then collect the trace from /sys/kernel/debug/tracing/trace after the oops. This should show the context of when it happens. diff --git a/kernel/events/Makefile b/kernel/events/Makefile index 3c022e33c109..8afc997110e0 100644 --- a/kernel/events/Makefile +++ b/kernel/events/Makefile @@ -1,7 +1,4 @@ # SPDX-License-Identifier: GPL-2.0 -ifdef CONFIG_FUNCTION_TRACER -CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE) -endif obj-y := core.o ring_buffer.o callchain.o
Re: System crash with perf_fuzzer (kernel: 5.0.0-rc3)
On Fri, 25 Jan 2019, Ravi Bangoria wrote: > I'm seeing a system crash while running perf_fuzzer with upstream kernel > on an Intel machine. I hit the crash twice (out of which I don't have log > of first crash so don't know if the reason is same for both the crashes). > I've attached my .config with the mail. > type = PERF_TYPE_HARDWARE; > > > And, I'm running fuzzer in a loop with *root*. (Let me know if running > as root is harmful ;-) ). There's a known issue related to Intel BTS events that you can trigger with the perf_fuzzer, even as a normal user. I reported it a few months ago but I don't think it ever got resolved. The traces you get look similar to some that you posted. It's hard to track down as it doesn't seem to be a simple issue, but rather it looks like the BTS event handling is stomping over memory it shouldn't somehow. Vince