Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Thu, Feb 23, 2017 at 06:28:01PM +0100, Peter Zijlstra wrote: > I think Boris and Thomas were talking about moving the entire c1e > detection out of there. Yeah, see 07c94a381253 ("x86/amd: Simplify AMD E400 aware idle routine") tglx and I got rid of reading that MSR in the idle routine. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Thu, Feb 23, 2017 at 06:28:01PM +0100, Peter Zijlstra wrote: > I think Boris and Thomas were talking about moving the entire c1e > detection out of there. Yeah, see 07c94a381253 ("x86/amd: Simplify AMD E400 aware idle routine") tglx and I got rid of reading that MSR in the idle routine. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Thu, Feb 23, 2017 at 01:24:34PM +0100, Jiri Olsa wrote: > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > I think I prefer something like the below, that only annotates the one > > RDMSR in question, instead of all of them. > > > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index 0888a879120f..d6c6aa80675f 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > if (!amd_e400_c1e_detected) { > > u32 lo, hi; > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > amd_e400_c1e_detected = true; > > hum, I might have missed some other solution in discussion, > and can't see this one being pulled in.. should I resend this? > I think Boris and Thomas were talking about moving the entire c1e detection out of there.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Thu, Feb 23, 2017 at 01:24:34PM +0100, Jiri Olsa wrote: > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > I think I prefer something like the below, that only annotates the one > > RDMSR in question, instead of all of them. > > > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index 0888a879120f..d6c6aa80675f 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > if (!amd_e400_c1e_detected) { > > u32 lo, hi; > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > amd_e400_c1e_detected = true; > > hum, I might have missed some other solution in discussion, > and can't see this one being pulled in.. should I resend this? > I think Boris and Thomas were talking about moving the entire c1e detection out of there.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index 0888a879120f..d6c6aa80675f 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > if (!amd_e400_c1e_detected) { > > u32 lo, hi; > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > amd_e400_c1e_detected = true; > > hum, I might have missed some other solution in discussion, > and can't see this one being pulled in.. should I resend this? I suppose you could also use the new notrace msr variant. BTW this check is quite annoying. -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index 0888a879120f..d6c6aa80675f 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > if (!amd_e400_c1e_detected) { > > u32 lo, hi; > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > amd_e400_c1e_detected = true; > > hum, I might have missed some other solution in discussion, > and can't see this one being pulled in.. should I resend this? I suppose you could also use the new notrace msr variant. BTW this check is quite annoying. -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > hi, > > Jan hit following output when msr tracepoints are enabled on amd server: > > > > [ 91.585653] === > > [ 91.589840] [ INFO: suspicious RCU usage. ] > > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > > [ 91.597691] --- > > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious > > rcu_dereference_check() usage! > > [ 91.610222] > > [ 91.610222] other info that might help us debug this: > > [ 91.610222] > > [ 91.618224] > > [ 91.618224] RCU used illegally from idle CPU! > > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > > [ 91.629081] RCU used illegally from extended quiescent state! > > [ 91.634820] no locks held by swapper/1/0. > > [ 91.638832] > > [ 91.638832] stack backtrace: > > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > > [ 91.656159] c900018fbdf8 813ed43c 88017ede8000 > > 0001 > > [ 91.663637] c900018fbe28 810fdcd7 880233f95dd0 > > c0010055 > > [ 91.671107] c900018fbe58 > > 814297ac > > [ 91.678560] Call Trace: > > [ 91.681022] [] dump_stack+0x85/0xc9 > > [ 91.686164] [] lockdep_rcu_suspicious+0xe7/0x120 > > [ 91.692429] [] do_trace_read_msr+0x14c/0x1b0 > > [ 91.698349] [] native_read_msr+0x32/0x40 > > [ 91.703921] [] amd_e400_idle+0x7e/0x110 > > [ 91.709407] [] arch_cpu_idle+0xf/0x20 > > [ 91.714720] [] default_idle_call+0x23/0x40 > > [ 91.720467] [] cpu_startup_entry+0x1da/0x2b0 > > [ 91.726387] [] start_secondary+0x17f/0x1f0 > > > > > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > I think I prefer something like the below, that only annotates the one > RDMSR in question, instead of all of them. > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index 0888a879120f..d6c6aa80675f 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > if (!amd_e400_c1e_detected) { > u32 lo, hi; > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > amd_e400_c1e_detected = true; hum, I might have missed some other solution in discussion, and can't see this one being pulled in.. should I resend this? thanks, jirka
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > hi, > > Jan hit following output when msr tracepoints are enabled on amd server: > > > > [ 91.585653] === > > [ 91.589840] [ INFO: suspicious RCU usage. ] > > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > > [ 91.597691] --- > > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious > > rcu_dereference_check() usage! > > [ 91.610222] > > [ 91.610222] other info that might help us debug this: > > [ 91.610222] > > [ 91.618224] > > [ 91.618224] RCU used illegally from idle CPU! > > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > > [ 91.629081] RCU used illegally from extended quiescent state! > > [ 91.634820] no locks held by swapper/1/0. > > [ 91.638832] > > [ 91.638832] stack backtrace: > > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > > [ 91.656159] c900018fbdf8 813ed43c 88017ede8000 > > 0001 > > [ 91.663637] c900018fbe28 810fdcd7 880233f95dd0 > > c0010055 > > [ 91.671107] c900018fbe58 > > 814297ac > > [ 91.678560] Call Trace: > > [ 91.681022] [] dump_stack+0x85/0xc9 > > [ 91.686164] [] lockdep_rcu_suspicious+0xe7/0x120 > > [ 91.692429] [] do_trace_read_msr+0x14c/0x1b0 > > [ 91.698349] [] native_read_msr+0x32/0x40 > > [ 91.703921] [] amd_e400_idle+0x7e/0x110 > > [ 91.709407] [] arch_cpu_idle+0xf/0x20 > > [ 91.714720] [] default_idle_call+0x23/0x40 > > [ 91.720467] [] cpu_startup_entry+0x1da/0x2b0 > > [ 91.726387] [] start_secondary+0x17f/0x1f0 > > > > > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > I think I prefer something like the below, that only annotates the one > RDMSR in question, instead of all of them. > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index 0888a879120f..d6c6aa80675f 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > if (!amd_e400_c1e_detected) { > u32 lo, hi; > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > amd_e400_c1e_detected = true; hum, I might have missed some other solution in discussion, and can't see this one being pulled in.. should I resend this? thanks, jirka
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Wed, 30 Nov 2016, Borislav Petkov wrote: > On Wed, Nov 30, 2016 at 09:54:58AM +0100, Thomas Gleixner wrote: > > Right, that's the safe bet. But I'm quite sure that the C1E crap only > > starts to work _after_ ACPI initialization. > > Yap, I think it is an ACPI decision whether to enter C1E or not. And all > those boxes which are unaffected - they actually are but since the FW > doesn't enter C1E, they don't hit the bug. > > > tick_force_broadcast() is irreversible > > So if I let the cores in that small window use amd_e400_idle() and > then when I detect the machine doesn't enter C1E after all, I do > tick_broadcast_exit() on all cores in amd_e400_c1e_mask, then clear it > and free it, that would work? > > Or do you see a better solution? Start with the bug cleared and select amd_e400_idle(), which will then use default_idle() and not do any of the broadcast crap. After ACPI gets initialized check the C1E misfeature and if it's detected, then set the CPU BUG and send an IPI to all online CPUs. I think this is safe because the CPU running the ACPI initialization is not in idle and if any of the other CPUs would be stuck due to the C1E enablement, then the IPI will kick them out. At least worth a try. Thanks, tglx
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Wed, 30 Nov 2016, Borislav Petkov wrote: > On Wed, Nov 30, 2016 at 09:54:58AM +0100, Thomas Gleixner wrote: > > Right, that's the safe bet. But I'm quite sure that the C1E crap only > > starts to work _after_ ACPI initialization. > > Yap, I think it is an ACPI decision whether to enter C1E or not. And all > those boxes which are unaffected - they actually are but since the FW > doesn't enter C1E, they don't hit the bug. > > > tick_force_broadcast() is irreversible > > So if I let the cores in that small window use amd_e400_idle() and > then when I detect the machine doesn't enter C1E after all, I do > tick_broadcast_exit() on all cores in amd_e400_c1e_mask, then clear it > and free it, that would work? > > Or do you see a better solution? Start with the bug cleared and select amd_e400_idle(), which will then use default_idle() and not do any of the broadcast crap. After ACPI gets initialized check the C1E misfeature and if it's detected, then set the CPU BUG and send an IPI to all online CPUs. I think this is safe because the CPU running the ACPI initialization is not in idle and if any of the other CPUs would be stuck due to the C1E enablement, then the IPI will kick them out. At least worth a try. Thanks, tglx
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Wed, Nov 30, 2016 at 09:54:58AM +0100, Thomas Gleixner wrote: > Right, that's the safe bet. But I'm quite sure that the C1E crap only > starts to work _after_ ACPI initialization. Yap, I think it is an ACPI decision whether to enter C1E or not. And all those boxes which are unaffected - they actually are but since the FW doesn't enter C1E, they don't hit the bug. > tick_force_broadcast() is irreversible So if I let the cores in that small window use amd_e400_idle() and then when I detect the machine doesn't enter C1E after all, I do tick_broadcast_exit() on all cores in amd_e400_c1e_mask, then clear it and free it, that would work? Or do you see a better solution? Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Wed, Nov 30, 2016 at 09:54:58AM +0100, Thomas Gleixner wrote: > Right, that's the safe bet. But I'm quite sure that the C1E crap only > starts to work _after_ ACPI initialization. Yap, I think it is an ACPI decision whether to enter C1E or not. And all those boxes which are unaffected - they actually are but since the FW doesn't enter C1E, they don't hit the bug. > tick_force_broadcast() is irreversible So if I let the cores in that small window use amd_e400_idle() and then when I detect the machine doesn't enter C1E after all, I do tick_broadcast_exit() on all cores in amd_e400_c1e_mask, then clear it and free it, that would work? Or do you see a better solution? Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Wed, 30 Nov 2016, Borislav Petkov wrote: > On Tue, Nov 29, 2016 at 02:59:01PM +0100, Thomas Gleixner wrote: > > The issue is that you obvioulsy start with the assumption, that the machine > > has this bug. As a consequence the machine is brute forced into tick > > broadcast mode, which cannot be reverted when you clear that misfeature > > after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is > > forced into broadcast mode, which is not what you want. > > > > As far as I understood the whole magic, this C1E misfeature takes only > > effect _after_ ACPI has been initialized. So instead of setting the bug in > > early boot and therefor forcing the broadcast nonsense, we should only set > > it when ACPI has actually detected it. > > Problem is, select_idle_routine() runs a lot earlier than acpi_init() so > there's a window where we don't definitively know yet whether the box is > actually going to enter C1E or not. > > [ I presume the reason why we have to do the proper detection after > ACPI has been initialized is because the frickelware decides whether > to do C1E entry or not and then sets those bits in the MSR (or not). ] > > If in that window we enter idle and we're on an affected machine and we > *don't* switch to broadcast mode, we risk not waking up from C1E, i.e., > the main reason this fix was even done. > > So, if we "prematurely" switch to broadcast mode on the affected CPUs, > we're ok, it will be detected properly later and we're in broadcast > mode already. Right, that's the safe bet. But I'm quite sure that the C1E crap only starts to work _after_ ACPI initialization. > Now, on those machines which are not affected and we clear > X86_BUG_AMD_APIC_C1E because they don't enter C1E at all, I was thinking > of maybe doing amd_e400_remove_cpu() and clearing that e400 mask and > even freeing it so that they can do default_idle(). > > But you're saying tick_broadcast_enter() is irreversible? tick_force_broadcast() is irreversible Thanks, tglx
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Wed, 30 Nov 2016, Borislav Petkov wrote: > On Tue, Nov 29, 2016 at 02:59:01PM +0100, Thomas Gleixner wrote: > > The issue is that you obvioulsy start with the assumption, that the machine > > has this bug. As a consequence the machine is brute forced into tick > > broadcast mode, which cannot be reverted when you clear that misfeature > > after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is > > forced into broadcast mode, which is not what you want. > > > > As far as I understood the whole magic, this C1E misfeature takes only > > effect _after_ ACPI has been initialized. So instead of setting the bug in > > early boot and therefor forcing the broadcast nonsense, we should only set > > it when ACPI has actually detected it. > > Problem is, select_idle_routine() runs a lot earlier than acpi_init() so > there's a window where we don't definitively know yet whether the box is > actually going to enter C1E or not. > > [ I presume the reason why we have to do the proper detection after > ACPI has been initialized is because the frickelware decides whether > to do C1E entry or not and then sets those bits in the MSR (or not). ] > > If in that window we enter idle and we're on an affected machine and we > *don't* switch to broadcast mode, we risk not waking up from C1E, i.e., > the main reason this fix was even done. > > So, if we "prematurely" switch to broadcast mode on the affected CPUs, > we're ok, it will be detected properly later and we're in broadcast > mode already. Right, that's the safe bet. But I'm quite sure that the C1E crap only starts to work _after_ ACPI initialization. > Now, on those machines which are not affected and we clear > X86_BUG_AMD_APIC_C1E because they don't enter C1E at all, I was thinking > of maybe doing amd_e400_remove_cpu() and clearing that e400 mask and > even freeing it so that they can do default_idle(). > > But you're saying tick_broadcast_enter() is irreversible? tick_force_broadcast() is irreversible Thanks, tglx
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Tue, Nov 29, 2016 at 02:59:01PM +0100, Thomas Gleixner wrote: > The issue is that you obvioulsy start with the assumption, that the machine > has this bug. As a consequence the machine is brute forced into tick > broadcast mode, which cannot be reverted when you clear that misfeature > after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is > forced into broadcast mode, which is not what you want. > > As far as I understood the whole magic, this C1E misfeature takes only > effect _after_ ACPI has been initialized. So instead of setting the bug in > early boot and therefor forcing the broadcast nonsense, we should only set > it when ACPI has actually detected it. Problem is, select_idle_routine() runs a lot earlier than acpi_init() so there's a window where we don't definitively know yet whether the box is actually going to enter C1E or not. [ I presume the reason why we have to do the proper detection after ACPI has been initialized is because the frickelware decides whether to do C1E entry or not and then sets those bits in the MSR (or not). ] If in that window we enter idle and we're on an affected machine and we *don't* switch to broadcast mode, we risk not waking up from C1E, i.e., the main reason this fix was even done. So, if we "prematurely" switch to broadcast mode on the affected CPUs, we're ok, it will be detected properly later and we're in broadcast mode already. Now, on those machines which are not affected and we clear X86_BUG_AMD_APIC_C1E because they don't enter C1E at all, I was thinking of maybe doing amd_e400_remove_cpu() and clearing that e400 mask and even freeing it so that they can do default_idle(). But you're saying tick_broadcast_enter() is irreversible? Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Tue, Nov 29, 2016 at 02:59:01PM +0100, Thomas Gleixner wrote: > The issue is that you obvioulsy start with the assumption, that the machine > has this bug. As a consequence the machine is brute forced into tick > broadcast mode, which cannot be reverted when you clear that misfeature > after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is > forced into broadcast mode, which is not what you want. > > As far as I understood the whole magic, this C1E misfeature takes only > effect _after_ ACPI has been initialized. So instead of setting the bug in > early boot and therefor forcing the broadcast nonsense, we should only set > it when ACPI has actually detected it. Problem is, select_idle_routine() runs a lot earlier than acpi_init() so there's a window where we don't definitively know yet whether the box is actually going to enter C1E or not. [ I presume the reason why we have to do the proper detection after ACPI has been initialized is because the frickelware decides whether to do C1E entry or not and then sets those bits in the MSR (or not). ] If in that window we enter idle and we're on an affected machine and we *don't* switch to broadcast mode, we risk not waking up from C1E, i.e., the main reason this fix was even done. So, if we "prematurely" switch to broadcast mode on the affected CPUs, we're ok, it will be detected properly later and we're in broadcast mode already. Now, on those machines which are not affected and we clear X86_BUG_AMD_APIC_C1E because they don't enter C1E at all, I was thinking of maybe doing amd_e400_remove_cpu() and clearing that e400 mask and even freeing it so that they can do default_idle(). But you're saying tick_broadcast_enter() is irreversible? Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Tue, Nov 29, 2016 at 02:16:49PM +0100, Borislav Petkov wrote: > On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote: > > IOW, what's the worst thing that can happen if we did this below? > > > > We basically get rid of the detection and switch the timer to broadcast > > mode immediately on the halting CPU. > > > > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" > > check so it will run on the affected CPUs only... > > > > Thoughts? > > Actually, here's a better version. The E400 detection works only after > ACPI has been enabled so we piggyback the end of acpi_init(). > > We don't need the MSR read now - we do > > if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) > > on the idle path which is as fast as it gets. > > Any complaints about this before I go and test it everywhere? > > It builds and boots in my guest here ok, not that it means a whole lot. > > The good news is, I have collected a fleet of boxes which all have that > erratum so testing should be pretty reliable. Something that doesn't > happen everytime! +1 one test server over here.. will check ;-) jirka
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Tue, Nov 29, 2016 at 02:16:49PM +0100, Borislav Petkov wrote: > On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote: > > IOW, what's the worst thing that can happen if we did this below? > > > > We basically get rid of the detection and switch the timer to broadcast > > mode immediately on the halting CPU. > > > > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" > > check so it will run on the affected CPUs only... > > > > Thoughts? > > Actually, here's a better version. The E400 detection works only after > ACPI has been enabled so we piggyback the end of acpi_init(). > > We don't need the MSR read now - we do > > if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) > > on the idle path which is as fast as it gets. > > Any complaints about this before I go and test it everywhere? > > It builds and boots in my guest here ok, not that it means a whole lot. > > The good news is, I have collected a fleet of boxes which all have that > erratum so testing should be pretty reliable. Something that doesn't > happen everytime! +1 one test server over here.. will check ;-) jirka
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Tue, 29 Nov 2016, Borislav Petkov wrote: > On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote: > > IOW, what's the worst thing that can happen if we did this below? > > > > We basically get rid of the detection and switch the timer to broadcast > > mode immediately on the halting CPU. > > > > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" > > check so it will run on the affected CPUs only... > > > > Thoughts? > > Actually, here's a better version. The E400 detection works only after > ACPI has been enabled so we piggyback the end of acpi_init(). > > We don't need the MSR read now - we do > > if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) > > on the idle path which is as fast as it gets. > > Any complaints about this before I go and test it everywhere? The issue is that you obvioulsy start with the assumption, that the machine has this bug. As a consequence the machine is brute forced into tick broadcast mode, which cannot be reverted when you clear that misfeature after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is forced into broadcast mode, which is not what you want. As far as I understood the whole magic, this C1E misfeature takes only effect _after_ ACPI has been initialized. So instead of setting the bug in early boot and therefor forcing the broadcast nonsense, we should only set it when ACPI has actually detected it. Thanks, tglx
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Tue, 29 Nov 2016, Borislav Petkov wrote: > On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote: > > IOW, what's the worst thing that can happen if we did this below? > > > > We basically get rid of the detection and switch the timer to broadcast > > mode immediately on the halting CPU. > > > > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" > > check so it will run on the affected CPUs only... > > > > Thoughts? > > Actually, here's a better version. The E400 detection works only after > ACPI has been enabled so we piggyback the end of acpi_init(). > > We don't need the MSR read now - we do > > if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) > > on the idle path which is as fast as it gets. > > Any complaints about this before I go and test it everywhere? The issue is that you obvioulsy start with the assumption, that the machine has this bug. As a consequence the machine is brute forced into tick broadcast mode, which cannot be reverted when you clear that misfeature after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is forced into broadcast mode, which is not what you want. As far as I understood the whole magic, this C1E misfeature takes only effect _after_ ACPI has been initialized. So instead of setting the bug in early boot and therefor forcing the broadcast nonsense, we should only set it when ACPI has actually detected it. Thanks, tglx
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote: > IOW, what's the worst thing that can happen if we did this below? > > We basically get rid of the detection and switch the timer to broadcast > mode immediately on the halting CPU. > > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" > check so it will run on the affected CPUs only... > > Thoughts? Actually, here's a better version. The E400 detection works only after ACPI has been enabled so we piggyback the end of acpi_init(). We don't need the MSR read now - we do if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) on the idle path which is as fast as it gets. Any complaints about this before I go and test it everywhere? It builds and boots in my guest here ok, not that it means a whole lot. The good news is, I have collected a fleet of boxes which all have that erratum so testing should be pretty reliable. Something that doesn't happen everytime! :-) --- diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h index 5391b0ae7cc3..aa8c879e814f 100644 --- a/arch/x86/include/asm/acpi.h +++ b/arch/x86/include/asm/acpi.h @@ -94,7 +94,8 @@ static inline unsigned int acpi_processor_cstate_check(unsigned int max_cstate) boot_cpu_data.x86_model <= 0x05 && boot_cpu_data.x86_mask < 0x0A) return 1; - else if (amd_e400_c1e_detected) + + else if (boot_cpu_has(X86_BUG_AMD_APIC_C1E)) return 1; else return max_cstate; @@ -131,6 +132,7 @@ static inline bool acpi_has_cpu_in_madt(void) return !!acpi_lapic; } +extern void __init __weak arch_post_acpi_init(void); #else /* !CONFIG_ACPI */ #define acpi_lapic 0 diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 1d2b69fc0ceb..b3918609f6f1 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -204,6 +204,7 @@ static __always_inline __pure bool _static_cpu_has(u16 bit) #define static_cpu_has_bug(bit)static_cpu_has((bit)) #define boot_cpu_has_bug(bit) cpu_has_bug(_cpu_data, (bit)) +#define setup_clear_cpu_bug(bit) setup_clear_cpu_cap((bit)) #define MAX_CPU_FEATURES (NCAPINTS * 32) #define cpu_have_feature boot_cpu_has diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 1f6a92903b09..d98c5c432209 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -636,7 +636,6 @@ extern void select_idle_routine(const struct cpuinfo_x86 *c); extern void init_amd_e400_c1e_mask(void); extern unsigned long boot_option_idle_override; -extern boolamd_e400_c1e_detected; enum idle_boot_override {IDLE_NO_OVERRIDE=0, IDLE_HALT, IDLE_NOMWAIT, IDLE_POLL}; diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index ee023919e476..56653b7322f9 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -289,9 +289,6 @@ void stop_this_cpu(void *dummy) halt(); } -bool amd_e400_c1e_detected; -EXPORT_SYMBOL(amd_e400_c1e_detected); - static cpumask_var_t amd_e400_c1e_mask; void amd_e400_remove_cpu(int cpu) @@ -301,26 +298,14 @@ void amd_e400_remove_cpu(int cpu) } /* - * AMD Erratum 400 aware idle routine. We check for C1E active in the interrupt - * pending message MSR. If we detect C1E, then we handle it the same - * way as C3 power states (local apic timer and TSC stop) + * AMD Erratum 400 aware idle routine. + * + * + * We handle it the same way as C3 power states (local apic timer and TSC stop) */ static void amd_e400_idle(void) { - if (!amd_e400_c1e_detected) { - u32 lo, hi; - - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); - - if (lo & K8_INTP_C1E_ACTIVE_MASK) { - amd_e400_c1e_detected = true; - if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) - mark_tsc_unstable("TSC halt in AMD C1E"); - pr_info("System has AMD C1E enabled\n"); - } - } - - if (amd_e400_c1e_detected) { + if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) { int cpu = smp_processor_id(); if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) { @@ -419,6 +404,30 @@ void __init init_amd_e400_c1e_mask(void) zalloc_cpumask_var(_e400_c1e_mask, GFP_KERNEL); } +void __init arch_post_acpi_init(void) +{ + u32 lo, hi; + + if (!boot_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) + return; + + /* +* AMD E400 detection needs to happen *after* ACPI has been enabled. We +* check for C1E active in the interrupt pending message MSR. If we detect +* C1E enabled, machine is affected. +*/ + rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + + if (!(lo &
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote: > IOW, what's the worst thing that can happen if we did this below? > > We basically get rid of the detection and switch the timer to broadcast > mode immediately on the halting CPU. > > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" > check so it will run on the affected CPUs only... > > Thoughts? Actually, here's a better version. The E400 detection works only after ACPI has been enabled so we piggyback the end of acpi_init(). We don't need the MSR read now - we do if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) on the idle path which is as fast as it gets. Any complaints about this before I go and test it everywhere? It builds and boots in my guest here ok, not that it means a whole lot. The good news is, I have collected a fleet of boxes which all have that erratum so testing should be pretty reliable. Something that doesn't happen everytime! :-) --- diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h index 5391b0ae7cc3..aa8c879e814f 100644 --- a/arch/x86/include/asm/acpi.h +++ b/arch/x86/include/asm/acpi.h @@ -94,7 +94,8 @@ static inline unsigned int acpi_processor_cstate_check(unsigned int max_cstate) boot_cpu_data.x86_model <= 0x05 && boot_cpu_data.x86_mask < 0x0A) return 1; - else if (amd_e400_c1e_detected) + + else if (boot_cpu_has(X86_BUG_AMD_APIC_C1E)) return 1; else return max_cstate; @@ -131,6 +132,7 @@ static inline bool acpi_has_cpu_in_madt(void) return !!acpi_lapic; } +extern void __init __weak arch_post_acpi_init(void); #else /* !CONFIG_ACPI */ #define acpi_lapic 0 diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 1d2b69fc0ceb..b3918609f6f1 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -204,6 +204,7 @@ static __always_inline __pure bool _static_cpu_has(u16 bit) #define static_cpu_has_bug(bit)static_cpu_has((bit)) #define boot_cpu_has_bug(bit) cpu_has_bug(_cpu_data, (bit)) +#define setup_clear_cpu_bug(bit) setup_clear_cpu_cap((bit)) #define MAX_CPU_FEATURES (NCAPINTS * 32) #define cpu_have_feature boot_cpu_has diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 1f6a92903b09..d98c5c432209 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -636,7 +636,6 @@ extern void select_idle_routine(const struct cpuinfo_x86 *c); extern void init_amd_e400_c1e_mask(void); extern unsigned long boot_option_idle_override; -extern boolamd_e400_c1e_detected; enum idle_boot_override {IDLE_NO_OVERRIDE=0, IDLE_HALT, IDLE_NOMWAIT, IDLE_POLL}; diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index ee023919e476..56653b7322f9 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -289,9 +289,6 @@ void stop_this_cpu(void *dummy) halt(); } -bool amd_e400_c1e_detected; -EXPORT_SYMBOL(amd_e400_c1e_detected); - static cpumask_var_t amd_e400_c1e_mask; void amd_e400_remove_cpu(int cpu) @@ -301,26 +298,14 @@ void amd_e400_remove_cpu(int cpu) } /* - * AMD Erratum 400 aware idle routine. We check for C1E active in the interrupt - * pending message MSR. If we detect C1E, then we handle it the same - * way as C3 power states (local apic timer and TSC stop) + * AMD Erratum 400 aware idle routine. + * + * + * We handle it the same way as C3 power states (local apic timer and TSC stop) */ static void amd_e400_idle(void) { - if (!amd_e400_c1e_detected) { - u32 lo, hi; - - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); - - if (lo & K8_INTP_C1E_ACTIVE_MASK) { - amd_e400_c1e_detected = true; - if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) - mark_tsc_unstable("TSC halt in AMD C1E"); - pr_info("System has AMD C1E enabled\n"); - } - } - - if (amd_e400_c1e_detected) { + if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) { int cpu = smp_processor_id(); if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) { @@ -419,6 +404,30 @@ void __init init_amd_e400_c1e_mask(void) zalloc_cpumask_var(_e400_c1e_mask, GFP_KERNEL); } +void __init arch_post_acpi_init(void) +{ + u32 lo, hi; + + if (!boot_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) + return; + + /* +* AMD E400 detection needs to happen *after* ACPI has been enabled. We +* check for C1E active in the interrupt pending message MSR. If we detect +* C1E enabled, machine is affected. +*/ + rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + + if (!(lo &
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> Well, actually it's been optimized a lot from the first instances. > Note, tracepoints need to be quite generic, so where it can be optimized > is not obvious. There is quite a bit of stuff that doesn't change from trace point to trace point. So you could just cache all these decisions into a single per cpu variable, and invalidate if anything changes the conditions. > > > > Functions with # of instructions: > > > > 25 trace_event_raw_event_sched_switch > > This is from the TRACE_EVENT macro. > > if (trace_trigger_soft_disabled(trace_file))\ > return; \ Cache? > if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) && > trace_event_ignore_this_pid(trace_file)) > return NULL; > > *current_rb = trace_file->tr->trace_buffer.buffer; Indirection could be cached. > > if ((trace_file->flags & > (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) && > (entry = this_cpu_read(trace_buffered_event))) { Multiple tests of trace_file->flags in different functions, could be all folded into a single test. > > > Note: Some of these if statements can be encapsulated with jump labels, as > they > are false pretty much all of the time. > > > /* Try to use the per cpu buffer first */ > val = this_cpu_inc_return(trace_buffered_event_cnt); > if (val == 1) { > trace_event_setup(entry, type, flags, pc); > entry->array[0] = len; > return entry; > } > this_cpu_dec(trace_buffered_event_cnt); This can be pre cached in a fast path. If true just point per cpu variable directly to buffer and invalidate if buffer changes or is full. > preempt_disable_notrace(); > > if (unlikely(atomic_read(>record_disabled))) > goto out; Cache. > > cpu = raw_smp_processor_id(); > > if (unlikely(!cpumask_test_cpu(cpu, buffer->cpumask))) > goto out; Not needed for per cpu buffers in a fast path. > > cpu_buffer = buffer->buffers[cpu]; > > if (unlikely(atomic_read(_buffer->record_disabled))) > goto out; cache > > if (unlikely(length > BUF_MAX_DATA_SIZE)) > goto out; > > if (unlikely(trace_recursive_lock(cpu_buffer))) > goto out; locking should invalidate state on this cpu, so shouldn't be needed > u64 diff; > > rb_start_commit(cpu_buffer); > > #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP > /* > * Due to the ability to swap a cpu buffer from a buffer > * it is possible it was swapped before we committed. > * (committing stops a swap). We check for it here and > * if it happened, we have to fail the write. > */ > barrier(); > if (unlikely(ACCESS_ONCE(cpu_buffer->buffer) != buffer)) { > local_dec(_buffer->committing); > local_dec(_buffer->commits); > return NULL; > } swapping should invalidate state centrally (and then run slow path like current path) > /* > * We allow for interrupts to reenter here and do a trace. > * If one does, it will cause this original code to loop > * back here. Even with heavy interrupts happening, this > * should only happen a few times in a row. If this happens > * 1000 times in a row, there must be either an interrupt > * storm or we have something buggy. > * Bail! > */ > if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)) > goto out_fail; didn't you disable interrupts earlier? > /* Did the write stamp get updated already? */ > if (likely(info.ts >= cpu_buffer->write_stamp)) { > info.delta = diff; > if (unlikely(test_time_stamp(info.delta))) > rb_handle_timestamp(cpu_buffer, ); > } Not sure why a write stamp is needed. isn't that the last few entries? ... -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> Well, actually it's been optimized a lot from the first instances. > Note, tracepoints need to be quite generic, so where it can be optimized > is not obvious. There is quite a bit of stuff that doesn't change from trace point to trace point. So you could just cache all these decisions into a single per cpu variable, and invalidate if anything changes the conditions. > > > > Functions with # of instructions: > > > > 25 trace_event_raw_event_sched_switch > > This is from the TRACE_EVENT macro. > > if (trace_trigger_soft_disabled(trace_file))\ > return; \ Cache? > if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) && > trace_event_ignore_this_pid(trace_file)) > return NULL; > > *current_rb = trace_file->tr->trace_buffer.buffer; Indirection could be cached. > > if ((trace_file->flags & > (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) && > (entry = this_cpu_read(trace_buffered_event))) { Multiple tests of trace_file->flags in different functions, could be all folded into a single test. > > > Note: Some of these if statements can be encapsulated with jump labels, as > they > are false pretty much all of the time. > > > /* Try to use the per cpu buffer first */ > val = this_cpu_inc_return(trace_buffered_event_cnt); > if (val == 1) { > trace_event_setup(entry, type, flags, pc); > entry->array[0] = len; > return entry; > } > this_cpu_dec(trace_buffered_event_cnt); This can be pre cached in a fast path. If true just point per cpu variable directly to buffer and invalidate if buffer changes or is full. > preempt_disable_notrace(); > > if (unlikely(atomic_read(>record_disabled))) > goto out; Cache. > > cpu = raw_smp_processor_id(); > > if (unlikely(!cpumask_test_cpu(cpu, buffer->cpumask))) > goto out; Not needed for per cpu buffers in a fast path. > > cpu_buffer = buffer->buffers[cpu]; > > if (unlikely(atomic_read(_buffer->record_disabled))) > goto out; cache > > if (unlikely(length > BUF_MAX_DATA_SIZE)) > goto out; > > if (unlikely(trace_recursive_lock(cpu_buffer))) > goto out; locking should invalidate state on this cpu, so shouldn't be needed > u64 diff; > > rb_start_commit(cpu_buffer); > > #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP > /* > * Due to the ability to swap a cpu buffer from a buffer > * it is possible it was swapped before we committed. > * (committing stops a swap). We check for it here and > * if it happened, we have to fail the write. > */ > barrier(); > if (unlikely(ACCESS_ONCE(cpu_buffer->buffer) != buffer)) { > local_dec(_buffer->committing); > local_dec(_buffer->commits); > return NULL; > } swapping should invalidate state centrally (and then run slow path like current path) > /* > * We allow for interrupts to reenter here and do a trace. > * If one does, it will cause this original code to loop > * back here. Even with heavy interrupts happening, this > * should only happen a few times in a row. If this happens > * 1000 times in a row, there must be either an interrupt > * storm or we have something buggy. > * Bail! > */ > if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)) > goto out_fail; didn't you disable interrupts earlier? > /* Did the write stamp get updated already? */ > if (likely(info.ts >= cpu_buffer->write_stamp)) { > info.delta = diff; > if (unlikely(test_time_stamp(info.delta))) > rb_handle_timestamp(cpu_buffer, ); > } Not sure why a write stamp is needed. isn't that the last few entries? ... -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 28 Nov 2016 13:48:25 -0800 Andi Kleenwrote: > > I took a look at this and forced some more functions to be inlined. I > > did a little tweaking here and there. Could you pull my tree and see if > > things are better? I don't currently have the hardware to run this > > myself. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git > > > > branch: ftrace/core > > Didn't help much unfortunately. I think you need a real fast path here. Well, actually it's been optimized a lot from the first instances. Note, tracepoints need to be quite generic, so where it can be optimized is not obvious. > > http://halobates.de/tracepoint-trace2 > > $ wc -l tracepoint-trace2 > 595 tracepoint-trace2 > > Opcode Histogram: > > 201 mov > 51 push > 51 pop > 34 test > 32 cmp > 26 jz > 24 jnz > 24 and > 16 ret > 16 lea > 15 call > 12 add > 10 jmp > > Functions with # of instructions: > > 25 trace_event_raw_event_sched_switch This is from the TRACE_EVENT macro. trace_event_raw_event_##call(void *__data, proto) \ { \ struct trace_event_file *trace_file = __data; \ struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\ struct trace_event_buffer fbuffer; \ struct trace_event_raw_##call *entry; \ int __data_size;\ \ if (trace_trigger_soft_disabled(trace_file))\ return; \ \ __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \ \ entry = trace_event_buffer_reserve(, trace_file,\ sizeof(*entry) + __data_size); \ > 19 trace_event_buffer_reserve void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer, struct trace_event_file *trace_file, unsigned long len) { struct trace_event_call *event_call = trace_file->event_call; if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) && trace_event_ignore_this_pid(trace_file)) return NULL; local_save_flags(fbuffer->flags); fbuffer->pc = preempt_count(); /* * If CONFIG_PREEMPT is enabled, then the tracepoint itself disables * preemption (adding one to the preempt_count). Since we are * interested in the preempt_count at the time the tracepoint was * hit, we need to subtract one to offset the increment. */ if (IS_ENABLED(CONFIG_PREEMPT)) fbuffer->pc--; fbuffer->trace_file = trace_file; fbuffer->event = trace_event_buffer_lock_reserve(>buffer, trace_file, event_call->event.type, len, fbuffer->flags, fbuffer->pc); > 21 trace_event_buffer_lock_reserve struct ring_buffer_event * trace_event_buffer_lock_reserve(struct ring_buffer **current_rb, struct trace_event_file *trace_file, int type, unsigned long len, unsigned long flags, int pc) { struct ring_buffer_event *entry; int val; *current_rb = trace_file->tr->trace_buffer.buffer; if ((trace_file->flags & (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) && (entry = this_cpu_read(trace_buffered_event))) { Note: Some of these if statements can be encapsulated with jump labels, as they are false pretty much all of the time. /* Try to use the per cpu buffer first */ val = this_cpu_inc_return(trace_buffered_event_cnt); if (val == 1) { trace_event_setup(entry, type, flags, pc); entry->array[0] = len; return entry; } this_cpu_dec(trace_buffered_event_cnt); } entry = trace_buffer_lock_reserve(*current_rb, type, len, flags, pc); > 57 ring_buffer_lock_reserve struct ring_buffer_event * ring_buffer_lock_reserve(struct ring_buffer *buffer, unsigned long length) { struct ring_buffer_per_cpu *cpu_buffer; struct ring_buffer_event *event; int cpu; /* If we are tracing schedule, we don't want to recurse */ preempt_disable_notrace();
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 28 Nov 2016 13:48:25 -0800 Andi Kleen wrote: > > I took a look at this and forced some more functions to be inlined. I > > did a little tweaking here and there. Could you pull my tree and see if > > things are better? I don't currently have the hardware to run this > > myself. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git > > > > branch: ftrace/core > > Didn't help much unfortunately. I think you need a real fast path here. Well, actually it's been optimized a lot from the first instances. Note, tracepoints need to be quite generic, so where it can be optimized is not obvious. > > http://halobates.de/tracepoint-trace2 > > $ wc -l tracepoint-trace2 > 595 tracepoint-trace2 > > Opcode Histogram: > > 201 mov > 51 push > 51 pop > 34 test > 32 cmp > 26 jz > 24 jnz > 24 and > 16 ret > 16 lea > 15 call > 12 add > 10 jmp > > Functions with # of instructions: > > 25 trace_event_raw_event_sched_switch This is from the TRACE_EVENT macro. trace_event_raw_event_##call(void *__data, proto) \ { \ struct trace_event_file *trace_file = __data; \ struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\ struct trace_event_buffer fbuffer; \ struct trace_event_raw_##call *entry; \ int __data_size;\ \ if (trace_trigger_soft_disabled(trace_file))\ return; \ \ __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \ \ entry = trace_event_buffer_reserve(, trace_file,\ sizeof(*entry) + __data_size); \ > 19 trace_event_buffer_reserve void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer, struct trace_event_file *trace_file, unsigned long len) { struct trace_event_call *event_call = trace_file->event_call; if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) && trace_event_ignore_this_pid(trace_file)) return NULL; local_save_flags(fbuffer->flags); fbuffer->pc = preempt_count(); /* * If CONFIG_PREEMPT is enabled, then the tracepoint itself disables * preemption (adding one to the preempt_count). Since we are * interested in the preempt_count at the time the tracepoint was * hit, we need to subtract one to offset the increment. */ if (IS_ENABLED(CONFIG_PREEMPT)) fbuffer->pc--; fbuffer->trace_file = trace_file; fbuffer->event = trace_event_buffer_lock_reserve(>buffer, trace_file, event_call->event.type, len, fbuffer->flags, fbuffer->pc); > 21 trace_event_buffer_lock_reserve struct ring_buffer_event * trace_event_buffer_lock_reserve(struct ring_buffer **current_rb, struct trace_event_file *trace_file, int type, unsigned long len, unsigned long flags, int pc) { struct ring_buffer_event *entry; int val; *current_rb = trace_file->tr->trace_buffer.buffer; if ((trace_file->flags & (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) && (entry = this_cpu_read(trace_buffered_event))) { Note: Some of these if statements can be encapsulated with jump labels, as they are false pretty much all of the time. /* Try to use the per cpu buffer first */ val = this_cpu_inc_return(trace_buffered_event_cnt); if (val == 1) { trace_event_setup(entry, type, flags, pc); entry->array[0] = len; return entry; } this_cpu_dec(trace_buffered_event_cnt); } entry = trace_buffer_lock_reserve(*current_rb, type, len, flags, pc); > 57 ring_buffer_lock_reserve struct ring_buffer_event * ring_buffer_lock_reserve(struct ring_buffer *buffer, unsigned long length) { struct ring_buffer_per_cpu *cpu_buffer; struct ring_buffer_event *event; int cpu; /* If we are tracing schedule, we don't want to recurse */ preempt_disable_notrace(); if
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> I took a look at this and forced some more functions to be inlined. I > did a little tweaking here and there. Could you pull my tree and see if > things are better? I don't currently have the hardware to run this > myself. > > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git > > branch: ftrace/core Didn't help much unfortunately. I think you need a real fast path here. http://halobates.de/tracepoint-trace2 $ wc -l tracepoint-trace2 595 tracepoint-trace2 Opcode Histogram: 201 mov 51 push 51 pop 34 test 32 cmp 26 jz 24 jnz 24 and 16 ret 16 lea 15 call 12 add 10 jmp Functions with # of instructions: 25 trace_event_raw_event_sched_switch 19 trace_event_buffer_reserve 21 trace_event_buffer_lock_reserve 57 ring_buffer_lock_reserve 3 trace_clock_local 3 sched_clock 24 native_sched_clock 3 sched_clock 2 trace_clock_local 11 ring_buffer_lock_reserve 81 __rb_reserve_next 11 ring_buffer_lock_reserve 6 trace_event_buffer_lock_reserve 18 ring_buffer_event_data 5 trace_event_buffer_lock_reserve 38 tracing_generic_entry_update 11 trace_event_buffer_lock_reserve 5 trace_event_buffer_reserve 18 ring_buffer_event_data 4 trace_event_buffer_reserve 28 trace_event_raw_event_sched_switch 30 trace_event_buffer_commit 19 trace_buffer_unlock_commit_regs 11 ring_buffer_unlock_commit 85 rb_commit 18 ring_buffer_unlock_commit 10 trace_buffer_unlock_commit_regs 3 ftrace_trace_userstack 7 trace_buffer_unlock_commit_regs 11 trace_event_buffer_commit 8 trace_event_raw_event_sched_switch -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> I took a look at this and forced some more functions to be inlined. I > did a little tweaking here and there. Could you pull my tree and see if > things are better? I don't currently have the hardware to run this > myself. > > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git > > branch: ftrace/core Didn't help much unfortunately. I think you need a real fast path here. http://halobates.de/tracepoint-trace2 $ wc -l tracepoint-trace2 595 tracepoint-trace2 Opcode Histogram: 201 mov 51 push 51 pop 34 test 32 cmp 26 jz 24 jnz 24 and 16 ret 16 lea 15 call 12 add 10 jmp Functions with # of instructions: 25 trace_event_raw_event_sched_switch 19 trace_event_buffer_reserve 21 trace_event_buffer_lock_reserve 57 ring_buffer_lock_reserve 3 trace_clock_local 3 sched_clock 24 native_sched_clock 3 sched_clock 2 trace_clock_local 11 ring_buffer_lock_reserve 81 __rb_reserve_next 11 ring_buffer_lock_reserve 6 trace_event_buffer_lock_reserve 18 ring_buffer_event_data 5 trace_event_buffer_lock_reserve 38 tracing_generic_entry_update 11 trace_event_buffer_lock_reserve 5 trace_event_buffer_reserve 18 ring_buffer_event_data 4 trace_event_buffer_reserve 28 trace_event_raw_event_sched_switch 30 trace_event_buffer_commit 19 trace_buffer_unlock_commit_regs 11 ring_buffer_unlock_commit 85 rb_commit 18 ring_buffer_unlock_commit 10 trace_buffer_unlock_commit_regs 3 ftrace_trace_userstack 7 trace_buffer_unlock_commit_regs 11 trace_event_buffer_commit 8 trace_event_raw_event_sched_switch -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleenwrote: > > It tries to be optimized. I "unoptimized" it a while back to pull out > > all the inlines that were done in the tracepoint itself. That is, the > > trace_() function is inlined in the code itself. By > > breaking that up a bit, I was able to save a bunch of text because the > > tracepoints were bloating the kernel tremendously. > > Just adding a few inlines won't fix the gigantic bloat that is currently > there. See the PT trace I posted earlier (it was even truncated, it's > actually worse). Just a single enabled trace point took about a us. > > POPF can cause some serializion but it won't be more than a few tens > of cycles, which would be a few percent at best. > > Here is it again untruncated: > > http://halobates.de/tracepoint-trace > I took a look at this and forced some more functions to be inlined. I did a little tweaking here and there. Could you pull my tree and see if things are better? I don't currently have the hardware to run this myself. git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git branch: ftrace/core Thanks! -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleen wrote: > > It tries to be optimized. I "unoptimized" it a while back to pull out > > all the inlines that were done in the tracepoint itself. That is, the > > trace_() function is inlined in the code itself. By > > breaking that up a bit, I was able to save a bunch of text because the > > tracepoints were bloating the kernel tremendously. > > Just adding a few inlines won't fix the gigantic bloat that is currently > there. See the PT trace I posted earlier (it was even truncated, it's > actually worse). Just a single enabled trace point took about a us. > > POPF can cause some serializion but it won't be more than a few tens > of cycles, which would be a few percent at best. > > Here is it again untruncated: > > http://halobates.de/tracepoint-trace > I took a look at this and forced some more functions to be inlined. I did a little tweaking here and there. Could you pull my tree and see if things are better? I don't currently have the hardware to run this myself. git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git branch: ftrace/core Thanks! -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Tue, Nov 22, 2016 at 09:39:45AM -0500, Steven Rostedt wrote: > On Mon, 21 Nov 2016 10:37:00 -0800 > Andi Kleenwrote: > > > > Here is it again untruncated: > > > > http://halobates.de/tracepoint-trace > > BTW, what tool did you use to generate this? perf and Intel PT, with the disassembler patches added and udis86 installed: https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/log/?h=perf/disassembler-2 http://udis86.sourceforge.net/ You need a CPU that supports PT, like Broadwell, Skylake, Goldmont. # enable trace point perf record -e intel_pt// -a sleep X perf script --itrace=i0ns --ns -F cpu,ip,time,asm,sym On Skylake can also get more accurate timing by enabling cycle timing (at the cost of some overhead) perf record -e intel_pt/cyc=1,cyc_thresh=1/ -a ... The earlier trace didn't have that enabled. -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Tue, Nov 22, 2016 at 09:39:45AM -0500, Steven Rostedt wrote: > On Mon, 21 Nov 2016 10:37:00 -0800 > Andi Kleen wrote: > > > > Here is it again untruncated: > > > > http://halobates.de/tracepoint-trace > > BTW, what tool did you use to generate this? perf and Intel PT, with the disassembler patches added and udis86 installed: https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/log/?h=perf/disassembler-2 http://udis86.sourceforge.net/ You need a CPU that supports PT, like Broadwell, Skylake, Goldmont. # enable trace point perf record -e intel_pt// -a sleep X perf script --itrace=i0ns --ns -F cpu,ip,time,asm,sym On Skylake can also get more accurate timing by enabling cycle timing (at the cost of some overhead) perf record -e intel_pt/cyc=1,cyc_thresh=1/ -a ... The earlier trace didn't have that enabled. -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleenwrote: > Here is it again untruncated: > > http://halobates.de/tracepoint-trace BTW, what tool did you use to generate this? -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleen wrote: > Here is it again untruncated: > > http://halobates.de/tracepoint-trace BTW, what tool did you use to generate this? -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 12:44:45PM -0800, Andi Kleen wrote: > > > http://halobates.de/tracepoint-trace > > > > There's a lot of push and pop regs due to function calling. There's > > places that inlines can still improve things, and perhaps even some > > likely unlikelys well placed. > > Assuming you avoid all the push/pop and all the call/ret this would only be > ~25% of the total instructions. There is just far too much logic and > computation in there. > > % awk ' { print $5 } ' tracepoint-trace | sort | uniq -c | sort -rn > 222 mov > 57 push > 57 pop > 35 test > 34 cmp > 32 and > 28 jz > 25 jnz > 21 ret > 20 call > 16 lea > 11 add Hmmm... It does indeed look like some performance analysis would be good... Thanx, Paul
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 12:44:45PM -0800, Andi Kleen wrote: > > > http://halobates.de/tracepoint-trace > > > > There's a lot of push and pop regs due to function calling. There's > > places that inlines can still improve things, and perhaps even some > > likely unlikelys well placed. > > Assuming you avoid all the push/pop and all the call/ret this would only be > ~25% of the total instructions. There is just far too much logic and > computation in there. > > % awk ' { print $5 } ' tracepoint-trace | sort | uniq -c | sort -rn > 222 mov > 57 push > 57 pop > 35 test > 34 cmp > 32 and > 28 jz > 25 jnz > 21 ret > 20 call > 16 lea > 11 add Hmmm... It does indeed look like some performance analysis would be good... Thanx, Paul
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> > http://halobates.de/tracepoint-trace > > There's a lot of push and pop regs due to function calling. There's > places that inlines can still improve things, and perhaps even some > likely unlikelys well placed. Assuming you avoid all the push/pop and all the call/ret this would only be ~25% of the total instructions. There is just far too much logic and computation in there. % awk ' { print $5 } ' tracepoint-trace | sort | uniq -c | sort -rn 222 mov 57 push 57 pop 35 test 34 cmp 32 and 28 jz 25 jnz 21 ret 20 call 16 lea 11 add -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> > http://halobates.de/tracepoint-trace > > There's a lot of push and pop regs due to function calling. There's > places that inlines can still improve things, and perhaps even some > likely unlikelys well placed. Assuming you avoid all the push/pop and all the call/ret this would only be ~25% of the total instructions. There is just far too much logic and computation in there. % awk ' { print $5 } ' tracepoint-trace | sort | uniq -c | sort -rn 222 mov 57 push 57 pop 35 test 34 cmp 32 and 28 jz 25 jnz 21 ret 20 call 16 lea 11 add -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 01:24:38PM -0500, Steven Rostedt wrote: > > Paul, > > > On Mon, 21 Nov 2016 12:55:01 -0500 > Steven Rostedtwrote: > > > On Mon, 21 Nov 2016 18:18:53 +0100 > > Peter Zijlstra wrote: > > > > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > > > locking, as one does. > > > > Just to be clear, as ftrace in the kernel mostly represents function > > tracing, which doesn't use RCU. This is a tracepoint feature. > > > > > > > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > > > with that is quite a few cycles. > > > > Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not > > concerned enough to put a heavy weight rcu idle code in for all > > tracepoints. > > > > Although, what about a percpu flag that can be checked in the > > tracepoint code to see if it should enable RCU or not? > > > > Hmm, I wonder if "rcu_is_watching()" is light enough to have in all > > tracepoints? > > Is it possible to make rcu_is_watching() an inlined call to prevent the > overhead of doing a function call? This way we could use this in the > fast path of the tracepoint. It would mean exposing the rcu_dynticks structure to the rest of the kernel, but I guess that wouldn't be the end of the world. Are you calling rcu_is_watching() or __rcu_is_watching()? The latter is appropriate if preemption is already disabled. Thanx, Paul
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 01:24:38PM -0500, Steven Rostedt wrote: > > Paul, > > > On Mon, 21 Nov 2016 12:55:01 -0500 > Steven Rostedt wrote: > > > On Mon, 21 Nov 2016 18:18:53 +0100 > > Peter Zijlstra wrote: > > > > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > > > locking, as one does. > > > > Just to be clear, as ftrace in the kernel mostly represents function > > tracing, which doesn't use RCU. This is a tracepoint feature. > > > > > > > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > > > with that is quite a few cycles. > > > > Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not > > concerned enough to put a heavy weight rcu idle code in for all > > tracepoints. > > > > Although, what about a percpu flag that can be checked in the > > tracepoint code to see if it should enable RCU or not? > > > > Hmm, I wonder if "rcu_is_watching()" is light enough to have in all > > tracepoints? > > Is it possible to make rcu_is_watching() an inlined call to prevent the > overhead of doing a function call? This way we could use this in the > fast path of the tracepoint. It would mean exposing the rcu_dynticks structure to the rest of the kernel, but I guess that wouldn't be the end of the world. Are you calling rcu_is_watching() or __rcu_is_watching()? The latter is appropriate if preemption is already disabled. Thanx, Paul
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleenwrote: > Just adding a few inlines won't fix the gigantic bloat that is currently > there. See the PT trace I posted earlier (it was even truncated, it's > actually worse). Just a single enabled trace point took about a us. If you want to play with that, enable "Add tracepoint that benchmarks tracepoints", and enable /sys/kernel/debug/tracing/events/benchmark I just did on my box, and I have a max of 1.27 us, average .184 us. Thus, on average, a tracepoint takes 184 nanoseconds. Note, a cold cache tracepoint (the first one it traced which is not saved in the "max") was 3.3 us. Yes it can still be improved, and I do spend time doing that. > > POPF can cause some serializion but it won't be more than a few tens > of cycles, which would be a few percent at best. > > Here is it again untruncated: > > http://halobates.de/tracepoint-trace There's a lot of push and pop regs due to function calling. There's places that inlines can still improve things, and perhaps even some likely unlikelys well placed. -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleen wrote: > Just adding a few inlines won't fix the gigantic bloat that is currently > there. See the PT trace I posted earlier (it was even truncated, it's > actually worse). Just a single enabled trace point took about a us. If you want to play with that, enable "Add tracepoint that benchmarks tracepoints", and enable /sys/kernel/debug/tracing/events/benchmark I just did on my box, and I have a max of 1.27 us, average .184 us. Thus, on average, a tracepoint takes 184 nanoseconds. Note, a cold cache tracepoint (the first one it traced which is not saved in the "max") was 3.3 us. Yes it can still be improved, and I do spend time doing that. > > POPF can cause some serializion but it won't be more than a few tens > of cycles, which would be a few percent at best. > > Here is it again untruncated: > > http://halobates.de/tracepoint-trace There's a lot of push and pop regs due to function calling. There's places that inlines can still improve things, and perhaps even some likely unlikelys well placed. -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleenwrote: > Ok so how should tracing in idle code work then in your opinion? As I suggested already. If we can get a light weight rcu_is_watching() then we can do the rcu_idle work when needed, and not when we don't need it. Sure this will add more branches, but that's still cheaper than popf. -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleen wrote: > Ok so how should tracing in idle code work then in your opinion? As I suggested already. If we can get a light weight rcu_is_watching() then we can do the rcu_idle work when needed, and not when we don't need it. Sure this will add more branches, but that's still cheaper than popf. -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> It tries to be optimized. I "unoptimized" it a while back to pull out > all the inlines that were done in the tracepoint itself. That is, the > trace_() function is inlined in the code itself. By > breaking that up a bit, I was able to save a bunch of text because the > tracepoints were bloating the kernel tremendously. Just adding a few inlines won't fix the gigantic bloat that is currently there. See the PT trace I posted earlier (it was even truncated, it's actually worse). Just a single enabled trace point took about a us. POPF can cause some serializion but it won't be more than a few tens of cycles, which would be a few percent at best. Here is it again untruncated: http://halobates.de/tracepoint-trace $ wc -l tracepoint-trace 640 tracepoint-trace $ head -1 tracepoint-trace [001] 264171.903577780: 810d6040 trace_event_raw_event_sched_switch push %rbp $ tail -1 tracepoint-trace [001] 264171.903578780: 810d6117 trace_event_raw_event_sched_switch ret > There can be more optimization done too. But just because it's not > optimized to the best it can be (which should be our goal) is not > excuse to bloat it more with popf! Ok so how should tracing in idle code work then in your opinion? -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> It tries to be optimized. I "unoptimized" it a while back to pull out > all the inlines that were done in the tracepoint itself. That is, the > trace_() function is inlined in the code itself. By > breaking that up a bit, I was able to save a bunch of text because the > tracepoints were bloating the kernel tremendously. Just adding a few inlines won't fix the gigantic bloat that is currently there. See the PT trace I posted earlier (it was even truncated, it's actually worse). Just a single enabled trace point took about a us. POPF can cause some serializion but it won't be more than a few tens of cycles, which would be a few percent at best. Here is it again untruncated: http://halobates.de/tracepoint-trace $ wc -l tracepoint-trace 640 tracepoint-trace $ head -1 tracepoint-trace [001] 264171.903577780: 810d6040 trace_event_raw_event_sched_switch push %rbp $ tail -1 tracepoint-trace [001] 264171.903578780: 810d6117 trace_event_raw_event_sched_switch ret > There can be more optimization done too. But just because it's not > optimized to the best it can be (which should be our goal) is not > excuse to bloat it more with popf! Ok so how should tracing in idle code work then in your opinion? -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
Paul, On Mon, 21 Nov 2016 12:55:01 -0500 Steven Rostedtwrote: > On Mon, 21 Nov 2016 18:18:53 +0100 > Peter Zijlstra wrote: > > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > > locking, as one does. > > Just to be clear, as ftrace in the kernel mostly represents function > tracing, which doesn't use RCU. This is a tracepoint feature. > > > > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > > with that is quite a few cycles. > > Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not > concerned enough to put a heavy weight rcu idle code in for all > tracepoints. > > Although, what about a percpu flag that can be checked in the > tracepoint code to see if it should enable RCU or not? > > Hmm, I wonder if "rcu_is_watching()" is light enough to have in all > tracepoints? Is it possible to make rcu_is_watching() an inlined call to prevent the overhead of doing a function call? This way we could use this in the fast path of the tracepoint. -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
Paul, On Mon, 21 Nov 2016 12:55:01 -0500 Steven Rostedt wrote: > On Mon, 21 Nov 2016 18:18:53 +0100 > Peter Zijlstra wrote: > > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > > locking, as one does. > > Just to be clear, as ftrace in the kernel mostly represents function > tracing, which doesn't use RCU. This is a tracepoint feature. > > > > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > > with that is quite a few cycles. > > Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not > concerned enough to put a heavy weight rcu idle code in for all > tracepoints. > > Although, what about a percpu flag that can be checked in the > tracepoint code to see if it should enable RCU or not? > > Hmm, I wonder if "rcu_is_watching()" is light enough to have in all > tracepoints? Is it possible to make rcu_is_watching() an inlined call to prevent the overhead of doing a function call? This way we could use this in the fast path of the tracepoint. -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:06:54 -0800 Andi Kleenwrote: > > And a popf can be much more expensive than any of these. You should > > know, not all instructions are equal. > > > > Using perf, I've seen popf take up almst 30% of a function the size of > > this. > > In any case it's a small fraction of the 600+ instructions which are currently > executed for every enabled trace point. > > If ftrace was actually optimized code this would make some sense, but it > clearly isn't ... It tries to be optimized. I "unoptimized" it a while back to pull out all the inlines that were done in the tracepoint itself. That is, the trace_() function is inlined in the code itself. By breaking that up a bit, I was able to save a bunch of text because the tracepoints were bloating the kernel tremendously. There can be more optimization done too. But just because it's not optimized to the best it can be (which should be our goal) is not excuse to bloat it more with popf! -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 10:06:54 -0800 Andi Kleen wrote: > > And a popf can be much more expensive than any of these. You should > > know, not all instructions are equal. > > > > Using perf, I've seen popf take up almst 30% of a function the size of > > this. > > In any case it's a small fraction of the 600+ instructions which are currently > executed for every enabled trace point. > > If ftrace was actually optimized code this would make some sense, but it > clearly isn't ... It tries to be optimized. I "unoptimized" it a while back to pull out all the inlines that were done in the tracepoint itself. That is, the trace_() function is inlined in the code itself. By breaking that up a bit, I was able to save a bunch of text because the tracepoints were bloating the kernel tremendously. There can be more optimization done too. But just because it's not optimized to the best it can be (which should be our goal) is not excuse to bloat it more with popf! -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> And a popf can be much more expensive than any of these. You should > know, not all instructions are equal. > > Using perf, I've seen popf take up almst 30% of a function the size of > this. In any case it's a small fraction of the 600+ instructions which are currently executed for every enabled trace point. If ftrace was actually optimized code this would make some sense, but it clearly isn't ... -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> And a popf can be much more expensive than any of these. You should > know, not all instructions are equal. > > Using perf, I've seen popf take up almst 30% of a function the size of > this. In any case it's a small fraction of the 600+ instructions which are currently executed for every enabled trace point. If ftrace was actually optimized code this would make some sense, but it clearly isn't ... -Andi
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 09:45:04 -0800 Andi Kleenwrote: > On Mon, Nov 21, 2016 at 06:18:53PM +0100, Peter Zijlstra wrote: > > On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote: > > > > > it got away with attached change.. but this rcu logic > > > > > is far beyond me, so it's just wild guess.. ;-) > > > > > > > > I think I prefer something like the below, that only annotates the one > > > > RDMSR in question, instead of all of them. > > > > > > It would be far better to just fix trace points that they always work. > > > > > > This whole thing is a travesty: we have tens of thousands of lines of > > > code in > > > ftrace to support tracing in NMIs, but then "debug features"[1] like > > > this come around and make trace points unusable for far more code than > > > just the NMI handlers. > > > > Not sure the idle handlers are more code than we have NMI code, but yes, > > its annoying. > > > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > > locking, as one does. > > > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > > with that is quite a few cycles. > > This is only when the trace point is enabled right? > > ftrace is already executing a lot of instructions when enabled. Adding two > POPF > won't be end of the world. And a popf can be much more expensive than any of these. You should know, not all instructions are equal. Using perf, I've seen popf take up almst 30% of a function the size of this. -- Steve > > -Andi > > > 264171.903577780: 810d6040 trace_event_raw_event_sched_switch > push %rbp > 264171.903577780: 810d6041 trace_event_raw_event_sched_switch > mov %rsp, %rbp > 264171.903577780: 810d6044 trace_event_raw_event_sched_switch > push %r15 > 264171.903577780: 810d6046 trace_event_raw_event_sched_switch > push %r14 > 264171.903577780: 810d6048 trace_event_raw_event_sched_switch > push %r13 > 264171.903577780: 810d604a trace_event_raw_event_sched_switch > push %r12 > 264171.903577780: 810d604c trace_event_raw_event_sched_switch > mov %esi, %r13d > 264171.903577780: 810d604f trace_event_raw_event_sched_switch > push %rbx > 264171.903577780: 810d6050 trace_event_raw_event_sched_switch > mov %rdi, %r12 > 264171.903577780: 810d6053 trace_event_raw_event_sched_switch > mov %rdx, %r14 > 264171.903577780: 810d6056 trace_event_raw_event_sched_switch > mov %rcx, %r15 > 264171.903577780: 810d6059 trace_event_raw_event_sched_switch > sub $0x30, %rsp > 264171.903577780: 810d605d trace_event_raw_event_sched_switch > mov 0x48(%rdi), %rbx > 264171.903577780: 810d6061 trace_event_raw_event_sched_switch > test $0x80, %bl > 264171.903577780: 810d6064 trace_event_raw_event_sched_switch > jnz 0x810d608c > 264171.903577780: 810d6066 trace_event_raw_event_sched_switch > test $0x40, %bl > 264171.903577780: 810d6069 trace_event_raw_event_sched_switch > jnz trace_event_raw_event_sched_switch+216 > 264171.903577780: 810d606f trace_event_raw_event_sched_switch > test $0x20, %bl > 264171.903577780: 810d6072 trace_event_raw_event_sched_switch > jz 0x810d6083 > 264171.903577780: 810d6083 trace_event_raw_event_sched_switch > and $0x1, %bh > 264171.903577780: 810d6086 trace_event_raw_event_sched_switch > jnz trace_event_raw_event_sched_switch+237 > 264171.903578113: 810d608c trace_event_raw_event_sched_switch > mov $0x40, %edx > 264171.903578113: 810d6091 trace_event_raw_event_sched_switch > mov %r12, %rsi > 264171.903578113: 810d6094 trace_event_raw_event_sched_switch > mov %rsp, %rdi > 264171.903578113: 810d6097 trace_event_raw_event_sched_switch > call trace_event_buffer_reserve > 264171.903578113: 8114f150 trace_event_buffer_reservetest > $0x1, 0x49(%rsi) > 264171.903578113: 8114f154 trace_event_buffer_reservepush > %rbp > 264171.903578113: 8114f155 trace_event_buffer_reservemov > 0x10(%rsi), %r10 > 264171.903578113: 8114f159 trace_event_buffer_reservemov > %rsp, %rbp > 264171.903578113: 8114f15c trace_event_buffer_reservepush > %rbx > 264171.903578113: 8114f15d trace_event_buffer_reservejz > 0x8114f17e > 264171.903578113: 8114f17e trace_event_buffer_reservemov > %rdx, %rcx > 264171.903578113: 8114f181 trace_event_buffer_reservemov > %rdi, %rbx > 264171.903578113: 8114f184 trace_event_buffer_reservepushfq > 264171.903578113: 8114f185 trace_event_buffer_reserve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 09:45:04 -0800 Andi Kleen wrote: > On Mon, Nov 21, 2016 at 06:18:53PM +0100, Peter Zijlstra wrote: > > On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote: > > > > > it got away with attached change.. but this rcu logic > > > > > is far beyond me, so it's just wild guess.. ;-) > > > > > > > > I think I prefer something like the below, that only annotates the one > > > > RDMSR in question, instead of all of them. > > > > > > It would be far better to just fix trace points that they always work. > > > > > > This whole thing is a travesty: we have tens of thousands of lines of > > > code in > > > ftrace to support tracing in NMIs, but then "debug features"[1] like > > > this come around and make trace points unusable for far more code than > > > just the NMI handlers. > > > > Not sure the idle handlers are more code than we have NMI code, but yes, > > its annoying. > > > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > > locking, as one does. > > > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > > with that is quite a few cycles. > > This is only when the trace point is enabled right? > > ftrace is already executing a lot of instructions when enabled. Adding two > POPF > won't be end of the world. And a popf can be much more expensive than any of these. You should know, not all instructions are equal. Using perf, I've seen popf take up almst 30% of a function the size of this. -- Steve > > -Andi > > > 264171.903577780: 810d6040 trace_event_raw_event_sched_switch > push %rbp > 264171.903577780: 810d6041 trace_event_raw_event_sched_switch > mov %rsp, %rbp > 264171.903577780: 810d6044 trace_event_raw_event_sched_switch > push %r15 > 264171.903577780: 810d6046 trace_event_raw_event_sched_switch > push %r14 > 264171.903577780: 810d6048 trace_event_raw_event_sched_switch > push %r13 > 264171.903577780: 810d604a trace_event_raw_event_sched_switch > push %r12 > 264171.903577780: 810d604c trace_event_raw_event_sched_switch > mov %esi, %r13d > 264171.903577780: 810d604f trace_event_raw_event_sched_switch > push %rbx > 264171.903577780: 810d6050 trace_event_raw_event_sched_switch > mov %rdi, %r12 > 264171.903577780: 810d6053 trace_event_raw_event_sched_switch > mov %rdx, %r14 > 264171.903577780: 810d6056 trace_event_raw_event_sched_switch > mov %rcx, %r15 > 264171.903577780: 810d6059 trace_event_raw_event_sched_switch > sub $0x30, %rsp > 264171.903577780: 810d605d trace_event_raw_event_sched_switch > mov 0x48(%rdi), %rbx > 264171.903577780: 810d6061 trace_event_raw_event_sched_switch > test $0x80, %bl > 264171.903577780: 810d6064 trace_event_raw_event_sched_switch > jnz 0x810d608c > 264171.903577780: 810d6066 trace_event_raw_event_sched_switch > test $0x40, %bl > 264171.903577780: 810d6069 trace_event_raw_event_sched_switch > jnz trace_event_raw_event_sched_switch+216 > 264171.903577780: 810d606f trace_event_raw_event_sched_switch > test $0x20, %bl > 264171.903577780: 810d6072 trace_event_raw_event_sched_switch > jz 0x810d6083 > 264171.903577780: 810d6083 trace_event_raw_event_sched_switch > and $0x1, %bh > 264171.903577780: 810d6086 trace_event_raw_event_sched_switch > jnz trace_event_raw_event_sched_switch+237 > 264171.903578113: 810d608c trace_event_raw_event_sched_switch > mov $0x40, %edx > 264171.903578113: 810d6091 trace_event_raw_event_sched_switch > mov %r12, %rsi > 264171.903578113: 810d6094 trace_event_raw_event_sched_switch > mov %rsp, %rdi > 264171.903578113: 810d6097 trace_event_raw_event_sched_switch > call trace_event_buffer_reserve > 264171.903578113: 8114f150 trace_event_buffer_reservetest > $0x1, 0x49(%rsi) > 264171.903578113: 8114f154 trace_event_buffer_reservepush > %rbp > 264171.903578113: 8114f155 trace_event_buffer_reservemov > 0x10(%rsi), %r10 > 264171.903578113: 8114f159 trace_event_buffer_reservemov > %rsp, %rbp > 264171.903578113: 8114f15c trace_event_buffer_reservepush > %rbx > 264171.903578113: 8114f15d trace_event_buffer_reservejz > 0x8114f17e > 264171.903578113: 8114f17e trace_event_buffer_reservemov > %rdx, %rcx > 264171.903578113: 8114f181 trace_event_buffer_reservemov > %rdi, %rbx > 264171.903578113: 8114f184 trace_event_buffer_reservepushfq > 264171.903578113: 8114f185 trace_event_buffer_reservepop %r8 >
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 18:18:53 +0100 Peter Zijlstrawrote: > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > locking, as one does. Just to be clear, as ftrace in the kernel mostly represents function tracing, which doesn't use RCU. This is a tracepoint feature. > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > with that is quite a few cycles. Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not concerned enough to put a heavy weight rcu idle code in for all tracepoints. Although, what about a percpu flag that can be checked in the tracepoint code to see if it should enable RCU or not? Hmm, I wonder if "rcu_is_watching()" is light enough to have in all tracepoints? -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 18:18:53 +0100 Peter Zijlstra wrote: > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > locking, as one does. Just to be clear, as ftrace in the kernel mostly represents function tracing, which doesn't use RCU. This is a tracepoint feature. > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > with that is quite a few cycles. Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not concerned enough to put a heavy weight rcu idle code in for all tracepoints. Although, what about a percpu flag that can be checked in the tracepoint code to see if it should enable RCU or not? Hmm, I wonder if "rcu_is_watching()" is light enough to have in all tracepoints? -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 06:18:53PM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote: > > > > it got away with attached change.. but this rcu logic > > > > is far beyond me, so it's just wild guess.. ;-) > > > > > > I think I prefer something like the below, that only annotates the one > > > RDMSR in question, instead of all of them. > > > > It would be far better to just fix trace points that they always work. > > > > This whole thing is a travesty: we have tens of thousands of lines of code > > in > > ftrace to support tracing in NMIs, but then "debug features"[1] like > > this come around and make trace points unusable for far more code than > > just the NMI handlers. > > Not sure the idle handlers are more code than we have NMI code, but yes, > its annoying. > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > locking, as one does. > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > with that is quite a few cycles. This is only when the trace point is enabled right? ftrace is already executing a lot of instructions when enabled. Adding two POPF won't be end of the world. -Andi 264171.903577780: 810d6040 trace_event_raw_event_sched_switch push %rbp 264171.903577780: 810d6041 trace_event_raw_event_sched_switch mov %rsp, %rbp 264171.903577780: 810d6044 trace_event_raw_event_sched_switch push %r15 264171.903577780: 810d6046 trace_event_raw_event_sched_switch push %r14 264171.903577780: 810d6048 trace_event_raw_event_sched_switch push %r13 264171.903577780: 810d604a trace_event_raw_event_sched_switch push %r12 264171.903577780: 810d604c trace_event_raw_event_sched_switch mov %esi, %r13d 264171.903577780: 810d604f trace_event_raw_event_sched_switch push %rbx 264171.903577780: 810d6050 trace_event_raw_event_sched_switch mov %rdi, %r12 264171.903577780: 810d6053 trace_event_raw_event_sched_switch mov %rdx, %r14 264171.903577780: 810d6056 trace_event_raw_event_sched_switch mov %rcx, %r15 264171.903577780: 810d6059 trace_event_raw_event_sched_switch sub $0x30, %rsp 264171.903577780: 810d605d trace_event_raw_event_sched_switch mov 0x48(%rdi), %rbx 264171.903577780: 810d6061 trace_event_raw_event_sched_switch test $0x80, %bl 264171.903577780: 810d6064 trace_event_raw_event_sched_switch jnz 0x810d608c 264171.903577780: 810d6066 trace_event_raw_event_sched_switch test $0x40, %bl 264171.903577780: 810d6069 trace_event_raw_event_sched_switch jnz trace_event_raw_event_sched_switch+216 264171.903577780: 810d606f trace_event_raw_event_sched_switch test $0x20, %bl 264171.903577780: 810d6072 trace_event_raw_event_sched_switch jz 0x810d6083 264171.903577780: 810d6083 trace_event_raw_event_sched_switch and $0x1, %bh 264171.903577780: 810d6086 trace_event_raw_event_sched_switch jnz trace_event_raw_event_sched_switch+237 264171.903578113: 810d608c trace_event_raw_event_sched_switch mov $0x40, %edx 264171.903578113: 810d6091 trace_event_raw_event_sched_switch mov %r12, %rsi 264171.903578113: 810d6094 trace_event_raw_event_sched_switch mov %rsp, %rdi 264171.903578113: 810d6097 trace_event_raw_event_sched_switch call trace_event_buffer_reserve 264171.903578113: 8114f150 trace_event_buffer_reserve test $0x1, 0x49(%rsi) 264171.903578113: 8114f154 trace_event_buffer_reserve push %rbp 264171.903578113: 8114f155 trace_event_buffer_reserve mov 0x10(%rsi), %r10 264171.903578113: 8114f159 trace_event_buffer_reserve mov %rsp, %rbp 264171.903578113: 8114f15c trace_event_buffer_reserve push %rbx 264171.903578113: 8114f15d trace_event_buffer_reserve jz 0x8114f17e 264171.903578113: 8114f17e trace_event_buffer_reserve mov %rdx, %rcx 264171.903578113: 8114f181 trace_event_buffer_reserve mov %rdi, %rbx 264171.903578113: 8114f184 trace_event_buffer_reserve pushfq 264171.903578113: 8114f185 trace_event_buffer_reserve pop %r8 264171.903578113: 8114f187 trace_event_buffer_reserve mov %gs:0x7eebd2c1(%rip), %r9d 264171.903578113: 8114f18f trace_event_buffer_reserve and $0x7fff, %r9d 264171.903578113: 8114f196 trace_event_buffer_reserve mov %r8, 0x20(%rdi) 264171.903578113: 8114f19a trace_event_buffer_reserve mov %rsi, 0x10(%rdi) 264171.903578113: 8114f19e trace_event_buffer_reserve mov %r9d, 0x28(%rdi) 264171.903578113: 8114f1a2 trace_event_buffer_reserve mov 0x40(%r10), %edx 264171.903578113: 8114f1a6 trace_event_buffer_reserve call trace_event_buffer_lock_reserve 264171.903578113: 81142210
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 06:18:53PM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote: > > > > it got away with attached change.. but this rcu logic > > > > is far beyond me, so it's just wild guess.. ;-) > > > > > > I think I prefer something like the below, that only annotates the one > > > RDMSR in question, instead of all of them. > > > > It would be far better to just fix trace points that they always work. > > > > This whole thing is a travesty: we have tens of thousands of lines of code > > in > > ftrace to support tracing in NMIs, but then "debug features"[1] like > > this come around and make trace points unusable for far more code than > > just the NMI handlers. > > Not sure the idle handlers are more code than we have NMI code, but yes, > its annoying. > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > locking, as one does. > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > with that is quite a few cycles. This is only when the trace point is enabled right? ftrace is already executing a lot of instructions when enabled. Adding two POPF won't be end of the world. -Andi 264171.903577780: 810d6040 trace_event_raw_event_sched_switch push %rbp 264171.903577780: 810d6041 trace_event_raw_event_sched_switch mov %rsp, %rbp 264171.903577780: 810d6044 trace_event_raw_event_sched_switch push %r15 264171.903577780: 810d6046 trace_event_raw_event_sched_switch push %r14 264171.903577780: 810d6048 trace_event_raw_event_sched_switch push %r13 264171.903577780: 810d604a trace_event_raw_event_sched_switch push %r12 264171.903577780: 810d604c trace_event_raw_event_sched_switch mov %esi, %r13d 264171.903577780: 810d604f trace_event_raw_event_sched_switch push %rbx 264171.903577780: 810d6050 trace_event_raw_event_sched_switch mov %rdi, %r12 264171.903577780: 810d6053 trace_event_raw_event_sched_switch mov %rdx, %r14 264171.903577780: 810d6056 trace_event_raw_event_sched_switch mov %rcx, %r15 264171.903577780: 810d6059 trace_event_raw_event_sched_switch sub $0x30, %rsp 264171.903577780: 810d605d trace_event_raw_event_sched_switch mov 0x48(%rdi), %rbx 264171.903577780: 810d6061 trace_event_raw_event_sched_switch test $0x80, %bl 264171.903577780: 810d6064 trace_event_raw_event_sched_switch jnz 0x810d608c 264171.903577780: 810d6066 trace_event_raw_event_sched_switch test $0x40, %bl 264171.903577780: 810d6069 trace_event_raw_event_sched_switch jnz trace_event_raw_event_sched_switch+216 264171.903577780: 810d606f trace_event_raw_event_sched_switch test $0x20, %bl 264171.903577780: 810d6072 trace_event_raw_event_sched_switch jz 0x810d6083 264171.903577780: 810d6083 trace_event_raw_event_sched_switch and $0x1, %bh 264171.903577780: 810d6086 trace_event_raw_event_sched_switch jnz trace_event_raw_event_sched_switch+237 264171.903578113: 810d608c trace_event_raw_event_sched_switch mov $0x40, %edx 264171.903578113: 810d6091 trace_event_raw_event_sched_switch mov %r12, %rsi 264171.903578113: 810d6094 trace_event_raw_event_sched_switch mov %rsp, %rdi 264171.903578113: 810d6097 trace_event_raw_event_sched_switch call trace_event_buffer_reserve 264171.903578113: 8114f150 trace_event_buffer_reserve test $0x1, 0x49(%rsi) 264171.903578113: 8114f154 trace_event_buffer_reserve push %rbp 264171.903578113: 8114f155 trace_event_buffer_reserve mov 0x10(%rsi), %r10 264171.903578113: 8114f159 trace_event_buffer_reserve mov %rsp, %rbp 264171.903578113: 8114f15c trace_event_buffer_reserve push %rbx 264171.903578113: 8114f15d trace_event_buffer_reserve jz 0x8114f17e 264171.903578113: 8114f17e trace_event_buffer_reserve mov %rdx, %rcx 264171.903578113: 8114f181 trace_event_buffer_reserve mov %rdi, %rbx 264171.903578113: 8114f184 trace_event_buffer_reserve pushfq 264171.903578113: 8114f185 trace_event_buffer_reserve pop %r8 264171.903578113: 8114f187 trace_event_buffer_reserve mov %gs:0x7eebd2c1(%rip), %r9d 264171.903578113: 8114f18f trace_event_buffer_reserve and $0x7fff, %r9d 264171.903578113: 8114f196 trace_event_buffer_reserve mov %r8, 0x20(%rdi) 264171.903578113: 8114f19a trace_event_buffer_reserve mov %rsi, 0x10(%rdi) 264171.903578113: 8114f19e trace_event_buffer_reserve mov %r9d, 0x28(%rdi) 264171.903578113: 8114f1a2 trace_event_buffer_reserve mov 0x40(%r10), %edx 264171.903578113: 8114f1a6 trace_event_buffer_reserve call trace_event_buffer_lock_reserve 264171.903578113: 81142210
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote: > > > it got away with attached change.. but this rcu logic > > > is far beyond me, so it's just wild guess.. ;-) > > > > I think I prefer something like the below, that only annotates the one > > RDMSR in question, instead of all of them. > > It would be far better to just fix trace points that they always work. > > This whole thing is a travesty: we have tens of thousands of lines of code in > ftrace to support tracing in NMIs, but then "debug features"[1] like > this come around and make trace points unusable for far more code than > just the NMI handlers. Not sure the idle handlers are more code than we have NMI code, but yes, its annoying. Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid locking, as one does. Biggest objection would be that the rcu_irq_enter_irqson() thing does POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint with that is quite a few cycles.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote: > > > it got away with attached change.. but this rcu logic > > > is far beyond me, so it's just wild guess.. ;-) > > > > I think I prefer something like the below, that only annotates the one > > RDMSR in question, instead of all of them. > > It would be far better to just fix trace points that they always work. > > This whole thing is a travesty: we have tens of thousands of lines of code in > ftrace to support tracing in NMIs, but then "debug features"[1] like > this come around and make trace points unusable for far more code than > just the NMI handlers. Not sure the idle handlers are more code than we have NMI code, but yes, its annoying. Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid locking, as one does. Biggest objection would be that the rcu_irq_enter_irqson() thing does POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint with that is quite a few cycles.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > I think I prefer something like the below, that only annotates the one > RDMSR in question, instead of all of them. It would be far better to just fix trace points that they always work. This whole thing is a travesty: we have tens of thousands of lines of code in ftrace to support tracing in NMIs, but then "debug features"[1] like this come around and make trace points unusable for far more code than just the NMI handlers. -Andi [1] unclear what they actually "debug" here.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
> > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > I think I prefer something like the below, that only annotates the one > RDMSR in question, instead of all of them. It would be far better to just fix trace points that they always work. This whole thing is a travesty: we have tens of thousands of lines of code in ftrace to support tracing in NMIs, but then "debug features"[1] like this come around and make trace points unusable for far more code than just the NMI handlers. -Andi [1] unclear what they actually "debug" here.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
+ tglx. On Mon, Nov 21, 2016 at 04:41:04PM +0100, Peter Zijlstra wrote: > Right, I just wondered about the !c1e present case. In that case we'll > forever read the msr in the hope of finding that bit set, and we never > will. Well, erratum 400 flag is set on everything F10h from model 0x2 onwards, which is basically everything that's still out there. And we set it for most of the K8s too. So I don't think there's an affected machine out there where we don't enable that erratum for so I'd assume it is turned on everywhere. IOW, what's the worst thing that can happen if we did this below? We basically get rid of the detection and switch the timer to broadcast mode immediately on the halting CPU. amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" check so it will run on the affected CPUs only... Thoughts? --- diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 0888a879120f..7ab9d0254b4d 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -354,41 +354,30 @@ void amd_e400_remove_cpu(int cpu) */ static void amd_e400_idle(void) { - if (!amd_e400_c1e_detected) { - u32 lo, hi; + int cpu = smp_processor_id(); - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + if (!static_cpu_has(X86_FEATURE_NONSTOP_TSC)) + mark_tsc_unstable("TSC halt in AMD C1E"); - if (lo & K8_INTP_C1E_ACTIVE_MASK) { - amd_e400_c1e_detected = true; - if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) - mark_tsc_unstable("TSC halt in AMD C1E"); - pr_info("System has AMD C1E enabled\n"); - } - } - - if (amd_e400_c1e_detected) { - int cpu = smp_processor_id(); + pr_info("System has AMD C1E enabled\n"); - if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) { - cpumask_set_cpu(cpu, amd_e400_c1e_mask); - /* Force broadcast so ACPI can not interfere. */ - tick_broadcast_force(); - pr_info("Switch to broadcast mode on CPU%d\n", cpu); - } - tick_broadcast_enter(); + if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) { + cpumask_set_cpu(cpu, amd_e400_c1e_mask); + /* Force broadcast so ACPI can not interfere. */ + tick_broadcast_force(); + pr_info("Switch to broadcast mode on CPU%d\n", cpu); + } + tick_broadcast_enter(); - default_idle(); + default_idle(); - /* -* The switch back from broadcast mode needs to be -* called with interrupts disabled. -*/ - local_irq_disable(); - tick_broadcast_exit(); - local_irq_enable(); - } else - default_idle(); + /* +* The switch back from broadcast mode needs to be +* called with interrupts disabled. +*/ + local_irq_disable(); + tick_broadcast_exit(); + local_irq_enable(); } /* -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
+ tglx. On Mon, Nov 21, 2016 at 04:41:04PM +0100, Peter Zijlstra wrote: > Right, I just wondered about the !c1e present case. In that case we'll > forever read the msr in the hope of finding that bit set, and we never > will. Well, erratum 400 flag is set on everything F10h from model 0x2 onwards, which is basically everything that's still out there. And we set it for most of the K8s too. So I don't think there's an affected machine out there where we don't enable that erratum for so I'd assume it is turned on everywhere. IOW, what's the worst thing that can happen if we did this below? We basically get rid of the detection and switch the timer to broadcast mode immediately on the halting CPU. amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" check so it will run on the affected CPUs only... Thoughts? --- diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 0888a879120f..7ab9d0254b4d 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -354,41 +354,30 @@ void amd_e400_remove_cpu(int cpu) */ static void amd_e400_idle(void) { - if (!amd_e400_c1e_detected) { - u32 lo, hi; + int cpu = smp_processor_id(); - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + if (!static_cpu_has(X86_FEATURE_NONSTOP_TSC)) + mark_tsc_unstable("TSC halt in AMD C1E"); - if (lo & K8_INTP_C1E_ACTIVE_MASK) { - amd_e400_c1e_detected = true; - if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) - mark_tsc_unstable("TSC halt in AMD C1E"); - pr_info("System has AMD C1E enabled\n"); - } - } - - if (amd_e400_c1e_detected) { - int cpu = smp_processor_id(); + pr_info("System has AMD C1E enabled\n"); - if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) { - cpumask_set_cpu(cpu, amd_e400_c1e_mask); - /* Force broadcast so ACPI can not interfere. */ - tick_broadcast_force(); - pr_info("Switch to broadcast mode on CPU%d\n", cpu); - } - tick_broadcast_enter(); + if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) { + cpumask_set_cpu(cpu, amd_e400_c1e_mask); + /* Force broadcast so ACPI can not interfere. */ + tick_broadcast_force(); + pr_info("Switch to broadcast mode on CPU%d\n", cpu); + } + tick_broadcast_enter(); - default_idle(); + default_idle(); - /* -* The switch back from broadcast mode needs to be -* called with interrupts disabled. -*/ - local_irq_disable(); - tick_broadcast_exit(); - local_irq_enable(); - } else - default_idle(); + /* +* The switch back from broadcast mode needs to be +* called with interrupts disabled. +*/ + local_irq_disable(); + tick_broadcast_exit(); + local_irq_enable(); } /* -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 04:35:38PM +0100, Borislav Petkov wrote: > On Mon, Nov 21, 2016 at 03:37:16PM +0100, Peter Zijlstra wrote: > > Yeah, but this one does a printk() when it hits the contidion it checks > > for, so not tracing it would be fine I think. > > FWIW, there's already: > > static inline void notrace > native_write_msr(unsigned int msr, u32 low, u32 high) > { > __native_write_msr_notrace(msr, low, high); > if (msr_tracepoint_active(__tracepoint_write_msr)) > do_trace_write_msr(msr, ((u64)high << 32 | low), 0); > } > > so could be mirrored. Yeah, I know. I caused that to appear ;-) > > Also, Boris, why do we need to redo that rdmsr until we see that bit > > set? Can't we simply do the rdmsr once and then be done with it? > > Hmm, so the way I read the definition of those bits - > K8_INTP_C1E_ACTIVE_MASK - it looks like they get set when the cores > enter halt: > > 28 C1eOnCmpHalt: C1E on chip multi-processing halt. Read-write. > > 1=When all cores of a processor have entered the halt state, the processor > ^^^ > generates an IO cycle as specified by IORd, IOMsgData, and IOMsgAddr. > When this bit is set, SmiOnCmpHalt and IntPndMsg must be 0, otherwise > the behavior is undefined. For revision DA-C, this bit is only supported > for dual-core processors. For revision C3 and E, this bit is supported > for any number of cores. See 2.4.3.3.3 [Hardware Initiated C1E]. > > 27 SmiOnCmpHalt: SMI on chip multi-processing halt. Read-write. > > 1=When all cores of the processor have entered the halt state, the > ^ > > processor generates an SMI-trigger IO cycle as specified by IORd, > IOMsgData, and IOMsgAddr. When this bit is set C1eOnCmpHalt and > IntPndMsg must be 0, otherwise the behavior is undefined. The status is > stored in SMMFEC4[SmiOnCmpHaltSts]. See 2.4.3.3.1 [SMI Initiated C1E]. > > But the thing is, we do it only once until amd_e400_c1e_detected is true. > > FWIW, I'd love it if we could do > > if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E)) > > but I'm afraid we need to look at that MSR at least once until we set > the boolean. > > I could ask if that is really the case and whether we can detect it > differently... > > In any case, I have a box here in case you want me to test patches. Right, I just wondered about the !c1e present case. In that case we'll forever read the msr in the hope of finding that bit set, and we never will. Don't care too much, just idle curiousity. Let me do the rdmsr_notrace() thing.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 04:35:38PM +0100, Borislav Petkov wrote: > On Mon, Nov 21, 2016 at 03:37:16PM +0100, Peter Zijlstra wrote: > > Yeah, but this one does a printk() when it hits the contidion it checks > > for, so not tracing it would be fine I think. > > FWIW, there's already: > > static inline void notrace > native_write_msr(unsigned int msr, u32 low, u32 high) > { > __native_write_msr_notrace(msr, low, high); > if (msr_tracepoint_active(__tracepoint_write_msr)) > do_trace_write_msr(msr, ((u64)high << 32 | low), 0); > } > > so could be mirrored. Yeah, I know. I caused that to appear ;-) > > Also, Boris, why do we need to redo that rdmsr until we see that bit > > set? Can't we simply do the rdmsr once and then be done with it? > > Hmm, so the way I read the definition of those bits - > K8_INTP_C1E_ACTIVE_MASK - it looks like they get set when the cores > enter halt: > > 28 C1eOnCmpHalt: C1E on chip multi-processing halt. Read-write. > > 1=When all cores of a processor have entered the halt state, the processor > ^^^ > generates an IO cycle as specified by IORd, IOMsgData, and IOMsgAddr. > When this bit is set, SmiOnCmpHalt and IntPndMsg must be 0, otherwise > the behavior is undefined. For revision DA-C, this bit is only supported > for dual-core processors. For revision C3 and E, this bit is supported > for any number of cores. See 2.4.3.3.3 [Hardware Initiated C1E]. > > 27 SmiOnCmpHalt: SMI on chip multi-processing halt. Read-write. > > 1=When all cores of the processor have entered the halt state, the > ^ > > processor generates an SMI-trigger IO cycle as specified by IORd, > IOMsgData, and IOMsgAddr. When this bit is set C1eOnCmpHalt and > IntPndMsg must be 0, otherwise the behavior is undefined. The status is > stored in SMMFEC4[SmiOnCmpHaltSts]. See 2.4.3.3.1 [SMI Initiated C1E]. > > But the thing is, we do it only once until amd_e400_c1e_detected is true. > > FWIW, I'd love it if we could do > > if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E)) > > but I'm afraid we need to look at that MSR at least once until we set > the boolean. > > I could ask if that is really the case and whether we can detect it > differently... > > In any case, I have a box here in case you want me to test patches. Right, I just wondered about the !c1e present case. In that case we'll forever read the msr in the hope of finding that bit set, and we never will. Don't care too much, just idle curiousity. Let me do the rdmsr_notrace() thing.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 03:37:16PM +0100, Peter Zijlstra wrote: > Yeah, but this one does a printk() when it hits the contidion it checks > for, so not tracing it would be fine I think. FWIW, there's already: static inline void notrace native_write_msr(unsigned int msr, u32 low, u32 high) { __native_write_msr_notrace(msr, low, high); if (msr_tracepoint_active(__tracepoint_write_msr)) do_trace_write_msr(msr, ((u64)high << 32 | low), 0); } so could be mirrored. > Also, Boris, why do we need to redo that rdmsr until we see that bit > set? Can't we simply do the rdmsr once and then be done with it? Hmm, so the way I read the definition of those bits - K8_INTP_C1E_ACTIVE_MASK - it looks like they get set when the cores enter halt: 28 C1eOnCmpHalt: C1E on chip multi-processing halt. Read-write. 1=When all cores of a processor have entered the halt state, the processor ^^^ generates an IO cycle as specified by IORd, IOMsgData, and IOMsgAddr. When this bit is set, SmiOnCmpHalt and IntPndMsg must be 0, otherwise the behavior is undefined. For revision DA-C, this bit is only supported for dual-core processors. For revision C3 and E, this bit is supported for any number of cores. See 2.4.3.3.3 [Hardware Initiated C1E]. 27 SmiOnCmpHalt: SMI on chip multi-processing halt. Read-write. 1=When all cores of the processor have entered the halt state, the ^ processor generates an SMI-trigger IO cycle as specified by IORd, IOMsgData, and IOMsgAddr. When this bit is set C1eOnCmpHalt and IntPndMsg must be 0, otherwise the behavior is undefined. The status is stored in SMMFEC4[SmiOnCmpHaltSts]. See 2.4.3.3.1 [SMI Initiated C1E]. But the thing is, we do it only once until amd_e400_c1e_detected is true. FWIW, I'd love it if we could do if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E)) but I'm afraid we need to look at that MSR at least once until we set the boolean. I could ask if that is really the case and whether we can detect it differently... In any case, I have a box here in case you want me to test patches. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 03:37:16PM +0100, Peter Zijlstra wrote: > Yeah, but this one does a printk() when it hits the contidion it checks > for, so not tracing it would be fine I think. FWIW, there's already: static inline void notrace native_write_msr(unsigned int msr, u32 low, u32 high) { __native_write_msr_notrace(msr, low, high); if (msr_tracepoint_active(__tracepoint_write_msr)) do_trace_write_msr(msr, ((u64)high << 32 | low), 0); } so could be mirrored. > Also, Boris, why do we need to redo that rdmsr until we see that bit > set? Can't we simply do the rdmsr once and then be done with it? Hmm, so the way I read the definition of those bits - K8_INTP_C1E_ACTIVE_MASK - it looks like they get set when the cores enter halt: 28 C1eOnCmpHalt: C1E on chip multi-processing halt. Read-write. 1=When all cores of a processor have entered the halt state, the processor ^^^ generates an IO cycle as specified by IORd, IOMsgData, and IOMsgAddr. When this bit is set, SmiOnCmpHalt and IntPndMsg must be 0, otherwise the behavior is undefined. For revision DA-C, this bit is only supported for dual-core processors. For revision C3 and E, this bit is supported for any number of cores. See 2.4.3.3.3 [Hardware Initiated C1E]. 27 SmiOnCmpHalt: SMI on chip multi-processing halt. Read-write. 1=When all cores of the processor have entered the halt state, the ^ processor generates an SMI-trigger IO cycle as specified by IORd, IOMsgData, and IOMsgAddr. When this bit is set C1eOnCmpHalt and IntPndMsg must be 0, otherwise the behavior is undefined. The status is stored in SMMFEC4[SmiOnCmpHaltSts]. See 2.4.3.3.1 [SMI Initiated C1E]. But the thing is, we do it only once until amd_e400_c1e_detected is true. FWIW, I'd love it if we could do if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E)) but I'm afraid we need to look at that MSR at least once until we set the boolean. I could ask if that is really the case and whether we can detect it differently... In any case, I have a box here in case you want me to test patches. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 09:15:43AM -0500, Steven Rostedt wrote: > On Mon, 21 Nov 2016 13:58:30 +0100 > Peter Zijlstrawrote: > > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > > > index 0888a879120f..d6c6aa80675f 100644 > > > > --- a/arch/x86/kernel/process.c > > > > +++ b/arch/x86/kernel/process.c > > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > > > if (!amd_e400_c1e_detected) { > > > > u32 lo, hi; > > > > > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > > > amd_e400_c1e_detected = true; > > > > OK, so while looking at this again, I don't like this ether :/ > > > > Problem with this one is that it always adds the RCU fiddling overhead, > > even when we're not tracing. > > > > I could do an rdmsr_notrace() for this one, dunno if its important. > > But that would neglect the point of tracing rdmsr. What about: > > /* tracepoints require RCU enabled */ > if (trace_read_msr_enabled()) > RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > else > rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); Yeah, but this one does a printk() when it hits the contidion it checks for, so not tracing it would be fine I think. Also, Boris, why do we need to redo that rdmsr until we see that bit set? Can't we simply do the rdmsr once and then be done with it?
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 09:15:43AM -0500, Steven Rostedt wrote: > On Mon, 21 Nov 2016 13:58:30 +0100 > Peter Zijlstra wrote: > > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > > > index 0888a879120f..d6c6aa80675f 100644 > > > > --- a/arch/x86/kernel/process.c > > > > +++ b/arch/x86/kernel/process.c > > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > > > if (!amd_e400_c1e_detected) { > > > > u32 lo, hi; > > > > > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > > > amd_e400_c1e_detected = true; > > > > OK, so while looking at this again, I don't like this ether :/ > > > > Problem with this one is that it always adds the RCU fiddling overhead, > > even when we're not tracing. > > > > I could do an rdmsr_notrace() for this one, dunno if its important. > > But that would neglect the point of tracing rdmsr. What about: > > /* tracepoints require RCU enabled */ > if (trace_read_msr_enabled()) > RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > else > rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); Yeah, but this one does a printk() when it hits the contidion it checks for, so not tracing it would be fine I think. Also, Boris, why do we need to redo that rdmsr until we see that bit set? Can't we simply do the rdmsr once and then be done with it?
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 13:58:30 +0100 Peter Zijlstrawrote: > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > > index 0888a879120f..d6c6aa80675f 100644 > > > --- a/arch/x86/kernel/process.c > > > +++ b/arch/x86/kernel/process.c > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > > if (!amd_e400_c1e_detected) { > > > u32 lo, hi; > > > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > > amd_e400_c1e_detected = true; > > OK, so while looking at this again, I don't like this ether :/ > > Problem with this one is that it always adds the RCU fiddling overhead, > even when we're not tracing. > > I could do an rdmsr_notrace() for this one, dunno if its important. But that would neglect the point of tracing rdmsr. What about: /* tracepoints require RCU enabled */ if (trace_read_msr_enabled()) RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); else rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 13:58:30 +0100 Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > > index 0888a879120f..d6c6aa80675f 100644 > > > --- a/arch/x86/kernel/process.c > > > +++ b/arch/x86/kernel/process.c > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > > if (!amd_e400_c1e_detected) { > > > u32 lo, hi; > > > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > > amd_e400_c1e_detected = true; > > OK, so while looking at this again, I don't like this ether :/ > > Problem with this one is that it always adds the RCU fiddling overhead, > even when we're not tracing. > > I could do an rdmsr_notrace() for this one, dunno if its important. But that would neglect the point of tracing rdmsr. What about: /* tracepoints require RCU enabled */ if (trace_read_msr_enabled()) RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); else rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 13:58:30 +0100 Peter Zijlstrawrote: > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > > index 0888a879120f..d6c6aa80675f 100644 > > > --- a/arch/x86/kernel/process.c > > > +++ b/arch/x86/kernel/process.c > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > > if (!amd_e400_c1e_detected) { > > > u32 lo, hi; > > > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > > amd_e400_c1e_detected = true; > > OK, so while looking at this again, I don't like this ether :/ > > Problem with this one is that it always adds the RCU fiddling overhead, > even when we're not tracing. > > I could do an rdmsr_notrace() for this one, dunno if its important. But that would neglect the point of tracing rdmsr. What about: /* tracepoints require RCU enabled */ if (trace_read_msr_enabled()) RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); else rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, 21 Nov 2016 13:58:30 +0100 Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > > index 0888a879120f..d6c6aa80675f 100644 > > > --- a/arch/x86/kernel/process.c > > > +++ b/arch/x86/kernel/process.c > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > > if (!amd_e400_c1e_detected) { > > > u32 lo, hi; > > > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > > amd_e400_c1e_detected = true; > > OK, so while looking at this again, I don't like this ether :/ > > Problem with this one is that it always adds the RCU fiddling overhead, > even when we're not tracing. > > I could do an rdmsr_notrace() for this one, dunno if its important. But that would neglect the point of tracing rdmsr. What about: /* tracepoints require RCU enabled */ if (trace_read_msr_enabled()) RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); else rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); -- Steve
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index 0888a879120f..d6c6aa80675f 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > if (!amd_e400_c1e_detected) { > > u32 lo, hi; > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > amd_e400_c1e_detected = true; OK, so while looking at this again, I don't like this ether :/ Problem with this one is that it always adds the RCU fiddling overhead, even when we're not tracing. I could do an rdmsr_notrace() for this one, dunno if its important.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index 0888a879120f..d6c6aa80675f 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > if (!amd_e400_c1e_detected) { > > u32 lo, hi; > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > amd_e400_c1e_detected = true; OK, so while looking at this again, I don't like this ether :/ Problem with this one is that it always adds the RCU fiddling overhead, even when we're not tracing. I could do an rdmsr_notrace() for this one, dunno if its important.
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 12:31:31PM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 12:22:29PM +0100, Jiri Olsa wrote: > > On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote: > > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > > > > > I think I prefer something like the below, that only annotates the one > > > > > RDMSR in question, instead of all of them. > > > > > > > > I was wondering about that, but haven't found RCU_NONIDLE > > > > > > Its in rcupdate.h ;-) > > > > it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. > > ? > > Probably, I still have to make pretty the uncore patch from last week, > and I have a PEBS patch that needs looking at :-) > > BTW, if you have any opinion on that PEBS register set crap, do holler > :-) > > lkml.kernel.org/r/20161117171731.gv3...@twins.programming.kicks-ass.net ook, will check thanks, jirka
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 12:31:31PM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 12:22:29PM +0100, Jiri Olsa wrote: > > On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote: > > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > > > > > I think I prefer something like the below, that only annotates the one > > > > > RDMSR in question, instead of all of them. > > > > > > > > I was wondering about that, but haven't found RCU_NONIDLE > > > > > > Its in rcupdate.h ;-) > > > > it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. > > ? > > Probably, I still have to make pretty the uncore patch from last week, > and I have a PEBS patch that needs looking at :-) > > BTW, if you have any opinion on that PEBS register set crap, do holler > :-) > > lkml.kernel.org/r/20161117171731.gv3...@twins.programming.kicks-ass.net ook, will check thanks, jirka
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 12:22:29PM +0100, Jiri Olsa wrote: > On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote: > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > > > I think I prefer something like the below, that only annotates the one > > > > RDMSR in question, instead of all of them. > > > > > > I was wondering about that, but haven't found RCU_NONIDLE > > > > Its in rcupdate.h ;-) > > it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ? Probably, I still have to make pretty the uncore patch from last week, and I have a PEBS patch that needs looking at :-) BTW, if you have any opinion on that PEBS register set crap, do holler :-) lkml.kernel.org/r/20161117171731.gv3...@twins.programming.kicks-ass.net
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 12:22:29PM +0100, Jiri Olsa wrote: > On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote: > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > > > I think I prefer something like the below, that only annotates the one > > > > RDMSR in question, instead of all of them. > > > > > > I was wondering about that, but haven't found RCU_NONIDLE > > > > Its in rcupdate.h ;-) > > it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ? Probably, I still have to make pretty the uncore patch from last week, and I have a PEBS patch that needs looking at :-) BTW, if you have any opinion on that PEBS register set crap, do holler :-) lkml.kernel.org/r/20161117171731.gv3...@twins.programming.kicks-ass.net
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > I think I prefer something like the below, that only annotates the one > > > RDMSR in question, instead of all of them. > > > > I was wondering about that, but haven't found RCU_NONIDLE > > Its in rcupdate.h ;-) it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ? thanks, jirka
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > I think I prefer something like the below, that only annotates the one > > > RDMSR in question, instead of all of them. > > > > I was wondering about that, but haven't found RCU_NONIDLE > > Its in rcupdate.h ;-) it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ? thanks, jirka
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:43:21AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 01:02:25AM -0800, Paul E. McKenney wrote: > > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > > > > > it got away with attached change.. but this rcu logic > > > is far beyond me, so it's just wild guess.. ;-) > > > > If in idle, the _rcuidle() is needed, so: > > Well, the point is, only this one rdmsr users is in idle, all the others > are not, so we should not be annotating _all_ of them, should we? Fair enough! Thanx, Paul
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:43:21AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 01:02:25AM -0800, Paul E. McKenney wrote: > > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > > > > > it got away with attached change.. but this rcu logic > > > is far beyond me, so it's just wild guess.. ;-) > > > > If in idle, the _rcuidle() is needed, so: > > Well, the point is, only this one rdmsr users is in idle, all the others > are not, so we should not be annotating _all_ of them, should we? Fair enough! Thanx, Paul
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 01:02:25AM -0800, Paul E. McKenney wrote: > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > If in idle, the _rcuidle() is needed, so: Well, the point is, only this one rdmsr users is in idle, all the others are not, so we should not be annotating _all_ of them, should we?
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 01:02:25AM -0800, Paul E. McKenney wrote: > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > If in idle, the _rcuidle() is needed, so: Well, the point is, only this one rdmsr users is in idle, all the others are not, so we should not be annotating _all_ of them, should we?
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > I think I prefer something like the below, that only annotates the one > > RDMSR in question, instead of all of them. > > I was wondering about that, but haven't found RCU_NONIDLE Its in rcupdate.h ;-)
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > I think I prefer something like the below, that only annotates the one > > RDMSR in question, instead of all of them. > > I was wondering about that, but haven't found RCU_NONIDLE Its in rcupdate.h ;-)
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > hi, > > Jan hit following output when msr tracepoints are enabled on amd server: > > > > [ 91.585653] === > > [ 91.589840] [ INFO: suspicious RCU usage. ] > > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > > [ 91.597691] --- > > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious > > rcu_dereference_check() usage! > > [ 91.610222] > > [ 91.610222] other info that might help us debug this: > > [ 91.610222] > > [ 91.618224] > > [ 91.618224] RCU used illegally from idle CPU! > > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > > [ 91.629081] RCU used illegally from extended quiescent state! > > [ 91.634820] no locks held by swapper/1/0. > > [ 91.638832] > > [ 91.638832] stack backtrace: > > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > > [ 91.656159] c900018fbdf8 813ed43c 88017ede8000 > > 0001 > > [ 91.663637] c900018fbe28 810fdcd7 880233f95dd0 > > c0010055 > > [ 91.671107] c900018fbe58 > > 814297ac > > [ 91.678560] Call Trace: > > [ 91.681022] [] dump_stack+0x85/0xc9 > > [ 91.686164] [] lockdep_rcu_suspicious+0xe7/0x120 > > [ 91.692429] [] do_trace_read_msr+0x14c/0x1b0 > > [ 91.698349] [] native_read_msr+0x32/0x40 > > [ 91.703921] [] amd_e400_idle+0x7e/0x110 > > [ 91.709407] [] arch_cpu_idle+0xf/0x20 > > [ 91.714720] [] default_idle_call+0x23/0x40 > > [ 91.720467] [] cpu_startup_entry+0x1da/0x2b0 > > [ 91.726387] [] start_secondary+0x17f/0x1f0 > > > > > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > I think I prefer something like the below, that only annotates the one > RDMSR in question, instead of all of them. I was wondering about that, but haven't found RCU_NONIDLE thanks, jirka > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index 0888a879120f..d6c6aa80675f 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > if (!amd_e400_c1e_detected) { > u32 lo, hi; > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > amd_e400_c1e_detected = true;
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > hi, > > Jan hit following output when msr tracepoints are enabled on amd server: > > > > [ 91.585653] === > > [ 91.589840] [ INFO: suspicious RCU usage. ] > > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > > [ 91.597691] --- > > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious > > rcu_dereference_check() usage! > > [ 91.610222] > > [ 91.610222] other info that might help us debug this: > > [ 91.610222] > > [ 91.618224] > > [ 91.618224] RCU used illegally from idle CPU! > > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > > [ 91.629081] RCU used illegally from extended quiescent state! > > [ 91.634820] no locks held by swapper/1/0. > > [ 91.638832] > > [ 91.638832] stack backtrace: > > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > > [ 91.656159] c900018fbdf8 813ed43c 88017ede8000 > > 0001 > > [ 91.663637] c900018fbe28 810fdcd7 880233f95dd0 > > c0010055 > > [ 91.671107] c900018fbe58 > > 814297ac > > [ 91.678560] Call Trace: > > [ 91.681022] [] dump_stack+0x85/0xc9 > > [ 91.686164] [] lockdep_rcu_suspicious+0xe7/0x120 > > [ 91.692429] [] do_trace_read_msr+0x14c/0x1b0 > > [ 91.698349] [] native_read_msr+0x32/0x40 > > [ 91.703921] [] amd_e400_idle+0x7e/0x110 > > [ 91.709407] [] arch_cpu_idle+0xf/0x20 > > [ 91.714720] [] default_idle_call+0x23/0x40 > > [ 91.720467] [] cpu_startup_entry+0x1da/0x2b0 > > [ 91.726387] [] start_secondary+0x17f/0x1f0 > > > > > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > I think I prefer something like the below, that only annotates the one > RDMSR in question, instead of all of them. I was wondering about that, but haven't found RCU_NONIDLE thanks, jirka > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index 0888a879120f..d6c6aa80675f 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > if (!amd_e400_c1e_detected) { > u32 lo, hi; > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > amd_e400_c1e_detected = true;
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > hi, > Jan hit following output when msr tracepoints are enabled on amd server: > > [ 91.585653] === > [ 91.589840] [ INFO: suspicious RCU usage. ] > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > [ 91.597691] --- > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious > rcu_dereference_check() usage! > [ 91.610222] > [ 91.610222] other info that might help us debug this: > [ 91.610222] > [ 91.618224] > [ 91.618224] RCU used illegally from idle CPU! > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > [ 91.629081] RCU used illegally from extended quiescent state! > [ 91.634820] no locks held by swapper/1/0. > [ 91.638832] > [ 91.638832] stack backtrace: > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > [ 91.656159] c900018fbdf8 813ed43c 88017ede8000 > 0001 > [ 91.663637] c900018fbe28 810fdcd7 880233f95dd0 > c0010055 > [ 91.671107] c900018fbe58 > 814297ac > [ 91.678560] Call Trace: > [ 91.681022] [] dump_stack+0x85/0xc9 > [ 91.686164] [] lockdep_rcu_suspicious+0xe7/0x120 > [ 91.692429] [] do_trace_read_msr+0x14c/0x1b0 > [ 91.698349] [] native_read_msr+0x32/0x40 > [ 91.703921] [] amd_e400_idle+0x7e/0x110 > [ 91.709407] [] arch_cpu_idle+0xf/0x20 > [ 91.714720] [] default_idle_call+0x23/0x40 > [ 91.720467] [] cpu_startup_entry+0x1da/0x2b0 > [ 91.726387] [] start_secondary+0x17f/0x1f0 > > > it got away with attached change.. but this rcu logic > is far beyond me, so it's just wild guess.. ;-) I think I prefer something like the below, that only annotates the one RDMSR in question, instead of all of them. diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 0888a879120f..d6c6aa80675f 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -357,7 +357,7 @@ static void amd_e400_idle(void) if (!amd_e400_c1e_detected) { u32 lo, hi; - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); if (lo & K8_INTP_C1E_ACTIVE_MASK) { amd_e400_c1e_detected = true;
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > hi, > Jan hit following output when msr tracepoints are enabled on amd server: > > [ 91.585653] === > [ 91.589840] [ INFO: suspicious RCU usage. ] > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > [ 91.597691] --- > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious > rcu_dereference_check() usage! > [ 91.610222] > [ 91.610222] other info that might help us debug this: > [ 91.610222] > [ 91.618224] > [ 91.618224] RCU used illegally from idle CPU! > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > [ 91.629081] RCU used illegally from extended quiescent state! > [ 91.634820] no locks held by swapper/1/0. > [ 91.638832] > [ 91.638832] stack backtrace: > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > [ 91.656159] c900018fbdf8 813ed43c 88017ede8000 > 0001 > [ 91.663637] c900018fbe28 810fdcd7 880233f95dd0 > c0010055 > [ 91.671107] c900018fbe58 > 814297ac > [ 91.678560] Call Trace: > [ 91.681022] [] dump_stack+0x85/0xc9 > [ 91.686164] [] lockdep_rcu_suspicious+0xe7/0x120 > [ 91.692429] [] do_trace_read_msr+0x14c/0x1b0 > [ 91.698349] [] native_read_msr+0x32/0x40 > [ 91.703921] [] amd_e400_idle+0x7e/0x110 > [ 91.709407] [] arch_cpu_idle+0xf/0x20 > [ 91.714720] [] default_idle_call+0x23/0x40 > [ 91.720467] [] cpu_startup_entry+0x1da/0x2b0 > [ 91.726387] [] start_secondary+0x17f/0x1f0 > > > it got away with attached change.. but this rcu logic > is far beyond me, so it's just wild guess.. ;-) I think I prefer something like the below, that only annotates the one RDMSR in question, instead of all of them. diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 0888a879120f..d6c6aa80675f 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -357,7 +357,7 @@ static void amd_e400_idle(void) if (!amd_e400_c1e_detected) { u32 lo, hi; - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); if (lo & K8_INTP_C1E_ACTIVE_MASK) { amd_e400_c1e_detected = true;
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > hi, > Jan hit following output when msr tracepoints are enabled on amd server: > > [ 91.585653] === > [ 91.589840] [ INFO: suspicious RCU usage. ] > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > [ 91.597691] --- > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious > rcu_dereference_check() usage! > [ 91.610222] > [ 91.610222] other info that might help us debug this: > [ 91.610222] > [ 91.618224] > [ 91.618224] RCU used illegally from idle CPU! > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > [ 91.629081] RCU used illegally from extended quiescent state! > [ 91.634820] no locks held by swapper/1/0. > [ 91.638832] > [ 91.638832] stack backtrace: > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > [ 91.656159] c900018fbdf8 813ed43c 88017ede8000 > 0001 > [ 91.663637] c900018fbe28 810fdcd7 880233f95dd0 > c0010055 > [ 91.671107] c900018fbe58 > 814297ac > [ 91.678560] Call Trace: > [ 91.681022] [] dump_stack+0x85/0xc9 > [ 91.686164] [] lockdep_rcu_suspicious+0xe7/0x120 > [ 91.692429] [] do_trace_read_msr+0x14c/0x1b0 > [ 91.698349] [] native_read_msr+0x32/0x40 > [ 91.703921] [] amd_e400_idle+0x7e/0x110 > [ 91.709407] [] arch_cpu_idle+0xf/0x20 > [ 91.714720] [] default_idle_call+0x23/0x40 > [ 91.720467] [] cpu_startup_entry+0x1da/0x2b0 > [ 91.726387] [] start_secondary+0x17f/0x1f0 > > > it got away with attached change.. but this rcu logic > is far beyond me, so it's just wild guess.. ;-) If in idle, the _rcuidle() is needed, so: Acked-by: Paul E. McKenney> thanks, > jirka > > > --- > diff --git a/arch/x86/lib/msr.c b/arch/x86/lib/msr.c > index d1dee753b949..ca15becea3b6 100644 > --- a/arch/x86/lib/msr.c > +++ b/arch/x86/lib/msr.c > @@ -115,14 +115,14 @@ int msr_clear_bit(u32 msr, u8 bit) > #ifdef CONFIG_TRACEPOINTS > void do_trace_write_msr(unsigned msr, u64 val, int failed) > { > - trace_write_msr(msr, val, failed); > + trace_write_msr_rcuidle(msr, val, failed); > } > EXPORT_SYMBOL(do_trace_write_msr); > EXPORT_TRACEPOINT_SYMBOL(write_msr); > > void do_trace_read_msr(unsigned msr, u64 val, int failed) > { > - trace_read_msr(msr, val, failed); > + trace_read_msr_rcuidle(msr, val, failed); > } > EXPORT_SYMBOL(do_trace_read_msr); > EXPORT_TRACEPOINT_SYMBOL(read_msr); >
Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > hi, > Jan hit following output when msr tracepoints are enabled on amd server: > > [ 91.585653] === > [ 91.589840] [ INFO: suspicious RCU usage. ] > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > [ 91.597691] --- > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious > rcu_dereference_check() usage! > [ 91.610222] > [ 91.610222] other info that might help us debug this: > [ 91.610222] > [ 91.618224] > [ 91.618224] RCU used illegally from idle CPU! > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > [ 91.629081] RCU used illegally from extended quiescent state! > [ 91.634820] no locks held by swapper/1/0. > [ 91.638832] > [ 91.638832] stack backtrace: > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > [ 91.656159] c900018fbdf8 813ed43c 88017ede8000 > 0001 > [ 91.663637] c900018fbe28 810fdcd7 880233f95dd0 > c0010055 > [ 91.671107] c900018fbe58 > 814297ac > [ 91.678560] Call Trace: > [ 91.681022] [] dump_stack+0x85/0xc9 > [ 91.686164] [] lockdep_rcu_suspicious+0xe7/0x120 > [ 91.692429] [] do_trace_read_msr+0x14c/0x1b0 > [ 91.698349] [] native_read_msr+0x32/0x40 > [ 91.703921] [] amd_e400_idle+0x7e/0x110 > [ 91.709407] [] arch_cpu_idle+0xf/0x20 > [ 91.714720] [] default_idle_call+0x23/0x40 > [ 91.720467] [] cpu_startup_entry+0x1da/0x2b0 > [ 91.726387] [] start_secondary+0x17f/0x1f0 > > > it got away with attached change.. but this rcu logic > is far beyond me, so it's just wild guess.. ;-) If in idle, the _rcuidle() is needed, so: Acked-by: Paul E. McKenney > thanks, > jirka > > > --- > diff --git a/arch/x86/lib/msr.c b/arch/x86/lib/msr.c > index d1dee753b949..ca15becea3b6 100644 > --- a/arch/x86/lib/msr.c > +++ b/arch/x86/lib/msr.c > @@ -115,14 +115,14 @@ int msr_clear_bit(u32 msr, u8 bit) > #ifdef CONFIG_TRACEPOINTS > void do_trace_write_msr(unsigned msr, u64 val, int failed) > { > - trace_write_msr(msr, val, failed); > + trace_write_msr_rcuidle(msr, val, failed); > } > EXPORT_SYMBOL(do_trace_write_msr); > EXPORT_TRACEPOINT_SYMBOL(write_msr); > > void do_trace_read_msr(unsigned msr, u64 val, int failed) > { > - trace_read_msr(msr, val, failed); > + trace_read_msr_rcuidle(msr, val, failed); > } > EXPORT_SYMBOL(do_trace_read_msr); > EXPORT_TRACEPOINT_SYMBOL(read_msr); >