Re: WARNING and PANIC in irq_matrix_free
Hi Thomas, At 06/04/2018 07:17 PM, Thomas Gleixner wrote: On Mon, 4 Jun 2018, Dou Liyang wrote: Here, why didn't we avoid this cleanup by diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index a75de0792942..0cc59646755f 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data *apicd) */ WARN_ON_ONCE(managed); + if (!vector) + return; + trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; Is there something I didn't consider with? ;-) Well, that just prevents the warning, but the hlist is already corrupted. So you'd just cure the symptom ... I see. I'm about to send a patch series which addresses that. Just need to finish writing changelogs. Thank you for telling me that. Thanks, dou
Re: WARNING and PANIC in irq_matrix_free
Hi Thomas, At 06/04/2018 07:17 PM, Thomas Gleixner wrote: On Mon, 4 Jun 2018, Dou Liyang wrote: Here, why didn't we avoid this cleanup by diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index a75de0792942..0cc59646755f 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data *apicd) */ WARN_ON_ONCE(managed); + if (!vector) + return; + trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; Is there something I didn't consider with? ;-) Well, that just prevents the warning, but the hlist is already corrupted. So you'd just cure the symptom ... I see. I'm about to send a patch series which addresses that. Just need to finish writing changelogs. Thank you for telling me that. Thanks, dou
Re: WARNING and PANIC in irq_matrix_free
On Mon, 4 Jun 2018, Dou Liyang wrote: > Here, why didn't we avoid this cleanup by > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index a75de0792942..0cc59646755f 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data > *apicd) > */ > WARN_ON_ONCE(managed); > > + if (!vector) > + return; > + > trace_vector_free_moved(apicd->irq, cpu, vector, managed); > irq_matrix_free(vector_matrix, cpu, vector, managed); > per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; > > Is there something I didn't consider with? ;-) Well, that just prevents the warning, but the hlist is already corrupted. So you'd just cure the symptom ... I'm about to send a patch series which addresses that. Just need to finish writing changelogs. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Mon, 4 Jun 2018, Dou Liyang wrote: > Here, why didn't we avoid this cleanup by > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index a75de0792942..0cc59646755f 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data > *apicd) > */ > WARN_ON_ONCE(managed); > > + if (!vector) > + return; > + > trace_vector_free_moved(apicd->irq, cpu, vector, managed); > irq_matrix_free(vector_matrix, cpu, vector, managed); > per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; > > Is there something I didn't consider with? ;-) Well, that just prevents the warning, but the hlist is already corrupted. So you'd just cure the symptom ... I'm about to send a patch series which addresses that. Just need to finish writing changelogs. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
Hi Thomas, Sorry to ask the questions at this series, my mailbox was kicked out of the mailing list a few days ago, and didn't receive the e-mail. please see below At 05/29/2018 04:09 AM, Thomas Gleixner wrote: On Mon, 28 May 2018, Song Liu wrote: This doesn't fix the issue with bnxt. Here is a trace with this patch: [...] Thanks for providing the data! > 19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 reserve=0 > 19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 prev_cpu=2 > 19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, total_alloc=15 Here IRQ 31 is shutdown and the vector freed. > 19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, global_rsvd=26, total_alloc=15 > 19d... 1610359249us : vector_reserve: irq=31 ret=0 > 19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 apicdest=0x And set to the magic reservation vector 239 to catch spurious interrupts. > 20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 can_reserve=1 reserve=0 > 20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 reserved=1 ret=0 > 20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16 > 20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 prev_vector=0 prev_cpu=20 this means there is no associated previous vector. > 20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 apicdest=0x0014 So here it gets initialized again and targets CPU9 now. > 20dN.. 161032us : irq_matrix_alloc: bit=33 cpu=20 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16 > 20dN.. 161032us : vector_update: irq=31 vector=33 cpu=20 prev_vector=33 prev_cpu=9 > 20dN.. 161032us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 161032us : vector_config: irq=31 vector=33 cpu=20 apicdest=0x002c Here it is reconfigured to CPU 20. Now that update schedules vector 33 on CPU9 for cleanup. > 20dN.. 161036us : irq_matrix_alloc: bit=34 cpu=2 online=1 avl=199 alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=28, total_alloc=17 > 20dN.. 161036us : vector_update: irq=31 vector=34 cpu=2 prev_vector=33 prev_cpu=20 > 20dN.. 161036us : vector_alloc: irq=31 vector=34 reserved=1 ret=0 > 20dN.. 161036us : vector_config: irq=31 vector=34 cpu=2 apicdest=0x0004 So here the shit hits the fan because that update schedules vector 33 on CPU20 for cleanup while the previous cleanup for CPU9 has not been done yet. Cute. or not so cute > 20dNh. 161039us : vector_free_moved: irq=31 cpu=20 vector=33 is_managed=0 > 20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16 And frees the CPU 20 vector > 9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 is_managed=0 Here, why didn't we avoid this cleanup by diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index a75de0792942..0cc59646755f 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data *apicd) */ WARN_ON_ONCE(managed); + if (!vector) + return; + trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; Is there something I didn't consider with? ;-) Thanks, dou. And then CPU9 claims that it's queued for cleanup. Bah. I'm still working on a fix as the elegant solution refused to work. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
Hi Thomas, Sorry to ask the questions at this series, my mailbox was kicked out of the mailing list a few days ago, and didn't receive the e-mail. please see below At 05/29/2018 04:09 AM, Thomas Gleixner wrote: On Mon, 28 May 2018, Song Liu wrote: This doesn't fix the issue with bnxt. Here is a trace with this patch: [...] Thanks for providing the data! > 19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 reserve=0 > 19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 prev_cpu=2 > 19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, total_alloc=15 Here IRQ 31 is shutdown and the vector freed. > 19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, global_rsvd=26, total_alloc=15 > 19d... 1610359249us : vector_reserve: irq=31 ret=0 > 19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 apicdest=0x And set to the magic reservation vector 239 to catch spurious interrupts. > 20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 can_reserve=1 reserve=0 > 20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 reserved=1 ret=0 > 20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16 > 20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 prev_vector=0 prev_cpu=20 this means there is no associated previous vector. > 20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 apicdest=0x0014 So here it gets initialized again and targets CPU9 now. > 20dN.. 161032us : irq_matrix_alloc: bit=33 cpu=20 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16 > 20dN.. 161032us : vector_update: irq=31 vector=33 cpu=20 prev_vector=33 prev_cpu=9 > 20dN.. 161032us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 161032us : vector_config: irq=31 vector=33 cpu=20 apicdest=0x002c Here it is reconfigured to CPU 20. Now that update schedules vector 33 on CPU9 for cleanup. > 20dN.. 161036us : irq_matrix_alloc: bit=34 cpu=2 online=1 avl=199 alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=28, total_alloc=17 > 20dN.. 161036us : vector_update: irq=31 vector=34 cpu=2 prev_vector=33 prev_cpu=20 > 20dN.. 161036us : vector_alloc: irq=31 vector=34 reserved=1 ret=0 > 20dN.. 161036us : vector_config: irq=31 vector=34 cpu=2 apicdest=0x0004 So here the shit hits the fan because that update schedules vector 33 on CPU20 for cleanup while the previous cleanup for CPU9 has not been done yet. Cute. or not so cute > 20dNh. 161039us : vector_free_moved: irq=31 cpu=20 vector=33 is_managed=0 > 20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16 And frees the CPU 20 vector > 9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 is_managed=0 Here, why didn't we avoid this cleanup by diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index a75de0792942..0cc59646755f 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data *apicd) */ WARN_ON_ONCE(managed); + if (!vector) + return; + trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; Is there something I didn't consider with? ;-) Thanks, dou. And then CPU9 claims that it's queued for cleanup. Bah. I'm still working on a fix as the elegant solution refused to work. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
Song, On Tue, 29 May 2018, Song Liu wrote: > > On May 29, 2018, at 1:35 AM, Thomas Gleixner wrote: > >> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix. > > > > Right. Sorry, I forgot to say that we only need the irq_vectors ones which > > are related to vector allocation, i.e.: irq_vectors/vector_* > > Here is the ftrace dump: Thanks for providing the data! > 19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 > reserve=0 > 19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 > prev_cpu=2 > 19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 > alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, > total_alloc=15 Here IRQ 31 is shutdown and the vector freed. > 19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, > global_rsvd=26, total_alloc=15 > 19d... 1610359249us : vector_reserve: irq=31 ret=0 > 19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 > apicdest=0x And set to the magic reservation vector 239 to catch spurious interrupts. > 20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 can_reserve=1 > reserve=0 > 20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 reserved=1 ret=0 > 20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 avl=200 > alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, > total_alloc=16 > 20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 prev_vector=0 > prev_cpu=20 > 20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 > apicdest=0x0014 So here it gets initialized again and targets CPU9 now. > 20dN.. 161032us : irq_matrix_alloc: bit=33 cpu=20 online=1 avl=200 > alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, > total_alloc=16 > 20dN.. 161032us : vector_update: irq=31 vector=33 cpu=20 prev_vector=33 > prev_cpu=9 > 20dN.. 161032us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 161032us : vector_config: irq=31 vector=33 cpu=20 > apicdest=0x002c Here it is reconfigured to CPU 20. Now that update schedules vector 33 on CPU9 for cleanup. > 20dN.. 161036us : irq_matrix_alloc: bit=34 cpu=2 online=1 avl=199 > alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=28, > total_alloc=17 > 20dN.. 161036us : vector_update: irq=31 vector=34 cpu=2 prev_vector=33 > prev_cpu=20 > 20dN.. 161036us : vector_alloc: irq=31 vector=34 reserved=1 ret=0 > 20dN.. 161036us : vector_config: irq=31 vector=34 cpu=2 > apicdest=0x0004 So here the shit hits the fan because that update schedules vector 33 on CPU20 for cleanup while the previous cleanup for CPU9 has not been done yet. Cute. or not so cute > 20dNh. 161039us : vector_free_moved: irq=31 cpu=20 vector=33 is_managed=0 > 20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 > alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, > total_alloc=16 And frees the CPU 20 vector > 9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 is_managed=0 And then CPU9 claims that it's queued for cleanup. Bah. I'm still working on a fix as the elegant solution refused to work. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
Song, On Tue, 29 May 2018, Song Liu wrote: > > On May 29, 2018, at 1:35 AM, Thomas Gleixner wrote: > >> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix. > > > > Right. Sorry, I forgot to say that we only need the irq_vectors ones which > > are related to vector allocation, i.e.: irq_vectors/vector_* > > Here is the ftrace dump: Thanks for providing the data! > 19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 > reserve=0 > 19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 > prev_cpu=2 > 19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 > alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, > total_alloc=15 Here IRQ 31 is shutdown and the vector freed. > 19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, > global_rsvd=26, total_alloc=15 > 19d... 1610359249us : vector_reserve: irq=31 ret=0 > 19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 > apicdest=0x And set to the magic reservation vector 239 to catch spurious interrupts. > 20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 can_reserve=1 > reserve=0 > 20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 reserved=1 ret=0 > 20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 avl=200 > alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, > total_alloc=16 > 20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 prev_vector=0 > prev_cpu=20 > 20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 > apicdest=0x0014 So here it gets initialized again and targets CPU9 now. > 20dN.. 161032us : irq_matrix_alloc: bit=33 cpu=20 online=1 avl=200 > alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, > total_alloc=16 > 20dN.. 161032us : vector_update: irq=31 vector=33 cpu=20 prev_vector=33 > prev_cpu=9 > 20dN.. 161032us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 161032us : vector_config: irq=31 vector=33 cpu=20 > apicdest=0x002c Here it is reconfigured to CPU 20. Now that update schedules vector 33 on CPU9 for cleanup. > 20dN.. 161036us : irq_matrix_alloc: bit=34 cpu=2 online=1 avl=199 > alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=28, > total_alloc=17 > 20dN.. 161036us : vector_update: irq=31 vector=34 cpu=2 prev_vector=33 > prev_cpu=20 > 20dN.. 161036us : vector_alloc: irq=31 vector=34 reserved=1 ret=0 > 20dN.. 161036us : vector_config: irq=31 vector=34 cpu=2 > apicdest=0x0004 So here the shit hits the fan because that update schedules vector 33 on CPU20 for cleanup while the previous cleanup for CPU9 has not been done yet. Cute. or not so cute > 20dNh. 161039us : vector_free_moved: irq=31 cpu=20 vector=33 is_managed=0 > 20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 > alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, > total_alloc=16 And frees the CPU 20 vector > 9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 is_managed=0 And then CPU9 claims that it's queued for cleanup. Bah. I'm still working on a fix as the elegant solution refused to work. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
> On May 29, 2018, at 1:35 AM, Thomas Gleixner wrote: > > On Mon, 28 May 2018, Song Liu wrote: >>> On May 28, 2018, at 1:09 PM, Thomas Gleixner wrote: >>> >>> On Mon, 28 May 2018, Song Liu wrote: This doesn't fix the issue with bnxt. Here is a trace with this patch: >>> >>> That's just the backtrace which is not really helpful. The real question is >>> what leads to this scenaria. >>> >>> What I was asking for is to enable the irq_vector and irq_matrix trace >>> points along with ftrace_dump_on_oops. The latter dumps the trace buffer >>> when the problem happens and that should give us a hint what actually >>> causes that. >> >> Attached is the dump I got. It seems pretty noisy, with many message lost. >> >> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix. > > Right. Sorry, I forgot to say that we only need the irq_vectors ones which > are related to vector allocation, i.e.: irq_vectors/vector_* > > Thanks, > > tglx Here is the ftrace dump: [ 1609.472697]<...>-53363 19d... 1610359225us : vector_deactivate: irq=29 is_managed=0 can_reserve=1 reserve=0 [ 1609.472698]<...>-53363 19d... 1610359229us : vector_clear: irq=29 vector=33 cpu=21 prev_vector=0 prev_cpu=0 [ 1609.472699]<...>-53363 19d... 1610359230us : irq_matrix_free: bit=33 cpu=21 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11239, global_rsvd=23, total_alloc=17 [ 1609.472699]<...>-53363 19d... 1610359231us : irq_matrix_reserve: online_maps=56 global_avl=11239, global_rsvd=24, total_alloc=17 [ 1609.472700]<...>-53363 19d... 1610359232us : vector_reserve: irq=29 ret=0 [ 1609.472700]<...>-53363 19d... 1610359233us : vector_config: irq=29 vector=239 cpu=0 apicdest=0x [ 1609.472701]<...>-53363 19d... 1610359242us : vector_deactivate: irq=30 is_managed=0 can_reserve=1 reserve=0 [ 1609.472701]<...>-53363 19d... 1610359242us : vector_clear: irq=30 vector=33 cpu=11 prev_vector=0 prev_cpu=1 [ 1609.472701]<...>-53363 19d... 1610359242us : irq_matrix_free: bit=33 cpu=11 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=24, total_alloc=16 [ 1609.472702]<...>-53363 19d... 1610359243us : irq_matrix_reserve: online_maps=56 global_avl=11240, global_rsvd=25, total_alloc=16 [ 1609.472702]<...>-53363 19d... 1610359243us : vector_reserve: irq=30 ret=0 [ 1609.472703]<...>-53363 19d... 1610359243us : vector_config: irq=30 vector=239 cpu=0 apicdest=0x [ 1609.472703]<...>-53363 19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 reserve=0 [ 1609.472703]<...>-53363 19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 prev_cpu=2 [ 1609.472704]<...>-53363 19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, total_alloc=15 [ 1609.472704]<...>-53363 19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, global_rsvd=26, total_alloc=15 [ 1609.472705]<...>-53363 19d... 1610359249us : vector_reserve: irq=31 ret=0 [ 1609.472705]<...>-53363 19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 apicdest=0x [ 1609.472705]<...>-53363 19d... 1610359255us : vector_deactivate: irq=32 is_managed=0 can_reserve=1 reserve=0 [ 1609.472706]<...>-53363 19d... 1610359255us : vector_clear: irq=32 vector=33 cpu=9 prev_vector=0 prev_cpu=3 [ 1609.472707]<...>-53363 19d... 1610359255us : irq_matrix_free: bit=33 cpu=9 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11242, global_rsvd=26, total_alloc=14 [ 1609.472707]<...>-53363 19d... 1610359256us : irq_matrix_reserve: online_maps=56 global_avl=11242, global_rsvd=27, total_alloc=14 [ 1609.472708]<...>-53363 19d... 1610359256us : vector_reserve: irq=32 ret=0 [ 1609.472708]<...>-53363 19d... 1610359256us : vector_config: irq=32 vector=239 cpu=0 apicdest=0x [ 1609.472709]<...>-53363 19d... 1610359261us : vector_deactivate: irq=33 is_managed=0 can_reserve=1 reserve=0 [ 1609.472709]<...>-53363 19d... 1610359261us : vector_clear: irq=33 vector=33 cpu=18 prev_vector=0 prev_cpu=4 [ 1609.472710]<...>-53363 19d... 1610359262us : irq_matrix_free: bit=33 cpu=18 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11243, global_rsvd=27, total_alloc=13 [ 1609.472710]<...>-53363 19d... 1610359262us : irq_matrix_reserve: online_maps=56 global_avl=11243, global_rsvd=28, total_alloc=13 [ 1609.472711]<...>-53363 19d... 1610359262us : vector_reserve: irq=33 ret=0 [ 1609.472711]<...>-53363 19d... 1610359262us : vector_config: irq=33 vector=239 cpu=0 apicdest=0x [ 1609.472711]<...>-53363 19d... 1610359268us : vector_deactivate: irq=34 is_managed=0 can_reserve=1 reserve=0 [ 1609.472712]<...>-53363 19d... 1610359268us : vector_clear: irq=34 vector=34 cpu=7 prev_vector=0
Re: WARNING and PANIC in irq_matrix_free
> On May 29, 2018, at 1:35 AM, Thomas Gleixner wrote: > > On Mon, 28 May 2018, Song Liu wrote: >>> On May 28, 2018, at 1:09 PM, Thomas Gleixner wrote: >>> >>> On Mon, 28 May 2018, Song Liu wrote: This doesn't fix the issue with bnxt. Here is a trace with this patch: >>> >>> That's just the backtrace which is not really helpful. The real question is >>> what leads to this scenaria. >>> >>> What I was asking for is to enable the irq_vector and irq_matrix trace >>> points along with ftrace_dump_on_oops. The latter dumps the trace buffer >>> when the problem happens and that should give us a hint what actually >>> causes that. >> >> Attached is the dump I got. It seems pretty noisy, with many message lost. >> >> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix. > > Right. Sorry, I forgot to say that we only need the irq_vectors ones which > are related to vector allocation, i.e.: irq_vectors/vector_* > > Thanks, > > tglx Here is the ftrace dump: [ 1609.472697]<...>-53363 19d... 1610359225us : vector_deactivate: irq=29 is_managed=0 can_reserve=1 reserve=0 [ 1609.472698]<...>-53363 19d... 1610359229us : vector_clear: irq=29 vector=33 cpu=21 prev_vector=0 prev_cpu=0 [ 1609.472699]<...>-53363 19d... 1610359230us : irq_matrix_free: bit=33 cpu=21 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11239, global_rsvd=23, total_alloc=17 [ 1609.472699]<...>-53363 19d... 1610359231us : irq_matrix_reserve: online_maps=56 global_avl=11239, global_rsvd=24, total_alloc=17 [ 1609.472700]<...>-53363 19d... 1610359232us : vector_reserve: irq=29 ret=0 [ 1609.472700]<...>-53363 19d... 1610359233us : vector_config: irq=29 vector=239 cpu=0 apicdest=0x [ 1609.472701]<...>-53363 19d... 1610359242us : vector_deactivate: irq=30 is_managed=0 can_reserve=1 reserve=0 [ 1609.472701]<...>-53363 19d... 1610359242us : vector_clear: irq=30 vector=33 cpu=11 prev_vector=0 prev_cpu=1 [ 1609.472701]<...>-53363 19d... 1610359242us : irq_matrix_free: bit=33 cpu=11 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=24, total_alloc=16 [ 1609.472702]<...>-53363 19d... 1610359243us : irq_matrix_reserve: online_maps=56 global_avl=11240, global_rsvd=25, total_alloc=16 [ 1609.472702]<...>-53363 19d... 1610359243us : vector_reserve: irq=30 ret=0 [ 1609.472703]<...>-53363 19d... 1610359243us : vector_config: irq=30 vector=239 cpu=0 apicdest=0x [ 1609.472703]<...>-53363 19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 reserve=0 [ 1609.472703]<...>-53363 19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 prev_cpu=2 [ 1609.472704]<...>-53363 19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, total_alloc=15 [ 1609.472704]<...>-53363 19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, global_rsvd=26, total_alloc=15 [ 1609.472705]<...>-53363 19d... 1610359249us : vector_reserve: irq=31 ret=0 [ 1609.472705]<...>-53363 19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 apicdest=0x [ 1609.472705]<...>-53363 19d... 1610359255us : vector_deactivate: irq=32 is_managed=0 can_reserve=1 reserve=0 [ 1609.472706]<...>-53363 19d... 1610359255us : vector_clear: irq=32 vector=33 cpu=9 prev_vector=0 prev_cpu=3 [ 1609.472707]<...>-53363 19d... 1610359255us : irq_matrix_free: bit=33 cpu=9 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11242, global_rsvd=26, total_alloc=14 [ 1609.472707]<...>-53363 19d... 1610359256us : irq_matrix_reserve: online_maps=56 global_avl=11242, global_rsvd=27, total_alloc=14 [ 1609.472708]<...>-53363 19d... 1610359256us : vector_reserve: irq=32 ret=0 [ 1609.472708]<...>-53363 19d... 1610359256us : vector_config: irq=32 vector=239 cpu=0 apicdest=0x [ 1609.472709]<...>-53363 19d... 1610359261us : vector_deactivate: irq=33 is_managed=0 can_reserve=1 reserve=0 [ 1609.472709]<...>-53363 19d... 1610359261us : vector_clear: irq=33 vector=33 cpu=18 prev_vector=0 prev_cpu=4 [ 1609.472710]<...>-53363 19d... 1610359262us : irq_matrix_free: bit=33 cpu=18 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11243, global_rsvd=27, total_alloc=13 [ 1609.472710]<...>-53363 19d... 1610359262us : irq_matrix_reserve: online_maps=56 global_avl=11243, global_rsvd=28, total_alloc=13 [ 1609.472711]<...>-53363 19d... 1610359262us : vector_reserve: irq=33 ret=0 [ 1609.472711]<...>-53363 19d... 1610359262us : vector_config: irq=33 vector=239 cpu=0 apicdest=0x [ 1609.472711]<...>-53363 19d... 1610359268us : vector_deactivate: irq=34 is_managed=0 can_reserve=1 reserve=0 [ 1609.472712]<...>-53363 19d... 1610359268us : vector_clear: irq=34 vector=34 cpu=7 prev_vector=0
Re: WARNING and PANIC in irq_matrix_free
On Mon, 28 May 2018, Song Liu wrote: > > On May 28, 2018, at 1:09 PM, Thomas Gleixner wrote: > > > > On Mon, 28 May 2018, Song Liu wrote: > >> This doesn't fix the issue with bnxt. Here is a trace with this patch: > > > > That's just the backtrace which is not really helpful. The real question is > > what leads to this scenaria. > > > > What I was asking for is to enable the irq_vector and irq_matrix trace > > points along with ftrace_dump_on_oops. The latter dumps the trace buffer > > when the problem happens and that should give us a hint what actually > > causes that. > > Attached is the dump I got. It seems pretty noisy, with many message lost. > > Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix. Right. Sorry, I forgot to say that we only need the irq_vectors ones which are related to vector allocation, i.e.: irq_vectors/vector_* Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Mon, 28 May 2018, Song Liu wrote: > > On May 28, 2018, at 1:09 PM, Thomas Gleixner wrote: > > > > On Mon, 28 May 2018, Song Liu wrote: > >> This doesn't fix the issue with bnxt. Here is a trace with this patch: > > > > That's just the backtrace which is not really helpful. The real question is > > what leads to this scenaria. > > > > What I was asking for is to enable the irq_vector and irq_matrix trace > > points along with ftrace_dump_on_oops. The latter dumps the trace buffer > > when the problem happens and that should give us a hint what actually > > causes that. > > Attached is the dump I got. It seems pretty noisy, with many message lost. > > Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix. Right. Sorry, I forgot to say that we only need the irq_vectors ones which are related to vector allocation, i.e.: irq_vectors/vector_* Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Mon, 28 May 2018, Song Liu wrote: > This doesn't fix the issue with bnxt. Here is a trace with this patch: That's just the backtrace which is not really helpful. The real question is what leads to this scenaria. What I was asking for is to enable the irq_vector and irq_matrix trace points along with ftrace_dump_on_oops. The latter dumps the trace buffer when the problem happens and that should give us a hint what actually causes that. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Mon, 28 May 2018, Song Liu wrote: > This doesn't fix the issue with bnxt. Here is a trace with this patch: That's just the backtrace which is not really helpful. The real question is what leads to this scenaria. What I was asking for is to enable the irq_vector and irq_matrix trace points along with ftrace_dump_on_oops. The latter dumps the trace buffer when the problem happens and that should give us a hint what actually causes that. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
> On May 28, 2018, at 3:53 AM, Thomas Gleixner wrote: > > On Fri, 25 May 2018, Song Liu wrote: >> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: >>> On Wed, 23 May 2018, Tariq Toukan wrote: I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. >>> >>> Could you please enable the vector and irq matrix trace points and capture >>> the trace when this happens? > > Does the patch below fix it? > > Thanks, > > tglx > > 8<--- > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index bb6f7a2148d7..54af3d4884b1 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, > unsigned int newvec, >* prev_vector for this and the offlined target case. >*/ > apicd->prev_vector = 0; > + apicd->move_in_progress = false; > if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) > goto setnew; > /* This doesn't fix the issue with bnxt. Here is a trace with this patch: [ 569.222495] WARNING: CPU: 20 PID: 0 at kernel/irq/matrix.c:373 irq_matrix_free+0x32/0xd0 [ 569.238811] BUG: unable to handle kernel [ 569.238811] NULL pointer dereference [ 569.238812] at [ 569.238812] IP: bnxt_poll+0x163/0x830 [ 569.238812] PGD 0 [ 569.238812] P4D 0 [ 569.238813] Oops: 0002 [#1] SMP PTI [ 569.238813] Modules linked in: [ 569.238813] nfsv3 [ 569.238814] nfs [ 569.238814] fscache [ 569.238814] ip6table_raw [ 569.238814] ip6table_mangle [ 569.238815] iptable_raw [ 569.238815] iptable_mangle [ 569.238815] ip6table_filter [ 569.238816] xt_NFLOG [ 569.238816] xt_comment [ 569.238816] xt_statistic [ 569.238816] iptable_filter [ 569.238817] nfnetlink_log [ 569.238817] tcp_diag [ 569.238817] inet_diag [ 569.238817] sb_edac [ 569.238818] x86_pkg_temp_thermal [ 569.238818] intel_powerclamp [ 569.238818] coretemp [ 569.238818] kvm_intel [ 569.238818] kvm [ 569.238819] irqbypass [ 569.238819] iTCO_wdt [ 569.238819] iTCO_vendor_support [ 569.238819] lpc_ich [ 569.238819] efivars [ 569.238820] mfd_core [ 569.238820] i2c_i801 [ 569.238820] ipmi_si [ 569.238820] ipmi_devintf [ 569.238820] ipmi_msghandler [ 569.238821] button [ 569.238821] acpi_cpufreq [ 569.238821] sch_fq_codel [ 569.238821] nfsd [ 569.238821] nfs_acl [ 569.238822] lockd [ 569.238822] auth_rpcgss [ 569.238822] oid_registry [ 569.238822] grace [ 569.238822] sunrpc [ 569.238823] fuse [ 569.238823] loop [ 569.238823] efivarfs [ 569.238823] autofs4 [ 569.238824] CPU: 20 PID: 0 Comm: swapper/20 Not tainted 4.16.0-00391-g3742c6a #813 [ 569.238824] Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B12 08/17/2017 [ 569.238824] RIP: 0010:bnxt_poll+0x163/0x830 [ 569.238825] RSP: 0018:883ffef83b18 EFLAGS: 00010006 [ 569.238825] RAX: 2c00 RBX: RCX: [ 569.238825] RDX: RSI: 881fdc53c000 RDI: 881f55a1da80 [ 569.238826] RBP: 883ffef83b60 R08: 0011 R09: 4fbf4c8d [ 569.238826] R10: 883ffef83ab0 R11: 883fec6eee02 R12: 881ff6be2780 [ 569.238826] R13: R14: 881f55a1da80 R15: [ 569.238827] FS: () GS:883ffef8() knlGS: [ 569.238827] CS: 0010 DS: ES: CR0: 80050033 [ 569.238827] CR2: CR3: 0220a003 CR4: 003606e0 [ 569.238827] DR0: DR1: DR2: [ 569.238828] DR3: DR6: fffe0ff0 DR7: 0400 [ 569.238828] Call Trace: [ 569.238828] [ 569.238828] ? __skb_tx_hash+0x94/0xb0 [ 569.238829] netpoll_poll_dev+0xc5/0x1a0 [ 569.238829] netpoll_send_skb_on_dev+0x12c/0x200 [ 569.238829] netpoll_send_udp+0x2d5/0x410 [ 569.238829] write_ext_msg+0x1e7/0x200 [ 569.238830] ? scnprintf+0x3a/0x70 [ 569.238830] console_unlock+0x35c/0x530 [ 569.238830] vprintk_emit+0x225/0x340 [ 569.238830] ? irq_matrix_free+0x32/0xd0 [ 569.238831] vprintk_default+0x1f/0x30 [ 569.238831] vprintk_func+0x35/0x70 [ 569.238831] printk+0x43/0x4b [ 569.238832] ? irq_matrix_free+0x32/0xd0 [ 569.238832] __warn+0x6f/0x150 [ 569.238832] ? irq_matrix_free+0x32/0xd0 [ 569.238832] report_bug+0x83/0xe0 [ 569.238833] do_invalid_op+0x2c/0x70 [ 569.238833] invalid_op+0x1b/0x40 [ 569.238833] RIP: 0010:irq_matrix_free+0x32/0xd0 [ 569.238834] RSP: 0018:883ffef83f58 EFLAGS: 00010002 [ 569.238834] RAX: 0014 RBX: 00014140 RCX: [ 569.238835] RDX: RSI: 0014 RDI:
Re: WARNING and PANIC in irq_matrix_free
> On May 28, 2018, at 3:53 AM, Thomas Gleixner wrote: > > On Fri, 25 May 2018, Song Liu wrote: >> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: >>> On Wed, 23 May 2018, Tariq Toukan wrote: I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. >>> >>> Could you please enable the vector and irq matrix trace points and capture >>> the trace when this happens? > > Does the patch below fix it? > > Thanks, > > tglx > > 8<--- > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index bb6f7a2148d7..54af3d4884b1 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, > unsigned int newvec, >* prev_vector for this and the offlined target case. >*/ > apicd->prev_vector = 0; > + apicd->move_in_progress = false; > if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) > goto setnew; > /* This doesn't fix the issue with bnxt. Here is a trace with this patch: [ 569.222495] WARNING: CPU: 20 PID: 0 at kernel/irq/matrix.c:373 irq_matrix_free+0x32/0xd0 [ 569.238811] BUG: unable to handle kernel [ 569.238811] NULL pointer dereference [ 569.238812] at [ 569.238812] IP: bnxt_poll+0x163/0x830 [ 569.238812] PGD 0 [ 569.238812] P4D 0 [ 569.238813] Oops: 0002 [#1] SMP PTI [ 569.238813] Modules linked in: [ 569.238813] nfsv3 [ 569.238814] nfs [ 569.238814] fscache [ 569.238814] ip6table_raw [ 569.238814] ip6table_mangle [ 569.238815] iptable_raw [ 569.238815] iptable_mangle [ 569.238815] ip6table_filter [ 569.238816] xt_NFLOG [ 569.238816] xt_comment [ 569.238816] xt_statistic [ 569.238816] iptable_filter [ 569.238817] nfnetlink_log [ 569.238817] tcp_diag [ 569.238817] inet_diag [ 569.238817] sb_edac [ 569.238818] x86_pkg_temp_thermal [ 569.238818] intel_powerclamp [ 569.238818] coretemp [ 569.238818] kvm_intel [ 569.238818] kvm [ 569.238819] irqbypass [ 569.238819] iTCO_wdt [ 569.238819] iTCO_vendor_support [ 569.238819] lpc_ich [ 569.238819] efivars [ 569.238820] mfd_core [ 569.238820] i2c_i801 [ 569.238820] ipmi_si [ 569.238820] ipmi_devintf [ 569.238820] ipmi_msghandler [ 569.238821] button [ 569.238821] acpi_cpufreq [ 569.238821] sch_fq_codel [ 569.238821] nfsd [ 569.238821] nfs_acl [ 569.238822] lockd [ 569.238822] auth_rpcgss [ 569.238822] oid_registry [ 569.238822] grace [ 569.238822] sunrpc [ 569.238823] fuse [ 569.238823] loop [ 569.238823] efivarfs [ 569.238823] autofs4 [ 569.238824] CPU: 20 PID: 0 Comm: swapper/20 Not tainted 4.16.0-00391-g3742c6a #813 [ 569.238824] Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B12 08/17/2017 [ 569.238824] RIP: 0010:bnxt_poll+0x163/0x830 [ 569.238825] RSP: 0018:883ffef83b18 EFLAGS: 00010006 [ 569.238825] RAX: 2c00 RBX: RCX: [ 569.238825] RDX: RSI: 881fdc53c000 RDI: 881f55a1da80 [ 569.238826] RBP: 883ffef83b60 R08: 0011 R09: 4fbf4c8d [ 569.238826] R10: 883ffef83ab0 R11: 883fec6eee02 R12: 881ff6be2780 [ 569.238826] R13: R14: 881f55a1da80 R15: [ 569.238827] FS: () GS:883ffef8() knlGS: [ 569.238827] CS: 0010 DS: ES: CR0: 80050033 [ 569.238827] CR2: CR3: 0220a003 CR4: 003606e0 [ 569.238827] DR0: DR1: DR2: [ 569.238828] DR3: DR6: fffe0ff0 DR7: 0400 [ 569.238828] Call Trace: [ 569.238828] [ 569.238828] ? __skb_tx_hash+0x94/0xb0 [ 569.238829] netpoll_poll_dev+0xc5/0x1a0 [ 569.238829] netpoll_send_skb_on_dev+0x12c/0x200 [ 569.238829] netpoll_send_udp+0x2d5/0x410 [ 569.238829] write_ext_msg+0x1e7/0x200 [ 569.238830] ? scnprintf+0x3a/0x70 [ 569.238830] console_unlock+0x35c/0x530 [ 569.238830] vprintk_emit+0x225/0x340 [ 569.238830] ? irq_matrix_free+0x32/0xd0 [ 569.238831] vprintk_default+0x1f/0x30 [ 569.238831] vprintk_func+0x35/0x70 [ 569.238831] printk+0x43/0x4b [ 569.238832] ? irq_matrix_free+0x32/0xd0 [ 569.238832] __warn+0x6f/0x150 [ 569.238832] ? irq_matrix_free+0x32/0xd0 [ 569.238832] report_bug+0x83/0xe0 [ 569.238833] do_invalid_op+0x2c/0x70 [ 569.238833] invalid_op+0x1b/0x40 [ 569.238833] RIP: 0010:irq_matrix_free+0x32/0xd0 [ 569.238834] RSP: 0018:883ffef83f58 EFLAGS: 00010002 [ 569.238834] RAX: 0014 RBX: 00014140 RCX: [ 569.238835] RDX: RSI: 0014 RDI:
Re: WARNING and PANIC in irq_matrix_free
> On May 28, 2018, at 7:27 AM, Thomas Gleixner wrote: > > On Mon, 28 May 2018, Tariq Toukan wrote: >> On 28/05/2018 1:53 PM, Thomas Gleixner wrote: >>> On Fri, 25 May 2018, Song Liu wrote: On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: > On Wed, 23 May 2018, Tariq Toukan wrote: >> I have your patch merged into my internal branch, it prints the >> following: >> >> [ 4898.226258] Trying to clear prev_vector: 0 >> [ 4898.226439] Trying to clear prev_vector: 0 >> >> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. > > Could you please enable the vector and irq matrix trace points and > capture > the trace when this happens? >>> >>> Does the patch below fix it? >>> >>> Thanks, >>> >>> tglx >>> >>> 8<--- >>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c >>> index bb6f7a2148d7..54af3d4884b1 100644 >>> --- a/arch/x86/kernel/apic/vector.c >>> +++ b/arch/x86/kernel/apic/vector.c >>> @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, >>> unsigned int newvec, >>> * prev_vector for this and the offlined target case. >>> */ >>> apicd->prev_vector = 0; >>> + apicd->move_in_progress = false; >>> if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) >>> goto setnew; >>> /* >>> >> >> I took it into my internal branch. Will let you know. > > Note, I'd still like to see a trace w/o the patch which shows which order > of events leads to this. Even if the patch cures it, it might just paper > over it and not fixing the root cause. > > Thanks, > > tglx I just checked trace without the patch. It looks the same as the one with the patch. Thanks, Song
Re: WARNING and PANIC in irq_matrix_free
> On May 28, 2018, at 7:27 AM, Thomas Gleixner wrote: > > On Mon, 28 May 2018, Tariq Toukan wrote: >> On 28/05/2018 1:53 PM, Thomas Gleixner wrote: >>> On Fri, 25 May 2018, Song Liu wrote: On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: > On Wed, 23 May 2018, Tariq Toukan wrote: >> I have your patch merged into my internal branch, it prints the >> following: >> >> [ 4898.226258] Trying to clear prev_vector: 0 >> [ 4898.226439] Trying to clear prev_vector: 0 >> >> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. > > Could you please enable the vector and irq matrix trace points and > capture > the trace when this happens? >>> >>> Does the patch below fix it? >>> >>> Thanks, >>> >>> tglx >>> >>> 8<--- >>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c >>> index bb6f7a2148d7..54af3d4884b1 100644 >>> --- a/arch/x86/kernel/apic/vector.c >>> +++ b/arch/x86/kernel/apic/vector.c >>> @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, >>> unsigned int newvec, >>> * prev_vector for this and the offlined target case. >>> */ >>> apicd->prev_vector = 0; >>> + apicd->move_in_progress = false; >>> if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) >>> goto setnew; >>> /* >>> >> >> I took it into my internal branch. Will let you know. > > Note, I'd still like to see a trace w/o the patch which shows which order > of events leads to this. Even if the patch cures it, it might just paper > over it and not fixing the root cause. > > Thanks, > > tglx I just checked trace without the patch. It looks the same as the one with the patch. Thanks, Song
Re: WARNING and PANIC in irq_matrix_free
On Mon, 28 May 2018, Tariq Toukan wrote: > On 28/05/2018 1:53 PM, Thomas Gleixner wrote: > > On Fri, 25 May 2018, Song Liu wrote: > > > On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner> > > wrote: > > > > On Wed, 23 May 2018, Tariq Toukan wrote: > > > > > I have your patch merged into my internal branch, it prints the > > > > > following: > > > > > > > > > > [ 4898.226258] Trying to clear prev_vector: 0 > > > > > [ 4898.226439] Trying to clear prev_vector: 0 > > > > > > > > > > i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. > > > > > > > > Could you please enable the vector and irq matrix trace points and > > > > capture > > > > the trace when this happens? > > > > Does the patch below fix it? > > > > Thanks, > > > > tglx > > > > 8<--- > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > > index bb6f7a2148d7..54af3d4884b1 100644 > > --- a/arch/x86/kernel/apic/vector.c > > +++ b/arch/x86/kernel/apic/vector.c > > @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, > > unsigned int newvec, > > * prev_vector for this and the offlined target case. > > */ > > apicd->prev_vector = 0; > > + apicd->move_in_progress = false; > > if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) > > goto setnew; > > /* > > > > I took it into my internal branch. Will let you know. Note, I'd still like to see a trace w/o the patch which shows which order of events leads to this. Even if the patch cures it, it might just paper over it and not fixing the root cause. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Mon, 28 May 2018, Tariq Toukan wrote: > On 28/05/2018 1:53 PM, Thomas Gleixner wrote: > > On Fri, 25 May 2018, Song Liu wrote: > > > On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner > > > wrote: > > > > On Wed, 23 May 2018, Tariq Toukan wrote: > > > > > I have your patch merged into my internal branch, it prints the > > > > > following: > > > > > > > > > > [ 4898.226258] Trying to clear prev_vector: 0 > > > > > [ 4898.226439] Trying to clear prev_vector: 0 > > > > > > > > > > i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. > > > > > > > > Could you please enable the vector and irq matrix trace points and > > > > capture > > > > the trace when this happens? > > > > Does the patch below fix it? > > > > Thanks, > > > > tglx > > > > 8<--- > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > > index bb6f7a2148d7..54af3d4884b1 100644 > > --- a/arch/x86/kernel/apic/vector.c > > +++ b/arch/x86/kernel/apic/vector.c > > @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, > > unsigned int newvec, > > * prev_vector for this and the offlined target case. > > */ > > apicd->prev_vector = 0; > > + apicd->move_in_progress = false; > > if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) > > goto setnew; > > /* > > > > I took it into my internal branch. Will let you know. Note, I'd still like to see a trace w/o the patch which shows which order of events leads to this. Even if the patch cures it, it might just paper over it and not fixing the root cause. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Fri, 25 May 2018, Song Liu wrote: > On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixnerwrote: > > On Wed, 23 May 2018, Tariq Toukan wrote: > >> I have your patch merged into my internal branch, it prints the following: > >> > >> [ 4898.226258] Trying to clear prev_vector: 0 > >> [ 4898.226439] Trying to clear prev_vector: 0 > >> > >> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. > > > > Could you please enable the vector and irq matrix trace points and capture > > the trace when this happens? Does the patch below fix it? Thanks, tglx 8<--- diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index bb6f7a2148d7..54af3d4884b1 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, unsigned int newvec, * prev_vector for this and the offlined target case. */ apicd->prev_vector = 0; + apicd->move_in_progress = false; if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) goto setnew; /*
Re: WARNING and PANIC in irq_matrix_free
On Fri, 25 May 2018, Song Liu wrote: > On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: > > On Wed, 23 May 2018, Tariq Toukan wrote: > >> I have your patch merged into my internal branch, it prints the following: > >> > >> [ 4898.226258] Trying to clear prev_vector: 0 > >> [ 4898.226439] Trying to clear prev_vector: 0 > >> > >> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. > > > > Could you please enable the vector and irq matrix trace points and capture > > the trace when this happens? Does the patch below fix it? Thanks, tglx 8<--- diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index bb6f7a2148d7..54af3d4884b1 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, unsigned int newvec, * prev_vector for this and the offlined target case. */ apicd->prev_vector = 0; + apicd->move_in_progress = false; if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) goto setnew; /*
Re: WARNING and PANIC in irq_matrix_free
On 28/05/2018 1:53 PM, Thomas Gleixner wrote: On Fri, 25 May 2018, Song Liu wrote: On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixnerwrote: On Wed, 23 May 2018, Tariq Toukan wrote: I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. Could you please enable the vector and irq matrix trace points and capture the trace when this happens? Does the patch below fix it? Thanks, tglx 8<--- diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index bb6f7a2148d7..54af3d4884b1 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, unsigned int newvec, * prev_vector for this and the offlined target case. */ apicd->prev_vector = 0; + apicd->move_in_progress = false; if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) goto setnew; /* I took it into my internal branch. Will let you know. Tariq
Re: WARNING and PANIC in irq_matrix_free
On 28/05/2018 1:53 PM, Thomas Gleixner wrote: On Fri, 25 May 2018, Song Liu wrote: On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: On Wed, 23 May 2018, Tariq Toukan wrote: I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. Could you please enable the vector and irq matrix trace points and capture the trace when this happens? Does the patch below fix it? Thanks, tglx 8<--- diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index bb6f7a2148d7..54af3d4884b1 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, unsigned int newvec, * prev_vector for this and the offlined target case. */ apicd->prev_vector = 0; + apicd->move_in_progress = false; if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR) goto setnew; /* I took it into my internal branch. Will let you know. Tariq
Re: WARNING and PANIC in irq_matrix_free
Adding Broadcom developers to the thread. Hi Michael, Vasundhara, and Andy, Could you please help look into the case I found? ethtool crashes the system for both net/master and net-next/master. Thanks, Song On Fri, May 25, 2018 at 1:10 PM, Song Liuwrote: > Hi, > > We are seeing something probably related. > > We run ethtool on a system with Broadcom NIC to increase number of > combined queues. > > > [root@ ~]# ethtool -l eth0 > Channel parameters for eth0: > Pre-set maximums: > RX: 9 > TX: 8 > Other: 0 > Combined: 17 > Current hardware settings: > RX: 0 > TX: 0 > Other: 0 > Combined: 8 > > [root@ ~]# ethtool -L eth0 combined 16 > > The last command PANIC the kernel easily (5 out of 5 in my tests). > > I haven't got luck to catch much console output, the only line I got is: > > [ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371 > irq_matrix_free+0x32/0xd0 > > The NIC we have is > > Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller > > Thanks, > Song > > On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: >> On Wed, 23 May 2018, Tariq Toukan wrote: >>> On 19/05/2018 2:20 PM, Thomas Gleixner wrote: >>> > On Fri, 18 May 2018, Dmitry Safonov wrote: >>> > > I'm not entirely sure that it's the same fault, but at least backtrace >>> > > looks resembling. >>> > >>> > Yes, it's similar, but not the same issue. I'll stare are the code ... >>> > >>> > Thanks, >>> > >>> > tglx >>> > >>> >>> We still see the issue in our daily regression runs. >>> I have your patch merged into my internal branch, it prints the following: >>> >>> [ 4898.226258] Trying to clear prev_vector: 0 >>> [ 4898.226439] Trying to clear prev_vector: 0 >>> >>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. >> >> Could you please enable the vector and irq matrix trace points and capture >> the trace when this happens? >> >> Thanks, >> >> tglx >>
Re: WARNING and PANIC in irq_matrix_free
Adding Broadcom developers to the thread. Hi Michael, Vasundhara, and Andy, Could you please help look into the case I found? ethtool crashes the system for both net/master and net-next/master. Thanks, Song On Fri, May 25, 2018 at 1:10 PM, Song Liu wrote: > Hi, > > We are seeing something probably related. > > We run ethtool on a system with Broadcom NIC to increase number of > combined queues. > > > [root@ ~]# ethtool -l eth0 > Channel parameters for eth0: > Pre-set maximums: > RX: 9 > TX: 8 > Other: 0 > Combined: 17 > Current hardware settings: > RX: 0 > TX: 0 > Other: 0 > Combined: 8 > > [root@ ~]# ethtool -L eth0 combined 16 > > The last command PANIC the kernel easily (5 out of 5 in my tests). > > I haven't got luck to catch much console output, the only line I got is: > > [ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371 > irq_matrix_free+0x32/0xd0 > > The NIC we have is > > Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller > > Thanks, > Song > > On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: >> On Wed, 23 May 2018, Tariq Toukan wrote: >>> On 19/05/2018 2:20 PM, Thomas Gleixner wrote: >>> > On Fri, 18 May 2018, Dmitry Safonov wrote: >>> > > I'm not entirely sure that it's the same fault, but at least backtrace >>> > > looks resembling. >>> > >>> > Yes, it's similar, but not the same issue. I'll stare are the code ... >>> > >>> > Thanks, >>> > >>> > tglx >>> > >>> >>> We still see the issue in our daily regression runs. >>> I have your patch merged into my internal branch, it prints the following: >>> >>> [ 4898.226258] Trying to clear prev_vector: 0 >>> [ 4898.226439] Trying to clear prev_vector: 0 >>> >>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. >> >> Could you please enable the vector and irq matrix trace points and capture >> the trace when this happens? >> >> Thanks, >> >> tglx >>
Re: WARNING and PANIC in irq_matrix_free
Hi, We are seeing something probably related. We run ethtool on a system with Broadcom NIC to increase number of combined queues. [root@ ~]# ethtool -l eth0 Channel parameters for eth0: Pre-set maximums: RX: 9 TX: 8 Other: 0 Combined: 17 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 8 [root@ ~]# ethtool -L eth0 combined 16 The last command PANIC the kernel easily (5 out of 5 in my tests). I haven't got luck to catch much console output, the only line I got is: [ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371 irq_matrix_free+0x32/0xd0 The NIC we have is Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller Thanks, Song On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixnerwrote: > On Wed, 23 May 2018, Tariq Toukan wrote: >> On 19/05/2018 2:20 PM, Thomas Gleixner wrote: >> > On Fri, 18 May 2018, Dmitry Safonov wrote: >> > > I'm not entirely sure that it's the same fault, but at least backtrace >> > > looks resembling. >> > >> > Yes, it's similar, but not the same issue. I'll stare are the code ... >> > >> > Thanks, >> > >> > tglx >> > >> >> We still see the issue in our daily regression runs. >> I have your patch merged into my internal branch, it prints the following: >> >> [ 4898.226258] Trying to clear prev_vector: 0 >> [ 4898.226439] Trying to clear prev_vector: 0 >> >> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. > > Could you please enable the vector and irq matrix trace points and capture > the trace when this happens? > > Thanks, > > tglx >
Re: WARNING and PANIC in irq_matrix_free
Hi, We are seeing something probably related. We run ethtool on a system with Broadcom NIC to increase number of combined queues. [root@ ~]# ethtool -l eth0 Channel parameters for eth0: Pre-set maximums: RX: 9 TX: 8 Other: 0 Combined: 17 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 8 [root@ ~]# ethtool -L eth0 combined 16 The last command PANIC the kernel easily (5 out of 5 in my tests). I haven't got luck to catch much console output, the only line I got is: [ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371 irq_matrix_free+0x32/0xd0 The NIC we have is Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller Thanks, Song On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner wrote: > On Wed, 23 May 2018, Tariq Toukan wrote: >> On 19/05/2018 2:20 PM, Thomas Gleixner wrote: >> > On Fri, 18 May 2018, Dmitry Safonov wrote: >> > > I'm not entirely sure that it's the same fault, but at least backtrace >> > > looks resembling. >> > >> > Yes, it's similar, but not the same issue. I'll stare are the code ... >> > >> > Thanks, >> > >> > tglx >> > >> >> We still see the issue in our daily regression runs. >> I have your patch merged into my internal branch, it prints the following: >> >> [ 4898.226258] Trying to clear prev_vector: 0 >> [ 4898.226439] Trying to clear prev_vector: 0 >> >> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. > > Could you please enable the vector and irq matrix trace points and capture > the trace when this happens? > > Thanks, > > tglx >
Re: WARNING and PANIC in irq_matrix_free
On Wed, 23 May 2018, Tariq Toukan wrote: > On 19/05/2018 2:20 PM, Thomas Gleixner wrote: > > On Fri, 18 May 2018, Dmitry Safonov wrote: > > > I'm not entirely sure that it's the same fault, but at least backtrace > > > looks resembling. > > > > Yes, it's similar, but not the same issue. I'll stare are the code ... > > > > Thanks, > > > > tglx > > > > We still see the issue in our daily regression runs. > I have your patch merged into my internal branch, it prints the following: > > [ 4898.226258] Trying to clear prev_vector: 0 > [ 4898.226439] Trying to clear prev_vector: 0 > > i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. Could you please enable the vector and irq matrix trace points and capture the trace when this happens? Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Wed, 23 May 2018, Tariq Toukan wrote: > On 19/05/2018 2:20 PM, Thomas Gleixner wrote: > > On Fri, 18 May 2018, Dmitry Safonov wrote: > > > I'm not entirely sure that it's the same fault, but at least backtrace > > > looks resembling. > > > > Yes, it's similar, but not the same issue. I'll stare are the code ... > > > > Thanks, > > > > tglx > > > > We still see the issue in our daily regression runs. > I have your patch merged into my internal branch, it prints the following: > > [ 4898.226258] Trying to clear prev_vector: 0 > [ 4898.226439] Trying to clear prev_vector: 0 > > i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR. Could you please enable the vector and irq matrix trace points and capture the trace when this happens? Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On 19/05/2018 2:20 PM, Thomas Gleixner wrote: On Fri, 18 May 2018, Dmitry Safonov wrote: I'm not entirely sure that it's the same fault, but at least backtrace looks resembling. Yes, it's similar, but not the same issue. I'll stare are the code ... Thanks, tglx We still see the issue in our daily regression runs. I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
Re: WARNING and PANIC in irq_matrix_free
On 19/05/2018 2:20 PM, Thomas Gleixner wrote: On Fri, 18 May 2018, Dmitry Safonov wrote: I'm not entirely sure that it's the same fault, but at least backtrace looks resembling. Yes, it's similar, but not the same issue. I'll stare are the code ... Thanks, tglx We still see the issue in our daily regression runs. I have your patch merged into my internal branch, it prints the following: [ 4898.226258] Trying to clear prev_vector: 0 [ 4898.226439] Trying to clear prev_vector: 0 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
Re: WARNING and PANIC in irq_matrix_free
On Fri, 18 May 2018, Dmitry Safonov wrote: > I'm not entirely sure that it's the same fault, but at least backtrace > looks resembling. Yes, it's similar, but not the same issue. I'll stare are the code ... Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Fri, 18 May 2018, Dmitry Safonov wrote: > I'm not entirely sure that it's the same fault, but at least backtrace > looks resembling. Yes, it's similar, but not the same issue. I'll stare are the code ... Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
2018-05-18 23:41 GMT+01:00 Dmitry Safonov <0x7f454...@gmail.com>: > What I did - was running x86 selftests and ssh'ing at the same moment. > Will try it again if it'll fire by any chance. No, I've tried a couple of times (after reboot as it's ONCE), but it looks hard to reproduce by hands without some automation or something. Thanks, Dmitry
Re: WARNING and PANIC in irq_matrix_free
2018-05-18 23:41 GMT+01:00 Dmitry Safonov <0x7f454...@gmail.com>: > What I did - was running x86 selftests and ssh'ing at the same moment. > Will try it again if it'll fire by any chance. No, I've tried a couple of times (after reboot as it's ONCE), but it looks hard to reproduce by hands without some automation or something. Thanks, Dmitry
Re: WARNING and PANIC in irq_matrix_free
Hi Thomas, Tariq, 2018-02-20 18:11 GMT+00:00 Thomas Gleixner: > On Tue, 20 Feb 2018, Tariq Toukan wrote: > >> Hi Thomas, >> >> We started seeing new issues in our net-device daily regression tests. >> They are related to patch [1] introduced in kernel 4.15-rc1. >> >> We frequently see a warning in dmesg [2]. Repro is not consistent, we tried >> to >> narrow it down to a smaller run but couldn't. >> >> In addition, sometimes (less frequent) the warning is followed by a panic >> [3]. > > Which might be just a consequence of the issue which triggers the warning. > >> I can share all needed details to help analyze this bug. >> If you suspect specific flows, we can do an educated narrow down. > >> [2] >> [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 >> irq_matrix_free+0x30/0xd0 >> [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0 >> [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002 >> [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: >> >> [ 8664.917679] RDX: 0001 RSI: 0001 RDI: >> 880237038400 >> [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: >> >> [ 8664.922813] R10: 03ff R11: 0ad9 R12: >> 88023fc4 >> [ 8664.925345] R13: R14: 0001 R15: >> 002b >> [ 8664.927872] FS: () GS:88023fd4() >> knlGS: >> [ 8664.930455] CS: 0010 DS: ES: CR0: 80050033 >> [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: >> 06e0 >> [ 8664.935557] DR0: DR1: DR2: >> >> [ 8664.938051] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 8664.940541] Call Trace: >> [ 8664.942980] >> [ 8664.945399] free_moved_vector+0x4e/0x100 >> [ 8664.947787] smp_irq_move_cleanup_interrupt+0x89/0x9e >> [ 8664.950134] irq_move_cleanup_interrupt+0x95/0xa0 >> [ 8664.952480] > > So thats: > > if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end)) > return; > > and the disassembly of > >> [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60 >> 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f> >> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74 > > gives: > > 1f: 44 39 6f 04 cmp%r13d,0x4(%rdi) > 23: 77 06 ja 0x2b > 25: 44 3b 6f 08 cmp0x8(%rdi),%r13d > 29: 72 0b jb 0x36 > 2b:* 0f ff (bad) <-- trapping instruction > > So the vector to release should be in r13d, which is 0. /me scratches head. > >> [ 8943.038767] BUG: unable to handle kernel paging request at >> 00037a6b561b >> [ 8943.040114] IP: free_moved_vector+0x61/0x100 >> [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100 >> [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007 >> [ 8943.057233] RAX: 00037a6b561b RBX: 880157a77ec0 RCX: >> 0001 >> [ 8943.058506] RDX: 000155a8 RSI: 000155a8 RDI: >> 880237038400 >> [ 8943.059784] RBP: 880157a77ec0 R08: e8ba3c69 R09: >> >> [ 8943.061051] R10: R11: R12: >> 7f0c0001 >> [ 8943.062462] R13: 000155a8 R14: 0001 R15: >> 00cc620d >> [ 8943.063726] FS: () GS:88023fd4() >> knlGS: >> [ 8943.064993] CS: 0010 DS: ES: CR0: 80050033 >> [ 8943.066253] CR2: 00037a6b561b CR3: 00010badc000 CR4: >> 06e0 >> [ 8943.067522] DR0: DR1: DR2: >> >> [ 8943.068771] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 8943.070029] Call Trace: >> [ 8943.071273] >> [ 8943.072503] smp_irq_move_cleanup_interrupt+0x89/0x9e >> [ 8943.073794] irq_move_cleanup_interrupt+0x95/0xa0 >> [ 8943.075048] > ... >> [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee >> e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48> >> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48 > > 17:* 48 c7 00 00 00 00 00movq $0x0,(%rax) <-- trapping > instruction > > I can't tell exactly which part of free_moved_vector() that is and which > vector it tries to free, but its more or less the same crap just with a > fatal vector number. > > Is there CPU hotplugging in play? > > I'll come back to you tomorrow with a plan how to debug that after staring > into the code some more. I'm not entirely sure that it's the same fault, but at least backtrace looks resembling. I've got it in VM on the last master: v4.17-rc5-110-g2c71d338bef2 Let me know if you need .config or CPU/qemu details or anything. I'm not sure if it'll
Re: WARNING and PANIC in irq_matrix_free
Hi Thomas, Tariq, 2018-02-20 18:11 GMT+00:00 Thomas Gleixner : > On Tue, 20 Feb 2018, Tariq Toukan wrote: > >> Hi Thomas, >> >> We started seeing new issues in our net-device daily regression tests. >> They are related to patch [1] introduced in kernel 4.15-rc1. >> >> We frequently see a warning in dmesg [2]. Repro is not consistent, we tried >> to >> narrow it down to a smaller run but couldn't. >> >> In addition, sometimes (less frequent) the warning is followed by a panic >> [3]. > > Which might be just a consequence of the issue which triggers the warning. > >> I can share all needed details to help analyze this bug. >> If you suspect specific flows, we can do an educated narrow down. > >> [2] >> [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 >> irq_matrix_free+0x30/0xd0 >> [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0 >> [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002 >> [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: >> >> [ 8664.917679] RDX: 0001 RSI: 0001 RDI: >> 880237038400 >> [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: >> >> [ 8664.922813] R10: 03ff R11: 0ad9 R12: >> 88023fc4 >> [ 8664.925345] R13: R14: 0001 R15: >> 002b >> [ 8664.927872] FS: () GS:88023fd4() >> knlGS: >> [ 8664.930455] CS: 0010 DS: ES: CR0: 80050033 >> [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: >> 06e0 >> [ 8664.935557] DR0: DR1: DR2: >> >> [ 8664.938051] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 8664.940541] Call Trace: >> [ 8664.942980] >> [ 8664.945399] free_moved_vector+0x4e/0x100 >> [ 8664.947787] smp_irq_move_cleanup_interrupt+0x89/0x9e >> [ 8664.950134] irq_move_cleanup_interrupt+0x95/0xa0 >> [ 8664.952480] > > So thats: > > if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end)) > return; > > and the disassembly of > >> [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60 >> 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f> >> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74 > > gives: > > 1f: 44 39 6f 04 cmp%r13d,0x4(%rdi) > 23: 77 06 ja 0x2b > 25: 44 3b 6f 08 cmp0x8(%rdi),%r13d > 29: 72 0b jb 0x36 > 2b:* 0f ff (bad) <-- trapping instruction > > So the vector to release should be in r13d, which is 0. /me scratches head. > >> [ 8943.038767] BUG: unable to handle kernel paging request at >> 00037a6b561b >> [ 8943.040114] IP: free_moved_vector+0x61/0x100 >> [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100 >> [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007 >> [ 8943.057233] RAX: 00037a6b561b RBX: 880157a77ec0 RCX: >> 0001 >> [ 8943.058506] RDX: 000155a8 RSI: 000155a8 RDI: >> 880237038400 >> [ 8943.059784] RBP: 880157a77ec0 R08: e8ba3c69 R09: >> >> [ 8943.061051] R10: R11: R12: >> 7f0c0001 >> [ 8943.062462] R13: 000155a8 R14: 0001 R15: >> 00cc620d >> [ 8943.063726] FS: () GS:88023fd4() >> knlGS: >> [ 8943.064993] CS: 0010 DS: ES: CR0: 80050033 >> [ 8943.066253] CR2: 00037a6b561b CR3: 00010badc000 CR4: >> 06e0 >> [ 8943.067522] DR0: DR1: DR2: >> >> [ 8943.068771] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 8943.070029] Call Trace: >> [ 8943.071273] >> [ 8943.072503] smp_irq_move_cleanup_interrupt+0x89/0x9e >> [ 8943.073794] irq_move_cleanup_interrupt+0x95/0xa0 >> [ 8943.075048] > ... >> [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee >> e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48> >> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48 > > 17:* 48 c7 00 00 00 00 00movq $0x0,(%rax) <-- trapping > instruction > > I can't tell exactly which part of free_moved_vector() that is and which > vector it tries to free, but its more or less the same crap just with a > fatal vector number. > > Is there CPU hotplugging in play? > > I'll come back to you tomorrow with a plan how to debug that after staring > into the code some more. I'm not entirely sure that it's the same fault, but at least backtrace looks resembling. I've got it in VM on the last master: v4.17-rc5-110-g2c71d338bef2 Let me know if you need .config or CPU/qemu details or anything. I'm not sure if it'll reproduce, but might
Re: WARNING and PANIC in irq_matrix_free
On 22/02/2018 11:38 PM, Thomas Gleixner wrote: On Wed, 21 Feb 2018, Tariq Toukan wrote: On 20/02/2018 8:18 PM, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Tariq Toukan wrote: Is there CPU hotplugging in play? No. Ok. I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Do you have a rough idea what the test case is doing? It arbitrary appears in different flows, like sending traffic or interface configuration changes. Hmm. Looks like memory corruption, but I can't pin point it. Find below a debug patch which should prevent the crash and might give us some insight into the type of corruption. Please enable the irq_matrix and vector allocation trace points. echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable When the problem triggers the bogus vector is printed and the trace is frozen. Please provide dmesg and the tracebuffer output. OK, I'm temporarily adding this to the regression internal branch. I'll let you know once we have a repro. Thanks, Tariq Thanks, tglx 8<-- --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -822,6 +822,12 @@ static void free_moved_vector(struct api unsigned int cpu = apicd->prev_cpu; bool managed = apicd->is_managed; + if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) { + tracing_off(); + pr_err("Trying to clear prev_vector: %u\n", vector); + goto out; + } + /* * This should never happen. Managed interrupts are not * migrated except on CPU down, which does not involve the @@ -833,6 +839,7 @@ static void free_moved_vector(struct api trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; +out: hlist_del_init(>clist); apicd->prev_vector = 0; apicd->move_in_progress = 0;
Re: WARNING and PANIC in irq_matrix_free
On 22/02/2018 11:38 PM, Thomas Gleixner wrote: On Wed, 21 Feb 2018, Tariq Toukan wrote: On 20/02/2018 8:18 PM, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Tariq Toukan wrote: Is there CPU hotplugging in play? No. Ok. I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Do you have a rough idea what the test case is doing? It arbitrary appears in different flows, like sending traffic or interface configuration changes. Hmm. Looks like memory corruption, but I can't pin point it. Find below a debug patch which should prevent the crash and might give us some insight into the type of corruption. Please enable the irq_matrix and vector allocation trace points. echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable When the problem triggers the bogus vector is printed and the trace is frozen. Please provide dmesg and the tracebuffer output. OK, I'm temporarily adding this to the regression internal branch. I'll let you know once we have a repro. Thanks, Tariq Thanks, tglx 8<-- --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -822,6 +822,12 @@ static void free_moved_vector(struct api unsigned int cpu = apicd->prev_cpu; bool managed = apicd->is_managed; + if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) { + tracing_off(); + pr_err("Trying to clear prev_vector: %u\n", vector); + goto out; + } + /* * This should never happen. Managed interrupts are not * migrated except on CPU down, which does not involve the @@ -833,6 +839,7 @@ static void free_moved_vector(struct api trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; +out: hlist_del_init(>clist); apicd->prev_vector = 0; apicd->move_in_progress = 0;
Re: WARNING and PANIC in irq_matrix_free
On Wed, 21 Feb 2018, Tariq Toukan wrote: > On 20/02/2018 8:18 PM, Thomas Gleixner wrote: > > On Tue, 20 Feb 2018, Thomas Gleixner wrote: > > > On Tue, 20 Feb 2018, Tariq Toukan wrote: > > > > > > Is there CPU hotplugging in play? > > No. Ok. > > > > > > I'll come back to you tomorrow with a plan how to debug that after staring > > > into the code some more. > > > > Do you have a rough idea what the test case is doing? > > > > It arbitrary appears in different flows, like sending traffic or interface > configuration changes. Hmm. Looks like memory corruption, but I can't pin point it. Find below a debug patch which should prevent the crash and might give us some insight into the type of corruption. Please enable the irq_matrix and vector allocation trace points. echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable When the problem triggers the bogus vector is printed and the trace is frozen. Please provide dmesg and the tracebuffer output. Thanks, tglx 8<-- --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -822,6 +822,12 @@ static void free_moved_vector(struct api unsigned int cpu = apicd->prev_cpu; bool managed = apicd->is_managed; + if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) { + tracing_off(); + pr_err("Trying to clear prev_vector: %u\n", vector); + goto out; + } + /* * This should never happen. Managed interrupts are not * migrated except on CPU down, which does not involve the @@ -833,6 +839,7 @@ static void free_moved_vector(struct api trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; +out: hlist_del_init(>clist); apicd->prev_vector = 0; apicd->move_in_progress = 0;
Re: WARNING and PANIC in irq_matrix_free
On Wed, 21 Feb 2018, Tariq Toukan wrote: > On 20/02/2018 8:18 PM, Thomas Gleixner wrote: > > On Tue, 20 Feb 2018, Thomas Gleixner wrote: > > > On Tue, 20 Feb 2018, Tariq Toukan wrote: > > > > > > Is there CPU hotplugging in play? > > No. Ok. > > > > > > I'll come back to you tomorrow with a plan how to debug that after staring > > > into the code some more. > > > > Do you have a rough idea what the test case is doing? > > > > It arbitrary appears in different flows, like sending traffic or interface > configuration changes. Hmm. Looks like memory corruption, but I can't pin point it. Find below a debug patch which should prevent the crash and might give us some insight into the type of corruption. Please enable the irq_matrix and vector allocation trace points. echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable When the problem triggers the bogus vector is printed and the trace is frozen. Please provide dmesg and the tracebuffer output. Thanks, tglx 8<-- --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -822,6 +822,12 @@ static void free_moved_vector(struct api unsigned int cpu = apicd->prev_cpu; bool managed = apicd->is_managed; + if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) { + tracing_off(); + pr_err("Trying to clear prev_vector: %u\n", vector); + goto out; + } + /* * This should never happen. Managed interrupts are not * migrated except on CPU down, which does not involve the @@ -833,6 +839,7 @@ static void free_moved_vector(struct api trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; +out: hlist_del_init(>clist); apicd->prev_vector = 0; apicd->move_in_progress = 0;
Re: WARNING and PANIC in irq_matrix_free
On 20/02/2018 8:18 PM, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Tariq Toukan wrote: Is there CPU hotplugging in play? No. I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Do you have a rough idea what the test case is doing? It arbitrary appears in different flows, like sending traffic or interface configuration changes. Thanks, Tariq
Re: WARNING and PANIC in irq_matrix_free
On 20/02/2018 8:18 PM, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Thomas Gleixner wrote: On Tue, 20 Feb 2018, Tariq Toukan wrote: Is there CPU hotplugging in play? No. I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Do you have a rough idea what the test case is doing? It arbitrary appears in different flows, like sending traffic or interface configuration changes. Thanks, Tariq
Re: WARNING and PANIC in irq_matrix_free
On Tue, 20 Feb 2018, Thomas Gleixner wrote: > On Tue, 20 Feb 2018, Tariq Toukan wrote: > > Is there CPU hotplugging in play? > > I'll come back to you tomorrow with a plan how to debug that after staring > into the code some more. Do you have a rough idea what the test case is doing? Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Tue, 20 Feb 2018, Thomas Gleixner wrote: > On Tue, 20 Feb 2018, Tariq Toukan wrote: > > Is there CPU hotplugging in play? > > I'll come back to you tomorrow with a plan how to debug that after staring > into the code some more. Do you have a rough idea what the test case is doing? Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Tue, 20 Feb 2018, Tariq Toukan wrote: > Hi Thomas, > > We started seeing new issues in our net-device daily regression tests. > They are related to patch [1] introduced in kernel 4.15-rc1. > > We frequently see a warning in dmesg [2]. Repro is not consistent, we tried to > narrow it down to a smaller run but couldn't. > > In addition, sometimes (less frequent) the warning is followed by a panic [3]. Which might be just a consequence of the issue which triggers the warning. > I can share all needed details to help analyze this bug. > If you suspect specific flows, we can do an educated narrow down. > [2] > [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 > irq_matrix_free+0x30/0xd0 > [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0 > [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002 > [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: > > [ 8664.917679] RDX: 0001 RSI: 0001 RDI: > 880237038400 > [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: > > [ 8664.922813] R10: 03ff R11: 0ad9 R12: > 88023fc4 > [ 8664.925345] R13: R14: 0001 R15: > 002b > [ 8664.927872] FS: () GS:88023fd4() > knlGS: > [ 8664.930455] CS: 0010 DS: ES: CR0: 80050033 > [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: > 06e0 > [ 8664.935557] DR0: DR1: DR2: > > [ 8664.938051] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 8664.940541] Call Trace: > [ 8664.942980] > [ 8664.945399] free_moved_vector+0x4e/0x100 > [ 8664.947787] smp_irq_move_cleanup_interrupt+0x89/0x9e > [ 8664.950134] irq_move_cleanup_interrupt+0x95/0xa0 > [ 8664.952480] So thats: if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end)) return; and the disassembly of > [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60 > 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f> > ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74 gives: 1f: 44 39 6f 04 cmp%r13d,0x4(%rdi) 23: 77 06 ja 0x2b 25: 44 3b 6f 08 cmp0x8(%rdi),%r13d 29: 72 0b jb 0x36 2b:* 0f ff (bad) <-- trapping instruction So the vector to release should be in r13d, which is 0. /me scratches head. > [ 8943.038767] BUG: unable to handle kernel paging request at 00037a6b561b > [ 8943.040114] IP: free_moved_vector+0x61/0x100 > [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100 > [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007 > [ 8943.057233] RAX: 00037a6b561b RBX: 880157a77ec0 RCX: > 0001 > [ 8943.058506] RDX: 000155a8 RSI: 000155a8 RDI: > 880237038400 > [ 8943.059784] RBP: 880157a77ec0 R08: e8ba3c69 R09: > > [ 8943.061051] R10: R11: R12: > 7f0c0001 > [ 8943.062462] R13: 000155a8 R14: 0001 R15: > 00cc620d > [ 8943.063726] FS: () GS:88023fd4() > knlGS: > [ 8943.064993] CS: 0010 DS: ES: CR0: 80050033 > [ 8943.066253] CR2: 00037a6b561b CR3: 00010badc000 CR4: > 06e0 > [ 8943.067522] DR0: DR1: DR2: > > [ 8943.068771] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 8943.070029] Call Trace: > [ 8943.071273] > [ 8943.072503] smp_irq_move_cleanup_interrupt+0x89/0x9e > [ 8943.073794] irq_move_cleanup_interrupt+0x95/0xa0 > [ 8943.075048] ... > [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee > e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48> > c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48 17:* 48 c7 00 00 00 00 00movq $0x0,(%rax) <-- trapping instruction I can't tell exactly which part of free_moved_vector() that is and which vector it tries to free, but its more or less the same crap just with a fatal vector number. Is there CPU hotplugging in play? I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Thanks, tglx
Re: WARNING and PANIC in irq_matrix_free
On Tue, 20 Feb 2018, Tariq Toukan wrote: > Hi Thomas, > > We started seeing new issues in our net-device daily regression tests. > They are related to patch [1] introduced in kernel 4.15-rc1. > > We frequently see a warning in dmesg [2]. Repro is not consistent, we tried to > narrow it down to a smaller run but couldn't. > > In addition, sometimes (less frequent) the warning is followed by a panic [3]. Which might be just a consequence of the issue which triggers the warning. > I can share all needed details to help analyze this bug. > If you suspect specific flows, we can do an educated narrow down. > [2] > [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 > irq_matrix_free+0x30/0xd0 > [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0 > [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002 > [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: > > [ 8664.917679] RDX: 0001 RSI: 0001 RDI: > 880237038400 > [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: > > [ 8664.922813] R10: 03ff R11: 0ad9 R12: > 88023fc4 > [ 8664.925345] R13: R14: 0001 R15: > 002b > [ 8664.927872] FS: () GS:88023fd4() > knlGS: > [ 8664.930455] CS: 0010 DS: ES: CR0: 80050033 > [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: > 06e0 > [ 8664.935557] DR0: DR1: DR2: > > [ 8664.938051] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 8664.940541] Call Trace: > [ 8664.942980] > [ 8664.945399] free_moved_vector+0x4e/0x100 > [ 8664.947787] smp_irq_move_cleanup_interrupt+0x89/0x9e > [ 8664.950134] irq_move_cleanup_interrupt+0x95/0xa0 > [ 8664.952480] So thats: if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end)) return; and the disassembly of > [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60 > 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f> > ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74 gives: 1f: 44 39 6f 04 cmp%r13d,0x4(%rdi) 23: 77 06 ja 0x2b 25: 44 3b 6f 08 cmp0x8(%rdi),%r13d 29: 72 0b jb 0x36 2b:* 0f ff (bad) <-- trapping instruction So the vector to release should be in r13d, which is 0. /me scratches head. > [ 8943.038767] BUG: unable to handle kernel paging request at 00037a6b561b > [ 8943.040114] IP: free_moved_vector+0x61/0x100 > [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100 > [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007 > [ 8943.057233] RAX: 00037a6b561b RBX: 880157a77ec0 RCX: > 0001 > [ 8943.058506] RDX: 000155a8 RSI: 000155a8 RDI: > 880237038400 > [ 8943.059784] RBP: 880157a77ec0 R08: e8ba3c69 R09: > > [ 8943.061051] R10: R11: R12: > 7f0c0001 > [ 8943.062462] R13: 000155a8 R14: 0001 R15: > 00cc620d > [ 8943.063726] FS: () GS:88023fd4() > knlGS: > [ 8943.064993] CS: 0010 DS: ES: CR0: 80050033 > [ 8943.066253] CR2: 00037a6b561b CR3: 00010badc000 CR4: > 06e0 > [ 8943.067522] DR0: DR1: DR2: > > [ 8943.068771] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 8943.070029] Call Trace: > [ 8943.071273] > [ 8943.072503] smp_irq_move_cleanup_interrupt+0x89/0x9e > [ 8943.073794] irq_move_cleanup_interrupt+0x95/0xa0 > [ 8943.075048] ... > [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee > e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48> > c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48 17:* 48 c7 00 00 00 00 00movq $0x0,(%rax) <-- trapping instruction I can't tell exactly which part of free_moved_vector() that is and which vector it tries to free, but its more or less the same crap just with a fatal vector number. Is there CPU hotplugging in play? I'll come back to you tomorrow with a plan how to debug that after staring into the code some more. Thanks, tglx