Re: WARNING and PANIC in irq_matrix_free

2018-06-04 Thread Dou Liyang

Hi Thomas,

At 06/04/2018 07:17 PM, Thomas Gleixner wrote:

On Mon, 4 Jun 2018, Dou Liyang wrote:

Here, why didn't we avoid this cleanup by

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index a75de0792942..0cc59646755f 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data
*apicd)
  */
 WARN_ON_ONCE(managed);

+   if (!vector)
+   return;
+
 trace_vector_free_moved(apicd->irq, cpu, vector, managed);
 irq_matrix_free(vector_matrix, cpu, vector, managed);
 per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;

Is there something I didn't consider with? ;-)


Well, that just prevents the warning, but the hlist is already
corrupted. So you'd just cure the symptom ...



I see.


I'm about to send a patch series which addresses that. Just need to finish
writing changelogs.



Thank you for telling me that.

Thanks,
dou




Re: WARNING and PANIC in irq_matrix_free

2018-06-04 Thread Dou Liyang

Hi Thomas,

At 06/04/2018 07:17 PM, Thomas Gleixner wrote:

On Mon, 4 Jun 2018, Dou Liyang wrote:

Here, why didn't we avoid this cleanup by

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index a75de0792942..0cc59646755f 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data
*apicd)
  */
 WARN_ON_ONCE(managed);

+   if (!vector)
+   return;
+
 trace_vector_free_moved(apicd->irq, cpu, vector, managed);
 irq_matrix_free(vector_matrix, cpu, vector, managed);
 per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;

Is there something I didn't consider with? ;-)


Well, that just prevents the warning, but the hlist is already
corrupted. So you'd just cure the symptom ...



I see.


I'm about to send a patch series which addresses that. Just need to finish
writing changelogs.



Thank you for telling me that.

Thanks,
dou




Re: WARNING and PANIC in irq_matrix_free

2018-06-04 Thread Thomas Gleixner
On Mon, 4 Jun 2018, Dou Liyang wrote:
> Here, why didn't we avoid this cleanup by
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index a75de0792942..0cc59646755f 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data
> *apicd)
>  */
> WARN_ON_ONCE(managed);
> 
> +   if (!vector)
> +   return;
> +
> trace_vector_free_moved(apicd->irq, cpu, vector, managed);
> irq_matrix_free(vector_matrix, cpu, vector, managed);
> per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
> 
> Is there something I didn't consider with? ;-)

Well, that just prevents the warning, but the hlist is already
corrupted. So you'd just cure the symptom ...

I'm about to send a patch series which addresses that. Just need to finish
writing changelogs.

Thanks,

tglx




Re: WARNING and PANIC in irq_matrix_free

2018-06-04 Thread Thomas Gleixner
On Mon, 4 Jun 2018, Dou Liyang wrote:
> Here, why didn't we avoid this cleanup by
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index a75de0792942..0cc59646755f 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data
> *apicd)
>  */
> WARN_ON_ONCE(managed);
> 
> +   if (!vector)
> +   return;
> +
> trace_vector_free_moved(apicd->irq, cpu, vector, managed);
> irq_matrix_free(vector_matrix, cpu, vector, managed);
> per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
> 
> Is there something I didn't consider with? ;-)

Well, that just prevents the warning, but the hlist is already
corrupted. So you'd just cure the symptom ...

I'm about to send a patch series which addresses that. Just need to finish
writing changelogs.

Thanks,

tglx




Re: WARNING and PANIC in irq_matrix_free

2018-06-04 Thread Dou Liyang

Hi Thomas,

Sorry to ask the questions at this series, my mailbox was kicked out of
the mailing list a few days ago, and didn't receive the e-mail.

please see below

At 05/29/2018 04:09 AM, Thomas Gleixner wrote:

On Mon, 28 May 2018, Song Liu wrote:

This doesn't fix the issue with bnxt. Here is a trace with this patch:


[...]


Thanks for providing the data!

 >  19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 
can_reserve=1 reserve=0
 >  19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 
prev_vector=0 prev_cpu=2
 >  19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 
avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, 
global_rsvd=25, total_alloc=15


Here IRQ 31 is shutdown and the vector freed.

 >  19d... 1610359249us : irq_matrix_reserve: online_maps=56 
global_avl=11241, global_rsvd=26, total_alloc=15

 >  19d... 1610359249us : vector_reserve: irq=31 ret=0
 >  19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 
apicdest=0x


And set to the magic reservation vector 239 to catch spurious interrupts.

 >  20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 
can_reserve=1 reserve=0
 >  20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 
reserved=1 ret=0
 >  20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 
avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, 
global_rsvd=28, total_alloc=16
 >  20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 
prev_vector=0 prev_cpu=20

  
   this means there is no associated previous vector.


 >  20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
 >  20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 
apicdest=0x0014


So here it gets initialized again and targets CPU9 now.

 >  20dN.. 161032us : irq_matrix_alloc: bit=33 cpu=20 online=1 
avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, 
global_rsvd=28, total_alloc=16
 >  20dN.. 161032us : vector_update: irq=31 vector=33 cpu=20 
prev_vector=33 prev_cpu=9

 >  20dN.. 161032us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
 >  20dN.. 161032us : vector_config: irq=31 vector=33 cpu=20 
apicdest=0x002c


Here it is reconfigured to CPU 20. Now that update schedules vector 33 on
CPU9 for cleanup.

 >  20dN.. 161036us : irq_matrix_alloc: bit=34 cpu=2 online=1 
avl=199 alloc=2 managed=0 online_maps=56 global_avl=11239, 
global_rsvd=28, total_alloc=17
 >  20dN.. 161036us : vector_update: irq=31 vector=34 cpu=2 
prev_vector=33 prev_cpu=20

 >  20dN.. 161036us : vector_alloc: irq=31 vector=34 reserved=1 ret=0
 >  20dN.. 161036us : vector_config: irq=31 vector=34 cpu=2 
apicdest=0x0004


So here the shit hits the fan because that update schedules vector 33 on
CPU20 for cleanup while the previous cleanup for CPU9 has not been done
yet. Cute. or not so cute

 >  20dNh. 161039us : vector_free_moved: irq=31 cpu=20 vector=33 
is_managed=0
 >  20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 
avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, 
global_rsvd=28, total_alloc=16


And frees the CPU 20 vector

 >   9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 
is_managed=0




Here, why didn't we avoid this cleanup by

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index a75de0792942..0cc59646755f 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data 
*apicd)

 */
WARN_ON_ONCE(managed);

+   if (!vector)
+   return;
+
trace_vector_free_moved(apicd->irq, cpu, vector, managed);
irq_matrix_free(vector_matrix, cpu, vector, managed);
per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;

Is there something I didn't consider with? ;-)

Thanks,
dou.


And then CPU9 claims that it's queued for cleanup. Bah.

I'm still working on a fix as the elegant solution refused to work.

Thanks,

 tglx





Re: WARNING and PANIC in irq_matrix_free

2018-06-04 Thread Dou Liyang

Hi Thomas,

Sorry to ask the questions at this series, my mailbox was kicked out of
the mailing list a few days ago, and didn't receive the e-mail.

please see below

At 05/29/2018 04:09 AM, Thomas Gleixner wrote:

On Mon, 28 May 2018, Song Liu wrote:

This doesn't fix the issue with bnxt. Here is a trace with this patch:


[...]


Thanks for providing the data!

 >  19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 
can_reserve=1 reserve=0
 >  19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 
prev_vector=0 prev_cpu=2
 >  19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 
avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, 
global_rsvd=25, total_alloc=15


Here IRQ 31 is shutdown and the vector freed.

 >  19d... 1610359249us : irq_matrix_reserve: online_maps=56 
global_avl=11241, global_rsvd=26, total_alloc=15

 >  19d... 1610359249us : vector_reserve: irq=31 ret=0
 >  19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 
apicdest=0x


And set to the magic reservation vector 239 to catch spurious interrupts.

 >  20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 
can_reserve=1 reserve=0
 >  20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 
reserved=1 ret=0
 >  20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 
avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, 
global_rsvd=28, total_alloc=16
 >  20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 
prev_vector=0 prev_cpu=20

  
   this means there is no associated previous vector.


 >  20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
 >  20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 
apicdest=0x0014


So here it gets initialized again and targets CPU9 now.

 >  20dN.. 161032us : irq_matrix_alloc: bit=33 cpu=20 online=1 
avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, 
global_rsvd=28, total_alloc=16
 >  20dN.. 161032us : vector_update: irq=31 vector=33 cpu=20 
prev_vector=33 prev_cpu=9

 >  20dN.. 161032us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
 >  20dN.. 161032us : vector_config: irq=31 vector=33 cpu=20 
apicdest=0x002c


Here it is reconfigured to CPU 20. Now that update schedules vector 33 on
CPU9 for cleanup.

 >  20dN.. 161036us : irq_matrix_alloc: bit=34 cpu=2 online=1 
avl=199 alloc=2 managed=0 online_maps=56 global_avl=11239, 
global_rsvd=28, total_alloc=17
 >  20dN.. 161036us : vector_update: irq=31 vector=34 cpu=2 
prev_vector=33 prev_cpu=20

 >  20dN.. 161036us : vector_alloc: irq=31 vector=34 reserved=1 ret=0
 >  20dN.. 161036us : vector_config: irq=31 vector=34 cpu=2 
apicdest=0x0004


So here the shit hits the fan because that update schedules vector 33 on
CPU20 for cleanup while the previous cleanup for CPU9 has not been done
yet. Cute. or not so cute

 >  20dNh. 161039us : vector_free_moved: irq=31 cpu=20 vector=33 
is_managed=0
 >  20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 
avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, 
global_rsvd=28, total_alloc=16


And frees the CPU 20 vector

 >   9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 
is_managed=0




Here, why didn't we avoid this cleanup by

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index a75de0792942..0cc59646755f 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data 
*apicd)

 */
WARN_ON_ONCE(managed);

+   if (!vector)
+   return;
+
trace_vector_free_moved(apicd->irq, cpu, vector, managed);
irq_matrix_free(vector_matrix, cpu, vector, managed);
per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;

Is there something I didn't consider with? ;-)

Thanks,
dou.


And then CPU9 claims that it's queued for cleanup. Bah.

I'm still working on a fix as the elegant solution refused to work.

Thanks,

 tglx





Re: WARNING and PANIC in irq_matrix_free

2018-05-30 Thread Thomas Gleixner
Song,

On Tue, 29 May 2018, Song Liu wrote:
> > On May 29, 2018, at 1:35 AM, Thomas Gleixner  wrote:
> >> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix.
> > 
> > Right. Sorry, I forgot to say that we only need the irq_vectors ones which
> > are related to vector allocation, i.e.: irq_vectors/vector_*
>
> Here is the ftrace dump:

Thanks for providing the data!

>  19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 
> reserve=0
>  19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 
> prev_cpu=2
>  19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 
> alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, 
> total_alloc=15

Here IRQ 31 is shutdown and the vector freed.

>  19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, 
> global_rsvd=26, total_alloc=15
>  19d... 1610359249us : vector_reserve: irq=31 ret=0
>  19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 
> apicdest=0x

And set to the magic reservation vector 239 to catch spurious interrupts.

>  20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 can_reserve=1 
> reserve=0
>  20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 reserved=1 ret=0
>  20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 avl=200 
> alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, 
> total_alloc=16
>  20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 prev_vector=0 
> prev_cpu=20
>  20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
>  20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 
> apicdest=0x0014

So here it gets initialized again and targets CPU9 now.

>  20dN.. 161032us : irq_matrix_alloc: bit=33 cpu=20 online=1 avl=200 
> alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, 
> total_alloc=16
>  20dN.. 161032us : vector_update: irq=31 vector=33 cpu=20 prev_vector=33 
> prev_cpu=9
>  20dN.. 161032us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
>  20dN.. 161032us : vector_config: irq=31 vector=33 cpu=20 
> apicdest=0x002c

Here it is reconfigured to CPU 20. Now that update schedules vector 33 on
CPU9 for cleanup.

>  20dN.. 161036us : irq_matrix_alloc: bit=34 cpu=2 online=1 avl=199 
> alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=28, 
> total_alloc=17
>  20dN.. 161036us : vector_update: irq=31 vector=34 cpu=2 prev_vector=33 
> prev_cpu=20
>  20dN.. 161036us : vector_alloc: irq=31 vector=34 reserved=1 ret=0
>  20dN.. 161036us : vector_config: irq=31 vector=34 cpu=2 
> apicdest=0x0004

So here the shit hits the fan because that update schedules vector 33 on
CPU20 for cleanup while the previous cleanup for CPU9 has not been done
yet. Cute. or not so cute

>  20dNh. 161039us : vector_free_moved: irq=31 cpu=20 vector=33 is_managed=0
>  20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 
> alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, 
> total_alloc=16

And frees the CPU 20 vector

>   9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 is_managed=0

And then CPU9 claims that it's queued for cleanup. Bah.

I'm still working on a fix as the elegant solution refused to work.

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-05-30 Thread Thomas Gleixner
Song,

On Tue, 29 May 2018, Song Liu wrote:
> > On May 29, 2018, at 1:35 AM, Thomas Gleixner  wrote:
> >> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix.
> > 
> > Right. Sorry, I forgot to say that we only need the irq_vectors ones which
> > are related to vector allocation, i.e.: irq_vectors/vector_*
>
> Here is the ftrace dump:

Thanks for providing the data!

>  19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 
> reserve=0
>  19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 
> prev_cpu=2
>  19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 
> alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, 
> total_alloc=15

Here IRQ 31 is shutdown and the vector freed.

>  19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, 
> global_rsvd=26, total_alloc=15
>  19d... 1610359249us : vector_reserve: irq=31 ret=0
>  19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 
> apicdest=0x

And set to the magic reservation vector 239 to catch spurious interrupts.

>  20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 can_reserve=1 
> reserve=0
>  20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 reserved=1 ret=0
>  20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 avl=200 
> alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, 
> total_alloc=16
>  20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 prev_vector=0 
> prev_cpu=20
>  20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
>  20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 
> apicdest=0x0014

So here it gets initialized again and targets CPU9 now.

>  20dN.. 161032us : irq_matrix_alloc: bit=33 cpu=20 online=1 avl=200 
> alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, 
> total_alloc=16
>  20dN.. 161032us : vector_update: irq=31 vector=33 cpu=20 prev_vector=33 
> prev_cpu=9
>  20dN.. 161032us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
>  20dN.. 161032us : vector_config: irq=31 vector=33 cpu=20 
> apicdest=0x002c

Here it is reconfigured to CPU 20. Now that update schedules vector 33 on
CPU9 for cleanup.

>  20dN.. 161036us : irq_matrix_alloc: bit=34 cpu=2 online=1 avl=199 
> alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=28, 
> total_alloc=17
>  20dN.. 161036us : vector_update: irq=31 vector=34 cpu=2 prev_vector=33 
> prev_cpu=20
>  20dN.. 161036us : vector_alloc: irq=31 vector=34 reserved=1 ret=0
>  20dN.. 161036us : vector_config: irq=31 vector=34 cpu=2 
> apicdest=0x0004

So here the shit hits the fan because that update schedules vector 33 on
CPU20 for cleanup while the previous cleanup for CPU9 has not been done
yet. Cute. or not so cute

>  20dNh. 161039us : vector_free_moved: irq=31 cpu=20 vector=33 is_managed=0
>  20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 
> alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, 
> total_alloc=16

And frees the CPU 20 vector

>   9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 is_managed=0

And then CPU9 claims that it's queued for cleanup. Bah.

I'm still working on a fix as the elegant solution refused to work.

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-05-29 Thread Song Liu



> On May 29, 2018, at 1:35 AM, Thomas Gleixner  wrote:
> 
> On Mon, 28 May 2018, Song Liu wrote:
>>> On May 28, 2018, at 1:09 PM, Thomas Gleixner  wrote:
>>> 
>>> On Mon, 28 May 2018, Song Liu wrote:
 This doesn't fix the issue with bnxt. Here is a trace with this patch:
>>> 
>>> That's just the backtrace which is not really helpful. The real question is
>>> what leads to this scenaria.
>>> 
>>> What I was asking for is to enable the irq_vector and irq_matrix trace
>>> points along with ftrace_dump_on_oops. The latter dumps the trace buffer
>>> when the problem happens and that should give us a hint what actually
>>> causes that.
>> 
>> Attached is the dump I got. It seems pretty noisy, with many message lost.
>> 
>> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix.
> 
> Right. Sorry, I forgot to say that we only need the irq_vectors ones which
> are related to vector allocation, i.e.: irq_vectors/vector_* 
> 
> Thanks,
> 
>   tglx

Here is the ftrace dump:

[ 1609.472697]<...>-53363  19d... 1610359225us : vector_deactivate: irq=29 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472698]<...>-53363  19d... 1610359229us : vector_clear: irq=29 
vector=33 cpu=21 prev_vector=0 prev_cpu=0
[ 1609.472699]<...>-53363  19d... 1610359230us : irq_matrix_free: bit=33 
cpu=21 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11239, 
global_rsvd=23, total_alloc=17
[ 1609.472699]<...>-53363  19d... 1610359231us : irq_matrix_reserve: 
online_maps=56 global_avl=11239, global_rsvd=24, total_alloc=17
[ 1609.472700]<...>-53363  19d... 1610359232us : vector_reserve: irq=29 
ret=0
[ 1609.472700]<...>-53363  19d... 1610359233us : vector_config: irq=29 
vector=239 cpu=0 apicdest=0x
[ 1609.472701]<...>-53363  19d... 1610359242us : vector_deactivate: irq=30 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472701]<...>-53363  19d... 1610359242us : vector_clear: irq=30 
vector=33 cpu=11 prev_vector=0 prev_cpu=1
[ 1609.472701]<...>-53363  19d... 1610359242us : irq_matrix_free: bit=33 
cpu=11 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, 
global_rsvd=24, total_alloc=16
[ 1609.472702]<...>-53363  19d... 1610359243us : irq_matrix_reserve: 
online_maps=56 global_avl=11240, global_rsvd=25, total_alloc=16
[ 1609.472702]<...>-53363  19d... 1610359243us : vector_reserve: irq=30 
ret=0
[ 1609.472703]<...>-53363  19d... 1610359243us : vector_config: irq=30 
vector=239 cpu=0 apicdest=0x
[ 1609.472703]<...>-53363  19d... 1610359248us : vector_deactivate: irq=31 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472703]<...>-53363  19d... 1610359248us : vector_clear: irq=31 
vector=33 cpu=20 prev_vector=0 prev_cpu=2
[ 1609.472704]<...>-53363  19d... 1610359249us : irq_matrix_free: bit=33 
cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, 
global_rsvd=25, total_alloc=15
[ 1609.472704]<...>-53363  19d... 1610359249us : irq_matrix_reserve: 
online_maps=56 global_avl=11241, global_rsvd=26, total_alloc=15
[ 1609.472705]<...>-53363  19d... 1610359249us : vector_reserve: irq=31 
ret=0
[ 1609.472705]<...>-53363  19d... 1610359249us : vector_config: irq=31 
vector=239 cpu=0 apicdest=0x
[ 1609.472705]<...>-53363  19d... 1610359255us : vector_deactivate: irq=32 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472706]<...>-53363  19d... 1610359255us : vector_clear: irq=32 
vector=33 cpu=9 prev_vector=0 prev_cpu=3
[ 1609.472707]<...>-53363  19d... 1610359255us : irq_matrix_free: bit=33 
cpu=9 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11242, 
global_rsvd=26, total_alloc=14
[ 1609.472707]<...>-53363  19d... 1610359256us : irq_matrix_reserve: 
online_maps=56 global_avl=11242, global_rsvd=27, total_alloc=14
[ 1609.472708]<...>-53363  19d... 1610359256us : vector_reserve: irq=32 
ret=0
[ 1609.472708]<...>-53363  19d... 1610359256us : vector_config: irq=32 
vector=239 cpu=0 apicdest=0x
[ 1609.472709]<...>-53363  19d... 1610359261us : vector_deactivate: irq=33 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472709]<...>-53363  19d... 1610359261us : vector_clear: irq=33 
vector=33 cpu=18 prev_vector=0 prev_cpu=4
[ 1609.472710]<...>-53363  19d... 1610359262us : irq_matrix_free: bit=33 
cpu=18 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11243, 
global_rsvd=27, total_alloc=13
[ 1609.472710]<...>-53363  19d... 1610359262us : irq_matrix_reserve: 
online_maps=56 global_avl=11243, global_rsvd=28, total_alloc=13
[ 1609.472711]<...>-53363  19d... 1610359262us : vector_reserve: irq=33 
ret=0
[ 1609.472711]<...>-53363  19d... 1610359262us : vector_config: irq=33 
vector=239 cpu=0 apicdest=0x
[ 1609.472711]<...>-53363  19d... 1610359268us : vector_deactivate: irq=34 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472712]<...>-53363  19d... 1610359268us : vector_clear: irq=34 
vector=34 cpu=7 prev_vector=0 

Re: WARNING and PANIC in irq_matrix_free

2018-05-29 Thread Song Liu



> On May 29, 2018, at 1:35 AM, Thomas Gleixner  wrote:
> 
> On Mon, 28 May 2018, Song Liu wrote:
>>> On May 28, 2018, at 1:09 PM, Thomas Gleixner  wrote:
>>> 
>>> On Mon, 28 May 2018, Song Liu wrote:
 This doesn't fix the issue with bnxt. Here is a trace with this patch:
>>> 
>>> That's just the backtrace which is not really helpful. The real question is
>>> what leads to this scenaria.
>>> 
>>> What I was asking for is to enable the irq_vector and irq_matrix trace
>>> points along with ftrace_dump_on_oops. The latter dumps the trace buffer
>>> when the problem happens and that should give us a hint what actually
>>> causes that.
>> 
>> Attached is the dump I got. It seems pretty noisy, with many message lost.
>> 
>> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix.
> 
> Right. Sorry, I forgot to say that we only need the irq_vectors ones which
> are related to vector allocation, i.e.: irq_vectors/vector_* 
> 
> Thanks,
> 
>   tglx

Here is the ftrace dump:

[ 1609.472697]<...>-53363  19d... 1610359225us : vector_deactivate: irq=29 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472698]<...>-53363  19d... 1610359229us : vector_clear: irq=29 
vector=33 cpu=21 prev_vector=0 prev_cpu=0
[ 1609.472699]<...>-53363  19d... 1610359230us : irq_matrix_free: bit=33 
cpu=21 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11239, 
global_rsvd=23, total_alloc=17
[ 1609.472699]<...>-53363  19d... 1610359231us : irq_matrix_reserve: 
online_maps=56 global_avl=11239, global_rsvd=24, total_alloc=17
[ 1609.472700]<...>-53363  19d... 1610359232us : vector_reserve: irq=29 
ret=0
[ 1609.472700]<...>-53363  19d... 1610359233us : vector_config: irq=29 
vector=239 cpu=0 apicdest=0x
[ 1609.472701]<...>-53363  19d... 1610359242us : vector_deactivate: irq=30 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472701]<...>-53363  19d... 1610359242us : vector_clear: irq=30 
vector=33 cpu=11 prev_vector=0 prev_cpu=1
[ 1609.472701]<...>-53363  19d... 1610359242us : irq_matrix_free: bit=33 
cpu=11 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, 
global_rsvd=24, total_alloc=16
[ 1609.472702]<...>-53363  19d... 1610359243us : irq_matrix_reserve: 
online_maps=56 global_avl=11240, global_rsvd=25, total_alloc=16
[ 1609.472702]<...>-53363  19d... 1610359243us : vector_reserve: irq=30 
ret=0
[ 1609.472703]<...>-53363  19d... 1610359243us : vector_config: irq=30 
vector=239 cpu=0 apicdest=0x
[ 1609.472703]<...>-53363  19d... 1610359248us : vector_deactivate: irq=31 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472703]<...>-53363  19d... 1610359248us : vector_clear: irq=31 
vector=33 cpu=20 prev_vector=0 prev_cpu=2
[ 1609.472704]<...>-53363  19d... 1610359249us : irq_matrix_free: bit=33 
cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, 
global_rsvd=25, total_alloc=15
[ 1609.472704]<...>-53363  19d... 1610359249us : irq_matrix_reserve: 
online_maps=56 global_avl=11241, global_rsvd=26, total_alloc=15
[ 1609.472705]<...>-53363  19d... 1610359249us : vector_reserve: irq=31 
ret=0
[ 1609.472705]<...>-53363  19d... 1610359249us : vector_config: irq=31 
vector=239 cpu=0 apicdest=0x
[ 1609.472705]<...>-53363  19d... 1610359255us : vector_deactivate: irq=32 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472706]<...>-53363  19d... 1610359255us : vector_clear: irq=32 
vector=33 cpu=9 prev_vector=0 prev_cpu=3
[ 1609.472707]<...>-53363  19d... 1610359255us : irq_matrix_free: bit=33 
cpu=9 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11242, 
global_rsvd=26, total_alloc=14
[ 1609.472707]<...>-53363  19d... 1610359256us : irq_matrix_reserve: 
online_maps=56 global_avl=11242, global_rsvd=27, total_alloc=14
[ 1609.472708]<...>-53363  19d... 1610359256us : vector_reserve: irq=32 
ret=0
[ 1609.472708]<...>-53363  19d... 1610359256us : vector_config: irq=32 
vector=239 cpu=0 apicdest=0x
[ 1609.472709]<...>-53363  19d... 1610359261us : vector_deactivate: irq=33 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472709]<...>-53363  19d... 1610359261us : vector_clear: irq=33 
vector=33 cpu=18 prev_vector=0 prev_cpu=4
[ 1609.472710]<...>-53363  19d... 1610359262us : irq_matrix_free: bit=33 
cpu=18 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11243, 
global_rsvd=27, total_alloc=13
[ 1609.472710]<...>-53363  19d... 1610359262us : irq_matrix_reserve: 
online_maps=56 global_avl=11243, global_rsvd=28, total_alloc=13
[ 1609.472711]<...>-53363  19d... 1610359262us : vector_reserve: irq=33 
ret=0
[ 1609.472711]<...>-53363  19d... 1610359262us : vector_config: irq=33 
vector=239 cpu=0 apicdest=0x
[ 1609.472711]<...>-53363  19d... 1610359268us : vector_deactivate: irq=34 
is_managed=0 can_reserve=1 reserve=0
[ 1609.472712]<...>-53363  19d... 1610359268us : vector_clear: irq=34 
vector=34 cpu=7 prev_vector=0 

Re: WARNING and PANIC in irq_matrix_free

2018-05-29 Thread Thomas Gleixner
On Mon, 28 May 2018, Song Liu wrote:
> > On May 28, 2018, at 1:09 PM, Thomas Gleixner  wrote:
> >
> > On Mon, 28 May 2018, Song Liu wrote:
> >> This doesn't fix the issue with bnxt. Here is a trace with this patch:
> >
> > That's just the backtrace which is not really helpful. The real question is
> > what leads to this scenaria.
> >
> > What I was asking for is to enable the irq_vector and irq_matrix trace
> > points along with ftrace_dump_on_oops. The latter dumps the trace buffer
> > when the problem happens and that should give us a hint what actually
> > causes that.
> 
> Attached is the dump I got. It seems pretty noisy, with many message lost.
> 
> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix.

Right. Sorry, I forgot to say that we only need the irq_vectors ones which
are related to vector allocation, i.e.: irq_vectors/vector_* 

Thanks,

tglx





Re: WARNING and PANIC in irq_matrix_free

2018-05-29 Thread Thomas Gleixner
On Mon, 28 May 2018, Song Liu wrote:
> > On May 28, 2018, at 1:09 PM, Thomas Gleixner  wrote:
> >
> > On Mon, 28 May 2018, Song Liu wrote:
> >> This doesn't fix the issue with bnxt. Here is a trace with this patch:
> >
> > That's just the backtrace which is not really helpful. The real question is
> > what leads to this scenaria.
> >
> > What I was asking for is to enable the irq_vector and irq_matrix trace
> > points along with ftrace_dump_on_oops. The latter dumps the trace buffer
> > when the problem happens and that should give us a hint what actually
> > causes that.
> 
> Attached is the dump I got. It seems pretty noisy, with many message lost.
> 
> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix.

Right. Sorry, I forgot to say that we only need the irq_vectors ones which
are related to vector allocation, i.e.: irq_vectors/vector_* 

Thanks,

tglx





Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Thomas Gleixner
On Mon, 28 May 2018, Song Liu wrote:
> This doesn't fix the issue with bnxt. Here is a trace with this patch:

That's just the backtrace which is not really helpful. The real question is
what leads to this scenaria.

What I was asking for is to enable the irq_vector and irq_matrix trace
points along with ftrace_dump_on_oops. The latter dumps the trace buffer
when the problem happens and that should give us a hint what actually
causes that.

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Thomas Gleixner
On Mon, 28 May 2018, Song Liu wrote:
> This doesn't fix the issue with bnxt. Here is a trace with this patch:

That's just the backtrace which is not really helpful. The real question is
what leads to this scenaria.

What I was asking for is to enable the irq_vector and irq_matrix trace
points along with ftrace_dump_on_oops. The latter dumps the trace buffer
when the problem happens and that should give us a hint what actually
causes that.

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Song Liu



> On May 28, 2018, at 3:53 AM, Thomas Gleixner  wrote:
> 
> On Fri, 25 May 2018, Song Liu wrote:
>> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:
>>> On Wed, 23 May 2018, Tariq Toukan wrote:
 I have your patch merged into my internal branch, it prints the following:
 
 [ 4898.226258] Trying to clear prev_vector: 0
 [ 4898.226439] Trying to clear prev_vector: 0
 
 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>>> 
>>> Could you please enable the vector and irq matrix trace points and capture
>>> the trace when this happens?
> 
> Does the patch below fix it?
> 
> Thanks,
> 
>   tglx
> 
> 8<---
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index bb6f7a2148d7..54af3d4884b1 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, 
> unsigned int newvec,
>* prev_vector for this and the offlined target case.
>*/
>   apicd->prev_vector = 0;
> + apicd->move_in_progress = false;
>   if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
>   goto setnew;
>   /*

This doesn't fix the issue with bnxt. Here is a trace with this patch:

[  569.222495] WARNING: CPU: 20 PID: 0 at kernel/irq/matrix.c:373 
irq_matrix_free+0x32/0xd0
[  569.238811] BUG: unable to handle kernel
[  569.238811] NULL pointer dereference
[  569.238812]  at 
[  569.238812] IP: bnxt_poll+0x163/0x830
[  569.238812] PGD 0
[  569.238812] P4D 0
[  569.238813] Oops: 0002 [#1] SMP PTI
[  569.238813] Modules linked in:
[  569.238813]  nfsv3
[  569.238814]  nfs
[  569.238814]  fscache
[  569.238814]  ip6table_raw
[  569.238814]  ip6table_mangle
[  569.238815]  iptable_raw
[  569.238815]  iptable_mangle
[  569.238815]  ip6table_filter
[  569.238816]  xt_NFLOG
[  569.238816]  xt_comment
[  569.238816]  xt_statistic
[  569.238816]  iptable_filter
[  569.238817]  nfnetlink_log
[  569.238817]  tcp_diag
[  569.238817]  inet_diag
[  569.238817]  sb_edac
[  569.238818]  x86_pkg_temp_thermal
[  569.238818]  intel_powerclamp
[  569.238818]  coretemp
[  569.238818]  kvm_intel
[  569.238818]  kvm
[  569.238819]  irqbypass
[  569.238819]  iTCO_wdt
[  569.238819]  iTCO_vendor_support
[  569.238819]  lpc_ich
[  569.238819]  efivars
[  569.238820]  mfd_core
[  569.238820]  i2c_i801
[  569.238820]  ipmi_si
[  569.238820]  ipmi_devintf
[  569.238820]  ipmi_msghandler
[  569.238821]  button
[  569.238821]  acpi_cpufreq
[  569.238821]  sch_fq_codel
[  569.238821]  nfsd
[  569.238821]  nfs_acl
[  569.238822]  lockd
[  569.238822]  auth_rpcgss
[  569.238822]  oid_registry
[  569.238822]  grace
[  569.238822]  sunrpc
[  569.238823]  fuse
[  569.238823]  loop
[  569.238823]  efivarfs
[  569.238823]  autofs4
[  569.238824] CPU: 20 PID: 0 Comm: swapper/20 Not tainted 
4.16.0-00391-g3742c6a #813
[  569.238824] Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS 
F06_3B12 08/17/2017
[  569.238824] RIP: 0010:bnxt_poll+0x163/0x830
[  569.238825] RSP: 0018:883ffef83b18 EFLAGS: 00010006
[  569.238825] RAX: 2c00 RBX:  RCX: 
[  569.238825] RDX:  RSI: 881fdc53c000 RDI: 881f55a1da80
[  569.238826] RBP: 883ffef83b60 R08: 0011 R09: 4fbf4c8d
[  569.238826] R10: 883ffef83ab0 R11: 883fec6eee02 R12: 881ff6be2780
[  569.238826] R13:  R14: 881f55a1da80 R15: 
[  569.238827] FS:  () GS:883ffef8() 
knlGS:
[  569.238827] CS:  0010 DS:  ES:  CR0: 80050033
[  569.238827] CR2:  CR3: 0220a003 CR4: 003606e0
[  569.238827] DR0:  DR1:  DR2: 
[  569.238828] DR3:  DR6: fffe0ff0 DR7: 0400
[  569.238828] Call Trace:
[  569.238828]  
[  569.238828]  ? __skb_tx_hash+0x94/0xb0
[  569.238829]  netpoll_poll_dev+0xc5/0x1a0
[  569.238829]  netpoll_send_skb_on_dev+0x12c/0x200
[  569.238829]  netpoll_send_udp+0x2d5/0x410
[  569.238829]  write_ext_msg+0x1e7/0x200
[  569.238830]  ? scnprintf+0x3a/0x70
[  569.238830]  console_unlock+0x35c/0x530
[  569.238830]  vprintk_emit+0x225/0x340
[  569.238830]  ? irq_matrix_free+0x32/0xd0
[  569.238831]  vprintk_default+0x1f/0x30
[  569.238831]  vprintk_func+0x35/0x70
[  569.238831]  printk+0x43/0x4b
[  569.238832]  ? irq_matrix_free+0x32/0xd0
[  569.238832]  __warn+0x6f/0x150
[  569.238832]  ? irq_matrix_free+0x32/0xd0
[  569.238832]  report_bug+0x83/0xe0
[  569.238833]  do_invalid_op+0x2c/0x70
[  569.238833]  invalid_op+0x1b/0x40
[  569.238833] RIP: 0010:irq_matrix_free+0x32/0xd0
[  569.238834] RSP: 0018:883ffef83f58 EFLAGS: 00010002
[  569.238834] RAX: 0014 RBX: 00014140 RCX: 
[  569.238835] RDX:  RSI: 0014 RDI: 

Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Song Liu



> On May 28, 2018, at 3:53 AM, Thomas Gleixner  wrote:
> 
> On Fri, 25 May 2018, Song Liu wrote:
>> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:
>>> On Wed, 23 May 2018, Tariq Toukan wrote:
 I have your patch merged into my internal branch, it prints the following:
 
 [ 4898.226258] Trying to clear prev_vector: 0
 [ 4898.226439] Trying to clear prev_vector: 0
 
 i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>>> 
>>> Could you please enable the vector and irq matrix trace points and capture
>>> the trace when this happens?
> 
> Does the patch below fix it?
> 
> Thanks,
> 
>   tglx
> 
> 8<---
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index bb6f7a2148d7..54af3d4884b1 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, 
> unsigned int newvec,
>* prev_vector for this and the offlined target case.
>*/
>   apicd->prev_vector = 0;
> + apicd->move_in_progress = false;
>   if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
>   goto setnew;
>   /*

This doesn't fix the issue with bnxt. Here is a trace with this patch:

[  569.222495] WARNING: CPU: 20 PID: 0 at kernel/irq/matrix.c:373 
irq_matrix_free+0x32/0xd0
[  569.238811] BUG: unable to handle kernel
[  569.238811] NULL pointer dereference
[  569.238812]  at 
[  569.238812] IP: bnxt_poll+0x163/0x830
[  569.238812] PGD 0
[  569.238812] P4D 0
[  569.238813] Oops: 0002 [#1] SMP PTI
[  569.238813] Modules linked in:
[  569.238813]  nfsv3
[  569.238814]  nfs
[  569.238814]  fscache
[  569.238814]  ip6table_raw
[  569.238814]  ip6table_mangle
[  569.238815]  iptable_raw
[  569.238815]  iptable_mangle
[  569.238815]  ip6table_filter
[  569.238816]  xt_NFLOG
[  569.238816]  xt_comment
[  569.238816]  xt_statistic
[  569.238816]  iptable_filter
[  569.238817]  nfnetlink_log
[  569.238817]  tcp_diag
[  569.238817]  inet_diag
[  569.238817]  sb_edac
[  569.238818]  x86_pkg_temp_thermal
[  569.238818]  intel_powerclamp
[  569.238818]  coretemp
[  569.238818]  kvm_intel
[  569.238818]  kvm
[  569.238819]  irqbypass
[  569.238819]  iTCO_wdt
[  569.238819]  iTCO_vendor_support
[  569.238819]  lpc_ich
[  569.238819]  efivars
[  569.238820]  mfd_core
[  569.238820]  i2c_i801
[  569.238820]  ipmi_si
[  569.238820]  ipmi_devintf
[  569.238820]  ipmi_msghandler
[  569.238821]  button
[  569.238821]  acpi_cpufreq
[  569.238821]  sch_fq_codel
[  569.238821]  nfsd
[  569.238821]  nfs_acl
[  569.238822]  lockd
[  569.238822]  auth_rpcgss
[  569.238822]  oid_registry
[  569.238822]  grace
[  569.238822]  sunrpc
[  569.238823]  fuse
[  569.238823]  loop
[  569.238823]  efivarfs
[  569.238823]  autofs4
[  569.238824] CPU: 20 PID: 0 Comm: swapper/20 Not tainted 
4.16.0-00391-g3742c6a #813
[  569.238824] Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS 
F06_3B12 08/17/2017
[  569.238824] RIP: 0010:bnxt_poll+0x163/0x830
[  569.238825] RSP: 0018:883ffef83b18 EFLAGS: 00010006
[  569.238825] RAX: 2c00 RBX:  RCX: 
[  569.238825] RDX:  RSI: 881fdc53c000 RDI: 881f55a1da80
[  569.238826] RBP: 883ffef83b60 R08: 0011 R09: 4fbf4c8d
[  569.238826] R10: 883ffef83ab0 R11: 883fec6eee02 R12: 881ff6be2780
[  569.238826] R13:  R14: 881f55a1da80 R15: 
[  569.238827] FS:  () GS:883ffef8() 
knlGS:
[  569.238827] CS:  0010 DS:  ES:  CR0: 80050033
[  569.238827] CR2:  CR3: 0220a003 CR4: 003606e0
[  569.238827] DR0:  DR1:  DR2: 
[  569.238828] DR3:  DR6: fffe0ff0 DR7: 0400
[  569.238828] Call Trace:
[  569.238828]  
[  569.238828]  ? __skb_tx_hash+0x94/0xb0
[  569.238829]  netpoll_poll_dev+0xc5/0x1a0
[  569.238829]  netpoll_send_skb_on_dev+0x12c/0x200
[  569.238829]  netpoll_send_udp+0x2d5/0x410
[  569.238829]  write_ext_msg+0x1e7/0x200
[  569.238830]  ? scnprintf+0x3a/0x70
[  569.238830]  console_unlock+0x35c/0x530
[  569.238830]  vprintk_emit+0x225/0x340
[  569.238830]  ? irq_matrix_free+0x32/0xd0
[  569.238831]  vprintk_default+0x1f/0x30
[  569.238831]  vprintk_func+0x35/0x70
[  569.238831]  printk+0x43/0x4b
[  569.238832]  ? irq_matrix_free+0x32/0xd0
[  569.238832]  __warn+0x6f/0x150
[  569.238832]  ? irq_matrix_free+0x32/0xd0
[  569.238832]  report_bug+0x83/0xe0
[  569.238833]  do_invalid_op+0x2c/0x70
[  569.238833]  invalid_op+0x1b/0x40
[  569.238833] RIP: 0010:irq_matrix_free+0x32/0xd0
[  569.238834] RSP: 0018:883ffef83f58 EFLAGS: 00010002
[  569.238834] RAX: 0014 RBX: 00014140 RCX: 
[  569.238835] RDX:  RSI: 0014 RDI: 

Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Song Liu



> On May 28, 2018, at 7:27 AM, Thomas Gleixner  wrote:
> 
> On Mon, 28 May 2018, Tariq Toukan wrote:
>> On 28/05/2018 1:53 PM, Thomas Gleixner wrote:
>>> On Fri, 25 May 2018, Song Liu wrote:
 On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner 
 wrote:
> On Wed, 23 May 2018, Tariq Toukan wrote:
>> I have your patch merged into my internal branch, it prints the
>> following:
>> 
>> [ 4898.226258] Trying to clear prev_vector: 0
>> [ 4898.226439] Trying to clear prev_vector: 0
>> 
>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
> 
> Could you please enable the vector and irq matrix trace points and
> capture
> the trace when this happens?
>>> 
>>> Does the patch below fix it?
>>> 
>>> Thanks,
>>> 
>>> tglx
>>> 
>>> 8<---
>>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
>>> index bb6f7a2148d7..54af3d4884b1 100644
>>> --- a/arch/x86/kernel/apic/vector.c
>>> +++ b/arch/x86/kernel/apic/vector.c
>>> @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd,
>>> unsigned int newvec,
>>>  * prev_vector for this and the offlined target case.
>>>  */
>>> apicd->prev_vector = 0;
>>> +   apicd->move_in_progress = false;
>>> if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
>>> goto setnew;
>>> /*
>>> 
>> 
>> I took it into my internal branch. Will let you know.
> 
> Note, I'd still like to see a trace w/o the patch which shows which order
> of events leads to this. Even if the patch cures it, it might just paper
> over it and not fixing the root cause.
> 
> Thanks,
> 
>   tglx

I just checked trace without the patch. It looks the same as the one with
the patch. 

Thanks,
Song



Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Song Liu



> On May 28, 2018, at 7:27 AM, Thomas Gleixner  wrote:
> 
> On Mon, 28 May 2018, Tariq Toukan wrote:
>> On 28/05/2018 1:53 PM, Thomas Gleixner wrote:
>>> On Fri, 25 May 2018, Song Liu wrote:
 On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner 
 wrote:
> On Wed, 23 May 2018, Tariq Toukan wrote:
>> I have your patch merged into my internal branch, it prints the
>> following:
>> 
>> [ 4898.226258] Trying to clear prev_vector: 0
>> [ 4898.226439] Trying to clear prev_vector: 0
>> 
>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
> 
> Could you please enable the vector and irq matrix trace points and
> capture
> the trace when this happens?
>>> 
>>> Does the patch below fix it?
>>> 
>>> Thanks,
>>> 
>>> tglx
>>> 
>>> 8<---
>>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
>>> index bb6f7a2148d7..54af3d4884b1 100644
>>> --- a/arch/x86/kernel/apic/vector.c
>>> +++ b/arch/x86/kernel/apic/vector.c
>>> @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd,
>>> unsigned int newvec,
>>>  * prev_vector for this and the offlined target case.
>>>  */
>>> apicd->prev_vector = 0;
>>> +   apicd->move_in_progress = false;
>>> if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
>>> goto setnew;
>>> /*
>>> 
>> 
>> I took it into my internal branch. Will let you know.
> 
> Note, I'd still like to see a trace w/o the patch which shows which order
> of events leads to this. Even if the patch cures it, it might just paper
> over it and not fixing the root cause.
> 
> Thanks,
> 
>   tglx

I just checked trace without the patch. It looks the same as the one with
the patch. 

Thanks,
Song



Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Thomas Gleixner
On Mon, 28 May 2018, Tariq Toukan wrote:
> On 28/05/2018 1:53 PM, Thomas Gleixner wrote:
> > On Fri, 25 May 2018, Song Liu wrote:
> > > On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner 
> > > wrote:
> > > > On Wed, 23 May 2018, Tariq Toukan wrote:
> > > > > I have your patch merged into my internal branch, it prints the
> > > > > following:
> > > > > 
> > > > > [ 4898.226258] Trying to clear prev_vector: 0
> > > > > [ 4898.226439] Trying to clear prev_vector: 0
> > > > > 
> > > > > i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
> > > > 
> > > > Could you please enable the vector and irq matrix trace points and
> > > > capture
> > > > the trace when this happens?
> > 
> > Does the patch below fix it?
> > 
> > Thanks,
> > 
> > tglx
> > 
> > 8<---
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index bb6f7a2148d7..54af3d4884b1 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd,
> > unsigned int newvec,
> >  * prev_vector for this and the offlined target case.
> >  */
> > apicd->prev_vector = 0;
> > +   apicd->move_in_progress = false;
> > if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
> > goto setnew;
> > /*
> > 
> 
> I took it into my internal branch. Will let you know.

Note, I'd still like to see a trace w/o the patch which shows which order
of events leads to this. Even if the patch cures it, it might just paper
over it and not fixing the root cause.

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Thomas Gleixner
On Mon, 28 May 2018, Tariq Toukan wrote:
> On 28/05/2018 1:53 PM, Thomas Gleixner wrote:
> > On Fri, 25 May 2018, Song Liu wrote:
> > > On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner 
> > > wrote:
> > > > On Wed, 23 May 2018, Tariq Toukan wrote:
> > > > > I have your patch merged into my internal branch, it prints the
> > > > > following:
> > > > > 
> > > > > [ 4898.226258] Trying to clear prev_vector: 0
> > > > > [ 4898.226439] Trying to clear prev_vector: 0
> > > > > 
> > > > > i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
> > > > 
> > > > Could you please enable the vector and irq matrix trace points and
> > > > capture
> > > > the trace when this happens?
> > 
> > Does the patch below fix it?
> > 
> > Thanks,
> > 
> > tglx
> > 
> > 8<---
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index bb6f7a2148d7..54af3d4884b1 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd,
> > unsigned int newvec,
> >  * prev_vector for this and the offlined target case.
> >  */
> > apicd->prev_vector = 0;
> > +   apicd->move_in_progress = false;
> > if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
> > goto setnew;
> > /*
> > 
> 
> I took it into my internal branch. Will let you know.

Note, I'd still like to see a trace w/o the patch which shows which order
of events leads to this. Even if the patch cures it, it might just paper
over it and not fixing the root cause.

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Thomas Gleixner
On Fri, 25 May 2018, Song Liu wrote:
> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:
> > On Wed, 23 May 2018, Tariq Toukan wrote:
> >> I have your patch merged into my internal branch, it prints the following:
> >>
> >> [ 4898.226258] Trying to clear prev_vector: 0
> >> [ 4898.226439] Trying to clear prev_vector: 0
> >>
> >> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
> >
> > Could you please enable the vector and irq matrix trace points and capture
> > the trace when this happens?

Does the patch below fix it?

Thanks,

tglx

8<---
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index bb6f7a2148d7..54af3d4884b1 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, 
unsigned int newvec,
 * prev_vector for this and the offlined target case.
 */
apicd->prev_vector = 0;
+   apicd->move_in_progress = false;
if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
goto setnew;
/*


Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Thomas Gleixner
On Fri, 25 May 2018, Song Liu wrote:
> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:
> > On Wed, 23 May 2018, Tariq Toukan wrote:
> >> I have your patch merged into my internal branch, it prints the following:
> >>
> >> [ 4898.226258] Trying to clear prev_vector: 0
> >> [ 4898.226439] Trying to clear prev_vector: 0
> >>
> >> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
> >
> > Could you please enable the vector and irq matrix trace points and capture
> > the trace when this happens?

Does the patch below fix it?

Thanks,

tglx

8<---
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index bb6f7a2148d7..54af3d4884b1 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, 
unsigned int newvec,
 * prev_vector for this and the offlined target case.
 */
apicd->prev_vector = 0;
+   apicd->move_in_progress = false;
if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
goto setnew;
/*


Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Tariq Toukan



On 28/05/2018 1:53 PM, Thomas Gleixner wrote:

On Fri, 25 May 2018, Song Liu wrote:

On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:

On Wed, 23 May 2018, Tariq Toukan wrote:

I have your patch merged into my internal branch, it prints the following:

[ 4898.226258] Trying to clear prev_vector: 0
[ 4898.226439] Trying to clear prev_vector: 0

i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.


Could you please enable the vector and irq matrix trace points and capture
the trace when this happens?


Does the patch below fix it?

Thanks,

tglx

8<---
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index bb6f7a2148d7..54af3d4884b1 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, 
unsigned int newvec,
 * prev_vector for this and the offlined target case.
 */
apicd->prev_vector = 0;
+   apicd->move_in_progress = false;
if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
goto setnew;
/*



I took it into my internal branch. Will let you know.

Tariq


Re: WARNING and PANIC in irq_matrix_free

2018-05-28 Thread Tariq Toukan



On 28/05/2018 1:53 PM, Thomas Gleixner wrote:

On Fri, 25 May 2018, Song Liu wrote:

On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:

On Wed, 23 May 2018, Tariq Toukan wrote:

I have your patch merged into my internal branch, it prints the following:

[ 4898.226258] Trying to clear prev_vector: 0
[ 4898.226439] Trying to clear prev_vector: 0

i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.


Could you please enable the vector and irq matrix trace points and capture
the trace when this happens?


Does the patch below fix it?

Thanks,

tglx

8<---
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index bb6f7a2148d7..54af3d4884b1 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, 
unsigned int newvec,
 * prev_vector for this and the offlined target case.
 */
apicd->prev_vector = 0;
+   apicd->move_in_progress = false;
if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
goto setnew;
/*



I took it into my internal branch. Will let you know.

Tariq


Re: WARNING and PANIC in irq_matrix_free

2018-05-25 Thread Song Liu
Adding Broadcom developers to the thread.

Hi Michael, Vasundhara, and Andy,

Could you please help look into the case I found? ethtool crashes the system
for both net/master and net-next/master.

Thanks,
Song

On Fri, May 25, 2018 at 1:10 PM, Song Liu  wrote:
> Hi,
>
> We are seeing something probably related.
>
> We run ethtool on a system with Broadcom NIC to increase number of
> combined queues.
>
>
> [root@ ~]# ethtool -l eth0
> Channel parameters for eth0:
> Pre-set maximums:
> RX: 9
> TX: 8
> Other:  0
> Combined:   17
> Current hardware settings:
> RX: 0
> TX: 0
> Other:  0
> Combined:   8
>
> [root@ ~]# ethtool -L eth0 combined 16
>
> The last command PANIC the kernel easily (5 out of 5 in my tests).
>
> I haven't got luck to catch much console output, the only line I got is:
>
> [ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371
> irq_matrix_free+0x32/0xd0
>
> The NIC we have is
>
> Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller
>
> Thanks,
> Song
>
> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:
>> On Wed, 23 May 2018, Tariq Toukan wrote:
>>> On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
>>> > On Fri, 18 May 2018, Dmitry Safonov wrote:
>>> > > I'm not entirely sure that it's the same fault, but at least backtrace
>>> > > looks resembling.
>>> >
>>> > Yes, it's similar, but not the same issue. I'll stare are the code ...
>>> >
>>> > Thanks,
>>> >
>>> > tglx
>>> >
>>>
>>> We still see the issue in our daily regression runs.
>>> I have your patch merged into my internal branch, it prints the following:
>>>
>>> [ 4898.226258] Trying to clear prev_vector: 0
>>> [ 4898.226439] Trying to clear prev_vector: 0
>>>
>>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>>
>> Could you please enable the vector and irq matrix trace points and capture
>> the trace when this happens?
>>
>> Thanks,
>>
>> tglx
>>


Re: WARNING and PANIC in irq_matrix_free

2018-05-25 Thread Song Liu
Adding Broadcom developers to the thread.

Hi Michael, Vasundhara, and Andy,

Could you please help look into the case I found? ethtool crashes the system
for both net/master and net-next/master.

Thanks,
Song

On Fri, May 25, 2018 at 1:10 PM, Song Liu  wrote:
> Hi,
>
> We are seeing something probably related.
>
> We run ethtool on a system with Broadcom NIC to increase number of
> combined queues.
>
>
> [root@ ~]# ethtool -l eth0
> Channel parameters for eth0:
> Pre-set maximums:
> RX: 9
> TX: 8
> Other:  0
> Combined:   17
> Current hardware settings:
> RX: 0
> TX: 0
> Other:  0
> Combined:   8
>
> [root@ ~]# ethtool -L eth0 combined 16
>
> The last command PANIC the kernel easily (5 out of 5 in my tests).
>
> I haven't got luck to catch much console output, the only line I got is:
>
> [ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371
> irq_matrix_free+0x32/0xd0
>
> The NIC we have is
>
> Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller
>
> Thanks,
> Song
>
> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:
>> On Wed, 23 May 2018, Tariq Toukan wrote:
>>> On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
>>> > On Fri, 18 May 2018, Dmitry Safonov wrote:
>>> > > I'm not entirely sure that it's the same fault, but at least backtrace
>>> > > looks resembling.
>>> >
>>> > Yes, it's similar, but not the same issue. I'll stare are the code ...
>>> >
>>> > Thanks,
>>> >
>>> > tglx
>>> >
>>>
>>> We still see the issue in our daily regression runs.
>>> I have your patch merged into my internal branch, it prints the following:
>>>
>>> [ 4898.226258] Trying to clear prev_vector: 0
>>> [ 4898.226439] Trying to clear prev_vector: 0
>>>
>>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>>
>> Could you please enable the vector and irq matrix trace points and capture
>> the trace when this happens?
>>
>> Thanks,
>>
>> tglx
>>


Re: WARNING and PANIC in irq_matrix_free

2018-05-25 Thread Song Liu
Hi,

We are seeing something probably related.

We run ethtool on a system with Broadcom NIC to increase number of
combined queues.


[root@ ~]# ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX: 9
TX: 8
Other:  0
Combined:   17
Current hardware settings:
RX: 0
TX: 0
Other:  0
Combined:   8

[root@ ~]# ethtool -L eth0 combined 16

The last command PANIC the kernel easily (5 out of 5 in my tests).

I haven't got luck to catch much console output, the only line I got is:

[ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371
irq_matrix_free+0x32/0xd0

The NIC we have is

Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller

Thanks,
Song

On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:
> On Wed, 23 May 2018, Tariq Toukan wrote:
>> On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
>> > On Fri, 18 May 2018, Dmitry Safonov wrote:
>> > > I'm not entirely sure that it's the same fault, but at least backtrace
>> > > looks resembling.
>> >
>> > Yes, it's similar, but not the same issue. I'll stare are the code ...
>> >
>> > Thanks,
>> >
>> > tglx
>> >
>>
>> We still see the issue in our daily regression runs.
>> I have your patch merged into my internal branch, it prints the following:
>>
>> [ 4898.226258] Trying to clear prev_vector: 0
>> [ 4898.226439] Trying to clear prev_vector: 0
>>
>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>
> Could you please enable the vector and irq matrix trace points and capture
> the trace when this happens?
>
> Thanks,
>
> tglx
>


Re: WARNING and PANIC in irq_matrix_free

2018-05-25 Thread Song Liu
Hi,

We are seeing something probably related.

We run ethtool on a system with Broadcom NIC to increase number of
combined queues.


[root@ ~]# ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX: 9
TX: 8
Other:  0
Combined:   17
Current hardware settings:
RX: 0
TX: 0
Other:  0
Combined:   8

[root@ ~]# ethtool -L eth0 combined 16

The last command PANIC the kernel easily (5 out of 5 in my tests).

I haven't got luck to catch much console output, the only line I got is:

[ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371
irq_matrix_free+0x32/0xd0

The NIC we have is

Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller

Thanks,
Song

On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner  wrote:
> On Wed, 23 May 2018, Tariq Toukan wrote:
>> On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
>> > On Fri, 18 May 2018, Dmitry Safonov wrote:
>> > > I'm not entirely sure that it's the same fault, but at least backtrace
>> > > looks resembling.
>> >
>> > Yes, it's similar, but not the same issue. I'll stare are the code ...
>> >
>> > Thanks,
>> >
>> > tglx
>> >
>>
>> We still see the issue in our daily regression runs.
>> I have your patch merged into my internal branch, it prints the following:
>>
>> [ 4898.226258] Trying to clear prev_vector: 0
>> [ 4898.226439] Trying to clear prev_vector: 0
>>
>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>
> Could you please enable the vector and irq matrix trace points and capture
> the trace when this happens?
>
> Thanks,
>
> tglx
>


Re: WARNING and PANIC in irq_matrix_free

2018-05-23 Thread Thomas Gleixner
On Wed, 23 May 2018, Tariq Toukan wrote:
> On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
> > On Fri, 18 May 2018, Dmitry Safonov wrote:
> > > I'm not entirely sure that it's the same fault, but at least backtrace
> > > looks resembling.
> > 
> > Yes, it's similar, but not the same issue. I'll stare are the code ...
> > 
> > Thanks,
> > 
> > tglx
> > 
> 
> We still see the issue in our daily regression runs.
> I have your patch merged into my internal branch, it prints the following:
> 
> [ 4898.226258] Trying to clear prev_vector: 0
> [ 4898.226439] Trying to clear prev_vector: 0
> 
> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.

Could you please enable the vector and irq matrix trace points and capture
the trace when this happens?

Thanks,

tglx



Re: WARNING and PANIC in irq_matrix_free

2018-05-23 Thread Thomas Gleixner
On Wed, 23 May 2018, Tariq Toukan wrote:
> On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
> > On Fri, 18 May 2018, Dmitry Safonov wrote:
> > > I'm not entirely sure that it's the same fault, but at least backtrace
> > > looks resembling.
> > 
> > Yes, it's similar, but not the same issue. I'll stare are the code ...
> > 
> > Thanks,
> > 
> > tglx
> > 
> 
> We still see the issue in our daily regression runs.
> I have your patch merged into my internal branch, it prints the following:
> 
> [ 4898.226258] Trying to clear prev_vector: 0
> [ 4898.226439] Trying to clear prev_vector: 0
> 
> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.

Could you please enable the vector and irq matrix trace points and capture
the trace when this happens?

Thanks,

tglx



Re: WARNING and PANIC in irq_matrix_free

2018-05-23 Thread Tariq Toukan



On 19/05/2018 2:20 PM, Thomas Gleixner wrote:

On Fri, 18 May 2018, Dmitry Safonov wrote:

I'm not entirely sure that it's the same fault, but at least backtrace
looks resembling.


Yes, it's similar, but not the same issue. I'll stare are the code ...

Thanks,

tglx



We still see the issue in our daily regression runs.
I have your patch merged into my internal branch, it prints the following:

[ 4898.226258] Trying to clear prev_vector: 0
[ 4898.226439] Trying to clear prev_vector: 0

i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.


Re: WARNING and PANIC in irq_matrix_free

2018-05-23 Thread Tariq Toukan



On 19/05/2018 2:20 PM, Thomas Gleixner wrote:

On Fri, 18 May 2018, Dmitry Safonov wrote:

I'm not entirely sure that it's the same fault, but at least backtrace
looks resembling.


Yes, it's similar, but not the same issue. I'll stare are the code ...

Thanks,

tglx



We still see the issue in our daily regression runs.
I have your patch merged into my internal branch, it prints the following:

[ 4898.226258] Trying to clear prev_vector: 0
[ 4898.226439] Trying to clear prev_vector: 0

i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.


Re: WARNING and PANIC in irq_matrix_free

2018-05-19 Thread Thomas Gleixner
On Fri, 18 May 2018, Dmitry Safonov wrote:
> I'm not entirely sure that it's the same fault, but at least backtrace
> looks resembling.

Yes, it's similar, but not the same issue. I'll stare are the code ...

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-05-19 Thread Thomas Gleixner
On Fri, 18 May 2018, Dmitry Safonov wrote:
> I'm not entirely sure that it's the same fault, but at least backtrace
> looks resembling.

Yes, it's similar, but not the same issue. I'll stare are the code ...

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-05-18 Thread Dmitry Safonov
2018-05-18 23:41 GMT+01:00 Dmitry Safonov <0x7f454...@gmail.com>:
> What I did - was running x86 selftests and ssh'ing at the same moment.
> Will try it again if it'll fire by any chance.

No, I've tried a couple of times (after reboot as it's ONCE), but it looks
hard to reproduce by hands without some automation or something.

Thanks,
 Dmitry


Re: WARNING and PANIC in irq_matrix_free

2018-05-18 Thread Dmitry Safonov
2018-05-18 23:41 GMT+01:00 Dmitry Safonov <0x7f454...@gmail.com>:
> What I did - was running x86 selftests and ssh'ing at the same moment.
> Will try it again if it'll fire by any chance.

No, I've tried a couple of times (after reboot as it's ONCE), but it looks
hard to reproduce by hands without some automation or something.

Thanks,
 Dmitry


Re: WARNING and PANIC in irq_matrix_free

2018-05-18 Thread Dmitry Safonov
Hi Thomas, Tariq,

2018-02-20 18:11 GMT+00:00 Thomas Gleixner :
> On Tue, 20 Feb 2018, Tariq Toukan wrote:
>
>> Hi Thomas,
>>
>> We started seeing new issues in our net-device daily regression tests.
>> They are related to patch [1] introduced in kernel 4.15-rc1.
>>
>> We frequently see a warning in dmesg [2]. Repro is not consistent, we tried 
>> to
>> narrow it down to a smaller run but couldn't.
>>
>> In addition, sometimes (less frequent) the warning is followed by a panic 
>> [3].
>
> Which might be just a consequence of the issue which triggers the warning.
>
>> I can share all needed details to help analyze this bug.
>> If you suspect specific flows, we can do an educated narrow down.
>
>> [2]
>> [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 
>> irq_matrix_free+0x30/0xd0
>> [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
>> [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002
>> [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: 
>> 
>> [ 8664.917679] RDX: 0001 RSI: 0001 RDI: 
>> 880237038400
>> [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: 
>> 
>> [ 8664.922813] R10: 03ff R11: 0ad9 R12: 
>> 88023fc4
>> [ 8664.925345] R13:  R14: 0001 R15: 
>> 002b
>> [ 8664.927872] FS:  () GS:88023fd4() 
>> knlGS:
>> [ 8664.930455] CS:  0010 DS:  ES:  CR0: 80050033
>> [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: 
>> 06e0
>> [ 8664.935557] DR0:  DR1:  DR2: 
>> 
>> [ 8664.938051] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [ 8664.940541] Call Trace:
>> [ 8664.942980]  
>> [ 8664.945399]  free_moved_vector+0x4e/0x100
>> [ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
>> [ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
>> [ 8664.952480]  
>
> So thats:
>
> if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end))
> return;
>
> and the disassembly of
>
>> [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60
>> 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f>
>> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74
>
> gives:
>
>   1f:   44 39 6f 04 cmp%r13d,0x4(%rdi)
>   23:   77 06   ja 0x2b
>   25:   44 3b 6f 08 cmp0x8(%rdi),%r13d
>   29:   72 0b   jb 0x36
>   2b:*  0f ff   (bad)   <-- trapping instruction
>
> So the vector to release should be in r13d, which is 0. /me scratches head.
>
>> [ 8943.038767] BUG: unable to handle kernel paging request at 
>> 00037a6b561b
>> [ 8943.040114] IP: free_moved_vector+0x61/0x100
>> [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
>> [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007
>> [ 8943.057233] RAX: 00037a6b561b RBX: 880157a77ec0 RCX: 
>> 0001
>> [ 8943.058506] RDX: 000155a8 RSI: 000155a8 RDI: 
>> 880237038400
>> [ 8943.059784] RBP: 880157a77ec0 R08: e8ba3c69 R09: 
>> 
>> [ 8943.061051] R10:  R11:  R12: 
>> 7f0c0001
>> [ 8943.062462] R13: 000155a8 R14: 0001 R15: 
>> 00cc620d
>> [ 8943.063726] FS:  () GS:88023fd4() 
>> knlGS:
>> [ 8943.064993] CS:  0010 DS:  ES:  CR0: 80050033
>> [ 8943.066253] CR2: 00037a6b561b CR3: 00010badc000 CR4: 
>> 06e0
>> [ 8943.067522] DR0:  DR1:  DR2: 
>> 
>> [ 8943.068771] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [ 8943.070029] Call Trace:
>> [ 8943.071273]  
>> [ 8943.072503]  smp_irq_move_cleanup_interrupt+0x89/0x9e
>> [ 8943.073794]  irq_move_cleanup_interrupt+0x95/0xa0
>> [ 8943.075048]  
> ...
>> [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee
>> e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48>
>> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48
>
>  17:*   48 c7 00 00 00 00 00movq   $0x0,(%rax)  <-- trapping 
> instruction
>
> I can't tell exactly which part of free_moved_vector() that is and which
> vector it tries to free, but its more or less the same crap just with a
> fatal vector number.
>
> Is there CPU hotplugging in play?
>
> I'll come back to you tomorrow with a plan how to debug that after staring
> into the code some more.

I'm not entirely sure that it's the same fault, but at least backtrace
looks resembling.
I've got it in VM on the last master: v4.17-rc5-110-g2c71d338bef2
Let me know if you need .config or CPU/qemu details or anything.
I'm not sure if it'll 

Re: WARNING and PANIC in irq_matrix_free

2018-05-18 Thread Dmitry Safonov
Hi Thomas, Tariq,

2018-02-20 18:11 GMT+00:00 Thomas Gleixner :
> On Tue, 20 Feb 2018, Tariq Toukan wrote:
>
>> Hi Thomas,
>>
>> We started seeing new issues in our net-device daily regression tests.
>> They are related to patch [1] introduced in kernel 4.15-rc1.
>>
>> We frequently see a warning in dmesg [2]. Repro is not consistent, we tried 
>> to
>> narrow it down to a smaller run but couldn't.
>>
>> In addition, sometimes (less frequent) the warning is followed by a panic 
>> [3].
>
> Which might be just a consequence of the issue which triggers the warning.
>
>> I can share all needed details to help analyze this bug.
>> If you suspect specific flows, we can do an educated narrow down.
>
>> [2]
>> [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 
>> irq_matrix_free+0x30/0xd0
>> [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
>> [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002
>> [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: 
>> 
>> [ 8664.917679] RDX: 0001 RSI: 0001 RDI: 
>> 880237038400
>> [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: 
>> 
>> [ 8664.922813] R10: 03ff R11: 0ad9 R12: 
>> 88023fc4
>> [ 8664.925345] R13:  R14: 0001 R15: 
>> 002b
>> [ 8664.927872] FS:  () GS:88023fd4() 
>> knlGS:
>> [ 8664.930455] CS:  0010 DS:  ES:  CR0: 80050033
>> [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: 
>> 06e0
>> [ 8664.935557] DR0:  DR1:  DR2: 
>> 
>> [ 8664.938051] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [ 8664.940541] Call Trace:
>> [ 8664.942980]  
>> [ 8664.945399]  free_moved_vector+0x4e/0x100
>> [ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
>> [ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
>> [ 8664.952480]  
>
> So thats:
>
> if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end))
> return;
>
> and the disassembly of
>
>> [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60
>> 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f>
>> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74
>
> gives:
>
>   1f:   44 39 6f 04 cmp%r13d,0x4(%rdi)
>   23:   77 06   ja 0x2b
>   25:   44 3b 6f 08 cmp0x8(%rdi),%r13d
>   29:   72 0b   jb 0x36
>   2b:*  0f ff   (bad)   <-- trapping instruction
>
> So the vector to release should be in r13d, which is 0. /me scratches head.
>
>> [ 8943.038767] BUG: unable to handle kernel paging request at 
>> 00037a6b561b
>> [ 8943.040114] IP: free_moved_vector+0x61/0x100
>> [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
>> [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007
>> [ 8943.057233] RAX: 00037a6b561b RBX: 880157a77ec0 RCX: 
>> 0001
>> [ 8943.058506] RDX: 000155a8 RSI: 000155a8 RDI: 
>> 880237038400
>> [ 8943.059784] RBP: 880157a77ec0 R08: e8ba3c69 R09: 
>> 
>> [ 8943.061051] R10:  R11:  R12: 
>> 7f0c0001
>> [ 8943.062462] R13: 000155a8 R14: 0001 R15: 
>> 00cc620d
>> [ 8943.063726] FS:  () GS:88023fd4() 
>> knlGS:
>> [ 8943.064993] CS:  0010 DS:  ES:  CR0: 80050033
>> [ 8943.066253] CR2: 00037a6b561b CR3: 00010badc000 CR4: 
>> 06e0
>> [ 8943.067522] DR0:  DR1:  DR2: 
>> 
>> [ 8943.068771] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [ 8943.070029] Call Trace:
>> [ 8943.071273]  
>> [ 8943.072503]  smp_irq_move_cleanup_interrupt+0x89/0x9e
>> [ 8943.073794]  irq_move_cleanup_interrupt+0x95/0xa0
>> [ 8943.075048]  
> ...
>> [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee
>> e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48>
>> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48
>
>  17:*   48 c7 00 00 00 00 00movq   $0x0,(%rax)  <-- trapping 
> instruction
>
> I can't tell exactly which part of free_moved_vector() that is and which
> vector it tries to free, but its more or less the same crap just with a
> fatal vector number.
>
> Is there CPU hotplugging in play?
>
> I'll come back to you tomorrow with a plan how to debug that after staring
> into the code some more.

I'm not entirely sure that it's the same fault, but at least backtrace
looks resembling.
I've got it in VM on the last master: v4.17-rc5-110-g2c71d338bef2
Let me know if you need .config or CPU/qemu details or anything.
I'm not sure if it'll reproduce, but might 

Re: WARNING and PANIC in irq_matrix_free

2018-02-25 Thread Tariq Toukan



On 22/02/2018 11:38 PM, Thomas Gleixner wrote:

On Wed, 21 Feb 2018, Tariq Toukan wrote:

On 20/02/2018 8:18 PM, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Tariq Toukan wrote:

Is there CPU hotplugging in play?


No.


Ok.



I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.


Do you have a rough idea what the test case is doing?



It arbitrary appears in different flows, like sending traffic or interface
configuration changes.


Hmm. Looks like memory corruption, but I can't pin point it.

Find below a debug patch which should prevent the crash and might give us
some insight into the type of corruption.

Please enable the irq_matrix and vector allocation trace points.

echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable
echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable

When the problem triggers the bogus vector is printed and the trace is
frozen. Please provide dmesg and the tracebuffer output.



OK, I'm temporarily adding this to the regression internal branch. I'll 
let you know once we have a repro.


Thanks,
Tariq


Thanks,

tglx

8<--
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -822,6 +822,12 @@ static void free_moved_vector(struct api
unsigned int cpu = apicd->prev_cpu;
bool managed = apicd->is_managed;
  
+	if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) {

+   tracing_off();
+   pr_err("Trying to clear prev_vector: %u\n", vector);
+   goto out;
+   }
+
/*
 * This should never happen. Managed interrupts are not
 * migrated except on CPU down, which does not involve the
@@ -833,6 +839,7 @@ static void free_moved_vector(struct api
trace_vector_free_moved(apicd->irq, cpu, vector, managed);
irq_matrix_free(vector_matrix, cpu, vector, managed);
per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
+out:
hlist_del_init(>clist);
apicd->prev_vector = 0;
apicd->move_in_progress = 0;



Re: WARNING and PANIC in irq_matrix_free

2018-02-25 Thread Tariq Toukan



On 22/02/2018 11:38 PM, Thomas Gleixner wrote:

On Wed, 21 Feb 2018, Tariq Toukan wrote:

On 20/02/2018 8:18 PM, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Tariq Toukan wrote:

Is there CPU hotplugging in play?


No.


Ok.



I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.


Do you have a rough idea what the test case is doing?



It arbitrary appears in different flows, like sending traffic or interface
configuration changes.


Hmm. Looks like memory corruption, but I can't pin point it.

Find below a debug patch which should prevent the crash and might give us
some insight into the type of corruption.

Please enable the irq_matrix and vector allocation trace points.

echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable
echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable

When the problem triggers the bogus vector is printed and the trace is
frozen. Please provide dmesg and the tracebuffer output.



OK, I'm temporarily adding this to the regression internal branch. I'll 
let you know once we have a repro.


Thanks,
Tariq


Thanks,

tglx

8<--
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -822,6 +822,12 @@ static void free_moved_vector(struct api
unsigned int cpu = apicd->prev_cpu;
bool managed = apicd->is_managed;
  
+	if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) {

+   tracing_off();
+   pr_err("Trying to clear prev_vector: %u\n", vector);
+   goto out;
+   }
+
/*
 * This should never happen. Managed interrupts are not
 * migrated except on CPU down, which does not involve the
@@ -833,6 +839,7 @@ static void free_moved_vector(struct api
trace_vector_free_moved(apicd->irq, cpu, vector, managed);
irq_matrix_free(vector_matrix, cpu, vector, managed);
per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
+out:
hlist_del_init(>clist);
apicd->prev_vector = 0;
apicd->move_in_progress = 0;



Re: WARNING and PANIC in irq_matrix_free

2018-02-22 Thread Thomas Gleixner
On Wed, 21 Feb 2018, Tariq Toukan wrote:
> On 20/02/2018 8:18 PM, Thomas Gleixner wrote:
> > On Tue, 20 Feb 2018, Thomas Gleixner wrote:
> > > On Tue, 20 Feb 2018, Tariq Toukan wrote:
> > > 
> > > Is there CPU hotplugging in play?
> 
> No.

Ok.

> > > 
> > > I'll come back to you tomorrow with a plan how to debug that after staring
> > > into the code some more.
> > 
> > Do you have a rough idea what the test case is doing?
> > 
> 
> It arbitrary appears in different flows, like sending traffic or interface
> configuration changes.

Hmm. Looks like memory corruption, but I can't pin point it.

Find below a debug patch which should prevent the crash and might give us
some insight into the type of corruption.

Please enable the irq_matrix and vector allocation trace points. 

echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable
echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable

When the problem triggers the bogus vector is printed and the trace is
frozen. Please provide dmesg and the tracebuffer output.

Thanks,

tglx

8<--
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -822,6 +822,12 @@ static void free_moved_vector(struct api
unsigned int cpu = apicd->prev_cpu;
bool managed = apicd->is_managed;
 
+   if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) {
+   tracing_off();
+   pr_err("Trying to clear prev_vector: %u\n", vector);
+   goto out;
+   }
+
/*
 * This should never happen. Managed interrupts are not
 * migrated except on CPU down, which does not involve the
@@ -833,6 +839,7 @@ static void free_moved_vector(struct api
trace_vector_free_moved(apicd->irq, cpu, vector, managed);
irq_matrix_free(vector_matrix, cpu, vector, managed);
per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
+out:
hlist_del_init(>clist);
apicd->prev_vector = 0;
apicd->move_in_progress = 0;


Re: WARNING and PANIC in irq_matrix_free

2018-02-22 Thread Thomas Gleixner
On Wed, 21 Feb 2018, Tariq Toukan wrote:
> On 20/02/2018 8:18 PM, Thomas Gleixner wrote:
> > On Tue, 20 Feb 2018, Thomas Gleixner wrote:
> > > On Tue, 20 Feb 2018, Tariq Toukan wrote:
> > > 
> > > Is there CPU hotplugging in play?
> 
> No.

Ok.

> > > 
> > > I'll come back to you tomorrow with a plan how to debug that after staring
> > > into the code some more.
> > 
> > Do you have a rough idea what the test case is doing?
> > 
> 
> It arbitrary appears in different flows, like sending traffic or interface
> configuration changes.

Hmm. Looks like memory corruption, but I can't pin point it.

Find below a debug patch which should prevent the crash and might give us
some insight into the type of corruption.

Please enable the irq_matrix and vector allocation trace points. 

echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable
echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable

When the problem triggers the bogus vector is printed and the trace is
frozen. Please provide dmesg and the tracebuffer output.

Thanks,

tglx

8<--
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -822,6 +822,12 @@ static void free_moved_vector(struct api
unsigned int cpu = apicd->prev_cpu;
bool managed = apicd->is_managed;
 
+   if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) {
+   tracing_off();
+   pr_err("Trying to clear prev_vector: %u\n", vector);
+   goto out;
+   }
+
/*
 * This should never happen. Managed interrupts are not
 * migrated except on CPU down, which does not involve the
@@ -833,6 +839,7 @@ static void free_moved_vector(struct api
trace_vector_free_moved(apicd->irq, cpu, vector, managed);
irq_matrix_free(vector_matrix, cpu, vector, managed);
per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
+out:
hlist_del_init(>clist);
apicd->prev_vector = 0;
apicd->move_in_progress = 0;


Re: WARNING and PANIC in irq_matrix_free

2018-02-21 Thread Tariq Toukan



On 20/02/2018 8:18 PM, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Tariq Toukan wrote:

Is there CPU hotplugging in play?


No.



I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.


Do you have a rough idea what the test case is doing?



It arbitrary appears in different flows, like sending traffic or 
interface configuration changes.


Thanks,
Tariq


Re: WARNING and PANIC in irq_matrix_free

2018-02-21 Thread Tariq Toukan



On 20/02/2018 8:18 PM, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Thomas Gleixner wrote:

On Tue, 20 Feb 2018, Tariq Toukan wrote:

Is there CPU hotplugging in play?


No.



I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.


Do you have a rough idea what the test case is doing?



It arbitrary appears in different flows, like sending traffic or 
interface configuration changes.


Thanks,
Tariq


Re: WARNING and PANIC in irq_matrix_free

2018-02-20 Thread Thomas Gleixner
On Tue, 20 Feb 2018, Thomas Gleixner wrote:
> On Tue, 20 Feb 2018, Tariq Toukan wrote:
> 
> Is there CPU hotplugging in play?
> 
> I'll come back to you tomorrow with a plan how to debug that after staring
> into the code some more.

Do you have a rough idea what the test case is doing?

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-02-20 Thread Thomas Gleixner
On Tue, 20 Feb 2018, Thomas Gleixner wrote:
> On Tue, 20 Feb 2018, Tariq Toukan wrote:
> 
> Is there CPU hotplugging in play?
> 
> I'll come back to you tomorrow with a plan how to debug that after staring
> into the code some more.

Do you have a rough idea what the test case is doing?

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-02-20 Thread Thomas Gleixner
On Tue, 20 Feb 2018, Tariq Toukan wrote:

> Hi Thomas,
> 
> We started seeing new issues in our net-device daily regression tests.
> They are related to patch [1] introduced in kernel 4.15-rc1.
> 
> We frequently see a warning in dmesg [2]. Repro is not consistent, we tried to
> narrow it down to a smaller run but couldn't.
> 
> In addition, sometimes (less frequent) the warning is followed by a panic [3].

Which might be just a consequence of the issue which triggers the warning.

> I can share all needed details to help analyze this bug.
> If you suspect specific flows, we can do an educated narrow down.

> [2]
> [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 
> irq_matrix_free+0x30/0xd0
> [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
> [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002
> [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: 
> 
> [ 8664.917679] RDX: 0001 RSI: 0001 RDI: 
> 880237038400
> [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: 
> 
> [ 8664.922813] R10: 03ff R11: 0ad9 R12: 
> 88023fc4
> [ 8664.925345] R13:  R14: 0001 R15: 
> 002b
> [ 8664.927872] FS:  () GS:88023fd4() 
> knlGS:
> [ 8664.930455] CS:  0010 DS:  ES:  CR0: 80050033
> [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: 
> 06e0
> [ 8664.935557] DR0:  DR1:  DR2: 
> 
> [ 8664.938051] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 8664.940541] Call Trace:
> [ 8664.942980]  
> [ 8664.945399]  free_moved_vector+0x4e/0x100
> [ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
> [ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
> [ 8664.952480]  

So thats:

if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end))
return;

and the disassembly of

> [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60
> 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f>
> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74

gives:

  1f:   44 39 6f 04 cmp%r13d,0x4(%rdi)
  23:   77 06   ja 0x2b
  25:   44 3b 6f 08 cmp0x8(%rdi),%r13d
  29:   72 0b   jb 0x36
  2b:*  0f ff   (bad)   <-- trapping instruction

So the vector to release should be in r13d, which is 0. /me scratches head.

> [ 8943.038767] BUG: unable to handle kernel paging request at 00037a6b561b
> [ 8943.040114] IP: free_moved_vector+0x61/0x100
> [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
> [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007
> [ 8943.057233] RAX: 00037a6b561b RBX: 880157a77ec0 RCX: 
> 0001
> [ 8943.058506] RDX: 000155a8 RSI: 000155a8 RDI: 
> 880237038400
> [ 8943.059784] RBP: 880157a77ec0 R08: e8ba3c69 R09: 
> 
> [ 8943.061051] R10:  R11:  R12: 
> 7f0c0001
> [ 8943.062462] R13: 000155a8 R14: 0001 R15: 
> 00cc620d
> [ 8943.063726] FS:  () GS:88023fd4() 
> knlGS:
> [ 8943.064993] CS:  0010 DS:  ES:  CR0: 80050033
> [ 8943.066253] CR2: 00037a6b561b CR3: 00010badc000 CR4: 
> 06e0
> [ 8943.067522] DR0:  DR1:  DR2: 
> 
> [ 8943.068771] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 8943.070029] Call Trace:
> [ 8943.071273]  
> [ 8943.072503]  smp_irq_move_cleanup_interrupt+0x89/0x9e
> [ 8943.073794]  irq_move_cleanup_interrupt+0x95/0xa0
> [ 8943.075048]  
...
> [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee
> e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48>
> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48

 17:*   48 c7 00 00 00 00 00movq   $0x0,(%rax)  <-- trapping 
instruction

I can't tell exactly which part of free_moved_vector() that is and which
vector it tries to free, but its more or less the same crap just with a
fatal vector number.

Is there CPU hotplugging in play?

I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.

Thanks,

tglx


Re: WARNING and PANIC in irq_matrix_free

2018-02-20 Thread Thomas Gleixner
On Tue, 20 Feb 2018, Tariq Toukan wrote:

> Hi Thomas,
> 
> We started seeing new issues in our net-device daily regression tests.
> They are related to patch [1] introduced in kernel 4.15-rc1.
> 
> We frequently see a warning in dmesg [2]. Repro is not consistent, we tried to
> narrow it down to a smaller run but couldn't.
> 
> In addition, sometimes (less frequent) the warning is followed by a panic [3].

Which might be just a consequence of the issue which triggers the warning.

> I can share all needed details to help analyze this bug.
> If you suspect specific flows, we can do an educated narrow down.

> [2]
> [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 
> irq_matrix_free+0x30/0xd0
> [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
> [ 8664.912624] RSP: 0018:88023fd43f70 EFLAGS: 00010002
> [ 8664.915149] RAX: 00026318 RBX: 880157a77ec0 RCX: 
> 
> [ 8664.917679] RDX: 0001 RSI: 0001 RDI: 
> 880237038400
> [ 8664.920244] RBP: 880237038400 R08: e8ba3c69 R09: 
> 
> [ 8664.922813] R10: 03ff R11: 0ad9 R12: 
> 88023fc4
> [ 8664.925345] R13:  R14: 0001 R15: 
> 002b
> [ 8664.927872] FS:  () GS:88023fd4() 
> knlGS:
> [ 8664.930455] CS:  0010 DS:  ES:  CR0: 80050033
> [ 8664.932996] CR2: 00f2c030 CR3: 0220a000 CR4: 
> 06e0
> [ 8664.935557] DR0:  DR1:  DR2: 
> 
> [ 8664.938051] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 8664.940541] Call Trace:
> [ 8664.942980]  
> [ 8664.945399]  free_moved_vector+0x4e/0x100
> [ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
> [ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
> [ 8664.952480]  

So thats:

if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end))
return;

and the disassembly of

> [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60
> 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f>
> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74

gives:

  1f:   44 39 6f 04 cmp%r13d,0x4(%rdi)
  23:   77 06   ja 0x2b
  25:   44 3b 6f 08 cmp0x8(%rdi),%r13d
  29:   72 0b   jb 0x36
  2b:*  0f ff   (bad)   <-- trapping instruction

So the vector to release should be in r13d, which is 0. /me scratches head.

> [ 8943.038767] BUG: unable to handle kernel paging request at 00037a6b561b
> [ 8943.040114] IP: free_moved_vector+0x61/0x100
> [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
> [ 8943.055940] RSP: 0018:88023fd43fa0 EFLAGS: 00010007
> [ 8943.057233] RAX: 00037a6b561b RBX: 880157a77ec0 RCX: 
> 0001
> [ 8943.058506] RDX: 000155a8 RSI: 000155a8 RDI: 
> 880237038400
> [ 8943.059784] RBP: 880157a77ec0 R08: e8ba3c69 R09: 
> 
> [ 8943.061051] R10:  R11:  R12: 
> 7f0c0001
> [ 8943.062462] R13: 000155a8 R14: 0001 R15: 
> 00cc620d
> [ 8943.063726] FS:  () GS:88023fd4() 
> knlGS:
> [ 8943.064993] CS:  0010 DS:  ES:  CR0: 80050033
> [ 8943.066253] CR2: 00037a6b561b CR3: 00010badc000 CR4: 
> 06e0
> [ 8943.067522] DR0:  DR1:  DR2: 
> 
> [ 8943.068771] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 8943.070029] Call Trace:
> [ 8943.071273]  
> [ 8943.072503]  smp_irq_move_cleanup_interrupt+0x89/0x9e
> [ 8943.073794]  irq_move_cleanup_interrupt+0x95/0xa0
> [ 8943.075048]  
...
> [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee
> e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48>
> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48

 17:*   48 c7 00 00 00 00 00movq   $0x0,(%rax)  <-- trapping 
instruction

I can't tell exactly which part of free_moved_vector() that is and which
vector it tries to free, but its more or less the same crap just with a
fatal vector number.

Is there CPU hotplugging in play?

I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.

Thanks,

tglx