Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-18 Thread Thomas Gleixner
On Thu, 18 Jan 2018, Keith Busch wrote: > On Thu, Jan 18, 2018 at 09:10:43AM +0100, Thomas Gleixner wrote: > > Can you please provide the output of > > > > # cat /sys/kernel/debug/irq/irqs/$ONE_I40_IRQ > > # cat /sys/kernel/debug/irq/irqs/48 > handler: handle_edge_irq > device: :1a:00.0

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-18 Thread Thomas Gleixner
On Thu, 18 Jan 2018, Keith Busch wrote: > On Thu, Jan 18, 2018 at 09:10:43AM +0100, Thomas Gleixner wrote: > > Can you please provide the output of > > > > # cat /sys/kernel/debug/irq/irqs/$ONE_I40_IRQ > > # cat /sys/kernel/debug/irq/irqs/48 > handler: handle_edge_irq > device: :1a:00.0

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-18 Thread Keith Busch
On Thu, Jan 18, 2018 at 09:10:43AM +0100, Thomas Gleixner wrote: > Can you please provide the output of > > # cat /sys/kernel/debug/irq/irqs/$ONE_I40_IRQ # cat /sys/kernel/debug/irq/irqs/48 handler: handle_edge_irq device: :1a:00.0 status: 0x istate: 0x ddepth: 0

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-18 Thread Keith Busch
On Thu, Jan 18, 2018 at 09:10:43AM +0100, Thomas Gleixner wrote: > Can you please provide the output of > > # cat /sys/kernel/debug/irq/irqs/$ONE_I40_IRQ # cat /sys/kernel/debug/irq/irqs/48 handler: handle_edge_irq device: :1a:00.0 status: 0x istate: 0x ddepth: 0

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-18 Thread Thomas Gleixner
On Wed, 17 Jan 2018, Keith Busch wrote: > On Wed, Jan 17, 2018 at 04:01:47PM +0100, Thomas Gleixner wrote: > > Which device is allocating gazillions of non-managed interrupts? > > I believe that would be the i40e. :) So enterprise grade insanity was spot on. Can you please provide the output of

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-18 Thread Thomas Gleixner
On Wed, 17 Jan 2018, Keith Busch wrote: > On Wed, Jan 17, 2018 at 04:01:47PM +0100, Thomas Gleixner wrote: > > Which device is allocating gazillions of non-managed interrupts? > > I believe that would be the i40e. :) So enterprise grade insanity was spot on. Can you please provide the output of

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-17 Thread Keith Busch
On Wed, Jan 17, 2018 at 04:01:47PM +0100, Thomas Gleixner wrote: > Which device is allocating gazillions of non-managed interrupts? I believe that would be the i40e. :) > The patch below should cure that by spreading them out on allocation. Yep, this is successfully testing already over 200

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-17 Thread Keith Busch
On Wed, Jan 17, 2018 at 04:01:47PM +0100, Thomas Gleixner wrote: > Which device is allocating gazillions of non-managed interrupts? I believe that would be the i40e. :) > The patch below should cure that by spreading them out on allocation. Yep, this is successfully testing already over 200

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-17 Thread Thomas Gleixner
On Wed, 17 Jan 2018, Keith Busch wrote: > On Wed, Jan 17, 2018 at 10:32:12AM +0100, Thomas Gleixner wrote: > > On Wed, 17 Jan 2018, Thomas Gleixner wrote: > > > That doesn't sound right. The vectors should be spread evenly accross the > > > CPUs. So ENOSPC should never happen. > > > > > > Can

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-17 Thread Thomas Gleixner
On Wed, 17 Jan 2018, Keith Busch wrote: > On Wed, Jan 17, 2018 at 10:32:12AM +0100, Thomas Gleixner wrote: > > On Wed, 17 Jan 2018, Thomas Gleixner wrote: > > > That doesn't sound right. The vectors should be spread evenly accross the > > > CPUs. So ENOSPC should never happen. > > > > > > Can

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-17 Thread Thomas Gleixner
On Wed, 17 Jan 2018, Thomas Gleixner wrote: > On Wed, 17 Jan 2018, Keith Busch wrote: > > On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote: > > > Can you trace the matrix allocations from the very beginning or tell me > > > how > > > to reproduce. I'd like to figure out why this

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-17 Thread Thomas Gleixner
On Wed, 17 Jan 2018, Thomas Gleixner wrote: > On Wed, 17 Jan 2018, Keith Busch wrote: > > On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote: > > > Can you trace the matrix allocations from the very beginning or tell me > > > how > > > to reproduce. I'd like to figure out why this

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-17 Thread Thomas Gleixner
On Wed, 17 Jan 2018, Keith Busch wrote: > On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote: > > Can you trace the matrix allocations from the very beginning or tell me how > > to reproduce. I'd like to figure out why this is happening. > > Sure, I'll get the irq_matrix events. > >

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-17 Thread Thomas Gleixner
On Wed, 17 Jan 2018, Keith Busch wrote: > On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote: > > Can you trace the matrix allocations from the very beginning or tell me how > > to reproduce. I'd like to figure out why this is happening. > > Sure, I'll get the irq_matrix events. > >

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch
On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote: > Can you trace the matrix allocations from the very beginning or tell me how > to reproduce. I'd like to figure out why this is happening. Sure, I'll get the irq_matrix events. I reproduce this on a machine with 112 CPUs and 3

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch
On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote: > Can you trace the matrix allocations from the very beginning or tell me how > to reproduce. I'd like to figure out why this is happening. Sure, I'll get the irq_matrix events. I reproduce this on a machine with 112 CPUs and 3

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Keith Busch wrote: > On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote: > > 8<-- > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > > index f8b03bb8e725..3cc471beb50b 100644 > > --- a/arch/x86/kernel/apic/vector.c

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Keith Busch wrote: > On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote: > > 8<-- > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > > index f8b03bb8e725..3cc471beb50b 100644 > > --- a/arch/x86/kernel/apic/vector.c

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch
On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote: > 8<-- > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index f8b03bb8e725..3cc471beb50b 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch
On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote: > 8<-- > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index f8b03bb8e725..3cc471beb50b 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch
On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote: > What we want is s/i + 1/i/ > > That's correct because x86_vector_free_irqs() does: > >for (i = 0; i < nr; i++) > > > So if we fail at the first irq, then the loop will do nothing. Failing on > the

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch
On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote: > What we want is s/i + 1/i/ > > That's correct because x86_vector_free_irqs() does: > >for (i = 0; i < nr; i++) > > > So if we fail at the first irq, then the loop will do nothing. Failing on > the

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Thomas Gleixner wrote: > On Tue, 16 Jan 2018, Keith Busch wrote: > > > This is all way over my head, but the part that obviously shows > > something's gone wrong: > > > > kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: > > bit=56 cpu=0 online=1

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Thomas Gleixner wrote: > On Tue, 16 Jan 2018, Keith Busch wrote: > > > This is all way over my head, but the part that obviously shows > > something's gone wrong: > > > > kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: > > bit=56 cpu=0 online=1

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Keith Busch wrote: > This is all way over my head, but the part that obviously shows > something's gone wrong: > > kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: > bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 >

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Keith Busch wrote: > This is all way over my head, but the part that obviously shows > something's gone wrong: > > kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: > bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 >

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-15 Thread Keith Busch
This is all way over my head, but the part that obviously shows something's gone wrong: kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 global_avl=22084, global_rsvd=157, total_alloc=570

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-15 Thread Keith Busch
This is all way over my head, but the part that obviously shows something's gone wrong: kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 global_avl=22084, global_rsvd=157, total_alloc=570

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-15 Thread Thomas Gleixner
On Sun, 14 Jan 2018, Keith Busch wrote: > I hoped to have a better report before the weekend, but I've run out of > time and without my machine till next week, so sending what I have and > praying someone more in the know will have a better clue. > > I've a few NVMe drives and occasionally the

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-15 Thread Thomas Gleixner
On Sun, 14 Jan 2018, Keith Busch wrote: > I hoped to have a better report before the weekend, but I've run out of > time and without my machine till next week, so sending what I have and > praying someone more in the know will have a better clue. > > I've a few NVMe drives and occasionally the