Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-02-02 Thread Ming Lei
Hi Kashyap,

On Fri, Feb 02, 2018 at 05:08:12PM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Friday, February 2, 2018 3:44 PM
> > To: Kashyap Desai
> > Cc: linux-scsi@vger.kernel.org; Peter Rivera
> > Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load
> balancing of
> > reply queue
> >
> > Hi Kashyap,
> >
> > On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> > > Hi All -
> > >
> > > We have seen cpu lock up issue from fields if system has greater (more
> > > than 96) logical cpu count.
> > > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> > >
> > > This may be a generic issue (if PCI device support  completion on
> > > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > > h/w just to simplify the problem and possible changes to handle such
> > > issues. IT HBA
> > > (mpt3sas) supports multiple reply queues in completion path. Driver
> > > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > > queue, Logical CPUs)". If submitter is not interrupted via completion
> > > on same CPU, there is a loop in the IO path. This behavior can cause
> > > hard/soft CPU lockups, IO timeout, system sluggish etc.
> >
> > As I mentioned in another thread, this issue may be solved by SCSI_MQ
> via
> > mapping reply queue into hctx of blk_mq, together with
> > QUEUE_FLAG_SAME_FORCE, especially you have set 'smp_affinity_enable' as
> > 1 at default already, then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can
> do IRQ
> > vectors spread on CPUs perfectly for you.
> >
> > But the following Hannes's patch is required for the conversion.
> >
> > https://marc.info/?l=linux-block=149130770004507=2
> >
> 
> Hi Ming -
> 
> I gone through thread discussing "support host-wide tagset". Below Link
> has latest reply on that thread.
> https://marc.info/?l=linux-block=149132580511346=2
> 
> I think, there is a confusion over mpt3sas and megaraid_sas h/w behavior.
> Broadcom/LSI HBA and MR h/w has only one h/w queue for submission but
> there are multiple reply queue.

That shouldn't be a problem, you still can submit to same hw queue in
all submission paths(all hw queues) just like the current implementation.

> Even though I include Hannes' patch for host-side tagset, problem
> described in this RFC will not be resolved.  In fact, tagset can also
> provide same results if completion queue is less than online CPU. Don't
> you think ? OR I am missing anything ?

If reply queue is less than online CPU, more than one CPU may be mapped
to some(or all) of hw queue, but the completion is only handled on one of
the mapped CPUs, and can be done on the request's submission CPU via the
queue flag of QUEUE_FLAG_SAME_FORCE, please see __blk_mq_complete_request().
Or do you have other requirement except for completing request on its
submission CPU?

> 
> We don't have problem in submission path.  Current problem is MSI-x to
> more than one  CPU can cause I/O loop. This is visible, if we have higher
> number of online CPUs.

Yeah, I know, as I mentioned above, your requirement of completing
request on its submission CPU can be met with current SCSI_MQ without
much difficulty.

Thanks,
Ming


RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-02-02 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Friday, February 2, 2018 3:44 PM
> To: Kashyap Desai
> Cc: linux-scsi@vger.kernel.org; Peter Rivera
> Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load
balancing of
> reply queue
>
> Hi Kashyap,
>
> On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> > Hi All -
> >
> > We have seen cpu lock up issue from fields if system has greater (more
> > than 96) logical cpu count.
> > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> >
> > This may be a generic issue (if PCI device support  completion on
> > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > h/w just to simplify the problem and possible changes to handle such
> > issues. IT HBA
> > (mpt3sas) supports multiple reply queues in completion path. Driver
> > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > queue, Logical CPUs)". If submitter is not interrupted via completion
> > on same CPU, there is a loop in the IO path. This behavior can cause
> > hard/soft CPU lockups, IO timeout, system sluggish etc.
>
> As I mentioned in another thread, this issue may be solved by SCSI_MQ
via
> mapping reply queue into hctx of blk_mq, together with
> QUEUE_FLAG_SAME_FORCE, especially you have set 'smp_affinity_enable' as
> 1 at default already, then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can
do IRQ
> vectors spread on CPUs perfectly for you.
>
> But the following Hannes's patch is required for the conversion.
>
>   https://marc.info/?l=linux-block=149130770004507=2
>

Hi Ming -

I gone through thread discussing "support host-wide tagset". Below Link
has latest reply on that thread.
https://marc.info/?l=linux-block=149132580511346=2

I think, there is a confusion over mpt3sas and megaraid_sas h/w behavior.
Broadcom/LSI HBA and MR h/w has only one h/w queue for submission but
there are multiple reply queue.
Even though I include Hannes' patch for host-side tagset, problem
described in this RFC will not be resolved.  In fact, tagset can also
provide same results if completion queue is less than online CPU. Don't
you think ? OR I am missing anything ?

We don't have problem in submission path.  Current problem is MSI-x to
more than one  CPU can cause I/O loop. This is visible, if we have higher
number of online CPUs.

> >
> > Example - one CPU (e.g. CPU A) is busy submitting the IOs and another
> > CPU (e.g. CPU B) is busy with processing the corresponding IO's reply
> > descriptors from reply descriptor queue upon receiving the interrupts
> > from HBA. If the CPU A is continuously pumping the IOs then always CPU
> > B (which is executing the ISR) will see the valid reply descriptors in
> > the reply descriptor queue and it will be continuously processing
> > those reply descriptor in a loop without quitting the ISR handler.
> > Mpt3sas driver will exit ISR handler if it finds unused reply
> > descriptor in the reply descriptor queue. Since CPU A will be
> > continuously sending the IOs, CPU B may always see a valid reply
> > descriptor (posted by HBA Firmware after processing the IO) in the
> > reply descriptor queue. In worst case, driver will not quit from this
> > loop in the ISR handler. Eventually, CPU lockup will be detected by
> watchdog.
> >
> > Above mentioned behavior is not common if "rq_affinity" set to 2 or
> > affinity_hint is honored by irqbalance as "exact".
> > If rq_affinity is set to 2, submitter will be always interrupted via
> > completion on same CPU.
> > If irqbalance is using "exact" policy, interrupt will be delivered to
> > submitter CPU.
>
> Now you have used pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) to get msix
> vector number, the irq affinity can't be changed by userspace any more.
>
> >
> > Problem statement -
> > If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio
> > is not 1:1, we still have  exposure of issue explained above and for
> > that we don't have any solution.
> >
> > Exposure of soft/hard lockup if CPU count is more than MSI-x supported
> > by device.
> >
> > If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if
> > CPU counts to MSI-x vector count ratio is something like X:1, where X
> > > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to
> > avoid CPU hard/soft lockups. There won't be any one to one mapping
> > between CPU to MSI-x vector instead one MSI-x interrupt (or reply
> > descriptor queu

Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-02-02 Thread Ming Lei
Hi Kashyap,

On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> Hi All -
> 
> We have seen cpu lock up issue from fields if system has greater (more
> than 96) logical cpu count.
> SAS3.0 controller (Invader series) supports at max 96 msix vector and
> SAS3.5 product (Ventura) supports at max 128 msix vectors.
> 
> This may be a generic issue (if PCI device support  completion on multiple
> reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
> simplify the problem and possible changes to handle such issues. IT HBA
> (mpt3sas) supports multiple reply queues in completion path. Driver
> creates MSI-x vectors for controller as "min of ( FW supported Reply
> queue, Logical CPUs)". If submitter is not interrupted via completion on
> same CPU, there is a loop in the IO path. This behavior can cause
> hard/soft CPU lockups, IO timeout, system sluggish etc.

As I mentioned in another thread, this issue may be solved by SCSI_MQ
via mapping reply queue into hctx of blk_mq, together with 
QUEUE_FLAG_SAME_FORCE,
especially you have set 'smp_affinity_enable' as 1 at default already,
then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can do IRQ vectors spread on
CPUs perfectly for you.

But the following Hannes's patch is required for the conversion.

https://marc.info/?l=linux-block=149130770004507=2

> 
> Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
> (e.g. CPU B) is busy with processing the corresponding IO's reply
> descriptors from reply descriptor queue upon receiving the interrupts from
> HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
> is executing the ISR) will see the valid reply descriptors in the reply
> descriptor queue and it will be continuously processing those reply
> descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
> will exit ISR handler if it finds unused reply descriptor in the reply
> descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
> may always see a valid reply descriptor (posted by HBA Firmware after
> processing the IO) in the reply descriptor queue. In worst case, driver
> will not quit from this loop in the ISR handler. Eventually, CPU lockup
> will be detected by watchdog.
> 
> Above mentioned behavior is not common if "rq_affinity" set to 2 or
> affinity_hint is honored by irqbalance as "exact".
> If rq_affinity is set to 2, submitter will be always interrupted via
> completion on same CPU.
> If irqbalance is using "exact" policy, interrupt will be delivered to
> submitter CPU.

Now you have used pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) to get msix
vector number, the irq affinity can't be changed by userspace any more.

> 
> Problem statement -
> If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
> not 1:1, we still have  exposure of issue explained above and for that we
> don't have any solution.
> 
> Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
> device.
> 
> If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
> counts to MSI-x vector count ratio is something like X:1, where X > 1)
> then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
> hard/soft lockups. There won't be any one to one mapping between CPU to
> MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
> shared with group/set of CPUs and there is a possibility of having a loop
> in the IO path within that CPU group and may observe lockups.
> 
> For example: Consider a system having two NUMA nodes and each node having
> four logical CPUs and also consider that number of MSI-x vectors enabled
> on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
> e.g.
> MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
> and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
> 1.
> 
> numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3-->
> MSI-x 0
> node 0 size: 65536 MB
> node 0 free: 63176 MB
> node 1 cpus: 4 5 6 7
> -->MSI-x 1
> node 1 size: 65536 MB
> node 1 free: 63176 MB
> 
> Assume that user started an application which uses all the CPUs of NUMA
> node 0 for issuing the IOs.
> Only one CPU from affinity list (it can be any cpu since this behavior
> depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
> 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> decreasing and ISR processing percentage will be increasing as it is more
> busy with processing the interrupts. Gradually IO submission percentage on
> CPU 0 will be zero and it's ISR processing percentage will be 100
> percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
> 1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
> and only CPU 0 is busy in the ISR path as it always find the valid reply
> descriptor in the reply descriptor 

RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-01-29 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Monday, January 29, 2018 2:29 PM
> To: Kashyap Desai; linux-scsi@vger.kernel.org; Peter Rivera
> Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing
> of
> reply queue
>
> On 01/15/2018 01:12 PM, Kashyap Desai wrote:
> > Hi All -
> >
> > We have seen cpu lock up issue from fields if system has greater (more
> > than 96) logical cpu count.
> > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> >
> > This may be a generic issue (if PCI device support  completion on
> > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > h/w just to simplify the problem and possible changes to handle such
> > issues. IT HBA
> > (mpt3sas) supports multiple reply queues in completion path. Driver
> > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > queue, Logical CPUs)". If submitter is not interrupted via completion
> > on same CPU, there is a loop in the IO path. This behavior can cause
> > hard/soft CPU lockups, IO timeout, system sluggish etc.
> >
> > Example - one CPU (e.g. CPU A) is busy submitting the IOs and another
> > CPU (e.g. CPU B) is busy with processing the corresponding IO's reply
> > descriptors from reply descriptor queue upon receiving the interrupts
> > from HBA. If the CPU A is continuously pumping the IOs then always CPU
> > B (which is executing the ISR) will see the valid reply descriptors in
> > the reply descriptor queue and it will be continuously processing
> > those reply descriptor in a loop without quitting the ISR handler.
> > Mpt3sas driver will exit ISR handler if it finds unused reply
> > descriptor in the reply descriptor queue. Since CPU A will be
> > continuously sending the IOs, CPU B may always see a valid reply
> > descriptor (posted by HBA Firmware after processing the IO) in the
> > reply descriptor queue. In worst case, driver will not quit from this
> > loop in the ISR handler. Eventually, CPU lockup will be detected by
> watchdog.
> >
> > Above mentioned behavior is not common if "rq_affinity" set to 2 or
> > affinity_hint is honored by irqbalance as "exact".
> > If rq_affinity is set to 2, submitter will be always interrupted via
> > completion on same CPU.
> > If irqbalance is using "exact" policy, interrupt will be delivered to
> > submitter CPU.
> >
> > Problem statement -
> > If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio
> > is not 1:1, we still have  exposure of issue explained above and for
> > that we don't have any solution.
> >
> > Exposure of soft/hard lockup if CPU count is more than MSI-x supported
> > by device.
> >
> > If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if
> > CPU counts to MSI-x vector count ratio is something like X:1, where X
> > > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to
> > avoid CPU hard/soft lockups. There won't be any one to one mapping
> > between CPU to MSI-x vector instead one MSI-x interrupt (or reply
> > descriptor queue) is shared with group/set of CPUs and there is a
> > possibility of having a loop in the IO path within that CPU group and
> > may
> observe lockups.
> >
> > For example: Consider a system having two NUMA nodes and each node
> > having four logical CPUs and also consider that number of MSI-x
> > vectors enabled on the HBA is two, then CPUs count to MSI-x vector count
> ratio as 4:1.
> > e.g.
> > MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node
> > 0 and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of
> > NUMA node 1.
> >
> > numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3-->
> > MSI-x 0
> > node 0 size: 65536 MB
> > node 0 free: 63176 MB
> > node 1 cpus: 4 5 6 7
> > -->MSI-x 1
> > node 1 size: 65536 MB
> > node 1 free: 63176 MB
> >
> > Assume that user started an application which uses all the CPUs of
> > NUMA node 0 for issuing the IOs.
> > Only one CPU from affinity list (it can be any cpu since this behavior
> > depends upon irqbalance) CPU0 will receive the interrupts from MSIx
> > vector
> > 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> > decreasing and ISR processing percentage will be increasing as it is
> > more 

Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-01-29 Thread Hannes Reinecke
On 01/15/2018 01:12 PM, Kashyap Desai wrote:
> Hi All -
> 
> We have seen cpu lock up issue from fields if system has greater (more
> than 96) logical cpu count.
> SAS3.0 controller (Invader series) supports at max 96 msix vector and
> SAS3.5 product (Ventura) supports at max 128 msix vectors.
> 
> This may be a generic issue (if PCI device support  completion on multiple
> reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
> simplify the problem and possible changes to handle such issues. IT HBA
> (mpt3sas) supports multiple reply queues in completion path. Driver
> creates MSI-x vectors for controller as "min of ( FW supported Reply
> queue, Logical CPUs)". If submitter is not interrupted via completion on
> same CPU, there is a loop in the IO path. This behavior can cause
> hard/soft CPU lockups, IO timeout, system sluggish etc.
> 
> Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
> (e.g. CPU B) is busy with processing the corresponding IO's reply
> descriptors from reply descriptor queue upon receiving the interrupts from
> HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
> is executing the ISR) will see the valid reply descriptors in the reply
> descriptor queue and it will be continuously processing those reply
> descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
> will exit ISR handler if it finds unused reply descriptor in the reply
> descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
> may always see a valid reply descriptor (posted by HBA Firmware after
> processing the IO) in the reply descriptor queue. In worst case, driver
> will not quit from this loop in the ISR handler. Eventually, CPU lockup
> will be detected by watchdog.
> 
> Above mentioned behavior is not common if "rq_affinity" set to 2 or
> affinity_hint is honored by irqbalance as "exact".
> If rq_affinity is set to 2, submitter will be always interrupted via
> completion on same CPU.
> If irqbalance is using "exact" policy, interrupt will be delivered to
> submitter CPU.
> 
> Problem statement -
> If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
> not 1:1, we still have  exposure of issue explained above and for that we
> don't have any solution.
> 
> Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
> device.
> 
> If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
> counts to MSI-x vector count ratio is something like X:1, where X > 1)
> then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
> hard/soft lockups. There won't be any one to one mapping between CPU to
> MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
> shared with group/set of CPUs and there is a possibility of having a loop
> in the IO path within that CPU group and may observe lockups.
> 
> For example: Consider a system having two NUMA nodes and each node having
> four logical CPUs and also consider that number of MSI-x vectors enabled
> on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
> e.g.
> MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
> and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
> 1.
> 
> numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3-->
> MSI-x 0
> node 0 size: 65536 MB
> node 0 free: 63176 MB
> node 1 cpus: 4 5 6 7
> -->MSI-x 1
> node 1 size: 65536 MB
> node 1 free: 63176 MB
> 
> Assume that user started an application which uses all the CPUs of NUMA
> node 0 for issuing the IOs.
> Only one CPU from affinity list (it can be any cpu since this behavior
> depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
> 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> decreasing and ISR processing percentage will be increasing as it is more
> busy with processing the interrupts. Gradually IO submission percentage on
> CPU 0 will be zero and it's ISR processing percentage will be 100
> percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
> 1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
> and only CPU 0 is busy in the ISR path as it always find the valid reply
> descriptor in the reply descriptor queue. Eventually, we will observe the
> hard lockup here.
> 
> Chances of occurring of hard/soft lockups are directly proportional to
> value of X. If value of X is high, then chances of observing CPU lockups
> is high.
> 
> Solution -
> Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
> will execute ISR routine in Softirq context and it will always quit the
> loop based on budget provided in IRQ poll interface.
> 
> In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio is
> X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups due
> to voluntary exit from the 

RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-01-22 Thread Kashyap Desai
>
> In Summary,
> CPU completing IO which is not contributing to IO submission, may cause
cpu
> lockup.
> If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then
using irq poll
> interface, we can avoid the CPU lockups and by equally distributing the
> interrupts among the enabled MSI-x interrupts we can avoid performance
> issues.
>
> We are planning to use both the fixes only if cpu count is more than FW
> supported MSI-x vector.
> Please review and provide your feedback. I have appended both the
patches.

Hi -
Assuming method explained here is in-line with Linux SCSI subsystem and
there is no better method to fix such issue.
I am planning to provide the same solution for internal testing and
maintainers of respective driver (mpt3sas and megaraid_sas) will post
final patch to the upstream based on results.
As of now PoC results looks promising with the above mentioned solution
and no cpu lock was discovered.

>
> Thanks, Kashyap
>
>