Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large configurations with Intel ixgbe running FCOE
- Original Message - > From: "Hannes Reinecke"> To: emi...@redhat.com, "vasu dev" , "robert w love" > > Cc: "Laurence Oberman" , "Linux SCSI Mailinglist" > , > fcoe-devel@open-fcoe.org, "Curtis Taylor (c...@us.ibm.com)" > , "Bud Brown" > Sent: Wednesday, October 12, 2016 11:46:16 AM > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by > fc_queuecommand on NUMA or large > configurations with Intel ixgbe running FCOE > > On 10/12/2016 05:26 PM, Ewan D. Milne wrote: > > On Tue, 2016-10-11 at 10:51 -0400, Ewan D. Milne wrote: > >> On Sat, 2016-10-08 at 19:35 +0200, Hannes Reinecke wrote: > >>> You might actually be hitting a limitation in the exchange manager code. > >>> The libfc exchange manager tries to be really clever and will assign a > >>> per-cpu exchange manager (probably to increase locality). However, we > >>> only have a limited number of exchanges, so on large systems we might > >>> actually run into a exchange starvation problem, where we have in theory > >>> enough free exchanges, but none for the submitting cpu. > >>> > >>> (Personally, the exchange manager code is in urgent need of reworking. > >>> It should be replaced by the sbitmap code from Omar). > >>> > >>> Do check how many free exchanges are actually present for the stalling > >>> CPU; it might be that you run into a starvation issue. > >> > >> We are still looking into this but one thing that looks bad is that > >> the exchange manager code rounds up the number of CPUs to the next > >> power of 2 before dividing up the exchange id space (and uses the lsbs > >> of the xid to extract the CPU when looking up an xid). We have a machine > >> with 288 CPUs, this code is just begging for a rewrite as it looks to > >> be wasting most of the limited xid space on ixgbe FCoE. > >> > >> Looks like we get 512 offloaded xids on this adapter and 4096-512 > >> non-offloaded xids. This would give 1 + 7 xids per CPU. However, I'm > >> not sure that even 4096 / 288 = 14 would be enough to prevent stalling. > >> > >> And, of course, potentially most of the CPUs aren't submitting I/O, so > >> the whole idea of per-CPU xid space is questionable. > >> > > > > fc_exch_alloc() used to try all the available exchange managers in the > > list for an available exchange id, but this was changed in 2010 so that > > if the first matched exchange manager couldn't allocate one, it fails > > and we end up returning host busy. This was due to commit: > > > > commit 3e22760d4db6fd89e0be46c3d132390a251da9c6 > > Author: Vasu Dev > > Date: Fri Mar 12 16:08:39 2010 -0800 > > > > [SCSI] libfc: use offload EM instance again instead jumping to next EM > > > > Since use of offloads is more efficient than switching > > to non-offload EM. However kept logic same to call em_match > > if it is provided in the list of EMs. > > > > Converted fc_exch_alloc to inline being now tiny a function > > and already not an exported libfc API any more. > > > > Signed-off-by: Vasu Dev > > Signed-off-by: Robert Love > > Signed-off-by: James Bottomley > > > > --- > > > > Setting the ddp_min module parameter to fcoe to 128MB prevents the ->match > > function from permitting the use of the offload exchange manager for the > > frame, > > and we no longer see the problem with host busy status, since it uses the > > larger non-offloaded pool. > > > Yes, this is also the impression I got from reading the spec. > The offload pool is mainly designed for large read or write commands, so > using it for _every_ frame is probably not a good idea. > And limiting it by the size of the transfers solves the problem quite > nicely, as a large size typically is only used by read and writes. > So please send a patch to revert that. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > h...@suse.de+49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > I will revert the commit and test it here in the lab, and then submit the revert patch. Ewan can review. Thanks Laurence ___ fcoe-devel mailing list fcoe-devel@open-fcoe.org http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel
Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large configurations with Intel ixgbe running FCOE
On 10/12/2016 05:26 PM, Ewan D. Milne wrote: On Tue, 2016-10-11 at 10:51 -0400, Ewan D. Milne wrote: On Sat, 2016-10-08 at 19:35 +0200, Hannes Reinecke wrote: You might actually be hitting a limitation in the exchange manager code. The libfc exchange manager tries to be really clever and will assign a per-cpu exchange manager (probably to increase locality). However, we only have a limited number of exchanges, so on large systems we might actually run into a exchange starvation problem, where we have in theory enough free exchanges, but none for the submitting cpu. (Personally, the exchange manager code is in urgent need of reworking. It should be replaced by the sbitmap code from Omar). Do check how many free exchanges are actually present for the stalling CPU; it might be that you run into a starvation issue. We are still looking into this but one thing that looks bad is that the exchange manager code rounds up the number of CPUs to the next power of 2 before dividing up the exchange id space (and uses the lsbs of the xid to extract the CPU when looking up an xid). We have a machine with 288 CPUs, this code is just begging for a rewrite as it looks to be wasting most of the limited xid space on ixgbe FCoE. Looks like we get 512 offloaded xids on this adapter and 4096-512 non-offloaded xids. This would give 1 + 7 xids per CPU. However, I'm not sure that even 4096 / 288 = 14 would be enough to prevent stalling. And, of course, potentially most of the CPUs aren't submitting I/O, so the whole idea of per-CPU xid space is questionable. fc_exch_alloc() used to try all the available exchange managers in the list for an available exchange id, but this was changed in 2010 so that if the first matched exchange manager couldn't allocate one, it fails and we end up returning host busy. This was due to commit: commit 3e22760d4db6fd89e0be46c3d132390a251da9c6 Author: Vasu DevDate: Fri Mar 12 16:08:39 2010 -0800 [SCSI] libfc: use offload EM instance again instead jumping to next EM Since use of offloads is more efficient than switching to non-offload EM. However kept logic same to call em_match if it is provided in the list of EMs. Converted fc_exch_alloc to inline being now tiny a function and already not an exported libfc API any more. Signed-off-by: Vasu Dev Signed-off-by: Robert Love Signed-off-by: James Bottomley --- Setting the ddp_min module parameter to fcoe to 128MB prevents the ->match function from permitting the use of the offload exchange manager for the frame, and we no longer see the problem with host busy status, since it uses the larger non-offloaded pool. Yes, this is also the impression I got from reading the spec. The offload pool is mainly designed for large read or write commands, so using it for _every_ frame is probably not a good idea. And limiting it by the size of the transfers solves the problem quite nicely, as a large size typically is only used by read and writes. So please send a patch to revert that. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ fcoe-devel mailing list fcoe-devel@open-fcoe.org http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel
Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large configurations with Intel ixgbe running FCOE
- Original Message - > From: "Laurence Oberman"> To: "Hannes Reinecke" > Cc: "Linux SCSI Mailinglist" , > fcoe-devel@open-fcoe.org, "Curtis Taylor (c...@us.ibm.com)" > , "Bud Brown" > Sent: Sunday, October 9, 2016 11:52:44 AM > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by > fc_queuecommand on NUMA or large > configurations with Intel ixgbe running FCOE > > > > - Original Message - > > From: "Laurence Oberman" > > To: "Hannes Reinecke" > > Cc: "Linux SCSI Mailinglist" , > > fcoe-devel@open-fcoe.org, "Curtis Taylor (c...@us.ibm.com)" > > , "Bud Brown" > > Sent: Saturday, October 8, 2016 3:44:16 PM > > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by > > fc_queuecommand on NUMA or large > > configurations with Intel ixgbe running FCOE > > > > > > > > - Original Message - > > > From: "Laurence Oberman" > > > To: "Hannes Reinecke" > > > Cc: "Linux SCSI Mailinglist" , > > > fcoe-devel@open-fcoe.org, "Curtis Taylor (c...@us.ibm.com)" > > > , "Bud Brown" > > > Sent: Saturday, October 8, 2016 1:53:01 PM > > > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by > > > fc_queuecommand on NUMA or large > > > configurations with Intel ixgbe running FCOE > > > > > > > > > > > > - Original Message - > > > > From: "Hannes Reinecke" > > > > To: "Laurence Oberman" , "Linux SCSI Mailinglist" > > > > , > > > > fcoe-devel@open-fcoe.org > > > > Cc: "Curtis Taylor (c...@us.ibm.com)" , "Bud Brown" > > > > > > > > Sent: Saturday, October 8, 2016 1:35:19 PM > > > > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by > > > > fc_queuecommand on NUMA or large > > > > configurations with Intel ixgbe running FCOE > > > > > > > > On 10/08/2016 02:57 PM, Laurence Oberman wrote: > > > > > Hello > > > > > > > > > > This has been a tough problem to chase down but was finally > > > > > reproduced. > > > > > This issue is apparent on RHEL kernels and upstream so justified > > > > > reporting > > > > > here. > > > > > > > > > > Its out there and some may not be aware its even happening other than > > > > > very > > > > > slow > > > > > performance using ixgbe and software FCOE on large configurations. > > > > > > > > > > Upstream Kernel used for reproducing is 4.8.0 > > > > > > > > > > I/O performance was noted to be very impacted on a large NUMA test > > > > > system > > > > > (64 CPUS 4 NUMA nodes) running the software fcoe stack with Intel > > > > > ixgbe > > > > > interfaces. > > > > > After capturing blktraces we saw for every I/O there was at least one > > > > > blk_requeue_request and sometimes hundreds or more. > > > > > This resulted in IOPS rates being marginal at best with queuing and > > > > > high > > > > > wait times. > > > > > After narrowing this down with systemtap and trace-cmd we added > > > > > further > > > > > debug and it was apparent this was dues to SCSI_MLQUEUE_HOST_BUSY > > > > > being > > > > > returned. > > > > > So I/O passes but very slowly as it constantly having to be requeued. > > > > > > > > > > The identical configuration in our lab with a single NUMA node and 4 > > > > > CPUS > > > > > does not see this issue at all. > > > > > The same large system that reproduces this was booted with numa=off > > > > > and > > > > > still sees the issue. > > > > > > > > > Have you tested with my FCoE fixes? > > > > I've done quite some fixes for libfc/fcoe, and it would be nice to see > > > > how the patches behave with this setup. > > > > > > > > > The flow is as follows: > > > > > > > > > > From with fc_queuecommand > > > > > fc_fcp_pkt_send() calls fc_fcp_cmd_send() calls > > > > > tt.exch_seq_send() which calls fc_exch_seq_send > > > > > > > > > > this fails and returns NULL in fc_exch_alloc() as the list traveral > > > > > never > > > > > creates a match. > > > > > > > > > > static struct fc_seq *fc_exch_seq_send(struct fc_lport *lport, > > > > > struct fc_frame *fp, > > > > > void (*resp)(struct fc_seq *, > > > > > struct fc_frame *fp, > > > > > void *arg), > > > > > void (*destructor)(struct fc_seq > > > > > *, > > > > > void *), > > > > > void *arg, u32 timer_msec) > > > > > { > > > > > struct fc_exch *ep; > > > > > struct fc_seq *sp = NULL; > > > >
Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large configurations with Intel ixgbe running FCOE
On 10/08/2016 02:57 PM, Laurence Oberman wrote: Hello This has been a tough problem to chase down but was finally reproduced. This issue is apparent on RHEL kernels and upstream so justified reporting here. Its out there and some may not be aware its even happening other than very slow > performance using ixgbe and software FCOE on large configurations. Upstream Kernel used for reproducing is 4.8.0 I/O performance was noted to be very impacted on a large NUMA test system (64 CPUS 4 NUMA nodes) running the software fcoe stack with Intel ixgbe interfaces. After capturing blktraces we saw for every I/O there was at least one blk_requeue_request and sometimes hundreds or more. This resulted in IOPS rates being marginal at best with queuing and high wait times. After narrowing this down with systemtap and trace-cmd we added further debug and it was apparent this was dues to SCSI_MLQUEUE_HOST_BUSY being returned. So I/O passes but very slowly as it constantly having to be requeued. The identical configuration in our lab with a single NUMA node and 4 CPUS does not see this issue at all. The same large system that reproduces this was booted with numa=off and still sees the issue. Have you tested with my FCoE fixes? I've done quite some fixes for libfc/fcoe, and it would be nice to see how the patches behave with this setup. The flow is as follows: From with fc_queuecommand fc_fcp_pkt_send() calls fc_fcp_cmd_send() calls tt.exch_seq_send() which calls fc_exch_seq_send this fails and returns NULL in fc_exch_alloc() as the list traveral never creates a match. static struct fc_seq *fc_exch_seq_send(struct fc_lport *lport, struct fc_frame *fp, void (*resp)(struct fc_seq *, struct fc_frame *fp, void *arg), void (*destructor)(struct fc_seq *, void *), void *arg, u32 timer_msec) { struct fc_exch *ep; struct fc_seq *sp = NULL; struct fc_frame_header *fh; struct fc_fcp_pkt *fsp = NULL; int rc = 1; ep = fc_exch_alloc(lport, fp); * Called Here and fails if (!ep) { fc_frame_free(fp); printk("RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep = %p\n",ep); return NULL; } .. .. ] fc_exch_alloc() - Allocate an exchange from an EM on a * /** * local port's list of EMs. * @lport: The local port that will own the exchange * @fp:The FC frame that the exchange will be for * * This function walks the list of exchange manager(EM) * anchors to select an EM for a new exchange allocation. The * EM is selected when a NULL match function pointer is encountered * or when a call to a match function returns true. */ static inline struct fc_exch *fc_exch_alloc(struct fc_lport *lport, struct fc_frame *fp) { struct fc_exch_mgr_anchor *ema; list_for_each_entry(ema, >ema_list, ema_list) if (!ema->match || ema->match(fp)) return fc_exch_em_alloc(lport, ema->mp); return NULL; * Never matches so returns NULL } RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep = (null) RHDEBUG: rc -1 with !seq = (null) after calling tt.exch_seq_send within fc_fcp_cmd_send RHDEBUG: rc non zero in :unlock within fc_fcp_cmd_send = -1 RHDEBUG: In fc_fcp_pkt_send, we returned from rc = lport->tt.fcp_cmd_send with rc = -1 RHDEBUG: We hit SCSI_MLQUEUE_HOST_BUSY in fc_queuecommand with rval in fc_fcp_pkt_send=-1 I am trying to get my head around why a large multi-node system sees this issue even with NUMA disabled. Has anybody seen this or is aware of this with configurations (using fc_queuecommand) I am continuing to add debug to narrow this down. You might actually be hitting a limitation in the exchange manager code. The libfc exchange manager tries to be really clever and will assign a per-cpu exchange manager (probably to increase locality). However, we only have a limited number of exchanges, so on large systems we might actually run into a exchange starvation problem, where we have in theory enough free exchanges, but none for the submitting cpu. (Personally, the exchange manager code is in urgent need of reworking. It should be replaced by the sbitmap code from Omar). Do check how many free exchanges are actually present for the stalling CPU; it might be that you run into a starvation issue. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: