Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter
Bill, > When doing a surprise removal of an adapter, some in flight I/Os can get > stuck and take a while to complete (they actually timeout and are > retried). We are not handling an early error exit from > qla2xxx_eh_abort properly. Applied to 4.20/scsi-fixes, thank you. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter
Hi Martin/Bill, Thanks for the patience with me. > On Nov 6, 2018, at 6:09 PM, Martin K. Petersen > wrote: > > External Email > > Himanshu, > > Please review Bill's rebased patch. Thanks! > >> When doing a surprise removal of an adapter, some in flight I/Os can get >> stuck and take a while to complete (they actually timeout and are >> retried). We are not handling an early error exit from >> qla2xxx_eh_abort properly. > > -- > Martin K. Petersen Oracle Linux Engineering This rebased patch looks good. Acked-by: Himanshu Madhani Thanks, - Himanshu
Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter
Himanshu, Please review Bill's rebased patch. Thanks! > When doing a surprise removal of an adapter, some in flight I/Os can get > stuck and take a while to complete (they actually timeout and are > retried). We are not handling an early error exit from > qla2xxx_eh_abort properly. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter
On Mon, 2018-11-05 at 11:32 -0500, Laurence Oberman wrote: > On Mon, 2018-11-05 at 11:23 -0500, Bill Kuzeja wrote: > > When doing a surprise removal of an adapter, some in flight I/Os > > can > > get > > stuck and take a while to complete (they actually timeout and are > > retried). We are not handling an early error exit from > > qla2xxx_eh_abort properly. > > > > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting > > down chip") > > Signed-off-by: Bill Kuzeja > > > > --- > > Resending...can somebody review this (pretty please)?? > > > > After a hot remove of a Qlogic adapter, the driver's remove > > function > > gets > > called and we end up aborting all in progress I/Os. Here is the > > code > > flow: > > > > qla2x00_remove_one > > qla2x00_abort_isp_cleanup > > qla2x00_abort_all_cmds > > __qla2x00_abort_all_cmds > > qla2xxx_eh_abort > > > > At the start of qla2xxx_eh_abort, some sanity checks are done > > before > > actually sending the abort. One of these checks is a call to > > fc_block_scsi_eh. In the case of a hot remove, it turns out that > > this > > routine can exit with FAST_IO_FAIL. > > > > When this occurs, we return back to __qla2x00_abort_all_cmds with > > an > > extra reference on sp (because the abort never gets sent). > > Originally, > > this was addressed with another fix: > > > > commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after > > adapter break > > > > But this later this added change complicated matters: > > > > commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting > > down > > chip > > > > Because the abort is now being done earlier in the teardown > > (through > > qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past > > the first check because qla2x00_isp_reg_stat(ha) returns zero. When > > we > > fail a few lines later in fc_block_scsi_eh, this error is not > > handled > > properly in __qla2x00_abort_all_cmds and the I/O ends up hanging > > and > > timing out because of the extra reference. > > > > For this fix, a check for FAST_IO_FAIL is added to > > __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort > > succeeded or not. > > > > This removes the extra reference in this additional early exit > > case. > > In > > my testing (hw surprise removals and also adapter remove via > > sysfs), > > this > > eliminates the timeouts and delays and the remove proceeds > > smoothly. > > > > --- > > drivers/scsi/qla2xxx/qla_os.c | 10 -- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/scsi/qla2xxx/qla_os.c > > b/drivers/scsi/qla2xxx/qla_os.c > > index 518f151..5261adf 100644 > > --- a/drivers/scsi/qla2xxx/qla_os.c > > +++ b/drivers/scsi/qla2xxx/qla_os.c > > @@ -1749,7 +1749,7 @@ uint32_t qla2x00_isp_reg_stat(struct > > qla_hw_data *ha) > > static void > > __qla2x00_abort_all_cmds(struct qla_qpair *qp, int res) > > { > > - int cnt; > > + int cnt, status; > > unsigned long flags; > > srb_t *sp; > > scsi_qla_host_t *vha = qp->vha; > > @@ -1799,10 +1799,16 @@ uint32_t qla2x00_isp_reg_stat(struct > > qla_hw_data *ha) > > if (!sp_get(sp)) { > > spin_unlock_irqres > > to > > re > > (qp- > > > qp_lock_ptr, flags); > > > > - qla2xxx_eh_abort( > > + status = > > qla2xxx_eh_abort( > > GET_CMD_SP > > (s > > p)); > > spin_lock_irqsave > > (qp- > > > qp_lock_ptr, flags); > > > > + /* > > + * Get rid of > > extra > > reference caused > > + * by early exit > > from qla2xxx_eh_abort > > + */ > > + if (status == > > FAST_IO_FAIL) > > + atomic_dec > > (& > > sp->ref_count); > > } > > } > > sp->done(sp, res); > > Bill > The patch looks fine to me but I still cant apply it > I tried against linux-next which is 4.20 > What ree are you applying it against now > > Regards > Laurence > I has a note from Bill saying it applies cleanly now to stable. In that case Reviewed-by: Laurence Oberman
Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter
On Mon, 2018-11-05 at 11:23 -0500, Bill Kuzeja wrote: > When doing a surprise removal of an adapter, some in flight I/Os can > get > stuck and take a while to complete (they actually timeout and are > retried). We are not handling an early error exit from > qla2xxx_eh_abort properly. > > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting > down chip") > Signed-off-by: Bill Kuzeja > > --- > Resending...can somebody review this (pretty please)?? > > After a hot remove of a Qlogic adapter, the driver's remove function > gets > called and we end up aborting all in progress I/Os. Here is the code > flow: > > qla2x00_remove_one > qla2x00_abort_isp_cleanup > qla2x00_abort_all_cmds > __qla2x00_abort_all_cmds > qla2xxx_eh_abort > > At the start of qla2xxx_eh_abort, some sanity checks are done before > actually sending the abort. One of these checks is a call to > fc_block_scsi_eh. In the case of a hot remove, it turns out that this > routine can exit with FAST_IO_FAIL. > > When this occurs, we return back to __qla2x00_abort_all_cmds with an > extra reference on sp (because the abort never gets sent). > Originally, > this was addressed with another fix: > > commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after > adapter break > > But this later this added change complicated matters: > > commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down > chip > > Because the abort is now being done earlier in the teardown (through > qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past > the first check because qla2x00_isp_reg_stat(ha) returns zero. When > we > fail a few lines later in fc_block_scsi_eh, this error is not handled > properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and > timing out because of the extra reference. > > For this fix, a check for FAST_IO_FAIL is added to > __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort > succeeded or not. > > This removes the extra reference in this additional early exit case. > In > my testing (hw surprise removals and also adapter remove via sysfs), > this > eliminates the timeouts and delays and the remove proceeds smoothly. > > --- > drivers/scsi/qla2xxx/qla_os.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/qla2xxx/qla_os.c > b/drivers/scsi/qla2xxx/qla_os.c > index 518f151..5261adf 100644 > --- a/drivers/scsi/qla2xxx/qla_os.c > +++ b/drivers/scsi/qla2xxx/qla_os.c > @@ -1749,7 +1749,7 @@ uint32_t qla2x00_isp_reg_stat(struct > qla_hw_data *ha) > static void > __qla2x00_abort_all_cmds(struct qla_qpair *qp, int res) > { > - int cnt; > + int cnt, status; > unsigned long flags; > srb_t *sp; > scsi_qla_host_t *vha = qp->vha; > @@ -1799,10 +1799,16 @@ uint32_t qla2x00_isp_reg_stat(struct > qla_hw_data *ha) > if (!sp_get(sp)) { > spin_unlock_irqresto > re > (qp- > >qp_lock_ptr, flags); > - qla2xxx_eh_abort( > + status = > qla2xxx_eh_abort( > GET_CMD_SP(s > p)); > spin_lock_irqsave > (qp- > >qp_lock_ptr, flags); > + /* > + * Get rid of extra > reference caused > + * by early exit > from qla2xxx_eh_abort > + */ > + if (status == > FAST_IO_FAIL) > + atomic_dec(& > sp->ref_count); > } > } > sp->done(sp, res); Bill The patch looks fine to me but I still cant apply it I tried against linux-next which is 4.20 What ree are you applying it against now Regards Laurence