Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter

2018-11-08 Thread Martin K. Petersen


Bill,

> When doing a surprise removal of an adapter, some in flight I/Os can get 
> stuck and take a while to complete (they actually timeout and are 
> retried). We are not handling an early error exit from 
> qla2xxx_eh_abort properly.

Applied to 4.20/scsi-fixes, thank you.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter

2018-11-07 Thread Madhani, Himanshu
Hi Martin/Bill, 

Thanks for the patience with me. 

> On Nov 6, 2018, at 6:09 PM, Martin K. Petersen  
> wrote:
> 
> External Email
> 
> Himanshu,
> 
> Please review Bill's rebased patch. Thanks!
> 
>> When doing a surprise removal of an adapter, some in flight I/Os can get
>> stuck and take a while to complete (they actually timeout and are
>> retried). We are not handling an early error exit from
>> qla2xxx_eh_abort properly.
> 
> --
> Martin K. Petersen  Oracle Linux Engineering

This rebased patch looks good. 

Acked-by: Himanshu Madhani 

Thanks,
- Himanshu



Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter

2018-11-06 Thread Martin K. Petersen


Himanshu,

Please review Bill's rebased patch. Thanks!

> When doing a surprise removal of an adapter, some in flight I/Os can get 
> stuck and take a while to complete (they actually timeout and are 
> retried). We are not handling an early error exit from 
> qla2xxx_eh_abort properly.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter

2018-11-05 Thread Laurence Oberman
On Mon, 2018-11-05 at 11:32 -0500, Laurence Oberman wrote:
> On Mon, 2018-11-05 at 11:23 -0500, Bill Kuzeja wrote:
> > When doing a surprise removal of an adapter, some in flight I/Os
> > can
> > get 
> > stuck and take a while to complete (they actually timeout and are 
> > retried). We are not handling an early error exit from 
> > qla2xxx_eh_abort properly.
> > 
> > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting
> > down chip") 
> > Signed-off-by: Bill Kuzeja 
> > 
> > ---
> > Resending...can somebody review this (pretty please)??
> > 
> > After a hot remove of a Qlogic adapter, the driver's remove
> > function
> > gets
> > called and we end up aborting all in progress I/Os. Here is the
> > code
> > flow:
> > 
> > qla2x00_remove_one
> >   qla2x00_abort_isp_cleanup
> > qla2x00_abort_all_cmds
> >   __qla2x00_abort_all_cmds
> > qla2xxx_eh_abort
> > 
> > At the start of qla2xxx_eh_abort, some sanity checks are done
> > before
> > actually sending the abort. One of these checks is a call to
> > fc_block_scsi_eh. In the case of a hot remove, it turns out that
> > this
> > routine can exit with FAST_IO_FAIL.
> > 
> > When this occurs, we return back to __qla2x00_abort_all_cmds with
> > an
> > extra reference on sp (because the abort never gets sent).
> > Originally,
> > this was addressed with another fix:
> > 
> > commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after
> > adapter break
> > 
> > But this later this added change complicated matters:
> > 
> > commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting
> > down
> > chip
> > 
> > Because the abort is now being done earlier in the teardown
> > (through
> > qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past
> > the first check because qla2x00_isp_reg_stat(ha) returns zero. When
> > we
> > fail a few lines later in fc_block_scsi_eh, this error is not
> > handled
> > properly in __qla2x00_abort_all_cmds and the I/O ends up hanging
> > and
> > timing out because of the extra reference.
> > 
> > For this fix, a check for FAST_IO_FAIL is added to
> > __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort
> > succeeded or not.
> > 
> > This removes the extra reference in this additional early exit
> > case.
> > In
> > my testing (hw surprise removals and also adapter remove via
> > sysfs),
> > this
> > eliminates the timeouts and delays and the remove proceeds
> > smoothly.
> > 
> > ---
> >  drivers/scsi/qla2xxx/qla_os.c | 10 --
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/scsi/qla2xxx/qla_os.c
> > b/drivers/scsi/qla2xxx/qla_os.c
> > index 518f151..5261adf 100644
> > --- a/drivers/scsi/qla2xxx/qla_os.c
> > +++ b/drivers/scsi/qla2xxx/qla_os.c
> > @@ -1749,7 +1749,7 @@ uint32_t qla2x00_isp_reg_stat(struct
> > qla_hw_data *ha)
> >  static void
> >  __qla2x00_abort_all_cmds(struct qla_qpair *qp, int res)
> >  {
> > -   int cnt;
> > +   int cnt, status;
> >     unsigned long flags;
> >     srb_t *sp;
> >     scsi_qla_host_t *vha = qp->vha;
> > @@ -1799,10 +1799,16 @@ uint32_t qla2x00_isp_reg_stat(struct
> > qla_hw_data *ha)
> >     if (!sp_get(sp)) {
> >     spin_unlock_irqres
> > to
> > re
> >     (qp-
> > > qp_lock_ptr, flags);
> > 
> > -   qla2xxx_eh_abort(
> > +   status =
> > qla2xxx_eh_abort(
> >     GET_CMD_SP
> > (s
> > p));
> >     spin_lock_irqsave
> >     (qp-
> > > qp_lock_ptr, flags);
> > 
> > +   /*
> > +    * Get rid of
> > extra
> > reference caused
> > +    * by early exit
> > from qla2xxx_eh_abort
> > +    */
> > +   if (status ==
> > FAST_IO_FAIL)
> > +   atomic_dec
> > (&
> > sp->ref_count);
> >     }
> >     }
> >     sp->done(sp, res);
> 
> Bill 
> The patch looks fine to me but I still cant apply it
> I tried against linux-next which is 4.20
> What ree are you applying it against now
> 
> Regards
> Laurence
> 
I has a note from Bill saying it applies cleanly now to stable.

In that case
Reviewed-by: Laurence Oberman 



Re: [PATCH V2 RESEND] Timeouts occur on surprise removal of QLogic adapter

2018-11-05 Thread Laurence Oberman
On Mon, 2018-11-05 at 11:23 -0500, Bill Kuzeja wrote:
> When doing a surprise removal of an adapter, some in flight I/Os can
> get 
> stuck and take a while to complete (they actually timeout and are 
> retried). We are not handling an early error exit from 
> qla2xxx_eh_abort properly.
> 
> Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting
> down chip") 
> Signed-off-by: Bill Kuzeja 
> 
> ---
> Resending...can somebody review this (pretty please)??
> 
> After a hot remove of a Qlogic adapter, the driver's remove function
> gets
> called and we end up aborting all in progress I/Os. Here is the code
> flow:
> 
> qla2x00_remove_one
>   qla2x00_abort_isp_cleanup
> qla2x00_abort_all_cmds
>   __qla2x00_abort_all_cmds
> qla2xxx_eh_abort
> 
> At the start of qla2xxx_eh_abort, some sanity checks are done before
> actually sending the abort. One of these checks is a call to
> fc_block_scsi_eh. In the case of a hot remove, it turns out that this
> routine can exit with FAST_IO_FAIL.
> 
> When this occurs, we return back to __qla2x00_abort_all_cmds with an
> extra reference on sp (because the abort never gets sent).
> Originally,
> this was addressed with another fix:
> 
> commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after
> adapter break
> 
> But this later this added change complicated matters:
> 
> commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down
> chip
> 
> Because the abort is now being done earlier in the teardown (through
> qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past
> the first check because qla2x00_isp_reg_stat(ha) returns zero. When
> we
> fail a few lines later in fc_block_scsi_eh, this error is not handled
> properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and
> timing out because of the extra reference.
> 
> For this fix, a check for FAST_IO_FAIL is added to
> __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort
> succeeded or not.
> 
> This removes the extra reference in this additional early exit case.
> In
> my testing (hw surprise removals and also adapter remove via sysfs),
> this
> eliminates the timeouts and delays and the remove proceeds smoothly.
> 
> ---
>  drivers/scsi/qla2xxx/qla_os.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_os.c
> b/drivers/scsi/qla2xxx/qla_os.c
> index 518f151..5261adf 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -1749,7 +1749,7 @@ uint32_t qla2x00_isp_reg_stat(struct
> qla_hw_data *ha)
>  static void
>  __qla2x00_abort_all_cmds(struct qla_qpair *qp, int res)
>  {
> - int cnt;
> + int cnt, status;
>   unsigned long flags;
>   srb_t *sp;
>   scsi_qla_host_t *vha = qp->vha;
> @@ -1799,10 +1799,16 @@ uint32_t qla2x00_isp_reg_stat(struct
> qla_hw_data *ha)
>   if (!sp_get(sp)) {
>   spin_unlock_irqresto
> re
>   (qp-
> >qp_lock_ptr, flags);
> - qla2xxx_eh_abort(
> + status =
> qla2xxx_eh_abort(
>   GET_CMD_SP(s
> p));
>   spin_lock_irqsave
>   (qp-
> >qp_lock_ptr, flags);
> + /*
> +  * Get rid of extra
> reference caused
> +  * by early exit
> from qla2xxx_eh_abort
> +  */
> + if (status ==
> FAST_IO_FAIL)
> + atomic_dec(&
> sp->ref_count);
>   }
>   }
>   sp->done(sp, res);

Bill 
The patch looks fine to me but I still cant apply it
I tried against linux-next which is 4.20
What ree are you applying it against now

Regards
Laurence