Re: scsi/qla2xxx: BUG_ON(blk_queued_rq(req) is triggered in blk_finish_request

2018-05-29 Thread jianchao.wang
Hi Himanshu

Thanks for your kindly response.

On 05/30/2018 01:44 AM, Madhani, Himanshu wrote:
> Thanks for the information. I was out for couple days. Still working through 
> my emails. 
> 
> Without core dump shared with us, things become hard to debug. We'll take a 
> look at this data. 
> 
> Also, if you have any reproducer test case, Can you share that with us so 
> that we can try to trigger It internally and see if we have better luck 
> debugging that.

This issue occurred in customer's environment twice till now, there is no 
reproduce steps.

If you have time, would you please look at our analysis to see whether it is 
reasonable ?
The kernel is not newest, but there should be the same issue in mainline with 
scsi + block legacy
The key point here is:
when the aborted command returns in irq context, it will invoke scsi_done, then 
blk_complete_request.
however, the scsi recovery context could clear the ATOM_COMPLETE and requeue 
the request before irq
context get it.

Thanks
Jianchao

> 
> On 5/28/18, 6:11 PM, "jianchao.wang"  wrote:
> 
> Hi Himanshu
> 
> do you need any other information ?
> 
> Thanks
> Jianchao
> 
> On 05/25/2018 02:48 PM, jianchao.wang wrote:
> > Hi Himanshu
> > 
> > I'm afraid I cannot provide you the vmcore file, it is from our 
> customer.
> > If any information needed in the vmcore, I could provide with you.
> > 
> > In this scene, there are two contexts:
> > 1. the irq context, scsi_done() from qla2x00_sp_compl()
> > 2. the kworker context, scmd_eh_abort_handler which invokes the
> > qla2xxx_eh_abort()
> > 
> > Let's see the irq context.
> > scsi_done() will invoke the blk_complete_request().
> > In common case, the REQ_ATOM_COMPLETE should have been set by
> > blk_rq_check_expired() when the request was confirmed timeout. So
> > __blk_complete_request() will not be invoked.
> > 
> > On the other hand, in the kworker context, scmd_eh_abort_handler() will 
> do some
> > other things on this aborted request.
> > 
> > rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
> > if (rtn == SUCCESS) {
> > set_host_byte(scmd, DID_TIME_OUT)
> > if (scsi_host_eh_past_deadline(sdev->host)) {
> > 
> > } else if (!scsi_noretry_cmd(scmd) &&
> > (++scmd->retries <= scmd->allowed)) {
> > scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY)
> > }
> >
> > }
> > The host type of the result will set to DID_TIME_OUT. 
> scsi_noretry_cmd() will
> > return false and scsi_queue_insert() will be invoked, finally the
> > blk_requeue_request().
> > 
> > blk_requeue_request() will clear the REQ_ATOM_COMPLETE and queue the 
> request on
> > request_queue->queue_head.
> > 
> > There could be a race between these two contexts:
> > what if the blk_clear_rq_complete() in kworker context is invoked 
> before the
> > blk_mark_rq_complete() in irq context ?
> > 
> > BLK_SOFTIRQ will be raised by __blk_complete_request() in irq context. 
> Then the
> > scsi_softirq_done() will be invoked. At the moment, the request has been
> > requeued by kworker context and scsi_cmnd->result is set 0. On the cpu 
> of the
> > irq context, the blk_finish_request() will be invoked finally, and 
> panic comes
> > up due to  BUG_ON(blk_queued_rq(req)
> > 
> > Here is the dmesg log.
> > [4937745.180213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.180655] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  0 2002.
> > [4937745.181059] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.181514] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  0 2002.
> > [4937745.181636] [ cut here ]
> > [4937745.181900] kernel BUG at block/blk-core.c:2601!
> > [4937745.182095] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.182338] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> > [4937745.183046] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.183213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> > [4937745.183844] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.184038] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> > [4937745.184919] invalid opcode:  [#1] SMP
> > [4937745.185173] Modules linked in: ...
> > [4937745.188237] CPU: 18 PID: 0 Comm: swapper/18 Not tainted 
> 4.1.12-61.1.9.el6uek.x86_64 #2
> > [4937745.188691] Hardware name: Oracle Corporation SUN BLADE X4-2B  
> /BD,ASSY,PCB,X4-2B , BIOS 28010200 01/16/2014
>  

Re: scsi/qla2xxx: BUG_ON(blk_queued_rq(req) is triggered in blk_finish_request

2018-05-29 Thread jianchao.wang
Hi Himanshu

Thanks for your kindly response.

On 05/30/2018 01:44 AM, Madhani, Himanshu wrote:
> Thanks for the information. I was out for couple days. Still working through 
> my emails. 
> 
> Without core dump shared with us, things become hard to debug. We'll take a 
> look at this data. 
> 
> Also, if you have any reproducer test case, Can you share that with us so 
> that we can try to trigger It internally and see if we have better luck 
> debugging that.

This issue occurred in customer's environment twice till now, there is no 
reproduce steps.

If you have time, would you please look at our analysis to see whether it is 
reasonable ?
The kernel is not newest, but there should be the same issue in mainline with 
scsi + block legacy
The key point here is:
when the aborted command returns in irq context, it will invoke scsi_done, then 
blk_complete_request.
however, the scsi recovery context could clear the ATOM_COMPLETE and requeue 
the request before irq
context get it.

Thanks
Jianchao

> 
> On 5/28/18, 6:11 PM, "jianchao.wang"  wrote:
> 
> Hi Himanshu
> 
> do you need any other information ?
> 
> Thanks
> Jianchao
> 
> On 05/25/2018 02:48 PM, jianchao.wang wrote:
> > Hi Himanshu
> > 
> > I'm afraid I cannot provide you the vmcore file, it is from our 
> customer.
> > If any information needed in the vmcore, I could provide with you.
> > 
> > In this scene, there are two contexts:
> > 1. the irq context, scsi_done() from qla2x00_sp_compl()
> > 2. the kworker context, scmd_eh_abort_handler which invokes the
> > qla2xxx_eh_abort()
> > 
> > Let's see the irq context.
> > scsi_done() will invoke the blk_complete_request().
> > In common case, the REQ_ATOM_COMPLETE should have been set by
> > blk_rq_check_expired() when the request was confirmed timeout. So
> > __blk_complete_request() will not be invoked.
> > 
> > On the other hand, in the kworker context, scmd_eh_abort_handler() will 
> do some
> > other things on this aborted request.
> > 
> > rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
> > if (rtn == SUCCESS) {
> > set_host_byte(scmd, DID_TIME_OUT)
> > if (scsi_host_eh_past_deadline(sdev->host)) {
> > 
> > } else if (!scsi_noretry_cmd(scmd) &&
> > (++scmd->retries <= scmd->allowed)) {
> > scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY)
> > }
> >
> > }
> > The host type of the result will set to DID_TIME_OUT. 
> scsi_noretry_cmd() will
> > return false and scsi_queue_insert() will be invoked, finally the
> > blk_requeue_request().
> > 
> > blk_requeue_request() will clear the REQ_ATOM_COMPLETE and queue the 
> request on
> > request_queue->queue_head.
> > 
> > There could be a race between these two contexts:
> > what if the blk_clear_rq_complete() in kworker context is invoked 
> before the
> > blk_mark_rq_complete() in irq context ?
> > 
> > BLK_SOFTIRQ will be raised by __blk_complete_request() in irq context. 
> Then the
> > scsi_softirq_done() will be invoked. At the moment, the request has been
> > requeued by kworker context and scsi_cmnd->result is set 0. On the cpu 
> of the
> > irq context, the blk_finish_request() will be invoked finally, and 
> panic comes
> > up due to  BUG_ON(blk_queued_rq(req)
> > 
> > Here is the dmesg log.
> > [4937745.180213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.180655] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  0 2002.
> > [4937745.181059] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.181514] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  0 2002.
> > [4937745.181636] [ cut here ]
> > [4937745.181900] kernel BUG at block/blk-core.c:2601!
> > [4937745.182095] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.182338] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> > [4937745.183046] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.183213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> > [4937745.183844] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> > [4937745.184038] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> > [4937745.184919] invalid opcode:  [#1] SMP
> > [4937745.185173] Modules linked in: ...
> > [4937745.188237] CPU: 18 PID: 0 Comm: swapper/18 Not tainted 
> 4.1.12-61.1.9.el6uek.x86_64 #2
> > [4937745.188691] Hardware name: Oracle Corporation SUN BLADE X4-2B  
> /BD,ASSY,PCB,X4-2B , BIOS 28010200 01/16/2014
>  

Re: scsi/qla2xxx: BUG_ON(blk_queued_rq(req) is triggered in blk_finish_request

2018-05-28 Thread jianchao.wang
Hi Himanshu

do you need any other information ?

Thanks
Jianchao

On 05/25/2018 02:48 PM, jianchao.wang wrote:
> Hi Himanshu
> 
> I'm afraid I cannot provide you the vmcore file, it is from our customer.
> If any information needed in the vmcore, I could provide with you.
> 
> In this scene, there are two contexts:
> 1. the irq context, scsi_done() from qla2x00_sp_compl()
> 2. the kworker context, scmd_eh_abort_handler which invokes the
> qla2xxx_eh_abort()
> 
> Let's see the irq context.
> scsi_done() will invoke the blk_complete_request().
> In common case, the REQ_ATOM_COMPLETE should have been set by
> blk_rq_check_expired() when the request was confirmed timeout. So
> __blk_complete_request() will not be invoked.
> 
> On the other hand, in the kworker context, scmd_eh_abort_handler() will do 
> some
> other things on this aborted request.
> 
> rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
> if (rtn == SUCCESS) {
> set_host_byte(scmd, DID_TIME_OUT)
> if (scsi_host_eh_past_deadline(sdev->host)) {
> 
> } else if (!scsi_noretry_cmd(scmd) &&
> (++scmd->retries <= scmd->allowed)) {
> scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY)
> }
>
> }
> The host type of the result will set to DID_TIME_OUT. scsi_noretry_cmd() will
> return false and scsi_queue_insert() will be invoked, finally the
> blk_requeue_request().
> 
> blk_requeue_request() will clear the REQ_ATOM_COMPLETE and queue the request 
> on
> request_queue->queue_head.
> 
> There could be a race between these two contexts:
> what if the blk_clear_rq_complete() in kworker context is invoked before the
> blk_mark_rq_complete() in irq context ?
> 
> BLK_SOFTIRQ will be raised by __blk_complete_request() in irq context. Then 
> the
> scsi_softirq_done() will be invoked. At the moment, the request has been
> requeued by kworker context and scsi_cmnd->result is set 0. On the cpu of the
> irq context, the blk_finish_request() will be invoked finally, and panic comes
> up due to  BUG_ON(blk_queued_rq(req)
> 
> Here is the dmesg log.
> [4937745.180213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.180655] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  0 2002.
> [4937745.181059] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.181514] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  0 2002.
> [4937745.181636] [ cut here ]
> [4937745.181900] kernel BUG at block/blk-core.c:2601!
> [4937745.182095] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.182338] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> [4937745.183046] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.183213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> [4937745.183844] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.184038] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> [4937745.184919] invalid opcode:  [#1] SMP
> [4937745.185173] Modules linked in: ...
> [4937745.188237] CPU: 18 PID: 0 Comm: swapper/18 Not tainted 
> 4.1.12-61.1.9.el6uek.x86_64 #2
> [4937745.188691] Hardware name: Oracle Corporation SUN BLADE X4-2B  
> /BD,ASSY,PCB,X4-2B , BIOS 28010200 01/16/2014
> [4937745.189157] task: 880227fa9c00 ti: 880227fc task.ti: 
> 880227fc
> [4937745.189607] RIP: e030:[]  [] 
> blk_finish_request+0xd0/0xe0
> [4937745.190104] RSP: e02b:88022d483d78  EFLAGS: 00010083
> [4937745.190339] RAX: 0280 RBX: 8801c4508730 RCX: 
> 0280
> [4937745.190791] RDX: 8b5a RSI:  RDI: 
> 8801c4508730
> [4937745.191246] RBP: 88022d483d88 R08:  R09: 
> 4000
> [4937745.191694] R10: 000b R11: 7ffe R12: 
> 
> [4937745.192153] R13: 88022374d000 R14: 880184cb88c0 R15: 
> 8800536ccc90
> [4937745.197930] FS:  () GS:88022d48() 
> knlGS:88022d48
> [4937745.198389] CS:  e033 DS: 002b ES: 002b CR0: 80050033
> [4937745.198627] CR2: 016e1f18 CR3: 00010caed000 CR4: 
> 00042660
> [4937745.199080] Stack:
> [4937745.199309]  8801c4508730  88022d483dd8 
> 814ac82f
> [4937745.199790]  88022d483da8 0280 88022d483dd8 
> 880184cb88c0
> [4937745.200290]  8801c4508730   
> 
> [4937745.200745] Call Trace:
> [4937745.200968]  
> [4937745.200986]  [] scsi_end_request+0x11f/0x1e0
> [4937745.201422]  [] scsi_io_completion+0xac/0x500
> [4937745.201651]  [] scsi_finish_command+0xf0/0x150
> [4937745.201875]  [] scsi_softirq_done+0x14a/0x160
> [4937745.202103]  [] 

Re: scsi/qla2xxx: BUG_ON(blk_queued_rq(req) is triggered in blk_finish_request

2018-05-28 Thread jianchao.wang
Hi Himanshu

do you need any other information ?

Thanks
Jianchao

On 05/25/2018 02:48 PM, jianchao.wang wrote:
> Hi Himanshu
> 
> I'm afraid I cannot provide you the vmcore file, it is from our customer.
> If any information needed in the vmcore, I could provide with you.
> 
> In this scene, there are two contexts:
> 1. the irq context, scsi_done() from qla2x00_sp_compl()
> 2. the kworker context, scmd_eh_abort_handler which invokes the
> qla2xxx_eh_abort()
> 
> Let's see the irq context.
> scsi_done() will invoke the blk_complete_request().
> In common case, the REQ_ATOM_COMPLETE should have been set by
> blk_rq_check_expired() when the request was confirmed timeout. So
> __blk_complete_request() will not be invoked.
> 
> On the other hand, in the kworker context, scmd_eh_abort_handler() will do 
> some
> other things on this aborted request.
> 
> rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
> if (rtn == SUCCESS) {
> set_host_byte(scmd, DID_TIME_OUT)
> if (scsi_host_eh_past_deadline(sdev->host)) {
> 
> } else if (!scsi_noretry_cmd(scmd) &&
> (++scmd->retries <= scmd->allowed)) {
> scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY)
> }
>
> }
> The host type of the result will set to DID_TIME_OUT. scsi_noretry_cmd() will
> return false and scsi_queue_insert() will be invoked, finally the
> blk_requeue_request().
> 
> blk_requeue_request() will clear the REQ_ATOM_COMPLETE and queue the request 
> on
> request_queue->queue_head.
> 
> There could be a race between these two contexts:
> what if the blk_clear_rq_complete() in kworker context is invoked before the
> blk_mark_rq_complete() in irq context ?
> 
> BLK_SOFTIRQ will be raised by __blk_complete_request() in irq context. Then 
> the
> scsi_softirq_done() will be invoked. At the moment, the request has been
> requeued by kworker context and scsi_cmnd->result is set 0. On the cpu of the
> irq context, the blk_finish_request() will be invoked finally, and panic comes
> up due to  BUG_ON(blk_queued_rq(req)
> 
> Here is the dmesg log.
> [4937745.180213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.180655] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  0 2002.
> [4937745.181059] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.181514] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  0 2002.
> [4937745.181636] [ cut here ]
> [4937745.181900] kernel BUG at block/blk-core.c:2601!
> [4937745.182095] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.182338] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> [4937745.183046] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.183213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> [4937745.183844] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:5:5 --  1 2002.
> [4937745.184038] qla2xxx [:92:00.1]-801c:4: Abort command issued 
> nexus=4:4:73 --  1 2002.
> [4937745.184919] invalid opcode:  [#1] SMP
> [4937745.185173] Modules linked in: ...
> [4937745.188237] CPU: 18 PID: 0 Comm: swapper/18 Not tainted 
> 4.1.12-61.1.9.el6uek.x86_64 #2
> [4937745.188691] Hardware name: Oracle Corporation SUN BLADE X4-2B  
> /BD,ASSY,PCB,X4-2B , BIOS 28010200 01/16/2014
> [4937745.189157] task: 880227fa9c00 ti: 880227fc task.ti: 
> 880227fc
> [4937745.189607] RIP: e030:[]  [] 
> blk_finish_request+0xd0/0xe0
> [4937745.190104] RSP: e02b:88022d483d78  EFLAGS: 00010083
> [4937745.190339] RAX: 0280 RBX: 8801c4508730 RCX: 
> 0280
> [4937745.190791] RDX: 8b5a RSI:  RDI: 
> 8801c4508730
> [4937745.191246] RBP: 88022d483d88 R08:  R09: 
> 4000
> [4937745.191694] R10: 000b R11: 7ffe R12: 
> 
> [4937745.192153] R13: 88022374d000 R14: 880184cb88c0 R15: 
> 8800536ccc90
> [4937745.197930] FS:  () GS:88022d48() 
> knlGS:88022d48
> [4937745.198389] CS:  e033 DS: 002b ES: 002b CR0: 80050033
> [4937745.198627] CR2: 016e1f18 CR3: 00010caed000 CR4: 
> 00042660
> [4937745.199080] Stack:
> [4937745.199309]  8801c4508730  88022d483dd8 
> 814ac82f
> [4937745.199790]  88022d483da8 0280 88022d483dd8 
> 880184cb88c0
> [4937745.200290]  8801c4508730   
> 
> [4937745.200745] Call Trace:
> [4937745.200968]  
> [4937745.200986]  [] scsi_end_request+0x11f/0x1e0
> [4937745.201422]  [] scsi_io_completion+0xac/0x500
> [4937745.201651]  [] scsi_finish_command+0xf0/0x150
> [4937745.201875]  [] scsi_softirq_done+0x14a/0x160
> [4937745.202103]  [] 

Re: scsi/qla2xxx: BUG_ON(blk_queued_rq(req) is triggered in blk_finish_request

2018-05-25 Thread jianchao.wang
Hi Himanshu

I'm afraid I cannot provide you the vmcore file, it is from our customer.
If any information needed in the vmcore, I could provide with you.

In this scene, there are two contexts:
1. the irq context, scsi_done() from qla2x00_sp_compl()
2. the kworker context, scmd_eh_abort_handler which invokes the
qla2xxx_eh_abort()

Let's see the irq context.
scsi_done() will invoke the blk_complete_request().
In common case, the REQ_ATOM_COMPLETE should have been set by
blk_rq_check_expired() when the request was confirmed timeout. So
__blk_complete_request() will not be invoked.

On the other hand, in the kworker context, scmd_eh_abort_handler() will do some
other things on this aborted request.

rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
if (rtn == SUCCESS) {
set_host_byte(scmd, DID_TIME_OUT)
if (scsi_host_eh_past_deadline(sdev->host)) {

} else if (!scsi_noretry_cmd(scmd) &&
(++scmd->retries <= scmd->allowed)) {
scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY)
}

}
The host type of the result will set to DID_TIME_OUT. scsi_noretry_cmd() will
return false and scsi_queue_insert() will be invoked, finally the
blk_requeue_request().

blk_requeue_request() will clear the REQ_ATOM_COMPLETE and queue the request on
request_queue->queue_head.

There could be a race between these two contexts:
what if the blk_clear_rq_complete() in kworker context is invoked before the
blk_mark_rq_complete() in irq context ?

BLK_SOFTIRQ will be raised by __blk_complete_request() in irq context. Then the
scsi_softirq_done() will be invoked. At the moment, the request has been
requeued by kworker context and scsi_cmnd->result is set 0. On the cpu of the
irq context, the blk_finish_request() will be invoked finally, and panic comes
up due to  BUG_ON(blk_queued_rq(req)

Here is the dmesg log.
[4937745.180213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.180655] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  0 2002.
[4937745.181059] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.181514] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  0 2002.
[4937745.181636] [ cut here ]
[4937745.181900] kernel BUG at block/blk-core.c:2601!
[4937745.182095] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.182338] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  1 2002.
[4937745.183046] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.183213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  1 2002.
[4937745.183844] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.184038] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  1 2002.
[4937745.184919] invalid opcode:  [#1] SMP
[4937745.185173] Modules linked in: ...
[4937745.188237] CPU: 18 PID: 0 Comm: swapper/18 Not tainted 
4.1.12-61.1.9.el6uek.x86_64 #2
[4937745.188691] Hardware name: Oracle Corporation SUN BLADE X4-2B  
/BD,ASSY,PCB,X4-2B , BIOS 28010200 01/16/2014
[4937745.189157] task: 880227fa9c00 ti: 880227fc task.ti: 
880227fc
[4937745.189607] RIP: e030:[]  [] 
blk_finish_request+0xd0/0xe0
[4937745.190104] RSP: e02b:88022d483d78  EFLAGS: 00010083
[4937745.190339] RAX: 0280 RBX: 8801c4508730 RCX: 
0280
[4937745.190791] RDX: 8b5a RSI:  RDI: 
8801c4508730
[4937745.191246] RBP: 88022d483d88 R08:  R09: 
4000
[4937745.191694] R10: 000b R11: 7ffe R12: 

[4937745.192153] R13: 88022374d000 R14: 880184cb88c0 R15: 
8800536ccc90
[4937745.197930] FS:  () GS:88022d48() 
knlGS:88022d48
[4937745.198389] CS:  e033 DS: 002b ES: 002b CR0: 80050033
[4937745.198627] CR2: 016e1f18 CR3: 00010caed000 CR4: 
00042660
[4937745.199080] Stack:
[4937745.199309]  8801c4508730  88022d483dd8 
814ac82f
[4937745.199790]  88022d483da8 0280 88022d483dd8 
880184cb88c0
[4937745.200290]  8801c4508730   

[4937745.200745] Call Trace:
[4937745.200968]  
[4937745.200986]  [] scsi_end_request+0x11f/0x1e0
[4937745.201422]  [] scsi_io_completion+0xac/0x500
[4937745.201651]  [] scsi_finish_command+0xf0/0x150
[4937745.201875]  [] scsi_softirq_done+0x14a/0x160
[4937745.202103]  [] blk_done_softirq+0x86/0xa0
[4937745.202333]  [] __do_softirq+0x10a/0x350
[4937745.202558]  [] irq_exit+0x125/0x130
[4937745.202787]  [] xen_evtchn_do_upcall+0x39/0x50
[4937745.203020]  [] xen_do_hypervisor_callback+0x1e/0x40
[4937745.203247]  
[4937745.203263]  [] ? xen_hypercall_sched_op+0xa/0x20
[4937745.203703]  [] ? 

Re: scsi/qla2xxx: BUG_ON(blk_queued_rq(req) is triggered in blk_finish_request

2018-05-25 Thread jianchao.wang
Hi Himanshu

I'm afraid I cannot provide you the vmcore file, it is from our customer.
If any information needed in the vmcore, I could provide with you.

In this scene, there are two contexts:
1. the irq context, scsi_done() from qla2x00_sp_compl()
2. the kworker context, scmd_eh_abort_handler which invokes the
qla2xxx_eh_abort()

Let's see the irq context.
scsi_done() will invoke the blk_complete_request().
In common case, the REQ_ATOM_COMPLETE should have been set by
blk_rq_check_expired() when the request was confirmed timeout. So
__blk_complete_request() will not be invoked.

On the other hand, in the kworker context, scmd_eh_abort_handler() will do some
other things on this aborted request.

rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
if (rtn == SUCCESS) {
set_host_byte(scmd, DID_TIME_OUT)
if (scsi_host_eh_past_deadline(sdev->host)) {

} else if (!scsi_noretry_cmd(scmd) &&
(++scmd->retries <= scmd->allowed)) {
scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY)
}

}
The host type of the result will set to DID_TIME_OUT. scsi_noretry_cmd() will
return false and scsi_queue_insert() will be invoked, finally the
blk_requeue_request().

blk_requeue_request() will clear the REQ_ATOM_COMPLETE and queue the request on
request_queue->queue_head.

There could be a race between these two contexts:
what if the blk_clear_rq_complete() in kworker context is invoked before the
blk_mark_rq_complete() in irq context ?

BLK_SOFTIRQ will be raised by __blk_complete_request() in irq context. Then the
scsi_softirq_done() will be invoked. At the moment, the request has been
requeued by kworker context and scsi_cmnd->result is set 0. On the cpu of the
irq context, the blk_finish_request() will be invoked finally, and panic comes
up due to  BUG_ON(blk_queued_rq(req)

Here is the dmesg log.
[4937745.180213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.180655] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  0 2002.
[4937745.181059] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.181514] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  0 2002.
[4937745.181636] [ cut here ]
[4937745.181900] kernel BUG at block/blk-core.c:2601!
[4937745.182095] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.182338] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  1 2002.
[4937745.183046] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.183213] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  1 2002.
[4937745.183844] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:5:5 --  1 2002.
[4937745.184038] qla2xxx [:92:00.1]-801c:4: Abort command issued 
nexus=4:4:73 --  1 2002.
[4937745.184919] invalid opcode:  [#1] SMP
[4937745.185173] Modules linked in: ...
[4937745.188237] CPU: 18 PID: 0 Comm: swapper/18 Not tainted 
4.1.12-61.1.9.el6uek.x86_64 #2
[4937745.188691] Hardware name: Oracle Corporation SUN BLADE X4-2B  
/BD,ASSY,PCB,X4-2B , BIOS 28010200 01/16/2014
[4937745.189157] task: 880227fa9c00 ti: 880227fc task.ti: 
880227fc
[4937745.189607] RIP: e030:[]  [] 
blk_finish_request+0xd0/0xe0
[4937745.190104] RSP: e02b:88022d483d78  EFLAGS: 00010083
[4937745.190339] RAX: 0280 RBX: 8801c4508730 RCX: 
0280
[4937745.190791] RDX: 8b5a RSI:  RDI: 
8801c4508730
[4937745.191246] RBP: 88022d483d88 R08:  R09: 
4000
[4937745.191694] R10: 000b R11: 7ffe R12: 

[4937745.192153] R13: 88022374d000 R14: 880184cb88c0 R15: 
8800536ccc90
[4937745.197930] FS:  () GS:88022d48() 
knlGS:88022d48
[4937745.198389] CS:  e033 DS: 002b ES: 002b CR0: 80050033
[4937745.198627] CR2: 016e1f18 CR3: 00010caed000 CR4: 
00042660
[4937745.199080] Stack:
[4937745.199309]  8801c4508730  88022d483dd8 
814ac82f
[4937745.199790]  88022d483da8 0280 88022d483dd8 
880184cb88c0
[4937745.200290]  8801c4508730   

[4937745.200745] Call Trace:
[4937745.200968]  
[4937745.200986]  [] scsi_end_request+0x11f/0x1e0
[4937745.201422]  [] scsi_io_completion+0xac/0x500
[4937745.201651]  [] scsi_finish_command+0xf0/0x150
[4937745.201875]  [] scsi_softirq_done+0x14a/0x160
[4937745.202103]  [] blk_done_softirq+0x86/0xa0
[4937745.202333]  [] __do_softirq+0x10a/0x350
[4937745.202558]  [] irq_exit+0x125/0x130
[4937745.202787]  [] xen_evtchn_do_upcall+0x39/0x50
[4937745.203020]  [] xen_do_hypervisor_callback+0x1e/0x40
[4937745.203247]  
[4937745.203263]  [] ? xen_hypercall_sched_op+0xa/0x20
[4937745.203703]  [] ?