Re: iscsi_trx going into D state

2017-01-15 Thread Laurence Oberman
"Zhu Lingshan" > <ls...@suse.com>, "linux-rdma" <linux-r...@vger.kernel.org>, > linux-scsi@vger.kernel.org, "Sagi Grimberg" > <s...@grimberg.me>, "Christoph Hellwig" <h...@lst.de> > Sent: Friday, January 13, 2017 6:3

Re: iscsi_trx going into D state

2017-01-13 Thread Robert LeBlanc
, "Sagi Grimberg" >> <s...@grimberg.me>, "Christoph Hellwig" <h...@lst.de> >> Sent: Thursday, January 12, 2017 4:26:05 PM >> Subject: Re: iscsi_trx going into D state >> >> Sorry sent prematurely... >> >> On Thu, Jan 12, 2017

Re: iscsi_trx going into D state

2017-01-13 Thread Laurence Oberman
"Zhu Lingshan" > <ls...@suse.com>, "linux-rdma" <linux-r...@vger.kernel.org>, > linux-scsi@vger.kernel.org, "Sagi Grimberg" > <s...@grimberg.me>, "Christoph Hellwig" <h...@lst.de> > Sent: Thursday, January 12, 2017 4:26:05

Re: iscsi_trx going into D state

2017-01-12 Thread Robert LeBlanc
Sorry sent prematurely... On Thu, Jan 12, 2017 at 2:22 PM, Robert LeBlanc wrote: > I'm having trouble replicating the D state issue on Infiniband (I was > able to trigger it reliably a couple weeks back, I don't know if OFED > to verify the same results happen there as

Re: iscsi_trx going into D state

2017-01-12 Thread Robert LeBlanc
I have a crappy patch (sledgehammer approach) that seems to prevent the D state issue and the connection recovers, but things are possibly not being cleaned up properly in iSCSI and so it may have issues after a few recoveries (one test completed with a lot of resets but no iSCSI errors).

Re: iscsi_trx going into D state

2017-01-06 Thread Robert LeBlanc
Laurence, Since the summary may be helpful to others, I'm just going to send it to the list. I've been able to reproduce the D state problem on both Infiniband and RoCE, but it is much easier to reproduce on RoCE due to another bug and doesn't require being at the server to yank the cable

Re: iscsi_trx going into D state

2017-01-06 Thread Laurence Oberman
uot;linux-rdma" > <linux-r...@vger.kernel.org>, linux-scsi@vger.kernel.org, "Sagi Grimberg" > <s...@grimberg.me>, "Christoph Hellwig" > <h...@lst.de> > Sent: Tuesday, January 3, 2017 7:11:40 PM > Subject: Re: iscsi_trx going into D state > >

Re: iscsi_trx going into D state

2017-01-03 Thread Robert LeBlanc
With the last patch it is getting hung up on wait_for_completion in target_wait_for_sess_cmds. I don't know what t_state or fabric state mean. To me it looks like a queue is not being emptied, but it would help if someone confirmed this and has some pointers on how to properly flush them when the

Re: iscsi_trx going into D state

2017-01-03 Thread Robert LeBlanc
With this patch I'm not seeing the __ib_drain_sq backtraces, but I'm still seeing the previous backtraces. diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index 6dd43f6..1e53502 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++

Re: iscsi_trx going into D state

2016-12-30 Thread Robert LeBlanc
I decided to try something completely different... Running the stock CentOS 3.10 kernel and OFED 3.4 on both hosts, I'm not seeing the hung processes and the tests complete successfully. The same seems to be true for the target on 4.9 and the initiator on 3.10. However, with the target on 3.10

Re: iscsi_trx going into D state

2016-12-29 Thread Robert LeBlanc
OK, I've drilled down a little more and timeout = action(timeout); in do_wait_for_common() in kernel/sched/completion.c is not returning. I'll have to see if I can make more progress tomorrow. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On

Re: iscsi_trx going into D state

2016-12-29 Thread Robert LeBlanc
I know most people are ignoring this thread by now, but I hope someone is still reading and can offer some ideas. It looks like ib_drain_qp_done() is not being called the first time that __ib_drain_sq() is called from iscsit_close_connection(). I tried to debug wait_for_completion() and friends,

Re: iscsi_trx going into D state

2016-12-28 Thread Robert LeBlanc
Good news! I found a 10 Gb switch laying around and put it in place of the Linux router. I'm getting the same failure with the switch, so it is not something funky with the Linux router and easier to replicate. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654

Re: iscsi_trx going into D state

2016-12-28 Thread Robert LeBlanc
OK, here is some more info. This is a diagram of my current set up. ++ | Linux Router | | ConnectX-3 | | port 1 port 2 | ++ / \ +---+ /

Re: iscsi_trx going into D state

2016-12-27 Thread Robert LeBlanc
I realized that I did not set the default RoCE mode to v2 and the client is on a different subnet, probably why I'm seeing the -110 error. Iser should not go into D state because of this and should handle this gracefully, but may provide an easy way to replicate the issue. Robert

Re: iscsi_trx going into D state

2016-12-27 Thread Robert LeBlanc
I looked at this code and it is quiet above my ability. I created this patch, but I don't know how to interrogate the queue to see how many items there are. If you can give me some more direction on what to try, I can keep fumbling around with this until someone smarter than me can figure it out.

Re: iscsi_trx going into D state

2016-12-22 Thread Doug Ledford
On 12/21/2016 6:39 PM, Robert LeBlanc wrote: > I hit a new backtrace today, hopefully it adds something. > > # cat /proc/19659/stack > [] iscsit_stop_session+0x1b1/0x1c0 > [] iscsi_check_for_session_reinstatement+0x1e2/0x270 > [] iscsi_target_check_for_existing_instances+0x30/0x40 > []

Re: iscsi_trx going into D state

2016-12-21 Thread Robert LeBlanc
I hit a new backtrace today, hopefully it adds something. # cat /proc/19659/stack [] iscsit_stop_session+0x1b1/0x1c0 [] iscsi_check_for_session_reinstatement+0x1e2/0x270 [] iscsi_target_check_for_existing_instances+0x30/0x40 [] iscsi_target_do_login+0x138/0x630 []

Re: iscsi_trx going into D state

2016-12-15 Thread Robert LeBlanc
Nicholas, I've found that the kernels I used were not able to be inspected using crash and I could not build the debug info for them. So I built a new 4.9 kernel and verified that I could inspect the crash. It is located at [1]. [1] http://mirrors.betterservers.com/trace/crash2.tar.xz

Re: iscsi_trx going into D state

2016-12-12 Thread Robert LeBlanc
Nicholas, After lots of set backs and having to give up trying to get kernel dumps on our "production" systems, I've been able to work out the issues we had with kdump and replicate the issue on my dev boxes. I have dumps from 4.4.30 and 4.9-rc8 (makedumpfile would not dump, so it is a straight

Re: iscsi_trx going into D state

2016-11-04 Thread Robert LeBlanc
We hit this yesterday, this time it was on the tx thread (the other ones before seem to be on the rx thread). We weren't able to get a kernel dump on this. We'll try to get one next time. # ps axuw | grep "D.*iscs[i]" root 12383 0.0 0.0 0 0 ?DNov03 0:04 [iscsi_np]

Re: iscsi_trx going into D state

2016-10-31 Thread Robert LeBlanc
Nicholas, Thanks for following up on this. We have been chasing other bugs in our provisioning and as such has reduced our load on the boxes. We are hoping to get that all straightened out this week and do some more testing. So far we have not had any iSCSI in D state since the patch, be we

Re: iscsi_trx going into D state

2016-10-29 Thread Nicholas A. Bellinger
Hi Robert, On Wed, 2016-10-19 at 10:41 -0600, Robert LeBlanc wrote: > Nicholas, > > I didn't have high hopes for the patch because we were not seeing > TMR_ABORT_TASK (or 'abort') in dmesg or /var/log/messages, but it > seemed to help regardless. Our clients finally OOMed from the hung >

Re: iscsi_trx going into D state

2016-10-19 Thread Robert LeBlanc
Nicholas, I didn't have high hopes for the patch because we were not seeing TMR_ABORT_TASK (or 'abort') in dmesg or /var/log/messages, but it seemed to help regardless. Our clients finally OOMed from the hung sessions, so we are having to reboot them and we will do some more testing. We haven't

Re: iscsi_trx going into D state

2016-10-19 Thread Nicholas A. Bellinger
On Tue, 2016-10-18 at 16:13 -0600, Robert LeBlanc wrote: > Nicholas, > > We patched this in and for the first time in many reboots, we didn't > have iSCSI going straight into D state. We have had to work on a > couple of other things, so we don't know if this is just a coincidence > or not. We

Re: iscsi_trx going into D state

2016-10-18 Thread Robert LeBlanc
Nicholas, We patched this in and for the first time in many reboots, we didn't have iSCSI going straight into D state. We have had to work on a couple of other things, so we don't know if this is just a coincidence or not. We will reboot back into the old kernel and back a few times and do some

Re: iscsi_trx going into D state

2016-10-18 Thread Nicholas A. Bellinger
On Tue, 2016-10-18 at 00:05 -0700, Nicholas A. Bellinger wrote: > Hello Robert, Zhu & Co, > > Thanks for your detailed bug report. Comments inline below. > > On Mon, 2016-10-17 at 22:42 -0600, Robert LeBlanc wrote: > > Sorry I forget that Android has an aversion to plain text emails. > > > >

Re: iscsi_trx going into D state

2016-10-18 Thread Nicholas A. Bellinger
Hello Robert, Zhu & Co, Thanks for your detailed bug report. Comments inline below. On Mon, 2016-10-17 at 22:42 -0600, Robert LeBlanc wrote: > Sorry I forget that Android has an aversion to plain text emails. > > If we can provide any information to help, let us know. We are willing > to patch

Re: iscsi_trx going into D state

2016-10-17 Thread Zhu Lingshan
Hi Robert, I think the reason why you can not logout the targets is that iscsi_np in D status. I think the patches fixed something, but it seems to be more than one code path can trigger these similar issues. as you can see, there are several call stacks, I am still working on it. Actually

Re: iscsi_trx going into D state

2016-10-17 Thread Robert LeBlanc
Sorry hit send too soon. In addition, on the client we see: # ps -aux | grep D | grep kworker root 5583 0.0 0.0 0 0 ?D11:55 0:03 [kworker/11:0] root 7721 0.1 0.0 0 0 ?D12:00 0:04 [kworker/4:25] root 10877 0.0 0.0 0 0 ?

Re: iscsi_trx going into D state

2016-10-17 Thread Robert LeBlanc
In addition, on the client we see: Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Oct 17, 2016 at 10:32 AM, Robert LeBlanc wrote: > Some more info as we hit this this morning. We have volumes mirrored > between

Re: iscsi_trx going into D state

2016-10-17 Thread Robert LeBlanc
Some more info as we hit this this morning. We have volumes mirrored between two targets and we had one target on the kernel with the three patches mentioned in this thread [0][1][2] and the other was on a kernel without the patches. We decided that after a week and a half we wanted to get both

Re: iscsi_trx going into D state

2016-10-07 Thread Zhu Lingshan
Hi Robert, I also see this issue, but this is not the only code path can trigger this problem, I think you may also see iscsi_np in D status. I fixed one code path whitch still not merged to mainline. I will forward you my patch later. Note: my patch only fixed one code path, you may see

Re: iscsi_trx going into D state

2016-10-05 Thread Robert LeBlanc
Thanks, we will apply that too. We'd like to get this stable. We'll report back on what we find with these patches. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Oct 5, 2016 at 12:03 PM, Christoph Hellwig wrote: >

Re: iscsi_trx going into D state

2016-10-05 Thread Christoph Hellwig
Hi Robert, I actually got the name wrong, the patch wasn't from Lee, but from Zhu, another SuSE engineer. This is the one: http://www.spinics.net/lists/target-devel/msg13463.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to

Re: iscsi_trx going into D state

2016-10-05 Thread Robert LeBlanc
We are not able to identify the patch that you mentioned from Lee, can you give us a commit or a link to the patch? Thanks, Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Oct 4, 2016 at 5:46 AM, Christoph Hellwig

Re: iscsi_trx going into D state

2016-10-04 Thread Robert LeBlanc
Do you want me to try this patch or wait for some of the suggestions Christoph brought up to be Incorporated? Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Oct 4, 2016 at 5:46 AM, Christoph Hellwig wrote: > On Tue,

Re: iscsi_trx going into D state

2016-10-04 Thread Christoph Hellwig
On Tue, Oct 04, 2016 at 11:11:18AM +0200, Hannes Reinecke wrote: > Hmm. Looking at the code it looks as we might miss some calls to > 'complete'. Can you try with the attached patch? That only looks slightly better than the original. What this really needs is a waitqueue and and waitevent on

Re: iscsi_trx going into D state

2016-10-04 Thread Hannes Reinecke
On 10/04/2016 09:55 AM, Johannes Thumshirn wrote: > On Fri, Sep 30, 2016 at 11:14:57AM -0600, Robert LeBlanc wrote: >> We are having a reoccurring problem where iscsi_trx is going into D >> state. It seems like it is waiting for a session tear down to happen >> or something, but keeps waiting. We

Re: iscsi_trx going into D state

2016-10-04 Thread Johannes Thumshirn
On Fri, Sep 30, 2016 at 11:14:57AM -0600, Robert LeBlanc wrote: > We are having a reoccurring problem where iscsi_trx is going into D > state. It seems like it is waiting for a session tear down to happen > or something, but keeps waiting. We have to reboot these targets on > occasion. This is