"Zhu Lingshan"
> <ls...@suse.com>, "linux-rdma" <linux-r...@vger.kernel.org>,
> linux-scsi@vger.kernel.org, "Sagi Grimberg"
> <s...@grimberg.me>, "Christoph Hellwig" <h...@lst.de>
> Sent: Friday, January 13, 2017 6:3
, "Sagi Grimberg"
>> <s...@grimberg.me>, "Christoph Hellwig" <h...@lst.de>
>> Sent: Thursday, January 12, 2017 4:26:05 PM
>> Subject: Re: iscsi_trx going into D state
>>
>> Sorry sent prematurely...
>>
>> On Thu, Jan 12, 2017
"Zhu Lingshan"
> <ls...@suse.com>, "linux-rdma" <linux-r...@vger.kernel.org>,
> linux-scsi@vger.kernel.org, "Sagi Grimberg"
> <s...@grimberg.me>, "Christoph Hellwig" <h...@lst.de>
> Sent: Thursday, January 12, 2017 4:26:05
Sorry sent prematurely...
On Thu, Jan 12, 2017 at 2:22 PM, Robert LeBlanc wrote:
> I'm having trouble replicating the D state issue on Infiniband (I was
> able to trigger it reliably a couple weeks back, I don't know if OFED
> to verify the same results happen there as
I have a crappy patch (sledgehammer approach) that seems to prevent
the D state issue and the connection recovers, but things are possibly
not being cleaned up properly in iSCSI and so it may have issues after
a few recoveries (one test completed with a lot of resets but no iSCSI
errors).
Laurence,
Since the summary may be helpful to others, I'm just going to send it
to the list.
I've been able to reproduce the D state problem on both Infiniband and
RoCE, but it is much easier to reproduce on RoCE due to another bug
and doesn't require being at the server to yank the cable
uot;linux-rdma"
> <linux-r...@vger.kernel.org>, linux-scsi@vger.kernel.org, "Sagi Grimberg"
> <s...@grimberg.me>, "Christoph Hellwig"
> <h...@lst.de>
> Sent: Tuesday, January 3, 2017 7:11:40 PM
> Subject: Re: iscsi_trx going into D state
>
>
With the last patch it is getting hung up on wait_for_completion in
target_wait_for_sess_cmds. I don't know what t_state or fabric state
mean. To me it looks like a queue is not being emptied, but it would
help if someone confirmed this and has some pointers on how to
properly flush them when the
With this patch I'm not seeing the __ib_drain_sq backtraces, but I'm
still seeing the previous backtraces.
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
b/drivers/infiniband/ulp/isert/ib_isert.c
index 6dd43f6..1e53502 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++
I decided to try something completely different... Running the stock
CentOS 3.10 kernel and OFED 3.4 on both hosts, I'm not seeing the hung
processes and the tests complete successfully. The same seems to be
true for the target on 4.9 and the initiator on 3.10.
However, with the target on 3.10
OK, I've drilled down a little more and
timeout = action(timeout);
in do_wait_for_common() in kernel/sched/completion.c is not returning.
I'll have to see if I can make more progress tomorrow.
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On
I know most people are ignoring this thread by now, but I hope someone
is still reading and can offer some ideas.
It looks like ib_drain_qp_done() is not being called the first time
that __ib_drain_sq() is called from iscsit_close_connection(). I tried
to debug wait_for_completion() and friends,
Good news! I found a 10 Gb switch laying around and put it in place of
the Linux router. I'm getting the same failure with the switch, so it
is not something funky with the Linux router and easier to replicate.
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654
OK, here is some more info. This is a diagram of my current set up.
++
| Linux Router |
| ConnectX-3 |
| port 1 port 2 |
++
/ \
+---+ /
I realized that I did not set the default RoCE mode to v2 and the
client is on a different subnet, probably why I'm seeing the -110
error. Iser should not go into D state because of this and should
handle this gracefully, but may provide an easy way to replicate the
issue.
Robert
I looked at this code and it is quiet above my ability. I created this
patch, but I don't know how to interrogate the queue to see how many
items there are. If you can give me some more direction on what to
try, I can keep fumbling around with this until someone smarter than
me can figure it out.
On 12/21/2016 6:39 PM, Robert LeBlanc wrote:
> I hit a new backtrace today, hopefully it adds something.
>
> # cat /proc/19659/stack
> [] iscsit_stop_session+0x1b1/0x1c0
> [] iscsi_check_for_session_reinstatement+0x1e2/0x270
> [] iscsi_target_check_for_existing_instances+0x30/0x40
> []
I hit a new backtrace today, hopefully it adds something.
# cat /proc/19659/stack
[] iscsit_stop_session+0x1b1/0x1c0
[] iscsi_check_for_session_reinstatement+0x1e2/0x270
[] iscsi_target_check_for_existing_instances+0x30/0x40
[] iscsi_target_do_login+0x138/0x630
[]
Nicholas,
I've found that the kernels I used were not able to be inspected using
crash and I could not build the debug info for them. So I built a new
4.9 kernel and verified that I could inspect the crash. It is located
at [1].
[1] http://mirrors.betterservers.com/trace/crash2.tar.xz
Nicholas,
After lots of set backs and having to give up trying to get kernel
dumps on our "production" systems, I've been able to work out the
issues we had with kdump and replicate the issue on my dev boxes. I
have dumps from 4.4.30 and 4.9-rc8 (makedumpfile would not dump, so it
is a straight
We hit this yesterday, this time it was on the tx thread (the other
ones before seem to be on the rx thread). We weren't able to get a
kernel dump on this. We'll try to get one next time.
# ps axuw | grep "D.*iscs[i]"
root 12383 0.0 0.0 0 0 ?DNov03 0:04 [iscsi_np]
Nicholas,
Thanks for following up on this. We have been chasing other bugs in
our provisioning and as such has reduced our load on the boxes. We are
hoping to get that all straightened out this week and do some more
testing. So far we have not had any iSCSI in D state since the patch,
be we
Hi Robert,
On Wed, 2016-10-19 at 10:41 -0600, Robert LeBlanc wrote:
> Nicholas,
>
> I didn't have high hopes for the patch because we were not seeing
> TMR_ABORT_TASK (or 'abort') in dmesg or /var/log/messages, but it
> seemed to help regardless. Our clients finally OOMed from the hung
>
Nicholas,
I didn't have high hopes for the patch because we were not seeing
TMR_ABORT_TASK (or 'abort') in dmesg or /var/log/messages, but it
seemed to help regardless. Our clients finally OOMed from the hung
sessions, so we are having to reboot them and we will do some more
testing. We haven't
On Tue, 2016-10-18 at 16:13 -0600, Robert LeBlanc wrote:
> Nicholas,
>
> We patched this in and for the first time in many reboots, we didn't
> have iSCSI going straight into D state. We have had to work on a
> couple of other things, so we don't know if this is just a coincidence
> or not. We
Nicholas,
We patched this in and for the first time in many reboots, we didn't
have iSCSI going straight into D state. We have had to work on a
couple of other things, so we don't know if this is just a coincidence
or not. We will reboot back into the old kernel and back a few times
and do some
On Tue, 2016-10-18 at 00:05 -0700, Nicholas A. Bellinger wrote:
> Hello Robert, Zhu & Co,
>
> Thanks for your detailed bug report. Comments inline below.
>
> On Mon, 2016-10-17 at 22:42 -0600, Robert LeBlanc wrote:
> > Sorry I forget that Android has an aversion to plain text emails.
> >
> >
Hello Robert, Zhu & Co,
Thanks for your detailed bug report. Comments inline below.
On Mon, 2016-10-17 at 22:42 -0600, Robert LeBlanc wrote:
> Sorry I forget that Android has an aversion to plain text emails.
>
> If we can provide any information to help, let us know. We are willing
> to patch
Hi Robert,
I think the reason why you can not logout the targets is that iscsi_np
in D status. I think the patches fixed something, but it seems to be
more than one code path can trigger these similar issues. as you can
see, there are several call stacks, I am still working on it. Actually
Sorry hit send too soon.
In addition, on the client we see:
# ps -aux | grep D | grep kworker
root 5583 0.0 0.0 0 0 ?D11:55 0:03 [kworker/11:0]
root 7721 0.1 0.0 0 0 ?D12:00 0:04 [kworker/4:25]
root 10877 0.0 0.0 0 0 ?
In addition, on the client we see:
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Mon, Oct 17, 2016 at 10:32 AM, Robert LeBlanc wrote:
> Some more info as we hit this this morning. We have volumes mirrored
> between
Some more info as we hit this this morning. We have volumes mirrored
between two targets and we had one target on the kernel with the three
patches mentioned in this thread [0][1][2] and the other was on a
kernel without the patches. We decided that after a week and a half we
wanted to get both
Hi Robert,
I also see this issue, but this is not the only code path can trigger
this problem, I think you may also see iscsi_np in D status. I fixed one
code path whitch still not merged to mainline. I will forward you my
patch later. Note: my patch only fixed one code path, you may see
Thanks, we will apply that too. We'd like to get this stable. We'll
report back on what we find with these patches.
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Wed, Oct 5, 2016 at 12:03 PM, Christoph Hellwig wrote:
>
Hi Robert,
I actually got the name wrong, the patch wasn't from Lee, but from Zhu,
another SuSE engineer. This is the one:
http://www.spinics.net/lists/target-devel/msg13463.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to
We are not able to identify the patch that you mentioned from Lee, can
you give us a commit or a link to the patch?
Thanks,
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Tue, Oct 4, 2016 at 5:46 AM, Christoph Hellwig
Do you want me to try this patch or wait for some of the suggestions
Christoph brought up to be Incorporated?
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Tue, Oct 4, 2016 at 5:46 AM, Christoph Hellwig wrote:
> On Tue,
On Tue, Oct 04, 2016 at 11:11:18AM +0200, Hannes Reinecke wrote:
> Hmm. Looking at the code it looks as we might miss some calls to
> 'complete'. Can you try with the attached patch?
That only looks slightly better than the original. What this really
needs is a waitqueue and and waitevent on
On 10/04/2016 09:55 AM, Johannes Thumshirn wrote:
> On Fri, Sep 30, 2016 at 11:14:57AM -0600, Robert LeBlanc wrote:
>> We are having a reoccurring problem where iscsi_trx is going into D
>> state. It seems like it is waiting for a session tear down to happen
>> or something, but keeps waiting. We
On Fri, Sep 30, 2016 at 11:14:57AM -0600, Robert LeBlanc wrote:
> We are having a reoccurring problem where iscsi_trx is going into D
> state. It seems like it is waiting for a session tear down to happen
> or something, but keeps waiting. We have to reboot these targets on
> occasion. This is
40 matches
Mail list logo