Hi Eugene
Ok, I've had a look through all those log files!

The verbose initiator log file 'iscsi.out', unfortunately did not help,
as I did not gain any insights from it. 

The 'messages' log file, shows a repeating set of error messages.
I will high-light these two lines:

Oct 23 18:24:57 redmine kernel:  connection16:0: iscsi: detected conn error 
(1011)
Oct 23 18:24:58 redmine iscsid: Kernel reported iSCSI connection 16:0 error 
(1011) state (3)

I did some Googling on these error strings, and found this interesting page:

http://web.bii.a-star.edu.sg/~james/100TB/14TB/

..where James Tay demonstrates that this can occur if 
the network layer fails (that is, unplug the ethernet cable)
or if iscsi target sends a TCP FIN packet back to the initiator
to tear down the session.

Next I looked at the 'snoop.out' which is from  a modified 'iscsisnoop.d'
with added time stamps.
(And those time stamps agree with those in file 'target_log', which
is the debug log file from the Solaris iscsi target.)

Here Eugene has captured a complete session from the initial logon.
What is interesting is that for a relatively long time, every thing
is running fine.

The session starts at "2008 Oct 23 18:24:00" with a notmal looking login.
At "2008 Oct 23 18:24:11" the target starts sequence of 'write(10)' data.
And it writes a lot of data - I estimate 1428 blocks each of 64 Kbytes.
So that's 89 Mbytes.
Towards the end of that sequence, things start to go wrong.
The 'responses' to each write start to be slow to appear.
Then the 'responses' become erratic and out of sequence.
Then the responses stop, and the whole thing stalls.
This is time "2008 Oct 23 18:24:12".
Yes, all those packets and data in just a couple of seconds!

After 17 to 18 seconds the system realises something is wrong,
and we get a recovery re-login.
This is time "2008 Oct 23 18:24:29"

We get a burst of writes, about 33 blocks of 64 Kbytes (~ 2 Mbytes)
But again the responses are slow and become slower and more eratic,
and within a fration of a seconds it's stalled again.

After 17 to 18 seconds the system realises something is wrong,
and we get a recovery re-login.
This is time "2008 Oct 23 18:24:47"

We are now in a loop, with the same pattern repeating over and over.

Ok, those are the facts of the log files.
The question is - what is the root cause of the problem?

When the iscsi target stops sending responses, that iscsi initiator
will just think that the target has died - effectively it has died.

It's almost as if a buffer on the target side has filled up and then
the flow-control kicks in, but does not work correctly.

I'm not sure if this is happening at the iScsi level, or lower
down in the stack at the TCP level.

This looks to me like some flow-control algorithm going wrong, 
or using the wrong &/or conflicting parameters.

We need some real iscsi &/or network expert to make sense of this.
Lets hope that someone can get the insight to spot the problem.

I've got Tano (from over on zfs-discuss) with similar problems.
He is currently emailing me some similar log files from
his attempt with the VMware ESX iscsi initiator.
So it will be interesting to see if they show a similar pattern.

Ok, out of time for tonight.
Best Regards
Nigel Smith
--
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to