[This is a little long winded and I end up back where nwsmith left off. I just
took a different approach.]
In relation to your first dtrace you have the repeating loop of...
3 13558 iscsi_conn_state_machine:event 300056dc480 logged_in T15
3 13559 iscsi_sess_state_machine:event 30005b56000 logged_in N5
1 13558 iscsi_conn_state_machine:event 300056dc480 failed T5
1 13559 iscsi_sess_state_machine:event 30005b56000 failed N1
...This does again back the statement that tran_err is a indirectly caused by
the initiator having the connection drop and restart.
<clip from code>
412 * -T15: One or more of the following events caused this transition:
413 * - Internal event that indicates a transport connection timeout
414 * was received thus prompting transport RESET or transport
415 * connection closure.
416 * - A transport RESET
417 * - A transport disconnect indication.
418 * - Async PDU with AsyncEvent "Drop connection" (for this CID)
419 * - Async PDU with AsyncEvent "Drop all connections"
So sounds like it tends to be an event in which the target disconnected the
connection via a notified disconnect (Async PDU) or a brute force disconnect.
Assuming the code matches or is similar open solaris code there is only place
that issues a T15 event, iscsi_rx_thread() when it gets a
ISCSI_STATUS_TCP_RX_ERROR. And all those are in iscsi_net.c caused by
recvmsg() failures. So I would guess the target disconnected the initiator.
This leads back to nwsmith's comments about "ERROR:Bad "Opcode": Got 0 expected
5." in the target log. The target seems to think the initiator is doing
something wrong.
I hadn't heard of newfs -T. Based on the source comment... "/* set up file
system for growth to over 1 TB */" it looks like its for greater than
1Terabyte LUNs? Although one manpage comment I found on the net made it sound
like it was for legacy support. Does someone know what this really does? I
decided to stop reading the code. Its messing with the filesystem
geometer/density values.
>>> Are you using a greater than 1T partition/filesystem? <<<
Just a wild guess at this point but maybe the newfs is causing the disk driver
to send the target some scsi command it doesn't like via the iscsi initiator.
Then the target doesn't respond and then the initiator sends a NOP to see what
is going on? The NOP isn't expected by the target and it disconnects? Then
things get flushed and then the process repeats meanwhile you do get a little
IO done during these resets. (I'm probably off in the weeds.)
I would search for the disconnects in the snoop trace or the NOP since it seems
to be unexpected.
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss