[This is a little long winded and I end up back where nwsmith left off.  I just 
took a different approach.]

In relation to your first dtrace you have the repeating loop of...

3 13558 iscsi_conn_state_machine:event 300056dc480 logged_in T15
3 13559 iscsi_sess_state_machine:event 30005b56000 logged_in N5
1 13558 iscsi_conn_state_machine:event 300056dc480 failed T5
1 13559 iscsi_sess_state_machine:event 30005b56000 failed N1

...This does again back the statement that tran_err is a indirectly caused by 
the initiator having the connection drop and restart.

<clip from code>
    412  * -T15: One or more of the following events caused this transition:
    413  *      - Internal event that indicates a transport connection timeout
    414  *        was received thus prompting transport RESET or transport
    415  *        connection closure.
    416  *      - A transport RESET
    417  *      - A transport disconnect indication.
    418  *      - Async PDU with AsyncEvent "Drop connection" (for this CID)
    419  *      - Async PDU with AsyncEvent "Drop all connections"

So sounds like it tends to be an event in which the target disconnected the 
connection via a notified disconnect (Async PDU) or a brute force disconnect.

Assuming the code matches or is similar open solaris code there is only place 
that issues a T15 event, iscsi_rx_thread() when it gets a 
ISCSI_STATUS_TCP_RX_ERROR.  And all those are in iscsi_net.c caused by 
recvmsg() failures.  So I would guess the target disconnected the initiator.

This leads back to nwsmith's comments about "ERROR:Bad "Opcode": Got 0 expected 
5." in the target log.  The target seems to think the initiator is doing 
something wrong.

I hadn't heard of newfs -T.  Based on the source comment... "/* set up file 
system for growth to over 1 TB */"  it looks like its for greater than 
1Terabyte LUNs?  Although one manpage comment I found on the net made it sound 
like it was for legacy support.  Does someone know what this really does?  I 
decided to stop reading the code.  Its messing with the filesystem 
geometer/density values.

>>> Are you using a greater than 1T partition/filesystem? <<<

Just a wild guess at this point but maybe the newfs is causing the disk driver 
to send the target some scsi command it doesn't like via the iscsi initiator.  
Then the target doesn't respond and then the initiator sends a NOP to see what 
is going on?  The NOP isn't expected by the target and it disconnects?  Then 
things get flushed and then the process repeats meanwhile you do get a little 
IO done during these resets.  (I'm probably off in the weeds.)

I would search for the disconnects in the snoop trace or the NOP since it seems 
to be unexpected.
 
 
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to