Cesar,

  I was hoping someone from Sun would have followed up on this one.  Maybe they 
did respond offline?  If not here are some a couple suggestions to isolate / 
reduce the problem.

  I just want to confirm your comment, "timeout was about 5-10 minutes".  This 
means it was unresponsive for about 5-10 minutes then it continued?

  The blocking of the IO stack by the iSCSI initiator is typically due to the 
Solaris 10+ devfs code path.  There are a couple common reasons that can lead 
to this blocking.

  1) Network Timeouts - The iSCSI initiator stack is highly dependent on the 
networking stack below and its responsiveness.  Some of connect/etc timeouts 
have been tuned to reduce blocking delays.  Although there is likely more that 
could occur.  The initiator already tweaks these network settings... (All bets 
are off if your using Radius or iSNS.  Those code paths path duplicated code 
paths and in the past they didn't contain the same tweaks.)

  TCP_CONN_NOTIFY_THRESHOLD
  TCP_CONN_ABORT_THRESHOLD
  TCP_ABORT_THRESHOLD

  2) Excessive BUS_CONFIG calls - The devfs framework gets a little brainless 
some time to time and will hammer the initiator with duplicate BUS_CONFIG calls 
from time to time.  If this is occuring there are a couple possible workarounds.


  To isolate the problem I recommend you use the following dtrace points with 
an anonymous trace buffer to capture the problem during boot.  (If I remember 
right you will drop the below in a file, issue 'dtrace -A -m <file>', reboot 
and force the problem, then once the system finally times out and boots use 
'dtace -a' to review the trace.)

fbt:iscsi:iscsid_config_one:entry
{
        printf("entry: %s %d", (string)arg1, arg2);
        stack();
}
fbt:iscsi:iscsid_config_one:return
{
       printf("return");
}

bt:iscsi:iscsid_config_all:entry
{
        printf("entry:  %d", arg2);
        stack();
}
fbt:iscsi:iscsid_config_all:return
{
       printf("return");
}

If you see long gaps between the entry and return points then the problem is 
likely more related to network timeouts.  If the entry and return points are 
short and frequent then the problem is probably due to repeated BUS_CONFIG 
calls.

If the BUS_CONFIG calls are frequent try increasing the "config-storm-delay" 
delay via iscsi.conf.  The default value for this field is 5 seconds.  Try 
increasing to 10/20 seconds.  Note: Increasing this can case the side effect if 
you add or remove devices in less than 10/20 seconds those changes will be 
missed and a devfsadm will have to be re-issued.

I hope this information helps...
 
 
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to