Cesar,
I was hoping someone from Sun would have followed up on this one. Maybe they
did respond offline? If not here are some a couple suggestions to isolate /
reduce the problem.
I just want to confirm your comment, "timeout was about 5-10 minutes". This
means it was unresponsive for about 5-10 minutes then it continued?
The blocking of the IO stack by the iSCSI initiator is typically due to the
Solaris 10+ devfs code path. There are a couple common reasons that can lead
to this blocking.
1) Network Timeouts - The iSCSI initiator stack is highly dependent on the
networking stack below and its responsiveness. Some of connect/etc timeouts
have been tuned to reduce blocking delays. Although there is likely more that
could occur. The initiator already tweaks these network settings... (All bets
are off if your using Radius or iSNS. Those code paths path duplicated code
paths and in the past they didn't contain the same tweaks.)
TCP_CONN_NOTIFY_THRESHOLD
TCP_CONN_ABORT_THRESHOLD
TCP_ABORT_THRESHOLD
2) Excessive BUS_CONFIG calls - The devfs framework gets a little brainless
some time to time and will hammer the initiator with duplicate BUS_CONFIG calls
from time to time. If this is occuring there are a couple possible workarounds.
To isolate the problem I recommend you use the following dtrace points with
an anonymous trace buffer to capture the problem during boot. (If I remember
right you will drop the below in a file, issue 'dtrace -A -m <file>', reboot
and force the problem, then once the system finally times out and boots use
'dtace -a' to review the trace.)
fbt:iscsi:iscsid_config_one:entry
{
printf("entry: %s %d", (string)arg1, arg2);
stack();
}
fbt:iscsi:iscsid_config_one:return
{
printf("return");
}
bt:iscsi:iscsid_config_all:entry
{
printf("entry: %d", arg2);
stack();
}
fbt:iscsi:iscsid_config_all:return
{
printf("return");
}
If you see long gaps between the entry and return points then the problem is
likely more related to network timeouts. If the entry and return points are
short and frequent then the problem is probably due to repeated BUS_CONFIG
calls.
If the BUS_CONFIG calls are frequent try increasing the "config-storm-delay"
delay via iscsi.conf. The default value for this field is 5 seconds. Try
increasing to 10/20 seconds. Note: Increasing this can case the side effect if
you add or remove devices in less than 10/20 seconds those changes will be
missed and a devfsadm will have to be re-issued.
I hope this information helps...
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss