I've been doing some further testing with the Microsoft iScsi initiator,
and the Solaris iScsi target. Everything seemed to be going ok,...until it
crashed!
My test hardware has improved since I last reported.
I'm now using GigaBit Ethernet.
The Solaris snv_60 is running on a Pentium-D dual-core 3GHz, with 1GB RAM.
The initiator is running Windows Server 2003 R2 on a Pentium III, 930MHz, 512MB
RAM.
I tried copying files using Windows Explorer, and did not have any problem.
So I tried again with my old favourite, the SQLIOStress utility from Microsoft.
I've gone back to using the old version 4.00.050, (non-GUI command line).
I did one run of 16 iterations overnight for about 8 hours, with no problems.
I set it going again as I went to work this morning, & when I arrive home
tonight it had crashed, on iteration 12 of 16.
Windows displays a message box titled "Windows - Delayed Write Failed" saying:
"Windows was unable to save all the data for the file S:\$Mft. The data has
been lost. This error may be caused by a failure of your computer hardware or
network connection. Please try to save this file elsewhere."
Over on the Solaris side, we see a core dump similar to the one
that Nicholas Lee reported:
# cat /etc/release
Solaris Nevada snv_60 X86
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 12 March 2007
# tail -20l /var/svc/log/system-iscsitgt\:default.log
[ Apr 5 22:31:56 Executing start method ("/lib/svc/method/svc-iscsitgt start")
]
[ Apr 5 22:31:57 Method "start" exited with status 0 ]
[ Apr 16 13:31:06 Stopping because process dumped core. ]
[ Apr 16 13:31:08 Executing stop method ("/lib/svc/method/svc-iscsitgt stop
72") ]
[ Apr 16 13:31:08 Method "stop" exited with status 0 ]
[ Apr 16 13:31:09 Executing start method ("/lib/svc/method/svc-iscsitgt start")
]
[ Apr 16 13:31:10 Method "start" exited with status 0 ]
[ Apr 16 13:32:21 Stopping because process dumped core. ]
[ Apr 16 13:32:21 Executing stop method ("/lib/svc/method/svc-iscsitgt stop
128") ]
[ Apr 16 13:32:21 Method "stop" exited with status 0 ]
[ Apr 16 13:32:21 Executing start method ("/lib/svc/method/svc-iscsitgt start")
]
[ Apr 16 13:32:21 Method "start" exited with status 0 ]
[ Apr 16 13:33:12 Stopping because process dumped core. ]
[ Apr 16 13:33:12 Executing stop method ("/lib/svc/method/svc-iscsitgt stop
130") ]
[ Apr 16 13:33:12 Method "stop" exited with status 0 ]
[ Apr 16 13:33:12 Executing start method ("/lib/svc/method/svc-iscsitgt start")
]
[ Apr 16 13:33:12 Method "start" exited with status 0 ]
# cd /var/crash/solaris/
# ls -l
total 324416
-rw------- 1 root root 18352886831 Apr 16 13:32 core.iscsitgtd.10030
-rw------- 1 root root 18353079287 Apr 16 13:33 core.iscsitgtd.10040
-rw------- 1 root root 15627199 Apr 16 13:31 core.iscsitgtd.471
# mdb core.iscsitgtd.471
Loading modules: [ libumem.so.1 libavl.so.1 libc.so.1 libnvpair.so.1
libuutil.so.1 ld.so.1 ]
> $c
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x19()
libc.so.1`abort+0xd1()
libc.so.1`_assert+0x74()
t10_cmd_state_machine+0x15b()
t10_handle_destroy+0xd1()
sess_from_t10+0xad()
libc.so.1`_thr_setup+0x67()
libc.so.1`_lwp_start()
>
# mdb core.iscsitgtd.10030
mdb: core file data for mapping at fffffd7bb8400000 not saved: I/O error
Loading modules: [ libumem.so.1 libavl.so.1 libc.so.1 libnvpair.so.1
libuutil.so.1 ld.so.1 ]
> $c
libc.so.1`memcpy+0x199()
trans_rqst_dataout+0x127()
sbc_write+0x3cf()
sbc_cmd+0x21()
lu_runner+0x7bc()
libc.so.1`_thr_setup+0x67()
libc.so.1`_lwp_start()
>
# mdb core.iscsitgtd.10040
mdb: core file data for mapping at fffffd7bb8600000 not saved: I/O error
Loading modules: [ libumem.so.1 libavl.so.1 libc.so.1 libnvpair.so.1
libuutil.so.1 ld.so.1 ]
> $c
libc.so.1`memcpy+0x199()
trans_rqst_dataout+0x127()
sbc_write+0x3cf()
sbc_cmd+0x21()
lu_runner+0x7bc()
libc.so.1`_thr_setup+0x67()
libc.so.1`_lwp_start()
>
I am still using the 'debug' version of the Microsoft initiator, and it logged
alot of output
in the window of 'DebugView'. I've included this below, in case it gives any
clues,
but unfortunately it may only make sense to the author of the initiator code!
00001711 485882.09375000 ResetLU : Session - F9358000, Request Count -
56
00001712 485882.09375000 Will replace CmdSN e3549d with ExpCmdSN e35465
00001713 485882.09375000 Resets TR = 0 LR = 1 for F9358000 Lun = 0
00001714 485882.12500000 Session F9358000, completed zombie request
F9D43000 with srb status 0xe
00001715 485882.12500000 Session F9358000, completed zombie request
F9D430D8 with srb status 0xe
00001716 485882.12500000 Session F9358000, completed zombie request
F9D431B0 with srb status 0xe
<--snip-Many similar lines-->
00001767 485882.12500000 Session F9358000, completed zombie request
F9D462A0 with srb status 0xe
00001768 485882.12500000 Session F9358000, completed zombie request
F9D46378 with srb status 0xe
00001769 485882.12500000 Session F9358000, completed zombie request
F9D46450 with srb status 0xe
00001770 485907.09375000 ResetLU : Session - F9358000, Request Count -
63
00001771 485907.09375000 Resets TR = 0 LR = 2 for F9358000 Lun = 0
00001772 485927.09375000 Will free F8CBFC30 after send completes.
Function 7
00001773 485927.09375000 Session F9358000, completed zombie request
F9D43000 with srb status 0xe
00001774 485927.09375000 Session F9358000, completed zombie request
F9D430D8 with srb status 0xe
<--snip-Many similar lines-->
00001834 485927.09375000 Session F9358000, completed zombie request
F9D46378 with srb status 0xe
00001835 485927.09375000 Session F9358000, completed zombie request
F9D46450 with srb status 0xe
00001836 485927.09375000 At TMF timeout completing reset srb FF53F360
for Session F9358000
00001837 485927.75000000 Processing unexpected PDU for F949E6E0
00001838 485952.15625000 ResetLU : Session - F9358000, Request Count - 1
00001839 485952.15625000 Will replace CmdSN e354a5 with ExpCmdSN e3549d
00001840 485952.15625000 Resets TR = 0 LR = 3 for F9358000 Lun = 0
00001841 485952.15625000 Invalid header F949E914
00001842 485952.15625000 Ignoring receive. Connection state 2,
ProtocolState 14
00001843 485952.15625000 Unable to process data given. Available 17,
Indicated 17
00001844 485952.15625000 ResetLU : Session - F9358000, Request Count - 1
00001845 485952.15625000 Free All Entries for F949E6E0: Found entry
F8C90E30 (7)
00001846 485952.15625000 AR FA020950 for timeout. Function 7
00001847 485952.15625000 At TMF timeout completing reset srb FFB968A0
for Session F9358000
00001848 485952.15625000 Premature Termination : Okay to remove
connection F949E6E0
00001849 485952.15625000 Premature Termination : Okay to remove session
F9358000
00001850 485953.62500000 [1320] 04/16/2007 12:32: 1.00000377:[ 528.
574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b
00001851 485967.12500000 Disconnect time NOT exceeded. Srb F8C2F1D4
[0x2a]
00001852 485977.15625000 Disconnect time NOT exceeded. Srb F8FD63E4
[0x2a]
00001853 485983.12500000 FF64D3C8 timed out waiting for relogin response
00001854 485983.12500000 Will queue login worker on timeout for FF64D3C8
00001855 485984.18750000 [1320] 04/16/2007 12:32:32.00000001:[ 528.
574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b
00001856 485984.18750000 Relogin. Reconnect count - 1
00001857 486017.18750000 ResetLU : Session - F9358000, Request Count - 1
00001858 486017.18750000 Resets TR = 0 LR = 1 for F9358000 Lun = 0
00001859 486022.31250000 RaidSendIrpSynchronous (FA478C78) irp FF5AB340
did not complete within 1e seconds
00001860 486035.12500000 Will retry lost connection F9206B70 (9). ERL 0
00001861 486035.12500000 ResetLU : Session - F9358000, Request Count - 1
00001862 486035.12500000 Free All Entries for F9206B70: Found entry
8177AE20 (7)
00001863 486035.12500000 AR F8CBFC30 for timeout. Function 7
00001864 486035.12500000 At TMF timeout completing reset srb FF0D44B0
for Session F9358000
00001865 486035.12500000 Premature Termination : Okay to remove
connection F9206B70
00001866 486035.12500000 Premature Termination : Okay to remove session
F9358000
00001867 486035.12500000 [1320] 04/16/2007 12:33:22.00000944:[ 528.
574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b
00001868 486035.56250000 Relogin. Reconnect count - 1
00001869 486065.56250000 RaidSendIrpSynchronous (FA478C78) irp FF5AB340
did not complete within 1e seconds
00001870 486068.18750000 ResetLU : Session - F9358000, Request Count -
63
00001871 486068.18750000 Will replace CmdSN e354e1 with ExpCmdSN e354a4
00001872 486068.18750000 Resets TR = 0 LR = 1 for F9358000 Lun = 0
00001873 486068.18750000 Failed to match unit FA478D30
00001874 486086.46875000 Will retry lost connection F9D2EB70 (9). ERL 0
00001875 486086.46875000 ResetLU : Session - F9358000, Request Count -
63
00001876 486086.46875000 Free All Entries for F9D2EB70: Found entry
FF5E11F8 (7)
00001877 486086.46875000 Will free F9115780 after send completes.
Function 7
00001878 486086.46875000 At TMF timeout completing reset srb FF0D44B0
for Session F9358000
00001879 486086.46875000 Will Not process request recovery in progress
for F9358000
00001880 486086.46875000 Disconnect time NOT exceeded. Srb FA2ED2C4
[0x2a]
00001881 486086.46875000 Will Not process request recovery in progress
for F9358000
00001882 486086.46875000 Disconnect time NOT exceeded. Srb FA2ED2C4
[0x2a]
<--snip - 100's of similar lines-->
00053756 486087.65625000 Disconnect time NOT exceeded. Srb FA2ED2C4
[0x2a]
00053757 486087.65625000 Will Not process request recovery in progress
for F9358000
00053758 486087.65625000 Disconnect time NOT exceeded. Srb FA2ED2C4
[0x2a]
00053759 486087.65625000 Will Not process request recovery in progress
for F9358000
00053760 486087.65625000 Disconnect time NOT exceeded. Srb FA2ED2C4
[0x2a]
02603103 486146.75000000 Relogin. Reconnect count - 1
02603104 486147.34375000 [1320] 04/16/2007 12:35:14.00000564:[ 528.
574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b
<--Followed by repeating attempts to re-login, that fail-->
Rick - have you any thoughts on what may be happening here?
Is there anything further I can do to assist?
I will try to see if I can trigger the problem again while capturing packets,
but that will be difficult!
Thanks
Nigel Smith
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss