I've been doing some further testing with the Microsoft iScsi initiator,
and the Solaris iScsi target. Everything seemed to be going ok,...until it 
crashed!

My test hardware has improved since I last reported.
I'm now using GigaBit Ethernet.
The Solaris snv_60 is running on a Pentium-D dual-core 3GHz, with 1GB RAM.
The initiator is running Windows Server 2003 R2 on a Pentium III, 930MHz, 512MB 
RAM.

I tried copying files using Windows Explorer, and did not have any problem.
So I tried again with my old favourite, the SQLIOStress utility from Microsoft.
I've gone back to using the old version 4.00.050, (non-GUI command line).
I did one run of 16 iterations overnight for about 8 hours, with no problems.
I set it going again as I went to work this morning, & when I arrive home
tonight it had crashed, on iteration 12 of 16.

Windows displays a message box titled "Windows - Delayed Write Failed" saying:
"Windows was unable to save all the data for the file S:\$Mft.  The data has 
been lost. This error may be caused by a failure of your computer hardware or 
network connection. Please try to save this file elsewhere."

Over on the Solaris side, we see a core dump similar to the one
that Nicholas Lee reported:

# cat /etc/release
                            Solaris Nevada snv_60 X86
           Copyright 2007 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                             Assembled 12 March 2007
# tail -20l /var/svc/log/system-iscsitgt\:default.log
[ Apr  5 22:31:56 Executing start method ("/lib/svc/method/svc-iscsitgt start") 
]
[ Apr  5 22:31:57 Method "start" exited with status 0 ]
[ Apr 16 13:31:06 Stopping because process dumped core. ]
[ Apr 16 13:31:08 Executing stop method ("/lib/svc/method/svc-iscsitgt stop 
72") ]
[ Apr 16 13:31:08 Method "stop" exited with status 0 ]
[ Apr 16 13:31:09 Executing start method ("/lib/svc/method/svc-iscsitgt start") 
]
[ Apr 16 13:31:10 Method "start" exited with status 0 ]
[ Apr 16 13:32:21 Stopping because process dumped core. ]
[ Apr 16 13:32:21 Executing stop method ("/lib/svc/method/svc-iscsitgt stop 
128") ]
[ Apr 16 13:32:21 Method "stop" exited with status 0 ]
[ Apr 16 13:32:21 Executing start method ("/lib/svc/method/svc-iscsitgt start") 
]
[ Apr 16 13:32:21 Method "start" exited with status 0 ]
[ Apr 16 13:33:12 Stopping because process dumped core. ]
[ Apr 16 13:33:12 Executing stop method ("/lib/svc/method/svc-iscsitgt stop 
130") ]
[ Apr 16 13:33:12 Method "stop" exited with status 0 ]
[ Apr 16 13:33:12 Executing start method ("/lib/svc/method/svc-iscsitgt start") 
]
[ Apr 16 13:33:12 Method "start" exited with status 0 ]
# cd /var/crash/solaris/
# ls -l
total 324416
-rw-------   1 root     root     18352886831 Apr 16 13:32 core.iscsitgtd.10030
-rw-------   1 root     root     18353079287 Apr 16 13:33 core.iscsitgtd.10040
-rw-------   1 root     root     15627199 Apr 16 13:31 core.iscsitgtd.471
# mdb core.iscsitgtd.471
Loading modules: [ libumem.so.1 libavl.so.1 libc.so.1 libnvpair.so.1 
libuutil.so.1 ld.so.1 ]
> $c
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x19()
libc.so.1`abort+0xd1()
libc.so.1`_assert+0x74()
t10_cmd_state_machine+0x15b()
t10_handle_destroy+0xd1()
sess_from_t10+0xad()
libc.so.1`_thr_setup+0x67()
libc.so.1`_lwp_start()
>
# mdb core.iscsitgtd.10030
mdb: core file data for mapping at fffffd7bb8400000 not saved: I/O error
Loading modules: [ libumem.so.1 libavl.so.1 libc.so.1 libnvpair.so.1 
libuutil.so.1 ld.so.1 ]
> $c
libc.so.1`memcpy+0x199()
trans_rqst_dataout+0x127()
sbc_write+0x3cf()
sbc_cmd+0x21()
lu_runner+0x7bc()
libc.so.1`_thr_setup+0x67()
libc.so.1`_lwp_start()
>
# mdb core.iscsitgtd.10040
mdb: core file data for mapping at fffffd7bb8600000 not saved: I/O error
Loading modules: [ libumem.so.1 libavl.so.1 libc.so.1 libnvpair.so.1 
libuutil.so.1 ld.so.1 ]
> $c
libc.so.1`memcpy+0x199()
trans_rqst_dataout+0x127()
sbc_write+0x3cf()
sbc_cmd+0x21()
lu_runner+0x7bc()
libc.so.1`_thr_setup+0x67()
libc.so.1`_lwp_start()
>

I am still using the 'debug' version of the Microsoft initiator, and it logged 
alot of output 
in the window of 'DebugView'.  I've included this below, in case it gives any 
clues,
but unfortunately it may only make sense to the author of the initiator code!

00001711        485882.09375000 ResetLU : Session - F9358000, Request Count - 
56        
00001712        485882.09375000 Will replace CmdSN e3549d with ExpCmdSN e35465  
00001713        485882.09375000 Resets TR = 0 LR = 1 for F9358000 Lun = 0       
00001714        485882.12500000 Session F9358000, completed zombie request 
F9D43000 with srb status 0xe 
00001715        485882.12500000 Session F9358000, completed zombie request 
F9D430D8 with srb status 0xe 
00001716        485882.12500000 Session F9358000, completed zombie request 
F9D431B0 with srb status 0xe 
<--snip-Many similar lines-->
00001767        485882.12500000 Session F9358000, completed zombie request 
F9D462A0 with srb status 0xe 
00001768        485882.12500000 Session F9358000, completed zombie request 
F9D46378 with srb status 0xe 
00001769        485882.12500000 Session F9358000, completed zombie request 
F9D46450 with srb status 0xe 
00001770        485907.09375000 ResetLU : Session - F9358000, Request Count - 
63        
00001771        485907.09375000 Resets TR = 0 LR = 2 for F9358000 Lun = 0       
00001772        485927.09375000 Will free F8CBFC30 after send completes. 
Function 7     
00001773        485927.09375000 Session F9358000, completed zombie request 
F9D43000 with srb status 0xe 
00001774        485927.09375000 Session F9358000, completed zombie request 
F9D430D8 with srb status 0xe 
<--snip-Many similar lines-->
00001834        485927.09375000 Session F9358000, completed zombie request 
F9D46378 with srb status 0xe 
00001835        485927.09375000 Session F9358000, completed zombie request 
F9D46450 with srb status 0xe 
00001836        485927.09375000 At TMF timeout completing reset srb FF53F360 
for Session F9358000       
00001837        485927.75000000 Processing unexpected PDU for F949E6E0  
00001838        485952.15625000 ResetLU : Session - F9358000, Request Count - 1 
00001839        485952.15625000 Will replace CmdSN e354a5 with ExpCmdSN e3549d  
00001840        485952.15625000 Resets TR = 0 LR = 3 for F9358000 Lun = 0       
00001841        485952.15625000 Invalid header F949E914 
00001842        485952.15625000 Ignoring receive. Connection state 2, 
ProtocolState 14  
00001843        485952.15625000 Unable to process data given. Available 17, 
Indicated 17        
00001844        485952.15625000 ResetLU : Session - F9358000, Request Count - 1 
00001845        485952.15625000 Free All Entries for F949E6E0: Found entry 
F8C90E30 (7) 
00001846        485952.15625000 AR FA020950 for timeout. Function 7     
00001847        485952.15625000 At TMF timeout completing reset srb FFB968A0 
for Session F9358000       
00001848        485952.15625000 Premature Termination : Okay to remove 
connection F949E6E0      
00001849        485952.15625000 Premature Termination : Okay to remove session 
F9358000 
00001850        485953.62500000 [1320] 04/16/2007 12:32: 1.00000377:[     528.  
   574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b        
00001851        485967.12500000 Disconnect time NOT exceeded. Srb F8C2F1D4 
[0x2a]       
00001852        485977.15625000 Disconnect time NOT exceeded. Srb F8FD63E4 
[0x2a]       
00001853        485983.12500000 FF64D3C8 timed out waiting for relogin response 
00001854        485983.12500000 Will queue login worker on timeout for FF64D3C8 
00001855        485984.18750000 [1320] 04/16/2007 12:32:32.00000001:[     528.  
   574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b        
00001856        485984.18750000 Relogin. Reconnect count - 1    
00001857        486017.18750000 ResetLU : Session - F9358000, Request Count - 1 
00001858        486017.18750000 Resets TR = 0 LR = 1 for F9358000 Lun = 0       
00001859        486022.31250000 RaidSendIrpSynchronous (FA478C78) irp FF5AB340 
did not complete within 1e seconds       
00001860        486035.12500000 Will retry lost connection F9206B70 (9). ERL 0  
00001861        486035.12500000 ResetLU : Session - F9358000, Request Count - 1 
00001862        486035.12500000 Free All Entries for F9206B70: Found entry 
8177AE20 (7) 
00001863        486035.12500000 AR F8CBFC30 for timeout. Function 7     
00001864        486035.12500000 At TMF timeout completing reset srb FF0D44B0 
for Session F9358000       
00001865        486035.12500000 Premature Termination : Okay to remove 
connection F9206B70      
00001866        486035.12500000 Premature Termination : Okay to remove session 
F9358000 
00001867        486035.12500000 [1320] 04/16/2007 12:33:22.00000944:[     528.  
   574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b        
00001868        486035.56250000 Relogin. Reconnect count - 1    
00001869        486065.56250000 RaidSendIrpSynchronous (FA478C78) irp FF5AB340 
did not complete within 1e seconds       
00001870        486068.18750000 ResetLU : Session - F9358000, Request Count - 
63        
00001871        486068.18750000 Will replace CmdSN e354e1 with ExpCmdSN e354a4  
00001872        486068.18750000 Resets TR = 0 LR = 1 for F9358000 Lun = 0       
00001873        486068.18750000 Failed to match unit FA478D30   
00001874        486086.46875000 Will retry lost connection F9D2EB70 (9). ERL 0  
00001875        486086.46875000 ResetLU : Session - F9358000, Request Count - 
63        
00001876        486086.46875000 Free All Entries for F9D2EB70: Found entry 
FF5E11F8 (7) 
00001877        486086.46875000 Will free F9115780 after send completes. 
Function 7     
00001878        486086.46875000 At TMF timeout completing reset srb FF0D44B0 
for Session F9358000       
00001879        486086.46875000 Will Not process request recovery in progress 
for F9358000      
00001880        486086.46875000 Disconnect time NOT exceeded. Srb FA2ED2C4 
[0x2a]       
00001881        486086.46875000 Will Not process request recovery in progress 
for F9358000      
00001882        486086.46875000 Disconnect time NOT exceeded. Srb FA2ED2C4 
[0x2a]       
<--snip - 100's of similar lines-->
00053756        486087.65625000 Disconnect time NOT exceeded. Srb FA2ED2C4 
[0x2a]       
00053757        486087.65625000 Will Not process request recovery in progress 
for F9358000      
00053758        486087.65625000 Disconnect time NOT exceeded. Srb FA2ED2C4 
[0x2a]       
00053759        486087.65625000 Will Not process request recovery in progress 
for F9358000      
00053760        486087.65625000 Disconnect time NOT exceeded. Srb FA2ED2C4 
[0x2a]       
02603103        486146.75000000 Relogin. Reconnect count - 1    
02603104        486147.34375000 [1320] 04/16/2007 12:35:14.00000564:[     528.  
   574] iscsidsc: Did not find target portal 172.16.8.38/3260 0xefff000b        
<--Followed by repeating attempts to re-login, that fail-->

Rick - have you any thoughts on what may be happening here?
Is there anything further I can do to assist?
I will try to see if I can trigger the problem again while capturing packets,
but that will be difficult!
Thanks
Nigel Smith
 
 
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to