Janwar Dinata wrote: > Hi Mike, Hey,
> I've been playing some more with the node.session.timeo.replacement_timeout = > x. > A good x for me is at least 30s. > > The test that I am doing is creating and deleting a target for at least 10 > minutes while traffic is running at the same time. > I can adjust the timeout value that will allow the test to run without > disktest reporting io errors on me. > > Attached is the what initiator /var/log/messages that during both successful > and failing disktest. > If everything on the messages file is as expected, then I think your last > patch fix the problem I had. > Sorry about that. I am going to add what these common errors mean to the README so it is clear. Dec 5 11:49:57 computeB014a kernel: connection5:0: ping timeout of 15 secs expired, last rx 86492733, last ping 86502733, now 86517751 this is just means we sent a nop and did not get response within 15 seconds. It is expected if the target is gone and is not failing the nop or you did something like a cable pull where nops are not going to reach a target. Dec 5 11:49:57 computeB014a kernel: connection5:0: detected conn error (1011) Dec 5 11:49:58 computeB014a iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Is just an indication that the connection was failed as a result of the nop timing out. Dec 5 11:50:05 computeB014a iscsid: connection1:0 is operational after recovery (1 attempts) We logged back in. Dec 5 11:55:50 computeB014a iscsid: Target requests logout within 1 seconds for connection Target sent AEN, and we sent logout. Dec 5 11:55:53 computeB014a iscsid: connect failed (111) We sent the logout and tried to relogin. We could not because target was not there. Dec 5 11:56:09 computeB014a kernel: session4: session recovery timed out after 20 secs This might concern you. It means that we tried to connect for 20 (replacment_timeout) seconds, but could not log back in, so we are going to fail any IO running and any new IO. You would want a longer timeout to make sure IO that could get sent or is running at this time does not get failed. Dec 5 11:56:31 computeB014a iscsid: connection8:0 is operational after recovery (10 attempts) Target came back and we logged back in. Dec 5 11:56:35 computeB014a iscsid: iSCSI daemon with pid=21738 started! Dec 5 11:56:35 computeB014a iscsid: iscsid shutting down. Dec 5 11:56:35 computeB014a kernel: Loading iSCSI transport class v2.0-870. Dec 5 11:56:35 computeB014a kernel: iscsi: register ed transport (tcp) Looks like you restarted the iscsi service. You do not want to normally do this while sessions are running and disks are in use. It will force any IO that is running or queued to fail. Then in the log it looks like you deleted targets and restarted the iscsi service some more. Dec 5 12:17:30 computeB014a iscsid: Kernel reported iSCSI connection 8:0 error (1011) state (3) Dec 5 12:17:49 computeB014a kernel: session2: session recovery timed out after 20 secs Dec 5 12:17:49 computeB014a kernel: sd 47:0:0:0: SCSI error: return code = 0x00010000 This is what you want to be concerened with. If the replacement/recovery timeout is too short you will get these errors which mean that the IO is going to be failed. If you are using dm-mulitpath this is fine, but if you have a App like FS using this disk directly this will cause errors. > Thanks > Janwar > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~---