Re: NFS hard semantics wanted: how to?

2011-01-01 Thread Mike Christie

On 01/01/2011 11:06 PM, Mike Christie wrote:


What kernel version are you using? We have exactly that :) If you set it
to -1 then you get your infinite timeout.



Oh yeah, this was added in 2.6.33.

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: NFS hard semantics wanted: how to?

2011-01-01 Thread Mike Christie

On 12/30/2010 10:09 PM, torn5 wrote:

On 12/22/2010 07:09 PM, Mike Christie wrote:

On 12/22/2010 05:57 AM, torn5 wrote:

Hello open-iscsi people
I am approaching iscsi, and I am currently doing some "reliability"
tests.

In particular I would like to be able to reboot the target machine
without the initiators to lose data.
Like NFS hard mounts.

[CUT]

These are the errors I see:
[31291.360009] EXT4-fs (sdd1): error count: 10
[31291.360013] EXT4-fs (sdd1): initial error at 1292972264:
ext4_remount:3755
[31291.360015] EXT4-fs (sdd1): last error at 1292976117:
ext4_put_super:719
They look harmful...


Could you send the rest of your /var/log/messages? It should have some
scsi error code info and block layer error info.

Could you also turn on iscsi eh debugging



Hello Mike
sorry for the delay in the reply, I was doing some zillions of tests...

The error I reported was erroneous, it was all right.
It was an ext4 new misleading feature: after 300 seconds from mount it
reports last errors seen on that filesystem (coming from my earlier
tests) and you can clear that error log only by paying money to Ted Ts'o
just kidding


:)


the log would have been cleared if I had a newer version of fsck.ext4 .
I was seeing those errors spitted out during my disconnection tests and
I thought they were due to the disconnections but they were just an old
log.

The replacement timeout thing works flawlessly, my congrats on this
excellent piece of software and for all the information.

Just a few more questions:

1- Can I raise the number of "5" resubmissions from SCSI, possibly by
recompiling the kernel? Do you know & could you tell me where that
number is? I grepped the sources but there are too many values and I'm
not sure what is the right one.


It is controlled by the scsi layer and it is hardcoded in 
drivers/scsi/sd.h's SD_MAX_RETRIES definition.




2- Wouldn't it be better to have a separate error count for network
errors? I would raise that one. Why should a network error eat retries
from scsi errors? Is it scsi standard that mandates equality of network
failures and disk failures? Seems strange/unwise to me...


It is just a generic counter that says if the command does not complete 
in 5 retries then fail it. I don't think the value of 5 is based on 
anything that is specific to disks or network behavior (FC drivers for 
example do something similar for SAN problems). I think it is just based 
on past experience that if the IO is retryable but it does not complete 
in 5 retries for any reason it is not going to complete.


It seems rare that you a command would get 3 network errors and 2 disk 
errors, so I do not think it has come up before.


When the iscsi driver was getting submitted a long time ago, there was 
code to make the retries configurable, but it got rejected. And, I think 
on the linux-scsi list every so often someone sends a patch to make the 
retries configurable from sysfs, but it does not get picked up. If I can 
find the posts I will send them. I think there is more info in them.





3- this is a kinda bug report / feature request:
I wanted to raise replacement_tmo (via sysfs) to a very high value but
it wrapped around. The limit seems to be 2**31/HZ, after that it wraps,
it doesn't tell you anything immediately but at the first network
disconnection it expires immediately like if it was below zero.
Hence, if I have HZ=1000 the max is about 24 days.
It might sound crazy but I would like higher values. The thing is, we
have (virtual) machines with almost-abandoned services, and if those
freeze for 24 days we might not notice it and then we can start having
errors and potentially a filesystem corruption. I would like possibly an
infinite timeout, like a magic value that makes the counter never
expire. Since you seem to be using a signed value and 0 is already used
for no-timeout, -1


What kernel version are you using? We have exactly that :) If you set it 
to -1 then you get your infinite timeout.


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: NFS hard semantics wanted: how to?

2010-12-30 Thread torn5

On 12/22/2010 07:09 PM, Mike Christie wrote:

On 12/22/2010 05:57 AM, torn5 wrote:

Hello open-iscsi people
I am approaching iscsi, and I am currently doing some "reliability" 
tests.


In particular I would like to be able to reboot the target machine
without the initiators to lose data.
Like NFS hard mounts.

[CUT]

These are the errors I see:
[31291.360009] EXT4-fs (sdd1): error count: 10
[31291.360013] EXT4-fs (sdd1): initial error at 1292972264:
ext4_remount:3755
[31291.360015] EXT4-fs (sdd1): last error at 1292976117: 
ext4_put_super:719

They look harmful...

Could you send the rest of your /var/log/messages? It should have some 
scsi error code info and block layer error info.


Could you also turn on iscsi eh debugging



Hello Mike
sorry for the delay in the reply, I was doing some zillions of tests...

The error I reported was erroneous, it was all right.
It was an ext4 new misleading feature: after 300 seconds from mount it 
reports last errors seen on that filesystem (coming from my earlier 
tests) and you can clear that error log only by paying money to Ted Ts'o

just kidding
the log would have been cleared if I had a newer version of fsck.ext4 .
I was seeing those errors spitted out during my disconnection tests and 
I thought they were due to the disconnections but they were just an old log.


The replacement timeout thing works flawlessly, my congrats on this 
excellent piece of software and for all the information.


Just a few more questions:

1- Can I raise the number of "5" resubmissions from SCSI, possibly by 
recompiling the kernel? Do you know & could you tell me where that 
number is? I grepped the sources but there are too many values and I'm 
not sure what is the right one.


2- Wouldn't it be better to have a separate error count for network 
errors? I would raise that one. Why should a network error eat retries 
from scsi errors? Is it scsi standard that mandates equality of network 
failures and disk failures? Seems strange/unwise to me...


3- this is a kinda bug report / feature request:
I wanted to raise replacement_tmo (via sysfs) to a very high value but 
it wrapped around. The limit seems to be 2**31/HZ, after that it wraps, 
it doesn't tell you anything immediately but at the first network 
disconnection it expires immediately like if it was below zero.

Hence, if I have HZ=1000 the max is about 24 days.
It might sound crazy but I would like higher values.  The thing is, we 
have (virtual) machines with almost-abandoned services, and if those 
freeze for 24 days we might not notice it and then we can start having 
errors and potentially a filesystem corruption. I would like possibly an 
infinite timeout, like a magic value that makes the counter never 
expire. Since you seem to be using a signed value and 0 is already used 
for no-timeout, -1 could be a good value imho. Or else use a 64bit value 
or compute it in another way so that HZ is used differently/later and 
does not make it wrap around (at that point we could enter about 66 
years, which would be enough).
I tried to look at the source to patch it myself but that value is 
passed around a lot and I couldn't really track where it was going.


Thanks for your help

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: NFS hard semantics wanted: how to?

2010-12-22 Thread Mike Christie

On 12/22/2010 05:57 AM, torn5 wrote:

Hello open-iscsi people
I am approaching iscsi, and I am currently doing some "reliability" tests.

In particular I would like to be able to reboot the target machine
without the initiators to lose data.
Like NFS hard mounts.

If the target goes down:
1) I want the device to be frozen so that applications get stuck while
trying to access the device
2) and when the target comes up again, I want the in-flight commands to
be re-played back to the target so that no data is lost.

I was able to obtain part 1, by increasing the replacement_timeout to a
high enough value.

However it seems I cannot obtain part 2 because there are still errors
in the dmesg. I think this is due to the lost inflight commands (my
guess from what's written in the README).

These are the errors I see:
[31291.360009] EXT4-fs (sdd1): error count: 10
[31291.360013] EXT4-fs (sdd1): initial error at 1292972264:
ext4_remount:3755
[31291.360015] EXT4-fs (sdd1): last error at 1292976117: ext4_put_super:719
They look harmful...


Firstly I don't understand why open-iscsi does not requeue inflight
commands by itself as soon as it blocks the device for connection lost.


This is what is done. When the connection problem (caused by target 
reset in your case) is detected, the iscsi layer blocks the devices. 
Then it fails IO that was running to the scsi layer which tells the scsi 
layer to requeue if it can (for tape and passthrough like sg io you 
could not retry but for disk commands you would retry up to 5 times). 
Because the devices are blocked all new IO and requeued IO then sits in 
the queue until we unblock the device queue.


If the replacement timeout expires before we can reconnect, the device 
is unblocked and everything in the device queue is failed and any new IO 
is failed.


If we reconnect within the replacement timeout period, we unblock the 
queue and we run IO that was requeued during the problem detection and 
then run then new IOs.




It seems the braindead obvious solution to me. Then, if the
replacement_timeout expires, all commands (inflight and queued) should
be failed together to the above layer. I don't understand why they


They don't.


should get a different treatment.


Secondly, I read in the docs that SCSI commands are retried 5 times.
Ok good! then I don't understand why ext4 still sees data loss. I was
doing cycles of
...
stop target service
wait 15 secs
start target service
wait 15 secs
...
(the initiator in the meanwhile is untarring tens of thousands of files
from a kernel tar in a forever loop)


In just 15 seconds I cannot believe the scsi commands could really fail
5 times, that would be a 3 seconds timeout, it's too low...



Could you send the rest of your /var/log/messages? It should have some 
scsi error code info and block layer error info.


Could you also turn on iscsi eh debugging

echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh

before running your test? That sends more logging info to /var/log/messages.



And also when the SCSI layer resubmits the command (second submission)
the device is blocked so the command should get stuck in the queue and
should stay there until connection is recovered (supposing a high enough
replacement_timeout) so the commands should not fail more than once.
Then why the errors?

I have even increased the /sys/block/sdX/device/timeout to a very high
value. That's the timeout for SCSI isn't it?


That timeout only monitors if a command has been sent to the driver and 
not completed within that timeout. When the problem is detected and we 
requeue IO, the timer is halted when it is requeued, then when IO is 
restarted the timer is reset.


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.