>>>>> "jb" == Jeff Bonwick <[EMAIL PROTECTED]> writes: >>>>> "rmc" == Ricardo M Correia <[EMAIL PROTECTED]> writes:
jb> We need a little more Code of Hammurabi in the storage
jb> industry.
It seems like most of the work people have to do now is cleaning up
after the sloppyness of others. At least it takes the longest.
You could always mention which disks you found ignoring the
command---wouldn't that help the overall problem? I understand
there's a pervasive ``i don' wan' any trouble, mistah'' attitude, but
I don't understand where it comes from.
http://www.ferris.edu/news/jimcrow/tom/
jb> displacement flush for disk caches that ignore the sync
jb> command.
Sounds like a good idea but:
(1) won't this break the NFS guarantees you were just saying should
never be broken?
I get it, someone else is breaking a standard so how can ZFS be
expected to yadda yadda yadad. But I fear it will just push
``blame the sysadmin'' one step further out. ex., Q. ``with ZFS
all my NFS clients become unstable after the server reboots,'' or
``I'm getting silent corruption with NFS''. A. ``your drives
might have gremlins in them, no way to know,'' and ``well what do
you expect without a single integrity domain and TCP's weak
checksums. / no i'm using a crossover cable, and FCS is not
weak. / ZFS managing a layer of redundancy it is probably your
RAM or corruption on the uh, between the Ethernet MAC chip and
the PCI slot''
(1a) I'm concerned about how it'll be reported when it happens.
(a) if it's not reported at all, then ZFS is hiding the fact
that fsync() is not working. Also, other journaling
filesystems sometimes report when they find
``unexpected'' corruption, which is useful for finding
both hardware and software problems.
I'm already concerned ZFS is not reporting enough, like
when it says a vdev component is ONLINE, but 'zpool
offline pool <component>' says 'no valid replicas', then
after a scrub there is no change to zpool status, but
zpool offline works again.
ZFS should not ``simplify'' the user interface to the
point that it's hiding problems with itself and its
environment to the ends of avoiding discussion.
(b) if it is reported, then whenever the reporter-blob
raises its hand it will have the effect of exonerating
ZFS in most people's minds, like the stupid CKSUM column
does right now. ``ZFS-FEED-B33F error? oh yeah that's
the new ueberblock search code. that means your disks
are ignoring the SYNCHRONIZE CACHE command. thank GOD
you have ZFS with ANY OTHER FILESYSTEM all bets would be
totally off. lucky you. / I have tried ten different
models from all four brands. / yeah sucks don't it?
flagrant violation of the standard, industry wide. / my
linux testing tool says they're obeying the command fine
/ linux is crap / i added a patch to solaris to block
the SYNC CACHE command and the disks got faster so I
think it's not being ignored / well the stack is
complicated and flushing happens at many levels, like
think about controller performance, and that's
completely unsupported you are doing something REALLY
UNSAFE there you should NOT DO THAT it is STUPID'' and
so on, stalling the actual fix literally for years.
The right way to exonerate ZFS is to make a diagnosis
tool for the disks which proves they're broken, and then
don't buy those disks. not to make a new class of ZFS
fault report that could potentially capture all kinds of
problems, then hazily assign blame to an untestable
quantity.
(2) disks are probably not the only thing dropping the write
barriers. So far, we're also suspecting (unproven!) iSCSI
targets/initiators, particularly around a TCP reconnection event
or target reboot. and VM stacks, both VirtualBox and the HVM in
UltraSPARC T1. probably other stuff.
I'm concerned that assumptions you'll find safe to make about
disks after you get started, like nothing is more than 1s stale,
or send a CDB to size the on-disk cache and imagine it's a FIFO
and it'll be no worse than that, or ``you can get an fsync by
pausing reads for 500ms'' or whatever, will add robustness for
current and future broken disks but won't apply to other types of
broken storage layer.
rmc> However, it is not so resilient when the storage system
rmc> suffers hiccups which cause phantom writes to occur
rmc> continuously, even if for a small period of time (say less
rmc> than 10 seconds), and then return to normal.
ha! that is a great idea. temporal ditto blocks: Important writes
should be written, aged in RAM for 1 minute, then rewritten. :) This
will help with latent sector errors caused by powersag/vibration
too. but...Even I will admit at some point you have to give up and
let the filesystem get corrupted.
actually I'm more in the camp of making ZFS fragile to incorrect
storage stacks, and offering an offline recovery tool that treats the
corrupt pool as read-only and copies it into a new filesystem (so you
need a second same-size empty pool to use the tool). I like this
painful way better than fsck-like things, and much better than silent
workarounds. but i'm probably in the wrong camp on this one.
My reasoning is, we will not be ultimately happy with a fileystem
where fsync() is broken, and that's the best you can do. To compete
with Netapp, we need to bang on this thing until it's actually
working. So far I think sysadmins are receptive to the idea they need
to fix <...> about their setup, or make purchases with extreme care,
or do testing before production. We are not lazy and do not expect an
appliance-on-a-CD.
it's just that pass-the-buck won't ever deliver something useful.
When ext3 was corrupting filesystems on laptops, ext3 got blamed, and
ext3 was not at the root of the problem. But no one _accepted_ that
ext3 was correctly-coded until the overall problem was fixed. (IIRC
it was: you need to send drives a stop-unit command before sending the
ACPI powerdown, because even if they ignore synchronize-cache they do
still flush when told to stop-unit)
It's proper to have a strict separation between ``unclean shutdown''
and ``recovery from corruption''. UFS does have the separation
between log-rolling and fsck-ing, but ZFS could detect the difference
between unclean shutdown and corruption a lot better than UFS, and
that's good. Currently ZFS seems to detect it by telling you ``pool's
corrupt. <shrug>, destroy it.''---the fact that the recovery tool is
entirely absent isn't good, but keeping recovery actions like this
ueberblock-search strictly separate makes delivering something truly
correct on the ``unclean shutdown'' front more likely.
I think, if iSCSI target/initiator combinations are silently
discarding 10sec worth of writes (ex., when they drop and reconnect
their TCP session), then this needs to be proven and their
implementation can be and needs to be corrected, not speculated on and
then worked around.
And I bet this same beefing-up performance numbers by discarding cache
flushes is as rampant in the virtualization game as in the hard disk
game.
pgpwdQOwATAGG.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
