Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Miles Nordin Fri, 10 Oct 2008 10:59:28 -0700

>>>>> "jb" == Jeff Bonwick <[EMAIL PROTECTED]> writes:
>>>>> "rmc" == Ricardo M Correia <[EMAIL PROTECTED]> writes:


    jb> We need a little more Code of Hammurabi in the storage
    jb> industry.

It seems like most of the work people have to do now is cleaning up
after the sloppyness of others.  At least it takes the longest.

You could always mention which disks you found ignoring the
command---wouldn't that help the overall problem?  I understand
there's a pervasive ``i don' wan' any trouble, mistah'' attitude, but
I don't understand where it comes from.

 http://www.ferris.edu/news/jimcrow/tom/

    jb> displacement flush for disk caches that ignore the sync
    jb> command.

Sounds like a good idea but:

 (1) won't this break the NFS guarantees you were just saying should
     never be broken?

     I get it, someone else is breaking a standard so how can ZFS be
     expected to yadda yadda yadad.  But I fear it will just push
     ``blame the sysadmin'' one step further out.  ex., Q. ``with ZFS
     all my NFS clients become unstable after the server reboots,'' or
     ``I'm getting silent corruption with NFS''.  A.  ``your drives
     might have gremlins in them, no way to know,'' and ``well what do
     you expect without a single integrity domain and TCP's weak
     checksums.  / no i'm using a crossover cable, and FCS is not
     weak. / ZFS managing a layer of redundancy it is probably your
     RAM or corruption on the uh, between the Ethernet MAC chip and
     the PCI slot''

     (1a) I'm concerned about how it'll be reported when it happens.

          (a) if it's not reported at all, then ZFS is hiding the fact
              that fsync() is not working.  Also, other journaling
              filesystems sometimes report when they find
              ``unexpected'' corruption, which is useful for finding
              both hardware and software problems.

              I'm already concerned ZFS is not reporting enough, like
              when it says a vdev component is ONLINE, but 'zpool
              offline pool <component>' says 'no valid replicas', then
              after a scrub there is no change to zpool status, but
              zpool offline works again.  

              ZFS should not ``simplify'' the user interface to the
              point that it's hiding problems with itself and its
              environment to the ends of avoiding discussion.

          (b) if it is reported, then whenever the reporter-blob
              raises its hand it will have the effect of exonerating
              ZFS in most people's minds, like the stupid CKSUM column
              does right now.  ``ZFS-FEED-B33F error?  oh yeah that's
              the new ueberblock search code.  that means your disks
              are ignoring the SYNCHRONIZE CACHE command.  thank GOD
              you have ZFS with ANY OTHER FILESYSTEM all bets would be
              totally off.  lucky you.  / I have tried ten different
              models from all four brands.  / yeah sucks don't it?
              flagrant violation of the standard, industry wide.  / my
              linux testing tool says they're obeying the command fine
              / linux is crap / i added a patch to solaris to block
              the SYNC CACHE command and the disks got faster so I
              think it's not being ignored / well the stack is
              complicated and flushing happens at many levels, like
              think about controller performance, and that's
              completely unsupported you are doing something REALLY
              UNSAFE there you should NOT DO THAT it is STUPID'' and
              so on, stalling the actual fix literally for years.

              The right way to exonerate ZFS is to make a diagnosis
              tool for the disks which proves they're broken, and then
              don't buy those disks.  not to make a new class of ZFS
              fault report that could potentially capture all kinds of
              problems, then hazily assign blame to an untestable
              quantity.

 (2) disks are probably not the only thing dropping the write
     barriers.  So far, we're also suspecting (unproven!) iSCSI
     targets/initiators, particularly around a TCP reconnection event
     or target reboot.  and VM stacks, both VirtualBox and the HVM in
     UltraSPARC T1.  probably other stuff.  

     I'm concerned that assumptions you'll find safe to make about
     disks after you get started, like nothing is more than 1s stale,
     or send a CDB to size the on-disk cache and imagine it's a FIFO
     and it'll be no worse than that, or ``you can get an fsync by
     pausing reads for 500ms'' or whatever, will add robustness for
     current and future broken disks but won't apply to other types of
     broken storage layer.

   rmc> However, it is not so resilient when the storage system
   rmc> suffers hiccups which cause phantom writes to occur
   rmc> continuously, even if for a small period of time (say less
   rmc> than 10 seconds), and then return to normal. 

ha!  that is a great idea.  temporal ditto blocks: Important writes
should be written, aged in RAM for 1 minute, then rewritten.  :) This
will help with latent sector errors caused by powersag/vibration
too.  but...Even I will admit at some point you have to give up and
let the filesystem get corrupted.

actually I'm more in the camp of making ZFS fragile to incorrect
storage stacks, and offering an offline recovery tool that treats the
corrupt pool as read-only and copies it into a new filesystem (so you
need a second same-size empty pool to use the tool).  I like this
painful way better than fsck-like things, and much better than silent
workarounds.  but i'm probably in the wrong camp on this one.

My reasoning is, we will not be ultimately happy with a fileystem
where fsync() is broken, and that's the best you can do.  To compete
with Netapp, we need to bang on this thing until it's actually
working.  So far I think sysadmins are receptive to the idea they need
to fix <...> about their setup, or make purchases with extreme care,
or do testing before production.  We are not lazy and do not expect an
appliance-on-a-CD.

it's just that pass-the-buck won't ever deliver something useful.
When ext3 was corrupting filesystems on laptops, ext3 got blamed, and
ext3 was not at the root of the problem.  But no one _accepted_ that
ext3 was correctly-coded until the overall problem was fixed.  (IIRC
it was: you need to send drives a stop-unit command before sending the
ACPI powerdown, because even if they ignore synchronize-cache they do
still flush when told to stop-unit)

It's proper to have a strict separation between ``unclean shutdown''
and ``recovery from corruption''.  UFS does have the separation
between log-rolling and fsck-ing, but ZFS could detect the difference
between unclean shutdown and corruption a lot better than UFS, and
that's good.  Currently ZFS seems to detect it by telling you ``pool's
corrupt.  <shrug>, destroy it.''---the fact that the recovery tool is
entirely absent isn't good, but keeping recovery actions like this
ueberblock-search strictly separate makes delivering something truly
correct on the ``unclean shutdown'' front more likely.

I think, if iSCSI target/initiator combinations are silently
discarding 10sec worth of writes (ex., when they drop and reconnect
their TCP session), then this needs to be proven and their
implementation can be and needs to be corrected, not speculated on and
then worked around.

And I bet this same beefing-up performance numbers by discarding cache
flushes is as rampant in the virtualization game as in the hard disk
game.

pgpwdQOwATAGG.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Reply via email to