On 05/22/09 21:08, Toby Thain wrote:
Yes, the important thing is to *detect* them, no system can run reliably
with bad memory, and that includes any system with ZFS. Doing nutty
things like calculating the checksum twice does not buy anything of
value here.

All memory is "bad" if it doesn't have ECC. There are only varying
degrees of badness. Calculating the checksum twice on its own would
be nutty, as you say, but doing so on a separate copy of the data
might prevent unrecoverable errors after writes to mirrored drives.
You can't detect memory errors if you don't have ECC. But you can
try to mitigate them. Without doing so makes ZFS less reliable than
the memory it is running on. The problem is that ZFS makes any file
with a bad checksum inaccessible, even if one really doesn't care
if the data has been corrupted. A workaround might be a way to allow
such files to be readable despite the bad checksum...

In hindsight I probably should have merely reported the problem and
left those with more knowledge to propose a solution. Oh well.
If the memory is this bad then applications will be dying all over the
place, compilers will be segfaulting, and databases will be writing bad
data even before it reaches ZFS.

But it isn't. Applications aren't dying, compilers are not segfaulting
(it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm
is staying up for weeks at a time... And I wouldn't consider running a
non-trivial database application on a machine without ECC.

Absolutely, memory diags are essential. And you certainly run them if
you see unexpected behaviour that has no other obvious cause.

Runs for days, as noted.
Your logic is rather tortuous. If the hardware is that crappy then
there's not much ZFS can do about it.

Well, it could. For example, it could make copies of the data before
checksumming so that one memory hit doesn't result in an unrecoverable
file on a mirrored drive. Either that or there's a bug in ZFS. I am
more inclined to blame the memory, especially since the failure rate
isn't much higher than the expected rate as reported elsewhere.

Maybe this should be a new thread, but I suspect the following
proves that the problem must be memory, and that begs the question
as to how memory glitches can cause fatal ZFS checksum errors.

Of course they can; but they will also break anything else on the machine.

But they don't. Checksum errors are reasonable, but not unrecoverable
ones on mirrors.
How can a machine with bad memory "work fine with ext3"?

It does. It works fine with ZFS too. Just really annoying unrecoverable
files every now and then on mirrored drives. This shouldn't happen even
with lousy memory and wouldn't (doesn't) with ECC. If there was a way
to examine the files and their checksums, I would be surprised if they
were different (If they were, it would almost certainly be the controller
or the PCI bus itself causing the problem). But I speculate that it is
predictable memory hits.

-- Frank

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to