Thanks! I will definitely take this out with my afternoon tea for a read C:
On Fri, Apr 11, 2014 at 5:09 PM, Bayard Bell <buffer.g.overf...@gmail.com>wrote: > If you want more of a smoking gun report on data corruption without ECC, > try: > > https://blogs.oracle.com/vlad/entry/zfs_likes_to_have_ecc > > This view isn't isolated in terms of what people at Sun thought or what > people at Oracle now think. Trying googling for "zfs ecc site: > blogs.oracle.com", and you'll find a recurring statement that ECC should > be used even in home deployment, with maybe one odd exception. > > The Wikipedia article, correctly summarising the Google study, is plain in > saying not that extremely high error rates are common but that error rates > are highly variable in large-sample studies, with some systems seeing > extremely high error rates. ECC gives a significant assurance based on an > incremental cost, so what's your data worth? You're not guaranteed to be > screwed by not using ECC (and the Google paper doesn't say this either), > but you are assuming risks that ECC mitigates. Look at the above blog, > however: even DIMMs that are high-quality but non-ECC can go wrong and > result in nasty system corruption. > > What generally protects you in terms of pool integrity is metadata > redundancy on top of integrity checks, but if you flip bits on metadata > in-core before writing redundant copies, well, that's a risk to pool > integrity. > > I also think it's mistaken to say this is distinctly a problem with ZFS. > Any "next-generation" filesystem that provides protections against on-disk > corruption via checksums ends up with a residual risk focus on making sure > that in-core data integrity is robust. You could well have those problems > on the pools you've deployed, and there are a lot of situations in you'd > never know and quite a lot (such as most of the bits in a photo or MP3) > where you'd never notice low rates of bit-flipping. The fact that you > haven't noticed doesn't equate to there being no problems in a strict > sense, it's far more likely that you've been able to tolerate the flipping > that's happened. The guy at Sun with the blog above got lucky: he was > running high-quality non-ECC RAM, and it went pear-shaped, at least for > metadata cancer, quite quickly, allowing him to recover by rolling back > snapshots. > > Take a look out there, and you'll find people who are very confused about > the risks and available mitigations. I found someone saying that there's no > problem with more traditional RAID technologies because disks have CRCs. By > comparison, you can find Bonwick, educated as a statistician, talking about > SHA256 collisions by comparison to undetected ECC error rates and > introducing ZFS data integrity safeguards by way of analogy to ECC. That's > why the large-sample studies are interesting and useful: none of this > technology makes data corruption impossible, it just goes to extreme length > to marginalise the chances of those events by addressing known sources of > errors and fundamental error scenarios--in-core is so core that if you > tolerate error there, those errors will characterize systematic behaviour > where you have better outcomes reasonably available (and that's > **reasonably** available, I would suggest, in a way that the Madison > paper's recommendation to make ZFS buffers magical isn't). CRC-32 does a > great job detecting bad sectors and preventing them from being read back, > but SHA256 in the right place in a system detects errors that a > well-conceived vdev topology will generally make recoverable. That includes > catching cases where an error isn't caught by CRC-32, which may be a rare > result, but when you've got the kind of data densities that ZFS can allow, > you're rolling the dice often enough that those results become interesting. > > ECC is one of the most basic steps to take, and if you look at the > architectural literature, that's how it's treated. If you really want to be > in on the joke, find the opensolaris zfs list thread from 2009 where > someone asks about ECC, and someone else jumps in to remark on how > VirtualBox can be poison for pool integrity for reasons rehearsed in my > last post. > > Cheers, > Bayard > > On 1 April 2014 12:04, Jason Belec <jasonbe...@belecmartin.com> wrote: > >> ZFS is lots of parts, in most cases lots of cheap unreliable parts, >> refurbished parts, yadda yadda, as posted on this thread and many, many >> others, any issues are probably not ZFS but the parts of the whole. Yes, it >> could be ZFS, after you confirm that all the parts ate pristine, maybe. >> >> My oldest system running ZFS is an Mac Mini Intel Core Duo with 3GB RAM >> (not ECC) it is the home server for music, tv shows, movies, and some >> interim backups. The mini has been modded for ESATA and has 6 drives >> connected. The pool is 2 RaidZ of 3 mirrored with copies set at 2. Been >> running since ZFS was released from Apple builds. Lost 3 drives, eventually >> traced to a new cable that cracked at the connector which when hot enough >> expanded lifting 2 pins free of their connector counter parts resulting in >> errors. Visually almost impossible to see. I replaced port multipliers, >> Esata cards, RAM, mini's, power supply, reinstalled OS, reinstalled ZFS, >> restored ZFS data from backup, finally to find the bad connector end one >> because it was hot and felt 'funny'. >> >> Frustrating, yes, educational also. The happy news is, all the data was >> fine, wife would have torn me to shreds if photos were missing, music was >> corrupt, etc., etc.. And this was on the old out of date but stable ZFS >> version we Mac users have been hugging onto for dear life. YMMV >> >> Never had RAM as the issue, here in the mad science lab across 10 >> rotating systems or in any client location - pick your decade. However I >> don't use cheap RAM either, and I only have 2 Systems requiring ECC >> currently that don't even connect to ZFS as they are both XServers with >> other lives. >> >> >> >> -- >> Jason Belec >> Sent from my iPad >> >> On Apr 1, 2014, at 12:13 AM, Daniel Becker <razzf...@gmail.com> wrote: >> >> On Mar 31, 2014, at 7:41 PM, Eric Jaw <naisa...@gmail.com> wrote: >> >> I started using ZFS about a few weeks ago, so a lot of it is still new to >> me. I'm actually not completely certain about "proper procedure" for >> repairing a pool. I'm not sure if I'm supposed to clear the errors after >> the scrub, before or after (little things). I'm not sure if it even >> matters. When I restarted the VM, the checksum counts cleared on its own. >> >> >> The counts are not maintained across reboots. >> >> >> On the first scrub it repaired roughly 1.65MB. None on the second scub. >> Even after the scrub there were still 43 data errors. I was expecting they >> were going to go away. >> >> >> errors: 43 data errors, use '-v' for a list >> >> >> What this means is that in these 43 cases, the system was not able to >> correct the error (i.e., both drives in a mirror returned bad data). >> >> >> This is an excellent question. They're in 'Normal' mode. I remember >> looking in to this before and decided normal mode should be fine. I might >> be wrong. So thanks for bringing this up. I'll have to check it out again. >> >> >> The reason I was asking is that these symptoms would also be consistent >> with something outside the VM writing to the disks behind the VM’s back; >> that’s unlikely to happen accidentally with disk images, but raw disks are >> visible to the host OS as such, so it may be as simple as Windows deciding >> that it should initialize the “unformatted” (really, formatted with an >> unknown filesystem) devices. Or it could be a raid controller that stores >> its array metadata in the last sector of the array’s disks. >> >> >> memtest86 and memtest86+ for 18 hours came out okay. I'm on my third >> scrub and the number or errors has remained at 43. Checksum errors continue >> to pile up as the pool is getting scrubbed. >> >> I'm just as flustered about this. Thanks again for the input. >> >> >> Given that you’re seeing a fairly large number of errors in your scrubs, >> the fact that memtest86 doesn’t find anything at all very strongly suggests >> that this is not actually a memory issue. >> >> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "zfs-macos" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to zfs-macos+unsubscr...@googlegroups.com. >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > > --- > You received this message because you are subscribed to a topic in the > Google Groups "zfs-macos" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/zfs-macos/qguq6LCf1QQ/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > zfs-macos+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- --- You received this message because you are subscribed to the Google Groups "zfs-macos" group. To unsubscribe from this group and stop receiving emails from it, send an email to zfs-macos+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.