Total loss of data

Eric Fri, 11 Apr 2014 15:01:16 -0700

Interesting point about different kinds of ECC memory. I wonder if the
difference is important enough to consider for a 20x3TB ZFS pool. For the
sake of sakes, I will likely look into getting ECC memory.



On Fri, Apr 11, 2014 at 5:36 PM, Jason Belec <jasonbe...@belecmartin.com>wrote:

> Excellent. If you feel this is necessary go for it. Those that have
> systems that don't have ECC should just run like the sky is falling by your
> point view. That said, I can guarantee non of the systems I have under my
> care have issues. How do I know? Well the data is tested/compared at
> regular intervals. Maybe I'm the luckiest guy ever, where is that lottery
> ticket. Is ECC better, possibly, probably in heavy load environments, no
> data has been provided to back this up. Especially nothing in the context
> of what most users needs are at least here in the Mac space. Which ECC? Be
> specific. They are not all the same. Just like regular RAM are not all the
> same. Just like HDDs are not all the same. Fear mongering is wonderful and
> easy. Putting forth a solution guaranteed to be better is what's needed
> now. Did you actually reference a wiki? Seriously? A document anyone can
> edit to suit there view? I guess I come from a different era.
>
>
> Jason
> Sent from my iPhone 5S
>
> On Apr 11, 2014, at 5:09 PM, Bayard Bell <buffer.g.overf...@gmail.com>
> wrote:
>
> If you want more of a smoking gun report on data corruption without ECC,
> try:
>
> https://blogs.oracle.com/vlad/entry/zfs_likes_to_have_ecc
>
> This view isn't isolated in terms of what people at Sun thought or what
> people at Oracle now think. Trying googling for "zfs ecc site:
> blogs.oracle.com", and you'll find a recurring statement that ECC should
> be used even in home deployment, with maybe one odd exception.
>
> The Wikipedia article, correctly summarising the Google study, is plain in
> saying not that extremely high error rates are common but that error rates
> are highly variable in large-sample studies, with some systems seeing
> extremely high error rates. ECC gives a significant assurance based on an
> incremental cost, so what's your data worth? You're not guaranteed to be
> screwed by not using ECC (and the Google paper doesn't say this either),
> but you are assuming risks that ECC mitigates. Look at the above blog,
> however: even DIMMs that are high-quality but non-ECC can go wrong and
> result in nasty system corruption.
>
> What generally protects you in terms of pool integrity is metadata
> redundancy on top of integrity checks, but if you flip bits on metadata
> in-core before writing redundant copies, well, that's a risk to pool
> integrity.
>
> I also think it's mistaken to say this is distinctly a problem with ZFS.
> Any "next-generation" filesystem that provides protections against on-disk
> corruption via checksums ends up with a residual risk focus on making sure
> that in-core data integrity is robust. You could well have those problems
> on the pools you've deployed, and there are a lot of situations in you'd
> never know and quite a lot (such as most of the bits in a photo or MP3)
> where you'd never notice low rates of bit-flipping. The fact that you
> haven't noticed doesn't equate to there being no problems in a strict
> sense, it's far more likely that you've been able to tolerate the flipping
> that's happened. The guy at Sun with the blog above got lucky: he was
> running high-quality non-ECC RAM, and it went pear-shaped, at least for
> metadata cancer, quite quickly, allowing him to recover by rolling back
> snapshots.
>
> Take a look out there, and you'll find people who are very confused about
> the risks and available mitigations. I found someone saying that there's no
> problem with more traditional RAID technologies because disks have CRCs. By
> comparison, you can find Bonwick, educated as a statistician, talking about
> SHA256 collisions by comparison to undetected ECC error rates and
> introducing ZFS data integrity safeguards by way of analogy to ECC. That's
> why the large-sample studies are interesting and useful: none of this
> technology makes data corruption impossible, it just goes to extreme length
> to marginalise the chances of those events by addressing known sources of
> errors and fundamental error scenarios--in-core is so core that if you
> tolerate error there, those errors will characterize systematic behaviour
> where you have better outcomes reasonably available (and that's
> **reasonably** available, I would suggest, in a way that the Madison
> paper's recommendation to make ZFS buffers magical isn't). CRC-32 does a
> great job detecting bad sectors and preventing them from being read back,
> but SHA256 in the right place in a system detects errors that a
> well-conceived vdev topology will generally make recoverable. That includes
> catching cases where an error isn't caught by CRC-32, which may be a rare
> result, but when you've got the kind of data densities that ZFS can allow,
> you're rolling the dice often enough that those results become interesting.
>
> ECC is one of the most basic steps to take, and if you look at the
> architectural literature, that's how it's treated. If you really want to be
> in on the joke, find the opensolaris zfs list thread from 2009 where
> someone asks about ECC, and someone else jumps in to remark on how
> VirtualBox can be poison for pool integrity for reasons rehearsed in my
> last post.
>
> Cheers,
> Bayard
>
> On 1 April 2014 12:04, Jason Belec <jasonbe...@belecmartin.com> wrote:
>
>> ZFS is lots of parts, in most cases lots of cheap unreliable parts,
>> refurbished parts, yadda yadda, as posted on this thread and many, many
>> others, any issues are probably not ZFS but the parts of the whole. Yes, it
>> could be ZFS, after you confirm that all the parts ate pristine, maybe.
>>
>> My oldest system running ZFS is an Mac Mini Intel Core Duo with 3GB RAM
>> (not ECC) it is the home server for music, tv shows, movies, and some
>> interim backups. The mini has been modded for ESATA and has 6 drives
>> connected. The pool is 2 RaidZ of 3 mirrored with copies set at 2. Been
>> running since ZFS was released from Apple builds. Lost 3 drives, eventually
>> traced to a new cable that cracked at the connector which when hot enough
>> expanded lifting 2 pins free of their connector counter parts resulting in
>> errors. Visually almost impossible to see. I replaced port multipliers,
>> Esata cards, RAM, mini's, power supply, reinstalled OS, reinstalled ZFS,
>> restored ZFS data from backup, finally to find the bad connector end one
>> because it was hot and felt 'funny'.
>>
>> Frustrating, yes, educational also. The happy news is, all the data was
>> fine, wife would have torn me to shreds if photos were missing, music was
>> corrupt, etc., etc.. And this was on the old out of date but stable ZFS
>> version we Mac users have been hugging onto for dear life. YMMV
>>
>> Never had RAM as the issue, here in the mad science lab across 10
>> rotating systems or in any client location - pick your decade. However I
>> don't use cheap RAM either, and I only have 2 Systems requiring ECC
>> currently that don't even connect to ZFS as they are both XServers with
>> other lives.
>>
>>
>>
>> --
>> Jason Belec
>> Sent from my iPad
>>
>> On Apr 1, 2014, at 12:13 AM, Daniel Becker <razzf...@gmail.com> wrote:
>>
>> On Mar 31, 2014, at 7:41 PM, Eric Jaw <naisa...@gmail.com> wrote:
>>
>> I started using ZFS about a few weeks ago, so a lot of it is still new to
>> me. I'm actually not completely certain about "proper procedure" for
>> repairing a pool. I'm not sure if I'm supposed to clear the errors after
>> the scrub, before or after (little things). I'm not sure if it even
>> matters. When I restarted the VM, the checksum counts cleared on its own.
>>
>>
>> The counts are not maintained across reboots.
>>
>>
>> On the first scrub it repaired roughly 1.65MB. None on the second scub.
>> Even after the scrub there were still 43 data errors. I was expecting they
>> were going to go away.
>>
>>
>> errors: 43 data errors, use '-v' for a list
>>
>>
>> What this means is that in these 43 cases, the system was not able to
>> correct the error (i.e., both drives in a mirror returned bad data).
>>
>>
>> This is an excellent question. They're in 'Normal' mode. I remember
>> looking in to this before and decided normal mode should be fine. I might
>> be wrong. So thanks for bringing this up. I'll have to check it out again.
>>
>>
>> The reason I was asking is that these symptoms would also be consistent
>> with something outside the VM writing to the disks behind the VM’s back;
>> that’s unlikely to happen accidentally with disk images, but raw disks are
>> visible to the host OS as such, so it may be as simple as Windows deciding
>> that it should initialize the “unformatted” (really, formatted with an
>> unknown filesystem) devices. Or it could be a raid controller that stores
>> its array metadata in the last sector of the array’s disks.
>>
>>
>> memtest86 and memtest86+ for 18 hours came out okay. I'm on my third
>> scrub and the number or errors has remained at 43. Checksum errors continue
>> to pile up as the pool is getting scrubbed.
>>
>> I'm just as flustered about this. Thanks again for the input.
>>
>>
>> Given that you’re seeing a fairly large number of errors in your scrubs,
>> the fact that memtest86 doesn’t find anything at all very strongly suggests
>> that this is not actually a memory issue.
>>
>>  --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "zfs-macos" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to zfs-macos+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "zfs-macos" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-macos+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>  --
>
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "zfs-macos" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/zfs-macos/qguq6LCf1QQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> zfs-macos+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"zfs-macos" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [zfs-macos] ZFS w/o ECC RAM -> Total loss of data

Reply via email to