Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
Jason, If you think I've said anything about the sky falling or referenced a wiki, you're responding to something other than what I wrote. I see no need for further reply. Cheers, Bayard On 11 April 2014 22:36, Jason Belec jasonbe...@belecmartin.com wrote: Excellent. If you feel this is necessary go for it. Those that have systems that don't have ECC should just run like the sky is falling by your point view. That said, I can guarantee non of the systems I have under my care have issues. How do I know? Well the data is tested/compared at regular intervals. Maybe I'm the luckiest guy ever, where is that lottery ticket. Is ECC better, possibly, probably in heavy load environments, no data has been provided to back this up. Especially nothing in the context of what most users needs are at least here in the Mac space. Which ECC? Be specific. They are not all the same. Just like regular RAM are not all the same. Just like HDDs are not all the same. Fear mongering is wonderful and easy. Putting forth a solution guaranteed to be better is what's needed now. Did you actually reference a wiki? Seriously? A document anyone can edit to suit there view? I guess I come from a different era. Jason Sent from my iPhone 5S On Apr 11, 2014, at 5:09 PM, Bayard Bell buffer.g.overf...@gmail.com wrote: If you want more of a smoking gun report on data corruption without ECC, try: https://blogs.oracle.com/vlad/entry/zfs_likes_to_have_ecc This view isn't isolated in terms of what people at Sun thought or what people at Oracle now think. Trying googling for zfs ecc site: blogs.oracle.com, and you'll find a recurring statement that ECC should be used even in home deployment, with maybe one odd exception. The Wikipedia article, correctly summarising the Google study, is plain in saying not that extremely high error rates are common but that error rates are highly variable in large-sample studies, with some systems seeing extremely high error rates. ECC gives a significant assurance based on an incremental cost, so what's your data worth? You're not guaranteed to be screwed by not using ECC (and the Google paper doesn't say this either), but you are assuming risks that ECC mitigates. Look at the above blog, however: even DIMMs that are high-quality but non-ECC can go wrong and result in nasty system corruption. What generally protects you in terms of pool integrity is metadata redundancy on top of integrity checks, but if you flip bits on metadata in-core before writing redundant copies, well, that's a risk to pool integrity. I also think it's mistaken to say this is distinctly a problem with ZFS. Any next-generation filesystem that provides protections against on-disk corruption via checksums ends up with a residual risk focus on making sure that in-core data integrity is robust. You could well have those problems on the pools you've deployed, and there are a lot of situations in you'd never know and quite a lot (such as most of the bits in a photo or MP3) where you'd never notice low rates of bit-flipping. The fact that you haven't noticed doesn't equate to there being no problems in a strict sense, it's far more likely that you've been able to tolerate the flipping that's happened. The guy at Sun with the blog above got lucky: he was running high-quality non-ECC RAM, and it went pear-shaped, at least for metadata cancer, quite quickly, allowing him to recover by rolling back snapshots. Take a look out there, and you'll find people who are very confused about the risks and available mitigations. I found someone saying that there's no problem with more traditional RAID technologies because disks have CRCs. By comparison, you can find Bonwick, educated as a statistician, talking about SHA256 collisions by comparison to undetected ECC error rates and introducing ZFS data integrity safeguards by way of analogy to ECC. That's why the large-sample studies are interesting and useful: none of this technology makes data corruption impossible, it just goes to extreme length to marginalise the chances of those events by addressing known sources of errors and fundamental error scenarios--in-core is so core that if you tolerate error there, those errors will characterize systematic behaviour where you have better outcomes reasonably available (and that's **reasonably** available, I would suggest, in a way that the Madison paper's recommendation to make ZFS buffers magical isn't). CRC-32 does a great job detecting bad sectors and preventing them from being read back, but SHA256 in the right place in a system detects errors that a well-conceived vdev topology will generally make recoverable. That includes catching cases where an error isn't caught by CRC-32, which may be a rare result, but when you've got the kind of data densities that ZFS can allow, you're rolling the dice often enough that those results become interesting. ECC is one of the most basic steps
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
It sounds like people are missing the forest for the trees. Some of us have been successfully RAIDing/deploying storage for years on everything from IDE vinum to SCSI XFS and beyond without ECC. We use ZFS today because of its featureset. Data integrity checking through checksumming is just one of those features which would have mitigated some issues that other file systems have historically failed to do. (Otherwise we should all be happy with existing journaling filesystems on a soft or hard RAID). ECC just adds another layer of mitigation (and even in a less-implementation-specific way like how ZFS may 'prefer' raw device access instead of whatever storage abstraction the controller is presenting). Asserting that ECC is required has about the same logic to it (and I would say less logic to it) than asserting a 3ware controller with raw jbod passthrough is required. On Sat, Apr 12, 2014 at 7:46 AM, Bayard Bell buffer.g.overf...@gmail.com wrote: Jason, Although I moved from OS X to illumos as a primary platform precisely because of ZFS (I ended up posting to the original list about the demise of the project because I happened to be doing an install the week Apple people the plug), I've spent enough time with OS X, including debugging storage interop issues with NexentaStor in significant commercial deployments, that it's risible to suggest I have zero knowledge of the platform and even more risible to imply that the role of ECC in ZFS architecture is here somehow fundamentally a matter of platform variation. I've pointed to a Solaris engineer showing core dumps from non-ECC RAM and reporting data corruption as a substantiated instance of ECC problems, and I've pointed to references to how ECC serves as a point of reference from one of its co-creators. I've explained that ECC in ZFS should be understood in terms of the scale it allow and the challenges that creates for data integrity protection, and I've tried to contrast the economics of ECC to what I take to be a less compelling alternative sketched out by the Mdison paper. At the same time I've said that ECC use is genereally assumed in ZFS, I've allowed that doing so is a question of an incremental cost against the value of your data and costs to replace it. I don't understand why you've decided to invest so much in arguing that ECC is so completely marginal a data integrity measure that you can't have a reasonable discussion about what gets people to different conclusions and feel the need to be overtly dismissive of the professionalism and expertise of those who come to fundamentally different conclusions, but clearly there's not going to be a dialogue on this. My only interest in posting at this point is so that people on this list at least have a clear statement of both ends of the argument and can judge for themselves. Regards, Bayard On 12 April 2014 11:44, Jason Belec jasonbe...@belecmartin.com wrote: Hhhhm, oh I get it, you have zero knowledge of the platform this list represents. No worries, appreciate your time clearing that up. -- Jason Belec Sent from my iPad On Apr 12, 2014, at 6:26 AM, Bayard Bell buffer.g.overf...@gmail.com wrote: Jason, If you think I've said anything about the sky falling or referenced a wiki, you're responding to something other than what I wrote. I see no need for further reply. Cheers, Bayard On 11 April 2014 22:36, Jason Belec jasonbe...@belecmartin.com wrote: Excellent. If you feel this is necessary go for it. Those that have systems that don't have ECC should just run like the sky is falling by your point view. That said, I can guarantee non of the systems I have under my care have issues. How do I know? Well the data is tested/compared at regular intervals. Maybe I'm the luckiest guy ever, where is that lottery ticket. Is ECC better, possibly, probably in heavy load environments, no data has been provided to back this up. Especially nothing in the context of what most users needs are at least here in the Mac space. Which ECC? Be specific. They are not all the same. Just like regular RAM are not all the same. Just like HDDs are not all the same. Fear mongering is wonderful and easy. Putting forth a solution guaranteed to be better is what's needed now. Did you actually reference a wiki? Seriously? A document anyone can edit to suit there view? I guess I come from a different era. Jason Sent from my iPhone 5S On Apr 11, 2014, at 5:09 PM, Bayard Bell buffer.g.overf...@gmail.com wrote: If you want more of a smoking gun report on data corruption without ECC, try: https://blogs.oracle.com/vlad/entry/zfs_likes_to_have_ecc This view isn't isolated in terms of what people at Sun thought or what people at Oracle now think. Trying googling for zfs ecc site:blogs.oracle.com, and you'll find a recurring statement that ECC should be used even in home deployment, with maybe one odd exception. The Wikipedia article, correctly
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
Interesting point about different kinds of ECC memory. I wonder if the difference is important enough to consider for a 20x3TB ZFS pool. For the sake of sakes, I will likely look into getting ECC memory. On Fri, Apr 11, 2014 at 5:36 PM, Jason Belec jasonbe...@belecmartin.comwrote: Excellent. If you feel this is necessary go for it. Those that have systems that don't have ECC should just run like the sky is falling by your point view. That said, I can guarantee non of the systems I have under my care have issues. How do I know? Well the data is tested/compared at regular intervals. Maybe I'm the luckiest guy ever, where is that lottery ticket. Is ECC better, possibly, probably in heavy load environments, no data has been provided to back this up. Especially nothing in the context of what most users needs are at least here in the Mac space. Which ECC? Be specific. They are not all the same. Just like regular RAM are not all the same. Just like HDDs are not all the same. Fear mongering is wonderful and easy. Putting forth a solution guaranteed to be better is what's needed now. Did you actually reference a wiki? Seriously? A document anyone can edit to suit there view? I guess I come from a different era. Jason Sent from my iPhone 5S On Apr 11, 2014, at 5:09 PM, Bayard Bell buffer.g.overf...@gmail.com wrote: If you want more of a smoking gun report on data corruption without ECC, try: https://blogs.oracle.com/vlad/entry/zfs_likes_to_have_ecc This view isn't isolated in terms of what people at Sun thought or what people at Oracle now think. Trying googling for zfs ecc site: blogs.oracle.com, and you'll find a recurring statement that ECC should be used even in home deployment, with maybe one odd exception. The Wikipedia article, correctly summarising the Google study, is plain in saying not that extremely high error rates are common but that error rates are highly variable in large-sample studies, with some systems seeing extremely high error rates. ECC gives a significant assurance based on an incremental cost, so what's your data worth? You're not guaranteed to be screwed by not using ECC (and the Google paper doesn't say this either), but you are assuming risks that ECC mitigates. Look at the above blog, however: even DIMMs that are high-quality but non-ECC can go wrong and result in nasty system corruption. What generally protects you in terms of pool integrity is metadata redundancy on top of integrity checks, but if you flip bits on metadata in-core before writing redundant copies, well, that's a risk to pool integrity. I also think it's mistaken to say this is distinctly a problem with ZFS. Any next-generation filesystem that provides protections against on-disk corruption via checksums ends up with a residual risk focus on making sure that in-core data integrity is robust. You could well have those problems on the pools you've deployed, and there are a lot of situations in you'd never know and quite a lot (such as most of the bits in a photo or MP3) where you'd never notice low rates of bit-flipping. The fact that you haven't noticed doesn't equate to there being no problems in a strict sense, it's far more likely that you've been able to tolerate the flipping that's happened. The guy at Sun with the blog above got lucky: he was running high-quality non-ECC RAM, and it went pear-shaped, at least for metadata cancer, quite quickly, allowing him to recover by rolling back snapshots. Take a look out there, and you'll find people who are very confused about the risks and available mitigations. I found someone saying that there's no problem with more traditional RAID technologies because disks have CRCs. By comparison, you can find Bonwick, educated as a statistician, talking about SHA256 collisions by comparison to undetected ECC error rates and introducing ZFS data integrity safeguards by way of analogy to ECC. That's why the large-sample studies are interesting and useful: none of this technology makes data corruption impossible, it just goes to extreme length to marginalise the chances of those events by addressing known sources of errors and fundamental error scenarios--in-core is so core that if you tolerate error there, those errors will characterize systematic behaviour where you have better outcomes reasonably available (and that's **reasonably** available, I would suggest, in a way that the Madison paper's recommendation to make ZFS buffers magical isn't). CRC-32 does a great job detecting bad sectors and preventing them from being read back, but SHA256 in the right place in a system detects errors that a well-conceived vdev topology will generally make recoverable. That includes catching cases where an error isn't caught by CRC-32, which may be a rare result, but when you've got the kind of data densities that ZFS can allow, you're rolling the dice often enough that those results become interesting. ECC is one
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
All this talk about controller, sync, buffer, storage, cache got me thinking. I looked up out ZFS handles cache flushing, and how VirtualBox handles cache flushing. *According to http://docs.oracle.com/cd/E26505_01/html/E37386/chapterzfs-6.html http://docs.oracle.com/cd/E26505_01/html/E37386/chapterzfs-6.html* ZFS issues *infrequent flushes *(every 5 second or so) after the uberblock updates. The flushing infrequency is fairly inconsequential so no tuning is warranted here. ZFS also issues a flush every time an application requests a synchronous write (O_DSYNC, fsync, NFS commit, and so on). *According to http://www.virtualbox.org/manual/ch12.html http://www.virtualbox.org/manual/ch12.html* 12.2.2. Responding to guest IDE/SATA flush requests If desired, the virtual disk images can be flushed when the guest issues the IDE FLUSH CACHE command. Normally *these requests are ignored *for improved performance. The parameters below are only accepted for disk drives. They must not be set for DVD drives. I'm going to enable cache flushing and see how that affects results On Tue, Apr 1, 2014 at 7:14 PM, Bayard Bell buffer.g.overf...@gmail.comwrote: Could you explain how you're using VirtualBox and why you'd use a type 2 hypervisor in this context? Here's a scenario where you really have to mind with hypervisors: ZFS tells a virtualised controller that it needs to sync a buffer, and the controller tells ZFS that all's well while perhaps requesting an async flush. ZFS thinks it's done all the I/Os to roll a TXG to stable storage, but in the mean time something else crashes and whoosh go your buffers. I'm not sure it's come across particularly well in this thread, but ZFS doesn't and can't cope with hardware that's so unreliable that it tells lies about basic things, like whether your writes have made it to stable storage, or doesn't mind the shop, as is the case with non-ECC memory. It's one thing when you have a device reading back something that doesn't match the checksum, but it gets uglier when you've got a single I/O path and a controller that seems to write the wrong bits in stride (I've seen this) or when the problems are even closer to home (and again I emphasise RAM). You may not have problems right away. You may have problems where you can't tell the difference, like flipping bits in data buffers that have no other integrity checks. But you can run into complex failure scenarios where ZFS has to cash in on guarantees that were rather more approximate than what it was told, and then it may not be a case of having some bits flipped in photos or MP3s but no longer being able to import your pool or having someone who knows how to operate zdb do some additional TXG rollback to get your data back after losing some updates. I don't know if you're running ZFS in a VM or running VMs on top of ZFS, but either way, you probably want to Google for data loss VirtualBox and whatever device you're emulating and see whether there are known issues. You can find issue reports out there on VirtualBox data loss, but working through bug reports can be challenging. Cheers, Bayard On 1 April 2014 16:34, Eric Jaw naisa...@gmail.com wrote: On Tuesday, April 1, 2014 7:04:39 AM UTC-4, jasonbelec wrote: ZFS is lots of parts, in most cases lots of cheap unreliable parts, refurbished parts, yadda yadda, as posted on this thread and many, many others, any issues are probably not ZFS but the parts of the whole. Yes, it could be ZFS, after you confirm that all the parts ate pristine, maybe. I don't think it's ZFS. ZFS is pretty solid. In my specific case, I'm trying to figure out why VirtualBox is creating these issues. I'm pretty sure that's the root cause, but I don't know why yet. So I'm just speculating at this point. Of course, I want to get my ZFS up and running so I can move on to what I really need to do, so it's easy to jump on a conclusion about something that I haven't thought of in my position. Hope you can understand My oldest system running ZFS is an Mac Mini Intel Core Duo with 3GB RAM (not ECC) it is the home server for music, tv shows, movies, and some interim backups. The mini has been modded for ESATA and has 6 drives connected. The pool is 2 RaidZ of 3 mirrored with copies set at 2. Been running since ZFS was released from Apple builds. Lost 3 drives, eventually traced to a new cable that cracked at the connector which when hot enough expanded lifting 2 pins free of their connector counter parts resulting in errors. Visually almost impossible to see. I replaced port multipliers, Esata cards, RAM, mini's, power supply, reinstalled OS, reinstalled ZFS, restored ZFS data from backup, finally to find the bad connector end one because it was hot and felt 'funny'. Frustrating, yes, educational also. The happy news is, all the data was fine, wife would have torn me to shreds if photos were missing, music was corrupt, etc., etc.. And
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
The only time this should make a difference is when your host experiences an unclean shutdown / reset / crash. On Apr 2, 2014, at 8:49 AM, Eric naisa...@gmail.com wrote: I believe we are referring to the same things. I JUST read about cache flushing. ZFS does cache flushing and VirtualBox ignores cache flushes by default. Please, if you can, let me know the key settings you have used. From the documentation that I read, the command it said to issue is: VBoxManage setextradata VM name VBoxInternal/Devices/ahci/0/LUN#[x]/Config/IgnoreFlush 0 Where [x] is the disk value On Wed, Apr 2, 2014 at 2:37 AM, Boyd Waters waters.b...@gmail.com wrote: I was able to destroy ZFS pools by trying to access them from inside VirtualBox. Until I read the detailed documentation, and set the disk buffer options correctly. I will dig into my notes and post the key setting to this thread when I find it. But I've used ZFS for many years without ECC RAM with no trouble. It isn't the best way to,go, but it isn't the lack of ECC that's killing a ZFS pool. It's the hypervisor hardware emulation and buffering. Sent from my iPad On Apr 1, 2014, at 5:24 PM, Jason Belec jasonbe...@belecmartin.com wrote: I think Bayard has hit on some very interesting points, part of what I was alluding to, but very well presented here. Jason Sent from my iPhone 5S On Apr 1, 2014, at 7:14 PM, Bayard Bell buffer.g.overf...@gmail.com wrote: Could you explain how you're using VirtualBox and why you'd use a type 2 hypervisor in this context? Here's a scenario where you really have to mind with hypervisors: ZFS tells a virtualised controller that it needs to sync a buffer, and the controller tells ZFS that all's well while perhaps requesting an async flush. ZFS thinks it's done all the I/Os to roll a TXG to stable storage, but in the mean time something else crashes and whoosh go your buffers. I'm not sure it's come across particularly well in this thread, but ZFS doesn't and can't cope with hardware that's so unreliable that it tells lies about basic things, like whether your writes have made it to stable storage, or doesn't mind the shop, as is the case with non-ECC memory. It's one thing when you have a device reading back something that doesn't match the checksum, but it gets uglier when you've got a single I/O path and a controller that seems to write the wrong bits in stride (I've seen this) or when the problems are even closer to home (and again I emphasise RAM). You may not have problems right away. You may have problems where you can't tell the difference, like flipping bits in data buffers that have no other integrity checks. But you can run into complex failure scenarios where ZFS has to cash in on guarantees that were rather more approximate than what it was told, and then it may not be a case of having some bits flipped in photos or MP3s but no longer being able to import your pool or having someone who knows how to operate zdb do some additional TXG rollback to get your data back after losing some updates. I don't know if you're running ZFS in a VM or running VMs on top of ZFS, but either way, you probably want to Google for data loss VirtualBox and whatever device you're emulating and see whether there are known issues. You can find issue reports out there on VirtualBox data loss, but working through bug reports can be challenging. Cheers, Bayard On 1 April 2014 16:34, Eric Jaw naisa...@gmail.com wrote: On Tuesday, April 1, 2014 7:04:39 AM UTC-4, jasonbelec wrote: ZFS is lots of parts, in most cases lots of cheap unreliable parts, refurbished parts, yadda yadda, as posted on this thread and many, many others, any issues are probably not ZFS but the parts of the whole. Yes, it could be ZFS, after you confirm that all the parts ate pristine, maybe. I don't think it's ZFS. ZFS is pretty solid. In my specific case, I'm trying to figure out why VirtualBox is creating these issues. I'm pretty sure that's the root cause, but I don't know why yet. So I'm just speculating at this point. Of course, I want to get my ZFS up and running so I can move on to what I really need to do, so it's easy to jump on a conclusion about something that I haven't thought of in my position. Hope you can understand My oldest system running ZFS is an Mac Mini Intel Core Duo with 3GB RAM (not ECC) it is the home server for music, tv shows, movies, and some interim backups. The mini has been modded for ESATA and has 6 drives connected. The pool is 2 RaidZ of 3 mirrored with copies set at 2. Been running since ZFS was released from Apple builds. Lost 3 drives, eventually traced to a new cable that cracked at the connector which when hot enough expanded lifting 2 pins free of their connector counter parts resulting in errors. Visually almost impossible to see. I replaced port multipliers, Esata
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
eh, I suspected that On Wed, Apr 2, 2014 at 2:38 PM, Daniel Becker razzf...@gmail.com wrote: The only time this should make a difference is when your host experiences an unclean shutdown / reset / crash. On Apr 2, 2014, at 8:49 AM, Eric naisa...@gmail.com wrote: I believe we are referring to the same things. I JUST read about cache flushing. ZFS does cache flushing and VirtualBox ignores cache flushes by default. Please, if you can, let me know the key settings you have used. From the documentation that I read, the command it said to issue is: VBoxManage setextradata VM name VBoxInternal/Devices/ahci/0/LUN#[x]/Config/IgnoreFlush 0 Where [x] is the disk value On Wed, Apr 2, 2014 at 2:37 AM, Boyd Waters waters.b...@gmail.com wrote: I was able to destroy ZFS pools by trying to access them from inside VirtualBox. Until I read the detailed documentation, and set the disk buffer options correctly. I will dig into my notes and post the key setting to this thread when I find it. But I've used ZFS for many years without ECC RAM with no trouble. It isn't the best way to,go, but it isn't the lack of ECC that's killing a ZFS pool. It's the hypervisor hardware emulation and buffering. Sent from my iPad On Apr 1, 2014, at 5:24 PM, Jason Belec jasonbe...@belecmartin.com wrote: I think Bayard has hit on some very interesting points, part of what I was alluding to, but very well presented here. Jason Sent from my iPhone 5S On Apr 1, 2014, at 7:14 PM, Bayard Bell buffer.g.overf...@gmail.com wrote: Could you explain how you're using VirtualBox and why you'd use a type 2 hypervisor in this context? Here's a scenario where you really have to mind with hypervisors: ZFS tells a virtualised controller that it needs to sync a buffer, and the controller tells ZFS that all's well while perhaps requesting an async flush. ZFS thinks it's done all the I/Os to roll a TXG to stable storage, but in the mean time something else crashes and whoosh go your buffers. I'm not sure it's come across particularly well in this thread, but ZFS doesn't and can't cope with hardware that's so unreliable that it tells lies about basic things, like whether your writes have made it to stable storage, or doesn't mind the shop, as is the case with non-ECC memory. It's one thing when you have a device reading back something that doesn't match the checksum, but it gets uglier when you've got a single I/O path and a controller that seems to write the wrong bits in stride (I've seen this) or when the problems are even closer to home (and again I emphasise RAM). You may not have problems right away. You may have problems where you can't tell the difference, like flipping bits in data buffers that have no other integrity checks. But you can run into complex failure scenarios where ZFS has to cash in on guarantees that were rather more approximate than what it was told, and then it may not be a case of having some bits flipped in photos or MP3s but no longer being able to import your pool or having someone who knows how to operate zdb do some additional TXG rollback to get your data back after losing some updates. I don't know if you're running ZFS in a VM or running VMs on top of ZFS, but either way, you probably want to Google for data loss VirtualBox and whatever device you're emulating and see whether there are known issues. You can find issue reports out there on VirtualBox data loss, but working through bug reports can be challenging. Cheers, Bayard On 1 April 2014 16:34, Eric Jaw naisa...@gmail.com wrote: On Tuesday, April 1, 2014 7:04:39 AM UTC-4, jasonbelec wrote: ZFS is lots of parts, in most cases lots of cheap unreliable parts, refurbished parts, yadda yadda, as posted on this thread and many, many others, any issues are probably not ZFS but the parts of the whole. Yes, it could be ZFS, after you confirm that all the parts ate pristine, maybe. I don't think it's ZFS. ZFS is pretty solid. In my specific case, I'm trying to figure out why VirtualBox is creating these issues. I'm pretty sure that's the root cause, but I don't know why yet. So I'm just speculating at this point. Of course, I want to get my ZFS up and running so I can move on to what I really need to do, so it's easy to jump on a conclusion about something that I haven't thought of in my position. Hope you can understand My oldest system running ZFS is an Mac Mini Intel Core Duo with 3GB RAM (not ECC) it is the home server for music, tv shows, movies, and some interim backups. The mini has been modded for ESATA and has 6 drives connected. The pool is 2 RaidZ of 3 mirrored with copies set at 2. Been running since ZFS was released from Apple builds. Lost 3 drives, eventually traced to a new cable that cracked at the connector which when hot enough expanded lifting 2 pins free of their connector counter parts resulting in errors. Visually almost impossible to see. I replaced port
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
On Sun, Mar 2, 2014 at 12:46 AM, Jean-Yves Avenard jyaven...@gmail.comwrote: On 28 February 2014 20:32, Philip Robar philip.ro...@gmail.com wrote: cyberjock is the biggest troll ever, not even the people actually involved with FreeNAS (iX system) knows what to do with him. He does spend an awful amount of time on the freenas forums helping others and as such tolerate him on that basis.. Otherwise, he just someone doing nothing, with a lot of time on his hand and spewing the same stuff over and over simply because he has heard about it. Well, that's at odds with his claims of how much time and effort he has put into learning about ZFS and is basically an ad hominem attack, but since Daniel Becker has already cast a far amount of doubt on both the scenario and logic behind cyberdog's EEC vs non-ECC posts and his understanding of architecture of ZFS I'll move on. Back to the ECC topic; one core issue to ZFS is that it will specifically write to the pool even when all you are doing is read, in an attempt to correct any data found to have incorrect checksum. So say you have corrupted memory, you read from the disk, zfs believes the data is faulty (after all, the checksum will be incorrect due to faulty RAM) and start to rewrite the data. That is one scenario where ZFS will corrupt an otherwise healthy pool until its too late and all your data is gone. As such, ZFS is indeed more sensitive to bad RAM than other filesystem. So, you're agreeing with cyberdog's conclusion, just not the path he took to get there. Having said that; find me *ONE* official source other than the FreeNAS forum stating that ECC is a minimal requirements (and no a wiki written by cyberjock doesn't count). Solaris never said so, FreeBSD didn't either, nor Sun. So if a problem isn't documented, it's not a problem? Most Sun/Solaris documentation isn't going to mention the need for ECC memory because all Sun systems shipped with ECC memory. FreeBSD/PC-BSD/FreeNAS/NAS4Free/Linux in turn derive from worlds where ECC memory is effectively nonexistent so their lack of documentation may stem from a combination of the ZFS folks just assuming that you have it and the distro people not realizing that you need it. FreeNAS's guide does state pretty strongly that you should use ECC memory. But if you insist: from Oracle Solaris 11.1 Administration: ZFS File Systems, Consider using ECC memory to protect against memory corruption. Silent memory corruption can potentially damage your data. [1] It seems to me that if using ZFS without ECC memory puts someone's data at an increased risk over other file system then they ought to be told that so that they can make an informed decision. Am I really being unreasonable about this? Bad RAM however has nothing to do with the occasional bit flip that would be prevented using ECC RAM. The probability of a bit flip is low, very low. You and Jason have both claimed this. This is at odds with papers and studies I've seen mentioned elsewhere. Here's what a little searching found: Soft Error: https://en.wikipedia.org/wiki/Soft_error Which says that there are numerous sources of soft errors in memory and other circuits other than cosmic rays. ECC Memory: https://en.wikipedia.org/wiki/ECC_memory States that design has dealt with the problem of increased circuit density. It then mentions the research IBM did years ago and Google's 2009 report which says: The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per mega*bit* (about 2.5-7 × 10-11 error/bit·h)(i.e. about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate), and more than 8% of DIMM memory modules affected by errors per year. So, since you've agreed that ZFS is more vulnerable than other file systems to memory errors, and Google says that these errors are a lot more frequent than most people think that they are then the question becomes just how much more vulnerable is ZFS and is the extent of the corruption likely to be wider or more catastrophic than on other file systems? Phil [1] Oracle Solaris 11.1 Administration: ZFS File Systems: http://docs.oracle.com/cd/E26502_01/html/E29007/zfspools-4.html -- --- You received this message because you are subscribed to the Google Groups zfs-macos group. To unsubscribe from this group and stop receiving emails from it, send an email to zfs-macos+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
On Sat, Mar 1, 2014 at 5:07 PM, Jason Belec jasonbe...@belecmartin.comwrote: Technically, what you qualify below is a truism under any hardware. ZFS is neither more or less susceptible to RAM failure as it has nothing to do with ZFS. Anything that gets written to the pool technically is sound. You have chosen a single possible point of failure, what of firmware, drive cache, motherboard, power surges, motion, etc.? I'm sorry, but I'm not following your logic here. Are you saying that ZFS doesn't use RAM so it can't be affected by it? ZFS likes lots of memory and uses it aggressively. So my understanding is that large amounts of data are more likely to be in memory with ZFS than with other file systems. If Google's research is to believed then random memory errors are a lot more frequent than you think that they are. As I understand it, ZFS does not checksum data while it's in memory. (While there a debug flag to turn this on, I'm betting that the performance hit is pretty big.) So how does RAM failure or random bit flips have nothing to do with ZFS? RAM/ECC RAM is like consumer drives vs pro drives in your system, recent long term studies have shown you don't get much more for the extra money. Do you have references to these studies? This directly conflicts with what I've seen posted, with references, in other forums on the frequency of soft memory errors, particularly on systems that run 24x7, and how ECC memory is able to correct these random errors. I have been running ZFS in production using the past and current versions for OS X on over 60 systems (12 are servers) since Apple kicked ZFS loose. No systems (3 run ECC) have had data corruption or data loss. That you know of. Some pools have disappeared on the older ZFS but were easily recovered on modern (current development) and past OpenSolaris, FreeBSD, etc., as I keep clones of 'corrupted' pools for such tests. Almost always, these were the result of connector/cable failure. In that span of time no RAM has failed 'utterly' and all data and tests have shown quality storage. In that time 11 drives have failed and easily been replaced, 4 of those were OS drives, data stored under ZFS and a regular clone of the OS also stored under ZFS just in case. All pools are backed-up/replicated off site. Probably a lot more than most are doing for data integrity. No this data I'm providing is not a guarantee. It's just data from someone who has grown to trust ZFS in the real world for clients that cannot lose data for the most part due to legal regulations. I trust RAM manufacturers and drive manufacturers equally, I just verify for peace of mind with ZFS. I have an opinion of people who run servers with legal or critical business data on it that do not use ECC memory but I'll keep it to myself. Phil -- --- You received this message because you are subscribed to the Google Groups zfs-macos group. To unsubscribe from this group and stop receiving emails from it, send an email to zfs-macos+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
Technically, what you qualify below is a truism under any hardware. ZFS is neither more or less susceptible to RAM failure as it has nothing to do with ZFS. Anything that gets written to the pool technically is sound. You have chosen a single possible point of failure, what of firmware, drive cache, motherboard, power surges, motion, etc.? RAM/ECC RAM is like consumer drives vs pro drives in your system, recent long term studies have shown you don't get much more for the extra money. I have been running ZFS in production using the past and current versions for OS X on over 60 systems (12 are servers) since Apple kicked ZFS loose. No systems (3 run ECC) have had data corruption or data loss. Some pools have disappeared on the older ZFS but were easily recovered on modern (current development) and past OpenSolaris, FreeBSD, etc., as I keep clones of 'corrupted' pools for such tests. Almost always, these were the result of connector/cable failure. In that span of time no RAM has failed 'utterly' and all data and tests have shown quality storage. In that time 11 drives have failed and easily been replaced, 4 of those were OS drives, data stored under ZFS and a regular clone of the OS also stored under ZFS just in case. All pools are backed-up/replicated off site. Probably a lot more than most are doing for data integrity. No this data I'm providing is not a guarantee. It's just data from someone who has grown to trust ZFS in the real world for clients that cannot lose data for the most part due to legal regulations. I trust RAM manufacturers and drive manufacturers equally, I just verify for peace of mind with ZFS. -- Jason Belec Sent from my iPad On Mar 1, 2014, at 5:39 PM, Philip Robar philip.ro...@gmail.com wrote: On Fri, Feb 28, 2014 at 2:36 PM, Richard Elling richard.ell...@gmail.com wrote: We might buy this argument if, in fact, no other program had the same vulnerabilities. But *all* of them do -- including OSX. So it is disingenuous to claim this as a ZFS deficiency. No it's disingenuous of you to ignore the fact that I carefully qualified what I said. To repeat, it's claimed with a detailed example and reasoned argument that ZFS is MORE vulnerable to corruption due to memory errors when using non-ECC memory and that that corruption is MORE likely to be extensive or catastrophic than with other file systems. As I said, Jason's and Daniel Becker's responses are reassuring, but I'd really like a definitive answer to this so I've reached out to one of the lead Open ZFS developers. Hopefully, I'll hear back from him. Phil -- --- You received this message because you are subscribed to the Google Groups zfs-macos group. To unsubscribe from this group and stop receiving emails from it, send an email to zfs-macos+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- --- You received this message because you are subscribed to the Google Groups zfs-macos group. To unsubscribe from this group and stop receiving emails from it, send an email to zfs-macos+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
On Sun, Mar 2, 2014 at 12:46 AM, Jean-Yves Avenard jyaven...@gmail.comwrote: Back to the OP, I'm not sure why he felt he had to mentioned being part of SunOS. ZFS was never part of sunos. I didn't say I was part of SunOS (later renamed to Solaris 1). SunOS was dead and buried years before I joined the network side of OS/Net. OS in this case just means operating system, it's not a reference to the OS in SunOS. By mentioning that I worked in the part of Sun that invented ZFS and saying that I am a fan of it I was just trying to be clear that I was not attacking ZFS by questioning some aspect of it. Clearly, at least in some minds I failed at that. Phil -- --- You received this message because you are subscribed to the Google Groups zfs-macos group. To unsubscribe from this group and stop receiving emails from it, send an email to zfs-macos+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data
On Feb 26, 2014, at 10:51 PM, Daniel Becker razzf...@gmail.com wrote: Incidentally, that paper came up in a ZFS-related thread on Ars Technica just the other day (as did the link to the FreeNAS forum post). Let me just quote what I said there: The conclusion of the paper is that ZFS does not protect against in-memory corruption, and thus can't provide end-to-end integrity in the presence of memory errors. I am not arguing against that at all; obviously you'll want ECC on your ZFS-based server if you value data integrity -- just as you would if you were using any other file system. That doesn't really have anything to do with the claim that ZFS specifically makes lack of ECC more likely to cause total data loss, though. The sections you quote below basically say that while ZFS offers good protection against on-disk corruption, it does *not* effectively protect you against memory errors. Or, put another way, the authors are basically finding that despite all the FS-level checksumming, ZFS does not render ECC memory unnecessary (as one might perhaps naively expect). No claim is being made that memory errors affect ZFS more than other filesystems. Yes. Just like anything else, end-to-end data integrity is needed. So until people write apps that self-check everything, there is a possibility that something you trust [1] can fail. As it happens, only the PC market demands no ECC. TANSTAAFL. [1] http://en.wikipedia.org/wiki/Pentium_FDIV_bug -- richard -- --- You received this message because you are subscribed to the Google Groups zfs-macos group. To unsubscribe from this group and stop receiving emails from it, send an email to zfs-macos+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.