Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-04-12 Thread Bayard Bell
Jason,

If you think I've said anything about the sky falling or referenced a wiki,
you're responding to something other than what I wrote. I see no need for
further reply.

Cheers,
Bayard


On 11 April 2014 22:36, Jason Belec jasonbe...@belecmartin.com wrote:

 Excellent. If you feel this is necessary go for it. Those that have
 systems that don't have ECC should just run like the sky is falling by your
 point view. That said, I can guarantee non of the systems I have under my
 care have issues. How do I know? Well the data is tested/compared at
 regular intervals. Maybe I'm the luckiest guy ever, where is that lottery
 ticket. Is ECC better, possibly, probably in heavy load environments, no
 data has been provided to back this up. Especially nothing in the context
 of what most users needs are at least here in the Mac space. Which ECC? Be
 specific. They are not all the same. Just like regular RAM are not all the
 same. Just like HDDs are not all the same. Fear mongering is wonderful and
 easy. Putting forth a solution guaranteed to be better is what's needed
 now. Did you actually reference a wiki? Seriously? A document anyone can
 edit to suit there view? I guess I come from a different era.


 Jason
 Sent from my iPhone 5S

 On Apr 11, 2014, at 5:09 PM, Bayard Bell buffer.g.overf...@gmail.com
 wrote:

 If you want more of a smoking gun report on data corruption without ECC,
 try:

 https://blogs.oracle.com/vlad/entry/zfs_likes_to_have_ecc

 This view isn't isolated in terms of what people at Sun thought or what
 people at Oracle now think. Trying googling for zfs ecc site:
 blogs.oracle.com, and you'll find a recurring statement that ECC should
 be used even in home deployment, with maybe one odd exception.

 The Wikipedia article, correctly summarising the Google study, is plain in
 saying not that extremely high error rates are common but that error rates
 are highly variable in large-sample studies, with some systems seeing
 extremely high error rates. ECC gives a significant assurance based on an
 incremental cost, so what's your data worth? You're not guaranteed to be
 screwed by not using ECC (and the Google paper doesn't say this either),
 but you are assuming risks that ECC mitigates. Look at the above blog,
 however: even DIMMs that are high-quality but non-ECC can go wrong and
 result in nasty system corruption.

 What generally protects you in terms of pool integrity is metadata
 redundancy on top of integrity checks, but if you flip bits on metadata
 in-core before writing redundant copies, well, that's a risk to pool
 integrity.

 I also think it's mistaken to say this is distinctly a problem with ZFS.
 Any next-generation filesystem that provides protections against on-disk
 corruption via checksums ends up with a residual risk focus on making sure
 that in-core data integrity is robust. You could well have those problems
 on the pools you've deployed, and there are a lot of situations in you'd
 never know and quite a lot (such as most of the bits in a photo or MP3)
 where you'd never notice low rates of bit-flipping. The fact that you
 haven't noticed doesn't equate to there being no problems in a strict
 sense, it's far more likely that you've been able to tolerate the flipping
 that's happened. The guy at Sun with the blog above got lucky: he was
 running high-quality non-ECC RAM, and it went pear-shaped, at least for
 metadata cancer, quite quickly, allowing him to recover by rolling back
 snapshots.

 Take a look out there, and you'll find people who are very confused about
 the risks and available mitigations. I found someone saying that there's no
 problem with more traditional RAID technologies because disks have CRCs. By
 comparison, you can find Bonwick, educated as a statistician, talking about
 SHA256 collisions by comparison to undetected ECC error rates and
 introducing ZFS data integrity safeguards by way of analogy to ECC. That's
 why the large-sample studies are interesting and useful: none of this
 technology makes data corruption impossible, it just goes to extreme length
 to marginalise the chances of those events by addressing known sources of
 errors and fundamental error scenarios--in-core is so core that if you
 tolerate error there, those errors will characterize systematic behaviour
 where you have better outcomes reasonably available (and that's
 **reasonably** available, I would suggest, in a way that the Madison
 paper's recommendation to make ZFS buffers magical isn't). CRC-32 does a
 great job detecting bad sectors and preventing them from being read back,
 but SHA256 in the right place in a system detects errors that a
 well-conceived vdev topology will generally make recoverable. That includes
 catching cases where an error isn't caught by CRC-32, which may be a rare
 result, but when you've got the kind of data densities that ZFS can allow,
 you're rolling the dice often enough that those results become interesting.

 ECC is one of the most basic steps 

Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-04-12 Thread Peter Lai
It sounds like people are missing the forest for the trees. Some of us
have been successfully RAIDing/deploying storage for years on
everything from IDE vinum to SCSI XFS and beyond without ECC. We use
ZFS today because of its featureset. Data integrity checking through
checksumming is just one of those features which would have mitigated
some issues that other file systems have historically failed to do.
(Otherwise we should all be happy with existing journaling filesystems
on a soft or hard RAID). ECC just adds another layer of mitigation
(and even in a less-implementation-specific way like how ZFS may
'prefer' raw device access instead of whatever storage abstraction the
controller is presenting). Asserting that ECC is required has about
the same logic to it (and I would say less logic to it) than asserting
a 3ware controller with raw jbod passthrough is required.

On Sat, Apr 12, 2014 at 7:46 AM, Bayard Bell
buffer.g.overf...@gmail.com wrote:
 Jason,

 Although I moved from OS X to illumos as a primary platform precisely
 because of ZFS (I ended up posting to the original list about the demise of
 the project because I happened to be doing an install the week Apple people
 the plug), I've spent enough time with OS X, including debugging storage
 interop issues with NexentaStor in significant commercial deployments, that
 it's risible to suggest I have zero knowledge of the platform and even more
 risible to imply that the role of ECC in ZFS architecture is here somehow
 fundamentally a matter of platform variation. I've pointed to a Solaris
 engineer showing core dumps from non-ECC RAM and reporting data corruption
 as a substantiated instance of ECC problems, and I've pointed to references
 to how ECC serves as a point of reference from one of its co-creators. I've
 explained that ECC in ZFS should be understood in terms of the scale it
 allow and the challenges that creates for data integrity protection, and
 I've tried to contrast the economics of ECC to what I take to be a less
 compelling alternative sketched out by the Mdison paper. At the same time
 I've said that ECC use is genereally assumed in ZFS, I've allowed that doing
 so is a question of an incremental cost against the value of your data and
 costs to replace it.

 I don't understand why you've decided to invest so much in arguing that ECC
 is so completely marginal a data integrity measure that you can't have a
 reasonable discussion about what gets people to different conclusions and
 feel the need to be overtly dismissive of the professionalism and expertise
 of those who come to fundamentally different conclusions, but clearly
 there's not going to be a dialogue on this. My only interest in posting at
 this point is so that people on this list at least have a clear statement of
 both ends of the argument and can judge for themselves.

 Regards,
 Bayard


 On 12 April 2014 11:44, Jason Belec jasonbe...@belecmartin.com wrote:

 Hhhhm, oh I get it, you have zero knowledge of the platform this list
 represents. No worries, appreciate your time clearing that up.



 --
 Jason Belec
 Sent from my iPad

 On Apr 12, 2014, at 6:26 AM, Bayard Bell buffer.g.overf...@gmail.com
 wrote:

 Jason,

 If you think I've said anything about the sky falling or referenced a
 wiki, you're responding to something other than what I wrote. I see no need
 for further reply.

 Cheers,
 Bayard


 On 11 April 2014 22:36, Jason Belec jasonbe...@belecmartin.com wrote:

 Excellent. If you feel this is necessary go for it. Those that have
 systems that don't have ECC should just run like the sky is falling by your
 point view. That said, I can guarantee non of the systems I have under my
 care have issues. How do I know? Well the data is tested/compared at regular
 intervals. Maybe I'm the luckiest guy ever, where is that lottery ticket. Is
 ECC better, possibly, probably in heavy load environments, no data has been
 provided to back this up. Especially nothing in the context of what most
 users needs are at least here in the Mac space. Which ECC? Be specific. They
 are not all the same. Just like regular RAM are not all the same. Just like
 HDDs are not all the same. Fear mongering is wonderful and easy. Putting
 forth a solution guaranteed to be better is what's needed now. Did you
 actually reference a wiki? Seriously? A document anyone can edit to suit
 there view? I guess I come from a different era.


 Jason
 Sent from my iPhone 5S

 On Apr 11, 2014, at 5:09 PM, Bayard Bell buffer.g.overf...@gmail.com
 wrote:

 If you want more of a smoking gun report on data corruption without ECC,
 try:

 https://blogs.oracle.com/vlad/entry/zfs_likes_to_have_ecc

 This view isn't isolated in terms of what people at Sun thought or what
 people at Oracle now think. Trying googling for zfs ecc
 site:blogs.oracle.com, and you'll find a recurring statement that ECC
 should be used even in home deployment, with maybe one odd exception.

 The Wikipedia article, correctly 

Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-04-11 Thread Eric
Interesting point about different kinds of ECC memory. I wonder if the
difference is important enough to consider for a 20x3TB ZFS pool. For the
sake of sakes, I will likely look into getting ECC memory.


On Fri, Apr 11, 2014 at 5:36 PM, Jason Belec jasonbe...@belecmartin.comwrote:

 Excellent. If you feel this is necessary go for it. Those that have
 systems that don't have ECC should just run like the sky is falling by your
 point view. That said, I can guarantee non of the systems I have under my
 care have issues. How do I know? Well the data is tested/compared at
 regular intervals. Maybe I'm the luckiest guy ever, where is that lottery
 ticket. Is ECC better, possibly, probably in heavy load environments, no
 data has been provided to back this up. Especially nothing in the context
 of what most users needs are at least here in the Mac space. Which ECC? Be
 specific. They are not all the same. Just like regular RAM are not all the
 same. Just like HDDs are not all the same. Fear mongering is wonderful and
 easy. Putting forth a solution guaranteed to be better is what's needed
 now. Did you actually reference a wiki? Seriously? A document anyone can
 edit to suit there view? I guess I come from a different era.


 Jason
 Sent from my iPhone 5S

 On Apr 11, 2014, at 5:09 PM, Bayard Bell buffer.g.overf...@gmail.com
 wrote:

 If you want more of a smoking gun report on data corruption without ECC,
 try:

 https://blogs.oracle.com/vlad/entry/zfs_likes_to_have_ecc

 This view isn't isolated in terms of what people at Sun thought or what
 people at Oracle now think. Trying googling for zfs ecc site:
 blogs.oracle.com, and you'll find a recurring statement that ECC should
 be used even in home deployment, with maybe one odd exception.

 The Wikipedia article, correctly summarising the Google study, is plain in
 saying not that extremely high error rates are common but that error rates
 are highly variable in large-sample studies, with some systems seeing
 extremely high error rates. ECC gives a significant assurance based on an
 incremental cost, so what's your data worth? You're not guaranteed to be
 screwed by not using ECC (and the Google paper doesn't say this either),
 but you are assuming risks that ECC mitigates. Look at the above blog,
 however: even DIMMs that are high-quality but non-ECC can go wrong and
 result in nasty system corruption.

 What generally protects you in terms of pool integrity is metadata
 redundancy on top of integrity checks, but if you flip bits on metadata
 in-core before writing redundant copies, well, that's a risk to pool
 integrity.

 I also think it's mistaken to say this is distinctly a problem with ZFS.
 Any next-generation filesystem that provides protections against on-disk
 corruption via checksums ends up with a residual risk focus on making sure
 that in-core data integrity is robust. You could well have those problems
 on the pools you've deployed, and there are a lot of situations in you'd
 never know and quite a lot (such as most of the bits in a photo or MP3)
 where you'd never notice low rates of bit-flipping. The fact that you
 haven't noticed doesn't equate to there being no problems in a strict
 sense, it's far more likely that you've been able to tolerate the flipping
 that's happened. The guy at Sun with the blog above got lucky: he was
 running high-quality non-ECC RAM, and it went pear-shaped, at least for
 metadata cancer, quite quickly, allowing him to recover by rolling back
 snapshots.

 Take a look out there, and you'll find people who are very confused about
 the risks and available mitigations. I found someone saying that there's no
 problem with more traditional RAID technologies because disks have CRCs. By
 comparison, you can find Bonwick, educated as a statistician, talking about
 SHA256 collisions by comparison to undetected ECC error rates and
 introducing ZFS data integrity safeguards by way of analogy to ECC. That's
 why the large-sample studies are interesting and useful: none of this
 technology makes data corruption impossible, it just goes to extreme length
 to marginalise the chances of those events by addressing known sources of
 errors and fundamental error scenarios--in-core is so core that if you
 tolerate error there, those errors will characterize systematic behaviour
 where you have better outcomes reasonably available (and that's
 **reasonably** available, I would suggest, in a way that the Madison
 paper's recommendation to make ZFS buffers magical isn't). CRC-32 does a
 great job detecting bad sectors and preventing them from being read back,
 but SHA256 in the right place in a system detects errors that a
 well-conceived vdev topology will generally make recoverable. That includes
 catching cases where an error isn't caught by CRC-32, which may be a rare
 result, but when you've got the kind of data densities that ZFS can allow,
 you're rolling the dice often enough that those results become interesting.

 ECC is one 

Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-04-02 Thread Eric
All this talk about controller, sync, buffer, storage, cache got me
thinking.

I looked up out ZFS handles cache flushing, and how VirtualBox handles
cache flushing.

*According to
http://docs.oracle.com/cd/E26505_01/html/E37386/chapterzfs-6.html
http://docs.oracle.com/cd/E26505_01/html/E37386/chapterzfs-6.html*

ZFS issues *infrequent flushes *(every 5 second or so) after the uberblock
 updates. The flushing infrequency is fairly inconsequential so no tuning is
 warranted here. ZFS also issues a flush every time an application requests
 a synchronous write (O_DSYNC, fsync, NFS commit, and so on).


*According to http://www.virtualbox.org/manual/ch12.html
http://www.virtualbox.org/manual/ch12.html*

12.2.2. Responding to guest IDE/SATA flush requests

 If desired, the virtual disk images can be flushed when the guest issues
 the IDE FLUSH CACHE command. Normally *these requests are ignored *for
 improved performance. The parameters below are only accepted for disk
 drives. They must not be set for DVD drives.


I'm going to enable cache flushing and see how that affects results




On Tue, Apr 1, 2014 at 7:14 PM, Bayard Bell buffer.g.overf...@gmail.comwrote:

 Could you explain how you're using VirtualBox and why you'd use a type 2
 hypervisor in this context?

 Here's a scenario where you really have to mind with hypervisors: ZFS
 tells a virtualised controller that it needs to sync a buffer, and the
 controller tells ZFS that all's well while perhaps requesting an async
 flush. ZFS thinks it's done all the I/Os to roll a TXG to stable storage,
 but in the mean time something else crashes and whoosh go your buffers.

 I'm not sure it's come across particularly well in this thread, but ZFS
 doesn't and can't cope with hardware that's so unreliable that it tells
 lies about basic things, like whether your writes have made it to stable
 storage, or doesn't mind the shop, as is the case with non-ECC memory. It's
 one thing when you have a device reading back something that doesn't match
 the checksum, but it gets uglier when you've got a single I/O path and a
 controller that seems to write the wrong bits in stride (I've seen this) or
 when the problems are even closer to home (and again I emphasise RAM). You
 may not have problems right away. You may have problems where you can't
 tell the difference, like flipping bits in data buffers that have no other
 integrity checks. But you can run into complex failure scenarios where ZFS
 has to cash in on guarantees that were rather more approximate than what it
 was told, and then it may not be a case of having some bits flipped in
 photos or MP3s but no longer being able to import your pool or having
 someone who knows how to operate zdb do some additional TXG rollback to get
 your data back after losing some updates.

 I don't know if you're running ZFS in a VM or running VMs on top of ZFS,
 but either way, you probably want to Google for data loss VirtualBox
 and whatever device you're emulating and see whether there are known
 issues. You can find issue reports out there on VirtualBox data loss, but
 working through bug reports can be challenging.

 Cheers,
 Bayard

 On 1 April 2014 16:34, Eric Jaw naisa...@gmail.com wrote:



 On Tuesday, April 1, 2014 7:04:39 AM UTC-4, jasonbelec wrote:

 ZFS is lots of parts, in most cases lots of cheap unreliable parts,
 refurbished parts, yadda yadda, as posted on this thread and many, many
 others, any issues are probably not ZFS but the parts of the whole. Yes, it
 could be ZFS, after you confirm that all the parts ate pristine, maybe.


 I don't think it's ZFS. ZFS is pretty solid. In my specific case, I'm
 trying to figure out why VirtualBox is creating these issues. I'm pretty
 sure that's the root cause, but I don't know why yet. So I'm just
 speculating at this point. Of course, I want to get my ZFS up and running
 so I can move on to what I really need to do, so it's easy to jump on a
 conclusion about something that I haven't thought of in my position. Hope
 you can understand



 My oldest system running ZFS is an Mac Mini Intel Core Duo with 3GB RAM
 (not ECC) it is the home server for music, tv shows, movies, and some
 interim backups. The mini has been modded for ESATA and has 6 drives
 connected. The pool is 2 RaidZ of 3 mirrored with copies set at 2. Been
 running since ZFS was released from Apple builds. Lost 3 drives, eventually
 traced to a new cable that cracked at the connector which when hot enough
 expanded lifting 2 pins free of their connector counter parts resulting in
 errors. Visually almost impossible to see. I replaced port multipliers,
 Esata cards, RAM, mini's, power supply, reinstalled OS, reinstalled ZFS,
 restored ZFS data from backup, finally to find the bad connector end one
 because it was hot and felt 'funny'.

 Frustrating, yes, educational also. The happy news is, all the data was
 fine, wife would have torn me to shreds if photos were missing, music was
 corrupt, etc., etc.. And 

Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-04-02 Thread Daniel Becker
The only time this should make a difference is when your host experiences an 
unclean shutdown / reset / crash.

 On Apr 2, 2014, at 8:49 AM, Eric naisa...@gmail.com wrote:
 
 I believe we are referring to the same things. I JUST read about cache 
 flushing. ZFS does cache flushing and VirtualBox ignores cache flushes by 
 default.
 
 Please, if you can, let me know the key settings you have used.
 
 From the documentation that I read, the command it said to issue is:
 
 VBoxManage setextradata VM name 
 VBoxInternal/Devices/ahci/0/LUN#[x]/Config/IgnoreFlush 0
 
 Where [x] is the disk value 
 
 
 On Wed, Apr 2, 2014 at 2:37 AM, Boyd Waters waters.b...@gmail.com wrote:
 I was able to destroy ZFS pools by trying to access them from inside 
 VirtualBox. Until I read the detailed documentation, and set the disk buffer 
 options correctly. I will dig into my notes and post the key setting to this 
 thread when I find it.
 
 But I've used ZFS for many years without ECC RAM with no trouble. It isn't 
 the best way to,go, but it isn't the lack of ECC that's killing a ZFS pool. 
 It's the hypervisor hardware emulation and buffering.
 
 Sent from my iPad
 
 On Apr 1, 2014, at 5:24 PM, Jason Belec jasonbe...@belecmartin.com wrote:
 
 I think Bayard has hit on some very interesting points, part of what I was 
 alluding to, but very well presented here. 
 
 Jason
 Sent from my iPhone 5S
 
 On Apr 1, 2014, at 7:14 PM, Bayard Bell buffer.g.overf...@gmail.com 
 wrote:
 
 Could you explain how you're using VirtualBox and why you'd use a type 2 
 hypervisor in this context?
 
 Here's a scenario where you really have to mind with hypervisors: ZFS 
 tells a virtualised controller that it needs to sync a buffer, and the 
 controller tells ZFS that all's well while perhaps requesting an async 
 flush. ZFS thinks it's done all the I/Os to roll a TXG to stable storage, 
 but in the mean time something else crashes and whoosh go your buffers.
 
 I'm not sure it's come across particularly well in this thread, but ZFS 
 doesn't and can't cope with hardware that's so unreliable that it tells 
 lies about basic things, like whether your writes have made it to stable 
 storage, or doesn't mind the shop, as is the case with non-ECC memory. 
 It's one thing when you have a device reading back something that doesn't 
 match the checksum, but it gets uglier when you've got a single I/O path 
 and a controller that seems to write the wrong bits in stride (I've seen 
 this) or when the problems are even closer to home (and again I emphasise 
 RAM). You may not have problems right away. You may have problems where 
 you can't tell the difference, like flipping bits in data buffers that 
 have no other integrity checks. But you can run into complex failure 
 scenarios where ZFS has to cash in on guarantees that were rather more 
 approximate than what it was told, and then it may not be a case of having 
 some bits flipped in photos or MP3s but no longer being able to import 
 your pool or having someone who knows how to operate zdb do some 
 additional TXG rollback to get your data back after losing some updates.
 
 I don't know if you're running ZFS in a VM or running VMs on top of ZFS, 
 but either way, you probably want to Google for data loss VirtualBox 
 and whatever device you're emulating and see whether there are known 
 issues. You can find issue reports out there on VirtualBox data loss, but 
 working through bug reports can be challenging.
 
 Cheers,
 Bayard
 
 On 1 April 2014 16:34, Eric Jaw naisa...@gmail.com wrote:
 
 
 On Tuesday, April 1, 2014 7:04:39 AM UTC-4, jasonbelec wrote:
 ZFS is lots of parts, in most cases lots of cheap unreliable parts, 
 refurbished parts, yadda yadda, as posted on this thread and many, many 
 others, any issues are probably not ZFS but the parts of the whole. Yes, 
 it could be ZFS, after you confirm that all the parts ate pristine, 
 maybe. 
 
 
 I don't think it's ZFS. ZFS is pretty solid. In my specific case, I'm 
 trying to figure out why VirtualBox is creating these issues. I'm pretty 
 sure that's the root cause, but I don't know why yet. So I'm just 
 speculating at this point. Of course, I want to get my ZFS up and running 
 so I can move on to what I really need to do, so it's easy to jump on a 
 conclusion about something that I haven't thought of in my position. Hope 
 you can understand
  
 
 My oldest system running ZFS is an Mac Mini Intel Core Duo with 3GB RAM 
 (not ECC) it is the home server for music, tv shows, movies, and some 
 interim backups. The mini has been modded for ESATA and has 6 drives 
 connected. The pool is 2 RaidZ of 3 mirrored with copies set at 2. Been 
 running since ZFS was released from Apple builds. Lost 3 drives, 
 eventually traced to a new cable that cracked at the connector which 
 when hot enough expanded lifting 2 pins free of their connector counter 
 parts resulting in errors. Visually almost impossible to see. I replaced 
 port multipliers, Esata 

Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-04-02 Thread Eric
eh, I suspected that


On Wed, Apr 2, 2014 at 2:38 PM, Daniel Becker razzf...@gmail.com wrote:

 The only time this should make a difference is when your host experiences
 an unclean shutdown / reset / crash.

 On Apr 2, 2014, at 8:49 AM, Eric naisa...@gmail.com wrote:

 I believe we are referring to the same things. I JUST read about cache
 flushing. ZFS does cache flushing and VirtualBox ignores cache flushes by
 default.

 Please, if you can, let me know the key settings you have used.

 From the documentation that I read, the command it said to issue is:

 VBoxManage setextradata VM name
 VBoxInternal/Devices/ahci/0/LUN#[x]/Config/IgnoreFlush 0


 Where [x] is the disk value


 On Wed, Apr 2, 2014 at 2:37 AM, Boyd Waters waters.b...@gmail.com wrote:

 I was able to destroy ZFS pools by trying to access them from inside
 VirtualBox. Until I read the detailed documentation, and set the disk
 buffer options correctly. I will dig into my notes and post the key setting
 to this thread when I find it.

 But I've used ZFS for many years without ECC RAM with no trouble. It
 isn't the best way to,go, but it isn't the lack of ECC that's killing a ZFS
 pool. It's the hypervisor hardware emulation and buffering.

 Sent from my iPad

 On Apr 1, 2014, at 5:24 PM, Jason Belec jasonbe...@belecmartin.com
 wrote:

 I think Bayard has hit on some very interesting points, part of what I
 was alluding to, but very well presented here.

 Jason
 Sent from my iPhone 5S

 On Apr 1, 2014, at 7:14 PM, Bayard Bell buffer.g.overf...@gmail.com
 wrote:

 Could you explain how you're using VirtualBox and why you'd use a type 2
 hypervisor in this context?

 Here's a scenario where you really have to mind with hypervisors: ZFS
 tells a virtualised controller that it needs to sync a buffer, and the
 controller tells ZFS that all's well while perhaps requesting an async
 flush. ZFS thinks it's done all the I/Os to roll a TXG to stable storage,
 but in the mean time something else crashes and whoosh go your buffers.

 I'm not sure it's come across particularly well in this thread, but ZFS
 doesn't and can't cope with hardware that's so unreliable that it tells
 lies about basic things, like whether your writes have made it to stable
 storage, or doesn't mind the shop, as is the case with non-ECC memory. It's
 one thing when you have a device reading back something that doesn't match
 the checksum, but it gets uglier when you've got a single I/O path and a
 controller that seems to write the wrong bits in stride (I've seen this) or
 when the problems are even closer to home (and again I emphasise RAM). You
 may not have problems right away. You may have problems where you can't
 tell the difference, like flipping bits in data buffers that have no other
 integrity checks. But you can run into complex failure scenarios where ZFS
 has to cash in on guarantees that were rather more approximate than what it
 was told, and then it may not be a case of having some bits flipped in
 photos or MP3s but no longer being able to import your pool or having
 someone who knows how to operate zdb do some additional TXG rollback to get
 your data back after losing some updates.

 I don't know if you're running ZFS in a VM or running VMs on top of ZFS,
 but either way, you probably want to Google for data loss VirtualBox
 and whatever device you're emulating and see whether there are known
 issues. You can find issue reports out there on VirtualBox data loss, but
 working through bug reports can be challenging.

 Cheers,
 Bayard

 On 1 April 2014 16:34, Eric Jaw naisa...@gmail.com wrote:



 On Tuesday, April 1, 2014 7:04:39 AM UTC-4, jasonbelec wrote:

 ZFS is lots of parts, in most cases lots of cheap unreliable parts,
 refurbished parts, yadda yadda, as posted on this thread and many, many
 others, any issues are probably not ZFS but the parts of the whole. Yes, it
 could be ZFS, after you confirm that all the parts ate pristine, maybe.


 I don't think it's ZFS. ZFS is pretty solid. In my specific case, I'm
 trying to figure out why VirtualBox is creating these issues. I'm pretty
 sure that's the root cause, but I don't know why yet. So I'm just
 speculating at this point. Of course, I want to get my ZFS up and running
 so I can move on to what I really need to do, so it's easy to jump on a
 conclusion about something that I haven't thought of in my position. Hope
 you can understand



 My oldest system running ZFS is an Mac Mini Intel Core Duo with 3GB RAM
 (not ECC) it is the home server for music, tv shows, movies, and some
 interim backups. The mini has been modded for ESATA and has 6 drives
 connected. The pool is 2 RaidZ of 3 mirrored with copies set at 2. Been
 running since ZFS was released from Apple builds. Lost 3 drives, eventually
 traced to a new cable that cracked at the connector which when hot enough
 expanded lifting 2 pins free of their connector counter parts resulting in
 errors. Visually almost impossible to see. I replaced port 

Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-03-02 Thread Philip Robar
On Sun, Mar 2, 2014 at 12:46 AM, Jean-Yves Avenard jyaven...@gmail.comwrote:

 On 28 February 2014 20:32, Philip Robar philip.ro...@gmail.com wrote:

 cyberjock is the biggest troll ever, not even the people actually
 involved with FreeNAS (iX system) knows what to do with him. He does
 spend an awful amount of time on the freenas forums helping others and
 as such tolerate him on that basis..

 Otherwise, he just someone doing nothing, with a lot of time on his
 hand and spewing the same stuff over and over simply because he has
 heard about it.


Well, that's at odds with his claims of how much time and effort he has put
into learning about ZFS and is basically an ad hominem attack, but since
Daniel Becker has already cast a far amount of doubt on both the scenario
and logic behind cyberdog's EEC vs non-ECC posts and his understanding of
architecture of ZFS I'll move on.



 Back to the ECC topic; one core issue to ZFS is that it will
 specifically write to the pool even when all you are doing is read, in
 an attempt to correct any data found to have incorrect checksum.
 So say you have corrupted memory, you read from the disk, zfs believes
 the data is faulty (after all, the checksum will be incorrect due to
 faulty RAM) and start to rewrite the data. That is one scenario where
 ZFS will corrupt an otherwise healthy pool until its too late and all
 your data is gone.
 As such, ZFS is indeed more sensitive to bad RAM than other filesystem.


So, you're agreeing with cyberdog's conclusion, just not the path he took
to get there.



 Having said that; find me *ONE* official source other than the FreeNAS

forum stating that ECC is a minimal requirements (and no a wiki
 written by cyberjock doesn't count). Solaris never said so, FreeBSD
 didn't either, nor Sun.


So if a problem isn't documented, it's not a problem?

Most Sun/Solaris documentation isn't going to mention the need for ECC
memory because all Sun systems shipped with ECC memory.
FreeBSD/PC-BSD/FreeNAS/NAS4Free/Linux in turn derive from worlds where ECC
memory is effectively nonexistent so their lack of documentation may stem
from a combination of the ZFS folks just assuming that you have it and the
distro people not realizing that you need it. FreeNAS's guide does state
pretty strongly that you should use ECC memory. But if you insist: from
Oracle Solaris 11.1 Administration: ZFS File Systems, Consider using ECC
memory to protect against memory corruption. Silent memory corruption can
potentially damage your data. [1]

It seems to me that if using ZFS without ECC memory puts someone's data at
an increased risk over other file system then they ought to be told that so
that they can make an informed decision. Am I really being unreasonable
about this?



 Bad RAM however has nothing to do with the occasional bit flip that
 would be prevented using ECC RAM. The probability of a bit flip is
 low, very low.


You and Jason have both claimed this. This is at odds with papers and
studies I've seen mentioned elsewhere. Here's what a little searching found:

Soft Error: https://en.wikipedia.org/wiki/Soft_error
Which says that there are numerous sources of soft errors in memory and
other circuits other than cosmic rays.

ECC Memory: https://en.wikipedia.org/wiki/ECC_memory
States that design has dealt with the problem of increased circuit density.
It then mentions the research IBM did years ago and Google's 2009 report
which says:

The actual error rate found was several orders of magnitude higher than
previous small-scale or laboratory studies, with 25,000 to 70,000 errors
per billion device hours per mega*bit* (about 2.5-7 × 10-11 error/bit·h)(i.e.
about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end
error rate), and more than 8% of DIMM memory modules affected by errors per
year.


So, since you've agreed that ZFS is more vulnerable than other file systems
to memory errors, and Google says that these errors are a lot more frequent
than most people think that they are then the question becomes just how
much more vulnerable is ZFS and is the extent of the corruption likely to
be wider or more catastrophic than on other file systems?


 Phil

[1] Oracle Solaris 11.1 Administration: ZFS File Systems:
http://docs.oracle.com/cd/E26502_01/html/E29007/zfspools-4.html

-- 

--- 
You received this message because you are subscribed to the Google Groups 
zfs-macos group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-03-02 Thread Philip Robar
On Sat, Mar 1, 2014 at 5:07 PM, Jason Belec jasonbe...@belecmartin.comwrote:

 Technically, what you qualify below is a truism under any hardware. ZFS is
 neither more or less susceptible to RAM failure as it has nothing to do
 with ZFS. Anything that gets written to the pool technically is sound. You
 have chosen a single possible point of failure, what of firmware, drive
 cache, motherboard, power surges, motion, etc.?


I'm sorry, but I'm not following your logic here. Are you saying that ZFS
doesn't use RAM so it can't be affected by it? ZFS likes lots of memory and
uses it aggressively. So my understanding is that large amounts of data are
more likely to be in memory with ZFS than with other file systems. If
Google's research is to believed then random memory errors are a lot more
frequent than you think that they are. As I understand it, ZFS does not
checksum data while it's in memory. (While there a debug flag to turn this
on, I'm betting that the performance hit is pretty big.) So how does RAM
failure or random bit flips have nothing to do with ZFS?



 RAM/ECC RAM is like consumer drives vs pro drives in your system, recent
 long term studies have shown you don't get much more for the extra money.


Do you have references to these studies? This directly conflicts with what
I've seen posted, with references, in other forums on the frequency of soft
memory errors, particularly on systems that run 24x7, and how ECC memory is
able to correct these random errors.



 I have been running ZFS in production using the past and current versions
 for OS X on over 60 systems (12 are servers) since Apple kicked ZFS loose.
 No systems (3 run ECC) have had data corruption or data loss.


That you know of.


 Some pools have disappeared on the older ZFS but were easily recovered on
 modern (current development) and past OpenSolaris, FreeBSD, etc., as I keep
 clones of 'corrupted' pools for such tests. Almost always, these were the
 result of connector/cable failure. In that span of time no RAM has failed
 'utterly' and all data and tests have shown quality storage. In that time
 11 drives have failed and easily been replaced, 4 of those were OS drives,
 data stored under ZFS and a regular clone of the OS also stored under ZFS
 just in case. All pools are backed-up/replicated off site. Probably a lot
 more than most are doing for data integrity.

 No this data I'm providing is not a guarantee. It's just data from someone
 who has grown to trust ZFS in the real world for clients that cannot lose
 data for the most part due to legal regulations. I trust RAM manufacturers
 and drive manufacturers equally, I just verify for peace of mind with ZFS.


 I have an opinion of people who run servers with legal or critical
business data on it that do not use ECC memory but I'll keep it to myself.

Phil

-- 

--- 
You received this message because you are subscribed to the Google Groups 
zfs-macos group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-03-01 Thread Jason Belec
Technically, what you qualify below is a truism under any hardware. ZFS is 
neither more or less susceptible to RAM failure as it has nothing to do with 
ZFS. Anything that gets written to the pool technically is sound. You have 
chosen a single possible point of failure, what of firmware, drive cache, 
motherboard, power surges, motion, etc.?

RAM/ECC RAM is like consumer drives vs pro drives in your system, recent long 
term studies have shown you don't get much more for the extra money.

I have been running ZFS in production using the past and current versions for 
OS X on over 60 systems (12 are servers) since Apple kicked ZFS loose. No 
systems (3 run ECC) have had data corruption or data loss. Some pools have 
disappeared on the older ZFS but were easily recovered on modern (current 
development) and past OpenSolaris, FreeBSD, etc., as I keep clones of 
'corrupted' pools for such tests. Almost always, these were the result of 
connector/cable failure. In that span of time no RAM has failed 'utterly' and 
all data and tests have shown quality storage. In that time 11 drives have 
failed and easily been replaced, 4 of those were OS drives, data stored under 
ZFS and a regular clone of the OS also stored under ZFS just in case. All pools 
are backed-up/replicated off site. Probably a lot more than most are doing for 
data integrity.

No this data I'm providing is not a guarantee. It's just data from someone who 
has grown to trust ZFS in the real world for clients that cannot lose data for 
the most part due to legal regulations. I trust RAM manufacturers and drive 
manufacturers equally, I just verify for peace of mind with ZFS. 

--
Jason Belec
Sent from my iPad

 On Mar 1, 2014, at 5:39 PM, Philip Robar philip.ro...@gmail.com wrote:
 
 On Fri, Feb 28, 2014 at 2:36 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 
 We might buy this argument if, in fact, no other program had the same
 vulnerabilities. But *all* of them do -- including OSX. So it is disingenuous
 to claim this as a ZFS deficiency.
 
 No it's disingenuous of you to ignore the fact that I carefully qualified 
 what I said. To repeat, it's claimed with a detailed example and reasoned 
 argument that ZFS is MORE vulnerable to corruption due to memory errors when 
 using non-ECC memory and that that corruption is MORE likely to be extensive 
 or catastrophic than with other file systems.
 
 As I said, Jason's and Daniel Becker's responses are reassuring, but I'd 
 really like a definitive answer to this so I've reached out to one of the 
 lead Open ZFS developers. Hopefully, I'll hear back from him.
 
 Phil
 
 
 -- 
  
 --- 
 You received this message because you are subscribed to the Google Groups 
 zfs-macos group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to zfs-macos+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
zfs-macos group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-03-01 Thread Philip Robar
On Sun, Mar 2, 2014 at 12:46 AM, Jean-Yves Avenard jyaven...@gmail.comwrote:


 Back to the OP, I'm not sure why he felt he had to mentioned being
 part of SunOS. ZFS was never part of sunos.


I didn't say I was part of SunOS (later renamed to Solaris 1). SunOS was
dead and buried years before I joined the network side of OS/Net. OS in
this case just means operating system, it's not a reference to the OS in
SunOS.

By mentioning that I worked in the part of Sun that invented ZFS and saying
that I am a fan of it I was just trying to be clear that I was not
attacking ZFS by questioning some aspect of it. Clearly, at least in some
minds I failed at that.

Phil

-- 

--- 
You received this message because you are subscribed to the Google Groups 
zfs-macos group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [zfs-macos] ZFS w/o ECC RAM - Total loss of data

2014-02-27 Thread Richard Elling

On Feb 26, 2014, at 10:51 PM, Daniel Becker razzf...@gmail.com wrote:

 Incidentally, that paper came up in a ZFS-related thread on Ars Technica just 
 the other day (as did the link to the FreeNAS forum post). Let me just quote 
 what I said there:
 
 The conclusion of the paper is that ZFS does not protect against in-memory 
 corruption, and thus can't provide end-to-end integrity in the presence of 
 memory errors. I am not arguing against that at all; obviously you'll want 
 ECC on your ZFS-based server if you value data integrity -- just as you 
 would if you were using any other file system. That doesn't really have 
 anything to do with the claim that ZFS specifically makes lack of ECC more 
 likely to cause total data loss, though.
 
 The sections you quote below basically say that while ZFS offers good 
 protection against on-disk corruption, it does *not* effectively protect you 
 against memory errors. Or, put another way, the authors are basically finding 
 that despite all the FS-level checksumming, ZFS does not render ECC memory 
 unnecessary (as one might perhaps naively expect). No claim is being made 
 that memory errors affect ZFS more than other filesystems.

Yes. Just like anything else, end-to-end data integrity is needed. So until 
people write apps that self-check everything, there is a possibility that
something you trust [1] can fail. As it happens, only the PC market demands
no ECC. TANSTAAFL.

[1] http://en.wikipedia.org/wiki/Pentium_FDIV_bug
 -- richard

-- 

--- 
You received this message because you are subscribed to the Google Groups 
zfs-macos group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.