Re: [zfs-discuss] Proposal: multiple copies of user data
On Tue, Sep 12, 2006 at 03:56:00PM -0700, Matthew Ahrens wrote: Matthew Ahrens wrote: [...] Given the overwhelming criticism of this feature, I'm going to shelve it for now. I'd really like to see this feature. You say ZFS should change our view on filesystems, I say be consequent. In ZFS world we create one big pool out of all our disks and create filesystems on top of it. This way we don't have to care about resizing them, etc. But this way we define redundancy at pool level for all our filesystems. It is quite common that we have data we don't really care about as well as data we do care about a lot in the same pool. Before ZFS, I'd just create RAID0 for the former and RAID1 for the latter, but this is not the ZFS way, right? My question is how can I express my intent of defining redundancy level based of the importance of my data, but still following the ZFS way without 'copies' feature? Please reconsider your choice. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpRd16TY8bxr.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Just a me too mail: On 13 Sep 2006, at 08:30, Richard Elling wrote: Is this use of slightly based upon disk failure modes? That is, when disks fail do they tend to get isolated areas of badness compared to complete loss? I would suggest that complete loss should include someone tripping over the power cord to the external array that houses the disk. The field data I have says that complete disk failures are the exception. It's the same here. In our 100 laptop population in the last 2 years, we had 2 dead drives and 10 or so with I/O errors. BTW, this feature will be very welcome on my laptop! I can't wait :-) I, too, would love having two copies of my important data on my laptop drive. Laptop drives are small enough as they are, there's no point in storing the OS, tmp and swap files twice as well. So if ditto-data blocks aren't hard to implement, they would be welcome. Otherwise there's still the mirror-split-your-drive approach. Wout. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/19/06, Richard Elling - PAE [EMAIL PROTECTED] wrote: [pardon the digression] David Dyer-Bennet wrote: On 9/18/06, Richard Elling - PAE [EMAIL PROTECTED] wrote: Interestingly, the operation may succeed and yet we will get an error which recommends replacing the drive. For example, if the failure prediction threshold is exceeded. You might also want to replace the drive when there are no spare defect sectors available. Life would be easier if they really did simply die. For one thing, people wouldn't be interested in doing ditto-block data! So, with ditto-block data, you survive any single-block failure, and most double-block failures, etc. What it doesn't lend itself to is simple computation of simple answers :-). In theory, and with an infinite budget, I'd approach this analagously to cpu architecture design based on large volumes of instruction trace data. If I had a large volume of disk operation traces with the hardware failures indicated, I could run this against the ZFS simulator and see what strategies produced the most robust single-disk results. There is a significant difference. The functionality of logic part is deterministic and discrete. The wear-out rate of a mechanical device is continuous and probabilistic. In the middle are discrete events with probabilities associated with them, but they are handled separately. In other words, we can use probability and statistics tools to analyze data loss in disk drives. This will be much faster and less expensive than running a bunch of traces. In fact, there has already been much written about disk drives, their failure modes, and factors which contribute to their failure rates. We use such data to predict the probability of events such as non-recoverable reads (which is often specified in the data sheet). Oh, I know there's a difference. It's not as big as it looks, though, if you remember that the instruction or disk operation traces are just *representative* of the workload, not the actual workload that has to run. So, yes, disk failures are certainly non-deterministic, but the actual instruction stream run by customers isn't the same one designed against, either. In both cases the design has to take the trace as a general guideline for types of things that will happen, rather than as a strict workload to optimize for. -- David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/ RKBA: http://www.dd-b.net/carry/ Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/ Dragaera/Steven Brust: http://dragaera.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Richard Elling - PAE wrote: This question was asked many times in this thread. IMHO, it is the single biggest reason we should implement ditto blocks for data. We did a study of disk failures in an enterprise RAID array a few years ago. One failure mode stands heads and shoulders above the others: non-recoverable reads. A short summary: 2,919 total errors reported 1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated) 961 (32.9%) unrecovered errors (of all types) 32 (1.1%) other (eg. device not ready) 707 (24.2%) non-recoverable reads In other words, non-recoverable reads represent 73.6% of the non- recoverable failures that occur, including complete drive failures. Does this take cascading failures into account? How often do you get an unrecoverable read and yet are still able to perform operation on the target media? Thats where ditto blocks could come in handy modulo the concerns around utilities and quotas. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
reply below... Torrey McMahon wrote: Richard Elling - PAE wrote: This question was asked many times in this thread. IMHO, it is the single biggest reason we should implement ditto blocks for data. We did a study of disk failures in an enterprise RAID array a few years ago. One failure mode stands heads and shoulders above the others: non-recoverable reads. A short summary: 2,919 total errors reported 1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated) 961 (32.9%) unrecovered errors (of all types) 32 (1.1%) other (eg. device not ready) 707 (24.2%) non-recoverable reads In other words, non-recoverable reads represent 73.6% of the non- recoverable failures that occur, including complete drive failures. Does this take cascading failures into account? How often do you get an unrecoverable read and yet are still able to perform operation on the target media? Thats where ditto blocks could come in handy modulo the concerns around utilities and quotas. No event analysis is done here, though we do have the data, the task is time consuming. Non-recoverable reads may not represent permanent failures. In the case of a RAID array, the data should be reconstructed and a rewrite + verify attempted with the possibility of sparing the sector. ZFS can reconstruct the data and relocate the block. I have some (volumous) data on disk error rates as reported though kstat. I plan to attempt to get a better sense of the failure rates from that data. The disk vendors specify non-recoverable read error rates, but we think they are overly pessimistic for the first few years of life. We'd like to have a better sense of how to model this, for a variety of applications which are concerned with archival periods. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Richard Elling - PAE wrote: Non-recoverable reads may not represent permanent failures. In the case of a RAID array, the data should be reconstructed and a rewrite + verify attempted with the possibility of sparing the sector. ZFS can reconstruct the data and relocate the block. True but if you're using a HW raid array or some sort of protection within a zpool then you're already protected to a large degree. I'm looking for the amount of cases where you get a permanent unrecoverable read error and yet can recover because you've got a ditto block someplace. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote: Richard Elling - PAE wrote: Non-recoverable reads may not represent permanent failures. In the case of a RAID array, the data should be reconstructed and a rewrite + verify attempted with the possibility of sparing the sector. ZFS can reconstruct the data and relocate the block. True but if you're using a HW raid array or some sort of protection within a zpool then you're already protected to a large degree. I'm looking for the amount of cases where you get a permanent unrecoverable read error and yet can recover because you've got a ditto block someplace. Agree. Non-recoverable reads are largely a JBOD problem. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
[appologies for being away from my data last week] David Dyer-Bennet wrote: The more I look at it the more I think that a second copy on the same disk doesn't protect against very much real-world risk. Am I wrong here? Are partial(small) disk corruptions more common than I think? I don't have a good statistical view of disk failures. This question was asked many times in this thread. IMHO, it is the single biggest reason we should implement ditto blocks for data. We did a study of disk failures in an enterprise RAID array a few years ago. One failure mode stands heads and shoulders above the others: non-recoverable reads. A short summary: 2,919 total errors reported 1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated) 961 (32.9%) unrecovered errors (of all types) 32 (1.1%) other (eg. device not ready) 707 (24.2%) non-recoverable reads In other words, non-recoverable reads represent 73.6% of the non- recoverable failures that occur, including complete drive failures. Boo! Did that scare you? Halloween is next month! :-) Seagate said today that in a few years 3.5 disks will store 2.5 TBytes. Boo! While I don't have data on laptop disk failures, I would not be surprised to see a similar distribution, though with a larger mechanical damage count. My laptops run hotter inside than my other systems and, as a rule of thumb, your disk failure rate increases by 2x for every 15C change in temperature. Is your laptop disk hot? The case for ditto data is clear to me. Many people are using single-disk systems, and many more people would really like to use single-disk systems but they really can't. Beyond spinning rust systems, there are other forms of non-volatile storage which would apply here. For example, those people who suggested that you should backup your presentation to a CD fail to note that a spec of dust on the CD could lead you to lose one block of data. In my CD/DVD experience, such losses are blissfully ignored by the system and you may blame the resulting crash on the cheap hardware you bought from your brother-in-law. Beyond CDs, I can see this as being a nice enhancement to limited endurance devices such as flash. While it is true that I could slice my disk up into multiple vdevs and mirror them, I'd much rather set a policy at a finer grainularity: my files are more important than most of the other, mostly read-only and easily reconstructed, files on my system. When ditto blocks for metadata was introduced, I took a look at the code and was pleasantly suprised. The code does an admirable job of ensuring spatial diversity in the face of multiple policies, even in the single disk case. IMHO, this is the right way to implement this and allows you to mix policies with ease. As a RAS guy, I'm biased to not wanting to lose data via easy-to-use interfaces. I don't see how this feature has any downside, but lots of upside. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/18/06, Richard Elling - PAE [EMAIL PROTECTED] wrote: [appologies for being away from my data last week] David Dyer-Bennet wrote: The more I look at it the more I think that a second copy on the same disk doesn't protect against very much real-world risk. Am I wrong here? Are partial(small) disk corruptions more common than I think? I don't have a good statistical view of disk failures. This question was asked many times in this thread. IMHO, it is the single biggest reason we should implement ditto blocks for data. We did a study of disk failures in an enterprise RAID array a few years ago. One failure mode stands heads and shoulders above the others: non-recoverable reads. A short summary: 2,919 total errors reported 1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated) 961 (32.9%) unrecovered errors (of all types) 32 (1.1%) other (eg. device not ready) 707 (24.2%) non-recoverable reads In other words, non-recoverable reads represent 73.6% of the non- recoverable failures that occur, including complete drive failures. I don't see anything addressing complete drive failures vs. block failures here anywhere. Is there some way to read something about that out of this data? I'm thinking the operations succeeded also occurs read errors recovered by retries and such, as well as the write failure cited as an example? I guess I can conclude that the 66% for errors successfully recovered means that a lot of errors are not, in fact, entire-drive failures. So that's good (for ditto-data). So a maximum of 34% are whole-drive failures (and in reality I'm sure far lower). Anyway, facts on actual failures in the real world are *definitely* the useful way to conduct this discussion! [snip] While it is true that I could slice my disk up into multiple vdevs and mirror them, I'd much rather set a policy at a finer grainularity: my files are more important than most of the other, mostly read-only and easily reconstructed, files on my system. I definitely like the idea of setting policy at a finer granularity; I really want it to be at the file level, even per-directory doesn't fit reality very well in my view. When ditto blocks for metadata was introduced, I took a look at the code and was pleasantly suprised. The code does an admirable job of ensuring spatial diversity in the face of multiple policies, even in the single disk case. IMHO, this is the right way to implement this and allows you to mix policies with ease. That's very good to hear. -- David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/ RKBA: http://www.dd-b.net/carry/ Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/ Dragaera/Steven Brust: http://dragaera.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
more below... David Dyer-Bennet wrote: On 9/18/06, Richard Elling - PAE [EMAIL PROTECTED] wrote: [appologies for being away from my data last week] David Dyer-Bennet wrote: The more I look at it the more I think that a second copy on the same disk doesn't protect against very much real-world risk. Am I wrong here? Are partial(small) disk corruptions more common than I think? I don't have a good statistical view of disk failures. This question was asked many times in this thread. IMHO, it is the single biggest reason we should implement ditto blocks for data. We did a study of disk failures in an enterprise RAID array a few years ago. One failure mode stands heads and shoulders above the others: non-recoverable reads. A short summary: 2,919 total errors reported 1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated) 961 (32.9%) unrecovered errors (of all types) 32 (1.1%) other (eg. device not ready) 707 (24.2%) non-recoverable reads In other words, non-recoverable reads represent 73.6% of the non- recoverable failures that occur, including complete drive failures. I don't see anything addressing complete drive failures vs. block failures here anywhere. Is there some way to read something about that out of this data? Complete failures are a non-zero category, but there is more than one error code which would result in the recommendation to replace the drive. Their counts are included in the 961-707=254 (26.4%) of other non- recoverable errors. In some cases a non-recoverable error can be corrected by a retry, and those also fall into the 26.4% bucket. Interestingly, the operation may succeed and yet we will get an error which recommends replacing the drive. For example, if the failure prediction threshold is exceeded. You might also want to replace the drive when there are no spare defect sectors available. Life would be easier if they really did simply die. I'm thinking the operations succeeded also occurs read errors recovered by retries and such, as well as the write failure cited as an example? Yes. I guess I can conclude that the 66% for errors successfully recovered means that a lot of errors are not, in fact, entire-drive failures. So that's good (for ditto-data). So a maximum of 34% are whole-drive failures (and in reality I'm sure far lower). I agree. Anyway, facts on actual failures in the real world are *definitely* the useful way to conduct this discussion! [snip] While it is true that I could slice my disk up into multiple vdevs and mirror them, I'd much rather set a policy at a finer grainularity: my files are more important than most of the other, mostly read-only and easily reconstructed, files on my system. I definitely like the idea of setting policy at a finer granularity; I really want it to be at the file level, even per-directory doesn't fit reality very well in my view. When ditto blocks for metadata was introduced, I took a look at the code and was pleasantly suprised. The code does an admirable job of ensuring spatial diversity in the face of multiple policies, even in the single disk case. IMHO, this is the right way to implement this and allows you to mix policies with ease. That's very good to hear. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Matthew Ahrens wrote: Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. +1, especially in a two disk (mirrored) configuration. Currently I use two ZFS pools: one mirrored and other unmirrored spreaded over two disks (each disk partitioned with SVM). And I'm constantly fighting the fill-up of one pools while the other is empty. My current setup have the same space balance problem that a traditional two *static* partition setup. - -- Jesus Cea Avion _/_/ _/_/_/_/_/_/ [EMAIL PROTECTED] http://www.argo.es/~jcea/ _/_/_/_/ _/_/_/_/ _/_/ jabber / xmpp:[EMAIL PROTECTED] _/_/_/_/ _/_/_/_/_/ _/_/ _/_/_/_/ _/_/ _/_/ Things are not so easy _/_/ _/_/_/_/ _/_/_/_/ _/_/ My name is Dump, Core Dump _/_/_/_/_/_/ _/_/ _/_/ El amor es poner tu felicidad en la felicidad de otro - Leibniz -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBRQlwoJlgi5GaxT1NAQLR7gP8C3QHCkvRznthRZNZ6sCfhtD/y+am7b2V +JrPBD0RRHkD65ZKhj6r3Ss4ypkjlSo82+pMdnPdIQUpNKoqmwEyAqfvXvdqm7A+ Yks5Ac5e9ris2Sz3o7wruFixkLOJSoKrUS8TR1TpvnXlHE8l3U4Q2uEgzwKr4s8F k/AR3VC70pg= =BCz2 -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Neil A. Wilson wrote: This is unfortunate. As a laptop user with only a single drive, I was looking forward to it since I've been bitten in the past by data loss caused by a bad area on the disk. I don't care about the space consumption because I generally don't come anywhere close to filling up the available space. It may not be the primary market for ZFS, but it could be a very useful side benefit. I feel your pain. Although your harddrive will suffer by the extra seeks, I would suggest you to partition your HD in two spaces and mount a two-way ZFS mirror between them. If space is an issue, you can use N partitions to mount a raid-z, but your performance will suffer a lot because any data read would require N seeks. - -- Jesus Cea Avion _/_/ _/_/_/_/_/_/ [EMAIL PROTECTED] http://www.argo.es/~jcea/ _/_/_/_/ _/_/_/_/ _/_/ jabber / xmpp:[EMAIL PROTECTED] _/_/_/_/ _/_/_/_/_/ _/_/ _/_/_/_/ _/_/ _/_/ Things are not so easy _/_/ _/_/_/_/ _/_/_/_/ _/_/ My name is Dump, Core Dump _/_/_/_/_/_/ _/_/ _/_/ El amor es poner tu felicidad en la felicidad de otro - Leibniz -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBRQlx0Zlgi5GaxT1NAQLxnAQAnR5ja6G+jzTPC6cNWRpD1BmUnEcXP+k5 KvRuoIAZ2GLLQvKbPYv+KivX9+jZcNW3W73g/HPGrmnMrFwKyVaeotnk5M8z2IH/ mCneF/qfV751eTaWGUXHqCD1bh/jRkxlIHRPU+TvCriE2zJ+N5r+AMOIbAd9oQ6H 9Y9LUSWAK+Q= =rNRA -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote: Matthew Ahrens wrote: The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Can you expand? I can think of some examples where using multiple pools - even on the same host - is quite useful given the current feature set of the product. Or are you only discussing the specific case where a host would want more reliability for a certain set of data then an other? If that's the case I'm still confused as to what failure cases would still allow you to retrieve your data if there are more then one copy in the fs or pool.but I'll gladly take some enlightenment. :) (My apologies for the length of this response, I'll try to address most of the issues brought up recently...) When I wrote this proposal, I was only seriously thinking about the case where you want different amounts of redundancy for different data. Perhaps because I failed to make this clear, discussion has concentrated on laptop reliability issues. It is true that there would be some benefit to using multiple copies on a single-disk (eg. laptop) pool, but of course it would not protect against the most common failure mode (whole disk failure). One case where this feature would be useful is if you have a pool with no redundancy (ie. no mirroring or raid-z), because most of the data in the pool is not very important. However, the pool may have a bunch of disks in it (say, four). The administrator/user may realize (perhaps later on) that some of their data really *is* important and they would like some protection against losing it if a disk fails. They may not have the option of adding more disks to mirror all of their data (cost or physical space constraints may apply here). Their problem is solved by creating a new filesystem with copies=2 and putting the important data there. Now, if a disk fails, then the data in the copies=2 filesystem will not be lost. Approximately 1/4 of the data in other filesystems will be lost. (There is a small chance that some tiny fraction of the data in the copies=2 filesystem will still be lost if we were forced to put both copies on the disk that failed.) Another plausible use case would be where you have some level of redundancy, say you have a Thumper (X4500) with its 48 disks configured into 9 5-wide single-parity raid-z groups (with 3 spares). If a single disk fails, there will be no data loss. However, if two disks within the same raid-z group fail, data will be lost. In this scenario, imagine that this data loss probability is acceptable for most of the data stored here, but there is some extremely important data for which this is unacceptable. Rather than reconfiguring the entire pool for higher redundancy (say, double-parity raid-z) and less usable storage, you can simply create a filesystem with copies=2 within the raid-z storage pool. Data within that filesystem will not be lost even if any three disks fail. I believe that these use cases, while not being extremely common, do occur. The extremely low amount of engineering effort required to implement the feature (modulo the space accounting issues) seems justified. The fact that this feature does not solve all problems (eg, it is not intended to be a replacement for mirroring) is not a downside; not all features need to be used in all situations :-) The real problem with this proposal is the confusion surrounding disk space accounting with copies1. While the same issues are present when using compression, people are understandably less upset when files take up less space than expected. Given the current lack of interest in this feature, the effort required to address the space accounting issue does not seem justified at this time. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
David Dyer-Bennet wrote: On 9/12/06, eric kustarz [EMAIL PROTECTED] wrote: So it seems to me that having this feature per-file is really useful. Say i have a presentation to give in Pleasanton, and the presentation lives on my single-disk laptop - I want all the meta-data and the actual presentation to be replicated. We already use ditto blocks for the meta-data. Now we could have an extra copy of the actual data. When i get back from the presentation i can turn off the extra copies. Yes, you could do that. *I* would make a copy on a CD, which I would carry in a separate case from the laptop. Do you backup the presentation to CD everytime you make an edit? I think my presentation is a lot safer than your presentation. I'm sure both of our presentations would be equally safe as we would know not to have the only copy(ies) on our personage. Similarly for your digital images example; I don't consider it safe until I have two or more *independent* copies. Two copies on a single hard drive doesn't come even close to passing the test for me; as many people have pointed out, those tend to fail all at once. And I will also point out that laptops get stolen a lot. And of course all the accidents involving fumble-fingers, OS bugs, and driver bugs won't be helped by the data duplication either. (Those will mostly be helped by sensible use of snapshots, though, which is another argument for ZFS on *any* disk you work on a lot.) Well of course you would have a separate, independent copy if it really mattered. The more I look at it the more I think that a second copy on the same disk doesn't protect against very much real-world risk. Am I wrong here? Are partial(small) disk corruptions more common than I think? I don't have a good statistical view of disk failures. Well let's see - my friend accompanied me on a trip and saved her photos daily onto her laptop. Near the end of the trip her hard drive started having problems. The hard drive was not dead, as it was bootable and you could access certain data. Upon returning home she was able to retrieve some of her photos but not all. She would have been much happier having ZFS + copies. And yes, you could backup to CD/DVD every night, but its a pain and people don't do it (as much as they should). Side note: it would have cost hundreds of dollars for data recovery to have just the *possibility* to get the other photos. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote: eric kustarz wrote: Matthew Ahrens wrote: Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. So it seems to me that having this feature per-file is really useful. Say i have a presentation to give in Pleasanton, and the presentation lives on my single-disk laptop - I want all the meta-data and the actual presentation to be replicated. We already use ditto blocks for the meta-data. Now we could have an extra copy of the actual data. When i get back from the presentation i can turn off the extra copies. Under what failure nodes would your data still be accessible? What things can go wrong that still allow you to access the data because some event has removed one copy but left the others? Silent data corruption of one of the copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/13/06, Richard Elling [EMAIL PROTECTED] wrote: * Mirroring offers slightly better redundancy, because one disk from each mirror can fail without data loss. Is this use of slightly based upon disk failure modes? That is, when disks fail do they tend to get isolated areas of badness compared to complete loss? I would suggest that complete loss should include someone tripping over the power cord to the external array that houses the disk. The field data I have says that complete disk failures are the exception. I hate to leave this as a teaser, I'll expand my comments later. BTW, this feature will be very welcome on my laptop! I can't wait :-) On servers and stationary desktops, I just don't care whether it is a whole disk failure or a few bad blocks. In that case I have the resources to mirror, RAID5, perform daily backups, etc. The laptop disk failures that I have seen have typically been limited to a few bad blocks. As Torey McMahon mentioned, they tend to start out with some warning signs followed by a full failure. I would *really* like to have that window between warning signs and full failure as my opportunity to back up my data and replace my non-redundant hard drive with no data loss. The only part of the proposal I don't like is space accounting. Double or triple charging for data will only confuse those apps and users that check for free space or block usage. If this is worked out, it would be a great feature for those times when mirroring just isn't an option. Mike -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/13/06, Mike Gerdts [EMAIL PROTECTED] wrote: The only part of the proposal I don't like is space accounting. Double or triple charging for data will only confuse those apps and users that check for free space or block usage. Why exactly isn't reporting the free space divided by the copies value on that particular file system an easy solution for this? Did I miss something? Tobias ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On Tue, 12 Sep 2006, Matthew Ahrens wrote: Torrey McMahon wrote: Matthew Ahrens wrote: The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Can you expand? I can think of some examples where using multiple pools - even on the same host - is quite useful given the current feature set of the product. Or are you only discussing the specific case where a host would want more reliability for a certain set of data then an other? If that's the case I'm still confused as to what failure cases would still allow you to retrieve your data if there are more then one copy in the fs or pool.but I'll gladly take some enlightenment. :) (My apologies for the length of this response, I'll try to address most of the issues brought up recently...) When I wrote this proposal, I was only seriously thinking about the case where you want different amounts of redundancy for different data. Perhaps because I failed to make this clear, discussion has concentrated on laptop reliability issues. It is true that there would be some benefit to using multiple copies on a single-disk (eg. laptop) pool, but of course it would not protect against the most common failure mode (whole disk failure). ... lots of Good Stuff elided Soon Samsung will release a 100% flash memory based drive (32Gb) in a laptop form factor. But flash memory chips have a limited number of write cycles available, and when exceeded, this usually results in data corruption. Some people have already encountered this issue with USB thumb drives. Its especially annoying if you were using the thumb drive as a, what you thought was, a 100% _reliable_ backup mechanism. This is a perfect application for ZFS copies=2. Also, consider that there is no time penalty for positioning the heads on a flash drive. So now you would have 2 options in a laptop type application with a single flash based drive: a) create a mirrored pool using 2 slices - expensive in terms of storage utilization b) create a pool with no redundancy create a filesystem called importantPresentationData within that pool with copies=2 (or more). Matthew - build it and they will come! Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On Wed, 2006-09-13 at 02:30, Richard Elling wrote: The field data I have says that complete disk failures are the exception. I hate to leave this as a teaser, I'll expand my comments later. That matches my anecdotal experience with laptop drives; maybe I'm just lucky, or maybe I'm just paying attention than most to the sounds they start to make when they're having a bad hair day, but so far they've always given *me* significant advance warning of impending doom, generally by failing to read a bunch of disk sectors. That said, I think the best use case for the copies 1 config would be in systems with exactly two disks -- which covers most of the 1U boxes out there. One question for Matt: when ditto blocks are used with raidz1, how well does this handle the case where you encounter one or more single-sector read errors on other drive(s) while reconstructing a failed drive? for a concrete example A0 B0 C0 D0 P0 A1 B1 C1 D1 P1 (A0==A1, B0==B1, ...; A^B^C^D==P) Does the current implementation of raidz + ditto blocks cope with the case where all of A, C0, and D1 are unavailable? - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
eric kustarz wrote: I want per pool, per dataset, and per file - where all are done by the filesystem (ZFS), not the application. I was talking about a further enhancement to copies than what Matt is currently proposing - per file copies, but its more work (one thing being we don't have administrative control over files per se). Now if you could do that and make it something that can be set at install time it would get a lot more interesting. When you install Solaris to that single laptop drive you can select files or even directories that have more then one copy in case of a problem down the road. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote: eric kustarz wrote: I want per pool, per dataset, and per file - where all are done by the filesystem (ZFS), not the application. I was talking about a further enhancement to copies than what Matt is currently proposing - per file copies, but its more work (one thing being we don't have administrative control over files per se). Now if you could do that and make it something that can be set at install time it would get a lot more interesting. When you install Solaris to that single laptop drive you can select files or even directories that have more then one copy in case of a problem down the road. Actually, this is a perfect use case for setting the copies=2 property after installation. The original binaries are quite replaceable; the customizations and personal files created later on are not. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Bart Smaalders wrote: Torrey McMahon wrote: eric kustarz wrote: I want per pool, per dataset, and per file - where all are done by the filesystem (ZFS), not the application. I was talking about a further enhancement to copies than what Matt is currently proposing - per file copies, but its more work (one thing being we don't have administrative control over files per se). Now if you could do that and make it something that can be set at install time it would get a lot more interesting. When you install Solaris to that single laptop drive you can select files or even directories that have more then one copy in case of a problem down the road. Actually, this is a perfect use case for setting the copies=2 property after installation. The original binaries are quite replaceable; the customizations and personal files created later on are not. We've been talking about user data but the chance of corrupting something on disk and then detecting a bad checksum on something in /kernel is also possible. (Disk drives do weird things from time to time.) If I was sufficiently paranoid I would want everything required to get into single-user mode, some other stuff, and then my user data, duplicated to avoid any issues. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Mike Gerdts wrote: On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote: B. DESCRIPTION A new property will be added, 'copies', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the 'copies' property be set at filesystem-creation time (eg. 'zfs create -o copies=2 pool/fs'). Is there anything in the works to compress (or encrypt) existing data after the fact? For example, a special option to scrub that causes the data to be re-written with the new properties could potentially do this. If so, this feature should subscribe to any generic framework provided by such an effort. While encryption of existing data is not in scope for the first ZFS crypto phase I am being careful in the design to ensure that it can be done later if such a ZFS framework becomes available. The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe in encrypted. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 12/09/06, Matthew Ahrens [EMAIL PROTECTED] wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Your comments are appreciated! Flexibility is always nice, but this seems to greatly complicate things, both technically and conceptually (sometimes, good design is about what is left out :) ). Seems to me this lets you say 'files in this directory are x times more valuable than files elsewhere'. Others have covered some of my concerns (guarantees, cleanup, etc.). In addition, * if I move a file somewhere else, does it become less important? * zpools let you do that already (admittedly with less granularity, but *much* *much* more simply - and disk is cheap in my world) * I don't need to do that :) The only real use I'd see would be for redundant copies on a single disk, but then why wouldn't I just add a disk? * disks are cheap, and creating a mirror from a single disk is very easy (and conceptually simple) * *removing* a disk from a mirror pair is simple too - I make mistakes sometimes * in my experience, disks fail. When you get bad errors on part of a disk, the disk is about to die. * you can already create a/several zpools using disk partitions as vdevs. That's not all that safe, and I don't see this being any safer. Sorry to be negative, but to me ZFS' simplicity is one of its major features. I think this provides a cool feature, but I question it's usefulness. Quite possibly I just don't have the particular itch this is intended to scratch - is this a much requested feature? -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 12/09/06, Darren J Moffat [EMAIL PROTECTED] wrote: Dick Davies wrote: The only real use I'd see would be for redundant copies on a single disk, but then why wouldn't I just add a disk? Some systems have physical space for only a single drive - think most laptops! True - I'm a laptop user myself. But as I said, I'd assume the whole disk would fail (it does in my experience). If your hardware craps differently to mine, you could do a similar thing with partitions (or even files) as vdevs. Wouldn't be any less reliable. I'm still not Feeling the Magic on this one :) -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
This proposal would benefit greatly by a problem statement. As it stands, it feels like a solution looking for a problem. The Introduction mentions a different problem and solution, but then pretends that there is value to this solution. The Description section mentions some benefits of 'copies' relative to the existing situation, but requires that the reader piece together the whole picture. And IMO there aren't enough pieces :-) , i.e. so far I haven't seen sufficient justification for the added administrative complexity and potential for confusion, both administrative and user. Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Your comments are appreciated! --matt A. INTRODUCTION ZFS stores multiple copies of all metadata. This is accomplished by storing up to three DVAs (Disk Virtual Addresses) in each block pointer. This feature is known as Ditto Blocks. When possible, the copies are stored on different disks. See bug 6410698 ZFS metadata needs to be more highly replicated (ditto blocks) for details on ditto blocks. This case will extend this feature to allow system administrators to store multiple copies of user data as well, on a per-filesystem basis. These copies are in addition to any redundancy provided at the pool level (mirroring, raid-z, etc). B. DESCRIPTION A new property will be added, 'copies', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the 'copies' property be set at filesystem-creation time (eg. 'zfs create -o copies=2 pool/fs'). The pool must be at least on-disk version 2 to use this feature (see 'zfs upgrade'). By default (copies=1), only two copies of most filesystem metadata are stored. However, if we are storing multiple copies of user data, then 3 copies (the maximum) of filesystem metadata will be stored. This feature is similar to using mirroring, but differs in several important ways: * Different filesystems in the same pool can have different numbers of copies. * The storage configuration is not constrained as it is with mirroring (eg. you can have multiple copies even on a single disk). * Mirroring offers slightly better performance, because only one DVA needs to be allocated. * Mirroring offers slightly better redundancy, because one disk from each mirror can fail without data loss. It is important to note that the copies provided by this feature are in addition to any redundancy provided by the pool configuration or the underlying storage. For example: * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any 1 disk failing without data loss. * In a pool with 2-way mirrors, a filesystem with copies=3 will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any 5 disks failing without data loss (assuming that there are at least ncopies=3 mirror groups). * In a pool with single-parity raid-z a filesystem with copies=2 will be stored with 2 copies, each copy protected by its own parity block. The filesystem can tolerate any 3 disks failing without data loss (assuming that there are at least ncopies=2 raid-z groups). C. MANPAGE CHANGES *** zfs.man4Tue Jun 13 10:15:38 2006 --- zfs.man5Mon Sep 11 16:34:37 2006 *** *** 708,714 --- 708,725 they are inherited. + copies=1 | 2 | 3 +Controls the number of copies of data stored for this dataset. +These copies are in addition to any redundancy provided by the +pool (eg. mirroring or raid-z). The copies will be stored on +different disks if possible. + +Changing this property only affects newly-written data. +Therefore, it is recommended that this property be set at +filesystem creation time, using the '-o copies=' option. + + Temporary Mountpoint Properties When a file system is mounted, either through mount(1M) for legacy mounts or the zfs mount command for normal file D. REFERENCES ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Your comments are appreciated! I've read the proposal, and followed the discussion so far. I have to say that I don't see any particular need for this feature. Possibly there is a need for a different feature, in which the entire control of redundancy is moved away from the pool level and to the file or filesystem level. I definitely see the attraction of being able to specify by file and directory different degrees of reliability needed. However, the details of the feature actually proposed don't seem to satisfy the need for extra reliability at the level that drives people to employ redundancy; it doesn't provide a guaranty. I see no need for additional non-guaranteed reliability on top of the levels of guaranty provided by use of redundancy at the pool level. Furthermore, as others have pointed out, this feature would add a high degree of user-visible complexity. From what I've seen here so far, I think this is a bad idea and should not be added. -- David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/ RKBA: http://www.dd-b.net/carry/ Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/ Dragaera/Steven Brust: http://dragaera.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Darren J Moffat wrote: While encryption of existing data is not in scope for the first ZFS crypto phase I am being careful in the design to ensure that it can be done later if such a ZFS framework becomes available. The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe in encrypted. I would also think that there's a significant problem around what to do about the previously unencrypted data. I assume that when performing a scrub to encrypt the data, the encrypted data will not be written on the same blocks previously used to hold the unencrypted data. As such, there's a very good chance that the unencrypted data would still be there for quite some time. You may not be able to access it through the filesystem, but someone with access to the raw disks may be able to recover at least parts of it. In this case, the scrub would not only have to write the encrypted data but also overwrite the unencrypted data (multiple times?). Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Neil A. Wilson wrote: Darren J Moffat wrote: While encryption of existing data is not in scope for the first ZFS crypto phase I am being careful in the design to ensure that it can be done later if such a ZFS framework becomes available. The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe in encrypted. I would also think that there's a significant problem around what to do about the previously unencrypted data. I assume that when performing a scrub to encrypt the data, the encrypted data will not be written on the same blocks previously used to hold the unencrypted data. As such, there's a very good chance that the unencrypted data would still be there for quite some time. You may not be able to access it through the filesystem, but someone with access to the raw disks may be able to recover at least parts of it. In this case, the scrub would not only have to write the encrypted data but also overwrite the unencrypted data (multiple times?). Right, that is a very important issue. Would a ZFS scrub framework do copy on write ? As you point out if it doesn't then we still need to do something about the old clear text blocks because strings(1) over the raw disk will show them. I see the desire to have a knob that says make this encrypted now but I personally believe that it is actually better if you can make this choice at the time you create the ZFS data set. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On Tue, Sep 12, 2006 at 10:36:30AM +0100, Darren J Moffat wrote: Mike Gerdts wrote: Is there anything in the works to compress (or encrypt) existing data after the fact? For example, a special option to scrub that causes the data to be re-written with the new properties could potentially do this. If so, this feature should subscribe to any generic framework provided by such an effort. While encryption of existing data is not in scope for the first ZFS crypto phase I am being careful in the design to ensure that it can be done later if such a ZFS framework becomes available. The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe in encrypted. I agree -- there needs to be a filesystem re-write option, something like a scrub but at the filesystem level. Things that might be accomplished through it: - record size changes - compression toggling / compression algorithm changes - encryption/re-keying/alg. changes - checksum alg. changes - ditto blocking What else? To me it's important that such scrubs not happen simply as a result of setting/changing a filesystem property, but it's also important that the user/admin be told that changing the property requires scrubbing in order to take effect for data/meta-data written before the change. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/12/06, Matthew Ahrens [EMAIL PROTECTED] wrote: Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. I think it's a valid problem. My understanding was that this didn't give a *guaranteed* solution, though. I think most people, when committing to the point of replication (spending actual money), need a guarantee at some level (not of course of total safety; but that the data actually does exist on separate disks, and will survive the destruction of one disk). A good solution to this problem would be valuable. (And I'd accept a non-guarantee on a single disk; or rather a guarantee that said if enough blocks to find the data exist, and a copy of each data block exists, we can retrieve the data; but that guarantee *does* exist I think). Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. I was never concerned at the free space issues (though I was concerned by some of the proposed solutions to what I saw as a non-issue). I'd be happy if the free space described how many bytes of default files you could add to the pool, and the user would have to understand that results would differ if they used non-default parameters. You're probably right that fewer people would mind having *more* space than an unthinking reading would show than less. -- David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/ RKBA: http://www.dd-b.net/carry/ Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/ Dragaera/Steven Brust: http://dragaera.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Matthew Ahrens wrote: Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. This is unfortunate. As a laptop user with only a single drive, I was looking forward to it since I've been bitten in the past by data loss caused by a bad area on the disk. I don't care about the space consumption because I generally don't come anywhere close to filling up the available space. It may not be the primary market for ZFS, but it could be a very useful side benefit. Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. I don't see much need for this in any area that I would use ZFS (either my own personal use or for any case in which I would recommend it for production use). However, if you think that it's OK to under-report free space, then why not just do that for the data ditto blocks. If one or more of my filesystems are configured to keep two copies of the data, then simply report only half of the available space. If duplication isn't enabled for the entire pool but only for certain filesystems, then perhaps you could even take advantage of quotas for those filesystems to make a more accurate calculation. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Matthew Ahrens wrote: Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. So it seems to me that having this feature per-file is really useful. Say i have a presentation to give in Pleasanton, and the presentation lives on my single-disk laptop - I want all the meta-data and the actual presentation to be replicated. We already use ditto blocks for the meta-data. Now we could have an extra copy of the actual data. When i get back from the presentation i can turn off the extra copies. Doing it for the filesystem is just one step higher (and makes it administratively easier as i don't have to type the same command for each file thats important). Mirroring is just like another step above that - though its possibly replicating stuff you just don't care about. Now placing extra copies of the data doesn't guarantee that data will survive multiple diskf failures; but neither does having a mirrored pool guarantee the data will be there either (2 disk failures). Both methods are about increasing your chances of having your valuable data around. I for one would have loved to have multiple copy filesystems + ZFS on my powerbook when i was travelling in Australia for a month - think of all the digital pictures you take and how pissed you would be if the one with the wild wombat didn't survive. Its maybe not an enterprise solution, but it seems like a consumer solution. Ensuring that the space accounting tools make sense is definitely a valid point though. eric Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/12/06, eric kustarz [EMAIL PROTECTED] wrote: So it seems to me that having this feature per-file is really useful. Say i have a presentation to give in Pleasanton, and the presentation lives on my single-disk laptop - I want all the meta-data and the actual presentation to be replicated. We already use ditto blocks for the meta-data. Now we could have an extra copy of the actual data. When i get back from the presentation i can turn off the extra copies. Yes, you could do that. *I* would make a copy on a CD, which I would carry in a separate case from the laptop. I think my presentation is a lot safer than your presentation. Similarly for your digital images example; I don't consider it safe until I have two or more *independent* copies. Two copies on a single hard drive doesn't come even close to passing the test for me; as many people have pointed out, those tend to fail all at once. And I will also point out that laptops get stolen a lot. And of course all the accidents involving fumble-fingers, OS bugs, and driver bugs won't be helped by the data duplication either. (Those will mostly be helped by sensible use of snapshots, though, which is another argument for ZFS on *any* disk you work on a lot.) The more I look at it the more I think that a second copy on the same disk doesn't protect against very much real-world risk. Am I wrong here? Are partial(small) disk corruptions more common than I think? I don't have a good statistical view of disk failures. -- David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/ RKBA: http://www.dd-b.net/carry/ Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/ Dragaera/Steven Brust: http://dragaera.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote: B. DESCRIPTION A new property will be added, 'copies', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the 'copies' property be set at filesystem-creation time (eg. 'zfs create -o copies=2 pool/fs'). Is there anything in the works to compress (or encrypt) existing data after the fact? For example, a special option to scrub that causes the data to be re-written with the new properties could potentially do this. If so, this feature should subscribe to any generic framework provided by such an effort. This feature is similar to using mirroring, but differs in several important ways: * Mirroring offers slightly better redundancy, because one disk from each mirror can fail without data loss. Is this use of slightly based upon disk failure modes? That is, when disks fail do they tend to get isolated areas of badness compared to complete loss? I would suggest that complete loss should include someone tripping over the power cord to the external array that houses the disk. It is important to note that the copies provided by this feature are in addition to any redundancy provided by the pool configuration or the underlying storage. For example: All of these examples seem to assume that there six disks. * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any 1 disk failing without data loss. * In a pool with 2-way mirrors, a filesystem with copies=3 will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any 5 disks failing without data loss (assuming that there are at least ncopies=3 mirror groups). This one assumes best case scenario with 6 disks. Suppose you had 4 x 72 GB and 2 x 36 GB disks. You could end up with multiple copies on the 72 GB disks. * In a pool with single-parity raid-z a filesystem with copies=2 will be stored with 2 copies, each copy protected by its own parity block. The filesystem can tolerate any 3 disks failing without data loss (assuming that there are at least ncopies=2 raid-z groups). C. MANPAGE CHANGES *** zfs.man4Tue Jun 13 10:15:38 2006 --- zfs.man5Mon Sep 11 16:34:37 2006 *** *** 708,714 --- 708,725 they are inherited. + copies=1 | 2 | 3 +Controls the number of copies of data stored for this dataset. +These copies are in addition to any redundancy provided by the +pool (eg. mirroring or raid-z). The copies will be stored on +different disks if possible. Any statement about physical location on the disk? It would seem as though locating two copies sequentially on the disk would not provide nearly the amount of protection as having them fairly distant from each other. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
James Dickens wrote: On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote: B. DESCRIPTION A new property will be added, 'copies', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the 'copies' property be set at filesystem-creation time (eg. 'zfs create -o copies=2 pool/fs'). would the user be held acountable for the space used by the extra copies? Doh! Sorry I forgot to address that. I'll amend the proposal and manpage to include this information... Yes, the space used by the extra copies will be accounted for, eg. in stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota. so if a user has a 1GB quota and stores one 512MB file with two copies activated, all his space will be used? Yes, and as mentioned this will be reflected in all the space accounting tools. what happens if the same user stores a file that is 756MB on the filesystem with multiple copies enabled an a 1GB quota, does the save fail? Yes, they will get ENOSPC and see that their filesystem is full. How would the user tell that his filesystem is full since all the tools he is used to report he is using only 1/2 the space? Any tool will report that in fact all space is being used. Is there a way for the sysdmin to get rid of the excess copies should disk space needs require it? No, not without rewriting them. (This is the same behavior we have today with the 'compression' and 'checksum' properties. It's a long-term goal of ours to be able to go back and change these things after the fact (scrub them in, so to say), but with snapshots, this is extremely nontrivial to do efficiently and without increasing the amount of space used.) If I start out 2 copies and later change it to on 1 copy, do the files created before keep there 2 copies? Yep, the property only affects newly-written data. what happens if root needs to store a copy of an important file and there is no space but there is space if extra copies are reclaimed? They will get ENOSPC. Will this be configurable behavior? No. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Mike Gerdts wrote: Is there anything in the works to compress (or encrypt) existing data after the fact? For example, a special option to scrub that causes the data to be re-written with the new properties could potentially do this. This is a long-term goal of ours, but with snapshots, this is extremely nontrivial to do efficiently and without increasing the amount of space used.) . If so, this feature should subscribe to any generic framework provided by such an effort. Yep, absolutely. * Mirroring offers slightly better redundancy, because one disk from each mirror can fail without data loss. Is this use of slightly based upon disk failure modes? That is, when disks fail do they tend to get isolated areas of badness compared to complete loss? I would suggest that complete loss should include someone tripping over the power cord to the external array that houses the disk. I'm basing this slightly better call on a model of random, complete-disk failures. I know that this is only an approximation. With many mirrors, most (but not all) 2-disk failures can be tolerated. With copies=2, almost no 2-top-level-vdev failures will be tolerated, because it's likely that *some* block will have both its copies on those 2 disks. With mirrors, you can arrange to mirror across cabinets, not within them, which you can't do with copies. It is important to note that the copies provided by this feature are in addition to any redundancy provided by the pool configuration or the underlying storage. For example: All of these examples seem to assume that there six disks. Not really. There could be any number of mirrors or raid-z groups (although I note, you need at least 'copies' groups to survive the max whole-disk failures). * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any 1 disk failing without data loss. * In a pool with 2-way mirrors, a filesystem with copies=3 will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any 5 disks failing without data loss (assuming that there are at least ncopies=3 mirror groups). This one assumes best case scenario with 6 disks. Suppose you had 4 x 72 GB and 2 x 36 GB disks. You could end up with multiple copies on the 72 GB disks. Yes, all these examples assume that our putting the copies on different disks when possible actually worked out. It will almost certainly work out unless you have a small number of different-sized devices, or are running with very little free space. If you need hard guarantees, you need to use actual mirroring. Any statement about physical location on the disk? It would seem as though locating two copies sequentially on the disk would not provide nearly the amount of protection as having them fairly distant from each other. Yep, if the copies can't be stored on different disks, they will be stored spread-out on the same disk if possible (I think we aim for one on each quarter of the disk). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote: James Dickens wrote: On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote: B. DESCRIPTION A new property will be added, 'copies', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the 'copies' property be set at filesystem-creation time (eg. 'zfs create -o copies=2 pool/fs'). would the user be held acountable for the space used by the extra copies? Doh! Sorry I forgot to address that. I'll amend the proposal and manpage to include this information... Yes, the space used by the extra copies will be accounted for, eg. in stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota. so if a user has a 1GB quota and stores one 512MB file with two copies activated, all his space will be used? Yes, and as mentioned this will be reflected in all the space accounting tools. what happens if the same user stores a file that is 756MB on the filesystem with multiple copies enabled an a 1GB quota, does the save fail? Yes, they will get ENOSPC and see that their filesystem is full. How would the user tell that his filesystem is full since all the tools he is used to report he is using only 1/2 the space? Any tool will report that in fact all space is being used. Is there a way for the sysdmin to get rid of the excess copies should disk space needs require it? No, not without rewriting them. (This is the same behavior we have today with the 'compression' and 'checksum' properties. It's a long-term goal of ours to be able to go back and change these things after the fact (scrub them in, so to say), but with snapshots, this is extremely nontrivial to do efficiently and without increasing the amount of space used.) If I start out 2 copies and later change it to on 1 copy, do the files created before keep there 2 copies? Yep, the property only affects newly-written data. what happens if root needs to store a copy of an important file and there is no space but there is space if extra copies are reclaimed? They will get ENOSPC. though I think this is a cool feature, I think i needs more work. I think there sould be an option to make extra copies expendible. So the extra copies are a request, if the space is availible make them, if not complete the write, and log the event. It the user really requires guaranteed extra copies, then use mirrored or raided disks. It seems just to be a nightmare for the administrator, you start with 3 copies and then change to 2 copies, you will have phantom copies that are only known to exist to the OS, it won't show in any reports, zfs list doesn't have an option to show which files have multiple clones and which dont. There is no way to destroy multiple clones without rewriting every file on the disk. James Will this be configurable behavior? No. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
James Dickens wrote: though I think this is a cool feature, I think i needs more work. I think there sould be an option to make extra copies expendible. So the extra copies are a request, if the space is availible make them, if not complete the write, and log the event. Are you asking for the extra copies that have already been written to be dynamically freed up when we are running low on space? That could be useful, but it isn't the problem I'm trying to solve with the 'copies' property (not to mention it would be extremely difficult to implement). It the user really requires guaranteed extra copies, then use mirrored or raided disks. Right, if you want everything to have extra redundancy, that use case is handled just fine today by mirrors or RAIDZ. The case where 'copies' is useful is when you want some data to be stored with more redundancy than others, without the burden of setting up different pools. It seems just to be a nightmare for the administrator, you start with 3 copies and then change to 2 copies, you will have phantom copies that are only known to exist to the OS, it won't show in any reports, zfs list doesn't have an option to show which files have multiple clones and which dont. There is no way to destroy multiple clones without rewriting every file on the disk. (I'm assuming you mean copies, not clones.) So would you prefer that the property be restricted to only being set at filesystem creation time, and not changed later? That way the number of copies of all files in the filesystem is always the same. It seems like the issue of knowing how many copies there are would be much worse in the system you're asking for where the extra copies are freed up as needed... --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss