Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-27 Thread Pawel Jakub Dawidek
On Tue, Sep 12, 2006 at 03:56:00PM -0700, Matthew Ahrens wrote:
 Matthew Ahrens wrote:
[...]
 Given the overwhelming criticism of this feature, I'm going to shelve it for 
 now.

I'd really like to see this feature. You say ZFS should change our view
on filesystems, I say be consequent.

In ZFS world we create one big pool out of all our disks and create
filesystems on top of it. This way we don't have to care about resizing
them, etc. But this way we define redundancy at pool level for all our
filesystems.

It is quite common that we have data we don't really care about as well
as data we do care about a lot in the same pool. Before ZFS, I'd just
create RAID0 for the former and RAID1 for the latter, but this is not
the ZFS way, right?

My question is how can I express my intent of defining redundancy level
based of the importance of my data, but still following the ZFS way
without 'copies' feature?

Please reconsider your choice.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpRd16TY8bxr.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-20 Thread Wout Mertens

Just a me too mail:

On 13 Sep 2006, at 08:30, Richard Elling wrote:


Is this use of slightly based upon disk failure modes?  That is, when
disks fail do they tend to get isolated areas of badness compared to
complete loss?  I would suggest that complete loss should include
someone tripping over the power cord to the external array that  
houses

the disk.


The field data I have says that complete disk failures are the  
exception.


It's the same here. In our 100 laptop population in the last 2 years,  
we had 2 dead drives and 10 or so with I/O errors.



BTW, this feature will be very welcome on my laptop!  I can't wait :-)


I, too, would love having two copies of my important data on my  
laptop drive. Laptop drives are small enough as they are, there's no  
point in storing the OS, tmp and swap files twice as well.


So if ditto-data blocks aren't hard to implement, they would be  
welcome. Otherwise there's still the mirror-split-your-drive approach.


Wout.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-19 Thread David Dyer-Bennet

On 9/19/06, Richard Elling - PAE [EMAIL PROTECTED] wrote:

[pardon the digression]

David Dyer-Bennet wrote:
 On 9/18/06, Richard Elling - PAE [EMAIL PROTECTED] wrote:

 Interestingly, the operation may succeed and yet we will get an error
 which recommends replacing the drive.  For example, if the failure
 prediction threshold is exceeded.  You might also want to replace the
 drive when there are no spare defect sectors available.  Life would be
 easier if they really did simply die.

 For one thing, people wouldn't be interested in doing ditto-block data!

 So, with ditto-block data, you survive any single-block failure, and
 most double-block failures, etc.  What it doesn't lend itself to is
 simple computation of simple answers :-).

 In theory, and with an infinite budget, I'd approach this analagously
 to cpu architecture design based on large volumes of instruction trace
 data.  If I had a large volume of disk operation traces with the
 hardware failures indicated, I could run this against the ZFS
 simulator and see what strategies produced the most robust single-disk
 results.

There is a significant difference.  The functionality of logic part is
deterministic and discrete.  The wear-out rate of a mechanical device
is continuous and probabilistic.  In the middle are discrete events
with probabilities associated with them, but they are handled separately.
In other words, we can use probability and statistics tools to analyze
data loss in disk drives.  This will be much faster and less expensive
than running a bunch of traces.  In fact, there has already been much
written about disk drives, their failure modes, and factors which
contribute to their failure rates.  We use such data to predict the
probability of events such as non-recoverable reads (which is often
specified in the data sheet).


Oh, I know there's a difference.  It's not as big as it looks, though,
if you remember that the instruction or disk operation traces are just
*representative* of the workload, not the actual workload that has to
run.  So, yes, disk failures are certainly non-deterministic, but the
actual instruction stream run by customers isn't the same one designed
against, either.  In both cases the design has to take the trace as a
general guideline for types of things that will happen, rather than as
a strict workload to optimize for.
--
David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/
RKBA: http://www.dd-b.net/carry/
Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/
Dragaera/Steven Brust: http://dragaera.info/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-19 Thread Torrey McMahon

Richard Elling - PAE wrote:


This question was asked many times in this thread.  IMHO, it is the
single biggest reason we should implement ditto blocks for data.

We did a study of disk failures in an enterprise RAID array a few
years ago.  One failure mode stands heads and shoulders above the
others: non-recoverable reads.  A short summary:

  2,919 total errors reported
  1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated)
961 (32.9%) unrecovered errors (of all types)
 32 (1.1%) other (eg. device not ready)
707 (24.2%) non-recoverable reads

In other words, non-recoverable reads represent 73.6% of the non-
recoverable failures that occur, including complete drive failures. 



Does this take cascading failures into account? How often do you get an 
unrecoverable read and yet are still able to perform operation on the 
target media? Thats where ditto blocks could come in handy modulo the 
concerns around utilities and quotas.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-19 Thread Richard Elling - PAE

reply below...

Torrey McMahon wrote:

Richard Elling - PAE wrote:


This question was asked many times in this thread.  IMHO, it is the
single biggest reason we should implement ditto blocks for data.

We did a study of disk failures in an enterprise RAID array a few
years ago.  One failure mode stands heads and shoulders above the
others: non-recoverable reads.  A short summary:

  2,919 total errors reported
  1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated)
961 (32.9%) unrecovered errors (of all types)
 32 (1.1%) other (eg. device not ready)
707 (24.2%) non-recoverable reads

In other words, non-recoverable reads represent 73.6% of the non-
recoverable failures that occur, including complete drive failures. 



Does this take cascading failures into account? How often do you get an 
unrecoverable read and yet are still able to perform operation on the 
target media? Thats where ditto blocks could come in handy modulo the 
concerns around utilities and quotas.


No event analysis is done here, though we do have the data, the task is
time consuming.

Non-recoverable reads may not represent permanent failures.  In the case
of a RAID array, the data should be reconstructed and a rewrite + verify
attempted with the possibility of sparing the sector.  ZFS can
reconstruct the data and relocate the block.

I have some (volumous) data on disk error rates as reported though kstat.
I plan to attempt to get a better sense of the failure rates from that
data.  The disk vendors specify non-recoverable read error rates, but
we think they are overly pessimistic for the first few years of life.
We'd like to have a better sense of how to model this, for a variety of
applications which are concerned with archival periods.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-19 Thread Torrey McMahon

Richard Elling - PAE wrote:


Non-recoverable reads may not represent permanent failures.  In the case
of a RAID array, the data should be reconstructed and a rewrite + verify
attempted with the possibility of sparing the sector.  ZFS can
reconstruct the data and relocate the block.




True but if you're using a HW raid array or some sort of protection 
within a zpool then you're already protected to a large degree. I'm 
looking for the amount of cases where you get a permanent unrecoverable 
read error and yet can recover because you've got a ditto block someplace.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-19 Thread Richard Elling - PAE

Torrey McMahon wrote:

Richard Elling - PAE wrote:


Non-recoverable reads may not represent permanent failures.  In the case
of a RAID array, the data should be reconstructed and a rewrite + verify
attempted with the possibility of sparing the sector.  ZFS can
reconstruct the data and relocate the block.




True but if you're using a HW raid array or some sort of protection 
within a zpool then you're already protected to a large degree. I'm 
looking for the amount of cases where you get a permanent unrecoverable 
read error and yet can recover because you've got a ditto block someplace.


Agree.  Non-recoverable reads are largely a JBOD problem.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-18 Thread Richard Elling - PAE

[appologies for being away from my data last week]

David Dyer-Bennet wrote:

The more I look at it the more I think that a second copy on the same
disk doesn't protect against very much real-world risk.  Am I wrong
here?  Are partial(small) disk corruptions more common than I think?
I don't have a good statistical view of disk failures.


This question was asked many times in this thread.  IMHO, it is the
single biggest reason we should implement ditto blocks for data.

We did a study of disk failures in an enterprise RAID array a few
years ago.  One failure mode stands heads and shoulders above the
others: non-recoverable reads.  A short summary:

  2,919 total errors reported
  1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated)
961 (32.9%) unrecovered errors (of all types)
 32 (1.1%) other (eg. device not ready)
707 (24.2%) non-recoverable reads

In other words, non-recoverable reads represent 73.6% of the non-
recoverable failures that occur, including complete drive failures.
Boo!  Did that scare you?  Halloween is next month! :-)
Seagate said today that in a few years 3.5 disks will store 2.5 TBytes.
Boo!

While I don't have data on laptop disk failures, I would not be surprised
to see a similar distribution, though with a larger mechanical damage
count.  My laptops run hotter inside than my other systems and, as a rule
of thumb, your disk failure rate increases by 2x for every 15C change in
temperature.  Is your laptop disk hot?

The case for ditto data is clear to me.  Many people are using single-disk
systems, and many more people would really like to use single-disk systems
but they really can't.

Beyond spinning rust systems, there are other forms of non-volatile
storage which would apply here.  For example, those people who suggested
that you should backup your presentation to a CD fail to note that a spec
of dust on the CD could lead you to lose one block of data.  In my CD/DVD
experience, such losses are blissfully ignored by the system and you may
blame the resulting crash on the cheap hardware you bought from your
brother-in-law.  Beyond CDs, I can see this as being a nice enhancement
to limited endurance devices such as flash.

While it is true that I could slice my disk up into multiple vdevs and
mirror them, I'd much rather set a policy at a finer grainularity: my
files are more important than most of the other, mostly read-only and
easily reconstructed, files on my system.

When ditto blocks for metadata was introduced, I took a look at the
code and was pleasantly suprised.  The code does an admirable job of
ensuring spatial diversity in the face of multiple policies, even in
the single disk case.  IMHO, this is the right way to implement this
and allows you to mix policies with ease.

As a RAS guy, I'm biased to not wanting to lose data via easy-to-use
interfaces.  I don't see how this feature has any downside, but lots
of upside.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-18 Thread David Dyer-Bennet

On 9/18/06, Richard Elling - PAE [EMAIL PROTECTED] wrote:

[appologies for being away from my data last week]

David Dyer-Bennet wrote:
 The more I look at it the more I think that a second copy on the same
 disk doesn't protect against very much real-world risk.  Am I wrong
 here?  Are partial(small) disk corruptions more common than I think?
 I don't have a good statistical view of disk failures.

This question was asked many times in this thread.  IMHO, it is the
single biggest reason we should implement ditto blocks for data.

We did a study of disk failures in an enterprise RAID array a few
years ago.  One failure mode stands heads and shoulders above the
others: non-recoverable reads.  A short summary:

   2,919 total errors reported
   1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated)
 961 (32.9%) unrecovered errors (of all types)
  32 (1.1%) other (eg. device not ready)
 707 (24.2%) non-recoverable reads

In other words, non-recoverable reads represent 73.6% of the non-
recoverable failures that occur, including complete drive failures.


I don't see anything addressing complete drive failures vs. block
failures here anywhere.   Is there some way to read something about
that out of this data?

I'm thinking the operations succeeded also occurs read errors
recovered by retries and such, as well as the write failure cited as
an example?

I guess I can conclude that the 66% for errors successfully recovered
means that a lot of errors are not, in fact, entire-drive failures.
So that's good (for ditto-data).  So a maximum of 34% are whole-drive
failures (and in reality I'm sure far lower).

Anyway, facts on actual failures in the real world are *definitely*
the useful way to conduct this discussion!

[snip]


While it is true that I could slice my disk up into multiple vdevs and
mirror them, I'd much rather set a policy at a finer grainularity: my
files are more important than most of the other, mostly read-only and
easily reconstructed, files on my system.


I definitely like the idea of setting policy at a finer granularity; I
really want it to be at the file level, even per-directory doesn't fit
reality very well in my view.


When ditto blocks for metadata was introduced, I took a look at the
code and was pleasantly suprised.  The code does an admirable job of
ensuring spatial diversity in the face of multiple policies, even in
the single disk case.  IMHO, this is the right way to implement this
and allows you to mix policies with ease.


That's very good to hear.
--
David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/
RKBA: http://www.dd-b.net/carry/
Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/
Dragaera/Steven Brust: http://dragaera.info/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-18 Thread Richard Elling - PAE

more below...

David Dyer-Bennet wrote:

On 9/18/06, Richard Elling - PAE [EMAIL PROTECTED] wrote:

[appologies for being away from my data last week]

David Dyer-Bennet wrote:
 The more I look at it the more I think that a second copy on the same
 disk doesn't protect against very much real-world risk.  Am I wrong
 here?  Are partial(small) disk corruptions more common than I think?
 I don't have a good statistical view of disk failures.

This question was asked many times in this thread.  IMHO, it is the
single biggest reason we should implement ditto blocks for data.

We did a study of disk failures in an enterprise RAID array a few
years ago.  One failure mode stands heads and shoulders above the
others: non-recoverable reads.  A short summary:

   2,919 total errors reported
   1,926 (66.0%) operations succeeded (eg. write failed, auto 
reallocated)

 961 (32.9%) unrecovered errors (of all types)
  32 (1.1%) other (eg. device not ready)
 707 (24.2%) non-recoverable reads

In other words, non-recoverable reads represent 73.6% of the non-
recoverable failures that occur, including complete drive failures.


I don't see anything addressing complete drive failures vs. block
failures here anywhere.   Is there some way to read something about
that out of this data?


Complete failures are a non-zero category, but there is more than one
error code which would result in the recommendation to replace the drive.
Their counts are included in the 961-707=254 (26.4%) of other non-
recoverable errors.  In some cases a non-recoverable error can be
corrected by a retry, and those also fall into the 26.4% bucket.

Interestingly, the operation may succeed and yet we will get an error
which recommends replacing the drive.  For example, if the failure
prediction threshold is exceeded.  You might also want to replace the
drive when there are no spare defect sectors available.  Life would be
easier if they really did simply die.


I'm thinking the operations succeeded also occurs read errors
recovered by retries and such, as well as the write failure cited as
an example?


Yes.


I guess I can conclude that the 66% for errors successfully recovered
means that a lot of errors are not, in fact, entire-drive failures.
So that's good (for ditto-data).  So a maximum of 34% are whole-drive
failures (and in reality I'm sure far lower).


I agree.


Anyway, facts on actual failures in the real world are *definitely*
the useful way to conduct this discussion!

[snip]


While it is true that I could slice my disk up into multiple vdevs and
mirror them, I'd much rather set a policy at a finer grainularity: my
files are more important than most of the other, mostly read-only and
easily reconstructed, files on my system.


I definitely like the idea of setting policy at a finer granularity; I
really want it to be at the file level, even per-directory doesn't fit
reality very well in my view.


When ditto blocks for metadata was introduced, I took a look at the
code and was pleasantly suprised.  The code does an admirable job of
ensuring spatial diversity in the face of multiple policies, even in
the single disk case.  IMHO, this is the right way to implement this
and allows you to mix policies with ease.


That's very good to hear.


 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-14 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Matthew Ahrens wrote:
 Out of curiosity, what would you guys think about addressing this same
 problem by having the option to store some filesystems unreplicated on
 an mirrored (or raid-z) pool?  This would have the same issues of
 unexpected space usage, but since it would be *less* than expected, that
 might be more acceptable.  There are no plans to implement anything like
 this right now, but I just wanted to get a read on it.

+1, especially in a two disk (mirrored) configuration.

Currently I use two ZFS pools: one mirrored and other unmirrored
spreaded over two disks (each disk partitioned with SVM). And I'm
constantly fighting the fill-up of one pools while the other is empty.
My current setup have the same space balance problem that a traditional
two *static* partition setup.

- --
Jesus Cea Avion _/_/  _/_/_/_/_/_/
[EMAIL PROTECTED] http://www.argo.es/~jcea/ _/_/_/_/  _/_/_/_/  _/_/
jabber / xmpp:[EMAIL PROTECTED] _/_/_/_/  _/_/_/_/_/
   _/_/  _/_/_/_/  _/_/  _/_/
Things are not so easy  _/_/  _/_/_/_/  _/_/_/_/  _/_/
My name is Dump, Core Dump   _/_/_/_/_/_/  _/_/  _/_/
El amor es poner tu felicidad en la felicidad de otro - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRQlwoJlgi5GaxT1NAQLR7gP8C3QHCkvRznthRZNZ6sCfhtD/y+am7b2V
+JrPBD0RRHkD65ZKhj6r3Ss4ypkjlSo82+pMdnPdIQUpNKoqmwEyAqfvXvdqm7A+
Yks5Ac5e9ris2Sz3o7wruFixkLOJSoKrUS8TR1TpvnXlHE8l3U4Q2uEgzwKr4s8F
k/AR3VC70pg=
=BCz2
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-14 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Neil A. Wilson wrote:
 This is unfortunate.  As a laptop user with only a single drive, I was
 looking forward to it since I've been bitten in the past by data loss
 caused by a bad area on the disk.  I don't care about the space
 consumption because I generally don't come anywhere close to filling up
 the available space.  It may not be the primary market for ZFS, but it
 could be a very useful side benefit.

I feel your pain.

Although your harddrive will suffer by the extra seeks, I would suggest
you to partition your HD in two spaces and mount a two-way ZFS mirror
between them. If space is an issue, you can use N partitions to mount a
raid-z, but your performance will suffer a lot because any data read
would require N seeks.

- --
Jesus Cea Avion _/_/  _/_/_/_/_/_/
[EMAIL PROTECTED] http://www.argo.es/~jcea/ _/_/_/_/  _/_/_/_/  _/_/
jabber / xmpp:[EMAIL PROTECTED] _/_/_/_/  _/_/_/_/_/
   _/_/  _/_/_/_/  _/_/  _/_/
Things are not so easy  _/_/  _/_/_/_/  _/_/_/_/  _/_/
My name is Dump, Core Dump   _/_/_/_/_/_/  _/_/  _/_/
El amor es poner tu felicidad en la felicidad de otro - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRQlx0Zlgi5GaxT1NAQLxnAQAnR5ja6G+jzTPC6cNWRpD1BmUnEcXP+k5
KvRuoIAZ2GLLQvKbPYv+KivX9+jZcNW3W73g/HPGrmnMrFwKyVaeotnk5M8z2IH/
mCneF/qfV751eTaWGUXHqCD1bh/jRkxlIHRPU+TvCriE2zJ+N5r+AMOIbAd9oQ6H
9Y9LUSWAK+Q=
=rNRA
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Matthew Ahrens

Torrey McMahon wrote:

Matthew Ahrens wrote:
The problem that this feature attempts to address is when you have 
some data that is more important (and thus needs a higher level of 
redundancy) than other data.  Of course in some situations you can use 
multiple pools, but that is antithetical to ZFS's pooled storage 
model.  (You have to divide up your storage, you'll end up with 
stranded storage and bandwidth, etc.) 


Can you expand? I can think of some examples where using multiple pools 
- even on the same host - is quite useful given the current feature set 
of the product.  Or are you only discussing the specific case where a
host would want more reliability for a certain set of data then an 
other? If that's the case I'm still confused as to what failure cases 
would still allow you to retrieve your data if there are more then one 
copy in the fs or pool.but I'll gladly take some enlightenment. :)


(My apologies for the length of this response, I'll try to address most 
of the issues brought up recently...)


When I wrote this proposal, I was only seriously thinking about the case 
where you want different amounts of redundancy for different data. 
Perhaps because I failed to make this clear, discussion has concentrated 
on laptop reliability issues.  It is true that there would be some 
benefit to using multiple copies on a single-disk (eg. laptop) pool, but 
of course it would not protect against the most common failure mode 
(whole disk failure).


One case where this feature would be useful is if you have a pool with 
no redundancy (ie. no mirroring or raid-z), because most of the data in 
the pool is not very important.  However, the pool may have a bunch of 
disks in it (say, four).  The administrator/user may realize (perhaps 
later on) that some of their data really *is* important and they would 
like some protection against losing it if a disk fails.  They may not 
have the option of adding more disks to mirror all of their data (cost 
or physical space constraints may apply here).  Their problem is solved 
by creating a new filesystem with copies=2 and putting the important 
data there.  Now, if a disk fails, then the data in the copies=2 
filesystem will not be lost.  Approximately 1/4 of the data in other 
filesystems will be lost.  (There is a small chance that some tiny 
fraction of the data in the copies=2 filesystem will still be lost if we 
were forced to put both copies on the disk that failed.)


Another plausible use case would be where you have some level of 
redundancy, say you have a Thumper (X4500) with its 48 disks configured 
into 9 5-wide single-parity raid-z groups (with 3 spares).  If a single 
disk fails, there will be no data loss.  However, if two disks within 
the same raid-z group fail, data will be lost.  In this scenario, 
imagine that this data loss probability is acceptable for most of the 
data stored here, but there is some extremely important data for which 
this is unacceptable.  Rather than reconfiguring the entire pool for 
higher redundancy (say, double-parity raid-z) and less usable storage, 
you can simply create a filesystem with copies=2 within the raid-z 
storage pool.  Data within that filesystem will not be lost even if any 
three disks fail.


I believe that these use cases, while not being extremely common, do 
occur.  The extremely low amount of engineering effort required to 
implement the feature (modulo the space accounting issues) seems 
justified.  The fact that this feature does not solve all problems (eg, 
it is not intended to be a replacement for mirroring) is not a downside; 
not all features need to be used in all situations :-)


The real problem with this proposal is the confusion surrounding disk 
space accounting with copies1.  While the same issues are present when 
using compression, people are understandably less upset when files take 
up less space than expected.  Given the current lack of interest in this 
feature, the effort required to address the space accounting issue does 
not seem justified at this time.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread eric kustarz

David Dyer-Bennet wrote:


On 9/12/06, eric kustarz [EMAIL PROTECTED] wrote:


So it seems to me that having this feature per-file is really useful.
Say i have a presentation to give in Pleasanton, and the presentation
lives on my single-disk laptop - I want all the meta-data and the actual
presentation to be replicated.  We already use ditto blocks for the
meta-data.  Now we could have an extra copy of the actual data.  When i
get back from the presentation i can turn off the extra copies.



Yes, you could do that.

*I* would make a copy on a CD, which I would carry in a separate case
from the laptop.



Do you backup the presentation to CD everytime you make an edit?



I think my presentation is a lot safer than your presentation.



I'm sure both of our presentations would be equally safe as we would 
know not to have the only copy(ies) on our personage.




Similarly for your digital images example; I don't consider it safe
until I have two or more *independent* copies.  Two copies on a single
hard drive doesn't come even close to passing the test for me; as many
people have pointed out, those tend to fail all at once.  And I will
also point out that laptops get stolen a lot.  And of course all the
accidents involving fumble-fingers, OS bugs, and driver bugs won't be
helped by the data duplication either.  (Those will mostly be helped
by sensible use of snapshots, though, which is another argument for
ZFS on *any* disk you work on a lot.)



Well of course you would have a separate, independent copy if it really 
mattered.




The more I look at it the more I think that a second copy on the same
disk doesn't protect against very much real-world risk.  Am I wrong
here?  Are partial(small) disk corruptions more common than I think?
I don't have a good statistical view of disk failures.



Well let's see - my friend accompanied me on a trip and saved her photos 
daily onto her laptop.  Near the end of the trip her hard drive started 
having problems.  The hard drive was not dead, as it was bootable and 
you could access certain data.  Upon returning home she was able to 
retrieve some of her photos but not all.  She would have been much 
happier having ZFS + copies.


And yes, you could backup to CD/DVD every night, but its a pain and 
people don't do it (as much as they should).


Side note: it would have cost hundreds of dollars for data recovery to 
have just the *possibility* to get the other photos.


eric


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread eric kustarz

Torrey McMahon wrote:


eric kustarz wrote:


Matthew Ahrens wrote:


Matthew Ahrens wrote:

Here is a proposal for a new 'copies' property which would allow 
different levels of replication for different filesystems.




Thanks everyone for your input.

The problem that this feature attempts to address is when you have 
some data that is more important (and thus needs a higher level of 
redundancy) than other data.  Of course in some situations you can 
use multiple pools, but that is antithetical to ZFS's pooled storage 
model.  (You have to divide up your storage, you'll end up with 
stranded storage and bandwidth, etc.)


Given the overwhelming criticism of this feature, I'm going to 
shelve it for now.




So it seems to me that having this feature per-file is really 
useful.  Say i have a presentation to give in Pleasanton, and the 
presentation lives on my single-disk laptop - I want all the 
meta-data and the actual presentation to be replicated.  We already 
use ditto blocks for the meta-data.  Now we could have an extra copy 
of the actual data.  When i get back from the presentation i can turn 
off the extra copies. 



Under what failure nodes would your data still be accessible? What 
things can go wrong that still allow you to access the data because 
some event has removed one copy but left the others?



Silent data corruption of one of the copies.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Mike Gerdts

On 9/13/06, Richard Elling [EMAIL PROTECTED] wrote:

 * Mirroring offers slightly better redundancy, because one disk from
each mirror can fail without data loss.

 Is this use of slightly based upon disk failure modes?  That is, when
 disks fail do they tend to get isolated areas of badness compared to
 complete loss?  I would suggest that complete loss should include
 someone tripping over the power cord to the external array that houses
 the disk.

The field data I have says that complete disk failures are the exception.
I hate to leave this as a teaser, I'll expand my comments later.

BTW, this feature will be very welcome on my laptop!  I can't wait :-)


On servers and stationary desktops, I just don't care whether it is a
whole disk failure or a few bad blocks.  In that case I have the
resources to mirror, RAID5, perform daily backups, etc.

The laptop disk failures that I have seen have typically been limited
to a few bad blocks.  As Torey McMahon mentioned, they tend to start
out with some warning signs followed by a full failure.  I would
*really* like to have that window between warning signs and full
failure as my opportunity to back up my data and replace my
non-redundant hard drive with no data loss.

The only part of the proposal I don't like is space accounting.
Double or triple charging for data will only confuse those apps and
users that check for free space or block usage.  If this is worked
out, it would be a great feature for those times when mirroring just
isn't an option.

Mike

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Tobias Schacht

On 9/13/06, Mike Gerdts [EMAIL PROTECTED] wrote:

The only part of the proposal I don't like is space accounting.
Double or triple charging for data will only confuse those apps and
users that check for free space or block usage.


Why exactly isn't reporting the free space divided by the copies
value on that particular file system an easy solution for this? Did I
miss something?


Tobias
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Al Hopper
On Tue, 12 Sep 2006, Matthew Ahrens wrote:

 Torrey McMahon wrote:
  Matthew Ahrens wrote:
  The problem that this feature attempts to address is when you have
  some data that is more important (and thus needs a higher level of
  redundancy) than other data.  Of course in some situations you can use
  multiple pools, but that is antithetical to ZFS's pooled storage
  model.  (You have to divide up your storage, you'll end up with
  stranded storage and bandwidth, etc.)
 
  Can you expand? I can think of some examples where using multiple pools
  - even on the same host - is quite useful given the current feature set
  of the product.  Or are you only discussing the specific case where a
  host would want more reliability for a certain set of data then an
  other? If that's the case I'm still confused as to what failure cases
  would still allow you to retrieve your data if there are more then one
  copy in the fs or pool.but I'll gladly take some enlightenment. :)

 (My apologies for the length of this response, I'll try to address most
 of the issues brought up recently...)

 When I wrote this proposal, I was only seriously thinking about the case
 where you want different amounts of redundancy for different data.
 Perhaps because I failed to make this clear, discussion has concentrated
 on laptop reliability issues.  It is true that there would be some
 benefit to using multiple copies on a single-disk (eg. laptop) pool, but
 of course it would not protect against the most common failure mode
 (whole disk failure).
... lots of Good Stuff elided 

Soon Samsung will release a 100% flash memory based drive (32Gb) in a
laptop form factor.  But flash memory chips have a limited number of write
cycles available, and when exceeded, this usually results in data
corruption.  Some people have already encountered this issue with USB
thumb drives.  Its especially annoying if you were using the thumb drive
as a, what you thought was, a 100% _reliable_ backup mechanism.

This is a perfect application for ZFS copies=2.  Also, consider that there
is no time penalty for positioning the heads on a flash drive.  So now
you would have 2 options in a laptop type application with a single flash
based drive:

a) create a mirrored pool using 2 slices - expensive in terms of storage
   utilization
b) create a pool with no redundancy
   create a filesystem called importantPresentationData within that pool
   with copies=2 (or more).

Matthew - build it and they will come!

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Bill Sommerfeld
On Wed, 2006-09-13 at 02:30, Richard Elling wrote:
 The field data I have says that complete disk failures are the exception.
 I hate to leave this as a teaser, I'll expand my comments later.

That matches my anecdotal experience with laptop drives; maybe I'm just
lucky, or maybe I'm just paying attention than most to the sounds they
start to make when they're having a bad hair day, but so far they've
always given *me* significant advance warning of impending doom,
generally by failing to read a bunch of disk sectors.

That said, I think the best use case for the copies  1 config would be
in systems with exactly two disks -- which covers most of the 1U boxes
out there.  

One question for Matt: when ditto blocks are used with raidz1, how well
does this handle the case where you encounter one or more single-sector
read errors on other drive(s) while reconstructing a failed drive?

for a concrete example

A0 B0 C0 D0 P0
A1 B1 C1 D1 P1

(A0==A1, B0==B1, ...; A^B^C^D==P)

Does the current implementation of raidz + ditto blocks cope with the
case where all of A, C0, and D1 are unavailable?

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Torrey McMahon

eric kustarz wrote:


I want per pool, per dataset, and per file - where all are done by the 
filesystem (ZFS), not the application.  I was talking about a further 
enhancement to copies than what Matt is currently proposing - per 
file copies, but its more work (one thing being we don't have 
administrative control over files per se).


Now if you could do that and make it something that can be set at 
install time it would get a lot more interesting. When you install 
Solaris to that single laptop drive you can select files or even 
directories that have more then one copy in case of a problem down the road.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Bart Smaalders

Torrey McMahon wrote:

eric kustarz wrote:


I want per pool, per dataset, and per file - where all are done by the 
filesystem (ZFS), not the application.  I was talking about a further 
enhancement to copies than what Matt is currently proposing - per 
file copies, but its more work (one thing being we don't have 
administrative control over files per se).


Now if you could do that and make it something that can be set at 
install time it would get a lot more interesting. When you install 
Solaris to that single laptop drive you can select files or even 
directories that have more then one copy in case of a problem down the 
road.




Actually, this is a perfect use case for setting the copies=2
property after installation.  The original binaries are
quite replaceable; the customizations and personal files
created later on are not.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Torrey McMahon

Bart Smaalders wrote:

Torrey McMahon wrote:

eric kustarz wrote:


I want per pool, per dataset, and per file - where all are done by 
the filesystem (ZFS), not the application.  I was talking about a 
further enhancement to copies than what Matt is currently 
proposing - per file copies, but its more work (one thing being we 
don't have administrative control over files per se).


Now if you could do that and make it something that can be set at 
install time it would get a lot more interesting. When you install 
Solaris to that single laptop drive you can select files or even 
directories that have more then one copy in case of a problem down 
the road.




Actually, this is a perfect use case for setting the copies=2
property after installation.  The original binaries are
quite replaceable; the customizations and personal files
created later on are not.



We've been talking about user data but the chance of corrupting 
something on disk and then detecting a bad checksum on something in 
/kernel is also possible. (Disk drives do weird things from time to 
time.) If I was sufficiently paranoid I would want everything required 
to get into single-user mode, some other stuff, and then my user data, 
duplicated to avoid any issues.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Darren J Moffat

Mike Gerdts wrote:

On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote:

B. DESCRIPTION

A new property will be added, 'copies', which specifies how many copies
of the given filesystem will be stored.  Its value must be 1, 2, or 3.
Like other properties (eg.  checksum, compression), it only affects
newly-written data.  As such, it is recommended that the 'copies'
property be set at filesystem-creation time
(eg. 'zfs create -o copies=2 pool/fs').


Is there anything in the works to compress (or encrypt) existing data
after the fact?  For example, a special option to scrub that causes
the data to be re-written with the new properties could potentially do
this.  If so, this feature should subscribe to any generic framework
provided by such an effort.


While encryption of existing data is not in scope for the first ZFS 
crypto phase I am being careful in the design to ensure that it can be 
done later if such a ZFS framework becomes available.


The biggest problem I see with this is one of observability, if not all 
of the data is encrypted yet what should the encryption property say ? 
If it says encryption is on then the admin might think the data is 
safe, but if it says it is off that isn't the truth either because 
some of it maybe in encrypted.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Dick Davies

On 12/09/06, Matthew Ahrens [EMAIL PROTECTED] wrote:

Here is a proposal for a new 'copies' property which would allow
different levels of replication for different filesystems.

Your comments are appreciated!


Flexibility is always nice, but this seems to greatly complicate things,
both technically and conceptually (sometimes, good design is about what
is left out :) ).

Seems to me this lets you say 'files in this directory are x times more
valuable than files elsewhere'. Others have covered some of my
concerns (guarantees, cleanup, etc.). In addition,

* if I move a file somewhere else, does it become less important?
* zpools let you do that already
 (admittedly with less granularity, but *much* *much* more simply -
 and disk is cheap in my world)
* I don't need to do that :)

The only real use I'd see would be for redundant copies
on a single disk, but then why wouldn't I just add a disk?

* disks are cheap, and creating a mirror from a single disk is very easy
 (and conceptually simple)
* *removing* a disk from a mirror pair is simple too - I make mistakes
 sometimes
* in my experience, disks fail. When you get bad errors on part of a disk,
 the disk is about to die.
* you can already create a/several zpools using disk
 partitions as vdevs. That's not all that safe, and I don't see this being
 any safer.


Sorry to be negative, but to me ZFS' simplicity is one of its major features.
I think this provides a cool feature, but I question it's usefulness.

Quite possibly I just don't have the particular itch this is intended
to scratch - is this a much requested feature?


--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Dick Davies

On 12/09/06, Darren J Moffat [EMAIL PROTECTED] wrote:

Dick Davies wrote:

 The only real use I'd see would be for redundant copies
 on a single disk, but then why wouldn't I just add a disk?

Some systems have physical space for only a single drive - think most
laptops!


True - I'm a laptop user myself. But as I said, I'd assume the whole disk
would fail (it does in my experience).

If your hardware craps differently to mine, you could do a similar thing
with partitions (or even files) as vdevs. Wouldn't be any less reliable.

I'm still not Feeling the Magic on this one :)

--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Jeff Victor
This proposal would benefit greatly by a problem statement.  As it stands, it 
feels like a solution looking for a problem.


The Introduction mentions a different problem and solution, but then pretends that 
there is value to this solution.  The Description section mentions some benefits 
of 'copies' relative to the existing situation, but requires that the reader piece 
together the whole picture.  And IMO there aren't enough pieces :-) , i.e. so far 
I haven't seen sufficient justification for the added administrative complexity 
and potential for confusion, both administrative and user.


Matthew Ahrens wrote:
Here is a proposal for a new 'copies' property which would allow 
different levels of replication for different filesystems.


Your comments are appreciated!

--matt

A. INTRODUCTION

ZFS stores multiple copies of all metadata.  This is accomplished by
storing up to three DVAs (Disk Virtual Addresses) in each block pointer.
This feature is known as Ditto Blocks.  When possible, the copies are
stored on different disks.

See bug 6410698 ZFS metadata needs to be more highly replicated (ditto
blocks) for details on ditto blocks.

This case will extend this feature to allow system administrators to
store multiple copies of user data as well, on a per-filesystem basis.
These copies are in addition to any redundancy provided at the pool
level (mirroring, raid-z, etc).

B. DESCRIPTION

A new property will be added, 'copies', which specifies how many copies
of the given filesystem will be stored.  Its value must be 1, 2, or 3.
Like other properties (eg.  checksum, compression), it only affects
newly-written data.  As such, it is recommended that the 'copies'
property be set at filesystem-creation time
(eg. 'zfs create -o copies=2 pool/fs').

The pool must be at least on-disk version 2 to use this feature (see
'zfs upgrade').

By default (copies=1), only two copies of most filesystem metadata are
stored.  However, if we are storing multiple copies of user data, then 3
copies (the maximum) of filesystem metadata will be stored.

This feature is similar to using mirroring, but differs in several
important ways:

* Different filesystems in the same pool can have different numbers of
   copies.
* The storage configuration is not constrained as it is with mirroring
   (eg. you can have multiple copies even on a single disk).
* Mirroring offers slightly better performance, because only one DVA
   needs to be allocated.
* Mirroring offers slightly better redundancy, because one disk from
   each mirror can fail without data loss.

It is important to note that the copies provided by this feature are in
addition to any redundancy provided by the pool configuration or the
underlying storage.  For example:

* In a pool with 2-way mirrors, a filesystem with copies=1 (the default)
   will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate any
   1 disk failing without data loss.
* In a pool with 2-way mirrors, a filesystem with copies=3
   will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate any
   5 disks failing without data loss (assuming that there are at least
   ncopies=3 mirror groups).
* In a pool with single-parity raid-z a filesystem with copies=2
   will be stored with 2 copies, each copy protected by its own parity
   block.  The filesystem can tolerate any 3 disks failing without data
   loss (assuming that there are at least ncopies=2 raid-z groups).


C. MANPAGE CHANGES
*** zfs.man4Tue Jun 13 10:15:38 2006
--- zfs.man5Mon Sep 11 16:34:37 2006
***
*** 708,714 
--- 708,725 
they are inherited.


+  copies=1 | 2 | 3

+Controls the number of copies of data stored for this dataset.
+These copies are in addition to any redundancy provided by the
+pool (eg. mirroring or raid-z).  The copies will be stored on
+different disks if possible.
+
+Changing this property only affects newly-written data.
+Therefore, it is recommended that this property be set at
+filesystem creation time, using the '-o copies=' option.
+
+
 Temporary Mountpoint Properties
When a file system is mounted, either through mount(1M)  for
legacy  mounts  or  the  zfs mount command for normal file


D. REFERENCES
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread David Dyer-Bennet

On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote:

Here is a proposal for a new 'copies' property which would allow
different levels of replication for different filesystems.

Your comments are appreciated!


I've read the proposal, and followed the discussion so far.  I have to
say that I don't see any particular need for this feature.

Possibly there is a need for a different feature, in which the entire
control of redundancy is moved away from the pool level and to the
file or filesystem level.  I definitely see the attraction of being
able to specify by file and directory different degrees of reliability
needed.  However, the details of the feature actually proposed don't
seem to satisfy the need for extra reliability at the level that
drives people to employ redundancy; it doesn't provide a guaranty.

I see no need for additional non-guaranteed reliability on top of the
levels of guaranty provided by use of redundancy at the pool level.

Furthermore, as others have pointed out, this feature would add a high
degree of user-visible complexity.


From what I've seen here so far, I think this is a bad idea and should

not be added.
--
David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/
RKBA: http://www.dd-b.net/carry/
Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/
Dragaera/Steven Brust: http://dragaera.info/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Neil A. Wilson

Darren J Moffat wrote:
While encryption of existing data is not in scope for the first ZFS 
crypto phase I am being careful in the design to ensure that it can be 
done later if such a ZFS framework becomes available.


The biggest problem I see with this is one of observability, if not all 
of the data is encrypted yet what should the encryption property say ? 
If it says encryption is on then the admin might think the data is 
safe, but if it says it is off that isn't the truth either because 
some of it maybe in encrypted.


I would also think that there's a significant problem around what to do 
about the previously unencrypted data.  I assume that when performing a 
scrub to encrypt the data, the encrypted data will not be written on 
the same blocks previously used to hold the unencrypted data.  As such, 
there's a very good chance that the unencrypted data would still be 
there for quite some time.  You may not be able to access it through the 
filesystem, but someone with access to the raw disks may be able to 
recover at least parts of it.  In this case, the scrub would not only 
have to write the encrypted data but also overwrite the unencrypted data 
(multiple times?).




Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Darren J Moffat

Neil A. Wilson wrote:

Darren J Moffat wrote:
While encryption of existing data is not in scope for the first ZFS 
crypto phase I am being careful in the design to ensure that it can be 
done later if such a ZFS framework becomes available.


The biggest problem I see with this is one of observability, if not 
all of the data is encrypted yet what should the encryption property 
say ? If it says encryption is on then the admin might think the data 
is safe, but if it says it is off that isn't the truth either 
because some of it maybe in encrypted.


I would also think that there's a significant problem around what to do 
about the previously unencrypted data.  I assume that when performing a 
scrub to encrypt the data, the encrypted data will not be written on 
the same blocks previously used to hold the unencrypted data.  As such, 
there's a very good chance that the unencrypted data would still be 
there for quite some time.  You may not be able to access it through the 
filesystem, but someone with access to the raw disks may be able to 
recover at least parts of it.  In this case, the scrub would not only 
have to write the encrypted data but also overwrite the unencrypted data 
(multiple times?).


Right, that is a very important issue.  Would a ZFS scrub framework do 
copy on write ?  As you point out if it doesn't then we still need to do 
something about the old clear text blocks because strings(1) over the 
raw disk will show them.


I see the desire to have a knob that says make this encrypted now but 
I personally believe that it is actually better if you can make this 
choice at the time you create the ZFS data set.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Nicolas Williams
On Tue, Sep 12, 2006 at 10:36:30AM +0100, Darren J Moffat wrote:
 Mike Gerdts wrote:
 Is there anything in the works to compress (or encrypt) existing data
 after the fact?  For example, a special option to scrub that causes
 the data to be re-written with the new properties could potentially do
 this.  If so, this feature should subscribe to any generic framework
 provided by such an effort.
 
 While encryption of existing data is not in scope for the first ZFS 
 crypto phase I am being careful in the design to ensure that it can be 
 done later if such a ZFS framework becomes available.
 
 The biggest problem I see with this is one of observability, if not all 
 of the data is encrypted yet what should the encryption property say ? 
 If it says encryption is on then the admin might think the data is 
 safe, but if it says it is off that isn't the truth either because 
 some of it maybe in encrypted.

I agree -- there needs to be a filesystem re-write option, something
like a scrub but at the filesystem level.  Things that might be
accomplished through it:

 - record size changes
 - compression toggling / compression algorithm changes
 - encryption/re-keying/alg. changes
 - checksum alg. changes
 - ditto blocking

What else?

To me it's important that such scrubs not happen simply as a result of
setting/changing a filesystem property, but it's also important that the
user/admin be told that changing the property requires scrubbing in
order to take effect for data/meta-data written before the change.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Matthew Ahrens

Matthew Ahrens wrote:
Here is a proposal for a new 'copies' property which would allow 
different levels of replication for different filesystems.


Thanks everyone for your input.

The problem that this feature attempts to address is when you have some 
data that is more important (and thus needs a higher level of 
redundancy) than other data.  Of course in some situations you can use 
multiple pools, but that is antithetical to ZFS's pooled storage model. 
 (You have to divide up your storage, you'll end up with stranded 
storage and bandwidth, etc.)


Given the overwhelming criticism of this feature, I'm going to shelve it 
for now.


Out of curiosity, what would you guys think about addressing this same 
problem by having the option to store some filesystems unreplicated on 
an mirrored (or raid-z) pool?  This would have the same issues of 
unexpected space usage, but since it would be *less* than expected, that 
might be more acceptable.  There are no plans to implement anything like 
this right now, but I just wanted to get a read on it.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread David Dyer-Bennet

On 9/12/06, Matthew Ahrens [EMAIL PROTECTED] wrote:

Matthew Ahrens wrote:
 Here is a proposal for a new 'copies' property which would allow
 different levels of replication for different filesystems.

Thanks everyone for your input.

The problem that this feature attempts to address is when you have some
data that is more important (and thus needs a higher level of
redundancy) than other data.  Of course in some situations you can use
multiple pools, but that is antithetical to ZFS's pooled storage model.
  (You have to divide up your storage, you'll end up with stranded
storage and bandwidth, etc.)

Given the overwhelming criticism of this feature, I'm going to shelve it
for now.


I think it's a valid problem.  My understanding was that this didn't
give a *guaranteed* solution, though.  I think most people, when
committing to the point of replication (spending actual money), need a
guarantee at some level (not of course of total safety; but that the
data actually does exist on separate disks, and will survive the
destruction of one disk).  A good solution to this problem would be
valuable.  (And I'd accept a non-guarantee on a single disk; or rather
a guarantee that said if enough blocks to find the data exist, and a
copy of each data block exists, we can retrieve the data; but that
guarantee *does* exist I think).


Out of curiosity, what would you guys think about addressing this same
problem by having the option to store some filesystems unreplicated on
an mirrored (or raid-z) pool?  This would have the same issues of
unexpected space usage, but since it would be *less* than expected, that
might be more acceptable.  There are no plans to implement anything like
this right now, but I just wanted to get a read on it.


I was never concerned at the free space issues (though I was concerned
by some of the proposed solutions to what I saw as a non-issue).  I'd
be happy if the free space described how many bytes of default files
you could add to the pool, and the user would have to understand that
results would differ if they used non-default parameters.  You're
probably right that fewer people would mind having *more* space than
an unthinking reading would show than less.
--
David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/
RKBA: http://www.dd-b.net/carry/
Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/
Dragaera/Steven Brust: http://dragaera.info/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Neil A. Wilson

Matthew Ahrens wrote:

Matthew Ahrens wrote:
Here is a proposal for a new 'copies' property which would allow 
different levels of replication for different filesystems.


Thanks everyone for your input.

The problem that this feature attempts to address is when you have some 
data that is more important (and thus needs a higher level of 
redundancy) than other data.  Of course in some situations you can use 
multiple pools, but that is antithetical to ZFS's pooled storage model. 
 (You have to divide up your storage, you'll end up with stranded 
storage and bandwidth, etc.)


Given the overwhelming criticism of this feature, I'm going to shelve it 
for now.


This is unfortunate.  As a laptop user with only a single drive, I was 
looking forward to it since I've been bitten in the past by data loss 
caused by a bad area on the disk.  I don't care about the space 
consumption because I generally don't come anywhere close to filling up 
the available space.  It may not be the primary market for ZFS, but it 
could be a very useful side benefit.




Out of curiosity, what would you guys think about addressing this same 
problem by having the option to store some filesystems unreplicated on 
an mirrored (or raid-z) pool?  This would have the same issues of 
unexpected space usage, but since it would be *less* than expected, that 
might be more acceptable.  There are no plans to implement anything like 
this right now, but I just wanted to get a read on it.


I don't see much need for this in any area that I would use ZFS (either 
my own personal use or for any case in which I would recommend it for 
production use).


However, if you think that it's OK to under-report free space, then why 
not just do that for the data ditto blocks.  If one or more of my 
filesystems are configured to keep two copies of the data, then simply 
report only half of the available space.  If duplication isn't enabled 
for the entire pool but only for certain filesystems, then perhaps you 
could even take advantage of quotas for those filesystems to make a more 
accurate calculation.




--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread eric kustarz

Matthew Ahrens wrote:


Matthew Ahrens wrote:

Here is a proposal for a new 'copies' property which would allow 
different levels of replication for different filesystems.



Thanks everyone for your input.

The problem that this feature attempts to address is when you have 
some data that is more important (and thus needs a higher level of 
redundancy) than other data.  Of course in some situations you can use 
multiple pools, but that is antithetical to ZFS's pooled storage 
model.  (You have to divide up your storage, you'll end up with 
stranded storage and bandwidth, etc.)


Given the overwhelming criticism of this feature, I'm going to shelve 
it for now.



So it seems to me that having this feature per-file is really useful.  
Say i have a presentation to give in Pleasanton, and the presentation 
lives on my single-disk laptop - I want all the meta-data and the actual 
presentation to be replicated.  We already use ditto blocks for the 
meta-data.  Now we could have an extra copy of the actual data.  When i 
get back from the presentation i can turn off the extra copies.


Doing it for the filesystem is just one step higher (and makes it 
administratively easier as i don't have to type the same command for 
each file thats important).


Mirroring is just like another step above that - though its possibly 
replicating stuff you just don't care about.


Now placing extra copies of the data doesn't guarantee that data will 
survive multiple diskf failures; but neither does having a mirrored pool 
guarantee the data will be there either (2 disk failures).  Both methods 
are about increasing your chances of having your valuable data around.


I for one would have loved to have multiple copy filesystems + ZFS on my 
powerbook when i was travelling in Australia for a month - think of all 
the digital pictures you take and how pissed you would be if the one 
with the wild wombat didn't survive.


Its maybe not an enterprise solution, but it seems like a consumer solution.

Ensuring that the space accounting tools make sense is definitely a 
valid point though.


eric



Out of curiosity, what would you guys think about addressing this same 
problem by having the option to store some filesystems unreplicated on 
an mirrored (or raid-z) pool?  This would have the same issues of 
unexpected space usage, but since it would be *less* than expected, 
that might be more acceptable.  There are no plans to implement 
anything like this right now, but I just wanted to get a read on it.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread David Dyer-Bennet

On 9/12/06, eric kustarz [EMAIL PROTECTED] wrote:


So it seems to me that having this feature per-file is really useful.
Say i have a presentation to give in Pleasanton, and the presentation
lives on my single-disk laptop - I want all the meta-data and the actual
presentation to be replicated.  We already use ditto blocks for the
meta-data.  Now we could have an extra copy of the actual data.  When i
get back from the presentation i can turn off the extra copies.


Yes, you could do that.

*I* would make a copy on a CD, which I would carry in a separate case
from the laptop.

I think my presentation is a lot safer than your presentation.

Similarly for your digital images example; I don't consider it safe
until I have two or more *independent* copies.  Two copies on a single
hard drive doesn't come even close to passing the test for me; as many
people have pointed out, those tend to fail all at once.  And I will
also point out that laptops get stolen a lot.  And of course all the
accidents involving fumble-fingers, OS bugs, and driver bugs won't be
helped by the data duplication either.  (Those will mostly be helped
by sensible use of snapshots, though, which is another argument for
ZFS on *any* disk you work on a lot.)

The more I look at it the more I think that a second copy on the same
disk doesn't protect against very much real-world risk.  Am I wrong
here?  Are partial(small) disk corruptions more common than I think?
I don't have a good statistical view of disk failures.
--
David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/
RKBA: http://www.dd-b.net/carry/
Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/
Dragaera/Steven Brust: http://dragaera.info/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-11 Thread Mike Gerdts

On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote:

B. DESCRIPTION

A new property will be added, 'copies', which specifies how many copies
of the given filesystem will be stored.  Its value must be 1, 2, or 3.
Like other properties (eg.  checksum, compression), it only affects
newly-written data.  As such, it is recommended that the 'copies'
property be set at filesystem-creation time
(eg. 'zfs create -o copies=2 pool/fs').


Is there anything in the works to compress (or encrypt) existing data
after the fact?  For example, a special option to scrub that causes
the data to be re-written with the new properties could potentially do
this.  If so, this feature should subscribe to any generic framework
provided by such an effort.


This feature is similar to using mirroring, but differs in several
important ways:

* Mirroring offers slightly better redundancy, because one disk from
   each mirror can fail without data loss.


Is this use of slightly based upon disk failure modes?  That is, when
disks fail do they tend to get isolated areas of badness compared to
complete loss?  I would suggest that complete loss should include
someone tripping over the power cord to the external array that houses
the disk.


It is important to note that the copies provided by this feature are in
addition to any redundancy provided by the pool configuration or the
underlying storage.  For example:


All of these examples seem to assume that there six disks.


* In a pool with 2-way mirrors, a filesystem with copies=1 (the default)
   will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate any
   1 disk failing without data loss.
* In a pool with 2-way mirrors, a filesystem with copies=3
   will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate any
   5 disks failing without data loss (assuming that there are at least
   ncopies=3 mirror groups).


This one assumes best case scenario with 6 disks.  Suppose you had 4 x
72 GB and 2 x 36 GB disks.  You could end up with multiple copies on
the 72 GB disks.


* In a pool with single-parity raid-z a filesystem with copies=2
   will be stored with 2 copies, each copy protected by its own parity
   block.  The filesystem can tolerate any 3 disks failing without data
   loss (assuming that there are at least ncopies=2 raid-z groups).


C. MANPAGE CHANGES
*** zfs.man4Tue Jun 13 10:15:38 2006
--- zfs.man5Mon Sep 11 16:34:37 2006
***
*** 708,714 
--- 708,725 
they are inherited.


+  copies=1 | 2 | 3

+Controls the number of copies of data stored for this dataset.
+These copies are in addition to any redundancy provided by the
+pool (eg. mirroring or raid-z).  The copies will be stored on
+different disks if possible.


Any statement about physical location on the disk?   It would seem as
though locating two copies sequentially on the disk would not provide
nearly the amount of protection as having them fairly distant from
each other.


--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-11 Thread Matthew Ahrens

James Dickens wrote:

On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote:

B. DESCRIPTION

A new property will be added, 'copies', which specifies how many copies
of the given filesystem will be stored.  Its value must be 1, 2, or 3.
Like other properties (eg.  checksum, compression), it only affects
newly-written data.  As such, it is recommended that the 'copies'
property be set at filesystem-creation time
(eg. 'zfs create -o copies=2 pool/fs').


would the user be held acountable for the space used by the extra
copies? 


Doh!  Sorry I forgot to address that.  I'll amend the proposal and 
manpage to include this information...


Yes, the space used by the extra copies will be accounted for, eg. in 
stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota.



so if a user has a 1GB quota and stores one  512MB file with
two copies activated, all his space will be used? 


Yes, and as mentioned this will be reflected in all the space accounting 
tools.



what happens if the
same user stores a file that is 756MB on the filesystem with multiple
copies enabled an a 1GB quota, does the save fail?


Yes, they will get ENOSPC and see that their filesystem is full.


How would the user
tell that his filesystem is full since all the tools he is used to
report he is using only 1/2 the space?


Any tool will report that in fact all space is being used.


Is there a way for the sysdmin to get rid of the excess copies should
disk space needs require it?


No, not without rewriting them.  (This is the same behavior we have 
today with the 'compression' and 'checksum' properties.  It's a 
long-term goal of ours to be able to go back and change these things 
after the fact (scrub them in, so to say), but with snapshots, this is 
extremely nontrivial to do efficiently and without increasing the amount 
of space used.)



If I start out 2 copies and later change it to on 1 copy,  do the
files created before keep there 2 copies?


Yep, the property only affects newly-written data.


what happens if root needs to store a copy of an important file and
there is no space but there is space if extra copies are reclaimed?


They will get ENOSPC.


Will this be configurable behavior?


No.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-11 Thread Matthew Ahrens

Mike Gerdts wrote:

Is there anything in the works to compress (or encrypt) existing data
after the fact?  For example, a special option to scrub that causes
the data to be re-written with the new properties could potentially do
this.


This is a long-term goal of ours, but with snapshots, this is extremely 
nontrivial to do efficiently and without increasing the amount of space 
used.) .


 If so, this feature should subscribe to any generic framework

provided by such an effort.


Yep, absolutely.


* Mirroring offers slightly better redundancy, because one disk from
   each mirror can fail without data loss.


Is this use of slightly based upon disk failure modes?  That is, when
disks fail do they tend to get isolated areas of badness compared to
complete loss?  I would suggest that complete loss should include
someone tripping over the power cord to the external array that houses
the disk.


I'm basing this slightly better call on a model of random, 
complete-disk failures.  I know that this is only an approximation. 
With many mirrors, most (but not all) 2-disk failures can be tolerated. 
 With copies=2, almost no 2-top-level-vdev failures will be tolerated, 
because it's likely that *some* block will have both its copies on those 
2 disks.  With mirrors, you can arrange to mirror across cabinets, not 
within them, which you can't do with copies.



It is important to note that the copies provided by this feature are in
addition to any redundancy provided by the pool configuration or the
underlying storage.  For example:


All of these examples seem to assume that there six disks.


Not really.  There could be any number of mirrors or raid-z groups 
(although I note, you need at least 'copies' groups to survive the max 
whole-disk failures).



* In a pool with 2-way mirrors, a filesystem with copies=1 (the default)
   will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate any
   1 disk failing without data loss.
* In a pool with 2-way mirrors, a filesystem with copies=3
   will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate any
   5 disks failing without data loss (assuming that there are at least
   ncopies=3 mirror groups).


This one assumes best case scenario with 6 disks.  Suppose you had 4 x
72 GB and 2 x 36 GB disks.  You could end up with multiple copies on
the 72 GB disks.


Yes, all these examples assume that our putting the copies on different 
disks when possible actually worked out.  It will almost certainly work 
out unless you have a small number of different-sized devices, or are 
running with very little free space.  If you need hard guarantees, you 
need to use actual mirroring.



Any statement about physical location on the disk?   It would seem as
though locating two copies sequentially on the disk would not provide
nearly the amount of protection as having them fairly distant from
each other.


Yep, if the copies can't be stored on different disks, they will be 
stored spread-out on the same disk if possible (I think we aim for one 
on each quarter of the disk).


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-11 Thread James Dickens

On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote:

James Dickens wrote:
 On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote:
 B. DESCRIPTION

 A new property will be added, 'copies', which specifies how many copies
 of the given filesystem will be stored.  Its value must be 1, 2, or 3.
 Like other properties (eg.  checksum, compression), it only affects
 newly-written data.  As such, it is recommended that the 'copies'
 property be set at filesystem-creation time
 (eg. 'zfs create -o copies=2 pool/fs').

 would the user be held acountable for the space used by the extra
 copies?

Doh!  Sorry I forgot to address that.  I'll amend the proposal and
manpage to include this information...

Yes, the space used by the extra copies will be accounted for, eg. in
stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota.

 so if a user has a 1GB quota and stores one  512MB file with
 two copies activated, all his space will be used?

Yes, and as mentioned this will be reflected in all the space accounting
tools.

 what happens if the
 same user stores a file that is 756MB on the filesystem with multiple
 copies enabled an a 1GB quota, does the save fail?

Yes, they will get ENOSPC and see that their filesystem is full.

 How would the user
 tell that his filesystem is full since all the tools he is used to
 report he is using only 1/2 the space?

Any tool will report that in fact all space is being used.

 Is there a way for the sysdmin to get rid of the excess copies should
 disk space needs require it?

No, not without rewriting them.  (This is the same behavior we have
today with the 'compression' and 'checksum' properties.  It's a
long-term goal of ours to be able to go back and change these things
after the fact (scrub them in, so to say), but with snapshots, this is
extremely nontrivial to do efficiently and without increasing the amount
of space used.)

 If I start out 2 copies and later change it to on 1 copy,  do the
 files created before keep there 2 copies?

Yep, the property only affects newly-written data.

 what happens if root needs to store a copy of an important file and
 there is no space but there is space if extra copies are reclaimed?

They will get ENOSPC.


though I think this is a cool feature, I think i needs more work. I
think there sould be an option to make extra copies expendible. So the
extra copies are a request, if the space is availible make them, if
not complete the write, and log the event.

It the user really requires guaranteed extra copies, then use mirrored
or raided disks.

It seems just to be a nightmare for the administrator, you start with
3 copies and then change to 2 copies, you will have phantom copies
that are only known to exist to the OS, it won't show in any reports,
zfs list doesn't have an option to show which files have multiple
clones and which dont. There is no way to destroy multiple clones
without rewriting every file on the disk.

James



 Will this be configurable behavior?

No.

--matt


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-11 Thread Matthew Ahrens

James Dickens wrote:

though I think this is a cool feature, I think i needs more work. I
think there sould be an option to make extra copies expendible. So the
extra copies are a request, if the space is availible make them, if
not complete the write, and log the event.


Are you asking for the extra copies that have already been written to be 
dynamically freed up when we are running low on space?  That could be 
useful, but it isn't the problem I'm trying to solve with the 'copies' 
property (not to mention it would be extremely difficult to implement).



It the user really requires guaranteed extra copies, then use mirrored
or raided disks.


Right, if you want everything to have extra redundancy, that use case is 
handled just fine today by mirrors or RAIDZ.


The case where 'copies' is useful is when you want some data to be 
stored with more redundancy than others, without the burden of setting 
up different pools.



It seems just to be a nightmare for the administrator, you start with
3 copies and then change to 2 copies, you will have phantom copies
that are only known to exist to the OS, it won't show in any reports,
zfs list doesn't have an option to show which files have multiple
clones and which dont. There is no way to destroy multiple clones
without rewriting every file on the disk.


(I'm assuming you mean copies, not clones.)

So would you prefer that the property be restricted to only being set at 
filesystem creation time, and not changed later?  That way the number of 
copies of all files in the filesystem is always the same.


It seems like the issue of knowing how many copies there are would be 
much worse in the system you're asking for where the extra copies are 
freed up as needed...


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss