Re: ditto blocks on ZFS

2014-05-23 Thread Duncan
Russell Coker posted on Fri, 23 May 2014 13:54:46 +1000 as excerpted:

 Is anyone doing research on how much free disk space is required on
 BTRFS for good performance?  If a rumor (whether correct or incorrect)
 goes around that you need 20% free space on a BTRFS filesystem for
 performance then that will vastly outweigh the space used for metadata.

Well, on btrfs there's free-space, and then there's free-space.  The 
chunk allocation and both data/metadata fragmentation make a difference.

That said, *IF* you're looking at the right numbers, btrfs doesn't 
actually require that much free space, and should run as efficiently 
right up to just a few GiB free, on pretty much any btrfs over a few GiB 
in size, so at least in the significant fractions of a TiB on up range, 
it doesn't require that much free space /as/ /a/ /percentage/ at all.

**BUT BE SURE YOU'RE LOOKING AT THE RIGHT NUMBERS** as explained below.

Chunks:

On btrfs, both data and metadata are allocated in chunks, 1 GiB chunks 
for data, 256 MiB chunks for metadata.  The catch is that while both 
chunks and space within chunks can be allocated on-demand, deleting files 
only frees space within chunks -- the chunks themselves remain allocated 
to data/metadata whichever they were, and cannot be reallocated to the 
other.  To deallocate unused chunks and to rewrite partially used chunks 
to consolidate usage on to fewer chunks and free the others, btrfs admins 
must currently manually (or via script) do a btrfs balance.

btrfs filesystem show:

For the btrfs filesystem show output, the individual devid lines show 
total filesystem space on the device vs. used, as in allocated to chunks, 
space.[1]  Ideally (assuming equal sized devices) you should keep at 
least 2.5-3.0 GiB free per device, since that will allow allocation of 
two chunks each for data (1 GiB each) and metadata (quarter GiB each, but 
on single-device filesystems they are allocated in pairs by default, so 
half a MiB, see below).  Since the balance process itself will want to 
allocate a new chunk to write into in ordered to rewrite and consolidate 
existing chunks, you don't want to use the last one available, and since 
the filesystem could decide it needs to allocate another chunk for normal 
usage as well, you always want to keep at least two chunks worth of each, 
thus 2.5 GiB (3.0 GiB for single-device-filesystems, see below), 
unallocated, one chunk each data/metadata for the filesystem if it needs 
it, and another to ensure balance can allocate at least the one chunk to 
do its rewrite.

As I said, data chunks are 1 GiB, while metadata chunks are 256 MiB, a 
quarter GiB.  However, on a single-device btrfs, metadata will normally 
default to dup (duplicate, two copies for safety) mode, and will thus 
allocate two chunks, half a GiB at a time.  This is why you want 3 GiB 
minimum free on a single-device btrfs, space for two single-mode data 
chunk allocations (1 GiB * 2 = 2 GiB), plus two dup-mode metadata chunk 
allocations (256 MiB * 2 * 2 = 1 GiB).  But on multi-device btrfs, only a 
single copy is stored per device, so the metadata minimum reserve is only 
half a GiB per device (256 MiB * 2 = 512 MiB = half a GiB).

That's the minimum unallocated space you need free.  More than that is 
nice and lets you go longer between having to worry about rebalances, but 
it really won't help btrfs efficiency that much, since btrfs uses already 
allocated chunk space where it can.

btrfs filesystem df:

Then there's the already chunk-allocated space.  btrfs filesystem df 
reports on this.  In the df output, total means allocated while used 
means used, of that allocated, so the spread between them is the 
allocated but unused.

Since btrfs allocates new chunks on-demand from the unallocated space 
pool, but cannot reallocate chunks between data and metadata on its own, 
and because the used blocks within existing chunks will get fragmented 
over time, it's best to keep the btrfs filesystem df reported spread 
between total and used to a minimum.

Of course, as I said above data chunks are 1 GiB each, so a data 
allocation spread of under a GiB won't be recoverable in any case, and a 
spread of 1-5 GiB isn't a big deal.  But if for instance btrfs filesystem 
df reports data 1.25 TiB total (that is, allocated) but only 250 GiB 
used, that's a spread of roughly a TiB, and running a btrfs balance in 
ordered to recover most of that spread to unallocated is a good idea.

Similarly with metadata, except it'll be allocated in 256 MiB chunks, two 
at a time by default on a single device filesystem so 512 MiB at at time 
in that case.  But again, if btrfs filesystem df is reporting say 10.5 
GiB total metadata but only perhaps 1.75 GiB used, the spread is several 
chunks worth and particularly if your unallocated reserve (as reported by 
btrfs filesystem show in the individual device lines) is getting low, 
it's time to consider rebalancing it to recover the unused metadata space 
to 

Re: ditto blocks on ZFS

2014-05-22 Thread Austin S Hemmelgarn
On 2014-05-21 19:05, Martin wrote:
 Very good comment from Ashford.
 
 
 Sorry, but I see no advantages from Russell's replies other than for a
 feel-good factor or a dangerous false sense of security. At best,
 there is a weak justification that for metadata, again going from 2% to
 4% isn't going to be a great problem (storage is cheap and fast).
 
 I thought an important idea behind btrfs was that we avoid by design in
 the first place the very long and vulnerable RAID rebuild scenarios
 suffered for block-level RAID...
 
 
 On 21/05/14 03:51, Russell Coker wrote:
 Absolutely. Hopefully this discussion will inspire the developers to
 consider this an interesting technical challenge and a feature that
 is needed to beat ZFS.
 
 Sorry, but I think that is completely the wrong reasoning. ...Unless
 that is you are some proprietary sales droid hyping features and big
 numbers! :-P
 
 
 Personally I'm not convinced we gain anything beyond what btrfs will
 eventually offer in any case for the n-way raid or the raid-n Cauchy stuff.
 
 Also note that usually, data is wanted to be 100% reliable and
 retrievable. Or if that fails, you go to your backups instead. Gambling
 proportions and importance rather than *ensuring* fault/error
 tolerance is a very human thing... ;-)
 
 
 Sorry:
 
 Interesting idea but not convinced there's any advantage for disk/SSD
 storage.
 
 
 Regards,
 Martin
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 Another nice option in this case might be adding logic to make sure
that there is some (considerable) offset between copies of metadata
using the dup profile (all of the filesystems that I have actually
looked at the low-level on-disk structures have had both copies of the
System chunks right next to each other, right at the beginning of the
disk, which of course mitigates the usefulness of storing two copies of
them on disk).  Adding an offset in those allocations would provide some
better protection against some of the more common 'idiot' failure-modes
(i.e. trying to use dd to write a disk image to a USB flash drive, and
accidentally overwriting the first n GB of your first HDD instead).
Ideally, once we have n-way replication, System chunks should default to
one copy per device for multi-device filesystems.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ditto blocks on ZFS

2014-05-22 Thread Tomasz Chmielewski
 I thought an important idea behind btrfs was that we avoid by design
 in the first place the very long and vulnerable RAID rebuild scenarios
 suffered for block-level RAID...

This may be true for SSD disks - for ordinary disks it's not entirely
the case.

For most RAID rebuilds, it still seems way faster with software RAID-1
where one drive is being read at its (almost) full speed, and the other
is being written to at its (almost) full speed (assuming no other IO
load).

With btrfs RAID-1, the way balance is made after disk replace, it takes
lots of disk head movements resulting in overall small speed to rebuild
the RAID, especially with lots of snapshots and related fragmentation.

And the balance is still not smart and is causing reads from one device,
and writes to *both* devices (extra unnecessary write to the
healthy device - while it should read from the healthy device and write
to the replaced device only).


Of course, other factors such as the amount of data or disk IO usage
during rebuild apply.


-- 
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-22 Thread ashford
Russell,

Overall, there are still a lot of unknowns WRT the stability, and ROI
(Return On Investment) of implementing ditto blocks for BTRFS.  The good
news is that there's a lot of time before the underlying structure is in
place to support, so there's time to figure this out a bit better.

 On Tue, 20 May 2014 07:56:41 ashf...@whisperpc.com wrote:
 1.  There will be more disk space used by the metadata.  I've been aware
 of space allocation issues in BTRFS for more than three years.  If the
 use of ditto blocks will make this issue worse, then it's probably not a
 good idea to implement it.  The actual increase in metadata space is
 probably small in most circumstances.

 Data, RAID1: total=2.51TB, used=2.50TB
 System, RAID1: total=32.00MB, used=376.00KB
 Metadata, RAID1: total=28.25GB, used=26.63GB

 The above is my home RAID-1 array.  It includes multiple backup copies of
 a medium size Maildir format mail spool which probably accounts for a
 significant portion of the used space, the Maildir spool has an average
 file size of about 70K and lots of hard links between different versions
 of the backup.  Even so the metadata is only 1% of the total used space.
 Going from 1% to 2% to improve reliability really isn't a problem.

 Data, RAID1: total=140.00GB, used=139.60GB
 System, RAID1: total=32.00MB, used=28.00KB
 Metadata, RAID1: total=4.00GB, used=2.97GB

 Above is a small Xen server which uses snapshots to backup the files for
 Xen block devices (the system is lightly loaded so I don't use nocow)
 and for data files that include a small Maildir spool.  It's still only
 2% of disk space used for metadata, again going from 2% to 4% isn't
 going to be a great problem.

You've addressed half of the issue.  It appears that the metadata is
normally a bit over 1% using the current methods, but two samples do not
make a statistical universe.  The good news is that these two samples are
from opposite extremes of usage, so I expect they're close to where the
overall average would end up.  I'd like to see a few more samples, from
other usage scenarios, just to be sure.

If the above numbers are normal, adding ditto blocks could increase the
size of the metadata from 1% to 2% or even 3%.  This isn't a problem.

What we still don't know, and probably won't until after it's implemented,
is whether or not the addition of ditto blocks will make the space
allocation worse.

 2.  Use of ditto blocks will increase write bandwidth to the disk.  This
 is a direct and unavoidable result of having more copies of the
 metadata.
 The actual impact of this would depend on the file-system usage pattern,
 but would probably be unnoticeable in most circumstances.  Does anyone
 have a “worst-case” scenario for testing?

 The ZFS design involves ditto blocks being spaced apart due to the fact
 that corruption tends to have some spacial locality.  So you are adding
 an extra seek.

 The worst case would be when you have lots of small synchronous writes,
 probably the default configuration of Maildir delivery would be a good
 case.

Is there a performance test for this?  That would be helpful in
determining the worst-case performance impact of implementing ditto
blocks, and probably some other enhancements as well.

 3.  Certain kinds of disk errors would be easier to recover from.  Some
 people here claim that those specific errors are rare.  I have no
 opinion on how often they happen, but I believe that if the overall
 disk space cost is low, it will have a reasonable return.  There would
 be virtually no reliability gains on an SSD-based file-system, as the
 ditto blocks would be written at the same time, and the SSD would be
 likely to map the logical blocks into the same page of flash memory.

 That claim is unproven AFAIK.

That claim is a direct result of how SSDs function.

 4.  If the BIO layer of BTRFS and the device driver are smart enough,
 ditto blocks could reduce I/O wait time.  This is a direct result of
 having more instances of the data on the disk, so it's likely that there
 will be a ditto block closer to where the disk head is currently.  The
 actual benefit for disk-based file-systems is likely to be under 1ms per
 metadata seek.  It's possible that a short-term backlog on one disk
 could cause BTRFS to use a ditto block on another disk, which could
 deliver 20ms of performance.  There would be no performance benefit for
 SSD-based file-systems.

 That is likely with RAID-5 and RAID-10.

It's likely with all disk layouts.  The reason just looks different on
different RAID structures.

 My experience is that once your disks are larger than about 500-750GB,
 RAID-6 becomes a much better choice, due to the increased chances of
 having an uncorrectable read error during a reconstruct.  My opinion is
 that anyone storing critical information in RAID-5, or even 2-disk
 RAID-1,
 with disks of this capacity, should either reconsider their storage
 topology, or verify that they have a good backup/restore mechanism in

Re: ditto blocks on ZFS

2014-05-22 Thread Russell Coker
On Thu, 22 May 2014 15:09:40 ashf...@whisperpc.com wrote:
 You've addressed half of the issue.  It appears that the metadata is
 normally a bit over 1% using the current methods, but two samples do not
 make a statistical universe.  The good news is that these two samples are
 from opposite extremes of usage, so I expect they're close to where the
 overall average would end up.  I'd like to see a few more samples, from
 other usage scenarios, just to be sure.
 
 If the above numbers are normal, adding ditto blocks could increase the
 size of the metadata from 1% to 2% or even 3%.  This isn't a problem.
 
 What we still don't know, and probably won't until after it's implemented,
 is whether or not the addition of ditto blocks will make the space
 allocation worse.

I've been involved in many discussions about filesystem choice.  None of them 
have included anyone raising an issue about ZFS metadata space usage, probably 
most ZFS users don't even know about ditto blocks.

The relevant issue regarding disk space is the fact that filesystems tend to 
perform better if there is a reasonable amount of free space.  The amount of 
free space for good performance will depend on filesystem, usage pattern, and 
whatever you might define as good performance.

The first two Google hits on searching for recommended free space on ZFS 
recommended using no more than 80% and 85% of disk space.  Obviously if good 
performance requires 15% of free disk space then your capacity problem isn't 
going to be solved by not duplicating metadata.  Note that I am not aware of 
the accuracy of such claims about ZFS performance.

Is anyone doing research on how much free disk space is required on BTRFS for 
good performance?  If a rumor (whether correct or incorrect) goes around 
that you need 20% free space on a BTRFS filesystem for performance then that 
will vastly outweigh the space used for metadata.

  The ZFS design involves ditto blocks being spaced apart due to the fact
  that corruption tends to have some spacial locality.  So you are adding
  an extra seek.
  
  The worst case would be when you have lots of small synchronous writes,
  probably the default configuration of Maildir delivery would be a good
  case.
 
 Is there a performance test for this?  That would be helpful in
 determining the worst-case performance impact of implementing ditto
 blocks, and probably some other enhancements as well.

http://doc.coker.com.au/projects/postal/

My Postal mail server benchmark is one option.  There are more than a few 
benchmarks of synchronous writes of small files, but Postal uses real world 
programs that need such performance.  Delivering a single message via a 
typical Unix MTA requires synchronous writes of two queue files and then the 
destination file in the mail store.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-21 Thread Martin
Very good comment from Ashford.


Sorry, but I see no advantages from Russell's replies other than for a
feel-good factor or a dangerous false sense of security. At best,
there is a weak justification that for metadata, again going from 2% to
4% isn't going to be a great problem (storage is cheap and fast).

I thought an important idea behind btrfs was that we avoid by design in
the first place the very long and vulnerable RAID rebuild scenarios
suffered for block-level RAID...


On 21/05/14 03:51, Russell Coker wrote:
 Absolutely. Hopefully this discussion will inspire the developers to
 consider this an interesting technical challenge and a feature that
 is needed to beat ZFS.

Sorry, but I think that is completely the wrong reasoning. ...Unless
that is you are some proprietary sales droid hyping features and big
numbers! :-P


Personally I'm not convinced we gain anything beyond what btrfs will
eventually offer in any case for the n-way raid or the raid-n Cauchy stuff.

Also note that usually, data is wanted to be 100% reliable and
retrievable. Or if that fails, you go to your backups instead. Gambling
proportions and importance rather than *ensuring* fault/error
tolerance is a very human thing... ;-)


Sorry:

Interesting idea but not convinced there's any advantage for disk/SSD
storage.


Regards,
Martin




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-21 Thread Konstantinos Skarlatos

On 20/5/2014 5:07 πμ, Russell Coker wrote:

On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:

This is extremely difficult to measure objectively. Subjectively ... see
below.


[snip]

*What other failure modes* should we guard against?

I know I'd sleep a /little/ better at night knowing that a double disk
failure on a raid5/1/10 configuration might ruin a ton of data along
with an obscure set of metadata in some long tree paths - but not the
entire filesystem.

My experience is that most disk failures that don't involve extreme physical
damage (EG dropping a drive on concrete) don't involve totally losing the
disk.  Much of the discussion about RAID failures concerns entirely failed
disks, but I believe that is due to RAID implementations such as Linux
software RAID that will entirely remove a disk when it gives errors.

I have a disk which had ~14,000 errors of which ~2000 errors were corrected by
duplicate metadata.  If two disks with that problem were in a RAID-1 array
then duplicate metadata would be a significant benefit.


The other use-case/failure mode - where you are somehow unlucky enough
to have sets of bad sectors/bitrot on multiple disks that simultaneously
affect the only copies of the tree roots - is an extremely unlikely
scenario. As unlikely as it may be, the scenario is a very painful
consequence in spite of VERY little corruption. That is where the
peace-of-mind/bragging rights come in.

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

The NetApp research on latent errors on drives is worth reading.  On page 12
they report latent sector errors on 9.5% of SATA disks per year.  So if you
lose one disk entirely the risk of having errors on a second disk is higher
than you would want for RAID-5.  While losing the root of the tree is
unlikely, losing a directory in the middle that has lots of subdirectories is
a risk.
Seeing the results of that paper, I think erasure coding is a better 
solution. Instead of having many copies of metadata or data, we could do 
erasure coding using something like zfec[1] that is being used by 
Tahoe-LAFS, increasing their size by lets say 5-10%, and be quite safe 
even from multiple continuous bad sectors.


[1] https://pypi.python.org/pypi/zfec


I can understand why people wouldn't want ditto blocks to be mandatory.  But
why are people arguing against them as an option?


As an aside, I'd really like to be able to set RAID levels by subtree.  I'd
like to use RAID-1 with ditto blocks for my important data and RAID-0 for
unimportant data.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-20 Thread Austin S Hemmelgarn
On 2014-05-19 22:07, Russell Coker wrote:
 On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:
 This is extremely difficult to measure objectively. Subjectively ... see
 below.

 [snip]

 *What other failure modes* should we guard against?

 I know I'd sleep a /little/ better at night knowing that a double disk
 failure on a raid5/1/10 configuration might ruin a ton of data along
 with an obscure set of metadata in some long tree paths - but not the
 entire filesystem.
 
 My experience is that most disk failures that don't involve extreme physical 
 damage (EG dropping a drive on concrete) don't involve totally losing the 
 disk.  Much of the discussion about RAID failures concerns entirely failed 
 disks, but I believe that is due to RAID implementations such as Linux 
 software RAID that will entirely remove a disk when it gives errors.
 
 I have a disk which had ~14,000 errors of which ~2000 errors were corrected 
 by 
 duplicate metadata.  If two disks with that problem were in a RAID-1 array 
 then duplicate metadata would be a significant benefit.
 
 The other use-case/failure mode - where you are somehow unlucky enough
 to have sets of bad sectors/bitrot on multiple disks that simultaneously
 affect the only copies of the tree roots - is an extremely unlikely
 scenario. As unlikely as it may be, the scenario is a very painful
 consequence in spite of VERY little corruption. That is where the
 peace-of-mind/bragging rights come in.
 
 http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
 
 The NetApp research on latent errors on drives is worth reading.  On page 12 
 they report latent sector errors on 9.5% of SATA disks per year.  So if you 
 lose one disk entirely the risk of having errors on a second disk is higher 
 than you would want for RAID-5.  While losing the root of the tree is 
 unlikely, losing a directory in the middle that has lots of subdirectories is 
 a risk.
 
 I can understand why people wouldn't want ditto blocks to be mandatory.  But 
 why are people arguing against them as an option?
 
 
 As an aside, I'd really like to be able to set RAID levels by subtree.  I'd 
 like to use RAID-1 with ditto blocks for my important data and RAID-0 for 
 unimportant data.
 
But the proposed changes for n-way replication would already handle
this.  They would just need the option of having more than one copy per
device (which theoretically shouldn't be too hard once you have n-way
replication).  Also, BTRFS already has the option of replicating the
root tree across multiple devices (it is included in the System Data
subset), and in fact dose so by default when using multiple devices.
Also, there are plans to have per-subvolume or per file RAID level
selection, but IIRC that is planned for after n-way replication (and of
course, RAID 5/6, as n-way replication isn't going to be implemented
until after RAID 5/6)



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ditto blocks on ZFS

2014-05-20 Thread Brendan Hide

On 2014/05/20 04:07 PM, Austin S Hemmelgarn wrote:

On 2014-05-19 22:07, Russell Coker wrote:

[snip]
As an aside, I'd really like to be able to set RAID levels by subtree.  I'd
like to use RAID-1 with ditto blocks for my important data and RAID-0 for
unimportant data.


But the proposed changes for n-way replication would already handle
this.
[snip]

Russell's specific request above is probably best handled by being able 
to change replication levels per subvolume - this won't be handled by 
N-way replication.


Extra replication on leaf nodes will make relatively little difference 
in the scenarios laid out in this thread - but on trunk nodes (folders 
or subvolumes closer to the filesystem root) it makes a significant 
difference. Plain N-way replication doesn't flexibly treat these two 
nodes differently.


As an example, Russell might have a server with two disks - yet he wants 
6 copies of all metadata for subvolumes and their immediate subfolders. 
At three folders deep he only wants to have 4 copies. At six folders 
deep, only 2. Ditto blocks add an attractive safety net without 
unnecessarily doubling or tripling the size of *all* metadata.


It is a good idea. The next question to me is whether or not it is 
something that can be implemented elegantly and whether or not a 
talented *dev* thinks it is a good idea.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-20 Thread Russell Coker
On Tue, 20 May 2014 07:56:41 ashf...@whisperpc.com wrote:
 1.  There will be more disk space used by the metadata.  I've been aware
 of space allocation issues in BTRFS for more than three years.  If the use
 of ditto blocks will make this issue worse, then it's probably not a good
 idea to implement it.  The actual increase in metadata space is probably
 small in most circumstances.

Data, RAID1: total=2.51TB, used=2.50TB
System, RAID1: total=32.00MB, used=376.00KB
Metadata, RAID1: total=28.25GB, used=26.63GB

The above is my home RAID-1 array.  It includes multiple backup copies of a 
medium size Maildir format mail spool which probably accounts for a 
significant portion of the used space, the Maildir spool has an average file 
size of about 70K and lots of hard links between different versions of the 
backup.  Even so the metadata is only 1% of the total used space.  Going from 
1% to 2% to improve reliability really isn't a problem.

Data, RAID1: total=140.00GB, used=139.60GB
System, RAID1: total=32.00MB, used=28.00KB
Metadata, RAID1: total=4.00GB, used=2.97GB

Above is a small Xen server which uses snapshots to backup the files for Xen 
block devices (the system is lightly loaded so I don't use nocow) and for data 
files that include a small Maildir spool.  It's still only 2% of disk space 
used for metadata, again going from 2% to 4% isn't going to be a great 
problem.

 2.  Use of ditto blocks will increase write bandwidth to the disk.  This
 is a direct and unavoidable result of having more copies of the metadata.
 The actual impact of this would depend on the file-system usage pattern,
 but would probably be unnoticeable in most circumstances.  Does anyone
 have a “worst-case” scenario for testing?

The ZFS design involves ditto blocks being spaced apart due to the fact that 
corruption tends to have some spacial locality.  So you are adding an extra 
seek.

The worst case would be when you have lots of small synchronous writes, 
probably the default configuration of Maildir delivery would be a good case.

As an aside I've been thinking of patching a mail server to do a sleep() 
before fsync() on mail delivery to see if that improves aggregate performance.  
My theory is that if you have dozens of concurrent delivery attempts then if 
they all sleep() before fsync() then the filesystem could write out metadata 
for multiple files in one pass in the most efficient manner.

 3.  Certain kinds of disk errors would be easier to recover from.  Some
 people here claim that those specific errors are rare.

All errors are rare.  :-#

Seriously you can run Ext4 on a single disk for years and probably not lose 
data.  It's just a matter of how many disks and how much reliability you want.

 I have no opinion
 on how often they happen, but I believe that if the overall disk space
 cost is low, it will have a reasonable return.  There would be virtually
 no reliability gains on an SSD-based file-system, as the ditto blocks
 would be written at the same time, and the SSD would be likely to map the
 logical blocks into the same page of flash memory.

That claim is unproven AFAIK.  On SSD the performance cost of such things is 
negligible (no seek cost) and losing 1% of disk space isn't a problem for most 
systems (admittedly the early SSDs were small).

 4.  If the BIO layer of BTRFS and the device driver are smart enough,
 ditto blocks could reduce I/O wait time.  This is a direct result of
 having more instances of the data on the disk, so it's likely that there
 will be a ditto block closer to where the disk head is currently.  The
 actual benefit for disk-based file-systems is likely to be under 1ms per
 metadata seek.  It's possible that a short-term backlog on one disk could
 cause BTRFS to use a ditto block on another disk, which could deliver
 20ms of performance.  There would be no performance benefit for SSD-based
 file-systems.

That is likely with RAID-5 and RAID-10.

 My experience is that once your disks are larger than about 500-750GB,
 RAID-6 becomes a much better choice, due to the increased chances of
 having an uncorrectable read error during a reconstruct.  My opinion is
 that anyone storing critical information in RAID-5, or even 2-disk RAID-1,
 with disks of this capacity, should either reconsider their storage
 topology, or verify that they have a good backup/restore mechanism in
 place for that data.

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

The NetApp research shows that the incidence of silent corruption is a lot 
greater than you would expect.  RAID-6 doesn't save you from this.  You need 
BTRFS or ZFS RAID-6.


On Tue, 20 May 2014 22:11:16 Brendan Hide wrote:
 Extra replication on leaf nodes will make relatively little difference 
 in the scenarios laid out in this thread - but on trunk nodes (folders 
 or subvolumes closer to the filesystem root) it makes a significant 
 difference. Plain N-way replication doesn't flexibly treat these two 
 nodes 

Re: ditto blocks on ZFS

2014-05-19 Thread Martin
On 18/05/14 17:09, Russell Coker wrote:
 On Sat, 17 May 2014 13:50:52 Martin wrote:
[...]
 Do you see or measure any real advantage?
 
 Imagine that you have a RAID-1 array where both disks get ~14,000 read 
 errors.  
 This could happen due to a design defect common to drives of a particular 
 model or some shared environmental problem.  Most errors would be corrected 
 by 
 RAID-1 but there would be a risk of some data being lost due to both copies 
 being corrupt.  Another possibility is that one disk could entirely die 
 (although total disk death seems rare nowadays) and the other could have 
 corruption.  If metadata was duplicated in addition to being on both disks 
 then the probability of data loss would be reduced.
 
 Another issue is the case where all drive slots are filled with active drives 
 (a very common configuration).  To replace a disk you have to physically 
 remove the old disk before adding the new one.  If the array is a RAID-1 or 
 RAID-5 then ANY error during reconstruction loses data.  Using dup for 
 metadata on top of the RAID protections (IE the ZFS ditto idea) means that 
 case doesn't lose you data.

Your example there is for the case where in effect there is no RAID. How
is that case any better than what is already done for btrfs duplicating
metadata?



So...


What real-world failure modes do the ditto blocks usefully protect against?

And how does that compare for failure rates and against what is already
done?


For example, we have RAID1 and RAID5 to protect against any one RAID
chunk being corrupted or for the total loss of any one device.

There is a second part to that in that another failure cannot be
tolerated until the RAID is remade.


Hence, we have RAID6 that protects against any two failures for a chunk
or device. Hence with just one failure, you can tolerate a second
failure whilst rebuilding the RAID.


And then we supposedly have safety-by-design where the filesystem itself
is using a journal and barriers/sync to ensure that the filesystem is
always kept in a consistent state, even after an interruption to any writes.


*What other failure modes* should we guard against?


There has been mention of fixing metadata keys from single bit flips...

Should hamming codes be used instead of a crc so that we can have
multiple bit error detect, single bit error correct functionality for
all data both in RAM and on disk for those systems that do not use ECC RAM?

Would that be useful?...


Regards,
Martin

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-19 Thread Brendan Hide

On 2014/05/19 10:36 PM, Martin wrote:

On 18/05/14 17:09, Russell Coker wrote:

On Sat, 17 May 2014 13:50:52 Martin wrote:

[...]

Do you see or measure any real advantage?

[snip]
This is extremely difficult to measure objectively. Subjectively ... see 
below.

[snip]

*What other failure modes* should we guard against?


I know I'd sleep a /little/ better at night knowing that a double disk 
failure on a raid5/1/10 configuration might ruin a ton of data along 
with an obscure set of metadata in some long tree paths - but not the 
entire filesystem.


The other use-case/failure mode - where you are somehow unlucky enough 
to have sets of bad sectors/bitrot on multiple disks that simultaneously 
affect the only copies of the tree roots - is an extremely unlikely 
scenario. As unlikely as it may be, the scenario is a very painful 
consequence in spite of VERY little corruption. That is where the 
peace-of-mind/bragging rights come in.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-19 Thread Russell Coker
On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:
 This is extremely difficult to measure objectively. Subjectively ... see
 below.
 
  [snip]
  
  *What other failure modes* should we guard against?
 
 I know I'd sleep a /little/ better at night knowing that a double disk
 failure on a raid5/1/10 configuration might ruin a ton of data along
 with an obscure set of metadata in some long tree paths - but not the
 entire filesystem.

My experience is that most disk failures that don't involve extreme physical 
damage (EG dropping a drive on concrete) don't involve totally losing the 
disk.  Much of the discussion about RAID failures concerns entirely failed 
disks, but I believe that is due to RAID implementations such as Linux 
software RAID that will entirely remove a disk when it gives errors.

I have a disk which had ~14,000 errors of which ~2000 errors were corrected by 
duplicate metadata.  If two disks with that problem were in a RAID-1 array 
then duplicate metadata would be a significant benefit.

 The other use-case/failure mode - where you are somehow unlucky enough
 to have sets of bad sectors/bitrot on multiple disks that simultaneously
 affect the only copies of the tree roots - is an extremely unlikely
 scenario. As unlikely as it may be, the scenario is a very painful
 consequence in spite of VERY little corruption. That is where the
 peace-of-mind/bragging rights come in.

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

The NetApp research on latent errors on drives is worth reading.  On page 12 
they report latent sector errors on 9.5% of SATA disks per year.  So if you 
lose one disk entirely the risk of having errors on a second disk is higher 
than you would want for RAID-5.  While losing the root of the tree is 
unlikely, losing a directory in the middle that has lots of subdirectories is 
a risk.

I can understand why people wouldn't want ditto blocks to be mandatory.  But 
why are people arguing against them as an option?


As an aside, I'd really like to be able to set RAID levels by subtree.  I'd 
like to use RAID-1 with ditto blocks for my important data and RAID-0 for 
unimportant data.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-18 Thread Russell Coker
On Sat, 17 May 2014 13:50:52 Martin wrote:
 On 16/05/14 04:07, Russell Coker wrote:
  https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
  
  Probably most of you already know about this, but for those of you who
  haven't the above describes ZFS ditto blocks which is a good feature we
  need on BTRFS.  The briefest summary is that on top of the RAID
  redundancy there...
 [... are additional copies of metadata ...]
 
 
 Is that idea not already implemented in effect in btrfs with the way
 that the superblocks are replicated multiple times, ever more times, for
 ever more huge storage devices?

No.  If the metadata for the root directory is corrupted then everything is 
lost even if the superblock is OK.  At every level in the directory tree a 
corruption will lose all levels below that, a corruption for /home would be 
very significant as would a corruption of /home/importantuser/major-project.

 The one exception is for SSDs whereby there is the excuse that you
 cannot know whether your data is usefully replicated across different
 erase blocks on a single device, and SSDs are not 'that big' anyhow.

I am not convinced by that argument.  While you can't know that it's usefully 
replicated you also can't say for sure that replication will never save you.  
There will surely be some random factors involved.  If dup on ssd will save 
you from 50% of corruption problems is it worth doing?  What if it's 80% or 
20%?

I have BTRFS running as the root filesystem on Intel SSDs on four machines 
(one of which is a file server with a pair of large disks in a BTRFS RAID-1).  
On all of those systems I have dup for metadata, it doesn't take up any amount 
of space I need for something else and it might save me.

 So... Your idea of replicating metadata multiple times in proportion to
 assumed 'importance' or 'extent of impact if lost' is an interesting
 approach. However, is that appropriate and useful considering the real
 world failure mechanisms that are to be guarded against?

Firstly it's not my idea, it's the idea of the ZFS developers.  Secondly I 
started reading about this after doing some experiments with a failing SATA 
disk.  In spite of having ~14,000 read errors (which sounds like a lot but is 
a small fraction of a 2TB disk) the vast majority of the data was readable, 
largely due to ~2000 errors corrected by dup metadata.

 Do you see or measure any real advantage?

Imagine that you have a RAID-1 array where both disks get ~14,000 read errors.  
This could happen due to a design defect common to drives of a particular 
model or some shared environmental problem.  Most errors would be corrected by 
RAID-1 but there would be a risk of some data being lost due to both copies 
being corrupt.  Another possibility is that one disk could entirely die 
(although total disk death seems rare nowadays) and the other could have 
corruption.  If metadata was duplicated in addition to being on both disks 
then the probability of data loss would be reduced.

Another issue is the case where all drive slots are filled with active drives 
(a very common configuration).  To replace a disk you have to physically 
remove the old disk before adding the new one.  If the array is a RAID-1 or 
RAID-5 then ANY error during reconstruction loses data.  Using dup for 
metadata on top of the RAID protections (IE the ZFS ditto idea) means that 
case doesn't lose you data.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-17 Thread Martin
On 16/05/14 04:07, Russell Coker wrote:
 https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
 
 Probably most of you already know about this, but for those of you who 
 haven't 
 the above describes ZFS ditto blocks which is a good feature we need on 
 BTRFS.  The briefest summary is that on top of the RAID redundancy there...
[... are additional copies of metadata ...]


Is that idea not already implemented in effect in btrfs with the way
that the superblocks are replicated multiple times, ever more times, for
ever more huge storage devices?

The one exception is for SSDs whereby there is the excuse that you
cannot know whether your data is usefully replicated across different
erase blocks on a single device, and SSDs are not 'that big' anyhow.


So... Your idea of replicating metadata multiple times in proportion to
assumed 'importance' or 'extent of impact if lost' is an interesting
approach. However, is that appropriate and useful considering the real
world failure mechanisms that are to be guarded against?

Do you see or measure any real advantage?


Regards,
Martin

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-17 Thread Hugo Mills
On Sat, May 17, 2014 at 01:50:52PM +0100, Martin wrote:
 On 16/05/14 04:07, Russell Coker wrote:
  https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
  
  Probably most of you already know about this, but for those of you who 
  haven't 
  the above describes ZFS ditto blocks which is a good feature we need on 
  BTRFS.  The briefest summary is that on top of the RAID redundancy there...
 [... are additional copies of metadata ...]
 
 
 Is that idea not already implemented in effect in btrfs with the way
 that the superblocks are replicated multiple times, ever more times, for
 ever more huge storage devices?

   Superblocks are the smallest part of the metadata. There's a whole
load of metadata that's not in the superblocks that isn't replicated
in this way.

 The one exception is for SSDs whereby there is the excuse that you
 cannot know whether your data is usefully replicated across different
 erase blocks on a single device, and SSDs are not 'that big' anyhow.
 
 
 So... Your idea of replicating metadata multiple times in proportion to
 assumed 'importance' or 'extent of impact if lost' is an interesting
 approach. However, is that appropriate and useful considering the real
 world failure mechanisms that are to be guarded against?
 
 Do you see or measure any real advantage?

   This. How many copies do you actually need? Are there concrete
statistics to show the marginal utility of each additional copy?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- IMPROVE YOUR ORGANISMS!!  -- Subject line of spam email --- 


signature.asc
Description: Digital signature