Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-26 Thread Ric Wheeler

On 06/26/2010 08:34 AM, Daniel Shiels wrote:

25.06.2010 22:58, Ric Wheeler wrote:
 

On 06/24/2010 06:06 PM, Daniel Taylor wrote:
   

[]
 

On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
daniel.tay...@wdc.com   wrote:

   

Just an FYI reminder.  The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others.  Some of the
SSDs have larger (16K0 or smaller blocks (2K).  There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).
 

Why it is not sensible to use btrfs on raid devices?
Nowadays raid is just everywhere, from 'fakeraid' on AHCI to
large external arrays on iSCSI-attached storage.  Sometimes
it is nearly imposisble to _not_ use RAID, -- many servers
comes with a built-in RAID card which can't be turned off or
disabled.  And hardware raid is faster (at least in theory)
at least because it puts less load on various system busses.

To many enterprise folks a statement we don't need hw raid,
we have better solution sounds like we're just a toy, don't
use.

Hmm?  ;)

/mjt, who always used and preferred _software_ raid due to
  multiple reasons, and never used btrfs so far.
 

Its not that you shouldn't use it on raid it's just it looses some value
from the file system.

Two nice features that btrfs provides are checksums and mirroring. If a
disk corrupts a block then btrfs will realize due to the strong checksum
and use the mirrored block. If you are using a raid system the raid won't
know the data is corrupted and raid doesn't provide a way for the file
system to get to the redundant block.

I read a paper from Sun a while back about the undetected read failure
rates for modern disks having not changed for many years. Disks are so
large now that undetected failures are unacceptably likely for many
systems. Hence zfs doing similar in file system raid schemes.

In my lab I used dd to clobber data in some of my mirrors. Btrfs logs lots
of checksum errors but never corrupted a file. Doing the same on a classic
raid with classic filesystem (solaris with veritas volume manager)
silently gave me bad data depending on what disk it felt like reading
from.

Daniel.
   


I was (one of many) people who worked at EMC on designing storage 
arrays. If you are using any high end, external hardware array, it will 
detect data corruption pro-actively for you. Most arrays do continual 
scans for latent errors and have internal data integrity checks that are 
used for this.


Note that DIF/DIX adds an extra 8 bytes of data integrity to newer 
standards disks. We don't do anything with that today in btrfs, but you 
could imagine ways to get even better data integrity protection.


If you are using software RAID (MD), you should also use its internal 
checks to do this kind of proactive detection of latent errors on a 
regular basis (say once every week or two).


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-25 Thread Ric Wheeler

On 06/24/2010 06:06 PM, Daniel Taylor wrote:



   

-Original Message-
From: mikefe...@gmail.com [mailto:mikefe...@gmail.com] On
Behalf Of Mike Fedyk
Sent: Wednesday, June 23, 2010 9:51 PM
To: Daniel Taylor
Cc: Daniel J Blueman; Mat; LKML;
linux-fsde...@vger.kernel.org; Chris Mason; Ric Wheeler;
Andrew Morton; Linus Torvalds; The development of BTRFS
Subject: Re: Btrfs: broken file system design (was Unbound(?)
internal fragmentation in Btrfs)

On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
daniel.tay...@wdc.com  wrote:
 

Just an FYI reminder.  The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others.  Some of the
SSDs have larger (16K0 or smaller blocks (2K).  There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).

The absolute minimum allocation size for data should be the same
as, and aligned with, the underlying disk block size.  If that
results in underutilization, I think that's a good thing for
performance, compared to read-modify-write cycles to update
partial disk blocks.
   

Block size = 4k

Btrfs packs smaller objects into the blocks in certain cases.

 

As long as no object smaller than the disk block size is ever
flushed to media, and all flushed objects are aligned to the disk
blocks, there should be no real performance hit from that.

Otherwise we end up with the damage for the ext[234] family, where
the file blocks can be aligned, but the 1K inode updates cause
the read-modify-write (RMW) cycles and and cost10% performance
hit for creation/update of large numbers of files.

An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
drive), which is painful.
   


Also interesting is to note that you can get a significant overheard 
even with 0 byte length files. Path names, metadata overhead, etc can 
consume (depending on the pathname length) quite a bit of space per file.


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-24 Thread David Woodhouse
On Wed, 2010-06-23 at 20:43 -0700, Daniel Taylor wrote:
 There is also the issue of btrfs over RAID (which I know is not
 entirely sensible, but which will happen). 

Well, we could discourage that by merging the RAID support that's been
pending for a while but I suspect Chris is a bit busy right now :)

-- 
dwmw2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-24 Thread Daniel Taylor
 

 -Original Message-
 From: mikefe...@gmail.com [mailto:mikefe...@gmail.com] On 
 Behalf Of Mike Fedyk
 Sent: Wednesday, June 23, 2010 9:51 PM
 To: Daniel Taylor
 Cc: Daniel J Blueman; Mat; LKML; 
 linux-fsde...@vger.kernel.org; Chris Mason; Ric Wheeler; 
 Andrew Morton; Linus Torvalds; The development of BTRFS
 Subject: Re: Btrfs: broken file system design (was Unbound(?) 
 internal fragmentation in Btrfs)
 
 On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor 
 daniel.tay...@wdc.com wrote:
  Just an FYI reminder.  The original test (2K files) is utterly
  pathological for disk drives with 4K physical sectors, such as
  those now shipping from WD, Seagate, and others.  Some of the
  SSDs have larger (16K0 or smaller blocks (2K).  There is also
  the issue of btrfs over RAID (which I know is not entirely
  sensible, but which will happen).
 
  The absolute minimum allocation size for data should be the same
  as, and aligned with, the underlying disk block size.  If that
  results in underutilization, I think that's a good thing for
  performance, compared to read-modify-write cycles to update
  partial disk blocks.
 
 Block size = 4k
 
 Btrfs packs smaller objects into the blocks in certain cases.
 

As long as no object smaller than the disk block size is ever
flushed to media, and all flushed objects are aligned to the disk
blocks, there should be no real performance hit from that.

Otherwise we end up with the damage for the ext[234] family, where
the file blocks can be aligned, but the 1K inode updates cause
the read-modify-write (RMW) cycles and and cost 10% performance
hit for creation/update of large numbers of files.

An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
drive), which is painful.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Jamie Lokier
Edward Shishkin wrote:
 I have noticed that events in Btrfs develop by scenario not predicted
 by the paper of academic Ohad Rodeh (in spite of the announce that
 Btrfs is based on this paper). This is why I have started to grumble..

In btrfs, based on means started with the algorithm and ideas of,
not uses the same algorithm as.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Jamie Lokier
Edward Shishkin wrote:
 I'll try to help, but I am rather pessimistic here: working out
 algorithms is something, which doesn't like timelines..

Nonsense.  Working out algorithms is just work to an algorithm
designer, just like programming is work to a programmer.  Sure, some
things are harder than others, and there's an element of creativity.
But lots of it is just cranking the handle, even in algorithm design.

 Note that it's not necessarily a problem to have a few nodes with low
 utilisation.  Deliberate violation of the classic balancing heuristic
 is often useful for performance.[*]
 
 Ok, let's violate this, but not in the detriment of utilisation:
 Internal fragmentation is a horror for file systems, the enemy #1
 (which is completely defeated in the last century BTW).
 
   The problem you've found is only a
 real problem when there are _too many_ nodes with low utilisation.
 
 IMHO this is a problem, as we can not estimate number of such nodes.
 Do you have any sane upper boundary for this number? I don't know such one.
 Maybe I have missed something?

Well, I don't know if btrfs maintains enough statistics or other
implicit processes to guarantee a sane upper boundary for this.  The
impression I'm getting from the thread is that it relies on heuristic
behaviour which is usually sufficient (and perhaps a bit faster than
updating statistics on the disk would be), but that it does not
provide a hard upper boundary.  I'm really not sure, though.  I
haven't read the code.

 [*] For example when filling a tree by inserting contiguously
 ascending keys, the classic split into two when full heuristic gives
 the worst possible results (50% lost space), and deliberately
 underfilling the most actively updated nodes, which is not permitted
 at all by the classic algorithm, gives denser packing in the end
 (almost zero lost space).
 
 At the end of what?

At the end of inserting keys in ascending order.  Just think about the
classic B-tree when that is done: Node[1] fills to 100% utilisation,
then is split into two nodes at 50% (call them node[1] and node[2]).
Node[2] fills to 100%, then splits into node[2] at 50% and node[3].
Etc etc: Result is a million nodes at 50% utilisation, except the last
one.

If instead you fill node[1], then *don't* split it but permit node[2]
to have 0% to start with, let that fill to 100%, then don't split
node[2] and let node[3] start with 0%, etc. etc: Result is half a
million nodes at 100% utilisation, except the last one.

Both fit the desired bounds, but the second is more desirable in
practice, especially as it's common behaviour to fill with contiguous,
ascending keys in a filesystem (or blob-filled database) where data
extents are part of the tree :-)

To handle the cases of multiple independent ascending runs of keys
being written in parallel, you generalise to maintain more than one
below-50% block, near the active insertion heads.

The above describes classic B-trees where updates are assumed to be
written immediately.  Things are different when there's a memory cache
and delayed allocation/writing, and packing can be reorganised in
memory before sending to storage.

 I hope you don't speak about on-line defragmentation?

No, not at all.

 Can you point me to the paper (if any)?

Sorry, no, I haven't seen this described in a paper.  Imho it's
obvious when you think about it.  But maybe it's not obvious to
everyone - perhaps this is even the heuristic modification missing
from btrfs which it would need to avoid the scenario which started
this thread?

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Daniel Taylor
Just an FYI reminder.  The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others.  Some of the
SSDs have larger (16K0 or smaller blocks (2K).  There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).

The absolute minimum allocation size for data should be the same
as, and aligned with, the underlying disk block size.  If that
results in underutilization, I think that's a good thing for
performance, compared to read-modify-write cycles to update
partial disk blocks. 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Mike Fedyk
On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor daniel.tay...@wdc.com wrote:
 Just an FYI reminder.  The original test (2K files) is utterly
 pathological for disk drives with 4K physical sectors, such as
 those now shipping from WD, Seagate, and others.  Some of the
 SSDs have larger (16K0 or smaller blocks (2K).  There is also
 the issue of btrfs over RAID (which I know is not entirely
 sensible, but which will happen).

 The absolute minimum allocation size for data should be the same
 as, and aligned with, the underlying disk block size.  If that
 results in underutilization, I think that's a good thing for
 performance, compared to read-modify-write cycles to update
 partial disk blocks.

Block size = 4k

Btrfs packs smaller objects into the blocks in certain cases.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Daniel J Blueman
On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
edward.shish...@gmail.com wrote:
 Mat wrote:

 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:

 Hello everyone.

 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).

 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:

 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).

 # ls /mnt | wc -l
 59480

 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.

 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.

 Internal fragmentation (see Appendix A) of those 5 leafs is
 (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
 ext4 and xfs: The last ones in this example will show fragmentation
 near zero with blocksize = 2K. Even with 4K blocksize they will
 show better utilization 0.50 (against 0.38 in btrfs)!

 I have a small question for btrfs developers: Why do you folks put
 inline extents, xattr, etc items of variable size to the B-tree
 in spite of the fact that B-tree is a data structure NOT for variable
 sized records? This disadvantage of B-trees was widely discussed.
 For example, maestro D. Knuth warned about this issue long time
 ago (see Appendix C).

 It is a well known fact that internal fragmentation of classic Bayer's
 B-trees is restricted by the value 0.50 (see Appendix C). However it
 takes place only if your tree contains records of the _same_ length
 (for example, extent pointers). Once you put to your B-tree records
 of variable length (restricted only by leaf size, like btrfs inline
 extents), your tree LOSES this boundary. Moreover, even worse:
 it is clear, that in this case utilization of B-tree scales as zero(!).
 That said, for every small E and for every amount of data N we
 can construct a consistent B-tree, which contains data N and has
 utilization worse then E. I.e. from the standpoint of utilization
 such trees can be completely degenerated.

 That said, the very important property of B-trees, which guarantees
 non-zero utilization, has been lost, and I don't see in Btrfs code any
 substitution for this property. In other words, where is a formal
 guarantee that all disk space of our users won't be eaten by internal
 fragmentation? I consider such guarantee as a *necessary* condition
 for putting a file system to production.

Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.

If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
  Daniel
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
 Mat wrote:
 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
 Hello everyone.
 
 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).
 
 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:
 
 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).
 
 # ls /mnt | wc -l
 59480
 
 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.
 
 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.

There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).

 
 It must be a highly unexpected and difficult question for file system
 developers: how efficiently does your file system manage disk space?
 
 In the meanwhile I confirm that Btrfs design is completely broken:
 records stored in the B-tree differ greatly from each other (it is
 unacceptable!), and the balancing algorithms have been modified in
 insane manner. All these factors has led to loss of *all* boundaries
 holding internal fragmentation and to exhaustive waste of disk space
 (and memory!) in spite of the property scaling in their ability to
 address large storage.
 
 This is not a large storage, this is a scalable sieve: you can not
 rely on finding there some given amount of water even after infinite
 increasing the size of the sieve (read escalating the pool of Btrfs
 devices).
 
 It seems that nobody have reviewed Btrfs before its inclusion to the
 mainline. I have only found a pair of recommendations with a common
 idea that Btrfs maintainer is not a crazy man. Plus a number of
 papers which admire with the Btrfs phenomena. Sigh.
 
 Well, let's decide what can we do in current situation..
 The first obvious point here is that we *can not* put such file system
 to production. Just because it doesn't provide any guarantees for our
 users regarding disk space utilization.

Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.



There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
  


Hello, Chris. Thanks for response!
I afraid that both ways won't fix the problem. Look at this leaf:

[...]
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
[...]

There is no inline extents, and what are you going to split here?
All leafs must be at least a half filled, otherwise we loose all
boundaries, which provides non-zero utilization..

Any ideas?

Thanks,
Edward.

  

It must be a highly unexpected and difficult question for file system
developers: how efficiently does your file system manage disk space?

In the meanwhile I confirm that Btrfs design is completely broken:
records stored in the B-tree differ greatly from each other (it is
unacceptable!), and the balancing algorithms have been modified in
insane manner. All these factors has led to loss of *all* boundaries
holding internal fragmentation and to exhaustive waste of disk space
(and memory!) in spite of the property scaling in their ability to
address large storage.

This is not a large storage, this is a scalable sieve: you can not
rely on finding there some given amount of water even after infinite
increasing the size of the sieve (read escalating the pool of Btrfs
devices).

It seems that nobody have reviewed Btrfs before its inclusion to the
mainline. I have only found a pair of recommendations with a common
idea that Btrfs maintainer is not a crazy man. Plus a number of
papers which admire with the Btrfs phenomena. Sigh.

Well, let's decide what can we do in current situation..
The first obvious point here is that we *can not* put such file system
to production. Just because it doesn't provide any guarantees for our
users regarding disk space utilization.



Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.

-chris

  


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
 Chris Mason wrote:
 On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
 Mat wrote:
 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
 Hello everyone.
 
 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).
 
 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:
 
 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).
 
 # ls /mnt | wc -l
 59480
 
 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.
 
 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.
 
 There are two easy ways to fix this problem.  Turn off the inline
 extents (max_inline=0) or allow splitting of the inline extents.  I
 didn't put in the splitting simply because the complexity was high while
 the benefits were low (in comparison with just turning off the inline
 extents).
 
 Hello, Chris. Thanks for response!
 I afraid that both ways won't fix the problem. Look at this leaf:
 
 [...]
 leaf 29425664 items 1 free space 3892 generation 8 owner 5
 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
location key (0 UNKNOWN 0) type 8
namelen 16 datalen 32 name: security.selinux
 [...]
 
 There is no inline extents, and what are you going to split here?
 All leafs must be at least a half filled, otherwise we loose all
 boundaries, which provides non-zero utilization..

Right, there is no inline extent because we require them to fit entirely
in the leaf.  So we end up with mostly empty leaves because the inline
item is large enough to make it difficult to push around but not large
enough to fill the leaf.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Chris Mason wrote:

On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
   

Mat wrote:
 

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com  wrote:
   

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.
 

There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
   
But then the benefits of splitting must be high, because it solves this 
problem if inline extents are turned on.
   

It must be a highly unexpected and difficult question for file system
developers: how efficiently does your file system manage disk space?

In the meanwhile I confirm that Btrfs design is completely broken:
records stored in the B-tree differ greatly from each other (it is
unacceptable!), and the balancing algorithms have been modified in
insane manner. All these factors has led to loss of *all* boundaries
holding internal fragmentation and to exhaustive waste of disk space
(and memory!) in spite of the property scaling in their ability to
address large storage.

This is not a large storage, this is a scalable sieve: you can not
rely on finding there some given amount of water even after infinite
increasing the size of the sieve (read escalating the pool of Btrfs
devices).

It seems that nobody have reviewed Btrfs before its inclusion to the
mainline. I have only found a pair of recommendations with a common
idea that Btrfs maintainer is not a crazy man. Plus a number of
papers which admire with the Btrfs phenomena. Sigh.

Well, let's decide what can we do in current situation..
The first obvious point here is that we *can not* put such file system
to production. Just because it doesn't provide any guarantees for our
users regarding disk space utilization.
 

Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.
   
If we have to split everywhere, hasn't it then some (dramatic) impact on 
the performance of the Btrfs filesystem?

As it was said above: splitting has a high complexity.

-chris
--
   

Have fun
Christian Stroetmann
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 05:21:21PM +0200, Christian Stroetmann wrote:
 Chris Mason wrote:
 On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
 Mat wrote:
 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com  wrote:
 Hello everyone.
 
 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).
 
 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:
 
 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).
 
 # ls /mnt | wc -l
 59480
 
 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.
 
 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.
 There are two easy ways to fix this problem.  Turn off the inline
 extents (max_inline=0) or allow splitting of the inline extents.  I
 didn't put in the splitting simply because the complexity was high while
 the benefits were low (in comparison with just turning off the inline
 extents).
 But then the benefits of splitting must be high, because it solves
 this problem if inline extents are turned on.

It depends, we might also argue that for larger inline extents like this
we are better off with much larger leaves or with using max_inline=0
instead.

 It must be a highly unexpected and difficult question for file system
 developers: how efficiently does your file system manage disk space?
 
 In the meanwhile I confirm that Btrfs design is completely broken:
 records stored in the B-tree differ greatly from each other (it is
 unacceptable!), and the balancing algorithms have been modified in
 insane manner. All these factors has led to loss of *all* boundaries
 holding internal fragmentation and to exhaustive waste of disk space
 (and memory!) in spite of the property scaling in their ability to
 address large storage.
 
 This is not a large storage, this is a scalable sieve: you can not
 rely on finding there some given amount of water even after infinite
 increasing the size of the sieve (read escalating the pool of Btrfs
 devices).
 
 It seems that nobody have reviewed Btrfs before its inclusion to the
 mainline. I have only found a pair of recommendations with a common
 idea that Btrfs maintainer is not a crazy man. Plus a number of
 papers which admire with the Btrfs phenomena. Sigh.
 
 Well, let's decide what can we do in current situation..
 The first obvious point here is that we *can not* put such file system
 to production. Just because it doesn't provide any guarantees for our
 users regarding disk space utilization.
 Are you basing all of this on inline extents?  The other extents of
 variable size are more flexible (taking up the room in the leaf), but
 they can also easy be split during balancing.
 If we have to split everywhere, hasn't it then some (dramatic)
 impact on the performance of the Btrfs filesystem?
 As it was said above: splitting has a high complexity.

Yes.  Both Edward and I have worked on the reiserfs sources where inline
extents were split during balancing.  It made for a few surprises in the
file read/write code.  They aren't impossible by any means, but I wanted
to avoid them.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Jamie Lokier
Edward Shishkin wrote:
 If you decide to base your file system on some algorithms then please
 use the original ones from proper academic papers. DO NOT modify the
 algorithms in solitude: this is very fragile thing! All such
 modifications must be reviewed by specialists in the theory of
 algorithms. Such review can be done in various scientific magazines of
 proper level.
 
 Personally I don't see any way to improve the situation with Btrfs
 except full redesigning the last one. If you want to base your file
 system on the paper of Ohad Rodeh, then please, use *exactly* the
 Bayer's B-trees that he refers to. That said, make sure that all
 records you put to the tree has equal length and all non-root nodes of
 your tree are at least half filled.

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.

Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.

This is easily shown by modelling variable-length records (key -
string) as a range of fixed length records ([key,index] - byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a compressed representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

 As to current Btrfs code: *NOT ACK*!!! I don't think we need such
 file systems.

Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.

If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well, and then propose a fix.

Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.

[*] For example when filling a tree by inserting contiguously
ascending keys, the classic split into two when full heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
  

Chris Mason wrote:


On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.


There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
  

Hello, Chris. Thanks for response!
I afraid that both ways won't fix the problem. Look at this leaf:

[...]
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
[...]

There is no inline extents, and what are you going to split here?
All leafs must be at least a half filled, otherwise we loose all
boundaries, which provides non-zero utilization..



Right, there is no inline extent because we require them to fit entirely
in the leaf.  So we end up with mostly empty leaves because the inline
item is large enough to make it difficult to push around but not large
enough to fill the leaf.
  


How about left and right neighbors? They contain a lot of
free space (1572 and 1901 respectively).
I am not happy with the very fact of such shallow leafs which
contain only one small (xattr) item..
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Daniel J Blueman wrote:

On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
edward.shish...@gmail.com wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize = 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
inline extents, xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs inline
extents), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.



Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.
  


Sure it is impossible. I believe in division of labour:
academics writes algorithms, and we (engineers) encode them.

I have noticed that events in Btrfs develop by scenario not predicted
by the paper of academic Ohad Rodeh (in spite of the announce that
Btrfs is based on this paper). This is why I have started to grumble..

Thanks.


If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
  Daniel
  


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 06:22:39PM +0200, Edward Shishkin wrote:
 Chris Mason wrote:
 On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
 Chris Mason wrote:
 On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
 Mat wrote:
 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com 
 wrote:
 Hello everyone.
 
 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).
 
 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:
 
 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).
 
 # ls /mnt | wc -l
 59480
 
 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.
 
 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.
 There are two easy ways to fix this problem.  Turn off the inline
 extents (max_inline=0) or allow splitting of the inline extents.  I
 didn't put in the splitting simply because the complexity was high while
 the benefits were low (in comparison with just turning off the inline
 extents).
 Hello, Chris. Thanks for response!
 I afraid that both ways won't fix the problem. Look at this leaf:
 
 [...]
 leaf 29425664 items 1 free space 3892 generation 8 owner 5
 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
location key (0 UNKNOWN 0) type 8
namelen 16 datalen 32 name: security.selinux
 [...]
 
 There is no inline extents, and what are you going to split here?
 All leafs must be at least a half filled, otherwise we loose all
 boundaries, which provides non-zero utilization..
 
 Right, there is no inline extent because we require them to fit entirely
 in the leaf.  So we end up with mostly empty leaves because the inline
 item is large enough to make it difficult to push around but not large
 enough to fill the leaf.
 
 How about left and right neighbors? They contain a lot of
 free space (1572 and 1901 respectively).
 I am not happy with the very fact of such shallow leafs which
 contain only one small (xattr) item..

Sure, the balancing can also be made more aggressive.  This should be
very easy to fix.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Daniel J Blueman wrote:

On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
edward.shish...@gmail.com  wrote:
   

Mat wrote:
 

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com  wrote:
   

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
inline extents, xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs inline
extents), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.
 

Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.
   

I find this is somehow off-topic, but:
For sure, it isn't impossible. History showed and present shows that 
there are exceptions.

If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
   Daniel
   

Have fun
Christian Stroetmann
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Jamie Lokier wrote:

Edward Shishkin wrote:
   

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.
 

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
   


This is the point: Which kind of modified B-tree data structure is best 
suited?



Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.

This is easily shown by modelling variable-length records (key -
string) as a range of fixed length records ([key,index] -  byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a compressed representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

   

As to current Btrfs code: *NOT ACK*!!! I don't think we need such
file systems.
 

Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.
   


I think this is the most important design question in relation with 
filesystems that use some kind of B-trees, which means, if the wrong 
modified B-tree as the fundamental data structure was chosen, then this 
is a deep design bug.



If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well, and then propose a fix.
   


This is a general problem of filesystem design, especially the packing 
and balancing heurisitcs, and a special problem of the Btrfs filesystem. 
You can't simply say do it in this or that way. That's why another 
filesystem uses something exotic like a B*-tree in conjunction with 
dancing trees as fundamental data structure, which leads back to the 
deep design question/problem/decision/bug/ And after I followed the 
explanations of this exotic B-tree version by the main developer I knew 
just right from the start of the development of the Btrfs filesystem 
that it wasn't chosen the right modified B-tree data structure, because 
it was too simple and too general. And since some days I have the 
impression that there wasn't made a design decision at all with the only 
exception that there has to be some kind of a B-tree algorithm/data 
structure in the Btrfs filesystem.


And I also think that such a deep desgin decision can't simply be 
corrected in general (subjective opinion).



Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.
   


The found problem is the first problem with the chosen modified B-tree 
data structure. I wouldn't call it only a problem in a special case.



[*] For example when filling a tree by inserting contiguously
ascending keys, the classic split into two when full heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...
   


Yes, but 
Firstly, maybe you are too focused on the classic B-tree algorithm here.
Secondly, a trick here, a split there, turning off a feature and then? 
Then we have complexity at then end, which brings us back to the start, 
the design decision.


But if you say there are no 

Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Jamie Lokier wrote:

Edward Shishkin wrote:
  

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.



First, thanks Edward for identifying a specific problem with the
current btrfs implementation.
  


Hello Jamie.


I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
  


Which property is robust?


Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.
  


I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length 
records don't
have any sane boundary for internal fragmentation. The common idea is 
that if we
don't want Btrfs to be in infinite development stage, then we should 
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly 
adhere this in

future.


This is easily shown by modelling variable-length records (key -
string) as a range of fixed length records ([key,index] - byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a compressed representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

  

As to current Btrfs code: *NOT ACK*!!! I don't think we need such
file systems.



Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.
  


Frankly, I would like to believe to such end, taking into account amount 
of my
contributions to the Btrfs project. At least to make sure I didn't do 
useless

work..


If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well,


Because of top-down balancing. It doesn't like clever things like 
splitting

and merging. Currently top-down works properly only with stupid classic
Bayer's B-trees.


 and then propose a fix.
  


I'll try to help, but I am rather pessimistic here: working out 
algorithms is

something, which doesn't like timelines..


Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]


Ok, let's violate this, but not in the detriment of utilisation:
Internal fragmentation is a horror for file systems, the enemy #1
(which is completely defeated in the last century BTW).


  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.
  


IMHO this is a problem, as we can not estimate number of such nodes.
Do you have any sane upper boundary for this number? I don't know such one.
Maybe I have missed something?


[*] For example when filling a tree by inserting contiguously
ascending keys, the classic split into two when full heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).


At the end of what? I hope you don't speak about on-line defragmentation?
Can you point me to the paper (if any)?

Thanks!


  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  



--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the 

Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
 Jamie Lokier wrote:
 Edward Shishkin wrote:
 If you decide to base your file system on some algorithms then please
 use the original ones from proper academic papers. DO NOT modify the
 algorithms in solitude: this is very fragile thing! All such
 modifications must be reviewed by specialists in the theory of
 algorithms. Such review can be done in various scientific magazines of
 proper level.
 
 Personally I don't see any way to improve the situation with Btrfs
 except full redesigning the last one. If you want to base your file
 system on the paper of Ohad Rodeh, then please, use *exactly* the
 Bayer's B-trees that he refers to. That said, make sure that all
 records you put to the tree has equal length and all non-root nodes of
 your tree are at least half filled.
 
 First, thanks Edward for identifying a specific problem with the
 current btrfs implementation.
 
 Hello Jamie.
 
 I've studied modified B-trees quite a lot and know enough to be sure
 that they are quite robust when you modify them in all sorts of ways.
 
 Which property is robust?
 
 Moreover, you are incorrect to say there's an intrinsic algorithmic
 problem with variable-length records.  It is not true; if Knuth said
 so, Knuth was mistaken.
 
 I didn't say about intrinsic algorithmic problems :)
 I just repeat (after Knuth et al) that B-trees with variable-length
 records don't
 have any sane boundary for internal fragmentation. The common idea
 is that if we
 don't want Btrfs to be in infinite development stage, then we should
 choose some
 *sane* strategy (for example the paper of Ohad Rodeh) and strictly
 adhere this in
 future.

Again, other than the inline file data, what exactly do you believe
needs to change?  Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves.  The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html