Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-26 Thread Ric Wheeler

On 06/26/2010 08:34 AM, Daniel Shiels wrote:

25.06.2010 22:58, Ric Wheeler wrote:
 

On 06/24/2010 06:06 PM, Daniel Taylor wrote:
   

[]
 

On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
   wrote:

   

Just an FYI reminder.  The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others.  Some of the
SSDs have larger (16K0 or smaller blocks (2K).  There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).
 

Why it is not sensible to use btrfs on raid devices?
Nowadays raid is just everywhere, from 'fakeraid' on AHCI to
large external arrays on iSCSI-attached storage.  Sometimes
it is nearly imposisble to _not_ use RAID, -- many servers
comes with a built-in RAID card which can't be turned off or
disabled.  And hardware raid is faster (at least in theory)
at least because it puts less load on various system busses.

To many "enterprise folks" a statement "we don't need hw raid,
we have better solution" sounds like "we're just a toy, don't
use".

Hmm?  ;)

/mjt, who always used and preferred _software_ raid due to
  multiple reasons, and never used btrfs so far.
 

Its not that you shouldn't use it on raid it's just it looses some value
from the file system.

Two nice features that btrfs provides are checksums and mirroring. If a
disk corrupts a block then btrfs will realize due to the strong checksum
and use the mirrored block. If you are using a raid system the raid won't
know the data is corrupted and raid doesn't provide a way for the file
system to get to the redundant block.

I read a paper from Sun a while back about the undetected read failure
rates for modern disks having not changed for many years. Disks are so
large now that undetected failures are unacceptably likely for many
systems. Hence zfs doing similar in file system raid schemes.

In my lab I used dd to clobber data in some of my mirrors. Btrfs logs lots
of checksum errors but never corrupted a file. Doing the same on a classic
raid with classic filesystem (solaris with veritas volume manager)
silently gave me bad data depending on what disk it felt like reading
from.

Daniel.
   


I was (one of many) people who worked at EMC on designing storage 
arrays. If you are using any high end, external hardware array, it will 
detect data corruption pro-actively for you. Most arrays do continual 
scans for latent errors and have internal data integrity checks that are 
used for this.


Note that DIF/DIX adds an extra 8 bytes of data integrity to newer 
standards disks. We don't do anything with that today in btrfs, but you 
could imagine ways to get even better data integrity protection.


If you are using software RAID (MD), you should also use its internal 
checks to do this kind of proactive detection of latent errors on a 
regular basis (say once every week or two).


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-26 Thread Ric Wheeler

On 06/26/2010 01:18 AM, Michael Tokarev wrote:

25.06.2010 22:58, Ric Wheeler wrote:
   

On 06/24/2010 06:06 PM, Daniel Taylor wrote:
 

[]
   

On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
   wrote:

 

Just an FYI reminder.  The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others.  Some of the
SSDs have larger (16K0 or smaller blocks (2K).  There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).
   

Why it is not sensible to use btrfs on raid devices?
Nowadays raid is just everywhere, from 'fakeraid' on AHCI to
large external arrays on iSCSI-attached storage.  Sometimes
it is nearly imposisble to _not_ use RAID, -- many servers
comes with a built-in RAID card which can't be turned off or
disabled.  And hardware raid is faster (at least in theory)
at least because it puts less load on various system busses.

To many "enterprise folks" a statement "we don't need hw raid,
we have better solution" sounds like "we're just a toy, don't
use".

Hmm?  ;)

/mjt, who always used and preferred _software_ raid due to
  multiple reasons, and never used btrfs so far.
   


Absolutely no reason that you would not use btrfs on hardware raid 
volumes (or software RAID for that matter).


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-25 Thread Michael Tokarev
25.06.2010 22:58, Ric Wheeler wrote:
> On 06/24/2010 06:06 PM, Daniel Taylor wrote:
[]
>>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
>>>   wrote:
>>> 
 Just an FYI reminder.  The original test (2K files) is utterly
 pathological for disk drives with 4K physical sectors, such as
 those now shipping from WD, Seagate, and others.  Some of the
 SSDs have larger (16K0 or smaller blocks (2K).  There is also
 the issue of btrfs over RAID (which I know is not entirely
 sensible, but which will happen).

Why it is not sensible to use btrfs on raid devices?
Nowadays raid is just everywhere, from 'fakeraid' on AHCI to
large external arrays on iSCSI-attached storage.  Sometimes
it is nearly imposisble to _not_ use RAID, -- many servers
comes with a built-in RAID card which can't be turned off or
disabled.  And hardware raid is faster (at least in theory)
at least because it puts less load on various system busses.

To many "enterprise folks" a statement "we don't need hw raid,
we have better solution" sounds like "we're just a toy, don't
use".

Hmm?  ;)

/mjt, who always used and preferred _software_ raid due to
 multiple reasons, and never used btrfs so far.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-25 Thread Ric Wheeler

On 06/24/2010 06:06 PM, Daniel Taylor wrote:



   

-Original Message-
From: mikefe...@gmail.com [mailto:mikefe...@gmail.com] On
Behalf Of Mike Fedyk
Sent: Wednesday, June 23, 2010 9:51 PM
To: Daniel Taylor
Cc: Daniel J Blueman; Mat; LKML;
linux-fsde...@vger.kernel.org; Chris Mason; Ric Wheeler;
Andrew Morton; Linus Torvalds; The development of BTRFS
Subject: Re: Btrfs: broken file system design (was Unbound(?)
internal fragmentation in Btrfs)

On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
  wrote:
 

Just an FYI reminder.  The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others.  Some of the
SSDs have larger (16K0 or smaller blocks (2K).  There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).

The absolute minimum allocation size for data should be the same
as, and aligned with, the underlying disk block size.  If that
results in underutilization, I think that's a good thing for
performance, compared to read-modify-write cycles to update
partial disk blocks.
   

Block size = 4k

Btrfs packs smaller objects into the blocks in certain cases.

 

As long as no object smaller than the disk block size is ever
flushed to media, and all flushed objects are aligned to the disk
blocks, there should be no real performance hit from that.

Otherwise we end up with the damage for the ext[234] family, where
the file blocks can be aligned, but the 1K inode updates cause
the read-modify-write (RMW) cycles and and cost>10% performance
hit for creation/update of large numbers of files.

An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
drive), which is painful.
   


Also interesting is to note that you can get a significant overheard 
even with 0 byte length files. Path names, metadata overhead, etc can 
consume (depending on the pathname length) quite a bit of space per file.


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design

2010-06-25 Thread Andi Kleen
"Daniel Taylor"  writes:
>
> As long as no object smaller than the disk block size is ever
> flushed to media, and all flushed objects are aligned to the disk
> blocks, there should be no real performance hit from that.

The question is just how large such a block needs to be.
Traditionally some RAID controllers (and possibly some SSDs now) 
needed very large blocks upto MBs.

>
> Otherwise we end up with the damage for the ext[234] family, where
> the file blocks can be aligned, but the 1K inode updates cause
> the read-modify-write (RMW) cycles and and cost >10% performance
> hit for creation/update of large numbers of files.

Fixing that doesn't require a new file system layout, just some effort
to read/write inodes in batches of multiple of them. XFS did similar
things for a long time, I believe there were some efforts for this
for ext4 too.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-24 Thread Daniel Taylor
 

> -Original Message-
> From: mikefe...@gmail.com [mailto:mikefe...@gmail.com] On 
> Behalf Of Mike Fedyk
> Sent: Wednesday, June 23, 2010 9:51 PM
> To: Daniel Taylor
> Cc: Daniel J Blueman; Mat; LKML; 
> linux-fsde...@vger.kernel.org; Chris Mason; Ric Wheeler; 
> Andrew Morton; Linus Torvalds; The development of BTRFS
> Subject: Re: Btrfs: broken file system design (was Unbound(?) 
> internal fragmentation in Btrfs)
> 
> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor 
>  wrote:
> > Just an FYI reminder.  The original test (2K files) is utterly
> > pathological for disk drives with 4K physical sectors, such as
> > those now shipping from WD, Seagate, and others.  Some of the
> > SSDs have larger (16K0 or smaller blocks (2K).  There is also
> > the issue of btrfs over RAID (which I know is not entirely
> > sensible, but which will happen).
> >
> > The absolute minimum allocation size for data should be the same
> > as, and aligned with, the underlying disk block size.  If that
> > results in underutilization, I think that's a good thing for
> > performance, compared to read-modify-write cycles to update
> > partial disk blocks.
> 
> Block size = 4k
> 
> Btrfs packs smaller objects into the blocks in certain cases.
> 

As long as no object smaller than the disk block size is ever
flushed to media, and all flushed objects are aligned to the disk
blocks, there should be no real performance hit from that.

Otherwise we end up with the damage for the ext[234] family, where
the file blocks can be aligned, but the 1K inode updates cause
the read-modify-write (RMW) cycles and and cost >10% performance
hit for creation/update of large numbers of files.

An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
drive), which is painful.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-24 Thread David Woodhouse
On Wed, 2010-06-23 at 20:43 -0700, Daniel Taylor wrote:
> There is also the issue of btrfs over RAID (which I know is not
> entirely sensible, but which will happen). 

Well, we could discourage that by merging the RAID support that's been
pending for a while but I suspect Chris is a bit busy right now :)

-- 
dwmw2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Mike Fedyk
On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor  wrote:
> Just an FYI reminder.  The original test (2K files) is utterly
> pathological for disk drives with 4K physical sectors, such as
> those now shipping from WD, Seagate, and others.  Some of the
> SSDs have larger (16K0 or smaller blocks (2K).  There is also
> the issue of btrfs over RAID (which I know is not entirely
> sensible, but which will happen).
>
> The absolute minimum allocation size for data should be the same
> as, and aligned with, the underlying disk block size.  If that
> results in underutilization, I think that's a good thing for
> performance, compared to read-modify-write cycles to update
> partial disk blocks.

Block size = 4k

Btrfs packs smaller objects into the blocks in certain cases.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Daniel Taylor
Just an FYI reminder.  The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others.  Some of the
SSDs have larger (16K0 or smaller blocks (2K).  There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).

The absolute minimum allocation size for data should be the same
as, and aligned with, the underlying disk block size.  If that
results in underutilization, I think that's a good thing for
performance, compared to read-modify-write cycles to update
partial disk blocks. 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Jamie Lokier
Edward Shishkin wrote:
> I'll try to help, but I am rather pessimistic here: working out
> algorithms is something, which doesn't like timelines..

Nonsense.  Working out algorithms is just work to an algorithm
designer, just like programming is work to a programmer.  Sure, some
things are harder than others, and there's an element of creativity.
But lots of it is just cranking the handle, even in algorithm design.

> >Note that it's not necessarily a problem to have a few nodes with low
> >utilisation.  Deliberate violation of the classic balancing heuristic
> >is often useful for performance.[*]
> 
> Ok, let's violate this, but not in the detriment of utilisation:
> Internal fragmentation is a horror for file systems, the enemy #1
> (which is completely defeated in the last century BTW).
> 
> >  The problem you've found is only a
> >real problem when there are _too many_ nodes with low utilisation.
> 
> IMHO this is a problem, as we can not estimate number of such nodes.
> Do you have any sane upper boundary for this number? I don't know such one.
> Maybe I have missed something?

Well, I don't know if btrfs maintains enough statistics or other
implicit processes to guarantee a sane upper boundary for this.  The
impression I'm getting from the thread is that it relies on heuristic
behaviour which is usually sufficient (and perhaps a bit faster than
updating statistics on the disk would be), but that it does not
provide a hard upper boundary.  I'm really not sure, though.  I
haven't read the code.

> >[*] For example when filling a tree by inserting contiguously
> >ascending keys, the classic "split into two when full" heuristic gives
> >the worst possible results (50% lost space), and deliberately
> >underfilling the most actively updated nodes, which is not permitted
> >at all by the classic algorithm, gives denser packing in the end
> >(almost zero lost space).
> 
> At the end of what?

At the end of inserting keys in ascending order.  Just think about the
classic B-tree when that is done: Node[1] fills to 100% utilisation,
then is split into two nodes at 50% (call them node[1] and node[2]).
Node[2] fills to 100%, then splits into node[2] at 50% and node[3].
Etc etc: Result is a million nodes at 50% utilisation, except the last
one.

If instead you fill node[1], then *don't* split it but permit node[2]
to have 0% to start with, let that fill to 100%, then don't split
node[2] and let node[3] start with 0%, etc. etc: Result is half a
million nodes at 100% utilisation, except the last one.

Both fit the desired bounds, but the second is more desirable in
practice, especially as it's common behaviour to fill with contiguous,
ascending keys in a filesystem (or blob-filled database) where data
extents are part of the tree :-)

To handle the cases of multiple independent ascending runs of keys
being written in parallel, you generalise to maintain more than one
below-50% block, near the "active insertion heads".

The above describes classic B-trees where updates are assumed to be
written immediately.  Things are different when there's a memory cache
and delayed allocation/writing, and packing can be reorganised in
memory before sending to storage.

> I hope you don't speak about on-line defragmentation?

No, not at all.

> Can you point me to the paper (if any)?

Sorry, no, I haven't seen this described in a paper.  Imho it's
obvious when you think about it.  But maybe it's not obvious to
everyone - perhaps this is even the heuristic modification missing
from btrfs which it would need to avoid the scenario which started
this thread?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Jamie Lokier
Edward Shishkin wrote:
> I have noticed that events in Btrfs develop by scenario not predicted
> by the paper of academic Ohad Rodeh (in spite of the announce that
> Btrfs is based on this paper). This is why I have started to grumble..

In btrfs, "based on" means "started with the algorithm and ideas of",
not "uses the same algorithm as".

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
> Jamie Lokier wrote:
> >Edward Shishkin wrote:
> >>If you decide to base your file system on some algorithms then please
> >>use the original ones from proper academic papers. DO NOT modify the
> >>algorithms in solitude: this is very fragile thing! All such
> >>modifications must be reviewed by specialists in the theory of
> >>algorithms. Such review can be done in various scientific magazines of
> >>proper level.
> >>
> >>Personally I don't see any way to improve the situation with Btrfs
> >>except full redesigning the last one. If you want to base your file
> >>system on the paper of Ohad Rodeh, then please, use *exactly* the
> >>Bayer's B-trees that he refers to. That said, make sure that all
> >>records you put to the tree has equal length and all non-root nodes of
> >>your tree are at least half filled.
> >
> >First, thanks Edward for identifying a specific problem with the
> >current btrfs implementation.
> 
> Hello Jamie.
> 
> >I've studied modified B-trees quite a lot and know enough to be sure
> >that they are quite robust when you modify them in all sorts of ways.
> 
> Which property is robust?
> 
> >Moreover, you are incorrect to say there's an intrinsic algorithmic
> >problem with variable-length records.  It is not true; if Knuth said
> >so, Knuth was mistaken.
> 
> I didn't say about intrinsic algorithmic problems :)
> I just repeat (after Knuth et al) that B-trees with variable-length
> records don't
> have any sane boundary for internal fragmentation. The common idea
> is that if we
> don't want Btrfs to be in infinite development stage, then we should
> choose some
> *sane* strategy (for example the paper of Ohad Rodeh) and strictly
> adhere this in
> future.

Again, other than the inline file data, what exactly do you believe
needs to change?  Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves.  The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Jamie Lokier wrote:

Edward Shishkin wrote:
  

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.



First, thanks Edward for identifying a specific problem with the
current btrfs implementation.
  


Hello Jamie.


I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
  


Which property is robust?


Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.
  


I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length 
records don't
have any sane boundary for internal fragmentation. The common idea is 
that if we
don't want Btrfs to be in infinite development stage, then we should 
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly 
adhere this in

future.


This is easily shown by modelling variable-length records (key ->
string) as a range of fixed length records ([key,index] -> byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a "compressed" representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

  

As to current Btrfs code: *NOT ACK*!!! I don't think we need such
"file systems".



Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.
  


Frankly, I would like to believe to such end, taking into account amount 
of my
contributions to the Btrfs project. At least to make sure I didn't do 
useless

work..


If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well,


Because of top-down balancing. It doesn't like "clever" things like 
"splitting"

and "merging". Currently top-down works properly only with stupid classic
Bayer's B-trees.


 and then propose a fix.
  


I'll try to help, but I am rather pessimistic here: working out 
algorithms is

something, which doesn't like timelines..


Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]


Ok, let's violate this, but not in the detriment of utilisation:
Internal fragmentation is a horror for file systems, the enemy #1
(which is completely defeated in the last century BTW).


  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.
  


IMHO this is a problem, as we can not estimate number of such nodes.
Do you have any sane upper boundary for this number? I don't know such one.
Maybe I have missed something?


[*] For example when filling a tree by inserting contiguously
ascending keys, the classic "split into two when full" heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).


At the end of what? I hope you don't speak about on-line defragmentation?
Can you point me to the paper (if any)?

Thanks!


  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  



--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line "unsubscribe linux

Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Jamie Lokier wrote:

Edward Shishkin wrote:
   

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.
 

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
   


This is the point: Which kind of modified B-tree data structure is best 
suited?



Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.

This is easily shown by modelling variable-length records (key ->
string) as a range of fixed length records ([key,index] ->  byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a "compressed" representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

   

As to current Btrfs code: *NOT ACK*!!! I don't think we need such
"file systems".
 

Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.
   


I think this is the most important design question in relation with 
filesystems that use some kind of B-trees, which means, if the wrong 
modified B-tree as the fundamental data structure was chosen, then this 
is a deep design bug.



If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well, and then propose a fix.
   


This is a general problem of filesystem design, especially the packing 
and balancing heurisitcs, and a special problem of the Btrfs filesystem. 
You can't simply say do it in this or that way. That's why another 
filesystem uses something exotic like a B*-tree in conjunction with 
dancing trees as fundamental data structure, which leads back to the 
deep design question/problem/decision/bug/ And after I followed the 
explanations of this exotic B-tree version by the main developer I knew 
just right from the start of the development of the Btrfs filesystem 
that it wasn't chosen the right modified B-tree data structure, because 
it was too simple and too general. And since some days I have the 
impression that there wasn't made a design decision at all with the only 
exception that there has to be some kind of a B-tree algorithm/data 
structure in the Btrfs filesystem.


And I also think that such a deep desgin decision can't simply be 
corrected in general (subjective opinion).



Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.
   


The found problem is the first problem with the chosen modified B-tree 
data structure. I wouldn't call it only a problem in a special case.



[*] For example when filling a tree by inserting contiguously
ascending keys, the classic "split into two when full" heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...
   


Yes, but 
Firstly, maybe you are too focused on the classic B-tree algorithm here.
Secondly, a trick here, a split there, turning off a feature and then? 
Then we have complexity at then end, which brings us back to the start, 
the design decision.


But if you say there 

Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Daniel J Blueman wrote:

On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
  wrote:
   

Mat wrote:
 

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
   

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize<= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
"inline extents", xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs "inline
extents"), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.
 

Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.
   

I find this is somehow off-topic, but:
For sure, it isn't impossible. History showed and present shows that 
there are exceptions.

If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
   Daniel
   

Have fun
Christian Stroetmann
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 06:22:39PM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
> >>Chris Mason wrote:
> >>>On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> Mat wrote:
> >On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  
> >wrote:
> >>Hello everyone.
> >>
> >>I was asked to review/evaluate Btrfs for using in enterprise
> >>systems and the below are my first impressions (linux-2.6.33).
> >>
> >>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>btrfs partition (mounted to /mnt) with 2K files:
> >>
> >># for i in $(seq 100); \
> >>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>(terminated after getting "No space left on device" reports).
> >>
> >># ls /mnt | wc -l
> >>59480
> >>
> >>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>and the first obvious question is "hey, where are other 83% of my
> >>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>shocked with the situation on the leaf level. The Appendix B shows
> >>5 adjacent btrfs leafs, which have the same parent.
> >>
> >>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>attempts to write to the file system will result in "No space left
> >>on device".
> >>>There are two easy ways to fix this problem.  Turn off the inline
> >>>extents (max_inline=0) or allow splitting of the inline extents.  I
> >>>didn't put in the splitting simply because the complexity was high while
> >>>the benefits were low (in comparison with just turning off the inline
> >>>extents).
> >>Hello, Chris. Thanks for response!
> >>I afraid that both ways won't fix the problem. Look at this leaf:
> >>
> >>[...]
> >>leaf 29425664 items 1 free space 3892 generation 8 owner 5
> >>fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> >>chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
> >>   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
> >>   location key (0 UNKNOWN 0) type 8
> >>   namelen 16 datalen 32 name: security.selinux
> >>[...]
> >>
> >>There is no inline extents, and what are you going to split here?
> >>All leafs must be at least a half filled, otherwise we loose all
> >>boundaries, which provides non-zero utilization..
> >
> >Right, there is no inline extent because we require them to fit entirely
> >in the leaf.  So we end up with mostly empty leaves because the inline
> >item is large enough to make it difficult to push around but not large
> >enough to fill the leaf.
> 
> How about left and right neighbors? They contain a lot of
> free space (1572 and 1901 respectively).
> I am not happy with the very fact of such shallow leafs which
> contain only one small (xattr) item..

Sure, the balancing can also be made more aggressive.  This should be
very easy to fix.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Daniel J Blueman wrote:

On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
 wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize <= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
"inline extents", xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs "inline
extents"), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.



Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.
  


Sure it is impossible. I believe in division of labour:
academics writes algorithms, and we (engineers) encode them.

I have noticed that events in Btrfs develop by scenario not predicted
by the paper of academic Ohad Rodeh (in spite of the announce that
Btrfs is based on this paper). This is why I have started to grumble..

Thanks.


If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
  Daniel
  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
  

Chris Mason wrote:


On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".


There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
  

Hello, Chris. Thanks for response!
I afraid that both ways won't fix the problem. Look at this leaf:

[...]
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
[...]

There is no inline extents, and what are you going to split here?
All leafs must be at least a half filled, otherwise we loose all
boundaries, which provides non-zero utilization..



Right, there is no inline extent because we require them to fit entirely
in the leaf.  So we end up with mostly empty leaves because the inline
item is large enough to make it difficult to push around but not large
enough to fill the leaf.
  


How about left and right neighbors? They contain a lot of
free space (1572 and 1901 respectively).
I am not happy with the very fact of such shallow leafs which
contain only one small (xattr) item..
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Jamie Lokier
Edward Shishkin wrote:
> If you decide to base your file system on some algorithms then please
> use the original ones from proper academic papers. DO NOT modify the
> algorithms in solitude: this is very fragile thing! All such
> modifications must be reviewed by specialists in the theory of
> algorithms. Such review can be done in various scientific magazines of
> proper level.
> 
> Personally I don't see any way to improve the situation with Btrfs
> except full redesigning the last one. If you want to base your file
> system on the paper of Ohad Rodeh, then please, use *exactly* the
> Bayer's B-trees that he refers to. That said, make sure that all
> records you put to the tree has equal length and all non-root nodes of
> your tree are at least half filled.

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.

Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.

This is easily shown by modelling variable-length records (key ->
string) as a range of fixed length records ([key,index] -> byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a "compressed" representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

> As to current Btrfs code: *NOT ACK*!!! I don't think we need such
> "file systems".

Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.

If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well, and then propose a fix.

Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.

[*] For example when filling a tree by inserting contiguously
ascending keys, the classic "split into two when full" heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 05:21:21PM +0200, Christian Stroetmann wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> >>Mat wrote:
> >>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
> Hello everyone.
> 
> I was asked to review/evaluate Btrfs for using in enterprise
> systems and the below are my first impressions (linux-2.6.33).
> 
> The first test I have made was filling an empty 659M (/dev/sdb2)
> btrfs partition (mounted to /mnt) with 2K files:
> 
> # for i in $(seq 100); \
> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> (terminated after getting "No space left on device" reports).
> 
> # ls /mnt | wc -l
> 59480
> 
> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> and the first obvious question is "hey, where are other 83% of my
> disk space???" I looked at the btrfs storage tree (fs_tree) and was
> shocked with the situation on the leaf level. The Appendix B shows
> 5 adjacent btrfs leafs, which have the same parent.
> 
> For example, look at the leaf 29425664: "items 1 free space 3892"
> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
> attempts to write to the file system will result in "No space left
> on device".
> >There are two easy ways to fix this problem.  Turn off the inline
> >extents (max_inline=0) or allow splitting of the inline extents.  I
> >didn't put in the splitting simply because the complexity was high while
> >the benefits were low (in comparison with just turning off the inline
> >extents).
> But then the benefits of splitting must be high, because it solves
> this problem if inline extents are turned on.

It depends, we might also argue that for larger inline extents like this
we are better off with much larger leaves or with using max_inline=0
instead.

> >>It must be a highly unexpected and difficult question for file system
> >>developers: "how efficiently does your file system manage disk space"?
> >>
> >>In the meanwhile I confirm that Btrfs design is completely broken:
> >>records stored in the B-tree differ greatly from each other (it is
> >>unacceptable!), and the balancing algorithms have been modified in
> >>insane manner. All these factors has led to loss of *all* boundaries
> >>holding internal fragmentation and to exhaustive waste of disk space
> >>(and memory!) in spite of the property "scaling in their ability to
> >>address large storage".
> >>
> >>This is not a large storage, this is a "scalable sieve": you can not
> >>rely on finding there some given amount of water even after infinite
> >>increasing the size of the sieve (read escalating the pool of Btrfs
> >>devices).
> >>
> >>It seems that nobody have reviewed Btrfs before its inclusion to the
> >>mainline. I have only found a pair of recommendations with a common
> >>idea that Btrfs maintainer is "not a crazy man". Plus a number of
> >>papers which admire with the "Btrfs phenomena". Sigh.
> >>
> >>Well, let's decide what can we do in current situation..
> >>The first obvious point here is that we *can not* put such file system
> >>to production. Just because it doesn't provide any guarantees for our
> >>users regarding disk space utilization.
> >Are you basing all of this on inline extents?  The other extents of
> >variable size are more flexible (taking up the room in the leaf), but
> >they can also easy be split during balancing.
> If we have to split everywhere, hasn't it then some (dramatic)
> impact on the performance of the Btrfs filesystem?
> As it was said above: splitting has a high complexity.

Yes.  Both Edward and I have worked on the reiserfs sources where inline
extents were split during balancing.  It made for a few surprises in the
file read/write code.  They aren't impossible by any means, but I wanted
to avoid them.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Chris Mason wrote:

On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
   

Mat wrote:
 

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
   

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".
 

There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
   
But then the benefits of splitting must be high, because it solves this 
problem if inline extents are turned on.
   

It must be a highly unexpected and difficult question for file system
developers: "how efficiently does your file system manage disk space"?

In the meanwhile I confirm that Btrfs design is completely broken:
records stored in the B-tree differ greatly from each other (it is
unacceptable!), and the balancing algorithms have been modified in
insane manner. All these factors has led to loss of *all* boundaries
holding internal fragmentation and to exhaustive waste of disk space
(and memory!) in spite of the property "scaling in their ability to
address large storage".

This is not a large storage, this is a "scalable sieve": you can not
rely on finding there some given amount of water even after infinite
increasing the size of the sieve (read escalating the pool of Btrfs
devices).

It seems that nobody have reviewed Btrfs before its inclusion to the
mainline. I have only found a pair of recommendations with a common
idea that Btrfs maintainer is "not a crazy man". Plus a number of
papers which admire with the "Btrfs phenomena". Sigh.

Well, let's decide what can we do in current situation..
The first obvious point here is that we *can not* put such file system
to production. Just because it doesn't provide any guarantees for our
users regarding disk space utilization.
 

Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.
   
If we have to split everywhere, hasn't it then some (dramatic) impact on 
the performance of the Btrfs filesystem?

As it was said above: splitting has a high complexity.

-chris
--
   

Have fun
Christian Stroetmann
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
> Chris Mason wrote:
> >On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> >>Mat wrote:
> >>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
> Hello everyone.
> 
> I was asked to review/evaluate Btrfs for using in enterprise
> systems and the below are my first impressions (linux-2.6.33).
> 
> The first test I have made was filling an empty 659M (/dev/sdb2)
> btrfs partition (mounted to /mnt) with 2K files:
> 
> # for i in $(seq 100); \
> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> (terminated after getting "No space left on device" reports).
> 
> # ls /mnt | wc -l
> 59480
> 
> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> and the first obvious question is "hey, where are other 83% of my
> disk space???" I looked at the btrfs storage tree (fs_tree) and was
> shocked with the situation on the leaf level. The Appendix B shows
> 5 adjacent btrfs leafs, which have the same parent.
> 
> For example, look at the leaf 29425664: "items 1 free space 3892"
> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
> attempts to write to the file system will result in "No space left
> on device".
> >
> >There are two easy ways to fix this problem.  Turn off the inline
> >extents (max_inline=0) or allow splitting of the inline extents.  I
> >didn't put in the splitting simply because the complexity was high while
> >the benefits were low (in comparison with just turning off the inline
> >extents).
> 
> Hello, Chris. Thanks for response!
> I afraid that both ways won't fix the problem. Look at this leaf:
> 
> [...]
> leaf 29425664 items 1 free space 3892 generation 8 owner 5
> fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
> chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
>item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
>location key (0 UNKNOWN 0) type 8
>namelen 16 datalen 32 name: security.selinux
> [...]
> 
> There is no inline extents, and what are you going to split here?
> All leafs must be at least a half filled, otherwise we loose all
> boundaries, which provides non-zero utilization..

Right, there is no inline extent because we require them to fit entirely
in the leaf.  So we end up with mostly empty leaves because the inline
item is large enough to make it difficult to push around but not large
enough to fill the leaf.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".



There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
  


Hello, Chris. Thanks for response!
I afraid that both ways won't fix the problem. Look at this leaf:

[...]
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
[...]

There is no inline extents, and what are you going to split here?
All leafs must be at least a half filled, otherwise we loose all
boundaries, which provides non-zero utilization..

Any ideas?

Thanks,
Edward.

  

It must be a highly unexpected and difficult question for file system
developers: "how efficiently does your file system manage disk space"?

In the meanwhile I confirm that Btrfs design is completely broken:
records stored in the B-tree differ greatly from each other (it is
unacceptable!), and the balancing algorithms have been modified in
insane manner. All these factors has led to loss of *all* boundaries
holding internal fragmentation and to exhaustive waste of disk space
(and memory!) in spite of the property "scaling in their ability to
address large storage".

This is not a large storage, this is a "scalable sieve": you can not
rely on finding there some given amount of water even after infinite
increasing the size of the sieve (read escalating the pool of Btrfs
devices).

It seems that nobody have reviewed Btrfs before its inclusion to the
mainline. I have only found a pair of recommendations with a common
idea that Btrfs maintainer is "not a crazy man". Plus a number of
papers which admire with the "Btrfs phenomena". Sigh.

Well, let's decide what can we do in current situation..
The first obvious point here is that we *can not* put such file system
to production. Just because it doesn't provide any guarantees for our
users regarding disk space utilization.



Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.

-chris

  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
> Mat wrote:
> >On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
> >>Hello everyone.
> >>
> >>I was asked to review/evaluate Btrfs for using in enterprise
> >>systems and the below are my first impressions (linux-2.6.33).
> >>
> >>The first test I have made was filling an empty 659M (/dev/sdb2)
> >>btrfs partition (mounted to /mnt) with 2K files:
> >>
> >># for i in $(seq 100); \
> >>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
> >>(terminated after getting "No space left on device" reports).
> >>
> >># ls /mnt | wc -l
> >>59480
> >>
> >>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
> >>and the first obvious question is "hey, where are other 83% of my
> >>disk space???" I looked at the btrfs storage tree (fs_tree) and was
> >>shocked with the situation on the leaf level. The Appendix B shows
> >>5 adjacent btrfs leafs, which have the same parent.
> >>
> >>For example, look at the leaf 29425664: "items 1 free space 3892"
> >>(of 4096!!). Note, that this "free" space (3892) is _dead_: any
> >>attempts to write to the file system will result in "No space left
> >>on device".

There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).

> 
> It must be a highly unexpected and difficult question for file system
> developers: "how efficiently does your file system manage disk space"?
> 
> In the meanwhile I confirm that Btrfs design is completely broken:
> records stored in the B-tree differ greatly from each other (it is
> unacceptable!), and the balancing algorithms have been modified in
> insane manner. All these factors has led to loss of *all* boundaries
> holding internal fragmentation and to exhaustive waste of disk space
> (and memory!) in spite of the property "scaling in their ability to
> address large storage".
> 
> This is not a large storage, this is a "scalable sieve": you can not
> rely on finding there some given amount of water even after infinite
> increasing the size of the sieve (read escalating the pool of Btrfs
> devices).
> 
> It seems that nobody have reviewed Btrfs before its inclusion to the
> mainline. I have only found a pair of recommendations with a common
> idea that Btrfs maintainer is "not a crazy man". Plus a number of
> papers which admire with the "Btrfs phenomena". Sigh.
> 
> Well, let's decide what can we do in current situation..
> The first obvious point here is that we *can not* put such file system
> to production. Just because it doesn't provide any guarantees for our
> users regarding disk space utilization.

Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Daniel J Blueman
On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
 wrote:
> Mat wrote:
>>
>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
>>>
>>> Hello everyone.
>>>
>>> I was asked to review/evaluate Btrfs for using in enterprise
>>> systems and the below are my first impressions (linux-2.6.33).
>>>
>>> The first test I have made was filling an empty 659M (/dev/sdb2)
>>> btrfs partition (mounted to /mnt) with 2K files:
>>>
>>> # for i in $(seq 100); \
>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>>> (terminated after getting "No space left on device" reports).
>>>
>>> # ls /mnt | wc -l
>>> 59480
>>>
>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>>> and the first obvious question is "hey, where are other 83% of my
>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>>> shocked with the situation on the leaf level. The Appendix B shows
>>> 5 adjacent btrfs leafs, which have the same parent.
>>>
>>> For example, look at the leaf 29425664: "items 1 free space 3892"
>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>>> attempts to write to the file system will result in "No space left
>>> on device".
>>>
>>> Internal fragmentation (see Appendix A) of those 5 leafs is
>>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
>>> ext4 and xfs: The last ones in this example will show fragmentation
>>> near zero with blocksize <= 2K. Even with 4K blocksize they will
>>> show better utilization 0.50 (against 0.38 in btrfs)!
>>>
>>> I have a small question for btrfs developers: Why do you folks put
>>> "inline extents", xattr, etc items of variable size to the B-tree
>>> in spite of the fact that B-tree is a data structure NOT for variable
>>> sized records? This disadvantage of B-trees was widely discussed.
>>> For example, maestro D. Knuth warned about this issue long time
>>> ago (see Appendix C).
>>>
>>> It is a well known fact that internal fragmentation of classic Bayer's
>>> B-trees is restricted by the value 0.50 (see Appendix C). However it
>>> takes place only if your tree contains records of the _same_ length
>>> (for example, extent pointers). Once you put to your B-tree records
>>> of variable length (restricted only by leaf size, like btrfs "inline
>>> extents"), your tree LOSES this boundary. Moreover, even worse:
>>> it is clear, that in this case utilization of B-tree scales as zero(!).
>>> That said, for every small E and for every amount of data N we
>>> can construct a consistent B-tree, which contains data N and has
>>> utilization worse then E. I.e. from the standpoint of utilization
>>> such trees can be completely degenerated.
>>>
>>> That said, the very important property of B-trees, which guarantees
>>> non-zero utilization, has been lost, and I don't see in Btrfs code any
>>> substitution for this property. In other words, where is a formal
>>> guarantee that all disk space of our users won't be eaten by internal
>>> fragmentation? I consider such guarantee as a *necessary* condition
>>> for putting a file system to production.

Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.

If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
  Daniel
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Mat wrote:

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize <= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
"inline extents", xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs "inline
extents"), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.

Any technical comments are welcome.

Thanks,
Edward.


Appendix A.
---
Glossary

1. Utilization of data and(or) metadata storage.

The fraction A/B, where
A is total size in bytes of stored data and(or) metadata.
B = N * S, where
N is number of blocks occupied by stored data and(or) metadata.
S is block size in bytes.

2. Internal fragmentation of data and(or) metadata storage.

difference (1 - U), where U is utilization.


Appendix B.
---
a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)

...

leaf 29982720 items 4 free space 1572 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
  location key (0 UNKNOWN 0) type 8
  namelen 16 datalen 32 name: security.selinux
  item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
  inline extent data size 2048 ram 2048 compress 0
  item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
  inode generation 8 size 2048 block group 29360128 mode 100644
links 1
  item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
  inode ref index 65 namelen 6 name: file64
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
  location key (0 UNKNOWN 0) type 8
  namelen 16 datalen 32 name: security.selinux
leaf 29990912 items 1 free space 1901 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
  inline extent data size 2048 ram 2048 compress 0
leaf 29986816 items 3 free space 3666 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160
  inode generation 8 size 2048 block group 29360128 mode 100644
links 1
  item 1 key (321 INODE_