Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-23 Thread Edward Shishkin

Chris Mason wrote:

On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
  

Chris Mason wrote:


On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
  

I'll reproduce from your test case and provide a fix.  mount -o
max_inline=1500 would give us 50% usage in the worst case


This is a very strange statement: how did you calculate this lower bound?



We want room for the extent and the inode item and the inode backref.
It's a rough number that leaves some extra room.  But even with a 2K
item we're getting very close to 50% usage of the metadata area.

  

(minus the
balancing bug you hit).


Ok, the balancing bug was interesting.  What happens is we create all
the inodes and directory items nicely packed into the leaves.

Then delayed allocation kicks in and starts inserting the big fat inline
extents.  This often forces the balancing code to split a leaf twice in
order to make room for the new inline extent.  The double split code
wasn't balancing the bits that were left over on either side of our
desired slot.

The patch below fixes the problem for me, reducing the number of leaves
that have more than 2K of free space down from 6% of the total to about
74 leaves.  It could be zero, but the balancing code doesn't push
items around if our leaf is in the first or last slot of the parent
node (complexity vs benefit tradeoff).
  

Nevertheless, I see leaves, which are not in the first or last slot,
but mergeable with their neighbors (test case is the same):



Correct, but it was in the first or last slot when it was balanced (I
confirmed this with printk).

Then the higher layers were balanced and our leaves were no longer in
the first/last slot.  We don't rebalance leaves when we balance level 1.

  

leaf 269557760 items 22 free space 2323 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231

leaf 280027136 items 18 free space 2627 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231

node 269549568 level 1 items 60 free 61 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231
   key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
   key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
   key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
   key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
   key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
   key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
   key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
   key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
   key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
   key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
   key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
   key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
   key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
   key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
   key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
   key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
[...]



With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
That doesn't sound like a huge improvement, but the default from
mkfs.btrfs is to duplicate metadata.  After duplication, that's about
417MB or about 40% of the overall space.

When you factor in the space that we reserve to avoid exploding on
enospc and the space that we have allocated for data extents, that's not
a very surprising number.

I'm still putting this patch through more testing, the double split code
is used in some difficult corners and I need to make sure I've tried
all of them.
  

Chris, for the further code review we need documents, which reflect
the original ideas of the balancing, etc. Could you please provide them?
Obviously, I can not do it instead of you, as I don't know those ideas.




Which part are you most interested in?
  


Balancing.

What conditions do we want to provide?
Why won't we get utilization slump in future?
How do we need to fix things when something goes wrong?
Whereas in accordance with the single existing paper Btrfs
design looks really broken, because:
1. the balancing condition n = number_of_keys = 2n+1
  is not satisfied (indeed, when number_of_keys is 1
  we have 1 = 2n+1 -- false);
2. the trees mentioned in this paper don't allow keys of
  variable length.

Thanks,
Edward.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Improve support for exporting btrfs subvolumes.

2010-06-23 Thread J. Bruce Fields
On Thu, Jun 17, 2010 at 02:54:01PM +1000, Neil Brown wrote:
 
 If you export two subvolumes of a btrfs filesystem, they will both be
 given the same uuid so lookups will be confused.
 blkid cannot differentiate the two, so we must use the fsid from
 statfs64 to identify the filesystem.
 
 We cannot tell if blkid or statfs is best without knowing internal
 details of the filesystem in question, so we need to encode specific
 knowledge of btrfs in mountd.  This is unfortunate.
 
 To ensure smooth handling of this and possible future changes in uuid
 generation, we add infrastructure for multiple different uuids to be
 recognised on old filehandles, but only the preferred on is used on
 new filehandles.

Could you just contatenate the two (or hash them somehow)?  Or does that
just use up too much space in the filehandle?

--b.

 
 Signed-off-by: NeilBrown ne...@suse.de
 
 --
 This is a substantially revised version of a patch I posted a while
 ago.
 I tried to find a way to do it would hard coding knowledge of btrfs in
 nfs-utils, but it isn't possible.  For some filesystems, f_fsid is
 best, for some it is worst.  No way to tell the difference.
 
 This patch add infrastructure so that if we find a better way to get a
 good uuid (e.g. a new syscall), we can slot it in for new filehandles,
 but old filehandles using the old uuid will still work.
 
 I believe this is ready for inclusion upstream.
 Thanks,
 NeilBrown
 
 
 diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
 index caef5b2..85cd829 100644
 --- a/utils/mountd/cache.c
 +++ b/utils/mountd/cache.c
 @@ -170,13 +170,16 @@ void auth_unix_gid(FILE *f)
  #if USE_BLKID
  static const char *get_uuid_blkdev(char *path)
  {
 + /* We set *safe if we know that we need the
 +  * fsid from statfs too.
 +  */
   static blkid_cache cache = NULL;
   struct stat stb;
   char *devname;
   blkid_tag_iterate iter;
   blkid_dev dev;
   const char *type;
 - const char *val = NULL;
 + const char *val, *uuid = NULL;
  
   if (cache == NULL)
   blkid_get_cache(cache, NULL);
 @@ -193,42 +196,29 @@ static const char *get_uuid_blkdev(char *path)
   iter = blkid_tag_iterate_begin(dev);
   if (!iter)
   return NULL;
 - while (blkid_tag_next(iter, type, val) == 0)
 + while (blkid_tag_next(iter, type, val) == 0) {
   if (strcmp(type, UUID) == 0)
 + uuid = val;
 + if (strcmp(type, TYPE) == 0 
 + strcmp(val, btrfs) == 0) {
 + uuid = NULL;
   break;
 + }
 + }
   blkid_tag_iterate_end(iter);
 - return val;
 + return uuid;
  }
  #else
  #define get_uuid_blkdev(path) (NULL)
  #endif
  
 -int get_uuid(char *path, char *uuid, int uuidlen, char *u)
 +int get_uuid(const char *val, int uuidlen, char *u)
  {
   /* extract hex digits from uuidstr and compose a uuid
* of the given length (max 16), xoring bytes to make
 -  * a smaller uuid.  Then compare with uuid
 +  * a smaller uuid.
*/
   int i = 0;
 - const char *val = NULL;
 - char fsid_val[17];
 -
 - if (path) {
 - val = get_uuid_blkdev(path);
 - if (!val) {
 - struct statfs64 st;
 -
 - if (statfs64(path, st))
 - return 0;
 - if (!st.f_fsid.__val[0]  !st.f_fsid.__val[1])
 - return 0;
 - snprintf(fsid_val, 17, %08x%08x,
 -  st.f_fsid.__val[0], st.f_fsid.__val[1]);
 - val = fsid_val;
 - }
 - } else {
 - val = uuid;
 - }
   
   memset(u, 0, uuidlen);
   for ( ; *val ; val++) {
 @@ -252,6 +242,60 @@ int get_uuid(char *path, char *uuid, int uuidlen, char 
 *u)
   return 1;
  }
  
 +int uuid_by_path(char *path, int type, int uuidlen, char *uuid)
 +{
 + /* get a uuid for the filesystem found at 'path'.
 +  * There are several possible ways of generating the
 +  * uuids (types).
 +  * Type 0 is used for new filehandles, while other types
 +  * may be used to interpret old filehandle - to ensure smooth
 +  * forward migration.
 +  * We return 1 if a uuid was found (and it might be worth 
 +  * trying the next type) or 0 if no more uuid types can be
 +  * extracted.
 +  */
 +
 + /* Possible sources of uuid are
 +  * - blkid uuid
 +  * - statfs64 uuid
 +  *
 +  * On some filesystems (e.g. vfat) the statfs64 uuid is simply an
 +  * encoding of the device that the filesystem is mounted from, so
 +  * it we be very bad to use that (as device numbers change).  blkid
 +  * must be preferred.
 +  * On other filesystems (e.g. btrfs) the statfs64 uuid contains
 +  * important info that the blkid uuid cannot contain:  This happens
 +  * when multiple subvolumes are 

Re: [PATCH] Improve support for exporting btrfs subvolumes.

2010-06-23 Thread Neil Brown
On Wed, 23 Jun 2010 14:28:38 -0400
J. Bruce Fields bfie...@fieldses.org wrote:

 On Thu, Jun 17, 2010 at 02:54:01PM +1000, Neil Brown wrote:
  
  If you export two subvolumes of a btrfs filesystem, they will both be
  given the same uuid so lookups will be confused.
  blkid cannot differentiate the two, so we must use the fsid from
  statfs64 to identify the filesystem.
  
  We cannot tell if blkid or statfs is best without knowing internal
  details of the filesystem in question, so we need to encode specific
  knowledge of btrfs in mountd.  This is unfortunate.
  
  To ensure smooth handling of this and possible future changes in uuid
  generation, we add infrastructure for multiple different uuids to be
  recognised on old filehandles, but only the preferred on is used on
  new filehandles.
 
 Could you just contatenate the two (or hash them somehow)?  Or does that
 just use up too much space in the filehandle?

I did consider xoring them together but came to the conclusion that would
actually be a regression.
If you look down at the comment that I included in uuid_by_path, you will see
that some filesystems (e.g. VFAT) just use the major/minor device number for
the f_fsid from statfs.  As you know that is not necessarily stable over
reboots, while the UUID that blkid gives is.
So if we always adding the two uuids somehow, it would be an improvement for
btrfs, no change for e.g. ext3/XFS, and a regression for VFAT (and others I
think).  Not good.

Thanks,
NeilBrown


 
 --b.
 
  
  Signed-off-by: NeilBrown ne...@suse.de
  
  --
  This is a substantially revised version of a patch I posted a while
  ago.
  I tried to find a way to do it would hard coding knowledge of btrfs in
  nfs-utils, but it isn't possible.  For some filesystems, f_fsid is
  best, for some it is worst.  No way to tell the difference.
  
  This patch add infrastructure so that if we find a better way to get a
  good uuid (e.g. a new syscall), we can slot it in for new filehandles,
  but old filehandles using the old uuid will still work.
  
  I believe this is ready for inclusion upstream.
  Thanks,
  NeilBrown
  
  
  diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
  index caef5b2..85cd829 100644
  --- a/utils/mountd/cache.c
  +++ b/utils/mountd/cache.c
  @@ -170,13 +170,16 @@ void auth_unix_gid(FILE *f)
   #if USE_BLKID
   static const char *get_uuid_blkdev(char *path)
   {
  +   /* We set *safe if we know that we need the
  +* fsid from statfs too.
  +*/
  static blkid_cache cache = NULL;
  struct stat stb;
  char *devname;
  blkid_tag_iterate iter;
  blkid_dev dev;
  const char *type;
  -   const char *val = NULL;
  +   const char *val, *uuid = NULL;
   
  if (cache == NULL)
  blkid_get_cache(cache, NULL);
  @@ -193,42 +196,29 @@ static const char *get_uuid_blkdev(char *path)
  iter = blkid_tag_iterate_begin(dev);
  if (!iter)
  return NULL;
  -   while (blkid_tag_next(iter, type, val) == 0)
  +   while (blkid_tag_next(iter, type, val) == 0) {
  if (strcmp(type, UUID) == 0)
  +   uuid = val;
  +   if (strcmp(type, TYPE) == 0 
  +   strcmp(val, btrfs) == 0) {
  +   uuid = NULL;
  break;
  +   }
  +   }
  blkid_tag_iterate_end(iter);
  -   return val;
  +   return uuid;
   }
   #else
   #define get_uuid_blkdev(path) (NULL)
   #endif
   
  -int get_uuid(char *path, char *uuid, int uuidlen, char *u)
  +int get_uuid(const char *val, int uuidlen, char *u)
   {
  /* extract hex digits from uuidstr and compose a uuid
   * of the given length (max 16), xoring bytes to make
  -* a smaller uuid.  Then compare with uuid
  +* a smaller uuid.
   */
  int i = 0;
  -   const char *val = NULL;
  -   char fsid_val[17];
  -
  -   if (path) {
  -   val = get_uuid_blkdev(path);
  -   if (!val) {
  -   struct statfs64 st;
  -
  -   if (statfs64(path, st))
  -   return 0;
  -   if (!st.f_fsid.__val[0]  !st.f_fsid.__val[1])
  -   return 0;
  -   snprintf(fsid_val, 17, %08x%08x,
  -st.f_fsid.__val[0], st.f_fsid.__val[1]);
  -   val = fsid_val;
  -   }
  -   } else {
  -   val = uuid;
  -   }
  
  memset(u, 0, uuidlen);
  for ( ; *val ; val++) {
  @@ -252,6 +242,60 @@ int get_uuid(char *path, char *uuid, int uuidlen, char 
  *u)
  return 1;
   }
   
  +int uuid_by_path(char *path, int type, int uuidlen, char *uuid)
  +{
  +   /* get a uuid for the filesystem found at 'path'.
  +* There are several possible ways of generating the
  +* uuids (types).
  +* Type 0 is used for new filehandles, while other types
  +* may be used to interpret old filehandle - to ensure smooth
  +* forward migration.
  +* We return 1 if a uuid was found (and it 

Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Jamie Lokier
Edward Shishkin wrote:
 I have noticed that events in Btrfs develop by scenario not predicted
 by the paper of academic Ohad Rodeh (in spite of the announce that
 Btrfs is based on this paper). This is why I have started to grumble..

In btrfs, based on means started with the algorithm and ideas of,
not uses the same algorithm as.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Improve support for exporting btrfs subvolumes.

2010-06-23 Thread J. Bruce Fields
On Thu, Jun 24, 2010 at 07:31:57AM +1000, Neil Brown wrote:
 On Wed, 23 Jun 2010 14:28:38 -0400
 J. Bruce Fields bfie...@fieldses.org wrote:
 
  On Thu, Jun 17, 2010 at 02:54:01PM +1000, Neil Brown wrote:
   
   If you export two subvolumes of a btrfs filesystem, they will both be
   given the same uuid so lookups will be confused.
   blkid cannot differentiate the two, so we must use the fsid from
   statfs64 to identify the filesystem.
   
   We cannot tell if blkid or statfs is best without knowing internal
   details of the filesystem in question, so we need to encode specific
   knowledge of btrfs in mountd.  This is unfortunate.
   
   To ensure smooth handling of this and possible future changes in uuid
   generation, we add infrastructure for multiple different uuids to be
   recognised on old filehandles, but only the preferred on is used on
   new filehandles.
  
  Could you just contatenate the two (or hash them somehow)?  Or does that
  just use up too much space in the filehandle?
 
 I did consider xoring them together but came to the conclusion that would
 actually be a regression.
 If you look down at the comment that I included in uuid_by_path, you will see
 that some filesystems (e.g. VFAT) just use the major/minor device number for
 the f_fsid from statfs.  As you know that is not necessarily stable over
 reboots, while the UUID that blkid gives is.
 So if we always adding the two uuids somehow, it would be an improvement for
 btrfs, no change for e.g. ext3/XFS, and a regression for VFAT (and others I
 think).  Not good.

OK, got it.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Jamie Lokier
Edward Shishkin wrote:
 I'll try to help, but I am rather pessimistic here: working out
 algorithms is something, which doesn't like timelines..

Nonsense.  Working out algorithms is just work to an algorithm
designer, just like programming is work to a programmer.  Sure, some
things are harder than others, and there's an element of creativity.
But lots of it is just cranking the handle, even in algorithm design.

 Note that it's not necessarily a problem to have a few nodes with low
 utilisation.  Deliberate violation of the classic balancing heuristic
 is often useful for performance.[*]
 
 Ok, let's violate this, but not in the detriment of utilisation:
 Internal fragmentation is a horror for file systems, the enemy #1
 (which is completely defeated in the last century BTW).
 
   The problem you've found is only a
 real problem when there are _too many_ nodes with low utilisation.
 
 IMHO this is a problem, as we can not estimate number of such nodes.
 Do you have any sane upper boundary for this number? I don't know such one.
 Maybe I have missed something?

Well, I don't know if btrfs maintains enough statistics or other
implicit processes to guarantee a sane upper boundary for this.  The
impression I'm getting from the thread is that it relies on heuristic
behaviour which is usually sufficient (and perhaps a bit faster than
updating statistics on the disk would be), but that it does not
provide a hard upper boundary.  I'm really not sure, though.  I
haven't read the code.

 [*] For example when filling a tree by inserting contiguously
 ascending keys, the classic split into two when full heuristic gives
 the worst possible results (50% lost space), and deliberately
 underfilling the most actively updated nodes, which is not permitted
 at all by the classic algorithm, gives denser packing in the end
 (almost zero lost space).
 
 At the end of what?

At the end of inserting keys in ascending order.  Just think about the
classic B-tree when that is done: Node[1] fills to 100% utilisation,
then is split into two nodes at 50% (call them node[1] and node[2]).
Node[2] fills to 100%, then splits into node[2] at 50% and node[3].
Etc etc: Result is a million nodes at 50% utilisation, except the last
one.

If instead you fill node[1], then *don't* split it but permit node[2]
to have 0% to start with, let that fill to 100%, then don't split
node[2] and let node[3] start with 0%, etc. etc: Result is half a
million nodes at 100% utilisation, except the last one.

Both fit the desired bounds, but the second is more desirable in
practice, especially as it's common behaviour to fill with contiguous,
ascending keys in a filesystem (or blob-filled database) where data
extents are part of the tree :-)

To handle the cases of multiple independent ascending runs of keys
being written in parallel, you generalise to maintain more than one
below-50% block, near the active insertion heads.

The above describes classic B-trees where updates are assumed to be
written immediately.  Things are different when there's a memory cache
and delayed allocation/writing, and packing can be reorganised in
memory before sending to storage.

 I hope you don't speak about on-line defragmentation?

No, not at all.

 Can you point me to the paper (if any)?

Sorry, no, I haven't seen this described in a paper.  Imho it's
obvious when you think about it.  But maybe it's not obvious to
everyone - perhaps this is even the heuristic modification missing
from btrfs which it would need to avoid the scenario which started
this thread?

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Daniel Taylor
Just an FYI reminder.  The original test (2K files) is utterly
pathological for disk drives with 4K physical sectors, such as
those now shipping from WD, Seagate, and others.  Some of the
SSDs have larger (16K0 or smaller blocks (2K).  There is also
the issue of btrfs over RAID (which I know is not entirely
sensible, but which will happen).

The absolute minimum allocation size for data should be the same
as, and aligned with, the underlying disk block size.  If that
results in underutilization, I think that's a good thing for
performance, compared to read-modify-write cycles to update
partial disk blocks. 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Mike Fedyk
On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor daniel.tay...@wdc.com wrote:
 Just an FYI reminder.  The original test (2K files) is utterly
 pathological for disk drives with 4K physical sectors, such as
 those now shipping from WD, Seagate, and others.  Some of the
 SSDs have larger (16K0 or smaller blocks (2K).  There is also
 the issue of btrfs over RAID (which I know is not entirely
 sensible, but which will happen).

 The absolute minimum allocation size for data should be the same
 as, and aligned with, the underlying disk block size.  If that
 results in underutilization, I think that's a good thing for
 performance, compared to read-modify-write cycles to update
 partial disk blocks.

Block size = 4k

Btrfs packs smaller objects into the blocks in certain cases.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html