Re: Balancing leaves when walking from top to down (was Btrfs:...)
Chris Mason wrote: On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote: Chris Mason wrote: On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote: I'll reproduce from your test case and provide a fix. mount -o max_inline=1500 would give us 50% usage in the worst case This is a very strange statement: how did you calculate this lower bound? We want room for the extent and the inode item and the inode backref. It's a rough number that leaves some extra room. But even with a 2K item we're getting very close to 50% usage of the metadata area. (minus the balancing bug you hit). Ok, the balancing bug was interesting. What happens is we create all the inodes and directory items nicely packed into the leaves. Then delayed allocation kicks in and starts inserting the big fat inline extents. This often forces the balancing code to split a leaf twice in order to make room for the new inline extent. The double split code wasn't balancing the bits that were left over on either side of our desired slot. The patch below fixes the problem for me, reducing the number of leaves that have more than 2K of free space down from 6% of the total to about 74 leaves. It could be zero, but the balancing code doesn't push items around if our leaf is in the first or last slot of the parent node (complexity vs benefit tradeoff). Nevertheless, I see leaves, which are not in the first or last slot, but mergeable with their neighbors (test case is the same): Correct, but it was in the first or last slot when it was balanced (I confirmed this with printk). Then the higher layers were balanced and our leaves were no longer in the first/last slot. We don't rebalance leaves when we balance level 1. leaf 269557760 items 22 free space 2323 generation 25 owner 2 fs uuid 614fb921-cfa9-403d-9feb-940021e72382 chunk uuid b1674882-a445-45f9-b250-0985e483d231 leaf 280027136 items 18 free space 2627 generation 25 owner 2 fs uuid 614fb921-cfa9-403d-9feb-940021e72382 chunk uuid b1674882-a445-45f9-b250-0985e483d231 node 269549568 level 1 items 60 free 61 generation 25 owner 2 fs uuid 614fb921-cfa9-403d-9feb-940021e72382 chunk uuid b1674882-a445-45f9-b250-0985e483d231 key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15 key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15 key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15 key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17 key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17 key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17 key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25 key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25 key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25 key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25 key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25 key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25 key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25 key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16 key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16 key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16 [...] With the patch, I'm able to create 106894 files (2K each) on a 1GB FS. That doesn't sound like a huge improvement, but the default from mkfs.btrfs is to duplicate metadata. After duplication, that's about 417MB or about 40% of the overall space. When you factor in the space that we reserve to avoid exploding on enospc and the space that we have allocated for data extents, that's not a very surprising number. I'm still putting this patch through more testing, the double split code is used in some difficult corners and I need to make sure I've tried all of them. Chris, for the further code review we need documents, which reflect the original ideas of the balancing, etc. Could you please provide them? Obviously, I can not do it instead of you, as I don't know those ideas. Which part are you most interested in? Balancing. What conditions do we want to provide? Why won't we get utilization slump in future? How do we need to fix things when something goes wrong? Whereas in accordance with the single existing paper Btrfs design looks really broken, because: 1. the balancing condition n = number_of_keys = 2n+1 is not satisfied (indeed, when number_of_keys is 1 we have 1 = 2n+1 -- false); 2. the trees mentioned in this paper don't allow keys of variable length. Thanks, Edward. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Improve support for exporting btrfs subvolumes.
On Thu, Jun 17, 2010 at 02:54:01PM +1000, Neil Brown wrote: If you export two subvolumes of a btrfs filesystem, they will both be given the same uuid so lookups will be confused. blkid cannot differentiate the two, so we must use the fsid from statfs64 to identify the filesystem. We cannot tell if blkid or statfs is best without knowing internal details of the filesystem in question, so we need to encode specific knowledge of btrfs in mountd. This is unfortunate. To ensure smooth handling of this and possible future changes in uuid generation, we add infrastructure for multiple different uuids to be recognised on old filehandles, but only the preferred on is used on new filehandles. Could you just contatenate the two (or hash them somehow)? Or does that just use up too much space in the filehandle? --b. Signed-off-by: NeilBrown ne...@suse.de -- This is a substantially revised version of a patch I posted a while ago. I tried to find a way to do it would hard coding knowledge of btrfs in nfs-utils, but it isn't possible. For some filesystems, f_fsid is best, for some it is worst. No way to tell the difference. This patch add infrastructure so that if we find a better way to get a good uuid (e.g. a new syscall), we can slot it in for new filehandles, but old filehandles using the old uuid will still work. I believe this is ready for inclusion upstream. Thanks, NeilBrown diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c index caef5b2..85cd829 100644 --- a/utils/mountd/cache.c +++ b/utils/mountd/cache.c @@ -170,13 +170,16 @@ void auth_unix_gid(FILE *f) #if USE_BLKID static const char *get_uuid_blkdev(char *path) { + /* We set *safe if we know that we need the + * fsid from statfs too. + */ static blkid_cache cache = NULL; struct stat stb; char *devname; blkid_tag_iterate iter; blkid_dev dev; const char *type; - const char *val = NULL; + const char *val, *uuid = NULL; if (cache == NULL) blkid_get_cache(cache, NULL); @@ -193,42 +196,29 @@ static const char *get_uuid_blkdev(char *path) iter = blkid_tag_iterate_begin(dev); if (!iter) return NULL; - while (blkid_tag_next(iter, type, val) == 0) + while (blkid_tag_next(iter, type, val) == 0) { if (strcmp(type, UUID) == 0) + uuid = val; + if (strcmp(type, TYPE) == 0 + strcmp(val, btrfs) == 0) { + uuid = NULL; break; + } + } blkid_tag_iterate_end(iter); - return val; + return uuid; } #else #define get_uuid_blkdev(path) (NULL) #endif -int get_uuid(char *path, char *uuid, int uuidlen, char *u) +int get_uuid(const char *val, int uuidlen, char *u) { /* extract hex digits from uuidstr and compose a uuid * of the given length (max 16), xoring bytes to make - * a smaller uuid. Then compare with uuid + * a smaller uuid. */ int i = 0; - const char *val = NULL; - char fsid_val[17]; - - if (path) { - val = get_uuid_blkdev(path); - if (!val) { - struct statfs64 st; - - if (statfs64(path, st)) - return 0; - if (!st.f_fsid.__val[0] !st.f_fsid.__val[1]) - return 0; - snprintf(fsid_val, 17, %08x%08x, - st.f_fsid.__val[0], st.f_fsid.__val[1]); - val = fsid_val; - } - } else { - val = uuid; - } memset(u, 0, uuidlen); for ( ; *val ; val++) { @@ -252,6 +242,60 @@ int get_uuid(char *path, char *uuid, int uuidlen, char *u) return 1; } +int uuid_by_path(char *path, int type, int uuidlen, char *uuid) +{ + /* get a uuid for the filesystem found at 'path'. + * There are several possible ways of generating the + * uuids (types). + * Type 0 is used for new filehandles, while other types + * may be used to interpret old filehandle - to ensure smooth + * forward migration. + * We return 1 if a uuid was found (and it might be worth + * trying the next type) or 0 if no more uuid types can be + * extracted. + */ + + /* Possible sources of uuid are + * - blkid uuid + * - statfs64 uuid + * + * On some filesystems (e.g. vfat) the statfs64 uuid is simply an + * encoding of the device that the filesystem is mounted from, so + * it we be very bad to use that (as device numbers change). blkid + * must be preferred. + * On other filesystems (e.g. btrfs) the statfs64 uuid contains + * important info that the blkid uuid cannot contain: This happens + * when multiple subvolumes are
Re: [PATCH] Improve support for exporting btrfs subvolumes.
On Wed, 23 Jun 2010 14:28:38 -0400 J. Bruce Fields bfie...@fieldses.org wrote: On Thu, Jun 17, 2010 at 02:54:01PM +1000, Neil Brown wrote: If you export two subvolumes of a btrfs filesystem, they will both be given the same uuid so lookups will be confused. blkid cannot differentiate the two, so we must use the fsid from statfs64 to identify the filesystem. We cannot tell if blkid or statfs is best without knowing internal details of the filesystem in question, so we need to encode specific knowledge of btrfs in mountd. This is unfortunate. To ensure smooth handling of this and possible future changes in uuid generation, we add infrastructure for multiple different uuids to be recognised on old filehandles, but only the preferred on is used on new filehandles. Could you just contatenate the two (or hash them somehow)? Or does that just use up too much space in the filehandle? I did consider xoring them together but came to the conclusion that would actually be a regression. If you look down at the comment that I included in uuid_by_path, you will see that some filesystems (e.g. VFAT) just use the major/minor device number for the f_fsid from statfs. As you know that is not necessarily stable over reboots, while the UUID that blkid gives is. So if we always adding the two uuids somehow, it would be an improvement for btrfs, no change for e.g. ext3/XFS, and a regression for VFAT (and others I think). Not good. Thanks, NeilBrown --b. Signed-off-by: NeilBrown ne...@suse.de -- This is a substantially revised version of a patch I posted a while ago. I tried to find a way to do it would hard coding knowledge of btrfs in nfs-utils, but it isn't possible. For some filesystems, f_fsid is best, for some it is worst. No way to tell the difference. This patch add infrastructure so that if we find a better way to get a good uuid (e.g. a new syscall), we can slot it in for new filehandles, but old filehandles using the old uuid will still work. I believe this is ready for inclusion upstream. Thanks, NeilBrown diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c index caef5b2..85cd829 100644 --- a/utils/mountd/cache.c +++ b/utils/mountd/cache.c @@ -170,13 +170,16 @@ void auth_unix_gid(FILE *f) #if USE_BLKID static const char *get_uuid_blkdev(char *path) { + /* We set *safe if we know that we need the +* fsid from statfs too. +*/ static blkid_cache cache = NULL; struct stat stb; char *devname; blkid_tag_iterate iter; blkid_dev dev; const char *type; - const char *val = NULL; + const char *val, *uuid = NULL; if (cache == NULL) blkid_get_cache(cache, NULL); @@ -193,42 +196,29 @@ static const char *get_uuid_blkdev(char *path) iter = blkid_tag_iterate_begin(dev); if (!iter) return NULL; - while (blkid_tag_next(iter, type, val) == 0) + while (blkid_tag_next(iter, type, val) == 0) { if (strcmp(type, UUID) == 0) + uuid = val; + if (strcmp(type, TYPE) == 0 + strcmp(val, btrfs) == 0) { + uuid = NULL; break; + } + } blkid_tag_iterate_end(iter); - return val; + return uuid; } #else #define get_uuid_blkdev(path) (NULL) #endif -int get_uuid(char *path, char *uuid, int uuidlen, char *u) +int get_uuid(const char *val, int uuidlen, char *u) { /* extract hex digits from uuidstr and compose a uuid * of the given length (max 16), xoring bytes to make -* a smaller uuid. Then compare with uuid +* a smaller uuid. */ int i = 0; - const char *val = NULL; - char fsid_val[17]; - - if (path) { - val = get_uuid_blkdev(path); - if (!val) { - struct statfs64 st; - - if (statfs64(path, st)) - return 0; - if (!st.f_fsid.__val[0] !st.f_fsid.__val[1]) - return 0; - snprintf(fsid_val, 17, %08x%08x, -st.f_fsid.__val[0], st.f_fsid.__val[1]); - val = fsid_val; - } - } else { - val = uuid; - } memset(u, 0, uuidlen); for ( ; *val ; val++) { @@ -252,6 +242,60 @@ int get_uuid(char *path, char *uuid, int uuidlen, char *u) return 1; } +int uuid_by_path(char *path, int type, int uuidlen, char *uuid) +{ + /* get a uuid for the filesystem found at 'path'. +* There are several possible ways of generating the +* uuids (types). +* Type 0 is used for new filehandles, while other types +* may be used to interpret old filehandle - to ensure smooth +* forward migration. +* We return 1 if a uuid was found (and it
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Edward Shishkin wrote: I have noticed that events in Btrfs develop by scenario not predicted by the paper of academic Ohad Rodeh (in spite of the announce that Btrfs is based on this paper). This is why I have started to grumble.. In btrfs, based on means started with the algorithm and ideas of, not uses the same algorithm as. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Improve support for exporting btrfs subvolumes.
On Thu, Jun 24, 2010 at 07:31:57AM +1000, Neil Brown wrote: On Wed, 23 Jun 2010 14:28:38 -0400 J. Bruce Fields bfie...@fieldses.org wrote: On Thu, Jun 17, 2010 at 02:54:01PM +1000, Neil Brown wrote: If you export two subvolumes of a btrfs filesystem, they will both be given the same uuid so lookups will be confused. blkid cannot differentiate the two, so we must use the fsid from statfs64 to identify the filesystem. We cannot tell if blkid or statfs is best without knowing internal details of the filesystem in question, so we need to encode specific knowledge of btrfs in mountd. This is unfortunate. To ensure smooth handling of this and possible future changes in uuid generation, we add infrastructure for multiple different uuids to be recognised on old filehandles, but only the preferred on is used on new filehandles. Could you just contatenate the two (or hash them somehow)? Or does that just use up too much space in the filehandle? I did consider xoring them together but came to the conclusion that would actually be a regression. If you look down at the comment that I included in uuid_by_path, you will see that some filesystems (e.g. VFAT) just use the major/minor device number for the f_fsid from statfs. As you know that is not necessarily stable over reboots, while the UUID that blkid gives is. So if we always adding the two uuids somehow, it would be an improvement for btrfs, no change for e.g. ext3/XFS, and a regression for VFAT (and others I think). Not good. OK, got it. --b. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Edward Shishkin wrote: I'll try to help, but I am rather pessimistic here: working out algorithms is something, which doesn't like timelines.. Nonsense. Working out algorithms is just work to an algorithm designer, just like programming is work to a programmer. Sure, some things are harder than others, and there's an element of creativity. But lots of it is just cranking the handle, even in algorithm design. Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] Ok, let's violate this, but not in the detriment of utilisation: Internal fragmentation is a horror for file systems, the enemy #1 (which is completely defeated in the last century BTW). The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. IMHO this is a problem, as we can not estimate number of such nodes. Do you have any sane upper boundary for this number? I don't know such one. Maybe I have missed something? Well, I don't know if btrfs maintains enough statistics or other implicit processes to guarantee a sane upper boundary for this. The impression I'm getting from the thread is that it relies on heuristic behaviour which is usually sufficient (and perhaps a bit faster than updating statistics on the disk would be), but that it does not provide a hard upper boundary. I'm really not sure, though. I haven't read the code. [*] For example when filling a tree by inserting contiguously ascending keys, the classic split into two when full heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). At the end of what? At the end of inserting keys in ascending order. Just think about the classic B-tree when that is done: Node[1] fills to 100% utilisation, then is split into two nodes at 50% (call them node[1] and node[2]). Node[2] fills to 100%, then splits into node[2] at 50% and node[3]. Etc etc: Result is a million nodes at 50% utilisation, except the last one. If instead you fill node[1], then *don't* split it but permit node[2] to have 0% to start with, let that fill to 100%, then don't split node[2] and let node[3] start with 0%, etc. etc: Result is half a million nodes at 100% utilisation, except the last one. Both fit the desired bounds, but the second is more desirable in practice, especially as it's common behaviour to fill with contiguous, ascending keys in a filesystem (or blob-filled database) where data extents are part of the tree :-) To handle the cases of multiple independent ascending runs of keys being written in parallel, you generalise to maintain more than one below-50% block, near the active insertion heads. The above describes classic B-trees where updates are assumed to be written immediately. Things are different when there's a memory cache and delayed allocation/writing, and packing can be reorganised in memory before sending to storage. I hope you don't speak about on-line defragmentation? No, not at all. Can you point me to the paper (if any)? Sorry, no, I haven't seen this described in a paper. Imho it's obvious when you think about it. But maybe it's not obvious to everyone - perhaps this is even the heuristic modification missing from btrfs which it would need to avoid the scenario which started this thread? -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Just an FYI reminder. The original test (2K files) is utterly pathological for disk drives with 4K physical sectors, such as those now shipping from WD, Seagate, and others. Some of the SSDs have larger (16K0 or smaller blocks (2K). There is also the issue of btrfs over RAID (which I know is not entirely sensible, but which will happen). The absolute minimum allocation size for data should be the same as, and aligned with, the underlying disk block size. If that results in underutilization, I think that's a good thing for performance, compared to read-modify-write cycles to update partial disk blocks. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor daniel.tay...@wdc.com wrote: Just an FYI reminder. The original test (2K files) is utterly pathological for disk drives with 4K physical sectors, such as those now shipping from WD, Seagate, and others. Some of the SSDs have larger (16K0 or smaller blocks (2K). There is also the issue of btrfs over RAID (which I know is not entirely sensible, but which will happen). The absolute minimum allocation size for data should be the same as, and aligned with, the underlying disk block size. If that results in underutilization, I think that's a good thing for performance, compared to read-modify-write cycles to update partial disk blocks. Block size = 4k Btrfs packs smaller objects into the blocks in certain cases. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html