[PATCH] Btrfs: cleanup some BUG_ON()
This patch changes some BUG_ON() to the error return. (but, most callers still use BUG_ON()) Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- fs/btrfs/ctree.c |3 ++- fs/btrfs/disk-io.c |5 - fs/btrfs/extent-tree.c | 25 ++--- fs/btrfs/file-item.c |3 ++- fs/btrfs/inode-map.c |3 ++- fs/btrfs/ioctl.c |5 - fs/btrfs/root-tree.c |6 -- fs/btrfs/transaction.c | 12 +--- fs/btrfs/tree-log.c| 15 +-- 9 files changed, 54 insertions(+), 23 deletions(-) diff -urNp linux-2.6.38/fs/btrfs/ctree.c linux-2.6.38.new/fs/btrfs/ctree.c --- linux-2.6.38/fs/btrfs/ctree.c 2011-03-15 10:20:32.0 +0900 +++ linux-2.6.38.new/fs/btrfs/ctree.c 2011-03-23 11:28:09.0 +0900 @@ -3840,7 +3840,8 @@ int btrfs_insert_item(struct btrfs_trans unsigned long ptr; path = btrfs_alloc_path(); - BUG_ON(!path); + if (!path) + return -ENOMEM; ret = btrfs_insert_empty_item(trans, root, path, cpu_key, data_size); if (!ret) { leaf = path-nodes[0]; diff -urNp linux-2.6.38/fs/btrfs/disk-io.c linux-2.6.38.new/fs/btrfs/disk-io.c --- linux-2.6.38/fs/btrfs/disk-io.c 2011-03-15 10:20:32.0 +0900 +++ linux-2.6.38.new/fs/btrfs/disk-io.c 2011-03-23 11:44:39.0 +0900 @@ -1160,7 +1160,10 @@ struct btrfs_root *btrfs_read_fs_root_no root, fs_info, location-objectid); path = btrfs_alloc_path(); - BUG_ON(!path); + if (!path) { + kfree(root); + return ERR_PTR(-ENOMEM); + } ret = btrfs_search_slot(NULL, tree_root, location, path, 0, 0); if (ret == 0) { l = path-nodes[0]; diff -urNp linux-2.6.38/fs/btrfs/extent-tree.c linux-2.6.38.new/fs/btrfs/extent-tree.c --- linux-2.6.38/fs/btrfs/extent-tree.c 2011-03-15 10:20:32.0 +0900 +++ linux-2.6.38.new/fs/btrfs/extent-tree.c 2011-03-23 11:28:09.0 +0900 @@ -5444,7 +5444,8 @@ static int alloc_reserved_file_extent(st size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type); path = btrfs_alloc_path(); - BUG_ON(!path); + if (!path) + return -ENOMEM; path-leave_spinning = 1; ret = btrfs_insert_empty_item(trans, fs_info-extent_root, path, @@ -6438,10 +6439,14 @@ int btrfs_drop_subtree(struct btrfs_tran BUG_ON(root-root_key.objectid != BTRFS_TREE_RELOC_OBJECTID); path = btrfs_alloc_path(); - BUG_ON(!path); + if (!path) + return -ENOMEM; wc = kzalloc(sizeof(*wc), GFP_NOFS); - BUG_ON(!wc); + if (!wc) { + btrfs_free_path(path); + return -ENOMEM; + } btrfs_assert_tree_locked(parent); parent_level = btrfs_header_level(parent); @@ -6899,7 +6904,11 @@ static noinline int get_new_locations(st } path = btrfs_alloc_path(); - BUG_ON(!path); + if (!path) { + if (exts != *extents) + kfree(exts); + return -ENOMEM; + } cur_pos = extent_key-objectid - offset; last_byte = extent_key-objectid + extent_key-offset; @@ -7423,7 +7432,8 @@ static noinline int replace_extents_in_l int ret; new_extent = kmalloc(sizeof(*new_extent), GFP_NOFS); - BUG_ON(!new_extent); + if (!new_extent) + return -ENOMEM; ref = btrfs_lookup_leaf_ref(root, leaf-start); BUG_ON(!ref); @@ -7627,7 +7637,8 @@ static noinline int init_reloc_tree(stru return 0; root_item = kmalloc(sizeof(*root_item), GFP_NOFS); - BUG_ON(!root_item); + if (!root_item) + return -ENOMEM; ret = btrfs_copy_root(trans, root, root-commit_root, eb, BTRFS_TREE_RELOC_OBJECTID); @@ -7653,7 +7664,7 @@ static noinline int init_reloc_tree(stru reloc_root = btrfs_read_fs_root_no_radix(root-fs_info-tree_root, root_key); - BUG_ON(!reloc_root); + BUG_ON(IS_ERR(reloc_root)); reloc_root-last_trans = trans-transid; reloc_root-commit_root = NULL; reloc_root-ref_tree = root-fs_info-reloc_ref_tree; diff -urNp linux-2.6.38/fs/btrfs/file-item.c linux-2.6.38.new/fs/btrfs/file-item.c --- linux-2.6.38/fs/btrfs/file-item.c 2011-03-15 10:20:32.0 +0900 +++ linux-2.6.38.new/fs/btrfs/file-item.c 2011-03-23 11:28:09.0 +0900 @@ -48,7 +48,8 @@ int btrfs_insert_file_extent(struct btrf struct extent_buffer *leaf; path = btrfs_alloc_path(); - BUG_ON(!path); + if (!path) + return -ENOMEM; file_key.objectid = objectid; file_key.offset = pos; btrfs_set_key_type(file_key, BTRFS_EXTENT_DATA_KEY); diff -urNp linux-2.6.38/fs/btrfs/inode-map.c
read-only subvolumes?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, When I am creating subvolumes I get this strange behavior. If I create a subvolume with a name longer than 4 characters it is read-only, if the name is shorter than 5 characters the subvolume is writeable as expected. I think it is since I upgraded to kernel version 2.6.38 (I do not create subvolumes on a regular basis.). I will compile one of the latest 2.6.37 kernels to see whether there the problem exists, too. Another interesting point is that previously created subvolumes are not affected. Thanks, Andreas Philipp thor btrfs # btrfs subvolume create 123456789 Create subvolume './123456789' thor btrfs # touch 123456789/lsdkfj touch: cannot touch `123456789/lsdkfj': Read-only file system -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJNia2nAAoJEJIcBJ3+XkgiTQQQAJqvn+zYbDmqmSo8ZRG86ssR Tj0hsaAYjSWiIUxs7Y9XulmC0sZoSpvX5BLIjW1pYwECzhzrA//pQrwVbXgwbW5H VjZ+YxcOcw4jxoUbW3lG+KYtSFMJFtbdMejmCY3GgYYIq1mtn0hBrCqZJ0syl4LO IjyTHR/v0r7FMIgL26F1jOfC478RfhIxAgZOtd65kl7/pHOv5At+99tgM4teUoy0 I76CWu6Ls9+1XevxMWp39XNceYCtQ/WoEThuQCvPERq6Th3NWczPBTP3POSBetVA Kcomq0TmgXQx1ZalFAFpMi9iRriDXbSm3ITSZW6Jp2BSEPurzpydchfhg0AWVNcC Icp5b+dy2RVM/K5UNDO6lNf8p+K1wk8GGpD/Pr+K0lO0FlKX+6rApzgx54GYL3cx 0RYL+NAAwSpy1i2uBIw72gyGX/yBliX7CB+YZZ/iULk0eUd36FvpJJAJ1Isk+QNn 6WFBoRwsMrL3WfiqR5/ODO+i+z+CUzYU0mUnD9IuQkdCANyXOeQhs5AyMOPkB1NC SS9ChpL60khwmLs9c99AyIzcvZU/q12JMvOZ2YUnfEHNIC/XmaThq11RbCIWIsl2 vjPr1QvKK+aykaOfjiTgLTwvB3mq147uEAylzIkduiQSFizMudbsI9vcO/X2pcy3 SVO9m6tlBUsCq3dU1dcA =NEEb -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: read-only subvolumes?
On Wed, Mar 23, 2011 at 3:21 PM, Andreas Philipp philipp.andr...@gmail.com wrote: I think it is since I upgraded to kernel version 2.6.38 (I do not create subvolumes on a regular basis.). thor btrfs # btrfs subvolume create 123456789 Create subvolume './123456789' thor btrfs # touch 123456789/lsdkfj touch: cannot touch `123456789/lsdkfj': Read-only file system It works on my system # touch test1 # btrfs su cr 123456789 Create subvolume './123456789' # touch 123456789/lsdkfj # uname -a Linux HP 2.6.38-020638-generic #201103151303 SMP Tue Mar 15 14:33:40 UTC 2011 i686 GNU/Linux -- Fajar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: read-only subvolumes?
Hi all, When I am creating subvolumes I get this strange behavior. If I create a subvolume with a name longer than 4 characters it is read-only, if the name is shorter than 5 characters the subvolume is writeable as expected. I think it is since I upgraded to kernel version 2.6.38 (I do not create subvolumes on a regular basis.). I will compile one of the latest 2.6.37 kernels to see whether there the problem exists, too. Another interesting point is that previously created subvolumes are not affected. Thanks, Andreas Philipp thor btrfs # btrfs subvolume create 123456789 Create subvolume './123456789' thor btrfs # touch 123456789/lsdkfj touch: cannot touch `123456789/lsdkfj': Read-only file system This is really odd, but I can't reproduce it. I created a btrfs filesystem on 2.6.37 kernel, and rebooted to latest 2.6.38+, and tried the procedures as you did, but nothing bad happend. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
Hi, Kitayama-san On wed, 23 Mar 2011 13:19:18 +0900, Itaru Kitayama wrote: On Wed, 23 Mar 2011 12:00:38 +0800 Miao Xie mi...@cn.fujitsu.com wrote: I is testing the new version, in which I fixed the slab shrinker problem reported by Chris. In the new version, the delayed node is removed before the relative inode is moved into the unused_inode list(the slab shrinker will reclaim inodes in this list). Maybe this method can also fix this bug. So could you tell me the reproduce step or the options of mkfs and mount? I will check if the new patch can fix this bug or not. I used the default mkfs options for $TEST_DEV and enabled the space_cache and the compress=lzo options upon mounting the partition. Unfortunately, I can trigger this warning, but by analyzing the following information, = [ INFO: possible irq lock inversion dependency detected ] 2.6.36-xie+ #122 - kswapd0/49 just changed the state of lock: (iprune_sem){.-}, at: [811316d0] shrink_icache_memory+0x4d/0x213 but this lock took another, RECLAIM_FS-unsafe lock in the past: (delayed_node-mutex){+.+.+.} and interrupts could create inverse lock ordering between them. [SNIP] RECLAIM_FS-ON-W at: [81074292] mark_held_locks+0x52/0x70 [81074354] lockdep_trace_alloc+0xa4/0xc2 [8110f206] __kmalloc+0x7f/0x154 [811c2c21] kzalloc+0x14/0x16 [811c5e83] cache_block_group+0x106/0x238 [811c7069] find_free_extent+0x4e2/0xa86 [811c76c1] btrfs_reserve_extent+0xb4/0x142 [811c78b6] btrfs_alloc_free_block+0x167/0x2af [811be610] __btrfs_cow_block+0x103/0x346 [811bedb8] btrfs_cow_block+0x101/0x110 [811c05d8] btrfs_search_slot+0x143/0x513 [811cf5ab] btrfs_lookup_inode+0x2f/0x8f [81212405] btrfs_update_delayed_inode+0x75/0x135 [8121340d] btrfs_run_delayed_items+0xd6/0x131 [811d64d7] btrfs_commit_transaction+0x28b/0x668 [811ba786] btrfs_sync_fs+0x6b/0x70 [81140265] __sync_filesystem+0x6b/0x83 [81140347] sync_filesystem+0x4c/0x50 [8111fc56] generic_shutdown_super+0x27/0xd7 [8111fd5b] kill_anon_super+0x16/0x54 [8111effd] deactivate_locked_super+0x26/0x46 [8111f495] deactivate_super+0x45/0x4a [81135962] mntput_no_expire+0xd6/0x104 [81136a87] sys_umount+0x2c1/0x2ec [81002ddb] system_call_fastpath+0x16/0x1b we found GFP_KERNEL was passed into kzalloc(), I think this flag trigger the above lockdep warning. the attached patch, which against the delayed items operation patch, may fix this problem, Could you test it for me? Thanks Miao From f84daee1d2060beae945a2774cda7af2ef7f3e61 Mon Sep 17 00:00:00 2001 From: Miao Xie mi...@cn.fujitsu.com Date: Wed, 23 Mar 2011 16:01:16 +0800 Subject: [PATCH] btrfs: use GFP_NOFS instead of GFP_KERNEL In the filesystem context, we must allocate memory by GFP_NOFS, or we will start another filesystem operation and trigger lockdep warnning. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/extent-tree.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f1db57d..fe50cff 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -471,7 +471,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, if (load_cache_only) return 0; - caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_KERNEL); + caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS); BUG_ON(!caching_ctl); INIT_LIST_HEAD(caching_ctl-list); -- 1.7.3.1
Re: read-only subvolumes?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 23.03.2011 10:25, Li Zefan wrote: Hi all, When I am creating subvolumes I get this strange behavior. If I create a subvolume with a name longer than 4 characters it is read-only, if the name is shorter than 5 characters the subvolume is writeable as expected. I think it is since I upgraded to kernel version 2.6.38 (I do not create subvolumes on a regular basis.). I will compile one of the latest 2.6.37 kernels to see whether there the problem exists, too. Another interesting point is that previously created subvolumes are not affected. Thanks, Andreas Philipp thor btrfs # btrfs subvolume create 123456789 Create subvolume './123456789' thor btrfs # touch 123456789/lsdkfj touch: cannot touch `123456789/lsdkfj': Read-only file system This is really odd, but I can't reproduce it. I created a btrfs filesystem on 2.6.37 kernel, and rebooted to latest 2.6.38+, and tried the procedures as you did, but nothing bad happend. While playing around I found the following three new points: - - Now the length of the subvolume name does not matter. So even the ones with short names are read-only. - - It also happens to a fresh newly created btrfs filesystem. - - If I take a snapshot of an old (= writeable) subvolume this is writeable. I will now reboot into 2.6.37.4, check there, and then report back. Thanks, Andreas Philipp -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJNicZIAAoJEJIcBJ3+XkgiDysP/1oo770VqaEhf3F9gXq5/V3W AkGGuRb0Upkwie5Y7L3YjCAjAJplYCemncsjqLDQVIQP6iYfmC3bLIM1GjDjMLfT uwt89/pDte2JStW6kFx0u5i7IwYD6NO7vh3/i7+l1RB4qpZ7DAomroeHS5FFgD2M y6hZcQ/bhiRKDv82c7YscBVE3ZgKIDPUHoNeduCGsCj8hSd4+/8PR7auGjv42a/l C92G01cx4mMS0pmnwLUL4U54n1rbJNrKkaoQwINNW/E3fj6gQRwtI1QyDhDWnmfO Y6c3JRtyYeWGadCaMq4SYGWvSFhG8jlR/a17ozubrLf/An14ywohx1pUZq0fPp9z oxSlZCINhGBDSeahGQBw7szmU45lXf8N99TgaUTLiHyStnlQfcqpD5RyJUTSBOa2 VAVpMeuvjqw1ng+Tsd1r35e/WBtPQOd9aUj6r5Hcjt4oGlV0mL7oBAR/J0DjNYfl kii8Ah+NWHFVw/pUVfWC3lzcwfqFIikvn3KVsR2X4LrOTmi6thrh0EG+eSOhfWuf dI/agqONGzNGH73V7jFtWaEjetrhqRrr5Q22syqWfqX/AYbzTAlISHm574RPtf0G P2r1fn/s/3FXGKo4zfTsscuvEE4LJaKFrjFxz5mW4wOz9hhFmTBox71ex538ZiMv NfZzNRKpmXyZCm8USF/i =b3lE -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs scrub: make fixups sync, don't reuse fixup bios
Hi Ilya, On 18.03.2011 17:21, Ilya Dryomov wrote: [CC'ing the list to make development public, the patch is againt Arne's tree at kernel.org] Below is a quite long diff the primary purpose of which is to make fixups totally sync. They are already sync for checksum failures, this patch makes them sync for EIO case as well. This is required for integrating drive swap, since the idea is that I have to fixup up everyithing first and then write the correct data to a new device. Obviously to do that fixups have to be sync. The EIO is supposed to happen quite rare, so the performance loose from making EIO sync should be minimal. The other significant change is that fixups are now sharing the buffer with the parent sbio. So instead of allocating a page to do a fixup, we just grab the page from the sbio buffer. This is also required for drive swap, since when all the fixups are done, sbio buffer will contain the right data which I can write to a new device using a single bio. It doesn't affect scrub at all. The third change is that fixup bios are no longer reused. This is a change that I think should be added even if you don't like the rest. You were right at the first place, bios cannot be reused that simply, and since it's just a one page bio, it's better to allocate it each time we need it. Since fixups are now sync, I don't embed spage into a fixup structure. Instead a pointer is used. thanks for the patch. The sync path looks good to me. I'd suggest to see if you can get rid of struct fixup completely. Also there is no need to increment scrubs_running anymore inside the recheck code. Your patch reorders some functions, which makes it harder to read. Could you separate that into two steps? Thanks, Arne This is a preliminary version, it's meant to get us on the same page. But if you can give this code some quick testing on real hardware with your test cases I'd appreciate that. I also plan to fix EIO handling in scrub_checksum, but that will happen only next week. My disk should arrive Monday-Tuesday + a couple days to play with it. I may have forgotten something, so ping me on IRC any time. Also disregard my debugging output at the end. Thanks, Ilya --- diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 85a4d4b..f3fe5a5 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -69,9 +69,6 @@ static int scrub_checksum_tree_block(struct scrub_dev *sdev, struct scrub_page *spag, u64 logical, void *buffer); static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer); -static void scrub_recheck_end_io(struct bio *bio, int err); -static void scrub_fixup_worker(scrub_work_t *work); -static void scrub_fixup(struct scrub_fixup *fixup); #define SCRUB_PAGES_PER_BIO 16 /* 64k per bio */ #define SCRUB_BIOS_PER_DEV 16 /* 1 MB per device in flight */ @@ -117,13 +114,10 @@ struct scrub_dev { struct scrub_fixup { struct scrub_dev*sdev; - struct bio *bio; u64 logical; u64 physical; - struct scrub_page spag; - scrub_work_twork; - int err; - int recheck; + struct page *page; + struct scrub_page *spag; }; static void scrub_free_csums(struct scrub_dev *sdev) @@ -230,115 +224,19 @@ nomem: return ERR_PTR(-ENOMEM); } -/* - * scrub_recheck_error gets called when either verification of the page - * failed or the bio failed to read, e.g. with EIO. In the latter case, - * recheck_error gets called for every page in the bio, even though only - * one may be bad - */ -static void scrub_recheck_error(struct scrub_bio *sbio, int ix) -{ - struct scrub_dev *sdev = sbio-sdev; - struct btrfs_fs_info *fs_info = sdev-dev-dev_root-fs_info; - struct bio *bio = NULL; - struct page *page = NULL; - struct scrub_fixup *fixup = NULL; - int ret; - - /* - * while we're in here we do not want the transaction to commit. - * To prevent it, we increment scrubs_running. scrub_pause will - * have to wait until we're finished - * we can safely increment scrubs_running here, because we're - * in the context of the original bio which is still marked in_flight - */ - atomic_inc(fs_info-scrubs_running); - - fixup = kzalloc(sizeof(*fixup), GFP_NOFS); - if (!fixup) - goto malloc_error; - - fixup-logical = sbio-logical + ix * PAGE_SIZE; - fixup-physical = sbio-physical + ix * PAGE_SIZE; - fixup-spag = sbio-spag[ix]; - fixup-sdev = sdev; - - bio = bio_alloc(GFP_NOFS, 1); - if (!bio) - goto malloc_error; - bio-bi_private = fixup; - bio-bi_size = 0; - bio-bi_bdev = sdev-dev-bdev; -
Re: read-only subvolumes?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 23.03.2011 11:07, Andreas Philipp wrote: On 23.03.2011 10:25, Li Zefan wrote: Hi all, When I am creating subvolumes I get this strange behavior. If I create a subvolume with a name longer than 4 characters it is read-only, if the name is shorter than 5 characters the subvolume is writeable as expected. I think it is since I upgraded to kernel version 2.6.38 (I do not create subvolumes on a regular basis.). I will compile one of the latest 2.6.37 kernels to see whether there the problem exists, too. Another interesting point is that previously created subvolumes are not affected. Thanks, Andreas Philipp thor btrfs # btrfs subvolume create 123456789 Create subvolume './123456789' thor btrfs # touch 123456789/lsdkfj touch: cannot touch `123456789/lsdkfj': Read-only file system This is really odd, but I can't reproduce it. I created a btrfs filesystem on 2.6.37 kernel, and rebooted to latest 2.6.38+, and tried the procedures as you did, but nothing bad happend. While playing around I found the following three new points: - Now the length of the subvolume name does not matter. So even the ones with short names are read-only. - It also happens to a fresh newly created btrfs filesystem. - If I take a snapshot of an old (= writeable) subvolume this is writeable. I will now reboot into 2.6.37.4, check there, and then report back. Well, this was fast. Everything works as expected on 2.6.37.4. See the output of uname -a for the exact kernel version below. I will now reboot into a differently configured kernel version 2.6.38 and look whether the problem is gone there. Thanks, Andreas Philipp thor ~ # uname -a Linux thor 2.6.37.4 #2 SMP Wed Mar 23 10:25:54 CET 2011 x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJNicgcAAoJEJIcBJ3+XkgiDRwP/jVRcrc+Qwq8D3rE50MjBox0 vy8ZKna2MXO4dl6Et8td1ikL0T91eueIjvOeaU5cS8vxknG7xqGh9k89Nd74j98a 2paWOiR49vYYhcKF1EZm6oKgHri/N/1RfLWvhXJef3POprwz3/n3YZcSDsiXcAnJ M8RnGgYFoXNGamPorp32rR5XMln9A6Uma+cUZuaL4eitvsZ+YDsYk4XKZ/8O+cql u5xKihRNDRqQL7LCfqfL0iJxDl3AReOdXUo8sBmo2ioLNv+syJJhhJ2XRbx7r8rM LDWOnsBE1oCq2QuM49MDxuD4JFhCmTJ6oJotaBShcU0J0S8Dlu1URucDO7P33BOK qBFnavR3HaUR+MRor7U+LmeYvasmhj/hUa1nx5jvMEQqeTIioQmYLdllyvHGApfy R4n1+/L91mRr56s96DC31mF7xnSC13LVLJLG+r3ktlj9/u6B+8LAISgo1uDJX681 YQ5KkI8O+5AcAT8Hu1pwdQVC+LXDPp8HIqL59pUWD2v4zyynVqSKgCSKLJ10npLF +NZRhSb6czNSvM0UrUBXPLq1th+ErfMNn4b6RCrAPbA4T5bejvCUUlkx7FiAMmVx rnfiyolblNMfQ+9rY9k8zzZeJfR88wx7yS2VoZlV7n68K01GMy+NDRK203TjcX36 +Y8kUmwptiXc48H6teUN =aSSq -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Simplify locking
Hello, Chris. On Tue, Mar 22, 2011 at 07:13:09PM -0400, Chris Mason wrote: Ok, this impact of this is really interesting. If we have very short waits where there is no IO at all, this patch tends to lose. I ran with dbench 10 and got about 20% slower tput. But, if we do any IO at all it wins by at least that much or more. I think we should take this patch and just work on getting rid of the scheduling with the mutex held where possible. I see. Tejun, could you please send the mutex_tryspin stuff in? If we can get a sob for that I can send the whole thing. I'm not sure whetehr mutex_tryspin() is justified at this point, and, even if so, how to proceed with it. Maybe we want to make mutex_trylock() perform owner spin by default without introducing a new API. Given that the difference between SIMPLE and SPIN is small, I think it would be best to simply use mutex_trylock() for now. It's not gonna make much difference either way. How do you want to proceed? I can prep patches doing the following. - Convert CONFIG_DEBUG_LOCK_ALLOC to CONFIG_LOCKDEP. - Drop locking.c and make the lock function simple wrapper around mutex operations. This makes blocking/unblocking noops. - Remove all blocking/unblocking calls along with the API. - Remove locking wrappers and use mutex API directly. What do you think? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Simplify locking
Excerpts from Tejun Heo's message of 2011-03-23 06:46:14 -0400: Hello, Chris. On Tue, Mar 22, 2011 at 07:13:09PM -0400, Chris Mason wrote: Ok, this impact of this is really interesting. If we have very short waits where there is no IO at all, this patch tends to lose. I ran with dbench 10 and got about 20% slower tput. But, if we do any IO at all it wins by at least that much or more. I think we should take this patch and just work on getting rid of the scheduling with the mutex held where possible. I see. Tejun, could you please send the mutex_tryspin stuff in? If we can get a sob for that I can send the whole thing. I'm not sure whetehr mutex_tryspin() is justified at this point, and, even if so, how to proceed with it. Maybe we want to make mutex_trylock() perform owner spin by default without introducing a new API. I'll benchmark without it, but I think the cond_resched is going to have a pretty big impact. I'm digging up the related benchmarks I used during the initial adaptive spin work. Given that the difference between SIMPLE and SPIN is small, I think it would be best to simply use mutex_trylock() for now. It's not gonna make much difference either way. mutex_trylock is a good start. How do you want to proceed? I can prep patches doing the following. - Convert CONFIG_DEBUG_LOCK_ALLOC to CONFIG_LOCKDEP. - Drop locking.c and make the lock function simple wrapper around mutex operations. This makes blocking/unblocking noops. - Remove all blocking/unblocking calls along with the API. I'd like to keep the blocking/unblocking calls for one release. I'd like to finally finish off my patches that do concurrent reads. - Remove locking wrappers and use mutex API directly. I'd also like to keep the wrappers until the concurrent reader locking is done. What do you think? Thanks for all the work. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: efficiency of btrfs cow
On 11-03-06 11:06 AM, Calvin Walton wrote: To see exactly what's going on, you should use the btrfs filesystem df command to see how space is being allocated for data and metadata separately: OK. So with an empty filesystem, before my first copy (i.e. the base on which the next copy will CoW from) df reports: Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/btrfs--test-btrfs--test 92274688056 922746824 1% /mnt/btrfs-test and btrfs fi df reports: Data: total=8.00MB, used=0.00 Metadata: total=1.01GB, used=24.00KB System: total=12.00MB, used=4.00KB after the first copy df and btrfs fi df report: Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/btrfs--test-btrfs--test 922746880 121402328 801344552 14% /mnt/btrfs-test root@linux:/mnt/btrfs-test# cat .snapshots/monthly.22/metadata/btrfs_df-stop Data: total=110.01GB, used=109.26GB Metadata: total=5.01GB, used=3.26GB System: total=12.00MB, used=24.00KB So it's clear that total usage (as reported by df) was 121,402,328KB but Metadata has two values: Metadata: total=5.01GB, used=3.26GB What's the difference between total and used? And for that matter, what's the difference between the total and used for Data (total=110.01GB, used=109.26GB)? Even if I take the largest values (i.e. the total values) for Data and Metadata (each converted to KB first) and add them up they are: 120,607,211.52 which is not quite the 121,402,328 that df reports. There is a 795,116.48KB discrepancy. In any case, which value from a btrfs df fi should I be subtracting from df's accounting to get a real accounting of the amount of data used? Cheers, b. signature.asc Description: OpenPGP digital signature
[RFC] Tree fragmentation and prefetching
While looking into the performance of scrub I noticed that a significant amount of time is being used for loading the extent tree and the csum tree. While this is no surprise I did some prototyping on how to improve on it. The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled 28%. It is created with the current git tree + the round robin patch and filled with fs_mark -D 512 -t 16 -n 4096 -F -S0 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf() with path-reada=2. Both trees are being enumerated one after the other. The prototype currently just uses raw bios, does not make use of the page cache and does not enter the read pages into the cache. This will probably add some overhead. It also does not check the crcs. While it is very promising to implement it for scrub, I think a more general interface which can be used for every enumeration would be beneficial. Use cases that come to mind are rebalance, reflink, deletion of large files, listing of large directories etc.. I'd imagine an interface along the lines of int btrfs_readahead_init(struct btrfs_reada_ctx *reada); int btrfs_readahead_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end, struct btrfs_reada_ctx *reada); void btrfs_readahead_wait(struct btrfs_reada_ctx *reada); to trigger the readahead of parts of a tree. Multiple readahead requests can be given before waiting. This would enable the very beneficial folding seen above for 'reading both trees'. Also it would be possible to add a cascading readahead, where the content of leaves would trigger readaheads in other trees, maybe by giving a callback for the decisions what to read instead of the fixed start/end range. For the implementation I'd need an interface which I haven't been able to find yet. Currently I can trigger the read of several pages / tree blocks and wait for the completion of each of them. What I'd need would be an interface that gives me a callback on each completion or a waiting function that wakes up on each completion with the information which pages just completed. One way to achieve this would be to add a hook, but I gladly take any implementation hints. -- Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
On wed, 23 Mar 2011 09:57:56 +0800, Miao Xie wrote: On Mon, 21 Mar 2011 08:08:17 -0400, Chris Mason wrote: I also think that code is racing with the code that frees delayed nodes, but haven't yet triggered my debugging printks to prove either one. We free delayed nodes when we want to destroy the inode, at that time, just one task, which is destroying inode, can access the delayed nodes, so I think ACCESS_ONCE() is enough. What do you think about? Great, I see what you mean. The bigger problem right now is that we may do a lot of operations in destroy_inode(), which can block the slab shrinkers on our metadata operations. That same stress.sh -n 50 run is running into OOM. So we need to rework the part where the final free is done. We could keep a ref on the inode until the delayed items are complete, or we could let the inode go and make a way to lookup the delayed node when the inode is read. I find the slab shrinkers just reclaim inodes which are in the inode_unused list, so if we free delayed nodes before moving the inode into the inode_unused list, we can avoid blocking the slab shrinker. Thus maybe we can fix the above problem by moving btrfs_remove_delayed_node() from btrfs_destroy_inode() to btrfs_drop_inode(). How do you think about? This method is not good, because we may do delayed insertion/deletion and free the delayed node when we release the inode, but we may access its delayed node soon. After reading the lockdep message reported by Kitayama, I guess the reason of the slab shrinker block may not be that we do a lot of operations in destroy_inode(), maybe the real reason is that we pass GFP_KERNEL into kzalloc()(in cache_block_group()), which makes the slab shrinkers hang up. (I don't trigger OOM by running stress.sh till now, so I can't locate this bug.) Thanks Miao I'll read more today. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/6] btrfs: add scrub code and prototypes
On 22.03.2011 17:38, David Sterba wrote: David Sterba wrote: On Fri, Mar 11, 2011 at 03:49:40PM +0100, Arne Jansen wrote: This is the main scrub code. sizeof(struct scrub_dev) == 18760 on an x86_64, an order 3 allocation in scrub_setup_dev() Is this a problem? There are only few allocations of it, one per device. High order allocations may fail when memory is fragmented, and should be avoided when possible. (And it is here, allocate each 'struct scrub_bio' separately and fill the bios array with pointers.) Scrub ioctl may fail to start until order-3 allocation will be available. I updated this in my git repo. Thanks, Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
On 23.03.2011 14:06, Arne Jansen wrote: While looking into the performance of scrub I noticed that a significant amount of time is being used for loading the extent tree and the csum tree. While this is no surprise I did some prototyping on how to improve on it. The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s sorry, the number is wrong, it should be 384s (just the sum of both ./. rounding errors). b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled 28%. It is created with the current git tree + the round robin patch and filled with fs_mark -D 512 -t 16 -n 4096 -F -S0 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf() with path-reada=2. Both trees are being enumerated one after the other. The prototype currently just uses raw bios, does not make use of the page cache and does not enter the read pages into the cache. This will probably add some overhead. It also does not check the crcs. While it is very promising to implement it for scrub, I think a more general interface which can be used for every enumeration would be beneficial. Use cases that come to mind are rebalance, reflink, deletion of large files, listing of large directories etc.. I'd imagine an interface along the lines of int btrfs_readahead_init(struct btrfs_reada_ctx *reada); int btrfs_readahead_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end, struct btrfs_reada_ctx *reada); void btrfs_readahead_wait(struct btrfs_reada_ctx *reada); to trigger the readahead of parts of a tree. Multiple readahead requests can be given before waiting. This would enable the very beneficial folding seen above for 'reading both trees'. Also it would be possible to add a cascading readahead, where the content of leaves would trigger readaheads in other trees, maybe by giving a callback for the decisions what to read instead of the fixed start/end range. For the implementation I'd need an interface which I haven't been able to find yet. Currently I can trigger the read of several pages / tree blocks and wait for the completion of each of them. What I'd need would be an interface that gives me a callback on each completion or a waiting function that wakes up on each completion with the information which pages just completed. One way to achieve this would be to add a hook, but I gladly take any implementation hints. -- Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
Hello, guys. I've been playing with locking in btrfs which has developed custom locking to avoid excessive context switches in its btree implementation. Generally, doing away with the custom implementation and just using the mutex adaptive owner spinning seems better; however, there's an interesting distinction in the custom implemention of trylock. It distinguishes between simple trylock and tryspin, where the former just tries once and then fail while the latter does some spinning before giving up. Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, at least for btrfs anyway. The following results are from dbench 50 run on an opteron two socket eight core machine with 4GiB of memory and an OCZ vertex SSD. During the run, disk stays mostly idle and all CPUs are fully occupied and the difference in locking performance becomes quite visible. IMPLE is with the locking simplification patch[1] applied. i.e. it basically just uses mutex. SPIN is with the patch at the end of this message applied on top - mutex_trylock() uses adaptive spinning. USER SYSTEM SIRQCXTSW THROUGHPUT SIMPLE 61107 354977217 8099529 845.100 MB/sec SPIN 63140 364888214 6840527 879.077 MB/sec I've been running various different configs from yesterday and the adaptive spinning trylock consistently posts higher throughput. The amount of difference varies but it outperforms consistently. In general, I think using adaptive spinning on trylock makes sense as trylock failure usually leads to costly unlock-relock sequence. Also, contrary to the comment, the adaptive spinning doesn't seem to check whether there are pending waiters or not. Is this intended or the test got lost somehow? Thanks. NOT-Signed-off-by: Tejun Heo t...@kernel.org --- kernel/mutex.c | 98 +++-- 1 file changed, 61 insertions(+), 37 deletions(-) Index: work/kernel/mutex.c === --- work.orig/kernel/mutex.c +++ work/kernel/mutex.c @@ -126,39 +126,33 @@ void __sched mutex_unlock(struct mutex * EXPORT_SYMBOL(mutex_unlock); -/* - * Lock a mutex (possibly interruptible), slowpath: +/** + * mutex_spin - optimistic spinning on mutex + * @lock: mutex to spin on + * + * This function implements optimistic spin for acquisition of @lock when + * we find that there are no pending waiters and the lock owner is + * currently running on a (different) CPU. + * + * The rationale is that if the lock owner is running, it is likely to + * release the lock soon. + * + * Since this needs the lock owner, and this mutex implementation doesn't + * track the owner atomically in the lock field, we need to track it + * non-atomically. + * + * We can't do this for DEBUG_MUTEXES because that relies on wait_lock to + * serialize everything. + * + * CONTEXT: + * Preemption disabled. + * + * RETURNS: + * %true if @lock is acquired, %false otherwise. */ -static inline int __sched -__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, - unsigned long ip) +static inline bool mutex_spin(struct mutex *lock) { - struct task_struct *task = current; - struct mutex_waiter waiter; - unsigned long flags; - - preempt_disable(); - mutex_acquire(lock-dep_map, subclass, 0, ip); - #ifdef CONFIG_MUTEX_SPIN_ON_OWNER - /* -* Optimistic spinning. -* -* We try to spin for acquisition when we find that there are no -* pending waiters and the lock owner is currently running on a -* (different) CPU. -* -* The rationale is that if the lock owner is running, it is likely to -* release the lock soon. -* -* Since this needs the lock owner, and this mutex implementation -* doesn't track the owner atomically in the lock field, we need to -* track it non-atomically. -* -* We can't do this for DEBUG_MUTEXES because that relies on wait_lock -* to serialize everything. -*/ - for (;;) { struct thread_info *owner; @@ -177,12 +171,8 @@ __mutex_lock_common(struct mutex *lock, if (owner !mutex_spin_on_owner(lock, owner)) break; - if (atomic_cmpxchg(lock-count, 1, 0) == 1) { - lock_acquired(lock-dep_map, ip); - mutex_set_owner(lock); - preempt_enable(); - return 0; - } + if (atomic_cmpxchg(lock-count, 1, 0) == 1) + return true; /* * When there's no owner, we might have preempted between the @@ -190,7 +180,7 @@ __mutex_lock_common(struct mutex *lock, * we're an RT task that will
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo t...@kernel.org wrote: Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, at least for btrfs anyway. Hmm. Seems reasonable to me. The patch looks clean, although part of that is just the mutex_spin() cleanup that is independent of actually using it in trylock. So no objections from me. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, Mar 23, 2011 at 08:48:01AM -0700, Linus Torvalds wrote: On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo t...@kernel.org wrote: Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, at least for btrfs anyway. Hmm. Seems reasonable to me. The patch looks clean, although part of that is just the mutex_spin() cleanup that is independent of actually using it in trylock. Oh, I have two split patches. Posted the combined one for comments. So no objections from me. Awesome. Peter, what do you think? Are there some other tests which can be useful? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: efficiency of btrfs cow
I'm not a developer, but I think it goes something like this: btrfs doesn't write the filesystem on the entire device/partition at format time, rather, it dynamically increases the size of the filesystem as data is used. That's why formating a disk in btrfs can be so fast. On Wed, Mar 23, 2011 at 12:39 PM, Brian J. Murrell br...@interlinx.bc.ca wrote: On 11-03-06 11:06 AM, Calvin Walton wrote: To see exactly what's going on, you should use the btrfs filesystem df command to see how space is being allocated for data and metadata separately: OK. So with an empty filesystem, before my first copy (i.e. the base on which the next copy will CoW from) df reports: Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/btrfs--test-btrfs--test 922746880 56 922746824 1% /mnt/btrfs-test and btrfs fi df reports: Data: total=8.00MB, used=0.00 Metadata: total=1.01GB, used=24.00KB System: total=12.00MB, used=4.00KB after the first copy df and btrfs fi df report: Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/btrfs--test-btrfs--test 922746880 121402328 801344552 14% /mnt/btrfs-test root@linux:/mnt/btrfs-test# cat .snapshots/monthly.22/metadata/btrfs_df-stop Data: total=110.01GB, used=109.26GB Metadata: total=5.01GB, used=3.26GB System: total=12.00MB, used=24.00KB So it's clear that total usage (as reported by df) was 121,402,328KB but Metadata has two values: Metadata: total=5.01GB, used=3.26GB What's the difference between total and used? And for that matter, what's the difference between the total and used for Data (total=110.01GB, used=109.26GB)? Even if I take the largest values (i.e. the total values) for Data and Metadata (each converted to KB first) and add them up they are: 120,607,211.52 which is not quite the 121,402,328 that df reports. There is a 795,116.48KB discrepancy. In any case, which value from a btrfs df fi should I be subtracting from df's accounting to get a real accounting of the amount of data used? Cheers, b. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: efficiency of btrfs cow
On 11-03-23 11:53 AM, Chester wrote: I'm not a developer, but I think it goes something like this: btrfs doesn't write the filesystem on the entire device/partition at format time, rather, it dynamically increases the size of the filesystem as data is used. That's why formating a disk in btrfs can be so fast. Indeed, this much is understood, which is why I am using btrfs fi df to try to determine how much of the increase in raw device usage is the dynamic allocation of metadata. Cheers, b. signature.asc Description: OpenPGP digital signature
stratified B-trees
I just noticed this out today on the arXiv : http://xxx.lanl.gov/abs/1103.4282 The paper describes stratified B-trees and quoting from the abstract: We describe the `stratified B-tree', which beats the CoW B-tree in every way. In particular, it is the first versioned dictionary to achieve optimal tradeoffs between space, query and update performance. Therefore, we believe there is no longer a good reason to use CoW B-trees for versioned data stores. The paper mentions that a company called Acunu is developing an implementation. Are these stratified B-trees something which the btrfs project could use? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 3/6] btrfs: add scrub code and prototypes
Hi, I'm reviewing the atomic counters and the wait/wake infrastructure, just found two missed mutex_unlock()s in btrfs_scrub_dev() in error paths. On Fri, Mar 18, 2011 at 04:55:06PM +0100, Arne Jansen wrote: This is the main scrub code. Updates v3: - fixed EIO handling, need to reallocate bio instead of reusing it - Updated according to David Sterba's review - don't clobber bi_flags on reuse, just set UPTODATE - allocate long living bios with bio_kmalloc instead of bio_alloc Updates v4: - don't restart chunk on commit - each EIO leaked a bio - the BIO_UPTODATE check was wrong - removed some trailing whitespace - nstripes int - u64 - %lld - %llu - extent_map reference not freed, leaking them on unmount - remov unnecessary mutex locks around 'scrubs_running' Thanks to Ilya Dryomov and Jan Schmidt for pointing most of those out. --- fs/btrfs/Makefile |2 +- fs/btrfs/ctree.h | 14 + fs/btrfs/scrub.c | 1496 + 3 files changed, 1511 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 31610ea..8fda313 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -7,4 +7,4 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \ -compression.o delayed-ref.o relocation.o +compression.o delayed-ref.o relocation.o scrub.o diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index fd2b92f..a89a205 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2619,4 +2619,18 @@ void btrfs_reloc_pre_snapshot(struct btrfs_trans_handle *trans, u64 *bytes_to_reserve); void btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans, struct btrfs_pending_snapshot *pending); + +/* scrub.c */ +int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end, +struct btrfs_scrub_progress *progress); +int btrfs_scrub_pause(struct btrfs_root *root); +int btrfs_scrub_pause_super(struct btrfs_root *root); +int btrfs_scrub_continue(struct btrfs_root *root); +int btrfs_scrub_continue_super(struct btrfs_root *root); +int btrfs_scrub_cancel(struct btrfs_root *root); +int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev); +int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid); +int btrfs_scrub_progress(struct btrfs_root *root, u64 devid, + struct btrfs_scrub_progress *progress); + #endif diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c new file mode 100644 index 000..85a4d4b --- /dev/null +++ b/fs/btrfs/scrub.c @@ -0,0 +1,1496 @@ +/* + * Copyright (C) 2011 STRATO. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include linux/sched.h +#include linux/pagemap.h +#include linux/writeback.h +#include linux/blkdev.h +#include linux/rbtree.h +#include linux/slab.h +#include linux/workqueue.h +#include ctree.h +#include volumes.h +#include disk-io.h +#include ordered-data.h + +/* + * This is only the first step towards a full-features scrub. It reads all + * extent and super block and verifies the checksums. In case a bad checksum + * is found or the extent cannot be read, good data will be written back if + * any can be found. + * + * Future enhancements: + * - To enhance the performance, better read-ahead strategies for the + *extent-tree can be employed. + * - In case an unrepairable extent is encountered, track which files are + *affected and report them + * - In case of a read error on files with nodatasum, map the file and read + *the extent to trigger a writeback of the good copy + * - track and record media errors, throw out bad devices + * - add a readonly mode + * - add a mode to also read unallocated space + * - make the prefetch cancellable + */ + +#ifdef SCRUB_BTRFS_WORKER +typedef struct btrfs_work scrub_work_t; +#define SCRUB_INIT_WORK(work, fn) do { (work)-func = (fn); } while (0) +#define SCRUB_QUEUE_WORK(wq, w) do { btrfs_queue_worker((wq), w); }
Re: [PATCH v4 4/6] btrfs: sync scrub with commit device removal
Hi, you are adding a new smp_mb, can you please explain why it's needed and document it? thanks, dave On Fri, Mar 18, 2011 at 04:55:07PM +0100, Arne Jansen wrote: This adds several synchronizations: - for a transaction commit, the scrub gets paused before the tree roots are committed until the super are safely on disk - during a log commit, scrubbing of supers is disabled - on unmount, the scrub gets cancelled - on device removal, the scrub for the particular device gets cancelled Signed-off-by: Arne Jansen sensi...@gmx.net --- fs/btrfs/disk-io.c |1 + fs/btrfs/transaction.c |3 +++ fs/btrfs/tree-log.c|2 ++ fs/btrfs/volumes.c |2 ++ 4 files changed, 8 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3e1ea3e..924a366 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2493,6 +2493,7 @@ int close_ctree(struct btrfs_root *root) fs_info-closing = 1; smp_mb(); + btrfs_scrub_cancel(root); btrfs_put_block_group_cache(fs_info); /* diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 3d73c8d..5a43b20 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1310,6 +1310,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, WARN_ON(cur_trans != trans-transaction); + btrfs_scrub_pause(root); /* btrfs_commit_tree_roots is responsible for getting the * various roots consistent with each other. Every pointer * in the tree of tree roots has to point to the most up to date @@ -1391,6 +1392,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, mutex_unlock(root-fs_info-trans_mutex); + btrfs_scrub_continue(root); + if (current-journal_info == trans) current-journal_info = NULL; diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 1f6788f..2be84fa 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2098,7 +2098,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, * the running transaction open, so a full commit can't hop * in and cause problems either. */ + btrfs_scrub_pause_super(root); write_ctree_super(trans, root-fs_info-tree_root, 1); + btrfs_scrub_continue_super(root); ret = 0; mutex_lock(root-log_mutex); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 7dc9fa5..ad3ea88 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1330,6 +1330,8 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) goto error_undo; device-in_fs_metadata = 0; + smp_mb(); + btrfs_scrub_cancel_dev(root, device); /* * the device list mutex makes sure that we don't change -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: efficiency of btrfs cow
So it's clear that total usage (as reported by df) was 121,402,328KB but Metadata has two values: Metadata: total=5.01GB, used=3.26GB What's the difference between total and used? And for that matter, what's the difference between the total and used for Data (total=110.01GB, used=109.26GB)? total is the space allocated (reserved) for a kind usage (metadata or data) the space allocated for a kind of usage can't be used for something else. The used value is the space that is used from the space that has been allocated for a kind of usage. The wiki gives you a overview how to interpret the values: https://btrfs.wiki.kernel.org/index.php/FAQ#btrfs_filesystem_df_.2Fmountpoint cheers Kolja. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stratified B-trees
Karn Kallio tierplusplusli...@gmail.com writes: Are these stratified B-trees something which the btrfs project could use? The current b*tree is pretty much hardcoded in the disk format, so it would be hard to change in a compatible way. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stratified B-trees
On Wed, Mar 23, 2011 at 5:38 PM, Karn Kallio tierplusplusli...@gmail.com wrote: I just noticed this out today on the arXiv : http://xxx.lanl.gov/abs/1103.4282 The paper describes stratified B-trees and quoting from the abstract: LOL. It looks like this paper is generated by a robot: ... Stratified B-trees don’t need block-size tuning, unlike B-trees. One major advantage is that they are naturally good candidates for SSDs – the Intel X25M can perform 35,000 random 4K reads/s, but must write in units of many MBs in order to fully utilise its performance. This massive asymmetry in block size makes life very hard... How do you like: to utilise performance, massive asymmetry in block size.. We describe the `stratified B-tree', which beats the CoW B-tree in every way. In particular, it is the first versioned dictionary to achieve optimal tradeoffs between space, query and update performance. Therefore, we believe there is no longer a good reason to use CoW B-trees for versioned data stores. The paper mentions that a company called Acunu is developing an implementation. Are these stratified B-trees something which the btrfs project could use? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs wins Linux New Media Award
Hi everyone, During the last Cebit conference, Btrfs was presented with an award for the most innovative open source project. I'd like to thank everyone at Linux magazine involved with selecting us, and since we have so many contributors I wanted to share a picture of the shiny award: http://oss.oracle.com/~mason/btrfs/btrfs_award.jpg Thanks everyone! Chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Recovering parent transid verify failed
Hi, I'm having the same issues as previously mentioned. Apparently the new fsck tool will be able to recover this? Few questions, is there a GIT version I can compile and use already for this? If not, is there any indication of when this will be released? --- Luke Sheldrick e: l...@sheldrick.co.uk p: 07880 725099 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen sensi...@gmx.net wrote: While looking into the performance of scrub I noticed that a significant amount of time is being used for loading the extent tree and the csum tree. While this is no surprise I did some prototyping on how to improve on it. The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s 10x speed-up looks indeed impressive. Just for me to be sure, did I get you right in that you attribute this effect specifically to enumerating tree leaves in key address vs. disk addresses when these two are not aligned? Regards, Andrey The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled 28%. It is created with the current git tree + the round robin patch and filled with fs_mark -D 512 -t 16 -n 4096 -F -S0 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf() with path-reada=2. Both trees are being enumerated one after the other. The prototype currently just uses raw bios, does not make use of the page cache and does not enter the read pages into the cache. This will probably add some overhead. It also does not check the crcs. While it is very promising to implement it for scrub, I think a more general interface which can be used for every enumeration would be beneficial. Use cases that come to mind are rebalance, reflink, deletion of large files, listing of large directories etc.. I'd imagine an interface along the lines of int btrfs_readahead_init(struct btrfs_reada_ctx *reada); int btrfs_readahead_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end, struct btrfs_reada_ctx *reada); void btrfs_readahead_wait(struct btrfs_reada_ctx *reada); to trigger the readahead of parts of a tree. Multiple readahead requests can be given before waiting. This would enable the very beneficial folding seen above for 'reading both trees'. Also it would be possible to add a cascading readahead, where the content of leaves would trigger readaheads in other trees, maybe by giving a callback for the decisions what to read instead of the fixed start/end range. For the implementation I'd need an interface which I haven't been able to find yet. Currently I can trigger the read of several pages / tree blocks and wait for the completion of each of them. What I'd need would be an interface that gives me a callback on each completion or a waiting function that wakes up on each completion with the information which pages just completed. One way to achieve this would be to add a hook, but I gladly take any implementation hints. -- Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
Excerpts from Arne Jansen's message of 2011-03-23 09:06:02 -0400: While looking into the performance of scrub I noticed that a significant amount of time is being used for loading the extent tree and the csum tree. While this is no surprise I did some prototyping on how to improve on it. The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s Very nice, btrfsck does something similar. The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled 28%. It is created with the current git tree + the round robin patch and filled with fs_mark -D 512 -t 16 -n 4096 -F -S0 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf() with path-reada=2. Both trees are being enumerated one after the other. The prototype currently just uses raw bios, does not make use of the page cache and does not enter the read pages into the cache. This will probably add some overhead. It also does not check the crcs. While it is very promising to implement it for scrub, I think a more general interface which can be used for every enumeration would be beneficial. Use cases that come to mind are rebalance, reflink, deletion of large files, listing of large directories etc.. I'd imagine an interface along the lines of int btrfs_readahead_init(struct btrfs_reada_ctx *reada); int btrfs_readahead_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end, struct btrfs_reada_ctx *reada); void btrfs_readahead_wait(struct btrfs_reada_ctx *reada); to trigger the readahead of parts of a tree. Multiple readahead requests can be given before waiting. This would enable the very beneficial folding seen above for 'reading both trees'. Also it would be possible to add a cascading readahead, where the content of leaves would trigger readaheads in other trees, maybe by giving a callback for the decisions what to read instead of the fixed start/end range. For the implementation I'd need an interface which I haven't been able to find yet. Currently I can trigger the read of several pages / tree blocks and wait for the completion of each of them. What I'd need would be an interface that gives me a callback on each completion or a waiting function that wakes up on each completion with the information which pages just completed. One way to achieve this would be to add a hook, but I gladly take any implementation hints. We have the bio endio call backs for this, I think that's the only thing you can use. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, Mar 23, 2011 at 6:48 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo t...@kernel.org wrote: Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, at least for btrfs anyway. Hmm. Seems reasonable to me. TAS/spin with exponential back-off has been preferred locking approach in Postgres (and I believe other DBMSes) for years, at least since '04 when I had last touched Postgres code. Even with 'false negative' cost in user-space being much higher than in the kernel, it's still just a question of scale (no wonder measurable improvement here is reported from dbench on SSD capable of few dozen thousand IOPS). Regards, Andrey The patch looks clean, although part of that is just the mutex_spin() cleanup that is independent of actually using it in trylock. So no objections from me. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
On 23.03.2011 20:26, Andrey Kuzmin wrote: On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net wrote: While looking into the performance of scrub I noticed that a significant amount of time is being used for loading the extent tree and the csum tree. While this is no surprise I did some prototyping on how to improve on it. The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s 10x speed-up looks indeed impressive. Just for me to be sure, did I get you right in that you attribute this effect specifically to enumerating tree leaves in key address vs. disk addresses when these two are not aligned? Yes. Leaves and the intermediate nodes tend to be quite scattered around the disk with respect to their logical order. Reading them in logical (ascending/descending) order require lots of seeks. Regards, Andrey The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled 28%. It is created with the current git tree + the round robin patch and filled with fs_mark -D 512 -t 16 -n 4096 -F -S0 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf() with path-reada=2. Both trees are being enumerated one after the other. The prototype currently just uses raw bios, does not make use of the page cache and does not enter the read pages into the cache. This will probably add some overhead. It also does not check the crcs. While it is very promising to implement it for scrub, I think a more general interface which can be used for every enumeration would be beneficial. Use cases that come to mind are rebalance, reflink, deletion of large files, listing of large directories etc.. I'd imagine an interface along the lines of int btrfs_readahead_init(struct btrfs_reada_ctx *reada); int btrfs_readahead_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end, struct btrfs_reada_ctx *reada); void btrfs_readahead_wait(struct btrfs_reada_ctx *reada); to trigger the readahead of parts of a tree. Multiple readahead requests can be given before waiting. This would enable the very beneficial folding seen above for 'reading both trees'. Also it would be possible to add a cascading readahead, where the content of leaves would trigger readaheads in other trees, maybe by giving a callback for the decisions what to read instead of the fixed start/end range. For the implementation I'd need an interface which I haven't been able to find yet. Currently I can trigger the read of several pages / tree blocks and wait for the completion of each of them. What I'd need would be an interface that gives me a callback on each completion or a waiting function that wakes up on each completion with the information which pages just completed. One way to achieve this would be to add a hook, but I gladly take any implementation hints. -- Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
On Wed, Mar 23, 2011 at 11:28 PM, Arne Jansen sensi...@gmx.net wrote: On 23.03.2011 20:26, Andrey Kuzmin wrote: On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net wrote: While looking into the performance of scrub I noticed that a significant amount of time is being used for loading the extent tree and the csum tree. While this is no surprise I did some prototyping on how to improve on it. The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s 10x speed-up looks indeed impressive. Just for me to be sure, did I get you right in that you attribute this effect specifically to enumerating tree leaves in key address vs. disk addresses when these two are not aligned? Yes. Leaves and the intermediate nodes tend to be quite scattered around the disk with respect to their logical order. Reading them in logical (ascending/descending) order require lots of seeks. And the patch actually does on-the-fly defragmentation, right? Why loose it then :)? Regards, Andrey Regards, Andrey The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled 28%. It is created with the current git tree + the round robin patch and filled with fs_mark -D 512 -t 16 -n 4096 -F -S0 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf() with path-reada=2. Both trees are being enumerated one after the other. The prototype currently just uses raw bios, does not make use of the page cache and does not enter the read pages into the cache. This will probably add some overhead. It also does not check the crcs. While it is very promising to implement it for scrub, I think a more general interface which can be used for every enumeration would be beneficial. Use cases that come to mind are rebalance, reflink, deletion of large files, listing of large directories etc.. I'd imagine an interface along the lines of int btrfs_readahead_init(struct btrfs_reada_ctx *reada); int btrfs_readahead_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end, struct btrfs_reada_ctx *reada); void btrfs_readahead_wait(struct btrfs_reada_ctx *reada); to trigger the readahead of parts of a tree. Multiple readahead requests can be given before waiting. This would enable the very beneficial folding seen above for 'reading both trees'. Also it would be possible to add a cascading readahead, where the content of leaves would trigger readaheads in other trees, maybe by giving a callback for the decisions what to read instead of the fixed start/end range. For the implementation I'd need an interface which I haven't been able to find yet. Currently I can trigger the read of several pages / tree blocks and wait for the completion of each of them. What I'd need would be an interface that gives me a callback on each completion or a waiting function that wakes up on each completion with the information which pages just completed. One way to achieve this would be to add a hook, but I gladly take any implementation hints. -- Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovering parent transid verify failed
Excerpts from Luke Sheldrick's message of 2011-03-23 14:12:45 -0400: Hi, I'm having the same issues as previously mentioned. Apparently the new fsck tool will be able to recover this? Few questions, is there a GIT version I can compile and use already for this? If not, is there any indication of when this will be released? Yes, I'm still hammering out a reliable way to resolve most of these. But, please post the messages you're hitting, it is actually a very generic problem and has many different causes. What happened to your FS that made them come up? Which kernel were you running and what was the FS built on top of? What happens when you grab the latest btrfsck from git and do: btrfsck -s 1 /dev/xxx -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs wins Linux New Media Award
On 03/23/2011 02:17 PM, Chris Mason wrote: Hi everyone, During the last Cebit conference, Btrfs was presented with an award for the most innovative open source project. I'd like to thank everyone at Linux magazine involved with selecting us, and since we have so many contributors I wanted to share a picture of the shiny award: http://oss.oracle.com/~mason/btrfs/btrfs_award.jpg Thanks everyone! Chris Congratulations! Very much deserved, Ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote: On 23.03.2011 20:26, Andrey Kuzmin wrote: On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net wrote: While looking into the performance of scrub I noticed that a significant amount of time is being used for loading the extent tree and the csum tree. While this is no surprise I did some prototyping on how to improve on it. The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s 10x speed-up looks indeed impressive. Just for me to be sure, did I get you right in that you attribute this effect specifically to enumerating tree leaves in key address vs. disk addresses when these two are not aligned? Yes. Leaves and the intermediate nodes tend to be quite scattered around the disk with respect to their logical order. Reading them in logical (ascending/descending) order require lots of seeks. I'm also dealing with tree fragmentation problem, I try to store the leaves which have the same parent closely. Regards Miao Regards, Andrey The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled 28%. It is created with the current git tree + the round robin patch and filled with fs_mark -D 512 -t 16 -n 4096 -F -S0 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf() with path-reada=2. Both trees are being enumerated one after the other. The prototype currently just uses raw bios, does not make use of the page cache and does not enter the read pages into the cache. This will probably add some overhead. It also does not check the crcs. While it is very promising to implement it for scrub, I think a more general interface which can be used for every enumeration would be beneficial. Use cases that come to mind are rebalance, reflink, deletion of large files, listing of large directories etc.. I'd imagine an interface along the lines of int btrfs_readahead_init(struct btrfs_reada_ctx *reada); int btrfs_readahead_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end, struct btrfs_reada_ctx *reada); void btrfs_readahead_wait(struct btrfs_reada_ctx *reada); to trigger the readahead of parts of a tree. Multiple readahead requests can be given before waiting. This would enable the very beneficial folding seen above for 'reading both trees'. Also it would be possible to add a cascading readahead, where the content of leaves would trigger readaheads in other trees, maybe by giving a callback for the decisions what to read instead of the fixed start/end range. For the implementation I'd need an interface which I haven't been able to find yet. Currently I can trigger the read of several pages / tree blocks and wait for the completion of each of them. What I'd need would be an interface that gives me a callback on each completion or a waiting function that wakes up on each completion with the information which pages just completed. One way to achieve this would be to add a hook, but I gladly take any implementation hints. -- Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
On Wed, 23 Mar 2011 17:47:01 +0800 Miao Xie mi...@cn.fujitsu.com wrote: we found GFP_KERNEL was passed into kzalloc(), I think this flag trigger the above lockdep warning. the attached patch, which against the delayed items operation patch, may fix this problem, Could you test it for me? The possible irq lock inversion dependency warning seems to go away. itaru -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html