Re: read-only subvolumes?
When I am creating subvolumes I get this strange behavior. If I create a subvolume with a name longer than 4 characters it is read-only, if the name is shorter than 5 characters the subvolume is writeable as expected. I think it is since I upgraded to kernel version 2.6.38 (I do not create subvolumes on a regular basis.). I will compile one of the latest 2.6.37 kernels to see whether there the problem exists, too. Another interesting point is that previously created subvolumes are not affected. Thanks, Andreas Philipp thor btrfs # btrfs subvolume create 123456789 Create subvolume './123456789' thor btrfs # touch 123456789/lsdkfj touch: cannot touch `123456789/lsdkfj': Read-only file system This is really odd, but I can't reproduce it. I created a btrfs filesystem on 2.6.37 kernel, and rebooted to latest 2.6.38+, and tried the procedures as you did, but nothing bad happend. While playing around I found the following three new points: - Now the length of the subvolume name does not matter. So even the ones with short names are read-only. - It also happens to a fresh newly created btrfs filesystem. - If I take a snapshot of an old (= writeable) subvolume this is writeable. I will now reboot into 2.6.37.4, check there, and then report back. Well, this was fast. Everything works as expected on 2.6.37.4. See the output of uname -a for the exact kernel version below. I will now reboot into a differently configured kernel version 2.6.38 and look whether the problem is gone there. Thanks, Andreas Philipp thor ~ # uname -a Linux thor 2.6.37.4 #2 SMP Wed Mar 23 10:25:54 CET 2011 x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux IMHO, this is related to how the debug options of the kernel are configured. Attached you find two config files, both for kernel version 2.6.38, with the one named 2.6.38-debug everything works and with the other one newly created subvolumes are read only. I'll see if I can reproduce the problem using your config. Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: check return value of read_tree_block()
This patch is checking return value of read_tree_block(), and if it is NULL, error processing. Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- fs/btrfs/ctree.c |3 +++ fs/btrfs/extent-tree.c |6 ++ fs/btrfs/relocation.c |6 ++ 3 files changed, 15 insertions(+) diff -urNp linux-2.6.38/fs/btrfs/ctree.c linux-2.6.38.new/fs/btrfs/ctree.c --- linux-2.6.38/fs/btrfs/ctree.c 2011-03-15 10:20:32.0 +0900 +++ linux-2.6.38.new/fs/btrfs/ctree.c 2011-03-24 11:12:54.0 +0900 @@ -686,6 +686,8 @@ int btrfs_realloc_node(struct btrfs_tran if (!cur) { cur = read_tree_block(root, blocknr, blocksize, gen); + if (!cur) + return -EIO; } else if (!uptodate) { btrfs_read_buffer(cur, gen); } @@ -4217,6 +4219,7 @@ find_next_key: } btrfs_set_path_blocking(path); cur = read_node_slot(root, cur, slot); + BUG_ON(!cur); btrfs_tree_lock(cur); diff -urNp linux-2.6.38/fs/btrfs/extent-tree.c linux-2.6.38.new/fs/btrfs/extent-tree.c --- linux-2.6.38/fs/btrfs/extent-tree.c 2011-03-15 10:20:32.0 +0900 +++ linux-2.6.38.new/fs/btrfs/extent-tree.c 2011-03-24 11:32:55.0 +0900 @@ -6047,6 +6047,8 @@ static noinline int do_walk_down(struct if (reada level == 1) reada_walk_down(trans, root, wc, path); next = read_tree_block(root, bytenr, blocksize, generation); + if (!next) + return -EIO; btrfs_tree_lock(next); btrfs_set_lock_blocking(next); } @@ -7906,6 +7908,10 @@ static noinline int relocate_one_extent( eb = read_tree_block(found_root, block_start, block_size, 0); + if (!eb) { + ret = -EIO; + goto out; + } btrfs_tree_lock(eb); BUG_ON(level != btrfs_header_level(eb)); diff -urNp linux-2.6.38/fs/btrfs/relocation.c linux-2.6.38.new/fs/btrfs/relocation.c --- linux-2.6.38/fs/btrfs/relocation.c 2011-03-15 10:20:32.0 +0900 +++ linux-2.6.38.new/fs/btrfs/relocation.c 2011-03-24 11:43:53.0 +0900 @@ -1724,6 +1724,7 @@ again: eb = read_tree_block(dest, old_bytenr, blocksize, old_ptr_gen); + BUG_ON(!eb); btrfs_tree_lock(eb); if (cow) { ret = btrfs_cow_block(trans, dest, eb, parent, @@ -2513,6 +2514,10 @@ static int do_relocation(struct btrfs_tr blocksize = btrfs_level_size(root, node-level); generation = btrfs_node_ptr_generation(upper-eb, slot); eb = read_tree_block(root, bytenr, blocksize, generation); + if (!eb) { + err = -EIO; + goto next; + } btrfs_tree_lock(eb); btrfs_set_lock_blocking(eb); @@ -2670,6 +2675,7 @@ static int get_tree_block_key(struct rel BUG_ON(block-key_ready); eb = read_tree_block(rc-extent_root, block-bytenr, block-key.objectid, block-key.offset); + BUG_ON(!eb); WARN_ON(btrfs_header_level(eb) != block-level); if (block-level == 0) btrfs_item_key_to_cpu(eb, block-key, 0); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
On 24.03.2011 02:38, Miao Xie wrote: On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote: On 23.03.2011 20:26, Andrey Kuzmin wrote: On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net wrote: The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s 10x speed-up looks indeed impressive. Just for me to be sure, did I get you right in that you attribute this effect specifically to enumerating tree leaves in key address vs. disk addresses when these two are not aligned? Yes. Leaves and the intermediate nodes tend to be quite scattered around the disk with respect to their logical order. Reading them in logical (ascending/descending) order require lots of seeks. I'm also dealing with tree fragmentation problem, I try to store the leaves which have the same parent closely. That's good to hear. Do you have already anything I can repeat the test with? -Arne Regards Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
* Tejun Heo t...@kernel.org wrote: NOT-Signed-off-by: Tejun Heo t...@kernel.org s/NOT-// ? Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.38 defragment compression oops...
I found that I'm able to provoke undefined behaviour with 2.6.38 with extent defragmenting + recompression, eg: mkfs.btrfs /dev/sdb mount /dev/sdb /mnt cp -xa / /mnt find /mnt -print0 | xargs -0 btrfs filesystem defragment -vc After a short time, I was seeing what looked like a secondary effect [1]. Reproducing with lock instrumentation reported recursive spinlock acquisition, probably a false-positive from the locking scheme not being annotated, so better report it now. Daniel --- [1] BUG: unable to handle kernel NULL pointer dereference at (null) IP: [a00e23cb] write_extent_buffer+0xbb/0x1b0 [btrfs] PGD 0 Oops: [#1] SMP last sysfs file: /sys/devices/pci:00/:00:1e.0/:06:04.0/local_cpus CPU 1 Modules linked in: microcode psmouse serio_raw ioatdma i7core_edac joydev lp edac_core dca parport raid10 raid456 async_raid6_recov async_pq usbhid hid raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear ahci btrfs zlib_deflate libahci e1000e libcrc32c Pid: 1119, comm: btrfs-delalloc- Tainted: GW 2.6.38-020638-generic #201103151303 Supermicro X8STi/X8STi RIP: 0010:[a00e23cb] [a00e23cb] write_extent_buffer+0xbb/0x1b0 [btrfs] RSP: 0018:880303a0bbc0 EFLAGS: 00010a86 RAX: db74 RBX: 0d26 RCX: 8800 RDX: RSI: 0002fa19 RDI: 88023c8353f8 RBP: 880303a0bc00 R08: 0001 R09: R10: R11: 0017 R12: db738800 R13: 028c R14: 880303a0bfd8 R15: FS: () GS:8800df48() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: CR3: 01a03000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process btrfs-delalloc- (pid: 1119, threadinfo 880303a0a000, task 8803046cad80) Stack: 880280e63cc0 8802fd10ad26 0001 880303a0a000 ea000a75ba30 0fb2 08f7 02da 880303a0bcb0 a00c5bb0 002e0001 Call Trace: [a00c5bb0] insert_inline_extent+0x330/0x350 [btrfs] [a00c5cf6] cow_file_range_inline+0x126/0x160 [btrfs] [a00c68f0] compress_file_range+0x3b0/0x580 [btrfs] [a00c6af5] async_cow_start+0x35/0x50 [btrfs] [a00eac0c] worker_loop+0xac/0x260 [btrfs] [a00eab60] ? worker_loop+0x0/0x260 [btrfs] [81086317] kthread+0x97/0xa0 [8100ce24] kernel_thread_helper+0x4/0x10 [81086280] ? kthread+0x0/0xa0 [8100ce20] ? kernel_thread_helper+0x0/0x10 Code: 16 00 00 48 8d 04 0a 48 b9 b7 6d db b6 6d db b6 6d 48 c1 f8 03 48 0f af c1 48 b9 00 00 00 00 00 88 ff ff 48 c1 e0 0c 4c 8d 24 08 48 8b 02 a8 08 0f 85 9c 00 00 00 be cb 0e 00 00 48 c7 c7 b8 7c RIP [a00e23cb] write_extent_buffer+0xbb/0x1b0 [btrfs] RSP 880303a0bbc0 CR2: ---[ end trace a7919e7f17c0a728 ]--- note: btrfs-delalloc-exited with preempt_count 1 -- Daniel J Blueman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: read-only subvolumes?
IMHO, this is related to how the debug options of the kernel are configured. Attached you find two config files, both for kernel version 2.6.38, with the one named 2.6.38-debug everything works and with the other one newly created subvolumes are read only. I've figured out what's wrong. The root cause is the flags field of the root item for a new subvol is never _initialized_!! so the on disk root_item-flags can be of arbitrary value.. (so is root_item-byte_limit btw.) I don't have a perfect solution at the moment, but I think a workaround is to use a flag in root_item-inode_item-flags to indicate if root-flags is initialized. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Subject: mutex: Separate out mutex_spin()
Separate out mutex_spin() out of __mutex_lock_common(). The fat comment is converted to docbook function description. While at it, drop the part of comment which explains that adaptive spinning considers whether there are pending waiters, which doesn't match the code. This patch is to prepare for using adaptive spinning in mutex_trylock() and doesn't cause any behavior change. Signed-off-by: Tejun Heo t...@kernel.org LKML-Reference: 20110323153727.gb12...@htj.dyndns.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@redhat.com --- Here are split patches with SOB. Ingo, it's probably best to route this through -tip, I suppose? Thanks. kernel/mutex.c | 87 - 1 file changed, 50 insertions(+), 37 deletions(-) Index: work/kernel/mutex.c === --- work.orig/kernel/mutex.c +++ work/kernel/mutex.c @@ -126,39 +126,32 @@ void __sched mutex_unlock(struct mutex * EXPORT_SYMBOL(mutex_unlock); -/* - * Lock a mutex (possibly interruptible), slowpath: +/** + * mutex_spin - optimistic spinning on mutex + * @lock: mutex to spin on + * + * This function implements optimistic spin for acquisition of @lock when + * the lock owner is currently running on a (different) CPU. + * + * The rationale is that if the lock owner is running, it is likely to + * release the lock soon. + * + * Since this needs the lock owner, and this mutex implementation doesn't + * track the owner atomically in the lock field, we need to track it + * non-atomically. + * + * We can't do this for DEBUG_MUTEXES because that relies on wait_lock to + * serialize everything. + * + * CONTEXT: + * Preemption disabled. + * + * RETURNS: + * %true if @lock is acquired, %false otherwise. */ -static inline int __sched -__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, - unsigned long ip) +static inline bool mutex_spin(struct mutex *lock) { - struct task_struct *task = current; - struct mutex_waiter waiter; - unsigned long flags; - - preempt_disable(); - mutex_acquire(lock-dep_map, subclass, 0, ip); - #ifdef CONFIG_MUTEX_SPIN_ON_OWNER - /* -* Optimistic spinning. -* -* We try to spin for acquisition when we find that there are no -* pending waiters and the lock owner is currently running on a -* (different) CPU. -* -* The rationale is that if the lock owner is running, it is likely to -* release the lock soon. -* -* Since this needs the lock owner, and this mutex implementation -* doesn't track the owner atomically in the lock field, we need to -* track it non-atomically. -* -* We can't do this for DEBUG_MUTEXES because that relies on wait_lock -* to serialize everything. -*/ - for (;;) { struct thread_info *owner; @@ -177,12 +170,8 @@ __mutex_lock_common(struct mutex *lock, if (owner !mutex_spin_on_owner(lock, owner)) break; - if (atomic_cmpxchg(lock-count, 1, 0) == 1) { - lock_acquired(lock-dep_map, ip); - mutex_set_owner(lock); - preempt_enable(); - return 0; - } + if (atomic_cmpxchg(lock-count, 1, 0) == 1) + return true; /* * When there's no owner, we might have preempted between the @@ -190,7 +179,7 @@ __mutex_lock_common(struct mutex *lock, * we're an RT task that will live-lock because we won't let * the owner complete. */ - if (!owner (need_resched() || rt_task(task))) + if (!owner (need_resched() || rt_task(current))) break; /* @@ -202,6 +191,30 @@ __mutex_lock_common(struct mutex *lock, arch_mutex_cpu_relax(); } #endif + return false; +} + +/* + * Lock a mutex (possibly interruptible), slowpath: + */ +static inline int __sched +__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, + unsigned long ip) +{ + struct task_struct *task = current; + struct mutex_waiter waiter; + unsigned long flags; + + preempt_disable(); + mutex_acquire(lock-dep_map, subclass, 0, ip); + + if (mutex_spin(lock)) { + lock_acquired(lock-dep_map, ip); + mutex_set_owner(lock); + preempt_enable(); + return 0; + } + spin_lock_mutex(lock-wait_lock, flags); debug_mutex_lock_common(lock, waiter); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Adaptive owner spinning used to be applied only to mutex_lock(). This patch applies it also to mutex_trylock(). btrfs has developed custom locking to avoid excessive context switches in its btree implementation. Generally, doing away with the custom implementation and just using the mutex shows better behavior; however, there's an interesting distinction in the custom implemention of trylock. It distinguishes between simple trylock and tryspin, where the former just tries once and then fail while the latter does some spinning before giving up. Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, for btrfs anyway. The following results are from dbench 50 run on an opteron two socket eight core machine with 4GiB of memory and an OCZ vertex SSD. During the run, disk stays mostly idle and all CPUs are fully occupied and the difference in locking performance becomes quite visible. SIMPLE is with the locking simplification patch[1] applied. i.e. it basically just uses mutex. SPIN is with this patch applied on top - mutex_trylock() uses adaptive spinning. USER SYSTEM SIRQCXTSW THROUGHPUT SIMPLE 61107 354977217 8099529 845.100 MB/sec SPIN 63140 364888214 6840527 879.077 MB/sec On various runs, the adaptive spinning trylock consistently posts higher throughput. The amount of difference varies but it outperforms consistently. In general, using adaptive spinning on trylock makes sense as trylock failure usually leads to costly unlock-relock sequence. [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 Signed-off-by: Tejun Heo t...@kernel.org LKML-Reference: 20110323153727.gb12...@htj.dyndns.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@redhat.com Cc: Chris Mason chris.ma...@oracle.com --- kernel/mutex.c | 10 ++ 1 file changed, 10 insertions(+) Index: work/kernel/mutex.c === --- work.orig/kernel/mutex.c +++ work/kernel/mutex.c @@ -443,6 +443,15 @@ static inline int __mutex_trylock_slowpa unsigned long flags; int prev; + preempt_disable(); + + if (mutex_spin(lock)) { + mutex_set_owner(lock); + mutex_acquire(lock-dep_map, 0, 1, _RET_IP_); + preempt_enable(); + return 1; + } + spin_lock_mutex(lock-wait_lock, flags); prev = atomic_xchg(lock-count, -1); @@ -456,6 +465,7 @@ static inline int __mutex_trylock_slowpa atomic_set(lock-count, 0); spin_unlock_mutex(lock-wait_lock, flags); + preempt_enable(); return prev == 1; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Subject: mutex: Separate out mutex_spin()
Ugh... Please drop the extra Subject: from subject before applying. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.38 fs balance lock ordering...
While doing a filesystem balance, lockdep detecting a potential lock ordering issue [1]. Thanks, Daniel --- [1] === [ INFO: possible circular locking dependency detected ] 2.6.38.1-341cd+ #10 --- btrfs/1101 is trying to acquire lock: (sb-s_type-i_mutex_key#12){+.+.+.}, at: [812cddb9] prealloc_file_extent_cluster+0x59/0x180 but task is already holding lock: (fs_info-cleaner_mutex){+.+.+.}, at: [812cfcb7] btrfs_relocate_block_group+0x197/0x2d0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #2 (fs_info-cleaner_mutex){+.+.+.}: [8109628a] lock_acquire+0x5a/0x70 [816c9cde] mutex_lock_nested+0x5e/0x390 [812828e1] btrfs_commit_super+0x21/0xe0 [812857a2] close_ctree+0x332/0x3a0 [8125fd08] btrfs_put_super+0x18/0x30 [8113ae7d] generic_shutdown_super+0x6d/0xf0 [8113af91] kill_anon_super+0x11/0x60 [8113b6b5] deactivate_locked_super+0x45/0x60 [8113c2b5] deactivate_super+0x45/0x60 [81158729] mntput_no_expire+0x99/0xf0 [8115996c] sys_umount+0x7c/0x3c0 [81002dfb] system_call_fastpath+0x16/0x1b - #1 (type-s_umount_key#24){++}: [8109628a] lock_acquire+0x5a/0x70 [816ca372] down_read+0x42/0x60 [8115e935] writeback_inodes_sb_nr_if_idle+0x35/0x60 [812723ae] shrink_delalloc+0xee/0x180 [81273253] btrfs_delalloc_reserve_metadata+0x163/0x180 [812732ab] btrfs_delalloc_reserve_space+0x3b/0x60 [8129563d] btrfs_file_aio_write+0x61d/0x9c0 [81137f12] do_sync_write+0xd2/0x110 [81138a88] vfs_write+0xc8/0x190 [81138c3c] sys_write+0x4c/0x90 [81002dfb] system_call_fastpath+0x16/0x1b - #0 (sb-s_type-i_mutex_key#12){+.+.+.}: [810961a8] __lock_acquire+0x1ba8/0x1c30 [8109628a] lock_acquire+0x5a/0x70 [816c9cde] mutex_lock_nested+0x5e/0x390 [812cddb9] prealloc_file_extent_cluster+0x59/0x180 [812ce0a1] relocate_file_extent_cluster+0x91/0x380 [812ce44b] relocate_data_extent+0xbb/0xd0 [812cf843] relocate_block_group+0x323/0x600 [812cfcc8] btrfs_relocate_block_group+0x1a8/0x2d0 [812b09c3] btrfs_relocate_chunk+0x83/0x600 [812b160d] btrfs_balance+0x20d/0x280 [812b8b86] btrfs_ioctl+0x1b6/0xa80 [8114a43d] do_vfs_ioctl+0x9d/0x590 [8114a97a] sys_ioctl+0x4a/0x80 [81002dfb] system_call_fastpath+0x16/0x1b other info that might help us debug this: 2 locks held by btrfs/1101: #0: (fs_info-volume_mutex){+.+.+.}, at: [812b148b] btrfs_balance+0x8b/0x280 #1: (fs_info-cleaner_mutex){+.+.+.}, at: [812cfcb7] btrfs_relocate_block_group+0x197/0x2d0 stack backtrace: Pid: 1101, comm: btrfs Tainted: GW 2.6.38.1-341cd+ #10 Call Trace: [810937fb] ? print_circular_bug+0xeb/0xf0 [810961a8] ? __lock_acquire+0x1ba8/0x1c30 [812a5fd1] ? map_private_extent_buffer+0xe1/0x210 [812cddb9] ? prealloc_file_extent_cluster+0x59/0x180 [8109628a] ? lock_acquire+0x5a/0x70 [812cddb9] ? prealloc_file_extent_cluster+0x59/0x180 [810565f5] ? add_preempt_count+0x75/0xd0 [816c9cde] ? mutex_lock_nested+0x5e/0x390 [812cddb9] ? prealloc_file_extent_cluster+0x59/0x180 [81125fa3] ? init_object+0x43/0x80 [81051121] ? get_parent_ip+0x11/0x50 [812cddb9] ? prealloc_file_extent_cluster+0x59/0x180 [812ce0a1] ? relocate_file_extent_cluster+0x91/0x380 [812ce44b] ? relocate_data_extent+0xbb/0xd0 [812cf843] ? relocate_block_group+0x323/0x600 [812cfcc8] ? btrfs_relocate_block_group+0x1a8/0x2d0 [812b09c3] ? btrfs_relocate_chunk+0x83/0x600 [812a62d2] ? read_extent_buffer+0xf2/0x230 [8126c286] ? btrfs_search_slot+0x886/0xa90 [8105654d] ? sub_preempt_count+0x9d/0xd0 [812a62d2] ? read_extent_buffer+0xf2/0x230 [812b160d] ? btrfs_balance+0x20d/0x280 [812b8b86] ? btrfs_ioctl+0x1b6/0xa80 [8103146c] ? do_page_fault+0x1cc/0x440 [8114a43d] ? do_vfs_ioctl+0x9d/0x590 [8113943f] ? fget_light+0x1df/0x3c0 [8114a97a] ? sys_ioctl+0x4a/0x80 [81002dfb] ? system_call_fastpath+0x16/0x1b -- Daniel J Blueman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
recurring btrfs csum failed
I had a system freeze for some reason with 2.6.38. I made a hard reboot, just to discover some of the files (KVM images, were in use when the crash happened) on btrfs RAID-1 filesystem are corrupted: btrfs csum failed ino 257 off 120180736 csum 4246715593 private 48329 btrfs csum failed ino 257 off 120180736 csum 4246715593 private 48329 btrfs csum failed ino 257 off 120180736 csum 4246715593 private 48329 Not being in mood if btrfs should try the other device from the mirror, I decided to remove the corrupted file and copy a previous version stored on a ext3 filesystem. The file copied fine, but to my surprise, the new file is still corrupted: # md5sum vm-113-disk-1.raw md5sum: vm-113-disk-1.raw: Input/output error Errors reported by btrfs are slightly different now: btrfs csum failed ino 260 extent 21968855040 csum 582168802 wanted 1727644489 mirror 1 btrfs csum failed ino 260 extent 21948932096 csum 582168802 wanted 1727644489 mirror 2 btrfs csum failed ino 260 extent 21968855040 csum 582168802 wanted 1727644489 mirror 1 btrfs csum failed ino 260 extent 21968855040 csum 582168802 wanted 1727644489 mirror 1 btrfs csum failed ino 260 extent 21948932096 csum 582168802 wanted 1727644489 mirror 2 btrfs csum failed ino 260 extent 21968855040 csum 582168802 wanted 1727644489 mirror 1 btrfs csum failed ino 260 extent 21948932096 csum 582168802 wanted 1727644489 mirror 2 btrfs is mounted with these flags: /dev/sdc on /mnt/btrfs type btrfs (rw,noatime,compress-force=lzo,device=/dev/sdc,device=/dev/sdd) I don't need to recover the file, just trying to signal something doesn't work well here! -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 3/6] btrfs: add scrub code and prototypes
On 23.03.2011 18:18, David Sterba wrote: Hi, I'm reviewing the atomic counters and the wait/wake infrastructure, just found two missed mutex_unlock()s in btrfs_scrub_dev() in error paths. On Fri, Mar 18, 2011 at 04:55:06PM +0100, Arne Jansen wrote: This is the main scrub code. +mutex_lock(fs_info-scrub_lock); +if (dev-scrub_device) { +mutex_unlock(fs_info-scrub_lock); mutex_unlock(root-fs_info-fs_devices-device_list_mutex); +scrub_workers_put(root); +return -EINPROGRESS; +} +sdev = scrub_setup_dev(dev); +if (IS_ERR(sdev)) { +mutex_unlock(fs_info-scrub_lock); mutex_unlock(root-fs_info-fs_devices-device_list_mutex); +scrub_workers_put(root); +return PTR_ERR(sdev); +} Thanks, I'll add you as Reported-by if that's ok. -Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 4/4] Btrfs: add btrfs_trim_fs() to handle FITRIM
We take an free extent out from allocator, trim it, then put it back, but before we trim the block group, we should make sure the block group is cached, so plus a little change to make cache_block_group() run without a transaction. Signed-off-by: Li Dongyang lidongy...@novell.com --- fs/btrfs/ctree.h|1 + fs/btrfs/extent-tree.c | 50 +++- fs/btrfs/free-space-cache.c | 92 +++ fs/btrfs/free-space-cache.h |2 + fs/btrfs/ioctl.c| 46 + 5 files changed, 190 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 94bb772..df206c1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2232,6 +2232,7 @@ int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr, u64 num_bytes, u64 *actual_bytes); int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 type); +int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range); /* ctree.c */ int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 10e542a..d876759 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -440,7 +440,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, * allocate blocks for the tree root we can't do the fast caching since * we likely hold important locks. */ - if (!trans-transaction-in_commit + if (trans (!trans-transaction-in_commit) (root root != root-fs_info-tree_root)) { spin_lock(cache-lock); if (cache-cached != BTRFS_CACHE_NO) { @@ -8739,3 +8739,51 @@ int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr, { return btrfs_discard_extent(root, bytenr, num_bytes, actual_bytes); } + +int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range) +{ + struct btrfs_fs_info *fs_info = root-fs_info; + struct btrfs_block_group_cache *cache = NULL; + u64 group_trimmed; + u64 start; + u64 end; + u64 trimmed = 0; + int ret = 0; + + cache = btrfs_lookup_block_group(fs_info, range-start); + + while (cache) { + if (cache-key.objectid = (range-start + range-len)) { + btrfs_put_block_group(cache); + break; + } + + start = max(range-start, cache-key.objectid); + end = min(range-start + range-len, + cache-key.objectid + cache-key.offset); + + if (end - start = range-minlen) { + if (!block_group_cache_done(cache)) { + ret = cache_block_group(cache, NULL, root, 0); + if (!ret) + wait_block_group_cache_done(cache); + } + ret = btrfs_trim_block_group(cache, +group_trimmed, +start, +end, +range-minlen); + + trimmed += group_trimmed; + if (ret) { + btrfs_put_block_group(cache); + break; + } + } + + cache = next_block_group(fs_info-tree_root, cache); + } + + range-len = trimmed; + return ret; +} diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index a039065..d0dc812 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2154,3 +2154,95 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster) cluster-block_group = NULL; } +int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group, + u64 *trimmed, u64 start, u64 end, u64 minlen) +{ + struct btrfs_free_space *entry = NULL; + struct btrfs_fs_info *fs_info = block_group-fs_info; + u64 bytes = 0; + u64 actually_trimmed; + int ret = 0; + + *trimmed = 0; + + while (start end) { + spin_lock(block_group-tree_lock); + + if (block_group-free_space minlen) { + spin_unlock(block_group-tree_lock); + break; + } + + entry = tree_search_offset(block_group, start, 0, 1); + if (!entry) + entry = tree_search_offset(block_group, + offset_to_bitmap(block_group, + start), +
[PATCH V4 2/4] Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
btrfs_map_block() will only return a single stripe length, but we want the full extent be mapped to each disk when we are trimming the extent, so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD. Signed-off-by: Li Dongyang lidongy...@novell.com --- fs/btrfs/volumes.c | 150 fs/btrfs/volumes.h |1 + 2 files changed, 129 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dd13eb8..e81cce6 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2962,7 +2962,10 @@ static int __btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw, struct extent_map_tree *em_tree = map_tree-map_tree; u64 offset; u64 stripe_offset; + u64 stripe_end_offset; u64 stripe_nr; + u64 stripe_nr_orig; + u64 stripe_nr_end; int stripes_allocated = 8; int stripes_required = 1; int stripe_index; @@ -2971,7 +2974,7 @@ static int __btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw, int max_errors = 0; struct btrfs_multi_bio *multi = NULL; - if (multi_ret !(rw REQ_WRITE)) + if (multi_ret !(rw (REQ_WRITE | REQ_DISCARD))) stripes_allocated = 1; again: if (multi_ret) { @@ -3017,7 +3020,15 @@ again: max_errors = 1; } } - if (multi_ret (rw REQ_WRITE) + if (rw REQ_DISCARD) { + if (map-type (BTRFS_BLOCK_GROUP_RAID0 | +BTRFS_BLOCK_GROUP_RAID1 | +BTRFS_BLOCK_GROUP_DUP | +BTRFS_BLOCK_GROUP_RAID10)) { + stripes_required = map-num_stripes; + } + } + if (multi_ret (rw (REQ_WRITE | REQ_DISCARD)) stripes_allocated stripes_required) { stripes_allocated = map-num_stripes; free_extent_map(em); @@ -3037,12 +3048,15 @@ again: /* stripe_offset is the offset of this block in its stripe*/ stripe_offset = offset - stripe_offset; - if (map-type (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 | -BTRFS_BLOCK_GROUP_RAID10 | -BTRFS_BLOCK_GROUP_DUP)) { + if (rw REQ_DISCARD) + *length = min_t(u64, em-len - offset, *length); + else if (map-type (BTRFS_BLOCK_GROUP_RAID0 | + BTRFS_BLOCK_GROUP_RAID1 | + BTRFS_BLOCK_GROUP_RAID10 | + BTRFS_BLOCK_GROUP_DUP)) { /* we limit the length of each bio to what fits in a stripe */ *length = min_t(u64, em-len - offset, - map-stripe_len - stripe_offset); + map-stripe_len - stripe_offset); } else { *length = em-len - offset; } @@ -3052,8 +3066,19 @@ again: num_stripes = 1; stripe_index = 0; - if (map-type BTRFS_BLOCK_GROUP_RAID1) { - if (unplug_page || (rw REQ_WRITE)) + stripe_nr_orig = stripe_nr; + stripe_nr_end = (offset + *length + map-stripe_len - 1) + (~(map-stripe_len - 1)); + do_div(stripe_nr_end, map-stripe_len); + stripe_end_offset = stripe_nr_end * map-stripe_len - + (offset + *length); + if (map-type BTRFS_BLOCK_GROUP_RAID0) { + if (rw REQ_DISCARD) + num_stripes = min_t(u64, map-num_stripes, + stripe_nr_end - stripe_nr_orig); + stripe_index = do_div(stripe_nr, map-num_stripes); + } else if (map-type BTRFS_BLOCK_GROUP_RAID1) { + if (unplug_page || (rw (REQ_WRITE | REQ_DISCARD))) num_stripes = map-num_stripes; else if (mirror_num) stripe_index = mirror_num - 1; @@ -3064,7 +3089,7 @@ again: } } else if (map-type BTRFS_BLOCK_GROUP_DUP) { - if (rw REQ_WRITE) + if (rw (REQ_WRITE | REQ_DISCARD)) num_stripes = map-num_stripes; else if (mirror_num) stripe_index = mirror_num - 1; @@ -3077,6 +3102,10 @@ again: if (unplug_page || (rw REQ_WRITE)) num_stripes = map-sub_stripes; + else if (rw REQ_DISCARD) + num_stripes = min_t(u64, map-sub_stripes * + (stripe_nr_end - stripe_nr_orig), + map-num_stripes); else if (mirror_num) stripe_index += mirror_num - 1; else { @@ -3094,24 +3123,101 @@ again: } BUG_ON(stripe_index = map-num_stripes); - for (i
[PATCH V4 3/4] Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
Callers of btrfs_discard_extent() should check if we are mounted with -o discard, as we want to make fitrim to work even the fs is not mounted with -o discard. Also we should use REQ_DISCARD to map the free extent to get a full mapping, last we only return errors if 1. the error is not a EOPNOTSUPP 2. no device supports discard Signed-off-by: Li Dongyang lidongy...@novell.com --- fs/btrfs/ctree.h |2 +- fs/btrfs/disk-io.c |5 - fs/btrfs/extent-tree.c | 45 ++--- 3 files changed, 31 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2c84551..94bb772 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2229,7 +2229,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo); int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end); int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr, - u64 num_bytes); + u64 num_bytes, u64 *actual_bytes); int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 type); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 100b07f..98b60b0 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2947,7 +2947,10 @@ static int btrfs_destroy_pinned_extent(struct btrfs_root *root, break; /* opt_discard */ - ret = btrfs_error_discard_extent(root, start, end + 1 - start); + if (btrfs_test_opt(root, DISCARD)) + ret = btrfs_error_discard_extent(root, start, +end + 1 - start, +NULL); clear_extent_dirty(unpin, start, end, GFP_NOFS); btrfs_error_unpin_extent_range(root, start, end); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index caa4254..10e542a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1738,40 +1738,44 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans, return ret; } -static void btrfs_issue_discard(struct block_device *bdev, +static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len) { - blkdev_issue_discard(bdev, start 9, len 9, GFP_KERNEL, 0); + return blkdev_issue_discard(bdev, start 9, len 9, GFP_KERNEL, 0); } static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr, - u64 num_bytes) + u64 num_bytes, u64 *actual_bytes) { int ret; - u64 map_length = num_bytes; + u64 discarded_bytes = 0; struct btrfs_multi_bio *multi = NULL; - if (!btrfs_test_opt(root, DISCARD)) - return 0; - /* Tell the block device(s) that the sectors can be discarded */ - ret = btrfs_map_block(root-fs_info-mapping_tree, READ, - bytenr, map_length, multi, 0); + ret = btrfs_map_block(root-fs_info-mapping_tree, REQ_DISCARD, + bytenr, num_bytes, multi, 0); if (!ret) { struct btrfs_bio_stripe *stripe = multi-stripes; int i; - if (map_length num_bytes) - map_length = num_bytes; - for (i = 0; i multi-num_stripes; i++, stripe++) { - btrfs_issue_discard(stripe-dev-bdev, - stripe-physical, - map_length); + ret = btrfs_issue_discard(stripe-dev-bdev, + stripe-physical, + stripe-length); + if (!ret) + discarded_bytes += stripe-length; + else if (ret != -EOPNOTSUPP) + break; } kfree(multi); } + if (discarded_bytes ret == -EOPNOTSUPP) + ret = 0; + + if (actual_bytes) + *actual_bytes = discarded_bytes; + return ret; } @@ -4361,7 +4365,9 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, if (ret) break; - ret = btrfs_discard_extent(root, start, end + 1 - start); + if (btrfs_test_opt(root, DISCARD)) + ret = btrfs_discard_extent(root, start, + end + 1 - start, NULL); clear_extent_dirty(unpin, start, end, GFP_NOFS); unpin_extent_range(root, start, end); @@ -5410,7 +5416,8 @@ int btrfs_free_reserved_extent(struct btrfs_root *root,
[PATCH V4 0/4] Btrfs: batched discard support for btrfs
Dear list, This is V4 of batched discard support, now we will get full mapping of the free space on each device for RAID0/1/10/DUP instead of just a single stripe length, and tested with xfsstests 251, Thanks. Changelog V4: *make btrfs_map_block() return full mapping. Changelog V3: *fix style problems. *rebase to 2.6.38-rc7. Changelog V2: *Check if we have devices support trim before trying to trim the fs, also adjust minlen according to the discard_granularity. *Update reserved extent calculations in btrfs_trim_block_group(). *Call cond_resched() without checking need_resched() *Use bitmap_clear_bits() and unlink_free_space() instead of btrfs_remove_free_space(), so we won't search the same extent for twice. *Try harder in btrfs_discard_extent(), now we won't report errors if it's not a EOPNOTSUPP. *make sure the block group is cached before trimming it,or we'll see an empty caching tree if the block group is not cached. *Minor return value fix in btrfs_discard_block_group(). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 1/4] Btrfs: make update_reserved_bytes() public
Make the function public as we should update the reserved extents calculations after taking out an extent for trimming. Signed-off-by: Li Dongyang lidongy...@novell.com --- fs/btrfs/ctree.h|2 ++ fs/btrfs/extent-tree.c | 16 +++- 2 files changed, 9 insertions(+), 9 deletions(-) create mode 100644 fs/btrfs/Module.symvers diff --git a/fs/btrfs/Module.symvers b/fs/btrfs/Module.symvers new file mode 100644 index 000..e69de29 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7f78cc7..2c84551 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2157,6 +2157,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, u64 root_objectid, u64 owner, u64 offset); int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len); +int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache, + u64 num_bytes, int reserve, int sinfo); int btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans, struct btrfs_root *root); int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7b3089b..caa4254 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -36,8 +36,6 @@ static int update_block_group(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 num_bytes, int alloc); -static int update_reserved_bytes(struct btrfs_block_group_cache *cache, -u64 num_bytes, int reserve, int sinfo); static int __btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 num_bytes, u64 parent, @@ -4223,8 +4221,8 @@ int btrfs_pin_extent(struct btrfs_root *root, * update size of reserved extents. this function may return -EAGAIN * if 'reserve' is true or 'sinfo' is false. */ -static int update_reserved_bytes(struct btrfs_block_group_cache *cache, -u64 num_bytes, int reserve, int sinfo) +int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache, + u64 num_bytes, int reserve, int sinfo) { int ret = 0; if (sinfo) { @@ -4704,10 +4702,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, buf-bflags)); btrfs_add_free_space(cache, buf-start, buf-len); - ret = update_reserved_bytes(cache, buf-len, 0, 0); + ret = btrfs_update_reserved_bytes(cache, buf-len, 0, 0); if (ret == -EAGAIN) { /* block group became read-only */ - update_reserved_bytes(cache, buf-len, 0, 1); + btrfs_update_reserved_bytes(cache, buf-len, 0, 1); goto out; } @@ -5191,7 +5189,7 @@ checks: search_start - offset); BUG_ON(offset search_start); - ret = update_reserved_bytes(block_group, num_bytes, 1, + ret = btrfs_update_reserved_bytes(block_group, num_bytes, 1, (data BTRFS_BLOCK_GROUP_DATA)); if (ret == -EAGAIN) { btrfs_add_free_space(block_group, offset, num_bytes); @@ -5415,7 +5413,7 @@ int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len) ret = btrfs_discard_extent(root, start, len); btrfs_add_free_space(cache, start, len); - update_reserved_bytes(cache, len, 0, 1); + btrfs_update_reserved_bytes(cache, len, 0, 1); btrfs_put_block_group(cache); return ret; @@ -5614,7 +5612,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, put_caching_control(caching_ctl); } - ret = update_reserved_bytes(block_group, ins-offset, 1, 1); + ret = btrfs_update_reserved_bytes(block_group, ins-offset, 1, 1); BUG_ON(ret); btrfs_put_block_group(block_group); ret = alloc_reserved_file_extent(trans, root, 0, root_objectid, -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly helpful for debugging, e.g dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0 dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0 btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0) btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0) btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8 flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0) flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0) flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0) Here is what I have added: 1) ordere_extent: btrfs_ordered_extent_add btrfs_ordered_extent_remove btrfs_ordered_extent_start btrfs_ordered_extent_put These provide critical information to understand how ordered_extents are updated. 2) extent_map: btrfs_get_extent extent_map is used in both read and write cases, and it is useful for tracking how btrfs specific IO is running. 3) writepage: __extent_writepage btrfs_writepage_end_io_hook Pages are cirtical resourses and produce a lot of corner cases during writeback, so it is valuable to know how page is written to disk. 4) inode: btrfs_inode_new btrfs_inode_request btrfs_inode_evict These can show where and when a inode is created, when a inode is evicted. 5) sync: btrfs_sync_file btrfs_sync_fs These show sync arguments. 6) transaction: btrfs_transaction_commit In transaction based filesystem, it will be useful to know the generation and who does commit. 7) back reference and cow: btrfs_delayed_tree_ref btrfs_delayed_data_ref btrfs_delayed_ref_head btrfs_cow_block Btrfs natively supports back references, these tracepoints are helpful on understanding btrfs's COW mechanism. 8) chunk: btrfs_chunk_alloc btrfs_chunk_free Chunk is a link between physical offset and logical offset, and stands for space infomation in btrfs, and these are helpful on tracing space things. 9) reserved_extent: btrfs_reserved_extent_alloc btrfs_reserved_extent_free These can show how btrfs uses its space. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/ctree.c |3 + fs/btrfs/ctree.h |1 + fs/btrfs/delayed-ref.c |6 + fs/btrfs/extent-tree.c |4 + fs/btrfs/extent_io.c |2 + fs/btrfs/file.c |1 + fs/btrfs/inode.c | 12 + fs/btrfs/ordered-data.c |8 + fs/btrfs/super.c |5 + fs/btrfs/transaction.c |2 + fs/btrfs/volumes.c | 16 +- fs/btrfs/volumes.h | 11 + include/trace/events/btrfs.h | 667 ++ 13 files changed, 727 insertions(+), 11 deletions(-) create mode 100644 include/trace/events/btrfs.h diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index b5baff0..351515d 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -542,6 +542,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans, ret = __btrfs_cow_block(trans, root, buf, parent, parent_slot, cow_ret, search_start, 0); + + trace_btrfs_cow_block(root, buf, *cow_ret); + return ret; } diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 28188a7..cd6906e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -28,6 +28,7 @@ #include linux/wait.h #include linux/slab.h #include linux/kobject.h +#include trace/events/btrfs.h #include asm/kmap_types.h #include extent_io.h #include extent_map.h diff --git a/fs/btrfs/delayed-ref.c
[PATCH V5 1/2] btrfs: use GFP_NOFS instead of GFP_KERNEL
In the filesystem context, we must allocate memory by GFP_NOFS, or we may start another filesystem operation and make kswap thread hang up. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/extent-tree.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f1db57d..42061d2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -471,7 +471,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, if (load_cache_only) return 0; - caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_KERNEL); + caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS); BUG_ON(!caching_ctl); INIT_LIST_HEAD(caching_ctl-list); @@ -1743,7 +1743,7 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans, static void btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len) { - blkdev_issue_discard(bdev, start 9, len 9, GFP_KERNEL, + blkdev_issue_discard(bdev, start 9, len 9, GFP_NOFS, BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER); } -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
On thu, 24 Mar 2011 08:29:57 +0100, Arne Jansen wrote: On 24.03.2011 02:38, Miao Xie wrote: On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote: On 23.03.2011 20:26, Andrey Kuzmin wrote: On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net wrote: The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s 10x speed-up looks indeed impressive. Just for me to be sure, did I get you right in that you attribute this effect specifically to enumerating tree leaves in key address vs. disk addresses when these two are not aligned? Yes. Leaves and the intermediate nodes tend to be quite scattered around the disk with respect to their logical order. Reading them in logical (ascending/descending) order require lots of seeks. I'm also dealing with tree fragmentation problem, I try to store the leaves which have the same parent closely. That's good to hear. Do you have already anything I can repeat the test with? It is still under developing.;) Thanks Miao -Arne Regards Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] btrfs: sync scrub with commit device removal
On 23.03.2011 18:28, David Sterba wrote: Hi, you are adding a new smp_mb, can you please explain why it's needed and document it? thanks, dave On Fri, Mar 18, 2011 at 04:55:07PM +0100, Arne Jansen wrote: This adds several synchronizations: - for a transaction commit, the scrub gets paused before the tree roots are committed until the super are safely on disk - during a log commit, scrubbing of supers is disabled - on unmount, the scrub gets cancelled - on device removal, the scrub for the particular device gets cancelled --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1330,6 +1330,8 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) goto error_undo; device-in_fs_metadata = 0; +smp_mb(); The idea was to disallow any new scrubs so start beyond this point, but it turns out this is not strong enough. I have to move the check for in_fs_metadata in btrfs_scrub_dev inside the scrub_lock. In this case, the smp_mb is still needed, as in_fs_metadata is not protected by any lock. I'll add a comment. Thanks for forcing me to rethink this :) -Arne +btrfs_scrub_cancel_dev(root, device); /* * the device list mutex makes sure that we don't change -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] btrfs: sync scrub with commit device removal
On 24.03.2011 13:58, Arne Jansen wrote: On 23.03.2011 18:28, David Sterba wrote: Hi, you are adding a new smp_mb, can you please explain why it's needed and document it? thanks, dave On Fri, Mar 18, 2011 at 04:55:07PM +0100, Arne Jansen wrote: This adds several synchronizations: - for a transaction commit, the scrub gets paused before the tree roots are committed until the super are safely on disk - during a log commit, scrubbing of supers is disabled - on unmount, the scrub gets cancelled - on device removal, the scrub for the particular device gets cancelled --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1330,6 +1330,8 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) goto error_undo; device-in_fs_metadata = 0; + smp_mb(); The idea was to disallow any new scrubs so start beyond this point, but it turns out this is not strong enough. I have to move the check for in_fs_metadata in btrfs_scrub_dev inside the scrub_lock. In this case, the smp_mb is still needed, as in_fs_metadata is not protected by any lock. I'll add a comment. Thinking more about locking... the smp_mb is not necessary, because the following cancel_dev aquires a spin_lock, which implies a barrier. -Arne + btrfs_scrub_cancel_dev(root, device); /* * the device list mutex makes sure that we don't change -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 3/6] btrfs: add scrub code and prototypes
On Thu, Mar 24, 2011 at 11:25:29AM +0100, Arne Jansen wrote: Thanks, I'll add you as Reported-by if that's ok. Ok it is :) dave -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
drives with more than 2 TByte
Hallo, linux-btrfs, what about disks with more than 2 TByte? Other filesystems (?) need GPT. When I use mkfs.btrfs /dev/sdc (p.e. with drive sdc), does that work without problems with btrfs? Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCHSET] btrfs: Simplify extent_buffer locking
Hello, This is split patchset of the RFC patches[1] to simplify btrfs locking and contains the following three patches. 0001-btrfs-Cleanup-extent_buffer-lockdep-code.patch 0002-btrfs-Use-separate-lockdep-class-keys-for-different-.patch 0003-btrfs-Simplify-extent_buffer-locking.patch For more info, please read the patch description on 0003 and the following two threads. http://thread.gmane.org/gmane.comp.file-systems.btrfs/9658 http://thread.gmane.org/gmane.linux.kernel/1116910 0001 and 0002 improve lockdep key assigning such that extent_buffer locks get different keys depending on the type (objectid) of the btrfs_root they belong to. I think this should provide enough lockdep annotation resolution to avoid spurious triggering but after applying this patchset, btrfs triggers several different locking dependency warnings. I've followed a couple of them and, to my untrained eyes, they seem to indicate genuine locking order problems in btrfs which were hidden till now because the custom locking was invisible to lockdep. Anyways, so, it seems locking fixes or at least lockdep annotation improvements will be needed. Chris, how do you want to proceed? Thanks. fs/btrfs/Makefile |2 fs/btrfs/ctree.c | 16 +-- fs/btrfs/disk-io.c | 105 ++ fs/btrfs/disk-io.h | 21 ++-- fs/btrfs/extent-tree.c |2 fs/btrfs/extent_io.c |3 fs/btrfs/extent_io.h | 12 -- fs/btrfs/locking.c | 233 - fs/btrfs/locking.h | 65 +++-- fs/btrfs/volumes.c |2 10 files changed, 154 insertions(+), 307 deletions(-) -- tejun [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs: Use separate lockdep class keys for different roots
Due to the custom extent_buffer locking implementation, currently lockdep doesn't have visibility into btrfs locking when the locks are switched to blocking, hiding most of lock ordering issues from lockdep. With planned switch to mutex, all extent_buffer locking operations will be visible to lockdep. As btrfs_root's used for different purposes can be lock-nested, sharing the same set of lockdep class keys leads to spurious locking dependency warnings. This patch makes btrfs_set_buffer_lockdep_class() take @root parameter which indicates the btrfs_root the @eb belongs to and use different sets of keys according to the type of @root. Signed-off-by: Tejun Heo t...@kernel.org --- fs/btrfs/disk-io.c | 91 +-- fs/btrfs/disk-io.h | 10 -- fs/btrfs/extent-tree.c |2 +- fs/btrfs/volumes.c |2 +- 4 files changed, 73 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index e973e0b..710efbd 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -99,42 +99,79 @@ struct async_submit_bio { #ifdef CONFIG_LOCKDEP /* - * These are used to set the lockdep class on the extent buffer locks. - * The class is set by the readpage_end_io_hook after the buffer has - * passed csum validation but before the pages are unlocked. + * Lockdep class keys for extent_buffer-lock's in this root. For a given + * eb, the lockdep key is determined by the btrfs_root it belongs to and + * the level the eb occupies in the tree. * - * The lockdep class is also set by btrfs_init_new_buffer on freshly - * allocated blocks. + * Different roots are used for different purposes and may nest inside each + * other and they require separate keysets. As lockdep keys should be + * static, assign keysets according to the purpose of the root as indicated + * by btrfs_root-objectid. This ensures that all special purpose roots + * have separate keysets. * - * The class is based on the level in the tree block, which allows lockdep - * to know that lower nodes nest inside the locks of higher nodes. + * Lock-nesting across peer nodes is always done with the immediate parent + * node locked thus preventing deadlock. As lockdep doesn't know this, use + * subclass to avoid triggering lockdep warning in such cases. * - * We also add a check to make sure the highest level of the tree is - * the same as our lockdep setup here. If BTRFS_MAX_LEVEL changes, this - * code needs update as well. + * The key is set by the readpage_end_io_hook after the buffer has passed + * csum validation but before the pages are unlocked. It is also set by + * btrfs_init_new_buffer on freshly allocated blocks. + * + * We also add a check to make sure the highest level of the tree is the + * same as our lockdep setup here. If BTRFS_MAX_LEVEL changes, this code + * needs update as well. */ # if BTRFS_MAX_LEVEL != 8 # error # endif -static struct lock_class_key btrfs_eb_class[BTRFS_MAX_LEVEL + 1]; -static const char *btrfs_eb_name[BTRFS_MAX_LEVEL + 1] = { - /* leaf */ - btrfs-extent-00, - btrfs-extent-01, - btrfs-extent-02, - btrfs-extent-03, - btrfs-extent-04, - btrfs-extent-05, - btrfs-extent-06, - btrfs-extent-07, - /* highest possible level */ - btrfs-extent-08, + +static struct btrfs_lockdep_keyset { + u64 id; /* root objectid */ + const char *name_stem; /* lock name stem */ + charnames[BTRFS_MAX_LEVEL + 1][20]; + struct lock_class_key keys[BTRFS_MAX_LEVEL + 1]; +} btrfs_lockdep_keysets[] = { + { .id = BTRFS_ROOT_TREE_OBJECTID, .name_stem = root }, + { .id = BTRFS_EXTENT_TREE_OBJECTID, .name_stem = extent }, + { .id = BTRFS_CHUNK_TREE_OBJECTID, .name_stem = chunk}, + { .id = BTRFS_DEV_TREE_OBJECTID,.name_stem = dev }, + { .id = BTRFS_FS_TREE_OBJECTID, .name_stem = fs }, + { .id = BTRFS_CSUM_TREE_OBJECTID, .name_stem = csum }, + { .id = BTRFS_ORPHAN_OBJECTID, .name_stem = orphan }, + { .id = BTRFS_TREE_LOG_OBJECTID,.name_stem = log }, + { .id = BTRFS_TREE_RELOC_OBJECTID, .name_stem = treloc }, + { .id = BTRFS_DATA_RELOC_TREE_OBJECTID, .name_stem = dreloc }, + { .id = 0, .name_stem = tree }, }; -void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb, int level) +void __init btrfs_init_lockdep(void) +{ + int i, j; + + /* initialize lockdep class names */ + for (i = 0; i ARRAY_SIZE(btrfs_lockdep_keysets); i++) { + struct btrfs_lockdep_keyset *ks = btrfs_lockdep_keysets[i]; + + for (j = 0; j ARRAY_SIZE(ks-names); j++) + snprintf(ks-names[j], sizeof(ks-names[j]), +btrfs-%s-%02d, ks-name_stem, j); +
[PATCH 1/3] btrfs: Cleanup extent_buffer lockdep code
btrfs_set_buffer_lockdep_class() should be dependent upon CONFIG_LOCKDEP instead of CONFIG_DEBUG_LOCK_ALLOC. Collect the related code into one place, use CONFIG_LOCKDEP instead and make some cosmetic changes. Signed-off-by: Tejun Heo t...@kernel.org --- fs/btrfs/disk-io.c | 22 ++ fs/btrfs/disk-io.h | 11 +-- 2 files changed, 15 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3e1ea3e..e973e0b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -97,7 +97,9 @@ struct async_submit_bio { struct btrfs_work work; }; -/* These are used to set the lockdep class on the extent buffer locks. +#ifdef CONFIG_LOCKDEP +/* + * These are used to set the lockdep class on the extent buffer locks. * The class is set by the readpage_end_io_hook after the buffer has * passed csum validation but before the pages are unlocked. * @@ -111,7 +113,6 @@ struct async_submit_bio { * the same as our lockdep setup here. If BTRFS_MAX_LEVEL changes, this * code needs update as well. */ -#ifdef CONFIG_DEBUG_LOCK_ALLOC # if BTRFS_MAX_LEVEL != 8 # error # endif @@ -129,7 +130,13 @@ static const char *btrfs_eb_name[BTRFS_MAX_LEVEL + 1] = { /* highest possible level */ btrfs-extent-08, }; -#endif + +void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb, int level) +{ + lockdep_set_class_and_name(eb-lock, btrfs_eb_class[level], + btrfs_eb_name[level]); +} +#endif /* CONFIG_LOCKDEP */ /* * extents on the btree inode are pretty simple, there's one extent @@ -419,15 +426,6 @@ static int check_tree_block_fsid(struct btrfs_root *root, return ret; } -#ifdef CONFIG_DEBUG_LOCK_ALLOC -void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb, int level) -{ - lockdep_set_class_and_name(eb-lock, - btrfs_eb_class[level], - btrfs_eb_name[level]); -} -#endif - static int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end, struct extent_state *state) { diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h index 07b20dc..4ab3fa8 100644 --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -102,13 +102,12 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans, struct btrfs_root *root); int btree_lock_page_hook(struct page *page); - -#ifdef CONFIG_DEBUG_LOCK_ALLOC +#ifdef CONFIG_LOCKDEP void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb, int level); #else static inline void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb, int level) -{ -} -#endif -#endif +{ } +#endif /* CONFIG_LOCKDEP */ + +#endif /* __DISKIO__ */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: Simplify extent_buffer locking
extent_buffer implemented custom locking which required explicit distinction between non-sleepable and sleepable lockings. This was to prevent excessive context switches. For short non-blocking acquisitions, lock was left non-blocking and other threads which wanted to lock the same eb would spin on it instead of scheduling out. If the lock owner wanted to perform blocking operations, it had to upgrade the locking to blocking mode by calling btrfs_set_lock_blocking(). The distinction is useful and leads to performance gains compared to naive sleeping sleeping lock implementation; however, the standard mutex implementation already has adaptive owner spinning - CONFIG_MUTEX_SPIN_ON_OWNER - which addresses the same problem in transparent manner. Compared to CONFIG_MUTEX_SPIN_ON_OWNER, the custom implementation has several disadvantages. * It requires explicit blocking state management by the lock owner, which can be tedious, error-prone and has its own overhead. * Although the default mutex lacks access to explicit information from the lock owner, it has direct visibility into scheduling which is often better information for deciding whether optimistic spinning would be useful. * Lockdep annoation comes for free. This can be added to the custom implementation but hasn't been. This patch removes the custom extent_buffer locking by replacing eb-lock with a mutex and making the locking API simple wrappers around mutex operations. The following is from dbench 50 runs on 8-way opteron w/ 4GiB of memory and SSD. CONFIG_PREEMPT_VOLUNTARY is set. USER SYSTEM SIRQCXTSW THROUGHPUT BEFORE 59898 504517377 6814245 782.295 AFTER 61090 493441457 1631688 827.751 Other tests also show generally favorable results for the standard mutex based implementation. For more info, please read the following threads. http://thread.gmane.org/gmane.comp.file-systems.btrfs/9658 http://thread.gmane.org/gmane.linux.kernel/1116910 This patch makes all eb locking visible to lockdep and triggers various locking ordering warnings along the allocation path. At least some of them seem to indicate genuine locking bugs while it is possible that some are spuriously triggered and simply require better lockdep annoations. Note that this patch doesn't change locking ordering itself. Lockdep now just has more visibility into btrfs locking. Signed-off-by: Tejun Heo t...@kernel.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@redhat.com --- fs/btrfs/Makefile|2 +- fs/btrfs/ctree.c | 16 ++-- fs/btrfs/extent_io.c |3 +- fs/btrfs/extent_io.h | 12 +-- fs/btrfs/locking.c | 233 -- fs/btrfs/locking.h | 65 -- 6 files changed, 70 insertions(+), 261 deletions(-) delete mode 100644 fs/btrfs/locking.c diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 31610ea..8688f47 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -5,6 +5,6 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ file-item.o inode-item.o inode-map.o disk-io.o \ transaction.o inode.o file.o tree-defrag.o \ extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \ - extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ + extent_io.o volumes.o async-thread.o ioctl.o orphan.o \ export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \ compression.o delayed-ref.o relocation.o diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index b5baff0..bc1627d 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1074,7 +1074,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans, left = read_node_slot(root, parent, pslot - 1); if (left) { - btrfs_tree_lock(left); + btrfs_tree_lock_nested(left, 1); btrfs_set_lock_blocking(left); wret = btrfs_cow_block(trans, root, left, parent, pslot - 1, left); @@ -1085,7 +1085,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans, } right = read_node_slot(root, parent, pslot + 1); if (right) { - btrfs_tree_lock(right); + btrfs_tree_lock_nested(right, 2); btrfs_set_lock_blocking(right); wret = btrfs_cow_block(trans, root, right, parent, pslot + 1, right); @@ -1241,7 +1241,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans, if (left) { u32 left_nr; - btrfs_tree_lock(left); + btrfs_tree_lock_nested(left, 1); btrfs_set_lock_blocking(left); left_nr = btrfs_header_nritems(left); @@ -1291,7 +1291,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans, if
Re: drives with more than 2 TByte
On 03/24/2011 05:43 PM, Helmut Hullen wrote: Hallo, linux-btrfs, what about disks with more than 2 TByte? Other filesystems (?) need GPT. The filesystems don't care about the partition system. The 2TB limits is related to the maximum partition size. Of course a filesystem cannot be greater the partition where it is allocated. When I use mkfs.btrfs /dev/sdc (p.e. with drive sdc), does that work without problems with btrfs? It should. BTW, why you dont' use a GPT partition table ? Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: drives with more than 2 TByte
Hallo, Goffredo, Du meintest am 24.03.11: what about disks with more than 2 TByte? Other filesystems (?) need GPT. The filesystems don't care about the partition system. Ok - thank you! Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/2] btrfs: use GFP_NOFS instead of GFP_KERNEL
Hi, On Thu, Mar 24, 2011 at 07:41:21PM +0800, Miao Xie wrote: In the filesystem context, we must allocate memory by GFP_NOFS, or we may start another filesystem operation and make kswap thread hang up. indeed. Did you check for other GFP_KERNEL allocations? I've found 8 more them and at least these look like candidates for GFP_NOFS too: diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c index de34bfa..76b9218 100644 --- a/fs/btrfs/acl.c +++ b/fs/btrfs/acl.c @@ -289,7 +289,7 @@ int btrfs_acl_chmod(struct inode *inode) if (IS_ERR(acl) || !acl) return PTR_ERR(acl); - clone = posix_acl_clone(acl, GFP_KERNEL); + clone = posix_acl_clone(acl, GFP_NOFS); posix_acl_release(acl); if (!clone) return -ENOMEM; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index f447b78..eb5c01d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -998,7 +998,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, nrptrs = min((iov_iter_count(i) + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE, PAGE_CACHE_SIZE / (sizeof(struct page *))); - pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL); + pages = kmalloc(nrptrs * sizeof(struct page *), GFP_NOFS); if (!pages) { ret = -ENOMEM; goto out; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d1bace3..e9b9648 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1644,7 +1644,7 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp) goto out; } - range = kzalloc(sizeof(*range), GFP_KERNEL); + range = kzalloc(sizeof(*range), GFP_NOFS); if (!range) { ret = -ENOMEM; goto out; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index d39a989..5e0fff7 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -399,7 +399,7 @@ static int btrfs_parse_early_options(const char *options, fmode_t flags, * strsep changes the string, duplicate it because parse_options * gets called twice */ - opts = kstrdup(options, GFP_KERNEL); + opts = kstrdup(options, GFP_NOFS); if (!opts) return -ENOMEM; orig = opts; @@ -446,7 +446,7 @@ static int btrfs_parse_early_options(const char *options, fmode_t flags, * mount path doesn't care if it's the default volume or another one. */ if (!*subvol_name) { - *subvol_name = kstrdup(., GFP_KERNEL); + *subvol_name = kstrdup(., GFP_NOFS); if (!*subvol_name) return -ENOMEM; } dave -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation
Hi, there's one thing I want to bring up. It's not related to delayed functionality itself but to git tree base of the patch. There's a merge conflict when your patch is applied directly onto Linus' tree, and not when on Chris' one. On Thu, Mar 24, 2011 at 07:41:31PM +0800, Miao Xie wrote: ... --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6726,6 +6775,9 @@ void btrfs_destroy_inode(struct inode *inode) inode_tree_del(inode); btrfs_drop_extent_cache(inode, 0, (u64)-1, 0); free: + ret = btrfs_remove_delayed_node(inode); + BUG_ON(ret); + kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode)); } the call to kmem_cache_free has been replaced by commit fa0d7e3de6d6fc5004ad9dea0dd6b286af8f03e9 Author: Nick Piggin npig...@kernel.dk Date: Fri Jan 7 17:49:49 2011 +1100 fs: icache RCU free inodes relevant hunk: --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6495,6 +6495,13 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) return inode; } +static void btrfs_i_callback(struct rcu_head *head) +{ + struct inode *inode = container_of(head, struct inode, i_rcu); + INIT_LIST_HEAD(inode-i_dentry); + kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode)); +} + void btrfs_destroy_inode(struct inode *inode) { struct btrfs_ordered_extent *ordered; @@ -6564,7 +6571,7 @@ void btrfs_destroy_inode(struct inode *inode) inode_tree_del(inode); btrfs_drop_extent_cache(inode, 0, (u64)-1, 0); free: - kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode)); + call_rcu(inode-i_rcu, btrfs_i_callback); } I don't think this disqualifies all the testing already done but maybe it's time to rebase btrfs-unstable.git to .38 . Chris? dave -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/2] btrfs: use GFP_NOFS instead of GFP_KERNEL
On Fri, 25 Mar 2011 00:07:59 +0100, David Sterba wrote: On Thu, Mar 24, 2011 at 07:41:21PM +0800, Miao Xie wrote: In the filesystem context, we must allocate memory by GFP_NOFS, or we may start another filesystem operation and make kswap thread hang up. indeed. Did you check for other GFP_KERNEL allocations? I've found 8 more them and at least these look like candidates for GFP_NOFS too: I just fix the ones, which should use GFP_NOFS. I think not all of the GFP_KERNEL allocations are wrong, if we don't hold any btrfs's lock except relative inode's i_mutex and not in the context of the transaction, we can use GFP_KERNEL. So the following GFP_KERNEL allocations are right, I think. Thanks Miao diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c index de34bfa..76b9218 100644 --- a/fs/btrfs/acl.c +++ b/fs/btrfs/acl.c @@ -289,7 +289,7 @@ int btrfs_acl_chmod(struct inode *inode) if (IS_ERR(acl) || !acl) return PTR_ERR(acl); - clone = posix_acl_clone(acl, GFP_KERNEL); + clone = posix_acl_clone(acl, GFP_NOFS); posix_acl_release(acl); if (!clone) return -ENOMEM; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index f447b78..eb5c01d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -998,7 +998,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, nrptrs = min((iov_iter_count(i) + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE, PAGE_CACHE_SIZE / (sizeof(struct page *))); - pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL); + pages = kmalloc(nrptrs * sizeof(struct page *), GFP_NOFS); if (!pages) { ret = -ENOMEM; goto out; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d1bace3..e9b9648 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1644,7 +1644,7 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp) goto out; } - range = kzalloc(sizeof(*range), GFP_KERNEL); + range = kzalloc(sizeof(*range), GFP_NOFS); if (!range) { ret = -ENOMEM; goto out; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index d39a989..5e0fff7 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -399,7 +399,7 @@ static int btrfs_parse_early_options(const char *options, fmode_t flags, * strsep changes the string, duplicate it because parse_options * gets called twice */ - opts = kstrdup(options, GFP_KERNEL); + opts = kstrdup(options, GFP_NOFS); if (!opts) return -ENOMEM; orig = opts; @@ -446,7 +446,7 @@ static int btrfs_parse_early_options(const char *options, fmode_t flags, * mount path doesn't care if it's the default volume or another one. */ if (!*subvol_name) { - *subvol_name = kstrdup(., GFP_KERNEL); + *subvol_name = kstrdup(., GFP_NOFS); if (!*subvol_name) return -ENOMEM; } dave -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
On Thu, Mar 24, 2011 at 09:18:16AM +0100, Ingo Molnar wrote: * Tejun Heo t...@kernel.org wrote: NOT-Signed-off-by: Tejun Heo t...@kernel.org s/NOT-// ? Perhaps because it is still in RFC context? -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote: Adaptive owner spinning used to be applied only to mutex_lock(). This patch applies it also to mutex_trylock(). btrfs has developed custom locking to avoid excessive context switches in its btree implementation. Generally, doing away with the custom implementation and just using the mutex shows better behavior; however, there's an interesting distinction in the custom implemention of trylock. It distinguishes between simple trylock and tryspin, where the former just tries once and then fail while the latter does some spinning before giving up. Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, for btrfs anyway. The following results are from dbench 50 run on an opteron two socket eight core machine with 4GiB of memory and an OCZ vertex SSD. During the run, disk stays mostly idle and all CPUs are fully occupied and the difference in locking performance becomes quite visible. SIMPLE is with the locking simplification patch[1] applied. i.e. it basically just uses mutex. SPIN is with this patch applied on top - mutex_trylock() uses adaptive spinning. USER SYSTEM SIRQCXTSW THROUGHPUT SIMPLE 61107 354977217 8099529 845.100 MB/sec SPIN 63140 364888214 6840527 879.077 MB/sec On various runs, the adaptive spinning trylock consistently posts higher throughput. The amount of difference varies but it outperforms consistently. In general, using adaptive spinning on trylock makes sense as trylock failure usually leads to costly unlock-relock sequence. [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 Signed-off-by: Tejun Heo t...@kernel.org I'm curious about the effects that this has on those places that do: again: mutex_lock(A); if (mutex_trylock(B)) { mutex_unlock(A); goto again; Where the normal locking order is: B - A If another location does: mutex_lock(B); [...] mutex_lock(A); But another process has A already, and is running, it may spin waiting for A as A's owner is still running. But now, mutex_trylock(B) becomes a spinner too, and since the B's owner is running (spinning on A) it will spin as well waiting for A's owner to release it. Unfortunately, A's owner is also spinning waiting for B to release it. If both A and B's owners are real time tasks, then boom! deadlock. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt rost...@goodmis.org wrote: But now, mutex_trylock(B) becomes a spinner too, and since the B's owner is running (spinning on A) it will spin as well waiting for A's owner to release it. Unfortunately, A's owner is also spinning waiting for B to release it. If both A and B's owners are real time tasks, then boom! deadlock. Hmm. I think you're right. And it looks pretty fundamental - I don't see any reasonable approach to avoid it. I think the RT issue is a red herring too - afaik, you can get a deadlock with two perfectly normal processes too. Of course, for non-RT tasks, any other process will eventually disturb the situation and you'd get kicked out due to need_resched(), but even that might be avoided for a long time if there are other CPU's - leading to tons of wasted CPU time. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html