Re: btrfs: hanging processes - race condition?
On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li shaohua...@intel.com wrote: On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: List, I ran into a hang issue (race condition: cpu is high when the server is idle, meaning that btrfs is hanging, and IOwait is high as well) running 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, as shown below: Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 Total devices 18 FS bytes used 2.94TB devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 Btrfs v0.19-16-g075587c-dirty The filesystem, mounted in /mnt/btrfs is hanging, no existing or new process can access it, however 'df' still displays the disk usage (3TB out of 5). The disks appear to be physically healthy. Please note that a significant number of files were placed on this filesystem, between 20 and 30 million files. The relevant kernel messages are displayed below: INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. btrfs-submit- D 00010042e12f 0 4220 2 0x 8803e584ac70 0046 4000 00011680 8803f7349fd8 8803f7349fd8 8803e584ac70 00011680 0001 8803ff99d250 8149f020 81150ab0 Call Trace: [813089f3] ? io_schedule+0x71/0xb1 [811470be] ? get_request_wait+0xab/0x140 [810406f4] ? autoremove_wake_function+0x0/0x2e [81143a4d] ? elv_rq_merge_ok+0x89/0x97 [8114a245] ? blk_recount_segments+0x17/0x27 [81147429] ? __make_request+0x2d6/0x3fc [81145b16] ? generic_make_request+0x207/0x268 [81145c12] ? submit_bio+0x9b/0xa2 [a01aa081] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] [a01a5365] ? run_scheduled_bios+0x297/0x48f [btrfs] [a01aa687] ? worker_loop+0x17c/0x452 [btrfs] [a01aa50b] ? worker_loop+0x0/0x452 [btrfs] [81040331] ? kthread+0x79/0x81 [81003674] ? kernel_thread_helper+0x4/0x10 [810402b8] ? kthread+0x0/0x81 [81003670] ? kernel_thread_helper+0x0/0x10 This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. This is reproduceable in our setup. I think I know the cause of http://lkml.org/lkml/2010/6/8/375. The code in the first do-while loop in btrfs_commit_transaction set current process to TASK_UNINTERRUPTIBLE state, then calls btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and btrfs_run_ordered_operations(). All of these function may call cond_resched(). Hi, When I test random write, I saw a lot of threads jump into btree_writepages() and do noting and io throughput is zero for some time. Looks like there is a live lock. See the code of btree_writepages(): if (wbc-sync_mode == WB_SYNC_NONE) { struct btrfs_root *root = BTRFS_I(mapping-host)-root; u64 num_dirty; unsigned long thresh = 32 * 1024 * 1024; if (wbc-for_kupdate) return 0; /* this is a bit racy, but that's ok */ num_dirty = root-fs_info-dirty_metadata_bytes; if (num_dirty thresh) return 0; } The marked line is quite intrusive. In my test, the live lock is caused by the thresh check. The dirty_metadata_bytes 32M. Without it, I can't see the live lock. Not sure if this is related to the hang. Thanks, Shaohua -- To unsubscribe from this
Re: [PATCH][RFC] Complex filesystem operations: split and join
Nikanth Karthikesan knika...@novell.com writes: I had a need to split a file into smaller files on a thumb drive with no free space on it or anywhere else in the system. When the filesystem supports sparse files(truncate_range), I could create files, while punching holes in the original file. But when the underlying fs is FAT, I couldn't. Also why should we do needless I/O, when all I want is to split/join files. i.e., all the data are already on the disk, under the same filesystem. I just want to do some metadata changes. So, I added two inode operations, namely split and join, that lets me tell the OS, that all I want is meta-data changes. And the file-system can avoid doing lots of I/O, when only metadata changes are needed. sys_split(fd1, n, fd2) 1. Attach the data of file after n bytes in fd1 to fd2. 2. Truncate fd1 to n bytes. Roughly can be thought of as equivalent of following commands: 1. dd if=file1 of=file2 skip=n 2. truncate -c -s n file1 sys_join(fd1, fd2) 1. Extend fd1 with data of fd2 2. Truncate fd2 to 0. Roughly can be thought of as equivalent of following commands: 1. dd if=file2 of=file1 seek=`filesize file1` 2. truncate -c -s 0 file2 Attached is the patch that adds these new syscalls and support for them to the FAT filesystem. I guess, this approach can be extended to splice() kind of call, between files, instead of pipes. On a COW fs, splice could simply setup blocks as shared between files, instead of doing I/O. It would be a kind of explicit online data-deduplication. Later when a file modifies any of those blocks, we copy blocks. i.e., COW. [I'll just ignore implementation for now.., because the patch is totally ignoring cache management.] I have no objections to such those operations (likewise make hole, truncate any range, etc. etc.). However, only if someone have enough motivation to implement/maintain those operations, AND there are real users (i.e. real sane usecase). Otherwise, IMO it would be bad than nothing. Because, of course, if there are such codes, we can't ignore those anymore until remove codes completely for e.g. security reasons. And IMHO, those cache managements to such operations are not so easy. Thanks. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: default subvolume abilities/restrictions
On Sat, Jun 12, 2010 at 8:06 PM, C Anthony Risinger anth...@extof.me wrote: On Sat, Jun 12, 2010 at 7:22 PM, David Brown bt...@davidb.org wrote: On Sat, Jun 12, 2010 at 06:06:23PM -0500, C Anthony Risinger wrote: # btrfs subvolume create new_root # mv . new_root/old_root can i at least get confirmation that the above is possible? I've had no problem with # btrfs subvolume snapshot . new_root # mkdir old_root # mv * old_root # rm -rf old_root Make sure the 'mv' fails fo move new_root, and I'd look at the new_root before removing everything. David heh, yeah i as i was writing the last email i realized that all i really wanted was to: # mv * new_root for some reason i was convinced that i must snapshot the old_root (.) to new_root... and then remove the erroneous stuff from old_root (.). thus a way to parent the default subvol (old_root/.) seemed a better solution... but alas, a snapshot isn't necessary. i can create an empty subvol new_root, and then mv * new_root. i don't know how that escaped me :-), sorry for all the noise. however, there probably is a legitimate use case for wanting to replace the default subvolume, but this isn't it. C Anthony ok i take it all back, i DO need this... i rewrote my initramfs hook to do the following operations: # btrfs subvolume create /new_root # mv /* /new_root instead of what i had: # btrfs subvolume snapshot / /new_root and it resulted in scarily COPYING my entire system... several gigs worth... to the newly created subvolume, which took forever, and grinded on my HD for awhile. i don't know how long because i went to bed. this is why i need a way to parent the default subvolume. a snapshot is nice and quick, but it leaves / full of erroneous folders (dev/etc/usr/lib), an entire system, that will no longer be used. this space will in time become dead wasted space unless my users manually rm -rf themselves. so... any input on this? how can i effectively, and efficiently, move a users installation into a dedicated subvolume, when they have already installed into the default subvolume? i think the best way is what i originally suggested; make an empty subvolume the new top-level subvol, and place the old top-level subvol INTO it with a new name. thoughts? C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html