Re: btrfs: hanging processes - race condition?

2010-06-13 Thread Shaohua Li
On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
 On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li shaohua...@intel.com wrote:
  On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
  List,
 
  I ran into a hang issue (race condition: cpu is high when the server is
  idle, meaning that btrfs is hanging, and IOwait is high as well) running
  2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram).
  The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0,
  as shown below:
 
  Label: none  uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
           Total devices 18 FS bytes used 2.94TB
           devid    5 size 279.39GB used 208.33GB path /dev/cciss/c1d0
           devid   17 size 279.39GB used 208.34GB path /dev/cciss/c1d8
           devid   16 size 279.39GB used 209.33GB path /dev/cciss/c1d7
           devid    4 size 279.39GB used 208.33GB path /dev/cciss/c0d4
           devid    1 size 279.39GB used 233.72GB path /dev/cciss/c0d1
           devid   13 size 279.39GB used 208.33GB path /dev/cciss/c1d4
           devid    8 size 279.39GB used 208.33GB path /dev/cciss/c1d11
           devid   12 size 279.39GB used 208.33GB path /dev/cciss/c1d3
           devid    3 size 279.39GB used 208.33GB path /dev/cciss/c0d3
           devid    9 size 279.39GB used 208.33GB path /dev/cciss/c1d12
           devid    6 size 279.39GB used 208.33GB path /dev/cciss/c1d1
           devid   11 size 279.39GB used 208.33GB path /dev/cciss/c1d2
           devid   14 size 279.39GB used 208.33GB path /dev/cciss/c1d5
           devid    2 size 279.39GB used 233.70GB path /dev/cciss/c0d2
           devid   15 size 279.39GB used 209.33GB path /dev/cciss/c1d6
           devid   10 size 279.39GB used 208.33GB path /dev/cciss/c1d13
           devid    7 size 279.39GB used 208.33GB path /dev/cciss/c1d10
           devid   18 size 279.39GB used 208.34GB path /dev/cciss/c1d9
  Btrfs v0.19-16-g075587c-dirty
 
  The filesystem, mounted in /mnt/btrfs is hanging, no existing or new
  process can access it, however 'df' still displays the disk usage (3TB out
  of 5). The disks appear to be physically healthy. Please note that a
  significant number of files were placed on this filesystem, between 20 and
  30 million files.
 
  The relevant kernel messages are displayed below:
 
  INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
  echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
  btrfs-submit- D 00010042e12f     0  4220      2 0x
    8803e584ac70 0046 4000 00011680
    8803f7349fd8 8803f7349fd8 8803e584ac70 00011680
    0001 8803ff99d250 8149f020 81150ab0
  Call Trace:
    [813089f3] ? io_schedule+0x71/0xb1
    [811470be] ? get_request_wait+0xab/0x140
    [810406f4] ? autoremove_wake_function+0x0/0x2e
    [81143a4d] ? elv_rq_merge_ok+0x89/0x97
    [8114a245] ? blk_recount_segments+0x17/0x27
    [81147429] ? __make_request+0x2d6/0x3fc
    [81145b16] ? generic_make_request+0x207/0x268
    [81145c12] ? submit_bio+0x9b/0xa2
    [a01aa081] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
    [a01a5365] ? run_scheduled_bios+0x297/0x48f [btrfs]
    [a01aa687] ? worker_loop+0x17c/0x452 [btrfs]
    [a01aa50b] ? worker_loop+0x0/0x452 [btrfs]
    [81040331] ? kthread+0x79/0x81
    [81003674] ? kernel_thread_helper+0x4/0x10
    [810402b8] ? kthread+0x0/0x81
    [81003670] ? kernel_thread_helper+0x0/0x10
  This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375.
  This is reproduceable in our setup.
 
 I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
 The code in the first do-while loop in btrfs_commit_transaction
 set current process to TASK_UNINTERRUPTIBLE state, then calls
 btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
 btrfs_run_ordered_operations(). All of these function may call
 cond_resched().
Hi,
When I test random write, I saw a lot of threads jump into btree_writepages()
and do noting and io throughput is zero for some time. Looks like there is a
live lock. See the code of btree_writepages():
if (wbc-sync_mode == WB_SYNC_NONE) {
struct btrfs_root *root = BTRFS_I(mapping-host)-root;
u64 num_dirty;
unsigned long thresh = 32 * 1024 * 1024;

if (wbc-for_kupdate)
return 0;

/* this is a bit racy, but that's ok */
num_dirty = root-fs_info-dirty_metadata_bytes;
  if (num_dirty  thresh)
return 0;
}
The marked line is quite intrusive. In my test, the live lock is caused by the 
thresh
check. The dirty_metadata_bytes  32M. Without it, I can't see the live lock. 
Not
sure if this is related to the hang.

Thanks,
Shaohua
--
To unsubscribe from this 

Re: [PATCH][RFC] Complex filesystem operations: split and join

2010-06-13 Thread OGAWA Hirofumi
Nikanth Karthikesan knika...@novell.com writes:

 I had a need to split a file into smaller files on a thumb drive with no
 free space on it or anywhere else in the system. When the filesystem
 supports sparse files(truncate_range), I could create files, while
 punching holes in the original file. But when the underlying fs is FAT,
 I couldn't. Also why should we do needless I/O, when all I want is to
 split/join files. i.e., all the data are already on the disk, under the
 same filesystem. I just want to do some metadata changes.

 So, I added two inode operations, namely split and join, that lets me
 tell the OS, that all I want is meta-data changes. And the file-system
 can avoid doing lots of I/O, when only metadata changes are needed.

 sys_split(fd1, n, fd2)
 1. Attach the data of file after n bytes in fd1 to fd2.
 2. Truncate fd1 to n bytes.

 Roughly can be thought of as equivalent of following commands:
 1. dd if=file1 of=file2 skip=n
 2. truncate -c -s n file1

 sys_join(fd1, fd2)
 1. Extend fd1 with data of fd2
 2. Truncate fd2 to 0.

 Roughly can be thought of as equivalent of following commands:
 1. dd if=file2 of=file1 seek=`filesize file1`
 2. truncate -c -s 0 file2

 Attached is the patch that adds these new syscalls and support for them
 to the FAT filesystem.

 I guess, this approach can be extended to splice() kind of call, between
 files, instead of pipes. On a COW fs, splice could simply setup blocks
 as shared between files, instead of doing I/O. It would be a kind of
 explicit online data-deduplication. Later when a file modifies any of
 those blocks, we copy blocks. i.e., COW.

[I'll just ignore implementation for now.., because the patch is totally
ignoring cache management.]

I have no objections to such those operations (likewise make hole,
truncate any range, etc. etc.). However, only if someone have enough
motivation to implement/maintain those operations, AND there are real
users (i.e. real sane usecase).

Otherwise, IMO it would be bad than nothing. Because, of course, if
there are such codes, we can't ignore those anymore until remove
codes completely for e.g. security reasons. And IMHO, those cache
managements to such operations are not so easy.

Thanks.
-- 
OGAWA Hirofumi hirof...@mail.parknet.co.jp
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: default subvolume abilities/restrictions

2010-06-13 Thread C Anthony Risinger
On Sat, Jun 12, 2010 at 8:06 PM, C Anthony Risinger anth...@extof.me wrote:
 On Sat, Jun 12, 2010 at 7:22 PM, David Brown bt...@davidb.org wrote:
 On Sat, Jun 12, 2010 at 06:06:23PM -0500, C Anthony Risinger wrote:

 # btrfs subvolume create new_root
 # mv . new_root/old_root

 can i at least get confirmation that the above is possible?

 I've had no problem with

  # btrfs subvolume snapshot . new_root
  # mkdir old_root
  # mv * old_root
  # rm -rf old_root

 Make sure the 'mv' fails fo move new_root, and I'd look at the
 new_root before removing everything.

 David

 heh, yeah i as i was writing the last email i realized that all i
 really wanted was to:

 # mv * new_root

 for some reason i was convinced that i must snapshot the old_root (.)
 to new_root... and then remove the erroneous stuff from old_root (.).
 thus a way to parent the default subvol (old_root/.) seemed a better
 solution...

 but alas, a snapshot isn't necessary.  i can create an empty subvol
 new_root, and then mv * new_root.

 i don't know how that escaped me :-), sorry for all the noise.
 however, there probably is a legitimate use case for wanting to
 replace the default subvolume, but this isn't it.

 C Anthony

ok i take it all back, i DO need this...

i rewrote my initramfs hook to do the following operations:

# btrfs subvolume create /new_root
# mv /* /new_root

instead of what i had:

# btrfs subvolume snapshot / /new_root

and it resulted in scarily COPYING my entire system... several gigs
worth... to the newly created subvolume, which took forever, and
grinded on my HD for awhile.  i don't know how long because i went to
bed.

this is why i need a way to parent the default subvolume.

a snapshot is nice and quick, but it leaves / full of erroneous
folders (dev/etc/usr/lib), an entire system, that will no longer be
used.  this space will in time become dead wasted space unless my
users manually rm -rf themselves.

so... any input on this?  how can i effectively, and efficiently, move
a users installation into a dedicated subvolume, when they have
already installed into the default subvolume?

i think the best way is what i originally suggested; make an empty
subvolume the new top-level subvol, and place the old top-level subvol
INTO it with a new name.

thoughts?

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html