Re: Confused by performance

2010-06-17 Thread David Brown

On 16/06/2010 21:35, Freddie Cash wrote:

snip a lot of fancy math that missed the point

That's all well and good, but you missed the part where he said ext2
on a 5-way LVM stripeset is many times faster than btrfs on a 5-way
btrfs stripeset.

IOW, same 5-way stripeset, different filesystems and volume managers,
and very different performance.

And he's wondering why the btrfs method used for striping is so much
slower than the lvm method used for striping.



This could easily be explained by Roberto's theory and maths - if the 
lvm stripe set used large stripe sizes so that the random reads were 
mostly read from a single disk, it would be fast.  If the btrfs stripes 
were small, then it would be slow due to all the extra seeks.


Do we know anything about the stripe sizes used?


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Odd block-count behaviour

2010-06-17 Thread Chris Mason
On Wed, Jun 16, 2010 at 01:47:30PM +0100, Hugo Mills wrote:
Hi,
 
I've just been copying large quantities of data from one btrfs
 volume to another, and while watching the progress of the copy, I've
 noticed something odd.
 
 $ mv source-dir dest-dir
 $ watch du -ms source-dir dest-dir
 
This gives me a count of the size of the source and target
 directories, every 2 seconds. As expected, the size of the source dir
 stays constant, and the size of the destination directory increases.
 Except when it doesn't.
 
Occasionally, while copying, the size of the dest-dir *drops* by
 several (tens of) megabytes. I'm not too worried about this, as it all
 seems to be copying the data OK, but it just seems a bit odd, and was
 wondering of there was a sane explanation for this behaviour.

If the files are small, they can be packed into the metadata btree.  But
this doesn't happen until the file is actually written.

So we start with a worst case estimate on the number of blocks the file
will consume (4k) and then when it is actually written we update the
metadata to reflect the number of blocks it is actually using (maybe 1
or 2).

You can see this with a test:

mkdir testdir
cd testdir
dd if=/dev/zero of=foo bs=512 count=1
du -k .
sync
du -k .

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mysteriously changing RAID levels

2010-06-17 Thread Chris Mason
On Wed, Jun 16, 2010 at 08:12:03AM -0600, cwillu wrote:
 I just noticed today that btrfs fi df / no longer reports any raid
 level on Metadata or Data.  I know as of Jun 4 that Data was RAID1 and
 Metadata was DUP (I had posted my df output to irc).  I've already
 checked that I didn't revert to an old version of btrfs-progs, and am
 at a bit of a loss to explain it.
 
 I'm currently running 2.6.35rc1.
 
 Any ideas what could have caused this?

Thanks for pointing this out, the ioctl needs to be updated.  I'll get
it fixed up.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Confused by performance

2010-06-17 Thread Chris Mason
On Wed, Jun 16, 2010 at 11:08:48AM -0700, K. Richard Pixley wrote:
 Once again I'm stumped by some performance numbers and hoping for
 some insight.
 
 Using an 8-core server, building in parallel, I'm building some
 code.  Using ext2 over a 5-way, (5 disk), lvm partition, I can build
 that code in 35 minutes.  Tests with dd on the raw disk and lvm
 partitions show me that I'm getting near linear improvement from the
 raw stripe, even with dd runs exceeding 10G, so I think that
 convinces me that my disks and controller subsystem are capable of
 operating in parallel and in concert.  hdparm -t numbers seem to
 support what I'm seeing from dd.
 
 Running the same build, same parallelism, over a btrfs (defaults)
 partition on a single drive, I'm seeing very consistent build times
 around an hour, which is reasonable.  I get a little under an hour
 on ext4 single disk, again, very consistently.
 
 However, if I build a btrfs file system across the 5 disks, my build
 times decline to around 1.5 - 2hrs, although there's about a 30min
 variation between different runs.
 
 If I build a btrfs file system across the 5-way lvm stripe, I get
 even worse performance at around 2.5hrs per build, with about a
 45min variation between runs.
 
 I can't explain these last two results.  Any theories?

I suspect they come down to different raid levels done by btrfs, and
maybe barriers.

By default btrfs will duplicate metadata, so ext2 is doing much less
metadata IO than btrfs does.

Try mkfs.btrfs -m raid0 -d raid0 /dev/xxx /dev/xxy ...

Then try mount -o nobarrier /dev/xxx /mnt

Someone else mentioned blktrace, it would help explain things if you're
interested in tracking this down.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still ENOSPC problems with 2.6.35-rc3

2010-06-17 Thread Sander
Yan, Zheng  wrote (ao):
 what will happen if you keep deleting files using 2.6.35?

From the list: Things you don't want your fs developer to say ;-)

PS, I am a very happy btrfs user on several systems (including
AMR based OpenRD-Client and SheevaPlug, and large 64bit servers), so no
flame intended ;-)

Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still ENOSPC problems with 2.6.35-rc3

2010-06-17 Thread Johannes Hirte
Am Donnerstag 17 Juni 2010, 02:47:07 schrieb Yan, Zheng:
 On Thu, Jun 17, 2010 at 7:56 AM, Johannes Hirte
 
 johannes.hi...@fem.tu-ilmenau.de wrote:
  Am Donnerstag 17 Juni 2010, 01:12:54 schrieb Yan, Zheng:
  On Thu, Jun 17, 2010 at 1:48 AM, Johannes Hirte
  
  johannes.hi...@fem.tu-ilmenau.de wrote:
   With kernel-2.6.34 I run into the ENOSPC problems that where reported
   on this list recently. The filesystem was somewhat over 90% full and
   most operations on it caused a Oops. I was able to delete files by
   trial and error and freed up half of the filesystem space. Operation
   on the other files still caused an Oops.
   
   For 2.6.35 there went some patches in, that addressed this problem.
   Sadly they don't fix it but only avoid the Oops. A simple 'ls' on
   this filesystem results in
  
  To avoid ENOSPC oops, btrfs in 2.6.35 reserves more metadata space for
  system use than older btrfs. If the FS has already ran out of metadata
  space, using btrfs in 2.6.35 doesn't help.
  
  Yan, Zheng
  
  So how can this be fixed/avoided? There must be some free metadata space,
  since I was able to delete files, more than 20Gig, mostly small files.
  Also from my understanding, when freeing space by deleting files,
  metadata space should be freed. Or do I get something wrong here?
  2.6.35 does change something, since I can delete more files, where 2.6.34
  does Oops. But you're right, it doesn't help at all. So, where is this
  space and why it can't be used?
 
 what will happen if you keep deleting files using 2.6.35?

With 2.6.35 I'm able to continue deleting files, even those where 2.6.34 would 
Oops. It's slow and give me many of this warnings in dmesg but they get 
deleted. I didn't tried it to the end, but I can do if you want. I've saved 
the affected filesystem to a separate partition, so I can test with it.

regards,
  Johannes
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Complex filesystem operations: split and join

2010-06-17 Thread Hubert Kario
On Tuesday 15 June 2010 17:16:06 David Pottage wrote:
 On 15/06/10 11:41, Nikanth Karthikesan wrote:
  I had a one-off use-case, where I had no free-space, which made me
  think along this line.
  
  1. We have the GNU split tool for example, which I guess, many of us
  use to split larger files to be transfered via smaller thumb drives,
  for example. We do cat many files into one, afterwards. [For this
  usecase, one can simply dd with seek and skip and avoid split/cat
  completely, but we dont.]
 
 I am not sure how you gain here as either way you have to do I/O to get
 the split files on and off the thumb drive. It might make sense if the
 thumb drive is formated with btrfs, and the file needs to be copied to
 another filling system that can't handle large files (eg FAT-16), but I
 would say that is unlikely.
 

But you do have to do only half as much of I/O with those features 
implemented.

The old way is:
1. Have a file
2. split a file (in effect use twice as much drive space)
3. copy fragments to flash disks

The btrfs way would be:
1. Have a file
2. split the file by using COW and referencing blocks in the original file (in 
effect using only a little more space after splitting)
3. copy fragments to flash disks

the amount of I/O in the second case is limited only to metadata operations, 
in the first case, all data must be duplicated

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarządzania Jakością
zgodny z normą ISO 9001:2000
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


xfstests: Failed 4 of 36 tests

2010-06-17 Thread Brad Figg

I successfully ran check from xfstests on a 2.6.35 based system. I've
patebin'd the test output as well as my dmesg.

Note, there is a backtrace in the dmesg.

output from check: http://pastebin.ubuntu.com/451219/
dmesg: http://pastebin.ubuntu.com/451222/

Feel free to let me know what other information I should have provided.

Brad

p.s. In the future is this the way the mailing-list would like logs handled?
--
Brad Figg brad.f...@canonical.com http://www.canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: hanging processes - race condition?

2010-06-17 Thread Shaohua Li
On Thu, Jun 17, 2010 at 09:41:18AM +0800, Shaohua Li wrote:
 On Mon, Jun 14, 2010 at 09:28:29PM +0800, Chris Mason wrote:
  On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:
   On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li shaohua...@intel.com 
wrote:
 On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
 List,

 I ran into a hang issue (race condition: cpu is high when the server 
 is
 idle, meaning that btrfs is hanging, and IOwait is high as well) 
 running
 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB 
 ram).
 The btrfs filesystem live on 18x300GB scsi spindles, configured as 
 Raid-0,
 as shown below:

 Label: none  uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
          Total devices 18 FS bytes used 2.94TB
          devid    5 size 279.39GB used 208.33GB path /dev/cciss/c1d0
          devid   17 size 279.39GB used 208.34GB path /dev/cciss/c1d8
          devid   16 size 279.39GB used 209.33GB path /dev/cciss/c1d7
          devid    4 size 279.39GB used 208.33GB path /dev/cciss/c0d4
          devid    1 size 279.39GB used 233.72GB path /dev/cciss/c0d1
          devid   13 size 279.39GB used 208.33GB path /dev/cciss/c1d4
          devid    8 size 279.39GB used 208.33GB path /dev/cciss/c1d11
          devid   12 size 279.39GB used 208.33GB path /dev/cciss/c1d3
          devid    3 size 279.39GB used 208.33GB path /dev/cciss/c0d3
          devid    9 size 279.39GB used 208.33GB path /dev/cciss/c1d12
          devid    6 size 279.39GB used 208.33GB path /dev/cciss/c1d1
          devid   11 size 279.39GB used 208.33GB path /dev/cciss/c1d2
          devid   14 size 279.39GB used 208.33GB path /dev/cciss/c1d5
          devid    2 size 279.39GB used 233.70GB path /dev/cciss/c0d2
          devid   15 size 279.39GB used 209.33GB path /dev/cciss/c1d6
          devid   10 size 279.39GB used 208.33GB path /dev/cciss/c1d13
          devid    7 size 279.39GB used 208.33GB path /dev/cciss/c1d10
          devid   18 size 279.39GB used 208.34GB path /dev/cciss/c1d9
 Btrfs v0.19-16-g075587c-dirty

 The filesystem, mounted in /mnt/btrfs is hanging, no existing or new
 process can access it, however 'df' still displays the disk usage 
 (3TB out
 of 5). The disks appear to be physically healthy. Please note that a
 significant number of files were placed on this filesystem, between 
 20 and
 30 million files.

 The relevant kernel messages are displayed below:

 INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
 message.
 btrfs-submit- D 00010042e12f     0  4220      2 0x
   8803e584ac70 0046 4000 00011680
   8803f7349fd8 8803f7349fd8 8803e584ac70 00011680
   0001 8803ff99d250 8149f020 81150ab0
 Call Trace:
   [813089f3] ? io_schedule+0x71/0xb1
   [811470be] ? get_request_wait+0xab/0x140
   [810406f4] ? autoremove_wake_function+0x0/0x2e
   [81143a4d] ? elv_rq_merge_ok+0x89/0x97
   [8114a245] ? blk_recount_segments+0x17/0x27
   [81147429] ? __make_request+0x2d6/0x3fc
   [81145b16] ? generic_make_request+0x207/0x268
   [81145c12] ? submit_bio+0x9b/0xa2
   [a01aa081] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
   [a01a5365] ? run_scheduled_bios+0x297/0x48f [btrfs]
   [a01aa687] ? worker_loop+0x17c/0x452 [btrfs]
   [a01aa50b] ? worker_loop+0x0/0x452 [btrfs]
   [81040331] ? kthread+0x79/0x81
   [81003674] ? kernel_thread_helper+0x4/0x10
   [810402b8] ? kthread+0x0/0x81
   [81003670] ? kernel_thread_helper+0x0/0x10
 This looks like the issue we saw too, 
 http://lkml.org/lkml/2010/6/8/375.
 This is reproduceable in our setup.

I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
The code in the first do-while loop in btrfs_commit_transaction
set current process to TASK_UNINTERRUPTIBLE state, then calls
btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
btrfs_run_ordered_operations(). All of these function may call
cond_resched().
   Hi,
   When I test random write, I saw a lot of threads jump into 
   btree_writepages()
   and do noting and io throughput is zero for some time. Looks like there 
   is a
   live lock. See the code of btree_writepages():
 if (wbc-sync_mode == WB_SYNC_NONE) {
 struct btrfs_root *root = BTRFS_I(mapping-host)-root;
 u64 num_dirty;
 unsigned long thresh = 32 * 1024 * 1024;
   
 if (wbc-for_kupdate)