Re: interesting use case for multiple devices and delayed raid?
On Wed, Apr 1, 2009 at 8:17 PM, Brian J. Murrell br...@interlinx.bc.ca wrote: I have a use case that I wonder if anyone might find interesting involving multiple device support and delayed raid. Let's say I have a system with two disks of equal size (to make it easy) which has sporadic, heavy, write requirements. At some points in time there will be multiple files being appended to simultaneously and at other times, there will be no activity at all. The write activity is time sensitive, however, so the filesystem must be able to provide guaranteed (only in a loose sense -- not looking for real QoS reservation semantics) bandwidths at times. Let's say slightly (but within the realm of reality) less than the bandwidth of the two disks combined. I assume you mean read bandwidth, since write bandwidth cannot be increased by mirroring, only striping. If you intend to stripe first, then mirror later as time permits, this is the kind of sophistication you will need to write in the program code itself. A filesystem is a handy abstraction, but you are by no means limited to using it. If you have very special needs, you can get pretty far by writing your own meta-filesystem to add semantics you don't have in your kernel filesystem of choice. That's what every single database application does. You can get even further by writing a complete user-space filesystem as part of your program, or a shared daemon, and the performance isn't really that bad. I also want both the metadata and file data mirrored between the two disks so that I can afford to lose one of the disks and not lose (most of) my data. It is not a strict requirement that all data be immediately mirrored however. This is handled by DragonFly BSD's HAMMER filesystem. A master gets written to, and asynchronously updates a slave, even over a network. It is transactionally consistent and virtually impossible to corrupt as long as the disk media is stable. However as far as I know it won't spread reads, so you'll still get the performance of one disk. A more complete solution, that requires no software changes, would be to have 3 or 4 disks. A stripe for really fast reads and writes, and another disk (or another stripe) to act as a slave to the data being written to the primary stripe. This seems to do what you want, at a small price premium. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs Conf Call
Hello everyone, There will be a btrfs conference call Wed April 1. Topics will include benchmarking and our current pending work. Time: 1:30pm US Eastern (10:30am Pacific) * Dial-in Number(s): * Toll Free: +1-888-967-2253 * Toll +1-650-607-2253 * Meeting id: 665734 * Passcode: 428737 (which hopefully spells 4Btrfs) -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote: Hi Chris. I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is very slow as compared to ext3/4. I used blktrace to try to investigate the cause of this. One of cause is that unplug is done by kblockd even if the I/O is issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout is 3msec, so unplug via blockd can decrease I/O response. To increase fsync/osync write performance, speeding up unplug should be done here. I realized today that all of the async thread handling btrfs does for writes gives us plenty of time to queue up IO for the block device. If that's true, we can just unplug the block device in async helper thread and get pretty good coverage for the problem you're describing. Could you please try the patch below and see if it performs well? I did some O_DIRECT testing on a 5 drive array, and tput jumped from 386MB/s to 450MB/s for large writes. Thanks again for digging through this problem. -chris diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dd06e18..bf377ab 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -146,7 +146,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device) unsigned long num_run = 0; unsigned long limit; - bdi = device-bdev-bd_inode-i_mapping-backing_dev_info; + bdi = blk_get_backing_dev_info(device-bdev); fs_info = device-dev_root-fs_info; limit = btrfs_async_submit_limit(fs_info); limit = limit * 2 / 3; @@ -231,6 +231,19 @@ loop_lock: if (device-pending_bios) goto loop_lock; spin_unlock(device-io_lock); + + /* +* IO has already been through a long path to get here. Checksumming, +* async helper threads, perhaps compression. We've done a pretty +* good job of collecting a batch of IO and should just unplug +* the device right away. +* +* This will help anyone who is waiting on the IO, they might have +* already unplugged, but managed to do so before the bio they +* cared about found its way down here. +*/ + if (bdi-unplug_io_fn) + bdi-unplug_io_fn(bdi, NULL); done: return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in 2.6.29-rc8
On Tue, 2009-03-31 at 13:26 +0200, Tom van Klinken / ISP Services BV wrote: Hi Chris, Chris Mason wrote: On Tue, 2009-03-24 at 15:41 +0100, Tom van Klinken / ISP Services BV wrote: CUT This is a metadata enospc oops. You actually had about 400MB free but it was pinned down and waiting for a commit to free it all. Today I had a similar issue. See attached kernel trace. I'm quite sure the filesystem is not full (I have around 15GB of free space). Is their anything I can test/do? plain text document attachment (btrfc-trace2.txt) Mar 31 00:20:07 db03b btrfs searching for 4096 bytes, num_bytes 4096, loop 2, allowed_alloc 0 Mar 31 00:20:07 db03b btrfs allocation failed flags 36, wanted 4096 Mar 31 00:20:07 db03b space_info has 204537856 free, is full Well, the problem is that you've filled up the metadata areas of your FS. So, even though the FS reads 15GB free, the space available for metadata is full. I'm working on some patches to help. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: interesting use case for multiple devices and delayed raid?
On Wed, 01 Apr 2009 21:13:19 +1100, Dmitri Nikulin wrote: On Wed, 2009-04-01 at 21:13 +1100, Dmitri Nikulin wrote: I assume you mean read bandwidth, since write bandwidth cannot be increased by mirroring, only striping. No, I mean write bandwidth. You can get increased write bandwidth with RAID 0 if you only write to one side of the mirror (initially), effectively, striping. You would update the other half of the mirror lazily (iow, delayed) when the filesystem has idle bandwidth. One of the stipulations was that the use pattern is peaks and valleys, not sustained usage. Yes, you would lose the data that was written to a failed mirror before the filesystem got a chance to do the lazy mirror updating later on. That was a stipulation in my original requirements too. If you intend to stripe first, then mirror later as time permits, Yeah, that's one way to describe it. this is the kind of sophistication you will need to write in the program code itself. Why? A filesystem that does already does it's own mirroring and striping (as I understand btrfs does) should be able to handle this itself. Much better in the filesystem than for each application to have to handle it itself. A filesystem is a handy abstraction, but you are by no means limited to using it. If you have very special needs, you can get pretty far by writing your own meta-filesystem to add semantics you don't have in your kernel filesystem of choice. Of course. But I am floating this idea as a feature of btrfs given that it already has much of the components needed. This is handled by DragonFly BSD's HAMMER filesystem. A master gets written to, and asynchronously updates a slave, even over a network. It is transactionally consistent and virtually impossible to corrupt as long as the disk media is stable. However as far as I know it won't spread reads, so you'll still get the performance of one disk. More importantly, it won't spread writes. A more complete solution, that requires no software changes, would be to have 3 or 4 disks. A stripe for really fast reads and writes, and another disk (or another stripe) to act as a slave to the data being written to the primary stripe. This seems to do what you want, at a small price premium. No. That's not really what I am describing at all. I apologize if my original description was unclear. Hopefully it is more so now. b. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
At 20:27 09/03/31, Chris Mason wrote: On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote: Hi Chris. I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is very slow as compared to ext3/4. I used blktrace to try to investigate the cause of this. One of cause is that unplug is done by kblockd even if the I/O is issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout is 3msec, so unplug via blockd can decrease I/O response. To increase fsync/osync write performance, speeding up unplug should be done here. Btrfs's write I/O is issued via kernel thread, not via user application context that calls fsync(). While waiting for page writeback, wait_on_page_writeback() can not unplug I/O sometimes on Btrfs because submit_bio is not called from user application context so when submit_bio is called from kernel thread, wait_on_page_writeback() sleeps on io_schedule(). This is exactly right, and one of the uglier side effects of the async helper kernel threads. I've been thinking for a while about a clean way to fix it. I introduced btrfs_wait_on_page_writeback() on following patch, this is replacement of wait_on_page_writeback() for Btrfs. This does unplug every 1 tick while waiting for page writeback. I did a performance test using the sysbench. # sysbench --num-threads=4 --max-requests=1 --test=fileio --file-num=1 --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr --file-fsync-freq=5 run The result was: -2.6.29 Test execution summary: total time: 628.1047s total number of events: 1 total time taken by event execution: 413.0834 per-request statistics: min:0.s avg:0.0413s max:1.9075s approx. 95 percentile: 0.3712s Threads fairness: events (avg/stddev): 2500./29.21 execution time (avg/stddev): 103.2708/4.04 -2.6.29-patched Test execution summary: total time: 579.8049s total number of events: 10004 total time taken by event execution: 355.3098 per-request statistics: min:0.s avg:0.0355s max:1.7670s approx. 95 percentile: 0.3154s Threads fairness: events (avg/stddev): 2501./8.03 execution time (avg/stddev): 88.8274/1.94 This patch has some effect for performance improvement. I think there are other reasons that should be fixed why fsync() or write() with O_SYNC flag is slow on Btrfs. Very nice. Could I trouble you to try one more experiment? The other way to fix this is to your WRITE_SYNC instead of WRITE. Could you please hardcode WRITE_SYNC in the btrfs submit_bio paths and benchmark that? It doesn't cover as many cases as your patch, but it might have a lower overall impact. Hi. I wrote hardcode WRITE_SYNC patch for btrfs submit_bio paths as shown below, and I did sysbench test. Later, I will try your unplug patch. diff -Nrup linux-2.6.29.org/fs/btrfs/disk-io.c linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c --- linux-2.6.29.org/fs/btrfs/disk-io.c 2009-03-24 08:12:14.0 +0900 +++ linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c 2009-04-01 16:26:56.0 +0900 @@ -2068,7 +2068,7 @@ static int write_dev_supers(struct btrfs } if (i == last_barrier do_barriers device-barriers) { - ret = submit_bh(WRITE_BARRIER, bh); + ret = submit_bh(WRITE_BARRIER|WRITE_SYNC, bh); if (ret == -EOPNOTSUPP) { printk(btrfs: disabling barriers on dev %s\n, device-name); @@ -2076,10 +2076,10 @@ static int write_dev_supers(struct btrfs device-barriers = 0; get_bh(bh); lock_buffer(bh); - ret = submit_bh(WRITE, bh); + ret = submit_bh(WRITE_SYNC, bh); } } else { - ret = submit_bh(WRITE, bh); + ret = submit_bh(WRITE_SYNC, bh); } if (!ret wait) { diff -Nrup linux-2.6.29.org/fs/btrfs/extent_io.c linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c --- linux-2.6.29.org/fs/btrfs/extent_io.c 2009-03-24 08:12:14.0 +0900 +++ linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c2009-04-01 14:48:08.0 +0900 @@ -1851,8 +1851,11 @@ static int submit_one_bio(int rw, struct if (tree-ops tree-ops-submit_bio_hook) tree-ops-submit_bio_hook(page-mapping-host, rw, bio,
Re: interesting use case for multiple devices and delayed raid?
On Thu, Apr 2, 2009 at 8:04 AM, Brian J. Murrell br...@interlinx.bc.ca wrote: A more complete solution, that requires no software changes, would be to have 3 or 4 disks. A stripe for really fast reads and writes, and another disk (or another stripe) to act as a slave to the data being written to the primary stripe. This seems to do what you want, at a small price premium. No. That's not really what I am describing at all. Well you get the bandwidth of 2 disks when reading and writing, and still mirrored to a second stripe as time permits. Kind of like delayed RAID10. I apologize if my original description was unclear. Hopefully it is more so now. Yes. It'll be up to the actual filesystem devs to weigh in on whether it's worth implementing. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html