Re: interesting use case for multiple devices and delayed raid?

2009-04-01 Thread Dmitri Nikulin
On Wed, Apr 1, 2009 at 8:17 PM, Brian J. Murrell br...@interlinx.bc.ca wrote:
 I have a use case that I wonder if anyone might find interesting
 involving multiple device support and delayed raid.

 Let's say I have a system with two disks of equal size (to make it easy)
 which has sporadic, heavy, write requirements.  At some points in time
 there will be multiple files being appended to simultaneously and at
 other times, there will be no activity at all.

 The write activity is time sensitive, however, so the filesystem must be
 able to provide guaranteed (only in a loose sense -- not looking for
 real QoS reservation semantics) bandwidths at times.  Let's say slightly
 (but within the realm of reality) less than the bandwidth of the two
 disks combined.

I assume you mean read bandwidth, since write bandwidth cannot be
increased by mirroring, only striping. If you intend to stripe first,
then mirror later as time permits, this is the kind of sophistication
you will need to write in the program code itself.

A filesystem is a handy abstraction, but you are by no means limited
to using it. If you have very special needs, you can get pretty far by
writing your own meta-filesystem to add semantics you don't have in
your kernel filesystem of choice. That's what every single database
application does. You can get even further by writing a complete
user-space filesystem as part of your program, or a shared daemon, and
the performance isn't really that bad.

 I also want both the metadata and file data mirrored between the two
 disks so that I can afford to lose one of the disks and not lose (most
 of) my data.  It is not a strict requirement that all data be
 immediately mirrored however.

This is handled by DragonFly BSD's HAMMER filesystem. A master gets
written to, and asynchronously updates a slave, even over a network.
It is transactionally consistent and virtually impossible to corrupt
as long as the disk media is stable. However as far as I know it won't
spread reads, so you'll still get the performance of one disk.

A more complete solution, that requires no software changes, would be
to have 3 or 4 disks. A stripe for really fast reads and writes, and
another disk (or another stripe) to act as a slave to the data being
written to the primary stripe. This seems to do what you want, at a
small price premium.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs Conf Call

2009-04-01 Thread Chris Mason
Hello everyone,

There will be a btrfs conference call Wed April 1.  Topics will
include benchmarking and our current pending work.

Time: 1:30pm US Eastern (10:30am Pacific)

* Dial-in Number(s):
* Toll Free: +1-888-967-2253
* Toll  +1-650-607-2253 
* Meeting id: 665734
* Passcode: 428737 (which hopefully spells 4Btrfs)

-chris



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance

2009-04-01 Thread Chris Mason
On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote:
 Hi Chris.
 
 I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is
 very slow as compared to ext3/4. I used blktrace to try to investigate the 
 cause of this. One of cause is that unplug is done by kblockd even if the I/O 
 is 
 issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout
 is 3msec, so unplug via blockd can decrease I/O response. To increase 
 fsync/osync write performance, speeding up unplug should be done here.
 

I realized today that all of the async thread handling btrfs does for
writes gives us plenty of time to queue up IO for the block device.  If
that's true, we can just unplug the block device in async helper thread
and get pretty good coverage for the problem you're describing.

Could you please try the patch below and see if it performs well?  I did
some O_DIRECT testing on a 5 drive array, and tput jumped from 386MB/s
to 450MB/s for large writes.

Thanks again for digging through this problem.

-chris

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dd06e18..bf377ab 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -146,7 +146,7 @@ static noinline int run_scheduled_bios(struct btrfs_device 
*device)
unsigned long num_run = 0;
unsigned long limit;
 
-   bdi = device-bdev-bd_inode-i_mapping-backing_dev_info;
+   bdi = blk_get_backing_dev_info(device-bdev);
fs_info = device-dev_root-fs_info;
limit = btrfs_async_submit_limit(fs_info);
limit = limit * 2 / 3;
@@ -231,6 +231,19 @@ loop_lock:
if (device-pending_bios)
goto loop_lock;
spin_unlock(device-io_lock);
+
+   /*
+* IO has already been through a long path to get here.  Checksumming,
+* async helper threads, perhaps compression.  We've done a pretty
+* good job of collecting a batch of IO and should just unplug
+* the device right away.
+*
+* This will help anyone who is waiting on the IO, they might have
+* already unplugged, but managed to do so before the bio they
+* cared about found its way down here.
+*/
+   if (bdi-unplug_io_fn)
+   bdi-unplug_io_fn(bdi, NULL);
 done:
return 0;
 }


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash in 2.6.29-rc8

2009-04-01 Thread Chris Mason
On Tue, 2009-03-31 at 13:26 +0200, Tom van Klinken / ISP Services BV
wrote:
 Hi Chris,
 
 Chris Mason wrote:
  On Tue, 2009-03-24 at 15:41 +0100, Tom van Klinken / ISP Services BV
  wrote:
  CUT
  
  This is a metadata enospc oops.  You actually had about 400MB free but
  it was pinned down and waiting for a commit to free it all.
 
 Today I had a similar issue. See attached kernel trace.
 
 I'm quite sure the filesystem is not full (I have around 15GB of free 
 space).
 
 Is their anything I can test/do?
 
 plain text document attachment (btrfc-trace2.txt)
 Mar 31 00:20:07 db03b btrfs searching for 4096 bytes, num_bytes 4096, loop 2, 
 allowed_alloc 0
 Mar 31 00:20:07 db03b btrfs allocation failed flags 36, wanted 4096
 Mar 31 00:20:07 db03b space_info has 204537856 free, is full

Well, the problem is that you've filled up the metadata areas of your
FS.  So, even though the FS reads 15GB free, the space available for
metadata is full.

I'm working on some patches to help.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: interesting use case for multiple devices and delayed raid?

2009-04-01 Thread Brian J. Murrell
On Wed, 01 Apr 2009 21:13:19 +1100, Dmitri Nikulin wrote:

On Wed, 2009-04-01 at 21:13 +1100, Dmitri Nikulin wrote:
 
 I assume you mean read bandwidth, since write bandwidth cannot be
 increased by mirroring, only striping.

No, I mean write bandwidth.  You can get increased write bandwidth with
RAID 0 if you only write to one side of the mirror (initially),
effectively, striping.  You would update the other half of the mirror
lazily (iow, delayed) when the filesystem has idle bandwidth.  One
of the stipulations was that the use pattern is peaks and valleys, not
sustained usage.

Yes, you would lose the data that was written to a failed mirror before
the filesystem got a chance to do the lazy mirror updating later on.
That was a stipulation in my original requirements too.

 If you intend to stripe first,
 then mirror later as time permits,

Yeah, that's one way to describe it.

 this is the kind of sophistication
 you will need to write in the program code itself.

Why?  A filesystem that does already does it's own mirroring and
striping (as I understand btrfs does) should be able to handle this
itself.  Much better in the filesystem than for each application to have
to handle it itself.

 A filesystem is a handy abstraction, but you are by no means limited
 to using it. If you have very special needs, you can get pretty far by
 writing your own meta-filesystem to add semantics you don't have in
 your kernel filesystem of choice.

Of course.  But I am floating this idea as a feature of btrfs given that
it already has much of the components needed.

 This is handled by DragonFly BSD's HAMMER filesystem. A master gets
 written to, and asynchronously updates a slave, even over a network.
 It is transactionally consistent and virtually impossible to corrupt
 as long as the disk media is stable. However as far as I know it won't
 spread reads, so you'll still get the performance of one disk.

More importantly, it won't spread writes.

 A more complete solution, that requires no software changes, would be
 to have 3 or 4 disks. A stripe for really fast reads and writes, and
 another disk (or another stripe) to act as a slave to the data being
 written to the primary stripe. This seems to do what you want, at a
 small price premium.

No.  That's not really what I am describing at all.

I apologize if my original description was unclear.  Hopefully it is
more so now.

b.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance

2009-04-01 Thread Hisashi Hifumi

At 20:27 09/03/31, Chris Mason wrote:
On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote:
 Hi Chris.
 
 I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is
 very slow as compared to ext3/4. I used blktrace to try to investigate the 
 cause of this. One of cause is that unplug is done by kblockd even if the 
I/O is 
 issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout
 is 3msec, so unplug via blockd can decrease I/O response. To increase 
 fsync/osync write performance, speeding up unplug should be done here.
 

 Btrfs's write I/O is issued via kernel thread, not via user application 
 context
 that calls fsync(). While waiting for page writeback, 
 wait_on_page_writeback() 
 can not unplug I/O sometimes on Btrfs because submit_bio is not called from 
 user application context so when submit_bio is called from kernel thread, 
 wait_on_page_writeback() sleeps on io_schedule(). 
 

This is exactly right, and one of the uglier side effects of the async
helper kernel threads.  I've been thinking for a while about a clean way
to fix it.

 I introduced btrfs_wait_on_page_writeback() on following patch, this is 
replacement 
 of wait_on_page_writeback() for Btrfs. This does unplug every 1 tick while
 waiting for page writeback.
 
 I did a performance test using the sysbench.
 
 # sysbench --num-threads=4 --max-requests=1  --test=fileio --file-num=1 
 --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr 
 --file-fsync-freq=5  run
 
 The result was:
 -2.6.29
 
 Test execution summary:
 total time:  628.1047s
 total number of events:  1
 total time taken by event execution: 413.0834
 per-request statistics:
  min:0.s
  avg:0.0413s
  max:1.9075s
  approx.  95 percentile: 0.3712s
 
 Threads fairness:
 events (avg/stddev):   2500./29.21
 execution time (avg/stddev):   103.2708/4.04
 
 
 -2.6.29-patched
 
 Test execution summary:
 total time:  579.8049s
 total number of events:  10004
 total time taken by event execution: 355.3098
 per-request statistics:
  min:0.s
  avg:0.0355s
  max:1.7670s
  approx.  95 percentile: 0.3154s
 
 Threads fairness:
 events (avg/stddev):   2501./8.03
 execution time (avg/stddev):   88.8274/1.94
 
 
 This patch has some effect for performance improvement. 
 
 I think there are other reasons that should be fixed why fsync() or 
 write() with O_SYNC flag is slow on Btrfs.
 

Very nice.  Could I trouble you to try one more experiment?  The other
way to fix this is to your WRITE_SYNC instead of WRITE.  Could you
please hardcode WRITE_SYNC in the btrfs submit_bio paths and benchmark
that?

It doesn't cover as many cases as your patch, but it might have a lower
overall impact.


Hi.
I wrote hardcode WRITE_SYNC patch for btrfs submit_bio paths as shown below,
and I did sysbench test.
Later, I will try your unplug patch.

diff -Nrup linux-2.6.29.org/fs/btrfs/disk-io.c 
linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c
--- linux-2.6.29.org/fs/btrfs/disk-io.c 2009-03-24 08:12:14.0 +0900
+++ linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c  2009-04-01 16:26:56.0 
+0900
@@ -2068,7 +2068,7 @@ static int write_dev_supers(struct btrfs
}
 
if (i == last_barrier  do_barriers  device-barriers) {
-   ret = submit_bh(WRITE_BARRIER, bh);
+   ret = submit_bh(WRITE_BARRIER|WRITE_SYNC, bh);
if (ret == -EOPNOTSUPP) {
printk(btrfs: disabling barriers on dev %s\n,
   device-name);
@@ -2076,10 +2076,10 @@ static int write_dev_supers(struct btrfs
device-barriers = 0;
get_bh(bh);
lock_buffer(bh);
-   ret = submit_bh(WRITE, bh);
+   ret = submit_bh(WRITE_SYNC, bh);
}
} else {
-   ret = submit_bh(WRITE, bh);
+   ret = submit_bh(WRITE_SYNC, bh);
}
 
if (!ret  wait) {
diff -Nrup linux-2.6.29.org/fs/btrfs/extent_io.c 
linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c
--- linux-2.6.29.org/fs/btrfs/extent_io.c   2009-03-24 08:12:14.0 
+0900
+++ linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c2009-04-01 
14:48:08.0 +0900
@@ -1851,8 +1851,11 @@ static int submit_one_bio(int rw, struct
if (tree-ops  tree-ops-submit_bio_hook)
tree-ops-submit_bio_hook(page-mapping-host, rw, bio,
   

Re: interesting use case for multiple devices and delayed raid?

2009-04-01 Thread Dmitri Nikulin
On Thu, Apr 2, 2009 at 8:04 AM, Brian J. Murrell br...@interlinx.bc.ca wrote:
 A more complete solution, that requires no software changes, would be
 to have 3 or 4 disks. A stripe for really fast reads and writes, and
 another disk (or another stripe) to act as a slave to the data being
 written to the primary stripe. This seems to do what you want, at a
 small price premium.

 No.  That's not really what I am describing at all.

Well you get the bandwidth of 2 disks when reading and writing, and
still mirrored to a second stripe as time permits. Kind of like
delayed RAID10.

 I apologize if my original description was unclear.  Hopefully it is
 more so now.

Yes. It'll be up to the actual filesystem devs to weigh in on whether
it's worth implementing.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html