from:"Chris Mason"

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread Chris Mason

On Thursday 21 February 2008, David Howells wrote:
 David Howells [EMAIL PROTECTED] wrote:
   Have you got before/after benchmark results?
 
  See attached.

 Attached here are results using BTRFS (patched so that it'll work at all)
 rather than Ext3 on the client on the partition backing the cache.

Thanks for trying this, of course I'll ask you to try again with the latest 
v0.13 code, it has a number of optimizations especially for CPU usage.


 Note that I didn't bother redoing the tests that didn't involve a cache as
 the choice of filesystem backing the cache should have no bearing on the
 result.

 Generally, completely cold caches shouldn't show much variation as all the
 writing can be done completely asynchronously, provided the client doesn't
 fill its RAM.

 The interesting case is where the disk cache is warm, but the pagecache is
 cold (ie: just after a reboot after filling the caches).  Here, for the two
 big files case, BTRFS appears quite a bit better than Ext3, showing a 21%
 reduction in time for the smaller case and a 13% reduction for the larger
 case.

I'm afraid I don't have a good handle on the filesystem operations that result 
from this workload.  Are we reading from the FS to fill the NFS page cache?


 For the many small/medium files case, BTRFS performed significantly better
 (15% reduction in time) in the case where the caches were completely cold.
 I'm not sure why, though - perhaps because it doesn't execute a
 write_begin() stage during the write_one_page() call and thus doesn't go
 allocating disk blocks to back the data, but instead allocates them later.

If your write_one_page call does parts of btrfs_file_write, you'll get delayed 
allocation for anything bigger than 8k by default.  = 8k will get packed 
into the btree leaves.


 More surprising is that BTRFS performed significantly worse (15% increase
 in time) in the case where the cache on disk was fully populated and then
 the machine had been rebooted to clear the pagecaches.

Which FS operations are included here?  Finding all the files or just an 
unmount?  Btrfs defrags metadata in the background, and unmount has to wait 
for that defrag to finish.

Thanks again,
Chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] Btrfs v0.13

2008-02-21 Thread Chris Mason

Hello everyone,

Btrfs v0.13 is now available for download from:

http://oss.oracle.com/projects/btrfs/

We took another short break from the multi-device code to make the minor mods 
required to compile on 2.6.25, fix some problematic bugs and do more tuning.

The most important fix is for file data checksumming errors.  These might show 
up on .o files from compiles or other files where seeky writes were done 
internally to fill it up.   The end result was a bunch of zeros in the file 
where people expected their data to be.  Thanks to Yan Zheng for tracking it 
down.

GregKH provided most of the 2.6.25 port with some sysfs updates.  Since the 
sysfs files are not used much and Greg has offered additional cleanups, I've 
disabled the btrfs sysfs interface on kernels older than 2.6.25.  This way he 
won't have to back port any of his changes.

Optimizations and other fixes:

* File data checksumming done in larger chunks, resulting in fewer btree 
searches and fewer kmap calls.

* CPU Optimizations for back reference removal

* CPU Optimizations for block allocation, and much more efficient searching 
through the free space cache.

* Allocation optimizations, the free space clustering code was not properly 
allocating from a cluster once it found it.  For normal mounts the fix 
improves metadata writeback, for mount -o ssd it improves everything.

* Unaligned access fixes from Dave Miller

* Btree reads are done in larger bios when possible

* i_block accounting is fixed

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs v0.13

2008-02-21 Thread Chris Mason

On Thursday 21 February 2008, Chris Mason wrote:
 Hello everyone,

 Btrfs v0.13 is now available for download from:

 http://oss.oracle.com/projects/btrfs/

 We took another short break from the multi-device code to make the minor
 mods required to compile on 2.6.25, fix some problematic bugs and do more
 tuning.

Sorry, I should have added:

v0.13 has no disk format changes since v0.12.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: very poor ext3 write performance on big filesystems?

2008-02-19 Thread Chris Mason

On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
 Theodore Tso schrieb:

 (...)

  The following ld_preload can help in some cases.  Mutt has this hack
  encoded in for maildir directories, which helps.

 It doesn't work very reliable for me.

 For some reason, it hangs for me sometimes (doesn't remove any files, rm
 -rf just stalls), or segfaults.

You can go the low-tech route (assuming your file names don't have spaces in 
them)

find . -printf %i %p\n | sort -n | awk '{print $2}' | xargs rm



 As most of the ideas here in this thread assume (re)creating a new
 filesystem from scratch - would perhaps playing with
 /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio help a
 bit?

Probably not.  You're seeking between all the inodes on the box, and probably 
not bound by the memory used.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: very poor ext3 write performance on big filesystems?

2008-02-19 Thread Chris Mason

On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
 Chris Mason schrieb:
  On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
  Theodore Tso schrieb:
 
  (...)
 
  The following ld_preload can help in some cases.  Mutt has this hack
  encoded in for maildir directories, which helps.
 
  It doesn't work very reliable for me.
 
  For some reason, it hangs for me sometimes (doesn't remove any files, rm
  -rf just stalls), or segfaults.
 
  You can go the low-tech route (assuming your file names don't have spaces
  in them)
 
  find . -printf %i %p\n | sort -n | awk '{print $2}' | xargs rm

 Why should it make a difference?

It does something similar to Ted's ld preload, sorting the results from 
readdir by inode number before using them.  You will still seek quite a lot 
between the directory entries, but operations on the files themselves will go 
in a much more optimal order.  It might help.


 Does find find filenames/paths faster than rm -r?

 Or is find once/remove once faster than find files/rm files/find
 files/rm files/..., which I suppose rm -r does?

rm -r does removes things in the order that readdir returns.  In your hard 
linked tree (on almost any FS), this will be very random.  The sorting is 
probably the best you can do from userland to optimize the ordering.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS only works with PAGE_SIZE = 4K

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, David Miller wrote:
 From: Chris Mason [EMAIL PROTECTED]
 Date: Wed, 6 Feb 2008 12:00:13 -0500

  So, here's v0.12.

 Any page size larger than 4K will not work with btrfs.  All of the
 extent stuff assumes that PAGE_SIZE = sectorsize.

Yeah, there is definitely clean up to do in that area.

 I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on
 sparc64 and I was finally able to successfully mount a partition.

Nice

 With 4K there are zero's in the root tree node header, because it's
 extent's location on disk is at a sub-PAGE_SIZE multiple and the
 extent code doesn't handle that.

 You really need to start validating this stuff on other platforms.
 Something that isn't little endian and something that doesn't use 4K
 pages.  I'm sure you have some powerpc parts around somewhere. :)

Grin, I think around v0.4 I grabbed a ppc box for a day and got things 
working.  There has been some churn since then...

My first prio is the newest set of disk format changes, and then I'll sit down 
and work on stability on a bunch of arches.

 Anyways, here is a patch for the kernel bits which fixes most of the
 unaligned accesses on sparc64.

Many thanks, I'll try these out here and push them into the tree.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, Jan Engelhardt wrote:
 On Feb 12 2008 09:08, Chris Mason wrote:
  So, if Btrfs starts zeroing at 1k, will that be acceptable for you?
 
  Something looks wrong here. Why would btrfs need to zero at all?
  Superblock at 0, and done. Just like xfs.
  (Yes, I had xfs on sparc before, so it's not like you NEED the
  whitespace at the start of a partition.)
 
 I've had requests to move the super down to 64k to make room for
  bootloaders, which may not matter for sparc, but I don't really plan on
  different locations for different arches.

 In x86, there is even more space for a bootloader (some 28k or so)
 even if your partition table is as closely packed as possible,
 from 0 to 7e00 IIRC.

 For sparc you could have something like

   startlbaendlba  type
   sda10   2   1 Boot
   sda22   58  3 Whole disk
   sda358  9   83 Linux

 and slap the bootloader into MBR, just like on x86.
 Or I am missing something..

It was a request from hpa, and he clearly had something in mind.  He kindly 
offered to review the disk format for bootloaders and other lower level 
issues but I asked him to wait until I firm it up a bit.

From my point of view, 0 is a bad idea because it is very likely to conflict 
with other things.  There are lots of things in the FS that need deep 
thought,and the perfect system to fully use the first 64k of a 1TB filesystem 
isn't quite at the top of my list right now ;)

Regardless of offset, it is a good idea to mop up previous filesystems where 
possible, and a very good idea to align things on some sector boundary.  Even 
going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, Jan Engelhardt wrote:
 On Feb 12 2008 08:49, Chris Mason wrote:
   This is a real issue on sparc where the default sun disk labels
   created use an initial partition where block zero aliases the disk
   label.  It took me a few iterations before I figured out why every
   btrfs make would zero out my disk label :-/
 
  Actually it seems this is only a problem with mkfs.btrfs, it clears
  out the first 64 4K chunks of the disk for whatever reason.
 
 It is a good idea to remove supers from other filesystems.  I also need to
  add zeroing at the end of the device as well.
 
 Looks like I misread the e2fs zeroing code.  It zeros the whole external
  log device, and I assumed it also zero'd out the start of the main FS.
 
 So, if Btrfs starts zeroing at 1k, will that be acceptable for you?

 Something looks wrong here. Why would btrfs need to zero at all?
 Superblock at 0, and done. Just like xfs.
 (Yes, I had xfs on sparc before, so it's not like you NEED the
 whitespace at the start of a partition.)

I've had requests to move the super down to 64k to make room for bootloaders, 
which may not matter for sparc, but I don't really plan on different 
locations for different arches.

4k aligned is important given that sector sizes are growing.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason

On Tuesday 12 February 2008, David Miller wrote:
 From: David Miller [EMAIL PROTECTED]
 Date: Mon, 11 Feb 2008 23:21:39 -0800 (PST)

  Filesystems like ext2 put their superblock 1 block into the partition
  in order to avoid overwriting disk labels and other uglies.  UFS does
  this too, as do several others.  One of the few exceptions I've been
  able to find is XFS.

  This is a real issue on sparc where the default sun disk labels
  created use an initial partition where block zero aliases the disk
  label.  It took me a few iterations before I figured out why every
  btrfs make would zero out my disk label :-/

 Actually it seems this is only a problem with mkfs.btrfs, it clears
 out the first 64 4K chunks of the disk for whatever reason.

It is a good idea to remove supers from other filesystems.  I also need to add 
zeroing at the end of the device as well.

Looks like I misread the e2fs zeroing code.  It zeros the whole external log 
device, and I assumed it also zero'd out the start of the main FS.

So, if Btrfs starts zeroing at 1k, will that be acceptable for you?

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] Btrfs v0.12 released

2008-02-06 Thread Chris Mason

Hello everyone,

I wasn't planning on releasing v0.12 yet, and it was supposed to have some 
initial support for multiple devices.  But, I have made a number of 
performance fixes and small bug fixes, and I wanted to get them out there 
before the (destabilizing) work on multiple-devices took over.

So, here's v0.12.  It comes with a shiny new disk format (sorry), but the gain 
is dramatically better random writes to existing files.  In testing here, the 
random write phase of tiobench went from 1MB/s to 30MB/s.  The fix was to 
change the way back references for file extents were hashed.

Other changes:

Insert and delete multiple items at once in the btree where possible.  Back 
references added more tree balances, and it showed up in a few benchmarks.  
With v0.12, backrefs have no real impact on performance.

Optimize bio end_io routines.  Btrfs was spending way too much CPU time in the 
bio end_io routines, leading to lock contention and other problems.

Optimize read ahead during transaction commit.  The old code was trying to 
read far too much at once, which made the end_io problems really stand out.

mount -o ssd option, which clusters file data writes together regardless of 
the directory the files belong to.  There are a number of other performance 
tweaks for SSD, aimed at clustering metadata and data writes to better take 
advantage of the hardware.

mount -o max_inline=size option, to override the default max inline file data 
size (default is 8k).  Any value up to the leaf size is allowed (default 
16k).

Simple -ENOSPC handling.  Emphasis on simple, but it prevents accidentally 
filling the disk most of the time.  With enough threads/procs banging on 
things, you can still easily crash the box.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason

On Thursday 31 January 2008, Jan Kara wrote:
 On Thu 31-01-08 11:56:01, Chris Mason wrote:
  On Thursday 31 January 2008, Al Boldi wrote:
   Andreas Dilger wrote:
On Wednesday 30 January 2008, Al Boldi wrote:
 And, a quick test of successive 1sec delayed syncs shows no hangs
 until about 1 minute (~180mb) of db-writeout activity, when the
 sync abruptly hangs for minutes on end, and io-wait shows almost
 100%.
   
How large is the journal in this filesystem?  You can check via
debugfs -R 'stat 8' /dev/XXX.
  
   32mb.
  
Is this affected by increasing
the journal size?  You can set the journal size via mke2fs -J
size=400 at format time, or on an unmounted filesystem by running
tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400
/dev/XXX.
  
   Setting size=400 doesn't help, nor does size=4.
  
I suspect that the stall is caused by the journal filling up, and
then waiting while the entire journal is checkpointed back to the
filesystem before the next transaction can start.
   
It is possible to improve this behaviour in JBD by reducing the
amount of space that is cleared if the journal becomes full, and
also doing journal checkpointing before it becomes full.  While that
may reduce performance a small amount, it would help avoid such huge
latency problems. I believe we have such a patch in one of the Lustre
branches already, and while I'm not sure what kernel it is for the
JBD code rarely changes much
  
   The big difference between ordered and writeback is that once the
   slowdown starts, ordered goes into ~100% iowait, whereas writeback
   continues 100% user.
 
  Does data=ordered write buffers in the order they were dirtied?  This
  might explain the extreme problems in transactional workloads.

   Well, it does but we submit them to block layer all at once so elevator
 should sort the requests for us...

nr_requests is fairly small, so a long stream of random requests should still 
end up being random IO.

Al, could you please compare the write throughput from vmstat for the 
data=ordered vs data=writeback runs?  I would guess the data=ordered one has 
a lower overall write throughput.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason

On Thursday 31 January 2008, Al Boldi wrote:
 Andreas Dilger wrote:
  On Wednesday 30 January 2008, Al Boldi wrote:
   And, a quick test of successive 1sec delayed syncs shows no hangs until
   about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
   hangs for minutes on end, and io-wait shows almost 100%.
 
  How large is the journal in this filesystem?  You can check via
  debugfs -R 'stat 8' /dev/XXX.

 32mb.

  Is this affected by increasing
  the journal size?  You can set the journal size via mke2fs -J size=400
  at format time, or on an unmounted filesystem by running
  tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX.

 Setting size=400 doesn't help, nor does size=4.

  I suspect that the stall is caused by the journal filling up, and then
  waiting while the entire journal is checkpointed back to the filesystem
  before the next transaction can start.
 
  It is possible to improve this behaviour in JBD by reducing the amount
  of space that is cleared if the journal becomes full, and also doing
  journal checkpointing before it becomes full.  While that may reduce
  performance a small amount, it would help avoid such huge latency
  problems. I believe we have such a patch in one of the Lustre branches
  already, and while I'm not sure what kernel it is for the JBD code rarely
  changes much

 The big difference between ordered and writeback is that once the slowdown
 starts, ordered goes into ~100% iowait, whereas writeback continues 100%
 user.

Does data=ordered write buffers in the order they were dirtied?  This might 
explain the extreme problems in transactional workloads.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)

2008-01-25 Thread Chris Mason

On Friday 25 January 2008, Jan Kara wrote:

  If ext3's DIO code only touches transactions in get_block, then it can
  violate data=ordered rules.  Basically the transaction that allocates
  the blocks might commit before the DIO code gets around to writing them.
 
  A crash in the wrong place will expose stale data on disk.

   Hmm, I've looked at it and I don't think so - look at the rationale in
 the patch below... That patch should fix the lock-inversion problem (at
 least I see no lockdep warnings on my test machine).


Ah ok, when I was looking at this I was allowing holes to get filled without 
falling back to buffered.  But, with the orphan inode entry protecting things 
I see how you're safe with this patch.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: konqueror deadlocks on 2.6.22

2008-01-22 Thread Chris Mason

On Tuesday 22 January 2008, Al Boldi wrote:
 Ingo Molnar wrote:
  * Oliver Pinter (Pintér Olivér) [EMAIL PROTECTED] wrote:
   and then please update to CFS-v24.1
   http://people.redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.6.22.15-v24.
  1 .patch
  
Yes with CFSv20.4, as in the log.
   
It also hangs on 2.6.23.13
 
  my feeling is that this is some sort of timing dependent race in
  konqueror/kde/qt that is exposed when a different scheduler is put in.
 
  If it disappears with CFS-v24.1 it is probably just because the timings
  will change again. Would be nice to debug this on the konqueror side and
  analyze why it fails and how. You can probably tune the timings by
  enabling SCHED_DEBUG and tweaking /proc/sys/kernel/*sched* values - in
  particular sched_latency and the granularity settings. Setting wakeup
  granularity to 0 might be one of the things that could make a
  difference.

 Thanks Ingo, but Mike suggested that data=writeback may make a difference,
 which it does indeed.

 So the bug seems to be related to data=ordered, although I haven't gotten
 any feedback from the ext3 gurus yet.

 Seems rather critical though, as data=writeback is a dangerous mode to run.

Running fsync in data=ordered means that all of the dirty blocks on the FS 
will get written before fsync returns.  Your original stack trace shows 
everyone either performing writeback for a log commit or waiting for the log 
commit to return.

They key task in your trace is kjournald, stuck in get_request_wait.  It could 
be a block layer bug, not giving him requests quickly enough, or it could be 
the scheduler not giving him back the cpu fast enough.

At any rate, that's where to concentrate the debugging.  You should be able to 
simulate this by running a few instances of the below loop and looking for 
stalls:

while(true) ; do
time dd if=/dev/zero of=foo bs=50M count=4 oflags=sync
done

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: konqueror deadlocks on 2.6.22

2008-01-22 Thread Chris Mason

On Tuesday 22 January 2008, Al Boldi wrote:
 Chris Mason wrote:
  Running fsync in data=ordered means that all of the dirty blocks on the
  FS will get written before fsync returns.

 Hm, that's strange, I expected this kind of behaviour from data=journal.

 data=writeback should return immediatly, which seems it does, but
 data=ordered should only wait for metadata flush, it shouldn't wait for
 filedata flush.  Are you sure it waits for both?

I over simplified.  data=ordered means that all data blocks are written before 
the metadata that references them commits.

So, if you add 1GB to a fileA in a transaction and then run fsync(fileB) in 
the same transaction, the 1GB from fileA is sent to disk (and waited on) 
before the fsync on fileB returns.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)

2008-01-18 Thread Chris mason

On Thursday 17 January 2008, Christian Hesse wrote:
 On Thursday 17 January 2008, Chris mason wrote:
  So, I've put v0.11 out there.

 Ok, back to the suspend problem I mentioned:

[ oopsen ]

 I get this after a suspend/resume cycle with mounted btrfs.

Looks like metadata corruption.  How are you suspending?

-chris



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)

2008-01-17 Thread Chris mason

On Tuesday 15 January 2008, Chris Mason wrote:
 Hello everyone,

 Btrfs v0.10 is now available for download from:

 http://oss.oracle.com/projects/btrfs/

Well, it turns out this release had a few small problems:

* data=ordered deadlock on older kernels (including 2.6.23)
* Compile problems when ACLs were not enabled in the kernel

So, I've put v0.11 out there.  It fixes those two problems and will also 
compile on older (2.6.18) enterprise kernels.

v0.11 does not have any disk format changes.

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)

2008-01-17 Thread Chris mason

On Thursday 17 January 2008, Daniel Phillips wrote:
 On Jan 17, 2008 1:25 PM, Chris mason [EMAIL PROTECTED] wrote:
  So, I've put v0.11 out there.  It fixes those two problems and will also
  compile on older (2.6.18) enterprise kernels.
 
  v0.11 does not have any disk format changes.

 Hi Chris,

 First, massive congratulations for bringing this to fruition in such a
 short time.

 Now back to the regular carping: why even support older kernels?

The general answer is the backports are small and easy.  I don't test them 
heavily, and I don't go out of my way to make things work. 

But, they do make it easier for people to try out, and to figure how to use 
all these new features to solve problems.  Small changes that enable more 
testers are always welcome.

In general, the core parts of the kernel that btrfs uses haven't had many 
interface changes since 2.6.18, so this isn't a huge deal.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)

2008-01-15 Thread Chris Mason


Hello everyone,

Btrfs v0.10 is now available for download from:

http://oss.oracle.com/projects/btrfs/

Btrfs is still in an early alpha state, and the disk format is not finalized.
v0.10 introduces a new disk format, and is not compatible with v0.9.

The core of this release is explicit back references for all metadata blocks,
data extents, and directory items.  These are a crucial building block for
future features such as online fsck and migration between devices.  The back
references are verified during deletes, and the extent back references are
checked by the existing offline fsck tool.

For all of the details of how the back references are maintained, please
see the design document:

http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html

Other new features (described in detail below):

* Online resizing (including shrinking)
* In place conversion from Ext3 to Btrfs
* data=ordered support
* Mount options to disable data COW and checksumming 
* Barrier support for sata and IDE drives

[ Resizing ]

In order to demonstrate and test the back references, I've added an online
resizer, which can both grow and shrink the filesystem:

mount -t btrfs /dev/xxx /mnt

# add 2GB to the FS
btrfsctl -r +2g /mnt

# shrink the FS by 4GB
btrfsctl -r -4g /mnt

# Explicitly set the FS size
btrfsctl -r 20g /mnt

# Use 'max' to grow the FS to the limit of the device
btrfsctl -r max /mnt

[ Conversion from Ext3 ]

This is an offline, in place, conversion program written by Yan Zheng.  It
has been through basic testing, but should not be trusted with critical data.

To build the conversion program, run 'make convert' in the btrfs-progs
tree. It depends on libe2fs and acl development libraries.

The conversion program uses the copy on write nature of Btrfs to preserve the
original Ext3 FS, sharing the data blocks between Btrfs and Ext3 metadata.
Btrfs metadata is created inside the free space of the Ext3 filesystem, and it
is possible to either make the conversion permanent (reclaiming the space used
by Ext3) or roll back the conversion to the original Ext3 filesystem.

More details and example usage of the conversion program can be found here:

http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-converter.html

Thanks to Yan Zheng for all of his work on the converter.

[ New mount options ]

mount -o nodatacsum disables checksumming on data extents

mount -o nodatacow disables copy on write of data extents, unless a given
extent is referenced by more than one snapshot.  This is targeted at database
workloads, where copy on write is not optimal for performance.

The explicit back references allow the nodatacow code to make sure copy
on write is done when multiple snapshots reference the same file, maintaining
snapshot consistency.

mount -o alloc_start=num forces allocation hints to start at least num bytes
into the disk.  This was introduced to test the resizer.  Example usage:

mount -o alloc_start=16g /dev/ /mnt
(do something to the FS)
btrfsctl -r 12g /mnt

The btrfsctl command will resize the FS down to 12GB in size.  Because
the FS was mounted with -o alloc_start=16g, any allocations done after
mounting will need to be relocated by the resizer.

It is safe to specify a number past the end of the FS, if the alloc_start is too
large, it is ignored.

mount -o nobarrier disables cache flushes during commit.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Chris Mason

On Tue, 15 Jan 2008 20:24:27 -0500
Daniel Phillips [EMAIL PROTECTED] wrote:

 On Jan 15, 2008 7:15 PM, Alan Cox [EMAIL PROTECTED] wrote:
   Writeback cache on disk in iteself is not bad, it only gets bad
   if the disk is not engineered to save all its dirty cache on
   power loss, using the disk motor as a generator or alternatively
   a small battery. It would be awfully nice to know which brands
   fail here, if any, because writeback cache is a big performance
   booster.
 
  AFAIK no drive saves the cache. The worst case cache flush for
  drives is several seconds with no retries and a couple of minutes
  if something really bad happens.
 
  This is why the kernel has some knowledge of barriers and uses them
  to issue flushes when needed.
 
 Indeed, you are right, which is supported by actual measurements:
 
 http://sr5tech.com/write_back_cache_experiments.htm
 
 Sorry for implying that anybody has engineered a drive that can do
 such a nice thing with writeback cache.
 
 The disk motor as a generator tale may not be purely folklore.  When
 an IDE drive is not in writeback mode, something special needs to done
 to ensure the last write to media is not a scribble.
 
 A small UPS can make writeback mode actually reliable, provided the
 system is smart enough to take the drives out of writeback mode when
 the line power is off.

We've had mount -o barrier=1 for ext3 for a while now, it makes
writeback caching safe.  XFS has this on by default, as does reiserfs.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)

2008-01-14 Thread Chris Mason

On Mon, 14 Jan 2008 18:06:09 +0100
Jan Kara [EMAIL PROTECTED] wrote:

 On Wed 02-01-08 12:42:19, Zach Brown wrote:
  Erez Zadok wrote:
   Setting: ltp-full-20071031, dio01 test on ext3 with Linus's
   latest tree. Kernel w/ SMP, preemption, and lockdep configured.
  
  This is a real lock ordering problem.  Thanks for reporting it.
  
  The updating of atime inside sys_mmap() orders the mmap_sem in the
  vfs outside of the journal handle in ext3's inode dirtying:
  

[ lock inversion traces ]
 
  Two fixes come to mind:
  
  1) use something like Peter's -mmap_prepare() to update atime
  before acquiring the mmap_sem.
  ( http://lkml.org/lkml/2007/11/11/97 ).  I don't know if this would
  leave more paths which do a journal_start() while holding the
  mmap_sem.
  
  2) rework ext3's dio to only hold the jbd handle in
  ext3_get_block(). Chris has a patch for this kicking around
  somewhere but I'm told it has problems exposing old blocks in
  ordered data mode.
  
  Does anyone have preferences?  I could go either way.  I certainly
  don't like the idea of journal handles being held across the
  entirety of fs/direct-io.c.  It's yet another case of O_DIRECT
  differing wildly from the buffered path :(.
   I've looked more into it and I think that 2) is the only way to go
 since transaction start ranks below page lock (standard buffered
 write path) and page lock ranks below mmap_sem. So we have at least
 one more dependency mmap_sem must go before transaction start...

Just to clarify a little bit:

If ext3's DIO code only touches transactions in get_block, then it can
violate data=ordered rules.  Basically the transaction that allocates
the blocks might commit before the DIO code gets around to writing them.

A crash in the wrong place will expose stale data on disk.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC] fast file mapping for loop

2008-01-11 Thread Chris Mason

On Fri, 11 Jan 2008 10:01:18 +1100
Neil Brown [EMAIL PROTECTED] wrote:

 On Thursday January 10, [EMAIL PROTECTED] wrote:
  On Thu, Jan 10 2008, Chris Mason wrote:
   On Thu, 10 Jan 2008 09:31:31 +0100
   Jens Axboe [EMAIL PROTECTED] wrote:
   
On Wed, Jan 09 2008, Alasdair G Kergon wrote:
 Here's the latest version of dm-loop, for comparison.
 
 To try it out, 
   ln -s dmsetup dmlosetup
 and supply similar basic parameters to losetup.
 (using dmsetup version 1.02.11 or higher)

Why oh why does dm always insist to reinvent everything? That's
bad enough in itself, but on top of that most of the extra
stuff ends up being essentially unmaintained.
   
   I don't quite get how the dm version is reinventing things.  They
   use
  
  Things like raid, now file mapping functionality. I'm sure there are
  more examples, it's how dm was always developed probably originating
  back to when they developed mostly out of tree. And I think it's a
  bad idea, we don't want duplicate functionality. If something is
  wrong with loop, fix it, don't write dm-loop.
 
 I'm with Jens here.
 
 We currently have two interfaces that interesting block devices can be
 written for: 'dm' and 'block'.
 We really should aim to have just one.  I would call it 'block' and
 move anything really useful from dm into block.
 
 As far as I can tell, the important things that 'dm' has that 'block'
 doesn't have are:
 
   - a standard ioctl interface for assembling and creating interesting
  devices.
  For 'block', everybody just rolls there own. e.g. md, loop, and
  nbd all use totally different approaches for setup and tear down
  etc. 
 
   - suspend/reconfigure/resume.
 This is something that I would really like to see in 'block'.  If
 I had a filesystem mounted on /dev/sda1 and I wanted to make it a
 raid1, it would be cool if I could
   suspend /dev/sda1
   build a raid1 from sda1 and something else
   plug tha raid1 in as 'sda1'.
   resume sda1
 
   - Integrated 'linear' mapping.
 This is the bit of 'dm' that I think of as yucky.  If I read the
 code correctly, every dm device is a linear array of a bunch of
 targets.  Each target can be a stripe-set(raid0) or a multipath or
 a raid1 or a plain block device or whatever.
 Having 'linear' at a different level to everything else seems a
 bit ugly, but it isn't really a big deal.


DM is also a framework where you can introduce completely new types of
block devices without having to go through the associated pain of
finding major numbers.  In terms of developing new things with greater
flexibility, I think it is easier.
 
 I would really like to see every 'dm' target being just a regular
 'block' device.  Then a 'linear' block device could be used to
 assemble dm targets into a dm device.  Or the targets could be used
 directly if the 'linear' function wasn't needed.
 
 Each target/device could respond to both dm ioctls and 'adhoc'
 ioctls.  That is a bit ugly, but backwards compatibility always is,
 but it isn't a big cost.
 
 I think the way forward here is to put the important
 suspend/reconfig/resume functionality into the block layer, then
 work on making code work with multiple ioctl interfaces.
 
 I *don't* think the way forward is to duplicate current block devices
 as dm targets.  This is duplication of effort (which I admit isn't
 always a bad thing) and a maintenance headache (which is).
 

raid in dm aside (that's an entirely different debate ;), loop is a
pile of things which dm can nicely layer out into pieces (dm-crypt vs
loopback crypt).  Also, dm doesn't have to jump through hoops to get a
variable number of minors.

Yes, the loop side was recently improved for # of minors, and it does
have enough in there for userland to do variable number of minors, but
this is one specific case where dm is just easier.

At any rate, I'm all for ideas that make dm less of the evil stepchild
of the block layer ;)  I'm not saying everything should be dm, but I
did want to point out that dm-loop isn't entirely silly.

I have a version of Jens' patch in testing here that makes a new API
with the FS for mapping extents and hope to post it later today.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Chris Mason

On Thu, 10 Jan 2008 09:31:31 +0100
Jens Axboe [EMAIL PROTECTED] wrote:

 On Wed, Jan 09 2008, Alasdair G Kergon wrote:
  Here's the latest version of dm-loop, for comparison.
  
  To try it out, 
ln -s dmsetup dmlosetup
  and supply similar basic parameters to losetup.
  (using dmsetup version 1.02.11 or higher)
 
 Why oh why does dm always insist to reinvent everything? That's bad
 enough in itself, but on top of that most of the extra stuff ends up
 being essentially unmaintained.

I don't quite get how the dm version is reinventing things.  They use
the dmsetup command that they use for everything else and provide a
small and fairly clean module for bio specific loop instead of piling
it onto loop.c

Their code doesn't have the fancy hole handling that yours does, but
neither did yours 4 days ago ;)

 
 If we instead improve loop, everyone wins.
 
 Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a
 bit outside your own sandbox.
 

It is a natural fit in either place, as both loop and dm have a good
infrastructure for it.  I'm not picky about where it ends up, but dm
wouldn't be a bad place.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Chris Mason

On Thu, 10 Jan 2008 08:54:59 +
Christoph Hellwig [EMAIL PROTECTED] wrote:

 On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
   IMHO this shouldn't be done in the loop driver anyway.
   Filesystems have their own effricient extent lookup trees (well,
   at least xfs and btrfs do), and we should leverage that instead
   of reinventing it.
  
  Completely agree, it's just needed right now for this solution
  since all we have is a crappy bmap() interface to get at those
  mappings.
 
 So let's fix the interface instead of piling crap ontop of it.  As I
 said I think Peter has something to start with so let's beat on it
 until we have something suitable.   If we aren't done by end of Feb
 I'm happy to host a hackfest to get it sorted around the fs/storage
 summit..
 

Ok, I've been meaning to break my extent_map code up, and this is a
very good reason.  I'll work up a sample today based on Jens' code.

The basic goals:

* Loop (swap) calls into the FS for each mapping. Any caching happens
on the FS side.
* The FS returns an extent, filling any holes

Swap would need to use an extra call early on for preallocation.

Step two is having a call back into the FS allow the FS to delay the
bios until commit completion so that COW and delalloc blocks can be
fully on disk when the bios are reported as done.  Jens, can you add
some way to queue the bio completions up?

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Chris Mason

On Thu, 10 Jan 2008 14:03:24 +0100
Jens Axboe [EMAIL PROTECTED] wrote:

 On Thu, Jan 10 2008, Chris Mason wrote:
  On Thu, 10 Jan 2008 08:54:59 +
  Christoph Hellwig [EMAIL PROTECTED] wrote:
  
   On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
 IMHO this shouldn't be done in the loop driver anyway.
 Filesystems have their own effricient extent lookup trees
 (well, at least xfs and btrfs do), and we should leverage
 that instead of reinventing it.

Completely agree, it's just needed right now for this solution
since all we have is a crappy bmap() interface to get at those
mappings.
   
   So let's fix the interface instead of piling crap ontop of it.
   As I said I think Peter has something to start with so let's beat
   on it until we have something suitable.   If we aren't done by
   end of Feb I'm happy to host a hackfest to get it sorted around
   the fs/storage summit..
   
  
  Ok, I've been meaning to break my extent_map code up, and this is a
  very good reason.  I'll work up a sample today based on Jens' code.
 
 Great!

Grin, we'll see how the sample looks.

 
  The basic goals:
  
  * Loop (swap) calls into the FS for each mapping. Any caching
  happens on the FS side.
  * The FS returns an extent, filling any holes
 
 We don't want to fill holes for a read, but I guess that's a given?

Right.

 
  Swap would need to use an extra call early on for preallocation.
  
  Step two is having a call back into the FS allow the FS to delay the
  bios until commit completion so that COW and delalloc blocks can be
  fully on disk when the bios are reported as done.  Jens, can you add
  some way to queue the bio completions up?
 
 Sure, a function to save a completed bio and a function to execute
 completions on those already stored?
 

Sounds right, I'm mostly looking for a way to aggregate a few writes to
make the commits a little larger.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Chris Mason

On Wed, 9 Jan 2008 10:43:21 +0100
Jens Axboe [EMAIL PROTECTED] wrote:

 On Wed, Jan 09 2008, Christoph Hellwig wrote:
  On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
   - The file block mappings must not change while loop is using the
   file. This means that we have to ensure exclusive access to the
   file and this is the bit that is currently missing in the
   implementation. It would be nice if we could just do this via
   open(), ideas welcome...
  
  And the way this is done is simply broken.  It means you have to get
  rid of things like delayed or unwritten hands beforehand, it'll be
  a complete pain for COW or non-block backed filesystems.
 
 COW is not that hard to handle, you just need to be notified of moving
 blocks. If you view the patch as just a tighter integration between
 loop and fs, I don't think it's necessarily that broken.
 
Filling holes (delayed allocation) and COW are definitely a problem.
But at least for the loop use case, most non-cow filesystems will want
to preallocate the space for loop file and be done with it.  Sparse
loop definitely has uses, but generally those users are willing to pay
a little performance.

Jens' patch falls back to buffered writes for the hole case and
pretends cow doesn't exist.  It's a good starting point that I hope to
extend with something like the extent_map apis.

 I did consider these cases, and it can be done with the existing
 approach.
 
  The right way to do this is to allow direct I/O from kernel sources
  where the filesystem is in-charge of submitting the actual I/O after
  the pages are handed to it.  I think Peter Zijlstra has been looking
  into something like that for swap over nfs.
 
 That does sound like a nice approach, but a lot more work. It'll
 behave differently too, the advantage of what I proposed is that it
 behaves like a real device.

The problem with O_DIRECT (or even O_SYNC) loop is that every write
into loop becomes synchronous, and it really changes the performance of
things like filemap_fdatawrite.

If we just hand ownership of the file over to loop entirely and prevent
other openers (perhaps even forcing backups through the loop device),
we get fewer corner cases and much better performance.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] Btrfs v0.9

2007-12-04 Thread Chris Mason

Hello everyone,

I've just tagged and released Btrfs v0.9.  Special thanks to Yan Zheng
and Josef Bacik for their work.

This release includes a number of disk format changes from v0.8 and
also a small change from recent btrfs-unstable HG trees.  So, if you
have existing Btrfs filesystems, you will need to backup, reformat and
restore to try out v0.9.

You can find download links and other details here:

http://oss.oracle.com/projects/btrfs/

Since v0.8:

* Support for btree blocks larger than the page size.  mkfs.btrfs
defaults to 8k blocks, but -l and -n can be used to set the block size
for leaves and nodes.  Powers of 2 are required, example:

mkfs.btrfs -l 32768 -n 32768 /dev/

* Support for inline (packed into the btree) file data larger than the
page size.  Any file smaller than a btree block will probably be backed
into the btree.

* Xattr support (no ACLs yet) from Josef Bacik.  This works for generic
user xattrs and was tested with beagle among other things.

* Stripe size parameter to mkfs.btrfs (-s size_in_bytes).  Extents will
be aligned to the stripe size for performance.

* Many performance and stability fixes, especially on 32 bit x86
machines.

Unfixed:

ENOSPC handling.  Things are much more predicable now, and
Btrfs will work up until the disk is very close to full.

Concurrency: Everything is still protected by a single mutex, which is
held during IO.  Multi-threaded benchmarks will not perform well.

Database performance: Still very slow in database workloads.

You can get an idea of where Btrfs is headed from the TODO list:

http://oss.oracle.com/projects/btrfs/dist/documentation/todo.html

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reminder: Last day for submissions to the Storage and Filesystem Workshop.

2007-12-03 Thread Chris Mason

Hello everyone,

The deadline for position statements to the Linux Storage and
Filesystem Workshop is here.  Submitting a position statement
is an easy way for you to tell the organizers that you would like to
attend, and which topics you are most interesting in.

You can find all the details about the workshop here:

http://www.usenix.org/events/lsf08/

The Linux Storage and Filesystem Workshop is a small, tightly focused,
by-invitation workshop. It is intended to bring together developers and
researchers interested in implementing improvements in the Linux
filesystem and storage subsystems that can find their way into the
mainline kernel and into Linux distributions in the 1–2 year timeframe.
The workshop will be two days and will be separated into storage and
filesystem tracks, with some combined plenary sessions.

The workshop will be held Feb 25 and 26 in San Jose.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reminder: Linux Storage and Filesystem Workshop

2007-11-26 Thread Chris Mason


Hello everyone,

The deadline for position statements to the Linux Storage and
Filesystem Workshop is quickly approaching.  The position statements
are an easy way for you to tell the organizers that you would like to
attend, and which topics you are most interesting in.

You can find all the details about the workshop here:

http://www.usenix.org/events/lsf08/

The Linux Storage and Filesystem Workshop is a small, tightly focused,
by-invitation workshop. It is intended to bring together developers and
researchers interested in implementing improvements in the Linux
filesystem and storage subsystems that can find their way into the
mainline kernel and into Linux distributions in the 1–2 year timeframe.
The workshop will be two days and will be separated into storage and
filesystem tracks, with some combined plenary sessions.

The workshop will be held Feb 25 and 26 in San Jose.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-11-05 Thread Chris Mason

On Mon, 5 Nov 2007 10:23:35 +
[EMAIL PROTECTED] (Mel Gorman) wrote:

 On (01/11/07 10:10), Badari Pulavarty didst pronounce:

   Hmpf, my first reply had a paragraph about the block device inode
   pages, I noticed the phrase file data pages and deleted it ;)
   
   But, for the metadata buffers there's not much we can do.  They
   are included in a bunch of different lists and the patch would
   be non-trivial.
  
  Unfortunately, these buffer pages are spread all around making
  those sections of memory non-removable. Of course, one can use
  ZONE_MOVABLE to make sure to guarantee the remove. But I am
  hoping we could easily group all these allocations and minimize
  spreading them around. Mel ?
 
 The grow_dev_page() pages should be reclaimable even though migration
 is not supported for those pages? They were marked movable as it was
 useful for lumpy reclaim taking back pages for hugepage allocations
 and the like. Would it make sense for memory unremove to attempt
 migration first and reclaim second?
 

In this case, reiserfs has the page pinned while it is doing journal
magic.  Not sure if ext3 has the same issues.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

2008 Linux Storage and Filesystem Workshop

2007-11-05 Thread Chris Mason


Hello everyone,

The position statement submission system for the 2008 storage and
filesystem workshop is now online.  This is how you let us know you're
interested in attending and what topics are most important for
discussion.

For all the details, please see:

http://www.usenix.org/events/lsf08/

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-11-01 Thread Chris Mason

On Thu, 01 Nov 2007 08:38:57 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:

 On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote:
  On Wed, 31 Oct 2007 08:14:21 -0800
  Badari Pulavarty [EMAIL PROTECTED] wrote:
   
   I tried data=writeback mode and it didn't help :(
  
  Ouch, so much for the easy way out.
  
   
   unable to release the page 262070
   bh c000211b9408 flags 110029 count 1 private 0
   unable to release the page 262098
   bh c00020ec9198 flags 110029 count 1 private 0
   memory offlining 3f000 to 4 failed
   
  
  The only other special thing reiserfs does with the page cache is
  file tails.  I don't suppose all of these pages are index zero in
  files smaller than 4k?
 
 Ah !! I am so blind :(
 
 I have been suspecting reiserfs all along, since its executing
 fallback_migrate_page(). Actually, these buffer heads are
 backing blockdev. I guess these are metadata buffers :( 
 I am not sure we can do much with these..

Hmpf, my first reply had a paragraph about the block device inode
pages, I noticed the phrase file data pages and deleted it ;)

But, for the metadata buffers there's not much we can do.  They are
included in a bunch of different lists and the patch would
be non-trivial.

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-10-31 Thread Chris Mason

On Wed, 31 Oct 2007 08:14:21 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:
 
 I tried data=writeback mode and it didn't help :(

Ouch, so much for the easy way out.

 
 unable to release the page 262070
 bh c000211b9408 flags 110029 count 1 private 0
 unable to release the page 262098
 bh c00020ec9198 flags 110029 count 1 private 0
 memory offlining 3f000 to 4 failed
 

The only other special thing reiserfs does with the page cache is file
tails.  I don't suppose all of these pages are index zero in files
smaller than 4k?

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-10-30 Thread Chris Mason

On Tue, 30 Oct 2007 10:27:04 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:

 Hi,
 
 While testing hotplug memory remove, I ran into this issue. Given a
 range of pages hotplug memory remove tries to migrate those pages.
 
 migrate_pages() keeps failing to migrate pages containing pagecache
 pages for reiserfs files. I noticed that reiserfs doesn't have 
 -migratepage() ops. So, fallback_migrate_page() code tries to
 do try_to_release_page(). try_to_release_page() fails to
 drop_buffers() since b_count == 1. Here is what my debug shows:
 
   migrate pages failed pfn 258111/flags 3f801
   bh cb53f6e0 flags 110029 count 1
   
 Any one know why the b_count == 1 and not getting dropped to zero ? 

If these are file data pages, the count is probably elevated as part of
the data=ordered tracking.  You can verify this via b_private, or just
mount data=writeback to double check.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migratepage failures on reiserfs

2007-10-30 Thread Chris Mason

On Tue, 30 Oct 2007 13:54:05 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:

 On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote:
  On Tue, 30 Oct 2007 10:27:04 -0800
  Badari Pulavarty [EMAIL PROTECTED] wrote:
  
   Hi,
   
   While testing hotplug memory remove, I ran into this issue. Given
   a range of pages hotplug memory remove tries to migrate those
   pages.
   
   migrate_pages() keeps failing to migrate pages containing
   pagecache pages for reiserfs files. I noticed that reiserfs
   doesn't have -migratepage() ops. So, fallback_migrate_page()
   code tries to do try_to_release_page(). try_to_release_page()
   fails to drop_buffers() since b_count == 1. Here is what my debug
   shows:
   
 migrate pages failed pfn 258111/flags 3f801
 bh cb53f6e0 flags 110029 count 1
 
   Any one know why the b_count == 1 and not getting dropped to
   zero ? 
  
  If these are file data pages, the count is probably elevated as
  part of the data=ordered tracking.  You can verify this via
  b_private, or just mount data=writeback to double check.
 
 
 Chris,
 
 That was my first assumption. But after looking at
 reiserfs_releasepage (), realized that it would do reiserfs_free_jh()
 and clears the b_private. I couldn't easily find out who has the ref.
 against this bh.
 
 bh cbdaaf00 flags 110029 count 1 private 0
 

If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps
it is currently under IO?

The page isn't locked, but data=ordered does IO directly on the buffer
heads, without taking the page lock.

The easy way to narrow our search is to try without data=ordered, it is
certainly complicating things.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 4/6][RFC] Attempt to plug race with truncate

2007-10-29 Thread Chris Mason

On Fri, 26 Oct 2007 16:37:36 -0700
Mike Waychison [EMAIL PROTECTED] wrote:

 Attempt to deal with races with truncate paths.
 
 I'm not really sure on the locking here, but these seem to be taken
 by the truncate path.  BKL is left as some filesystem may(?) still
 require it.
 
 Signed-off-by: Mike Waychison [EMAIL PROTECTED]
  fs/ioctl.c |8 
  1 file changed, 8 insertions(+)
 
 Index: linux-2.6.23/fs/ioctl.c
 ===
 --- linux-2.6.23.orig/fs/ioctl.c  2007-10-26 15:27:29.0
 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26
 16:16:28.0 -0700 @@ -43,13 +43,21 @@ static long
 do_ioctl(struct file *filp, static int do_fibmap(struct address_space
 *mapping, sector_t block, sector_t *phys_block)
  {
 + struct inode *inode = mapping-host;
 +
   if (!capable(CAP_SYS_RAWIO))
   return -EPERM;
   if (!mapping-a_ops-bmap)
   return -EINVAL;
  
   lock_kernel();
 + /* Avoid races with truncate */
 + mutex_lock(inode-i_mutex);
 + /* FIXME: Do we really need i_alloc_sem? */
 + down_read(inode-i_alloc_sem);


i_alloc_sem will avoid races with filesystems filling holes inside
writepage (where i_mutex isn't held).  I'd expect everyone to currently
give some consistent result (either the old value or the new but not
garbage), but I wouldn't expect taking the semaphore to hurt anything.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Chris Mason

On Sat, 27 Oct 2007 18:57:06 +0100
Anton Altaparmakov [EMAIL PROTECTED] wrote:

 Hi,
 
 -bmap is ugly and horrible!  If you have to do this at the very
 least please cause -bmap64 to be able to return error values in case
 the file system failed to get the information or indeed such
 information does not exist as is the case for compressed and
 encrypted files for example and also for small files that are inside
 the on-disk inode (NTFS resident files and reiserfs packed tails are
 examples of this).
 
 And another of my pet peeves with -bmap is that it uses 0 to mean  
 sparse which causes a conflict on NTFS at least as block zero is  
 part of the $Boot system file so it is a real, valid block...  NTFS  
 uses -1 to denote sparse blocks internally.

Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
was a way to indicate your-data-is-here-but-isn't-alone.  But that's
more of a feature for the FIEMAP stuff.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Chris Mason

On Mon, 29 Oct 2007 12:18:22 -0700
Mike Waychison [EMAIL PROTECTED] wrote:

 Zach Brown wrote:
  And another of my pet peeves with -bmap is that it uses 0 to
  mean sparse which causes a conflict on NTFS at least as block
  zero is part of the $Boot system file so it is a real, valid
  block...  NTFS uses -1 to denote sparse blocks internally.
  Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if
  there was a way to indicate your-data-is-here-but-isn't-alone.
  But that's more of a feature for the FIEMAP stuff.
  
  And maybe we can step back and see what the callers of FIBMAP are
  doing with the results they're getting.
  
  One use is to discover the order in which to read file data that
  will result in efficient IO.
  
  If we had an interface specifically for this use case then perhaps a
  sparse block would be better reported as the position of the inode
  relative to other data blocks.  Maybe the inode block number in
  ext* land.
  
 
 Can you clarify what you mean above with an example?  I don't really
 follow.

This is a larger topic of helping userland optimize access to groups of
files.  For example, during a readdir if we knew the next step was to
delete all the files found, we could do one top of readahead (or even
ordering the returned values).

If we knew the next step would be to read all the files found, a
different type of readahead would be useful.

But, we shouldn't inflict all of this on fibmap/fiemapwe'll get
lost trying to make the one true interface for all operations.

For grouping operations on files, I think a read_tree syscall with
hints for what userland will do (read, stat, delete, list
filenames), and a better cookie than readdir should do it.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[CFP] 2008 Linux Storage and Filesystem Workshop

2007-10-24 Thread Chris Mason


Hello everyone,

We are organizing another filesystem and storage workshop in San Jose
next Feb 25 and 26.  You can find some great writeups of last year's
conference on LWN:

http://lwn.net/Articles/226351/

This year we're trying to concentrate on more problem solving sessions,
short term projects and joint sessions.  You can find all the details
on the conference webpages:

http://www.usenix.org/events/lsf08/

Soon there will be a link for submitting your position statement, which
is basically a note to the organizers that you are interested in
attending and which topics you think should be covered.

We're also looking for people to lead the discussion around the major
topics, so please let us know if you're interested in that.  The
discussion leaders will have input into the people that get invited and
the format of the discussion.

Please let me know if there are any questions about the workshop.

Thanks,
Chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

2007-10-23 Thread Chris Mason

On Tue, 23 Oct 2007 19:56:20 +0800
Fengguang Wu [EMAIL PROTECTED] wrote:

 On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote:
  [ adding reiserfs devs to the CC ]
 
 Thank you.
 
 This fix is kind of crude - even when it fixed Maxim's problem, and
 survived my stress testing of a lot of patching and kernel compiling.
 I'd be glad to see better solutions.

This should be safe, reiserfs has the buffer heads themselves clean and
the page should get cleaned eventually.  The cancel_dirty_page call was
just an optimization to be VM friendly.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: More Large blocksize benchmarks

2007-10-16 Thread Chris Mason

On Tue, 2007-10-16 at 12:36 +1000, David Chinner wrote:
 On Mon, Oct 15, 2007 at 08:22:31PM -0400, Chris Mason wrote:
  Hello everyone,
  
  I'm stealing the cc list and reviving and old thread because I've
  finally got some numbers to go along with the Btrfs variable blocksize
  feature.  The basic idea is to create a read/write interface to
  map a range of bytes on the address space, and use it in Btrfs for all
  metadata operations (file operations have always been extent based).
  
  So, instead of casting buffer_head-b_data to some structure, I read and
  write at offsets in a struct extent_buffer.  The extent buffer is very
  small and backed by an address space, and I get large block sizes the
  same way file_write gets to write to 16k at a time, by finding the
  appropriate page in the addess space.  This is an over simplification
  since I try to cache these mapping decisions to avoid using too much
  CPU, but hopefully you get the idea.
  
  The advantage to this approach is the changes are all inside Btrfs.  No
  extra kernel patches were required.
  
  Dave reported that XFS saw much higher write throughput with large
  blocksizes, but so far I'm seeing the most benefits during reads.
 
 Apples to oranges, Chris ;)
 

Grin, if the two were the same, there'd be no reason to write a new one.
I didn't expect faster writes on btrfs, at least not for workloads that
did not require reads.  The basic idea is to show there are a variety of
ways the larger blocks can improve (and hurt) performance.

Also, vmap isn't the only implementation path.  Its true the Btrfs
changes for this were huge, but a big chunk of the changes were for
different leaf/node blocksizes, something that may never get used in
practice.

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

More Large blocksize benchmarks

2007-10-15 Thread Chris Mason

Hello everyone,

I'm stealing the cc list and reviving and old thread because I've
finally got some numbers to go along with the Btrfs variable blocksize
feature.  The basic idea is to create a read/write interface to
map a range of bytes on the address space, and use it in Btrfs for all
metadata operations (file operations have always been extent based).

So, instead of casting buffer_head-b_data to some structure, I read and
write at offsets in a struct extent_buffer.  The extent buffer is very
small and backed by an address space, and I get large block sizes the
same way file_write gets to write to 16k at a time, by finding the
appropriate page in the addess space.  This is an over simplification
since I try to cache these mapping decisions to avoid using too much
CPU, but hopefully you get the idea.

The advantage to this approach is the changes are all inside Btrfs.  No
extra kernel patches were required.

Dave reported that XFS saw much higher write throughput with large
blocksizes, but so far I'm seeing the most benefits during reads.

The next step is a bunch more benchmarks.  I've done the first round
and posted it here:

http://oss.oracle.com/~mason/blocksizes/

The Btrfs code makes it relatively easy to experiment, and so this may
be a good step toward figuring out if some automagic solution is worth
it in general.  I can even use different sizes for nodes and leaves,
although I haven't done much testing at all there yet.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Correct behavior on O_DIRECT sparse file writes

2007-10-12 Thread Chris Mason

Hello everyone,

The test below creates a sparse file and then fills a hole with
O_DIRECT.  As far as I can tell from reading generic_osync_inode, the
filesystem metadata is only forced to disk if i_size changes during the
file write.  I've tested ext3, xfs and reiserfs and they all skip the
commit when filling holes.

I would argue that filling holes via O_DIRECT is supposed to commit the
metadata required to find those file blocks later.  At least on ext3,
O_SYNC does force a commit on fill holes  (haven't tested others).

So, is the current behavior a bug or a feature?

dd if=/dev/zero of=foo bs=1M seek=1 count=1 oflag=direct

hexdump foo | head -n 2
000 62b1 ea2d 73e8 c64f f5ef 1af5 dd09 8ccd
010 75ec 9581 e0ea ae9b e28f b76d a700 4d5b

dd if=/dev/urandom of=foo bs=4k count=1 conv=notrunc oflag=direct
reboot -nf

(after reboot)

hexdump foo
000        
*
020

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread Chris Mason

On Wed, 29 Aug 2007 00:55:30 +1000
David Chinner [EMAIL PROTECTED] wrote:

 On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
  On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
   On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
Notes:
(1) I'm not sure inode number is correlated to disk location in
filesystems other than ext2/3/4. Or parent dir?
   
   The correspond to the exact location on disk on XFS. But, XFS has
   it's own inode clustering (see xfs_iflush) and it can't be moved
   up into the generic layers because of locking and integration into
   the transaction subsystem.
  
(2) It duplicates some function of elevators. Why is it
necessary?
   
   The elevators have no clue as to how the filesystem might treat
   adjacent inodes. In XFS, inode clustering is a fundamental
   feature of the inode reading and writing and that is something no
   elevator can hope to acheive
   
  Thank you. That explains the linear write curve(perfect!) in Chris'
  graph.
  
  I wonder if XFS can benefit any more from the general writeback
  clustering. How large would be a typical XFS cluster?
 
 Depends on inode size. typically they are 8k in size, so anything
 from 4-32 inodes. The inode writeback clustering is pretty tightly
 integrated into the transaction subsystem and has some intricate
 locking, so it's not likely to be easy (or perhaps even possible) to
 make it more generic.

When I talked to hch about this, he said the order file data pages got
written in XFS was still dictated by the order the higher layers sent
things down.  Shouldn't the clustering still help to have delalloc done
in inode order instead of in whatever random order pdflush sends things
down now?

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread Chris Mason

On Wed, 29 Aug 2007 02:33:08 +1000
David Chinner [EMAIL PROTECTED] wrote:

 On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:

I wonder if XFS can benefit any more from the general writeback
clustering. How large would be a typical XFS cluster?
   
   Depends on inode size. typically they are 8k in size, so anything
   from 4-32 inodes. The inode writeback clustering is pretty tightly
   integrated into the transaction subsystem and has some intricate
   locking, so it's not likely to be easy (or perhaps even possible)
   to make it more generic.
  
  When I talked to hch about this, he said the order file data pages
  got written in XFS was still dictated by the order the higher
  layers sent things down.
 
 Sure, that's file data. I was talking about the inode writeback, not
 the data writeback.

I think we're trying to gain different things from inode based
clustering...I'm not worried that the inode be next to the data.  I'm
going under the assumption that most of the time, the FS will try to
allocate inodes in groups in a directory, and so most of the time the
data blocks for inode N will be close to inode N+1.

So what I'm really trying for here is data block clustering when
writing multiple inodes at once.  This matters most when files are
relatively small and written in groups, which is a common workload.

It may make the most sense to change the patch to supply some key for
the data block clustering instead of the inode number, but its an easy
first pass.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-23 Thread Chris Mason

On Thu, 23 Aug 2007 12:47:23 +1000
David Chinner [EMAIL PROTECTED] wrote:

 On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
  I think we should assume a full scan of s_dirty is impossible in the
  presence of concurrent writers.  We want to be able to pick a start
  time (right now) and find all the inodes older than that start time.
  New things will come in while we're scanning.  But perhaps that's
  what you're saying...
  
  At any rate, we've got two types of lists now.  One keeps track of
  age and the other two keep track of what is currently being
  written.  I would try two things:
  
  1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
  indexes by inode number (or some arbitrary field the FS can set in
  the inode).  Radix tree tags are used to indicate which things in
  s_io are already in progress or are pending (hand waving because
  I'm not sure exactly).
  
  inodes are pulled off s_dirty and the corresponding slot in s_io is
  tagged to indicate IO has started.  Any nearby inodes in s_io are
  also sent down.
 
 the problem with this approach is that it only looks at inode
 locality. Data locality is ignored completely here and the data for
 all the inodes that are close together could be splattered all over
 the drive. In that case, clustering by inode location is exactly the
 wrong thing to do.

Usually it won't be less wrong than clustering by time.

 
 For example, XFs changes allocation strategy at 1TB for 32bit inode
 filesystems which makes the data get placed way away from the inodes.
 i.e. inodes in AGs below 1TB, all data in AGs  1TB. clustering
 by inode number for data writeback is mostly useless in the 1TB
 case.

I agree we'll want a way to let the FS provide the clustering key.  But
for the first cut on the patch, I would suggest keeping it simple.

 
 The inode32 for 1Tb and inode64 allocators both try to keep data
 close to the inode (i.e. in the same AG) so clustering by inode number
 might work better here.
 
 Also, it might be worthwhile allowing the filesystem to supply a
 hint or mask for closeness for inode clustering. This would help
 the gernic code only try to cluster inode writes to inodes that
 fall into the same cluster as the first inode

Yes, also a good idea after things are working.

 
   Notes:
   (1) I'm not sure inode number is correlated to disk location in
   filesystems other than ext2/3/4. Or parent dir?
  
  In general, it is a better assumption than sorting by time.  It may
  make sense to one day let the FS provide a clustering hint
  (corresponding to the first block in the file?), but for starters it
  makes sense to just go with the inode number.
 
 Perhaps multiple hints are needed - one for data locality and one
 for inode cluster locality.

So, my feature creep idea would have been more data clustering.  I'm
mainly trying to solve this graph:

http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png

Where background writing of the block device inode is making ext3 do
seeky writes while directory trees.  My simple idea was to kick
off a 'I've just written block X' call back to the FS, where it may
decide to send down dirty chunks of the block device inode that also
happen to be dirty.

But, maintaining the kupdate max dirty time and congestion limits in
the face of all this clustering gets tricky.  So, I wasn't going to
suggest it until the basic machinery was working.

Fengguang, this isn't a small project ;)  But, lots of people will be
interested in the results.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] seekwatcher v0.3 IO graphing an animation

2007-07-27 Thread Chris Mason


Hello everyone,

I've tossed out seekwatcher v0.3.  The major changes are using rolling
averages to smooth out the seek and throughput graphs, and it can
generate mpgs of the IO done by a given trace.

Here's a sample of the smoother graphs (creating 20 kernel trees):

http://oss.oracle.com/~mason/seekwatcher/ext3_vs_btrfs_vs_xfs.png

There are details and sample movies of the kernel tree run at:

http://oss.oracle.com/~mason/seekwatcher

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] extent mapped page cache

2007-07-26 Thread Chris Mason

On Thu, 26 Jul 2007 04:36:39 +0200
Nick Piggin [EMAIL PROTECTED] wrote:

[ are state trees a good idea? ]

  One thing it gains us is finding the start of the cluster.  Even if
  called by kswapd, the state tree allows writepage to find the start
  of the cluster and send down a big bio (provided I implement
  trylock to avoid various deadlocks).
 
 That's very true, we could potentially also do that with the block
 extent tree that I want to try with fsblock.

If fsblock records and extent of 200MB, and writepage is called on a
page in the middle of the extent, how do you walk the radix backwards
to find the first dirty  up to date page in the range?

 
 I'm looking at cleaning up some of these aops APIs so hopefully
 most of the deadlock problems go away. Should be useful to both our
 efforts. Will post patches hopefully when I get time to finish the
 draft this weekend.

Great

 
 
O_DIRECT becomes a special case of readpages and
writepagesthe memory used for IO just comes from userland
instead of the page cache.
   
   Could be, although you'll probably also need to teach the mm about
   the state tree and/or still manipulate the pagecache tree to
   prevent concurrency?
  
  Well, it isn't coded yet, but I should be able to do it from the FS
  specific ops.
 
 Probably, if you invalidate all the pagecache in the range beforehand
 you should be able to do it (and I guess you want to do the invalidate
 anyway). Although, below deadlock issues might still bite somehwere...

Well, O_DIRECT is french for deadlocks.  But I shouldn't have to worry
so much about evicting the pages themselves since I can tag the range.

 
 
   But isn't the main aim of O_DIRECT to do as little locking and
   synchronisation with the pagecache as possible? I thought this is
   why your race fixing patches got put on the back burner (although
   they did look fairly nice from a correctness POV).
  
  I put the placeholder patches on hold because handling a corner case
  where userland did O_DIRECT from a mmap'd region of the same file
  (Linus pointed it out to me).  Basically my patches had to work in
  64k chunks to avoid a deadlock in get_user_pages.  With the state
  tree, I can allow the page to be faulted in but still properly deal
  with it.
 
 Oh right, I didn't think of that one. Would you still have similar
 issues with the external state tree? I mean, the filesystem doesn't
 really know why the fault is taken. O_DIRECT read from a file into
 mmapped memory of the same block in the file is almost hopeless I
 think.

Racing is fine as long as we don't deadlock or expose garbage from disk.

The ability to put in additional tracking info like the process
that first dirtied a range is also significant.  So, I think it
is worth trying.
   
   Definitely, and I'm glad you are. You haven't converted me yet,
   but I look forward to finding the best ideas from our two
   approaches when the patches are further along (ext2 port of
   fsblock coming along, so we'll be able to have races soon :P).
  
  I'm sure we can find some river in Cambridge, winner gets to throw
  Axboe in.
 
 Very noble of you to donate your colleage to such a worthy cause.

Jens is always interested in helping solve such debates.  It's a
fantastic service he provides to the community.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] extent mapped page cache

2007-07-25 Thread Chris Mason

On Wed, 25 Jul 2007 04:32:17 +0200
Nick Piggin [EMAIL PROTECTED] wrote:

 On Tue, Jul 24, 2007 at 07:25:09PM -0400, Chris Mason wrote:
  On Tue, 24 Jul 2007 23:25:43 +0200
  Peter Zijlstra [EMAIL PROTECTED] wrote:
  
  The tree is a critical part of the patch, but it is also the
  easiest to rip out and replace.  Basically the code stores a range
  by inserting an object at an index corresponding to the end of the
  range.
  
  Then it does searches by looking forward from the start of the
  range. More or less any tree that can search and return the first
  key = than the requested key will work.
  
  So, I'd be happy to rip out the tree and replace with something
  else. Going completely lockless will be tricky, its something that
  will deep thought once the rest of the interface is sane.
 
 Just having the other tree and managing it is what makes me a little
 less positive of this approach, especially using it to store pagecache
 state when we already have the pagecache tree.
 
 Having another tree to store block state I think is a good idea as I
 said in the fsblock thread with Dave, but I haven't clicked as to why
 it is a big advantage to use it to manage pagecache state. (and I can
 see some possible disadvantages in locking and tree manipulation
 overhead).

Yes, there are definitely costs with the state tree, it will take some
careful benchmarking to convince me it is a feasible solution. But,
storing all the state in the pages themselves is impossible unless the
block size equals the page size. So, we end up with something like
fsblock/buffer heads or the state tree.

One advantage to the state tree is that it separates the state from
the memory being described, allowing a simple kmap style interface
that covers subpages, highmem and superpages.

It also more naturally matches the way we want to do IO, making for
easy clustering.

O_DIRECT becomes a special case of readpages and writepagesthe
memory used for IO just comes from userland instead of the page cache.

The ability to put in additional tracking info like the process that
first dirtied a range is also significant.  So, I think it is worth
trying.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] extent mapped page cache

2007-07-25 Thread Chris Mason

On Thu, 26 Jul 2007 03:37:28 +0200
Nick Piggin [EMAIL PROTECTED] wrote:

  
  One advantage to the state tree is that it separates the state from
  the memory being described, allowing a simple kmap style interface
  that covers subpages, highmem and superpages.
 
 I suppose so, although we should have added those interfaces long
 ago ;) The variants in fsblock are pretty good, and you could always
 do an arbitrary extent (rather than block) based API using the
 pagecache tree if it would be helpful.

Yes, you could use fsblock for the state bits and make a separate API
to map the actual pages.

  
 
  It also more naturally matches the way we want to do IO, making for
  easy clustering.
 
 Well the pagecache tree is used to reasonable effect for that now.
 OK the code isn't beautiful ;). Granted, this might be an area where
 the seperate state tree ends up being better. We'll see.
 

One thing it gains us is finding the start of the cluster.  Even if
called by kswapd, the state tree allows writepage to find the start of
the cluster and send down a big bio (provided I implement trylock to
avoid various deadlocks).

  
  O_DIRECT becomes a special case of readpages and writepagesthe
  memory used for IO just comes from userland instead of the page
  cache.
 
 Could be, although you'll probably also need to teach the mm about
 the state tree and/or still manipulate the pagecache tree to prevent
 concurrency?

Well, it isn't coded yet, but I should be able to do it from the FS
specific ops.

 
 But isn't the main aim of O_DIRECT to do as little locking and
 synchronisation with the pagecache as possible? I thought this is
 why your race fixing patches got put on the back burner (although
 they did look fairly nice from a correctness POV).

I put the placeholder patches on hold because handling a corner case
where userland did O_DIRECT from a mmap'd region of the same file (Linus
pointed it out to me).  Basically my patches had to work in 64k chunks
to avoid a deadlock in get_user_pages.  With the state tree, I can
allow the page to be faulted in but still properly deal with it.

 
 Well I'm kind of handwaving when it comes to O_DIRECT ;) It does look
 like this might be another advantage of the state tree (although you
 aren't allowed to slow down buffered IO to achieve the locking ;)).

;) The O_DIRECT benefit is a fringe thing.  I've long wanted to help
clean up that code, but the real point of the patch is to make general
usage faster and less complex.  If I can't get there, the O_DIRECT
stuff doesn't matter.
 
  
  The ability to put in additional tracking info like the process that
  first dirtied a range is also significant.  So, I think it is worth
  trying.
 
 Definitely, and I'm glad you are. You haven't converted me yet, but
 I look forward to finding the best ideas from our two approaches when
 the patches are further along (ext2 port of fsblock coming along, so
 we'll be able to have races soon :P).

I'm sure we can find some river in Cambridge, winner gets to throw
Axboe in.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC] extent mapped page cache

2007-07-24 Thread Chris Mason

On Tue, 10 Jul 2007 17:03:26 -0400
Chris Mason [EMAIL PROTECTED] wrote:

 This patch aims to demonstrate one way to replace buffer heads with a
 few extent trees.  Buffer heads provide a few different features:
 
 1) Mapping of logical file offset to blocks on disk
 2) Recording state (dirty, locked etc)
 3) Providing a mechanism to access sub-page sized blocks.
 
 This patch covers #1 and #2, I'll start on #3 a little later next
 week.
 
Well, almost.  I decided to try out an rbtree instead of the radix,
which turned out to be much faster.  Even though individual operations
are slower, the rbtree was able to do many fewer ops to accomplish the
same thing, especially for merging extents together.  It also uses much
less ram.

This code still has lots of room for optimization, but it comes in at
around 2-5% more cpu time for ext2 streaming reads and writes.  I
haven't done readpages or writepages yet, so this is more or less a
worst case setup.  I'm comparing against ext2 with readpages and
writepages disabled.

The new code has the added benefit of passing fsx-linux, and not
triggering MCE's on my poor little test box.

The basic idea is to store state in byte ranges in an rbtree, and to
mirror that state down into individual pages.  This allows us to store
arbitrary state outside of the page struct, so we could include the pid
of the process that dirtied a page range for cfq purposes.  The example
readpage and writepage code is probably the easiest way to understand
the basic API.

A separate rbtree stores a mapping of byte offset in the file to byte
offset on disk.  This allows the filesystem to fill in mapping
information in bulk, and reduces the number of metadata lookups
required to do common operations.

Because the state and mapping information are separate from the page,
pages can come and go and their corresponding metadata can still be
cached (the current code drops mappings as the last page corresponding
to that mapping disappears).

Two patches follow, the core extent_map implementation and a sample
user (ext2).  This is pretty basic, implementing prepare/commit_write,
read/writepage and a few other funcs to exercise the new code.  Longer
term, it should fit in with Nick's other extent work instead of
prepare/commit_write.

My patch sets page-private to 1, really for no good reason.  It is
just a debugging aid I was using to make sure the page took the right
path down the line.  If this catches on, we might set it to a magic
value so you can if (ExtentPage(page)) or just leave it as null.

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC] extent mapped page cache main code

2007-07-24 Thread Chris Mason

Core Extentmap implementation

diff -r 126111346f94 -r 53cabea328f7 fs/Makefile
--- a/fs/Makefile   Mon Jul 09 10:53:57 2007 -0400
+++ b/fs/Makefile   Tue Jul 24 15:40:27 2007 -0400
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o
+   stack.o extent_map.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff -r 126111346f94 -r 53cabea328f7 fs/extent_map.c
--- /dev/null   Thu Jan 01 00:00:00 1970 +
+++ b/fs/extent_map.c   Tue Jul 24 15:40:27 2007 -0400
@@ -0,0 +1,1591 @@
+#include linux/bitops.h
+#include linux/slab.h
+#include linux/bio.h
+#include linux/mm.h
+#include linux/gfp.h
+#include linux/pagemap.h
+#include linux/page-flags.h
+#include linux/module.h
+#include linux/spinlock.h
+#include linux/blkdev.h
+#include linux/extent_map.h
+
+static struct kmem_cache *extent_map_cache;
+static struct kmem_cache *extent_state_cache;
+
+struct tree_entry {
+   u64 start;
+   u64 end;
+   int in_tree;
+   struct rb_node rb_node;
+};
+
+
+/* bits for the extent state */
+#define EXTENT_DIRTY 1
+#define EXTENT_WRITEBACK (1  1)
+#define EXTENT_UPTODATE (1  2)
+#define EXTENT_LOCKED (1  3)
+#define EXTENT_NEW (1  4)
+
+#define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK)
+
+void __init extent_map_init(void)
+{
+   extent_map_cache = kmem_cache_create(extent_map,
+   sizeof(struct extent_map), 0,
+   SLAB_RECLAIM_ACCOUNT |
+   SLAB_DESTROY_BY_RCU,
+   NULL, NULL);
+   extent_state_cache = kmem_cache_create(extent_state,
+   sizeof(struct extent_state), 0,
+   SLAB_RECLAIM_ACCOUNT |
+   SLAB_DESTROY_BY_RCU,
+   NULL, NULL);
+}
+
+void extent_map_tree_init(struct extent_map_tree *tree,
+ struct address_space *mapping, gfp_t mask)
+{
+   tree-map.rb_node = NULL;
+   tree-state.rb_node = NULL;
+   rwlock_init(tree-lock);
+   tree-mapping = mapping;
+}
+EXPORT_SYMBOL(extent_map_tree_init);
+
+struct extent_map *alloc_extent_map(gfp_t mask)
+{
+   struct extent_map *em;
+   em = kmem_cache_alloc(extent_map_cache, mask);
+   if (!em || IS_ERR(em))
+   return em;
+   em-in_tree = 0;
+   atomic_set(em-refs, 1);
+   return em;
+}
+EXPORT_SYMBOL(alloc_extent_map);
+
+void free_extent_map(struct extent_map *em)
+{
+   if (atomic_dec_and_test(em-refs)) {
+   WARN_ON(em-in_tree);
+   kmem_cache_free(extent_map_cache, em);
+   }
+}
+EXPORT_SYMBOL(free_extent_map);
+
+struct extent_state *alloc_extent_state(gfp_t mask)
+{
+   struct extent_state *state;
+   state = kmem_cache_alloc(extent_state_cache, mask);
+   if (!state || IS_ERR(state))
+   return state;
+   state-state = 0;
+   state-in_tree = 0;
+   atomic_set(state-refs, 1);
+   init_waitqueue_head(state-wq);
+   return state;
+}
+EXPORT_SYMBOL(alloc_extent_state);
+
+void free_extent_state(struct extent_state *state)
+{
+   if (atomic_dec_and_test(state-refs)) {
+   WARN_ON(state-in_tree);
+   kmem_cache_free(extent_state_cache, state);
+   }
+}
+EXPORT_SYMBOL(free_extent_state);
+
+static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
+  struct rb_node *node)
+{
+   struct rb_node ** p = root-rb_node;
+   struct rb_node * parent = NULL;
+   struct tree_entry *entry;
+
+   while(*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct tree_entry, rb_node);
+
+   if (offset  entry-end)
+   p = (*p)-rb_left;
+   else if (offset  entry-end)
+   p = (*p)-rb_right;
+   else
+   return parent;
+   }
+
+   entry = rb_entry(node, struct tree_entry, rb_node);
+   entry-in_tree = 1;
+   rb_link_node(node, parent, p);
+   rb_insert_color(node, root);
+   return NULL;
+}
+
+static struct rb_node *__tree_search(struct rb_root *root, u64 offset,
+  struct rb_node **prev_ret)
+{
+   struct rb_node * n = root-rb_node;
+   struct rb_node *prev = NULL;
+   struct tree_entry *entry;
+   struct tree_entry *prev_entry = NULL;
+
+   while(n) {
+   entry = rb_entry(n, struct tree_entry, rb_node);
+   prev = n;
+   prev_entry = entry;
+
+   if (offset

[PATCH RFC] ext2 extentmap support

2007-07-24 Thread Chris Mason

mount -o extentmap to use the new stuff

diff -r 126111346f94 -r 53cabea328f7 fs/ext2/ext2.h
--- a/fs/ext2/ext2.hMon Jul 09 10:53:57 2007 -0400
+++ b/fs/ext2/ext2.hTue Jul 24 15:40:27 2007 -0400
@@ -1,5 +1,6 @@
 #include linux/fs.h
 #include linux/ext2_fs.h
+#include linux/extent_map.h
 
 /*
  * ext2 mount options
@@ -65,6 +66,7 @@ struct ext2_inode_info {
struct posix_acl*i_default_acl;
 #endif
rwlock_t i_meta_lock;
+   struct extent_map_tree extent_tree;
struct inodevfs_inode;
 };
 
@@ -167,6 +169,7 @@ extern const struct address_space_operat
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
+extern const struct address_space_operations ext2_extent_map_aops;
 
 /* namei.c */
 extern const struct inode_operations ext2_dir_inode_operations;
diff -r 126111346f94 -r 53cabea328f7 fs/ext2/inode.c
--- a/fs/ext2/inode.c   Mon Jul 09 10:53:57 2007 -0400
+++ b/fs/ext2/inode.c   Tue Jul 24 15:40:27 2007 -0400
@@ -625,6 +625,84 @@ changed:
goto reread;
 }
 
+/*
+ * simple get_extent implementation using get_block.  This assumes
+ * the get_block function can return something larger than a single block,
+ * but the ext2 implementation doesn't do so.  Just change b_size to
+ * something larger if get_block can return larger extents.
+ */
+struct extent_map *ext2_get_extent(struct inode *inode, struct page *page,
+  size_t page_offset, u64 start, u64 end,
+  int create)
+{
+   struct buffer_head bh;
+   sector_t iblock;
+   struct extent_map *em = NULL;
+   struct extent_map_tree *extent_tree = EXT2_I(inode)-extent_tree;
+   int ret = 0;
+   u64 max_end = (u64)-1;
+   u64 found_len;
+   u64 bh_start;
+   u64 bh_end;
+
+   bh.b_size = inode-i_sb-s_blocksize;
+   bh.b_state = 0;
+again:
+   em = lookup_extent_mapping(extent_tree, start, end);
+   if (em) {
+   return em;
+   }
+
+   iblock = start  inode-i_blkbits;
+   if (!buffer_mapped(bh)) {
+   ret = ext2_get_block(inode, iblock, bh, create);
+   if (ret)
+   goto out;
+   }
+
+   found_len = min((u64)(bh.b_size), max_end - start);
+   if (!em)
+   em = alloc_extent_map(GFP_NOFS);
+
+   bh_start = start;
+   bh_end = start + found_len - 1;
+   em-start = start;
+   em-end = bh_end;
+   em-bdev = inode-i_sb-s_bdev;
+
+   if (!buffer_mapped(bh)) {
+   em-block_start = 0;
+   em-block_end = 0;
+   } else {
+   em-block_start = bh.b_blocknr  inode-i_blkbits;
+   em-block_end = em-block_start + found_len - 1;
+   }
+   ret = add_extent_mapping(extent_tree, em);
+   if (ret == -EEXIST) {
+   free_extent_map(em);
+   em = NULL;
+   max_end = end;
+   goto again;
+   }
+out:
+   if (ret) {
+   if (em)
+   free_extent_map(em);
+   return ERR_PTR(ret);
+   } else if (em  buffer_new(bh)) {
+   set_extent_new(extent_tree, bh_start, bh_end, GFP_NOFS);
+   }
+   return em;
+}
+
+static int ext2_extent_map_writepage(struct page *page,
+struct writeback_control *wbc)
+{
+   struct extent_map_tree *tree;
+   tree = EXT2_I(page-mapping-host)-extent_tree;
+   return extent_write_full_page(tree, page, ext2_get_extent, wbc);
+}
+
 static int ext2_writepage(struct page *page, struct writeback_control *wbc)
 {
return block_write_full_page(page, ext2_get_block, wbc);
@@ -633,6 +711,42 @@ static int ext2_readpage(struct file *fi
 static int ext2_readpage(struct file *file, struct page *page)
 {
return mpage_readpage(page, ext2_get_block);
+}
+
+static int ext2_extent_map_readpage(struct file *file, struct page *page)
+{
+   struct extent_map_tree *tree;
+   tree = EXT2_I(page-mapping-host)-extent_tree;
+   return extent_read_full_page(tree, page, ext2_get_extent);
+}
+
+static int ext2_extent_map_releasepage(struct page *page,
+  gfp_t unused_gfp_flags)
+{
+   struct extent_map_tree *tree;
+   int ret;
+
+   if (page-private != 1)
+   return try_to_free_buffers(page);
+   tree = EXT2_I(page-mapping-host)-extent_tree;
+   ret = try_release_extent_mapping(tree, page);
+   if (ret == 1) {
+   ClearPagePrivate(page);
+   set_page_private(page, 0);
+   page_cache_release(page);
+   }
+   return ret;
+}
+
+
+static void ext2_extent_map_invalidatepage(struct page *page,
+  unsigned long offset)
+{
+   struct extent_map_tree *tree;
+
+   tree =

Re: [PATCH RFC] extent mapped page cache

2007-07-24 Thread Chris Mason

On Tue, 24 Jul 2007 23:25:43 +0200
Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Tue, 2007-07-24 at 16:13 -0400, Trond Myklebust wrote:
  On Tue, 2007-07-24 at 16:00 -0400, Chris Mason wrote:
   On Tue, 10 Jul 2007 17:03:26 -0400
   Chris Mason [EMAIL PROTECTED] wrote:
   
This patch aims to demonstrate one way to replace buffer heads
with a few extent trees.  Buffer heads provide a few different
features:

1) Mapping of logical file offset to blocks on disk
2) Recording state (dirty, locked etc)
3) Providing a mechanism to access sub-page sized blocks.

This patch covers #1 and #2, I'll start on #3 a little later
next week.

   Well, almost.  I decided to try out an rbtree instead of the
   radix, which turned out to be much faster.  Even though
   individual operations are slower, the rbtree was able to do many
   fewer ops to accomplish the same thing, especially for merging
   extents together.  It also uses much less ram.
  
  The problem with an rbtree is that you can't use it together with
  RCU to do lockless lookups. You can probably modify it to allocate
  nodes dynamically (like the radix tree does) and thus make it
  RCU-compatible, but then you risk losing the two main benefits that
  you list above.

The tree is a critical part of the patch, but it is also the easiest to
rip out and replace.  Basically the code stores a range by inserting
an object at an index corresponding to the end of the range.

Then it does searches by looking forward from the start of the range.
More or less any tree that can search and return the first key =
than the requested key will work.

So, I'd be happy to rip out the tree and replace with something else.
Going completely lockless will be tricky, its something that will deep
thought once the rest of the interface is sane.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] seekwatcher IO graphing v0.2

2007-07-23 Thread Chris Mason


Hello everyone,

Since doing the initial Btrfs benchmarks, I've made my blktrace graphing
utility a little more generic and tossed it out on oss.oracle.com.

This new version can easily graph two different runs, and has a few
other tweaks that make the graphs look nicer.

Docs, examples and other details are at:

http://oss.oracle.com/~mason/seekwatcher

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] extent mapped page cache

2007-07-18 Thread Chris Mason

On Thu, 12 Jul 2007 00:00:28 -0700
Daniel Phillips [EMAIL PROTECTED] wrote:

 On Tuesday 10 July 2007 14:03, Chris Mason wrote:
  This patch aims to demonstrate one way to replace buffer heads with
  a few extent trees...
 
 Hi Chris,
 
 Quite terse commentary on algorithms and data structures, but I
 suppose that is not a problem because Jon has a whole week to reverse
 engineer it for us.
 
 What did you have in mind for subpages?
 

This partially depends on input here.  The goal is to have one
interface that works for subpages, highmem and superpages, and for
the FS maintainers to not care if the mappings come magically from
clameter's work or vmap or whatever.

Given the whole extent based theme, I plan on something like this:

struct extent_ptr {
char *ptr;
some way to indicate size and type of map
struct page pages[];
};

struct extent_ptr *alloc_extent_ptr(struct extent_map_tree *tree,
u64 start, u64 end);
void free_extent_ptr(struct extent_map_tree *tree,
 struct extent_ptr *ptr);

And then some calls along the lines of kmap/kunmap that gives you a
pointer you can use for accessing the ram.  read/write calls would also
be fine by me, but harder to convert filesystems to use.

The struct extent_ptr would increase the ref count on the pages, but
the pages would have no back pointers to it.  All
dirty/locked/writeback state would go in the extent state tree and would
not be stored in the struct extent_ptr.  

The idea is to make a simple mapping entity, and not complicate it
by storing FS specific state in there. It could be variably sized to
hold an array of pages, and allocated via kmap.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC] extent mapped page cache

2007-07-10 Thread Chris Mason

This patch aims to demonstrate one way to replace buffer heads with a few
extent trees.  Buffer heads provide a few different features:

1) Mapping of logical file offset to blocks on disk
2) Recording state (dirty, locked etc)
3) Providing a mechanism to access sub-page sized blocks.

This patch covers #1 and #2, I'll start on #3 a little later next week.

The file offset to disk block mapping is done in one radix tree, and the
state is done in a second radix tree.  Extent ranges are stored in the
radix trees by inserting into the slot corresponding to the end of the range,
and always using gang lookups for searching.

The basic implementation mirrors the page and buffer bits already used, but
allows state bits to be set on regions smaller or larger than a single page.
Eventually I would like to use this mechanism to replace my DIO
locking/placeholder patch.

Ext2 is changed to use the extent mapping code when mounted with -o extentmap.
DIO is not supported and readpages/writepages are not yet implemented, but
this should be enough to get the basic idea across.  Testing has been
very very light, I'm mostly sending this out for comments and to continue
the discussion started by Nick's patch set.

diff -r 126111346f94 fs/Makefile
--- a/fs/Makefile   Mon Jul 09 10:53:57 2007 -0400
+++ b/fs/Makefile   Tue Jul 10 16:49:26 2007 -0400
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o
+   stack.o extent_map.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff -r 126111346f94 fs/ext2/ext2.h
--- a/fs/ext2/ext2.hMon Jul 09 10:53:57 2007 -0400
+++ b/fs/ext2/ext2.hTue Jul 10 16:49:26 2007 -0400
@@ -1,5 +1,6 @@
 #include linux/fs.h
 #include linux/ext2_fs.h
+#include linux/extent_map.h
 
 /*
  * ext2 mount options
@@ -65,6 +66,7 @@ struct ext2_inode_info {
struct posix_acl*i_default_acl;
 #endif
rwlock_t i_meta_lock;
+   struct extent_map_tree extent_tree;
struct inodevfs_inode;
 };
 
@@ -167,6 +169,7 @@ extern const struct address_space_operat
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
+extern const struct address_space_operations ext2_extent_map_aops;
 
 /* namei.c */
 extern const struct inode_operations ext2_dir_inode_operations;
diff -r 126111346f94 fs/ext2/inode.c
--- a/fs/ext2/inode.c   Mon Jul 09 10:53:57 2007 -0400
+++ b/fs/ext2/inode.c   Tue Jul 10 16:49:26 2007 -0400
@@ -625,6 +625,78 @@ changed:
goto reread;
 }
 
+/*
+ * simple get_extent implementation using get_block.  This assumes
+ * the get_block function can return something larger than a single block,
+ * but the ext2 implementation doesn't do so.  Just change b_size to
+ * something larger if get_block can return larger extents.
+ */
+struct extent_map *ext2_get_extent(struct inode *inode, struct page *page,
+  size_t page_offset, u64 start, u64 end,
+  int create)
+{
+   struct buffer_head bh;
+   sector_t iblock;
+   struct extent_map *em = NULL;
+   struct extent_map_tree *extent_tree = EXT2_I(inode)-extent_tree;
+   int ret = 0;
+   u64 max_end = (u64)-1;
+   u64 found_len;
+
+   bh.b_size = inode-i_sb-s_blocksize;
+   bh.b_state = 0;
+again:
+   em = lookup_extent_mapping(extent_tree, start, end);
+   if (em)
+   return em;
+
+   iblock = start  inode-i_blkbits;
+   if (!buffer_mapped(bh)) {
+   ret = ext2_get_block(inode, iblock, bh, create);
+   if (ret)
+   goto out;
+   }
+
+   found_len = min((u64)(bh.b_size), max_end - start);
+   if (!em)
+   em = alloc_extent_map(GFP_NOFS);
+
+   em-start = start;
+   em-end = start + found_len - 1;
+   em-bdev = inode-i_sb-s_bdev;
+
+   if (!buffer_mapped(bh)) {
+   em-block_start = 0;
+   em-block_end = 0;
+   } else {
+   em-block_start = bh.b_blocknr  inode-i_blkbits;
+   em-block_end = em-block_start + found_len - 1;
+   }
+
+   ret = add_extent_mapping(extent_tree, em);
+   if (ret == -EEXIST) {
+   max_end = end;
+   goto again;
+   }
+out:
+   if (ret) {
+   if (em)
+   free_extent_map(em);
+   return ERR_PTR(ret);
+   } else if (em  buffer_new(bh)) {
+   set_extent_new(extent_tree, start, end, GFP_NOFS);
+   }
+   return em;
+}
+
+static int ext2_extent_map_writepage(struct page *page,
+struct

Re: vm/fs meetup details

2007-07-06 Thread Chris Mason

On Fri, 6 Jul 2007 23:42:01 +1000
David Chinner [EMAIL PROTECTED] wrote:

 On Fri, Jul 06, 2007 at 12:26:23PM +0200, JÃ¶rn Engel wrote:
  On Fri, 6 July 2007 20:01:10 +1000, David Chinner wrote:
   On Fri, Jul 06, 2007 at 04:26:51AM +0200, Nick Piggin wrote:
   
   But, surprisingly enough, the above work is relevent to this
   forum because of two things:
   
 - we've had to move to direct I/O and user space caching
   to work around deficiencies in kernel block device caching under
   memory pressure
   
 - we've exploited techniques that XFS supports but the VM
   does not. i.e. priority tagging of cached metadata so that less
   important metadata is tossed first (e.g. toss tree leaves before
   nodes and nodes before roots) when under memory pressure.
  
  And the latter is exactly what logfs needs as well.  You certainly
  have me interested.
  
  I believe it applies to btrfs and any other cow-fs as well.  The
  point is that higher levels get dirtied by writing lower layers.
  So perfect behaviour for sync is to write leaves first, then nodes,
  then the root.  Any other order will either cause sync not to sync
  or cause unnecessary writes and cost performance.
 
 Hmmm - I guess you could use it for writeback ordering. I hadn't
 really thought about that. Doesn't seem a particularly efficient way
 of doing it, though. Why not just use multiple address spaces for
 this? i.e. one per level and flush in ascending order.
 

At least in the case of btrfs, the perfect order for sync is disk
order ;)  COW happens when blocks are changed for the first time in a
transaction, not when they are written out to disk.  If logfs is
writing things out some form of tree order, you're going to have to
group disk allocations such that tree order reflects disk order somehow.

But, the part where we toss leaves first is definitely useful.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Versioning file system

2007-07-05 Thread Chris Mason

On Thu, 5 Jul 2007 09:57:40 -0400
John Stoffel [EMAIL PROTECTED] wrote:

  Erik == Erik Mouw [EMAIL PROTECTED] writes:
 
 Erik (sorry for the late reply, just got back from holiday)
 Erik On Mon, Jun 18, 2007 at 01:29:56PM -0400, Theodore Tso wrote:
  As I mentioned in my Linux.conf.au presentation a year and a half
  ago, the main use of Streams in Windows to date has been for system
  crackers to hide trojan horse code and rootkits so that system
  administrators couldn't find them.  :-)
 
 Erik The only valid use of Streams in Windows I've seen was a virus
 Erik checker that stored a hash of the file in a separate
 Erik stream. Checking a file was a matter of rehashing it and
 Erik comparing against the hash stored in the special hash data
 Erik stream for that particular file.
 
 So what was stopping a virus from infecting a file, re-computing the
 hash and pushing the new hash into the stream?  
 
 You need to keep the computed hashes on Read-Only media for true
 security, once you let the system change them, then you're toast

I'm not a huge fan of streams, but I'm pretty sure there are various
encryption tools that let us verify and validate the source of data.
It's entirely possible the virus checker wasn't doing it right, but
storing verification info in an EA or stream isn't entirely invalid.

You still need an external key that you do trust of course.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how do versioning filesystems take snapshot of opened files?

2007-07-03 Thread Chris Mason

On Tue, 3 Jul 2007 01:28:57 -0400
Xin Zhao [EMAIL PROTECTED] wrote:

 Hi,
 
 
 If a file is already opened when snapshot command is issued,  the file
 itself could be in an inconsistent state already. Before the file is
 closed, maybe part of the file contains old data, the rest contains
 new data.
 How does a versioning filesystem guarantee that the file snapshot is
 in a consistent state in this case?
 
 I googled it but didn't find any answer. Can someone explain it a
 little bit?

It's the same answer as in most filesystem related questions...it
depends ;)  Consistent state means many different things.  It may mean
that the metadata accurately reflects the space on disk allocated to
the file and that all data for the file is properly on disk (ie from an
fsync).

But, even this is less than useful because very few files on the
filesystem stand alone.  Applications spread their state across a
number of files and so consistent means something different to
every application.

Getting a snapshot that is useful with respect to application data
requires help from the application.  The app needs to be shutdown or
paused prior to the snapshot and then started up again after the
snapshot is taken.

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how do versioning filesystems take snapshot of opened files?

2007-07-03 Thread Chris Mason

On Tue, 3 Jul 2007 12:31:49 -0400
Xin Zhao [EMAIL PROTECTED] wrote:

 That's a good point!
 
 But this sounds hopeless to take a real consistent snapshot from app
 perspective unless you shutdown the computer. Right?

Many different applications support some form of pausing in order
to facilitate live backups.  You just have to keep it all in mind when
designing the total backup solution.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how do versioning filesystems take snapshot of opened files?

2007-07-03 Thread Chris Mason

On Tue, 3 Jul 2007 13:15:06 -0400
Xin Zhao [EMAIL PROTECTED] wrote:

 OK. From discussion above, can we reach a conclusion: from the
 application perspective, it is very hard, if not impossible, to take a
 transactional consistent snapshot without the help from applications?

You definitely need help from the applications.  They define what a
transaction is.

 
 Chris, you mentioned that Many different applications support some
 form of pausing in order to facilitate live backups.  Can you provide
 some examples? I mean popular apps.

Oracle, db2, mysql, ldap, postgres, sleepycat databases...just search
for online backup and most programs that involve something
transactional have a way to do it.

 
 Finally, if we back up a little bit, say, we don't care the
 transaction level consistency ( a transaction that open/close many
 times), but we want a open/close consistency in snapshots. That is, a
 file in a snapshot must be in a single version, but it can be in a
 middle state of a transaction. Can we do that? Pausing apps itself
 does not solve this problem, because a file could be already opened
 and in the middle of write. As I mentioned earlier, some systems can
 backup old data every time new data is written, but I suspect that
 this will impact the system performance quite a bit. Any idea about
 that?
 

This depends on the transaction engine in your filesystem.  None of the
existing linux filesystems have a way to start a transaction when the
file opens and finish it when the file closes, or a way to roll back
individual operations that have happened inside a given transaction.

It certainly could be done, but it would also introduce a great deal of
complexity to the FS.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] fsblock

2007-06-28 Thread Chris Mason

On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
 On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote:
  On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
   Lets look at a typical example of how IO actually gets done today,
   starting with sys_write():
   
   sys_write(file, buffer, 1MB)
   for each page:
   prepare_write()
 allocate contiguous chunks of disk
   attach buffers
   copy_from_user()
   commit_write()
   dirty buffers
   
   pdflush:
   writepages()
   find pages with contiguous chunks of disk
 build and submit large bios
   
   So, we replace prepare_write and commit_write with an extent based api,
   but we keep the dirty each buffer part.  writepages has to turn that
   back into extents (bio sized), and the result is completely full of dark
   dark corner cases.
 
 That's true but I don't think an extent data structure means we can
 become too far divorced from the pagecache or the native block size
 -- what will end up happening is that often we'll need stuff to map
 between all those as well, even if it is only at IO-time.

I think the fundamental difference is that fsblock still does:
mapping_info = page-something, where something is attached on a per
page basis.  What we really want is mapping_info = lookup_mapping(page),
where that function goes and finds something stored on a per extent
basis, with extra bits for tracking dirty and locked state.

Ideally, in at least some of the cases the dirty and locked state could
be at an extent granularity (streaming IO) instead of the block
granularity (random IO).

In my little brain, even block based filesystems should be able to take
advantage of this...but such things are always easier to believe in
before the coding starts.

 
 But the point is taken, and I do believe that at least for APIs, extent
 based seems like the best way to go. And that should allow fsblock to
 be replaced or augmented in future without _too_ much pain.
 
  
  Yup - I've been on the painful end of those dark corner cases several
  times in the last few months.
  
  It's also worth pointing out that mpage_readpages() already works on
  an extent basis - it overloads bufferheads to provide a map_bh that
  can point to a range of blocks in the same state. The code then iterates
  the map_bh range a page at a time building bios (i.e. not even using
  buffer heads) from that map..
 
 One issue I have with the current nobh and mpage stuff is that it
 requires multiple calls into get_block (first to prepare write, then
 to writepage), it doesn't allow filesystems to attach resources
 required for writeout at prepare_write time, and it doesn't play nicely
 with buffers in general. (not to mention that nobh error handling is
 buggy).
 
 I haven't done any mpage-like code for fsblocks yet, but I think they
 wouldn't be too much trouble, and wouldn't have any of the above
 problems...

Could be, but the fundamental issue of sometimes pages have mappings
attached and sometimes they don't is still there.  The window is
smaller, but non-zero.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] fsblock

2007-06-27 Thread Chris Mason

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
 On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
  On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
   On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
  
  [ ... fsblocks vs extent range mapping ]
  
   iomaps can double as range locks simply because iomaps are
   expressions of ranges within the file.  Seeing as you can only
   access a given range exclusively to modify it, inserting an empty
   mapping into the tree as a range lock gives an effective method of
   allowing safe parallel reads, writes and allocation into the file.
   
   The fsblocks and the vm page cache interface cannot be used to
   facilitate this because a radix tree is the wrong type of tree to
   store this information in. A sparse, range based tree (e.g. btree)
   is the right way to do this and it matches very well with
   a range based API.
  
  I'm really not against the extent based page cache idea, but I kind of
  assumed it would be too big a change for this kind of generic setup.  At
  any rate, if we'd like to do it, it may be best to ditch the idea of
  attach mapping information to a page, and switch to lookup mapping
  information and range locking for a page.
 
 Well the get_block equivalent API is extent based one now, and I'll
 look at what is required in making map_fsblock a more generic call
 that could be used for an extent-based scheme.
 
 An extent based thing IMO really isn't appropriate as the main generic
 layer here though. If it is really useful and popular, then it could
 be turned into generic code and sit along side fsblock or underneath
 fsblock...

Lets look at a typical example of how IO actually gets done today,
starting with sys_write():

sys_write(file, buffer, 1MB)
for each page:
prepare_write()
allocate contiguous chunks of disk
attach buffers
copy_from_user()
commit_write()
dirty buffers

pdflush:
writepages()
find pages with contiguous chunks of disk
build and submit large bios

So, we replace prepare_write and commit_write with an extent based api,
but we keep the dirty each buffer part.  writepages has to turn that
back into extents (bio sized), and the result is completely full of dark
dark corner cases.

I do think fsblocks is a nice cleanup on its own, but Dave has a good
point that it makes sense to look for ways generalize things even more.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/3] add the fsblock layer

2007-06-26 Thread Chris Mason

On Tue, Jun 26, 2007 at 01:07:43PM +1000, Nick Piggin wrote:
 Neil Brown wrote:
 On Tuesday June 26, [EMAIL PROTECTED] wrote:
 
 Chris Mason wrote:
 
 The block device pagecache isn't special, and certainly isn't that much
 code.  I would suggest keeping it buffer head specific and making a
 second variant that does only fsblocks.  This is mostly to keep the
 semantics of PagePrivate sane, lets not fuzz the line.
 
 That would require a new inode and address_space for the fsblock
 type blockdev pagecache, wouldn't it? I just can't think of a
 better non-intrusive way of allowing a buffer_head filesystem and
 an fsblock filesystem to live on the same blkdev together.
 
 
 I don't think they would ever try to.  Both filesystems would bd_claim
 the blkdev, and only one would win.
 
 Hmm OK, I might have confused myself thinking about partitions...
 
 The issue is more of a filesystem sharing a blockdev with the
 block-special device (i.e. open(/dev/sda1), read) isn't it?
 
 If a filesystem wants to attach information to the blockdev pagecache
 that is different to what blockdev want to attach, then I think Yes
 - a new inode and address space is what it needs to create.
 
 Then you get into consistency issues between the metadata and direct
 blockdevice access.  Do we care about those?
 
 Yeah that issue is definitely a real one. The problem is not just
 consistency, but how do the block device aops even know that the
 PG_private page they have has buffer heads or fsblocks, so it is
 an oopsable condition rather than just a plain consistency issue
 (consistency is already not guaranteed).

Since we're testing new code, I would just leave the blkdev address
space alone.  If a filesystem wants to use fsblocks, they allocate a new
inode during mount, stuff it into their private super block (or in the
generic super), and use that for everything.  Basically ignoring the
block device address space completely.

It means there will be some inconsistency between what you get when
reading the block device file and the filesystem metadata, but we've got
that already (ext2 dir in page cache).

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] fsblock

2007-06-26 Thread Chris Mason

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
 On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:

[ ... fsblocks vs extent range mapping ]

 iomaps can double as range locks simply because iomaps are
 expressions of ranges within the file.  Seeing as you can only
 access a given range exclusively to modify it, inserting an empty
 mapping into the tree as a range lock gives an effective method of
 allowing safe parallel reads, writes and allocation into the file.
 
 The fsblocks and the vm page cache interface cannot be used to
 facilitate this because a radix tree is the wrong type of tree to
 store this information in. A sparse, range based tree (e.g. btree)
 is the right way to do this and it matches very well with
 a range based API.

I'm really not against the extent based page cache idea, but I kind of
assumed it would be too big a change for this kind of generic setup.  At
any rate, if we'd like to do it, it may be best to ditch the idea of
attach mapping information to a page, and switch to lookup mapping
information and range locking for a page.

A btree could be used to hold the range mapping and locking, but it
could just as easily be a radix tree where you do a gang lookup for the
end of the range (the same way my placeholder patch did).  It'll still
find intersecting range locks but is much faster for random
insertion/deletion than the btrees.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vm/fs meetup in september?

2007-06-26 Thread Chris Mason

On Tue, Jun 26, 2007 at 12:35:09PM +1000, Nick Piggin wrote:
 Christoph Hellwig wrote:
 On Sun, Jun 24, 2007 at 06:23:45AM +0200, Nick Piggin wrote:
 
 I'd just like to take the chance also to ask about a VM/FS meetup some
 time around kernel summit (maybe take a big of time during UKUUG or so).
 
 
 I won't be around until a day or two before KS, so I'd prefer to have it
 after KS if possible.
 
 I'd like to see you there, so I hope we can find a date that most
 people are happy with. I'll try to start working that out after we
 have a rough idea of who's interested.

I'm game, but won't be staying past the end of KS (I'll arrive Sept 2nd
or so though).  Given debates so far, it probably makes sense to talk
about things at KS too.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] fsblock

2007-06-25 Thread Chris Mason

On Mon, Jun 25, 2007 at 04:58:48PM +1000, Nick Piggin wrote:
 
 Using buffer heads instead allows the FS to send file data down inside
 the transaction code, without taking the page lock.  So, locking wrt
 data=ordered is definitely going to be tricky.
 
 The best long term option may be making the locking order
 transaction - page lock, and change writepage to punt to some other
 queue when it needs to start a transaction.
 
 Yeah, that's what I would like, and I think it would come naturally
 if we move away from these pass down a single, locked page APIs
 in the VM, and let the filesystem do the locking and potentially
 batching of larger ranges.

Definitely.

 
 write_begin/write_end is a step in that direction (and it helps
 OCFS and GFS quite a bit). I think there is also not much reason
 for writepage sites to require the page to lock the page and clear
 the dirty bit themselves (which has seems ugly to me).

If we keep the page mapping information with the page all the time (ie
writepage doesn't have to call get_block ever), it may be possible to
avoid sending down a locked page.  But, I don't know the delayed
allocation internals well enough to say for sure if that is true.

Either way, writepage is the easiest of the bunch because it can be
deferred.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Chris Mason

On Mon, Jun 25, 2007 at 05:41:58PM +1000, Nick Piggin wrote:
 Neil Brown wrote:
 On Sunday June 24, [EMAIL PROTECTED] wrote:
 
 
 +#define PG_blocks  20  /* Page has block mappings */
 +
 
 
 I've only had a very quick look, but this line looks *very* wrong.
 You should be using PG_private.
 
 There should never be any confusion about whether -private has
 buffers or blocks attached as the only routines that ever look in
 -private are address_space operations  (or should be.  I think 'NULL'
 is sometimes special cased, as in try_to_release_page.  It would be
 good to do some preliminary work and tidy all that up).
 
 There is a lot of confusion, actually :)
 But as you see in the patch, I added a couple more aops APIs, and
 am working toward decoupling it as much as possible. It's pretty
 close after the fsblock patch... however:
 
 
 Why do you think you need PG_blocks?
 
 Block device pagecache (buffer cache) has to be able to accept
 attachment of either buffers or blocks for filesystem metadata,
 and call into either buffer.c or fsblock.c based on that.
 
 If the page flag is really important, we can do some awful hack
 like assuming the first long of the private data is flags, and
 those flags will tell us whether the structure is a buffer_head
 or fsblock ;) But for now it is just easier to use a page flag.

The block device pagecache isn't special, and certainly isn't that much
code.  I would suggest keeping it buffer head specific and making a
second variant that does only fsblocks.  This is mostly to keep the
semantics of PagePrivate sane, lets not fuzz the line.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Chris Mason

On Sun, Jun 24, 2007 at 03:46:13AM +0200, Nick Piggin wrote:
 Rewrite the buffer layer.

Overall, I like the basic concepts, but it is hard to track the locking
rules.  Could you please write them up?

I like the way you split out the assoc_buffers from the main fsblock
code, but the list setup is still something of a wart.  It also provides
poor ordering of blocks for writeback.

I think it makes sense to replace the assoc_buffers list head with a
radix tree sorted by block number.  mark_buffer_dirty_inode would up the
reference count and put it into the radix, the various flushing routines
would walk the radix etc.

If you wanted to be able to drop the reference count once the block was
written you could have a back pointer to the appropriate inode.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching

2007-06-22 Thread Chris Mason

On Thu, Jun 21, 2007 at 09:06:40PM -0400, James Morris wrote:
 On Thu, 21 Jun 2007, Chris Mason wrote:
 
   The incomplete mediation flows from the design, since the pathname-based
   mediation doesn't generalize to cover all objects unlike label- or
   attribute-based mediation.  And the use the natural abstraction for
   each object type approach likewise doesn't yield any general model or
   anything that you can analyze systematically for data flow.
  
  This feels quite a lot like a repeat of the discussion at the kernel
  summit.  There are valid uses for path based security, and if they don't
  fit your needs, please don't use them.  But, path based semantics alone
  are not a valid reason to shut out AA.
 
 The validity or otherwise of pathname access control is not being 
 discussed here.
 
 The point is that the pathname model does not generalize, and that 
 AppArmor's inability to provide adequate coverage of the system is a 
 design issue arising from this.

I'm sorry, but I don't see where in the paragraphs above you aren't
making a general argument against the pathname model.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching

2007-06-22 Thread Chris Mason

On Fri, Jun 22, 2007 at 10:23:03AM -0400, James Morris wrote:
 On Fri, 22 Jun 2007, Chris Mason wrote:
 
  But, this is a completely different discussion than if AA is
  solving problems in the wild for its intended audience, or if the code
  is somehow flawed and breaking other parts of the kernel.
 
 Is its intended audience aware of its limitiations?  Lars has just 
 acknowledged that it does not implement mandatory access control, for one.
 
 Until people understand these issues, they certainly need to be addressed 
 in the context of upstream merge.

It is definitely useful to clearly understand the intended AA use cases
during the merge.

 
  We've been over the AA is different discussion in threads about a
  billion times, and at the last kernel summit.
 
 I don't believe that people at the summit were adequately informed on the 
 issue, and from several accounts I've heard, Stephen Smalley was 
 effectively cut off before he could even get to his second slide.

I'm sure people there will have a different versions of events.  The
one part that was discussed was if pathname based security was
useful, and a number of the people in the room (outside of 
novell) said it was.  Now, it could be that nobody wanted to argue
anymore, since most opinions had come out on one list or another by
then.  

But as someone who doesn't use either SElinux or AA, I really hope
we can get past the part of the debate where:

while(1)
AA) we think we're making users happy with pathname security
SELINUX) pathname security sucks

So, yes Greg got it started and Lars is a well known trouble maker, and
I completely understand if you want to say no thank you to an selinux
based AA ;)  The models are different and it shouldn't be a requirement
that they try to use the same underlying mechanisms.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching

2007-06-21 Thread Chris Mason

On Thu, Jun 21, 2007 at 04:59:54PM -0400, Stephen Smalley wrote:
 On Thu, 2007-06-21 at 21:54 +0200, Lars Marowsky-Bree wrote:
  On 2007-06-21T15:42:28, James Morris [EMAIL PROTECTED] wrote:
  
A veto is not a technical argument. All technical arguments (except for
path name is ugly, yuk yuk!) have been addressed, have they not?
   AppArmor doesn't actually provide confinement, because it only operates 
   on 
   filesystem objects.
   
   What you define in AppArmor policy does _not_ reflect the actual 
   confinement properties of the policy.  Applications can simply use other 
   mechanisms to access objects, and the policy is effectively meaningless.
  
  Only if they have access to another process which provides them with
  that data.
 
 Or can access the data under a different path to which their profile
 does give them access, whether in its final destination or in some
 temporary file processed along the way.
 
  And now, yes, I know AA doesn't mediate IPC or networking (yet), but
  that's a missing feature, not broken by design.
 
 The incomplete mediation flows from the design, since the pathname-based
 mediation doesn't generalize to cover all objects unlike label- or
 attribute-based mediation.  And the use the natural abstraction for
 each object type approach likewise doesn't yield any general model or
 anything that you can analyze systematically for data flow.

This feels quite a lot like a repeat of the discussion at the kernel
summit.  There are valid uses for path based security, and if they don't
fit your needs, please don't use them.  But, path based semantics alone
are not a valid reason to shut out AA.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-19 Thread Chris Mason

On Tue, Jun 19, 2007 at 10:11:13AM +0100, Pádraig Brady wrote:
 Vladislav Bolkhovitin wrote:
  
  I would also suggest one more feature: support for block level
  de-duplication. I mean:
  
  1. Ability for Btrfs to have blocks in several files to point to the
  same block on disk
  
  2. Support for new syscall or IOCTL to de-duplicate as a single
  transaction two or more blocks on disk, i.e. link them to one of them
  and free others
  
  3. De-de-duplicate blocks on disk, i.e. copy them on write
  
  I suppose that de-duplication itself would be done by some user space
  process that would scan files, determine blocks with the same data and
  then de-duplicate them by using syscall or IOCTL (2).
  
  That would be very usable feature, which in most cases would allow to
  shrink occupied disk space on 50-90%.
 
 Have you references for this number?
 In my experience one gets a lot of benefit from
 the much simpler process of de-duplication of files.

Yes, I would expect simple hard links to be a better solution for this,
but the feature request is not that out of line.  I actually had plans
on implementing auto duplicate block reuse earlier in btrfs.

Snapshots already share duplicate blocks between files, and so all of
the reference counting needed to implement this already exists.
Snapshots are writable, and data mods are copy on write, and in general
things work.

But, to help fsck, the extent allocation tree has a back pointer to the
inode that owns an extent.  If you're doing snapshots, all of the owners
of the extent have the same inode number.   If you're sharing duplicate
blocks, the owners can have any inode number, and fsck becomes much more
complex.

In general, when I have to decide between fsck and a feature, I'm going
to pick fsck.  The features are much more fun, but fsck is one of the
main motivations for doing this work.

Thanks for the input,
Chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-18 Thread Chris Mason

On Sat, Jun 16, 2007 at 11:31:47AM +0200, Florian D. wrote:
 Chris Mason wrote:
 
  Strange, these numbers are not quite what I was expecting ;)  Could you
  please post your fio job files?  Also, how much ram does the machine
  have?  Only writing doesn't seem like enough to fill the ram.
  
  -chris
  
  
 
 Sure:
  [global]
  directory=/mnt/temp/default
  filename=testfile
  size=300m
  randrepeat=1
  overwrite=1
  end_fsync=1

[ very bad results on btrfs with these parameters ]

Ok, the numbers make more sense now.  Basically what is happening
is that during the random IO phase, fio is hitting every single block
in the file.  Btrfs will allocate new blocks in a sequential fashion,
but the fsync does writeback in page order.  So, the fsync sees
completely random block ordering, and then we see it again on the reads.

In ext3 even though the writes are random, the fsync uses the original
(sequential) ordering of the blocks, and everything works nicely.

The fix is either delayed allocation or defrag-on-writeback.  Another
option (which I'll have to do for O_SYNC performance) is to leave space
in the blocks allocated to the file for COWs (basically strides of
allocated blocks).

I'll do the defrag-on-writeback right after enospc.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Versioning file system

2007-06-18 Thread Chris Mason

On Mon, Jun 18, 2007 at 03:45:24AM -0600, Andreas Dilger wrote:
 Too bad everyone is spending time on 10 similar-but-slightly-different
 filesystems.  This will likely end up with a bunch of filesystems that
 implement some easy subset of features, but will not get polished for
 users or have a full set of features implemented (e.g. ACL, quota, fsck,
 etc).  While I don't think there is a single answer to every question,
 it does seem that the number of filesystem projects has climbed lately.
 
 Maybe there should be a BOF at OLS to merge these filesystem projects
 (btrfs, chunkfs, tilefs, logfs, etc) into a single project with multiple
 people working on getting it solid, scalable (parallel readers/writers on
 lots of CPUs), robust (checksums, failure localization), recoverable, etc.
 I thought Val's FS summits were designed to get developers to collaborate,
 but it seems everyone has gone back to their corners to work on their own
 filesystem?

Unfortunately, I can't do OLS this year, but anyone who wants to talk on
these things can drop me a line and we can setup phone calls or whatever
for planning.  Adding polish to any FS is not a one man show, and so I know
I'll need to get more people on board to really finish btrfs off.

One of my long term goals for btrfs is to figure out the features and
layout people are most interested in for filesystems that don't have to
be ext* backwards compatible.  I've got a pretty good start, but I'm
sure parts of it will change if I can get a big enough developer base.

 
 Working on getting hooks into DM/MD so that the filesystem and RAID layers
 can move beyond ignorance is bliss when talking to each other would be
 great.  Not rebuilding empty parts of the fs, limit parity resync to parts
 of the fs that were in the previous transaction, use fs-supplied checksums
 to verify on-disk data is correct, use RAID geometry when doing allocations,
 etc.

Definitely.  There's a lot of work in the DM integration bits that are
not FS specific.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Updated Btrfs project site online

2007-06-18 Thread Chris Mason


Hello everyone,

I've moved the Btrfs pages here:

http://oss.oracle.com/projects/btrfs

Which gives us a bugzilla, mailing lists, and a somewhat more orderly
file download area.  There are links to my HG trees for sources as well.

The oss project area automagically creates a few different mailing
lists.  For now, [EMAIL PROTECTED] and
[EMAIL PROTECTED] will be used.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Updated Btrfs project site online -git repo?

2007-06-18 Thread Chris Mason

On Mon, Jun 18, 2007 at 09:53:39PM +0200, Maria Domenica Bertolucci wrote:
 Would it be possible to have a git repo as well so as to keep in sync
 with all git kernel projects? It also helps standardize things.

Sorry, the repos will stay Mercurial based for now.  These are small
repos and not attached to the main kernel sources, it will be easy to
download (I promise).

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-15 Thread Chris Mason

On Fri, Jun 15, 2007 at 09:08:38PM +0200, Florian D. wrote:
 Chris Mason wrote:
  
  is it possible to test it on top of LVM2 on RAID at this stage?
  
  Yes, I haven't done much multi-spindle testing yet, so I'm definitely
  interested in these numbers.
  
  -chris
  
  
 
 I did not get very far:
 
 
 # insmod btrfs.ko
 # mkfs.btrfs /dev/brain_volume_group/btrfstest
 on close 0 blocks are allocated
 fs created on /dev/brain_volume_group/btrfstest blocksize 4096 blocks
 4980736
 
 (/dev/brain_volume_group/btrfstest is a 20GB logical volume on top of
 RAID6)
 
 # mount /dev/brain_volume_group/btrfstest /mnt/temp/
 (this gives these kernel-msgs:
 [  385.980358] btrfs: dm-6 checksum verify failed on 4
 [  385.980462] btrfs: dm-6 checksum verify failed on 12
 [  385.980559] btrfs: dm-6 checksum verify failed on 11

These are normal on the first mount, the mkfs doesn't set the csums on
the blocks it creates (will fix ;)

 )
 
 # touch /mnt/temp/default/testfile.txt
 [  445.445638] btrfs: dm-6 checksum verify failed on 10
 
 
 # umount /mnt/temp/
 
 [  457.980372] [ cut here ]
 [  457.980377] kernel BUG at fs/buffer.c:2644!

Whoops.  Please try this:

diff -r 38b36731 disk-io.c
--- a/disk-io.c Fri Jun 15 13:50:20 2007 -0400
+++ b/disk-io.c Fri Jun 15 15:12:26 2007 -0400
@@ -541,6 +541,7 @@ int write_ctree_super(struct btrfs_trans
else
ret = submit_bh(WRITE, bh);
if (ret == -EOPNOTSUPP) {
+   lock_buffer(bh);
set_buffer_uptodate(bh);
root-fs_info-do_barriers = 0;
ret = submit_bh(WRITE, bh);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-15 Thread Chris Mason

On Fri, Jun 15, 2007 at 10:46:04PM +0200, Florian D. wrote:
 Chris Mason wrote:
  # umount /mnt/temp/
 
  [  457.980372] [ cut here ]
  [  457.980377] kernel BUG at fs/buffer.c:2644!
  
  Whoops.  Please try this:

[ bad patch ]

 sorry, with the patch applied:
 
 [  147.475077] BUG: at
 /home/florian/system/btrfs_test/btrfs-0.2/disk-io.c:534

Well, apparently I get get the silly stuff wrong an infinite number of
times.  Sorry, lets try again:

diff -r 38b36731 disk-io.c
--- a/disk-io.c Fri Jun 15 13:50:20 2007 -0400
+++ b/disk-io.c Fri Jun 15 16:52:38 2007 -0400
@@ -541,6 +541,8 @@ int write_ctree_super(struct btrfs_trans
else
ret = submit_bh(WRITE, bh);
if (ret == -EOPNOTSUPP) {
+   get_bh(bh);
+   lock_buffer(bh);
set_buffer_uptodate(bh);
root-fs_info-do_barriers = 0;
ret = submit_bh(WRITE, bh);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-15 Thread Chris Mason

On Sat, Jun 16, 2007 at 12:03:06AM +0200, Florian D. wrote:
 Chris Mason wrote:
  Well, apparently I get get the silly stuff wrong an infinite number of
  times.  Sorry, lets try again:
  
  diff -r 38b36731 disk-io.c
  --- a/disk-io.c Fri Jun 15 13:50:20 2007 -0400
  +++ b/disk-io.c Fri Jun 15 16:52:38 2007 -0400
  @@ -541,6 +541,8 @@ int write_ctree_super(struct btrfs_trans
  else
  ret = submit_bh(WRITE, bh);
  if (ret == -EOPNOTSUPP) {
  +   get_bh(bh);
  +   lock_buffer(bh);
  set_buffer_uptodate(bh);
  root-fs_info-do_barriers = 0;
  ret = submit_bh(WRITE, bh);
  
 
 ha! it is working now. some numbers from here(with the fio-tool):

Great, I'll have a v0.3 out on Monday with that fix rolled in.

 
 1. sequential read
 2. random writes
 3. sequential read again
 
 filesize:300MB, bs:4K
 
btrfs  reiserfs   ext3
usr% sys% bw   sec.usr% sys% bw   sec.usr% sys% bw   sec.
 1  551   68.3 4.6 117   67.4 4.6 524   68.0 4.6
 2  010.7  431 221   29.8 10.5318   29.0 10.8
 3  012.3  133 119   70.5 4.4 524   68.6 4.5
 
 bw: MB/sec.
 ext3: -o data=writeback,barrier=1
 
 20GB LVM2 partition on a RAID6 (4 SATA-disks)

Strange, these numbers are not quite what I was expecting ;)  Could you
please post your fio job files?  Also, how much ram does the machine
have?  Only writing doesn't seem like enough to fill the ram.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-14 Thread Chris Mason

On Thu, Jun 14, 2007 at 08:29:10PM +0200, Florian D. wrote:
 Chris Mason wrote:
  The basic list of features looks like this:
 [amazing stuff snipped]
 
  The current status is a very early alpha state, and the kernel code
  weighs in at a sparsely commented 10,547 lines.  I'm releasing now in
  hopes of finding people interested in testing, benchmarking,
  documenting, and contributing to the code.
 ok, what kind of benchmarks would help you most? bonnie? compilebench?
 sth. other?

Thanks! Lets start with a list of the things I know will go badly:

O_SYNC (not implemented)
O_DIRECT (not implemented)
aio (not implemented)
multi-threaded (brain dead tree locking)
things that fill the drive (will oops)
mmap() writes (not supported, mmap reads are ok)

Also, overlapping writes are not that well supported.  For example, tar
by default will write in 10k chunks, and btrfs_file_write currently cows
on every single write.  So, if your tar file has a bunch of 16k files,
it'll go much faster if you tell tar to use 16k (or 8k) buffers.

In general, I was hoping for a generic delayed allocation facility to
magically appear in the kernel, and so I haven't spent a lot of time
tuning btrfs_file_write for this yet.

Any other workload is fair game, and I'm especially interested in seeing
how badly the COW hurts.  For example, on a big file, I'd like to see
how much slower big sequential reads are after small random writes (fio
is good for this).  Or, writing to every file on the FS in random order
and then seeing how much slower we are at reading.

Benchmarks that stress the directory structure are interesting too, huge
numbers of files + directories etc.  Ric Wheeler's fs_mark has a lot of
options and output.

But, that's just my list, you can pick anything that you find
interesting ;)  Please try btrfsck after the run to see how well it
keeps up.  If you use blktrace to generate io traces, graphs can be
generated:

http://oss.oracle.com/~mason/seekwatcher/

Not that well documented, but drop me a line if you need help running
it.  btt is a good alternative to the graphs too, and easier to run.

 
 is it possible to test it on top of LVM2 on RAID at this stage?

Yes, I haven't done much multi-spindle testing yet, so I'm definitely
interested in these numbers.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-13 Thread Chris Mason

On Wed, Jun 13, 2007 at 04:08:30AM +0100, Christoph Hellwig wrote:
 On Tue, Jun 12, 2007 at 04:14:39PM -0400, Chris Mason wrote:
   Aside from folding snapshot history into the origin's namespace... It
   could be possible to have a mount.btrfs that allows subvolumes and/or
   snapshot volumes to be mounted as unique roots?  I'd imagine a bind
   mount _could_ provide this too?  Anyway, I'm just interested in
   understanding the vision for managing the potentially complex nature
   of a Btrfs namespace.
  
  One option is to put the real btrfs root into some directory in
  (/sys/fs/btrfs/$device?) and then use tools in userland to mount -o bind
  outside of that.  I wanted to wait to get fancy until I had a better
  idea of how people would use the feature.
 
 We already support mounting into subdirectories of a filesystem for
 nfs connection sharing.  The patch below makes use of this to allow
 mounting any subdirectory of a btrfs filesystem by specifying it in
 the form of /dev/somedevice:directory and when no subdirectory
 is specified uses 'default'.

Neat, thanks Christoph, this will be much nicer longer term.  I'll
integrate it after I finish off -enospc.

 To make this more useful btrfs directories
 should grow some way to be marked head of a subvolume,

They are already different in the btree, but maybe I'm not 100% sure
what you mean by marked as the head of a subvolume?

 and we'd need
 a more useful way to actually create subvolumes and snapshots without
 fugly ioctls.

One way I can think of that doesn't involve an ioctl is to have a
special subdir at the root of the subvolume:

cd
/mnt/default/.snaps
mkdir new_snapshot
rmdir old_snapshot

cd /mnt
mkdir new_subvol
rmdir old_subvol

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-13 Thread Chris Mason

On Tue, Jun 12, 2007 at 11:46:20PM -0400, John Stoffel wrote:
  Chris == Chris Mason [EMAIL PROTECTED] writes:
 
 Chris After the last FS summit, I started working on a new filesystem
 Chris that maintains checksums of all file data and metadata.  Many
 Chris thanks to Zach Brown for his ideas, and to Dave Chinner for his
 Chris help on benchmarking analysis.
 
 Chris The basic list of features looks like this:
 
 Chris* Extent based file storage (2^64 max file size)
 Chris* Space efficient packing of small files
 Chris* Space efficient indexed directories
 Chris* Dynamic inode allocation
 Chris* Writable snapshots
 Chris* Subvolumes (separate internal filesystem roots)
 Chris- Object level mirroring and striping
 Chris* Checksums on data and metadata (multiple algorithms available)
 Chris- Strong integration with device mapper for multiple device 
 support
 Chris- Online filesystem check
 Chris* Very fast offline filesystem check
 Chris- Efficient incremental backup and FS mirroring
 
 So, can you resize a filesystem both bigger and smaller?  Or is that
 implicit in the Object level mirroring and striping?  

Growing the FS is just either extending or adding a new extent tree.
Shrinking is more complex.  The extent trees do have back pointers to
the objectids that own the extent, but snapshotting makes that a little
non-deterministic.  The good news is there are no fixed locations for
any of the metadata.  So it is at least possible to shrink and pop out
arbitrary chunks.

 
 As a user of Netapps, having quotas (if only for reporting purposes)
 and some way to migrate non-used files to slower/cheaper storage would
 be great.

So far, I'm not planning quotas beyond the subvolume level.

 
 Ie. being able to setup two pools, one being RAID6, the other being
 RAID1, where all currently accessed files are in the RAID1 setup, but
 if un-used get migrated to the RAID6 area.  

HSM in general is definitely interesting.  I'm afraid it is a long ways
off, but it could be integrated into the scrubber that wanders the trees
in the background.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-13 Thread Chris Mason

On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:
 Neat! It's great to see somebody else waking up to the idea that
 storage media is NOT to be trusted.
 
 Judging by the design paper, it looks like your structs have some
 alignment problems.

Actual defs are all packed, but I may still shuffle around the structs
to optimize alignment.  The keys are fixed, although I may make the u32
in the middle smaller.

 
 The usual wishlist:
 
 * inode-to-pathnames mapping

This one I'll code, it will help with inode link count verification.  I
want to be able to detect at run time that an inode with a link count of
zero is still actually in a directory. So there will be back pointers
from the inode to the directory.

Also, the incremental backup code will be able to walk the btree to find
inodes that have changed, and the backpointers will help make a list of
file names that need to be rsync'd or whatever.

 * a subvolume that is a single file (disk image, database, etc.)

subvolumes can be made that have a single file in them, but they have to
be directories right now.  Doing otherwise would complicate mounts and
other management tools (inside the btree, it doesn't really matter).

 * directory indexes to better support Wine and Samba
 * secure delete via destruction of per-file or per-block random crypto keys

I'd rather keep secure delete as a userland problem (or a layered FS
problem).  When you take backups and other copies of the file into
account, it's a bigger problem than btrfs wants to tackle right now.

 * fast (seekless) access to normal-sized SE Linux data

acls and xattrs will adjacent to the inode in the tree.  Most of the
time it'll be seekless.

 * atomic creation of copy-on-write directory trees

Do you mean something more fine grained than the current snapshotting
system?

 * immutable bits like UFS has

I'll do the ext2 chattr calls.

 * hole punch ability

Hole punching isn't harder or easier in btrfs than most other
filesystems that support holes.  It's largely a VM issue.

 * insert/delete ability (add/remove a chunk in the middle of a file)

The disk format makes this O(extent records past the chunk).  It's
possible to code but it would not be optimized.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-13 Thread Chris Mason

On Wed, Jun 13, 2007 at 10:00:56AM -0400, John Stoffel wrote:
  Chris == Chris Mason [EMAIL PROTECTED] writes:
  As a user of Netapps, having quotas (if only for reporting purposes)
  and some way to migrate non-used files to slower/cheaper storage would
  be great.
 
 Chris So far, I'm not planning quotas beyond the subvolume level.
 
 So let me get this straight.  Are you saying that quotas would only be
 on the volume level, and for the initial level of sub-volumes below
 that level?  Or would *all* sub-volumes have quota support?  And does
 that include snapshots as well?

On disk, snapshots and subvolumes are identical...the only difference is
their starting state (sorry, it's confusing, and it doesn't help that I
interchange the terms when describing features).

Every subvolume will have a quota on the number of blocks it can
consume.  I haven't yet decided on the best way to account for blocks
that are actually shared between snapshots, but it'll be in there
somehow.  So if you wanted to make a snapshot readonly, you just set the
quota to 1 block.

But, I'm not planning on adding a way to say user X in subvolume Y has
quota Z.  I'll just be: this subvolume can't get bigger than a given
size.  (at least for version 1.0).

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-13 Thread Chris Mason

On Wed, Jun 13, 2007 at 12:12:23PM -0400, John Stoffel wrote:
  Chris == Chris Mason [EMAIL PROTECTED] writes:

[ nod ]

 Also, I think you're wrong here when you state that making a snapshot
 (sub-volume?) RO just requires you to set the quota to 1 block.  What
 is to stop me from writing 1 block to a random file that already
 exists?  

It's copy on write, so changing one block means allocating a new one and
putting the new contents there.  The old blocks don't become available
for reuse until the transaction commits.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-13 Thread Chris Mason

On Wed, Jun 13, 2007 at 12:14:40PM -0400, Albert Cahalan wrote:
 On 6/13/07, Chris Mason [EMAIL PROTECTED] wrote:
 On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:
 
  The usual wishlist:
 
  * inode-to-pathnames mapping
 
 This one I'll code, it will help with inode link count verification.  I
 want to be able to detect at run time that an inode with a link count of
 zero is still actually in a directory. So there will be back pointers
 from the inode to the directory.
 
 Great, but fsck improvement wasn't on my mind. This is
 a desirable feature for the NFS server, and for regular users.
 Think about a backup program trying to maintain hard links.

Sure, it'll be there either way ;)

 
 Also, the incremental backup code will be able to walk the btree to find
 inodes that have changed, and the backpointers will help make a list of
 file names that need to be rsync'd or whatever.
 
  * a subvolume that is a single file (disk image, database, etc.)
 
 subvolumes can be made that have a single file in them, but they have to
 be directories right now.  Doing otherwise would complicate mounts and
 other management tools (inside the btree, it doesn't really matter).
 
 Bummer. As I understand it, ZFS provides this. :-)

Grin, when the pain of typing cd subvol is btrfs' biggest worry, I'll be
doing very well.

 
  * directory indexes to better support Wine and Samba
  * secure delete via destruction of per-file or per-block random crypto 
 keys
 
 I'd rather keep secure delete as a userland problem (or a layered FS
 problem).  When you take backups and other copies of the file into
 account, it's a bigger problem than btrfs wants to tackle right now.
 
 It can't be a userland problem if you allow disk blocks to move.
 Volume resizing, logging/journalling, etc. -- they combine to make
 the userland solution essentially impossible. (one could wipe the
 whole partition, or maybe fill ALL space on the volume)

Right about here is where I would insert a long story about ecryptfs, or
encryption solutions that happen all in userland.  At any rate, it is
outside the scope of v1.0, even though I definitely agree it is an
important problem for some people.

  * atomic creation of copy-on-write directory trees
 
 Do you mean something more fine grained than the current snapshotting
 system?
 
 I believe so. Example: I have a linux-2.6 directory. It's not
 a mount point or anything special like that. I want to copy
 it to a new directory called wip, without actually copying
 all the blocks. To all the normal POSIX API stuff, this copy
 should look like the result of cp -a, not hard links.

This would be a snapshot, which has to be done on a subvolume right now.
It is not as nice as being able to pick a random directory, but I've
only been able to get this far by limiting the feature scope
significantly.  What I did do was make subvolumes very cheap...just make
a bunch of them.

Keep in mind that if you implement a cow directory tree without a
snapshot, and you don't want to duplicate any blocks in the cow, you're
going to have fun with inode numbers.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-12 Thread Chris Mason

On Tue, Jun 12, 2007 at 03:53:03PM -0400, Mike Snitzer wrote:
 On 6/12/07, Chris Mason [EMAIL PROTECTED] wrote:
 Hello everyone,
 
 After the last FS summit, I started working on a new filesystem that
 maintains checksums of all file data and metadata.  Many thanks to Zach
 Brown for his ideas, and to Dave Chinner for his help on
 benchmarking analysis.
 
 Chris,
 
 Given the substantial work that you've already put into btrfs and the
 direction you're Todo list details; it feels as though Btrfs will
 quickly provide the features that only Sun's ZFS provides.
 
 Looking at your Btrfs benchmark and design pages it is clear that
 you're motivation is a filesystem that addresses modern concerns
 (performance that doesn't degrade over time, online fsck, fast offline
 fsck, data/metadata checksums, unlimited snapshots, efficient remote
 mirroring, etc).  There is still much Todo but you've made very
 impressive progress for the first announcement!
 
 I have some management oriented questions/comments.
 
 1)
 Regarding the direction of Btrfs as it relates to integration with DM.
 The allocation policies, the ease of configuring DM-based
 striping/mirroring, management of large pools of storage all seems to
 indicate that Btrfs will manage the physical spindles internally.
 This is very ZFS-ish (ZFS pools) so I'd like to understand where you
 see Btrfs going in this area.

There's quite a lot of hand waving in that section.  What I'd like to do
is work closely with the LVM/DM/MD maintainers and come up with
something that leverages what linux already does.  I don't want to
rewrite LVM into the FS, but I do want to make better use of info about
the underlying storage.

 
 Your initial benchmarks were all done ontop of a single disk with an
 LVM stack yet your roadmap/todo and design speaks to a tighter
 integration of the volume management features.  So long term is
 traditional LVM/MD functionality to be pulled directly into Btrfs?
 
 2)
 The Btrfs notion of subvolumes and snapshots is very elegant and
 provides for a fluid management of the filesystem system data.  It
 feels as though each subvolume/snapshot is just folded into the parent
 Btrfs volumes' namespace.  Was there any particular reason you elected
 to do this?  I can see that it lends itself to allowing snapshots of
 snapshots.  If you could elaborate I'd appreciate it.
 
Yes, I wanted snapshots to be writable and resnapshottable.  It also
lowers the complexity to keep each snapshot as a subvolume/tree.

subvolumes are only slightly more expensive than a directory.  So, even
though a subvolume is a large grained unit for a snapshot, you can get
around this by just making more subvolumes.

 In practice subvolumes and/or snapshots appear to be implicitly
 mounted upon creation (refcount of parent is incremented).  Is this
 correct?  For snapshots, this runs counter to mapping the snapshots'
 data into the namespace of the origin Btrfs (e.g. with a .snapshot
 dir, but this is only useful for read-only snaps).  Having snapshot
 namespaces in terms of monolithic subvolumes puts a less intuitive
 face on N Btrfs snapshots.  The history of a given file/dir feels to
 be lost with this model.

That's somewhat true, the disk format does have enough information to
show you that history, but cleanly expressing it to the user is a
daunting task.

 
 Aside from folding snapshot history into the origin's namespace... It
 could be possible to have a mount.btrfs that allows subvolumes and/or
 snapshot volumes to be mounted as unique roots?  I'd imagine a bind
 mount _could_ provide this too?  Anyway, I'm just interested in
 understanding the vision for managing the potentially complex nature
 of a Btrfs namespace.

One option is to put the real btrfs root into some directory in
(/sys/fs/btrfs/$device?) and then use tools in userland to mount -o bind
outside of that.  I wanted to wait to get fancy until I had a better
idea of how people would use the feature.
 
 Thanks for doing all this work; I think the Linux community got a much
 needed shot in the arm with this Btrfs announcement.
 

Thanks for the comments.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread Chris Mason

On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
 On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
  The start and end points passed to block_prepare_write() delimit the region 
  of
  the page that is going to be modified.  This means that prepare_write()
  doesn't need to fill it in if the page is not up to date. 
 
 Really? Is it _really_ going to be modified? Even if the pointer
 userspace gave to write() is bogus, and is going to fault half-way
 through the copy_from_user()?

This is why there are so many variations on copy_from_user that zero on
faults.  One way or another, the prepare_write/commit_write pair are
responsible for filling it in.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread Chris Mason

On Wed, May 16, 2007 at 11:04:11PM +1000, Nick Piggin wrote:
 Chris Mason wrote:
 On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
 
 On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
 
 The start and end points passed to block_prepare_write() delimit the 
 region of
 the page that is going to be modified.  This means that prepare_write()
 doesn't need to fill it in if the page is not up to date. 
 
 Really? Is it _really_ going to be modified? Even if the pointer
 userspace gave to write() is bogus, and is going to fault half-way
 through the copy_from_user()?
 
 
 This is why there are so many variations on copy_from_user that zero on
 faults.  One way or another, the prepare_write/commit_write pair are
 responsible for filling it in.
 
 I'll add to David's question about David's comment on David's patch, yes
 it will be modified but in that case it would be zero-filled as Chris
 says. However I believe this is incorrect behaviour.
 
 It is possible to easily fix that so it would only happen via a tiny race
 window (where the source memory gets unmapped at just the right time)
 however nobody seemed to interested (just by checking the return value of
 fault_in_pages_readable).
 
 The buffered write patches I'm working on fix that (among other things) of
 course. But they do away with prepare_write and introduce new aops, and
 they indeed must not expect the full range to have been written to.

I was also wrong to say prepare_write and commit_write are
responsible, they work together with their callers to make the right
things happen.  Oh well, so much for trying to give a short answer for a
chunk of code full of corner cases ;)

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4 of 8] Add flags to control direct IO helpers

2007-02-08 Thread Chris Mason

On Thu, Feb 08, 2007 at 09:33:05AM +0530, Suparna Bhattacharya wrote:
 On Wed, Feb 07, 2007 at 01:05:44PM -0500, Chris Mason wrote:
  On Wed, Feb 07, 2007 at 10:38:45PM +0530, Suparna Bhattacharya wrote:
+ * The flags parameter is a bitmask of:
+ *
+ * DIO_PLACEHOLDERS (use placeholder pages for locking)
+ * DIO_CREATE (pass create=1 to get_block for filling holes or 
extending)
   
   A little more explanation about why these options are needed, and examples
   of when one would specify each of these options would be good.
  
  I'll extend the comments in the patch, but for discussion here:
  
  DIO_PLACEHOLDERS:  placeholders are inserted into the page cache to
  synchronize the DIO with buffered writes.  From a locking point of view,
  this is similar to inserting and locking pages in the address space
  corresponding to the DIO.
  
  placeholders guard against concurrent allocations and truncates during the 
  DIO.
  You don't need placeholders if truncates and allocations are are
  impossible (for example, on a block device).
 
 Likewise placeholders may not be needed if the underlying filesystem
 already takes care of locking to synchronizes DIO vs buffered.

True, although I don't think any FS covers 100% of the cases right now.

 
  
  DIO_CREATE: placeholders make it possible for filesystems to safely fill
  holes and extend the file via get_block during the DIO.  If DIO_CREATE
  is turned on, get_block will be called with create=1, allowing the FS to
  allocate blocks during the DIO.
 
 When would one NOT specify DIO_CREATE, and what are the implications ?
 The purpose of having an option of NOT allowing the FS to allocate blocks
 during DIO is one is not very intuitive from the standpoint of the caller.
 (the block device case could be an example, but then create=1 could not do
 any harm or add extra overhead, so why bother ?)

DIO has fallen back to buffered IO for so long that I wanted filesystems
to explicitly choose the create=1 for now.  A good example is my patch
for ext3, where the ext3 get_block routine needed to be changed to start
a transaction instead of finding the current trans in
current-journal_info.  The reiserfs DIO get_block needed to be told not
to expect i_mutex to be held, etc etc.

 
 Is there still a valid case where we fallback to buffered IO to fill holes
 - to me that seems to be the only situation where create=0 must be enforced.

Right, when create=0 we fall back, otherwise we don't.

 
  
  DIO_DROP_I_MUTEX: If the write is inside of i_size, i_mutex is dropped
  during the DIO and taken again before returning.
 
 Again an example of when one would not specify this (block device and
 XFS ?) would be useful.

If the FS can't fill a hole or extend the file without i_mutex, or if
the caller has already dropped I_MUTEX themselves.  I think this is
only XFS right now, the long term goal is to make placeholders fast
enough for XFS to use.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1 of 2] Implement generic block_page_mkwrite() functionality

2007-02-08 Thread Chris Mason

On Thu, Feb 08, 2007 at 09:50:13AM +1100, David Chinner wrote:
  You don't need to lock out all truncation, but you do need to lock
  out truncation of the page in question.  Instead of your i_size
  checks, check page-mapping isn't NULL after the lock_page?
 
 Yes, that can be done, but we still need to know if part of
 the page is beyond EOF for when we call block_commit_write()
 and mark buffers dirty. Hence we need to check the inode size.
 
 I guess if we block the truncate with the page lock, then the
 inode size is not going to change until we unlock the page.
 If the inode size has already been changed but the page not yet
 removed from the mapping we'll be beyond EOF.
 
 So it seems to me that we can get away with not using the i_mutex
 in the generic code here.

vmtruncate changes the inode size before waiting on any pages.  So,
i_size could change any time during page_mkwrite.

Since the patch does:
   if (((page-index + 1)  PAGE_CACHE_SHIFT)  i_size_read(inode))
   end = i_size_read(inode)  ~PAGE_CACHE_MASK;
   else
   end = PAGE_CACHE_SIZE;

It would be a good idea to read i_size once and put it in a local var
instead.

The FS truncate op should be locking the last page in the file to make
sure it is properly zero filled.  The worst case should be that we zero
too many bytes in page_mkwrite (expanding truncate past our current
i_size), but at least it won't expose stale data.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls

2007-02-07 Thread Chris Mason

XFS is changed to use blockdev_direct_IO flags instead of DIO_OWN_LOCKING.

Signed-off-by: Chris Mason [EMAIL PROTECTED]

diff -r 1ab8a2112a7d -r f53fd3802dc9 fs/xfs/linux-2.6/xfs_aops.c
--- a/fs/xfs/linux-2.6/xfs_aops.c   Tue Feb 06 20:02:56 2007 -0500
+++ b/fs/xfs/linux-2.6/xfs_aops.c   Tue Feb 06 20:02:56 2007 -0500
@@ -1392,19 +1392,16 @@ xfs_vm_direct_IO(
 
iocb-private = xfs_alloc_ioend(inode, IOMAP_UNWRITTEN);
 
-   if (rw == WRITE) {
-   ret = blockdev_direct_IO_own_locking(rw, iocb, inode,
-   iomap.iomap_target-bt_bdev,
-   iov, offset, nr_segs,
-   xfs_get_blocks_direct,
-   xfs_end_io_direct);
-   } else {
-   ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
-   iomap.iomap_target-bt_bdev,
-   iov, offset, nr_segs,
-   xfs_get_blocks_direct,
-   xfs_end_io_direct);
-   }
+   /*
+* ask DIO not to do any special locking for us, and to always
+* pass create=1 to get_block on writes
+*/
+   ret = blockdev_direct_IO_flags(rw, iocb, inode,
+  iomap.iomap_target-bt_bdev,
+  iov, offset, nr_segs,
+  xfs_get_blocks_direct,
+  xfs_end_io_direct,
+  DIO_CREATE);
 
if (unlikely(ret != -EIOCBQUEUED  iocb-private))
xfs_destroy_ioend(iocb-private);


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 8 of 8] Avoid too many boundary buffers in DIO

2007-02-07 Thread Chris Mason

Dave Chinner found a 10% performance regression with ext3 when using DIO
to fill holes instead of buffered IO.  On large IOs, the ext3 get_block
routine will send more than a page worth of blocks back to DIO via a
single buffer_head with a large b_size value.

The DIO code iterates through this massive block and tests for a
boundary buffer over and over again.  For every block size unit spanned
by the big map_bh, the boundary bit is tested and a bio may be forced
down to the block layer.

There are two potential fixes, one is to ignore the boundary bit on
large regions returned by the FS.  DIO can't tell which part of the big
region was a boundary, and so it may not be a good idea to trust the
hint.

This patch just clears the boundary bit after using it once.  It is 10%
faster for a streaming DIO write w/blocksize of 512k on my sata drive.

Signed-off-by: Chris Mason [EMAIL PROTECTED]

diff -r f53fd3802dc9 -r d068ea378c04 fs/direct-io.c
--- a/fs/direct-io.cTue Feb 06 20:02:56 2007 -0500
+++ b/fs/direct-io.cTue Feb 06 20:02:56 2007 -0500
@@ -625,7 +625,6 @@ static int dio_new_bio(struct dio *dio, 
nr_pages = min(dio-pages_in_io, bio_get_nr_vecs(dio-map_bh.b_bdev));
BUG_ON(nr_pages = 0);
ret = dio_bio_alloc(dio, dio-map_bh.b_bdev, sector, nr_pages);
-   dio-boundary = 0;
 out:
return ret;
 }
@@ -679,12 +678,6 @@ static int dio_send_cur_page(struct dio 
 */
if (dio-final_block_in_bio != dio-cur_page_block)
dio_bio_submit(dio);
-   /*
-* Submit now if the underlying fs is about to perform a
-* metadata read
-*/
-   if (dio-boundary)
-   dio_bio_submit(dio);
}
 
if (dio-bio == NULL) {
@@ -701,6 +694,12 @@ static int dio_send_cur_page(struct dio 
BUG_ON(ret != 0);
}
}
+   /*
+* Submit now if the underlying fs is about to perform a
+* metadata read
+*/
+   if (dio-boundary)
+   dio_bio_submit(dio);
 out:
return ret;
 }
@@ -727,6 +726,10 @@ submit_page_section(struct dio *dio, str
unsigned offset, unsigned len, sector_t blocknr)
 {
int ret = 0;
+   int boundary = dio-boundary;
+
+   /* don't let dio_send_cur_page do the boundary too soon */
+   dio-boundary = 0;
 
if (dio-rw  WRITE) {
/*
@@ -743,17 +746,7 @@ submit_page_section(struct dio *dio, str
(dio-cur_page_block +
(dio-cur_page_len  dio-blkbits) == blocknr)) {
dio-cur_page_len += len;
-
-   /*
-* If dio-boundary then we want to schedule the IO now to
-* avoid metadata seeks.
-*/
-   if (dio-boundary) {
-   ret = dio_send_cur_page(dio);
-   page_cache_release(dio-cur_page);
-   dio-cur_page = NULL;
-   }
-   goto out;
+   goto out_send;
}
 
/*
@@ -772,6 +765,18 @@ submit_page_section(struct dio *dio, str
dio-cur_page_offset = offset;
dio-cur_page_len = len;
dio-cur_page_block = blocknr;
+
+out_send:
+   /*
+* If dio-boundary then we want to schedule the IO now to
+* avoid metadata seeks.
+*/
+   if (boundary) {
+   dio-boundary = 1;
+   ret = dio_send_cur_page(dio);
+   page_cache_release(dio-cur_page);
+   dio-cur_page = NULL;
+   }
 out:
return ret;
 }
@@ -977,7 +982,16 @@ do_holes:
this_chunk_bytes = this_chunk_blocks  blkbits;
BUG_ON(this_chunk_bytes == 0);
 
-   dio-boundary = buffer_boundary(map_bh);
+   /*
+* get_block may return more than one page worth
+* of blocks.  Make sure only the last io we
+* send down for this region is a boundary
+*/
+   if (dio-blocks_available == this_chunk_blocks)
+   dio-boundary = buffer_boundary(map_bh);
+   else
+   dio-boundary = 0;
+
ret = submit_page_section(dio, page, offset_in_page,
this_chunk_bytes, dio-next_block_for_io);
if (ret) {


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5 of 8] Make ext3 safe for the new DIO locking rules

2007-02-07 Thread Chris Mason

This creates a version of ext3_get_block that starts and ends a transaction.

By starting and ending the transaction inside get_block, this is able to
avoid lock inversion problems when the DIO code tries to take page locks
inside blockdev_direct_IO. (transaction locks must always happen after
page locks).

Signed-off-by: Chris Mason [EMAIL PROTECTED]

diff -r 04dd7ddd593e -r 42596f5254ca fs/ext3/inode.c
--- a/fs/ext3/inode.c   Tue Feb 06 20:02:56 2007 -0500
+++ b/fs/ext3/inode.c   Tue Feb 06 20:02:56 2007 -0500
@@ -1673,6 +1673,30 @@ static int ext3_releasepage(struct page 
return journal_try_to_free_buffers(journal, page, wait);
 }
 
+static int ext3_get_block_direct_IO(struct inode *inode, sector_t iblock,
+   struct buffer_head *bh_result, int create)
+{
+   int ret = 0;
+   handle_t *handle = ext3_journal_start(inode, DIO_CREDITS);
+   if (IS_ERR(handle)) {
+   ret = PTR_ERR(handle);
+   goto out;
+   }
+   ret = ext3_get_block(inode, iblock, bh_result, create);
+   /*
+* Reacquire the handle: ext3_get_block() can restart the transaction
+*/
+   handle = journal_current_handle();
+   if (handle) {
+   int err;
+   err = ext3_journal_stop(handle);
+   if (!ret)
+   ret = err;
+   }
+out:
+   return ret;
+}
+
 /*
  * If the O_DIRECT write will extend the file then add this inode to the
  * orphan list.  So recovery will truncate it back to the original size
@@ -1693,39 +1717,58 @@ static ssize_t ext3_direct_IO(int rw, st
int orphan = 0;
size_t count = iov_length(iov, nr_segs);
 
-   if (rw == WRITE) {
-   loff_t final_size = offset + count;
-
+   if (rw == WRITE  (offset + count  inode-i_size)) { 
handle = ext3_journal_start(inode, DIO_CREDITS);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
}
-   if (final_size  inode-i_size) {
-   ret = ext3_orphan_add(handle, inode);
-   if (ret)
-   goto out_stop;
-   orphan = 1;
-   ei-i_disksize = inode-i_size;
-   }
-   }
-
+   ret = ext3_orphan_add(handle, inode);
+   if (ret) {
+   ext3_journal_stop(handle);
+   goto out;
+   }
+   ei-i_disksize = inode-i_size;
+   ret = ext3_journal_stop(handle);
+   if (ret) {
+   /* something has gone horribly wrong, cleanup
+* the orphan list in ram
+*/
+   if (inode-i_nlink)
+   ext3_orphan_del(NULL, inode);
+   goto out;
+   }
+   orphan = 1;
+   }
+
+   /*
+* the placeholder page code may take a page lock, so we have
+* to stop any running transactions before calling
+* blockdev_direct_IO.  Use ext3_get_block_direct_IO to start
+* and stop a transaction on each get_block call.
+*/
ret = blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov,
 offset, nr_segs,
-ext3_get_block, NULL);
+ext3_get_block_direct_IO, NULL);
 
/*
 * Reacquire the handle: ext3_get_block() can restart the transaction
 */
handle = journal_current_handle();
 
-out_stop:
-   if (handle) {
+   if (orphan) {
int err;
-
-   if (orphan  inode-i_nlink)
+   handle = ext3_journal_start(inode, DIO_CREDITS);
+   if (IS_ERR(handle)) {
+   ret = PTR_ERR(handle);
+   if (inode-i_nlink)
+   ext3_orphan_del(NULL, inode);
+   goto out;
+   }
+
+   if (inode-i_nlink)
ext3_orphan_del(handle, inode);
-   if (orphan  ret  0) {
+   if (ret  0) {
loff_t end = offset + ret;
if (end  inode-i_size) {
ei-i_disksize = end;


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1 of 8] Introduce a place holder page for the pagecache

2007-02-07 Thread Chris Mason

mm/filemap.c is changed to wait on these before adding a page into the page
cache, and truncates are changed to wait for all of the place holder pages to
disappear.

Place holder pages can only be examined with the mapping lock held.  They
cannot be locked, and cannot have references increased or decreased on them.

Placeholders can span a range bigger than one page.  The placeholder is
inserted into the radix slot for the end of the range, and the flags field in
the page struct is used to record the start of the range.

A bit is added for the radix root (PAGECACHE_TAG_EXTENTS), and when
mm/filemap.c finds that bit set, searches for an index in the pagecache
look forward to find any placeholders that index may intersect.

Signed-off-by: Chris Mason [EMAIL PROTECTED]

diff -r fc2d683623bb -r 7819e6e3f674 drivers/mtd/devices/block2mtd.c
--- a/drivers/mtd/devices/block2mtd.c   Sun Feb 04 10:44:54 2007 -0800
+++ b/drivers/mtd/devices/block2mtd.c   Tue Feb 06 19:45:28 2007 -0500
@@ -66,7 +66,7 @@ static void cache_readahead(struct addre
INFO(Overrun end of disk in cache readahead\n);
break;
}
-   page = radix_tree_lookup(mapping-page_tree, pagei);
+   page = radix_tree_lookup_extent(mapping-page_tree, pagei);
if (page  (!i))
break;
if (page)
diff -r fc2d683623bb -r 7819e6e3f674 include/linux/fs.h
--- a/include/linux/fs.hSun Feb 04 10:44:54 2007 -0800
+++ b/include/linux/fs.hTue Feb 06 19:45:28 2007 -0500
@@ -490,6 +490,11 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY0
 #define PAGECACHE_TAG_WRITEBACK1
+
+/*
+ * This tag is only valid on the root of the radix tree
+ */
+#define PAGE_CACHE_TAG_EXTENTS 2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff -r fc2d683623bb -r 7819e6e3f674 include/linux/page-flags.h
--- a/include/linux/page-flags.hSun Feb 04 10:44:54 2007 -0800
+++ b/include/linux/page-flags.hTue Feb 06 19:45:28 2007 -0500
@@ -263,4 +263,6 @@ static inline void set_page_writeback(st
test_set_page_writeback(page);
 }
 
+void set_page_placeholder(struct page *page, pgoff_t start, pgoff_t end);
+
 #endif /* PAGE_FLAGS_H */
diff -r fc2d683623bb -r 7819e6e3f674 include/linux/pagemap.h
--- a/include/linux/pagemap.h   Sun Feb 04 10:44:54 2007 -0800
+++ b/include/linux/pagemap.h   Tue Feb 06 19:45:28 2007 -0500
@@ -76,6 +76,9 @@ extern struct page * find_get_page(struc
unsigned long index);
 extern struct page * find_lock_page(struct address_space *mapping,
unsigned long index);
+int find_or_insert_placeholders(struct address_space *mapping,
+  unsigned long start, unsigned long end,
+  gfp_t gfp_mask, int wait);
 extern __deprecated_for_modules struct page * find_trylock_page(
struct address_space *mapping, unsigned long index);
 extern struct page * find_or_create_page(struct address_space *mapping,
@@ -86,6 +89,12 @@ unsigned find_get_pages_contig(struct ad
   unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
int tag, unsigned int nr_pages, struct page **pages);
+void remove_placeholder_pages(struct address_space *mapping,
+ unsigned long offset, unsigned long end);
+void wake_up_placeholder_page(struct page *page);
+void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t 
start,
+  pgoff_t end);
+
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
@@ -116,6 +125,8 @@ int add_to_page_cache_lru(struct page *p
unsigned long index, gfp_t gfp_mask);
 extern void remove_from_page_cache(struct page *page);
 extern void __remove_from_page_cache(struct page *page);
+struct page *radix_tree_lookup_extent(struct radix_tree_root *root,
+unsigned long index);
 
 /*
  * Return byte-offset into filesystem object for page.
diff -r fc2d683623bb -r 7819e6e3f674 include/linux/radix-tree.h
--- a/include/linux/radix-tree.hSun Feb 04 10:44:54 2007 -0800
+++ b/include/linux/radix-tree.hTue Feb 06 19:45:28 2007 -0500
@@ -53,6 +53,7 @@ static inline int radix_tree_is_direct_p
 /*** radix-tree API starts here ***/
 
 #define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_ROOT_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
@@ -168,6 +169,7 @@ radix_tree_gang_lookup_tag(struct radix_
unsigned long first_index, unsigned int max_items,
unsigned int tag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
+void radix_tree_root_tag_set

[PATCH 6 of 8] Make reiserfs safe for new DIO locking rules

2007-02-07 Thread Chris Mason

reiserfs is changed to use a version of reiserfs_get_block that is safe
for filling holes without i_mutex held.

Signed-off-by: Chris Mason [EMAIL PROTECTED]

diff -r 42596f5254ca -r 1ab8a2112a7d fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c   Tue Feb 06 20:02:56 2007 -0500
+++ b/fs/reiserfs/inode.c   Tue Feb 06 20:02:56 2007 -0500
@@ -469,7 +469,8 @@ static int reiserfs_get_blocks_direct_io
bh_result-b_size = (1  inode-i_blkbits);
 
ret = reiserfs_get_block(inode, iblock, bh_result,
-create | GET_BLOCK_NO_DANGLE);
+create | GET_BLOCK_NO_DANGLE |
+GET_BLOCK_NO_IMUX);
if (ret)
goto out;
 


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4 of 8] Add flags to control direct IO helpers

2007-02-07 Thread Chris Mason

This creates a number of flags so that filesystems can control
blockdev_direct_IO.  It is based on code from Russell Cettelan.

The new flags are:
DIO_CREATE -- always pass create=1 to get_block on writes.  This allows
  DIO to fill holes in the file.
DIO_PLACEHOLDERS -- use placeholder pages to provide locking against buffered
io and truncates.
DIO_DROP_I_MUTEX -- drop i_mutex before starting the mapping, io submission,
or io waiting.  The mutex is still dropped for AIO
as well.

Some API changes are made so that filesystems can have more control
over the DIO features.

__blockdev_direct_IO is more or less renamed to blockdev_direct_IO_flags.
All waiting and invalidating of page cache data is pushed down into
blockdev_direct_IO_flags (and removed from mm/filemap.c)

direct_io_worker is exported into the wild.  Filesystems that want to be
special can pull out the bits of blockdev_direct_IO_flags they care about
and then call direct_io_worker directly.

Signed-off-by: Chris Mason [EMAIL PROTECTED]

diff -r 1a7105ab9c19 -r 04dd7ddd593e fs/direct-io.c
--- a/fs/direct-io.cTue Feb 06 20:02:55 2007 -0500
+++ b/fs/direct-io.cTue Feb 06 20:02:56 2007 -0500
@@ -1,4 +1,3 @@
- GFP_KERNEL, 1);
 /*
  * fs/direct-io.c
  *
@@ -55,13 +54,6 @@
  *
  * If blkfactor is zero then the user's request was aligned to the filesystem's
  * blocksize.
- *
- * lock_type is DIO_LOCKING for regular files on direct-IO-naive filesystems.
- * This determines whether we need to do the fancy locking which prevents
- * direct-IO from being able to read uninitialised disk blocks.  If its zero
- * (blockdev) this locking is not done, and if it is DIO_OWN_LOCKING i_mutex is
- * not held for the entire direct write (taken briefly, initially, during a
- * direct read though, but its never held for the duration of a direct-IO).
  */
 
 struct dio {
@@ -70,8 +62,7 @@ struct dio {
struct inode *inode;
int rw;
loff_t i_size;  /* i_size when submitted */
-   int lock_type;  /* doesn't change */
-   int reacquire_i_mutex;  /* should we get i_mutex when done? */
+   unsigned flags; /* locking and get_block flags */
unsigned blkbits;   /* doesn't change */
unsigned blkfactor; /* When we're using an alignment which
   is finer than the filesystem's soft
@@ -211,7 +202,7 @@ out:
 
 static void dio_unlock_page_range(struct dio *dio)
 {
-   if (dio-lock_type != DIO_NO_LOCKING) {
+   if (dio-flags  DIO_PLACEHOLDERS) {
remove_placeholder_pages(dio-inode-i_mapping,
 dio-fspages_start_off,
 dio-fspages_end_off);
@@ -226,7 +217,7 @@ static int dio_lock_page_range(struct di
unsigned long max_size;
int ret = 0;
 
-   if (dio-lock_type == DIO_NO_LOCKING)
+   if (!(dio-flags  DIO_PLACEHOLDERS))
return 0;
 
while (index = dio-fspages_end_off) {
@@ -310,9 +301,6 @@ static int dio_complete(struct dio *dio,
dio-map_bh.b_private);
dio_unlock_page_range(dio);
 
-   if (dio-reacquire_i_mutex)
-   mutex_lock(dio-inode-i_mutex);
-
if (ret == 0)
ret = dio-page_errors;
if (ret == 0)
@@ -597,8 +585,9 @@ static int get_more_blocks(struct dio *d
map_bh-b_state = 0;
map_bh-b_size = fs_count  dio-inode-i_blkbits;
 
-   create = dio-rw  WRITE;
-   if (dio-lock_type == DIO_NO_LOCKING)
+   if (dio-flags  DIO_CREATE)
+   create = dio-rw  WRITE;
+   else
create = 0;
index = fs_startblk  (PAGE_CACHE_SHIFT -
dio-inode-i_blkbits);
@@ -1014,19 +1003,41 @@ out:
return ret;
 }
 
-static ssize_t
-direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
-   const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
+/*
+ * This does all the real work of the direct io.  Most filesystems want to
+ * call blockdev_direct_IO_flags instead, but if you have exotic locking
+ * routines you can call this directly.
+ *
+ * The flags parameter is a bitmask of:
+ *
+ * DIO_PLACEHOLDERS (use placeholder pages for locking)
+ * DIO_CREATE (pass create=1 to get_block for filling holes or extending)
+ * DIO_DROP_I_MUTEX (drop inode-i_mutex during writes)
+ */
+ssize_t
+direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
+   const struct iovec *iov, loff_t offset, unsigned long nr_segs,
unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
-   struct dio *dio)
-{
-   unsigned long user_addr; 
+   int is_async, unsigned dioflags

Re: [PATCH 4 of 8] Add flags to control direct IO helpers

2007-02-07 Thread Chris Mason

On Wed, Feb 07, 2007 at 10:38:45PM +0530, Suparna Bhattacharya wrote:
  + * The flags parameter is a bitmask of:
  + *
  + * DIO_PLACEHOLDERS (use placeholder pages for locking)
  + * DIO_CREATE (pass create=1 to get_block for filling holes or extending)
 
 A little more explanation about why these options are needed, and examples
 of when one would specify each of these options would be good.

I'll extend the comments in the patch, but for discussion here:

DIO_PLACEHOLDERS:  placeholders are inserted into the page cache to
synchronize the DIO with buffered writes.  From a locking point of view,
this is similar to inserting and locking pages in the address space
corresponding to the DIO.

placeholders guard against concurrent allocations and truncates during the DIO.
You don't need placeholders if truncates and allocations are are
impossible (for example, on a block device).

DIO_CREATE: placeholders make it possible for filesystems to safely fill
holes and extend the file via get_block during the DIO.  If DIO_CREATE
is turned on, get_block will be called with create=1, allowing the FS to
allocate blocks during the DIO.

DIO_DROP_I_MUTEX: If the write is inside of i_size, i_mutex is dropped
during the DIO and taken again before returning.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 123 matches

Mail list logo