Re: [patch 1/3] add the fsblock layer

2007-06-30 Thread Christoph Hellwig
On Tue, Jun 26, 2007 at 08:26:50AM -0400, Chris Mason wrote:
> Since we're testing new code, I would just leave the blkdev address
> space alone.  If a filesystem wants to use fsblocks, they allocate a new
> inode during mount, stuff it into their private super block (or in the
> generic super), and use that for everything.  Basically ignoring the
> block device address space completely.

Exactly, same thing XFS does.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-30 Thread Christoph Hellwig
On Tue, Jun 26, 2007 at 12:34:26PM +1000, Nick Piggin wrote:
> That would require a new inode and address_space for the fsblock
> type blockdev pagecache, wouldn't it?

Yes.  That's easily possible, XFS already does it for it's own
buffer cache.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-30 Thread Christoph Hellwig
On Tue, Jun 26, 2007 at 12:34:26PM +1000, Nick Piggin wrote:
 That would require a new inode and address_space for the fsblock
 type blockdev pagecache, wouldn't it?

Yes.  That's easily possible, XFS already does it for it's own
buffer cache.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-30 Thread Christoph Hellwig
On Tue, Jun 26, 2007 at 08:26:50AM -0400, Chris Mason wrote:
 Since we're testing new code, I would just leave the blkdev address
 space alone.  If a filesystem wants to use fsblocks, they allocate a new
 inode during mount, stuff it into their private super block (or in the
 generic super), and use that for everything.  Basically ignoring the
 block device address space completely.

Exactly, same thing XFS does.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-26 Thread Chris Mason
On Tue, Jun 26, 2007 at 01:07:43PM +1000, Nick Piggin wrote:
> Neil Brown wrote:
> >On Tuesday June 26, [EMAIL PROTECTED] wrote:
> >
> >>Chris Mason wrote:
> >>
> >>>The block device pagecache isn't special, and certainly isn't that much
> >>>code.  I would suggest keeping it buffer head specific and making a
> >>>second variant that does only fsblocks.  This is mostly to keep the
> >>>semantics of PagePrivate sane, lets not fuzz the line.
> >>
> >>That would require a new inode and address_space for the fsblock
> >>type blockdev pagecache, wouldn't it? I just can't think of a
> >>better non-intrusive way of allowing a buffer_head filesystem and
> >>an fsblock filesystem to live on the same blkdev together.
> >
> >
> >I don't think they would ever try to.  Both filesystems would bd_claim
> >the blkdev, and only one would win.
> 
> Hmm OK, I might have confused myself thinking about partitions...
> 
> >The issue is more of a filesystem sharing a blockdev with the
> >block-special device (i.e. open("/dev/sda1"), read) isn't it?
> >
> >If a filesystem wants to attach information to the blockdev pagecache
> >that is different to what blockdev want to attach, then I think "Yes"
> >- a new inode and address space is what it needs to create.
> >
> >Then you get into consistency issues between the metadata and direct
> >blockdevice access.  Do we care about those?
> 
> Yeah that issue is definitely a real one. The problem is not just
> consistency, but "how do the block device aops even know that the
> PG_private page they have has buffer heads or fsblocks", so it is
> an oopsable condition rather than just a plain consistency issue
> (consistency is already not guaranteed).

Since we're testing new code, I would just leave the blkdev address
space alone.  If a filesystem wants to use fsblocks, they allocate a new
inode during mount, stuff it into their private super block (or in the
generic super), and use that for everything.  Basically ignoring the
block device address space completely.

It means there will be some inconsistency between what you get when
reading the block device file and the filesystem metadata, but we've got
that already (ext2 dir in page cache).

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-26 Thread Chris Mason
On Tue, Jun 26, 2007 at 01:07:43PM +1000, Nick Piggin wrote:
 Neil Brown wrote:
 On Tuesday June 26, [EMAIL PROTECTED] wrote:
 
 Chris Mason wrote:
 
 The block device pagecache isn't special, and certainly isn't that much
 code.  I would suggest keeping it buffer head specific and making a
 second variant that does only fsblocks.  This is mostly to keep the
 semantics of PagePrivate sane, lets not fuzz the line.
 
 That would require a new inode and address_space for the fsblock
 type blockdev pagecache, wouldn't it? I just can't think of a
 better non-intrusive way of allowing a buffer_head filesystem and
 an fsblock filesystem to live on the same blkdev together.
 
 
 I don't think they would ever try to.  Both filesystems would bd_claim
 the blkdev, and only one would win.
 
 Hmm OK, I might have confused myself thinking about partitions...
 
 The issue is more of a filesystem sharing a blockdev with the
 block-special device (i.e. open(/dev/sda1), read) isn't it?
 
 If a filesystem wants to attach information to the blockdev pagecache
 that is different to what blockdev want to attach, then I think Yes
 - a new inode and address space is what it needs to create.
 
 Then you get into consistency issues between the metadata and direct
 blockdevice access.  Do we care about those?
 
 Yeah that issue is definitely a real one. The problem is not just
 consistency, but how do the block device aops even know that the
 PG_private page they have has buffer heads or fsblocks, so it is
 an oopsable condition rather than just a plain consistency issue
 (consistency is already not guaranteed).

Since we're testing new code, I would just leave the blkdev address
space alone.  If a filesystem wants to use fsblocks, they allocate a new
inode during mount, stuff it into their private super block (or in the
generic super), and use that for everything.  Basically ignoring the
block device address space completely.

It means there will be some inconsistency between what you get when
reading the block device file and the filesystem metadata, but we've got
that already (ext2 dir in page cache).

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Neil Brown wrote:

On Tuesday June 26, [EMAIL PROTECTED] wrote:


Chris Mason wrote:


The block device pagecache isn't special, and certainly isn't that much
code.  I would suggest keeping it buffer head specific and making a
second variant that does only fsblocks.  This is mostly to keep the
semantics of PagePrivate sane, lets not fuzz the line.


That would require a new inode and address_space for the fsblock
type blockdev pagecache, wouldn't it? I just can't think of a
better non-intrusive way of allowing a buffer_head filesystem and
an fsblock filesystem to live on the same blkdev together.



I don't think they would ever try to.  Both filesystems would bd_claim
the blkdev, and only one would win.


Hmm OK, I might have confused myself thinking about partitions...


The issue is more of a filesystem sharing a blockdev with the
block-special device (i.e. open("/dev/sda1"), read) isn't it?

If a filesystem wants to attach information to the blockdev pagecache
that is different to what blockdev want to attach, then I think "Yes"
- a new inode and address space is what it needs to create.

Then you get into consistency issues between the metadata and direct
blockdevice access.  Do we care about those?


Yeah that issue is definitely a real one. The problem is not just
consistency, but "how do the block device aops even know that the
PG_private page they have has buffer heads or fsblocks", so it is
an oopsable condition rather than just a plain consistency issue
(consistency is already not guaranteed).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Neil Brown
On Tuesday June 26, [EMAIL PROTECTED] wrote:
> Chris Mason wrote:
> > 
> > The block device pagecache isn't special, and certainly isn't that much
> > code.  I would suggest keeping it buffer head specific and making a
> > second variant that does only fsblocks.  This is mostly to keep the
> > semantics of PagePrivate sane, lets not fuzz the line.
> 
> That would require a new inode and address_space for the fsblock
> type blockdev pagecache, wouldn't it? I just can't think of a
> better non-intrusive way of allowing a buffer_head filesystem and
> an fsblock filesystem to live on the same blkdev together.

I don't think they would ever try to.  Both filesystems would bd_claim
the blkdev, and only one would win.

The issue is more of a filesystem sharing a blockdev with the
block-special device (i.e. open("/dev/sda1"), read) isn't it?

If a filesystem wants to attach information to the blockdev pagecache
that is different to what blockdev want to attach, then I think "Yes"
- a new inode and address space is what it needs to create.

Then you get into consistency issues between the metadata and direct
blockdevice access.  Do we care about those?


NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Chris Mason wrote:

On Sun, Jun 24, 2007 at 03:46:13AM +0200, Nick Piggin wrote:


Rewrite the buffer layer.



Overall, I like the basic concepts, but it is hard to track the locking
rules.  Could you please write them up?


Yeah I will do that.

Thanks for taking a look. One thing I am thinking about is to get
rid of the unmap_underlying_metadata calls from the generic code.
I found they were required for minix to prevent corruption, however
I don't know exactly what metadata is interfering here (maybe it
is indirect blocks or something?). Anyway, I think I will make it
a requirement that the filesystem has to already handle this before
returning a newly allocated block -- they can probably do it more
efficiently and we avoid the extra work on every block allocation.



I like the way you split out the assoc_buffers from the main fsblock
code, but the list setup is still something of a wart.  It also provides
poor ordering of blocks for writeback.


Yeah, I didn't know how much effort to put in here because I don't
know whether modern filesystems are going to need to implement their
own management of this stuff or not.

I haven't actually instrumented this in something like ext2 to see
how much IO comes from the assoc buffers...



I think it makes sense to replace the assoc_buffers list head with a
radix tree sorted by block number.  mark_buffer_dirty_inode would up the
reference count and put it into the radix, the various flushing routines
would walk the radix etc.

If you wanted to be able to drop the reference count once the block was
written you could have a back pointer to the appropriate inode.


I was actually thinking about a radix-tree :) One annoyance is that
unsigned long != sector_t :P rbtree would probably be OK.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Chris Mason wrote:

On Mon, Jun 25, 2007 at 05:41:58PM +1000, Nick Piggin wrote:


Neil Brown wrote:



Why do you think you need PG_blocks?


Block device pagecache (buffer cache) has to be able to accept
attachment of either buffers or blocks for filesystem metadata,
and call into either buffer.c or fsblock.c based on that.

If the page flag is really important, we can do some awful hack
like assuming the first long of the private data is flags, and
those flags will tell us whether the structure is a buffer_head
or fsblock ;) But for now it is just easier to use a page flag.



The block device pagecache isn't special, and certainly isn't that much
code.  I would suggest keeping it buffer head specific and making a
second variant that does only fsblocks.  This is mostly to keep the
semantics of PagePrivate sane, lets not fuzz the line.


That would require a new inode and address_space for the fsblock
type blockdev pagecache, wouldn't it? I just can't think of a
better non-intrusive way of allowing a buffer_head filesystem and
an fsblock filesystem to live on the same blkdev together.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Chris Mason
On Sun, Jun 24, 2007 at 03:46:13AM +0200, Nick Piggin wrote:
> Rewrite the buffer layer.

Overall, I like the basic concepts, but it is hard to track the locking
rules.  Could you please write them up?

I like the way you split out the assoc_buffers from the main fsblock
code, but the list setup is still something of a wart.  It also provides
poor ordering of blocks for writeback.

I think it makes sense to replace the assoc_buffers list head with a
radix tree sorted by block number.  mark_buffer_dirty_inode would up the
reference count and put it into the radix, the various flushing routines
would walk the radix etc.

If you wanted to be able to drop the reference count once the block was
written you could have a back pointer to the appropriate inode.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Chris Mason
On Mon, Jun 25, 2007 at 05:41:58PM +1000, Nick Piggin wrote:
> Neil Brown wrote:
> >On Sunday June 24, [EMAIL PROTECTED] wrote:
> >
> >>
> >>+#define PG_blocks  20  /* Page has block mappings */
> >>+
> >
> >
> >I've only had a very quick look, but this line looks *very* wrong.
> >You should be using PG_private.
> >
> >There should never be any confusion about whether ->private has
> >buffers or blocks attached as the only routines that ever look in
> >->private are address_space operations  (or should be.  I think 'NULL'
> >is sometimes special cased, as in try_to_release_page.  It would be
> >good to do some preliminary work and tidy all that up).
> 
> There is a lot of confusion, actually :)
> But as you see in the patch, I added a couple more aops APIs, and
> am working toward decoupling it as much as possible. It's pretty
> close after the fsblock patch... however:
> 
> 
> >Why do you think you need PG_blocks?
> 
> Block device pagecache (buffer cache) has to be able to accept
> attachment of either buffers or blocks for filesystem metadata,
> and call into either buffer.c or fsblock.c based on that.
> 
> If the page flag is really important, we can do some awful hack
> like assuming the first long of the private data is flags, and
> those flags will tell us whether the structure is a buffer_head
> or fsblock ;) But for now it is just easier to use a page flag.

The block device pagecache isn't special, and certainly isn't that much
code.  I would suggest keeping it buffer head specific and making a
second variant that does only fsblocks.  This is mostly to keep the
semantics of PagePrivate sane, lets not fuzz the line.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Andi Kleen
On Sun, Jun 24, 2007 at 01:18:42PM -0700, Arjan van de Ven wrote:
> 
> > Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
> > smallest
> > possible type for each architecture. And a couple of ugly casts for set_bit 
> > et.al.
> > but those could be also hidden in macros. Should be relatively easy to do.
> 
> or make a "smallbit" type that is small/supported, so 64 bit if 32 bit
> isn't supported, otherwise 32

That wouldn't handle the case where you only need e.g. 8 bits
That's fine for x86 too. It only hates atomic accesses crossing cache line
boundaries (but handles them too, just slow) 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Neil Brown wrote:

On Sunday June 24, [EMAIL PROTECTED] wrote:



+#define PG_blocks  20  /* Page has block mappings */
+



I've only had a very quick look, but this line looks *very* wrong.
You should be using PG_private.

There should never be any confusion about whether ->private has
buffers or blocks attached as the only routines that ever look in
->private are address_space operations  (or should be.  I think 'NULL'
is sometimes special cased, as in try_to_release_page.  It would be
good to do some preliminary work and tidy all that up).


There is a lot of confusion, actually :)
But as you see in the patch, I added a couple more aops APIs, and
am working toward decoupling it as much as possible. It's pretty
close after the fsblock patch... however:



Why do you think you need PG_blocks?


Block device pagecache (buffer cache) has to be able to accept
attachment of either buffers or blocks for filesystem metadata,
and call into either buffer.c or fsblock.c based on that.

If the page flag is really important, we can do some awful hack
like assuming the first long of the private data is flags, and
those flags will tell us whether the structure is a buffer_head
or fsblock ;) But for now it is just easier to use a page flag.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Andi Kleen wrote:

Nick Piggin <[EMAIL PROTECTED]> writes:


[haven't read everything, just commenting on something that caught my eye]



+struct fsblock {
+   atomic_tcount;
+   union {
+   struct {
+   unsigned long   flags; /* XXX: flags could be int for 
better packing */



int is not supported by many architectures, but works on x86 at least.


Yeah, that would be nice. We could actually use this for buffer_head as well,
but saving 4% there isn't so important as saving 20% for fsblock :)



Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
smallest
possible type for each architecture. And a couple of ugly casts for set_bit 
et.al.
but those could be also hidden in macros. Should be relatively easy to do.


Cool. It would probably be useful for other things as well.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Andi Kleen wrote:

Nick Piggin [EMAIL PROTECTED] writes:


[haven't read everything, just commenting on something that caught my eye]



+struct fsblock {
+   atomic_tcount;
+   union {
+   struct {
+   unsigned long   flags; /* XXX: flags could be int for 
better packing */



int is not supported by many architectures, but works on x86 at least.


Yeah, that would be nice. We could actually use this for buffer_head as well,
but saving 4% there isn't so important as saving 20% for fsblock :)



Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
smallest
possible type for each architecture. And a couple of ugly casts for set_bit 
et.al.
but those could be also hidden in macros. Should be relatively easy to do.


Cool. It would probably be useful for other things as well.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Neil Brown wrote:

On Sunday June 24, [EMAIL PROTECTED] wrote:



+#define PG_blocks  20  /* Page has block mappings */
+



I've only had a very quick look, but this line looks *very* wrong.
You should be using PG_private.

There should never be any confusion about whether -private has
buffers or blocks attached as the only routines that ever look in
-private are address_space operations  (or should be.  I think 'NULL'
is sometimes special cased, as in try_to_release_page.  It would be
good to do some preliminary work and tidy all that up).


There is a lot of confusion, actually :)
But as you see in the patch, I added a couple more aops APIs, and
am working toward decoupling it as much as possible. It's pretty
close after the fsblock patch... however:



Why do you think you need PG_blocks?


Block device pagecache (buffer cache) has to be able to accept
attachment of either buffers or blocks for filesystem metadata,
and call into either buffer.c or fsblock.c based on that.

If the page flag is really important, we can do some awful hack
like assuming the first long of the private data is flags, and
those flags will tell us whether the structure is a buffer_head
or fsblock ;) But for now it is just easier to use a page flag.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Andi Kleen
On Sun, Jun 24, 2007 at 01:18:42PM -0700, Arjan van de Ven wrote:
 
  Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
  smallest
  possible type for each architecture. And a couple of ugly casts for set_bit 
  et.al.
  but those could be also hidden in macros. Should be relatively easy to do.
 
 or make a smallbit type that is small/supported, so 64 bit if 32 bit
 isn't supported, otherwise 32

That wouldn't handle the case where you only need e.g. 8 bits
That's fine for x86 too. It only hates atomic accesses crossing cache line
boundaries (but handles them too, just slow) 

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Chris Mason
On Mon, Jun 25, 2007 at 05:41:58PM +1000, Nick Piggin wrote:
 Neil Brown wrote:
 On Sunday June 24, [EMAIL PROTECTED] wrote:
 
 
 +#define PG_blocks  20  /* Page has block mappings */
 +
 
 
 I've only had a very quick look, but this line looks *very* wrong.
 You should be using PG_private.
 
 There should never be any confusion about whether -private has
 buffers or blocks attached as the only routines that ever look in
 -private are address_space operations  (or should be.  I think 'NULL'
 is sometimes special cased, as in try_to_release_page.  It would be
 good to do some preliminary work and tidy all that up).
 
 There is a lot of confusion, actually :)
 But as you see in the patch, I added a couple more aops APIs, and
 am working toward decoupling it as much as possible. It's pretty
 close after the fsblock patch... however:
 
 
 Why do you think you need PG_blocks?
 
 Block device pagecache (buffer cache) has to be able to accept
 attachment of either buffers or blocks for filesystem metadata,
 and call into either buffer.c or fsblock.c based on that.
 
 If the page flag is really important, we can do some awful hack
 like assuming the first long of the private data is flags, and
 those flags will tell us whether the structure is a buffer_head
 or fsblock ;) But for now it is just easier to use a page flag.

The block device pagecache isn't special, and certainly isn't that much
code.  I would suggest keeping it buffer head specific and making a
second variant that does only fsblocks.  This is mostly to keep the
semantics of PagePrivate sane, lets not fuzz the line.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Chris Mason
On Sun, Jun 24, 2007 at 03:46:13AM +0200, Nick Piggin wrote:
 Rewrite the buffer layer.

Overall, I like the basic concepts, but it is hard to track the locking
rules.  Could you please write them up?

I like the way you split out the assoc_buffers from the main fsblock
code, but the list setup is still something of a wart.  It also provides
poor ordering of blocks for writeback.

I think it makes sense to replace the assoc_buffers list head with a
radix tree sorted by block number.  mark_buffer_dirty_inode would up the
reference count and put it into the radix, the various flushing routines
would walk the radix etc.

If you wanted to be able to drop the reference count once the block was
written you could have a back pointer to the appropriate inode.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Chris Mason wrote:

On Mon, Jun 25, 2007 at 05:41:58PM +1000, Nick Piggin wrote:


Neil Brown wrote:



Why do you think you need PG_blocks?


Block device pagecache (buffer cache) has to be able to accept
attachment of either buffers or blocks for filesystem metadata,
and call into either buffer.c or fsblock.c based on that.

If the page flag is really important, we can do some awful hack
like assuming the first long of the private data is flags, and
those flags will tell us whether the structure is a buffer_head
or fsblock ;) But for now it is just easier to use a page flag.



The block device pagecache isn't special, and certainly isn't that much
code.  I would suggest keeping it buffer head specific and making a
second variant that does only fsblocks.  This is mostly to keep the
semantics of PagePrivate sane, lets not fuzz the line.


That would require a new inode and address_space for the fsblock
type blockdev pagecache, wouldn't it? I just can't think of a
better non-intrusive way of allowing a buffer_head filesystem and
an fsblock filesystem to live on the same blkdev together.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Chris Mason wrote:

On Sun, Jun 24, 2007 at 03:46:13AM +0200, Nick Piggin wrote:


Rewrite the buffer layer.



Overall, I like the basic concepts, but it is hard to track the locking
rules.  Could you please write them up?


Yeah I will do that.

Thanks for taking a look. One thing I am thinking about is to get
rid of the unmap_underlying_metadata calls from the generic code.
I found they were required for minix to prevent corruption, however
I don't know exactly what metadata is interfering here (maybe it
is indirect blocks or something?). Anyway, I think I will make it
a requirement that the filesystem has to already handle this before
returning a newly allocated block -- they can probably do it more
efficiently and we avoid the extra work on every block allocation.



I like the way you split out the assoc_buffers from the main fsblock
code, but the list setup is still something of a wart.  It also provides
poor ordering of blocks for writeback.


Yeah, I didn't know how much effort to put in here because I don't
know whether modern filesystems are going to need to implement their
own management of this stuff or not.

I haven't actually instrumented this in something like ext2 to see
how much IO comes from the assoc buffers...



I think it makes sense to replace the assoc_buffers list head with a
radix tree sorted by block number.  mark_buffer_dirty_inode would up the
reference count and put it into the radix, the various flushing routines
would walk the radix etc.

If you wanted to be able to drop the reference count once the block was
written you could have a back pointer to the appropriate inode.


I was actually thinking about a radix-tree :) One annoyance is that
unsigned long != sector_t :P rbtree would probably be OK.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Neil Brown
On Tuesday June 26, [EMAIL PROTECTED] wrote:
 Chris Mason wrote:
  
  The block device pagecache isn't special, and certainly isn't that much
  code.  I would suggest keeping it buffer head specific and making a
  second variant that does only fsblocks.  This is mostly to keep the
  semantics of PagePrivate sane, lets not fuzz the line.
 
 That would require a new inode and address_space for the fsblock
 type blockdev pagecache, wouldn't it? I just can't think of a
 better non-intrusive way of allowing a buffer_head filesystem and
 an fsblock filesystem to live on the same blkdev together.

I don't think they would ever try to.  Both filesystems would bd_claim
the blkdev, and only one would win.

The issue is more of a filesystem sharing a blockdev with the
block-special device (i.e. open(/dev/sda1), read) isn't it?

If a filesystem wants to attach information to the blockdev pagecache
that is different to what blockdev want to attach, then I think Yes
- a new inode and address space is what it needs to create.

Then you get into consistency issues between the metadata and direct
blockdevice access.  Do we care about those?


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Nick Piggin

Neil Brown wrote:

On Tuesday June 26, [EMAIL PROTECTED] wrote:


Chris Mason wrote:


The block device pagecache isn't special, and certainly isn't that much
code.  I would suggest keeping it buffer head specific and making a
second variant that does only fsblocks.  This is mostly to keep the
semantics of PagePrivate sane, lets not fuzz the line.


That would require a new inode and address_space for the fsblock
type blockdev pagecache, wouldn't it? I just can't think of a
better non-intrusive way of allowing a buffer_head filesystem and
an fsblock filesystem to live on the same blkdev together.



I don't think they would ever try to.  Both filesystems would bd_claim
the blkdev, and only one would win.


Hmm OK, I might have confused myself thinking about partitions...


The issue is more of a filesystem sharing a blockdev with the
block-special device (i.e. open(/dev/sda1), read) isn't it?

If a filesystem wants to attach information to the blockdev pagecache
that is different to what blockdev want to attach, then I think Yes
- a new inode and address space is what it needs to create.

Then you get into consistency issues between the metadata and direct
blockdevice access.  Do we care about those?


Yeah that issue is definitely a real one. The problem is not just
consistency, but how do the block device aops even know that the
PG_private page they have has buffer heads or fsblocks, so it is
an oopsable condition rather than just a plain consistency issue
(consistency is already not guaranteed).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-24 Thread Neil Brown
On Sunday June 24, [EMAIL PROTECTED] wrote:
>  
> +#define PG_blocks20  /* Page has block mappings */
> +

I've only had a very quick look, but this line looks *very* wrong.
You should be using PG_private.

There should never be any confusion about whether ->private has
buffers or blocks attached as the only routines that ever look in
->private are address_space operations  (or should be.  I think 'NULL'
is sometimes special cased, as in try_to_release_page.  It would be
good to do some preliminary work and tidy all that up).

Why do you think you need PG_blocks?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-24 Thread Arjan van de Ven

> Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
> smallest
> possible type for each architecture. And a couple of ugly casts for set_bit 
> et.al.
> but those could be also hidden in macros. Should be relatively easy to do.

or make a "smallbit" type that is small/supported, so 64 bit if 32 bit
isn't supported, otherwise 32


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-24 Thread Andi Kleen
Nick Piggin <[EMAIL PROTECTED]> writes:


[haven't read everything, just commenting on something that caught my eye]

> +struct fsblock {
> + atomic_tcount;
> + union {
> + struct {
> + unsigned long   flags; /* XXX: flags could be int for 
> better packing */

int is not supported by many architectures, but works on x86 at least.

Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
smallest
possible type for each architecture. And a couple of ugly casts for set_bit 
et.al.
but those could be also hidden in macros. Should be relatively easy to do.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-24 Thread Andi Kleen
Nick Piggin [EMAIL PROTECTED] writes:


[haven't read everything, just commenting on something that caught my eye]

 +struct fsblock {
 + atomic_tcount;
 + union {
 + struct {
 + unsigned long   flags; /* XXX: flags could be int for 
 better packing */

int is not supported by many architectures, but works on x86 at least.

Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
smallest
possible type for each architecture. And a couple of ugly casts for set_bit 
et.al.
but those could be also hidden in macros. Should be relatively easy to do.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-24 Thread Arjan van de Ven

 Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
 smallest
 possible type for each architecture. And a couple of ugly casts for set_bit 
 et.al.
 but those could be also hidden in macros. Should be relatively easy to do.

or make a smallbit type that is small/supported, so 64 bit if 32 bit
isn't supported, otherwise 32


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] add the fsblock layer

2007-06-24 Thread Neil Brown
On Sunday June 24, [EMAIL PROTECTED] wrote:
  
 +#define PG_blocks20  /* Page has block mappings */
 +

I've only had a very quick look, but this line looks *very* wrong.
You should be using PG_private.

There should never be any confusion about whether -private has
buffers or blocks attached as the only routines that ever look in
-private are address_space operations  (or should be.  I think 'NULL'
is sometimes special cased, as in try_to_release_page.  It would be
good to do some preliminary work and tidy all that up).

Why do you think you need PG_blocks?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/