subject:"\[RFC\] fsblock"

Re: [RFC] fsblock

2007-07-09 Thread Dave McCracken

On Monday 09 July 2007, Christoph Lameter wrote:
> On Tue, 10 Jul 2007, Nick Piggin wrote:
> > There are no changes to the filesystem API for large pages (although I
> > am adding a couple of helpers to do page based bitmap ops). And I don't
> > want to rely on contiguous memory. Why do you think handling of large
> > pages (presumably you mean larger than page sized blocks) is strange?
>
> We already have a way to handle large pages: Compound pages.

Um, no, we don't, assuming by compound pages you mean order > 0 pages..  None 
of the stack of changes necessary to make these pages viable has yet been 
accepted, ie antifrag, defrag, and variable page cache.  While these changes 
may yet all go in and work wonderfully, I applaud Nick's alternative solution 
that does not include a depency on them.

Dave McCracken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-07-09 Thread Nick Piggin

On Mon, Jul 09, 2007 at 05:59:47PM -0700, Christoph Lameter wrote:
> On Tue, 10 Jul 2007, Nick Piggin wrote:
> 
> > > Hmmm I did not notice that yet but then I have not done much work 
> > > there.
> > 
> > Notice what?
> 
> The bad code for the buffer heads.

Oh. Well my first mail in this thrad listed some of the problems
with them.


> > > > - A real "nobh" mode. nobh was created I think mainly to avoid problems
> > > >   with buffer_head memory consumption, especially on lowmem machines. It
> > > >   is basically a hack (sorry), which requires special code in 
> > > > filesystems,
> > > >   and duplication of quite a bit of tricky buffer layer code (and bugs).
> > > >   It also doesn't work so well for buffers with non-trivial private data
> > > >   (like most journalling ones). fsblock implements this with basically a
> > > >   few lines of code, and it shold work in situations like ext3.
> > > 
> > > Hmmm That means simply page struct are not working...
> > 
> > I don't understand you. jbd needs to attach private data to each bh, and
> > that can stay around for longer than the life of the page in the pagecache.
> 
> Right. So just using page struct alone wont work for the filesystems.
> 
> > There are no changes to the filesystem API for large pages (although I
> > am adding a couple of helpers to do page based bitmap ops). And I don't
> > want to rely on contiguous memory. Why do you think handling of large
> > pages (presumably you mean larger than page sized blocks) is strange?
> 
> We already have a way to handle large pages: Compound pages.

Yes but I don't want to use large pages and I am not going to use
them (at least, they won't be mandatory).

 
> > Conglomerating the constituent pages via the pagecache radix-tree seems
> > logical to me.
> 
> Meaning overhead to handle each page still exists? This scheme cannot 
> handle large contiguous blocks as a single entity?

Of course some things have to be done per-page if the pages are not
contiguous. I actually haven't seen that to be a problem or have much
reason to think it will suddenly become a problem (although I do like
Andrea's config page sizes approach for really big systems that cannot
change their HW page size).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-07-09 Thread Christoph Lameter

On Tue, 10 Jul 2007, Nick Piggin wrote:

> > Hmmm I did not notice that yet but then I have not done much work 
> > there.
> 
> Notice what?

The bad code for the buffer heads.

> > > - A real "nobh" mode. nobh was created I think mainly to avoid problems
> > >   with buffer_head memory consumption, especially on lowmem machines. It
> > >   is basically a hack (sorry), which requires special code in filesystems,
> > >   and duplication of quite a bit of tricky buffer layer code (and bugs).
> > >   It also doesn't work so well for buffers with non-trivial private data
> > >   (like most journalling ones). fsblock implements this with basically a
> > >   few lines of code, and it shold work in situations like ext3.
> > 
> > Hmmm That means simply page struct are not working...
> 
> I don't understand you. jbd needs to attach private data to each bh, and
> that can stay around for longer than the life of the page in the pagecache.

Right. So just using page struct alone wont work for the filesystems.

> There are no changes to the filesystem API for large pages (although I
> am adding a couple of helpers to do page based bitmap ops). And I don't
> want to rely on contiguous memory. Why do you think handling of large
> pages (presumably you mean larger than page sized blocks) is strange?

We already have a way to handle large pages: Compound pages.

> Conglomerating the constituent pages via the pagecache radix-tree seems
> logical to me.

Meaning overhead to handle each page still exists? This scheme cannot 
handle large contiguous blocks as a single entity?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-07-09 Thread Nick Piggin

On Mon, Jul 09, 2007 at 10:14:06AM -0700, Christoph Lameter wrote:
> On Sun, 24 Jun 2007, Nick Piggin wrote:
> 
> > Firstly, what is the buffer layer?  The buffer layer isn't really a
> > buffer layer as in the buffer cache of unix: the block device cache
> > is unified with the pagecache (in terms of the pagecache, a blkdev
> > file is just like any other, but with a 1:1 mapping between offset
> > and block).
> 
> I thought that the buffer layer is essentially a method to index to sub 
> section of a page?

It converts pagecache addresses to block addresses I guess. The
current implementation cannot handle blocks larger than pages,
but not because use of larger pages for pagecache wsa anticipated
(likely because it is more work, and the APIs aren't really set up
for it).


> > Why rewrite the buffer layer?  Lots of people have had a desire to
> > completely rip out the buffer layer, but we can't do that[*] because
> > it does actually serve a useful purpose. Why the bad rap? Because
> > the code is old and crufty, and buffer_head is an awful name. It must 
> > be among the oldest code in the core fs/vm, and the main reason is
> > because of the inertia of so many and such complex filesystems.
> 
> Hmmm I did not notice that yet but then I have not done much work 
> there.

Notice what?


> > - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on
> >   64-bit (could easily be 32 if we can have int bitops). Compare this
> >   to around 50 and 100ish for struct buffer_head. With a 4K page and 1K
> >   blocks, IO requires 10% RAM overhead in buffer heads alone. With
> >   fsblocks you're down to around 3%.
> 
> I thought we were going to simply use the page struct instead of having
> buffer heads? Would that not reduce the overhead to zero?

What do you mean by that? As I said, you couldn't use just the page
struct for anything except page sized blocks, and even then it would
require more fields or at least more flags in the page struct.

nobh mode actually tries to do something similar, however it requires
multiple calls into the filesystem to first allocate the block, and
then find its sector. It is also buggy and can't handle errors properly
(although I'm trying to fix that).


> > - A real "nobh" mode. nobh was created I think mainly to avoid problems
> >   with buffer_head memory consumption, especially on lowmem machines. It
> >   is basically a hack (sorry), which requires special code in filesystems,
> >   and duplication of quite a bit of tricky buffer layer code (and bugs).
> >   It also doesn't work so well for buffers with non-trivial private data
> >   (like most journalling ones). fsblock implements this with basically a
> >   few lines of code, and it shold work in situations like ext3.
> 
> Hmmm That means simply page struct are not working...

I don't understand you. jbd needs to attach private data to each bh, and
that can stay around for longer than the life of the page in the pagecache.


> > - Large block support. I can mount and run an 8K block size minix3 fs on
> >   my 4K page system and it didn't require anything special in the fs. We
> >   can go up to about 32MB blocks now, and gigabyte+ blocks would only
> >   require  one more bit in the fsblock flags. fsblock_superpage blocks
> >   are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
> > 
> >   Core pagecache code is pretty creaky with respect to this. I think it is
> >   mostly race free, but it requires stupid unlocking and relocking hacks
> >   because the vm usually passes single locked pages to the fs layers, and we
> >   need to lock all pages of a block in offset ascending order. This could be
> >   avoided by doing locking on only the first page of a block for locking in
> >   the fsblock layer, but that's a bit scary too. Probably better would be to
> >   move towards offset,length rather than page based fs APIs where everything
> >   can be batched up nicely and this sort of non-trivial locking can be more
> >   optimal.
> > 
> >   Large blocks also have a performance black spot where an 8K sized and
> >   aligned write(2) would require an RMW in the filesystem. Again because of
> >   the page based nature of the fs API, and this too would be fixed if
> >   the APIs were better.
> 
> The simple solution would be to use a compound page and make the head page
> represent the status of all the pages in the vm. Logic for that is already 
> in place.

I do not consider that a solution because I explicitly want to allow
order-0 pages here. I know about your higher order pagecache, the anti-frag
and defrag work, I know about compound pages.  I'm not just ignoring them
because of NIH or something silly.

Anyway, I have thought about just using the first page in the block for
the locking, and that might be a reasonable optimisation. However for
now I'm keeping it simple.


> >   Large block memory access via filesystem uses vmap, but it will go back
> >   to kmap if the access doesn't cross a page.

Re: [RFC] fsblock

2007-07-09 Thread Christoph Lameter

On Sun, 24 Jun 2007, Nick Piggin wrote:

> Firstly, what is the buffer layer?  The buffer layer isn't really a
> buffer layer as in the buffer cache of unix: the block device cache
> is unified with the pagecache (in terms of the pagecache, a blkdev
> file is just like any other, but with a 1:1 mapping between offset
> and block).

I thought that the buffer layer is essentially a method to index to sub 
section of a page?

> Why rewrite the buffer layer?  Lots of people have had a desire to
> completely rip out the buffer layer, but we can't do that[*] because
> it does actually serve a useful purpose. Why the bad rap? Because
> the code is old and crufty, and buffer_head is an awful name. It must 
> be among the oldest code in the core fs/vm, and the main reason is
> because of the inertia of so many and such complex filesystems.

Hmmm I did not notice that yet but then I have not done much work 
there.

> - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on
>   64-bit (could easily be 32 if we can have int bitops). Compare this
>   to around 50 and 100ish for struct buffer_head. With a 4K page and 1K
>   blocks, IO requires 10% RAM overhead in buffer heads alone. With
>   fsblocks you're down to around 3%.

I thought we were going to simply use the page struct instead of having
buffer heads? Would that not reduce the overhead to zero?

> - Structure packing. A page gets a number of buffer heads that are
>   allocated in a linked list. fsblocks are allocated contiguously, so
>   cacheline footprint is smaller in the above situation.

Good idea.

> - A real "nobh" mode. nobh was created I think mainly to avoid problems
>   with buffer_head memory consumption, especially on lowmem machines. It
>   is basically a hack (sorry), which requires special code in filesystems,
>   and duplication of quite a bit of tricky buffer layer code (and bugs).
>   It also doesn't work so well for buffers with non-trivial private data
>   (like most journalling ones). fsblock implements this with basically a
>   few lines of code, and it shold work in situations like ext3.

Hmmm That means simply page struct are not working...
 
> - Large block support. I can mount and run an 8K block size minix3 fs on
>   my 4K page system and it didn't require anything special in the fs. We
>   can go up to about 32MB blocks now, and gigabyte+ blocks would only
>   require  one more bit in the fsblock flags. fsblock_superpage blocks
>   are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
> 
>   Core pagecache code is pretty creaky with respect to this. I think it is
>   mostly race free, but it requires stupid unlocking and relocking hacks
>   because the vm usually passes single locked pages to the fs layers, and we
>   need to lock all pages of a block in offset ascending order. This could be
>   avoided by doing locking on only the first page of a block for locking in
>   the fsblock layer, but that's a bit scary too. Probably better would be to
>   move towards offset,length rather than page based fs APIs where everything
>   can be batched up nicely and this sort of non-trivial locking can be more
>   optimal.
> 
>   Large blocks also have a performance black spot where an 8K sized and
>   aligned write(2) would require an RMW in the filesystem. Again because of
>   the page based nature of the fs API, and this too would be fixed if
>   the APIs were better.

The simple solution would be to use a compound page and make the head page
represent the status of all the pages in the vm. Logic for that is already 
in place.

>   Large block memory access via filesystem uses vmap, but it will go back
>   to kmap if the access doesn't cross a page. Filesystems really should do
>   this because vmap is slow as anything. I've implemented a vmap cache
>   which basically wouldn't work on 32-bit systems (because of limited vmap
>   space) for performance testing (and yes it sometimes tries to unmap in
>   interrupt context, I know, I'm using loop). We could possibly do a self
>   limiting cache, but I'd rather build some helpers to hide the raw multi
>   page access for things like bitmap scanning and bit setting etc. and
>   avoid too much vmaps.

Argh. No. Too much overhead.
 
> So. Comments? Is this something we want? If yes, then how would we
> transition from buffer.c to fsblock.c?

I think many of the ideas are great but the handling of large pages is 
rather strange. I would suggest to use compound pages to represent larger 
pages and rely on Mel Gorman's antifrag/compaction work to get you the 
contiguous memory locations instead of using vmap. This may significantly 
simplify your patchset and avoid changes to the filesytesm API. Its still 
pretty invasive though and I am not sure that there is enough benefit from 
this one.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please

Re: [RFC] fsblock

2007-07-09 Thread Christoph Lameter

On Sun, 24 Jun 2007, Nick Piggin wrote:

 Firstly, what is the buffer layer?  The buffer layer isn't really a
 buffer layer as in the buffer cache of unix: the block device cache
 is unified with the pagecache (in terms of the pagecache, a blkdev
 file is just like any other, but with a 1:1 mapping between offset
 and block).

I thought that the buffer layer is essentially a method to index to sub 
section of a page?

 Why rewrite the buffer layer?  Lots of people have had a desire to
 completely rip out the buffer layer, but we can't do that[*] because
 it does actually serve a useful purpose. Why the bad rap? Because
 the code is old and crufty, and buffer_head is an awful name. It must 
 be among the oldest code in the core fs/vm, and the main reason is
 because of the inertia of so many and such complex filesystems.

Hmmm I did not notice that yet but then I have not done much work 
there.

 - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on
   64-bit (could easily be 32 if we can have int bitops). Compare this
   to around 50 and 100ish for struct buffer_head. With a 4K page and 1K
   blocks, IO requires 10% RAM overhead in buffer heads alone. With
   fsblocks you're down to around 3%.

I thought we were going to simply use the page struct instead of having
buffer heads? Would that not reduce the overhead to zero?

 - Structure packing. A page gets a number of buffer heads that are
   allocated in a linked list. fsblocks are allocated contiguously, so
   cacheline footprint is smaller in the above situation.

Good idea.

 - A real nobh mode. nobh was created I think mainly to avoid problems
   with buffer_head memory consumption, especially on lowmem machines. It
   is basically a hack (sorry), which requires special code in filesystems,
   and duplication of quite a bit of tricky buffer layer code (and bugs).
   It also doesn't work so well for buffers with non-trivial private data
   (like most journalling ones). fsblock implements this with basically a
   few lines of code, and it shold work in situations like ext3.

Hmmm That means simply page struct are not working...
 
 - Large block support. I can mount and run an 8K block size minix3 fs on
   my 4K page system and it didn't require anything special in the fs. We
   can go up to about 32MB blocks now, and gigabyte+ blocks would only
   require  one more bit in the fsblock flags. fsblock_superpage blocks
   are  PAGE_CACHE_SIZE, midpage ==, and subpage .
 
   Core pagecache code is pretty creaky with respect to this. I think it is
   mostly race free, but it requires stupid unlocking and relocking hacks
   because the vm usually passes single locked pages to the fs layers, and we
   need to lock all pages of a block in offset ascending order. This could be
   avoided by doing locking on only the first page of a block for locking in
   the fsblock layer, but that's a bit scary too. Probably better would be to
   move towards offset,length rather than page based fs APIs where everything
   can be batched up nicely and this sort of non-trivial locking can be more
   optimal.
 
   Large blocks also have a performance black spot where an 8K sized and
   aligned write(2) would require an RMW in the filesystem. Again because of
   the page based nature of the fs API, and this too would be fixed if
   the APIs were better.

The simple solution would be to use a compound page and make the head page
represent the status of all the pages in the vm. Logic for that is already 
in place.

   Large block memory access via filesystem uses vmap, but it will go back
   to kmap if the access doesn't cross a page. Filesystems really should do
   this because vmap is slow as anything. I've implemented a vmap cache
   which basically wouldn't work on 32-bit systems (because of limited vmap
   space) for performance testing (and yes it sometimes tries to unmap in
   interrupt context, I know, I'm using loop). We could possibly do a self
   limiting cache, but I'd rather build some helpers to hide the raw multi
   page access for things like bitmap scanning and bit setting etc. and
   avoid too much vmaps.

Argh. No. Too much overhead.
 
 So. Comments? Is this something we want? If yes, then how would we
 transition from buffer.c to fsblock.c?

I think many of the ideas are great but the handling of large pages is 
rather strange. I would suggest to use compound pages to represent larger 
pages and rely on Mel Gorman's antifrag/compaction work to get you the 
contiguous memory locations instead of using vmap. This may significantly 
simplify your patchset and avoid changes to the filesytesm API. Its still 
pretty invasive though and I am not sure that there is enough benefit from 
this one.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-07-09 Thread Nick Piggin

On Mon, Jul 09, 2007 at 10:14:06AM -0700, Christoph Lameter wrote:
 On Sun, 24 Jun 2007, Nick Piggin wrote:
 
  Firstly, what is the buffer layer?  The buffer layer isn't really a
  buffer layer as in the buffer cache of unix: the block device cache
  is unified with the pagecache (in terms of the pagecache, a blkdev
  file is just like any other, but with a 1:1 mapping between offset
  and block).
 
 I thought that the buffer layer is essentially a method to index to sub 
 section of a page?

It converts pagecache addresses to block addresses I guess. The
current implementation cannot handle blocks larger than pages,
but not because use of larger pages for pagecache wsa anticipated
(likely because it is more work, and the APIs aren't really set up
for it).


  Why rewrite the buffer layer?  Lots of people have had a desire to
  completely rip out the buffer layer, but we can't do that[*] because
  it does actually serve a useful purpose. Why the bad rap? Because
  the code is old and crufty, and buffer_head is an awful name. It must 
  be among the oldest code in the core fs/vm, and the main reason is
  because of the inertia of so many and such complex filesystems.
 
 Hmmm I did not notice that yet but then I have not done much work 
 there.

Notice what?


  - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on
64-bit (could easily be 32 if we can have int bitops). Compare this
to around 50 and 100ish for struct buffer_head. With a 4K page and 1K
blocks, IO requires 10% RAM overhead in buffer heads alone. With
fsblocks you're down to around 3%.
 
 I thought we were going to simply use the page struct instead of having
 buffer heads? Would that not reduce the overhead to zero?

What do you mean by that? As I said, you couldn't use just the page
struct for anything except page sized blocks, and even then it would
require more fields or at least more flags in the page struct.

nobh mode actually tries to do something similar, however it requires
multiple calls into the filesystem to first allocate the block, and
then find its sector. It is also buggy and can't handle errors properly
(although I'm trying to fix that).


  - A real nobh mode. nobh was created I think mainly to avoid problems
with buffer_head memory consumption, especially on lowmem machines. It
is basically a hack (sorry), which requires special code in filesystems,
and duplication of quite a bit of tricky buffer layer code (and bugs).
It also doesn't work so well for buffers with non-trivial private data
(like most journalling ones). fsblock implements this with basically a
few lines of code, and it shold work in situations like ext3.
 
 Hmmm That means simply page struct are not working...

I don't understand you. jbd needs to attach private data to each bh, and
that can stay around for longer than the life of the page in the pagecache.


  - Large block support. I can mount and run an 8K block size minix3 fs on
my 4K page system and it didn't require anything special in the fs. We
can go up to about 32MB blocks now, and gigabyte+ blocks would only
require  one more bit in the fsblock flags. fsblock_superpage blocks
are  PAGE_CACHE_SIZE, midpage ==, and subpage .
  
Core pagecache code is pretty creaky with respect to this. I think it is
mostly race free, but it requires stupid unlocking and relocking hacks
because the vm usually passes single locked pages to the fs layers, and we
need to lock all pages of a block in offset ascending order. This could be
avoided by doing locking on only the first page of a block for locking in
the fsblock layer, but that's a bit scary too. Probably better would be to
move towards offset,length rather than page based fs APIs where everything
can be batched up nicely and this sort of non-trivial locking can be more
optimal.
  
Large blocks also have a performance black spot where an 8K sized and
aligned write(2) would require an RMW in the filesystem. Again because of
the page based nature of the fs API, and this too would be fixed if
the APIs were better.
 
 The simple solution would be to use a compound page and make the head page
 represent the status of all the pages in the vm. Logic for that is already 
 in place.

I do not consider that a solution because I explicitly want to allow
order-0 pages here. I know about your higher order pagecache, the anti-frag
and defrag work, I know about compound pages.  I'm not just ignoring them
because of NIH or something silly.

Anyway, I have thought about just using the first page in the block for
the locking, and that might be a reasonable optimisation. However for
now I'm keeping it simple.


Large block memory access via filesystem uses vmap, but it will go back
to kmap if the access doesn't cross a page. Filesystems really should do
this because vmap is slow as anything. I've implemented a vmap cache
which

Re: [RFC] fsblock

2007-07-09 Thread Christoph Lameter

On Tue, 10 Jul 2007, Nick Piggin wrote:

  Hmmm I did not notice that yet but then I have not done much work 
  there.
 
 Notice what?

The bad code for the buffer heads.

   - A real nobh mode. nobh was created I think mainly to avoid problems
 with buffer_head memory consumption, especially on lowmem machines. It
 is basically a hack (sorry), which requires special code in filesystems,
 and duplication of quite a bit of tricky buffer layer code (and bugs).
 It also doesn't work so well for buffers with non-trivial private data
 (like most journalling ones). fsblock implements this with basically a
 few lines of code, and it shold work in situations like ext3.
  
  Hmmm That means simply page struct are not working...
 
 I don't understand you. jbd needs to attach private data to each bh, and
 that can stay around for longer than the life of the page in the pagecache.

Right. So just using page struct alone wont work for the filesystems.

 There are no changes to the filesystem API for large pages (although I
 am adding a couple of helpers to do page based bitmap ops). And I don't
 want to rely on contiguous memory. Why do you think handling of large
 pages (presumably you mean larger than page sized blocks) is strange?

We already have a way to handle large pages: Compound pages.

 Conglomerating the constituent pages via the pagecache radix-tree seems
 logical to me.

Meaning overhead to handle each page still exists? This scheme cannot 
handle large contiguous blocks as a single entity?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-07-09 Thread Nick Piggin

On Mon, Jul 09, 2007 at 05:59:47PM -0700, Christoph Lameter wrote:
 On Tue, 10 Jul 2007, Nick Piggin wrote:
 
   Hmmm I did not notice that yet but then I have not done much work 
   there.
  
  Notice what?
 
 The bad code for the buffer heads.

Oh. Well my first mail in this thrad listed some of the problems
with them.


- A real nobh mode. nobh was created I think mainly to avoid problems
  with buffer_head memory consumption, especially on lowmem machines. It
  is basically a hack (sorry), which requires special code in 
filesystems,
  and duplication of quite a bit of tricky buffer layer code (and bugs).
  It also doesn't work so well for buffers with non-trivial private data
  (like most journalling ones). fsblock implements this with basically a
  few lines of code, and it shold work in situations like ext3.
   
   Hmmm That means simply page struct are not working...
  
  I don't understand you. jbd needs to attach private data to each bh, and
  that can stay around for longer than the life of the page in the pagecache.
 
 Right. So just using page struct alone wont work for the filesystems.
 
  There are no changes to the filesystem API for large pages (although I
  am adding a couple of helpers to do page based bitmap ops). And I don't
  want to rely on contiguous memory. Why do you think handling of large
  pages (presumably you mean larger than page sized blocks) is strange?
 
 We already have a way to handle large pages: Compound pages.

Yes but I don't want to use large pages and I am not going to use
them (at least, they won't be mandatory).

 
  Conglomerating the constituent pages via the pagecache radix-tree seems
  logical to me.
 
 Meaning overhead to handle each page still exists? This scheme cannot 
 handle large contiguous blocks as a single entity?

Of course some things have to be done per-page if the pages are not
contiguous. I actually haven't seen that to be a problem or have much
reason to think it will suddenly become a problem (although I do like
Andrea's config page sizes approach for really big systems that cannot
change their HW page size).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-07-09 Thread Dave McCracken

On Monday 09 July 2007, Christoph Lameter wrote:
 On Tue, 10 Jul 2007, Nick Piggin wrote:
  There are no changes to the filesystem API for large pages (although I
  am adding a couple of helpers to do page based bitmap ops). And I don't
  want to rely on contiguous memory. Why do you think handling of large
  pages (presumably you mean larger than page sized blocks) is strange?

 We already have a way to handle large pages: Compound pages.

Um, no, we don't, assuming by compound pages you mean order  0 pages..  None 
of the stack of changes necessary to make these pages viable has yet been 
accepted, ie antifrag, defrag, and variable page cache.  While these changes 
may yet all go in and work wonderfully, I applaud Nick's alternative solution 
that does not include a depency on them.

Dave McCracken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-30 Thread Christoph Hellwig

On Sat, Jun 30, 2007 at 07:10:27AM -0400, Jeff Garzik wrote:
> >Not really, the current behaviour is a bug.  And it's not actually buffer
> >layer specific - XFS now has a fix for that bug and it's generic enough
> >that everyone could use it.
> 
> I'm not sure I follow.  If you require block allocation at mmap(2) time, 
> rather than when a page is actually dirtied, you are denying userspace 
> the ability to do sparse files with mmap.
> 
> A quick Google readily turns up people who have built upon the 
> mmap-sparse-file assumption, and I don't think we want to break those 
> assumptions as a "bug fix."
> 
> Where is the bug?

It's not mmap time but page dirtying time.  Currently the default behaviour
is not to allocate at page dirtying time but rather at writeout time in
some scenarious.

(and s/allocation/reservation/ applies for delalloc of course)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-30 Thread Jeff Garzik


Christoph Hellwig wrote:

On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:

- In line with the above item, filesystem block allocation is performed
 before a page is dirtied. In the buffer layer, mmap writes can dirty a
 page with no backing blocks which is a problem if the filesystem is
 ENOSPC (patches exist for buffer.c for this).
This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
more an ABI behavior, so I don't see how this can be fixed with internal 
changes, yet without changing behavior currently exported to userland 
(and thus affecting code based on such assumptions).



Not really, the current behaviour is a bug.  And it's not actually buffer
layer specific - XFS now has a fix for that bug and it's generic enough
that everyone could use it.


I'm not sure I follow.  If you require block allocation at mmap(2) time, 
rather than when a page is actually dirtied, you are denying userspace 
the ability to do sparse files with mmap.


A quick Google readily turns up people who have built upon the 
mmap-sparse-file assumption, and I don't think we want to break those 
assumptions as a "bug fix."


Where is the bug?

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-30 Thread Christoph Hellwig

Warning ahead:  I've only briefly skipped over the pages so the comments
in the mail are very highlevel.

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
> fsblock is a rewrite of the "buffer layer" (ding dong the witch is
> dead), which I have been working on, on and off and is now at the stage
> where some of the basics are working-ish. This email is going to be
> long...
> 
> Firstly, what is the buffer layer?  The buffer layer isn't really a
> buffer layer as in the buffer cache of unix: the block device cache
> is unified with the pagecache (in terms of the pagecache, a blkdev
> file is just like any other, but with a 1:1 mapping between offset
> and block).
>
> There are filesystem APIs to access the block device, but these go
> through the block device pagecache as well. These don't exactly
> define the buffer layer either.
> 
> The buffer layer is a layer between the pagecache and the block
> device for block based filesystems. It keeps a translation between
> logical offset and physical block number, as well as meta
> information such as locks, dirtyness, and IO status of each block.
> This information is tracked via the buffer_head structure.
> 

Traditional unix buffer cache is always physical block indexed and
used for all data/metadata/blockdevice node access.  There's been
a lot of variants of schemes where data or some data is in a separate
inode,logial block indexed scheme.  Most modern OSes including Linux
now always do the inode,logial block index with some noop substitute
for the metadata and block device node variants of operation.

Now what you replace is a really crappy hybrid of a traditional
unix buffercache implemented ontop of the pagecache for the block
device node (for metadata) and a lot of abuse of the same data
structure as used in the buffercache for keeping metainformation
about the actual data mapping.

> Why rewrite the buffer layer?  Lots of people have had a desire to
> completely rip out the buffer layer, but we can't do that[*] because
> it does actually serve a useful purpose. Why the bad rap? Because
> the code is old and crufty, and buffer_head is an awful name. It must 
> be among the oldest code in the core fs/vm, and the main reason is
> because of the inertia of so many and such complex filesystems.

Actually most of the code is no older than 10 years.  Just compare
fs/buffer.c in 2.2 and 2.6.  buffer_head is a perfectly fine name
for one of it's uses in the traditional buffercache.

I also thing there is little to no reason to get rid of that use:
This buffercache is what most linux block-based filesystems (except
xfs and jfs most notably) are written to, and it fits them very nicely.

What I'd really like to see is to get rid of the abuse of struct buffer_head
in the data path, and the sometimes to intimate coupling of the buffer cache
with page cache internals.

> - Data / metadata separation. I have a struct fsblock and a struct
>   fsblock_meta, so we could put more stuff into the usually less used
>   fsblock_meta without bloating it up too much. After a few tricks, these
>   are no longer any different in my code, and dirty up the typing quite
>   a lot (and I'm aware it still has some warnings, thanks). So if not
>   useful this could be taken out.

That's what I mean.  And from a quick glimpse at your code they're still
far too deeply coupled in fsblock.  Really, we don't really want to share
anything between the buffer cache and data mapping operations - they are
so deeply different that this sharing is what creates the enormous complexity
we have to deal with.

> - No deadlocks (hopefully). The buffer layer is technically deadlocky by
>   design, because it can require memory allocations at page writeout-time.
>   It also has one path that cannot tolerate memory allocation failures.
>   No such problems for fsblock, which keeps fsblock metadata around for as
>   long as a page is dirty (this still has problems vs get_user_pages, but
>   that's going to require an audit of all get_user_pages sites. Phew).

The whole concept of delayed allocation requires page allocations at
writeout time, as do various network protocols or even storage drivers.

> - In line with the above item, filesystem block allocation is performed
>   before a page is dirtied. In the buffer layer, mmap writes can dirty a
>   page with no backing blocks which is a problem if the filesystem is
>   ENOSPC (patches exist for buffer.c for this).

Not really something that is the block layers fault but rather the lazyness
of the filesystem maintainers.

> - Large block support. I can mount and run an 8K block size minix3 fs on
>   my 4K page system and it didn't require anything special in the fs. We
>   can go up to about 32MB blocks now, and gigabyte+ blocks would only
>   require  one more bit in the fsblock flags. fsblock_superpage blocks
>   are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
> 
>   Core pagecache code is pretty creaky with respect to this. I think it

Re: [RFC] fsblock

2007-06-30 Thread Christoph Hellwig

On Mon, Jun 25, 2007 at 08:25:21AM -0400, Chris Mason wrote:
> > write_begin/write_end is a step in that direction (and it helps
> > OCFS and GFS quite a bit). I think there is also not much reason
> > for writepage sites to require the page to lock the page and clear
> > the dirty bit themselves (which has seems ugly to me).
> 
> If we keep the page mapping information with the page all the time (ie
> writepage doesn't have to call get_block ever), it may be possible to
> avoid sending down a locked page.  But, I don't know the delayed
> allocation internals well enough to say for sure if that is true.

The point of delayed allocations is that the mapping information doesn't
even exist until writepage for new allocations :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-30 Thread Christoph Hellwig

On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
> >- In line with the above item, filesystem block allocation is performed
> >  before a page is dirtied. In the buffer layer, mmap writes can dirty a
> >  page with no backing blocks which is a problem if the filesystem is
> >  ENOSPC (patches exist for buffer.c for this).
> 
> This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
> more an ABI behavior, so I don't see how this can be fixed with internal 
> changes, yet without changing behavior currently exported to userland 
> (and thus affecting code based on such assumptions).

Not really, the current behaviour is a bug.  And it's not actually buffer
layer specific - XFS now has a fix for that bug and it's generic enough
that everyone could use it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-30 Thread Christoph Hellwig

On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
 - In line with the above item, filesystem block allocation is performed
   before a page is dirtied. In the buffer layer, mmap writes can dirty a
   page with no backing blocks which is a problem if the filesystem is
   ENOSPC (patches exist for buffer.c for this).
 
 This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
 more an ABI behavior, so I don't see how this can be fixed with internal 
 changes, yet without changing behavior currently exported to userland 
 (and thus affecting code based on such assumptions).

Not really, the current behaviour is a bug.  And it's not actually buffer
layer specific - XFS now has a fix for that bug and it's generic enough
that everyone could use it.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-30 Thread Christoph Hellwig

On Mon, Jun 25, 2007 at 08:25:21AM -0400, Chris Mason wrote:
  write_begin/write_end is a step in that direction (and it helps
  OCFS and GFS quite a bit). I think there is also not much reason
  for writepage sites to require the page to lock the page and clear
  the dirty bit themselves (which has seems ugly to me).
 
 If we keep the page mapping information with the page all the time (ie
 writepage doesn't have to call get_block ever), it may be possible to
 avoid sending down a locked page.  But, I don't know the delayed
 allocation internals well enough to say for sure if that is true.

The point of delayed allocations is that the mapping information doesn't
even exist until writepage for new allocations :)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-30 Thread Christoph Hellwig

Warning ahead:  I've only briefly skipped over the pages so the comments
in the mail are very highlevel.

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
 fsblock is a rewrite of the buffer layer (ding dong the witch is
 dead), which I have been working on, on and off and is now at the stage
 where some of the basics are working-ish. This email is going to be
 long...
 
 Firstly, what is the buffer layer?  The buffer layer isn't really a
 buffer layer as in the buffer cache of unix: the block device cache
 is unified with the pagecache (in terms of the pagecache, a blkdev
 file is just like any other, but with a 1:1 mapping between offset
 and block).

 There are filesystem APIs to access the block device, but these go
 through the block device pagecache as well. These don't exactly
 define the buffer layer either.
 
 The buffer layer is a layer between the pagecache and the block
 device for block based filesystems. It keeps a translation between
 logical offset and physical block number, as well as meta
 information such as locks, dirtyness, and IO status of each block.
 This information is tracked via the buffer_head structure.
 

Traditional unix buffer cache is always physical block indexed and
used for all data/metadata/blockdevice node access.  There's been
a lot of variants of schemes where data or some data is in a separate
inode,logial block indexed scheme.  Most modern OSes including Linux
now always do the inode,logial block index with some noop substitute
for the metadata and block device node variants of operation.

Now what you replace is a really crappy hybrid of a traditional
unix buffercache implemented ontop of the pagecache for the block
device node (for metadata) and a lot of abuse of the same data
structure as used in the buffercache for keeping metainformation
about the actual data mapping.

 Why rewrite the buffer layer?  Lots of people have had a desire to
 completely rip out the buffer layer, but we can't do that[*] because
 it does actually serve a useful purpose. Why the bad rap? Because
 the code is old and crufty, and buffer_head is an awful name. It must 
 be among the oldest code in the core fs/vm, and the main reason is
 because of the inertia of so many and such complex filesystems.

Actually most of the code is no older than 10 years.  Just compare
fs/buffer.c in 2.2 and 2.6.  buffer_head is a perfectly fine name
for one of it's uses in the traditional buffercache.

I also thing there is little to no reason to get rid of that use:
This buffercache is what most linux block-based filesystems (except
xfs and jfs most notably) are written to, and it fits them very nicely.

What I'd really like to see is to get rid of the abuse of struct buffer_head
in the data path, and the sometimes to intimate coupling of the buffer cache
with page cache internals.

 - Data / metadata separation. I have a struct fsblock and a struct
   fsblock_meta, so we could put more stuff into the usually less used
   fsblock_meta without bloating it up too much. After a few tricks, these
   are no longer any different in my code, and dirty up the typing quite
   a lot (and I'm aware it still has some warnings, thanks). So if not
   useful this could be taken out.


That's what I mean.  And from a quick glimpse at your code they're still
far too deeply coupled in fsblock.  Really, we don't really want to share
anything between the buffer cache and data mapping operations - they are
so deeply different that this sharing is what creates the enormous complexity
we have to deal with.

 - No deadlocks (hopefully). The buffer layer is technically deadlocky by
   design, because it can require memory allocations at page writeout-time.
   It also has one path that cannot tolerate memory allocation failures.
   No such problems for fsblock, which keeps fsblock metadata around for as
   long as a page is dirty (this still has problems vs get_user_pages, but
   that's going to require an audit of all get_user_pages sites. Phew).

The whole concept of delayed allocation requires page allocations at
writeout time, as do various network protocols or even storage drivers.

 - In line with the above item, filesystem block allocation is performed
   before a page is dirtied. In the buffer layer, mmap writes can dirty a
   page with no backing blocks which is a problem if the filesystem is
   ENOSPC (patches exist for buffer.c for this).

Not really something that is the block layers fault but rather the lazyness
of the filesystem maintainers.

 - Large block support. I can mount and run an 8K block size minix3 fs on
   my 4K page system and it didn't require anything special in the fs. We
   can go up to about 32MB blocks now, and gigabyte+ blocks would only
   require  one more bit in the fsblock flags. fsblock_superpage blocks
   are  PAGE_CACHE_SIZE, midpage ==, and subpage .
 
   Core pagecache code is pretty creaky with respect to this. I think it is
   mostly race free, but it requires stupid

Re: [RFC] fsblock

2007-06-30 Thread Jeff Garzik


Christoph Hellwig wrote:

On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:

- In line with the above item, filesystem block allocation is performed
 before a page is dirtied. In the buffer layer, mmap writes can dirty a
 page with no backing blocks which is a problem if the filesystem is
 ENOSPC (patches exist for buffer.c for this).
This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
more an ABI behavior, so I don't see how this can be fixed with internal 
changes, yet without changing behavior currently exported to userland 
(and thus affecting code based on such assumptions).



Not really, the current behaviour is a bug.  And it's not actually buffer
layer specific - XFS now has a fix for that bug and it's generic enough
that everyone could use it.


I'm not sure I follow.  If you require block allocation at mmap(2) time, 
rather than when a page is actually dirtied, you are denying userspace 
the ability to do sparse files with mmap.


A quick Google readily turns up people who have built upon the 
mmap-sparse-file assumption, and I don't think we want to break those 
assumptions as a bug fix.


Where is the bug?

Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-30 Thread Christoph Hellwig

On Sat, Jun 30, 2007 at 07:10:27AM -0400, Jeff Garzik wrote:
 Not really, the current behaviour is a bug.  And it's not actually buffer
 layer specific - XFS now has a fix for that bug and it's generic enough
 that everyone could use it.
 
 I'm not sure I follow.  If you require block allocation at mmap(2) time, 
 rather than when a page is actually dirtied, you are denying userspace 
 the ability to do sparse files with mmap.
 
 A quick Google readily turns up people who have built upon the 
 mmap-sparse-file assumption, and I don't think we want to break those 
 assumptions as a bug fix.
 
 Where is the bug?

It's not mmap time but page dirtying time.  Currently the default behaviour
is not to allocate at page dirtying time but rather at writeout time in
some scenarious.

(and s/allocation/reservation/ applies for delalloc of course)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-28 Thread Nick Piggin

On Thu, Jun 28, 2007 at 08:20:31AM -0400, Chris Mason wrote:
> On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
> > 
> > That's true but I don't think an extent data structure means we can
> > become too far divorced from the pagecache or the native block size
> > -- what will end up happening is that often we'll need "stuff" to map
> > between all those as well, even if it is only at IO-time.
> 
> I think the fundamental difference is that fsblock still does:
> mapping_info = page->something, where something is attached on a per
> page basis.  What we really want is mapping_info = lookup_mapping(page),
> where that function goes and finds something stored on a per extent
> basis, with extra bits for tracking dirty and locked state.
> 
> Ideally, in at least some of the cases the dirty and locked state could
> be at an extent granularity (streaming IO) instead of the block
> granularity (random IO).
> 
> In my little brain, even block based filesystems should be able to take
> advantage of this...but such things are always easier to believe in
> before the coding starts.

Now I wouldn't for a minute deny that at least some of the block
information would be well to store in extent/tree format (if XFS 
does it it must be good!).

And yes, I'm sure filesystems with even basic block based allocation
could get a reasonable ratio of blocks to extents.

However I think it is fundamentally another layer or at least
more complexity... fsblocks uses the existing pagecache mapping as
(much of) the data structure and uses the existing pagecache locking
for the locking. And it fundamentally just provides a block access
and IO layer into the pagecache for the filesystem, which I think will
often be needed anyway.

But that said, I would like to see a generic extent mapping layer
sitting between fsblock and the filesystem (I might even have a crack
at it myself)... and I could be proven completely wrong and it may be
that fsblock isn't required at all after such a layer goes in. So I
will try to keep all the APIs extent based.

The first thing I actually looked at for "get_blocks" was for the
filesystem to build up a tree of mappings itself, completely unconnected
from the pagecache. It just ended up being a little more work and
locking but the idea isn't insane :)

> > One issue I have with the current nobh and mpage stuff is that it
> > requires multiple calls into get_block (first to prepare write, then
> > to writepage), it doesn't allow filesystems to attach resources
> > required for writeout at prepare_write time, and it doesn't play nicely
> > with buffers in general. (not to mention that nobh error handling is
> > buggy).
> > 
> > I haven't done any mpage-like code for fsblocks yet, but I think they
> > wouldn't be too much trouble, and wouldn't have any of the above
> > problems...
> 
> Could be, but the fundamental issue of sometimes pages have mappings
> attached and sometimes they don't is still there.  The window is
> smaller, but non-zero.

The aim for fsblocks is that any page under IO will always have fsblocks,
which I hope is going to make this easy. In the fsblocks patch I sent out
there is a window (with mmapped pages), however that's a bug wich can be
fixed rather than a fundamental problem. So writepages will be less problem.

Readpages may indeed be more efficient and block mapping with extents than
individual fsblocks (or it could be, if it were an extent based API itself).

Well I don't know. Extents are always going to have benefits, but I don't
know if it means the fsblock part could go away completely. I'll keep
it in mind though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-28 Thread David Chinner

On Thu, Jun 28, 2007 at 08:20:31AM -0400, Chris Mason wrote:
> On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
> > That's true but I don't think an extent data structure means we can
> > become too far divorced from the pagecache or the native block size
> > -- what will end up happening is that often we'll need "stuff" to map
> > between all those as well, even if it is only at IO-time.
> 
> I think the fundamental difference is that fsblock still does:
> mapping_info = page->something, where something is attached on a per
> page basis.  What we really want is mapping_info = lookup_mapping(page),

lookup_block_mapping(page) ;)

But yes, that is the essence of what I was saying. Thanks for
describing it so concisely, Chris.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-28 Thread Chris Mason

On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
> On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote:
> > On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
> > > Lets look at a typical example of how IO actually gets done today,
> > > starting with sys_write():
> > > 
> > > sys_write(file, buffer, 1MB)
> > > for each page:
> > > prepare_write()
> > >   allocate contiguous chunks of disk
> > > attach buffers
> > > copy_from_user()
> > > commit_write()
> > > dirty buffers
> > > 
> > > pdflush:
> > > writepages()
> > > find pages with contiguous chunks of disk
> > >   build and submit large bios
> > > 
> > > So, we replace prepare_write and commit_write with an extent based api,
> > > but we keep the dirty each buffer part.  writepages has to turn that
> > > back into extents (bio sized), and the result is completely full of dark
> > > dark corner cases.
> 
> That's true but I don't think an extent data structure means we can
> become too far divorced from the pagecache or the native block size
> -- what will end up happening is that often we'll need "stuff" to map
> between all those as well, even if it is only at IO-time.

I think the fundamental difference is that fsblock still does:
mapping_info = page->something, where something is attached on a per
page basis.  What we really want is mapping_info = lookup_mapping(page),
where that function goes and finds something stored on a per extent
basis, with extra bits for tracking dirty and locked state.

Ideally, in at least some of the cases the dirty and locked state could
be at an extent granularity (streaming IO) instead of the block
granularity (random IO).

In my little brain, even block based filesystems should be able to take
advantage of this...but such things are always easier to believe in
before the coding starts.

> 
> But the point is taken, and I do believe that at least for APIs, extent
> based seems like the best way to go. And that should allow fsblock to
> be replaced or augmented in future without _too_ much pain.
> 
>  
> > Yup - I've been on the painful end of those dark corner cases several
> > times in the last few months.
> > 
> > It's also worth pointing out that mpage_readpages() already works on
> > an extent basis - it overloads bufferheads to provide a "map_bh" that
> > can point to a range of blocks in the same state. The code then iterates
> > the map_bh range a page at a time building bios (i.e. not even using
> > buffer heads) from that map..
> 
> One issue I have with the current nobh and mpage stuff is that it
> requires multiple calls into get_block (first to prepare write, then
> to writepage), it doesn't allow filesystems to attach resources
> required for writeout at prepare_write time, and it doesn't play nicely
> with buffers in general. (not to mention that nobh error handling is
> buggy).
> 
> I haven't done any mpage-like code for fsblocks yet, but I think they
> wouldn't be too much trouble, and wouldn't have any of the above
> problems...

Could be, but the fundamental issue of sometimes pages have mappings
attached and sometimes they don't is still there.  The window is
smaller, but non-zero.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-28 Thread Chris Mason

On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
 On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote:
  On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
   Lets look at a typical example of how IO actually gets done today,
   starting with sys_write():
   
   sys_write(file, buffer, 1MB)
   for each page:
   prepare_write()
 allocate contiguous chunks of disk
   attach buffers
   copy_from_user()
   commit_write()
   dirty buffers
   
   pdflush:
   writepages()
   find pages with contiguous chunks of disk
 build and submit large bios
   
   So, we replace prepare_write and commit_write with an extent based api,
   but we keep the dirty each buffer part.  writepages has to turn that
   back into extents (bio sized), and the result is completely full of dark
   dark corner cases.
 
 That's true but I don't think an extent data structure means we can
 become too far divorced from the pagecache or the native block size
 -- what will end up happening is that often we'll need stuff to map
 between all those as well, even if it is only at IO-time.

I think the fundamental difference is that fsblock still does:
mapping_info = page-something, where something is attached on a per
page basis.  What we really want is mapping_info = lookup_mapping(page),
where that function goes and finds something stored on a per extent
basis, with extra bits for tracking dirty and locked state.

Ideally, in at least some of the cases the dirty and locked state could
be at an extent granularity (streaming IO) instead of the block
granularity (random IO).

In my little brain, even block based filesystems should be able to take
advantage of this...but such things are always easier to believe in
before the coding starts.

 
 But the point is taken, and I do believe that at least for APIs, extent
 based seems like the best way to go. And that should allow fsblock to
 be replaced or augmented in future without _too_ much pain.
 
  
  Yup - I've been on the painful end of those dark corner cases several
  times in the last few months.
  
  It's also worth pointing out that mpage_readpages() already works on
  an extent basis - it overloads bufferheads to provide a map_bh that
  can point to a range of blocks in the same state. The code then iterates
  the map_bh range a page at a time building bios (i.e. not even using
  buffer heads) from that map..
 
 One issue I have with the current nobh and mpage stuff is that it
 requires multiple calls into get_block (first to prepare write, then
 to writepage), it doesn't allow filesystems to attach resources
 required for writeout at prepare_write time, and it doesn't play nicely
 with buffers in general. (not to mention that nobh error handling is
 buggy).
 
 I haven't done any mpage-like code for fsblocks yet, but I think they
 wouldn't be too much trouble, and wouldn't have any of the above
 problems...

Could be, but the fundamental issue of sometimes pages have mappings
attached and sometimes they don't is still there.  The window is
smaller, but non-zero.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-28 Thread David Chinner

On Thu, Jun 28, 2007 at 08:20:31AM -0400, Chris Mason wrote:
 On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
  That's true but I don't think an extent data structure means we can
  become too far divorced from the pagecache or the native block size
  -- what will end up happening is that often we'll need stuff to map
  between all those as well, even if it is only at IO-time.
 
 I think the fundamental difference is that fsblock still does:
 mapping_info = page-something, where something is attached on a per
 page basis.  What we really want is mapping_info = lookup_mapping(page),

lookup_block_mapping(page) ;)

But yes, that is the essence of what I was saying. Thanks for
describing it so concisely, Chris.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-28 Thread Nick Piggin

On Thu, Jun 28, 2007 at 08:20:31AM -0400, Chris Mason wrote:
 On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
  
  That's true but I don't think an extent data structure means we can
  become too far divorced from the pagecache or the native block size
  -- what will end up happening is that often we'll need stuff to map
  between all those as well, even if it is only at IO-time.
 
 I think the fundamental difference is that fsblock still does:
 mapping_info = page-something, where something is attached on a per
 page basis.  What we really want is mapping_info = lookup_mapping(page),
 where that function goes and finds something stored on a per extent
 basis, with extra bits for tracking dirty and locked state.
 
 Ideally, in at least some of the cases the dirty and locked state could
 be at an extent granularity (streaming IO) instead of the block
 granularity (random IO).
 
 In my little brain, even block based filesystems should be able to take
 advantage of this...but such things are always easier to believe in
 before the coding starts.

Now I wouldn't for a minute deny that at least some of the block
information would be well to store in extent/tree format (if XFS 
does it it must be good!).

And yes, I'm sure filesystems with even basic block based allocation
could get a reasonable ratio of blocks to extents.

However I think it is fundamentally another layer or at least
more complexity... fsblocks uses the existing pagecache mapping as
(much of) the data structure and uses the existing pagecache locking
for the locking. And it fundamentally just provides a block access
and IO layer into the pagecache for the filesystem, which I think will
often be needed anyway.

But that said, I would like to see a generic extent mapping layer
sitting between fsblock and the filesystem (I might even have a crack
at it myself)... and I could be proven completely wrong and it may be
that fsblock isn't required at all after such a layer goes in. So I
will try to keep all the APIs extent based.

The first thing I actually looked at for get_blocks was for the
filesystem to build up a tree of mappings itself, completely unconnected
from the pagecache. It just ended up being a little more work and
locking but the idea isn't insane :)


  One issue I have with the current nobh and mpage stuff is that it
  requires multiple calls into get_block (first to prepare write, then
  to writepage), it doesn't allow filesystems to attach resources
  required for writeout at prepare_write time, and it doesn't play nicely
  with buffers in general. (not to mention that nobh error handling is
  buggy).
  
  I haven't done any mpage-like code for fsblocks yet, but I think they
  wouldn't be too much trouble, and wouldn't have any of the above
  problems...
 
 Could be, but the fundamental issue of sometimes pages have mappings
 attached and sometimes they don't is still there.  The window is
 smaller, but non-zero.

The aim for fsblocks is that any page under IO will always have fsblocks,
which I hope is going to make this easy. In the fsblocks patch I sent out
there is a window (with mmapped pages), however that's a bug wich can be
fixed rather than a fundamental problem. So writepages will be less problem.

Readpages may indeed be more efficient and block mapping with extents than
individual fsblocks (or it could be, if it were an extent based API itself).

Well I don't know. Extents are always going to have benefits, but I don't
know if it means the fsblock part could go away completely. I'll keep
it in mind though.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread Nick Piggin

On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote:
> On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
> > Lets look at a typical example of how IO actually gets done today,
> > starting with sys_write():
> > 
> > sys_write(file, buffer, 1MB)
> > for each page:
> > prepare_write()
> > allocate contiguous chunks of disk
> > attach buffers
> > copy_from_user()
> > commit_write()
> > dirty buffers
> > 
> > pdflush:
> > writepages()
> > find pages with contiguous chunks of disk
> > build and submit large bios
> > 
> > So, we replace prepare_write and commit_write with an extent based api,
> > but we keep the dirty each buffer part.  writepages has to turn that
> > back into extents (bio sized), and the result is completely full of dark
> > dark corner cases.

That's true but I don't think an extent data structure means we can
become too far divorced from the pagecache or the native block size
-- what will end up happening is that often we'll need "stuff" to map
between all those as well, even if it is only at IO-time.

But the point is taken, and I do believe that at least for APIs, extent
based seems like the best way to go. And that should allow fsblock to
be replaced or augmented in future without _too_ much pain.

> Yup - I've been on the painful end of those dark corner cases several
> times in the last few months.
> 
> It's also worth pointing out that mpage_readpages() already works on
> an extent basis - it overloads bufferheads to provide a "map_bh" that
> can point to a range of blocks in the same state. The code then iterates
> the map_bh range a page at a time building bios (i.e. not even using
> buffer heads) from that map..

One issue I have with the current nobh and mpage stuff is that it
requires multiple calls into get_block (first to prepare write, then
to writepage), it doesn't allow filesystems to attach resources
required for writeout at prepare_write time, and it doesn't play nicely
with buffers in general. (not to mention that nobh error handling is
buggy).

I haven't done any mpage-like code for fsblocks yet, but I think they
wouldn't be too much trouble, and wouldn't have any of the above
problems...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread David Chinner

On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
> On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
> > On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
> > > On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> > > > On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> > > 
> > > [ ... fsblocks vs extent range mapping ]
> > > 
> > > > iomaps can double as range locks simply because iomaps are
> > > > expressions of ranges within the file.  Seeing as you can only
> > > > access a given range exclusively to modify it, inserting an empty
> > > > mapping into the tree as a range lock gives an effective method of
> > > > allowing safe parallel reads, writes and allocation into the file.
> > > > 
> > > > The fsblocks and the vm page cache interface cannot be used to
> > > > facilitate this because a radix tree is the wrong type of tree to
> > > > store this information in. A sparse, range based tree (e.g. btree)
> > > > is the right way to do this and it matches very well with
> > > > a range based API.
> > > 
> > > I'm really not against the extent based page cache idea, but I kind of
> > > assumed it would be too big a change for this kind of generic setup.  At
> > > any rate, if we'd like to do it, it may be best to ditch the idea of
> > > "attach mapping information to a page", and switch to "lookup mapping
> > > information and range locking for a page".
> > 
> > Well the get_block equivalent API is extent based one now, and I'll
> > look at what is required in making map_fsblock a more generic call
> > that could be used for an extent-based scheme.
> > 
> > An extent based thing IMO really isn't appropriate as the main generic
> > layer here though. If it is really useful and popular, then it could
> > be turned into generic code and sit along side fsblock or underneath
> > fsblock...
> 
> Lets look at a typical example of how IO actually gets done today,
> starting with sys_write():
> 
> sys_write(file, buffer, 1MB)
> for each page:
> prepare_write()
>   allocate contiguous chunks of disk
> attach buffers
> copy_from_user()
> commit_write()
> dirty buffers
> 
> pdflush:
> writepages()
> find pages with contiguous chunks of disk
>   build and submit large bios
> 
> So, we replace prepare_write and commit_write with an extent based api,
> but we keep the dirty each buffer part.  writepages has to turn that
> back into extents (bio sized), and the result is completely full of dark
> dark corner cases.

Yup - I've been on the painful end of those dark corner cases several
times in the last few months.

It's also worth pointing out that mpage_readpages() already works on
an extent basis - it overloads bufferheads to provide a "map_bh" that
can point to a range of blocks in the same state. The code then iterates
the map_bh range a page at a time building bios (i.e. not even using
buffer heads) from that map..

> I do think fsblocks is a nice cleanup on its own, but Dave has a good
> point that it makes sense to look for ways generalize things even more.

*nod*

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread Anton Altaparmakov


On 27 Jun 2007, at 12:50, Chris Mason wrote:

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:

On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:

On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:


[ ... fsblocks vs extent range mapping ]


iomaps can double as range locks simply because iomaps are
expressions of ranges within the file.  Seeing as you can only
access a given range exclusively to modify it, inserting an empty
mapping into the tree as a range lock gives an effective method of
allowing safe parallel reads, writes and allocation into the file.

The fsblocks and the vm page cache interface cannot be used to
facilitate this because a radix tree is the wrong type of tree to
store this information in. A sparse, range based tree (e.g. btree)
is the right way to do this and it matches very well with
a range based API.


I'm really not against the extent based page cache idea, but I  
kind of
assumed it would be too big a change for this kind of generic  
setup.  At

any rate, if we'd like to do it, it may be best to ditch the idea of
"attach mapping information to a page", and switch to "lookup  
mapping

information and range locking for a page".


Well the get_block equivalent API is extent based one now, and I'll
look at what is required in making map_fsblock a more generic call
that could be used for an extent-based scheme.

An extent based thing IMO really isn't appropriate as the main  
generic

layer here though. If it is really useful and popular, then it could
be turned into generic code and sit along side fsblock or underneath
fsblock...


Lets look at a typical example of how IO actually gets done today,
starting with sys_write():


Yes, this is very inefficient which is one of the reasons I don't use  
the generic file write helpers in NTFS.  The other reasons are that  
supporting larger logical block sizes than PAGE_CACHE_SIZE becomes a  
pain if it is not done this way when the write targets a hole as that  
requires all pages in the hole to be locked simultaneously which  
would mean dropping the page lock to acquire the others that are of  
lower page index and to then re-take the page lock which is horrible  
- much better to lock all at once from the outset and the other  
reason is that in NTFS there is such a thing as the initialized size  
of an attribute which basically states "anything past this byte  
offset must be returned as 0 on read, i.e. it does not have to be  
read from disk at all, and on write beyond the initialized_size you  
have to zero on disk everything between the old initialized size and  
the start of the write before you begin writing and certainly before  
you update the initalized_size otherwise a concurrent read would see  
random old data from the disk.


For NTFS this effectively becomes:


sys_write(file, buffer, 1MB)


allocate space for the entire 1MB write

if write offset past the initialized_size zero out on disk starting  
at initialized_size up to the start offset for the write and update  
the initialized size to be equal to the start offset of the write


do {
	if (current position is in a hole and the NTFS logical block size is  
> PAGE_CACHE_SIZE) {

work on (NTFS logical block size / PAGE_CACHE_SIZE) pages in 
one go;
do_pages = vol->cluster_size / PAGE_CACHE_SIZE;
} else {
work on only one page;
do_pages = 1;
}
	fault in for read (do_pages*PAGE_CACHE_SIZE) bytes worth of source  
pages

grab do_pages worth of pages
prepare_write - attach buffers to grabbed pages
copy data from source to grabbed pages
commit_write the copied pages by dirtying their buffers
} while (data left to write);

The allocation in advance is a huge win both in terms of avoiding  
fragmentation (NTFS still uses a very simple/stupid allocator so you  
get a lot of fragmentation if two processes write to different files  
simultaneously and do so in small chunks) and in terms of performance.


I have wondered whether I should perhaps turn on the "multi page"  
stuff on for all writes rather than just for ones that go into a hole  
and the logical size is greater than the PAGE_CACHE_SIZE as that  
might improve performance even further but I haven't had the time/ 
inclination to experiment...


And I have also wondered whether to go direct to bio/wholes pages at  
once instead of bothering with dirtying each buffer but the buffers  
(which are always 512 bytes on NTFS) allow me to easily support  
dirtying smaller parts of the page which is desired at least on  
volumes with a logical block size < PAGE_CACHE_SIZE as different bits  
of the page could then reside on completely different locations on  
disk so writing out unneeded bits of the page could result in a lot  
of wasted disk head seek times.


Best regards,

Anton


for each page:

Re: [RFC] fsblock

2007-06-27 Thread Kyle Moffett


On Jun 26, 2007, at 07:14:14, Nick Piggin wrote:

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
Can we call it a block mapping layer or something like that? e.g.  
struct blkmap?


I'm not fixed on fsblock, but blkmap doesn't grab me either. It is  
a map from the pagecache to the block layer, but blkmap sounds like  
it is a map from the block to somewhere.


fsblkmap ;)


vmblock? pgblock?

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread Chris Mason

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
> On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
> > On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> > > On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> > 
> > [ ... fsblocks vs extent range mapping ]
> > 
> > > iomaps can double as range locks simply because iomaps are
> > > expressions of ranges within the file.  Seeing as you can only
> > > access a given range exclusively to modify it, inserting an empty
> > > mapping into the tree as a range lock gives an effective method of
> > > allowing safe parallel reads, writes and allocation into the file.
> > > 
> > > The fsblocks and the vm page cache interface cannot be used to
> > > facilitate this because a radix tree is the wrong type of tree to
> > > store this information in. A sparse, range based tree (e.g. btree)
> > > is the right way to do this and it matches very well with
> > > a range based API.
> > 
> > I'm really not against the extent based page cache idea, but I kind of
> > assumed it would be too big a change for this kind of generic setup.  At
> > any rate, if we'd like to do it, it may be best to ditch the idea of
> > "attach mapping information to a page", and switch to "lookup mapping
> > information and range locking for a page".
> 
> Well the get_block equivalent API is extent based one now, and I'll
> look at what is required in making map_fsblock a more generic call
> that could be used for an extent-based scheme.
> 
> An extent based thing IMO really isn't appropriate as the main generic
> layer here though. If it is really useful and popular, then it could
> be turned into generic code and sit along side fsblock or underneath
> fsblock...

Lets look at a typical example of how IO actually gets done today,
starting with sys_write():

sys_write(file, buffer, 1MB)
for each page:
prepare_write()
allocate contiguous chunks of disk
attach buffers
copy_from_user()
commit_write()
dirty buffers

pdflush:
writepages()
find pages with contiguous chunks of disk
build and submit large bios

So, we replace prepare_write and commit_write with an extent based api,
but we keep the dirty each buffer part.  writepages has to turn that
back into extents (bio sized), and the result is completely full of dark
dark corner cases.

I do think fsblocks is a nice cleanup on its own, but Dave has a good
point that it makes sense to look for ways generalize things even more.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread David Chinner

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
> I think using fsblock to drive the IO and keep the pagecache flags
> uptodate and using a btree in the filesystem to manage extents of block
> allocations wouldn't be a bad idea though. Do any filesystems actually
> do this?

Yes. XFS. But we still need to hold state in buffer heads (BH_delay,
BH_unwritten) that is needed to determine what type of
allocation/extent conversion is necessary during writeback. i.e.
what we originally mapped the page as during the ->prepare_write
call.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread David Chinner

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
 I think using fsblock to drive the IO and keep the pagecache flags
 uptodate and using a btree in the filesystem to manage extents of block
 allocations wouldn't be a bad idea though. Do any filesystems actually
 do this?

Yes. XFS. But we still need to hold state in buffer heads (BH_delay,
BH_unwritten) that is needed to determine what type of
allocation/extent conversion is necessary during writeback. i.e.
what we originally mapped the page as during the -prepare_write
call.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread Chris Mason

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
 On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
  On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
   On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
  
  [ ... fsblocks vs extent range mapping ]
  
   iomaps can double as range locks simply because iomaps are
   expressions of ranges within the file.  Seeing as you can only
   access a given range exclusively to modify it, inserting an empty
   mapping into the tree as a range lock gives an effective method of
   allowing safe parallel reads, writes and allocation into the file.
   
   The fsblocks and the vm page cache interface cannot be used to
   facilitate this because a radix tree is the wrong type of tree to
   store this information in. A sparse, range based tree (e.g. btree)
   is the right way to do this and it matches very well with
   a range based API.
  
  I'm really not against the extent based page cache idea, but I kind of
  assumed it would be too big a change for this kind of generic setup.  At
  any rate, if we'd like to do it, it may be best to ditch the idea of
  attach mapping information to a page, and switch to lookup mapping
  information and range locking for a page.
 
 Well the get_block equivalent API is extent based one now, and I'll
 look at what is required in making map_fsblock a more generic call
 that could be used for an extent-based scheme.
 
 An extent based thing IMO really isn't appropriate as the main generic
 layer here though. If it is really useful and popular, then it could
 be turned into generic code and sit along side fsblock or underneath
 fsblock...

Lets look at a typical example of how IO actually gets done today,
starting with sys_write():

sys_write(file, buffer, 1MB)
for each page:
prepare_write()
allocate contiguous chunks of disk
attach buffers
copy_from_user()
commit_write()
dirty buffers

pdflush:
writepages()
find pages with contiguous chunks of disk
build and submit large bios

So, we replace prepare_write and commit_write with an extent based api,
but we keep the dirty each buffer part.  writepages has to turn that
back into extents (bio sized), and the result is completely full of dark
dark corner cases.

I do think fsblocks is a nice cleanup on its own, but Dave has a good
point that it makes sense to look for ways generalize things even more.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread Kyle Moffett


On Jun 26, 2007, at 07:14:14, Nick Piggin wrote:

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
Can we call it a block mapping layer or something like that? e.g.  
struct blkmap?


I'm not fixed on fsblock, but blkmap doesn't grab me either. It is  
a map from the pagecache to the block layer, but blkmap sounds like  
it is a map from the block to somewhere.


fsblkmap ;)


vmblock? pgblock?

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread Anton Altaparmakov


On 27 Jun 2007, at 12:50, Chris Mason wrote:

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:

On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:

On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:


[ ... fsblocks vs extent range mapping ]


iomaps can double as range locks simply because iomaps are
expressions of ranges within the file.  Seeing as you can only
access a given range exclusively to modify it, inserting an empty
mapping into the tree as a range lock gives an effective method of
allowing safe parallel reads, writes and allocation into the file.

The fsblocks and the vm page cache interface cannot be used to
facilitate this because a radix tree is the wrong type of tree to
store this information in. A sparse, range based tree (e.g. btree)
is the right way to do this and it matches very well with
a range based API.


I'm really not against the extent based page cache idea, but I  
kind of
assumed it would be too big a change for this kind of generic  
setup.  At

any rate, if we'd like to do it, it may be best to ditch the idea of
attach mapping information to a page, and switch to lookup  
mapping

information and range locking for a page.


Well the get_block equivalent API is extent based one now, and I'll
look at what is required in making map_fsblock a more generic call
that could be used for an extent-based scheme.

An extent based thing IMO really isn't appropriate as the main  
generic

layer here though. If it is really useful and popular, then it could
be turned into generic code and sit along side fsblock or underneath
fsblock...


Lets look at a typical example of how IO actually gets done today,
starting with sys_write():


Yes, this is very inefficient which is one of the reasons I don't use  
the generic file write helpers in NTFS.  The other reasons are that  
supporting larger logical block sizes than PAGE_CACHE_SIZE becomes a  
pain if it is not done this way when the write targets a hole as that  
requires all pages in the hole to be locked simultaneously which  
would mean dropping the page lock to acquire the others that are of  
lower page index and to then re-take the page lock which is horrible  
- much better to lock all at once from the outset and the other  
reason is that in NTFS there is such a thing as the initialized size  
of an attribute which basically states anything past this byte  
offset must be returned as 0 on read, i.e. it does not have to be  
read from disk at all, and on write beyond the initialized_size you  
have to zero on disk everything between the old initialized size and  
the start of the write before you begin writing and certainly before  
you update the initalized_size otherwise a concurrent read would see  
random old data from the disk.


For NTFS this effectively becomes:


sys_write(file, buffer, 1MB)


allocate space for the entire 1MB write

if write offset past the initialized_size zero out on disk starting  
at initialized_size up to the start offset for the write and update  
the initialized size to be equal to the start offset of the write


do {
	if (current position is in a hole and the NTFS logical block size is  
 PAGE_CACHE_SIZE) {

work on (NTFS logical block size / PAGE_CACHE_SIZE) pages in 
one go;
do_pages = vol-cluster_size / PAGE_CACHE_SIZE;
} else {
work on only one page;
do_pages = 1;
}
	fault in for read (do_pages*PAGE_CACHE_SIZE) bytes worth of source  
pages

grab do_pages worth of pages
prepare_write - attach buffers to grabbed pages
copy data from source to grabbedprepared pages
commit_write the copied pages by dirtying their buffers
} while (data left to write);

The allocation in advance is a huge win both in terms of avoiding  
fragmentation (NTFS still uses a very simple/stupid allocator so you  
get a lot of fragmentation if two processes write to different files  
simultaneously and do so in small chunks) and in terms of performance.


I have wondered whether I should perhaps turn on the multi page  
stuff on for all writes rather than just for ones that go into a hole  
and the logical size is greater than the PAGE_CACHE_SIZE as that  
might improve performance even further but I haven't had the time/ 
inclination to experiment...


And I have also wondered whether to go direct to bio/wholes pages at  
once instead of bothering with dirtying each buffer but the buffers  
(which are always 512 bytes on NTFS) allow me to easily support  
dirtying smaller parts of the page which is desired at least on  
volumes with a logical block size  PAGE_CACHE_SIZE as different bits  
of the page could then reside on completely different locations on  
disk so writing out unneeded bits of the page could result in a lot  
of wasted disk head seek times.


Best regards,

Anton


for each page:

Re: [RFC] fsblock

2007-06-27 Thread David Chinner

On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
 On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
  On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
   On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
   
   [ ... fsblocks vs extent range mapping ]
   
iomaps can double as range locks simply because iomaps are
expressions of ranges within the file.  Seeing as you can only
access a given range exclusively to modify it, inserting an empty
mapping into the tree as a range lock gives an effective method of
allowing safe parallel reads, writes and allocation into the file.

The fsblocks and the vm page cache interface cannot be used to
facilitate this because a radix tree is the wrong type of tree to
store this information in. A sparse, range based tree (e.g. btree)
is the right way to do this and it matches very well with
a range based API.
   
   I'm really not against the extent based page cache idea, but I kind of
   assumed it would be too big a change for this kind of generic setup.  At
   any rate, if we'd like to do it, it may be best to ditch the idea of
   attach mapping information to a page, and switch to lookup mapping
   information and range locking for a page.
  
  Well the get_block equivalent API is extent based one now, and I'll
  look at what is required in making map_fsblock a more generic call
  that could be used for an extent-based scheme.
  
  An extent based thing IMO really isn't appropriate as the main generic
  layer here though. If it is really useful and popular, then it could
  be turned into generic code and sit along side fsblock or underneath
  fsblock...
 
 Lets look at a typical example of how IO actually gets done today,
 starting with sys_write():
 
 sys_write(file, buffer, 1MB)
 for each page:
 prepare_write()
   allocate contiguous chunks of disk
 attach buffers
 copy_from_user()
 commit_write()
 dirty buffers
 
 pdflush:
 writepages()
 find pages with contiguous chunks of disk
   build and submit large bios
 
 So, we replace prepare_write and commit_write with an extent based api,
 but we keep the dirty each buffer part.  writepages has to turn that
 back into extents (bio sized), and the result is completely full of dark
 dark corner cases.

Yup - I've been on the painful end of those dark corner cases several
times in the last few months.

It's also worth pointing out that mpage_readpages() already works on
an extent basis - it overloads bufferheads to provide a map_bh that
can point to a range of blocks in the same state. The code then iterates
the map_bh range a page at a time building bios (i.e. not even using
buffer heads) from that map..

 I do think fsblocks is a nice cleanup on its own, but Dave has a good
 point that it makes sense to look for ways generalize things even more.

*nod*

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-27 Thread Nick Piggin

On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote:
 On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
  Lets look at a typical example of how IO actually gets done today,
  starting with sys_write():
  
  sys_write(file, buffer, 1MB)
  for each page:
  prepare_write()
  allocate contiguous chunks of disk
  attach buffers
  copy_from_user()
  commit_write()
  dirty buffers
  
  pdflush:
  writepages()
  find pages with contiguous chunks of disk
  build and submit large bios
  
  So, we replace prepare_write and commit_write with an extent based api,
  but we keep the dirty each buffer part.  writepages has to turn that
  back into extents (bio sized), and the result is completely full of dark
  dark corner cases.

That's true but I don't think an extent data structure means we can
become too far divorced from the pagecache or the native block size
-- what will end up happening is that often we'll need stuff to map
between all those as well, even if it is only at IO-time.

But the point is taken, and I do believe that at least for APIs, extent
based seems like the best way to go. And that should allow fsblock to
be replaced or augmented in future without _too_ much pain.

 
 Yup - I've been on the painful end of those dark corner cases several
 times in the last few months.
 
 It's also worth pointing out that mpage_readpages() already works on
 an extent basis - it overloads bufferheads to provide a map_bh that
 can point to a range of blocks in the same state. The code then iterates
 the map_bh range a page at a time building bios (i.e. not even using
 buffer heads) from that map..

One issue I have with the current nobh and mpage stuff is that it
requires multiple calls into get_block (first to prepare write, then
to writepage), it doesn't allow filesystems to attach resources
required for writeout at prepare_write time, and it doesn't play nicely
with buffers in general. (not to mention that nobh error handling is
buggy).

I haven't done any mpage-like code for fsblocks yet, but I think they
wouldn't be too much trouble, and wouldn't have any of the above
problems...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-26 Thread Nick Piggin

On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
> On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> > On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> 
> [ ... fsblocks vs extent range mapping ]
> 
> > iomaps can double as range locks simply because iomaps are
> > expressions of ranges within the file.  Seeing as you can only
> > access a given range exclusively to modify it, inserting an empty
> > mapping into the tree as a range lock gives an effective method of
> > allowing safe parallel reads, writes and allocation into the file.
> > 
> > The fsblocks and the vm page cache interface cannot be used to
> > facilitate this because a radix tree is the wrong type of tree to
> > store this information in. A sparse, range based tree (e.g. btree)
> > is the right way to do this and it matches very well with
> > a range based API.
> 
> I'm really not against the extent based page cache idea, but I kind of
> assumed it would be too big a change for this kind of generic setup.  At
> any rate, if we'd like to do it, it may be best to ditch the idea of
> "attach mapping information to a page", and switch to "lookup mapping
> information and range locking for a page".

Well the get_block equivalent API is extent based one now, and I'll
look at what is required in making map_fsblock a more generic call
that could be used for an extent-based scheme.

An extent based thing IMO really isn't appropriate as the main generic
layer here though. If it is really useful and popular, then it could
be turned into generic code and sit along side fsblock or underneath
fsblock...

It definitely isn't trivial to drive the IO directly from something
like that which doesn't correspond to filesystem block size. Splitting
parts of your extent tree when things go dirty or uptodate or partially
under IO, etc.. joining things back up again when they are mergable.
Not that it would be impossible, but it would be a lot more heavyweight
than fsblock.

I think using fsblock to drive the IO and keep the pagecache flags
uptodate and using a btree in the filesystem to manage extents of block
allocations wouldn't be a bad idea though. Do any filesystems actually
do this?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-26 Thread Chris Mason

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:

[ ... fsblocks vs extent range mapping ]

> iomaps can double as range locks simply because iomaps are
> expressions of ranges within the file.  Seeing as you can only
> access a given range exclusively to modify it, inserting an empty
> mapping into the tree as a range lock gives an effective method of
> allowing safe parallel reads, writes and allocation into the file.
> 
> The fsblocks and the vm page cache interface cannot be used to
> facilitate this because a radix tree is the wrong type of tree to
> store this information in. A sparse, range based tree (e.g. btree)
> is the right way to do this and it matches very well with
> a range based API.

I'm really not against the extent based page cache idea, but I kind of
assumed it would be too big a change for this kind of generic setup.  At
any rate, if we'd like to do it, it may be best to ditch the idea of
"attach mapping information to a page", and switch to "lookup mapping
information and range locking for a page".

A btree could be used to hold the range mapping and locking, but it
could just as easily be a radix tree where you do a gang lookup for the
end of the range (the same way my placeholder patch did).  It'll still
find intersecting range locks but is much faster for random
insertion/deletion than the btrees.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-26 Thread Nick Piggin

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> > >
> > >Realistically, this is not about "filesystem blocks", this is
> > >about file offset to disk blocks. i.e. it's a mapping.
> > 
> > Yeah, fsblock ~= the layer between the fs and the block layers.
> 
> Sure, but it's not a "filesystem block" which is what you are
> calling it. IMO, it's overloading a well known term with something
> different, and that's just confusing.

Well it is the metadata used to manage the filesystem block for the
given bit of pagecache (even if the block is not actually allocated
or even a hole, it is deemed to be so by the filesystem).

> Can we call it a block mapping layer or something like that?
> e.g. struct blkmap?

I'm not fixed on fsblock, but blkmap doesn't grab me either. It
is a map from the pagecache to the block layer, but blkmap sounds
like it is a map from the block to somewhere.

fsblkmap ;)

 
> > >> Probably better would be to
> > >> move towards offset,length rather than page based fs APIs where 
> > >> everything
> > >> can be batched up nicely and this sort of non-trivial locking can be more
> > >> optimal.
> > >
> > >If we are going to turn over the API completely like this, can
> > >we seriously look at moving to this sort of interface at the same
> > >time?
> > 
> > Yeah we can move to anything. But note that fsblock is perfectly
> > happy with <= PAGE_CACHE_SIZE blocks today, and isn't _terrible_
> > at >.
> 
> Extent based block mapping is entirely independent of block size.
> Please don't confuse the two

I'm not, but it seemed like you were confused that fsblock is tied
to changing the aops APIs. It is not, but they can be changed to
give improvements in a good number of areas (*including* better
large block support).


> > >With special "disk blocks" for indicating delayed allocation
> > >blocks (-1) and unwritten extents (-2). Worst case we end up
> > >with is an iomap per filesystem block.
> > 
> > I was thinking about doing an extent based scheme, but it has
> > some issues as well. Block based is light weight and simple, it
> > aligns nicely with the pagecache structures.
> 
> Yes. Block based is simple, but has flexibility and scalability
> problems.  e.g the number of fsblocks that are required to map large
> files.  It's not uncommon for use to have millions of bufferheads
> lying around after writing a single large file that only has a
> handful of extents. That's 5-6 orders of magnitude difference there
> in memory usage and as memory and disk sizes get larger, this will
> become more of a problem

I guess fsblock is 3 times smaller and you would probably have 16
times fewer of them for such a filesystem (given a 4K page size)
still leaves a few orders of magnitude ;)

However, fsblock has this nice feature where it can drop the blocks
when the last reference goes away, so you really only have fsblocks
around for dirty or currently-being-read blocks...

But you give me a good idea: I'll gear the filesystem-side APIs to
be more extent based as well (eg. fsblock's get_block equivalent).
That way it should be much easier to change over to such extents in
future or even have an extent based representation sitting in front
of the fsblock one and acting as a high density cache in your above
situation.


> > >If we allow iomaps to be split and combined along with range
> > >locking, we can parallelise read and write access to each
> > >file on an iomap basis, etc. There's plenty of goodness that
> > >comes from indexing by range
> > 
> > Some operations AFAIKS will always need to be per-page (eg. in
> > the core VM it wants to lock a single page to fault it in, or
> > wait for a single page to writeout etc). So I didn't see a huge
> > gain in a one-lock-per-extent type arrangement.
> 
> For VM operations, no, but they would continue to be locked on a
> per-page basis. However, we can do filesystem block operations
> without needing to hold page locks. e.g. space reservation and
> allocation..

You could do that without holding the page locks as well AFAIKS.
Actually again it might be a bit troublesome with the current
aops APIs, but I don't think fsblock stands in your way there
either.
 
> > If you're worried about parallelisability, then I don't see what
> > iomaps give you that buffer heads or fsblocks do not? In fact
> > they would be worse because there are fewer of them? :)
> 
> No, that's wrong. I'm not talking about VM parallelisation,
> I want to be able to support multiple writers to a single file.
> i.e. removing the i_mutex restriction on writes. To do that
> you've got to have a range locking scheme integrated into
> the block map for the file so that concurrent lookups and
> allocations don't trip over each other.
 
> iomaps can double as range locks simply because iomaps are
> expressions of ranges within the file.  Seeing as you can only
> access a given range exclusively to modify it,

Re: [RFC] fsblock

2007-06-26 Thread David Chinner

On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> David Chinner wrote:
> >On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
> >>I'm announcing "fsblock" now because it is quite intrusive and so I'd
> >>like to get some thoughts about significantly changing this core part
> >>of the kernel.
> >
> >Can you rename it to something other than shorthand for
> >"filesystem block"? e.g. When you say:
> >
> >>- In line with the above item, filesystem block allocation is performed
> >
> >What are we actually talking aout here? filesystem block allocation
> >is something a filesystem does to allocate blocks on disk, not
> >allocate a mapping structure in memory.
> >
> >Realistically, this is not about "filesystem blocks", this is
> >about file offset to disk blocks. i.e. it's a mapping.
> 
> Yeah, fsblock ~= the layer between the fs and the block layers.

Sure, but it's not a "filesystem block" which is what you are
calling it. IMO, it's overloading a well known term with something
different, and that's just confusing.

Can we call it a block mapping layer or something like that?
e.g. struct blkmap?

> >> Probably better would be to
> >> move towards offset,length rather than page based fs APIs where 
> >> everything
> >> can be batched up nicely and this sort of non-trivial locking can be more
> >> optimal.
> >
> >If we are going to turn over the API completely like this, can
> >we seriously look at moving to this sort of interface at the same
> >time?
> 
> Yeah we can move to anything. But note that fsblock is perfectly
> happy with <= PAGE_CACHE_SIZE blocks today, and isn't _terrible_
> at >.

Extent based block mapping is entirely independent of block size.
Please don't confuse the two

> >With a offset/len interface, we can start to track contiguous
> >ranges of blocks rather than persisting with a structure per
> >filesystem block. If you want to save memory, thet's where
> >we need to go.
> >
> >XFS uses "iomaps" for this purpose - it's basically:
> >
> > - start offset into file
> > - start block on disk
> > - length of mapping
> > - state 
> >
> >With special "disk blocks" for indicating delayed allocation
> >blocks (-1) and unwritten extents (-2). Worst case we end up
> >with is an iomap per filesystem block.
> 
> I was thinking about doing an extent based scheme, but it has
> some issues as well. Block based is light weight and simple, it
> aligns nicely with the pagecache structures.

Yes. Block based is simple, but has flexibility and scalability
problems.  e.g the number of fsblocks that are required to map large
files.  It's not uncommon for use to have millions of bufferheads
lying around after writing a single large file that only has a
handful of extents. That's 5-6 orders of magnitude difference there
in memory usage and as memory and disk sizes get larger, this will
become more of a problem

> >If we allow iomaps to be split and combined along with range
> >locking, we can parallelise read and write access to each
> >file on an iomap basis, etc. There's plenty of goodness that
> >comes from indexing by range
> 
> Some operations AFAIKS will always need to be per-page (eg. in
> the core VM it wants to lock a single page to fault it in, or
> wait for a single page to writeout etc). So I didn't see a huge
> gain in a one-lock-per-extent type arrangement.

For VM operations, no, but they would continue to be locked on a
per-page basis. However, we can do filesystem block operations
without needing to hold page locks. e.g. space reservation and
allocation..

> If you're worried about parallelisability, then I don't see what
> iomaps give you that buffer heads or fsblocks do not? In fact
> they would be worse because there are fewer of them? :)

No, that's wrong. I'm not talking about VM parallelisation,
I want to be able to support multiple writers to a single file.
i.e. removing the i_mutex restriction on writes. To do that
you've got to have a range locking scheme integrated into
the block map for the file so that concurrent lookups and
allocations don't trip over each other.

iomaps can double as range locks simply because iomaps are
expressions of ranges within the file.  Seeing as you can only
access a given range exclusively to modify it, inserting an empty
mapping into the tree as a range lock gives an effective method of
allowing safe parallel reads, writes and allocation into the file.

The fsblocks and the vm page cache interface cannot be used to
facilitate this because a radix tree is the wrong type of tree to
store this information in. A sparse, range based tree (e.g. btree)
is the right way to do this and it matches very well with
a range based API.

None of what I'm talking about requires any changes to the existing
page cache or VM address space. I'm proposing that we should be
treat the block mapping as an address space in it's own right. i.e.
perhaps the struct page should not have block mapping objects
attached to it at all.

Re: [RFC] fsblock

2007-06-26 Thread David Chinner

On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
 David Chinner wrote:
 On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
 I'm announcing fsblock now because it is quite intrusive and so I'd
 like to get some thoughts about significantly changing this core part
 of the kernel.
 
 Can you rename it to something other than shorthand for
 filesystem block? e.g. When you say:
 
 - In line with the above item, filesystem block allocation is performed
 
 What are we actually talking aout here? filesystem block allocation
 is something a filesystem does to allocate blocks on disk, not
 allocate a mapping structure in memory.
 
 Realistically, this is not about filesystem blocks, this is
 about file offset to disk blocks. i.e. it's a mapping.
 
 Yeah, fsblock ~= the layer between the fs and the block layers.

Sure, but it's not a filesystem block which is what you are
calling it. IMO, it's overloading a well known term with something
different, and that's just confusing.

Can we call it a block mapping layer or something like that?
e.g. struct blkmap?

  Probably better would be to
  move towards offset,length rather than page based fs APIs where 
  everything
  can be batched up nicely and this sort of non-trivial locking can be more
  optimal.
 
 If we are going to turn over the API completely like this, can
 we seriously look at moving to this sort of interface at the same
 time?
 
 Yeah we can move to anything. But note that fsblock is perfectly
 happy with = PAGE_CACHE_SIZE blocks today, and isn't _terrible_
 at .

Extent based block mapping is entirely independent of block size.
Please don't confuse the two

 With a offset/len interface, we can start to track contiguous
 ranges of blocks rather than persisting with a structure per
 filesystem block. If you want to save memory, thet's where
 we need to go.
 
 XFS uses iomaps for this purpose - it's basically:
 
  - start offset into file
  - start block on disk
  - length of mapping
  - state 
 
 With special disk blocks for indicating delayed allocation
 blocks (-1) and unwritten extents (-2). Worst case we end up
 with is an iomap per filesystem block.
 
 I was thinking about doing an extent based scheme, but it has
 some issues as well. Block based is light weight and simple, it
 aligns nicely with the pagecache structures.

Yes. Block based is simple, but has flexibility and scalability
problems.  e.g the number of fsblocks that are required to map large
files.  It's not uncommon for use to have millions of bufferheads
lying around after writing a single large file that only has a
handful of extents. That's 5-6 orders of magnitude difference there
in memory usage and as memory and disk sizes get larger, this will
become more of a problem

 If we allow iomaps to be split and combined along with range
 locking, we can parallelise read and write access to each
 file on an iomap basis, etc. There's plenty of goodness that
 comes from indexing by range
 
 Some operations AFAIKS will always need to be per-page (eg. in
 the core VM it wants to lock a single page to fault it in, or
 wait for a single page to writeout etc). So I didn't see a huge
 gain in a one-lock-per-extent type arrangement.

For VM operations, no, but they would continue to be locked on a
per-page basis. However, we can do filesystem block operations
without needing to hold page locks. e.g. space reservation and
allocation..

 If you're worried about parallelisability, then I don't see what
 iomaps give you that buffer heads or fsblocks do not? In fact
 they would be worse because there are fewer of them? :)

No, that's wrong. I'm not talking about VM parallelisation,
I want to be able to support multiple writers to a single file.
i.e. removing the i_mutex restriction on writes. To do that
you've got to have a range locking scheme integrated into
the block map for the file so that concurrent lookups and
allocations don't trip over each other.

iomaps can double as range locks simply because iomaps are
expressions of ranges within the file.  Seeing as you can only
access a given range exclusively to modify it, inserting an empty
mapping into the tree as a range lock gives an effective method of
allowing safe parallel reads, writes and allocation into the file.

The fsblocks and the vm page cache interface cannot be used to
facilitate this because a radix tree is the wrong type of tree to
store this information in. A sparse, range based tree (e.g. btree)
is the right way to do this and it matches very well with
a range based API.

None of what I'm talking about requires any changes to the existing
page cache or VM address space. I'm proposing that we should be
treat the block mapping as an address space in it's own right. i.e.
perhaps the struct page should not have block mapping objects
attached to it at all.

By separating out the block mapping from the page cache, we make the
page cache completely independent of filesystem block size,

Re: [RFC] fsblock

2007-06-26 Thread Nick Piggin

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
 On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
  
  Realistically, this is not about filesystem blocks, this is
  about file offset to disk blocks. i.e. it's a mapping.
  
  Yeah, fsblock ~= the layer between the fs and the block layers.
 
 Sure, but it's not a filesystem block which is what you are
 calling it. IMO, it's overloading a well known term with something
 different, and that's just confusing.

Well it is the metadata used to manage the filesystem block for the
given bit of pagecache (even if the block is not actually allocated
or even a hole, it is deemed to be so by the filesystem).

 Can we call it a block mapping layer or something like that?
 e.g. struct blkmap?

I'm not fixed on fsblock, but blkmap doesn't grab me either. It
is a map from the pagecache to the block layer, but blkmap sounds
like it is a map from the block to somewhere.

fsblkmap ;)

 
   Probably better would be to
   move towards offset,length rather than page based fs APIs where 
   everything
   can be batched up nicely and this sort of non-trivial locking can be more
   optimal.
  
  If we are going to turn over the API completely like this, can
  we seriously look at moving to this sort of interface at the same
  time?
  
  Yeah we can move to anything. But note that fsblock is perfectly
  happy with = PAGE_CACHE_SIZE blocks today, and isn't _terrible_
  at .
 
 Extent based block mapping is entirely independent of block size.
 Please don't confuse the two

I'm not, but it seemed like you were confused that fsblock is tied
to changing the aops APIs. It is not, but they can be changed to
give improvements in a good number of areas (*including* better
large block support).


  With special disk blocks for indicating delayed allocation
  blocks (-1) and unwritten extents (-2). Worst case we end up
  with is an iomap per filesystem block.
  
  I was thinking about doing an extent based scheme, but it has
  some issues as well. Block based is light weight and simple, it
  aligns nicely with the pagecache structures.
 
 Yes. Block based is simple, but has flexibility and scalability
 problems.  e.g the number of fsblocks that are required to map large
 files.  It's not uncommon for use to have millions of bufferheads
 lying around after writing a single large file that only has a
 handful of extents. That's 5-6 orders of magnitude difference there
 in memory usage and as memory and disk sizes get larger, this will
 become more of a problem

I guess fsblock is 3 times smaller and you would probably have 16
times fewer of them for such a filesystem (given a 4K page size)
still leaves a few orders of magnitude ;)

However, fsblock has this nice feature where it can drop the blocks
when the last reference goes away, so you really only have fsblocks
around for dirty or currently-being-read blocks...

But you give me a good idea: I'll gear the filesystem-side APIs to
be more extent based as well (eg. fsblock's get_block equivalent).
That way it should be much easier to change over to such extents in
future or even have an extent based representation sitting in front
of the fsblock one and acting as a high density cache in your above
situation.


  If we allow iomaps to be split and combined along with range
  locking, we can parallelise read and write access to each
  file on an iomap basis, etc. There's plenty of goodness that
  comes from indexing by range
  
  Some operations AFAIKS will always need to be per-page (eg. in
  the core VM it wants to lock a single page to fault it in, or
  wait for a single page to writeout etc). So I didn't see a huge
  gain in a one-lock-per-extent type arrangement.
 
 For VM operations, no, but they would continue to be locked on a
 per-page basis. However, we can do filesystem block operations
 without needing to hold page locks. e.g. space reservation and
 allocation..

You could do that without holding the page locks as well AFAIKS.
Actually again it might be a bit troublesome with the current
aops APIs, but I don't think fsblock stands in your way there
either.
 
  If you're worried about parallelisability, then I don't see what
  iomaps give you that buffer heads or fsblocks do not? In fact
  they would be worse because there are fewer of them? :)
 
 No, that's wrong. I'm not talking about VM parallelisation,
 I want to be able to support multiple writers to a single file.
 i.e. removing the i_mutex restriction on writes. To do that
 you've got to have a range locking scheme integrated into
 the block map for the file so that concurrent lookups and
 allocations don't trip over each other.
 
 iomaps can double as range locks simply because iomaps are
 expressions of ranges within the file.  Seeing as you can only
 access a given range exclusively to modify it, inserting an empty
 mapping into the tree as a range lock gives an effective method of
 allowing safe parallel reads, writes and allocation

Re: [RFC] fsblock

2007-06-26 Thread Chris Mason

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
 On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:

[ ... fsblocks vs extent range mapping ]

 iomaps can double as range locks simply because iomaps are
 expressions of ranges within the file.  Seeing as you can only
 access a given range exclusively to modify it, inserting an empty
 mapping into the tree as a range lock gives an effective method of
 allowing safe parallel reads, writes and allocation into the file.
 
 The fsblocks and the vm page cache interface cannot be used to
 facilitate this because a radix tree is the wrong type of tree to
 store this information in. A sparse, range based tree (e.g. btree)
 is the right way to do this and it matches very well with
 a range based API.

I'm really not against the extent based page cache idea, but I kind of
assumed it would be too big a change for this kind of generic setup.  At
any rate, if we'd like to do it, it may be best to ditch the idea of
attach mapping information to a page, and switch to lookup mapping
information and range locking for a page.

A btree could be used to hold the range mapping and locking, but it
could just as easily be a radix tree where you do a gang lookup for the
end of the range (the same way my placeholder patch did).  It'll still
find intersecting range locks but is much faster for random
insertion/deletion than the btrees.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-26 Thread Nick Piggin

On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
 On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
  On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
 
 [ ... fsblocks vs extent range mapping ]
 
  iomaps can double as range locks simply because iomaps are
  expressions of ranges within the file.  Seeing as you can only
  access a given range exclusively to modify it, inserting an empty
  mapping into the tree as a range lock gives an effective method of
  allowing safe parallel reads, writes and allocation into the file.
  
  The fsblocks and the vm page cache interface cannot be used to
  facilitate this because a radix tree is the wrong type of tree to
  store this information in. A sparse, range based tree (e.g. btree)
  is the right way to do this and it matches very well with
  a range based API.
 
 I'm really not against the extent based page cache idea, but I kind of
 assumed it would be too big a change for this kind of generic setup.  At
 any rate, if we'd like to do it, it may be best to ditch the idea of
 attach mapping information to a page, and switch to lookup mapping
 information and range locking for a page.

Well the get_block equivalent API is extent based one now, and I'll
look at what is required in making map_fsblock a more generic call
that could be used for an extent-based scheme.

An extent based thing IMO really isn't appropriate as the main generic
layer here though. If it is really useful and popular, then it could
be turned into generic code and sit along side fsblock or underneath
fsblock...

It definitely isn't trivial to drive the IO directly from something
like that which doesn't correspond to filesystem block size. Splitting
parts of your extent tree when things go dirty or uptodate or partially
under IO, etc.. joining things back up again when they are mergable.
Not that it would be impossible, but it would be a lot more heavyweight
than fsblock.

I think using fsblock to drive the IO and keep the pagecache flags
uptodate and using a btree in the filesystem to manage extents of block
allocations wouldn't be a bad idea though. Do any filesystems actually
do this?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-25 Thread Nick Piggin


David Chinner wrote:

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:


I'm announcing "fsblock" now because it is quite intrusive and so I'd
like to get some thoughts about significantly changing this core part
of the kernel.



Can you rename it to something other than shorthand for
"filesystem block"? e.g. When you say:



- In line with the above item, filesystem block allocation is performed



What are we actually talking aout here? filesystem block allocation
is something a filesystem does to allocate blocks on disk, not
allocate a mapping structure in memory.

Realistically, this is not about "filesystem blocks", this is
about file offset to disk blocks. i.e. it's a mapping.


Yeah, fsblock ~= the layer between the fs and the block layers.
But don't take the name too literally, like a struct page isn't
actually a page of memory ;)



 Probably better would be to
 move towards offset,length rather than page based fs APIs where everything
 can be batched up nicely and this sort of non-trivial locking can be more
 optimal.



If we are going to turn over the API completely like this, can
we seriously look at moving to this sort of interface at the same
time?


Yeah we can move to anything. But note that fsblock is perfectly
happy with <= PAGE_CACHE_SIZE blocks today, and isn't _terrible_
at >.



With a offset/len interface, we can start to track contiguous
ranges of blocks rather than persisting with a structure per
filesystem block. If you want to save memory, thet's where
we need to go.

XFS uses "iomaps" for this purpose - it's basically:

- start offset into file
- start block on disk
- length of mapping
	- state 


With special "disk blocks" for indicating delayed allocation
blocks (-1) and unwritten extents (-2). Worst case we end up
with is an iomap per filesystem block.


I was thinking about doing an extent based scheme, but it has
some issues as well. Block based is light weight and simple, it
aligns nicely with the pagecache structures.



If we allow iomaps to be split and combined along with range
locking, we can parallelise read and write access to each
file on an iomap basis, etc. There's plenty of goodness that
comes from indexing by range


Some operations AFAIKS will always need to be per-page (eg. in
the core VM it wants to lock a single page to fault it in, or
wait for a single page to writeout etc). So I didn't see a huge
gain in a one-lock-per-extent type arrangement.

If you're worried about parallelisability, then I don't see what
iomaps give you that buffer heads or fsblocks do not? In fact
they would be worse because there are fewer of them? :)

But remember that once the filesystems have accessor APIs and
can handle multiple pages per fsblock, that would already be
most of the work done for the fs and the mm to go to an extent
based representation.



FWIW, I really see little point in making all the filesystems
work with fsblocks if the plan is to change the API again in
a major way a year down the track. Let's get all the changes
we think are necessary in one basket first, and then work out
a coherent plan to implement them ;)


The aops API changes and the fsblock layer are kind of two
seperate things. I'm slowly implementing things as I go (eg.
see perform_write aop, which is exactly the offset,length
based API that I'm talking about).

fsblocks can be implemented on the old or the new APIs. New
APIs won't invalidate work to convert a filesystem to fsblocks.



- Large block support. I can mount and run an 8K block size minix3 fs on
 my 4K page system and it didn't require anything special in the fs. We
 can go up to about 32MB blocks now, and gigabyte+ blocks would only
 require  one more bit in the fsblock flags. fsblock_superpage blocks
 are > PAGE_CACHE_SIZE, midpage ==, and subpage <.



My 2c worth - this is a damn complex way of introducing large block
size support. It has all the problems I pointed out that it would
have (locking issues, vmap overhead, every filesystem needs needs
major changes and it's not very efficient) and it's going to take
quite some time to stabilise.


What locking issues? It locks pages in pagecache offset ascending
order, which already has precedent and is really the only sane way
to do it so it's not like it precludes other possible sane lock
orderings.

vmap overhead is an issue, however I did it mainly for easy of
conversion. I guess things like superblocks and such would make
use of it happily. Most other things should be able to be
implemented with page based helpers (just a couple of bitops
helpers would pretty much cover minix). If it is still a problem,
then I can implement a proper vmap cache.

But the major changes in the filesystem are not for vmaps, but for
page accessors. As I said, this allows blkdev to move out of
lowmem and also closes CPU cache coherency problems. (as well as
not having to carry around a vmem pointer of course).



If this is the only real feature that

Re: [RFC] fsblock

2007-06-25 Thread David Chinner

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
> 
> I'm announcing "fsblock" now because it is quite intrusive and so I'd
> like to get some thoughts about significantly changing this core part
> of the kernel.

Can you rename it to something other than shorthand for
"filesystem block"? e.g. When you say:

> - In line with the above item, filesystem block allocation is performed

What are we actually talking aout here? filesystem block allocation
is something a filesystem does to allocate blocks on disk, not
allocate a mapping structure in memory.

Realistically, this is not about "filesystem blocks", this is
about file offset to disk blocks. i.e. it's a mapping.

>   Probably better would be to
>   move towards offset,length rather than page based fs APIs where everything
>   can be batched up nicely and this sort of non-trivial locking can be more
>   optimal.

If we are going to turn over the API completely like this, can
we seriously look at moving to this sort of interface at the same
time?

With a offset/len interface, we can start to track contiguous
ranges of blocks rather than persisting with a structure per
filesystem block. If you want to save memory, thet's where
we need to go.

XFS uses "iomaps" for this purpose - it's basically:

- start offset into file
- start block on disk
- length of mapping
- state 

With special "disk blocks" for indicating delayed allocation
blocks (-1) and unwritten extents (-2). Worst case we end up
with is an iomap per filesystem block.

If we allow iomaps to be split and combined along with range
locking, we can parallelise read and write access to each
file on an iomap basis, etc. There's plenty of goodness that
comes from indexing by range

FWIW, I really see little point in making all the filesystems
work with fsblocks if the plan is to change the API again in
a major way a year down the track. Let's get all the changes
we think are necessary in one basket first, and then work out
a coherent plan to implement them ;)

> - Large block support. I can mount and run an 8K block size minix3 fs on
>   my 4K page system and it didn't require anything special in the fs. We
>   can go up to about 32MB blocks now, and gigabyte+ blocks would only
>   require  one more bit in the fsblock flags. fsblock_superpage blocks
>   are > PAGE_CACHE_SIZE, midpage ==, and subpage <.

My 2c worth - this is a damn complex way of introducing large block
size support. It has all the problems I pointed out that it would
have (locking issues, vmap overhead, every filesystem needs needs
major changes and it's not very efficient) and it's going to take
quite some time to stabilise.

If this is the only real feature that fsblocks are going to give us,
then I think this is a waste of time. If we are going to replace
buffer heads, lets do it with something that is completely
independent of filesystem block size and not introduce something
that is just a bufferhead on steroids.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-25 Thread Chris Mason

On Mon, Jun 25, 2007 at 04:58:48PM +1000, Nick Piggin wrote:
> 
> >Using buffer heads instead allows the FS to send file data down inside
> >the transaction code, without taking the page lock.  So, locking wrt
> >data=ordered is definitely going to be tricky.
> >
> >The best long term option may be making the locking order
> >transaction -> page lock, and change writepage to punt to some other
> >queue when it needs to start a transaction.
> 
> Yeah, that's what I would like, and I think it would come naturally
> if we move away from these "pass down a single, locked page APIs"
> in the VM, and let the filesystem do the locking and potentially
> batching of larger ranges.

Definitely.

> 
> write_begin/write_end is a step in that direction (and it helps
> OCFS and GFS quite a bit). I think there is also not much reason
> for writepage sites to require the page to lock the page and clear
> the dirty bit themselves (which has seems ugly to me).

If we keep the page mapping information with the page all the time (ie
writepage doesn't have to call get_block ever), it may be possible to
avoid sending down a locked page.  But, I don't know the delayed
allocation internals well enough to say for sure if that is true.

Either way, writepage is the easiest of the bunch because it can be
deferred.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-25 Thread Nick Piggin


Andi Kleen wrote:

Nick Piggin <[EMAIL PROTECTED]> writes:


- Structure packing. A page gets a number of buffer heads that are
 allocated in a linked list. fsblocks are allocated contiguously, so
 cacheline footprint is smaller in the above situation.



It would be interesting to test if that makes a difference for 
database benchmarks running over file systems. Databases

eat a lot of cache so in theory any cache improvements
in the kernel which often runs cache cold then should be beneficial. 


But I guess it would need at least ext2 to test; Minix is probably not
good enough.


Yeah, you are right. ext2 would be cool to port as it would be
a reasonable platform for basic performance testing and comparisons.


In general have you benchmarked the CPU overhead of old vs new code? 
e.g. when we went to BIO scalability went up, but CPU costs

of a single request also went up. It would be nice to not continue
or better reverse that trend.


At the moment there are still a few silly things in the code, such
as always calling the insert_mapping indirect function (which is
the get_block equivalent). And it does a bit more RMWing than it
should still.

Also, it always goes to the pagecache radix-tree to find fsblocks,
wheras the buffer layer has a per-CPU cache front-end... so in
that regard, fsblock is really designed with lockless pagecache
in mind, where find_get_page is much faster even in the serial case
(though fsblock shouldn't exactly be slow with the current pagecache).

However, I don't think there are any fundamental performance
problems with fsblock. It even uses one less layer of locking to
do regular IO compared with buffer.c, so in theory it might even
have some advantage.

Single threaded performance of request submission is something I
will definitely try to keep optimal.



- Large block support. I can mount and run an 8K block size minix3 fs on
 my 4K page system and it didn't require anything special in the fs. We
 can go up to about 32MB blocks now, and gigabyte+ blocks would only
 require  one more bit in the fsblock flags. fsblock_superpage blocks
 are > PAGE_CACHE_SIZE, midpage ==, and subpage <.



Can it be cleanly ifdefed or optimized away?


Yeah, it pretty well stays out of the way when using <= PAGE_CACHE_SIZE
size blocks, generally just a single test and branch of an already-used
cacheline. It can be optimised away completely by commenting out
#define BLOCK_SUPERPAGE_SUPPORT from fsblock.h.



Unless the fragmentation
problem is not solved it would seem rather pointless to me. Also I personally
still think the right way to approach this is larger softpage size.


It does not suffer from a fragmentation problem. It will do scatter
gather IO if the pagecache of that block is not contiguous. My naming
may be a little confusing: fsblock_superpage (which is a function that
returns true if the given fsblock is larger than PAGE_CACHE_SIZE) is
just named as to whether the fsblock is larger than a page, rather than
having a connection to VM superpages.

Don't get me wrong, I think soft page size is a good idea for other
reasons as well (less page metadata and page operations), and that
8 or 16K would probably be a good sweet spot for today's x86 systems.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-25 Thread Nick Piggin


Chris Mason wrote:

On Sun, Jun 24, 2007 at 05:47:55AM +0200, Nick Piggin wrote:


My gut feeling is that there are several problem areas you haven't hit 
yet, with the new code.


I would agree with your gut :)




Without having read the code yet (light reading for monday morning ;),
ext3 and reiserfs use buffers heads for data=ordered to help them do
deadlock free writeback.  Basically they need to be able to write out
the pending data=ordered pages, potentially with the transaction lock
held (or if not held, while blocking new transactions from starting).

But, writepage, prepare_write and commit_write all need to start a
transaction with the page lock already held.  So, if the page lock were
used for data=ordered writeback, there would be a lock inversion between
the transaction lock and the page lock.


Ah, thanks for that information.



Using buffer heads instead allows the FS to send file data down inside
the transaction code, without taking the page lock.  So, locking wrt
data=ordered is definitely going to be tricky.

The best long term option may be making the locking order
transaction -> page lock, and change writepage to punt to some other
queue when it needs to start a transaction.


Yeah, that's what I would like, and I think it would come naturally
if we move away from these "pass down a single, locked page APIs"
in the VM, and let the filesystem do the locking and potentially
batching of larger ranges.

write_begin/write_end is a step in that direction (and it helps
OCFS and GFS quite a bit). I think there is also not much reason
for writepage sites to require the page to lock the page and clear
the dirty bit themselves (which has seems ugly to me).

So yes, I definitely want to move the aops API along with fsblock.

That I have tried to keep it within the existing API for the moment
is just because that makes things a bit easier...

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-25 Thread Nick Piggin


Chris Mason wrote:

On Sun, Jun 24, 2007 at 05:47:55AM +0200, Nick Piggin wrote:


My gut feeling is that there are several problem areas you haven't hit 
yet, with the new code.


I would agree with your gut :)




Without having read the code yet (light reading for monday morning ;),
ext3 and reiserfs use buffers heads for data=ordered to help them do
deadlock free writeback.  Basically they need to be able to write out
the pending data=ordered pages, potentially with the transaction lock
held (or if not held, while blocking new transactions from starting).

But, writepage, prepare_write and commit_write all need to start a
transaction with the page lock already held.  So, if the page lock were
used for data=ordered writeback, there would be a lock inversion between
the transaction lock and the page lock.


Ah, thanks for that information.



Using buffer heads instead allows the FS to send file data down inside
the transaction code, without taking the page lock.  So, locking wrt
data=ordered is definitely going to be tricky.

The best long term option may be making the locking order
transaction - page lock, and change writepage to punt to some other
queue when it needs to start a transaction.


Yeah, that's what I would like, and I think it would come naturally
if we move away from these pass down a single, locked page APIs
in the VM, and let the filesystem do the locking and potentially
batching of larger ranges.

write_begin/write_end is a step in that direction (and it helps
OCFS and GFS quite a bit). I think there is also not much reason
for writepage sites to require the page to lock the page and clear
the dirty bit themselves (which has seems ugly to me).

So yes, I definitely want to move the aops API along with fsblock.

That I have tried to keep it within the existing API for the moment
is just because that makes things a bit easier...

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-25 Thread Nick Piggin


Andi Kleen wrote:

Nick Piggin [EMAIL PROTECTED] writes:


- Structure packing. A page gets a number of buffer heads that are
 allocated in a linked list. fsblocks are allocated contiguously, so
 cacheline footprint is smaller in the above situation.



It would be interesting to test if that makes a difference for 
database benchmarks running over file systems. Databases

eat a lot of cache so in theory any cache improvements
in the kernel which often runs cache cold then should be beneficial. 


But I guess it would need at least ext2 to test; Minix is probably not
good enough.


Yeah, you are right. ext2 would be cool to port as it would be
a reasonable platform for basic performance testing and comparisons.


In general have you benchmarked the CPU overhead of old vs new code? 
e.g. when we went to BIO scalability went up, but CPU costs

of a single request also went up. It would be nice to not continue
or better reverse that trend.


At the moment there are still a few silly things in the code, such
as always calling the insert_mapping indirect function (which is
the get_block equivalent). And it does a bit more RMWing than it
should still.

Also, it always goes to the pagecache radix-tree to find fsblocks,
wheras the buffer layer has a per-CPU cache front-end... so in
that regard, fsblock is really designed with lockless pagecache
in mind, where find_get_page is much faster even in the serial case
(though fsblock shouldn't exactly be slow with the current pagecache).

However, I don't think there are any fundamental performance
problems with fsblock. It even uses one less layer of locking to
do regular IO compared with buffer.c, so in theory it might even
have some advantage.

Single threaded performance of request submission is something I
will definitely try to keep optimal.



- Large block support. I can mount and run an 8K block size minix3 fs on
 my 4K page system and it didn't require anything special in the fs. We
 can go up to about 32MB blocks now, and gigabyte+ blocks would only
 require  one more bit in the fsblock flags. fsblock_superpage blocks
 are  PAGE_CACHE_SIZE, midpage ==, and subpage .



Can it be cleanly ifdefed or optimized away?


Yeah, it pretty well stays out of the way when using = PAGE_CACHE_SIZE
size blocks, generally just a single test and branch of an already-used
cacheline. It can be optimised away completely by commenting out
#define BLOCK_SUPERPAGE_SUPPORT from fsblock.h.



Unless the fragmentation
problem is not solved it would seem rather pointless to me. Also I personally
still think the right way to approach this is larger softpage size.


It does not suffer from a fragmentation problem. It will do scatter
gather IO if the pagecache of that block is not contiguous. My naming
may be a little confusing: fsblock_superpage (which is a function that
returns true if the given fsblock is larger than PAGE_CACHE_SIZE) is
just named as to whether the fsblock is larger than a page, rather than
having a connection to VM superpages.

Don't get me wrong, I think soft page size is a good idea for other
reasons as well (less page metadata and page operations), and that
8 or 16K would probably be a good sweet spot for today's x86 systems.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-25 Thread Chris Mason

On Mon, Jun 25, 2007 at 04:58:48PM +1000, Nick Piggin wrote:
 
 Using buffer heads instead allows the FS to send file data down inside
 the transaction code, without taking the page lock.  So, locking wrt
 data=ordered is definitely going to be tricky.
 
 The best long term option may be making the locking order
 transaction - page lock, and change writepage to punt to some other
 queue when it needs to start a transaction.
 
 Yeah, that's what I would like, and I think it would come naturally
 if we move away from these pass down a single, locked page APIs
 in the VM, and let the filesystem do the locking and potentially
 batching of larger ranges.

Definitely.

 
 write_begin/write_end is a step in that direction (and it helps
 OCFS and GFS quite a bit). I think there is also not much reason
 for writepage sites to require the page to lock the page and clear
 the dirty bit themselves (which has seems ugly to me).

If we keep the page mapping information with the page all the time (ie
writepage doesn't have to call get_block ever), it may be possible to
avoid sending down a locked page.  But, I don't know the delayed
allocation internals well enough to say for sure if that is true.

Either way, writepage is the easiest of the bunch because it can be
deferred.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-25 Thread David Chinner

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
 
 I'm announcing fsblock now because it is quite intrusive and so I'd
 like to get some thoughts about significantly changing this core part
 of the kernel.

Can you rename it to something other than shorthand for
filesystem block? e.g. When you say:

 - In line with the above item, filesystem block allocation is performed

What are we actually talking aout here? filesystem block allocation
is something a filesystem does to allocate blocks on disk, not
allocate a mapping structure in memory.

Realistically, this is not about filesystem blocks, this is
about file offset to disk blocks. i.e. it's a mapping.

   Probably better would be to
   move towards offset,length rather than page based fs APIs where everything
   can be batched up nicely and this sort of non-trivial locking can be more
   optimal.

If we are going to turn over the API completely like this, can
we seriously look at moving to this sort of interface at the same
time?

With a offset/len interface, we can start to track contiguous
ranges of blocks rather than persisting with a structure per
filesystem block. If you want to save memory, thet's where
we need to go.

XFS uses iomaps for this purpose - it's basically:

- start offset into file
- start block on disk
- length of mapping
- state 

With special disk blocks for indicating delayed allocation
blocks (-1) and unwritten extents (-2). Worst case we end up
with is an iomap per filesystem block.

If we allow iomaps to be split and combined along with range
locking, we can parallelise read and write access to each
file on an iomap basis, etc. There's plenty of goodness that
comes from indexing by range

FWIW, I really see little point in making all the filesystems
work with fsblocks if the plan is to change the API again in
a major way a year down the track. Let's get all the changes
we think are necessary in one basket first, and then work out
a coherent plan to implement them ;)

 - Large block support. I can mount and run an 8K block size minix3 fs on
   my 4K page system and it didn't require anything special in the fs. We
   can go up to about 32MB blocks now, and gigabyte+ blocks would only
   require  one more bit in the fsblock flags. fsblock_superpage blocks
   are  PAGE_CACHE_SIZE, midpage ==, and subpage .

My 2c worth - this is a damn complex way of introducing large block
size support. It has all the problems I pointed out that it would
have (locking issues, vmap overhead, every filesystem needs needs
major changes and it's not very efficient) and it's going to take
quite some time to stabilise.

If this is the only real feature that fsblocks are going to give us,
then I think this is a waste of time. If we are going to replace
buffer heads, lets do it with something that is completely
independent of filesystem block size and not introduce something
that is just a bufferhead on steroids.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-25 Thread Nick Piggin


David Chinner wrote:

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:


I'm announcing fsblock now because it is quite intrusive and so I'd
like to get some thoughts about significantly changing this core part
of the kernel.



Can you rename it to something other than shorthand for
filesystem block? e.g. When you say:



- In line with the above item, filesystem block allocation is performed



What are we actually talking aout here? filesystem block allocation
is something a filesystem does to allocate blocks on disk, not
allocate a mapping structure in memory.

Realistically, this is not about filesystem blocks, this is
about file offset to disk blocks. i.e. it's a mapping.


Yeah, fsblock ~= the layer between the fs and the block layers.
But don't take the name too literally, like a struct page isn't
actually a page of memory ;)



 Probably better would be to
 move towards offset,length rather than page based fs APIs where everything
 can be batched up nicely and this sort of non-trivial locking can be more
 optimal.



If we are going to turn over the API completely like this, can
we seriously look at moving to this sort of interface at the same
time?


Yeah we can move to anything. But note that fsblock is perfectly
happy with = PAGE_CACHE_SIZE blocks today, and isn't _terrible_
at .



With a offset/len interface, we can start to track contiguous
ranges of blocks rather than persisting with a structure per
filesystem block. If you want to save memory, thet's where
we need to go.

XFS uses iomaps for this purpose - it's basically:

- start offset into file
- start block on disk
- length of mapping
	- state 


With special disk blocks for indicating delayed allocation
blocks (-1) and unwritten extents (-2). Worst case we end up
with is an iomap per filesystem block.


I was thinking about doing an extent based scheme, but it has
some issues as well. Block based is light weight and simple, it
aligns nicely with the pagecache structures.



If we allow iomaps to be split and combined along with range
locking, we can parallelise read and write access to each
file on an iomap basis, etc. There's plenty of goodness that
comes from indexing by range


Some operations AFAIKS will always need to be per-page (eg. in
the core VM it wants to lock a single page to fault it in, or
wait for a single page to writeout etc). So I didn't see a huge
gain in a one-lock-per-extent type arrangement.

If you're worried about parallelisability, then I don't see what
iomaps give you that buffer heads or fsblocks do not? In fact
they would be worse because there are fewer of them? :)

But remember that once the filesystems have accessor APIs and
can handle multiple pages per fsblock, that would already be
most of the work done for the fs and the mm to go to an extent
based representation.



FWIW, I really see little point in making all the filesystems
work with fsblocks if the plan is to change the API again in
a major way a year down the track. Let's get all the changes
we think are necessary in one basket first, and then work out
a coherent plan to implement them ;)


The aops API changes and the fsblock layer are kind of two
seperate things. I'm slowly implementing things as I go (eg.
see perform_write aop, which is exactly the offset,length
based API that I'm talking about).

fsblocks can be implemented on the old or the new APIs. New
APIs won't invalidate work to convert a filesystem to fsblocks.



- Large block support. I can mount and run an 8K block size minix3 fs on
 my 4K page system and it didn't require anything special in the fs. We
 can go up to about 32MB blocks now, and gigabyte+ blocks would only
 require  one more bit in the fsblock flags. fsblock_superpage blocks
 are  PAGE_CACHE_SIZE, midpage ==, and subpage .



My 2c worth - this is a damn complex way of introducing large block
size support. It has all the problems I pointed out that it would
have (locking issues, vmap overhead, every filesystem needs needs
major changes and it's not very efficient) and it's going to take
quite some time to stabilise.


What locking issues? It locks pages in pagecache offset ascending
order, which already has precedent and is really the only sane way
to do it so it's not like it precludes other possible sane lock
orderings.

vmap overhead is an issue, however I did it mainly for easy of
conversion. I guess things like superblocks and such would make
use of it happily. Most other things should be able to be
implemented with page based helpers (just a couple of bitops
helpers would pretty much cover minix). If it is still a problem,
then I can implement a proper vmap cache.

But the major changes in the filesystem are not for vmaps, but for
page accessors. As I said, this allows blkdev to move out of
lowmem and also closes CPU cache coherency problems. (as well as
not having to carry around a vmem pointer of course).



If this is the only real feature that fsblocks are going to

Re: [RFC] fsblock

2007-06-24 Thread Chris Mason

On Sun, Jun 24, 2007 at 05:47:55AM +0200, Nick Piggin wrote:
> On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
> 
> > >- Large block support. I can mount and run an 8K block size minix3 fs on
> > >  my 4K page system and it didn't require anything special in the fs. We
> > >  can go up to about 32MB blocks now, and gigabyte+ blocks would only
> > >  require  one more bit in the fsblock flags. fsblock_superpage blocks
> > >  are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
> > 
> > definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, 
> > like I've been planning.
> 
> Yeah, it wasn't the primary motivation for the rewrite, but it would
> be negligent to not even consider large blocks in such a rewrite, I
> think.

I'll join the cheering here, thanks for starting on this.

>  
> > My gut feeling is that there are several problem areas you haven't hit 
> > yet, with the new code.
> 
> I would agree with your gut :)
> 

Without having read the code yet (light reading for monday morning ;),
ext3 and reiserfs use buffers heads for data=ordered to help them do
deadlock free writeback.  Basically they need to be able to write out
the pending data=ordered pages, potentially with the transaction lock
held (or if not held, while blocking new transactions from starting).

But, writepage, prepare_write and commit_write all need to start a
transaction with the page lock already held.  So, if the page lock were
used for data=ordered writeback, there would be a lock inversion between
the transaction lock and the page lock.

Using buffer heads instead allows the FS to send file data down inside
the transaction code, without taking the page lock.  So, locking wrt
data=ordered is definitely going to be tricky.

The best long term option may be making the locking order
transaction -> page lock, and change writepage to punt to some other
queue when it needs to start a transaction.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-24 Thread Andi Kleen

Nick Piggin <[EMAIL PROTECTED]> writes:
> 
> - Structure packing. A page gets a number of buffer heads that are
>   allocated in a linked list. fsblocks are allocated contiguously, so
>   cacheline footprint is smaller in the above situation.

It would be interesting to test if that makes a difference for 
database benchmarks running over file systems. Databases
eat a lot of cache so in theory any cache improvements
in the kernel which often runs cache cold then should be beneficial. 

But I guess it would need at least ext2 to test; Minix is probably not
good enough.

In general have you benchmarked the CPU overhead of old vs new code? 
e.g. when we went to BIO scalability went up, but CPU costs
of a single request also went up. It would be nice to not continue
or better reverse that trend.

> - Large block support. I can mount and run an 8K block size minix3 fs on
>   my 4K page system and it didn't require anything special in the fs. We
>   can go up to about 32MB blocks now, and gigabyte+ blocks would only
>   require  one more bit in the fsblock flags. fsblock_superpage blocks
>   are > PAGE_CACHE_SIZE, midpage ==, and subpage <.

Can it be cleanly ifdefed or optimized away?  Unless the fragmentation
problem is not solved it would seem rather pointless to me. Also I personally
still think the right way to approach this is larger softpage size.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-24 Thread Andi Kleen

Nick Piggin [EMAIL PROTECTED] writes:
 
 - Structure packing. A page gets a number of buffer heads that are
   allocated in a linked list. fsblocks are allocated contiguously, so
   cacheline footprint is smaller in the above situation.

It would be interesting to test if that makes a difference for 
database benchmarks running over file systems. Databases
eat a lot of cache so in theory any cache improvements
in the kernel which often runs cache cold then should be beneficial. 

But I guess it would need at least ext2 to test; Minix is probably not
good enough.

In general have you benchmarked the CPU overhead of old vs new code? 
e.g. when we went to BIO scalability went up, but CPU costs
of a single request also went up. It would be nice to not continue
or better reverse that trend.

 - Large block support. I can mount and run an 8K block size minix3 fs on
   my 4K page system and it didn't require anything special in the fs. We
   can go up to about 32MB blocks now, and gigabyte+ blocks would only
   require  one more bit in the fsblock flags. fsblock_superpage blocks
   are  PAGE_CACHE_SIZE, midpage ==, and subpage .

Can it be cleanly ifdefed or optimized away?  Unless the fragmentation
problem is not solved it would seem rather pointless to me. Also I personally
still think the right way to approach this is larger softpage size.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-24 Thread Chris Mason

On Sun, Jun 24, 2007 at 05:47:55AM +0200, Nick Piggin wrote:
 On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
 
  - Large block support. I can mount and run an 8K block size minix3 fs on
my 4K page system and it didn't require anything special in the fs. We
can go up to about 32MB blocks now, and gigabyte+ blocks would only
require  one more bit in the fsblock flags. fsblock_superpage blocks
are  PAGE_CACHE_SIZE, midpage ==, and subpage .
  
  definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, 
  like I've been planning.
 
 Yeah, it wasn't the primary motivation for the rewrite, but it would
 be negligent to not even consider large blocks in such a rewrite, I
 think.

I'll join the cheering here, thanks for starting on this.

  
  My gut feeling is that there are several problem areas you haven't hit 
  yet, with the new code.
 
 I would agree with your gut :)
 

Without having read the code yet (light reading for monday morning ;),
ext3 and reiserfs use buffers heads for data=ordered to help them do
deadlock free writeback.  Basically they need to be able to write out
the pending data=ordered pages, potentially with the transaction lock
held (or if not held, while blocking new transactions from starting).

But, writepage, prepare_write and commit_write all need to start a
transaction with the page lock already held.  So, if the page lock were
used for data=ordered writeback, there would be a lock inversion between
the transaction lock and the page lock.

Using buffer heads instead allows the FS to send file data down inside
the transaction code, without taking the page lock.  So, locking wrt
data=ordered is definitely going to be tricky.

The best long term option may be making the locking order
transaction - page lock, and change writepage to punt to some other
queue when it needs to start a transaction.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-23 Thread William Lee Irwin III

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
> fsblock is a rewrite of the "buffer layer" (ding dong the witch is
> dead), which I have been working on, on and off and is now at the stage
> where some of the basics are working-ish. This email is going to be
> long...

Long overdue. Thank you.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-23 Thread Nick Piggin

On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
> Nick Piggin wrote:
> >- No deadlocks (hopefully). The buffer layer is technically deadlocky by
> >  design, because it can require memory allocations at page writeout-time.
> >  It also has one path that cannot tolerate memory allocation failures.
> >  No such problems for fsblock, which keeps fsblock metadata around for as
> >  long as a page is dirty (this still has problems vs get_user_pages, but
> >  that's going to require an audit of all get_user_pages sites. Phew).
> >
> >- In line with the above item, filesystem block allocation is performed
> >  before a page is dirtied. In the buffer layer, mmap writes can dirty a
> >  page with no backing blocks which is a problem if the filesystem is
> >  ENOSPC (patches exist for buffer.c for this).
> 
> This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
> more an ABI behavior, so I don't see how this can be fixed with internal 
> changes, yet without changing behavior currently exported to userland 
> (and thus affecting code based on such assumptions).

I believe people are happy to have it SIGBUS (which is how the VM
is already set up with page_mkwrite, and what fsblock does).

> >- An inode's metadata must be tracked per-inode in order for fsync to
> >  work correctly. buffer contains helpers to do this for basic
> >  filesystems, but any block can be only the metadata for a single inode.
> >  This is not really correct for things like inode descriptor blocks.
> >  fsblock can track multiple inodes per block. (This is non trivial,
> >  and it may be overkill so it could be reverted to a simpler scheme
> >  like buffer).
> 
> hrm; no specific comment but this seems like an idea/area that needs to 
> be fleshed out more, by converting some of the more advanced filesystems.

Yep. It's conceptually fairly simple though, and it might be easier
than having filesystems implement their own complex syncing that finds
and syncs everything themselves.

> >- Large block support. I can mount and run an 8K block size minix3 fs on
> >  my 4K page system and it didn't require anything special in the fs. We
> >  can go up to about 32MB blocks now, and gigabyte+ blocks would only
> >  require  one more bit in the fsblock flags. fsblock_superpage blocks
> >  are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
> 
> definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, 
> like I've been planning.

Yeah, it wasn't the primary motivation for the rewrite, but it would
be negligent to not even consider large blocks in such a rewrite, I
think.

> >So. Comments? Is this something we want? If yes, then how would we
> >transition from buffer.c to fsblock.c?
> 
> Your work is definitely interesting, but I think it will be even more 
> interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are 
> converted.

Well minix has dir in pagecache ;) But you're completely right: ext2
will be the next step and then ext3 and things like XFS and NTFS
will be the real test. I think I could eventually get ext2 done (one
of the biggest headaches is simply just converting ->b_data accesses),
however unlikely a journalling one.

> My gut feeling is that there are several problem areas you haven't hit 
> yet, with the new code.

I would agree with your gut :)

> Also, once things are converted, the question of transitioning from 
> buffer.c will undoubtedly answer itself.  That's the way several of us 
> handle transitions:  finish all the work, then look with fresh eyes and 
> conceive a path from the current code to your enhanced code.

Yeah that would be nice. It's very difficult because of so much
filesystem code. I'd say it would be feasible to step buffer.c into
fsblock.c, however if we were to track all (or even the common)
filesystems along with that it would introduce a huge number of
kind-of-redundant changes that I don't think all fs maintainers would
have time to write (and as I said, I can't do it myself). Anyway,
let's cross that bridge if and when we come to it.

For now, the big thing that needs to be done is convert a "big" fs
and see if the results tell us that it's workable.

Thanks for the comments Jeff.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-23 Thread Jeff Garzik


Nick Piggin wrote:

- No deadlocks (hopefully). The buffer layer is technically deadlocky by
  design, because it can require memory allocations at page writeout-time.
  It also has one path that cannot tolerate memory allocation failures.
  No such problems for fsblock, which keeps fsblock metadata around for as
  long as a page is dirty (this still has problems vs get_user_pages, but
  that's going to require an audit of all get_user_pages sites. Phew).

- In line with the above item, filesystem block allocation is performed
  before a page is dirtied. In the buffer layer, mmap writes can dirty a
  page with no backing blocks which is a problem if the filesystem is
  ENOSPC (patches exist for buffer.c for this).


This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
more an ABI behavior, so I don't see how this can be fixed with internal 
changes, yet without changing behavior currently exported to userland 
(and thus affecting code based on such assumptions).




- An inode's metadata must be tracked per-inode in order for fsync to
  work correctly. buffer contains helpers to do this for basic
  filesystems, but any block can be only the metadata for a single inode.
  This is not really correct for things like inode descriptor blocks.
  fsblock can track multiple inodes per block. (This is non trivial,
  and it may be overkill so it could be reverted to a simpler scheme
  like buffer).


hrm; no specific comment but this seems like an idea/area that needs to 
be fleshed out more, by converting some of the more advanced filesystems.




- Large block support. I can mount and run an 8K block size minix3 fs on
  my 4K page system and it didn't require anything special in the fs. We
  can go up to about 32MB blocks now, and gigabyte+ blocks would only
  require  one more bit in the fsblock flags. fsblock_superpage blocks
  are > PAGE_CACHE_SIZE, midpage ==, and subpage <.


definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, 
like I've been planning.




So. Comments? Is this something we want? If yes, then how would we
transition from buffer.c to fsblock.c?


Your work is definitely interesting, but I think it will be even more 
interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are 
converted.


My gut feeling is that there are several problem areas you haven't hit 
yet, with the new code.


Also, once things are converted, the question of transitioning from 
buffer.c will undoubtedly answer itself.  That's the way several of us 
handle transitions:  finish all the work, then look with fresh eyes and 
conceive a path from the current code to your enhanced code.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-23 Thread Nick Piggin

Just clarify a few things. Don't you hate rereading a long work you
wrote? (oh, you're supposed to do that *before* you press send?).

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
> 
> I'm announcing "fsblock" now because it is quite intrusive and so I'd
> like to get some thoughts about significantly changing this core part
> of the kernel.
> 
> fsblock is a rewrite of the "buffer layer" (ding dong the witch is
> dead), which I have been working on, on and off and is now at the stage
> where some of the basics are working-ish. This email is going to be
> long...
> 
> Firstly, what is the buffer layer?  The buffer layer isn't really a
> buffer layer as in the buffer cache of unix: the block device cache
> is unified with the pagecache (in terms of the pagecache, a blkdev
> file is just like any other, but with a 1:1 mapping between offset
> and block).

I mean, in Linux, the block device cache is unified. UNIX I believe
did all its caching in a buffer cache, below the filesystem.

> - Large block support. I can mount and run an 8K block size minix3 fs on
>   my 4K page system and it didn't require anything special in the fs. We

Oh, and I don't have a Linux mkfs that makes minixv3 filesystems.
I had an image kindly made for me because I don't use minix. If
you want to test large block support, I won't email it to you though:
you can just convert ext2 or ext3 to fsblock ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] fsblock

2007-06-23 Thread Nick Piggin


I'm announcing "fsblock" now because it is quite intrusive and so I'd
like to get some thoughts about significantly changing this core part
of the kernel.

fsblock is a rewrite of the "buffer layer" (ding dong the witch is
dead), which I have been working on, on and off and is now at the stage
where some of the basics are working-ish. This email is going to be
long...

Firstly, what is the buffer layer?  The buffer layer isn't really a
buffer layer as in the buffer cache of unix: the block device cache
is unified with the pagecache (in terms of the pagecache, a blkdev
file is just like any other, but with a 1:1 mapping between offset
and block).

There are filesystem APIs to access the block device, but these go
through the block device pagecache as well. These don't exactly
define the buffer layer either.

The buffer layer is a layer between the pagecache and the block
device for block based filesystems. It keeps a translation between
logical offset and physical block number, as well as meta
information such as locks, dirtyness, and IO status of each block.
This information is tracked via the buffer_head structure.

Why rewrite the buffer layer?  Lots of people have had a desire to
completely rip out the buffer layer, but we can't do that[*] because
it does actually serve a useful purpose. Why the bad rap? Because
the code is old and crufty, and buffer_head is an awful name. It must 
be among the oldest code in the core fs/vm, and the main reason is
because of the inertia of so many and such complex filesystems.

[*] About the furthest we could go is use the struct page for the
information otherwise stored in the buffer_head, but this would be
tricky and suboptimal for filesystems with non page sized blocks and
would probably bloat the struct page as well.

So why rewrite rather than incremental improvements? Incremental
improvements are logically the correct way to do this, and we probably
could go from buffer.c to fsblock.c in steps. But I didn't do this
because: a) the blinding pace at which things move in this area would
make me an old man before it would be complete; b) I didn't actually
know exactly what it was going to look like before starting on it; c)
I wanted stable root filesystems and such when testing it; and d) I
found it reasonably easy to have both layers coexist (it uses an extra
page flag, but even that wouldn't be needed if the old buffer layer
was better decoupled from the page cache).

I started this as an exercise to see how the buffer layer could be
improved, and I think it is working out OK so far. The name is fsblock
because it basically ties the fs layer to the block layer. I think
Andrew has wanted to rename buffer_head to block before, but block is
too clashy, and it isn't a great deal more descriptive than buffer_head.
I believe fsblock is.

I'll go through a list of things where I have hopefully improved on the
buffer layer, off the top of my head. The big caveat here is that minix
is the only real filesystem I have converted so far, and complex
journalled filesystems might pose some problems that water down its
goodness (I don't know).

- data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on
  64-bit (could easily be 32 if we can have int bitops). Compare this
  to around 50 and 100ish for struct buffer_head. With a 4K page and 1K
  blocks, IO requires 10% RAM overhead in buffer heads alone. With
  fsblocks you're down to around 3%.

- Structure packing. A page gets a number of buffer heads that are
  allocated in a linked list. fsblocks are allocated contiguously, so
  cacheline footprint is smaller in the above situation.

- Data / metadata separation. I have a struct fsblock and a struct
  fsblock_meta, so we could put more stuff into the usually less used
  fsblock_meta without bloating it up too much. After a few tricks, these
  are no longer any different in my code, and dirty up the typing quite
  a lot (and I'm aware it still has some warnings, thanks). So if not
  useful this could be taken out.

- Locking. fsblocks completely use the pagecache for locking and lookups.
  The page lock is used, but there is no extra per-inode lock that buffer
  has. Would go very nicely with lockless pagecache. RCU is used for one
  non-blocking fsblock lookup (find_get_block), but I'd really rather hope
  filesystems can tolerate that blocking, and get rid of RCU completely.
  (actually this is not quite true because mapping->private_lock is still
  used for mark_buffer_dirty_inode equivalent, but that's a relatively
  rare operation).

- Coupling with pagecache metadata. Pagecache pages contain some metadata
  that is logically redundant because it is tracked in buffers as well
  (eg. a page is dirty if one or more buffers are dirty, or uptodate if
  all buffers are uptodate). This is great because means we can avoid that
  layer in some situations, but they can get out of sync. eg. if a
  filesystem writes a buffer out by hand, its pagecache page will stay

[RFC] fsblock

2007-06-23 Thread Nick Piggin


I'm announcing fsblock now because it is quite intrusive and so I'd
like to get some thoughts about significantly changing this core part
of the kernel.

fsblock is a rewrite of the buffer layer (ding dong the witch is
dead), which I have been working on, on and off and is now at the stage
where some of the basics are working-ish. This email is going to be
long...

Firstly, what is the buffer layer?  The buffer layer isn't really a
buffer layer as in the buffer cache of unix: the block device cache
is unified with the pagecache (in terms of the pagecache, a blkdev
file is just like any other, but with a 1:1 mapping between offset
and block).

There are filesystem APIs to access the block device, but these go
through the block device pagecache as well. These don't exactly
define the buffer layer either.

The buffer layer is a layer between the pagecache and the block
device for block based filesystems. It keeps a translation between
logical offset and physical block number, as well as meta
information such as locks, dirtyness, and IO status of each block.
This information is tracked via the buffer_head structure.

Why rewrite the buffer layer?  Lots of people have had a desire to
completely rip out the buffer layer, but we can't do that[*] because
it does actually serve a useful purpose. Why the bad rap? Because
the code is old and crufty, and buffer_head is an awful name. It must 
be among the oldest code in the core fs/vm, and the main reason is
because of the inertia of so many and such complex filesystems.

[*] About the furthest we could go is use the struct page for the
information otherwise stored in the buffer_head, but this would be
tricky and suboptimal for filesystems with non page sized blocks and
would probably bloat the struct page as well.

So why rewrite rather than incremental improvements? Incremental
improvements are logically the correct way to do this, and we probably
could go from buffer.c to fsblock.c in steps. But I didn't do this
because: a) the blinding pace at which things move in this area would
make me an old man before it would be complete; b) I didn't actually
know exactly what it was going to look like before starting on it; c)
I wanted stable root filesystems and such when testing it; and d) I
found it reasonably easy to have both layers coexist (it uses an extra
page flag, but even that wouldn't be needed if the old buffer layer
was better decoupled from the page cache).

I started this as an exercise to see how the buffer layer could be
improved, and I think it is working out OK so far. The name is fsblock
because it basically ties the fs layer to the block layer. I think
Andrew has wanted to rename buffer_head to block before, but block is
too clashy, and it isn't a great deal more descriptive than buffer_head.
I believe fsblock is.

I'll go through a list of things where I have hopefully improved on the
buffer layer, off the top of my head. The big caveat here is that minix
is the only real filesystem I have converted so far, and complex
journalled filesystems might pose some problems that water down its
goodness (I don't know).

- data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on
  64-bit (could easily be 32 if we can have int bitops). Compare this
  to around 50 and 100ish for struct buffer_head. With a 4K page and 1K
  blocks, IO requires 10% RAM overhead in buffer heads alone. With
  fsblocks you're down to around 3%.

- Structure packing. A page gets a number of buffer heads that are
  allocated in a linked list. fsblocks are allocated contiguously, so
  cacheline footprint is smaller in the above situation.

- Data / metadata separation. I have a struct fsblock and a struct
  fsblock_meta, so we could put more stuff into the usually less used
  fsblock_meta without bloating it up too much. After a few tricks, these
  are no longer any different in my code, and dirty up the typing quite
  a lot (and I'm aware it still has some warnings, thanks). So if not
  useful this could be taken out.

- Locking. fsblocks completely use the pagecache for locking and lookups.
  The page lock is used, but there is no extra per-inode lock that buffer
  has. Would go very nicely with lockless pagecache. RCU is used for one
  non-blocking fsblock lookup (find_get_block), but I'd really rather hope
  filesystems can tolerate that blocking, and get rid of RCU completely.
  (actually this is not quite true because mapping-private_lock is still
  used for mark_buffer_dirty_inode equivalent, but that's a relatively
  rare operation).

- Coupling with pagecache metadata. Pagecache pages contain some metadata
  that is logically redundant because it is tracked in buffers as well
  (eg. a page is dirty if one or more buffers are dirty, or uptodate if
  all buffers are uptodate). This is great because means we can avoid that
  layer in some situations, but they can get out of sync. eg. if a
  filesystem writes a buffer out by hand, its pagecache page will stay
  dirty,

Re: [RFC] fsblock

2007-06-23 Thread Nick Piggin

Just clarify a few things. Don't you hate rereading a long work you
wrote? (oh, you're supposed to do that *before* you press send?).

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
 
 I'm announcing fsblock now because it is quite intrusive and so I'd
 like to get some thoughts about significantly changing this core part
 of the kernel.
 
 fsblock is a rewrite of the buffer layer (ding dong the witch is
 dead), which I have been working on, on and off and is now at the stage
 where some of the basics are working-ish. This email is going to be
 long...
 
 Firstly, what is the buffer layer?  The buffer layer isn't really a
 buffer layer as in the buffer cache of unix: the block device cache
 is unified with the pagecache (in terms of the pagecache, a blkdev
 file is just like any other, but with a 1:1 mapping between offset
 and block).

I mean, in Linux, the block device cache is unified. UNIX I believe
did all its caching in a buffer cache, below the filesystem.

 
 - Large block support. I can mount and run an 8K block size minix3 fs on
   my 4K page system and it didn't require anything special in the fs. We

Oh, and I don't have a Linux mkfs that makes minixv3 filesystems.
I had an image kindly made for me because I don't use minix. If
you want to test large block support, I won't email it to you though:
you can just convert ext2 or ext3 to fsblock ;)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-23 Thread Jeff Garzik


Nick Piggin wrote:

- No deadlocks (hopefully). The buffer layer is technically deadlocky by
  design, because it can require memory allocations at page writeout-time.
  It also has one path that cannot tolerate memory allocation failures.
  No such problems for fsblock, which keeps fsblock metadata around for as
  long as a page is dirty (this still has problems vs get_user_pages, but
  that's going to require an audit of all get_user_pages sites. Phew).

- In line with the above item, filesystem block allocation is performed
  before a page is dirtied. In the buffer layer, mmap writes can dirty a
  page with no backing blocks which is a problem if the filesystem is
  ENOSPC (patches exist for buffer.c for this).


This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
more an ABI behavior, so I don't see how this can be fixed with internal 
changes, yet without changing behavior currently exported to userland 
(and thus affecting code based on such assumptions).




- An inode's metadata must be tracked per-inode in order for fsync to
  work correctly. buffer contains helpers to do this for basic
  filesystems, but any block can be only the metadata for a single inode.
  This is not really correct for things like inode descriptor blocks.
  fsblock can track multiple inodes per block. (This is non trivial,
  and it may be overkill so it could be reverted to a simpler scheme
  like buffer).


hrm; no specific comment but this seems like an idea/area that needs to 
be fleshed out more, by converting some of the more advanced filesystems.




- Large block support. I can mount and run an 8K block size minix3 fs on
  my 4K page system and it didn't require anything special in the fs. We
  can go up to about 32MB blocks now, and gigabyte+ blocks would only
  require  one more bit in the fsblock flags. fsblock_superpage blocks
  are  PAGE_CACHE_SIZE, midpage ==, and subpage .


definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, 
like I've been planning.




So. Comments? Is this something we want? If yes, then how would we
transition from buffer.c to fsblock.c?


Your work is definitely interesting, but I think it will be even more 
interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are 
converted.


My gut feeling is that there are several problem areas you haven't hit 
yet, with the new code.


Also, once things are converted, the question of transitioning from 
buffer.c will undoubtedly answer itself.  That's the way several of us 
handle transitions:  finish all the work, then look with fresh eyes and 
conceive a path from the current code to your enhanced code.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-23 Thread Nick Piggin

On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
 Nick Piggin wrote:
 - No deadlocks (hopefully). The buffer layer is technically deadlocky by
   design, because it can require memory allocations at page writeout-time.
   It also has one path that cannot tolerate memory allocation failures.
   No such problems for fsblock, which keeps fsblock metadata around for as
   long as a page is dirty (this still has problems vs get_user_pages, but
   that's going to require an audit of all get_user_pages sites. Phew).
 
 - In line with the above item, filesystem block allocation is performed
   before a page is dirtied. In the buffer layer, mmap writes can dirty a
   page with no backing blocks which is a problem if the filesystem is
   ENOSPC (patches exist for buffer.c for this).
 
 This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
 more an ABI behavior, so I don't see how this can be fixed with internal 
 changes, yet without changing behavior currently exported to userland 
 (and thus affecting code based on such assumptions).

I believe people are happy to have it SIGBUS (which is how the VM
is already set up with page_mkwrite, and what fsblock does).
 
 
 - An inode's metadata must be tracked per-inode in order for fsync to
   work correctly. buffer contains helpers to do this for basic
   filesystems, but any block can be only the metadata for a single inode.
   This is not really correct for things like inode descriptor blocks.
   fsblock can track multiple inodes per block. (This is non trivial,
   and it may be overkill so it could be reverted to a simpler scheme
   like buffer).
 
 hrm; no specific comment but this seems like an idea/area that needs to 
 be fleshed out more, by converting some of the more advanced filesystems.

Yep. It's conceptually fairly simple though, and it might be easier
than having filesystems implement their own complex syncing that finds
and syncs everything themselves.


 - Large block support. I can mount and run an 8K block size minix3 fs on
   my 4K page system and it didn't require anything special in the fs. We
   can go up to about 32MB blocks now, and gigabyte+ blocks would only
   require  one more bit in the fsblock flags. fsblock_superpage blocks
   are  PAGE_CACHE_SIZE, midpage ==, and subpage .
 
 definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, 
 like I've been planning.

Yeah, it wasn't the primary motivation for the rewrite, but it would
be negligent to not even consider large blocks in such a rewrite, I
think.


 So. Comments? Is this something we want? If yes, then how would we
 transition from buffer.c to fsblock.c?
 
 Your work is definitely interesting, but I think it will be even more 
 interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are 
 converted.

Well minix has dir in pagecache ;) But you're completely right: ext2
will be the next step and then ext3 and things like XFS and NTFS
will be the real test. I think I could eventually get ext2 done (one
of the biggest headaches is simply just converting -b_data accesses),
however unlikely a journalling one.

 
 My gut feeling is that there are several problem areas you haven't hit 
 yet, with the new code.

I would agree with your gut :)

 
 Also, once things are converted, the question of transitioning from 
 buffer.c will undoubtedly answer itself.  That's the way several of us 
 handle transitions:  finish all the work, then look with fresh eyes and 
 conceive a path from the current code to your enhanced code.

Yeah that would be nice. It's very difficult because of so much
filesystem code. I'd say it would be feasible to step buffer.c into
fsblock.c, however if we were to track all (or even the common)
filesystems along with that it would introduce a huge number of
kind-of-redundant changes that I don't think all fs maintainers would
have time to write (and as I said, I can't do it myself). Anyway,
let's cross that bridge if and when we come to it.

For now, the big thing that needs to be done is convert a big fs
and see if the results tell us that it's workable.

Thanks for the comments Jeff.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] fsblock

2007-06-23 Thread William Lee Irwin III

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
 fsblock is a rewrite of the buffer layer (ding dong the witch is
 dead), which I have been working on, on and off and is now at the stage
 where some of the basics are working-ish. This email is going to be
 long...

Long overdue. Thank you.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

70 matches

Mail list logo