Re: [RFC] generic IO write clustering

2001-01-20 Thread Marcelo Tosatti


On Sat, 20 Jan 2001, Christoph Hellwig wrote:

> On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote:
> > > True.  But you have to go through ext2_get_branch (under the big kernel
> > > lock) - if we can do only one logical->physical block translations,
> > > why doing it multiple times?
> > 
> > You dont. If the metadata is cached and uptodate there is no need to call
> > get_block().
> 
> Ups.  You are right for the stock tree - I was only looking at my kio tree,
> where it can't be cached due to the lack of buffer-cache usage...

Must be fixed.  

We need a higher level abstraction which can hold this (and other)
information.

Take a look at SGI's pagebuf page_buf_t. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Christoph Hellwig

On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote:
> > True.  But you have to go through ext2_get_branch (under the big kernel
> > lock) - if we can do only one logical->physical block translations,
> > why doing it multiple times?
> 
> You dont. If the metadata is cached and uptodate there is no need to call
> get_block().

Ups.  You are right for the stock tree - I was only looking at my kio tree,
where it can't be cached due to the lack of buffer-cache usage...

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Marcelo Tosatti



On Sat, 20 Jan 2001, Christoph Hellwig wrote:

> On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote:
> > In case the metadata was not already cached before ->cluster() (in this
> > case there is no disk IO at all), ->cluster() will cache it avoiding
> > further disk accesses by writepage (or writepages()).
> 
> True.  But you have to go through ext2_get_branch (under the big kernel
> lock) - if we can do only one logical->physical block translations,
> why doing it multiple times?

You dont. If the metadata is cached and uptodate there is no need to call
get_block().


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Christoph Hellwig

On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote:
> In case the metadata was not already cached before ->cluster() (in this
> case there is no disk IO at all), ->cluster() will cache it avoiding
> further disk accesses by writepage (or writepages()).

True.  But you have to go through ext2_get_branch (under the big kernel
lock) - if we can do only one logical->physical block translations,
why doing it multiple times?

> > Another thing I dislike is that the flushing gets more complicated with
> > yout VM-level clustering.  Now (and with my appropeach I'll describe
> > below) flushing is write it out now and do whatever you else want,
> > with you design it is 'find out pages beside this page in write out
> > a bunch of them' - much more complicated.  I'd like it abstracted out.
> 
> I dont see your point here. What I'm missing?

It's just a matter of taste.
(I thought it was clear enough that there is no technical advantage...)

> [...] 
>
> IMHO replicating the code is the worst thing. 

This does not replicated the code.  The 'normal' filesystems share the
code, and the special filesystems want to their own clustering anyway.
(See the discussion on xfs-devel yesterday).

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Marcelo Tosatti


On Sat, 20 Jan 2001, Christoph Hellwig wrote:



> I think there is a big disadvantage of this appropeach:
> To find out which pages are clusterable, we need do do bmap/get_block,
> that means we have to go through the block-allocation functions, which
> is rather expensive, and then we have to do it again in writepage, for
> the pages that are actually clustered bt the VM.

In case the metadata was not already cached before ->cluster() (in this
case there is no disk IO at all), ->cluster() will cache it avoiding
further disk accesses by writepage (or writepages()).

> Another thing I dislike is that the flushing gets more complicated with
> yout VM-level clustering.  Now (and with my appropeach I'll describe
> below) flushing is write it out now and do whatever you else want,
> with you design it is 'find out pages beside this page in write out
> a bunch of them' - much more complicated.  I'd like it abstracted out.

I dont see your point here. What I'm missing?

> > The idea is to work with delayed allocated pages, too. A filesystem which
> > has this feature can, at its "cluster" operation, allocate delayed pages
> > contiguously on disk, and then return to the VM code which now can
> > potentially write a bunch of dirty pages in a few big IO operations.
> 
> That does also work nicely together with ->writepage level IO clustering.
> 
> > I'm sure that a bit of tuning to know the optimal cluster size will be
> > needed. Also some fs locking problems will appear.
> 
> Sure, but again that's an issue for every kind of IO clustering...
>
> 
> No my proposal.  I prefer doing it in writepage, as stated above.
> Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and
> behind the initial page, it first uses test wether the page should be
> clustered (a callback from vm, highly 'balanceable'...), then does
> a bmap/get_block to check wether it is contingous.
>
> Finally the IO is submitted using a submit_bh loop, or when using a
> kiobuf-based IO path all clustered pages are passed down to ll_rw_kio
> in one piece.
> As you see the easy integration with the new bulk-IO mechanisms is also
> an advantage of this proposal, without the need a new multi-page a_op.

IMHO replicating the code is the worst thing. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Christoph Hellwig

In article <[EMAIL PROTECTED]> you wrote:
> The write clustering issue has already been discussed (mainly at Miami)
> and the agreement, AFAIK, was to implement the write clustering at the
> per-address-space writepage() operation.

> IMO there are some problems if we implement the write clustering in this
> level:

>   - The filesystem does not have information (and should not have) about
> limiting cluster size depending on memory shortage.

Agreed.

>   - By doing the write clustering at a higher level, we avoid a ton of
> filesystems duplicating the code.

Most filesystems share their writepage-implementation, and most
others have special requirements on write clustering anyway.

For example extent-based filesystems (xfs, jfs) usually want to write out
more pages even if the VM doesn't see a need, just for effiecy reasons.

Network-based filesystems also need special care vs. writeclustering,
because the network behaves different from a typical disk...

> So what I suggest is to add a "cluster" operation to struct address_space
> which can be used by the VM code to know the optimal IO transfer unit in
> the storage device. Something like this (maybe we need an async flag but
> thats a minor detail now):

> int (*cluster)(struct page *, unsigned long *boffset, 
>   unsigned long *poffset);

> "page" is from where the filesystem code should start its search for
> contiguous pages. boffset and poffset are passed by the VM code to know
> the logical "backwards offset" (number of contiguous pages going backwards
> from "page") and "forward offset" (cont pages going forward from
> "page") in the inode.

I think there is a big disadvantage of this appropeach:
To find out which pages are clusterable, we need do do bmap/get_block,
that means we have to go through the block-allocation functions, which
is rather expensive, and then we have to do it again in writepage, for
the pages that are actually clustered bt the VM.

Another thing I dislike is that the flushing gets more complicated with
yout VM-level clustering.  Now (and with my appropeach I'll describe
below) flushing is write it out now and do whatever you else want,
with you design it is 'find out pages beside this page in write out
a bunch of them' - much more complicated.  I'd like it abstracted out.

> The idea is to work with delayed allocated pages, too. A filesystem which
> has this feature can, at its "cluster" operation, allocate delayed pages
> contiguously on disk, and then return to the VM code which now can
> potentially write a bunch of dirty pages in a few big IO operations.

That does also work nicely together with ->writepage level IO clustering.

> I'm sure that a bit of tuning to know the optimal cluster size will be
> needed. Also some fs locking problems will appear.

Sure, but again that's an issue for every kind of IO clustering...

No my proposal.  I prefer doing it in writepage, as stated above.
Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and
behind the initial page, it first uses test wether the page should be
clustered (a callback from vm, highly 'balanceable'...), then does
a bmap/get_block to check wether it is contingous.

Finally the IO is submitted using a submit_bh loop, or when using a
kiobuf-based IO path all clustered pages are passed down to ll_rw_kio
in one piece.
As you see the easy integration with the new bulk-IO mechanisms is also
an advantage of this proposal, without the need a new multi-page a_op.

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Christoph Hellwig

In article [EMAIL PROTECTED] you wrote:
 The write clustering issue has already been discussed (mainly at Miami)
 and the agreement, AFAIK, was to implement the write clustering at the
 per-address-space writepage() operation.

 IMO there are some problems if we implement the write clustering in this
 level:

   - The filesystem does not have information (and should not have) about
 limiting cluster size depending on memory shortage.

Agreed.

   - By doing the write clustering at a higher level, we avoid a ton of
 filesystems duplicating the code.

Most filesystems share their writepage-implementation, and most
others have special requirements on write clustering anyway.

For example extent-based filesystems (xfs, jfs) usually want to write out
more pages even if the VM doesn't see a need, just for effiecy reasons.

Network-based filesystems also need special care vs. writeclustering,
because the network behaves different from a typical disk...

 So what I suggest is to add a "cluster" operation to struct address_space
 which can be used by the VM code to know the optimal IO transfer unit in
 the storage device. Something like this (maybe we need an async flag but
 thats a minor detail now):

 int (*cluster)(struct page *, unsigned long *boffset, 
   unsigned long *poffset);

 "page" is from where the filesystem code should start its search for
 contiguous pages. boffset and poffset are passed by the VM code to know
 the logical "backwards offset" (number of contiguous pages going backwards
 from "page") and "forward offset" (cont pages going forward from
 "page") in the inode.

I think there is a big disadvantage of this appropeach:
To find out which pages are clusterable, we need do do bmap/get_block,
that means we have to go through the block-allocation functions, which
is rather expensive, and then we have to do it again in writepage, for
the pages that are actually clustered bt the VM.

Another thing I dislike is that the flushing gets more complicated with
yout VM-level clustering.  Now (and with my appropeach I'll describe
below) flushing is write it out now and do whatever you else want,
with you design it is 'find out pages beside this page in write out
a bunch of them' - much more complicated.  I'd like it abstracted out.

 The idea is to work with delayed allocated pages, too. A filesystem which
 has this feature can, at its "cluster" operation, allocate delayed pages
 contiguously on disk, and then return to the VM code which now can
 potentially write a bunch of dirty pages in a few big IO operations.

That does also work nicely together with -writepage level IO clustering.

 I'm sure that a bit of tuning to know the optimal cluster size will be
 needed. Also some fs locking problems will appear.

Sure, but again that's an issue for every kind of IO clustering...

No my proposal.  I prefer doing it in writepage, as stated above.
Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and
behind the initial page, it first uses test wether the page should be
clustered (a callback from vm, highly 'balanceable'...), then does
a bmap/get_block to check wether it is contingous.

Finally the IO is submitted using a submit_bh loop, or when using a
kiobuf-based IO path all clustered pages are passed down to ll_rw_kio
in one piece.
As you see the easy integration with the new bulk-IO mechanisms is also
an advantage of this proposal, without the need a new multi-page a_op.

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Marcelo Tosatti


On Sat, 20 Jan 2001, Christoph Hellwig wrote:

snip

 I think there is a big disadvantage of this appropeach:
 To find out which pages are clusterable, we need do do bmap/get_block,
 that means we have to go through the block-allocation functions, which
 is rather expensive, and then we have to do it again in writepage, for
 the pages that are actually clustered bt the VM.

In case the metadata was not already cached before -cluster() (in this
case there is no disk IO at all), -cluster() will cache it avoiding
further disk accesses by writepage (or writepages()).

 Another thing I dislike is that the flushing gets more complicated with
 yout VM-level clustering.  Now (and with my appropeach I'll describe
 below) flushing is write it out now and do whatever you else want,
 with you design it is 'find out pages beside this page in write out
 a bunch of them' - much more complicated.  I'd like it abstracted out.

I dont see your point here. What I'm missing?

  The idea is to work with delayed allocated pages, too. A filesystem which
  has this feature can, at its "cluster" operation, allocate delayed pages
  contiguously on disk, and then return to the VM code which now can
  potentially write a bunch of dirty pages in a few big IO operations.
 
 That does also work nicely together with -writepage level IO clustering.
 
  I'm sure that a bit of tuning to know the optimal cluster size will be
  needed. Also some fs locking problems will appear.
 
 Sure, but again that's an issue for every kind of IO clustering...

 
 No my proposal.  I prefer doing it in writepage, as stated above.
 Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and
 behind the initial page, it first uses test wether the page should be
 clustered (a callback from vm, highly 'balanceable'...), then does
 a bmap/get_block to check wether it is contingous.

 Finally the IO is submitted using a submit_bh loop, or when using a
 kiobuf-based IO path all clustered pages are passed down to ll_rw_kio
 in one piece.
 As you see the easy integration with the new bulk-IO mechanisms is also
 an advantage of this proposal, without the need a new multi-page a_op.

IMHO replicating the code is the worst thing. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Christoph Hellwig

On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote:
 In case the metadata was not already cached before -cluster() (in this
 case there is no disk IO at all), -cluster() will cache it avoiding
 further disk accesses by writepage (or writepages()).

True.  But you have to go through ext2_get_branch (under the big kernel
lock) - if we can do only one logical-physical block translations,
why doing it multiple times?

  Another thing I dislike is that the flushing gets more complicated with
  yout VM-level clustering.  Now (and with my appropeach I'll describe
  below) flushing is write it out now and do whatever you else want,
  with you design it is 'find out pages beside this page in write out
  a bunch of them' - much more complicated.  I'd like it abstracted out.
 
 I dont see your point here. What I'm missing?

It's just a matter of taste.
(I thought it was clear enough that there is no technical advantage...)

 [...] 

 IMHO replicating the code is the worst thing. 

This does not replicated the code.  The 'normal' filesystems share the
code, and the special filesystems want to their own clustering anyway.
(See the discussion on xfs-devel yesterday).

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Marcelo Tosatti



On Sat, 20 Jan 2001, Christoph Hellwig wrote:

 On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote:
  In case the metadata was not already cached before -cluster() (in this
  case there is no disk IO at all), -cluster() will cache it avoiding
  further disk accesses by writepage (or writepages()).
 
 True.  But you have to go through ext2_get_branch (under the big kernel
 lock) - if we can do only one logical-physical block translations,
 why doing it multiple times?

You dont. If the metadata is cached and uptodate there is no need to call
get_block().


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Christoph Hellwig

On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote:
  True.  But you have to go through ext2_get_branch (under the big kernel
  lock) - if we can do only one logical-physical block translations,
  why doing it multiple times?
 
 You dont. If the metadata is cached and uptodate there is no need to call
 get_block().

Ups.  You are right for the stock tree - I was only looking at my kio tree,
where it can't be cached due to the lack of buffer-cache usage...

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-20 Thread Marcelo Tosatti


On Sat, 20 Jan 2001, Christoph Hellwig wrote:

 On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote:
   True.  But you have to go through ext2_get_branch (under the big kernel
   lock) - if we can do only one logical-physical block translations,
   why doing it multiple times?
  
  You dont. If the metadata is cached and uptodate there is no need to call
  get_block().
 
 Ups.  You are right for the stock tree - I was only looking at my kio tree,
 where it can't be cached due to the lack of buffer-cache usage...

Must be fixed.  

We need a higher level abstraction which can hold this (and other)
information.

Take a look at SGI's pagebuf page_buf_t. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-19 Thread Marcelo Tosatti


On Sat, 20 Jan 2001, Rik van Riel wrote:

> Is there ever a reason NOT to do the best possible IO
> clustering at write time ?
> 
> Remember that disk writes do not cost memory and have
> no influence on the resident set ... completely unlike
> read clustering, which does need to be limited.

You dont want to have too many ongoing writes at the same time to avoid
complete starvation of the system. We already do this, and have to, in a
quite a few places.

> >   - By doing the write clustering at a higher level, we avoid a ton of
> > filesystems duplicating the code.
> >
> > So what I suggest is to add a "cluster" operation to struct address_space
> > which can be used by the VM code to know the optimal IO transfer unit in
> > the storage device. Something like this (maybe we need an async flag but
> > thats a minor detail now):
> >
> > int (*cluster)(struct page *, unsigned long *boffset,
> > unsigned long *poffset);
> 
> Makes sense, except that I don't see how (or why) the _VM_
> should "know the optimal IO transfer unit". This sounds more
> like a job for the IO subsystem and/or the filesystem, IMHO.

The a_ops->cluster() operation will make the VM aware of the contiguous
pages which can be clustered.

The VM does not know about _any_ fs lowlevel details (which are hidden
behind ->cluster()), including buffer_head's.

> 
> > "page" is from where the filesystem code should start its search
> > for contiguous pages. boffset and poffset are passed by the VM
> > code to know the logical "backwards offset" (number of
> > contiguous pages going backwards from "page") and "forward
> > offset" (cont pages going forward from "page") in the inode.
> 
> Yes, this makes a LOT of sense. I really like a pagecache
> helper function so the filesystems can build their writeout
> clusters easier.

The address space owners (filesystems _and_ swap for this case) do not
need to implement the writeout clustering at all because we're doing it at
the VM _without_ having to know about low-level details.

Take a look at this somewhat pseudo-code:

int cluster_write (struct page *page)
{
struct address_space *mapping = page->mapping;
unsigned long boffset, poffset;
int nr_pages;

...
/* How much pages can we write for free? */
nr_pages = mapping->a_ops->cluster(page, , );
... 

page_cluster_flush(page, csize); 
}

/*
 * @page: dirty page from where to start the search
 * @csize: maximum size of the cluster
 */
int page_cluster_flush(struct page *page, int csize)
{
struct *cpages[csize];
struct address_space *mapping = page->mapping;
struct inode *inode = mapping->host;
unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
unsigned long index = page->index;
unsigned long curr_index = page->index;

cpages[csize] = page;
count = 1;

/* Search for clusterable dirty pages behind */

/* Search for clusterable dirty pages ahead */
...
/* Write all of them */
for(i=0; ihttp://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-19 Thread Rik van Riel

On Fri, 19 Jan 2001, Marcelo Tosatti wrote:

> The write clustering issue has already been discussed (mainly at Miami)
> and the agreement, AFAIK, was to implement the write clustering at the
> per-address-space writepage() operation.
>
> IMO there are some problems if we implement the write clustering in this
> level:
>
>   - The filesystem does not have information (and should not have) about
> limiting cluster size depending on memory shortage.

Is there ever a reason NOT to do the best possible IO
clustering at write time ?

Remember that disk writes do not cost memory and have
no influence on the resident set ... completely unlike
read clustering, which does need to be limited.

>   - By doing the write clustering at a higher level, we avoid a ton of
> filesystems duplicating the code.
>
> So what I suggest is to add a "cluster" operation to struct address_space
> which can be used by the VM code to know the optimal IO transfer unit in
> the storage device. Something like this (maybe we need an async flag but
> thats a minor detail now):
>
> int (*cluster)(struct page *, unsigned long *boffset,
>   unsigned long *poffset);

Makes sense, except that I don't see how (or why) the _VM_
should "know the optimal IO transfer unit". This sounds more
like a job for the IO subsystem and/or the filesystem, IMHO.

> "page" is from where the filesystem code should start its search
> for contiguous pages. boffset and poffset are passed by the VM
> code to know the logical "backwards offset" (number of
> contiguous pages going backwards from "page") and "forward
> offset" (cont pages going forward from "page") in the inode.

Yes, this makes a LOT of sense. I really like a pagecache
helper function so the filesystems can build their writeout
clusters easier.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[RFC] generic IO write clustering

2001-01-19 Thread Marcelo Tosatti


Hi, 

I'm starting to implement a generic write clustering scheme and I would
like to receive comments and suggestions.

The write clustering issue has already been discussed (mainly at Miami)
and the agreement, AFAIK, was to implement the write clustering at the
per-address-space writepage() operation.

IMO there are some problems if we implement the write clustering in this
level:

  - The filesystem does not have information (and should not have) about
limiting cluster size depending on memory shortage.
  - By doing the write clustering at a higher level, we avoid a ton of
filesystems duplicating the code.

So what I suggest is to add a "cluster" operation to struct address_space
which can be used by the VM code to know the optimal IO transfer unit in
the storage device. Something like this (maybe we need an async flag but
thats a minor detail now):

int (*cluster)(struct page *, unsigned long *boffset, 
unsigned long *poffset);

"page" is from where the filesystem code should start its search for
contiguous pages. boffset and poffset are passed by the VM code to know
the logical "backwards offset" (number of contiguous pages going backwards
from "page") and "forward offset" (cont pages going forward from
"page") in the inode.

The idea is to work with delayed allocated pages, too. A filesystem which
has this feature can, at its "cluster" operation, allocate delayed pages
contiguously on disk, and then return to the VM code which now can
potentially write a bunch of dirty pages in a few big IO operations.

I'm sure that a bit of tuning to know the optimal cluster size will be
needed. Also some fs locking problems will appear.

But it seems worth to me since we're avoiding a lot of code replication in
the future and also the performance gain will be _nice_.

Comments, thoughts?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[RFC] generic IO write clustering

2001-01-19 Thread Marcelo Tosatti


Hi, 

I'm starting to implement a generic write clustering scheme and I would
like to receive comments and suggestions.

The write clustering issue has already been discussed (mainly at Miami)
and the agreement, AFAIK, was to implement the write clustering at the
per-address-space writepage() operation.

IMO there are some problems if we implement the write clustering in this
level:

  - The filesystem does not have information (and should not have) about
limiting cluster size depending on memory shortage.
  - By doing the write clustering at a higher level, we avoid a ton of
filesystems duplicating the code.

So what I suggest is to add a "cluster" operation to struct address_space
which can be used by the VM code to know the optimal IO transfer unit in
the storage device. Something like this (maybe we need an async flag but
thats a minor detail now):

int (*cluster)(struct page *, unsigned long *boffset, 
unsigned long *poffset);

"page" is from where the filesystem code should start its search for
contiguous pages. boffset and poffset are passed by the VM code to know
the logical "backwards offset" (number of contiguous pages going backwards
from "page") and "forward offset" (cont pages going forward from
"page") in the inode.

The idea is to work with delayed allocated pages, too. A filesystem which
has this feature can, at its "cluster" operation, allocate delayed pages
contiguously on disk, and then return to the VM code which now can
potentially write a bunch of dirty pages in a few big IO operations.

I'm sure that a bit of tuning to know the optimal cluster size will be
needed. Also some fs locking problems will appear.

But it seems worth to me since we're avoiding a lot of code replication in
the future and also the performance gain will be _nice_.

Comments, thoughts?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-19 Thread Rik van Riel

On Fri, 19 Jan 2001, Marcelo Tosatti wrote:

 The write clustering issue has already been discussed (mainly at Miami)
 and the agreement, AFAIK, was to implement the write clustering at the
 per-address-space writepage() operation.

 IMO there are some problems if we implement the write clustering in this
 level:

   - The filesystem does not have information (and should not have) about
 limiting cluster size depending on memory shortage.

Is there ever a reason NOT to do the best possible IO
clustering at write time ?

Remember that disk writes do not cost memory and have
no influence on the resident set ... completely unlike
read clustering, which does need to be limited.

   - By doing the write clustering at a higher level, we avoid a ton of
 filesystems duplicating the code.

 So what I suggest is to add a "cluster" operation to struct address_space
 which can be used by the VM code to know the optimal IO transfer unit in
 the storage device. Something like this (maybe we need an async flag but
 thats a minor detail now):

 int (*cluster)(struct page *, unsigned long *boffset,
   unsigned long *poffset);

Makes sense, except that I don't see how (or why) the _VM_
should "know the optimal IO transfer unit". This sounds more
like a job for the IO subsystem and/or the filesystem, IMHO.

 "page" is from where the filesystem code should start its search
 for contiguous pages. boffset and poffset are passed by the VM
 code to know the logical "backwards offset" (number of
 contiguous pages going backwards from "page") and "forward
 offset" (cont pages going forward from "page") in the inode.

Yes, this makes a LOT of sense. I really like a pagecache
helper function so the filesystems can build their writeout
clusters easier.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] generic IO write clustering

2001-01-19 Thread Marcelo Tosatti


On Sat, 20 Jan 2001, Rik van Riel wrote:

 Is there ever a reason NOT to do the best possible IO
 clustering at write time ?
 
 Remember that disk writes do not cost memory and have
 no influence on the resident set ... completely unlike
 read clustering, which does need to be limited.

You dont want to have too many ongoing writes at the same time to avoid
complete starvation of the system. We already do this, and have to, in a
quite a few places.

- By doing the write clustering at a higher level, we avoid a ton of
  filesystems duplicating the code.
 
  So what I suggest is to add a "cluster" operation to struct address_space
  which can be used by the VM code to know the optimal IO transfer unit in
  the storage device. Something like this (maybe we need an async flag but
  thats a minor detail now):
 
  int (*cluster)(struct page *, unsigned long *boffset,
  unsigned long *poffset);
 
 Makes sense, except that I don't see how (or why) the _VM_
 should "know the optimal IO transfer unit". This sounds more
 like a job for the IO subsystem and/or the filesystem, IMHO.

The a_ops-cluster() operation will make the VM aware of the contiguous
pages which can be clustered.

The VM does not know about _any_ fs lowlevel details (which are hidden
behind -cluster()), including buffer_head's.

 
  "page" is from where the filesystem code should start its search
  for contiguous pages. boffset and poffset are passed by the VM
  code to know the logical "backwards offset" (number of
  contiguous pages going backwards from "page") and "forward
  offset" (cont pages going forward from "page") in the inode.
 
 Yes, this makes a LOT of sense. I really like a pagecache
 helper function so the filesystems can build their writeout
 clusters easier.

The address space owners (filesystems _and_ swap for this case) do not
need to implement the writeout clustering at all because we're doing it at
the VM _without_ having to know about low-level details.

Take a look at this somewhat pseudo-code:

int cluster_write (struct page *page)
{
struct address_space *mapping = page-mapping;
unsigned long boffset, poffset;
int nr_pages;

...
/* How much pages can we write for free? */
nr_pages = mapping-a_ops-cluster(page, boffset, poffset);
... 

page_cluster_flush(page, csize); 
}

/*
 * @page: dirty page from where to start the search
 * @csize: maximum size of the cluster
 */
int page_cluster_flush(struct page *page, int csize)
{
struct *cpages[csize];
struct address_space *mapping = page-mapping;
struct inode *inode = mapping-host;
unsigned long end_index = inode-i_size  PAGE_CACHE_SHIFT;
unsigned long index = page-index;
unsigned long curr_index = page-index;

cpages[csize] = page;
count = 1;

/* Search for clusterable dirty pages behind */

/* Search for clusterable dirty pages ahead */
...
/* Write all of them */
for(i=0; icount; i++) {
ClearPageDirty(cpages[i]);
writepage(cpages[i]);  
...
}

This way we have _one_ clean implementation of write clustering without
any lowlevel crap involved. Try to imagine the amount of code people will
manage to write in their own fs's to implement write clustering.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/