Re: [RFC] generic IO write clustering
On Sat, 20 Jan 2001, Christoph Hellwig wrote: > On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote: > > > True. But you have to go through ext2_get_branch (under the big kernel > > > lock) - if we can do only one logical->physical block translations, > > > why doing it multiple times? > > > > You dont. If the metadata is cached and uptodate there is no need to call > > get_block(). > > Ups. You are right for the stock tree - I was only looking at my kio tree, > where it can't be cached due to the lack of buffer-cache usage... Must be fixed. We need a higher level abstraction which can hold this (and other) information. Take a look at SGI's pagebuf page_buf_t. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote: > > True. But you have to go through ext2_get_branch (under the big kernel > > lock) - if we can do only one logical->physical block translations, > > why doing it multiple times? > > You dont. If the metadata is cached and uptodate there is no need to call > get_block(). Ups. You are right for the stock tree - I was only looking at my kio tree, where it can't be cached due to the lack of buffer-cache usage... Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, 20 Jan 2001, Christoph Hellwig wrote: > On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote: > > In case the metadata was not already cached before ->cluster() (in this > > case there is no disk IO at all), ->cluster() will cache it avoiding > > further disk accesses by writepage (or writepages()). > > True. But you have to go through ext2_get_branch (under the big kernel > lock) - if we can do only one logical->physical block translations, > why doing it multiple times? You dont. If the metadata is cached and uptodate there is no need to call get_block(). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote: > In case the metadata was not already cached before ->cluster() (in this > case there is no disk IO at all), ->cluster() will cache it avoiding > further disk accesses by writepage (or writepages()). True. But you have to go through ext2_get_branch (under the big kernel lock) - if we can do only one logical->physical block translations, why doing it multiple times? > > Another thing I dislike is that the flushing gets more complicated with > > yout VM-level clustering. Now (and with my appropeach I'll describe > > below) flushing is write it out now and do whatever you else want, > > with you design it is 'find out pages beside this page in write out > > a bunch of them' - much more complicated. I'd like it abstracted out. > > I dont see your point here. What I'm missing? It's just a matter of taste. (I thought it was clear enough that there is no technical advantage...) > [...] > > IMHO replicating the code is the worst thing. This does not replicated the code. The 'normal' filesystems share the code, and the special filesystems want to their own clustering anyway. (See the discussion on xfs-devel yesterday). Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, 20 Jan 2001, Christoph Hellwig wrote: > I think there is a big disadvantage of this appropeach: > To find out which pages are clusterable, we need do do bmap/get_block, > that means we have to go through the block-allocation functions, which > is rather expensive, and then we have to do it again in writepage, for > the pages that are actually clustered bt the VM. In case the metadata was not already cached before ->cluster() (in this case there is no disk IO at all), ->cluster() will cache it avoiding further disk accesses by writepage (or writepages()). > Another thing I dislike is that the flushing gets more complicated with > yout VM-level clustering. Now (and with my appropeach I'll describe > below) flushing is write it out now and do whatever you else want, > with you design it is 'find out pages beside this page in write out > a bunch of them' - much more complicated. I'd like it abstracted out. I dont see your point here. What I'm missing? > > The idea is to work with delayed allocated pages, too. A filesystem which > > has this feature can, at its "cluster" operation, allocate delayed pages > > contiguously on disk, and then return to the VM code which now can > > potentially write a bunch of dirty pages in a few big IO operations. > > That does also work nicely together with ->writepage level IO clustering. > > > I'm sure that a bit of tuning to know the optimal cluster size will be > > needed. Also some fs locking problems will appear. > > Sure, but again that's an issue for every kind of IO clustering... > > > No my proposal. I prefer doing it in writepage, as stated above. > Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and > behind the initial page, it first uses test wether the page should be > clustered (a callback from vm, highly 'balanceable'...), then does > a bmap/get_block to check wether it is contingous. > > Finally the IO is submitted using a submit_bh loop, or when using a > kiobuf-based IO path all clustered pages are passed down to ll_rw_kio > in one piece. > As you see the easy integration with the new bulk-IO mechanisms is also > an advantage of this proposal, without the need a new multi-page a_op. IMHO replicating the code is the worst thing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
In article <[EMAIL PROTECTED]> you wrote: > The write clustering issue has already been discussed (mainly at Miami) > and the agreement, AFAIK, was to implement the write clustering at the > per-address-space writepage() operation. > IMO there are some problems if we implement the write clustering in this > level: > - The filesystem does not have information (and should not have) about > limiting cluster size depending on memory shortage. Agreed. > - By doing the write clustering at a higher level, we avoid a ton of > filesystems duplicating the code. Most filesystems share their writepage-implementation, and most others have special requirements on write clustering anyway. For example extent-based filesystems (xfs, jfs) usually want to write out more pages even if the VM doesn't see a need, just for effiecy reasons. Network-based filesystems also need special care vs. writeclustering, because the network behaves different from a typical disk... > So what I suggest is to add a "cluster" operation to struct address_space > which can be used by the VM code to know the optimal IO transfer unit in > the storage device. Something like this (maybe we need an async flag but > thats a minor detail now): > int (*cluster)(struct page *, unsigned long *boffset, > unsigned long *poffset); > "page" is from where the filesystem code should start its search for > contiguous pages. boffset and poffset are passed by the VM code to know > the logical "backwards offset" (number of contiguous pages going backwards > from "page") and "forward offset" (cont pages going forward from > "page") in the inode. I think there is a big disadvantage of this appropeach: To find out which pages are clusterable, we need do do bmap/get_block, that means we have to go through the block-allocation functions, which is rather expensive, and then we have to do it again in writepage, for the pages that are actually clustered bt the VM. Another thing I dislike is that the flushing gets more complicated with yout VM-level clustering. Now (and with my appropeach I'll describe below) flushing is write it out now and do whatever you else want, with you design it is 'find out pages beside this page in write out a bunch of them' - much more complicated. I'd like it abstracted out. > The idea is to work with delayed allocated pages, too. A filesystem which > has this feature can, at its "cluster" operation, allocate delayed pages > contiguously on disk, and then return to the VM code which now can > potentially write a bunch of dirty pages in a few big IO operations. That does also work nicely together with ->writepage level IO clustering. > I'm sure that a bit of tuning to know the optimal cluster size will be > needed. Also some fs locking problems will appear. Sure, but again that's an issue for every kind of IO clustering... No my proposal. I prefer doing it in writepage, as stated above. Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and behind the initial page, it first uses test wether the page should be clustered (a callback from vm, highly 'balanceable'...), then does a bmap/get_block to check wether it is contingous. Finally the IO is submitted using a submit_bh loop, or when using a kiobuf-based IO path all clustered pages are passed down to ll_rw_kio in one piece. As you see the easy integration with the new bulk-IO mechanisms is also an advantage of this proposal, without the need a new multi-page a_op. Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
In article [EMAIL PROTECTED] you wrote: The write clustering issue has already been discussed (mainly at Miami) and the agreement, AFAIK, was to implement the write clustering at the per-address-space writepage() operation. IMO there are some problems if we implement the write clustering in this level: - The filesystem does not have information (and should not have) about limiting cluster size depending on memory shortage. Agreed. - By doing the write clustering at a higher level, we avoid a ton of filesystems duplicating the code. Most filesystems share their writepage-implementation, and most others have special requirements on write clustering anyway. For example extent-based filesystems (xfs, jfs) usually want to write out more pages even if the VM doesn't see a need, just for effiecy reasons. Network-based filesystems also need special care vs. writeclustering, because the network behaves different from a typical disk... So what I suggest is to add a "cluster" operation to struct address_space which can be used by the VM code to know the optimal IO transfer unit in the storage device. Something like this (maybe we need an async flag but thats a minor detail now): int (*cluster)(struct page *, unsigned long *boffset, unsigned long *poffset); "page" is from where the filesystem code should start its search for contiguous pages. boffset and poffset are passed by the VM code to know the logical "backwards offset" (number of contiguous pages going backwards from "page") and "forward offset" (cont pages going forward from "page") in the inode. I think there is a big disadvantage of this appropeach: To find out which pages are clusterable, we need do do bmap/get_block, that means we have to go through the block-allocation functions, which is rather expensive, and then we have to do it again in writepage, for the pages that are actually clustered bt the VM. Another thing I dislike is that the flushing gets more complicated with yout VM-level clustering. Now (and with my appropeach I'll describe below) flushing is write it out now and do whatever you else want, with you design it is 'find out pages beside this page in write out a bunch of them' - much more complicated. I'd like it abstracted out. The idea is to work with delayed allocated pages, too. A filesystem which has this feature can, at its "cluster" operation, allocate delayed pages contiguously on disk, and then return to the VM code which now can potentially write a bunch of dirty pages in a few big IO operations. That does also work nicely together with -writepage level IO clustering. I'm sure that a bit of tuning to know the optimal cluster size will be needed. Also some fs locking problems will appear. Sure, but again that's an issue for every kind of IO clustering... No my proposal. I prefer doing it in writepage, as stated above. Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and behind the initial page, it first uses test wether the page should be clustered (a callback from vm, highly 'balanceable'...), then does a bmap/get_block to check wether it is contingous. Finally the IO is submitted using a submit_bh loop, or when using a kiobuf-based IO path all clustered pages are passed down to ll_rw_kio in one piece. As you see the easy integration with the new bulk-IO mechanisms is also an advantage of this proposal, without the need a new multi-page a_op. Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, 20 Jan 2001, Christoph Hellwig wrote: snip I think there is a big disadvantage of this appropeach: To find out which pages are clusterable, we need do do bmap/get_block, that means we have to go through the block-allocation functions, which is rather expensive, and then we have to do it again in writepage, for the pages that are actually clustered bt the VM. In case the metadata was not already cached before -cluster() (in this case there is no disk IO at all), -cluster() will cache it avoiding further disk accesses by writepage (or writepages()). Another thing I dislike is that the flushing gets more complicated with yout VM-level clustering. Now (and with my appropeach I'll describe below) flushing is write it out now and do whatever you else want, with you design it is 'find out pages beside this page in write out a bunch of them' - much more complicated. I'd like it abstracted out. I dont see your point here. What I'm missing? The idea is to work with delayed allocated pages, too. A filesystem which has this feature can, at its "cluster" operation, allocate delayed pages contiguously on disk, and then return to the VM code which now can potentially write a bunch of dirty pages in a few big IO operations. That does also work nicely together with -writepage level IO clustering. I'm sure that a bit of tuning to know the optimal cluster size will be needed. Also some fs locking problems will appear. Sure, but again that's an issue for every kind of IO clustering... No my proposal. I prefer doing it in writepage, as stated above. Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and behind the initial page, it first uses test wether the page should be clustered (a callback from vm, highly 'balanceable'...), then does a bmap/get_block to check wether it is contingous. Finally the IO is submitted using a submit_bh loop, or when using a kiobuf-based IO path all clustered pages are passed down to ll_rw_kio in one piece. As you see the easy integration with the new bulk-IO mechanisms is also an advantage of this proposal, without the need a new multi-page a_op. IMHO replicating the code is the worst thing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote: In case the metadata was not already cached before -cluster() (in this case there is no disk IO at all), -cluster() will cache it avoiding further disk accesses by writepage (or writepages()). True. But you have to go through ext2_get_branch (under the big kernel lock) - if we can do only one logical-physical block translations, why doing it multiple times? Another thing I dislike is that the flushing gets more complicated with yout VM-level clustering. Now (and with my appropeach I'll describe below) flushing is write it out now and do whatever you else want, with you design it is 'find out pages beside this page in write out a bunch of them' - much more complicated. I'd like it abstracted out. I dont see your point here. What I'm missing? It's just a matter of taste. (I thought it was clear enough that there is no technical advantage...) [...] IMHO replicating the code is the worst thing. This does not replicated the code. The 'normal' filesystems share the code, and the special filesystems want to their own clustering anyway. (See the discussion on xfs-devel yesterday). Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, 20 Jan 2001, Christoph Hellwig wrote: On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote: In case the metadata was not already cached before -cluster() (in this case there is no disk IO at all), -cluster() will cache it avoiding further disk accesses by writepage (or writepages()). True. But you have to go through ext2_get_branch (under the big kernel lock) - if we can do only one logical-physical block translations, why doing it multiple times? You dont. If the metadata is cached and uptodate there is no need to call get_block(). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote: True. But you have to go through ext2_get_branch (under the big kernel lock) - if we can do only one logical-physical block translations, why doing it multiple times? You dont. If the metadata is cached and uptodate there is no need to call get_block(). Ups. You are right for the stock tree - I was only looking at my kio tree, where it can't be cached due to the lack of buffer-cache usage... Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, 20 Jan 2001, Christoph Hellwig wrote: On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote: True. But you have to go through ext2_get_branch (under the big kernel lock) - if we can do only one logical-physical block translations, why doing it multiple times? You dont. If the metadata is cached and uptodate there is no need to call get_block(). Ups. You are right for the stock tree - I was only looking at my kio tree, where it can't be cached due to the lack of buffer-cache usage... Must be fixed. We need a higher level abstraction which can hold this (and other) information. Take a look at SGI's pagebuf page_buf_t. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, 20 Jan 2001, Rik van Riel wrote: > Is there ever a reason NOT to do the best possible IO > clustering at write time ? > > Remember that disk writes do not cost memory and have > no influence on the resident set ... completely unlike > read clustering, which does need to be limited. You dont want to have too many ongoing writes at the same time to avoid complete starvation of the system. We already do this, and have to, in a quite a few places. > > - By doing the write clustering at a higher level, we avoid a ton of > > filesystems duplicating the code. > > > > So what I suggest is to add a "cluster" operation to struct address_space > > which can be used by the VM code to know the optimal IO transfer unit in > > the storage device. Something like this (maybe we need an async flag but > > thats a minor detail now): > > > > int (*cluster)(struct page *, unsigned long *boffset, > > unsigned long *poffset); > > Makes sense, except that I don't see how (or why) the _VM_ > should "know the optimal IO transfer unit". This sounds more > like a job for the IO subsystem and/or the filesystem, IMHO. The a_ops->cluster() operation will make the VM aware of the contiguous pages which can be clustered. The VM does not know about _any_ fs lowlevel details (which are hidden behind ->cluster()), including buffer_head's. > > > "page" is from where the filesystem code should start its search > > for contiguous pages. boffset and poffset are passed by the VM > > code to know the logical "backwards offset" (number of > > contiguous pages going backwards from "page") and "forward > > offset" (cont pages going forward from "page") in the inode. > > Yes, this makes a LOT of sense. I really like a pagecache > helper function so the filesystems can build their writeout > clusters easier. The address space owners (filesystems _and_ swap for this case) do not need to implement the writeout clustering at all because we're doing it at the VM _without_ having to know about low-level details. Take a look at this somewhat pseudo-code: int cluster_write (struct page *page) { struct address_space *mapping = page->mapping; unsigned long boffset, poffset; int nr_pages; ... /* How much pages can we write for free? */ nr_pages = mapping->a_ops->cluster(page, , ); ... page_cluster_flush(page, csize); } /* * @page: dirty page from where to start the search * @csize: maximum size of the cluster */ int page_cluster_flush(struct page *page, int csize) { struct *cpages[csize]; struct address_space *mapping = page->mapping; struct inode *inode = mapping->host; unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT; unsigned long index = page->index; unsigned long curr_index = page->index; cpages[csize] = page; count = 1; /* Search for clusterable dirty pages behind */ /* Search for clusterable dirty pages ahead */ ... /* Write all of them */ for(i=0; ihttp://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Fri, 19 Jan 2001, Marcelo Tosatti wrote: > The write clustering issue has already been discussed (mainly at Miami) > and the agreement, AFAIK, was to implement the write clustering at the > per-address-space writepage() operation. > > IMO there are some problems if we implement the write clustering in this > level: > > - The filesystem does not have information (and should not have) about > limiting cluster size depending on memory shortage. Is there ever a reason NOT to do the best possible IO clustering at write time ? Remember that disk writes do not cost memory and have no influence on the resident set ... completely unlike read clustering, which does need to be limited. > - By doing the write clustering at a higher level, we avoid a ton of > filesystems duplicating the code. > > So what I suggest is to add a "cluster" operation to struct address_space > which can be used by the VM code to know the optimal IO transfer unit in > the storage device. Something like this (maybe we need an async flag but > thats a minor detail now): > > int (*cluster)(struct page *, unsigned long *boffset, > unsigned long *poffset); Makes sense, except that I don't see how (or why) the _VM_ should "know the optimal IO transfer unit". This sounds more like a job for the IO subsystem and/or the filesystem, IMHO. > "page" is from where the filesystem code should start its search > for contiguous pages. boffset and poffset are passed by the VM > code to know the logical "backwards offset" (number of > contiguous pages going backwards from "page") and "forward > offset" (cont pages going forward from "page") in the inode. Yes, this makes a LOT of sense. I really like a pagecache helper function so the filesystems can build their writeout clusters easier. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[RFC] generic IO write clustering
Hi, I'm starting to implement a generic write clustering scheme and I would like to receive comments and suggestions. The write clustering issue has already been discussed (mainly at Miami) and the agreement, AFAIK, was to implement the write clustering at the per-address-space writepage() operation. IMO there are some problems if we implement the write clustering in this level: - The filesystem does not have information (and should not have) about limiting cluster size depending on memory shortage. - By doing the write clustering at a higher level, we avoid a ton of filesystems duplicating the code. So what I suggest is to add a "cluster" operation to struct address_space which can be used by the VM code to know the optimal IO transfer unit in the storage device. Something like this (maybe we need an async flag but thats a minor detail now): int (*cluster)(struct page *, unsigned long *boffset, unsigned long *poffset); "page" is from where the filesystem code should start its search for contiguous pages. boffset and poffset are passed by the VM code to know the logical "backwards offset" (number of contiguous pages going backwards from "page") and "forward offset" (cont pages going forward from "page") in the inode. The idea is to work with delayed allocated pages, too. A filesystem which has this feature can, at its "cluster" operation, allocate delayed pages contiguously on disk, and then return to the VM code which now can potentially write a bunch of dirty pages in a few big IO operations. I'm sure that a bit of tuning to know the optimal cluster size will be needed. Also some fs locking problems will appear. But it seems worth to me since we're avoiding a lot of code replication in the future and also the performance gain will be _nice_. Comments, thoughts? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[RFC] generic IO write clustering
Hi, I'm starting to implement a generic write clustering scheme and I would like to receive comments and suggestions. The write clustering issue has already been discussed (mainly at Miami) and the agreement, AFAIK, was to implement the write clustering at the per-address-space writepage() operation. IMO there are some problems if we implement the write clustering in this level: - The filesystem does not have information (and should not have) about limiting cluster size depending on memory shortage. - By doing the write clustering at a higher level, we avoid a ton of filesystems duplicating the code. So what I suggest is to add a "cluster" operation to struct address_space which can be used by the VM code to know the optimal IO transfer unit in the storage device. Something like this (maybe we need an async flag but thats a minor detail now): int (*cluster)(struct page *, unsigned long *boffset, unsigned long *poffset); "page" is from where the filesystem code should start its search for contiguous pages. boffset and poffset are passed by the VM code to know the logical "backwards offset" (number of contiguous pages going backwards from "page") and "forward offset" (cont pages going forward from "page") in the inode. The idea is to work with delayed allocated pages, too. A filesystem which has this feature can, at its "cluster" operation, allocate delayed pages contiguously on disk, and then return to the VM code which now can potentially write a bunch of dirty pages in a few big IO operations. I'm sure that a bit of tuning to know the optimal cluster size will be needed. Also some fs locking problems will appear. But it seems worth to me since we're avoiding a lot of code replication in the future and also the performance gain will be _nice_. Comments, thoughts? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Fri, 19 Jan 2001, Marcelo Tosatti wrote: The write clustering issue has already been discussed (mainly at Miami) and the agreement, AFAIK, was to implement the write clustering at the per-address-space writepage() operation. IMO there are some problems if we implement the write clustering in this level: - The filesystem does not have information (and should not have) about limiting cluster size depending on memory shortage. Is there ever a reason NOT to do the best possible IO clustering at write time ? Remember that disk writes do not cost memory and have no influence on the resident set ... completely unlike read clustering, which does need to be limited. - By doing the write clustering at a higher level, we avoid a ton of filesystems duplicating the code. So what I suggest is to add a "cluster" operation to struct address_space which can be used by the VM code to know the optimal IO transfer unit in the storage device. Something like this (maybe we need an async flag but thats a minor detail now): int (*cluster)(struct page *, unsigned long *boffset, unsigned long *poffset); Makes sense, except that I don't see how (or why) the _VM_ should "know the optimal IO transfer unit". This sounds more like a job for the IO subsystem and/or the filesystem, IMHO. "page" is from where the filesystem code should start its search for contiguous pages. boffset and poffset are passed by the VM code to know the logical "backwards offset" (number of contiguous pages going backwards from "page") and "forward offset" (cont pages going forward from "page") in the inode. Yes, this makes a LOT of sense. I really like a pagecache helper function so the filesystems can build their writeout clusters easier. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] generic IO write clustering
On Sat, 20 Jan 2001, Rik van Riel wrote: Is there ever a reason NOT to do the best possible IO clustering at write time ? Remember that disk writes do not cost memory and have no influence on the resident set ... completely unlike read clustering, which does need to be limited. You dont want to have too many ongoing writes at the same time to avoid complete starvation of the system. We already do this, and have to, in a quite a few places. - By doing the write clustering at a higher level, we avoid a ton of filesystems duplicating the code. So what I suggest is to add a "cluster" operation to struct address_space which can be used by the VM code to know the optimal IO transfer unit in the storage device. Something like this (maybe we need an async flag but thats a minor detail now): int (*cluster)(struct page *, unsigned long *boffset, unsigned long *poffset); Makes sense, except that I don't see how (or why) the _VM_ should "know the optimal IO transfer unit". This sounds more like a job for the IO subsystem and/or the filesystem, IMHO. The a_ops-cluster() operation will make the VM aware of the contiguous pages which can be clustered. The VM does not know about _any_ fs lowlevel details (which are hidden behind -cluster()), including buffer_head's. "page" is from where the filesystem code should start its search for contiguous pages. boffset and poffset are passed by the VM code to know the logical "backwards offset" (number of contiguous pages going backwards from "page") and "forward offset" (cont pages going forward from "page") in the inode. Yes, this makes a LOT of sense. I really like a pagecache helper function so the filesystems can build their writeout clusters easier. The address space owners (filesystems _and_ swap for this case) do not need to implement the writeout clustering at all because we're doing it at the VM _without_ having to know about low-level details. Take a look at this somewhat pseudo-code: int cluster_write (struct page *page) { struct address_space *mapping = page-mapping; unsigned long boffset, poffset; int nr_pages; ... /* How much pages can we write for free? */ nr_pages = mapping-a_ops-cluster(page, boffset, poffset); ... page_cluster_flush(page, csize); } /* * @page: dirty page from where to start the search * @csize: maximum size of the cluster */ int page_cluster_flush(struct page *page, int csize) { struct *cpages[csize]; struct address_space *mapping = page-mapping; struct inode *inode = mapping-host; unsigned long end_index = inode-i_size PAGE_CACHE_SHIFT; unsigned long index = page-index; unsigned long curr_index = page-index; cpages[csize] = page; count = 1; /* Search for clusterable dirty pages behind */ /* Search for clusterable dirty pages ahead */ ... /* Write all of them */ for(i=0; icount; i++) { ClearPageDirty(cpages[i]); writepage(cpages[i]); ... } This way we have _one_ clean implementation of write clustering without any lowlevel crap involved. Try to imagine the amount of code people will manage to write in their own fs's to implement write clustering. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/