Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c
Christoph Hellwig wrote: On Tue, Aug 28, 2007 at 09:49:38PM +0200, J??rn Engel wrote: On Tue, 28 August 2007 12:05:58 -0700, [EMAIL PROTECTED] wrote: - index = *ppos >> PAGE_CACHE_SHIFT; - offset = *ppos & ~PAGE_CACHE_MASK; + index = page_cache_index(mapping, *ppos); + offset = page_cache_offset(mapping, *ppos); Part of me feels inclined to marge this patch now because it makes the code more readable, even if page_cache_index() is implemented as #define page_cache_index(mapping, pos) ((pos) >> PAGE_CACHE_SHIFT) I know there is little use in yet another global search'n'replace wankfest and Andrew might wash my mouth just for mentioning it. Still, hard to dislike this part of your patch. Yes, I I suggested that before. Andrew seems to somehow hate this patchset, but even if we don;'t get it in the lowercase macros are much much better then the current PAGE_CACHE_* confusion. I don't mind the change either. The open coded macros are very recognisable, but it isn't hard to have a typo and get one slightly wrong. If it goes upstream now it wouldn't have the mapping argument though, would it? Or the need to replace PAGE_CACHE_SIZE I guess. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: > On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) > wrote: > > > We do not care about one cpu being able to increase its counter > > > higher than the limit, such inaccuracy (maximum bios in flight > > > thus can be more than limit, difference is equal to the number of > > > CPUs - 1) is a price for removing atomic operation. I thought I > > > pointed it in the original description, but might forget, that if > > > it will be an issue, that atomic operations can be introduced > > > there. Any uber-precise measurements in the case when we are > > > close to the edge will not give us any benefit at all, since were > > > are already in the grey area. > > > > This is not just inaccurate, it is suicide. Keep leaking throttle > > counts and eventually all of them will be gone. No more IO > > on that block device! > > First, because number of increased and decreased operations are the > same, so it will dance around limit in both directions. No. Please go and read it the description of the race again. A count gets irretrievably lost because the write operation of the first decrement is overwritten by the second. Data gets lost. Atomic operations exist to prevent that sort of thing. You either need to use them or have a deep understanding of SMP read and write ordering in order to preserve data integrity by some equivalent algorithm. > Let's solve problems in order of their appearence. If bio structure > will be allowed to grow, then the whole patches can be done better. How about like the patch below. This throttles any block driver by implementing a throttle metric method so that each block driver can keep track of its own resource consumption in units of its choosing. As an (important) example, it implements a simple metric for device mapper devices. Other block devices will work as before, because they do not define any metric. Short, sweet and untested, which is why I have not posted it until now. This patch originally kept its accounting info in backing_dev_info, however that structure seems to be in some and it is just a part of struct queue anyway, so I lifted the throttle accounting up into struct queue. We should be able to report on the efficacy of this patch in terms of deadlock prevention pretty soon. --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + + if (q && q->metric) { + int need = bio->bi_reserved = q->metric(bio); + bio->queue = q; + wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need); + atomic_sub(&q->available, need); + } + if (current->bio_tail) { /* make_request is active */ *(current->bio_tail) = bio; --- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/drivers/md/dm.c 2007-08-24 12:14:23.0 -0700 @@ -880,6 +880,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio->bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md->queue->queuedata = md; + md->queue->metric = dm_metric; + atomic_set(&md->queue->available, md->queue->capacity = 1000); + init_waitqueue_head(&md->queue->throttle_wait); + md->queue->backing_dev_info.congested_fn = dm_any_congested; md->queue->backing_dev_info.congested_data = md; blk_queue_make_request(md->queue, dm_request); --- 2.6.22.clean/fs/bio.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700 @@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned bytes_done = bio->bi_size; } - bio->bi_size -= bytes_done; + if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) { + struct request_queue *q = bio->queue; + atomic_add(&q->available, bio->bi_reserved); + bio->bi_reserved = 0; /* just in case */ + wake_up(&q->throttle_wait); + } bio->bi_sector += (bytes_done >> 9); if (bio->bi_end_io) --- 2.6.22.clean/include/linux/bio.h2007-07-08 16:32:17.0 -0700 +++ 2.6.22/include/linux/bio.h 2007-08-24 11:53:51.0 -0700 @@ -109,6 +109,9 @@ struct bio { bio_end_io_t*bi_end_io; atomic_tbi_cnt; /* pin count */ + struct requ
Re: [NFS] [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits
On Tue, 28 Aug 2007 15:49:51 -0400 Trond Myklebust <[EMAIL PROTECTED]> wrote: > On Tue, 2007-08-28 at 20:11 +0100, Christoph Hellwig wrote: > > Sorry for not replying to the previsious revisions, but I've been out > > for on vacation. > > > > I can't say I like this version. Now we've got callouts at two rather close > > levels which is not very nice from the interface POV. > > Agreed. > > > Maybe preference is for the first scheme where we simply move interpreation > > of the ATTR_KILL_SUID/ATTR_KILL_SGID into the setattr routine and provide > > a nice helper for the normal filesystem to use. > > > > If people are really concerned about adding two lines of code to the > > handfull of setattr operation there's a variant of this scheme that can > > avoid it: > > > > - notify_change is modified to not clear the ATTR_KILL_SUID/ATTR_KILL_SGID > >but update ia_mode and the ia_valid flag to include ATTR_MODE. > > - disk filesystems stay unchanged and never look at > >ATTR_KILL_SUID/ATTR_KILL_SGID, but nfs can check for it and ignore > >the ATTR_MODE flags and ia_valid in this case and do the right thing > >on the server side. > > Hmm... There has to be an implicit promise here that nobody else will > ever try to set ATTR_KILL_SUID/ATTR_KILL_SGID and ATTR_MODE at the same > time. Currently, that assumption is not there: > That was my concern with this scheme as well... > > > if (ia_valid & ATTR_KILL_SGID) { > > attr->ia_valid &= ~ ATTR_KILL_SGID; > > if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) { > > if (!(ia_valid & ATTR_MODE)) { > > ia_valid = attr->ia_valid |= ATTR_MODE; > > attr->ia_mode = inode->i_mode; > > } > > attr->ia_mode &= ~S_ISGID; > > } > > } > > Should we perhaps just convert the above 'if (!(ia_valid & ATTR_MODE))' > into a 'BUG_ON(ia_valid & ATTR_MODE)'? > Sounds reasonable. I'll also throw in a comment that explains this reasoning... -- Jeff Layton <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c
On Tue, Aug 28, 2007 at 09:49:38PM +0200, J??rn Engel wrote: > On Tue, 28 August 2007 12:05:58 -0700, [EMAIL PROTECTED] wrote: > > > > - index = *ppos >> PAGE_CACHE_SHIFT; > > - offset = *ppos & ~PAGE_CACHE_MASK; > > + index = page_cache_index(mapping, *ppos); > > + offset = page_cache_offset(mapping, *ppos); > > Part of me feels inclined to marge this patch now because it makes the > code more readable, even if page_cache_index() is implemented as > #define page_cache_index(mapping, pos) ((pos) >> PAGE_CACHE_SHIFT) > > I know there is little use in yet another global search'n'replace > wankfest and Andrew might wash my mouth just for mentioning it. Still, > hard to dislike this part of your patch. Yes, I I suggested that before. Andrew seems to somehow hate this patchset, but even if we don;'t get it in the lowercase macros are much much better then the current PAGE_CACHE_* confusion. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [35/36] Large blocksize support for ext2
On Tue, 28 Aug 2007, Christoph Hellwig wrote: > On Tue, Aug 28, 2007 at 12:06:26PM -0700, [EMAIL PROTECTED] wrote: > > Hmmm... Actually there is nothing additional to be done after the earlier > > cleanup of the macros. So just modify copyright. > > So you get a copyright line for some trivial macro cleanups? Please > drop this patch and rather put your copyright into places where you > actually did major work.. Ok. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/36] Large Blocksize Support V6
On Tue, 28 Aug 2007, Christoph Hellwig wrote: > one patch per file is the most braindead and most unacceptable way > to split a series. Please stop whatever you're doing right now and > correct it and send out a patch that has one patch per logical change > for the whole tree. This means people can actually read the patch, > and it's bisectable. The patches are per logical change aside from the first patches that introduce the page cache functions all over the kernel. It would be unacceptably big and difficult to merge if I put them all together. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c
On Tue, 28 August 2007 12:05:58 -0700, [EMAIL PROTECTED] wrote: > > - index = *ppos >> PAGE_CACHE_SHIFT; > - offset = *ppos & ~PAGE_CACHE_MASK; > + index = page_cache_index(mapping, *ppos); > + offset = page_cache_offset(mapping, *ppos); Part of me feels inclined to marge this patch now because it makes the code more readable, even if page_cache_index() is implemented as #define page_cache_index(mapping, pos) ((pos) >> PAGE_CACHE_SHIFT) I know there is little use in yet another global search'n'replace wankfest and Andrew might wash my mouth just for mentioning it. Still, hard to dislike this part of your patch. Jörn -- He who knows others is wise. He who knows himself is enlightened. -- Lao Tsu - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits
On Tue, Aug 28, 2007 at 03:49:51PM -0400, Trond Myklebust wrote: > Hmm... There has to be an implicit promise here that nobody else will > ever try to set ATTR_KILL_SUID/ATTR_KILL_SGID and ATTR_MODE at the same > time. Currently, that assumption is not there: > > > > if (ia_valid & ATTR_KILL_SGID) { > > attr->ia_valid &= ~ ATTR_KILL_SGID; > > if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) { > > if (!(ia_valid & ATTR_MODE)) { > > ia_valid = attr->ia_valid |= ATTR_MODE; > > attr->ia_mode = inode->i_mode; > > } > > attr->ia_mode &= ~S_ISGID; > > } > > } > > Should we perhaps just convert the above 'if (!(ia_valid & ATTR_MODE))' > into a 'BUG_ON(ia_valid & ATTR_MODE)'? Yes, sounds fine to me. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits
On Tue, 2007-08-28 at 20:11 +0100, Christoph Hellwig wrote: > Sorry for not replying to the previsious revisions, but I've been out > for on vacation. > > I can't say I like this version. Now we've got callouts at two rather close > levels which is not very nice from the interface POV. Agreed. > Maybe preference is for the first scheme where we simply move interpreation > of the ATTR_KILL_SUID/ATTR_KILL_SGID into the setattr routine and provide > a nice helper for the normal filesystem to use. > > If people are really concerned about adding two lines of code to the > handfull of setattr operation there's a variant of this scheme that can > avoid it: > > - notify_change is modified to not clear the ATTR_KILL_SUID/ATTR_KILL_SGID >but update ia_mode and the ia_valid flag to include ATTR_MODE. > - disk filesystems stay unchanged and never look at >ATTR_KILL_SUID/ATTR_KILL_SGID, but nfs can check for it and ignore >the ATTR_MODE flags and ia_valid in this case and do the right thing >on the server side. Hmm... There has to be an implicit promise here that nobody else will ever try to set ATTR_KILL_SUID/ATTR_KILL_SGID and ATTR_MODE at the same time. Currently, that assumption is not there: > if (ia_valid & ATTR_KILL_SGID) { > attr->ia_valid &= ~ ATTR_KILL_SGID; > if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) { > if (!(ia_valid & ATTR_MODE)) { > ia_valid = attr->ia_valid |= ATTR_MODE; > attr->ia_mode = inode->i_mode; > } > attr->ia_mode &= ~S_ISGID; > } > } Should we perhaps just convert the above 'if (!(ia_valid & ATTR_MODE))' into a 'BUG_ON(ia_valid & ATTR_MODE)'? Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits
On Tue, Aug 28, 2007 at 08:11:14PM +0100, Christoph Hellwig wrote: > > Sorry for not replying to the previsious revisions, but I've been out > for on vacation. > > I can't say I like this version. Now we've got callouts at two rather close > levels which is not very nice from the interface POV. > > Maybe preference is for the first scheme where we simply move interpreation > of the ATTR_KILL_SUID/ATTR_KILL_SGID into the setattr routine and provide > a nice helper for the normal filesystem to use. > > If people are really concerned about adding two lines of code to the > handfull of setattr operation there's a variant of this scheme that can > avoid it: It's not about adding 2 lines of code - it's about adding the requirement for the fs to call a function. > - notify_change is modified to not clear the ATTR_KILL_SUID/ATTR_KILL_SGID >but update ia_mode and the ia_valid flag to include ATTR_MODE. > - disk filesystems stay unchanged and never look at >ATTR_KILL_SUID/ATTR_KILL_SGID, but nfs can check for it and ignore >the ATTR_MODE flags and ia_valid in this case and do the right thing >on the server side. Sounds reasonable. Josef 'Jeff' Sipek. -- I abhor a system designed for the "user", if that word is a coded pejorative meaning "stupid and unsophisticated." - Ken Thompson - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [35/36] Large blocksize support for ext2
On Tue, Aug 28, 2007 at 12:06:26PM -0700, [EMAIL PROTECTED] wrote: > Hmmm... Actually there is nothing additional to be done after the earlier > cleanup of the macros. So just modify copyright. So you get a copyright line for some trivial macro cleanups? Please drop this patch and rather put your copyright into places where you actually did major work.. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/36] Large Blocksize Support V6
Stoopp! This patchseries is entirely unacceptable! one patch per file is the most braindead and most unacceptable way to split a series. Please stop whatever you're doing right now and correct it and send out a patch that has one patch per logical change for the whole tree. This means people can actually read the patch, and it's bisectable. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits
Sorry for not replying to the previsious revisions, but I've been out for on vacation. I can't say I like this version. Now we've got callouts at two rather close levels which is not very nice from the interface POV. Maybe preference is for the first scheme where we simply move interpreation of the ATTR_KILL_SUID/ATTR_KILL_SGID into the setattr routine and provide a nice helper for the normal filesystem to use. If people are really concerned about adding two lines of code to the handfull of setattr operation there's a variant of this scheme that can avoid it: - notify_change is modified to not clear the ATTR_KILL_SUID/ATTR_KILL_SGID but update ia_mode and the ia_valid flag to include ATTR_MODE. - disk filesystems stay unchanged and never look at ATTR_KILL_SUID/ATTR_KILL_SGID, but nfs can check for it and ignore the ATTR_MODE flags and ia_valid in this case and do the right thing on the server side. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[01/36] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user
Simplify page cache zeroing of segments of pages through 3 functions zero_user_segments(page, start1, end1, start2, end2) Zeros two segments of the page. It takes the position where to start and end the zeroing which avoids length calculations. zero_user_segment(page, start, end) Same for a single segment. zero_user(page, start, length) Length variant for the case where we know the length. We remove the zero_user_page macro. Issues: 1. Its a macro. Inline functions are preferable. 2. The KM_USER0 macro is only defined for HIGHMEM. Having to treat this special case everywhere makes the code needlessly complex. The parameter for zeroing is always KM_USER0 except in one single case that we open code. Avoiding KM_USER0 makes a lot of code not having to be dealing with the special casing for HIGHMEM anymore. Dealing with kmap is only necessary for HIGHMEM configurations. In those configurations we use KM_USER0 like we do for a series of other functions defined in highmem.h. Since KM_USER0 is depends on HIGHMEM the existing zero_user_page function could not be a macro. zero_user_* functions introduced here can be be inline because that constant is not used when these functions are called. Also extract the flushing of the caches to be outside of the kmap. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- drivers/block/loop.c |2 +- fs/buffer.c| 48 +- fs/cifs/inode.c|2 +- fs/direct-io.c |4 +- fs/ecryptfs/mmap.c |7 ++--- fs/ext3/inode.c|4 +- fs/ext4/inode.c|4 +- fs/gfs2/bmap.c |2 +- fs/gfs2/ops_address.c |2 +- fs/libfs.c | 11 +++-- fs/mpage.c |7 + fs/nfs/read.c | 10 fs/nfs/write.c |2 +- fs/ntfs/aops.c | 18 +--- fs/ntfs/file.c | 32 +--- fs/ocfs2/aops.c|6 ++-- fs/reiserfs/inode.c|4 +- fs/xfs/linux-2.6/xfs_lrw.c |2 +- include/linux/highmem.h| 49 +++ mm/filemap_xip.c |2 +- mm/truncate.c |2 +- 21 files changed, 104 insertions(+), 116 deletions(-) Index: linux-2.6/drivers/block/loop.c === --- linux-2.6.orig/drivers/block/loop.c 2007-08-27 19:22:13.0 -0700 +++ linux-2.6/drivers/block/loop.c 2007-08-27 19:22:17.0 -0700 @@ -251,7 +251,7 @@ static int do_lo_send_aops(struct loop_d */ printk(KERN_ERR "loop: transfer error block %llu\n", (unsigned long long)index); - zero_user_page(page, offset, size, KM_USER0); + zero_user(page, offset, size); } flush_dcache_page(page); ret = aops->commit_write(file, page, offset, Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-08-27 19:22:13.0 -0700 +++ linux-2.6/fs/buffer.c 2007-08-27 19:22:17.0 -0700 @@ -1803,19 +1803,10 @@ static int __block_prepare_write(struct set_buffer_uptodate(bh); continue; } - if (block_end > to || block_start < from) { - void *kaddr; - - kaddr = kmap_atomic(page, KM_USER0); - if (block_end > to) - memset(kaddr+to, 0, - block_end-to); - if (block_start < from) - memset(kaddr+block_start, - 0, from-block_start); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); - } + if (block_end > to || block_start < from) + zero_user_segments(page, + to, block_end, + block_start, from) continue; } } @@ -1863,7 +1854,7 @@ static int __block_prepare_write(struct break; if (buffer_new(bh)) { clear_buffer_new(bh); - zero_user_page(page, block_start, bh->b_size, KM_USER0); + zero_user(page,
[06/36] Use page_cache_xxx in mm/rmap.c
Use page_cache_xxx in mm/rmap.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/rmap.c | 13 + 1 files changed, 9 insertions(+), 4 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 41ac397..d6a1771 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -188,9 +188,14 @@ static void page_unlock_anon_vma(struct anon_vma *anon_vma) static inline unsigned long vma_address(struct page *page, struct vm_area_struct *vma) { - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff; unsigned long address; + if (PageAnon(page)) + pgoff = page->index; + else + pgoff = page->index << mapping_order(page->mapping); + address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); if (unlikely(address < vma->vm_start || address >= vma->vm_end)) { /* page should be within any vma from prio_tree_next */ @@ -335,7 +340,7 @@ static int page_referenced_file(struct page *page) { unsigned int mapcount; struct address_space *mapping = page->mapping; - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page->index << (page_cache_shift(mapping) - PAGE_SHIFT); struct vm_area_struct *vma; struct prio_tree_iter iter; int referenced = 0; @@ -447,7 +452,7 @@ out: static int page_mkclean_file(struct address_space *mapping, struct page *page) { - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page->index << (page_cache_shift(mapping) - PAGE_SHIFT); struct vm_area_struct *vma; struct prio_tree_iter iter; int ret = 0; @@ -863,7 +868,7 @@ static int try_to_unmap_anon(struct page *page, int migration) static int try_to_unmap_file(struct page *page, int migration) { struct address_space *mapping = page->mapping; - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = page->index << (page_cache_shift(mapping) - PAGE_SHIFT); struct vm_area_struct *vma; struct prio_tree_iter iter; int ret = SWAP_AGAIN; -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[05/36] Use page_cache_xxx in mm/truncate.c
Use page_cache_xxx in mm/truncate.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/truncate.c | 35 ++- 1 files changed, 18 insertions(+), 17 deletions(-) diff --git a/mm/truncate.c b/mm/truncate.c index bf8068d..8c3d32e 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -45,9 +45,10 @@ void do_invalidatepage(struct page *page, unsigned long offset) (*invalidatepage)(page, offset); } -static inline void truncate_partial_page(struct page *page, unsigned partial) +static inline void truncate_partial_page(struct address_space *mapping, + struct page *page, unsigned partial) { - zero_user_segment(page, partial, PAGE_CACHE_SIZE); + zero_user_segment(page, partial, page_cache_size(mapping)); if (PagePrivate(page)) do_invalidatepage(page, partial); } @@ -95,7 +96,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page) if (page->mapping != mapping) return; - cancel_dirty_page(page, PAGE_CACHE_SIZE); + cancel_dirty_page(page, page_cache_size(mapping)); if (PagePrivate(page)) do_invalidatepage(page, 0); @@ -157,9 +158,9 @@ invalidate_complete_page(struct address_space *mapping, struct page *page) void truncate_inode_pages_range(struct address_space *mapping, loff_t lstart, loff_t lend) { - const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; + const pgoff_t start = page_cache_next(mapping, lstart); pgoff_t end; - const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1); + const unsigned partial = page_cache_offset(mapping, lstart); struct pagevec pvec; pgoff_t next; int i; @@ -167,8 +168,9 @@ void truncate_inode_pages_range(struct address_space *mapping, if (mapping->nrpages == 0) return; - BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1)); - end = (lend >> PAGE_CACHE_SHIFT); + BUG_ON(page_cache_offset(mapping, lend) != + page_cache_size(mapping) - 1); + end = page_cache_index(mapping, lend); pagevec_init(&pvec, 0); next = start; @@ -194,8 +196,8 @@ void truncate_inode_pages_range(struct address_space *mapping, } if (page_mapped(page)) { unmap_mapping_range(mapping, - (loff_t)page_indexindex > next) next = page->index; @@ -421,9 +423,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping, * Zap the rest of the file in one hit. */ unmap_mapping_range(mapping, - (loff_t)page_index
[10/36] Use page_cache_xxx in fs/sync
Use page_cache_xxx in fs/sync. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/sync.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/sync.c b/fs/sync.c index 7cd005e..f30d7eb 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -260,8 +260,8 @@ int do_sync_mapping_range(struct address_space *mapping, loff_t offset, ret = 0; if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) { ret = wait_on_page_writeback_range(mapping, - offset >> PAGE_CACHE_SHIFT, - endbyte >> PAGE_CACHE_SHIFT); + page_cache_index(mapping, offset), + page_cache_index(mapping, endbyte)); if (ret < 0) goto out; } @@ -275,8 +275,8 @@ int do_sync_mapping_range(struct address_space *mapping, loff_t offset, if (flags & SYNC_FILE_RANGE_WAIT_AFTER) { ret = wait_on_page_writeback_range(mapping, - offset >> PAGE_CACHE_SHIFT, - endbyte >> PAGE_CACHE_SHIFT); + page_cache_index(mapping, offset), + page_cache_index(mapping, endbyte)); } out: return ret; -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[12/36] Use page_cache_xxx in mm/mpage.c
Use page_cache_xxx in mm/mpage.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/mpage.c | 28 1 files changed, 16 insertions(+), 12 deletions(-) diff --git a/fs/mpage.c b/fs/mpage.c index a5e1385..2843ed7 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -133,7 +133,8 @@ mpage_alloc(struct block_device *bdev, static void map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block) { - struct inode *inode = page->mapping->host; + struct address_space *mapping = page->mapping; + struct inode *inode = mapping->host; struct buffer_head *page_bh, *head; int block = 0; @@ -142,9 +143,9 @@ map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block) * don't make any buffers if there is only one buffer on * the page and the page just needs to be set up to date */ - if (inode->i_blkbits == PAGE_CACHE_SHIFT && + if (inode->i_blkbits == page_cache_shift(mapping) && buffer_uptodate(bh)) { - SetPageUptodate(page); + SetPageUptodate(page); return; } create_empty_buffers(page, 1 << inode->i_blkbits, 0); @@ -177,9 +178,10 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, sector_t *last_block_in_bio, struct buffer_head *map_bh, unsigned long *first_logical_block, get_block_t get_block) { - struct inode *inode = page->mapping->host; + struct address_space *mapping = page->mapping; + struct inode *inode = mapping->host; const unsigned blkbits = inode->i_blkbits; - const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits; + const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits; const unsigned blocksize = 1 << blkbits; sector_t block_in_file; sector_t last_block; @@ -196,7 +198,7 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, if (page_has_buffers(page)) goto confused; - block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); + block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits); last_block = block_in_file + nr_pages * blocks_per_page; last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits; if (last_block > last_block_in_file) @@ -284,7 +286,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, } if (first_hole != blocks_per_page) { - zero_user_segment(page, first_hole << blkbits, PAGE_CACHE_SIZE); + zero_user_segment(page, first_hole << blkbits, + page_cache_size(mapping)); if (first_hole == 0) { SetPageUptodate(page); unlock_page(page); @@ -468,7 +471,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, struct inode *inode = page->mapping->host; const unsigned blkbits = inode->i_blkbits; unsigned long end_index; - const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits; + const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits; sector_t last_block; sector_t block_in_file; sector_t blocks[MAX_BUF_PER_PAGE]; @@ -537,7 +540,8 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, * The page has no buffers: map it to disk */ BUG_ON(!PageUptodate(page)); - block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); + block_in_file = (sector_t)page->index << + (page_cache_shift(mapping) - blkbits); last_block = (i_size - 1) >> blkbits; map_bh.b_page = page; for (page_block = 0; page_block < blocks_per_page; ) { @@ -569,7 +573,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, first_unmapped = page_block; page_is_mapped: - end_index = i_size >> PAGE_CACHE_SHIFT; + end_index = page_cache_index(mapping, i_size); if (page->index >= end_index) { /* * The page straddles i_size. It must be zeroed out on each @@ -579,11 +583,11 @@ page_is_mapped: * is zeroed when mapped, and writes to that region are not * written out to the file." */ - unsigned offset = i_size & (PAGE_CACHE_SIZE - 1); + unsigned offset = page_cache_offset(mapping, i_size); if (page->index > end_index || !offset) goto confused; - zero_user_segment(page, offset, PAGE_CACHE_SIZE); + zero_user_segment(page, offset, page_cache_size(mappin
[07/36] Use page_cache_xxx in mm/filemap_xip.c
Use page_cache_xxx in mm/filemap_xip.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/filemap_xip.c | 28 ++-- 1 files changed, 14 insertions(+), 14 deletions(-) diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index ba6892d..5237e53 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -61,24 +61,24 @@ do_xip_mapping_read(struct address_space *mapping, BUG_ON(!mapping->a_ops->get_xip_page); - index = *ppos >> PAGE_CACHE_SHIFT; - offset = *ppos & ~PAGE_CACHE_MASK; + index = page_cache_index(mapping, *ppos); + offset = page_cache_offset(mapping, *ppos); isize = i_size_read(inode); if (!isize) goto out; - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + end_index = page_cache_index(mapping, isize - 1); for (;;) { struct page *page; unsigned long nr, ret; /* nr is the maximum number of bytes to copy from this page */ - nr = PAGE_CACHE_SIZE; + nr = page_cache_size(mapping); if (index >= end_index) { if (index > end_index) goto out; - nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; + nr = page_cache_next(mapping, size - 1) + 1; if (nr <= offset) { goto out; } @@ -117,8 +117,8 @@ do_xip_mapping_read(struct address_space *mapping, */ ret = actor(desc, page, offset, nr); offset += ret; - index += offset >> PAGE_CACHE_SHIFT; - offset &= ~PAGE_CACHE_MASK; + index += page_cache_index(mapping, offset); + offset = page_cache_offset(mapping, offset); if (ret == nr && desc->count) continue; @@ -131,7 +131,7 @@ no_xip_page: } out: - *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; + *ppos = page_cache_pos(mapping, index, offset); if (filp) file_accessed(filp); } @@ -220,7 +220,7 @@ static int xip_file_fault(struct vm_area_struct *area, struct vm_fault *vmf) /* XXX: are VM_FAULT_ codes OK? */ - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + size = page_cache_next(mapping, i_size_read(inode)); if (vmf->pgoff >= size) return VM_FAULT_SIGBUS; @@ -289,9 +289,9 @@ __xip_file_write(struct file *filp, const char __user *buf, unsigned long offset; size_t copied; - offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ - index = pos >> PAGE_CACHE_SHIFT; - bytes = PAGE_CACHE_SIZE - offset; + offset = page_cache_offset(mapping, pos); /* Within page */ + index = page_cache_index(mapping, pos); + bytes = page_cache_size(mapping) - offset; if (bytes > count) bytes = count; @@ -405,8 +405,8 @@ EXPORT_SYMBOL_GPL(xip_file_write); int xip_truncate_page(struct address_space *mapping, loff_t from) { - pgoff_t index = from >> PAGE_CACHE_SHIFT; - unsigned offset = from & (PAGE_CACHE_SIZE-1); + pgoff_t index = page_cache_index(mapping, from); + unsigned offset = page_cache_offset(mapping, from); unsigned blocksize; unsigned length; struct page *page; -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[09/36] Use page_cache_xxx in fs/libfs.c
Use page_cache_xxx in fs/libfs.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/libfs.c | 12 +++- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/libfs.c b/fs/libfs.c index 53b3dc5..e90f894 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -16,7 +16,8 @@ int simple_getattr(struct vfsmount *mnt, struct dentry *dentry, { struct inode *inode = dentry->d_inode; generic_fillattr(inode, stat); - stat->blocks = inode->i_mapping->nrpages << (PAGE_CACHE_SHIFT - 9); + stat->blocks = inode->i_mapping->nrpages << + (page_cache_shift(inode->i_mapping) - 9); return 0; } @@ -340,10 +341,10 @@ int simple_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { if (!PageUptodate(page)) { - if (to - from != PAGE_CACHE_SIZE) + if (to - from != page_cache_size(file->f_mapping)) zero_user_segments(page, 0, from, - to, PAGE_CACHE_SIZE); + to, page_cache_size(file->f_mapping)); } return 0; } @@ -351,8 +352,9 @@ int simple_prepare_write(struct file *file, struct page *page, int simple_commit_write(struct file *file, struct page *page, unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + struct address_space *mapping = page->mapping; + struct inode *inode = mapping->host; + loff_t pos = page_cache_pos(mapping, page->index, to); if (!PageUptodate(page)) SetPageUptodate(page); -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[20/36] Use page_cache_xxx in drivers/block/rd.c
Use page_cache_xxx in drivers/block/rd.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- drivers/block/rd.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/block/rd.c b/drivers/block/rd.c index 65150b5..e148b3b 100644 --- a/drivers/block/rd.c +++ b/drivers/block/rd.c @@ -121,7 +121,7 @@ static void make_page_uptodate(struct page *page) } } while ((bh = bh->b_this_page) != head); } else { - memset(page_address(page), 0, PAGE_CACHE_SIZE); + memset(page_address(page), 0, page_cache_size(page_mapping(page))); } flush_dcache_page(page); SetPageUptodate(page); @@ -201,9 +201,9 @@ static const struct address_space_operations ramdisk_aops = { static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector, struct address_space *mapping) { - pgoff_t index = sector >> (PAGE_CACHE_SHIFT - 9); + pgoff_t index = sector >> (page_cache_size(mapping) - 9); unsigned int vec_offset = vec->bv_offset; - int offset = (sector << 9) & ~PAGE_CACHE_MASK; + int offset = page_cache_offset(mapping, (sector << 9)); int size = vec->bv_len; int err = 0; @@ -213,7 +213,7 @@ static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector, char *src; char *dst; - count = PAGE_CACHE_SIZE - offset; + count = page_cache_size(mapping) - offset; if (count > size) count = size; size -= count; -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[15/36] Use page_cache_xxx functions in fs/ext2
Use page_cache_xxx functions in fs/ext2 Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/ext2/dir.c | 40 +++- 1 files changed, 23 insertions(+), 17 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 2bf49d7..d72926f 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -43,7 +43,8 @@ static inline void ext2_put_page(struct page *page) static inline unsigned long dir_pages(struct inode *inode) { - return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT; + return (inode->i_size+page_cache_size(inode->i_mapping)-1)>> + page_cache_shift(inode->i_mapping); } /* @@ -54,10 +55,11 @@ static unsigned ext2_last_byte(struct inode *inode, unsigned long page_nr) { unsigned last_byte = inode->i_size; + struct address_space *mapping = inode->i_mapping; - last_byte -= page_nr << PAGE_CACHE_SHIFT; - if (last_byte > PAGE_CACHE_SIZE) - last_byte = PAGE_CACHE_SIZE; + last_byte -= page_nr << page_cache_shift(mapping); + if (last_byte > page_cache_size(mapping)) + last_byte = page_cache_size(mapping); return last_byte; } @@ -76,18 +78,19 @@ static int ext2_commit_chunk(struct page *page, unsigned from, unsigned to) static void ext2_check_page(struct page *page) { - struct inode *dir = page->mapping->host; + struct address_space *mapping = page->mapping; + struct inode *dir = mapping->host; struct super_block *sb = dir->i_sb; unsigned chunk_size = ext2_chunk_size(dir); char *kaddr = page_address(page); u32 max_inumber = le32_to_cpu(EXT2_SB(sb)->s_es->s_inodes_count); unsigned offs, rec_len; - unsigned limit = PAGE_CACHE_SIZE; + unsigned limit = page_cache_size(mapping); ext2_dirent *p; char *error; - if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) { - limit = dir->i_size & ~PAGE_CACHE_MASK; + if (page_cache_index(mapping, dir->i_size) == page->index) { + limit = page_cache_offset(mapping, dir->i_size); if (limit & (chunk_size - 1)) goto Ebadsize; if (!limit) @@ -139,7 +142,7 @@ Einumber: bad_entry: ext2_error (sb, "ext2_check_page", "bad entry in directory #%lu: %s - " "offset=%lu, inode=%lu, rec_len=%d, name_len=%d", - dir->i_ino, error, (page->indexindex, offs), (unsigned long) le32_to_cpu(p->inode), rec_len, p->name_len); goto fail; @@ -148,7 +151,7 @@ Eend: ext2_error (sb, "ext2_check_page", "entry in directory #%lu spans the page boundary" "offset=%lu, inode=%lu", - dir->i_ino, (page->index index, offs), (unsigned long) le32_to_cpu(p->inode)); fail: SetPageChecked(page); @@ -246,8 +249,9 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir) loff_t pos = filp->f_pos; struct inode *inode = filp->f_path.dentry->d_inode; struct super_block *sb = inode->i_sb; - unsigned int offset = pos & ~PAGE_CACHE_MASK; - unsigned long n = pos >> PAGE_CACHE_SHIFT; + struct address_space *mapping = inode->i_mapping; + unsigned int offset = page_cache_offset(mapping, pos); + unsigned long n = page_cache_index(mapping, pos); unsigned long npages = dir_pages(inode); unsigned chunk_mask = ~(ext2_chunk_size(inode)-1); unsigned char *types = NULL; @@ -268,14 +272,14 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir) ext2_error(sb, __FUNCTION__, "bad page in #%lu", inode->i_ino); - filp->f_pos += PAGE_CACHE_SIZE - offset; + filp->f_pos += page_cache_size(mapping) - offset; return -EIO; } kaddr = page_address(page); if (unlikely(need_revalidate)) { if (offset) { offset = ext2_validate_entry(kaddr, offset, chunk_mask); - filp->f_pos = (n f_version = inode->i_version; need_revalidate = 0; @@ -298,7 +302,7 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir) offset = (char *)de - kaddr; over = filldir(dirent, de->name, de->name_len, - (n
[22/36] compound pages: Add new support functions
compound_pages(page)-> Determines base pages of a compound page compound_shift(page)-> Determine the page shift of a compound page compound_size(page) -> Determine the size of a compound page Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/mm.h | 15 +++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3e9e8fe..fa4cbab 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -362,6 +362,21 @@ static inline void set_compound_order(struct page *page, unsigned long order) page[1].lru.prev = (void *)order; } +static inline int compound_pages(struct page *page) +{ + return 1 << compound_order(page); +} + +static inline int compound_shift(struct page *page) +{ + return PAGE_SHIFT + compound_order(page); +} + +static inline int compound_size(struct page *page) +{ + return PAGE_SIZE << compound_order(page); +} + /* * Multiple processes may "see" the same page. E.g. for untouched * mappings of /dev/null, all processes see the same page full of -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[19/36] Use page_cache_xxx for fs/xfs
Use page_cache_xxx for fs/xfs Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_aops.c | 55 ++ fs/xfs/linux-2.6/xfs_lrw.c |6 ++-- 2 files changed, 32 insertions(+), 29 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index fd4105d..e48817a 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -74,7 +74,7 @@ xfs_page_trace( xfs_inode_t *ip; bhv_vnode_t *vp = vn_from_inode(inode); loff_t isize = i_size_read(inode); - loff_t offset = page_offset(page); + loff_t offset = page_cache_offset(page->mapping); int delalloc = -1, unmapped = -1, unwritten = -1; if (page_has_buffers(page)) @@ -610,7 +610,7 @@ xfs_probe_page( break; } while ((bh = bh->b_this_page) != head); } else - ret = mapped ? 0 : PAGE_CACHE_SIZE; + ret = mapped ? 0 : page_cache_size(page->mapping); } return ret; @@ -637,7 +637,7 @@ xfs_probe_cluster( } while ((bh = bh->b_this_page) != head); /* if we reached the end of the page, sum forwards in following pages */ - tlast = i_size_read(inode) >> PAGE_CACHE_SHIFT; + tlast = page_cache_index(inode->i_mapping, i_size_read(inode)); tindex = startpage->index + 1; /* Prune this back to avoid pathological behavior */ @@ -655,14 +655,14 @@ xfs_probe_cluster( size_t pg_offset, len = 0; if (tindex == tlast) { - pg_offset = - i_size_read(inode) & (PAGE_CACHE_SIZE - 1); + pg_offset = page_cache_offset(inode->i_mapping, + i_size_read(inode)); if (!pg_offset) { done = 1; break; } } else - pg_offset = PAGE_CACHE_SIZE; + pg_offset = page_cache_size(inode->i_mapping); if (page->index == tindex && !TestSetPageLocked(page)) { len = xfs_probe_page(page, pg_offset, mapped); @@ -744,7 +744,8 @@ xfs_convert_page( int bbits = inode->i_blkbits; int len, page_dirty; int count = 0, done = 0, uptodate = 1; - xfs_off_t offset = page_offset(page); + struct address_space*map = inode->i_mapping; + xfs_off_t offset = page_cache_pos(map, page->index, 0); if (page->index != tindex) goto fail; @@ -752,7 +753,7 @@ xfs_convert_page( goto fail; if (PageWriteback(page)) goto fail_unlock_page; - if (page->mapping != inode->i_mapping) + if (page->mapping != map) goto fail_unlock_page; if (!xfs_is_delayed_page(page, (*ioendp)->io_type)) goto fail_unlock_page; @@ -764,20 +765,20 @@ xfs_convert_page( * Derivation: * * End offset is the highest offset that this page should represent. -* If we are on the last page, (end_offset & (PAGE_CACHE_SIZE - 1)) -* will evaluate non-zero and be less than PAGE_CACHE_SIZE and +* If we are on the last page, (end_offset & page_cache_mask()) +* will evaluate non-zero and be less than page_cache_size() and * hence give us the correct page_dirty count. On any other page, * it will be zero and in that case we need page_dirty to be the * count of buffers on the page. */ end_offset = min_t(unsigned long long, - (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT, + (xfs_off_t)(page->index + 1) << page_cache_shift(map), i_size_read(inode)); len = 1 << inode->i_blkbits; - p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1), - PAGE_CACHE_SIZE); - p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE; + p_offset = min_t(unsigned long, page_cache_offset(map, end_offset), + page_cache_size(map)); + p_offset = p_offset ? roundup(p_offset, len) : page_cache_size(map); page_dirty = p_offset / len; bh = head = page_buffers(page); @@ -933,6 +934,8 @@ xfs_page_state_convert( int page_dirty, count = 0; int trylock = 0; int all_bh = unmapped; + struct address_space*map = inode->i_
[26/36] compound pages: Allow freeing of compound pages via pagevec
Allow the freeing of compound pages via pagevec. In release_pages() we currently special case for compound pages in order to be sure to always decrement the page count of the head page and not the tail page. However that redirection to the head page is only necessary for tail pages. So we can actually use PageTail instead of PageCompound there by avoiding the redirection to the first page. Tail page handling is not changed. The head page of a compound pages now represents single page large page. We do the usual processing including checking if its on the LRU and removing it (not useful right now but later when compound pages are on the LRU this will work). Then we add the compound page to the pagevec. Only head pages will end up on the pagevec not tail pages. In __pagevec_free() we then check if we are freeing a head page and if so call the destructor for the compound page. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/page_alloc.c | 13 +++-- mm/swap.c |8 +++- 2 files changed, 18 insertions(+), 3 deletions(-) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c 2007-08-27 20:59:38.0 -0700 +++ linux-2.6/mm/page_alloc.c 2007-08-27 21:05:34.0 -0700 @@ -1441,8 +1441,17 @@ void __pagevec_free(struct pagevec *pvec { int i = pagevec_count(pvec); - while (--i >= 0) - free_hot_cold_page(pvec->pages[i], pvec->cold); + while (--i >= 0) { + struct page *page = pvec->pages[i]; + + if (PageHead(page)) { + compound_page_dtor *dtor; + + dtor = get_compound_page_dtor(page); + (*dtor)(page); + } else + free_hot_cold_page(page, pvec->cold); + } } fastcall void __free_pages(struct page *page, unsigned int order) Index: linux-2.6/mm/swap.c === --- linux-2.6.orig/mm/swap.c2007-08-27 19:22:13.0 -0700 +++ linux-2.6/mm/swap.c 2007-08-27 21:05:34.0 -0700 @@ -263,7 +263,13 @@ void release_pages(struct page **pages, for (i = 0; i < nr; i++) { struct page *page = pages[i]; - if (unlikely(PageCompound(page))) { + /* +* If we have a tail page on the LRU then we need to +* decrement the page count of the head page. There +* is no further need to do anything since tail pages +* cannot be on the LRU. +*/ + if (unlikely(PageTail(page))) { if (zone) { spin_unlock_irq(&zone->lru_lock); zone = NULL; -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[31/36] Large Blocksize: Core piece
Provide an alternate definition for the page_cache_xxx(mapping, ...) functions that can determine the current page size from the mapping and generate the appropriate shifts, sizes and mask for the page cache operations. Change the basic functions that allocate pages for the page cache to be able to handle higher order allocations. Provide a new function mapping_setup(stdruct address_space *, gfp_t mask, int order) that allows the setup of a mapping of any compound page order. mapping_set_gfp_mask() is still provided but it sets mappings to order 0. Calls to mapping_set_gfp_mask() must be converted to mapping_setup() in order for the filesystem to be able to use larger pages. For some key block devices and filesystems the conversion is done here. mapping_setup() for higher order is only allowed if the mapping does not use DMA mappings or HIGHMEM since we do not support bouncing at the moment. Thus BUG() on DMA mappings and clear the highmem bit of higher order mappings. Modify the set_blocksize() function so that an arbitrary blocksize can be set. Blocksizes up to MAX_ORDER - 1 can be set. This is typically 8MB on many platforms (order 11). Typically file systems are not only limited by the core VM but also by the structure of their internal data structures. The core VM limitations fall away with this patch. The functionality provided here can do nothing about the internal limitations of filesystems. Known internal limitations: Ext264k XFS 64k Reiserfs8k Ext34k (rumor has it that changing a constant can remove the limit) Ext44k Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- block/Kconfig | 17 ++ drivers/block/rd.c |6 ++- fs/block_dev.c | 29 +++--- fs/buffer.c |4 +- fs/inode.c |7 ++- fs/xfs/linux-2.6/xfs_buf.c |3 +- include/linux/buffer_head.h | 12 - include/linux/fs.h |5 ++ include/linux/pagemap.h | 121 -- mm/filemap.c| 17 -- 10 files changed, 192 insertions(+), 29 deletions(-) Index: linux-2.6/block/Kconfig === --- linux-2.6.orig/block/Kconfig2007-08-27 19:22:13.0 -0700 +++ linux-2.6/block/Kconfig 2007-08-27 21:16:38.0 -0700 @@ -62,6 +62,20 @@ config BLK_DEV_BSG protocols (e.g. Task Management Functions and SMP in Serial Attached SCSI). +# +# The functions to switch on larger pages in a filesystem will return an error +# if the gfp flags for a mapping require only DMA pages. Highmem will always +# be switched off for higher order mappings. +# +config LARGE_BLOCKSIZE + bool "Support blocksizes larger than page size" + default n + depends on EXPERIMENTAL + help + Allows the page cache to support higher orders of pages. Higher + order page cache pages may be useful to increase I/O performance + anbd support special devices like CD or DVDs and Flash. + endif # BLOCK source block/Kconfig.iosched Index: linux-2.6/drivers/block/rd.c === --- linux-2.6.orig/drivers/block/rd.c 2007-08-27 20:59:27.0 -0700 +++ linux-2.6/drivers/block/rd.c2007-08-27 21:10:38.0 -0700 @@ -121,7 +121,8 @@ static void make_page_uptodate(struct pa } } while ((bh = bh->b_this_page) != head); } else { - memset(page_address(page), 0, page_cache_size(page_mapping(page))); + memset(page_address(page), 0, + page_cache_size(page_mapping(page))); } flush_dcache_page(page); SetPageUptodate(page); @@ -380,7 +381,8 @@ static int rd_open(struct inode *inode, gfp_mask = mapping_gfp_mask(mapping); gfp_mask &= ~(__GFP_FS|__GFP_IO); gfp_mask |= __GFP_HIGH; - mapping_set_gfp_mask(mapping, gfp_mask); + mapping_setup(mapping, gfp_mask, + page_cache_blkbits_to_order(inode->i_blkbits)); } return 0; Index: linux-2.6/fs/block_dev.c === --- linux-2.6.orig/fs/block_dev.c 2007-08-27 19:22:13.0 -0700 +++ linux-2.6/fs/block_dev.c2007-08-27 21:10:38.0 -0700 @@ -63,36 +63,46 @@ static void kill_bdev(struct block_devic return; invalidate_bh_lrus(); truncate_inode_pages(bdev->bd_inode->i_mapping, 0); -} +} int set_blocksize(struct block_device *bdev, int size) { - /* Size must be a power of two, and between 512 and PAGE_SIZE */ - if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size)) + int order; + + if (size > (PAGE_SIZE << (MAX_ORDER - 1)) || +
[34/36] Large blocksize support in XFS
The only change needed to enable Large Block I/O in XFS is to remove the check for a too large blocksize ;-) Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/xfs/xfs_mount.c | 13 - 1 files changed, 0 insertions(+), 13 deletions(-) diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index a66b398..47ddc89 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -326,19 +326,6 @@ xfs_mount_validate_sb( return XFS_ERROR(ENOSYS); } - /* -* Until this is fixed only page-sized or smaller data blocks work. -*/ - if (unlikely(sbp->sb_blocksize > PAGE_SIZE)) { - xfs_fs_mount_cmn_err(flags, - "file system with blocksize %d bytes", - sbp->sb_blocksize); - xfs_fs_mount_cmn_err(flags, - "only pagesize (%ld) or less will currently work.", - PAGE_SIZE); - return XFS_ERROR(ENOSYS); - } - return 0; } -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[17/36] Use page_cache_xxx in fs/ext4
Use page_cache_xxx in fs/ext4 Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/ext4/dir.c |3 ++- fs/ext4/inode.c | 31 --- 2 files changed, 18 insertions(+), 16 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index 3ab01c0..9d6cd51 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -136,7 +136,8 @@ static int ext4_readdir(struct file * filp, err = ext4_get_blocks_wrap(NULL, inode, blk, 1, &map_bh, 0, 0); if (err > 0) { pgoff_t index = map_bh.b_blocknr >> - (PAGE_CACHE_SHIFT - inode->i_blkbits); + (page_cache_size(node->i_mapping) + - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) page_cache_sync_readahead( sb->s_bdev->bd_inode->i_mapping, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 3fe1e40..0be5bf8 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1223,7 +1223,7 @@ static int ext4_ordered_commit_write(struct file *file, struct page *page, */ loff_t new_i_size; - new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + new_i_size = page_cache_pos(page->mapping, page->index, to); if (new_i_size > EXT4_I(inode)->i_disksize) EXT4_I(inode)->i_disksize = new_i_size; ret = generic_commit_write(file, page, from, to); @@ -1242,7 +1242,7 @@ static int ext4_writeback_commit_write(struct file *file, struct page *page, int ret = 0, ret2; loff_t new_i_size; - new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + new_i_size = page_cache_pos(page->mapping, page->index, to); if (new_i_size > EXT4_I(inode)->i_disksize) EXT4_I(inode)->i_disksize = new_i_size; @@ -1269,7 +1269,7 @@ static int ext4_journalled_commit_write(struct file *file, /* * Here we duplicate the generic_commit_write() functionality */ - pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + pos = page_cache_pos(page->mapping, page->index, to); ret = walk_page_buffers(handle, page_buffers(page), from, to, &partial, commit_write_fn); @@ -1421,6 +1421,7 @@ static int ext4_ordered_writepage(struct page *page, handle_t *handle = NULL; int ret = 0; int err; + int pagesize = page_cache_size(inode->i_mapping); J_ASSERT(PageLocked(page)); @@ -1443,8 +1444,7 @@ static int ext4_ordered_writepage(struct page *page, (1 << BH_Dirty)|(1 << BH_Uptodate)); } page_bufs = page_buffers(page); - walk_page_buffers(handle, page_bufs, 0, - PAGE_CACHE_SIZE, NULL, bget_one); + walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bget_one); ret = block_write_full_page(page, ext4_get_block, wbc); @@ -1461,13 +1461,12 @@ static int ext4_ordered_writepage(struct page *page, * and generally junk. */ if (ret == 0) { - err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, + err = walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, jbd2_journal_dirty_data_fn); if (!ret) ret = err; } - walk_page_buffers(handle, page_bufs, 0, - PAGE_CACHE_SIZE, NULL, bput_one); + walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bput_one); err = ext4_journal_stop(handle); if (!ret) ret = err; @@ -1519,6 +1518,7 @@ static int ext4_journalled_writepage(struct page *page, handle_t *handle = NULL; int ret = 0; int err; + int pagesize = page_cache_size(inode->i_mapping); if (ext4_journal_current_handle()) goto no_write; @@ -1535,17 +1535,17 @@ static int ext4_journalled_writepage(struct page *page, * doesn't seem much point in redirtying the page here. */ ClearPageChecked(page); - ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE, + ret = block_prepare_write(page, 0, pagesize, ext4_get_block); if (ret != 0) { ext4_journal_stop(handle); goto out_unlock; } ret = walk_page_buffers(handle, page_buffers(page), 0, - PAGE_CACHE_SIZE, NULL, do_journal_get_write_access); + pagesize, NULL, do_journal_get_write_access); err = walk_page_buffers(handle, page_buffers(page), 0, - PAGE_CACHE_SIZE, NULL, commit_
[36/36] Reiserfs: Fix up for mapping_set_gfp_mask
mapping_set_gfp_mask only works on order 0 page cache operations. Reiserfs can use 8k pages (order 1). Replace the mapping_set_gfp_mask with mapping_setup to make this work properly. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/reiserfs/xattr.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c index c86f570..5ca01f3 100644 --- a/fs/reiserfs/xattr.c +++ b/fs/reiserfs/xattr.c @@ -405,9 +405,10 @@ static struct page *reiserfs_get_page(struct inode *dir, unsigned long n) { struct address_space *mapping = dir->i_mapping; struct page *page; + /* We can deadlock if we try to free dentries, and an unlink/rmdir has just occured - GFP_NOFS avoids this */ - mapping_set_gfp_mask(mapping, GFP_NOFS); + mapping_setup(mapping, GFP_NOFS, page_cache_shift(mapping)); page = read_mapping_page(mapping, n, NULL); if (!IS_ERR(page)) { kmap(page); -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[27/36] Compound page zeroing and flushing
We may now have to zero and flush higher order pages. Implement clear_mapping_page and flush_mapping_page to do that job. Replace the flushing and clearing at some key locations for the pagecache. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/libfs.c |4 ++-- include/linux/highmem.h | 31 +-- mm/filemap.c|4 ++-- mm/filemap_xip.c|4 ++-- 4 files changed, 35 insertions(+), 8 deletions(-) Index: linux-2.6/fs/libfs.c === --- linux-2.6.orig/fs/libfs.c 2007-08-27 20:51:55.0 -0700 +++ linux-2.6/fs/libfs.c2007-08-27 21:08:04.0 -0700 @@ -330,8 +330,8 @@ int simple_rename(struct inode *old_dir, int simple_readpage(struct file *file, struct page *page) { - clear_highpage(page); - flush_dcache_page(page); + clear_mapping_page(page); + flush_mapping_page(page); SetPageUptodate(page); unlock_page(page); return 0; Index: linux-2.6/include/linux/highmem.h === --- linux-2.6.orig/include/linux/highmem.h 2007-08-27 19:22:17.0 -0700 +++ linux-2.6/include/linux/highmem.h 2007-08-27 21:08:04.0 -0700 @@ -124,14 +124,41 @@ static inline void clear_highpage(struct kunmap_atomic(kaddr, KM_USER0); } +/* + * Clear a higher order page + */ +static inline void clear_mapping_page(struct page *page) +{ + int nr_pages = compound_pages(page); + int i; + + for (i = 0; i < nr_pages; i++) + clear_highpage(page + i); +} + +/* + * Primitive support for flushing higher order pages. + * + * A bit stupid: On many platforms flushing the first page + * will flush any TLB starting there + */ +static inline void flush_mapping_page(struct page *page) +{ + int nr_pages = compound_pages(page); + int i; + + for (i = 0; i < nr_pages; i++) + flush_dcache_page(page + i); +} + static inline void zero_user_segments(struct page *page, unsigned start1, unsigned end1, unsigned start2, unsigned end2) { void *kaddr = kmap_atomic(page, KM_USER0); - BUG_ON(end1 > PAGE_SIZE || - end2 > PAGE_SIZE); + BUG_ON(end1 > compound_size(page) || + end2 > compound_size(page)); if (end1 > start1) memset(kaddr + start1, 0, end1 - start1); Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-08-27 19:31:13.0 -0700 +++ linux-2.6/mm/filemap.c 2007-08-27 21:08:04.0 -0700 @@ -941,7 +941,7 @@ page_ok: * before reading the page on the kernel side. */ if (mapping_writably_mapped(mapping)) - flush_dcache_page(page); + flush_mapping_page(page); /* * When a sequential read accesses a page several times, @@ -1932,7 +1932,7 @@ generic_file_buffered_write(struct kiocb else copied = filemap_copy_from_user_iovec(page, offset, cur_iov, iov_base, bytes); - flush_dcache_page(page); + flush_mapping_page(page); status = a_ops->commit_write(file, page, offset, offset+bytes); if (status == AOP_TRUNCATED_PAGE) { page_cache_release(page); Index: linux-2.6/mm/filemap_xip.c === --- linux-2.6.orig/mm/filemap_xip.c 2007-08-27 20:51:40.0 -0700 +++ linux-2.6/mm/filemap_xip.c 2007-08-27 21:08:04.0 -0700 @@ -104,7 +104,7 @@ do_xip_mapping_read(struct address_space * before reading the page on the kernel side. */ if (mapping_writably_mapped(mapping)) - flush_dcache_page(page); + flush_mapping_page(page); /* * Ok, we have the page, so now we can copy it to user space... @@ -320,7 +320,7 @@ __xip_file_write(struct file *filp, cons } copied = filemap_copy_from_user(page, offset, buf, bytes); - flush_dcache_page(page); + flush_mapping_page(page); if (likely(copied > 0)) { status = copied; -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[29/36] Fix up reclaim counters
Compound pages of an arbitrary order may now be on the LRU and may be reclaimed. Adjust the counting in vmscan.c to could the number of base pages. Also change the active and inactive accounting to do the same. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/mm_inline.h | 36 +++- mm/vmscan.c | 22 -- 2 files changed, 39 insertions(+), 19 deletions(-) Index: linux-2.6/include/linux/mm_inline.h === --- linux-2.6.orig/include/linux/mm_inline.h2007-08-27 19:22:13.0 -0700 +++ linux-2.6/include/linux/mm_inline.h 2007-08-27 21:08:27.0 -0700 @@ -2,39 +2,57 @@ static inline void add_page_to_active_list(struct zone *zone, struct page *page) { list_add(&page->lru, &zone->active_list); - __inc_zone_state(zone, NR_ACTIVE); + if (!PageHead(page)) + __inc_zone_state(zone, NR_ACTIVE); + else + __inc_zone_page_state(page, NR_ACTIVE); } static inline void add_page_to_inactive_list(struct zone *zone, struct page *page) { list_add(&page->lru, &zone->inactive_list); - __inc_zone_state(zone, NR_INACTIVE); + if (!PageHead(page)) + __inc_zone_state(zone, NR_INACTIVE); + else + __inc_zone_page_state(page, NR_INACTIVE); } static inline void del_page_from_active_list(struct zone *zone, struct page *page) { list_del(&page->lru); - __dec_zone_state(zone, NR_ACTIVE); + if (!PageHead(page)) + __dec_zone_state(zone, NR_ACTIVE); + else + __dec_zone_page_state(page, NR_ACTIVE); } static inline void del_page_from_inactive_list(struct zone *zone, struct page *page) { list_del(&page->lru); - __dec_zone_state(zone, NR_INACTIVE); + if (!PageHead(page)) + __dec_zone_state(zone, NR_INACTIVE); + else + __dec_zone_page_state(page, NR_INACTIVE); } static inline void del_page_from_lru(struct zone *zone, struct page *page) { + enum zone_stat_item counter = NR_ACTIVE; + list_del(&page->lru); - if (PageActive(page)) { + if (PageActive(page)) __ClearPageActive(page); - __dec_zone_state(zone, NR_ACTIVE); - } else { - __dec_zone_state(zone, NR_INACTIVE); - } + else + counter = NR_INACTIVE; + + if (!PageHead(page)) + __dec_zone_state(zone, counter); + else + __dec_zone_page_state(page, counter); } + Index: linux-2.6/mm/vmscan.c === --- linux-2.6.orig/mm/vmscan.c 2007-08-27 19:22:13.0 -0700 +++ linux-2.6/mm/vmscan.c 2007-08-27 21:08:27.0 -0700 @@ -466,14 +466,14 @@ static unsigned long shrink_page_list(st VM_BUG_ON(PageActive(page)); - sc->nr_scanned++; + sc->nr_scanned += compound_pages(page); if (!sc->may_swap && page_mapped(page)) goto keep_locked; /* Double the slab pressure for mapped and swapcache pages */ if (page_mapped(page) || PageSwapCache(page)) - sc->nr_scanned++; + sc->nr_scanned += compound_pages(page); may_enter_fs = (sc->gfp_mask & __GFP_FS) || (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); @@ -590,7 +590,7 @@ static unsigned long shrink_page_list(st free_it: unlock_page(page); - nr_reclaimed++; + nr_reclaimed += compound_pages(page); if (!pagevec_add(&freed_pvec, page)) __pagevec_release_nonlru(&freed_pvec); continue; @@ -682,22 +682,23 @@ static unsigned long isolate_lru_pages(u unsigned long nr_taken = 0; unsigned long scan; - for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { + for (scan = 0; scan < nr_to_scan && !list_empty(src); ) { struct page *page; unsigned long pfn; unsigned long end_pfn; unsigned long page_pfn; + int pages; int zone_id; page = lru_to_page(src); prefetchw_prev_lru_page(page, src, flags); - + pages = compound_pages(page); VM_BUG_ON(!PageLRU(page)); switch (__isolate_lru_page(page, mode)) { case 0: list_move(&page->lru, dst); - nr_taken++; + nr_taken += pages; break; case -EBUSY: @@ -743,8 +744,8 @@ static unsigned long isolate_lru_pages(u switch (__isolate_lru_page(cursor_page, mode)) {
[02/36] Define functions for page cache handling
We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK and PAGE_CACHE_ALIGN in various places in the kernel. Many times common operations like calculating the offset or the index are coded using shifts and adds. This patch provides inline functions to get the calculations accomplished without having to explicitly shift and add constants. All functions take an address_space pointer. The address space pointer will be used in the future to eventually support a variable size page cache. Information reachable via the mapping may then determine page size. New functionRelated base page constant page_cache_shift(a) PAGE_CACHE_SHIFT page_cache_size(a) PAGE_CACHE_SIZE page_cache_mask(a) PAGE_CACHE_MASK page_cache_index(a, pos)Calculate page number from position page_cache_next(addr, pos) Page number of next page page_cache_offset(a, pos) Calculate offset into a page page_cache_pos(a, index, offset) Form position based on page number and an offset. This provides a basis that would allow the conversion of all page cache handling in the kernel and ultimately allow the removal of the PAGE_CACHE_* constants. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/pagemap.h | 54 +++ 1 files changed, 54 insertions(+), 0 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 8a83537..836e9dd 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -52,12 +52,66 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) * space in smaller chunks for same flexibility). * * Or rather, it _will_ be done in larger chunks. + * + * The following constants can be used if a filesystem only supports a single + * page size. */ #define PAGE_CACHE_SHIFT PAGE_SHIFT #define PAGE_CACHE_SIZEPAGE_SIZE #define PAGE_CACHE_MASKPAGE_MASK #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) +/* + * Functions that are currently setup for a fixed PAGE_SIZEd. The use of + * these will allow a variable page size pagecache in the future. + */ +static inline int mapping_order(struct address_space *a) +{ + return 0; +} + +static inline int page_cache_shift(struct address_space *a) +{ + return PAGE_SHIFT; +} + +static inline unsigned int page_cache_size(struct address_space *a) +{ + return PAGE_SIZE; +} + +static inline loff_t page_cache_mask(struct address_space *a) +{ + return (loff_t)PAGE_MASK; +} + +static inline unsigned int page_cache_offset(struct address_space *a, + loff_t pos) +{ + return pos & ~PAGE_MASK; +} + +static inline pgoff_t page_cache_index(struct address_space *a, + loff_t pos) +{ + return pos >> page_cache_shift(a); +} + +/* + * Index of the page starting on or after the given position. + */ +static inline pgoff_t page_cache_next(struct address_space *a, + loff_t pos) +{ + return page_cache_index(a, pos + page_cache_size(a) - 1); +} + +static inline loff_t page_cache_pos(struct address_space *a, + pgoff_t index, unsigned long offset) +{ + return ((loff_t)index << page_cache_shift(a)) + offset; +} + #define page_cache_get(page) get_page(page) #define page_cache_release(page) put_page(page) void release_pages(struct page **pages, int nr, int cold); -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[30/36] Add VM_BUG_ONs to check for correct page order
Before allowing different page orders it may be wise to get some checkpoints in at various places. Checkpoints will help debugging whenever a wrong order page shows up in a mapping. Helps when converting new filesystems to utilize larger pages. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/buffer.c |1 + mm/filemap.c | 18 +++--- 2 files changed, 16 insertions(+), 3 deletions(-) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-08-27 20:52:34.0 -0700 +++ linux-2.6/fs/buffer.c 2007-08-27 21:09:19.0 -0700 @@ -893,6 +893,7 @@ struct buffer_head *alloc_page_buffers(s long offset; unsigned int page_size = page_cache_size(page->mapping); + BUG_ON(size > page_size); try_again: head = NULL; offset = page_size; Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-08-27 21:08:04.0 -0700 +++ linux-2.6/mm/filemap.c 2007-08-27 21:09:19.0 -0700 @@ -128,6 +128,7 @@ void remove_from_page_cache(struct page struct address_space *mapping = page->mapping; BUG_ON(!PageLocked(page)); + VM_BUG_ON(mapping_order(mapping) != compound_order(page)); write_lock_irq(&mapping->tree_lock); __remove_from_page_cache(page); @@ -269,6 +270,7 @@ int wait_on_page_writeback_range(struct if (page->index > end) continue; + VM_BUG_ON(mapping_order(mapping) != compound_order(page)); wait_on_page_writeback(page); if (PageError(page)) ret = -EIO; @@ -440,6 +442,7 @@ int add_to_page_cache(struct page *page, { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + VM_BUG_ON(mapping_order(mapping) != compound_order(page)); if (error == 0) { write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); @@ -599,8 +602,10 @@ struct page * find_get_page(struct addre read_lock_irq(&mapping->tree_lock); page = radix_tree_lookup(&mapping->page_tree, offset); - if (page) + if (page) { + VM_BUG_ON(mapping_order(mapping) != compound_order(page)); page_cache_get(page); + } read_unlock_irq(&mapping->tree_lock); return page; } @@ -625,6 +630,7 @@ struct page *find_lock_page(struct addre repeat: page = radix_tree_lookup(&mapping->page_tree, offset); if (page) { + VM_BUG_ON(mapping_order(mapping) != compound_order(page)); page_cache_get(page); if (TestSetPageLocked(page)) { read_unlock_irq(&mapping->tree_lock); @@ -715,8 +721,10 @@ unsigned find_get_pages(struct address_s read_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup(&mapping->page_tree, (void **)pages, start, nr_pages); - for (i = 0; i < ret; i++) + for (i = 0; i < ret; i++) { + VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i])); page_cache_get(pages[i]); + } read_unlock_irq(&mapping->tree_lock); return ret; } @@ -746,6 +754,7 @@ unsigned find_get_pages_contig(struct ad if (pages[i]->mapping == NULL || pages[i]->index != index) break; + VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i])); page_cache_get(pages[i]); index++; } @@ -774,8 +783,10 @@ unsigned find_get_pages_tag(struct addre read_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)pages, *index, nr_pages, tag); - for (i = 0; i < ret; i++) + for (i = 0; i < ret; i++) { + VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i])); page_cache_get(pages[i]); + } if (ret) *index = pages[ret - 1]->index + 1; read_unlock_irq(&mapping->tree_lock); @@ -2233,6 +2244,7 @@ int try_to_release_page(struct page *pag struct address_space * const mapping = page->mapping; BUG_ON(!PageLocked(page)); + VM_BUG_ON(mapping_order(mapping) != compound_order(page)); if (PageWriteback(page)) return 0; -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[32/36] Readahead changes to support large blocksize.
Fix up readhead for large I/O operations. Only calculate the readahead until the 2M boundary then fall back to one page. Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> === --- include/linux/mm.h |2 +- mm/fadvise.c |4 ++-- mm/filemap.c |5 ++--- mm/madvise.c |2 +- mm/readahead.c | 22 ++ 5 files changed, 20 insertions(+), 15 deletions(-) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h 2007-08-27 21:03:20.0 -0700 +++ linux-2.6/include/linux/mm.h2007-08-27 21:14:44.0 -0700 @@ -1142,7 +1142,7 @@ void page_cache_async_readahead(struct a pgoff_t offset, unsigned long size); -unsigned long max_sane_readahead(unsigned long nr); +unsigned long max_sane_readahead(unsigned long nr, int order); /* Do stack extension */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); Index: linux-2.6/mm/fadvise.c === --- linux-2.6.orig/mm/fadvise.c 2007-08-27 20:52:49.0 -0700 +++ linux-2.6/mm/fadvise.c 2007-08-27 21:14:44.0 -0700 @@ -86,10 +86,10 @@ asmlinkage long sys_fadvise64_64(int fd, nrpages = end_index - start_index + 1; if (!nrpages) nrpages = ~0UL; - + ret = force_page_cache_readahead(mapping, file, start_index, - max_sane_readahead(nrpages)); + nrpages); if (ret > 0) ret = 0; break; Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-08-27 21:10:38.0 -0700 +++ linux-2.6/mm/filemap.c 2007-08-27 21:14:44.0 -0700 @@ -1237,8 +1237,7 @@ do_readahead(struct address_space *mappi if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage) return -EINVAL; - force_page_cache_readahead(mapping, filp, index, - max_sane_readahead(nr)); + force_page_cache_readahead(mapping, filp, index, nr); return 0; } @@ -1373,7 +1372,7 @@ retry_find: count_vm_event(PGMAJFAULT); } did_readaround = 1; - ra_pages = max_sane_readahead(file->f_ra.ra_pages); + ra_pages = file->f_ra.ra_pages; if (ra_pages) { pgoff_t start = 0; Index: linux-2.6/mm/madvise.c === --- linux-2.6.orig/mm/madvise.c 2007-08-27 19:22:13.0 -0700 +++ linux-2.6/mm/madvise.c 2007-08-27 21:14:44.0 -0700 @@ -124,7 +124,7 @@ static long madvise_willneed(struct vm_a end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; force_page_cache_readahead(file->f_mapping, - file, start, max_sane_readahead(end - start)); + file, start, end - start); return 0; } Index: linux-2.6/mm/readahead.c === --- linux-2.6.orig/mm/readahead.c 2007-08-27 19:22:13.0 -0700 +++ linux-2.6/mm/readahead.c2007-08-27 21:14:44.0 -0700 @@ -44,7 +44,8 @@ EXPORT_SYMBOL_GPL(default_backing_dev_in void file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping) { - ra->ra_pages = mapping->backing_dev_info->ra_pages; + ra->ra_pages = DIV_ROUND_UP(mapping->backing_dev_info->ra_pages, + page_cache_size(mapping)); ra->prev_index = -1; } EXPORT_SYMBOL_GPL(file_ra_state_init); @@ -84,7 +85,7 @@ int read_cache_pages(struct address_spac put_pages_list(pages); break; } - task_io_account_read(PAGE_CACHE_SIZE); + task_io_account_read(page_cache_size(mapping)); } pagevec_lru_add(&lru_pvec); return ret; @@ -151,7 +152,7 @@ __do_page_cache_readahead(struct address if (isize == 0) goto out; - end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); + end_index = page_cache_index(mapping, isize - 1); /* * Preallocate as many pages as we will need. @@ -204,10 +205,12 @@ int force_page_cache_readahead(struct ad if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages)) return -EINVAL; + nr_to_read = max_sane_readahead(nr_to_read, mapping_order(mapping)); while (nr_to_read) { i
[11/36] Use page_cache_xxx in fs/buffer.c
Use page_cache_xxx in fs/buffer.c. We have a special situation in set_bh_page() since reiserfs calls that function before setting up the mapping. So retrieve the page size from the page struct rather than the mapping. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/buffer.c | 110 +--- 1 file changed, 62 insertions(+), 48 deletions(-) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-08-28 11:37:13.0 -0700 +++ linux-2.6/fs/buffer.c 2007-08-28 11:37:58.0 -0700 @@ -257,7 +257,7 @@ __find_get_block_slow(struct block_devic struct page *page; int all_mapped = 1; - index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits); + index = block >> (page_cache_shift(bd_mapping) - bd_inode->i_blkbits); page = find_get_page(bd_mapping, index); if (!page) goto out; @@ -697,7 +697,7 @@ static int __set_page_dirty(struct page if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - task_io_account_write(PAGE_CACHE_SIZE); + task_io_account_write(page_cache_size(mapping)); } radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); @@ -891,10 +891,11 @@ struct buffer_head *alloc_page_buffers(s { struct buffer_head *bh, *head; long offset; + unsigned int page_size = page_cache_size(page->mapping); try_again: head = NULL; - offset = PAGE_SIZE; + offset = page_size; while ((offset -= size) >= 0) { bh = alloc_buffer_head(GFP_NOFS); if (!bh) @@ -1426,7 +1427,7 @@ void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset) { bh->b_page = page; - BUG_ON(offset >= PAGE_SIZE); + BUG_ON(offset >= compound_size(page)); if (PageHighMem(page)) /* * This catches illegal uses and preserves the offset: @@ -1605,6 +1606,7 @@ static int __block_write_full_page(struc struct buffer_head *bh, *head; const unsigned blocksize = 1 << inode->i_blkbits; int nr_underway = 0; + struct address_space *mapping = inode->i_mapping; BUG_ON(!PageLocked(page)); @@ -1625,7 +1627,8 @@ static int __block_write_full_page(struc * handle that here by just cleaning them. */ - block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits); + block = (sector_t)page->index << + (page_cache_shift(mapping) - inode->i_blkbits); head = page_buffers(page); bh = head; @@ -1742,7 +1745,7 @@ recover: } while ((bh = bh->b_this_page) != head); SetPageError(page); BUG_ON(PageWriteback(page)); - mapping_set_error(page->mapping, err); + mapping_set_error(mapping, err); set_page_writeback(page); do { struct buffer_head *next = bh->b_this_page; @@ -1767,8 +1770,8 @@ static int __block_prepare_write(struct struct buffer_head *bh, *head, *wait[2], **wait_bh=wait; BUG_ON(!PageLocked(page)); - BUG_ON(from > PAGE_CACHE_SIZE); - BUG_ON(to > PAGE_CACHE_SIZE); + BUG_ON(from > page_cache_size(inode->i_mapping)); + BUG_ON(to > page_cache_size(inode->i_mapping)); BUG_ON(from > to); blocksize = 1 << inode->i_blkbits; @@ -1777,7 +1780,8 @@ static int __block_prepare_write(struct head = page_buffers(page); bbits = inode->i_blkbits; - block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); + block = (sector_t)page->index << + (page_cache_shift(inode->i_mapping) - bbits); for(bh = head, block_start = 0; bh != head || !block_start; block++, block_start=block_end, bh = bh->b_this_page) { @@ -1921,7 +1925,8 @@ int block_read_full_page(struct page *pa create_empty_buffers(page, blocksize, 0); head = page_buffers(page); - iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits); + iblock = (sector_t)page->index << + (page_cache_shift(page->mapping) - inode->i_blkbits); lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits; bh = head; nr = 0; @@ -2045,7 +2050,7 @@ int generic_cont_expand(struct inode *in pgoff_t index; unsigned int offset; - offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */ + offset = page_cache_offset(inode->i_mapping, size); /* Within page */ /* ugh. in prepare/commit_write, if from==to==start of block, we ** skip the prepare. make sure we never send an offset for the start @@ -2055,7 +2060,7 @@ int generi
[33/36] Large blocksize support in ramfs
The simplest file system to use for large blocksize support is ramfs. Note that ramfs does not use the lower layers (buffer I/O etc) so this case is useful for initial testing of changes to large buffer size support if one just wants to exercise the higher layers. The patch adds the ability to specify a mount parameter to modify the order for the pages that are allocated by ramfs. Here is an example of how to mount a volume with order 10 pages: mount -tramfs -o10 none /media Mounts a ramfs filesystem with 4MB sized pages. Then copy a file onto it. cp linux-2.6.21-rc7.tar.gz /media This will populate the ramfs volume. Note that we allocated 14 pages of 4M each instead of 13508. Get rid of the large pages again umount /media Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/ramfs/inode.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index ef2b46d..b317f80 100644 --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -60,7 +60,8 @@ struct inode *ramfs_get_inode(struct super_block *sb, int mode, dev_t dev) inode->i_blocks = 0; inode->i_mapping->a_ops = &ramfs_aops; inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info; - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + mapping_setup(inode->i_mapping, GFP_HIGHUSER, + sb->s_blocksize_bits - PAGE_SHIFT); inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; switch (mode & S_IFMT) { default: @@ -164,10 +165,15 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent) { struct inode * inode; struct dentry * root; + int order = 0; + char *options = data; + + if (options && *options) + order = simple_strtoul(options, NULL, 10); sb->s_maxbytes = MAX_LFS_FILESIZE; - sb->s_blocksize = PAGE_CACHE_SIZE; - sb->s_blocksize_bits = PAGE_CACHE_SHIFT; + sb->s_blocksize = PAGE_CACHE_SIZE << order; + sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT; sb->s_magic = RAMFS_MAGIC; sb->s_op = &ramfs_ops; sb->s_time_gran = 1; -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[14/36] Use page_cache_xxx in fs/splice.c
Use page_cache_xxx in fs/splice.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/splice.c | 27 ++- 1 files changed, 14 insertions(+), 13 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index c010a72..7910f32 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -279,9 +279,9 @@ __generic_file_splice_read(struct file *in, loff_t *ppos, .ops = &page_cache_pipe_buf_ops, }; - index = *ppos >> PAGE_CACHE_SHIFT; - loff = *ppos & ~PAGE_CACHE_MASK; - req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + index = page_cache_index(mapping, *ppos); + loff = page_cache_offset(mapping, *ppos); + req_pages = page_cache_next(mapping, len + loff); nr_pages = min(req_pages, (unsigned)PIPE_BUFFERS); /* @@ -336,7 +336,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos, * Now loop over the map and see if we need to start IO on any * pages, fill in the partial map, etc. */ - index = *ppos >> PAGE_CACHE_SHIFT; + index = page_cache_index(mapping, *ppos); nr_pages = spd.nr_pages; spd.nr_pages = 0; for (page_nr = 0; page_nr < nr_pages; page_nr++) { @@ -348,7 +348,8 @@ __generic_file_splice_read(struct file *in, loff_t *ppos, /* * this_len is the max we'll use from this page */ - this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff); + this_len = min_t(unsigned long, len, + page_cache_size(mapping) - loff); page = pages[page_nr]; if (PageReadahead(page)) @@ -408,7 +409,7 @@ fill_it: * i_size must be checked after PageUptodate. */ isize = i_size_read(mapping->host); - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + end_index = page_cache_index(mapping, isize - 1); if (unlikely(!isize || index > end_index)) break; @@ -422,7 +423,7 @@ fill_it: /* * max good bytes in this page */ - plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; + plen = page_cache_offset(mapping, isize - 1) + 1; if (plen <= loff) break; @@ -573,12 +574,12 @@ static int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, if (unlikely(ret)) return ret; - index = sd->pos >> PAGE_CACHE_SHIFT; - offset = sd->pos & ~PAGE_CACHE_MASK; + index = page_cache_index(mapping, sd->pos); + offset = page_cache_offset(mapping, sd->pos); this_len = sd->len; - if (this_len + offset > PAGE_CACHE_SIZE) - this_len = PAGE_CACHE_SIZE - offset; + if (this_len + offset > page_cache_size(mapping)) + this_len = page_cache_size(mapping) - offset; find_page: page = find_lock_page(mapping, index); @@ -839,7 +840,7 @@ generic_file_splice_write_nolock(struct pipe_inode_info *pipe, struct file *out, unsigned long nr_pages; *ppos += ret; - nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + nr_pages = page_cache_next(mapping, ret); /* * If file or inode is SYNC and we actually wrote some data, @@ -896,7 +897,7 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, unsigned long nr_pages; *ppos += ret; - nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + nr_pages = page_cache_next(mapping, ret); /* * If file or inode is SYNC and we actually wrote some data, -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[21/36] compound pages: PageHead/PageTail instead of PageCompound
This patch enhances the handling of compound pages in the VM. It may also be important also for the antifrag patches that need to manage a set of higher order free pages and also for other uses of compound pages. For now it simplifies accounting for SLUB pages but the groundwork here is important for the large block size patches and for allowing to page migration of larger pages. With this framework we may be able to get to a point where compound pages keep their flags while they are free and Mel may avoid having special functions for determining the page order of higher order freed pages. If we can avoid the setup and teardown of higher order pages then allocation and release of compound pages will be faster. Looking at the handling of compound pages we see that the fact that a page is part of a higher order page is not that interesting. The differentiation is mainly for head pages and tail pages of higher order pages. Head pages keep the page state and it is usually sufficient to pass a pointer to a head page. It is usually an error if tail pages are encountered. Or they may need to be treated like PAGE_SIZE pages. So a compound flag in the page flags is not what we need. Instead we introduce a flag for the head page and another for the tail page. The PageCompound test is preserved for backward compatibility and will test if either PageTail or PageHead has been set. After this patchset the uses of CompoundPage() will be reduced significantly in the core VM. The I/O layer will still use CompoundPage() for direct I/O. However, if we at some point convert direct I/O to also support compound pages as a single unit then CompoundPage() there may become unecessary as well as the leftover check in mm/swap.c. We may end up mostly with checks for PageTail and PageHead. This patch: Use two separate page flags for the head and tail of compound pages. PageHead() and PageTail() become more efficient. PageCompound then becomes a check for PageTail || PageHead. Over time it is expected that PageCompound will mostly go away since the head page processing will be different from tail page processing is most situations. We can remove the compound page check from set_page_refcounted since PG_reclaim is no longer overloaded. Also the check in _free_one_page can only be for PageHead. We cannot free a tail page. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/page-flags.h | 41 +++-- mm/internal.h |2 +- mm/page_alloc.c|2 +- 3 files changed, 13 insertions(+), 32 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 209d3a4..2786693 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,13 +83,15 @@ #define PG_private 11 /* If pagecache, has fs-private data */ #define PG_writeback 12 /* Page is under writeback */ -#define PG_compound14 /* Part of a compound page */ #define PG_swapcache 15 /* Swap page: swp_entry_t in private */ #define PG_mappedtodisk16 /* Has blocks allocated on-disk */ #define PG_reclaim 17 /* To be reclaimed asap */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_head21 /* Page is head of a compound page */ +#define PG_tail22 /* Page is tail of a compound page */ + /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ #define PG_readahead PG_reclaim /* Reminder to do async read-ahead */ @@ -216,37 +218,16 @@ static inline void SetPageUptodate(struct page *page) #define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags) #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags) -#define PageCompound(page) test_bit(PG_compound, &(page)->flags) -#define __SetPageCompound(page)__set_bit(PG_compound, &(page)->flags) -#define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags) - -/* - * PG_reclaim is used in combination with PG_compound to mark the - * head and tail of a compound page - * - * PG_compound & PG_reclaim=> Tail page - * PG_compound & ~PG_reclaim => Head page - */ - -#define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim)) +#define PageHead(page) test_bit(PG_head, &(page)->flags) +#define __SetPageHead(page)__set_bit(PG_head, &(page)->flags) +#define __ClearPageHead(page) __clear_bit(PG_head, &(page)->flags) -#define PageTail(page) ((page->flags & PG_head_tail_mask) \ - == PG_head_tail_mask) - -static inline void __SetPageTail(struct page *page) -{ - page->flags |= PG_head_tail_mask; -} - -static inline void __ClearPageTail(struct page *page) -{ - page->flags &= ~PG_head_tail_mask; -} +#define PageTail(page) test_bit(PG_tail, &(page->flags)) +#define
[18/36] Use page_cache_xxx in fs/reiserfs
Use page_cache_xxx in fs/reiserfs Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/reiserfs/file.c| 83 ++--- fs/reiserfs/inode.c | 33 ++-- fs/reiserfs/ioctl.c |2 +- fs/reiserfs/stree.c |8 ++- fs/reiserfs/tail_conversion.c |5 +- fs/reiserfs/xattr.c | 19 + 6 files changed, 84 insertions(+), 66 deletions(-) Index: linux-2.6/fs/reiserfs/file.c === --- linux-2.6.orig/fs/reiserfs/file.c 2007-08-27 21:22:40.0 -0700 +++ linux-2.6/fs/reiserfs/file.c2007-08-27 21:50:01.0 -0700 @@ -187,9 +187,11 @@ static int reiserfs_allocate_blocks_for_ int curr_block; // current block used to keep track of unmapped blocks. int i; // loop counter int itempos;// position in item - unsigned int from = (pos & (PAGE_CACHE_SIZE - 1)); // writing position in + struct address_space *mapping = prepared_pages[0]->mapping; + unsigned int from = page_cache_offset(mapping, pos);// writing position in // first page - unsigned int to = ((pos + write_bytes - 1) & (PAGE_CACHE_SIZE - 1)) + 1;/* last modified byte offset in last page */ + unsigned int to = page_cache_offset(mapping, pos + write_bytes - 1) + 1; + /* last modified byte offset in last page */ __u64 hole_size;// amount of blocks for a file hole, if it needed to be created. int modifying_this_item = 0;// Flag for items traversal code to keep track // of the fact that we already prepared @@ -731,19 +733,22 @@ static int reiserfs_copy_from_user_to_fi long page_fault = 0;// status of copy_from_user. int i; // loop counter. int offset; // offset in page + struct address_space *mapping = prepared_pages[0]->mapping; - for (i = 0, offset = (pos & (PAGE_CACHE_SIZE - 1)); i < num_pages; + for (i = 0, offset = page_cache_offset(mapping, pos); i < num_pages; i++, offset = 0) { - size_t count = min_t(size_t, PAGE_CACHE_SIZE - offset, write_bytes);// How much of bytes to write to this page + size_t count = min_t(size_t, page_cache_size(mapping) - offset, + write_bytes); // How much of bytes to write to this page struct page *page = prepared_pages[i]; // Current page we process. fault_in_pages_readable(buf, count); /* Copy data from userspace to the current page */ kmap(page); - page_fault = __copy_from_user(page_address(page) + offset, buf, count); // Copy the data. + page_fault = __copy_from_user(page_address(page) + offset, buf, count); + // Copy the data. /* Flush processor's dcache for this page */ - flush_dcache_page(page); + flush_mapping_page(page); kunmap(page); buf += count; write_bytes -= count; @@ -763,11 +768,12 @@ int reiserfs_commit_page(struct inode *i int partial = 0; unsigned blocksize; struct buffer_head *bh, *head; - unsigned long i_size_index = inode->i_size >> PAGE_CACHE_SHIFT; + unsigned long i_size_index = + page_cache_offset(inode->i_mapping, inode->i_size); int new; int logit = reiserfs_file_data_log(inode); struct super_block *s = inode->i_sb; - int bh_per_page = PAGE_CACHE_SIZE / s->s_blocksize; + int bh_per_page = page_cache_size(inode->i_mapping) / s->s_blocksize; struct reiserfs_transaction_handle th; int ret = 0; @@ -839,10 +845,11 @@ static int reiserfs_submit_file_region_f int offset; // Writing offset in page. int orig_write_bytes = write_bytes; int sd_update = 0; + struct address_space *mapping = inode->i_mapping; - for (i = 0, offset = (pos & (PAGE_CACHE_SIZE - 1)); i < num_pages; + for (i = 0, offset = page_cache_offset(mapping, pos); i < num_pages; i++, offset = 0) { - int count = min_t(int, PAGE_CACHE_SIZE - offset, write_bytes); // How much of bytes to write to this page + int count = min_t(int, page_cache_size(mapping) - offset, write_bytes); // How much of bytes to write to this page struct page *page = prepared_pages[i]; // Current page we process. status = @@ -985,12 +992,12 @@ static int reiserfs_prepare_file_region_ ) { int res = 0;// Return values of different functions we call. - unsigned long index = pos >> PAGE_CACHE_SHIFT; // Offset in file in pages. - int from = (pos & (PAGE_CA
[24/36] compound pages: Use new compound vmstat functions in SLUB
Use the new dec/inc functions to simplify SLUB's accounting of pages. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/slub.c | 13 - 1 files changed, 4 insertions(+), 9 deletions(-) Index: linux-2.6/mm/slub.c === --- linux-2.6.orig/mm/slub.c2007-08-27 19:22:13.0 -0700 +++ linux-2.6/mm/slub.c 2007-08-27 21:02:51.0 -0700 @@ -1038,7 +1038,6 @@ static inline void kmem_cache_open_debug static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) { struct page * page; - int pages = 1 << s->order; if (s->order) flags |= __GFP_COMP; @@ -1054,10 +1053,9 @@ static struct page *allocate_slab(struct if (!page) return NULL; - mod_zone_page_state(page_zone(page), + inc_zone_page_state(page, (s->flags & SLAB_RECLAIM_ACCOUNT) ? - NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, - pages); + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE); return page; } @@ -1124,8 +1122,6 @@ out: static void __free_slab(struct kmem_cache *s, struct page *page) { - int pages = 1 << s->order; - if (unlikely(SlabDebug(page))) { void *p; @@ -1135,10 +1131,9 @@ static void __free_slab(struct kmem_cach ClearSlabDebug(page); } - mod_zone_page_state(page_zone(page), + dec_zone_page_state(page, (s->flags & SLAB_RECLAIM_ACCOUNT) ? - NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, - - pages); + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE); page->mapping = NULL; __free_pages(page, s->order); -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[35/36] Large blocksize support for ext2
This adds support for a block size of up to 64k on any platform. It enables the mounting filesystems that have a larger blocksize than the page size. F.e. the following is possible on x86_64 and i386 that have only a 4k page size: mke2fs -b 16384 /dev/hdd2 mount /dev/hdd2 /media ls -l /media Do more things with the volume that uses a 16k page cache size on a 4k page sized platform.. Hmmm... Actually there is nothing additional to be done after the earlier cleanup of the macros. So just modify copyright. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/ext2/inode.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 0079b2c..5ff775a 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -20,6 +20,9 @@ * ([EMAIL PROTECTED]) * * Assorted race fixes, rewrite of ext2_get_block() by Al Viro, 2000 + * + * (C) 2007 SGI. + * Large blocksize support by Christoph Lameter */ #include -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[16/36] Use page_cache_xxx in fs/ext3
Use page_cache_xxx in fs/ext3 Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- fs/ext3/dir.c |3 ++- fs/ext3/inode.c | 34 +- 2 files changed, 19 insertions(+), 18 deletions(-) diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c index c00723a..a65b5a7 100644 --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -137,7 +137,8 @@ static int ext3_readdir(struct file * filp, &map_bh, 0, 0); if (err > 0) { pgoff_t index = map_bh.b_blocknr >> - (PAGE_CACHE_SHIFT - inode->i_blkbits); + (page_cache_shift(inode->i_mapping) + - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) page_cache_sync_readahead( sb->s_bdev->bd_inode->i_mapping, diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index eb3c264..986519b 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1224,7 +1224,7 @@ static int ext3_ordered_commit_write(struct file *file, struct page *page, */ loff_t new_i_size; - new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + new_i_size = page_cache_pos(page->mapping, page->index, to); if (new_i_size > EXT3_I(inode)->i_disksize) EXT3_I(inode)->i_disksize = new_i_size; ret = generic_commit_write(file, page, from, to); @@ -1243,7 +1243,7 @@ static int ext3_writeback_commit_write(struct file *file, struct page *page, int ret = 0, ret2; loff_t new_i_size; - new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + new_i_size = page_cache_pos(inode->i_mapping, page->index, to); if (new_i_size > EXT3_I(inode)->i_disksize) EXT3_I(inode)->i_disksize = new_i_size; @@ -1270,7 +1270,7 @@ static int ext3_journalled_commit_write(struct file *file, /* * Here we duplicate the generic_commit_write() functionality */ - pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + pos = page_cache_pos(page->mapping, page->index, to); ret = walk_page_buffers(handle, page_buffers(page), from, to, &partial, commit_write_fn); @@ -1422,6 +1422,7 @@ static int ext3_ordered_writepage(struct page *page, handle_t *handle = NULL; int ret = 0; int err; + int pagesize = page_cache_size(inode->i_mapping); J_ASSERT(PageLocked(page)); @@ -1444,8 +1445,7 @@ static int ext3_ordered_writepage(struct page *page, (1 << BH_Dirty)|(1 << BH_Uptodate)); } page_bufs = page_buffers(page); - walk_page_buffers(handle, page_bufs, 0, - PAGE_CACHE_SIZE, NULL, bget_one); + walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bget_one); ret = block_write_full_page(page, ext3_get_block, wbc); @@ -1462,13 +1462,12 @@ static int ext3_ordered_writepage(struct page *page, * and generally junk. */ if (ret == 0) { - err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, - NULL, journal_dirty_data_fn); + err = walk_page_buffers(handle, page_bufs, 0, pagesize, + NULL, journal_dirty_data_fn); if (!ret) ret = err; } - walk_page_buffers(handle, page_bufs, 0, - PAGE_CACHE_SIZE, NULL, bput_one); + walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bput_one); err = ext3_journal_stop(handle); if (!ret) ret = err; @@ -1520,6 +1519,7 @@ static int ext3_journalled_writepage(struct page *page, handle_t *handle = NULL; int ret = 0; int err; + int pagesize = page_cache_size(inode->i_mapping); if (ext3_journal_current_handle()) goto no_write; @@ -1536,17 +1536,16 @@ static int ext3_journalled_writepage(struct page *page, * doesn't seem much point in redirtying the page here. */ ClearPageChecked(page); - ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE, - ext3_get_block); + ret = block_prepare_write(page, 0, pagesize, ext3_get_block); if (ret != 0) { ext3_journal_stop(handle); goto out_unlock; } ret = walk_page_buffers(handle, page_buffers(page), 0, - PAGE_CACHE_SIZE, NULL, do_journal_get_write_access); + pagesize, NULL, do_journal_get_write_access); err = walk_page_buffers(handle, page_buffers(page), 0, -
[23/36] compound pages: vmstat support
Add support for compound pages so that inc_ and dec_xxx will increment the ZVCs by the number of base pages of the compound page. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/vmstat.h |5 ++--- mm/vmstat.c| 18 +- 2 files changed, 15 insertions(+), 8 deletions(-) Index: linux-2.6/include/linux/vmstat.h === --- linux-2.6.orig/include/linux/vmstat.h 2007-08-27 19:22:13.0 -0700 +++ linux-2.6/include/linux/vmstat.h2007-08-27 20:59:42.0 -0700 @@ -234,7 +234,7 @@ static inline void __inc_zone_state(stru static inline void __inc_zone_page_state(struct page *page, enum zone_stat_item item) { - __inc_zone_state(page_zone(page), item); + __mod_zone_page_state(page_zone(page), item, compound_pages(page)); } static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item) @@ -246,8 +246,7 @@ static inline void __dec_zone_state(stru static inline void __dec_zone_page_state(struct page *page, enum zone_stat_item item) { - atomic_long_dec(&page_zone(page)->vm_stat[item]); - atomic_long_dec(&vm_stat[item]); + __mod_zone_page_state(page_zone(page), item, -compound_pages(page)); } /* Index: linux-2.6/mm/vmstat.c === --- linux-2.6.orig/mm/vmstat.c 2007-08-27 19:22:13.0 -0700 +++ linux-2.6/mm/vmstat.c 2007-08-27 20:59:42.0 -0700 @@ -225,7 +225,12 @@ void __inc_zone_state(struct zone *zone, void __inc_zone_page_state(struct page *page, enum zone_stat_item item) { - __inc_zone_state(page_zone(page), item); + struct zone *z = page_zone(page); + + if (likely(!PageHead(page))) + __inc_zone_state(z, item); + else + __mod_zone_page_state(z, item, compound_pages(page)); } EXPORT_SYMBOL(__inc_zone_page_state); @@ -246,7 +251,12 @@ void __dec_zone_state(struct zone *zone, void __dec_zone_page_state(struct page *page, enum zone_stat_item item) { - __dec_zone_state(page_zone(page), item); + struct zone *z = page_zone(page); + + if (likely(!PageHead(page))) + __dec_zone_state(z, item); + else + __mod_zone_page_state(z, item, -compound_pages(page)); } EXPORT_SYMBOL(__dec_zone_page_state); @@ -262,11 +272,9 @@ void inc_zone_state(struct zone *zone, e void inc_zone_page_state(struct page *page, enum zone_stat_item item) { unsigned long flags; - struct zone *zone; - zone = page_zone(page); local_irq_save(flags); - __inc_zone_state(zone, item); + __inc_zone_page_state(page, item); local_irq_restore(flags); } EXPORT_SYMBOL(inc_zone_page_state); -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[25/36] compound pages: Allow use of get_page_unless_zero with compound pages
This is needed by slab defragmentation. The refcount of a page head may be incremented to ensure that a compound page will not go away under us. It also may be needed for defragmentation of higher order pages. The moving of compound pages may require the establishment of a reference before the use of page migration functions. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/mm.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h 2007-08-27 20:59:40.0 -0700 +++ linux-2.6/include/linux/mm.h2007-08-27 21:03:20.0 -0700 @@ -290,7 +290,7 @@ static inline int put_page_testzero(stru */ static inline int get_page_unless_zero(struct page *page) { - VM_BUG_ON(PageCompound(page)); + VM_BUG_ON(PageTail(page)); return atomic_inc_not_zero(&page->_count); } -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[28/36] Fix PAGE SIZE assumption in miscellaneous places
Fix PAGE SIZE assumption in miscellaneous places. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- kernel/futex.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index a124250..c6102e8 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -258,7 +258,7 @@ int get_futex_key(u32 __user *uaddr, struct rw_semaphore *fshared, err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL); if (err >= 0) { key->shared.pgoff = - page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + page->index << (compound_order(page) - PAGE_SHIFT); put_page(page); return 0; } -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[00/36] Large Blocksize Support V6
[An update before the Kernel Summit because of the numerous requests that I have had for this patchset. Please speak up if you feel that we need something like this.] This patchset modifies the Linux kernel so that larger block sizes than page size can be supported. Larger block sizes are handled by using compound pages of an arbitrary order for the page cache instead of single pages with order 0. Support is added in a way that limits the changes to existing code. As a result filesystems can support I/O using large buffers with minimal changes. The page cache functions are mostly unchanged. Instead of a page struct representing a single page they take a head page struct (which looks the same as a regular page struct apart from the compound flags) and operate on those. Most page cache functions can stay as they are. No locking protocols are added or modified. The support is also fully transparent at the level of the OS. No specialized heuristics are added to switch to larger pages. Large page support is enabled by filesystems or device drivers when a device or volume is mounted. Larger block sizes are usually set during volume creation although the patchset supports setting these sizes per file. The formattted partition will then always be accessed with the configured blocksize. Some of the changes are: - Replace the use of PAGE_CACHE_XXX constants to calculate offsets into pages with functions that do the the same and allow the constants to be parameterized. - Extend the capabilities of compound pages so that they can be put onto the LRU and reclaimed. - Allow setting a larger blocksize via set_blocksize() Rationales: --- 1. The ability to handle memory of an arbitrarily large size using a singe page struct "handle" is essential for scaling memory handling and reducing overhead in multiple kernel subsystems. This patchset is a strategic move that allows performance gains throughout the kernel. 2. Reduce fsck times. Larger block sizes mean faster file system checking. Using 64k block size will reduce the number of blocks to be managed by a factor of 16 and produce much denser and contiguous metadata. 3. Performance. If we look at IA64 vs. x86_64 then it seems that the faster interrupt handling on x86_64 compensate for the speed loss due to a smaller page size (4k vs 16k on IA64). Supporting larger block sizes sizes on all allows a significant reduction in I/O overhead and increases the size of I/O that can be performed by hardware in a single request since the number of scatter gather entries are typically limited for one request. This is going to become increasingly important to support the ever growing memory sizes since we may have to handle excessively large amounts of 4k requests for data sizes that may become common soon. For example to write a 1 terabyte file the kernel would have to handle 256 million 4k chunks. 4. Cross arch compatibility: It is currently not possible to mount an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system. With this patch this becomes possible. Note that this also means that some filesystems are already capable of working with blocksizes of up to 64k (ext2, XFS) which is currently only available on a select few arches. This patchset enables that functionality on all arches. There are no special modifications needed to the filesystems. The set_blocksize() function call will simply support a larger blocksize. 5. VM scalability Large block sizes mean less state keeping for the information being transferred. For a 1TB file one needs to handle 256 million page structs in the VM if one uses 4k page size. A 64k page size reduces that amount to 16 million. If the limitation in existing filesystems are removed then even higher reductions become possible. For very large files like that a page size of 2 MB may be beneficial which will reduce the number of page struct to handle to 512k. The variable nature of the block size means that the size can be tuned at file system creation time for the anticipated needs on a volume. 6. IO scalability The IO layer will receive large blocks of contiguious memory with this patchset. This means that less scatter gather elements are needed and the memory used is guaranteed to be contiguous. Instead of having to handle 4k chunks we can f.e. handle 64k chunks in one go. 7. Limited scatter gather support restricts I/O sizes. A lot of I/O controllers are limited in the number of scatter gather elements that they support. For example a controller that support 128 entries in the scatter gather lists can only perform I/O of 128*4k = 512k in one go. If the blocksize is larger (f.e. 64k) then we can perform larger I/O transfers. If we support 128 entries then 128*64k = 8M can be transferred in one transaction. Dave Chinner measured a performance increase of 50% when going to 64k blo
[13/36] Use page_cache_xxx in mm/fadvise.c
Use page_cache_xxx in mm/fadvise.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/fadvise.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/fadvise.c b/mm/fadvise.c index 0df4c89..804c2a9 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -79,8 +79,8 @@ asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice) } /* First and last PARTIAL page! */ - start_index = offset >> PAGE_CACHE_SHIFT; - end_index = endbyte >> PAGE_CACHE_SHIFT; + start_index = page_cache_index(mapping, offset); + end_index = page_cache_index(mapping, endbyte); /* Careful about overflow on the "+1" */ nrpages = end_index - start_index + 1; @@ -100,8 +100,8 @@ asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice) filemap_flush(mapping); /* First and last FULL page! */ - start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT; - end_index = (endbyte >> PAGE_CACHE_SHIFT); + start_index = page_cache_next(mapping, offset); + end_index = page_cache_index(mapping, endbyte); if (end_index >= start_index) invalidate_mapping_pages(mapping, start_index, -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[08/36] Use page_cache_xxx in mm/migrate.c
Use page_cache_xxx in mm/migrate.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/migrate.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 37c73b9..4949927 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -195,7 +195,7 @@ static void remove_file_migration_ptes(struct page *old, struct page *new) struct vm_area_struct *vma; struct address_space *mapping = page_mapping(new); struct prio_tree_iter iter; - pgoff_t pgoff = new->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + pgoff_t pgoff = new->index << mapping_order(mapping); if (!mapping) return; -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[03/36] Use page_cache_xxx functions in mm/filemap.c
Conver the uses of PAGE_CACHE_xxx to use page_cache_xxx instead. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/filemap.c | 56 1 files changed, 28 insertions(+), 28 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-08-27 19:22:13.0 -0700 +++ linux-2.6/mm/filemap.c 2007-08-27 19:31:13.0 -0700 @@ -303,8 +303,8 @@ int wait_on_page_writeback_range(struct int sync_page_range(struct inode *inode, struct address_space *mapping, loff_t pos, loff_t count) { - pgoff_t start = pos >> PAGE_CACHE_SHIFT; - pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; + pgoff_t start = page_cache_index(mapping, pos); + pgoff_t end = page_cache_index(mapping, pos + count - 1); int ret; if (!mapping_cap_writeback_dirty(mapping) || !count) @@ -335,8 +335,8 @@ EXPORT_SYMBOL(sync_page_range); int sync_page_range_nolock(struct inode *inode, struct address_space *mapping, loff_t pos, loff_t count) { - pgoff_t start = pos >> PAGE_CACHE_SHIFT; - pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; + pgoff_t start = page_cache_index(mapping, pos); + pgoff_t end = page_cache_index(mapping, pos + count - 1); int ret; if (!mapping_cap_writeback_dirty(mapping) || !count) @@ -365,7 +365,7 @@ int filemap_fdatawait(struct address_spa return 0; return wait_on_page_writeback_range(mapping, 0, - (i_size - 1) >> PAGE_CACHE_SHIFT); + page_cache_index(mapping, i_size - 1)); } EXPORT_SYMBOL(filemap_fdatawait); @@ -413,8 +413,8 @@ int filemap_write_and_wait_range(struct /* See comment of filemap_write_and_wait() */ if (err != -EIO) { int err2 = wait_on_page_writeback_range(mapping, - lstart >> PAGE_CACHE_SHIFT, - lend >> PAGE_CACHE_SHIFT); + page_cache_index(mapping, lstart), + page_cache_index(mapping, lend)); if (!err) err = err2; } @@ -877,12 +877,12 @@ void do_generic_mapping_read(struct addr struct file_ra_state ra = *_ra; cached_page = NULL; - index = *ppos >> PAGE_CACHE_SHIFT; + index = page_cache_index(mapping, *ppos); next_index = index; prev_index = ra.prev_index; prev_offset = ra.prev_offset; - last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; - offset = *ppos & ~PAGE_CACHE_MASK; + last_index = page_cache_next(mapping, *ppos + desc->count); + offset = page_cache_offset(mapping, *ppos); for (;;) { struct page *page; @@ -919,16 +919,16 @@ page_ok: */ isize = i_size_read(inode); - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + end_index = page_cache_index(mapping, isize - 1); if (unlikely(!isize || index > end_index)) { page_cache_release(page); goto out; } /* nr is the maximum number of bytes to copy from this page */ - nr = PAGE_CACHE_SIZE; + nr = page_cache_size(mapping); if (index == end_index) { - nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; + nr = page_cache_offset(mapping, isize - 1) + 1; if (nr <= offset) { page_cache_release(page); goto out; @@ -963,8 +963,8 @@ page_ok: */ ret = actor(desc, page, offset, nr); offset += ret; - index += offset >> PAGE_CACHE_SHIFT; - offset &= ~PAGE_CACHE_MASK; + index += page_cache_index(mapping, offset); + offset = page_cache_offset(mapping, offset); prev_offset = offset; ra.prev_offset = offset; @@ -1058,7 +1058,7 @@ out: *_ra = ra; _ra->prev_index = prev_index; - *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; + *ppos = page_cache_pos(mapping, index, offset); if (cached_page) page_cache_release(cached_page); if (filp) @@ -1240,8 +1240,8 @@ asmlinkage ssize_t sys_readahead(int fd, if (file) { if (file->f_mode & FMODE_READ) { struct address_space *mapping = file->f_mapping; - unsigned long start = offset >> PAGE_CACHE_SHIFT; - unsigned long end =
[04/36] Use page_cache_xxx in mm/page-writeback.c
Use page_cache_xxx in mm/page-writeback.c Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/page-writeback.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 63512a9..ebe76e3 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -624,8 +624,8 @@ int write_cache_pages(struct address_space *mapping, index = mapping->writeback_index; /* Start from prev offset */ end = -1; } else { - index = wbc->range_start >> PAGE_CACHE_SHIFT; - end = wbc->range_end >> PAGE_CACHE_SHIFT; + index = page_cache_index(mapping, wbc->range_start); + end = page_cache_index(mapping, wbc->range_end); if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) range_whole = 1; scanned = 1; @@ -827,7 +827,7 @@ int __set_page_dirty_nobuffers(struct page *page) WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - task_io_account_write(PAGE_CACHE_SIZE); + task_io_account_write(page_cache_size(mapping)); } radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); -- 1.5.2.4 -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > We do not care about one cpu being able to increase its counter > > higher than the limit, such inaccuracy (maximum bios in flight thus > > can be more than limit, difference is equal to the number of CPUs - > > 1) is a price for removing atomic operation. I thought I pointed it > > in the original description, but might forget, that if it will be an > > issue, that atomic operations can be introduced there. Any > > uber-precise measurements in the case when we are close to the edge > > will not give us any benefit at all, since were are already in the > > grey area. > > This is not just inaccurate, it is suicide. Keep leaking throttle > counts and eventually all of them will be gone. No more IO > on that block device! First, because number of increased and decreased operations are the same, so it will dance around limit in both directions. Second, I wrote about this race and there is number of ways to deal with it, from atomic operations to separated counters for in-flight and completed bios (which can be racy too, but in different angle). Third, if people can not agree even on much higher layer detail about should bio structure be increased or not, how we can discuss details of the preliminary implementation with known issues. So I can not agree with fatality of the issue, but of course it exists, and was highlighted. Let's solve problems in order of their appearence. If bio structure will be allowed to grow, then the whole patches can be done better, if not, then there are issues with performance (although the more I think, the more I become sure that since bio itself is very rarely shared, and thus requires cloning and alocation/freeing, which itself is much more costly operation than atomic_sub/dec, it can safely host additional operation). -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote: > On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > Say Evgeniy, something I was curious about but forgot to ask you > > earlier... > > > > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > > > ...All oerations are not atomic, since we do not care about > > > precise number of bios, but a fact, that we are close or close > > > enough to the limit. > > > ... in bio->endio > > > + q->bio_queued--; > > > > In your proposed patch, what prevents the race: > > > > cpu1cpu2 > > > > read q->bio_queued > > > > q->bio_queued-- > > write q->bio_queued - 1 > > Whoops! We leaked a throttle count. > > We do not care about one cpu being able to increase its counter > higher than the limit, such inaccuracy (maximum bios in flight thus > can be more than limit, difference is equal to the number of CPUs - > 1) is a price for removing atomic operation. I thought I pointed it > in the original description, but might forget, that if it will be an > issue, that atomic operations can be introduced there. Any > uber-precise measurements in the case when we are close to the edge > will not give us any benefit at all, since were are already in the > grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! > Another possibility is to create a queue/device pointer in the bio > structure to hold original device and then in its backing dev > structure add a callback to recalculate the limit, but it increases > the size of the bio. Do we need this? Different issue. Yes, I think we need a nice simple approach like that, and prove it is stable before worrying about the size cost. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Fri, Aug 03, 2007 at 09:04:51AM +0400, Manu Abraham ([EMAIL PROTECTED]) wrote: > On 7/31/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > > > TODO list currently includes following main items: > > * redundancy algorithm (drop me a request of your own, but it is highly > > unlikley that Reed-Solomon based will ever be used - it is too slow > > for distributed RAID, I consider WEAVER codes) > > > LDPC codes[1][2] have been replacing Turbo code[3] with regards to > communication links and we have been seeing that transition. (maybe > helpful, came to mind seeing the mention of Turbo code) Don't know how > weaver compares to LDPC, though found some comparisons [4][5] But > looking at fault tolerance figures, i guess Weaver is much better. > > [1] http://www.ldpc-codes.com/ > [2] http://portal.acm.org/citation.cfm?id=1240497 > [3] http://en.wikipedia.org/wiki/Turbo_code > [4] > http://domino.research.ibm.com/library/cyberdig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf > [5] http://hplabs.hp.com/personal/Jay_Wylie/publications/wylie_dsn2007.pdf I've studied and implemented LDPC encoder/decoder (hard decoding belief propagation algo only though) in userspace and found that any such probabilistic codes generally are not suitable for redundant or distributed data storages, because of its per-bit nature and probabilistic error recovery. Interested reader can find similar to Dr. Plank's iteractive decoding presentation and some of my analysis about codes and all sources at project homepage and in blog: http://tservice.net.ru/~s0mbre/old/?section=projects&item=ldpc So I consider weaver codes, as a superior decision for distributed storages. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Wed, 29 Aug 2007 02:33:08 +1000 David Chinner <[EMAIL PROTECTED]> wrote: > On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote: > > > > > > > > I wonder if XFS can benefit any more from the general writeback > > > > clustering. How large would be a typical XFS cluster? > > > > > > Depends on inode size. typically they are 8k in size, so anything > > > from 4-32 inodes. The inode writeback clustering is pretty tightly > > > integrated into the transaction subsystem and has some intricate > > > locking, so it's not likely to be easy (or perhaps even possible) > > > to make it more generic. > > > > When I talked to hch about this, he said the order file data pages > > got written in XFS was still dictated by the order the higher > > layers sent things down. > > Sure, that's file data. I was talking about the inode writeback, not > the data writeback. I think we're trying to gain different things from inode based clustering...I'm not worried that the inode be next to the data. I'm going under the assumption that most of the time, the FS will try to allocate inodes in groups in a directory, and so most of the time the data blocks for inode N will be close to inode N+1. So what I'm really trying for here is data block clustering when writing multiple inodes at once. This matters most when files are relatively small and written in groups, which is a common workload. It may make the most sense to change the patch to supply some key for the data block clustering instead of the inode number, but its an easy first pass. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote: > On Wed, 29 Aug 2007 00:55:30 +1000 > David Chinner <[EMAIL PROTECTED]> wrote: > > On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote: > > > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote: > > > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote: > > > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote: > > > > > Notes: > > > > > (1) I'm not sure inode number is correlated to disk location in > > > > > filesystems other than ext2/3/4. Or parent dir? > > > > > > > > The correspond to the exact location on disk on XFS. But, XFS has > > > > it's own inode clustering (see xfs_iflush) and it can't be moved > > > > up into the generic layers because of locking and integration into > > > > the transaction subsystem. > > > > > > > > > (2) It duplicates some function of elevators. Why is it > > > > > necessary? > > > > > > > > The elevators have no clue as to how the filesystem might treat > > > > adjacent inodes. In XFS, inode clustering is a fundamental > > > > feature of the inode reading and writing and that is something no > > > > elevator can hope to acheive > > > > > > Thank you. That explains the linear write curve(perfect!) in Chris' > > > graph. > > > > > > I wonder if XFS can benefit any more from the general writeback > > > clustering. How large would be a typical XFS cluster? > > > > Depends on inode size. typically they are 8k in size, so anything > > from 4-32 inodes. The inode writeback clustering is pretty tightly > > integrated into the transaction subsystem and has some intricate > > locking, so it's not likely to be easy (or perhaps even possible) to > > make it more generic. > > When I talked to hch about this, he said the order file data pages got > written in XFS was still dictated by the order the higher layers sent > things down. Sure, that's file data. I was talking about the inode writeback, not the data writeback. > Shouldn't the clustering still help to have delalloc done > in inode order instead of in whatever random order pdflush sends things > down now? Depends on how things are being allocated. if you've got inode32 allocation and >1TB filesytsem, then data is nowhere near the inodes. If you've got large allocation groups, then data is typically nowhere near the inodes, either. If you've got full AGs, data will be nowehere near the inodes. If you've got large files and lots of data to write, then clustering multiple files together for writing is not needed. So in many cases, clustering delalloc writes by inode number doesn't provide any better I/o patterns than not clustering... The only difference we may see is that if we flush all the data on inodes in a single cluster, we can get away with a single inode cluster write for all of the inodes Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Wed, 29 Aug 2007 00:55:30 +1000 David Chinner <[EMAIL PROTECTED]> wrote: > On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote: > > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote: > > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote: > > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote: > > > > Notes: > > > > (1) I'm not sure inode number is correlated to disk location in > > > > filesystems other than ext2/3/4. Or parent dir? > > > > > > The correspond to the exact location on disk on XFS. But, XFS has > > > it's own inode clustering (see xfs_iflush) and it can't be moved > > > up into the generic layers because of locking and integration into > > > the transaction subsystem. > > > > > > > (2) It duplicates some function of elevators. Why is it > > > > necessary? > > > > > > The elevators have no clue as to how the filesystem might treat > > > adjacent inodes. In XFS, inode clustering is a fundamental > > > feature of the inode reading and writing and that is something no > > > elevator can hope to acheive > > > > Thank you. That explains the linear write curve(perfect!) in Chris' > > graph. > > > > I wonder if XFS can benefit any more from the general writeback > > clustering. How large would be a typical XFS cluster? > > Depends on inode size. typically they are 8k in size, so anything > from 4-32 inodes. The inode writeback clustering is pretty tightly > integrated into the transaction subsystem and has some intricate > locking, so it's not likely to be easy (or perhaps even possible) to > make it more generic. When I talked to hch about this, he said the order file data pages got written in XFS was still dictated by the order the higher layers sent things down. Shouldn't the clustering still help to have delalloc done in inode order instead of in whatever random order pdflush sends things down now? -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote: > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote: > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote: > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote: > > > Notes: > > > (1) I'm not sure inode number is correlated to disk location in > > > filesystems other than ext2/3/4. Or parent dir? > > > > The correspond to the exact location on disk on XFS. But, XFS has it's > > own inode clustering (see xfs_iflush) and it can't be moved up > > into the generic layers because of locking and integration into > > the transaction subsystem. > > > > > (2) It duplicates some function of elevators. Why is it necessary? > > > > The elevators have no clue as to how the filesystem might treat adjacent > > inodes. In XFS, inode clustering is a fundamental feature of the inode > > reading and writing and that is something no elevator can hope to > > acheive > > Thank you. That explains the linear write curve(perfect!) in Chris' graph. > > I wonder if XFS can benefit any more from the general writeback clustering. > How large would be a typical XFS cluster? Depends on inode size. typically they are 8k in size, so anything from 4-32 inodes. The inode writeback clustering is pretty tightly integrated into the transaction subsystem and has some intricate locking, so it's not likely to be easy (or perhaps even possible) to make it more generic. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > Say Evgeniy, something I was curious about but forgot to ask you > earlier... > > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > > ...All oerations are not atomic, since we do not care about precise > > number of bios, but a fact, that we are close or close enough to the > > limit. > > ... in bio->endio > > + q->bio_queued--; > > In your proposed patch, what prevents the race: > > cpu1cpu2 > > read q->bio_queued > > q->bio_queued-- > write q->bio_queued - 1 > Whoops! We leaked a throttle count. We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. Another possibility is to create a queue/device pointer in the bio structure to hold original device and then in its backing dev structure add a callback to recalculate the limit, but it increases the size of the bio. Do we need this? > Regards, > > Daniel -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html