date:20070828

Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c

2007-08-28 Thread Nick Piggin


Christoph Hellwig wrote:

On Tue, Aug 28, 2007 at 09:49:38PM +0200, J??rn Engel wrote:


On Tue, 28 August 2007 12:05:58 -0700, [EMAIL PROTECTED] wrote:



-   index = *ppos >> PAGE_CACHE_SHIFT;
-   offset = *ppos & ~PAGE_CACHE_MASK;
+   index = page_cache_index(mapping, *ppos);
+   offset = page_cache_offset(mapping, *ppos);


Part of me feels inclined to marge this patch now because it makes the
code more readable, even if page_cache_index() is implemented as
#define page_cache_index(mapping, pos) ((pos) >> PAGE_CACHE_SHIFT)

I know there is little use in yet another global search'n'replace
wankfest and Andrew might wash my mouth just for mentioning it.  Still,
hard to dislike this part of your patch.



Yes, I I suggested that before.  Andrew seems to somehow hate this
patchset, but even if we don;'t get it in the lowercase macros are much
much better then the current PAGE_CACHE_* confusion.


I don't mind the change either. The open coded macros are very
recognisable, but it isn't hard to have a typo and get one
slightly wrong.

If it goes upstream now it wouldn't have the mapping argument
though, would it? Or the need to replace PAGE_CACHE_SIZE I guess.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips

On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
> On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
> wrote:
> > > We do not care about one cpu being able to increase its counter
> > > higher than the limit, such inaccuracy (maximum bios in flight
> > > thus can be more than limit, difference is equal to the number of
> > > CPUs - 1) is a price for removing atomic operation. I thought I
> > > pointed it in the original description, but might forget, that if
> > > it will be an issue, that atomic operations can be introduced
> > > there. Any uber-precise measurements in the case when we are
> > > close to the edge will not give us any benefit at all, since were
> > > are already in the grey area.
> >
> > This is not just inaccurate, it is suicide.  Keep leaking throttle
> > counts and eventually all of them will be gone.  No more IO
> > on that block device!
>
> First, because number of increased and decreased operations are the
> same, so it will dance around limit in both directions.

No.  Please go and read it the description of the race again.  A count
gets irretrievably lost because the write operation of the first
decrement is overwritten by the second. Data gets lost.  Atomic 
operations exist to prevent that sort of thing.  You either need to use 
them or have a deep understanding of SMP read and write ordering in 
order to preserve data integrity by some equivalent algorithm.

> Let's solve problems in order of their appearence. If bio structure
> will be allowed to grow, then the whole patches can be done better.

How about like the patch below.  This throttles any block driver by
implementing a throttle metric method so that each block driver can
keep track of its own resource consumption in units of its choosing.
As an (important) example, it implements a simple metric for device
mapper devices.  Other block devices will work as before, because
they do not define any metric.  Short, sweet and untested, which is
why I have not posted it until now.

This patch originally kept its accounting info in backing_dev_info,
however that structure seems to be in some and it is just a part of
struct queue anyway, so I lifted the throttle accounting up into
struct queue.  We should be able to report on the efficacy of this
patch in terms of deadlock prevention pretty soon.

--- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0 -0700
+++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700
@@ -3237,6 +3237,15 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+   struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+
+   if (q && q->metric) {
+   int need = bio->bi_reserved = q->metric(bio);
+   bio->queue = q;
+   wait_event_interruptible(q->throttle_wait, 
atomic_read(&q->available) >= need);
+   atomic_sub(&q->available, need);
+   }
+
if (current->bio_tail) {
/* make_request is active */
*(current->bio_tail) = bio;
--- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700
+++ 2.6.22/drivers/md/dm.c  2007-08-24 12:14:23.0 -0700
@@ -880,6 +880,11 @@ static int dm_any_congested(void *conges
return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+   return bio->bi_vcnt;
+}
+
 /*-
  * An IDR is used to keep track of allocated minor numbers.
  *---*/
@@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i
goto bad1_free_minor;
 
md->queue->queuedata = md;
+   md->queue->metric = dm_metric;
+   atomic_set(&md->queue->available, md->queue->capacity = 1000);
+   init_waitqueue_head(&md->queue->throttle_wait);
+
md->queue->backing_dev_info.congested_fn = dm_any_congested;
md->queue->backing_dev_info.congested_data = md;
blk_queue_make_request(md->queue, dm_request);
--- 2.6.22.clean/fs/bio.c   2007-07-08 16:32:17.0 -0700
+++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700
@@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned
bytes_done = bio->bi_size;
}
 
-   bio->bi_size -= bytes_done;
+   if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) {
+   struct request_queue *q = bio->queue;
+   atomic_add(&q->available, bio->bi_reserved);
+   bio->bi_reserved = 0; /* just in case */
+   wake_up(&q->throttle_wait);
+   }
bio->bi_sector += (bytes_done >> 9);
 
if (bio->bi_end_io)
--- 2.6.22.clean/include/linux/bio.h2007-07-08 16:32:17.0 -0700
+++ 2.6.22/include/linux/bio.h  2007-08-24 11:53:51.0 -0700
@@ -109,6 +109,9 @@ struct bio {
bio_end_io_t*bi_end_io;
atomic_tbi_cnt; /* pin count */
 
+   struct requ

Re: [NFS] [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits

2007-08-28 Thread Jeff Layton

On Tue, 28 Aug 2007 15:49:51 -0400
Trond Myklebust <[EMAIL PROTECTED]> wrote:

> On Tue, 2007-08-28 at 20:11 +0100, Christoph Hellwig wrote:
> > Sorry for not replying to the previsious revisions, but I've been out
> > for on vacation.
> > 
> > I can't say I like this version.  Now we've got callouts at two rather close
> > levels which is not very nice from the interface POV.
> 
> Agreed.
> 
> > Maybe preference is for the first scheme where we simply move interpreation
> > of the ATTR_KILL_SUID/ATTR_KILL_SGID into the setattr routine and provide
> > a nice helper for the normal filesystem to use.
> > 
> > If people are really concerned about adding two lines of code to the
> > handfull of setattr operation there's a variant of this scheme that can
> > avoid it:
> > 
> >  - notify_change is modified to not clear the ATTR_KILL_SUID/ATTR_KILL_SGID
> >but update ia_mode and the ia_valid flag to include ATTR_MODE.
> >  - disk filesystems stay unchanged and never look at
> >ATTR_KILL_SUID/ATTR_KILL_SGID, but nfs can check for it and ignore
> >the ATTR_MODE flags and ia_valid in this case and do the right thing
> >on the server side.
> 
> Hmm... There has to be an implicit promise here that nobody else will
> ever try to set ATTR_KILL_SUID/ATTR_KILL_SGID and ATTR_MODE at the same
> time. Currently, that assumption is not there:
> 

That was my concern with this scheme as well...

> 
> > if (ia_valid & ATTR_KILL_SGID) {
> > attr->ia_valid &= ~ ATTR_KILL_SGID;
> > if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
> > if (!(ia_valid & ATTR_MODE)) {
> > ia_valid = attr->ia_valid |= ATTR_MODE;
> > attr->ia_mode = inode->i_mode;
> > }
> > attr->ia_mode &= ~S_ISGID;
> > }
> > }
> 
> Should we perhaps just convert the above 'if (!(ia_valid & ATTR_MODE))'
> into a 'BUG_ON(ia_valid & ATTR_MODE)'?
> 

Sounds reasonable. I'll also throw in a comment that explains this
reasoning...

-- 
Jeff Layton <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c

2007-08-28 Thread Christoph Hellwig

On Tue, Aug 28, 2007 at 09:49:38PM +0200, J??rn Engel wrote:
> On Tue, 28 August 2007 12:05:58 -0700, [EMAIL PROTECTED] wrote:
> >  
> > -   index = *ppos >> PAGE_CACHE_SHIFT;
> > -   offset = *ppos & ~PAGE_CACHE_MASK;
> > +   index = page_cache_index(mapping, *ppos);
> > +   offset = page_cache_offset(mapping, *ppos);
> 
> Part of me feels inclined to marge this patch now because it makes the
> code more readable, even if page_cache_index() is implemented as
> #define page_cache_index(mapping, pos) ((pos) >> PAGE_CACHE_SHIFT)
> 
> I know there is little use in yet another global search'n'replace
> wankfest and Andrew might wash my mouth just for mentioning it.  Still,
> hard to dislike this part of your patch.

Yes, I I suggested that before.  Andrew seems to somehow hate this
patchset, but even if we don;'t get it in the lowercase macros are much
much better then the current PAGE_CACHE_* confusion.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [35/36] Large blocksize support for ext2

2007-08-28 Thread Christoph Lameter

On Tue, 28 Aug 2007, Christoph Hellwig wrote:

> On Tue, Aug 28, 2007 at 12:06:26PM -0700, [EMAIL PROTECTED] wrote:
> > Hmmm... Actually there is nothing additional to be done after the earlier
> > cleanup of the macros. So just modify copyright.
> 
> So you get a copyright line for some trivial macro cleanups?  Please
> drop this patch and rather put your copyright into places where you
> actually did major work..

Ok.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/36] Large Blocksize Support V6

2007-08-28 Thread Christoph Lameter

On Tue, 28 Aug 2007, Christoph Hellwig wrote:

> one patch per file is the most braindead and most unacceptable way
> to split a series.  Please stop whatever you're doing right now and
> correct it and send out a patch that has one patch per logical change
> for the whole tree.  This means people can actually read the patch,
> and it's bisectable.

The patches are per logical change aside from the first patches that 
introduce the page cache functions all over the kernel. It would be 
unacceptably big and difficult to merge if I put them all together.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c

2007-08-28 Thread Jörn Engel

On Tue, 28 August 2007 12:05:58 -0700, [EMAIL PROTECTED] wrote:
>  
> - index = *ppos >> PAGE_CACHE_SHIFT;
> - offset = *ppos & ~PAGE_CACHE_MASK;
> + index = page_cache_index(mapping, *ppos);
> + offset = page_cache_offset(mapping, *ppos);

Part of me feels inclined to marge this patch now because it makes the
code more readable, even if page_cache_index() is implemented as
#define page_cache_index(mapping, pos) ((pos) >> PAGE_CACHE_SHIFT)

I know there is little use in yet another global search'n'replace
wankfest and Andrew might wash my mouth just for mentioning it.  Still,
hard to dislike this part of your patch.

Jörn

-- 
He who knows others is wise.
He who knows himself is enlightened.
-- Lao Tsu
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NFS] [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits

2007-08-28 Thread Christoph Hellwig

On Tue, Aug 28, 2007 at 03:49:51PM -0400, Trond Myklebust wrote:
> Hmm... There has to be an implicit promise here that nobody else will
> ever try to set ATTR_KILL_SUID/ATTR_KILL_SGID and ATTR_MODE at the same
> time. Currently, that assumption is not there:
> 
> 
> > if (ia_valid & ATTR_KILL_SGID) {
> > attr->ia_valid &= ~ ATTR_KILL_SGID;
> > if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
> > if (!(ia_valid & ATTR_MODE)) {
> > ia_valid = attr->ia_valid |= ATTR_MODE;
> > attr->ia_mode = inode->i_mode;
> > }
> > attr->ia_mode &= ~S_ISGID;
> > }
> > }
> 
> Should we perhaps just convert the above 'if (!(ia_valid & ATTR_MODE))'
> into a 'BUG_ON(ia_valid & ATTR_MODE)'?

Yes, sounds fine to me.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NFS] [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits

2007-08-28 Thread Trond Myklebust

On Tue, 2007-08-28 at 20:11 +0100, Christoph Hellwig wrote:
> Sorry for not replying to the previsious revisions, but I've been out
> for on vacation.
> 
> I can't say I like this version.  Now we've got callouts at two rather close
> levels which is not very nice from the interface POV.

Agreed.

> Maybe preference is for the first scheme where we simply move interpreation
> of the ATTR_KILL_SUID/ATTR_KILL_SGID into the setattr routine and provide
> a nice helper for the normal filesystem to use.
> 
> If people are really concerned about adding two lines of code to the
> handfull of setattr operation there's a variant of this scheme that can
> avoid it:
> 
>  - notify_change is modified to not clear the ATTR_KILL_SUID/ATTR_KILL_SGID
>but update ia_mode and the ia_valid flag to include ATTR_MODE.
>  - disk filesystems stay unchanged and never look at
>ATTR_KILL_SUID/ATTR_KILL_SGID, but nfs can check for it and ignore
>the ATTR_MODE flags and ia_valid in this case and do the right thing
>on the server side.

Hmm... There has to be an implicit promise here that nobody else will
ever try to set ATTR_KILL_SUID/ATTR_KILL_SGID and ATTR_MODE at the same
time. Currently, that assumption is not there:


>   if (ia_valid & ATTR_KILL_SGID) {
>   attr->ia_valid &= ~ ATTR_KILL_SGID;
>   if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
>   if (!(ia_valid & ATTR_MODE)) {
>   ia_valid = attr->ia_valid |= ATTR_MODE;
>   attr->ia_mode = inode->i_mode;
>   }
>   attr->ia_mode &= ~S_ISGID;
>   }
>   }

Should we perhaps just convert the above 'if (!(ia_valid & ATTR_MODE))'
into a 'BUG_ON(ia_valid & ATTR_MODE)'?

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits

2007-08-28 Thread Josef Sipek

On Tue, Aug 28, 2007 at 08:11:14PM +0100, Christoph Hellwig wrote:
> 
> Sorry for not replying to the previsious revisions, but I've been out
> for on vacation.
> 
> I can't say I like this version.  Now we've got callouts at two rather close
> levels which is not very nice from the interface POV.
> 
> Maybe preference is for the first scheme where we simply move interpreation
> of the ATTR_KILL_SUID/ATTR_KILL_SGID into the setattr routine and provide
> a nice helper for the normal filesystem to use.
> 
> If people are really concerned about adding two lines of code to the
> handfull of setattr operation there's a variant of this scheme that can
> avoid it:
 
It's not about adding 2 lines of code - it's about adding the requirement
for the fs to call a function.

>  - notify_change is modified to not clear the ATTR_KILL_SUID/ATTR_KILL_SGID
>but update ia_mode and the ia_valid flag to include ATTR_MODE.
>  - disk filesystems stay unchanged and never look at
>ATTR_KILL_SUID/ATTR_KILL_SGID, but nfs can check for it and ignore
>the ATTR_MODE flags and ia_valid in this case and do the right thing
>on the server side.

Sounds reasonable.

Josef 'Jeff' Sipek.

-- 
I abhor a system designed for the "user", if that word is a coded pejorative
meaning "stupid and unsophisticated."
- Ken Thompson
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [35/36] Large blocksize support for ext2

2007-08-28 Thread Christoph Hellwig

On Tue, Aug 28, 2007 at 12:06:26PM -0700, [EMAIL PROTECTED] wrote:
> Hmmm... Actually there is nothing additional to be done after the earlier
> cleanup of the macros. So just modify copyright.

So you get a copyright line for some trivial macro cleanups?  Please
drop this patch and rather put your copyright into places where you
actually did major work..

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/36] Large Blocksize Support V6

2007-08-28 Thread Christoph Hellwig

Stoopp!

This patchseries is entirely unacceptable!

one patch per file is the most braindead and most unacceptable way
to split a series.  Please stop whatever you're doing right now and
correct it and send out a patch that has one patch per logical change
for the whole tree.  This means people can actually read the patch,
and it's bisectable.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] add killattr inode operation to allow filesystems to interpret ATTR_KILL_S*ID bits

2007-08-28 Thread Christoph Hellwig


Sorry for not replying to the previsious revisions, but I've been out
for on vacation.

I can't say I like this version.  Now we've got callouts at two rather close
levels which is not very nice from the interface POV.

Maybe preference is for the first scheme where we simply move interpreation
of the ATTR_KILL_SUID/ATTR_KILL_SGID into the setattr routine and provide
a nice helper for the normal filesystem to use.

If people are really concerned about adding two lines of code to the
handfull of setattr operation there's a variant of this scheme that can
avoid it:

 - notify_change is modified to not clear the ATTR_KILL_SUID/ATTR_KILL_SGID
   but update ia_mode and the ia_valid flag to include ATTR_MODE.
 - disk filesystems stay unchanged and never look at
   ATTR_KILL_SUID/ATTR_KILL_SGID, but nfs can check for it and ignore
   the ATTR_MODE flags and ia_valid in this case and do the right thing
   on the server side.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[01/36] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user

2007-08-28 Thread clameter

Simplify page cache zeroing of segments of pages through 3 functions

zero_user_segments(page, start1, end1, start2, end2)

Zeros two segments of the page. It takes the position where to
start and end the zeroing which avoids length calculations.

zero_user_segment(page, start, end)

Same for a single segment.

zero_user(page, start, length)

Length variant for the case where we know the length.

We remove the zero_user_page macro. Issues:

1. Its a macro. Inline functions are preferable.

2. The KM_USER0 macro is only defined for HIGHMEM.

   Having to treat this special case everywhere makes the
   code needlessly complex. The parameter for zeroing is always
   KM_USER0 except in one single case that we open code.

Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.

Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.

Also extract the flushing of the caches to be outside of the kmap.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 drivers/block/loop.c   |2 +-
 fs/buffer.c|   48 +-
 fs/cifs/inode.c|2 +-
 fs/direct-io.c |4 +-
 fs/ecryptfs/mmap.c |7 ++---
 fs/ext3/inode.c|4 +-
 fs/ext4/inode.c|4 +-
 fs/gfs2/bmap.c |2 +-
 fs/gfs2/ops_address.c  |2 +-
 fs/libfs.c |   11 +++--
 fs/mpage.c |7 +
 fs/nfs/read.c  |   10 
 fs/nfs/write.c |2 +-
 fs/ntfs/aops.c |   18 +---
 fs/ntfs/file.c |   32 +---
 fs/ocfs2/aops.c|6 ++--
 fs/reiserfs/inode.c|4 +-
 fs/xfs/linux-2.6/xfs_lrw.c |2 +-
 include/linux/highmem.h|   49 +++
 mm/filemap_xip.c   |2 +-
 mm/truncate.c  |2 +-
 21 files changed, 104 insertions(+), 116 deletions(-)

Index: linux-2.6/drivers/block/loop.c
===
--- linux-2.6.orig/drivers/block/loop.c 2007-08-27 19:22:13.0 -0700
+++ linux-2.6/drivers/block/loop.c  2007-08-27 19:22:17.0 -0700
@@ -251,7 +251,7 @@ static int do_lo_send_aops(struct loop_d
 */
printk(KERN_ERR "loop: transfer error block %llu\n",
   (unsigned long long)index);
-   zero_user_page(page, offset, size, KM_USER0);
+   zero_user(page, offset, size);
}
flush_dcache_page(page);
ret = aops->commit_write(file, page, offset,
Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-08-27 19:22:13.0 -0700
+++ linux-2.6/fs/buffer.c   2007-08-27 19:22:17.0 -0700
@@ -1803,19 +1803,10 @@ static int __block_prepare_write(struct 
set_buffer_uptodate(bh);
continue;
}
-   if (block_end > to || block_start < from) {
-   void *kaddr;
-
-   kaddr = kmap_atomic(page, KM_USER0);
-   if (block_end > to)
-   memset(kaddr+to, 0,
-   block_end-to);
-   if (block_start < from)
-   memset(kaddr+block_start,
-   0, from-block_start);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr, KM_USER0);
-   }
+   if (block_end > to || block_start < from)
+   zero_user_segments(page,
+   to, block_end,
+   block_start, from)
continue;
}
}
@@ -1863,7 +1854,7 @@ static int __block_prepare_write(struct 
break;
if (buffer_new(bh)) {
clear_buffer_new(bh);
-   zero_user_page(page, block_start, bh->b_size, KM_USER0);
+   zero_user(page,

[06/36] Use page_cache_xxx in mm/rmap.c

2007-08-28 Thread clameter

Use page_cache_xxx in mm/rmap.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/rmap.c |   13 +
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 41ac397..d6a1771 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -188,9 +188,14 @@ static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 static inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
-   pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   pgoff_t pgoff;
unsigned long address;
 
+   if (PageAnon(page))
+   pgoff = page->index;
+   else
+   pgoff = page->index << mapping_order(page->mapping);
+
address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
/* page should be within any vma from prio_tree_next */
@@ -335,7 +340,7 @@ static int page_referenced_file(struct page *page)
 {
unsigned int mapcount;
struct address_space *mapping = page->mapping;
-   pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   pgoff_t pgoff = page->index << (page_cache_shift(mapping) - PAGE_SHIFT);
struct vm_area_struct *vma;
struct prio_tree_iter iter;
int referenced = 0;
@@ -447,7 +452,7 @@ out:
 
 static int page_mkclean_file(struct address_space *mapping, struct page *page)
 {
-   pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   pgoff_t pgoff = page->index << (page_cache_shift(mapping) - PAGE_SHIFT);
struct vm_area_struct *vma;
struct prio_tree_iter iter;
int ret = 0;
@@ -863,7 +868,7 @@ static int try_to_unmap_anon(struct page *page, int 
migration)
 static int try_to_unmap_file(struct page *page, int migration)
 {
struct address_space *mapping = page->mapping;
-   pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   pgoff_t pgoff = page->index << (page_cache_shift(mapping) - PAGE_SHIFT);
struct vm_area_struct *vma;
struct prio_tree_iter iter;
int ret = SWAP_AGAIN;
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[05/36] Use page_cache_xxx in mm/truncate.c

2007-08-28 Thread clameter

Use page_cache_xxx in mm/truncate.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/truncate.c |   35 ++-
 1 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index bf8068d..8c3d32e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -45,9 +45,10 @@ void do_invalidatepage(struct page *page, unsigned long 
offset)
(*invalidatepage)(page, offset);
 }
 
-static inline void truncate_partial_page(struct page *page, unsigned partial)
+static inline void truncate_partial_page(struct address_space *mapping,
+   struct page *page, unsigned partial)
 {
-   zero_user_segment(page, partial, PAGE_CACHE_SIZE);
+   zero_user_segment(page, partial, page_cache_size(mapping));
if (PagePrivate(page))
do_invalidatepage(page, partial);
 }
@@ -95,7 +96,7 @@ truncate_complete_page(struct address_space *mapping, struct 
page *page)
if (page->mapping != mapping)
return;
 
-   cancel_dirty_page(page, PAGE_CACHE_SIZE);
+   cancel_dirty_page(page, page_cache_size(mapping));
 
if (PagePrivate(page))
do_invalidatepage(page, 0);
@@ -157,9 +158,9 @@ invalidate_complete_page(struct address_space *mapping, 
struct page *page)
 void truncate_inode_pages_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
 {
-   const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+   const pgoff_t start = page_cache_next(mapping, lstart);
pgoff_t end;
-   const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+   const unsigned partial = page_cache_offset(mapping, lstart);
struct pagevec pvec;
pgoff_t next;
int i;
@@ -167,8 +168,9 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
if (mapping->nrpages == 0)
return;
 
-   BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
-   end = (lend >> PAGE_CACHE_SHIFT);
+   BUG_ON(page_cache_offset(mapping, lend) !=
+   page_cache_size(mapping) - 1);
+   end = page_cache_index(mapping, lend);
 
pagevec_init(&pvec, 0);
next = start;
@@ -194,8 +196,8 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
}
if (page_mapped(page)) {
unmap_mapping_range(mapping,
- (loff_t)page_indexindex > next)
next = page->index;
@@ -421,9 +423,8 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
 * Zap the rest of the file in one hit.
 */
unmap_mapping_range(mapping,
-  (loff_t)page_index

[10/36] Use page_cache_xxx in fs/sync

2007-08-28 Thread clameter

Use page_cache_xxx in fs/sync.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/sync.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..f30d7eb 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -260,8 +260,8 @@ int do_sync_mapping_range(struct address_space *mapping, 
loff_t offset,
ret = 0;
if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
ret = wait_on_page_writeback_range(mapping,
-   offset >> PAGE_CACHE_SHIFT,
-   endbyte >> PAGE_CACHE_SHIFT);
+   page_cache_index(mapping, offset),
+   page_cache_index(mapping, endbyte));
if (ret < 0)
goto out;
}
@@ -275,8 +275,8 @@ int do_sync_mapping_range(struct address_space *mapping, 
loff_t offset,
 
if (flags & SYNC_FILE_RANGE_WAIT_AFTER) {
ret = wait_on_page_writeback_range(mapping,
-   offset >> PAGE_CACHE_SHIFT,
-   endbyte >> PAGE_CACHE_SHIFT);
+   page_cache_index(mapping, offset),
+   page_cache_index(mapping, endbyte));
}
 out:
return ret;
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[12/36] Use page_cache_xxx in mm/mpage.c

2007-08-28 Thread clameter

Use page_cache_xxx in mm/mpage.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/mpage.c |   28 
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index a5e1385..2843ed7 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -133,7 +133,8 @@ mpage_alloc(struct block_device *bdev,
 static void 
 map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block) 
 {
-   struct inode *inode = page->mapping->host;
+   struct address_space *mapping = page->mapping;
+   struct inode *inode = mapping->host;
struct buffer_head *page_bh, *head;
int block = 0;
 
@@ -142,9 +143,9 @@ map_buffer_to_page(struct page *page, struct buffer_head 
*bh, int page_block)
 * don't make any buffers if there is only one buffer on
 * the page and the page just needs to be set up to date
 */
-   if (inode->i_blkbits == PAGE_CACHE_SHIFT && 
+   if (inode->i_blkbits == page_cache_shift(mapping) &&
buffer_uptodate(bh)) {
-   SetPageUptodate(page);
+   SetPageUptodate(page);
return;
}
create_empty_buffers(page, 1 << inode->i_blkbits, 0);
@@ -177,9 +178,10 @@ do_mpage_readpage(struct bio *bio, struct page *page, 
unsigned nr_pages,
sector_t *last_block_in_bio, struct buffer_head *map_bh,
unsigned long *first_logical_block, get_block_t get_block)
 {
-   struct inode *inode = page->mapping->host;
+   struct address_space *mapping = page->mapping;
+   struct inode *inode = mapping->host;
const unsigned blkbits = inode->i_blkbits;
-   const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+   const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
const unsigned blocksize = 1 << blkbits;
sector_t block_in_file;
sector_t last_block;
@@ -196,7 +198,7 @@ do_mpage_readpage(struct bio *bio, struct page *page, 
unsigned nr_pages,
if (page_has_buffers(page))
goto confused;
 
-   block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+   block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - 
blkbits);
last_block = block_in_file + nr_pages * blocks_per_page;
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
@@ -284,7 +286,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, 
unsigned nr_pages,
}
 
if (first_hole != blocks_per_page) {
-   zero_user_segment(page, first_hole << blkbits, PAGE_CACHE_SIZE);
+   zero_user_segment(page, first_hole << blkbits,
+   page_cache_size(mapping));
if (first_hole == 0) {
SetPageUptodate(page);
unlock_page(page);
@@ -468,7 +471,7 @@ static int __mpage_writepage(struct page *page, struct 
writeback_control *wbc,
struct inode *inode = page->mapping->host;
const unsigned blkbits = inode->i_blkbits;
unsigned long end_index;
-   const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+   const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
sector_t last_block;
sector_t block_in_file;
sector_t blocks[MAX_BUF_PER_PAGE];
@@ -537,7 +540,8 @@ static int __mpage_writepage(struct page *page, struct 
writeback_control *wbc,
 * The page has no buffers: map it to disk
 */
BUG_ON(!PageUptodate(page));
-   block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+   block_in_file = (sector_t)page->index <<
+   (page_cache_shift(mapping) - blkbits);
last_block = (i_size - 1) >> blkbits;
map_bh.b_page = page;
for (page_block = 0; page_block < blocks_per_page; ) {
@@ -569,7 +573,7 @@ static int __mpage_writepage(struct page *page, struct 
writeback_control *wbc,
first_unmapped = page_block;
 
 page_is_mapped:
-   end_index = i_size >> PAGE_CACHE_SHIFT;
+   end_index = page_cache_index(mapping, i_size);
if (page->index >= end_index) {
/*
 * The page straddles i_size.  It must be zeroed out on each
@@ -579,11 +583,11 @@ page_is_mapped:
 * is zeroed when mapped, and writes to that region are not
 * written out to the file."
 */
-   unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
+   unsigned offset = page_cache_offset(mapping, i_size);
 
if (page->index > end_index || !offset)
goto confused;
-   zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+   zero_user_segment(page, offset, page_cache_size(mappin

[07/36] Use page_cache_xxx in mm/filemap_xip.c

2007-08-28 Thread clameter

Use page_cache_xxx in mm/filemap_xip.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/filemap_xip.c |   28 ++--
 1 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index ba6892d..5237e53 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -61,24 +61,24 @@ do_xip_mapping_read(struct address_space *mapping,
 
BUG_ON(!mapping->a_ops->get_xip_page);
 
-   index = *ppos >> PAGE_CACHE_SHIFT;
-   offset = *ppos & ~PAGE_CACHE_MASK;
+   index = page_cache_index(mapping, *ppos);
+   offset = page_cache_offset(mapping, *ppos);
 
isize = i_size_read(inode);
if (!isize)
goto out;
 
-   end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+   end_index = page_cache_index(mapping, isize - 1);
for (;;) {
struct page *page;
unsigned long nr, ret;
 
/* nr is the maximum number of bytes to copy from this page */
-   nr = PAGE_CACHE_SIZE;
+   nr = page_cache_size(mapping);
if (index >= end_index) {
if (index > end_index)
goto out;
-   nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+   nr = page_cache_next(mapping, size - 1) + 1;
if (nr <= offset) {
goto out;
}
@@ -117,8 +117,8 @@ do_xip_mapping_read(struct address_space *mapping,
 */
ret = actor(desc, page, offset, nr);
offset += ret;
-   index += offset >> PAGE_CACHE_SHIFT;
-   offset &= ~PAGE_CACHE_MASK;
+   index += page_cache_index(mapping, offset);
+   offset = page_cache_offset(mapping, offset);
 
if (ret == nr && desc->count)
continue;
@@ -131,7 +131,7 @@ no_xip_page:
}
 
 out:
-   *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+   *ppos = page_cache_pos(mapping, index, offset);
if (filp)
file_accessed(filp);
 }
@@ -220,7 +220,7 @@ static int xip_file_fault(struct vm_area_struct *area, 
struct vm_fault *vmf)
 
/* XXX: are VM_FAULT_ codes OK? */
 
-   size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+   size = page_cache_next(mapping, i_size_read(inode));
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
 
@@ -289,9 +289,9 @@ __xip_file_write(struct file *filp, const char __user *buf,
unsigned long offset;
size_t copied;
 
-   offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-   index = pos >> PAGE_CACHE_SHIFT;
-   bytes = PAGE_CACHE_SIZE - offset;
+   offset = page_cache_offset(mapping, pos); /* Within page */
+   index = page_cache_index(mapping, pos);
+   bytes = page_cache_size(mapping) - offset;
if (bytes > count)
bytes = count;
 
@@ -405,8 +405,8 @@ EXPORT_SYMBOL_GPL(xip_file_write);
 int
 xip_truncate_page(struct address_space *mapping, loff_t from)
 {
-   pgoff_t index = from >> PAGE_CACHE_SHIFT;
-   unsigned offset = from & (PAGE_CACHE_SIZE-1);
+   pgoff_t index = page_cache_index(mapping, from);
+   unsigned offset = page_cache_offset(mapping, from);
unsigned blocksize;
unsigned length;
struct page *page;
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[09/36] Use page_cache_xxx in fs/libfs.c

2007-08-28 Thread clameter

Use page_cache_xxx in fs/libfs.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/libfs.c |   12 +++-
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 53b3dc5..e90f894 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -16,7 +16,8 @@ int simple_getattr(struct vfsmount *mnt, struct dentry 
*dentry,
 {
struct inode *inode = dentry->d_inode;
generic_fillattr(inode, stat);
-   stat->blocks = inode->i_mapping->nrpages << (PAGE_CACHE_SHIFT - 9);
+   stat->blocks = inode->i_mapping->nrpages <<
+   (page_cache_shift(inode->i_mapping) - 9);
return 0;
 }
 
@@ -340,10 +341,10 @@ int simple_prepare_write(struct file *file, struct page 
*page,
unsigned from, unsigned to)
 {
if (!PageUptodate(page)) {
-   if (to - from != PAGE_CACHE_SIZE)
+   if (to - from != page_cache_size(file->f_mapping))
zero_user_segments(page,
0, from,
-   to, PAGE_CACHE_SIZE);
+   to, page_cache_size(file->f_mapping));
}
return 0;
 }
@@ -351,8 +352,9 @@ int simple_prepare_write(struct file *file, struct page 
*page,
 int simple_commit_write(struct file *file, struct page *page,
unsigned from, unsigned to)
 {
-   struct inode *inode = page->mapping->host;
-   loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+   struct address_space *mapping = page->mapping;
+   struct inode *inode = mapping->host;
+   loff_t pos = page_cache_pos(mapping, page->index, to);
 
if (!PageUptodate(page))
SetPageUptodate(page);
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[20/36] Use page_cache_xxx in drivers/block/rd.c

2007-08-28 Thread clameter

Use page_cache_xxx in drivers/block/rd.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 drivers/block/rd.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rd.c b/drivers/block/rd.c
index 65150b5..e148b3b 100644
--- a/drivers/block/rd.c
+++ b/drivers/block/rd.c
@@ -121,7 +121,7 @@ static void make_page_uptodate(struct page *page)
}
} while ((bh = bh->b_this_page) != head);
} else {
-   memset(page_address(page), 0, PAGE_CACHE_SIZE);
+   memset(page_address(page), 0, 
page_cache_size(page_mapping(page)));
}
flush_dcache_page(page);
SetPageUptodate(page);
@@ -201,9 +201,9 @@ static const struct address_space_operations ramdisk_aops = 
{
 static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
struct address_space *mapping)
 {
-   pgoff_t index = sector >> (PAGE_CACHE_SHIFT - 9);
+   pgoff_t index = sector >> (page_cache_size(mapping) - 9);
unsigned int vec_offset = vec->bv_offset;
-   int offset = (sector << 9) & ~PAGE_CACHE_MASK;
+   int offset = page_cache_offset(mapping, (sector << 9));
int size = vec->bv_len;
int err = 0;
 
@@ -213,7 +213,7 @@ static int rd_blkdev_pagecache_IO(int rw, struct bio_vec 
*vec, sector_t sector,
char *src;
char *dst;
 
-   count = PAGE_CACHE_SIZE - offset;
+   count = page_cache_size(mapping) - offset;
if (count > size)
count = size;
size -= count;
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[15/36] Use page_cache_xxx functions in fs/ext2

2007-08-28 Thread clameter

Use page_cache_xxx functions in fs/ext2

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/ext2/dir.c |   40 +++-
 1 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 2bf49d7..d72926f 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -43,7 +43,8 @@ static inline void ext2_put_page(struct page *page)
 
 static inline unsigned long dir_pages(struct inode *inode)
 {
-   return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT;
+   return (inode->i_size+page_cache_size(inode->i_mapping)-1)>>
+   page_cache_shift(inode->i_mapping);
 }
 
 /*
@@ -54,10 +55,11 @@ static unsigned
 ext2_last_byte(struct inode *inode, unsigned long page_nr)
 {
unsigned last_byte = inode->i_size;
+   struct address_space *mapping = inode->i_mapping;
 
-   last_byte -= page_nr << PAGE_CACHE_SHIFT;
-   if (last_byte > PAGE_CACHE_SIZE)
-   last_byte = PAGE_CACHE_SIZE;
+   last_byte -= page_nr << page_cache_shift(mapping);
+   if (last_byte > page_cache_size(mapping))
+   last_byte = page_cache_size(mapping);
return last_byte;
 }
 
@@ -76,18 +78,19 @@ static int ext2_commit_chunk(struct page *page, unsigned 
from, unsigned to)
 
 static void ext2_check_page(struct page *page)
 {
-   struct inode *dir = page->mapping->host;
+   struct address_space *mapping = page->mapping;
+   struct inode *dir = mapping->host;
struct super_block *sb = dir->i_sb;
unsigned chunk_size = ext2_chunk_size(dir);
char *kaddr = page_address(page);
u32 max_inumber = le32_to_cpu(EXT2_SB(sb)->s_es->s_inodes_count);
unsigned offs, rec_len;
-   unsigned limit = PAGE_CACHE_SIZE;
+   unsigned limit = page_cache_size(mapping);
ext2_dirent *p;
char *error;
 
-   if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) {
-   limit = dir->i_size & ~PAGE_CACHE_MASK;
+   if (page_cache_index(mapping, dir->i_size) == page->index) {
+   limit = page_cache_offset(mapping, dir->i_size);
if (limit & (chunk_size - 1))
goto Ebadsize;
if (!limit)
@@ -139,7 +142,7 @@ Einumber:
 bad_entry:
ext2_error (sb, "ext2_check_page", "bad entry in directory #%lu: %s - "
"offset=%lu, inode=%lu, rec_len=%d, name_len=%d",
-   dir->i_ino, error, (page->indexindex, offs),
(unsigned long) le32_to_cpu(p->inode),
rec_len, p->name_len);
goto fail;
@@ -148,7 +151,7 @@ Eend:
ext2_error (sb, "ext2_check_page",
"entry in directory #%lu spans the page boundary"
"offset=%lu, inode=%lu",
-   dir->i_ino, (page->indexindex, offs),
(unsigned long) le32_to_cpu(p->inode));
 fail:
SetPageChecked(page);
@@ -246,8 +249,9 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t 
filldir)
loff_t pos = filp->f_pos;
struct inode *inode = filp->f_path.dentry->d_inode;
struct super_block *sb = inode->i_sb;
-   unsigned int offset = pos & ~PAGE_CACHE_MASK;
-   unsigned long n = pos >> PAGE_CACHE_SHIFT;
+   struct address_space *mapping = inode->i_mapping;
+   unsigned int offset = page_cache_offset(mapping, pos);
+   unsigned long n = page_cache_index(mapping, pos);
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
unsigned char *types = NULL;
@@ -268,14 +272,14 @@ ext2_readdir (struct file * filp, void * dirent, 
filldir_t filldir)
ext2_error(sb, __FUNCTION__,
   "bad page in #%lu",
   inode->i_ino);
-   filp->f_pos += PAGE_CACHE_SIZE - offset;
+   filp->f_pos += page_cache_size(mapping) - offset;
return -EIO;
}
kaddr = page_address(page);
if (unlikely(need_revalidate)) {
if (offset) {
offset = ext2_validate_entry(kaddr, offset, 
chunk_mask);
-   filp->f_pos = (nf_version = inode->i_version;
need_revalidate = 0;
@@ -298,7 +302,7 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t 
filldir)
 
offset = (char *)de - kaddr;
over = filldir(dirent, de->name, de->name_len,
-   (n

[22/36] compound pages: Add new support functions

2007-08-28 Thread clameter

compound_pages(page)-> Determines base pages of a compound page

compound_shift(page)-> Determine the page shift of a compound page

compound_size(page) -> Determine the size of a compound page

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/mm.h |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3e9e8fe..fa4cbab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -362,6 +362,21 @@ static inline void set_compound_order(struct page *page, 
unsigned long order)
page[1].lru.prev = (void *)order;
 }
 
+static inline int compound_pages(struct page *page)
+{
+   return 1 << compound_order(page);
+}
+
+static inline int compound_shift(struct page *page)
+{
+   return PAGE_SHIFT + compound_order(page);
+}
+
+static inline int compound_size(struct page *page)
+{
+   return PAGE_SIZE << compound_order(page);
+}
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[19/36] Use page_cache_xxx for fs/xfs

2007-08-28 Thread clameter

Use page_cache_xxx for fs/xfs

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_aops.c |   55 ++
 fs/xfs/linux-2.6/xfs_lrw.c  |6 ++--
 2 files changed, 32 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index fd4105d..e48817a 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -74,7 +74,7 @@ xfs_page_trace(
xfs_inode_t *ip;
bhv_vnode_t *vp = vn_from_inode(inode);
loff_t  isize = i_size_read(inode);
-   loff_t  offset = page_offset(page);
+   loff_t  offset = page_cache_offset(page->mapping);
int delalloc = -1, unmapped = -1, unwritten = -1;
 
if (page_has_buffers(page))
@@ -610,7 +610,7 @@ xfs_probe_page(
break;
} while ((bh = bh->b_this_page) != head);
} else
-   ret = mapped ? 0 : PAGE_CACHE_SIZE;
+   ret = mapped ? 0 : page_cache_size(page->mapping);
}
 
return ret;
@@ -637,7 +637,7 @@ xfs_probe_cluster(
} while ((bh = bh->b_this_page) != head);
 
/* if we reached the end of the page, sum forwards in following pages */
-   tlast = i_size_read(inode) >> PAGE_CACHE_SHIFT;
+   tlast = page_cache_index(inode->i_mapping, i_size_read(inode));
tindex = startpage->index + 1;
 
/* Prune this back to avoid pathological behavior */
@@ -655,14 +655,14 @@ xfs_probe_cluster(
size_t pg_offset, len = 0;
 
if (tindex == tlast) {
-   pg_offset =
-   i_size_read(inode) & (PAGE_CACHE_SIZE - 1);
+   pg_offset = page_cache_offset(inode->i_mapping,
+   i_size_read(inode));
if (!pg_offset) {
done = 1;
break;
}
} else
-   pg_offset = PAGE_CACHE_SIZE;
+   pg_offset = page_cache_size(inode->i_mapping);
 
if (page->index == tindex && !TestSetPageLocked(page)) {
len = xfs_probe_page(page, pg_offset, mapped);
@@ -744,7 +744,8 @@ xfs_convert_page(
int bbits = inode->i_blkbits;
int len, page_dirty;
int count = 0, done = 0, uptodate = 1;
-   xfs_off_t   offset = page_offset(page);
+   struct address_space*map = inode->i_mapping;
+   xfs_off_t   offset = page_cache_pos(map, page->index, 0);
 
if (page->index != tindex)
goto fail;
@@ -752,7 +753,7 @@ xfs_convert_page(
goto fail;
if (PageWriteback(page))
goto fail_unlock_page;
-   if (page->mapping != inode->i_mapping)
+   if (page->mapping != map)
goto fail_unlock_page;
if (!xfs_is_delayed_page(page, (*ioendp)->io_type))
goto fail_unlock_page;
@@ -764,20 +765,20 @@ xfs_convert_page(
 * Derivation:
 *
 * End offset is the highest offset that this page should represent.
-* If we are on the last page, (end_offset & (PAGE_CACHE_SIZE - 1))
-* will evaluate non-zero and be less than PAGE_CACHE_SIZE and
+* If we are on the last page, (end_offset & page_cache_mask())
+* will evaluate non-zero and be less than page_cache_size() and
 * hence give us the correct page_dirty count. On any other page,
 * it will be zero and in that case we need page_dirty to be the
 * count of buffers on the page.
 */
end_offset = min_t(unsigned long long,
-   (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT,
+   (xfs_off_t)(page->index + 1) << page_cache_shift(map),
i_size_read(inode));
 
len = 1 << inode->i_blkbits;
-   p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
-   PAGE_CACHE_SIZE);
-   p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
+   p_offset = min_t(unsigned long, page_cache_offset(map, end_offset),
+   page_cache_size(map));
+   p_offset = p_offset ? roundup(p_offset, len) : page_cache_size(map);
page_dirty = p_offset / len;
 
bh = head = page_buffers(page);
@@ -933,6 +934,8 @@ xfs_page_state_convert(
int page_dirty, count = 0;
int trylock = 0;
int all_bh = unmapped;
+   struct address_space*map = inode->i_

[26/36] compound pages: Allow freeing of compound pages via pagevec

2007-08-28 Thread clameter

Allow the freeing of compound pages via pagevec.

In release_pages() we currently special case for compound pages in order to
be sure to always decrement the page count of the head page and not the
tail page. However that redirection to the head page is only necessary for
tail pages. So we can actually use PageTail instead of PageCompound there
by avoiding the redirection to the first page. Tail page handling is
not changed.

The head page of a compound pages now represents single page large page.
We do the usual processing including checking if its on the LRU
and removing it (not useful right now but later when compound pages are
on the LRU this will work). Then we add the compound page to the pagevec.
Only head pages will end up on the pagevec not tail pages.

In __pagevec_free() we then check if we are freeing a head page and if
so call the destructor for the compound page.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/page_alloc.c |   13 +++--
 mm/swap.c   |8 +++-
 2 files changed, 18 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-08-27 20:59:38.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-08-27 21:05:34.0 -0700
@@ -1441,8 +1441,17 @@ void __pagevec_free(struct pagevec *pvec
 {
int i = pagevec_count(pvec);
 
-   while (--i >= 0)
-   free_hot_cold_page(pvec->pages[i], pvec->cold);
+   while (--i >= 0) {
+   struct page *page = pvec->pages[i];
+
+   if (PageHead(page)) {
+   compound_page_dtor *dtor;
+
+   dtor = get_compound_page_dtor(page);
+   (*dtor)(page);
+   } else
+   free_hot_cold_page(page, pvec->cold);
+   }
 }
 
 fastcall void __free_pages(struct page *page, unsigned int order)
Index: linux-2.6/mm/swap.c
===
--- linux-2.6.orig/mm/swap.c2007-08-27 19:22:13.0 -0700
+++ linux-2.6/mm/swap.c 2007-08-27 21:05:34.0 -0700
@@ -263,7 +263,13 @@ void release_pages(struct page **pages, 
for (i = 0; i < nr; i++) {
struct page *page = pages[i];
 
-   if (unlikely(PageCompound(page))) {
+   /*
+* If we have a tail page on the LRU then we need to
+* decrement the page count of the head page. There
+* is no further need to do anything since tail pages
+* cannot be on the LRU.
+*/
+   if (unlikely(PageTail(page))) {
if (zone) {
spin_unlock_irq(&zone->lru_lock);
zone = NULL;

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[31/36] Large Blocksize: Core piece

2007-08-28 Thread clameter

Provide an alternate definition for the page_cache_xxx(mapping, ...)
functions that can determine the current page size from the mapping
and generate the appropriate shifts, sizes and mask for the page cache
operations. Change the basic functions that allocate pages for the
page cache to be able to handle higher order allocations.

Provide a new function

mapping_setup(stdruct address_space *, gfp_t mask, int order)

that allows the setup of a mapping of any compound page order.

mapping_set_gfp_mask() is still provided but it sets mappings to order 0.
Calls to mapping_set_gfp_mask() must be converted to mapping_setup() in
order for the filesystem to be able to use larger pages. For some key block
devices and filesystems the conversion is done here.

mapping_setup() for higher order is only allowed if the mapping does not
use DMA mappings or HIGHMEM since we do not support bouncing at the moment.
Thus BUG() on DMA mappings and clear the highmem bit of higher order mappings.

Modify the set_blocksize() function so that an arbitrary blocksize can be set.
Blocksizes up to MAX_ORDER - 1 can be set. This is typically 8MB on many
platforms (order 11). Typically file systems are not only limited by the core
VM but also by the structure of their internal data structures. The core VM
limitations fall away with this patch. The functionality provided here
can do nothing about the internal limitations of filesystems.

Known internal limitations:

Ext264k
XFS 64k
Reiserfs8k
Ext34k (rumor has it that changing a constant can remove the limit)
Ext44k

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 block/Kconfig   |   17 ++
 drivers/block/rd.c  |6 ++-
 fs/block_dev.c  |   29 +++---
 fs/buffer.c |4 +-
 fs/inode.c  |7 ++-
 fs/xfs/linux-2.6/xfs_buf.c  |3 +-
 include/linux/buffer_head.h |   12 -
 include/linux/fs.h  |5 ++
 include/linux/pagemap.h |  121 --
 mm/filemap.c|   17 --
 10 files changed, 192 insertions(+), 29 deletions(-)

Index: linux-2.6/block/Kconfig
===
--- linux-2.6.orig/block/Kconfig2007-08-27 19:22:13.0 -0700
+++ linux-2.6/block/Kconfig 2007-08-27 21:16:38.0 -0700
@@ -62,6 +62,20 @@ config BLK_DEV_BSG
protocols (e.g. Task Management Functions and SMP in Serial
Attached SCSI).
 
+#
+# The functions to switch on larger pages in a filesystem will return an error
+# if the gfp flags for a mapping require only DMA pages. Highmem will always
+# be switched off for higher order mappings.
+#
+config LARGE_BLOCKSIZE
+   bool "Support blocksizes larger than page size"
+   default n
+   depends on EXPERIMENTAL
+   help
+ Allows the page cache to support higher orders of pages. Higher
+ order page cache pages may be useful to increase I/O performance
+ anbd support special devices like CD or DVDs and Flash.
+
 endif # BLOCK
 
 source block/Kconfig.iosched
Index: linux-2.6/drivers/block/rd.c
===
--- linux-2.6.orig/drivers/block/rd.c   2007-08-27 20:59:27.0 -0700
+++ linux-2.6/drivers/block/rd.c2007-08-27 21:10:38.0 -0700
@@ -121,7 +121,8 @@ static void make_page_uptodate(struct pa
}
} while ((bh = bh->b_this_page) != head);
} else {
-   memset(page_address(page), 0, 
page_cache_size(page_mapping(page)));
+   memset(page_address(page), 0,
+   page_cache_size(page_mapping(page)));
}
flush_dcache_page(page);
SetPageUptodate(page);
@@ -380,7 +381,8 @@ static int rd_open(struct inode *inode, 
gfp_mask = mapping_gfp_mask(mapping);
gfp_mask &= ~(__GFP_FS|__GFP_IO);
gfp_mask |= __GFP_HIGH;
-   mapping_set_gfp_mask(mapping, gfp_mask);
+   mapping_setup(mapping, gfp_mask,
+   page_cache_blkbits_to_order(inode->i_blkbits));
}
 
return 0;
Index: linux-2.6/fs/block_dev.c
===
--- linux-2.6.orig/fs/block_dev.c   2007-08-27 19:22:13.0 -0700
+++ linux-2.6/fs/block_dev.c2007-08-27 21:10:38.0 -0700
@@ -63,36 +63,46 @@ static void kill_bdev(struct block_devic
return;
invalidate_bh_lrus();
truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
-}  
+}
 
 int set_blocksize(struct block_device *bdev, int size)
 {
-   /* Size must be a power of two, and between 512 and PAGE_SIZE */
-   if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
+   int order;
+
+   if (size > (PAGE_SIZE << (MAX_ORDER - 1)) ||
+

[34/36] Large blocksize support in XFS

2007-08-28 Thread clameter

The only change needed to enable Large Block I/O in XFS is to remove
the check for a too large blocksize ;-)

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/xfs/xfs_mount.c |   13 -
 1 files changed, 0 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index a66b398..47ddc89 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -326,19 +326,6 @@ xfs_mount_validate_sb(
return XFS_ERROR(ENOSYS);
}
 
-   /*
-* Until this is fixed only page-sized or smaller data blocks work.
-*/
-   if (unlikely(sbp->sb_blocksize > PAGE_SIZE)) {
-   xfs_fs_mount_cmn_err(flags,
-   "file system with blocksize %d bytes",
-   sbp->sb_blocksize);
-   xfs_fs_mount_cmn_err(flags,
-   "only pagesize (%ld) or less will currently work.",
-   PAGE_SIZE);
-   return XFS_ERROR(ENOSYS);
-   }
-
return 0;
 }
 
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[17/36] Use page_cache_xxx in fs/ext4

2007-08-28 Thread clameter

Use page_cache_xxx in fs/ext4

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/ext4/dir.c   |3 ++-
 fs/ext4/inode.c |   31 ---
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 3ab01c0..9d6cd51 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -136,7 +136,8 @@ static int ext4_readdir(struct file * filp,
err = ext4_get_blocks_wrap(NULL, inode, blk, 1, &map_bh, 0, 0);
if (err > 0) {
pgoff_t index = map_bh.b_blocknr >>
-   (PAGE_CACHE_SHIFT - inode->i_blkbits);
+   (page_cache_size(node->i_mapping)
+   - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
page_cache_sync_readahead(
sb->s_bdev->bd_inode->i_mapping,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3fe1e40..0be5bf8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1223,7 +1223,7 @@ static int ext4_ordered_commit_write(struct file *file, 
struct page *page,
 */
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+   new_i_size = page_cache_pos(page->mapping, page->index, to);
if (new_i_size > EXT4_I(inode)->i_disksize)
EXT4_I(inode)->i_disksize = new_i_size;
ret = generic_commit_write(file, page, from, to);
@@ -1242,7 +1242,7 @@ static int ext4_writeback_commit_write(struct file *file, 
struct page *page,
int ret = 0, ret2;
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+   new_i_size = page_cache_pos(page->mapping, page->index, to);
if (new_i_size > EXT4_I(inode)->i_disksize)
EXT4_I(inode)->i_disksize = new_i_size;
 
@@ -1269,7 +1269,7 @@ static int ext4_journalled_commit_write(struct file *file,
/*
 * Here we duplicate the generic_commit_write() functionality
 */
-   pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+   pos = page_cache_pos(page->mapping, page->index, to);
 
ret = walk_page_buffers(handle, page_buffers(page), from,
to, &partial, commit_write_fn);
@@ -1421,6 +1421,7 @@ static int ext4_ordered_writepage(struct page *page,
handle_t *handle = NULL;
int ret = 0;
int err;
+   int pagesize = page_cache_size(inode->i_mapping);
 
J_ASSERT(PageLocked(page));
 
@@ -1443,8 +1444,7 @@ static int ext4_ordered_writepage(struct page *page,
(1 << BH_Dirty)|(1 << BH_Uptodate));
}
page_bufs = page_buffers(page);
-   walk_page_buffers(handle, page_bufs, 0,
-   PAGE_CACHE_SIZE, NULL, bget_one);
+   walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bget_one);
 
ret = block_write_full_page(page, ext4_get_block, wbc);
 
@@ -1461,13 +1461,12 @@ static int ext4_ordered_writepage(struct page *page,
 * and generally junk.
 */
if (ret == 0) {
-   err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
+   err = walk_page_buffers(handle, page_bufs, 0, pagesize,
NULL, jbd2_journal_dirty_data_fn);
if (!ret)
ret = err;
}
-   walk_page_buffers(handle, page_bufs, 0,
-   PAGE_CACHE_SIZE, NULL, bput_one);
+   walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bput_one);
err = ext4_journal_stop(handle);
if (!ret)
ret = err;
@@ -1519,6 +1518,7 @@ static int ext4_journalled_writepage(struct page *page,
handle_t *handle = NULL;
int ret = 0;
int err;
+   int pagesize = page_cache_size(inode->i_mapping);
 
if (ext4_journal_current_handle())
goto no_write;
@@ -1535,17 +1535,17 @@ static int ext4_journalled_writepage(struct page *page,
 * doesn't seem much point in redirtying the page here.
 */
ClearPageChecked(page);
-   ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
+   ret = block_prepare_write(page, 0, pagesize,
ext4_get_block);
if (ret != 0) {
ext4_journal_stop(handle);
goto out_unlock;
}
ret = walk_page_buffers(handle, page_buffers(page), 0,
-   PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
+   pagesize, NULL, do_journal_get_write_access);
 
err = walk_page_buffers(handle, page_buffers(page), 0,
-   PAGE_CACHE_SIZE, NULL, commit_

[36/36] Reiserfs: Fix up for mapping_set_gfp_mask

2007-08-28 Thread clameter

mapping_set_gfp_mask only works on order 0 page cache operations. Reiserfs
can use 8k pages (order 1). Replace the mapping_set_gfp_mask with
mapping_setup to make this work properly.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/reiserfs/xattr.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c
index c86f570..5ca01f3 100644
--- a/fs/reiserfs/xattr.c
+++ b/fs/reiserfs/xattr.c
@@ -405,9 +405,10 @@ static struct page *reiserfs_get_page(struct inode *dir, 
unsigned long n)
 {
struct address_space *mapping = dir->i_mapping;
struct page *page;
+
/* We can deadlock if we try to free dentries,
   and an unlink/rmdir has just occured - GFP_NOFS avoids this */
-   mapping_set_gfp_mask(mapping, GFP_NOFS);
+   mapping_setup(mapping, GFP_NOFS, page_cache_shift(mapping));
page = read_mapping_page(mapping, n, NULL);
if (!IS_ERR(page)) {
kmap(page);
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[27/36] Compound page zeroing and flushing

2007-08-28 Thread clameter

We may now have to zero and flush higher order pages. Implement
clear_mapping_page and flush_mapping_page to do that job. Replace
the flushing and clearing at some key locations for the pagecache.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/libfs.c  |4 ++--
 include/linux/highmem.h |   31 +--
 mm/filemap.c|4 ++--
 mm/filemap_xip.c|4 ++--
 4 files changed, 35 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/libfs.c
===
--- linux-2.6.orig/fs/libfs.c   2007-08-27 20:51:55.0 -0700
+++ linux-2.6/fs/libfs.c2007-08-27 21:08:04.0 -0700
@@ -330,8 +330,8 @@ int simple_rename(struct inode *old_dir,
 
 int simple_readpage(struct file *file, struct page *page)
 {
-   clear_highpage(page);
-   flush_dcache_page(page);
+   clear_mapping_page(page);
+   flush_mapping_page(page);
SetPageUptodate(page);
unlock_page(page);
return 0;
Index: linux-2.6/include/linux/highmem.h
===
--- linux-2.6.orig/include/linux/highmem.h  2007-08-27 19:22:17.0 
-0700
+++ linux-2.6/include/linux/highmem.h   2007-08-27 21:08:04.0 -0700
@@ -124,14 +124,41 @@ static inline void clear_highpage(struct
kunmap_atomic(kaddr, KM_USER0);
 }
 
+/*
+ * Clear a higher order page
+ */
+static inline void clear_mapping_page(struct page *page)
+{
+   int nr_pages = compound_pages(page);
+   int i;
+
+   for (i = 0; i < nr_pages; i++)
+   clear_highpage(page + i);
+}
+
+/*
+ * Primitive support for flushing higher order pages.
+ *
+ * A bit stupid: On many platforms flushing the first page
+ * will flush any TLB starting there
+ */
+static inline void flush_mapping_page(struct page *page)
+{
+   int nr_pages = compound_pages(page);
+   int i;
+
+   for (i = 0; i < nr_pages; i++)
+   flush_dcache_page(page + i);
+}
+
 static inline void zero_user_segments(struct page *page,
unsigned start1, unsigned end1,
unsigned start2, unsigned end2)
 {
void *kaddr = kmap_atomic(page, KM_USER0);
 
-   BUG_ON(end1 > PAGE_SIZE ||
-   end2 > PAGE_SIZE);
+   BUG_ON(end1 > compound_size(page) ||
+   end2 > compound_size(page));
 
if (end1 > start1)
memset(kaddr + start1, 0, end1 - start1);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-08-27 19:31:13.0 -0700
+++ linux-2.6/mm/filemap.c  2007-08-27 21:08:04.0 -0700
@@ -941,7 +941,7 @@ page_ok:
 * before reading the page on the kernel side.
 */
if (mapping_writably_mapped(mapping))
-   flush_dcache_page(page);
+   flush_mapping_page(page);
 
/*
 * When a sequential read accesses a page several times,
@@ -1932,7 +1932,7 @@ generic_file_buffered_write(struct kiocb
else
copied = filemap_copy_from_user_iovec(page, offset,
cur_iov, iov_base, bytes);
-   flush_dcache_page(page);
+   flush_mapping_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (status == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
Index: linux-2.6/mm/filemap_xip.c
===
--- linux-2.6.orig/mm/filemap_xip.c 2007-08-27 20:51:40.0 -0700
+++ linux-2.6/mm/filemap_xip.c  2007-08-27 21:08:04.0 -0700
@@ -104,7 +104,7 @@ do_xip_mapping_read(struct address_space
 * before reading the page on the kernel side.
 */
if (mapping_writably_mapped(mapping))
-   flush_dcache_page(page);
+   flush_mapping_page(page);
 
/*
 * Ok, we have the page, so now we can copy it to user space...
@@ -320,7 +320,7 @@ __xip_file_write(struct file *filp, cons
}
 
copied = filemap_copy_from_user(page, offset, buf, bytes);
-   flush_dcache_page(page);
+   flush_mapping_page(page);
if (likely(copied > 0)) {
status = copied;
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[29/36] Fix up reclaim counters

2007-08-28 Thread clameter

Compound pages of an arbitrary order may now be on the LRU and
may be reclaimed.

Adjust the counting in vmscan.c to could the number of base
pages.

Also change the active and inactive accounting to do the same.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/mm_inline.h |   36 +++-
 mm/vmscan.c   |   22 --
 2 files changed, 39 insertions(+), 19 deletions(-)

Index: linux-2.6/include/linux/mm_inline.h
===
--- linux-2.6.orig/include/linux/mm_inline.h2007-08-27 19:22:13.0 
-0700
+++ linux-2.6/include/linux/mm_inline.h 2007-08-27 21:08:27.0 -0700
@@ -2,39 +2,57 @@ static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
list_add(&page->lru, &zone->active_list);
-   __inc_zone_state(zone, NR_ACTIVE);
+   if (!PageHead(page))
+   __inc_zone_state(zone, NR_ACTIVE);
+   else
+   __inc_zone_page_state(page, NR_ACTIVE);
 }
 
 static inline void
 add_page_to_inactive_list(struct zone *zone, struct page *page)
 {
list_add(&page->lru, &zone->inactive_list);
-   __inc_zone_state(zone, NR_INACTIVE);
+   if (!PageHead(page))
+   __inc_zone_state(zone, NR_INACTIVE);
+   else
+   __inc_zone_page_state(page, NR_INACTIVE);
 }
 
 static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
list_del(&page->lru);
-   __dec_zone_state(zone, NR_ACTIVE);
+   if (!PageHead(page))
+   __dec_zone_state(zone, NR_ACTIVE);
+   else
+   __dec_zone_page_state(page, NR_ACTIVE);
 }
 
 static inline void
 del_page_from_inactive_list(struct zone *zone, struct page *page)
 {
list_del(&page->lru);
-   __dec_zone_state(zone, NR_INACTIVE);
+   if (!PageHead(page))
+   __dec_zone_state(zone, NR_INACTIVE);
+   else
+   __dec_zone_page_state(page, NR_INACTIVE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
+   enum zone_stat_item counter = NR_ACTIVE;
+
list_del(&page->lru);
-   if (PageActive(page)) {
+   if (PageActive(page))
__ClearPageActive(page);
-   __dec_zone_state(zone, NR_ACTIVE);
-   } else {
-   __dec_zone_state(zone, NR_INACTIVE);
-   }
+   else
+   counter = NR_INACTIVE;
+
+   if (!PageHead(page))
+   __dec_zone_state(zone, counter);
+   else
+   __dec_zone_page_state(page, counter);
 }
 
+
Index: linux-2.6/mm/vmscan.c
===
--- linux-2.6.orig/mm/vmscan.c  2007-08-27 19:22:13.0 -0700
+++ linux-2.6/mm/vmscan.c   2007-08-27 21:08:27.0 -0700
@@ -466,14 +466,14 @@ static unsigned long shrink_page_list(st
 
VM_BUG_ON(PageActive(page));
 
-   sc->nr_scanned++;
+   sc->nr_scanned += compound_pages(page);
 
if (!sc->may_swap && page_mapped(page))
goto keep_locked;
 
/* Double the slab pressure for mapped and swapcache pages */
if (page_mapped(page) || PageSwapCache(page))
-   sc->nr_scanned++;
+   sc->nr_scanned += compound_pages(page);
 
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
@@ -590,7 +590,7 @@ static unsigned long shrink_page_list(st
 
 free_it:
unlock_page(page);
-   nr_reclaimed++;
+   nr_reclaimed += compound_pages(page);
if (!pagevec_add(&freed_pvec, page))
__pagevec_release_nonlru(&freed_pvec);
continue;
@@ -682,22 +682,23 @@ static unsigned long isolate_lru_pages(u
unsigned long nr_taken = 0;
unsigned long scan;
 
-   for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
+   for (scan = 0; scan < nr_to_scan && !list_empty(src); ) {
struct page *page;
unsigned long pfn;
unsigned long end_pfn;
unsigned long page_pfn;
+   int pages;
int zone_id;
 
page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);
-
+   pages = compound_pages(page);
VM_BUG_ON(!PageLRU(page));
 
switch (__isolate_lru_page(page, mode)) {
case 0:
list_move(&page->lru, dst);
-   nr_taken++;
+   nr_taken += pages;
break;
 
case -EBUSY:
@@ -743,8 +744,8 @@ static unsigned long isolate_lru_pages(u
switch (__isolate_lru_page(cursor_page, mode)) {

[02/36] Define functions for page cache handling

2007-08-28 Thread clameter

We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
and PAGE_CACHE_ALIGN in various places in the kernel. Many times
common operations like calculating the offset or the index are coded
using shifts and adds. This patch provides inline functions to
get the calculations accomplished without having to explicitly
shift and add constants.

All functions take an address_space pointer. The address space pointer
will be used in the future to eventually support a variable size
page cache. Information reachable via the mapping may then determine
page size.

New functionRelated base page constant

page_cache_shift(a) PAGE_CACHE_SHIFT
page_cache_size(a)  PAGE_CACHE_SIZE
page_cache_mask(a)  PAGE_CACHE_MASK
page_cache_index(a, pos)Calculate page number from position
page_cache_next(addr, pos)  Page number of next page
page_cache_offset(a, pos)   Calculate offset into a page
page_cache_pos(a, index, offset)
Form position based on page number
and an offset.

This provides a basis that would allow the conversion of all page cache
handling in the kernel and ultimately allow the removal of the PAGE_CACHE_*
constants.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/pagemap.h |   54 +++
 1 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8a83537..836e9dd 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -52,12 +52,66 @@ static inline void mapping_set_gfp_mask(struct 
address_space *m, gfp_t mask)
  * space in smaller chunks for same flexibility).
  *
  * Or rather, it _will_ be done in larger chunks.
+ *
+ * The following constants can be used if a filesystem only supports a single
+ * page size.
  */
 #define PAGE_CACHE_SHIFT   PAGE_SHIFT
 #define PAGE_CACHE_SIZEPAGE_SIZE
 #define PAGE_CACHE_MASKPAGE_MASK
 #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
+/*
+ * Functions that are currently setup for a fixed PAGE_SIZEd. The use of
+ * these will allow a variable page size pagecache in the future.
+ */
+static inline int mapping_order(struct address_space *a)
+{
+   return 0;
+}
+
+static inline int page_cache_shift(struct address_space *a)
+{
+   return PAGE_SHIFT;
+}
+
+static inline unsigned int page_cache_size(struct address_space *a)
+{
+   return PAGE_SIZE;
+}
+
+static inline loff_t page_cache_mask(struct address_space *a)
+{
+   return (loff_t)PAGE_MASK;
+}
+
+static inline unsigned int page_cache_offset(struct address_space *a,
+   loff_t pos)
+{
+   return pos & ~PAGE_MASK;
+}
+
+static inline pgoff_t page_cache_index(struct address_space *a,
+   loff_t pos)
+{
+   return pos >> page_cache_shift(a);
+}
+
+/*
+ * Index of the page starting on or after the given position.
+ */
+static inline pgoff_t page_cache_next(struct address_space *a,
+   loff_t pos)
+{
+   return page_cache_index(a, pos + page_cache_size(a) - 1);
+}
+
+static inline loff_t page_cache_pos(struct address_space *a,
+   pgoff_t index, unsigned long offset)
+{
+   return ((loff_t)index << page_cache_shift(a)) + offset;
+}
+
 #define page_cache_get(page)   get_page(page)
 #define page_cache_release(page)   put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[30/36] Add VM_BUG_ONs to check for correct page order

2007-08-28 Thread clameter

Before allowing different page orders it may be wise to get some checkpoints
in at various places. Checkpoints will help debugging whenever a wrong order
page shows up in a mapping. Helps when converting new filesystems to utilize
larger pages.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/buffer.c  |1 +
 mm/filemap.c |   18 +++---
 2 files changed, 16 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-08-27 20:52:34.0 -0700
+++ linux-2.6/fs/buffer.c   2007-08-27 21:09:19.0 -0700
@@ -893,6 +893,7 @@ struct buffer_head *alloc_page_buffers(s
long offset;
unsigned int page_size = page_cache_size(page->mapping);
 
+   BUG_ON(size > page_size);
 try_again:
head = NULL;
offset = page_size;
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-08-27 21:08:04.0 -0700
+++ linux-2.6/mm/filemap.c  2007-08-27 21:09:19.0 -0700
@@ -128,6 +128,7 @@ void remove_from_page_cache(struct page 
struct address_space *mapping = page->mapping;
 
BUG_ON(!PageLocked(page));
+   VM_BUG_ON(mapping_order(mapping) != compound_order(page));
 
write_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
@@ -269,6 +270,7 @@ int wait_on_page_writeback_range(struct 
if (page->index > end)
continue;
 
+   VM_BUG_ON(mapping_order(mapping) != 
compound_order(page));
wait_on_page_writeback(page);
if (PageError(page))
ret = -EIO;
@@ -440,6 +442,7 @@ int add_to_page_cache(struct page *page,
 {
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
+   VM_BUG_ON(mapping_order(mapping) != compound_order(page));
if (error == 0) {
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
@@ -599,8 +602,10 @@ struct page * find_get_page(struct addre
 
read_lock_irq(&mapping->tree_lock);
page = radix_tree_lookup(&mapping->page_tree, offset);
-   if (page)
+   if (page) {
+   VM_BUG_ON(mapping_order(mapping) != compound_order(page));
page_cache_get(page);
+   }
read_unlock_irq(&mapping->tree_lock);
return page;
 }
@@ -625,6 +630,7 @@ struct page *find_lock_page(struct addre
 repeat:
page = radix_tree_lookup(&mapping->page_tree, offset);
if (page) {
+   VM_BUG_ON(mapping_order(mapping) != compound_order(page));
page_cache_get(page);
if (TestSetPageLocked(page)) {
read_unlock_irq(&mapping->tree_lock);
@@ -715,8 +721,10 @@ unsigned find_get_pages(struct address_s
read_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup(&mapping->page_tree,
(void **)pages, start, nr_pages);
-   for (i = 0; i < ret; i++)
+   for (i = 0; i < ret; i++) {
+   VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i]));
page_cache_get(pages[i]);
+   }
read_unlock_irq(&mapping->tree_lock);
return ret;
 }
@@ -746,6 +754,7 @@ unsigned find_get_pages_contig(struct ad
if (pages[i]->mapping == NULL || pages[i]->index != index)
break;
 
+   VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i]));
page_cache_get(pages[i]);
index++;
}
@@ -774,8 +783,10 @@ unsigned find_get_pages_tag(struct addre
read_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
(void **)pages, *index, nr_pages, tag);
-   for (i = 0; i < ret; i++)
+   for (i = 0; i < ret; i++) {
+   VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i]));
page_cache_get(pages[i]);
+   }
if (ret)
*index = pages[ret - 1]->index + 1;
read_unlock_irq(&mapping->tree_lock);
@@ -2233,6 +2244,7 @@ int try_to_release_page(struct page *pag
struct address_space * const mapping = page->mapping;
 
BUG_ON(!PageLocked(page));
+   VM_BUG_ON(mapping_order(mapping) != compound_order(page));
if (PageWriteback(page))
return 0;
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[32/36] Readahead changes to support large blocksize.

2007-08-28 Thread clameter

Fix up readhead for large I/O operations.

Only calculate the readahead until the 2M boundary then fall back to
one page.

Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

===
---
 include/linux/mm.h |2 +-
 mm/fadvise.c   |4 ++--
 mm/filemap.c   |5 ++---
 mm/madvise.c   |2 +-
 mm/readahead.c |   22 ++
 5 files changed, 20 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2007-08-27 21:03:20.0 -0700
+++ linux-2.6/include/linux/mm.h2007-08-27 21:14:44.0 -0700
@@ -1142,7 +1142,7 @@ void page_cache_async_readahead(struct a
pgoff_t offset,
unsigned long size);
 
-unsigned long max_sane_readahead(unsigned long nr);
+unsigned long max_sane_readahead(unsigned long nr, int order);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
Index: linux-2.6/mm/fadvise.c
===
--- linux-2.6.orig/mm/fadvise.c 2007-08-27 20:52:49.0 -0700
+++ linux-2.6/mm/fadvise.c  2007-08-27 21:14:44.0 -0700
@@ -86,10 +86,10 @@ asmlinkage long sys_fadvise64_64(int fd,
nrpages = end_index - start_index + 1;
if (!nrpages)
nrpages = ~0UL;
-   
+
ret = force_page_cache_readahead(mapping, file,
start_index,
-   max_sane_readahead(nrpages));
+   nrpages);
if (ret > 0)
ret = 0;
break;
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-08-27 21:10:38.0 -0700
+++ linux-2.6/mm/filemap.c  2007-08-27 21:14:44.0 -0700
@@ -1237,8 +1237,7 @@ do_readahead(struct address_space *mappi
if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
return -EINVAL;
 
-   force_page_cache_readahead(mapping, filp, index,
-   max_sane_readahead(nr));
+   force_page_cache_readahead(mapping, filp, index, nr);
return 0;
 }
 
@@ -1373,7 +1372,7 @@ retry_find:
count_vm_event(PGMAJFAULT);
}
did_readaround = 1;
-   ra_pages = max_sane_readahead(file->f_ra.ra_pages);
+   ra_pages = file->f_ra.ra_pages;
if (ra_pages) {
pgoff_t start = 0;
 
Index: linux-2.6/mm/madvise.c
===
--- linux-2.6.orig/mm/madvise.c 2007-08-27 19:22:13.0 -0700
+++ linux-2.6/mm/madvise.c  2007-08-27 21:14:44.0 -0700
@@ -124,7 +124,7 @@ static long madvise_willneed(struct vm_a
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
force_page_cache_readahead(file->f_mapping,
-   file, start, max_sane_readahead(end - start));
+   file, start, end - start);
return 0;
 }
 
Index: linux-2.6/mm/readahead.c
===
--- linux-2.6.orig/mm/readahead.c   2007-08-27 19:22:13.0 -0700
+++ linux-2.6/mm/readahead.c2007-08-27 21:14:44.0 -0700
@@ -44,7 +44,8 @@ EXPORT_SYMBOL_GPL(default_backing_dev_in
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
-   ra->ra_pages = mapping->backing_dev_info->ra_pages;
+   ra->ra_pages = DIV_ROUND_UP(mapping->backing_dev_info->ra_pages,
+   page_cache_size(mapping));
ra->prev_index = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
@@ -84,7 +85,7 @@ int read_cache_pages(struct address_spac
put_pages_list(pages);
break;
}
-   task_io_account_read(PAGE_CACHE_SIZE);
+   task_io_account_read(page_cache_size(mapping));
}
pagevec_lru_add(&lru_pvec);
return ret;
@@ -151,7 +152,7 @@ __do_page_cache_readahead(struct address
if (isize == 0)
goto out;
 
-   end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+   end_index = page_cache_index(mapping, isize - 1);
 
/*
 * Preallocate as many pages as we will need.
@@ -204,10 +205,12 @@ int force_page_cache_readahead(struct ad
if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
return -EINVAL;
 
+   nr_to_read = max_sane_readahead(nr_to_read, mapping_order(mapping));
while (nr_to_read) {
i

[11/36] Use page_cache_xxx in fs/buffer.c

2007-08-28 Thread clameter

Use page_cache_xxx in fs/buffer.c.

We have a special situation in set_bh_page() since reiserfs calls that
function before setting up the mapping. So retrieve the page size
from the page struct rather than the mapping.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/buffer.c |  110 +---
 1 file changed, 62 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-08-28 11:37:13.0 -0700
+++ linux-2.6/fs/buffer.c   2007-08-28 11:37:58.0 -0700
@@ -257,7 +257,7 @@ __find_get_block_slow(struct block_devic
struct page *page;
int all_mapped = 1;
 
-   index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
+   index = block >> (page_cache_shift(bd_mapping) - bd_inode->i_blkbits);
page = find_get_page(bd_mapping, index);
if (!page)
goto out;
@@ -697,7 +697,7 @@ static int __set_page_dirty(struct page 
 
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
-   task_io_account_write(PAGE_CACHE_SIZE);
+   task_io_account_write(page_cache_size(mapping));
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
@@ -891,10 +891,11 @@ struct buffer_head *alloc_page_buffers(s
 {
struct buffer_head *bh, *head;
long offset;
+   unsigned int page_size = page_cache_size(page->mapping);
 
 try_again:
head = NULL;
-   offset = PAGE_SIZE;
+   offset = page_size;
while ((offset -= size) >= 0) {
bh = alloc_buffer_head(GFP_NOFS);
if (!bh)
@@ -1426,7 +1427,7 @@ void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
 {
bh->b_page = page;
-   BUG_ON(offset >= PAGE_SIZE);
+   BUG_ON(offset >= compound_size(page));
if (PageHighMem(page))
/*
 * This catches illegal uses and preserves the offset:
@@ -1605,6 +1606,7 @@ static int __block_write_full_page(struc
struct buffer_head *bh, *head;
const unsigned blocksize = 1 << inode->i_blkbits;
int nr_underway = 0;
+   struct address_space *mapping = inode->i_mapping;
 
BUG_ON(!PageLocked(page));
 
@@ -1625,7 +1627,8 @@ static int __block_write_full_page(struc
 * handle that here by just cleaning them.
 */
 
-   block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+   block = (sector_t)page->index <<
+   (page_cache_shift(mapping) - inode->i_blkbits);
head = page_buffers(page);
bh = head;
 
@@ -1742,7 +1745,7 @@ recover:
} while ((bh = bh->b_this_page) != head);
SetPageError(page);
BUG_ON(PageWriteback(page));
-   mapping_set_error(page->mapping, err);
+   mapping_set_error(mapping, err);
set_page_writeback(page);
do {
struct buffer_head *next = bh->b_this_page;
@@ -1767,8 +1770,8 @@ static int __block_prepare_write(struct 
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
 
BUG_ON(!PageLocked(page));
-   BUG_ON(from > PAGE_CACHE_SIZE);
-   BUG_ON(to > PAGE_CACHE_SIZE);
+   BUG_ON(from > page_cache_size(inode->i_mapping));
+   BUG_ON(to > page_cache_size(inode->i_mapping));
BUG_ON(from > to);
 
blocksize = 1 << inode->i_blkbits;
@@ -1777,7 +1780,8 @@ static int __block_prepare_write(struct 
head = page_buffers(page);
 
bbits = inode->i_blkbits;
-   block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
+   block = (sector_t)page->index <<
+   (page_cache_shift(inode->i_mapping) - bbits);
 
for(bh = head, block_start = 0; bh != head || !block_start;
block++, block_start=block_end, bh = bh->b_this_page) {
@@ -1921,7 +1925,8 @@ int block_read_full_page(struct page *pa
create_empty_buffers(page, blocksize, 0);
head = page_buffers(page);
 
-   iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+   iblock = (sector_t)page->index <<
+   (page_cache_shift(page->mapping) - inode->i_blkbits);
lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;
bh = head;
nr = 0;
@@ -2045,7 +2050,7 @@ int generic_cont_expand(struct inode *in
pgoff_t index;
unsigned int offset;
 
-   offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */
+   offset = page_cache_offset(inode->i_mapping, size); /* Within page */
 
/* ugh.  in prepare/commit_write, if from==to==start of block, we
** skip the prepare.  make sure we never send an offset for the start
@@ -2055,7 +2060,7 @@ int generi

[33/36] Large blocksize support in ramfs

2007-08-28 Thread clameter

The simplest file system to use for large blocksize support is ramfs.

Note that ramfs does not use the lower layers (buffer I/O etc) so this
case is useful for initial testing of changes to large buffer size
support if one just wants to exercise the higher layers.

The patch adds the ability to specify a mount parameter to modify the
order for the pages that are allocated by ramfs.

Here is an example of how to mount a volume with order 10 pages:

mount -tramfs -o10 none /media

Mounts a ramfs filesystem with 4MB sized pages. Then copy
a file onto it.

cp linux-2.6.21-rc7.tar.gz /media

This will populate the ramfs volume. Note that we allocated 14 pages
of 4M each instead of 13508.

Get rid of the large pages again

umount /media


Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/ramfs/inode.c |   12 +---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index ef2b46d..b317f80 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -60,7 +60,8 @@ struct inode *ramfs_get_inode(struct super_block *sb, int 
mode, dev_t dev)
inode->i_blocks = 0;
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-   mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+   mapping_setup(inode->i_mapping, GFP_HIGHUSER,
+   sb->s_blocksize_bits - PAGE_SHIFT);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
default:
@@ -164,10 +165,15 @@ static int ramfs_fill_super(struct super_block * sb, void 
* data, int silent)
 {
struct inode * inode;
struct dentry * root;
+   int order = 0;
+   char *options = data;
+
+   if (options && *options)
+   order = simple_strtoul(options, NULL, 10);
 
sb->s_maxbytes = MAX_LFS_FILESIZE;
-   sb->s_blocksize = PAGE_CACHE_SIZE;
-   sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+   sb->s_blocksize = PAGE_CACHE_SIZE << order;
+   sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT;
sb->s_magic = RAMFS_MAGIC;
sb->s_op = &ramfs_ops;
sb->s_time_gran = 1;
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[14/36] Use page_cache_xxx in fs/splice.c

2007-08-28 Thread clameter

Use page_cache_xxx in fs/splice.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/splice.c |   27 ++-
 1 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index c010a72..7910f32 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -279,9 +279,9 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
.ops = &page_cache_pipe_buf_ops,
};
 
-   index = *ppos >> PAGE_CACHE_SHIFT;
-   loff = *ppos & ~PAGE_CACHE_MASK;
-   req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+   index = page_cache_index(mapping, *ppos);
+   loff = page_cache_offset(mapping, *ppos);
+   req_pages = page_cache_next(mapping, len + loff);
nr_pages = min(req_pages, (unsigned)PIPE_BUFFERS);
 
/*
@@ -336,7 +336,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 * Now loop over the map and see if we need to start IO on any
 * pages, fill in the partial map, etc.
 */
-   index = *ppos >> PAGE_CACHE_SHIFT;
+   index = page_cache_index(mapping, *ppos);
nr_pages = spd.nr_pages;
spd.nr_pages = 0;
for (page_nr = 0; page_nr < nr_pages; page_nr++) {
@@ -348,7 +348,8 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
/*
 * this_len is the max we'll use from this page
 */
-   this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff);
+   this_len = min_t(unsigned long, len,
+   page_cache_size(mapping) - loff);
page = pages[page_nr];
 
if (PageReadahead(page))
@@ -408,7 +409,7 @@ fill_it:
 * i_size must be checked after PageUptodate.
 */
isize = i_size_read(mapping->host);
-   end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+   end_index = page_cache_index(mapping, isize - 1);
if (unlikely(!isize || index > end_index))
break;
 
@@ -422,7 +423,7 @@ fill_it:
/*
 * max good bytes in this page
 */
-   plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+   plen = page_cache_offset(mapping, isize - 1) + 1;
if (plen <= loff)
break;
 
@@ -573,12 +574,12 @@ static int pipe_to_file(struct pipe_inode_info *pipe, 
struct pipe_buffer *buf,
if (unlikely(ret))
return ret;
 
-   index = sd->pos >> PAGE_CACHE_SHIFT;
-   offset = sd->pos & ~PAGE_CACHE_MASK;
+   index = page_cache_index(mapping, sd->pos);
+   offset = page_cache_offset(mapping, sd->pos);
 
this_len = sd->len;
-   if (this_len + offset > PAGE_CACHE_SIZE)
-   this_len = PAGE_CACHE_SIZE - offset;
+   if (this_len + offset > page_cache_size(mapping))
+   this_len = page_cache_size(mapping) - offset;
 
 find_page:
page = find_lock_page(mapping, index);
@@ -839,7 +840,7 @@ generic_file_splice_write_nolock(struct pipe_inode_info 
*pipe, struct file *out,
unsigned long nr_pages;
 
*ppos += ret;
-   nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+   nr_pages = page_cache_next(mapping, ret);
 
/*
 * If file or inode is SYNC and we actually wrote some data,
@@ -896,7 +897,7 @@ generic_file_splice_write(struct pipe_inode_info *pipe, 
struct file *out,
unsigned long nr_pages;
 
*ppos += ret;
-   nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+   nr_pages = page_cache_next(mapping, ret);
 
/*
 * If file or inode is SYNC and we actually wrote some data,
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[21/36] compound pages: PageHead/PageTail instead of PageCompound

2007-08-28 Thread clameter

This patch enhances the handling of compound pages in the VM. It may also
be important also for the antifrag patches that need to manage a set of
higher order free pages and also for other uses of compound pages.

For now it simplifies accounting for SLUB pages but the groundwork here is
important for the large block size patches and for allowing to page migration
of larger pages. With this framework we may be able to get to a point where
compound pages keep their flags while they are free and Mel may avoid having
special functions for determining the page order of higher order freed pages.
If we can avoid the setup and teardown of higher order pages then allocation
and release of compound pages will be faster.

Looking at the handling of compound pages we see that the fact that a page
is part of a higher order page is not that interesting. The differentiation
is mainly for head pages and tail pages of higher order pages. Head pages
keep the page state and it is usually sufficient to pass a pointer to
a head page. It is usually an error if tail pages are encountered. Or they
may need to be treated like PAGE_SIZE pages. So a compound flag in the page
flags is not what we need. Instead we introduce a flag for the head page and
another for the tail page. The PageCompound test is preserved for backward
compatibility and will test if either PageTail or PageHead has been set.

After this patchset the uses of CompoundPage() will be reduced significantly
in the core VM. The I/O layer will still use CompoundPage() for direct I/O.
However, if we at some point convert direct I/O to also support compound
pages as a single unit then CompoundPage() there may become unecessary as
well as the leftover check in mm/swap.c. We may end up mostly with checks
for PageTail and PageHead.

This patch:

Use two separate page flags for the head and tail of compound pages.
PageHead() and PageTail() become more efficient.

PageCompound then becomes a check for PageTail || PageHead. Over time
it is expected that PageCompound will mostly go away since the head page
processing will be different from tail page processing is most situations.

We can remove the compound page check from set_page_refcounted since
PG_reclaim is no longer overloaded.

Also the check in _free_one_page can only be for PageHead. We cannot
free a tail page.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/page-flags.h |   41 +++--
 mm/internal.h  |2 +-
 mm/page_alloc.c|2 +-
 3 files changed, 13 insertions(+), 32 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 209d3a4..2786693 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -83,13 +83,15 @@
 #define PG_private 11  /* If pagecache, has fs-private data */
 
 #define PG_writeback   12  /* Page is under writeback */
-#define PG_compound14  /* Part of a compound page */
 #define PG_swapcache   15  /* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk16  /* Has blocks allocated on-disk 
*/
 #define PG_reclaim 17  /* To be reclaimed asap */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
+#define PG_head21  /* Page is head of a compound 
page */
+#define PG_tail22  /* Page is tail of a compound 
page */
+
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead   PG_reclaim /* Reminder to do async read-ahead */
 
@@ -216,37 +218,16 @@ static inline void SetPageUptodate(struct page *page)
 #define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, 
&(page)->flags)
 
-#define PageCompound(page) test_bit(PG_compound, &(page)->flags)
-#define __SetPageCompound(page)__set_bit(PG_compound, &(page)->flags)
-#define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
-
-/*
- * PG_reclaim is used in combination with PG_compound to mark the
- * head and tail of a compound page
- *
- * PG_compound & PG_reclaim=> Tail page
- * PG_compound & ~PG_reclaim   => Head page
- */
-
-#define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim))
+#define PageHead(page) test_bit(PG_head, &(page)->flags)
+#define __SetPageHead(page)__set_bit(PG_head, &(page)->flags)
+#define __ClearPageHead(page)  __clear_bit(PG_head, &(page)->flags)
 
-#define PageTail(page) ((page->flags & PG_head_tail_mask) \
-   == PG_head_tail_mask)
-
-static inline void __SetPageTail(struct page *page)
-{
-   page->flags |= PG_head_tail_mask;
-}
-
-static inline void __ClearPageTail(struct page *page)
-{
-   page->flags &= ~PG_head_tail_mask;
-}
+#define PageTail(page) test_bit(PG_tail, &(page->flags))
+#define

[18/36] Use page_cache_xxx in fs/reiserfs

2007-08-28 Thread clameter

Use page_cache_xxx in fs/reiserfs

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/reiserfs/file.c|   83 ++---
 fs/reiserfs/inode.c   |   33 ++--
 fs/reiserfs/ioctl.c   |2 +-
 fs/reiserfs/stree.c   |8 ++-
 fs/reiserfs/tail_conversion.c |5 +-
 fs/reiserfs/xattr.c   |   19 +
 6 files changed, 84 insertions(+), 66 deletions(-)

Index: linux-2.6/fs/reiserfs/file.c
===
--- linux-2.6.orig/fs/reiserfs/file.c   2007-08-27 21:22:40.0 -0700
+++ linux-2.6/fs/reiserfs/file.c2007-08-27 21:50:01.0 -0700
@@ -187,9 +187,11 @@ static int reiserfs_allocate_blocks_for_
int curr_block; // current block used to keep track of unmapped 
blocks.
int i;  // loop counter
int itempos;// position in item
-   unsigned int from = (pos & (PAGE_CACHE_SIZE - 1));  // writing 
position in
+   struct address_space *mapping = prepared_pages[0]->mapping;
+   unsigned int from = page_cache_offset(mapping, pos);// writing 
position in
// first page
-   unsigned int to = ((pos + write_bytes - 1) & (PAGE_CACHE_SIZE - 1)) + 
1;/* last modified byte offset in last page */
+   unsigned int to = page_cache_offset(mapping, pos + write_bytes - 1)  + 
1;
+   /* last modified byte offset in last 
page */
__u64 hole_size;// amount of blocks for a file hole, if it 
needed to be created.
int modifying_this_item = 0;// Flag for items traversal code to 
keep track
// of the fact that we already prepared
@@ -731,19 +733,22 @@ static int reiserfs_copy_from_user_to_fi
long page_fault = 0;// status of copy_from_user.
int i;  // loop counter.
int offset; // offset in page
+   struct address_space *mapping = prepared_pages[0]->mapping;
 
-   for (i = 0, offset = (pos & (PAGE_CACHE_SIZE - 1)); i < num_pages;
+   for (i = 0, offset = page_cache_offset(mapping, pos); i < num_pages;
 i++, offset = 0) {
-   size_t count = min_t(size_t, PAGE_CACHE_SIZE - offset, 
write_bytes);// How much of bytes to write to this page
+   size_t count = min_t(size_t, page_cache_size(mapping) - offset,
+   write_bytes);   // How much of bytes to write to this 
page
struct page *page = prepared_pages[i];  // Current page we 
process.
 
fault_in_pages_readable(buf, count);
 
/* Copy data from userspace to the current page */
kmap(page);
-   page_fault = __copy_from_user(page_address(page) + offset, buf, 
count); // Copy the data.
+   page_fault = __copy_from_user(page_address(page) + offset, buf, 
count);
+   // Copy the data.
/* Flush processor's dcache for this page */
-   flush_dcache_page(page);
+   flush_mapping_page(page);
kunmap(page);
buf += count;
write_bytes -= count;
@@ -763,11 +768,12 @@ int reiserfs_commit_page(struct inode *i
int partial = 0;
unsigned blocksize;
struct buffer_head *bh, *head;
-   unsigned long i_size_index = inode->i_size >> PAGE_CACHE_SHIFT;
+   unsigned long i_size_index =
+   page_cache_offset(inode->i_mapping, inode->i_size);
int new;
int logit = reiserfs_file_data_log(inode);
struct super_block *s = inode->i_sb;
-   int bh_per_page = PAGE_CACHE_SIZE / s->s_blocksize;
+   int bh_per_page = page_cache_size(inode->i_mapping) / s->s_blocksize;
struct reiserfs_transaction_handle th;
int ret = 0;
 
@@ -839,10 +845,11 @@ static int reiserfs_submit_file_region_f
int offset; // Writing offset in page.
int orig_write_bytes = write_bytes;
int sd_update = 0;
+   struct address_space *mapping = inode->i_mapping;
 
-   for (i = 0, offset = (pos & (PAGE_CACHE_SIZE - 1)); i < num_pages;
+   for (i = 0, offset = page_cache_offset(mapping, pos); i < num_pages;
 i++, offset = 0) {
-   int count = min_t(int, PAGE_CACHE_SIZE - offset, write_bytes);  
// How much of bytes to write to this page
+   int count = min_t(int, page_cache_size(mapping) - offset, 
write_bytes); // How much of bytes to write to this page
struct page *page = prepared_pages[i];  // Current page we 
process.
 
status =
@@ -985,12 +992,12 @@ static int reiserfs_prepare_file_region_
 )
 {
int res = 0;// Return values of different functions we call.
-   unsigned long index = pos >> PAGE_CACHE_SHIFT;  // Offset in file in 
pages.
-   int from = (pos & (PAGE_CA

[24/36] compound pages: Use new compound vmstat functions in SLUB

2007-08-28 Thread clameter

Use the new dec/inc functions to simplify SLUB's accounting
of pages.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slub.c |   13 -
 1 files changed, 4 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-27 19:22:13.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-27 21:02:51.0 -0700
@@ -1038,7 +1038,6 @@ static inline void kmem_cache_open_debug
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
struct page * page;
-   int pages = 1 << s->order;
 
if (s->order)
flags |= __GFP_COMP;
@@ -1054,10 +1053,9 @@ static struct page *allocate_slab(struct
if (!page)
return NULL;
 
-   mod_zone_page_state(page_zone(page),
+   inc_zone_page_state(page,
(s->flags & SLAB_RECLAIM_ACCOUNT) ?
-   NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
-   pages);
+   NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE);
 
return page;
 }
@@ -1124,8 +1122,6 @@ out:
 
 static void __free_slab(struct kmem_cache *s, struct page *page)
 {
-   int pages = 1 << s->order;
-
if (unlikely(SlabDebug(page))) {
void *p;
 
@@ -1135,10 +1131,9 @@ static void __free_slab(struct kmem_cach
ClearSlabDebug(page);
}
 
-   mod_zone_page_state(page_zone(page),
+   dec_zone_page_state(page,
(s->flags & SLAB_RECLAIM_ACCOUNT) ?
-   NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
-   - pages);
+   NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE);
 
page->mapping = NULL;
__free_pages(page, s->order);

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[35/36] Large blocksize support for ext2

2007-08-28 Thread clameter

This adds support for a block size of up to 64k on any platform.
It enables the mounting filesystems that have a larger blocksize
than the page size.

F.e. the following is possible on x86_64 and i386 that have only a 4k page
size:

mke2fs -b 16384 /dev/hdd2   

mount /dev/hdd2 /media
ls -l /media

 Do more things with the volume that uses a 16k page cache size on
a 4k page sized platform..

Hmmm... Actually there is nothing additional to be done after the earlier
cleanup of the macros. So just modify copyright.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/ext2/inode.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0079b2c..5ff775a 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -20,6 +20,9 @@
  * ([EMAIL PROTECTED])
  *
  *  Assorted race fixes, rewrite of ext2_get_block() by Al Viro, 2000
+ *
+ *  (C) 2007 SGI.
+ *  Large blocksize support by Christoph Lameter
  */
 
 #include 
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[16/36] Use page_cache_xxx in fs/ext3

2007-08-28 Thread clameter

Use page_cache_xxx in fs/ext3

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/ext3/dir.c   |3 ++-
 fs/ext3/inode.c |   34 +-
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index c00723a..a65b5a7 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -137,7 +137,8 @@ static int ext3_readdir(struct file * filp,
&map_bh, 0, 0);
if (err > 0) {
pgoff_t index = map_bh.b_blocknr >>
-   (PAGE_CACHE_SHIFT - inode->i_blkbits);
+   (page_cache_shift(inode->i_mapping)
+   - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
page_cache_sync_readahead(
sb->s_bdev->bd_inode->i_mapping,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index eb3c264..986519b 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1224,7 +1224,7 @@ static int ext3_ordered_commit_write(struct file *file, 
struct page *page,
 */
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+   new_i_size = page_cache_pos(page->mapping, page->index, to);
if (new_i_size > EXT3_I(inode)->i_disksize)
EXT3_I(inode)->i_disksize = new_i_size;
ret = generic_commit_write(file, page, from, to);
@@ -1243,7 +1243,7 @@ static int ext3_writeback_commit_write(struct file *file, 
struct page *page,
int ret = 0, ret2;
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+   new_i_size = page_cache_pos(inode->i_mapping, page->index, to);
if (new_i_size > EXT3_I(inode)->i_disksize)
EXT3_I(inode)->i_disksize = new_i_size;
 
@@ -1270,7 +1270,7 @@ static int ext3_journalled_commit_write(struct file *file,
/*
 * Here we duplicate the generic_commit_write() functionality
 */
-   pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+   pos = page_cache_pos(page->mapping, page->index, to);
 
ret = walk_page_buffers(handle, page_buffers(page), from,
to, &partial, commit_write_fn);
@@ -1422,6 +1422,7 @@ static int ext3_ordered_writepage(struct page *page,
handle_t *handle = NULL;
int ret = 0;
int err;
+   int pagesize = page_cache_size(inode->i_mapping);
 
J_ASSERT(PageLocked(page));
 
@@ -1444,8 +1445,7 @@ static int ext3_ordered_writepage(struct page *page,
(1 << BH_Dirty)|(1 << BH_Uptodate));
}
page_bufs = page_buffers(page);
-   walk_page_buffers(handle, page_bufs, 0,
-   PAGE_CACHE_SIZE, NULL, bget_one);
+   walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bget_one);
 
ret = block_write_full_page(page, ext3_get_block, wbc);
 
@@ -1462,13 +1462,12 @@ static int ext3_ordered_writepage(struct page *page,
 * and generally junk.
 */
if (ret == 0) {
-   err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
-   NULL, journal_dirty_data_fn);
+   err = walk_page_buffers(handle, page_bufs, 0, pagesize,
+   NULL, journal_dirty_data_fn);
if (!ret)
ret = err;
}
-   walk_page_buffers(handle, page_bufs, 0,
-   PAGE_CACHE_SIZE, NULL, bput_one);
+   walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bput_one);
err = ext3_journal_stop(handle);
if (!ret)
ret = err;
@@ -1520,6 +1519,7 @@ static int ext3_journalled_writepage(struct page *page,
handle_t *handle = NULL;
int ret = 0;
int err;
+   int pagesize = page_cache_size(inode->i_mapping);
 
if (ext3_journal_current_handle())
goto no_write;
@@ -1536,17 +1536,16 @@ static int ext3_journalled_writepage(struct page *page,
 * doesn't seem much point in redirtying the page here.
 */
ClearPageChecked(page);
-   ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
-   ext3_get_block);
+   ret = block_prepare_write(page, 0, pagesize, ext3_get_block);
if (ret != 0) {
ext3_journal_stop(handle);
goto out_unlock;
}
ret = walk_page_buffers(handle, page_buffers(page), 0,
-   PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
+   pagesize, NULL, do_journal_get_write_access);
 
err = walk_page_buffers(handle, page_buffers(page), 0,
-

[23/36] compound pages: vmstat support

2007-08-28 Thread clameter

Add support for compound pages so that

inc_ and dec_xxx

will increment the ZVCs by the number of base pages of the compound page.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/vmstat.h |5 ++---
 mm/vmstat.c|   18 +-
 2 files changed, 15 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===
--- linux-2.6.orig/include/linux/vmstat.h   2007-08-27 19:22:13.0 
-0700
+++ linux-2.6/include/linux/vmstat.h2007-08-27 20:59:42.0 -0700
@@ -234,7 +234,7 @@ static inline void __inc_zone_state(stru
 static inline void __inc_zone_page_state(struct page *page,
enum zone_stat_item item)
 {
-   __inc_zone_state(page_zone(page), item);
+   __mod_zone_page_state(page_zone(page), item, compound_pages(page));
 }
 
 static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item 
item)
@@ -246,8 +246,7 @@ static inline void __dec_zone_state(stru
 static inline void __dec_zone_page_state(struct page *page,
enum zone_stat_item item)
 {
-   atomic_long_dec(&page_zone(page)->vm_stat[item]);
-   atomic_long_dec(&vm_stat[item]);
+   __mod_zone_page_state(page_zone(page), item, -compound_pages(page));
 }
 
 /*
Index: linux-2.6/mm/vmstat.c
===
--- linux-2.6.orig/mm/vmstat.c  2007-08-27 19:22:13.0 -0700
+++ linux-2.6/mm/vmstat.c   2007-08-27 20:59:42.0 -0700
@@ -225,7 +225,12 @@ void __inc_zone_state(struct zone *zone,
 
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-   __inc_zone_state(page_zone(page), item);
+   struct zone *z = page_zone(page);
+
+   if (likely(!PageHead(page)))
+   __inc_zone_state(z, item);
+   else
+   __mod_zone_page_state(z, item, compound_pages(page));
 }
 EXPORT_SYMBOL(__inc_zone_page_state);
 
@@ -246,7 +251,12 @@ void __dec_zone_state(struct zone *zone,
 
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-   __dec_zone_state(page_zone(page), item);
+   struct zone *z = page_zone(page);
+
+   if (likely(!PageHead(page)))
+   __dec_zone_state(z, item);
+   else
+   __mod_zone_page_state(z, item, -compound_pages(page));
 }
 EXPORT_SYMBOL(__dec_zone_page_state);
 
@@ -262,11 +272,9 @@ void inc_zone_state(struct zone *zone, e
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
unsigned long flags;
-   struct zone *zone;
 
-   zone = page_zone(page);
local_irq_save(flags);
-   __inc_zone_state(zone, item);
+   __inc_zone_page_state(page, item);
local_irq_restore(flags);
 }
 EXPORT_SYMBOL(inc_zone_page_state);

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[25/36] compound pages: Allow use of get_page_unless_zero with compound pages

2007-08-28 Thread clameter

This is needed by slab defragmentation. The refcount of a page head
may be incremented to ensure that a compound page will not go away under us.

It also may be needed for defragmentation of higher order pages. The
moving of compound pages may require the establishment of a reference
before the use of page migration functions.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/mm.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2007-08-27 20:59:40.0 -0700
+++ linux-2.6/include/linux/mm.h2007-08-27 21:03:20.0 -0700
@@ -290,7 +290,7 @@ static inline int put_page_testzero(stru
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-   VM_BUG_ON(PageCompound(page));
+   VM_BUG_ON(PageTail(page));
return atomic_inc_not_zero(&page->_count);
 }
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[28/36] Fix PAGE SIZE assumption in miscellaneous places

2007-08-28 Thread clameter

Fix PAGE SIZE assumption in miscellaneous places.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 kernel/futex.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a124250..c6102e8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -258,7 +258,7 @@ int get_futex_key(u32 __user *uaddr, struct rw_semaphore 
*fshared,
err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
if (err >= 0) {
key->shared.pgoff =
-   page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   page->index << (compound_order(page) - PAGE_SHIFT);
put_page(page);
return 0;
}
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[00/36] Large Blocksize Support V6

2007-08-28 Thread clameter

[An update before the Kernel Summit because of the numerous requests that I
have had for this patchset. Please speak up if you feel that we need something
like this.]

This patchset modifies the Linux kernel so that larger block sizes than
page size can be supported. Larger block sizes are handled by using
compound pages of an arbitrary order for the page cache instead of
single pages with order 0.

Support is added in a way that limits the changes to existing code.
As a result filesystems can support I/O using large buffers with minimal
changes.

The page cache functions are mostly unchanged. Instead of a page struct
representing a single page they take a head page struct (which looks
the same as a regular page struct apart from the compound flags) and
operate on those. Most page cache functions can stay as they are.

No locking protocols are added or modified.

The support is also fully transparent at the level of the OS. No
specialized heuristics are added to switch to larger pages. Large
page support is enabled by filesystems or device drivers when a device
or volume is mounted. Larger block sizes are usually set during volume
creation although the patchset supports setting these sizes per file.
The formattted partition will then always be accessed with the
configured blocksize.

Some of the changes are:

- Replace the use of PAGE_CACHE_XXX constants to calculate offsets into
  pages with functions that do the the same and allow the constants to
  be parameterized.

- Extend the capabilities of compound pages so that they can be
  put onto the LRU and reclaimed.

- Allow setting a larger blocksize via set_blocksize()

Rationales:
---

1. The ability to handle memory of an arbitrarily large size using
   a singe page struct "handle" is essential for scaling memory handling
   and reducing overhead in multiple kernel subsystems. This patchset
   is a strategic move that allows performance gains throughout the
   kernel.

2. Reduce fsck times. Larger block sizes mean faster file system checking.
   Using 64k block size will reduce the number of blocks to be managed
   by a factor of 16 and produce much denser and contiguous metadata.

3. Performance. If we look at IA64 vs. x86_64 then it seems that the
   faster interrupt handling on x86_64 compensate for the speed loss due to
   a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
   sizes on all allows a significant reduction in I/O overhead and increases
   the size of I/O that can be performed by hardware in a single request
   since the number of scatter gather entries are typically limited for
   one request. This is going to become increasingly important to support
   the ever growing memory sizes since we may have to handle excessively
   large amounts of 4k requests for data sizes that may become common
   soon. For example to write a 1 terabyte file the kernel would have to
   handle 256 million 4k chunks.

4. Cross arch compatibility: It is currently not possible to mount
   an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
   With this patch this becomes possible. Note that this also means that
   some filesystems are already capable of working with blocksizes of
   up to 64k (ext2, XFS) which is currently only available on a select
   few arches. This patchset enables that functionality on all arches.
   There are no special modifications needed to the filesystems. The
   set_blocksize() function call will simply support a larger blocksize.

5. VM scalability
   Large block sizes mean less state keeping for the information being
   transferred. For a 1TB file one needs to handle 256 million page
   structs in the VM if one uses 4k page size. A 64k page size reduces
   that amount to 16 million. If the limitation in existing filesystems
   are removed then even higher reductions become possible. For very
   large files like that a page size of 2 MB may be beneficial which
   will reduce the number of page struct to handle to 512k. The variable
   nature of the block size means that the size can be tuned at file
   system creation time for the anticipated needs on a volume.

6. IO scalability
   The IO layer will receive large blocks of contiguious memory with
   this patchset. This means that less scatter gather elements are needed
   and the memory used is guaranteed to be contiguous. Instead of having
   to handle 4k chunks we can f.e. handle 64k chunks in one go.

7. Limited scatter gather support restricts I/O sizes.

   A lot of I/O controllers are limited in the number of scatter gather
   elements that they support. For example a controller that support 128
   entries in the scatter gather lists can only perform I/O of 128*4k =
   512k in one go. If the blocksize is larger (f.e. 64k) then we can perform
   larger I/O transfers. If we support 128 entries then 128*64k = 8M
   can be transferred in one transaction.

   Dave Chinner measured a performance increase of 50% when going to 64k
   blo

[13/36] Use page_cache_xxx in mm/fadvise.c

2007-08-28 Thread clameter

Use page_cache_xxx in mm/fadvise.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/fadvise.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index 0df4c89..804c2a9 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -79,8 +79,8 @@ asmlinkage long sys_fadvise64_64(int fd, loff_t offset, 
loff_t len, int advice)
}
 
/* First and last PARTIAL page! */
-   start_index = offset >> PAGE_CACHE_SHIFT;
-   end_index = endbyte >> PAGE_CACHE_SHIFT;
+   start_index = page_cache_index(mapping, offset);
+   end_index = page_cache_index(mapping, endbyte);
 
/* Careful about overflow on the "+1" */
nrpages = end_index - start_index + 1;
@@ -100,8 +100,8 @@ asmlinkage long sys_fadvise64_64(int fd, loff_t offset, 
loff_t len, int advice)
filemap_flush(mapping);
 
/* First and last FULL page! */
-   start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
-   end_index = (endbyte >> PAGE_CACHE_SHIFT);
+   start_index = page_cache_next(mapping, offset);
+   end_index = page_cache_index(mapping, endbyte);
 
if (end_index >= start_index)
invalidate_mapping_pages(mapping, start_index,
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[08/36] Use page_cache_xxx in mm/migrate.c

2007-08-28 Thread clameter

Use page_cache_xxx in mm/migrate.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/migrate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 37c73b9..4949927 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -195,7 +195,7 @@ static void remove_file_migration_ptes(struct page *old, 
struct page *new)
struct vm_area_struct *vma;
struct address_space *mapping = page_mapping(new);
struct prio_tree_iter iter;
-   pgoff_t pgoff = new->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   pgoff_t pgoff = new->index << mapping_order(mapping);
 
if (!mapping)
return;
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[03/36] Use page_cache_xxx functions in mm/filemap.c

2007-08-28 Thread clameter

Conver the uses of PAGE_CACHE_xxx to use page_cache_xxx instead.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/filemap.c |   56 
 1 files changed, 28 insertions(+), 28 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-08-27 19:22:13.0 -0700
+++ linux-2.6/mm/filemap.c  2007-08-27 19:31:13.0 -0700
@@ -303,8 +303,8 @@ int wait_on_page_writeback_range(struct 
 int sync_page_range(struct inode *inode, struct address_space *mapping,
loff_t pos, loff_t count)
 {
-   pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-   pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+   pgoff_t start = page_cache_index(mapping, pos);
+   pgoff_t end = page_cache_index(mapping, pos + count - 1);
int ret;
 
if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -335,8 +335,8 @@ EXPORT_SYMBOL(sync_page_range);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
   loff_t pos, loff_t count)
 {
-   pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-   pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+   pgoff_t start = page_cache_index(mapping, pos);
+   pgoff_t end = page_cache_index(mapping, pos + count - 1);
int ret;
 
if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -365,7 +365,7 @@ int filemap_fdatawait(struct address_spa
return 0;
 
return wait_on_page_writeback_range(mapping, 0,
-   (i_size - 1) >> PAGE_CACHE_SHIFT);
+   page_cache_index(mapping, i_size - 1));
 }
 EXPORT_SYMBOL(filemap_fdatawait);
 
@@ -413,8 +413,8 @@ int filemap_write_and_wait_range(struct 
/* See comment of filemap_write_and_wait() */
if (err != -EIO) {
int err2 = wait_on_page_writeback_range(mapping,
-   lstart >> PAGE_CACHE_SHIFT,
-   lend >> PAGE_CACHE_SHIFT);
+   page_cache_index(mapping, lstart),
+   page_cache_index(mapping, lend));
if (!err)
err = err2;
}
@@ -877,12 +877,12 @@ void do_generic_mapping_read(struct addr
struct file_ra_state ra = *_ra;
 
cached_page = NULL;
-   index = *ppos >> PAGE_CACHE_SHIFT;
+   index = page_cache_index(mapping, *ppos);
next_index = index;
prev_index = ra.prev_index;
prev_offset = ra.prev_offset;
-   last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> 
PAGE_CACHE_SHIFT;
-   offset = *ppos & ~PAGE_CACHE_MASK;
+   last_index = page_cache_next(mapping, *ppos + desc->count);
+   offset = page_cache_offset(mapping, *ppos);
 
for (;;) {
struct page *page;
@@ -919,16 +919,16 @@ page_ok:
 */
 
isize = i_size_read(inode);
-   end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+   end_index = page_cache_index(mapping, isize - 1);
if (unlikely(!isize || index > end_index)) {
page_cache_release(page);
goto out;
}
 
/* nr is the maximum number of bytes to copy from this page */
-   nr = PAGE_CACHE_SIZE;
+   nr = page_cache_size(mapping);
if (index == end_index) {
-   nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+   nr = page_cache_offset(mapping, isize - 1) + 1;
if (nr <= offset) {
page_cache_release(page);
goto out;
@@ -963,8 +963,8 @@ page_ok:
 */
ret = actor(desc, page, offset, nr);
offset += ret;
-   index += offset >> PAGE_CACHE_SHIFT;
-   offset &= ~PAGE_CACHE_MASK;
+   index += page_cache_index(mapping, offset);
+   offset = page_cache_offset(mapping, offset);
prev_offset = offset;
ra.prev_offset = offset;
 
@@ -1058,7 +1058,7 @@ out:
*_ra = ra;
_ra->prev_index = prev_index;
 
-   *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+   *ppos = page_cache_pos(mapping, index, offset);
if (cached_page)
page_cache_release(cached_page);
if (filp)
@@ -1240,8 +1240,8 @@ asmlinkage ssize_t sys_readahead(int fd,
if (file) {
if (file->f_mode & FMODE_READ) {
struct address_space *mapping = file->f_mapping;
-   unsigned long start = offset >> PAGE_CACHE_SHIFT;
-   unsigned long end =

[04/36] Use page_cache_xxx in mm/page-writeback.c

2007-08-28 Thread clameter

Use page_cache_xxx in mm/page-writeback.c

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/page-writeback.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 63512a9..ebe76e3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -624,8 +624,8 @@ int write_cache_pages(struct address_space *mapping,
index = mapping->writeback_index; /* Start from prev offset */
end = -1;
} else {
-   index = wbc->range_start >> PAGE_CACHE_SHIFT;
-   end = wbc->range_end >> PAGE_CACHE_SHIFT;
+   index = page_cache_index(mapping, wbc->range_start);
+   end = page_cache_index(mapping, wbc->range_end);
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
scanned = 1;
@@ -827,7 +827,7 @@ int __set_page_dirty_nobuffers(struct page *page)
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
-   task_io_account_write(PAGE_CACHE_SIZE);
+   task_io_account_write(page_cache_size(mapping));
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Evgeniy Polyakov

On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> > We do not care about one cpu being able to increase its counter
> > higher than the limit, such inaccuracy (maximum bios in flight thus
> > can be more than limit, difference is equal to the number of CPUs -
> > 1) is a price for removing atomic operation. I thought I pointed it
> > in the original description, but might forget, that if it will be an
> > issue, that atomic operations can be introduced there. Any
> > uber-precise measurements in the case when we are close to the edge
> > will not give us any benefit at all, since were are already in the
> > grey area.
> 
> This is not just inaccurate, it is suicide.  Keep leaking throttle 
> counts and eventually all of them will be gone.  No more IO
> on that block device!

First, because number of increased and decreased operations are the
same, so it will dance around limit in both directions. Second, I
wrote about this race and there is number of ways to deal with it, from
atomic operations to separated counters for in-flight and completed
bios (which can be racy too, but in different angle). Third, if people
can not agree even on much higher layer detail about should bio
structure be increased or not, how we can discuss details of
the preliminary implementation with known issues.

So I can not agree with fatality of the issue, but of course it exists,
and was highlighted.

Let's solve problems in order of their appearence. If bio structure will
be allowed to grow, then the whole patches can be done better, if not,
then there are issues with performance (although the more I think, the
more I become sure that since bio itself is very rarely shared, and thus
requires cloning and alocation/freeing, which itself is much more costly
operation than atomic_sub/dec, it can safely host additional operation).

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips

On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote:
> On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > Say Evgeniy, something I was curious about but forgot to ask you
> > earlier...
> >
> > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> > > ...All oerations are not atomic, since we do not care about
> > > precise number of bios, but a fact, that we are close or close
> > > enough to the limit.
> > > ... in bio->endio
> > > + q->bio_queued--;
> >
> > In your proposed patch, what prevents the race:
> >
> > cpu1cpu2
> >
> > read q->bio_queued
> > 
> > q->bio_queued--
> > write q->bio_queued - 1
> > Whoops! We leaked a throttle count.
>
> We do not care about one cpu being able to increase its counter
> higher than the limit, such inaccuracy (maximum bios in flight thus
> can be more than limit, difference is equal to the number of CPUs -
> 1) is a price for removing atomic operation. I thought I pointed it
> in the original description, but might forget, that if it will be an
> issue, that atomic operations can be introduced there. Any
> uber-precise measurements in the case when we are close to the edge
> will not give us any benefit at all, since were are already in the
> grey area.

This is not just inaccurate, it is suicide.  Keep leaking throttle 
counts and eventually all of them will be gone.  No more IO
on that block device!

> Another possibility is to create a queue/device pointer in the bio
> structure to hold original device and then in its backing dev
> structure add a callback to recalculate the limit, but it increases
> the size of the bio. Do we need this?

Different issue.  Yes, I think we need a nice simple approach like
that, and prove it is stable before worrying about the size cost.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-28 Thread Evgeniy Polyakov

On Fri, Aug 03, 2007 at 09:04:51AM +0400, Manu Abraham ([EMAIL PROTECTED]) 
wrote:
> On 7/31/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:
> 
> > TODO list currently includes following main items:
> > * redundancy algorithm (drop me a request of your own, but it is highly
> > unlikley that Reed-Solomon based will ever be used - it is too slow
> > for distributed RAID, I consider WEAVER codes)
> 
> 
> LDPC codes[1][2] have been replacing Turbo code[3] with regards to
> communication links and we have been seeing that transition. (maybe
> helpful, came to mind seeing the mention of Turbo code) Don't know how
> weaver compares to LDPC, though found some comparisons [4][5] But
> looking at fault tolerance figures, i guess Weaver is much better.
> 
> [1] http://www.ldpc-codes.com/
> [2] http://portal.acm.org/citation.cfm?id=1240497
> [3] http://en.wikipedia.org/wiki/Turbo_code
> [4] 
> http://domino.research.ibm.com/library/cyberdig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf
> [5] http://hplabs.hp.com/personal/Jay_Wylie/publications/wylie_dsn2007.pdf

I've studied and implemented LDPC encoder/decoder (hard decoding belief 
propagation algo only though) in userspace and found that any such 
probabilistic codes generally are not suitable for redundant or 
distributed data storages, because of its per-bit nature and probabilistic
error recovery.
Interested reader can find similar to Dr. Plank's iteractive decoding 
presentation and some of my analysis about codes and all sources at 
project homepage and in blog:

http://tservice.net.ru/~s0mbre/old/?section=projects&item=ldpc

So I consider weaver codes, as a superior decision for distributed
storages.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread Chris Mason

On Wed, 29 Aug 2007 02:33:08 +1000
David Chinner <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:
> > > > 
> > > > I wonder if XFS can benefit any more from the general writeback
> > > > clustering. How large would be a typical XFS cluster?
> > > 
> > > Depends on inode size. typically they are 8k in size, so anything
> > > from 4-32 inodes. The inode writeback clustering is pretty tightly
> > > integrated into the transaction subsystem and has some intricate
> > > locking, so it's not likely to be easy (or perhaps even possible)
> > > to make it more generic.
> > 
> > When I talked to hch about this, he said the order file data pages
> > got written in XFS was still dictated by the order the higher
> > layers sent things down.
> 
> Sure, that's file data. I was talking about the inode writeback, not
> the data writeback.

I think we're trying to gain different things from inode based
clustering...I'm not worried that the inode be next to the data.  I'm
going under the assumption that most of the time, the FS will try to
allocate inodes in groups in a directory, and so most of the time the
data blocks for inode N will be close to inode N+1.

So what I'm really trying for here is data block clustering when
writing multiple inodes at once.  This matters most when files are
relatively small and written in groups, which is a common workload.

It may make the most sense to change the patch to supply some key for
the data block clustering instead of the inode number, but its an easy
first pass.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread David Chinner

On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:
> On Wed, 29 Aug 2007 00:55:30 +1000
> David Chinner <[EMAIL PROTECTED]> wrote:
> > On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> > > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > > > Notes:
> > > > > (1) I'm not sure inode number is correlated to disk location in
> > > > > filesystems other than ext2/3/4. Or parent dir?
> > > > 
> > > > The correspond to the exact location on disk on XFS. But, XFS has
> > > > it's own inode clustering (see xfs_iflush) and it can't be moved
> > > > up into the generic layers because of locking and integration into
> > > > the transaction subsystem.
> > > >
> > > > > (2) It duplicates some function of elevators. Why is it
> > > > > necessary?
> > > > 
> > > > The elevators have no clue as to how the filesystem might treat
> > > > adjacent inodes. In XFS, inode clustering is a fundamental
> > > > feature of the inode reading and writing and that is something no
> > > > elevator can hope to acheive
> > >  
> > > Thank you. That explains the linear write curve(perfect!) in Chris'
> > > graph.
> > > 
> > > I wonder if XFS can benefit any more from the general writeback
> > > clustering. How large would be a typical XFS cluster?
> > 
> > Depends on inode size. typically they are 8k in size, so anything
> > from 4-32 inodes. The inode writeback clustering is pretty tightly
> > integrated into the transaction subsystem and has some intricate
> > locking, so it's not likely to be easy (or perhaps even possible) to
> > make it more generic.
> 
> When I talked to hch about this, he said the order file data pages got
> written in XFS was still dictated by the order the higher layers sent
> things down.

Sure, that's file data. I was talking about the inode writeback, not the
data writeback.

> Shouldn't the clustering still help to have delalloc done
> in inode order instead of in whatever random order pdflush sends things
> down now?

Depends on how things are being allocated. if you've got inode32 allocation
and >1TB filesytsem, then data is nowhere near the inodes. If you've got large
allocation groups, then data is typically nowhere near the inodes, either. If
you've got full AGs, data will be nowehere near the inodes. If you've got
large files and lots of data to write, then clustering multiple files together
for writing is not needed.  So in many cases, clustering delalloc writes by
inode number doesn't provide any better I/o patterns than not clustering...

The only difference we may see is that if we flush all the data on inodes
in a single cluster, we can get away with a single inode cluster write
for all of the inodes

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread Chris Mason

On Wed, 29 Aug 2007 00:55:30 +1000
David Chinner <[EMAIL PROTECTED]> wrote:

> On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > > filesystems other than ext2/3/4. Or parent dir?
> > > 
> > > The correspond to the exact location on disk on XFS. But, XFS has
> > > it's own inode clustering (see xfs_iflush) and it can't be moved
> > > up into the generic layers because of locking and integration into
> > > the transaction subsystem.
> > >
> > > > (2) It duplicates some function of elevators. Why is it
> > > > necessary?
> > > 
> > > The elevators have no clue as to how the filesystem might treat
> > > adjacent inodes. In XFS, inode clustering is a fundamental
> > > feature of the inode reading and writing and that is something no
> > > elevator can hope to acheive
> >  
> > Thank you. That explains the linear write curve(perfect!) in Chris'
> > graph.
> > 
> > I wonder if XFS can benefit any more from the general writeback
> > clustering. How large would be a typical XFS cluster?
> 
> Depends on inode size. typically they are 8k in size, so anything
> from 4-32 inodes. The inode writeback clustering is pretty tightly
> integrated into the transaction subsystem and has some intricate
> locking, so it's not likely to be easy (or perhaps even possible) to
> make it more generic.

When I talked to hch about this, he said the order file data pages got
written in XFS was still dictated by the order the higher layers sent
things down.  Shouldn't the clustering still help to have delalloc done
in inode order instead of in whatever random order pdflush sends things
down now?

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread David Chinner

On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > Notes:
> > > (1) I'm not sure inode number is correlated to disk location in
> > > filesystems other than ext2/3/4. Or parent dir?
> > 
> > The correspond to the exact location on disk on XFS. But, XFS has it's
> > own inode clustering (see xfs_iflush) and it can't be moved up
> > into the generic layers because of locking and integration into
> > the transaction subsystem.
> >
> > > (2) It duplicates some function of elevators. Why is it necessary?
> > 
> > The elevators have no clue as to how the filesystem might treat adjacent
> > inodes. In XFS, inode clustering is a fundamental feature of the inode
> > reading and writing and that is something no elevator can hope to
> > acheive
>  
> Thank you. That explains the linear write curve(perfect!) in Chris' graph.
> 
> I wonder if XFS can benefit any more from the general writeback clustering.
> How large would be a typical XFS cluster?

Depends on inode size. typically they are 8k in size, so anything from 4-32
inodes. The inode writeback clustering is pretty tightly integrated into the
transaction subsystem and has some intricate locking, so it's not likely
to be easy (or perhaps even possible) to make it more generic.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Evgeniy Polyakov

On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> Say Evgeniy, something I was curious about but forgot to ask you 
> earlier...
> 
> On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> > ...All oerations are not atomic, since we do not care about precise
> > number of bios, but a fact, that we are close or close enough to the
> > limit. 
> > ... in bio->endio
> > +   q->bio_queued--;
> 
> In your proposed patch, what prevents the race:
> 
>   cpu1cpu2
> 
>   read q->bio_queued
>   
> q->bio_queued--
>   write q->bio_queued - 1
>   Whoops! We leaked a throttle count.

We do not care about one cpu being able to increase its counter higher
than the limit, such inaccuracy (maximum bios in flight thus can be more
than limit, difference is equal to the number of CPUs - 1) is a price
for removing atomic operation. I thought I pointed it in the original
description, but might forget, that if it will be an issue, that atomic
operations can be introduced there. Any uber-precise measurements in the
case when we are close to the edge will not give us any benefit at all,
since were are already in the grey area.

Another possibility is to create a queue/device pointer in the bio
structure to hold original device and then in its backing dev structure
add a callback to recalculate the limit, but it increases the size of
the bio. Do we need this?

> Regards,
> 
> Daniel

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

58 matches

Mail list logo