Re: [2.6.24 REGRESSION] BUG: Soft lockup - with VFS

2008-02-08 Thread Pete Zaitcev
On Tue, 5 Feb 2008 14:05:06 -0800, Andrew Morton <[EMAIL PROTECTED]> wrote:

> > > http://students.zipernowsky.hu/~oliverp/kernel/regression_2624/

> I think ub.c is basically abandoned in favour of usb-storage.
> If so, perhaps we should remove or disble ub.c?

Looks like it's just Tomo or Jens made a mistake when converting to
the new s/g API. Nothing to be too concerned about. I know I should've
reviewed their patch closer, but it seemed too simple...

-- Pete

Fix up the conversion to sg_init_table().

Signed-off-by: Pete Zaitcev <[EMAIL PROTECTED]>

--- a/drivers/block/ub.c
+++ b/drivers/block/ub.c
@@ -657,7 +657,6 @@ static int ub_request_fn_1(struct ub_lun *lun, struct 
request *rq)
if ((cmd = ub_get_cmd(lun)) == NULL)
return -1;
memset(cmd, 0, sizeof(struct ub_scsi_cmd));
-   sg_init_table(cmd->sgv, UB_MAX_REQ_SG);
 
blkdev_dequeue_request(rq);
 
@@ -668,6 +667,7 @@ static int ub_request_fn_1(struct ub_lun *lun, struct 
request *rq)
/*
 * get scatterlist from block layer
 */
+   sg_init_table(&urq->sgv[0], UB_MAX_REQ_SG);
n_elem = blk_rq_map_sg(lun->disk->queue, rq, &urq->sgv[0]);
if (n_elem < 0) {
/* Impossible, because blk_rq_map_sg should not hit ENOMEM. */
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] udf: move headers out include/linux/

2008-02-08 Thread Christoph Hellwig
There's really no reason to keep udf headers in include/linux as they're
not used by anything but fs/udf/.

This patch merges most of include/linux/udf_fs_i.h into fs/udf/udf_i.h,
include/linux/udf_fs_sb.h into fs/udf/udf_sb.h and
include/linux/udf_fs.h into fs/udf/udfdecl.h.

The only thing remaining in include/linux/ is a stub of udf_fs_i.h
defining the four user-visible udf ioctls.  It's also moved from
unifdef-y to headers-y because it can be included unconditionally now.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/udf/udf_i.h
===
--- linux-2.6.orig/fs/udf/udf_i.h   2008-02-09 08:27:50.0 +0100
+++ linux-2.6/fs/udf/udf_i.h2008-02-09 08:31:34.0 +0100
@@ -1,10 +1,32 @@
-#ifndef __LINUX_UDF_I_H
-#define __LINUX_UDF_I_H
+#ifndef _UDF_I_H
+#define _UDF_I_H
+
+struct udf_inode_info {
+   struct timespec i_crtime;
+   /* Physical address of inode */
+   kernel_lb_addr  i_location;
+   __u64   i_unique;
+   __u32   i_lenEAttr;
+   __u32   i_lenAlloc;
+   __u64   i_lenExtents;
+   __u32   i_next_alloc_block;
+   __u32   i_next_alloc_goal;
+   unsignedi_alloc_type : 3;
+   unsignedi_efe : 1;
+   unsignedi_use : 1;
+   unsignedi_strat4096 : 1;
+   unsignedreserved : 26;
+   union {
+   short_ad*i_sad;
+   long_ad *i_lad;
+   __u8*i_data;
+   } i_ext;
+   struct inode vfs_inode;
+};
 
-#include 
 static inline struct udf_inode_info *UDF_I(struct inode *inode)
 {
return list_entry(inode, struct udf_inode_info, vfs_inode);
 }
 
-#endif /* !defined(_LINUX_UDF_I_H) */
+#endif /* _UDF_I_H) */
Index: linux-2.6/include/linux/Kbuild
===
--- linux-2.6.orig/include/linux/Kbuild 2008-02-09 08:27:50.0 +0100
+++ linux-2.6/include/linux/Kbuild  2008-02-09 08:28:01.0 +0100
@@ -150,6 +150,7 @@ header-y += tiocl.h
 header-y += tipc.h
 header-y += tipc_config.h
 header-y += toshiba.h
+header-y += udf_fs_i.h
 header-y += ultrasound.h
 header-y += un.h
 header-y += utime.h
@@ -336,7 +337,6 @@ unifdef-y += time.h
 unifdef-y += timex.h
 unifdef-y += tty.h
 unifdef-y += types.h
-unifdef-y += udf_fs_i.h
 unifdef-y += udp.h
 unifdef-y += uinput.h
 unifdef-y += uio.h
Index: linux-2.6/include/linux/udf_fs_i.h
===
--- linux-2.6.orig/include/linux/udf_fs_i.h 2008-02-09 08:27:50.0 
+0100
+++ linux-2.6/include/linux/udf_fs_i.h  2008-02-09 08:28:01.0 +0100
@@ -9,41 +9,10 @@
  * ftp://prep.ai.mit.edu/pub/gnu/GPL
  * Each contributing author retains all rights to their own work.
  */
-
 #ifndef _UDF_FS_I_H
 #define _UDF_FS_I_H 1
 
-#ifdef __KERNEL__
-
-struct udf_inode_info
-{
-   struct timespec i_crtime;
-   /* Physical address of inode */
-   kernel_lb_addr  i_location;
-   __u64   i_unique;
-   __u32   i_lenEAttr;
-   __u32   i_lenAlloc;
-   __u64   i_lenExtents;
-   __u32   i_next_alloc_block;
-   __u32   i_next_alloc_goal;
-   unsignedi_alloc_type : 3;
-   unsignedi_efe : 1;
-   unsignedi_use : 1;
-   unsignedi_strat4096 : 1;
-   unsignedreserved : 26;
-   union
-   {
-   short_ad*i_sad;
-   long_ad *i_lad;
-   __u8*i_data;
-   } i_ext;
-   struct inode vfs_inode;
-};
-
-#endif
-
 /* exported IOCTLs, we have 'l', 0x40-0x7f */
-
 #define UDF_GETEASIZE   _IOR('l', 0x40, int)
 #define UDF_GETEABLOCK  _IOR('l', 0x41, void *)
 #define UDF_GETVOLIDENT _IOR('l', 0x42, void *)
Index: linux-2.6/fs/udf/file.c
===
--- linux-2.6.orig/fs/udf/file.c2008-02-09 08:27:59.0 +0100
+++ linux-2.6/fs/udf/file.c 2008-02-09 08:28:01.0 +0100
@@ -27,7 +27,6 @@
 
 #include "udfdecl.h"
 #include 
-#include 
 #include 
 #include 
 #include  /* memset */
Index: linux-2.6/fs/udf/ialloc.c
===
--- linux-2.6.orig/fs/udf/ialloc.c  2008-02-09 08:27:50.0 +0100
+++ linux-2.6/fs/udf/ialloc.c   2008-02-09 08:28:01.0 +0100
@@ -21,7 +21,6 @@
 #include "udfdecl.h"
 #include 
 #include 
-#include 
 #include 
 #include 
 
Index: linux-2.6/fs/udf/lowlevel.c
===
--- linux-2.6.orig/fs/udf/lowlevel.c2008-0

[PATCH 1/3] udf: kill udf_set_blocksize

2008-02-08 Thread Christoph Hellwig
This helper has been quite useless since sb_min_blocksize was introduced
and is misnamed while we're at it.  Just opencode the few lines in the
caller instead.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/udf/super.c
===
--- linux-2.6.orig/fs/udf/super.c   2008-02-09 07:48:41.0 +0100
+++ linux-2.6/fs/udf/super.c2008-02-09 07:56:18.0 +0100
@@ -587,44 +587,6 @@ static int udf_remount_fs(struct super_b
return 0;
 }
 
-/*
- * udf_set_blocksize
- *
- * PURPOSE
- * Set the block size to be used in all transfers.
- *
- * DESCRIPTION
- * To allow room for a DMA transfer, it is best to guess big when unsure.
- * This routine picks 2048 bytes as the blocksize when guessing. This
- * should be adequate until devices with larger block sizes become common.
- *
- * Note that the Linux kernel can currently only deal with blocksizes of
- * 512, 1024, 2048, 4096, and 8192 bytes.
- *
- * PRE-CONDITIONS
- * sb  Pointer to _locked_ superblock.
- *
- * POST-CONDITIONS
- * sb->s_blocksize Blocksize.
- * sb->s_blocksize_bitslog2 of blocksize.
- * 0   Blocksize is valid.
- * 1   Blocksize is invalid.
- *
- * HISTORY
- * July 1, 1997 - Andrew E. Mileski
- * Written, tested, and released.
- */
-static int udf_set_blocksize(struct super_block *sb, int bsize)
-{
-   if (!sb_min_blocksize(sb, bsize)) {
-   udf_debug("Bad block size (%d)\n", bsize);
-   printk(KERN_ERR "udf: bad block size (%d)\n", bsize);
-   return 0;
-   }
-
-   return sb->s_blocksize;
-}
-
 static int udf_vrs(struct super_block *sb, int silent)
 {
struct volStructDesc *vsd = NULL;
@@ -1776,8 +1738,11 @@ static int udf_fill_super(struct super_b
sbi->s_nls_map = uopt.nls_map;
 
/* Set the block size for all transfers */
-   if (!udf_set_blocksize(sb, uopt.blocksize))
+   if (!sb_min_blocksize(sb, uopt.blocksize)) {
+   udf_debug("Bad block size (%d)\n", uopt.blocksize);
+   printk(KERN_ERR "udf: bad block size (%d)\n", uopt.blocksize);
goto error_out;
+   }
 
if (uopt.session == 0x)
sbi->s_session = udf_get_last_session(sb);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] udf: kill useless file header comments for vfs method implementations

2008-02-08 Thread Christoph Hellwig
There's not need to document vfs method invocation rules, we have
Documentation/filesystems/vfs.txt and Documentation/filesystems/Locking
for that.  Also a lot of these comments where either plain wrong or
horrible out of date.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/udf/dir.c
===
--- linux-2.6.orig/fs/udf/dir.c 2008-02-09 07:42:10.0 +0100
+++ linux-2.6/fs/udf/dir.c  2008-02-09 07:42:22.0 +0100
@@ -188,32 +188,6 @@ static int do_udf_readdir(struct inode *
return 0;
 }
 
-/*
- * udf_readdir
- *
- * PURPOSE
- * Read a directory entry.
- *
- * DESCRIPTION
- * Optional - sys_getdents() will return -ENOTDIR if this routine is not
- * available.
- *
- * Refer to sys_getdents() in fs/readdir.c
- * sys_getdents() -> .
- *
- * PRE-CONDITIONS
- * filpPointer to directory file.
- * buf Pointer to directory entry buffer.
- * filldir Pointer to filldir function.
- *
- * POST-CONDITIONS
- * >=0 on success.
- *
- * HISTORY
- * July 1, 1997 - Andrew E. Mileski
- * Written, tested, and released.
- */
-
 static int udf_readdir(struct file *filp, void *dirent, filldir_t filldir)
 {
struct inode *dir = filp->f_path.dentry->d_inode;
Index: linux-2.6/fs/udf/file.c
===
--- linux-2.6.orig/fs/udf/file.c2008-02-09 07:42:31.0 +0100
+++ linux-2.6/fs/udf/file.c 2008-02-09 07:43:11.0 +0100
@@ -144,40 +144,6 @@ static ssize_t udf_file_aio_write(struct
return retval;
 }
 
-/*
- * udf_ioctl
- *
- * PURPOSE
- * Issue an ioctl.
- *
- * DESCRIPTION
- * Optional - sys_ioctl() will return -ENOTTY if this routine is not
- * available, and the ioctl cannot be handled without filesystem help.
- *
- * sys_ioctl() handles these ioctls that apply only to regular files:
- * FIBMAP [requires udf_block_map()], FIGETBSZ, FIONREAD
- * These ioctls are also handled by sys_ioctl():
- * FIOCLEX, FIONCLEX, FIONBIO, FIOASYNC
- * All other ioctls are passed to the filesystem.
- *
- * Refer to sys_ioctl() in fs/ioctl.c
- * sys_ioctl() -> .
- *
- * PRE-CONDITIONS
- * inode   Pointer to inode that ioctl was issued on.
- * filpPointer to file that ioctl was issued on.
- * cmd The ioctl command.
- * arg The ioctl argument [can be interpreted as a
- * user-space pointer if desired].
- *
- * POST-CONDITIONS
- * Success (>=0) or an error code (<=0) that
- * sys_ioctl() will return.
- *
- * HISTORY
- * July 1, 1997 - Andrew E. Mileski
- * Written, tested, and released.
- */
 int udf_ioctl(struct inode *inode, struct file *filp, unsigned int cmd,
  unsigned long arg)
 {
@@ -225,18 +191,6 @@ int udf_ioctl(struct inode *inode, struc
return result;
 }
 
-/*
- * udf_release_file
- *
- * PURPOSE
- *  Called when all references to the file are closed
- *
- * DESCRIPTION
- *  Discard prealloced blocks
- *
- * HISTORY
- *
- */
 static int udf_release_file(struct inode *inode, struct file *filp)
 {
if (filp->f_mode & FMODE_WRITE) {
Index: linux-2.6/fs/udf/inode.c
===
--- linux-2.6.orig/fs/udf/inode.c   2008-02-09 07:42:32.0 +0100
+++ linux-2.6/fs/udf/inode.c2008-02-09 07:45:45.0 +0100
@@ -66,22 +66,7 @@ static void udf_update_extents(struct in
   struct extent_position *);
 static int udf_get_block(struct inode *, sector_t, struct buffer_head *, int);
 
-/*
- * udf_delete_inode
- *
- * PURPOSE
- * Clean-up before the specified inode is destroyed.
- *
- * DESCRIPTION
- * This routine is called when the kernel destroys an inode structure
- * ie. when iput() finds i_count == 0.
- *
- * HISTORY
- * July 1, 1997 - Andrew E. Mileski
- * Written, tested, and released.
- *
- *  Called at the last iput() if i_nlink is zero.
- */
+
 void udf_delete_inode(struct inode *inode)
 {
truncate_inode_pages(&inode->i_data, 0);
@@ -1416,21 +1401,6 @@ static mode_t udf_convert_permissions(st
return mode;
 }
 
-/*
- * udf_write_inode
- *
- * PURPOSE
- * Write out the specified inode.
- *
- * DESCRIPTION
- * This routine is called whenever an inode is synced.
- * Currently this routine is just a placeholder.
- *
- * HISTORY
- * July 1, 1997 - Andrew E. Mileski
- * Written, tested, and released.
- */
-
 int udf_write_inode(struct inode *inode, int sync)
 {
int ret;
Index: linux-2.6/fs/udf/namei.c
===
--- linux-2.6.orig/fs/udf/namei.c   2008-02-09 07:42:32.000

Re: NFS client hang on attempt to do async blocking posix lock enqueue

2008-02-08 Thread Jeff Layton
On Fri, 8 Feb 2008 16:12:28 -0500
"J. Bruce Fields" <[EMAIL PROTECTED]> wrote:

> On Fri, Feb 08, 2008 at 03:54:14PM -0500, Jeff Layton wrote:
> > Interesting. It's not clear me why the underlying filesystem would make
> > any difference there. Though now that I look, it looks like fl_grant
> > really only gets called from dlm code, and that queues up the block for
> > an immediate grant callback attempt. So perhaps that's the reason.
> 
> The asynchronous locking interface does something slightly cheesy for
> blocking locks--instead of waiting for the filesystem to respond, it
> just sends back a deny immediately (even if the lock might actually be
> available), then responds later with a granted message when it discovers
> it's available.
> 
> That works, but we should make it just wait to send the reply to the
> original lock request until we've got a real answer, as we do for
> nonblocking lock requests.  And in fact someone submitted a patch to do
> that--I just haven't gotten the time to review it.  Urp.
> 
> So anyway the effect is that on ext3 this particular lock wouldn't have
> required a grant reply, whereas on gfs2 it does.
> 
> Of course, what this means is that we'd hit the same problem on ext3 too
> if the lock request did in fact legitimately block.  So grant callbacks
> probably have never worked on ext3 over the loopback interface either.
> Oops!
> 

As best I can tell, the whole problem with rpc_pings was introduced
when we moved everything to the rpcbind stuff. Before that we generally
never did an rpc_ping when binding the client. This probably did work
until that was introduced.

> I bet nobody's ever noticed because we manage to recover by retrying the
> lock after it's available (whereas in the gfs2 case the retry hits the
> same problem).  So in practice for ext3 this probably just means
> blocking lock requests take a lot longer over loopback then they would
> otherwise.  And probably the only people that care about nlm performance
> don't usually do local mounts like that.
> 

-- 
Jeff Layton <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS client hang on attempt to do async blocking posix lock enqueue

2008-02-08 Thread J. Bruce Fields
On Fri, Feb 08, 2008 at 03:54:14PM -0500, Jeff Layton wrote:
> Interesting. It's not clear me why the underlying filesystem would make
> any difference there. Though now that I look, it looks like fl_grant
> really only gets called from dlm code, and that queues up the block for
> an immediate grant callback attempt. So perhaps that's the reason.

The asynchronous locking interface does something slightly cheesy for
blocking locks--instead of waiting for the filesystem to respond, it
just sends back a deny immediately (even if the lock might actually be
available), then responds later with a granted message when it discovers
it's available.

That works, but we should make it just wait to send the reply to the
original lock request until we've got a real answer, as we do for
nonblocking lock requests.  And in fact someone submitted a patch to do
that--I just haven't gotten the time to review it.  Urp.

So anyway the effect is that on ext3 this particular lock wouldn't have
required a grant reply, whereas on gfs2 it does.

Of course, what this means is that we'd hit the same problem on ext3 too
if the lock request did in fact legitimately block.  So grant callbacks
probably have never worked on ext3 over the loopback interface either.
Oops!

I bet nobody's ever noticed because we manage to recover by retrying the
lock after it's available (whereas in the gfs2 case the retry hits the
same problem).  So in practice for ext3 this probably just means
blocking lock requests take a lot longer over loopback then they would
otherwise.  And probably the only people that care about nlm performance
don't usually do local mounts like that.

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS client hang on attempt to do async blocking posix lock enqueue

2008-02-08 Thread Jeff Layton
On Fri, 8 Feb 2008 13:49:01 -0500 (EST)
"david m. richter" <[EMAIL PROTECTED]> wrote:

> On Fri, 8 Feb 2008, J. Bruce Fields wrote:
> 
> > On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote:
> > > On Thu, 7 Feb 2008 18:26:18 -0500
> > > "J. Bruce Fields" <[EMAIL PROTECTED]> wrote:
> > > 
> > > > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > > > > Hello!
> > > > >
> > > > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> > > > >
> > > > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> > > > >>> The problem seems to be with the fact that the client and server 
> > > > >>> are 
> > > > >>> on
> > > > >>> the same machine. This test work fine with or without an 
> > > > >>> underlaying 
> > > > >>> fs
> > > > >>> that supports locking when the client and the server are on a  
> > > > >>> different
> > > > >>> machines. Like you said the server is trying to send the grant  
> > > > >>> message to
> > > > >>> the client but for some reason it fails when the client is on the  
> > > > >>> same
> > > > >>> machine.
> > > > >> That *shouldn't* make a difference, so we need to take another look 
> > > > >> at
> > > > >> this--Oleg, this problem is still unfixed, right?
> > > > >
> > > > > Yes, I just pulled your latest nfs tree and I still can reproduce the 
> > > > >  
> > > > > problem.
> > > > 
> > > > OK, we have finally reproduced this problem here, and David's working on
> > > > debugging.  It does indeed seem to only be reproduceable with client and
> > > > server on the same machine.  Thanks for the report
> > > > 
> > > > --b.
> > > 
> > > It might be worth testing this both with and without the patchset I
> > > posted to linux-nfs recently to take care of the lockd hang. If
> > > lockd is stuck trying to rpc_ping itself then it probably would hang
> > > like this, wouldn't it?
> > 
> > Of course!  Yes, that fits.
> > 
> > --b.
> 
>   right on, jeff, good catch and thanks for directing my attention 
> to your patches.
> 

Excellent! Glad that took care of it...

>   i applied them on top of 2.6.23.1 and tested them on a cluster 
> exporting GFS2 over NFS, using oleg's reproducer code.  your patches fix 
> that lockd hang.
> 
>   in a bit more detail, oleg's reproducer basically gets a 
> whole-file read lock, tests the lock, upgrades to a whole-file exclusive 
> lock, tests the lock, then unlocks.  the problem was that when getting 
> that exclusive lock things would hang.  this only happened when the client 
> and server were on the same machine, and i could reproduce it with NFS 
> exporting GFS2 but not NFS exporting EXT3.
> 
> 

Interesting. It's not clear me why the underlying filesystem would make
any difference there. Though now that I look, it looks like fl_grant
really only gets called from dlm code, and that queues up the block for
an immediate grant callback attempt. So perhaps that's the reason.

-- 
Jeff Layton <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 01/26] mount options: add documentation

2008-02-08 Thread Miklos Szeredi
> > > Could also please explain why you want to go via user
> > > mounts. Other OS use a daemon for that, which e.g. can maintain
> > > access controls. How do you want to manage this?
> > 
> > The unprivileged mounts patches do contain a simple form of access
> > control.  I don't think anything more is needed, but of course, having
> > unprivileged mounts in the kernel does not prevent the use of a more
> > sophisticated access control daemon in userspace, if that becomes
> > necessary.
> 
> A "I don't think anything more is needed" lets go off all sorts of warning 
> lights. Most things start out simple, so IMO it's very worth it to check 
> where it might go to to know the limits beforehand. The main question here 
> is why should a kernel based solution be preferable over a daemon based 
> solution?

A daemon based solution would work for the "normal" case, where we
have a single mount namespace and a single /etc/mtab file, and we hope
it doesn't get too much out of sync with what is actually in the
kernel (on remount the mount options do get out of sync, but hey, we
seem to be able to live with that).

However, once you start using multiple namespaces, the daemon based
solution quickly becomes unusable, because you would need a separate
daemon for each namespace, and it would have to somehow keep track of
mount propagations in userspace (which is basically impossible), etc,
etc...

Does that answer your question?

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 01/26] mount options: add documentation

2008-02-08 Thread Roman Zippel
Hi,

On Wed, 30 Jan 2008, Miklos Szeredi wrote:

> > How does this deal with certain special cases:
> > - chroot: how will mount/df only show the for chroot relevant mounts?
> 
> That is a very good question.  Andreas Gruenbacher had some patches
> for fixing behavior of /proc/mounts under a chroot, but people are
> paranoid about userspace ABI changes (unwarranted in this case, IMO).
> 
>   http://lkml.org/lkml/2007/4/20/147
> 
> Anyway, if we are going to have a new 'mountinfo' file, this could be
> easily fixed as well.
> 
> > - loop: how is the connection between file and loop device maintained?
> 
> We also discussed this with Karel, maybe it didn't make it onto lkml.
> 
> The proposed solution was to store the "loop" flag separately in a
> file under /var.  It could just be an empty file for each such loop
> device:
> 
>   /var/lib/mount/loops/loop0
> 
> This file is created by mount(8) if the '-oloop' option is given.  And
> umount(8) automatically tears down the loop device if it finds this
> file.

My question was maybe a little short. I don't doubt that we can shove a 
lot into the kernel, the question is rather how much of this will be 
unnecessary information, which the kernel doesn't really need itself.

> > Could also please explain why you want to go via user mounts. Other OS use 
> > a 
> > daemon for that, which e.g. can maintain access controls. How do you want 
> > to 
> > manage this?
> 
> The unprivileged mounts patches do contain a simple form of access
> control.  I don't think anything more is needed, but of course, having
> unprivileged mounts in the kernel does not prevent the use of a more
> sophisticated access control daemon in userspace, if that becomes
> necessary.

A "I don't think anything more is needed" lets go off all sorts of warning 
lights. Most things start out simple, so IMO it's very worth it to check 
where it might go to to know the limits beforehand. The main question here 
is why should a kernel based solution be preferable over a daemon based 
solution?

If we look for example look at OS X, it has no need for user mounts but 
has a daemon instead, which also provides an interesting notification 
system for new devices, mounts or unmount requests. All this could also be 
done in the kernel, but where would be the advantage in doing so? The 
kernel implementation would be either rather limited or only bloat the 
kernel. What is the feature that would make user mounts more than just a 
cool kernel hack?

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS client hang on attempt to do async blocking posix lock enqueue

2008-02-08 Thread david m. richter
On Fri, 8 Feb 2008, J. Bruce Fields wrote:

> On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote:
> > On Thu, 7 Feb 2008 18:26:18 -0500
> > "J. Bruce Fields" <[EMAIL PROTECTED]> wrote:
> > 
> > > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > > > Hello!
> > > >
> > > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> > > >
> > > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> > > >>> The problem seems to be with the fact that the client and server are 
> > > >>> on
> > > >>> the same machine. This test work fine with or without an underlaying 
> > > >>> fs
> > > >>> that supports locking when the client and the server are on a  
> > > >>> different
> > > >>> machines. Like you said the server is trying to send the grant  
> > > >>> message to
> > > >>> the client but for some reason it fails when the client is on the  
> > > >>> same
> > > >>> machine.
> > > >> That *shouldn't* make a difference, so we need to take another look at
> > > >> this--Oleg, this problem is still unfixed, right?
> > > >
> > > > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > > > problem.
> > > 
> > > OK, we have finally reproduced this problem here, and David's working on
> > > debugging.  It does indeed seem to only be reproduceable with client and
> > > server on the same machine.  Thanks for the report
> > > 
> > > --b.
> > 
> > It might be worth testing this both with and without the patchset I
> > posted to linux-nfs recently to take care of the lockd hang. If
> > lockd is stuck trying to rpc_ping itself then it probably would hang
> > like this, wouldn't it?
> 
> Of course!  Yes, that fits.
> 
> --b.

right on, jeff, good catch and thanks for directing my attention 
to your patches.

i applied them on top of 2.6.23.1 and tested them on a cluster 
exporting GFS2 over NFS, using oleg's reproducer code.  your patches fix 
that lockd hang.

in a bit more detail, oleg's reproducer basically gets a 
whole-file read lock, tests the lock, upgrades to a whole-file exclusive 
lock, tests the lock, then unlocks.  the problem was that when getting 
that exclusive lock things would hang.  this only happened when the client 
and server were on the same machine, and i could reproduce it with NFS 
exporting GFS2 but not NFS exporting EXT3.


thanks,

d
.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 37/37] NFS: Add mount options to enable local caching on NFS

2008-02-08 Thread David Howells
Add NFS mount options to allow the local caching support to be enabled.

The attached patch makes it possible for the NFS filesystem to be told to make
use of the network filesystem local caching service (FS-Cache).

To be able to use this, a recent nfsutils package is required.

There are three variant NFS mount options that can be added to a mount command
to control caching for a mount.  Only the last one specified takes effect:

 (*) Adding "fsc" will request caching.

 (*) Adding "fsc=" will request caching and also specify a uniquifier.

 (*) Adding "nofsc" will disable caching.

For example:

mount warthog:/ /a -o fsc


The cache of a particular superblock (NFS FSID) will be shared between all
mounts of that volume, provided they have the same connection parameters and
are not marked 'nosharecache'.

Where it is otherwise impossible to distinguish superblocks because all the
parameters are identical, but the 'nosharecache' option is supplied, a
uniquifying string must be supplied, else only the first mount will be
permitted to use the cache.

If there's a key collision, then the second mount will disable caching and give
a warning into the kernel log.


Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/client.c   |2 ++
 fs/nfs/internal.h |1 +
 fs/nfs/super.c|   25 +
 3 files changed, 28 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index d67d52f..8357f68 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -669,6 +669,7 @@ static int nfs_init_server(struct nfs_server *server,
 
/* Initialise the client representation from the mount data */
server->flags = data->flags & NFS_MOUNT_FLAGMASK;
+   server->options = data->options;
 
if (data->rsize)
server->rsize = nfs_block_size(data->rsize, NULL);
@@ -1056,6 +1057,7 @@ static int nfs4_init_server(struct nfs_server *server,
/* Initialise the client representation from the mount data */
server->flags = data->flags & NFS_MOUNT_FLAGMASK;
server->caps |= NFS_CAP_ATOMIC_OPEN;
+   server->options = data->options;
 
if (data->rsize)
server->rsize = nfs_block_size(data->rsize, NULL);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index e49cb6e..f427b35 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -38,6 +38,7 @@ struct nfs_parsed_mount_data {
int acregmin, acregmax,
acdirmin, acdirmax;
int namlen;
+   unsigned intoptions;
unsigned intbsize;
unsigned intauth_flavor_len;
rpc_authflavor_tauth_flavors[1];
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 437c3dd..96082a2 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -76,6 +76,7 @@ enum {
Opt_acl, Opt_noacl,
Opt_rdirplus, Opt_nordirplus,
Opt_sharecache, Opt_nosharecache,
+   Opt_fscache, Opt_nofscache,
 
/* Mount options that take integer arguments */
Opt_port,
@@ -92,6 +93,7 @@ enum {
/* Mount options that take string arguments */
Opt_sec, Opt_proto, Opt_mountproto, Opt_mounthost,
Opt_addr, Opt_mountaddr, Opt_clientaddr,
+   Opt_fscache_uniq,
 
/* Mount options that are ignored */
Opt_userspace, Opt_deprecated,
@@ -125,6 +127,9 @@ static match_table_t nfs_mount_option_tokens = {
{ Opt_nordirplus, "nordirplus" },
{ Opt_sharecache, "sharecache" },
{ Opt_nosharecache, "nosharecache" },
+   { Opt_fscache, "fsc" },
+   { Opt_fscache_uniq, "fsc=%s" },
+   { Opt_nofscache, "nofsc" },
 
{ Opt_port, "port=%u" },
{ Opt_rsize, "rsize=%u" },
@@ -482,6 +487,8 @@ static void nfs_show_mount_options(struct seq_file *m, 
struct nfs_server *nfss,
seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout->to_initval 
/ HZ);
seq_printf(m, ",retrans=%u", nfss->client->cl_timeout->to_retries);
seq_printf(m, ",sec=%s", 
nfs_pseudoflavour_to_name(nfss->client->cl_auth->au_flavor));
+   if (nfss->options & NFS_OPTION_FSCACHE)
+   seq_printf(m, ",fsc");
 }
 
 /*
@@ -776,6 +783,24 @@ static int nfs_parse_mount_options(char *raw,
case Opt_nosharecache:
mnt->flags |= NFS_MOUNT_UNSHARED;
break;
+   case Opt_fscache:
+   mnt->options |= NFS_OPTION_FSCACHE;
+   kfree(mnt->fscache_uniq);
+   mnt->fscache_uniq = NULL;
+   break;
+   case Opt_nofscache:
+   mnt->options &= ~NFS_OPTION_FSCACHE;
+   kfree(mnt->fscache_uniq);
+   mnt->fscache_uniq = NULL;
+   break;
+   case Opt_fscache_uniq:
+   string = match_strdup(args);
+  

[PATCH 36/37] NFS: Display local caching state

2008-02-08 Thread David Howells
Display the local caching state in /proc/fs/nfsfs/volumes.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/client.c  |7 ---
 fs/nfs/fscache.h |   15 +++
 2 files changed, 19 insertions(+), 3 deletions(-)


diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 51e9346..d67d52f 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1451,7 +1451,7 @@ static int nfs_volume_list_show(struct seq_file *m, void 
*v)
 
/* display header on line 1 */
if (v == &nfs_volume_list) {
-   seq_puts(m, "NV SERVER   PORT DEV FSID\n");
+   seq_puts(m, "NV SERVER   PORT DEV FSID  FSC\n");
return 0;
}
/* display one transport per line on subsequent lines */
@@ -1465,12 +1465,13 @@ static int nfs_volume_list_show(struct seq_file *m, 
void *v)
 (unsigned long long) server->fsid.major,
 (unsigned long long) server->fsid.minor);
 
-   seq_printf(m, "v%u %s %s %-7s %-17s\n",
+   seq_printf(m, "v%u %s %s %-7s %-17s %s\n",
   clp->rpc_ops->version,
   rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_HEX_ADDR),
   rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_HEX_PORT),
   dev,
-  fsid);
+  fsid,
+  nfs_server_fscache_state(server));
 
return 0;
 }
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 6264cd8..5f7806f 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -146,6 +146,16 @@ static inline void nfs_readpage_to_fscache(struct inode 
*inode,
__nfs_readpage_to_fscache(inode, page, sync);
 }
 
+/*
+ * indicate the client caching state as readable text
+ */
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+   if (server->fscache && (server->options & NFS_OPTION_FSCACHE))
+   return "yes";
+   return "no ";
+}
+
 
 #else /* CONFIG_NFS_FSCACHE */
 static inline int nfs_fscache_register(void) { return 0; }
@@ -195,5 +205,10 @@ static inline int nfs_readpages_from_fscache(struct 
nfs_open_context *ctx,
 static inline void nfs_readpage_to_fscache(struct inode *inode,
   struct page *page, int sync) {}
 
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+   return "no ";
+}
+
 #endif /* CONFIG_NFS_FSCACHE */
 #endif /* _NFS_FSCACHE_H */

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 33/37] NFS: nfs_readpage_async() needs to be accessible as a fallback for local caching

2008-02-08 Thread David Howells
nfs_readpage_async() needs to be non-static so that it can be used as a
fallback for the local on-disk caching should an EIO crop up when reading the
cache.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/read.c  |4 ++--
 include/linux/nfs_fs.h |2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 3d7d963..725a5a2 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -114,8 +114,8 @@ static void nfs_readpage_truncate_uninitialised_page(struct 
nfs_read_data *data)
}
 }
 
-static int nfs_readpage_async(struct nfs_open_context *ctx, struct inode 
*inode,
-   struct page *page)
+int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
+  struct page *page)
 {
LIST_HEAD(one_request);
struct nfs_page *new;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index d9adb53..d1d545e 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -505,6 +505,8 @@ extern int  nfs_readpages(struct file *, struct 
address_space *,
struct list_head *, unsigned);
 extern int  nfs_readpage_result(struct rpc_task *, struct nfs_read_data *);
 extern void nfs_readdata_release(void *data);
+extern int  nfs_readpage_async(struct nfs_open_context *, struct inode *,
+  struct page *);
 
 /*
  * Allocate nfs_read_data structures

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 31/37] NFS: FS-Cache page management

2008-02-08 Thread David Howells
FS-Cache page management for NFS.  This includes hooking the releasing and
invalidation of pages marked with PG_fscache (aka PG_private_2) and waiting for
completion of the write-to-cache flag (PG_fscache_write aka PG_owner_priv_2).

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/file.c|   17 +
 fs/nfs/fscache.c |   49 +
 fs/nfs/fscache.h |   22 ++
 3 files changed, 84 insertions(+), 4 deletions(-)


diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 26a073b..60db3ea 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -35,6 +35,7 @@
 #include "delegation.h"
 #include "internal.h"
 #include "iostat.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITYNFSDBG_FILE
 
@@ -358,7 +359,7 @@ static int nfs_write_end(struct file *file, struct 
address_space *mapping,
  * Partially or wholly invalidate a page
  * - Release the private state associated with a page if undergoing complete
  *   page invalidation
- * - Called if either PG_private or PG_private_2 is set on the page
+ * - Called if either PG_private or PG_fscache is set on the page
  * - Caller holds page lock
  */
 static void nfs_invalidate_page(struct page *page, unsigned long offset)
@@ -367,30 +368,35 @@ static void nfs_invalidate_page(struct page *page, 
unsigned long offset)
return;
/* Cancel any unstarted writes on this page */
nfs_wb_page_cancel(page->mapping->host, page);
+
+   nfs_fscache_invalidate_page(page, page->mapping->host);
 }
 
 /*
  * Attempt to release the private state associated with a page
- * - Called if either PG_private or PG_private_2 is set on the page
+ * - Called if either PG_private or PG_fscache is set on the page
  * - Caller holds page lock
  * - Return true (may release page) or false (may not)
  */
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
/* If PagePrivate() is set, then the page is not freeable */
-   return 0;
+   if (PagePrivate(page))
+   return 0;
+   return nfs_fscache_release_page(page, gfp);
 }
 
 /*
  * Attempt to clear the private state associated with a page when an error
  * occurs that requires the cached contents of an inode to be written back or
  * destroyed
- * - Called if either PG_private or PG_private_2 is set on the page
+ * - Called if either PG_private or fscache is set on the page
  * - Caller holds page lock
  * - Return 0 if successful, -error otherwise
  */
 static int nfs_launder_page(struct page *page)
 {
+   wait_on_page_fscache_write(page);
return nfs_wb_page(page->mapping->host, page);
 }
 
@@ -422,6 +428,9 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, 
struct page *page)
int ret = -EINVAL;
struct address_space *mapping;
 
+   /* make sure the cache has finished storing the page */
+   wait_on_page_fscache_write(page);
+
lock_page(page);
mapping = page->mapping;
if (mapping != vma->vm_file->f_path.dentry->d_inode->i_mapping)
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index c0e0320..d475ff5 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -19,6 +19,7 @@
 #include 
 
 #include "internal.h"
+#include "iostat.h"
 #include "fscache.h"
 
 #define NFSDBG_FACILITYNFSDBG_FSCACHE
@@ -297,3 +298,51 @@ void nfs_fscache_attr_changed(struct inode *inode)
 {
fscache_attr_changed(NFS_I(inode)->fscache);
 }
+
+/*
+ * Release the caching state associated with a page, if the page isn't busy
+ * interacting with the cache.
+ * - Returns true (can release page) or false (page busy).
+ */
+int nfs_fscache_release_page(struct page *page, gfp_t gfp)
+{
+   if (PageFsCacheWrite(page)) {
+   if (!(gfp & __GFP_WAIT))
+   return 0;
+   wait_on_page_fscache_write(page);
+   }
+
+   if (PageFsCache(page)) {
+   struct nfs_inode *nfsi = NFS_I(page->mapping->host);
+
+   BUG_ON(!nfsi->fscache);
+
+   dfprintk(FSCACHE, "NFS: fscache releasepage (0x%p/0x%p/0x%p)\n",
+nfsi->fscache, page, nfsi);
+
+   fscache_uncache_page(nfsi->fscache, page);
+   nfs_add_stats(page->mapping->host, NFSIOS_FSCACHE_UNCACHE, 1);
+   }
+
+   return 1;
+}
+
+/*
+ * Release the caching state associated with a page if undergoing complete page
+ * invalidation.
+ */
+void __nfs_fscache_invalidate_page(struct page *page, struct inode *inode)
+{
+   struct nfs_inode *nfsi = NFS_I(inode);
+
+   BUG_ON(!nfsi->fscache);
+
+   dfprintk(FSCACHE, "NFS: fscache invalidatepage (0x%p/0x%p/0x%p)\n",
+nfsi->fscache, page, nfsi);
+
+   wait_on_page_fscache_write(page);
+
+   BUG_ON(!PageLocked(page));
+   fscache_uncache_page(nfsi->fscache, page);
+   nfs_add_stats(page->mapping->host, NFSIOS_FSCACHE_UNCACHE, 1);
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.

[PATCH 34/37] NFS: Read pages from FS-Cache into an NFS inode

2008-02-08 Thread David Howells
Read pages from an FS-Cache data storage object representing an inode into an
NFS inode.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/fscache.c |  112 ++
 fs/nfs/fscache.h |   47 +++
 fs/nfs/read.c|   18 +
 3 files changed, 176 insertions(+), 1 deletions(-)


diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index d475ff5..438cc9b 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -344,5 +344,115 @@ void __nfs_fscache_invalidate_page(struct page *page, 
struct inode *inode)
 
BUG_ON(!PageLocked(page));
fscache_uncache_page(nfsi->fscache, page);
-   nfs_add_stats(page->mapping->host, NFSIOS_FSCACHE_UNCACHE, 1);
+   nfs_add_stats(inode, NFSIOS_FSCACHE_UNCACHE, 1);
+}
+
+/*
+ * Handle completion of a page being read from the cache.
+ * - Called in process (keventd) context.
+ */
+static void nfs_readpage_from_fscache_complete(struct page *page,
+  void *context,
+  int error)
+{
+   dfprintk(FSCACHE,
+"NFS: readpage_from_fscache_complete (0x%p/0x%p/%d)\n",
+page, context, error);
+
+   /* if the read completes with an error, we just unlock the page and let
+* the VM reissue the readpage */
+   if (!error) {
+   SetPageUptodate(page);
+   unlock_page(page);
+   } else {
+   error = nfs_readpage_async(context, page->mapping->host, page);
+   if (error)
+   unlock_page(page);
+   }
+}
+
+/*
+ * Retrieve a page from fscache
+ */
+int __nfs_readpage_from_fscache(struct nfs_open_context *ctx,
+   struct inode *inode, struct page *page)
+{
+   int ret;
+
+   dfprintk(FSCACHE,
+"NFS: readpage_from_fscache(fsc:%p/p:%p(i:%lx f:%lx)/0x%p)\n",
+NFS_I(inode)->fscache, page, page->index, page->flags, inode);
+
+   ret = fscache_read_or_alloc_page(NFS_I(inode)->fscache,
+page,
+nfs_readpage_from_fscache_complete,
+ctx,
+GFP_KERNEL);
+
+   switch (ret) {
+   case 0: /* read BIO submitted (page in fscache) */
+   dfprintk(FSCACHE,
+"NFS:readpage_from_fscache: BIO submitted\n");
+   nfs_add_stats(inode, NFSIOS_FSCACHE_READ_OK, 1);
+   return ret;
+
+   case -ENOBUFS: /* inode not in cache */
+   case -ENODATA: /* page not in cache */
+   nfs_add_stats(inode, NFSIOS_FSCACHE_READ_FAIL, 1);
+   dfprintk(FSCACHE,
+"NFS:readpage_from_fscache %d\n", ret);
+   return 1;
+
+   default:
+   dfprintk(FSCACHE, "NFS:readpage_from_fscache %d\n", ret);
+   nfs_add_stats(inode, NFSIOS_FSCACHE_READ_FAIL, 1);
+   }
+   return ret;
+}
+
+/*
+ * Retrieve a set of pages from fscache
+ */
+int __nfs_readpages_from_fscache(struct nfs_open_context *ctx,
+struct inode *inode,
+struct address_space *mapping,
+struct list_head *pages,
+unsigned *nr_pages)
+{
+   int ret, npages = *nr_pages;
+
+   dfprintk(FSCACHE, "NFS: nfs_getpages_from_fscache (0x%p/%u/0x%p)\n",
+NFS_I(inode)->fscache, npages, inode);
+
+   ret = fscache_read_or_alloc_pages(NFS_I(inode)->fscache,
+ mapping, pages, nr_pages,
+ nfs_readpage_from_fscache_complete,
+ ctx,
+ mapping_gfp_mask(mapping));
+   if (*nr_pages < npages)
+   nfs_add_stats(inode, NFSIOS_FSCACHE_READ_OK, npages);
+   if (*nr_pages > 0)
+   nfs_add_stats(inode, NFSIOS_FSCACHE_READ_FAIL, *nr_pages);
+
+   switch (ret) {
+   case 0: /* read submitted to the cache for all pages */
+   BUG_ON(!list_empty(pages));
+   BUG_ON(*nr_pages != 0);
+   dfprintk(FSCACHE,
+"NFS: nfs_getpages_from_fscache: submitted\n");
+
+   return ret;
+
+   case -ENOBUFS: /* some pages aren't cached and can't be */
+   case -ENODATA: /* some pages aren't cached */
+   dfprintk(FSCACHE,
+"NFS: nfs_getpages_from_fscache: no page: %d\n", ret);
+   return 1;
+
+   default:
+   dfprintk(FSCACHE,
+"NFS: nfs_getpages_from_fscache: ret  %d\n", ret);
+   }
+
+   return ret;
 }
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 1cb7d96..4c1e1a8 100644
--- a/fs/nfs/fscache.h
+++ b/

[PATCH 30/37] NFS: Add some new I/O event counters for FS-Cache events

2008-02-08 Thread David Howells
Add some new NFS I/O event counters for FS-Cache events.  They have to be
added as byte counters because I may need to be able to increase the numbers
by more than 1 at a time.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/iostat.h |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/iostat.h b/fs/nfs/iostat.h
index 6350ecb..0e3b170 100644
--- a/fs/nfs/iostat.h
+++ b/fs/nfs/iostat.h
@@ -60,6 +60,13 @@ enum nfs_stat_bytecounters {
NFSIOS_SERVERWRITTENBYTES,
NFSIOS_READPAGES,
NFSIOS_WRITEPAGES,
+#ifdef CONFIG_NFS_FSCACHE
+   NFSIOS_FSCACHE_READ_OK,
+   NFSIOS_FSCACHE_READ_FAIL,
+   NFSIOS_FSCACHE_WRITE_OK,
+   NFSIOS_FSCACHE_WRITE_FAIL,
+   NFSIOS_FSCACHE_UNCACHE,
+#endif
__NFSIOS_BYTESMAX,
 };
 

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 35/37] NFS: Store pages from an NFS inode into a local cache

2008-02-08 Thread David Howells
Store pages from an NFS inode into the cache data storage object associated
with that inode.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/fscache.c |   26 ++
 fs/nfs/fscache.h |   16 
 fs/nfs/read.c|5 +
 3 files changed, 47 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index 438cc9b..50ae70f 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -456,3 +456,29 @@ int __nfs_readpages_from_fscache(struct nfs_open_context 
*ctx,
 
return ret;
 }
+
+/*
+ * Store a newly fetched page in fscache
+ * - PG_fscache must be set on the page
+ */
+void __nfs_readpage_to_fscache(struct inode *inode, struct page *page, int 
sync)
+{
+   int ret;
+
+   dfprintk(FSCACHE,
+"NFS: readpage_to_fscache(fsc:%p/p:%p(i:%lx f:%lx)/%d)\n",
+NFS_I(inode)->fscache, page, page->index, page->flags, sync);
+
+   ret = fscache_write_page(NFS_I(inode)->fscache, page, GFP_KERNEL);
+   dfprintk(FSCACHE,
+"NFS: readpage_to_fscache: p:%p(i:%lu f:%lx) ret %d\n",
+page, page->index, page->flags, ret);
+
+   if (ret != 0) {
+   fscache_uncache_page(NFS_I(inode)->fscache, page);
+   nfs_add_stats(inode, NFSIOS_FSCACHE_WRITE_FAIL, 1);
+   nfs_add_stats(inode, NFSIOS_FSCACHE_UNCACHE, 1);
+   } else {
+   nfs_add_stats(inode, NFSIOS_FSCACHE_WRITE_OK, 1);
+   }
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 4c1e1a8..6264cd8 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -94,6 +94,7 @@ extern int __nfs_readpage_from_fscache(struct 
nfs_open_context *,
 extern int __nfs_readpages_from_fscache(struct nfs_open_context *,
struct inode *, struct address_space *,
struct list_head *, unsigned *);
+extern void __nfs_readpage_to_fscache(struct inode *, struct page *, int);
 
 /*
  * release the caching state associated with a page if undergoing complete page
@@ -133,6 +134,19 @@ static inline int nfs_readpages_from_fscache(struct 
nfs_open_context *ctx,
return -ENOBUFS;
 }
 
+/*
+ * Store a page newly fetched from the server in an inode data storage object
+ * in the cache.
+ */
+static inline void nfs_readpage_to_fscache(struct inode *inode,
+  struct page *page,
+  int sync)
+{
+   if (PageFsCache(page))
+   __nfs_readpage_to_fscache(inode, page, sync);
+}
+
+
 #else /* CONFIG_NFS_FSCACHE */
 static inline int nfs_fscache_register(void) { return 0; }
 static inline void nfs_fscache_unregister(void) {}
@@ -178,6 +192,8 @@ static inline int nfs_readpages_from_fscache(struct 
nfs_open_context *ctx,
 {
return -ENOBUFS;
 }
+static inline void nfs_readpage_to_fscache(struct inode *inode,
+  struct page *page, int sync) {}
 
 #endif /* CONFIG_NFS_FSCACHE */
 #endif /* _NFS_FSCACHE_H */
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index db27b26..e09bdf9 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -143,6 +143,11 @@ int nfs_readpage_async(struct nfs_open_context *ctx, 
struct inode *inode,
 
 static void nfs_readpage_release(struct nfs_page *req)
 {
+   struct inode *d_inode = req->wb_context->path.dentry->d_inode;
+
+   if (PageUptodate(req->wb_page))
+   nfs_readpage_to_fscache(d_inode, req->wb_page, 0);
+
unlock_page(req->wb_page);
 
dprintk("NFS: read done (%s/%Ld [EMAIL PROTECTED])\n",

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 28/37] NFS: Use local disk inode cache

2008-02-08 Thread David Howells
Bind data storage objects in the local cache to NFS inodes.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/fscache.c   |  131 
 fs/nfs/fscache.h   |   19 +++
 fs/nfs/inode.c |   39 --
 include/linux/nfs_fs.h |   10 
 4 files changed, 193 insertions(+), 6 deletions(-)


diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index cbd09f0..c0e0320 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -166,3 +166,134 @@ void nfs_fscache_release_super_cookie(struct super_block 
*sb)
nfss->fscache_key = NULL;
}
 }
+
+/*
+ * Initialise the per-inode cache cookie pointer for an NFS inode.
+ */
+void nfs_fscache_init_inode_cookie(struct inode *inode)
+{
+   NFS_I(inode)->fscache = NULL;
+   if (S_ISREG(inode->i_mode))
+   set_bit(NFS_INO_FSCACHE, &NFS_I(inode)->flags);
+}
+
+/*
+ * Get the per-inode cache cookie for an NFS inode.
+ */
+void nfs_fscache_enable_inode_cookie(struct inode *inode)
+{
+   struct super_block *sb = inode->i_sb;
+   struct nfs_inode *nfsi = NFS_I(inode);
+
+   if (nfsi->fscache || !NFS_FSCACHE(inode))
+   return;
+
+   if ((NFS_SB(sb)->options & NFS_OPTION_FSCACHE)) {
+   nfsi->fscache = fscache_acquire_cookie(
+   NFS_SB(sb)->fscache,
+   &nfs_cache_inode_object_def,
+   nfsi);
+
+   dfprintk(FSCACHE, "NFS: get FH cookie (0x%p/0x%p/0x%p)\n",
+sb, nfsi, nfsi->fscache);
+   }
+}
+
+/*
+ * Release a per-inode cookie.
+ */
+void nfs_fscache_release_inode_cookie(struct inode *inode)
+{
+   struct nfs_inode *nfsi = NFS_I(inode);
+
+   dfprintk(FSCACHE, "NFS: clear cookie (0x%p/0x%p)\n",
+nfsi, nfsi->fscache);
+
+   fscache_relinquish_cookie(nfsi->fscache, 0);
+   nfsi->fscache = NULL;
+}
+
+/*
+ * Retire a per-inode cookie, destroying the data attached to it.
+ */
+void nfs_fscache_zap_inode_cookie(struct inode *inode)
+{
+   struct nfs_inode *nfsi = NFS_I(inode);
+
+   dfprintk(FSCACHE, "NFS: zapping cookie (0x%p/0x%p)\n",
+nfsi, nfsi->fscache);
+
+   fscache_relinquish_cookie(nfsi->fscache, 1);
+   nfsi->fscache = NULL;
+}
+
+/*
+ * Turn off the cache with regard to a per-inode cookie if opened for writing,
+ * invalidating all the pages in the page cache relating to the associated
+ * inode to clear the per-page caching.
+ */
+void nfs_fscache_disable_inode_cookie(struct inode *inode)
+{
+   clear_bit(NFS_INO_FSCACHE, &NFS_I(inode)->flags);
+
+   if (NFS_I(inode)->fscache) {
+   dfprintk(FSCACHE,
+"NFS: nfsi 0x%p turning cache off\n", NFS_I(inode));
+
+   /* Need to invalidate any mapped pages that were read in before
+* turning off the cache.
+*/
+   if (inode->i_mapping && inode->i_mapping->nrpages)
+   invalidate_inode_pages2(inode->i_mapping);
+
+   nfs_fscache_zap_inode_cookie(inode);
+   }
+}
+
+/*
+ * Decide if we should enable or disable local caching for this inode.
+ * - For now, with NFS, only regular files that are open read-only will be able
+ *   to use the cache.
+ */
+void nfs_fscache_set_inode_cookie(struct inode *inode, struct file *filp)
+{
+   if (NFS_FSCACHE(inode)) {
+   if ((filp->f_flags & O_ACCMODE) != O_RDONLY)
+   nfs_fscache_disable_inode_cookie(inode);
+   else
+   nfs_fscache_enable_inode_cookie(inode);
+   }
+}
+
+/*
+ * Replace a per-inode cookie due to revalidation detecting a file having
+ * changed on the server.
+ */
+void nfs_fscache_renew_inode_cookie(struct inode *inode)
+{
+   struct nfs_inode *nfsi = NFS_I(inode);
+   struct nfs_server *nfss = NFS_SERVER(inode);
+   struct fscache_cookie *old = nfsi->fscache;
+
+   if (nfsi->fscache) {
+   /* retire the current fscache cache and get a new one */
+   fscache_relinquish_cookie(nfsi->fscache, 1);
+
+   nfsi->fscache = fscache_acquire_cookie(
+   nfss->nfs_client->fscache,
+   &nfs_cache_inode_object_def,
+   nfsi);
+
+   dfprintk(FSCACHE,
+"NFS: revalidation new cookie (0x%p/0x%p/0x%p/0x%p)\n",
+nfss, nfsi, old, nfsi->fscache);
+   }
+}
+
+/*
+ * Update the filesize associated with a per-inode cookie.
+ */
+void nfs_fscache_attr_changed(struct inode *inode)
+{
+   fscache_attr_changed(NFS_I(inode)->fscache);
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 7dcdf32..d730ec8 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -77,6 +77,15 @@ extern void nfs_fscache_get_super_cookie(struct super_block 
*,
 struct nfs_parsed_mount_data *)

[PATCH 32/37] NFS: Add read context retention for FS-Cache to call back with

2008-02-08 Thread David Howells
Add read context retention so that FS-Cache can call back into NFS when a read
operation on the cache fails EIO rather than reading data.  This permits NFS to
then fetch the data from the server instead using the appropriate security
context.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/fscache-index.c |   26 ++
 1 files changed, 26 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index eec8e7e..af9f06b 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -285,6 +285,30 @@ static void nfs_cache_inode_now_uncached(void 
*cookie_netfs_data)
 }
 
 /*
+ * Get an extra reference on a read context.
+ * - This function can be absent if the completion function doesn't require a
+ *   context.
+ * - The read context is passed back to NFS in the event that a data read on 
the
+ *   cache fails with EIO - in which case the server must be contacted to
+ *   retrieve the data, which requires the read context for security.
+ */
+static void nfs_fh_get_context(void *cookie_netfs_data, void *context)
+{
+   get_nfs_open_context(context);
+}
+
+/*
+ * Release an extra reference on a read context.
+ * - This function can be absent if the completion function doesn't require a
+ *   context.
+ */
+static void nfs_fh_put_context(void *cookie_netfs_data, void *context)
+{
+   if (context)
+   put_nfs_open_context(context);
+}
+
+/*
  * Define the inode object for FS-Cache.  This is used to describe an inode
  * object to fscache_acquire_cookie().  It is keyed by the NFS file handle for
  * an inode.
@@ -301,4 +325,6 @@ const struct fscache_cookie_def nfs_cache_inode_object_def 
= {
.get_aux= nfs_cache_inode_get_aux,
.check_aux  = nfs_cache_inode_check_aux,
.now_uncached   = nfs_cache_inode_now_uncached,
+   .get_context= nfs_fh_get_context,
+   .put_context= nfs_fh_put_context,
 };

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 23/37] NFS: Permit local filesystem caching to be enabled for NFS

2008-02-08 Thread David Howells
Permit local filesystem caching to be enabled for NFS in the kernel
configuration.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/Kconfig |8 
 1 files changed, 8 insertions(+), 0 deletions(-)


diff --git a/fs/Kconfig b/fs/Kconfig
index c42ec50..fa8e978 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1644,6 +1644,14 @@ config NFS_V4
 
  If unsure, say N.
 
+config NFS_FSCACHE
+   bool "Provide NFS client caching support (EXPERIMENTAL)"
+   depends on EXPERIMENTAL
+   depends on NFS_FS=m && FSCACHE || NFS_FS=y && FSCACHE=y
+   help
+ Say Y here if you want NFS data to be cached locally on disc through
+ the general filesystem cache manager
+
 config NFS_DIRECTIO
bool "Allow direct I/O on NFS files"
depends on NFS_FS

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 29/37] NFS: Invalidate FsCache page flags when cache removed

2008-02-08 Thread David Howells
Invalidate the FsCache page flags on the pages belonging to an inode when the
cache backing that NFS inode is removed.

This allows a live cache to be withdrawn.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/fscache-index.c |   40 
 1 files changed, 40 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index c3c63fa..eec8e7e 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -246,6 +246,45 @@ static enum fscache_checkaux 
nfs_cache_inode_check_aux(void *cookie_netfs_data,
 }
 
 /*
+ * Indication from FS-Cache that the cookie is no longer cached
+ * - This function is called when the backing store currently caching a cookie
+ *   is removed
+ * - The netfs should use this to clean up any markers indicating cached pages
+ * - This is mandatory for any object that may have data
+ */
+static void nfs_cache_inode_now_uncached(void *cookie_netfs_data)
+{
+   struct nfs_inode *nfsi = cookie_netfs_data;
+   struct pagevec pvec;
+   pgoff_t first;
+   int loop, nr_pages;
+
+   pagevec_init(&pvec, 0);
+   first = 0;
+
+   dprintk("NFS: nfs_inode_now_uncached: nfs_inode 0x%p\n", nfsi);
+
+   for (;;) {
+   /* grab a bunch of pages to unmark */
+   nr_pages = pagevec_lookup(&pvec,
+ nfsi->vfs_inode.i_mapping,
+ first,
+ PAGEVEC_SIZE - pagevec_count(&pvec));
+   if (!nr_pages)
+   break;
+
+   for (loop = 0; loop < nr_pages; loop++)
+   ClearPageFsCache(pvec.pages[loop]);
+
+   first = pvec.pages[nr_pages - 1]->index + 1;
+
+   pvec.nr = nr_pages;
+   pagevec_release(&pvec);
+   cond_resched();
+   }
+}
+
+/*
  * Define the inode object for FS-Cache.  This is used to describe an inode
  * object to fscache_acquire_cookie().  It is keyed by the NFS file handle for
  * an inode.
@@ -261,4 +300,5 @@ const struct fscache_cookie_def nfs_cache_inode_object_def 
= {
.get_attr   = nfs_cache_inode_get_attr,
.get_aux= nfs_cache_inode_get_aux,
.check_aux  = nfs_cache_inode_check_aux,
+   .now_uncached   = nfs_cache_inode_now_uncached,
 };

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 26/37] NFS: Define and create superblock-level objects

2008-02-08 Thread David Howells
Define and create superblock-level cache index objects (as managed by
nfs_server structs).

Each superblock object is created in a server level index object and is itself
an index into which inode-level objects are inserted.

Ideally there would be one superblock-level object per server, and the former
would be folded into the latter; however, since the "nosharecache" option
exists this isn't possible.

The superblock object key is a sequence consisting of:

 (1) Certain superblock s_flags.

 (2) Various connection parameters that serve to distinguish superblocks for
 sget().

 (3) The volume FSID.

 (4) The security flavour.

 (5) The uniquifier length.

 (6) The uniquifier text.  This is normally an empty string, unless the fsc=xyz
 mount option was used to explicitly specify a uniquifier.

The key blob is of variable length, depending on the length of (6).

The superblock object is given no coherency data to carry in the auxiliary data
permitted by the cache.  It is assumed that the superblock is always coherent.


This patch also adds uniquification handling such that two otherwise identical
superblocks, at least one of which is marked "nosharecache", won't end up
trying to share the on-disk cache.  It will be possible to manually provide a
uniquifier through a mount option with a later patch to avoid the error
otherwise produced.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/fscache-index.c|   34 +
 fs/nfs/fscache.c  |  116 +
 fs/nfs/fscache.h  |   49 +++
 fs/nfs/internal.h |3 +
 fs/nfs/super.c|8 ++-
 include/linux/nfs_fs_sb.h |5 ++
 6 files changed, 213 insertions(+), 2 deletions(-)


diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 25ac4a1..b5a52e3 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -116,3 +116,37 @@ const struct fscache_cookie_def nfs_cache_server_index_def 
= {
.type   = FSCACHE_COOKIE_TYPE_INDEX,
.get_key= nfs_server_get_key,
 };
+
+/*
+ * Generate a key to describe a superblock key in the main NFS index
+ */
+static uint16_t nfs_super_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+   const struct nfs_fscache_key *key;
+   const struct nfs_server *nfss = cookie_netfs_data;
+   uint16_t len;
+
+   key = nfss->fscache_key;
+   len = sizeof(key->key) + key->key.uniq_len;
+   if (len > bufmax) {
+   len = 0;
+   } else {
+   memcpy(buffer, &key->key, sizeof(key->key));
+   memcpy(buffer + sizeof(key->key),
+  key->key.uniquifier, key->key.uniq_len);
+   }
+
+   return len;
+}
+
+/*
+ * Define the superblock object for FS-Cache.  This is used to describe a
+ * superblock object to fscache_acquire_cookie().  It is keyed by all the NFS
+ * parameters that might cause a separate superblock.
+ */
+const struct fscache_cookie_def nfs_cache_super_index_def = {
+   .name   = "NFS.super",
+   .type   = FSCACHE_COOKIE_TYPE_INDEX,
+   .get_key= nfs_super_get_key,
+};
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index dcc1800..cbd09f0 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -23,6 +23,9 @@
 
 #define NFSDBG_FACILITYNFSDBG_FSCACHE
 
+static struct rb_root nfs_fscache_keys = RB_ROOT;
+static DEFINE_SPINLOCK(nfs_fscache_keys_lock);
+
 /*
  * Get the per-client index cookie for an NFS client if the appropriate mount
  * flag was set
@@ -50,3 +53,116 @@ void nfs_fscache_release_client_cookie(struct nfs_client 
*clp)
fscache_relinquish_cookie(clp->fscache, 0);
clp->fscache = NULL;
 }
+
+/*
+ * Get the cache cookie for an NFS superblock.  We have to handle
+ * uniquification here because the cache doesn't do it for us.
+ */
+void nfs_fscache_get_super_cookie(struct super_block *sb,
+ struct nfs_parsed_mount_data *data)
+{
+   struct nfs_fscache_key *key, *xkey;
+   struct nfs_server *nfss = NFS_SB(sb);
+   struct rb_node **p, *parent;
+   const char *uniq = data->fscache_uniq ?: "";
+   int diff, ulen;
+
+   ulen = strlen(uniq);
+   key = kzalloc(sizeof(*key) + ulen, GFP_KERNEL);
+   if (!key)
+   return;
+
+   key->nfs_client = nfss->nfs_client;
+   key->key.super.s_flags = sb->s_flags & NFS_MS_MASK;
+   key->key.nfs_server.flags = nfss->flags;
+   key->key.nfs_server.rsize = nfss->rsize;
+   key->key.nfs_server.wsize = nfss->wsize;
+   key->key.nfs_server.acregmin = nfss->acregmin;
+   key->key.nfs_server.acregmax = nfss->acregmax;
+   key->key.nfs_server.acdirmin = nfss->acdirmin;
+   key->key.nfs_server.acdirmax = nfss->acdirmax;
+   key->key.nfs_server.fsid = nfss->fsid;
+   key->key.rpc_auth.au_flavor = nfss->client->cl_auth->au_fla

[PATCH 21/37] NFS: Add comment banners to some NFS functions

2008-02-08 Thread David Howells
Add comment banners to some NFS functions so that they can be modified by the
NFS fscache patches for further information.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/file.c |   26 ++
 1 files changed, 26 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index ef57a5a..26a073b 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -354,6 +354,13 @@ static int nfs_write_end(struct file *file, struct 
address_space *mapping,
return copied;
 }
 
+/*
+ * Partially or wholly invalidate a page
+ * - Release the private state associated with a page if undergoing complete
+ *   page invalidation
+ * - Called if either PG_private or PG_private_2 is set on the page
+ * - Caller holds page lock
+ */
 static void nfs_invalidate_page(struct page *page, unsigned long offset)
 {
if (offset != 0)
@@ -362,12 +369,26 @@ static void nfs_invalidate_page(struct page *page, 
unsigned long offset)
nfs_wb_page_cancel(page->mapping->host, page);
 }
 
+/*
+ * Attempt to release the private state associated with a page
+ * - Called if either PG_private or PG_private_2 is set on the page
+ * - Caller holds page lock
+ * - Return true (may release page) or false (may not)
+ */
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
/* If PagePrivate() is set, then the page is not freeable */
return 0;
 }
 
+/*
+ * Attempt to clear the private state associated with a page when an error
+ * occurs that requires the cached contents of an inode to be written back or
+ * destroyed
+ * - Called if either PG_private or PG_private_2 is set on the page
+ * - Caller holds page lock
+ * - Return 0 if successful, -error otherwise
+ */
 static int nfs_launder_page(struct page *page)
 {
return nfs_wb_page(page->mapping->host, page);
@@ -389,6 +410,11 @@ const struct address_space_operations nfs_file_aops = {
.launder_page = nfs_launder_page,
 };
 
+/*
+ * Notification that a PTE pointing to an NFS page is about to be made
+ * writable, implying that someone is about to modify the page through a
+ * shared-writable mapping
+ */
 static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
 {
struct file *filp = vma->vm_file;

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 27/37] NFS: Define and create inode-level cache objects

2008-02-08 Thread David Howells
Define and create inode-level cache data storage objects (as managed by
nfs_inode structs).

Each inode-level object is created in a superblock-level index object and is
itself a data storage object into which pages from the inode are stored.

The inode object key is the NFS file handle for the inode.

The inode object is given coherency data to carry in the auxiliary data
permitted by the cache.  This is a sequence made up of:

 (1) i_mtime from the NFS inode.

 (2) i_ctime from the NFS inode.

 (3) i_size from the NFS inode.

As the cache is a persistent cache, the auxiliary data is checked when a new
NFS in-memory inode is set up that matches an already existing data storage
object in the cache.  If the coherency data is the same, the on-disk object is
retained and used; if not, it is scrapped and a new one created.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/fscache-index.c |  112 
 fs/nfs/fscache.h   |1 
 2 files changed, 113 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index b5a52e3..c3c63fa 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -150,3 +150,115 @@ const struct fscache_cookie_def nfs_cache_super_index_def 
= {
.type   = FSCACHE_COOKIE_TYPE_INDEX,
.get_key= nfs_super_get_key,
 };
+
+/*
+ * Definition of the auxiliary data attached to NFS inode storage objects
+ * within the cache.
+ *
+ * The contents of this struct are recorded in the on-disk local cache in the
+ * auxiliary data attached to the data storage object backing an inode.  This
+ * permits coherency to be managed when a new inode binds to an already extant
+ * cache object.
+ */
+struct nfs_cache_inode_auxdata {
+   struct timespec mtime;
+   struct timespec ctime;
+   loff_t  size;
+};
+
+/*
+ * Generate a key to describe an NFS inode in an NFS server's index
+ */
+static uint16_t nfs_cache_inode_get_key(const void *cookie_netfs_data,
+   void *buffer, uint16_t bufmax)
+{
+   const struct nfs_inode *nfsi = cookie_netfs_data;
+   uint16_t nsize;
+
+   /* use the inode's NFS filehandle as the key */
+   nsize = nfsi->fh.size;
+   memcpy(buffer, nfsi->fh.data, nsize);
+   return nsize;
+}
+
+/*
+ * Get certain file attributes from the netfs data
+ * - This function can be absent for an index
+ * - Not permitted to return an error
+ * - The netfs data from the cookie being used as the source is presented
+ */
+static void nfs_cache_inode_get_attr(const void *cookie_netfs_data, uint64_t 
*size)
+{
+   const struct nfs_inode *nfsi = cookie_netfs_data;
+
+   *size = nfsi->vfs_inode.i_size;
+}
+
+/*
+ * Get the auxiliary data from netfs data
+ * - This function can be absent if the index carries no state data
+ * - Should store the auxiliary data in the buffer
+ * - Should return the amount of amount stored
+ * - Not permitted to return an error
+ * - The netfs data from the cookie being used as the source is presented
+ */
+static uint16_t nfs_cache_inode_get_aux(const void *cookie_netfs_data,
+   void *buffer, uint16_t bufmax)
+{
+   struct nfs_cache_inode_auxdata auxdata;
+   const struct nfs_inode *nfsi = cookie_netfs_data;
+
+   auxdata.size = nfsi->vfs_inode.i_size;
+   auxdata.mtime = nfsi->vfs_inode.i_mtime;
+   auxdata.ctime = nfsi->vfs_inode.i_ctime;
+
+   if (bufmax > sizeof(auxdata))
+   bufmax = sizeof(auxdata);
+
+   memcpy(buffer, &auxdata, bufmax);
+   return bufmax;
+}
+
+/*
+ * Consult the netfs about the state of an object
+ * - This function can be absent if the index carries no state data
+ * - The netfs data from the cookie being used as the target is
+ *   presented, as is the auxiliary data
+ */
+static enum fscache_checkaux nfs_cache_inode_check_aux(void *cookie_netfs_data,
+  const void *data,
+  uint16_t datalen)
+{
+   struct nfs_cache_inode_auxdata auxdata;
+   struct nfs_inode *nfsi = cookie_netfs_data;
+
+   if (datalen > sizeof(auxdata))
+   return FSCACHE_CHECKAUX_OBSOLETE;
+
+   auxdata.size = nfsi->vfs_inode.i_size;
+   auxdata.mtime = nfsi->vfs_inode.i_mtime;
+   auxdata.ctime = nfsi->vfs_inode.i_ctime;
+
+   if (memcmp(data, &auxdata, datalen) != 0)
+   return FSCACHE_CHECKAUX_OBSOLETE;
+
+   return FSCACHE_CHECKAUX_OKAY;
+}
+
+/*
+ * Define the inode object for FS-Cache.  This is used to describe an inode
+ * object to fscache_acquire_cookie().  It is keyed by the NFS file handle for
+ * an inode.
+ *
+ * Coherency is managed by comparing the copies of i_size, i_mtime and i_ctime
+ * held in the cache auxiliary data for the data storage object with those in
+ * the inode struct in memory.
+ */
+const struct

[PATCH 24/37] NFS: Register NFS for caching and retrieve the top-level index

2008-02-08 Thread David Howells
Register NFS for caching and retrieve the top-level cache index object cookie.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/Makefile|1 +
 fs/nfs/fscache-index.c |   53 
 fs/nfs/fscache.h   |   35 
 fs/nfs/inode.c |8 +++
 4 files changed, 97 insertions(+), 0 deletions(-)
 create mode 100644 fs/nfs/fscache-index.c
 create mode 100644 fs/nfs/fscache.h


diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index df0f41e..6d7176d 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4)  += nfs4proc.o nfs4xdr.o nfs4state.o 
nfs4renewd.o \
   nfs4namespace.o
 nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache-index.o
diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
new file mode 100644
index 000..225ed5d
--- /dev/null
+++ b/fs/nfs/fscache-index.c
@@ -0,0 +1,53 @@
+/* NFS FS-Cache index structure definition
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([EMAIL PROTECTED])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "internal.h"
+#include "fscache.h"
+
+#define NFSDBG_FACILITYNFSDBG_FSCACHE
+
+static const struct fscache_netfs_operations nfs_cache_ops = {
+};
+
+/*
+ * Define the NFS filesystem for FS-Cache.  Upon registration FS-Cache sticks
+ * the cookie for the top-level index object for NFS into this structure.  The
+ * top-level index can than have other cache objects inserted into it.
+ */
+struct fscache_netfs nfs_cache_netfs = {
+   .name   = "nfs",
+   .version= 0,
+   .ops= &nfs_cache_ops,
+};
+
+/*
+ * Register NFS for caching
+ */
+int nfs_fscache_register(void)
+{
+   return fscache_register_netfs(&nfs_cache_netfs);
+}
+
+/*
+ * Unregister NFS for caching
+ */
+void nfs_fscache_unregister(void)
+{
+   fscache_unregister_netfs(&nfs_cache_netfs);
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
new file mode 100644
index 000..75e5a03
--- /dev/null
+++ b/fs/nfs/fscache.h
@@ -0,0 +1,35 @@
+/* NFS filesystem cache interface definitions
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([EMAIL PROTECTED])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _NFS_FSCACHE_H
+#define _NFS_FSCACHE_H
+
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_NFS_FSCACHE
+#include 
+
+/*
+ * fscache-index.c
+ */
+extern struct fscache_netfs nfs_cache_netfs;
+
+extern int nfs_fscache_register(void);
+extern void nfs_fscache_unregister(void);
+
+#else /* CONFIG_NFS_FSCACHE */
+static inline int nfs_fscache_register(void) { return 0; }
+static inline void nfs_fscache_unregister(void) {}
+
+#endif /* CONFIG_NFS_FSCACHE */
+#endif /* _NFS_FSCACHE_H */
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 966a885..7254d5c 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -46,6 +46,7 @@
 #include "delegation.h"
 #include "iostat.h"
 #include "internal.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITYNFSDBG_VFS
 
@@ -1222,6 +1223,10 @@ static int __init init_nfs_fs(void)
 {
int err;
 
+   err = nfs_fscache_register();
+   if (err < 0)
+   goto out6;
+
err = nfs_fs_proc_init();
if (err)
goto out5;
@@ -1268,6 +1273,8 @@ out3:
 out4:
nfs_fs_proc_exit();
 out5:
+   nfs_fscache_unregister();
+out6:
return err;
 }
 
@@ -1278,6 +1285,7 @@ static void __exit exit_nfs_fs(void)
nfs_destroy_readpagecache();
nfs_destroy_inodecache();
nfs_destroy_nfspagecache();
+   nfs_fscache_unregister();
 #ifdef CONFIG_PROC_FS
rpc_proc_unregister("nfs");
 #endif

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 25/37] NFS: Define and create server-level objects

2008-02-08 Thread David Howells
Define and create server-level cache index objects (as managed by nfs_client
structs).

Each server object is created in the NFS top-level index object and is itself
an index into which superblock-level objects are inserted.

Ideally there would be one superblock-level object per server, and the former
would be folded into the latter; however, since the "nosharecache" option
exists this isn't possible.

The server object key is a sequence consisting of:

 (1) NFS version

 (2) Server address family (eg: AF_INET or AF_INET6)

 (3) Server port.

 (4) Server IP address.

The key blob is of variable length, depending on the length of (4).

The server object is given no coherency data to carry in the auxiliary data
permitted by the cache.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfs/Makefile   |2 +
 fs/nfs/client.c   |5 +++
 fs/nfs/fscache-index.c|   65 +
 fs/nfs/fscache.c  |   52 
 fs/nfs/fscache.h  |   10 +++
 include/linux/nfs_fs_sb.h |4 +++
 6 files changed, 137 insertions(+), 1 deletions(-)
 create mode 100644 fs/nfs/fscache.c


diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 6d7176d..d848c97 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,4 +16,4 @@ nfs-$(CONFIG_NFS_V4)  += nfs4proc.o nfs4xdr.o nfs4state.o 
nfs4renewd.o \
   nfs4namespace.o
 nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
-nfs-$(CONFIG_NFS_FSCACHE) += fscache-index.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index c5c0175..51e9346 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -45,6 +45,7 @@
 #include "delegation.h"
 #include "iostat.h"
 #include "internal.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITYNFSDBG_CLIENT
 
@@ -151,6 +152,8 @@ static struct nfs_client *nfs_alloc_client(const struct 
nfs_client_initdata *cl_
clp->cl_state = 1 << NFS4CLNT_LEASE_EXPIRED;
 #endif
 
+   nfs_fscache_get_client_cookie(clp);
+
return clp;
 
 error_3:
@@ -182,6 +185,8 @@ static void nfs_free_client(struct nfs_client *clp)
 
nfs4_shutdown_client(clp);
 
+   nfs_fscache_release_client_cookie(clp);
+
/* -EIO all pending I/O */
if (!IS_ERR(clp->cl_rpcclient))
rpc_shutdown_client(clp->cl_rpcclient);
diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 225ed5d..25ac4a1 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -51,3 +51,68 @@ void nfs_fscache_unregister(void)
 {
fscache_unregister_netfs(&nfs_cache_netfs);
 }
+
+/*
+ * Layout of the key for an NFS server cache object.
+ */
+struct nfs_server_key {
+   uint16_tnfsversion; /* NFS protocol version */
+   uint16_tfamily; /* address family */
+   uint16_tport;   /* IP port */
+   union {
+   struct in_addr  ipv4_addr;  /* IPv4 address */
+   struct in6_addr ipv6_addr;  /* IPv6 address */
+   } addr[0];
+};
+
+/*
+ * Generate a key to describe a server in the main NFS index
+ * - We return the length of the key, or 0 if we can't generate one
+ */
+static uint16_t nfs_server_get_key(const void *cookie_netfs_data,
+  void *buffer, uint16_t bufmax)
+{
+   const struct nfs_client *clp = cookie_netfs_data;
+   const struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *) &clp->cl_addr;
+   const struct sockaddr_in *sin = (struct sockaddr_in *) &clp->cl_addr;
+   struct nfs_server_key *key = buffer;
+   uint16_t len = 0;
+
+   key->nfsversion = clp->rpc_ops->version;
+   key->family = clp->cl_addr.ss_family;
+
+   len = sizeof(struct nfs_server_key);
+
+   switch (clp->cl_addr.ss_family) {
+   case AF_INET:
+   key->port = sin->sin_port;
+   key->addr[0].ipv4_addr = sin->sin_addr;
+   len += sizeof(key->addr[0].ipv4_addr);
+   break;
+
+   case AF_INET6:
+   key->port = sin6->sin6_port;
+   key->addr[0].ipv6_addr = sin6->sin6_addr;
+   len += sizeof(key->addr[0].ipv6_addr);
+   break;
+
+   default:
+   printk(KERN_WARNING "NFS: Unknown network family '%d'\n",
+  clp->cl_addr.ss_family);
+   len = 0;
+   break;
+   }
+
+   return len;
+}
+
+/*
+ * Define the server object for FS-Cache.  This is used to describe a server
+ * object to fscache_acquire_cookie().  It is keyed by the NFS protocol and
+ * server address parameters.
+ */
+const struct fscache_cookie_def nfs_cache_server_index_def = {
+   .name   = "NFS.server",
+   .type   = FSCACHE_COOKIE_TYPE_INDEX,
+   .get_key= nfs_server_get_key,
+};
diff --git a/fs/nfs/fscache.c b/f

[PATCH 22/37] NFS: Add FS-Cache option bit and debug bit

2008-02-08 Thread David Howells
Add FS-Cache option bit to nfs_server struct.  This is set to indicate local
on-disk caching is enabled for a particular superblock.

Also add debug bit for local caching operations.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 include/linux/nfs_fs.h|1 +
 include/linux/nfs_fs_sb.h |2 ++
 2 files changed, 3 insertions(+), 0 deletions(-)


diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a69ba80..14894c9 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -578,6 +578,7 @@ extern void * nfs_root_data(void);
 #define NFSDBG_CALLBACK0x0100
 #define NFSDBG_CLIENT  0x0200
 #define NFSDBG_MOUNT   0x0400
+#define NFSDBG_FSCACHE 0x0800
 #define NFSDBG_ALL 0x
 
 #ifdef __KERNEL__
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 3423c67..e7c4cdd 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -99,6 +99,8 @@ struct nfs_server {
unsigned intacdirmin;
unsigned intacdirmax;
unsigned intnamelen;
+   unsigned intoptions;/* extra options enabled by 
mount */
+#define NFS_OPTION_FSCACHE 0x0001  /* - local caching enabled */
 
struct nfs_fsid fsid;
__u64   maxfilesize;/* maximum file size */

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/37] CacheFiles: Add missing copy_page export for ia64

2008-02-08 Thread David Howells
This one-line patch fixes the missing export of copy_page introduced
by the cachefile patches.  This patch is not yet upstream, but is required
for cachefile on ia64.  It will be pushed upstream when cachefile goes
upstream.

Signed-off-by: Prarit Bhargava <[EMAIL PROTECTED]>
Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 arch/ia64/kernel/ia64_ksyms.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)


diff --git a/arch/ia64/kernel/ia64_ksyms.c b/arch/ia64/kernel/ia64_ksyms.c
index 8e7193d..3e544f4 100644
--- a/arch/ia64/kernel/ia64_ksyms.c
+++ b/arch/ia64/kernel/ia64_ksyms.c
@@ -46,6 +46,7 @@ EXPORT_SYMBOL(__do_clear_user);
 EXPORT_SYMBOL(__strlen_user);
 EXPORT_SYMBOL(__strncpy_from_user);
 EXPORT_SYMBOL(__strnlen_user);
+EXPORT_SYMBOL(copy_page);
 
 /* from arch/ia64/lib */
 extern void __divsi3(void);

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 19/37] CacheFiles: Export things for CacheFiles

2008-02-08 Thread David Howells
Export a number of functions for CacheFiles's use.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/super.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)


diff --git a/fs/super.c b/fs/super.c
index ceaf2e3..cd199ae 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -266,6 +266,7 @@ int fsync_super(struct super_block *sb)
__fsync_super(sb);
return sync_blockdev(sb->s_bdev);
 }
+EXPORT_SYMBOL_GPL(fsync_super);
 
 /**
  * generic_shutdown_super  -   common helper for ->kill_sb()

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/37] CacheFiles: Permit the page lock state to be monitored

2008-02-08 Thread David Howells
Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.

This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 include/linux/pagemap.h |5 +
 mm/filemap.c|   18 ++
 2 files changed, 23 insertions(+), 0 deletions(-)


diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d22e975..eb08fb8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -242,6 +242,11 @@ static inline void wait_on_page_owner_priv_2(struct page 
*page)
 extern void end_page_owner_priv_2(struct page *page);
 
 /*
+ * Add an arbitrary waiter to a page's wait queue
+ */
+extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
+
+/*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
diff --git a/mm/filemap.c b/mm/filemap.c
index 6c6cd76..5c0241c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -548,6 +548,24 @@ void wait_on_page_bit(struct page *page, int bit_nr)
 EXPORT_SYMBOL(wait_on_page_bit);
 
 /**
+ * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
+ * @page - Page defining the wait queue of interest
+ * @waiter - Waiter to add to the queue
+ *
+ * Add an arbitrary @waiter to the wait queue for the nominated @page.
+ */
+void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
+{
+   wait_queue_head_t *q = page_waitqueue(page);
+   unsigned long flags;
+
+   spin_lock_irqsave(&q->lock, flags);
+   __add_wait_queue(q, waiter);
+   spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_page_wait_queue);
+
+/**
  * unlock_page - unlock a locked page
  * @page: the page
  *

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/37] CacheFiles: Be consistent about the use of mapping vs file->f_mapping in Ext3

2008-02-08 Thread David Howells
Change all the usages of file->f_mapping in ext3_*write_end() functions to use
the mapping argument directly.  This has two consequences:

 (*) Consistency.  Without this patch sometimes one is used and sometimes the
 other is.

 (*) A NULL file pointer can be passed.  This feature is then made use of by
 the generic hook in the next patch, which is used by CacheFiles to write
 pages to a file without setting up a file struct.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/ext3/inode.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)


diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index eb95670..c976123 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1215,7 +1215,7 @@ static int ext3_generic_write_end(struct file *file,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
 {
-   struct inode *inode = file->f_mapping->host;
+   struct inode *inode = mapping->host;
 
copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
@@ -1240,7 +1240,7 @@ static int ext3_ordered_write_end(struct file *file,
struct page *page, void *fsdata)
 {
handle_t *handle = ext3_journal_current_handle();
-   struct inode *inode = file->f_mapping->host;
+   struct inode *inode = mapping->host;
unsigned from, to;
int ret = 0, ret2;
 
@@ -1281,7 +1281,7 @@ static int ext3_writeback_write_end(struct file *file,
struct page *page, void *fsdata)
 {
handle_t *handle = ext3_journal_current_handle();
-   struct inode *inode = file->f_mapping->host;
+   struct inode *inode = mapping->host;
int ret = 0, ret2;
loff_t new_i_size;
 

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/37] CacheFiles: Add a hook to write a single page of data to an inode

2008-02-08 Thread David Howells
Add an address space operation to write one single page of data to an inode at
a page-aligned location (thus permitting the implementation to be highly
optimised).  The data source is a single page.

This is used by CacheFiles to store the contents of netfs pages into their
backing file pages.

Supply a generic implementation for this that uses the write_begin() and
write_end() address_space operations to bind a copy directly into the page
cache.

Hook the Ext2 and Ext3 operations to the generic implementation.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/ext2/inode.c|2 ++
 fs/ext3/inode.c|3 +++
 include/linux/fs.h |7 ++
 mm/filemap.c   |   61 
 4 files changed, 73 insertions(+), 0 deletions(-)


diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c620068..f483014 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -792,6 +792,7 @@ const struct address_space_operations ext2_aops = {
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage= buffer_migrate_page,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 const struct address_space_operations ext2_aops_xip = {
@@ -810,6 +811,7 @@ const struct address_space_operations ext2_nobh_aops = {
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage= buffer_migrate_page,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 /*
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index c976123..0209f3b 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1776,6 +1776,7 @@ static const struct address_space_operations 
ext3_ordered_aops = {
.releasepage= ext3_releasepage,
.direct_IO  = ext3_direct_IO,
.migratepage= buffer_migrate_page,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 static const struct address_space_operations ext3_writeback_aops = {
@@ -1790,6 +1791,7 @@ static const struct address_space_operations 
ext3_writeback_aops = {
.releasepage= ext3_releasepage,
.direct_IO  = ext3_direct_IO,
.migratepage= buffer_migrate_page,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 static const struct address_space_operations ext3_journalled_aops = {
@@ -1803,6 +1805,7 @@ static const struct address_space_operations 
ext3_journalled_aops = {
.bmap   = ext3_bmap,
.invalidatepage = ext3_invalidatepage,
.releasepage= ext3_releasepage,
+   .write_one_page = generic_file_buffered_write_one_page,
 };
 
 void ext3_set_aops(struct inode *inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index feea65d..35525c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -481,6 +481,11 @@ struct address_space_operations {
int (*migratepage) (struct address_space *,
struct page *, struct page *);
int (*launder_page) (struct page *);
+   /* write the contents of the source page over the page at the specified
+* index in the target address space (the source page does not need to
+* be related to the target address space) */
+   int (*write_one_page)(struct address_space *, pgoff_t, struct page *);
+
 };
 
 /*
@@ -1805,6 +1810,8 @@ extern ssize_t generic_file_direct_write(struct kiocb *, 
const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
 extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec 
*,
unsigned long, loff_t, loff_t *, size_t, ssize_t);
+extern int generic_file_buffered_write_one_page(struct address_space *,
+   pgoff_t, struct page *);
 extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, 
loff_t *ppos);
 extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t 
len, loff_t *ppos);
 extern void do_generic_mapping_read(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 89f5a5e..6c6cd76 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2367,6 +2367,67 @@ generic_file_buffered_write(struct kiocb *iocb, const 
struct iovec *iov,
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
+/**
+ * generic_file_buffered_write_one_page - Write a single page of data to an
+ * inode
+ * @mapping - The address space of the target inode
+ * @index - The target page in the target inode to fill
+ * @source - The data to write into the target page
+ *
+ * Write the data from the source page to the page in the nominated address
+ * space at the @index specified.  Note that the file will not be extended if
+ * the page crosses the EOF marker, in which case only the first part of the
+ * page will be written.
+ *
+ * The @source page does not need to have any association wit

[PATCH 13/37] FS-Cache: Provide an add_wait_queue_tail() function

2008-02-08 Thread David Howells
Provide an add_wait_queue_tail() function to add a waiter to the back of a
wait queue instead of the front.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 include/linux/pagemap.h |7 +--
 include/linux/wait.h|2 ++
 kernel/wait.c   |   18 ++
 mm/filemap.c|2 +-
 4 files changed, 26 insertions(+), 3 deletions(-)


diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 51c39f8..cecbace 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -225,8 +225,11 @@ static inline void wait_on_page_writeback(struct page 
*page)
 
 extern void end_page_writeback(struct page *page);
 
-/*
- * Wait for a PG_owner_priv_2 to become clear
+/**
+ * wait_on_page_owner_priv_2 - Wait for PG_owner_priv_2 to become clear
+ * @page: The page to monitor
+ *
+ * Wait for a PG_owner_priv_2 to become clear on the specified page.
  */
 static inline void wait_on_page_owner_priv_2(struct page *page)
 {
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 33a2aa9..5032ebb 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -118,6 +118,8 @@ static inline int waitqueue_active(wait_queue_head_t *q)
 #define is_sync_wait(wait) (!(wait) || ((wait)->private))
 
 extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * 
wait));
+extern void FASTCALL(add_wait_queue_tail(wait_queue_head_t *q,
+wait_queue_t *wait));
 extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, 
wait_queue_t * wait));
 extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * 
wait));
 
diff --git a/kernel/wait.c b/kernel/wait.c
index f987688..a82b012 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -29,6 +29,24 @@ void fastcall add_wait_queue(wait_queue_head_t *q, 
wait_queue_t *wait)
 }
 EXPORT_SYMBOL(add_wait_queue);
 
+/**
+ * add_wait_queue_tail - Add a waiter to the back of a waitqueue
+ * @q: the wait queue to append the waiter to
+ * @wait: the waiter to be queued
+ *
+ * Add a waiter to the back of a waitqueue so that it gets woken up last.
+ */
+void fastcall add_wait_queue_tail(wait_queue_head_t *q, wait_queue_t *wait)
+{
+   unsigned long flags;
+
+   wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+   spin_lock_irqsave(&q->lock, flags);
+   __add_wait_queue_tail(q, wait);
+   spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL(add_wait_queue_tail);
+
 void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t 
*wait)
 {
unsigned long flags;
diff --git a/mm/filemap.c b/mm/filemap.c
index 138e791..8d64000 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -587,7 +587,7 @@ void end_page_writeback(struct page *page)
 EXPORT_SYMBOL(end_page_writeback);
 
 /**
- * end_page_own - Clear PG_owner_priv_2 and wake up any waiters
+ * end_page_owner_priv_2 - Clear PG_owner_priv_2 and wake up any waiters
  * @page: the page
  *
  * Clear PG_owner_priv_2 and wake up any processes waiting for that event.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/37] FS-Cache: Recruit a couple of page flags for cache management

2008-02-08 Thread David Howells
Recruit a couple of page flags to aid in cache management.  The following extra
flags are defined:

 (1) PG_fscache (PG_private_2)

 The marked page is backed by a local cache and is pinning resources in the
 cache driver.

 (2) PG_fscache_write (PG_owner_priv_2)

 The marked page is being written to the local cache.  The page may not be
 modified whilst this is in progress.

If PG_fscache is set, then things that checked for PG_private will now also
check for that.  This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/splice.c|2 +-
 include/linux/page-flags.h |   39 +--
 include/linux/pagemap.h|   11 +++
 mm/filemap.c   |   18 ++
 mm/migrate.c   |2 +-
 mm/page_alloc.c|3 +++
 mm/readahead.c |9 +
 mm/swap.c  |4 ++--
 mm/swap_state.c|4 ++--
 mm/truncate.c  |   10 +-
 mm/vmscan.c|2 +-
 11 files changed, 86 insertions(+), 18 deletions(-)


diff --git a/fs/splice.c b/fs/splice.c
index 4ee49e8..01b8c43 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info 
*pipe,
 */
wait_on_page_writeback(page);
 
-   if (PagePrivate(page))
+   if (page_has_private(page))
try_to_release_page(page, GFP_KERNEL);
 
/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index bbad43f..cc16c23 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -77,25 +77,32 @@
 #define PG_active   6
 #define PG_slab 7  /* slab debug (Suparna wants 
this) */
 
-#define PG_owner_priv_1 8  /* Owner use. If pagecache, fs 
may use*/
+#define PG_owner_priv_1 8  /* Owner use. fs may use in 
pagecache */
 #define PG_arch_1   9
 #define PG_reserved10
 #define PG_private 11  /* If pagecache, has fs-private data */
 
 #define PG_writeback   12  /* Page is under writeback */
+#define PG_private_2   13  /* If pagecache, has fs aux data */
 #define PG_compound14  /* Part of a compound page */
 #define PG_swapcache   15  /* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk16  /* Has blocks allocated on-disk 
*/
 #define PG_reclaim 17  /* To be reclaimed asap */
+#define PG_owner_priv_218  /* Owner use. fs may use in 
pagecache */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead   PG_reclaim /* Reminder to do async read-ahead */
 
-/* PG_owner_priv_1 users should have descriptive aliases */
+/* PG_owner_priv_1/2 users should have descriptive aliases */
 #define PG_checked PG_owner_priv_1 /* Used by some filesystems */
 #define PG_pinned  PG_owner_priv_1 /* Xen pinned pagetable */
+#define PG_fscache_write   PG_owner_priv_2 /* Writing to local cache */
+
+/* PG_private_2 causes releasepage() and co to be invoked */
+#define PG_fscache PG_private_2/* Backed by local cache */
+
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -235,6 +242,23 @@ static inline void SetPageUptodate(struct page *page)
 #define TestClearPageWriteback(page) test_and_clear_bit(PG_writeback,  \
&(page)->flags)
 
+#define PagePrivate2(page) test_bit(PG_private_2, &(page)->flags)
+#define SetPagePrivate2(page)  set_bit(PG_private_2, &(page)->flags)
+#define ClearPagePrivate2(page)clear_bit(PG_private_2, &(page)->flags)
+#define TestSetPagePrivate2(page) test_and_set_bit(PG_private_2, 
&(page)->flags)
+#define TestClearPagePrivate2(page) test_and_clear_bit(PG_private_2, \
+ &(page)->flags)
+
+#define PageOwnerPriv2(page)   test_bit(PG_owner_priv_2, \
+&(page)->flags)
+#define SetPageOwnerPriv2(page)set_bit(PG_owner_priv_2, 
&(page)->flags)
+#define ClearPageOwnerPriv2(page)  clear_bit(PG_owner_priv_2, \
+ &(page)->flags)
+#define TestSetPageOwnerPriv2(page)test_and_set_bit(PG_owner_priv_2, \
+&(page)->flags)
+#define TestClearPageOwnerPriv2(page)  test_and_clear_bit(PG_owner_priv_2, \
+  &(page)->flags)
+
 #define Page

[PATCH 10/37] Security: Make NFSD work with detached security

2008-02-08 Thread David Howells
Make NFSD work with detached security, using the patches that excise the
security information from task_struct to struct task_security as a base.

Each time NFSD wants a new security descriptor (to do NFS4 recovery or just to
do NFS operations), a task_security record is derived from NFSD's *objective*
security, modified and then applied as the *subjective* security.  This means
(a) the changes are not visible to anyone looking at NFSD through /proc, (b)
there is no leakage between two consecutive ops with different security
configurations.

Consideration should probably be given to caching the task_security record on
the basis that there'll probably be several ops that will want to use any
particular security configuration.

Furthermore, nfs4recover.c perhaps ought to set an appropriate LSM context on
the record pointed to by rec_security so that the disk is accessed
appropriately (see set_security_override[_from_ctx]()).

NOTE!  This patch must be rolled in to one of the earlier security patches to
make it compile fully.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/nfsd/auth.c|   37 +++-
 fs/nfsd/nfs4recover.c |   64 +++--
 2 files changed, 65 insertions(+), 36 deletions(-)


diff --git a/fs/nfsd/auth.c b/fs/nfsd/auth.c
index 5586157..ebdc562 100644
--- a/fs/nfsd/auth.c
+++ b/fs/nfsd/auth.c
@@ -6,6 +6,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,12 +27,17 @@ int nfsexp_flags(struct svc_rqst *rqstp, struct svc_export 
*exp)
 
 int nfsd_setuser(struct svc_rqst *rqstp, struct svc_export *exp)
 {
-   struct task_security *act_as = current->act_as;
+   struct task_security *sec, *old;
struct svc_cred cred = rqstp->rq_cred;
int i;
int flags = nfsexp_flags(rqstp, exp);
int ret;
 
+   /* derive the new security record from nfsd's objective security */
+   sec = get_kernel_security(current);
+   if (!sec)
+   return -ENOMEM;
+
if (flags & NFSEXP_ALLSQUASH) {
cred.cr_uid = exp->ex_anon_uid;
cred.cr_gid = exp->ex_anon_gid;
@@ -55,26 +61,33 @@ int nfsd_setuser(struct svc_rqst *rqstp, struct svc_export 
*exp)
get_group_info(cred.cr_group_info);
 
if (cred.cr_uid != (uid_t) -1)
-   act_as->fsuid = cred.cr_uid;
+   sec->fsuid = cred.cr_uid;
else
-   act_as->fsuid = exp->ex_anon_uid;
+   sec->fsuid = exp->ex_anon_uid;
if (cred.cr_gid != (gid_t) -1)
-   act_as->fsgid = cred.cr_gid;
+   sec->fsgid = cred.cr_gid;
else
-   act_as->fsgid = exp->ex_anon_gid;
+   sec->fsgid = exp->ex_anon_gid;
 
-   if (!cred.cr_group_info)
+   if (!cred.cr_group_info) {
+   put_task_security(sec);
return -ENOMEM;
-   ret = set_groups(act_as, cred.cr_group_info);
+   }
+   ret = set_groups(sec, cred.cr_group_info);
put_group_info(cred.cr_group_info);
if ((cred.cr_uid)) {
-   act_as->cap_effective =
-   cap_drop_nfsd_set(act_as->cap_effective);
+   sec->cap_effective =
+   cap_drop_nfsd_set(sec->cap_effective);
} else {
-   act_as->cap_effective =
-   cap_raise_nfsd_set(act_as->cap_effective,
-  act_as->cap_permitted);
+   sec->cap_effective =
+   cap_raise_nfsd_set(sec->cap_effective,
+  sec->cap_permitted);
}
+
+   /* set the new security as nfsd's subjective security */
+   old = current->act_as;
+   current->act_as = sec;
+   put_task_security(old);
return ret;
 }
 
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index bf0217a..ae91262 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -46,27 +46,37 @@
 #include 
 #include 
 #include 
+#include 
 
 #define NFSDDBG_FACILITYNFSDDBG_PROC
 
 /* Globals */
 static struct nameidata rec_dir;
 static int rec_dir_init = 0;
+static struct task_security *rec_security;
 
+/*
+ * switch the special recovery access security in on the current task's
+ * subjective security
+ */
 static void
-nfs4_save_user(uid_t *saveuid, gid_t *savegid)
+nfs4_begin_secure(struct task_security **saved_sec)
 {
-   *saveuid = current->act_as->fsuid;
-   *savegid = current->act_as->fsgid;
-   current->act_as->fsuid = 0;
-   current->act_as->fsgid = 0;
+   *saved_sec = current->act_as;
+   current->act_as = get_task_security(rec_security);
 }
 
+/*
+ * return the current task's subjective security to its former glory
+ */
 static void
-nfs4_reset_user(uid_t saveuid, gid_t savegid)
+nfs4_end_secure(struct task_security *saved_sec)
 {
-   current->act_as->fsuid = saveuid;
-   current

[PATCH 11/37] FS-Cache: Release page->private after failed readahead

2008-02-08 Thread David Howells
The attached patch causes read_cache_pages() to release page-private data on a
page for which add_to_page_cache() fails or the filler function fails. This
permits pages with caching references associated with them to be cleaned up.

The invalidatepage() address space op is called (indirectly) to do the honours.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 mm/readahead.c |   39 +--
 1 files changed, 37 insertions(+), 2 deletions(-)


diff --git a/mm/readahead.c b/mm/readahead.c
index c9c50ca..75aa6b6 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -44,6 +44,41 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+/*
+ * see if a page needs releasing upon read_cache_pages() failure
+ * - the caller of read_cache_pages() may have set PG_private before calling,
+ *   such as the NFS fs marking pages that are cached locally on disk, thus we
+ *   need to give the fs a chance to clean up in the event of an error
+ */
+static void read_cache_pages_invalidate_page(struct address_space *mapping,
+struct page *page)
+{
+   if (PagePrivate(page)) {
+   if (TestSetPageLocked(page))
+   BUG();
+   page->mapping = mapping;
+   do_invalidatepage(page, 0);
+   page->mapping = NULL;
+   unlock_page(page);
+   }
+   page_cache_release(page);
+}
+
+/*
+ * release a list of pages, invalidating them first if need be
+ */
+static void read_cache_pages_invalidate_pages(struct address_space *mapping,
+ struct list_head *pages)
+{
+   struct page *victim;
+
+   while (!list_empty(pages)) {
+   victim = list_to_page(pages);
+   list_del(&victim->lru);
+   read_cache_pages_invalidate_page(mapping, victim);
+   }
+}
+
 /**
  * read_cache_pages - populate an address space with some pages & start reads 
against them
  * @mapping: the address_space
@@ -65,14 +100,14 @@ int read_cache_pages(struct address_space *mapping, struct 
list_head *pages,
list_del(&page->lru);
if (add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
-   page_cache_release(page);
+   read_cache_pages_invalidate_page(mapping, page);
continue;
}
page_cache_release(page);
 
ret = filler(data, page);
if (unlikely(ret)) {
-   put_pages_list(pages);
+   read_cache_pages_invalidate_pages(mapping, pages);
break;
}
task_io_account_read(PAGE_CACHE_SIZE);

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/37] KEYS: Add keyctl function to get a security label

2008-02-08 Thread David Howells
Add a keyctl() function to get the security label of a key.

The following is added to Documentation/keys.txt:

 (*) Get the LSM security context attached to a key.

long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
size_t buflen)

 This function returns a string that represents the LSM security context
 attached to a key in the buffer provided.

 Unless there's an error, it always returns the amount of data it could
 produce, even if that's too big for the buffer, but it won't copy more
 than requested to userspace. If the buffer pointer is NULL then no copy
 will take place.

 A NUL character is included at the end of the string if the buffer is
 sufficiently big.  This is included in the returned count.  If no LSM is
 in force then an empty string will be returned.

 A process must have view permission on the key for this function to be
 successful.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
Acked-by:  Stephen Smalley <[EMAIL PROTECTED]>
---

 Documentation/keys.txt   |   21 +++
 include/linux/keyctl.h   |1 +
 include/linux/security.h |   20 +-
 security/dummy.c |8 ++
 security/keys/compat.c   |3 ++
 security/keys/keyctl.c   |   66 ++
 security/security.c  |5 +++
 security/selinux/hooks.c |   21 +--
 8 files changed, 141 insertions(+), 4 deletions(-)


diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index b82d38d..be424b0 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -711,6 +711,27 @@ The keyctl syscall functions are:
  The assumed authoritative key is inherited across fork and exec.
 
 
+ (*) Get the LSM security context attached to a key.
+
+   long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
+   size_t buflen)
+
+ This function returns a string that represents the LSM security context
+ attached to a key in the buffer provided.
+
+ Unless there's an error, it always returns the amount of data it could
+ produce, even if that's too big for the buffer, but it won't copy more
+ than requested to userspace. If the buffer pointer is NULL then no copy
+ will take place.
+
+ A NUL character is included at the end of the string if the buffer is
+ sufficiently big.  This is included in the returned count.  If no LSM is
+ in force then an empty string will be returned.
+
+ A process must have view permission on the key for this function to be
+ successful.
+
+
 ===
 KERNEL SERVICES
 ===
diff --git a/include/linux/keyctl.h b/include/linux/keyctl.h
index 3365945..656ee6b 100644
--- a/include/linux/keyctl.h
+++ b/include/linux/keyctl.h
@@ -49,5 +49,6 @@
 #define KEYCTL_SET_REQKEY_KEYRING  14  /* set default request-key 
keyring */
 #define KEYCTL_SET_TIMEOUT 15  /* set key timeout */
 #define KEYCTL_ASSUME_AUTHORITY16  /* assume request_key() 
authorisation */
+#define KEYCTL_GET_SECURITY17  /* get key security label */
 
 #endif /*  _LINUX_KEYCTL_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index fe52cde..a33fd03 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -970,6 +970,17 @@ struct request_sock;
  * @perm describes the combination of permissions required of this key.
  * Return 1 if permission granted, 0 if permission denied and -ve it the
  *  normal permissions model should be effected.
+ * @key_getsecurity:
+ * Get a textual representation of the security context attached to a key
+ * for the purposes of honouring KEYCTL_GETSECURITY.  This function
+ * allocates the storage for the NUL-terminated string and the caller
+ * should free it.
+ * @key points to the key to be queried.
+ * @_buffer points to a pointer that should be set to point to the
+ *  resulting string (if no label or an error occurs).
+ * Return the length of the string (including terminating NUL) or -ve if
+ *  an error.
+ * May also return 0 (and a NULL buffer pointer) if there is no label.
  *
  * Security hooks affecting all System V IPC operations.
  *
@@ -1459,7 +1470,7 @@ struct security_operations {
int (*key_permission)(key_ref_t key_ref,
  struct task_struct *context,
  key_perm_t perm);
-
+   int (*key_getsecurity)(struct key *key, char **_buffer);
 #endif /* CONFIG_KEYS */
 
 };
@@ -2600,6 +2611,7 @@ int security_key_alloc(struct key *key, struct 
task_struct *tsk, unsigned long f
 void security_key_free(struct key *key);
 int security_key_permission(key_ref_t key_ref,
struct task_struct *context, key_perm_t perm);
+int security_key_getsecurity(struct key *key, char **_buffer);
 
 #else
 
@@ -2621,6 +2633,12 @@ static inline int secur

[PATCH 05/37] Security: Change current->fs[ug]id to current_fs[ug]id()

2008-02-08 Thread David Howells
Change current->fs[ug]id to current_fs[ug]id() so that fsgid and fsuid can be
separated from the task_struct.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 arch/ia64/kernel/perfmon.c|4 ++--
 arch/powerpc/platforms/cell/spufs/inode.c |4 ++--
 drivers/isdn/capi/capifs.c|4 ++--
 drivers/usb/core/inode.c  |4 ++--
 fs/9p/fid.c   |2 +-
 fs/9p/vfs_inode.c |4 ++--
 fs/9p/vfs_super.c |4 ++--
 fs/affs/inode.c   |4 ++--
 fs/anon_inodes.c  |4 ++--
 fs/attr.c |4 ++--
 fs/bfs/dir.c  |4 ++--
 fs/cifs/cifsproto.h   |2 +-
 fs/cifs/dir.c |   12 ++--
 fs/cifs/inode.c   |8 
 fs/cifs/misc.c|4 ++--
 fs/coda/cache.c   |6 +++---
 fs/coda/upcall.c  |4 ++--
 fs/devpts/inode.c |4 ++--
 fs/dquot.c|2 +-
 fs/exec.c |4 ++--
 fs/ext2/balloc.c  |2 +-
 fs/ext2/ialloc.c  |4 ++--
 fs/ext2/ioctl.c   |2 +-
 fs/ext3/balloc.c  |2 +-
 fs/ext3/ialloc.c  |4 ++--
 fs/ext4/balloc.c  |2 +-
 fs/ext4/ialloc.c  |4 ++--
 fs/fuse/dev.c |4 ++--
 fs/gfs2/inode.c   |   10 +-
 fs/hfs/inode.c|4 ++--
 fs/hfsplus/inode.c|4 ++--
 fs/hpfs/namei.c   |   24 
 fs/hugetlbfs/inode.c  |   16 
 fs/jffs2/fs.c |4 ++--
 fs/jfs/jfs_inode.c|4 ++--
 fs/locks.c|2 +-
 fs/minix/bitmap.c |4 ++--
 fs/namei.c|8 
 fs/nfsd/vfs.c |4 ++--
 fs/ocfs2/dlm/dlmfs.c  |8 
 fs/ocfs2/namei.c  |4 ++--
 fs/pipe.c |4 ++--
 fs/posix_acl.c|4 ++--
 fs/ramfs/inode.c  |4 ++--
 fs/reiserfs/namei.c   |4 ++--
 fs/sysv/ialloc.c  |4 ++--
 fs/udf/ialloc.c   |4 ++--
 fs/udf/namei.c|2 +-
 fs/ufs/ialloc.c   |4 ++--
 fs/xfs/linux-2.6/xfs_linux.h  |4 ++--
 fs/xfs/xfs_acl.c  |6 +++---
 fs/xfs/xfs_attr.c |2 +-
 fs/xfs/xfs_inode.c|4 ++--
 fs/xfs/xfs_vnodeops.c |8 
 include/linux/fs.h|2 +-
 include/linux/sched.h |3 +++
 ipc/mqueue.c  |4 ++--
 kernel/cgroup.c   |4 ++--
 mm/shmem.c|8 
 net/9p/client.c   |2 +-
 net/socket.c  |4 ++--
 net/sunrpc/auth.c |8 
 security/commoncap.c  |4 ++--
 security/keys/key.c   |2 +-
 security/keys/keyctl.c|2 +-
 security/keys/request_key.c   |   10 +-
 security/keys/request_key_auth.c  |2 +-
 67 files changed, 160 insertions(+), 157 deletions(-)


diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 78acd9f..9ef832c 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2206,8 +2206,8 @@ pfm_alloc_fd(struct file **cfile)
DPRINT(("new inode ino=%ld @%p\n", inode->i_ino, inode));
 
inode->i_mode = S_IFCHR|S_IRUGO;
-   inode->i_uid  = current->fsuid;
-   inode->i_gid  = current->fsgid;
+   inode->i_uid  = current_fsuid();
+   inode->i_gid  = current_fsgid();
 
sprintf(name, "[%lu]", inode->i_ino);
this.name = name;
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c 
b/arch/powerpc/platforms/cell/spufs/inode.c
index 90784c0..0c3838c 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -85,8 +85,8 @@ spufs_new_inode(struct super_block *sb, int mode)
goto out;
 
inode->i_mode = mode;
-   inode->i_uid = current->fsuid;
-   inode->i_gid = current->fsgid;
+   inode->i_uid = current_fsuid();
+  

[PATCH 09/37] Security: Allow kernel services to override LSM settings for task actions

2008-02-08 Thread David Howells
Allow kernel services to override LSM settings appropriate to the actions
performed by a task by duplicating a security record, modifying it and then
using task_struct::act_as to point to it when performing operations on behalf
of a task.

This is used, for example, by CacheFiles which has to transparently access the
cache on behalf of a process that thinks it is doing, say, NFS accesses with a
potentially inappropriate (with respect to accessing the cache) set of
security data.

This patch provides two LSM hooks for modifying a task security record:

 (*) security_kernel_act_as() which allows modification of the security datum
 with which a task acts on other objects (most notably files).

 (*) security_create_files_as() which allows modification of the security
 datum that is used to initialise the security data on a file that a task
 creates.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 include/linux/capability.h  |   12 ++--
 include/linux/cred.h|   23 +++
 include/linux/security.h|   43 +
 kernel/cred.c   |  112 +++
 security/dummy.c|   17 +
 security/security.c |   15 -
 security/selinux/hooks.c|   51 
 security/selinux/include/security.h |2 -
 security/selinux/ss/services.c  |5 +-
 9 files changed, 265 insertions(+), 15 deletions(-)
 create mode 100644 include/linux/cred.h


diff --git a/include/linux/capability.h b/include/linux/capability.h
index 7d50ff6..424de01 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -364,12 +364,12 @@ typedef struct kernel_cap_struct {
 # error Fix up hand-coded capability macro initializers
 #else /* HAND-CODED capability initializers */
 
-# define CAP_EMPTY_SET{{ 0, 0 }}
-# define CAP_FULL_SET {{ ~0, ~0 }}
-# define CAP_INIT_EFF_SET {{ ~CAP_TO_MASK(CAP_SETPCAP), ~0 }}
-# define CAP_FS_SET   {{ CAP_FS_MASK_B0, CAP_FS_MASK_B1 } }
-# define CAP_NFSD_SET {{ CAP_FS_MASK_B0|CAP_TO_MASK(CAP_SYS_RESOURCE), \
-CAP_FS_MASK_B1 } }
+# define CAP_EMPTY_SET((kernel_cap_t){{ 0, 0 }})
+# define CAP_FULL_SET ((kernel_cap_t){{ ~0, ~0 }})
+# define CAP_INIT_EFF_SET ((kernel_cap_t){{ ~CAP_TO_MASK(CAP_SETPCAP), ~0 }})
+# define CAP_FS_SET   ((kernel_cap_t){{ CAP_FS_MASK_B0, CAP_FS_MASK_B1 } })
+# define CAP_NFSD_SET ((kernel_cap_t){{ 
CAP_FS_MASK_B0|CAP_TO_MASK(CAP_SYS_RESOURCE), \
+   CAP_FS_MASK_B1 } })
 
 #endif /* _LINUX_CAPABILITY_U32S != 2 */
 
diff --git a/include/linux/cred.h b/include/linux/cred.h
new file mode 100644
index 000..497af5b
--- /dev/null
+++ b/include/linux/cred.h
@@ -0,0 +1,23 @@
+/* Credential management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([EMAIL PROTECTED])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_CRED_H
+#define _LINUX_CRED_H
+
+struct task_security;
+struct inode;
+
+extern struct task_security *get_kernel_security(struct task_struct *);
+extern int set_security_override(struct task_security *, u32);
+extern int set_security_override_from_ctx(struct task_security *, const char 
*);
+extern int change_create_files_as(struct task_security *, struct inode *);
+
+#endif /* _LINUX_CRED_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index 9bf93c7..1c17b91 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -568,6 +568,19 @@ struct request_sock;
  * Duplicate and attach the security structure currently attached to the
  * p->security field.
  * Return 0 if operation was successful.
+ * @task_kernel_act_as:
+ * Set the credentials for a kernel service to act as (subjective context).
+ * @p points to the task that nominated @secid.
+ * @sec points to the task security record to be modified.
+ * @secid specifies the security ID to be set
+ * Return 0 if successful.
+ * @task_create_files_as:
+ * Set the file creation context in a task security record to be the same
+ * as the objective context of the specified inode.
+ * @p points to the task that nominated @inode.
+ * @sec points to the task security record to be modified.
+ * @inode points to the inode to use as a reference.
+ * Return 0 if successful.
  * @task_setuid:
  * Check permission before setting one or more of the user identity
  * attributes of the current process.  The @flags parameter indicates
@@ -1342,6 +1355,11 @@ struct security_operations {
int (*task_alloc_security) (struct task_struct *p);
void (*task_free_security) (struct task_security *p);
int

[PATCH 00/37] Permit filesystem local caching

2008-02-08 Thread David Howells


These patches add local caching for network filesystems such as NFS.

The patches can roughly be broken down into a number of sets:

  (*) 01-keys-inc-payload.diff
  (*) 02-keys-search-keyring.diff
  (*) 03-keys-callout-blob.diff

  Three patches to the keyring code made to help the CIFS people.
  Included because of patches 05-08.

  (*) 04-keys-get-label.diff

  A patch to allow the security label of a key to be retrieved.
  Included because of patches 05-08.

  (*) 05-security-current-fsugid.diff
  (*) 06-security-separate-task-bits.diff
  (*) 07-security-subjective.diff
  (*) 08-security-kernel_service-class.diff
  (*) 09-security-kernel-service.diff
  (*) 10-security-nfsd.diff

  Patches to permit the subjective security of a task to be overridden.
  All the security details in task_struct are decanted into a new struct
  that task_struct then has two pointers two: one that defines the
  objective security of that task (how other tasks may affect it) and one
  that defines the subjective security (how it may affect other objects).

  Note that I have dropped the idea of struct cred for the moment.  With
  the amount of stuff that was excluded from it, it wasn't actually any
  use to me.  However, it can be added later.

  Required for cachefiles.

  (*) 11-release-page.diff
  (*) 12-fscache-page-flags.diff
  (*) 13-add_wait_queue_tail.diff
  (*) 14-fscache.diff

  Patches to provide a local caching facility for network filesystems.

  (*) 15-cachefiles-ia64.diff
  (*) 16-cachefiles-ext3-f_mapping.diff
  (*) 17-cachefiles-write.diff
  (*) 18-cachefiles-monitor.diff
  (*) 19-cachefiles-export.diff
  (*) 20-cachefiles.diff

  Patches to provide a local cache in a directory of an already mounted
  filesystem.

  (*) 21-nfs-comment.diff
  (*) 22-nfs-fscache-option.diff
  (*) 23-nfs-fscache-kconfig.diff
  (*) 24-nfs-fscache-top-index.diff
  (*) 25-nfs-fscache-server-obj.diff
  (*) 26-nfs-fscache-super-obj.diff
  (*) 27-nfs-fscache-inode-obj.diff
  (*) 28-nfs-fscache-use-inode.diff
  (*) 29-nfs-fscache-invalidate-pages.diff
  (*) 30-nfs-fscache-iostats.diff
  (*) 31-nfs-fscache-page-management.diff
  (*) 32-nfs-fscache-read-context.diff
  (*) 33-nfs-fscache-read-fallback.diff
  (*) 34-nfs-fscache-read-from-cache.diff
  (*) 35-nfs-fscache-store-to-cache.diff
  (*) 36-nfs-fscache-mount.diff
  (*) 37-nfs-fscache-display.diff

  Patches to provide NFS with local caching.

  A couple of questions on the NFS iostat changes: (1) Should I update the
  iostat version number; (2) is it permitted to have conditional iostats?


I've massively split up the NFS patches as requested by Trond Myklebust and
Chuck Lever.  I've also brought the patches up to date with the patch window
turbulence.

--
A tarball of the patches is available at:


http://people.redhat.com/~dhowells/fscache/patches/nfs+fscache-29.tar.bz2


To use this version of CacheFiles, the cachefilesd-0.9 is also required.  It
is available as an SRPM:

http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9-1.fc7.src.rpm

Or as individual bits:

http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9.tar.bz2
http://people.redhat.com/~dhowells/fscache/cachefilesd.fc
http://people.redhat.com/~dhowells/fscache/cachefilesd.if
http://people.redhat.com/~dhowells/fscache/cachefilesd.te
http://people.redhat.com/~dhowells/fscache/cachefilesd.spec

The .fc, .if and .te files are for manipulating SELinux.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/37] Security: Add a kernel_service object class to SELinux

2008-02-08 Thread David Howells
Add a 'kernel_service' object class to SELinux and give this object class two
access vectors: 'use_as_override' and 'create_files_as'.

The first vector is used to grant a process the right to nominate an alternate
process security ID for the kernel to use as an override for the SELinux
subjective security when accessing stuff on behalf of another process.

For example, CacheFiles when accessing the cache on behalf on a process
accessing an NFS file needs to use a subjective security ID appropriate to the
cache rather then the one the calling process is using.  The cachefilesd
daemon will nominate the security ID to be used.

The second vector is used to grant a process the right to nominate a file
creation label for a kernel service to use.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 security/selinux/include/av_perm_to_string.h |2 ++
 security/selinux/include/av_permissions.h|2 ++
 security/selinux/include/class_to_string.h   |1 +
 security/selinux/include/flask.h |1 +
 4 files changed, 6 insertions(+), 0 deletions(-)


diff --git a/security/selinux/include/av_perm_to_string.h 
b/security/selinux/include/av_perm_to_string.h
index 399f868..c68ec9c 100644
--- a/security/selinux/include/av_perm_to_string.h
+++ b/security/selinux/include/av_perm_to_string.h
@@ -168,3 +168,5 @@
S_(SECCLASS_DCCP_SOCKET, DCCP_SOCKET__NAME_CONNECT, "name_connect")
S_(SECCLASS_MEMPROTECT, MEMPROTECT__MMAP_ZERO, "mmap_zero")
S_(SECCLASS_PEER, PEER__RECV, "recv")
+   S_(SECCLASS_KERNEL_SERVICE, KERNEL_SERVICE__USE_AS_OVERRIDE, 
"use_as_override")
+   S_(SECCLASS_KERNEL_SERVICE, KERNEL_SERVICE__CREATE_FILES_AS, 
"create_files_as")
diff --git a/security/selinux/include/av_permissions.h 
b/security/selinux/include/av_permissions.h
index 84c9abc..41cee9e 100644
--- a/security/selinux/include/av_permissions.h
+++ b/security/selinux/include/av_permissions.h
@@ -833,3 +833,5 @@
 #define DCCP_SOCKET__NAME_CONNECT 0x0080UL
 #define MEMPROTECT__MMAP_ZERO 0x0001UL
 #define PEER__RECV0x0001UL
+#define KERNEL_SERVICE__USE_AS_OVERRIDE   0x0001UL
+#define KERNEL_SERVICE__CREATE_FILES_AS   0x0002UL
diff --git a/security/selinux/include/class_to_string.h 
b/security/selinux/include/class_to_string.h
index b1b0d1d..efe9efa 100644
--- a/security/selinux/include/class_to_string.h
+++ b/security/selinux/include/class_to_string.h
@@ -71,3 +71,4 @@
 S_(NULL)
 S_(NULL)
 S_("peer")
+S_("kernel_service")
diff --git a/security/selinux/include/flask.h b/security/selinux/include/flask.h
index 09e9dd2..2bc251a 100644
--- a/security/selinux/include/flask.h
+++ b/security/selinux/include/flask.h
@@ -51,6 +51,7 @@
 #define SECCLASS_DCCP_SOCKET 60
 #define SECCLASS_MEMPROTECT  61
 #define SECCLASS_PEER68
+#define SECCLASS_KERNEL_SERVICE  69
 
 /*
  * Security identifier indices for initial entities

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/37] KEYS: Allow the callout data to be passed as a blob rather than a string

2008-02-08 Thread David Howells
Allow the callout data to be passed as a blob rather than a string for internal
kernel services that call any request_key_*() interface other than
request_key().  request_key() itself still takes a NUL-terminated string.

The functions that change are:

request_key_with_auxdata()
request_key_async()
request_key_async_with_auxdata()

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 Documentation/keys-request-key.txt |   11 +---
 Documentation/keys.txt |   14 +++---
 include/linux/key.h|9 ---
 security/keys/internal.h   |9 ---
 security/keys/keyctl.c |7 -
 security/keys/request_key.c|   49 ++--
 security/keys/request_key_auth.c   |   12 +
 7 files changed, 70 insertions(+), 41 deletions(-)


diff --git a/Documentation/keys-request-key.txt 
b/Documentation/keys-request-key.txt
index 266955d..09b55e4 100644
--- a/Documentation/keys-request-key.txt
+++ b/Documentation/keys-request-key.txt
@@ -11,26 +11,29 @@ request_key*():
 
struct key *request_key(const struct key_type *type,
const char *description,
-   const char *callout_string);
+   const char *callout_info);
 
 or:
 
struct key *request_key_with_auxdata(const struct key_type *type,
 const char *description,
-const char *callout_string,
+const char *callout_info,
+size_t callout_len,
 void *aux);
 
 or:
 
struct key *request_key_async(const struct key_type *type,
  const char *description,
- const char *callout_string);
+ const char *callout_info,
+ size_t callout_len);
 
 or:
 
struct key *request_key_async_with_auxdata(const struct key_type *type,
   const char *description,
-  const char *callout_string,
+  const char *callout_info,
+  size_t callout_len,
   void *aux);
 
 Or by userspace invoking the request_key system call:
diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index 51652d3..b82d38d 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -771,7 +771,7 @@ payload contents" for more information.
 
struct key *request_key(const struct key_type *type,
const char *description,
-   const char *callout_string);
+   const char *callout_info);
 
 This is used to request a key or keyring with a description that matches
 the description specified according to the key type's match function. This
@@ -793,24 +793,28 @@ payload contents" for more information.
 
struct key *request_key_with_auxdata(const struct key_type *type,
 const char *description,
-const char *callout_string,
+const void *callout_info,
+size_t callout_len,
 void *aux);
 
 This is identical to request_key(), except that the auxiliary data is
-passed to the key_type->request_key() op if it exists.
+passed to the key_type->request_key() op if it exists, and the callout_info
+is a blob of length callout_len, if given (the length may be 0).
 
 
 (*) A key can be requested asynchronously by calling one of:
 
struct key *request_key_async(const struct key_type *type,
  const char *description,
- const char *callout_string);
+ const void *callout_info,
+ size_t callout_len);
 
 or:
 
struct key *request_key_async_with_auxdata(const struct key_type *type,
   const char *description,
-  const char *callout_string,
+  const char *callout_info,
+  size_t callout_len,
   void *aux);
 
 which are asynchronous equivalents of request_key() and
diff --git a/include/linux/key.h b/include/linux/key.h
index a70b8a8..163f864 100644
--- a/include/linux/key.h
+++ b/include/li

[PATCH 01/37] KEYS: Increase the payload size when instantiating a key

2008-02-08 Thread David Howells
Increase the size of a payload that can be used to instantiate a key in
add_key() and keyctl_instantiate_key().  This permits huge CIFS SPNEGO blobs to
be passed around.  The limit is raised to 1MB.  If kmalloc() can't allocate a
buffer of sufficient size, vmalloc() will be tried instead.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 security/keys/keyctl.c |   38 ++
 1 files changed, 30 insertions(+), 8 deletions(-)


diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index d9ca15c..8ec8432 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -62,9 +63,10 @@ asmlinkage long sys_add_key(const char __user *_type,
char type[32], *description;
void *payload;
long ret;
+   bool vm;
 
ret = -EINVAL;
-   if (plen > 32767)
+   if (plen > 1024 * 1024 - 1)
goto error;
 
/* draw all the data into kernel space */
@@ -81,11 +83,18 @@ asmlinkage long sys_add_key(const char __user *_type,
/* pull the payload in if one was supplied */
payload = NULL;
 
+   vm = false;
if (_payload) {
ret = -ENOMEM;
payload = kmalloc(plen, GFP_KERNEL);
-   if (!payload)
-   goto error2;
+   if (!payload) {
+   if (plen <= PAGE_SIZE)
+   goto error2;
+   vm = true;
+   payload = vmalloc(plen);
+   if (!payload)
+   goto error2;
+   }
 
ret = -EFAULT;
if (copy_from_user(payload, _payload, plen) != 0)
@@ -113,7 +122,10 @@ asmlinkage long sys_add_key(const char __user *_type,
 
key_ref_put(keyring_ref);
  error3:
-   kfree(payload);
+   if (!vm)
+   kfree(payload);
+   else
+   vfree(payload);
  error2:
kfree(description);
  error:
@@ -821,9 +833,10 @@ long keyctl_instantiate_key(key_serial_t id,
key_ref_t keyring_ref;
void *payload;
long ret;
+   bool vm = false;
 
ret = -EINVAL;
-   if (plen > 32767)
+   if (plen > 1024 * 1024 - 1)
goto error;
 
/* the appropriate instantiation authorisation key must have been
@@ -843,8 +856,14 @@ long keyctl_instantiate_key(key_serial_t id,
if (_payload) {
ret = -ENOMEM;
payload = kmalloc(plen, GFP_KERNEL);
-   if (!payload)
-   goto error;
+   if (!payload) {
+   if (plen <= PAGE_SIZE)
+   goto error;
+   vm = true;
+   payload = vmalloc(plen);
+   if (!payload)
+   goto error;
+   }
 
ret = -EFAULT;
if (copy_from_user(payload, _payload, plen) != 0)
@@ -877,7 +896,10 @@ long keyctl_instantiate_key(key_serial_t id,
}
 
 error2:
-   kfree(payload);
+   if (!vm)
+   kfree(payload);
+   else
+   vfree(payload);
 error:
return ret;
 

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/37] KEYS: Check starting keyring as part of search

2008-02-08 Thread David Howells
Check the starting keyring as part of the search to (a) see if that is what
we're searching for, and (b) to check it is still valid for searching.

The scenario:  User in process A does things that cause things to be
created in its process session keyring.  The user then does an su to
another user and starts a new process, B.  The two processes now
share the same process session keyring.

Process B does an NFS access which results in an upcall to gssd.
When gssd attempts to instantiate the context key (to be linked
into the process session keyring), it is denied access even though it
has an authorization key.

The order of calls is:

   keyctl_instantiate_key()
  lookup_user_key() (the default: case)
 search_process_keyrings(current)
search_process_keyrings(rka->context)   (recursive call)
   keyring_search_aux()

keyring_search_aux() verifies the keys and keyrings underneath the
top-level keyring it is given, but that top-level keyring is neither
fully validated nor checked to see if it is the thing being searched for.

This patch changes keyring_search_aux() to:
1) do more validation on the top keyring it is given and
2) check whether that top-level keyring is the thing being searched for


Signed-off-by: Kevin Coffman <[EMAIL PROTECTED]>
Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 security/keys/keyring.c |   35 +++
 1 files changed, 31 insertions(+), 4 deletions(-)


diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 88292e3..76b89b2 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -292,7 +292,7 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
 
struct keyring_list *keylist;
struct timespec now;
-   unsigned long possessed;
+   unsigned long possessed, kflags;
struct key *keyring, *key;
key_ref_t key_ref;
long err;
@@ -318,6 +318,32 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
now = current_kernel_time();
err = -EAGAIN;
sp = 0;
+   
+   /* firstly we should check to see if this top-level keyring is what we
+* are looking for */
+   key_ref = ERR_PTR(-EAGAIN);
+   kflags = keyring->flags;
+   if (keyring->type == type && match(keyring, description)) {
+   key = keyring;
+
+   /* check it isn't negative and hasn't expired or been
+* revoked */
+   if (kflags & (1 << KEY_FLAG_REVOKED))
+   goto error_2;
+   if (key->expiry && now.tv_sec >= key->expiry)
+   goto error_2;
+   key_ref = ERR_PTR(-ENOKEY);
+   if (kflags & (1 << KEY_FLAG_NEGATIVE))
+   goto error_2;
+   goto found;
+   }
+
+   /* otherwise, the top keyring must not be revoked, expired, or
+* negatively instantiated if we are to search it */
+   key_ref = ERR_PTR(-EAGAIN);
+   if (kflags & ((1 << KEY_FLAG_REVOKED) | (1 << KEY_FLAG_NEGATIVE)) ||
+   (keyring->expiry && now.tv_sec >= keyring->expiry))
+   goto error_2;
 
/* start processing a new keyring */
 descend:
@@ -331,13 +357,14 @@ descend:
/* iterate through the keys in this keyring first */
for (kix = 0; kix < keylist->nkeys; kix++) {
key = keylist->keys[kix];
+   kflags = key->flags;
 
/* ignore keys not of this type */
if (key->type != type)
continue;
 
/* skip revoked keys and expired keys */
-   if (test_bit(KEY_FLAG_REVOKED, &key->flags))
+   if (kflags & (1 << KEY_FLAG_REVOKED))
continue;
 
if (key->expiry && now.tv_sec >= key->expiry)
@@ -352,8 +379,8 @@ descend:
context, KEY_SEARCH) < 0)
continue;
 
-   /* we set a different error code if we find a negative key */
-   if (test_bit(KEY_FLAG_NEGATIVE, &key->flags)) {
+   /* we set a different error code if we pass a negative key */
+   if (kflags & (1 << KEY_FLAG_NEGATIVE)) {
err = -ENOKEY;
continue;
}

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-02-08 Thread Christoph Hellwig
On Fri, Feb 08, 2008 at 08:26:57AM -0500, Andreas Dilger wrote:
> You may as well make the common ioctl the same as the XFS version,
> both by number and parameters, so that applications which already
> understand the XFS ioctl will work on other filesystems.

Yes.  In facy you should be able to lift the implementations of
XFS_IOC_FREEZE and XFS_IOC_THAW to generic code, there's nothing
XFS-specific in there.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS client hang on attempt to do async blocking posix lock enqueue

2008-02-08 Thread J. Bruce Fields
On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote:
> On Thu, 7 Feb 2008 18:26:18 -0500
> "J. Bruce Fields" <[EMAIL PROTECTED]> wrote:
> 
> > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > > Hello!
> > >
> > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> > >
> > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> > >>> The problem seems to be with the fact that the client and server are 
> > >>> on
> > >>> the same machine. This test work fine with or without an underlaying 
> > >>> fs
> > >>> that supports locking when the client and the server are on a  
> > >>> different
> > >>> machines. Like you said the server is trying to send the grant  
> > >>> message to
> > >>> the client but for some reason it fails when the client is on the  
> > >>> same
> > >>> machine.
> > >> That *shouldn't* make a difference, so we need to take another look at
> > >> this--Oleg, this problem is still unfixed, right?
> > >
> > > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > > problem.
> > 
> > OK, we have finally reproduced this problem here, and David's working on
> > debugging.  It does indeed seem to only be reproduceable with client and
> > server on the same machine.  Thanks for the report
> > 
> > --b.
> 
> It might be worth testing this both with and without the patchset I
> posted to linux-nfs recently to take care of the lockd hang. If
> lockd is stuck trying to rpc_ping itself then it probably would hang
> like this, wouldn't it?

Of course!  Yes, that fits.

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-02-08 Thread Andreas Dilger
On Feb 08, 2008  19:48 +0900, Takashi Sato wrote:
> OK I would like to implement the freeze feature on VFS
> as the filesystem independent ioctl so that it can be
> available on filesystems that have already had write_super_lockfs()
> and unlockfs().
> The usage for the freeze ioctl is the following.
>  int ioctl(int fd, int FIFREEZE, long *timeval);
>fd:file descriptor of mountpoint
>FIFREEZE:request cord for freeze
>timeval:timeout period (second)
>
> And the unfreeze ioctl is the following.
>  int ioctl(int fd, int FITHAW, NULL);
>fd:file descriptor of mountpoint
>FITHAW:Request cord for unfreeze

You may as well make the common ioctl the same as the XFS version,
both by number and parameters, so that applications which already
understand the XFS ioctl will work on other filesystems.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS client hang on attempt to do async blocking posix lock enqueue

2008-02-08 Thread Jeff Layton
On Thu, 7 Feb 2008 18:26:18 -0500
"J. Bruce Fields" <[EMAIL PROTECTED]> wrote:

> On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > Hello!
> >
> > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> >
> >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> >>> The problem seems to be with the fact that the client and server are 
> >>> on
> >>> the same machine. This test work fine with or without an underlaying 
> >>> fs
> >>> that supports locking when the client and the server are on a  
> >>> different
> >>> machines. Like you said the server is trying to send the grant  
> >>> message to
> >>> the client but for some reason it fails when the client is on the  
> >>> same
> >>> machine.
> >> That *shouldn't* make a difference, so we need to take another look at
> >> this--Oleg, this problem is still unfixed, right?
> >
> > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > problem.
> 
> OK, we have finally reproduced this problem here, and David's working on
> debugging.  It does indeed seem to only be reproduceable with client and
> server on the same machine.  Thanks for the report
> 
> --b.

It might be worth testing this both with and without the patchset I
posted to linux-nfs recently to take care of the lockd hang. If
lockd is stuck trying to rpc_ping itself then it probably would hang
like this, wouldn't it?

-- 
Jeff Layton <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-02-08 Thread Takashi Sato

Hi,

Ted wrote:

And I do agree that we probably should just implement this in
filesystem independent way, in which case all of the filesystems that
support this already have super_operations functions
write_super_lockfs() and unlockfs().

So if this is done using a new system call, there should be no
filesystem-specific changes needed, and all filesystems which support
those super_operations method functions would be able to provide this
functionality to the new system call.


OK I would like to implement the freeze feature on VFS
as the filesystem independent ioctl so that it can be
available on filesystems that have already had write_super_lockfs()
and unlockfs().
The usage for the freeze ioctl is the following.
 int ioctl(int fd, int FIFREEZE, long *timeval);
   fd:file descriptor of mountpoint
   FIFREEZE:request cord for freeze
   timeval:timeout period (second)

And the unfreeze ioctl is the following.
 int ioctl(int fd, int FITHAW, NULL);
   fd:file descriptor of mountpoint
   FITHAW:Request cord for unfreeze

I think we need the timeout feature which thaws the filesystem
after lapse of specified time for a fail-safe in case the freezer
accesses the frozen filesystem and causes a deadlock.
I intend to implement the timeout feature on VFS.
(This is realized by registering the delayed work which calls
thaw_bdev() to the delayed work queue.)

Any comments are very welcome.

Cheers, Takashi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html