Re: [PATCH 4/4] fs/unionfs/: Don't duplicate the struct nameidata

2007-01-30 Thread Christoph Hellwig
On Mon, Jan 29, 2007 at 03:37:42PM -0500, Josef 'Jeff' Sipek wrote:
 The only fields that we have to watch out for are the dentry and vfsmount.
 Additionally, this makes Unionfs gentler on the stack as nameidata is rather
 large.

That's onviously not true at all.  To handle any filesystems using intents
(e.g. NFSv4) you need to do much more.  Then again doing things correctly
doesn't seem to be interesting to the stackable filesystems crowd an this
problem has been constantly ignored over the last year, including merging
ecryptfs which has been broken in the same way.

Folks, if you want your stackable filesystem toys taken seriously you
need to fix these kind of issues instead of talking them away.  And yes,
this will involve some surgery to the VFS.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] have pipefs ensure i_ino uniqueness by calling iunique and hashing the inode

2007-01-30 Thread Jeff Layton

Eric Dumazet wrote:

 For dentry name, we certainly could use [address of inode] instead
 of [inode number] to get unicity, but do we care ?

 For st_ino values on pipefs and sockets, I doubt any user application would
 care. I never had to fstat() a socket fd. Of course it's a file descriptor,
 but all we really want to do with this kind of file descriptor is to call
 socket API.

 And for some heavy loaded internet servers , the additional cost of
 insert/delete a node in a machine shared tree could be a problem.


Granted, I've never had to stat a pipe fd either, so maybe in the big scheme
of things, it's not that important. Still, I think we can probably do this
without a great deal of added complexity or performance impact.

If changing the stuff between the brackets in the dentry name to something
besides inode number is OK, then we can defer the assignment of the inode
number until it's actually needed. For pipefs calls, this means we can only
assign an inode number when a stat call is actually done. So anyone who needs
that info can get it, and anyone who doesn't care about it shouldn't be
greatly impacted by it.

I'll be following up this email with a couple of patches for comment...

-- Jeff

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] make pipefs do lazy i_ino assignment and hashing

2007-01-30 Thread Jeff Layton
This patch updates pipefs to do defer assigning an i_ino value to its inodes
until someone actually tries to stat it. This allows us to have unique i_ino
values for the inodes here, without the performance impact for anyone who
doesn't actually care about it.

Since we don't have an i_ino value at pipe creation time, we need something
else to stuff into the dentry name. Here, I'm using the pointer address of
the inode xor'ed with a random value. There are certainly better hashing
schemes, so if someone wants to propose a better way to do this, then I'm
open to looking at it (maybe halfmd4?).

This patch should apply cleanly to the current -mm.

Signed-off-by: Jeff Layton [EMAIL PROTECTED]

diff --git a/fs/pipe.c b/fs/pipe.c
index 9b3cb34..76dc0e1 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -16,6 +16,7 @@
 #include linux/uio.h
 #include linux/highmem.h
 #include linux/pagemap.h
+#include linux/random.h
 
 #include asm/uaccess.h
 #include asm/ioctls.h
@@ -35,6 +36,8 @@
  * -- Manfred Spraul [EMAIL PROTECTED] 2002-05-09
  */
 
+static unsigned long pipefs_dname_obfuscator;
+
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct pipe_inode_info *pipe)
 {
@@ -844,6 +847,10 @@ static struct dentry_operations pipefs_dentry_operations = 
{
.d_delete   = pipefs_delete_dentry,
 };
 
+static struct inode_operations pipefs_inode_operations = {
+   .getattr= lazy_getattr,
+};
+
 static struct inode * get_pipe_inode(void)
 {
struct inode *inode = new_inode(pipe_mnt-mnt_sb);
@@ -859,6 +866,7 @@ static struct inode * get_pipe_inode(void)
 
pipe-readers = pipe-writers = 1;
inode-i_fop = rdwr_pipe_fops;
+   inode-i_op = pipefs_inode_operations;
 
/*
 * Mark the inode dirty from the very beginning,
@@ -871,8 +879,7 @@ static struct inode * get_pipe_inode(void)
inode-i_uid = current-fsuid;
inode-i_gid = current-fsgid;
inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
-   inode-i_ino = iunique(pipe_mnt-mnt_sb, 1);
-   insert_inode_hash(inode);
+   inode-i_ino = 0;
 
return inode;
 
@@ -900,7 +907,8 @@ struct file *create_write_pipe(void)
if (!inode)
goto err_file;
 
-   this.len = sprintf(name, [%lu], inode-i_ino);
+   this.len = sprintf(name, [%lu],
+  (unsigned long) inode ^ pipefs_dname_obfuscator);
this.name = name;
this.hash = 0;
err = -ENOMEM;
@@ -1033,6 +1041,8 @@ static int __init init_pipe_fs(void)
 {
int err = register_filesystem(pipe_fs_type);
 
+   get_random_bytes(pipefs_dname_obfuscator, sizeof(unsigned long));
+
if (!err) {
pipe_mnt = kern_mount(pipe_fs_type);
if (IS_ERR(pipe_mnt)) {
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] add lazy_getattr and lazy_readdir patches that defer i_ino assignment

2007-01-30 Thread Jeff Layton
This patch adds 2 new libfs functions that allow for us to defer assignment of
an i_ino value until such time that it's actually used. This allows us to
ensure uniqueness without actually impacting the cases that don't really care
about it. With this i_ino == 0 has a special meaning of unassigned.

I have 2 things I'm not sure of with this patch. The first is the #defined
value for MAX_RESERVED_INODE. I'm not thrilled with that, but didn't see a way
to pass in a max_reserved value for iunique through the standard filesystem
call (maybe we could hang the value off of the superblock, but I kind of
don't like doing that either).

The other thing I'm not sure of is whether the locking I'm doing in case 1 of
__dcache_readdir (where I replace the parent_ino call) is correct.

Signed-off-by: Jeff Layton [EMAIL PROTECTED]

diff --git a/fs/libfs.c b/fs/libfs.c
index 9b77f89..f6cd382 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -11,6 +11,17 @@
 
 #include asm/uaccess.h
 
+/* highest inode number that should be assigned by simple_fill_super */
+#define MAX_RESERVED_INODE 100
+
+static inline void lazy_hash_inode(struct inode *inode)
+{
+   if (unlikely(inode-i_ino == 0)) {
+   inode-i_ino = iunique(inode-i_sb, MAX_RESERVED_INODE);
+   insert_inode_hash(inode);
+   }
+}
+
 int simple_getattr(struct vfsmount *mnt, struct dentry *dentry,
   struct kstat *stat)
 {
@@ -20,6 +31,13 @@ int simple_getattr(struct vfsmount *mnt, struct dentry 
*dentry,
return 0;
 }
 
+int lazy_getattr(struct vfsmount *mnt, struct dentry *dentry,
+  struct kstat *stat)
+{
+   lazy_hash_inode(dentry-d_inode);
+   return simple_getattr(mnt, dentry, stat);
+}
+
 int simple_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
buf-f_type = dentry-d_sb-s_magic;
@@ -124,24 +142,34 @@ static inline unsigned char dt_type(struct inode *inode)
  * both impossible due to the lock on directory.
  */
 
-int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
+static int __dcache_readdir(struct file * filp, void * dirent,
+   filldir_t filldir, bool lazy)
 {
struct dentry *dentry = filp-f_path.dentry;
struct dentry *cursor = filp-private_data;
+   struct inode *inode;
struct list_head *p, *q = cursor-d_u.d_child;
ino_t ino;
int i = filp-f_pos;
 
switch (i) {
case 0:
-   ino = dentry-d_inode-i_ino;
+   inode = dentry-d_inode;
+   if (lazy)
+   lazy_hash_inode(inode);
+   ino = inode-i_ino;
if (filldir(dirent, ., 1, i, ino, DT_DIR)  0)
break;
filp-f_pos++;
i++;
/* fallthrough */
case 1:
-   ino = parent_ino(dentry);
+   spin_lock(dentry-d_lock);
+   inode = dentry-d_parent-d_inode;
+   spin_unlock(dentry-d_lock);
+   if (lazy)
+   lazy_hash_inode(inode);
+   ino = inode-i_ino;
if (filldir(dirent, .., 2, i, ino, DT_DIR)  0)
break;
filp-f_pos++;
@@ -155,11 +183,17 @@ int dcache_readdir(struct file * filp, void * dirent, 
filldir_t filldir)
for (p=q-next; p != dentry-d_subdirs; p=p-next) {
struct dentry *next;
next = list_entry(p, struct dentry, 
d_u.d_child);
-   if (d_unhashed(next) || !next-d_inode)
+   inode = next-d_inode;
+   if (d_unhashed(next) || !inode)
continue;
 
spin_unlock(dcache_lock);
-   if (filldir(dirent, next-d_name.name, 
next-d_name.len, filp-f_pos, next-d_inode-i_ino, dt_type(next-d_inode))  
0)
+
+   if (lazy)
+   lazy_hash_inode(inode);
+   if (filldir(dirent, next-d_name.name,
+   next-d_name.len, filp-f_pos,
+   inode-i_ino, dt_type(inode))  0)
return 0;
spin_lock(dcache_lock);
/* next is still alive */
@@ -172,6 +206,16 @@ int dcache_readdir(struct file * filp, void * dirent, 
filldir_t filldir)
return 0;
 }
 
+int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
+{
+   return __dcache_readdir(filp, dirent, filldir, false);
+}
+
+int lazy_readdir(struct file * filp, void * dirent, filldir_t filldir)
+{

[PATCH] make iunique use a do/while loop rather than its obscure goto loop

2007-01-30 Thread Jeffrey Layton
While working on a case, Christoph mentioned that he thought that iunique
ought to be cleaned up to use a more conventional loop construct. This patch
does that, turning the strange goto loop into a do/while.

Signed-off-by: Jeff Layton [EMAIL PROTECTED]

diff --git a/fs/inode.c b/fs/inode.c
index 23fc1fd..90e7587 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -689,21 +689,18 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
struct inode *inode;
struct hlist_head * head;
ino_t res;
+
spin_lock(inode_lock);
-retry:
-   if (counter  max_reserved) {
-   head = inode_hashtable + hash(sb,counter);
+   do {
+   if (counter = max_reserved)
+   counter = max_reserved + 1;
res = counter++;
+   head = inode_hashtable + hash(sb, res);
inode = find_inode_fast(sb, head, res);
-   if (!inode) {
-   spin_unlock(inode_lock);
-   return res;
-   }
-   } else {
-   counter = max_reserved + 1;
-   }
-   goto retry;
-   
+   } while (inode != NULL);
+   spin_unlock(inode_lock);
+
+   return res;
 }
 
 EXPORT_SYMBOL(iunique);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] add lazy_getattr and lazy_readdir patches that defer i_ino assignment

2007-01-30 Thread Jeff Layton

Kirill Korotaev wrote:

Jeff,

taking into account the discussion about unawarness/uncertainty
of whether *unique* inode number is needed at all on pipe fds and such
do we need this at all?

Thanks,
Kirill



Fair enough, perhaps we should just not worry about it, and assume that there 
might be collisions.


If so, I should probably just have Andrew withdraw the patch I submitted earlier 
to hash the inodes for pipefs. I'll look at other callers of new_inode and fix 
up any of the ones that need fixing.


Does that seem like the most reasonable approach?

-- Jeff
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] add lazy_getattr and lazy_readdir patches that defer i_ino assignment

2007-01-30 Thread Kirill Korotaev
Jeff,

taking into account the discussion about unawarness/uncertainty
of whether *unique* inode number is needed at all on pipe fds and such
do we need this at all?

Thanks,
Kirill

 
 
 Fair enough, perhaps we should just not worry about it, and assume that there 
 might be collisions.
 
 If so, I should probably just have Andrew withdraw the patch I submitted 
 earlier 
 to hash the inodes for pipefs. I'll look at other callers of new_inode and 
 fix 
 up any of the ones that need fixing.
 
 Does that seem like the most reasonable approach?
yep!

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] fs/unionfs/: Don't duplicate the struct nameidata

2007-01-30 Thread Josef Sipek
On Tue, Jan 30, 2007 at 09:42:33AM +, Christoph Hellwig wrote:
 On Mon, Jan 29, 2007 at 03:37:42PM -0500, Josef 'Jeff' Sipek wrote:
  The only fields that we have to watch out for are the dentry and vfsmount.
  Additionally, this makes Unionfs gentler on the stack as nameidata is rather
  large.
 
 That's onviously not true at all.  To handle any filesystems using intents
 (e.g. NFSv4) you need to do much more.  Then again doing things correctly
 doesn't seem to be interesting to the stackable filesystems crowd an this
 problem has been constantly ignored over the last year, including merging
 ecryptfs which has been broken in the same way.
 
 Folks, if you want your stackable filesystem toys taken seriously you
 need to fix these kind of issues instead of talking them away.  And yes,
 this will involve some surgery to the VFS.

Indeed. I asked around (#linuxfs) and it seemed that restoring the
dentry/vfsmount was sufficient for the purpose of passing intents down. If
this is not the case, I'll revert the patch to do the full namei allocation.

Josef Jeff Sipek.

-- 
Computer Science is no more about computers than astronomy is about
telescopes.
- Edsger Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] revert changes to pipefs for i_ino uniqueness

2007-01-30 Thread Jeff Layton
Since posting that patch to add i_ino uniqueness to pipefs, I've gotten some
comments that have convinced me that we should probably leave it as is for
now (at least until someone we a good reason).

So, please back out that patch from -mm.

The patch below should revert pipefs to its original state.

Thanks,

Signed-off-by: Jeff Layton [EMAIL PROTECTED]

diff --git a/fs/pipe.c b/fs/pipe.c
index 9b3cb34..68090e8 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -871,8 +871,6 @@ static struct inode * get_pipe_inode(void)
inode-i_uid = current-fsuid;
inode-i_gid = current-fsgid;
inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
-   inode-i_ino = iunique(pipe_mnt-mnt_sb, 1);
-   insert_inode_hash(inode);
 
return inode;
 
@@ -1005,11 +1003,6 @@ int do_pipe(int *fd)
return error;
 }
 
-static struct super_operations pipefs_sops = {
-   .statfs = simple_statfs,
-   .drop_inode = generic_delete_inode,
-};
-
 /*
  * pipefs should _never_ be mounted by userland - too much of security hassle,
  * no real gain from having the whole whorehouse mounted. So we don't need
@@ -1020,7 +1013,7 @@ static int pipefs_get_sb(struct file_system_type *fs_type,
 int flags, const char *dev_name, void *data,
 struct vfsmount *mnt)
 {
-   return get_sb_pseudo(fs_type, pipe:, pipefs_sops, PIPEFS_MAGIC, mnt);
+   return get_sb_pseudo(fs_type, pipe:, NULL, PIPEFS_MAGIC, mnt);
 }
 
 static struct file_system_type pipe_fs_type = {
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] dlm/gfs2: indent help text

2007-01-30 Thread Randy Dunlap
From: Randy Dunlap [EMAIL PROTECTED]

Indent help text as expected.

Signed-off-by: Randy Dunlap [EMAIL PROTECTED]
---
 fs/dlm/Kconfig  |   16 
 fs/gfs2/Kconfig |   47 +++
 2 files changed, 31 insertions(+), 32 deletions(-)

--- linux-2620-rc6.orig/fs/dlm/Kconfig
+++ linux-2620-rc6/fs/dlm/Kconfig
@@ -7,17 +7,17 @@ config DLM
select CONFIGFS_FS
select IP_SCTP if DLM_SCTP
help
-   A general purpose distributed lock manager for kernel or userspace
-   applications.
+ A general purpose distributed lock manager for kernel or userspace
+ applications.
 
 choice
prompt Select DLM communications protocol
depends on DLM
default DLM_TCP
help
-   The DLM Can use TCP or SCTP for it's network communications.
-   SCTP supports multi-homed operations whereas TCP doesn't.
-   However, SCTP seems to have stability problems at the moment.
+ The DLM Can use TCP or SCTP for it's network communications.
+ SCTP supports multi-homed operations whereas TCP doesn't.
+ However, SCTP seems to have stability problems at the moment.
 
 config DLM_TCP
bool TCP/IP
@@ -31,8 +31,8 @@ config DLM_DEBUG
bool DLM debugging
depends on DLM
help
-   Under the debugfs mount point, the name of each lockspace will
-   appear as a file in the dlm directory.  The output is the
-   list of resource and locks the local node knows about.
+ Under the debugfs mount point, the name of each lockspace will
+ appear as a file in the dlm directory.  The output is the
+ list of resource and locks the local node knows about.
 
 endmenu
--- linux-2620-rc6.orig/fs/gfs2/Kconfig
+++ linux-2620-rc6/fs/gfs2/Kconfig
@@ -4,33 +4,33 @@ config GFS2_FS
select FS_POSIX_ACL
select CRC32
help
-   A cluster filesystem.
+ A cluster filesystem.
 
-   Allows a cluster of computers to simultaneously use a block device
-   that is shared between them (with FC, iSCSI, NBD, etc...).  GFS reads
-   and writes to the block device like a local filesystem, but also uses
-   a lock module to allow the computers coordinate their I/O so
-   filesystem consistency is maintained.  One of the nifty features of
-   GFS is perfect consistency -- changes made to the filesystem on one
-   machine show up immediately on all other machines in the cluster.
-
-   To use the GFS2 filesystem, you will need to enable one or more of
-   the below locking modules. Documentation and utilities for GFS2 can
-   be found here: http://sources.redhat.com/cluster
+ Allows a cluster of computers to simultaneously use a block device
+ that is shared between them (with FC, iSCSI, NBD, etc...).  GFS reads
+ and writes to the block device like a local filesystem, but also uses
+ a lock module to allow the computers coordinate their I/O so
+ filesystem consistency is maintained.  One of the nifty features of
+ GFS is perfect consistency -- changes made to the filesystem on one
+ machine show up immediately on all other machines in the cluster.
+
+ To use the GFS2 filesystem, you will need to enable one or more of
+ the below locking modules. Documentation and utilities for GFS2 can
+ be found here: http://sources.redhat.com/cluster
 
 config GFS2_FS_LOCKING_NOLOCK
tristate GFS2 \nolock\ locking module
depends on GFS2_FS
help
-   Single node locking module for GFS2.
+ Single node locking module for GFS2.
 
-   Use this module if you want to use GFS2 on a single node without
-   its clustering features. You can still take advantage of the
-   large file support, and upgrade to running a full cluster later on
-   if required.
+ Use this module if you want to use GFS2 on a single node without
+ its clustering features. You can still take advantage of the
+ large file support, and upgrade to running a full cluster later on
+ if required.
 
-   If you will only be using GFS2 in cluster mode, you do not need this
-   module.
+ If you will only be using GFS2 in cluster mode, you do not need this
+ module.
 
 config GFS2_FS_LOCKING_DLM
tristate GFS2 DLM locking module
@@ -39,9 +39,8 @@ config GFS2_FS_LOCKING_DLM
select CONFIGFS_FS
select DLM
help
-   Multiple node locking module for GFS2
-
-   Most users of GFS2 will require this module. It provides the locking
-   interface between GFS2 and the DLM, which is required to use GFS2
-   in a cluster environment.
+ Multiple node locking module for GFS2
 
+ Most users of GFS2 will require this module. It provides the locking
+ interface between GFS2 and the DLM, which is required to use GFS2
+ in a 

Re: [patch 0/9] buffered write deadlock fix

2007-01-30 Thread Andrew Morton
On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
Nick Piggin [EMAIL PROTECTED] wrote:

 The following set of patches attempt to fix the buffered write
 locking problems 

y'know, four or five years back I fixed this bug by doing

current-locked_page = page;

in the write() code, and then teaching the pagefault code to avoid locking
the same page.  Patch below.

But then evil mean Hugh pointed out that the patch is still vulnerable to
ab/ba deadlocking so I dropped it.

But Hugh lied!  There is no ab/ba deadlock because both a and b need
i_mutex to get into the write() path.

This approach doesn't fix the writev() performance regresson which
nobody has measured yet but which the NFS guys reported.

But I think with this fix in place we can go back to a modified version of
the 2.6.17 filemap.c code and get that performance back, but I haven't
thought about that.

It's a heck of a lot simpler than your patches though ;)





There is a longstanding problem in which the kernel can deadlock if an
application performs a write(bf, buf, n), and `buf' points at a mmapped
page of `fd', and that page is the pagecache page into which this write
will write, and the page is not yet present.

The kernel will put a new page into pagecache, lock it, run
copy_from_user().  The copy will fault and filemap_nopage() will try to
lock the page in preparation for reading it in.  But it was already
locked up in generic_file_write().

We have had a workaround in there - touch the userspace address by hand
(to fault it in) before locking the pagecache page, and hope that it
doesn't get evicted before we try to lock it.

But there's a place in the kmap_atomic-for-generic_file_write code
where this race is detectable.  I put a printk in there and saw a
stream of them during heavy testing.  So the workaround doesn't work.


What this patch does is

- within generic_file_write(): record the locked page in current-locked_page

- within filemap_nopage, look to see if the page which we're faulting
  in is equal to current-locked_page.

  If it is, taken special action: instead of locking the page,
  reading it and unlocking it, we assume that the page is already
  locked.  Bring it uptodate and relock it before returning.


I tested this by disabling the fault-it-in-by-hand workaround, writing
a special test app and adding several printks.  Not pretty, but it does
the job.




 include/linux/sched.h |2 ++
 mm/filemap.c  |   26 ++
 2 files changed, 24 insertions(+), 4 deletions(-)

--- linux-mnm/mm/filemap.c~write-deadlock   Thu Oct 31 22:42:38 2002
+++ linux-mnm-akpm/mm/filemap.c Thu Oct 31 22:42:38 2002
@@ -997,6 +997,7 @@ struct page * filemap_nopage(struct vm_a
struct page *page;
unsigned long size, pgoff, endoff;
int did_readahead;
+   struct page *locked_page = current-locked_page;
 
pgoff = ((address - area-vm_start)  PAGE_CACHE_SHIFT) + 
area-vm_pgoff;
endoff = ((area-vm_end - area-vm_start)  PAGE_CACHE_SHIFT) + 
area-vm_pgoff;
@@ -1092,10 +1093,12 @@ no_cached_page:
 
 page_not_uptodate:
inc_page_state(pgmajfault);
-   lock_page(page);
+   if (page != locked_page)
+   lock_page(page);
 
/* Did it get unhashed while we waited for it? */
if (!page-mapping) {
+   BUG_ON(page == locked_page);/* can't happen: i_sem */
unlock_page(page);
page_cache_release(page);
goto retry_all;
@@ -1103,12 +1106,15 @@ page_not_uptodate:
 
/* Did somebody else get it up-to-date? */
if (PageUptodate(page)) {
-   unlock_page(page);
+   if (page != locked_page)
+   unlock_page(page);
goto success;
}
 
if (!mapping-a_ops-readpage(file, page)) {
wait_on_page_locked(page);
+   if (page == locked_page)
+   lock_page(page);
if (PageUptodate(page))
goto success;
}
@@ -1119,10 +1125,12 @@ page_not_uptodate:
 * because there really aren't any performance issues here
 * and we need to check for errors.
 */
-   lock_page(page);
+   if (page != locked_page)
+   lock_page(page);
 
/* Somebody truncated the page on us? */
if (!page-mapping) {
+   BUG_ON(page == locked_page);/* can't happen: i_sem */
unlock_page(page);
page_cache_release(page);
goto retry_all;
@@ -1130,12 +1138,15 @@ page_not_uptodate:
 
/* Somebody else successfully read it in? */
if (PageUptodate(page)) {
-   unlock_page(page);
+   if (page != locked_page)
+   unlock_page(page);
goto success;
}
ClearPageError(page);
if (!mapping-a_ops-readpage(file, page)) {
wait_on_page_locked(page);
+

Re: [PATCH] pipefs unique inode numbers

2007-01-30 Thread Linus Torvalds


On Tue, 30 Jan 2007, Bodo Eggert wrote:

 change pipefs to use a unique inode number equal to the memory
 address unless it would be truncated.

I *really* wouldn't want to expose kernel addresses to user space, it just 
ends up being a piece of data that they shouldn't have. If we have some 
security issue, this is just too much kernel information that a bad user 
could get at.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pipefs unique inode numbers

2007-01-30 Thread Jeff Layton

Linus Torvalds wrote:


On Tue, 30 Jan 2007, Bodo Eggert wrote:

change pipefs to use a unique inode number equal to the memory
address unless it would be truncated.


I *really* wouldn't want to expose kernel addresses to user space, it just 
ends up being a piece of data that they shouldn't have. If we have some 
security issue, this is just too much kernel information that a bad user 
could get at.


Linus


Agreed. That was my reasoning for proposing the earlier patch that xor'ed
it with a random value (though that's pretty thin protection too).

I think in hindsight though, just pulling the patch that hashes pipefs
inodes is probably the best thing for now. At some point in the future,
if we decide it's enough of a problem, we can always revisit it.

I'm still planning to look over other callers of new_inode to make a
determination about them wrt to i_ino uniqueness. Many of them are not
as performance sensitive as pipefs, and it might not be such a big deal
to just hash those.

-- Jeff

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pipefs unique inode numbers

2007-01-30 Thread Linus Torvalds


On Tue, 30 Jan 2007, Jeff Layton wrote:
 
 Also, that patch would break many 32-bit programs not compiled with large
 offsets when run in compatibility mode on a 64-bit kernel. If they were to
 do a stat on this inode, it would likely generate an EOVERFLOW error since
 the pointer address would probably not fit in a 32 bit field.
 
 That problem was the whole impetus for this set of patches.

Well, we have that problem with the slowly incrementing last_ino too.

Should we make last_ino be static unsigned int instead of long?

Does anybody actually even use the old stat() with 32-bit interfaces? We 
warn for it, and we've done so for a long time.. I dont' remember people 
even complaining about the warning, so..

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/9] buffered write deadlock fix

2007-01-30 Thread Nick Piggin
On Tue, Jan 30, 2007 at 03:21:19PM -0800, Andrew Morton wrote:
 On Tue, 30 Jan 2007 12:55:58 -0800
 Andrew Morton [EMAIL PROTECTED] wrote:
 
  y'know, four or five years back I fixed this bug by doing
  
  current-locked_page = page;
  
  in the write() code, and then teaching the pagefault code to avoid locking
  the same page.  Patch below.
  
  But then evil mean Hugh pointed out that the patch is still vulnerable to
  ab/ba deadlocking so I dropped it.
 
 And he was right, of course.  Task A holds file a's i_mutex and takes a
 fault against file b's page.  Task B holds file b's i_mutex and takes a
 fault against file a's page.  Drat.
 
 I wonder if there's a sane way of preventing that.

If you want to go down the path of carrying state around in task_struct,
you can take the mmap_sem and set a flag, then get_user_pages the source
page and lock both source and destination in ascending order, then your
page fault handler checks the flag and skips mmap_sem, and the rest of
your fault path checks both the page locks you're holding.

At which point you arrive at a horrible mess :)

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html