date:20070711

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-11 Thread Mingming Cao

On Wed, 2007-07-11 at 00:38 -0500, Jose R. Santos wrote:
 On Tue, 10 Jul 2007 16:30:25 -0700
 Andrew Morton [EMAIL PROTECTED] wrote:
 
  On Sun, 01 Jul 2007 03:36:48 -0400
  Mingming Cao [EMAIL PROTECTED] wrote:
  
On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
 The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
 create_proc_entry() does not do lookups on file names with more that 
 one
 directory deep.  This causes the entry creation to fail and hence, no 
 proc
 file is created.  This patch moves the file to /proc/jbd2-degug.
 
 The file could be move to /proc/fs/jbd2/jbd2-debug, but it would 
 require
 some minor alterations to the jbd-stats patch.

I don't think we really want to be adding top-level files in /proc.
What about using the debugfs filesystem (not to be confused with
the e2fsprogs 'debugfs' command)?
   
   How about this then?  Moved the file to use debugfs as well as having
   the nice effect of removing more lines than what it adds.
   
   Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
  
  Please clean up the changelog.
  
  The changelog should include information about the location and the content
  of these debugfs files.  it should provide any instructions which users
  will need to be able to create and use those files.
 
 Will fix.
 
  Alternatively (and preferably) do this via an update to
  Documentation/filesystems/ext4.txt.
 
 Seems like I also need to update the doc on Kconfig as well.  Do you
 prefer this in separate patches? (current patch, kconfig patch, ext4
 doc update patch?
 
fs/jbd2/journal.c|   6220 +42 -0 !
include/linux/jbd2.h |21 + 1 - 0 !
2 files changed, 21 insertions(+), 43 deletions(-)
  
  Again, this patch isn't in Ted's kernel.org directory and hasn't been in 
  -mm.
  
  Apart from the lack of testing and review which this causes, it means I
  can't just do `pushpatch name-of-this-patch' and look at it in tkdiff.  So
  I squint at the diff, but that's harder when the diff wasn't prepared with
  `diff -p'.  Oh well.
 
 Will fix.
 
  
   Index: linux-2.6.22-rc4/fs/jbd2/journal.c
   ===
   --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c   2007-06-11 
   16:16:18.0 -0700
   +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 
   -0700
   @@ -35,6 +35,7 @@
#include linux/kthread.h
#include linux/poison.h
#include linux/proc_fs.h
   +#include linux/debugfs.h

#include asm/uaccess.h
#include asm/page.h
   @@ -1954,60 +1955,37 @@
 * /proc tunables
 */
#if defined(CONFIG_JBD2_DEBUG)
   -int jbd2_journal_enable_debug;
   +u16 jbd2_journal_enable_debug;
EXPORT_SYMBOL(jbd2_journal_enable_debug);
#endif

   -#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_PROC_FS)
   +#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_DEBUG_FS)
  
  Has this been compile-tested with CONFIG_DEBUGFS=n?
 
 I think I did, but honestly don't remember.  Will check with the new
 patch. :) 
 

Yes, I remember I did, that discovered some inconsistency in ext4 code,
which has already been fixed.


   -#define create_jbd_proc_entry() do {} while (0)
   -#define jbd2_remove_jbd_proc_entry() do {} while (0)
   +#define jbd2_create_debugfs_entry() do {} while (0)
   +#define jbd2_remove_debugfs_entry() do {} while (0)
  
  I suggest that these be converted to (preferable) inline functions while
  you're there.
 
 OK.
 
#endif

   @@ -2067,7 +2045,7 @@
 ret = journal_init_caches();
 if (ret != 0)
 jbd2_journal_destroy_caches();
   - create_jbd_proc_entry();
   + jbd2_create_debugfs_entry();
 return ret;
}

   @@ -2078,7 +2056,7 @@
 if (n)
 printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
#endif
   - jbd2_remove_jbd_proc_entry();
   + jbd2_remove_debugfs_entry();
 jbd2_journal_destroy_caches();
}

   Index: linux-2.6.22-rc4/include/linux/jbd2.h
   ===
   --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
   16:16:18.0 -0700
   +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 
   -0700
   @@ -57,7 +57,7 @@
 * CONFIG_JBD2_DEBUG is on.
 */
#define JBD_EXPENSIVE_CHECKING
  
  JBD2?
 
   -extern int jbd2_journal_enable_debug;
   +extern u16 jbd2_journal_enable_debug;
  
  Why was this made 16-bit?  To save 2 bytes?  Could have saved 3 if we're
  going to do that.
 
 OK.
 
  Shoudln't all this debug info be a per-superblock thing rather than
  kernel-wide?
 
 I don't think it is worth pursuing this feature since this seems to
 have been broken for a while now (its been there since the first git
 revission in ext3) and nobody has noticed it until now.  It could be
 address on a later patch though, since the initial purpose of the patch
 was to fix the

Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-11 Thread Mingming Cao

On Tue, 2007-07-10 at 23:35 -0400, Dave Jones wrote:
 On Tue, Jul 10, 2007 at 05:35:13PM -0400, Mingming Cao wrote:
   
   Sorry about this. I was using version 0.04. The latest one I can find
   for now is 0.05(searching lkml), but it didn't catch this codling style
   bug either. I appreciate if anyone can point me the version 0.07, thanks
 
 It's now in-tree in scripts/checkpatch.pl
 (linus' tree is still at 0.06 though, I suspect Andrew has something
  newer in -mm)
 

Thanks, Andy has uploaded the 0.07 version at
http://www.shadowen.org/~apw/public/checkpatch/checkpatch.pl-0.07


Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:25 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 Journal checksum feature has been added to detect corruption of journal.

That was brief.  No description of what it does, how it does it, why it
does it, how one operates it, why (or why not) one would choose to enable
it nor of the costs of enabling it.

 Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
 Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]
 Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
 
 diff -Nurp linux024/fs/ext4/super.c linux/fs/ext4/super.c
 --- linux024/fs/ext4/super.c  2007-06-25 16:19:24.0 -0500
 +++ linux/fs/ext4/super.c 2007-06-26 08:35:16.0 -0500
 @@ -721,6 +721,7 @@ enum {
   Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
   Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
   Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
 + Opt_journal_checksum, Opt_journal_async_commit,

A new user-visible feature should be accompanied by an update to
Documentation/filesystems/ext4.txt?

   Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
   Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
   Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 @@ -760,6 +761,8 @@ static match_table_t tokens = {
   {Opt_journal_update, journal=update},
   {Opt_journal_inum, journal=%u},
   {Opt_journal_dev, journal_dev=%u},
 + {Opt_journal_checksum, journal_checksum},
 + {Opt_journal_async_commit, journal_async_commit},

What's journal_async_commit?

   {Opt_abort, abort},
   {Opt_data_journal, data=journal},
   {Opt_data_ordered, data=ordered},
 @@ -948,6 +951,13 @@ static int parse_options (char *options,
   return 0;
   *journal_devnum = option;
   break;
 + case Opt_journal_checksum:
 + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM);
 + break;
 + case Opt_journal_async_commit:
 + set_opt (sbi-s_mount_opt, JOURNAL_ASYNC_COMMIT);
 + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM);
 + break;
   case Opt_noload:
   set_opt (sbi-s_mount_opt, NOLOAD);
   break;
 @@ -1817,6 +1827,21 @@ static int ext4_fill_super (struct super
   goto failed_mount4;
   }
  
 + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
 + jbd2_journal_set_features(sbi-s_journal,
 + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
 + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
 + } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
 + jbd2_journal_set_features(sbi-s_journal,
 + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
 + jbd2_journal_clear_features(sbi-s_journal, 0, 0,
 + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
 + } else {
 + jbd2_journal_clear_features(sbi-s_journal,
 + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
 + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
 + }

Some discussion of the forward- and backward- compatibility design would be
appropriate.  Also a description of whether and how fsck can be used to fix
up these feature flags.

   /* We have now updated the journal if required, so we can
* validate the data journaling mode. */
   switch (test_opt(sb, DATA_FLAGS)) {
 diff -Nurp linux024/fs/jbd2/commit.c linux/fs/jbd2/commit.c
 --- linux024/fs/jbd2/commit.c 2007-06-25 16:19:25.0 -0500
 +++ linux/fs/jbd2/commit.c2007-06-26 08:40:03.0 -0500
 @@ -21,6 +21,7 @@
  #include linux/mm.h
  #include linux/pagemap.h
  #include linux/jiffies.h
 +#include linux/crc32.h

I think we just broke CONFIG_EXT4=y, CONFIG_CRC32=n builds.

  /*
   * Default IO end handler for temporary BJ_IO buffer_heads.
 @@ -93,15 +94,18 @@ static int inverted_lock(journal_t *jour
   return 1;
  }
  
 -/* Done it all: now write the commit record.  We should have
 +/*
 + * Done it all: now submit the commit record.  We should have
   * cleaned up our previous buffers by now, so if we are in abort
   * mode we can now just skip the rest of the journal write
   * entirely.
   *
   * Returns 1 if the journal needs to be aborted or 0 on success
   */
 -static int journal_write_commit_record(journal_t *journal,
 - transaction_t *commit_transaction)
 +static int journal_submit_commit_record(journal_t *journal,
 + transaction_t *commit_transaction,
 + struct buffer_head **cbh,
 + __u32 crc32_sum)
  {
   struct journal_head *descriptor;
   struct buffer_head *bh;
 @@ -117,21 +121,36 @@ static int journal_write_commit_record(j
  
   bh

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Mingming Cao

On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
 On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  David Chinneer pointed that we need to journal the version number
  updates together with the operations that causes the change of the inode
  version number, in order to survive server crashes so clients won't see
  the counter go backwards.
  
  So increment i_version in fs code is probably the place to ensure the
  inode version changes are stored to disk. It's seems update the ext4
  inode version in every ext4_mark_inode_dirty() is the easiest way.
 
 That still makes us dependent upon _something_ changing the inode.  For
 overwrites the only something is mtime.
 
 If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
 I don't think we do) then I guess we'll need new code in or around
 file_update_time() to do this.

do you mean mark inode dirty all the times in file_update_time()? Not
sure about the overhead for ext3/4.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes

2007-07-11 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
 Date: Sun, 01 Jul 2007 03:38:51 -0400
 Sender: [EMAIL PROTECTED]
 Organization: IBM Linux Technology Center
 X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) 

 From: Dmitry Monakhov [EMAIL PROTECTED]
 Subject: ext4: extent compilation fixes

I hope this patch series will hit git with nice titles like ext4: extent
compilation fixes and not funny ones like
Morecleanups:ext4_extent_compilation_fixes.

 Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions.

conversions.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups

2007-07-11 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 From: Dmitry Monakhov [EMAIL PROTECTED]
 Subject: ext4: extent macros cleanup
 
 - Replace math equation to it's macro equivalent

s/it's/its/;)

 - make ext4_ext_grow_indepth() indexes/leaf correct

hm, what was wrong with it?

 Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED]
 Acked-by: Alex Tomas [EMAIL PROTECTED]
 Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
 ---
  fs/ext4/extents.c |   11 +++
  1 files changed, 7 insertions(+), 4 deletions(-)
 
 diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
 index 12fe3d7..1fd00ac 100644
 --- a/fs/ext4/extents.c
 +++ b/fs/ext4/extents.c
 @@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct 
 ext4_ext_path *path, int bloc
   ext_debug(binsearch for %d(idx):  , block);
 
   l = EXT_FIRST_INDEX(eh) + 1;
 - r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh-eh_entries) - 1;
 + r = EXT_LAST_INDEX(eh);
   while (l = r) {
   m = l + (r - l) / 2;
   if (block  le32_to_cpu(m-ei_block))
 @@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct 
 ext4_ext_path *path, int block)
   ext_debug(binsearch for %d:  , block);
 
   l = EXT_FIRST_EXTENT(eh) + 1;
 - r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh-eh_entries) - 1;
 + r = EXT_LAST_EXTENT(eh);
 
   while (l = r) {
   m = l + (r - l) / 2;
 @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, 
 struct inode *inode,
   curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode));
   curp-p_hdr-eh_entries = cpu_to_le16(1);
   curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr);
 - /* FIXME: it works, but actually path[0] can be index */
 - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
 + 
 + if (path[0].p_hdr-eh_depth)
 +   curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block;
 + else
 +   curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;

whitespace bustage there.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-11 Thread Andrew Morton

On Wed, 11 Jul 2007 00:38:09 -0500 Jose R. Santos [EMAIL PROTECTED] wrote:

 
  Alternatively (and preferably) do this via an update to
  Documentation/filesystems/ext4.txt.
 
 Seems like I also need to update the doc on Kconfig as well.  Do you
 prefer this in separate patches? (current patch, kconfig patch, ext4
 doc update patch?

All these changes are logically connected (aren't they?).  A single patch
is fine.

  Shoudln't all this debug info be a per-superblock thing rather than
  kernel-wide?
 
 I don't think it is worth pursuing this feature since this seems to
 have been broken for a while now (its been there since the first git
 revission in ext3) and nobody has noticed it until now.  It could be
 address on a later patch though, since the initial purpose of the patch
 was to fix the broken JBD2_DEBUG option. Of course, this may not be
 clearly express in the changelog. :)
 

I don't think that making it all per-superblock is worth the effort - it's
a developer-only thing and developer will have the knowledge to test ext4
on an otherwise-ext3 setup if they're really fussed about the accuracy.

So yes, a bare make-it-work patch sounds appropriate.  Or remove it, but
hey, it might be useful.  The timestamping stuff certainly looks useful.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andrew Morton

On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
  On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
  
   David Chinneer pointed that we need to journal the version number
   updates together with the operations that causes the change of the inode
   version number, in order to survive server crashes so clients won't see
   the counter go backwards.
   
   So increment i_version in fs code is probably the place to ensure the
   inode version changes are stored to disk. It's seems update the ext4
   inode version in every ext4_mark_inode_dirty() is the easiest way.
  
  That still makes us dependent upon _something_ changing the inode.  For
  overwrites the only something is mtime.
  
  If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
  I don't think we do) then I guess we'll need new code in or around
  file_update_time() to do this.
 
 do you mean mark inode dirty all the times in file_update_time()? Not
 sure about the overhead for ext3/4.
 

hm, I hadn't thought about it in any detail.

Maybe something like

--- a/fs/inode.c~a
+++ a/fs/inode.c
@@ -1238,6 +1238,11 @@ void file_update_time(struct file *file)
sync_it = 1;
}
 
+   if (IS_I_VERSION_64(inode)) {
+   inode_inc_iversion(inode);  /* Takes i_lock on 32-bit */
+   sync_it = 1;
+   }
+
if (sync_it)
mark_inode_dirty_sync(inode);
 }
_

?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-11 Thread Amit K. Arora

On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote:
 On Wed, 11 Jul 2007 01:50:00 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
  +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
  @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, 
  return sys_fadvise64_64(fd, ((u64)offset_hi  32) | offset_lo,
  len, advice);
   }
  +
  +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
  +   unsigned offset_hi, unsigned len_lo,
  +   unsigned len_hi)
 
 Please call this compat_sys_fallocate in line with the powerpc version -
 it gives us a hint that maybe we should think about how to consolidate
 them.  I know other stuff in that file is called sys32_ ... but it is time
 for a change :-)

I think this can be handled as a separate patch once this patchset
is in mainline. Since, anyhow we will need to do this for other sys32_
calls which are already there...

--
Regards,
Amit Arora


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Christoph Hellwig

On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
 This patch is on top of i_version_update_vfs.
 The i_version field of the inode is set on inode creation and incremented when
 the inode is being modified.

Which is not what i_version is supposed to do.  It'll get you tons of misses
for NFSv3 filehandles that rely on the generation staying the same for the
same file.  Please add a new field for the NFSv4 sequence counter instead
of making i_version unuseable.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC/PATCH 1/5] revoke: special mmap handling

2007-07-11 Thread Pekka J Enberg

From: Pekka Enberg [EMAIL PROTECTED]

This adds special handling for revoked memory mappings.  We want to raise
SIGBUS when accessing revoked mappings and return ENODEV when trying to remap
with mmap(2).

Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---

 include/linux/mm.h |1 +
 mm/memory.c|3 +++
 mm/mmap.c  |   12 
 3 files changed, 12 insertions(+), 4 deletions(-)

Index: 2.6/include/linux/mm.h
===
--- 2.6.orig/include/linux/mm.h 2007-07-06 10:19:51.0 +0300
+++ 2.6/include/linux/mm.h  2007-07-11 11:48:28.0 +0300
@@ -169,6 +169,7 @@ #define VM_NONLINEAR0x0080  /* Is no
 #define VM_MAPPED_COPY 0x0100  /* T if mapped copy of data (nommu 
mmap) */
 #define VM_INSERTPAGE  0x0200  /* The vma has had vm_insert_page() 
done on it */
 #define VM_ALWAYSDUMP  0x0400  /* Always include in core dumps */
+#define VM_REVOKED 0x0800  /* Mapping has been revoked */
 
 #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: 2.6/mm/memory.c
===
--- 2.6.orig/mm/memory.c2007-07-06 10:19:53.0 +0300
+++ 2.6/mm/memory.c 2007-07-11 11:48:28.0 +0300
@@ -2597,6 +2597,9 @@ int __handle_mm_fault(struct mm_struct *
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, write_access);
 
+   if (unlikely(vma-vm_flags  VM_REVOKED))
+   return VM_FAULT_SIGBUS;
+
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
if (!pud)
Index: 2.6/mm/mmap.c
===
--- 2.6.orig/mm/mmap.c  2007-07-06 10:19:53.0 +0300
+++ 2.6/mm/mmap.c   2007-07-11 11:48:28.0 +0300
@@ -1031,10 +1031,14 @@ accountable = 0;
error = -ENOMEM;
 munmap_back:
vma = find_vma_prepare(mm, addr, prev, rb_link, rb_parent);
-   if (vma  vma-vm_start  addr + len) {
-   if (do_munmap(mm, addr, len))
-   return -ENOMEM;
-   goto munmap_back;
+   if (vma) {
+   if (unlikely(vma-vm_flags  VM_REVOKED))
+   return -ENODEV;
+   if (vma-vm_start  addr + len) {
+   if (do_munmap(mm, addr, len))
+   return -ENOMEM;
+   goto munmap_back;
+   }
}
 
/* Check against address space limit. */
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Pekka J Enberg

From: Pekka Enberg [EMAIL PROTECTED]

The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
and shared mappings of an inode.  After an successful revocation, operations
on file descriptors fail with the EBADF or ENXIO error code for regular and
device files, respectively.  Attempting to read from or write to a revoked
mapping causes SIGBUS.

The actual operation is done in two passes:

 1. Revoke all file descriptors that point to the given inode. We do
this under tasklist_lock so that after this pass, we don't need
to worry about racing with close(2) or dup(2).

 2. Take down shared memory mappings of the inode and close all file
pointers.

The file descriptors and memory mapping ranges are preserved until the
owning task does close(2) and munmap(2), respectively.


You use revoke() (with chown, for example) to gain exclusive access to 
an inode that might be in use by other processes. This means that we must 
mke sure that:

  - operations on opened file descriptors pointing to that inode fail
  - there are no shared mappings visible to other processes
  - in-progress system calls are either completed (writes) or abort 
(reads)

After revoke() system call returns, you are guaranteed to have revoked 
access to an inode for any processes that had access to it when you 
started the operation. The caller is responsible for blocking any future 
open(2) calls that might occur while revoke() takes care of fork(2) and 
dup(2) during the operation.

Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---

 fs/Makefile  |1 
 fs/revoke.c  |  777 +++
 fs/revoked_inode.c   |  417 +++
 include/linux/fs.h   |8 
 include/linux/magic.h|1 
 include/linux/mm.h   |1 
 include/linux/revoked_fs_i.h |   18 
 include/linux/syscalls.h |3 
 mm/mmap.c|   11 
 9 files changed, 1237 insertions(+)

Index: 2.6/fs/Makefile
===
--- 2.6.orig/fs/Makefile2007-05-21 15:38:14.0 +0300
+++ 2.6/fs/Makefile 2007-07-11 11:48:35.0 +0300
@@ -19,6 +19,7 @@ else
 obj-y +=   no-block.o
 endif
 
+obj-$(CONFIG_MMU)  += revoke.o revoked_inode.o
 obj-$(CONFIG_INOTIFY)  += inotify.o
 obj-$(CONFIG_INOTIFY_USER) += inotify_user.o
 obj-$(CONFIG_EPOLL)+= eventpoll.o
Index: 2.6/fs/revoke.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ 2.6/fs/revoke.c 2007-07-11 11:48:35.0 +0300
@@ -0,0 +1,777 @@
+/*
+ * fs/revoke.c - Invalidate all current open file descriptors of an inode.
+ *
+ * Copyright (C) 2006-2007  Pekka Enberg
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include linux/file.h
+#include linux/fs.h
+#include linux/namei.h
+#include linux/magic.h
+#include linux/mm.h
+#include linux/mman.h
+#include linux/module.h
+#include linux/mount.h
+#include linux/sched.h
+#include linux/revoked_fs_i.h
+#include linux/syscalls.h
+
+/**
+ * fileset - an array of file pointers.
+ * @files:the array of file pointers
+ * @nr:   number of elements in the array
+ * @end:  index to next unused file pointer
+ */
+struct fileset {
+   struct file **files;
+   unsigned long   nr;
+   unsigned long   end;
+};
+
+/**
+ * revoke_details - details of the revoke operation
+ * @inode:invalidate open file descriptors of this inode
+ * @fset: set of files that point to a revoked inode
+ * @restore_start:index to the first file pointer that is currently in
+ *use by a file descriptor but the real file has not
+ *been revoked
+ */
+struct revoke_details {
+   struct fileset  *fset;
+   unsigned long   restore_start;
+};
+
+static struct kmem_cache *revokefs_inode_cache;
+
+static inline bool fset_is_full(struct fileset *set)
+{
+   return set-nr == set-end;
+}
+
+static inline struct file *fset_get_filp(struct fileset *set)
+{
+   return set-files[set-end++];
+}
+
+static struct fileset *alloc_fset(unsigned long size)
+{
+   struct fileset *fset;
+
+   fset = kzalloc(sizeof *fset, GFP_KERNEL);
+   if (!fset)
+   return NULL;
+
+   fset-files = kcalloc(size, sizeof(struct file *), GFP_KERNEL);
+   if (!fset-files) {
+   kfree(fset);
+   return NULL;
+   }
+   fset-nr = size;
+   return fset;
+}
+
+static void free_fset(struct fileset *fset)
+{
+  int i;
+
+  for (i = fset-end; i  fset-nr; i++)
+  fput(fset-files[i]);
+
+  kfree(fset-files);
+  kfree(fset);
+}
+
+/*
+ * Revoked file descriptors point to inodes in the revokefs filesystem.
+ */
+static struct vfsmount *revokefs_mnt;
+
+static struct file *get_revoked_file(void)
+{
+   struct dentry

[RFC/PATCH 3/5] revoke: wire up i386 system calls

2007-07-11 Thread Pekka J Enberg

From: Pekka Enberg [EMAIL PROTECTED]

Make revokeat and frevoke system calls available to user-space on i386.

Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---

 arch/i386/kernel/syscall_table.S |2 ++
 arch/x86_64/ia32/ia32entry.S |2 ++
 include/asm-i386/unistd.h|4 +++-
 3 files changed, 7 insertions(+), 1 deletion(-)

Index: 2.6/arch/i386/kernel/syscall_table.S
===
--- 2.6.orig/arch/i386/kernel/syscall_table.S   2007-05-21 15:38:00.0 
+0300
+++ 2.6/arch/i386/kernel/syscall_table.S2007-07-11 11:48:39.0 
+0300
@@ -323,3 +323,5 @@ .long sys_utimensat /* 320 */
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_revokeat
+   .long sys_frevoke   /* 325 */
Index: 2.6/include/asm-i386/unistd.h
===
--- 2.6.orig/include/asm-i386/unistd.h  2007-05-21 15:38:15.0 +0300
+++ 2.6/include/asm-i386/unistd.h   2007-07-11 11:48:39.0 +0300
@@ -329,10 +329,12 @@ #define __NR_utimensat320
 #define __NR_signalfd  321
 #define __NR_timerfd   322
 #define __NR_eventfd   323
+#define __NR_revokeat  324
+#define __NR_frevoke   325
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 324
+#define NR_syscalls 326
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: 2.6/arch/x86_64/ia32/ia32entry.S
===
--- 2.6.orig/arch/x86_64/ia32/ia32entry.S   2007-07-06 10:19:44.0 
+0300
+++ 2.6/arch/x86_64/ia32/ia32entry.S2007-07-11 11:48:40.0 +0300
@@ -719,4 +719,6 @@ .quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
.quad compat_sys_timerfd
.quad sys_eventfd
+   .quad sys_revokeat
+   .quad sys_frevoke   /* 325 */
 ia32_syscall_end:
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC/PATCH 4/5] revoke: support for ext2 and ext3

2007-07-11 Thread Pekka J Enberg

From: Pekka Enberg [EMAIL PROTECTED]

Add revoke support to ext2, ext3 and ext4 by wiring f_ops-revoke with
generic_file_revoke.

Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---

 fs/ext2/file.c |1 +
 fs/ext3/file.c |1 +
 fs/ext4/file.c |1 +
 3 files changed, 3 insertions(+)

Index: 2.6/fs/ext2/file.c
===
--- 2.6.orig/fs/ext2/file.c 2007-05-04 13:49:05.0 +0300
+++ 2.6/fs/ext2/file.c  2007-07-11 11:48:43.0 +0300
@@ -56,6 +56,7 @@ const struct file_operations ext2_file_o
.sendfile   = generic_file_sendfile,
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
+   .revoke = generic_file_revoke,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
Index: 2.6/fs/ext3/file.c
===
--- 2.6.orig/fs/ext3/file.c 2007-05-04 13:49:05.0 +0300
+++ 2.6/fs/ext3/file.c  2007-07-11 11:48:43.0 +0300
@@ -123,6 +123,7 @@ const struct file_operations ext3_file_o
.sendfile   = generic_file_sendfile,
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
+   .revoke = generic_file_revoke,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
Index: 2.6/fs/ext4/file.c
===
--- 2.6.orig/fs/ext4/file.c 2007-05-04 13:49:05.0 +0300
+++ 2.6/fs/ext4/file.c  2007-07-11 11:48:43.0 +0300
@@ -123,6 +123,7 @@ const struct file_operations ext4_file_o
.sendfile   = generic_file_sendfile,
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
+   .revoke = generic_file_revoke,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC/PATCH 5/5] revoke: add documentation

2007-07-11 Thread Pekka J Enberg

From: Pekka Enberg [EMAIL PROTECTED]

This documents revoke file operation in Documentation/filesystems/vfs.txt.

Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---

 Documentation/filesystems/vfs.txt |5 +
 1 file changed, 5 insertions(+)

Index: 2.6/Documentation/filesystems/vfs.txt
===
--- 2.6.orig/Documentation/filesystems/vfs.txt  2007-05-21 15:37:59.0 
+0300
+++ 2.6/Documentation/filesystems/vfs.txt   2007-07-11 11:48:46.0 
+0300
@@ -732,6 +732,7 @@ struct file_operations {
 int);
ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, 
unsigned  
 int);
+   int (*revoke)(struct file *);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -805,6 +806,10 @@ otherwise noted.
   splice_read: called by the VFS to splice data from file to a pipe. This
   method is used by the splice(2) system call
 
+  revoke: called by revokeat(2) and frevoke(2) system calls to revoke access
+ to an open file. This method must ensure that all currently blocked
+ writes are flushed and reads will fail.
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides. When opening a device node
 (character or block special) most filesystems will call special
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig

On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
 Well, if you see the modes proposed using above flags :
 
 #define FA_ALLOCATE   0
 #define FA_DEALLOCATE FA_FL_DEALLOC
 #define FA_RESV_SPACE FA_FL_KEEP_SIZE
 #define FA_UNRESV_SPACE   (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
 FA_FL_DEL_DATA)
 
 FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
 for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
 flag. Hence prealloction will never delete data.
 This mode is required only for FA_UNRESV_SPACE, which is a deallocation
 mode, to support any existing XFS aware applications/usage-scenarios.

Sorry, but this doesn't make any sense.  There is no need to put every
feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
be supported forever anyway - as I suggested before they really should
be moved to generic code.

What needs to be supported is what makes sense as an interface.
A punch a hole interface does make sense, but trying to hack this into
a preallocation system call is just madness.  We're not IRIX or windows
that fit things into random subcall just because there was some space
left to squeeze them in.

FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data 
change) */
FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data 
change) */
  
  NACK to these aswell.  If i_size changes c/mtime need updates, if the size
  doesn't chamge they don't.  No need to add more flags for this.
 
 This requirement was from the point of view of HSM applications. Hope
 you saw Andreas previous post and are keeping that in mind.

HSMs needs this basically for every system call, which screams for an
open flag like O_INVISIBLE anyway.  Adding this in a generic way is
a good idea, but hacking bits and pieces that won't fit into the global
design is completely wrong.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig

On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote:
 Given the current behaviour for posix_fallocate() in glibc, I think
 that retaining the same error semantic and punting the cleanup to
 userspace (where the app will fail with ENOSPC anyway) is the only
 sane thing we can do here. Trying to undo this in the kernel leads
 to lots of extra rarely used code in error handling paths...

Agreed, looks like we should stay with the user has to clean up behaviour.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig

On Wed, Jul 04, 2007 at 03:37:01PM +1000, Timothy Shimmin wrote:
 We use this capability in XFS at the moment.
 I think this is mainly for DMF (HSM) but is done via the xfs handle 
 interface
 (xfs_open_by_handle) AFAICT.
 

You're not :)  You're using an O_INVIBLE equivalent (as described below),
which would be a useful thing to have at the VFS level, but adding hacks
to some system calls only wouldn't help any HSM system.  It's just useless
API clutter.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/7] manpage for fallocate

2007-07-11 Thread Amit K. Arora

On Wed, Jul 11, 2007 at 12:37:01AM +0300, Heikki Orsila wrote:
 On Wed, Jul 11, 2007 at 01:48:20AM +0530, Amit K. Arora wrote:
  .BI int syscall(int, int fd, int mode, loff_t offset, loff_t len);
 
 Correction: int syscall(int fd, int mode, ...),

Here, we have syscall() with first argument being the system call number
- so what you suggested is not correct.

But, yes, the synopsis should change at some time. Maybe to something
like:

#include fcntl.h

long fallocate(int fd, int mode, loff_t offset, loff_t len);

  .TP
  .B ENOSPC
  There is not enough space left on the device containing the file
  referred to by
  .IR fd.
  .TP
  .B ESPIPE
  .I fd
  refers to a pipe of file descriptor.
  .B ENOSYS
  The filesystem underlying the file descriptor does not support this
  operation.
 
 EINTR?

Will add following errors:

  EINTR A signal was caught during execution
  EIO   An I/O error occurred while reading from or writing to
a file system.
  EOPNOTSUPPThe mode is not supported on the file descriptor.

and will update following :

  EINVALoffset was less than 0, or len was less than or equal to 0.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro

On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote:
 From: Pekka Enberg [EMAIL PROTECTED]
 
 The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
 and shared mappings of an inode.  After an successful revocation, operations
 on file descriptors fail with the EBADF or ENXIO error code for regular and
 device files, respectively.  Attempting to read from or write to a revoked
 mapping causes SIGBUS.
 
 The actual operation is done in two passes:
 
  1. Revoke all file descriptors that point to the given inode. We do
 this under tasklist_lock so that after this pass, we don't need
 to worry about racing with close(2) or dup(2).

How does that deal with the following:

task A gets its descriptor table cleansed
task B sends a descriptor to task A via SCM_RIGHTS datagram
task B gets its descriptor table cleansed
task A receives the datagram and gets the descriptor installed
task A does any kind of IO on that descriptor
-f_mapping gets replaced in the most inconvenient time.

Come to think of that, what do you do if I create a socketpair,
stuff the descriptor into SCM_RIGHTS datagram and send it to
myself?  Then receive it a day after you've called revoke() and
voila - I've got an almost undamaged struct file back...  At
the very least, it allows me to fchmod().  Or fchdir() if that
had been a directory...

BTW, read() or write() in progress might get rather unhappy if your
live replacement of -f_mapping races with them...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro

On Wed, Jul 11, 2007 at 10:37:33AM +0100, Al Viro wrote:
 On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote:
  From: Pekka Enberg [EMAIL PROTECTED]
  
  The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
  and shared mappings of an inode.  After an successful revocation, operations
  on file descriptors fail with the EBADF or ENXIO error code for regular and
  device files, respectively.  Attempting to read from or write to a revoked
  mapping causes SIGBUS.
  
  The actual operation is done in two passes:
  
   1. Revoke all file descriptors that point to the given inode. We do
  this under tasklist_lock so that after this pass, we don't need
  to worry about racing with close(2) or dup(2).
 
 How does that deal with the following:
 
 task A gets its descriptor table cleansed
 task B sends a descriptor to task A via SCM_RIGHTS datagram
 task B gets its descriptor table cleansed
 task A receives the datagram and gets the descriptor installed
 task A does any kind of IO on that descriptor
 -f_mapping gets replaced in the most inconvenient time.
 
 Come to think of that, what do you do if I create a socketpair,
 stuff the descriptor into SCM_RIGHTS datagram and send it to
 myself?  Then receive it a day after you've called revoke() and
 voila - I've got an almost undamaged struct file back...  At
 the very least, it allows me to fchmod().  Or fchdir() if that
 had been a directory...
 
 BTW, read() or write() in progress might get rather unhappy if your
 live replacement of -f_mapping races with them...

Better: I have the only opened descriptor for foo.  I send it to myself
as described above.  I close it.  revoke() is called, finds no opened
instances of foo in any descriptor tables and cheerfully does nothing.
I call recvmsg() and I have completely undamaged opened file back.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Pekka J Enberg

Hi Al,

On Wed, 11 Jul 2007, Al Viro wrote:
 Better: I have the only opened descriptor for foo.  I send it to myself
 as described above.  I close it.  revoke() is called, finds no opened
 instances of foo in any descriptor tables and cheerfully does nothing.
 I call recvmsg() and I have completely undamaged opened file back.

Uhm, nice. So, revoke() needs a proper inode - struct files mapping 
somewhere. Can we add a list of files to struct inode? Are there other 
cases where a file can point to an inode but the file is not attached to 
any file descriptor?

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Pekka J Enberg

On Wed, 11 Jul 2007, Al Viro wrote:
 BTW, read() or write() in progress might get rather unhappy if your
 live replacement of -f_mapping races with them...

For writes, we (1) never start any new operations after we've cleaned up 
the file descriptor tables so (2) after we're done with do_fsync() we 
never touch -f_mapping again.

But for reads, I think there's a problem if we're in 
do_generic_mapping_read() doing invalidate_inode_pages2() is not enough 
because we're hanging on to the real mapping. Hmm.

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro

On Wed, Jul 11, 2007 at 12:50:48PM +0300, Pekka J Enberg wrote:
 Hi Al,
 
 On Wed, 11 Jul 2007, Al Viro wrote:
  Better: I have the only opened descriptor for foo.  I send it to myself
  as described above.  I close it.  revoke() is called, finds no opened
  instances of foo in any descriptor tables and cheerfully does nothing.
  I call recvmsg() and I have completely undamaged opened file back.
 
 Uhm, nice. So, revoke() needs a proper inode - struct files mapping 
 somewhere. Can we add a list of files to struct inode? Are there other 
 cases where a file can point to an inode but the file is not attached to 
 any file descriptor?

Umm...  Any number, really - it might be in the middle of syscall
while another task sharing descriptor table has closed the descriptor.
Then there's quota, then there's process accounting, then there's
execve() in progress, then there's knfsd working with that struct
file, etc.

The fundamental issue here is that even if you do find struct file,
you can't blindly rip its -f_mapping since it can be in the middle
of -read(), -write(), pageout, etc.  And even if you do manage
that, you still have the ability to do fchmod() later.

I don't see how the ability to find all instances in SCM_RIGHTS
datagrams (for example) will help you with the race I've described
first.  Original state: task B has the only reference to file.
revoke() is called, passes task A.  B sends datagram to A and closes
file.  A receives datagram.  Now the only reference is in A's table
and you've already passed that.

So you can't avoid processes keeping pointers to struct file.
If you could find all struct file over given inode (which, I suspect,
will lead to interesting locking), you could call something on
that struct file, but you'd have zero exclusion with processes
calling methods on it.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro

On Wed, Jul 11, 2007 at 01:01:07PM +0300, Pekka J Enberg wrote:
 On Wed, 11 Jul 2007, Al Viro wrote:
  BTW, read() or write() in progress might get rather unhappy if your
  live replacement of -f_mapping races with them...
 
 For writes, we (1) never start any new operations after we've cleaned up 
 the file descriptor tables so (2) after we're done with do_fsync() we 
 never touch -f_mapping again.

Er, no.  do_fsync() won't hit the sys_write() that is yet to enter
-write().  And you can't get rid of new callers _anyway_ (see
previous mail).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] nfs: disable leases over NFS

2007-07-11 Thread Christoph Hellwig

On Thu, Jul 05, 2007 at 11:41:00AM -0400, J. Bruce Fields wrote:
 OK, after looking at this a little more, I'm less happy about the idea
 of erroring out by default:
 
   - There are a ton of filesystems that probably should allow
 leases, and only a few (network filesystems) that shouldn't,
 so leaving leases on by default seems simpler.

But it gets you possible wrong behaviour by default.  I'm not a big
fan of non-trivial default methods as you see :)

   - We already fall back on the local method by default in the case
 of locks, and I don't see a reason to treat leases differently.
   - The patch to add
   .setlease = setlease,
 to all the file_operations is going to be a big patch that
 changes behavior in a way that might be easy to miss (because
 it changes behavior exactly on those filesystems it *doesn't*
 touch.)  I think it'll be easier to get better review on the
 patch that adds a method just to those filesystems that we're
 disabling leases for.

Anyway, feel free to go ahead with the simpler version for now, I'll do
the switchover for locks and leases when I get some time.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()

2007-07-11 Thread Christoph Hellwig

On Thu, Jul 05, 2007 at 03:43:32PM -0700, Dave Hansen wrote:
 This takes care of all of the direct callers of vfs_mknod().
 Since a few of these cases also handle normal file creation
 as well, this also covers some calls to vfs_create().
 
 So that we don't have to make three mnt_want/drop_write()
 calls inside of the switch statement, we move some of its
 logic outside of the switch.

Looks good to me.

 One thing I noticed: do we actually _need_ the first
 S_ISDIR() check at the top of the function?  

The EPERM is required by Posix.

 + if (!S_ISREG(mode)   !S_ISCHR(mode)  !S_ISBLK(mode) 
 + !S_ISFIFO(mode)  !S_ISSOCK(mode)  (mode != 0)) {

no need for braces around the mode != 0

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()

2007-07-11 Thread Christoph Hellwig

On Sat, Jul 07, 2007 at 08:25:31PM +0200, Jan Engelhardt wrote:
 Hm, I wonder why mknod(,S_IFDIR,) returns -EPERM rather than -EINVAL like the
 others?

Posix.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree

2007-07-11 Thread Christoph Hellwig

On Sat, Jun 30, 2007 at 08:56:02AM -0400, H. Peter Anvin wrote:
 Is that conjecture, or do you have evidence to that effect?  Most users 
 of this file are using it via the glibc interfaces, and there probably 
 aren't all that many users of it in the first place.

I have written parsers for personal projects that might not have been
happy to deal with additional fields myself for example..

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Pekka J Enberg

Hi,

On Wed, 11 Jul 2007, Al Viro wrote:
 The fundamental issue here is that even if you do find struct file,
 you can't blindly rip its -f_mapping since it can be in the middle
 of -read(), -write(), pageout, etc.  And even if you do manage
 that, you still have the ability to do fchmod() later.

Then we would need to change the VFS and relevant parts so that we can 
take down -f_mapping. I don't see how we could do that without affecting 
current hotpaths. Hmm. I suppose what we really need to do is cannibalize 
the actual inode (remove from inode cache, detach from dentry and take 
down the mapping) so that we don't have to touch existing struct file 
pointers at all.

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  22:00 -0400, Mingming Cao wrote:
 On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
  On Sun, 01 Jul 2007 03:36:56 -0400
  Mingming Cao [EMAIL PROTECTED] wrote:
  
   This patch is a spinoff of the old nanosecond patches.
  
  I don't know what the old nanosecond patches are.  A link to a suitable
  changlog for those patches would do in a pinch.  Preferable would be to
  write a proper changelog for this patch.
  
 I found the original patch
 http://marc.info/?l=linux-ext4m=115091699809181w=2
 
 Andreas or Kalpak, is changelog from the original patch is accurate to
 apply here?

Mostly, yes, but the name of the feature flag has changed.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  16:30 -0700, Andrew Morton wrote:
  +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field)  \
  +   ((offsetof(typeof(*ext4_inode), field) +\
  + sizeof((ext4_inode)-field))  \
  +   = (EXT4_GOOD_OLD_INODE_SIZE +  \
  +   (einode)-i_extra_isize))   \
 
 Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers
 under what circumstances something will not fit in an inode and what the
 consequences of this are.

/* Extended fields will fit into an inode if the filesystem was formatted
 * with large inodes (-I 256 or larger) and there are not currently EAs
 * consuming all of the available space.  For new inodes we always reserve
 * enough space for the kernel's known extended fields, but for inodes
 * created with an old kernel this might not have been the case.  None of
 * the extended inode fields is critical for correct filesystem operation.
 */

  +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)  
 \
  +do {   
 \
  +   (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);   \
  +   if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
  +   ext4_decode_extra_time((inode)-xtime,\
  +  raw_inode-xtime ## _extra);\
  +} while (0)
 
 Ugly.  I expect these could be implemented as plain old C functions. 
 Caller could pass in the address of the ext4_inode field which the function
 is to operate upon.

We thought about that also, but then the caller needs to do all of the
pointer gymnastics themselves like:

 ext4_inode_get_xtime(inode-i_ctime, inode-i_ctime_extra,
 raw_inode-i_ctime, raw_inode-i_ctime_extra)

instead of the current:

 EXT4_INODE_GET_XTIME(ctime, inode, raw_inode);

IMHO it is preferrable to make the multiple callsites more readable than
the macros.

   #if defined(__KERNEL__) || defined(__linux__)
 
 (What's the __linux__ for?)
 
   #define i_reserved1osd1.linux1.l_i_reserved1
   #define i_frag osd2.linux2.l_i_frag

This is actually unrelated to the current patch, just part of the context.
AFAIK, this is historical, so that the kernel and e2fsprogs can use the
same ext2_fs.h header.  I don't think it is really needed, but such cleanup
shouldn't be a part of this patch either.

  +static inline struct timespec ext4_current_time(struct inode *inode)
  +{
  +   return (inode-i_sb-s_time_gran  NSEC_PER_SEC) ?
  +   current_fs_time(inode-i_sb) : CURRENT_TIME_SEC;
  +}
 
 Now, I've forgotten how this works.  Remind me, please.  Can ext4
 filesystems ever have one-second timestamp granularity?  If so, how does
 one cause that to come about?

Yes, this is possible if an ext2/3/4 filesystem is formatted with 128-byte
inodes (which is the default for all but ext4) and this fs is mounted as
ext4dev.  The inodes can never hold the extra time information (FITS_IN_INODE
check above) so the superblock limits the timestamp resolution to 1s in that
case.

  @@ -153,6 +153,7 @@
   
  unsigned long i_ext_generation;
  struct ext4_ext_cache i_cached_extent;
  +   struct timespec i_crtime;
   };
 
 It is unobvious what this field does.  Please prefer to add commentary to
 _all_ struct fields - it really helps.

It is the inode creation time.  This is useful for debug/forensic purposes,
and at some point there will be an API so that Samba can use it also.

   #endif /* _LINUX_EXT4_FS_I */
  Index: linux-2.6.22-rc4/include/linux/ext4_fs_sb.h
  ===
  --- linux-2.6.22-rc4.orig/include/linux/ext4_fs_sb.h2007-06-11 
  17:28:15.0 -0700
  +++ linux-2.6.22-rc4/include/linux/ext4_fs_sb.h 2007-06-11 
  17:39:05.0 -0700
  @@ -79,6 +79,7 @@
  char *s_qf_names[MAXQUOTAS];/* Names of quota files with 
  journalled quota */
  int s_jquota_fmt;   /* Format of quota to use */
   #endif
  +   unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */
 
 OK, I can kind-of see how this is working, but some overall description of
 how the inode sizing design operates would be helpful.  It would certainly
 make reviewing of this proposed change more fruitful.  Perhaps that new
 comment over EXT4_FITS_IN_INODE() would be a suitable place.

Hmm, I'm sure there were emails on the topic, but they aren't attached to
the patch.  s_want_extra_isize is just an override for sizeof(ext4_inode)
in case the sysadmin wants to reserve more fields in new inodes.  There is
also s_min_extra_isize which is what the kernel and e2fsck guarantee that
will be available in all in-use inodes, if RO_COMPAT_EXTRA_ISIZE is set
(ro-compat so that older kernels can't create inodes with a smaller
extra_isize).  That feature is only enabled

Re: -mm merge plans for 2.6.23

2007-07-11 Thread David Woodhouse

On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote:
 On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote:
  romfs-printk-format-warnings.patch
 
 NACK on this one. 

The rest of it is nacked anyway, until we unify the point and
get_unmapped_area methods of the MTD API.

-- 
dwmw2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  23:34 -0400, Trond Myklebust wrote:
 On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
  So my vote is to increment i_version in common code every time any
  change is made to the file, and alloc_inode should initialise it to
  current time, which might be changed by the filesystem before it calls
  unlock_new_inode. 
  ... but doesn't lustre want to control its i_version... so maybe not :-(
 
 If lustre wants to be exportable via pNFS (as Peter Braam has suggested
 it should), then it had better be able to return a change attribute that
 is compatible with the NFSv4.1 spec...

The Lustre use of i_version is a superset of what NFSv4 needs - the Lustre
version can be used to compare the updates of two inodes.  It is set
to be the Lustre transaction number (which is sequentially incremented
on a per filesystem basis), so that we can use the per-inode version
to do replay of client operations even if they have been disconnected for
a long time, which is why we want to be able to control the actual value.
We don't want the version to be updated for e.g. file defragmentation
or other similar internal-only changes which need ext4_mark_inode_dirty().

We had a patch to disable ext4 inode versioning by a flag the superblock,
but we dropped it at the last minute because it needed some updates and
we didn't want to wait on that for submitting these changes upstream.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  16:30 -0700, Andrew Morton wrote:
  Signed-off-by: Mingming Cao [EMAIL PROTECTED]
  Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
  Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
  ---
  Index: linux-2.6.21/include/linux/ext4_fs.h
  ===
  --- linux-2.6.21.orig/include/linux/ext4_fs.h
  +++ linux-2.6.21/include/linux/ext4_fs.h
  @@ -342,6 +342,7 @@ struct ext4_inode {
  __le32  i_atime_extra;  /* extra Access time  (nsec  2 | epoch) */
  __le32  i_crtime;   /* File Creation time */
  __le32  i_crtime_extra; /* extra FileCreationtime (nsec  2 | epoch) */
  +   __le32  i_version_hi;   /* high 32 bits for 64-bit version */
   };
 
 Aren't there forward- backward-compatibility issues here?  How does the
 filesystem driver work out whether this field is present and valid?

This uses the same EXT4_FITS_IN_INODE() check as any other, so the
compatibility issues are handled.  NFSv4 could live with 32-bit versions
with only a small danger of overflow, so we can still export ext3
filesystems with 128-byte inodes that have been updated to ext4.  For
Lustre (which requires 64-bit versions), we will enforce that space is
available with s_min_extra_isize and RO_COMPAT_EXTRA_ISIZE.

In the case where an older ext3/ext4 filesystem with large inodes does
not have enough space for i_version_hi the EAs that follow i_extra_isize
will be shifted to make room for it (if possible, which is likely).  There
are no critical fields inside i_extra_isize so in the rare case of a
failure to enlarge the i_extra_isize is not a cause for alarm.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Andreas Dilger

On Jul 11, 2007  09:47 +0100, Christoph Hellwig wrote:
 On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
  This patch is on top of i_version_update_vfs.
  The i_version field of the inode is set on inode creation and incremented
  when the inode is being modified.
 
 Which is not what i_version is supposed to do.  It'll get you tons of misses
 for NFSv3 filehandles that rely on the generation staying the same for the
 same file.  Please add a new field for the NFSv4 sequence counter instead
 of making i_version unuseable.

You are confusing i_generation (the instance of this inode number) with
i_version (whether this file has been modified)?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Trond Myklebust

On Wed, 2007-07-11 at 09:47 +0100, Christoph Hellwig wrote:
 On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
  This patch is on top of i_version_update_vfs.
  The i_version field of the inode is set on inode creation and incremented 
  when
  the inode is being modified.
 
 Which is not what i_version is supposed to do.  It'll get you tons of misses
 for NFSv3 filehandles that rely on the generation staying the same for the
 same file.  Please add a new field for the NFSv4 sequence counter instead
 of making i_version unuseable.

Aren't you confusing i_version and i_generation here? Those are two
separate inode fields.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

fallocate, Re: -mm merge plans for 2.6.23

2007-07-11 Thread Christoph Hellwig

 fallocate-implementation-on-i86-x86_64-and-powerpc.patch
 fallocate-on-s390.patch
 fallocate-on-ia64.patch
 fallocate-on-ia64-fix.patch
 
  Merge.

Hopefull this will be done during the 2.6.23 merge window, but right now
it's not (yet).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Christoph Hellwig

On Wed, Jul 11, 2007 at 05:52:24AM -0600, Andreas Dilger wrote:
 On Jul 11, 2007  09:47 +0100, Christoph Hellwig wrote:
  On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
   This patch is on top of i_version_update_vfs.
   The i_version field of the inode is set on inode creation and incremented
   when the inode is being modified.
  
  Which is not what i_version is supposed to do.  It'll get you tons of misses
  for NFSv3 filehandles that rely on the generation staying the same for the
  same file.  Please add a new field for the NFSv4 sequence counter instead
  of making i_version unuseable.
 
 You are confusing i_generation (the instance of this inode number) with
 i_version (whether this file has been modified)?

Yes, sorry.  Objection dropped.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates

2007-07-11 Thread Christoph Hellwig

On Wed, Jul 11, 2007 at 05:57:17AM -0600, Andreas Dilger wrote:
 Ah, this is the patch to disable i_version updates for Lustre.  I don't
 think any normal user would use this mount option, so I don't know if
 there is a need to document it.

This is a reason to not merge it at all.  If the only user of this is
the out of tree lustre code there is no need to put this in.  I should
rather stay in clusterfs' patchkit.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  16:32 -0700, Andrew Morton wrote:
  err = ext4_reserve_inode_write(handle, inode, iloc);
  +   if (EXT4_I(inode)-i_extra_isize 
  +   EXT4_SB(inode-i_sb)-s_want_extra_isize 
  +   !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
  +   /* We need extra buffer credits since we may write into EA block
  +* with this same handle */
  +   if ((jbd2_journal_extend(handle,
  +EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) {
  +   ret = ext4_expand_extra_isize(inode,
  +   EXT4_SB(inode-i_sb)-s_want_extra_isize,
  +   iloc, handle);
  +   if (ret) {
  +   EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND;
  +   if (!expand_message) {
  +   ext4_warning(inode-i_sb, __FUNCTION__,
  +   Unable to expand inode %lu. Delete
  +some EAs or run e2fsck.,
  +   inode-i_ino);
  +   expand_message = 1;
  +   }
  +   }
  +   }
  +   }
 
 Maybe that message should come out once per mount rather than once per boot
 (or once per modprobe)?

Probably true.

 What are the consequences of a jbd2_journal_extend() failure in here?

Not fatal, just like every user of i_extra_isize.  If the inode isn't a
large inode, or it can't be expanded then there will be a minor loss of
functionality on that inode.  If the i_extra_isize is critical, then
the sysadmin will have run e2fsck to force s_min_extra_isize large enough.

Note that this is only applicable for filesystems which are upgraded.  For
new inodes (i.e. all inodes that exist in the filesystem if it was always
run on a kernel with the currently understood extra fields) then this will
never be invoked (until such a time new extra fields are added).

  +static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
  +int value_offs_shift, void *to,
  +void *from, size_t n, int blocksize)
  +{
  +   /* Adjust the value offsets of the entries */
  +   for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
  +   if (!last-e_value_block  last-e_value_size) {
  +   new_offs = le16_to_cpu(last-e_value_offs) +
  +   value_offs_shift;
  +   BUG_ON(new_offs + le32_to_cpu(last-e_value_size)
  + blocksize);
  +   last-e_value_offs = cpu_to_le16(new_offs);
  +   }
  +   }
  +   /* Shift the entries by n bytes */
  +   memmove(to, from, n);
  +}
 
 Under what circumstances will that new BUG_ON trigger?  Can it be triggered
 via incorrect disk contents?  If so, it should not be there.

Only under code defect or memory corruption.  The needed extra space was
already checked in ext4_expand_extra_isize_ea(), but I asked for this
BUG_ON() because it would otherwise seem that there was a chance from
reading just the above code that the shifted EA might overflow the buffer.

  +int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
  +   struct ext4_inode *raw_inode, handle_t *handle)
  +{
  +   down_write(EXT4_I(inode)-xattr_sem);
  +retry:
  +   if (EXT4_I(inode)-i_extra_isize = new_extra_isize) {
  +   up_write(EXT4_I(inode)-xattr_sem);
  +   return 0;
  +   }
 
 So..  xattr_sem is a lock which protects i_extra_isize?  That seems a bit
 strange to me - I'd have expected i_mutex.

Well, since this is the only code that ever changes i_extra_isize, and it
also needs to move the EAs around, it makes sense to use the EA lock for it.
This is also a relatively low-contention lock, and it needs to be held to
access any of the EAs (which are the only things beyond i_extra_isize).

  +   if (EXT4_I(inode)-i_file_acl) {
  +   bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl);
  +   error = -EIO;
  +   if (!bh)
  +   goto cleanup;
  +   if (ext4_xattr_check_block(bh)) {
  +   ext4_error(inode-i_sb, __FUNCTION__,
  +   inode %lu: bad block %llu, inode-i_ino,
  +   EXT4_I(inode)-i_file_acl);
  +   error = -EIO;
  +   goto cleanup;
  +   }
  +   base = BHDR(bh);
  +   first = BFIRST(bh);
  +   end = bh-b_data + bh-b_size;
  +   min_offs = end - base;
  +   free = ext4_xattr_free_space(first, min_offs, base,
  +total_blk);
  +   if (free  new_extra_isize) {
  +   if (!tried_min_extra_isize  s_min_extra_isize) {
  +   tried_min_extra_isize++;
  +

Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Girish Shilamkar

On Tue, 2007-07-10 at 23:16 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:38:25 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Journal checksum feature has been added to detect corruption of journal.
 
 That was brief.  No description of what it does, how it does it, why it
 does it, how one operates it, why (or why not) one would choose to enable
 it nor of the costs of enabling it.
Somehow the description was lost due to the way the original patchset
was sent to ext4 list.

Here is the original description:

The journal checksum feature adds two new flags i.e 
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.

JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
checksum for the blocks described by the descriptor blocks.
Due to checksums, writing of the commit record no longer needs to be
synchronous. Now commit record can be sent to disk without waiting for
descriptor blocks to be written to disk. This behavior is controlled
using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
able to recover the journal with _ASYNC_COMMIT hence it is made
incompat.
The commit header has been extended to hold the checksum along with the
type of the checksum.

For recovery in pass scan checksums are verified to ensure the sanity
and completeness(in case of _ASYNC_COMMIT) of every transaction.


 
  Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
  Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]
  Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
  
  diff -Nurp linux024/fs/ext4/super.c linux/fs/ext4/super.c
  --- linux024/fs/ext4/super.c2007-06-25 16:19:24.0 -0500
  +++ linux/fs/ext4/super.c   2007-06-26 08:35:16.0 -0500
  @@ -721,6 +721,7 @@ enum {
  Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
  Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
  Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
  +   Opt_journal_checksum, Opt_journal_async_commit,
 
 A new user-visible feature should be accompanied by an update to
 Documentation/filesystems/ext4.txt?
Ok.

  Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
  Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
  Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
  @@ -760,6 +761,8 @@ static match_table_t tokens = {
  {Opt_journal_update, journal=update},
  {Opt_journal_inum, journal=%u},
  {Opt_journal_dev, journal_dev=%u},
  +   {Opt_journal_checksum, journal_checksum},
  +   {Opt_journal_async_commit, journal_async_commit},
 
 What's journal_async_commit?
I hope the description has answered this question.

  {Opt_abort, abort},
  {Opt_data_journal, data=journal},
  {Opt_data_ordered, data=ordered},
  @@ -948,6 +951,13 @@ static int parse_options (char *options,
  return 0;
  *journal_devnum = option;
  break;
  +   case Opt_journal_checksum:
  +   set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM);
  +   break;
  +   case Opt_journal_async_commit:
  +   set_opt (sbi-s_mount_opt, JOURNAL_ASYNC_COMMIT);
  +   set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM);
  +   break;
  case Opt_noload:
  set_opt (sbi-s_mount_opt, NOLOAD);
  break;
  @@ -1817,6 +1827,21 @@ static int ext4_fill_super (struct super
  goto failed_mount4;
  }
   
  +   if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
  +   jbd2_journal_set_features(sbi-s_journal,
  +   JBD2_FEATURE_COMPAT_CHECKSUM, 0,
  +   JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
  +   } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
  +   jbd2_journal_set_features(sbi-s_journal,
  +   JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
  +   jbd2_journal_clear_features(sbi-s_journal, 0, 0,
  +   JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
  +   } else {
  +   jbd2_journal_clear_features(sbi-s_journal,
  +   JBD2_FEATURE_COMPAT_CHECKSUM, 0,
  +   JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
  +   }
 
 Some discussion of the forward- and backward- compatibility design would be
 appropriate.  Also a description of whether and how fsck can be used to fix
 up these feature flags.
  /* We have now updated the journal if required, so we can
   * validate the data journaling mode. */
  switch (test_opt(sb, DATA_FLAGS)) {
  diff -Nurp linux024/fs/jbd2/commit.c linux/fs/jbd2/commit.c
  --- linux024/fs/jbd2/commit.c   2007-06-25 16:19:25.0 -0500
  +++ linux/fs/jbd2/commit.c  2007-06-26 08:40:03.0 -0500
  @@ -21,6 +21,7 @@
   #include linux/mm.h
   #include linux/pagemap.h
   #include linux/jiffies.h
  +#include linux/crc32.h
 
 I think we just broke

Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  22:40 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
  A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
  the subdir count for any directory crosses 65000.
 
 Would I be correct in assuming that a later fsck will clear
 EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any 65000 subdir
 directories?

Correct.

 If so, that is worth a mention in the changelog, perhaps?
 
 Please remind us what is the behaviour of an RO_COMPAT flag?  It means that
 old ext4, ext3 and ext2 can only mount this fs read-only, yes?

Also correct.  The COMPAT flag behaviour is described in detail in
Documentation/filesystems/ext[234].txt

  +static inline void ext4_inc_count(handle_t *handle, struct inode *inode)
  +{
  +   inc_nlink(inode);
  +   if (is_dx(inode)  inode-i_nlink  1) {
  +   /* limit is 16-bit i_links_count */
  +   if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) {
  +   inode-i_nlink = 1;
  +   EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb,
  + EXT4_FEATURE_RO_COMPAT_DIR_NLINK);
  +   }
  +   }
  +}
 
 Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2?

Because that means it was previously 1 (inc_nlink() was already called).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Andreas Dilger

On Jul 11, 2007  17:16 +0530, Girish Shilamkar wrote:
   + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
   + jbd2_journal_set_features(sbi-s_journal,
   + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
   + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
   + } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
   + jbd2_journal_set_features(sbi-s_journal,
   + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
   + jbd2_journal_clear_features(sbi-s_journal, 0, 0,
   + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
   + } else {
   + jbd2_journal_clear_features(sbi-s_journal,
   + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
   + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
   + }
  
  Some discussion of the forward- and backward- compatibility design would be
  appropriate.  Also a description of whether and how fsck can be used to fix
  up these feature flags.

It is forward  backward compatible to enable COMPAT_CHECKSUM.  That just
means the commit blocks will have checksums in them, but older kernels
will just ignore them.  Hmm, I suppose there might be an issue with upgrade,
downgrade, upgrade in that the commit blocks would not have checksums
even though the superblock says they will...

Does that mean we should accept a checksum == 0 as being valid (which
is not very nice, given that 0 is an oft-hit bad value), or that we need
a flag in every commit block which indicates if it actually has a checksum?

The INCOMPAT_ASYNC_COMMIT can't be handled safely by older kernels, since
they would assume commit block == complete transaction, which isn't
true if the commit block didn't wait for the rest of the blocks to make
it to the disk.

I don't think e2fsck can be used to individually clean up the feature flags,
but it is always possible to remove and recreate the journal...

   - /* AKPM: buglet - add `i' to tmp! */
  
  Damn.  After, what, seven years, someone actually fixed it?
  
 for (i = 0; i  bh-b_size; i += 512) {
   - journal_header_t *tmp = (journal_header_t*)bh-b_data;
   + struct commit_header *tmp =
   + (struct commit_header *)(bh-b_data + i);
 tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
 tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
 tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid);
   +
   + if (JBD2_HAS_COMPAT_FEATURE(journal,
   + JBD2_FEATURE_COMPAT_CHECKSUM)) {
   + tmp-h_chksum_type  = JBD2_CRC32_CHKSUM;
   + tmp-h_chksum_size  = JBD2_CRC32_CHKSUM_SIZE;
   + tmp-h_chksum[0]= cpu_to_be32(crc32_sum);
   + }
 }
  
  And in doing so, changed the on-disk format of the journal commit blocks.
  
  Surely this was worth a mention in the changelog, if not a standalone patch?
  
  I don't think this is worth doing, really.  Why not just leave the format
  as it was, take the loop out and run this code once rather than eight
  times?

Well, we aren't using the rest of the commit block in any case.  I think
the original intention was that we'd get 8 copies of the commit block so
we would be sure to get a good one.

I don't know whether we'd rather have 8 copies of the commit block, or
more potential to expand the commit block?  I don't personally have any
preference, since the checksum should be a more robust way of checking
validity than having multiple copies, so we may as well remove the loop
and stick with a single copy for now.

   @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journa
 unsigned intsequence;
 int blocktype;
 int tag_bytes = journal_tag_bytes(journal);
   + __u32   crc32_sum = ~0; /* Transactional Checksums */
  
  We normally use __u32 for visible-to-userspace stuff.  Kernel code would
  use plain old u32.
 Ok.

Since the checksum is saved to disk, it seems more appropriate to use __u32
or maybe even __be32, though I'm not sure if the crc32 functions do that
correctly or not.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Dave Kleikamp

On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote:
 It just occurred to me:
 
  If i_version is 64bit, then knfsd would need to be careful when
  reading it on a 32bit host.  What are the locking rules?

How does knfsd use i_version?  I would think that if all it was doing
was to compare (i_version == previous_version), then locking wouldn't
really matter.  Well, theoretically, previous_version could be
0x1, and i_version could be 0x1, knfsd checks the high
word, then ext4 updates i_version to 0x2, then knfsd checks the
low word, detecting no change.  How likely is this?  (I don't understand
why i_version even needs to be 64 bits in the first place.)

  Presumably it is only updated under i_mutex protection, but having to
  get i_mutex to read it would seem a little heavy handed.

How does knfsd protect itself from the inode changing after i_version is
checked?  Is any locking being done otherwise?

  Should it use a seqlock like i_size?
  Could we use the same seqlock that i_size uses, or would we need a
  separate one?
 
 NeilBrown

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree

2007-07-11 Thread Ram Pai

On Wed, 2007-07-11 at 11:24 +0100, Christoph Hellwig wrote:
 On Sat, Jun 30, 2007 at 08:56:02AM -0400, H. Peter Anvin wrote:
  Is that conjecture, or do you have evidence to that effect?  Most users 
  of this file are using it via the glibc interfaces, and there probably 
  aren't all that many users of it in the first place.
 
 I have written parsers for personal projects that might not have been
 happy to deal with additional fields myself for example..

I modified the patch to add fields towards the end of each line.
i.e after 'freq, passno' fields. And symlinked /etc/mtab
to /proc/mounts.
mount,df  and friends were all perfectly happy.  

I imagine your script may also be happy with the additional fields
**towards the end**. I would like to avoid one more mount interface if
we can help it.

RP



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates

2007-07-11 Thread Theodore Tso

On Tue, Jul 10, 2007 at 04:31:44PM -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:37:53 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Add a noversion mount option to disable inode version updates.
 
 Why is this option being offered to our users?  To reduce disk traffic,
 like noatime?
 
 If so, what are the implications of this?  What would the user lose?

This has been removed in the latest patch set; it's needed only for
Lustre, because they set the version field themselves.  Lustre needs
the inode version to be globally monotonically increasing, so it can
order updates between two different files, so it does this itself.
NFSv4 only uses i_version to detect changes, and so there's no need to
use a global atomic counter for i_version.  So the thinking was that
there was no point doing the global atomic counter if it was not necessary.

Since noversion is Lustre specific, we've dropped that from the list
of patches that we'll push, and so the inode version will only have
local per-inode significance, and not have any global ordering
properties.  

We have not actually benchmarked whether or not doing the global
ordering actually *matters* in terms of being actually noticeable.  If
it isn't noticeable, I wouldn't mind changing things so that we always
make i_version globally significant (without a mount option), and make
life a bit easier for the Lustre folks.  Or if someone other
distributed filesystem requests a globally significant i_version.  But
we can cross that bridge when we get to it

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: -mm merge plans for 2.6.23

2007-07-11 Thread Andrew Morton

On Wed, 11 Jul 2007 12:39:42 +0100 David Woodhouse [EMAIL PROTECTED] wrote:

 On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote:
  On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote:
   romfs-printk-format-warnings.patch
  
  NACK on this one. 
 
 The rest of it is nacked anyway, until we unify the point and
 get_unmapped_area methods of the MTD API.
 

Methinks you meant
nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not
romfs-printk-format-warnings.patch.

I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: -mm merge plans for 2.6.23

2007-07-11 Thread Randy Dunlap

On Wed, 11 Jul 2007 10:21:03 -0700 Andrew Morton wrote:

 On Wed, 11 Jul 2007 12:39:42 +0100 David Woodhouse [EMAIL PROTECTED] wrote:
 
  On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote:
   On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote:
romfs-printk-format-warnings.patch
   
   NACK on this one. 
  
  The rest of it is nacked anyway, until we unify the point and
  get_unmapped_area methods of the MTD API.
  
 
 Methinks you meant
 nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not
 romfs-printk-format-warnings.patch.
 
 I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks.

Thanks.  I was certainly getting confused.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread J. Bruce Fields

On Wed, Jul 11, 2007 at 01:21:55PM +1000, Neil Brown wrote:
 And just by-the-way, the server doesn't really have the option of not
 sending the attribute.  If i_version isn't defined, it has to fake
 something using mtime, and hope that is good enough.

ctime, actually--the change attribute is also supposed to be updated on
attribute updates.

 Alternately we could mandate that i_version is always kept up-to-date
 and if a filesystem doesn't have anything to load from storage, it
 just sets it to the current time in nanoseconds.
 
 That would mean that a client would need to flush it's cache whenever
 the inode fell out of cache on the server, but I don't think we can
 reliably do better than that.
 
 I think I like that approach.
 
 So my vote is to increment i_version in common code every time any
 change is made to the file, and alloc_inode should initialise it to
 current time, which might be changed by the filesystem before it calls
 unlock_new_inode. 

So the client would be invalidating its cache more often than necessary,
rather than failing to invalidate it when it should.  I agree that
that's probably the better tradeoff, although I wish I had a better idea
of the downside.  I don't know, for example, whether users might see
unpleasant results if every client has to reread its cached data on a
reboot.

The currently proposed change--just providing a model change attribute
implementation for ext4 and leaving other filesystems untouched--is a
more conservative step.

So I'm inclined to just do this ext4 thing first, and then look into
further change attribute experiments next time around

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Andrew Morton

On Wed, 11 Jul 2007 07:01:08 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:

-   /* AKPM: buglet - add `i' to tmp! */
   
   Damn.  After, what, seven years, someone actually fixed it?
   
for (i = 0; i  bh-b_size; i += 512) {
-   journal_header_t *tmp = (journal_header_t*)bh-b_data;
+   struct commit_header *tmp =
+   (struct commit_header *)(bh-b_data + i);
tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
tmp-h_sequence = 
cpu_to_be32(commit_transaction-t_tid);
+
+   if (JBD2_HAS_COMPAT_FEATURE(journal,
+   
JBD2_FEATURE_COMPAT_CHECKSUM)) {
+   tmp-h_chksum_type  = JBD2_CRC32_CHKSUM;
+   tmp-h_chksum_size  = 
JBD2_CRC32_CHKSUM_SIZE;
+   tmp-h_chksum[0]= 
cpu_to_be32(crc32_sum);
+   }
}
   
   And in doing so, changed the on-disk format of the journal commit blocks.
   
   Surely this was worth a mention in the changelog, if not a standalone 
   patch?
   
   I don't think this is worth doing, really.  Why not just leave the format
   as it was, take the loop out and run this code once rather than eight
   times?
 
 Well, we aren't using the rest of the commit block in any case.  I think
 the original intention was that we'd get 8 copies of the commit block so
 we would be sure to get a good one.
 
 I don't know whether we'd rather have 8 copies of the commit block, or
 more potential to expand the commit block?  I don't personally have any
 preference, since the checksum should be a more robust way of checking
 validity than having multiple copies, so we may as well remove the loop
 and stick with a single copy for now.

We've never altered any commit block sectors apart from the zeroeth one
(eight times) due to the above bug.  So I'd suggest that we should formalise
the old bug and leave the format as-is.  That'll leave lots of space spare in
the commit block.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Andrew Morton

On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:

 On Jul 10, 2007  16:32 -0700, Andrew Morton wrote:
 err = ext4_reserve_inode_write(handle, inode, iloc);
   + if (EXT4_I(inode)-i_extra_isize 
   + EXT4_SB(inode-i_sb)-s_want_extra_isize 
   + !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
   + /* We need extra buffer credits since we may write into EA block
   +  * with this same handle */
   + if ((jbd2_journal_extend(handle,
   +  EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) {
   + ret = ext4_expand_extra_isize(inode,
   + EXT4_SB(inode-i_sb)-s_want_extra_isize,
   + iloc, handle);
   + if (ret) {
   + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND;
   + if (!expand_message) {
   + ext4_warning(inode-i_sb, __FUNCTION__,
   + Unable to expand inode %lu. Delete
   +  some EAs or run e2fsck.,
   + inode-i_ino);
   + expand_message = 1;
   + }
   + }
   + }
   + }
  
  Maybe that message should come out once per mount rather than once per boot
  (or once per modprobe)?
 
 Probably true.
 
  What are the consequences of a jbd2_journal_extend() failure in here?
 
 Not fatal, just like every user of i_extra_isize.  If the inode isn't a
 large inode, or it can't be expanded then there will be a minor loss of
 functionality on that inode.  If the i_extra_isize is critical, then
 the sysadmin will have run e2fsck to force s_min_extra_isize large enough.
 
 Note that this is only applicable for filesystems which are upgraded.  For
 new inodes (i.e. all inodes that exist in the filesystem if it was always
 run on a kernel with the currently understood extra fields) then this will
 never be invoked (until such a time new extra fields are added).

I'd suggest that we get a comment in the code explaining this: this
unchecked error does rather stand out.

   + if (EXT4_I(inode)-i_file_acl) {
   + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl);
   + error = -EIO;
   + if (!bh)
   + goto cleanup;
   + if (ext4_xattr_check_block(bh)) {
   + ext4_error(inode-i_sb, __FUNCTION__,
   + inode %lu: bad block %llu, inode-i_ino,
   + EXT4_I(inode)-i_file_acl);
   + error = -EIO;
   + goto cleanup;
   + }
   + base = BHDR(bh);
   + first = BFIRST(bh);
   + end = bh-b_data + bh-b_size;
   + min_offs = end - base;
   + free = ext4_xattr_free_space(first, min_offs, base,
   +  total_blk);
   + if (free  new_extra_isize) {
   + if (!tried_min_extra_isize  s_min_extra_isize) {
   + tried_min_extra_isize++;
   + new_extra_isize = s_min_extra_isize;
  
  Aren't we missing a brelse(bh) here?
 
 Seems likely, yes.

OK - could we get a positive ack from someone indicating that this will get
looked at?  Because I am about to forget about it.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for 16TB ext4 fs

2007-07-11 Thread Jose R. Santos

Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
than 32bit block sizes during mount time.  This ensure proper record
lenth when writing to the journal.

Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]
Signed-off-by: Laurent Vivier [EMAIL PROTECTED]
---
 fs/ext4/super.c |7 7 + 0 - 0 !
 1 file changed, 7 insertions(+)

Index: linux-2.6/fs/ext4/super.c
===
--- linux-2.6.orig/fs/ext4/super.c  2007-07-11 09:36:00.0 -0500
+++ linux-2.6/fs/ext4/super.c   2007-07-11 09:36:20.0 -0500
@@ -1804,6 +1804,13 @@ static int ext4_fill_super (struct super
goto failed_mount3;
}
 
+   if (ext4_blocks_count(es)  0xULL 
+   !jbd2_journal_set_features(EXT4_SB(sb)-s_journal, 0, 0,
+  JBD2_FEATURE_INCOMPAT_64BIT)) {
+   printk(KERN_ERR ext4: Failed to set 64-bit journal feature\n);
+   goto failed_mount4;
+   }
+
/* We have now updated the journal if required, so we can
 * validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-11 Thread Jose R. Santos

The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
create_proc_entry() does not do lookups on file names that are more that one
directory deep.  This causes the entry creation to fail and hence, no proc
file is created.

Instead of fixing this on procfs might as well move the jbd2-debug file to
debugfs which would be the preferred location for this kind of tunable.  The
new location is now /sys/kernel/debug/jbd2/jbd2-debug.


Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
---
 fs/Kconfig   |   105 + 5 - 0 !
 fs/jbd2/journal.c|   6727 +40 -0 !
 include/linux/jbd2.h |21 + 1 - 0 !
 3 files changed, 33 insertions(+), 46 deletions(-)

Index: linux-2.6/fs/jbd2/journal.c
===
--- linux-2.6.orig/fs/jbd2/journal.c2007-07-11 09:46:25.0 -0500
+++ linux-2.6/fs/jbd2/journal.c 2007-07-11 11:31:30.0 -0500
@@ -35,6 +35,7 @@
 #include linux/kthread.h
 #include linux/poison.h
 #include linux/proc_fs.h
+#include linux/debugfs.h
 
 #include asm/uaccess.h
 #include asm/page.h
@@ -1951,64 +1952,50 @@ void jbd2_journal_put_journal_head(struc
 }
 
 /*
- * /proc tunables
+ * debugfs tunables
  */
 #if defined(CONFIG_JBD2_DEBUG)
-int jbd2_journal_enable_debug;
+u8 jbd2_journal_enable_debug;
 EXPORT_SYMBOL(jbd2_journal_enable_debug);
 #endif
 
-#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_PROC_FS)
+#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_DEBUG_FS)
 
-static struct proc_dir_entry *proc_jbd_debug;
+#define JBD2_DEBUG_NAME jbd2-debug
 
-static int read_jbd_debug(char *page, char **start, off_t off,
- int count, int *eof, void *data)
-{
-   int ret;
+struct dentry *jbd2_debugfs_dir, *jbd2_debug;
 
-   ret = sprintf(page + off, %d\n, jbd2_journal_enable_debug);
-   *eof = 1;
-   return ret;
+static void __init jbd2_create_debugfs_entry(void)
+{
+   jbd2_debugfs_dir = debugfs_create_dir(jbd2, NULL);
+   if (jbd2_debugfs_dir)
+   jbd2_debug = debugfs_create_u8(JBD2_DEBUG_NAME, S_IRUGO,
+  jbd2_debugfs_dir,
+  jbd2_journal_enable_debug);
 }
 
-static int write_jbd_debug(struct file *file, const char __user *buffer,
-  unsigned long count, void *data)
+static void __exit jbd2_remove_debugfs_entry(void)
 {
-   char buf[32];
-
-   if (count  ARRAY_SIZE(buf) - 1)
-   count = ARRAY_SIZE(buf) - 1;
-   if (copy_from_user(buf, buffer, count))
-   return -EFAULT;
-   buf[ARRAY_SIZE(buf) - 1] = '\0';
-   jbd2_journal_enable_debug = simple_strtoul(buf, NULL, 10);
-   return count;
+   if (jbd2_debug)
+   debugfs_remove(jbd2_debug);
+   if (jbd2_debugfs_dir)
+   debugfs_remove(jbd2_debugfs_dir);
 }
 
-#define JBD_PROC_NAME sys/fs/jbd2-debug
+#else
 
-static void __init create_jbd_proc_entry(void)
+static void __init jbd2_create_debugfs_entry(void)
 {
-   proc_jbd_debug = create_proc_entry(JBD_PROC_NAME, 0644, NULL);
-   if (proc_jbd_debug) {
-   /* Why is this so hard? */
-   proc_jbd_debug-read_proc = read_jbd_debug;
-   proc_jbd_debug-write_proc = write_jbd_debug;
-   }
+   do {
+   } while (0);
 }
 
-static void __exit jbd2_remove_jbd_proc_entry(void)
+static void __exit jbd2_remove_debugfs_entry(void)
 {
-   if (proc_jbd_debug)
-   remove_proc_entry(JBD_PROC_NAME, NULL);
+   do {
+   } while (0);
 }
 
-#else
-
-#define create_jbd_proc_entry() do {} while (0)
-#define jbd2_remove_jbd_proc_entry() do {} while (0)
-
 #endif
 
 struct kmem_cache *jbd2_handle_cache;
@@ -2067,7 +2054,7 @@ static int __init journal_init(void)
ret = journal_init_caches();
if (ret != 0)
jbd2_journal_destroy_caches();
-   create_jbd_proc_entry();
+   jbd2_create_debugfs_entry();
return ret;
 }
 
@@ -2078,7 +2065,7 @@ static void __exit journal_exit(void)
if (n)
printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
 #endif
-   jbd2_remove_jbd_proc_entry();
+   jbd2_remove_debugfs_entry();
jbd2_journal_destroy_caches();
 }
 
Index: linux-2.6/include/linux/jbd2.h
===
--- linux-2.6.orig/include/linux/jbd2.h 2007-07-11 09:46:25.0 -0500
+++ linux-2.6/include/linux/jbd2.h  2007-07-11 10:37:06.0 -0500
@@ -57,7 +57,7 @@
  * CONFIG_JBD2_DEBUG is on.
  */
 #define JBD_EXPENSIVE_CHECKING
-extern int jbd2_journal_enable_debug;
+extern u8 jbd2_journal_enable_debug;
 
 #define jbd_debug(n, f, a...)  \
do {\
Index: linux-2.6/fs/Kconfig
===
---

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Kalpak Shah

On Wed, 2007-07-11 at 10:34 -0700, Andrew Morton wrote:
 On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:
 
  On Jul 10, 2007  16:32 -0700, Andrew Morton wrote:
err = ext4_reserve_inode_write(handle, inode, iloc);
+   if (EXT4_I(inode)-i_extra_isize 
+   EXT4_SB(inode-i_sb)-s_want_extra_isize 
+   !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
+   /* We need extra buffer credits since we may write into 
EA block
+* with this same handle */
+   if ((jbd2_journal_extend(handle,
+EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 
0) {
+   ret = ext4_expand_extra_isize(inode,
+   
EXT4_SB(inode-i_sb)-s_want_extra_isize,
+   iloc, handle);
+   if (ret) {
+   EXT4_I(inode)-i_state |= 
EXT4_STATE_NO_EXPAND;
+   if (!expand_message) {
+   ext4_warning(inode-i_sb, 
__FUNCTION__,
+   Unable to expand inode %lu. 
Delete
+some EAs or run e2fsck.,
+   inode-i_ino);
+   expand_message = 1;
+   }
+   }
+   }
+   }
   
   Maybe that message should come out once per mount rather than once per 
   boot
   (or once per modprobe)?
  
  Probably true.
  
   What are the consequences of a jbd2_journal_extend() failure in here?
  
  Not fatal, just like every user of i_extra_isize.  If the inode isn't a
  large inode, or it can't be expanded then there will be a minor loss of
  functionality on that inode.  If the i_extra_isize is critical, then
  the sysadmin will have run e2fsck to force s_min_extra_isize large enough.
  
  Note that this is only applicable for filesystems which are upgraded.  For
  new inodes (i.e. all inodes that exist in the filesystem if it was always
  run on a kernel with the currently understood extra fields) then this will
  never be invoked (until such a time new extra fields are added).
 
 I'd suggest that we get a comment in the code explaining this: this
 unchecked error does rather stand out.
 
+   if (EXT4_I(inode)-i_file_acl) {
+   bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl);
+   error = -EIO;
+   if (!bh)
+   goto cleanup;
+   if (ext4_xattr_check_block(bh)) {
+   ext4_error(inode-i_sb, __FUNCTION__,
+   inode %lu: bad block %llu, 
inode-i_ino,
+   EXT4_I(inode)-i_file_acl);
+   error = -EIO;
+   goto cleanup;
+   }
+   base = BHDR(bh);
+   first = BFIRST(bh);
+   end = bh-b_data + bh-b_size;
+   min_offs = end - base;
+   free = ext4_xattr_free_space(first, min_offs, base,
+total_blk);
+   if (free  new_extra_isize) {
+   if (!tried_min_extra_isize  
s_min_extra_isize) {
+   tried_min_extra_isize++;
+   new_extra_isize = s_min_extra_isize;
   
   Aren't we missing a brelse(bh) here?
  
  Seems likely, yes.
 
 OK - could we get a positive ack from someone indicating that this will get
 looked at?  Because I am about to forget about it.

I will send a patch to add the comments and make the suggested
corrections.

Thanks,
Kalpak.

 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread J. Bruce Fields

On Wed, Jul 11, 2007 at 09:28:06AM -0500, Dave Kleikamp wrote:
 On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote:
  It just occurred to me:
  
   If i_version is 64bit, then knfsd would need to be careful when
   reading it on a 32bit host.  What are the locking rules?
 
 How does knfsd use i_version?  I would think that if all it was doing
 was to compare (i_version == previous_version)

That's correct.  (Though it's the client that's doing the comparison,
actually--the server is just reporting the value.)

 then locking wouldn't really matter.  Well, theoretically,
 previous_version could be 0x1, and i_version could be
 0x1, knfsd checks the high word, then ext4 updates i_version
 to 0x2, then knfsd checks the low word, detecting no change.
 How likely is this?

The choice of upper word in your example is arbitrary, but other than
that I believe your example is essentially the only one.  So this would
only happen when *both*

- the read of the new value of the low word happens precisely
  2^32 i_version updates after the word was read on the client's
  previous cache revalidation, and
- the value of i_version itself is close enough to a 32-bit
  boundary that wraparound can happen between the reads of the
  high and low words.

 (I don't understand why i_version even needs to be 64 bits in the
 first place.)

A 32-bit i_version could in theory wrap pretty quickly, couldn't it?
That's not a problem in itself--the problem would only arise if two
subsequent client queries of the change attribute happened a multiple of
2^32 i_version increments apart.

This is more likely than the previous scenario, but still very unlikely.
I would have guessed that even in situations with a very high rate of
updates and a low rate of client revalidations, the chance of two
revalidations happening exactly 2^32 updates apart would still be no
more than 1 in 2^32.  (Could odd characteristics of the workloads (like
updates that tend to happen in power-of-2 groups?) make it any more
likely?)

I'd be happier if ext4 at least allowed the possibility of 64 bits in
the future.  And there's always the chance someone would find a use for
an i_version that was nondecreasing, even if nfs didn't care.

   Presumably it is only updated under i_mutex protection, but having to
   get i_mutex to read it would seem a little heavy handed.
 
 How does knfsd protect itself from the inode changing after i_version is
 checked?  Is any locking being done otherwise?

If the client always requests the change attribute before reading, and
the i_version is always updated after data is modified, I think we're
OK.  Admittedly this is a little subtle.

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-11 Thread Ric Wheeler



[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a I'm a high-end array bit
in the sense data that I'm unaware of?



There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not 
sure it is worth the trouble to do any automatic detection/handling of this.


In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its 
exported LUNs.


ric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] nfs: disable leases over NFS

2007-07-11 Thread J. Bruce Fields

On Wed, Jul 11, 2007 at 11:20:18AM +0100, Christoph Hellwig wrote:
 On Thu, Jul 05, 2007 at 11:41:00AM -0400, J. Bruce Fields wrote:
  OK, after looking at this a little more, I'm less happy about the idea
  of erroring out by default:
  
  - There are a ton of filesystems that probably should allow
leases, and only a few (network filesystems) that shouldn't,
so leaving leases on by default seems simpler.
 
 But it gets you possible wrong behaviour by default.  I'm not a big
 fan of non-trivial default methods as you see :)

Yeah, I do understand the attraction of doing it your way.  With some
quick grepping, what I found was:

- about 28 on-disk filesystems, all of which I assume should
  support leases.
- about 12 filesystems that either are network filesystems (9p,
  afs, cifs, coda, ncpfs, nfs, nfs4, ocfs2, smbfs) or that don't
  have control over all opens/data modifications for whatever
  reason (ecryptfs, fuse, hostfs), so shouldn't be giving out
  leases by default.
- A bunch of synthetic filesystems (proc, sysfs...).

The latter category being the strongest argument for your approach,
since it's sort of ludicrous to allow leases on those filesystems, but
currently we do just out of laziness.  (Or, in any case, I don't see any
reason the current code wouldn't allow them; I haven't actually tested).

  - We already fall back on the local method by default in the case
of locks, and I don't see a reason to treat leases differently.
  - The patch to add
  .setlease = setlease,
to all the file_operations is going to be a big patch that
changes behavior in a way that might be easy to miss (because
it changes behavior exactly on those filesystems it *doesn't*
touch.)  I think it'll be easier to get better review on the
patch that adds a method just to those filesystems that we're
disabling leases for.
 
 Anyway, feel free to go ahead with the simpler version for now, I'll do
 the switchover for locks and leases when I get some time.

That would be great.  For now I think I'll also add another simple
EINVAL-returning setlease() at least to cifs (partly just as an attempt
to goad Steve French into following up on a promise at OLS to look into
proper lease support for cifs)

But I've appended my first attempt at your suggestion for leases, in
case it's of use.  (Untested.)

--b.

From: J. Bruce Fields [EMAIL PROTECTED]
Subject: [PATCH] Turn off support for fcntl leases by default

A lease enforces mutual exclusion with conflicting opens.  On
filesystems such as network filesystems where not all opens happen under
the control of this kernel, the default setlease() operation, which can
only check for local conflicts, is incorrect.  So in most cases the
correct thing is probably to disable leases for those filesystems until
they can implement something more sophisticated.

The only users of leases that I'm aware of (samba and nfsd) are actually
using them primarly to get synchronous notification of changes to files
so that they can update their caches.  So, more generally, we should
disable leases for any filesystem which might allow file contents to
change without a local open for write occuring.  That includes most of
the synthetic filesystems (like proc), which the file servers probably
don't want to export anyway.

Previously we explicitly disabled leases for some network filesystems,
but with this patch we disable by default and add an explicit setlease
method for those filesystems on which leases will be allowed.

Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]
---
 fs/adfs/file.c  |1 +
 fs/affs/file.c  |1 +
 fs/bfs/file.c   |1 +
 fs/ext2/file.c  |2 ++
 fs/ext3/file.c  |1 +
 fs/ext4/file.c  |1 +
 fs/fat/file.c   |1 +
 fs/hfs/inode.c  |1 +
 fs/hfsplus/inode.c  |1 +
 fs/hppfs/hppfs_kern.c   |1 +
 fs/jffs2/file.c |1 +
 fs/jfs/file.c   |1 +
 fs/locks.c  |8 
 fs/minix/file.c |1 +
 fs/nfs/file.c   |   12 
 fs/ntfs/file.c  |1 +
 fs/qnx4/file.c  |1 +
 fs/ramfs/file-mmu.c |1 +
 fs/ramfs/file-nommu.c   |1 +
 fs/read_write.c |1 +
 fs/reiserfs/file.c  |1 +
 fs/sysv/file.c  |1 +
 fs/udf/file.c   |1 +
 fs/ufs/file.c   |1 +
 fs/xfs/linux-2.6/xfs_file.c |2 ++
 25 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/fs/adfs/file.c b/fs/adfs/file.c
index f544a28..4e99861 100644
--- a/fs/adfs/file.c
+++ b/fs/adfs/file.c
@@ -34,6 +34,7 @@ const struct file_operations adfs_file_operations = {
.write  = do_sync_write,
.aio_write  =

[PATCH 00/23] Mount writer count API (read-only bind mounts prep)

2007-07-11 Thread Dave Hansen

The most contentious part of the r/o bind mount patches is
actually implementing the count tracking.  It has NUMA and
SMP implications, and is going to need to have a whole
discussion on that one patch.

These patches, on the other hand, simply introduce a new
API: mnt_want_write() and mnt_drop_write().  They do not
functionally change the kernel, just alter the way in
which we check filesystems for our ability to write to
them.

These functions should be used in place of IS_RDONLY(inode)
as they explicitly spell out when a mount is expected to
_stay_ r/w instead of simply checking at a single point
in time.

It should take a very small number (like 3) of small
patches to actually implement read-only bind mounts on
top of this new API.

These apply to current -git (as of July 11th, 2007).

---

Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem.
In the process of doing that, it also provides infrastructure for
keeping track of the number of writers to any given mount.

This has a number of uses.  It allows chroots to have parts of
filesystems writable.  It will be useful for containers in the future
because users may have root inside a container, but should not
be allowed to write to somefilesystems.  This also replaces 
patches that vserver has had out of the tree for several years.

It allows security enhancement by making sure that parts of
your filesystem read-only (such as when you don't trust your
FTP server), when you don't want to have entire new filesystems
mounted, or when you want atime selectively updated.
I've been using the following script to test that the feature is
working as desired.  It takes a directory and makes a regular
bind and a r/o bind mount of it.  It then performs some normal
filesystem operations on the three directories, including ones
that are expected to fail, like creating a file on the r/o
mount.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/23] rearrange may_open() to be r/o friendly

2007-07-11 Thread Dave Hansen


may_open() calls vfs_permission() before it does checks for
IS_RDONLY(inode).  It checks _again_ inside of vfs_permission().

The check inside of vfs_permission() is going away eventually.
With the mnt_want/drop_write() functions, all of the r/o
checks (except for this one) are consistently done before
calling permission().  Because of this, I'd like to use
permission() to hold a debugging check to make sure that
the mnt_want/drop_write() calls are actually being made.

So, to do this:
1. remove the IS_RDONLY() check from permission()
2. enforce that you must mnt_want_write() before
   even calling permission()
3. enable a debugging in permission()

We need to rearrange may_open().  Here's the patch.

---

 lxc-dave/fs/namei.c |   14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff -puN fs/namei.c~rearrange-permission-and-ro-checks-in-may_open fs/namei.c
--- lxc/fs/namei.c~rearrange-permission-and-ro-checks-in-may_open   
2007-07-10 12:46:01.0 -0700
+++ lxc-dave/fs/namei.c 2007-07-10 12:46:01.0 -0700
@@ -228,6 +228,10 @@ int permission(struct inode *inode, int 
 {
umode_t mode = inode-i_mode;
int retval, submask;
+   struct vfsmount *mnt = NULL;
+
+   if (nd)
+   mnt = nd-mnt;
 
if (mask  MAY_WRITE) {
 
@@ -252,7 +256,7 @@ int permission(struct inode *inode, int 
 * the fs is mounted with the noexec flag.
 */
if ((mask  MAY_EXEC)  S_ISREG(mode)  (!(mode  S_IXUGO) ||
-   (nd  nd-mnt  (nd-mnt-mnt_flags  MNT_NOEXEC
+   (mnt  (mnt-mnt_flags  MNT_NOEXEC
return -EACCES;
 
/* Ordinary permission routines do not understand MAY_APPEND. */
@@ -1546,10 +1550,6 @@ int may_open(struct nameidata *nd, int a
if (S_ISDIR(inode-i_mode)  (flag  FMODE_WRITE))
return -EISDIR;
 
-   error = vfs_permission(nd, acc_mode);
-   if (error)
-   return error;
-
/*
 * FIFO's, sockets and device files are special: they don't
 * actually live on the filesystem itself, and as such you
@@ -1564,6 +1564,10 @@ int may_open(struct nameidata *nd, int a
flag = ~O_TRUNC;
} else if (IS_RDONLY(inode)  (flag  FMODE_WRITE))
return -EROFS;
+
+   error = vfs_permission(nd, acc_mode);
+   if (error)
+   return error;
/*
 * An append-only file must be opened in append mode for writing.
 */
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/23] create cleanup helper svc_msnfs()

2007-07-11 Thread Dave Hansen


I'm going to be modifying nfsd_rename() shortly to support
read-only bind mounts.  This #ifdef is around the area I'm
patching, and it starts to get really ugly if I just try
to add my new code by itself.  Using this little helper
makes things a lot cleaner to use.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/nfsd/vfs.c |   15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff -puN fs/nfsd/vfs.c~create-svc_msnfs-helper fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~create-svc_msnfs-helper   2007-07-10 12:46:02.0 
-0700
+++ lxc-dave/fs/nfsd/vfs.c  2007-07-10 12:46:02.0 -0700
@@ -837,6 +837,15 @@ nfsd_read_actor(read_descriptor_t *desc,
return size;
 }
 
+static inline int svc_msnfs(struct svc_fh *ffhp)
+{
+#ifdef MSNFS
+   return (ffhp-fh_export-ex_flags  NFSEXP_MSNFS);
+#else
+   return 0;
+#endif
+}
+
 static __be32
 nfsd_vfs_read(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
   loff_t offset, struct kvec *vec, int vlen, unsigned long *count)
@@ -849,11 +858,9 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
 
err = nfserr_perm;
inode = file-f_path.dentry-d_inode;
-#ifdef MSNFS
-   if ((fhp-fh_export-ex_flags  NFSEXP_MSNFS) 
-   (!lock_may_read(inode, offset, *count)))
+
+   if (svc_msnfs(fhp)  !lock_may_read(inode, offset, *count))
goto out;
-#endif
 
/* Get readahead parameters */
ra = nfsd_get_raparms(inode-i_sb-s_dev, inode-i_ino);
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/23] r/o bind mounts: stub functions

2007-07-11 Thread Dave Hansen


This patch adds two function mnt_want_write() and
mnt_drop_write().  These are used like a lock pair around
and fs operations that might cause a write to the filesystem.

Before these can become useful, we must first cover each
place in the VFS where writes are performed with a
want/drop pair.  When that is complete, we can actually
introduce code that will safely check the counts before
allowing r/w-r/o transitions to occur.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/namespace.c|   46 +
 lxc-dave/include/linux/mount.h |3 ++
 2 files changed, 49 insertions(+)

diff -puN fs/namespace.c~add-vfsmount-writer-count fs/namespace.c
--- lxc/fs/namespace.c~add-vfsmount-writer-count2007-07-10 
12:46:04.0 -0700
+++ lxc-dave/fs/namespace.c 2007-07-10 12:46:04.0 -0700
@@ -76,6 +76,52 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;
 }
 
+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes are able to occur to
+ * a filesystem.
+ */
+/*
+ * This tells the low-level filesystem that a write is
+ * about to be performed to it, and makes sure that
+ * writes are allowed before returning success.  When
+ * the write operation is finished, mnt_drop_write()
+ * must be called.  This is effectively a refcount.
+ */
+int mnt_want_write(struct vfsmount *mnt)
+{
+   if (__mnt_is_readonly(mnt))
+   return -EROFS;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mnt_want_write);
+
+/*
+ * Tells the low-level filesystem that we are done
+ * performing a write to it.  Must be matched with
+ * mnt_want_write() call above.
+ */
+void mnt_drop_write(struct vfsmount *mnt)
+{
+}
+EXPORT_SYMBOL_GPL(mnt_drop_write);
+
+/*
+ * This shouldn't be used directly ouside of the VFS.
+ * It does not guarantee that the filesystem will stay
+ * r/w, just that it is right *now*e
+ * mnt_want/drop_write() will _keep_ the filesystem
+ * r/w.
+ */
+int __mnt_is_readonly(struct vfsmount *mnt)
+{
+   return (mnt-mnt_sb-s_flags  MS_RDONLY);
+}
+EXPORT_SYMBOL_GPL(__mnt_is_readonly);
+
 int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
 {
mnt-mnt_sb = sb;
diff -puN include/linux/mount.h~add-vfsmount-writer-count include/linux/mount.h
--- lxc/include/linux/mount.h~add-vfsmount-writer-count 2007-07-10 
12:46:04.0 -0700
+++ lxc-dave/include/linux/mount.h  2007-07-10 12:46:04.0 -0700
@@ -70,9 +70,12 @@ static inline struct vfsmount *mntget(st
return mnt;
 }
 
+extern int mnt_want_write(struct vfsmount *mnt);
+extern void mnt_drop_write(struct vfsmount *mnt);
 extern void mntput_no_expire(struct vfsmount *mnt);
 extern void mnt_pin(struct vfsmount *mnt);
 extern void mnt_unpin(struct vfsmount *mnt);
+extern int __mnt_is_readonly(struct vfsmount *mnt);
 
 static inline void mntput(struct vfsmount *mnt)
 {
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/23] elevate mount count for extended attributes

2007-07-11 Thread Dave Hansen


This basically audits the callers of xattr_permission(), which
calls permission() and can perform writes to the filesystem.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/nfsd/nfs4proc.c |7 ++-
 lxc-dave/fs/xattr.c |   16 ++--
 2 files changed, 20 insertions(+), 3 deletions(-)

diff -puN fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes 
fs/nfsd/nfs4proc.c
--- lxc/fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes  
2007-07-10 12:46:10.0 -0700
+++ lxc-dave/fs/nfsd/nfs4proc.c 2007-07-10 12:46:10.0 -0700
@@ -626,14 +626,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st
return status;
}
}
+   status = mnt_want_write(cstate-current_fh.fh_export-ex_mnt);
+   if (status)
+   return status;
status = nfs_ok;
if (setattr-sa_acl != NULL)
status = nfsd4_set_nfs4_acl(rqstp, cstate-current_fh,
setattr-sa_acl);
if (status)
-   return status;
+   goto out;
status = nfsd_setattr(rqstp, cstate-current_fh, setattr-sa_iattr,
0, (time_t)0);
+out:
+   mnt_drop_write(cstate-current_fh.fh_export-ex_mnt);
return status;
 }
 
diff -puN fs/xattr.c~elevate-mount-count-for-extended-attributes fs/xattr.c
--- lxc/fs/xattr.c~elevate-mount-count-for-extended-attributes  2007-07-10 
12:46:10.0 -0700
+++ lxc-dave/fs/xattr.c 2007-07-10 12:46:10.0 -0700
@@ -11,6 +11,7 @@
 #include linux/slab.h
 #include linux/file.h
 #include linux/xattr.h
+#include linux/mount.h
 #include linux/namei.h
 #include linux/security.h
 #include linux/syscalls.h
@@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, co
 * filesystem  or on an immutable / append-only inode.
 */
if (mask  MAY_WRITE) {
-   if (IS_RDONLY(inode))
-   return -EROFS;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
return -EPERM;
}
@@ -236,7 +235,11 @@ sys_setxattr(char __user *path, char __u
error = user_path_walk(path, nd);
if (error)
return error;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   return error;
error = setxattr(nd.dentry, name, value, size, flags);
+   mnt_drop_write(nd.mnt);
path_release(nd);
return error;
 }
@@ -251,7 +254,11 @@ sys_lsetxattr(char __user *path, char __
error = user_path_walk_link(path, nd);
if (error)
return error;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   return error;
error = setxattr(nd.dentry, name, value, size, flags);
+   mnt_drop_write(nd.mnt);
path_release(nd);
return error;
 }
@@ -267,9 +274,14 @@ sys_fsetxattr(int fd, char __user *name,
f = fget(fd);
if (!f)
return error;
+   error = mnt_want_write(f-f_vfsmnt);
+   if (error)
+   goto out_fput;
dentry = f-f_path.dentry;
audit_inode(NULL, dentry-d_inode);
error = setxattr(dentry, name, value, size, flags);
+   mnt_drop_write(f-f_vfsmnt);
+out_fput:
fput(f);
return error;
 }
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/23] elevate write count for link and symlink calls

2007-07-11 Thread Dave Hansen



Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/namei.c |   10 ++
 1 file changed, 10 insertions(+)

diff -puN fs/namei.c~elevate-write-count-for-link-and-symlink-calls fs/namei.c
--- lxc/fs/namei.c~elevate-write-count-for-link-and-symlink-calls   
2007-07-10 12:46:09.0 -0700
+++ lxc-dave/fs/namei.c 2007-07-10 12:46:09.0 -0700
@@ -2266,7 +2266,12 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(dentry))
goto out_unlock;
 
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_dput;
error = vfs_symlink(nd.dentry-d_inode, dentry, from, S_IALLUGO);
+   mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(nd.dentry-d_inode-i_mutex);
@@ -2361,7 +2366,12 @@ asmlinkage long sys_linkat(int olddfd, c
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_unlock;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_dput;
error = vfs_link(old_nd.dentry, nd.dentry-d_inode, new_dentry);
+   mnt_drop_write(nd.mnt);
+out_dput:
dput(new_dentry);
 out_unlock:
mutex_unlock(nd.dentry-d_inode-i_mutex);
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/23] elevate write count for file_update_time()

2007-07-11 Thread Dave Hansen



Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/inode.c |   13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff -puN fs/inode.c~elevate-write-count-for-file_update_time fs/inode.c
--- lxc/fs/inode.c~elevate-write-count-for-file_update_time 2007-07-10 
12:46:10.0 -0700
+++ lxc-dave/fs/inode.c 2007-07-10 12:46:10.0 -0700
@@ -1221,10 +1221,19 @@ void file_update_time(struct file *file)
struct inode *inode = file-f_path.dentry-d_inode;
struct timespec now;
int sync_it = 0;
+   int err = 0;
 
if (IS_NOCMTIME(inode))
return;
-   if (IS_RDONLY(inode))
+   /*
+* Ideally, we want to guarantee that 'f_vfsmnt'
+* is non-NULL here.  But, NFS exports need to
+* be fixed up before we can do that.  So, check
+* it for now. - Dave Hansen
+*/
+   if (file-f_vfsmnt)
+   err = mnt_want_write(file-f_vfsmnt);
+   if (err)
return;
 
now = current_fs_time(inode-i_sb);
@@ -1240,6 +1249,8 @@ void file_update_time(struct file *file)
 
if (sync_it)
mark_inode_dirty_sync(inode);
+   if (file-f_vfsmnt)
+   mnt_drop_write(file-f_vfsmnt);
 }
 
 EXPORT_SYMBOL(file_update_time);
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/23] mount_is_safe(): add comment

2007-07-11 Thread Dave Hansen


This area of code is currently #ifdef'd out, so add a comment
for the time when it is actually used.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/namespace.c |4 
 1 file changed, 4 insertions(+)

diff -puN fs/namespace.c~mount-is-safe-add-comment fs/namespace.c
--- lxc/fs/namespace.c~mount-is-safe-add-comment2007-07-10 
12:46:11.0 -0700
+++ lxc-dave/fs/namespace.c 2007-07-10 12:46:11.0 -0700
@@ -728,6 +728,10 @@ static int mount_is_safe(struct nameidat
if (current-uid != nd-dentry-d_inode-i_uid)
return -EPERM;
}
+   /*
+* We will eventually check for the mnt-writer_count here,
+* but since the code is not used now, skip it - Dave Hansen
+*/
if (vfs_permission(nd, MAY_WRITE))
return -EPERM;
return 0;
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/23] elevate mnt writers for callers of vfs_mkdir()

2007-07-11 Thread Dave Hansen


Pretty self-explanatory.  Fits in with the rest of the series.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/namei.c|5 +
 lxc-dave/fs/nfsd/nfs4recover.c |4 
 2 files changed, 9 insertions(+)

diff -puN fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir fs/namei.c
--- lxc/fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-07-10 
12:46:07.0 -0700
+++ lxc-dave/fs/namei.c 2007-07-10 12:46:07.0 -0700
@@ -1993,7 +1993,12 @@ asmlinkage long sys_mkdirat(int dfd, con
 
if (!IS_POSIXACL(nd.dentry-d_inode))
mode = ~current-fs-umask;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_dput;
error = vfs_mkdir(nd.dentry-d_inode, dentry, mode);
+   mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(nd.dentry-d_inode-i_mutex);
diff -puN fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 
fs/nfsd/nfs4recover.c
--- lxc/fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir  
2007-07-10 12:46:07.0 -0700
+++ lxc-dave/fs/nfsd/nfs4recover.c  2007-07-10 12:46:07.0 -0700
@@ -156,7 +156,11 @@ nfsd4_create_clid_dir(struct nfs4_client
dprintk(NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n);
goto out_put;
}
+   status = mnt_want_write(rec_dir.mnt);
+   if (status)
+   goto out_put;
status = vfs_mkdir(rec_dir.dentry-d_inode, dentry, S_IRWXU);
+   mnt_drop_write(rec_dir.mnt);
 out_put:
dput(dentry);
 out_unlock:
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/23] unix_find_other() elevate write count for touch_atime()

2007-07-11 Thread Dave Hansen



Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/net/unix/af_unix.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff -puN 
net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime 
net/unix/af_unix.c
--- lxc/net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime  
2007-07-10 12:46:11.0 -0700
+++ lxc-dave/net/unix/af_unix.c 2007-07-10 12:46:11.0 -0700
@@ -702,21 +702,27 @@ static struct sock *unix_find_other(stru
err = path_lookup(sunname-sun_path, LOOKUP_FOLLOW, nd);
if (err)
goto fail;
+
+   err = mnt_want_write(nd.mnt);
+   if (err)
+   goto put_path_fail;
+
err = vfs_permission(nd, MAY_WRITE);
if (err)
-   goto put_fail;
+   goto mnt_drop_write_fail;
 
err = -ECONNREFUSED;
if (!S_ISSOCK(nd.dentry-d_inode-i_mode))
-   goto put_fail;
+   goto mnt_drop_write_fail;
u=unix_find_socket_byinode(nd.dentry-d_inode);
if (!u)
-   goto put_fail;
+   goto mnt_drop_write_fail;
 
if (u-sk_type == type)
touch_atime(nd.mnt, nd.dentry);
 
path_release(nd);
+   mnt_drop_write(nd.mnt);
 
err=-EPROTOTYPE;
if (u-sk_type != type) {
@@ -736,7 +742,9 @@ static struct sock *unix_find_other(stru
}
return u;
 
-put_fail:
+mnt_drop_write_fail:
+   mnt_drop_write(nd.mnt);
+put_path_fail:
path_release(nd);
 fail:
*error=err;
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 16/23] elevate write count over calls to vfs_rename()

2007-07-11 Thread Dave Hansen


This also uses the little helper in the NFS code to
make an if() a little bit less ugly.  We introduced
the helper at the beginning of the series.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/namei.c|4 
 lxc-dave/fs/nfsd/vfs.c |   15 +++
 2 files changed, 15 insertions(+), 4 deletions(-)

diff -puN fs/namei.c~elevate-write-count-over-calls-to-vfs-rename fs/namei.c
--- lxc/fs/namei.c~elevate-write-count-over-calls-to-vfs-rename 2007-07-10 
12:46:12.0 -0700
+++ lxc-dave/fs/namei.c 2007-07-10 12:46:12.0 -0700
@@ -2597,8 +2597,12 @@ static int do_rename(int olddfd, const c
if (new_dentry == trap)
goto exit5;
 
+   error = mnt_want_write(oldnd.mnt);
+   if (error)
+   goto exit5;
error = vfs_rename(old_dir-d_inode, old_dentry,
   new_dir-d_inode, new_dentry);
+   mnt_drop_write(oldnd.mnt);
 exit5:
dput(new_dentry);
 exit4:
diff -puN fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename 
fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename  
2007-07-10 12:46:12.0 -0700
+++ lxc-dave/fs/nfsd/vfs.c  2007-07-10 12:46:12.0 -0700
@@ -1622,13 +1622,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru
if (ndentry == trap)
goto out_dput_new;
 
-#ifdef MSNFS
-   if ((ffhp-fh_export-ex_flags  NFSEXP_MSNFS) 
+   if (svc_msnfs(ffhp) 
((atomic_read(odentry-d_count)  1)
 || (atomic_read(ndentry-d_count)  1))) {
host_err = -EPERM;
-   } else
-#endif
+   goto out_dput_new;
+   }
+
+   host_err = -EXDEV;
+   if (ffhp-fh_export-ex_mnt != tfhp-fh_export-ex_mnt)
+   goto out_dput_new;
+   host_err = mnt_want_write(ffhp-fh_export-ex_mnt);
+   if (host_err)
+   goto out_dput_new;
+
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err  EX_ISSYNC(tfhp-fh_export)) {
host_err = nfsd_sync_dir(tdentry);
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 18/23] elevate writer count for do_sys_truncate()

2007-07-11 Thread Dave Hansen



Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/open.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff -puN fs/open.c~elevate-writer-count-for-do-sys-truncate fs/open.c
--- lxc/fs/open.c~elevate-writer-count-for-do-sys-truncate  2007-07-10 
12:46:13.0 -0700
+++ lxc-dave/fs/open.c  2007-07-10 12:46:13.0 -0700
@@ -243,28 +243,28 @@ static long do_sys_truncate(const char _
if (!S_ISREG(inode-i_mode))
goto dput_and_out;
 
-   error = vfs_permission(nd, MAY_WRITE);
+   error = mnt_want_write(nd.mnt);
if (error)
goto dput_and_out;
 
-   error = -EROFS;
-   if (IS_RDONLY(inode))
-   goto dput_and_out;
+   error = vfs_permission(nd, MAY_WRITE);
+   if (error)
+   goto mnt_drop_write_and_out;
 
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-   goto dput_and_out;
+   goto mnt_drop_write_and_out;
 
/*
 * Make sure that there are no leases.
 */
error = break_lease(inode, FMODE_WRITE);
if (error)
-   goto dput_and_out;
+   goto mnt_drop_write_and_out;
 
error = get_write_access(inode);
if (error)
-   goto dput_and_out;
+   goto mnt_drop_write_and_out;
 
error = locks_verify_truncate(inode, NULL, length);
if (!error) {
@@ -273,6 +273,8 @@ static long do_sys_truncate(const char _
}
put_write_access(inode);
 
+mnt_drop_write_and_out:
+   mnt_drop_write(nd.mnt);
 dput_and_out:
path_release(nd);
 out:
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 20/23] elevate write count for do_sys_utime() and touch_atime()

2007-07-11 Thread Dave Hansen



Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/inode.c |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff -puN fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime 
fs/inode.c
--- lxc/fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime 
2007-07-10 12:46:15.0 -0700
+++ lxc-dave/fs/inode.c 2007-07-10 12:46:15.0 -0700
@@ -1165,22 +1165,23 @@ void touch_atime(struct vfsmount *mnt, s
struct inode *inode = dentry-d_inode;
struct timespec now;
 
-   if (inode-i_flags  S_NOATIME)
+   if (mnt  mnt_want_write(mnt))
return;
+   if (inode-i_flags  S_NOATIME)
+   goto out;
if (IS_NOATIME(inode))
-   return;
+   goto out;
if ((inode-i_sb-s_flags  MS_NODIRATIME)  S_ISDIR(inode-i_mode))
-   return;
+   goto out;
 
/*
 * We may have a NULL vfsmount when coming from NFSD
 */
if (mnt) {
if (mnt-mnt_flags  MNT_NOATIME)
-   return;
+   goto out;
if ((mnt-mnt_flags  MNT_NODIRATIME)  S_ISDIR(inode-i_mode))
-   return;
-
+   goto out;
if (mnt-mnt_flags  MNT_RELATIME) {
/*
 * With relative atime, only update atime if the
@@ -1191,16 +1192,19 @@ void touch_atime(struct vfsmount *mnt, s
inode-i_atime)  0 
timespec_compare(inode-i_ctime,
inode-i_atime)  0)
-   return;
+   goto out;
}
}
 
now = current_fs_time(inode-i_sb);
if (timespec_equal(inode-i_atime, now))
-   return;
+   goto out;
 
inode-i_atime = now;
mark_inode_dirty_sync(inode);
+out:
+   if (mnt)
+   mnt_drop_write(mnt);
 }
 EXPORT_SYMBOL(touch_atime);
 
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 21/23] sys_mknodat(): elevate write count for vfs_mknod/create()

2007-07-11 Thread Dave Hansen


This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().

So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/namei.c |   32 +---
 lxc-dave/fs/nfsd/vfs.c  |4 
 lxc-dave/net/unix/af_unix.c |4 
 3 files changed, 29 insertions(+), 11 deletions(-)

diff -puN fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 
fs/namei.c
--- lxc/fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 
2007-07-10 12:46:15.0 -0700
+++ lxc-dave/fs/namei.c 2007-07-11 10:10:55.0 -0700
@@ -1912,12 +1912,25 @@ asmlinkage long sys_mknodat(int dfd, con
if (error)
goto out;
dentry = lookup_create(nd, 0);
-   error = PTR_ERR(dentry);
-
+   if (IS_ERR(dentry)) {
+   error = PTR_ERR(dentry);
+   goto out_unlock;
+   }
if (!IS_POSIXACL(nd.dentry-d_inode))
mode = ~current-fs-umask;
-   if (!IS_ERR(dentry)) {
-   switch (mode  S_IFMT) {
+   if (S_ISDIR(mode)) {
+   error = -EPERM;
+   goto out_dput;
+   }
+   if (!S_ISREG(mode)   !S_ISCHR(mode)   !S_ISBLK(mode) 
+   !S_ISFIFO(mode)  !S_ISSOCK(mode)  mode != 0) {
+   error = -EINVAL;
+   goto out_dput;
+   }
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_dput;
+   switch (mode  S_IFMT) {
case 0: case S_IFREG:
error = vfs_create(nd.dentry-d_inode,dentry,mode,nd);
break;
@@ -1928,14 +1941,11 @@ asmlinkage long sys_mknodat(int dfd, con
case S_IFIFO: case S_IFSOCK:
error = vfs_mknod(nd.dentry-d_inode,dentry,mode,0);
break;
-   case S_IFDIR:
-   error = -EPERM;
-   break;
-   default:
-   error = -EINVAL;
-   }
-   dput(dentry);
}
+   mnt_drop_write(nd.mnt);
+out_dput:
+   dput(dentry);
+out_unlock:
mutex_unlock(nd.dentry-d_inode-i_mutex);
path_release(nd);
 out:
diff -puN fs/nfsd/vfs.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 
fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create  
2007-07-10 12:46:15.0 -0700
+++ lxc-dave/fs/nfsd/vfs.c  2007-07-10 12:46:15.0 -0700
@@ -1199,7 +1199,11 @@ nfsd_create(struct svc_rqst *rqstp, stru
case S_IFBLK:
case S_IFIFO:
case S_IFSOCK:
+   host_err = mnt_want_write(fhp-fh_export-ex_mnt);
+   if (host_err)
+   break;
host_err = vfs_mknod(dirp, dchild, iap-ia_mode, rdev);
+   mnt_drop_write(fhp-fh_export-ex_mnt);
break;
default:
printk(nfsd: bad file type %o in nfsd_create\n, type);
diff -puN 
net/unix/af_unix.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 
net/unix/af_unix.c
--- lxc/net/unix/af_unix.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 
2007-07-10 12:46:15.0 -0700
+++ lxc-dave/net/unix/af_unix.c 2007-07-10 12:46:15.0 -0700
@@ -815,7 +815,11 @@ static int unix_bind(struct socket *sock
 */
mode = S_IFSOCK |
   (SOCK_INODE(sock)-i_mode  ~current-fs-umask);
+   err = mnt_want_write(nd.mnt);
+   if (err)
+   goto out_mknod_dput;
err = vfs_mknod(nd.dentry-d_inode, dentry, mode, 0);
+   mnt_drop_write(nd.mnt);
if (err)
goto out_mknod_dput;
mutex_unlock(nd.dentry-d_inode-i_mutex);
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 23/23] do_rmdir(): elevate write count

2007-07-11 Thread Dave Hansen


Elevate the write count during the vfs_rmdir() call.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/namei.c |5 +
 1 file changed, 5 insertions(+)

diff -puN fs/namei.c~do-rmdir-elevate-write-count fs/namei.c
--- lxc/fs/namei.c~do-rmdir-elevate-write-count 2007-07-10 12:46:16.0 
-0700
+++ lxc-dave/fs/namei.c 2007-07-10 12:46:16.0 -0700
@@ -2115,7 +2115,12 @@ static long do_rmdir(int dfd, const char
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit2;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto exit3;
error = vfs_rmdir(nd.dentry-d_inode, dentry);
+   mnt_drop_write(nd.mnt);
+exit3:
dput(dentry);
 exit2:
mutex_unlock(nd.dentry-d_inode-i_mutex);
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 22/23] elevate mnt writers for vfs_unlink() callers

2007-07-11 Thread Dave Hansen



Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/namei.c   |4 
 lxc-dave/ipc/mqueue.c |5 -
 2 files changed, 8 insertions(+), 1 deletion(-)

diff -puN fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers fs/namei.c
--- lxc/fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers   2007-07-10 
12:46:16.0 -0700
+++ lxc-dave/fs/namei.c 2007-07-10 12:46:16.0 -0700
@@ -2195,7 +2195,11 @@ static long do_unlinkat(int dfd, const c
inode = dentry-d_inode;
if (inode)
atomic_inc(inode-i_count);
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto exit2;
error = vfs_unlink(nd.dentry-d_inode, dentry);
+   mnt_drop_write(nd.mnt);
exit2:
dput(dentry);
}
diff -puN ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers ipc/mqueue.c
--- lxc/ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers 2007-07-10 
12:46:16.0 -0700
+++ lxc-dave/ipc/mqueue.c   2007-07-10 12:46:16.0 -0700
@@ -750,8 +750,11 @@ asmlinkage long sys_mq_unlink(const char
inode = dentry-d_inode;
if (inode)
atomic_inc(inode-i_count);
-
+   err = mnt_want_write(mqueue_mnt);
+   if (err)
+   goto out_err;
err = vfs_unlink(dentry-d_parent-d_inode, dentry);
+   mnt_drop_write(mqueue_mnt);
 out_err:
dput(dentry);
 
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 19/23] elevate write count for do_utimes()

2007-07-11 Thread Dave Hansen



Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/utimes.c |   15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff -puN fs/utimes.c~elevate-write-count-for-do-utimes fs/utimes.c
--- lxc/fs/utimes.c~elevate-write-count-for-do-utimes   2007-07-10 
12:46:14.0 -0700
+++ lxc-dave/fs/utimes.c2007-07-10 12:46:14.0 -0700
@@ -2,6 +2,7 @@
 #include linux/file.h
 #include linux/fs.h
 #include linux/linkage.h
+#include linux/mount.h
 #include linux/namei.h
 #include linux/sched.h
 #include linux/stat.h
@@ -75,8 +76,8 @@ long do_utimes(int dfd, char __user *fil
 
inode = dentry-d_inode;
 
-   error = -EROFS;
-   if (IS_RDONLY(inode))
+   error = mnt_want_write(nd.mnt);
+   if (error)
goto dput_and_out;
 
/* Don't worry, the checks are done in inode_change_ok() */
@@ -84,7 +85,7 @@ long do_utimes(int dfd, char __user *fil
if (times) {
error = -EPERM;
 if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
-goto dput_and_out;
+   goto mnt_drop_write_and_out;
 
if (times[0].tv_nsec == UTIME_OMIT)
newattrs.ia_valid = ~ATTR_ATIME;
@@ -104,22 +105,24 @@ long do_utimes(int dfd, char __user *fil
} else {
error = -EACCES;
 if (IS_IMMUTABLE(inode))
-goto dput_and_out;
+   goto mnt_drop_write_and_out;
 
if (current-fsuid != inode-i_uid) {
if (f) {
if (!(f-f_mode  FMODE_WRITE))
-   goto dput_and_out;
+   goto mnt_drop_write_and_out;
} else {
error = vfs_permission(nd, MAY_WRITE);
if (error)
-   goto dput_and_out;
+   goto mnt_drop_write_and_out;
}
}
}
mutex_lock(inode-i_mutex);
error = notify_change(dentry, newattrs);
mutex_unlock(inode-i_mutex);
+mnt_drop_write_and_out:
+   mnt_drop_write(nd.mnt);
 dput_and_out:
if (f)
fput(f);
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/23] elevate write count during entire ncp_ioctl()

2007-07-11 Thread Dave Hansen


Some ioctls need write access, but others don't.  Make a helper
function to decide when write access is needed, and take it.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/ncpfs/ioctl.c |   55 +-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff -puN fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl 
fs/ncpfs/ioctl.c
--- lxc/fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl
2007-07-10 12:46:08.0 -0700
+++ lxc-dave/fs/ncpfs/ioctl.c   2007-07-10 12:46:08.0 -0700
@@ -14,6 +14,7 @@
 #include linux/ioctl.h
 #include linux/time.h
 #include linux/mm.h
+#include linux/mount.h
 #include linux/highuid.h
 #include linux/smp_lock.h
 #include linux/vmalloc.h
@@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* serv
 }
 #endif /* CONFIG_NCPFS_NLS */
 
-int ncp_ioctl(struct inode *inode, struct file *filp,
+static int __ncp_ioctl(struct inode *inode, struct file *filp,
  unsigned int cmd, unsigned long arg)
 {
struct ncp_server *server = NCP_SERVER(inode);
@@ -822,6 +823,58 @@ outrel:
return -EINVAL;
 }
 
+static int ncp_ioctl_need_write(unsigned int cmd)
+{
+   switch (cmd) {
+   case NCP_IOC_GET_FS_INFO:
+   case NCP_IOC_GET_FS_INFO_V2:
+   case NCP_IOC_NCPREQUEST:
+   case NCP_IOC_SETDENTRYTTL:
+   case NCP_IOC_SIGN_INIT:
+   case NCP_IOC_LOCKUNLOCK:
+   case NCP_IOC_SET_SIGN_WANTED:
+   return 1;
+   case NCP_IOC_GETOBJECTNAME:
+   case NCP_IOC_SETOBJECTNAME:
+   case NCP_IOC_GETPRIVATEDATA:
+   case NCP_IOC_SETPRIVATEDATA:
+   case NCP_IOC_SETCHARSETS:
+   case NCP_IOC_GETCHARSETS:
+   case NCP_IOC_CONN_LOGGED_IN:
+   case NCP_IOC_GETDENTRYTTL:
+   case NCP_IOC_GETMOUNTUID2:
+   case NCP_IOC_SIGN_WANTED:
+   case NCP_IOC_GETROOT:
+   case NCP_IOC_SETROOT:
+   return 0;
+   default:
+   /* unkown IOCTL command, assume write */
+   WARN_ON(1);
+   }
+   return 1;
+}
+
+int ncp_ioctl(struct inode *inode, struct file *filp,
+ unsigned int cmd, unsigned long arg)
+{
+   int ret;
+
+   if (ncp_ioctl_need_write(cmd)) {
+   /*
+* inside the ioctl(), any failures which
+* are because of file_permission() are
+* -EACCESS, so it seems consistent to keep
+*  that here.
+*/
+   if (mnt_want_write(filp-f_vfsmnt))
+   return -EACCES;
+   }
+   ret = __ncp_ioctl(inode, filp, cmd, arg);
+   if (ncp_ioctl_need_write(cmd))
+   mnt_drop_write(filp-f_vfsmnt);
+   return ret;
+}
+
 #ifdef CONFIG_COMPAT
 long ncp_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/23] make access() use mnt check

2007-07-11 Thread Dave Hansen


It is OK to let access() go without using a mnt_want/drop_write()
pair because it doesn't actually do writes to the filesystem,
and it is inherently racy anyway.  This is a rare case when it is
OK to use __mnt_is_readonly() directly.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/open.c |   13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff -puN fs/open.c~make-access-use-helper fs/open.c
--- lxc/fs/open.c~make-access-use-helper2007-07-10 12:46:07.0 
-0700
+++ lxc-dave/fs/open.c  2007-07-10 12:46:07.0 -0700
@@ -396,8 +396,17 @@ asmlinkage long sys_faccessat(int dfd, c
if(res || !(mode  S_IWOTH) ||
   special_file(nd.dentry-d_inode-i_mode))
goto out_path_release;
-
-   if(IS_RDONLY(nd.dentry-d_inode))
+   /*
+* This is a rare case where using __mnt_is_readonly()
+* is OK without a mnt_want/drop_write() pair.  Since
+* no actual write to the fs is performed here, we do
+* not need to telegraph to that to anyone.
+*
+* By doing this, we accept that this access is
+* inherently racy and know that the fs may change
+* state before we even see this result.
+*/
+   if (__mnt_is_readonly(nd.mnt))
res = -EROFS;
 
 out_path_release:
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/23] r/o bind mounts: elevate write count for some ioctls

2007-07-11 Thread Dave Hansen


Some ioctl()s can cause writes to the filesystem.  Take
these, and make them use mnt_want/drop_write() instead.

We need to pass the filp one layer deeper in XFS, but
somebody _just_ pulled it out in February because nobody
was using it, so I don't feel guilty for adding it back.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/ext2/ioctl.c  |   46 +-
 lxc-dave/fs/ext3/ioctl.c  |  100 +---
 lxc-dave/fs/ext4/ioctl.c  |  105 +-
 lxc-dave/fs/fat/file.c|   10 +--
 lxc-dave/fs/hfsplus/ioctl.c   |   39 +++-
 lxc-dave/fs/jfs/ioctl.c   |   33 ++
 lxc-dave/fs/ocfs2/ioctl.c |   11 +--
 lxc-dave/fs/reiserfs/ioctl.c  |   53 +++--
 lxc-dave/fs/xfs/linux-2.6/xfs_ioctl.c |   15 +++-
 lxc-dave/fs/xfs/linux-2.6/xfs_iops.c  |7 --
 lxc-dave/fs/xfs/linux-2.6/xfs_lrw.c   |9 ++
 11 files changed, 272 insertions(+), 156 deletions(-)

diff -puN fs/ext2/ioctl.c~ioctl-mnt-takers fs/ext2/ioctl.c
--- lxc/fs/ext2/ioctl.c~ioctl-mnt-takers2007-07-10 12:46:05.0 
-0700
+++ lxc-dave/fs/ext2/ioctl.c2007-07-10 12:46:05.0 -0700
@@ -12,6 +12,7 @@
 #include linux/time.h
 #include linux/sched.h
 #include linux/compat.h
+#include linux/mount.h
 #include linux/smp_lock.h
 #include asm/current.h
 #include asm/uaccess.h
@@ -22,6 +23,7 @@ int ext2_ioctl (struct inode * inode, st
 {
struct ext2_inode_info *ei = EXT2_I(inode);
unsigned int flags;
+   int ret;
 
ext2_debug (cmd = %u, arg = %lu\n, cmd, arg);
 
@@ -33,14 +35,19 @@ int ext2_ioctl (struct inode * inode, st
case EXT2_IOC_SETFLAGS: {
unsigned int oldflags;
 
-   if (IS_RDONLY(inode))
-   return -EROFS;
-
-   if ((current-fsuid != inode-i_uid)  !capable(CAP_FOWNER))
-   return -EACCES;
+   ret = mnt_want_write(filp-f_vfsmnt);
+   if (ret)
+   return ret;
+
+   if ((current-fsuid != inode-i_uid)  !capable(CAP_FOWNER)) {
+   ret = -EACCES;
+   goto setflags_out;
+   }
 
-   if (get_user(flags, (int __user *) arg))
-   return -EFAULT;
+   if (get_user(flags, (int __user *) arg)) {
+   ret = -EFAULT;
+   goto setflags_out;
+   }
 
if (!S_ISDIR(inode-i_mode))
flags = ~EXT2_DIRSYNC_FL;
@@ -57,7 +64,8 @@ int ext2_ioctl (struct inode * inode, st
if ((flags ^ oldflags)  (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(inode-i_mutex);
-   return -EPERM;
+   ret = -EPERM;
+   goto setflags_out;
}
}
 
@@ -69,20 +77,26 @@ int ext2_ioctl (struct inode * inode, st
ext2_set_inode_flags(inode);
inode-i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
-   return 0;
+setflags_out:
+   mnt_drop_write(filp-f_vfsmnt);
+   return ret;
}
case EXT2_IOC_GETVERSION:
return put_user(inode-i_generation, (int __user *) arg);
case EXT2_IOC_SETVERSION:
if ((current-fsuid != inode-i_uid)  !capable(CAP_FOWNER))
return -EPERM;
-   if (IS_RDONLY(inode))
-   return -EROFS;
-   if (get_user(inode-i_generation, (int __user *) arg))
-   return -EFAULT; 
-   inode-i_ctime = CURRENT_TIME_SEC;
-   mark_inode_dirty(inode);
-   return 0;
+   ret = mnt_want_write(filp-f_vfsmnt);
+   if (ret)
+   return ret;
+   if (get_user(inode-i_generation, (int __user *) arg)) {
+   ret = -EFAULT;
+   } else {
+   inode-i_ctime = CURRENT_TIME_SEC;
+   mark_inode_dirty(inode);
+   }
+   mnt_drop_write(filp-f_vfsmnt);
+   return ret;
default:
return -ENOTTY;
}
diff -puN fs/ext3/ioctl.c~ioctl-mnt-takers fs/ext3/ioctl.c
--- lxc/fs/ext3/ioctl.c~ioctl-mnt-takers2007-07-10 12:46:05.0 
-0700
+++ lxc-dave/fs/ext3/ioctl.c2007-07-10 12:46:05.0 -0700
@@ -12,6 +12,7 @@
 #include linux/capability.h
 #include linux/ext3_fs.h
 #include linux/ext3_jbd.h
+#include linux/mount.h
 #include linux/time.h
 #include linux/compat.h
 #include linux/smp_lock.h
@@ -38,14 +39,19 @@ int ext3_ioctl (struct inode * inode, st
unsigned int oldflags;

[PATCH 03/23] filesystem helpers for custom 'struct file's

2007-07-11 Thread Dave Hansen


Christoph H. says this stands on its own and can go in before the
rest of the r/o bind mount set.  

---

Some filesystems forego the vfs and may_open() and create their
own 'struct file's.

This patch creates a couple of helper functions which can be
used by these filesystems, and will provide a unified place
which the r/o bind mount code may patch.

Also, rename an existing, static-scope init_file() to a less
generic name.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/configfs/dir.c|5 +++--
 lxc-dave/fs/file_table.c  |   34 ++
 lxc-dave/fs/hugetlbfs/inode.c |   22 +-
 lxc-dave/include/linux/file.h |9 +
 lxc-dave/ipc/shm.c|   13 +
 lxc-dave/mm/shmem.c   |7 ++-
 lxc-dave/mm/tiny-shmem.c  |   19 +++
 lxc-dave/net/socket.c |   18 +-
 8 files changed, 78 insertions(+), 49 deletions(-)

diff -puN fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s 
fs/configfs/dir.c
--- lxc/fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s   
2007-07-10 12:46:03.0 -0700
+++ lxc-dave/fs/configfs/dir.c  2007-07-10 12:46:03.0 -0700
@@ -142,7 +142,7 @@ static int init_dir(struct inode * inode
return 0;
 }
 
-static int init_file(struct inode * inode)
+static int configfs_init_file(struct inode * inode)
 {
inode-i_size = PAGE_SIZE;
inode-i_fop = configfs_file_operations;
@@ -283,7 +283,8 @@ static int configfs_attach_attr(struct c
 
dentry-d_fsdata = configfs_get(sd);
sd-s_dentry = dentry;
-   error = configfs_create(dentry, (attr-ca_mode  S_IALLUGO) | S_IFREG, 
init_file);
+   error = configfs_create(dentry, (attr-ca_mode  S_IALLUGO) | S_IFREG,
+   configfs_init_file);
if (error) {
configfs_put(sd);
return error;
diff -puN fs/file_table.c~filesystem-helpers-for-custom-struct-file-s 
fs/file_table.c
--- lxc/fs/file_table.c~filesystem-helpers-for-custom-struct-file-s 
2007-07-10 12:46:03.0 -0700
+++ lxc-dave/fs/file_table.c2007-07-10 12:46:03.0 -0700
@@ -138,6 +138,40 @@ fail:
 
 EXPORT_SYMBOL(get_empty_filp);
 
+struct file *alloc_file(struct vfsmount *mnt, struct dentry *dentry,
+   mode_t mode, const struct file_operations *fop)
+{
+   struct file *file;
+   struct path;
+
+   file = get_empty_filp();
+   if (!file)
+   return NULL;
+
+   init_file(file, mnt, dentry, mode, fop);
+   return file;
+}
+EXPORT_SYMBOL(alloc_file);
+
+/*
+ * Note: This is a crappy interface.  It is here to make
+ * merging with the existing users of get_empty_filp()
+ * who have complex failure logic easier.  All users
+ * of this should be moving to alloc_file().
+ */
+int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry,
+  mode_t mode, const struct file_operations *fop)
+{
+   int error = 0;
+   file-f_path.dentry = dentry;
+   file-f_path.mnt = mntget(mnt);
+   file-f_mapping = dentry-d_inode-i_mapping;
+   file-f_mode = mode;
+   file-f_op = fop;
+   return error;
+}
+EXPORT_SYMBOL(init_file);
+
 void fastcall fput(struct file *file)
 {
if (atomic_dec_and_test(file-f_count))
diff -puN fs/hugetlbfs/inode.c~filesystem-helpers-for-custom-struct-file-s 
fs/hugetlbfs/inode.c
--- lxc/fs/hugetlbfs/inode.c~filesystem-helpers-for-custom-struct-file-s
2007-07-10 12:46:03.0 -0700
+++ lxc-dave/fs/hugetlbfs/inode.c   2007-07-10 12:46:03.0 -0700
@@ -761,16 +761,11 @@ struct file *hugetlb_file_setup(const ch
if (!dentry)
goto out_shm_unlock;
 
-   error = -ENFILE;
-   file = get_empty_filp();
-   if (!file)
-   goto out_dentry;
-
error = -ENOSPC;
inode = hugetlbfs_get_inode(root-d_sb, current-fsuid,
current-fsgid, S_IFREG | S_IRWXUGO, 0);
if (!inode)
-   goto out_file;
+   goto out_dentry;
 
error = -ENOMEM;
if (hugetlb_reserve_pages(inode, 0, size  HPAGE_SHIFT))
@@ -779,17 +774,18 @@ struct file *hugetlb_file_setup(const ch
d_instantiate(dentry, inode);
inode-i_size = size;
inode-i_nlink = 0;
-   file-f_path.mnt = mntget(hugetlbfs_vfsmount);
-   file-f_path.dentry = dentry;
-   file-f_mapping = inode-i_mapping;
-   file-f_op = hugetlbfs_file_operations;
-   file-f_mode = FMODE_WRITE | FMODE_READ;
+
+   error = -ENFILE;
+   file = alloc_file(hugetlbfs_vfsmount, dentry,
+   FMODE_WRITE | FMODE_READ,
+   hugetlbfs_file_operations);
+   if (!file)
+   goto out_inode;
+
return file;
 
 out_inode:
iput(inode);
-out_file:
-   put_filp(file);
 out_dentry:
dput(dentry);
 out_shm_unlock:
diff -puN

[PATCH 05/23] elevate write count open()'d files

2007-07-11 Thread Dave Hansen


This is the first really tricky patch in the series.  It
elevates the writer count on a mount each time a
non-special file is opened for write.

This is not completely apparent in the patch because the
two if() conditions in may_open() above the
mnt_want_write() call are, combined, equivalent to
special_file().

There is also an elevated count around the vfs_create()
call in open_namei().  The count needs to be kept elevated
all the way into the may_open() call.  Otherwise, when the
write is dropped, a ro-rw transisition could occur.  This
would lead to having rw access on the newly created file,
while the vfsmount is ro.  That is bad.

Some filesystems forego the use of normal vfs calls to create
struct files.  Make sure that these users elevate the mnt writer
count because they will get __fput(), and we need to make
sure they're balanced.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---

 lxc-dave/fs/file_table.c |9 -
 lxc-dave/fs/namei.c  |   20 
 lxc-dave/ipc/mqueue.c|3 +++
 3 files changed, 27 insertions(+), 5 deletions(-)

diff -puN fs/file_table.c~tricky-elevate-write-count-files-are-open-ed 
fs/file_table.c
--- lxc/fs/file_table.c~tricky-elevate-write-count-files-are-open-ed
2007-07-10 12:46:04.0 -0700
+++ lxc-dave/fs/file_table.c2007-07-10 12:46:04.0 -0700
@@ -168,6 +168,10 @@ int init_file(struct file *file, struct 
file-f_mapping = dentry-d_inode-i_mapping;
file-f_mode = mode;
file-f_op = fop;
+   if (mode  FMODE_WRITE) {
+   error = mnt_want_write(mnt);
+   WARN_ON(error);
+   }
return error;
 }
 EXPORT_SYMBOL(init_file);
@@ -205,8 +209,11 @@ void fastcall __fput(struct file *file)
if (unlikely(S_ISCHR(inode-i_mode)  inode-i_cdev != NULL))
cdev_put(inode-i_cdev);
fops_put(file-f_op);
-   if (file-f_mode  FMODE_WRITE)
+   if (file-f_mode  FMODE_WRITE) {
put_write_access(inode);
+   if (!special_file(inode-i_mode))
+   mnt_drop_write(mnt);
+   }
put_pid(file-f_owner.pid);
file_kill(file);
file-f_path.dentry = NULL;
diff -puN fs/namei.c~tricky-elevate-write-count-files-are-open-ed fs/namei.c
--- lxc/fs/namei.c~tricky-elevate-write-count-files-are-open-ed 2007-07-10 
12:46:04.0 -0700
+++ lxc-dave/fs/namei.c 2007-07-10 12:46:04.0 -0700
@@ -1562,8 +1562,15 @@ int may_open(struct nameidata *nd, int a
return -EACCES;
 
flag = ~O_TRUNC;
-   } else if (IS_RDONLY(inode)  (flag  FMODE_WRITE))
-   return -EROFS;
+   } else if (flag  FMODE_WRITE) {
+   /*
+* effectively: !special_file()
+* balanced by __fput()
+*/
+   error = mnt_want_write(nd-mnt);
+   if (error)
+   return error;
+   }
 
error = vfs_permission(nd, acc_mode);
if (error)
@@ -1706,14 +1713,17 @@ do_last:
}
 
if (IS_ERR(nd-intent.open.file)) {
-   mutex_unlock(dir-d_inode-i_mutex);
error = PTR_ERR(nd-intent.open.file);
-   goto exit_dput;
+   goto exit_mutex_unlock;
}
 
/* Negative dentry, just create the file */
if (!path.dentry-d_inode) {
+   error = mnt_want_write(nd-mnt);
+   if (error)
+   goto exit_mutex_unlock;
error = open_namei_create(nd, path, flag, mode);
+   mnt_drop_write(nd-mnt);
if (error)
goto exit;
return 0;
@@ -1751,6 +1761,8 @@ ok:
goto exit;
return 0;
 
+exit_mutex_unlock:
+   mutex_unlock(dir-d_inode-i_mutex);
 exit_dput:
dput_path(path, nd);
 exit:
diff -puN ipc/mqueue.c~tricky-elevate-write-count-files-are-open-ed ipc/mqueue.c
--- lxc/ipc/mqueue.c~tricky-elevate-write-count-files-are-open-ed   
2007-07-10 12:46:04.0 -0700
+++ lxc-dave/ipc/mqueue.c   2007-07-10 12:46:04.0 -0700
@@ -686,6 +686,9 @@ asmlinkage long sys_mq_open(const char _
goto out;
filp = do_open(dentry, oflag);
} else {
+   error = mnt_want_write(mqueue_mnt);
+   if (error)
+   goto out;
filp = do_create(mqueue_mnt-mnt_root, dentry,
oflag, mode, u_attr);
}
_
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups

2007-07-11 Thread Mingming Cao

On Tue, 2007-07-10 at 23:20 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  From: Dmitry Monakhov [EMAIL PROTECTED]
  Subject: ext4: extent macros cleanup
  
  - Replace math equation to it's macro equivalent
 
 s/it's/its/;)
Okay.

 
  - make ext4_ext_grow_indepth() indexes/leaf correct
 
 hm, what was wrong with it?
 
Looking at the code, ext4_ext_ext_grow_indepth() implements tree growing
procedure. It allocates a new index block, moves the top-level data of
the tree(root or leaf blocks in i_data) into the new block, initializes
the new root, creating index that points to the just created index block

The original top-level data (in i_data) could be extent tree root (index
block) or extents (leaf block). The current code (without the patch)
treats the top-level data always be the leaf block, which is incorrect.


assumes when the tree is growing the extent structure pass in is always
  @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, 
  struct inode *inode,
  curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode));
  curp-p_hdr-eh_entries = cpu_to_le16(1);
  curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr);
  -   /* FIXME: it works, but actually path[0] can be index */
  -   curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
  +   
  +   if (path[0].p_hdr-eh_depth)
  + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block;
  +   else
  + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
 
 whitespace bustage there.
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23))

2007-07-11 Thread David Chinner

On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote:
 Andrew Morton wrote:
  The fault-vs-invalidate race fix.  I have belatedly learned that these 
  need
  more work, so their state is uncertain.
 
 The more work may turn out being too much for you (although it is nothing
 exactly tricky that would introduce subtle bugs, it is a fair amont of 
 churn).

OK, so does that mean we can finally get the block_page_mkwrite
patches merged?

i.e.:

http://marc.info/?l=linux-kernelm=117426058311032w=2
http://marc.info/?l=linux-kernelm=11742607036w=2

I've got up-to-date versions of them ready to go and they've been
consistently tested thanks to the XFSQA test I wrote for the bug
that it fixes. I've been holding them out-of-tree for months now
because -fault was supposed to supercede this interface.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23))

2007-07-11 Thread Nick Piggin


David Chinner wrote:

On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote:


Andrew Morton wrote:

The fault-vs-invalidate race fix.  I have belatedly learned that these 
need

more work, so their state is uncertain.


The more work may turn out being too much for you (although it is nothing
exactly tricky that would introduce subtle bugs, it is a fair amont of 
churn).



OK, so does that mean we can finally get the block_page_mkwrite
patches merged?

i.e.:

http://marc.info/?l=linux-kernelm=117426058311032w=2
http://marc.info/?l=linux-kernelm=11742607036w=2

I've got up-to-date versions of them ready to go and they've been
consistently tested thanks to the XFSQA test I wrote for the bug
that it fixes. I've been holding them out-of-tree for months now
because -fault was supposed to supercede this interface.


Yeah, as I've said, don't hold them back because of me. They are
relatively simple enough that I don't see why they couldn't be
merged in this window.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1 of 2] block_page_mkwrite V2

2007-07-11 Thread David Chinner

Generic page_mkwrite functionality.

Filesystems that make use of the VM -page_mkwrite() callout will generally use
the same core code to implement it. There are several tricky truncate-related
issues that we need to deal with here as we cannot take the i_mutex as we
normally would for these paths.  These issues are not documented anywhere yet
so block_page_mkwrite() seems like the best place to start.

Version 2:

- read inode size only once
- more comments explaining implementation restrictions

Signed-Off-By: Dave Chinner [EMAIL PROTECTED]

---
 fs/buffer.c |   47 
 include/linux/buffer_head.h |2 +
 2 files changed, 49 insertions(+)

Index: 2.6.x-xfs-new/fs/buffer.c
===
--- 2.6.x-xfs-new.orig/fs/buffer.c  2007-03-17 10:55:32.291414968 +1100
+++ 2.6.x-xfs-new/fs/buffer.c   2007-03-19 08:13:54.519909087 +1100
@@ -2194,6 +2194,52 @@ int generic_commit_write(struct file *fi
return 0;
 }
 
+/*
+ * block_page_mkwrite() is not allowed to change the file size as it gets
+ * called from a page fault handler when a page is first dirtied. Hence we must
+ * be careful to check for EOF conditions here. We set the page up correctly
+ * for a written page which means we get ENOSPC checking when writing into
+ * holes and correct delalloc and unwritten extent mapping on filesystems that
+ * support these features.
+ *
+ * We are not allowed to take the i_mutex here so we have to play games to
+ * protect against truncate races as the page could now be beyond EOF.  Because
+ * vmtruncate() writes the inode size before removing pages, once we have the
+ * page lock we can determine safely if the page is beyond EOF. If it is not
+ * beyond EOF, then the page is guaranteed safe against truncation until we
+ * unlock the page.
+ */
+int
+block_page_mkwrite(struct vm_area_struct *vma, struct page *page,
+  get_block_t get_block)
+{
+   struct inode *inode = vma-vm_file-f_path.dentry-d_inode;
+   unsigned long end;
+   loff_t size;
+   int ret = -EINVAL;
+
+   lock_page(page);
+   size = i_size_read(inode);
+   if ((page-mapping != inode-i_mapping) ||
+   ((page-index  PAGE_CACHE_SHIFT)  size)) {
+   /* page got truncated out from underneath us */
+   goto out_unlock;
+   }
+
+   /* page is wholly or partially inside EOF */
+   if (((page-index + 1)  PAGE_CACHE_SHIFT)  size)
+   end = size  ~PAGE_CACHE_MASK;
+   else
+   end = PAGE_CACHE_SIZE;
+
+   ret = block_prepare_write(page, 0, end, get_block);
+   if (!ret)
+   ret = block_commit_write(page, 0, end);
+
+out_unlock:
+   unlock_page(page);
+   return ret;
+}
 
 /*
  * nobh_prepare_write()'s prereads are special: the buffer_heads are freed
@@ -2997,6 +3043,7 @@ EXPORT_SYMBOL(__brelse);
 EXPORT_SYMBOL(__wait_on_buffer);
 EXPORT_SYMBOL(block_commit_write);
 EXPORT_SYMBOL(block_prepare_write);
+EXPORT_SYMBOL(block_page_mkwrite);
 EXPORT_SYMBOL(block_read_full_page);
 EXPORT_SYMBOL(block_sync_page);
 EXPORT_SYMBOL(block_truncate_page);
Index: 2.6.x-xfs-new/include/linux/buffer_head.h
===
--- 2.6.x-xfs-new.orig/include/linux/buffer_head.h  2007-03-17 
10:55:32.135435539 +1100
+++ 2.6.x-xfs-new/include/linux/buffer_head.h   2007-03-17 10:55:32.567378573 
+1100
@@ -206,6 +206,8 @@ int cont_prepare_write(struct page*, uns
 int generic_cont_expand(struct inode *inode, loff_t size);
 int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
+int block_page_mkwrite(struct vm_area_struct *vma, struct page *page,
+   get_block_t get_block);
 void block_sync_page(struct page *);
 sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
 int generic_commit_write(struct file *, struct page *, unsigned, unsigned);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2 of 2] Make XFS use block_page_mkwrite

2007-07-11 Thread David Chinner

Implement -page_mkwrite in XFS.

Signed-Off-By: Dave Chinner [EMAIL PROTECTED]

---
 fs/xfs/linux-2.6/xfs_file.c |   16 
 1 file changed, 16 insertions(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_file.c  2007-02-07 
23:00:10.0 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c   2007-02-07 23:15:20.170880823 
+1100
@@ -446,6 +446,20 @@ xfs_file_open_exec(
 }
 #endif /* HAVE_FOP_OPEN_EXEC */
 
+/*
+ * mmap()d file has taken write protection fault and is being made
+ * writable. We can set the page state up correctly for a writable
+ * page, which means we can do correct delalloc accounting (ENOSPC
+ * checking!) and unwritten extent mapping.
+ */
+STATIC int
+xfs_vm_page_mkwrite(
+   struct vm_area_struct   *vma,
+   struct page *page)
+{
+   return block_page_mkwrite(vma, page, xfs_get_blocks);
+}
+
 const struct file_operations xfs_file_operations = {
.llseek = generic_file_llseek,
.read   = do_sync_read,
@@ -503,12 +517,14 @@ const struct file_operations xfs_dir_fil
 static struct vm_operations_struct xfs_file_vm_ops = {
.nopage = filemap_nopage,
.populate   = filemap_populate,
+   .page_mkwrite   = xfs_vm_page_mkwrite,
 };
 
 #ifdef HAVE_DMAPI
 static struct vm_operations_struct xfs_dmapi_file_vm_ops = {
.nopage = xfs_vm_nopage,
.populate   = filemap_populate,
+   .page_mkwrite   = xfs_vm_page_mkwrite,
 #ifdef HAVE_VMOP_MPROTECT
.mprotect   = xfs_vm_mprotect,
 #endif
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andreas Dilger

On Jul 11, 2007  16:04 -0400, J. Bruce Fields wrote:
 A 32-bit i_version could in theory wrap pretty quickly, couldn't it?
 That's not a problem in itself--the problem would only arise if two
 subsequent client queries of the change attribute happened a multiple of
 2^32 i_version increments apart.
 
 This is more likely than the previous scenario, but still very unlikely.
 I would have guessed that even in situations with a very high rate of
 updates and a low rate of client revalidations, the chance of two
 revalidations happening exactly 2^32 updates apart would still be no
 more than 1 in 2^32.  (Could odd characteristics of the workloads (like
 updates that tend to happen in power-of-2 groups?) make it any more
 likely?)
 
 I'd be happier if ext4 at least allowed the possibility of 64 bits in
 the future.  And there's always the chance someone would find a use for
 an i_version that was nondecreasing, even if nfs didn't care.

This would indeed be the case for ext3 filesystems updated to ext4.
They will only be able to store the low 32 bits of the version, which
is itself normally enough for NFSv4 because it only uses the inequality
check.  Having the full 64 bits available eliminates the risk of
collisions, and given that the spec mandates a 64-bit version I'm sure
someone will take full advantage of it in NFS at some point.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH, RESEND] Teach do_mpage_readpage() about unwritten buffers

2007-07-11 Thread David Chinner

Teach do_mpage_readpage() about unwritten extents so we can
always map them in get_blocks rather than they are are holes on
read. Allows setup_swap_extents() to use preallocated files on XFS
filesystems for swap files without ever needing to convert them.

Signed-Off-By: Dave Chinner [EMAIL PROTECTED]

---
 fs/mpage.c  |5 +++--
 fs/xfs/linux-2.6/xfs_aops.c |   13 +++--
 2 files changed, 6 insertions(+), 12 deletions(-)

Index: 2.6.x-xfs-new/fs/mpage.c
===
--- 2.6.x-xfs-new.orig/fs/mpage.c   2007-05-29 16:17:59.0 +1000
+++ 2.6.x-xfs-new/fs/mpage.c2007-06-27 22:39:35.568852270 +1000
@@ -207,7 +207,8 @@ do_mpage_readpage(struct bio *bio, struc
 * Map blocks using the result from the previous get_blocks call first.
 */
nblocks = map_bh-b_size  blkbits;
-   if (buffer_mapped(map_bh)  block_in_file  *first_logical_block 
+   if (buffer_mapped(map_bh)  !buffer_unwritten(map_bh) 
+   block_in_file  *first_logical_block 
block_in_file  (*first_logical_block + nblocks)) {
unsigned map_offset = block_in_file - *first_logical_block;
unsigned last = nblocks - map_offset;
@@ -242,7 +243,7 @@ do_mpage_readpage(struct bio *bio, struc
*first_logical_block = block_in_file;
}
 
-   if (!buffer_mapped(map_bh)) {
+   if (!buffer_mapped(map_bh) || buffer_unwritten(map_bh)) {
fully_mapped = 0;
if (first_hole == blocks_per_page)
first_hole = page_block;
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c  2007-06-05 
22:14:39.0 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c   2007-06-27 22:39:29.545636749 
+1000
@@ -1340,16 +1340,9 @@ __xfs_get_blocks(
return 0;
 
if (iomap.iomap_bn != IOMAP_DADDR_NULL) {
-   /*
-* For unwritten extents do not report a disk address on
-* the read case (treat as if we're reading into a hole).
-*/
-   if (create || !(iomap.iomap_flags  IOMAP_UNWRITTEN)) {
-   xfs_map_buffer(bh_result, iomap, offset,
-  inode-i_blkbits);
-   }
-   if (create  (iomap.iomap_flags  IOMAP_UNWRITTEN)) {
-   if (direct)
+   xfs_map_buffer(bh_result, iomap, offset, inode-i_blkbits);
+   if (iomap.iomap_flags  IOMAP_UNWRITTEN) {
+   if (create  direct)
bh_result-b_private = inode;
set_buffer_unwritten(bh_result);
}
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

87 matches

Mail list logo