Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs
On Wed, 2007-07-11 at 00:38 -0500, Jose R. Santos wrote: On Tue, 10 Jul 2007 16:30:25 -0700 Andrew Morton [EMAIL PROTECTED] wrote: On Sun, 01 Jul 2007 03:36:48 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Jun 07, 2007 23:45 -0500, Jose R. Santos wrote: The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but create_proc_entry() does not do lookups on file names with more that one directory deep. This causes the entry creation to fail and hence, no proc file is created. This patch moves the file to /proc/jbd2-degug. The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require some minor alterations to the jbd-stats patch. I don't think we really want to be adding top-level files in /proc. What about using the debugfs filesystem (not to be confused with the e2fsprogs 'debugfs' command)? How about this then? Moved the file to use debugfs as well as having the nice effect of removing more lines than what it adds. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] Please clean up the changelog. The changelog should include information about the location and the content of these debugfs files. it should provide any instructions which users will need to be able to create and use those files. Will fix. Alternatively (and preferably) do this via an update to Documentation/filesystems/ext4.txt. Seems like I also need to update the doc on Kconfig as well. Do you prefer this in separate patches? (current patch, kconfig patch, ext4 doc update patch? fs/jbd2/journal.c| 6220 +42 -0 ! include/linux/jbd2.h |21 + 1 - 0 ! 2 files changed, 21 insertions(+), 43 deletions(-) Again, this patch isn't in Ted's kernel.org directory and hasn't been in -mm. Apart from the lack of testing and review which this causes, it means I can't just do `pushpatch name-of-this-patch' and look at it in tkdiff. So I squint at the diff, but that's harder when the diff wasn't prepared with `diff -p'. Oh well. Will fix. Index: linux-2.6.22-rc4/fs/jbd2/journal.c === --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c 2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 -0700 @@ -35,6 +35,7 @@ #include linux/kthread.h #include linux/poison.h #include linux/proc_fs.h +#include linux/debugfs.h #include asm/uaccess.h #include asm/page.h @@ -1954,60 +1955,37 @@ * /proc tunables */ #if defined(CONFIG_JBD2_DEBUG) -int jbd2_journal_enable_debug; +u16 jbd2_journal_enable_debug; EXPORT_SYMBOL(jbd2_journal_enable_debug); #endif -#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_DEBUG_FS) Has this been compile-tested with CONFIG_DEBUGFS=n? I think I did, but honestly don't remember. Will check with the new patch. :) Yes, I remember I did, that discovered some inconsistency in ext4 code, which has already been fixed. -#define create_jbd_proc_entry() do {} while (0) -#define jbd2_remove_jbd_proc_entry() do {} while (0) +#define jbd2_create_debugfs_entry() do {} while (0) +#define jbd2_remove_debugfs_entry() do {} while (0) I suggest that these be converted to (preferable) inline functions while you're there. OK. #endif @@ -2067,7 +2045,7 @@ ret = journal_init_caches(); if (ret != 0) jbd2_journal_destroy_caches(); - create_jbd_proc_entry(); + jbd2_create_debugfs_entry(); return ret; } @@ -2078,7 +2056,7 @@ if (n) printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n); #endif - jbd2_remove_jbd_proc_entry(); + jbd2_remove_debugfs_entry(); jbd2_journal_destroy_caches(); } Index: linux-2.6.22-rc4/include/linux/jbd2.h === --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 -0700 @@ -57,7 +57,7 @@ * CONFIG_JBD2_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING JBD2? -extern int jbd2_journal_enable_debug; +extern u16 jbd2_journal_enable_debug; Why was this made 16-bit? To save 2 bytes? Could have saved 3 if we're going to do that. OK. Shoudln't all this debug info be a per-superblock thing rather than kernel-wide? I don't think it is worth pursuing this feature since this seems to have been broken for a while now (its been there since the first git revission in ext3) and nobody has noticed it until now. It could be address on a later patch though, since the initial purpose of the patch was to fix the
Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev
On Tue, 2007-07-10 at 23:35 -0400, Dave Jones wrote: On Tue, Jul 10, 2007 at 05:35:13PM -0400, Mingming Cao wrote: Sorry about this. I was using version 0.04. The latest one I can find for now is 0.05(searching lkml), but it didn't catch this codling style bug either. I appreciate if anyone can point me the version 0.07, thanks It's now in-tree in scripts/checkpatch.pl (linus' tree is still at 0.06 though, I suspect Andrew has something newer in -mm) Thanks, Andy has uploaded the 0.07 version at http://www.shadowen.org/~apw/public/checkpatch/checkpatch.pl-0.07 Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Sun, 01 Jul 2007 03:38:25 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Journal checksum feature has been added to detect corruption of journal. That was brief. No description of what it does, how it does it, why it does it, how one operates it, why (or why not) one would choose to enable it nor of the costs of enabling it. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] diff -Nurp linux024/fs/ext4/super.c linux/fs/ext4/super.c --- linux024/fs/ext4/super.c 2007-06-25 16:19:24.0 -0500 +++ linux/fs/ext4/super.c 2007-06-26 08:35:16.0 -0500 @@ -721,6 +721,7 @@ enum { Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh, Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev, + Opt_journal_checksum, Opt_journal_async_commit, A new user-visible feature should be accompanied by an update to Documentation/filesystems/ext4.txt? Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, @@ -760,6 +761,8 @@ static match_table_t tokens = { {Opt_journal_update, journal=update}, {Opt_journal_inum, journal=%u}, {Opt_journal_dev, journal_dev=%u}, + {Opt_journal_checksum, journal_checksum}, + {Opt_journal_async_commit, journal_async_commit}, What's journal_async_commit? {Opt_abort, abort}, {Opt_data_journal, data=journal}, {Opt_data_ordered, data=ordered}, @@ -948,6 +951,13 @@ static int parse_options (char *options, return 0; *journal_devnum = option; break; + case Opt_journal_checksum: + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM); + break; + case Opt_journal_async_commit: + set_opt (sbi-s_mount_opt, JOURNAL_ASYNC_COMMIT); + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM); + break; case Opt_noload: set_opt (sbi-s_mount_opt, NOLOAD); break; @@ -1817,6 +1827,21 @@ static int ext4_fill_super (struct super goto failed_mount4; } + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) { + jbd2_journal_set_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } else if (test_opt(sb, JOURNAL_CHECKSUM)) { + jbd2_journal_set_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0); + jbd2_journal_clear_features(sbi-s_journal, 0, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } else { + jbd2_journal_clear_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } Some discussion of the forward- and backward- compatibility design would be appropriate. Also a description of whether and how fsck can be used to fix up these feature flags. /* We have now updated the journal if required, so we can * validate the data journaling mode. */ switch (test_opt(sb, DATA_FLAGS)) { diff -Nurp linux024/fs/jbd2/commit.c linux/fs/jbd2/commit.c --- linux024/fs/jbd2/commit.c 2007-06-25 16:19:25.0 -0500 +++ linux/fs/jbd2/commit.c2007-06-26 08:40:03.0 -0500 @@ -21,6 +21,7 @@ #include linux/mm.h #include linux/pagemap.h #include linux/jiffies.h +#include linux/crc32.h I think we just broke CONFIG_EXT4=y, CONFIG_CRC32=n builds. /* * Default IO end handler for temporary BJ_IO buffer_heads. @@ -93,15 +94,18 @@ static int inverted_lock(journal_t *jour return 1; } -/* Done it all: now write the commit record. We should have +/* + * Done it all: now submit the commit record. We should have * cleaned up our previous buffers by now, so if we are in abort * mode we can now just skip the rest of the journal write * entirely. * * Returns 1 if the journal needs to be aborted or 0 on success */ -static int journal_write_commit_record(journal_t *journal, - transaction_t *commit_transaction) +static int journal_submit_commit_record(journal_t *journal, + transaction_t *commit_transaction, + struct buffer_head **cbh, + __u32 crc32_sum) { struct journal_head *descriptor; struct buffer_head *bh; @@ -117,21 +121,36 @@ static int journal_write_commit_record(j bh
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote: David Chinneer pointed that we need to journal the version number updates together with the operations that causes the change of the inode version number, in order to survive server crashes so clients won't see the counter go backwards. So increment i_version in fs code is probably the place to ensure the inode version changes are stored to disk. It's seems update the ext4 inode version in every ext4_mark_inode_dirty() is the easiest way. That still makes us dependent upon _something_ changing the inode. For overwrites the only something is mtime. If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and I don't think we do) then I guess we'll need new code in or around file_update_time() to do this. do you mean mark inode dirty all the times in file_update_time()? Not sure about the overhead for ext3/4. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes Date: Sun, 01 Jul 2007 03:38:51 -0400 Sender: [EMAIL PROTECTED] Organization: IBM Linux Technology Center X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent compilation fixes I hope this patch series will hit git with nice titles like ext4: extent compilation fixes and not funny ones like Morecleanups:ext4_extent_compilation_fixes. Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions. conversions. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups
On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote: From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent macros cleanup - Replace math equation to it's macro equivalent s/it's/its/;) - make ext4_ext_grow_indepth() indexes/leaf correct hm, what was wrong with it? Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED] Acked-by: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/ext4/extents.c | 11 +++ 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 12fe3d7..1fd00ac 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int bloc ext_debug(binsearch for %d(idx): , block); l = EXT_FIRST_INDEX(eh) + 1; - r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh-eh_entries) - 1; + r = EXT_LAST_INDEX(eh); while (l = r) { m = l + (r - l) / 2; if (block le32_to_cpu(m-ei_block)) @@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block) ext_debug(binsearch for %d: , block); l = EXT_FIRST_EXTENT(eh) + 1; - r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh-eh_entries) - 1; + r = EXT_LAST_EXTENT(eh); while (l = r) { m = l + (r - l) / 2; @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode, curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode)); curp-p_hdr-eh_entries = cpu_to_le16(1); curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr); - /* FIXME: it works, but actually path[0] can be index */ - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; + + if (path[0].p_hdr-eh_depth) + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block; + else + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; whitespace bustage there. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs
On Wed, 11 Jul 2007 00:38:09 -0500 Jose R. Santos [EMAIL PROTECTED] wrote: Alternatively (and preferably) do this via an update to Documentation/filesystems/ext4.txt. Seems like I also need to update the doc on Kconfig as well. Do you prefer this in separate patches? (current patch, kconfig patch, ext4 doc update patch? All these changes are logically connected (aren't they?). A single patch is fine. Shoudln't all this debug info be a per-superblock thing rather than kernel-wide? I don't think it is worth pursuing this feature since this seems to have been broken for a while now (its been there since the first git revission in ext3) and nobody has noticed it until now. It could be address on a later patch though, since the initial purpose of the patch was to fix the broken JBD2_DEBUG option. Of course, this may not be clearly express in the changelog. :) I don't think that making it all per-superblock is worth the effort - it's a developer-only thing and developer will have the knowledge to test ext4 on an otherwise-ext3 setup if they're really fussed about the accuracy. So yes, a bare make-it-work patch sounds appropriate. Or remove it, but hey, it might be useful. The timestamping stuff certainly looks useful. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote: David Chinneer pointed that we need to journal the version number updates together with the operations that causes the change of the inode version number, in order to survive server crashes so clients won't see the counter go backwards. So increment i_version in fs code is probably the place to ensure the inode version changes are stored to disk. It's seems update the ext4 inode version in every ext4_mark_inode_dirty() is the easiest way. That still makes us dependent upon _something_ changing the inode. For overwrites the only something is mtime. If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and I don't think we do) then I guess we'll need new code in or around file_update_time() to do this. do you mean mark inode dirty all the times in file_update_time()? Not sure about the overhead for ext3/4. hm, I hadn't thought about it in any detail. Maybe something like --- a/fs/inode.c~a +++ a/fs/inode.c @@ -1238,6 +1238,11 @@ void file_update_time(struct file *file) sync_it = 1; } + if (IS_I_VERSION_64(inode)) { + inode_inc_iversion(inode); /* Takes i_lock on 32-bit */ + sync_it = 1; + } + if (sync_it) mark_inode_dirty_sync(inode); } _ ? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc
On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote: On Wed, 11 Jul 2007 01:50:00 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, return sys_fadvise64_64(fd, ((u64)offset_hi 32) | offset_lo, len, advice); } + +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo, + unsigned offset_hi, unsigned len_lo, + unsigned len_hi) Please call this compat_sys_fallocate in line with the powerpc version - it gives us a hint that maybe we should think about how to consolidate them. I know other stuff in that file is called sys32_ ... but it is time for a change :-) I think this can be handled as a separate patch once this patchset is in mainline. Since, anyhow we will need to do this for other sys32_ calls which are already there... -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. Which is not what i_version is supposed to do. It'll get you tons of misses for NFSv3 filehandles that rely on the generation staying the same for the same file. Please add a new field for the NFSv4 sequence counter instead of making i_version unuseable. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH 1/5] revoke: special mmap handling
From: Pekka Enberg [EMAIL PROTECTED] This adds special handling for revoked memory mappings. We want to raise SIGBUS when accessing revoked mappings and return ENODEV when trying to remap with mmap(2). Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- include/linux/mm.h |1 + mm/memory.c|3 +++ mm/mmap.c | 12 3 files changed, 12 insertions(+), 4 deletions(-) Index: 2.6/include/linux/mm.h === --- 2.6.orig/include/linux/mm.h 2007-07-06 10:19:51.0 +0300 +++ 2.6/include/linux/mm.h 2007-07-11 11:48:28.0 +0300 @@ -169,6 +169,7 @@ #define VM_NONLINEAR0x0080 /* Is no #define VM_MAPPED_COPY 0x0100 /* T if mapped copy of data (nommu mmap) */ #define VM_INSERTPAGE 0x0200 /* The vma has had vm_insert_page() done on it */ #define VM_ALWAYSDUMP 0x0400 /* Always include in core dumps */ +#define VM_REVOKED 0x0800 /* Mapping has been revoked */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS Index: 2.6/mm/memory.c === --- 2.6.orig/mm/memory.c2007-07-06 10:19:53.0 +0300 +++ 2.6/mm/memory.c 2007-07-11 11:48:28.0 +0300 @@ -2597,6 +2597,9 @@ int __handle_mm_fault(struct mm_struct * if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, write_access); + if (unlikely(vma-vm_flags VM_REVOKED)) + return VM_FAULT_SIGBUS; + pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) Index: 2.6/mm/mmap.c === --- 2.6.orig/mm/mmap.c 2007-07-06 10:19:53.0 +0300 +++ 2.6/mm/mmap.c 2007-07-11 11:48:28.0 +0300 @@ -1031,10 +1031,14 @@ accountable = 0; error = -ENOMEM; munmap_back: vma = find_vma_prepare(mm, addr, prev, rb_link, rb_parent); - if (vma vma-vm_start addr + len) { - if (do_munmap(mm, addr, len)) - return -ENOMEM; - goto munmap_back; + if (vma) { + if (unlikely(vma-vm_flags VM_REVOKED)) + return -ENODEV; + if (vma-vm_start addr + len) { + if (do_munmap(mm, addr, len)) + return -ENOMEM; + goto munmap_back; + } } /* Check against address space limit. */ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH 2/5] revoke: core code
From: Pekka Enberg [EMAIL PROTECTED] The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). 2. Take down shared memory mappings of the inode and close all file pointers. The file descriptors and memory mapping ranges are preserved until the owning task does close(2) and munmap(2), respectively. You use revoke() (with chown, for example) to gain exclusive access to an inode that might be in use by other processes. This means that we must mke sure that: - operations on opened file descriptors pointing to that inode fail - there are no shared mappings visible to other processes - in-progress system calls are either completed (writes) or abort (reads) After revoke() system call returns, you are guaranteed to have revoked access to an inode for any processes that had access to it when you started the operation. The caller is responsible for blocking any future open(2) calls that might occur while revoke() takes care of fork(2) and dup(2) during the operation. Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- fs/Makefile |1 fs/revoke.c | 777 +++ fs/revoked_inode.c | 417 +++ include/linux/fs.h |8 include/linux/magic.h|1 include/linux/mm.h |1 include/linux/revoked_fs_i.h | 18 include/linux/syscalls.h |3 mm/mmap.c| 11 9 files changed, 1237 insertions(+) Index: 2.6/fs/Makefile === --- 2.6.orig/fs/Makefile2007-05-21 15:38:14.0 +0300 +++ 2.6/fs/Makefile 2007-07-11 11:48:35.0 +0300 @@ -19,6 +19,7 @@ else obj-y += no-block.o endif +obj-$(CONFIG_MMU) += revoke.o revoked_inode.o obj-$(CONFIG_INOTIFY) += inotify.o obj-$(CONFIG_INOTIFY_USER) += inotify_user.o obj-$(CONFIG_EPOLL)+= eventpoll.o Index: 2.6/fs/revoke.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ 2.6/fs/revoke.c 2007-07-11 11:48:35.0 +0300 @@ -0,0 +1,777 @@ +/* + * fs/revoke.c - Invalidate all current open file descriptors of an inode. + * + * Copyright (C) 2006-2007 Pekka Enberg + * + * This file is released under the GPLv2. + */ + +#include linux/file.h +#include linux/fs.h +#include linux/namei.h +#include linux/magic.h +#include linux/mm.h +#include linux/mman.h +#include linux/module.h +#include linux/mount.h +#include linux/sched.h +#include linux/revoked_fs_i.h +#include linux/syscalls.h + +/** + * fileset - an array of file pointers. + * @files:the array of file pointers + * @nr: number of elements in the array + * @end: index to next unused file pointer + */ +struct fileset { + struct file **files; + unsigned long nr; + unsigned long end; +}; + +/** + * revoke_details - details of the revoke operation + * @inode:invalidate open file descriptors of this inode + * @fset: set of files that point to a revoked inode + * @restore_start:index to the first file pointer that is currently in + *use by a file descriptor but the real file has not + *been revoked + */ +struct revoke_details { + struct fileset *fset; + unsigned long restore_start; +}; + +static struct kmem_cache *revokefs_inode_cache; + +static inline bool fset_is_full(struct fileset *set) +{ + return set-nr == set-end; +} + +static inline struct file *fset_get_filp(struct fileset *set) +{ + return set-files[set-end++]; +} + +static struct fileset *alloc_fset(unsigned long size) +{ + struct fileset *fset; + + fset = kzalloc(sizeof *fset, GFP_KERNEL); + if (!fset) + return NULL; + + fset-files = kcalloc(size, sizeof(struct file *), GFP_KERNEL); + if (!fset-files) { + kfree(fset); + return NULL; + } + fset-nr = size; + return fset; +} + +static void free_fset(struct fileset *fset) +{ + int i; + + for (i = fset-end; i fset-nr; i++) + fput(fset-files[i]); + + kfree(fset-files); + kfree(fset); +} + +/* + * Revoked file descriptors point to inodes in the revokefs filesystem. + */ +static struct vfsmount *revokefs_mnt; + +static struct file *get_revoked_file(void) +{ + struct dentry
[RFC/PATCH 3/5] revoke: wire up i386 system calls
From: Pekka Enberg [EMAIL PROTECTED] Make revokeat and frevoke system calls available to user-space on i386. Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- arch/i386/kernel/syscall_table.S |2 ++ arch/x86_64/ia32/ia32entry.S |2 ++ include/asm-i386/unistd.h|4 +++- 3 files changed, 7 insertions(+), 1 deletion(-) Index: 2.6/arch/i386/kernel/syscall_table.S === --- 2.6.orig/arch/i386/kernel/syscall_table.S 2007-05-21 15:38:00.0 +0300 +++ 2.6/arch/i386/kernel/syscall_table.S2007-07-11 11:48:39.0 +0300 @@ -323,3 +323,5 @@ .long sys_utimensat /* 320 */ .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_revokeat + .long sys_frevoke /* 325 */ Index: 2.6/include/asm-i386/unistd.h === --- 2.6.orig/include/asm-i386/unistd.h 2007-05-21 15:38:15.0 +0300 +++ 2.6/include/asm-i386/unistd.h 2007-07-11 11:48:39.0 +0300 @@ -329,10 +329,12 @@ #define __NR_utimensat320 #define __NR_signalfd 321 #define __NR_timerfd 322 #define __NR_eventfd 323 +#define __NR_revokeat 324 +#define __NR_frevoke 325 #ifdef __KERNEL__ -#define NR_syscalls 324 +#define NR_syscalls 326 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: 2.6/arch/x86_64/ia32/ia32entry.S === --- 2.6.orig/arch/x86_64/ia32/ia32entry.S 2007-07-06 10:19:44.0 +0300 +++ 2.6/arch/x86_64/ia32/ia32entry.S2007-07-11 11:48:40.0 +0300 @@ -719,4 +719,6 @@ .quad compat_sys_utimensat /* 320 */ .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys_revokeat + .quad sys_frevoke /* 325 */ ia32_syscall_end: - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH 4/5] revoke: support for ext2 and ext3
From: Pekka Enberg [EMAIL PROTECTED] Add revoke support to ext2, ext3 and ext4 by wiring f_ops-revoke with generic_file_revoke. Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- fs/ext2/file.c |1 + fs/ext3/file.c |1 + fs/ext4/file.c |1 + 3 files changed, 3 insertions(+) Index: 2.6/fs/ext2/file.c === --- 2.6.orig/fs/ext2/file.c 2007-05-04 13:49:05.0 +0300 +++ 2.6/fs/ext2/file.c 2007-07-11 11:48:43.0 +0300 @@ -56,6 +56,7 @@ const struct file_operations ext2_file_o .sendfile = generic_file_sendfile, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .revoke = generic_file_revoke, }; #ifdef CONFIG_EXT2_FS_XIP Index: 2.6/fs/ext3/file.c === --- 2.6.orig/fs/ext3/file.c 2007-05-04 13:49:05.0 +0300 +++ 2.6/fs/ext3/file.c 2007-07-11 11:48:43.0 +0300 @@ -123,6 +123,7 @@ const struct file_operations ext3_file_o .sendfile = generic_file_sendfile, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .revoke = generic_file_revoke, }; const struct inode_operations ext3_file_inode_operations = { Index: 2.6/fs/ext4/file.c === --- 2.6.orig/fs/ext4/file.c 2007-05-04 13:49:05.0 +0300 +++ 2.6/fs/ext4/file.c 2007-07-11 11:48:43.0 +0300 @@ -123,6 +123,7 @@ const struct file_operations ext4_file_o .sendfile = generic_file_sendfile, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .revoke = generic_file_revoke, }; const struct inode_operations ext4_file_inode_operations = { - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH 5/5] revoke: add documentation
From: Pekka Enberg [EMAIL PROTECTED] This documents revoke file operation in Documentation/filesystems/vfs.txt. Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- Documentation/filesystems/vfs.txt |5 + 1 file changed, 5 insertions(+) Index: 2.6/Documentation/filesystems/vfs.txt === --- 2.6.orig/Documentation/filesystems/vfs.txt 2007-05-21 15:37:59.0 +0300 +++ 2.6/Documentation/filesystems/vfs.txt 2007-07-11 11:48:46.0 +0300 @@ -732,6 +732,7 @@ struct file_operations { int); ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int); + int (*revoke)(struct file *); }; Again, all methods are called without any locks being held, unless @@ -805,6 +806,10 @@ otherwise noted. splice_read: called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system call + revoke: called by revokeat(2) and frevoke(2) system calls to revoke access + to an open file. This method must ensure that all currently blocked + writes are flushed and reads will fail. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote: Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. Sorry, but this doesn't make any sense. There is no need to put every feature in the XFS ioctls in the syscalls. The XFS ioctls will need to be supported forever anyway - as I suggested before they really should be moved to generic code. What needs to be supported is what makes sense as an interface. A punch a hole interface does make sense, but trying to hack this into a preallocation system call is just madness. We're not IRIX or windows that fit things into random subcall just because there was some space left to squeeze them in. FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. HSMs needs this basically for every system call, which screams for an open flag like O_INVISIBLE anyway. Adding this in a generic way is a good idea, but hacking bits and pieces that won't fit into the global design is completely wrong. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote: Given the current behaviour for posix_fallocate() in glibc, I think that retaining the same error semantic and punting the cleanup to userspace (where the app will fail with ENOSPC anyway) is the only sane thing we can do here. Trying to undo this in the kernel leads to lots of extra rarely used code in error handling paths... Agreed, looks like we should stay with the user has to clean up behaviour. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, Jul 04, 2007 at 03:37:01PM +1000, Timothy Shimmin wrote: We use this capability in XFS at the moment. I think this is mainly for DMF (HSM) but is done via the xfs handle interface (xfs_open_by_handle) AFAICT. You're not :) You're using an O_INVIBLE equivalent (as described below), which would be a useful thing to have at the VFS level, but adding hacks to some system calls only wouldn't help any HSM system. It's just useless API clutter. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] manpage for fallocate
On Wed, Jul 11, 2007 at 12:37:01AM +0300, Heikki Orsila wrote: On Wed, Jul 11, 2007 at 01:48:20AM +0530, Amit K. Arora wrote: .BI int syscall(int, int fd, int mode, loff_t offset, loff_t len); Correction: int syscall(int fd, int mode, ...), Here, we have syscall() with first argument being the system call number - so what you suggested is not correct. But, yes, the synopsis should change at some time. Maybe to something like: #include fcntl.h long fallocate(int fd, int mode, loff_t offset, loff_t len); .TP .B ENOSPC There is not enough space left on the device containing the file referred to by .IR fd. .TP .B ESPIPE .I fd refers to a pipe of file descriptor. .B ENOSYS The filesystem underlying the file descriptor does not support this operation. EINTR? Will add following errors: EINTR A signal was caught during execution EIO An I/O error occurred while reading from or writing to a file system. EOPNOTSUPPThe mode is not supported on the file descriptor. and will update following : EINVALoffset was less than 0, or len was less than or equal to 0. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote: From: Pekka Enberg [EMAIL PROTECTED] The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). How does that deal with the following: task A gets its descriptor table cleansed task B sends a descriptor to task A via SCM_RIGHTS datagram task B gets its descriptor table cleansed task A receives the datagram and gets the descriptor installed task A does any kind of IO on that descriptor -f_mapping gets replaced in the most inconvenient time. Come to think of that, what do you do if I create a socketpair, stuff the descriptor into SCM_RIGHTS datagram and send it to myself? Then receive it a day after you've called revoke() and voila - I've got an almost undamaged struct file back... At the very least, it allows me to fchmod(). Or fchdir() if that had been a directory... BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 10:37:33AM +0100, Al Viro wrote: On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote: From: Pekka Enberg [EMAIL PROTECTED] The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). How does that deal with the following: task A gets its descriptor table cleansed task B sends a descriptor to task A via SCM_RIGHTS datagram task B gets its descriptor table cleansed task A receives the datagram and gets the descriptor installed task A does any kind of IO on that descriptor -f_mapping gets replaced in the most inconvenient time. Come to think of that, what do you do if I create a socketpair, stuff the descriptor into SCM_RIGHTS datagram and send it to myself? Then receive it a day after you've called revoke() and voila - I've got an almost undamaged struct file back... At the very least, it allows me to fchmod(). Or fchdir() if that had been a directory... BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... Better: I have the only opened descriptor for foo. I send it to myself as described above. I close it. revoke() is called, finds no opened instances of foo in any descriptor tables and cheerfully does nothing. I call recvmsg() and I have completely undamaged opened file back. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
Hi Al, On Wed, 11 Jul 2007, Al Viro wrote: Better: I have the only opened descriptor for foo. I send it to myself as described above. I close it. revoke() is called, finds no opened instances of foo in any descriptor tables and cheerfully does nothing. I call recvmsg() and I have completely undamaged opened file back. Uhm, nice. So, revoke() needs a proper inode - struct files mapping somewhere. Can we add a list of files to struct inode? Are there other cases where a file can point to an inode but the file is not attached to any file descriptor? Pekka - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, 11 Jul 2007, Al Viro wrote: BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... For writes, we (1) never start any new operations after we've cleaned up the file descriptor tables so (2) after we're done with do_fsync() we never touch -f_mapping again. But for reads, I think there's a problem if we're in do_generic_mapping_read() doing invalidate_inode_pages2() is not enough because we're hanging on to the real mapping. Hmm. Pekka - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 12:50:48PM +0300, Pekka J Enberg wrote: Hi Al, On Wed, 11 Jul 2007, Al Viro wrote: Better: I have the only opened descriptor for foo. I send it to myself as described above. I close it. revoke() is called, finds no opened instances of foo in any descriptor tables and cheerfully does nothing. I call recvmsg() and I have completely undamaged opened file back. Uhm, nice. So, revoke() needs a proper inode - struct files mapping somewhere. Can we add a list of files to struct inode? Are there other cases where a file can point to an inode but the file is not attached to any file descriptor? Umm... Any number, really - it might be in the middle of syscall while another task sharing descriptor table has closed the descriptor. Then there's quota, then there's process accounting, then there's execve() in progress, then there's knfsd working with that struct file, etc. The fundamental issue here is that even if you do find struct file, you can't blindly rip its -f_mapping since it can be in the middle of -read(), -write(), pageout, etc. And even if you do manage that, you still have the ability to do fchmod() later. I don't see how the ability to find all instances in SCM_RIGHTS datagrams (for example) will help you with the race I've described first. Original state: task B has the only reference to file. revoke() is called, passes task A. B sends datagram to A and closes file. A receives datagram. Now the only reference is in A's table and you've already passed that. So you can't avoid processes keeping pointers to struct file. If you could find all struct file over given inode (which, I suspect, will lead to interesting locking), you could call something on that struct file, but you'd have zero exclusion with processes calling methods on it. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 01:01:07PM +0300, Pekka J Enberg wrote: On Wed, 11 Jul 2007, Al Viro wrote: BTW, read() or write() in progress might get rather unhappy if your live replacement of -f_mapping races with them... For writes, we (1) never start any new operations after we've cleaned up the file descriptor tables so (2) after we're done with do_fsync() we never touch -f_mapping again. Er, no. do_fsync() won't hit the sys_write() that is yet to enter -write(). And you can't get rid of new callers _anyway_ (see previous mail). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] nfs: disable leases over NFS
On Thu, Jul 05, 2007 at 11:41:00AM -0400, J. Bruce Fields wrote: OK, after looking at this a little more, I'm less happy about the idea of erroring out by default: - There are a ton of filesystems that probably should allow leases, and only a few (network filesystems) that shouldn't, so leaving leases on by default seems simpler. But it gets you possible wrong behaviour by default. I'm not a big fan of non-trivial default methods as you see :) - We already fall back on the local method by default in the case of locks, and I don't see a reason to treat leases differently. - The patch to add .setlease = setlease, to all the file_operations is going to be a big patch that changes behavior in a way that might be easy to miss (because it changes behavior exactly on those filesystems it *doesn't* touch.) I think it'll be easier to get better review on the patch that adds a method just to those filesystems that we're disabling leases for. Anyway, feel free to go ahead with the simpler version for now, I'll do the switchover for locks and leases when I get some time. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
On Thu, Jul 05, 2007 at 03:43:32PM -0700, Dave Hansen wrote: This takes care of all of the direct callers of vfs_mknod(). Since a few of these cases also handle normal file creation as well, this also covers some calls to vfs_create(). So that we don't have to make three mnt_want/drop_write() calls inside of the switch statement, we move some of its logic outside of the switch. Looks good to me. One thing I noticed: do we actually _need_ the first S_ISDIR() check at the top of the function? The EPERM is required by Posix. + if (!S_ISREG(mode) !S_ISCHR(mode) !S_ISBLK(mode) + !S_ISFIFO(mode) !S_ISSOCK(mode) (mode != 0)) { no need for braces around the mode != 0 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
On Sat, Jul 07, 2007 at 08:25:31PM +0200, Jan Engelhardt wrote: Hm, I wonder why mknod(,S_IFDIR,) returns -EPERM rather than -EINVAL like the others? Posix. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree
On Sat, Jun 30, 2007 at 08:56:02AM -0400, H. Peter Anvin wrote: Is that conjecture, or do you have evidence to that effect? Most users of this file are using it via the glibc interfaces, and there probably aren't all that many users of it in the first place. I have written parsers for personal projects that might not have been happy to deal with additional fields myself for example.. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
Hi, On Wed, 11 Jul 2007, Al Viro wrote: The fundamental issue here is that even if you do find struct file, you can't blindly rip its -f_mapping since it can be in the middle of -read(), -write(), pageout, etc. And even if you do manage that, you still have the ability to do fchmod() later. Then we would need to change the VFS and relevant parts so that we can take down -f_mapping. I don't see how we could do that without affecting current hotpaths. Hmm. I suppose what we really need to do is cannibalize the actual inode (remove from inode cache, detach from dentry and take down the mapping) so that we don't have to touch existing struct file pointers at all. Pekka - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Jul 10, 2007 22:00 -0400, Mingming Cao wrote: On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:56 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch is a spinoff of the old nanosecond patches. I don't know what the old nanosecond patches are. A link to a suitable changlog for those patches would do in a pinch. Preferable would be to write a proper changelog for this patch. I found the original patch http://marc.info/?l=linux-ext4m=115091699809181w=2 Andreas or Kalpak, is changelog from the original patch is accurate to apply here? Mostly, yes, but the name of the feature flag has changed. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Jul 10, 2007 16:30 -0700, Andrew Morton wrote: +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field) \ + ((offsetof(typeof(*ext4_inode), field) +\ + sizeof((ext4_inode)-field)) \ + = (EXT4_GOOD_OLD_INODE_SIZE + \ + (einode)-i_extra_isize)) \ Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers under what circumstances something will not fit in an inode and what the consequences of this are. /* Extended fields will fit into an inode if the filesystem was formatted * with large inodes (-I 256 or larger) and there are not currently EAs * consuming all of the available space. For new inodes we always reserve * enough space for the kernel's known extended fields, but for inodes * created with an old kernel this might not have been the case. None of * the extended inode fields is critical for correct filesystem operation. */ +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ +do { \ + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ + ext4_decode_extra_time((inode)-xtime,\ + raw_inode-xtime ## _extra);\ +} while (0) Ugly. I expect these could be implemented as plain old C functions. Caller could pass in the address of the ext4_inode field which the function is to operate upon. We thought about that also, but then the caller needs to do all of the pointer gymnastics themselves like: ext4_inode_get_xtime(inode-i_ctime, inode-i_ctime_extra, raw_inode-i_ctime, raw_inode-i_ctime_extra) instead of the current: EXT4_INODE_GET_XTIME(ctime, inode, raw_inode); IMHO it is preferrable to make the multiple callsites more readable than the macros. #if defined(__KERNEL__) || defined(__linux__) (What's the __linux__ for?) #define i_reserved1osd1.linux1.l_i_reserved1 #define i_frag osd2.linux2.l_i_frag This is actually unrelated to the current patch, just part of the context. AFAIK, this is historical, so that the kernel and e2fsprogs can use the same ext2_fs.h header. I don't think it is really needed, but such cleanup shouldn't be a part of this patch either. +static inline struct timespec ext4_current_time(struct inode *inode) +{ + return (inode-i_sb-s_time_gran NSEC_PER_SEC) ? + current_fs_time(inode-i_sb) : CURRENT_TIME_SEC; +} Now, I've forgotten how this works. Remind me, please. Can ext4 filesystems ever have one-second timestamp granularity? If so, how does one cause that to come about? Yes, this is possible if an ext2/3/4 filesystem is formatted with 128-byte inodes (which is the default for all but ext4) and this fs is mounted as ext4dev. The inodes can never hold the extra time information (FITS_IN_INODE check above) so the superblock limits the timestamp resolution to 1s in that case. @@ -153,6 +153,7 @@ unsigned long i_ext_generation; struct ext4_ext_cache i_cached_extent; + struct timespec i_crtime; }; It is unobvious what this field does. Please prefer to add commentary to _all_ struct fields - it really helps. It is the inode creation time. This is useful for debug/forensic purposes, and at some point there will be an API so that Samba can use it also. #endif /* _LINUX_EXT4_FS_I */ Index: linux-2.6.22-rc4/include/linux/ext4_fs_sb.h === --- linux-2.6.22-rc4.orig/include/linux/ext4_fs_sb.h2007-06-11 17:28:15.0 -0700 +++ linux-2.6.22-rc4/include/linux/ext4_fs_sb.h 2007-06-11 17:39:05.0 -0700 @@ -79,6 +79,7 @@ char *s_qf_names[MAXQUOTAS];/* Names of quota files with journalled quota */ int s_jquota_fmt; /* Format of quota to use */ #endif + unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */ OK, I can kind-of see how this is working, but some overall description of how the inode sizing design operates would be helpful. It would certainly make reviewing of this proposed change more fruitful. Perhaps that new comment over EXT4_FITS_IN_INODE() would be a suitable place. Hmm, I'm sure there were emails on the topic, but they aren't attached to the patch. s_want_extra_isize is just an override for sizeof(ext4_inode) in case the sysadmin wants to reserve more fields in new inodes. There is also s_min_extra_isize which is what the kernel and e2fsck guarantee that will be available in all in-use inodes, if RO_COMPAT_EXTRA_ISIZE is set (ro-compat so that older kernels can't create inodes with a smaller extra_isize). That feature is only enabled
Re: -mm merge plans for 2.6.23
On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote: On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: romfs-printk-format-warnings.patch NACK on this one. The rest of it is nacked anyway, until we unify the point and get_unmapped_area methods of the MTD API. -- dwmw2 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 10, 2007 23:34 -0400, Trond Myklebust wrote: On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote: So my vote is to increment i_version in common code every time any change is made to the file, and alloc_inode should initialise it to current time, which might be changed by the filesystem before it calls unlock_new_inode. ... but doesn't lustre want to control its i_version... so maybe not :-( If lustre wants to be exportable via pNFS (as Peter Braam has suggested it should), then it had better be able to return a change attribute that is compatible with the NFSv4.1 spec... The Lustre use of i_version is a superset of what NFSv4 needs - the Lustre version can be used to compare the updates of two inodes. It is set to be the Lustre transaction number (which is sequentially incremented on a per filesystem basis), so that we can use the per-inode version to do replay of client operations even if they have been disconnected for a long time, which is why we want to be able to control the actual value. We don't want the version to be updated for e.g. file defragmentation or other similar internal-only changes which need ext4_mark_inode_dirty(). We had a patch to disable ext4 inode versioning by a flag the superblock, but we dropped it at the last minute because it needed some updates and we didn't want to wait on that for submitting these changes upstream. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode
On Jul 10, 2007 16:30 -0700, Andrew Morton wrote: Signed-off-by: Mingming Cao [EMAIL PROTECTED] Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] --- Index: linux-2.6.21/include/linux/ext4_fs.h === --- linux-2.6.21.orig/include/linux/ext4_fs.h +++ linux-2.6.21/include/linux/ext4_fs.h @@ -342,6 +342,7 @@ struct ext4_inode { __le32 i_atime_extra; /* extra Access time (nsec 2 | epoch) */ __le32 i_crtime; /* File Creation time */ __le32 i_crtime_extra; /* extra FileCreationtime (nsec 2 | epoch) */ + __le32 i_version_hi; /* high 32 bits for 64-bit version */ }; Aren't there forward- backward-compatibility issues here? How does the filesystem driver work out whether this field is present and valid? This uses the same EXT4_FITS_IN_INODE() check as any other, so the compatibility issues are handled. NFSv4 could live with 32-bit versions with only a small danger of overflow, so we can still export ext3 filesystems with 128-byte inodes that have been updated to ext4. For Lustre (which requires 64-bit versions), we will enforce that space is available with s_min_extra_isize and RO_COMPAT_EXTRA_ISIZE. In the case where an older ext3/ext4 filesystem with large inodes does not have enough space for i_version_hi the EAs that follow i_extra_isize will be shifted to make room for it (if possible, which is likely). There are no critical fields inside i_extra_isize so in the rare case of a failure to enlarge the i_extra_isize is not a cause for alarm. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Jul 11, 2007 09:47 +0100, Christoph Hellwig wrote: On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. Which is not what i_version is supposed to do. It'll get you tons of misses for NFSv3 filehandles that rely on the generation staying the same for the same file. Please add a new field for the NFSv4 sequence counter instead of making i_version unuseable. You are confusing i_generation (the instance of this inode number) with i_version (whether this file has been modified)? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Wed, 2007-07-11 at 09:47 +0100, Christoph Hellwig wrote: On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. Which is not what i_version is supposed to do. It'll get you tons of misses for NFSv3 filehandles that rely on the generation staying the same for the same file. Please add a new field for the NFSv4 sequence counter instead of making i_version unuseable. Aren't you confusing i_version and i_generation here? Those are two separate inode fields. Cheers Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fallocate, Re: -mm merge plans for 2.6.23
fallocate-implementation-on-i86-x86_64-and-powerpc.patch fallocate-on-s390.patch fallocate-on-ia64.patch fallocate-on-ia64-fix.patch Merge. Hopefull this will be done during the 2.6.23 merge window, but right now it's not (yet). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Wed, Jul 11, 2007 at 05:52:24AM -0600, Andreas Dilger wrote: On Jul 11, 2007 09:47 +0100, Christoph Hellwig wrote: On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. Which is not what i_version is supposed to do. It'll get you tons of misses for NFSv3 filehandles that rely on the generation staying the same for the same file. Please add a new field for the NFSv4 sequence counter instead of making i_version unuseable. You are confusing i_generation (the instance of this inode number) with i_version (whether this file has been modified)? Yes, sorry. Objection dropped. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates
On Wed, Jul 11, 2007 at 05:57:17AM -0600, Andreas Dilger wrote: Ah, this is the patch to disable i_version updates for Lustre. I don't think any normal user would use this mount option, so I don't know if there is a need to document it. This is a reason to not merge it at all. If the only user of this is the out of tree lustre code there is no need to put this in. I should rather stay in clusterfs' patchkit. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block +* with this same handle */ + if ((jbd2_journal_extend(handle, +EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; + if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete +some EAs or run e2fsck., + inode-i_ino); + expand_message = 1; + } + } + } + } Maybe that message should come out once per mount rather than once per boot (or once per modprobe)? Probably true. What are the consequences of a jbd2_journal_extend() failure in here? Not fatal, just like every user of i_extra_isize. If the inode isn't a large inode, or it can't be expanded then there will be a minor loss of functionality on that inode. If the i_extra_isize is critical, then the sysadmin will have run e2fsck to force s_min_extra_isize large enough. Note that this is only applicable for filesystems which are upgraded. For new inodes (i.e. all inodes that exist in the filesystem if it was always run on a kernel with the currently understood extra fields) then this will never be invoked (until such a time new extra fields are added). +static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry, +int value_offs_shift, void *to, +void *from, size_t n, int blocksize) +{ + /* Adjust the value offsets of the entries */ + for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) { + if (!last-e_value_block last-e_value_size) { + new_offs = le16_to_cpu(last-e_value_offs) + + value_offs_shift; + BUG_ON(new_offs + le32_to_cpu(last-e_value_size) + blocksize); + last-e_value_offs = cpu_to_le16(new_offs); + } + } + /* Shift the entries by n bytes */ + memmove(to, from, n); +} Under what circumstances will that new BUG_ON trigger? Can it be triggered via incorrect disk contents? If so, it should not be there. Only under code defect or memory corruption. The needed extra space was already checked in ext4_expand_extra_isize_ea(), but I asked for this BUG_ON() because it would otherwise seem that there was a chance from reading just the above code that the shifted EA might overflow the buffer. +int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize, + struct ext4_inode *raw_inode, handle_t *handle) +{ + down_write(EXT4_I(inode)-xattr_sem); +retry: + if (EXT4_I(inode)-i_extra_isize = new_extra_isize) { + up_write(EXT4_I(inode)-xattr_sem); + return 0; + } So.. xattr_sem is a lock which protects i_extra_isize? That seems a bit strange to me - I'd have expected i_mutex. Well, since this is the only code that ever changes i_extra_isize, and it also needs to move the EAs around, it makes sense to use the EA lock for it. This is also a relatively low-contention lock, and it needs to be held to access any of the EAs (which are the only things beyond i_extra_isize). + if (EXT4_I(inode)-i_file_acl) { + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl); + error = -EIO; + if (!bh) + goto cleanup; + if (ext4_xattr_check_block(bh)) { + ext4_error(inode-i_sb, __FUNCTION__, + inode %lu: bad block %llu, inode-i_ino, + EXT4_I(inode)-i_file_acl); + error = -EIO; + goto cleanup; + } + base = BHDR(bh); + first = BFIRST(bh); + end = bh-b_data + bh-b_size; + min_offs = end - base; + free = ext4_xattr_free_space(first, min_offs, base, +total_blk); + if (free new_extra_isize) { + if (!tried_min_extra_isize s_min_extra_isize) { + tried_min_extra_isize++; +
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Tue, 2007-07-10 at 23:16 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:38:25 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Journal checksum feature has been added to detect corruption of journal. That was brief. No description of what it does, how it does it, why it does it, how one operates it, why (or why not) one would choose to enable it nor of the costs of enabling it. Somehow the description was lost due to the way the original patchset was sent to ext4 list. Here is the original description: The journal checksum feature adds two new flags i.e JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM. JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the checksum for the blocks described by the descriptor blocks. Due to checksums, writing of the commit record no longer needs to be synchronous. Now commit record can be sent to disk without waiting for descriptor blocks to be written to disk. This behavior is controlled using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be able to recover the journal with _ASYNC_COMMIT hence it is made incompat. The commit header has been extended to hold the checksum along with the type of the checksum. For recovery in pass scan checksums are verified to ensure the sanity and completeness(in case of _ASYNC_COMMIT) of every transaction. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] diff -Nurp linux024/fs/ext4/super.c linux/fs/ext4/super.c --- linux024/fs/ext4/super.c2007-06-25 16:19:24.0 -0500 +++ linux/fs/ext4/super.c 2007-06-26 08:35:16.0 -0500 @@ -721,6 +721,7 @@ enum { Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh, Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev, + Opt_journal_checksum, Opt_journal_async_commit, A new user-visible feature should be accompanied by an update to Documentation/filesystems/ext4.txt? Ok. Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, @@ -760,6 +761,8 @@ static match_table_t tokens = { {Opt_journal_update, journal=update}, {Opt_journal_inum, journal=%u}, {Opt_journal_dev, journal_dev=%u}, + {Opt_journal_checksum, journal_checksum}, + {Opt_journal_async_commit, journal_async_commit}, What's journal_async_commit? I hope the description has answered this question. {Opt_abort, abort}, {Opt_data_journal, data=journal}, {Opt_data_ordered, data=ordered}, @@ -948,6 +951,13 @@ static int parse_options (char *options, return 0; *journal_devnum = option; break; + case Opt_journal_checksum: + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM); + break; + case Opt_journal_async_commit: + set_opt (sbi-s_mount_opt, JOURNAL_ASYNC_COMMIT); + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM); + break; case Opt_noload: set_opt (sbi-s_mount_opt, NOLOAD); break; @@ -1817,6 +1827,21 @@ static int ext4_fill_super (struct super goto failed_mount4; } + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) { + jbd2_journal_set_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } else if (test_opt(sb, JOURNAL_CHECKSUM)) { + jbd2_journal_set_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0); + jbd2_journal_clear_features(sbi-s_journal, 0, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } else { + jbd2_journal_clear_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } Some discussion of the forward- and backward- compatibility design would be appropriate. Also a description of whether and how fsck can be used to fix up these feature flags. /* We have now updated the journal if required, so we can * validate the data journaling mode. */ switch (test_opt(sb, DATA_FLAGS)) { diff -Nurp linux024/fs/jbd2/commit.c linux/fs/jbd2/commit.c --- linux024/fs/jbd2/commit.c 2007-06-25 16:19:25.0 -0500 +++ linux/fs/jbd2/commit.c 2007-06-26 08:40:03.0 -0500 @@ -21,6 +21,7 @@ #include linux/mm.h #include linux/pagemap.h #include linux/jiffies.h +#include linux/crc32.h I think we just broke
Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.
On Jul 10, 2007 22:40 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao [EMAIL PROTECTED] wrote: A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if the subdir count for any directory crosses 65000. Would I be correct in assuming that a later fsck will clear EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any 65000 subdir directories? Correct. If so, that is worth a mention in the changelog, perhaps? Please remind us what is the behaviour of an RO_COMPAT flag? It means that old ext4, ext3 and ext2 can only mount this fs read-only, yes? Also correct. The COMPAT flag behaviour is described in detail in Documentation/filesystems/ext[234].txt +static inline void ext4_inc_count(handle_t *handle, struct inode *inode) +{ + inc_nlink(inode); + if (is_dx(inode) inode-i_nlink 1) { + /* limit is 16-bit i_links_count */ + if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) { + inode-i_nlink = 1; + EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb, + EXT4_FEATURE_RO_COMPAT_DIR_NLINK); + } + } +} Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2? Because that means it was previously 1 (inc_nlink() was already called). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Jul 11, 2007 17:16 +0530, Girish Shilamkar wrote: + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) { + jbd2_journal_set_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } else if (test_opt(sb, JOURNAL_CHECKSUM)) { + jbd2_journal_set_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0); + jbd2_journal_clear_features(sbi-s_journal, 0, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } else { + jbd2_journal_clear_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } Some discussion of the forward- and backward- compatibility design would be appropriate. Also a description of whether and how fsck can be used to fix up these feature flags. It is forward backward compatible to enable COMPAT_CHECKSUM. That just means the commit blocks will have checksums in them, but older kernels will just ignore them. Hmm, I suppose there might be an issue with upgrade, downgrade, upgrade in that the commit blocks would not have checksums even though the superblock says they will... Does that mean we should accept a checksum == 0 as being valid (which is not very nice, given that 0 is an oft-hit bad value), or that we need a flag in every commit block which indicates if it actually has a checksum? The INCOMPAT_ASYNC_COMMIT can't be handled safely by older kernels, since they would assume commit block == complete transaction, which isn't true if the commit block didn't wait for the rest of the blocks to make it to the disk. I don't think e2fsck can be used to individually clean up the feature flags, but it is always possible to remove and recreate the journal... - /* AKPM: buglet - add `i' to tmp! */ Damn. After, what, seven years, someone actually fixed it? for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; + struct commit_header *tmp = + (struct commit_header *)(bh-b_data + i); tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER); tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK); tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); + + if (JBD2_HAS_COMPAT_FEATURE(journal, + JBD2_FEATURE_COMPAT_CHECKSUM)) { + tmp-h_chksum_type = JBD2_CRC32_CHKSUM; + tmp-h_chksum_size = JBD2_CRC32_CHKSUM_SIZE; + tmp-h_chksum[0]= cpu_to_be32(crc32_sum); + } } And in doing so, changed the on-disk format of the journal commit blocks. Surely this was worth a mention in the changelog, if not a standalone patch? I don't think this is worth doing, really. Why not just leave the format as it was, take the loop out and run this code once rather than eight times? Well, we aren't using the rest of the commit block in any case. I think the original intention was that we'd get 8 copies of the commit block so we would be sure to get a good one. I don't know whether we'd rather have 8 copies of the commit block, or more potential to expand the commit block? I don't personally have any preference, since the checksum should be a more robust way of checking validity than having multiple copies, so we may as well remove the loop and stick with a single copy for now. @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journa unsigned intsequence; int blocktype; int tag_bytes = journal_tag_bytes(journal); + __u32 crc32_sum = ~0; /* Transactional Checksums */ We normally use __u32 for visible-to-userspace stuff. Kernel code would use plain old u32. Ok. Since the checksum is saved to disk, it seems more appropriate to use __u32 or maybe even __be32, though I'm not sure if the crc32 functions do that correctly or not. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote: It just occurred to me: If i_version is 64bit, then knfsd would need to be careful when reading it on a 32bit host. What are the locking rules? How does knfsd use i_version? I would think that if all it was doing was to compare (i_version == previous_version), then locking wouldn't really matter. Well, theoretically, previous_version could be 0x1, and i_version could be 0x1, knfsd checks the high word, then ext4 updates i_version to 0x2, then knfsd checks the low word, detecting no change. How likely is this? (I don't understand why i_version even needs to be 64 bits in the first place.) Presumably it is only updated under i_mutex protection, but having to get i_mutex to read it would seem a little heavy handed. How does knfsd protect itself from the inode changing after i_version is checked? Is any locking being done otherwise? Should it use a seqlock like i_size? Could we use the same seqlock that i_size uses, or would we need a separate one? NeilBrown -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree
On Wed, 2007-07-11 at 11:24 +0100, Christoph Hellwig wrote: On Sat, Jun 30, 2007 at 08:56:02AM -0400, H. Peter Anvin wrote: Is that conjecture, or do you have evidence to that effect? Most users of this file are using it via the glibc interfaces, and there probably aren't all that many users of it in the first place. I have written parsers for personal projects that might not have been happy to deal with additional fields myself for example.. I modified the patch to add fields towards the end of each line. i.e after 'freq, passno' fields. And symlinked /etc/mtab to /proc/mounts. mount,df and friends were all perfectly happy. I imagine your script may also be happy with the additional fields **towards the end**. I would like to avoid one more mount interface if we can help it. RP - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates
On Tue, Jul 10, 2007 at 04:31:44PM -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:37:53 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Add a noversion mount option to disable inode version updates. Why is this option being offered to our users? To reduce disk traffic, like noatime? If so, what are the implications of this? What would the user lose? This has been removed in the latest patch set; it's needed only for Lustre, because they set the version field themselves. Lustre needs the inode version to be globally monotonically increasing, so it can order updates between two different files, so it does this itself. NFSv4 only uses i_version to detect changes, and so there's no need to use a global atomic counter for i_version. So the thinking was that there was no point doing the global atomic counter if it was not necessary. Since noversion is Lustre specific, we've dropped that from the list of patches that we'll push, and so the inode version will only have local per-inode significance, and not have any global ordering properties. We have not actually benchmarked whether or not doing the global ordering actually *matters* in terms of being actually noticeable. If it isn't noticeable, I wouldn't mind changing things so that we always make i_version globally significant (without a mount option), and make life a bit easier for the Lustre folks. Or if someone other distributed filesystem requests a globally significant i_version. But we can cross that bridge when we get to it - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: -mm merge plans for 2.6.23
On Wed, 11 Jul 2007 12:39:42 +0100 David Woodhouse [EMAIL PROTECTED] wrote: On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote: On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: romfs-printk-format-warnings.patch NACK on this one. The rest of it is nacked anyway, until we unify the point and get_unmapped_area methods of the MTD API. Methinks you meant nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not romfs-printk-format-warnings.patch. I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: -mm merge plans for 2.6.23
On Wed, 11 Jul 2007 10:21:03 -0700 Andrew Morton wrote: On Wed, 11 Jul 2007 12:39:42 +0100 David Woodhouse [EMAIL PROTECTED] wrote: On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote: On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: romfs-printk-format-warnings.patch NACK on this one. The rest of it is nacked anyway, until we unify the point and get_unmapped_area methods of the MTD API. Methinks you meant nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not romfs-printk-format-warnings.patch. I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks. Thanks. I was certainly getting confused. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, Jul 11, 2007 at 01:21:55PM +1000, Neil Brown wrote: And just by-the-way, the server doesn't really have the option of not sending the attribute. If i_version isn't defined, it has to fake something using mtime, and hope that is good enough. ctime, actually--the change attribute is also supposed to be updated on attribute updates. Alternately we could mandate that i_version is always kept up-to-date and if a filesystem doesn't have anything to load from storage, it just sets it to the current time in nanoseconds. That would mean that a client would need to flush it's cache whenever the inode fell out of cache on the server, but I don't think we can reliably do better than that. I think I like that approach. So my vote is to increment i_version in common code every time any change is made to the file, and alloc_inode should initialise it to current time, which might be changed by the filesystem before it calls unlock_new_inode. So the client would be invalidating its cache more often than necessary, rather than failing to invalidate it when it should. I agree that that's probably the better tradeoff, although I wish I had a better idea of the downside. I don't know, for example, whether users might see unpleasant results if every client has to reread its cached data on a reboot. The currently proposed change--just providing a model change attribute implementation for ext4 and leaving other filesystems untouched--is a more conservative step. So I'm inclined to just do this ext4 thing first, and then look into further change attribute experiments next time around --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Wed, 11 Jul 2007 07:01:08 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: - /* AKPM: buglet - add `i' to tmp! */ Damn. After, what, seven years, someone actually fixed it? for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; + struct commit_header *tmp = + (struct commit_header *)(bh-b_data + i); tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER); tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK); tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); + + if (JBD2_HAS_COMPAT_FEATURE(journal, + JBD2_FEATURE_COMPAT_CHECKSUM)) { + tmp-h_chksum_type = JBD2_CRC32_CHKSUM; + tmp-h_chksum_size = JBD2_CRC32_CHKSUM_SIZE; + tmp-h_chksum[0]= cpu_to_be32(crc32_sum); + } } And in doing so, changed the on-disk format of the journal commit blocks. Surely this was worth a mention in the changelog, if not a standalone patch? I don't think this is worth doing, really. Why not just leave the format as it was, take the loop out and run this code once rather than eight times? Well, we aren't using the rest of the commit block in any case. I think the original intention was that we'd get 8 copies of the commit block so we would be sure to get a good one. I don't know whether we'd rather have 8 copies of the commit block, or more potential to expand the commit block? I don't personally have any preference, since the checksum should be a more robust way of checking validity than having multiple copies, so we may as well remove the loop and stick with a single copy for now. We've never altered any commit block sectors apart from the zeroeth one (eight times) due to the above bug. So I'd suggest that we should formalise the old bug and leave the format as-is. That'll leave lots of space spare in the commit block. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block + * with this same handle */ + if ((jbd2_journal_extend(handle, + EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; + if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete + some EAs or run e2fsck., + inode-i_ino); + expand_message = 1; + } + } + } + } Maybe that message should come out once per mount rather than once per boot (or once per modprobe)? Probably true. What are the consequences of a jbd2_journal_extend() failure in here? Not fatal, just like every user of i_extra_isize. If the inode isn't a large inode, or it can't be expanded then there will be a minor loss of functionality on that inode. If the i_extra_isize is critical, then the sysadmin will have run e2fsck to force s_min_extra_isize large enough. Note that this is only applicable for filesystems which are upgraded. For new inodes (i.e. all inodes that exist in the filesystem if it was always run on a kernel with the currently understood extra fields) then this will never be invoked (until such a time new extra fields are added). I'd suggest that we get a comment in the code explaining this: this unchecked error does rather stand out. + if (EXT4_I(inode)-i_file_acl) { + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl); + error = -EIO; + if (!bh) + goto cleanup; + if (ext4_xattr_check_block(bh)) { + ext4_error(inode-i_sb, __FUNCTION__, + inode %lu: bad block %llu, inode-i_ino, + EXT4_I(inode)-i_file_acl); + error = -EIO; + goto cleanup; + } + base = BHDR(bh); + first = BFIRST(bh); + end = bh-b_data + bh-b_size; + min_offs = end - base; + free = ext4_xattr_free_space(first, min_offs, base, + total_blk); + if (free new_extra_isize) { + if (!tried_min_extra_isize s_min_extra_isize) { + tried_min_extra_isize++; + new_extra_isize = s_min_extra_isize; Aren't we missing a brelse(bh) here? Seems likely, yes. OK - could we get a positive ack from someone indicating that this will get looked at? Because I am about to forget about it. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for 16TB ext4 fs
Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more than 32bit block sizes during mount time. This ensure proper record lenth when writing to the journal. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Signed-off-by: Laurent Vivier [EMAIL PROTECTED] --- fs/ext4/super.c |7 7 + 0 - 0 ! 1 file changed, 7 insertions(+) Index: linux-2.6/fs/ext4/super.c === --- linux-2.6.orig/fs/ext4/super.c 2007-07-11 09:36:00.0 -0500 +++ linux-2.6/fs/ext4/super.c 2007-07-11 09:36:20.0 -0500 @@ -1804,6 +1804,13 @@ static int ext4_fill_super (struct super goto failed_mount3; } + if (ext4_blocks_count(es) 0xULL + !jbd2_journal_set_features(EXT4_SB(sb)-s_journal, 0, 0, + JBD2_FEATURE_INCOMPAT_64BIT)) { + printk(KERN_ERR ext4: Failed to set 64-bit journal feature\n); + goto failed_mount4; + } + /* We have now updated the journal if required, so we can * validate the data journaling mode. */ switch (test_opt(sb, DATA_FLAGS)) { - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs
The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but create_proc_entry() does not do lookups on file names that are more that one directory deep. This causes the entry creation to fail and hence, no proc file is created. Instead of fixing this on procfs might as well move the jbd2-debug file to debugfs which would be the preferred location for this kind of tunable. The new location is now /sys/kernel/debug/jbd2/jbd2-debug. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] --- fs/Kconfig | 105 + 5 - 0 ! fs/jbd2/journal.c| 6727 +40 -0 ! include/linux/jbd2.h |21 + 1 - 0 ! 3 files changed, 33 insertions(+), 46 deletions(-) Index: linux-2.6/fs/jbd2/journal.c === --- linux-2.6.orig/fs/jbd2/journal.c2007-07-11 09:46:25.0 -0500 +++ linux-2.6/fs/jbd2/journal.c 2007-07-11 11:31:30.0 -0500 @@ -35,6 +35,7 @@ #include linux/kthread.h #include linux/poison.h #include linux/proc_fs.h +#include linux/debugfs.h #include asm/uaccess.h #include asm/page.h @@ -1951,64 +1952,50 @@ void jbd2_journal_put_journal_head(struc } /* - * /proc tunables + * debugfs tunables */ #if defined(CONFIG_JBD2_DEBUG) -int jbd2_journal_enable_debug; +u8 jbd2_journal_enable_debug; EXPORT_SYMBOL(jbd2_journal_enable_debug); #endif -#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_DEBUG_FS) -static struct proc_dir_entry *proc_jbd_debug; +#define JBD2_DEBUG_NAME jbd2-debug -static int read_jbd_debug(char *page, char **start, off_t off, - int count, int *eof, void *data) -{ - int ret; +struct dentry *jbd2_debugfs_dir, *jbd2_debug; - ret = sprintf(page + off, %d\n, jbd2_journal_enable_debug); - *eof = 1; - return ret; +static void __init jbd2_create_debugfs_entry(void) +{ + jbd2_debugfs_dir = debugfs_create_dir(jbd2, NULL); + if (jbd2_debugfs_dir) + jbd2_debug = debugfs_create_u8(JBD2_DEBUG_NAME, S_IRUGO, + jbd2_debugfs_dir, + jbd2_journal_enable_debug); } -static int write_jbd_debug(struct file *file, const char __user *buffer, - unsigned long count, void *data) +static void __exit jbd2_remove_debugfs_entry(void) { - char buf[32]; - - if (count ARRAY_SIZE(buf) - 1) - count = ARRAY_SIZE(buf) - 1; - if (copy_from_user(buf, buffer, count)) - return -EFAULT; - buf[ARRAY_SIZE(buf) - 1] = '\0'; - jbd2_journal_enable_debug = simple_strtoul(buf, NULL, 10); - return count; + if (jbd2_debug) + debugfs_remove(jbd2_debug); + if (jbd2_debugfs_dir) + debugfs_remove(jbd2_debugfs_dir); } -#define JBD_PROC_NAME sys/fs/jbd2-debug +#else -static void __init create_jbd_proc_entry(void) +static void __init jbd2_create_debugfs_entry(void) { - proc_jbd_debug = create_proc_entry(JBD_PROC_NAME, 0644, NULL); - if (proc_jbd_debug) { - /* Why is this so hard? */ - proc_jbd_debug-read_proc = read_jbd_debug; - proc_jbd_debug-write_proc = write_jbd_debug; - } + do { + } while (0); } -static void __exit jbd2_remove_jbd_proc_entry(void) +static void __exit jbd2_remove_debugfs_entry(void) { - if (proc_jbd_debug) - remove_proc_entry(JBD_PROC_NAME, NULL); + do { + } while (0); } -#else - -#define create_jbd_proc_entry() do {} while (0) -#define jbd2_remove_jbd_proc_entry() do {} while (0) - #endif struct kmem_cache *jbd2_handle_cache; @@ -2067,7 +2054,7 @@ static int __init journal_init(void) ret = journal_init_caches(); if (ret != 0) jbd2_journal_destroy_caches(); - create_jbd_proc_entry(); + jbd2_create_debugfs_entry(); return ret; } @@ -2078,7 +2065,7 @@ static void __exit journal_exit(void) if (n) printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n); #endif - jbd2_remove_jbd_proc_entry(); + jbd2_remove_debugfs_entry(); jbd2_journal_destroy_caches(); } Index: linux-2.6/include/linux/jbd2.h === --- linux-2.6.orig/include/linux/jbd2.h 2007-07-11 09:46:25.0 -0500 +++ linux-2.6/include/linux/jbd2.h 2007-07-11 10:37:06.0 -0500 @@ -57,7 +57,7 @@ * CONFIG_JBD2_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING -extern int jbd2_journal_enable_debug; +extern u8 jbd2_journal_enable_debug; #define jbd_debug(n, f, a...) \ do {\ Index: linux-2.6/fs/Kconfig === ---
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Wed, 2007-07-11 at 10:34 -0700, Andrew Morton wrote: On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block +* with this same handle */ + if ((jbd2_journal_extend(handle, +EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; + if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete +some EAs or run e2fsck., + inode-i_ino); + expand_message = 1; + } + } + } + } Maybe that message should come out once per mount rather than once per boot (or once per modprobe)? Probably true. What are the consequences of a jbd2_journal_extend() failure in here? Not fatal, just like every user of i_extra_isize. If the inode isn't a large inode, or it can't be expanded then there will be a minor loss of functionality on that inode. If the i_extra_isize is critical, then the sysadmin will have run e2fsck to force s_min_extra_isize large enough. Note that this is only applicable for filesystems which are upgraded. For new inodes (i.e. all inodes that exist in the filesystem if it was always run on a kernel with the currently understood extra fields) then this will never be invoked (until such a time new extra fields are added). I'd suggest that we get a comment in the code explaining this: this unchecked error does rather stand out. + if (EXT4_I(inode)-i_file_acl) { + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl); + error = -EIO; + if (!bh) + goto cleanup; + if (ext4_xattr_check_block(bh)) { + ext4_error(inode-i_sb, __FUNCTION__, + inode %lu: bad block %llu, inode-i_ino, + EXT4_I(inode)-i_file_acl); + error = -EIO; + goto cleanup; + } + base = BHDR(bh); + first = BFIRST(bh); + end = bh-b_data + bh-b_size; + min_offs = end - base; + free = ext4_xattr_free_space(first, min_offs, base, +total_blk); + if (free new_extra_isize) { + if (!tried_min_extra_isize s_min_extra_isize) { + tried_min_extra_isize++; + new_extra_isize = s_min_extra_isize; Aren't we missing a brelse(bh) here? Seems likely, yes. OK - could we get a positive ack from someone indicating that this will get looked at? Because I am about to forget about it. I will send a patch to add the comments and make the suggested corrections. Thanks, Kalpak. To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, Jul 11, 2007 at 09:28:06AM -0500, Dave Kleikamp wrote: On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote: It just occurred to me: If i_version is 64bit, then knfsd would need to be careful when reading it on a 32bit host. What are the locking rules? How does knfsd use i_version? I would think that if all it was doing was to compare (i_version == previous_version) That's correct. (Though it's the client that's doing the comparison, actually--the server is just reporting the value.) then locking wouldn't really matter. Well, theoretically, previous_version could be 0x1, and i_version could be 0x1, knfsd checks the high word, then ext4 updates i_version to 0x2, then knfsd checks the low word, detecting no change. How likely is this? The choice of upper word in your example is arbitrary, but other than that I believe your example is essentially the only one. So this would only happen when *both* - the read of the new value of the low word happens precisely 2^32 i_version updates after the word was read on the client's previous cache revalidation, and - the value of i_version itself is close enough to a 32-bit boundary that wraparound can happen between the reads of the high and low words. (I don't understand why i_version even needs to be 64 bits in the first place.) A 32-bit i_version could in theory wrap pretty quickly, couldn't it? That's not a problem in itself--the problem would only arise if two subsequent client queries of the change attribute happened a multiple of 2^32 i_version increments apart. This is more likely than the previous scenario, but still very unlikely. I would have guessed that even in situations with a very high rate of updates and a low rate of client revalidations, the chance of two revalidations happening exactly 2^32 updates apart would still be no more than 1 in 2^32. (Could odd characteristics of the workloads (like updates that tend to happen in power-of-2 groups?) make it any more likely?) I'd be happier if ext4 at least allowed the possibility of 64 bits in the future. And there's always the chance someone would find a use for an i_version that was nondecreasing, even if nfs didn't care. Presumably it is only updated under i_mutex protection, but having to get i_mutex to read it would seem a little heavy handed. How does knfsd protect itself from the inode changing after i_version is checked? Is any locking being done otherwise? If the client always requests the change attribute before reading, and the i_version is always updated after data is modified, I think we're OK. Admittedly this is a little subtle. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[EMAIL PROTECTED] wrote: On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) OK, I'll bite - how does the kernel know whether the other end of that fiberchannel cable is attached to a DMX-3 or to some no-name product that may not have the same assurances? Is there a I'm a high-end array bit in the sense data that I'm unaware of? There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI has similar queries) to see what kind of device you are talking to. I am not sure it is worth the trouble to do any automatic detection/handling of this. In this specific case, it is more a case of when you attach a high end (or mid-tier) device to a server, you should configure it without barriers for its exported LUNs. ric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] nfs: disable leases over NFS
On Wed, Jul 11, 2007 at 11:20:18AM +0100, Christoph Hellwig wrote: On Thu, Jul 05, 2007 at 11:41:00AM -0400, J. Bruce Fields wrote: OK, after looking at this a little more, I'm less happy about the idea of erroring out by default: - There are a ton of filesystems that probably should allow leases, and only a few (network filesystems) that shouldn't, so leaving leases on by default seems simpler. But it gets you possible wrong behaviour by default. I'm not a big fan of non-trivial default methods as you see :) Yeah, I do understand the attraction of doing it your way. With some quick grepping, what I found was: - about 28 on-disk filesystems, all of which I assume should support leases. - about 12 filesystems that either are network filesystems (9p, afs, cifs, coda, ncpfs, nfs, nfs4, ocfs2, smbfs) or that don't have control over all opens/data modifications for whatever reason (ecryptfs, fuse, hostfs), so shouldn't be giving out leases by default. - A bunch of synthetic filesystems (proc, sysfs...). The latter category being the strongest argument for your approach, since it's sort of ludicrous to allow leases on those filesystems, but currently we do just out of laziness. (Or, in any case, I don't see any reason the current code wouldn't allow them; I haven't actually tested). - We already fall back on the local method by default in the case of locks, and I don't see a reason to treat leases differently. - The patch to add .setlease = setlease, to all the file_operations is going to be a big patch that changes behavior in a way that might be easy to miss (because it changes behavior exactly on those filesystems it *doesn't* touch.) I think it'll be easier to get better review on the patch that adds a method just to those filesystems that we're disabling leases for. Anyway, feel free to go ahead with the simpler version for now, I'll do the switchover for locks and leases when I get some time. That would be great. For now I think I'll also add another simple EINVAL-returning setlease() at least to cifs (partly just as an attempt to goad Steve French into following up on a promise at OLS to look into proper lease support for cifs) But I've appended my first attempt at your suggestion for leases, in case it's of use. (Untested.) --b. From: J. Bruce Fields [EMAIL PROTECTED] Subject: [PATCH] Turn off support for fcntl leases by default A lease enforces mutual exclusion with conflicting opens. On filesystems such as network filesystems where not all opens happen under the control of this kernel, the default setlease() operation, which can only check for local conflicts, is incorrect. So in most cases the correct thing is probably to disable leases for those filesystems until they can implement something more sophisticated. The only users of leases that I'm aware of (samba and nfsd) are actually using them primarly to get synchronous notification of changes to files so that they can update their caches. So, more generally, we should disable leases for any filesystem which might allow file contents to change without a local open for write occuring. That includes most of the synthetic filesystems (like proc), which the file servers probably don't want to export anyway. Previously we explicitly disabled leases for some network filesystems, but with this patch we disable by default and add an explicit setlease method for those filesystems on which leases will be allowed. Signed-off-by: J. Bruce Fields [EMAIL PROTECTED] --- fs/adfs/file.c |1 + fs/affs/file.c |1 + fs/bfs/file.c |1 + fs/ext2/file.c |2 ++ fs/ext3/file.c |1 + fs/ext4/file.c |1 + fs/fat/file.c |1 + fs/hfs/inode.c |1 + fs/hfsplus/inode.c |1 + fs/hppfs/hppfs_kern.c |1 + fs/jffs2/file.c |1 + fs/jfs/file.c |1 + fs/locks.c |8 fs/minix/file.c |1 + fs/nfs/file.c | 12 fs/ntfs/file.c |1 + fs/qnx4/file.c |1 + fs/ramfs/file-mmu.c |1 + fs/ramfs/file-nommu.c |1 + fs/read_write.c |1 + fs/reiserfs/file.c |1 + fs/sysv/file.c |1 + fs/udf/file.c |1 + fs/ufs/file.c |1 + fs/xfs/linux-2.6/xfs_file.c |2 ++ 25 files changed, 29 insertions(+), 16 deletions(-) diff --git a/fs/adfs/file.c b/fs/adfs/file.c index f544a28..4e99861 100644 --- a/fs/adfs/file.c +++ b/fs/adfs/file.c @@ -34,6 +34,7 @@ const struct file_operations adfs_file_operations = { .write = do_sync_write, .aio_write =
[PATCH 00/23] Mount writer count API (read-only bind mounts prep)
The most contentious part of the r/o bind mount patches is actually implementing the count tracking. It has NUMA and SMP implications, and is going to need to have a whole discussion on that one patch. These patches, on the other hand, simply introduce a new API: mnt_want_write() and mnt_drop_write(). They do not functionally change the kernel, just alter the way in which we check filesystems for our ability to write to them. These functions should be used in place of IS_RDONLY(inode) as they explicitly spell out when a mount is expected to _stay_ r/w instead of simply checking at a single point in time. It should take a very small number (like 3) of small patches to actually implement read-only bind mounts on top of this new API. These apply to current -git (as of July 11th, 2007). --- Why do we need r/o bind mounts? This feature allows a read-only view into a read-write filesystem. In the process of doing that, it also provides infrastructure for keeping track of the number of writers to any given mount. This has a number of uses. It allows chroots to have parts of filesystems writable. It will be useful for containers in the future because users may have root inside a container, but should not be allowed to write to somefilesystems. This also replaces patches that vserver has had out of the tree for several years. It allows security enhancement by making sure that parts of your filesystem read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated. I've been using the following script to test that the feature is working as desired. It takes a directory and makes a regular bind and a r/o bind mount of it. It then performs some normal filesystem operations on the three directories, including ones that are expected to fail, like creating a file on the r/o mount. Signed-off-by: Dave Hansen [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/23] rearrange may_open() to be r/o friendly
may_open() calls vfs_permission() before it does checks for IS_RDONLY(inode). It checks _again_ inside of vfs_permission(). The check inside of vfs_permission() is going away eventually. With the mnt_want/drop_write() functions, all of the r/o checks (except for this one) are consistently done before calling permission(). Because of this, I'd like to use permission() to hold a debugging check to make sure that the mnt_want/drop_write() calls are actually being made. So, to do this: 1. remove the IS_RDONLY() check from permission() 2. enforce that you must mnt_want_write() before even calling permission() 3. enable a debugging in permission() We need to rearrange may_open(). Here's the patch. --- lxc-dave/fs/namei.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff -puN fs/namei.c~rearrange-permission-and-ro-checks-in-may_open fs/namei.c --- lxc/fs/namei.c~rearrange-permission-and-ro-checks-in-may_open 2007-07-10 12:46:01.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:01.0 -0700 @@ -228,6 +228,10 @@ int permission(struct inode *inode, int { umode_t mode = inode-i_mode; int retval, submask; + struct vfsmount *mnt = NULL; + + if (nd) + mnt = nd-mnt; if (mask MAY_WRITE) { @@ -252,7 +256,7 @@ int permission(struct inode *inode, int * the fs is mounted with the noexec flag. */ if ((mask MAY_EXEC) S_ISREG(mode) (!(mode S_IXUGO) || - (nd nd-mnt (nd-mnt-mnt_flags MNT_NOEXEC + (mnt (mnt-mnt_flags MNT_NOEXEC return -EACCES; /* Ordinary permission routines do not understand MAY_APPEND. */ @@ -1546,10 +1550,6 @@ int may_open(struct nameidata *nd, int a if (S_ISDIR(inode-i_mode) (flag FMODE_WRITE)) return -EISDIR; - error = vfs_permission(nd, acc_mode); - if (error) - return error; - /* * FIFO's, sockets and device files are special: they don't * actually live on the filesystem itself, and as such you @@ -1564,6 +1564,10 @@ int may_open(struct nameidata *nd, int a flag = ~O_TRUNC; } else if (IS_RDONLY(inode) (flag FMODE_WRITE)) return -EROFS; + + error = vfs_permission(nd, acc_mode); + if (error) + return error; /* * An append-only file must be opened in append mode for writing. */ _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/23] create cleanup helper svc_msnfs()
I'm going to be modifying nfsd_rename() shortly to support read-only bind mounts. This #ifdef is around the area I'm patching, and it starts to get really ugly if I just try to add my new code by itself. Using this little helper makes things a lot cleaner to use. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/nfsd/vfs.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff -puN fs/nfsd/vfs.c~create-svc_msnfs-helper fs/nfsd/vfs.c --- lxc/fs/nfsd/vfs.c~create-svc_msnfs-helper 2007-07-10 12:46:02.0 -0700 +++ lxc-dave/fs/nfsd/vfs.c 2007-07-10 12:46:02.0 -0700 @@ -837,6 +837,15 @@ nfsd_read_actor(read_descriptor_t *desc, return size; } +static inline int svc_msnfs(struct svc_fh *ffhp) +{ +#ifdef MSNFS + return (ffhp-fh_export-ex_flags NFSEXP_MSNFS); +#else + return 0; +#endif +} + static __be32 nfsd_vfs_read(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, loff_t offset, struct kvec *vec, int vlen, unsigned long *count) @@ -849,11 +858,9 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st err = nfserr_perm; inode = file-f_path.dentry-d_inode; -#ifdef MSNFS - if ((fhp-fh_export-ex_flags NFSEXP_MSNFS) - (!lock_may_read(inode, offset, *count))) + + if (svc_msnfs(fhp) !lock_may_read(inode, offset, *count)) goto out; -#endif /* Get readahead parameters */ ra = nfsd_get_raparms(inode-i_sb-s_dev, inode-i_ino); _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/23] r/o bind mounts: stub functions
This patch adds two function mnt_want_write() and mnt_drop_write(). These are used like a lock pair around and fs operations that might cause a write to the filesystem. Before these can become useful, we must first cover each place in the VFS where writes are performed with a want/drop pair. When that is complete, we can actually introduce code that will safely check the counts before allowing r/w-r/o transitions to occur. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/namespace.c| 46 + lxc-dave/include/linux/mount.h |3 ++ 2 files changed, 49 insertions(+) diff -puN fs/namespace.c~add-vfsmount-writer-count fs/namespace.c --- lxc/fs/namespace.c~add-vfsmount-writer-count2007-07-10 12:46:04.0 -0700 +++ lxc-dave/fs/namespace.c 2007-07-10 12:46:04.0 -0700 @@ -76,6 +76,52 @@ struct vfsmount *alloc_vfsmnt(const char return mnt; } +/* + * Most r/o checks on a fs are for operations that take + * discrete amounts of time, like a write() or unlink(). + * We must keep track of when those operations start + * (for permission checks) and when they end, so that + * we can determine when writes are able to occur to + * a filesystem. + */ +/* + * This tells the low-level filesystem that a write is + * about to be performed to it, and makes sure that + * writes are allowed before returning success. When + * the write operation is finished, mnt_drop_write() + * must be called. This is effectively a refcount. + */ +int mnt_want_write(struct vfsmount *mnt) +{ + if (__mnt_is_readonly(mnt)) + return -EROFS; + return 0; +} +EXPORT_SYMBOL_GPL(mnt_want_write); + +/* + * Tells the low-level filesystem that we are done + * performing a write to it. Must be matched with + * mnt_want_write() call above. + */ +void mnt_drop_write(struct vfsmount *mnt) +{ +} +EXPORT_SYMBOL_GPL(mnt_drop_write); + +/* + * This shouldn't be used directly ouside of the VFS. + * It does not guarantee that the filesystem will stay + * r/w, just that it is right *now*e + * mnt_want/drop_write() will _keep_ the filesystem + * r/w. + */ +int __mnt_is_readonly(struct vfsmount *mnt) +{ + return (mnt-mnt_sb-s_flags MS_RDONLY); +} +EXPORT_SYMBOL_GPL(__mnt_is_readonly); + int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb) { mnt-mnt_sb = sb; diff -puN include/linux/mount.h~add-vfsmount-writer-count include/linux/mount.h --- lxc/include/linux/mount.h~add-vfsmount-writer-count 2007-07-10 12:46:04.0 -0700 +++ lxc-dave/include/linux/mount.h 2007-07-10 12:46:04.0 -0700 @@ -70,9 +70,12 @@ static inline struct vfsmount *mntget(st return mnt; } +extern int mnt_want_write(struct vfsmount *mnt); +extern void mnt_drop_write(struct vfsmount *mnt); extern void mntput_no_expire(struct vfsmount *mnt); extern void mnt_pin(struct vfsmount *mnt); extern void mnt_unpin(struct vfsmount *mnt); +extern int __mnt_is_readonly(struct vfsmount *mnt); static inline void mntput(struct vfsmount *mnt) { _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/23] elevate mount count for extended attributes
This basically audits the callers of xattr_permission(), which calls permission() and can perform writes to the filesystem. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/nfsd/nfs4proc.c |7 ++- lxc-dave/fs/xattr.c | 16 ++-- 2 files changed, 20 insertions(+), 3 deletions(-) diff -puN fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes fs/nfsd/nfs4proc.c --- lxc/fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes 2007-07-10 12:46:10.0 -0700 +++ lxc-dave/fs/nfsd/nfs4proc.c 2007-07-10 12:46:10.0 -0700 @@ -626,14 +626,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st return status; } } + status = mnt_want_write(cstate-current_fh.fh_export-ex_mnt); + if (status) + return status; status = nfs_ok; if (setattr-sa_acl != NULL) status = nfsd4_set_nfs4_acl(rqstp, cstate-current_fh, setattr-sa_acl); if (status) - return status; + goto out; status = nfsd_setattr(rqstp, cstate-current_fh, setattr-sa_iattr, 0, (time_t)0); +out: + mnt_drop_write(cstate-current_fh.fh_export-ex_mnt); return status; } diff -puN fs/xattr.c~elevate-mount-count-for-extended-attributes fs/xattr.c --- lxc/fs/xattr.c~elevate-mount-count-for-extended-attributes 2007-07-10 12:46:10.0 -0700 +++ lxc-dave/fs/xattr.c 2007-07-10 12:46:10.0 -0700 @@ -11,6 +11,7 @@ #include linux/slab.h #include linux/file.h #include linux/xattr.h +#include linux/mount.h #include linux/namei.h #include linux/security.h #include linux/syscalls.h @@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, co * filesystem or on an immutable / append-only inode. */ if (mask MAY_WRITE) { - if (IS_RDONLY(inode)) - return -EROFS; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) return -EPERM; } @@ -236,7 +235,11 @@ sys_setxattr(char __user *path, char __u error = user_path_walk(path, nd); if (error) return error; + error = mnt_want_write(nd.mnt); + if (error) + return error; error = setxattr(nd.dentry, name, value, size, flags); + mnt_drop_write(nd.mnt); path_release(nd); return error; } @@ -251,7 +254,11 @@ sys_lsetxattr(char __user *path, char __ error = user_path_walk_link(path, nd); if (error) return error; + error = mnt_want_write(nd.mnt); + if (error) + return error; error = setxattr(nd.dentry, name, value, size, flags); + mnt_drop_write(nd.mnt); path_release(nd); return error; } @@ -267,9 +274,14 @@ sys_fsetxattr(int fd, char __user *name, f = fget(fd); if (!f) return error; + error = mnt_want_write(f-f_vfsmnt); + if (error) + goto out_fput; dentry = f-f_path.dentry; audit_inode(NULL, dentry-d_inode); error = setxattr(dentry, name, value, size, flags); + mnt_drop_write(f-f_vfsmnt); +out_fput: fput(f); return error; } _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/23] elevate write count for link and symlink calls
Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/namei.c | 10 ++ 1 file changed, 10 insertions(+) diff -puN fs/namei.c~elevate-write-count-for-link-and-symlink-calls fs/namei.c --- lxc/fs/namei.c~elevate-write-count-for-link-and-symlink-calls 2007-07-10 12:46:09.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:09.0 -0700 @@ -2266,7 +2266,12 @@ asmlinkage long sys_symlinkat(const char if (IS_ERR(dentry)) goto out_unlock; + error = mnt_want_write(nd.mnt); + if (error) + goto out_dput; error = vfs_symlink(nd.dentry-d_inode, dentry, from, S_IALLUGO); + mnt_drop_write(nd.mnt); +out_dput: dput(dentry); out_unlock: mutex_unlock(nd.dentry-d_inode-i_mutex); @@ -2361,7 +2366,12 @@ asmlinkage long sys_linkat(int olddfd, c error = PTR_ERR(new_dentry); if (IS_ERR(new_dentry)) goto out_unlock; + error = mnt_want_write(nd.mnt); + if (error) + goto out_dput; error = vfs_link(old_nd.dentry, nd.dentry-d_inode, new_dentry); + mnt_drop_write(nd.mnt); +out_dput: dput(new_dentry); out_unlock: mutex_unlock(nd.dentry-d_inode-i_mutex); _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/23] elevate write count for file_update_time()
Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/inode.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff -puN fs/inode.c~elevate-write-count-for-file_update_time fs/inode.c --- lxc/fs/inode.c~elevate-write-count-for-file_update_time 2007-07-10 12:46:10.0 -0700 +++ lxc-dave/fs/inode.c 2007-07-10 12:46:10.0 -0700 @@ -1221,10 +1221,19 @@ void file_update_time(struct file *file) struct inode *inode = file-f_path.dentry-d_inode; struct timespec now; int sync_it = 0; + int err = 0; if (IS_NOCMTIME(inode)) return; - if (IS_RDONLY(inode)) + /* +* Ideally, we want to guarantee that 'f_vfsmnt' +* is non-NULL here. But, NFS exports need to +* be fixed up before we can do that. So, check +* it for now. - Dave Hansen +*/ + if (file-f_vfsmnt) + err = mnt_want_write(file-f_vfsmnt); + if (err) return; now = current_fs_time(inode-i_sb); @@ -1240,6 +1249,8 @@ void file_update_time(struct file *file) if (sync_it) mark_inode_dirty_sync(inode); + if (file-f_vfsmnt) + mnt_drop_write(file-f_vfsmnt); } EXPORT_SYMBOL(file_update_time); _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/23] mount_is_safe(): add comment
This area of code is currently #ifdef'd out, so add a comment for the time when it is actually used. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/namespace.c |4 1 file changed, 4 insertions(+) diff -puN fs/namespace.c~mount-is-safe-add-comment fs/namespace.c --- lxc/fs/namespace.c~mount-is-safe-add-comment2007-07-10 12:46:11.0 -0700 +++ lxc-dave/fs/namespace.c 2007-07-10 12:46:11.0 -0700 @@ -728,6 +728,10 @@ static int mount_is_safe(struct nameidat if (current-uid != nd-dentry-d_inode-i_uid) return -EPERM; } + /* +* We will eventually check for the mnt-writer_count here, +* but since the code is not used now, skip it - Dave Hansen +*/ if (vfs_permission(nd, MAY_WRITE)) return -EPERM; return 0; _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/23] elevate mnt writers for callers of vfs_mkdir()
Pretty self-explanatory. Fits in with the rest of the series. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/namei.c|5 + lxc-dave/fs/nfsd/nfs4recover.c |4 2 files changed, 9 insertions(+) diff -puN fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir fs/namei.c --- lxc/fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-07-10 12:46:07.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:07.0 -0700 @@ -1993,7 +1993,12 @@ asmlinkage long sys_mkdirat(int dfd, con if (!IS_POSIXACL(nd.dentry-d_inode)) mode = ~current-fs-umask; + error = mnt_want_write(nd.mnt); + if (error) + goto out_dput; error = vfs_mkdir(nd.dentry-d_inode, dentry, mode); + mnt_drop_write(nd.mnt); +out_dput: dput(dentry); out_unlock: mutex_unlock(nd.dentry-d_inode-i_mutex); diff -puN fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir fs/nfsd/nfs4recover.c --- lxc/fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-07-10 12:46:07.0 -0700 +++ lxc-dave/fs/nfsd/nfs4recover.c 2007-07-10 12:46:07.0 -0700 @@ -156,7 +156,11 @@ nfsd4_create_clid_dir(struct nfs4_client dprintk(NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n); goto out_put; } + status = mnt_want_write(rec_dir.mnt); + if (status) + goto out_put; status = vfs_mkdir(rec_dir.dentry-d_inode, dentry, S_IRWXU); + mnt_drop_write(rec_dir.mnt); out_put: dput(dentry); out_unlock: _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/23] unix_find_other() elevate write count for touch_atime()
Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/net/unix/af_unix.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff -puN net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime net/unix/af_unix.c --- lxc/net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime 2007-07-10 12:46:11.0 -0700 +++ lxc-dave/net/unix/af_unix.c 2007-07-10 12:46:11.0 -0700 @@ -702,21 +702,27 @@ static struct sock *unix_find_other(stru err = path_lookup(sunname-sun_path, LOOKUP_FOLLOW, nd); if (err) goto fail; + + err = mnt_want_write(nd.mnt); + if (err) + goto put_path_fail; + err = vfs_permission(nd, MAY_WRITE); if (err) - goto put_fail; + goto mnt_drop_write_fail; err = -ECONNREFUSED; if (!S_ISSOCK(nd.dentry-d_inode-i_mode)) - goto put_fail; + goto mnt_drop_write_fail; u=unix_find_socket_byinode(nd.dentry-d_inode); if (!u) - goto put_fail; + goto mnt_drop_write_fail; if (u-sk_type == type) touch_atime(nd.mnt, nd.dentry); path_release(nd); + mnt_drop_write(nd.mnt); err=-EPROTOTYPE; if (u-sk_type != type) { @@ -736,7 +742,9 @@ static struct sock *unix_find_other(stru } return u; -put_fail: +mnt_drop_write_fail: + mnt_drop_write(nd.mnt); +put_path_fail: path_release(nd); fail: *error=err; _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/23] elevate write count over calls to vfs_rename()
This also uses the little helper in the NFS code to make an if() a little bit less ugly. We introduced the helper at the beginning of the series. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/namei.c|4 lxc-dave/fs/nfsd/vfs.c | 15 +++ 2 files changed, 15 insertions(+), 4 deletions(-) diff -puN fs/namei.c~elevate-write-count-over-calls-to-vfs-rename fs/namei.c --- lxc/fs/namei.c~elevate-write-count-over-calls-to-vfs-rename 2007-07-10 12:46:12.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:12.0 -0700 @@ -2597,8 +2597,12 @@ static int do_rename(int olddfd, const c if (new_dentry == trap) goto exit5; + error = mnt_want_write(oldnd.mnt); + if (error) + goto exit5; error = vfs_rename(old_dir-d_inode, old_dentry, new_dir-d_inode, new_dentry); + mnt_drop_write(oldnd.mnt); exit5: dput(new_dentry); exit4: diff -puN fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename fs/nfsd/vfs.c --- lxc/fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename 2007-07-10 12:46:12.0 -0700 +++ lxc-dave/fs/nfsd/vfs.c 2007-07-10 12:46:12.0 -0700 @@ -1622,13 +1622,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru if (ndentry == trap) goto out_dput_new; -#ifdef MSNFS - if ((ffhp-fh_export-ex_flags NFSEXP_MSNFS) + if (svc_msnfs(ffhp) ((atomic_read(odentry-d_count) 1) || (atomic_read(ndentry-d_count) 1))) { host_err = -EPERM; - } else -#endif + goto out_dput_new; + } + + host_err = -EXDEV; + if (ffhp-fh_export-ex_mnt != tfhp-fh_export-ex_mnt) + goto out_dput_new; + host_err = mnt_want_write(ffhp-fh_export-ex_mnt); + if (host_err) + goto out_dput_new; + host_err = vfs_rename(fdir, odentry, tdir, ndentry); if (!host_err EX_ISSYNC(tfhp-fh_export)) { host_err = nfsd_sync_dir(tdentry); _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/23] elevate writer count for do_sys_truncate()
Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/open.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff -puN fs/open.c~elevate-writer-count-for-do-sys-truncate fs/open.c --- lxc/fs/open.c~elevate-writer-count-for-do-sys-truncate 2007-07-10 12:46:13.0 -0700 +++ lxc-dave/fs/open.c 2007-07-10 12:46:13.0 -0700 @@ -243,28 +243,28 @@ static long do_sys_truncate(const char _ if (!S_ISREG(inode-i_mode)) goto dput_and_out; - error = vfs_permission(nd, MAY_WRITE); + error = mnt_want_write(nd.mnt); if (error) goto dput_and_out; - error = -EROFS; - if (IS_RDONLY(inode)) - goto dput_and_out; + error = vfs_permission(nd, MAY_WRITE); + if (error) + goto mnt_drop_write_and_out; error = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto dput_and_out; + goto mnt_drop_write_and_out; /* * Make sure that there are no leases. */ error = break_lease(inode, FMODE_WRITE); if (error) - goto dput_and_out; + goto mnt_drop_write_and_out; error = get_write_access(inode); if (error) - goto dput_and_out; + goto mnt_drop_write_and_out; error = locks_verify_truncate(inode, NULL, length); if (!error) { @@ -273,6 +273,8 @@ static long do_sys_truncate(const char _ } put_write_access(inode); +mnt_drop_write_and_out: + mnt_drop_write(nd.mnt); dput_and_out: path_release(nd); out: _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 20/23] elevate write count for do_sys_utime() and touch_atime()
Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/inode.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff -puN fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime fs/inode.c --- lxc/fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime 2007-07-10 12:46:15.0 -0700 +++ lxc-dave/fs/inode.c 2007-07-10 12:46:15.0 -0700 @@ -1165,22 +1165,23 @@ void touch_atime(struct vfsmount *mnt, s struct inode *inode = dentry-d_inode; struct timespec now; - if (inode-i_flags S_NOATIME) + if (mnt mnt_want_write(mnt)) return; + if (inode-i_flags S_NOATIME) + goto out; if (IS_NOATIME(inode)) - return; + goto out; if ((inode-i_sb-s_flags MS_NODIRATIME) S_ISDIR(inode-i_mode)) - return; + goto out; /* * We may have a NULL vfsmount when coming from NFSD */ if (mnt) { if (mnt-mnt_flags MNT_NOATIME) - return; + goto out; if ((mnt-mnt_flags MNT_NODIRATIME) S_ISDIR(inode-i_mode)) - return; - + goto out; if (mnt-mnt_flags MNT_RELATIME) { /* * With relative atime, only update atime if the @@ -1191,16 +1192,19 @@ void touch_atime(struct vfsmount *mnt, s inode-i_atime) 0 timespec_compare(inode-i_ctime, inode-i_atime) 0) - return; + goto out; } } now = current_fs_time(inode-i_sb); if (timespec_equal(inode-i_atime, now)) - return; + goto out; inode-i_atime = now; mark_inode_dirty_sync(inode); +out: + if (mnt) + mnt_drop_write(mnt); } EXPORT_SYMBOL(touch_atime); _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 21/23] sys_mknodat(): elevate write count for vfs_mknod/create()
This takes care of all of the direct callers of vfs_mknod(). Since a few of these cases also handle normal file creation as well, this also covers some calls to vfs_create(). So that we don't have to make three mnt_want/drop_write() calls inside of the switch statement, we move some of its logic outside of the switch. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/namei.c | 32 +--- lxc-dave/fs/nfsd/vfs.c |4 lxc-dave/net/unix/af_unix.c |4 3 files changed, 29 insertions(+), 11 deletions(-) diff -puN fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/namei.c --- lxc/fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-07-10 12:46:15.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-11 10:10:55.0 -0700 @@ -1912,12 +1912,25 @@ asmlinkage long sys_mknodat(int dfd, con if (error) goto out; dentry = lookup_create(nd, 0); - error = PTR_ERR(dentry); - + if (IS_ERR(dentry)) { + error = PTR_ERR(dentry); + goto out_unlock; + } if (!IS_POSIXACL(nd.dentry-d_inode)) mode = ~current-fs-umask; - if (!IS_ERR(dentry)) { - switch (mode S_IFMT) { + if (S_ISDIR(mode)) { + error = -EPERM; + goto out_dput; + } + if (!S_ISREG(mode) !S_ISCHR(mode) !S_ISBLK(mode) + !S_ISFIFO(mode) !S_ISSOCK(mode) mode != 0) { + error = -EINVAL; + goto out_dput; + } + error = mnt_want_write(nd.mnt); + if (error) + goto out_dput; + switch (mode S_IFMT) { case 0: case S_IFREG: error = vfs_create(nd.dentry-d_inode,dentry,mode,nd); break; @@ -1928,14 +1941,11 @@ asmlinkage long sys_mknodat(int dfd, con case S_IFIFO: case S_IFSOCK: error = vfs_mknod(nd.dentry-d_inode,dentry,mode,0); break; - case S_IFDIR: - error = -EPERM; - break; - default: - error = -EINVAL; - } - dput(dentry); } + mnt_drop_write(nd.mnt); +out_dput: + dput(dentry); +out_unlock: mutex_unlock(nd.dentry-d_inode-i_mutex); path_release(nd); out: diff -puN fs/nfsd/vfs.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/nfsd/vfs.c --- lxc/fs/nfsd/vfs.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-07-10 12:46:15.0 -0700 +++ lxc-dave/fs/nfsd/vfs.c 2007-07-10 12:46:15.0 -0700 @@ -1199,7 +1199,11 @@ nfsd_create(struct svc_rqst *rqstp, stru case S_IFBLK: case S_IFIFO: case S_IFSOCK: + host_err = mnt_want_write(fhp-fh_export-ex_mnt); + if (host_err) + break; host_err = vfs_mknod(dirp, dchild, iap-ia_mode, rdev); + mnt_drop_write(fhp-fh_export-ex_mnt); break; default: printk(nfsd: bad file type %o in nfsd_create\n, type); diff -puN net/unix/af_unix.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create net/unix/af_unix.c --- lxc/net/unix/af_unix.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-07-10 12:46:15.0 -0700 +++ lxc-dave/net/unix/af_unix.c 2007-07-10 12:46:15.0 -0700 @@ -815,7 +815,11 @@ static int unix_bind(struct socket *sock */ mode = S_IFSOCK | (SOCK_INODE(sock)-i_mode ~current-fs-umask); + err = mnt_want_write(nd.mnt); + if (err) + goto out_mknod_dput; err = vfs_mknod(nd.dentry-d_inode, dentry, mode, 0); + mnt_drop_write(nd.mnt); if (err) goto out_mknod_dput; mutex_unlock(nd.dentry-d_inode-i_mutex); _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 23/23] do_rmdir(): elevate write count
Elevate the write count during the vfs_rmdir() call. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/namei.c |5 + 1 file changed, 5 insertions(+) diff -puN fs/namei.c~do-rmdir-elevate-write-count fs/namei.c --- lxc/fs/namei.c~do-rmdir-elevate-write-count 2007-07-10 12:46:16.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:16.0 -0700 @@ -2115,7 +2115,12 @@ static long do_rmdir(int dfd, const char error = PTR_ERR(dentry); if (IS_ERR(dentry)) goto exit2; + error = mnt_want_write(nd.mnt); + if (error) + goto exit3; error = vfs_rmdir(nd.dentry-d_inode, dentry); + mnt_drop_write(nd.mnt); +exit3: dput(dentry); exit2: mutex_unlock(nd.dentry-d_inode-i_mutex); _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 22/23] elevate mnt writers for vfs_unlink() callers
Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/namei.c |4 lxc-dave/ipc/mqueue.c |5 - 2 files changed, 8 insertions(+), 1 deletion(-) diff -puN fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers fs/namei.c --- lxc/fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers 2007-07-10 12:46:16.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:16.0 -0700 @@ -2195,7 +2195,11 @@ static long do_unlinkat(int dfd, const c inode = dentry-d_inode; if (inode) atomic_inc(inode-i_count); + error = mnt_want_write(nd.mnt); + if (error) + goto exit2; error = vfs_unlink(nd.dentry-d_inode, dentry); + mnt_drop_write(nd.mnt); exit2: dput(dentry); } diff -puN ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers ipc/mqueue.c --- lxc/ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers 2007-07-10 12:46:16.0 -0700 +++ lxc-dave/ipc/mqueue.c 2007-07-10 12:46:16.0 -0700 @@ -750,8 +750,11 @@ asmlinkage long sys_mq_unlink(const char inode = dentry-d_inode; if (inode) atomic_inc(inode-i_count); - + err = mnt_want_write(mqueue_mnt); + if (err) + goto out_err; err = vfs_unlink(dentry-d_parent-d_inode, dentry); + mnt_drop_write(mqueue_mnt); out_err: dput(dentry); _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 19/23] elevate write count for do_utimes()
Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/utimes.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff -puN fs/utimes.c~elevate-write-count-for-do-utimes fs/utimes.c --- lxc/fs/utimes.c~elevate-write-count-for-do-utimes 2007-07-10 12:46:14.0 -0700 +++ lxc-dave/fs/utimes.c2007-07-10 12:46:14.0 -0700 @@ -2,6 +2,7 @@ #include linux/file.h #include linux/fs.h #include linux/linkage.h +#include linux/mount.h #include linux/namei.h #include linux/sched.h #include linux/stat.h @@ -75,8 +76,8 @@ long do_utimes(int dfd, char __user *fil inode = dentry-d_inode; - error = -EROFS; - if (IS_RDONLY(inode)) + error = mnt_want_write(nd.mnt); + if (error) goto dput_and_out; /* Don't worry, the checks are done in inode_change_ok() */ @@ -84,7 +85,7 @@ long do_utimes(int dfd, char __user *fil if (times) { error = -EPERM; if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) -goto dput_and_out; + goto mnt_drop_write_and_out; if (times[0].tv_nsec == UTIME_OMIT) newattrs.ia_valid = ~ATTR_ATIME; @@ -104,22 +105,24 @@ long do_utimes(int dfd, char __user *fil } else { error = -EACCES; if (IS_IMMUTABLE(inode)) -goto dput_and_out; + goto mnt_drop_write_and_out; if (current-fsuid != inode-i_uid) { if (f) { if (!(f-f_mode FMODE_WRITE)) - goto dput_and_out; + goto mnt_drop_write_and_out; } else { error = vfs_permission(nd, MAY_WRITE); if (error) - goto dput_and_out; + goto mnt_drop_write_and_out; } } } mutex_lock(inode-i_mutex); error = notify_change(dentry, newattrs); mutex_unlock(inode-i_mutex); +mnt_drop_write_and_out: + mnt_drop_write(nd.mnt); dput_and_out: if (f) fput(f); _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/23] elevate write count during entire ncp_ioctl()
Some ioctls need write access, but others don't. Make a helper function to decide when write access is needed, and take it. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/ncpfs/ioctl.c | 55 +- 1 file changed, 54 insertions(+), 1 deletion(-) diff -puN fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl fs/ncpfs/ioctl.c --- lxc/fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl 2007-07-10 12:46:08.0 -0700 +++ lxc-dave/fs/ncpfs/ioctl.c 2007-07-10 12:46:08.0 -0700 @@ -14,6 +14,7 @@ #include linux/ioctl.h #include linux/time.h #include linux/mm.h +#include linux/mount.h #include linux/highuid.h #include linux/smp_lock.h #include linux/vmalloc.h @@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* serv } #endif /* CONFIG_NCPFS_NLS */ -int ncp_ioctl(struct inode *inode, struct file *filp, +static int __ncp_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg) { struct ncp_server *server = NCP_SERVER(inode); @@ -822,6 +823,58 @@ outrel: return -EINVAL; } +static int ncp_ioctl_need_write(unsigned int cmd) +{ + switch (cmd) { + case NCP_IOC_GET_FS_INFO: + case NCP_IOC_GET_FS_INFO_V2: + case NCP_IOC_NCPREQUEST: + case NCP_IOC_SETDENTRYTTL: + case NCP_IOC_SIGN_INIT: + case NCP_IOC_LOCKUNLOCK: + case NCP_IOC_SET_SIGN_WANTED: + return 1; + case NCP_IOC_GETOBJECTNAME: + case NCP_IOC_SETOBJECTNAME: + case NCP_IOC_GETPRIVATEDATA: + case NCP_IOC_SETPRIVATEDATA: + case NCP_IOC_SETCHARSETS: + case NCP_IOC_GETCHARSETS: + case NCP_IOC_CONN_LOGGED_IN: + case NCP_IOC_GETDENTRYTTL: + case NCP_IOC_GETMOUNTUID2: + case NCP_IOC_SIGN_WANTED: + case NCP_IOC_GETROOT: + case NCP_IOC_SETROOT: + return 0; + default: + /* unkown IOCTL command, assume write */ + WARN_ON(1); + } + return 1; +} + +int ncp_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + int ret; + + if (ncp_ioctl_need_write(cmd)) { + /* +* inside the ioctl(), any failures which +* are because of file_permission() are +* -EACCESS, so it seems consistent to keep +* that here. +*/ + if (mnt_want_write(filp-f_vfsmnt)) + return -EACCES; + } + ret = __ncp_ioctl(inode, filp, cmd, arg); + if (ncp_ioctl_need_write(cmd)) + mnt_drop_write(filp-f_vfsmnt); + return ret; +} + #ifdef CONFIG_COMPAT long ncp_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/23] make access() use mnt check
It is OK to let access() go without using a mnt_want/drop_write() pair because it doesn't actually do writes to the filesystem, and it is inherently racy anyway. This is a rare case when it is OK to use __mnt_is_readonly() directly. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/open.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff -puN fs/open.c~make-access-use-helper fs/open.c --- lxc/fs/open.c~make-access-use-helper2007-07-10 12:46:07.0 -0700 +++ lxc-dave/fs/open.c 2007-07-10 12:46:07.0 -0700 @@ -396,8 +396,17 @@ asmlinkage long sys_faccessat(int dfd, c if(res || !(mode S_IWOTH) || special_file(nd.dentry-d_inode-i_mode)) goto out_path_release; - - if(IS_RDONLY(nd.dentry-d_inode)) + /* +* This is a rare case where using __mnt_is_readonly() +* is OK without a mnt_want/drop_write() pair. Since +* no actual write to the fs is performed here, we do +* not need to telegraph to that to anyone. +* +* By doing this, we accept that this access is +* inherently racy and know that the fs may change +* state before we even see this result. +*/ + if (__mnt_is_readonly(nd.mnt)) res = -EROFS; out_path_release: _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/23] r/o bind mounts: elevate write count for some ioctls
Some ioctl()s can cause writes to the filesystem. Take these, and make them use mnt_want/drop_write() instead. We need to pass the filp one layer deeper in XFS, but somebody _just_ pulled it out in February because nobody was using it, so I don't feel guilty for adding it back. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/ext2/ioctl.c | 46 +- lxc-dave/fs/ext3/ioctl.c | 100 +--- lxc-dave/fs/ext4/ioctl.c | 105 +- lxc-dave/fs/fat/file.c| 10 +-- lxc-dave/fs/hfsplus/ioctl.c | 39 +++- lxc-dave/fs/jfs/ioctl.c | 33 ++ lxc-dave/fs/ocfs2/ioctl.c | 11 +-- lxc-dave/fs/reiserfs/ioctl.c | 53 +++-- lxc-dave/fs/xfs/linux-2.6/xfs_ioctl.c | 15 +++- lxc-dave/fs/xfs/linux-2.6/xfs_iops.c |7 -- lxc-dave/fs/xfs/linux-2.6/xfs_lrw.c |9 ++ 11 files changed, 272 insertions(+), 156 deletions(-) diff -puN fs/ext2/ioctl.c~ioctl-mnt-takers fs/ext2/ioctl.c --- lxc/fs/ext2/ioctl.c~ioctl-mnt-takers2007-07-10 12:46:05.0 -0700 +++ lxc-dave/fs/ext2/ioctl.c2007-07-10 12:46:05.0 -0700 @@ -12,6 +12,7 @@ #include linux/time.h #include linux/sched.h #include linux/compat.h +#include linux/mount.h #include linux/smp_lock.h #include asm/current.h #include asm/uaccess.h @@ -22,6 +23,7 @@ int ext2_ioctl (struct inode * inode, st { struct ext2_inode_info *ei = EXT2_I(inode); unsigned int flags; + int ret; ext2_debug (cmd = %u, arg = %lu\n, cmd, arg); @@ -33,14 +35,19 @@ int ext2_ioctl (struct inode * inode, st case EXT2_IOC_SETFLAGS: { unsigned int oldflags; - if (IS_RDONLY(inode)) - return -EROFS; - - if ((current-fsuid != inode-i_uid) !capable(CAP_FOWNER)) - return -EACCES; + ret = mnt_want_write(filp-f_vfsmnt); + if (ret) + return ret; + + if ((current-fsuid != inode-i_uid) !capable(CAP_FOWNER)) { + ret = -EACCES; + goto setflags_out; + } - if (get_user(flags, (int __user *) arg)) - return -EFAULT; + if (get_user(flags, (int __user *) arg)) { + ret = -EFAULT; + goto setflags_out; + } if (!S_ISDIR(inode-i_mode)) flags = ~EXT2_DIRSYNC_FL; @@ -57,7 +64,8 @@ int ext2_ioctl (struct inode * inode, st if ((flags ^ oldflags) (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) { if (!capable(CAP_LINUX_IMMUTABLE)) { mutex_unlock(inode-i_mutex); - return -EPERM; + ret = -EPERM; + goto setflags_out; } } @@ -69,20 +77,26 @@ int ext2_ioctl (struct inode * inode, st ext2_set_inode_flags(inode); inode-i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(inode); - return 0; +setflags_out: + mnt_drop_write(filp-f_vfsmnt); + return ret; } case EXT2_IOC_GETVERSION: return put_user(inode-i_generation, (int __user *) arg); case EXT2_IOC_SETVERSION: if ((current-fsuid != inode-i_uid) !capable(CAP_FOWNER)) return -EPERM; - if (IS_RDONLY(inode)) - return -EROFS; - if (get_user(inode-i_generation, (int __user *) arg)) - return -EFAULT; - inode-i_ctime = CURRENT_TIME_SEC; - mark_inode_dirty(inode); - return 0; + ret = mnt_want_write(filp-f_vfsmnt); + if (ret) + return ret; + if (get_user(inode-i_generation, (int __user *) arg)) { + ret = -EFAULT; + } else { + inode-i_ctime = CURRENT_TIME_SEC; + mark_inode_dirty(inode); + } + mnt_drop_write(filp-f_vfsmnt); + return ret; default: return -ENOTTY; } diff -puN fs/ext3/ioctl.c~ioctl-mnt-takers fs/ext3/ioctl.c --- lxc/fs/ext3/ioctl.c~ioctl-mnt-takers2007-07-10 12:46:05.0 -0700 +++ lxc-dave/fs/ext3/ioctl.c2007-07-10 12:46:05.0 -0700 @@ -12,6 +12,7 @@ #include linux/capability.h #include linux/ext3_fs.h #include linux/ext3_jbd.h +#include linux/mount.h #include linux/time.h #include linux/compat.h #include linux/smp_lock.h @@ -38,14 +39,19 @@ int ext3_ioctl (struct inode * inode, st unsigned int oldflags;
[PATCH 03/23] filesystem helpers for custom 'struct file's
Christoph H. says this stands on its own and can go in before the rest of the r/o bind mount set. --- Some filesystems forego the vfs and may_open() and create their own 'struct file's. This patch creates a couple of helper functions which can be used by these filesystems, and will provide a unified place which the r/o bind mount code may patch. Also, rename an existing, static-scope init_file() to a less generic name. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/configfs/dir.c|5 +++-- lxc-dave/fs/file_table.c | 34 ++ lxc-dave/fs/hugetlbfs/inode.c | 22 +- lxc-dave/include/linux/file.h |9 + lxc-dave/ipc/shm.c| 13 + lxc-dave/mm/shmem.c |7 ++- lxc-dave/mm/tiny-shmem.c | 19 +++ lxc-dave/net/socket.c | 18 +- 8 files changed, 78 insertions(+), 49 deletions(-) diff -puN fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s fs/configfs/dir.c --- lxc/fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s 2007-07-10 12:46:03.0 -0700 +++ lxc-dave/fs/configfs/dir.c 2007-07-10 12:46:03.0 -0700 @@ -142,7 +142,7 @@ static int init_dir(struct inode * inode return 0; } -static int init_file(struct inode * inode) +static int configfs_init_file(struct inode * inode) { inode-i_size = PAGE_SIZE; inode-i_fop = configfs_file_operations; @@ -283,7 +283,8 @@ static int configfs_attach_attr(struct c dentry-d_fsdata = configfs_get(sd); sd-s_dentry = dentry; - error = configfs_create(dentry, (attr-ca_mode S_IALLUGO) | S_IFREG, init_file); + error = configfs_create(dentry, (attr-ca_mode S_IALLUGO) | S_IFREG, + configfs_init_file); if (error) { configfs_put(sd); return error; diff -puN fs/file_table.c~filesystem-helpers-for-custom-struct-file-s fs/file_table.c --- lxc/fs/file_table.c~filesystem-helpers-for-custom-struct-file-s 2007-07-10 12:46:03.0 -0700 +++ lxc-dave/fs/file_table.c2007-07-10 12:46:03.0 -0700 @@ -138,6 +138,40 @@ fail: EXPORT_SYMBOL(get_empty_filp); +struct file *alloc_file(struct vfsmount *mnt, struct dentry *dentry, + mode_t mode, const struct file_operations *fop) +{ + struct file *file; + struct path; + + file = get_empty_filp(); + if (!file) + return NULL; + + init_file(file, mnt, dentry, mode, fop); + return file; +} +EXPORT_SYMBOL(alloc_file); + +/* + * Note: This is a crappy interface. It is here to make + * merging with the existing users of get_empty_filp() + * who have complex failure logic easier. All users + * of this should be moving to alloc_file(). + */ +int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry, + mode_t mode, const struct file_operations *fop) +{ + int error = 0; + file-f_path.dentry = dentry; + file-f_path.mnt = mntget(mnt); + file-f_mapping = dentry-d_inode-i_mapping; + file-f_mode = mode; + file-f_op = fop; + return error; +} +EXPORT_SYMBOL(init_file); + void fastcall fput(struct file *file) { if (atomic_dec_and_test(file-f_count)) diff -puN fs/hugetlbfs/inode.c~filesystem-helpers-for-custom-struct-file-s fs/hugetlbfs/inode.c --- lxc/fs/hugetlbfs/inode.c~filesystem-helpers-for-custom-struct-file-s 2007-07-10 12:46:03.0 -0700 +++ lxc-dave/fs/hugetlbfs/inode.c 2007-07-10 12:46:03.0 -0700 @@ -761,16 +761,11 @@ struct file *hugetlb_file_setup(const ch if (!dentry) goto out_shm_unlock; - error = -ENFILE; - file = get_empty_filp(); - if (!file) - goto out_dentry; - error = -ENOSPC; inode = hugetlbfs_get_inode(root-d_sb, current-fsuid, current-fsgid, S_IFREG | S_IRWXUGO, 0); if (!inode) - goto out_file; + goto out_dentry; error = -ENOMEM; if (hugetlb_reserve_pages(inode, 0, size HPAGE_SHIFT)) @@ -779,17 +774,18 @@ struct file *hugetlb_file_setup(const ch d_instantiate(dentry, inode); inode-i_size = size; inode-i_nlink = 0; - file-f_path.mnt = mntget(hugetlbfs_vfsmount); - file-f_path.dentry = dentry; - file-f_mapping = inode-i_mapping; - file-f_op = hugetlbfs_file_operations; - file-f_mode = FMODE_WRITE | FMODE_READ; + + error = -ENFILE; + file = alloc_file(hugetlbfs_vfsmount, dentry, + FMODE_WRITE | FMODE_READ, + hugetlbfs_file_operations); + if (!file) + goto out_inode; + return file; out_inode: iput(inode); -out_file: - put_filp(file); out_dentry: dput(dentry); out_shm_unlock: diff -puN
[PATCH 05/23] elevate write count open()'d files
This is the first really tricky patch in the series. It elevates the writer count on a mount each time a non-special file is opened for write. This is not completely apparent in the patch because the two if() conditions in may_open() above the mnt_want_write() call are, combined, equivalent to special_file(). There is also an elevated count around the vfs_create() call in open_namei(). The count needs to be kept elevated all the way into the may_open() call. Otherwise, when the write is dropped, a ro-rw transisition could occur. This would lead to having rw access on the newly created file, while the vfsmount is ro. That is bad. Some filesystems forego the use of normal vfs calls to create struct files. Make sure that these users elevate the mnt writer count because they will get __fput(), and we need to make sure they're balanced. Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- lxc-dave/fs/file_table.c |9 - lxc-dave/fs/namei.c | 20 lxc-dave/ipc/mqueue.c|3 +++ 3 files changed, 27 insertions(+), 5 deletions(-) diff -puN fs/file_table.c~tricky-elevate-write-count-files-are-open-ed fs/file_table.c --- lxc/fs/file_table.c~tricky-elevate-write-count-files-are-open-ed 2007-07-10 12:46:04.0 -0700 +++ lxc-dave/fs/file_table.c2007-07-10 12:46:04.0 -0700 @@ -168,6 +168,10 @@ int init_file(struct file *file, struct file-f_mapping = dentry-d_inode-i_mapping; file-f_mode = mode; file-f_op = fop; + if (mode FMODE_WRITE) { + error = mnt_want_write(mnt); + WARN_ON(error); + } return error; } EXPORT_SYMBOL(init_file); @@ -205,8 +209,11 @@ void fastcall __fput(struct file *file) if (unlikely(S_ISCHR(inode-i_mode) inode-i_cdev != NULL)) cdev_put(inode-i_cdev); fops_put(file-f_op); - if (file-f_mode FMODE_WRITE) + if (file-f_mode FMODE_WRITE) { put_write_access(inode); + if (!special_file(inode-i_mode)) + mnt_drop_write(mnt); + } put_pid(file-f_owner.pid); file_kill(file); file-f_path.dentry = NULL; diff -puN fs/namei.c~tricky-elevate-write-count-files-are-open-ed fs/namei.c --- lxc/fs/namei.c~tricky-elevate-write-count-files-are-open-ed 2007-07-10 12:46:04.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:04.0 -0700 @@ -1562,8 +1562,15 @@ int may_open(struct nameidata *nd, int a return -EACCES; flag = ~O_TRUNC; - } else if (IS_RDONLY(inode) (flag FMODE_WRITE)) - return -EROFS; + } else if (flag FMODE_WRITE) { + /* +* effectively: !special_file() +* balanced by __fput() +*/ + error = mnt_want_write(nd-mnt); + if (error) + return error; + } error = vfs_permission(nd, acc_mode); if (error) @@ -1706,14 +1713,17 @@ do_last: } if (IS_ERR(nd-intent.open.file)) { - mutex_unlock(dir-d_inode-i_mutex); error = PTR_ERR(nd-intent.open.file); - goto exit_dput; + goto exit_mutex_unlock; } /* Negative dentry, just create the file */ if (!path.dentry-d_inode) { + error = mnt_want_write(nd-mnt); + if (error) + goto exit_mutex_unlock; error = open_namei_create(nd, path, flag, mode); + mnt_drop_write(nd-mnt); if (error) goto exit; return 0; @@ -1751,6 +1761,8 @@ ok: goto exit; return 0; +exit_mutex_unlock: + mutex_unlock(dir-d_inode-i_mutex); exit_dput: dput_path(path, nd); exit: diff -puN ipc/mqueue.c~tricky-elevate-write-count-files-are-open-ed ipc/mqueue.c --- lxc/ipc/mqueue.c~tricky-elevate-write-count-files-are-open-ed 2007-07-10 12:46:04.0 -0700 +++ lxc-dave/ipc/mqueue.c 2007-07-10 12:46:04.0 -0700 @@ -686,6 +686,9 @@ asmlinkage long sys_mq_open(const char _ goto out; filp = do_open(dentry, oflag); } else { + error = mnt_want_write(mqueue_mnt); + if (error) + goto out; filp = do_create(mqueue_mnt-mnt_root, dentry, oflag, mode, u_attr); } _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups
On Tue, 2007-07-10 at 23:20 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote: From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent macros cleanup - Replace math equation to it's macro equivalent s/it's/its/;) Okay. - make ext4_ext_grow_indepth() indexes/leaf correct hm, what was wrong with it? Looking at the code, ext4_ext_ext_grow_indepth() implements tree growing procedure. It allocates a new index block, moves the top-level data of the tree(root or leaf blocks in i_data) into the new block, initializes the new root, creating index that points to the just created index block The original top-level data (in i_data) could be extent tree root (index block) or extents (leaf block). The current code (without the patch) treats the top-level data always be the leaf block, which is incorrect. assumes when the tree is growing the extent structure pass in is always @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode, curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode)); curp-p_hdr-eh_entries = cpu_to_le16(1); curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr); - /* FIXME: it works, but actually path[0] can be index */ - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; + + if (path[0].p_hdr-eh_depth) + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block; + else + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; whitespace bustage there. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23))
On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote: Andrew Morton wrote: The fault-vs-invalidate race fix. I have belatedly learned that these need more work, so their state is uncertain. The more work may turn out being too much for you (although it is nothing exactly tricky that would introduce subtle bugs, it is a fair amont of churn). OK, so does that mean we can finally get the block_page_mkwrite patches merged? i.e.: http://marc.info/?l=linux-kernelm=117426058311032w=2 http://marc.info/?l=linux-kernelm=11742607036w=2 I've got up-to-date versions of them ready to go and they've been consistently tested thanks to the XFSQA test I wrote for the bug that it fixes. I've been holding them out-of-tree for months now because -fault was supposed to supercede this interface. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23))
David Chinner wrote: On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote: Andrew Morton wrote: The fault-vs-invalidate race fix. I have belatedly learned that these need more work, so their state is uncertain. The more work may turn out being too much for you (although it is nothing exactly tricky that would introduce subtle bugs, it is a fair amont of churn). OK, so does that mean we can finally get the block_page_mkwrite patches merged? i.e.: http://marc.info/?l=linux-kernelm=117426058311032w=2 http://marc.info/?l=linux-kernelm=11742607036w=2 I've got up-to-date versions of them ready to go and they've been consistently tested thanks to the XFSQA test I wrote for the bug that it fixes. I've been holding them out-of-tree for months now because -fault was supposed to supercede this interface. Yeah, as I've said, don't hold them back because of me. They are relatively simple enough that I don't see why they couldn't be merged in this window. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1 of 2] block_page_mkwrite V2
Generic page_mkwrite functionality. Filesystems that make use of the VM -page_mkwrite() callout will generally use the same core code to implement it. There are several tricky truncate-related issues that we need to deal with here as we cannot take the i_mutex as we normally would for these paths. These issues are not documented anywhere yet so block_page_mkwrite() seems like the best place to start. Version 2: - read inode size only once - more comments explaining implementation restrictions Signed-Off-By: Dave Chinner [EMAIL PROTECTED] --- fs/buffer.c | 47 include/linux/buffer_head.h |2 + 2 files changed, 49 insertions(+) Index: 2.6.x-xfs-new/fs/buffer.c === --- 2.6.x-xfs-new.orig/fs/buffer.c 2007-03-17 10:55:32.291414968 +1100 +++ 2.6.x-xfs-new/fs/buffer.c 2007-03-19 08:13:54.519909087 +1100 @@ -2194,6 +2194,52 @@ int generic_commit_write(struct file *fi return 0; } +/* + * block_page_mkwrite() is not allowed to change the file size as it gets + * called from a page fault handler when a page is first dirtied. Hence we must + * be careful to check for EOF conditions here. We set the page up correctly + * for a written page which means we get ENOSPC checking when writing into + * holes and correct delalloc and unwritten extent mapping on filesystems that + * support these features. + * + * We are not allowed to take the i_mutex here so we have to play games to + * protect against truncate races as the page could now be beyond EOF. Because + * vmtruncate() writes the inode size before removing pages, once we have the + * page lock we can determine safely if the page is beyond EOF. If it is not + * beyond EOF, then the page is guaranteed safe against truncation until we + * unlock the page. + */ +int +block_page_mkwrite(struct vm_area_struct *vma, struct page *page, + get_block_t get_block) +{ + struct inode *inode = vma-vm_file-f_path.dentry-d_inode; + unsigned long end; + loff_t size; + int ret = -EINVAL; + + lock_page(page); + size = i_size_read(inode); + if ((page-mapping != inode-i_mapping) || + ((page-index PAGE_CACHE_SHIFT) size)) { + /* page got truncated out from underneath us */ + goto out_unlock; + } + + /* page is wholly or partially inside EOF */ + if (((page-index + 1) PAGE_CACHE_SHIFT) size) + end = size ~PAGE_CACHE_MASK; + else + end = PAGE_CACHE_SIZE; + + ret = block_prepare_write(page, 0, end, get_block); + if (!ret) + ret = block_commit_write(page, 0, end); + +out_unlock: + unlock_page(page); + return ret; +} /* * nobh_prepare_write()'s prereads are special: the buffer_heads are freed @@ -2997,6 +3043,7 @@ EXPORT_SYMBOL(__brelse); EXPORT_SYMBOL(__wait_on_buffer); EXPORT_SYMBOL(block_commit_write); EXPORT_SYMBOL(block_prepare_write); +EXPORT_SYMBOL(block_page_mkwrite); EXPORT_SYMBOL(block_read_full_page); EXPORT_SYMBOL(block_sync_page); EXPORT_SYMBOL(block_truncate_page); Index: 2.6.x-xfs-new/include/linux/buffer_head.h === --- 2.6.x-xfs-new.orig/include/linux/buffer_head.h 2007-03-17 10:55:32.135435539 +1100 +++ 2.6.x-xfs-new/include/linux/buffer_head.h 2007-03-17 10:55:32.567378573 +1100 @@ -206,6 +206,8 @@ int cont_prepare_write(struct page*, uns int generic_cont_expand(struct inode *inode, loff_t size); int generic_cont_expand_simple(struct inode *inode, loff_t size); int block_commit_write(struct page *page, unsigned from, unsigned to); +int block_page_mkwrite(struct vm_area_struct *vma, struct page *page, + get_block_t get_block); void block_sync_page(struct page *); sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *); int generic_commit_write(struct file *, struct page *, unsigned, unsigned); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2 of 2] Make XFS use block_page_mkwrite
Implement -page_mkwrite in XFS. Signed-Off-By: Dave Chinner [EMAIL PROTECTED] --- fs/xfs/linux-2.6/xfs_file.c | 16 1 file changed, 16 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_file.c 2007-02-07 23:00:10.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c 2007-02-07 23:15:20.170880823 +1100 @@ -446,6 +446,20 @@ xfs_file_open_exec( } #endif /* HAVE_FOP_OPEN_EXEC */ +/* + * mmap()d file has taken write protection fault and is being made + * writable. We can set the page state up correctly for a writable + * page, which means we can do correct delalloc accounting (ENOSPC + * checking!) and unwritten extent mapping. + */ +STATIC int +xfs_vm_page_mkwrite( + struct vm_area_struct *vma, + struct page *page) +{ + return block_page_mkwrite(vma, page, xfs_get_blocks); +} + const struct file_operations xfs_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -503,12 +517,14 @@ const struct file_operations xfs_dir_fil static struct vm_operations_struct xfs_file_vm_ops = { .nopage = filemap_nopage, .populate = filemap_populate, + .page_mkwrite = xfs_vm_page_mkwrite, }; #ifdef HAVE_DMAPI static struct vm_operations_struct xfs_dmapi_file_vm_ops = { .nopage = xfs_vm_nopage, .populate = filemap_populate, + .page_mkwrite = xfs_vm_page_mkwrite, #ifdef HAVE_VMOP_MPROTECT .mprotect = xfs_vm_mprotect, #endif - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 11, 2007 16:04 -0400, J. Bruce Fields wrote: A 32-bit i_version could in theory wrap pretty quickly, couldn't it? That's not a problem in itself--the problem would only arise if two subsequent client queries of the change attribute happened a multiple of 2^32 i_version increments apart. This is more likely than the previous scenario, but still very unlikely. I would have guessed that even in situations with a very high rate of updates and a low rate of client revalidations, the chance of two revalidations happening exactly 2^32 updates apart would still be no more than 1 in 2^32. (Could odd characteristics of the workloads (like updates that tend to happen in power-of-2 groups?) make it any more likely?) I'd be happier if ext4 at least allowed the possibility of 64 bits in the future. And there's always the chance someone would find a use for an i_version that was nondecreasing, even if nfs didn't care. This would indeed be the case for ext3 filesystems updated to ext4. They will only be able to store the low 32 bits of the version, which is itself normally enough for NFSv4 because it only uses the inequality check. Having the full 64 bits available eliminates the risk of collisions, and given that the spec mandates a 64-bit version I'm sure someone will take full advantage of it in NFS at some point. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH, RESEND] Teach do_mpage_readpage() about unwritten buffers
Teach do_mpage_readpage() about unwritten extents so we can always map them in get_blocks rather than they are are holes on read. Allows setup_swap_extents() to use preallocated files on XFS filesystems for swap files without ever needing to convert them. Signed-Off-By: Dave Chinner [EMAIL PROTECTED] --- fs/mpage.c |5 +++-- fs/xfs/linux-2.6/xfs_aops.c | 13 +++-- 2 files changed, 6 insertions(+), 12 deletions(-) Index: 2.6.x-xfs-new/fs/mpage.c === --- 2.6.x-xfs-new.orig/fs/mpage.c 2007-05-29 16:17:59.0 +1000 +++ 2.6.x-xfs-new/fs/mpage.c2007-06-27 22:39:35.568852270 +1000 @@ -207,7 +207,8 @@ do_mpage_readpage(struct bio *bio, struc * Map blocks using the result from the previous get_blocks call first. */ nblocks = map_bh-b_size blkbits; - if (buffer_mapped(map_bh) block_in_file *first_logical_block + if (buffer_mapped(map_bh) !buffer_unwritten(map_bh) + block_in_file *first_logical_block block_in_file (*first_logical_block + nblocks)) { unsigned map_offset = block_in_file - *first_logical_block; unsigned last = nblocks - map_offset; @@ -242,7 +243,7 @@ do_mpage_readpage(struct bio *bio, struc *first_logical_block = block_in_file; } - if (!buffer_mapped(map_bh)) { + if (!buffer_mapped(map_bh) || buffer_unwritten(map_bh)) { fully_mapped = 0; if (first_hole == blocks_per_page) first_hole = page_block; Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-06-05 22:14:39.0 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-06-27 22:39:29.545636749 +1000 @@ -1340,16 +1340,9 @@ __xfs_get_blocks( return 0; if (iomap.iomap_bn != IOMAP_DADDR_NULL) { - /* -* For unwritten extents do not report a disk address on -* the read case (treat as if we're reading into a hole). -*/ - if (create || !(iomap.iomap_flags IOMAP_UNWRITTEN)) { - xfs_map_buffer(bh_result, iomap, offset, - inode-i_blkbits); - } - if (create (iomap.iomap_flags IOMAP_UNWRITTEN)) { - if (direct) + xfs_map_buffer(bh_result, iomap, offset, inode-i_blkbits); + if (iomap.iomap_flags IOMAP_UNWRITTEN) { + if (create direct) bh_result-b_private = inode; set_buffer_unwritten(bh_result); } - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html