Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Fri, 2007-07-13 at 09:59 +0530, Aneesh Kumar K.V wrote: Kalpak Shah wrote: On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:56 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch is a spinoff of the old nanosecond patches. I don't know what the old nanosecond patches are. A link to a suitable changlog for those patches would do in a pinch. Preferable would be to write a proper changelog for this patch. The incremental patch contains a proper changelog describing the patch. Instead of putting incremental patches it would be nice if we can have replacement patches. for the already existing patches with the comments addressed. For example if we have a review comment on the patch message ( commit log ) then adding an incremental patch doesn't help. I think that it would be easier to review just the changes that have been made to the patches instead of having people go through the entire patch again. I was hoping that someone with write access to ext4-git would update the commit logs. If replacement patches are preferred, then I will send them again. Thanks, Kalpak. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
On Fri, Jul 13, 2007 at 06:17:55PM +0530, Amit K. Arora wrote: /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies the behavior of allocation. + * @offset: The offset within file, from where allocation is being + * requested. It should not have a negative value. + * @len: The amount of space in bytes to be allocated, from the offset. + *This can not be zero or a negative value. kerneldoc comments are for in-kernel APIs which syscalls aren't. I'd say just temove this comment, the manpage is a much better documentation anyway. + * TBD Generic fallocate to be added for file systems that do not + *support fallocate. Please remove the comment, adding a generic fallback in kernelspace is a very dumb idea as we already discussed long time ago. --- linux-2.6.22.orig/include/linux/fs.h +++ linux-2.6.22/include/linux/fs.h @@ -266,6 +266,21 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * sys_fallocate modes + * Currently sys_fallocate supports two modes: + * FALLOC_ALLOCATE : This is the preallocate mode, using which an application + * may request reservation of space for a particular file. + * The file size will be changed if the allocation is + * beyond EOF. + * FALLOC_RESV_SPACE : This is same as the above mode, with only one difference + * that the file size will not be modified. + */ +#define FALLOC_FL_KEEP_SIZE0x01 /* default is extend/shrink size */ + +#define FALLOC_ALLOCATE0 +#define FALLOC_RESV_SPACE FALLOC_FL_KEEP_SIZE Just remove FALLOC_ALLOCATE, 0 flags should be the default. I'm also not sure there is any point in having two namespace now that we have a flags- based ABI. Also please don't add this to fs.h. fs.h is a complete mess and the falloc flags are a new user ABI. Add a linux/falloc.h instead which can be added to headers-y so the ABI constant can be exported to userspace. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6][TAKE7] ext4: change for better extent-to-group alignment
From: Amit Arora [EMAIL PROTECTED] Change on-disk format for extent to represent uninitialized/initialized extents This change was suggested by Andreas Dilger. This patch changes the EXT_MAX_LEN value and extent code which marks/checks uninitialized extents. With this change it will be possible to have initialized extents with 2^15 blocks (earlier the max blocks we could have was 2^15 - 1). This way we can have better extent-to-block alignment. Now, maximum number of blocks we can have in an initialized extent is 2^15 and in an uninitialized extent is 2^15 - 1. This patch takes care of Andreas's suggestion of using EXT_INIT_MAX_LEN instead of 0x8000 at some places. Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -1106,7 +1106,7 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - unsigned short ext1_ee_len, ext2_ee_len; + unsigned short ext1_ee_len, ext2_ee_len, max_len; /* * Make sure that either both extents are uninitialized, or @@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) return 0; + if (ext4_ext_is_uninitialized(ex1)) + max_len = EXT_UNINIT_MAX_LEN; + else + max_len = EXT_INIT_MAX_LEN; + ext1_ee_len = ext4_ext_get_actual_len(ex1); ext2_ee_len = ext4_ext_get_actual_len(ex2); @@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (ext1_ee_len + ext2_ee_len EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len max_len) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1-ee_len) = 4) @@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex-ee_block = cpu_to_le32(block); ex-ee_len = cpu_to_le16(num); - if (uninitialized) + /* +* Do not mark uninitialized if all the blocks in the +* extent have been removed. +*/ + if (uninitialized num) ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); @@ -2307,6 +2316,19 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); + /* +* See if request is beyond maximum number of blocks we can have in +* a single extent. For an initialized extent this limit is +* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is +* EXT_UNINIT_MAX_LEN. +*/ + if (max_blocks EXT_INIT_MAX_LEN + create != EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_INIT_MAX_LEN; + else if (max_blocks EXT_UNINIT_MAX_LEN +create == EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_UNINIT_MAX_LEN; + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ newex.ee_block = cpu_to_le32(iblock); newex.ee_len = cpu_to_le16(max_blocks); Index: linux-2.6.22/include/linux/ext4_fs_extents.h === --- linux-2.6.22.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22/include/linux/ext4_fs_extents.h @@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru #define EXT_MAX_BLOCK 0x -#define EXT_MAX_LEN((1UL 15) - 1) +/* + * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an + * initialized extent. This is 2^15 and not (2^16 - 1), since we use the + * MSB of ee_len field in the extent datastructure to signify if this + * particular extent is an initialized extent or an uninitialized (i.e. + * preallocated). + * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an + * uninitialized extent. + * If ee_len is = 0x8000, it is an initialized extent. Otherwise, it is an + * uninitialized one. In other words, if MSB of ee_len is set, it is an + * uninitialized extent with only one special scenario when ee_len = 0x8000. + * In this case we can not have an uninitialized extent of zero length and + * thus we make it as a special case of initialized extent with 0x8000 length. + * This way we get better extent-to-group alignment for initialized extents. + * Hence, the maximum number of blocks we can have in an *initialized* + * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 (32767). + */ +#define EXT_INIT_MAX_LEN (1UL 15) +#define EXT_UNINIT_MAX_LEN (EXT_INIT_MAX_LEN - 1)
[PATCH 5/6][TAKE7] ext4: write support for preallocated blocks
From: Amit Arora [EMAIL PROTECTED] write support for preallocated blocks This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -1140,6 +1140,53 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the ex extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass ex - 1 as argument instead of ex. + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex EXT_LAST_EXTENT(eh)) { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries) - 1); + merge_done = 1; + WARN_ON(eh-eh_entries == 0); + if (!eh-eh_entries) + ext4_error(inode-i_sb, ext4_ext_try_to_merge, + inode#%lu, eh-eh_entries = 0!, inode-i_ino); + } + + return merge_done; +} + +/* * check if a portion of the newext extent overlaps with an * existing extent. * @@ -1327,25 +1374,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1); - BUG_ON(eh-eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2011,15 +2040,158 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a There is no split required: Entire extent should be initialized + * b Splits in two extents: Write is happening at either end of the extent + * c Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, newex; + struct ext4_extent *ex1 = NULL; + struct ext4_extent *ex2 = NULL; + struct ext4_extent *ex3 = NULL; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err = 0; + int ret = 0; + + depth = ext_depth(inode); + eh = path[depth].p_hdr; + ex =
[PATCH 4/6][TAKE7] ext4: fallocate support in ext4
From: Amit Arora [EMAIL PROTECTED] fallocate support in ext4 This patch implements -fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of now. Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in } else if (path-p_ext) { ext_debug( %d:%d:%llu , le32_to_cpu(path-p_ext-ee_block), - le16_to_cpu(path-p_ext-ee_len), + ext4_ext_get_actual_len(path-p_ext), ext_pblock(path-p_ext)); } else ext_debug( []); @@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i le16_to_cpu(eh-eh_entries); i++, ex++) { ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block), - le16_to_cpu(ex-ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug(\n); } @@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug( - %d:%llu:%d , le32_to_cpu(path-p_ext-ee_block), ext_pblock(path-p_ext), - le16_to_cpu(path-p_ext-ee_len)); + ext4_ext_get_actual_len(path-p_ext)); #ifdef CHECK_BINSEARCH { @@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand ext_debug(move %d:%llu:%d in new leaf %llu\n, le32_to_cpu(path[depth].p_ext-ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext-ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1106,7 +1106,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* +* Make sure that either both extents are uninitialized, or +* both are _not_. +*/ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1-ee_block) + ext1_ee_len != le32_to_cpu(ex2-ee_block)) return 0; @@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len) EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1-ee_len) = 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext-ee_block); - len1 = le16_to_cpu(newext-ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; int depth, len, err, next; + unsigned uninitialized = 0; - BUG_ON(newext-ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode); ex = path[depth].p_ext; BUG_ON(path[depth].p_hdr == NULL); @@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han /* try to insert block into found extent and return */ if (ex ext4_can_extents_be_merged(inode, ex, newext)) { ext_debug(append %d block to %d:%d (from %llu)\n, - le16_to_cpu(newext-ee_len), +
[PATCH 3/6][TAKE7] revalidate write permissions for fallocate
From: David P. Quigley [EMAIL PROTECTED] Revalidate the write permissions for fallocate(2), in case security policy has changed since the files were opened. Acked-by: James Morris [EMAIL PROTECTED] Signed-off-by: David P. Quigley [EMAIL PROTECTED] --- fs/open.c |3 +++ 1 files changed, 3 insertions(+) Index: linux-2.6.22/fs/open.c === --- linux-2.6.22.orig/fs/open.c +++ linux-2.6.22/fs/open.c @@ -407,6 +407,9 @@ asmlinkage long sys_fallocate(int fd, in goto out; if (!(file-f_mode FMODE_WRITE)) goto out_fput; + ret = security_file_permission(file, MAY_WRITE); + if (ret) + goto out_fput; inode = file-f_path.dentry-d_inode; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6][TAKE7] manpage for fallocate
Following is the modified version of the manpage originally submitted by David Chinner. Please use `nroff -man fallocate.2 | less` to view. This includes changes suggested by Heikki Orsila and Barry Naujok. .TH fallocate 2 .SH NAME fallocate \- allocate or remove file space .SH SYNOPSIS .nf .B #include fcntl.h .PP .BI long fallocate(int fd , int mode , loff_t offset , loff_t len); .SH DESCRIPTION The .B fallocate syscall allows a user to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .I offset and continuing for .I len bytes. The .I mode parameter determines the operation to be performed on the given range. Currently there are two modes: .TP .B FALLOC_ALLOCATE allocates and initialises to zero the disk space within the given range. After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. If the size of the file is less than .IR offset + len , then the file is increased to this size; otherwise the file size is left unchanged. .B FALLOC_ALLOCATE closely resembles .BR posix_fallocate (3) and is intended as a method of optimally implementing this function. .B FALLOC_ALLOCATE may allocate a larger range than that was specified. .TP .B FALLOC_RESV_SPACE provides the same functionality as .B FALLOC_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising append workloads. .SH RETURN VALUE .B fallocate returns zero on success, or an error number on failure. Note that .I errno is not set. .SH ERRORS .TP .B EBADF .I fd is not a valid file descriptor, or is not opened for writing. .TP .B EFBIG .IR offset + len exceeds the maximum file size. .TP .B EINVAL .I offset was less than 0, or .I len was less than or equal to 0. .TP .B ENODEV .I fd does not refer to a regular file or a directory. .TP .B ENOSPC There is not enough space left on the device containing the file referred to by .IR fd . .TP .B ESPIPE .I fd refers to a pipe of file descriptor. .TP .B ENOSYS The filesystem underlying the file descriptor does not support this operation. .TP .B EINTR A signal was caught during execution .TP .B EIO An I/O error occurred while reading from or writing to a file system. .TP .B EOPNOTSUPP The mode is not supported on the file descriptor. .SH AVAILABILITY The .B fallocate system call is available since 2.6.XX .SH SEE ALSO .BR syscall (2), .BR posix_fadvise (3), .BR ftruncate (3). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Fri, 2007-07-13 at 02:05 -0700, Andrew Morton wrote: Except lockdep doesn't know about journal_start(), which has ranking requirements similar to a semaphore. Something like so? Or can journal_stop() be done by a different task than the one that did journal_start()? - in which case nothing much can be done :-/ This seems to boot... albeit I did not push it hard. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/jbd/transaction.c |9 + include/linux/jbd.h |5 + 2 files changed, 14 insertions(+) Index: linux-2.6/fs/jbd/transaction.c === --- linux-2.6.orig/fs/jbd/transaction.c +++ linux-2.6/fs/jbd/transaction.c @@ -233,6 +233,8 @@ out: return ret; } +static struct lock_class_key jbd_handle_key; + /* Allocate a new handle. This should probably be in a slab... */ static handle_t *new_handle(int nblocks) { @@ -243,6 +245,8 @@ static handle_t *new_handle(int nblocks) handle-h_buffer_credits = nblocks; handle-h_ref = 1; + lockdep_init_map(handle-h_lockdep_map, jbd_handle, jbd_handle_key, 0); + return handle; } @@ -286,6 +290,9 @@ handle_t *journal_start(journal_t *journ current-journal_info = NULL; handle = ERR_PTR(err); } + + lock_acquire(handle-h_lockdep_map, 0, 0, 0, 2, _THIS_IP_); + return handle; } @@ -1411,6 +1418,8 @@ int journal_stop(handle_t *handle) spin_unlock(journal-j_state_lock); } + lock_release(handle-h_lockdep_map, 1, _THIS_IP_); + jbd_free_handle(handle); return err; } Index: linux-2.6/include/linux/jbd.h === --- linux-2.6.orig/include/linux/jbd.h +++ linux-2.6/include/linux/jbd.h @@ -30,6 +30,7 @@ #include linux/bit_spinlock.h #include linux/mutex.h #include linux/timer.h +#include linux/lockdep.h #include asm/semaphore.h #endif @@ -405,6 +406,10 @@ struct handle_s unsigned inth_sync: 1; /* sync-on-close */ unsigned inth_jdata:1; /* force data journaling */ unsigned inth_aborted: 1; /* fatal error on handle */ + +#ifdef CONFIG_LOCKDEP + struct lockdep_map h_lockdep_map; +#endif }; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
From: Amit Arora [EMAIL PROTECTED] sys_fallocate() implementation on i386, x86_64 and powerpc fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called -fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22/arch/i386/kernel/syscall_table.S === --- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c === --- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, +u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi 32) | offlo, +((loff_t)lenhi 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S +++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys32_fallocate ia32_syscall_end: Index: linux-2.6.22/fs/open.c === --- linux-2.6.22.orig/fs/open.c +++ linux-2.6.22/fs/open.c @@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies the behavior of allocation. + * @offset: The offset within file, from where allocation is being + * requested. It should not have a negative value. + * @len: The amount of space in bytes to be allocated, from the offset. + * This can not be zero or a negative value. + * + * This system call preallocates space for a file. The range of blocks + * allocated depends on the value of offset and len arguments provided + * by the user/application. With FALLOC_ALLOCATE or FALLOC_RESV_SPACE + * modes, if the system call succeeds, subsequent writes to the file in + * the given range (specified by offset len) should not fail - even if + * the file system later becomes full. Hence the preallocation done is + * persistent (valid even after reopen of the file and remount/reboot). + * + * It is expected that the -fallocate() inode operation implemented by + * the individual file systems will update the file size and/or + * ctime/mtime depending on the mode and also on the success of the + * operation. + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On
Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.
On 7/13/07, Kalpak Shah [EMAIL PROTECTED] wrote: EXT4_DIR_LINK_MAX() is buggy: it evaluates its arg twice. #define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir) (dir)-i_nlink = EXT4_LINK_MAX) [snip] Sorry, I didn't understand what is the problem with this macro? The expression represented by 'dir' is evaluated twice (think dir++ here). It's safer to make it a static inline function. Pekka - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6][TAKE7] fallocate system call
This is the latest fallocate patchset and is based on 2.6.22. * Following are the changes from TAKE6: 1) We now just have two modes (and no deallocation modes). 2) Updated the man page 3) Added a new patch submitted by David P. Quigley (Patch 3/6). 4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6. 5) Included below in the end is a small testcase to test fallocate. * Following are the changes from TAKE5 to TAKE6: 1) Rebased to 2.6.22 2) Added compat wrapper for x86_64 3) Dropped s390 and ia64 patches, since the platform maintaners can add the support for fallocate once it is in mainline. 4) Added a change suggested by Andreas for better extent-to-group alignment in ext4 (Patch 6/6). Please refer following post: http://www.mail-archive.com/[EMAIL PROTECTED]/msg02445.html 5) Renamed mode flags and values from FA_ to FALLOC_ 6) Added manpage (updated version of the one initially submitted by David Chinner). Todos: - 1 Implementation on other architectures (other than i386, x86_64, and ppc64). s390(x) and ia64 patches are ready and will be pushed by platform maintaners when the fallocate is in mainline. 2 A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3 Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() 4 Patch to e2fsprogs to recognize and display uninitialized extents. Following patches follow: Patch 1/6 : manpage for fallocate Patch 2/6 : fallocate() implementation in i386, x86_64 and powerpc Patch 3/6 : revalidate write permissions for fallocate Patch 4/6 : ext4: fallocate support in ext4 Patch 5/6 : ext4: write support for preallocated blocks Patch 6/6 : ext4: change for better extent-to-group alignment Note: Attached below is a small testcase to test fallocate. The __NR_fallocate will need to be changed depending on the system call number in the kernel (it may get changed due to merge) and also depending on the architecture. -- Regards, Amit Arora #include stdio.h #include stdlib.h #include fcntl.h #include errno.h #include linux/unistd.h #include sys/vfs.h #include sys/stat.h #define VERBOSE 0 #define __NR_fallocate324 #define FALLOC_FL_KEEP_SIZE 0x01 #define FALLOC_ALLOCATE 0x0 #define FALLOC_RESV_SPACE FALLOC_FL_KEEP_SIZE int do_fallocate(int fd, int mode, loff_t offset, loff_t len) { int ret; if (VERBOSE) printf(Trying to preallocate blocks (offset=%llu, len=%llu)\n, offset, len); ret = syscall(__NR_fallocate, fd, mode, offset, len); if (ret 0) { printf(SYSCALL: received error %d, ret=%d\n, errno, ret); close(fd); return(1); } if (VERBOSE) printf(fallocate system call succedded ! ret=%d\n, ret); return ret; } int test_fallocate(int fd, int mode, loff_t offset, loff_t len) { int ret, blocks; struct stat statbuf1, statbuf2; fstat(fd, statbuf1); ret = do_fallocate(fd, mode, offset, len); fstat(fd, statbuf2); /* check file size after preallocation */ if (mode == FALLOC_ALLOCATE) { if (!ret statbuf1.st_size (offset + len) statbuf2.st_size != (offset + len)) { printf(Error: fallocate succeeded, but the file size did not change, where it should have!\n); ret = 1; } } else if (statbuf1.st_size != statbuf2.st_size) { printf(Error : File size changed, when it should not have!\n); ret = 1; } blocks = ((statbuf2.st_blocks - statbuf1.st_blocks) * 512)/ statbuf2.st_blksize; /* Print report */ printf(# FALLOCATE TEST REPORT #\n); printf(\tNew blocks preallocated = %d.\n, blocks); printf(\tNumber of bytes preallocated = %d\n, blocks * statbuf2.st_blksize); printf(\tOld file size = %d, New file size %d.\n, statbuf1.st_size, statbuf2.st_size); printf(\tOld num blocks = %d, New num blocks %d.\n, (statbuf1.st_blocks * 512)/1024, (statbuf2.st_blocks * 512)/1024); return ret; } int do_write(int fd, loff_t offset, loff_t len) { int ret; char *buf; buf = (char *)malloc(len); if (!buf) { printf(error: malloc failed.\n); return(-1); } if (VERBOSE) printf(Trying to write to file (offset=%llu, len=%llu)\n, offset, len); ret = lseek(fd, offset, SEEK_SET); if (ret != offset) { printf(lseek() failed error=%d, ret=%d\n, errno, ret); close(fd); return(-1); } ret = write(fd, buf, len); if (ret != len) { printf(write() failed error=%d, ret=%d\n, errno, ret); close(fd); return(-1); } if (VERBOSE) printf(Write succedded ! Written %llu bytes ret=%d\n, len, ret); return ret; } int test_write(int fd, loff_t offset, loff_t len) { int ret; ret = do_write(fd, offset, len); printf(#
Re: [PATCH 3/6][TAKE7] revalidate write permissions for fallocate
On Fri, Jul 13, 2007 at 06:18:47PM +0530, Amit K. Arora wrote: From: David P. Quigley [EMAIL PROTECTED] Revalidate the write permissions for fallocate(2), in case security policy has changed since the files were opened. Acked-by: James Morris [EMAIL PROTECTED] Signed-off-by: David P. Quigley [EMAIL PROTECTED] This should be merged into the main falloc patch. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
On Fri, Jul 13, 2007 at 02:21:19PM +0100, Christoph Hellwig wrote: On Fri, Jul 13, 2007 at 06:17:55PM +0530, Amit K. Arora wrote: /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies the behavior of allocation. + * @offset: The offset within file, from where allocation is being + * requested. It should not have a negative value. + * @len: The amount of space in bytes to be allocated, from the offset. + * This can not be zero or a negative value. kerneldoc comments are for in-kernel APIs which syscalls aren't. I'd say just temove this comment, the manpage is a much better documentation anyway. Ok. I will remove this entire comment. + * TBD Generic fallocate to be added for file systems that do not + * support fallocate. Please remove the comment, adding a generic fallback in kernelspace is a very dumb idea as we already discussed long time ago. --- linux-2.6.22.orig/include/linux/fs.h +++ linux-2.6.22/include/linux/fs.h @@ -266,6 +266,21 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * sys_fallocate modes + * Currently sys_fallocate supports two modes: + * FALLOC_ALLOCATE : This is the preallocate mode, using which an application + * may request reservation of space for a particular file. + * The file size will be changed if the allocation is + * beyond EOF. + * FALLOC_RESV_SPACE : This is same as the above mode, with only one difference + * that the file size will not be modified. + */ +#define FALLOC_FL_KEEP_SIZE0x01 /* default is extend/shrink size */ + +#define FALLOC_ALLOCATE0 +#define FALLOC_RESV_SPACE FALLOC_FL_KEEP_SIZE Just remove FALLOC_ALLOCATE, 0 flags should be the default. I'm also not sure there is any point in having two namespace now that we have a flags- based ABI. Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this flag and remove the other mode too (FALLOC_RESV_SPACE). Is this what you are suggesting ? Also please don't add this to fs.h. fs.h is a complete mess and the falloc flags are a new user ABI. Add a linux/falloc.h instead which can be added to headers-y so the ABI constant can be exported to userspace. Should we need a header file just to declare one flag - i.e. FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two modes) ? If linux/fs.h is not a good place, will asm-generic/fcntl.h be a sane place for this flag ? Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6][TAKE7] manpage for fallocate
On Fri, Jul 13, 2007 at 06:16:01PM +0530, Amit K. Arora wrote: Following is the modified version of the manpage originally submitted by David Chinner. Please use `nroff -man fallocate.2 | less` to view. This includes changes suggested by Heikki Orsila and Barry Naujok. Can we get itemised change logs for all these patches from now on? .TH fallocate 2 .SH NAME fallocate \- allocate or remove file space If fallocate is just being used for allocating space this is wrong. maybe - manipulate file space instead? dd .TP .B FALLOC_RESV_SPACE provides the same functionality as .B FALLOC_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising of zeroed blocks -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Guy Watkins wrote: } -Original Message- } From: [EMAIL PROTECTED] [mailto:linux-raid- } [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] } Sent: Thursday, July 12, 2007 1:35 PM } To: [EMAIL PROTECTED] } Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper } development; linux-fsdevel@vger.kernel.org; [EMAIL PROTECTED]; } [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger } Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for } devices, filesystems, and dm/md. } } On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said: } [EMAIL PROTECTED] wrote: } On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: } } All of the high end arrays have non-volatile cache (read, on power } loss, it is a } promise that it will get all of your data out to permanent storage). } You don't } need to ask this kind of array to drain the cache. In fact, it might } just ignore } you if you send it that kind of request ;-) } } OK, I'll bite - how does the kernel know whether the other end of that } fiberchannel cable is attached to a DMX-3 or to some no-name product } that } may not have the same assurances? Is there a I'm a high-end array } bit } in the sense data that I'm unaware of? } } } There are ways to query devices (think of hdparm -I in S-ATA/P-ATA } drives, SCSI } has similar queries) to see what kind of device you are talking to. I am } not } sure it is worth the trouble to do any automatic detection/handling of } this. } } In this specific case, it is more a case of when you attach a high end } (or } mid-tier) device to a server, you should configure it without barriers } for its } exported LUNs. } } I don't have a problem with the sysadmin *telling* the system the other } end of } that fiber cable has characteristics X, Y and Z. What worried me was } that it } looked like conflating device reported writeback cache with device } actually } has enough battery/hamster/whatever backup to flush everything on a power } loss. } (My back-of-envelope calculation shows for a worst-case of needing a 1ms } seek } for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync. } That's } a lot of battery..) Most hardware RAID devices I know of use the battery to save the cache while the power is off. When the power is restored it flushes the cache to disk. If the power failure lasts longer than the batteries then the cache data is lost, but the batteries last 24+ hours I beleve. Most mid-range and high end arrays actually use that battery to insure that data is all written out to permanent media when the power is lost. I won't go into how that is done, but it clearly would not be a safe assumption to assume that your power outage is only going to be a certain length of time (and if not, you would lose data). A big EMC array we had had enough battery power to power about 400 disks while the 16 Gig of cache was flushed. I think EMC told me the batteries would last about 20 minutes. I don't recall if the array was usable during the 20 minutes. We never tested a power failure. Guy I worked on the team that designed that big array. At one point, we had an array on loan to a partner who tried to put it in a very small data center. A few weeks later, they brought in an electrician who needed to run more power into the center. It was pretty funny - he tried to find a power button to turn it off and then just walked over and dropped power trying to get the Symm to turn off. When that didn't work, he was really, really confused ;-) ric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.
The updated patch is attached. comments inline... On Tue, 2007-07-10 at 22:40 -0700, Andrew Morton wrote: If we exceed 65000 subdirectories in an htree directory it sets the inode link count to 1 and no longer counts subdirectories. The directory link count is not actually used when determining if a directory is empty, as that only counts subdirectories and not regular files that might be in there. A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if the subdir count for any directory crosses 65000. Would I be correct in assuming that a later fsck will clear EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any 65000 subdir directories? If so, that is worth a mention in the changelog, perhaps? The changelog has been updated to include this. +static inline void ext4_inc_count(handle_t *handle, struct inode *inode) +{ + inc_nlink(inode); + if (is_dx(inode) inode-i_nlink 1) { + /* limit is 16-bit i_links_count */ + if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) { + inode-i_nlink = 1; + EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb, + EXT4_FEATURE_RO_COMPAT_DIR_NLINK); + } + } +} Looks too big to be inlined. Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2? I have added a comment for this. (since it indicates that nlinks==1 previously). +static inline void ext4_dec_count(handle_t *handle, struct inode *inode) +{ + drop_nlink(inode); + if (S_ISDIR(inode-i_mode) inode-i_nlink == 0) + inc_nlink(inode); +} Probably too big to inline. Removed the inline. - if (inode-i_nlink = EXT4_LINK_MAX) + if (EXT4_DIR_LINK_MAX(inode)) return -EMLINK; argh. WHY_IS_EXT4_FULL_OF_UPPER_CASE_MACROS_WHICH_COULD_BE_IMPLEMENTED as_lower_case_inlines? Sigh. It's all the old-timers, I guess. EXT4_DIR_LINK_MAX() is buggy: it evaluates its arg twice. #define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir) (dir)-i_nlink = EXT4_LINK_MAX) This just checks if directory has hash indexing in which case we need not worry about EXT4_LINK_MAX subdir limit. If directory is not hash indexed then we will need to enforce a max subdir limit. Sorry, I didn't understand what is the problem with this macro? Thanks, Kalpak. This patch adds support to ext4 for allowing more than 65000 subdirectories. Currently the maximum number of subdirectories is capped at 32000. If we exceed 65000 subdirectories in an htree directory it sets the inode link count to 1 and no longer counts subdirectories. The directory link count is not actually used when determining if a directory is empty, as that only counts subdirectories and not regular files that might be in there. A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if the subdir count for any directory crosses 65000. A later fsck will clear EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any directory with 65000 subdirs. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] --- fs/ext4/namei.c | 52 +++- include/linux/ext4_fs.h |4 ++- 2 files changed, 41 insertions(+), 15 deletions(-) Index: linux-2.6.22/fs/ext4/namei.c === --- linux-2.6.22.orig/fs/ext4/namei.c +++ linux-2.6.22/fs/ext4/namei.c @@ -1617,6 +1617,35 @@ static int ext4_delete_entry (handle_t * return -ENOENT; } +/* + * DIR_NLINK feature is set if 1) nlinks EXT4_LINK_MAX or 2) nlinks == 2, + * since this indicates that nlinks count was previously 1. + */ +static void ext4_inc_count(handle_t *handle, struct inode *inode) +{ + inc_nlink(inode); + if (is_dx(inode) inode-i_nlink 1) { + /* limit is 16-bit i_links_count */ + if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) { + inode-i_nlink = 1; + EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb, + EXT4_FEATURE_RO_COMPAT_DIR_NLINK); + } + } +} + +/* + * If a directory had nlink == 1, then we should let it be 1. This indicates + * directory has EXT4_LINK_MAX subdirs. + */ +static void ext4_dec_count(handle_t *handle, struct inode *inode) +{ + drop_nlink(inode); + if (S_ISDIR(inode-i_mode) inode-i_nlink == 0) + inc_nlink(inode); +} + + static int ext4_add_nondir(handle_t *handle, struct dentry *dentry, struct inode *inode) { @@ -1713,7 +1742,7 @@ static int ext4_mkdir(struct inode * dir struct ext4_dir_entry_2 * de; int err, retries = 0; - if (dir-i_nlink = EXT4_LINK_MAX) + if (EXT4_DIR_LINK_MAX(dir)) return -EMLINK; retry: @@ -1736,7 +1765,7 @@ retry: inode-i_size = EXT4_I(inode)-i_disksize = inode-i_sb-s_blocksize; dir_block = ext4_bread (handle, inode, 0, 1, err); if (!dir_block) { - drop_nlink(inode); /* is this nlink == 0? */ + ext4_dec_count(handle, inode); /* is this nlink == 0? */
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Tue, 10 Jul 2007 16:32:47 -0700 Andrew Morton [EMAIL PROTECTED] wrote: + brelse(bh); + up_write(EXT4_I(inode)-xattr_sem); + return error; +} + We're doing GFP_KERNEL memory allocations while holding xattr_sem. This can cause the VM to reenter the filesystem, perhaps taking i_mutex and/or i_truncate_sem and/or journal_start() (I forget whether this still happens). Have we checked whether this can occur and if so, whether we are OK from a lock ranking POV? Bear in mind that journalled-data mode is more complex in this regard. I notice that everyone carefully avoided addressing this ;) Oh well, hopefully people are testing with lockdep enabled. As long as the fs is put under extreme memory pressure, most bugs should be reported. Except lockdep doesn't know about journal_start(), which has ranking requirements similar to a semaphore. Nor does it know about lock_page(). We already have hard-to-hit but deadlockable bugs in this area. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
On Fri, Jul 13, 2007 at 07:48:58PM +0530, Amit K. Arora wrote: Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this flag and remove the other mode too (FALLOC_RESV_SPACE). Is this what you are suggesting ? Yes. Should we need a header file just to declare one flag - i.e. FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two modes) ? If linux/fs.h is not a good place, will asm-generic/fcntl.h be a sane place for this flag ? It might sound a litte silly but is the cleanest thing we could do by far. And I suspect there will be more more flags soon.. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6][TAKE7] manpage for fallocate
On Sat, Jul 14, 2007 at 12:06:51AM +1000, David Chinner wrote: On Fri, Jul 13, 2007 at 06:16:01PM +0530, Amit K. Arora wrote: Following is the modified version of the manpage originally submitted by David Chinner. Please use `nroff -man fallocate.2 | less` to view. This includes changes suggested by Heikki Orsila and Barry Naujok. Can we get itemised change logs for all these patches from now on? Sure. .TH fallocate 2 .SH NAME fallocate \- allocate or remove file space If fallocate is just being used for allocating space this is wrong. maybe - manipulate file space instead? Yes, it needs to be changed. dd .TP .B FALLOC_RESV_SPACE provides the same functionality as .B FALLOC_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising of zeroed blocks Ok. -- Regards, Amit Arora -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6][TAKE7] revalidate write permissions for fallocate
On Fri, Jul 13, 2007 at 02:21:37PM +0100, Christoph Hellwig wrote: On Fri, Jul 13, 2007 at 06:18:47PM +0530, Amit K. Arora wrote: From: David P. Quigley [EMAIL PROTECTED] Revalidate the write permissions for fallocate(2), in case security policy has changed since the files were opened. Acked-by: James Morris [EMAIL PROTECTED] Signed-off-by: David P. Quigley [EMAIL PROTECTED] This should be merged into the main falloc patch. Ok. Will merge it... -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Jul 13, 2007 15:33 +0200, Peter Zijlstra wrote: On Fri, 2007-07-13 at 02:05 -0700, Andrew Morton wrote: Or can journal_stop() be done by a different task than the one that did journal_start()? - in which case nothing much can be done :-/ The call to journal_stop() has to be in the same process, since the journal handle is also held in current-journal_info so the handle does not need to be passed as an argument all over the VFS. This seems to boot... albeit I did not push it hard. Can you please also make a patch for jbd2. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.
On Fri, 13 Jul 2007 16:00:48 +0530 Kalpak Shah [EMAIL PROTECTED] wrote: - if (inode-i_nlink = EXT4_LINK_MAX) + if (EXT4_DIR_LINK_MAX(inode)) return -EMLINK; argh. WHY_IS_EXT4_FULL_OF_UPPER_CASE_MACROS_WHICH_COULD_BE_IMPLEMENTED as_lower_case_inlines? Sigh. It's all the old-timers, I guess. EXT4_DIR_LINK_MAX() is buggy: it evaluates its arg twice. #define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir) (dir)-i_nlink = EXT4_LINK_MAX) This just checks if directory has hash indexing in which case we need not worry about EXT4_LINK_MAX subdir limit. If directory is not hash indexed then we will need to enforce a max subdir limit. Sorry, I didn't understand what is the problem with this macro? Macros should never evaluate their argument more than once, because if they do they will misbehave when someone passes them an expression-with-side-effects: struct inode *p = q; EXT4_DIR_LINK_MAX(p++); one expects `p' to have the value q+1 here. But it might be q+2. and EXT4_DIR_LINK_MAX(some_function()); might cause some_function() to be called twice. This is one of the many problems which gets fixed when we write code in C rather than in cpp. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5][TAKE8] fallocate system call
This is the latest fallocate patchset and is based on 2.6.22. * Following are the changes from TAKE7: 1) Updated the man page. 2) Merged revalidate write permissions patch with the main falloc patch. 3) Added linux/falloc.h and moved FALLOC_FL_KEEP_SIZE flag to it. Also removed the two modes (FALLOC_ALLOCATE and FALLOC_RESV_SPACE). 4) Removed comment above sys_fallocate definition. 5) Updated the testcase below to use FALLOC_FL_KEEP_SIZE flag instead of previous two modes. * Following are the changes from TAKE6: 1) We now just have two modes (and no deallocation modes). 2) Updated the man page 3) Added a new patch submitted by David P. Quigley (Patch 3/6). 4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6. 4) Included below in the end is a small testcase to test fallocate. * Following are the changes from TAKE5 to TAKE6: 1) Rebased to 2.6.22 2) Added compat wrapper for x86_64 3) Dropped s390 and ia64 patches, since the platform maintaners can add the support for fallocate once it is in mainline. 4) Added a change suggested by Andreas for better extent-to-group alignment in ext4 (Patch 6/6). Please refer following post: http://www.mail-archive.com/[EMAIL PROTECTED]/msg02445.html 5) Renamed mode flags and values from FA_ to FALLOC_ 6) Added manpage (updated version of the one initially submitted by David Chinner). Todos: - 1 Implementation on other architectures (other than i386, x86_64, and ppc64). s390(x) and ia64 patches are ready and will be pushed by platform maintaners when the fallocate is in mainline. 2 A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3 Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() 4 Patch to e2fsprogs to recognize and display uninitialized extents. Following patches follow: Patch 1/5 : manpage for fallocate Patch 2/5 : fallocate() implementation in i386, x86_64 and powerpc Patch 3/5 : ext4: fallocate support in ext4 Patch 4/5 : ext4: write support for preallocated blocks Patch 5/5 : ext4: change for better extent-to-group alignment ** Attached below is a small testcase to test fallocate. The __NR_fallocate will need to be changed depending on the system call number in the kernel (it may get changed due to merge) and also depending on the architecture. -- Regards, Amit Arora #include stdio.h #include stdlib.h #include fcntl.h #include errno.h #include linux/unistd.h #include sys/vfs.h #include sys/stat.h #define VERBOSE 0 #define __NR_fallocate324 #define FALLOC_FL_KEEP_SIZE 0x01 int do_fallocate(int fd, int mode, loff_t offset, loff_t len) { int ret; if (VERBOSE) printf(Trying to preallocate blocks (offset=%llu, len=%llu)\n, offset, len); ret = syscall(__NR_fallocate, fd, mode, offset, len); if (ret 0) { printf(SYSCALL: received error %d, ret=%d\n, errno, ret); close(fd); return(1); } if (VERBOSE) printf(fallocate system call succedded ! ret=%d\n, ret); return ret; } int test_fallocate(int fd, int mode, loff_t offset, loff_t len) { int ret, blocks; struct stat statbuf1, statbuf2; fstat(fd, statbuf1); ret = do_fallocate(fd, mode, offset, len); fstat(fd, statbuf2); /* check file size after preallocation */ if (!mode) { if (!ret statbuf1.st_size (offset + len) statbuf2.st_size != (offset + len)) { printf(Error: fallocate succeeded, but the file size did not change, where it should have!\n); ret = 1; } } else if (statbuf1.st_size != statbuf2.st_size) { printf(Error : File size changed, when it should not have!\n); ret = 1; } blocks = ((statbuf2.st_blocks - statbuf1.st_blocks) * 512)/ statbuf2.st_blksize; /* Print report */ printf(# FALLOCATE TEST REPORT #\n); printf(\tNew blocks preallocated = %d.\n, blocks); printf(\tNumber of bytes preallocated = %d\n, blocks * statbuf2.st_blksize); printf(\tOld file size = %d, New file size %d.\n, statbuf1.st_size, statbuf2.st_size); printf(\tOld num blocks = %d, New num blocks %d.\n, (statbuf1.st_blocks * 512)/1024, (statbuf2.st_blocks * 512)/1024); return ret; } int do_write(int fd, loff_t offset, loff_t len) { int ret; char *buf; buf = (char *)malloc(len); if (!buf) { printf(error: malloc failed.\n); return(-1); } if (VERBOSE) printf(Trying to write to file (offset=%llu, len=%llu)\n, offset, len); ret = lseek(fd, offset, SEEK_SET); if (ret != offset) { printf(lseek() failed error=%d, ret=%d\n, errno, ret); close(fd); return(-1); } ret = write(fd, buf, len); if (ret != len) { printf(write() failed error=%d, ret=%d\n, errno, ret);
[PATCH 1/5][TAKE8] manpage for fallocate
Following is the modified version of the manpage originally submitted by David Chinner. Please use `nroff -man fallocate.2 | less` to view. Following changed from TAKE7: * Removed FALLOC_ALLOCATE and FALLOCATE_RESV_SPACE modes. * Described only single flag for mode, i.e. FALLOC_FL_KEEP_SIZE. * s/zero blocks/zeroed blocks/ as suggested by Dave. * Included linux/falloc.h instead of fcntl.h. Following changed from TAKE6 to TAKE7: Included changes suggested by Heikki Orsila and Barry Naujok. .TH fallocate 2 .SH NAME fallocate \- manipulate file space .SH SYNOPSIS .nf .B #include linux/falloc.h .PP .BI long fallocate(int fd , int mode , loff_t offset , loff_t len ); .SH DESCRIPTION The .B fallocate syscall allows a user to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .I offset and continuing for .I len bytes. The .I mode parameter determines the operation to be performed on the given range. Currently there is only one flag supported for the mode argument. .TP .B FALLOC_FL_KEEP_SIZE allocates and initialises to zero the disk space within the given range. After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. Even if the size of the file is less than .IR offset + len , the file size is not changed. This allows allocation of zeroed blocks beyond the end of file and is useful for optimising append workloads. .PP If .B FALLOC_FL_KEEP_SIZE flag is not specified in the mode argument, the default behavior of this system call is almost same as when this flag is passed. The only difference is that on success, the file size will be changed if the .IR offset + len is greater than the file size. This default behavior closely resembles .BR posix_fallocate (3) and is intended as a method of optimally implementing this function. .PP .B fallocate may allocate a larger range than that was specified. .SH RETURN VALUE .B fallocate returns zero on success, or an error number on failure. Note that .I errno is not set. .SH ERRORS .TP .B EBADF .I fd is not a valid file descriptor, or is not opened for writing. .TP .B EFBIG .IR offset + len exceeds the maximum file size. .TP .B EINVAL .I offset was less than 0, or .I len was less than or equal to 0. .TP .B ENODEV .I fd does not refer to a regular file or a directory. .TP .B ENOSPC There is not enough space left on the device containing the file referred to by .IR fd . .TP .B ESPIPE .I fd refers to a pipe of file descriptor. .TP .B ENOSYS The filesystem underlying the file descriptor does not support this operation. .TP .B EINTR A signal was caught during execution .TP .B EIO An I/O error occurred while reading from or writing to a file system. .TP .B EOPNOTSUPP The mode is not supported on the file descriptor. .SH AVAILABILITY The .B fallocate system call is available since 2.6.XX .SH SEE ALSO .BR posix_fallocate (3), .BR posix_fadvise (3), .BR ftruncate (3). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5][TAKE8] fallocate() implementation in i386, x86_64 and powerpc
From: Amit Arora [EMAIL PROTECTED] sys_fallocate() implementation on i386, x86_64 and powerpc fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called -fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() CHANGELOG: - Following changed from TAKE7: 1. Added linux/falloc.h and moved FALLOC_FL_KEEP_SIZE flag to it. 2. Removed the two modes (FALLOC_ALLOCATE and FALLOC_RESV_SPACE). 3. Merged revalidate write permissions patch from David P. Quigley to this patch. 4. Deleted comment above sys_fallocate definition, as suggested by Christoph. Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22/arch/i386/kernel/syscall_table.S === --- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c === --- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, +u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi 32) | offlo, +((loff_t)lenhi 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S +++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys32_fallocate ia32_syscall_end: Index: linux-2.6.22/fs/open.c === --- linux-2.6.22.orig/fs/open.c +++ linux-2.6.22/fs/open.c @@ -26,6 +26,7 @@ #include linux/syscalls.h #include linux/rcupdate.h #include linux/audit.h +#include linux/falloc.h int vfs_statfs(struct dentry *dentry, struct kstatfs *buf) { @@ -352,6 +353,64 @@ asmlinkage long sys_ftruncate64(unsigned } #endif +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset 0 || len = 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode !(mode FALLOC_FL_KEEP_SIZE)) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file-f_mode FMODE_WRITE)) + goto out_fput; + /* +* Revalidate the write permissions, in case security policy has +* changed since the files were opened. +*/ + ret = security_file_permission(file, MAY_WRITE); + if (ret) + goto out_fput; + + inode = file-f_path.dentry-d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode-i_mode)) + goto out_fput; + + ret = -ENODEV; + /* +* Let individual file system
[PATCH 3/5][TAKE8] ext4: fallocate support in ext4
From: Amit Arora [EMAIL PROTECTED] fallocate support in ext4 This patch implements -fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of now. CHANGELOG: - Following changed from TAKE7: 1. Removed usage of FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes and used FALLOC_FL_KEEP_SIZE mode flag instead. 2. Included linux/falloc.h new header file, which defines above flag. Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -39,6 +39,7 @@ #include linux/quotaops.h #include linux/string.h #include linux/slab.h +#include linux/falloc.h #include linux/ext4_fs_extents.h #include asm/uaccess.h @@ -282,7 +283,7 @@ static void ext4_ext_show_path(struct in } else if (path-p_ext) { ext_debug( %d:%d:%llu , le32_to_cpu(path-p_ext-ee_block), - le16_to_cpu(path-p_ext-ee_len), + ext4_ext_get_actual_len(path-p_ext), ext_pblock(path-p_ext)); } else ext_debug( []); @@ -305,7 +306,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i le16_to_cpu(eh-eh_entries); i++, ex++) { ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block), - le16_to_cpu(ex-ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug(\n); } @@ -425,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug( - %d:%llu:%d , le32_to_cpu(path-p_ext-ee_block), ext_pblock(path-p_ext), - le16_to_cpu(path-p_ext-ee_len)); + ext4_ext_get_actual_len(path-p_ext)); #ifdef CHECK_BINSEARCH { @@ -686,7 +687,7 @@ static int ext4_ext_split(handle_t *hand ext_debug(move %d:%llu:%d in new leaf %llu\n, le32_to_cpu(path[depth].p_ext-ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext-ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1106,7 +1107,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* +* Make sure that either both extents are uninitialized, or +* both are _not_. +*/ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1-ee_block) + ext1_ee_len != le32_to_cpu(ex2-ee_block)) return 0; @@ -1115,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len) EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1-ee_len) = 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1144,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext-ee_block); - len1 = le16_to_cpu(newext-ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1191,8 +1204,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; int depth, len, err, next; + unsigned uninitialized = 0; - BUG_ON(newext-ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode);
[PATCH 4/5][TAKE8] ext4: write support for preallocated blocks
From: Amit Arora [EMAIL PROTECTED] write support for preallocated blocks This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. CHANGELOG: - This patch did not change from TAKE7 (besides offsets ;). Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -1141,6 +1141,53 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the ex extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass ex - 1 as argument instead of ex. + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex EXT_LAST_EXTENT(eh)) { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries) - 1); + merge_done = 1; + WARN_ON(eh-eh_entries == 0); + if (!eh-eh_entries) + ext4_error(inode-i_sb, ext4_ext_try_to_merge, + inode#%lu, eh-eh_entries = 0!, inode-i_ino); + } + + return merge_done; +} + +/* * check if a portion of the newext extent overlaps with an * existing extent. * @@ -1328,25 +1375,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1); - BUG_ON(eh-eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2012,15 +2041,158 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a There is no split required: Entire extent should be initialized + * b Splits in two extents: Write is happening at either end of the extent + * c Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, newex; + struct ext4_extent *ex1 = NULL; + struct ext4_extent *ex2 = NULL; + struct ext4_extent *ex3 = NULL; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err = 0; + int ret = 0; + +
[PATCH 5/5][TAKE8] ext4: change for better extent-to-group alignment
From: Amit Arora [EMAIL PROTECTED] Change on-disk format for extent to represent uninitialized/initialized extents This change was suggested by Andreas Dilger. This patch changes the EXT_MAX_LEN value and extent code which marks/checks uninitialized extents. With this change it will be possible to have initialized extents with 2^15 blocks (earlier the max blocks we could have was 2^15 - 1). This way we can have better extent-to-block alignment. Now, maximum number of blocks we can have in an initialized extent is 2^15 and in an uninitialized extent is 2^15 - 1. CHANGELOG: - This patch did not change from TAKE7 (besides offsets ;). Following changed from TAKE6 to TAKE7: 1. Taken care of Andreas's suggestion of using EXT_INIT_MAX_LEN instead of 0x8000 at some places. Signed-off-by: Amit Arora [EMAIL PROTECTED] Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c +++ linux-2.6.22/fs/ext4/extents.c @@ -1107,7 +1107,7 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - unsigned short ext1_ee_len, ext2_ee_len; + unsigned short ext1_ee_len, ext2_ee_len, max_len; /* * Make sure that either both extents are uninitialized, or @@ -1116,6 +1116,11 @@ ext4_can_extents_be_merged(struct inode if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) return 0; + if (ext4_ext_is_uninitialized(ex1)) + max_len = EXT_UNINIT_MAX_LEN; + else + max_len = EXT_INIT_MAX_LEN; + ext1_ee_len = ext4_ext_get_actual_len(ex1); ext2_ee_len = ext4_ext_get_actual_len(ex2); @@ -1128,7 +1133,7 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (ext1_ee_len + ext2_ee_len EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len max_len) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1-ee_len) = 4) @@ -1815,7 +1820,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex-ee_block = cpu_to_le32(block); ex-ee_len = cpu_to_le16(num); - if (uninitialized) + /* +* Do not mark uninitialized if all the blocks in the +* extent have been removed. +*/ + if (uninitialized num) ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); @@ -2308,6 +2317,19 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); + /* +* See if request is beyond maximum number of blocks we can have in +* a single extent. For an initialized extent this limit is +* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is +* EXT_UNINIT_MAX_LEN. +*/ + if (max_blocks EXT_INIT_MAX_LEN + create != EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_INIT_MAX_LEN; + else if (max_blocks EXT_UNINIT_MAX_LEN +create == EXT4_CREATE_UNINITIALIZED_EXT) + max_blocks = EXT_UNINIT_MAX_LEN; + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ newex.ee_block = cpu_to_le32(iblock); newex.ee_len = cpu_to_le16(max_blocks); Index: linux-2.6.22/include/linux/ext4_fs_extents.h === --- linux-2.6.22.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22/include/linux/ext4_fs_extents.h @@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru #define EXT_MAX_BLOCK 0x -#define EXT_MAX_LEN((1UL 15) - 1) +/* + * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an + * initialized extent. This is 2^15 and not (2^16 - 1), since we use the + * MSB of ee_len field in the extent datastructure to signify if this + * particular extent is an initialized extent or an uninitialized (i.e. + * preallocated). + * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an + * uninitialized extent. + * If ee_len is = 0x8000, it is an initialized extent. Otherwise, it is an + * uninitialized one. In other words, if MSB of ee_len is set, it is an + * uninitialized extent with only one special scenario when ee_len = 0x8000. + * In this case we can not have an uninitialized extent of zero length and + * thus we make it as a special case of initialized extent with 0x8000 length. + * This way we get better extent-to-group alignment for initialized extents. + * Hence, the maximum number of blocks we can have in an *initialized* + * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Fri, 13 Jul 2007 15:33:41 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: On Fri, 2007-07-13 at 02:05 -0700, Andrew Morton wrote: Except lockdep doesn't know about journal_start(), which has ranking requirements similar to a semaphore. Something like so? Looks OK. Or can journal_stop() be done by a different task than the one that did journal_start()? - in which case nothing much can be done :-/ Yeah, journal_start() and journal_stop() are well-behaved. This seems to boot... albeit I did not push it hard. I fear the consequences of this change :( Oh well, please keep it alive, maybe beat on it a bit, resend it later on? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Fri, 2007-07-13 at 12:35 +0530, Kalpak Shah wrote: On Fri, 2007-07-13 at 09:59 +0530, Aneesh Kumar K.V wrote: Kalpak Shah wrote: On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:56 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch is a spinoff of the old nanosecond patches. I don't know what the old nanosecond patches are. A link to a suitable changlog for those patches would do in a pinch. Preferable would be to write a proper changelog for this patch. The incremental patch contains a proper changelog describing the patch. Instead of putting incremental patches it would be nice if we can have replacement patches. for the already existing patches with the comments addressed. For example if we have a review comment on the patch message ( commit log ) then adding an incremental patch doesn't help. I think that it would be easier to review just the changes that have been made to the patches instead of having people go through the entire patch again. I was hoping that someone with write access to ext4-git would update the commit logs. If replacement patches are preferred, then I will send them again. No need, I already fold your fix patch to the parent patches, so in the updated ext4-patch-queue it saved the updated nanosecond patch. Thanks, Kalpak. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
I fear the consequences of this change :( I love it. In the past I've lost time by working with patches which didn't quite realize that ext3 holds a transaction open during -direct_IO. Oh well, please keep it alive, maybe beat on it a bit, resend it later on? I can test the patch to make sure that it catches mistakes I've made in the past. Peter, do you have any interest in seeing how far we can get at tracking lock_page()? I'm not holding my breath, but any little bit would probably help. - z - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
lease and lock patches
Please pull from the 'for-linus' branch at git://linux-nfs.org/~bfields/linux.git for-linus for a series of patches which add a setlease() file method. The longer-term goal is to allow cluster and network filesystems to give out consistent leases when possible, in particular to allow nfsd to give out delegations on cluster filesystems. For now, though, we're using this just to disallow leases selectively on certain filesystems (nfs and gfs2 for now) where they don't make sense. Also includes some minor locks.c cleanup. J. Bruce Fields (9): locks: convert an -EINVAL return to a BUG locks: clean up lease_alloc() locks: share more common lease code locks: rename lease functions to reflect locks.c conventions locks: provide a file lease method enabling cluster-coherent leases locks: export setlease to filesystems nfs: disable leases over NFS locks: make posix_test_lock() interface more consistent locks: fix vfs_test_lock() comment Marc Eshel (1): gfs2: stop giving out non-cluster-coherent leases david m. richter (1): leases: minor break_lease() comment clarification fs/gfs2/ops_file.c | 24 +++ fs/locks.c | 112 ++ fs/nfs/file.c | 16 +++- fs/nfsd/nfs4state.c | 10 ++-- include/linux/fs.h |4 +- 5 files changed, 105 insertions(+), 61 deletions(-) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] isofs: mounting to regular file may succeed
It turned out that mounting a corrupted ISO image to a regular file may succeed, e.g. if an image was prepared as follows: $ dd if=correct.iso of=bad.iso bs=4k count=8 We then can mount it to a regular file: # mount -o loop -t iso9660 bad.iso /tmp/file But mounting it to a directory fails with -ENOTDIR, simply because the root directory inode doesn't have S_IFDIR set and the condition in graft_tree() is met: if (S_ISDIR(nd-dentry-d_inode-i_mode) != S_ISDIR(mnt-mnt_root-d_inode-i_mode)) return -ENOTDIR This is because the root directory inode was read from an incorrect block. It's supposed to be read from sbi-s_firstdatazone, which is an absolute value and gets messed up in the case of an incorrect image. In order to somehow circumvent this we have to check that the root directory inode is actually a directory after all. Signed-off-by: Kirill Kuvaldin [EMAIL PROTECTED] diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c index 5c3eecf..ce5062a 100644 --- a/fs/isofs/inode.c +++ b/fs/isofs/inode.c @@ -840,6 +840,15 @@ root_found: goto out_no_root; if (!inode-i_op) goto out_bad_root; + + /* Make sure the root inode is a directory */ + if (!S_ISDIR(inode-i_mode)) { + printk(KERN_WARNING + isofs_fill_super: root inode is not a directory. + Corrupted media?\n); + goto out_iput; + } + /* get the root dentry */ s-s_root = d_alloc_root(inode); if (!(s-s_root)) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Jul 13, 2007 02:05 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 16:32:47 -0700 Andrew Morton [EMAIL PROTECTED] wrote: + brelse(bh); + up_write(EXT4_I(inode)-xattr_sem); + return error; +} + We're doing GFP_KERNEL memory allocations while holding xattr_sem. This can cause the VM to reenter the filesystem, perhaps taking i_mutex and/or i_truncate_sem and/or journal_start() (I forget whether this still happens). Have we checked whether this can occur and if so, whether we are OK from a lock ranking POV? Bear in mind that journalled-data mode is more complex in this regard. I notice that everyone carefully avoided addressing this ;) Oh well, hopefully people are testing with lockdep enabled. As long as the fs is put under extreme memory pressure, most bugs should be reported. I have no objection to changing these to GFP_NOFS or GFP_ATOMIC, because the number of times this function is called is really quite small (only for existing inodes when the size of the fixed fields in the inode is increasing) and the buffers are freed immediately so this won't put any undue strain on the atomic memory pools. That said, there is also a GFP_KERNEL allocations in ext3_xattr_block_set() under xattr_sem, so the same problem would exist there. I also just noticed that buffer and b_entry_name are leaked in ext4_expand_extra_isize() if the while loop is run more than one time (again a relatively rare event). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html