Re: [PATCH] ext4: Use page_mkwrite vma_operations to get mmap write notification.
On Fri, 2008-02-22 at 20:09 +0530, Aneesh Kumar K.V wrote: We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. The changes are also needed to handle ENOSPC when writing to an mmap section of files with holes. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/file.c | 19 ++- fs/ext4/inode.c | 60 +++ include/linux/ext4_fs.h |1 + 3 files changed, 79 insertions(+), 1 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 20507a2..77341c1 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -123,6 +123,23 @@ force_commit: return ret; } +static struct vm_operations_struct ext4_file_vm_ops = { + .fault = filemap_fault, + .page_mkwrite = ext4_page_mkwrite, +}; + +static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct address_space *mapping = file-f_mapping; + + if (!mapping-a_ops-readpage) + return -ENOEXEC; + file_accessed(file); + vma-vm_ops = ext4_file_vm_ops; + vma-vm_flags |= VM_CAN_NONLINEAR; + return 0; +} + const struct file_operations ext4_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -133,7 +150,7 @@ const struct file_operations ext4_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif - .mmap = generic_file_mmap, + .mmap = ext4_file_mmap, .open = generic_file_open, .release= ext4_release_file, .fsync = ext4_sync_file, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5b5d63d..00af97d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3490,3 +3490,63 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) return err; } + +int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page) +{ + unsigned long end; + loff_t size; + handle_t *handle; + int ret = -EINVAL, needed_blocks; + struct file *file = vma-vm_file; + struct inode *inode = file-f_path.dentry-d_inode; + + needed_blocks = ext4_writepage_trans_blocks(inode); + /* We need to take inode mutex to prevent parallel write */ + mutex_lock(inode-i_mutex); + lock_page(page); + size = i_size_read(inode); + if ((page-mapping != inode-i_mapping) || + (page_offset(page) size)) { + /* page got truncated out from underneath us */ + goto out_unlock; + } + + /* page is wholly or partially inside EOF */ + if (((page-index + 1) PAGE_CACHE_SHIFT) size) + end = size ~PAGE_CACHE_MASK; + else + end = PAGE_CACHE_SIZE; + + handle = ext4_journal_start(inode, needed_blocks); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + goto out_unlock; + } + /* Will zero out the pages if buffer is marked new */ + ret = block_prepare_write(page, 0, end, ext4_get_block); + + if (!ret ext4_should_journal_data(inode)) { + ret = walk_page_buffers(handle, page_buffers(page), + 0, end, NULL, do_journal_get_write_access); + if (!ret) + ret = walk_page_buffers(handle, page_buffers(page), + 0, end, NULL, write_end_fn); + /* + * we don't want to call block_commit_write in journalled mode + */ + ext4_journal_stop(handle); + goto out_unlock; + } + if (!ret ext4_should_order_data(inode)) { + ret = walk_page_buffers(handle, page_buffers(page), + 0, end, NULL, ext4_journal_dirty_data); + } + if (!ret) + ret = block_commit_write(page, 0, end); + Hmm, it seems wired to do commit_write when the page is about becoming writable, but maybe that's the way it needs to? Don't we need to update the i_size somewhere? + ext4_journal_stop(handle); +out_unlock: + unlock_page(page); + mutex_unlock(inode-i_mutex); + return ret; +} It seems this combined the three journalling mode prepare_write() code here:( Since prepare_write() and commit_write() is going to sunset, why not simply calling mappings-a_ops-write_begin() and then write_end()? that should take care of pretty much the journalling and the page operations, no? Mingming diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index 22810b1..8f5a563 100644 --- a/include/linux/ext4_fs.h +++
Re: [PATCH] ext4: Use page_mkwrite vma_operations to get mmap write notification.
On Fri, 2008-02-22 at 23:53 +0530, Aneesh Kumar K.V wrote: On Fri, Feb 22, 2008 at 10:10:48AM -0800, Mingming Cao wrote: On Fri, 2008-02-22 at 20:09 +0530, Aneesh Kumar K.V wrote: . + ext4_journal_stop(handle); + goto out_unlock; + } + if (!ret ext4_should_order_data(inode)) { + ret = walk_page_buffers(handle, page_buffers(page), + 0, end, NULL, ext4_journal_dirty_data); + } + if (!ret) + ret = block_commit_write(page, 0, end); + Hmm, it seems wired to do commit_write when the page is about becoming writable, but maybe that's the way it needs to? Don't we need to update the i_size somewhere? ah, i_size didn't change with mapped IO. block_commit_write simply iterate over buffer_head of page and mark them dirty. That is why we don't want to call that for data=journalled mode. Right, but it still seems odd to mark the buffer_heard dirty *before* the write happens. I am confused, if i_size is not changing, then what we are journalling about? Keep journal ordering? but we haven't write anything yet Mingming + ext4_journal_stop(handle); +out_unlock: + unlock_page(page); + mutex_unlock(inode-i_mutex); + return ret; +} It seems this combined the three journalling mode prepare_write() code here:( Since prepare_write() and commit_write() is going to sunset, why not simply calling mappings-a_ops-write_begin() and then write_end()? that should take care of pretty much the journalling and the page operations, no? write_begin and write_end works with the user space buffer. In this case we don't have one. Also what ext4_page_mkwrite does is mostly what write_begin/write_end does except the copy of user space buffer. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC]ext4: block reservations to handle delalloc ENOSPC error
In delayed allocation, blocks to be allocated need to be reserved before user buffers being copied to memory, otherwise later at page writeout time we could hit ENOSPC error. In this patch, blocks(data and metadata) are reserved at da_write_begin() time, the free blocks counter is updated by then, and the number of reserved blocks is store in inode structure in memory. Then later, when new_blocks() is being called, if the caller is already reserved the blocks for allocation, the free blocks_counter is not being updated, otherwise, in DIO case, which could call get_blocks() without the blocks being reserved, proper accounting still need to be done. At the writepage() time, the unused reserved blocks are returned back. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/balloc.c| 25 +++-- fs/ext4/dir.c |3 - fs/ext4/extents.c | 78 +++--- fs/ext4/inode.c | 102 +--- fs/ext4/mballoc.c | 11 +++- fs/ext4/migrate.c |2 fs/ext4/super.c |3 + fs/ext4/xattr.c |2 include/linux/ext4_fs.h | 17 -- include/linux/ext4_fs_extents.h |4 - include/linux/ext4_fs_i.h |4 + 11 files changed, 195 insertions(+), 56 deletions(-) Index: linux-2.6.25-rc2/fs/ext4/super.c === --- linux-2.6.25-rc2.orig/fs/ext4/super.c 2008-02-21 17:30:59.0 -0800 +++ linux-2.6.25-rc2/fs/ext4/super.c2008-02-21 19:47:00.0 -0800 @@ -573,6 +573,8 @@ static struct inode *ext4_alloc_inode(st memset(ei-i_cached_extent, 0, sizeof(struct ext4_ext_cache)); INIT_LIST_HEAD(ei-i_prealloc_list); spin_lock_init(ei-i_prealloc_lock); + ei-i_reserved_data_blocks = 0; + ei-i_reserved_meta_blocks = 0; return ei-vfs_inode; } @@ -2190,6 +2192,7 @@ static int ext4_fill_super (struct super err = percpu_counter_init(sbi-s_dirs_counter, ext4_count_dirs(sb)); } + if (err) { printk(KERN_ERR EXT4-fs: insufficient memory\n); goto failed_mount3; Index: linux-2.6.25-rc2/fs/ext4/inode.c === --- linux-2.6.25-rc2.orig/fs/ext4/inode.c 2008-02-21 17:31:03.0 -0800 +++ linux-2.6.25-rc2/fs/ext4/inode.c2008-02-21 19:52:34.0 -0800 @@ -36,6 +36,7 @@ #include linux/mpage.h #include linux/uio.h #include linux/bio.h +#include linux/ext4_fs_extents.h #include xattr.h #include acl.h @@ -508,10 +509,11 @@ static int ext4_blks_to_allocate(Indirec * the indirect blocks(if needed) and the first direct block, * @blks: on return it will store the total number of allocated * direct blocks + * @reserved: flag if blocks is pre-reserved by delayed allocation */ static int ext4_alloc_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t goal, int indirect_blks, int blks, - ext4_fsblk_t new_blocks[4], int *err) + ext4_fsblk_t new_blocks[4], int reserved, int *err) { int target, i; unsigned long count = 0; @@ -532,7 +534,8 @@ static int ext4_alloc_blocks(handle_t *h while (1) { count = target; /* allocating blocks for indirect blocks and direct blocks */ - current_block = ext4_new_blocks(handle,inode,goal,count,err); + current_block = ext4_new_blocks(handle, inode, goal, + count, reserved, err); if (*err) goto failed_out; @@ -567,6 +570,8 @@ failed_out: * @blks: number of allocated direct blocks * @offsets: offsets (in the blocks) to store the pointers to next. * @branch: place to store the chain in. + * @reserved: flagging if blocks already reserved for allocation + *avoid doing accounting on free blocks * * This function allocates blocks, zeroes out all but the last one, * links them into chain and (if we are synchronous) writes them to disk. @@ -587,7 +592,7 @@ failed_out: */ static int ext4_alloc_branch(handle_t *handle, struct inode *inode, int indirect_blks, int *blks, ext4_fsblk_t goal, - ext4_lblk_t *offsets, Indirect *branch) + ext4_lblk_t *offsets, Indirect *branch, int reserved) { int blocksize = inode-i_sb-s_blocksize; int i, n = 0; @@ -598,7 +603,7 @@ static int ext4_alloc_branch(handle_t *h ext4_fsblk_t current_block; num = ext4_alloc_blocks(handle, inode, goal, indirect_blks, - *blks, new_blocks, err); + *blks, new_blocks
[RFC] Delayed allocation updates
Hi, ext4 patch queue has keeping a set of delayed allocation for a while. The know todo is add reservation and ordered mode support. FYI, The base delayed allocation patches are located at VFS changes: http://repo.or.cz/w/ext4-patch-queue.git?a=blob;f=delalloc-vfs.patch;h=713004ca3f01ae9e7ace289c596ce25901b6c28a;hb=dc9fa932e786df0f2e18402e990926be667484ac ext4 support: http://repo.or.cz/w/ext4-patch-queue.git?a=blob;f=delalloc-ext4.patch;h=67f8e2802f52e52497a9a8360a1d979bc60c2142;hb=dc9fa932e786df0f2e18402e990926be667484ac Here is the updates to the delayed allocation patches in the ext4 patch queue. - update delayed allocation with current VFS API write_begin/write_end - general handling preallocation - add block reservation to handle ENOSPC error Patches will follow. compiles, plan to add to the patch later. Comments? I looked at ordered mode briefly but not entirely clear how that could be done without a new callback function. Perhaps Alex has some thoughts? Cheers, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH]ext4: use write_begin/write_end in delalloc instead of prepare_write()
Since generally the current mainline VFS API changed to use write-begin() and write_end(), updating ext4 delayed allocation to adopt this change. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/inode.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) Index: linux-2.6.25-rc2/fs/ext4/inode.c === --- linux-2.6.25-rc2.orig/fs/ext4/inode.c 2008-02-21 17:30:59.0 -0800 +++ linux-2.6.25-rc2/fs/ext4/inode.c2008-02-22 12:17:16.0 -0800 @@ -1421,13 +1421,6 @@ static int ext4_da_get_block_prep(struct return ret; } - -static int ext4_da_prepare_write(struct file *file, struct page *page, -unsigned from, unsigned to) -{ - return block_prepare_write(page, from, to, ext4_da_get_block_prep); -} - static int ext4_da_get_block_write(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) { @@ -1485,11 +1478,34 @@ out: } static int ext4_da_writepages(struct address_space *mapping, - struct writeback_control *wbc) + struct writeback_control *wbc) { return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write); } +static int ext4_da_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + int ret; + struct page *page; + pgoff_t index; + unsigned from, to; + + index = pos PAGE_CACHE_SHIFT; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; + + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + + ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ext4_da_get_block_prep); + return ret; +} + static void ext4_da_invalidatepage(struct page *page, unsigned long offset) { struct buffer_head *head, *bh; @@ -1990,8 +2006,8 @@ static const struct address_space_operat .writepage = ext4_writeback_writepage, .writepages = ext4_da_writepages, .sync_page = block_sync_page, - .prepare_write = ext4_da_prepare_write, - .commit_write = generic_commit_write, + .write_begin= ext4_da_write_begin, + .write_end = generic_write_end, .bmap = ext4_bmap, .invalidatepage = ext4_da_invalidatepage, .releasepage= ext4_releasepage, - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: general handling preallocated blocks in delayed allocation
With delayed allocation, get_block() is only doing block map at the write_begin() time. If the blocks are prea-allocated, the result bh is not mapped, but the blocks are actually being allocated. delalloc should not treat it as other regular unallocated area, thus mark it as need block allocation later, and doing extra reservation incorrectly. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/inode.c | 26 -- 1 file changed, 12 insertions(+), 14 deletions(-) Index: linux-2.6.25-rc2/fs/ext4/inode.c === --- linux-2.6.25-rc2.orig/fs/ext4/inode.c 2008-02-19 14:57:00.0 -0800 +++ linux-2.6.25-rc2/fs/ext4/inode.c2008-02-19 15:04:56.0 -0800 @@ -1401,20 +1401,18 @@ static int ext4_da_get_block_prep(struct * XXX: when the filesystem has a lot of free blocks, we could * reserve even allocated blocks to save this lookup */ ret = ext4_get_blocks_wrap(NULL, inode, iblock, 1, bh_result, 0, 0); - if (ret = 0) { - if (buffer_mapped(bh_result)) { - bh_result-b_size = (ret inode-i_blkbits); - } else { - /* the block isn't allocated yet, let's reserve space */ - /* XXX: call reservation here */ - /* -* XXX: __block_prepare_write() unmaps passed block, -* is it OK? -*/ - map_bh(bh_result, inode-i_sb, 0); - set_buffer_new(bh_result); - set_buffer_delay(bh_result); - } + if (ret == 0) { + /* the block isn't allocated yet, let's reserve space */ + /* XXX: call reservation here */ + /* +* XXX: __block_prepare_write() unmaps passed block, +* is it OK? +*/ + map_bh(bh_result, inode-i_sb, 0); + set_buffer_new(bh_result); + set_buffer_delay(bh_result); + } else if ((ret 0) (buffer_mapped(bh_result))) { + bh_result-b_size = (ret inode-i_blkbits); ret = 0; } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] ext4: Use page_mkwrite vma_operations to get mmap write notification.
On Tue, 2008-02-19 at 09:13 +0530, Aneesh Kumar K.V wrote: We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. The changes are also needed to handle ENOSPC when writing to an mmap section of files with holes. Hi Aneesh, I have a concern, it seems we missed journalling the allocation activity for the mmaped write. See comments below... Another thing, perhaps similar patch should be ported to ext2/3, as this also addressed the mmaped write ENOSPC error without preallocation/deleyed allocation. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/file.c | 19 ++- fs/ext4/inode.c |6 ++ include/linux/ext4_fs.h |1 + 3 files changed, 25 insertions(+), 1 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 20507a2..77341c1 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -123,6 +123,23 @@ force_commit: return ret; } +static struct vm_operations_struct ext4_file_vm_ops = { + .fault = filemap_fault, + .page_mkwrite = ext4_page_mkwrite, +}; + +static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct address_space *mapping = file-f_mapping; + + if (!mapping-a_ops-readpage) + return -ENOEXEC; + file_accessed(file); + vma-vm_ops = ext4_file_vm_ops; + vma-vm_flags |= VM_CAN_NONLINEAR; + return 0; +} + const struct file_operations ext4_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -133,7 +150,7 @@ const struct file_operations ext4_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif - .mmap = generic_file_mmap, + .mmap = ext4_file_mmap, .open = generic_file_open, .release= ext4_release_file, .fsync = ext4_sync_file, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 34f3eb6..81faa67 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3466,3 +3466,9 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) return err; } + +int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page) +{ + return block_page_mkwrite(vma, page, ext4_get_block); +} + I don't see block allocation being journalled here. block_page_mkwrite() eventually calling block_prepare_write() which invokes ext4_get_block() without starting a new journal handle. Perhaps call ext4 write_begin() and write_end() inode operations, that would taking care of different write_begin and write_end for three different journalling mode. Mingming diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index 22810b1..8f5a563 100644 --- a/include/linux/ext4_fs.h +++ b/include/linux/ext4_fs.h @@ -1059,6 +1059,7 @@ extern void ext4_set_aops(struct inode *inode); extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_block_truncate_page(handle_t *handle, struct page *page, struct address_space *mapping, loff_t from); +extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page); /* ioctl.c */ extern long ext4_ioctl(struct file *, unsigned int, unsigned long); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full
Hi Aneesh, It's a good start, a few comments below.. On Fri, 2008-02-22 at 00:47 +0530, Aneesh Kumar K.V wrote: From 6a73edd4dbb32344e6a83ebdc07edd0e96d376bd Mon Sep 17 00:00:00 2001 From: Aneesh Kumar K.V [EMAIL PROTECTED] Date: Thu, 21 Feb 2008 23:57:38 +0530 Subject: [PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full A write to prealloc area cause the split of unititalized extent into a initialized and uninitialized extent. If we don't have space to add new extent information instead of returning error convert the existing uninitialized extent to initialized one. We need to zero out the blocks corresponding to the extent to prevent wrong data reaching userspace. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 135 - 1 files changed, 133 insertions(+), 2 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index b179b03..d37c14e 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2137,6 +2137,103 @@ void ext4_ext_release(struct super_block *sb) #endif } +static int ext4_ext_zero_out(handle_t *handle, struct inode *inode, + ext4_lblk_t iblock, struct ext4_extent *ex) +{ + ext4_lblk_t ee_block; + unsigned int ee_len, blkcount, blocksize; + loff_t pos; + pgoff_t index, skip_index; + unsigned long offset; + struct page *page; + struct address_space *mapping = inode-i_mapping; + struct buffer_head *head, *bh; + int err = 0; + + ee_block = le32_to_cpu(ex-ee_block); + ee_len = blkcount = ext4_ext_get_actual_len(ex); + blocksize = inode-i_sb-s_blocksize; + + /* + * find the skip index. We can't call __grab_cache_page for this + * because we are in the writeout of this page and we already have + * taken the lock on this page + */ + pos = iblock inode-i_blkbits; + skip_index = pos PAGE_CACHE_SHIFT; + We should not need to look up the page cache to do the zero out. The approach I had thought is just zero it out on disk. + while (blkcount) { + pos = (ee_block + ee_len - blkcount) inode-i_blkbits; + index = pos PAGE_CACHE_SHIFT; + offset = (pos (PAGE_CACHE_SIZE - 1)); + if (index == skip_index) { + /* Page will already be locked in the writepage */ + read_lock_irq(mapping-tree_lock); + page = radix_tree_lookup(mapping-page_tree, index); + read_unlock_irq(mapping-tree_lock); + if (page) + page_cache_get(page); + else + return -ENOMEM; + } else { + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + } + I the page is already locked before calling get_block() via writepage(), isn't it? and the journal transaction already started... + if (!page_has_buffers(page)) + create_empty_buffers(page, blocksize, 0); + + head = page_buffers(page); + /* Look for the buffer_head which map the block */ + bh = head; + while (offset 0) { + bh = bh-b_this_page; + offset -= blocksize; + } + offset = (pos (PAGE_CACHE_SIZE - 1)); + + /* Now write all the buffer_heads in the page */ + do { + set_buffer_uptodate(bh); + if (ext4_should_journal_data(inode)) { + err = ext4_journal_get_write_access(handle, bh); + /* do we have that many credits ??*/ + if (err) + goto err_out; + } + zero_user(page, offset, blocksize); Ah oh, you are trying to zero out the pages in the page cache, that's seems wrong to me. By the time get_block() is called from writepages(), the pages should have meaningful content that needs to flush to disk, zero the pages out will lost the data. + offset += blocksize; + if (ext4_should_journal_data(inode)) { + err = ext4_journal_dirty_metadata(handle, bh); + if (err) + goto err_out; + } else { + if (ext4_should_order_data(inode)) { + err = ext4_journal_dirty_data(handle, + bh); + if (err) + goto
[PATCH] ext4: remove extra defination of ext4_new_blocks_old()
ext4_new_blocks_old() is already externed in ext4_fs.h, remove the extra define from mballoc.c. Trivial cleanup, but this cause extra maintain work if later we need to modify this function. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/mballoc.c |2 -- 1 file changed, 2 deletions(-) Index: linux-2.6.25-rc2/fs/ext4/mballoc.c === --- linux-2.6.25-rc2.orig/fs/ext4/mballoc.c 2008-02-21 17:23:26.0 -0800 +++ linux-2.6.25-rc2/fs/ext4/mballoc.c 2008-02-21 17:23:54.0 -0800 @@ -576,8 +576,6 @@ static void ext4_mb_store_history(struct static struct proc_dir_entry *proc_root_ext4; struct buffer_head *read_block_bitmap(struct super_block *, ext4_group_t); -ext4_fsblk_t ext4_new_blocks_old(handle_t *handle, struct inode *inode, - ext4_fsblk_t goal, unsigned long *count, int *errp); static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, ext4_group_t group); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: ext4_find_next_zero_bit needs an aligned address on some arch
On Wed, 2008-02-20 at 08:49 -0600, Eric Sandeen wrote: Aneesh Kumar K.V wrote: ext4_find_next_zero_bit and ext4_find_next_bit needs a long aligned address on x8_64. Add mb_find_next_zero_bit and mb_find_next_bit and use them in the mballoc. Fix: https://bugzilla.redhat.com/show_bug.cgi?id=433286 Eric Sandeen debugged the problem and suggested the fix. Also, Ted Mingming: we probably should get this into 2.6.25; at least w/ the way the Fedora kernel is configured, ext4 is pretty much DOA w/o this change. I'm not sure why it started showing up now, but it is, in a big way. :) Acked, and added to ext4 patch queue. Thanks, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] ext4: use ext4_group_first_block_no()
On Sun, 2008-02-17 at 15:08 +0900, Akinobu Mita wrote: Use ext4_group_first_block_no() and assign the return values to ext2_fsblk_t variables. Acked for ext2/3/4 patches(except a little typo in above change log: ext4_fsblk_t). ext4 patch is queued in ext4 patch queue http://repo.or.cz/w/ext4-patch-queue.git Mingming Signed-off-by: Akinobu Mita [EMAIL PROTECTED] Cc: Stephen Tweedie [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: Andrew Morton [EMAIL PROTECTED] Cc: Mingming Cao [EMAIL PROTECTED] Cc: Theodore Tso [EMAIL PROTECTED] --- fs/ext4/balloc.c |6 +++--- fs/ext4/xattr.c |6 ++ 2 files changed, 5 insertions(+), 7 deletions(-) Index: 2.6-rc/fs/ext4/xattr.c === --- 2.6-rc.orig/fs/ext4/xattr.c +++ 2.6-rc/fs/ext4/xattr.c @@ -808,10 +808,8 @@ inserted: get_bh(new_bh); } else { /* We need to allocate a new block */ - ext4_fsblk_t goal = le32_to_cpu( - EXT4_SB(sb)-s_es-s_first_data_block) + - (ext4_fsblk_t)EXT4_I(inode)-i_block_group * - EXT4_BLOCKS_PER_GROUP(sb); + ext4_fsblk_t goal = ext4_group_first_block_no(sb, + EXT4_I(inode)-i_block_group); ext4_fsblk_t block = ext4_new_block(handle, inode, goal, error); if (error) Index: 2.6-rc/fs/ext4/balloc.c === --- 2.6-rc.orig/fs/ext4/balloc.c +++ 2.6-rc/fs/ext4/balloc.c @@ -48,7 +48,6 @@ void ext4_get_group_no_and_offset(struct unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh, ext4_group_t block_group, struct ext4_group_desc *gdp) { - unsigned long start; int bit, bit_max; unsigned free_blocks, group_blocks; struct ext4_sb_info *sbi = EXT4_SB(sb); @@ -106,11 +105,12 @@ unsigned ext4_init_block_bitmap(struct s free_blocks = group_blocks - bit_max; if (bh) { + ext4_fsblk_t start; + for (bit = 0; bit bit_max; bit++) ext4_set_bit(bit, bh-b_data); - start = block_group * EXT4_BLOCKS_PER_GROUP(sb) + - le32_to_cpu(sbi-s_es-s_first_data_block); + start = ext4_group_first_block_no(sb, block_group); /* Set bits for block and inode bitmaps, and inode table */ ext4_set_bit(ext4_block_bitmap(sb, gdp) - start, bh-b_data); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] ext4: use ext4_get_group_desc()
On Sun, 2008-02-17 at 15:13 +0900, Akinobu Mita wrote: Use ext4_get_group_desc() Acked and added to ext4 patch queue. Signed-off-by: Akinobu Mita [EMAIL PROTECTED] Cc: Stephen Tweedie [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: Andrew Morton [EMAIL PROTECTED] Cc: Mingming Cao [EMAIL PROTECTED] Cc: Theodore Tso [EMAIL PROTECTED] --- fs/ext4/inode.c | 20 +++- 1 file changed, 3 insertions(+), 17 deletions(-) Index: 2.6-rc/fs/ext4/inode.c === --- 2.6-rc.orig/fs/ext4/inode.c +++ 2.6-rc/fs/ext4/inode.c @@ -2455,12 +2455,10 @@ out_stop: static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb, unsigned long ino, struct ext4_iloc *iloc) { - unsigned long desc, group_desc; ext4_group_t block_group; unsigned long offset; ext4_fsblk_t block; - struct buffer_head *bh; - struct ext4_group_desc * gdp; + struct ext4_group_desc *gdp; if (!ext4_valid_inum(sb, ino)) { /* @@ -2472,22 +2470,10 @@ static ext4_fsblk_t ext4_get_inode_block } block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb); - if (block_group = EXT4_SB(sb)-s_groups_count) { - ext4_error(sb,ext4_get_inode_block,group = groups count); + gdp = ext4_get_group_desc(sb, block_group, NULL); + if (!gdp) return 0; - } - smp_rmb(); - group_desc = block_group EXT4_DESC_PER_BLOCK_BITS(sb); - desc = block_group (EXT4_DESC_PER_BLOCK(sb) - 1); - bh = EXT4_SB(sb)-s_group_desc[group_desc]; - if (!bh) { - ext4_error (sb, ext4_get_inode_block, - Descriptor not loaded); - return 0; - } - gdp = (struct ext4_group_desc *)((__u8 *)bh-b_data + - desc * EXT4_DESC_SIZE(sb)); /* * Figure out the offset within the block group inode table */ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] ext4: check ext4_journal_get_write_access() errors
On Sun, 2008-02-17 at 15:15 +0900, Akinobu Mita wrote: Check ext4_journal_get_write_access() errors. Acked and added to ext4 patch queue Mingming Signed-off-by: Akinobu Mita [EMAIL PROTECTED] Cc: Stephen Tweedie [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: Andrew Morton [EMAIL PROTECTED] Cc: Mingming Cao [EMAIL PROTECTED] Cc: Theodore Tso [EMAIL PROTECTED] --- fs/ext4/namei.c |9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) Index: 2.6-rc/fs/ext4/namei.c === --- 2.6-rc.orig/fs/ext4/namei.c +++ 2.6-rc/fs/ext4/namei.c @@ -57,10 +57,15 @@ static struct buffer_head *ext4_append(h *block = inode-i_size inode-i_sb-s_blocksize_bits; - if ((bh = ext4_bread(handle, inode, *block, 1, err))) { + bh = ext4_bread(handle, inode, *block, 1, err); + if (bh) { inode-i_size += inode-i_sb-s_blocksize; EXT4_I(inode)-i_disksize = inode-i_size; - ext4_journal_get_write_access(handle,bh); + *err = ext4_journal_get_write_access(handle, bh); + if (*err) { + brelse(bh); + bh = NULL; + } } return bh; } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: When reading from fallocated blocks make sure we return zero.
On Tue, 2008-02-19 at 08:49 +0530, Aneesh Kumar K.V wrote: On Mon, Feb 18, 2008 at 04:14:34PM -0800, Mingming Cao wrote: On Sat, 2008-02-16 at 08:53 +0530, Aneesh Kumar K.V wrote: How about the following patch? Regards, Mingming ext4: ext4_get_blocks_wrap fix for writing to preallocated From: Mingming Cao [EMAIL PROTECTED] This patch fixed a issue with wrting to a preallocated blocks. A write hit a BUG_ON() in fs/buffer.c saying the buffer is not mapped. On the write path, ext4_get_block_wrap() is called with create=1, but it will pass create=0 down to the underlying ext4ext_get_blocks() to do a look up first. In the preallocation case, ext4_ext_get_blocks() with create = 0, will return number of blocks pre-allocated and buffer head unmapped. ext4_get_blocks_wrap() thinks it succeeds too early, without checking if it needs again call ext4_ext_get_blocks with create = 1 which would do proper handling for writing to preallocated blocks: split the extent to initialized and uninitialized one and returns the mapped buffer head. Treating preallocated blocks as holes equally(i.e. ignoring the number of blocks pre-allocated and returns 0) when get_blocks() is called with create = 0 is not enough. ext4_ext_get_blocks() needs to differentiate these two cases for delayed allocation purpose, as for holes it need to do reservation and prepare for later delayed allocation, but for pre-allocated blocks it needs skip that work. It would makes things more clear if we have clear definition of what get_blocks() return value means. Similar to ext4_get_blocks_handle(), the following * return 0, # of blocks already allocated * if these are pre-allocated blocks and create = 0 * buffer head is unmapped * otherwise blocks are mapped. * * return = 0, if plain look up failed (blocks have not been allocated) * buffer head is unmapped * * return 0, error case. The for the write path, at ext4_ext_get_blocks_wrap(), it could check the buffer_mapped() status for preallocated extent before quit too early. Signed-off-by: Mingming Cao [EMAIL PROTECTED] Reviewed-by: Aneesh Kumar K.V [EMAIL PROTECTED] I guess we also need to make sure the buffer head have the mapped bit set. Something like the patch below. Good point. I modified the patch with clear_buffer_mapped() added at the begining of the wrapper function. Mingming diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index bc7081f..69ccda9 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2294,6 +2294,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, struct ext4_allocation_request ar; __clear_bit(BH_New, bh_result-b_state); + __clear_bit(BH_Mapped, bh_result-b_state); ext_debug(blocks %u/%lu requested for inode %u\n, iblock, max_blocks, inode-i_ino); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] ext4: Use page_mkwrite vma_operations to get mmap write notification.
On Tue, 2008-02-19 at 09:13 +0530, Aneesh Kumar K.V wrote: We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. The changes are also needed to handle ENOSPC when writing to an mmap section of files with holes. Acked. Added to patch queue. Thanks, Mingming Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/file.c | 19 ++- fs/ext4/inode.c |6 ++ include/linux/ext4_fs.h |1 + 3 files changed, 25 insertions(+), 1 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 20507a2..77341c1 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -123,6 +123,23 @@ force_commit: return ret; } +static struct vm_operations_struct ext4_file_vm_ops = { + .fault = filemap_fault, + .page_mkwrite = ext4_page_mkwrite, +}; + +static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct address_space *mapping = file-f_mapping; + + if (!mapping-a_ops-readpage) + return -ENOEXEC; + file_accessed(file); + vma-vm_ops = ext4_file_vm_ops; + vma-vm_flags |= VM_CAN_NONLINEAR; + return 0; +} + const struct file_operations ext4_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -133,7 +150,7 @@ const struct file_operations ext4_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif - .mmap = generic_file_mmap, + .mmap = ext4_file_mmap, .open = generic_file_open, .release= ext4_release_file, .fsync = ext4_sync_file, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 34f3eb6..81faa67 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3466,3 +3466,9 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) return err; } + +int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page) +{ + return block_page_mkwrite(vma, page, ext4_get_block); +} + diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index 22810b1..8f5a563 100644 --- a/include/linux/ext4_fs.h +++ b/include/linux/ext4_fs.h @@ -1059,6 +1059,7 @@ extern void ext4_set_aops(struct inode *inode); extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_block_truncate_page(handle_t *handle, struct page *page, struct address_space *mapping, loff_t from); +extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page); /* ioctl.c */ extern long ext4_ioctl(struct file *, unsigned int, unsigned long); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] ext4: Don't mark the filesystem with errors if we fail to fallocate.
On Tue, 2008-02-19 at 09:14 +0530, Aneesh Kumar K.V wrote: IF we fail allocate blocks don't call ext4_error. Also don't hide errors from ext4_get_blocks_wrap Aced. And added to patch queue. Thanks, Mingming Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 5b22f71..9a27e88 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2653,13 +2653,14 @@ retry: ret = ext4_get_blocks_wrap(handle, inode, block, max_blocks, map_bh, EXT4_CREATE_UNINITIALIZED_EXT, 0); - WARN_ON(ret = 0); if (ret = 0) { - ext4_error(inode-i_sb, ext4_fallocate, - ext4_ext_get_blocks returned error: - inode#%lu, block=%u, max_blocks=%lu, +#ifdef EXT4FS_DEBUG + WARN_ON(ret = 0); + printk(KERN_ERR %s: ext4_ext_get_blocks + returned error inode#%lu, block=%u, + max_blocks=%lu, __FUNCTION__, inode-i_ino, block, max_blocks); - ret = -EIO; +#endif ext4_mark_inode_dirty(handle, inode); ret2 = ext4_journal_stop(handle); break; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] don't set extents flag for _any_ symlinks
Thanks, Added to patch queue On Mon, 2008-02-18 at 16:27 -0600, Eric Sandeen wrote: As symlinks are limited to a single block anyway, and e2fsck doesn't expect to find it set, don't set the extents flag on any type of symlinks at all: fast/in-inode, or the external-block flavor. There are a lot of filesystems out there by now w/ exent-style symlink blocks though, so e2fsck should probably be able to repair that at some point... Signed-off-by: Eric Sandeen [EMAIL PROTECTED] --- Index: linux-2.6.24/fs/ext4/namei.c === --- linux-2.6.24.orig/fs/ext4/namei.c +++ linux-2.6.24/fs/ext4/namei.c @@ -2223,7 +2226,6 @@ retry: inode-i_op = ext4_fast_symlink_inode_operations; memcpy((char*)EXT4_I(inode)-i_data,symname,l); inode-i_size = l-1; - EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL; } EXT4_I(inode)-i_disksize = inode-i_size; err = ext4_add_nondir(handle, dentry, inode); Index: linux-2.6.24/fs/ext4/ialloc.c === --- linux-2.6.24.orig/fs/ext4/ialloc.c +++ linux-2.6.24/fs/ext4/ialloc.c @@ -744,7 +744,7 @@ got: ext4_std_error(sb, err); goto fail_free_drop; } - if (test_opt(sb, EXTENTS)) { + if (test_opt(sb, EXTENTS) !S_ISLNK(mode)) { EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL; ext4_ext_tree_init(handle, inode); err = ext4_update_incompat_feature(handle, sb, - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] clear extents flag on inodes created in ext4_mknod
Acked and added to the patch queue. On Mon, 2008-02-18 at 15:00 -0600, Eric Sandeen wrote: e2fsck doesn't expect to find char, block, fifo, or socket files with the extent flag set, so clear that in ext4_mknod. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] --- Index: linux-2.6.24/fs/ext4/namei.c === --- linux-2.6.24.orig/fs/ext4/namei.c +++ linux-2.6.24/fs/ext4/namei.c @@ -1766,6 +1766,7 @@ retry: #ifdef CONFIG_EXT4DEV_FS_XATTR inode-i_op = ext4_special_inode_operations; #endif + EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL; err = ext4_add_nondir(handle, dentry, inode); } ext4_journal_stop(handle); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] ext4: add missing ext4_journal_stop()
On Sun, 2008-02-17 at 15:11 +0900, Akinobu Mita wrote: Add missing ext4_journal_stop() in error handling. Reviewed and Acked ext2/3/4 changes. ext4 patch is also queued in ext4 patch queue http://repo.or.cz/w/ext4-patch-queue.git Mingming Signed-off-by: Akinobu Mita [EMAIL PROTECTED] Cc: Stephen Tweedie [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: Andrew Morton [EMAIL PROTECTED] Cc: Mingming Cao [EMAIL PROTECTED] Cc: Theodore Tso [EMAIL PROTECTED] --- fs/ext4/resize.c |1 + 1 file changed, 1 insertion(+) Index: 2.6-rc/fs/ext4/resize.c === --- 2.6-rc.orig/fs/ext4/resize.c +++ 2.6-rc/fs/ext4/resize.c @@ -1037,6 +1037,7 @@ int ext4_group_extend(struct super_block ext4_warning(sb, __FUNCTION__, multiple resizers run on filesystem!); unlock_super(sb); + ext4_journal_stop(handle); err = -EBUSY; goto exit_put; } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] ext4: Fix the memory leak and buffer head leak with respect to ext4_ext_path
On Thu, 2008-02-14 at 16:00 +0530, Aneesh Kumar K.V wrote: The path variable returned via ext4_ext_find_extent is a kmalloc variable and need to be freeded. It also contain refrences to buffer_head which need to be dropped. Acked. Mingming Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c |6 +++--- fs/ext4/migrate.c |5 + include/linux/ext4_fs_extents.h |1 + 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index e856f66..995ac16 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -349,7 +349,7 @@ static void ext4_ext_show_leaf(struct inode *inode, struct ext4_ext_path *path) #define ext4_ext_show_leaf(inode,path) #endif -static void ext4_ext_drop_refs(struct ext4_ext_path *path) +void ext4_ext_drop_refs(struct ext4_ext_path *path) { int depth = path-p_depth; int i; @@ -2200,10 +2200,10 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, newdepth = ext_depth(inode); if (newdepth != depth) { depth = newdepth; - path = ext4_ext_find_extent(inode, iblock, NULL); + ext4_ext_drop_refs(path); + path = ext4_ext_find_extent(inode, iblock, path); if (IS_ERR(path)) { err = PTR_ERR(path); - path = NULL; goto out; } eh = path[depth].p_hdr; diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index 8c6c685..5c1e27d 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -43,6 +43,7 @@ static int finish_range(handle_t *handle, struct inode *inode, if (IS_ERR(path)) { retval = PTR_ERR(path); + path = NULL; goto err_out; } @@ -74,6 +75,10 @@ static int finish_range(handle_t *handle, struct inode *inode, } retval = ext4_ext_insert_extent(handle, inode, path, newext); err_out: + if (path) { + ext4_ext_drop_refs(path); + kfree(path); + } lb-first_pblock = 0; return retval; } diff --git a/include/linux/ext4_fs_extents.h b/include/linux/ext4_fs_extents.h index 697da4b..1285c58 100644 --- a/include/linux/ext4_fs_extents.h +++ b/include/linux/ext4_fs_extents.h @@ -227,5 +227,6 @@ extern int ext4_ext_search_left(struct inode *, struct ext4_ext_path *, ext4_lblk_t *, ext4_fsblk_t *); extern int ext4_ext_search_right(struct inode *, struct ext4_ext_path *, ext4_lblk_t *, ext4_fsblk_t *); +extern void ext4_ext_drop_refs(struct ext4_ext_path *); #endif /* _LINUX_EXT4_EXTENTS */ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] ext4: Request for journal write access early.
On Thu, 2008-02-14 at 16:00 +0530, Aneesh Kumar K.V wrote: In ext4_ext_convert_to_initialized before we need to request for journal write access before we even modify the extent length. Acked. Thanks Mingming Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 11 --- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 995ac16..c4d6f19 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2168,6 +2168,10 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, newblock = iblock - ee_block + ext_pblock(ex); ex2 = ex; + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; + /* ex1: ee_block to iblock - 1 : uninitialized */ if (iblock ee_block) { ex1 = ex; @@ -2210,6 +2214,10 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, ex = path[depth].p_ext; if (ex2 != newex) ex2 = ex; + + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; } allocated = max_blocks; } @@ -2230,9 +2238,6 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, ex2-ee_len = cpu_to_le16(allocated); if (ex2 != ex) goto insert; - err = ext4_ext_get_access(handle, inode, path + depth); - if (err) - goto out; /* * New (initialized) extent starts from the first block * in the current extent. i.e., ex2 == ex - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] ext4: Fix the memory leak and buffer head leak with respect to ext4_ext_path
Added in ext4 patch queue, Thanks On Thu, 2008-02-14 at 16:00 +0530, Aneesh Kumar K.V wrote: The path variable returned via ext4_ext_find_extent is a kmalloc variable and need to be freeded. It also contain refrences to buffer_head which need to be dropped. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c |6 +++--- fs/ext4/migrate.c |5 + include/linux/ext4_fs_extents.h |1 + 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index e856f66..995ac16 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -349,7 +349,7 @@ static void ext4_ext_show_leaf(struct inode *inode, struct ext4_ext_path *path) #define ext4_ext_show_leaf(inode,path) #endif -static void ext4_ext_drop_refs(struct ext4_ext_path *path) +void ext4_ext_drop_refs(struct ext4_ext_path *path) { int depth = path-p_depth; int i; @@ -2200,10 +2200,10 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, newdepth = ext_depth(inode); if (newdepth != depth) { depth = newdepth; - path = ext4_ext_find_extent(inode, iblock, NULL); + ext4_ext_drop_refs(path); + path = ext4_ext_find_extent(inode, iblock, path); if (IS_ERR(path)) { err = PTR_ERR(path); - path = NULL; goto out; } eh = path[depth].p_hdr; diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index 8c6c685..5c1e27d 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -43,6 +43,7 @@ static int finish_range(handle_t *handle, struct inode *inode, if (IS_ERR(path)) { retval = PTR_ERR(path); + path = NULL; goto err_out; } @@ -74,6 +75,10 @@ static int finish_range(handle_t *handle, struct inode *inode, } retval = ext4_ext_insert_extent(handle, inode, path, newext); err_out: + if (path) { + ext4_ext_drop_refs(path); + kfree(path); + } lb-first_pblock = 0; return retval; } diff --git a/include/linux/ext4_fs_extents.h b/include/linux/ext4_fs_extents.h index 697da4b..1285c58 100644 --- a/include/linux/ext4_fs_extents.h +++ b/include/linux/ext4_fs_extents.h @@ -227,5 +227,6 @@ extern int ext4_ext_search_left(struct inode *, struct ext4_ext_path *, ext4_lblk_t *, ext4_fsblk_t *); extern int ext4_ext_search_right(struct inode *, struct ext4_ext_path *, ext4_lblk_t *, ext4_fsblk_t *); +extern void ext4_ext_drop_refs(struct ext4_ext_path *); #endif /* _LINUX_EXT4_EXTENTS */ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] ext4: Request for journal write access early.
Added in ext4 patch queue, Thanks On Thu, 2008-02-14 at 16:00 +0530, Aneesh Kumar K.V wrote: In ext4_ext_convert_to_initialized before we need to request for journal write access before we even modify the extent length. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 11 --- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 995ac16..c4d6f19 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2168,6 +2168,10 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, newblock = iblock - ee_block + ext_pblock(ex); ex2 = ex; + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; + /* ex1: ee_block to iblock - 1 : uninitialized */ if (iblock ee_block) { ex1 = ex; @@ -2210,6 +2214,10 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, ex = path[depth].p_ext; if (ex2 != newex) ex2 = ex; + + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; } allocated = max_blocks; } @@ -2230,9 +2238,6 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, ex2-ee_len = cpu_to_le16(allocated); if (ex2 != ex) goto insert; - err = ext4_ext_get_access(handle, inode, path + depth); - if (err) - goto out; /* * New (initialized) extent starts from the first block * in the current extent. i.e., ex2 == ex - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] ext2/3/4: convert byte order of constant instead of variable
On Thu, 2008-02-14 at 14:20 -0800, Andrew Morton wrote: On Sun, 10 Feb 2008 11:10:15 +0100 Marcin Slusarz [EMAIL PROTECTED] wrote: fs/ext2/super.c |8 +++- fs/ext3/super.c |2 +- fs/ext4/super.c |2 +- Please don't bundle the filesystem patches in this manner. I split it into three patches. Andrew, Ted, I added the ext4 patch in the ext4 patch queue. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: modify block allocation algorithm for the last group
On Wed, 2008-02-13 at 17:26 +0100, Valerie Clement wrote: Modify the block allocation algorithm for the last group From: Valerie Clement [EMAIL PROTECTED] When a directory inode is allocated in the last group and the last group contains less than s_blocks_per_group blocks, the initial block allocated for the directory is not always allocated in the same group as the directory inode, but in one of the first groups of the filesystem (group 1 for example). Depending on the current process's pid, ext4_find_near() and ext4_ext_find_goal() can return a block number greater than the maximum blocks count in the filesystem and in that case the block will be not allocated in the same group as the inode. The following patch fixes the problem. Comments? Looks sane to me. Should the modification also be done in ext2/3 code? I think so. Mingming Signed-off-by: Valerie Clement [EMAIL PROTECTED] --- fs/ext4/extents.c |8 +++- fs/ext4/inode.c |8 +++- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 5c4af51..1391ded 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -148,6 +148,7 @@ ext4_fsblk_t ext4_ext_find_goal(struct inode *inode, { struct ext4_inode_info *ei = EXT4_I(inode); ext4_fsblk_t bg_start; + ext4_fsblk_t last_block; ext4_grpblk_t colour; int depth; @@ -169,8 +170,13 @@ ext4_fsblk_t ext4_ext_find_goal(struct inode *inode, /* OK. use inode's group */ bg_start = (ei-i_block_group * EXT4_BLOCKS_PER_GROUP(inode-i_sb)) + le32_to_cpu(EXT4_SB(inode-i_sb)-s_es-s_first_data_block); - colour = (current-pid % 16) * + last_block = ext4_blocks_count(EXT4_SB(inode-i_sb)-s_es) - 1; + + if (bg_start + EXT4_BLOCKS_PER_GROUP(inode-i_sb) = last_block) + colour = (current-pid % 16) * (EXT4_BLOCKS_PER_GROUP(inode-i_sb) / 16); + else + colour = (current-pid % 16) * ((last_block - bg_start) / 16); return bg_start + colour + block; } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 51bf3b5..c4f8ace 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -405,6 +405,7 @@ static ext4_fsblk_t ext4_find_near(struct inode *inode, Indirect *ind) __le32 *start = ind-bh ? (__le32*) ind-bh-b_data : ei-i_data; __le32 *p; ext4_fsblk_t bg_start; + ext4_fsblk_t last_block; ext4_grpblk_t colour; /* Try to find previous block */ @@ -422,8 +423,13 @@ static ext4_fsblk_t ext4_find_near(struct inode *inode, Indirect *ind) * into the same cylinder group then. */ bg_start = ext4_group_first_block_no(inode-i_sb, ei-i_block_group); - colour = (current-pid % 16) * + last_block = ext4_blocks_count(EXT4_SB(inode-i_sb)-s_es) - 1; + + if (bg_start + EXT4_BLOCKS_PER_GROUP(inode-i_sb) = last_block) + colour = (current-pid % 16) * (EXT4_BLOCKS_PER_GROUP(inode-i_sb) / 16); + else + colour = (current-pid % 16) * ((last_block - bg_start) / 16); return bg_start + colour; } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] allocate struct ext4_allocation_context from a kmem cache to save stack space
On Wed, 2008-02-06 at 11:05 -0600, Eric Sandeen wrote: struct ext4_allocation_context is rather large, and this bloats the stack of many functions which use it. Allocating it from a named slab cache will alleviate this. For example, with this change (on top of the noinline patch sent earlier): -ext4_mb_new_blocks 200 +ext4_mb_new_blocks40 -ext4_mb_free_blocks 344 +ext4_mb_free_blocks 168 -ext4_mb_release_inode_pa 216 +ext4_mb_release_inode_pa 40 -ext4_mb_release_group_pa 192 +ext4_mb_release_group_pa 24 Most of these stack-allocated structs are actually used only for mballoc history; and in those cases often a smaller struct would do. So changing that may be another way around it, at least for those functions, if preferred. For now, in those cases where the ac is only for history, an allocation failure simply skips the history recording, and does not cause any other failures. Comments? Signed-off-by: Eric Sandeen [EMAIL PROTECTED] --- Index: linux-2.6.24-rc6-mm1/fs/ext4/mballoc.c === --- linux-2.6.24-rc6-mm1.orig/fs/ext4/mballoc.c +++ linux-2.6.24-rc6-mm1/fs/ext4/mballoc.c @@ -419,6 +419,7 @@ #define MB_DEFAULT_GROUP_PREALLOC512 static struct kmem_cache *ext4_pspace_cachep; +static struct kmem_cache *ext4_ac_cachep; #ifdef EXT4_BB_MAX_BLOCKS #undef EXT4_BB_MAX_BLOCKS @@ -2958,11 +2959,18 @@ int __init init_ext4_mballoc(void) if (ext4_pspace_cachep == NULL) return -ENOMEM; -#ifdef CONFIG_PROC_FS + ext4_ac_cachep = + kmem_cache_create(ext4_alloc_context, + sizeof(struct ext4_allocation_context), + 0, SLAB_RECLAIM_ACCOUNT, NULL); + if (ext4_ac_cachep == NULL) { + kmem_cache_destroy(ext4_pspace_cachep); + return -ENOMEM; + } + proc_root_ext4 = proc_mkdir(EXT4_ROOT, proc_root_fs); if (proc_root_ext4 == NULL) printk(KERN_ERR EXT4-fs: Unable to create %s\n, EXT4_ROOT); -#endif return 0; } @@ -2971,9 +2979,8 @@ void exit_ext4_mballoc(void) { /* XXX: synchronize_rcu(); */ kmem_cache_destroy(ext4_pspace_cachep); -#ifdef CONFIG_PROC_FS + kmem_cache_destroy(ext4_ac_cachep); remove_proc_entry(EXT4_ROOT, proc_root_fs); -#endif } Do you intend to remove the #ifdef CONFIG_PROC_FS, or it's a accident? I think we need keep that to allow ext4 build without procfs configured. Other than this, the patch looks fine to me.:) Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG_ON at mballoc.c:3752
On Thu, 2008-02-07 at 18:25 +0530, Aneesh Kumar K.V wrote: ext4: Don't panic in case of corrupt bitmap From: Aneesh Kumar K.V [EMAIL PROTECTED] Multiblock allocator was calling BUG_ON in many case if the free and used blocks count obtained looking at the bitmap is different from what the allocator internally accounted for. Use ext4_error in such case and don't panic the system. There seems a lot of BUG_ON() and BUG() in mballoc code, other than this case. Should it always panic the whole system in those cases? Perhaps replacing with ext4_error() or some cases just WARN_ON is enough. Mingming Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/mballoc.c | 35 +-- 1 files changed, 21 insertions(+), 14 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 06d1f52..656729b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -680,7 +680,6 @@ static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) { char *bb; - /* FIXME!! is this needed */ BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b)); BUG_ON(max == NULL); @@ -964,7 +963,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb, grp-bb_fragments = fragments; if (free != grp-bb_free) { - printk(KERN_DEBUG + ext4_error(sb, __FUNCTION__, EXT4-fs: group %lu: %u blocks in bitmap, %u in gd\n, group, free, grp-bb_free); grp-bb_free = free; @@ -1821,13 +1820,24 @@ static void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac, i = ext4_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i); if (i = EXT4_BLOCKS_PER_GROUP(sb)) { - BUG_ON(free != 0); + /* + * IF we corrupt the bitmap we won't find any + * free blocks even though group info says we + * we have free blocks + */ + ext4_error(sb, __FUNCTION__, %d free blocks as per + group info. But bitmap says 0\n, + free); break; } mb_find_extent(e4b, 0, i, ac-ac_g_ex.fe_len, ex); BUG_ON(ex.fe_len = 0); - BUG_ON(free ex.fe_len); + if (free ex.fe_len) { + ext4_error(sb, __FUNCTION__, %d free blocks as per + group info. But got %d blocks\n, + free, ex.fe_len); + } ext4_mb_measure_extent(ac, ex, e4b); @@ -3354,13 +3364,10 @@ static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac, ac-ac_pa = pa; /* we don't correct pa_pstart or pa_plen here to avoid - * possible race when tte group is being loaded concurrently + * possible race when the group is being loaded concurrently * instead we correct pa later, after blocks are marked - * in on-disk bitmap -- see ext4_mb_release_context() */ - /* - * FIXME!! but the other CPUs can look at this particular - * pa and think that it have enought free blocks if we - * don't update pa_free here right ? + * in on-disk bitmap -- see ext4_mb_release_context() + * Other CPUs are prevented from allocating from this pa by lg_mutex */ mb_debug(use %u/%u from group pa %p\n, pa-pa_lstart-len, len, pa); } @@ -3743,13 +3750,13 @@ static int ext4_mb_release_inode_pa(struct ext4_buddy *e4b, bit = next + 1; } if (free != pa-pa_free) { - printk(KERN_ERR pa %p: logic %lu, phys. %lu, len %lu\n, + printk(KERN_CRIT pa %p: logic %lu, phys. %lu, len %lu\n, pa, (unsigned long) pa-pa_lstart, (unsigned long) pa-pa_pstart, (unsigned long) pa-pa_len); - printk(KERN_ERR free %u, pa_free %u\n, free, pa-pa_free); + ext4_error(sb, __FUNCTION__, free %u, pa_free %u\n, + free, pa-pa_free); } - BUG_ON(free != pa-pa_free); atomic_add(free, sbi-s_mb_discarded); return err; @@ -4405,7 +4412,7 @@ void ext4_mb_free_blocks(handle_t *handle, struct inode *inode, unsigned long block, unsigned long count, int metadata, unsigned long *freed) { - struct buffer_head *bitmap_bh = 0; + struct buffer_head *bitmap_bh = NULL; struct super_block *sb = inode-i_sb; struct ext4_allocation_context ac; struct ext4_group_desc *gdp; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL
Re: [PATCH] allocate struct ext4_allocation_context from a kmem cache to save stack space
On Thu, 2008-02-07 at 19:06 -0600, Eric Sandeen wrote: Mingming Cao wrote: Do you intend to remove the #ifdef CONFIG_PROC_FS, or it's a accident? I think we need keep that to allow ext4 build without procfs configured. Other than this, the patch looks fine to me.:) oh, it kind of snuck in. It actually should still build, as remove_proc_entry is a no-op function w/o the config option. Oh, I mean the proc_mkdir(EXT4_ROOT, proc_root_fs) will complain w/o CONFIG_PROC_FS configured. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] jbd: fix assertion failure in journal_next_log_block
On Tue, 2008-02-05 at 13:59 -0500, Josef Bacik wrote: On Tuesday 05 February 2008 12:27:31 pm Jan Kara wrote: Hello, Sorry for replying a bit late but I'm currently falling behind in maling-list reading... The way jbd tries to determine if there is enough space left on the journal in order to start a new transaction is looking at the space left in the journal and the space needed for the committing transaction, which is 1/4 of the journal + the number of t_outstanding_credits for that transaction. In this case its assumed that t_outstanding_credits accurately represents the number of credits Yes. waiting to be written to the journal, but this sometimes isn't the case. The transaction has two counters for buffers, t_outstanding_credits which is used in conjunction with handles that are added to the transaction, and t_nr_buffers which is only incremented/decremented when buffers are added/removed from the transaction and are actually destined to be journaled. Normally these two t_nr_buffers actually represents number of buffers on BJ_Metadata list and nobody uses it (except for the assertion in __journal_temp_unlink_buffer()). t_outstanding_credits is supposed to be *the* counter making sure we don't write more than we have credits for. counters are the same, however there are cases where the committing transaction can have buffers moved to the next running transaction, for example any buffers on the committing transactions t_reserved list would be moved to the next (running) transaction, and if it had been dirtied in the process it would immediately make it onto the t_updates list, which would increment t_nr_buffers You probably mean t_buffers list here... but not t_outstanding_credits. So you get into this situation where But which moving and dirtying do you mean? The caller which dirties the buffer must make sure that he has acquired enough credits for the transaction where the buffer ends up... So if there were not enough buffers in the running transaction where we refiled the buffer it is a bug in the caller which dirties the buffer. You know now that you say that I feel like an idiot, you are right the only way for something to actually end up on that list was if somebody dirtied it and if they did it would have had to been accounted for at some point on the running transaction. t_nr_buffers (the actual number of buffers that are on the transaction) is greater than the number of buffers accounted for via t_outstanding_credits. This presents a problem since as we loop through writting buffers to the journal, we decrement t_outstanding_credits, and if t_nr_buffers is more than t_outstanding_credits then we end up with a negative number for t_outstanding_credits, which means we start saying we need less than 1/4 of the journal for our committing transaction and allow more transactions than we can handle to start, and then bam we fail because journal_next_log_block doesn't have enough free blocks in order to handle the request. This has been tested and fixes the issue (which could not be reproduced by me but several other people could get it to reproduce using postmark), and although I couldn't reproduce the assertion, I could very easily reproduce the situation where t_outstanding_credits was than t_nr_buffers. I suppose you see the assertion J_ASSERT(journal-j_free 1); to fail, right? I don't see how your patch could help avoid that assertion. You've just removed accounting of t_outstanding_credits which has no impact on the real number of free blocks in the journal stored in j_free. Anyway, if you can reproduce t_outstanding_credits t_nr_buffers, then there's something fishy. Are you able to reproduce it also with a current kernel? Thanks for looking into the problem :) Well my patch helped avoid the assertion because t_outstanding_credits was going negative therefore we were letting transactions start when we shouldn't be, and eventually we would end up with too much of the journal in use and we'd assert. Course I can't reproduce where t_outstanding_credits t_nr_buffers upstream (again I feel like an idiot, should have tested that first). Thanks for looking at this Jan. Mingming, would you mind pulling this patch out of the patch queue please since its wrong? Thanks much, Sure, done! Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Fix DIO locking
On Wed, 2008-02-06 at 17:06 +0100, Jan Kara wrote: Hi, this patch fixes lock inversion in ext4 direct IO path. The similar patch has already been accepted for ext3. Mingming, will you put it into ext4 patch queue? Thanks. Thanks! Added to the ext4 patch queue. Mingming Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- We cannot start transaction in ext4_direct_IO() and just let it last during the whole write because dio_get_page() acquires mmap_sem which ranks above transaction start (e.g. because we have dependency chain mmap_sem-PageLock-journal_start, or because we update atime while holding mmap_sem) and thus deadlocks could happen. We solve the problem by starting a transaction separately for each ext4_get_block() call. We *could* have a problem that we allocate a block and before its data are written out the machine crashes and thus we expose stale data. But that does not happen because for hole-filling generic code falls back to buffered writes and for file extension, we add inode to orphan list and thus in case of crash, journal replay will truncate inode back to the original size. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bb717cb..1948b97 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -895,7 +895,16 @@ out: return err; } -#define DIO_CREDITS (EXT4_RESERVE_TRANS_BLOCKS + 32) +/* Maximum number of blocks we map for direct IO at once. */ +#define DIO_MAX_BLOCKS 4096 +/* + * Number of credits we need for writing DIO_MAX_BLOCKS: + * We need sb + group descriptor + bitmap + inode - 4 + * For B blocks with A block pointers per block we need: + * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect). + * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25. + */ +#define DIO_CREDITS 25 int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, unsigned long max_blocks, struct buffer_head *bh, @@ -942,49 +951,31 @@ static int ext4_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) { handle_t *handle = ext4_journal_current_handle(); - int ret = 0; + int ret = 0, started = 0; unsigned max_blocks = bh_result-b_size inode-i_blkbits; - if (!create) - goto get_block; /* A read */ - - if (max_blocks == 1) - goto get_block; /* A single block get */ - - if (handle-h_transaction-t_state == T_LOCKED) { - /* - * Huge direct-io writes can hold off commits for long - * periods of time. Let this commit run. - */ - ext4_journal_stop(handle); - handle = ext4_journal_start(inode, DIO_CREDITS); - if (IS_ERR(handle)) + + if (create !handle) {/* Direct IO write... */ + if (max_blocks DIO_MAX_BLOCKS) + max_blocks = DIO_MAX_BLOCKS; + handle = ext4_journal_start(inode, DIO_CREDITS + + 2 * EXT4_QUOTA_TRANS_BLOCKS(inode-i_sb)); + if (IS_ERR(handle)) { ret = PTR_ERR(handle); - goto get_block; - } - - if (handle-h_buffer_credits = EXT4_RESERVE_TRANS_BLOCKS) { - /* - * Getting low on buffer credits... - */ - ret = ext4_journal_extend(handle, DIO_CREDITS); - if (ret 0) { - /* - * Couldn't extend the transaction. Start a new one. - */ - ret = ext4_journal_restart(handle, DIO_CREDITS); + goto out; } + started = 1; } -get_block: - if (ret == 0) { - ret = ext4_get_blocks_wrap(handle, inode, iblock, + ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks, bh_result, create, 0); - if (ret 0) { - bh_result-b_size = (ret inode-i_blkbits); - ret = 0; - } + if (ret 0) { + bh_result-b_size = (ret inode-i_blkbits); + ret = 0; } + if (started) + ext4_journal_stop(handle); +out: return ret; } @@ -1674,7 +1665,8 @@ static int ext4_releasepage(struct page *page, gfp_t wait) * if the machine crashes during the write. * * If the O_DIRECT write is intantiating holes inside i_size and the machine - * crashes then stale disk data _may_ be exposed inside the file. + * crashes then stale disk data _may_ be exposed inside the file. But current + * VFS code falls back into buffered path in that case so we are safe. */ static ssize_t ext4_direct_IO(int rw,
Re: - disable-ext4.patch removed from -mm tree
On Mon, 2008-02-04 at 10:00 -0500, Theodore Tso wrote: On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote: On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote: When I merge David's iget coversion patches this will instead wreck the ext4 patchset. That's ok, it shouldn't be hard for me to fix this up. How quickly will you be able to merge David's iget converstion patches? They're about 1,000 patches back. OK, if you're not planning on pushing David's changes to Linus right away, what if I pull in David's iget-stop-ext4-from-using-iget-and-read_inode-try.patch and push it plus some other ext4 bug fixes directly to Linus, and let you know when that has happened so you can drop David's patch from your queue? David's changes to ext4 can be applied standalone without the rest of his series, so it would be safe to push that to Linus independently and in advance of the rest of his series. I get compile error when builing ext4 patch queue with iget-stop-ext4-from-using-iget-and-read_inode-try.patch applied, against 2.6.24-git14. It seems iget-stop-ext4-from-using-iget-and-read_inode-try.patch depends on patches: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [PATCH 03/32] IGET: Introduce a function to register iget failure Mingming That should also help reduce the number of inter-patch queue dependencies. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Replace iget with iget_locked in defrag-free-space-fragementation
On Sat, 2008-02-02 at 11:40 -0500, Theodore Ts'o wrote: FYI, since iget() is going to be disappearing in 2.6.25, and akpm has disabled ext4 in the -mm tree as a result, I'm folding the following patch into the defrag-free-space-fragmentation.patch in the ext4 tree. I noticed that you have merged the changes in the patch queue... diff --git a/fs/ext4/defrag.c b/fs/ext4/defrag.c index d9a14e4..6370fb8 100644 --- a/fs/ext4/defrag.c +++ b/fs/ext4/defrag.c @@ -287,9 +287,13 @@ static int ext4_ext_extents_info(struct ext4_extents_info *ext_info, int entries = 0; int err = 0; - inode = iget(sb, ext_info-ino); + inode = iget_locked(sb, ext_info-ino); if (!inode) return -EACCES; + if (inode-i_state I_NEW) { + sb-s_op-read_inode(inode); + unlock_new_inode(inode); + } down_write(EXT4_I(inode)-i_data_sem); read_inode() also sunset, as removed in http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24/2.6.24-mm1/broken-out/iget-remove-iget-and-the-read_inode-super-op-as.patch so above changes won't compile in mm tree. I have fixed it in patch queue, using ext4_iget() to replace iget() and read-inode() as suggested in iget-stop-ext4-from-using-iget-and-read_inode-try.patch http://repo.or.cz/w/ext4-patch-queue.git?a=blob;f=ext4-online-defrag-iget-read-inode-fix.patch;h=94660ca7371680fcd02bb5804016b4cae4f20846;hb=85589744f9d53579a12aa463ccd42fb450aec786 We also have an issue where the read-only bind patches from Dave Hansen are making changes which conflict with the ext4 patch tree, which is making akpm very grumpy. I'll try to take a look at this later in the weekend... - Ted @@ -588,9 +592,13 @@ static int ext4_ext_defrag_victim(struct file *target_filp, ext.len = 0; /* Get the inode of the victim file */ - victim_inode = iget(sb, ex_info-ino); + victim_inode = iget_locked(sb, ex_info-ino); if (!victim_inode) return -EACCES; + if (victim_inode-i_state I_NEW) { + sb-s_op-read_inode(victim_inode); + unlock_new_inode(victim_inode); + } /* Setup file for the victim file */ victim_dent.d_inode = victim_inode; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Replace iget with iget_locked in defrag-free-space-fragementation
On Sat, 2008-02-02 at 11:40 -0500, Theodore Ts'o wrote: We also have an issue where the read-only bind patches from Dave Hansen are making changes which conflict with the ext4 patch tree, which is making akpm very grumpy. I'll try to take a look at this later in the weekend... - Ted BTW, Ted, could you clarify about this one? I am able to build the patch queue on akpm's 2.6.24-mm1 tree with Dave Hansen's read-only bind changes. Maybe I missed something. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: - disable-ext4.patch removed from -mm tree
On Tue, 2008-02-05 at 13:26 -0800, Mingming Cao wrote: On Mon, 2008-02-04 at 10:00 -0500, Theodore Tso wrote: On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote: On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote: When I merge David's iget coversion patches this will instead wreck the ext4 patchset. That's ok, it shouldn't be hard for me to fix this up. How quickly will you be able to merge David's iget converstion patches? They're about 1,000 patches back. OK, if you're not planning on pushing David's changes to Linus right away, what if I pull in David's iget-stop-ext4-from-using-iget-and-read_inode-try.patch and push it plus some other ext4 bug fixes directly to Linus, and let you know when that has happened so you can drop David's patch from your queue? David's changes to ext4 can be applied standalone without the rest of his series, so it would be safe to push that to Linus independently and in advance of the rest of his series. I get compile error when builing ext4 patch queue with iget-stop-ext4-from-using-iget-and-read_inode-try.patch applied, against 2.6.24-git14. It seems iget-stop-ext4-from-using-iget-and-read_inode-try.patch depends on patches: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [PATCH 03/32] IGET: Introduce a function to register iget failure It seems to me the easiest way to bring ext4 patches back to mm tree, is to carry above two patches in ext4 patch queue, like we did for other ext4 patches that depend on generic code in the past. So I added above two patches to ext4 patch queue, now that ext4 patches could apply cleanly to linus git tree, and Andrew should able to easily pull ext4 patches after removing the duplicated patches. Ted, I have the ext4 patch queue updated for this. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG_ON at mballoc.c:3752
On Thu, 2008-01-31 at 15:01 +0100, Eric Sesterhenn wrote: hi, while running a modified version of fsfuzzer i triggered the BUG() in ext4_mb_release_inode_pa(). Sadly I am not able to reproduce this using the generated image, but running the fuzzer will usually trigger this in less than 40 attempts. Increasing the JBD2 Debug level didnt give more information. The kernel is current git with ext4-fix-null-pointer-deref-in-journal_wait_on_commit_record.patch applied. Greetings, Eric Thanks for reporting this. [ 1570.971980] EXT4-fs error (device loop0) in ext4_reserve_inode_write: Journal has aborted Is there any more info about why jbd has aborted? [ 1570.972077] pa c6512330: logic 16, phys. 2337, len 16 [ 1570.972103] free 2, pa_free 1 looks like free!=pa_free. Aneesh, could you take a look? Thanks! Mingming [ 1570.972191] [ cut here ] [ 1570.972217] kernel BUG at fs/ext4/mballoc.c:3752! [ 1570.972241] invalid opcode: [#1] PREEMPT DEBUG_PAGEALLOC [ 1570.972386] Modules linked in: [ 1570.972425] [ 1570.972509] Pid: 6629, comm: fstest Not tainted (2.6.24-05749-g8af03e7-dirty #19) [ 1570.972534] EIP: 0060:[c02266b9] EFLAGS: 00010202 CPU: 0 [ 1570.972570] EIP is at ext4_mb_release_inode_pa+0x169/0x1a0 [ 1570.972595] EAX: 0001 EBX: 0930 ECX: 0001 EDX: 0001 [ 1570.972678] ESI: 0930 EDI: c6512330 EBP: cb638b28 ESP: cb638a84 [ 1570.972703] DS: 007b ES: 007b FS: GS: 0033 SS: 0068 [ 1570.972728] Process fstest (pid: 6629, ti=cb638000 task=cb698000 task.ti=cb638000) [ 1570.972751] Stack: c07c21bb 0002 0001 0921 0010 cbff34e0 cb638b54 0002 [ 1570.972899]cb5fa430 c64824a0 cb5fb920 0022 cbff34e0 cb638ad4 0246 0400 [ 1570.972899]cbfa2000 cb5fb920 cbff34e0 092e 0002 cbfa2000 [ 1570.972899] Call Trace: [ 1570.972899] [c020e3a4] ? read_block_bitmap+0x54/0x120 [ 1570.972899] [c022bd24] ? ext4_mb_discard_inode_preallocations+0x124/0x300 [ 1570.972899] [c022bda5] ? ext4_mb_discard_inode_preallocations+0x1a5/0x300 [ 1570.972899] [c0223577] ? ext4_ext_get_blocks+0x3a7/0x4b0 [ 1570.972899] [c0213928] ? ext4_get_blocks_wrap+0xe8/0x130 [ 1570.972899] [c0213bce] ? ext4_get_block+0x7e/0xf0 [ 1570.972899] [c019d7ba] ? __block_prepare_write+0x17a/0x3a0 [ 1570.972899] [c019da68] ? block_write_begin+0x48/0xe0 [ 1570.972899] [c0213b50] ? ext4_get_block+0x0/0xf0 [ 1570.972899] [c0215127] ? ext4_write_begin+0xb7/0x190 [ 1570.972899] [c0213b50] ? ext4_get_block+0x0/0xf0 [ 1570.972899] [c01557e9] ? generic_perform_write+0xa9/0x190 [ 1570.972899] [c01575bd] ? generic_file_buffered_write+0x6d/0x130 [ 1570.972899] [c01578c1] ? __generic_file_aio_write_nolock+0x241/0x550 [ 1570.972899] [c0144b44] ? trace_hardirqs_on+0xc4/0x150 [ 1570.972899] [c0157c2c] ? generic_file_aio_write+0x5c/0xd0 [ 1570.972899] [c015ab0d] ? free_one_page+0x1ed/0x220 [ 1570.972899] [c0210410] ? ext4_file_write+0x50/0x160 [ 1570.972899] [c017b11d] ? do_sync_write+0xcd/0x110 [ 1570.972899] [c01096f9] ? native_sched_clock+0x69/0xc0 [ 1570.972899] [c01373c0] ? autoremove_wake_function+0x0/0x50 [ 1570.972899] [c01070e5] ? do_softirq+0x55/0xd0 [ 1570.972899] [c01050d3] ? restore_nocheck+0x12/0x15 [ 1570.972899] [c0144b44] ? trace_hardirqs_on+0xc4/0x150 [ 1570.972899] [c017b959] ? vfs_write+0x99/0x130 [ 1570.972899] [c017b050] ? do_sync_write+0x0/0x110 [ 1570.972899] [c017c048] ? sys_pwrite64+0x68/0x70 [ 1570.972899] [c0104fea] ? sysenter_past_esp+0x5f/0xa5 [ 1570.972899] === [ 1570.972899] Code: ff 0f b7 47 4e 89 44 24 08 8b 85 78 ff ff ff c7 04 24 bb 21 7c c0 89 44 24 04 e8 c3 e0 ef ff 0f b7 47 4e 39 85 78 ff ff ff 74 07 0f 0b eb fe 8d 76 00 8b 85 78 ff ff ff 8b 95 7c ff ff ff 01 82 [ 1570.972899] EIP: [c02266b9] ext4_mb_release_inode_pa+0x169/0x1a0 SS:ESP 0068:cb638a84 [ 1570.972942] ---[ end trace 51819e80cd9431da ]--- [ 1570.972969] note: fstest[6629] exited with preempt_count 1 [ 1570.973013] BUG: sleeping function called from invalid context at kernel/rwsem.c:21 [ 1570.973039] in_atomic():1, irqs_disabled():0 [ 1570.973077] INFO: lockdep is turned off. [ 1570.973104] Pid: 6629, comm: fstest Tainted: G D 2.6.24-05749-g8af03e7-dirty #19 [ 1570.973159] [c011e1a6] __might_sleep+0xc6/0xf0 [ 1570.973224] [c06b2c99] down_read+0x19/0x80 [ 1570.973295] [c013a7fd] ? hrtimer_try_to_cancel+0x3d/0x80 [ 1570.973396] [c0125a27] exit_mm+0x27/0xd0 [ 1570.973467] [c01272f3] do_exit+0x133/0x2e0 [ 1570.973529] [c010611c] die+0x13c/0x140 [ 1570.973590] [c0135197] ? search_exception_tables+0x27/0x30 [ 1570.973993] [c01061b1] do_trap+0x91/0xc0 [ 1570.974054] [c0106440] ? do_invalid_op+0x0/0xa0 [ 1570.974133] [c01064c9] do_invalid_op+0x89/0xa0 [ 1570.974195] [c02266b9] ? ext4_mb_release_inode_pa+0x169/0x1a0 [ 1570.974290] [c012007b] ? account_system_time+0x9b/0xd0 [ 1570.974370]
Re: [PATCH] jbd: fix assertion failure in journal_next_log_block
On Thu, 2008-01-31 at 16:52 -0500, Josef Bacik wrote: On Thu, Jan 31, 2008 at 12:35:43PM -0700, Andreas Dilger wrote: On Jan 31, 2008 11:14 -0500, Josef Bacik wrote: [snip excellent analysis] So you get into this situation where t_nr_buffers (the actual number of buffers that are on the transaction) is greater than the number of buffers accounted for via t_outstanding_credits. This presents a problem since as we loop through writting buffers to the journal, we decrement t_outstanding_credits, and if t_nr_buffers is more than t_outstanding_credits then we end up with a negative number for t_outstanding_credits Signed-off-by: Josef Bacik [EMAIL PROTECTED] Do you know what kernel this problem was introduced in, or is this a long standing problem? Presumably the same is needed for jbd2? Once we have some decent amount of testing going on with ext4, I think it makes sense to merge the jbd2 changes back into jbd and return to a single code base, since there is nothing in the jbd2 code that ext3 can't also work with (i.e. all of the changes are properly isolated with compatibility flags and such). @@ -1056,7 +1056,7 @@ static inline int jbd_space_needed(journal_t *journal) int nblocks = journal-j_max_transaction_buffers; if (journal-j_committing_transaction) nblocks += journal-j_committing_transaction- - t_outstanding_credits; + t_nr_buffers; (trivial) this can be moved back onto the previous line. @@ -1168,7 +1168,7 @@ static inline int jbd_space_needed(journal_t *journal) int nblocks = journal-j_max_transaction_buffers; if (journal-j_committing_transaction) nblocks += journal-j_committing_transaction- - t_outstanding_credits; + t_nr_buffers; Same... The original issue was reported on RHEL4, so thats 2.6.9, and looking through the old-bkcvs git tree I can't see where this was introduced, so it's probably existed before that. The same problem looks to exist in jbd2 though I haven't tested it myself, I just went ahead and included the fixes. Here is the updated patch, thanks much for the comments. Added to ext4 patch queue. Thanks, Mingming Signed-off-by: Josef Bacik [EMAIL PROTECTED] diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 31853eb..e385a5c 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -561,13 +561,6 @@ void journal_commit_transaction(journal_t *journal) continue; } - /* - * start_this_handle() uses t_outstanding_credits to determine - * the free space in the log, but this counter is changed - * by journal_next_log_block() also. - */ - commit_transaction-t_outstanding_credits--; - /* Bump b_count to prevent truncate from stumbling over the shadowed buffer! @@@ This can go if we ever get rid of the BJ_IO/BJ_Shadow pairing of buffers. */ diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 4f302d2..c0f93f5 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -580,7 +580,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) stats.u.run.rs_logging = jiffies; stats.u.run.rs_flushing = jbd2_time_diff(stats.u.run.rs_flushing, stats.u.run.rs_logging); - stats.u.run.rs_blocks = commit_transaction-t_outstanding_credits; + stats.u.run.rs_blocks = commit_transaction-t_nr_buffers; stats.u.run.rs_blocks_logged = 0; descriptor = NULL; @@ -655,13 +655,6 @@ void jbd2_journal_commit_transaction(journal_t *journal) continue; } - /* - * start_this_handle() uses t_outstanding_credits to determine - * the free space in the log, but this counter is changed - * by jbd2_journal_next_log_block() also. - */ - commit_transaction-t_outstanding_credits--; - /* Bump b_count to prevent truncate from stumbling over the shadowed buffer! @@@ This can go if we ever get rid of the BJ_IO/BJ_Shadow pairing of buffers. */ diff --git a/include/linux/jbd.h b/include/linux/jbd.h index d9ecd13..eaeb3db 100644 --- a/include/linux/jbd.h +++ b/include/linux/jbd.h @@ -1055,8 +1055,7 @@ static inline int jbd_space_needed(journal_t *journal) { int nblocks = journal-j_max_transaction_buffers; if (journal-j_committing_transaction) - nblocks += journal-j_committing_transaction- - t_outstanding_credits; + nblocks += journal-j_committing_transaction-t_nr_buffers; return nblocks; } diff --git
Re: Fw: [Bugme-new] [Bug 9849] New: NULL pointer deref in journal_wait_on_commit_record
On Wed, 2008-01-30 at 12:00 -0800, Andrew Morton wrote: Begin forwarded message: Date: Wed, 30 Jan 2008 03:24:08 -0800 (PST) From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [Bugme-new] [Bug 9849] New: NULL pointer deref in journal_wait_on_commit_record http://bugzilla.kernel.org/show_bug.cgi?id=9849 Summary: NULL pointer deref in journal_wait_on_commit_record Product: File System Version: 2.5 KernelVersion: 2.6.24-03997-g85004cc Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: ext4 AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Latest working kernel version: - Earliest failing kernel version: 2.6.24-03863-g0ba6c33 Distribution: Ubuntu Problem Description: using a corrupted image causes an oops in unmount, seems as if journal_wait_on_commit_record() gets passed a NULL pointer The buufer head pointer passed to journal_wait_on_commit_record() could be NULL if the previous journal_submit_commit_record() failed or journal has already aborted. Looking at the jbd2 debug messages, before the oops happen, the jbd2 is aborted due to trying to access the next log block beyond the end of device. This might be caused by using a corrupted image. We need to check the error returns from journal_submit_commit_record() and avoid calling journal_wait_on_commit_record() in the failure case. Signed-off-by: Mingming Cao [EMAIL PROTECTED] The buufer head pointer passed to journal_wait_on_commit_record() could be NULL if the previous journal_submit_commit_record() failed or journal has already aborted. We need to check the error returns from journal_submit_commit_record() and avoid calling journal_wait_on_commit_record() in the failure case. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/commit.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24-rc8/fs/jbd2/commit.c === --- linux-2.6.24-rc8.orig/fs/jbd2/commit.c 2008-01-30 14:12:10.0 -0800 +++ linux-2.6.24-rc8/fs/jbd2/commit.c 2008-01-30 15:09:50.0 -0800 @@ -872,7 +872,8 @@ wait_for_iobuf: if (err) __jbd2_journal_abort_hard(journal); } - err = journal_wait_on_commit_record(cbh); + if (!err !is_journal_aborted(journal)) + err = journal_wait_on_commit_record(cbh); if (err) jbd2_journal_abort(journal, err); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 33/49] ext4: Add the journal checksum feature
, +* just skip over the blocks it describes. */ if (pass != PASS_REPLAY) { + if (pass == PASS_SCAN + JBD2_HAS_COMPAT_FEATURE(journal, + JBD2_FEATURE_COMPAT_CHECKSUM) + !info-end_transaction) { + if (calc_chksums(journal, bh, + next_log_block, + crc32_sum)) { put_bh() + brelse(bh); + break; + } + brelse(bh); + continue; put_bh() + } next_log_block += count_tags(journal, bh); wrap(journal, next_log_block); brelse(bh); @@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journal, continue; + brelse(bh); etc Thanks, Updated patch below: ext4: Add the journal checksum feature From: Girish Shilamkar [EMAIL PROTECTED] The journal checksum feature adds two new flags i.e JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM. JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the checksum for the blocks described by the descriptor blocks. Due to checksums, writing of the commit record no longer needs to be synchronous. Now commit record can be sent to disk without waiting for descriptor blocks to be written to disk. This behavior is controlled using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be able to recover the journal with _ASYNC_COMMIT hence it is made incompat. The commit header has been extended to hold the checksum along with the type of the checksum. For recovery in pass scan checksums are verified to ensure the sanity and completeness(in case of _ASYNC_COMMIT) of every transaction. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- Documentation/filesystems/ext4.txt | 10 + fs/Kconfig |1 fs/ext4/super.c| 25 fs/jbd2/commit.c | 198 +++-- fs/jbd2/journal.c | 26 fs/jbd2/recovery.c | 151 ++-- include/linux/ext4_fs.h|3 include/linux/jbd2.h | 36 +- 8 files changed, 388 insertions(+), 62 deletions(-) Index: linux-2.6.24-rc8/Documentation/filesystems/ext4.txt === --- linux-2.6.24-rc8.orig/Documentation/filesystems/ext4.txt2008-01-24 11:18:08.0 -0800 +++ linux-2.6.24-rc8/Documentation/filesystems/ext4.txt 2008-01-24 13:00:44.0 -0800 @@ -89,6 +89,16 @@ When mounting an ext4 filesystem, the fo extentsext4 will use extents to address file data. The file system will no longer be mountable by ext3. +journal_checksum Enable checksumming of the journal transactions. + This will allow the recovery code in e2fsck and the + kernel to detect corruption in the kernel. It is a + compatible change and will be ignored by older kernels. + +journal_async_commit Commit block can be written to disk without waiting + for descriptor blocks. If enabled older kernels cannot + mount the device. This will enable 'journal_checksum' + internally. + journal=update Update the ext4 file system's journal to the current format. Index: linux-2.6.24-rc8/fs/Kconfig === --- linux-2.6.24-rc8.orig/fs/Kconfig2008-01-24 11:18:08.0 -0800 +++ linux-2.6.24-rc8/fs/Kconfig 2008-01-24 11:18:55.0 -0800 @@ -236,6 +236,7 @@ config JBD_DEBUG config JBD2 tristate + select CRC32 help This is a generic journaling layer for block devices that support both 32-bit and 64-bit block numbers. It is currently used by Index: linux-2.6.24-rc8/fs/ext4/super.c === --- linux-2.6.24-rc8.orig/fs/ext4/super.c 2008-01-24 11:18:52.0 -0800 +++ linux-2.6.24-rc8/fs/ext4/super.c2008-01-24 13:00:45.0 -0800 @@ -869,6 +869,7 @@ enum { Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh
Re: [PATCH] Shrink ext3_inode_info by 8 bytes for !POSIX_ACL.
On Sat, 2008-01-12 at 21:35 +0100, Indan Zupancic wrote: i_file_acl and i_dir_acl aren't always needed. With certain configs this makes 10 ext3_inode_cache objects fit in one slab instead of the current 9, as the size shrinks from 416 to 408 bytes for 32 bit, !POSIX_ACL and !EXT3_FS_XATTR configs. Signed-off-by: Indan Zupancic [EMAIL PROTECTED] --- fs/ext3/ialloc.c |2 ++ fs/ext3/inode.c | 29 +++-- include/linux/ext3_fs_i.h |2 ++ 3 files changed, 23 insertions(+), 10 deletions(-) diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c index 1bc8cd8..01745bc 100644 --- a/fs/ext3/ialloc.c +++ b/fs/ext3/ialloc.c @@ -574,8 +574,10 @@ got: ei-i_frag_no = 0; ei-i_frag_size = 0; #endif +#ifdef CONFIG_EXT3_FS_POSIX_ACL ei-i_file_acl = 0; ei-i_dir_acl = 0; +#endif For regular file, i_dir_acl is being reused as i_size_high to support large file. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4]JBD2: user of the jiffies rounding code
Ported from JBD changes from Arjan van de Ven [EMAIL PROTECTED] --- Date: Sun, 10 Dec 2006 10:21:26 + (-0800) Subject: [PATCH] user of the jiffies rounding code: JBD X-Git-Tag: v2.6.20-rc1~15^2~43 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=44d306e1508fef6fa7a6eb15a1aba86ef68389a6 [PATCH] user of the jiffies rounding code: JBD This patch introduces a user: of the round_jiffies() function; the 5 second ext3/jbd wakeup. While every 5 seconds doesn't sound as a problem, there can be many of these (and these timers do add up over all the kernel). The 5 second wakeup isn't really timing sensitive; in addition even with rounding it'll still happen every 5 seconds (with the exception of the very first time, which is likely to be rounded up to somewhere closer to 6 seconds) Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- --- fs/jbd2/transaction.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 15:41:14.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 17:49:48.0 -0800 @@ -54,7 +54,7 @@ jbd2_get_transaction(journal_t *journal, spin_lock_init(transaction-t_handle_lock); /* Set up the commit timer for the new transaction. */ - journal-j_commit_timer.expires = transaction-t_expires; + journal-j_commit_timer.expires = round_jiffies(transaction-t_expires); add_timer(journal-j_commit_timer); J_ASSERT(journal-j_running_transaction == NULL); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4]jbd2: port jbd lockdep support to jbd2
Hi Andrew, Ted, I walked through the linus's git tree history and found 4 patches should port from ext3/jbd to ext4/jbd2, since the day ext4 was forked (2006.10.11) to today. I have already queued the ported patches in ext4 patch queue and verified they seems fine. Here is the first one. jbd2: port jbd lockdep support to jbd2 Except lockdep doesn't know about journal_start(), which has ranking requirements similar to a semaphore. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/transaction.c | 11 +++ include/linux/jbd2.h |4 2 files changed, 15 insertions(+) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 15:30:24.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 15:41:14.0 -0800 @@ -241,6 +241,8 @@ out: return ret; } +static struct lock_class_key jbd2_handle_key; + /* Allocate a new handle. This should probably be in a slab... */ static handle_t *new_handle(int nblocks) { @@ -251,6 +253,9 @@ static handle_t *new_handle(int nblocks) handle-h_buffer_credits = nblocks; handle-h_ref = 1; + lockdep_init_map(handle-h_lockdep_map, jbd2_handle, + jbd2_handle_key, 0); + return handle; } @@ -293,7 +298,11 @@ handle_t *jbd2_journal_start(journal_t * jbd2_free_handle(handle); current-journal_info = NULL; handle = ERR_PTR(err); + goto out; } + + lock_acquire(handle-h_lockdep_map, 0, 0, 0, 2, _THIS_IP_); +out: return handle; } @@ -1419,6 +1428,8 @@ int jbd2_journal_stop(handle_t *handle) spin_unlock(journal-j_state_lock); } + lock_release(handle-h_lockdep_map, 1, _THIS_IP_); + jbd2_free_handle(handle); return err; } Index: linux-2.6.24-rc7/include/linux/jbd2.h === --- linux-2.6.24-rc7.orig/include/linux/jbd2.h 2008-01-16 15:29:03.0 -0800 +++ linux-2.6.24-rc7/include/linux/jbd2.h 2008-01-16 15:29:54.0 -0800 @@ -418,6 +418,10 @@ struct handle_s unsigned inth_sync: 1; /* sync-on-close */ unsigned inth_jdata:1; /* force data journaling */ unsigned inth_aborted: 1; /* fatal error on handle */ + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map h_lockdep_map; +#endif }; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4]JBD2: sparse pointer use of zero as null
Ported from upstream jbd changes to jbd2 sparse pointer use of zero as null Get rid of sparse related warnings from places that use integer as NULL pointer. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/transaction.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 17:49:48.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 18:06:33.0 -0800 @@ -1182,7 +1182,7 @@ int jbd2_journal_dirty_metadata(handle_t } /* That test should have eliminated the following case: */ - J_ASSERT_JH(jh, jh-b_frozen_data == 0); + J_ASSERT_JH(jh, jh-b_frozen_data == NULL); JBUFFER_TRACE(jh, file as BJ_Metadata); spin_lock(journal-j_list_lock); @@ -1532,7 +1532,7 @@ void __jbd2_journal_temp_unlink_buffer(s J_ASSERT_JH(jh, jh-b_jlist BJ_Types); if (jh-b_jlist != BJ_None) - J_ASSERT_JH(jh, transaction != 0); + J_ASSERT_JH(jh, transaction != NULL); switch (jh-b_jlist) { case BJ_None: @@ -1601,11 +1601,11 @@ __journal_try_to_free_buffer(journal_t * if (buffer_locked(bh) || buffer_dirty(bh)) goto out; - if (jh-b_next_transaction != 0) + if (jh-b_next_transaction != NULL) goto out; spin_lock(journal-j_list_lock); - if (jh-b_transaction != 0 jh-b_cp_transaction == 0) { + if (jh-b_transaction != NULL jh-b_cp_transaction == NULL) { if (jh-b_jlist == BJ_SyncData || jh-b_jlist == BJ_Locked) { /* A written-back ordered data buffer */ JBUFFER_TRACE(jh, release data); @@ -1613,7 +1613,7 @@ __journal_try_to_free_buffer(journal_t * jbd2_journal_remove_journal_head(bh); __brelse(bh); } - } else if (jh-b_cp_transaction != 0 jh-b_transaction == 0) { + } else if (jh-b_cp_transaction != NULL jh-b_transaction == NULL) { /* written-back checkpointed metadata buffer */ if (jh-b_jlist == BJ_None) { JBUFFER_TRACE(jh, remove from checkpoint list); @@ -1973,7 +1973,7 @@ void __jbd2_journal_file_buffer(struct j J_ASSERT_JH(jh, jh-b_jlist BJ_Types); J_ASSERT_JH(jh, jh-b_transaction == transaction || - jh-b_transaction == 0); + jh-b_transaction == NULL); if (jh-b_transaction jh-b_jlist == jlist) return; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix oops in mballoc caused by a variable overflow
On Wed, 2008-01-16 at 20:11 +0100, Valerie Clement wrote: A simple dd oopses the kernel (2.6.24-rc7 with the latest patch queue): dd if=/dev/zero of=/mnt/test/foo bs=1M count=8096 EXT4-fs: mballoc enabled [ cut here ] kernel BUG at fs/ext4/mballoc.c:3148! The BUG_ON is: BUG_ON(size = 0 || size = EXT4_BLOCKS_PER_GROUP(ac-ac_sb)); where the value of size is 4293920768. This is due to the overflow of the variable start in the ext4_mb_normalize_request() function. The patch below fixes it. Thanks! Signed-off-by: Valerie Clement [EMAIL PROTECTED] --- mballoc.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) Index: linux-2.6.24-rc7/fs/ext4/mballoc.c === --- linux-2.6.24-rc7.orig/fs/ext4/mballoc.c 2008-01-16 19:22:45.0 +0100 +++ linux-2.6.24-rc7/fs/ext4/mballoc.c2008-01-16 19:25:04.0 +0100 @@ -2990,6 +2990,7 @@ static void ext4_mb_normalize_request(st struct list_head *cur; loff_t size, orig_size; ext4_lblk_t start, orig_start; + ext4_fsblk_t pstart; ext4_fsblk_t is used for fs physical block number, here I think pstart is pointing to some logical block location.. struct ext4_inode_info *ei = EXT4_I(ac-ac_inode); /* do normalize only data requests, metadata requests @@ -3029,7 +3030,7 @@ static void ext4_mb_normalize_request(st /* first, try to predict filesize */ /* XXX: should this table be tunable? */ - start = 0; + pstart = 0; if (size = 16 * 1024) { size = 16 * 1024; } else if (size = 32 * 1024) { @@ -3045,25 +3046,25 @@ static void ext4_mb_normalize_request(st } else if (size = 1024 * 1024) { size = 1024 * 1024; } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, bsbits)) { - start = ac-ac_o_ex.fe_logical bsbits; - start = (start / (1024 * 1024)) * (1024 * 1024); + pstart = ac-ac_o_ex.fe_logical bsbits; + pstart = (pstart / (1024 * 1024)) * (1024 * 1024); How about using shift... - start = ac-ac_o_ex.fe_logical bsbits; - start = (start / (1024 * 1024)) * (1024 * 1024); + start = (ac-ac_o_ex.fe_logical (20-bsbits)) 20; That would be more efficient and should fix the overflow issue size = 1024 * 1024; } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, bsbits)) { - start = ac-ac_o_ex.fe_logical bsbits; - start = (start / (4 * (1024 * 1024))) * 4 * (1024 * 1024); + pstart = ac-ac_o_ex.fe_logical bsbits; + pstart = (pstart / (4 * (1024 * 1024))) * 4 * (1024 * 1024); + start = (ac-ac_o_ex.fe_logical (22-bsbits)) 22; size = 4 * 1024 * 1024; } else if(NRL_CHECK_SIZE(ac-ac_o_ex.fe_len,(820)bsbits,max,bsbits)){ - start = ac-ac_o_ex.fe_logical; - start = start bsbits; - start = (start / (8 * (1024 * 1024))) * 8 * (1024 * 1024); + pstart = ac-ac_o_ex.fe_logical; + pstart = pstart bsbits; + pstart = (pstart / (8 * (1024 * 1024))) * 8 * (1024 * 1024); + start = (ac-ac_o_ex.fe_logical (23-bsbits)) 23; size = 8 * 1024 * 1024; } else { - start = ac-ac_o_ex.fe_logical; - start = start bsbits; + pstart = ac-ac_o_ex.fe_logical; + pstart = pstart bsbits; size = ac-ac_o_ex.fe_len bsbits; } orig_size = size = size bsbits; - orig_start = start = start bsbits; + orig_start = start = pstart bsbits; /* don't cover already allocated blocks in selected range */ if (ar-pleft start = ar-lleft) { - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Do not try lock_acquire after handle made invalid
On Tue, 2008-01-15 at 15:59 -0800, Andrew Morton wrote: On Wed, 16 Jan 2008 00:39:26 +0100 Jonas Bonn [EMAIL PROTECTED] wrote: This likely fixes the oops in __lock_acquire reported as: http://www.kerneloops.org/raw.php?rawid=2753msgid= http://www.kerneloops.org/raw.php?rawid=2749msgid= In these reported oopses, start_this_handle is returning -EROFS. Signed-off-by: Jonas Bonn [EMAIL PROTECTED] --- fs/jbd/transaction.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 08ff6c7..038ed74 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -288,10 +288,12 @@ handle_t *journal_start(journal_t *journal, int nblocks) jbd_free_handle(handle); current-journal_info = NULL; handle = ERR_PTR(err); + goto out; } lock_acquire(handle-h_lockdep_map, 0, 0, 0, 2, _THIS_IP_); +out: return handle; } Yeah, thanks. It looks like we forgot to port this lockdep support into ext4. This is bad. What else got lost? Here is the ext4/jbd2 port of the original lockdep patch with the above fix. Please let me know if I missed anything. It's likely there are other jbd/ext3 changes missed out in ext4/jbd2. I will take a look at the git tree and do a cleanup. Thanks for pointing this out. - jbd2: lockdep support for jbd2 Except lockdep doesn't know about journal_start(), which has ranking requirements similar to a semaphore. Ported from lockdep jbd patch. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/transaction.c | 11 +++ include/linux/jbd2.h |4 2 files changed, 15 insertions(+) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 15:30:24.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 15:41:14.0 -0800 @@ -241,6 +241,8 @@ out: return ret; } +static struct lock_class_key jbd2_handle_key; + /* Allocate a new handle. This should probably be in a slab... */ static handle_t *new_handle(int nblocks) { @@ -251,6 +253,9 @@ static handle_t *new_handle(int nblocks) handle-h_buffer_credits = nblocks; handle-h_ref = 1; + lockdep_init_map(handle-h_lockdep_map, jbd2_handle, + jbd2_handle_key, 0); + return handle; } @@ -293,7 +298,11 @@ handle_t *jbd2_journal_start(journal_t * jbd2_free_handle(handle); current-journal_info = NULL; handle = ERR_PTR(err); + goto out; } + + lock_acquire(handle-h_lockdep_map, 0, 0, 0, 2, _THIS_IP_); +out: return handle; } @@ -1419,6 +1428,8 @@ int jbd2_journal_stop(handle_t *handle) spin_unlock(journal-j_state_lock); } + lock_release(handle-h_lockdep_map, 1, _THIS_IP_); + jbd2_free_handle(handle); return err; } Index: linux-2.6.24-rc7/include/linux/jbd2.h === --- linux-2.6.24-rc7.orig/include/linux/jbd2.h 2008-01-16 15:29:03.0 -0800 +++ linux-2.6.24-rc7/include/linux/jbd2.h 2008-01-16 15:29:54.0 -0800 @@ -418,6 +418,10 @@ struct handle_s unsigned inth_sync: 1; /* sync-on-close */ unsigned inth_jdata:1; /* force data journaling */ unsigned inth_aborted: 1; /* fatal error on handle */ + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map h_lockdep_map; +#endif }; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: checkpatch.pl warnings
On Tue, 2008-01-15 at 18:22 +0530, Aneesh Kumar K.V wrote: On Mon, Jan 14, 2008 at 12:49:27PM -0800, Mingming Cao wrote: Hi Guys, Could you check the checkpatch.pl warnings and see if it make sense to fix them? Thanks! [EMAIL PROTECTED]:~/fs/ext4/stylecheck$ grep has style problems * linux-2.6.24-rc7-48-bit-i_blocks.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext3-4-migrate.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext4_export_iov_shorten_from_kernel_for_ext4.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext4-journal_chksum-2.6.20.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext4_rec_len_overflow_with_64kblk_fix-v2.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext4_store_maxbytes_for_bitmaped_files.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-inode-version-ext4.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-jbd-stats-through-procfs.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-large-file-blocktype.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-mballoc-core.patch.out:Your patch has style problems, please review. If any of these errors Fixed the checkpatch.pl warning for all the patches in the patch queue. The diff is attached below for review. Thanks! patch queue has been updated. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: fix uniniatilized extend splitting error.
Thanks for catching this, I added your fix to the ext4 patch queue http://repo.or.cz/w/ext4-patch-queue.git On Thu, 2008-01-10 at 17:31 +0300, Dmitry Monakhov wrote: Hi, While playing with new fancy fallocate interface on ext4 i've triggered bug which corrupted my grub :). My testcase: blksize = 0x1000; fd = open(argv[1], O_RDWR|O_CREAT, 0700); unsigned long long sz = 0x1000UL; /* allocating big blocks chunk */ syscall(__NR_fallocate, fd, 0, 0UL, sz) /* grab all other available filesystem space */ tfd = open(tmp, O_RDWR|O_CREAT|O_DIRECT, 0700); while( write(tfd, buf, 4096) 0); /* loop untill ENOSPC */ fsync(fd); /* just in case */ while (pos sz) { /* each seek+ write operation result in splits uninitialized extent in three extents. Splitting may result in new extent allocation which probably will fail because of ENOSPC*/ lseek(fd, blksize*2 -1, SEEK_CUR); if ((ret = write(fd, 'a', 1)) != 1) exit(1); pos += blksize * 2; } Buggy place: ext4_ext_get_blocks(..., bh_result,..) { err = 0; allocated = 0; ret = ext4_ext_convert_to_initialized(...) if (ret 0) By occasion real error code was lost here. goto out2 out2: return err? err: allocated; Wow.. exit with 0, and caller assumes what bh_result was properly filled and then will submit it for write. But in fact bh contains random data in -b_bdev, -b_blocknr fileds :). } Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED] --- fs/ext4/extents.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 8528774..fc8e508 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2320,9 +2320,10 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ret = ext4_ext_convert_to_initialized(handle, inode, path, iblock, max_blocks); - if (ret = 0) + if (ret = 0) { + err = ret; goto out2; - else + } else allocated = ret; goto outnew; } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: fix uniniatilized extend splitting error.
On Thu, 2008-01-10 at 14:31 -0700, Andreas Dilger wrote: On Jan 10, 2008 17:31 +0300, Dmitry Monakhov wrote: While playing with new fancy fallocate interface on ext4 i've triggered bug which corrupted my grub :). I notice I'm CC'd on this, but in fact Amit wrote the code. I've CC'd him even though I expect he will notice it anyways. My testcase: blksize = 0x1000; fd = open(argv[1], O_RDWR|O_CREAT, 0700); unsigned long long sz = 0x1000UL; /* allocating big blocks chunk */ syscall(__NR_fallocate, fd, 0, 0UL, sz) /* grab all other available filesystem space */ tfd = open(tmp, O_RDWR|O_CREAT|O_DIRECT, 0700); while( write(tfd, buf, 4096) 0); /* loop untill ENOSPC */ fsync(fd); /* just in case */ while (pos sz) { /* each seek+ write operation result in splits uninitialized extent in three extents. Splitting may result in new extent allocation which probably will fail because of ENOSPC*/ lseek(fd, blksize*2 -1, SEEK_CUR); if ((ret = write(fd, 'a', 1)) != 1) exit(1); pos += blksize * 2; } Interesting test, and well thought out... Buggy place: ext4_ext_get_blocks(..., bh_result,..) { err = 0; allocated = 0; ret = ext4_ext_convert_to_initialized(...) if (ret 0) By occasion real error code was lost here. goto out2 out2: return err? err: allocated; Wow.. exit with 0, and caller assumes what bh_result was properly filled and then will submit it for write. But in fact bh contains random data in -b_bdev, -b_blocknr fileds :). } The other item that Amit and I discussed in the past is in the case of ENOSPC it would be possible instead of splitting the extent to zero-fill the smaller extent (1 block in your test case) and write the whole thing as an initialized extent. This could then either be merged with the previous or following allocated extent, or the whole extent zeroed if that was not possible. It would add some latency in the worst case to do this in the kernel, but this would only happen if there is no free space at all. It might even be desirable to always zero-fill small extents instead of splitting uninitialized extents, because a random write of 64kB is not more expensive than 4kB and avoids overhead of splitting the nicely contiguous extent tree. Zero-fill uninitialized extents all the time can cause extra IO, as that zero-out (via get_block()) is happened at prepare write time rather than page write out time. I'd prefer the latency zero-out happens only when it needed. Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED] --- fs/ext4/extents.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 8528774..fc8e508 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2320,9 +2320,10 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ret = ext4_ext_convert_to_initialized(handle, inode, path, iblock, max_blocks); - if (ret = 0) + if (ret = 0) { + err = ret; goto out2; - else + } else allocated = ret; goto outnew; } -- 1.5.3.1.40.g6972-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Fix the soft lockup with multi block allocator.
On Thu, 2008-01-10 at 00:41 +0530, Aneesh Kumar K.V wrote: On Wed, Jan 09, 2008 at 07:44:30PM +0100, Jan Kara wrote: On Wed 09-01-08 23:54:28, Aneesh Kumar K.V wrote: On Wed, Jan 09, 2008 at 01:10:41PM +0100, Jan Kara wrote: With the multi block allocator when we don't have prealloc space we discard @@ -3790,7 +3782,9 @@ repeat: /* if we still need more blocks and some PAs were used, try again */ if (free needed busy) { + busy = 0; ext4_unlock_group(sb, group); + schedule_timeout(HZ); goto repeat; } Hmm, wouldn't just schedule() be enough here? That would give a good chance to other processes to proceed and we would avoid this artificial wait of 1s which is quite ugly IMO. But then who will wake up the task ?. I have the below comment added to the patch in the patch queue. As far as I know, you don't have to wake-up the task explicitely. Scheduler will simply schedule the task sometime in future (it is a similar situation as if the task got preempted in the kernel). I missed the current-state = TASK_UNINTERRUPTIBLE; in that code piece. So yes without setting the task state yes it would put it back to the run queue.Infact schedule_timeout() without changing the task state also behaves similarly. Now that that we know that it if fine just to have a schedule() there since schedule_timeout() was just behaving as schedule(). I guess we should make the change you suggested. In that case we can remove the comment which says we need to add the wait queue. Mingming, Do you want me to send a patch or can you make the modification ? I could make the changes and update the mballoc patch in the queue. Mingming -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix up vfs inode version patch for modules
On Wed, 2007-11-28 at 14:33 -0600, Eric Sandeen wrote: ext4 calls inode_inc_iversion(), but it's not exported, so modular ext4 has trouble. Any reason not to just make it an inline, as below? I agree. Mingming Signed-off-by: Eric Sandeen [EMAIL PROTECTED] --- Index: linux-2.6.24-rc1/fs/inode.c === --- linux-2.6.24-rc1.orig/fs/inode.c +++ linux-2.6.24-rc1/fs/inode.c @@ -1243,23 +1243,6 @@ void touch_atime(struct vfsmount *mnt, s EXPORT_SYMBOL(touch_atime); /** - * inode_inc_iversion - increments i_version - * @inode: inode that need to be updated - * - * Every time the inode is modified, the i_version field - * will be incremented. - * The filesystem has to be mounted with i_version flag - * - */ - -void inode_inc_iversion(struct inode *inode) -{ - spin_lock(inode-i_lock); - inode-i_version++; - spin_unlock(inode-i_lock); -} - -/** * file_update_time- update mtime and ctime time * @file: file accessed * Index: linux-2.6.24-rc1/include/linux/fs.h === --- linux-2.6.24-rc1.orig/include/linux/fs.h +++ linux-2.6.24-rc1/include/linux/fs.h @@ -1396,7 +1396,21 @@ static inline void inode_dec_link_count( mark_inode_dirty(inode); } -extern void inode_inc_iversion(struct inode *inode); +/** + * inode_inc_iversion - increments i_version + * @inode: inode that need to be updated + * + * Every time the inode is modified, the i_version field will be incremented. + * The filesystem has to be mounted with i_version flag + */ + +static inline void inode_inc_iversion(struct inode *inode) +{ + spin_lock(inode-i_lock); + inode-i_version++; + spin_unlock(inode-i_lock); +} + extern void touch_atime(struct vfsmount *mnt, struct dentry *dentry); static inline void file_accessed(struct file *file) { - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/8] ext4: fix MB_DEBUG format warnings
On Fri, 2007-11-30 at 16:22 -0800, Andrew Morton wrote: On Fri, 30 Nov 2007 16:17:32 -0800 Mingming Cao [EMAIL PROTECTED] wrote: Thanks for resending the patches. The first three patches in this series were missed in the ext4 patch queue, the rest of them are already queued there. They are all in ext4 patch queue now. I will check other ext4 patches you plan to merge to mm tree and get them sync in ext4 tree. http://repo.or.cz/w/ext4-patch-queue.git git://repo.or.cz/ext4-patch-queue.git What's all this about? That is the latest ext4 patch queue ( quilt format patches, not full ext4 source tree). I've always been pulling ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ (and continuously fixing the same rejects each time). Ted periodically pull patches from the latest ext4 patch queue on repo.or.cz, after they had survive fs tests, then, release the ext4 patches on kernel.org (the one you pulled from). I believe Ted is on vacation now. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/8] ext4: fix MB_DEBUG format warnings
Hi Andrew, Thanks for resending the patches. The first three patches in this series were missed in the ext4 patch queue, the rest of them are already queued there. They are all in ext4 patch queue now. I will check other ext4 patches you plan to merge to mm tree and get them sync in ext4 tree. http://repo.or.cz/w/ext4-patch-queue.git git://repo.or.cz/ext4-patch-queue.git Regards, Mingming On Thu, 2007-11-29 at 12:52 -0800, [EMAIL PROTECTED] wrote: From: Eric Sandeen [EMAIL PROTECTED] (applies at end of current ext4 patch series) Fix a slew of compile warnings from mismatched format specifiers when MB_DEBUG is turned on. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- fs/ext4/mballoc.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff -puN fs/ext4/mballoc.c~ext4-fix-mb_debug-format-warnings fs/ext4/mballoc.c --- a/fs/ext4/mballoc.c~ext4-fix-mb_debug-format-warnings +++ a/fs/ext4/mballoc.c @@ -954,7 +954,7 @@ static int ext4_mb_init_cache(struct pag get_bh(bh[i]); bh[i]-b_end_io = end_buffer_read_sync; submit_bh(READ, bh[i]); - mb_debug(read bitmap for group %u\n, first_group + i); + mb_debug(read bitmap for group %lu\n, first_group + i); } /* wait for I/O completion */ @@ -2670,7 +2670,7 @@ static void ext4_mb_free_committed_block if (md == NULL) break; - mb_debug(gonna free %u blocks in group %u (0x%p):, + mb_debug(gonna free %u blocks in group %lu (0x%p):, md-num, md-group, md); err = ext4_mb_load_buddy(sb, md-group, e4b); @@ -3029,7 +3029,7 @@ static void ext4_mb_normalize_group_requ BUG_ON(lg == NULL); ac-ac_g_ex.fe_len = EXT4_SB(sb)-s_mb_group_prealloc; - mb_debug(#%u: goal %u blocks for locality group\n, + mb_debug(#%u: goal %lu blocks for locality group\n, current-pid, ac-ac_g_ex.fe_len); } @@ -3255,7 +3255,7 @@ static void ext4_mb_use_inode_pa(struct BUG_ON(pa-pa_free len); pa-pa_free -= len; - mb_debug(use %lu/%lu from inode pa %p\n, start, len, pa); + mb_debug(use %llu/%lu from inode pa %p\n, start, len, pa); } /* @@ -3282,7 +3282,7 @@ static void ext4_mb_use_group_pa(struct * pa and think that it have enought free blocks if we * don't update pa_free here right ? */ - mb_debug(use %lu/%lu from group pa %p\n, pa-pa_lstart-len, len, pa); + mb_debug(use %u/%u from group pa %p\n, pa-pa_lstart-len, len, pa); } /* @@ -3513,7 +3513,7 @@ static int ext4_mb_new_inode_pa(struct e pa-pa_deleted = 0; pa-pa_linear = 0; - mb_debug(new inode pa %p: %lu/%lu for %lu\n, pa, + mb_debug(new inode pa %p: %llu/%u for %u\n, pa, pa-pa_pstart, pa-pa_len, pa-pa_lstart); ext4_mb_use_inode_pa(ac, pa); @@ -3569,7 +3569,7 @@ static int ext4_mb_new_group_pa(struct e pa-pa_deleted = 0; pa-pa_linear = 1; - mb_debug(new group pa %p: %lu/%lu for %lu\n, pa, + mb_debug(new group pa %p: %llu/%u for %u\n, pa, pa-pa_pstart, pa-pa_len, pa-pa_lstart); ext4_mb_use_group_pa(ac, pa); @@ -4072,8 +4072,8 @@ static int ext4_mb_initialize_context(st * locality group. this is a policy, actually */ ext4_mb_group_or_file(ac); - mb_debug(init ac: %u blocks @ %llu, goal %llu, flags %x, 2^%d, - left: %llu/%llu, right %llu/%llu to %swritable\n, + mb_debug(init ac: %u blocks @ %u, goal %u, flags %x, 2^%d, + left: %u/%u, right %u/%u to %swritable\n, (unsigned) ar-len, (unsigned) ar-logical, (unsigned) ar-goal, ac-ac_flags, ac-ac_2order, (unsigned) ar-lleft, (unsigned) ar-pleft, @@ -4288,7 +4288,7 @@ static int ext4_mb_free_metadata(handle_ page_cache_get(e4b-bd_bitmap_page); db-bb_md_cur = md; db-bb_tid = handle-h_transaction-t_tid; - mb_debug(new md 0x%p for group %u\n, + mb_debug(new md 0x%p for group %lu\n, md, md-group); } else { kfree(md); _ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs
On Fri, 2007-11-30 at 17:08 -0600, Eric Sandeen wrote: Mingming Cao wrote: [PATCH] jbd2 stats through procfs The patch below updates the jbd stats patch to 2.6.20/jbd2. The initial patch was posted by Alex Tomas in December 2005 (http://marc.info/?l=linux-ext4m=113538565128617w=2). It provides statistics via procfs such as transaction lifetime and size. [ This probably should be rewritten to use debugfs? -- Ted] Signed-off-by: Johann Lombardi [EMAIL PROTECTED] I've started going through this one to clean it up to the point where it can go forward. It's been sitting at the top of the unstable portion of the patch queue for long enough, I think :) Thanks! I've converted the msecs to jiffies until the user boundary, changed the union #defines as suggested by Andrew, and various other little issues etc. Remaining to do is a generic time-difference calculator (instead of jbd2_time_diff), and looking into whether it should be made a config option; I tend to think it should, but it's fairly well sprinkled through the code, so I'll see how well that works. Also did we ever decided if this should go to debugfs? I thought it was decided to keep it on procfs as debugfs is not always on... Thanks, -Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext4 still broken on arm (at least)
Andrew Morton wrote: On Wed, 28 Nov 2007 11:00:20 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] wrote: On Tue, Nov 27, 2007 at 09:14:44PM -0800, Andrew Morton wrote: fs/ext4/mballoc.c: In function `ext4_mb_generate_buddy': fs/ext4/mballoc.c:836: error: implicit declaration of function `ext2_find_next_bit' I actually sent in a patch which changes asking for review to linux-arch. I haven't got the response yet. Don't expect one... Attaching the patch below Introduce ext4_find_next_bit From: Aneesh Kumar K.V [EMAIL PROTECTED] This gets used by the ext4 multi block allocator patches. Also add generic_find_next_le_bit Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] Cc: linux-ext4@vger.kernel.org Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- include/asm-arm/bitops.h |2 + include/asm-generic/bitops/ext2-non-atomic.h |2 + include/asm-generic/bitops/le.h |4 ++ include/asm-m68k/bitops.h|2 + include/asm-m68knommu/bitops.h |2 + include/asm-powerpc/bitops.h |4 ++ include/asm-s390/bitops.h|2 + include/linux/ext4_fs.h |1 + lib/find_next_bit.c | 43 ++ May as well merge it I guess. I'll do that. Possibly it should be merged into the ext4 tree instead, so that people can build the ext4 devel tree on other architectures but obviously there's no call for that. I will merge it to ext4 development tree. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: dir inode reservation V3
On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: Basic idea of my dir inode reservation patch can be found here, http://lists.openwall.net/linux-ext4/2007/11/05/3 1, What does dir inode reservation do Dir inode reservation tries to reserve several inodes in inodes table for a directory when this directory is created. When create new file under this directory, try to allocate inode from the reserved inodes area. This is called as dir_ireserve inode allocator. Thanks for the update. Let me try to understand your method: So the basic idea is not do linear inode allocation for directory? Inode structure block for directory file is only coming from block 0, N, N +N,... where the number of skipped blocks N is stored in the in-core superblock structure. When ever need to allocate an inode for directory, skip N reserved bits (space for N*16 inodes) if the previous block is already allocated. That way place two directories with the hole of N*16 inodes structures, then allow files under the first directory stay closer with their parent directory. Is this correct? 4, Dir inode reservation is optional Dir inode reservation is optional, you can use -o followed by one of these options to enable dir inode reservation during mount ext4 file system: dir_ireserve=low dir_ireserve=normal dir_ireserve=high Would be nice to pass the tuning info low/normal/high(16/64/128 blocks correspondingly) via something else rather than mount options. Currently, 'low' reserves 15 file inodes for each directory, 'normal' reserves 31 inodes and 'high' reserves 127 inodes. Reserving more than 127 inodes does not help to performance obviously. 5, Performance number On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create 5 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Hmm...I suspect there is bug in your patch, the extra seek should not contribute to 4 times slower #include linux/time.h @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, return -1; } +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, + int mode, ext4_group_t *group, unsigned long *ino) +{ + struct super_block *sb; + struct ext4_sb_info *sbi; + struct ext4_group_desc *gdp = NULL; + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; + ext4_group_t ires_group = *group; + unsigned long ires_ino; + int i, bit; + + sb = dir-i_sb; + sbi = EXT4_SB(sb); + + /* if the inode number is not for directory, + * only try to allocate after directory's inode + */ + if (!S_ISDIR(mode)) { + *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb); + return 0; + } + + /* reserve inodes for new directory */ + for (i = 0; i sbi-s_groups_count; i++) { + gdp = ext4_get_group_desc(sb, ires_group, gdp_bh); + if (!gdp) + goto fail; + bit = 0; +try_same_group: + if (bit EXT4_INODES_PER_GROUP(sb)) { + brelse(bitmap_bh); + bitmap_bh = read_inode_bitmap(sb, ires_group); + if (!bitmap_bh) + goto fail; + + BUFFER_TRACE(bitmap_bh, get_write_access); + if (ext4_journal_get_write_access( + handle, bitmap_bh) != 0) + goto fail; + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), + bit, bitmap_bh-b_data)) { + /* we won it */ + BUFFER_TRACE(bitmap_bh, + call ext4_journal_dirty_metadata); + if (ext4_journal_dirty_metadata(handle, + bitmap_bh) != 0) + goto fail; + ires_ino = bit; + goto find; + } + /* we lost it */ +
Re: [2.6 patch] ext4/super.c: fix #ifdef's
Acked-by: Mingmming Cao [EMAIL PROTECTED] Ted, I added this patch in ext4 patch queue. On Mon, 2007-11-05 at 18:07 +0100, Adrian Bunk wrote: This patch fixes the names of two variables in #ifdef's. Based on a report by Robert P. J. Day. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- fs/ext4/super.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) 44e9889e6a3952ea225704b2e494d31e00f34a6b diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8031dc0..6673672 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -646,7 +646,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs) seq_puts(seq, ,debug); if (test_opt(sb, OLDALLOC)) seq_puts(seq, ,oldalloc); -#ifdef CONFIG_EXT4_FS_XATTR +#ifdef CONFIG_EXT4DEV_FS_XATTR if (test_opt(sb, XATTR_USER)) seq_puts(seq, ,user_xattr); if (!test_opt(sb, XATTR_USER) @@ -654,7 +654,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs) seq_puts(seq, ,nouser_xattr); } #endif -#ifdef CONFIG_EXT4_FS_POSIX_ACL +#ifdef CONFIG_EXT4DEV_FS_POSIX_ACL if (test_opt(sb, POSIX_ACL)) seq_puts(seq, ,acl); if (!test_opt(sb, POSIX_ACL) (def_mount_opts EXT4_DEFM_ACL)) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA
On Mon, 2007-11-05 at 10:07 -0800, Badari Pulavarty wrote: On Tue, 2007-11-06 at 00:15 +0800, Andreas Dilger wrote: On Nov 05, 2007 08:04 -0800, Badari Pulavarty wrote: On Sat, 2007-11-03 at 09:36 +0800, Andreas Dilger wrote: But... this implies that every user of bh-b_data needs to kmap, and I don't see that in the code anywhere else. That makes me think something else is going wrong here. Most cases, this is handled in ll_rw_block() code - when we submit the buffer head for IO. If the page is in highmem, we will end up creating a bounce bufer for it. In our case, JBD code is trying to look at the data to do checksum on it. Thats why we have to kmap() the page before looking. My point is that there is a LOT of code in ext[234] that dereferences bh-b_data without kmap() (e.g. group descriptors, bitmaps, superblock, inode tables, etc). Does that imply that something is forcing those bh pages into lowmem, or is the journal bh page in question being allocated in some different way that allows it to be in highmem? Yes. You are right. Its been a while since I had to deal with HIGHMEM. All the meta-data should be in LOWMEM. I asked Mingming to verify what the buffer-head is pointing to when it has HIGHMEM page. The buffer_heads with NULL bh-b_data(under the start_journal_io branch in jbd2_journal_commit_transaction() code) is created by jbd2_journal_write_metadata_buffer(). Noticed that in jbd2_journal_write_metadata_buffer(), there are multiple places which do kmap_atomic() to access the journal bh page (new_page). In the normal case the new_page is pointing to the bh pages, which(the page) was initially allocated by _page_cache_alloc() (sb_bread-__bread()-_...find_or_create_page()-_page_cache_alloc() In the case it need a data copy (the buffer start with the JBD2_MAGIC_NUMBER?), a new page is allocated by by __get_free_pages()(via jbd2_alloc, which is possible allocated in highmem. __get_free_pages calls alloc_pages() directly, doesn't seem to have highmem handling like __page_cache_alloc(). I am not sure why we saw this issue on 2.6.23 kernel, where jbd2_slab_alloc()-kmem_cache_alloc() is used. Isn't all slab pages under lowmem? Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RFC]Ext4: Use get_cpu()/put_cpu() in preemptible context
Fix the warning of: printk: 369 messages suppressed. BUG: using smp_processor_id() in preemptible [0001] code: fsx-linux/31702 caller is ext4_mb_new_blocks+0x2aa/0x1319 Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/mballoc.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24-rc1/fs/ext4/mballoc.c === --- linux-2.6.24-rc1.orig/fs/ext4/mballoc.c 2007-11-02 17:22:18.0 -0700 +++ linux-2.6.24-rc1/fs/ext4/mballoc.c 2007-11-02 17:23:02.0 -0700 @@ -4006,7 +4006,8 @@ static void ext4_mb_group_or_file(struct return; BUG_ON(ac-ac_lg != NULL); - ac-ac_lg = sbi-s_locality_groups[smp_processor_id()]; + ac-ac_lg = sbi-s_locality_groups[get_cpu()]; + put_cpu(); /* we're going to use group allocation */ ac-ac_flags |= EXT4_MB_HINT_GROUP_ALLOC; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA
JBD2: Fix NULL pointer bh-b_data on NUMA box with journal checksumming. Current journal checksumming patch failed fsstress test on NUMA. The bh-b_data passed to the crc32_be () function could be NULL pointer, which caused kernel oops immediately when running fsstress with -o journal_checksum. It is because the page is part of highmem on NUMA box. We need to kmap the page before access the bh-b_data to calculate the checksums. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/commit.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) Index: linux-2.6.24-rc1/fs/jbd2/commit.c === --- linux-2.6.24-rc1.orig/fs/jbd2/commit.c 2007-11-01 11:15:08.0 -0700 +++ linux-2.6.24-rc1/fs/jbd2/commit.c 2007-11-01 12:27:02.0 -0700 @@ -352,6 +352,20 @@ write_out_data: journal_do_submit_data(wbuf, bufs); } +static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh) +{ + struct page *page = bh-b_page; + char *addr; + __u32 checksum; + + addr = kmap(page); + checksum = crc32_be(crc32_sum, + (void *)(addr + offset_in_page(bh-b_data)), bh-b_size); + kunmap(page); + + return checksum; +} + static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag, unsigned long long block) { @@ -715,9 +729,8 @@ start_journal_io: */ if (JBD2_HAS_COMPAT_FEATURE(journal, JBD2_FEATURE_COMPAT_CHECKSUM)) { - crc32_sum = crc32_be(crc32_sum, - (void *)bh-b_data, - bh-b_size); + crc32_sum = + jbd2_checksum_data(crc32_sum, bh); } lock_buffer(bh); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problem with delayed allocation option
On Fri, 2007-10-26 at 14:28 +0200, Valerie Clement wrote: Hi all, Hi Valerie, I ran a small test which creates one directory and 2O 8-KB size files in it. When the filesystem is mounted without the delalloc option, here is the output of the command dumpe2fs for the group in which the directory and the files are created: Group 532 : (Blocks 17432576-17465343) Block bitmap at 17432576 (+0), Inode bitmap at 17432577 (+1) Inode table at 17432578-17433089 (+2) 32213 free blocks, 16363 free inodes, 1 directories Free blocks : 17433090-17459199, 17459241-17465343 Free inodes : 8716310-8732672 When the filesystem is mounted with the delalloc option, the same test gives a different result: Group 395 : (Blocks 12943360-12976127) Block bitmap at 12943360 (+0), Inode bitmap at 12943361 (+1) Inode table at 12943362-12943873 (+2) 32213 free blocks, 16363 free inodes, 1 directories Free blocks : 12943874-12955647, 12955650-12955655, 12955658-12955663, 12955666-12955671, 12955674-12955679, 12955682-12955687, 12955690-12955695, 12955698-12955703, 12955706-12955711, 12955714-12955719, 12955722-12955727, 12955730-12955735, 12955738-12955743, 12955746-12955751, 12955754-12955759, 12955762-12955767, 12955770-12955775, 12955778-12955783, 12955786-12955791, 12955794-12955799, 12955802-12961791, 12961793-12976127 Free inodes : 6471702-6488064 In the first case, the allocated blocks are contiguous whereas they are not in the second case. After adding traces in the code to understand why the behavior is different with the delalloc option, I found that the problem is related to the inode reservation window. To simplify, without the delalloc option we have the following scenario: For each inode, - call alloc_new_reservation() to allocate a new reservation window - allocate blocks for data - write data to disk - ext4_discard_reservation() when the inode is closed. With the delalloc option, when the data are written to disk we have: For each inode, - call alloc_new_reservation() to allocate a new reservation window - allocate blocks for data - write data to disk I think a call to ext4_discard_reservation() is missing somewhere and the question is where. Oh, that should be block reservation, not inode reservation window. The problem with delayed allocation and block reservation is, we don't know when suppose to close the window, as the file maybe closed with diry data in cache,and the blocks has not be allocated yet. We would like to keep the window open so that later delayed allocation happens, the allocation could take advantage of the reservation. But on the other hand, that may leads fs external fragmentation. with mballoc, ext3 block reservation should be turned off and replaced with the group-in-core-preallocation. Has the new delayed allocation integrated with mballoc yet? I tried to add this call at the end of the ext4_da_get_block_write() function. This seems to fix the problem as the blocks are allocated contiguously on disk but the function seems to be called too many times so I think it is perhaps not the right place to call it. Who could look into this problem? I've got a few days off so I couldn't help more next days, but the problem is easily reproductible. Wouldn't this also explain why the compilebench results posted by Chris Mason are bad in some cases? Valérie - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix oops with jbd-stats-through-procfs and external journal
Thanks, Added to ext4 patch queue at http://repo.or.cz/w/ext4-patch-queue.git On Thu, 2007-10-25 at 11:39 -0500, Eric Sandeen wrote: When using an external device for the journal, jbd2_stats_proc_init() wants to use journal-j_dev in its call to bdevname() but it's not assigned yet, resulting in an oops. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] --- linux.orig/fs/jbd2/journal.c 2007-10-25 11:36:25.772354262 -0500 +++ linux/fs/jbd2/journal.c 2007-10-25 11:36:35.058278242 -0500 @@ -1035,11 +1035,11 @@ journal = NULL; goto out; } - jbd2_stats_proc_init(journal); journal-j_dev = bdev; journal-j_fs_dev = fs_dev; journal-j_blk_offset = start; journal-j_maxlen = len; + jbd2_stats_proc_init(journal); bh = __getblk(journal-j_dev, start, journal-j_blocksize); J_ASSERT(bh != NULL); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] i_version update - vfs part
On Thu, 2007-10-25 at 19:04 +0200, Cordenner jean noel wrote: Hi, This is an update of the previous patches on the ext4 git tree, the 2 coming patches applies at the end of the current ext4-patch-queue, and replaces the inode-version related patches: 64-bit-i_version.patch i_version_hi.patch ext4_i_version_hi_2.patch i_version_update_ext4.patch The first part deals with the vfs part. The i_version field of the inode is changed to be a 64-bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530. This first part concerns the vfs, it converts the 32-bit i_version in the generic inode to a 64-bit, a flag is added in the super block in order to check if the feature is enabled and the i_version is incremented in the vfs. Thanks for reposting it. Index: linux-2.6.23-ext4-1/fs/inode.c === --- linux-2.6.23-ext4-1.orig/fs/inode.c 2007-10-25 16:15:52.0 +0200 +++ linux-2.6.23-ext4-1/fs/inode.c2007-10-25 16:25:53.0 +0200 @@ -1216,6 +1216,24 @@ EXPORT_SYMBOL(touch_atime); /** + * inode_inc_iversion - increments i_version + * @inode: inode that need to be updated + * + * Every time the inode is modified, the i_version field + * will be incremented. + * The filesystem has to be mounted with i_version flag + * + */ + +void inode_inc_iversion(struct inode *inode) +{ + spin_lock(inode-i_lock); + inode-i_version++; + spin_unlock(inode-i_lock); +} I wonder do we really need i_lock here for inode versioning update? Understand this is a 64 bit counter, but file_update_time() and ext4_mark_inode_dirty() (where the inode version is updated) is called on the file write path so i_mutex should be hold all the time. As long as the read patch holding i_mutex everything should be fine, isn't it? Have you get a chance to check the performance impact to ext4? + +/** * file_update_time- update mtime and ctime time * @file: file accessed * @@ -1249,6 +1267,11 @@ sync_it = 1; } + if (IS_I_VERSION(inode)) { + inode_inc_iversion(inode); + sync_it = 1; + } + if (sync_it) mark_inode_dirty_sync(inode); } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: One request for people editing the patch queue....
On Thu, 2007-10-25 at 15:49 -0400, Theodore Ts'o wrote: Could people please use more descriptive names for patch names? Names like: ext4-cleanup-2.patch ext4-cleanup-3.patch ext4-cleanup-4.patch aren't very useful when browsing through the series file. Thanks, Hi Ted, I was me being lazy, I just took an updated tar ball from Aneesh without carefully rename them to a proper names. Updated ext4-patch-queue patches with meaningful names now. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Ext4 patch queue updated(2.6.23-git17)
http://repo.or.cz/w/ext4-patch-queue.git This patch queue is against mainline. Changes: Rebase ext4 patch queue after Linus pull patches from ext4 git tree. Added ext4-Return-after-ext4_error-in-case-of-failures.patch from Aneesh Added mballoc-compilebench-fix.patch from Aneesh Reordered series to move some cleanup patches to stable series. # This series applies on 2.6.24-git17 # # # Large blocksize support for ext4 # ext4_large_blocksize_support.patch ext4_rec_len_overflow_with_64kblk_fix-v2.patch large-file-blocktype.patch #block group varibles ext4_grpnum_t ext4_grpnum_t.patch ext4_grpnum_t_int_fix.patch #large file support ext4-cleanup.patch ext4-cleanup-2.patch ext4-cleanup-3.patch ext4-cleanup-4.patch 48-bit-i_blocks.patch large-file.patch ext2_fix_max_size.patch ext3_fix_max_size.patch ext4_sync_group_desciptor_with_e2fsprogs.patch ext4-Return-after-ext4_error-in-case-of-failures.patch # unstable patches still have outstanding comments from lkml stable-boundary stable-boundary-undo.patch # Export jbd stats through procfs # Fold together: #ext4-jbd-stats-through-procfs.patch #ext4-jbd-stats-through-procfs_fix.patch #ext4-fs-jbd2-journalc-kmalloc-memset-conversion-to-kzalloc.patch # still have comments from akpm to be addressed jbd-stats-through-procfs # Add journal checksums ext4-journal_chksum-2.6.20.patch ext4-journal-chksum-review-fix.patch # inode verion patch series # inode versioning is needed for NFSv4 # Need to address overwrite case and performance concerns # Need to repost to lkml for comments before submit # # vfs changes, 64 bit inode-i_version 64-bit-i_version.patch # reserve hi 32 bit inode version on ext4 on-disk inode i_version_hi.patch # ext4 inode version read/store ext4_i_version_hi_2.patch # ext4 inode version update i_version_update_ext4.patch # New delayed allocation patch delalloc-vfs.patch delalloc-ext4.patch ext-truncate-mutex.patch ext3-4-migrate.patch ### #mballoc ### # n.b. in the rc7-mm1 tree as introduce-ext4_find_next_bit.patch generic-find-next-le-bit new-extent-function.patch mballoc-core.patch mballoc-bug-workaround.patch ext4_grpnumt-mballoc-fix.patch mballoc-compilebench-fix.patch # Large block support for blocksize pagesize # Needed for Christoph Lameter's largeblock patchset # to support large block on system that # blocksize pagesize jbd-blocks-reservation-fix-for-large-blk.patch jbd2-blocks-reservation-fix-for-large-blk.patch - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compilebench numbers for ext4
On Mon, 2007-10-22 at 19:31 -0400, Chris Mason wrote: Hello everyone, I recently posted some performance numbers for Btrfs with different blocksizes, and to help establish a baseline I did comparisons with Ext3. Thanks for doing this, Chris! The graphs, numbers and a basic description of compilebench are here: http://oss.oracle.com/~mason/blocksizes/ Ext3 easily wins the read phase, but scores poorly while creating files and deleting them. Since ext3 is winning the read phase, we can assume the file layout is fairly good. I think most of the problems during the write phase are caused by pdflush doing metadata writeback. The file data and metadata are written separately, and so we end up seeking between things that are actually close together. Andreas asked me to give ext4 a try, so I grabbed the patch queue from Friday along with the latest Linus kernel. The FS was created with: mkfs.ext3 -I 256 /dev/ mount -o delalloc,mballoc,data=ordered -t ext4dev /dev/ I did expect delayed allocation to help the write phases of compilebench, especially the parts where it writes out .o files in random order (basically writing medium sized files all over the directory tree). Unfortunately delayed allocation support for ordered mode is not there yet. But, every phase except reads showed huge improvements. http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png To match the ext4 numbers with Btrfs, I'd probably have to turn off data checksumming... But oddly enough I saw very bad ext4 read throughput even when reading a single kernel tree (outside of compilebench). The time to read the tree was almost 2x ext3. Have others seen similar problems? thanks for point this out, will run compilebench. Trying to understand the Disk IO graph http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png it looks like ext3 the blocks are spread over the disk, while ext4 is more around the same place, is this right? I think the ext4 delete times are so much better than ext3 because this is a single threaded test. delayed allocation is able to get everything into a few extents, and these all end up in the inode. So, the delete phase only needs to seek around in small directories and seek to well grouped inodes. ext3 probably had to seek all over for the direct/indirect blocks. So, tomorrow I'll run a few tests with delalloc and mballoc independently, but if there are other numbers people are interested in, please let me know. (test box was a desktop machine with single sata drive, barriers were not used). -chris - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Wed, 2007-10-17 at 21:09 -0700, Andrew Morton wrote: On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara [EMAIL PROTECTED] wrote: With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. btw, this changes ext2's on-disk format. Just to clarify this is only changes the directory entries format on ext2/3/4 fs with 64k block size. But currently without kernel changes ext2/3/4 does not support 64 block size. a) is the ext2 format documented anywhere? If so, that document will need updating. The e2fsprogs needs to be changed to sync up with this change. Ted has a paper a while back to show ext2 disk format http://web.mit.edu/tytso/www/linux/ext2intro.html The Documentation/filesystems/ext2.txt doesn't have the ext2 format documented. That document is out-dated need to be reviewed and cleaned up. b) what happens when an old ext2 driver tries to read and/or write this directory entry? Do we need a compat flag for it? c) what happens when old and new ext3 or ext4 try to read/write this directory entry? Without the first patch in this series: ext2 large blocksize support patches, it fails to mount a ext2 filesystem with 64k block size. [PATCH 1/2] ext2: Support large blocksize up to PAGESIZE http://lkml.org/lkml/2007/10/1/361 So the old ext2/3/4 driver will not get access the directory entry with 64k block size format changes. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: question about patch queue and ext4-git
On Mon, 2007-10-15 at 22:29 +0800, Coly Li wrote: Jose R. Santos wrote: On Mon, 15 Oct 2007 20:46:32 +0800 Coly Li [EMAIL PROTECTED] wrote: Thanks for the replying :-) Jose R. Santos wrote: On Mon, 15 Oct 2007 13:36:05 +0800 Coly Li [EMAIL PROTECTED] wrote: Now in my mind there are several words for ext4 patches, most frequently one are patch queue. I see the patches in patch queue from http://www2.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ . Also I confirm some of the patches are in ext4 git tree now, but I am not sure for two questions: 1) Whether all the patches are in ext4 git tree ? Yes, all these patches should be in ext4 git tree. 2) This patch queue is only used to push ext4 patch into upstream ? The patch queue series is divided into stable and unstable patches. The stable patches are the one usually the ones used to push back upstream, while the unstable section has the patches for development purposes only and are not ready for pushing upstream (and some may never make it in). How to recognize which patch is stable patch and which one is unstable patch ? Look in the series file. The mark where the stable patches end is documented there. Find it now. Thanks for your explaining :-) when a new ext4 patch is being submitted, it first added to the patch queue on repo.or.cz, where the whole series (stable and unstable patches) will be tested automatically. After the patches being stablized a bit, Ted will pull ext4 patches from patch queue from repo.or.cz to his git tree, and public the current patches in the git tree on kernel.org. Also there is a stable-boundary patch being added in repo.or.cz patch queue to indicating where the unstable patches starts. Mingming Also there is a patch-queue git at http://repo.or.cz/w/ext4-patch-queue.git , is it same to the patches in http://www2.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/ ? Same thing bug in git format. I believe Ted updates his patch queue from the patches in the git tree repo, so if you want latest/greatest the git tree is what you want. Thanks for clarifying :-) -JRS -JRS - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: patch queue updated
On Wed, 2007-10-10 at 17:35 +0530, Aneesh Kumar K.V wrote: I have updated the patch queue at http://www.radian.org/~kvaneesh/ext4/patch-series/ http://www.radian.org/~kvaneesh/ext4/patch-series/ext4-patch-queue.tar Changes: a) Add large-file-blocktype.patch large-file.patch patch. Could you send them out separately? I noticed that your queue is not based on the last ext4 patch queue. You modified mballoc-core.patch directly with large file block type changes...but your updated mballoc-core.patch doesn't have the latest proc fs fix from Ted. I would appreciate if you separete the largeblocktype changes for mballoc out of the core mballoc changes, as that patch is getting really big already. b) Add the stable-boundary-patch with dummy change c) Add stable-boundary-undo.patch which undo the above dummy change. The stable-boundary patch is being removed from ext4 patch queue,as some tools expect a meaning code diff to apply any patch. d) Update the series to point to the right file name ext4_grpnum_t_int_fix.patch instead of ext4_grpnum_t_negative_value_fix.patch Fixed it in ext4-patch-queue.Thanks, The diff between the content of the current patch-queu and the this is below -aneesh diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index 2ba49ff..9ffae96 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -124,7 +124,7 @@ static int ext4_readdir(struct file * filp, offset = filp-f_pos (sb-s_blocksize - 1); while (!error !stored filp-f_pos inode-i_size) { - unsigned long blk = filp-f_pos EXT4_BLOCK_SIZE_BITS(sb); + ext4_lblk_t blk = filp-f_pos EXT4_BLOCK_SIZE_BITS(sb); struct buffer_head map_bh; struct buffer_head *bh = NULL; diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 0ea6122..392b76e 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -144,7 +144,7 @@ static int ext4_ext_dirty(handle_t *handle, struct inode *inode, static ext4_fsblk_t ext4_ext_find_goal(struct inode *inode, struct ext4_ext_path *path, - ext4_fsblk_t block) + ext4_lblk_t block) { struct ext4_inode_info *ei = EXT4_I(inode); ext4_fsblk_t bg_start; @@ -367,13 +367,14 @@ static void ext4_ext_drop_refs(struct ext4_ext_path *path) * the header must be checked before calling this */ static void -ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int block) +ext4_ext_binsearch_idx(struct inode *inode, + struct ext4_ext_path *path, ext4_lblk_t block) { struct ext4_extent_header *eh = path-p_hdr; struct ext4_extent_idx *r, *l, *m; - ext_debug(binsearch for %d(idx): , block); + ext_debug(binsearch for %lu(idx): , (unsigned long)block); l = EXT_FIRST_INDEX(eh) + 1; r = EXT_LAST_INDEX(eh); @@ -425,7 +426,8 @@ ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int bloc * the header must be checked before calling this */ static void -ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block) +ext4_ext_binsearch(struct inode *inode, + struct ext4_ext_path *path, ext4_lblk_t block) { struct ext4_extent_header *eh = path-p_hdr; struct ext4_extent *r, *l, *m; @@ -438,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block) return; } - ext_debug(binsearch for %d: , block); + ext_debug(binsearch for %lu: , (unsigned long)block); l = EXT_FIRST_EXTENT(eh) + 1; r = EXT_LAST_EXTENT(eh); @@ -494,7 +496,8 @@ int ext4_ext_tree_init(handle_t *handle, struct inode *inode) } struct ext4_ext_path * -ext4_ext_find_extent(struct inode *inode, int block, struct ext4_ext_path *path) +ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block, + struct ext4_ext_path *path) { struct ext4_extent_header *eh; struct buffer_head *bh; @@ -979,8 +982,8 @@ repeat: /* refill path */ ext4_ext_drop_refs(path); path = ext4_ext_find_extent(inode, - le32_to_cpu(newext-ee_block), - path); + (ext4_lblk_t )le32_to_cpu(newext-ee_block), + path); if (IS_ERR(path)) err = PTR_ERR(path); } else { @@ -992,8 +995,8 @@ repeat: /* refill path */ ext4_ext_drop_refs(path); path = ext4_ext_find_extent(inode, - le32_to_cpu(newext-ee_block), - path); +(ext4_lblk_t)le32_to_cpu(newext-ee_block), + path); if
Re: [patch 1/2] add init_ext4_proc() stub for when CONFIG_PROC_FS is not set
On Tue, 2007-10-09 at 09:31 -0700, Badari Pulavarty wrote: On Tue, 2007-10-09 at 15:50 +1000, [EMAIL PROTECTED] wrote: plain text document attachment (ext4-add-init_ext4_proc-stub.patch) init_ext4_fs() calls init_ext4_proc() so we need a stub init_ext4_proc() for the case that CONFIG_PROC_FS is not set. Without the stub we get a build error: fs/ext4/mballoc.c: In function 'init_ext4_proc': fs/ext4/mballoc.c:2837: error: 'proc_root_fs' undeclared (first use in this function) fs/ext4/mballoc.c:2837: error: (Each undeclared identifier is reported only once fs/ext4/mballoc.c:2837: error: for each function it appears in.) Add a stub init_ext4_proc() function that does nothing but return 0 Signed-off-by: Mark Nelson [EMAIL PROTECTED] --- fs/ext4/mballoc.c |7 +++ 1 file changed, 7 insertions(+) Index: ext4/fs/ext4/mballoc.c === --- ext4.orig/fs/ext4/mballoc.c +++ ext4/fs/ext4/mballoc.c @@ -2825,6 +2825,7 @@ static int ext4_mb_destroy_per_dev_proc( return 0; } +#ifdef CONFIG_PROC_FS int __init init_ext4_proc(void) { ext4_pspace_cachep = @@ -2840,6 +2841,12 @@ int __init init_ext4_proc(void) return 0; } +#else +int __init init_ext4_proc(void) +{ + return 0; +} +#endif void exit_ext4_proc(void) { Nope. I don't think we can do this :( For example, we need to create ext4_pspace_cachep kmem cache for the pre-allocation to work. We can't ifdef it out. Mingming/Amit, can you take a look at this ? It looks like we NEED procfs support to make mballoc work. If so, we need to add it to the dependency. I guess our testing did not catch this up because we have CONFIG_PROC_FS enabled all the time. mballoc needs procfs for exporting some stats info and tunable paramenters to optimize/customize multiple allocation. We could select CONFIG_PROC_FS at kconfig when ext4dev is enabled. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] jbd/jbd2: JBD memory allocation cleanups
On Thu, 2007-10-04 at 07:52 +0100, Christoph Hellwig wrote: On Thu, Oct 04, 2007 at 01:50:36AM -0400, Theodore Ts'o wrote: From: Mingming Cao [EMAIL PROTECTED] JBD: Replace slab allocations with page cache allocations It's page allocations, not page cache allocations. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly That sounds like it should be a different patch.. Okay. Will sent the patches, that also separate JBD2 changes to a different patch. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] jbd: JBD slab allocation cleanups
JBD: JBD slab allocation cleanups From: Mingming Cao [EMAIL PROTECTED] JBD: Replace slab allocations with page allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/commit.c |6 +-- fs/jbd/journal.c | 88 ++- fs/jbd/transaction.c |8 ++-- include/linux/jbd.h | 13 +-- 4 files changed, 21 insertions(+), 94 deletions(-) Index: linux-2.6.23-rc9/fs/jbd/commit.c === --- linux-2.6.23-rc9.orig/fs/jbd/commit.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/commit.c2007-10-05 12:08:08.0 -0700 @@ -375,7 +375,7 @@ void journal_commit_transaction(journal_ struct buffer_head *bh = jh2bh(jh); jbd_lock_bh_state(bh); - jbd_slab_free(jh-b_committed_data, bh-b_size); + jbd_free(jh-b_committed_data, bh-b_size); jh-b_committed_data = NULL; jbd_unlock_bh_state(bh); } @@ -792,14 +792,14 @@ restart_loop: * Otherwise, we can just throw away the frozen data now. */ if (jh-b_committed_data) { - jbd_slab_free(jh-b_committed_data, bh-b_size); + jbd_free(jh-b_committed_data, bh-b_size); jh-b_committed_data = NULL; if (jh-b_frozen_data) { jh-b_committed_data = jh-b_frozen_data; jh-b_frozen_data = NULL; } } else if (jh-b_frozen_data) { - jbd_slab_free(jh-b_frozen_data, bh-b_size); + jbd_free(jh-b_frozen_data, bh-b_size); jh-b_frozen_data = NULL; } Index: linux-2.6.23-rc9/fs/jbd/journal.c === --- linux-2.6.23-rc9.orig/fs/jbd/journal.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/journal.c 2007-10-05 12:08:08.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1624,77 +1616,6 @@ void * __jbd_kmalloc (const char *where, } /* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. - */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size 11) - -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; -static const char *jbd_slab_names[JBD_MAX_SLABS] = { - jbd_1k, jbd_2k, jbd_4k, NULL, jbd_8k -}; - -static void journal_destroy_jbd_slabs(void) -{ - int i; - - for (i = 0; i JBD_MAX_SLABS; i++) { - if (jbd_slab[i]) - kmem_cache_destroy(jbd_slab[i]); - jbd_slab[i] = NULL; - } -} - -static int journal_create_jbd_slab(size_t slab_size) -{ - int i = JBD_SLAB_INDEX(slab_size); - - BUG_ON(i = JBD_MAX_SLABS); - - /* -* Check if we already have a slab created for this size -*/ - if (jbd_slab[i]) - return 0; - - /* -* Create a slab and force alignment to be same as slabsize - -* this will make sure that allocations won't cross the page -* boundary. -*/ - jbd_slab[i] = kmem_cache_create(jbd_slab_names[i
Re: [PATCH 2/2] i_version update - ext4 part
On Fri, 2007-10-05 at 17:28 +0200, Cordenner jean noel wrote: This patch update the i_version field of the inode and add a mount option to enable this feature. The other condition to enable this feature is that the inode size should be 256-bytes. Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] --- fs/ext4/inode.c |4 +++- fs/ext4/super.c |7 ++- include/linux/ext4_fs.h |1 + 3 files changed, 10 insertions(+), 2 deletions(-) Index: linux-2.6.23-rc8-ext4-i_version/fs/ext4/inode.c === --- linux-2.6.23-rc8-ext4-i_version.orig/fs/ext4/inode.c 2007-10-03 18:11:17.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/fs/ext4/inode.c 2007-10-05 10:26:42.0 +0200 @@ -3173,7 +3173,9 @@ { int err = 0; - inode-i_version++; + if (test_opt(inode-i_sb, I_VERSION)) + inode_inc_iversion(inode); + /* the do_update_inode consumes one bh-b_count */ get_bh(iloc-bh); Index: linux-2.6.23-rc8-ext4-i_version/fs/ext4/super.c === --- linux-2.6.23-rc8-ext4-i_version.orig/fs/ext4/super.c 2007-10-03 18:11:17.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/fs/ext4/super.c 2007-10-03 18:17:44.0 +0200 @@ -742,7 +742,7 @@ Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, Opt_grpquota, Opt_extents, Opt_noextents, Opt_delalloc, - Opt_mballoc, Opt_nomballoc, Opt_stripe, + Opt_mballoc, Opt_nomballoc, Opt_stripe, Opt_i_version, }; static match_table_t tokens = { @@ -800,6 +800,7 @@ {Opt_mballoc, mballoc}, {Opt_nomballoc, nomballoc}, {Opt_stripe, stripe=%u}, + {Opt_i_version, i_version}, {Opt_err, NULL}, {Opt_resize, resize}, }; @@ -1161,6 +1162,10 @@ return 0; sbi-s_stripe = option; break; + case Opt_i_version: + set_opt (sbi-s_mount_opt, I_VERSION); + sb-s_flags |= MS_I_VERSION; + break; Need to make sure this flag is cleared if remounted fs without I_VERSION default: printk (KERN_ERR EXT4-fs: Unrecognized mount option \%s\ Index: linux-2.6.23-rc8-ext4-i_version/include/linux/ext4_fs.h === --- linux-2.6.23-rc8-ext4-i_version.orig/include/linux/ext4_fs.h 2007-10-03 18:11:17.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/include/linux/ext4_fs.h 2007-10-03 18:11:54.0 +0200 @@ -500,6 +500,7 @@ #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x100 /* Journal Async Commit */ #define EXT4_MOUNT_DELALLOC 0x200 /* Delalloc support */ #define EXT4_MOUNT_MBALLOC 0x400 /* Buddy allocation support */ +#define EXT4_MOUNT_I_VERSION 0x800 /* i_version support */ /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */ #ifndef _LINUX_EXT2_FS_H #define clear_opt(o, opt)o = ~EXT4_MOUNT_##opt I don't see places where this counter is being stored/load to/from disk, so I assume this is the not the full patch series? Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] i_version update - vfs part
On Fri, 2007-10-05 at 17:28 +0200, Cordenner jean noel wrote: Hi, Hi Jean Noel, This is an update of the i_version patch. Just to make sure, is this vfs patch and next ext4 patch together going to replace the 4 inode-version related patches currently in ext4-patch-queue (and git tree)? The i_version field is a 64bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530: 5.5. Mandatory Attributes - Definitions Name#DataType Access Description ___ change3uint64 READ A value created by the server that the client can use to determine if file data, directory contents or attributes of the object have been modified. The servermay return the object's time_metadata attribute for this attribute's value but only if the filesystem object can not be updated more frequently than the resolution of time_metadata. This first part deals with adding a flag in the super block and incrementing the i_version in the vfs. Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] --- fs/inode.c | 23 +++ fs/libfs.c | 12 include/linux/fs.h |3 +++ 3 files changed, 38 insertions(+) Index: linux-2.6.23-rc8-ext4-i_version/fs/inode.c === --- linux-2.6.23-rc8-ext4-i_version.orig/fs/inode.c 2007-09-26 14:41:41.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/fs/inode.c2007-10-05 16:14:41.0 +0200 @@ -1216,6 +1216,24 @@ EXPORT_SYMBOL(touch_atime); /** + * inode_inc_iversion - increments i_version + * @inode: inode that need to be updated + * + * Every time the inode is modified, the i_version field + * will be incremented. + * The filesystem has to be mounted with i_version flag + * + */ + +void inode_inc_iversion(struct inode *inode) +{ + spin_lock(inode-i_lock); + inode-i_version++; + spin_unlock(inode-i_lock); +} I suspect we need a lock here, the places where need to update the inode-i_version are already doing update for inode, mostly protected by i_mutex. You could remove the above function and update the counter directly at the places it need to. +EXPORT_SYMBOL(inode_inc_iversion); + Seems unnecessary. +/** * file_update_time- update mtime and ctime time * @file: file accessed * @@ -1249,6 +1267,11 @@ sync_it = 1; } + if (IS_I_VERSION(inode)) { + inode_inc_iversion(inode); + sync_it = 1; + } + if (sync_it) mark_inode_dirty_sync(inode); } Index: linux-2.6.23-rc8-ext4-i_version/fs/libfs.c === --- linux-2.6.23-rc8-ext4-i_version.orig/fs/libfs.c 2007-07-09 01:32:17.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/fs/libfs.c2007-09-26 14:51:08.0 +0200 @@ -255,6 +255,10 @@ struct inode *inode = old_dentry-d_inode; inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME; + if (IS_I_VERSION(inode)) { + inode_inc_iversion(inode); + inode_inc_iversion(dir); + } inc_nlink(inode); atomic_inc(inode-i_count); dget(dentry); @@ -287,6 +291,10 @@ struct inode *inode = dentry-d_inode; inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME; + if (IS_I_VERSION(inode)) { + inode_inc_iversion(inode); + inode_inc_iversion(dir); + } drop_nlink(inode); dput(dentry); return 0; @@ -323,6 +331,10 @@ old_dir-i_ctime = old_dir-i_mtime = new_dir-i_ctime = new_dir-i_mtime = inode-i_ctime = CURRENT_TIME; + if (IS_I_VERSION(old_dir)) { + inode_inc_iversion(old_dir); + inode_inc_iversion(new_dir); + } return 0; } Need to update the counter in libfs.c? Index: linux-2.6.23-rc8-ext4-i_version/include/linux/fs.h === --- linux-2.6.23-rc8-ext4-i_version.orig/include/linux/fs.h 2007-09-26 14:46:15.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/include/linux/fs.h2007-09-26 14:51:08.0 +0200 @@ -123,6 +123,7 @@ #define MS_SLAVE (119) /* change to slave */ #define MS_SHARED(120) /* change to shared */ #define MS_RELATIME (121) /* Update atime relative to mtime/ctime. */ +#define MS_I_VERSION (122) /* Update inode i_version field */ #define MS_ACTIVE(130) #define MS_NOUSER(131) @@ -172,6 +173,7 @@ ((inode)-i_flags (S_SYNC|S_DIRSYNC)))
[PATCH 1/2] ext4: Support large blocksize up to PAGESIZE
Support large blocksize up to PAGESIZE (max 64KB) for ext4. From: Takashi Sato [EMAIL PROTECTED] This patch set supports large block size(4k, =64k) in ext4, just enlarging the block size limit. But it is NOT possible to have 64kB blocksize on ext4 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is treat 64k rec_len with a an impossible value like rec_len = 0x to handle this. The Patch-set consists of the following 2 patches. [1/2] ext4: enlarge blocksize - Allow blocksize up to pagesize [2/2] ext4: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Now on 64k page ppc64 box runs with this patch set we could create a 64k block size ext4dev, and able to handle empty directory block. Patch consider to be merge to 2.6.24-rc1. Signed-off-by: Takashi Sato [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/super.c |5 + include/linux/ext4_fs.h |4 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 619db84..d8bb279 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1548,6 +1548,11 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) goto out_fail; } + if (!sb_set_blocksize(sb, blocksize)) { + printk(KERN_ERR EXT4-fs: bad blocksize %d.\n, blocksize); + goto out_fail; + } + /* * The ext4 superblock will not be buffer aligned for other than 1kB * block sizes. We need to calculate the offset from buffer start. diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index f9881b6..d15a15e 100644 --- a/include/linux/ext4_fs.h +++ b/include/linux/ext4_fs.h @@ -77,8 +77,8 @@ * Macro-instructions used to manage several block sizes */ #define EXT4_MIN_BLOCK_SIZE1024 -#defineEXT4_MAX_BLOCK_SIZE 4096 -#define EXT4_MIN_BLOCK_LOG_SIZE 10 +#defineEXT4_MAX_BLOCK_SIZE 65536 +#define EXT4_MIN_BLOCK_LOG_SIZE10 #ifdef __KERNEL__ # define EXT4_BLOCK_SIZE(s)((s)-s_blocksize) #else - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] ext4: Avoid rec_len overflow with 64KB block size
ext4: Avoid rec_len overflow with 64KB block size From: Jan Kara [EMAIL PROTECTED] With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext4_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/dir.c | 12 --- fs/ext4/namei.c | 76 ++- include/linux/ext4_fs.h | 20 3 files changed, 62 insertions(+), 46 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index 3ab01c0..20b1e28 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * function, struct inode * dir, unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de-rec_len); + const int rlen = ext4_rec_len_from_disk(de-rec_len); if (rlen EXT4_DIR_REC_LEN(1)) error_msg = rec_len is smaller than minimal; @@ -176,10 +176,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de-rec_len) - EXT4_DIR_REC_LEN(1)) + if (ext4_rec_len_from_disk(de-rec_len) +EXT4_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de-rec_len); + i += ext4_rec_len_from_disk(de-rec_len); } offset = i; filp-f_pos = (filp-f_pos ~(sb-s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de-rec_len); + offset += ext4_rec_len_from_disk(de-rec_len); if (le32_to_cpu(de-inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext4_rec_len_from_disk(de-rec_len); } offset = 0; brelse (bh); diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 5fdb862..96e8a85 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -281,7 +281,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext4_dir_ent space += EXT4_DIR_REC_LEN(de-name_len); names++; } - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext4_next_entry(de); } printk((%i)\n, names); return (struct stats) { names, space, 1 }; @@ -552,7 +552,8 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash, */ static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p) { - return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len)); + return (struct ext4_dir_entry_2 *)((char*)p + + ext4_rec_len_from_disk(p-rec_len)); } /* @@ -721,7 +722,7 @@ static int dx_make_map (struct ext4_dir_entry_2 *de, int size, cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext4_next_entry(de); } return count; } @@ -823,7 +824,7 @@ static inline int search_dirblock(struct buffer_head * bh, return 1; } /* prevent looping on a bad block */ - de_len = le16_to_cpu(de-rec_len); + de_len = ext4_rec_len_from_disk(de-rec_len); if (de_len = 0) return -1; offset += de_len; @@ -1136,7 +1137,7 @@ dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count) rec_len = EXT4_DIR_REC_LEN(de-name_len); memcpy (to, de, rec_len); ((struct ext4_dir_entry_2 *) to)-rec_len = - cpu_to_le16(rec_len); + ext4_rec_len_to_disk(rec_len); de-inode = 0; map
[PATCH 1/2] ext2: Support large blocksize up to PAGESIZE
Support large blocksize up to PAGESIZE (max 64KB) for ext2 From: Takashi Sato [EMAIL PROTECTED] This patch set supports large block size(4k, =64k) in ext2, just enlarging the block size limit. But it is NOT possible to have 64kB blocksize on ext2 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is treat 64k rec_len with a an impossible value like rec_len = 0x to handle this. The Patch-set consists of the following 2 patches. [1/2] ext2: enlarge blocksize - Allow blocksize up to pagesize [2/2] ext2: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Now on 64k page ppc64 box runs with this patch set we could create a 64k block size ext2, and able to handle empty directory block. Please consider to include to mm tree. Signed-off-by: Takashi Sato [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext2/super.c |3 ++- include/linux/ext2_fs.h |4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 639a32c..765c805 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -775,7 +775,8 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) brelse(bh); if (!sb_set_blocksize(sb, blocksize)) { - printk(KERN_ERR EXT2-fs: blocksize too small for device.\n); + printk(KERN_ERR EXT2-fs: bad blocksize %d.\n, + blocksize); goto failed_sbi; } diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h index 153d755..910a705 100644 --- a/include/linux/ext2_fs.h +++ b/include/linux/ext2_fs.h @@ -86,8 +86,8 @@ static inline struct ext2_sb_info *EXT2_SB(struct super_block *sb) * Macro-instructions used to manage several block sizes */ #define EXT2_MIN_BLOCK_SIZE1024 -#defineEXT2_MAX_BLOCK_SIZE 4096 -#define EXT2_MIN_BLOCK_LOG_SIZE 10 +#define EXT2_MAX_BLOCK_SIZE65536 +#define EXT2_MIN_BLOCK_LOG_SIZE10 #ifdef __KERNEL__ # define EXT2_BLOCK_SIZE(s)((s)-s_blocksize) #else - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
ext2: Avoid rec_len overflow with 64KB block size From: Jan Kara [EMAIL PROTECTED] With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. Signed-off-by: Jan Kara [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext2/dir.c | 43 +++ include/linux/ext2_fs.h |1 + 2 files changed, 32 insertions(+), 12 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 2bf49d7..1329bdb 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -26,6 +26,24 @@ typedef struct ext2_dir_entry_2 ext2_dirent; +static inline unsigned ext2_rec_len_from_disk(__le16 dlen) +{ + unsigned len = le16_to_cpu(dlen); + + if (len == EXT2_MAX_REC_LEN) + return 1 16; + return len; +} + +static inline __le16 ext2_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) + return cpu_to_le16(EXT2_MAX_REC_LEN); + else if (len (1 16)) + BUG(); + return cpu_to_le16(len); +} + /* * ext2 uses block-sized chunks. Arguably, sector-sized ones would be * more robust, but we have what we have @@ -95,7 +113,7 @@ static void ext2_check_page(struct page *page) } for (offs = 0; offs = limit - EXT2_DIR_REC_LEN(1); offs += rec_len) { p = (ext2_dirent *)(kaddr + offs); - rec_len = le16_to_cpu(p-rec_len); + rec_len = ext2_rec_len_from_disk(p-rec_len); if (rec_len EXT2_DIR_REC_LEN(1)) goto Eshort; @@ -193,7 +211,8 @@ static inline int ext2_match (int len, const char * const name, */ static inline ext2_dirent *ext2_next_entry(ext2_dirent *p) { - return (ext2_dirent *)((char*)p + le16_to_cpu(p-rec_len)); + return (ext2_dirent *)((char*)p + + ext2_rec_len_from_disk(p-rec_len)); } static inline unsigned @@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir) return 0; } } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext2_rec_len_from_disk(de-rec_len); } ext2_put_page(page); } @@ -413,7 +432,7 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de, struct page *page, struct inode *inode) { unsigned from = (char *) de - (char *) page_address(page); - unsigned to = from + le16_to_cpu(de-rec_len); + unsigned to = from + ext2_rec_len_from_disk(de-rec_len); int err; lock_page(page); @@ -469,7 +488,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode) /* We hit i_size */ name_len = 0; rec_len = chunk_size; - de-rec_len = cpu_to_le16(chunk_size); + de-rec_len = ext2_rec_len_to_disk(chunk_size); de-inode = 0; goto got_it; } @@ -483,7 +502,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode) if (ext2_match (namelen, name, de)) goto out_unlock; name_len = EXT2_DIR_REC_LEN(de-name_len); - rec_len = le16_to_cpu(de-rec_len); + rec_len = ext2_rec_len_from_disk(de-rec_len); if (!de-inode rec_len = reclen) goto got_it; if (rec_len = name_len + reclen) @@ -504,8 +523,8 @@ got_it: goto out_unlock; if (de-inode) { ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len); - de1-rec_len = cpu_to_le16(rec_len - name_len); - de-rec_len = cpu_to_le16(name_len); + de1-rec_len = ext2_rec_len_to_disk(rec_len - name_len); + de-rec_len = ext2_rec_len_to_disk(name_len); de = de1; } de-name_len = namelen; @@ -536,7 +555,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, struct page * page ) struct inode *inode = mapping-host; char *kaddr = page_address(page); unsigned from = ((char*)dir - kaddr) ~(ext2_chunk_size(inode)-1); - unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir-rec_len); + unsigned to = ((char*)dir - kaddr) + ext2_rec_len_from_disk(dir-rec_len); ext2_dirent * pde = NULL; ext2_dirent * de = (ext2_dirent *) (kaddr + from); int err; @@ -557,7 +576,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, struct page * page ) err = mapping-a_ops-prepare_write
[PATCH 1/2] ext3: Support large blocksize up to PAGESIZE
Support large blocksize up to PAGESIZE (max 64KB) for ext3 From: Takashi Sato [EMAIL PROTECTED] This patch set supports large block size(4k, =64k) in ext3 just enlarging the block size limit. But it is NOT possible to have 64kB blocksize on ext3 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is treat 64k rec_len with a an impossible value like rec_len = 0x to handle this. The Patch-set consists of the following 2 patches. [1/2] ext3: enlarge blocksize - Allow blocksize up to pagesize [2/2] ext3: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Now on 64k page ppc64 box runs with this patch set we could create a 64k block size ext3, and able to handle empty directory block. Signed-off-by: Takashi Sato [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext3/super.c |6 +- include/linux/ext3_fs.h |4 ++-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 9537316..b4bfd36 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -1549,7 +1549,11 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent) } brelse (bh); - sb_set_blocksize(sb, blocksize); + if (!sb_set_blocksize(sb, blocksize)) { + printk(KERN_ERR EXT3-fs: bad blocksize %d.\n, + blocksize); + goto out_fail; + } logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize; offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize; bh = sb_bread(sb, logic_sb_block); diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index ece49a8..7aa5556 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -76,8 +76,8 @@ * Macro-instructions used to manage several block sizes */ #define EXT3_MIN_BLOCK_SIZE1024 -#defineEXT3_MAX_BLOCK_SIZE 4096 -#define EXT3_MIN_BLOCK_LOG_SIZE 10 +#defineEXT3_MAX_BLOCK_SIZE 65536 +#define EXT3_MIN_BLOCK_LOG_SIZE10 #ifdef __KERNEL__ # define EXT3_BLOCK_SIZE(s)((s)-s_blocksize) #else - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] ext3: Avoid rec_len overflow with 64KB block size
ext3: Avoid rec_len overflow with 64KB block size From: Jan Kara [EMAIL PROTECTED] With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext3_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext3/dir.c | 10 +++-- fs/ext3/namei.c | 90 ++- include/linux/ext3_fs.h | 20 ++ 3 files changed, 68 insertions(+), 52 deletions(-) diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c index c00723a..3c4c43a 100644 --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -69,7 +69,7 @@ int ext3_check_dir_entry (const char * function, struct inode * dir, unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de-rec_len); + const int rlen = ext3_rec_len_from_disk(de-rec_len); if (rlen EXT3_DIR_REC_LEN(1)) error_msg = rec_len is smaller than minimal; @@ -177,10 +177,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de-rec_len) + if (ext3_rec_len_from_disk(de-rec_len) EXT3_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de-rec_len); + i += ext3_rec_len_from_disk(de-rec_len); } offset = i; filp-f_pos = (filp-f_pos ~(sb-s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de-rec_len); + offset += ext3_rec_len_from_disk(de-rec_len); if (le32_to_cpu(de-inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext3_rec_len_from_disk(de-rec_len); } offset = 0; brelse (bh); diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c index c1fa190..2c38eb6 100644 --- a/fs/ext3/namei.c +++ b/fs/ext3/namei.c @@ -144,6 +144,15 @@ struct dx_map_entry u16 size; }; +/* + * p is at least 6 bytes before the end of page + */ +static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) +{ + return (struct ext3_dir_entry_2 *)((char*)p + + ext3_rec_len_from_disk(p-rec_len)); +} + #ifdef CONFIG_EXT3_INDEX static inline unsigned dx_get_block (struct dx_entry *entry); static void dx_set_block (struct dx_entry *entry, unsigned value); @@ -281,7 +290,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext3_dir_ent space += EXT3_DIR_REC_LEN(de-name_len); names++; } - de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext3_next_entry(de); } printk((%i)\n, names); return (struct stats) { names, space, 1 }; @@ -548,14 +557,6 @@ static int ext3_htree_next_block(struct inode *dir, __u32 hash, /* - * p is at least 6 bytes before the end of page - */ -static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) -{ - return (struct ext3_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len)); -} - -/* * This function fills a red-black tree with information from a * directory block. It returns the number directory entries loaded * into the tree. If there is an error it is returned in err. @@ -721,7 +722,7 @@ static int dx_make_map (struct ext3_dir_entry_2 *de, int size, cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext3_next_entry(de); } return count; } @@ -825,7 +826,7 @@ static inline int search_dirblock(struct buffer_head * bh, return 1; } /* prevent looping on a bad block */ - de_len = le16_to_cpu(de-rec_len); + de_len = ext3_rec_len_from_disk(de
Re: kernel Oops in ext3 code
BUG: unable to handle kernel paging request at virtual address 104b printing eip: c0195bd3 *pde = Oops: [#1] PREEMPT SMP Modules linked in: vboxdrv binfmt_misc fuse coretemp hwmon gspca videodev v4l2_common v4l1_compat iwl3945 mac80211 tifm_7xx1 tifm_core joydev irda crc_ccitt 8250_pnp 8250 serial_core firewire_ohci firewire_core crc_itu_t CPU:0 EIP:0060:[c0195bd3]Not tainted VLI EFLAGS: 00010206 (2.6.23-rc6 #1) EIP is at ext3_discard_reservation+0x18/0x4d eax: dff23800 ebx: 1033 ecx: dfc15ec0 edx: esi: c0007c44 edi: 1033 ebp: dfc2bef4 esp: dfc2beac ds: 007b es: 007b fs: 00d8 gs: ss: 0068 Process kswapd0 (pid: 261, ti=dfc2a000 task=dfcac570 task.ti=dfc2a000) Stack: c0007ba4 c0007c44 1033 c019ec51 c0007c44 c0007d8c 002c c0171b1b 002c c0007c44 c0007c4c c0171da2 c050880c 0080 0080 c0171fb8 0080 c0007e48 df9e3910 7404 c03f5634 0080 00d0 Call Trace: [c019ec51] ext3_clear_inode+0x5d/0x76 [c0171b1b] clear_inode+0x6b/0xb9 [c0171da2] dispose_list+0x48/0xc9 [c0171fb8] shrink_icache_memory+0x195/0x1bd [c014f5ec] shrink_slab+0xe2/0x159 [c014f9a0] kswapd+0x2d3/0x431 [c0132520] autoremove_wake_function+0x0/0x33 [c014f6cd] kswapd+0x0/0x431 [c0132453] kthread+0x38/0x5d [c013241b] kthread+0x0/0x5d [c0104b73] kernel_thread_helper+0x7/0x10 === Code: 83 f8 01 19 c0 f7 d0 83 e0 08 89 42 0c 89 56 b4 5b 5e c3 57 56 89 c6 53 8b 58 b4 8b 80 a4 00 00 00 85 db 8b 80 78 01 00 00 74 30 83 7b 18 00 74 2a 8d b8 00 03 00 00 89 f8 e8 b8 ca 1a 00 83 7b EIP: [c0195bd3] ext3_discard_reservation+0x18/0x4d SS:ESP 0068:dfc2beac On Fri, 2007-09-28 at 17:00 +0200, Norbert Preining wrote: On Fr, 28 Sep 2007, Badari Pulavarty wrote: objdump -DlS balloc.o Here it is Thanks Looks like kernel oops at 1753(173b+0x18): 173b ext3_discard_reservation: ext3_discard_reservation(): 173b: 57 push %edi 173c: 56 push %esi 173d: 89 c6 mov%eax,%esi 173f: 53 push %ebx 1740: 8b 58 b4mov-0x4c(%eax),%ebx 1743: 8b 80 a4 00 00 00 mov0xa4(%eax),%eax 1749: 85 db test %ebx,%ebx 174b: 8b 80 78 01 00 00 mov0x178(%eax),%eax 1751: 74 30 je 1783 ext3_discard_reservation+0x48 1753: 83 7b 18 00 cmpl $0x0,0x18(%ebx) == Kernel oops here, ebx=1033, match bad page location 104b(=1033+0x18) 1757: 74 2a je 1783 ext3_discard_reservation+0x48 1759: 8d b8 00 03 00 00 lea0x300(%eax),%edi 175f: 89 f8 mov%edi,%eax 1761: e8 fc ff ff ff call 1762 ext3_discard_reservation+0x27 1766: 83 7b 18 00 cmpl $0x0,0x18(%ebx) 176a: 74 0d je 1779 ext3_discard_reservation+0x3e 176c: 8b 86 a4 00 00 00 mov0xa4(%esi),%eax 1772: 89 da mov%ebx,%edx 1774: e8 dc eb ff ff call 355 rsv_window_remove 1779: 89 f8 mov%edi,%eax 177b: 5b pop%ebx 177c: 5e pop%esi 177d: 5f pop%edi 177e: e9 fc ff ff ff jmp177f ext3_discard_reservation+0x44 1783: 5b pop%ebx 1784: 5e pop%esi 1785: 5f pop%edi 1786: c3 ret And trying to matching to the code: void ext3_discard_reservation(struct inode *inode) { struct ext3_inode_info *ei = EXT3_I(inode); struct ext3_block_alloc_info *block_i = ei-i_block_alloc_info; struct ext3_reserve_window_node *rsv; spinlock_t *rsv_lock = EXT3_SB(inode-i_sb)-s_rsv_window_lock; if (!block_i) return; rsv = block_i-rsv_window_node; if (!rsv_is_empty(rsv-rsv_window)) { = kernel oops here spin_lock(rsv_lock); if (!rsv_is_empty(rsv-rsv_window)) rsv_window_remove(inode-i_sb, rsv); spin_unlock(rsv_lock); } } It seems ebx points to block_i(i_block_alloc_info), and that is bad memory location, so that leads to bad paging request when try to get the rsv_window structure. But it confused me why the rsv_window offset is 0x18 to i_block_alloc_info, it should be 0x14(20 bytes)...Are you running a vanilla 2.6.23-rc6? No clue how i_block_alloc_info pointing to a bad location for now. ext3_alloc_inode() clearly init this field to
Re: kernel Oops in ext3 code
Hi, Could you please sent the objdump of the ext4_discard_reservation function? It doesn't match what I see here. Thanks, Mingming On Thu, 2007-09-27 at 12:31 +0200, [EMAIL PROTECTED] wrote: Hi all! (Please Cc) kernel 2.6.23-rc6 Debian/sid kernel ooops: BUG: unable to handle kernel paging request at virtual address 104b printing eip: c0195bd3 *pde = Oops: [#1] PREEMPT SMP Modules linked in: vboxdrv binfmt_misc fuse coretemp hwmon gspca videodev v4l2_common v4l1_compat iwl3945 mac80211 tifm_7xx1 tifm_core joydev irda crc_ccitt 8250_pnp 8250 serial_core firewire_ohci firewire_core crc_itu_t CPU:0 EIP:0060:[c0195bd3]Not tainted VLI EFLAGS: 00010206 (2.6.23-rc6 #1) EIP is at ext3_discard_reservation+0x18/0x4d eax: dff23800 ebx: 1033 ecx: dfc15ec0 edx: esi: c0007c44 edi: 1033 ebp: dfc2bef4 esp: dfc2beac ds: 007b es: 007b fs: 00d8 gs: ss: 0068 Process kswapd0 (pid: 261, ti=dfc2a000 task=dfcac570 task.ti=dfc2a000) Stack: c0007ba4 c0007c44 1033 c019ec51 c0007c44 c0007d8c 002c c0171b1b 002c c0007c44 c0007c4c c0171da2 c050880c 0080 0080 c0171fb8 0080 c0007e48 df9e3910 7404 c03f5634 0080 00d0 Call Trace: [c019ec51] ext3_clear_inode+0x5d/0x76 [c0171b1b] clear_inode+0x6b/0xb9 [c0171da2] dispose_list+0x48/0xc9 [c0171fb8] shrink_icache_memory+0x195/0x1bd [c014f5ec] shrink_slab+0xe2/0x159 [c014f9a0] kswapd+0x2d3/0x431 [c0132520] autoremove_wake_function+0x0/0x33 [c014f6cd] kswapd+0x0/0x431 [c0132453] kthread+0x38/0x5d [c013241b] kthread+0x0/0x5d [c0104b73] kernel_thread_helper+0x7/0x10 === Code: 83 f8 01 19 c0 f7 d0 83 e0 08 89 42 0c 89 56 b4 5b 5e c3 57 56 89 c6 53 8b 58 b4 8b 80 a4 00 00 00 85 db 8b 80 78 01 00 00 74 30 83 7b 18 00 74 2a 8d b8 00 03 00 00 89 f8 e8 b8 ca 1a 00 83 7b EIP: [c0195bd3] ext3_discard_reservation+0x18/0x4d SS:ESP 0068:dfc2beac Sysrq did work, so the oops was saved. Good. Any ideas? Best wishes Norbert --- Dr. Norbert Preining [EMAIL PROTECTED]Vienna University of Technology Debian Developer [EMAIL PROTECTED] Debian TeX Group gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094 --- As he came into the light they could see his black and gold uniform on which the buttons were so highly polished that they shone with an intensity that would have made an approaching motorist flash his lights in annoyance. --- Douglas Adams, The Hitchhikers Guide to the Galaxy - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD/ext34 cleanups: convert to kzalloc
On Wed, 2007-09-26 at 12:54 -0700, Andrew Morton wrote: On Fri, 21 Sep 2007 16:13:56 -0700 Mingming Cao [EMAIL PROTECTED] wrote: Convert kmalloc to kzalloc() and get rid of the memset(). I split this into separate ext3/jbd and ext4/jbd2 patches. It's generally better to raise separate patches, please - the ext3 patches I'll merge directly but the ext4 patches should go through (and be against) the ext4 devel tree. Sure. The patches(including ext3/jbd and ext4/jbd2) were merged into ext4 devel tree already, I will remove the ext3/jbd part out of the ext4 devel tree. I fixed lots of rejects against the already-pending changes to these filesystems. You forgot to remove the memsets in both start_this_handle()s. Thanks for catching this. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Ext4 patch queue updated(2.6.23-rc8)
http://repo.or.cz/w/ext4-patch-queue.git Changes + Add large block support back, updated patches to fix rec_len overflow issue with 64k blk size from Jan Kara split ext234_enlarge_blocksize.patch to three patches, one for each filesystem type Fix compile warning with jbd-stats-through-procfs split jbd_kmalloc_kzalloc.patch into two patches for ext3/jbd and ext4/jbd2 to sync with what in mm tree +Add jbd2-fix-commit-code-to-properly-abort-journal.patch from Jan kara +Added New patch set :Introduce le32_t and le16_t, from Aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] JBD2/ext4 naming cleanup
JBD2 naming cleanup From: Mingming Cao [EMAIL PROTECTED] change micros name from JBD_XXX to JBD2_XXX in JBD2/Ext4 Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/extents.c |2 +- fs/ext4/super.c |2 +- fs/jbd2/commit.c |2 +- fs/jbd2/journal.c |8 fs/jbd2/recovery.c|2 +- fs/jbd2/revoke.c |4 ++-- include/linux/ext4_jbd2.h |6 +++--- include/linux/jbd2.h | 30 +++--- 8 files changed, 28 insertions(+), 28 deletions(-) Index: linux-2.6.23-rc6/fs/ext4/super.c === --- linux-2.6.23-rc6.orig/fs/ext4/super.c 2007-09-21 16:27:31.0 -0700 +++ linux-2.6.23-rc6/fs/ext4/super.c2007-09-21 16:27:46.0 -0700 @@ -966,7 +966,7 @@ static int parse_options (char *options, if (option 0) return 0; if (option == 0) - option = JBD_DEFAULT_MAX_COMMIT_AGE; + option = JBD2_DEFAULT_MAX_COMMIT_AGE; sbi-s_commit_interval = HZ * option; break; case Opt_data_journal: Index: linux-2.6.23-rc6/include/linux/ext4_jbd2.h === --- linux-2.6.23-rc6.orig/include/linux/ext4_jbd2.h 2007-09-10 19:50:29.0 -0700 +++ linux-2.6.23-rc6/include/linux/ext4_jbd2.h 2007-09-21 16:27:46.0 -0700 @@ -12,8 +12,8 @@ * Ext4-specific journaling extensions. */ -#ifndef _LINUX_EXT4_JBD_H -#define _LINUX_EXT4_JBD_H +#ifndef _LINUX_EXT4_JBD2_H +#define _LINUX_EXT4_JBD2_H #include linux/fs.h #include linux/jbd2.h @@ -228,4 +228,4 @@ static inline int ext4_should_writeback_ return 0; } -#endif /* _LINUX_EXT4_JBD_H */ +#endif /* _LINUX_EXT4_JBD2_H */ Index: linux-2.6.23-rc6/include/linux/jbd2.h === --- linux-2.6.23-rc6.orig/include/linux/jbd2.h 2007-09-21 09:07:09.0 -0700 +++ linux-2.6.23-rc6/include/linux/jbd2.h 2007-09-21 16:27:46.0 -0700 @@ -13,8 +13,8 @@ * filesystem journaling support. */ -#ifndef _LINUX_JBD_H -#define _LINUX_JBD_H +#ifndef _LINUX_JBD2_H +#define _LINUX_JBD2_H /* Allow this file to be included directly into e2fsprogs */ #ifndef __KERNEL__ @@ -37,26 +37,26 @@ #define journal_oom_retry 1 /* - * Define JBD_PARANIOD_IOFAIL to cause a kernel BUG() if ext3 finds + * Define JBD2_PARANIOD_IOFAIL to cause a kernel BUG() if ext4 finds * certain classes of error which can occur due to failed IOs. Under - * normal use we want ext3 to continue after such errors, because + * normal use we want ext4 to continue after such errors, because * hardware _can_ fail, but for debugging purposes when running tests on * known-good hardware we may want to trap these errors. */ -#undef JBD_PARANOID_IOFAIL +#undef JBD2_PARANOID_IOFAIL /* * The default maximum commit age, in seconds. */ -#define JBD_DEFAULT_MAX_COMMIT_AGE 5 +#define JBD2_DEFAULT_MAX_COMMIT_AGE 5 #ifdef CONFIG_JBD2_DEBUG /* - * Define JBD_EXPENSIVE_CHECKING to enable more expensive internal + * Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal * consistency checks. By default we don't do this unless * CONFIG_JBD2_DEBUG is on. */ -#define JBD_EXPENSIVE_CHECKING +#define JBD2_EXPENSIVE_CHECKING extern u8 jbd2_journal_enable_debug; #define jbd_debug(n, f, a...) \ @@ -163,8 +163,8 @@ typedef struct journal_block_tag_s __be32 t_blocknr_high; /* most-significant high 32bits. */ } journal_block_tag_t; -#define JBD_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) -#define JBD_TAG_SIZE64 (sizeof(journal_block_tag_t)) +#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) +#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t)) /* * The revoke descriptor: used on disk to describe a series of blocks to @@ -256,8 +256,8 @@ typedef struct journal_superblock_s #include linux/fs.h #include linux/sched.h -#define JBD_ASSERTIONS -#ifdef JBD_ASSERTIONS +#define JBD2_ASSERTIONS +#ifdef JBD2_ASSERTIONS #define J_ASSERT(assert) \ do { \ if (!(assert)) {\ @@ -284,9 +284,9 @@ void buffer_assertion_failure(struct buf #else #define J_ASSERT(assert) do { } while (0) -#endif /* JBD_ASSERTIONS */ +#endif /* JBD2_ASSERTIONS */ -#if defined(JBD_PARANOID_IOFAIL) +#if defined(JBD2_PARANOID_IOFAIL) #define J_EXPECT(expr, why...) J_ASSERT(expr) #define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr) #define J_EXPECT_JH(jh, expr, why...) J_ASSERT_JH(jh, expr
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 19:19 -0700, Andrew Morton wrote: On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao [EMAIL PROTECTED] wrote: JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly __GFP_NOFAIL should only be used when we have no way of recovering from failure. The allocation in journal_init_common() (at least) _can_ recover and hence really shouldn't be using __GFP_NOFAIL. (Actually, nothing in the kernel should be using __GFP_NOFAIL. It is there as a marker which says we really shouldn't be doing this but we don't know how to fix it). So sometime it'd be good if you could review all the __GFP_NOFAILs in there and see if we can remove some, thanks. Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all cases except one handles memory allocation failure so I get rid of those GFP_NOFAIL flags. Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc in jbd/jbd2? I will send a separate patch to cleanup that. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |2 +- fs/jbd/transaction.c |3 +-- fs/jbd2/journal.c |2 +- fs/jbd2/transaction.c |3 +-- 4 files changed, 4 insertions(+), 6 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:47:58.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:48:40.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); Index: linux-2.6.23-rc6/fs/jbd/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-19 11:48:05.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-19 11:49:10.0 -0700 @@ -96,8 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), - GFP_NOFS|__GFP_NOFAIL); + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS); if (!new_transaction) { ret = -ENOMEM; goto out; Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:49:46.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); Index: linux-2.6.23-rc6/fs/jbd2/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c 2007-09-19 11:48:08.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/transaction.c 2007-09-19 11:50:12.0 -0700 @@ -96,8 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), - GFP_NOFS|__GFP_NOFAIL); + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS); if (!new_transaction) { ret = -ENOMEM; goto out; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] JBD: use GFP_NOFS in kmalloc
Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent with the rest of kmalloc flag used in the JBD/JBD2 layer. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |6 +++--- fs/jbd/revoke.c |8 fs/jbd2/journal.c |6 +++--- fs/jbd2/revoke.c |8 4 files changed, 14 insertions(+), 14 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 -0700 @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ while((tmp = 1UL) != 0UL) shift++; - journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[0]) return -ENOMEM; journal-j_revoke = journal-j_revoke_table[0]; @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ for (tmp = 0; tmp hash_size; tmp++) INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]); - journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[1]) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:52:48.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:53:12.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -724,7 +724,7 @@ journal_t * jbd2_journal_init_dev(struct journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL
Re: [PATCH] JBD slab cleanups
On Wed, 2007-09-19 at 19:28 +, Dave Kleikamp wrote: On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote: On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote: Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all cases except one handles memory allocation failure so I get rid of those GFP_NOFAIL flags. Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc in jbd/jbd2? I will send a separate patch to cleanup that. No. GFP_NOFS avoids deadlock. It prevents the allocation from making recursive calls back into the file system that could end up blocking on jbd code. Oh, I see your patch now. You mean use GFP_NOFS instead of GFP_KERNEL. :-) OK then. oops, I did mean what you say here.:-) Shaggy - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD: use GFP_NOFS in kmalloc
On Wed, 2007-09-19 at 14:34 -0700, Andrew Morton wrote: On Wed, 19 Sep 2007 12:22:09 -0700 Mingming Cao [EMAIL PROTECTED] wrote: Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent with the rest of kmalloc flag used in the JBD/JBD2 layer. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |6 +++--- fs/jbd/revoke.c |8 fs/jbd2/journal.c |6 +++--- fs/jbd2/revoke.c |8 4 files changed, 14 insertions(+), 14 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 -0700 @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ while((tmp = 1UL) != 0UL) shift++; - journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[0]) return -ENOMEM; journal-j_revoke = journal-j_revoke_table[0]; @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ for (tmp = 0; tmp hash_size; tmp++) INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]); - journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[1]) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); These were all OK using GFP_KERNEL. GFP_NOFS should only be used when the caller is holding some fs locks which might cause a deadlock if that caller reentered the fs in -writepage (and maybe put_inode and such). That isn't the case in any of the above code, which is all mount time stuff (I think). You are right they are all occur at initialization time. ext3/4 should be using GFP_NOFS when the caller has a transaction open, has a page locked, is holding i_mutex, etc. Thanks for your feedback. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote: On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote: On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. I disagree. Why would jbd need to globally change the way it allocates memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a variety of structures. Having to change one particular instance won't necessarily mean we want to change all of them. Adding unnecessary wrappers only obfuscates the code making it harder to understand. You wouldn't want every subsystem to have it's own *_kmalloc() that took different arguments. Besides, there aren't that many calls to kmalloc and kfree in the jbd code, so there wouldn't be much pain in changing GFP flags or whatever, if it ever needed to be done. Shaggy Okay, Points taken, Here is the updated patch to get rid of slab management and jbd_kmalloc from jbd totally. This patch is intend to replace the patch in mm tree, Andrew, could you pick up this one instead? Thanks, Mingming jbd/jbd2: JBD memory allocation cleanups From: Christoph Lameter [EMAIL PROTECTED] JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/commit.c |6 +-- fs/jbd/journal.c | 99 ++ fs/jbd/transaction.c | 12 +++--- fs/jbd2/commit.c |6 +-- fs/jbd2/journal.c | 99 ++ fs/jbd2/transaction.c | 18 - include/linux/jbd.h | 18 + include/linux/jbd2.h | 21 +- 8 files changed, 52 insertions(+), 227 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-18 17:19:01.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-18 17:51:21.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -654,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
Re: [PATCH] JBD slab cleanups
On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Currently memory allocation for committed_data(and frozen_buffer) for bufferhead is done through jbd slab management, as Christoph Hellwig pointed out that this is broken as jbd should not pass slab pages down to IO layer. and suggested to use get_free_pages() directly. The problem with this patch, as Andreas Dilger pointed today in ext4 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste 1/3-1/2 page space. What was the originally intention to set up slabs for committed_data(and frozen_buffer) in JBD? Why not using kmalloc? Mingming Tested on 2.6.23-rc6 with fsx runs fine. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/checkpoint.c |2 fs/jbd/commit.c |6 +- fs/jbd/journal.c | 107 - fs/jbd/transaction.c | 10 ++-- fs/jbd2/checkpoint.c |2 fs/jbd2/commit.c |6 +- fs/jbd2/journal.c | 109 -- fs/jbd2/transaction.c | 18 include/linux/jbd.h | 23 +- include/linux/jbd2.h | 28 ++-- 10 files changed, 83 insertions(+), 228 deletions(-) Index: linux-2.6.23-rc5/fs/jbd/journal.c === --- linux-2.6.23-rc5.orig/fs/jbd/journal.c2007-09-13 13:37:57.0 -0700 +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -679,7 +678,7 @@ static journal_t * journal_init_common ( /* Set up a default-sized revoke table for the new mount. */ err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); if (err) { - kfree(journal); + jbd_kfree(journal); goto fail; } return journal; @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); journal = NULL; goto out; } @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i if (err) { printk(KERN_ERR %s: Cannnot locate journal superblock\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* - * Create a slab for this blocksize - */ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal) if (journal-j_revoke) journal_destroy_revoke(journal); kfree(journal-j_wbuf); - kfree(journal); + jbd_kfree(journal); } @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen
Re: [PATCH] JBD slab cleanups
On Mon, 2007-09-17 at 15:01 -0700, Badari Pulavarty wrote: On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote: On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Currently memory allocation for committed_data(and frozen_buffer) for bufferhead is done through jbd slab management, as Christoph Hellwig pointed out that this is broken as jbd should not pass slab pages down to IO layer. and suggested to use get_free_pages() directly. The problem with this patch, as Andreas Dilger pointed today in ext4 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste 1/3-1/2 page space. What was the originally intention to set up slabs for committed_data(and frozen_buffer) in JBD? Why not using kmalloc? Mingming Looks good. Small suggestion is to get rid of all kmalloc() usages and consistently use jbd_kmalloc() or jbd2_kmalloc(). Thanks, Badari Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |8 +--- fs/jbd/revoke.c | 12 ++-- fs/jbd2/journal.c |8 +--- fs/jbd2/revoke.c | 12 ++-- 4 files changed, 22 insertions(+), 18 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-17 14:32:16.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-17 14:33:59.0 -0700 @@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*), + GFP_KERNEL); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*), + GFP_KERNEL); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal) iput(journal-j_inode); if (journal-j_revoke) journal_destroy_revoke(journal); - kfree(journal-j_wbuf); + jbd_kfree(journal-j_wbuf); jbd_kfree(journal); } Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-17 14:32:22.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-17 14:35:13.0 -0700 @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -231,7 +231,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); if (!journal-j_revoke_table[1]) { - kfree(journal-j_revoke_table[0]-hash_table); + jbd_kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); return -ENOMEM; } @@ -246,9 +246,9 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); if (!journal-j_revoke-hash_table) { - kfree(journal-j_revoke_table[0]-hash_table); + jbd_kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); kmem_cache_free
Ext4 patch queue updated(2.6.23-rc6)
Changes: Rebase to 2.6.23-rc6 updated jbd_slab_cleanup.patch. (from Mingming) updated commit messages to different patches in the patch-queue (from Aneesh.) Added JBD-blocks-reservation-fix-for-large-blk.patch (part of 4k block support) (from Mingming) http://repo.or.cz/w/ext4-patch-queue.git Ted, could you please update the ext4 git tree with the patches before the unstable mark, so we could get more air time in mm tree for testing? Thanks, Mingming current patch series # Rebased the patches to 2.6.23-rc6 jbd_slab_cleanup.patch sparse-fix.patch # Add journal checksums ext4-journal_chksum-2.6.20.patch ext4-journal-chksum-review-fix.patch # Add unused inode watermark and checksum to blockgroup descriptors # Need e2fsprogs changes to mkfs to use this featur ext4_uninit_blockgroup.patch # Large blocksize support # ext234_enlarge_blocksize.patch ext2_rec_len_overflow_with_64kblk_fix.patch ext3_rec_len_overflow_with_64kblk_fix.patch ext4_rec_len_overflow_with_64kblk_fix.patch JBD-blocks-reservation-fix-for-large-blk.patch # patches still have outstanding comments from lkml # inode verion patch series # inode versioning is needed for NFSv4 # Need to address overwrite case and performance concerns # Need to repost to lkml for comments before submit # # vfs changes, 64 bit inode-i_version 64-bit-i_version.patch # reserve hi 32 bit inode version on ext4 on-disk inode i_version_hi.patch # ext4 inode version read/store ext4_i_version_hi_2.patch # ext4 inode version update i_version_update_ext4.patch # Export jbd stats through procfs # Shall this move to debugfs? # Still need to address akpm's comments. jbd-stats-through-procfs jbd-stats-through-procfs_fix.patch #Unstable patches # New delayed allocation patch delalloc-vfs.patch delalloc-ext4.patch ext-truncate-mutex.patch ext3-4-migrate.patch new-extent-function.patch ### #mballoc ### generic-find-next-le-bit mballoc-core.patch # #FLEX_BG flex_bg_kernel.patch - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new mballoc patches.
On Tue, 2007-09-11 at 13:59 +0530, Aneesh Kumar K.V wrote: I have updated the mballoc patches. The same can be found at http://www.radian.org/~kvaneesh/ext4/patch-series/ The series file is # This series applies on GIT commit b07d68b5ca4d55a16fab223d63d5fb36f89ff42f ext4-journal_chksum-2.6.20.patch ext4-journal-chksum-review-fix.patch ext4_uninit_blockgroup.patch 64-bit-i_version.patch i_version_hi.patch ext4_i_version_hi_2.patch i_version_update_ext4.patch jbd-stats-through-procfs jbd-stats-through-procfs_fix.patch delalloc-vfs.patch delalloc-ext4.patch ext-truncate-mutex.patch ext3-4-migrate.patch new-extent-function.patch generic-find-next-le-bit mballoc-core.patch ext234_enlarge_blocksize.patch ext2_rec_len_overflow_with_64kblk_fix.patch ext3_rec_len_overflow_with_64kblk_fix.patch ext4_rec_len_overflow_with_64kblk_fix.patch RFC-1-2-JBD-slab-management-support-for-large-b.patch RFC-2-2-JBD-blocks-reservation-fix-for-large-bl.patch Changes: a) deleted mballoc-fixup.patch; merged with mballoc-core.path b) deleted mballoc-mkdir-oops-fix.patch; merged with mballoc-core.patch c) merged Uninitialized group related changes for mballoc d) merged the checkpatch.pl error fix patch for mballoc e) merged the fix for ext4_ext_search_right for error EXT4-fs error (device sdc): ext4_ext_search_right: bad header in inode #3745797: unexpected eh_depth - magic f30a, entries 81, max 84(0), depth 0(1) f) PPC build fix by introducing generic_find_next_le_bit() g) Document the code. Rest of the places that needs more comments are marked with FIXME!! h) 48 bit block usage by converting bg_*_bitmap and bg_inode_table to respective ext4_*_bitmap and ext4_inode_table() function i) Added jbd fixes patches from Mingming for large block size with the comment from hch. (this is added towards the end of the series) k) Updated the ext234_enlarge_blocksize.patch ext2_rec_len_overflow_with_64kblk_fix.patch ext3_rec_len_overflow_with_64kblk_fix.patch ext4_rec_len_overflow_with_64kblk_fix.patch to make sure they can be applied with stgit. They didn't had a valid email id in the From: field. ( This is only commit message update ) Test status: Minor testing with KVM. I also didn't do a PPC build. Mingming/Avantika, Can we update the patch queue with the series ? Thanks for the update. I just back to home from London late last night, still catching up emails, so I haven't get a chance to update the ext4 patch queue yet. Will do so shortly. Cheers, Mingming -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] placate checkpatch.pl to some degree for mballoc.c
On Fri, 2007-08-31 at 16:28 -0500, Eric Sandeen wrote: Here's a patch for mballoc.c, to make checkpatch happier with it. I was about as pedantic as I could be, except for a few things it complained about which I just could not agree with. :) One of the comments in the series file in git says checkpatch doesn't like mballoc, so, here you go, if it's helpful. Very helpful, thanks! Applies to the bottom of the patch stack - should some of these mballoc patches get rolled together by now? Okay, will do. There are still a issue with mballoc, it doesn't handle big endian when generating free extents map from on-disk bitmaps. Thus cause failure compile on ppc64. ppc64 needs to implement generic_find_next_le_bit. Andreas pointed a patch already in cfs bug report, https://bugzilla.lustre.org/show_bug.cgi?id=10634, the patch for this is in the last attachment. -Eric Make checkpatch happier with mballoc.c Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Index: ext4.git/fs/ext4/mballoc.c === --- ext4.git.orig/fs/ext4/mballoc.c +++ ext4.git/fs/ext4/mballoc.c @@ -247,9 +247,9 @@ */ #define MB_DEBUG__ #ifdef MB_DEBUG -#define mb_debug(fmt,a...) printk(fmt, ##a) +#define mb_debug(fmt, a...) printk(fmt, ##a) #else -#define mb_debug(fmt,a...) +#define mb_debug(fmt, a...) #endif /* @@ -303,7 +303,7 @@ */ #define MB_DEFAULT_STRIPE256 -static struct kmem_cache *ext4_pspace_cachep = NULL; +static struct kmem_cache *ext4_pspace_cachep; #ifdef EXT4_BB_MAX_BLOCKS #undef EXT4_BB_MAX_BLOCKS @@ -353,7 +353,7 @@ struct ext4_prealloc_space { unsigned short pa_len; /* len of preallocated chunk */ unsigned short pa_free;/* how many blocks are free */ unsigned short pa_linear; /* consumed in one direction - * strictly, for group prealloc */ + * strictly, for grp prealloc */ spinlock_t *pa_obj_lock; struct inode*pa_inode; /* hack, for history only */ }; @@ -456,8 +456,8 @@ static void ext4_mb_store_history(struct static struct proc_dir_entry *proc_root_ext4; -int ext4_create (struct inode *, struct dentry *, int, struct nameidata *); -struct buffer_head * read_block_bitmap(struct super_block *, unsigned int); +int ext4_create(struct inode *, struct dentry *, int, struct nameidata *); +struct buffer_head *read_block_bitmap(struct super_block *, unsigned int); ext4_fsblk_t ext4_new_blocks_old(handle_t *handle, struct inode *inode, ext4_fsblk_t goal, unsigned long *count, int *errp); void ext4_mb_release_blocks(struct super_block *, int); @@ -465,11 +465,13 @@ void ext4_mb_poll_new_transaction(struct void ext4_mb_free_committed_blocks(struct super_block *); void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, int group); void ext4_mb_free_consumed_preallocations(struct ext4_allocation_context *ac); -void ext4_mb_return_to_preallocation(struct inode *inode, struct ext4_buddy *e3b, - sector_t block, int count); +void ext4_mb_return_to_preallocation(struct inode *inode, + struct ext4_buddy *e3b, sector_t block, + int count); void ext4_mb_show_ac(struct ext4_allocation_context *ac); void ext4_mb_check_with_pa(struct ext4_buddy *e3b, int first, int count); -void ext4_mb_put_pa(struct ext4_allocation_context *, struct super_block *, struct ext4_prealloc_space *pa); +void ext4_mb_put_pa(struct ext4_allocation_context *, struct super_block *, + struct ext4_prealloc_space *pa); int ext4_mb_init_per_dev_proc(struct super_block *sb); int ext4_mb_destroy_per_dev_proc(struct super_block *sb); @@ -507,13 +509,13 @@ unsigned long ext4_grp_offs_to_block(str } #if BITS_PER_LONG == 64 -#define mb_correct_addr_and_bit(bit,addr)\ +#define mb_correct_addr_and_bit(bit, addr) \ {\ bit += ((unsigned long) addr 7UL) 3; \ addr = (void *) ((unsigned long) addr ~7UL); \ } #elif BITS_PER_LONG == 32 -#define mb_correct_addr_and_bit(bit,addr)\ +#define mb_correct_addr_and_bit(bit, addr) \ {\ bit += ((unsigned long) addr 3UL) 3; \ addr = (void *) ((unsigned long) addr ~3UL); \ @@ -524,31 +526,31 @@ unsigned long ext4_grp_offs_to_block(str static inline int mb_test_bit(int bit, void *addr) { - mb_correct_addr_and_bit(bit,addr); + mb_correct_addr_and_bit(bit, addr); return ext2_test_bit(bit, addr); } static inline void
Re: [RFC 1/4] Large Blocksize support for Ext2/3/4
On Wed, 2007-08-29 at 17:47 -0700, Mingming Cao wrote: Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only. Next steps: Need a e2fsprogs changes to able test this feature. As mkfs needs to be educated not assuming rec_len to be blocksize all the time. Will try it with Christoph Lameter's large block patch next. Two problems were found when testing largeblock on ext3. Patches to follow. Good news is, with your changes, plus all these extN changes, I am able to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k page size. fsx test runs fine for an hour on ext3 with 16k blocksize on x86:-) Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/2] JBD: slab management support for large block(8k)
From clameter: Teach jbd/jbd2 slab management to support 8k block size. Without this, it refused to mount on 8k ext3. Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: my2.6/fs/jbd/journal.c === --- my2.6.orig/fs/jbd/journal.c 2007-08-30 18:40:02.0 -0700 +++ my2.6/fs/jbd/journal.c 2007-08-31 11:01:18.0 -0700 @@ -1627,16 +1627,17 @@ void * __jbd_kmalloc (const char *where, * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed * and allocate frozen and commit buffers from these slabs. * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. + * (Note: We only seem to need the definitions here for the SLAB_DEBUG + * case. In non debug operations SLUB will find the corresponding kmalloc + * cache and create an alias. --clameter) */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size 11) +#define JBD_MAX_SLABS 7 +#define JBD_SLAB_INDEX(size) get_order((size) (PAGE_SHIFT - 10)) static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; static const char *jbd_slab_names[JBD_MAX_SLABS] = { - jbd_1k, jbd_2k, jbd_4k, NULL, jbd_8k + jbd_1k, jbd_2k, jbd_4k, jbd_8k, + jbd_16k, jbd_32k, jbd_64k }; static void journal_destroy_jbd_slabs(void) Index: my2.6/fs/jbd2/journal.c === --- my2.6.orig/fs/jbd2/journal.c2007-08-30 18:40:02.0 -0700 +++ my2.6/fs/jbd2/journal.c 2007-08-31 11:04:37.0 -0700 @@ -1639,16 +1639,18 @@ void * __jbd2_kmalloc (const char *where * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed * and allocate frozen and commit buffers from these slabs. * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. + * (Note: We only seem to need the definitions here for the SLAB_DEBUG + * case. In non debug operations SLUB will find the corresponding kmalloc + * cache and create an alias. --clameter) */ -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size 11) +#define JBD_MAX_SLABS 7 +#define JBD_SLAB_INDEX(size) get_order((size) (PAGE_SHIFT - 10)) static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; static const char *jbd_slab_names[JBD_MAX_SLABS] = { - jbd2_1k, jbd2_2k, jbd2_4k, NULL, jbd2_8k + jbd2_1k, jbd2_2k, jbd2_4k, jbd2_8k, +jbd2_16k, jbd2_32k, jbd2_64k }; static void jbd2_journal_destroy_jbd_slabs(void) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 2/2] JBD: blocks reservation fix for large block support
The blocks per page could be less or quals to 1 with the large block support in VM. The patch fixed the way to calculate the number of blocks to reserve in journal in the case blocksize pagesize. Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: my2.6/fs/jbd/journal.c === --- my2.6.orig/fs/jbd/journal.c 2007-08-31 13:27:16.0 -0700 +++ my2.6/fs/jbd/journal.c 2007-08-31 13:28:18.0 -0700 @@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal) int journal_blocks_per_page(struct inode *inode) { - return 1 (PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits); + int bits = PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits; + + if (bits 0) + return 1 bits; + else + return 1; } /* Index: my2.6/fs/jbd2/journal.c === --- my2.6.orig/fs/jbd2/journal.c2007-08-31 13:32:21.0 -0700 +++ my2.6/fs/jbd2/journal.c 2007-08-31 13:32:30.0 -0700 @@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou int jbd2_journal_blocks_per_page(struct inode *inode) { - return 1 (PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits); + int bits = PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits; + + if (bits 0) + return 1 bits; + else + return 1; } /* - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Wed, 2007-08-01 at 12:34 +0530, Girish Shilamkar wrote: On Wed, 2007-07-11 at 17:16 +0530, Girish Shilamkar wrote: I will make the changes and send an incremental patch. Hi, I have made the changes and attached the incremental patch as per the review. This is the actual changelog which was missing in the original patch. -- The journal checksum feature adds two new flags i.e JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM. JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the checksum for the blocks described by the descriptor blocks. Due to checksums, writing of the commit record no longer needs to be synchronous. Now commit record can be sent to disk without waiting for descriptor blocks to be written to disk. This behavior is controlled using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be able to recover the journal with _ASYNC_COMMIT hence it is made incompat. The commit header has been extended to hold the checksum along with the type of the checksum. For recovery in pass scan checksums are verified to ensure the sanity and completeness(in case of _ASYNC_COMMIT) of every transaction. - Hit kernel oops when run fsstress test on ext4-patch-queue with -o journal_checksum. BUG: unable to handle kernel NULL pointer dereference at virtual address printing eip: c118ba5d *pdpt = 2560a001 *pde = Oops: [#1] SMP Modules linked in: CPU:1 EIP:0060:[c118ba5d]Not tainted VLI EFLAGS: 00010257 (2.6.23-rc4-autokern1 #1) EIP is at crc32_be+0x3d/0x9c eax: 7e78a276 ebx: 76a2787e ecx: 0400 edx: esi: edi: ebp: f56e5200 esp: e61f9e90 ds: 007b es: 007b fs: 00d8 gs: ss: 0068 Process kjournald2 (pid: 5388, ti=e61f8000 task=e3efc000 task.ti=e61f8000) Stack: ef5fffc0 0016 c10c762d 055e 1000 f50e3e80 7e78a276 0008 0544 eb46aab4 eb46aabc 0155 f585f800 e1e1c968 eb059428 055e e3efc000 Call Trace: [c10c762d] jbd2_journal_commit_transaction+0x92a/0x128d [c1029e51] autoremove_wake_function+0x0/0x33 [c1029e51] autoremove_wake_function+0x0/0x33 [c10214f4] try_to_del_timer_sync+0x42/0x48 [c10ca4fd] kjournald2+0x130/0x307 [c1029e51] autoremove_wake_function+0x0/0x33 [c129171c] __sched_text_start+0x364/0x3ff [c1029e51] autoremove_wake_function+0x0/0x33 [c10ca3cd] kjournald2+0x0/0x307 [c1029a27] kthread+0x34/0x55 [c10299f3] kthread+0x0/0x55 [c1003173] kernel_thread_helper+0x7/0x10 === Code: 42 30 d8 0f b6 c0 c1 eb 08 33 1c 85 e0 ae 2a c1 49 74 05 f6 c2 03 75 e5 83 f9 03 76 4c 89 ce 83 ea 04 83 e6 03 c1 e9 02 83 c2 04 33 1a 0f b6 c3 c1 eb 08 33 1c 85 e0 ae 2a c1 0f b6 c3 c1 eb 08 EIP: [c118ba5d] crc32_be+0x3d/0x9c SS:ESP 0068:e61f9e90 - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8][e2fsprogs] add new -D desc-size option to mkfs
On Thu, 2007-08-30 at 17:37 +0200, Valerie Clement wrote: From: Valerie Clement [EMAIL PROTECTED] This patch adds a new option -D descriptor-size to mkfs to allow creating ext4 filesystems with larger group descriptor size. By default, the group descriptor size is 32 bytes. To support 64-bit physical block numbers and thus to be able to use filesystems 16TB, the group descriptor size must be increased to 64 bytes. Could you explain a bit more why we need this option? It seems to me we could make the block group descriptor size to 64bytes whenever creating a ext4 filesystem by default. Is there any case we need to create ext4 with old 32 bytes block group descriptors? Use mkfs -t ext4 -D 64 to create an ext4 filesystem with 64-byte group descriptor size. This set the EXT4_FEATURE_INCOMPAT_64BIT flag on device. The descriptor size is limited to a power-of-two value and to the block size. Signed-off-by: Valerie Clement [EMAIL PROTECTED] --- lib/ext2fs/ext2_fs.h|7 +++ lib/ext2fs/initialize.c |1 + misc/mke2fs.c | 33 +++-- 3 files changed, 39 insertions(+), 2 deletions(-) diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h index a316665..f6e8b36 100644 --- a/lib/ext2fs/ext2_fs.h +++ b/lib/ext2fs/ext2_fs.h @@ -217,6 +217,13 @@ struct ext2_dx_countlimit { /* * Macro-instructions used to manage group descriptors */ +#define EXT2_MIN_DESC_SIZE 32 +#define EXT2_MIN_DESC_SIZE_64BIT 64 +#define EXT2_MAX_DESC_SIZE EXT2_MIN_BLOCK_SIZE +#define EXT2_DESC_SIZE(s)\ + ((EXT2_SB(s)-s_feature_incompat EXT4_FEATURE_INCOMPAT_64BIT) ? \ + (s)-s_desc_size : EXT2_MIN_DESC_SIZE) + #define EXT2_BLOCKS_PER_GROUP(s) (EXT2_SB(s)-s_blocks_per_group) #define EXT2_INODES_PER_GROUP(s) (EXT2_SB(s)-s_inodes_per_group) #define EXT2_INODES_PER_BLOCK(s) (EXT2_BLOCK_SIZE(s)/EXT2_INODE_SIZE(s)) diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c index 16e9eaa..ee7af93 100644 --- a/lib/ext2fs/initialize.c +++ b/lib/ext2fs/initialize.c @@ -149,6 +149,7 @@ errcode_t ext2fs_initialize(const char *name, int flags, set_field(s_log_block_size, 0); /* default blocksize: 1024 bytes */ set_field(s_log_frag_size, 0); /* default fragsize: 1024 bytes */ + set_field(s_desc_size, 0); set_field(s_first_data_block, super-s_log_block_size ? 0 : 1); set_field(s_max_mnt_count, EXT2_DFL_MAX_MNT_COUNT); set_field(s_errors, EXT2_ERRORS_DEFAULT); diff --git a/misc/mke2fs.c b/misc/mke2fs.c index 4a6cace..115eef9 100644 --- a/misc/mke2fs.c +++ b/misc/mke2fs.c @@ -96,7 +96,7 @@ static void usage(void) { fprintf(stderr, _(Usage: %s [-c|-t|-l filename] [-b block-size] [-f fragment-size]\n\t[-i bytes-per-inode] [-I inode-size] - [-j] [-J journal-options]\n + [-D descriptor-size] [-j] [-J journal-options]\n \t[-N number-of-inodes] [-m reserved-blocks-percentage] [-o creator-os]\n\t[-g blocks-per-group] [-L volume-label] [-M last-mounted-directory]\n\t[-O feature[,...]] @@ -911,6 +911,7 @@ static void PRS(int argc, char *argv[]) int blocksize = 0; int inode_ratio = 0; int inode_size = 0; + int desc_size = 0; double reserved_ratio = 5.0; int sector_size = 0; int show_version_only = 0; @@ -993,7 +994,7 @@ static void PRS(int argc, char *argv[]) } while ((c = getopt (argc, argv, - b:cf:g:i:jl:m:no:qr:s:tvE:FI:J:L:M:N:O:R:ST:V)) != EOF) { + b:cf:g:i:jl:m:no:qr:s:tvD:E:FI:J:L:M:N:O:R:ST:V)) != EOF) { switch (c) { case 'b': blocksize = strtol(optarg, tmp, 0); @@ -1110,6 +,14 @@ static void PRS(int argc, char *argv[]) exit(1); } break; + case 'D': + desc_size = strtoul(optarg, tmp, 0); + if (*tmp) { + com_err(program_name, 0, + _(invalid descriptor size - %s), optarg); + exit(1); + } + break; case 'v': verbose = 1; break; @@ -1439,6 +1448,26 @@ static void PRS(int argc, char *argv[]) } } + if (desc_size) { + if (desc_size EXT2_MIN_DESC_SIZE || + desc_size EXT2_BLOCK_SIZE(fs_param) || + desc_size (desc_size -1)) { + com_err(program_name, 0, + _(invalid descriptor size %d (min %d/max %d)), + desc_size, EXT2_MIN_DESC_SIZE, +
Re: [PATCH V2] fix mballoc oopses on mkdir
On Tue, 2007-08-28 at 00:25 -0500, Eric Sandeen wrote: Ugh. brain clearly not fully engaged... here we go: -- Tried out Ted's git tree + all pending patches today, and immediately oopsed on a mkdir thanks to this in ext4_mb_new_group_pa() BUG_ON(!S_ISREG(ac-ac_inode-i_mode)); (there are 54 BUGs and BUG_ONs in this file...!) I think something like this patch is needed? Thanks for catching this and the fix. I saw the same kernel oops today when rum mballoc on fsstress (mkdir). I just added this fix to the patch queue with some comments. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Index: ext4.git/fs/ext4/extents.c === --- ext4.git.orig/fs/ext4/extents.c +++ ext4.git/fs/ext4/extents.c @@ -2535,7 +2535,10 @@ int ext4_ext_get_blocks(handle_t *handle ar.goal = ext4_ext_find_goal(inode, path, iblock); ar.logical = iblock; ar.len = allocated; - ar.flags = EXT4_MB_HINT_DATA; + if (S_ISREG(inode-i_mode)) + ar.flags = EXT4_MB_HINT_DATA; + else + ar.flags = 0; I see. We set the flags to zero here to avoid doing in-core preallocation for directory files later. Well, I am not sure whether it's worth the effort to support in-core preallocation for directory files. But I don't see hard reason we cannot do this. My understanding is: this is the new in-core preallocation which does a should not depend on extents or any on-disk format changes. Alex, any comments? Mingming newblock = ext4_mb_new_blocks(handle, ar, err); if (!newblock) goto out2; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html