Re: [patch 12/26] mount options: fix ext4
On Thu, 2008-01-24 at 20:33 +0100, Miklos Szeredi wrote: plain text document attachment (ext4_opts.patch) From: Miklos Szeredi [EMAIL PROTECTED] Add stripe= option to /proc/mounts for ext4 filesystems. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/ext4/super.c === --- linux.orig/fs/ext4/super.c2008-01-23 12:57:07.0 +0100 +++ linux/fs/ext4/super.c 2008-01-23 21:43:51.0 +0100 @@ -742,7 +742,8 @@ static int ext4_show_options(struct seq_ seq_puts(seq, ,nomballoc); if (!test_opt(sb, DELALLOC)) seq_puts(seq, ,nodelalloc); - + if (sbi-s_stripe) + seq_printf(seq, ,stripe=%lu, sbi-s_stripe); /* * journal mode get enabled in different ways Added to ext4 patch queue. Thanks! http://repo.or.cz/w/ext4-patch-queue.git Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs
On Fri, 2007-11-30 at 17:08 -0600, Eric Sandeen wrote: Mingming Cao wrote: [PATCH] jbd2 stats through procfs The patch below updates the jbd stats patch to 2.6.20/jbd2. The initial patch was posted by Alex Tomas in December 2005 (http://marc.info/?l=linux-ext4m=113538565128617w=2). It provides statistics via procfs such as transaction lifetime and size. [ This probably should be rewritten to use debugfs? -- Ted] Signed-off-by: Johann Lombardi [EMAIL PROTECTED] I've started going through this one to clean it up to the point where it can go forward. It's been sitting at the top of the unstable portion of the patch queue for long enough, I think :) Thanks! I've converted the msecs to jiffies until the user boundary, changed the union #defines as suggested by Andrew, and various other little issues etc. Remaining to do is a generic time-difference calculator (instead of jbd2_time_diff), and looking into whether it should be made a config option; I tend to think it should, but it's fairly well sprinkled through the code, so I'll see how well that works. Also did we ever decided if this should go to debugfs? I thought it was decided to keep it on procfs as debugfs is not always on... Thanks, -Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/4] 64k pagesize/blocksize fixes
On Tue, 2007-09-25 at 16:30 -0700, Christoph Lameter wrote: Attached the fixes necessary to support 64k pagesize/blocksize. I think these are useful independent of the large blocksize patchset since there are architectures that support 64k page size and that could use these large buffer sizes without the large buffersize patchset. Are these patches in the right shape to be merged? I rediffed these against 2.6.32-rc8-mm1. I had to fix some things in the second patch (ext2) that may need some review since the way that commits work changed. Thanks, As Andreas mentioned in another email there is a better way to deal with rec_len issue on 64k block size. I will get those patches merged in ext4 patch queue. I think the updated patches could get merged. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] JBD: use GFP_NOFS in kmalloc
Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent with the rest of kmalloc flag used in the JBD/JBD2 layer. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |6 +++--- fs/jbd/revoke.c |8 fs/jbd2/journal.c |6 +++--- fs/jbd2/revoke.c |8 4 files changed, 14 insertions(+), 14 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 -0700 @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ while((tmp = 1UL) != 0UL) shift++; - journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[0]) return -ENOMEM; journal-j_revoke = journal-j_revoke_table[0]; @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ for (tmp = 0; tmp hash_size; tmp++) INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]); - journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[1]) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:52:48.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:53:12.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -724,7 +724,7 @@ journal_t * jbd2_journal_init_dev(struct journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL
Re: [PATCH] JBD slab cleanups
On Wed, 2007-09-19 at 19:28 +, Dave Kleikamp wrote: On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote: On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote: Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all cases except one handles memory allocation failure so I get rid of those GFP_NOFAIL flags. Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc in jbd/jbd2? I will send a separate patch to cleanup that. No. GFP_NOFS avoids deadlock. It prevents the allocation from making recursive calls back into the file system that could end up blocking on jbd code. Oh, I see your patch now. You mean use GFP_NOFS instead of GFP_KERNEL. :-) OK then. oops, I did mean what you say here.:-) Shaggy - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD: use GFP_NOFS in kmalloc
On Wed, 2007-09-19 at 14:34 -0700, Andrew Morton wrote: On Wed, 19 Sep 2007 12:22:09 -0700 Mingming Cao [EMAIL PROTECTED] wrote: Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent with the rest of kmalloc flag used in the JBD/JBD2 layer. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |6 +++--- fs/jbd/revoke.c |8 fs/jbd2/journal.c |6 +++--- fs/jbd2/revoke.c |8 4 files changed, 14 insertions(+), 14 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 -0700 @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ while((tmp = 1UL) != 0UL) shift++; - journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[0]) return -ENOMEM; journal-j_revoke = journal-j_revoke_table[0]; @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ for (tmp = 0; tmp hash_size; tmp++) INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]); - journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[1]) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); These were all OK using GFP_KERNEL. GFP_NOFS should only be used when the caller is holding some fs locks which might cause a deadlock if that caller reentered the fs in -writepage (and maybe put_inode and such). That isn't the case in any of the above code, which is all mount time stuff (I think). You are right they are all occur at initialization time. ext3/4 should be using GFP_NOFS when the caller has a transaction open, has a page locked, is holding i_mutex, etc. Thanks for your feedback. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote: On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote: On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. I disagree. Why would jbd need to globally change the way it allocates memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a variety of structures. Having to change one particular instance won't necessarily mean we want to change all of them. Adding unnecessary wrappers only obfuscates the code making it harder to understand. You wouldn't want every subsystem to have it's own *_kmalloc() that took different arguments. Besides, there aren't that many calls to kmalloc and kfree in the jbd code, so there wouldn't be much pain in changing GFP flags or whatever, if it ever needed to be done. Shaggy Okay, Points taken, Here is the updated patch to get rid of slab management and jbd_kmalloc from jbd totally. This patch is intend to replace the patch in mm tree, Andrew, could you pick up this one instead? Thanks, Mingming jbd/jbd2: JBD memory allocation cleanups From: Christoph Lameter [EMAIL PROTECTED] JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/commit.c |6 +-- fs/jbd/journal.c | 99 ++ fs/jbd/transaction.c | 12 +++--- fs/jbd2/commit.c |6 +-- fs/jbd2/journal.c | 99 ++ fs/jbd2/transaction.c | 18 - include/linux/jbd.h | 18 + include/linux/jbd2.h | 21 +- 8 files changed, 52 insertions(+), 227 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-18 17:19:01.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-18 17:51:21.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -654,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
Re: [PATCH] JBD slab cleanups
On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Currently memory allocation for committed_data(and frozen_buffer) for bufferhead is done through jbd slab management, as Christoph Hellwig pointed out that this is broken as jbd should not pass slab pages down to IO layer. and suggested to use get_free_pages() directly. The problem with this patch, as Andreas Dilger pointed today in ext4 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste 1/3-1/2 page space. What was the originally intention to set up slabs for committed_data(and frozen_buffer) in JBD? Why not using kmalloc? Mingming Tested on 2.6.23-rc6 with fsx runs fine. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/checkpoint.c |2 fs/jbd/commit.c |6 +- fs/jbd/journal.c | 107 - fs/jbd/transaction.c | 10 ++-- fs/jbd2/checkpoint.c |2 fs/jbd2/commit.c |6 +- fs/jbd2/journal.c | 109 -- fs/jbd2/transaction.c | 18 include/linux/jbd.h | 23 +- include/linux/jbd2.h | 28 ++-- 10 files changed, 83 insertions(+), 228 deletions(-) Index: linux-2.6.23-rc5/fs/jbd/journal.c === --- linux-2.6.23-rc5.orig/fs/jbd/journal.c2007-09-13 13:37:57.0 -0700 +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -679,7 +678,7 @@ static journal_t * journal_init_common ( /* Set up a default-sized revoke table for the new mount. */ err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); if (err) { - kfree(journal); + jbd_kfree(journal); goto fail; } return journal; @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); journal = NULL; goto out; } @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i if (err) { printk(KERN_ERR %s: Cannnot locate journal superblock\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* - * Create a slab for this blocksize - */ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal) if (journal-j_revoke) journal_destroy_revoke(journal); kfree(journal-j_wbuf); - kfree(journal); + jbd_kfree(journal); } @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen
Re: [PATCH] JBD slab cleanups
On Mon, 2007-09-17 at 15:01 -0700, Badari Pulavarty wrote: On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote: On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Currently memory allocation for committed_data(and frozen_buffer) for bufferhead is done through jbd slab management, as Christoph Hellwig pointed out that this is broken as jbd should not pass slab pages down to IO layer. and suggested to use get_free_pages() directly. The problem with this patch, as Andreas Dilger pointed today in ext4 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste 1/3-1/2 page space. What was the originally intention to set up slabs for committed_data(and frozen_buffer) in JBD? Why not using kmalloc? Mingming Looks good. Small suggestion is to get rid of all kmalloc() usages and consistently use jbd_kmalloc() or jbd2_kmalloc(). Thanks, Badari Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |8 +--- fs/jbd/revoke.c | 12 ++-- fs/jbd2/journal.c |8 +--- fs/jbd2/revoke.c | 12 ++-- 4 files changed, 22 insertions(+), 18 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-17 14:32:16.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-17 14:33:59.0 -0700 @@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*), + GFP_KERNEL); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*), + GFP_KERNEL); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal) iput(journal-j_inode); if (journal-j_revoke) journal_destroy_revoke(journal); - kfree(journal-j_wbuf); + jbd_kfree(journal-j_wbuf); jbd_kfree(journal); } Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-17 14:32:22.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-17 14:35:13.0 -0700 @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -231,7 +231,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); if (!journal-j_revoke_table[1]) { - kfree(journal-j_revoke_table[0]-hash_table); + jbd_kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); return -ENOMEM; } @@ -246,9 +246,9 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); if (!journal-j_revoke-hash_table) { - kfree(journal-j_revoke_table[0]-hash_table); + jbd_kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); kmem_cache_free
[PATCH] JBD slab cleanups
jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Tested on 2.6.23-rc6 with fsx runs fine. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/checkpoint.c |2 fs/jbd/commit.c |6 +- fs/jbd/journal.c | 107 - fs/jbd/transaction.c | 10 ++-- fs/jbd2/checkpoint.c |2 fs/jbd2/commit.c |6 +- fs/jbd2/journal.c | 109 -- fs/jbd2/transaction.c | 18 include/linux/jbd.h | 23 +- include/linux/jbd2.h | 28 ++-- 10 files changed, 83 insertions(+), 228 deletions(-) Index: linux-2.6.23-rc5/fs/jbd/journal.c === --- linux-2.6.23-rc5.orig/fs/jbd/journal.c 2007-09-13 13:37:57.0 -0700 +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -679,7 +678,7 @@ static journal_t * journal_init_common ( /* Set up a default-sized revoke table for the new mount. */ err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); if (err) { - kfree(journal); + jbd_kfree(journal); goto fail; } return journal; @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); journal = NULL; goto out; } @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i if (err) { printk(KERN_ERR %s: Cannnot locate journal superblock\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal) if (journal-j_revoke) journal_destroy_revoke(journal); kfree(journal-j_wbuf); - kfree(journal); + jbd_kfree(journal); } @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. - */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size 11) - -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; -static const char *jbd_slab_names[JBD_MAX_SLABS] = { - jbd_1k, jbd_2k, jbd_4k, NULL, jbd_8k -}; - -static void journal_destroy_jbd_slabs(void) -{ - int i; - - for (i = 0; i JBD_MAX_SLABS; i++) { - if (jbd_slab[i]) - kmem_cache_destroy(jbd_slab[i]); - jbd_slab[i] = NULL
Re: [RFC 1/4] Large Blocksize support for Ext2/3/4
On Wed, 2007-08-29 at 17:47 -0700, Mingming Cao wrote: Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only. Next steps: Need a e2fsprogs changes to able test this feature. As mkfs needs to be educated not assuming rec_len to be blocksize all the time. Will try it with Christoph Lameter's large block patch next. Two problems were found when testing largeblock on ext3. Patches to follow. Good news is, with your changes, plus all these extN changes, I am able to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k page size. fsx test runs fine for an hour on ext3 with 16k blocksize on x86:-) Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 2/2] JBD: blocks reservation fix for large block support
The blocks per page could be less or quals to 1 with the large block support in VM. The patch fixed the way to calculate the number of blocks to reserve in journal in the case blocksize pagesize. Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: my2.6/fs/jbd/journal.c === --- my2.6.orig/fs/jbd/journal.c 2007-08-31 13:27:16.0 -0700 +++ my2.6/fs/jbd/journal.c 2007-08-31 13:28:18.0 -0700 @@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal) int journal_blocks_per_page(struct inode *inode) { - return 1 (PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits); + int bits = PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits; + + if (bits 0) + return 1 bits; + else + return 1; } /* Index: my2.6/fs/jbd2/journal.c === --- my2.6.orig/fs/jbd2/journal.c2007-08-31 13:32:21.0 -0700 +++ my2.6/fs/jbd2/journal.c 2007-08-31 13:32:30.0 -0700 @@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou int jbd2_journal_blocks_per_page(struct inode *inode) { - return 1 (PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits); + int bits = PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits; + + if (bits 0) + return 1 bits; + else + return 1; } /* - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Wed, 2007-08-01 at 12:34 +0530, Girish Shilamkar wrote: On Wed, 2007-07-11 at 17:16 +0530, Girish Shilamkar wrote: I will make the changes and send an incremental patch. Hi, I have made the changes and attached the incremental patch as per the review. Thanks, I merged your changes to ext4-patch-queue. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Sun, 2007-07-29 at 20:24 +0100, Christoph Hellwig wrote: On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote: Sigh, we HAVE a patch that was only adding delalloc to ext4, but it was rejected because that functionality should go into the VFS. Since the performance improvement of delalloc is quite large, we'd like to get this into the kernel one way or another. Can we make a decision if the ext4-specific delalloc is acceptable? I'm a big proponent of having proper common delalloc code, but the one proposed here is not generic for the existing filesystem using delalloc. To be fair, what Alex have so far is probably good enough for ext2/3 delayed allocation. It's still on my todo list to revamp the xfs code to get rid of some of the existing mess and make it useable genericly. If the ext4 users are fine with the end result we could move to generic code. Are you okay with having a ext4 delayed allocation implementation (i.e. moving the code proposed in this thread to fs/ext4) first? Then later when you come up with a generic delayed allocation for both ext4 and xfs we could make use of that generic implementation. Is that a acceptable approach? Andrew, what do you think? Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] ext4_ext_put_in_cache uses __u32 to receive physical block number.
On Fri, 2007-07-27 at 13:16 +0800, Yan Zheng wrote: Hi, all I think I found a bug in ext4/extents.c, ext4_ext_put_in_cache uses __u32 to receive physical block number. ext4_ext_put_in_cache is used in ext4_ext_get_blocks, it sets ext4 inode's extent cache according most recently tree lookup (higher 16 bits of saved physical block number are always zero). when serving a mapping request, ext4_ext_get_blocks first check whether the logical block is in inode's extent cache. if the logical block is in the cache and the cached region isn't a gap, ext4_ext_get_blocks gets physical block number by using cached region's physical block number and offset in the cached region. as described above, ext4_ext_get_blocks may return wrong result when there are physical block numbers bigger than 0x. Regards YZ You are right. Thanks for reporting this! Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: linux-2.6.22/fs/ext4/extents.c === --- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-27 08:31:02.0 -0700 +++ linux-2.6.22/fs/ext4/extents.c 2007-07-27 08:31:48.0 -0700 @@ -1544,7 +1544,7 @@ int ext4_ext_walk_space(struct inode *in static void ext4_ext_put_in_cache(struct inode *inode, __u32 block, - __u32 len, __u32 start, int type) + __u32 len, ext4_fsblk_t start, int type) { struct ext4_ext_cache *cex; BUG_ON(len == 0); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new ext4 build warnings
On Wed, 2007-07-18 at 17:37 -0400, Jeff Garzik wrote: It seems jbd_debug() might need modification: fs/ext4/inode.c: In function ‘ext4_write_inode’: fs/ext4/inode.c:2906: warning: comparison is always true due to limited range of data type fs/jbd2/recovery.c: In function ‘jbd2_journal_recover’: fs/jbd2/recovery.c:254: warning: comparison is always true due to limited range of data type fs/jbd2/recovery.c:257: warning: comparison is always true due to limited range of data type fs/jbd2/recovery.c: In function ‘jbd2_journal_skip_recovery’: fs/jbd2/recovery.c:301: warning: comparison is always true due to limited range of data type I'm surprised this was not noticed in a test build before pushing upstream. Hmm, I am not sure what happened. I get the compile warning on linus latest git tree, but could not get the same compile warning on Ted's ext4 git tree. In both build CONFIG_JBD2_DEBUG and DEBUG_FS is enabled. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix ext4/JBD2 build warnings
On Wed, 2007-07-18 at 17:37 -0400, Jeff Garzik wrote: It seems jbd_debug() might need modification: Looking at the current linus-git tree jbd_debug() define in include/linux/jbd2.h extern u8 journal_enable_debug; #define jbd_debug(n, f, a...) \ do {\ if ((n) = journal_enable_debug) { \ printk (KERN_DEBUG (%s, %d): %s: ,\ __FILE__, __LINE__, __FUNCTION__); \ printk (f, ## a); \ } \ } while (0) fs/ext4/inode.c: In function ‘ext4_write_inode’: fs/ext4/inode.c:2906: warning: comparison is always true due to limited range of data type fs/jbd2/recovery.c: In function ‘jbd2_journal_recover’: fs/jbd2/recovery.c:254: warning: comparison is always true due to limited range of data type fs/jbd2/recovery.c:257: warning: comparison is always true due to limited range of data type fs/jbd2/recovery.c: In function ‘jbd2_journal_skip_recovery’: fs/jbd2/recovery.c:301: warning: comparison is always true due to limited range of data type Noticed all warnings are occurs when the debug level is 0. Then found the jbd2: Move jbd2-debug file to debugfs patch http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0f49d5d019afa4e94253bfc92f0daca3badb990b changed the jbd2_journal_enable_debug from int type to u8, makes the jbd_debug comparision is always true when the debugging level is 0. Thus the compile warning occurs. Thought about changing the jbd2_journal_enable_debug data type back to int, but can't, because the jbd2-debug is moved to debug fs, where calling debugfs_create_u8() to create the debugfs entry needs the value to be u8 type. Even if we changed the data type back to int, the code is still buggy, kernel should not print jbd2 debug message if the jbd2_journal_enable_debug is set to 0. But this is not the case. The fix is change the level of debugging to 1. The same should fixed in ext3/JBD, but currently ext3 jbd-debug via /proc fs is broken, so we probably should fix it all together. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/inode.c|2 +- fs/jbd2/recovery.c |6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) Index: linux-2.6.22/fs/ext4/inode.c === --- linux-2.6.22.orig/fs/ext4/inode.c 2007-07-18 19:23:58.0 -0700 +++ linux-2.6.22/fs/ext4/inode.c2007-07-18 19:25:08.0 -0700 @@ -2903,7 +2903,7 @@ int ext4_write_inode(struct inode *inode return 0; if (ext4_journal_current_handle()) { - jbd_debug(0, called recursively, non-PF_MEMALLOC!\n); + jbd_debug(1, called recursively, non-PF_MEMALLOC!\n); dump_stack(); return -EIO; } Index: linux-2.6.22/fs/jbd2/recovery.c === --- linux-2.6.22.orig/fs/jbd2/recovery.c2007-07-18 19:25:51.0 -0700 +++ linux-2.6.22/fs/jbd2/recovery.c 2007-07-18 19:26:13.0 -0700 @@ -251,10 +251,10 @@ int jbd2_journal_recover(journal_t *jour if (!err) err = do_one_pass(journal, info, PASS_REPLAY); - jbd_debug(0, JBD: recovery, exit status %d, + jbd_debug(1, JBD: recovery, exit status %d, recovered transactions %u to %u\n, err, info.start_transaction, info.end_transaction); - jbd_debug(0, JBD: Replayed %d and revoked %d/%d blocks\n, + jbd_debug(1, JBD: Replayed %d and revoked %d/%d blocks\n, info.nr_replays, info.nr_revoke_hits, info.nr_revokes); /* Restart the log at the next transaction ID, thus invalidating @@ -298,7 +298,7 @@ int jbd2_journal_skip_recovery(journal_t #ifdef CONFIG_JBD2_DEBUG int dropped = info.end_transaction - be32_to_cpu(sb-s_sequence); #endif - jbd_debug(0, + jbd_debug(1, JBD: ignoring %d transaction%s from the journal.\n, dropped, (dropped == 1) ? : s); journal-j_transaction_sequence = ++info.end_transaction; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Tue, 2007-07-17 at 15:29 +0530, Kalpak Shah wrote: On Mon, 2007-07-16 at 17:49 -0700, Mingming Cao wrote: On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:56 -0400 Mingming Cao [EMAIL PROTECTED] wrote: +static inline __le32 ext4_encode_extra_time(struct timespec *time) +{ + return cpu_to_le32((sizeof(time-tv_sec) 4 ? + time-tv_sec 32 : 0) | + ((time-tv_nsec 2) EXT4_NSEC_MASK)); +} + +static inline void ext4_decode_extra_time(struct timespec *time, __le32 extra) +{ + if (sizeof(time-tv_sec) 4) + time-tv_sec |= (__u64)(le32_to_cpu(extra) EXT4_EPOCH_MASK) + 32; + time-tv_nsec = (le32_to_cpu(extra) EXT4_NSEC_MASK) 2; +} Consider uninlining these functions. I got compile warining after apply Kalpal's update nanosecond patch, which makes these two function inline. It complains these functions are defined but not used. It's being used only in the following micros(EXT4_INODE_SET_XTIME etc). So if the .c file included the ext4_fs.h but not using the micros, the compile will think these two functions are not used. The compile warnings were introduced because the functions were uninlined. So we can either keep these functions inlined or consider adding a __used attribute to these two functions. okay for now I keep these functions inlined. Thanks, Kalpak. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
On Tue, 2007-07-10 at 23:18 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes Date: Sun, 01 Jul 2007 03:38:51 -0400 Sender: [EMAIL PROTECTED] Organization: IBM Linux Technology Center X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent compilation fixes I hope this patch series will hit git with nice titles like ext4: extent compilation fixes and not funny ones like Morecleanups:ext4_extent_compilation_fixes. Sure, the changelog keep the nice tile Subject: ext4: extent compilation fixes Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions. conversions. Done. Thanks, Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] ext4: JBD-JBD2 naming cleanups
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: Index: linux-2.6.22-rc4/include/linux/jbd2.h === --- linux-2.6.22-rc4.orig/include/linux/jbd2.h 2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 -0700 @@ -57,7 +57,7 @@ * CONFIG_JBD2_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING JBD2? Some cleanups in ext4/JBD2 to follow the naming fules:change micros name from JBD_XXX to JBD2_XXX. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/extents.c |2 +- fs/ext4/super.c |2 +- fs/jbd2/commit.c |2 +- fs/jbd2/journal.c | 26 +- fs/jbd2/recovery.c|2 +- fs/jbd2/revoke.c |4 ++-- include/linux/ext4_jbd2.h |6 +++--- include/linux/jbd2.h | 30 +++--- 8 files changed, 37 insertions(+), 37 deletions(-) Index: linux-2.6.22/fs/ext4/super.c === --- linux-2.6.22.orig/fs/ext4/super.c 2007-07-13 17:17:29.0 -0700 +++ linux-2.6.22/fs/ext4/super.c2007-07-13 17:17:33.0 -0700 @@ -967,7 +967,7 @@ static int parse_options (char *options, if (option 0) return 0; if (option == 0) - option = JBD_DEFAULT_MAX_COMMIT_AGE; + option = JBD2_DEFAULT_MAX_COMMIT_AGE; sbi-s_commit_interval = HZ * option; break; case Opt_data_journal: Index: linux-2.6.22/include/linux/ext4_jbd2.h === --- linux-2.6.22.orig/include/linux/ext4_jbd2.h 2007-07-13 17:02:08.0 -0700 +++ linux-2.6.22/include/linux/ext4_jbd2.h 2007-07-13 17:39:41.0 -0700 @@ -12,8 +12,8 @@ * Ext4-specific journaling extensions. */ -#ifndef _LINUX_EXT4_JBD_H -#define _LINUX_EXT4_JBD_H +#ifndef _LINUX_EXT4_JBD2_H +#define _LINUX_EXT4_JBD2_H #include linux/fs.h #include linux/jbd2.h @@ -228,4 +228,4 @@ static inline int ext4_should_writeback_ return 0; } -#endif /* _LINUX_EXT4_JBD_H */ +#endif /* _LINUX_EXT4_JBD2_H */ Index: linux-2.6.22/include/linux/jbd2.h === --- linux-2.6.22.orig/include/linux/jbd2.h 2007-07-13 17:17:28.0 -0700 +++ linux-2.6.22/include/linux/jbd2.h 2007-07-13 17:31:33.0 -0700 @@ -13,8 +13,8 @@ * filesystem journaling support. */ -#ifndef _LINUX_JBD_H -#define _LINUX_JBD_H +#ifndef _LINUX_JBD2_H +#define _LINUX_JBD2_H /* Allow this file to be included directly into e2fsprogs */ #ifndef __KERNEL__ @@ -37,26 +37,26 @@ #define journal_oom_retry 1 /* - * Define JBD_PARANIOD_IOFAIL to cause a kernel BUG() if ext3 finds + * Define JBD2_PARANIOD_IOFAIL to cause a kernel BUG() if ext4 finds * certain classes of error which can occur due to failed IOs. Under - * normal use we want ext3 to continue after such errors, because + * normal use we want ext4 to continue after such errors, because * hardware _can_ fail, but for debugging purposes when running tests on * known-good hardware we may want to trap these errors. */ -#undef JBD_PARANOID_IOFAIL +#undef JBD2_PARANOID_IOFAIL /* * The default maximum commit age, in seconds. */ -#define JBD_DEFAULT_MAX_COMMIT_AGE 5 +#define JBD2_DEFAULT_MAX_COMMIT_AGE 5 #ifdef CONFIG_JBD2_DEBUG /* - * Define JBD_EXPENSIVE_CHECKING to enable more expensive internal + * Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal * consistency checks. By default we don't do this unless * CONFIG_JBD2_DEBUG is on. */ -#define JBD_EXPENSIVE_CHECKING +#define JBD2_EXPENSIVE_CHECKING extern u8 jbd2_journal_enable_debug; #define jbd_debug(n, f, a...) \ @@ -185,8 +185,8 @@ typedef struct journal_block_tag_s __be32 t_blocknr_high; /* most-significant high 32bits. */ } journal_block_tag_t; -#define JBD_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) -#define JBD_TAG_SIZE64 (sizeof(journal_block_tag_t)) +#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) +#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t)) /* * The revoke descriptor: used on disk to describe a series of blocks to @@ -282,8 +282,8 @@ typedef struct journal_superblock_s #include linux/fs.h #include linux/sched.h -#define JBD_ASSERTIONS -#ifdef JBD_ASSERTIONS +#define JBD2_ASSERTIONS +#ifdef JBD2_ASSERTIONS #define J_ASSERT(assert) \ do { \ if (!(assert)) {\ @@ -310,9 +310,9 @@ void
Re: [EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:22 -0400 Mingming Cao [EMAIL PROTECTED] wrote: with the patch all headers are checked. the code should become more resistant to on-disk corruptions. needless BUG_ON() have been removed. please, review for inclusion. ... @@ -269,6 +239,70 @@ return size; } +static inline int +ext4_ext_max_entries(struct inode *inode, int depth) Please remove the `inline'. done `inline' is usually wrong. It is wrong here. If this function has a single callsite (it has) then the compiler will inline it. If the function later gains more callsites then the compiler will know (correctly) not to inline it. We hope. If we find the compiler still inlines a function as large as this one then we'd need to use noinline and complain at the gcc developers. +{ + int max; + + if (depth == ext_depth(inode)) { + if (depth == 0) + max = ext4_ext_space_root(inode); + else + max = ext4_ext_space_root_idx(inode); + } else { + if (depth == 0) + max = ext4_ext_space_block(inode); + else + max = ext4_ext_space_block_idx(inode); + } + + return max; +} + +static int __ext4_ext_check_header(const char *function, struct inode *inode, + struct ext4_extent_header *eh, + int depth) +{ + const char *error_msg = NULL; Unneeded initialisation. done + int max = 0; + + if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) { + error_msg = invalid magic; + goto corrupted; + } + if (unlikely(le16_to_cpu(eh-eh_depth) != depth)) { + error_msg = unexpected eh_depth; + goto corrupted; + } + if (unlikely(eh-eh_max == 0)) { + error_msg = invalid eh_max; + goto corrupted; + } + max = ext4_ext_max_entries(inode, depth); + if (unlikely(le16_to_cpu(eh-eh_max) max)) { + error_msg = too large eh_max; + goto corrupted; + } + if (unlikely(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max))) { + error_msg = invalid eh_entries; + goto corrupted; + } + return 0; + +corrupted: + ext4_error(inode-i_sb, function, + bad header in inode #%lu: %s - magic %x, + entries %u, max %u(%u), depth %u(%u), + inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic), + le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max), + max, le16_to_cpu(eh-eh_depth), depth); + + return -EIO; +} + ... + i = depth = ext_depth(inode); Mulitple assignments like this are somewhat unpopular from the coding-style POV. okay. We have a local variable called `i' which is not used as a simple counter and which has no explanatory comment. This is plain bad. Please find a better name for this variable. Review the code for other such lack of clarity - I'm sure there's more. i is is loop counter. I have moved the i= depth to before the while loop. - BUG_ON(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max)); - BUG_ON(eh-eh_magic != EXT4_EXT_MAGIC); Yeah, this patch improves things a lot. How well-tested is it? Have any of these errors actually been triggered? path[i].p_hdr = ext_block_hdr(path[i].p_bh); - if (ext4_ext_check_header(__FUNCTION__, inode, - path[i].p_hdr)) { - err = -EIO; - goto out; - } } - BUG_ON(le16_to_cpu(path[i].p_hdr-eh_entries) - le16_to_cpu(path[i].p_hdr-eh_max)); - BUG_ON(path[i].p_hdr-eh_magic != EXT4_EXT_MAGIC); - if (!path[i].p_idx) { /* this level hasn't been touched yet */ path[i].p_idx = EXT_LAST_INDEX(path[i].p_hdr); @@ -1873,17 +1890,24 @@ i, EXT_FIRST_INDEX(path[i].p_hdr), path[i].p_idx); if (ext4_ext_more_to_rm(path + i)) { + struct buffer_head *bh; /* go to the next level */ ext_debug(move to level %d (block %llu)\n, i + 1, idx_pblock(path[i].p_idx)); memset(path + i + 1, 0, sizeof(*path)); - path[i+1].p_bh = - sb_bread(sb, idx_pblock(path[i].p_idx)); - if (!path[i+1].p_bh) { + bh = sb_bread(sb, idx_pblock(path[i].p_idx)); + if (!bh) { /* should we reset i_size
Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs
On Tue, 2007-07-10 at 19:31 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:38:10 -0400 Mingming Cao [EMAIL PROTECTED] wrote: [PATCH] jbd2 stats through procfs The patch below updates the jbd stats patch to 2.6.20/jbd2. The initial patch was posted by Alex Tomas in December 2005 (http://marc.info/?l=linux-ext4m=113538565128617w=2). It provides statistics via procfs such as transaction lifetime and size. [ This probably should be rewritten to use debugfs? -- Ted] I suppose that given that we're creating a spot in debugfs for the jbd2 debug code, yes, this also should be moved over. But the jbd2 debug debugfs entries were kernel-wide whereas this is per-superblock. I think. I take this comment as we still keep the stats info in proc fs... That email from Alex contains pretty important information. I suggest that it be added to the changelog after accuracy-checking. Addition to Documentation/filesystems/ext4.txt would be good. Done. I hope Alex can help address the rest of comments. -- Index: linux-2.6.22-rc4/include/linux/jbd2.h === --- linux-2.6.22-rc4.orig/include/linux/jbd2.h 2007-06-11 17:28:17.0 -0700 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-13 10:45:21.0 -0700 @@ -408,6 +408,16 @@ }; +/* + * Some stats for checkpoint phase + */ +struct transaction_chp_stats_s { + unsigned long cs_chp_time; + unsigned long cs_forced_to_close; + unsigned long cs_written; + unsigned long cs_dropped; +}; It would be nice to document what units all these fields are in. Jiffies, I assume. /* The transaction_t type is the guts of the journaling mechanism. It * tracks a compound transaction through its various states: * @@ -543,6 +553,21 @@ spinlock_t t_handle_lock; /* +* Longest time some handle had to wait for running transaction +*/ + unsigned long t_max_wait; + + /* +* When transaction started +*/ + unsigned long t_start; + + /* +* Checkpointing stats [j_checkpoint_sem] +*/ + struct transaction_chp_stats_s t_chp_stats; + + /* * Number of outstanding updates running on this transaction * [t_handle_lock] */ @@ -573,6 +598,57 @@ }; +struct transaction_run_stats_s { + unsigned long rs_wait; + unsigned long rs_running; + unsigned long rs_locked; + unsigned long rs_flushing; + unsigned long rs_logging; + + unsigned long rs_handle_count; + unsigned long rs_blocks; + unsigned long rs_blocks_logged; +}; + +struct transaction_stats_s +{ + int ts_type; + unsigned long ts_tid; + union { + struct transaction_run_stats_s run; + struct transaction_chp_stats_s chp; + } u; +}; + +#define JBD2_STATS_RUN 1 +#define JBD2_STATS_CHECKPOINT 2 + +#define ts_waitu.run.rs_wait +#define ts_running u.run.rs_running +#define ts_locked u.run.rs_locked +#define ts_flushingu.run.rs_flushing +#define ts_logging u.run.rs_logging +#define ts_handle_countu.run.rs_handle_count +#define ts_blocks u.run.rs_blocks +#define ts_blocks_logged u.run.rs_blocks_logged + +#define ts_chp_timeu.chp.cs_chp_time +#define ts_forced_to_close u.chp.cs_forced_to_close +#define ts_written u.chp.cs_written +#define ts_dropped u.chp.cs_dropped That's a bit sleazy. We can drop the u from 'struct transaction_stats_s' and make it an anonymous union, then open-code foo.run.rs_wait everywhere. But that's a bit sleazy too, because the reader of the code would not know that a write to foo.run.rs_wait will stomp on the value of foo.chp.cs_chp_time. So to minimize reader confusion it would be best I think to just open-code the full u.run.rs_wait at all code-sites. The macros above are the worst possible choice: they hide information from the code-reader just to save the code-writer a bit of typing. But we very much want to optimise for code-readers, not for code-writers. +#define CURRENT_MSECS (jiffies_to_msecs(jiffies)) hm, that isn't something which should be in an ext4 header file. And it shouldn't be in UPPER CASE and it shouldn't pretend to be a constant (or a global scalar). IOW: yuk. How's about raising a separate, standalone patch which creates a new kernel-wide coded-in-C function such as unsigned long jiffies_in_msecs(void); ? (That's assuming we don't already have one. Most likely we have seven of them hiding in various dark corners). +static inline
Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs
On Tue, 2007-07-10 at 21:42 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 23:21:49 -0400 Cédric Augonnet [EMAIL PROTECTED] wrote: 2007/7/10, Andrew Morton [EMAIL PROTECTED]: Hi all, + size = sizeof(struct transaction_stats_s); + s-stats = kmalloc(size, GFP_KERNEL); + if (s == NULL) { + kfree(s); + return -EIO; ENOMEM I'm sorry if i missed some point, but i just don't see the use of such a kfree here, not sure Andrew meant you should only return ENOMEM instead, but why issuing those kfree(NULL), instead of just a if (!s) return ENOMEM ? You found a bug. It was meant to be if (s-stats == NULL) The incremental fix patch is attached just FYI. It will be folded to the parent patch once the parent patch is being updated. --- fs/jbd2/journal.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.22/fs/jbd2/journal.c === --- linux-2.6.22.orig/fs/jbd2/journal.c 2007-07-16 00:05:03.0 -0700 +++ linux-2.6.22/fs/jbd2/journal.c 2007-07-16 00:05:59.0 -0700 @@ -751,7 +751,7 @@ static int jbd2_seq_history_open(struct return -EIO; size = sizeof(struct transaction_stats_s) * journal-j_history_max; s-stats = kmalloc(size, GFP_KERNEL); - if (s == NULL) { + if (s-stats == NULL) { kfree(s); return -EIO; } - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Fri, 2007-07-13 at 02:05 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 16:32:47 -0700 Andrew Morton [EMAIL PROTECTED] wrote: + brelse(bh); + up_write(EXT4_I(inode)-xattr_sem); + return error; +} + We're doing GFP_KERNEL memory allocations while holding xattr_sem. This can cause the VM to reenter the filesystem, perhaps taking i_mutex and/or i_truncate_sem and/or journal_start() (I forget whether this still happens). Have we checked whether this can occur and if so, whether we are OK from a lock ranking POV? Bear in mind that journalled-data mode is more complex in this regard. I notice that everyone carefully avoided addressing this ;) I am not sure why we need GFP_KERNEL flag here. I think we should use GFP_NOFS instead. The following patch use the GFP_NOFS flag, as well as fixing memory leak issue introduced by the ext4 expand inode extra isize patch. Fixing memory allocation issue with expand inode extra isize patch. - use GFP_NOFS instead of GFP_KERNEL flag for memory allocation - use kzalloc instead of kmalloc - fix memory leak in the success case, at the end of while loop. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/xattr.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) Index: linux-2.6.22/fs/ext4/xattr.c === --- linux-2.6.22.orig/fs/ext4/xattr.c 2007-07-16 16:12:18.0 -0700 +++ linux-2.6.22/fs/ext4/xattr.c2007-07-16 16:35:59.0 -0700 @@ -1204,8 +1204,8 @@ retry: unsigned int shift_bytes; /* No. of bytes to shift EAs by? */ unsigned int min_total_size = ~0U; - is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_KERNEL); - bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_KERNEL); + is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_NOFS); + bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_NOFS); if (!is || !bs) { error = -ENOMEM; goto cleanup; @@ -1251,8 +1251,8 @@ retry: size = le32_to_cpu(entry-e_value_size); entry_size = EXT4_XATTR_LEN(entry-e_name_len); i.name_index = entry-e_name_index, - buffer = kmalloc(EXT4_XATTR_SIZE(size), GFP_KERNEL); - b_entry_name = kmalloc(entry-e_name_len + 1, GFP_KERNEL); + buffer = kzalloc(EXT4_XATTR_SIZE(size), GFP_NOFS); + b_entry_name = kzalloc(entry-e_name_len + 1, GFP_NOFS); if (!buffer || !b_entry_name) { error = -ENOMEM; goto cleanup; @@ -1302,7 +1302,15 @@ retry: error = ext4_xattr_block_set(handle, inode, i, bs); if (error) goto cleanup; + kfree(b_entry_name); + kfree(buffer); + brelse(is-iloc.bh); + kfree(is); + kfree(bs); + brelse(bh); } + up_write(EXT4_I(inode)-xattr_sem); +return 0; cleanup: kfree(b_entry_name); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Mon, 2007-07-16 at 18:06 -0600, Andreas Dilger wrote: On Jul 16, 2007 16:52 -0700, Mingming Cao wrote: I am not sure why we need GFP_KERNEL flag here. I think we should use GFP_NOFS instead. The following patch use the GFP_NOFS flag, as well as fixing memory leak issue introduced by the ext4 expand inode extra isize patch. Fixing memory allocation issue with expand inode extra isize patch. - use GFP_NOFS instead of GFP_KERNEL flag for memory allocation - use kzalloc instead of kmalloc This doesn't need kzalloc() for buffer and b_entry_name, since they are immediately overwritten by memcpy(). ok. - fix memory leak in the success case, at the end of while loop. goto cleanup; @@ -1302,7 +1302,15 @@ retry: error = ext4_xattr_block_set(handle, inode, i, bs); if (error) goto cleanup; + kfree(b_entry_name); + kfree(buffer); + brelse(is-iloc.bh); + kfree(is); + kfree(bs); + brelse(bh); } + up_write(EXT4_I(inode)-xattr_sem); +return 0; cleanup: kfree(b_entry_name); I don't think you should have brelse(bh) inside the loop, since it is allocated before the loop starts. Ahh right. thanks. Updated fix. Fixing memory allocation issue with expand inode extra isize patch. - use GFP_NOFS instead of GFP_KERNEL to prevent fs reenter - fix memory leak in the success case, at the end of while loop. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/xattr.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) Index: linux-2.6.22/fs/ext4/xattr.c === --- linux-2.6.22.orig/fs/ext4/xattr.c 2007-07-16 17:21:14.0 -0700 +++ linux-2.6.22/fs/ext4/xattr.c2007-07-16 17:22:25.0 -0700 @@ -1204,8 +1204,8 @@ retry: unsigned int shift_bytes; /* No. of bytes to shift EAs by? */ unsigned int min_total_size = ~0U; - is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_KERNEL); - bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_KERNEL); + is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_NOFS); + bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_NOFS); if (!is || !bs) { error = -ENOMEM; goto cleanup; @@ -1251,8 +1251,8 @@ retry: size = le32_to_cpu(entry-e_value_size); entry_size = EXT4_XATTR_LEN(entry-e_name_len); i.name_index = entry-e_name_index, - buffer = kmalloc(EXT4_XATTR_SIZE(size), GFP_KERNEL); - b_entry_name = kmalloc(entry-e_name_len + 1, GFP_KERNEL); + buffer = kmalloc(EXT4_XATTR_SIZE(size), GFP_NOFS); + b_entry_name = kmalloc(entry-e_name_len + 1, GFP_NOFS); if (!buffer || !b_entry_name) { error = -ENOMEM; goto cleanup; @@ -1302,7 +1302,15 @@ retry: error = ext4_xattr_block_set(handle, inode, i, bs); if (error) goto cleanup; + kfree(b_entry_name); + kfree(buffer); + brelse(is-iloc.bh); + kfree(is); + kfree(bs); } + brelse(bh); + up_write(EXT4_I(inode)-xattr_sem); + return 0; cleanup: kfree(b_entry_name); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:56 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch is a spinoff of the old nanosecond patches. I don't know what the old nanosecond patches are. A link to a suitable changlog for those patches would do in a pinch. Preferable would be to write a proper changelog for this patch. It includes some cleanups and addition of a creation timestamp. The EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with s_{min, want}_extra_isize fields in struct ext3_super_block. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: linux-2.6.22-rc4/fs/ext4/ialloc.c Please include diffstat output when preparing patches. + +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field) \ + ((offsetof(typeof(*ext4_inode), field) +\ + sizeof((ext4_inode)-field)) \ + = (EXT4_GOOD_OLD_INODE_SIZE + \ + (einode)-i_extra_isize)) \ Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers under what circumstances something will not fit in an inode and what the consequences of this are. +static inline __le32 ext4_encode_extra_time(struct timespec *time) +{ + return cpu_to_le32((sizeof(time-tv_sec) 4 ? + time-tv_sec 32 : 0) | + ((time-tv_nsec 2) EXT4_NSEC_MASK)); +} + +static inline void ext4_decode_extra_time(struct timespec *time, __le32 extra) +{ + if (sizeof(time-tv_sec) 4) + time-tv_sec |= (__u64)(le32_to_cpu(extra) EXT4_EPOCH_MASK) + 32; + time-tv_nsec = (le32_to_cpu(extra) EXT4_NSEC_MASK) 2; +} Consider uninlining these functions. I got compile warining after apply Kalpal's update nanosecond patch, which makes these two function inline. It complains these functions are defined but not used. It's being used only in the following micros(EXT4_INODE_SET_XTIME etc). So if the .c file included the ext4_fs.h but not using the micros, the compile will think these two functions are not used. Mingming +#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode) \ +do { \ + (raw_inode)-xtime = cpu_to_le32((inode)-xtime.tv_sec); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ + (raw_inode)-xtime ## _extra = \ + ext4_encode_extra_time((inode)-xtime); \ +} while (0) + +#define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode) \ +do { \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ + (raw_inode)-xtime = cpu_to_le32((einode)-xtime.tv_sec); \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ + (raw_inode)-xtime ## _extra = \ + ext4_encode_extra_time((einode)-xtime); \ +} while (0) + +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ +do { \ + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ + ext4_decode_extra_time((inode)-xtime,\ + raw_inode-xtime ## _extra);\ +} while (0) + +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) \ +do { \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ + (einode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ + ext4_decode_extra_time((einode)-xtime, \ + raw_inode-xtime ## _extra);\ +} while (0) Ugly. I expect these could be implemented as plain old C functions. Caller could pass in the address of the ext4_inode field which the function is to operate upon. #if defined(__KERNEL__) || defined(__linux__) (What's the __linux__ for?) #define i_reserved1osd1.linux1.l_i_reserved1 #define i_frag osd2.linux2.l_i_frag @@ -539,6 +603,13 @@ return container_of(inode, struct
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Fri, 2007-07-13 at 12:35 +0530, Kalpak Shah wrote: On Fri, 2007-07-13 at 09:59 +0530, Aneesh Kumar K.V wrote: Kalpak Shah wrote: On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:56 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch is a spinoff of the old nanosecond patches. I don't know what the old nanosecond patches are. A link to a suitable changlog for those patches would do in a pinch. Preferable would be to write a proper changelog for this patch. The incremental patch contains a proper changelog describing the patch. Instead of putting incremental patches it would be nice if we can have replacement patches. for the already existing patches with the comments addressed. For example if we have a review comment on the patch message ( commit log ) then adding an incremental patch doesn't help. I think that it would be easier to review just the changes that have been made to the patches instead of having people go through the entire patch again. I was hoping that someone with write access to ext4-git would update the commit logs. If replacement patches are preferred, then I will send them again. No need, I already fold your fix patch to the parent patches, so in the updated ext4-patch-queue it saved the updated nanosecond patch. Thanks, Kalpak. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs
On Wed, 2007-07-11 at 00:38 -0500, Jose R. Santos wrote: On Tue, 10 Jul 2007 16:30:25 -0700 Andrew Morton [EMAIL PROTECTED] wrote: On Sun, 01 Jul 2007 03:36:48 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Jun 07, 2007 23:45 -0500, Jose R. Santos wrote: The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but create_proc_entry() does not do lookups on file names with more that one directory deep. This causes the entry creation to fail and hence, no proc file is created. This patch moves the file to /proc/jbd2-degug. The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require some minor alterations to the jbd-stats patch. I don't think we really want to be adding top-level files in /proc. What about using the debugfs filesystem (not to be confused with the e2fsprogs 'debugfs' command)? How about this then? Moved the file to use debugfs as well as having the nice effect of removing more lines than what it adds. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] Please clean up the changelog. The changelog should include information about the location and the content of these debugfs files. it should provide any instructions which users will need to be able to create and use those files. Will fix. Alternatively (and preferably) do this via an update to Documentation/filesystems/ext4.txt. Seems like I also need to update the doc on Kconfig as well. Do you prefer this in separate patches? (current patch, kconfig patch, ext4 doc update patch? fs/jbd2/journal.c| 6220 +42 -0 ! include/linux/jbd2.h |21 + 1 - 0 ! 2 files changed, 21 insertions(+), 43 deletions(-) Again, this patch isn't in Ted's kernel.org directory and hasn't been in -mm. Apart from the lack of testing and review which this causes, it means I can't just do `pushpatch name-of-this-patch' and look at it in tkdiff. So I squint at the diff, but that's harder when the diff wasn't prepared with `diff -p'. Oh well. Will fix. Index: linux-2.6.22-rc4/fs/jbd2/journal.c === --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c 2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 -0700 @@ -35,6 +35,7 @@ #include linux/kthread.h #include linux/poison.h #include linux/proc_fs.h +#include linux/debugfs.h #include asm/uaccess.h #include asm/page.h @@ -1954,60 +1955,37 @@ * /proc tunables */ #if defined(CONFIG_JBD2_DEBUG) -int jbd2_journal_enable_debug; +u16 jbd2_journal_enable_debug; EXPORT_SYMBOL(jbd2_journal_enable_debug); #endif -#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_DEBUG_FS) Has this been compile-tested with CONFIG_DEBUGFS=n? I think I did, but honestly don't remember. Will check with the new patch. :) Yes, I remember I did, that discovered some inconsistency in ext4 code, which has already been fixed. -#define create_jbd_proc_entry() do {} while (0) -#define jbd2_remove_jbd_proc_entry() do {} while (0) +#define jbd2_create_debugfs_entry() do {} while (0) +#define jbd2_remove_debugfs_entry() do {} while (0) I suggest that these be converted to (preferable) inline functions while you're there. OK. #endif @@ -2067,7 +2045,7 @@ ret = journal_init_caches(); if (ret != 0) jbd2_journal_destroy_caches(); - create_jbd_proc_entry(); + jbd2_create_debugfs_entry(); return ret; } @@ -2078,7 +2056,7 @@ if (n) printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n); #endif - jbd2_remove_jbd_proc_entry(); + jbd2_remove_debugfs_entry(); jbd2_journal_destroy_caches(); } Index: linux-2.6.22-rc4/include/linux/jbd2.h === --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 -0700 @@ -57,7 +57,7 @@ * CONFIG_JBD2_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING JBD2? -extern int jbd2_journal_enable_debug; +extern u16 jbd2_journal_enable_debug; Why was this made 16-bit? To save 2 bytes? Could have saved 3 if we're going to do that. OK. Shoudln't all this debug info be a per-superblock thing rather than kernel-wide? I don't think it is worth pursuing this feature since this seems to have been broken for a while now (its been there since the first git revission in ext3) and nobody has noticed it until now. It could be address on a later patch though, since the initial purpose of the patch was to fix
Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev
On Tue, 2007-07-10 at 23:35 -0400, Dave Jones wrote: On Tue, Jul 10, 2007 at 05:35:13PM -0400, Mingming Cao wrote: Sorry about this. I was using version 0.04. The latest one I can find for now is 0.05(searching lkml), but it didn't catch this codling style bug either. I appreciate if anyone can point me the version 0.07, thanks It's now in-tree in scripts/checkpatch.pl (linus' tree is still at 0.06 though, I suspect Andrew has something newer in -mm) Thanks, Andy has uploaded the 0.07 version at http://www.shadowen.org/~apw/public/checkpatch/checkpatch.pl-0.07 Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote: David Chinneer pointed that we need to journal the version number updates together with the operations that causes the change of the inode version number, in order to survive server crashes so clients won't see the counter go backwards. So increment i_version in fs code is probably the place to ensure the inode version changes are stored to disk. It's seems update the ext4 inode version in every ext4_mark_inode_dirty() is the easiest way. That still makes us dependent upon _something_ changing the inode. For overwrites the only something is mtime. If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and I don't think we do) then I guess we'll need new code in or around file_update_time() to do this. do you mean mark inode dirty all the times in file_update_time()? Not sure about the overhead for ext3/4. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups
On Tue, 2007-07-10 at 23:20 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote: From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent macros cleanup - Replace math equation to it's macro equivalent s/it's/its/;) Okay. - make ext4_ext_grow_indepth() indexes/leaf correct hm, what was wrong with it? Looking at the code, ext4_ext_ext_grow_indepth() implements tree growing procedure. It allocates a new index block, moves the top-level data of the tree(root or leaf blocks in i_data) into the new block, initializes the new root, creating index that points to the just created index block The original top-level data (in i_data) could be extent tree root (index block) or extents (leaf block). The current code (without the patch) treats the top-level data always be the leaf block, which is incorrect. assumes when the tree is growing the extent structure pass in is always @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode, curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode)); curp-p_hdr-eh_entries = cpu_to_le16(1); curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr); - /* FIXME: it works, but actually path[0] can be index */ - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; + + if (path[0].p_hdr-eh_depth) + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block; + else + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; whitespace bustage there. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 1][PATCH 1/2] Add noextents mount option
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:35:48 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Add a mount option to turn off extents. Please update the changelog to describe the reason for making this change. Sure, I will update the changelog, mainly this is for wide testing of extents feature in ext4dev. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- Index: linux-2.6.22-rc4/fs/ext4/super.c === --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:18.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 17:02:22.0 -0700 @@ -725,7 +725,7 @@ Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, - Opt_grpquota, Opt_extents, + Opt_grpquota, Opt_extents, Opt_noextents, }; static match_table_t tokens = { @@ -776,6 +776,7 @@ {Opt_usrquota, usrquota}, {Opt_barrier, barrier=%u}, {Opt_extents, extents}, + {Opt_noextents, noextents}, {Opt_err, NULL}, {Opt_resize, resize}, }; @@ -,6 +1112,9 @@ case Opt_extents: set_opt (sbi-s_mount_opt, EXTENTS); break; + case Opt_noextents: + clear_opt (sbi-s_mount_opt, EXTENTS); + break; default: printk (KERN_ERR EXT4-fs: Unrecognized mount option \%s\ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for 16TB ext4 fs
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:32 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more than 32bit block sizes during mount time. This ensure proper record lenth when writing to the journal. This patch isn't in Ted's kernel.org directory and hasn't been in -mm. Where did it come from? Is something amiss with ext4 patch management? Jose Santo posted it to linux-ext4 mailing list. I agree this bug fix should included in Ted's git tree or mm tree. There are other ext4 cleanups in this series that should goes to mm tree also. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Signed-off-by: Laurent Vivier [EMAIL PROTECTED] --- fs/ext4/super.c | 11 +++ 1 file changed, 11 insertions(+) Index: linux-2.6.22-rc4/fs/ext4/super.c === --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 16:15:54.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 16:16:10.0 -0700 @@ -1804,6 +1804,13 @@ Please prepare patches using `diff -p' Will do. goto failed_mount3; } + if (ext4_blocks_count(es) 0xULL + !jbd2_journal_set_features(EXT4_SB(sb)-s_journal, 0, 0, + JBD2_FEATURE_INCOMPAT_64BIT)) { + printk(KERN_ERR ext4: Failed to set 64-bit journal feature\n); + goto failed_mount4; + } It is not appropriate for the text ext4 to appear in a JBD2 message. This is part of ext4 code. Ext4 will set the 64-bit JBD2 flag if the fs is larger than 32 bit blocks. /* We have now updated the journal if required, so we can * validate the data journaling mode. */ switch (test_opt(sb, DATA_FLAGS)) { - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:01 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Turn on extents feature by default in ext4 filesystem. User could use -o noextents to turn it off. Oh, there you go. Index: linux-2.6.22-rc4/fs/ext4/super.c === --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:22.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 17:03:09.0 -0700 @@ -1546,6 +1546,12 @@ set_opt(sbi-s_mount_opt, RESERVATION); + /* +* turn on extents feature by default in ext4 filesystem +* User -o noextents to turn it off +*/ + set_opt (sbi-s_mount_opt, EXTENTS); + Broken coding style. Please feed all the ext4 patches through scripts/checkpatch.pl (preferably version 0.07 - see Andy's patch on lkml) and then consider addressing the (quite large) number of mistakes which are detected. Sorry about this. I was using version 0.04. The latest one I can find for now is 0.05(searching lkml), but it didn't catch this codling style bug either. I appreciate if anyone can point me the version 0.07, thanks Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:37:04 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch converts the 32-bit i_version in the generic inode to a 64-bit i_version field. That's obvious from the patch. But what was the reason for making this (unrelated to ext4) change? The need is came from NFSv4 On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: Hi, This is an update of the i_version patch. The i_version field is a 64bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530: 5.5. Mandatory Attributes - Definitions Name # DataType Access Description ___ change3 uint64 READ A value created by the server that the client can use to determine if file data, directory contents or attributes of the object have been modified. The servermay return the object's time_metadata attribute for this attribute's value but only if the filesystem object can not be updated more frequently than the resolution of time_metadata. Please update the changelog for this. Is above description clear to you? Index: linux-2.6.21/include/linux/fs.h === --- linux-2.6.21.orig/include/linux/fs.h +++ linux-2.6.21/include/linux/fs.h @@ -549,7 +549,7 @@ struct inode { uid_t i_uid; gid_t i_gid; dev_t i_rdev; - unsigned long i_version; + u64 i_version; loff_t i_size; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:37:04 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch converts the 32-bit i_version in the generic inode to a 64-bit i_version field. That's obvious from the patch. But what was the reason for making this (unrelated to ext4) change? The need is came from NFSv4 On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: Hi, This is an update of the i_version patch. The i_version field is a 64bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530: 5.5. Mandatory Attributes - Definitions Name # DataType Access Description ___ change3 uint64 READ A value created by the server that the client can use to determine if file data, directory contents or attributes of the object have been modified. The servermay return the object's time_metadata attribute for this attribute's value but only if the filesystem object can not be updated more frequently than the resolution of time_metadata. Please update the changelog for this. Is above description clear to you? Yes, thanks. It doesn't actually tell us why we want to implement this attribute and it doesn't tell us what the implications of failing to do so are, but I guess we can take that on trust from the NFS guys. But I suspect the ext4 implementation doesn't actually do this. afaict we won't update i_version for file overwrites (especially if s_time_gran can indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What would be the implications of this? In the case of overwrite (file date updated), I assume the ctime/mtime is being updated and the inode is being dirtied, so the version number is being updated. vfs_write()-.. -__generic_file_aio_write_nolock() -file_update_time() -mark_inode_dirty_sync() -__mark_inode_dirty(I_DIRTY_SYNC) -ext4_dirty_inode() -ext4_mark_inode_dirty() And how does the NFS server know that the filesystem implements i_version? Will a zero-value of i_version have special significance, telling the server to not send this attribute, perhaps? Bruce raised up this question a few days back when he reviewed this patch, I think the solution is add a superblock flag for fs support inode versioning, probably at VFS layer? Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:36:56 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch is a spinoff of the old nanosecond patches. I don't know what the old nanosecond patches are. A link to a suitable changlog for those patches would do in a pinch. Preferable would be to write a proper changelog for this patch. I found the original patch http://marc.info/?l=linux-ext4m=115091699809181w=2 Andreas or Kalpak, is changelog from the original patch is accurate to apply here? Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote: On Tuesday July 10, [EMAIL PROTECTED] wrote: Yes, thanks. It doesn't actually tell us why we want to implement this attribute and it doesn't tell us what the implications of failing to do so are, but I guess we can take that on trust from the NFS guys. You would like to think so, but remember NFSv4 was designed by a committee :-) The 'change' number is used for cache consistency, and as the spec makes very strong statements about the 'change' number, it is very hard (or impossible) to implement a server correctly without storing a change number in stable storage (just one of my grips about V4). But I suspect the ext4 implementation doesn't actually do this. afaict we won't update i_version for file overwrites (especially if s_time_gran can indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What would be the implications of this? The first part sounds like a bug - i_version should really be updated by every call to -commit_write (if that is still what it is called). The MAP_SHARED thing is less obvious. I guess every time we notice that the page might have been changed, we need to increment i_version. And how does the NFS server know that the filesystem implements i_version? Will a zero-value of i_version have special significance, telling the server to not send this attribute, perhaps? That is a very important question. Zero probably makes sense, but what ever it is needs to be agreed and documented. And just by-the-way, the server doesn't really have the option of not sending the attribute. If i_version isn't defined, it has to fake something using mtime, and hope that is good enough. Alternately we could mandate that i_version is always kept up-to-date and if a filesystem doesn't have anything to load from storage, it just sets it to the current time in nanoseconds. That would mean that a client would need to flush it's cache whenever the inode fell out of cache on the server, but I don't think we can reliably do better than that. I think I like that approach. So my vote is to increment i_version in common code every time any change is made to the file, David Chinneer pointed that we need to journal the version number updates together with the operations that causes the change of the inode version number, in order to survive server crashes so clients won't see the counter go backwards. So increment i_version in fs code is probably the place to ensure the inode version changes are stored to disk. It's seems update the ext4 inode version in every ext4_mark_inode_dirty() is the easiest way. and alloc_inode should initialise it to current time, which might be changed by the filesystem before it calls unlock_new_inode. ... but doesn't lustre want to control its i_version... so maybe not :-( NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs
On Tue, 2007-07-10 at 21:42 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 23:21:49 -0400 Cédric Augonnet [EMAIL PROTECTED] wrote: 2007/7/10, Andrew Morton [EMAIL PROTECTED]: Hi all, + size = sizeof(struct transaction_stats_s); + s-stats = kmalloc(size, GFP_KERNEL); + if (s == NULL) { + kfree(s); + return -EIO; ENOMEM I'm sorry if i missed some point, but i just don't see the use of such a kfree here, not sure Andrew meant you should only return ENOMEM instead, but why issuing those kfree(NULL), instead of just a if (!s) return ENOMEM ? You found a bug. It was meant to be if (s-stats == NULL) Thanks, I will make sure this get fixed in ext4 patch queue. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 21:22 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 20:19:16 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:37:04 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch converts the 32-bit i_version in the generic inode to a 64-bit i_version field. That's obvious from the patch. But what was the reason for making this (unrelated to ext4) change? The need is came from NFSv4 On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: Hi, This is an update of the i_version patch. The i_version field is a 64bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530: 5.5. Mandatory Attributes - Definitions Name # DataType Access Description ___ change3 uint64 READ A value created by the server that the client can use to determine if file data, directory contents or attributes of the object have been modified. The servermay return the object's time_metadata attribute for this attribute's value but only if the filesystem object can not be updated more frequently than the resolution of time_metadata. Please update the changelog for this. Is above description clear to you? Yes, thanks. It doesn't actually tell us why we want to implement this attribute and it doesn't tell us what the implications of failing to do so are, but I guess we can take that on trust from the NFS guys. But I suspect the ext4 implementation doesn't actually do this. afaict we won't update i_version for file overwrites (especially if s_time_gran can indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What would be the implications of this? In the case of overwrite (file date updated), I assume the ctime/mtime is being updated and the inode is being dirtied, so the version number is being updated. vfs_write()-.. -__generic_file_aio_write_nolock() -file_update_time() -mark_inode_dirty_sync() -__mark_inode_dirty(I_DIRTY_SYNC) -ext4_dirty_inode() -ext4_mark_inode_dirty() That assumes an mtime update for every write(). OK, so two writes in a single nanosecond won't be happening. But in that case why is this code: static inline struct timespec ext4_current_time(struct inode *inode) { return (inode-i_sb-s_time_gran NSEC_PER_SEC) ? current_fs_time(inode-i_sb) : CURRENT_TIME_SEC; } checking (s_time_gran NSEC_PER_SEC) ?? Ext4 can still load/read ext3 fs (which by default with 128 bytes old inode size, means doens't have support nanosecond timestamps), so it's not always gurantee nanosecond timestamps granularity.(it depends on the size of the inode (128 bytes), by default, a fresh ext4 increase inode size to 256 bytes to have the room to store nanoseond timestamps, inode versioning etc) Overall it is a bit unpleasing to rely upon mtime updates for a correct NFS server implementation: if we were to later decrease s_time_gran (as we might do, for performance reasons), the NFS server implementation starts reporting incorrect information. :( that is a problem... And how does the NFS server know that the filesystem implements i_version? Will a zero-value of i_version have special significance, telling the server to not send this attribute, perhaps? Bruce raised up this question a few days back when he reviewed this patch, I think the solution is add a superblock flag for fs support inode versioning, probably at VFS layer? That would work. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Fri, 2007-07-06 at 16:53 -0600, Andreas Dilger wrote: On Jul 06, 2007 09:51 -0400, J. Bruce Fields wrote: The use of a mount option means the change attribute could be inconsistent across mounts. If we really need this, wouldn't it make more sense for it to be a persistent feature of the filesystem, set at mkfs time? Yes, having it stored into the superblock in s_flags is probably a good idea. Kalpak, do you think you could get a patch that adds e.g. EXT4_FLAGS_NO_INODE_VERSION (like EXT4_FLAGS_SIGNED_HASH in e2fsprogs). Per our ext4 interlock discussion today, I have dropped the ext4-no-inode version-mount-option patch from ext4 patch queue. Thanks, Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Tue, 2007-07-03 at 15:58 +0530, Kalpak Shah wrote: On Sun, 2007-07-01 at 03:36 -0400, Mingming Cao wrote: + +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ +do { \ + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ + ext4_decode_extra_time((inode)-xtime,\ + raw_inode-xtime ## _extra);\ +} while (0) + +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) \ +do { \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ + (einode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ + ext4_decode_extra_time((einode)-xtime, \ + raw_inode-xtime ## _extra);\ +} while (0) + This nanosecond patch seems to be missing the fix below which is required for http://bugzilla.kernel.org/show_bug.cgi?id=5079 If the timestamp is set to before epoch i.e. a negative timestamp then the file may have its date set into the future on 64-bit systems. So when the timestamp is read it must be cast as signed. Missed this one. Thanks. Will update ext4 patch queue tonight with this fix. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
Trond or Bruce, can you please review these patch series and ack if you agrees? Thanks. As to performance concerns that raise before the inode version counter (at least for ext4) is done inside ext4_mark_inode_dirty), so there is no extra IO work to store this counter to disk. Mingming On Sun, 2007-07-01 at 03:37 -0400, Mingming Cao wrote: This patch converts the 32-bit i_version in the generic inode to a 64-bit i_version field. Signed-off-by: Mingming Cao [EMAIL PROTECTED] Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Index: linux-2.6.21/include/linux/fs.h === --- linux-2.6.21.orig/include/linux/fs.h +++ linux-2.6.21/include/linux/fs.h @@ -549,7 +549,7 @@ struct inode { uid_t i_uid; gid_t i_gid; dev_t i_rdev; - unsigned long i_version; + u64 i_version; loff_t i_size; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Ext4 patches for 2.6.22-rc6
On Fri, 2007-06-29 at 13:57 -0700, Andrew Morton wrote: On Fri, 29 Jun 2007 11:50:04 -0400 Mingming Caoc [EMAIL PROTECTED] wrote: I think the ext4 patch queue is in good shape now. Which ext4 patches are you intending to merge into 2.6.23? Please send all those out to lkml for review? Hi Andrew, Here are the patches in ext4-patch-queue that I think can be considered to be merged to upstream. Please review. All of the patches have been posted on ext4 mailinglist before. Some are bug fixes, some are features, to summaries: - make extents on by default in ext4dev - nanosecond timestamp - 64 bit inode versioning support - remove 32k subdir limits - journal checksumming - journal stats via procfs - delayed allocation for ext4 writeback mode - fallocate() All the patches can be found at http://repo.or.cz/w/ext4-patch-queue.git and have been tested(with fsx ,dbench, FFSB, iozone) on x86,x86_64,ppc64, with extents and delayed allocation enabled And the full series can be found at http://repo.or.cz/w/ext4-patch-queue.git?a=blob;f=series;h=2f43431db28778ce8d2149bce7a51566a2d2517c;hb=56e27e20cf228b32f5162a76b3bad154d1d3b730 I will post the patches-in-good-shape (in 9 set of patches) to lkml in the following emails, except for the bottom two feature: *the fallocate() patches, which Amit just posted a few days ago and are under review (hopefully we can reach a agreement on the interface and the modes before 2.6.23-rc1 window closed). *Another one is the delayed allocation patches in ext4 patch queue. Alex mentioned in another email that he is working on another version of delalloc that can handle block size page size, and move some work to vfs. So it's probably not very useful to post this version for people to review. So, here is the series file. # Rebased the patches to 2.6.22-rc6 # Add mount option to turn off extents ext4_noextent_mount_opt.patch # Mounted ext4dev fs with extents by default for testing purpose, # for Ext4 product release, extents mount option # will be turn on only if the fs has EXTENTS feature on ext4_extents_on_by_default.patch # Propagate inode flags ext4-propagate_flags.patch # Add extent sanity checks ext4-extent-sanity-checks.patch # Bug fix:set 64bit JBD2 feature on 32bit ext4 fs ext4_set_jbd2_64bit_feature.patch # Fix: Rename CONFIG_JBD_DEBUG to CONFIG_JBD2_DEBUG jbd2_config_jbd2_debug_fix.patch # Export jbd2-debug via debugfs ext4_CONFIG_JBD2_DEBUG.patch jbd2_move_jbd2_debug_to_debugfs.patch # Nanosecond timestamp support ext4-nanosecond-patch # inode verion patch series # inode versioning is needed for NFSv4 # vfs changes, 64 bit inode-i_version 64-bit-i_version.patch # reserve hi 32 bit inode version on ext4 on-disk inode i_version_hi.patch # ext4 inode version read/store ext4_i_version_hi_2.patch # ext4 inode version update i_version_update_ext4.patch # add a noversion mount option to disable inode version updates ext4_no_version.patch # New patch to expand inode i_extra_isize to support features # in high part of inode (128 bytes) ext4_expand_inode_extra_isize.patch # Export jbd stats through procfs # Shall this move to debugfs? jbd-stats-through-procfs # Remove 32000 subdirs limit. ext4_remove_subdirs_limit.patch # Add journal checksums ext4-journal_chksum-2.6.20.patch # Various Cleanups ext4-zero_user_page.patch is_power_of_2-ext4-superc.patch ext4-remove-extra-is_rdonly-check.patch ext4_extent_compilation_fixes.patch ext4_extent_macros_cleanup.patch - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev
Turn on extents feature by default in ext4 filesystem. User could use -o noextents to turn it off. Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: linux-2.6.22-rc4/fs/ext4/super.c === --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:22.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 17:03:09.0 -0700 @@ -1546,6 +1546,12 @@ set_opt(sbi-s_mount_opt, RESERVATION); + /* +* turn on extents feature by default in ext4 filesystem +* User -o noextents to turn it off +*/ + set_opt (sbi-s_mount_opt, EXTENTS); + if (!parse_options ((char *) data, sb, journal_inum, journal_devnum, NULL, 0)) goto failed_mount; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks
with the patch all headers are checked. the code should become more resistant to on-disk corruptions. needless BUG_ON() have been removed. please, review for inclusion. Signed-off-by: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: linux-2.6.22-rc4/fs/ext4/extents.c === --- linux-2.6.22-rc4.orig/fs/ext4/extents.c 2007-06-11 17:22:15.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/extents.c 2007-06-11 17:27:57.0 -0700 @@ -91,36 +91,6 @@ ix-ei_leaf_hi = cpu_to_le16((unsigned long) ((pb 31) 1) 0x); } -static int ext4_ext_check_header(const char *function, struct inode *inode, - struct ext4_extent_header *eh) -{ - const char *error_msg = NULL; - - if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) { - error_msg = invalid magic; - goto corrupted; - } - if (unlikely(eh-eh_max == 0)) { - error_msg = invalid eh_max; - goto corrupted; - } - if (unlikely(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max))) { - error_msg = invalid eh_entries; - goto corrupted; - } - return 0; - -corrupted: - ext4_error(inode-i_sb, function, - bad header in inode #%lu: %s - magic %x, - entries %u, max %u, depth %u, - inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic), - le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max), - le16_to_cpu(eh-eh_depth)); - - return -EIO; -} - static handle_t *ext4_ext_journal_restart(handle_t *handle, int needed) { int err; @@ -269,6 +239,70 @@ return size; } +static inline int +ext4_ext_max_entries(struct inode *inode, int depth) +{ + int max; + + if (depth == ext_depth(inode)) { + if (depth == 0) + max = ext4_ext_space_root(inode); + else + max = ext4_ext_space_root_idx(inode); + } else { + if (depth == 0) + max = ext4_ext_space_block(inode); + else + max = ext4_ext_space_block_idx(inode); + } + + return max; +} + +static int __ext4_ext_check_header(const char *function, struct inode *inode, + struct ext4_extent_header *eh, + int depth) +{ + const char *error_msg = NULL; + int max = 0; + + if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) { + error_msg = invalid magic; + goto corrupted; + } + if (unlikely(le16_to_cpu(eh-eh_depth) != depth)) { + error_msg = unexpected eh_depth; + goto corrupted; + } + if (unlikely(eh-eh_max == 0)) { + error_msg = invalid eh_max; + goto corrupted; + } + max = ext4_ext_max_entries(inode, depth); + if (unlikely(le16_to_cpu(eh-eh_max) max)) { + error_msg = too large eh_max; + goto corrupted; + } + if (unlikely(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max))) { + error_msg = invalid eh_entries; + goto corrupted; + } + return 0; + +corrupted: + ext4_error(inode-i_sb, function, + bad header in inode #%lu: %s - magic %x, + entries %u, max %u(%u), depth %u(%u), + inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic), + le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max), + max, le16_to_cpu(eh-eh_depth), depth); + + return -EIO; +} + +#define ext4_ext_check_header(inode, eh, depth)\ + __ext4_ext_check_header(__FUNCTION__, inode, eh, depth) + #ifdef EXT_DEBUG static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path) { @@ -329,6 +363,7 @@ /* * ext4_ext_binsearch_idx: * binary search for the closest index of the given block + * the header must be checked before calling this */ static void ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int block) @@ -336,9 +371,6 @@ struct ext4_extent_header *eh = path-p_hdr; struct ext4_extent_idx *r, *l, *m; - BUG_ON(eh-eh_magic != EXT4_EXT_MAGIC); - BUG_ON(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max)); - BUG_ON(le16_to_cpu(eh-eh_entries) = 0); ext_debug(binsearch for %d(idx): , block); @@ -388,6 +420,7 @@ /* * ext4_ext_binsearch: * binary search for closest extent of the given block + * the header must be checked before calling this */ static void ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block) @@ -395,9 +428,6 @@ struct ext4_extent_header *eh = path-p_hdr; struct
[EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for 16TB ext4 fs
Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more than 32bit block sizes during mount time. This ensure proper record lenth when writing to the journal. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Signed-off-by: Laurent Vivier [EMAIL PROTECTED] --- fs/ext4/super.c | 11 +++ 1 file changed, 11 insertions(+) Index: linux-2.6.22-rc4/fs/ext4/super.c === --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 16:15:54.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 16:16:10.0 -0700 @@ -1804,6 +1804,13 @@ goto failed_mount3; } + if (ext4_blocks_count(es) 0xULL + !jbd2_journal_set_features(EXT4_SB(sb)-s_journal, 0, 0, + JBD2_FEATURE_INCOMPAT_64BIT)) { + printk(KERN_ERR ext4: Failed to set 64-bit journal feature\n); + goto failed_mount4; + } + /* We have now updated the journal if required, so we can * validate the data journaling mode. */ switch (test_opt(sb, DATA_FLAGS)) { - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[EXT4 set 2][PATCH 4/5] cleanups: Rename CONFIG_JBD_DEBUG to CONFIG_JBD2_DEBUG
When the JBD code was forked to create the new JBD2 code base, the references to CONFIG_JBD_DEBUG where never changed to CONFIG_JBD2_DEBUG. This patch fixes that. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] --- Index: linux-2.6.22-rc4/fs/jbd2/journal.c === --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c 2007-06-11 16:15:49.0 -0700 +++ linux-2.6.22-rc4/fs/jbd2/journal.c 2007-06-11 16:16:18.0 -0700 @@ -528,7 +528,7 @@ { int err = 0; -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG spin_lock(journal-j_state_lock); if (!tid_geq(journal-j_commit_request, tid)) { printk(KERN_EMERG @@ -1709,7 +1709,7 @@ * Journal_head storage management */ static struct kmem_cache *jbd2_journal_head_cache; -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG static atomic_t nr_journal_heads = ATOMIC_INIT(0); #endif @@ -1747,7 +1747,7 @@ struct journal_head *ret; static unsigned long last_warning; -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG atomic_inc(nr_journal_heads); #endif ret = kmem_cache_alloc(jbd2_journal_head_cache, GFP_NOFS); @@ -1768,7 +1768,7 @@ static void journal_free_journal_head(struct journal_head *jh) { -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG atomic_dec(nr_journal_heads); memset(jh, JBD_POISON_FREE, sizeof(*jh)); #endif @@ -1953,12 +1953,12 @@ /* * /proc tunables */ -#if defined(CONFIG_JBD_DEBUG) +#if defined(CONFIG_JBD2_DEBUG) int jbd2_journal_enable_debug; EXPORT_SYMBOL(jbd2_journal_enable_debug); #endif -#if defined(CONFIG_JBD_DEBUG) defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_PROC_FS) static struct proc_dir_entry *proc_jbd_debug; @@ -2073,7 +2073,7 @@ static void __exit journal_exit(void) { -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG int n = atomic_read(nr_journal_heads); if (n) printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n); Index: linux-2.6.22-rc4/fs/jbd2/recovery.c === --- linux-2.6.22-rc4.orig/fs/jbd2/recovery.c2007-06-04 17:57:25.0 -0700 +++ linux-2.6.22-rc4/fs/jbd2/recovery.c 2007-06-11 16:16:18.0 -0700 @@ -295,7 +295,7 @@ printk(KERN_ERR JBD: error %d scanning journal\n, err); ++journal-j_transaction_sequence; } else { -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG int dropped = info.end_transaction - be32_to_cpu(sb-s_sequence); #endif jbd_debug(0, Index: linux-2.6.22-rc4/include/linux/ext4_fs.h === --- linux-2.6.22-rc4.orig/include/linux/ext4_fs.h 2007-06-11 16:15:59.0 -0700 +++ linux-2.6.22-rc4/include/linux/ext4_fs.h2007-06-11 16:16:18.0 -0700 @@ -237,7 +237,7 @@ #define EXT4_IOC_GROUP_ADD _IOW('f', 8,struct ext4_new_group_input) #defineEXT4_IOC_GETVERSION_OLD FS_IOC_GETVERSION #defineEXT4_IOC_SETVERSION_OLD FS_IOC_SETVERSION -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG #define EXT4_IOC_WAIT_FOR_READONLY _IOR('f', 99, long) #endif #define EXT4_IOC_GETRSVSZ _IOR('f', 5, long) @@ -253,7 +253,7 @@ #define EXT4_IOC32_GETRSVSZ_IOR('f', 5, int) #define EXT4_IOC32_SETRSVSZ_IOW('f', 6, int) #define EXT4_IOC32_GROUP_EXTEND_IOW('f', 7, unsigned int) -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG #define EXT4_IOC32_WAIT_FOR_READONLY _IOR('f', 99, int) #endif #define EXT4_IOC32_GETVERSION_OLD FS_IOC32_GETVERSION Index: linux-2.6.22-rc4/include/linux/ext4_fs_sb.h === --- linux-2.6.22-rc4.orig/include/linux/ext4_fs_sb.h2007-06-11 16:15:55.0 -0700 +++ linux-2.6.22-rc4/include/linux/ext4_fs_sb.h 2007-06-11 16:16:18.0 -0700 @@ -71,7 +71,7 @@ struct list_head s_orphan; unsigned long s_commit_interval; struct block_device *journal_bdev; -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG struct timer_list turn_ro_timer;/* For turning read-only (crash simulation) */ wait_queue_head_t ro_wait_queue;/* For people waiting for the fs to go read-only */ #endif Index: linux-2.6.22-rc4/include/linux/jbd2.h === --- linux-2.6.22-rc4.orig/include/linux/jbd2.h 2007-06-11 16:15:49.0 -0700 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:16:18.0 -0700 @@ -50,11 +50,11 @@ */ #define JBD_DEFAULT_MAX_COMMIT_AGE 5 -#ifdef CONFIG_JBD_DEBUG +#ifdef CONFIG_JBD2_DEBUG /* * Define JBD_EXPENSIVE_CHECKING to enable more expensive internal * consistency checks. By default we don't do this unless - * CONFIG_JBD_DEBUG
[EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs
On Jun 07, 2007 23:45 -0500, Jose R. Santos wrote: The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but create_proc_entry() does not do lookups on file names with more that one directory deep. This causes the entry creation to fail and hence, no proc file is created. This patch moves the file to /proc/jbd2-degug. The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require some minor alterations to the jbd-stats patch. I don't think we really want to be adding top-level files in /proc. What about using the debugfs filesystem (not to be confused with the e2fsprogs 'debugfs' command)? How about this then? Moved the file to use debugfs as well as having the nice effect of removing more lines than what it adds. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] --- fs/jbd2/journal.c| 6220 +42 -0 ! include/linux/jbd2.h |21 + 1 - 0 ! 2 files changed, 21 insertions(+), 43 deletions(-) Index: linux-2.6.22-rc4/fs/jbd2/journal.c === --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c 2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/fs/jbd2/journal.c 2007-06-11 16:36:10.0 -0700 @@ -35,6 +35,7 @@ #include linux/kthread.h #include linux/poison.h #include linux/proc_fs.h +#include linux/debugfs.h #include asm/uaccess.h #include asm/page.h @@ -1954,60 +1955,37 @@ * /proc tunables */ #if defined(CONFIG_JBD2_DEBUG) -int jbd2_journal_enable_debug; +u16 jbd2_journal_enable_debug; EXPORT_SYMBOL(jbd2_journal_enable_debug); #endif -#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_DEBUG_FS) -static struct proc_dir_entry *proc_jbd_debug; +#define JBD2_DEBUG_NAME jbd2-debug -static int read_jbd_debug(char *page, char **start, off_t off, - int count, int *eof, void *data) -{ - int ret; - - ret = sprintf(page + off, %d\n, jbd2_journal_enable_debug); - *eof = 1; - return ret; -} +struct dentry *jbd2_debugfs_dir, *jbd2_debug; -static int write_jbd_debug(struct file *file, const char __user *buffer, - unsigned long count, void *data) +static void __init jbd2_create_debugfs_entry(void) { - char buf[32]; - - if (count ARRAY_SIZE(buf) - 1) - count = ARRAY_SIZE(buf) - 1; - if (copy_from_user(buf, buffer, count)) - return -EFAULT; - buf[ARRAY_SIZE(buf) - 1] = '\0'; - jbd2_journal_enable_debug = simple_strtoul(buf, NULL, 10); - return count; -} - -#define JBD_PROC_NAME sys/fs/jbd2-debug - -static void __init create_jbd_proc_entry(void) -{ - proc_jbd_debug = create_proc_entry(JBD_PROC_NAME, 0644, NULL); - if (proc_jbd_debug) { - /* Why is this so hard? */ - proc_jbd_debug-read_proc = read_jbd_debug; - proc_jbd_debug-write_proc = write_jbd_debug; - } + jbd2_debugfs_dir = debugfs_create_dir(jbd2, NULL); + if (jbd2_debugfs_dir) + jbd2_debug = debugfs_create_u16(JBD2_DEBUG_NAME, S_IRUGO, + jbd2_debugfs_dir, + jbd2_journal_enable_debug); } -static void __exit jbd2_remove_jbd_proc_entry(void) +static void __exit jbd2_remove_debugfs_entry(void) { - if (proc_jbd_debug) - remove_proc_entry(JBD_PROC_NAME, NULL); + if (jbd2_debug) + debugfs_remove(jbd2_debug); + if (jbd2_debugfs_dir) + debugfs_remove(jbd2_debugfs_dir); } #else -#define create_jbd_proc_entry() do {} while (0) -#define jbd2_remove_jbd_proc_entry() do {} while (0) +#define jbd2_create_debugfs_entry() do {} while (0) +#define jbd2_remove_debugfs_entry() do {} while (0) #endif @@ -2067,7 +2045,7 @@ ret = journal_init_caches(); if (ret != 0) jbd2_journal_destroy_caches(); - create_jbd_proc_entry(); + jbd2_create_debugfs_entry(); return ret; } @@ -2078,7 +2056,7 @@ if (n) printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n); #endif - jbd2_remove_jbd_proc_entry(); + jbd2_remove_debugfs_entry(); jbd2_journal_destroy_caches(); } Index: linux-2.6.22-rc4/include/linux/jbd2.h === --- linux-2.6.22-rc4.orig/include/linux/jbd2.h 2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 -0700 @@ -57,7 +57,7 @@ * CONFIG_JBD2_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING -extern int jbd2_journal_enable_debug; +extern u16 jbd2_journal_enable_debug; #define jbd_debug(n, f, a...) \ do {\ - To unsubscribe from this list: send the line
[EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
This patch is a spinoff of the old nanosecond patches. It includes some cleanups and addition of a creation timestamp. The EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with s_{min, want}_extra_isize fields in struct ext3_super_block. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: linux-2.6.22-rc4/fs/ext4/ialloc.c === --- linux-2.6.22-rc4.orig/fs/ext4/ialloc.c 2007-06-11 17:22:15.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/ialloc.c 2007-06-11 17:39:05.0 -0700 @@ -563,7 +563,8 @@ inode-i_ino = ino; /* This is the optimal IO size (for stat), not the fs block size */ inode-i_blocks = 0; - inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME_SEC; + inode-i_mtime = inode-i_atime = inode-i_ctime = ei-i_crtime = + ext4_current_time(inode); memset(ei-i_data, 0, sizeof(ei-i_data)); ei-i_dir_start_lookup = 0; @@ -595,9 +596,8 @@ spin_unlock(sbi-s_next_gen_lock); ei-i_state = EXT4_STATE_NEW; - ei-i_extra_isize = - (EXT4_INODE_SIZE(inode-i_sb) EXT4_GOOD_OLD_INODE_SIZE) ? - sizeof(struct ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE : 0; + + ei-i_extra_isize = EXT4_SB(sb)-s_want_extra_isize; ret = inode; if(DQUOT_ALLOC_INODE(inode)) { Index: linux-2.6.22-rc4/fs/ext4/inode.c === --- linux-2.6.22-rc4.orig/fs/ext4/inode.c 2007-06-11 17:24:28.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/inode.c2007-06-11 17:39:05.0 -0700 @@ -726,7 +726,7 @@ /* We are done with atomic stuff, now do the rest of housekeeping */ - inode-i_ctime = CURRENT_TIME_SEC; + inode-i_ctime = ext4_current_time(inode); ext4_mark_inode_dirty(handle, inode); /* had we spliced it onto indirect block? */ @@ -2375,7 +2375,7 @@ ext4_discard_reservation(inode); mutex_unlock(ei-truncate_mutex); - inode-i_mtime = inode-i_ctime = CURRENT_TIME_SEC; + inode-i_mtime = inode-i_ctime = ext4_current_time(inode); ext4_mark_inode_dirty(handle, inode); /* @@ -2629,10 +2629,6 @@ } inode-i_nlink = le16_to_cpu(raw_inode-i_links_count); inode-i_size = le32_to_cpu(raw_inode-i_size); - inode-i_atime.tv_sec = (signed)le32_to_cpu(raw_inode-i_atime); - inode-i_ctime.tv_sec = (signed)le32_to_cpu(raw_inode-i_ctime); - inode-i_mtime.tv_sec = (signed)le32_to_cpu(raw_inode-i_mtime); - inode-i_atime.tv_nsec = inode-i_ctime.tv_nsec = inode-i_mtime.tv_nsec = 0; ei-i_state = 0; ei-i_dir_start_lookup = 0; @@ -2708,6 +2704,11 @@ } else ei-i_extra_isize = 0; + EXT4_INODE_GET_XTIME(i_ctime, inode, raw_inode); + EXT4_INODE_GET_XTIME(i_mtime, inode, raw_inode); + EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode); + EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode); + if (S_ISREG(inode-i_mode)) { inode-i_op = ext4_file_inode_operations; inode-i_fop = ext4_file_operations; @@ -2789,9 +2790,12 @@ } raw_inode-i_links_count = cpu_to_le16(inode-i_nlink); raw_inode-i_size = cpu_to_le32(ei-i_disksize); - raw_inode-i_atime = cpu_to_le32(inode-i_atime.tv_sec); - raw_inode-i_ctime = cpu_to_le32(inode-i_ctime.tv_sec); - raw_inode-i_mtime = cpu_to_le32(inode-i_mtime.tv_sec); + + EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); + EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode); + raw_inode-i_blocks = cpu_to_le32(inode-i_blocks); raw_inode-i_dtime = cpu_to_le32(ei-i_dtime); raw_inode-i_flags = cpu_to_le32(ei-i_flags); Index: linux-2.6.22-rc4/fs/ext4/ioctl.c === --- linux-2.6.22-rc4.orig/fs/ext4/ioctl.c 2007-06-11 17:25:11.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/ioctl.c2007-06-11 17:39:05.0 -0700 @@ -97,7 +97,7 @@ ei-i_flags = flags; ext4_set_inode_flags(inode); - inode-i_ctime = CURRENT_TIME_SEC; + inode-i_ctime = ext4_current_time(inode); err = ext4_mark_iloc_dirty(handle, inode, iloc); flags_err: @@ -134,7 +134,7 @@ return PTR_ERR(handle); err = ext4_reserve_inode_write(handle, inode, iloc); if (err == 0) { - inode-i_ctime = CURRENT_TIME_SEC; + inode-i_ctime
[EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
This patch converts the 32-bit i_version in the generic inode to a 64-bit i_version field. Signed-off-by: Mingming Cao [EMAIL PROTECTED] Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Index: linux-2.6.21/include/linux/fs.h === --- linux-2.6.21.orig/include/linux/fs.h +++ linux-2.6.21/include/linux/fs.h @@ -549,7 +549,7 @@ struct inode { uid_t i_uid; gid_t i_gid; dev_t i_rdev; - unsigned long i_version; + u64 i_version; loff_t i_size; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
This patch is on top of the nanosecond timestamp and i_version_hi patches. We need to make sure that existing filesystems can also avail the new fields that have been added to the inode. We use s_want_extra_isize and s_min_extra_isize to decide by how much we should expand the inode. If EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand by max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question about whether users should be able to set s_*_extra_isize smaller than the known fields or not. This patch also adds the functionality to expand inodes to include the newly added fields. We start by trying to expand by s_want_extra_isize bytes and if its fails we try to expand by s_min_extra_isize bytes. This is done by changing the i_extra_isize if enough space is available in the inode and no EAs are present. If EAs are present and there is enough space in the inode then the EAs in the inode are shifted to make space. If enough space is not available in the inode due to the EAs then 1 or more EAs are shifted to the external EA block. In the worst case when even the external EA block does not have enough space we inform the user that some EA would need to be deleted or s_min_extra_isize would have to be reduced. This would be online expansion of inodes. I am also working on adding an expand_inodes option to e2fsck which will expand all the used inodes. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: linux-2.6.22-rc4/fs/ext4/inode.c === --- linux-2.6.22-rc4.orig/fs/ext4/inode.c 2007-06-14 17:32:27.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/inode.c2007-06-14 17:32:41.0 -0700 @@ -3120,6 +3120,40 @@ ext4_reserve_inode_write(handle_t *handl } /* + * Expand an inode by new_extra_isize bytes. + * Returns 0 on success or negative error number on failure. + */ +int ext4_expand_extra_isize(struct inode *inode, unsigned int new_extra_isize, + struct ext4_iloc iloc, handle_t *handle) +{ + struct ext4_inode *raw_inode; + struct ext4_xattr_ibody_header *header; + struct ext4_xattr_entry *entry; + + if (EXT4_I(inode)-i_extra_isize = new_extra_isize) { + return 0; + } + + raw_inode = ext4_raw_inode(iloc); + + header = IHDR(inode, raw_inode); + entry = IFIRST(header); + + /* No extended attributes present */ + if (!(EXT4_I(inode)-i_state EXT4_STATE_XATTR) || + header-h_magic != cpu_to_le32(EXT4_XATTR_MAGIC)) { + memset((void *)raw_inode + EXT4_GOOD_OLD_INODE_SIZE, 0, + new_extra_isize); + EXT4_I(inode)-i_extra_isize = new_extra_isize; + return 0; + } + + /* try to expand with EA present */ + return ext4_expand_extra_isize_ea(inode, new_extra_isize, + raw_inode, handle); +} + +/* * What we do here is to mark the in-core inode as clean with respect to inode * dirtiness (it may still be data-dirty). * This means that the in-core inode may be reaped by prune_icache @@ -3143,10 +3177,33 @@ ext4_reserve_inode_write(handle_t *handl int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode) { struct ext4_iloc iloc; - int err; + int err, ret; + static int expand_message; might_sleep(); err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block +* with this same handle */ + if ((jbd2_journal_extend(handle, +EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; + if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete +some EAs or run e2fsck., + inode-i_ino); + expand_message = 1; + } + } + } + } if (!err) err = ext4_mark_iloc_dirty(handle, inode, iloc); return err; Index: linux-2.6.22-rc4/include/linux/ext4_fs.h
[EXT4 set 8][PATCH 1/1]Add journal checksums
Journal checksum feature has been added to detect corruption of journal. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] diff -Nurp linux024/fs/ext4/super.c linux/fs/ext4/super.c --- linux024/fs/ext4/super.c2007-06-25 16:19:24.0 -0500 +++ linux/fs/ext4/super.c 2007-06-26 08:35:16.0 -0500 @@ -721,6 +721,7 @@ enum { Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh, Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev, + Opt_journal_checksum, Opt_journal_async_commit, Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, @@ -760,6 +761,8 @@ static match_table_t tokens = { {Opt_journal_update, journal=update}, {Opt_journal_inum, journal=%u}, {Opt_journal_dev, journal_dev=%u}, + {Opt_journal_checksum, journal_checksum}, + {Opt_journal_async_commit, journal_async_commit}, {Opt_abort, abort}, {Opt_data_journal, data=journal}, {Opt_data_ordered, data=ordered}, @@ -948,6 +951,13 @@ static int parse_options (char *options, return 0; *journal_devnum = option; break; + case Opt_journal_checksum: + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM); + break; + case Opt_journal_async_commit: + set_opt (sbi-s_mount_opt, JOURNAL_ASYNC_COMMIT); + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM); + break; case Opt_noload: set_opt (sbi-s_mount_opt, NOLOAD); break; @@ -1817,6 +1827,21 @@ static int ext4_fill_super (struct super goto failed_mount4; } + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) { + jbd2_journal_set_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } else if (test_opt(sb, JOURNAL_CHECKSUM)) { + jbd2_journal_set_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0); + jbd2_journal_clear_features(sbi-s_journal, 0, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } else { + jbd2_journal_clear_features(sbi-s_journal, + JBD2_FEATURE_COMPAT_CHECKSUM, 0, + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); + } + /* We have now updated the journal if required, so we can * validate the data journaling mode. */ switch (test_opt(sb, DATA_FLAGS)) { diff -Nurp linux024/fs/jbd2/commit.c linux/fs/jbd2/commit.c --- linux024/fs/jbd2/commit.c 2007-06-25 16:19:25.0 -0500 +++ linux/fs/jbd2/commit.c 2007-06-26 08:40:03.0 -0500 @@ -21,6 +21,7 @@ #include linux/mm.h #include linux/pagemap.h #include linux/jiffies.h +#include linux/crc32.h /* * Default IO end handler for temporary BJ_IO buffer_heads. @@ -93,15 +94,18 @@ static int inverted_lock(journal_t *jour return 1; } -/* Done it all: now write the commit record. We should have +/* + * Done it all: now submit the commit record. We should have * cleaned up our previous buffers by now, so if we are in abort * mode we can now just skip the rest of the journal write * entirely. * * Returns 1 if the journal needs to be aborted or 0 on success */ -static int journal_write_commit_record(journal_t *journal, - transaction_t *commit_transaction) +static int journal_submit_commit_record(journal_t *journal, + transaction_t *commit_transaction, + struct buffer_head **cbh, + __u32 crc32_sum) { struct journal_head *descriptor; struct buffer_head *bh; @@ -117,21 +121,36 @@ static int journal_write_commit_record(j bh = jh2bh(descriptor); - /* AKPM: buglet - add `i' to tmp! */ for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; + struct commit_header *tmp = + (struct commit_header *)(bh-b_data + i); tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER); tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK); tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); + + if (JBD2_HAS_COMPAT_FEATURE(journal, +
[EXT4 set 9][PATCH 1/5]Morecleanups:ext4-zero_user_page
Use zero_user_page() in ext4 where possible. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Index: linux-2.6.22-rc4-mm2/fs/ext4/inode.c === --- linux-2.6.22-rc4-mm2.orig/fs/ext4/inode.c +++ linux-2.6.22-rc4-mm2/fs/ext4/inode.c @@ -1830,7 +1830,6 @@ int ext4_block_truncate_page(handle_t *h struct inode *inode = mapping-host; struct buffer_head *bh; int err = 0; - void *kaddr; if ((EXT4_I(inode)-i_flags EXT4_EXTENTS_FL) test_opt(inode-i_sb, EXTENTS) @@ -1847,10 +1846,7 @@ int ext4_block_truncate_page(handle_t *h */ if (!page_has_buffers(page) test_opt(inode-i_sb, NOBH) ext4_should_writeback_data(inode) PageUptodate(page)) { - kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr + offset, 0, length); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); + zero_user_page(page, offset, length, KM_USER0); set_page_dirty(page); goto unlock; } @@ -1903,10 +1899,7 @@ int ext4_block_truncate_page(handle_t *h goto unlock; } - kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr + offset, 0, length); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); + zero_user_page(page, offset, length, KM_USER0); BUFFER_TRACE(bh, zeroed end of block); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[EXT4 set 9][PATCH 3/5]Morecleanups:ext4-remove-extra-is_rdonly-check
Subject: ext4: remove extra IS_RDONLY() check From: Dave Hansen [EMAIL PROTECTED] ext4_change_inode_journal_flag() is only called from one location: ext4_ioctl(EXT3_IOC_SETFLAGS). That ioctl case already has a IS_RDONLY() call in it so this one is superfluous. Signed-off-by: Dave Hansen [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/ext4/inode.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN fs/ext4/inode.c~ext4-remove-extra-is_rdonly-check fs/ext4/inode.c --- a/fs/ext4/inode.c~ext4-remove-extra-is_rdonly-check +++ a/fs/ext4/inode.c @@ -3352,7 +3352,7 @@ int ext4_change_inode_journal_flag(struc */ journal = EXT4_JOURNAL(inode); - if (is_journal_aborted(journal) || IS_RDONLY(inode)) + if (is_journal_aborted(journal)) return -EROFS; jbd2_journal_lock_updates(journal); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent compilation fixes Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions. Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED] Acked-by: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/ext4/extents.c | 18 ++ 1 files changed, 10 insertions(+), 8 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 6f72dcb..12fe3d7 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -382,13 +382,14 @@ ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int bloc r = m - 1; else l = m + 1; - ext_debug(%p(%u):%p(%u):%p(%u) , l, l-ei_block, - m, m-ei_block, r, r-ei_block); + ext_debug(%p(%u):%p(%u):%p(%u) , l, le32_to_cpu(l-ei_block), + m, le32_to_cpu(m-ei_block), + r, le32_to_cpu(r-ei_block)); } path-p_idx = l - 1; ext_debug( - %d-%lld , le32_to_cpu(path-p_idx-ei_block), - idx_block(path-p_idx)); + idx_pblock(path-p_idx)); #ifdef CHECK_BINSEARCH { @@ -447,8 +448,9 @@ ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block) r = m - 1; else l = m + 1; - ext_debug(%p(%u):%p(%u):%p(%u) , l, l-ee_block, - m, m-ee_block, r, r-ee_block); + ext_debug(%p(%u):%p(%u):%p(%u) , l, le32_to_cpu(l-ee_block), + m, le32_to_cpu(m-ee_block), + r, le32_to_cpu(r-ee_block)); } path-p_ext = l - 1; @@ -580,7 +582,7 @@ static int ext4_ext_insert_index(handle_t *handle, struct inode *inode, if (curp-p_idx != EXT_LAST_INDEX(curp-p_hdr)) { len = (len - 1) * sizeof(struct ext4_extent_idx); len = len 0 ? 0 : len; - ext_debug(insert new index %d after: %d. + ext_debug(insert new index %d after: %llu. move %d from 0x%p to 0x%p\n, logical, ptr, len, (curp-p_idx + 1), (curp-p_idx + 2)); @@ -591,7 +593,7 @@ static int ext4_ext_insert_index(handle_t *handle, struct inode *inode, /* insert before */ len = len * sizeof(struct ext4_extent_idx); len = len 0 ? 0 : len; - ext_debug(insert new index %d before: %d. + ext_debug(insert new index %d before: %llu. move %d from 0x%p to 0x%p\n, logical, ptr, len, curp-p_idx, (curp-p_idx + 1)); @@ -791,7 +793,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode, BUG_ON(EXT_MAX_INDEX(path[i].p_hdr) != EXT_LAST_INDEX(path[i].p_hdr)); while (path[i].p_idx = EXT_MAX_INDEX(path[i].p_hdr)) { - ext_debug(%d: move %d:%d in new index %llu\n, i, + ext_debug(%d: move %d:%llu in new index %llu\n, i, le32_to_cpu(path[i].p_idx-ei_block), idx_pblock(path[i].p_idx), newblock); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[EXT4 set 9][PATCH 5/5]Extent micro cleanups
From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent macros cleanup - Replace math equation to it's macro equivalent - make ext4_ext_grow_indepth() indexes/leaf correct Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED] Acked-by: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/ext4/extents.c | 11 +++ 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 12fe3d7..1fd00ac 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int bloc ext_debug(binsearch for %d(idx): , block); l = EXT_FIRST_INDEX(eh) + 1; - r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh-eh_entries) - 1; + r = EXT_LAST_INDEX(eh); while (l = r) { m = l + (r - l) / 2; if (block le32_to_cpu(m-ei_block)) @@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block) ext_debug(binsearch for %d: , block); l = EXT_FIRST_EXTENT(eh) + 1; - r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh-eh_entries) - 1; + r = EXT_LAST_EXTENT(eh); while (l = r) { m = l + (r - l) / 2; @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode, curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode)); curp-p_hdr-eh_entries = cpu_to_le16(1); curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr); - /* FIXME: it works, but actually path[0] can be index */ - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; + + if (path[0].p_hdr-eh_depth) + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block; + else + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; ext4_idx_store_pblock(curp-p_idx, newblock); neh = ext_inode_hdr(inode); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
On Thu, 2007-06-28 at 02:55 -0700, Andrew Morton wrote: Please drop the non-ext4 patches from the ext4 tree and send incremental patches against the (non-ext4) fallocate patches in -mm. The ext4 fallocate() patches are dependent on the core fallocate() patches, so ext4 patch-queue and git tree won't compile (it's not based on mm tree) without the core changes. We can send ext4 fallocate patches (incremental patches against mm tree) and drop the full fallocate patches(ext4 and non ext4 part) from ext4 patch queue if you prefer this way. And try to get the code finished? Time is pressing. I looked at the mm tree, there are other ext4 features/changes that are currently in ext4-patch-queue(not ext4 git tree) that not in part of ext4 series yet. Ted, can you merge those patches to your git tree? Thanks! Thanks for your patience. Mingming. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Thu, 2007-05-31 at 10:33 +1000, David Chinner wrote: On Wed, May 30, 2007 at 04:32:57PM -0700, Mingming Cao wrote: On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote: On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: Hi, This is an update of the i_version patch. The i_version field is a 64bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). My understanding (please correct me if I'm wrong) is that the requirements are much more rigourous than simply incrementing an in memory counter on every change. i.e. the this counter has to survive server crashes intact so clients never see the counter go backwards. That means version number changes need to be journalled along with the operation that caused the change of the version number. Yeah, the i_version is the in memeory counter. From the patch it looks like the counter is being updated inside ext4_mark_iloc_dirty(), so it is being journalled and being flush to on-disk ext4 inode structure immediately (On-disk ext4 inode structure is being modified/expanded to store the counter in the first patch). Ok, that catches most things (I missed that), but the version number still needs to change on file data changes, right? So if we are overwriting the file, we're calling __mark_inode_dirty(I_DIRTY_PAGES) which means you don't get the callout and so the version number doesn't change or get logged. In that case, the version number is not doing what it needs to do, right? Hmm, maybe I missed something... but looking at the code again, in the case of overwrite (file date updated),it seems the ctime/mtime is being updated and the inode is being dirtied, so the version number is being updated. vfs_write()-.. -__generic_file_aio_write_nolock() -file_update_time() -mark_inode_dirty_sync() -__mark_inode_dirty(I_DIRTY_SYNC) -ext4_dirty_inode() -ext4_mark_inode_dirty() Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote: On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: Hi, This is an update of the i_version patch. The i_version field is a 64bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). My understanding (please correct me if I'm wrong) is that the requirements are much more rigourous than simply incrementing an in memory counter on every change. i.e. the this counter has to survive server crashes intact so clients never see the counter go backwards. That means version number changes need to be journalled along with the operation that caused the change of the version number. Yeah, the i_version is the in memeory counter. From the patch it looks like the counter is being updated inside ext4_mark_iloc_dirty(), so it is being journalled and being flush to on-disk ext4 inode structure immediately (On-disk ext4 inode structure is being modified/expanded to store the counter in the first patch). This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] Index: linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/ialloc.c 2007-05-25 18:05:28.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c 2007-05-25 18:05:40.0 +0200 @@ -565,6 +565,7 @@ inode-i_blocks = 0; inode-i_mtime = inode-i_atime = inode-i_ctime = ei-i_crtime = ext4_current_time(inode); + inode-i_version = 1; memset(ei-i_data, 0, sizeof(ei-i_data)); ei-i_dir_start_lookup = 0; Index: linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/inode.c 2007-05-25 18:05:28.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c 2007-05-25 18:05:40.0 +0200 @@ -3082,6 +3082,7 @@ { int err = 0; + inode-i_version++; /* the do_update_inode consumes one bh-b_count */ get_bh(iloc-bh); Index: linux-2.6.22-rc2-ext4-1/fs/ext4/super.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/super.c 2007-05-25 18:05:28.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/super.c 2007-05-25 18:05:40.0 +0200 @@ -2839,8 +2839,8 @@ i_size_write(inode, off+len-towrite); EXT4_I(inode)-i_disksize = inode-i_size; } - inode-i_version++; inode-i_mtime = inode-i_ctime = CURRENT_TIME; + inode-i_version = 1; ext4_mark_inode_dirty(handle, inode); mutex_unlock(inode-i_mutex); return len - towrite; I am not very clear about this part -- what is i_version being used for with quota? And shall we reset the counter to 1 for ext4_quota_write()? Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/2] i_version update - ext4 part
On Tue, 2007-05-29 at 16:58 -0600, Andreas Dilger wrote: On May 29, 2007 12:44 -0700, Mingming Cao wrote: I am a little bit confused about the two patches. It appears in the ext4_expand_inode_extra_isize patch by Kalpak, there a new 64 bit i_fs_version field is added to ext4 inode structure for inode versioning support. read/store of this counter are properly handled, but missing the inode versioning counter update. For the Lustre use of the inode version we don't care about the VFS changes to i_version. In fact - we want to be able to control the changes to inode version ourselves so that e.g. file defragmenting or atime updates don't change the inode version, and that recovery can restore the version to a known state along with the rest of the metadata. It's a pity that VFS i_version can't serve Lustre need completely. :( If the unnecessary inode version update is a concern, then, with current implementation (i_version being updated in ext4_mark_inode_dirty()- ext4_mark_iloc_dirty()), the i_version can be increased multiple times during one write() operation (unlike ctime update). I know doing the update in ext4_mark_inode_dirty() (rather than update inode version on every mtime/ctime update) clearly simplified the code changes. So I am not sure if the increased update is a concern or not. In any case, does the VFS inode version get update when atime updates? That said, since Lustre isn't in the kernel and we patch our version of ext3 anyways it doesn't really matter what is done for NFS. We will just patch in our own behaviour if the final ext4 code isn't suitable in all of the details. Having 99% of the code the same at least makes this a lot less work. But later in the second patch by Jean Noel, he re-used the VFS inode- i_version for ext4 inode versioning, the counter is being updated every time the file is being changed. I don't know what the NFS requirements for the version are. There may also be some complaints from others if the i_version is 64 bits because this contributes to generic inode growth and isn't used for other filesystems. That should benefit for other filesystems, as I thought this NFS requirements apply to all filesystems. To me, i_fs_version and inode_version are the same thing, right? Shouldn't we choose one(I assume inode i_version?), and combine these two patch together? How about split the inode versioning part from the ext4_expand_inode_extra_isize patch(it does multiple things, and i_versioning doesn't longs there) and put it together with the rest of i_version update patches? I don't have an objection to that, but I don't think it is required. BTW, how could NFS/user space to access the inode version counter? If the Bull patch uses i_version then knfsd can just access it directly. I don't think there is any API to access it from userspace. One option is to add a virtual EA like user.inode_version and have the kernel fill this in from i_version. Lustre will manipulate the ei-i_fs_version directly. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/2] i_version update - ext4 part
On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: The patch is on top of the ext4 tree: http://repo.or.cz/w/ext4-patch-queue.git In this part, the i_version counter is stored into 2 32bit fields of the ext4_inode structure osd1.linux1.l_i_version and i_version_hi. I included the ext4_expand_inode_extra_isize patch, which does part of the job, checking if there is enough room for extra fields in the inode (i_version_hi). The other patch increments the counter on inode modifications and set it on inode creation. plain text document attachment (i_version_update_ext4) This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. I am a little bit confused about the two patches. It appears in the ext4_expand_inode_extra_isize patch by Kalpak, there a new 64 bit i_fs_version field is added to ext4 inode structure for inode versioning support. read/store of this counter are properly handled, but missing the inode versioning counter update. But later in the second patch by Jean Noel, he re-used the VFS inode- i_version for ext4 inode versioning, the counter is being updated every time the file is being changed. To me, i_fs_version and inode_version are the same thing, right? Shouldn't we choose one(I assume inode i_version?), and combine these two patch together? How about split the inode versioning part from the ext4_expand_inode_extra_isize patch(it does multiple things, and i_versioning doesn't longs there) and put it together with the rest of i_version update patches? BTW, how could NFS/user space to access the inode version counter? Thanks, Mingming Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] Index: linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/ialloc.c 2007-05-25 18:05:28.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c 2007-05-25 18:05:40.0 +0200 @@ -565,6 +565,7 @@ inode-i_blocks = 0; inode-i_mtime = inode-i_atime = inode-i_ctime = ei-i_crtime = ext4_current_time(inode); + inode-i_version = 1; memset(ei-i_data, 0, sizeof(ei-i_data)); ei-i_dir_start_lookup = 0; Index: linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/inode.c 2007-05-25 18:05:28.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c 2007-05-25 18:05:40.0 +0200 @@ -3082,6 +3082,7 @@ { int err = 0; + inode-i_version++; /* the do_update_inode consumes one bh-b_count */ get_bh(iloc-bh); Index: linux-2.6.22-rc2-ext4-1/fs/ext4/super.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/super.c 2007-05-25 18:05:28.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/super.c 2007-05-25 18:05:40.0 +0200 @@ -2839,8 +2839,8 @@ i_size_write(inode, off+len-towrite); EXT4_I(inode)-i_disksize = inode-i_size; } - inode-i_version++; inode-i_mtime = inode-i_ctime = CURRENT_TIME; + inode-i_version = 1; ext4_mark_inode_dirty(handle, inode); mutex_unlock(inode-i_mutex); return len - towrite; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE4] fallocate system call
On Fri, 2007-05-18 at 23:44 -0700, Andrew Morton wrote: On Thu, 17 May 2007 19:41:15 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. I merged the first three patches into -mm, thanks. All the system call numbers got changed due to recent additions. They may change in the future, too - nothing is stable until the code lands in mainline. In case you haven't realize it, the ia64 fallocate() patch comes with Amit's takes 4 fallocate patch series (3/6) missing one line change, thus fail to compile on ia64. Here is the updated one. Patch tested on ia64. (compile and fsx) fallocate() on ia64 ia64 fallocate syscall support. Signed-Off-By: Dave Chinner [EMAIL PROTECTED] --- arch/ia64/kernel/entry.S |1 + include/asm-ia64/unistd.h |3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S === --- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S 2007-05-18 16:30:16.0 -0700 +++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S 2007-05-18 16:32:45.0 -0700 @@ -1585,5 +1585,6 @@ data8 sys_getcpu data8 sys_epoll_pwait // 1305 data8 sys_utimensat + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h === --- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-18 16:30:16.0 -0700 +++ linux-2.6.22-rc1/include/asm-ia64/unistd.h 2007-05-18 17:34:58.0 -0700 @@ -296,11 +296,12 @@ #define __NR_getcpu1304 #define __NR_epoll_pwait 1305 #define __NR_utimensat 1306 +#define __NR_fallocate 1307 #ifdef __KERNEL__ -#define NR_syscalls283 /* length of syscall table */ +#define NR_syscalls285 /* length of syscall table */ #define __ARCH_WANT_SYS_RT_SIGACTION #define __ARCH_WANT_SYS_RT_SIGSUSPEND I didn't merge any of the ext4 changes as they appear to be in Ted's devel tree. Although I didn't check that they are 100% the same in that tree. Since both Amit and Ted are traveling, I will jump in... Most likely it's not the same one. What in Ted's devel tree is takes 2 patches. I have incorporated takes 4 patches in the backing ext4 patch git tree here: http://repo.or.cz/w/ext4-patch-queue.git I have tested these patch series on ia64,ppc64,x86 and x86_64. I am not sure if Ted got a chance to update his ext4 git tree from this patch queue git tree yet. What's the plan to get some ext4 updates into mainline, btw? Things seem to be rather gradual. Last time Ted and I discussed we all agree fallocate patches should go into mainline. Actually most patches marked before the unstable patches can get into mainline, especially the following patches (contains a few bug fixes patches) # New patch to fix whitespace before applying new patches whitespace.patch #New patch to remove unnecessary exported symbols ext4_remove_exported_symbles.patch # New patch to add mount option to turn off extents ext4_noextent_mount_opt.patch # Now Turn on extents feature by default ext4_extents_on_by_default.patch #New patch to propagate inode flags ext4-propagate_flags.patch #New patch to add extent sanity checks ext4-extent-sanity-checks.patch #New patch to free blocks when failed to insert an extent ext4-free-blocks-on-insert-extent-failure.patch We already missed rc-1 window, but if possible, I would like to see ext4 fallocate patches and above patches in mainline 2.6.22. The nanosecond timestamp patch is probably good to go also. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5][TAKE3] fallocate system call
On Wed, 2007-05-16 at 01:07 +0530, Amit K. Arora wrote: ToDos: - 1 Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64 From: David Chinner [EMAIL PROTECTED] Subject: [PATCH] ia64 fallocate syscall Cc: Amit K. Arora [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] ia64 fallocate syscall support. Signed-Off-By: Dave Chinner [EMAIL PROTECTED] --- arch/ia64/kernel/entry.S |1 + include/asm-ia64/unistd.h |3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S === --- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S 2007-05-12 18:45:56.0 -0700 +++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S 2007-05-15 15:36:48.0 -0700 @@ -1585,5 +1585,6 @@ data8 sys_getcpu data8 sys_epoll_pwait // 1305 data8 sys_utimensat + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h === --- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 18:45:56.0 -0700 +++ linux-2.6.22-rc1/include/asm-ia64/unistd.h 2007-05-15 15:37:51.0 -0700 @@ -296,6 +296,7 @@ #define __NR_getcpu1304 #define __NR_epoll_pwait 1305 #define __NR_utimensat 1306 +#define __NR_fallocate 1307 #ifdef __KERNEL__ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote: I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. If we chose not to update the file size beyong EOF, then for filesystem without fallocate() support (ext2,3 currently), posix_fallocate() will follow the hard way(zero-out) to do preallocation. Then we will get different behavior on filesystems w/o fallocate() support. It make sense to be consistent, IMO. My point of view, preallocation is just a efficient way to allocating blocks for files without zero-out, other than this, the new behavior should be consistent with the old way: file size update,mtime/ctime, ENOSPC etc. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 21:43 -0400, Theodore Tso wrote: On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. Checking against the fs free blocks is a good idea, since it will prevent the obvious error case where someone tries to preallocate 10GB when there is only 2GB left. Think it again, this check is useful when preallocate blocks at EOF. It's not much useful is preallocating a range with holes. In that case 2GB space might be enough if the application tries to preallocate a 10GB. But it won't help if there are multiple processes trying to allocate blocks the same time. On the other hand, that case is probably relatively rare, and in that case, the filesystem was probably going to be left completely full in any case. On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. Actually, the kernel could do it, in that could simply release all unitialized extents back to the system. The problem is distinguishing between the unitialized extents that had just been newly added, versus the ones that had there from before. True, the new uninitialized extents can be merged to the near old uninitialized extents, there is no way to distinguish the just added unintialized extents from the merged one. (On the other hand, if the filesystem was completely full, releasing unitialized blocks wouldn't be the worse thing in the world to do, although releasing previously fallocated blocks probably does violate the princple of least surprise, even if it's what the user would have wanted.) On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: If there is enough free space, we could make a reservation window that have at least N free blocks and mark it not stealable by other files. So later we will not run into the ENOSPC error. Could you really use a single reservation window? When the filesystem is almost full, the free extents are likely going to be scattered all over the disk. The general principle of grabbing all of the extents and keeping them in an in-memory data structure, and only adding them to the extent tree would work, though; I'm just not sure we could do it using the existing reservation window code, since it only supports a single reservation window per file, yes? You are right. One reservation window per file and there is limit to the maximum window size). So yeah this way it's not going to prevent ENOSPC for sure:( Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote: On Mon, 7 May 2007 05:37:54 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: + block = offset blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) blkbits) +- block; + mutex_lock(EXT4_I(inode)-truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(EXT4_I(inode)-truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). I think the use of ext4_journal_extend() (as Amit has proposed) will help here, but it is not sufficient. Because under some circumstances, a journal_extend() failure could mean that we fail to allocate all the required disk space. If it is infrequent enough, that is acceptable when the caller is using fallocate() for performance reasons. But it is very much not acceptable if the caller is using fallocate() for space-reservation reasons. If you used fallocate to reserve 1GB of disk and fallocate() succeeded and you later get ENOSPC then you'd have a right to get a bit upset. So I think the ext3/4 fallocate() implementation will need to be implemented as a loop: while (len) { journal_start(); len -= do_fallocate(len, ...); journal_stop(); } I agree. There is already a loop in Amit's current's patch to call ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to ask for in each journal_start()? +/* + * ext4_fallocate: + * preallocate space for a file + * mode is for future use, e.g. for unallocating preallocated blocks etc. + */ +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + mutex_lock(EXT4_I(inode)-truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(EXT4_I(inode)-truncate_mutex); I think the calculation is based on the assumption that there is only a single extent to be inserted, which is the ideal case. But in some cases we may end up allocating several chunk of blocks(extents) for this single preallocation request when fs is fragmented (or part of preallocation request is already fulfilled) I think we should move this calculation inside the loop as well,and we really do not need to grab the lock to calculate the credit if the @path is always NULL, all the function does is mathmatics. I can't think of any good way to estimate the total credits needed for this whole preallocation request. Looked at ext4_get_block(), which is used for DIO code to deal with large amount of block allocation. The credit reservation is quite weak there too. The DIO_CREDIT is only (EXT4_RESERVE_TRANS_BLOCKS + 32) + handle=ext4_journal_start(inode, credits + + EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+1); + if (IS_ERR(handle)) + return PTR_ERR(handle); +retry: + ret = 0; + while (ret = 0 ret max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); + if (ret 0 test_bit(BH_New, map_bh.b_state) +((block + ret) (i_size_read(inode) blkbits))) + nblocks = nblocks + ret; + } + + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) + goto retry; + Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop so we could get a lower chance to get more fragmented. Does the proposed implementation handle quotas
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote: On Mon, 7 May 2007 19:14:42 -0400 Theodore Tso [EMAIL PROTECTED] wrote: On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Actually, the reservations code won't matter, since glibc will fall back to its current behavior, which is it will do the preallocation by explicitly writing zeros to the file. No! Reservations code is *critical* here. Without reservations, we get disastrously-bad layout if two processes were running a large fallocate() at the same time. (This is an SMP-only problem, btw: on UP the timeslice lengths save us). My point is that even though reservations save us, we could do even-better in-kernel. In this case, since the number of blocks to preallocate (eg. N=10GB) is clear, we could improve the current reservation code, to allow callers explicitly ask for a new window that have the minimum N free blocks for the blocks-to-preallocated(rather than just have at least 1 free blocks). Before the ext4_fallocate() is called, the right reservation window size is set with the flag to indicating please spend time if needed to find a window covers at least N free blocks. So for ex4 block mapped files, later when glibc is doing allocation and zeroing, the ext4 block-mapped allocator will knows to reserve the right amount of free blocks before allocating and zeroing 10GB space. I am not sure whether this worth the effort though. But then, a smart application would bypass the glibc() fallocate() implementation and would tune the reservation window size and would use direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). This wlil result in the same layout as if we had done the persistent preallocation, but of course it will mean the posix_fallocate() could potentially take a long time if you're a PVR and you're reserving a gig or two for a two hour movie at high quality. That seems suboptimal, granted, and ideally the application should be warned about this before it calls posix_fallocate(). On the other hand, it's what happens today, all the time, so applications won't be too badly surprised. A PVR implementor would take all this over and would do it themselves, for sure. If we think applications programmers badly need to know in advance if posix_fallocate() will be fast or slow, probably the right thing is to define a new fpathconf() configuration option so they can query to see whether a particular file will support a fast posix_fallocate(). I'm not 100% convinced such complexity is really needed, but I'm willing to be convinced what do folks think? An application could do sys_fallocate(one-byte) to work out whether it's supported in-kernel, I guess. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote: On Mon, 07 May 2007 17:00:24 -0700 Mingming Cao [EMAIL PROTECTED] wrote: + while (ret = 0 ret max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); + if (ret 0 test_bit(BH_New, map_bh.b_state) +((block + ret) (i_size_read(inode) blkbits))) + nblocks = nblocks + ret; + } + + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) + goto retry; + Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop so we could get a lower chance to get more fragmented. yes, but my point is that the proposed behaviour is really quite bad. I agree your point, that's why I mention it only helped the fragmentation issue but not the ENOSPC case. We will attempt to allocate the disk space and then we will return failure, having consumed all the disk space and having partially and uselessly populated an unknown amount of the file. Not totally useless I think. If only half of the space is preallocated because run out of space, the application can decide whether it's good enough to start to use this preallocated space or wait for the fs to have more free space. Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. So we'd need to either keep track of which blocks were newly-allocated and then free them all again on the error path (doesn't work right across commit+crash+recovery) or we could later use the space-reservation scheme which delayed allocation will need to introduce. Or we could decide to live with the above IMO-crappy behaviour. In fact Amit and I had raised this issue before, whether it's okay to do allow partial preallocation. At that moment the feedback is it's no much different than the current zero-out-preallocation behavior: people might preallocating half-way then later deal with ENOSPC. We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. If there is enough free space, we could make a reservation window that have at least N free blocks and mark it not stealable by other files. So later we will not run into the ENOSPC error. The fs free blocks account is just a estimate though. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Jan Kara wrote: On Fri, 02 Mar 2007 09:40:54 +1100 Nathan Scott [EMAIL PROTECTED] wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); ... I'd agree with Eric on the command flag extension. Seems like a separate syscall would be better, command sounds a bit ioctl like, especially if that command is passed into the filesystems.. madvise, fadvise, lseek, etc seem to work OK. I get repeatedly traumatised by patch rejects whenever a new syscall gets added, so I'm biased. The advantage of a command flag is that we can add new modes in the future without causing lots of churn, waiting for arch maintainers to catch up, potentially adding new compat code, etc. Rename it to mode? ;) I am wondering if it is useful to add another mode to advise block allocation policy? Something like indicating which physical block/block group to allocate from (goal), and whether ask for strict contigous blocks. This will help preallocation or reservation to choose the right blocks for the file. Yes, I also think this would be useful so you can guide preallocation for things like defragmentation (e.g. preallocate space for the file being defragmented and move the file to it). Honza Yep, I think it makes sense to use preallocation for defragmentation. After all both preallocation and defragmentation shall call underlying filesystem multiple block allocator to try to allocate a chunk of contiguous blocks on disk. ext4 online defrag implementation by Takashi already support to choose a goal allocation block to guide the ext4 block allocator to place the defraged file is a specific location. Passing a little bit more hint to sys_fallocate() (i.e, goal block, and/or whether the goal block is important over the size of prealloc extent), might make it more useful for the orginial goal (get contigous and guranteed blocks) and for defragmentation. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Dave Kleikamp wrote: On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote: On Thu, 01 Mar 2007 22:44:16 + Dave Kleikamp [EMAIL PROTECTED] wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file-f_path.dentry-d_inode; + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} ENOTTY is a bit unconventional - we often use EINVAL for this sort of thing. But EINVAL has other meanings for posix_fallocate() and isn't really appropriate here anyway. So I'm not sure what would be better... Would EINVAL (or whatever) make it back to the caller of posix_fallocate(), or would glibc fall back to its current implementation? Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. Given that glibc already implements fallocate for all filesystems, it will need to continue to do so for filesystems which don't implement this syscall - otherwise applications would start breaking. I didn't make it clear, but my point was to call generic_fallocate if the file system did not define i_op-allocate(). if (inode-i_op inode-i_op-fallocate) ret = inode-i_op-fallocate(inode, offset, len); else ret = generic_fallocate(inode, offset, len); I'm not sure it's worth the effort, but I thought I'd throw the idea out there. I think this is useful. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: a question regarding Ext3 file truncate
On Wed, 2007-02-07 at 12:16 -0500, Xin Zhao wrote: Hi, Please forgive me if the question is dumb. I am modifying ext3 to add some new features, but was confused by the implementation of ext3_truncate(). In ext3_truncate(): we first use n = ext3_block_to_path(inode, last_block, offsets, NULL); to get the path of the last block. If the number of blocks is smaller than 12, all blocks are then direct blocks. We then need to clear them. But the interesting thing happens: if (n == 1) /* direct blocks */ { ext3_free_data(handle, inode, NULL, i_data+offsets[0], i_data + EXT3_NDIR_BLOCKS); goto do_indirects; } This code seems to free data blocks right after the blocks used by the file. I think it should be ext3_free_data(handle, inode, NULL, i_data, i_data+offsets[0]); Last_block is the last logical block after the truncate, so ext3_truncate() free data blocks after this point. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH]Add memory barrier before clear bit in unlock_buffer()
We are runnin SDET benchmark and saw double free issue for ext3 extended attributes block, which complains the same xattr block already being freed (in ext3_xattr_release_block()). The problem could also been triggered by multiple threads loop untar/rm a kernel tree. The race is caused by missing a memory barrier at unlock_buffer() before the lock bit being cleared, resulting in possible concurrent h_refcounter update. That causes a reference counter leak, then later leads to the double free that we have seen. Inside unlock_buffer(), there is a memory barrier is placed *after* the lock bit is being cleared, however, there is no memory barrier *before* the bit is cleared. On some arch the h_refcount update instruction and the clear bit instruction could be reordered, thus leave the critical section re-entered. The race is like this: For example, if the h_refcount is initialized as 1, cpu 0: cpu1 -- --- lock_buffer() /* test_and_set_bit */ clear_buffer_locked(bh); lock_buffer() /* test_and_set_bit */ h_refcount = h_refcount+1; /* = 2*/ h_refcount = h_refcount + 1; /*= 2 */ clear_buffer_locked(bh); .. We lost a h_refcount here. We need a memory barrier before the buffer head lock bit being cleared to force the order of the two writes. Please apply. Signed-Off-By: Mingming Cao [EMAIL PROTECTED] --- linux/fs/buffer.c.orig 2007-02-04 11:37:50.0 -0600 +++ linux/fs/buffer.c 2007-02-04 11:38:14.0 -0600 @@ -77,6 +77,7 @@ void fastcall unlock_buffer(struct buffer_head *bh) { + smp_mb__before_clear_bit(); clear_buffer_locked(bh); smp_mb__after_clear_bit(); wake_up_bit(bh-b_state, BH_Lock); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: a question about ext4_fsblk_t
On Sun, 2006-10-22 at 16:50 +0800, guomingyang wrote: I also find many places where the block number type is not changed in namei.c and dir.c. Why? They are all file logical blocks, ext4_fsblk_t is for on disk blocks. Takashi Sato has a patch to define all file logical blocks as ext3_fileblk_t, it did not make to mainline before ext4 was forked. Probably we should bring the to ext3/4 to clarify the confusions. Mingming Thank you to anyone who offer help Hello everyone: I am a student interesting in linux filesystem, I have a problem about the replace of block number type to ext4_fsblk_t. Is it complete? Or has it passed all the test? Because I have found a place like this in linux-2.6.19-rc2/fs/ext4/inode.c +struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode, + int block, int create, int *err) the block's type is not changed. Although I think it will bring no mistake, I doubt about why not change it. Thank you ! guomingyang - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html guomingyang - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] [RFC] [PATCH 2/4]delayed allocation for ext3
On Sun, 2005-07-17 at 19:47 -0600, Andreas Dilger wrote: On Jul 17, 2005 10:40 -0700, Mingming Cao wrote: @@ -373,6 +373,7 @@ struct ext3_inode { #define EXT3_MOUNT_BARRIER 0x2 /* Use block barriers */ #define EXT3_MOUNT_NOBH0x4 /* No bufferheads */ #define EXT3_MOUNT_QUOTA 0x8 /* Some quota option set */ + #define EXT3_MOUNT_DELAYED_ALLOC 0xC /* Delayed Allocation */ This doesn't make sense. DELAYED_ALLOC == QUOTA | NOBH? Ah...:-) Badari used 0x8 for DELAYED_ALLOC in the previous patch (2.6.11 based).When moving those patches forward to 2.6.13-rc3, we found the conflict with QUOTA, and obviously picked up a wrong value. + {Opt_delayed_alloc, delalloc}, Is this a replacement for Alex's delalloc code? We also use delalloc for that code and if they are not interchangeable it will cause confusion about which one is in use. Okey, will think a new name for this feature to avoid confusion. Alex's delalloc is bond to extent tree structure so it's hard to be adopted directly to the current ext3 layout, so, I'd say this work done by Badari(inspired by Alex's work) is a different implementation. + if (test_opt(sb, DELAYED_ALLOC)) { + if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) { + printk(KERN_WARNING EXT3-fs: Ignoring delall option - + its supported only with writeback mode\n); Should be ignoring delalloc option. Fixed. Thanks for looking at this. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3
Hi All, Here are the updated patches to support multiple block allocation and delayed allocation for ext3 done by me, Badari and Suparna. [PATCH 1/4] -- multiple block allocation for current ext3. (ext3_get_blocks()). [PATCH 2/4] -- adding delayed allocation for writeback mode [PATCH 3/4] -- generic support for cluster pages together in mapge_writepages() to make use of getblocks() [PATCH 4/4] -- support multiple block allocation for ext3 writeback mode through writepages(). Have done initial testing on dbench and tiobench on a 2.6.11 version of this patch set. Dbench 8 thread throughput result is increased by 20% with this patch set. dbench comparison: (ext3-dm represents ext3+thispatchset) http://www.sudhaa.com/~ram/ols2005presentation/dbench.jpg tiobench comparison: http://www.sudhaa.com/~ram/ols2005presentation/tio_seq_write.jpg Todo: - bmap() support for delayed allocation - page reserve flag to indicate the delayed allocation - ordered mode support for delayed allocation - bh support to enable blocksize = 1k/2k filesystems Cheers, Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] [PATCH 4/4]add ext3 writeback writpages
support multiple block allocation for ext3 writeback mode through writepages(). --- linux-2.6.12-ming/fs/ext3/inode.c | 131 linux-2.6.12-ming/fs/mpage.c|8 + linux-2.6.12-ming/include/linux/mpage.h | 17 3 files changed, 153 insertions(+), 3 deletions(-) diff -puN fs/ext3/inode.c~writepages fs/ext3/inode.c --- linux-2.6.12/fs/ext3/inode.c~writepages 2005-07-17 17:11:43.239274864 -0700 +++ linux-2.6.12-ming/fs/ext3/inode.c 2005-07-17 17:11:43.259271824 -0700 @@ -36,6 +36,7 @@ #include linux/writeback.h #include linux/mpage.h #include linux/uio.h +#include linux/pagevec.h #include xattr.h #include acl.h @@ -1719,6 +1720,135 @@ out_fail: return ret; } +static int +ext3_writeback_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *inode = mapping-host; + const unsigned blkbits = inode-i_blkbits; + int err = 0; + int ret = 0; + int done = 0; + unsigned int max_pages_to_cluster = 0; + struct pagevec pvec; + int nr_pages; + pgoff_t index; + pgoff_t end = -1; /* Inclusive */ + int scanned = 0; + int is_range = 0; + struct page *page; + struct mpageio mio = { + .bio = NULL + }; + + pagevec_init(pvec, 0); + if (wbc-sync_mode == WB_SYNC_NONE) { + index = mapping-writeback_index; /* Start from prev offset */ + } else { + index = 0;/* whole-file sweep */ + scanned = 1; + } + if (wbc-start || wbc-end) { + index = wbc-start PAGE_CACHE_SHIFT; + end = wbc-end PAGE_CACHE_SHIFT; + is_range = 1; + scanned = 1; + } + max_pages_to_cluster = min(EXT3_MAX_TRANS_DATA, (pgoff_t)PAGEVEC_SIZE); + +retry: + down_read(inode-i_alloc_sem); + while (!done (index = end) + (nr_pages = pagevec_contig_lookup_tag(pvec, mapping, + index, PAGECACHE_TAG_DIRTY, + min(end - index, max_pages_to_cluster-1) + 1))) { + unsigned i; + + scanned = 1; + for (i = 0; i nr_pages; i++) { + page = pvec.pages[i]; + + lock_page(page); + + if (unlikely(page-mapping != mapping)) { + unlock_page(page); + break; + } + + if (unlikely(is_range) page-index end) { + unlock_page(page); + break; + } + + if (wbc-sync_mode != WB_SYNC_NONE) + wait_on_page_writeback(page); + + if (PageWriteback(page) || + !clear_page_dirty_for_io(page)) { + unlock_page(page); + break; + } + } + + if (i) { + unsigned j; + handle_t *handle; + + page = pvec.pages[i-1]; + index = page-index + 1; + mio.final_block_in_request = + min(index, end) (PAGE_CACHE_SHIFT - blkbits); + + handle = ext3_journal_start(inode, + i + ext3_writepage_trans_blocks(inode)); + + if (IS_ERR(handle)) { + err = PTR_ERR(handle); + done = 1; + } + for (j = 0; j i; j++) { + page = pvec.pages[j]; + if (!done) { + ret = __mpage_writepage(mio, page, + ext3_writepages_get_blocks, wbc, + ext3_writeback_writepage_helper); + if (ret || (--(wbc-nr_to_write) = 0)) + done = 1; + } else { + redirty_page_for_writepage(wbc, page); + unlock_page(page); + } + } + if (!err mio.bio) + mio.bio = mpage_bio_submit(WRITE, mio.bio); + if (!err) + err = ext3_journal_stop(handle); + if (!ret) { + ret = err; + if (ret) + done = 1; +
[RFC] [PATCH 3/4]generic getblocks() support in mpage_writepages
Updated patch from Suparna for generic support for cluster pages together in mapge_writepages() to make use of getblocks() --- linux-2.6.12-ming/fs/buffer.c | 49 - linux-2.6.12-ming/fs/ext2/inode.c | 15 - linux-2.6.12-ming/fs/ext3/inode.c | 15 + linux-2.6.12-ming/fs/ext3/super.c |3 linux-2.6.12-ming/fs/hfs/inode.c |2 linux-2.6.12-ming/fs/hfsplus/inode.c |2 linux-2.6.12-ming/fs/jfs/inode.c | 24 ++ linux-2.6.12-ming/fs/mpage.c | 214 ++ linux-2.6.12-ming/include/linux/buffer_head.h |4 linux-2.6.12-ming/include/linux/fs.h |2 linux-2.6.12-ming/include/linux/mpage.h | 11 - linux-2.6.12-ming/include/linux/pagemap.h |3 linux-2.6.12-ming/include/linux/pagevec.h |3 linux-2.6.12-ming/include/linux/radix-tree.h | 14 + linux-2.6.12-ming/lib/radix-tree.c| 25 ++- linux-2.6.12-ming/mm/filemap.c|9 - linux-2.6.12-ming/mm/swap.c | 11 + 17 files changed, 270 insertions(+), 136 deletions(-) diff -puN fs/buffer.c~mpage_writepages_getblocks fs/buffer.c --- linux-2.6.12/fs/buffer.c~mpage_writepages_getblocks 2005-07-15 00:11:01.0 -0700 +++ linux-2.6.12-ming/fs/buffer.c 2005-07-15 00:11:01.0 -0700 @@ -2509,53 +2509,10 @@ EXPORT_SYMBOL(nobh_commit_write); * that it tries to operate without attaching bufferheads to * the page. */ -int nobh_writepage(struct page *page, get_block_t *get_block, - struct writeback_control *wbc) +int nobh_writepage(struct page *page, get_blocks_t *get_blocks, + struct writeback_control *wbc, writepage_t bh_writepage_fn) { - struct inode * const inode = page-mapping-host; - loff_t i_size = i_size_read(inode); - const pgoff_t end_index = i_size PAGE_CACHE_SHIFT; - unsigned offset; - void *kaddr; - int ret; - - /* Is the page fully inside i_size? */ - if (page-index end_index) - goto out; - - /* Is the page fully outside i_size? (truncate in progress) */ - offset = i_size (PAGE_CACHE_SIZE-1); - if (page-index = end_index+1 || !offset) { - /* -* The page may have dirty, unmapped buffers. For example, -* they may have been added in ext3_writepage(). Make them -* freeable here, so the page does not leak. -*/ -#if 0 - /* Not really sure about this - do we need this ? */ - if (page-mapping-a_ops-invalidatepage) - page-mapping-a_ops-invalidatepage(page, offset); -#endif - unlock_page(page); - return 0; /* don't care */ - } - - /* -* The page straddles i_size. It must be zeroed out on each and every -* writepage invocation because it may be mmapped. A file is mapped -* in multiples of the page size. For a file that is not a multiple of -* the page size, the remaining memory is zeroed when mapped, and -* writes to that region are not written out to the file. -*/ - kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); -out: - ret = mpage_writepage(page, get_block, wbc); - if (ret == -EAGAIN) - ret = __block_write_full_page(inode, page, get_block, wbc); - return ret; + return mpage_writepage(page, get_blocks, wbc, bh_writepage_fn); } EXPORT_SYMBOL(nobh_writepage); diff -puN fs/ext2/inode.c~mpage_writepages_getblocks fs/ext2/inode.c --- linux-2.6.12/fs/ext2/inode.c~mpage_writepages_getblocks 2005-07-15 00:11:01.0 -0700 +++ linux-2.6.12-ming/fs/ext2/inode.c 2005-07-15 00:11:01.0 -0700 @@ -650,12 +650,6 @@ ext2_nobh_prepare_write(struct file *fil return nobh_prepare_write(page,from,to,ext2_get_block); } -static int ext2_nobh_writepage(struct page *page, - struct writeback_control *wbc) -{ - return nobh_writepage(page, ext2_get_block, wbc); -} - static sector_t ext2_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping,block,ext2_get_block); @@ -673,6 +667,12 @@ ext2_get_blocks(struct inode *inode, sec return ret; } +static int ext2_nobh_writepage(struct page *page, + struct writeback_control *wbc) +{ + return nobh_writepage(page, ext2_get_blocks, wbc, ext2_writepage); +} + static ssize_t ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs) @@ -687,7 +687,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { -
Re: [RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3
On Sun, 2005-07-17 at 10:40 -0700, Mingming Cao wrote: Hi All, Here are the updated patches to support multiple block allocation and delayed allocation for ext3 done by me, Badari and Suparna. Patches are against 2.6.13-rc3. Mingming - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] [PATCH 1/4]Multiple block allocation for ext3
Here is the patch support multiple block allocation for ext3. Current ext3 allocates one block at a time, not efficient for large sequential write IO. This patch implements a simply multiple block allocation with current ext3. The basic idea is allocating the 1st block in the existing way, and attempting to allocate the next adjacent blocks on a best effort basis. If contiguous allocation is blocked by an already allocated block, the current number of free blocks are allocated and no futhur search is tried. This implementation makes uses of block reservation. With the knowledge of how many blocks to allocate, the reservation window size is being enlargedaccordin before block allocation to increase the chance to get contiguous blocks. Previous post of this patch with more description could be found here: http://marc.theaimsgroup.com/?l=ext2-develm=111471578328685w=2 --- linux-2.6.12-ming/fs/ext3/balloc.c| 121 +++-- linux-2.6.12-ming/fs/ext3/inode.c | 380 -- linux-2.6.12-ming/fs/ext3/xattr.c |3 linux-2.6.12-ming/include/linux/ext3_fs.h |2 4 files changed, 458 insertions(+), 48 deletions(-) diff -puN fs/ext3/balloc.c~ext3-get-blocks fs/ext3/balloc.c --- linux-2.6.12/fs/ext3/balloc.c~ext3-get-blocks 2005-07-14 21:55:55.110385896 -0700 +++ linux-2.6.12-ming/fs/ext3/balloc.c 2005-07-14 22:40:32.265396472 -0700 @@ -20,6 +20,7 @@ #include linux/quotaops.h #include linux/buffer_head.h +#defineNBS_DEBUG 0 /* * balloc.c contains the blocks allocation and deallocation routines */ @@ -652,9 +653,11 @@ claim_block(spinlock_t *lock, int block, */ static int ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group, - struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window *my_rsv) + struct buffer_head *bitmap_bh, int goal, unsigned long *count, + struct ext3_reserve_window *my_rsv) { int group_first_block, start, end; + unsigned long num = 0; /* we do allocation within the reservation window if we have a window */ if (my_rsv) { @@ -712,8 +715,22 @@ repeat: goto fail_access; goto repeat; } - return goal; + num++; + goal++; + if (NBS_DEBUG) + printk(ext3_new_block: first block allocated:block %d,num %d\n, goal, num); + while (num *count goal end +ext3_test_allocatable(goal, bitmap_bh) +claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, bitmap_bh)) { + num++; + goal++; + } + *count = num; + if (NBS_DEBUG) + printk(ext3_new_block: additional block allocated:block %d,num %d,goal-num %d\n, goal, num, goal-num); + return goal - num; fail_access: + *count = num; return -1; } @@ -998,6 +1015,28 @@ retry: goto retry; } +static void try_to_extend_reservation(struct ext3_reserve_window_node *my_rsv, + struct super_block *sb, int size) +{ + struct ext3_reserve_window_node *next_rsv; + struct rb_node *next; + spinlock_t *rsv_lock = EXT3_SB(sb)-s_rsv_window_lock; + + spin_lock(rsv_lock); + next = rb_next(my_rsv-rsv_node); + + if (!next) + my_rsv-rsv_end += size; + else { + next_rsv = list_entry(next, struct ext3_reserve_window_node, rsv_node); + + if ((next_rsv-rsv_start - my_rsv-rsv_end) size) + my_rsv-rsv_end += size; + else + my_rsv-rsv_end = next_rsv-rsv_start -1 ; + } + spin_unlock(rsv_lock); +} /* * This is the main function used to allocate a new block and its reservation * window. @@ -1023,11 +1062,12 @@ static int ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle, unsigned int group, struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window_node * my_rsv, - int *errp) + unsigned long *count, int *errp) { unsigned long group_first_block; int ret = 0; int fatal; + unsigned long num = *count; *errp = 0; @@ -1050,7 +1090,8 @@ ext3_try_to_allocate_with_rsv(struct sup * or last attempt to allocate a block with reservation turned on failed */ if (my_rsv == NULL ) { - ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL); + ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, + count, NULL); goto out; } /* @@ -1080,6 +1121,10 @@ ext3_try_to_allocate_with_rsv(struct sup while (1) { if (rsv_is_empty(my_rsv-rsv_window) || (ret 0) ||
Re: Lazy block allocation and block_prepare_write?
On Tue, 2005-04-19 at 19:55 +0400, Nikita Danilov wrote: Badari Pulavarty [EMAIL PROTECTED] writes: On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote: Badari Pulavarty [EMAIL PROTECTED] writes: [...] Yes. Its possible to do what you want to. I am currently working on adding delayed allocation support to ext3. As part of that, We As you most likely already know, Alex Thomas already implemented delayed block allocation for ext3. Yep. I reviewed Alex Thomas patches for delayed allocation. He handled all the cases in his code and did NOT use any mpage* routines to do the work. I was hoping to change the mpage infrastructure to handle these, so that every filesystem doesn't have to do their thing. Just keep in mind that filesystem != ext3. :-) Generic support makes sense only when it is usable by multiple file systems. This is not always possible, e.g., there is no generic block allocator for precisely the same reason: disk space allocation policies are tightly intertwined with the rest of file system internals. This generic support should be useful for ext2 and xfs. From delayed allocation point of view, it should not aware any filesystem specific block allocation policies, and it should not care.:) It just simply gathering all pages that need to map block on disk, and asking the filesystem get_blocks() call back function, which will take care of the filesystem-specific multiple blocks mapping for it. Current get_blocks() function for ext3 is just simply loop calling ext3_get_block(). I am trying to add a real ext3_get_blocks() to reduce the cpu cost, reduce the number of metadata updates and increase the possibility to get contiguous blocks on disk. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html