Re: [RFC][Patch 1/1] Persistent preallocation in ext4
On Tue, 2006-12-12 at 11:53 +0530, Amit K. Arora wrote: > Hi Mingming, > Hi Amit, > On Mon, Dec 11, 2006 at 05:28:15PM -0800, Mingming Cao wrote: > > On Wed, 2006-12-06 at 11:28 +0530, Amit K. Arora wrote: > > > > > @@ -1142,13 +1155,22 @@ > > > /* try to insert block into found extent and return */ > > > if (ex && ext4_can_extents_be_merged(inode, ex, newext)) { > > > ext_debug("append %d block to %d:%d (from %llu)\n", > > > - le16_to_cpu(newext->ee_len), > > > + ext4_ext_get_actual_len(newext), > > > le32_to_cpu(ex->ee_block), > > > - le16_to_cpu(ex->ee_len), ext_pblock(ex)); > > > + ext4_ext_get_actual_len(ex), ext_pblock(ex)); > > > if ((err = ext4_ext_get_access(handle, inode, path + depth))) > > > return err; > > > - ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len) > > > - + le16_to_cpu(newext->ee_len)); > > > + > > > + /* ext4_can_extents_be_merged should have checked that either > > > + * both extents are uninitialized, or both aren't. Thus we > > > + * need to check any of them here. > > > + */ > > > + if (ext4_ext_is_uninitialized(ex)) > > > + uninitialized = 1; > > > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) > > > + + ext4_ext_get_actual_len(newext)); > > Above line will "remove" the uninitialized bit from "ex", if it was set. > We call ext4_ext_get_actual_len() to get the "actual" lengths of the two > extents, which removes this MSB in ee_len (MSB in ee_len is used to mark > an extent uninitialized). Now, we do this because if lengths of two > uninitialized extents will be added as it is (i.e. without masking out > the MSB in the length), it will result in removing the MSB in ee_len. > For example, 0x8002 + 0x8003 => 0x10005 => 0x5 (since ee_len is 16 bit). > > That is why just before this line, we save the "state" of this extent, > whether it was uninitialized or not. And, we restore this "state" below. Ah...you are right:) More comments below... > Index: linux-2.6.19/fs/ext4/ioctl.c > === > --- linux-2.6.19.orig/fs/ext4/ioctl.c 2006-12-06 10:18:12.0 +0530 > +++ linux-2.6.19/fs/ext4/ioctl.c 2006-12-06 10:18:15.0 +0530 > @@ -248,6 +248,47 @@ > return err; > } > > + case EXT4_IOC_PREALLOCATE: { > + struct ext4_falloc_input input; > + handle_t *handle; > + ext4_fsblk_t block, max_blocks; > + int ret = 0; > + struct buffer_head map_bh; > + unsigned int blkbits = inode->i_blkbits; > + > + if (IS_RDONLY(inode)) > +return -EROFS; > + > + if (copy_from_user(&input, > + (struct ext4_falloc_input __user *) arg, sizeof(input))) > +return -EFAULT; > + > + if (input.len == 0) > + return -EINVAL; > + > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > + return -ENOTTY; > Supporting preallocation for extent based files seems fairly straightforward. I agree we should look at this first. After get this done, it probably worth re-consider whether to support preallocation for non-extent based files on ext4. I could imagine user upgrade from ext3 to ext4, and expecting to use preallocation on those existing files > + > + block = EXT4_BLOCK_ALIGN(input.offset, blkbits) >> blkbits; > + max_blocks = EXT4_BLOCK_ALIGN(input.len, blkbits) >> blkbits; > + handle=ext4_journal_start(inode, > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+max_blocks); > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > + while(ret>=0 && ret + { > + block = block + ret; > + max_blocks = max_blocks - ret; > + ret = ext4_ext_get_blocks(handle, inode, block, > + max_blocks, &map_bh, > + EXT4_CREATE_UNINITIALIZED_EXT, 1); Since the interface takes offset and number of blocks to allocate, I assuming we are going to handle holes in preallocation, thus, we cannot not mark the extend_size flag to 1 when calling ext4_ext_get_blocks. I think we should update i_size and i_disksize after preallocation. Oh, to protect parallel updating i_size, we have to take i_mutex down. > + } > + ext4_mark_inode_dirty(handle, inode); > + ext4_journal_stop(handle); > + Error code returned by ext4_journal_stop() is being ignored here, is this right? Well, there are other places in ext34/ioctl.c which ignore the return returned
Re: [RFC] ext4-locality-groups patch
> Valerie Clement (VC) writes: VC> The crash occurs because ei->i_locality_group is not well initialized. VC> The patch in attachment fixes the problem on my system (x86_64). thanks! strange that I didn't hit this on i386 ... and gcc doesn't warn. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext4-locality-groups patch
Alex Tomas wrote: this patch implements locality groups idea in a very simplified form. the policy is silly and ->sync_inodes() not very well tested on different workloads. thanks, Alex Hi Alex, I applied your patches to a 2.6.19-rc6 kernel. After creating and mounting an ext4 filesystem with options extents,mballoc,delalloc, the system crashes while copying a file into the ext4 partition. I set some kernel debugging options and rebuilt the kernel. Here is the oops I've got: EXT4 FS on sdc1, internal journal EXT4-fs: mounted filesystem with ordered data mode. EXT4-fs: file extents enabled EXT4-fs: mballoc enabled EXT4-fs: delayed allocation enabled BUG: spinlock bad magic on CPU#2, pdflush/264 Unable to handle kernel NULL pointer dereference at 01ac RIP: [] spin_bug+0x67/0xa5 PGD 27d97067 PUD 28370067 PMD 0 Oops: [1] SMP CPU 2 Modules linked in: qla2xxx Pid: 264, comm: pdflush Not tainted 2.6.19-rc6 #3 RIP: 0010:[] [] spin_bug+0x67/0xa5 RSP: 0018:81007d6a7db0 EFLAGS: 00010206 RAX: 0031 RBX: 810076630020 RCX: 806f48d8 RDX: 806f48d8 RSI: 0046 RDI: 806f48c0 RBP: 00a0 R08: R09: 8000 R10: 808bfc78 R11: 808bfc78 R12: 805bfdc9 R13: 81002dbd R14: 81007663 R15: 81007d6a7e60 FS: () GS:81007df14740() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: 01ac CR3: 282cd000 CR4: 06e0 Process pdflush (pid: 264, threadinfo 81007d6a6000, task 81007db3b140) Stack: 810076630020 81002daa4160 81007663 803600ef 810076630020 802f2bf4 81007d6a7e60 81002dbd 81002dbd0070 81007d6a7e60 fffc 81007df19d70 Call Trace: [] _raw_spin_lock+0x19/0x7d [] ext4_lg_sync_inodes+0xec/0x196 [] keventd_create_kthread+0x0/0x64 [<00010004a823>] Code: 44 8b 85 0c 01 00 00 8b 53 04 48 85 ed 48 8d 85 98 02 00 00 RIP [] spin_bug+0x67/0xa5 RSP CR2: 01ac The crash occurs because ei->i_locality_group is not well initialized. The patch in attachment fixes the problem on my system (x86_64). Valérie Index: linux-2.6.19-rc6/fs/ext4/lg.c === --- linux-2.6.19-rc6.orig/fs/ext4/lg.c 2006-12-12 17:23:46.0 +0100 +++ linux-2.6.19-rc6/fs/ext4/lg.c 2006-12-12 17:24:53.0 +0100 @@ -66,7 +66,7 @@ extern int __writeback_single_inode(stru struct ext4_locality_group *ext4_lg_find_group(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); - struct ext4_locality_group *lg; + struct ext4_locality_group *lg = NULL; struct list_head *cur; rcu_read_lock();
[PATCH] process orphan list if device transitions to readonly
I have a patch in -mm now to skip orphan inode processing on a read-only device (where IO may fail if issued), but Stephen points out that if the device ever transitions back to readwrite, and the filesystem is remounted as rewrite, we should process the orphan inode list at that point. Today, I'm not sure how we can easily get into this situation - for example, lvm apparently cannot transform a RO snapshot into RW. But it is at least possible to do this. For example I tested by: create orphan list crash mark block device RO via BLKROSET mount fs RO mark block device RW via BLKROSET remount fs RW so it's within the realm of possibility, certainly. here's what I came up with; I'm not very pleased with it though, but I need some way to let ext3_orphan_del know that the sb is already locked and it's safe to do this orphan processing w/o re-taking it, so I added a state flag... It's worked in my testing. Comments or commit would be appreciated. Thanks, -Eric Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]> Index: linux-2.6.19/fs/ext3/namei.c === --- linux-2.6.19.orig/fs/ext3/namei.c +++ linux-2.6.19/fs/ext3/namei.c @@ -1942,15 +1942,15 @@ int ext3_orphan_del(handle_t *handle, st struct ext3_iloc iloc; int err = 0; - lock_super(inode->i_sb); - if (list_empty(&ei->i_orphan)) { - unlock_super(inode->i_sb); - return 0; - } + sbi = EXT3_SB(inode->i_sb); + if (!(sbi->s_mount_state & EXT3_REMOUNTING_FS)) + lock_super(inode->i_sb); + + if (list_empty(&ei->i_orphan)) + goto out; ino_next = NEXT_ORPHAN(inode); prev = ei->i_orphan.prev; - sbi = EXT3_SB(inode->i_sb); jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino); @@ -1996,7 +1996,8 @@ int ext3_orphan_del(handle_t *handle, st out_err: ext3_std_error(inode->i_sb, err); out: - unlock_super(inode->i_sb); + if (!(sbi->s_mount_state & EXT3_REMOUNTING_FS)) + unlock_super(inode->i_sb); return err; out_brelse: Index: linux-2.6.19/fs/ext3/super.c === --- linux-2.6.19.orig/fs/ext3/super.c +++ linux-2.6.19/fs/ext3/super.c @@ -2358,6 +2358,10 @@ static int ext3_remount (struct super_bl goto restore_opts; if (!ext3_setup_super (sb, es, 0)) sb->s_flags &= ~MS_RDONLY; + EXT3_SB(sb)->s_mount_state |= (EXT3_ORPHAN_FS|EXT3_REMOUNTING_FS); + ext3_orphan_cleanup(sb, es); + EXT3_SB(sb)->s_mount_state &= ~(EXT3_ORPHAN_FS|EXT3_REMOUNTING_FS); + } } #ifdef CONFIG_QUOTA Index: linux-2.6.19/include/linux/ext3_fs.h === --- linux-2.6.19.orig/include/linux/ext3_fs.h +++ linux-2.6.19/include/linux/ext3_fs.h @@ -356,6 +356,7 @@ struct ext3_inode { #defineEXT3_VALID_FS 0x0001 /* Unmounted cleanly */ #defineEXT3_ERROR_FS 0x0002 /* Errors detected */ #defineEXT3_ORPHAN_FS 0x0004 /* Orphans being recovered */ +#defineEXT3_REMOUNTING_FS 0x0008 /* Filesystem remounting, sb locked */ /* * Mount flags - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html