Re: [RFC][Patch 1/1] Persistent preallocation in ext4

2006-12-12 Thread Mingming Cao
On Tue, 2006-12-12 at 11:53 +0530, Amit K. Arora wrote:
> Hi Mingming,
> 
Hi Amit,

> On Mon, Dec 11, 2006 at 05:28:15PM -0800, Mingming Cao wrote:
> > On Wed, 2006-12-06 at 11:28 +0530, Amit K. Arora wrote:
> >
> > > @@ -1142,13 +1155,22 @@
> > >   /* try to insert block into found extent and return */
> > >   if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
> > >   ext_debug("append %d block to %d:%d (from %llu)\n",
> > > - le16_to_cpu(newext->ee_len),
> > > + ext4_ext_get_actual_len(newext),
> > >   le32_to_cpu(ex->ee_block),
> > > - le16_to_cpu(ex->ee_len), ext_pblock(ex));
> > > + ext4_ext_get_actual_len(ex), ext_pblock(ex));
> > >   if ((err = ext4_ext_get_access(handle, inode, path + depth)))
> > >   return err;
> > > - ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
> > > -  + le16_to_cpu(newext->ee_len));
> > > +
> > > + /* ext4_can_extents_be_merged should have checked that either
> > > +  * both extents are uninitialized, or both aren't. Thus we
> > > +  * need to check any of them here.
> > > +  */
> > > + if (ext4_ext_is_uninitialized(ex))
> > > + uninitialized = 1;
> > > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
> > > +  + ext4_ext_get_actual_len(newext));
> 
> Above line will "remove" the uninitialized bit from "ex", if it was set.
> We call ext4_ext_get_actual_len() to get the "actual" lengths of the two
> extents, which removes this MSB in ee_len (MSB in ee_len is used to mark
> an extent uninitialized). Now, we do this because if lengths of two
> uninitialized extents will be added as it is (i.e. without masking out
> the MSB in the length), it will result in removing the MSB in ee_len.
> For example, 0x8002 + 0x8003 => 0x10005 => 0x5 (since ee_len is 16 bit).
> 
> That is why just before this line, we save the "state" of this extent,
> whether it was uninitialized or not. And, we restore this "state" below.

Ah...you are right:)

More comments below...

> Index: linux-2.6.19/fs/ext4/ioctl.c
> ===
> --- linux-2.6.19.orig/fs/ext4/ioctl.c 2006-12-06 10:18:12.0 +0530
> +++ linux-2.6.19/fs/ext4/ioctl.c  2006-12-06 10:18:15.0 +0530
> @@ -248,6 +248,47 @@
>   return err;
>   }
> 
> + case EXT4_IOC_PREALLOCATE: {
> + struct ext4_falloc_input input;
> + handle_t *handle;
> + ext4_fsblk_t block, max_blocks;
> + int ret = 0;
> + struct buffer_head map_bh;
> + unsigned int blkbits = inode->i_blkbits;
> +
> + if (IS_RDONLY(inode))
> +return -EROFS;
> +
> + if (copy_from_user(&input,
> + (struct ext4_falloc_input __user *) arg, sizeof(input)))
> +return -EFAULT;
> +
> + if (input.len == 0)
> + return -EINVAL;
> +
> + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> + return -ENOTTY;
> 

Supporting preallocation for extent based files seems fairly
straightforward.  I agree we should look at this first.  After get this
done, it probably worth re-consider whether to support preallocation for
non-extent based files on ext4. I could imagine user upgrade from ext3
to ext4, and expecting to use preallocation on those existing files

> +
> + block = EXT4_BLOCK_ALIGN(input.offset, blkbits) >> blkbits;
> + max_blocks = EXT4_BLOCK_ALIGN(input.len, blkbits) >> blkbits;
> + handle=ext4_journal_start(inode,
> + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+max_blocks);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> + while(ret>=0 && ret + {
> + block = block + ret;
> + max_blocks = max_blocks - ret;
> + ret = ext4_ext_get_blocks(handle, inode, block,
> + max_blocks, &map_bh,
> + EXT4_CREATE_UNINITIALIZED_EXT, 1);

Since the interface takes offset and number of blocks to allocate, I
assuming we are going to handle holes in preallocation, thus, we cannot
not mark the extend_size flag to 1 when calling ext4_ext_get_blocks.

I think we should update i_size and i_disksize after preallocation. Oh,
to protect parallel updating i_size, we have to take i_mutex down.

> + }
> + ext4_mark_inode_dirty(handle, inode);
> + ext4_journal_stop(handle);
> +

Error code returned by ext4_journal_stop() is being ignored here, is
this right?
Well, there are other places in ext34/ioctl.c which ignore the return
returned 

Re: [RFC] ext4-locality-groups patch

2006-12-12 Thread Alex Tomas
> Valerie Clement (VC) writes:


 VC> The crash occurs because ei->i_locality_group is not well initialized.
 VC> The patch in attachment fixes the problem on my system (x86_64).

thanks!

strange that I didn't hit this on i386 ... and gcc doesn't warn.

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext4-locality-groups patch

2006-12-12 Thread Valerie Clement

Alex Tomas wrote:

this patch implements locality groups idea in a very
simplified form. the policy is silly and ->sync_inodes()
not very well tested on different workloads. 


thanks, Alex


Hi Alex,
I applied your patches to a 2.6.19-rc6 kernel. After creating and 
mounting an ext4 filesystem with options extents,mballoc,delalloc,

the system crashes while copying a file into the ext4 partition.
I set some kernel debugging options and rebuilt the kernel. Here is the 
oops I've got:


EXT4 FS on sdc1, internal journal
EXT4-fs: mounted filesystem with ordered data mode.
EXT4-fs: file extents enabled
EXT4-fs: mballoc enabled
EXT4-fs: delayed allocation enabled
BUG: spinlock bad magic on CPU#2, pdflush/264
Unable to handle kernel NULL pointer dereference at 01ac RIP:
 [] spin_bug+0x67/0xa5
PGD 27d97067 PUD 28370067 PMD 0
Oops:  [1] SMP
CPU 2
Modules linked in: qla2xxx
Pid: 264, comm: pdflush Not tainted 2.6.19-rc6 #3
RIP: 0010:[]  [] spin_bug+0x67/0xa5
RSP: 0018:81007d6a7db0  EFLAGS: 00010206
RAX: 0031 RBX: 810076630020 RCX: 806f48d8
RDX: 806f48d8 RSI: 0046 RDI: 806f48c0
RBP: 00a0 R08:  R09: 8000
R10: 808bfc78 R11: 808bfc78 R12: 805bfdc9
R13: 81002dbd R14: 81007663 R15: 81007d6a7e60
FS:  () GS:81007df14740() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 01ac CR3: 282cd000 CR4: 06e0
Process pdflush (pid: 264, threadinfo 81007d6a6000, task 
81007db3b140)

Stack:  810076630020 81002daa4160 81007663 803600ef
 810076630020 802f2bf4 81007d6a7e60 81002dbd
 81002dbd0070 81007d6a7e60 fffc 81007df19d70
Call Trace:
 [] _raw_spin_lock+0x19/0x7d
 [] ext4_lg_sync_inodes+0xec/0x196
 [] keventd_create_kthread+0x0/0x64
 [<00010004a823>]


Code: 44 8b 85 0c 01 00 00 8b 53 04 48 85 ed 48 8d 85 98 02 00 00
RIP  [] spin_bug+0x67/0xa5
 RSP 
CR2: 01ac


The crash occurs because ei->i_locality_group is not well initialized.
The patch in attachment fixes the problem on my system (x86_64).

  Valérie


Index: linux-2.6.19-rc6/fs/ext4/lg.c
===
--- linux-2.6.19-rc6.orig/fs/ext4/lg.c  2006-12-12 17:23:46.0 +0100
+++ linux-2.6.19-rc6/fs/ext4/lg.c   2006-12-12 17:24:53.0 +0100
@@ -66,7 +66,7 @@ extern int __writeback_single_inode(stru
 struct ext4_locality_group *ext4_lg_find_group(struct super_block *sb)
 {
struct ext4_sb_info *sbi = EXT4_SB(sb);
-   struct ext4_locality_group *lg;
+   struct ext4_locality_group *lg = NULL;
struct list_head *cur;
 
rcu_read_lock();


[PATCH] process orphan list if device transitions to readonly

2006-12-12 Thread Eric Sandeen
I have a patch in -mm now to skip orphan inode processing on a read-only
device (where IO may fail if issued), but Stephen points out that if
the device ever transitions back to readwrite, and the filesystem is
remounted as rewrite, we should process the orphan inode list at that point.

Today, I'm not sure how we can easily get into this situation - for example,
lvm apparently cannot transform a RO snapshot into RW.  But it is at least
possible to do this.  For example I tested by:

create orphan list
crash
mark block device RO via BLKROSET
mount fs RO
mark block device RW via BLKROSET
remount fs RW

so it's within the realm of possibility, certainly.

here's what I came up with; I'm not very pleased with it though, but I need
some way to let ext3_orphan_del know that the sb is already locked and it's
safe to do this orphan processing w/o re-taking it, so I added a state flag...
It's worked in my testing.  Comments or commit would be appreciated.

Thanks,

-Eric

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

Index: linux-2.6.19/fs/ext3/namei.c
===
--- linux-2.6.19.orig/fs/ext3/namei.c
+++ linux-2.6.19/fs/ext3/namei.c
@@ -1942,15 +1942,15 @@ int ext3_orphan_del(handle_t *handle, st
struct ext3_iloc iloc;
int err = 0;
 
-   lock_super(inode->i_sb);
-   if (list_empty(&ei->i_orphan)) {
-   unlock_super(inode->i_sb);
-   return 0;
-   }
+   sbi = EXT3_SB(inode->i_sb);
+   if (!(sbi->s_mount_state & EXT3_REMOUNTING_FS))
+   lock_super(inode->i_sb);
+
+   if (list_empty(&ei->i_orphan))
+   goto out;
 
ino_next = NEXT_ORPHAN(inode);
prev = ei->i_orphan.prev;
-   sbi = EXT3_SB(inode->i_sb);
 
jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
 
@@ -1996,7 +1996,8 @@ int ext3_orphan_del(handle_t *handle, st
 out_err:
ext3_std_error(inode->i_sb, err);
 out:
-   unlock_super(inode->i_sb);
+   if (!(sbi->s_mount_state & EXT3_REMOUNTING_FS))
+   unlock_super(inode->i_sb);
return err;
 
 out_brelse:
Index: linux-2.6.19/fs/ext3/super.c
===
--- linux-2.6.19.orig/fs/ext3/super.c
+++ linux-2.6.19/fs/ext3/super.c
@@ -2358,6 +2358,10 @@ static int ext3_remount (struct super_bl
goto restore_opts;
if (!ext3_setup_super (sb, es, 0))
sb->s_flags &= ~MS_RDONLY;
+   EXT3_SB(sb)->s_mount_state |= 
(EXT3_ORPHAN_FS|EXT3_REMOUNTING_FS);
+   ext3_orphan_cleanup(sb, es);
+   EXT3_SB(sb)->s_mount_state &= 
~(EXT3_ORPHAN_FS|EXT3_REMOUNTING_FS);
+
}
}
 #ifdef CONFIG_QUOTA
Index: linux-2.6.19/include/linux/ext3_fs.h
===
--- linux-2.6.19.orig/include/linux/ext3_fs.h
+++ linux-2.6.19/include/linux/ext3_fs.h
@@ -356,6 +356,7 @@ struct ext3_inode {
 #defineEXT3_VALID_FS   0x0001  /* Unmounted cleanly */
 #defineEXT3_ERROR_FS   0x0002  /* Errors detected */
 #defineEXT3_ORPHAN_FS  0x0004  /* Orphans being 
recovered */
+#defineEXT3_REMOUNTING_FS  0x0008  /* Filesystem 
remounting, sb locked */
 
 /*
  * Mount flags


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html