Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-11 Thread Mingming Cao
On Wed, 2007-07-11 at 00:38 -0500, Jose R. Santos wrote:
 On Tue, 10 Jul 2007 16:30:25 -0700
 Andrew Morton [EMAIL PROTECTED] wrote:
 
  On Sun, 01 Jul 2007 03:36:48 -0400
  Mingming Cao [EMAIL PROTECTED] wrote:
  
On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
 The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
 create_proc_entry() does not do lookups on file names with more that 
 one
 directory deep.  This causes the entry creation to fail and hence, no 
 proc
 file is created.  This patch moves the file to /proc/jbd2-degug.
 
 The file could be move to /proc/fs/jbd2/jbd2-debug, but it would 
 require
 some minor alterations to the jbd-stats patch.

I don't think we really want to be adding top-level files in /proc.
What about using the debugfs filesystem (not to be confused with
the e2fsprogs 'debugfs' command)?
   
   How about this then?  Moved the file to use debugfs as well as having
   the nice effect of removing more lines than what it adds.
   
   Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
  
  Please clean up the changelog.
  
  The changelog should include information about the location and the content
  of these debugfs files.  it should provide any instructions which users
  will need to be able to create and use those files.
 
 Will fix.
 
  Alternatively (and preferably) do this via an update to
  Documentation/filesystems/ext4.txt.
 
 Seems like I also need to update the doc on Kconfig as well.  Do you
 prefer this in separate patches? (current patch, kconfig patch, ext4
 doc update patch?
 
fs/jbd2/journal.c|   6220 +42 -0 !
include/linux/jbd2.h |21 + 1 - 0 !
2 files changed, 21 insertions(+), 43 deletions(-)
  
  Again, this patch isn't in Ted's kernel.org directory and hasn't been in 
  -mm.
  
  Apart from the lack of testing and review which this causes, it means I
  can't just do `pushpatch name-of-this-patch' and look at it in tkdiff.  So
  I squint at the diff, but that's harder when the diff wasn't prepared with
  `diff -p'.  Oh well.
 
 Will fix.
 
  
   Index: linux-2.6.22-rc4/fs/jbd2/journal.c
   ===
   --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c   2007-06-11 
   16:16:18.0 -0700
   +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 
   -0700
   @@ -35,6 +35,7 @@
#include linux/kthread.h
#include linux/poison.h
#include linux/proc_fs.h
   +#include linux/debugfs.h

#include asm/uaccess.h
#include asm/page.h
   @@ -1954,60 +1955,37 @@
 * /proc tunables
 */
#if defined(CONFIG_JBD2_DEBUG)
   -int jbd2_journal_enable_debug;
   +u16 jbd2_journal_enable_debug;
EXPORT_SYMBOL(jbd2_journal_enable_debug);
#endif

   -#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_PROC_FS)
   +#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_DEBUG_FS)
  
  Has this been compile-tested with CONFIG_DEBUGFS=n?
 
 I think I did, but honestly don't remember.  Will check with the new
 patch. :) 
 

Yes, I remember I did, that discovered some inconsistency in ext4 code,
which has already been fixed.


   -#define create_jbd_proc_entry() do {} while (0)
   -#define jbd2_remove_jbd_proc_entry() do {} while (0)
   +#define jbd2_create_debugfs_entry() do {} while (0)
   +#define jbd2_remove_debugfs_entry() do {} while (0)
  
  I suggest that these be converted to (preferable) inline functions while
  you're there.
 
 OK.
 
#endif

   @@ -2067,7 +2045,7 @@
 ret = journal_init_caches();
 if (ret != 0)
 jbd2_journal_destroy_caches();
   - create_jbd_proc_entry();
   + jbd2_create_debugfs_entry();
 return ret;
}

   @@ -2078,7 +2056,7 @@
 if (n)
 printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
#endif
   - jbd2_remove_jbd_proc_entry();
   + jbd2_remove_debugfs_entry();
 jbd2_journal_destroy_caches();
}

   Index: linux-2.6.22-rc4/include/linux/jbd2.h
   ===
   --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
   16:16:18.0 -0700
   +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 
   -0700
   @@ -57,7 +57,7 @@
 * CONFIG_JBD2_DEBUG is on.
 */
#define JBD_EXPENSIVE_CHECKING
  
  JBD2?
 
   -extern int jbd2_journal_enable_debug;
   +extern u16 jbd2_journal_enable_debug;
  
  Why was this made 16-bit?  To save 2 bytes?  Could have saved 3 if we're
  going to do that.
 
 OK.
 
  Shoudln't all this debug info be a per-superblock thing rather than
  kernel-wide?
 
 I don't think it is worth pursuing this feature since this seems to
 have been broken for a while now (its been there since the first git
 revission in ext3) and nobody has noticed it until now.  It could be
 address on a later patch though, since the initial purpose of the patch
 was to fix the 

Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-11 Thread Mingming Cao
On Tue, 2007-07-10 at 23:35 -0400, Dave Jones wrote:
 On Tue, Jul 10, 2007 at 05:35:13PM -0400, Mingming Cao wrote:
   
   Sorry about this. I was using version 0.04. The latest one I can find
   for now is 0.05(searching lkml), but it didn't catch this codling style
   bug either. I appreciate if anyone can point me the version 0.07, thanks
 
 It's now in-tree in scripts/checkpatch.pl
 (linus' tree is still at 0.06 though, I suspect Andrew has something
  newer in -mm)
 

Thanks, Andy has uploaded the 0.07 version at
http://www.shadowen.org/~apw/public/checkpatch/checkpatch.pl-0.07


Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Mingming Cao
On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
 On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  David Chinneer pointed that we need to journal the version number
  updates together with the operations that causes the change of the inode
  version number, in order to survive server crashes so clients won't see
  the counter go backwards.
  
  So increment i_version in fs code is probably the place to ensure the
  inode version changes are stored to disk. It's seems update the ext4
  inode version in every ext4_mark_inode_dirty() is the easiest way.
 
 That still makes us dependent upon _something_ changing the inode.  For
 overwrites the only something is mtime.
 
 If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
 I don't think we do) then I guess we'll need new code in or around
 file_update_time() to do this.

do you mean mark inode dirty all the times in file_update_time()? Not
sure about the overhead for ext3/4.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes

2007-07-11 Thread Andrew Morton
On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
 Date: Sun, 01 Jul 2007 03:38:51 -0400
 Sender: [EMAIL PROTECTED]
 Organization: IBM Linux Technology Center
 X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) 
 
 From: Dmitry Monakhov [EMAIL PROTECTED]
 Subject: ext4: extent compilation fixes

I hope this patch series will hit git with nice titles like ext4: extent
compilation fixes and not funny ones like
Morecleanups:ext4_extent_compilation_fixes.

 Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions.

conversions.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups

2007-07-11 Thread Andrew Morton
On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 From: Dmitry Monakhov [EMAIL PROTECTED]
 Subject: ext4: extent macros cleanup
 
 - Replace math equation to it's macro equivalent

s/it's/its/;)

 - make ext4_ext_grow_indepth() indexes/leaf correct

hm, what was wrong with it?

 Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED]
 Acked-by: Alex Tomas [EMAIL PROTECTED]
 Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
 ---
  fs/ext4/extents.c |   11 +++
  1 files changed, 7 insertions(+), 4 deletions(-)
 
 diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
 index 12fe3d7..1fd00ac 100644
 --- a/fs/ext4/extents.c
 +++ b/fs/ext4/extents.c
 @@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct 
 ext4_ext_path *path, int bloc
   ext_debug(binsearch for %d(idx):  , block);
 
   l = EXT_FIRST_INDEX(eh) + 1;
 - r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh-eh_entries) - 1;
 + r = EXT_LAST_INDEX(eh);
   while (l = r) {
   m = l + (r - l) / 2;
   if (block  le32_to_cpu(m-ei_block))
 @@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct 
 ext4_ext_path *path, int block)
   ext_debug(binsearch for %d:  , block);
 
   l = EXT_FIRST_EXTENT(eh) + 1;
 - r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh-eh_entries) - 1;
 + r = EXT_LAST_EXTENT(eh);
 
   while (l = r) {
   m = l + (r - l) / 2;
 @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, 
 struct inode *inode,
   curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode));
   curp-p_hdr-eh_entries = cpu_to_le16(1);
   curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr);
 - /* FIXME: it works, but actually path[0] can be index */
 - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
 + 
 + if (path[0].p_hdr-eh_depth)
 +   curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block;
 + else
 +   curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;

whitespace bustage there.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andrew Morton
On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
  On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
  
   David Chinneer pointed that we need to journal the version number
   updates together with the operations that causes the change of the inode
   version number, in order to survive server crashes so clients won't see
   the counter go backwards.
   
   So increment i_version in fs code is probably the place to ensure the
   inode version changes are stored to disk. It's seems update the ext4
   inode version in every ext4_mark_inode_dirty() is the easiest way.
  
  That still makes us dependent upon _something_ changing the inode.  For
  overwrites the only something is mtime.
  
  If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
  I don't think we do) then I guess we'll need new code in or around
  file_update_time() to do this.
 
 do you mean mark inode dirty all the times in file_update_time()? Not
 sure about the overhead for ext3/4.
 

hm, I hadn't thought about it in any detail.

Maybe something like

--- a/fs/inode.c~a
+++ a/fs/inode.c
@@ -1238,6 +1238,11 @@ void file_update_time(struct file *file)
sync_it = 1;
}
 
+   if (IS_I_VERSION_64(inode)) {
+   inode_inc_iversion(inode);  /* Takes i_lock on 32-bit */
+   sync_it = 1;
+   }
+
if (sync_it)
mark_inode_dirty_sync(inode);
 }
_

?
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-11 Thread Amit K. Arora
On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote:
 On Wed, 11 Jul 2007 01:50:00 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
  +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
  @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, 
  return sys_fadvise64_64(fd, ((u64)offset_hi  32) | offset_lo,
  len, advice);
   }
  +
  +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
  +   unsigned offset_hi, unsigned len_lo,
  +   unsigned len_hi)
 
 Please call this compat_sys_fallocate in line with the powerpc version -
 it gives us a hint that maybe we should think about how to consolidate
 them.  I know other stuff in that file is called sys32_ ... but it is time
 for a change :-)

I think this can be handled as a separate patch once this patchset
is in mainline. Since, anyhow we will need to do this for other sys32_
calls which are already there...

--
Regards,
Amit Arora


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig
On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
 Well, if you see the modes proposed using above flags :
 
 #define FA_ALLOCATE   0
 #define FA_DEALLOCATE FA_FL_DEALLOC
 #define FA_RESV_SPACE FA_FL_KEEP_SIZE
 #define FA_UNRESV_SPACE   (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
 FA_FL_DEL_DATA)
 
 FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
 for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
 flag. Hence prealloction will never delete data.
 This mode is required only for FA_UNRESV_SPACE, which is a deallocation
 mode, to support any existing XFS aware applications/usage-scenarios.

Sorry, but this doesn't make any sense.  There is no need to put every
feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
be supported forever anyway - as I suggested before they really should
be moved to generic code.

What needs to be supported is what makes sense as an interface.
A punch a hole interface does make sense, but trying to hack this into
a preallocation system call is just madness.  We're not IRIX or windows
that fit things into random subcall just because there was some space
left to squeeze them in.

FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data 
change) */
FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data 
change) */
  
  NACK to these aswell.  If i_size changes c/mtime need updates, if the size
  doesn't chamge they don't.  No need to add more flags for this.
 
 This requirement was from the point of view of HSM applications. Hope
 you saw Andreas previous post and are keeping that in mind.

HSMs needs this basically for every system call, which screams for an
open flag like O_INVISIBLE anyway.  Adding this in a generic way is
a good idea, but hacking bits and pieces that won't fit into the global
design is completely wrong.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote:
 Given the current behaviour for posix_fallocate() in glibc, I think
 that retaining the same error semantic and punting the cleanup to
 userspace (where the app will fail with ENOSPC anyway) is the only
 sane thing we can do here. Trying to undo this in the kernel leads
 to lots of extra rarely used code in error handling paths...

Agreed, looks like we should stay with the user has to clean up behaviour.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig
On Wed, Jul 04, 2007 at 03:37:01PM +1000, Timothy Shimmin wrote:
 We use this capability in XFS at the moment.
 I think this is mainly for DMF (HSM) but is done via the xfs handle 
 interface
 (xfs_open_by_handle) AFAICT.
 

You're not :)  You're using an O_INVIBLE equivalent (as described below),
which would be a useful thing to have at the VFS level, but adding hacks
to some system calls only wouldn't help any HSM system.  It's just useless
API clutter.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ext4 online resizing

2007-07-11 Thread Ph. Marek
Hello everybody,

sorry for this question if it has been asked before; I couldn't find 
information about this.

Please keep me CC'ed - I'm not subscribed. Thank you.


Ext2/3 have ext2resize/resize2fs, ext2prepare and ext3online for resizing.
But they don't work for ext4.

Here's a sample output for a 64MB loopback-mounted file:
# ext2online -C -f -d tst
ext2online v1.1.19 - 2001/03/18 for EXT2FS 0.5b
ext2_open
ext2_bcache_init
ext2_determine_itoffset
setting itoffset to +259
ext2_get_reserved
Found 255 blocks in s_reserved_gdt_blocks

8 old groups, 1 blocks
32 new groups, 1 blocks
ext2_ioctl: EXTEND group to 65537 blocks

using itoffset of 259
new block bitmap is at 0x10001
new inode bitmap is at 0x10002
new inode table is at 0x10104-0x10203
new group has 7934 free blocks
new group has 2048 free inodes (256 blocks)
ext2_ioctl: ADD group 8
ext2online: ext2_ioctl: Inappropriate ioctl for device

ext2online: unable to resize /tmp/tst

That this doesn't work could be caused by my old ext2online version;
# ext2online -V
ext2online v1.1.19 - 2001/03/18 for EXT2FS 0.5b
but what makes me wonder is that the sf project 
http://ext2resize.sourceforge.net/download.html still lists 1.19 as current.
Should that work for ext4 too?

# cat /proc/version
Linux version 2.6.22-rc5-686 (Debian 2.6.22~rc5-1~experimental.1)
([EMAIL PROTECTED]) (gcc version 4.1.3 20070601 (prerelease) 
(Debian
4.1.2-12)) #1 SMP Wed Jun 27 22:21:14 UTC 2007

There's a resize.c in the ext4 directory 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=fs/ext4;h=e509c1bf9448c343b7a4666dae4dfed0afe228ad;hb=HEAD
so I'd thought that it should include that functionality.

What am I doing wrong?


Thank you for all answers.


Regards,

Phil
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


e2fsprogs: Undo I/O manager

2007-07-11 Thread Aneesh Kumar K.V
Hi Ted,
Following patch implement the Undo I/O manager and undoe2fs.
The patches are on top of latest git(23edf9b4674ced1cdf8625bd609d95dbd62923b3)

-aneesh


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] e2fsprogs: Add undoe2fs

2007-07-11 Thread Aneesh Kumar K.V
From: Aneesh Kumar K.V [EMAIL PROTECTED]

undoe2fs can be used to replay the transaction saved
in the transaction file using undo I/O Manager

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 misc/Makefile.in |   10 --
 misc/undoe2fs.c  |   48 
 2 files changed, 56 insertions(+), 2 deletions(-)
 create mode 100644 misc/undoe2fs.c

diff --git a/misc/Makefile.in b/misc/Makefile.in
index ccad78c..51bb17a 100644
--- a/misc/Makefile.in
+++ b/misc/Makefile.in
@@ -15,7 +15,7 @@ INSTALL = @INSTALL@
 @[EMAIL PROTECTED] e2image.8
 
 SPROGS=mke2fs badblocks tune2fs dumpe2fs blkid logsave \
-   $(E2IMAGE_PROG) @FSCK_PROG@ 
+   $(E2IMAGE_PROG) @FSCK_PROG@  undoe2fs
 USPROGS=   mklost+found filefrag
 SMANPAGES= tune2fs.8 mklost+found.8 mke2fs.8 dumpe2fs.8 badblocks.8 \
e2label.8 findfs.8 blkid.8 $(E2IMAGE_MAN) \
@@ -39,6 +39,7 @@ E2IMAGE_OBJS= e2image.o
 FSCK_OBJS= fsck.o base_device.o
 BLKID_OBJS=blkid.o
 FILEFRAG_OBJS= filefrag.o
+UNDOE2FS_OBJS=  undoe2fs.o
 
 XTRA_CFLAGS=   -I$(srcdir)/../e2fsck -I.
 
@@ -47,7 +48,7 @@ SRCS= $(srcdir)/tune2fs.c $(srcdir)/mklost+found.c 
$(srcdir)/mke2fs.c \
$(srcdir)/badblocks.c $(srcdir)/fsck.c $(srcdir)/util.c \
$(srcdir)/uuidgen.c $(srcdir)/blkid.c $(srcdir)/logsave.c \
$(srcdir)/filefrag.c $(srcdir)/base_device.c \
-   $(srcdir)/../e2fsck/profile.c
+   $(srcdir)/../e2fsck/profile.c $(srcdir)/undoe2fs.c
 
 LIBS= $(LIBEXT2FS) $(LIBCOM_ERR) 
 DEPLIBS= $(LIBEXT2FS) $(LIBCOM_ERR) 
@@ -108,6 +109,10 @@ e2image: $(E2IMAGE_OBJS) $(DEPLIBS)
@echo  LD $@
@$(CC) $(ALL_LDFLAGS) -o e2image $(E2IMAGE_OBJS) $(LIBS) $(LIBINTL)
 
+undoe2fs: $(UNDOE2FS_OBJS) $(DEPLIBS)
+   @echo  LD $@
+   @$(CC) $(ALL_LDFLAGS) -o undoe2fs $(UNDOE2FS_OBJS) $(LIBS)
+
 base_device: base_device.c
@echo  LD $@
@$(CC) $(ALL_CFLAGS) $(ALL_LDFLAGS) $(srcdir)/base_device.c \
@@ -434,3 +439,4 @@ filefrag.o: $(srcdir)/filefrag.c
 base_device.o: $(srcdir)/base_device.c $(srcdir)/fsck.h
 profile.o: $(srcdir)/../e2fsck/profile.c $(top_srcdir)/lib/et/com_err.h \
  $(srcdir)/../e2fsck/profile.h prof_err.h
+undoe2fs.o: $(srcdir)/undoe2fs.c $(top_srcdir)/lib/ext2fs/tdb.h
diff --git a/misc/undoe2fs.c b/misc/undoe2fs.c
new file mode 100644
index 000..c209878
--- /dev/null
+++ b/misc/undoe2fs.c
@@ -0,0 +1,48 @@
+#include stdio.h
+#include stdlib.h
+#include fcntl.h
+#if HAVE_ERRNO_H
+#include errno.h
+#endif
+#include ext2fs/tdb.h
+
+void usage(char *prg_name)
+{
+   fprintf(stderr, Usage: %s transaction file filesystem\n, 
prg_name);
+   exit(1);
+
+}
+
+
+main(int argc, char *argv[])
+{
+   TDB_CONTEXT *tdb;
+   TDB_DATA key, data;
+   unsigned long  blk_num;
+   unsigned long long int location;
+   int fd;
+
+   if (argc != 3)
+   usage(argv[0]);
+
+   tdb = tdb_open(argv[1], 0, 0, O_RDONLY, 0600);
+
+   if (!tdb) {
+   printf(Failed tdb_open %s\n,  strerror(errno));
+   exit(1);
+   }
+
+   fd = open(argv[2], O_WRONLY);
+
+   for (key = tdb_firstkey(tdb); key.dptr; key = tdb_nextkey(tdb, key)) {
+   data = tdb_fetch(tdb, key);
+   blk_num = *(unsigned long *)key.dptr;
+   location = blk_num * data.dsize;
+   printf(Replayed transaction of size %d at location %ld\n, 
data.dsize, blk_num);
+   lseek(fd, location, SEEK_SET);
+   write(fd, data.dptr, data.dsize);
+   }
+   close(fd);
+   tdb_close(tdb);
+
+}
-- 
1.5.3.rc0.63.gc956-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  16:30 -0700, Andrew Morton wrote:
  +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field)  \
  +   ((offsetof(typeof(*ext4_inode), field) +\
  + sizeof((ext4_inode)-field))  \
  +   = (EXT4_GOOD_OLD_INODE_SIZE +  \
  +   (einode)-i_extra_isize))   \
 
 Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers
 under what circumstances something will not fit in an inode and what the
 consequences of this are.

/* Extended fields will fit into an inode if the filesystem was formatted
 * with large inodes (-I 256 or larger) and there are not currently EAs
 * consuming all of the available space.  For new inodes we always reserve
 * enough space for the kernel's known extended fields, but for inodes
 * created with an old kernel this might not have been the case.  None of
 * the extended inode fields is critical for correct filesystem operation.
 */

  +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)  
 \
  +do {   
 \
  +   (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);   \
  +   if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
  +   ext4_decode_extra_time((inode)-xtime,\
  +  raw_inode-xtime ## _extra);\
  +} while (0)
 
 Ugly.  I expect these could be implemented as plain old C functions. 
 Caller could pass in the address of the ext4_inode field which the function
 is to operate upon.

We thought about that also, but then the caller needs to do all of the
pointer gymnastics themselves like:

 ext4_inode_get_xtime(inode-i_ctime, inode-i_ctime_extra,
 raw_inode-i_ctime, raw_inode-i_ctime_extra)

instead of the current:

 EXT4_INODE_GET_XTIME(ctime, inode, raw_inode);

IMHO it is preferrable to make the multiple callsites more readable than
the macros.

   #if defined(__KERNEL__) || defined(__linux__)
 
 (What's the __linux__ for?)
 
   #define i_reserved1osd1.linux1.l_i_reserved1
   #define i_frag osd2.linux2.l_i_frag

This is actually unrelated to the current patch, just part of the context.
AFAIK, this is historical, so that the kernel and e2fsprogs can use the
same ext2_fs.h header.  I don't think it is really needed, but such cleanup
shouldn't be a part of this patch either.

  +static inline struct timespec ext4_current_time(struct inode *inode)
  +{
  +   return (inode-i_sb-s_time_gran  NSEC_PER_SEC) ?
  +   current_fs_time(inode-i_sb) : CURRENT_TIME_SEC;
  +}
 
 Now, I've forgotten how this works.  Remind me, please.  Can ext4
 filesystems ever have one-second timestamp granularity?  If so, how does
 one cause that to come about?

Yes, this is possible if an ext2/3/4 filesystem is formatted with 128-byte
inodes (which is the default for all but ext4) and this fs is mounted as
ext4dev.  The inodes can never hold the extra time information (FITS_IN_INODE
check above) so the superblock limits the timestamp resolution to 1s in that
case.

  @@ -153,6 +153,7 @@
   
  unsigned long i_ext_generation;
  struct ext4_ext_cache i_cached_extent;
  +   struct timespec i_crtime;
   };
 
 It is unobvious what this field does.  Please prefer to add commentary to
 _all_ struct fields - it really helps.

It is the inode creation time.  This is useful for debug/forensic purposes,
and at some point there will be an API so that Samba can use it also.

   #endif /* _LINUX_EXT4_FS_I */
  Index: linux-2.6.22-rc4/include/linux/ext4_fs_sb.h
  ===
  --- linux-2.6.22-rc4.orig/include/linux/ext4_fs_sb.h2007-06-11 
  17:28:15.0 -0700
  +++ linux-2.6.22-rc4/include/linux/ext4_fs_sb.h 2007-06-11 
  17:39:05.0 -0700
  @@ -79,6 +79,7 @@
  char *s_qf_names[MAXQUOTAS];/* Names of quota files with 
  journalled quota */
  int s_jquota_fmt;   /* Format of quota to use */
   #endif
  +   unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */
 
 OK, I can kind-of see how this is working, but some overall description of
 how the inode sizing design operates would be helpful.  It would certainly
 make reviewing of this proposed change more fruitful.  Perhaps that new
 comment over EXT4_FITS_IN_INODE() would be a suitable place.

Hmm, I'm sure there were emails on the topic, but they aren't attached to
the patch.  s_want_extra_isize is just an override for sizeof(ext4_inode)
in case the sysadmin wants to reserve more fields in new inodes.  There is
also s_min_extra_isize which is what the kernel and e2fsck guarantee that
will be available in all in-use inodes, if RO_COMPAT_EXTRA_ISIZE is set
(ro-compat so that older kernels can't create inodes with a smaller
extra_isize).  That feature is only enabled 

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  23:34 -0400, Trond Myklebust wrote:
 On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
  So my vote is to increment i_version in common code every time any
  change is made to the file, and alloc_inode should initialise it to
  current time, which might be changed by the filesystem before it calls
  unlock_new_inode. 
  ... but doesn't lustre want to control its i_version... so maybe not :-(
 
 If lustre wants to be exportable via pNFS (as Peter Braam has suggested
 it should), then it had better be able to return a change attribute that
 is compatible with the NFSv4.1 spec...

The Lustre use of i_version is a superset of what NFSv4 needs - the Lustre
version can be used to compare the updates of two inodes.  It is set
to be the Lustre transaction number (which is sequentially incremented
on a per filesystem basis), so that we can use the per-inode version
to do replay of client operations even if they have been disconnected for
a long time, which is why we want to be able to control the actual value.
We don't want the version to be updated for e.g. file defragmentation
or other similar internal-only changes which need ext4_mark_inode_dirty().

We had a patch to disable ext4 inode versioning by a flag the superblock,
but we dropped it at the last minute because it needed some updates and
we didn't want to wait on that for submitting these changes upstream.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 3/5] i_version:ext4 inode version read/store

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  16:31 -0700, Andrew Morton wrote:
  This patch adds 64-bit inode version support to ext4. The lower 32 bits
  are stored in the osd1.linux1.l_i_version field while the high 32 bits
  are stored in the i_version_hi field newly created in the ext4_inode.
 
 So reading the code here does serve to answer the question I raised against
 the earlier patch.  A bit.
 
 I'd have thought that this patch and the one which adds i_version_hi should
 be folded into a single diff?

It could be - the original patch to reserve i_version_hi was submitted to
before the patches were ready to avoid that space being used by something
else.

  +   if (EXT4_INODE_SIZE(inode-i_sb)  EXT4_GOOD_OLD_INODE_SIZE) {
  +   if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
  +   inode-i_version |=
  +   (__u64)(le32_to_cpu(raw_inode-i_version_hi))  32;
 
 checks the precedence of (type) versus 
 
 OK
 
  +   }
 
 I don't quite see how the above two tests are sufficient to unambiguously
 determine that the i_version_hi field is present on-disk.
 
 I guess we're implicitly assuming that if the on-disk inode is big enough
 then it _must_ have i_version_hi in there?  If so, why is the comparison
 with EXT4_GOOD_OLD_INODE_SIZE needed?

The GOOT_OLD_INODE_SIZE check is needed to know if i_extra_isize is even
present or valid in the on-disk inode.

  @@ -2852,8 +2859,14 @@ static int ext4_do_update_inode(handle_t
  +   raw_inode-i_disk_version = cpu_to_le32(inode-i_version);
  +   if (ei-i_extra_isize) {
  +   if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) {
 
 There's no comparison with EXT4_GOOD_OLD_INODE_SIZE here...

Because this is the in-memory version and it is always valid (set to zero
if there is extra space in the on-disk inode).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Andreas Dilger
On Jul 11, 2007  09:47 +0100, Christoph Hellwig wrote:
 On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
  This patch is on top of i_version_update_vfs.
  The i_version field of the inode is set on inode creation and incremented
  when the inode is being modified.
 
 Which is not what i_version is supposed to do.  It'll get you tons of misses
 for NFSv3 filehandles that rely on the generation staying the same for the
 same file.  Please add a new field for the NFSv4 sequence counter instead
 of making i_version unuseable.

You are confusing i_generation (the instance of this inode number) with
i_version (whether this file has been modified)?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Trond Myklebust
On Wed, 2007-07-11 at 09:47 +0100, Christoph Hellwig wrote:
 On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
  This patch is on top of i_version_update_vfs.
  The i_version field of the inode is set on inode creation and incremented 
  when
  the inode is being modified.
 
 Which is not what i_version is supposed to do.  It'll get you tons of misses
 for NFSv3 filehandles that rely on the generation staying the same for the
 same file.  Please add a new field for the NFSv4 sequence counter instead
 of making i_version unuseable.

Aren't you confusing i_version and i_generation here? Those are two
separate inode fields.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  16:31 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:37:53 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
  Add a noversion mount option to disable inode version updates.
 
 Why is this option being offered to our users?  To reduce disk traffic,
 like noatime?
 
 If so, what are the implications of this?  What would the user lose?

Ah, this is the patch to disable i_version updates for Lustre.  I don't
think any normal user would use this mount option, so I don't know if
there is a need to document it.  There are no performance implications,
unless we end up changing the mtime granularity JUST to update i_version,
in which case we can avoid some overhead if not exporting with NFSv4.

If we want to go in the direction of forcing extra inode updates just
for this, then we might even consider making i_version updates on disk
default to OFF unless NFSv4 has exported the filesystem at least once,
and then it should set a persistent flag in the superblock indicating
that i_version updates are needed.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Christoph Hellwig
On Wed, Jul 11, 2007 at 05:52:24AM -0600, Andreas Dilger wrote:
 On Jul 11, 2007  09:47 +0100, Christoph Hellwig wrote:
  On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
   This patch is on top of i_version_update_vfs.
   The i_version field of the inode is set on inode creation and incremented
   when the inode is being modified.
  
  Which is not what i_version is supposed to do.  It'll get you tons of misses
  for NFSv3 filehandles that rely on the generation staying the same for the
  same file.  Please add a new field for the NFSv4 sequence counter instead
  of making i_version unuseable.
 
 You are confusing i_generation (the instance of this inode number) with
 i_version (whether this file has been modified)?

Yes, sorry.  Objection dropped.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  16:32 -0700, Andrew Morton wrote:
  err = ext4_reserve_inode_write(handle, inode, iloc);
  +   if (EXT4_I(inode)-i_extra_isize 
  +   EXT4_SB(inode-i_sb)-s_want_extra_isize 
  +   !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
  +   /* We need extra buffer credits since we may write into EA block
  +* with this same handle */
  +   if ((jbd2_journal_extend(handle,
  +EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) {
  +   ret = ext4_expand_extra_isize(inode,
  +   EXT4_SB(inode-i_sb)-s_want_extra_isize,
  +   iloc, handle);
  +   if (ret) {
  +   EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND;
  +   if (!expand_message) {
  +   ext4_warning(inode-i_sb, __FUNCTION__,
  +   Unable to expand inode %lu. Delete
  +some EAs or run e2fsck.,
  +   inode-i_ino);
  +   expand_message = 1;
  +   }
  +   }
  +   }
  +   }
 
 Maybe that message should come out once per mount rather than once per boot
 (or once per modprobe)?

Probably true.

 What are the consequences of a jbd2_journal_extend() failure in here?

Not fatal, just like every user of i_extra_isize.  If the inode isn't a
large inode, or it can't be expanded then there will be a minor loss of
functionality on that inode.  If the i_extra_isize is critical, then
the sysadmin will have run e2fsck to force s_min_extra_isize large enough.

Note that this is only applicable for filesystems which are upgraded.  For
new inodes (i.e. all inodes that exist in the filesystem if it was always
run on a kernel with the currently understood extra fields) then this will
never be invoked (until such a time new extra fields are added).

  +static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
  +int value_offs_shift, void *to,
  +void *from, size_t n, int blocksize)
  +{
  +   /* Adjust the value offsets of the entries */
  +   for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) {
  +   if (!last-e_value_block  last-e_value_size) {
  +   new_offs = le16_to_cpu(last-e_value_offs) +
  +   value_offs_shift;
  +   BUG_ON(new_offs + le32_to_cpu(last-e_value_size)
  + blocksize);
  +   last-e_value_offs = cpu_to_le16(new_offs);
  +   }
  +   }
  +   /* Shift the entries by n bytes */
  +   memmove(to, from, n);
  +}
 
 Under what circumstances will that new BUG_ON trigger?  Can it be triggered
 via incorrect disk contents?  If so, it should not be there.

Only under code defect or memory corruption.  The needed extra space was
already checked in ext4_expand_extra_isize_ea(), but I asked for this
BUG_ON() because it would otherwise seem that there was a chance from
reading just the above code that the shifted EA might overflow the buffer.

  +int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
  +   struct ext4_inode *raw_inode, handle_t *handle)
  +{
  +   down_write(EXT4_I(inode)-xattr_sem);
  +retry:
  +   if (EXT4_I(inode)-i_extra_isize = new_extra_isize) {
  +   up_write(EXT4_I(inode)-xattr_sem);
  +   return 0;
  +   }
 
 So..  xattr_sem is a lock which protects i_extra_isize?  That seems a bit
 strange to me - I'd have expected i_mutex.

Well, since this is the only code that ever changes i_extra_isize, and it
also needs to move the EAs around, it makes sense to use the EA lock for it.
This is also a relatively low-contention lock, and it needs to be held to
access any of the EAs (which are the only things beyond i_extra_isize).

  +   if (EXT4_I(inode)-i_file_acl) {
  +   bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl);
  +   error = -EIO;
  +   if (!bh)
  +   goto cleanup;
  +   if (ext4_xattr_check_block(bh)) {
  +   ext4_error(inode-i_sb, __FUNCTION__,
  +   inode %lu: bad block %llu, inode-i_ino,
  +   EXT4_I(inode)-i_file_acl);
  +   error = -EIO;
  +   goto cleanup;
  +   }
  +   base = BHDR(bh);
  +   first = BFIRST(bh);
  +   end = bh-b_data + bh-b_size;
  +   min_offs = end - base;
  +   free = ext4_xattr_free_space(first, min_offs, base,
  +total_blk);
  +   if (free  new_extra_isize) {
  +   if (!tried_min_extra_isize  s_min_extra_isize) {
  +   tried_min_extra_isize++;
  + 

Re: Initial results of FLEX_BG feature.

2007-07-11 Thread Andreas Dilger
On Jul 11, 2007  00:30 -0500, Jose R. Santos wrote:
 On Tue, 10 Jul 2007 22:12:14 -0600
 Andreas Dilger [EMAIL PROTECTED] wrote:
  You might also want to test out placement of the journal in the middle
  of the filesystem, the U. Wisconsin folks tested this in one of their
  papers and showed some noticable improvements.  That isn't exactly
  related, but it is a relatively simple tweak to mke2fs/tune2fs to give
  it an allocation goal of group_desc[s_groups_count / 2].bg_inode_table
  (to put it past inode table in middle group).
 
 Make sense.  Do you have a link to the paper?

I don't have a URL, but if you search for IRON ext3 and go to Remzi's
site it is called Analysis and Evolution of Journaling File Systems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  22:40 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
  A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
  the subdir count for any directory crosses 65000.
 
 Would I be correct in assuming that a later fsck will clear
 EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any 65000 subdir
 directories?

Correct.

 If so, that is worth a mention in the changelog, perhaps?
 
 Please remind us what is the behaviour of an RO_COMPAT flag?  It means that
 old ext4, ext3 and ext2 can only mount this fs read-only, yes?

Also correct.  The COMPAT flag behaviour is described in detail in
Documentation/filesystems/ext[234].txt

  +static inline void ext4_inc_count(handle_t *handle, struct inode *inode)
  +{
  +   inc_nlink(inode);
  +   if (is_dx(inode)  inode-i_nlink  1) {
  +   /* limit is 16-bit i_links_count */
  +   if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) {
  +   inode-i_nlink = 1;
  +   EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb,
  + EXT4_FEATURE_RO_COMPAT_DIR_NLINK);
  +   }
  +   }
  +}
 
 Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2?

Because that means it was previously 1 (inc_nlink() was already called).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ext3 onlie resize failure due to small journal size

2007-07-11 Thread Suzuki

Hi Andreas,

Trying to resize a mounted ext3 filesystem fails due to small journal size.

Background :

The filesystem was created with default values, except blocksize = 4K on 
a LV partition. Later we tried extended the partition to +16M and tried 
to resize the fs using resize2fs, while it was mounted.


[EMAIL PROTECTED] ~]# resize2fs  /dev/mapper/testvg-tmp
resize2fs 1.39 (29-May-2006)
Filesystem at /dev/mapper/testvg-tmp is mounted on /tmp; on-line 
resizing required
Performing an on-line resize of /dev/mapper/testvg-tmp to 186368 (4k) 
blocks.

resize2fs: No space left on device While trying to add group #4

Analysis :

While adding the new blockgroup, inside setup_new_group_blocks() we hit 
the limit because we are requesting for a a credit value of 2 + 
sbi-s_itb_per_group which in the case of the file system below is 1026 
while the max_transaction credits possible is 1024 for the fs.


journal-j_maxlen = inode-i_size / blocksize = 16M/4K = 4K

journal-j_max_transaction_buffers = journal-j_maxlen / 4 = 1K

journal-j_max_transaction_buffers = 1024.




# dumpe2fs  /dev/mapper/testvg-tmp
dumpe2fs 1.39 (29-May-2006)
Filesystem volume name:   none
Last mounted on:  not available
Filesystem UUID:  7d82f07b-cb22-4c7d-b290-a35749121bbb
Filesystem magic number:  0xEF53
Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal resize_inode dir_index filetype 
needs_recovery sparse_super large_file

Default mount options:(none)
Filesystem state: clean
Errors behavior:  Continue
Filesystem OS type:   Linux
Inode count:  131072
Block count:  131072
Reserved block count: 6553
Free blocks:  122753
Free inodes:  131056
First block:  0
Block size:   4096
Fragment size:4096
Reserved GDT blocks:  31
Blocks per group: 32768
Fragments per group:  32768
Inodes per group: 32768
Inode blocks per group:   1024
Filesystem created:   Mon Jul  9 00:09:23 2007
Last mount time:  Tue Jul 10 02:44:44 2007
Last write time:  Tue Jul 10 02:44:44 2007
Mount count:  7
Maximum mount count:  33
Last checked: Mon Jul  9 00:09:23 2007
Check interval:   15552000 (6 months)
Next check after: Sat Jan  5 00:09:23 2008
Reserved blocks uid:  0 (user root)
Reserved blocks gid:  0 (group root)
First inode:  11
Inode size:   128
Journal inode:8
Default directory hash:   tea
Directory Hash Seed:  0a57077d-9778-4f3a-8605-7f0c86f18d8a
Journal backup:   inode blocks
Journal size: 16M


Group 0: (Blocks 0-32767)
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-32
Block bitmap at 33 (+33), Inode bitmap at 34 (+34)
Inode table at 35-1058 (+35)
27600 free blocks, 32755 free inodes, 2 directories
Free blocks: 5168-32767
Free inodes: 14-32768
Group 1: (Blocks 32768-65535)
Backup superblock at 32768, Group descriptors at 32769-32769
Reserved GDT blocks at 32770-32800
Block bitmap at 32801 (+33), Inode bitmap at 32802 (+34)
Inode table at 32803-33826 (+35)
31708 free blocks, 32767 free inodes, 1 directories
Free blocks: 33828-65535
Free inodes: 32770-65536
Group 2: (Blocks 65536-98303)
Block bitmap at 65536 (+0), Inode bitmap at 65537 (+1)
Inode table at 65538-66561 (+2)
31740 free blocks, 32764 free inodes, 2 directories
Free blocks: 66562-83967, 83969-94207, 94209-98303
Free inodes: 65541-98304
Group 3: (Blocks 98304-131071)
Backup superblock at 98304, Group descriptors at 98305-98305
Reserved GDT blocks at 98306-98336
Block bitmap at 98337 (+33), Inode bitmap at 98338 (+34)
Inode table at 98339-99362 (+35)
31704 free blocks, 32766 free inodes, 1 directories
Free blocks: 99363-126976, 126982-131071
Free inodes: 98305, 98308-131072



Is this a supported operation ? If yes, what could be the best way to 
fix it ?


Resizing the journal is not supported at the moment :(.

Thoughts ?

Thanks

Suzuki
IBM Linux Technology Centre


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates

2007-07-11 Thread Theodore Tso
On Tue, Jul 10, 2007 at 04:31:44PM -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:37:53 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Add a noversion mount option to disable inode version updates.
 
 Why is this option being offered to our users?  To reduce disk traffic,
 like noatime?
 
 If so, what are the implications of this?  What would the user lose?

This has been removed in the latest patch set; it's needed only for
Lustre, because they set the version field themselves.  Lustre needs
the inode version to be globally monotonically increasing, so it can
order updates between two different files, so it does this itself.
NFSv4 only uses i_version to detect changes, and so there's no need to
use a global atomic counter for i_version.  So the thinking was that
there was no point doing the global atomic counter if it was not necessary.

Since noversion is Lustre specific, we've dropped that from the list
of patches that we'll push, and so the inode version will only have
local per-inode significance, and not have any global ordering
properties.  

We have not actually benchmarked whether or not doing the global
ordering actually *matters* in terms of being actually noticeable.  If
it isn't noticeable, I wouldn't mind changing things so that we always
make i_version globally significant (without a mount option), and make
life a bit easier for the Lustre folks.  Or if someone other
distributed filesystem requests a globally significant i_version.  But
we can cross that bridge when we get to it

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread J. Bruce Fields
On Wed, Jul 11, 2007 at 01:21:55PM +1000, Neil Brown wrote:
 And just by-the-way, the server doesn't really have the option of not
 sending the attribute.  If i_version isn't defined, it has to fake
 something using mtime, and hope that is good enough.

ctime, actually--the change attribute is also supposed to be updated on
attribute updates.

 Alternately we could mandate that i_version is always kept up-to-date
 and if a filesystem doesn't have anything to load from storage, it
 just sets it to the current time in nanoseconds.
 
 That would mean that a client would need to flush it's cache whenever
 the inode fell out of cache on the server, but I don't think we can
 reliably do better than that.
 
 I think I like that approach.
 
 So my vote is to increment i_version in common code every time any
 change is made to the file, and alloc_inode should initialise it to
 current time, which might be changed by the filesystem before it calls
 unlock_new_inode. 

So the client would be invalidating its cache more often than necessary,
rather than failing to invalidate it when it should.  I agree that
that's probably the better tradeoff, although I wish I had a better idea
of the downside.  I don't know, for example, whether users might see
unpleasant results if every client has to reread its cached data on a
reboot.

The currently proposed change--just providing a model change attribute
implementation for ext4 and leaving other filesystems untouched--is a
more conservative step.

So I'm inclined to just do this ext4 thing first, and then look into
further change attribute experiments next time around

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Andrew Morton
On Wed, 11 Jul 2007 07:01:08 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:

-   /* AKPM: buglet - add `i' to tmp! */
   
   Damn.  After, what, seven years, someone actually fixed it?
   
for (i = 0; i  bh-b_size; i += 512) {
-   journal_header_t *tmp = (journal_header_t*)bh-b_data;
+   struct commit_header *tmp =
+   (struct commit_header *)(bh-b_data + i);
tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
tmp-h_sequence = 
cpu_to_be32(commit_transaction-t_tid);
+
+   if (JBD2_HAS_COMPAT_FEATURE(journal,
+   
JBD2_FEATURE_COMPAT_CHECKSUM)) {
+   tmp-h_chksum_type  = JBD2_CRC32_CHKSUM;
+   tmp-h_chksum_size  = 
JBD2_CRC32_CHKSUM_SIZE;
+   tmp-h_chksum[0]= 
cpu_to_be32(crc32_sum);
+   }
}
   
   And in doing so, changed the on-disk format of the journal commit blocks.
   
   Surely this was worth a mention in the changelog, if not a standalone 
   patch?
   
   I don't think this is worth doing, really.  Why not just leave the format
   as it was, take the loop out and run this code once rather than eight
   times?
 
 Well, we aren't using the rest of the commit block in any case.  I think
 the original intention was that we'd get 8 copies of the commit block so
 we would be sure to get a good one.
 
 I don't know whether we'd rather have 8 copies of the commit block, or
 more potential to expand the commit block?  I don't personally have any
 preference, since the checksum should be a more robust way of checking
 validity than having multiple copies, so we may as well remove the loop
 and stick with a single copy for now.

We've never altered any commit block sectors apart from the zeroeth one
(eight times) due to the above bug.  So I'd suggest that we should formalise
the old bug and leave the format as-is.  That'll leave lots of space spare in
the commit block.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Andrew Morton
On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:

 On Jul 10, 2007  16:32 -0700, Andrew Morton wrote:
 err = ext4_reserve_inode_write(handle, inode, iloc);
   + if (EXT4_I(inode)-i_extra_isize 
   + EXT4_SB(inode-i_sb)-s_want_extra_isize 
   + !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
   + /* We need extra buffer credits since we may write into EA block
   +  * with this same handle */
   + if ((jbd2_journal_extend(handle,
   +  EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) {
   + ret = ext4_expand_extra_isize(inode,
   + EXT4_SB(inode-i_sb)-s_want_extra_isize,
   + iloc, handle);
   + if (ret) {
   + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND;
   + if (!expand_message) {
   + ext4_warning(inode-i_sb, __FUNCTION__,
   + Unable to expand inode %lu. Delete
   +  some EAs or run e2fsck.,
   + inode-i_ino);
   + expand_message = 1;
   + }
   + }
   + }
   + }
  
  Maybe that message should come out once per mount rather than once per boot
  (or once per modprobe)?
 
 Probably true.
 
  What are the consequences of a jbd2_journal_extend() failure in here?
 
 Not fatal, just like every user of i_extra_isize.  If the inode isn't a
 large inode, or it can't be expanded then there will be a minor loss of
 functionality on that inode.  If the i_extra_isize is critical, then
 the sysadmin will have run e2fsck to force s_min_extra_isize large enough.
 
 Note that this is only applicable for filesystems which are upgraded.  For
 new inodes (i.e. all inodes that exist in the filesystem if it was always
 run on a kernel with the currently understood extra fields) then this will
 never be invoked (until such a time new extra fields are added).

I'd suggest that we get a comment in the code explaining this: this
unchecked error does rather stand out.

   + if (EXT4_I(inode)-i_file_acl) {
   + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl);
   + error = -EIO;
   + if (!bh)
   + goto cleanup;
   + if (ext4_xattr_check_block(bh)) {
   + ext4_error(inode-i_sb, __FUNCTION__,
   + inode %lu: bad block %llu, inode-i_ino,
   + EXT4_I(inode)-i_file_acl);
   + error = -EIO;
   + goto cleanup;
   + }
   + base = BHDR(bh);
   + first = BFIRST(bh);
   + end = bh-b_data + bh-b_size;
   + min_offs = end - base;
   + free = ext4_xattr_free_space(first, min_offs, base,
   +  total_blk);
   + if (free  new_extra_isize) {
   + if (!tried_min_extra_isize  s_min_extra_isize) {
   + tried_min_extra_isize++;
   + new_extra_isize = s_min_extra_isize;
  
  Aren't we missing a brelse(bh) here?
 
 Seems likely, yes.

OK - could we get a positive ack from someone indicating that this will get
looked at?  Because I am about to forget about it.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random corruption test for e2fsck

2007-07-11 Thread Theodore Tso
On Wed, Jul 11, 2007 at 03:44:11AM -0600, Andreas Dilger wrote:
 I've already found some kind of memory corruption in e2fsck as a result
 of running this as a regular user.  It segfaults in qsort() when freeing
 memory.  The image that causes this problem is attached, and it happens
 with the unpatched 1.39-WIP Mercurial tree of 2007-05-22.  Unfortunately,
 I don't have any decent memory debugging tools handy, so it isn't easy to
 see what is happening.  This is on an FC3 i686 system, in case it matters.

Thanks for sending me the test case!  Here's the patch, which will
probably cause me to do a 1.40.2 release sooner rather than later...

- Ted

commit 5e9ba85c2694926eb784531d81ba107200cf1a51
Author: Theodore Ts'o [EMAIL PROTECTED]
Date:   Wed Jul 11 13:42:43 2007 -0400

Fix e2fsck segfault on very badly damaged filesystems

A recent change to e2fsck_add_dir_info() to use tdb files to check
filesystems with a very large number of filesystems had a typo which
caused us to resize the wrong data structure.  This would cause a
array overrun leading to malloc pointer corruptions.  Since we
normally can very accurately predict how big the the dirinfo array
needs to be, this bug only got triggered on very badly corrupted
filesystems.

Thanks to Andreas Dilger for submitting the test case which discovered
this problem, and to Kalpak Shah for writing a random testing script
which created the test case.

Signed-off-by: Theodore Ts'o [EMAIL PROTECTED]

diff --git a/e2fsck/dirinfo.c b/e2fsck/dirinfo.c
index aaa4d09..f583c62 100644
--- a/e2fsck/dirinfo.c
+++ b/e2fsck/dirinfo.c
@@ -126,7 +126,7 @@ void e2fsck_add_dir_info(e2fsck_t ctx, ext2_ino_t ino, 
ext2_ino_t parent)
ctx-dir_info-size += 10;
retval = ext2fs_resize_mem(old_size, ctx-dir_info-size *
   sizeof(struct dir_info),
-  ctx-dir_info);
+  ctx-dir_info-array);
if (retval) {
ctx-dir_info-size -= 10;
return;
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for 16TB ext4 fs

2007-07-11 Thread Jose R. Santos
Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
than 32bit block sizes during mount time.  This ensure proper record
lenth when writing to the journal.

Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]
Signed-off-by: Laurent Vivier [EMAIL PROTECTED]
---
 fs/ext4/super.c |7 7 + 0 - 0 !
 1 file changed, 7 insertions(+)

Index: linux-2.6/fs/ext4/super.c
===
--- linux-2.6.orig/fs/ext4/super.c  2007-07-11 09:36:00.0 -0500
+++ linux-2.6/fs/ext4/super.c   2007-07-11 09:36:20.0 -0500
@@ -1804,6 +1804,13 @@ static int ext4_fill_super (struct super
goto failed_mount3;
}
 
+   if (ext4_blocks_count(es)  0xULL 
+   !jbd2_journal_set_features(EXT4_SB(sb)-s_journal, 0, 0,
+  JBD2_FEATURE_INCOMPAT_64BIT)) {
+   printk(KERN_ERR ext4: Failed to set 64-bit journal feature\n);
+   goto failed_mount4;
+   }
+
/* We have now updated the journal if required, so we can
 * validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


+ jbd2-commit-fix-transaction-dropping.patch added to -mm tree

2007-07-11 Thread akpm

The patch titled
 jbd2 commit: fix transaction dropping
has been added to the -mm tree.  Its filename is
 jbd2-commit-fix-transaction-dropping.patch

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

--
Subject: jbd2 commit: fix transaction dropping
From: Jan Kara [EMAIL PROTECTED]

We have to check that also the second checkpoint list is non-empty before
dropping the transaction.

Signed-off-by: Jan Kara [EMAIL PROTECTED]
Cc: Chuck Ebbert [EMAIL PROTECTED]
Cc: Kirill Korotaev [EMAIL PROTECTED]
Cc: linux-ext4@vger.kernel.org
Cc: [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 fs/jbd2/commit.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletion(-)

diff -puN fs/jbd2/commit.c~jbd2-commit-fix-transaction-dropping fs/jbd2/commit.c
--- a/fs/jbd2/commit.c~jbd2-commit-fix-transaction-dropping
+++ a/fs/jbd2/commit.c
@@ -896,7 +896,8 @@ restart_loop:
journal-j_committing_transaction = NULL;
spin_unlock(journal-j_state_lock);
 
-   if (commit_transaction-t_checkpoint_list == NULL) {
+   if (commit_transaction-t_checkpoint_list == NULL 
+   commit_transaction-t_checkpoint_io_list == NULL) {
__jbd2_journal_drop_transaction(journal, commit_transaction);
} else {
if (journal-j_checkpoint_transactions == NULL) {
_

Patches currently in -mm which might be from [EMAIL PROTECTED] are

jbd-commit-fix-transaction-dropping.patch
jbd2-commit-fix-transaction-dropping.patch
ext2-fix-a-comment-when-ext2_release_file-is-called.patch
ext3-fix-deadlock-in-ext3_remount-and-orphan-list-handling.patch
ext4-fix-deadlock-in-ext4_remount-and-orphan-list-handling.patch

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-11 Thread Jose R. Santos
The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
create_proc_entry() does not do lookups on file names that are more that one
directory deep.  This causes the entry creation to fail and hence, no proc
file is created.

Instead of fixing this on procfs might as well move the jbd2-debug file to
debugfs which would be the preferred location for this kind of tunable.  The
new location is now /sys/kernel/debug/jbd2/jbd2-debug.


Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
---
 fs/Kconfig   |   105 + 5 - 0 !
 fs/jbd2/journal.c|   6727 +40 -0 !
 include/linux/jbd2.h |21 + 1 - 0 !
 3 files changed, 33 insertions(+), 46 deletions(-)

Index: linux-2.6/fs/jbd2/journal.c
===
--- linux-2.6.orig/fs/jbd2/journal.c2007-07-11 09:46:25.0 -0500
+++ linux-2.6/fs/jbd2/journal.c 2007-07-11 11:31:30.0 -0500
@@ -35,6 +35,7 @@
 #include linux/kthread.h
 #include linux/poison.h
 #include linux/proc_fs.h
+#include linux/debugfs.h
 
 #include asm/uaccess.h
 #include asm/page.h
@@ -1951,64 +1952,50 @@ void jbd2_journal_put_journal_head(struc
 }
 
 /*
- * /proc tunables
+ * debugfs tunables
  */
 #if defined(CONFIG_JBD2_DEBUG)
-int jbd2_journal_enable_debug;
+u8 jbd2_journal_enable_debug;
 EXPORT_SYMBOL(jbd2_journal_enable_debug);
 #endif
 
-#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_PROC_FS)
+#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_DEBUG_FS)
 
-static struct proc_dir_entry *proc_jbd_debug;
+#define JBD2_DEBUG_NAME jbd2-debug
 
-static int read_jbd_debug(char *page, char **start, off_t off,
- int count, int *eof, void *data)
-{
-   int ret;
+struct dentry *jbd2_debugfs_dir, *jbd2_debug;
 
-   ret = sprintf(page + off, %d\n, jbd2_journal_enable_debug);
-   *eof = 1;
-   return ret;
+static void __init jbd2_create_debugfs_entry(void)
+{
+   jbd2_debugfs_dir = debugfs_create_dir(jbd2, NULL);
+   if (jbd2_debugfs_dir)
+   jbd2_debug = debugfs_create_u8(JBD2_DEBUG_NAME, S_IRUGO,
+  jbd2_debugfs_dir,
+  jbd2_journal_enable_debug);
 }
 
-static int write_jbd_debug(struct file *file, const char __user *buffer,
-  unsigned long count, void *data)
+static void __exit jbd2_remove_debugfs_entry(void)
 {
-   char buf[32];
-
-   if (count  ARRAY_SIZE(buf) - 1)
-   count = ARRAY_SIZE(buf) - 1;
-   if (copy_from_user(buf, buffer, count))
-   return -EFAULT;
-   buf[ARRAY_SIZE(buf) - 1] = '\0';
-   jbd2_journal_enable_debug = simple_strtoul(buf, NULL, 10);
-   return count;
+   if (jbd2_debug)
+   debugfs_remove(jbd2_debug);
+   if (jbd2_debugfs_dir)
+   debugfs_remove(jbd2_debugfs_dir);
 }
 
-#define JBD_PROC_NAME sys/fs/jbd2-debug
+#else
 
-static void __init create_jbd_proc_entry(void)
+static void __init jbd2_create_debugfs_entry(void)
 {
-   proc_jbd_debug = create_proc_entry(JBD_PROC_NAME, 0644, NULL);
-   if (proc_jbd_debug) {
-   /* Why is this so hard? */
-   proc_jbd_debug-read_proc = read_jbd_debug;
-   proc_jbd_debug-write_proc = write_jbd_debug;
-   }
+   do {
+   } while (0);
 }
 
-static void __exit jbd2_remove_jbd_proc_entry(void)
+static void __exit jbd2_remove_debugfs_entry(void)
 {
-   if (proc_jbd_debug)
-   remove_proc_entry(JBD_PROC_NAME, NULL);
+   do {
+   } while (0);
 }
 
-#else
-
-#define create_jbd_proc_entry() do {} while (0)
-#define jbd2_remove_jbd_proc_entry() do {} while (0)
-
 #endif
 
 struct kmem_cache *jbd2_handle_cache;
@@ -2067,7 +2054,7 @@ static int __init journal_init(void)
ret = journal_init_caches();
if (ret != 0)
jbd2_journal_destroy_caches();
-   create_jbd_proc_entry();
+   jbd2_create_debugfs_entry();
return ret;
 }
 
@@ -2078,7 +2065,7 @@ static void __exit journal_exit(void)
if (n)
printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
 #endif
-   jbd2_remove_jbd_proc_entry();
+   jbd2_remove_debugfs_entry();
jbd2_journal_destroy_caches();
 }
 
Index: linux-2.6/include/linux/jbd2.h
===
--- linux-2.6.orig/include/linux/jbd2.h 2007-07-11 09:46:25.0 -0500
+++ linux-2.6/include/linux/jbd2.h  2007-07-11 10:37:06.0 -0500
@@ -57,7 +57,7 @@
  * CONFIG_JBD2_DEBUG is on.
  */
 #define JBD_EXPENSIVE_CHECKING
-extern int jbd2_journal_enable_debug;
+extern u8 jbd2_journal_enable_debug;
 
 #define jbd_debug(n, f, a...)  \
do {\
Index: linux-2.6/fs/Kconfig
===
--- 

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Kalpak Shah
On Wed, 2007-07-11 at 10:34 -0700, Andrew Morton wrote:
 On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:
 
  On Jul 10, 2007  16:32 -0700, Andrew Morton wrote:
err = ext4_reserve_inode_write(handle, inode, iloc);
+   if (EXT4_I(inode)-i_extra_isize 
+   EXT4_SB(inode-i_sb)-s_want_extra_isize 
+   !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
+   /* We need extra buffer credits since we may write into 
EA block
+* with this same handle */
+   if ((jbd2_journal_extend(handle,
+EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 
0) {
+   ret = ext4_expand_extra_isize(inode,
+   
EXT4_SB(inode-i_sb)-s_want_extra_isize,
+   iloc, handle);
+   if (ret) {
+   EXT4_I(inode)-i_state |= 
EXT4_STATE_NO_EXPAND;
+   if (!expand_message) {
+   ext4_warning(inode-i_sb, 
__FUNCTION__,
+   Unable to expand inode %lu. 
Delete
+some EAs or run e2fsck.,
+   inode-i_ino);
+   expand_message = 1;
+   }
+   }
+   }
+   }
   
   Maybe that message should come out once per mount rather than once per 
   boot
   (or once per modprobe)?
  
  Probably true.
  
   What are the consequences of a jbd2_journal_extend() failure in here?
  
  Not fatal, just like every user of i_extra_isize.  If the inode isn't a
  large inode, or it can't be expanded then there will be a minor loss of
  functionality on that inode.  If the i_extra_isize is critical, then
  the sysadmin will have run e2fsck to force s_min_extra_isize large enough.
  
  Note that this is only applicable for filesystems which are upgraded.  For
  new inodes (i.e. all inodes that exist in the filesystem if it was always
  run on a kernel with the currently understood extra fields) then this will
  never be invoked (until such a time new extra fields are added).
 
 I'd suggest that we get a comment in the code explaining this: this
 unchecked error does rather stand out.
 
+   if (EXT4_I(inode)-i_file_acl) {
+   bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl);
+   error = -EIO;
+   if (!bh)
+   goto cleanup;
+   if (ext4_xattr_check_block(bh)) {
+   ext4_error(inode-i_sb, __FUNCTION__,
+   inode %lu: bad block %llu, 
inode-i_ino,
+   EXT4_I(inode)-i_file_acl);
+   error = -EIO;
+   goto cleanup;
+   }
+   base = BHDR(bh);
+   first = BFIRST(bh);
+   end = bh-b_data + bh-b_size;
+   min_offs = end - base;
+   free = ext4_xattr_free_space(first, min_offs, base,
+total_blk);
+   if (free  new_extra_isize) {
+   if (!tried_min_extra_isize  
s_min_extra_isize) {
+   tried_min_extra_isize++;
+   new_extra_isize = s_min_extra_isize;
   
   Aren't we missing a brelse(bh) here?
  
  Seems likely, yes.
 
 OK - could we get a positive ack from someone indicating that this will get
 looked at?  Because I am about to forget about it.

I will send a patch to add the comments and make the suggested
corrections.

Thanks,
Kalpak.

 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext3 onlie resize failure due to small journal size

2007-07-11 Thread Kalpak Shah
On Wed, 2007-07-11 at 19:30 +0530, Suzuki wrote:
 Hi Andreas,
 
 Trying to resize a mounted ext3 filesystem fails due to small journal size.

A similar problem was discussed in an earlier thread JBD: ext2online
wants too many credits (744  256). The problem here is that the
maximum allowed transaction size is 4MB while a bigger transaction is
required to resize the filesystem.

In your case, you can delete the journal using tune2fs -O ^has_journal
dev and add a bigger journal tune2fs -J size=16M dev. I think
this will solve your problem.

Thanks,
Kalpak.

 
 Background :
 
 The filesystem was created with default values, except blocksize = 4K on 
 a LV partition. Later we tried extended the partition to +16M and tried 
 to resize the fs using resize2fs, while it was mounted.
 
 [EMAIL PROTECTED] ~]# resize2fs  /dev/mapper/testvg-tmp
 resize2fs 1.39 (29-May-2006)
 Filesystem at /dev/mapper/testvg-tmp is mounted on /tmp; on-line 
 resizing required
 Performing an on-line resize of /dev/mapper/testvg-tmp to 186368 (4k) 
 blocks.
 resize2fs: No space left on device While trying to add group #4
 
 Analysis :
 
 While adding the new blockgroup, inside setup_new_group_blocks() we hit 
 the limit because we are requesting for a a credit value of 2 + 
 sbi-s_itb_per_group which in the case of the file system below is 1026 
 while the max_transaction credits possible is 1024 for the fs.
 
 journal-j_maxlen = inode-i_size / blocksize = 16M/4K = 4K
 
 journal-j_max_transaction_buffers = journal-j_maxlen / 4 = 1K
 
 journal-j_max_transaction_buffers = 1024.
 
 
 
 
 # dumpe2fs  /dev/mapper/testvg-tmp
 dumpe2fs 1.39 (29-May-2006)
 Filesystem volume name:   none
 Last mounted on:  not available
 Filesystem UUID:  7d82f07b-cb22-4c7d-b290-a35749121bbb
 Filesystem magic number:  0xEF53
 Filesystem revision #:1 (dynamic)
 Filesystem features:  has_journal resize_inode dir_index filetype 
 needs_recovery sparse_super large_file
 Default mount options:(none)
 Filesystem state: clean
 Errors behavior:  Continue
 Filesystem OS type:   Linux
 Inode count:  131072
 Block count:  131072
 Reserved block count: 6553
 Free blocks:  122753
 Free inodes:  131056
 First block:  0
 Block size:   4096
 Fragment size:4096
 Reserved GDT blocks:  31
 Blocks per group: 32768
 Fragments per group:  32768
 Inodes per group: 32768
 Inode blocks per group:   1024
 Filesystem created:   Mon Jul  9 00:09:23 2007
 Last mount time:  Tue Jul 10 02:44:44 2007
 Last write time:  Tue Jul 10 02:44:44 2007
 Mount count:  7
 Maximum mount count:  33
 Last checked: Mon Jul  9 00:09:23 2007
 Check interval:   15552000 (6 months)
 Next check after: Sat Jan  5 00:09:23 2008
 Reserved blocks uid:  0 (user root)
 Reserved blocks gid:  0 (group root)
 First inode:  11
 Inode size:   128
 Journal inode:8
 Default directory hash:   tea
 Directory Hash Seed:  0a57077d-9778-4f3a-8605-7f0c86f18d8a
 Journal backup:   inode blocks
 Journal size: 16M
 
 
 Group 0: (Blocks 0-32767)
   Primary superblock at 0, Group descriptors at 1-1
   Reserved GDT blocks at 2-32
   Block bitmap at 33 (+33), Inode bitmap at 34 (+34)
   Inode table at 35-1058 (+35)
   27600 free blocks, 32755 free inodes, 2 directories
   Free blocks: 5168-32767
   Free inodes: 14-32768
 Group 1: (Blocks 32768-65535)
   Backup superblock at 32768, Group descriptors at 32769-32769
   Reserved GDT blocks at 32770-32800
   Block bitmap at 32801 (+33), Inode bitmap at 32802 (+34)
   Inode table at 32803-33826 (+35)
   31708 free blocks, 32767 free inodes, 1 directories
   Free blocks: 33828-65535
   Free inodes: 32770-65536
 Group 2: (Blocks 65536-98303)
   Block bitmap at 65536 (+0), Inode bitmap at 65537 (+1)
   Inode table at 65538-66561 (+2)
   31740 free blocks, 32764 free inodes, 2 directories
   Free blocks: 66562-83967, 83969-94207, 94209-98303
   Free inodes: 65541-98304
 Group 3: (Blocks 98304-131071)
   Backup superblock at 98304, Group descriptors at 98305-98305
   Reserved GDT blocks at 98306-98336
   Block bitmap at 98337 (+33), Inode bitmap at 98338 (+34)
   Inode table at 98339-99362 (+35)
   31704 free blocks, 32766 free inodes, 1 directories
   Free blocks: 99363-126976, 126982-131071
   Free inodes: 98305, 98308-131072
 
 
 
 Is this a supported operation ? If yes, what could be the best way to 
 fix it ?
 
 Resizing the journal is not supported at the moment :(.
 
 Thoughts ?
 
 Thanks
 
 Suzuki
 IBM Linux Technology Centre
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread J. Bruce Fields
On Wed, Jul 11, 2007 at 09:28:06AM -0500, Dave Kleikamp wrote:
 On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote:
  It just occurred to me:
  
   If i_version is 64bit, then knfsd would need to be careful when
   reading it on a 32bit host.  What are the locking rules?
 
 How does knfsd use i_version?  I would think that if all it was doing
 was to compare (i_version == previous_version)

That's correct.  (Though it's the client that's doing the comparison,
actually--the server is just reporting the value.)

 then locking wouldn't really matter.  Well, theoretically,
 previous_version could be 0x1, and i_version could be
 0x1, knfsd checks the high word, then ext4 updates i_version
 to 0x2, then knfsd checks the low word, detecting no change.
 How likely is this?

The choice of upper word in your example is arbitrary, but other than
that I believe your example is essentially the only one.  So this would
only happen when *both*

- the read of the new value of the low word happens precisely
  2^32 i_version updates after the word was read on the client's
  previous cache revalidation, and
- the value of i_version itself is close enough to a 32-bit
  boundary that wraparound can happen between the reads of the
  high and low words.

 (I don't understand why i_version even needs to be 64 bits in the
 first place.)

A 32-bit i_version could in theory wrap pretty quickly, couldn't it?
That's not a problem in itself--the problem would only arise if two
subsequent client queries of the change attribute happened a multiple of
2^32 i_version increments apart.

This is more likely than the previous scenario, but still very unlikely.
I would have guessed that even in situations with a very high rate of
updates and a low rate of client revalidations, the chance of two
revalidations happening exactly 2^32 updates apart would still be no
more than 1 in 2^32.  (Could odd characteristics of the workloads (like
updates that tend to happen in power-of-2 groups?) make it any more
likely?)

I'd be happier if ext4 at least allowed the possibility of 64 bits in
the future.  And there's always the chance someone would find a use for
an i_version that was nondecreasing, even if nfs didn't care.

   Presumably it is only updated under i_mutex protection, but having to
   get i_mutex to read it would seem a little heavy handed.
 
 How does knfsd protect itself from the inode changing after i_version is
 checked?  Is any locking being done otherwise?

If the client always requests the change attribute before reading, and
the i_version is always updated after data is modified, I think we're
OK.  Admittedly this is a little subtle.

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Initial results of FLEX_BG feature.

2007-07-11 Thread Theodore Tso
On Wed, Jul 11, 2007 at 12:30:04AM -0500, Jose R. Santos wrote:
  i think in the spirit of the original META_BG option, Ted had wanted to
  put all the bitmaps from EXT4_DESC_PER_BLOCK groups somewhere within the
  metagroup.  It would also be interesting to see if moving ALL of the
  group metadata to a single location in the filesystem makes a bit 
  difference.
  If not, then we may as well keep it spread out for safety.

My original intention was that META_BG would place the bitmaps and
inode tables at the beginning of each metagroup by default, but that
the constraints about where to put the bitmaps and inode tables would
be completely relaxed from the point of view of requirements by the
kernel and e2fsck.  Unfortunately while I had patches which removed
the constraints checking, they never made it into mainline of either
the kernel or e2fsprogs, sigh.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Initial results of FLEX_BG feature.

2007-07-11 Thread Theodore Tso
On Wed, Jul 11, 2007 at 12:30:04AM -0500, Jose R. Santos wrote:
 Right now what I've done is allocate the bitmaps and inode tables at the
 beginning of each group of 64 BG.  Still need to work on fsck since just
 removing the restriction on were the bitmaps and inode table are
 located still gives me errors of uninitialized inodes with dtime set.
 Seems like fsck still expect inode information to be located at
 specific locations within the disk.

Can you send me the patch which you were playing with?  I might be
able to help you with this.  It should be pretty straightforward to
remove the constraint on the inode table location.  

It really should only be a check in e2fsck/super.c:check_super_block(),
as far as I know.

If you're seeing errors of unitialized inodes with dtime set, that
sounds like maybe something else is going on.  All of e2fsprogs should
be referencing the inode table via fs-group_desc[group_num].bg_inode_table.  
See lib/ext2fs/inode.c, functions ext2fs_open_inode_scan(), 
get_next_blockgroup(), and ext2fs_read_inode_full().

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups

2007-07-11 Thread Mingming Cao
On Tue, 2007-07-10 at 23:20 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  From: Dmitry Monakhov [EMAIL PROTECTED]
  Subject: ext4: extent macros cleanup
  
  - Replace math equation to it's macro equivalent
 
 s/it's/its/;)
Okay.

 
  - make ext4_ext_grow_indepth() indexes/leaf correct
 
 hm, what was wrong with it?
 
Looking at the code, ext4_ext_ext_grow_indepth() implements tree growing
procedure. It allocates a new index block, moves the top-level data of
the tree(root or leaf blocks in i_data) into the new block, initializes
the new root, creating index that points to the just created index block

The original top-level data (in i_data) could be extent tree root (index
block) or extents (leaf block). The current code (without the patch)
treats the top-level data always be the leaf block, which is incorrect.


assumes when the tree is growing the extent structure pass in is always
  @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, 
  struct inode *inode,
  curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode));
  curp-p_hdr-eh_entries = cpu_to_le16(1);
  curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr);
  -   /* FIXME: it works, but actually path[0] can be index */
  -   curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
  +   
  +   if (path[0].p_hdr-eh_depth)
  + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block;
  +   else
  + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
 
 whitespace bustage there.
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext3 onlie resize failure due to small journal size

2007-07-11 Thread Andreas Dilger
On Jul 11, 2007  19:30 +0530, Suzuki wrote:
 Trying to resize a mounted ext3 filesystem fails due to small journal size.
 
 Background :
 
 The filesystem was created with default values, except blocksize = 4K on 
 a LV partition. Later we tried extended the partition to +16M and tried 
 to resize the fs using resize2fs, while it was mounted.
 
 While adding the new blockgroup, inside setup_new_group_blocks() we hit 
 the limit because we are requesting for a a credit value of 2 + 
 sbi-s_itb_per_group which in the case of the file system below is 1026 
 while the max_transaction credits possible is 1024 for the fs.
 
 journal-j_maxlen = inode-i_size / blocksize = 16M/4K = 4K
 
 journal-j_max_transaction_buffers = journal-j_maxlen / 4 = 1K
 
 journal-j_max_transaction_buffers = 1024.
 
 Is this a supported operation ? If yes, what could be the best way to 
 fix it ?
 
 Resizing the journal is not supported at the moment :(.

You can't do a journal resize online, but you can wait until your next
outage and resize the journal at that time.  Even a few extra blocks
would be enough.  I guess this is a corner case that hasn't been hit
before.  It might make sense to have the ext2fs_figure_journal_size()
take this into account when making the filesystem?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext3 onlie resize failure due to small journal size

2007-07-11 Thread Aneesh Kumar K.V



Andreas Dilger wrote:

On Jul 11, 2007  19:30 +0530, Suzuki wrote:

Trying to resize a mounted ext3 filesystem fails due to small journal size.

Background :

The filesystem was created with default values, except blocksize = 4K on 
a LV partition. Later we tried extended the partition to +16M and tried 
to resize the fs using resize2fs, while it was mounted.


While adding the new blockgroup, inside setup_new_group_blocks() we hit 
the limit because we are requesting for a a credit value of 2 + 
sbi-s_itb_per_group which in the case of the file system below is 1026 
while the max_transaction credits possible is 1024 for the fs.


journal-j_maxlen = inode-i_size / blocksize = 16M/4K = 4K

journal-j_max_transaction_buffers = journal-j_maxlen / 4 = 1K

journal-j_max_transaction_buffers = 1024.

Is this a supported operation ? If yes, what could be the best way to 
fix it ?


Resizing the journal is not supported at the moment :(.


You can't do a journal resize online, but you can wait until your next
outage and resize the journal at that time.  Even a few extra blocks
would be enough.  I guess this is a corner case that hasn't been hit
before.  It might make sense to have the ext2fs_figure_journal_size()
take this into account when making the filesystem?




That't true. I was looking at it. I guess we should make sure we can
ask for a credit same as inode tables block per group + some extra.

Will try to see i can cook a patch.

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random corruption test for e2fsck

2007-07-11 Thread Andreas Dilger
On Jul 11, 2007  13:43 -0400, Theodore Tso wrote:
 Fix e2fsck segfault on very badly damaged filesystems
 
 --- a/e2fsck/dirinfo.c
 +++ b/e2fsck/dirinfo.c
 @@ -126,7 +126,7 @@ void e2fsck_add_dir_info(e2fsck_t ctx, ext2_ino_t ino, 
 ext2_ino_t parent)
   ctx-dir_info-size += 10;
   retval = ext2fs_resize_mem(old_size, ctx-dir_info-size *
  sizeof(struct dir_info),
 -ctx-dir_info);
 +ctx-dir_info-array);
   if (retval) {
   ctx-dir_info-size -= 10;
   return;

This appears to fix the problem.  I was previously able to crash e2fsck
within a couple of runs, now it is running in a loop w/o problems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random corruption test for e2fsck

2007-07-11 Thread Andreas Dilger
I've got another one, but it isn't a show stopper I think.

If you format a filesystem with both resize_inode and meta_bg you get an
unfixable filesystem.  The bad news is that it appears that running
e2fsck on the filesystem is actually _causing_ the corruption in this case
(trying to rebuild the resize inode)?  For now, the answer is don't do that.
The resize inode was never intended to be in use when meta_bg is enabled,
so we should just prevent this from the start.


mke2fs  -j -b 4096 -I 512 -O 
sparse_super,filetype,resize_inode,dir_index,lazy_bg,meta_bg -F /tmp/test.img 
57852

I can run e2fsck on this repeatedly and it always complains the same way:

$ e2fsck -fy /tmp/test.img
e2fsck 1.39.cfs9 (7-Apr-2007)
Resize inode not valid.  Recreate? yes

Pass 1: Checking inodes, blocks, and sizes
Inode 8, i_blocks is 0, should be 32816.  Fix? yes

Reserved inode 9 (Reserved inode 9) has invalid mode.  Clear? yes
Deleted inode 17 has zero dtime.  Fix? yes
Deleted inode 25 has zero dtime.  Fix? yes
Deleted inode 33 has zero dtime.  Fix? yes
Deleted inode 41 has zero dtime.  Fix? yes
Deleted inode 49 has zero dtime.  Fix? yes
Deleted inode 57 has zero dtime.  Fix? yes
Deleted inode 65 has zero dtime.  Fix? yes
Deleted inode 73 has zero dtime.  Fix? yes
Deleted inode 81 has zero dtime.  Fix? yes
Deleted inode 89 has zero dtime.  Fix? yes

Pass 2: Checking directory structure
Inode 2 (???) has invalid mode (00).
Clear? yes

Entry '..' in ??? (2) has deleted/unused inode 2.  Clear? yes

Inode 11 (???) has invalid mode (00).
Clear? yes

Pass 3: Checking directory connectivity
Root inode not allocated.  Allocate? yes

/lost+found not found.  Create? yes

Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  +0 +(2--14) +(16--3619) +(3621--3622) +(3625--7727)
Fix? yes

Free blocks count wrong for group #0 (25039, counted=25041).
Fix? yes

Free blocks count wrong (46503, counted=46505).
Fix? yes

Inode bitmap differences:  +(3--10) -16
Fix? yes

Free inodes count wrong for group #0 (28918, counted=28917).
Fix? yes

Directories count wrong for group #0 (3, counted=2).
Fix? yes

Free inodes count wrong (57846, counted=57845).
Fix? yes


/tmp/test.img: * FILE SYSTEM WAS MODIFIED *
/tmp/test.img: 11/57856 files (9.1% non-contiguous), 11347/57852 blocks


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html