Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs
On Wed, 2007-07-11 at 00:38 -0500, Jose R. Santos wrote: On Tue, 10 Jul 2007 16:30:25 -0700 Andrew Morton [EMAIL PROTECTED] wrote: On Sun, 01 Jul 2007 03:36:48 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Jun 07, 2007 23:45 -0500, Jose R. Santos wrote: The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but create_proc_entry() does not do lookups on file names with more that one directory deep. This causes the entry creation to fail and hence, no proc file is created. This patch moves the file to /proc/jbd2-degug. The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require some minor alterations to the jbd-stats patch. I don't think we really want to be adding top-level files in /proc. What about using the debugfs filesystem (not to be confused with the e2fsprogs 'debugfs' command)? How about this then? Moved the file to use debugfs as well as having the nice effect of removing more lines than what it adds. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] Please clean up the changelog. The changelog should include information about the location and the content of these debugfs files. it should provide any instructions which users will need to be able to create and use those files. Will fix. Alternatively (and preferably) do this via an update to Documentation/filesystems/ext4.txt. Seems like I also need to update the doc on Kconfig as well. Do you prefer this in separate patches? (current patch, kconfig patch, ext4 doc update patch? fs/jbd2/journal.c| 6220 +42 -0 ! include/linux/jbd2.h |21 + 1 - 0 ! 2 files changed, 21 insertions(+), 43 deletions(-) Again, this patch isn't in Ted's kernel.org directory and hasn't been in -mm. Apart from the lack of testing and review which this causes, it means I can't just do `pushpatch name-of-this-patch' and look at it in tkdiff. So I squint at the diff, but that's harder when the diff wasn't prepared with `diff -p'. Oh well. Will fix. Index: linux-2.6.22-rc4/fs/jbd2/journal.c === --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c 2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 -0700 @@ -35,6 +35,7 @@ #include linux/kthread.h #include linux/poison.h #include linux/proc_fs.h +#include linux/debugfs.h #include asm/uaccess.h #include asm/page.h @@ -1954,60 +1955,37 @@ * /proc tunables */ #if defined(CONFIG_JBD2_DEBUG) -int jbd2_journal_enable_debug; +u16 jbd2_journal_enable_debug; EXPORT_SYMBOL(jbd2_journal_enable_debug); #endif -#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_DEBUG_FS) Has this been compile-tested with CONFIG_DEBUGFS=n? I think I did, but honestly don't remember. Will check with the new patch. :) Yes, I remember I did, that discovered some inconsistency in ext4 code, which has already been fixed. -#define create_jbd_proc_entry() do {} while (0) -#define jbd2_remove_jbd_proc_entry() do {} while (0) +#define jbd2_create_debugfs_entry() do {} while (0) +#define jbd2_remove_debugfs_entry() do {} while (0) I suggest that these be converted to (preferable) inline functions while you're there. OK. #endif @@ -2067,7 +2045,7 @@ ret = journal_init_caches(); if (ret != 0) jbd2_journal_destroy_caches(); - create_jbd_proc_entry(); + jbd2_create_debugfs_entry(); return ret; } @@ -2078,7 +2056,7 @@ if (n) printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n); #endif - jbd2_remove_jbd_proc_entry(); + jbd2_remove_debugfs_entry(); jbd2_journal_destroy_caches(); } Index: linux-2.6.22-rc4/include/linux/jbd2.h === --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 -0700 @@ -57,7 +57,7 @@ * CONFIG_JBD2_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING JBD2? -extern int jbd2_journal_enable_debug; +extern u16 jbd2_journal_enable_debug; Why was this made 16-bit? To save 2 bytes? Could have saved 3 if we're going to do that. OK. Shoudln't all this debug info be a per-superblock thing rather than kernel-wide? I don't think it is worth pursuing this feature since this seems to have been broken for a while now (its been there since the first git revission in ext3) and nobody has noticed it until now. It could be address on a later patch though, since the initial purpose of the patch was to fix the
Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev
On Tue, 2007-07-10 at 23:35 -0400, Dave Jones wrote: On Tue, Jul 10, 2007 at 05:35:13PM -0400, Mingming Cao wrote: Sorry about this. I was using version 0.04. The latest one I can find for now is 0.05(searching lkml), but it didn't catch this codling style bug either. I appreciate if anyone can point me the version 0.07, thanks It's now in-tree in scripts/checkpatch.pl (linus' tree is still at 0.06 though, I suspect Andrew has something newer in -mm) Thanks, Andy has uploaded the 0.07 version at http://www.shadowen.org/~apw/public/checkpatch/checkpatch.pl-0.07 Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote: David Chinneer pointed that we need to journal the version number updates together with the operations that causes the change of the inode version number, in order to survive server crashes so clients won't see the counter go backwards. So increment i_version in fs code is probably the place to ensure the inode version changes are stored to disk. It's seems update the ext4 inode version in every ext4_mark_inode_dirty() is the easiest way. That still makes us dependent upon _something_ changing the inode. For overwrites the only something is mtime. If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and I don't think we do) then I guess we'll need new code in or around file_update_time() to do this. do you mean mark inode dirty all the times in file_update_time()? Not sure about the overhead for ext3/4. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes Date: Sun, 01 Jul 2007 03:38:51 -0400 Sender: [EMAIL PROTECTED] Organization: IBM Linux Technology Center X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent compilation fixes I hope this patch series will hit git with nice titles like ext4: extent compilation fixes and not funny ones like Morecleanups:ext4_extent_compilation_fixes. Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions. conversions. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups
On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote: From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent macros cleanup - Replace math equation to it's macro equivalent s/it's/its/;) - make ext4_ext_grow_indepth() indexes/leaf correct hm, what was wrong with it? Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED] Acked-by: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/ext4/extents.c | 11 +++ 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 12fe3d7..1fd00ac 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int bloc ext_debug(binsearch for %d(idx): , block); l = EXT_FIRST_INDEX(eh) + 1; - r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh-eh_entries) - 1; + r = EXT_LAST_INDEX(eh); while (l = r) { m = l + (r - l) / 2; if (block le32_to_cpu(m-ei_block)) @@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block) ext_debug(binsearch for %d: , block); l = EXT_FIRST_EXTENT(eh) + 1; - r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh-eh_entries) - 1; + r = EXT_LAST_EXTENT(eh); while (l = r) { m = l + (r - l) / 2; @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode, curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode)); curp-p_hdr-eh_entries = cpu_to_le16(1); curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr); - /* FIXME: it works, but actually path[0] can be index */ - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; + + if (path[0].p_hdr-eh_depth) + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block; + else + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; whitespace bustage there. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote: David Chinneer pointed that we need to journal the version number updates together with the operations that causes the change of the inode version number, in order to survive server crashes so clients won't see the counter go backwards. So increment i_version in fs code is probably the place to ensure the inode version changes are stored to disk. It's seems update the ext4 inode version in every ext4_mark_inode_dirty() is the easiest way. That still makes us dependent upon _something_ changing the inode. For overwrites the only something is mtime. If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and I don't think we do) then I guess we'll need new code in or around file_update_time() to do this. do you mean mark inode dirty all the times in file_update_time()? Not sure about the overhead for ext3/4. hm, I hadn't thought about it in any detail. Maybe something like --- a/fs/inode.c~a +++ a/fs/inode.c @@ -1238,6 +1238,11 @@ void file_update_time(struct file *file) sync_it = 1; } + if (IS_I_VERSION_64(inode)) { + inode_inc_iversion(inode); /* Takes i_lock on 32-bit */ + sync_it = 1; + } + if (sync_it) mark_inode_dirty_sync(inode); } _ ? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc
On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote: On Wed, 11 Jul 2007 01:50:00 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, return sys_fadvise64_64(fd, ((u64)offset_hi 32) | offset_lo, len, advice); } + +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo, + unsigned offset_hi, unsigned len_lo, + unsigned len_hi) Please call this compat_sys_fallocate in line with the powerpc version - it gives us a hint that maybe we should think about how to consolidate them. I know other stuff in that file is called sys32_ ... but it is time for a change :-) I think this can be handled as a separate patch once this patchset is in mainline. Since, anyhow we will need to do this for other sys32_ calls which are already there... -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote: Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. Sorry, but this doesn't make any sense. There is no need to put every feature in the XFS ioctls in the syscalls. The XFS ioctls will need to be supported forever anyway - as I suggested before they really should be moved to generic code. What needs to be supported is what makes sense as an interface. A punch a hole interface does make sense, but trying to hack this into a preallocation system call is just madness. We're not IRIX or windows that fit things into random subcall just because there was some space left to squeeze them in. FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. HSMs needs this basically for every system call, which screams for an open flag like O_INVISIBLE anyway. Adding this in a generic way is a good idea, but hacking bits and pieces that won't fit into the global design is completely wrong. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote: Given the current behaviour for posix_fallocate() in glibc, I think that retaining the same error semantic and punting the cleanup to userspace (where the app will fail with ENOSPC anyway) is the only sane thing we can do here. Trying to undo this in the kernel leads to lots of extra rarely used code in error handling paths... Agreed, looks like we should stay with the user has to clean up behaviour. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, Jul 04, 2007 at 03:37:01PM +1000, Timothy Shimmin wrote: We use this capability in XFS at the moment. I think this is mainly for DMF (HSM) but is done via the xfs handle interface (xfs_open_by_handle) AFAICT. You're not :) You're using an O_INVIBLE equivalent (as described below), which would be a useful thing to have at the VFS level, but adding hacks to some system calls only wouldn't help any HSM system. It's just useless API clutter. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Ext4 online resizing
Hello everybody, sorry for this question if it has been asked before; I couldn't find information about this. Please keep me CC'ed - I'm not subscribed. Thank you. Ext2/3 have ext2resize/resize2fs, ext2prepare and ext3online for resizing. But they don't work for ext4. Here's a sample output for a 64MB loopback-mounted file: # ext2online -C -f -d tst ext2online v1.1.19 - 2001/03/18 for EXT2FS 0.5b ext2_open ext2_bcache_init ext2_determine_itoffset setting itoffset to +259 ext2_get_reserved Found 255 blocks in s_reserved_gdt_blocks 8 old groups, 1 blocks 32 new groups, 1 blocks ext2_ioctl: EXTEND group to 65537 blocks using itoffset of 259 new block bitmap is at 0x10001 new inode bitmap is at 0x10002 new inode table is at 0x10104-0x10203 new group has 7934 free blocks new group has 2048 free inodes (256 blocks) ext2_ioctl: ADD group 8 ext2online: ext2_ioctl: Inappropriate ioctl for device ext2online: unable to resize /tmp/tst That this doesn't work could be caused by my old ext2online version; # ext2online -V ext2online v1.1.19 - 2001/03/18 for EXT2FS 0.5b but what makes me wonder is that the sf project http://ext2resize.sourceforge.net/download.html still lists 1.19 as current. Should that work for ext4 too? # cat /proc/version Linux version 2.6.22-rc5-686 (Debian 2.6.22~rc5-1~experimental.1) ([EMAIL PROTECTED]) (gcc version 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)) #1 SMP Wed Jun 27 22:21:14 UTC 2007 There's a resize.c in the ext4 directory http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=fs/ext4;h=e509c1bf9448c343b7a4666dae4dfed0afe228ad;hb=HEAD so I'd thought that it should include that functionality. What am I doing wrong? Thank you for all answers. Regards, Phil - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
e2fsprogs: Undo I/O manager
Hi Ted, Following patch implement the Undo I/O manager and undoe2fs. The patches are on top of latest git(23edf9b4674ced1cdf8625bd609d95dbd62923b3) -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] e2fsprogs: Add undoe2fs
From: Aneesh Kumar K.V [EMAIL PROTECTED] undoe2fs can be used to replay the transaction saved in the transaction file using undo I/O Manager Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- misc/Makefile.in | 10 -- misc/undoe2fs.c | 48 2 files changed, 56 insertions(+), 2 deletions(-) create mode 100644 misc/undoe2fs.c diff --git a/misc/Makefile.in b/misc/Makefile.in index ccad78c..51bb17a 100644 --- a/misc/Makefile.in +++ b/misc/Makefile.in @@ -15,7 +15,7 @@ INSTALL = @INSTALL@ @[EMAIL PROTECTED] e2image.8 SPROGS=mke2fs badblocks tune2fs dumpe2fs blkid logsave \ - $(E2IMAGE_PROG) @FSCK_PROG@ + $(E2IMAGE_PROG) @FSCK_PROG@ undoe2fs USPROGS= mklost+found filefrag SMANPAGES= tune2fs.8 mklost+found.8 mke2fs.8 dumpe2fs.8 badblocks.8 \ e2label.8 findfs.8 blkid.8 $(E2IMAGE_MAN) \ @@ -39,6 +39,7 @@ E2IMAGE_OBJS= e2image.o FSCK_OBJS= fsck.o base_device.o BLKID_OBJS=blkid.o FILEFRAG_OBJS= filefrag.o +UNDOE2FS_OBJS= undoe2fs.o XTRA_CFLAGS= -I$(srcdir)/../e2fsck -I. @@ -47,7 +48,7 @@ SRCS= $(srcdir)/tune2fs.c $(srcdir)/mklost+found.c $(srcdir)/mke2fs.c \ $(srcdir)/badblocks.c $(srcdir)/fsck.c $(srcdir)/util.c \ $(srcdir)/uuidgen.c $(srcdir)/blkid.c $(srcdir)/logsave.c \ $(srcdir)/filefrag.c $(srcdir)/base_device.c \ - $(srcdir)/../e2fsck/profile.c + $(srcdir)/../e2fsck/profile.c $(srcdir)/undoe2fs.c LIBS= $(LIBEXT2FS) $(LIBCOM_ERR) DEPLIBS= $(LIBEXT2FS) $(LIBCOM_ERR) @@ -108,6 +109,10 @@ e2image: $(E2IMAGE_OBJS) $(DEPLIBS) @echo LD $@ @$(CC) $(ALL_LDFLAGS) -o e2image $(E2IMAGE_OBJS) $(LIBS) $(LIBINTL) +undoe2fs: $(UNDOE2FS_OBJS) $(DEPLIBS) + @echo LD $@ + @$(CC) $(ALL_LDFLAGS) -o undoe2fs $(UNDOE2FS_OBJS) $(LIBS) + base_device: base_device.c @echo LD $@ @$(CC) $(ALL_CFLAGS) $(ALL_LDFLAGS) $(srcdir)/base_device.c \ @@ -434,3 +439,4 @@ filefrag.o: $(srcdir)/filefrag.c base_device.o: $(srcdir)/base_device.c $(srcdir)/fsck.h profile.o: $(srcdir)/../e2fsck/profile.c $(top_srcdir)/lib/et/com_err.h \ $(srcdir)/../e2fsck/profile.h prof_err.h +undoe2fs.o: $(srcdir)/undoe2fs.c $(top_srcdir)/lib/ext2fs/tdb.h diff --git a/misc/undoe2fs.c b/misc/undoe2fs.c new file mode 100644 index 000..c209878 --- /dev/null +++ b/misc/undoe2fs.c @@ -0,0 +1,48 @@ +#include stdio.h +#include stdlib.h +#include fcntl.h +#if HAVE_ERRNO_H +#include errno.h +#endif +#include ext2fs/tdb.h + +void usage(char *prg_name) +{ + fprintf(stderr, Usage: %s transaction file filesystem\n, prg_name); + exit(1); + +} + + +main(int argc, char *argv[]) +{ + TDB_CONTEXT *tdb; + TDB_DATA key, data; + unsigned long blk_num; + unsigned long long int location; + int fd; + + if (argc != 3) + usage(argv[0]); + + tdb = tdb_open(argv[1], 0, 0, O_RDONLY, 0600); + + if (!tdb) { + printf(Failed tdb_open %s\n, strerror(errno)); + exit(1); + } + + fd = open(argv[2], O_WRONLY); + + for (key = tdb_firstkey(tdb); key.dptr; key = tdb_nextkey(tdb, key)) { + data = tdb_fetch(tdb, key); + blk_num = *(unsigned long *)key.dptr; + location = blk_num * data.dsize; + printf(Replayed transaction of size %d at location %ld\n, data.dsize, blk_num); + lseek(fd, location, SEEK_SET); + write(fd, data.dptr, data.dsize); + } + close(fd); + tdb_close(tdb); + +} -- 1.5.3.rc0.63.gc956-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Jul 10, 2007 16:30 -0700, Andrew Morton wrote: +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field) \ + ((offsetof(typeof(*ext4_inode), field) +\ + sizeof((ext4_inode)-field)) \ + = (EXT4_GOOD_OLD_INODE_SIZE + \ + (einode)-i_extra_isize)) \ Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers under what circumstances something will not fit in an inode and what the consequences of this are. /* Extended fields will fit into an inode if the filesystem was formatted * with large inodes (-I 256 or larger) and there are not currently EAs * consuming all of the available space. For new inodes we always reserve * enough space for the kernel's known extended fields, but for inodes * created with an old kernel this might not have been the case. None of * the extended inode fields is critical for correct filesystem operation. */ +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ +do { \ + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ + ext4_decode_extra_time((inode)-xtime,\ + raw_inode-xtime ## _extra);\ +} while (0) Ugly. I expect these could be implemented as plain old C functions. Caller could pass in the address of the ext4_inode field which the function is to operate upon. We thought about that also, but then the caller needs to do all of the pointer gymnastics themselves like: ext4_inode_get_xtime(inode-i_ctime, inode-i_ctime_extra, raw_inode-i_ctime, raw_inode-i_ctime_extra) instead of the current: EXT4_INODE_GET_XTIME(ctime, inode, raw_inode); IMHO it is preferrable to make the multiple callsites more readable than the macros. #if defined(__KERNEL__) || defined(__linux__) (What's the __linux__ for?) #define i_reserved1osd1.linux1.l_i_reserved1 #define i_frag osd2.linux2.l_i_frag This is actually unrelated to the current patch, just part of the context. AFAIK, this is historical, so that the kernel and e2fsprogs can use the same ext2_fs.h header. I don't think it is really needed, but such cleanup shouldn't be a part of this patch either. +static inline struct timespec ext4_current_time(struct inode *inode) +{ + return (inode-i_sb-s_time_gran NSEC_PER_SEC) ? + current_fs_time(inode-i_sb) : CURRENT_TIME_SEC; +} Now, I've forgotten how this works. Remind me, please. Can ext4 filesystems ever have one-second timestamp granularity? If so, how does one cause that to come about? Yes, this is possible if an ext2/3/4 filesystem is formatted with 128-byte inodes (which is the default for all but ext4) and this fs is mounted as ext4dev. The inodes can never hold the extra time information (FITS_IN_INODE check above) so the superblock limits the timestamp resolution to 1s in that case. @@ -153,6 +153,7 @@ unsigned long i_ext_generation; struct ext4_ext_cache i_cached_extent; + struct timespec i_crtime; }; It is unobvious what this field does. Please prefer to add commentary to _all_ struct fields - it really helps. It is the inode creation time. This is useful for debug/forensic purposes, and at some point there will be an API so that Samba can use it also. #endif /* _LINUX_EXT4_FS_I */ Index: linux-2.6.22-rc4/include/linux/ext4_fs_sb.h === --- linux-2.6.22-rc4.orig/include/linux/ext4_fs_sb.h2007-06-11 17:28:15.0 -0700 +++ linux-2.6.22-rc4/include/linux/ext4_fs_sb.h 2007-06-11 17:39:05.0 -0700 @@ -79,6 +79,7 @@ char *s_qf_names[MAXQUOTAS];/* Names of quota files with journalled quota */ int s_jquota_fmt; /* Format of quota to use */ #endif + unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */ OK, I can kind-of see how this is working, but some overall description of how the inode sizing design operates would be helpful. It would certainly make reviewing of this proposed change more fruitful. Perhaps that new comment over EXT4_FITS_IN_INODE() would be a suitable place. Hmm, I'm sure there were emails on the topic, but they aren't attached to the patch. s_want_extra_isize is just an override for sizeof(ext4_inode) in case the sysadmin wants to reserve more fields in new inodes. There is also s_min_extra_isize which is what the kernel and e2fsck guarantee that will be available in all in-use inodes, if RO_COMPAT_EXTRA_ISIZE is set (ro-compat so that older kernels can't create inodes with a smaller extra_isize). That feature is only enabled
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 10, 2007 23:34 -0400, Trond Myklebust wrote: On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote: So my vote is to increment i_version in common code every time any change is made to the file, and alloc_inode should initialise it to current time, which might be changed by the filesystem before it calls unlock_new_inode. ... but doesn't lustre want to control its i_version... so maybe not :-( If lustre wants to be exportable via pNFS (as Peter Braam has suggested it should), then it had better be able to return a change attribute that is compatible with the NFSv4.1 spec... The Lustre use of i_version is a superset of what NFSv4 needs - the Lustre version can be used to compare the updates of two inodes. It is set to be the Lustre transaction number (which is sequentially incremented on a per filesystem basis), so that we can use the per-inode version to do replay of client operations even if they have been disconnected for a long time, which is why we want to be able to control the actual value. We don't want the version to be updated for e.g. file defragmentation or other similar internal-only changes which need ext4_mark_inode_dirty(). We had a patch to disable ext4 inode versioning by a flag the superblock, but we dropped it at the last minute because it needed some updates and we didn't want to wait on that for submitting these changes upstream. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 3/5] i_version:ext4 inode version read/store
On Jul 10, 2007 16:31 -0700, Andrew Morton wrote: This patch adds 64-bit inode version support to ext4. The lower 32 bits are stored in the osd1.linux1.l_i_version field while the high 32 bits are stored in the i_version_hi field newly created in the ext4_inode. So reading the code here does serve to answer the question I raised against the earlier patch. A bit. I'd have thought that this patch and the one which adds i_version_hi should be folded into a single diff? It could be - the original patch to reserve i_version_hi was submitted to before the patches were ready to avoid that space being used by something else. + if (EXT4_INODE_SIZE(inode-i_sb) EXT4_GOOD_OLD_INODE_SIZE) { + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) + inode-i_version |= + (__u64)(le32_to_cpu(raw_inode-i_version_hi)) 32; checks the precedence of (type) versus OK + } I don't quite see how the above two tests are sufficient to unambiguously determine that the i_version_hi field is present on-disk. I guess we're implicitly assuming that if the on-disk inode is big enough then it _must_ have i_version_hi in there? If so, why is the comparison with EXT4_GOOD_OLD_INODE_SIZE needed? The GOOT_OLD_INODE_SIZE check is needed to know if i_extra_isize is even present or valid in the on-disk inode. @@ -2852,8 +2859,14 @@ static int ext4_do_update_inode(handle_t + raw_inode-i_disk_version = cpu_to_le32(inode-i_version); + if (ei-i_extra_isize) { + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) { There's no comparison with EXT4_GOOD_OLD_INODE_SIZE here... Because this is the in-memory version and it is always valid (set to zero if there is extra space in the on-disk inode). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Jul 11, 2007 09:47 +0100, Christoph Hellwig wrote: On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. Which is not what i_version is supposed to do. It'll get you tons of misses for NFSv3 filehandles that rely on the generation staying the same for the same file. Please add a new field for the NFSv4 sequence counter instead of making i_version unuseable. You are confusing i_generation (the instance of this inode number) with i_version (whether this file has been modified)? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Wed, 2007-07-11 at 09:47 +0100, Christoph Hellwig wrote: On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. Which is not what i_version is supposed to do. It'll get you tons of misses for NFSv3 filehandles that rely on the generation staying the same for the same file. Please add a new field for the NFSv4 sequence counter instead of making i_version unuseable. Aren't you confusing i_version and i_generation here? Those are two separate inode fields. Cheers Trond - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates
On Jul 10, 2007 16:31 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:37:53 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Add a noversion mount option to disable inode version updates. Why is this option being offered to our users? To reduce disk traffic, like noatime? If so, what are the implications of this? What would the user lose? Ah, this is the patch to disable i_version updates for Lustre. I don't think any normal user would use this mount option, so I don't know if there is a need to document it. There are no performance implications, unless we end up changing the mtime granularity JUST to update i_version, in which case we can avoid some overhead if not exporting with NFSv4. If we want to go in the direction of forcing extra inode updates just for this, then we might even consider making i_version updates on disk default to OFF unless NFSv4 has exported the filesystem at least once, and then it should set a persistent flag in the superblock indicating that i_version updates are needed. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Wed, Jul 11, 2007 at 05:52:24AM -0600, Andreas Dilger wrote: On Jul 11, 2007 09:47 +0100, Christoph Hellwig wrote: On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. Which is not what i_version is supposed to do. It'll get you tons of misses for NFSv3 filehandles that rely on the generation staying the same for the same file. Please add a new field for the NFSv4 sequence counter instead of making i_version unuseable. You are confusing i_generation (the instance of this inode number) with i_version (whether this file has been modified)? Yes, sorry. Objection dropped. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block +* with this same handle */ + if ((jbd2_journal_extend(handle, +EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; + if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete +some EAs or run e2fsck., + inode-i_ino); + expand_message = 1; + } + } + } + } Maybe that message should come out once per mount rather than once per boot (or once per modprobe)? Probably true. What are the consequences of a jbd2_journal_extend() failure in here? Not fatal, just like every user of i_extra_isize. If the inode isn't a large inode, or it can't be expanded then there will be a minor loss of functionality on that inode. If the i_extra_isize is critical, then the sysadmin will have run e2fsck to force s_min_extra_isize large enough. Note that this is only applicable for filesystems which are upgraded. For new inodes (i.e. all inodes that exist in the filesystem if it was always run on a kernel with the currently understood extra fields) then this will never be invoked (until such a time new extra fields are added). +static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry, +int value_offs_shift, void *to, +void *from, size_t n, int blocksize) +{ + /* Adjust the value offsets of the entries */ + for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) { + if (!last-e_value_block last-e_value_size) { + new_offs = le16_to_cpu(last-e_value_offs) + + value_offs_shift; + BUG_ON(new_offs + le32_to_cpu(last-e_value_size) + blocksize); + last-e_value_offs = cpu_to_le16(new_offs); + } + } + /* Shift the entries by n bytes */ + memmove(to, from, n); +} Under what circumstances will that new BUG_ON trigger? Can it be triggered via incorrect disk contents? If so, it should not be there. Only under code defect or memory corruption. The needed extra space was already checked in ext4_expand_extra_isize_ea(), but I asked for this BUG_ON() because it would otherwise seem that there was a chance from reading just the above code that the shifted EA might overflow the buffer. +int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize, + struct ext4_inode *raw_inode, handle_t *handle) +{ + down_write(EXT4_I(inode)-xattr_sem); +retry: + if (EXT4_I(inode)-i_extra_isize = new_extra_isize) { + up_write(EXT4_I(inode)-xattr_sem); + return 0; + } So.. xattr_sem is a lock which protects i_extra_isize? That seems a bit strange to me - I'd have expected i_mutex. Well, since this is the only code that ever changes i_extra_isize, and it also needs to move the EAs around, it makes sense to use the EA lock for it. This is also a relatively low-contention lock, and it needs to be held to access any of the EAs (which are the only things beyond i_extra_isize). + if (EXT4_I(inode)-i_file_acl) { + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl); + error = -EIO; + if (!bh) + goto cleanup; + if (ext4_xattr_check_block(bh)) { + ext4_error(inode-i_sb, __FUNCTION__, + inode %lu: bad block %llu, inode-i_ino, + EXT4_I(inode)-i_file_acl); + error = -EIO; + goto cleanup; + } + base = BHDR(bh); + first = BFIRST(bh); + end = bh-b_data + bh-b_size; + min_offs = end - base; + free = ext4_xattr_free_space(first, min_offs, base, +total_blk); + if (free new_extra_isize) { + if (!tried_min_extra_isize s_min_extra_isize) { + tried_min_extra_isize++; +
Re: Initial results of FLEX_BG feature.
On Jul 11, 2007 00:30 -0500, Jose R. Santos wrote: On Tue, 10 Jul 2007 22:12:14 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: You might also want to test out placement of the journal in the middle of the filesystem, the U. Wisconsin folks tested this in one of their papers and showed some noticable improvements. That isn't exactly related, but it is a relatively simple tweak to mke2fs/tune2fs to give it an allocation goal of group_desc[s_groups_count / 2].bg_inode_table (to put it past inode table in middle group). Make sense. Do you have a link to the paper? I don't have a URL, but if you search for IRON ext3 and go to Remzi's site it is called Analysis and Evolution of Journaling File Systems. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.
On Jul 10, 2007 22:40 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao [EMAIL PROTECTED] wrote: A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if the subdir count for any directory crosses 65000. Would I be correct in assuming that a later fsck will clear EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any 65000 subdir directories? Correct. If so, that is worth a mention in the changelog, perhaps? Please remind us what is the behaviour of an RO_COMPAT flag? It means that old ext4, ext3 and ext2 can only mount this fs read-only, yes? Also correct. The COMPAT flag behaviour is described in detail in Documentation/filesystems/ext[234].txt +static inline void ext4_inc_count(handle_t *handle, struct inode *inode) +{ + inc_nlink(inode); + if (is_dx(inode) inode-i_nlink 1) { + /* limit is 16-bit i_links_count */ + if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) { + inode-i_nlink = 1; + EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb, + EXT4_FEATURE_RO_COMPAT_DIR_NLINK); + } + } +} Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2? Because that means it was previously 1 (inc_nlink() was already called). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Ext3 onlie resize failure due to small journal size
Hi Andreas, Trying to resize a mounted ext3 filesystem fails due to small journal size. Background : The filesystem was created with default values, except blocksize = 4K on a LV partition. Later we tried extended the partition to +16M and tried to resize the fs using resize2fs, while it was mounted. [EMAIL PROTECTED] ~]# resize2fs /dev/mapper/testvg-tmp resize2fs 1.39 (29-May-2006) Filesystem at /dev/mapper/testvg-tmp is mounted on /tmp; on-line resizing required Performing an on-line resize of /dev/mapper/testvg-tmp to 186368 (4k) blocks. resize2fs: No space left on device While trying to add group #4 Analysis : While adding the new blockgroup, inside setup_new_group_blocks() we hit the limit because we are requesting for a a credit value of 2 + sbi-s_itb_per_group which in the case of the file system below is 1026 while the max_transaction credits possible is 1024 for the fs. journal-j_maxlen = inode-i_size / blocksize = 16M/4K = 4K journal-j_max_transaction_buffers = journal-j_maxlen / 4 = 1K journal-j_max_transaction_buffers = 1024. # dumpe2fs /dev/mapper/testvg-tmp dumpe2fs 1.39 (29-May-2006) Filesystem volume name: none Last mounted on: not available Filesystem UUID: 7d82f07b-cb22-4c7d-b290-a35749121bbb Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file Default mount options:(none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 131072 Block count: 131072 Reserved block count: 6553 Free blocks: 122753 Free inodes: 131056 First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 31 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 32768 Inode blocks per group: 1024 Filesystem created: Mon Jul 9 00:09:23 2007 Last mount time: Tue Jul 10 02:44:44 2007 Last write time: Tue Jul 10 02:44:44 2007 Mount count: 7 Maximum mount count: 33 Last checked: Mon Jul 9 00:09:23 2007 Check interval: 15552000 (6 months) Next check after: Sat Jan 5 00:09:23 2008 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode:8 Default directory hash: tea Directory Hash Seed: 0a57077d-9778-4f3a-8605-7f0c86f18d8a Journal backup: inode blocks Journal size: 16M Group 0: (Blocks 0-32767) Primary superblock at 0, Group descriptors at 1-1 Reserved GDT blocks at 2-32 Block bitmap at 33 (+33), Inode bitmap at 34 (+34) Inode table at 35-1058 (+35) 27600 free blocks, 32755 free inodes, 2 directories Free blocks: 5168-32767 Free inodes: 14-32768 Group 1: (Blocks 32768-65535) Backup superblock at 32768, Group descriptors at 32769-32769 Reserved GDT blocks at 32770-32800 Block bitmap at 32801 (+33), Inode bitmap at 32802 (+34) Inode table at 32803-33826 (+35) 31708 free blocks, 32767 free inodes, 1 directories Free blocks: 33828-65535 Free inodes: 32770-65536 Group 2: (Blocks 65536-98303) Block bitmap at 65536 (+0), Inode bitmap at 65537 (+1) Inode table at 65538-66561 (+2) 31740 free blocks, 32764 free inodes, 2 directories Free blocks: 66562-83967, 83969-94207, 94209-98303 Free inodes: 65541-98304 Group 3: (Blocks 98304-131071) Backup superblock at 98304, Group descriptors at 98305-98305 Reserved GDT blocks at 98306-98336 Block bitmap at 98337 (+33), Inode bitmap at 98338 (+34) Inode table at 98339-99362 (+35) 31704 free blocks, 32766 free inodes, 1 directories Free blocks: 99363-126976, 126982-131071 Free inodes: 98305, 98308-131072 Is this a supported operation ? If yes, what could be the best way to fix it ? Resizing the journal is not supported at the moment :(. Thoughts ? Thanks Suzuki IBM Linux Technology Centre - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates
On Tue, Jul 10, 2007 at 04:31:44PM -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:37:53 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Add a noversion mount option to disable inode version updates. Why is this option being offered to our users? To reduce disk traffic, like noatime? If so, what are the implications of this? What would the user lose? This has been removed in the latest patch set; it's needed only for Lustre, because they set the version field themselves. Lustre needs the inode version to be globally monotonically increasing, so it can order updates between two different files, so it does this itself. NFSv4 only uses i_version to detect changes, and so there's no need to use a global atomic counter for i_version. So the thinking was that there was no point doing the global atomic counter if it was not necessary. Since noversion is Lustre specific, we've dropped that from the list of patches that we'll push, and so the inode version will only have local per-inode significance, and not have any global ordering properties. We have not actually benchmarked whether or not doing the global ordering actually *matters* in terms of being actually noticeable. If it isn't noticeable, I wouldn't mind changing things so that we always make i_version globally significant (without a mount option), and make life a bit easier for the Lustre folks. Or if someone other distributed filesystem requests a globally significant i_version. But we can cross that bridge when we get to it - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, Jul 11, 2007 at 01:21:55PM +1000, Neil Brown wrote: And just by-the-way, the server doesn't really have the option of not sending the attribute. If i_version isn't defined, it has to fake something using mtime, and hope that is good enough. ctime, actually--the change attribute is also supposed to be updated on attribute updates. Alternately we could mandate that i_version is always kept up-to-date and if a filesystem doesn't have anything to load from storage, it just sets it to the current time in nanoseconds. That would mean that a client would need to flush it's cache whenever the inode fell out of cache on the server, but I don't think we can reliably do better than that. I think I like that approach. So my vote is to increment i_version in common code every time any change is made to the file, and alloc_inode should initialise it to current time, which might be changed by the filesystem before it calls unlock_new_inode. So the client would be invalidating its cache more often than necessary, rather than failing to invalidate it when it should. I agree that that's probably the better tradeoff, although I wish I had a better idea of the downside. I don't know, for example, whether users might see unpleasant results if every client has to reread its cached data on a reboot. The currently proposed change--just providing a model change attribute implementation for ext4 and leaving other filesystems untouched--is a more conservative step. So I'm inclined to just do this ext4 thing first, and then look into further change attribute experiments next time around --b. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Wed, 11 Jul 2007 07:01:08 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: - /* AKPM: buglet - add `i' to tmp! */ Damn. After, what, seven years, someone actually fixed it? for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; + struct commit_header *tmp = + (struct commit_header *)(bh-b_data + i); tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER); tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK); tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); + + if (JBD2_HAS_COMPAT_FEATURE(journal, + JBD2_FEATURE_COMPAT_CHECKSUM)) { + tmp-h_chksum_type = JBD2_CRC32_CHKSUM; + tmp-h_chksum_size = JBD2_CRC32_CHKSUM_SIZE; + tmp-h_chksum[0]= cpu_to_be32(crc32_sum); + } } And in doing so, changed the on-disk format of the journal commit blocks. Surely this was worth a mention in the changelog, if not a standalone patch? I don't think this is worth doing, really. Why not just leave the format as it was, take the loop out and run this code once rather than eight times? Well, we aren't using the rest of the commit block in any case. I think the original intention was that we'd get 8 copies of the commit block so we would be sure to get a good one. I don't know whether we'd rather have 8 copies of the commit block, or more potential to expand the commit block? I don't personally have any preference, since the checksum should be a more robust way of checking validity than having multiple copies, so we may as well remove the loop and stick with a single copy for now. We've never altered any commit block sectors apart from the zeroeth one (eight times) due to the above bug. So I'd suggest that we should formalise the old bug and leave the format as-is. That'll leave lots of space spare in the commit block. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block + * with this same handle */ + if ((jbd2_journal_extend(handle, + EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; + if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete + some EAs or run e2fsck., + inode-i_ino); + expand_message = 1; + } + } + } + } Maybe that message should come out once per mount rather than once per boot (or once per modprobe)? Probably true. What are the consequences of a jbd2_journal_extend() failure in here? Not fatal, just like every user of i_extra_isize. If the inode isn't a large inode, or it can't be expanded then there will be a minor loss of functionality on that inode. If the i_extra_isize is critical, then the sysadmin will have run e2fsck to force s_min_extra_isize large enough. Note that this is only applicable for filesystems which are upgraded. For new inodes (i.e. all inodes that exist in the filesystem if it was always run on a kernel with the currently understood extra fields) then this will never be invoked (until such a time new extra fields are added). I'd suggest that we get a comment in the code explaining this: this unchecked error does rather stand out. + if (EXT4_I(inode)-i_file_acl) { + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl); + error = -EIO; + if (!bh) + goto cleanup; + if (ext4_xattr_check_block(bh)) { + ext4_error(inode-i_sb, __FUNCTION__, + inode %lu: bad block %llu, inode-i_ino, + EXT4_I(inode)-i_file_acl); + error = -EIO; + goto cleanup; + } + base = BHDR(bh); + first = BFIRST(bh); + end = bh-b_data + bh-b_size; + min_offs = end - base; + free = ext4_xattr_free_space(first, min_offs, base, + total_blk); + if (free new_extra_isize) { + if (!tried_min_extra_isize s_min_extra_isize) { + tried_min_extra_isize++; + new_extra_isize = s_min_extra_isize; Aren't we missing a brelse(bh) here? Seems likely, yes. OK - could we get a positive ack from someone indicating that this will get looked at? Because I am about to forget about it. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Random corruption test for e2fsck
On Wed, Jul 11, 2007 at 03:44:11AM -0600, Andreas Dilger wrote: I've already found some kind of memory corruption in e2fsck as a result of running this as a regular user. It segfaults in qsort() when freeing memory. The image that causes this problem is attached, and it happens with the unpatched 1.39-WIP Mercurial tree of 2007-05-22. Unfortunately, I don't have any decent memory debugging tools handy, so it isn't easy to see what is happening. This is on an FC3 i686 system, in case it matters. Thanks for sending me the test case! Here's the patch, which will probably cause me to do a 1.40.2 release sooner rather than later... - Ted commit 5e9ba85c2694926eb784531d81ba107200cf1a51 Author: Theodore Ts'o [EMAIL PROTECTED] Date: Wed Jul 11 13:42:43 2007 -0400 Fix e2fsck segfault on very badly damaged filesystems A recent change to e2fsck_add_dir_info() to use tdb files to check filesystems with a very large number of filesystems had a typo which caused us to resize the wrong data structure. This would cause a array overrun leading to malloc pointer corruptions. Since we normally can very accurately predict how big the the dirinfo array needs to be, this bug only got triggered on very badly corrupted filesystems. Thanks to Andreas Dilger for submitting the test case which discovered this problem, and to Kalpak Shah for writing a random testing script which created the test case. Signed-off-by: Theodore Ts'o [EMAIL PROTECTED] diff --git a/e2fsck/dirinfo.c b/e2fsck/dirinfo.c index aaa4d09..f583c62 100644 --- a/e2fsck/dirinfo.c +++ b/e2fsck/dirinfo.c @@ -126,7 +126,7 @@ void e2fsck_add_dir_info(e2fsck_t ctx, ext2_ino_t ino, ext2_ino_t parent) ctx-dir_info-size += 10; retval = ext2fs_resize_mem(old_size, ctx-dir_info-size * sizeof(struct dir_info), - ctx-dir_info); + ctx-dir_info-array); if (retval) { ctx-dir_info-size -= 10; return; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for 16TB ext4 fs
Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more than 32bit block sizes during mount time. This ensure proper record lenth when writing to the journal. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Signed-off-by: Laurent Vivier [EMAIL PROTECTED] --- fs/ext4/super.c |7 7 + 0 - 0 ! 1 file changed, 7 insertions(+) Index: linux-2.6/fs/ext4/super.c === --- linux-2.6.orig/fs/ext4/super.c 2007-07-11 09:36:00.0 -0500 +++ linux-2.6/fs/ext4/super.c 2007-07-11 09:36:20.0 -0500 @@ -1804,6 +1804,13 @@ static int ext4_fill_super (struct super goto failed_mount3; } + if (ext4_blocks_count(es) 0xULL + !jbd2_journal_set_features(EXT4_SB(sb)-s_journal, 0, 0, + JBD2_FEATURE_INCOMPAT_64BIT)) { + printk(KERN_ERR ext4: Failed to set 64-bit journal feature\n); + goto failed_mount4; + } + /* We have now updated the journal if required, so we can * validate the data journaling mode. */ switch (test_opt(sb, DATA_FLAGS)) { - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
+ jbd2-commit-fix-transaction-dropping.patch added to -mm tree
The patch titled jbd2 commit: fix transaction dropping has been added to the -mm tree. Its filename is jbd2-commit-fix-transaction-dropping.patch *** Remember to use Documentation/SubmitChecklist when testing your code *** See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this -- Subject: jbd2 commit: fix transaction dropping From: Jan Kara [EMAIL PROTECTED] We have to check that also the second checkpoint list is non-empty before dropping the transaction. Signed-off-by: Jan Kara [EMAIL PROTECTED] Cc: Chuck Ebbert [EMAIL PROTECTED] Cc: Kirill Korotaev [EMAIL PROTECTED] Cc: linux-ext4@vger.kernel.org Cc: [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- fs/jbd2/commit.c |3 ++- 1 files changed, 2 insertions(+), 1 deletion(-) diff -puN fs/jbd2/commit.c~jbd2-commit-fix-transaction-dropping fs/jbd2/commit.c --- a/fs/jbd2/commit.c~jbd2-commit-fix-transaction-dropping +++ a/fs/jbd2/commit.c @@ -896,7 +896,8 @@ restart_loop: journal-j_committing_transaction = NULL; spin_unlock(journal-j_state_lock); - if (commit_transaction-t_checkpoint_list == NULL) { + if (commit_transaction-t_checkpoint_list == NULL + commit_transaction-t_checkpoint_io_list == NULL) { __jbd2_journal_drop_transaction(journal, commit_transaction); } else { if (journal-j_checkpoint_transactions == NULL) { _ Patches currently in -mm which might be from [EMAIL PROTECTED] are jbd-commit-fix-transaction-dropping.patch jbd2-commit-fix-transaction-dropping.patch ext2-fix-a-comment-when-ext2_release_file-is-called.patch ext3-fix-deadlock-in-ext3_remount-and-orphan-list-handling.patch ext4-fix-deadlock-in-ext4_remount-and-orphan-list-handling.patch - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs
The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but create_proc_entry() does not do lookups on file names that are more that one directory deep. This causes the entry creation to fail and hence, no proc file is created. Instead of fixing this on procfs might as well move the jbd2-debug file to debugfs which would be the preferred location for this kind of tunable. The new location is now /sys/kernel/debug/jbd2/jbd2-debug. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] --- fs/Kconfig | 105 + 5 - 0 ! fs/jbd2/journal.c| 6727 +40 -0 ! include/linux/jbd2.h |21 + 1 - 0 ! 3 files changed, 33 insertions(+), 46 deletions(-) Index: linux-2.6/fs/jbd2/journal.c === --- linux-2.6.orig/fs/jbd2/journal.c2007-07-11 09:46:25.0 -0500 +++ linux-2.6/fs/jbd2/journal.c 2007-07-11 11:31:30.0 -0500 @@ -35,6 +35,7 @@ #include linux/kthread.h #include linux/poison.h #include linux/proc_fs.h +#include linux/debugfs.h #include asm/uaccess.h #include asm/page.h @@ -1951,64 +1952,50 @@ void jbd2_journal_put_journal_head(struc } /* - * /proc tunables + * debugfs tunables */ #if defined(CONFIG_JBD2_DEBUG) -int jbd2_journal_enable_debug; +u8 jbd2_journal_enable_debug; EXPORT_SYMBOL(jbd2_journal_enable_debug); #endif -#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_DEBUG_FS) -static struct proc_dir_entry *proc_jbd_debug; +#define JBD2_DEBUG_NAME jbd2-debug -static int read_jbd_debug(char *page, char **start, off_t off, - int count, int *eof, void *data) -{ - int ret; +struct dentry *jbd2_debugfs_dir, *jbd2_debug; - ret = sprintf(page + off, %d\n, jbd2_journal_enable_debug); - *eof = 1; - return ret; +static void __init jbd2_create_debugfs_entry(void) +{ + jbd2_debugfs_dir = debugfs_create_dir(jbd2, NULL); + if (jbd2_debugfs_dir) + jbd2_debug = debugfs_create_u8(JBD2_DEBUG_NAME, S_IRUGO, + jbd2_debugfs_dir, + jbd2_journal_enable_debug); } -static int write_jbd_debug(struct file *file, const char __user *buffer, - unsigned long count, void *data) +static void __exit jbd2_remove_debugfs_entry(void) { - char buf[32]; - - if (count ARRAY_SIZE(buf) - 1) - count = ARRAY_SIZE(buf) - 1; - if (copy_from_user(buf, buffer, count)) - return -EFAULT; - buf[ARRAY_SIZE(buf) - 1] = '\0'; - jbd2_journal_enable_debug = simple_strtoul(buf, NULL, 10); - return count; + if (jbd2_debug) + debugfs_remove(jbd2_debug); + if (jbd2_debugfs_dir) + debugfs_remove(jbd2_debugfs_dir); } -#define JBD_PROC_NAME sys/fs/jbd2-debug +#else -static void __init create_jbd_proc_entry(void) +static void __init jbd2_create_debugfs_entry(void) { - proc_jbd_debug = create_proc_entry(JBD_PROC_NAME, 0644, NULL); - if (proc_jbd_debug) { - /* Why is this so hard? */ - proc_jbd_debug-read_proc = read_jbd_debug; - proc_jbd_debug-write_proc = write_jbd_debug; - } + do { + } while (0); } -static void __exit jbd2_remove_jbd_proc_entry(void) +static void __exit jbd2_remove_debugfs_entry(void) { - if (proc_jbd_debug) - remove_proc_entry(JBD_PROC_NAME, NULL); + do { + } while (0); } -#else - -#define create_jbd_proc_entry() do {} while (0) -#define jbd2_remove_jbd_proc_entry() do {} while (0) - #endif struct kmem_cache *jbd2_handle_cache; @@ -2067,7 +2054,7 @@ static int __init journal_init(void) ret = journal_init_caches(); if (ret != 0) jbd2_journal_destroy_caches(); - create_jbd_proc_entry(); + jbd2_create_debugfs_entry(); return ret; } @@ -2078,7 +2065,7 @@ static void __exit journal_exit(void) if (n) printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n); #endif - jbd2_remove_jbd_proc_entry(); + jbd2_remove_debugfs_entry(); jbd2_journal_destroy_caches(); } Index: linux-2.6/include/linux/jbd2.h === --- linux-2.6.orig/include/linux/jbd2.h 2007-07-11 09:46:25.0 -0500 +++ linux-2.6/include/linux/jbd2.h 2007-07-11 10:37:06.0 -0500 @@ -57,7 +57,7 @@ * CONFIG_JBD2_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING -extern int jbd2_journal_enable_debug; +extern u8 jbd2_journal_enable_debug; #define jbd_debug(n, f, a...) \ do {\ Index: linux-2.6/fs/Kconfig === ---
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Wed, 2007-07-11 at 10:34 -0700, Andrew Morton wrote: On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block +* with this same handle */ + if ((jbd2_journal_extend(handle, +EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; + if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete +some EAs or run e2fsck., + inode-i_ino); + expand_message = 1; + } + } + } + } Maybe that message should come out once per mount rather than once per boot (or once per modprobe)? Probably true. What are the consequences of a jbd2_journal_extend() failure in here? Not fatal, just like every user of i_extra_isize. If the inode isn't a large inode, or it can't be expanded then there will be a minor loss of functionality on that inode. If the i_extra_isize is critical, then the sysadmin will have run e2fsck to force s_min_extra_isize large enough. Note that this is only applicable for filesystems which are upgraded. For new inodes (i.e. all inodes that exist in the filesystem if it was always run on a kernel with the currently understood extra fields) then this will never be invoked (until such a time new extra fields are added). I'd suggest that we get a comment in the code explaining this: this unchecked error does rather stand out. + if (EXT4_I(inode)-i_file_acl) { + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl); + error = -EIO; + if (!bh) + goto cleanup; + if (ext4_xattr_check_block(bh)) { + ext4_error(inode-i_sb, __FUNCTION__, + inode %lu: bad block %llu, inode-i_ino, + EXT4_I(inode)-i_file_acl); + error = -EIO; + goto cleanup; + } + base = BHDR(bh); + first = BFIRST(bh); + end = bh-b_data + bh-b_size; + min_offs = end - base; + free = ext4_xattr_free_space(first, min_offs, base, +total_blk); + if (free new_extra_isize) { + if (!tried_min_extra_isize s_min_extra_isize) { + tried_min_extra_isize++; + new_extra_isize = s_min_extra_isize; Aren't we missing a brelse(bh) here? Seems likely, yes. OK - could we get a positive ack from someone indicating that this will get looked at? Because I am about to forget about it. I will send a patch to add the comments and make the suggested corrections. Thanks, Kalpak. To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext3 onlie resize failure due to small journal size
On Wed, 2007-07-11 at 19:30 +0530, Suzuki wrote: Hi Andreas, Trying to resize a mounted ext3 filesystem fails due to small journal size. A similar problem was discussed in an earlier thread JBD: ext2online wants too many credits (744 256). The problem here is that the maximum allowed transaction size is 4MB while a bigger transaction is required to resize the filesystem. In your case, you can delete the journal using tune2fs -O ^has_journal dev and add a bigger journal tune2fs -J size=16M dev. I think this will solve your problem. Thanks, Kalpak. Background : The filesystem was created with default values, except blocksize = 4K on a LV partition. Later we tried extended the partition to +16M and tried to resize the fs using resize2fs, while it was mounted. [EMAIL PROTECTED] ~]# resize2fs /dev/mapper/testvg-tmp resize2fs 1.39 (29-May-2006) Filesystem at /dev/mapper/testvg-tmp is mounted on /tmp; on-line resizing required Performing an on-line resize of /dev/mapper/testvg-tmp to 186368 (4k) blocks. resize2fs: No space left on device While trying to add group #4 Analysis : While adding the new blockgroup, inside setup_new_group_blocks() we hit the limit because we are requesting for a a credit value of 2 + sbi-s_itb_per_group which in the case of the file system below is 1026 while the max_transaction credits possible is 1024 for the fs. journal-j_maxlen = inode-i_size / blocksize = 16M/4K = 4K journal-j_max_transaction_buffers = journal-j_maxlen / 4 = 1K journal-j_max_transaction_buffers = 1024. # dumpe2fs /dev/mapper/testvg-tmp dumpe2fs 1.39 (29-May-2006) Filesystem volume name: none Last mounted on: not available Filesystem UUID: 7d82f07b-cb22-4c7d-b290-a35749121bbb Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file Default mount options:(none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 131072 Block count: 131072 Reserved block count: 6553 Free blocks: 122753 Free inodes: 131056 First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 31 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 32768 Inode blocks per group: 1024 Filesystem created: Mon Jul 9 00:09:23 2007 Last mount time: Tue Jul 10 02:44:44 2007 Last write time: Tue Jul 10 02:44:44 2007 Mount count: 7 Maximum mount count: 33 Last checked: Mon Jul 9 00:09:23 2007 Check interval: 15552000 (6 months) Next check after: Sat Jan 5 00:09:23 2008 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode:8 Default directory hash: tea Directory Hash Seed: 0a57077d-9778-4f3a-8605-7f0c86f18d8a Journal backup: inode blocks Journal size: 16M Group 0: (Blocks 0-32767) Primary superblock at 0, Group descriptors at 1-1 Reserved GDT blocks at 2-32 Block bitmap at 33 (+33), Inode bitmap at 34 (+34) Inode table at 35-1058 (+35) 27600 free blocks, 32755 free inodes, 2 directories Free blocks: 5168-32767 Free inodes: 14-32768 Group 1: (Blocks 32768-65535) Backup superblock at 32768, Group descriptors at 32769-32769 Reserved GDT blocks at 32770-32800 Block bitmap at 32801 (+33), Inode bitmap at 32802 (+34) Inode table at 32803-33826 (+35) 31708 free blocks, 32767 free inodes, 1 directories Free blocks: 33828-65535 Free inodes: 32770-65536 Group 2: (Blocks 65536-98303) Block bitmap at 65536 (+0), Inode bitmap at 65537 (+1) Inode table at 65538-66561 (+2) 31740 free blocks, 32764 free inodes, 2 directories Free blocks: 66562-83967, 83969-94207, 94209-98303 Free inodes: 65541-98304 Group 3: (Blocks 98304-131071) Backup superblock at 98304, Group descriptors at 98305-98305 Reserved GDT blocks at 98306-98336 Block bitmap at 98337 (+33), Inode bitmap at 98338 (+34) Inode table at 98339-99362 (+35) 31704 free blocks, 32766 free inodes, 1 directories Free blocks: 99363-126976, 126982-131071 Free inodes: 98305, 98308-131072 Is this a supported operation ? If yes, what could be the best way to fix it ? Resizing the journal is not supported at the moment :(. Thoughts ? Thanks Suzuki IBM Linux Technology Centre - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, Jul 11, 2007 at 09:28:06AM -0500, Dave Kleikamp wrote: On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote: It just occurred to me: If i_version is 64bit, then knfsd would need to be careful when reading it on a 32bit host. What are the locking rules? How does knfsd use i_version? I would think that if all it was doing was to compare (i_version == previous_version) That's correct. (Though it's the client that's doing the comparison, actually--the server is just reporting the value.) then locking wouldn't really matter. Well, theoretically, previous_version could be 0x1, and i_version could be 0x1, knfsd checks the high word, then ext4 updates i_version to 0x2, then knfsd checks the low word, detecting no change. How likely is this? The choice of upper word in your example is arbitrary, but other than that I believe your example is essentially the only one. So this would only happen when *both* - the read of the new value of the low word happens precisely 2^32 i_version updates after the word was read on the client's previous cache revalidation, and - the value of i_version itself is close enough to a 32-bit boundary that wraparound can happen between the reads of the high and low words. (I don't understand why i_version even needs to be 64 bits in the first place.) A 32-bit i_version could in theory wrap pretty quickly, couldn't it? That's not a problem in itself--the problem would only arise if two subsequent client queries of the change attribute happened a multiple of 2^32 i_version increments apart. This is more likely than the previous scenario, but still very unlikely. I would have guessed that even in situations with a very high rate of updates and a low rate of client revalidations, the chance of two revalidations happening exactly 2^32 updates apart would still be no more than 1 in 2^32. (Could odd characteristics of the workloads (like updates that tend to happen in power-of-2 groups?) make it any more likely?) I'd be happier if ext4 at least allowed the possibility of 64 bits in the future. And there's always the chance someone would find a use for an i_version that was nondecreasing, even if nfs didn't care. Presumably it is only updated under i_mutex protection, but having to get i_mutex to read it would seem a little heavy handed. How does knfsd protect itself from the inode changing after i_version is checked? Is any locking being done otherwise? If the client always requests the change attribute before reading, and the i_version is always updated after data is modified, I think we're OK. Admittedly this is a little subtle. --b. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Initial results of FLEX_BG feature.
On Wed, Jul 11, 2007 at 12:30:04AM -0500, Jose R. Santos wrote: i think in the spirit of the original META_BG option, Ted had wanted to put all the bitmaps from EXT4_DESC_PER_BLOCK groups somewhere within the metagroup. It would also be interesting to see if moving ALL of the group metadata to a single location in the filesystem makes a bit difference. If not, then we may as well keep it spread out for safety. My original intention was that META_BG would place the bitmaps and inode tables at the beginning of each metagroup by default, but that the constraints about where to put the bitmaps and inode tables would be completely relaxed from the point of view of requirements by the kernel and e2fsck. Unfortunately while I had patches which removed the constraints checking, they never made it into mainline of either the kernel or e2fsprogs, sigh. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Initial results of FLEX_BG feature.
On Wed, Jul 11, 2007 at 12:30:04AM -0500, Jose R. Santos wrote: Right now what I've done is allocate the bitmaps and inode tables at the beginning of each group of 64 BG. Still need to work on fsck since just removing the restriction on were the bitmaps and inode table are located still gives me errors of uninitialized inodes with dtime set. Seems like fsck still expect inode information to be located at specific locations within the disk. Can you send me the patch which you were playing with? I might be able to help you with this. It should be pretty straightforward to remove the constraint on the inode table location. It really should only be a check in e2fsck/super.c:check_super_block(), as far as I know. If you're seeing errors of unitialized inodes with dtime set, that sounds like maybe something else is going on. All of e2fsprogs should be referencing the inode table via fs-group_desc[group_num].bg_inode_table. See lib/ext2fs/inode.c, functions ext2fs_open_inode_scan(), get_next_blockgroup(), and ext2fs_read_inode_full(). - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups
On Tue, 2007-07-10 at 23:20 -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote: From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent macros cleanup - Replace math equation to it's macro equivalent s/it's/its/;) Okay. - make ext4_ext_grow_indepth() indexes/leaf correct hm, what was wrong with it? Looking at the code, ext4_ext_ext_grow_indepth() implements tree growing procedure. It allocates a new index block, moves the top-level data of the tree(root or leaf blocks in i_data) into the new block, initializes the new root, creating index that points to the just created index block The original top-level data (in i_data) could be extent tree root (index block) or extents (leaf block). The current code (without the patch) treats the top-level data always be the leaf block, which is incorrect. assumes when the tree is growing the extent structure pass in is always @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode, curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode)); curp-p_hdr-eh_entries = cpu_to_le16(1); curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr); - /* FIXME: it works, but actually path[0] can be index */ - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; + + if (path[0].p_hdr-eh_depth) + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block; + else + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; whitespace bustage there. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext3 onlie resize failure due to small journal size
On Jul 11, 2007 19:30 +0530, Suzuki wrote: Trying to resize a mounted ext3 filesystem fails due to small journal size. Background : The filesystem was created with default values, except blocksize = 4K on a LV partition. Later we tried extended the partition to +16M and tried to resize the fs using resize2fs, while it was mounted. While adding the new blockgroup, inside setup_new_group_blocks() we hit the limit because we are requesting for a a credit value of 2 + sbi-s_itb_per_group which in the case of the file system below is 1026 while the max_transaction credits possible is 1024 for the fs. journal-j_maxlen = inode-i_size / blocksize = 16M/4K = 4K journal-j_max_transaction_buffers = journal-j_maxlen / 4 = 1K journal-j_max_transaction_buffers = 1024. Is this a supported operation ? If yes, what could be the best way to fix it ? Resizing the journal is not supported at the moment :(. You can't do a journal resize online, but you can wait until your next outage and resize the journal at that time. Even a few extra blocks would be enough. I guess this is a corner case that hasn't been hit before. It might make sense to have the ext2fs_figure_journal_size() take this into account when making the filesystem? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext3 onlie resize failure due to small journal size
Andreas Dilger wrote: On Jul 11, 2007 19:30 +0530, Suzuki wrote: Trying to resize a mounted ext3 filesystem fails due to small journal size. Background : The filesystem was created with default values, except blocksize = 4K on a LV partition. Later we tried extended the partition to +16M and tried to resize the fs using resize2fs, while it was mounted. While adding the new blockgroup, inside setup_new_group_blocks() we hit the limit because we are requesting for a a credit value of 2 + sbi-s_itb_per_group which in the case of the file system below is 1026 while the max_transaction credits possible is 1024 for the fs. journal-j_maxlen = inode-i_size / blocksize = 16M/4K = 4K journal-j_max_transaction_buffers = journal-j_maxlen / 4 = 1K journal-j_max_transaction_buffers = 1024. Is this a supported operation ? If yes, what could be the best way to fix it ? Resizing the journal is not supported at the moment :(. You can't do a journal resize online, but you can wait until your next outage and resize the journal at that time. Even a few extra blocks would be enough. I guess this is a corner case that hasn't been hit before. It might make sense to have the ext2fs_figure_journal_size() take this into account when making the filesystem? That't true. I was looking at it. I guess we should make sure we can ask for a credit same as inode tables block per group + some extra. Will try to see i can cook a patch. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Random corruption test for e2fsck
On Jul 11, 2007 13:43 -0400, Theodore Tso wrote: Fix e2fsck segfault on very badly damaged filesystems --- a/e2fsck/dirinfo.c +++ b/e2fsck/dirinfo.c @@ -126,7 +126,7 @@ void e2fsck_add_dir_info(e2fsck_t ctx, ext2_ino_t ino, ext2_ino_t parent) ctx-dir_info-size += 10; retval = ext2fs_resize_mem(old_size, ctx-dir_info-size * sizeof(struct dir_info), -ctx-dir_info); +ctx-dir_info-array); if (retval) { ctx-dir_info-size -= 10; return; This appears to fix the problem. I was previously able to crash e2fsck within a couple of runs, now it is running in a loop w/o problems. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Random corruption test for e2fsck
I've got another one, but it isn't a show stopper I think. If you format a filesystem with both resize_inode and meta_bg you get an unfixable filesystem. The bad news is that it appears that running e2fsck on the filesystem is actually _causing_ the corruption in this case (trying to rebuild the resize inode)? For now, the answer is don't do that. The resize inode was never intended to be in use when meta_bg is enabled, so we should just prevent this from the start. mke2fs -j -b 4096 -I 512 -O sparse_super,filetype,resize_inode,dir_index,lazy_bg,meta_bg -F /tmp/test.img 57852 I can run e2fsck on this repeatedly and it always complains the same way: $ e2fsck -fy /tmp/test.img e2fsck 1.39.cfs9 (7-Apr-2007) Resize inode not valid. Recreate? yes Pass 1: Checking inodes, blocks, and sizes Inode 8, i_blocks is 0, should be 32816. Fix? yes Reserved inode 9 (Reserved inode 9) has invalid mode. Clear? yes Deleted inode 17 has zero dtime. Fix? yes Deleted inode 25 has zero dtime. Fix? yes Deleted inode 33 has zero dtime. Fix? yes Deleted inode 41 has zero dtime. Fix? yes Deleted inode 49 has zero dtime. Fix? yes Deleted inode 57 has zero dtime. Fix? yes Deleted inode 65 has zero dtime. Fix? yes Deleted inode 73 has zero dtime. Fix? yes Deleted inode 81 has zero dtime. Fix? yes Deleted inode 89 has zero dtime. Fix? yes Pass 2: Checking directory structure Inode 2 (???) has invalid mode (00). Clear? yes Entry '..' in ??? (2) has deleted/unused inode 2. Clear? yes Inode 11 (???) has invalid mode (00). Clear? yes Pass 3: Checking directory connectivity Root inode not allocated. Allocate? yes /lost+found not found. Create? yes Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: +0 +(2--14) +(16--3619) +(3621--3622) +(3625--7727) Fix? yes Free blocks count wrong for group #0 (25039, counted=25041). Fix? yes Free blocks count wrong (46503, counted=46505). Fix? yes Inode bitmap differences: +(3--10) -16 Fix? yes Free inodes count wrong for group #0 (28918, counted=28917). Fix? yes Directories count wrong for group #0 (3, counted=2). Fix? yes Free inodes count wrong (57846, counted=57845). Fix? yes /tmp/test.img: * FILE SYSTEM WAS MODIFIED * /tmp/test.img: 11/57856 files (9.1% non-contiguous), 11347/57852 blocks Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html