Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
On Tue 06-11-07 10:04:47, H. Peter Anvin wrote: Arjan van de Ven wrote: On Tue, 6 Nov 2007 18:19:45 +0100 Jan Kara [EMAIL PROTECTED] wrote: Implement recursive mtime (rtime) feature for ext3. The feature works as follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user) whether rtime should be updated. In case a directory or a file in it is modified and when the flag is set, directory's rtime is updated, the flag is cleared, and we move to the parent. If the flag is set there, we clear it, update rtime and continue upwards upto the root of the filesystem. In case a regular file or symlink is modified, we pick arbitrary of its parents (actually the one that happens to be at the head of i_dentry list) and start the rtime update algorith there. Ok since mtime (and rtime) are part of the inode and not the dentry... how do you deal with hardlinks? And with cases of files that have been unlinked? (ok the later is a wash obviously other than not crashing) Unlinked files are easy - you just don't propagate the rtime anywhere. For hardlinks see below. There is only one possible answer... he only updates the directory path that was used to touch the particular file involved. Thus, the semantics gets grotty not just in the presence of hard links, but also in the presence of bind- and other non-root mounts. Update of recursive mtime does not pass filesystem boundaries (i.e. mountpoints) so bind mounts and such are non-issue (hmm, at least that was my original idea but as I'm looking now I don't handle bind mounts properly so that needs to be fixed). With hardlinks, you are right that the behaviour is undeterministic - I tried to argue in the text of the mail that this does not actually matter - there are not many hardlinks on usual system and so the application can check hardlinked files in the old way - i.e. look at mtime. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
On Tue 06-11-07 18:01:00, Al Viro wrote: On Tue, Nov 06, 2007 at 06:19:45PM +0100, Jan Kara wrote: Implement recursive mtime (rtime) feature for ext3. The feature works as follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user) whether rtime should be updated. In case a directory or a file in it is modified and when the flag is set, directory's rtime is updated, the flag is cleared, and we move to the parent. If the flag is set there, we clear it, update rtime and continue upwards upto the root of the filesystem. In case a regular file or symlink is modified, we pick arbitrary of its parents (actually the one that happens to be at the head of i_dentry list) and start the rtime update algorith there. *e* Nothing like undeterministic behaviour, is there? Oh yes, there is :) But I tried to argue it does not really matter - application would have to handle hardlinks in a special way but I find that acceptable given how rare they are... Intended use case is that application which wants to watch any modification in a subtree scans the subtree and sets flags for all inodes there. Next time, it just needs to recurse in directories having rtime newer than the start of the previous scan. There it can handle modifications and set the flag again. It is up to application to watch out for hardlinked files. It can e.g. build their list and check their mtime separately (when a hardlink to a file is created its inode is modified and rtimes properly updated and thus any application has an effective way of finding new hardlinked files). You know, you can do that with aush^H^Hdit right now... Interesting idea, no I have not thought about this. I guess you mean watching all the VFS modification events and then do the checking and propagation from user space... My first feeling is that the performance penalty would be considerably higher (currently I am at 1% performance penalty for quite pessimistic test case) but in case the current patch would be considered unacceptable, I can try how large the penalty would be. Thanks for suggestion. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize
Hello, sorry for replying to myself but I've just found out that the patch I've sent was and old version of the patch which had some problems. Attached is a new version. On Tue 06-11-07 12:31:42, Jan Kara wrote: it seems attached patch still did not get your attention. It makes e2fsprogs properly handle filesystems with 64KB block size. Could you put it into e2fsprogs git? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR Subject: Support for 64KB blocksize in ext2-4 directories. When block size is 64KB, we have to take care that rec_len does not overflow. Kernel stores 0x in case 0x1 should be stored - perform appropriate conversion when reading from / writing to disk. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/lib/ext2fs/dirblock.c b/lib/ext2fs/dirblock.c index fb20fa0..db73edd 100644 --- a/lib/ext2fs/dirblock.c +++ b/lib/ext2fs/dirblock.c @@ -38,9 +38,9 @@ errcode_t ext2fs_read_dir_block2(ext2_fi dirent = (struct ext2_dir_entry *) p; #ifdef WORDS_BIGENDIAN dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); dirent-name_len = ext2fs_swab16(dirent-name_len); #endif + dirent-rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); name_len = dirent-name_len; #ifdef WORDS_BIGENDIAN if (flags EXT2_DIRBLOCK_V2_STRUCT) @@ -68,12 +68,15 @@ errcode_t ext2fs_read_dir_block(ext2_fil errcode_t ext2fs_write_dir_block2(ext2_filsys fs, blk_t block, void *inbuf, int flags EXT2FS_ATTR((unused))) { -#ifdef WORDS_BIGENDIAN errcode_t retval; char *p, *end; char *buf = 0; struct ext2_dir_entry *dirent; +#ifndef WORDS_BIGENDIAN + if (fs-blocksize EXT2_MAX_REC_LEN) + goto just_write; +#endif retval = ext2fs_get_mem(fs-blocksize, buf); if (retval) return retval; @@ -88,19 +91,18 @@ errcode_t ext2fs_write_dir_block2(ext2_f return (EXT2_ET_DIR_CORRUPTED); } p += dirent-rec_len; + dirent-rec_len = ext2fs_rec_len_to_disk(dirent-rec_len); +#ifdef WORDS_BIGENDIAN dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); - dirent-name_len = ext2fs_swab16(dirent-name_len); - - if (flags EXT2_DIRBLOCK_V2_STRUCT) + if (!(flags EXT2_DIRBLOCK_V2_STRUCT)) dirent-name_len = ext2fs_swab16(dirent-name_len); +#endif } retval = io_channel_write_blk(fs-io, block, 1, buf); ext2fs_free_mem(buf); return retval; -#else +just_write: return io_channel_write_blk(fs-io, block, 1, (char *) inbuf); -#endif } diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h index a316665..21747c2 100644 --- a/lib/ext2fs/ext2_fs.h +++ b/lib/ext2fs/ext2_fs.h @@ -717,6 +718,32 @@ struct ext2_dir_entry_2 { #define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1) #define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) \ ~EXT2_DIR_ROUND) +#define EXT2_MAX_REC_LEN ((116)-1) + +static inline unsigned ext2fs_rec_len_from_disk(unsigned len) +{ +#ifdef WORDS_BIGENDIAN + len = ext2fs_swab16(dlen); +#endif + if (len == EXT2_MAX_REC_LEN) + return 1 16; + return len; +} + +static inline unsigned ext2fs_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) +#ifdef WORDS_BIGENDIAN + return ext2fs_swab16(EXT2_MAX_REC_LEN); +#else + return EXT2_MAX_REC_LEN; +#endif +#ifdef WORDS_BIGENDIAN + return ext2fs_swab_16(len); +#else + return len; +#endif +} /* * This structure will be used for multiple mount protection. It will be diff --git a/misc/e2image.c b/misc/e2image.c index 1fbb267..4e2c9fb 100644 --- a/misc/e2image.c +++ b/misc/e2image.c @@ -345,10 +345,7 @@ static void scramble_dir_block(ext2_fils end = buf + fs-blocksize; for (p = buf; p end-8; p += rec_len) { dirent = (struct ext2_dir_entry_2 *) p; - rec_len = dirent-rec_len; -#ifdef WORDS_BIGENDIAN - rec_len = ext2fs_swab16(rec_len); -#endif + rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); #if 0 printf(rec_len = %d, name_len = %d\n, rec_len, dirent-name_len); #endif @@ -358,9 +355,7 @@ static void scramble_dir_block(ext2_fils bad rec_len (%d)\n, (unsigned long) blk, rec_len); rec_len = end - p; -#ifdef WORDS_BIGENDIAN -dirent-rec_len = ext2fs_swab16(rec_len); -#endif + dirent-rec_len = ext2fs_rec_len_to_disk(rec_len); continue; } if (dirent-name_len + 8 rec_len) {
Re: [PATCH] allow tune2fs to set/clear resize_inode
On Nov 06, 2007 13:51 -0500, Theodore Tso wrote: On Tue, Nov 06, 2007 at 09:12:55AM +0800, Andreas Dilger wrote: What is needed is an ext2prepare-like step that involves resize2fs code to move the file/dir blocks and then the move inode table, as if the filesystem were going to be resized to the new maximum resize limit, and then create the resize inode but do not actually add new blocks/groups at the end of the filesystem. Yeah, the plan was to eventually add ext2prepare-like code into tune2fs, using the undo I/O manager for safety. But that's been relatively low priority. BTW, I've gotten ~2 bug reports from Debian users claiming that ext2prepare had trashed their filesystem. I don't have any clean evidence about whether it was a userspace error or some kind of bug in ext2prepare, possibly conflicting with some new ext3 feature that we've since added that ext2prepare doesn't properly account for (extended attributes, maybe?). I have not had time to look into it, but thought has crossed my mind that a quick hack would be to splice the undo manager into ext2prepare, have it run e2fsck, and if it fails, do a rollback, create an e2image file, and then instruct the user to send in a bug report. :-) I don't think it would be very easy to splice undo manager into ext2prepare. I'd rather see time spent to make resize2fs handle the prepare functionality and then ext2resize can be entirely obsoleted. Aneesh, adding undo manager to resize2fs would be an excellent use of that library, and going from resize2fs to resize2fs --prepare-only (or whatever) would be trivial I think. Is that something you'd be interested to work on? Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: delalloc fragmenting files?
Andreas Dilger wrote: On Nov 06, 2007 13:54 -0600, Eric Sandeen wrote: Hmm bad news is when I add uninit_groups into the mix, it goes a little south again, with some out-of-order extents. Not the end of the world, but a little unexpected? I think part of the issue is that by default the groups marked BLOCK_UNINIT are skipped, to avoid dirtying those groups if they have never been used before. This policy could be changed in the mballoc code pretty easily if you think it is a net loss. Note that the size of the extents is large enough (120MB or more) that some small reordering is probably not going to affect the performance in any meaningful way. You're probably right; on the other hand, this is about the simplest test an allocator could wish for - a single-threaded large linear write in big IO chunks. In this case it's probably not a big deal; I do wonder how it might affect the bigger picture though, with more writing threads, aged filesystems, and the like. Just thought it was worth pointing out, as I started looking at allocator behavior in the simple/isolated/unrealistic :) cases. -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: delalloc fragmenting files?
On Nov 06, 2007 13:54 -0600, Eric Sandeen wrote: Hmm bad news is when I add uninit_groups into the mix, it goes a little south again, with some out-of-order extents. Not the end of the world, but a little unexpected? Discontinuity: Block 1430784 is at 24183810 (was 24181761) Discontinuity: Block 1461760 is at 24216578 (was 24214785) Discontinuity: Block 1492480 is at 37888 (was 24247297) Discontinuity: Block 1519616 is at 850944 (was 65023) Discontinuity: Block 1520640 is at 883712 (was 851967) Discontinuity: Block 1521664 is at 1670144 (was 884735) Discontinuity: Block 1522688 is at 2685952 (was 1671167) Discontinuity: Block 1523712 is at 4226048 (was 2686975) Discontinuity: Block 1524736 is at 11271168 (was 4227071) Discontinuity: Block 1525760 is at 23952384 (was 11272191) I think part of the issue is that by default the groups marked BLOCK_UNINIT are skipped, to avoid dirtying those groups if they have never been used before. This policy could be changed in the mballoc code pretty easily if you think it is a net loss. Note that the size of the extents is large enough (120MB or more) that some small reordering is probably not going to affect the performance in any meaningful way. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix oops on corrupted ext4 mount
Eric Sandeen wrote: When mounting an ext4 filesystem with corrupted s_first_data_block, things can go very wrong and oops. Because blocks_count in ext4_fill_super is a u64, and we must use do_div, the calculation of db_count is done differently than on ext4. Urgh... than on ext3 -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
More testing: 4x parallel 2G writes, sequential reads
I tried ext4 vs. xfs doing 4 parallel 2G IO writes in 1M units to 4 different subdirectories of the root of the filesystem: http://people.redhat.com/esandeen/seekwatcher/ext4_4_threads.png http://people.redhat.com/esandeen/seekwatcher/xfs_4_threads.png http://people.redhat.com/esandeen/seekwatcher/ext4_xfs_4_threads.png and then read them back sequentially: http://people.redhat.com/esandeen/seekwatcher/ext4_4_threads_read.png http://people.redhat.com/esandeen/seekwatcher/xfs_4_threads_read.png http://people.redhat.com/esandeen/seekwatcher/ext4_xfs_4_read_threads.png At the end of the write, ext4 had on the order of 400 extents/file, xfs had on the order of 30 extents/file. It's clear especially from the read graph that ext4 is interleaving the 4 files, in about 5M chunks on average. Throughput seems comparable between ext4 xfs nonetheless. Again this was on a decent HW raid so seek penalties are probably not too bad. -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More testing: 4x parallel 2G writes, sequential reads
On Nov 07, 2007 16:42 -0600, Eric Sandeen wrote: I tried ext4 vs. xfs doing 4 parallel 2G IO writes in 1M units to 4 different subdirectories of the root of the filesystem: http://people.redhat.com/esandeen/seekwatcher/ext4_4_threads.png http://people.redhat.com/esandeen/seekwatcher/xfs_4_threads.png http://people.redhat.com/esandeen/seekwatcher/ext4_xfs_4_threads.png and then read them back sequentially: http://people.redhat.com/esandeen/seekwatcher/ext4_4_threads_read.png http://people.redhat.com/esandeen/seekwatcher/xfs_4_threads_read.png http://people.redhat.com/esandeen/seekwatcher/ext4_xfs_4_read_threads.png At the end of the write, ext4 had on the order of 400 extents/file, xfs had on the order of 30 extents/file. It's clear especially from the read graph that ext4 is interleaving the 4 files, in about 5M chunks on average. Throughput seems comparable between ext4 xfs nonetheless. The question is what the best result is for this kind of workload? In HPC applications the common case is that you will also have the data files read back in parallel instead of serially. The test shows ext4 finishing marginally faster in the write case, and marginally slower in the read case. What happens if you have 4 parallel readers? Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More testing: 4x parallel 2G writes, sequential reads
Hi, could you try to larger preallocation? like 512/1024/2048 blocks, please? thanks, Alex Eric Sandeen wrote: I tried ext4 vs. xfs doing 4 parallel 2G IO writes in 1M units to 4 different subdirectories of the root of the filesystem: http://people.redhat.com/esandeen/seekwatcher/ext4_4_threads.png http://people.redhat.com/esandeen/seekwatcher/xfs_4_threads.png http://people.redhat.com/esandeen/seekwatcher/ext4_xfs_4_threads.png and then read them back sequentially: http://people.redhat.com/esandeen/seekwatcher/ext4_4_threads_read.png http://people.redhat.com/esandeen/seekwatcher/xfs_4_threads_read.png http://people.redhat.com/esandeen/seekwatcher/ext4_xfs_4_read_threads.png At the end of the write, ext4 had on the order of 400 extents/file, xfs had on the order of 30 extents/file. It's clear especially from the read graph that ext4 is interleaving the 4 files, in about 5M chunks on average. Throughput seems comparable between ext4 xfs nonetheless. Again this was on a decent HW raid so seek penalties are probably not too bad. -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More testing: 4x parallel 2G writes, sequential reads
Andreas Dilger wrote: The question is what the best result is for this kind of workload? In HPC applications the common case is that you will also have the data files read back in parallel instead of serially. Agreed, I'm not trying to argue what's better or worse, I'm just seeing what it's doing. The main reason I did sequential reads back is that it more clearly shows the file layout for each file on the graph. :) I'm just getting a handle on how the allocations are going for various types of writes. The test shows ext4 finishing marginally faster in the write case, and marginally slower in the read case. What happens if you have 4 parallel readers? I'll test that a bit later (have to run now); I expect parallel readers may go faster, since the blocks are interleaved, and it might be able to suck them up pretty much in order across all 4 files. I'd also like to test some of this under a single head, rather than on HW raid... -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
On Wed, Nov 07, 2007 at 03:36:05PM +0100, Jan Kara wrote: What if more than one application wants to use this facility? That should be fine - let's see: Each application keeps somewhere a time when it started a scan of a subtree (or it can actually remember a time when it set the flag for each directory), during the scan, it sets the flag on each directory. When it wakes up to recheck the subtree it just compares the rtime against the stored time - if rtime is greater, subtree has been modified since the last scan and we recurse in it and when we are finished with it we set the flag. Now notice that we don't care about the flag when we check for changes - we care only for rtime - so if there are several applications interested in the same subtree, the flag just gets set more often and thus the update of rtime happens more often but the same scheme still works fine. OK, so in this case you don't need to set rtime on the every single file inode, but only directory inode, right? Because you're only using checking the rtime at the directory level, and not the flag. And it's just as easy for you to check the rtime flag for the file's containing directory (modulo magic vis-a-vis hard links) as the file's inode. I'm just really wishing that rtime and the rtime flag didn't have live on disk, but could rather be in memory. If you only needed to save the directory flags and rtimes, that might actually be doable. Note by the way that since you need to own the file/directory to set flags, this means that only programs that are running as root or running as the uid who owns the entire subtree will be able to use this scheme. One advantage of doing in kernel memory is that you might be able to support watching a tree that is not owned by the watcher. I don't get it here - you need to scan the whole subtree and set the flag only during the initial scan. Later, you need to scan and set the flag only for directories in whose subtree something changed. Similarty rtime needs to be updated for each inode at most once after the scan. OK, so in the worst case every single file in a kernel source tree might change after doing an extreme git checkout. That means around 36k of files get updated. So if you have to set/clear the rtime flag during the checkout process 36k file inodes would have to have their rtime flag cleared, plus 2k worth of directory inodes; but those would probably be folded into other changes made to the inodes anyway. But then when trackerd goes back and scans the subtree, if you are actually setting rtime flags for every single file inode, then that's 38k of indoes that need updating. If you only need to set the rtime flags for directories, that's only 2k worth of extra gratuitous inode updates. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More testing: 4x parallel 2G writes, sequential reads
Shapor Naghibzadeh wrote: On Wed, Nov 07, 2007 at 04:42:59PM -0600, Eric Sandeen wrote: Again this was on a decent HW raid so seek penalties are probably not too bad. You may want to verify that by doing a benchmark on the raw device. I recently did some benchmarks doing random I/O on a Dell 2850 w/ a PERC (megaraid) RAID5 w/ 128MB onboard writeback cache and 6x 15krpm drives and noticed appoximately one order of magnitude throughput drop on small (stripe-sized) random reads versus linear. It maxed out at ~100 random read IOPs or seeks/sec (suprisingly low). Out of curiousity, how are you counting the seeks? Chris Mason's seekwatcher (google can find it for you) is doing the graphing, it uses blocktrace for the raw data. -Eric Shapor - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More testing: 4x parallel 2G writes, sequential reads
On Wed, Nov 07, 2007 at 04:42:59PM -0600, Eric Sandeen wrote: Again this was on a decent HW raid so seek penalties are probably not too bad. You may want to verify that by doing a benchmark on the raw device. I recently did some benchmarks doing random I/O on a Dell 2850 w/ a PERC (megaraid) RAID5 w/ 128MB onboard writeback cache and 6x 15krpm drives and noticed appoximately one order of magnitude throughput drop on small (stripe-sized) random reads versus linear. It maxed out at ~100 random read IOPs or seeks/sec (suprisingly low). Out of curiousity, how are you counting the seeks? Shapor - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More testing: 4x parallel 2G writes, sequential reads
Andreas Dilger wrote: The test shows ext4 finishing marginally faster in the write case, and marginally slower in the read case. What happens if you have 4 parallel readers? http://people.redhat.com/esandeen/seekwatcher/ext4_4_thread_par_read.png http://people.redhat.com/esandeen/seekwatcher/xfs_4_thread_par_read.png http://people.redhat.com/esandeen/seekwatcher/ext4_xfs_4_thread_par_read.png -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html