Re: e2fsprogs-interim scm tree?
On Tue, November 20, 2007 04:11, Theodore Tso wrote: There is a git tree. It's at: http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=summary The bleeding edge branch is the 'pu' (proposed updates), which you need if you want the full ext4 features. Ah, I've seen the git tree, but missed on the 'pu' branch. So it's # git-checkout -b pu to get the pu branch? Note that the userspace code still needs a lot of work. It's for this reason that I haven't been recommending people use it for production systems just yet. Understood. But the more ppl testing your stuff the better, right? Thanks for your time, Christian. -- BOFH excuse #442: Trojan horse ran out of hay - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsprogs-interim scm tree?
On Tue, Nov 20, 2007 at 11:24:03AM +0100, Christian Kujau wrote: Note that the userspace code still needs a lot of work. It's for this reason that I haven't been recommending people use it for production systems just yet. Understood. But the more ppl testing your stuff the better, right? Oh, absolutely. I just don't think it's fair to encourage people to use something that might cause them to risk their data unless they are going into it with their eyes wide open. The 'pu' branch gets very minimal testing. I do run the regression test suite(*), so it's a bit more than it builds, ship it!, but essentially almost any patch that will apply gets thrown into 'pu', and as I review patches and fix up issues, I move them to the 'next' branch, and then a little bit later to the 'master' branch. (*) With only 3-4 test failures, but at least some of them are tests that need to be fixed up, not necessarily outright bugs in the 'pu' branch. At the moment in the git tree only the 'pu' branch has extents support, and to be honest the support in e2fsprogs-interim in terms of being able to better detect and fix corrupted filesystems without crashing. (Some of the newer features like uninit groups and flexbg isn't in e2fsprogs-interim, though.) Fixing up the extents support is very high on my priority list over the next couple of weeks, but at the moment e2fsck on the 'pu' branch will correctly check an ext4 filesystem with extents that isn't too badly corrupted; a badly corrupted one will cause e2fsck to crash. It will get better, I promise you! :-) Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsprogs-interim scm tree?
Theodore Tso wrote: On Tue, Nov 20, 2007 at 11:24:03AM +0100, Christian Kujau wrote: Note that the userspace code still needs a lot of work. It's for this reason that I haven't been recommending people use it for production systems just yet. Understood. But the more ppl testing your stuff the better, right? Oh, absolutely. I just don't think it's fair to encourage people to use something that might cause them to risk their data unless they are going into it with their eyes wide open. Please do report any problems you find, though. Ted has said that the sourceforge bugtracker is the right place to do this for now. -Eric (remembering he has a bug to report, too) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: dir inode reservation V3
Hi Coly, finally I've found some time to have a look at a new version of your patch. 5, Performance number On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create 5 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and files is 29.2% faster as expected. When number of directories is increased, the performance improvement will be more considerable. More benchmark result will be posted here if necessary, because I need more time to run more test cases. Maybe to get some better idea - could you compare how long does take untar of a kernel tree, find through the whole kernel tree and removal of the tree? Also measuring CPU load during your benchmarks would be useful so that we can see whether we don't increase too much the cost of searching in bitmaps... The patch is nicely short ;) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 17b5df1..f838a72 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -10,6 +10,8 @@ * Stephen Tweedie ([EMAIL PROTECTED]), 1993 * Big-endian to little-endian byte-swapping/bitmaps by *David S. Miller ([EMAIL PROTECTED]), 1995 + * Directory inodes reservation by + *Coly Li ([EMAIL PROTECTED]), 2007 */ #include linux/time.h @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, return -1; } +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, + int mode, ext4_group_t *group, unsigned long *ino) +{ + struct super_block *sb; + struct ext4_sb_info *sbi; + struct ext4_group_desc *gdp = NULL; + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; + ext4_group_t ires_group = *group; + unsigned long ires_ino; + int i, bit; + + sb = dir-i_sb; + sbi = EXT4_SB(sb); + + /* if the inode number is not for directory, + * only try to allocate after directory's inode + */ + if (!S_ISDIR(mode)) { + *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb); + return 0; + } ^^^ You don't set a group here - is this intentional? It means we get the group from find_group_other() and there we start searching at a place corresponding to parent's inode number... That would be an interesting heuristic but I'm not sure if that's what you want. + + /* reserve inodes for new directory */ + for (i = 0; i sbi-s_groups_count; i++) { + gdp = ext4_get_group_desc(sb, ires_group, gdp_bh); + if (!gdp) + goto fail; + bit = 0; +try_same_group: + if (bit EXT4_INODES_PER_GROUP(sb)) { + brelse(bitmap_bh); + bitmap_bh = read_inode_bitmap(sb, ires_group); + if (!bitmap_bh) + goto fail; + + BUFFER_TRACE(bitmap_bh, get_write_access); + if (ext4_journal_get_write_access( + handle, bitmap_bh) != 0) + goto fail; + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), + bit, bitmap_bh-b_data)) { + /* we won it */ + BUFFER_TRACE(bitmap_bh, + call ext4_journal_dirty_metadata); + if (ext4_journal_dirty_metadata(handle, + bitmap_bh) != 0) + goto fail; + ires_ino = bit; + goto find; + } + /* we lost it */ + jbd2_journal_release_buffer(handle, bitmap_bh); + bit += sbi-s_dir_ireserve_nr; + goto try_same_group; + } So this above is just a while loop coded with goto... While loop would be IMO better. Honza -- Jan Kara
Re: [PATCH] ext4: dir inode reservation V3
On Wed 21-11-07 00:40:17, Coly Li wrote: Jan Kara wrote: diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 17b5df1..f838a72 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -10,6 +10,8 @@ * Stephen Tweedie ([EMAIL PROTECTED]), 1993 * Big-endian to little-endian byte-swapping/bitmaps by *David S. Miller ([EMAIL PROTECTED]), 1995 + * Directory inodes reservation by + *Coly Li ([EMAIL PROTECTED]), 2007 */ #include linux/time.h @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, return -1; } +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, +int mode, ext4_group_t *group, unsigned long *ino) +{ + struct super_block *sb; + struct ext4_sb_info *sbi; + struct ext4_group_desc *gdp = NULL; + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; + ext4_group_t ires_group = *group; + unsigned long ires_ino; + int i, bit; + + sb = dir-i_sb; + sbi = EXT4_SB(sb); + + /* if the inode number is not for directory, + * only try to allocate after directory's inode + */ + if (!S_ISDIR(mode)) { + *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb); + return 0; + } ^^^ You don't set a group here - is this intentional? It means we get the group from find_group_other() and there we start searching at a place corresponding to parent's inode number... That would be an interesting heuristic but I'm not sure if that's what you want. Yes, if allocating a file inode, just return first inode offset in the reserved area of parent directory. In this case, group is decided by find_group_other() or find_group_orlov(), ext4_ino_from_ireserve() just tries to persuade linear inode allocator to search free inode slot after parent's inode. But what I mean is: Parent directory is in group 1, with inode number 10, now find_group_other will set group to 2 and you set inode number to 10 so linear allocator will start searching in group 2, inode number 10 which is *not* just after directory inode + + /* reserve inodes for new directory */ + for (i = 0; i sbi-s_groups_count; i++) { + gdp = ext4_get_group_desc(sb, ires_group, gdp_bh); + if (!gdp) + goto fail; + bit = 0; +try_same_group: + if (bit EXT4_INODES_PER_GROUP(sb)) { + brelse(bitmap_bh); + bitmap_bh = read_inode_bitmap(sb, ires_group); + if (!bitmap_bh) + goto fail; + + BUFFER_TRACE(bitmap_bh, get_write_access); + if (ext4_journal_get_write_access( + handle, bitmap_bh) != 0) + goto fail; + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), + bit, bitmap_bh-b_data)) { + /* we won it */ + BUFFER_TRACE(bitmap_bh, + call ext4_journal_dirty_metadata); + if (ext4_journal_dirty_metadata(handle, + bitmap_bh) != 0) + goto fail; + ires_ino = bit; + goto find; + } + /* we lost it */ + jbd2_journal_release_buffer(handle, bitmap_bh); + bit += sbi-s_dir_ireserve_nr; + goto try_same_group; + } So this above is just a while loop coded with goto... While loop would be IMO better. The only reason for me to use a goto, is 80 column limitation :) BTW, goto does not hurt performance and readability here. IMHO, it's acceptable :-) But you could just remove goto try_same_group; and change 'if' to 'while'. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: dir inode reservation V3
Jan, Thanks for taking time to review the patch :-) Jan Kara wrote: Hi Coly, finally I've found some time to have a look at a new version of your patch. 5, Performance number On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create 5 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and files is 29.2% faster as expected. When number of directories is increased, the performance improvement will be more considerable. More benchmark result will be posted here if necessary, because I need more time to run more test cases. Maybe to get some better idea - could you compare how long does take untar of a kernel tree, find through the whole kernel tree and removal of the tree? Also measuring CPU load during your benchmarks would be useful so that we can see whether we don't increase too much the cost of searching in bitmaps... Sure, good ideas, I will add these items to my benchmark list. The patch is nicely short ;) Thanks for the encouragement. The first version was more than 2000 lines, including kernel and e2fsprogs patches. Finally I find only around 100 lines kernel patch can achieve same performance too. Really nice :-) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 17b5df1..f838a72 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -10,6 +10,8 @@ * Stephen Tweedie ([EMAIL PROTECTED]), 1993 * Big-endian to little-endian byte-swapping/bitmaps by *David S. Miller ([EMAIL PROTECTED]), 1995 + * Directory inodes reservation by + *Coly Li ([EMAIL PROTECTED]), 2007 */ #include linux/time.h @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, return -1; } +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, + int mode, ext4_group_t *group, unsigned long *ino) +{ +struct super_block *sb; +struct ext4_sb_info *sbi; +struct ext4_group_desc *gdp = NULL; +struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; +ext4_group_t ires_group = *group; +unsigned long ires_ino; +int i, bit; + +sb = dir-i_sb; +sbi = EXT4_SB(sb); + +/* if the inode number is not for directory, + * only try to allocate after directory's inode + */ +if (!S_ISDIR(mode)) { +*ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb); +return 0; +} ^^^ You don't set a group here - is this intentional? It means we get the group from find_group_other() and there we start searching at a place corresponding to parent's inode number... That would be an interesting heuristic but I'm not sure if that's what you want. Yes, if allocating a file inode, just return first inode offset in the reserved area of parent directory. In this case, group is decided by find_group_other() or find_group_orlov(), ext4_ino_from_ireserve() just tries to persuade linear inode allocator to search free inode slot after parent's inode. + +/* reserve inodes for new directory */ +for (i = 0; i sbi-s_groups_count; i++) { +gdp = ext4_get_group_desc(sb, ires_group, gdp_bh); +if (!gdp) +goto fail; +bit = 0; +try_same_group: +if (bit EXT4_INODES_PER_GROUP(sb)) { +brelse(bitmap_bh); +bitmap_bh = read_inode_bitmap(sb, ires_group); +if (!bitmap_bh) +goto fail; + +BUFFER_TRACE(bitmap_bh, get_write_access); +if (ext4_journal_get_write_access( +handle, bitmap_bh) != 0) +goto fail; +if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), +bit, bitmap_bh-b_data)) { +/* we won it */ +BUFFER_TRACE(bitmap_bh, +call ext4_journal_dirty_metadata); +if (ext4_journal_dirty_metadata(handle, +
Re: e2fsprogs-interim scm tree?
On Tue, 20 Nov 2007, Theodore Tso wrote: Please do report any problems you find, though. Ted has said that the sourceforge bugtracker is the right place to do this for now. BTW, when reporting a bugs against the git tree, please include the branch and the git ID (i.e., that which is reported by git-describe), so there is no question which version you were running at the time. Thanks for all the hints. But I don't intend to run e2fsprogs more than needed (just the occasional tune2fs/e2fsck on an already created fs). As soon as I have a testbox again, I can go crazy with these tools :) Christian. -- BOFH excuse #324: Your packets were eaten by the terminator - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: dir inode reservation V3
On Nov 20, 2007 12:22 -0800, Mingming Cao wrote: On Tue, 2007-11-20 at 12:14 +0800, Coly Li wrote: Mingming Cao wrote: On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because directory inode will also use a inode slot from reserved area, reset slots number for files is (s_dir_ireserve_nr - 1). Except for the reserved inodes number, your understanding exactly matches my idea. The performance gain on large number of directories looks interesting, but I am afraid this makes the uninitialized block group feature less useful (to speed up fsck) than before:( The reserved inode will cause the unused inode watermark higher than before, and spread the directories over to other block groups earlier than before. Maybe it worth it...not sure. My original thoughts on the design for this were slightly different: - that the per-directory reserved window would scale with the size of the directory, so that even (or especially) with htree directories the inodes would be kept in hash-ordered sets to speed up stat/unlink - it would be possible/desirable to re-use the existing block bitmap reservation code to handle inode bitmap reservation for directories while those directories are in-core. We already have the mechanisms for this, all that would need to change is have the reservation code point at the inode bitmaps but I don't know how easy that is - after an unmount/remount it would be desirable to re-use the same blocks for getting good hash-inode mappings, wherein lies the problem of compatibility One possible solutions for the restart problem is searching the directory leaf block in which an inode is being added for the inode numbers and try to use those as a goal for the inode allocation... Has a minor problem with ordering, because usually the inode is allocated before the dirent is created, but isn't impossible to fix (e.g. find dirent slot first, keep a pointer to it, check for inode goals, and then fill in dirent inum after allocating inode) 5, Performance number On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create 5 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I tried to remove all the directories and files recursively by a 'rm -rf'. Below is the benchmark result, normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s One likely reason that the create dirs step is slower is that this is doing a lot more IO than in the past. Only a single inode in each inode table block is being used, so that means that a lot of empty bytes are being read and written (maybe 16x as much data in this case). Also, in what order are you creating files in the directories? If you are creating them in directory order like: for (f = 0; f 15; f++) for (i = 0; i 5; i++) touch dir$i/f$f then it is completely unsurprising that directory reservation is faster at file create/unlink because those inodes are now contiguous at the expense of having gaps in the inode sequence. Creating 15 files per directory is of course the optimum test case also. How does this patch behave with benchmarks like dbench, mongo, postmark? It would be nice to remember the last allocated bit for each block group, so we don't have to start from bit 0 (in the worse case scan the entire block group) for every single directory. Probably this can be saved in the in-core block group descriptor. Right now the in core block group descriptor structure is the same on-disk block-group structure though, it might worth to separate them and provide load/store helper functions. Note that mballoc already creates an in-memory struct for each group. I think the initialization of this should be moved outside of mballoc so that it can be used for other purposes as you propose. Eric had a benchmark where creating many files/subdirs would cause a huge slowdown because of bitmap searching, and having a per-group pointer with the first free inode (or last allocated inode might be less work to track) would speed this up a lot. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html