Re: e2fsprogs-interim scm tree?

2007-11-20 Thread Christian Kujau
On Tue, November 20, 2007 04:11, Theodore Tso wrote:
 There is a git tree.  It's at:
 http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=summary

 The bleeding edge branch is the 'pu' (proposed updates), which you
 need if you want the full ext4 features.

Ah, I've seen the git tree, but missed on the 'pu' branch. So it's

# git-checkout -b pu

to get the pu branch?

 Note that the userspace code still needs a lot of work.
 It's for this reason that I haven't been recommending people use it for
 production systems just yet.

Understood. But the more ppl testing your stuff the better, right?

Thanks for your time,
Christian.
-- 
BOFH excuse #442:

Trojan horse ran out of hay

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsprogs-interim scm tree?

2007-11-20 Thread Theodore Tso
On Tue, Nov 20, 2007 at 11:24:03AM +0100, Christian Kujau wrote:
  Note that the userspace code still needs a lot of work.
  It's for this reason that I haven't been recommending people use it for
  production systems just yet.
 
 Understood. But the more ppl testing your stuff the better, right?

Oh, absolutely.  I just don't think it's fair to encourage people to
use something that might cause them to risk their data unless they are
going into it with their eyes wide open.  The 'pu' branch gets very
minimal testing.  I do run the regression test suite(*), so it's a bit
more than it builds, ship it!, but essentially almost any patch that
will apply gets thrown into 'pu', and as I review patches and fix up
issues, I move them to the 'next' branch, and then a little bit later
to the 'master' branch.

(*) With only 3-4 test failures, but at least some of them are tests
that need to be fixed up, not necessarily outright bugs in the 'pu'
branch.

At the moment in the git tree only the 'pu' branch has extents
support, and to be honest the support in e2fsprogs-interim in terms of
being able to better detect and fix corrupted filesystems without
crashing.  (Some of the newer features like uninit groups and flexbg
isn't in e2fsprogs-interim, though.)  Fixing up the extents support is
very high on my priority list over the next couple of weeks, but at
the moment e2fsck on the 'pu' branch will correctly check an ext4
filesystem with extents that isn't too badly corrupted; a badly
corrupted one will cause e2fsck to crash.  It will get better, I
promise you!  :-)

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsprogs-interim scm tree?

2007-11-20 Thread Eric Sandeen
Theodore Tso wrote:
 On Tue, Nov 20, 2007 at 11:24:03AM +0100, Christian Kujau wrote:
 Note that the userspace code still needs a lot of work.
 It's for this reason that I haven't been recommending people use it for
 production systems just yet.
 Understood. But the more ppl testing your stuff the better, right?
 
 Oh, absolutely.  I just don't think it's fair to encourage people to
 use something that might cause them to risk their data unless they are
 going into it with their eyes wide open.

Please do report any problems you find, though.  Ted has said that the
sourceforge bugtracker is the right place to do this for now.

-Eric (remembering he has a bug to report, too)
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: dir inode reservation V3

2007-11-20 Thread Jan Kara
  Hi Coly,

  finally I've found some time to have a look at a new version of your
patch.

 5, Performance number
 On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4
 partition, and tried to create 5 directories, and create 15 (1KB)
 files in each directory alternatively. After a remount, I tried to
 remove all the directories and files recursively by a 'rm -rf'. Bellow
 is the benchmark result,
   normal ext4 ext4 with dir inode 
 reservation
   mount options:  -o data=writeback   -o 
 data=writeback,dir_ireserve=low
   Create dirs:real0m49.101s   real2m59.703s
   Create files:   real24m17.962s  real21m8.161s
   Unlink all: real24m43.788s  real17m29.862s
 Creating dirs with dir inode reservation is slower than normal ext4 as
 predicted, because allocating directory inodes in non-linear order
 will cause extra hard disk seeking and block I/O. Creating files with
 dir inode reservation is 13% faster than normal ext4. Unlink all the
 directories and files is 29.2% faster as expected.  When number of
 directories is increased, the performance improvement will be more
 considerable. More benchmark result will be posted here if necessary,
 because I need more time to run more test cases.
  Maybe to get some better idea - could you compare how long does take
untar of a kernel tree, find through the whole kernel tree and removal
of the tree? Also measuring CPU load during your benchmarks would be
useful so that we can see whether we don't increase too much the cost of
searching in bitmaps...

  The patch is nicely short ;)

 diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
 index 17b5df1..f838a72 100644
 --- a/fs/ext4/ialloc.c
 +++ b/fs/ext4/ialloc.c
 @@ -10,6 +10,8 @@
   *  Stephen Tweedie ([EMAIL PROTECTED]), 1993
   *  Big-endian to little-endian byte-swapping/bitmaps by
   *David S. Miller ([EMAIL PROTECTED]), 1995
 + *  Directory inodes reservation by
 + *Coly Li ([EMAIL PROTECTED]), 2007
   */
 
  #include linux/time.h
 @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, 
 struct inode *parent,
   return -1;
  }
 
 +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir,
 +   int mode, ext4_group_t *group, unsigned long *ino)
 +{
 + struct super_block *sb;
 + struct ext4_sb_info *sbi;
 + struct ext4_group_desc *gdp = NULL;
 + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL;
 + ext4_group_t ires_group = *group;
 + unsigned long ires_ino;
 + int i, bit;
 +
 + sb = dir-i_sb;
 + sbi = EXT4_SB(sb);
 +
 + /* if the inode number is not for directory,
 +  * only try to allocate after directory's inode
 +  */
 + if (!S_ISDIR(mode)) {
 + *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb);
 + return 0;
 + }
  ^^^ You don't set a group here - is this intentional? It means we get
the group from find_group_other() and there we start searching at a
place corresponding to parent's inode number... That would be an
interesting heuristic but I'm not sure if that's what you want.

 +
 + /* reserve inodes for new directory */
 + for (i = 0; i  sbi-s_groups_count; i++) {
 + gdp = ext4_get_group_desc(sb, ires_group, gdp_bh);
 + if (!gdp)
 + goto fail;
 + bit = 0;
 +try_same_group:
 + if (bit  EXT4_INODES_PER_GROUP(sb)) {
 + brelse(bitmap_bh);
 + bitmap_bh = read_inode_bitmap(sb, ires_group);
 + if (!bitmap_bh)
 + goto fail;
 +
 + BUFFER_TRACE(bitmap_bh, get_write_access);
 + if (ext4_journal_get_write_access(
 + handle, bitmap_bh) != 0)
 + goto fail;
 + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group),
 + bit, bitmap_bh-b_data)) {
 + /* we won it */
 + BUFFER_TRACE(bitmap_bh,
 + call ext4_journal_dirty_metadata);
 + if (ext4_journal_dirty_metadata(handle,
 + bitmap_bh) != 0)
 + goto fail;
 + ires_ino = bit;
 + goto find;
 + }
 + /* we lost it */
 + jbd2_journal_release_buffer(handle, bitmap_bh);
 + bit += sbi-s_dir_ireserve_nr;
 + goto try_same_group;
 + }
 So this above is just a while loop coded with goto... While loop
would be IMO better.

Honza
-- 
Jan Kara 

Re: [PATCH] ext4: dir inode reservation V3

2007-11-20 Thread Jan Kara
On Wed 21-11-07 00:40:17, Coly Li wrote:
 Jan Kara wrote:
  diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
  index 17b5df1..f838a72 100644
  --- a/fs/ext4/ialloc.c
  +++ b/fs/ext4/ialloc.c
  @@ -10,6 +10,8 @@
*  Stephen Tweedie ([EMAIL PROTECTED]), 1993
*  Big-endian to little-endian byte-swapping/bitmaps by
*David S. Miller ([EMAIL PROTECTED]), 1995
  + *  Directory inodes reservation by
  + *Coly Li ([EMAIL PROTECTED]), 2007
*/
 
   #include linux/time.h
  @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, 
  struct inode *parent,
 return -1;
   }
 
  +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir,
  +int mode, ext4_group_t *group, unsigned long *ino)
  +{
  +  struct super_block *sb;
  +  struct ext4_sb_info *sbi;
  +  struct ext4_group_desc *gdp = NULL;
  +  struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL;
  +  ext4_group_t ires_group = *group;
  +  unsigned long ires_ino;
  +  int i, bit;
  +
  +  sb = dir-i_sb;
  +  sbi = EXT4_SB(sb);
  +
  +  /* if the inode number is not for directory,
  +   * only try to allocate after directory's inode
  +   */
  +  if (!S_ISDIR(mode)) {
  +  *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb);
  +  return 0;
  +  }
^^^ You don't set a group here - is this intentional? It means we get
  the group from find_group_other() and there we start searching at a
  place corresponding to parent's inode number... That would be an
  interesting heuristic but I'm not sure if that's what you want.
 Yes, if allocating a file inode, just return first inode offset in the 
 reserved area of parent
 directory. In this case, group is decided by find_group_other() or 
 find_group_orlov(),
 ext4_ino_from_ireserve() just tries to persuade linear inode allocator to 
 search free inode slot
 after parent's inode.
  But what I mean is: Parent directory is in group 1, with inode number 10, now
find_group_other will set group to 2 and you set inode number to 10 so
linear allocator will start searching in group 2, inode number 10 which is
*not* just after directory inode

  +
  +  /* reserve inodes for new directory */
  +  for (i = 0; i  sbi-s_groups_count; i++) {
  +  gdp = ext4_get_group_desc(sb, ires_group, gdp_bh);
  +  if (!gdp)
  +  goto fail;
  +  bit = 0;
  +try_same_group:
  +  if (bit  EXT4_INODES_PER_GROUP(sb)) {
  +  brelse(bitmap_bh);
  +  bitmap_bh = read_inode_bitmap(sb, ires_group);
  +  if (!bitmap_bh)
  +  goto fail;
  +
  +  BUFFER_TRACE(bitmap_bh, get_write_access);
  +  if (ext4_journal_get_write_access(
  +  handle, bitmap_bh) != 0)
  +  goto fail;
  +  if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group),
  +  bit, bitmap_bh-b_data)) {
  +  /* we won it */
  +  BUFFER_TRACE(bitmap_bh,
  +  call ext4_journal_dirty_metadata);
  +  if (ext4_journal_dirty_metadata(handle,
  +  bitmap_bh) != 0)
  +  goto fail;
  +  ires_ino = bit;
  +  goto find;
  +  }
  +  /* we lost it */
  +  jbd2_journal_release_buffer(handle, bitmap_bh);
  +  bit += sbi-s_dir_ireserve_nr;
  +  goto try_same_group;
  +  }
   So this above is just a while loop coded with goto... While loop
  would be IMO better.
 
 The only reason for me to use a goto, is 80 column limitation :) BTW,
 goto does not hurt performance and readability here. IMHO, it's
 acceptable :-)
  But you could just remove goto try_same_group; and change 'if' to 'while'.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: dir inode reservation V3

2007-11-20 Thread Coly Li
Jan,

Thanks for taking time to review the patch :-)

Jan Kara wrote:
   Hi Coly,
 
   finally I've found some time to have a look at a new version of your
 patch.
 
 5, Performance number
 On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4
 partition, and tried to create 5 directories, and create 15 (1KB)
 files in each directory alternatively. After a remount, I tried to
 remove all the directories and files recursively by a 'rm -rf'. Bellow
 is the benchmark result,
  normal ext4 ext4 with dir inode 
 reservation
  mount options:  -o data=writeback   -o 
 data=writeback,dir_ireserve=low
  Create dirs:real0m49.101s   real2m59.703s
  Create files:   real24m17.962s  real21m8.161s
  Unlink all: real24m43.788s  real17m29.862s
 Creating dirs with dir inode reservation is slower than normal ext4 as
 predicted, because allocating directory inodes in non-linear order
 will cause extra hard disk seeking and block I/O. Creating files with
 dir inode reservation is 13% faster than normal ext4. Unlink all the
 directories and files is 29.2% faster as expected.  When number of
 directories is increased, the performance improvement will be more
 considerable. More benchmark result will be posted here if necessary,
 because I need more time to run more test cases.
   Maybe to get some better idea - could you compare how long does take
 untar of a kernel tree, find through the whole kernel tree and removal
 of the tree? Also measuring CPU load during your benchmarks would be
 useful so that we can see whether we don't increase too much the cost of
 searching in bitmaps...

Sure, good ideas, I will add these items to my benchmark list.

 
   The patch is nicely short ;)
Thanks for the encouragement.
The first version was more than 2000 lines, including kernel and e2fsprogs 
patches. Finally I find
only around 100 lines kernel patch can achieve same performance too. Really 
nice :-)

 
 diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
 index 17b5df1..f838a72 100644
 --- a/fs/ext4/ialloc.c
 +++ b/fs/ext4/ialloc.c
 @@ -10,6 +10,8 @@
   *  Stephen Tweedie ([EMAIL PROTECTED]), 1993
   *  Big-endian to little-endian byte-swapping/bitmaps by
   *David S. Miller ([EMAIL PROTECTED]), 1995
 + *  Directory inodes reservation by
 + *Coly Li ([EMAIL PROTECTED]), 2007
   */

  #include linux/time.h
 @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, 
 struct inode *parent,
  return -1;
  }

 +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir,
 +  int mode, ext4_group_t *group, unsigned long *ino)
 +{
 +struct super_block *sb;
 +struct ext4_sb_info *sbi;
 +struct ext4_group_desc *gdp = NULL;
 +struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL;
 +ext4_group_t ires_group = *group;
 +unsigned long ires_ino;
 +int i, bit;
 +
 +sb = dir-i_sb;
 +sbi = EXT4_SB(sb);
 +
 +/* if the inode number is not for directory,
 + * only try to allocate after directory's inode
 + */
 +if (!S_ISDIR(mode)) {
 +*ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb);
 +return 0;
 +}
   ^^^ You don't set a group here - is this intentional? It means we get
 the group from find_group_other() and there we start searching at a
 place corresponding to parent's inode number... That would be an
 interesting heuristic but I'm not sure if that's what you want.

Yes, if allocating a file inode, just return first inode offset in the reserved 
area of parent
directory. In this case, group is decided by find_group_other() or 
find_group_orlov(),
ext4_ino_from_ireserve() just tries to persuade linear inode allocator to 
search free inode slot
after parent's inode.

 
 +
 +/* reserve inodes for new directory */
 +for (i = 0; i  sbi-s_groups_count; i++) {
 +gdp = ext4_get_group_desc(sb, ires_group, gdp_bh);
 +if (!gdp)
 +goto fail;
 +bit = 0;
 +try_same_group:
 +if (bit  EXT4_INODES_PER_GROUP(sb)) {
 +brelse(bitmap_bh);
 +bitmap_bh = read_inode_bitmap(sb, ires_group);
 +if (!bitmap_bh)
 +goto fail;
 +
 +BUFFER_TRACE(bitmap_bh, get_write_access);
 +if (ext4_journal_get_write_access(
 +handle, bitmap_bh) != 0)
 +goto fail;
 +if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group),
 +bit, bitmap_bh-b_data)) {
 +/* we won it */
 +BUFFER_TRACE(bitmap_bh,
 +call ext4_journal_dirty_metadata);
 +if (ext4_journal_dirty_metadata(handle,
 +   

Re: e2fsprogs-interim scm tree?

2007-11-20 Thread Christian Kujau

On Tue, 20 Nov 2007, Theodore Tso wrote:

Please do report any problems you find, though.  Ted has said that the
sourceforge bugtracker is the right place to do this for now.


BTW, when reporting a bugs against the git tree, please include the
branch and the git ID (i.e., that which is reported by git-describe),
so there is no question which version you were running at the time.


Thanks for all the hints. But I don't intend to run e2fsprogs more than 
needed (just the occasional tune2fs/e2fsck on an already created fs). As 
soon as I have a testbox again, I can go crazy with these tools :)


Christian.
--
BOFH excuse #324:

Your packets were eaten by the terminator
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: dir inode reservation V3

2007-11-20 Thread Andreas Dilger
On Nov 20, 2007  12:22 -0800, Mingming Cao wrote:
 On Tue, 2007-11-20 at 12:14 +0800, Coly Li wrote:
  Mingming Cao wrote:
   On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote:
  The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because
  directory inode will also use a inode slot from reserved area, reset
  slots number for files is (s_dir_ireserve_nr - 1).  Except for the
  reserved inodes number, your understanding exactly matches my idea.
 
 The performance gain on large number of directories looks interesting,
 but I am afraid this makes the uninitialized block group feature less
 useful (to speed up fsck) than before:( The reserved inode will cause
 the unused inode watermark higher than before, and spread the
 directories over to other block groups earlier than before. Maybe it
 worth it...not sure.

My original thoughts on the design for this were slightly different:
- that the per-directory reserved window would scale with the size of
  the directory, so that even (or especially) with htree directories the 
  inodes would be kept in hash-ordered sets to speed up stat/unlink
- it would be possible/desirable to re-use the existing block bitmap
  reservation code to handle inode bitmap reservation for directories
  while those directories are in-core.  We already have the mechanisms
  for this, all that would need to change is have the reservation code
  point at the inode bitmaps but I don't know how easy that is
- after an unmount/remount it would be desirable to re-use the same blocks
  for getting good hash-inode mappings, wherein lies the problem of
  compatibility

One possible solutions for the restart problem is searching the directory
leaf block in which an inode is being added for the inode numbers and try
to use those as a goal for the inode allocation...  Has a minor problem
with ordering, because usually the inode is allocated before the dirent
is created, but isn't impossible to fix (e.g. find dirent slot first,
keep a pointer to it, check for inode goals, and then fill in dirent
inum after allocating inode)

   5, Performance number
   On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4
   partition, and tried to create 5 directories, and create 15 (1KB)
   files in each directory alternatively. After a remount, I tried to
   remove all the directories and files recursively by a 'rm -rf'.
   Below is the benchmark result,
normal ext4 ext4 with dir inode reservation
mount options:  -o data=writeback   -o 
   data=writeback,dir_ireserve=low
Create dirs:real0m49.101s   real2m59.703s
Create files:   real24m17.962s  real21m8.161s
Unlink all: real24m43.788s  real17m29.862s

One likely reason that the create dirs step is slower is that this is
doing a lot more IO than in the past.  Only a single inode in each
inode table block is being used, so that means that a lot of empty
bytes are being read and written (maybe 16x as much data in this case).

Also, in what order are you creating files in the directories?  If you
are creating them in directory order like:

for (f = 0; f  15; f++)
for (i = 0; i  5; i++)
touch dir$i/f$f

then it is completely unsurprising that directory reservation is faster
at file create/unlink because those inodes are now contiguous at the
expense of having gaps in the inode sequence.  Creating 15 files per
directory is of course the optimum test case also.

How does this patch behave with benchmarks like dbench, mongo, postmark?

 It would be nice to remember the last allocated bit for each block
 group, so we don't have to start from bit 0 (in the worse case scan the
 entire block group) for every single directory. Probably this can be
 saved in the in-core block group descriptor.  Right now the in core
 block group descriptor structure is the same on-disk block-group
 structure though, it might worth to separate them and provide load/store
 helper functions.

Note that mballoc already creates an in-memory struct for each group.
I think the initialization of this should be moved outside of mballoc
so that it can be used for other purposes as you propose.

Eric had a benchmark where creating many files/subdirs would cause
a huge slowdown because of bitmap searching, and having a per-group
pointer with the first free inode (or last allocated inode might be
less work to track) would speed this up a lot. 

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html