Re: [RFC] ext3 freeze feature ver 0.2

2008-02-26 Thread Andreas Dilger
On Feb 26, 2008  08:39 -0800, Eric Sandeen wrote:
 Takashi Sato wrote:
 
  o Elevate XFS ioctl numbers (XFS_IOC_FREEZE and XFS_IOC_THAW) to the VFS
As Andreas Dilger and Christoph Hellwig advised me, I have elevated
them to include/linux/fs.h as below.
  #define FIFREEZE_IOWR('X', 119, int)
     #define FITHAW  _IOWR('X', 120, int)
The ioctl numbers used by XFS applications don't need to be changed.
But my following ioctl for the freeze needs the parameter
as the timeout period.  So if XFS applications don't want the timeout
feature as the current implementation, the parameter needs to be
changed 1 (level?) into 0.
 
 So, existing xfs applications calling the xfs ioctl now will behave
 differently, right?  We can only keep the same ioctl number if the
 calling semantics are the same.  Keeping the same number but changing
 the semantics is harmful, IMHO

Do we know what this parameter was supposed to mean?

We could special case 1 if needed to keep compatibility (documenting
this clearly), either making it == 0, or some very long timeout (1h
or whatever).  A relatively minor wart I think.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in e2fsprogs.git (topics)

2008-02-24 Thread Andreas Dilger
On Feb 22, 2008  19:15 -0500, Theodore Ts'o wrote:
 So before the recent patch were we actually creating long symlinks in
 extents format?  Or were we just setting the flag but still treating
 them as a block number?  If it was the latter, I guess we can put in
 code into e2fsck to detect that case, and convert it back to a
 singleton block number.  

Eric informed me that the long symlinks were actually stored in extent
mapped blocks.  That is not harmful, because it can only be a single
block and it will always fit into the inode.  The other thing to note
is that extent mapping is REQUIRED for  32-bit blocknumbers, so we
may as well fix e2fsprogs to allow these symlinks to be handled normally.

  I'd say when e2fsprogs has an official release with extents support,
  and there are no show-stopping bugs in the existing code...  I don't
  think that is too far off anymore.
 
 I guess I'd be a *bit* more cautious.  We still have some code patches
 such as the delayed allocation and to a lesser extent the online
 defrag patches which have the possibility of introducing bugs.  Once
 all of those get merged and we have a full kernel release cycle to fix
 the last remaining bugs, that's when I would drop the -dev from the
 name.

Well, there isn't any hard requirement to include delalloc into the
first non-dev ext4 release.  Yes, it would be desirable, but I don't
think we HAVE to have it.  I think the important thing is that the
on-disk format is no longer changing.

One thing that still needs to be done (AFAIK) is the removal of the
auto-setting of filesystem feature flags, and allow tune2fs/e2fsck
to set/leave the flags that we currently set automatically.  I'd
hazard that for feature flags which have been around a LONG time (like
EAs and LARGEFILE) that these should be enabled by default.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in e2fsprogs.git (topics)

2008-02-22 Thread Andreas Dilger
On Feb 21, 2008  10:40 -0600, Eric Sandeen wrote:
 Ok, but my concern is what happens to those long symlinks when they
 really truly are in extents format.  One option is to say hey it was
 ext4DEV, deal with it and nuke the symlink, but if possible, converting
 back to the proper format would be nice.

Is that actually the case though?  That should be pretty easy to massage
into storing a block number.  The difficulty is if the long symlink block
is beyond 32-bit blocknr, in which case it actually needs extents format.
We may as well bite the bullet and fix the code to be the same as with
htree fakeroot index block reading and use the proper mapping to find
the symlink block.  See htree_blk_iter_cb() for how to do that.

  One related question is whether we want to try to get support for full
  64-bit physical block numbers into ext4.  I think there were some
  rough draft patches floating about, but IIRC they didn't
  simultaneously support the 48-bit extent format.
 
 I too had assumed that 48 bits would be it for now; it should be
 sufficient for a good while.  I guess what I'd like to see if a usable
 ext4 out there in the near future, with stuff added on later as
 necessary; delalloc, flex_bg (if that doesn't make 2.6.25...) etc.

At some point 32-bit logical block numbers will also be an issue, but
the need for 16TB+ non-sparse single files is rare even in my world.

 Oh, speaking of all this - what do you think the criteria are for
 dropping the dev from ext4dev?  How do we decide that it's cooked
 enough?  :)

I'd say when e2fsprogs has an official release with extents support,
and there are no show-stopping bugs in the existing code...  I don't
think that is too far off anymore.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [E2FSPROGS, RFC] New mke2fs types parsing

2008-02-21 Thread Andreas Dilger
On Feb 20, 2008  17:20 -0500, Theodore Ts'o wrote:
 There are only three things which mke2fs will do, in my design:

This should all go into the mke2fs man page...

 [fs_types]
   ext3 = {
   features = has_journal
   }
   ext4 = {
   features = extents,flex_bg
   inode_size = 256
   }

Presumably the ext4 feature should also have features = has_journal?
If this is the default for ext4, why would it need to be given for ext3?

We should also add dir_nlink,flexbg while we are in there.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] blkid detection for ZFS

2008-02-21 Thread Andreas Dilger
On Feb 20, 2008  07:57 -0500, Theodore Ts'o wrote:
 On Thu, Feb 14, 2008 at 06:07:40PM -0700, Andreas Dilger wrote:
  Some input is welcome here also...  There is a UUID (GUID) for the whole
  pool (aggregation of devices that ZFS filesystems might live on), a
  UUID for the virtual device (vdev) (akin to MD RAID set) that a disk
  is part of and also a separate UUID for each device.  There is a LABEL
  (pool name) for the whole pool, but not one for an individual filesystem.
 
 Are there devices for that are made available for the vdev and the
 pool?  I assume not for the pool since that's a filesystem entity, but
 what about the vdev?
 
 In general, blkid is all about mapping the UUID of what lives on the
 device to the device filename, for the benefit of programs like mount
 and fsck.
 
 I don't know enough about ZFS in terms of how you would mount a
 filesystem which is part of a pool.  How is the filesystem specified
 to the mount command?

Good question.  For Lustre (linux or Solaris), we want to be able to find
the pool by name, and then use ZFS tools to import the pool and make the
filesystems available to Lustre.  The current ZFS tools (as ported to
Linux) scan all of /dev/* directly, but I'd much prefer to use libblkid
for that since it knows about PVs, RAID devices, etc.

Filesystems in a ZFS pool are specified via {poolname}/{fsname}, but
to get fsname from disk is much more involved than I want to get,
since it almost involves importing the pool and parsing a whole tree
of parameters and indexes.

I'd be pretty happy to just know from blkid that a given device is
used by ZFS for lustrepool or fusepool or whatever it is called.

 So it would seem to me that it would be better to make the UUID be for
 a particular ZFS physical disk be the UUID for that disk, and not for
 the whole pool.  The question really, though, is what actually would
 be most useful --- who is going to actually use blkid on a Solaris
 system with ZFS?  It may be that the right answer is to put the pool
 UUID as a separate tag; blkid supports more than just the standard
 LABEL, UUID, TYPE, etc. tags.  You could easily stash the pool UUID in
 a POOL_GUID tag, if it would be useful for some blkid callers.

OK, maybe I'll go that route, since I won't stricly be having UUIDs
or LABELs that directly map to filesystems.
 
  On a related note - on Solaris the ZFS filesystems always live in a GPT
  partition table, and I note that libblkid doesn't identify this.  Is that
  something we want to start adding to libblkid (e.g. GPT, DOS, LVM, etc)?
 
 What do you mean by not identifying the GPT partition table?  At the
 moment we haven't been identifying the whole disk partition tables,
 mainly becuase there isn't much use for it especially for the DOS MBR
 (no uuid or label to speak of).
 
 I just checked in a patch from Eric to detect LVM2 PV's, because it
 was useful for the Anaconda developers.  I wouldn't have any
 objections accepting a patch which detected the whole-disk device and
 returned the GPT label/UUID information, but I probably wouldn't code
 it myself.  Still, if it someone thought it was *useful* and would use
 it, and thus felt called to write a patch, I'd certainly accept it.

That was my question.  I didn't see the LVM2 identification patch until
after my email, but this makes it fairly clear that identification of
block devices isn't verboten.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How were some of the lustre e2fsprogs test cases generated?

2008-02-19 Thread Andreas Dilger
On Feb 18, 2008  19:36 -0500, Theodore Ts'o wrote:
 One minor correction --- the clusterfs e2fsprogs extents code checks
 to see if the ee_leaf_hi field is non-zero, and complains if so.
 However, it ignores the ee_start_hi field for interior (non-leaf)
 nodes in the extent tree, and a number of tests do have non-zero
 ee_start_hi fields which cause my version of e2fsprogs to (rightly)
 complain.
 
 If you fix this, a whole bunch of tests will fail as a result, and not
 exercise the code paths that the tests were apparently trying to
 exercise.  Which is what is causing me a bit of worry and wonder about
 how those test cases were originally generated

The original CFS extents kernel patch had a bug where the _hi fields
were not initialized correctly to zero.  The CFS exents e2fsck
patches would clear the _hi fields in the extents and index blocks,
but I disabled that in the upstream patch submission because it will
be incorrect for 48-bit filesystems.

That's the high_bits_ok check in e2fsck_ext_block_verify() for error
PR_1_EXTENT_HI, that only allows the high bits when there are  2^32
blocks in the filesystem.  It's possible I made a mistake when I added
that part of the patch, but the regression tests still passed.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How were some of the lustre e2fsprogs test cases generated?

2008-02-19 Thread Andreas Dilger
On Feb 19, 2008  07:29 -0500, Theodore Ts'o wrote:
 On Tue, Feb 19, 2008 at 04:40:32AM -0700, Andreas Dilger wrote:
  No, it hasn't always been true that we cleared the _hi fields in the
  kernel code.  But, it has been a year or more since we found this bug,
  and all CFS e2fsprogs releases since then have cleared the _hi fields,
  and there has not been any other e2fsprogs that supports extents, so
  we expect that there are no filesystems left in the field with this
  issue, and even then the current code will prefer to clear the _hi
  bits instead of considering the whole extent corrupt.
 
 I checked again, and it looks like the interim code is indeed clearing
 the _hi bits.  I managed to confuse myself into thinking it didn't for
 index nodes, but I checked again and it seems to be doing the right
 thing.
 
 The reason why I asked is that the extents code in the 'next' branch
 of e2fsprogs *does* consider the whole extent to be corrupt, since in
 the long run once we start 64-bit block number extent blocks, if the
 physical block number (including the high 16 bits) is greater than
 s_blocks_count, simply masking off the high 16 bits of the 48 bit
 extent block is probably not the right way of dealing with the
 problem.
 
 I think that's probably a safe thing to do since all of your customers
 who might have had a filesystem with non-zero _hi fields have almost
 certainly run e2fsck to clear the _hi bits at least once; do you
 concur that is a safe assumption?  Or would you prefer that I add some
 code that tries to clear just the _hi bits, perhaps controlled by a
 configuration flag in e2fsck.conf?

I'm OK with either.  We might consider patching e2fsck to return to the
more permissive CFS behaviour with _hi bits for our own releases, or
just leave it.  Checking back in our patches, we fixed the kernel code
in July '06 and the e2fsck code in Jan '07, so I hope people have run
an e2fsck on their filesystems in the last 1.5 years.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][7/28] e2fsprogs-extents.patch

2008-02-18 Thread Andreas Dilger
On Feb 18, 2008  11:56 -0600, Eric Sandeen wrote:
  @@ -904,21 +910,75 @@ void e2fsck_pass1(e2fsck_t ctx)
  +   eh = (struct ext3_extent_header *)inode-i_block;
  +   if ((inode-i_flags  EXT4_EXTENTS_FL)) {
  +   if ((LINUX_S_ISREG(inode-i_mode) ||
  +LINUX_S_ISDIR(inode-i_mode)) 
 
 So this trips up on things like sockets, fifos, and block  char nodes.

Hrm, not impossible, since Lustre only uses extent-based filesystems for
regular file storage.

 Also this is unhappy:
 
  @@ -137,7 +141,7 @@ int e2fsck_pass1_check_device_inode(ext2
   * If the index flag is set, then this is a bogus
   * device/fifo/socket
   */
  -   if (inode-i_flags  EXT2_INDEX_FL)
  +   if (inode-i_flags  (EXT2_INDEX_FL | EXT4_EXTENTS_FL))
  return 0;
 
 Do we really care if these have the extents flag set?  IOW should we
 make sure the kernel doesn't set the flag, or should we make e2fsck not
 care...

The Lustre extents patches clear the EXT4_EXTENTS_FL always (i.e. they
are never set on directories) so we've never seen these problems.

 There are enough checks in e2fsck to show the intent was that these
 files should not have the extents flag set, but I'm not sure why it
 matters enough that the kernel needs to run around being sure to clear
 it
 
 Or... (rambling on now) it seems odd to me that zero-length files have
 the extents flag set at all; should we only set extents when we actually
 get a block allocated to the file?  That would also take care of this
 from the kernel side I think.

Yes, I'd be for e2fsck clearing this flag, but as I mentioned in the
concall, I think it is better to have the kernel just stop inheriting
all flags from the parent directory, or possibly just have a fixed
range of flags that are being propogated to child inodes.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] ext4: Fix kernel BUG at fs/ext4/mballoc.c:910!

2008-02-14 Thread Andreas Dilger
On Feb 14, 2008  17:35 +0100, Valerie Clement wrote:
 From: Valerie Clement [EMAIL PROTECTED]
 
 With the flex_bg feature enabled, a large file creation oopses the
 kernel.
 The BUG_ON is:
   BUG_ON(len = EXT4_BLOCKS_PER_GROUP(sb));
 
 As the allocation of the bitmaps and the inode table can be done
 outside the block group with flex_bg, this allows to allocate up to
 EXT4_BLOCKS_PER_GROUP blocks in a group.
 Depending on the group size and the block size, extents might be 
 larger than BLOCKS_PER_GROUP(); use EXT_INIT_MAX_LEN instead of 
 BLOCKS_PER_GROUP().

In fact, my earlier review of this patch was incorrect, and Aneesh pointed
out the correct answer.  The ext4_mb_mark_free_simple() function is only
called from ext4_mb_generate_buddy() to generate the buddy bitmap from the
on-disk block bitmap, and in that case the @len parameter should always
be = EXT4_BLOCKS_PER_GROUP().  I think the original patch was correct.

Sorry about the confusion.  I thought at first glance this was for
freeing the blocks from releasing an extent, but that is incorrect.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] blkid detection for ZFS

2008-02-14 Thread Andreas Dilger
Attached is a patch to detect ZFS in libblkid.  It isn't by any means
complete, because it doesn't report the LABEL or UUID of the device,
nor names any of the constituent filesystems.  The latter is quite
complex to implement and may be beyond the scope of libblkid.

Some input is welcome here also...  There is a UUID (GUID) for the whole
pool (aggregation of devices that ZFS filesystems might live on), a
UUID for the virtual device (vdev) (akin to MD RAID set) that a disk
is part of and also a separate UUID for each device.  There is a LABEL
(pool name) for the whole pool, but not one for an individual filesystem.

I'm thinking of making the blkid UUID be the GUID of the whole pool, as
any device in the pool would be sufficient to locate all of the component
devices.  This means all devices in the same pool will return the same
UUID, but for identification that should be fine I think...  I haven't
checked for pathologies in libblkid regarding that yet.

On a related note - on Solaris the ZFS filesystems always live in a GPT
partition table, and I note that libblkid doesn't identify this.  Is that
something we want to start adding to libblkid (e.g. GPT, DOS, LVM, etc)?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Index: e2fsprogs-cfs/lib/blkid/probe.c
===
--- e2fsprogs-cfs.orig/lib/blkid/probe.c
+++ e2fsprogs-cfs/lib/blkid/probe.c
@@ -647,6 +647,21 @@ static int probe_jfs(struct blkid_probe 
return 0;
 }
 
+static int probe_zfs(struct blkid_probe *probe, struct blkid_magic *id,
+unsigned char *buf)
+{
+   struct zfs_uber_block *zub;
+   char *vdev_label;
+   const char *pool_name = 0;
+
+   zub = (struct zfs_uber_block *)buf;
+
+   /* read nvpair data for pool name, pool GUID (complex) */
+   //blkid_set_tag(probe-dev, LABEL, pool_name, sizeof(pool_name));
+   //set_uuid(probe-dev, pool_guid, 0);
+   return 0;
+}
+
 static int probe_luks(struct blkid_probe *probe,
   struct blkid_magic *id __BLKID_ATTR((unused)),
   unsigned char *buf)
@@ -896,15 +911,6 @@ static int probe_hfsplus(struct blkid_pr
 }
 
 /*
- * BLKID_BLK_OFFS is at least as large as the highest bim_kboff defined
- * in the type_array table below + bim_kbalign.
- *
- * When probing for a lot of magics, we handle everything in 1kB buffers so
- * that we don't have to worry about reading each combination of block sizes.
- */
-#define BLKID_BLK_OFFS 64  /* currently reiserfs */
-
-/*
  * Various filesystem magics that we can check for.  Note that kboff and
  * sboff are in kilobytes and bytes respectively.  All magics are in
  * byte strings so we don't worry about endian issues.
@@ -954,6 +960,8 @@ static struct blkid_magic type_array[] =
   { iso9660, 32,  1,  5, CD001,probe_iso9660 },
   { iso9660, 32,  9,  5, CDROM,probe_iso9660 },
   { jfs, 32,  0,  4, JFS1, probe_jfs },
+  { zfs, 128, 0,  8, \0\0\0\0\x00\xba\xb1\x0c, probe_zfs },
+  { zfs, 128, 0,  8, \x0c\xb1\xba\x00\0\0\0\0, probe_zfs },
   { hfsplus,  1,  0,  2, BD,   probe_hfsplus },
   { hfsplus,  1,  0,  2, H+,   0 },
   { hfs,  1,  0,  2, BD,   0 },
@@ -1074,7 +1082,7 @@ try_again:
if (!buf)
continue;
 
-   if (memcmp(id-bim_magic, buf + (id-bim_sboff0x3ff),
+   if (memcmp(id-bim_magic, buf + (id-bim_sboff  0x3ff),
   id-bim_len))
continue;
 
@@ -1104,7 +1112,7 @@ try_again:
dev = 0;
goto found_type;
}
-   
+
 found_type:
if (dev  type) {
dev-bid_devno = st.st_rdev;
@@ -1113,7 +1121,7 @@ found_type:
cache-bic_flags |= BLKID_BIC_FL_CHANGED;
 
blkid_set_tag(dev, TYPE, type, 0);
-   
+
DBG(DEBUG_PROBE, printf(%s: devno 0x%04llx, type %s\n,
   dev-bid_name, (long long)st.st_rdev, type));
}
Index: e2fsprogs-cfs/lib/blkid/probe.h
===
--- e2fsprogs-cfs.orig/lib/blkid/probe.h
+++ e2fsprogs-cfs/lib/blkid/probe.h
@@ -190,6 +190,16 @@ struct jfs_super_block {
unsigned char   js_loguuid[16];
 };
 
+#define UBERBLOCK_MAGIC 0x00bab10c  /* oo-ba-bloc!  */
+struct zfs_uberblock {
+   __u64   ub_magic;   /* UBERBLOCK_MAGIC  */
+   __u64   ub_version; /* ZFS_VERSION  */
+   __u64   ub_txg; /* txg of last sync */
+   __u64   ub_guid_sum;/* sum of all vdev guids*/
+   __u64   ub_timestamp;   /* UTC time of last sync

Re: [PATCH] ext4: Fix kernel BUG at fs/ext4/mballoc.c:910!

2008-02-13 Thread Andreas Dilger
On Feb 13, 2008  18:19 +0100, Valerie Clement wrote:
 From: Valerie Clement [EMAIL PROTECTED]
 
 With the flex_bg feature enabled, a large file creation oopses the
 kernel.
 The BUG_ON is:
   BUG_ON(len = EXT4_BLOCKS_PER_GROUP(sb));

 As the allocation of the bitmaps and the inode table can be done
 outside the block group with flex_bg, this allows to allocate up to
 EXT4_BLOCKS_PER_GROUP blocks in a group.

Caution is needed here.  In the past we were limited to BLOCKS_PER_GROUP()
blocks per extent (32768 blocks at most, regardless of blocksize I think)
but now an extent might be larger.

Can you please verify that the extent-length limits for initialized vs.
uninitialized extents are being hit so that extents don't accidentally
grow to be  32768 blocks long and suddenly get marked as short uninitialized
extents.

Note that the assertion can still be hit if groups are created with fewer
blocks, or with blocksize  4096.  For example, if we have blocksize = 1024
this gives BLOCKS_PER_GROUP=8192, but an extent can be up to 32768 blocks.

I think the right assertion is now:

BUG_ON(len  EXT4_INIT_MAX_LEN);

if FLEX_BG is active.  I'm not sure if we want to keep the stricter assertion:

BUG_ON(len  EXT4_HAS_INCOMPAT_FEATURE_FLEX_BG(sb) ? EXT4_INIT_MAX_LEN :
EXT4_BLOCKS_PER_GROUP(sb));

but it might be worthwhile at least initially, and I don't think the CPU cost
is very high.

 diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
 index b0f84b4..0275150 100644
 --- a/fs/ext4/mballoc.c
 +++ b/fs/ext4/mballoc.c
 @@ -907,7 +907,7 @@ static void ext4_mb_mark_free_simple(struct super_block 
 *sb,
   unsigned short chunk;
   unsigned short border;
  
 - BUG_ON(len = EXT4_BLOCKS_PER_GROUP(sb));
 + BUG_ON(len  EXT4_BLOCKS_PER_GROUP(sb));
  
   border = 2  sb-s_blocksize_bits;
  
 @@ -3286,7 +3286,7 @@ static void ext4_mb_normalize_request(struct 
 ext4_allocation_context *ac,
   }
   BUG_ON(start + size = ac-ac_o_ex.fe_logical 
   start  ac-ac_o_ex.fe_logical);
 - BUG_ON(size = 0 || size = EXT4_BLOCKS_PER_GROUP(ac-ac_sb));
 + BUG_ON(size = 0 || size  EXT4_BLOCKS_PER_GROUP(ac-ac_sb));

Please separate this into two BUG_ON() statements, so it is clear which
one is being hit.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's in e2fsprogs.git (stable)

2008-02-11 Thread Andreas Dilger
On Feb 10, 2008  23:41 -0500, Theodore Ts'o wrote:
 Most of the maint branch went into the e2fsprogs 1.40.6 release that
 happened this weekend.

Including the URL of the GIT repo in these messages would be useful.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][0/28] Lustre e2fsprogs patch series

2008-02-11 Thread Andreas Dilger
On Feb 10, 2008  23:19 -0500, Theodore Ts'o wrote:
 On Sat, Feb 02, 2008 at 12:59:43AM -0700, Andreas Dilger wrote:
  The following series of emails will contain the large part of the
  e2fsprogs patch series that is used for Lustre.  It will not contain
  the regression tests for EXTENTS nor the DIR_NLINK features, as those
  are very large and were previously submitted.
 
 I've applied these patches to the tip of maint, and exported it as
 e2fsprogs-interim on the e2fsprogs git repository.  There quite a
 few patch conflicts, mostly due to some changes that had happened on
 the tip of maint, but also apparently because your patchset was
 missing the flex bg changes.  I haven't applied them yet, but I'll
 probably tack them at the end.

The patch was based on e2fsprogs-1.40.5.  Also note that the majority
of patches are intended for upstream inclusion, with the exception of
the extents patches, which you are reworking.

  A full tarball that includes the patches, series, and regression tests
  will be uploaded to ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/

This didn't work out, as our FTP site was subsumed into a Sun-hosted
download site that week and I'm no longer able to make public-access
uploads.  I have to find some other location to host that tarball.

 If you could sanity check to make sure they are sane, I would
 appreciate it.

I need to catch up from travelling last week, hopefully I'll get to
it by the end of this week.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-02-08 Thread Andreas Dilger
On Feb 08, 2008  19:48 +0900, Takashi Sato wrote:
 OK I would like to implement the freeze feature on VFS
 as the filesystem independent ioctl so that it can be
 available on filesystems that have already had write_super_lockfs()
 and unlockfs().
 The usage for the freeze ioctl is the following.
  int ioctl(int fd, int FIFREEZE, long *timeval);
fd:file descriptor of mountpoint
FIFREEZE:request cord for freeze
timeval:timeout period (second)

 And the unfreeze ioctl is the following.
  int ioctl(int fd, int FITHAW, NULL);
fd:file descriptor of mountpoint
FITHAW:Request cord for unfreeze

You may as well make the common ioctl the same as the XFS version,
both by number and parameters, so that applications which already
understand the XFS ioctl will work on other filesystems.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] extra checking for in-inode EAs

2008-02-05 Thread Andreas Dilger
When investigating the EA problem reported on this list, I noticed that some
of the checks for the in-inode EAs were removed (possibly when the unordered
EAs-in-inode patch was removed).  The following patch returns the checks for
the e_value_offs.  This passes make check with the Lustre EA test cases.

A more complete check (not implemented here) would be to ensure that
the EAs don't overlap as is done with the external EAs.  Some extra
whitespace is removed in the first hunk.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

--- e2fsck/pass1.c.orig 2008-02-04 10:41:50.0 -0700
+++ e2fsck/pass1.c  2008-02-04 17:36:34.0 -0700
@@ -268,14 +268,14 @@
/* scan all entry's headers first */
 
/* take finish entry 0UL into account */
-   remain = storage_size - sizeof(__u32); 
+   remain = storage_size - sizeof(__u32);
 
while (!EXT2_EXT_IS_LAST_ENTRY(entry)) {
__u32 hash;
 
/* header eats this space */
remain -= sizeof(struct ext2_ext_attr_entry);
-   
+
/* is attribute name valid? */
if (EXT2_EXT_ATTR_SIZE(entry-e_name_len)  remain) {
pctx-num = entry-e_name_len;
@@ -293,6 +293,21 @@
goto fix;
}
 
+   /* check value placement */
+   if (start + entry-e_value_offs  end) {
+   pctx-num = entry-e_value_offset;
+   problem = PR_1_ATTR_VALUE_OFFSET;
+   goto fix;
+   }
+
+   /* check value offset + size */
+   if (start + entry-e_value_offs +
+   EXT2_XATTR_SIZE(entry-e_value_size)  end) {
+   pctx-num = entry-e_value_size;
+   problem = PR_1_ATTR_VALUE_SIZE;
+   goto fix;
+   }
+
/* e_value_block must be 0 in inode's ea */
if (entry-e_value_block != 0) {
pctx-num = entry-e_value_block;
@@ -310,7 +325,7 @@
goto fix;
}
 
-   remain -= entry-e_value_size;
+   remain -= EXT2_XATTR_SIZE(entry-e_value_size);
 
entry = EXT2_EXT_ATTR_NEXT(entry);
}

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 9855] ext3 ACL corruption

2008-02-04 Thread Andreas Dilger
On Jan 31, 2008  22:44 +1030, Kevin Shanahan wrote:
 On Thu, 2008-01-31 at 03:05 -0700, Andreas Dilger wrote:
  ...  To get the interesting bits you need:
  
  debugfs: stat 95   # prints decoded inode, File ACL: is a block 
  number
  debugfs: imap 95   # prints inode block number, offset
  
  dd if=/dev/mapper/vg_main-lv_samba of=/tmp/inode.bin bs=4k count=1 
  skip={iblock}
  dd if=/dev/mapper/vg_main-lv_samba of=/tmp/inode.bin bs=4k count=1 
  skip={ACLblk}
 
 Ah, ok - learning fast. Lets see how I go this time:
 
 # debugfs /dev/mapper/vg_main-lv_samba
 debugfs:  stat 3342652
 
 Inode: 3342652   Type: regularMode:  0770   Flags: 0x0   Generation: 
 3684645243
 User: 0   Group: 10140   Size: 18432
 File ACL: 0Directory ACL: 0
 Links: 1   Blockcount: 40
 Fragment:  Address: 0Number: 0Size: 0
 ctime: 0x475be06e -- Sun Dec  9 23:02:46 2007
 atime: 0x475d4073 -- Tue Dec 11 00:04:43 2007
 mtime: 0x45d2686a -- Wed Feb 14 12:09:54 2007
 Size of extra inode fields: 4
 Extended attributes stored in inode body: 
= 01 00 00 00 01 00 07 00 04 00 05 00 08 00 05 00 d6 27 00 00 08 00 07 00 
 09 28 00 00 08 00 07 00 0a 28 00 00 10 00 07 00 20 00 00 00  (44)
   DOSATTRIB = 0x20 (4)
 BLOCKS:
 (0):6713397, (1):6713399, (2):6713395, (3):6713405, (4):6713396
 TOTAL: 5
 
 debugfs:  imap 3342652
 Inode 3342652 is part of block group 204
   located at block 6684693, offset 0x0b00

The hexdump of this data (od -Ax -tx4 -a) shows the EA is in good shape:

000b80 0004ea0200480200
   eot nul nul nul nul nul stx   j nul stx   H nul nul nul nul nul

inode.i_extra_isize=0x0004
ext3_xattr_ibody_header.h_magic=0xea02

[EA1 entry]
ext3_xattr_entry.e_name_len=0x00  (unused for POSIX_ACL_ACCESS)
ext3_xattr_entry.e_name_index=0x02 (EXT3_INDEX_POSIX_ACL_ACCESS)
ext3_xattr_entry.e_value_offs=0x0048 = 72
ext3_xattr_entry.e_value_block=0x  (unused)



000b90 002c00740109
 , nul nul nul nul nul nul nul  ht soh   t nul nul nul nul nul
[EA1 cont.]
ext3_xattr_entry.e_value_size=0x002c = 44
ext3_xattr_entry.e_hash=0x  (currently unused)

[EA2 entry]
ext3_xattr_entry.e_name_len=0x09
ext3_xattr_entry.e_name_index=0x01 (EXT3_INDEX_USER)
ext3_xattr_entry.e_value_offs=0x0074 = 116
ext3_xattr_entry.e_value_block=0x  (unused)

000ba0 000441534f4449525454
   eot nul nul nul nul nul nul nul   D   O   S   A   T   T   R   I
[EA2 cont.]
ext3_xattr_entry.e_value_size=0x002c = 44
ext3_xattr_entry.e_hash=0x  (currently unused)
ext3_xattr_entry.e_name=DOSATTRIB

000bb0 0042
 B nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
000bc0 
   nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul

000bd0 0001000700010005000400050008
   soh nul nul nul soh nul bel nul eot nul enq nul  bs nul enq nul
[EA1 data]
ext3_acl_header.a_version=0x0001
ext3_acl_entry_short[0].e_tag=0x0001
ext3_acl_entry_short[0].e_perm=0x0007
ext3_acl_entry_short[1].e_tag=0x0004
ext3_acl_entry_short[1].e_perm=0x0005
ext3_acl_entry_short[2].e_tag=0x0008
ext3_acl_entry_short[2].e_perm=0x0005

000be0 27d600070008280900070008
 V   ' nul nul  bs nul bel nul  ht   ( nul nul  bs nul bel nul
[EA1 data cont]
ext3_acl_entry_short[2].e_id=0x27d6
ext3_acl_entry[3].e_tag=0x0008
ext3_acl_entry[3].e_perm=0x0007
ext3_acl_entry[3].e_id=0x2809

ext3_acl_entry[5].e_tag=0x0008
ext3_acl_entry[5].e_perm=0x0007
ext3_acl_entry[5].e_id=0x280a

000bf0 280a00070010002030327830
nl   ( nul nul dle nul bel nul  sp nul nul nul   0   x   2   0
[EA1 data cont]
ext3_acl_entry[6].e_tag=0x0010
ext3_acl_entry[6].e_perm=0x0007
ext3_acl_entry[6].e_id=0x0020

[EA2 data]
0x20

 I'm assuming that File ACL: 0 means that there's no ACL block.

Right.

 e2fsck 1.40-WIP (14-Nov-2006)
 Pass 1: Checking inodes, blocks, and sizes
 (entry-e_value_offs + entry-e_value_size: 116, offs: 120)
 Extended attribute in inode 3342652 has a value offset (72) which is
 invalid
 Clear? no
 ...

Hmm, I wonder if e2fsck is calculating the wrong file offset or something?
The kernel appears to be taking the EA data offset from the end of
i_extra_isize and the ext3_xattr_ibody_header fields (so 0x88 + e_value_offs
from the start of the inode).

Conversely, debugfs isn't having any problem with this EA at all.

h, I think I see the problem, and this was fixed in newer e2fsck.
The EAs are stored out of order in the inode and older e2fsprogs
considered that an error.  That was fixed in the final 1.40 release:

Remove check in e2fsck which requires EA's in inodes to be sorted;
they don't need to be sorted, and e2fsck was previously wrongly
clearing unsorted EA's

[PATCH][1/28] e2fsprogs-specdotin.patch

2008-02-02 Thread Andreas Dilger
Add the distro type to the RPM release number, so that it is
possible release multiple distro packages without having conflicting
RPM package names.

Allow the RPM built from upstream to replace the split packages provided
by the distros.  At some point in the future it may be desirable to also
split the RPM built by this spec file, but this is complicated by the
fact that SLES and RHEL have different splits.

Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.5/e2fsprogs.spec.in
===
--- e2fsprogs-1.40.5.orig/e2fsprogs.spec.in
+++ e2fsprogs-1.40.5/e2fsprogs.spec.in
@@ -6,13 +6,22 @@
 Summary: Utilities for managing the second extended (ext2) filesystem.
 Name: e2fsprogs
 Version: @E2FSPROGS_PKGVER@
-Release: 0
+Release: 0%{_vendor}
 License: GPLv2
 Group: System Environment/Base
 Source:  
ftp://download.sourceforge.net/pub/sourceforge/e2fsprogs/e2fsprogs-%{version}.tar.gz
 Url: http://e2fsprogs.sourceforge.net/
 Prereq: /sbin/ldconfig
 BuildRoot: %{_tmppath}/%{name}-root
+%if %{_vendor} == suse
+Group: System/Filesystems
+Provides: e2fsbn ext2fs libcom_err = %{version}
+Obsoletes: ext2fs libcom_err  %{version}
+%else
+Group: System Environment/Base
+Obsoletes: e2fsprogs-libs  %{version}
+Provides: e2fsprogs-libs = %{version}
+%endif
 
 %description
 The e2fsprogs package contains a number of utilities for creating,

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [2/28] e2fsprogs-eacheck.patch

2008-02-02 Thread Andreas Dilger
Verify in-inode EA structure.
Allow in-inode EAs to have a checksum.
Connect zero-length inodes that have an EA to lost+found.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.5/e2fsck/e2fsck.h
===
--- e2fsprogs-1.40.5.orig/e2fsck/e2fsck.h
+++ e2fsprogs-1.40.5/e2fsck/e2fsck.h
@@ -478,6 +478,9 @@ extern void init_resource_track(struct r
 extern int inode_has_valid_blocks(struct ext2_inode *inode);
 extern void e2fsck_read_inode(e2fsck_t ctx, unsigned long ino,
  struct ext2_inode * inode, const char * proc);
+extern void e2fsck_read_inode_full(e2fsck_t ctx, unsigned long ino,
+  struct ext2_inode *inode,
+  const int bufsize, const char *proc);
 extern void e2fsck_write_inode(e2fsck_t ctx, unsigned long ino,
   struct ext2_inode * inode, const char * proc);
 extern void e2fsck_write_inode_full(e2fsck_t ctx, unsigned long ino,
Index: e2fsprogs-1.40.5/e2fsck/pass1.c
===
--- e2fsprogs-1.40.5.orig/e2fsck/pass1.c
+++ e2fsprogs-1.40.5/e2fsck/pass1.c
@@ -264,6 +264,7 @@ static void check_ea_in_inode(e2fsck_t c
remain = storage_size - sizeof(__u32); 
 
while (!EXT2_EXT_IS_LAST_ENTRY(entry)) {
+   __u32 hash;
 
/* header eats this space */
remain -= sizeof(struct ext2_ext_attr_entry);
@@ -291,9 +292,12 @@ static void check_ea_in_inode(e2fsck_t c
problem = PR_1_ATTR_VALUE_BLOCK;
goto fix;
}
-   
-   /* e_hash must be 0 in inode's ea */
-   if (entry-e_hash != 0) {
+
+   hash = ext2fs_ext_attr_hash_entry(entry,
+ start + entry-e_value_offs);
+
+   /* e_hash may be 0 in older inode's ea */
+   if (entry-e_hash != 0  entry-e_hash != hash) {
pctx-num = entry-e_hash;
problem = PR_1_ATTR_HASH;
goto fix;
@@ -308,15 +312,12 @@ fix:
 * it seems like a corruption. it's very unlikely we could repair
 * EA(s) in automatic fashion -bzzz
 */
-#if 0
-   problem = PR_1_ATTR_HASH;
-#endif
if (problem == 0 || !fix_problem(ctx, problem, pctx))
return;
 
-   /* simple remove all possible EA(s) */
+   /* simply remove all remaining EA(s) */
*((__u32 *)start) = 0UL;
-   e2fsck_write_inode_full(ctx, pctx-ino, pctx-inode,
+   e2fsck_write_inode_full(ctx, pctx-ino,(struct ext2_inode *)pctx-inode,
EXT2_INODE_SIZE(sb), pass1);
 }
 
@@ -1360,10 +1361,13 @@ static int check_ext_attr(e2fsck_t ctx, 
entry = (struct ext2_ext_attr_entry *)(header+1);
end = block_buf + fs-blocksize;
while ((char *)entry  end  *(__u32 *)entry) {
+   __u32 hash;
+
if (region_allocate(region, (char *)entry - (char *)header,
   EXT2_EXT_ATTR_LEN(entry-e_name_len))) {
if (fix_problem(ctx, PR_1_EA_ALLOC_COLLISION, pctx))
goto clear_extattr;
+   break;
}
if ((ctx-ext_attr_ver == 1 
 (entry-e_name_len == 0 || entry-e_name_index != 0)) ||
@@ -1371,6 +1375,7 @@ static int check_ext_attr(e2fsck_t ctx, 
 entry-e_name_index == 0)) {
if (fix_problem(ctx, PR_1_EA_BAD_NAME, pctx))
goto clear_extattr;
+   break;
}
if (entry-e_value_block != 0) {
if (fix_problem(ctx, PR_1_EA_BAD_VALUE, pctx))
@@ -1387,6 +1392,17 @@ static int check_ext_attr(e2fsck_t ctx, 
if (fix_problem(ctx, PR_1_EA_ALLOC_COLLISION, pctx))
goto clear_extattr;
}
+
+   hash = ext2fs_ext_attr_hash_entry(entry, block_buf +
+entry-e_value_offs);
+
+   if (entry-e_hash != hash) {
+   pctx-num = entry-e_hash;
+   if (fix_problem(ctx, PR_1_ATTR_HASH, pctx))
+   goto clear_extattr;
+   entry-e_hash = hash;
+   }
+
entry = EXT2_EXT_ATTR_NEXT(entry);
}
if (region_allocate(region, (char *)entry - (char *)header, 4)) {
@@ -1508,8 +1524,11 @@ static void check_blocks(e2fsck_t ctx, s
}
}
 
-   if (inode-i_file_acl  check_ext_attr(ctx, pctx, block_buf))
+   if (inode-i_file_acl  check_ext_attr(ctx, pctx, block_buf)) {
+   if (ctx-flags  E2F_FLAG_SIGNAL_MASK

Re: [PATCH][4/28] e2fsprogs-tests-f_unsorted_EAs.patch

2008-02-02 Thread Andreas Dilger
Attached binary patch.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



e2fsprogs-tests-f_unsorted_EAs.patch
Description: Binary data


Re: [PATCH][6/28] e2fsprogs-nlinks.patch

2008-02-02 Thread Andreas Dilger

Add support for the DIR_NLINK feature.

This patch includes the changes required to e2fsck to understand the 
nlink count changes made in the kernel. In pass2, while counting the 
links for a directory, if the link count exceeds 65000, its permanently 
set to EXT2_LINK_MAX + 10. In pass4, when the counted and actual nlink
counts are compared, e2fsck does not flag an error if counted links =
EXT2_NLINK_MAX + 10 and existing link count is 1. 

It also handles the case when a directory had more than 65000 subdirs 
and they were later deleted. The nlink count of such a directory remains 
1. In pass4 if counted links are 2 and if existing nlink count = 1, 
e2fsck corrects the nlink count without displaying any errors. 

The file hard link count is also increased to 65000, but this cannot be
exceeded.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Kalpak Shah [EMAIL PROTECTED]

Index: e2fsprogs-1.40.1/e2fsck/pass2.c
===
--- e2fsprogs-1.40.1.orig/e2fsck/pass2.c
+++ e2fsprogs-1.40.1/e2fsck/pass2.c
@@ -717,7 +717,7 @@ static int check_dir_block(ext2_filsys f
blk_t   block_nr = db-blk;
ext2_ino_t  ino = db-ino;
ext2_ino_t  subdir_parent;
-   __u16   links;
+   __u32   links;
struct check_dir_struct *cd;
char*buf;
e2fsck_tctx;
@@ -1024,9 +1024,11 @@ static int check_dir_block(ext2_filsys f
dups_found++;
} else
dict_alloc_insert(de_dict, dirent, dirent);
-   
-   ext2fs_icount_increment(ctx-inode_count, dirent-inode,
-   links);
+
+   ext2fs_icount_inc32(ctx-inode_count, dirent-inode, links,
+   ext2fs_test_inode_bitmap(ctx-inode_dir_map,
+dirent-inode) ?
+   EXT2_LINK_MAX : (__u32)~0U);
if (links  1)
ctx-fs_links_count++;
ctx-fs_total_count++;
Index: e2fsprogs-1.40.1/e2fsck/pass4.c
===
--- e2fsprogs-1.40.1.orig/e2fsck/pass4.c
+++ e2fsprogs-1.40.1/e2fsck/pass4.c
@@ -99,7 +99,8 @@ void e2fsck_pass4(e2fsck_t ctx)
struct resource_track   rtrack;
 #endif
struct problem_context  pctx;
-   __u16   link_count, link_counted;
+   __u16   link_count;
+   __u32   link_counted;
char*buf = 0;
int group, maxgroup;

@@ -145,7 +146,7 @@ void e2fsck_pass4(e2fsck_t ctx)
 ext2fs_test_inode_bitmap(ctx-inode_bb_map, i)))
continue;
ext2fs_icount_fetch(ctx-inode_link_info, i, link_count);
-   ext2fs_icount_fetch(ctx-inode_count, i, link_counted);
+   ext2fs_icount_fetch32(ctx-inode_count, i, link_counted);
if (link_counted == 0) {
if (!buf)
buf = e2fsck_allocate_memory(ctx,
@@ -156,10 +157,12 @@ void e2fsck_pass4(e2fsck_t ctx)
continue;
ext2fs_icount_fetch(ctx-inode_link_info, i,
link_count);
-   ext2fs_icount_fetch(ctx-inode_count, i,
-   link_counted);
+   ext2fs_icount_fetch32(ctx-inode_count, i,
+ link_counted);
}
-   if (link_counted != link_count) {
+   if (link_counted != link_count 
+   !(ext2fs_test_inode_bitmap(ctx-inode_dir_map, i) 
+ link_count == 1  link_counted  EXT2_LINK_MAX)) {
e2fsck_read_inode(ctx, i, inode, pass4);
pctx.ino = i;
pctx.inode = inode;
@@ -169,7 +172,12 @@ void e2fsck_pass4(e2fsck_t ctx)
PR_4_INCONSISTENT_COUNT, pctx);
}
pctx.num = link_counted;
-   if (fix_problem(ctx, PR_4_BAD_REF_COUNT, pctx)) {
+   /* i_link_count was previously exceeded, but no longer
+* is, fix this but don't consider it an error */
+   if ((LINUX_S_ISDIR(inode-i_mode)  link_counted  1 
+(inode-i_flags  EXT2_INDEX_FL) 
+link_count == 1  !(ctx-options  E2F_OPT_NO)) ||
+(fix_problem(ctx, PR_4_BAD_REF_COUNT, pctx))) {
inode-i_links_count = link_counted;
e2fsck_write_inode(ctx, i, inode, pass4);
}
Index

Re: [PATCH][8/28] e2fsprogs-config-before-cmdline.patch

2008-02-02 Thread Andreas Dilger

The patch changes the order that the config file and command line are
parsed so that command line has precedence.  It also allows multiple
-E options to be specified on the command line.

Signed-off-by: Jim Garlick [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.4/e2fsck/unix.c
===
--- e2fsprogs-1.40.4.orig/e2fsck/unix.c
+++ e2fsprogs-1.40.4/e2fsck/unix.c
@@ -588,7 +588,6 @@ static errcode_t PRS(int argc, char *arg
 #ifdef HAVE_SIGNAL_H
struct sigactionsa;
 #endif
-   char*extended_opts = 0;
char*cp;
int res;/* result of sscanf */
 #ifdef CONFIG_JBD_DEBUG
@@ -619,6 +618,12 @@ static errcode_t PRS(int argc, char *arg
ctx-program_name = *argv;
else
ctx-program_name = e2fsck;
+
+   if ((cp = getenv(E2FSCK_CONFIG)) != NULL)
+   config_fn[0] = cp;
+   profile_set_syntax_err_cb(syntax_err_report);
+   profile_init(config_fn, ctx-profile);
+
while ((c = getopt (argc, argv, 
panyrcC:B:dE:fvtFVM:b:I:j:P:l:L:N:SsDk)) != EOF)
switch (c) {
case 'C':
@@ -645,7 +650,7 @@ static errcode_t PRS(int argc, char *arg
ctx-options |= E2F_OPT_COMPRESS_DIRS;
break;
case 'E':
-   extended_opts = optarg;
+   parse_extended_opts(ctx, optarg);
break;
case 'p':
case 'a':
@@ -771,13 +776,6 @@ static errcode_t PRS(int argc, char *arg
argv[optind]);
fatal_error(ctx, 0);
}
-   if (extended_opts)
-   parse_extended_opts(ctx, extended_opts);
-
-   if ((cp = getenv(E2FSCK_CONFIG)) != NULL)
-   config_fn[0] = cp;
-   profile_set_syntax_err_cb(syntax_err_report);
-   profile_init(config_fn, ctx-profile);
 
if (flush) {
fd = open(ctx-filesystem_name, O_RDONLY, 0);

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][9/28] e2fsprogs-SLES10--m-support.patch

2008-02-02 Thread Andreas Dilger
)) {
+   continue;
+   }
+   }
fsck_device(fs, interactive);
if (serialize ||
(max_running  (num_running = max_running))) {

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][11/28] e2fsprogs-nlinks-flag.patch

2008-02-02 Thread Andreas Dilger
;
+   dir_nlink = sb-s_feature_ro_compat 
+   EXT4_FEATURE_RO_COMPAT_DIR_NLINK;
filetype = sb-s_feature_incompat 
EXT2_FEATURE_INCOMPAT_FILETYPE;
flex_bg = sb-s_feature_incompat 
@@ -359,6 +365,14 @@ static void update_feature_set(ext2_fils
if (uuid_is_null((unsigned char *) sb-s_hash_seed))
uuid_generate((unsigned char *) sb-s_hash_seed);
}
+
+   if (old_dir_nlink  !dir_nlink) {
+   fputs(_(The dir_nlink flag was cleared.  
+   Please run e2fsck before using the filesystem\n
+   to verify no many-linked directories exist or 
+   data loss may result.\n), stderr);
+   }
+
if (!flex_bg  old_flex_bg) {
if (ext2fs_check_desc(fs)) {
fputs(_(Clearing the flex_bg flag would 

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][12/28] e2fsprogs-expand-extra-isize.patch

2008-02-02 Thread Andreas Dilger

This patch adds a -E expand_extra_isize feature which makes sure that
_every_ used inode has i_extra_isize = s_min_extra_isize if
s_min_extra_isize is set. Else it makes sure that i_extra_isize of every
inode is equal to sizeof(ext2_inode_large) - 128.

This is useful for the case where nanosecond timestamps or 64-bit inode
version fields are required for all inodes in the filesystem.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Kalpak Shah [EMAIL PROTECTED]

Index: e2fsprogs-1.40.5/lib/ext2fs/ext_attr.c
===
--- e2fsprogs-1.40.5.orig/lib/ext2fs/ext_attr.c
+++ e2fsprogs-1.40.5/lib/ext2fs/ext_attr.c
@@ -17,6 +17,7 @@
 #endif
 #include string.h
 #include time.h
+#include errno.h
 
 #include ext2_fs.h
 #include ext2_ext_attr.h
@@ -60,11 +61,39 @@ __u32 ext2fs_ext_attr_hash_entry(struct 
 #undef NAME_HASH_SHIFT
 #undef VALUE_HASH_SHIFT
 
+#define BLOCK_HASH_SHIFT 16
+/*
+ * Re-compute the extended attribute hash value after an entry has changed.
+ */
+static void ext2fs_attr_rehash(struct ext2_ext_attr_header *header,
+  struct ext2_ext_attr_entry *entry)
+{
+   struct ext2_ext_attr_entry *here;
+   __u32 hash = 0;
+
+   entry-e_hash = ext2fs_ext_attr_hash_entry(entry, (char *) header +
+  entry-e_value_offs);
+
+   here = ENTRY(header+1);
+   while (!EXT2_EXT_IS_LAST_ENTRY(here)) {
+   if (!here-e_hash) {
+   /* Block is not shared if an entry's hash value == 0 */
+   hash = 0;
+   break;
+   }
+   hash = (hash  BLOCK_HASH_SHIFT) ^
+  (hash  (8*sizeof(hash) - BLOCK_HASH_SHIFT)) ^
+  here-e_hash;
+   here = EXT2_EXT_ATTR_NEXT(here);
+   }
+   header-h_hash = hash;
+}
+
 errcode_t ext2fs_read_ext_attr(ext2_filsys fs, blk_t block, void *buf)
 {
errcode_t   retval;
 
-   retval = io_channel_read_blk(fs-io, block, 1, buf);
+   retval = io_channel_read_blk(fs-io, block, 1, buf);
if (retval)
return retval;
 #ifdef WORDS_BIGENDIAN
@@ -88,7 +117,7 @@ errcode_t ext2fs_write_ext_attr(ext2_fil
 #else
write_buf = (char *) inbuf;
 #endif
-   retval = io_channel_write_blk(fs-io, block, 1, write_buf);
+   retval = io_channel_write_blk(fs-io, block, 1, write_buf);
if (buf)
ext2fs_free_mem(buf);
if (!retval)
@@ -122,7 +151,10 @@ errcode_t ext2fs_adjust_ea_refcount(ext2
if (retval)
goto errout;
 
-   header = (struct ext2_ext_attr_header *) block_buf;
+   header = BHDR(block_buf);
+   if (header-h_magic != EXT2_EXT_ATTR_MAGIC)
+   return EXT2_ET_EA_BAD_MAGIC;
+
header-h_refcount += adjust;
if (newcount)
*newcount = header-h_refcount;
@@ -136,3 +168,881 @@ errout:
ext2fs_free_mem(buf);
return retval;
 }
+
+struct ext2_attr_info {
+   int name_index;
+   const char *name;
+   const char *value;
+   int value_len;
+};
+
+struct ext2_attr_search {
+   struct ext2_ext_attr_entry *first;
+   char *base;
+   char *end;
+   struct ext2_ext_attr_entry *here;
+   int not_found;
+};
+
+struct ext2_attr_ibody_find {
+   ext2_ino_t ino;
+   struct ext2_attr_search s;
+};
+
+struct ext2_attr_block_find {
+   struct ext2_attr_search s;
+   char *block;
+};
+
+void ext2fs_attr_shift_entries(struct ext2_ext_attr_entry *entry,
+  int value_offs_shift, char *to,
+  char *from, int n)
+{
+   struct ext2_ext_attr_entry *last = entry;
+
+   /* Adjust the value offsets of the entries */
+   for (; !EXT2_EXT_IS_LAST_ENTRY(last); last = EXT2_EXT_ATTR_NEXT(last)) {
+   if (!last-e_value_block  last-e_value_size) {
+   last-e_value_offs = last-e_value_offs +
+   value_offs_shift;
+   }
+   }
+   /* Shift the entries by n bytes */
+   memmove(to, from, n);
+}
+
+/*
+ * This function returns the free space present in the inode or the EA block.
+ * total is number of bytes taken up by the EA entries and is used to shift the
+ * EAs in ext2fs_expand_extra_isize().
+ */
+int ext2fs_attr_free_space(struct ext2_ext_attr_entry *last,
+  int *min_offs, char *base, int *total)
+{
+   for (; !EXT2_EXT_IS_LAST_ENTRY(last); last = EXT2_EXT_ATTR_NEXT(last)) {
+   *total += EXT2_EXT_ATTR_LEN(last-e_name_len);
+   if (!last-e_value_block  last-e_value_size) {
+   int offs = last-e_value_offs;
+   if (offs  *min_offs)
+   *min_offs = offs;
+   }
+   }
+
+   return (*min_offs - ((char

Re: [PATCH][15/28] e2fsprogs-ibadness-counter.patch

2008-02-02 Thread Andreas Dilger

The present e2fsck code checks the inode, per field basis. It doesn't
take into consideration to total sanity of the inode. This may cause
e2fsck turning a garbage inode into an apparently sane inode (It is a
vessel of fertilizer, and none may abide its strength.).

The following patch adds a heuristics to detect the degree of badness of
an inode. icount mechanism is used to keep track of the badness of every
inode. The badness is increased as various fields in inode are found to
be corrupt. Badness above a certain threshold value results in deletion
of the inode. The default threshold value is 7, it can be specified to
e2fsck using -E inode_badness_threshold=value

This can avoid lengthy pass1b shared block processing, where a corrupt
chunk of the inode table has resulted in a bunch of garbage inodes
suddenly having shared blocks with a lot of good inodes (or each other).

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Index: e2fsprogs-1.40.4/e2fsck/e2fsck.h
===
--- e2fsprogs-1.40.4.orig/e2fsck/e2fsck.h
+++ e2fsprogs-1.40.4/e2fsck/e2fsck.h
@@ -11,6 +11,7 @@
 
 #include stdio.h
 #include string.h
+#include stddef.h
 #ifdef HAVE_UNISTD_H
 #include unistd.h
 #endif
@@ -195,6 +196,18 @@ typedef enum {
E2F_CLONE_ZERO
 } clone_opt_t;
 
+#define EXT4_FITS_IN_INODE(ext4_inode, einode, field)  \
+   ((offsetof(typeof(*ext4_inode), field) +\
+ sizeof(ext4_inode-field))\
+   = (EXT2_GOOD_OLD_INODE_SIZE +  \
+   (einode)-i_extra_isize))   \
+
+#define BADNESS_NORMAL 1
+#define BADNESS_HIGH   2
+#define BADNESS_THRESHOLD  8
+#define BADNESS_BAD_MODE   100
+#define BADNESS_LARGE_FILE 219902322ULL
+
 /*
  * Define the extended attribute refcount structure
  */
@@ -229,7 +242,6 @@ struct e2fsck_struct {
unsigned long max);
 
ext2fs_inode_bitmap inode_used_map; /* Inodes which are in use */
-   ext2fs_inode_bitmap inode_bad_map; /* Inodes which are bad somehow */
ext2fs_inode_bitmap inode_dir_map; /* Inodes which are directories */
ext2fs_inode_bitmap inode_bb_map; /* Inodes which are in bad blocks */
ext2fs_inode_bitmap inode_imagic_map; /* AFS inodes */
@@ -244,6 +256,8 @@ struct e2fsck_struct {
 */
ext2_icount_t   inode_count;
ext2_icount_t inode_link_info;
+   ext2_icount_t inode_badness;
+   int inode_badness_threshold;
 
ext2_refcount_t refcount;
ext2_refcount_t refcount_extra;
@@ -344,6 +358,7 @@ struct e2fsck_struct {
/* misc fields */
time_t now;
time_t time_fudge;  /* For working around buggy init scripts */
+   time_t now_tolerance_val;
int ext_attr_ver;
 shared_opt_t shared;
 clone_opt_t clone;
@@ -454,6 +469,8 @@ extern int e2fsck_pass1_check_device_ino
   struct ext2_inode *inode);
 extern int e2fsck_pass1_check_symlink(ext2_filsys fs,
  struct ext2_inode *inode, char *buf);
+extern void e2fsck_mark_inode_bad(e2fsck_t ctx, ino_t ino, int count);
+extern int is_inode_bad(e2fsck_t ctx, ino_t ino);
 
 /* pass2.c */
 extern int e2fsck_process_bad_inode(e2fsck_t ctx, ext2_ino_t dir,
Index: e2fsprogs-1.40.4/e2fsck/pass1.c
===
--- e2fsprogs-1.40.4.orig/e2fsck/pass1.c
+++ e2fsprogs-1.40.4/e2fsck/pass1.c
@@ -20,7 +20,8 @@
  * - A bitmap of which inodes are in use.  (inode_used_map)
  * - A bitmap of which inodes are directories. (inode_dir_map)
  * - A bitmap of which inodes are regular files.   (inode_reg_map)
- * - A bitmap of which inodes have bad fields. (inode_bad_map)
+ * - An icount mechanism is used to keep track of
+ *   inodes with bad fields and its badness(ctx-inode_badness)
  * - A bitmap of which inodes are in bad blocks.   (inode_bb_map)
  * - A bitmap of which inodes are imagic inodes.   (inode_imagic_map)
  * - A bitmap of which inodes need to be expanded  (expand_eisize_map)
@@ -68,7 +69,6 @@ static void check_blocks(e2fsck_t ctx, s
 static void mark_table_blocks(e2fsck_t ctx);
 static void alloc_bb_map(e2fsck_t ctx);
 static void alloc_imagic_map(e2fsck_t ctx);
-static void mark_inode_bad(e2fsck_t ctx, ino_t ino);
 static void handle_fs_bad_blocks(e2fsck_t ctx);
 static void process_inodes(e2fsck_t ctx, char *block_buf);
 static EXT2_QSORT_TYPE process_inode_cmp(const void *a, const void *b);
@@ -220,6 +220,7 @@ static void check_immutable(e2fsck_t ctx
if (!(pctx-inode-i_flags  BAD_SPECIAL_FLAGS))
return;
 
+   e2fsck_mark_inode_bad(ctx, pctx-ino, BADNESS_NORMAL);
if (!fix_problem(ctx, PR_1_SET_IMMUTABLE, pctx))
return;
 
@@ -238,6 +239,7

Re: [PATCH][19/28] e2fsprogs-stride_option.patch

2008-02-02 Thread Andreas Dilger

Add support for setting the s_raid_stride and s_raid_stripe_width
fields in the superblock via mke2fs and tune2fs.c.  This is useful
for mballoc to align block allocation on the RAID stripe boundaries.

Fix up the debugfs ssv command to set a number of new superblock fields.

Signed-off-by: Rupesh Thakare [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.5/lib/ext2fs/initialize.c
===
--- e2fsprogs-1.40.5.orig/lib/ext2fs/initialize.c
+++ e2fsprogs-1.40.5/lib/ext2fs/initialize.c
@@ -156,6 +156,8 @@ errcode_t ext2fs_initialize(const char *
set_field(s_feature_incompat, 0);
set_field(s_feature_ro_compat, 0);
set_field(s_first_meta_bg, 0);
+   set_field(s_raid_stride, 0);/* default stride size: 0 */
+   set_field(s_raid_stripe_width, 0);  /* default stripe width: 0 */
set_field(s_flags, 0);
if (super-s_feature_incompat  ~EXT2_LIB_FEATURE_INCOMPAT_SUPP) {
retval = EXT2_ET_UNSUPP_FEATURE;
Index: e2fsprogs-1.40.5/misc/mke2fs.c
===
--- e2fsprogs-1.40.5.orig/misc/mke2fs.c
+++ e2fsprogs-1.40.5/misc/mke2fs.c
@@ -773,7 +773,7 @@ static int set_os(struct ext2_super_bloc
 static void parse_extended_opts(struct ext2_super_block *param, 
const char *opts)
 {
-   char*buf, *token, *next, *p, *arg;
+   char*buf, *token, *next, *p, *arg, *badopt = ;
int len;
int r_usage = 0;
 
@@ -800,16 +800,32 @@ static void parse_extended_opts(struct e
if (strcmp(token, stride) == 0) {
if (!arg) {
r_usage++;
+   badopt = token;
continue;
}
-   fs_stride = strtoul(arg, p, 0);
-   if (*p || (fs_stride == 0)) {
+   param-s_raid_stride = strtoul(arg, p, 0);
+   if (*p || (param-s_raid_stride == 0)) {
fprintf(stderr,
_(Invalid stride parameter: %s\n),
arg);
r_usage++;
continue;
}
+   } else if (strcmp(token, stripe-width) == 0 ||
+  strcmp(token, stripe_width) == 0) {
+   if (!arg) {
+   r_usage++;
+   badopt = token;
+   continue;
+   }
+   param-s_raid_stripe_width = strtoul(arg, p, 0);
+   if (*p || (param-s_raid_stripe_width == 0)) {
+   fprintf(stderr,
+   _(Invalid stripe-width parameter: 
%s\n),
+   arg);
+   r_usage++;
+   continue;
+   }
} else if (!strcmp(token, resize)) {
unsigned long resize, bpg, rsv_groups;
unsigned long group_desc_count, desc_blocks;
@@ -818,6 +834,7 @@ static void parse_extended_opts(struct e
 
if (!arg) {
r_usage++;
+   badopt = token;
continue;
}
 
@@ -868,21 +885,31 @@ static void parse_extended_opts(struct e
}
} else if (!strcmp(token, test_fs)) {
param-s_flags |= EXT2_FLAGS_TEST_FILESYS;
-   } else
+   } else {
r_usage++;
+   badopt = token;
+   }
}
if (r_usage) {
-   fprintf(stderr, _(\nBad options specified.\n\n
+   fprintf(stderr, _(\nBad option(s) specified: %s\n\n
Extended options are separated by commas, 
and may take an argument which\n
\tis set off by an equals ('=') sign.\n\n
Valid extended options are:\n
-   \tstride=stride length in blocks\n
-   \tresize=resize maximum size in blocks\n
-   \ttest_fs\n));
+   \tstride=RAID per-disk data chunk in blocks\n
+   \tstripe-width=RAID stride * data disks in blocks\n
+   \tresize=resize maximum size in blocks\n\n
+   \ttest_fs\n),
+   badopt);
free(buf);
exit(1);
}
+   if (param-s_raid_stride 
+   (param-s_raid_stripe_width % param-s_raid_stride) != 0

Re: [PATCH][20/28] e2fsprogs-mmp.patch

2008-02-02 Thread Andreas Dilger

Add multi-mount protection support to libext2fs (INCOMPAT_MMP feature).

This allows mke2fs, e2fsck, and others to detect if the filesystem is
mounted on a remote node (on SAN disks) and avoid corrupting the
filesystem.  For e2fsprogs this only means that it check the MMP block
to see if the filesystem is in use, and mark the filesystem busy while
e2fsck is running on the system.

There is no requirement that e2fsck updates the MMP block in any regular
interval, but e2fsck does this occasionally to provide additional
information to the sysadmin in case of conflict.

Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.5/lib/e2p/feature.c
===
--- e2fsprogs-1.40.5.orig/lib/e2p/feature.c
+++ e2fsprogs-1.40.5/lib/e2p/feature.c
@@ -67,6 +67,8 @@ static struct feature feature_list[] = {
extent },
{   E2P_FEATURE_INCOMPAT, EXT4_FEATURE_INCOMPAT_64BIT,
64bit },
+   {   E2P_FEATURE_INCOMPAT, EXT4_FEATURE_INCOMPAT_MMP,
+   mmp },
{   E2P_FEATURE_INCOMPAT, EXT4_FEATURE_INCOMPAT_FLEX_BG,
 flex_bg},
{   0, 0, 0 },
Index: e2fsprogs-1.40.5/lib/ext2fs/ext2_fs.h
===
--- e2fsprogs-1.40.5.orig/lib/ext2fs/ext2_fs.h
+++ e2fsprogs-1.40.5/lib/ext2fs/ext2_fs.h
@@ -566,11 +566,11 @@ struct ext2_super_block {
__u16   s_min_extra_isize;  /* All inodes have at least # bytes */
__u16   s_want_extra_isize; /* New inodes should reserve # bytes */
__u32   s_flags;/* Miscellaneous flags */
-   __u16   s_raid_stride;  /* RAID stride */
-   __u16   s_mmp_interval; /* # seconds to wait in MMP checking */
-   __u64   s_mmp_block;/* Block for multi-mount protection */
-   __u32   s_raid_stripe_width;/* blocks on all data disks (N*stride)*/
-   __u32   s_reserved[163];/* Padding to the end of the block */
+   __u16   s_raid_stride;  /* RAID stride */
+   __u16   s_mmp_update_interval;  /* # seconds to wait in MMP checking */
+   __u64   s_mmp_block;/* Block for multi-mount protection */
+   __u32   s_raid_stripe_width;/* blocks on all data disks (N*stride)*/
+   __u32   s_reserved[163];/* Padding to the end of the block */
 };
 
 /*
@@ -637,7 +637,8 @@ struct ext2_super_block {
 
 
 #define EXT2_FEATURE_COMPAT_SUPP   0
-#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE)
+#define EXT2_FEATURE_INCOMPAT_SUPP(EXT2_FEATURE_INCOMPAT_FILETYPE| \
+  EXT4_FEATURE_INCOMPAT_MMP)
 #define EXT2_FEATURE_RO_COMPAT_SUPP(EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
 EXT4_FEATURE_RO_COMPAT_DIR_NLINK| \
@@ -717,26 +718,34 @@ struct ext2_dir_entry_2 {
 /*
  * This structure will be used for multiple mount protection. It will be
  * written into the block number saved in the s_mmp_block field in the
- * superblock.
- */
-#defineEXT2_MMP_MAGIC0x004D4D50 /* ASCII for MMP */
-#defineEXT2_MMP_CLEAN0xFF4D4D50 /* Value of mmp_seq for clean 
unmount */
-#defineEXT2_MMP_FSCK_ON  0xE24D4D50 /* Value of mmp_seq when being 
fscked */
+ * superblock. Programs that check MMP should assume that if
+ * SEQ_FSCK (or any unknown code above SEQ_MAX) is present then it is NOT safe
+ * to use the filesystem, regardless of how old the timestamp is.
+ */
+#define EXT2_MMP_MAGIC 0x004D4D50U /* ASCII for MMP */
+#define EXT2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */
+#define EXT2_MMP_SEQ_FSCK  0xE24D4D50U /* mmp_seq value when being fscked */
+#define EXT2_MMP_SEQ_MAX   0xE24D4D4FU /* maximum valid mmp_seq value */
 
 struct mmp_struct {
-   __u32   mmp_magic;
-   __u32   mmp_seq;
-   __u64   mmp_time;
-   charmmp_nodename[64];
-   charmmp_bdevname[32];
-   __u16   mmp_interval;
+   __u32   mmp_magic;  /* Magic number for MMP */
+   __u32   mmp_seq;/* Sequence no. updated periodically */
+   __u64   mmp_time;   /* Time last updated */
+   charmmp_nodename[64];   /* Node which last updated MMP block */
+   charmmp_bdevname[32];   /* Bdev which last updated MMP block */
+   __u16   mmp_check_interval; /* Changed mmp_check_interval */
__u16   mmp_pad1;
-   __u32   mmp_pad2;
+   __u32   mmp_pad2[227];
 };
 
 /*
- * Interval in number of seconds to update the MMP sequence number.
+ * Default interval in seconds to update the MMP sequence number.
+ */
+#define EXT2_MMP_UPDATE_INTERVAL   1
+
+/*
+ * Minimum interval for MMP checking in seconds

Re: [PATCH][25/28] e2fsprogs-i_size-corruption.patch

2008-02-02 Thread Andreas Dilger

Fix handling of block preallocation support in cases where the kernel
PAGE_SIZE is larger than the filesystem blocksize.

Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.2/e2fsck/pass1.c
===
--- e2fsprogs-1.40.2.orig/e2fsck/pass1.c
+++ e2fsprogs-1.40.2/e2fsck/pass1.c
@@ -2103,7 +2103,7 @@ static void check_blocks(e2fsck_t ctx, s
if ((pb.last_block = 0) 
/* allow allocated blocks to end of PAGE_SIZE */
(size  (__u64)pb.last_block * fs-blocksize) 
-   (pb.last_block / blkpg * blkpg != pb.last_block ||
+   ((pb.last_block+1) / blkpg * blkpg != (pb.last_block+1) ||
 size  (__u64)(pb.last_block  ~(blkpg-1)) *fs-blocksize))
bad_size = 3;
else if (size  ext2_max_sizes[fs-super-s_log_block_size])


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][26/28] e2fsprogs-fiemap.patch

2008-02-02 Thread Andreas Dilger

Add support for ioctl(FIEMAP) to filefrag.  If the kernel supports FIEMAP
the filefrag program prefers this more efficient mechanism to get extent
information instead of repeated FIBMAP calls.

Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.2/misc/filefrag.c
===
--- e2fsprogs-1.40.2.orig/misc/filefrag.c
+++ e2fsprogs-1.40.2/misc/filefrag.c
@@ -38,11 +38,47 @@ extern int optind;
 #include sys/vfs.h
 #include sys/ioctl.h
 #include linux/fd.h
+#include ext2fs/ext2_types.h
 
 int verbose = 0;
 
-#define FIBMAP_IO(0x00,1)  /* bmap access */
-#define FIGETBSZ   _IO(0x00,2) /* get the block size used for bmap */
+struct fiemap_extent {
+   __u64 fe_offset; /* offset in bytes for the start of the extent */
+   __u64 fe_length; /* length in bytes for the extent */
+   __u32 fe_flags;  /* returned FIEMAP_EXTENT_* flags for the extent */
+   __u32 fe_lun;/* logical device number for extent (starting at 0) */
+};
+
+struct fiemap {
+   __u64 fm_start; /* logical starting byte offset (in/out) */
+   __u64 fm_length;/* logical length of map (in/out) */
+   __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
+   __u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
+   __u64 fm_unused;
+   struct fiemap_extent fm_extents[0];
+};
+
+#define FIEMAP_FLAG_SYNC   0x0001 /* sync file data before map */
+#define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */
+#define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number of extents */
+#define FIEMAP_FLAG_INCOMPAT0xff00 /* error for unknown flags in here 
*/
+
+#define FIEMAP_EXTENT_HOLE  0x0001 /* has no data or space allocation 
*/
+#define FIEMAP_EXTENT_UNWRITTEN 0x0002 /* space allocated, but no data */
+#define FIEMAP_EXTENT_UNMAPPED  0x0004 /* has data but no space 
allocation*/
+#define FIEMAP_EXTENT_ERROR 0x0008 /* mapping error, errno in 
fe_start*/
+#define FIEMAP_EXTENT_NO_DIRECT 0x0010 /* cannot access data directly */
+#define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
+#define FIEMAP_EXTENT_DELALLOC  0x0040 /* has data but not yet written,
+ must have EXTENT_UNKNOWN set */
+#define FIEMAP_EXTENT_SECONDARY 0x0080 /* data (also) in secondary storage,
+ not in primary if EXTENT_UNKNOWN*/
+#define FIEMAP_EXTENT_EOF   0x0100 /* if fm_start+fm_len is beyond 
EOF*/
+
+
+#define FIBMAP _IO(0x00, 1)/* bmap access */
+#define FIGETBSZ   _IO(0x00, 2)/* get the block size used for bmap */
+#define EXT4_IOC_FIEMAP_IOWR('f', 10, struct fiemap) /* get file  
extent info*/
 
 #define EXT4_EXTENTS_FL0x0008 /* Inode uses 
extents */
 #defineEXT3_IOC_GETFLAGS   _IOR('f', 1, long)
@@ -71,6 +107,62 @@ static unsigned long get_bmap(int fd, un
return b;
 }
 
+int filefrag_fiemap(int fd, int bs, int *num_extents)
+{
+   char buf[4096] = ;
+   struct fiemap *fiemap = (struct fiemap *)buf;
+   int count = (sizeof(buf) - sizeof(*fiemap)) /
+   sizeof(struct fiemap_extent);
+   __u64 logical_blk = 0, last_blk = 0;
+   unsigned long flags;
+   int tot_extents = 0;
+   int eof = 0;
+   int i;
+   int rc;
+
+   memset(fiemap, 0, sizeof(struct fiemap));
+   fiemap-fm_extent_count = count;
+   fiemap-fm_length = ~0ULL;
+   if (!verbose)
+   flags |= FIEMAP_FLAG_NUM_EXTENTS;
+
+   do {
+   fiemap-fm_length = ~0ULL;
+   fiemap-fm_flags = flags;
+   fiemap-fm_extent_count = count;
+   rc = ioctl (fd, EXT4_IOC_FIEMAP, (unsigned long) fiemap);
+   if (rc)
+   return rc;
+
+   if (!verbose) {
+   *num_extents = fiemap-fm_extent_count;
+   goto out;
+   }
+
+   for (i = 0; i  fiemap-fm_extent_count; i++) {
+   __u64 phy_blk;
+   unsigned long ext_len;
+
+   phy_blk = fiemap-fm_extents[i].fe_offset / bs;
+   ext_len = fiemap-fm_extents[i].fe_length / bs;
+   if (logical_blk  (phy_blk != last_blk+1))
+   printf(Discontinuity: Block %llu is at %llu 
+  (was %llu)\n, logical_blk, phy_blk,
+  last_blk);
+   logical_blk += ext_len;
+   last_blk = phy_blk + ext_len - 1;
+   if (fiemap-fm_extents[i].fe_flags  FIEMAP_EXTENT_EOF)
+   eof = 1;
+   }
+   fiemap

Re: [PATCH][27/28] e2fsprogs-debugfs-supported_features.patch

2008-02-02 Thread Andreas Dilger

Print out the currently supported features of e2fsprogs/libext2fs
via a new debugfs supported_features command.

Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.2/debugfs/debug_cmds.ct
===
--- e2fsprogs-1.40.2.orig/debugfs/debug_cmds.ct
+++ e2fsprogs-1.40.2/debugfs/debug_cmds.ct
@@ -154,5 +154,8 @@ request do_dump_unused, Dump unused blo
 request do_set_current_time, Set current time to use when setting filesystme 
fields,
set_current_time;
 
+request do_supported_features, Print features supported by this version of 
e2fsprogs,
+   supported_features;
+
 end;
 
Index: e2fsprogs-1.40.2/debugfs/debugfs.c
===
--- e2fsprogs-1.40.2.orig/debugfs/debugfs.c
+++ e2fsprogs-1.40.2/debugfs/debugfs.c
@@ -1772,6 +1772,44 @@ void do_set_current_time(int argc, char 
}
 }
 
+void do_supported_features(int argc, char *argv[])
+{
+FILE   *out = stdout;
+inti, j, ret;
+__u32  supp[3] = { EXT2_LIB_FEATURE_COMPAT_SUPP,
+   EXT2_LIB_FEATURE_INCOMPAT_SUPP,
+   EXT2_LIB_FEATURE_RO_COMPAT_SUPP };
+   __u32   m;
+   int compat;
+   unsigned int feature_flag;
+
+   if (argc = 1) {
+   ret = e2p_string2feature(argv[1], compat, feature_flag);
+   if (ret)
+   goto err;
+
+   if (!(supp[compat]  feature_flag))
+   goto err;
+
+   fprintf(out, Supported feature: %s\n, argv[1]);
+   } else {
+   fprintf(out, Supported features:);
+   for (i = 0; i  3; i++) {
+   for (j = 0, m = 1; j  32; j++, m = 1) {
+   if (supp[i]  m)
+   fprintf(out,  %s,
+   e2p_feature2string(i, m));
+   }
+   }
+   fprintf(out, \n);
+   }
+
+   return;
+
+err:
+   com_err(argv[0], 0, Unknown feature: %s\n, argv[1]);
+}
+
 static int source_file(const char *cmd_file, int sci_idx)
 {
FILE*f;
Index: e2fsprogs-1.40.2/lib/ext2fs/ext2_fs.h
===
--- e2fsprogs-1.40.2.orig/lib/ext2fs/ext2_fs.h
+++ e2fsprogs-1.40.2/lib/ext2fs/ext2_fs.h
@@ -656,8 +656,7 @@ struct ext2_super_block {
 #define EXT2_FEATURE_RO_COMPAT_SUPP(EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
 EXT4_FEATURE_RO_COMPAT_DIR_NLINK| \
-EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE| \
-EXT2_FEATURE_RO_COMPAT_BTREE_DIR)
+EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE)
 
 /*
  * Default values for user and/or group using reserved blocks

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][28/28] e2fsprogs-lts-make_rpms.patch

2008-02-02 Thread Andreas Dilger

Allow make rpm to take some extra configure options from the build
environment without having to patch the code.

Build the tarball in a temporary directory instead of the e2fsprogs
source directory.

Signed-off-by: Michael MacDonald [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

diff -Naur e2fsprogs-1.40.2/contrib/build-rpm 
e2fsprogs-1.40.2.new/contrib/build-rpm
--- e2fsprogs-1.40.2/contrib/build-rpm  2007-06-30 08:58:34.0 -0400
+++ e2fsprogs-1.40.2.new/contrib/build-rpm  2007-12-21 12:49:43.0 
-0500
@@ -1,5 +1,10 @@
 #!/bin/sh
 
+# enable xtrace output if requested
+if [ -n ${ENABLE_XTRACE:-''} ]; then
+set -x
+fi
+
 # Build an e2fsprogs RPM from cvs
 
 pwd=`pwd`
@@ -8,8 +13,11 @@
 pkgvers=`grep Version: e2fsprogs.spec | awk '{print $2;}'`
 builddir=${pkgname}-${pkgvers}
 
+# ensure that $TMP is set to something
+TMP=${TMP:-'/tmp'}
+
 cd ..
-tmpdir=`mktemp -d rpmtmp.XX`
+tmpdir=`mktemp -d ${RPM_TMPDIR:-$TMP}/rpmtmp.XX`
 
 # We need to build a tarball for the SRPM using $builddir as the 
 # directory name (since that's what RPM will expect it to unpack
@@ -25,10 +33,13 @@
 (cd $tmpdir  tar czfh ${builddir}.tar.gz $EXCLUDE $builddir)
 
 [ `rpmbuild --version 2 /dev/null` ]  RPM=rpmbuild || RPM=rpm
-$RPM --define _sourcedir `pwd`/$tmpdir -ba $currdir/e2fsprogs.spec
-
-ret=$?
-rm -rf $tmpdir
-exit $?
 
+$RPM --define _sourcedir $tmpdir \
+ --define _topdir ${RPM_TOPDIR:-$(rpm -E %_topdir)} \
+ --define _tmpdir ${RPM_TMPDIR:-$TMP} \
+ --define extra_config_flags ${EXTRA_CONFIG_FLAGS:-''} \
+ -ba $currdir/e2fsprogs.spec
 
+rpm_exit=$?
+rm -rf $tmpdir
+exit $rpm_exit
diff -Naur e2fsprogs-1.40.2/e2fsprogs.spec.in 
e2fsprogs-1.40.2.new/e2fsprogs.spec.in
--- e2fsprogs-1.40.2/e2fsprogs.spec.in  2007-12-12 22:48:29.0 -0500
+++ e2fsprogs-1.40.2.new/e2fsprogs.spec.in  2007-12-21 12:39:11.0 
-0500
@@ -50,7 +50,8 @@
 %setup
 
 %build
-%configure --enable-elf-shlibs --enable-nls
+%configure --enable-elf-shlibs --enable-nls \
+   %{?extra_config_flags:%extra_config_flags}
 make
 
 %install

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][0/28] Lustre e2fsprogs patch series

2008-02-01 Thread Andreas Dilger
The following series of emails will contain the large part of the
e2fsprogs patch series that is used for Lustre.  It will not contain
the regression tests for EXTENTS nor the DIR_NLINK features, as those
are very large and were previously submitted.

A full tarball that includes the patches, series, and regression tests
will be uploaded to ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/

Patch series:

e2fsprogs-specdotin.patch
e2fsprogs-eacheck.patch 
e2fsprogs-extended_ops.patch
e2fsprogs-tests-f_unsorted_EAs.patch
e2fsprogs-tests-f_ea_checks.patch
e2fsprogs-nlinks.patch 
e2fsprogs-extents.patch
e2fsprogs-config-before-cmdline.patch
e2fsprogs-SLES10--m-support.patch
e2fsprogs-uninit.patch
e2fsprogs-nlinks-flag.patch 
e2fsprogs-expand-extra-isize.patch
e2fsprogs-tests-f_expisize.patch
e2fsprogs-tests-f_expisize_ea_del.patch
e2fsprogs-ibadness-counter.patch
e2fsprogs-tests-f_ibadness.patch
e2fsprogs-tests-f_ibadness_bad_extents.patch
e2fsprogs-tests-f_random_corruption.patch
e2fsprogs-stride_option.patch
e2fsprogs-mmp.patch
e2fsprogs-journal_chksum.patch
e2fsprogs-tests-f_jchksum_bblk.patch
e2fsprogs-tests-f_jchksum_blast_trans.patch
e2fsprogs-tests-f_jchksum_remount.patch
e2fsprogs-i_size-corruption.patch
e2fsprogs-fiemap.patch
e2fsprogs-debugfs-supported_features.patch
e2fsprogs-lts-make_rpms.patch

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] jbd: fix assertion failure in journal_next_log_block

2008-01-31 Thread Andreas Dilger
On Jan 31, 2008  11:14 -0500, Josef Bacik wrote:
[snip excellent analysis]
 So you get into this situation where
 t_nr_buffers (the actual number of buffers that are on the transaction) is
 greater than the number of buffers accounted for via t_outstanding_credits.
 This presents a problem since as we loop through writting buffers to the
 journal, we decrement t_outstanding_credits, and if t_nr_buffers is more than
 t_outstanding_credits then we end up with a negative number for
 t_outstanding_credits
 
 Signed-off-by: Josef Bacik [EMAIL PROTECTED]

Do you know what kernel this problem was introduced in, or is this a
long standing problem?  Presumably the same is needed for jbd2?

Once we have some decent amount of testing going on with ext4, I think
it makes sense to merge the jbd2 changes back into jbd and return to
a single code base, since there is nothing in the jbd2 code that ext3
can't also work with (i.e. all of the changes are properly isolated
with compatibility flags and such).

 @@ -1056,7 +1056,7 @@ static inline int jbd_space_needed(journal_t *journal)
   int nblocks = journal-j_max_transaction_buffers;
   if (journal-j_committing_transaction)
   nblocks += journal-j_committing_transaction-
 - t_outstanding_credits;
 + t_nr_buffers;

(trivial) this can be moved back onto the previous line.

 @@ -1168,7 +1168,7 @@ static inline int jbd_space_needed(journal_t *journal)
   int nblocks = journal-j_max_transaction_buffers;
   if (journal-j_committing_transaction)
   nblocks += journal-j_committing_transaction-
 - t_outstanding_credits;
 + t_nr_buffers;

Same...

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsck ext4 with extents possible?

2008-01-30 Thread Andreas Dilger
On Jan 30, 2008  11:54 +0100, supersud501 wrote:
 i'm testing the ext4-filesystem for months now (no data loss experienced ), 
 but i noticed e2fsck not being able to check a ext4-filesystem with extents 
 (e2fsck: Filesystem has unsupported feature(s)).

 i just pulled linus' kernel tree (with the newly added patches to ext4) and 
 e2fsprogs git tree (from git.kernel.org), but still e2fsprogs is not able 
 to check my ext4-fs.

 so i wonder whether there's somewhere a patch for e2fsprogs flying around 
 i've missed or whether checking ext4 with extents is simply not supported 
 yet? if so, how's development going on with this?

There is an e2fsprogs RPM that can check ext4 filesystems with extents,
available at ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/ (no
warranties, etc).  We've been using this for a while already, and are
working to get the patches incorporated into the upstream e2fsprogs
in the interim while Ted is reimplementing the extents support.

Note the above RPMs contain patches which are NOT related just to ext4
and should not become part of a distro.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: file magic for ext4

2008-01-30 Thread Andreas Dilger
On Jan 30, 2008  11:19 -0600, Eric Sandeen wrote:
 This seems to work fine; any comments, either about
 the logic, or the text descriptions, or the various
 options I've shown/not shown?
 
 To test, copy your magic file somewhere local, 
 replace the ext2/3 section with the following,
 and do something like:
 
 # file -C -m magic; file -m ./magic -s /dev/sda1
 
 -Eric
 
 # ext2/ext3 filesystems - Andreas Dilger [EMAIL PROTECTED]

Lol, you may as well fix up the email  [EMAIL PROTECTED] will
probably be the least transient.

 0x438   leshort 0xEF53  Linux
 0x44c  lelong  x   rev %d
 0x43e  leshort x   \b.%d
 # No journal?  ext2
 0x45c  lelong  ^0x004  ext2 filesystem data
 0x43a leshort ^0x001  (mounted or unclean)
 # Has a journal?  ext3 or ext4
 0x45c  lelong  0x004
 #  and small INCOMPAT?
 0x460 lelong  0x040
 #   and small RO_COMPAT?
 0x464 lelong 0x008  ext3 filesystem data
 #   else large RO_COMPAT?
 0x464 lelong 0x007  ext4 filesystem data
 #  else large INCOMPAT?
 0x460 lelong  0x03f  ext4 filesystem data
 # General flags for any ext* fs
 0x460  lelong  0x004  (needs journal recovery)
 0x43a  leshort 0x002  (errors)
 # INCOMPAT flags
 0x460  lelong  0x001  (compressed)
 #0x460 lelong  0x002  (filetype)
 #0x460 lelong  0x010  (meta bg)
 0x460  lelong  0x040  (extents)
 0x460  lelong  0x080  (64bit)
 #0x460 lelong  0x100  (mmp)
 #0x460 lelong  0x200  (flex bg)
 # RO_INCOMPAT flags
 #0x464 lelong  0x001  (sparse super)
 0x464  lelong  0x002  (large files)
 0x464  lelong  0x008  (huge files)
 #0x464 lelong  0x010  (gdt checksum)
 #0x464 lelong  0x020  (many subdirs)
 #0x463 lelong  0x040  (extra isize)

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 9855] New: ext3 ACL corruption

2008-01-30 Thread Andreas Dilger
On Jan 30, 2008  14:49 -0800, Andrew Morton wrote:
  Problem Description:
  On several occasions now I have had e2fsck prune away ACLs on my file 
  systems
  during a file system check after rebooting a number of (reasonably) long
  running Samba servers. This morning I decided to manually run fsck before
  rebooting one of these:
  
  # e2fsck -pfv /dev/mapper/vg_main-lv_samba
  (entry-e_value_offs + entry-e_value_size: 116, offs: 120)
  /dev/mapper/vg_main-lv_samba: Extended attribute in inode 163841 has a value
  offset (56) which is invalid
  CLEARED.
  (entry-e_value_offs + entry-e_value_size: 116, offs: 120)
  /dev/mapper/vg_main-lv_samba: Extended attribute in inode 262146 has a value
  offset (56) which is invalid
  CLEARED.

While these error messages still exist in e2fsck, this code appears to
have been changed somewhat because these same error messages no longer
get printed in e2fsprogs 1.40.5.

  Inode size:   256 

This is a bit interesting, since it isn't very common to use large inodes.
I suspect this relates to the problem.

  These are production Samba servers making fairly extensive use of file and
  directory ACLs. Thus far, I've only noticed the corruptions when it came 
  time
  to upgrade to a new kernel and reboot (and the boot scripts then run fsck).
  Note that I've never noticed any issues at runtime because of this - only 
  when
  I later realised that ACLs had been removed from random files and/or
  directories.
  
  I think I will implement some scripts to unmount and run fsck nightly from
  cron, so I can at least detect the corruption a little earlier. If there is
  some more helpful debugging output I can provide, please let me know.

There is just such a script in the thread forced fsck (again?).  Since you
are using LVs for the filesystem.

If you are able to reproduce this, could you please dump the inode and EA
block before fixing the problem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Integrating patches in SLES10 e2fsprogs

2008-01-29 Thread Andreas Dilger
On Jan 28, 2008  10:54 -0600, Eric Sandeen wrote:
 Theodore Tso wrote:
  On Mon, Jan 28, 2008 at 04:26:53PM +0100, Matthias Koenig wrote:
  Patch6: e2fsprogs-mdraid.patch
 
  This apparently adds a new environment variable,
  BLKID_SKIP_CHECK_MDRAID, which forces blkid to not detect mdraid
  devices.  I'm not sure why.
  Workaround for people having stale RAID signature on their disk:
  https://bugzilla.novell.com/show_bug.cgi?id=100530
  
  Hmm... there's got to be a better way around this.
 
 Won't help existing block devices, but it'd be nice to have a common
 library which could be called @ mkfs time to wipe out all known
 signatures...
 
 mkfs.xfs tries to do this, but it'd be silly to duplicate in every mkfs.

Well, blkid already has a way to _detect_ a lot of filesystem signatures,
so it might be relatively easy to exploit the type_array[] entries to have
it zap out all of these blocks.  That said, the majority of them are in
the first 68kB of the filesystem (mdraid excluded) so it shouldn't be too
hard to zero them out.  Let's hope nobody ever uses 0x as magic.

mke2fs already tries to do this, though I notice:
- the zap_sector() call will skip the entire write if there is a BSD bootblock,
  instead of skipping only the first sector(s) and overwriting the rest...
  Since I don't know much about BSD bootblocks, I don't know what the right
  behaviour is, but I can guess we still want to zero out 4-68kB (or whatever).
- it only overwrites up to sector 8 (4kB) and not further into the disk to
  catch e.g. reiserfs superblocks.  Usually it will overwrite this anyways
  (GDT, bitmaps, inode table), but in some rare cases it might not.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Andreas Dilger
On Jan 24, 2008  17:25 -0700, Zan Lynx wrote:
 Have y'all been following the /dev/mem_notify patches?
 http://article.gmane.org/gmane.linux.kernel/628653

Having the notification be via poll() is a very restrictive processing
model.  Having the notification be via a signal means that any kind of
process (and not just those that are event loop driven) can register
a callback at some arbitrary point in the code and be notified.  I
don't object to the poll() interface, but it would be good to have a
signal mechanism also.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patch queue update

2008-01-24 Thread Andreas Dilger
On Jan 24, 2008  20:20 +0530, Aneesh Kumar K.V wrote:
 @@ -89,6 +89,8 @@ When mounting an ext4 filesystem, the following option are 
 accepted:
  extents  ext4 will use extents to address file data.  The
   file system will no longer be mountable by ext3.
  
 +noextentsext4 will not use extents for new files created.
 +

s/new files created/newly created files/

  journal_checksum Enable checksumming of the journal transactions.
   This will allow the recovery code in e2fsck and the
   kernel to detect corruption in the kernel.  It is a
 @@ -206,6 +208,10 @@ nobh (a) cache disk block mapping 
 information
   nobh option tries to avoid associating buffer
   heads (supported only for writeback mode).
  
 +mballoc  (*) Use the mutliblock allocator for block 
 allocation
 +nomballocdisabled multiblock allocator for block allocation.
 +stripe=n filesystem blocks per stripe for a RAID configuration.

Please provide a more verbose description of what a stripe is, since the
RAID terminology is sadly vague.  Something like number of filesystem blocks 
that mballoc will try to use for allocation size and alignment.  For RAID5/6
systems this should be the number of data disks * number of filesystem blocks
per data disk.


 @@ -3948,9 +3942,8 @@ repeat:
   spin_unlock(pa-pa_lock);
   spin_unlock(ei-i_prealloc_lock);
   printk(KERN_ERR uh-oh! used pa while discarding\n);
 - dump_stack();
 - current-state = TASK_UNINTERRUPTIBLE;
 - schedule_timeout(HZ);
 + WARN_ON(1);

This printk and dump stack can just go away, we have removed it from our
mballoc patch as well because it was only needed for determining how often
this condition is hit and is otherwise useless.

 @@ -577,14 +529,12 @@ err_out:
*
* FIXME!! we may be touching bitmaps in different block groups.
*/
 -
   if (ext4_journal_extend(handle,
   4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode-i_sb)) != 0) {
  
   ext4_journal_restart(handle,
   4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode-i_sb));
   }
 -

There don't actually need to be braces here either.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Integrating patches in SLES10 e2fsprogs

2008-01-24 Thread Andreas Dilger
I was looking through the SLES10 e2fsprogs patch set, and I wonder if some
of them could be integrated upstream, and if any effort had been made in
that direction in the past?  In particular, the addition of et_list_lock()
and et_list_unlock() to libcom_err cause failures if e2fsprogs is updated
to a non-SLES10 derived RPM.


A list of patches and (my) descriptions are below:
libcom_err-no-static-buffer.patch - avoids static buffer returned to caller
by error_message() function
libcom_err-no-init_error_table.patch - removes init_error_table() function
   (maybe because it isn't thread safe?),
   but I think this could be made thread
   safe by adding locking around use of
   _et_dynamic_list, or maybe it is
   obsoleted by add_error_table()?
libcom_err-no-e2fsck.static.patch - can't build e2fsck.static because of
-lpthread in libcom_err-mutex.patch, but
nothing uses e2fsck.static anymore?
libcom_err-mutex.patch - add et_list_{un,}lock() via pthread mutex


e2fsprogs-blkid.diff - Adds documentation of BLKID_FILE environment variable.
   This is actually implemented directly in libblkid in
   e2fsprogs-1.40.2 but no mention of it in the man pages.
e2fsprogs-mdraid.patch - allows skip of mdraid probing, not sure why?
e2fsprogs-probe_reiserfs-fpe.patch - fixes a legitimate bug in probe_reiserfs,
 though it might be better to just return
 an error if the blocksize is bad?


In addition to this, the SLES10 .spec file is completely different than that
shipped with upstream e2fsprogs, and I'd like to reconcile that if possible.
In particular it has libcom_err and libss in a separate RPM (libcom_err).
I understand that FC8 (not sure about RHEl5) has also split out some of the
libraries, but in a different way (e2fsprogs-libs) and that is a bit of a
headache.  It might be possible to reconcile with suitable rpm-fu, but it
would be desirable that SLES pick up these changes in the future...

I don't want to spam the list with all of the patches yet, but if there is
interest in merging these upstream then I can provide versions of these
patches against the current e2fsprogs instead of 1.38 that is in SLES10.

Eric, I haven't looked at the FC8/9 e2fsprogs yet, but do they also have
a ton of patches (possibly in the -pu branch), or do they track upstream
more closely?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Andreas Dilger
On Jan 24, 2008  18:32 +0100, Bodo Eggert wrote:
 I think a single, system-wide signal is the second-to worst solution: All
 applications (or the wrong one, if you select one) would free their caches
 and start to crawl, and either stay in this state or slowly increase their
 caches again until they get signaled again. And the signal would either
 come too early or too late. The userspace daemon could collect the weighted
 demand of memory from all applications and tell them how much to use.

Well, sending a few signals (maybe to the top 5 processes in the OOM killer
list) is still a LOT better than OOM-killing them without warning...  That
way important system processes could be taught to understand SIGDANGER and
maybe do something about it instead of being killed, and if Firefox and
other memory hungry processes flush some of their cache it is not fatal.

I wouldn't think that SIGDANGER means free all of your cache, since the
memory usage clearly wasn't a problem a few seconds previously, so as
an application writer I'd code it as flush the oldest 10% of my cache
or similar, and the kernel could send SIGDANGER again (or kill the real
offender) if the memory usage again becomes an issue.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] Add new development flag to the ext4 filesystem

2008-01-23 Thread Andreas Dilger
On Jan 23, 2008  11:53 -0500, Theodore Tso wrote:
 Since I'm still hoping that
 some point in the future, fs/ext4 could subsume fs/ext3 so we don't
 have to worry about bug fixes going into fs/ext2 AND fs/ext3 AND
 fs/ext4, I have my own reasons for wanting that.

If any newbie kernel hacker wants a filesystem project, allowing ext4
to mount ext2 filesystems w/o a journal would be very useful.  I
suspect that a simple flag check in the ext4_journal_* wrappers of the
jbd2 functions would be enough in many cases.

One of the reasons to keep ext2 around is that ext3 cannot mount the
filesystem without a journal, and removing that limitation for ext4
would bring us one step closer to removing a ton of duplicate code.
Another reason for ext2 vs. ext3 was overhead from journaling, and
that could also be removed by allowing ext4 to mount ext2 filesystems
w/o a journal.

Maybe a good proposal for a Google Summer-of-Code project.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

2008-01-23 Thread Andreas Dilger
On Jan 23, 2008  14:07 -0800, Andrew Morton wrote:
  +#define mb_correct_addr_and_bit(bit, addr) \
  +{  \
  +   bit += ((unsigned long) addr  3UL)  3;   \
  +   addr = (void *) ((unsigned long) addr  ~3UL);  \
  +}
 
 Why do these exist?

They seem to be a holdover from when mballoc stored the buddy bitmaps
on disk.  That no longer happens (to avoid bitmap vs. buddy consistency
problems), so I suspect they can be removed.

I can't comment on many of the other issues because Alex wrote most
of the code.

 Gosh what a lot of code.  Is it faster?

Yes, and also importantly it uses a lot less CPU to do a given amount
of allocation, which is critical in our environments where there is
very high disk bandwidth on a single node and CPU becomes the limiting
factor of the IO speed.  This of course also helps any write-intensive
environment where the CPU is doing something useful.

Some older test results include:
https://ols2006.108.redhat.com/2007/Reprints/mathur-Reprint.pdf (Section 7)

In the linux-ext4 thread compilebench numbers for ext4:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg03834.html

http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png

note the ext-read-compare.png graph shows lower read performance, but
a couple of bugs in mballoc were since fixed to have ext4 allocate more
contiguous extents.

In the old linux-ext4 thread [RFC] delayed allocation testing on node zefir
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg00587.html

: dd2048rw 
: REAL   UTIME  STIME  READWRITTEN DETAILS
EXT3: 58.46  23 1491   25722097292 17 extents
EXT4: 44.56  19 1018   12  2097244 19 extents
REISERFS: 56.80  26 1370   29522097336 457 extents
JFS : 45.77  22 9840   2097216 1 extents
XFS : 50.97  20 1394   0   2100825 7 extents

: kernuntar
: REAL   UTIME  STIME  READWRITTEN DETAILS
EXT3: 56.99  5037   65168  252016  
EXT4: 55.03  5034   55336  249884  
REISERFS: 52.55  4996   85464  238068  
JFS : 70.15  5057   630496 288116  
XFS : 72.84  5052   953132 316798  

: kernstat 
: REAL   UTIME  STIME  READWRITTEN DETAILS
EXT3: 2.83   8  15 58920   
EXT4: 0.51   9  10 58920   
REISERFS: 0.81   7  49 26960   
JFS : 6.19   11 49 12552   0   
XFS : 2.09   9  61 65040   

: kerncat  
: REAL   UTIME  STIME  READWRITTEN DETAILS
EXT3: 9.48   25 213241624  0   
EXT4: 6.29   27 197238560  0   
REISERFS: 14.69  33 230234744  0   
JFS : 23.51  23 231244596  0   
XFS : 18.24  36 254238548  0   

: kernrm   
: REAL   UTIME  STIME  READWRITTEN DETAILS
EXT3: 4.82   4  10896284672
EXT4: 1.61   5  11065364632
REISERFS: 3.15   8  2762768236 
JFS : 33.90  7  16814400   33048   
XFS : 20.03  8  296663286160   


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger
On Jan 21, 2008  23:17 -0500, [EMAIL PROTECTED] wrote:
 On Tue, 22 Jan 2008 14:38:30 +1100, David Chinner said:
  Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
  to a processes that aren't masking the signal followed by a short
  grace period to allow the processes to free up some memory before
  swapping out pages from that process?
 
 AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
 the system was about to hit OOM, not when it was about to start swapping.

I'd tried to advocate SIGDANGER some years ago as well, but none of
the kernel maintainers were interested.  It definitely makes sense
to have some sort of mechanism like this.  At the time I first brought
it up it was in conjunction with Netscape using too much cache on some
system, but it would be just as useful for all kinds of other memory-
hungry applications.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger
On Jan 22, 2008  14:38 +1100, David Chinner wrote:
 On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
  I discussed this with Ted at one point also.  This is a generic problem,
  not just for readahead, because fsck can run multiple e2fsck in parallel
  and in case of many large filesystems on a single node this can cause
  memory usage problems also.
  
  What I was proposing is that fsck.{fstype} be modified to return an
  estimated minimum amount of memory needed, and some desired amount of
  memory (i.e. readahead) to fsck the filesystem, using some parameter like
  fsck.{fstype} --report-memory-needed /dev/XXX.  If this does not
  return the output in the expected format, or returns an error then fsck
  will assume some amount of memory based on the device size and continue
  as it does today.
 
 And while fsck is running, some other program runs that uses
 memory and blows your carefully calculated paramters to smithereens?

Well, fsck has a rather restricted working environment, because it is
run before most other processes start (i.e. single-user mode).  For fsck
initiated by an admin in other runlevels the admin would need to specify
the upper limit of memory usage.  My proposal was only for the single-user
fsck at boot time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-17 Thread Andreas Dilger
On Jan 15, 2008  23:25 -0500, [EMAIL PROTECTED] wrote:
 I've got multiple boxes across the hall that have 50T of disk on them, in one
 case as one large filesystem, and the users want *more* *bigger* still (damned
 researchers - you put a 15 teraflop supercomputer in the room, and then they
 want someplace to *put* all the numbers that come spewing out of there.. ;)
 
 There comes a point where that downtime gets too long to be politically
 expedient.  6-2 may not be a biggie, because you can likely get a 6 hour
 window.  24-8 suddenly looks a lot different.
 
 (Having said that, I'll admit the one 52T filesystem is an SGI Itanium box
 running Suse and using XFS rather than ext3).
 
 Has anybody done a back-of-envelope of what this would do for fsck times for
 a max realistically achievable ext3 filesystem (i.e. 100T-200T or ext3
 design limit, whichever is smaller)?

This is exactly the kind of environment that Lustre is designed for.
Not only do you get parallel IO performance, but you can also do parallel
e2fsck on the individual filesystems when you need it.  Not that we aren't
also working on improving e2fsck performance, but 100 * 4TB e2fsck in
parallel is much better than 1 * 400TB e2fsck (probably not possible on
a system today due to RAM constraints though I haven't really done any
calculations either way).  I know customers were having RAM problems with
3 * 2TB e2fsck in parallel on a 2GB node.

Most customers these days use 2-4 4-8TB filesystems per server
(inexpensive SMP node with 2-4GB RAM instead of a single monstrous SGI box).

We have many Lustre filesystems in the 100-200TB range today, some over 1PB
already and much larger ones being planned.

 (And one of the research crew had a not-totally-on-crack proposal to get a
 petabyte of spinning oxide.  Figuring out how to back that up would probably
 have landed firmly in my lap.  Ouch. ;)

Charge them for 2PB of storage, and use rsync :-).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: New bitmap and inode table allocation for FLEX_BG

2008-01-11 Thread Andreas Dilger
On Jan 11, 2008  11:28 -0600, Jose R. Santos wrote:
 +blk_t ext2fs_flexbg_offset(ext2_filsys fs, dgrp_t group, int flexbg_size, 
 +ext2fs_block_bitmap bmap, int offset, int size)

Could you please add some comments for what this function is trying to do?

 + last_grp = (group + (flexbg_size - (group % flexbg_size) - 1));

Is this the same as:

last_grp = group + (flexbg_size - 1) / flexbg_size * flexbg_size

(i.e. trying to round up to the next even multiple of flexbg_size)?

Didn't we decide to have flexbg_size be a power-of-two value, so we could
use shift and mask instead of divide and mod?  It's less of an issue because
group is only a 32-bit value, I guess.

 + if (ext2fs_get_free_blocks(fs, start_blk, last_blk, 1, bmap, 
 +first_free))
 + return first_free;
 + 
 + if (ext2fs_get_free_blocks(fs, first_free + offset, last_blk, size, 
 +bmap, first_free))
 + return first_free;
 +
 + return first_free;
 +}
 +
  errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
 ext2fs_block_bitmap bmap)
  {
   errcode_t   retval;
   blk_t   group_blk, start_blk, last_blk, new_blk, blk;
 - int j;
 + dgrp_t  last_grp;
 + int j, rem_grps, flexbg_size = 0;
  
   group_blk = ext2fs_group_first_block(fs, group);
   last_blk = ext2fs_group_last_block(fs, group);
  
   if (!bmap)
   bmap = fs-block_map;
 +
 + if (EXT2_HAS_INCOMPAT_FEATURE (fs-super,
 +EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
 + flexbg_size = 1  fs-super-s_log_groups_per_flex;
 + rem_grps = flexbg_size - (group % flexbg_size);

Hmm, no point in doing groups % flexbg_size if we have
s_log_groups_per_flex.  Could do groups  (flexbg_size - 1) instead.

 @@ -101,7 +102,11 @@ int ext2fs_super_and_bgd_loc(ext2_filsys fs,
 + if (flex_bg_size) {
 + if ((group % flex_bg_size) == 0)
 + numblocks -= 2 + fs-inode_blocks_per_group;

Ditto.

 @@ -1045,6 +1046,20 @@ static void PRS(int argc, char *argv[])
   exit(1);
   }
   break;
 + case 'G':
 + flex_bg_size = strtoul(optarg, tmp, 0);
 + if (*tmp) {
 + com_err(program_name, 0,
 + _(Illegal number for Flex_BG size));
 + exit(1);
 + }
 + if (flex_bg_size  2 || 
 + (flex_bg_size  (flex_bg_size-1)) != 0) {
 + com_err(program_name, 0,
 + _(Flex_BG size must be a power of 2));
 + exit(1);
 + }
 + break;

We've been putting new options under -E var=value...  I don't know what
Ted's thoughs are on using new option letters, though this one might qualify.

 @@ -1444,6 +1459,16 @@ static void PRS(int argc, char *argv[])
   }
   }
  
 + if(flex_bg_size) {

Space after if .

 + shift = 0;
 + tmp = flex_bg_size;
 + while ((tmp = 1UL) != 0UL)
 + shift++;

There isn't a log2 function?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] New inode allocation for FLEX_BG meta-data groups.

2008-01-11 Thread Andreas Dilger
On Jan 11, 2008  11:28 -0600, Jose R. Santos wrote:
 @@ -127,6 +127,8 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, 
 struct buffer_head *bh,
   mark_bitmap_end(group_blocks, sb-s_blocksize * 8, bh-b_data);
   }
  
 + if (sbi-s_log_groups_per_flex)
 + return free_blocks;
   return free_blocks - sbi-s_itb_per_group - 2;

To be honest, I think this is a wart in ext4_init_block_bitmap() that
it returns the number of free blocks in the group.  That value should
really be set at mke2fs or e2fsck time, and if the last group is marked
BLOCK_UNINIT it gets the free blocks count wrong because it always starts
with EXT4_BLOCKS_PER_GROUP().

The above patch may also be incorrect since there may be inode tables or
bitmaps in the above group even in the case of FLEX_BG filesystems.

 +#define free_block_ratio 10
 +
 +static int find_group_flex(struct super_block *sb, struct inode *parent, 
 ext4_group_t *best_group)
 +{
 + n_fbg_groups = (sbi-s_groups_count + flex_size - 1) / flex_size;

Can be a shift?

I would suggest doing some kind of testing to see how well this allocation
policy is working.  We don't want to force all allocations contiguously at
the start of the filesystem, or we end up with FAT...

 +static int ext4_fill_flex_info(struct super_block *sb)
 +{
 + sbi-s_log_groups_per_flex = sbi-s_es-s_log_groups_per_flex;

Hmm, I guess no le*_to_cpu() because this is 8 bits?

 +
 + flex_group_count = (sbi-s_groups_count + groups_per_flex - 1) /
 + groups_per_flex;
 + sbi-s_flex_groups = kmalloc(flex_group_count *
 +  sizeof(struct flex_groups), GFP_KERNEL);
 + if (sbi-s_flex_groups == NULL) {
 + printk(KERN_ERR EXT4-fs: not enough memory\n);

This should report not enough memory for N flex groups or something.

 @@ -2105,6 +2154,13 @@ static int ext4_fill_super (struct super_block *sb,
 + if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
 + if (!ext4_fill_flex_info(sb)) {
 + printk(KERN_ERR
 +EXT4-fs: unable to initialize flex_bg meta 
 info!\n);
 + goto failed_mount2;

Should this be considered a fatal error, or could sbi-s_log_groups_per_flex
just be set to 0 and the filesystem be used as-is (maybe with sub-optimal
allocations or something)?  Otherwise this renders the filesystem unusable.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Use superblock s_raid_stripe_width as stripe size during block allocation.

2008-01-10 Thread Andreas Dilger
On Jan 10, 2008  09:58 +0530, Aneesh Kumar K.V wrote:
  d) if s_stripe is still  s_blocks_per_group try s_raid_stride
  e) if s_stripe is still  s_blocks_per_group use 0
 
 But i guess mke2fs and tune2fs should validate the value of
 s_raid_stripe_width and s_raid_stride. Both of them should be less that
 blocks per group. Should I add extra check in the kernel for them ?

It's true that mke2fs and tune2fs should validate this, but it is also
possible to become corrupted, and e2fsck doesn't fix it yet nor can it
make a good estimate of the right value.

   + if (!sbi-s_stripe ||
   + sbi-s_stripe = sbi-s_blocks_per_group) {
  
 
 So what do you think should it be  or =. Looking at the mballoc I
 guess it should work with stripe size equal to blocks per group. I am
 not sure how efficient the allocation would be though.

I think if s_stripe == s_blocks_per_group that is fine...  For 1kB block
filesystem that is only 8MB in size.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] set s_raid_stride and s_raid_stripe_width

2008-01-09 Thread Andreas Dilger
This is a resend of a patch for tune2fs and mke2fs to allow setting
the s_raid_stride and s_raid_stripe_width fields in the superblock.
These aren't used by the kernel yet, but it is desirable to have mballoc
use the s_raid_stripe_width value to align and size allocations on RAID
boundaries to avoid read-modify-write on RAID 5/6 systems.

This patch has improved error messages for parsing extended options, and
prints a warning if the stripe-width is not a multiple of the stride.

It also adds a whole bunch of missing fields to debugfs set_super_value,
and adds parsing of 64-bit values and checking of overflow therein.

What was very bizarre is that having _XOPEN_SOURCE 500 set would confuse the
use of strtoull() to mis-parse values between 33 and 64 bits on my system.
Using __USE_ISOC9X directly isn't useful because it is unset in features.h.

The use of _XOPEN_SOURCE 600 to define __USE_ISOC9X works at least as far
back as FC3 with glibc 2.3.6 (2005), so should be reasonably portable.
Similarly, per the strtoull() man page (and testing) if errno isn't checked
there is no detection of 64-bit overflow of the input value.

Signed-off-by: Rupesh Thakare  [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.4/lib/ext2fs/initialize.c
===
--- e2fsprogs-1.40.4.orig/lib/ext2fs/initialize.c
+++ e2fsprogs-1.40.4/lib/ext2fs/initialize.c
@@ -156,6 +156,8 @@ errcode_t ext2fs_initialize(const char *
set_field(s_feature_incompat, 0);
set_field(s_feature_ro_compat, 0);
set_field(s_first_meta_bg, 0);
+   set_field(s_raid_stride, 0);/* default stride size: 0 */
+   set_field(s_raid_stripe_width, 0);  /* default stripe width: 0 */
if (super-s_feature_incompat  ~EXT2_LIB_FEATURE_INCOMPAT_SUPP) {
retval = EXT2_ET_UNSUPP_FEATURE;
goto cleanup;
Index: e2fsprogs-1.40.4/misc/mke2fs.c
===
--- e2fsprogs-1.40.4.orig/misc/mke2fs.c
+++ e2fsprogs-1.40.4/misc/mke2fs.c
@@ -773,7 +773,7 @@ static int set_os(struct ext2_super_bloc
 static void parse_extended_opts(struct ext2_super_block *param, 
const char *opts)
 {
-   char*buf, *token, *next, *p, *arg;
+   char*buf, *token, *next, *p, *arg, *badopt = ;
int len;
int r_usage = 0;
 
@@ -800,16 +800,32 @@ static void parse_extended_opts(struct e
if (strcmp(token, stride) == 0) {
if (!arg) {
r_usage++;
+   badopt = token;
continue;
}
-   fs_stride = strtoul(arg, p, 0);
-   if (*p || (fs_stride == 0)) {
+   param-s_raid_stride = strtoul(arg, p, 0);
+   if (*p || (param-s_raid_stride == 0)) {
fprintf(stderr,
_(Invalid stride parameter: %s\n),
arg);
r_usage++;
continue;
}
+   } else if (strcmp(token, stripe-width) == 0 ||
+  strcmp(token, stripe_width) == 0) {
+   if (!arg) {
+   r_usage++;
+   badopt = token;
+   continue;
+   }
+   param-s_raid_stripe_width = strtoul(arg, p, 0);
+   if (*p || (param-s_raid_stripe_width == 0)) {
+   fprintf(stderr,
+   _(Invalid stripe-width parameter: 
%s\n),
+   arg);
+   r_usage++;
+   continue;
+   }
} else if (!strcmp(token, resize)) {
unsigned long resize, bpg, rsv_groups;
unsigned long group_desc_count, desc_blocks;
@@ -818,6 +834,7 @@ static void parse_extended_opts(struct e
 
if (!arg) {
r_usage++;
+   badopt = token;
continue;
}
 
@@ -866,22 +883,32 @@ static void parse_extended_opts(struct e
 
param-s_reserved_gdt_blocks = rsv_gdb;
}
-   } else
+   } else {
r_usage++;
+   badopt = token;
+   }
}
if (r_usage) {
-   fprintf(stderr, _(\nBad options specified.\n\n
+   fprintf(stderr, _(\nBad option(s) specified: %s\n\n
Extended options

Re: [PATCH] set s_raid_stride and s_raid_stripe_width

2008-01-09 Thread Andreas Dilger
On Jan 09, 2008  21:33 +0530, Aneesh Kumar K.V wrote:
  +   if (param-s_raid_stride 
  +   (param-s_raid_stripe_width % param-s_raid_stride) != 0)
  +   fprintf(stderr, _(\nWarning: RAID stripe-width %u not an even 
  + multiple of stride %u.\n\n),
  +   param-s_raid_stripe_width, param-s_raid_stride);
 
 
 We also want to validate that s_raid_stripe_width is not =
 blocks_per_group.

And probably that s_raid_stripe_width = s_raid_stride.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Use superblock s_raid_stripe_width as stripe size during block allocation.

2008-01-09 Thread Andreas Dilger
On Jan 09, 2008  22:37 +0530, Aneesh Kumar K.V wrote:
 The stipe size used during block allocation is calculated as below.
 a) if we specify a stripe=value option using mount time. Use that value.
 b) if not use the value specified in super block.
 b) If the value specfied at mount time is greater than blocks per group use
 the value specified ini the super block.

What if the value on disk is also bad?  I'd add (after the second 'b' :-):

d) if s_stripe is still  s_blocks_per_group try s_raid_stride
e) if s_stripe is still  s_blocks_per_group use 0

 Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
 ---
  fs/ext4/super.c |   10 ++
  1 files changed, 10 insertions(+), 0 deletions(-)
 
 diff --git a/fs/ext4/super.c b/fs/ext4/super.c
 index db1edc8..10330eb 100644
 --- a/fs/ext4/super.c
 +++ b/fs/ext4/super.c
 @@ -2136,6 +2136,16 @@ static int ext4_fill_super (struct super_block *sb, 
 void *data, int silent)
   sbi-s_rsv_window_head.rsv_alloc_hit = 0;
   sbi-s_rsv_window_head.rsv_goal_size = 0;
   ext4_rsv_window_add(sb, sbi-s_rsv_window_head);
 + /*
 +  * set the stripe size. If we have specified it via mount option, then
 +  * use the mount option value. If the value specified at mount time is
 +  * greater than the blocks per group use the super block value.
 +  * Allocator needs it be less than blocks per group.
 +  */
 + if (!sbi-s_stripe ||
 + sbi-s_stripe = sbi-s_blocks_per_group) {

Please fix alignment to be just after '(' on previous line.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: play with 8TB to 16TB fs's better

2008-01-09 Thread Andreas Dilger
On Jan 09, 2008  16:04 -0500, Josef Bacik wrote:
 +static int valid_size(unsigned long long *size64, int blocksize)
 +{
 + /* see if we are above 16tb */
 + if ((*size64 / blocksize)  0x) {
 + /* if we are just at 16tb adjust the size slightly */

I'd change these comments to read 2^32 blocks instead of 16TB because
this is a different size with 1kB or 16kB blocks...

 +static int check_for_wrap(const char *file, int blocksize)
 +{
 + int fd, tmp, total = 0;
 + char buffer[blocksize];

I don't think this is ANSI-C to declare the stack variable size based
on a function parameter.  This instead should malloc() a buffer instead.
It probably makes sense to make it a unique non-zero pattern, just to
catch the case where the block is not read from disk, or is read from some
other spot on disk that IS zero.

 +#ifdef HAVE_OPEN64
 + fd = open64(file, O_RDWR);
 +#else
 + fd = open(file, O_RDWR);
 +#endif

It would be desirable to use the ext2fs_open() fs handle, and
io_channel_write_block() to do the write at block 2^31+1 to verify the libext2fs
code.  It appears the io_channel_{read,write}_block() code would at least work
with 64-bit offsets on 64-bit architectures, because it takes an unsigned long
as the block number...

The other issue is that none of the unix IO calls will be portable to the
other IO managers, and will circumvent the undo IO manager, so it is better
to use the normal io_channel_{read,write}_block().

 + if (ext2fs_sync_device(fd, 1)) {
 + fprintf(stderr, Error flushing cache to disk %s\n, file);
 + close(fd);
 + exit(1);
 + }

I'm not sure we need a sync+flush before the second write?

 + memset(buffer, 0xa, blocksize);

It would be better to set this to some more unique pattern (e.g. current
time in first 8 bytes) to ensure there isn't some hold-over data from a
previous run or something.

 + memset(buffer, 0xa, blocksize);

I don't think we need this second memset, since the buffer should still be
the existing non-zero data.

 + for (tmp = 0; tmp  blocksize; tmp++) {
 + if (buffer[tmp] != 0x0) {
 + close(fd);
 + return -1;
 + }

Would probably be easier to just allocate a second buffer, copy it from
the original buffer at the beginning of the test, and then memcmp() it
against the block read from disk.
 + com_err(program_name, retval, Write wrapped, 
 + filesystem is too large for the disk 
 + to handle\n);
 +
 + fprintf(stderr, \nWarning: older 2.6 kernels (2.6.18 and 
 + older) may have problems with such a \n\tlarge 
 + filesystem.  If you have problems try a newer 
 + kernel\n);

Why are some messages com_err() and others fprintf()?  The content of the
message is good though.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: play with 8TB to 16TB fs's better

2008-01-08 Thread Andreas Dilger
On Jan 08, 2008  14:33 -0500, Josef Bacik wrote:
 @@ -190,8 +190,13 @@ errcode_t ext2fs_get_device_size(const c
   ioctl(fd, BLKGETSIZE64, size64) = 0) {
   if ((sizeof(*retblocks)  sizeof(unsigned long long)) 
   ((size64 / blocksize)  0x)) {
 - rc = EFBIG;
 - goto out;
 + /* 16tb fs is fine, just adjust slightly */
 + if ((size64 / blocksize) == 0x1) {
 + size64--;
 + } else {
 + rc = EFBIG;
 + goto out;
 + }

It might be cleaner to localize this check/fixup into a small helper function?

 +++ e2fsprogs/misc/mke2fs.c
 @@ -1455,13 +1455,6 @@ static void PRS(int argc, char *argv[])
 - if (!force  fs_param.s_blocks_count = ((unsigned) 1  31)) {
 - com_err(program_name, 0,
 - _(Filesystem too large.  No more than 2**31-1 blocks\n
 -   \t (8TB using a blocksize of 4k) are currently 
 supported.));
 - exit(1);
 - }
 -
   if ((blocksize  4096) 
   (fs_param.s_feature_compat  EXT3_FEATURE_COMPAT_HAS_JOURNAL))
   fprintf(stderr, _(\nWarning: some 2.4 kernels do not support 

It is also worthwhile to report at least a warning for filesystems larger
than 0x7fff blocks that older kernels (2.6.18 and older, IIRC) don't
necessarily work correctly with such large filesystems.

Doing something like having mke2fs zero out block 1, flush it from cache
with ioctl(BLKFLSBUF), then write some data at 8TB+1 to verify it doesn't
clobber block 1 might also be prudent.  I've seen some RAID arrays do this
in the past, and when we pass 0x blocks we should do the same so
it may as well be a simple helper function.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] UPDATED3: types fixup for mballoc

2008-01-08 Thread Andreas Dilger
On Jan 08, 2008  13:54 -0600, Eric Sandeen wrote:
 Note, the calculations Andreas  I were discussing only work properly
 for stripe = blocks per group... I don't know if we'd need to enforce
 that at mount time?

I think that would be prudent, but can be done in a separate patch.
If the RAID stripe width is so large that one has to do read-modify-write
for a whole group write (8MB @ 1kB blocksize, 128MB @ 4kB blocksize)
then I don't think we can use that to align allocations.

I think Aneesh might be working on getting s_raid_stripe_width from the
superblock, and we may as well do the sanity checking in the same patch.
If sb-s_raid_stripe_width  EXT4_BLOCKS_PER_GROUP, then as a fallback
I'd suggest using sb-s_raid_stride if  EXT4_BLOCKS_PER_GROUP, or just
ignoring both if unsuitable.

 I ran into a potential overflow in ext4_mb_scan_aligned, 
 and went looking for others in mballoc.  This patch hits a 
 few spots, compile-tested only at this point, comments welcome.
 
 This patch:
 
 changes fe_len to an int, I don't think we need it to be a long,
 looking at how it's used (should it be a grpblk_t?)  Also change
 anything assigned to return value of mb_find_extent, since it returns
 fe_len.
 
 changes anything that does groupno * EXT4_BLOCKS_PER_GROUP
 or pa-pa_pstart + whatever to an ext4_fsblk_t
 
 avoids 64-bit divides  modulos, and...
 
 fixes up any related formats
 
 Signed-off-by: Eric Sandeen [EMAIL PROTECTED]

You can add a Signed-off-by: Andreas Dilger [EMAIL PROTECTED] too.
The revised calcs look good to me.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 9692] New: journal_data mount option causes filesystem

2008-01-07 Thread Andreas Dilger
On Jan 06, 2008  19:30 -0600, Jayson King wrote:
 This looks to be an off-by-one bug with e2fsck in the function
 check_blocks(), and there isn't any actual filesystem corruption
 (e2fsck causes the corruption).

This is actually a problem for cases where blocksize != pagesize.
We have a similar patch in our e2fsprogs, and I thought we sent
an equivalent patch to Ted previously...

-   (pb.last_block / blkpg * blkpg != pb.last_block ||
+   ((pb.last_block+1) / blkpg * blkpg != (pb.last_block+1) ||


Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

 From 654f24814e7b80d3b16bec2a67c13c43cb20eb2f Mon Sep 17 00:00:00 2001
 From: Jayson R. King [EMAIL PROTECTED]
 Date: Sun, 6 Jan 2008 18:14:18 -0600
 Subject: e2fsck: Fix off-by-one error in check_blocks()
 
 e2fsck allows extra blocks to be allocated to an inode up to the next
 multiple of page size iff the block size is not equal to page size. An
 off-by-one error in checking for this causes e2fsck to wrongly detect
 a bad i_size for such inodes and results in incorrectly adjusting the
 i_size to include those blocks.
 
 Signed-off-by: Jayson R. King [EMAIL PROTECTED]
 ---
  e2fsck/pass1.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
 index 56218ae..7bf0686 100644
 --- a/e2fsck/pass1.c
 +++ b/e2fsck/pass1.c
 @@ -1593,7 +1593,7 @@ static void check_blocks(e2fsck_t ctx, struct 
 problem_context *pctx,
   if ((pb.last_block = 0) 
   /* allow allocated blocks to end of PAGE_SIZE */
   (size  (__u64)pb.last_block * fs-blocksize) 
 - (pb.last_block / blkpg * blkpg != pb.last_block ||
 + ((pb.last_block+1)  (blkpg-1) != 0 ||
size  (__u64)(pb.last_block  ~(blkpg-1)) *fs-blocksize))
   bad_size = 3;
   else if (size  ext2_max_sizes[fs-super-s_log_block_size])
 -- 
 1.5.3.3
 
 


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [Coding Style]: fs/ext{3,4}/ext{3,4}_jbd{,2}.c

2008-01-04 Thread Andreas Dilger
On Jan 04, 2008  14:41 +0100, Richard Knutsson wrote:
  @@ -54,6 +54,6 @@ int __ext4_journal_dirty_metadata(const char *where,
  {
  int err = jbd2_journal_dirty_metadata(handle, bh);
  if (err)
 -ext4_journal_abort_handle(where, __FUNCTION__, bh, handle,err);
 +ext4_journal_abort_handle(where, __FUNCTION__, bh, handle, err);
  return err;
  }

 What about changing the __FUNCTION__ to __func__, while you are at it?

What's wrong with __FUNCTION__?  I thought that was ANSI C?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] UPDATED: types fixup for mballoc

2008-01-03 Thread Andreas Dilger
On Jan 02, 2008  14:01 -0600, Eric Sandeen wrote:
 This patch:
 
 changes fe_len to an int, I don't think we need it to be a long,
 looking at how it's used (should it be a grpblk_t?)  Also change
 anything assigned to return value of mb_find_extent, since it returns
 fe_len.
 
 changes anything that does groupno * EXT4_BLOCKS_PER_GROUP
 or pa-pa_pstart + whatever to an ext4_fsblk_t
 
 fixes up any related formats
 
 The change to ext4_mb_scan_aligned to avoid a 64-bit modulo
 could use an extra look I think.

 @@ -1732,17 +1735,18 @@ static void ext4_mb_scan_aligned(struct 
   BUG_ON(sbi-s_stripe == 0);
  
 - /* find first stripe-aligned block */
 - i = e4b-bd_group * EXT4_BLOCKS_PER_GROUP(sb)
 - + le32_to_cpu(sbi-s_es-s_first_data_block);
 - i = ((i + sbi-s_stripe - 1) / sbi-s_stripe) * sbi-s_stripe;
 - i = (i - le32_to_cpu(sbi-s_es-s_first_data_block))
 - % EXT4_BLOCKS_PER_GROUP(sb);

 + /* find first stripe-aligned block in group */
 + a = e4b-bd_group * EXT4_BLOCKS_PER_GROUP(sb)
 + + le32_to_cpu(sbi-s_es-s_first_data_block);
 + a = (a + sbi-s_stripe - 1)  ~((ext4_fsblk_t)sbi-s_stripe - 1);
 + a = a - le32_to_cpu(sbi-s_es-s_first_data_block);
 + i = do_div(a, EXT4_BLOCKS_PER_GROUP(sb));

I don't think this is correct...  This code should only be used if s_stripe
is NOT a power-of-two value, otherwise we can just use the buddy maps to
find aligned chunks.  As a result I don't think that  (s_stripe - 1)
change is equivalent to / s_stripe * s_stripe.

   while (i  sb-s_blocksize * 8) {
   if (!mb_test_bit(i, bitmap)) {

Hrmmm, I thought this should be while (i  EXT4_BLOCKS_PER_GROUP(sb))
and not sb-s_blocksize * 8?  Did that fix get lost somewhere?

There are a few other changes in the patch related to this fix:
https://bugzilla.lustre.org/attachment.cgi?id=13275

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] UPDATED2: types fixup for mballoc

2008-01-03 Thread Andreas Dilger
On Jan 03, 2008  14:10 -0600, Eric Sandeen wrote:
 @@ -1732,17 +1735,19 @@ static void ext4_mb_scan_aligned(struct 
 + /* find first stripe-aligned block in group */
 + a = e4b-bd_group * EXT4_BLOCKS_PER_GROUP(sb)
   + le32_to_cpu(sbi-s_es-s_first_data_block);
 + a += sbi-s_stripe - 1;

Why not just have + sbi-s_stipe - 1 on the previous line, instead of +=?
Also minor coding style nit: the + is normally at the end of the previous
line instead of at the start of the next line, so:

/* find first stripe-aligned block in group */
a = e4b-bd_group * EXT4_BLOCKS_PER_GROUP(sb) +
le32_to_cpu(sbi-s_es-s_first_data_block) + sbi-s_stripe - 1;

 + a = (a * sbi-s_stripe) - le32_to_cpu(sbi-s_es-s_first_data_block);
 + i = do_div(a, EXT4_BLOCKS_PER_GROUP(sb));

I guess this doesn't stricly need to be a modulus either, because we know
which group this will be in and can just subtract the start.  You can just do:
 
/* find first stripe-aligned block in this group */
group_start = e4b-bd_group * EXT4_BLOCKS_PER_GROUP(sb) +;
le32_to_cpu(sbi-s_es-s_first_data_block);

a = group_start + sbi-s_stripe - 1;
do_div(a, sbi-s_stripe);
a = (a * sbi-s_stripe) - group_start;

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6.24 patch] let EXT4DEV_FS depend on BROKEN

2008-01-02 Thread Andreas Dilger
On Jan 02, 2008  03:32 +0200, Adrian Bunk wrote:
 It might make sense to offer ext4 in -mm and even in early -rc kernels, 
 but I've already seen people using ext4 simply because a stable kernel 
 offered it - and that's definitely not intended.
 
 Anyone who _really_ wants to test ext4 should anyway be able to do the 
 trivial change of removing the depends on BROKEN line.
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
 
 @@ -138,7 +138,7 @@ config EXT3_FS_SECURITY
  
  config EXT4DEV_FS
   tristate Ext4dev/ext4 extended fs support development (EXPERIMENTAL)
 - depends on EXPERIMENTAL
 + depends on BROKEN
   select JBD2
   select CRC16
   help

Isn't CONFIG_EXPERIMENTAL enough?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Fix the soft lockup with multi block allocator.

2007-12-24 Thread Andreas Dilger
On Dec 24, 2007  21:18 +0300, Alex Tomas wrote:
 Andreas Dilger wrote:
 On Dec 21, 2007  16:39 +0530, Aneesh Kumar K.V wrote:
 @@ -3790,7 +3782,9 @@ repeat:
 /* if we still need more blocks and some PAs were used, try again */
 if (free  needed  busy) {
 +   busy = 0;
 ext4_unlock_group(sb, group);
 +   schedule_timeout(HZ);
 goto repeat;
 }

 Is there nothing we could actually wait on instead of just sleeping for
 1 second?

 actually it was done for simplicity - in my tests busy PA happened quite rare.
 I have no objection to improve this with special wait queue.

If it is a very rare case, then I have no objection.  I just wanted to
avoid some sort of Nagle case where suddenly a workload is taking 1s
instead of 1ms to complete each IO.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in e2fsprogs.git (topics)

2007-12-17 Thread Andreas Dilger
On Dec 17, 2007  12:11 -0500, Theodore Tso wrote:
 * ak/undo-mgr (Mon Aug 13 15:56:26 2007 +0530) 6 commits
  - e2fsprogs: Make mke2fs use undo I/O manager.

Did you see Eric's report that using the undo manager for mke2fs caused the
performance to completely tank?  There is already enough memory pressure
caused by a regular mke2fs that having to save the blocks into tdb for a
large filesystem makes it unbearably slow.

We had also wanted to move from using db4 to tdb for the Lustre lfsck data
(collection of EA information for distributed fsck) but even at 1 files
the tdb performance was growing exponentially slower than db4 and we gave up.
I suspect the same problem hits undo manager when the number of blocks to
save is very high.

I think it might be viable to use undo manager for mke2fs when e.g.
uninit_groups and lazy_bg are enabled, since the amount of device IO is
very small then.  Otherwise, it may be that undo manager should be reserved
for e2fsck right now, or possibly reworked to just do IO to a sparse file
similar to doing an e2image before the mke2fs.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in e2fsprogs.git (topics)

2007-12-17 Thread Andreas Dilger
On Dec 17, 2007  17:59 -0500, Theodore Tso wrote:
 On Mon, Dec 17, 2007 at 03:34:55PM -0700, Andreas Dilger wrote:
  We had also wanted to move from using db4 to tdb for the Lustre lfsck data
  (collection of EA information for distributed fsck) but even at 1 files
  the tdb performance was growing exponentially slower than db4 and we gave 
  up.
  I suspect the same problem hits undo manager when the number of blocks to
  save is very high.
 
 Hm.  I was very concerned about using db4, mainly because of the ABI
 and on-disk format compatibility nightmare, which is why I chose tdb.

Yes, we have had all sorts of compatibility problems using db4 (e.g.
RHEL and SLES ship different package names, put the libraries and headers
in different locations, don't support overlapping sets of db4 libraries
between releases, etc), which is why we were hoping to be able to use tdb.

 But the performance problems are starting to make me worry.  Do you
 know how many tdb entries you had before tdb performance started going
 really badly down the toilet?  I wonder if there are some tuning knobs
 we could tweak to the performance numbers.

There is some test data at https://bugzilla.lustre.org/attachment.cgi?id=13924
which is a PDF file.  This shows 1000 items is reasonable, and 1 is not.

The majority of the time is taken looking up existing entries, and this
is due to one unusual requirement of the Lustre usage to be notified
of duplicate insertions to detect duplicate use of objects, so this may
have been a major factor in the slowdown.  It isn't really practical to
use a regular libext2fs bitmap for our case, since the collision space
is a 64-bit integer, but maybe we could have done this with an RB tree
or some other mechanism.

So, your mileage may vary with the undo manager usage, but it is
definitely worth writing a test case (e.g. time creation of filesystems of
progressively larger size on a large device) and seeing how bad it gets.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Flex_BG ialloc awareness V2.

2007-12-14 Thread Andreas Dilger
On Dec 13, 2007  20:36 -0600, Jose R. Santos wrote:
 ... if the value in the super block is corrupted and
 does not represent the actual flexbg size, the inode allocation will
 behave in weird unexpected ways.  Just as we check that the bitmaps are
 within the block group range (when not using flexbg), we should
 probably sanity check the size of the flexbg as reported in the super
 block.
 
 Or do you believe the check is unnecessary?

Well, I can imagine in some cases that the flexbg will not be completely
contiguous on disk (e.g. after a filesystem resize, if there are bad
blocks, etc).  As long as the group descriptors themselves are correct
(i.e. referencing valid bitmaps/itable) then it shouldn't cause a mount
failure if the per-group data isn't strictly aligned according to the
superblock flexbg count.

We would need to validate the group descriptor separately though (e.g.
group checksums).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Flex_BG ialloc awareness V2.

2007-12-13 Thread Andreas Dilger
On Dec 13, 2007  09:51 -0600, Jose R. Santos wrote:
 Now, storing the bits only guaranties that the flexbg size is always a
 power-of-two and does not guarantee that the super block flexbg size
 represents the actual meta-data grouping on disk.  For this we need to
 verify that the bitmap offsets match what the super block reports.  It
 may be an unlikely scenario, but it may be worth it to check this as
 well at mount time.

I'm not sure what you mean...  Isn't the flexbg size just a count of
the number of block groups?  If it is always a power of two, and the
groups per metabg is always a power of two (it is) then they will
always be even multiples.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Flex_BG ialloc awareness V2.

2007-12-11 Thread Andreas Dilger
On Dec 07, 2007  09:52 -0600, Jose R. Santos wrote:
 Andreas Dilger [EMAIL PROTECTED] wrote:
  There is no particular reason that this ratio needs to be *100, it could
  just as easily be a fraction of 256 and make the multiply into a shift.
  The free_block_ratio would be 26 in that case.
 
 The idea here is to reserve 10% (free_block_ratio) of free blocks in a
 flexbg for allocation of new files and expansion of existing one.  The
 *100 make the math here easy but this still something that need to be
 tune further.  I'm sure we can do this in a series of shifts, just
 haven't spent the time thinking of a clever way to do this.

This is a common misconception for code to have 10% mean 10 / 100.  It
is just as good to have 26/256

   @@ -622,7 +631,9 @@ struct ext4_super_block {
 __le16  s_mmp_interval; /* # seconds to wait in MMP checking */
 __le64  s_mmp_block;/* Block for multi-mount protection */
 __le32  s_raid_stripe_width;/* blocks on all data disks (N*stride)*/
   - __u32   s_reserved[163];/* Padding to the end of the block */
   + __le16  s_flex_bg_size; /* FLEX_BG group size */
  
  Shouldn't this be s_flex_bg_bits?
 
 I debated whether to store this as the s_flex_bg_size and calculate the
 bits during the filesystem mount time or just stored the bit in the
 super block to begging with.  The reason I stored the size is that it
 seemed more in line with the other fields in the super block.  I don't
 mind either way since this is more of a style issue, although saving an
 extra 8bits in the super block may be good enough reason to change it. 

I'd think being able to avoid the divide for every inode allocation is more
important than 8 bits in the superblock.

  My preference would be to have if (EXT2_HAS_INCOMPAT...) { ... } else {
  (i.e. add { } for the first part) since there are { } on the second part,
  and it is just easier to read.
 
 Mine too, but checkpatch complained about this. :)

Time to fix checkpatch it would seem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Flex_BG ialloc awareness V2.

2007-12-11 Thread Andreas Dilger
On Dec 11, 2007  10:08 -0600, Jose R. Santos wrote:
  I'd think being able to avoid the divide for every inode allocation is more
  important than 8 bits in the superblock.
 
 We already avoid the divide since what we store in the sbi IS the bits
 which are calculated at mount time for each fs.  Base on the other
 fields in the super block struct, I decided to put explicit size of the
 flexbg in the super block.  The kernel code can decide how best to use
 that number which in this case its used to calculate the number of bits
 in order to avoid doing divides.
 
 So this is really a styling issue in how to record data in the super
 block.  The only technical issue with this is whether it's important to
 save those extra 8 bits in the super block struct.

Well, if it is stored in the superblock as a non-power-of-two value, then
there always exists the possibility that it is set incorrectly (maybe by
a version of mke2fs that doesn't verify this) and the code will not do the
right thing.  Storing it in bits (as is done with e.g. s_log_block_size and
s_log_frag_size) ensures there is no possibility of a value that isn't a
power-of-two.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] 64 bit blocks support for e2fsprogs

2007-12-10 Thread Andreas Dilger
On Dec 07, 2007  11:14 -0500, Josef Bacik wrote:
 At this point I've only added the helper functions and converted mke2fs
 (sloppily for now) to use the helper functions.  I know we want to make this
 as painless as possible to go between ext2/3/4 so what I plan on doing is
 leave everything as it is and only allow large disks to be formatted if
 EXT4_FEATURE_INCOMPAT_64BIT is set.  Any feedback would be awesome, and if
 somebody is already working on this please let me know so I'm not duplicating
 work.  Thanks much,

 +_INLINE_ void ext2fs_blocks_count_set(struct ext2_super_block *super,
 +   blk64_t blk)
 +{
 + super-s_blocks_count = ext2fs_cpu_to_le32((__u32)blk);
 + super-s_blocks_count_hi = ext2fs_cpu_to_le32(blk  32);
 +}

I'm undecided of whether there should be checking of whether INCOMPAT_64BIT
set in the superblock before using the s_blocks*_hi fields.  In one sense
these fields are only valid when the INCOMPAT_64BIT flag is set, but
there is also a concern that the 64BIT flag might be incorrectly unset...

Given that these are the ext2fs_* interfaces, and e2fsck should probably
be doing its own validation of the 64BIT flag, along with the size of the
physical device and the s_blocks*_hi fields I think the ext2fs_* functions
SHOULD check for INCOMPAT_64BIT being set before using the _hi fields.

 @@ -191,9 +192,9 @@ static void test_disk(ext2_filsys fs, badblocks_list 
 *bb_list)
 - sprintf(buf, badblocks -b %d -X %s%s%s %u, fs-blocksize,
 + sprintf(buf, badblocks -b %d -X %s%s%s %lu, fs-blocksize,
   quiet ?  : -s , (cflag  1) ? -w  : ,
 - fs-device_name, fs-super-s_blocks_count-1);
 + fs-device_name, ext2fs_blocks64_count(fs)-1);

%lu is not correct for a 64-bit field, which will be long long on 32-bit
systems.  Instead I'd suggest using %llu and casting the return of
ext2fs_blocks64_count() to (unsigned long long).

 - retval = zero_blocks(fs, 0, fs-super-s_blocks_count,
 + retval = zero_blocks(fs, 0, ext2fs_blocks64_count(fs),
progress, blk, count);

Since zero_blocks() only takes a blk_t as a parameter I think this is
incorrect.  However, given it is a static internal function we can also
change the prototype to fix it.

 @@ -686,9 +690,10 @@ static void show_stats(ext2_filsys fs)
 + if (ext2fs_blocks_count(fs_param.super) != ext2fs_blocks_count(s))
   fprintf(stderr, _(warning: %u blocks unused.\n\n),
 -fs_param.s_blocks_count - s-s_blocks_count);
 +ext2fs_blocks_count(fs_param.super) - 
 +ext2fs_blocks_count(s));

Need to cast the difference to (unsigned).  The difference will never need
to be a 64-bit value, but the returned type will be blk64_t.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Flex_BG ialloc awareness V2.

2007-12-07 Thread Andreas Dilger
, 
 +EXT4_FEATURE_INCOMPAT_FLEX_BG)) 
 + ext2fs_allocate_flex_groups(fs);
 + 
 + else {
 + for (i = 0; i  fs-group_desc_count; i++) {
 + retval = ext2fs_allocate_group_table(fs, i, 
 fs-block_map);
 + if (retval)
 + return retval;
 + }
   }

My preference would be to have if (EXT2_HAS_INCOMPAT...) { ... } else {
(i.e. add { } for the first part) since there are { } on the second part,
and it is just easier to read.

 @@ -1045,6 +1046,19 @@ static void PRS(int argc, char *argv[])
 + if ((flex_bg_size  (flex_bg_size-1)) != 0) {
 + com_err(program_name, 0,
 + _(Flex_BG size must be a power of 2));
 + exit(1);

If flex_bg_size is a power of two then there isn't any need to store anything
except __u8 s_flex_bg_bits in the superblock.

 @@ -1444,6 +1458,10 @@ static void PRS(int argc, char *argv[])
 + if(flex_bg_size) {
 + fs_param.s_flex_bg_size = ext2fs_swab16(flex_bg_size);
 + }

Space between if and (, and no need for braces for a single line body.

It would also be nice to get a m_flexbg test case along with this patch
that (at minimum) creates a filesystem with flexbg enabled, and then
runs e2fsck on it.  This was broken for the lazy_bg feature for a long
time, so it makes sense to add a test to verify each new feature has
some basic functionality.  If the f_random_corruption test is in the
git tree, it would be good to add the flex_bg option to the list of
possible feature combinations to test.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] ext4: different maxbytes functions for bitmap extent files

2007-12-05 Thread Andreas Dilger
On Dec 04, 2007  09:58 -0600, Eric Sandeen wrote:
 +static loff_t ext4_max_size(int blkbits)
 +{
 + loff_t res;
 + loff_t upper_limit = MAX_LFS_FILESIZE;
 +
 + /* small i_blocks in vfs inode? */
 + if (sizeof(blkcnt_t)  sizeof(u64)) {

It would probably be more future-proof by having a struct inode here and
using if (sizeof(inode-i_blocks)  sizeof(__u64)).

 + upper_limit = (1LL  32) - 1;
 +
 + /* total blocks in file system block size */
 + upper_limit = (blkbits - 9);
 + upper_limit = blkbits;

There is probably a cleaner way to do this, maybe:

upper_limit = (1LL  (32 + 9) - 1)  ~((1  blkbits) - 1);

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding mballoc

2007-12-03 Thread Andreas Dilger
On Dec 03, 2007  23:42 +0530, Aneesh Kumar K.V wrote:
 This is my attempt at understanding multi block allocator. I have
 few questions marked as FIXME below. Can you help answering them.
 Most of this data is already in the patch queue as commit message.
 I have updated some details regarding preallocation. Once we
 understand the details i will update the patch queue commit message.

Some comments below, Alex can answer more authoritatively.

 If we are not able to find blocks in the inode prealloc space and if we have
 the group allocation flag set then we look at the locality group prealloc
 space. These are per CPU prealloc list repreasented as
 
 ext4_sb_info.s_locality_groups[smp_processor_id()]
 
 /* FIXME!! 
 After getting the locality group related to the current CPU we could be
 scheduled out and scheduled in on different CPU. So why are we putting the
 locality group per cpu ?
 */

I think just to avoid contention between CPUs.  It is possible to get
scheduled at this point it is definitely unlikely.  There does appear
to still be proper locking for the locality group, so at worst we get
contention between 2 CPUs for the preallocation instead of all of them.

 /* FIXME: 
 We need to explain the normalization of the request length.
 What are the conditions we are checking the request length
 against. Why are group request always requested at 512 blocks ?

Probably no particular reason for 512 blocks = 2MB, other than some
decent number of smaller requests can fit in there before looking
for another one.

One note for normalization - regarding recent benchmarks that show
e2fsck performance improvement for clustering of indirect blocks it
would also seem that allocating index blocks in the same preallocation
group could provide a similar improvement for mballoc+extents.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Flex_BG ialloc awareness.

2007-12-03 Thread Andreas Dilger
On Dec 03, 2007  13:05 -0600, Jose R. Santos wrote:
 @@ -600,6 +600,7 @@ void ext4_free_blocks_sb(handle_t *handle, struct 
 super_block *sb,
   ext4_grpblk_t group_freed;
 + ext4_group_t meta_group;

Please do not call these meta_groups.  This already means something very
specific (i.e. desc_per_block groups) and using it for FLEX_BG is confusing.
One possibly desirable relation is if the FLEX_BG count is some integer or
power-of-two multiple of the metabg count.  That would allow the FLEX_BG
code to share the same in-memory group struct as the mballoc code and save
on some memory overhead.

 + meta_group = ext4_meta_group(sbi, block_group);
 + 
 spin_lock(sbi-s_meta_groups[meta_group].meta_group_lock);
 + sbi-s_meta_groups[meta_group].free_inodes++;
 + if (is_directory)
 + sbi-s_meta_groups[meta_group].num_dirs--;
 + 
 spin_unlock(sbi-s_meta_groups[meta_group].meta_group_lock);

This can be as many as hundreds or thousands of spin locks.  Why not use
the same hashed locking code as the group descriptors themselves?

spin_lock(sb_bgl_lock(sbi, meta_group));
spin_unlock(sb_bgl_lock(sbi, meta_group));

This scales with the number of CPUs and chance of contention is very low.

 +int find_group_meta(struct super_block *sb, struct inode *parent)
 +{
 + ext4_group_t parent_mgroup = parent_group / sbi-s_groups_per_meta;

This could use ext4_meta_group(sbi, parent_group)?

 +static inline ext4_group_t ext4_meta_group(struct ext4_sb_info *sbi,
 +  ext4_group_t block_group)
 +{
 + return block_group/sbi-s_groups_per_meta;
 +}

It would be preferable to limit s_groups_per_meta to be a power-of-two
so that this can become a shift instead of a divide.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: add block bitmap validation

2007-11-21 Thread Andreas Dilger
On Nov 19, 2007  20:26 +0530, Aneesh Kumar K.V wrote:
 +static int ext4_valid_block_bitmap(struct super_block *sb,
 + struct ext4_group_desc *desc,
 + unsigned int block_group,
 + struct buffer_head *bh)
 +{
 + if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
 + /* with FLEX_BG, the inode/block bitmaps and itable
 +  * blocks may not be in the group at all
 +  * so the bitmap validation will be skipped for those groups
 +  * or it has to also read the block group where the bitmaps
 +  * are located to verify they are set.
 +  */
 + return 1;
 + }

Rather than skipping the bitmap check entirely when FLEX_BG is enabled,
it is probably better to just validate whether each of the bitmap/itable
offsets are within the bitmap being read, like:

offset = ...
if (offset  EXT4_BLOCKS_PER_GROUP(sb) 
ext4_test_bit(offset, bh-b_data))
/* bad block bitmap */
goto err_out;

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: dir inode reservation V3

2007-11-20 Thread Andreas Dilger
On Nov 20, 2007  12:22 -0800, Mingming Cao wrote:
 On Tue, 2007-11-20 at 12:14 +0800, Coly Li wrote:
  Mingming Cao wrote:
   On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote:
  The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because
  directory inode will also use a inode slot from reserved area, reset
  slots number for files is (s_dir_ireserve_nr - 1).  Except for the
  reserved inodes number, your understanding exactly matches my idea.
 
 The performance gain on large number of directories looks interesting,
 but I am afraid this makes the uninitialized block group feature less
 useful (to speed up fsck) than before:( The reserved inode will cause
 the unused inode watermark higher than before, and spread the
 directories over to other block groups earlier than before. Maybe it
 worth it...not sure.

My original thoughts on the design for this were slightly different:
- that the per-directory reserved window would scale with the size of
  the directory, so that even (or especially) with htree directories the 
  inodes would be kept in hash-ordered sets to speed up stat/unlink
- it would be possible/desirable to re-use the existing block bitmap
  reservation code to handle inode bitmap reservation for directories
  while those directories are in-core.  We already have the mechanisms
  for this, all that would need to change is have the reservation code
  point at the inode bitmaps but I don't know how easy that is
- after an unmount/remount it would be desirable to re-use the same blocks
  for getting good hash-inode mappings, wherein lies the problem of
  compatibility

One possible solutions for the restart problem is searching the directory
leaf block in which an inode is being added for the inode numbers and try
to use those as a goal for the inode allocation...  Has a minor problem
with ordering, because usually the inode is allocated before the dirent
is created, but isn't impossible to fix (e.g. find dirent slot first,
keep a pointer to it, check for inode goals, and then fill in dirent
inum after allocating inode)

   5, Performance number
   On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4
   partition, and tried to create 5 directories, and create 15 (1KB)
   files in each directory alternatively. After a remount, I tried to
   remove all the directories and files recursively by a 'rm -rf'.
   Below is the benchmark result,
normal ext4 ext4 with dir inode reservation
mount options:  -o data=writeback   -o 
   data=writeback,dir_ireserve=low
Create dirs:real0m49.101s   real2m59.703s
Create files:   real24m17.962s  real21m8.161s
Unlink all: real24m43.788s  real17m29.862s

One likely reason that the create dirs step is slower is that this is
doing a lot more IO than in the past.  Only a single inode in each
inode table block is being used, so that means that a lot of empty
bytes are being read and written (maybe 16x as much data in this case).

Also, in what order are you creating files in the directories?  If you
are creating them in directory order like:

for (f = 0; f  15; f++)
for (i = 0; i  5; i++)
touch dir$i/f$f

then it is completely unsurprising that directory reservation is faster
at file create/unlink because those inodes are now contiguous at the
expense of having gaps in the inode sequence.  Creating 15 files per
directory is of course the optimum test case also.

How does this patch behave with benchmarks like dbench, mongo, postmark?

 It would be nice to remember the last allocated bit for each block
 group, so we don't have to start from bit 0 (in the worse case scan the
 entire block group) for every single directory. Probably this can be
 saved in the in-core block group descriptor.  Right now the in core
 block group descriptor structure is the same on-disk block-group
 structure though, it might worth to separate them and provide load/store
 helper functions.

Note that mballoc already creates an in-memory struct for each group.
I think the initialization of this should be moved outside of mballoc
so that it can be used for other purposes as you propose.

Eric had a benchmark where creating many files/subdirs would cause
a huge slowdown because of bitmap searching, and having a per-group
pointer with the first free inode (or last allocated inode might be
less work to track) would speed this up a lot. 

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: test for EA validity checking

2007-11-15 Thread Andreas Dilger
On Nov 10, 2007  05:10 -0700, Andreas Dilger wrote:
 Attached is a test image for extended attribute block checking.
 This small image contains a number of different kinds of corruptions
 (bad magic, bad checksum, empty block, bad EA block number), as well
 as an old-format (v1) EA, and a valid v2 EA.

Attached is an updated patch with a new test image, that includes a
corrupt EA set on a block device.  That had been causing us some
e2fsck heartburn because of random inode table corruption.

This goes along with an updated expand-extra-isize patch.  The root of
the problem was that I had added an extra check to verify the EA magic
in ext2fs_read_ext_attr(), but the return value of this function was used
in check_blocks() to set pctx-errcode, and later in check_ext_attr() the
non-zero value of pctx-errcode caused an abort when
ext2fs_inode_has_valid_blocks() was false and it caused PR_1_BLOCK_ITERATE
to be hit (a fatal error).

I ended up checking the magic in all of the ext2fs_read_ext_attr() callsites,
but IMHO the library should at least be doing basic validity checking like
this.

One question that remains unclear is whether pctx-errcode being set from
the early call of check_ext_attr() should cause check_blocks() to abort?
In the common regular-file/directory/slow-symlink case pctx-errcode is
reset by the call to ext2fs_block_iterate2(), so it would seem reasonable
to go back to checking EA magic in ext2fs_read_ext_attr(), and then clearing
pctx-errcode in check_ext_attr() if the problem is fixed.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



e2fsprogs-tests-f_ea_checks.patch
Description: Binary data
Index: e2fsprogs-1.40.1/lib/ext2fs/ext_attr.c
===
--- e2fsprogs-1.40.1.orig/lib/ext2fs/ext_attr.c
+++ e2fsprogs-1.40.1/lib/ext2fs/ext_attr.c
@@ -17,6 +17,7 @@
 #endif
 #include string.h
 #include time.h
+#include errno.h
 
 #include ext2_fs.h
 #include ext2_ext_attr.h
@@ -60,11 +61,39 @@ __u32 ext2fs_ext_attr_hash_entry(struct 
 #undef NAME_HASH_SHIFT
 #undef VALUE_HASH_SHIFT
 
+#define BLOCK_HASH_SHIFT 16
+/*
+ * Re-compute the extended attribute hash value after an entry has changed.
+ */
+static void ext2fs_attr_rehash(struct ext2_ext_attr_header *header,
+  struct ext2_ext_attr_entry *entry)
+{
+   struct ext2_ext_attr_entry *here;
+   __u32 hash = 0;
+
+   entry-e_hash = ext2fs_ext_attr_hash_entry(entry, (char *) header +
+  entry-e_value_offs);
+
+   here = ENTRY(header+1);
+   while (!EXT2_EXT_IS_LAST_ENTRY(here)) {
+   if (!here-e_hash) {
+   /* Block is not shared if an entry's hash value == 0 */
+   hash = 0;
+   break;
+   }
+   hash = (hash  BLOCK_HASH_SHIFT) ^
+  (hash  (8*sizeof(hash) - BLOCK_HASH_SHIFT)) ^
+  here-e_hash;
+   here = EXT2_EXT_ATTR_NEXT(here);
+   }
+   header-h_hash = hash;
+}
+
 errcode_t ext2fs_read_ext_attr(ext2_filsys fs, blk_t block, void *buf)
 {
errcode_t   retval;
 
-   retval = io_channel_read_blk(fs-io, block, 1, buf);
+   retval = io_channel_read_blk(fs-io, block, 1, buf);
if (retval)
return retval;
 #ifdef EXT2FS_ENABLE_SWAPFS
@@ -92,7 +121,7 @@ errcode_t ext2fs_write_ext_attr(ext2_fil
} else
 #endif
write_buf = (char *) inbuf;
-   retval = io_channel_write_blk(fs-io, block, 1, write_buf);
+   retval = io_channel_write_blk(fs-io, block, 1, write_buf);
if (buf)
ext2fs_free_mem(buf);
if (!retval)
@@ -126,7 +155,10 @@ errcode_t ext2fs_adjust_ea_refcount(ext2
if (retval)
goto errout;
 
-   header = (struct ext2_ext_attr_header *) block_buf;
+   header = BHDR(block_buf);
+   if (header-h_magic != EXT2_EXT_ATTR_MAGIC)
+   return EXT2_ET_EA_BAD_MAGIC;
+
header-h_refcount += adjust;
if (newcount)
*newcount = header-h_refcount;
@@ -140,3 +172,881 @@ errout:
ext2fs_free_mem(buf);
return retval;
 }
+
+struct ext2_attr_info {
+   int name_index;
+   const char *name;
+   const char *value;
+   int value_len;
+};
+
+struct ext2_attr_search {
+   struct ext2_ext_attr_entry *first;
+   char *base;
+   char *end;
+   struct ext2_ext_attr_entry *here;
+   int not_found;
+};
+
+struct ext2_attr_ibody_find {
+   ext2_ino_t ino;
+   struct ext2_attr_search s;
+};
+
+struct ext2_attr_block_find {
+   struct ext2_attr_search s;
+   char *block;
+};
+
+void ext2fs_attr_shift_entries(struct ext2_ext_attr_entry *entry,
+  int value_offs_shift, char *to,
+  char *from, int n)
+{
+   struct

test for EA validity checking

2007-11-10 Thread Andreas Dilger
Attached is a test image for extended attribute block checking.
This small image contains a number of different kinds of corruptions
(bad magic, bad checksum, empty block, bad EA block number), as well
as an old-format (v1) EA, and a valid v2 EA.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



f_ea_checks.tgz
Description: GNU Zip compressed data


Re: ext3 metaclustering performance numbers and updated patch

2007-11-09 Thread Andreas Dilger
number of indirect blocks/group is?  For large files ( 8MB or so)
I suspect the above will be sufficient, but for small files ( 2MB)
this wouldn't be enough as the indirect blocks will not be filled.
I guess it isn't harmful in the case where these reserved blocks run
out...

 +ext3_get_metacluster_range(struct super_block *sb,
 + ext3_grpblk_t *mc_start,
 + ext3_grpblk_t *mc_end)  /* exclusive */
 +{
 + *mc_start = EXT3_BLOCKS_PER_GROUP(sb) / 2;

Have you tested with mc_start right after the inode table?  For e2fsck
that might be slightly better positioning as it could potentially be
read into cache without a seek, and it avoids breaking up the group in
the middle if you are creating large files.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix check_mntent_file() to pass mode for open(O_CREAT)

2007-11-08 Thread Andreas Dilger
On my FC8 install, ismounted.c fails to build because open(O_CREAT) is
used without passing a mode.  The following trivial patch fixes it.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

Index: e2fsprogs-1.40.2/lib/ext2fs/ismounted.c
===
--- e2fsprogs-1.40.2.orig/lib/ext2fs/ismounted.c
+++ e2fsprogs-1.40.2/lib/ext2fs/ismounted.c
@@ -147,7 +147,7 @@ static errcode_t check_mntent_file(const
 is_root:
 #define TEST_FILE /.ismount-test-file
*mount_flags |= EXT2_MF_ISROOT;
-   fd = open(TEST_FILE, O_RDWR|O_CREAT);
+   fd = open(TEST_FILE, O_RDWR|O_CREAT, 0600);
if (fd  0) {
if (errno == EROFS)
*mount_flags |= EXT2_MF_READONLY;

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] allow tune2fs to set/clear resize_inode

2007-11-07 Thread Andreas Dilger
On Nov 06, 2007  13:51 -0500, Theodore Tso wrote:
 On Tue, Nov 06, 2007 at 09:12:55AM +0800, Andreas Dilger wrote:
  What is needed is an ext2prepare-like step that involves resize2fs code
  to move the file/dir blocks and then the move inode table, as if the
  filesystem were going to be resized to the new maximum resize limit,
  and then create the resize inode but do not actually add new blocks/groups
  at the end of the filesystem.
 
 Yeah, the plan was to eventually add ext2prepare-like code into
 tune2fs, using the undo I/O manager for safety.  But that's been
 relatively low priority.
 
 BTW, I've gotten ~2 bug reports from Debian users claiming that
 ext2prepare had trashed their filesystem.  I don't have any clean
 evidence about whether it was a userspace error or some kind of bug in
 ext2prepare, possibly conflicting with some new ext3 feature that
 we've since added that ext2prepare doesn't properly account for
 (extended attributes, maybe?).  
 
 I have not had time to look into it, but thought has crossed my mind
 that a quick hack would be to splice the undo manager into
 ext2prepare, have it run e2fsck, and if it fails, do a rollback,
 create an e2image file, and then instruct the user to send in a bug
 report.  :-)

I don't think it would be very easy to splice undo manager into ext2prepare.
I'd rather see time spent to make resize2fs handle the prepare functionality
and then ext2resize can be entirely obsoleted.

Aneesh, adding undo manager to resize2fs would be an excellent use of that
library, and going from resize2fs to resize2fs --prepare-only (or whatever)
would be trivial I think.  Is that something you'd be interested to work on?

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: delalloc fragmenting files?

2007-11-07 Thread Andreas Dilger
On Nov 06, 2007  13:54 -0600, Eric Sandeen wrote:
 Hmm bad news is when I add uninit_groups into the mix, it goes a little
 south again, with some out-of-order extents.  Not the end of the world,
 but a little unexpected?
 
 
 Discontinuity: Block 1430784 is at 24183810 (was 24181761)
 Discontinuity: Block 1461760 is at 24216578 (was 24214785)
 Discontinuity: Block 1492480 is at 37888 (was 24247297)
 Discontinuity: Block 1519616 is at 850944 (was 65023)
 Discontinuity: Block 1520640 is at 883712 (was 851967)
 Discontinuity: Block 1521664 is at 1670144 (was 884735)
 Discontinuity: Block 1522688 is at 2685952 (was 1671167)
 Discontinuity: Block 1523712 is at 4226048 (was 2686975)
 Discontinuity: Block 1524736 is at 11271168 (was 4227071)
 Discontinuity: Block 1525760 is at 23952384 (was 11272191)

I think part of the issue is that by default the groups marked BLOCK_UNINIT
are skipped, to avoid dirtying those groups if they have never been used
before.  This policy could be changed in the mballoc code pretty easily if
you think it is a net loss.  Note that the size of the extents is large
enough (120MB or more) that some small reordering is probably not going
to affect the performance in any meaningful way.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: More testing: 4x parallel 2G writes, sequential reads

2007-11-07 Thread Andreas Dilger
On Nov 07, 2007  16:42 -0600, Eric Sandeen wrote:
 I tried ext4 vs. xfs doing 4 parallel 2G IO writes in 1M units to 4
 different subdirectories of the root of the filesystem:
 
 http://people.redhat.com/esandeen/seekwatcher/ext4_4_threads.png
 http://people.redhat.com/esandeen/seekwatcher/xfs_4_threads.png
 http://people.redhat.com/esandeen/seekwatcher/ext4_xfs_4_threads.png
 
 and then read them back sequentially:
 
 http://people.redhat.com/esandeen/seekwatcher/ext4_4_threads_read.png
 http://people.redhat.com/esandeen/seekwatcher/xfs_4_threads_read.png
 http://people.redhat.com/esandeen/seekwatcher/ext4_xfs_4_read_threads.png
 
 At the end of the write, ext4 had on the order of 400 extents/file, xfs
 had on the order of 30 extents/file.  It's clear especially from the
 read graph that ext4 is interleaving the 4 files, in about 5M chunks on
 average.  Throughput seems comparable between ext4  xfs nonetheless.

The question is what the best result is for this kind of workload?
In HPC applications the common case is that you will also have the data
files read back in parallel instead of serially.

The test shows ext4 finishing marginally faster in the write case, and
marginally slower in the read case.  What happens if you have 4 parallel
readers?

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in e2fsprogs.git (topics)

2007-11-05 Thread Andreas Dilger
On Nov 04, 2007  23:32 -0500, Theodore Tso wrote:
 On Mon, Nov 05, 2007 at 10:15:33AM +0800, Andreas Dilger wrote:
  On Nov 04, 2007  17:42 -0500, Theodore Ts'o wrote:
   * js/uninit (Sun Oct 21 21:04:24 2007 -0500) 14 commits
- Add m_uninit test case.
- Add new mm_lazy test case.
- Fix test cases.
  
  Do you have all of the f_uninit test cases we made?
  
  f_uninit_bad_free_inodes
  f_uninit_blk_used_not_set
  f_uninit_checksum_bad
  f_uninit_disable
  f_uninit_enable
  f_uninit_inode_past_unused
  f_uninit_last_uninit
  f_uninit_set_inode_not_set
 
 Nope, it wasn't in the patches that Jose sent out.  I assume they're
 in the CFS e2fsprogs?

Yes.  You can get our full patch series at:

ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/e2fsprogs-1.40.2-cfs1.patches.tgz

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: delalloc fragmenting files?

2007-11-05 Thread Andreas Dilger
On Nov 05, 2007  11:37 -0600, Eric Sandeen wrote:
 Alex Tomas wrote:
  thanks for the data. the attached patch should fix couple issues:
  broken history output and policy in that smaller usable chunk to
  be used. can you give it a spin, please?
 
 This looks *much* better:
 
 First block: 100355
 Last block: 2342657
 Discontinuity: Block 30208 is at 133120 (was 130562)
 Discontinuity: Block 60928 is at 165891 (was 163839)
 Discontinuity: Block 90368 is at 197634 (was 195330)
 Discontinuity: Block 121856 is at 232448 (was 229121)
 Discontinuity: Block 150528 is at 263170 (was 261119)
 Discontinuity: Block 181504 is at 296963 (was 294145)
 Discontinuity: Block 210944 is at 328706 (was 326402)
 Discontinuity: Block 241664 is at 361474 (was 359425)
 Discontinuity: Block 272896 is at 395264 (was 392705)
 Discontinuity: Block 303360 is at 427010 (was 425727)
 Discontinuity: Block 334080 is at 459778 (was 457729)
 Discontinuity: Block 365568 is at 493568 (was 491265)
 Discontinuity: Block 395264 is at 525314 (was 523263)
 Discontinuity: Block 426240 is at 558082 (was 556289)
 

On a related note - the FIEMAP patches to filefrag also include a new
output format that is much more useful, IMHO.  The new format is like:

{filename}
ext: [logical start.. end kB]: phys start..end kB:   kB:lun: flags
  0: [ 0.. 30207]: 401416..522251:   120828: 0 :
  1: [ 30208.. 60927]: 532480..655359:   122880: 0 :
  2: [ 60928..121855]: 790536..916484:   125948: 0 :

Hopefully Kalpak will be able to post the updated patches here soon.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] allow tune2fs to set/clear resize_inode

2007-11-05 Thread Andreas Dilger
On Nov 05, 2007  14:45 -0600, Eric Sandeen wrote:
 (this patch has been carried in Red Hat / Fedora rpms for a while, not 
 sure why it never got sent upstream... clearing this allows for mounting 
 filesystems with a resize_inode on older systems)

 Allow tune2fs to set  clear resize_inode; requires fsck afterwards.

Two things that are a bit confusing:
- since RESIZE_INODE is a COMPAT feature, I don't see how this affects
  mounting the filesystem on older systems?  Is there a bug somewhere?
- why not use something like remove_journal_inode() (or create a new
  helper function like ext2fs_unlink(), and move the kill_file_by_inode()
  and release_blocks_proc() into lib/ext2fs as ext2fs_delete_inode())
  to avoid the need for an e2fsck?

   if ((sparse != old_sparse) ||
 - (filetype != old_filetype)) {
 + (filetype != old_filetype) ||
 + (resize_inode != old_resize_inode)) {
   sb-s_state = ~EXT2_VALID_FS;
   printf(\n%s\n, _(please_fsck));
   }

I don't know that it is so easy to enable RESIZE_INODE on an existing
filesystem as just setting the feature flag and running e2fsck?  The
reserved group descriptor blocks will potentially conflict with the
bitmaps and inode tables.

What is needed is an ext2prepare-like step that involves resize2fs code
to move the file/dir blocks and then the move inode table, as if the
filesystem were going to be resized to the new maximum resize limit,
and then create the resize inode but do not actually add new blocks/groups
at the end of the filesystem.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in e2fsprogs.git (topics)

2007-11-04 Thread Andreas Dilger
On Nov 04, 2007  17:42 -0500, Theodore Ts'o wrote:
 * js/uninit (Sun Oct 21 21:04:24 2007 -0500) 14 commits
  - Add m_uninit test case.
  - Add new mm_lazy test case.
  - Fix test cases.

Do you have all of the f_uninit test cases we made?

f_uninit_bad_free_inodes
f_uninit_blk_used_not_set
f_uninit_checksum_bad
f_uninit_disable
f_uninit_enable
f_uninit_inode_past_unused
f_uninit_last_uninit
f_uninit_set_inode_not_set

  - Update uninit block group documetation for some of the utilities.
  - Make e2fsck uninit block group aware.
  - Make debugfs uninit block group aware.
  - Make resize2fs uninit block group aware.
  - Make dumpe2fs uninit block group aware.
  - Make tune2fs uninit block group aware.
  - Add support for creating filesystems using uninit block group.
  - Rename feature name from gdt_checksum to uninit_groups.
  - Add uninit block group support on libe2fs.
  - Add initial checksum support.
  + Reorder some of the $(SRCS) in alphabetical order.
 
   This still doesn't pass make check in lib/ext2fs, but I'll
   take a closer look at this in the near future.

That's a bit of a surprise, since the e2fsprogs-uninit patches in
our own CVS repo pass make check.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

2007-11-02 Thread Andreas Dilger
On Nov 02, 2007  08:31 -0800, Badari Pulavarty wrote:
 On Fri, 2007-11-02 at 13:20 +0800, Andreas Dilger wrote:
  On Nov 01, 2007  17:40 -0700, Mingming Cao wrote:
   Current journal checksumming patch failed fsstress test on NUMA. The 
   bh-b_data passed to the crc32_be () function could be NULL pointer, 
   which caused kernel oops immediately when running fsstress with -o 
   journal_checksum. It is because the page is part of highmem on NUMA box.
   We need to kmap the page before access the bh-b_data to calculate
   the checksums.
  
  I have no objection to the patch, per-se, but I'm surprised that there
  would ever be a buffer head pointing at a page in high memory?  That
  seems contrary to what I would expect...
 
 I was surprised to see that too while helping Mingming/Avantika track
 this issue. I was under impression that we are checksumming only
 metadata and it should be lowmem. But only buffer_heads are in lowmem.
 Pages that point to can be in Highmem.

But...  this implies that every user of bh-b_data needs to kmap, and I
don't see that in the code anywhere else.  That makes me think something
else is going wrong here.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]Ext4: Use get_cpu()/put_cpu() in preemptible context

2007-11-02 Thread Andreas Dilger
On Nov 02, 2007  17:35 -0700, Mingming Cao wrote:
 Index: linux-2.6.24-rc1/fs/ext4/mballoc.c
 ===
 --- linux-2.6.24-rc1.orig/fs/ext4/mballoc.c   2007-11-02 17:22:18.0 
 -0700
 +++ linux-2.6.24-rc1/fs/ext4/mballoc.c2007-11-02 17:23:02.0 
 -0700
 @@ -4006,7 +4006,8 @@ static void ext4_mb_group_or_file(struct
   return;
  
   BUG_ON(ac-ac_lg != NULL);
 - ac-ac_lg = sbi-s_locality_groups[smp_processor_id()];
 + ac-ac_lg = sbi-s_locality_groups[get_cpu()];
 + put_cpu();
  
   /* we're going to use group allocation */
   ac-ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;

Shouldn't the put_cpu() be after ac-ac_lg is no longer being used?
I guess there would otherwise be a danger of other processes using
the same s_locality_groups[] struct?

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[EMAIL PROTECTED]: [RFC PATCH ext3/ext4] orphan list corruption due bad inode]

2007-11-01 Thread Andreas Dilger
I just noticed this patch in the old parts of my inbox, but it was never
landed to ext3 or ext4.

- Forwarded message from Vasily Averin [EMAIL PROTECTED] -

Date: Mon, 04 Jun 2007 09:19:10 +0400
From: Vasily Averin [EMAIL PROTECTED]
To: Linux Kernel Mailing List [EMAIL PROTECTED],
Andrew Morton [EMAIL PROTECTED], [EMAIL PROTECTED],
linux-ext4@vger.kernel.org, Stephen Tweedie [EMAIL PROTECTED],
[EMAIL PROTECTED]
Subject: [RFC PATCH ext3/ext4] orphan list corruption due bad inode

After ext3 orphan list check has been added into ext3_destroy_inode() (please 
see my previous patch) the following situation has been detected:
 EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file 
(37901290), 0
 Inode 0101a15b7840: orphan list check failed!
 0773 6f665f00 74616d72 0573 65725f00 06737270 6600 616d726f
...
 Call Trace: [80211ea9] ext3_destroy_inode+0x79/0x90
  [801a2b16] sys_unlink+0x126/0x1a0
  [80111479] error_exit+0x0/0x81
  [80110aba] system_call+0x7e/0x83

First messages said that unlinked inode has i_nlink=0, then ext3_unlink() adds 
this inode into orphan list.

Second message means that this inode has not been removed from orphan list. 
Inode dump has showed that i_fop = bad_file_ops and it can be set in 
make_bad_inode() only. Then I've found that ext3_read_inode() can call 
make_bad_inode() without any error/warning messages, for example in the 
following case:
...
if (inode-i_nlink == 0) {
if (inode-i_mode == 0 ||
!(EXT3_SB(inode-i_sb)-s_mount_state  EXT3_ORPHAN_FS)) {
/* this inode is deleted */
brelse (bh);
goto bad_inode;
...

Bad inode can live some time, ext3_unlink can add it to orphan list, but
ext3_delete_inode() do not deleted this inode from orphan list. As
result we can have orphan list corruption detected in ext3_destroy_inode().

However it is not clear for me how to fix this issue correctly.

As far as i see is_bad_inode() is called after iget() in all places excluding 
ext3_lookup() and ext3_get_parent(). I believe it makes sense to add bad inode 
check to these functions too and call iput if bad inode detected.

Signed-off-by:  Vasily Averin [EMAIL PROTECTED]

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 9bb046d..e3ac8c3 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1019,6 +1019,11 @@ static struct dentry *ext3_lookup(struct inode * dir, 
struct dentry *dentry, str
 
if (!inode)
return ERR_PTR(-EACCES);
+
+   if (is_bad_inode(inode)) {
+   iput(inode);
+   return ERR_PTR(-ENOENT);
+   }
}
return d_splice_alias(inode, dentry);
 }
@@ -1054,6 +1059,11 @@ struct dentry *ext3_get_parent(struct dentry *child)
if (!inode)
return ERR_PTR(-EACCES);
 
+   if (is_bad_inode(inode)) {
+   iput(inode);
+   return ERR_PTR(-ENOENT);
+   }
+
parent = d_alloc_anon(inode);
if (!parent) {
iput(inode);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 4ec57be..70db358 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1017,6 +1017,11 @@ static struct dentry *ext4_lookup(struct inode * dir, 
struct dentry *dentry, str
 
if (!inode)
return ERR_PTR(-EACCES);
+
+   if (is_bad_inode(inode)) {
+   iput(inode);
+   return ERR_PTR(-ENOENT);
+   }
}
return d_splice_alias(inode, dentry);
 }
@@ -1052,6 +1057,11 @@ struct dentry *ext4_get_parent(struct dentry *child)
if (!inode)
return ERR_PTR(-EACCES);
 
+   if (is_bad_inode(inode)) {
+   iput(inode);
+   return ERR_PTR(-ENOENT);
+   }
+
parent = d_alloc_anon(inode);
if (!parent) {
iput(inode);

- End forwarded message -

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

2007-11-01 Thread Andreas Dilger
On Nov 01, 2007  17:40 -0700, Mingming Cao wrote:
 Current journal checksumming patch failed fsstress test on NUMA. The 
 bh-b_data passed to the crc32_be () function could be NULL pointer, 
 which caused kernel oops immediately when running fsstress with -o 
 journal_checksum. It is because the page is part of highmem on NUMA box.
 We need to kmap the page before access the bh-b_data to calculate
 the checksums.

I have no objection to the patch, per-se, but I'm surprised that there
would ever be a buffer head pointing at a page in high memory?  That
seems contrary to what I would expect...

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
By request on #linuxfs, here is the FIEMAP spec that we used to implement
the FIEMAP support for ext4.  There was an ext4 patch posted on August 29
to linux-ext4 entitled [PATCH] FIEMAP ioctl.   I've asked Kalpak to post
an updated version of that patch along with the changes to the filefrag
tool to use FIEMAP.

 FIEMAP_1.0.txt ==

File Mapping Interface

18 June 2007

Andreas Dilger, Kalpak Shah

Introduction

This document covers the user interface and internal implementation of
an efficient fragmentation reporting tool. This will include addition
of a FIEMAP ioctl to fetch extents and changes to filefrag to use this
ioctl. The main objective of this tool is to efficiently and easily allow
inspection of the disk layout of one or more files without requiring
user access to the underlying storage device(s).

1 Requirements

The tool should be efficient in its use of resources, even for large
files. The FIBMAP ioctl is not suitable for use on large files,
as this can result in millions or even billions of ioctls to get the
mapping information for a single file. It should be possible to get the
information about an arbitrary-sized extent in a single call, and the
kernel component and user tool should efficiently use this information.

The user interface should be simple, and the output should be easily
understood - by default the filename(s), a count of extents (for each
file), and the optimal number of extents for a file with the given
striping parameters. The user interface will be filefrag [options]
{filename ...} and will allow retrieving the fragmentation information
for one or more files specified on the command-line. The output will be
of the form:

/path/to/file1: extents=2 optimal=1

/path/to/file2: extents=10 optimal=4

..

2 Functional specification

The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
ioctl block device ioctl used for mapping an individual logical block
address in a file to a physical block address in the block device. The
FIEMAP ioctl will return the logical to physical mapping for the extent
that contains the specified logical byte address.

struct fiemap_extent {
__u64 fe_offset;/* offset in bytes for the start of the extent */
__u64 fe_length;/* length in bytes for the extent */
__u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */
__u32 fe_lun;   /* logical device number for extent(starting at 0)*/
};



struct fiemap {
__u64 fm_start; /* logical byte offset (in/out) */
__u64 fm_length;/* logical length of map (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags (in/out) */
__u32 fm_extent_count;  /* extents in fm_extents (in/out) */
__u64 fm_unused;

struct fiemap_extent fm_extents[0];  
};



In the ioctl request, the fiemap struct is initialized with the desired
mapping information.

fiemap.fm_start = {desired start byte offset, 0 if whole file};
fiemap.fm_length = {length of mapping in bytes, ~0ULL if whole file}
fiemap.fm_extent_count = {number of fiemap_extents in fm_extents array};
fiemap.fm_flags = {flags from FIEMPA_FLAG_* array, if needed};

ioctl(fd, FIEMAP, fiemap);
{verify fiemap flags are understood }

for (i = 0; i  fiemap.fm_extent_count; i++) {
{ process extent fiemap.fm_extents[i]};
}


The logic for the filefrag would be similar to above. The size of the
extent array will be extrapolated from the filesize and multiple ioctls
of increasing extent count may be called for very large files. filefrag
can easily call the FIEMAP ioctls repeatedly using the end of the last
extent as the start offset for the next ioctl:

fm_start = fm_extents[fm_extent_count - 1].fe_offset +
fm_extents[fm_extent_count - 1].fe_length + 1;

We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.

The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is
given then the fm_extents array is not filled, and only fm_extent_count is
returned with the total number of extents in the file. Any new flags that
introduce and/or require an incompatible behaviour in an application or
in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT
(e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that
range if they were not part of the original specification). This is
currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT
is not large enough then it is possible to use the last INCOMPAT flag
0x0100 to incidate that more of the flag range contains incompatible
flags.

#define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */
#define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */
#define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  16:13 -0600, Andreas Dilger wrote:
 On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
  I'm a little bit confused by fe_offset. Is it a physical offset, or a
  logical offset? The reason I ask is that your description above says FIEMAP
  ioctl will return the logical to physical mapping for the extent that
  contains the specified logical byte address. Which seems to imply physical,
  but your math to get to the next logical start in a very fragmented file,
  implies that fe_offset is a logical offset:
  
 fm_start = fm_extents[fm_extent_count - 1].fe_offset +
   fm_extents[fm_extent_count - 1].fe_length + 1; 
 
 Note the distinction between fe_offset (which is a physical offset for
 a single extent) and fm_offset (which is a logical offset for that file).

Actually, that is completely bunk.  What it should say is something like:
filefrag can easily call the FIEMAP ioctls repeatedly using the returned
fm_start and fm_length as the start offset for the next ioctl:

fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1;

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
   Thanks for posting this. I believe that an interface such as FIEMAP
 would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)

I tried to make it as Lustre-agnostic as possible...

 On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:
  The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
  ioctl block device ioctl used for mapping an individual logical block
  address in a file to a physical block address in the block device. The
  FIEMAP ioctl will return the logical to physical mapping for the extent
  that contains the specified logical byte address.
  
  struct fiemap_extent {
  __u64 fe_offset;/* offset in bytes for the start of the extent */
 
 I'm a little bit confused by fe_offset. Is it a physical offset, or a
 logical offset? The reason I ask is that your description above says FIEMAP
 ioctl will return the logical to physical mapping for the extent that
 contains the specified logical byte address. Which seems to imply physical,
 but your math to get to the next logical start in a very fragmented file,
 implies that fe_offset is a logical offset:
 
fm_start = fm_extents[fm_extent_count - 1].fe_offset +
  fm_extents[fm_extent_count - 1].fe_length + 1; 

Note the distinction between fe_offset (which is a physical offset for
a single extent) and fm_offset (which is a logical offset for that file).

  We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
  will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.
 
 I think you meant 'fm_length' instead of 'fm_end' there.

You're right, thanks.

  #define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
  #define FIEMAP_EXTENT_EOF   0x0100 /* fm_start + fm_len beyond EOF*/
 
 Is EOF here considering beyond i_size or beyond allocation?

_EOF == beyond i_size.
_LAST == last extent in the file.

In most cases FIEMAP_EXTENT_EOF will be set at the same time as
FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the 
EOF flag may be set on one or more earlier extents.

  FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
  encrypted, compressed, etc.)
 
 Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
 Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
 blocks.

Hmm, but part of the issue would be how to request the extra data, and
what offset it would be given?  One could, for example, use negative
offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
or similar, I hadn't given that much thought.  The other issue is that
I'd like to get the basics of the API in place before it gets too complex.
We can always add functionality with more FIEMAP_FLAG_* (whether in the
INCOMPAT range or not, depending on what is being done).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  17:11 -0700, Mark Fasheh wrote:
 On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:
   Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
   blocks.
  
  Hmm, but part of the issue would be how to request the extra data, and
  what offset it would be given?  One could, for example, use negative
  offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
  or similar, I hadn't given that much thought.
 
 Well, fe_offset and fe_length are already expressed in bytes, so we could
 just put the byte offset to where the inline data starts in there. fe_length
 is just used as the length allocated for inline-data.
 
 If fe_offset is required to be block aligned, then we could add a field to
 express an offset within the block where data would be found - say
 'fe_data_start_offset'. In the non-inline case, we could guarantee that
 fe_data_start_offset is zero. That way software which doesn't want to care
 whether something is inline-data (for example, a backup program) or not
 could just blidly add it to fe_offset before looking at the data.

Oh, I was confused as to what you are asking.  Mapping in-inode data is
just fine using the existing interface.  The byte offset of the data is
given, and the FIEMAP_EXTENT_NO_DIRECT flag is set to indicate that it
isn't necessarily safe to do IO directly to that byte offset in the file
(e.g. tail packed, compressed data, etc).

I was thinking you were asking how to map metadata (e.g. indirect blocks).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: delalloc fragmenting files?

2007-10-26 Thread Andreas Dilger
On Oct 26, 2007  16:24 -0500, Eric Sandeen wrote:
 The resulting file had over 4k extents.
 [EMAIL PROTECTED] ~]# filefrag -v /mnt/test/foobar  | grep -i extents
 File is stored in extents format
 /mnt/test/foobar: 4075 extents found

On a related note - we're just putting the finishing touches on the
FIEMAP patches for ext4 + e2fsprogs, so that we can get decent looking
output from filefrag, and much more efficiently than FIBMAP.

 if I don't mount with delalloc:
 
 mount -t ext4dev -o data=writeback,extents,mballoc /dev/sdb7 /mnt/test
 
 and run the same dd, I get 229 extents:
 
 [EMAIL PROTECTED] ~]# filefrag -v /mnt/test/foobar  | grep -i extents
 File is stored in extents format
 /mnt/test/foobar: 229 extents found

One of the issues is that w/o delalloc the mballoc code only gets 
single-block allocations, so there might be a problem with the interface
to mballoc.  That might be caused by the fact the patches were changed
at one point from delalloc-atop-mballoc to mballoc-atop-delalloc, and
something was missed in that conversion.

Have you tried O_DIRECT?  That is another way to access mballoc w/o
using delalloc.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Clustering indirect blocks in Ext2

2007-10-25 Thread Andreas Dilger
On Oct 25, 2007  03:21 -0700, Abhishek Rai wrote:
 This patch modifies the block allocation strategy in ext2 in order to
 improve fsck performance.
 
 Most of Ext2 metadata is clustered on disk. For example, Ext2
 partitions the block space into block groups and stores the metadata
 for each block group (inode table, block bitmap, inode bitmap) at the
 beginning of the block group. Clustering related metadata together not
 only helps ext2 I/O performance by keeping data and related metadata
 close together, but also helps fsck since it is able to find all the
 metadata in one place. However, indirect blocks are an exception.
 Indirect blocks are allocated on-demand and are spread out along with
 the data. This layout enables good I/O performance due to the close
 proximity between an indirect block and its data blocks but it makes
 things difficult for fsck which must now rotate almost the entire disk
 in order to read all indirect blocks.

I understand this does not change the on-disk format, but it does
introduce complexity into the ext2 code base, which we have been
trying to avoid for several reasons (risk of introducing bugs in
ext2, keeping it less complex for easier understanding of code).

There is a fair amount of existing work for reducing e2fsck time both
for crash recovery and full scanning of the filesystem.

Of course with ext3 journaling this removes most of the need for e2fsck
at boot time, but it does impact performance to some extent.  In ext4
there are several other features that also reduce e2fsck time, likely
more than what you will be getting with your patch.

- uninit_groups: keep a high watermark of inodes in use in each group, to
  avoid scanning the unused inodes during a full scan.  This has been
  shown to reduce full e2fsck times by 90%.
- extents: reduces the file metadata by at least an order of magnitude
  over indirect blocks.  For unfragmented files an extent-mapped inode
  can map up to 512MB without even using an indirect (index) block.  No
  indirect block reads/seeks is always better than optimized reads/seeks.
- delalloc+mballoc: this improves ext4 performance to be equal or better
  than ext2 performance for large IO by doing better block allocation to
  ensure large extents are allocated and avoiding seeks during IO and
  keeping the extents compact for fewer/no index blocks.

We also have Lustre patches against ext3 for most of these features
against older vendor kernels (SLES10 2.6.16, RHEL5 2.6.18) if that is
of interest to you (only delalloc isn't included in the existing Lustre
patch set, but I believe Alex had delalloc patches for 2.6.18 kernels
in the past).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Clustering indirect blocks in Ext2

2007-10-25 Thread Andreas Dilger
On Oct 25, 2007  15:56 -0700, Abhishek Rai wrote:
 While this patch does add some complexity to ext2, it has the benefit
 of backward and forward compatibility which will probably make it
 attractive for more people than any change that changes on-disk
 format.

To be honest, I think the number of people using ext2 on their systems
is relatively small compared to ext3 because of the e2fsck hit on each
boot.  IMHO, that means the engineering effort spent on improving
e2fsck for ext2 is less worthwhile than if the same effort was spent
on testing ext4 and the improvements made there.

My understanding is that the primary reason Google is using ext2 instead
of ext3 is because of the performance impact of journaling.  With the
performance (and also scalability) improvements in ext4, doesn't it make
sense to put test/development time and effort toward ext4?

 Thanks for pointing these out. extents and delalloc+mballoc are of
 course useful but are not a simple transition though I'm definitely
 considering trying them out.

Note that delalloc and mballoc don't strictly require extents, as
they are in-memory optimizations only.

 Regarding the uninit_groups patch, I think it can be implemented in a
 backward compatible way as follows. Instead of modifying the group
 desc to store the number of unused inodes (bg_itable_inodes), we can
 alternatively define an implicit boundary in every group's inode
 bitmap by having a special free marker inode with a certain
 signature. Whenever we need to allocate inodes in a group beyond this
 boundary, we shift the boundary by using a later inode as the free
 marker inode. The idea is that new ext2 will try to allocate inodes
 from before the marker and fsck will not seek past the marker.

The problem with this is that ext2 is not journalled and it is possible
that updates are not ordered on disk.  The danger is that the update
of the marker block is lost, but inodes are allocated after it. 

 - Over time markers drift towards higher inode numbers but never
 travel backwards, so a pathological workload can kill all markers
 bringing us back to old behavior, but this is very unlikely.

This is currently true of the uninit_groups feature also, because it
is a lot easier to avoid the problem of safely shrinking the high
watermark.  On the next e2fsck it will shrink the high watermark for
each group again.


Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Enable delalloc and mballoc by default.

2007-10-24 Thread Andreas Dilger
On Oct 24, 2007  12:22 -0500, Eric Sandeen wrote:
 Aneesh Kumar K.V wrote:
  @@ -1279,6 +1280,9 @@ clear_qf_name:
  case Opt_delalloc:
  set_opt (sbi-s_mount_opt, DELALLOC);
  break;
 
 If delalloc, mballoc, extents are the new defaults, is there a reason to
 keep them as options?  When would you need to specify -o extents, now,
 for example?  (though my brain is fuzzy today, maybe I'm missing
 something)  If this were not a filesystem ending in dev I could see
 keeping it for compatibility with existing fstabs

It is useful to be able to mount w/o extents/delalloc/mballoc for perf
testing and functional testing of the block-mapped file path in ext4.
Also, some users might want the ability to use features of ext4 w/o
the incompatibility of extents.

  set_opt(sbi-s_mount_opt, EXTENTS);
  +   set_opt(sbi-s_mount_opt, DELALLOC);
  +   set_opt(sbi-s_mount_opt, MBALLOC);

I think the other thing to do is enable the INCOMPAT_EXTENTS flag in
mkfs.ext4 by default, so that extents is enabled/disabled in the
same manner as other ext* features.  We can remove the above once
we have an e2fsprogs that specifically sets all of the ext4 features
(large inodes, etc) for ext4 filesystems.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Enable delalloc and mballoc by default.

2007-10-24 Thread Andreas Dilger
On Oct 24, 2007  16:15 -0500, Eric Sandeen wrote:
 Andreas Dilger wrote:
  It is useful to be able to mount w/o extents/delalloc/mballoc for perf
  testing and functional testing of the block-mapped file path in ext4.
  Also, some users might want the ability to use features of ext4 w/o
  the incompatibility of extents.
 
 Right, I understand the reason for noextents, nodelalloc, nomballoc.
 Above, I ask what is the point of having the *defaults* (extents,
 delalloc, mballoc) as mount options?

Partly for symmetry, and also to be able to override any no* options
that might have been specified previously (e.g. in some script that the
user doesn't directly control, but can pass additional options to).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 warnings from LTP rename14

2007-10-18 Thread Andreas Dilger
On Oct 16, 2007  21:17 -0500, Eric Sandeen wrote:
 Eric Sandeen wrote:
  Martin Habets wrote:
  I ran the ltp-full-20070930 tests on 2.6.23-rc9-mph4 (sparc32 SMP), and am
  seeing the following warnings:
  
  This makes me a little nervous about my change
  ef2b02d3e617cb0400eedf2668f86215e1b0e6af
  (ext34: ensure do_split leaves enough free space in both blocks)
  
  Do you know when this first showed up?  Could you test -rc6?
  
  You say you can reproduce it; have you checked (fsck'd) the filesystem
  in between, and is it in good shape?
  
  -Eric
 
 FWIW, I ran rename14 standalone a few times on 2.6.23.1 with no
 problems...
 
 [EMAIL PROTECTED] ltp-full-20070930]#
 ./testcases/kernel/syscalls/rename/rename14
 rename141  PASS  :  Test Passed
 [EMAIL PROTECTED] ltp-full-20070930]#
 ./testcases/kernel/syscalls/rename/rename14
 rename141  PASS  :  Test Passed
 [EMAIL PROTECTED] ltp-full-20070930]#
 ./testcases/kernel/syscalls/rename/rename14
 rename141  PASS  :  Test Passed
 [EMAIL PROTECTED] ~]# uname -a
 Linux bear-05.lab.msp.redhat.com 2.6.23.1 #1 SMP Mon Oct 15 15:28:08 CDT
 2007 i686 athlon i386 GNU/Linux

It is probably significant that the original machine is a sparc32 (big
endian).  I'd suspect you can reproduce this on a PPC system also.  You
might also consider running sparse on it ext3/ext4 in case you missed an
le32_to_cpu() or something.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] ext4 update

2007-10-18 Thread Andreas Dilger
On Oct 16, 2007  21:51 -0700, [EMAIL PROTECTED] wrote:
 On Wed, 17 Oct 2007, Theodore Ts'o wrote:
 It has a number random cleanups and bug fixes, and two new features.
 The first is uninitialized block groups, which allows fast mke2fs
 operations plus as well as speeding up e2fsck by allowing it to skip
 parts of the inode tables that haven't been used yet.
 
 nice feature, is there any work on a tool to go through a well-used 
 filesystem and mark unused block groups as uninitialized? (I would guess 
 that such a tool may want to move files to make this so)

Yes, just set the feature flag via tune2fs and then run e2fsck on it.
The second e2fsck shown below is just a demonstration of the speedup.

# tune2fs -O uninit_groups /dev/foo
tune2fs 1.39.cfs9 (7-Apr-2007)

Please run e2fsck on the filesystem.

# time e2fsck -fy /dev/foo
e2fsck 1.39.cfs9 (7-Apr-2007)
Group descriptor 0 checksum is invalid.  Fix? yes

{repeats for all groups}

Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
lustre-OST: 9099/1640160 files (0.4% non-contiguous), 221996/6554520
blocks

real0m17.273s
user0m4.930s
sys 0m1.749s

# time e2fsck -fy /dev/hda3
e2fsck 1.39.cfs9 (7-Apr-2007)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
lustre-OST: 9099/1640160 files (0.4% non-contiguous), 221996/6554520
blocks

real0m2.412s
user0m0.604s
sys 0m0.077s


The caveats are:
(a) this is a read-only feature, so you can't mount such a filesystem r/w
on an older kernel.  You can disable it with tune2fs -O ^uninit_groups
and run a full e2fsck on it again.
(b) I don't think there is an official e2fsprogs release with support for this
feature yet (it's in the pipe, however).
(c) The actual speedup depends on how full the filesystem is, but since ext*
usually has way too many inodes, it is generally pretty good.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >