Re: [PATCH] e2fsprogs: Check journal inode sanity and recreate journal

2007-05-07 Thread Theodore Tso
On Thu, Mar 15, 2007 at 02:43:47AM +0530, Kalpak Shah wrote:
> Index: e2fsprogs-1.39/lib/ext2fs/mkjournal.c
> ===
> --- e2fsprogs-1.39.orig/lib/ext2fs/mkjournal.c
> +++ e2fsprogs-1.39/lib/ext2fs/mkjournal.c
> + if (fs->super->s_blocks_count < 2048) {
> + fputs(("\nFilesystem too small for a journal\n"), stderr);
> + return 0;
> + }

Code in lib/ext2fs isn't allowed to do any output to stdio (except for
debugging purposes).  It causes internationalization problems, and
it's just in general a bad idea for ext2fs library code to try to do
any UI.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: Offsets of EAs in inode need not be sorted

2007-05-07 Thread Theodore Tso
On Thu, Apr 19, 2007 at 05:35:36PM +0530, Kalpak Shah wrote:
> This patch removes a code snippet from check_ea_in_inode() in pass1
> which checks if the EA values in the inode are sorted or not. The
> comments in fs/ext*/xattr.c state that the EA values in the external
> EA block are sorted but those in the inode need not be sorted. I
> have also attached a test image which has unsorted EAs in the
> inodes. The current e2fsck wrongly clears the EAs in the inode.

Applied.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Accomodate 32-bit uid/guid values in e2fsprogs

2007-05-07 Thread Eric Sandeen

Theodore Tso wrote:

On Mon, May 07, 2007 at 12:10:37PM -0500, Eric Sandeen wrote:

e2fsprogs doesn't handle large (> 16 bit) UID/GID...


Applied, with one correction:


--- e2fsprogs-hg.orig/misc/mke2fs.c
+++ e2fsprogs-hg/misc/mke2fs.c
@@ -492,9 +494,14 @@ static void create_root_dir(ext2_filsys 
 _("while reading root inode"));

exit(1);
}
-   inode.i_uid = getuid();
-   if (inode.i_uid)
-   inode.i_gid = getgid();
+   uid = getuid();
+   inode.i_uid = uid;
+   inode.i_uid_high = uid >> 16;
+   if (inode.i_uid) {

^^^

This should be "uid" instead.  Otherwise, the gid won't be set if the
uid is a multiple of 65536.

- Ted


Doh... dumb mistake.  nice catch.  thanks.

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Accomodate 32-bit uid/guid values in e2fsprogs

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 12:10:37PM -0500, Eric Sandeen wrote:
> e2fsprogs doesn't handle large (> 16 bit) UID/GID...

Applied, with one correction:

> --- e2fsprogs-hg.orig/misc/mke2fs.c
> +++ e2fsprogs-hg/misc/mke2fs.c
> @@ -492,9 +494,14 @@ static void create_root_dir(ext2_filsys 
>   _("while reading root inode"));
>   exit(1);
>   }
> - inode.i_uid = getuid();
> - if (inode.i_uid)
> - inode.i_gid = getgid();
> + uid = getuid();
> + inode.i_uid = uid;
> + inode.i_uid_high = uid >> 16;
> + if (inode.i_uid) {
^^^

This should be "uid" instead.  Otherwise, the gid won't be set if the
uid is a multiple of 65536.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote:
> Anyway, here's a patch to bring mke2fs.conf and mke2fs.c into line
> for current defaults...

Applied.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21-ext4-1

2007-05-07 Thread David Chinner
On Mon, May 07, 2007 at 01:56:23PM -0700, Mingming Cao wrote:
> In any case, it would be useful to add a new set of testsuites for the
> new fallocate() syscall and fsstress in LTP testsuites to automatically
> the preallocation code in ext4/XFS.

I hacked an existing XFS test prog to do manual testing of the fallocate()
syscall. In the XFSQA suite we have various pre-alloc enhanced utils (e.g.
fsstress, fsx, etc) that we should probably update to be able to use both
fallocate and xfsctl so we can test both.

Here's all the programs we use that have preallocation awareness:

chook 982% grep RESVSP ltp/*.c  | awk '/^ltp/  { split($1,a,":"); print a[1] 
;}' | uniq
ltp/doio.c
ltp/fsstress.c
ltp/fsx.c
ltp/growfiles.c
ltp/iogen.c
chook 983% grep RESVSP src/* | awk '/^src/ { split($1,a,":"); print a[1] ;}' | 
uniq
src/alloc.c
src/fstest.c
src/iopat.c
src/randholes.c
src/resvtest.c
src/unwritten_mmap.c
src/unwritten_sync.c

BTW, have you guys tested mmap writes into unwritten extents? ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> We could check the total number of fs free blocks account before
> preallocation happens, if there isn't enough space left, there is no
> need to bother preallocating.

Checking against the fs free blocks is a good idea, since it will
prevent the obvious error case where someone tries to preallocate 10GB
when there is only 2GB left.  But it won't help if there are multiple
processes trying to allocate blocks the same time.  On the other hand,
that case is probably relatively rare, and in that case, the
filesystem was probably going to be left completely full in any case.

On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
> Userspace could presumably repair the mess in most situations by truncating
> the file back again.  The kernel cannot do that because there might be live
> data in amongst there.

Actually, the kernel could do it, in that could simply release all
unitialized extents back to the system.  The problem is distinguishing
between the unitialized extents that had just been newly added, versus
the ones that had there from before.  (On the other hand, if the
filesystem was completely full, releasing unitialized blocks wouldn't
be the worse thing in the world to do, although releasing previously
fallocated blocks probably does violate the princple of least
surprise, even if it's what the user would have wanted.)

On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> If there is enough free space, we could make a reservation window that
> have at least N free blocks and mark it not stealable by other files. So
> later we will not run into the ENOSPC error.

Could you really use a single reservation window?  When the filesystem
is almost full, the free extents are likely going to be scattered all
over the disk.  The general principle of grabbing all of the extents
and keeping them in an in-memory data structure, and only adding them
to the extent tree would work, though; I'm just not sure we could do
it using the existing reservation window code, since it only supports
a single reservation window per file, yes?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Jeff Garzik

Andreas Dilger wrote:

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.



Agreed, thanks for the clarification.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 07, 2007  19:02 -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >Actually, this is a non-issue.  The reason that it is handled for 
> >extent-only is that this is the only way to allocate space in the
> >filesystem without doing the explicit zeroing.
> 
> Precisely /how/ do you avoid the zeroing issue, for extents?
> 
> If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
> otherwise the implementation is broken.

In ext4 (as in XFS) there is a flag stored in the extent that tells if
the extent is initialized or not.  Reads from uninitialized extents will
return zero-filled data, and writes that don't span the whole extent
will cause the uninitialized extent to be split into a regular extent
and one or two uninitialized extents (depending where the write is).

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao
On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote:
> On Mon, 07 May 2007 17:00:24 -0700
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > > +   while (ret >= 0 && ret < max_blocks) {
> > > +   block = block + ret;
> > > +   max_blocks = max_blocks - ret;
> > > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > > + max_blocks, &map_bh,
> > > + EXT4_CREATE_UNINITIALIZED_EXT, 
> > > 0);
> > > +   BUG_ON(!ret);
> > > +   if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> > > +   && ((block + ret) > (i_size_read(inode) << 
> > > blkbits)))
> > > +   nblocks = nblocks + ret;
> > > +   }
> > > +
> > > +   if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, 
> > > &retries))
> > > +   goto retry;
> > > +
> > > Now the interesting question is: what do we do if we get halfway through
> > > this loop and then run out of space?  We could leave the disk all filled 
> > > up
> > > and then return failure to the caller, but that's pretty poor behaviour,
> > > IMO.
> > > 
> > The current code handles earlier ENOSPC by three times retries. After
> > that if we still run out of space, then it's propably right to notify
> > the caller there isn't much space left.
> > 
> > We could extend the block reservation window size before the while loop
> > so we could get a lower chance to get more fragmented.
> 
> yes, but my point is that the proposed behaviour is really quite bad.
> 
I agree your point, that's why I mention it only helped the
fragmentation issue but not the ENOSPC case.


> We will attempt to allocate the disk space and then we will return failure,
> having consumed all the disk space and having partially and uselessly
> populated an unknown amount of the file.
> 

Not totally useless I think. If only half of the space is preallocated
because run out of space, the application can decide whether it's good
enough to start to use this preallocated space or wait for the fs to
have more free space.

> Userspace could presumably repair the mess in most situations by truncating
> the file back again.  The kernel cannot do that because there might be live
> data in amongst there.
> 
> So we'd need to either keep track of which blocks were newly-allocated and
> then free them all again on the error path (doesn't work right across
> commit+crash+recovery) or we could later use the space-reservation scheme 
> which
> delayed allocation will need to introduce.
> 
> Or we could decide to live with the above IMO-crappy behaviour.

In fact Amit and I had raised this issue before, whether it's okay to do
allow partial preallocation. At that moment the feedback is it's no much
different than the current zero-out-preallocation behavior: people might
preallocating half-way then later deal with ENOSPC.

We could check the total number of fs free blocks account before
preallocation happens, if there isn't enough space left, there is no
need to bother preallocating.

If there is enough free space, we could make a reservation window that
have at least N free blocks and mark it not stealable by other files. So
later we will not run into the ENOSPC error.

The fs free blocks account is just a estimate though.


Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao
On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 19:14:42 -0400
> Theodore Tso <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > > > Actually, this is a non-issue.  The reason that it is handled for 
> > > > extent-only
> > > > is that this is the only way to allocate space in the filesystem without
> > > > doing the explicit zeroing.  For other filesystems (including ext3 and
> > > > ext4 with block-mapped files) the filesystem should return an error 
> > > > (e.g.
> > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in 
> > > > userspace.
> > > 
> > > It can be a bit suboptimal from the layout POV.  The reservations code 
> > > will
> > > largely save us here, but kernel support might make it a bit better.
> > 
> > Actually, the reservations code won't matter, since glibc will fall
> > back to its current behavior, which is it will do the preallocation by
> > explicitly writing zeros to the file.
> 
> No!  Reservations code is *critical* here.  Without reservations, we get
> disastrously-bad layout if two processes were running a large fallocate()
> at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
> lengths save us).
> 
> My point is that even though reservations save us, we could do even-better
> in-kernel.
> 

In this case, since the number of blocks to preallocate (eg. N=10GB) is
clear, we could improve the current reservation code, to allow callers
explicitly ask for a new window that have the minimum N free blocks for
the blocks-to-preallocated(rather than just have at least 1 free
blocks).

Before the ext4_fallocate() is called, the right reservation window size
is set with the flag to indicating "please spend time if needed to find
a window covers at least N free blocks".

So for ex4 block mapped files, later when glibc is doing allocation and
zeroing, the ext4 block-mapped allocator will knows to reserve the right
amount of free blocks before allocating and zeroing 10GB space.

I am not sure whether this worth the effort though.

> But then, a smart application would bypass the glibc() fallocate()
> implementation and would tune the reservation window size and would use
> direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).
> 
> > This wlil result in the same
> > layout as if we had done the persistent preallocation, but of course
> > it will mean the posix_fallocate() could potentially take a long time
> > if you're a PVR and you're reserving a gig or two for a two hour movie
> > at high quality.  That seems suboptimal, granted, and ideally the
> > application should be warned about this before it calls
> > posix_fallocate().  On the other hand, it's what happens today, all
> > the time, so applications won't be too badly surprised.
> 
> A PVR implementor would take all this over and would do it themselves, for
> sure.
> 
> > If we think applications programmers badly need to know in advance if
> > posix_fallocate() will be fast or slow, probably the right thing is to
> > define a new fpathconf() configuration option so they can query to see
> > whether a particular file will support a fast posix_fallocate().  I'm
> > not 100% convinced such complexity is really needed, but I'm willing
> > to be convinced  what do folks think?
> > 
> 
> An application could do sys_fallocate(one-byte) to work out whether it's
> supported in-kernel, I guess.
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 07 May 2007 17:00:24 -0700
Mingming Cao <[EMAIL PROTECTED]> wrote:

> > +   while (ret >= 0 && ret < max_blocks) {
> > +   block = block + ret;
> > +   max_blocks = max_blocks - ret;
> > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +   BUG_ON(!ret);
> > +   if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> > +   && ((block + ret) > (i_size_read(inode) << 
> > blkbits)))
> > +   nblocks = nblocks + ret;
> > +   }
> > +
> > +   if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, 
> > &retries))
> > +   goto retry;
> > +
> > Now the interesting question is: what do we do if we get halfway through
> > this loop and then run out of space?  We could leave the disk all filled up
> > and then return failure to the caller, but that's pretty poor behaviour,
> > IMO.
> > 
> The current code handles earlier ENOSPC by three times retries. After
> that if we still run out of space, then it's propably right to notify
> the caller there isn't much space left.
> 
> We could extend the block reservation window size before the while loop
> so we could get a lower chance to get more fragmented.

yes, but my point is that the proposed behaviour is really quite bad.

We will attempt to allocate the disk space and then we will return failure,
having consumed all the disk space and having partially and uselessly
populated an unknown amount of the file.

Userspace could presumably repair the mess in most situations by truncating
the file back again.  The kernel cannot do that because there might be live
data in amongst there.

So we'd need to either keep track of which blocks were newly-allocated and
then free them all again on the error path (doesn't work right across
commit+crash+recovery) or we could later use the space-reservation scheme which
delayed allocation will need to introduce.

Or we could decide to live with the above IMO-crappy behaviour.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao
On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 05:37:54 -0600
> Andreas Dilger <[EMAIL PROTECTED]> wrote:
> 
> > > > +   block = offset >> blkbits;
> > > > +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> 
> > > > blkbits)
> > > > +- block;
> > > > +   mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > > > +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > > > +   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> > > 
> > > Now I'm mystified.  Given that we're allocating an arbitrary amount of 
> > > disk
> > > space, and that this disk space will require an arbitrary amount of
> > > metadata, how can we work out how much journal space we'll be needing
> > > without at least looking at `len'?
> > 
> > Good question.
> > 
> > The uninitialized extent can cover up to 128MB with a single entry.
> > If @path isn't specified, then ext4_ext_calc_credits_for_insert()
> > function returns the maximum number of extents needed to insert a leaf,
> > including splitting all of the index blocks.  That would allow up to 43GB
> > (340 extents/block * 128MB) to be preallocated, but it still needs to take
> > the size of the preallocation into account (adding 3 blocks per 43GB - a
> > leaf block, a bitmap block and a group descriptor).
> 
> I think the use of ext4_journal_extend() (as Amit has proposed) will help
> here, but it is not sufficient.
> 
> Because under some circumstances, a journal_extend() failure could mean
> that we fail to allocate all the required disk space.  If it is infrequent
> enough, that is acceptable when the caller is using fallocate() for
> performance reasons.
> 
> But it is very much not acceptable if the caller is using fallocate() for
> space-reservation reasons.  If you used fallocate to reserve 1GB of disk
> and fallocate() "succeeded" and you later get ENOSPC then you'd have a
> right to get a bit upset.
> 
> So I think the ext3/4 fallocate() implementation will need to be
> implemented as a loop: 
> 
>   while (len) {
>   journal_start();
>   len -= do_fallocate(len, ...);
>   journal_stop();
>   }
> 
> 

I agree.  There is already a loop in Amit's current's patch to call
ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to
ask for in each journal_start()?
> +/*
> + * ext4_fallocate:
> + * preallocate space for a file
> + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> + */
> +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> +{


> +   mutex_lock(&EXT4_I(inode)->truncate_mutex);
> +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> +   mutex_unlock(&EXT4_I(inode)->truncate_mutex);

I think the calculation is based on the assumption that there is only a
single extent to be inserted, which is the ideal case. But in some cases
we may end up allocating several chunk of blocks(extents) for this
single preallocation request when fs is fragmented (or part of
preallocation request is already fulfilled)

I think we should move this calculation inside the loop as well,and we
really do not need to grab the lock to calculate the credit if the @path
is always NULL, all the function does is mathmatics.

I can't think of any good way to estimate the total credits needed for
this whole preallocation request. Looked at ext4_get_block(), which is
used for DIO code to deal with large amount of block allocation. The
credit reservation is quite weak there too. The DIO_CREDIT is only
(EXT4_RESERVE_TRANS_BLOCKS + 32)

> +   handle=ext4_journal_start(inode, credits +
> +   
> EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);
> +   if (IS_ERR(handle))
> +   return PTR_ERR(handle);
> +retry:
> +   ret = 0;
> +   while (ret >= 0 && ret < max_blocks) {
> +   block = block + ret;
> +   max_blocks = max_blocks - ret;
> +   ret = ext4_ext_get_blocks(handle, inode, block,
> + max_blocks, &map_bh,
> + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> +   BUG_ON(!ret);
> +   if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> +   && ((block + ret) > (i_size_read(inode) << blkbits)))
> +   nblocks = nblocks + ret;
> +   }
> +
> +   if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> +   goto retry;
> +
> Now the interesting question is: what do we do if we get halfway through
> this loop and then run out of space?  We could leave the disk all filled up
> and then return failure to the caller, but that's pretty poor behaviour,
> IMO.
> 
The current code handles earlier ENOSPC by three times retries. After
that if we still run out of space, then it's propably right to notify
the caller there isn't much space left.

We could extend 

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> >>Final point: it's fairly disappointing that the present implementation is
> >>ext4-only, and extent-only.  I do think we should be aiming at an ext4
> >>bitmap-based implementation and an ext3 implementation.
> >
> >Actually, this is a non-issue.  The reason that it is handled for 
> >extent-only
> >is that this is the only way to allocate space in the filesystem without
> >doing the explicit zeroing.  For other filesystems (including ext3 and
> 
> Precisely /how/ do you avoid the zeroing issue, for extents?
> 
> If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
> otherwise the implementation is broken.

There is a bit in the extent structure which indicates that the extent
has not been initialized.  When reading from a block where the extent
is marked as unitialized, ext4 returns zero's, to avoid returning the
uninitalized contents of the disk, which might contain someone else's
love letters, p0rn, or other information which we shouldn't leak out.
When writing to an extent which is uninitalized, we may potentially
have to split the extent into three extents in the worst case.

My understanding is that XFS uses a similar implementation; it's a
pretty obvious and standard way to implement allocated-but-not-initialized
extents.

We thought about supporting persistent preallocation for inodes using
indirect blocks, but it would require stealing a bit from each entry
in the indirect block, reducing the maximum size of the filesystem by
two (i.e., 2**31 blocks).  It was decided it wasn't worth the
complexity, given the tradeoffs.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 19:14:42 -0400
Theodore Tso <[EMAIL PROTECTED]> wrote:

> On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > > Actually, this is a non-issue.  The reason that it is handled for 
> > > extent-only
> > > is that this is the only way to allocate space in the filesystem without
> > > doing the explicit zeroing.  For other filesystems (including ext3 and
> > > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in 
> > > userspace.
> > 
> > It can be a bit suboptimal from the layout POV.  The reservations code will
> > largely save us here, but kernel support might make it a bit better.
> 
> Actually, the reservations code won't matter, since glibc will fall
> back to its current behavior, which is it will do the preallocation by
> explicitly writing zeros to the file.

No!  Reservations code is *critical* here.  Without reservations, we get
disastrously-bad layout if two processes were running a large fallocate()
at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
lengths save us).

My point is that even though reservations save us, we could do even-better
in-kernel.

But then, a smart application would bypass the glibc() fallocate()
implementation and would tune the reservation window size and would use
direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).

> This wlil result in the same
> layout as if we had done the persistent preallocation, but of course
> it will mean the posix_fallocate() could potentially take a long time
> if you're a PVR and you're reserving a gig or two for a two hour movie
> at high quality.  That seems suboptimal, granted, and ideally the
> application should be warned about this before it calls
> posix_fallocate().  On the other hand, it's what happens today, all
> the time, so applications won't be too badly surprised.

A PVR implementor would take all this over and would do it themselves, for
sure.

> If we think applications programmers badly need to know in advance if
> posix_fallocate() will be fast or slow, probably the right thing is to
> define a new fpathconf() configuration option so they can query to see
> whether a particular file will support a fast posix_fallocate().  I'm
> not 100% convinced such complexity is really needed, but I'm willing
> to be convinced  what do folks think?
> 

An application could do sys_fallocate(one-byte) to work out whether it's
supported in-kernel, I guess.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > Actually, this is a non-issue.  The reason that it is handled for 
> > extent-only
> > is that this is the only way to allocate space in the filesystem without
> > doing the explicit zeroing.  For other filesystems (including ext3 and
> > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
> 
> It can be a bit suboptimal from the layout POV.  The reservations code will
> largely save us here, but kernel support might make it a bit better.

Actually, the reservations code won't matter, since glibc will fall
back to its current behavior, which is it will do the preallocation by
explicitly writing zeros to the file.  This wlil result in the same
layout as if we had done the persistent preallocation, but of course
it will mean the posix_fallocate() could potentially take a long time
if you're a PVR and you're reserving a gig or two for a two hour movie
at high quality.  That seems suboptimal, granted, and ideally the
application should be warned about this before it calls
posix_fallocate().  On the other hand, it's what happens today, all
the time, so applications won't be too badly surprised.  

If we think applications programmers badly need to know in advance if
posix_fallocate() will be fast or slow, probably the right thing is to
define a new fpathconf() configuration option so they can query to see
whether a particular file will support a fast posix_fallocate().  I'm
not 100% convinced such complexity is really needed, but I'm willing
to be convinced  what do folks think?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Jeff Garzik

Andreas Dilger wrote:

On May 07, 2007  13:58 -0700, Andrew Morton wrote:

Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.


Actually, this is a non-issue.  The reason that it is handled for extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and


Precisely /how/ do you avoid the zeroing issue, for extents?

If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
otherwise the implementation is broken.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 15:21:04 -0700
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> > Final point: it's fairly disappointing that the present implementation is
> > ext4-only, and extent-only.  I do think we should be aiming at an ext4
> > bitmap-based implementation and an ext3 implementation.
> 
> Actually, this is a non-issue.  The reason that it is handled for extent-only
> is that this is the only way to allocate space in the filesystem without
> doing the explicit zeroing.  For other filesystems (including ext3 and
> ext4 with block-mapped files) the filesystem should return an error (e.g.
> -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

hrm, spose so.

It can be a bit suboptimal from the layout POV.  The reservations code will
largely save us here, but kernel support might make it a bit better.

Totally blowing pagecache could be a problem.  Fixable in userspace by
using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> Final point: it's fairly disappointing that the present implementation is
> ext4-only, and extent-only.  I do think we should be aiming at an ext4
> bitmap-based implementation and an ext3 implementation.

Actually, this is a non-issue.  The reason that it is handled for extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and
ext4 with block-mapped files) the filesystem should return an error (e.g.
-EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Andreas Dilger
On May 07, 2007  14:52 -0500, Eric Sandeen wrote:
> Theodore Tso wrote:
> > How likely do you think the case will be that mke2fs.conf would be
> > missing?  I'm trying to figure out how high priority of an item this
> > really is.
> 
> Well, not too likely, although for some reason I guess it happened in
> the installer root in FC6 or so.  That's what raised the issue.

Ah, good point - there are probably lots of installers and rescue disks
that grab mke2fs but don't know anything about /etc/mke2fs.conf.

> > We could enhance the profile code so that it could read in the profile
> > from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs,
> > but that adds bloat --- the question is how necessary do we think that
> > really is?
> 
> I guess it doesn't really sound *necessary* - it's just that if we have
> 2 different "defaults" and they drift, it can be confusing...

Since the shipped mke2fs.conf is only on the order of a few hundred bytes
I don't think it is a huge issue to include it.  I like the idea of it
being compiled into the code and yet consistent with the default install.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 > 256)

2007-05-07 Thread Andreas Dilger
On May 07, 2007  11:50 -0400, Theodore Tso wrote:
> On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote:
> > On Mon, 7 May 2007, Theodore Tso wrote:
> > >We're going to have to rethink the defaults of the journal size as it
> > >relates to on-line resizing, but there might not be a lot of great
> > >answers here.  We'll also have to look at the on-line resizing code
> > >and see if there's a way to break up the resize operation into smaller
> > >transactions as a way of avoiding this problem --- but that would
> > >still leave the user stuck with a pathetically small 4M journal on a
> > >3G filesystem.
> 
> Clearly the right thing for resize2fs to do when it's doing an
> off-line resize is to just adjust the journal inode and make it
> bigger.  I'll add that to my todo list, but of course, patches to do
> that would be gratefully accepted...

For that matter, there was a paper from U. Wisconsin showing that
having the journal in the middle of the filesystem gave noticably
better performance due to lower average seek times.  If anyone is
looking at messing with the journal that is probably also a good
and easy place to start (e.g. may be as easy as just giving a goal
block of s_blocks_count / 2 to the journal create code).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Andreas Dilger
On May 07, 2007  19:16 +0530, Aneesh Kumar K.V wrote:
> Theodore Tso wrote:
> >Actually, changing the inode size is actually going to be more useful
> >to a larger number of people, since they can use it today to support
> >in-inode EA's.  In addition, changing the inode size must be done
> >off-line, whereas it's not so clear that off-line conversion to
> >extents is the best way to go (more on that below).
> 
> That way we don't need to add kernel code that convert ext3 inode to 
> ext4 inode while defragmentation. Last time i looked at the online 
> defrag code, i was not able to figure out a easy way to take indirect 
> map inode as input, then defrag the same and write the result in extent 
> map.

I think the online defrag code _should_ be fairly easily made to handle
defrag of block-mapped files, so long as it is always creating extent-
mapped files in the end.  All that should be necessary is to have the
kernel read data from the block-mapped file and write it into the newly
mapped blocks.  That functionality needs to exist for a long time anyways
to support upgrading the filesystem so it shouldn't be a huge burden.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 05:37:54 -0600
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> > > + block = offset >> blkbits;
> > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > > +  - block;
> > > + mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> > 
> > Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> > space, and that this disk space will require an arbitrary amount of
> > metadata, how can we work out how much journal space we'll be needing
> > without at least looking at `len'?
> 
> Good question.
> 
> The uninitialized extent can cover up to 128MB with a single entry.
> If @path isn't specified, then ext4_ext_calc_credits_for_insert()
> function returns the maximum number of extents needed to insert a leaf,
> including splitting all of the index blocks.  That would allow up to 43GB
> (340 extents/block * 128MB) to be preallocated, but it still needs to take
> the size of the preallocation into account (adding 3 blocks per 43GB - a
> leaf block, a bitmap block and a group descriptor).

I think the use of ext4_journal_extend() (as Amit has proposed) will help
here, but it is not sufficient.

Because under some circumstances, a journal_extend() failure could mean
that we fail to allocate all the required disk space.  If it is infrequent
enough, that is acceptable when the caller is using fallocate() for
performance reasons.

But it is very much not acceptable if the caller is using fallocate() for
space-reservation reasons.  If you used fallocate to reserve 1GB of disk
and fallocate() "succeeded" and you later get ENOSPC then you'd have a
right to get a bit upset.

So I think the ext3/4 fallocate() implementation will need to be
implemented as a loop: 

while (len) {
journal_start();
len -= do_fallocate(len, ...);
journal_stop();
}


Now the interesting question is: what do we do if we get halfway through
this loop and then run out of space?  We could leave the disk all filled up
and then return failure to the caller, but that's pretty poor behaviour,
IMO.



Does the proposed implementation handle quotas correctly, btw?  Has that
been tested?


Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21-ext4-1

2007-05-07 Thread Mingming Cao
On Mon, 2007-04-30 at 11:14 -0400, Theodore Ts'o wrote:
> I've respun the ext4 development patchset, with Amit's updated fallocate
> patches.  I've added Dave's patch to add ia64 support to the fallocate
> system call, but *not* the XFS fallocate support patches.  (Probably
> better for them to live in an xfs tree, where they can more easily
> tested and updated.)  Yes, we haven't reached complete closure on the
> fallocate system call calling convention, but it's enough for us to get
> more testing in -mm.
> 
> Also added Johann's jbd2-stats-through-procfs patches; it provides
> useful help in turning the size of the journal, which will be useful in
> benchmarking efforts.  In addition, Alex Tomas's patch to free
> just-allocated patches when there is an error inserting the extent into
> the extent tree has also been included.
> 
> The patches have been compile-tested on x86, and compile/run-tested on
> x86/UML.  Would appreciate reports about testing on other platforms.
> 
I have tested this patch series on ppc64, x86_64 with
dbench/tiobench/fsx, all runs fine. 

I am not sure what level of testing Amit has done about the fallocate()
and preallocation code on various archs. I couldn't find a available
s390 and ia64 machines with free partition yet.

In any case, it would be useful to add a new set of testsuites for the
new fallocate() syscall and fsstress in LTP testsuites to automatically
the preallocation code in ext4/XFS.

thanks,
Mingming
> Thanks,
> 
>   - Ted
> 
> P.S.  One bug which I've noted --- if there is a failure due to disk
> filling up, running e2fsck on the filesystem will show that the i_blocks
> fields on the inodes where there was a failure to allocate disk blocks
> are left incorrect.  I'm guessing this is a bug in the delayed
> allocation patches.  Alex, when you have a moment, could you take a
> look?  Thanks!!
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Eric Sandeen
Theodore Tso wrote:
> On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote: One of
>> our testers filed a bug that said "mkfs.ext3 is much slower when
>> mke2fs.conf is missing..."
>>
>> This is because the shipped defaults in mke2fs.conf do not match the
>> shipped defaults in the mkfs code itself; he wound up making a 1k
>> block filesystem on a very large block device, for example.
>>
>> So - How about this patch, to bring them back into line?  
> 
> It doesn't actually bring them completely back into line, since mke2fs
> will use different block sizes depending on the size of the
> filesystem.  So your patch makes the default probably a bit more
> reasonable, and so I'll probably end up applying it, but it definitely
> isn't a complete replacement for /etc/mke2fs.conf.

Well, sure, I don't mean for it to *replace* mke2fs.conf...

Hm, does having

[defaults]
blocksize = 4096

in mke2fs.conf turn off the blocksize heuristics and force 4k?  is
what's in mke2fs.conf a starting point or an absolute?  I guess I need
to read up on the code...

> How likely do you think the case will be that mke2fs.conf would be
> missing?  I'm trying to figure out how high priority of an item this
> really is.

Well, not too likely, although for some reason I guess it happened in
the installer root in FC6 or so.  That's what raised the issue.

> We could enhance the profile code so that it could read in the profile
> from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs,
> but that adds bloat --- the question is how necessary do we think that
> really is?

I guess it doesn't really sound *necessary* - it's just that if we have
2 different "defaults" and they drift, it can be confusing...

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 > 256)

2007-05-07 Thread Andreas Dilger
On May 06, 2007  21:40 -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen <[EMAIL PROTECTED]> 
> wrote:
> 
> > 2.6.20.6, FC4:
> > 
> > I created a 91248k ext3 fs with 4k blocksize:
> > 
> > Next, I tried to resize it to about 3G using ext2online while mounted:
> > 
> > At that time the kernel said:
> > 
> > |JBD: ext2online wants too many credits (744 > 256)
> > 
> > What is the limitation I should be aware of? Has it something to do with
> > the journal log size?

Yes, for very small filesystems the default journal size is only 4MB, and
the maximum transaction size is 1MB (256 blocks).

If the filesystem was 128MB or larger in the initial mke2fs then the
journal would be 16MB and the resize would get up to 1024 blocks for
the transaction.

I'm not sure what the right solution is for this.  If you know you will
be resizing the fs you could increase the initial journal size at mke2fs
time (-J size=16).  Alternately, it is possible resize to an intermediate
size and then delete and recreate the journal via tune2fs (which would
be the larger size by default) but that can only be done offline.  

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 03, 2007  21:31 -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> > + * ext4_fallocate:
> > + * preallocate space for a file
> > + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> > + */
> 
> This description is rather thin.  What is the filesystem's actual behaviour
> here?  If the file is using extents then the implementation will do
> .  If the file is using bitmaps then we will do .
> 
> But what?   Here is where it should be described.

My understanding is that glibc will handle zero-filling of files for
filesystems that do not support fallocate().

> > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
> > len)
> > +{
> > +   handle_t *handle;
> > +   ext4_fsblk_t block, max_blocks;
> > +   int ret, ret2, nblocks = 0, retries = 0;
> > +   struct buffer_head map_bh;
> > +   unsigned int credits, blkbits = inode->i_blkbits;
> > +
> > +   /* Currently supporting (pre)allocate mode _only_ */
> > +   if (mode != FA_ALLOCATE)
> > +   return -EOPNOTSUPP;
> > +
> > +   if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > +   return -ENOTTY;
> 
> So we don't implement fallocate on bitmap-based files!  Well that's huge
> news.  The changelog would be an appropriate place to communicate this,
> along with reasons why, or a description of the plan to fix it.
> 
> Also, posix says nothing about fallocate() returning ENOTTY.

I _think_ this is to convince glibc to do the zero-filling in userspace,
but I'm not up on the API specifics.

> > +   block = offset >> blkbits;
> > +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > +- block;
> > +   mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > +   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> 
> Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> space, and that this disk space will require an arbitrary amount of
> metadata, how can we work out how much journal space we'll be needing
> without at least looking at `len'?

Good question.

The uninitialized extent can cover up to 128MB with a single entry.
If @path isn't specified, then ext4_ext_calc_credits_for_insert()
function returns the maximum number of extents needed to insert a leaf,
including splitting all of the index blocks.  That would allow up to 43GB
(340 extents/block * 128MB) to be preallocated, but it still needs to take
the size of the preallocation into account (adding 3 blocks per 43GB - a
leaf block, a bitmap block and a group descriptor).

Also, since @path is not being given then truncate_mutex is not needed.

> > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +   BUG_ON(!ret);
> 
> BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
> ext4_error() would be safer and more useful here.

Ouch, not very friendly error handling.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote: One of
> our testers filed a bug that said "mkfs.ext3 is much slower when
> mke2fs.conf is missing..."
> 
> This is because the shipped defaults in mke2fs.conf do not match the
> shipped defaults in the mkfs code itself; he wound up making a 1k
> block filesystem on a very large block device, for example.
> 
> So - How about this patch, to bring them back into line?  

It doesn't actually bring them completely back into line, since mke2fs
will use different block sizes depending on the size of the
filesystem.  So your patch makes the default probably a bit more
reasonable, and so I'll probably end up applying it, but it definitely
isn't a complete replacement for /etc/mke2fs.conf.

How likely do you think the case will be that mke2fs.conf would be
missing?  I'm trying to figure out how high priority of an item this
really is.

> Which makes me wonder; having "defaults" in 2 different places is
> bound to get out of sync; should we instead generate both code &
> config file defaults (and maybe man page defaults) from a common
> source?

I had been working on the assumption that the defaults if mke2fs.conf
were not present were more in the nature of emergency defaults as
opposed something that could be used a fully functional set of
configuration parameters.  So the assumption was that when you
installed the RPM, mke2fs.conf would also be there.

We could enhance the profile code so that it could read in the profile
from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs,
but that adds bloat --- the question is how necessary do we think that
really is?

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Creating a >32bit blocks filesystem.

2007-05-07 Thread Jose R. Santos
On Mon, 7 May 2007 11:19:52 -0500
"Jose R. Santos" <[EMAIL PROTECTED]> wrote:
> I tried the patches and while the tools such as mkfs and debugfs seem
> to work fine, I am still unable to mount a filesystem with block
> numbers exceeding 32 bits.

Correction,

I tested debugfs on a 32bit blocks filesystem.  Both debugfs and fsck
failed when attempting to use a 64bit filesystem.

-JRS
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Accomodate 32-bit uid/guid values in e2fsprogs

2007-05-07 Thread Eric Sandeen
e2fsprogs doesn't handle large (> 16 bit) UID/GID...

Create an fs by user with large UID:

bash-3.1$ id
uid=501666(newuser) gid=501666(newuser) groups=501666(newuser)
bash-3.1$ /sbin/mke2fs -q fsfile
fsfile is not a block special device.
Proceed anyway? (y,n) y

Now mount it:

[EMAIL PROTECTED] tmp]# mount -o loop fsfile /mnt/tmp/
[EMAIL PROTECTED] tmp]# ls -la /mnt/tmp/
total 13
drwxr-xr-x 3 42914 42914  1024 May  7 11:55 .
drwxr-xr-x 4 root  root 28 Jan 23 10:34 ..
drwx-- 2 root  root  12288 May  7 11:55 lost+found

uid/gid is wrong (only bottom 16 bits)!  (it's wrong on disk...)

And even if I get it right on disk:

[EMAIL PROTECTED] tmp]# ls -lan /mnt/tmp/
total 13
drwxr-xr-x 3 501666 501666  1024 May  7 11:57 .
drwxr-xr-x 4  0  028 Jan 23 10:34 ..
drwx-- 2  0  0 12288 May  7 11:57 lost+found

debugfs has further problems:

[EMAIL PROTECTED] tmp]# /sbin/debugfs /tmp/fsfile 
debugfs 1.39 (29-May-2006)
debugfs:  stat .
Inode: 2   Type: directoryMode:  0755   Flags: 0x0   Generation: 0
User: 42914   Group: 42914   Size: 1024
File ACL: 0Directory ACL: 0
Links: 3   Blockcount: 2
Fragment:  Address: 0Number: 0Size: 0
ctime: 0x463f5a8a -- Mon May  7 11:57:46 2007
atime: 0x463f5a90 -- Mon May  7 11:57:52 2007
mtime: 0x463f5a8a -- Mon May  7 11:57:46 2007
BLOCKS:
(0):508
TOTAL: 1

debugfs:  ls -l
  2   40755 (2)  42914  429141024  7-May-2007 11:57 .
  2   40755 (2)  42914  429141024  7-May-2007 11:57 ..
 11   40700 (2)  0  0   12288  7-May-2007 11:57 lost+found

The attached patch creates inode_uid() and inode_gid() macros to accommodate
the high bits when reading uid/gid, on platforms that support it.

It also accommodates large UID/GID at mkfs time.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

Index: e2fsprogs-hg/debugfs/debugfs.c
===
--- e2fsprogs-hg.orig/debugfs/debugfs.c
+++ e2fsprogs-hg/debugfs/debugfs.c
@@ -534,7 +534,7 @@ void internal_dump_inode(FILE *out, cons
prefix, 
inode->i_mode & 0777, inode->i_flags, inode->i_generation);
fprintf(out, "%sUser: %5d   Group: %5d   Size: ",
-   prefix, inode->i_uid, inode->i_gid);
+   prefix, inode_uid(*inode), inode_gid(*inode));
if (LINUX_S_ISREG(inode->i_mode)) {
__u64 i_size = (inode->i_size |
((unsigned long long)inode->i_size_high << 32));
Index: e2fsprogs-hg/debugfs/ls.c
===
--- e2fsprogs-hg.orig/debugfs/ls.c
+++ e2fsprogs-hg/debugfs/ls.c
@@ -88,7 +88,7 @@ static int list_dir_proc(ext2_ino_t dir 
}
fprintf(ls->f, "%c%6u%c %6o (%d)  %5d  %5d   ", lbr, ino, rbr,
inode.i_mode, dirent->name_len >> 8,
-   inode.i_uid, inode.i_gid);
+   inode_uid(inode), inode_gid(inode));
if (LINUX_S_ISDIR(inode.i_mode))
fprintf(ls->f, "%5d", inode.i_size);
else
Index: e2fsprogs-hg/debugfs/lsdel.c
===
--- e2fsprogs-hg.orig/debugfs/lsdel.c
+++ e2fsprogs-hg/debugfs/lsdel.c
@@ -23,7 +23,7 @@
 struct deleted_info {
ext2_ino_t  ino;
unsigned short  mode;
-   unsigned short  uid;
+   __u32   uid;
__u64   size;
time_t  dtime;
int num_blocks;
@@ -160,7 +160,7 @@ void do_lsdel(int argc, char **argv)

delarray[num_delarray].ino = ino;
delarray[num_delarray].mode = inode.i_mode;
-   delarray[num_delarray].uid = inode.i_uid;
+   delarray[num_delarray].uid = inode_uid(inode);
delarray[num_delarray].size = inode.i_size;
if (!LINUX_S_ISDIR(inode.i_mode))
delarray[num_delarray].size |= 
Index: e2fsprogs-hg/e2fsck/message.c
===
--- e2fsprogs-hg.orig/e2fsck/message.c
+++ e2fsprogs-hg/e2fsck/message.c
@@ -304,12 +304,10 @@ static _INLINE_ void expand_inode_expres
  inode->i_dir_acl : 0));
break;
case 'u':
-   printf("%d", (inode->i_uid |
- (inode->osd2.linux2.l_i_uid_high << 16)));
+   printf("%d", inode_uid(*inode));
break;
case 'g':
-   printf("%d", (inode->i_gid |
- (inode->osd2.linux2.l_i_gid_high << 16)));
+   printf("%d", inode_gid(*inode));
break;
case 't':
if (LINUX_S_ISREG(inode->i_mode)) 
Index: e2fsprogs-hg/lib/ext2fs/ext2_fs.h

[PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Eric Sandeen
One of our testers filed a bug that said "mkfs.ext3 is much slower when 
mke2fs.conf is missing..."

This is because the shipped defaults in mke2fs.conf do not match the shipped 
defaults in the
mkfs code itself; he wound up making a 1k block filesystem on a very large 
block device,
for example.

So - How about this patch, to bring them back into line?  Which makes me 
wonder; having
"defaults" in 2 different places is bound to get out of sync; should we instead 
generate
both code & config file defaults (and maybe man page defaults) from a common 
source?

Anyway, here's a patch to bring mke2fs.conf and mke2fs.c into line for current 
defaults...

Thanks,
-Eric

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

Index: e2fsprogs-hg/misc/mke2fs.c
===
--- e2fsprogs-hg.orig/misc/mke2fs.c
+++ e2fsprogs-hg/misc/mke2fs.c
@@ -1288,7 +1288,8 @@ static void PRS(int argc, char *argv[])
tmp = tmp2 = NULL;
if (fs_param.s_rev_level != EXT2_GOOD_OLD_REV) {
profile_get_string(profile, "defaults", "base_features", 0,
-  "filetype,sparse_super", &tmp);
+  
"sparse_super,filetype,resize_inode,dir_index",
+  &tmp);
profile_get_string(profile, "fs_types", fs_type, 
   "base_features", tmp, &tmp2);
edit_feature(tmp2, &fs_param.s_feature_compat);
@@ -1365,7 +1366,7 @@ static void PRS(int argc, char *argv[])

if (blocksize <= 0) {
profile_get_integer(profile, "defaults", "blocksize", 0,
-   1024, &use_bsize);
+   4096, &use_bsize);
profile_get_integer(profile, "fs_types", fs_type, 
"blocksize", use_bsize, &use_bsize);
 


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Creating a >32bit blocks filesystem.

2007-05-07 Thread Jose R. Santos
On Fri, 04 May 2007 16:58:35 +0200
Valerie Clement <[EMAIL PROTECTED]> wrote:
> Hi Jose,
> I began to port our modifications done for the 64-bit support against 
> the new version of e2fsprogs Ted posted at the beginning of the week.
> Note that it is *just* for test use as it breaks the backwards 
> compatibility.
> I did a few tests with a kernel 2.6.17-rc7 and it seems to work, at 
> least mkfs, debugfs and fsck tools.

Hi Valerie,

I tried the patches and while the tools such as mkfs and debugfs seem
to work fine, I am still unable to mount a filesystem with block
numbers exceeding 32 bits.  I am testing on a 2.6.21.1 kernel with the
ext4 patches from:

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.21-ext4-1

The following error shows up on the kernel log:

[12145.598822] EXT4-fs error (device dm-2): ext4_check_descriptors: Block 
bitmap for group 0 not in group (block 18446744069414584320)!
[12145.670781] EXT4-fs: group descriptors corrupted!

So its failing very early in ext4_check_descriptors().  The hi 32 bits of 
block_bitmap for the first group seem to be set all to 1s.

Thanks

-JRS
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 07:16:31PM +0530, Aneesh Kumar K.V wrote:
> 
> One of the option i was thinking was to use this tool to migrate to 
> extent map and then towards the end use the online defrag ioctl to 
> defrag the resulting ext4 inode.

As we discussed on the phone, my recommendation would be to take your
existing code and move it into the kernel so that triggered off of an
ioctl, your code could migrate an inode from using indirect blocks to
extents while a filesystem is mounted.  The main things you will need
to watch for is to make sure the inode is locked so that another
process doesn't try to extend or truncate it, and to use the JBD layer
to provide appropriate journalling support.

Given that ext4migrated imported the kernel extent functions, it
should be relatively straightforward to simply make them use the
kernel extent functions while in kernel space.

Once the the inode has been converted on-line then it can be
defragmented on-line.  That would be much more convenient than having
to unmount the filesystem to do the off-line migration, followed by
mounting it to do the defragmentation.

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 > 256)

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote:
> On Mon, 7 May 2007, Theodore Tso wrote:
> 
> >We're going to have to rethink the defaults of the journal size as it
> >relates to on-line resizing, but there might not be a lot of great
> >answers here.  We'll also have to look at the on-line resizing code
> >and see if there's a way to break up the resize operation into smaller
> >transactions as a way of avoiding this problem --- but that would
> >still leave the user stuck with a pathetically small 4M journal on a
> >3G filesystem.
> 
> why not just allocate a new journal (something along the lines of convert 
> the ext3 to ext2 then back to ext3 would be the ugly, scripted way of 
> doing it)
> 
> or is the problem that you are trying to resize things without remounting 
> them (and therefor without flushing the journal)

Yes, I'm referring to the difficulties of getting this right when
doing online (mounted) resizing, as that was the context of the
original bug report.

Clearly the right thing for resize2fs to do when it's doing an
off-line resize is to just adjust the journal inode and make it
bigger.  I'll add that to my todo list, but of course, patches to do
that would be gratefully accepted...

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Dave Kleikamp
On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
> On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
> > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> 
> > wrote:

> > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
> > > len)
> > > +{
> > > + handle_t *handle;
> > > + ext4_fsblk_t block, max_blocks;
> > > + int ret, ret2, nblocks = 0, retries = 0;
> > > + struct buffer_head map_bh;
> > > + unsigned int credits, blkbits = inode->i_blkbits;
> > > +
> > > + /* Currently supporting (pre)allocate mode _only_ */
> > > + if (mode != FA_ALLOCATE)
> > > + return -EOPNOTSUPP;
> > > +
> > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > > + return -ENOTTY;
> > 
> > So we don't implement fallocate on bitmap-based files!  Well that's huge
> > news.  The changelog would be an appropriate place to communicate this,
> > along with reasons why, or a description of the plan to fix it.
> 
> Ok. Will add this in the function description as well.
> 
> > Also, posix says nothing about fallocate() returning ENOTTY.
> 
> Right. I don't seem to find any suitable error from posix description.
> Can you please suggest an error code which might make more sense here ?
> Will -ENOTSUPP be ok ? Since we want to say here that we don't support
> non-extent files.

Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
here, and fall back to the current library code to do preallocation?
This way, the caller of fallocate() will never see this return code, so
it won't violate posix.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 > 256)

2007-05-07 Thread david

On Mon, 7 May 2007, Theodore Tso wrote:


We're going to have to rethink the defaults of the journal size as it
relates to on-line resizing, but there might not be a lot of great
answers here.  We'll also have to look at the on-line resizing code
and see if there's a way to break up the resize operation into smaller
transactions as a way of avoiding this problem --- but that would
still leave the user stuck with a pathetically small 4M journal on a
3G filesystem.


why not just allocate a new journal (something along the lines of convert 
the ext3 to ext2 then back to ext3 would be the ugly, scripted way of 
doing it)


or is the problem that you are trying to resize things without remounting 
them (and therefor without flushing the journal)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 > 256)

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 12:26:26AM +0200, Frank van Maarseveen wrote:
> 2.6.20.6, FC4:
> 
> |JBD: ext2online wants too many credits (744 > 256)

You would be better off using resize2fs from e2fsprogs 1.39, BTW

About the only reason to keep using the ext2resize packages is
ext2prepare, which is the off-line tool to add a resizing inode after
the fact, and the main reason that functionality hasn't been absorbed
into e2fsprogs is that it needs to be completely rewritten before I'm
willing to trust and maintain it...

> What is the limitation I should be aware of? Has it something to do with
> the journal log size?

Yes, the problem is the journal log size.  Since the original
filesystem was so small, the journal was kept small so as not to use a
huge percentage of the filesystem size.  Of course, with a larger
filesystem you probably want a larger journal anyway to get better
performance.  So that brings up a related problem, which is we don't
have a way of dynamically increasing the size of the journal, which we
probably would want to do as part of resizing the filesystem size.

So the short-term workaround for now, if you know that you're going to
take a filesystem which is 22M and increase it to 3G, is to create it
with largish journal in the first place:

mke2fs -j -b 4096 -J size=16M /dev/vol1/project 22812

Note that if you were creating a 3G filesystem from scratch, by
default it would be using a 128M journal --- but it's a little hard to
fit a 128M journal in a 22M filesystem.  :-/

We're going to have to rethink the defaults of the journal size as it
relates to on-line resizing, but there might not be a lot of great
answers here.  We'll also have to look at the on-line resizing code
and see if there's a way to break up the resize operation into smaller
transactions as a way of avoiding this problem --- but that would
still leave the user stuck with a pathetically small 4M journal on a
3G filesystem.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 > 256)

2007-05-07 Thread Frank van Maarseveen
On Sun, May 06, 2007 at 09:40:14PM -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen <[EMAIL PROTECTED]> 
> wrote:
> 
> > Steps to reproduce:
> > Create a 3G partition, say /dev/vol1/project
> > mke2fs -j -b 4096 /dev/vol1/project 22812
> > mount it
> > ext2online /dev/vol1/project said:
> > 
> > | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
> > | ext2online: ext2_ioctl: No space left on device
> > | 
> > | ext2online: unable to resize /dev/mapper/vol1-project
> > 
> > kernel said:
> > 
> > | JBD: ext2online wants too many credits (721 > 256)

There's a threshold for the problem depending on the initial
size. This one fails:

mke2fs -j -b 4096 /dev/<3GB-blockdev> 32768
(mount + ext2online or resize2fs)
kernel: JBD: resize2fs wants too many credits (1034 > 1024)

Add one block to the initial mke2fs (32768+1 == 32769) and the
problem is gone.

Without the -b 4096 there's another resize problem

mke2fs -j /dev/loop1 2048
mount /dev/loop1 /1
resize2fs /dev/loop1

says:
resize2fs 1.40-WIP (14-Nov-2006)
Filesystem at /dev/loop1 is mounted on /1; on-line resizing required
old desc_blocks = 1, new_desc_blocks = 12
Performing an on-line resize of /dev/loop1 to 3072000 (1k) blocks.
resize2fs: Invalid argument While trying to add group #256

and the kernel says:

May  7 15:36:08 lokka EXT3-fs warning (device loop1): verify_reserved_gdb: 
May  7 15:36:08 lokka reserved GDT 10 missing grp 1 (8202)
May  7 15:36:08 lokka  

After that, the filesystem has been resized to 2GB. I recall a 2G
(?) limit for ext3 resizing with 1k blocksize but trying the above with
4096 1k blocks seems to work. fsck says it's ok all the time.

-- 
Frank
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Aneesh Kumar K.V



Theodore Tso wrote:


That is correct. My next step is to enhance the tool to support 
migration to large inodes.


I would actually suggest that the place to add that functionality
would be via "tune2fs -I ".  Since mke2fs -I 
is used to set the inode size, it makes sense that tune2fs -I would
change the inode size.  Obviously this would have to be done when the
filesystem is not mounted, and tune2fs should abort gracefully if
determines that the filesystem is mounted.



That is nice !!. I will  add the large inode migration support to tune2fs.

-aneesh
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Aneesh Kumar K.V

Hi,

Theodore Tso wrote:

On Mon, May 07, 2007 at 08:56:11AM +0530, Aneesh Kumar K.V wrote:

Andreas Dilger wrote:

If this code could also (or optionally just) increase the size of inodes
it would be very useful.  AFAIK right now it will only change the inodes

>from block-mapped to extent-mapped?


Actually, changing the inode size is actually going to be more useful
to a larger number of people, since they can use it today to support
in-inode EA's.  In addition, changing the inode size must be done
off-line, whereas it's not so clear that off-line conversion to
extents is the best way to go (more on that below).




One of the option i was thinking was to use this tool to migrate to 
extent map and then towards the end use the online defrag ioctl to 
defrag the resulting ext4 inode.


That way we don't need to add kernel code that convert ext3 inode to 
ext4 inode while defragmentation. Last time i looked at the online 
defrag code, i was not able to figure out a easy way to take indirect 
map inode as input, then defrag the same and write the result in extent 
map.



-aneesh
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 08:56:11AM +0530, Aneesh Kumar K.V wrote:
> Andreas Dilger wrote:
> >If this code could also (or optionally just) increase the size of inodes
> >it would be very useful.  AFAIK right now it will only change the inodes
> >from block-mapped to extent-mapped?

Actually, changing the inode size is actually going to be more useful
to a larger number of people, since they can use it today to support
in-inode EA's.  In addition, changing the inode size must be done
off-line, whereas it's not so clear that off-line conversion to
extents is the best way to go (more on that below).

One generic concern I have about simply migrating existing files to
use extents is that unless we actually also combine the tool with
defragmentation support, the file ends up being fairly badly
fragmented simply because of the "holes" left in the file from the
indirect blocks.  Therefore the tree ends up being far larger than it
needs to be, and future attempts allocate blocks in that block group
will fill in the holes, further degrading the performance of the
filesystem and whatever file is being written at the time.

The other reason why increasing the inode size would actually be more
valuable is that right now, if we have the disk space, the user can do
on-line migration of a file simply by copying it and then moving it
over the original file --- and if we have delayed allocation support
and sufficient memory, we can avoid the fragmentation problems
currently with the off-line ext4migration approach.

So we may need to think a bit about what's the best way to handle
ext4migrate.  It may be that correct approach is to combine it with a
defragmentation tool, either on-line or off-line.  

> That is correct. My next step is to enhance the tool to support 
> migration to large inodes.

I would actually suggest that the place to add that functionality
would be via "tune2fs -I ".  Since mke2fs -I 
is used to set the inode size, it makes sense that tune2fs -I would
change the inode size.  Obviously this would have to be done when the
filesystem is not mounted, and tune2fs should abort gracefully if
determines that the filesystem is mounted.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Ulrich Drepper

Jakub Jelinek wrote:

is what glibc does ATM.  Seems we violate the case where len == 0, as
EINVAL in that case is "shall fail".  But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
"required storage for regular file data starting at offset and continuing for len 
bytes"
doesn't make sense for negative size.  


This wording has already been cleaned up.  The current draft for the 
next revision reads:



[EINVAL] The len argument is less than or equal to zero, or the offset
 argument is less than zero, or the underlying file system does not
 support this operation.


I still don't like it since len==0 shouldn't create an error (it's 
inconsistent) but len<0 is already outlawed.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Amit K. Arora
On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote:
> On 4/26/07, Amit K. Arora <[EMAIL PROTECTED]> wrote:
> > /*
> >+ * ext4_ext_try_to_merge:
> >+ * tries to merge the "ex" extent to the next extent in the tree.
> >+ * It always tries to merge towards right. If you want to merge towards
> >+ * left, pass "ex - 1" as argument instead of "ex".
> >+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
> >+ * 1 if they got merged.
> >+ */
> >+int ext4_ext_try_to_merge(struct inode *inode,
> >+   struct ext4_ext_path *path,
> >+   struct ext4_extent *ex)
> >+{
> 
> Please either use proper kerneldoc format or drop
> "ext4_ext_try_to_merge" from the comment.

Ok, Thanks.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] Add unix COW io manager

2007-05-07 Thread Theodore Tso
On Fri, May 04, 2007 at 02:43:21PM +0530, Aneesh Kumar K.V wrote:
> + * unix_cow_io.c --- This is the Unix (well, really POSIX) implementation
> + *   of the COW I/O manager.

I really wish you had based this not on unix_io.c, but rather on
test_io.c, for two reasons (a) this reduces code duplication, (b) it
means that we can chain/stack I/O managers.  i.e., test_io ->
cow_manager -> unix_io.

The one change I would recommend is that instead of assigning to a
globally visible external variable (as I did in test_io, and which I
now regret as a not necessarily being portable --- different shared
library implementations have different restrictions on whether you can
access global variables defined in shared libraries from an
application or vice versa --- with AIX having some of the most peverse
shared library restritions, at least 7-8 years ago when I had the
misfortune of trying to make Kerberos v5 shared libraries work on AIX
:-).  Instead, what I would recommend is creating a function which an
application could call to set the backing I/O manager for the
cow_manager, as well as the filenames to use for the tdb journal.

The second change I would recommend is to NOT call it "cow_manager"
but rather an "undo_manager", since that makes it clear that what is
being backed up is not the work that we are about to do, but rather
the original contents of the disk.  I would also rename the
ext4_replay tool something like e2undofs, since "replay" has the
connotation that you are applying changes that were done in a log,
when in fact what you are really doing is backing out changes.

We want to make sure we avoid this confusion, since Xen and UML and
most other virtualization engines, for good reasons, do the exact
opposite --- the COW file contains the changes, and the base
file/device remains unchanged.

> + retval = tdb_store(data->tdb, tdb_key, tdb_data, TDB_INSERT);
> + if (retval == -1) {
> + /* FIXME!! Not sure what should be the error */
> + retval = EXT2_ET_SHORT_WRITE;
> + goto error_out;
> + }

You need to write the tdb_store() around a tdb_transaction_start() and
tdb_transaction_commit() pair, so that we can be sure the original
data is actually synced to disk.  Otherwise, on a crash, we could have
a situation where the original data on disk has been overwritten, but
the backup of the original block was never forced to disk!

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Pekka Enberg

On 4/26/07, Amit K. Arora <[EMAIL PROTECTED]> wrote:

 /*
+ * ext4_ext_try_to_merge:
+ * tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+   struct ext4_ext_path *path,
+   struct ext4_extent *ex)
+{


Please either use proper kerneldoc format or drop
"ext4_ext_try_to_merge" from the comment.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> > + */
> > +int ext4_ext_try_to_merge(struct inode *inode,
> > +   struct ext4_ext_path *path,
> > +   struct ext4_extent *ex)
> > +{
> > +   struct ext4_extent_header *eh;
> > +   unsigned int depth, len;
> > +   int merge_done=0, uninitialized = 0;
> 
> space around "=", please.
> 
> Many people prefer not to do the multiple-definitions-per-line, btw:
> 
>   int merge_done = 0;
>   int uninitialized = 0;

Ok. Will make the change.

> 
> reasons:
> 
> - If gives you some space for a nice comment
> 
> - It makes patches much more readable, and it makes rejects easier to fix
> 
> - standardisation.
> 
> > +   depth = ext_depth(inode);
> > +   BUG_ON(path[depth].p_hdr == NULL);
> > +   eh = path[depth].p_hdr;
> > +
> > +   while (ex < EXT_LAST_EXTENT(eh)) {
> > +   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
> > +   break;
> > +   /* merge with next extent! */
> > +   if (ext4_ext_is_uninitialized(ex))
> > +   uninitialized = 1;
> > +   ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
> > +   + ext4_ext_get_actual_len(ex + 1));
> > +   if (uninitialized)
> > +   ext4_ext_mark_uninitialized(ex);
> > +
> > +   if (ex + 1 < EXT_LAST_EXTENT(eh)) {
> > +   len = (EXT_LAST_EXTENT(eh) - ex - 1)
> > +   * sizeof(struct ext4_extent);
> > +   memmove(ex + 1, ex + 2, len);
> > +   }
> > +   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
> 
> Kenrel convention is to put spaces around "-"

Will fix this.

> 
> > +   merge_done = 1;
> > +   BUG_ON(eh->eh_entries == 0);
> 
> eek, scary BUG_ON.  Do we really need to be that severe?  Would it be
> better to warn and run ext4_error() here?
Ok.
> 
> > +   }
> > +
> > +   return merge_done;
> > +}
> > +
> > +
> >
> > ...
> >
> > +/*
> > + * ext4_ext_convert_to_initialized:
> > + * this function is called by ext4_ext_get_blocks() if someone tries to 
> > write
> > + * to an uninitialized extent. It may result in splitting the uninitialized
> > + * extent into multiple extents (upto three). Atleast one initialized 
> > extent
> > + * and atmost two uninitialized extents can result.
> 
> There are some typos here
> 
> > + * There are three possibilities:
> > + *   a> No split required: Entire extent should be initialized.
> > + *   b> Split into two extents: Only one end of the extent is being 
> > written to.
> > + *   c> Split into three extents: Somone is writing in middle of the 
> > extent.
> 
> and here
> 
Ok. Will fix them.
> > + */
> > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
> > +   struct ext4_ext_path *path,
> > +   ext4_fsblk_t iblock,
> > +   unsigned long max_blocks)
> > +{
> > +   struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
> > +   struct ext4_extent_header *eh;
> > +   unsigned int allocated, ee_block, ee_len, depth;
> > +   ext4_fsblk_t newblock;
> > +   int err = 0, ret = 0;
> > +
> > +   depth = ext_depth(inode);
> > +   eh = path[depth].p_hdr;
> > +   ex = path[depth].p_ext;
> > +   ee_block = le32_to_cpu(ex->ee_block);
> > +   ee_len = ext4_ext_get_actual_len(ex);
> > +   allocated = ee_len - (iblock - ee_block);
> > +   newblock = iblock - ee_block + ext_pblock(ex);
> > +   ex2 = ex;
> > +
> > +   /* ex1: ee_block to iblock - 1 : uninitialized */
> > +   if (iblock > ee_block) {
> > +   ex1 = ex;
> > +   ex1->ee_len = cpu_to_le16(iblock - ee_block);
> > +   ext4_ext_mark_uninitialized(ex1);
> > +   ex2 = &newex;
> > +   }
> > +   /* for sanity, update the length of the ex2 extent before
> > +* we insert ex3, if ex1 is NULL. This is to avoid temporary
> > +* overlap of blocks.
> > +*/
> > +   if (!ex1 && allocated > max_blocks)
> > +   ex2->ee_len = cpu_to_le16(max_blocks);
> > +   /* ex3: to ee_block + ee_len : uninitialised */
> > +   if (allocated > max_blocks) {
> > +   unsigned int newdepth;
> > +   ex3 = &newex;
> > +   ex3->ee_block = cpu_to_le32(iblock + max_blocks);
> > +   ext4_ext_store_pblock(ex3, newblock + max_blocks);
> > +   ex3->ee_len = cpu_to_le16(allocated - max_blocks);
> > +   ext4_ext_mark_uninitialized(ex3);
> > +   err = ext4_ext_insert_extent(handle, inode, path, ex3);
> > +   if (err)
> > +   goto out;
> > +   /* The depth, and hence eh & ex might change
> > +* as part of the insert above.
> > +*/
> > +   newdepth = ext_depth(inode);
> > +   if (newdepth != dep

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > This patch has the ext4 implemtation of fallocate system call.
> > 
> > ...
> >
> > +   /* ext4_can_extents_be_merged should have checked that either
> > +* both extents are uninitialized, or both aren't. Thus we
> > +* need to check only one of them here.
> > +*/
> 
> Please always format multiline comments like this:
> 
>   /*
>* ext4_can_extents_be_merged should have checked that either
>* both extents are uninitialized, or both aren't. Thus we
>* need to check only one of them here.
>*/

Ok.
 
> > ...
> >
> > +/*
> > + * ext4_fallocate:
> > + * preallocate space for a file
> > + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> > + */
> 
> This description is rather thin.  What is the filesystem's actual behaviour
> here?  If the file is using extents then the implementation will do
> .  If the file is using bitmaps then we will do .
> 
> But what?   Here is where it should be described.

Ok. Will expand the description.
 
> > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
> > len)
> > +{
> > +   handle_t *handle;
> > +   ext4_fsblk_t block, max_blocks;
> > +   int ret, ret2, nblocks = 0, retries = 0;
> > +   struct buffer_head map_bh;
> > +   unsigned int credits, blkbits = inode->i_blkbits;
> > +
> > +   /* Currently supporting (pre)allocate mode _only_ */
> > +   if (mode != FA_ALLOCATE)
> > +   return -EOPNOTSUPP;
> > +
> > +   if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > +   return -ENOTTY;
> 
> So we don't implement fallocate on bitmap-based files!  Well that's huge
> news.  The changelog would be an appropriate place to communicate this,
> along with reasons why, or a description of the plan to fix it.

Ok. Will add this in the function description as well.
 
> Also, posix says nothing about fallocate() returning ENOTTY.

Right. I don't seem to find any suitable error from posix description.
Can you please suggest an error code which might make more sense here ?
Will -ENOTSUPP be ok ? Since we want to say here that we don't support
non-extent files.
 
> > +   block = offset >> blkbits;
> > +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > +- block;
> > +   mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > +   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> 
> Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> space, and that this disk space will require an arbitrary amount of
> metadata, how can we work out how much journal space we'll be needing
> without at least looking at `len'?

You are right to say that the credits can not be fixed here. But, 'len'
will not directly tell us how many extents might need to be inserted and
how many block groups (if any - think about the "segment range" already
being allocated case) the allocation request might touch.
One solution I have thought is to check the buffer credits after a call to
ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the
credits are falling short. Incase journal_extend fails, we call
journal_restart. This will automatically take care of how much journal
space we might need for any value of "len".
 
> > +   handle=ext4_journal_start(inode, credits +
> 
> Please always put spaces around "="A
Ok.
> 
> > +   EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);
> 
> And around "+"
Ok.
> 
> > +   if (IS_ERR(handle))
> > +   return PTR_ERR(handle);
> > +retry:
> > +   ret = 0;
> > +   while (ret >= 0 && ret < max_blocks) {
> > +   block = block + ret;
> > +   max_blocks = max_blocks - ret;
> > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +   BUG_ON(!ret);
> 
> BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
> ext4_error() would be safer and more useful here.

Ok. Will do that.
> 
> > +   if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> 
> Use buffer_new() here.   A separate patch which fixes the three existing
> instances of open-coded BH_foo usage would be appreciated.

Ok.
> 
> > +   && ((block + ret) > (i_size_read(inode) << blkbits)))
> 
> Check for wrap though the sign bit and through zero please.
Ok.
> 
> > +   nblocks = nblocks + ret;
> > +   }
> > +
> > +   if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > +   goto retry;
> > +
> > +   /* Time to update the file size.
> > +* Update only when preallocation was requested beyond the file size.

Re: [PATCH 3/5] ext4: Extent overlap bugfix

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > +unsigned int ext4_ext_check_overlap(struct inode *inode,
> > +   struct ext4_extent *newext,
> > +   struct ext4_ext_path *path)
> > +{
> > +   unsigned long b1, b2;
> > +   unsigned int depth, len1;
> > +
> > +   b1 = le32_to_cpu(newext->ee_block);
> > +   len1 = le16_to_cpu(newext->ee_len);
> > +   depth = ext_depth(inode);
> > +   if (!path[depth].p_ext)
> > +   goto out;
> > +   b2 = le32_to_cpu(path[depth].p_ext->ee_block);
> > +
> > +   /* get the next allocated block if the extent in the path
> > +* is before the requested block(s) */
> > +   if (b2 < b1) {
> > +   b2 = ext4_ext_next_allocated_block(path);
> > +   if (b2 == EXT_MAX_BLOCK)
> > +   goto out;
> > +   }
> > +
> > +   if (b1 + len1 > b2) {
> 
> Are we sure that b1+len cannot wrap through zero here?

No. Will add a check here for this. Thanks!
 
> > +   newext->ee_len = cpu_to_le16(b2 - b1);
> > +   return 1;
> > +   }


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> The above opengroup page only permits S_ISREG.  Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it.  And there's a decent case to
> be made for emulating it - run-anywhere reasons.  Does glibc emulation support
> directories?  Quite unlikely.
> 
> But yes, sounds like a desirable thing.  Would XFS support it easily if the 
> above
> check was relaxed?

I think we may relax the check here and let the individual file system
decide if they support preallocation for directories or not. What do you
think ?

One thing to be thought in this case is the error code which should be
returned by the file system implementation, incase it doesn't support
preallocation for directories. Should it be -ENODEV (to match with what
posix says) , or something else (which might make more sense in this
case) ?

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora
Andrew,

Thanks for the review comments!

On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> > 
> > ...
> >
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> Please add a comment over this function which specifies its behaviour. 
> Really it should be enough material from which a full manpage can be
> written.
> 
> If that's all too much, this material should at least be spelled out in the
> changelog.  Because there's no way in which this change can be fully
> reviewed unless someone (ie: you) tells us what it is setting out to
> achieve.
> 
> If we 100% implement some standard then a URL for what we claim to
> implement would suffice.  Given that we're at least using different types from
> posix I doubt if such a thing would be sufficient.
> 
> And given the complexity and potential variability within the filesystem
> implementations of this, I'd expect that _something_ additional needs to be
> said?

Ok. I will add a detailed comment here.

> 
> > +{
> > +   struct file *file;
> > +   struct inode *inode;
> > +   long ret = -EINVAL;
> > +
> > +   if (len == 0 || offset < 0)
> > +   goto out;
> 
> The posix spec implies that negative `len' is permitted - presumably "allocate
> ahead of `offset'".  How peculiar.

I think we should go ahead with current glibc implementation (which
Jakub poited at) of not allowing a negative 'len', since posix also
doesn't explicitly say anything about allowing negative 'len'.

> 
> > +   ret = -EBADF;
> > +   file = fget(fd);
> > +   if (!file)
> > +   goto out;
> > +   if (!(file->f_mode & FMODE_WRITE))
> > +   goto out_fput;
> > +
> > +   inode = file->f_path.dentry->d_inode;
> > +
> > +   ret = -ESPIPE;
> > +   if (S_ISFIFO(inode->i_mode))
> > +   goto out_fput;
> > +
> > +   ret = -ENODEV;
> > +   if (!S_ISREG(inode->i_mode))
> > +   goto out_fput;
> 
> So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> seems a bit silly of them.

True. 
 
> > +   ret = -EFBIG;
> > +   if (offset + len > inode->i_sb->s_maxbytes)
> > +   goto out_fput;
> 
> This code does handle offset+len going negative, but only by accident, I
> suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
> here would settle the reader's mind.

Ok. I will add a check here for wrap though zero.
 
> > +   if (inode->i_op && inode->i_op->fallocate)
> > +   ret = inode->i_op->fallocate(inode, mode, offset, len);
> > +   else
> > +   ret = -ENOSYS;
> 
> If we _are_ going to support negative `len', as posix suggests, I think we
> should perform the appropriate sanity conversions to `offset' and `len'
> right here, rather than expecting each filesystem to do it.
> 
> If we're not going to handle negative `len' then we should check for it.

Will add a check for negative 'len' and return -EINVAL. This will be
done where currently we check for negative offset (i.e. at the start of
the function).
 
> > +out_fput:
> > +   fput(file);
> > +out:
> > +   return ret;
> > +}
> > +EXPORT_SYMBOL(sys_fallocate);
> 
> I don't believe this needs to be exported to modules?

Ok. Will remove it.
 
> > +/*
> > + * fallocate() modes
> > + */
> > +#define FA_ALLOCATE0x1
> > +#define FA_DEALLOCATE  0x2
> 
> Now those aren't in posix.  They should be documented, along with their
> expected semantics.

Will add a comment describing the role of these modes.
 
> >  #ifdef __KERNEL__
> >  
> >  #include 
> > @@ -1125,6 +1131,7 @@ struct inode_operations {
> > ssize_t (*listxattr) (struct dentry *, char *, size_t);
> > int (*removexattr) (struct dentry *, const char *);
> > void (*truncate_range)(struct inode *, loff_t, loff_t);
> > +   long (*fallocate)(struct inode *, int, loff_t, loff_t);
> 
> I really do think it's better to put the variable names in definitions such
> as this.  Especially when we have two identically-typed variables next to
> each other like that.  Quick: which one is the offset and which is the
> length?

Ok. Will add the variable names here.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html