Re: [PATCH] e2fsprogs: Check journal inode sanity and recreate journal
On Thu, Mar 15, 2007 at 02:43:47AM +0530, Kalpak Shah wrote: > Index: e2fsprogs-1.39/lib/ext2fs/mkjournal.c > === > --- e2fsprogs-1.39.orig/lib/ext2fs/mkjournal.c > +++ e2fsprogs-1.39/lib/ext2fs/mkjournal.c > + if (fs->super->s_blocks_count < 2048) { > + fputs(("\nFilesystem too small for a journal\n"), stderr); > + return 0; > + } Code in lib/ext2fs isn't allowed to do any output to stdio (except for debugging purposes). It causes internationalization problems, and it's just in general a bad idea for ext2fs library code to try to do any UI. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e2fsprogs: Offsets of EAs in inode need not be sorted
On Thu, Apr 19, 2007 at 05:35:36PM +0530, Kalpak Shah wrote: > This patch removes a code snippet from check_ea_in_inode() in pass1 > which checks if the EA values in the inode are sorted or not. The > comments in fs/ext*/xattr.c state that the EA values in the external > EA block are sorted but those in the inode need not be sorted. I > have also attached a test image which has unsorted EAs in the > inodes. The current e2fsck wrongly clears the EAs in the inode. Applied. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Accomodate 32-bit uid/guid values in e2fsprogs
Theodore Tso wrote: On Mon, May 07, 2007 at 12:10:37PM -0500, Eric Sandeen wrote: e2fsprogs doesn't handle large (> 16 bit) UID/GID... Applied, with one correction: --- e2fsprogs-hg.orig/misc/mke2fs.c +++ e2fsprogs-hg/misc/mke2fs.c @@ -492,9 +494,14 @@ static void create_root_dir(ext2_filsys _("while reading root inode")); exit(1); } - inode.i_uid = getuid(); - if (inode.i_uid) - inode.i_gid = getgid(); + uid = getuid(); + inode.i_uid = uid; + inode.i_uid_high = uid >> 16; + if (inode.i_uid) { ^^^ This should be "uid" instead. Otherwise, the gid won't be set if the uid is a multiple of 65536. - Ted Doh... dumb mistake. nice catch. thanks. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Accomodate 32-bit uid/guid values in e2fsprogs
On Mon, May 07, 2007 at 12:10:37PM -0500, Eric Sandeen wrote: > e2fsprogs doesn't handle large (> 16 bit) UID/GID... Applied, with one correction: > --- e2fsprogs-hg.orig/misc/mke2fs.c > +++ e2fsprogs-hg/misc/mke2fs.c > @@ -492,9 +494,14 @@ static void create_root_dir(ext2_filsys > _("while reading root inode")); > exit(1); > } > - inode.i_uid = getuid(); > - if (inode.i_uid) > - inode.i_gid = getgid(); > + uid = getuid(); > + inode.i_uid = uid; > + inode.i_uid_high = uid >> 16; > + if (inode.i_uid) { ^^^ This should be "uid" instead. Otherwise, the gid won't be set if the uid is a multiple of 65536. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote: > Anyway, here's a patch to bring mke2fs.conf and mke2fs.c into line > for current defaults... Applied. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21-ext4-1
On Mon, May 07, 2007 at 01:56:23PM -0700, Mingming Cao wrote: > In any case, it would be useful to add a new set of testsuites for the > new fallocate() syscall and fsstress in LTP testsuites to automatically > the preallocation code in ext4/XFS. I hacked an existing XFS test prog to do manual testing of the fallocate() syscall. In the XFSQA suite we have various pre-alloc enhanced utils (e.g. fsstress, fsx, etc) that we should probably update to be able to use both fallocate and xfsctl so we can test both. Here's all the programs we use that have preallocation awareness: chook 982% grep RESVSP ltp/*.c | awk '/^ltp/ { split($1,a,":"); print a[1] ;}' | uniq ltp/doio.c ltp/fsstress.c ltp/fsx.c ltp/growfiles.c ltp/iogen.c chook 983% grep RESVSP src/* | awk '/^src/ { split($1,a,":"); print a[1] ;}' | uniq src/alloc.c src/fstest.c src/iopat.c src/randholes.c src/resvtest.c src/unwritten_mmap.c src/unwritten_sync.c BTW, have you guys tested mmap writes into unwritten extents? ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > We could check the total number of fs free blocks account before > preallocation happens, if there isn't enough space left, there is no > need to bother preallocating. Checking against the fs free blocks is a good idea, since it will prevent the obvious error case where someone tries to preallocate 10GB when there is only 2GB left. But it won't help if there are multiple processes trying to allocate blocks the same time. On the other hand, that case is probably relatively rare, and in that case, the filesystem was probably going to be left completely full in any case. On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > Userspace could presumably repair the mess in most situations by truncating > the file back again. The kernel cannot do that because there might be live > data in amongst there. Actually, the kernel could do it, in that could simply release all unitialized extents back to the system. The problem is distinguishing between the unitialized extents that had just been newly added, versus the ones that had there from before. (On the other hand, if the filesystem was completely full, releasing unitialized blocks wouldn't be the worse thing in the world to do, although releasing previously fallocated blocks probably does violate the princple of least surprise, even if it's what the user would have wanted.) On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > If there is enough free space, we could make a reservation window that > have at least N free blocks and mark it not stealable by other files. So > later we will not run into the ENOSPC error. Could you really use a single reservation window? When the filesystem is almost full, the free extents are likely going to be scattered all over the disk. The general principle of grabbing all of the extents and keeping them in an in-memory data structure, and only adding them to the extent tree would work, though; I'm just not sure we could do it using the existing reservation window code, since it only supports a single reservation window per file, yes? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
Andreas Dilger wrote: My comment was just that the extent doesn't have to be explicitly zero filled on the disk, by virtue of the fact that the uninitialized flag will cause reads to return zero. Agreed, thanks for the clarification. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On May 07, 2007 19:02 -0400, Jeff Garzik wrote: > Andreas Dilger wrote: > >Actually, this is a non-issue. The reason that it is handled for > >extent-only is that this is the only way to allocate space in the > >filesystem without doing the explicit zeroing. > > Precisely /how/ do you avoid the zeroing issue, for extents? > > If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, > otherwise the implementation is broken. In ext4 (as in XFS) there is a flag stored in the extent that tells if the extent is initialized or not. Reads from uninitialized extents will return zero-filled data, and writes that don't span the whole extent will cause the uninitialized extent to be split into a regular extent and one or two uninitialized extents (depending where the write is). My comment was just that the extent doesn't have to be explicitly zero filled on the disk, by virtue of the fact that the uninitialized flag will cause reads to return zero. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote: > On Mon, 07 May 2007 17:00:24 -0700 > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > + while (ret >= 0 && ret < max_blocks) { > > > + block = block + ret; > > > + max_blocks = max_blocks - ret; > > > + ret = ext4_ext_get_blocks(handle, inode, block, > > > + max_blocks, &map_bh, > > > + EXT4_CREATE_UNINITIALIZED_EXT, > > > 0); > > > + BUG_ON(!ret); > > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > > + && ((block + ret) > (i_size_read(inode) << > > > blkbits))) > > > + nblocks = nblocks + ret; > > > + } > > > + > > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, > > > &retries)) > > > + goto retry; > > > + > > > Now the interesting question is: what do we do if we get halfway through > > > this loop and then run out of space? We could leave the disk all filled > > > up > > > and then return failure to the caller, but that's pretty poor behaviour, > > > IMO. > > > > > The current code handles earlier ENOSPC by three times retries. After > > that if we still run out of space, then it's propably right to notify > > the caller there isn't much space left. > > > > We could extend the block reservation window size before the while loop > > so we could get a lower chance to get more fragmented. > > yes, but my point is that the proposed behaviour is really quite bad. > I agree your point, that's why I mention it only helped the fragmentation issue but not the ENOSPC case. > We will attempt to allocate the disk space and then we will return failure, > having consumed all the disk space and having partially and uselessly > populated an unknown amount of the file. > Not totally useless I think. If only half of the space is preallocated because run out of space, the application can decide whether it's good enough to start to use this preallocated space or wait for the fs to have more free space. > Userspace could presumably repair the mess in most situations by truncating > the file back again. The kernel cannot do that because there might be live > data in amongst there. > > So we'd need to either keep track of which blocks were newly-allocated and > then free them all again on the error path (doesn't work right across > commit+crash+recovery) or we could later use the space-reservation scheme > which > delayed allocation will need to introduce. > > Or we could decide to live with the above IMO-crappy behaviour. In fact Amit and I had raised this issue before, whether it's okay to do allow partial preallocation. At that moment the feedback is it's no much different than the current zero-out-preallocation behavior: people might preallocating half-way then later deal with ENOSPC. We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. If there is enough free space, we could make a reservation window that have at least N free blocks and mark it not stealable by other files. So later we will not run into the ENOSPC error. The fs free blocks account is just a estimate though. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote: > On Mon, 7 May 2007 19:14:42 -0400 > Theodore Tso <[EMAIL PROTECTED]> wrote: > > > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > > > Actually, this is a non-issue. The reason that it is handled for > > > > extent-only > > > > is that this is the only way to allocate space in the filesystem without > > > > doing the explicit zeroing. For other filesystems (including ext3 and > > > > ext4 with block-mapped files) the filesystem should return an error > > > > (e.g. > > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in > > > > userspace. > > > > > > It can be a bit suboptimal from the layout POV. The reservations code > > > will > > > largely save us here, but kernel support might make it a bit better. > > > > Actually, the reservations code won't matter, since glibc will fall > > back to its current behavior, which is it will do the preallocation by > > explicitly writing zeros to the file. > > No! Reservations code is *critical* here. Without reservations, we get > disastrously-bad layout if two processes were running a large fallocate() > at the same time. (This is an SMP-only problem, btw: on UP the timeslice > lengths save us). > > My point is that even though reservations save us, we could do even-better > in-kernel. > In this case, since the number of blocks to preallocate (eg. N=10GB) is clear, we could improve the current reservation code, to allow callers explicitly ask for a new window that have the minimum N free blocks for the blocks-to-preallocated(rather than just have at least 1 free blocks). Before the ext4_fallocate() is called, the right reservation window size is set with the flag to indicating "please spend time if needed to find a window covers at least N free blocks". So for ex4 block mapped files, later when glibc is doing allocation and zeroing, the ext4 block-mapped allocator will knows to reserve the right amount of free blocks before allocating and zeroing 10GB space. I am not sure whether this worth the effort though. > But then, a smart application would bypass the glibc() fallocate() > implementation and would tune the reservation window size and would use > direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). > > > This wlil result in the same > > layout as if we had done the persistent preallocation, but of course > > it will mean the posix_fallocate() could potentially take a long time > > if you're a PVR and you're reserving a gig or two for a two hour movie > > at high quality. That seems suboptimal, granted, and ideally the > > application should be warned about this before it calls > > posix_fallocate(). On the other hand, it's what happens today, all > > the time, so applications won't be too badly surprised. > > A PVR implementor would take all this over and would do it themselves, for > sure. > > > If we think applications programmers badly need to know in advance if > > posix_fallocate() will be fast or slow, probably the right thing is to > > define a new fpathconf() configuration option so they can query to see > > whether a particular file will support a fast posix_fallocate(). I'm > > not 100% convinced such complexity is really needed, but I'm willing > > to be convinced what do folks think? > > > > An application could do sys_fallocate(one-byte) to work out whether it's > supported in-kernel, I guess. > - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 07 May 2007 17:00:24 -0700 Mingming Cao <[EMAIL PROTECTED]> wrote: > > + while (ret >= 0 && ret < max_blocks) { > > + block = block + ret; > > + max_blocks = max_blocks - ret; > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > + && ((block + ret) > (i_size_read(inode) << > > blkbits))) > > + nblocks = nblocks + ret; > > + } > > + > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, > > &retries)) > > + goto retry; > > + > > Now the interesting question is: what do we do if we get halfway through > > this loop and then run out of space? We could leave the disk all filled up > > and then return failure to the caller, but that's pretty poor behaviour, > > IMO. > > > The current code handles earlier ENOSPC by three times retries. After > that if we still run out of space, then it's propably right to notify > the caller there isn't much space left. > > We could extend the block reservation window size before the while loop > so we could get a lower chance to get more fragmented. yes, but my point is that the proposed behaviour is really quite bad. We will attempt to allocate the disk space and then we will return failure, having consumed all the disk space and having partially and uselessly populated an unknown amount of the file. Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. So we'd need to either keep track of which blocks were newly-allocated and then free them all again on the error path (doesn't work right across commit+crash+recovery) or we could later use the space-reservation scheme which delayed allocation will need to introduce. Or we could decide to live with the above IMO-crappy behaviour. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote: > On Mon, 7 May 2007 05:37:54 -0600 > Andreas Dilger <[EMAIL PROTECTED]> wrote: > > > > > + block = offset >> blkbits; > > > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> > > > > blkbits) > > > > +- block; > > > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > > > > > Now I'm mystified. Given that we're allocating an arbitrary amount of > > > disk > > > space, and that this disk space will require an arbitrary amount of > > > metadata, how can we work out how much journal space we'll be needing > > > without at least looking at `len'? > > > > Good question. > > > > The uninitialized extent can cover up to 128MB with a single entry. > > If @path isn't specified, then ext4_ext_calc_credits_for_insert() > > function returns the maximum number of extents needed to insert a leaf, > > including splitting all of the index blocks. That would allow up to 43GB > > (340 extents/block * 128MB) to be preallocated, but it still needs to take > > the size of the preallocation into account (adding 3 blocks per 43GB - a > > leaf block, a bitmap block and a group descriptor). > > I think the use of ext4_journal_extend() (as Amit has proposed) will help > here, but it is not sufficient. > > Because under some circumstances, a journal_extend() failure could mean > that we fail to allocate all the required disk space. If it is infrequent > enough, that is acceptable when the caller is using fallocate() for > performance reasons. > > But it is very much not acceptable if the caller is using fallocate() for > space-reservation reasons. If you used fallocate to reserve 1GB of disk > and fallocate() "succeeded" and you later get ENOSPC then you'd have a > right to get a bit upset. > > So I think the ext3/4 fallocate() implementation will need to be > implemented as a loop: > > while (len) { > journal_start(); > len -= do_fallocate(len, ...); > journal_stop(); > } > > I agree. There is already a loop in Amit's current's patch to call ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to ask for in each journal_start()? > +/* > + * ext4_fallocate: > + * preallocate space for a file > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > + */ > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > +{ > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); I think the calculation is based on the assumption that there is only a single extent to be inserted, which is the ideal case. But in some cases we may end up allocating several chunk of blocks(extents) for this single preallocation request when fs is fragmented (or part of preallocation request is already fulfilled) I think we should move this calculation inside the loop as well,and we really do not need to grab the lock to calculate the credit if the @path is always NULL, all the function does is mathmatics. I can't think of any good way to estimate the total credits needed for this whole preallocation request. Looked at ext4_get_block(), which is used for DIO code to deal with large amount of block allocation. The credit reservation is quite weak there too. The DIO_CREDIT is only (EXT4_RESERVE_TRANS_BLOCKS + 32) > + handle=ext4_journal_start(inode, credits + > + > EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > +retry: > + ret = 0; > + while (ret >= 0 && ret < max_blocks) { > + block = block + ret; > + max_blocks = max_blocks - ret; > + ret = ext4_ext_get_blocks(handle, inode, block, > + max_blocks, &map_bh, > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > + BUG_ON(!ret); > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > + && ((block + ret) > (i_size_read(inode) << blkbits))) > + nblocks = nblocks + ret; > + } > + > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > + goto retry; > + > Now the interesting question is: what do we do if we get halfway through > this loop and then run out of space? We could leave the disk all filled up > and then return failure to the caller, but that's pretty poor behaviour, > IMO. > The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote: > Andreas Dilger wrote: > >On May 07, 2007 13:58 -0700, Andrew Morton wrote: > >>Final point: it's fairly disappointing that the present implementation is > >>ext4-only, and extent-only. I do think we should be aiming at an ext4 > >>bitmap-based implementation and an ext3 implementation. > > > >Actually, this is a non-issue. The reason that it is handled for > >extent-only > >is that this is the only way to allocate space in the filesystem without > >doing the explicit zeroing. For other filesystems (including ext3 and > > Precisely /how/ do you avoid the zeroing issue, for extents? > > If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, > otherwise the implementation is broken. There is a bit in the extent structure which indicates that the extent has not been initialized. When reading from a block where the extent is marked as unitialized, ext4 returns zero's, to avoid returning the uninitalized contents of the disk, which might contain someone else's love letters, p0rn, or other information which we shouldn't leak out. When writing to an extent which is uninitalized, we may potentially have to split the extent into three extents in the worst case. My understanding is that XFS uses a similar implementation; it's a pretty obvious and standard way to implement allocated-but-not-initialized extents. We thought about supporting persistent preallocation for inodes using indirect blocks, but it would require stealing a bit from each entry in the indirect block, reducing the maximum size of the filesystem by two (i.e., 2**31 blocks). It was decided it wasn't worth the complexity, given the tradeoffs. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 7 May 2007 19:14:42 -0400 Theodore Tso <[EMAIL PROTECTED]> wrote: > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > > Actually, this is a non-issue. The reason that it is handled for > > > extent-only > > > is that this is the only way to allocate space in the filesystem without > > > doing the explicit zeroing. For other filesystems (including ext3 and > > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in > > > userspace. > > > > It can be a bit suboptimal from the layout POV. The reservations code will > > largely save us here, but kernel support might make it a bit better. > > Actually, the reservations code won't matter, since glibc will fall > back to its current behavior, which is it will do the preallocation by > explicitly writing zeros to the file. No! Reservations code is *critical* here. Without reservations, we get disastrously-bad layout if two processes were running a large fallocate() at the same time. (This is an SMP-only problem, btw: on UP the timeslice lengths save us). My point is that even though reservations save us, we could do even-better in-kernel. But then, a smart application would bypass the glibc() fallocate() implementation and would tune the reservation window size and would use direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). > This wlil result in the same > layout as if we had done the persistent preallocation, but of course > it will mean the posix_fallocate() could potentially take a long time > if you're a PVR and you're reserving a gig or two for a two hour movie > at high quality. That seems suboptimal, granted, and ideally the > application should be warned about this before it calls > posix_fallocate(). On the other hand, it's what happens today, all > the time, so applications won't be too badly surprised. A PVR implementor would take all this over and would do it themselves, for sure. > If we think applications programmers badly need to know in advance if > posix_fallocate() will be fast or slow, probably the right thing is to > define a new fpathconf() configuration option so they can query to see > whether a particular file will support a fast posix_fallocate(). I'm > not 100% convinced such complexity is really needed, but I'm willing > to be convinced what do folks think? > An application could do sys_fallocate(one-byte) to work out whether it's supported in-kernel, I guess. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > Actually, this is a non-issue. The reason that it is handled for > > extent-only > > is that this is the only way to allocate space in the filesystem without > > doing the explicit zeroing. For other filesystems (including ext3 and > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > It can be a bit suboptimal from the layout POV. The reservations code will > largely save us here, but kernel support might make it a bit better. Actually, the reservations code won't matter, since glibc will fall back to its current behavior, which is it will do the preallocation by explicitly writing zeros to the file. This wlil result in the same layout as if we had done the persistent preallocation, but of course it will mean the posix_fallocate() could potentially take a long time if you're a PVR and you're reserving a gig or two for a two hour movie at high quality. That seems suboptimal, granted, and ideally the application should be warned about this before it calls posix_fallocate(). On the other hand, it's what happens today, all the time, so applications won't be too badly surprised. If we think applications programmers badly need to know in advance if posix_fallocate() will be fast or slow, probably the right thing is to define a new fpathconf() configuration option so they can query to see whether a particular file will support a fast posix_fallocate(). I'm not 100% convinced such complexity is really needed, but I'm willing to be convinced what do folks think? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
Andreas Dilger wrote: On May 07, 2007 13:58 -0700, Andrew Morton wrote: Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and Precisely /how/ do you avoid the zeroing issue, for extents? If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, otherwise the implementation is broken. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 7 May 2007 15:21:04 -0700 Andreas Dilger <[EMAIL PROTECTED]> wrote: > On May 07, 2007 13:58 -0700, Andrew Morton wrote: > > Final point: it's fairly disappointing that the present implementation is > > ext4-only, and extent-only. I do think we should be aiming at an ext4 > > bitmap-based implementation and an ext3 implementation. > > Actually, this is a non-issue. The reason that it is handled for extent-only > is that this is the only way to allocate space in the filesystem without > doing the explicit zeroing. For other filesystems (including ext3 and > ext4 with block-mapped files) the filesystem should return an error (e.g. > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. hrm, spose so. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Totally blowing pagecache could be a problem. Fixable in userspace by using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On May 07, 2007 13:58 -0700, Andrew Morton wrote: > Final point: it's fairly disappointing that the present implementation is > ext4-only, and extent-only. I do think we should be aiming at an ext4 > bitmap-based implementation and an ext3 implementation. Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
On May 07, 2007 14:52 -0500, Eric Sandeen wrote: > Theodore Tso wrote: > > How likely do you think the case will be that mke2fs.conf would be > > missing? I'm trying to figure out how high priority of an item this > > really is. > > Well, not too likely, although for some reason I guess it happened in > the installer root in FC6 or so. That's what raised the issue. Ah, good point - there are probably lots of installers and rescue disks that grab mke2fs but don't know anything about /etc/mke2fs.conf. > > We could enhance the profile code so that it could read in the profile > > from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs, > > but that adds bloat --- the question is how necessary do we think that > > really is? > > I guess it doesn't really sound *necessary* - it's just that if we have > 2 different "defaults" and they drift, it can be confusing... Since the shipped mke2fs.conf is only on the order of a few hundred bytes I don't think it is a huge issue to include it. I like the idea of it being compiled into the code and yet consistent with the default install. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 > 256)
On May 07, 2007 11:50 -0400, Theodore Tso wrote: > On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote: > > On Mon, 7 May 2007, Theodore Tso wrote: > > >We're going to have to rethink the defaults of the journal size as it > > >relates to on-line resizing, but there might not be a lot of great > > >answers here. We'll also have to look at the on-line resizing code > > >and see if there's a way to break up the resize operation into smaller > > >transactions as a way of avoiding this problem --- but that would > > >still leave the user stuck with a pathetically small 4M journal on a > > >3G filesystem. > > Clearly the right thing for resize2fs to do when it's doing an > off-line resize is to just adjust the journal inode and make it > bigger. I'll add that to my todo list, but of course, patches to do > that would be gratefully accepted... For that matter, there was a paper from U. Wisconsin showing that having the journal in the middle of the filesystem gave noticably better performance due to lower average seek times. If anyone is looking at messing with the journal that is probably also a good and easy place to start (e.g. may be as easy as just giving a goal block of s_blocks_count / 2 to the journal create code). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
On May 07, 2007 19:16 +0530, Aneesh Kumar K.V wrote: > Theodore Tso wrote: > >Actually, changing the inode size is actually going to be more useful > >to a larger number of people, since they can use it today to support > >in-inode EA's. In addition, changing the inode size must be done > >off-line, whereas it's not so clear that off-line conversion to > >extents is the best way to go (more on that below). > > That way we don't need to add kernel code that convert ext3 inode to > ext4 inode while defragmentation. Last time i looked at the online > defrag code, i was not able to figure out a easy way to take indirect > map inode as input, then defrag the same and write the result in extent > map. I think the online defrag code _should_ be fairly easily made to handle defrag of block-mapped files, so long as it is always creating extent- mapped files in the end. All that should be necessary is to have the kernel read data from the block-mapped file and write it into the newly mapped blocks. That functionality needs to exist for a long time anyways to support upgrading the filesystem so it shouldn't be a huge burden. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 7 May 2007 05:37:54 -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: > > > + block = offset >> blkbits; > > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > > + - block; > > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > > space, and that this disk space will require an arbitrary amount of > > metadata, how can we work out how much journal space we'll be needing > > without at least looking at `len'? > > Good question. > > The uninitialized extent can cover up to 128MB with a single entry. > If @path isn't specified, then ext4_ext_calc_credits_for_insert() > function returns the maximum number of extents needed to insert a leaf, > including splitting all of the index blocks. That would allow up to 43GB > (340 extents/block * 128MB) to be preallocated, but it still needs to take > the size of the preallocation into account (adding 3 blocks per 43GB - a > leaf block, a bitmap block and a group descriptor). I think the use of ext4_journal_extend() (as Amit has proposed) will help here, but it is not sufficient. Because under some circumstances, a journal_extend() failure could mean that we fail to allocate all the required disk space. If it is infrequent enough, that is acceptable when the caller is using fallocate() for performance reasons. But it is very much not acceptable if the caller is using fallocate() for space-reservation reasons. If you used fallocate to reserve 1GB of disk and fallocate() "succeeded" and you later get ENOSPC then you'd have a right to get a bit upset. So I think the ext3/4 fallocate() implementation will need to be implemented as a loop: while (len) { journal_start(); len -= do_fallocate(len, ...); journal_stop(); } Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. Does the proposed implementation handle quotas correctly, btw? Has that been tested? Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21-ext4-1
On Mon, 2007-04-30 at 11:14 -0400, Theodore Ts'o wrote: > I've respun the ext4 development patchset, with Amit's updated fallocate > patches. I've added Dave's patch to add ia64 support to the fallocate > system call, but *not* the XFS fallocate support patches. (Probably > better for them to live in an xfs tree, where they can more easily > tested and updated.) Yes, we haven't reached complete closure on the > fallocate system call calling convention, but it's enough for us to get > more testing in -mm. > > Also added Johann's jbd2-stats-through-procfs patches; it provides > useful help in turning the size of the journal, which will be useful in > benchmarking efforts. In addition, Alex Tomas's patch to free > just-allocated patches when there is an error inserting the extent into > the extent tree has also been included. > > The patches have been compile-tested on x86, and compile/run-tested on > x86/UML. Would appreciate reports about testing on other platforms. > I have tested this patch series on ppc64, x86_64 with dbench/tiobench/fsx, all runs fine. I am not sure what level of testing Amit has done about the fallocate() and preallocation code on various archs. I couldn't find a available s390 and ia64 machines with free partition yet. In any case, it would be useful to add a new set of testsuites for the new fallocate() syscall and fsstress in LTP testsuites to automatically the preallocation code in ext4/XFS. thanks, Mingming > Thanks, > > - Ted > > P.S. One bug which I've noted --- if there is a failure due to disk > filling up, running e2fsck on the filesystem will show that the i_blocks > fields on the inodes where there was a failure to allocate disk blocks > are left incorrect. I'm guessing this is a bug in the delayed > allocation patches. Alex, when you have a moment, could you take a > look? Thanks!! > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
Theodore Tso wrote: > On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote: One of >> our testers filed a bug that said "mkfs.ext3 is much slower when >> mke2fs.conf is missing..." >> >> This is because the shipped defaults in mke2fs.conf do not match the >> shipped defaults in the mkfs code itself; he wound up making a 1k >> block filesystem on a very large block device, for example. >> >> So - How about this patch, to bring them back into line? > > It doesn't actually bring them completely back into line, since mke2fs > will use different block sizes depending on the size of the > filesystem. So your patch makes the default probably a bit more > reasonable, and so I'll probably end up applying it, but it definitely > isn't a complete replacement for /etc/mke2fs.conf. Well, sure, I don't mean for it to *replace* mke2fs.conf... Hm, does having [defaults] blocksize = 4096 in mke2fs.conf turn off the blocksize heuristics and force 4k? is what's in mke2fs.conf a starting point or an absolute? I guess I need to read up on the code... > How likely do you think the case will be that mke2fs.conf would be > missing? I'm trying to figure out how high priority of an item this > really is. Well, not too likely, although for some reason I guess it happened in the installer root in FC6 or so. That's what raised the issue. > We could enhance the profile code so that it could read in the profile > from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs, > but that adds bloat --- the question is how necessary do we think that > really is? I guess it doesn't really sound *necessary* - it's just that if we have 2 different "defaults" and they drift, it can be confusing... -Eric - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 > 256)
On May 06, 2007 21:40 -0700, Andrew Morton wrote: > On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen <[EMAIL PROTECTED]> > wrote: > > > 2.6.20.6, FC4: > > > > I created a 91248k ext3 fs with 4k blocksize: > > > > Next, I tried to resize it to about 3G using ext2online while mounted: > > > > At that time the kernel said: > > > > |JBD: ext2online wants too many credits (744 > 256) > > > > What is the limitation I should be aware of? Has it something to do with > > the journal log size? Yes, for very small filesystems the default journal size is only 4MB, and the maximum transaction size is 1MB (256 blocks). If the filesystem was 128MB or larger in the initial mke2fs then the journal would be 16MB and the resize would get up to 1024 blocks for the transaction. I'm not sure what the right solution is for this. If you know you will be resizing the fs you could increase the initial journal size at mke2fs time (-J size=16). Alternately, it is possible resize to an intermediate size and then delete and recreate the journal via tune2fs (which would be the larger size by default) but that can only be done offline. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On May 03, 2007 21:31 -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > + * ext4_fallocate: > > + * preallocate space for a file > > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > > + */ > > This description is rather thin. What is the filesystem's actual behaviour > here? If the file is using extents then the implementation will do > . If the file is using bitmaps then we will do . > > But what? Here is where it should be described. My understanding is that glibc will handle zero-filling of files for filesystems that do not support fallocate(). > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t > > len) > > +{ > > + handle_t *handle; > > + ext4_fsblk_t block, max_blocks; > > + int ret, ret2, nblocks = 0, retries = 0; > > + struct buffer_head map_bh; > > + unsigned int credits, blkbits = inode->i_blkbits; > > + > > + /* Currently supporting (pre)allocate mode _only_ */ > > + if (mode != FA_ALLOCATE) > > + return -EOPNOTSUPP; > > + > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + return -ENOTTY; > > So we don't implement fallocate on bitmap-based files! Well that's huge > news. The changelog would be an appropriate place to communicate this, > along with reasons why, or a description of the plan to fix it. > > Also, posix says nothing about fallocate() returning ENOTTY. I _think_ this is to convince glibc to do the zero-filling in userspace, but I'm not up on the API specifics. > > + block = offset >> blkbits; > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > +- block; > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > space, and that this disk space will require an arbitrary amount of > metadata, how can we work out how much journal space we'll be needing > without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). Also, since @path is not being given then truncate_mutex is not needed. > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and > ext4_error() would be safer and more useful here. Ouch, not very friendly error handling. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote: One of > our testers filed a bug that said "mkfs.ext3 is much slower when > mke2fs.conf is missing..." > > This is because the shipped defaults in mke2fs.conf do not match the > shipped defaults in the mkfs code itself; he wound up making a 1k > block filesystem on a very large block device, for example. > > So - How about this patch, to bring them back into line? It doesn't actually bring them completely back into line, since mke2fs will use different block sizes depending on the size of the filesystem. So your patch makes the default probably a bit more reasonable, and so I'll probably end up applying it, but it definitely isn't a complete replacement for /etc/mke2fs.conf. How likely do you think the case will be that mke2fs.conf would be missing? I'm trying to figure out how high priority of an item this really is. > Which makes me wonder; having "defaults" in 2 different places is > bound to get out of sync; should we instead generate both code & > config file defaults (and maybe man page defaults) from a common > source? I had been working on the assumption that the defaults if mke2fs.conf were not present were more in the nature of emergency defaults as opposed something that could be used a fully functional set of configuration parameters. So the assumption was that when you installed the RPM, mke2fs.conf would also be there. We could enhance the profile code so that it could read in the profile from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs, but that adds bloat --- the question is how necessary do we think that really is? Regards, - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Creating a >32bit blocks filesystem.
On Mon, 7 May 2007 11:19:52 -0500 "Jose R. Santos" <[EMAIL PROTECTED]> wrote: > I tried the patches and while the tools such as mkfs and debugfs seem > to work fine, I am still unable to mount a filesystem with block > numbers exceeding 32 bits. Correction, I tested debugfs on a 32bit blocks filesystem. Both debugfs and fsck failed when attempting to use a 64bit filesystem. -JRS - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Accomodate 32-bit uid/guid values in e2fsprogs
e2fsprogs doesn't handle large (> 16 bit) UID/GID... Create an fs by user with large UID: bash-3.1$ id uid=501666(newuser) gid=501666(newuser) groups=501666(newuser) bash-3.1$ /sbin/mke2fs -q fsfile fsfile is not a block special device. Proceed anyway? (y,n) y Now mount it: [EMAIL PROTECTED] tmp]# mount -o loop fsfile /mnt/tmp/ [EMAIL PROTECTED] tmp]# ls -la /mnt/tmp/ total 13 drwxr-xr-x 3 42914 42914 1024 May 7 11:55 . drwxr-xr-x 4 root root 28 Jan 23 10:34 .. drwx-- 2 root root 12288 May 7 11:55 lost+found uid/gid is wrong (only bottom 16 bits)! (it's wrong on disk...) And even if I get it right on disk: [EMAIL PROTECTED] tmp]# ls -lan /mnt/tmp/ total 13 drwxr-xr-x 3 501666 501666 1024 May 7 11:57 . drwxr-xr-x 4 0 028 Jan 23 10:34 .. drwx-- 2 0 0 12288 May 7 11:57 lost+found debugfs has further problems: [EMAIL PROTECTED] tmp]# /sbin/debugfs /tmp/fsfile debugfs 1.39 (29-May-2006) debugfs: stat . Inode: 2 Type: directoryMode: 0755 Flags: 0x0 Generation: 0 User: 42914 Group: 42914 Size: 1024 File ACL: 0Directory ACL: 0 Links: 3 Blockcount: 2 Fragment: Address: 0Number: 0Size: 0 ctime: 0x463f5a8a -- Mon May 7 11:57:46 2007 atime: 0x463f5a90 -- Mon May 7 11:57:52 2007 mtime: 0x463f5a8a -- Mon May 7 11:57:46 2007 BLOCKS: (0):508 TOTAL: 1 debugfs: ls -l 2 40755 (2) 42914 429141024 7-May-2007 11:57 . 2 40755 (2) 42914 429141024 7-May-2007 11:57 .. 11 40700 (2) 0 0 12288 7-May-2007 11:57 lost+found The attached patch creates inode_uid() and inode_gid() macros to accommodate the high bits when reading uid/gid, on platforms that support it. It also accommodates large UID/GID at mkfs time. Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]> Index: e2fsprogs-hg/debugfs/debugfs.c === --- e2fsprogs-hg.orig/debugfs/debugfs.c +++ e2fsprogs-hg/debugfs/debugfs.c @@ -534,7 +534,7 @@ void internal_dump_inode(FILE *out, cons prefix, inode->i_mode & 0777, inode->i_flags, inode->i_generation); fprintf(out, "%sUser: %5d Group: %5d Size: ", - prefix, inode->i_uid, inode->i_gid); + prefix, inode_uid(*inode), inode_gid(*inode)); if (LINUX_S_ISREG(inode->i_mode)) { __u64 i_size = (inode->i_size | ((unsigned long long)inode->i_size_high << 32)); Index: e2fsprogs-hg/debugfs/ls.c === --- e2fsprogs-hg.orig/debugfs/ls.c +++ e2fsprogs-hg/debugfs/ls.c @@ -88,7 +88,7 @@ static int list_dir_proc(ext2_ino_t dir } fprintf(ls->f, "%c%6u%c %6o (%d) %5d %5d ", lbr, ino, rbr, inode.i_mode, dirent->name_len >> 8, - inode.i_uid, inode.i_gid); + inode_uid(inode), inode_gid(inode)); if (LINUX_S_ISDIR(inode.i_mode)) fprintf(ls->f, "%5d", inode.i_size); else Index: e2fsprogs-hg/debugfs/lsdel.c === --- e2fsprogs-hg.orig/debugfs/lsdel.c +++ e2fsprogs-hg/debugfs/lsdel.c @@ -23,7 +23,7 @@ struct deleted_info { ext2_ino_t ino; unsigned short mode; - unsigned short uid; + __u32 uid; __u64 size; time_t dtime; int num_blocks; @@ -160,7 +160,7 @@ void do_lsdel(int argc, char **argv) delarray[num_delarray].ino = ino; delarray[num_delarray].mode = inode.i_mode; - delarray[num_delarray].uid = inode.i_uid; + delarray[num_delarray].uid = inode_uid(inode); delarray[num_delarray].size = inode.i_size; if (!LINUX_S_ISDIR(inode.i_mode)) delarray[num_delarray].size |= Index: e2fsprogs-hg/e2fsck/message.c === --- e2fsprogs-hg.orig/e2fsck/message.c +++ e2fsprogs-hg/e2fsck/message.c @@ -304,12 +304,10 @@ static _INLINE_ void expand_inode_expres inode->i_dir_acl : 0)); break; case 'u': - printf("%d", (inode->i_uid | - (inode->osd2.linux2.l_i_uid_high << 16))); + printf("%d", inode_uid(*inode)); break; case 'g': - printf("%d", (inode->i_gid | - (inode->osd2.linux2.l_i_gid_high << 16))); + printf("%d", inode_gid(*inode)); break; case 't': if (LINUX_S_ISREG(inode->i_mode)) Index: e2fsprogs-hg/lib/ext2fs/ext2_fs.h
[PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
One of our testers filed a bug that said "mkfs.ext3 is much slower when mke2fs.conf is missing..." This is because the shipped defaults in mke2fs.conf do not match the shipped defaults in the mkfs code itself; he wound up making a 1k block filesystem on a very large block device, for example. So - How about this patch, to bring them back into line? Which makes me wonder; having "defaults" in 2 different places is bound to get out of sync; should we instead generate both code & config file defaults (and maybe man page defaults) from a common source? Anyway, here's a patch to bring mke2fs.conf and mke2fs.c into line for current defaults... Thanks, -Eric Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]> Index: e2fsprogs-hg/misc/mke2fs.c === --- e2fsprogs-hg.orig/misc/mke2fs.c +++ e2fsprogs-hg/misc/mke2fs.c @@ -1288,7 +1288,8 @@ static void PRS(int argc, char *argv[]) tmp = tmp2 = NULL; if (fs_param.s_rev_level != EXT2_GOOD_OLD_REV) { profile_get_string(profile, "defaults", "base_features", 0, - "filetype,sparse_super", &tmp); + "sparse_super,filetype,resize_inode,dir_index", + &tmp); profile_get_string(profile, "fs_types", fs_type, "base_features", tmp, &tmp2); edit_feature(tmp2, &fs_param.s_feature_compat); @@ -1365,7 +1366,7 @@ static void PRS(int argc, char *argv[]) if (blocksize <= 0) { profile_get_integer(profile, "defaults", "blocksize", 0, - 1024, &use_bsize); + 4096, &use_bsize); profile_get_integer(profile, "fs_types", fs_type, "blocksize", use_bsize, &use_bsize); - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Creating a >32bit blocks filesystem.
On Fri, 04 May 2007 16:58:35 +0200 Valerie Clement <[EMAIL PROTECTED]> wrote: > Hi Jose, > I began to port our modifications done for the 64-bit support against > the new version of e2fsprogs Ted posted at the beginning of the week. > Note that it is *just* for test use as it breaks the backwards > compatibility. > I did a few tests with a kernel 2.6.17-rc7 and it seems to work, at > least mkfs, debugfs and fsck tools. Hi Valerie, I tried the patches and while the tools such as mkfs and debugfs seem to work fine, I am still unable to mount a filesystem with block numbers exceeding 32 bits. I am testing on a 2.6.21.1 kernel with the ext4 patches from: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.21-ext4-1 The following error shows up on the kernel log: [12145.598822] EXT4-fs error (device dm-2): ext4_check_descriptors: Block bitmap for group 0 not in group (block 18446744069414584320)! [12145.670781] EXT4-fs: group descriptors corrupted! So its failing very early in ext4_check_descriptors(). The hi 32 bits of block_bitmap for the first group seem to be set all to 1s. Thanks -JRS - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
On Mon, May 07, 2007 at 07:16:31PM +0530, Aneesh Kumar K.V wrote: > > One of the option i was thinking was to use this tool to migrate to > extent map and then towards the end use the online defrag ioctl to > defrag the resulting ext4 inode. As we discussed on the phone, my recommendation would be to take your existing code and move it into the kernel so that triggered off of an ioctl, your code could migrate an inode from using indirect blocks to extents while a filesystem is mounted. The main things you will need to watch for is to make sure the inode is locked so that another process doesn't try to extend or truncate it, and to use the JBD layer to provide appropriate journalling support. Given that ext4migrated imported the kernel extent functions, it should be relatively straightforward to simply make them use the kernel extent functions while in kernel space. Once the the inode has been converted on-line then it can be defragmented on-line. That would be much more convenient than having to unmount the filesystem to do the off-line migration, followed by mounting it to do the defragmentation. Regards, - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 > 256)
On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote: > On Mon, 7 May 2007, Theodore Tso wrote: > > >We're going to have to rethink the defaults of the journal size as it > >relates to on-line resizing, but there might not be a lot of great > >answers here. We'll also have to look at the on-line resizing code > >and see if there's a way to break up the resize operation into smaller > >transactions as a way of avoiding this problem --- but that would > >still leave the user stuck with a pathetically small 4M journal on a > >3G filesystem. > > why not just allocate a new journal (something along the lines of convert > the ext3 to ext2 then back to ext3 would be the ugly, scripted way of > doing it) > > or is the problem that you are trying to resize things without remounting > them (and therefor without flushing the journal) Yes, I'm referring to the difficulties of getting this right when doing online (mounted) resizing, as that was the context of the original bug report. Clearly the right thing for resize2fs to do when it's doing an off-line resize is to just adjust the journal inode and make it bigger. I'll add that to my todo list, but of course, patches to do that would be gratefully accepted... - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> > > wrote: > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t > > > len) > > > +{ > > > + handle_t *handle; > > > + ext4_fsblk_t block, max_blocks; > > > + int ret, ret2, nblocks = 0, retries = 0; > > > + struct buffer_head map_bh; > > > + unsigned int credits, blkbits = inode->i_blkbits; > > > + > > > + /* Currently supporting (pre)allocate mode _only_ */ > > > + if (mode != FA_ALLOCATE) > > > + return -EOPNOTSUPP; > > > + > > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > > + return -ENOTTY; > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > news. The changelog would be an appropriate place to communicate this, > > along with reasons why, or a description of the plan to fix it. > > Ok. Will add this in the function description as well. > > > Also, posix says nothing about fallocate() returning ENOTTY. > > Right. I don't seem to find any suitable error from posix description. > Can you please suggest an error code which might make more sense here ? > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > non-extent files. Isn't the idea that libc will interpret -ENOTTY, or whatever is returned here, and fall back to the current library code to do preallocation? This way, the caller of fallocate() will never see this return code, so it won't violate posix. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 > 256)
On Mon, 7 May 2007, Theodore Tso wrote: We're going to have to rethink the defaults of the journal size as it relates to on-line resizing, but there might not be a lot of great answers here. We'll also have to look at the on-line resizing code and see if there's a way to break up the resize operation into smaller transactions as a way of avoiding this problem --- but that would still leave the user stuck with a pathetically small 4M journal on a 3G filesystem. why not just allocate a new journal (something along the lines of convert the ext3 to ext2 then back to ext3 would be the ugly, scripted way of doing it) or is the problem that you are trying to resize things without remounting them (and therefor without flushing the journal) David Lang - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 > 256)
On Mon, May 07, 2007 at 12:26:26AM +0200, Frank van Maarseveen wrote: > 2.6.20.6, FC4: > > |JBD: ext2online wants too many credits (744 > 256) You would be better off using resize2fs from e2fsprogs 1.39, BTW About the only reason to keep using the ext2resize packages is ext2prepare, which is the off-line tool to add a resizing inode after the fact, and the main reason that functionality hasn't been absorbed into e2fsprogs is that it needs to be completely rewritten before I'm willing to trust and maintain it... > What is the limitation I should be aware of? Has it something to do with > the journal log size? Yes, the problem is the journal log size. Since the original filesystem was so small, the journal was kept small so as not to use a huge percentage of the filesystem size. Of course, with a larger filesystem you probably want a larger journal anyway to get better performance. So that brings up a related problem, which is we don't have a way of dynamically increasing the size of the journal, which we probably would want to do as part of resizing the filesystem size. So the short-term workaround for now, if you know that you're going to take a filesystem which is 22M and increase it to 3G, is to create it with largish journal in the first place: mke2fs -j -b 4096 -J size=16M /dev/vol1/project 22812 Note that if you were creating a 3G filesystem from scratch, by default it would be using a 128M journal --- but it's a little hard to fit a 128M journal in a 22M filesystem. :-/ We're going to have to rethink the defaults of the journal size as it relates to on-line resizing, but there might not be a lot of great answers here. We'll also have to look at the on-line resizing code and see if there's a way to break up the resize operation into smaller transactions as a way of avoiding this problem --- but that would still leave the user stuck with a pathetically small 4M journal on a 3G filesystem. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 > 256)
On Sun, May 06, 2007 at 09:40:14PM -0700, Andrew Morton wrote: > On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen <[EMAIL PROTECTED]> > wrote: > > > Steps to reproduce: > > Create a 3G partition, say /dev/vol1/project > > mke2fs -j -b 4096 /dev/vol1/project 22812 > > mount it > > ext2online /dev/vol1/project said: > > > > | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b > > | ext2online: ext2_ioctl: No space left on device > > | > > | ext2online: unable to resize /dev/mapper/vol1-project > > > > kernel said: > > > > | JBD: ext2online wants too many credits (721 > 256) There's a threshold for the problem depending on the initial size. This one fails: mke2fs -j -b 4096 /dev/<3GB-blockdev> 32768 (mount + ext2online or resize2fs) kernel: JBD: resize2fs wants too many credits (1034 > 1024) Add one block to the initial mke2fs (32768+1 == 32769) and the problem is gone. Without the -b 4096 there's another resize problem mke2fs -j /dev/loop1 2048 mount /dev/loop1 /1 resize2fs /dev/loop1 says: resize2fs 1.40-WIP (14-Nov-2006) Filesystem at /dev/loop1 is mounted on /1; on-line resizing required old desc_blocks = 1, new_desc_blocks = 12 Performing an on-line resize of /dev/loop1 to 3072000 (1k) blocks. resize2fs: Invalid argument While trying to add group #256 and the kernel says: May 7 15:36:08 lokka EXT3-fs warning (device loop1): verify_reserved_gdb: May 7 15:36:08 lokka reserved GDT 10 missing grp 1 (8202) May 7 15:36:08 lokka After that, the filesystem has been resized to 2GB. I recall a 2G (?) limit for ext3 resizing with 1k blocksize but trying the above with 4096 1k blocks seems to work. fsck says it's ok all the time. -- Frank - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
Theodore Tso wrote: That is correct. My next step is to enhance the tool to support migration to large inodes. I would actually suggest that the place to add that functionality would be via "tune2fs -I ". Since mke2fs -I is used to set the inode size, it makes sense that tune2fs -I would change the inode size. Obviously this would have to be done when the filesystem is not mounted, and tune2fs should abort gracefully if determines that the filesystem is mounted. That is nice !!. I will add the large inode migration support to tune2fs. -aneesh - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
Hi, Theodore Tso wrote: On Mon, May 07, 2007 at 08:56:11AM +0530, Aneesh Kumar K.V wrote: Andreas Dilger wrote: If this code could also (or optionally just) increase the size of inodes it would be very useful. AFAIK right now it will only change the inodes >from block-mapped to extent-mapped? Actually, changing the inode size is actually going to be more useful to a larger number of people, since they can use it today to support in-inode EA's. In addition, changing the inode size must be done off-line, whereas it's not so clear that off-line conversion to extents is the best way to go (more on that below). One of the option i was thinking was to use this tool to migrate to extent map and then towards the end use the online defrag ioctl to defrag the resulting ext4 inode. That way we don't need to add kernel code that convert ext3 inode to ext4 inode while defragmentation. Last time i looked at the online defrag code, i was not able to figure out a easy way to take indirect map inode as input, then defrag the same and write the result in extent map. -aneesh - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
On Mon, May 07, 2007 at 08:56:11AM +0530, Aneesh Kumar K.V wrote: > Andreas Dilger wrote: > >If this code could also (or optionally just) increase the size of inodes > >it would be very useful. AFAIK right now it will only change the inodes > >from block-mapped to extent-mapped? Actually, changing the inode size is actually going to be more useful to a larger number of people, since they can use it today to support in-inode EA's. In addition, changing the inode size must be done off-line, whereas it's not so clear that off-line conversion to extents is the best way to go (more on that below). One generic concern I have about simply migrating existing files to use extents is that unless we actually also combine the tool with defragmentation support, the file ends up being fairly badly fragmented simply because of the "holes" left in the file from the indirect blocks. Therefore the tree ends up being far larger than it needs to be, and future attempts allocate blocks in that block group will fill in the holes, further degrading the performance of the filesystem and whatever file is being written at the time. The other reason why increasing the inode size would actually be more valuable is that right now, if we have the disk space, the user can do on-line migration of a file simply by copying it and then moving it over the original file --- and if we have delayed allocation support and sufficient memory, we can avoid the fragmentation problems currently with the off-line ext4migration approach. So we may need to think a bit about what's the best way to handle ext4migrate. It may be that correct approach is to combine it with a defragmentation tool, either on-line or off-line. > That is correct. My next step is to enhance the tool to support > migration to large inodes. I would actually suggest that the place to add that functionality would be via "tune2fs -I ". Since mke2fs -I is used to set the inode size, it makes sense that tune2fs -I would change the inode size. Obviously this would have to be done when the filesystem is not mounted, and tune2fs should abort gracefully if determines that the filesystem is mounted. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Jakub Jelinek wrote: is what glibc does ATM. Seems we violate the case where len == 0, as EINVAL in that case is "shall fail". But reading the standard to imply negative len is ok is too much guessing, there is no word what it means when len is negative and "required storage for regular file data starting at offset and continuing for len bytes" doesn't make sense for negative size. This wording has already been cleaned up. The current draft for the next revision reads: [EINVAL] The len argument is less than or equal to zero, or the offset argument is less than zero, or the underlying file system does not support this operation. I still don't like it since len==0 shouldn't create an error (it's inconsistent) but len<0 is already outlawed. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote: > On 4/26/07, Amit K. Arora <[EMAIL PROTECTED]> wrote: > > /* > >+ * ext4_ext_try_to_merge: > >+ * tries to merge the "ex" extent to the next extent in the tree. > >+ * It always tries to merge towards right. If you want to merge towards > >+ * left, pass "ex - 1" as argument instead of "ex". > >+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > >+ * 1 if they got merged. > >+ */ > >+int ext4_ext_try_to_merge(struct inode *inode, > >+ struct ext4_ext_path *path, > >+ struct ext4_extent *ex) > >+{ > > Please either use proper kerneldoc format or drop > "ext4_ext_try_to_merge" from the comment. Ok, Thanks. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] Add unix COW io manager
On Fri, May 04, 2007 at 02:43:21PM +0530, Aneesh Kumar K.V wrote: > + * unix_cow_io.c --- This is the Unix (well, really POSIX) implementation > + * of the COW I/O manager. I really wish you had based this not on unix_io.c, but rather on test_io.c, for two reasons (a) this reduces code duplication, (b) it means that we can chain/stack I/O managers. i.e., test_io -> cow_manager -> unix_io. The one change I would recommend is that instead of assigning to a globally visible external variable (as I did in test_io, and which I now regret as a not necessarily being portable --- different shared library implementations have different restrictions on whether you can access global variables defined in shared libraries from an application or vice versa --- with AIX having some of the most peverse shared library restritions, at least 7-8 years ago when I had the misfortune of trying to make Kerberos v5 shared libraries work on AIX :-). Instead, what I would recommend is creating a function which an application could call to set the backing I/O manager for the cow_manager, as well as the filenames to use for the tdb journal. The second change I would recommend is to NOT call it "cow_manager" but rather an "undo_manager", since that makes it clear that what is being backed up is not the work that we are about to do, but rather the original contents of the disk. I would also rename the ext4_replay tool something like e2undofs, since "replay" has the connotation that you are applying changes that were done in a log, when in fact what you are really doing is backing out changes. We want to make sure we avoid this confusion, since Xen and UML and most other virtualization engines, for good reasons, do the exact opposite --- the COW file contains the changes, and the base file/device remains unchanged. > + retval = tdb_store(data->tdb, tdb_key, tdb_data, TDB_INSERT); > + if (retval == -1) { > + /* FIXME!! Not sure what should be the error */ > + retval = EXT2_ET_SHORT_WRITE; > + goto error_out; > + } You need to write the tdb_store() around a tdb_transaction_start() and tdb_transaction_commit() pair, so that we can be sure the original data is actually synced to disk. Otherwise, on a crash, we could have a situation where the original data on disk has been overwritten, but the backup of the original block was never forced to disk! - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
On 4/26/07, Amit K. Arora <[EMAIL PROTECTED]> wrote: /* + * ext4_ext_try_to_merge: + * tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ Please either use proper kerneldoc format or drop "ext4_ext_try_to_merge" from the comment. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > + */ > > +int ext4_ext_try_to_merge(struct inode *inode, > > + struct ext4_ext_path *path, > > + struct ext4_extent *ex) > > +{ > > + struct ext4_extent_header *eh; > > + unsigned int depth, len; > > + int merge_done=0, uninitialized = 0; > > space around "=", please. > > Many people prefer not to do the multiple-definitions-per-line, btw: > > int merge_done = 0; > int uninitialized = 0; Ok. Will make the change. > > reasons: > > - If gives you some space for a nice comment > > - It makes patches much more readable, and it makes rejects easier to fix > > - standardisation. > > > + depth = ext_depth(inode); > > + BUG_ON(path[depth].p_hdr == NULL); > > + eh = path[depth].p_hdr; > > + > > + while (ex < EXT_LAST_EXTENT(eh)) { > > + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) > > + break; > > + /* merge with next extent! */ > > + if (ext4_ext_is_uninitialized(ex)) > > + uninitialized = 1; > > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) > > + + ext4_ext_get_actual_len(ex + 1)); > > + if (uninitialized) > > + ext4_ext_mark_uninitialized(ex); > > + > > + if (ex + 1 < EXT_LAST_EXTENT(eh)) { > > + len = (EXT_LAST_EXTENT(eh) - ex - 1) > > + * sizeof(struct ext4_extent); > > + memmove(ex + 1, ex + 2, len); > > + } > > + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); > > Kenrel convention is to put spaces around "-" Will fix this. > > > + merge_done = 1; > > + BUG_ON(eh->eh_entries == 0); > > eek, scary BUG_ON. Do we really need to be that severe? Would it be > better to warn and run ext4_error() here? Ok. > > > + } > > + > > + return merge_done; > > +} > > + > > + > > > > ... > > > > +/* > > + * ext4_ext_convert_to_initialized: > > + * this function is called by ext4_ext_get_blocks() if someone tries to > > write > > + * to an uninitialized extent. It may result in splitting the uninitialized > > + * extent into multiple extents (upto three). Atleast one initialized > > extent > > + * and atmost two uninitialized extents can result. > > There are some typos here > > > + * There are three possibilities: > > + * a> No split required: Entire extent should be initialized. > > + * b> Split into two extents: Only one end of the extent is being > > written to. > > + * c> Split into three extents: Somone is writing in middle of the > > extent. > > and here > Ok. Will fix them. > > + */ > > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, > > + struct ext4_ext_path *path, > > + ext4_fsblk_t iblock, > > + unsigned long max_blocks) > > +{ > > + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; > > + struct ext4_extent_header *eh; > > + unsigned int allocated, ee_block, ee_len, depth; > > + ext4_fsblk_t newblock; > > + int err = 0, ret = 0; > > + > > + depth = ext_depth(inode); > > + eh = path[depth].p_hdr; > > + ex = path[depth].p_ext; > > + ee_block = le32_to_cpu(ex->ee_block); > > + ee_len = ext4_ext_get_actual_len(ex); > > + allocated = ee_len - (iblock - ee_block); > > + newblock = iblock - ee_block + ext_pblock(ex); > > + ex2 = ex; > > + > > + /* ex1: ee_block to iblock - 1 : uninitialized */ > > + if (iblock > ee_block) { > > + ex1 = ex; > > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > > + ext4_ext_mark_uninitialized(ex1); > > + ex2 = &newex; > > + } > > + /* for sanity, update the length of the ex2 extent before > > +* we insert ex3, if ex1 is NULL. This is to avoid temporary > > +* overlap of blocks. > > +*/ > > + if (!ex1 && allocated > max_blocks) > > + ex2->ee_len = cpu_to_le16(max_blocks); > > + /* ex3: to ee_block + ee_len : uninitialised */ > > + if (allocated > max_blocks) { > > + unsigned int newdepth; > > + ex3 = &newex; > > + ex3->ee_block = cpu_to_le32(iblock + max_blocks); > > + ext4_ext_store_pblock(ex3, newblock + max_blocks); > > + ex3->ee_len = cpu_to_le16(allocated - max_blocks); > > + ext4_ext_mark_uninitialized(ex3); > > + err = ext4_ext_insert_extent(handle, inode, path, ex3); > > + if (err) > > + goto out; > > + /* The depth, and hence eh & ex might change > > +* as part of the insert above. > > +*/ > > + newdepth = ext_depth(inode); > > + if (newdepth != dep
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > This patch has the ext4 implemtation of fallocate system call. > > > > ... > > > > + /* ext4_can_extents_be_merged should have checked that either > > +* both extents are uninitialized, or both aren't. Thus we > > +* need to check only one of them here. > > +*/ > > Please always format multiline comments like this: > > /* >* ext4_can_extents_be_merged should have checked that either >* both extents are uninitialized, or both aren't. Thus we >* need to check only one of them here. >*/ Ok. > > ... > > > > +/* > > + * ext4_fallocate: > > + * preallocate space for a file > > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > > + */ > > This description is rather thin. What is the filesystem's actual behaviour > here? If the file is using extents then the implementation will do > . If the file is using bitmaps then we will do . > > But what? Here is where it should be described. Ok. Will expand the description. > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t > > len) > > +{ > > + handle_t *handle; > > + ext4_fsblk_t block, max_blocks; > > + int ret, ret2, nblocks = 0, retries = 0; > > + struct buffer_head map_bh; > > + unsigned int credits, blkbits = inode->i_blkbits; > > + > > + /* Currently supporting (pre)allocate mode _only_ */ > > + if (mode != FA_ALLOCATE) > > + return -EOPNOTSUPP; > > + > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + return -ENOTTY; > > So we don't implement fallocate on bitmap-based files! Well that's huge > news. The changelog would be an appropriate place to communicate this, > along with reasons why, or a description of the plan to fix it. Ok. Will add this in the function description as well. > Also, posix says nothing about fallocate() returning ENOTTY. Right. I don't seem to find any suitable error from posix description. Can you please suggest an error code which might make more sense here ? Will -ENOTSUPP be ok ? Since we want to say here that we don't support non-extent files. > > + block = offset >> blkbits; > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > +- block; > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > space, and that this disk space will require an arbitrary amount of > metadata, how can we work out how much journal space we'll be needing > without at least looking at `len'? You are right to say that the credits can not be fixed here. But, 'len' will not directly tell us how many extents might need to be inserted and how many block groups (if any - think about the "segment range" already being allocated case) the allocation request might touch. One solution I have thought is to check the buffer credits after a call to ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the credits are falling short. Incase journal_extend fails, we call journal_restart. This will automatically take care of how much journal space we might need for any value of "len". > > + handle=ext4_journal_start(inode, credits + > > Please always put spaces around "="A Ok. > > > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); > > And around "+" Ok. > > > + if (IS_ERR(handle)) > > + return PTR_ERR(handle); > > +retry: > > + ret = 0; > > + while (ret >= 0 && ret < max_blocks) { > > + block = block + ret; > > + max_blocks = max_blocks - ret; > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and > ext4_error() would be safer and more useful here. Ok. Will do that. > > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > Use buffer_new() here. A separate patch which fixes the three existing > instances of open-coded BH_foo usage would be appreciated. Ok. > > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > Check for wrap though the sign bit and through zero please. Ok. > > > + nblocks = nblocks + ret; > > + } > > + > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > + goto retry; > > + > > + /* Time to update the file size. > > +* Update only when preallocation was requested beyond the file size.
Re: [PATCH 3/5] ext4: Extent overlap bugfix
On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > +unsigned int ext4_ext_check_overlap(struct inode *inode, > > + struct ext4_extent *newext, > > + struct ext4_ext_path *path) > > +{ > > + unsigned long b1, b2; > > + unsigned int depth, len1; > > + > > + b1 = le32_to_cpu(newext->ee_block); > > + len1 = le16_to_cpu(newext->ee_len); > > + depth = ext_depth(inode); > > + if (!path[depth].p_ext) > > + goto out; > > + b2 = le32_to_cpu(path[depth].p_ext->ee_block); > > + > > + /* get the next allocated block if the extent in the path > > +* is before the requested block(s) */ > > + if (b2 < b1) { > > + b2 = ext4_ext_next_allocated_block(path); > > + if (b2 == EXT_MAX_BLOCK) > > + goto out; > > + } > > + > > + if (b1 + len1 > b2) { > > Are we sure that b1+len cannot wrap through zero here? No. Will add a check here for this. Thanks! > > + newext->ee_len = cpu_to_le16(b2 - b1); > > + return 1; > > + } -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the > above > check was relaxed? I think we may relax the check here and let the individual file system decide if they support preallocation for directories or not. What do you think ? One thing to be thought in this case is the error code which should be returned by the file system implementation, incase it doesn't support preallocation for directories. Should it be -ENODEV (to match with what posix says) , or something else (which might make more sense in this case) ? -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Andrew, Thanks for the review comments! On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. > > If that's all too much, this material should at least be spelled out in the > changelog. Because there's no way in which this change can be fully > reviewed unless someone (ie: you) tells us what it is setting out to > achieve. > > If we 100% implement some standard then a URL for what we claim to > implement would suffice. Given that we're at least using different types from > posix I doubt if such a thing would be sufficient. > > And given the complexity and potential variability within the filesystem > implementations of this, I'd expect that _something_ additional needs to be > said? Ok. I will add a detailed comment here. > > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I think we should go ahead with current glibc implementation (which Jakub poited at) of not allowing a negative 'len', since posix also doesn't explicitly say anything about allowing negative 'len'. > > > + ret = -EBADF; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + if (!(file->f_mode & FMODE_WRITE)) > > + goto out_fput; > > + > > + inode = file->f_path.dentry->d_inode; > > + > > + ret = -ESPIPE; > > + if (S_ISFIFO(inode->i_mode)) > > + goto out_fput; > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. True. > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment > here would settle the reader's mind. Ok. I will add a check here for wrap though zero. > > + if (inode->i_op && inode->i_op->fallocate) > > + ret = inode->i_op->fallocate(inode, mode, offset, len); > > + else > > + ret = -ENOSYS; > > If we _are_ going to support negative `len', as posix suggests, I think we > should perform the appropriate sanity conversions to `offset' and `len' > right here, rather than expecting each filesystem to do it. > > If we're not going to handle negative `len' then we should check for it. Will add a check for negative 'len' and return -EINVAL. This will be done where currently we check for negative offset (i.e. at the start of the function). > > +out_fput: > > + fput(file); > > +out: > > + return ret; > > +} > > +EXPORT_SYMBOL(sys_fallocate); > > I don't believe this needs to be exported to modules? Ok. Will remove it. > > +/* > > + * fallocate() modes > > + */ > > +#define FA_ALLOCATE0x1 > > +#define FA_DEALLOCATE 0x2 > > Now those aren't in posix. They should be documented, along with their > expected semantics. Will add a comment describing the role of these modes. > > #ifdef __KERNEL__ > > > > #include > > @@ -1125,6 +1131,7 @@ struct inode_operations { > > ssize_t (*listxattr) (struct dentry *, char *, size_t); > > int (*removexattr) (struct dentry *, const char *); > > void (*truncate_range)(struct inode *, loff_t, loff_t); > > + long (*fallocate)(struct inode *, int, loff_t, loff_t); > > I really do think it's better to put the variable names in definitions such > as this. Especially when we have two identically-typed variables next to > each other like that. Quick: which one is the offset and which is the > length? Ok. Will add the variable names here. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html