Re: [PATCH 0/5] fallocate system call
On Mon, Apr 30, 2007 at 03:56:32PM +1000, David Chinner wrote: > On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > IIRC, the argument for FA_ALLOCATE changing file size is that > posix_fallocate() is supposed to change the file size. But it's not posix_fallocate; it's something more generic. glibc can do posix_fallocate using truncate + fallocate. > Note that the way XFS implements growing the file size after the > allocation is via a truncate What's wrong with that? That seems very reasonable. > That's would what I did because otherwise you'd use ftruncate64(). > Without documented behaviour or an ext4 implementation, I have to > ask what it's supposed to do, though ;) How many *real* users are there for ext4? Why does 'what ext4 does' define 'the semantics'? Surely semantics should be decided either by precedent (if there is an existing relevant userbase) or sensible thought and some debate? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] fallocate system call
On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > For FA_ALLOCATE, it's supposed to change the file size if we > > allocate past EOF, right? > > I would argue no. Use truncate for that. I'm going from the ext4 implementation because the semantics have not been documented yet. IIRC, the argument for FA_ALLOCATE changing file size is that posix_fallocate() is supposed to change the file size. I think that having a mode for real preallocation and another for posix_fallocate is a valid thing to do... Note that the way XFS implements growing the file size after the allocation is via a truncate > > For FA_DEALLOCATE, does it change the filesize at all? > > Same as above. > > > Or does > > it just punch a hole in the file? > > Yes. That's would what I did because otherwise you'd use ftruncate64(). Without documented behaviour or an ext4 implementation, I have to ask what it's supposed to do, though ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] fallocate system call
On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > For FA_ALLOCATE, it's supposed to change the file size if we > allocate past EOF, right? I would argue no. Use truncate for that. > For FA_DEALLOCATE, does it change the filesize at all? Same as above. > Or does > it just punch a hole in the file? Yes. > FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does > not change file size) so we can preallocate beyond EOF for apps > which use O_APPEND (i.e. changing file size would cause problems for > them). FA_ALLOCATE should be able to allocate past-EOF I would argue. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] fs: add 4th case to do_path_lookup
Signed-off-by: Josef 'Jeff' Sipek <[EMAIL PROTECTED]> diff --git a/fs/namei.c b/fs/namei.c index 2995fba..1516a9b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1125,6 +1125,10 @@ static int fastcall do_path_lookup(int dfd, const char *name, nd->mnt = mntget(fs->rootmnt); nd->dentry = dget(fs->root); read_unlock(&fs->lock); + } else if (flags & LOOKUP_ONE) { + /* nd->mnt and nd->dentry already set, just grab references */ + mntget(nd->mnt); + dget(nd->dentry); } else if (dfd == AT_FDCWD) { read_lock(&fs->lock); nd->mnt = mntget(fs->pwdmnt); diff --git a/include/linux/namei.h b/include/linux/namei.h index 92b422b..aa89d97 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -48,6 +48,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; * - internal "there are more path compnents" flag * - locked when lookup done with dcache_lock held * - dentry cache is untrusted; force a real lookup + * - lookup path from given dentry/vfsmount pair */ #define LOOKUP_FOLLOW 1 #define LOOKUP_DIRECTORY2 @@ -55,6 +56,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_PARENT 16 #define LOOKUP_NOALT 32 #define LOOKUP_REVAL 64 +#define LOOKUP_ONE128 /* * Intent data */ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/1] [RFC] New mode for path_lookup (V1)
Stackable file systems frequently need to lookup paths or path components starting from an arbitrary point in the namespace (identified by a dentry and a vfsmount). Currently, such file systems use lookup_one_len, which is frowned upon [1] as it does not pass the lookup intent along; not passing a lookup intent, for example, can trigger BUG_ON's when stacking on top of NFSv4. The following patch introduces a new mode to path_lookup to allow lookup to start from an arbitrary point in the namespace. This approach has been suggested by Christoph Hellwig at the Linux Storage & Filesystem workshop in February of this year. One indicates that the lookup should be relative to a dentry-vfsmnt pair by using the LOOKUP_ONE flag. For example, the following snippet of code, looks up "pathcomponent" in a directory pointed to by parent_{dentry,vfsmnt}: nd.dentry = parent_dentry; nd.mnt = parent_vfsmnt; err = path_lookup("pathcomponent", LOOKUP_ONE, &nd); if (!err) { /* exits */ ... /* once done, release the references */ path_release(&nd); } else if (err == -ENOENT) { /* doesn't exits */ } else { /* other error */ } VFS functions such as lookup_create can be used on the nameidata structure to pass the create intent to the file system. Currently, there is no easy way to pass the LOOKUP_OPEN intent. The proper way would be to call open_namei. We'd like to get comments about what's necessary to make stackable file systems do lookups right - this includes potential changes to open_namei. Josef 'Jeff' Sipek. [1] http://marc.info/?l=linux-kernel&m=117343337823760&w=2 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Add preallocation beyond EOF to fallocate
Add new mode to ->fallocate() to allow allocation to occur beyond the current EOF without changing the file size. Implement in XFS ->fallocate() vector. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_iops.c |8 +--- include/linux/fs.h |1 + 2 files changed, 6 insertions(+), 3 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:02:16.0 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:09:48.233375382 +1000 @@ -833,11 +833,13 @@ xfs_vn_fallocate( VNODE_POSITION_XFS); switch (mode) { - case FA_ALLOCATE: /* changes file size */ - error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, - &bf, 0, NULL, 0); + case FA_ALLOCATE:/* changes file size */ if (offset + len > i_size_read(inode)) do_setattr = offset + len; + /* FALL THROUGH */ + case FA_PREALLOCATE:/* no filesize change */ + error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, + &bf, 0, NULL, 0); break; case FA_DEALLOCATE: /* XXX: changes file size? this just punches a hole */ Index: 2.6.x-xfs-new/include/linux/fs.h === --- 2.6.x-xfs-new.orig/include/linux/fs.h 2007-04-27 18:48:01.0 +1000 +++ 2.6.x-xfs-new/include/linux/fs.h2007-04-30 11:08:05.790903661 +1000 @@ -269,6 +269,7 @@ extern int dir_notify_enable; */ #define FA_ALLOCATE0x1 #define FA_DEALLOCATE 0x2 +#define FA_PREALLOCATE 0x3 #ifdef __KERNEL__ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] XFS ->fallocate() support
Add XFS support for ->fallocate() vector. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_iops.c | 48 1 file changed, 48 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c 2007-02-07 13:24:32.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:02:16.225095992 +1000 @@ -812,6 +812,53 @@ xfs_vn_removexattr( return namesp->attr_remove(vp, attr, xflags); } +STATIC long +xfs_vn_fallocate( + struct inode*inode, + int mode, + loff_t offset, + loff_t len) +{ + longerror = -EOPNOTSUPP; + bhv_vnode_t *vp = vn_from_inode(inode); + bhv_desc_t *bdp; + int do_setattr = 0; + xfs_flock64_t bf; + + bf.l_whence = 0; + bf.l_start = offset; + bf.l_len = len; + + bdp = bhv_lookup_range(VN_BHV_HEAD(vp), VNODE_POSITION_XFS, + VNODE_POSITION_XFS); + + switch (mode) { + case FA_ALLOCATE: /* changes file size */ + error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, + &bf, 0, NULL, 0); + if (offset + len > i_size_read(inode)) + do_setattr = offset + len; + break; + case FA_DEALLOCATE: + /* XXX: changes file size? this just punches a hole */ + error = xfs_change_file_space(bdp, XFS_IOC_UNRESVSP, + &bf, 0, NULL, 0); + break; + default: + break; + } + + /* Change file size if needed */ + if (!error && do_setattr) { + bhv_vattr_t va; + + va.va_mask = XFS_AT_SIZE; + va.va_size = do_setattr; + error = bhv_vop_setattr(vp, &va, 0, NULL); + } + + return error; +} struct inode_operations xfs_inode_operations = { .permission = xfs_vn_permission, @@ -822,6 +869,7 @@ struct inode_operations xfs_inode_operat .getxattr = xfs_vn_getxattr, .listxattr = xfs_vn_listxattr, .removexattr= xfs_vn_removexattr, + .fallocate = xfs_vn_fallocate, }; struct inode_operations xfs_dir_inode_operations = { - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ia64 fallocate syscall
ia64 fallocate syscall support. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- arch/ia64/kernel/entry.S |1 + include/asm-ia64/unistd.h |3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: 2.6.x-xfs-new/arch/ia64/kernel/entry.S === --- 2.6.x-xfs-new.orig/arch/ia64/kernel/entry.S 2007-03-29 19:01:41.0 +1000 +++ 2.6.x-xfs-new/arch/ia64/kernel/entry.S 2007-04-27 19:12:56.829396661 +1000 @@ -1612,5 +1612,6 @@ sys_call_table: data8 sys_vmsplice data8 sys_ni_syscall// reserved for move_pages data8 sys_getcpu + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: 2.6.x-xfs-new/include/asm-ia64/unistd.h === --- 2.6.x-xfs-new.orig/include/asm-ia64/unistd.h2007-03-29 19:03:37.0 +1000 +++ 2.6.x-xfs-new/include/asm-ia64/unistd.h 2007-04-27 19:18:18.215568425 +1000 @@ -293,11 +293,12 @@ #define __NR_vmsplice 1302 /* 1303 reserved for move_pages */ #define __NR_getcpu1304 +#define __NR_fallocate 1305 #ifdef __KERNEL__ -#define NR_syscalls281 /* length of syscall table */ +#define NR_syscalls282 /* length of syscall table */ #define __ARCH_WANT_SYS_RT_SIGACTION - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, Apr 29, 2007 at 07:23:49PM -0400, Theodore Tso wrote: > On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote: > > This is a relatively simple scheme for making a filesystem with > > incremental online consistency checks of both data and metadata. > > Overhead can be well under 1% disk space and CPU overhead may also be > > very small, while greatly improving filesystem integrity. > > What's your goal here? Is it to it to speed up fsck's after an > unclean shutdown to the point so that you don't need to use a journal > or some kind of soft updates scheme? > > Is it to speed up fsck's after the kernel has detected some kind if > internal consistency error? > > Is it to speed up fsck's after you no longer have confidence that > random blocks in the filesystem may have gotten corrupted, due to a > suspend/resume bug, hard drive failure, reported CRC/parity errors > when writing to the device, reports of massive ECC mailures in your > memory that could have caused random block to have gotten been written > with multiple bit flips? Mostly this last. > The first is relatively easy, but as you move down the list, things > get progressive harder, since it's no longer possible to use a per > tile "clean bit" to assume that you get to skip checking that > particular tile or chunk. Tile headers contain a timestamp that lets you say "let's revisit every clean block that hasn't been checked in the last 6 months". And the blocks containing the clean bits are themselves covered by CRC! > > Divide disk into a bunch of tiles. For each tile, allocate a one > > block tile header that contains (inode, checksum) pairs for each > > block in the tile. Unused blocks get marked inode -1, filesystem > > metadata blocks -2. The first element contains a last-clean > > timestamp, a clean flag and a checksum for the block itself. For 4K > > blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB), > > with ~.2% overhead. > > So what happens for files that are bigger than 2MB? Presumably they > consists of blocks that must come from more than one tile, right? So > is an inode allowed to reference blocks outside of its tile? Or does > an inode that needs to span multiple tiles have a local sub-inode in > each tile, with back pointers to parent inode? We don't need any form of sub-inode or continuation inode, which is why it's not mentioned. This is one of two main differences from chunkfs. The other is reverse maps (aka back pointers) for blocks -> inodes and inodes -> directories that obviate the need to have large amounts of memory to check for collisions. When you check a tile, you check all the inodes that "cover" that tile (ie have overlapping data or metadata). So if a large file spans many tiles, you end up reading many tiles to check one. So a completely naive checker would check the same file many times in the course of checking a file. Total work would then be multiplied by the average tile-crossing rate for files. However most of this work can be trivially avoided, of course. You can keep a small in-memory LRU cache of recently-checked inodes to avoid having to repeatedly check files in neighboring blocks as your scan progresses. This LRU doesn't need to be very big at all to be quite effective. This does mean that our time to make progress on a check is bounded at the top by the size of our largest file. If we have a degenerate filesystem filled with a single file, this will in fact take as long as a conventional fsck. If your filesystem has, say, 100 roughly equally-sized files, you're back in Chunkfs territory. Note that even though our time usage scales with max file size, our memory usage is bound by a small constant. We only need one tile header in memory plus what it takes to walk an inode's data. The block->inode reverse map means that we don't have keep a filesystem- or chunk-sized map to spot block collisions. Each block knows where it belongs and spotting broken links is easy. So we should have no trouble checking an exabyte-sized filesystem on a 4MB box. Even if it has one exabyte-sized file! We check the first tile, see that it points to our file, then iterate through that file, checking that the forward and reverse pointers for each block match and all CRCs match, etc. We cache the file's inode as clean, finish checking anything else in the first tile, then mark it clean. When we get to the next tile (and the next billion after that!), we notice that each block points back to our cached inode and skip rechecking it. > > Every time we write to a tile, we must mark the tile dirty. To cut > > down time to find dirty tiles, the clean bits can be collected into a > > smaller set of blocks, one clean bitmap block per 64GB of data. > > Hopefully the clean bitmap block is protected by a checksum. After > all, the smaller set of clean bitmap block is going to be constantly > updated as tiles get dirtied, and then cleaned. What if they get > corrupted? How
Re: [PATCH 0/5] fallocate system call
On Thu, Apr 26, 2007 at 11:20:56PM +0530, Amit K. Arora wrote: > Based on the discussion, this new patchset uses following as the > interface for fallocate() system call: > > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Ok, so now for the hard questions - what are the semantics of FA_ALLOCATE and FA_DEALLOCATE? For FA_ALLOCATE, it's supposed to change the file size if we allocate past EOF, right? What's the return value supposed to be? Zero for success, error otherwise? Does this update a/m/ctime at all? How persistent is this preallocation? Should it be there "forever" or for the lifetime of the currently open fd that it was preallocated on? For FA_DEALLOCATE, does it change the filesize at all? Or does it just punch a hole in the file? If it does change file size, what happens when you punch out preallocation beyond EOF? What's the return value supposed to be? > Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for > preallocation and deallocation of preallocated blocks respectively. More > modes can be added, when required. FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does not change file size) so we can preallocate beyond EOF for apps which use O_APPEND (i.e. changing file size would cause problems for them). > ToDos: > = > 1> Implementation on other architectures (other than i386, x86_64, > ppc64 and s390(x)) I'll have ia64 soon. > 2> A generic file system operation to handle fallocate > (generic_fallocate), for filesystems that do _not_ have the fallocate > inode operation implemented. > 3> Changes to glibc, > a) to support fallocate() system call > b) so that posix_fallocate() and posix_fallocate64() call > fallocate() system call > 4> Changes to XFS to implement the fallocate inode operation And that's what I'm doing now, hence all the questions ;) BTW, do you have a test program for this, or will I need to write one myself? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote: > This is a relatively simple scheme for making a filesystem with > incremental online consistency checks of both data and metadata. > Overhead can be well under 1% disk space and CPU overhead may also be > very small, while greatly improving filesystem integrity. What's your goal here? Is it to it to speed up fsck's after an unclean shutdown to the point so that you don't need to use a journal or some kind of soft updates scheme? Is it to speed up fsck's after the kernel has detected some kind if internal consistency error? Is it to speed up fsck's after you no longer have confidence that random blocks in the filesystem may have gotten corrupted, due to a suspend/resume bug, hard drive failure, reported CRC/parity errors when writing to the device, reports of massive ECC mailures in your memory that could have caused random block to have gotten been written with multiple bit flips? The first is relatively easy, but as you move down the list, things get progressive harder, since it's no longer possible to use a per tile "clean bit" to assume that you get to skip checking that particular tile or chunk. > Divide disk into a bunch of tiles. For each tile, allocate a one > block tile header that contains (inode, checksum) pairs for each > block in the tile. Unused blocks get marked inode -1, filesystem > metadata blocks -2. The first element contains a last-clean > timestamp, a clean flag and a checksum for the block itself. For 4K > blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB), > with ~.2% overhead. So what happens for files that are bigger than 2MB? Presumably they consists of blocks that must come from more than one tile, right? So is an inode allowed to reference blocks outside of its tile? Or does an inode that needs to span multiple tiles have a local sub-inode in each tile, with back pointers to parent inode? Note that both design paths have some serious tradeoffs. If you allow an inode to span multiple tiles, now you can't check the block allocation data structures without scanning all of the tiles involved. If you have sub-inodes, now you have to have bidirectional pointers and you have to validate those pointers after validating all of the individual tiles. This is one of the toughest aspects of either the chunkfs or tilefs design, and when we discussed chunkfs at past filesystem workshops, we didn't come to any firm conclusions about the best way to solve it, except to acknowledge that it is a hard problem. My personal inclination is to use substantially bigger chunks than the 2MB that you've proposed, and make each of the chunks more like 2 or 4 GB each, and to enforce a rule which says an inode in a chunk can only reference blocks in that local chunk, and to try like mad to keep directories reference inodes in a chunk, and to keep inodes reference blocks within that chunk. When a file is bigger than a chunk, then you will be forced to use indirection pointers that basically say, "for offsets 2GB-4GB, reference inode in chunk , and for 4GB-8GB, checkout inode in chunk , etc". I won't say that this is definitely the best way to do things, but I note that you haven't really address this design point, and there are no obvious best ways of handling this. > [Note that CRCs are optional so we can cut the overhead in half. I > choose CRCs here because they're capable of catching the vast > majority of accidental corruptions at a small cost and mostly serve > to protect against errors not caught by on-disk ECC (eg cable noise, > kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like > SHA-n is perfectly doable.] If the goal is just accidental corruptions, CRC's are just fine. If you want better protection against accidental corruption, then the answer is to use a bigger CRC. Using a cryptographic hash like SHA-n is pure overkill unless you're trying to design protection against a malicious attacker, in which case you've got a much bigger set of problems that you have to address first --- you don't get a cryptogaphically secure filesystem by replcing a CRC with a SHA-n hash function > Every time we write to a tile, we must mark the tile dirty. To cut > down time to find dirty tiles, the clean bits can be collected into a > smaller set of blocks, one clean bitmap block per 64GB of data. Hopefully the clean bitmap block is protected by a checksum. After all, the smaller set of clean bitmap block is going to be constantly updated as tiles get dirtied, and then cleaned. What if they get corrupted? How does the checker notice? And presumably if there is a CRC that doesn't verify, it would have to check all of the tiles, right? > Checking a tile: > > Read the tile > If clean and current, we're done. > Check the tile header checksum > Check the checksum on each block in the tile > Check that metadata blocks are metadata > Check that in
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, Apr 29, 2007 at 05:58:48PM +0200, Jörn Engel wrote: > On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote: > > > > Some things we need to check during fsck: > > > > all directories point to in-use inodes > > all in-use inodes are referred to by directories > > all inodes in use are marked in use > > all free inodes are marked free > > all inodes point to in-use blocks > > all inode refcounts are correct > > all inode block counts are correct > > free inode count is correct > > > > no blocks are used twice > > all used blocks are marked as used > > all free blocks are marked as free > > optional: all block contents are correct > >statfs information matches filesystem content > > This one may still require a full fsck in your current approach. One if > the good aspects of ChunkFS (assuming my understanding matches reality) > is to have per-chunk counters for free blocks, free inodes, etc. For a > fast fsck you would need to have these counters per-unit as well. It > doesn't matter whether your unit is a tile, chunk, blockgroup or > karboozel. Yes. This is probably not a very big problem. Once we've checked all the tiles in a block group and they're clean, we know that block and inode bitmaps are correct, so we can doublecheck the per-blockgroup counters. One obvious approach is simply batching a blockgroup's worth of tiles at a time (if tile header entries are 64 bits, then there are 64 tiles in a block group). Then you keep running totals and check at the end. Memory usage is still O(single tile). Summing blockgroup totals across a million block groups.. we'll probably want something like: > Having some tree structure for these counters would also help. Statfs > requires to add all counters for all units. Smaller units speed up fsck > but slow down statfs. With a tree statfs can be O(log(n)). -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, 29 April 2007 18:34:59 +0200, Andi Kleen wrote: > Matt Mackall <[EMAIL PROTECTED]> writes: > > > This is a relatively simple scheme for making a filesystem with > > incremental online consistency checks of both data and metadata. > > Overhead can be well under 1% disk space and CPU overhead may also be > > very small, while greatly improving filesystem integrity. > > Problem I see is that your scheme doesn't support metadata checksums only. Why? It is fairly simple to run many different schemes with this: - Generate checksums for everything, compare for each access. - Generate checksums for everything, only compare metadata checksums. - Generate checksums for everything, only compare at fsck time. - Generate metadata checksums, use 0x_ as data "checksums". - Not generate any checksums. Users without checksums would still pay the .1% space overhead, sure. I'd bet they already pay more for unused inodes. Jörn -- ticks = jiffies; while (ticks == jiffies); ticks = jiffies; -- /usr/src/linux/init/main.c - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, Apr 29, 2007 at 06:34:59PM +0200, Andi Kleen wrote: > Matt Mackall <[EMAIL PROTECTED]> writes: > > > This is a relatively simple scheme for making a filesystem with > > incremental online consistency checks of both data and metadata. > > Overhead can be well under 1% disk space and CPU overhead may also be > > very small, while greatly improving filesystem integrity. > > Problem I see is that your scheme doesn't support metadata checksums only. > IMHO those are the most interesting because they have the potential > to be basically zero cost, unlike full data checksumming. > And doing metadata checksums is enough to handle the fsck > problem. It does. Checksums are not in any way an essential part of this approach. We can checksum nothing, just metadata, or everything. Further, we might find ourselves wanting to change the setting, so we'd probably just want to leave slots in the tile headers for CRCs always. So for now, let's imagine three mount options: nocrc, crc, and metacrc. > I'm sure there are many cases where full checksumming makes sense too, > but those shouldn't be forced on everybody because it will slow down > some important workloads (like O_DIRECT) Right. > Metadata checksums would be best just put into the file systems > data structures. Essentially every object (inode, extent, directory entry, > super block) should have a checksum that can be incrementially updated. Putting CRCs on extents is trouble. You can't validate them without reading the whole extent and writing into the middle of an extent is also hard. But this is kind of a rat-hole. The more immediate question is do the reverse mapping pointers make sense for fsck purposes? -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote: > > Some things we need to check during fsck: > > all directories point to in-use inodes > all in-use inodes are referred to by directories > all inodes in use are marked in use > all free inodes are marked free > all inodes point to in-use blocks > all inode refcounts are correct > all inode block counts are correct > free inode count is correct > > no blocks are used twice > all used blocks are marked as used > all free blocks are marked as free > optional: all block contents are correct statfs information matches filesystem content This one may still require a full fsck in your current approach. One if the good aspects of ChunkFS (assuming my understanding matches reality) is to have per-chunk counters for free blocks, free inodes, etc. For a fast fsck you would need to have these counters per-unit as well. It doesn't matter whether your unit is a tile, chunk, blockgroup or karboozel. Having some tree structure for these counters would also help. Statfs requires to add all counters for all units. Smaller units speed up fsck but slow down statfs. With a tree statfs can be O(log(n)). Jörn -- "Translations are and will always be problematic. They inflict violence upon two languages." (translation from German) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, 29 April 2007 07:57:18 -0500, Matt Mackall wrote: > On Sun, Apr 29, 2007 at 02:21:13PM +0200, Jörn Engel wrote: > > Thanks. I think this is a bit more direct solution than ChunkFS, but > a) I haven't followed ChunkFS closely and b) I haven't been thinking > about fsck very long, so this is still just a presented as fodder for > discussion. After thinking about it for a while, I believe you have a problem as well. Will cover that in a later mail. > > You should add a 64bit fpos field. That allows you to easily check for > > addressing errors. > > Describe the scenario where this manifests, please. Block with checksum is written somewhere. Block matches checksum, but a bit flipped on the address bus. To catch this you have to match 1) block contents, 2) checksum and 3) inode tree. ZFS does it by having the checksum next to the block pointers in indirect blocks. LogFS does it by having a block header with checksum and (ino, pos, level) tupel. Level is 0 for data blocks, 1 for indirect blocks, etc. In the very rare case that a data block gets written to the offset belonging to one of its indirect blocks or vice versa this is necessary to catch the error. LogFS needs it for different reasons and I wouldn't mind if you just ignore that detail. > It just occurred to me that my approach is analogous to object-based > rmap on the filesystem. The fpos proposal I think makes it more like > the original per-pte rmap. This is not to say I think the same lessons > apply, as I'm not clear what you're proposing yet. > > Ooh.. I also just realized the tile approach allows much easier > defragging/shrinking of large filesystems because finding the > associated inode for blocks you want to move is fast. It also allows for a background scan of the filesystem. If data is rotting on the medium, the only chance to detect this is by reading and checking it. Lots of data is never read from userspace, so it can accumulate lots of errors. Without rmap the filesystem cannot verify random blocks. Jörn -- Joern's library part 9: http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
Matt Mackall <[EMAIL PROTECTED]> writes: > This is a relatively simple scheme for making a filesystem with > incremental online consistency checks of both data and metadata. > Overhead can be well under 1% disk space and CPU overhead may also be > very small, while greatly improving filesystem integrity. Problem I see is that your scheme doesn't support metadata checksums only. IMHO those are the most interesting because they have the potential to be basically zero cost, unlike full data checksumming. And doing metadata checksums is enough to handle the fsck problem. I'm sure there are many cases where full checksumming makes sense too, but those shouldn't be forced on everybody because it will slow down some important workloads (like O_DIRECT) Metadata checksums would be best just put into the file systems data structures. Essentially every object (inode, extent, directory entry, super block) should have a checksum that can be incrementially updated. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, Apr 29, 2007 at 02:21:13PM +0200, Jörn Engel wrote: > On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote: > > > > This is a relatively simple scheme for making a filesystem with > > incremental online consistency checks of both data and metadata. > > Overhead can be well under 1% disk space and CPU overhead may also be > > very small, while greatly improving filesystem integrity. > > I like it a lot. Until now it appears to solve more problems and cause > fewer new problems than ChunkFS. Thanks. I think this is a bit more direct solution than ChunkFS, but a) I haven't followed ChunkFS closely and b) I haven't been thinking about fsck very long, so this is still just a presented as fodder for discussion. > > [...] > > > > Divide disk into a bunch of tiles. For each tile, allocate a one > > block tile header that contains (inode, checksum) pairs for each > > block in the tile. Unused blocks get marked inode -1, filesystem > > metadata blocks -2. The first element contains a last-clean > > timestamp, a clean flag and a checksum for the block itself. For 4K > > blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB), > > with ~.2% overhead. > > You should add a 64bit fpos field. That allows you to easily check for > addressing errors. Describe the scenario where this manifests, please. It just occurred to me that my approach is analogous to object-based rmap on the filesystem. The fpos proposal I think makes it more like the original per-pte rmap. This is not to say I think the same lessons apply, as I'm not clear what you're proposing yet. Ooh.. I also just realized the tile approach allows much easier defragging/shrinking of large filesystems because finding the associated inode for blocks you want to move is fast. > > [Note that CRCs are optional so we can cut the overhead in half. I > > choose CRCs here because they're capable of catching the vast > > majority of accidental corruptions at a small cost and mostly serve > > to protect against errors not caught by on-disk ECC (eg cable noise, > > kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like > > SHA-n is perfectly doable.] > > The checksum cannot protect against a maliciously prepared medium > anyway, so crypto makes little sense. In a past life, I wrote a device mapper layer that kept a cryptographic hash per cluster of the underlying device, with a top-level digital signature of said hashes. That gets you pretty close to tamper-proof, in theory. Practice of course is a different matter, so don't try this at home. As it happens, this earlier system was the inspiration for the tile idea, the integrity parts of which have been kicking around in my head since I heard ZFS was tracking checksums. > Crc32 can provably (if you trust those who did the proof) detect all > 1, 2 and 3-bit errors and has a 1:2^32 chance of detecting any > remaining errors. That is fairly hard to improve on. Indeed. > > Every time we write to a tile, we must mark the tile dirty. To cut > > down time to find dirty tiles, the clean bits can be collected into a > > smaller set of blocks, one clean bitmap block per 64GB of data. > > You can and possibly should organize this as a tree, similar to a file. > One bit at the lowest level marks a tile as dirty. One bit for each > indirect block pointer marks some tiles behind the pointer as dirty. > That scales logarithmically to any filesystem size. Right. 3 levels takes us to 512TB, etc.. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote: > > This is a relatively simple scheme for making a filesystem with > incremental online consistency checks of both data and metadata. > Overhead can be well under 1% disk space and CPU overhead may also be > very small, while greatly improving filesystem integrity. I like it a lot. Until now it appears to solve more problems and cause fewer new problems than ChunkFS. > [...] > > Divide disk into a bunch of tiles. For each tile, allocate a one > block tile header that contains (inode, checksum) pairs for each > block in the tile. Unused blocks get marked inode -1, filesystem > metadata blocks -2. The first element contains a last-clean > timestamp, a clean flag and a checksum for the block itself. For 4K > blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB), > with ~.2% overhead. You should add a 64bit fpos field. That allows you to easily check for addressing errors. > [Note that CRCs are optional so we can cut the overhead in half. I > choose CRCs here because they're capable of catching the vast > majority of accidental corruptions at a small cost and mostly serve > to protect against errors not caught by on-disk ECC (eg cable noise, > kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like > SHA-n is perfectly doable.] The checksum cannot protect against a maliciously prepared medium anyway, so crypto makes little sense. Crc32 can provably (if you trust those who did the proof) detect all 1, 2 and 3-bit errors and has a 1:2^32 chance of detecting any remaining errors. That is fairly hard to improve on. > Every time we write to a tile, we must mark the tile dirty. To cut > down time to find dirty tiles, the clean bits can be collected into a > smaller set of blocks, one clean bitmap block per 64GB of data. You can and possibly should organize this as a tree, similar to a file. One bit at the lowest level marks a tile as dirty. One bit for each indirect block pointer marks some tiles behind the pointer as dirty. That scales logarithmically to any filesystem size. Jörn -- I don't understand it. Nobody does. -- Richard P. Feynman - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html