date:20070429

Re: [PATCH 0/5] fallocate system call

2007-04-29 Thread Chris Wedgwood

On Mon, Apr 30, 2007 at 03:56:32PM +1000, David Chinner wrote:
> On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:

> IIRC, the argument for FA_ALLOCATE changing file size is that
> posix_fallocate() is supposed to change the file size.

But it's not posix_fallocate; it's something more generic. glibc can
do posix_fallocate using truncate + fallocate.

> Note that the way XFS implements growing the file size after the
> allocation is via a truncate

What's wrong with that?  That seems very reasonable.

> That's would what I did because otherwise you'd use ftruncate64().
> Without documented behaviour or an ext4 implementation, I have to
> ask what it's supposed to do, though ;)

How many *real* users are there for ext4?  Why does 'what ext4 does'
define 'the semantics'?

Surely semantics should be decided either by precedent (if there is an
existing relevant userbase) or sensible thought and some debate?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] fallocate system call

2007-04-29 Thread David Chinner

On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:
> On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:
> 
> > For FA_ALLOCATE, it's supposed to change the file size if we
> > allocate past EOF, right?
> 
> I would argue no.  Use truncate for that.

I'm going from the ext4 implementation because the semantics
have not been documented yet.

IIRC, the argument for FA_ALLOCATE changing file size is that
posix_fallocate() is supposed to change the file size. I think
that having a mode for real preallocation and another for
posix_fallocate is a valid thing to do...

Note that the way XFS implements growing the file size after the
allocation is via a truncate

> > For FA_DEALLOCATE, does it change the filesize at all?
> 
> Same as above.
> 
> > Or does
> > it just punch a hole in the file?
> 
> Yes.

That's would what I did because otherwise you'd use ftruncate64().
Without documented behaviour or an ext4 implementation, I have to
ask what it's supposed to do, though ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] fallocate system call

2007-04-29 Thread Chris Wedgwood

On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:

> For FA_ALLOCATE, it's supposed to change the file size if we
> allocate past EOF, right?

I would argue no.  Use truncate for that.

> For FA_DEALLOCATE, does it change the filesize at all?

Same as above.

> Or does
> it just punch a hole in the file?

Yes.

> FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does
> not change file size) so we can preallocate beyond EOF for apps
> which use O_APPEND (i.e. changing file size would cause problems for
> them).

FA_ALLOCATE should be able to allocate past-EOF I would argue.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/1] fs: add 4th case to do_path_lookup

2007-04-29 Thread Josef Sipek

Signed-off-by: Josef 'Jeff' Sipek <[EMAIL PROTECTED]>

diff --git a/fs/namei.c b/fs/namei.c
index 2995fba..1516a9b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1125,6 +1125,10 @@ static int fastcall do_path_lookup(int dfd, const char 
*name,
nd->mnt = mntget(fs->rootmnt);
nd->dentry = dget(fs->root);
read_unlock(&fs->lock);
+   } else if (flags & LOOKUP_ONE) {
+   /* nd->mnt and nd->dentry already set, just grab references */
+   mntget(nd->mnt);
+   dget(nd->dentry);
} else if (dfd == AT_FDCWD) {
read_lock(&fs->lock);
nd->mnt = mntget(fs->pwdmnt);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 92b422b..aa89d97 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -48,6 +48,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
  *  - internal "there are more path compnents" flag
  *  - locked when lookup done with dcache_lock held
  *  - dentry cache is untrusted; force a real lookup
+ *  - lookup path from given dentry/vfsmount pair
  */
 #define LOOKUP_FOLLOW   1
 #define LOOKUP_DIRECTORY2
@@ -55,6 +56,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_PARENT  16
 #define LOOKUP_NOALT   32
 #define LOOKUP_REVAL   64
+#define LOOKUP_ONE128
 /*
  * Intent data
  */
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/1] [RFC] New mode for path_lookup (V1)

2007-04-29 Thread Josef Sipek

Stackable file systems frequently need to lookup paths or path components
starting from an arbitrary point in the namespace (identified by a dentry
and a vfsmount).  Currently, such file systems use lookup_one_len, which is
frowned upon [1] as it does not pass the lookup intent along; not passing a
lookup intent, for example, can trigger BUG_ON's when stacking on top of
NFSv4.

The following patch introduces a new mode to path_lookup to allow lookup to
start from an arbitrary point in the namespace.  This approach has been
suggested by Christoph Hellwig at the Linux Storage & Filesystem workshop in
February of this year.

One indicates that the lookup should be relative to a dentry-vfsmnt pair by
using the LOOKUP_ONE flag.  For example, the following snippet of code,
looks up "pathcomponent" in a directory pointed to by
parent_{dentry,vfsmnt}:

nd.dentry = parent_dentry;
nd.mnt = parent_vfsmnt;

err = path_lookup("pathcomponent", LOOKUP_ONE, &nd);
if (!err) {
/* exits */

...

/* once done, release the references */
path_release(&nd);
} else if (err == -ENOENT) {
/* doesn't exits */
} else {
/* other error */
}

VFS functions such as lookup_create can be used on the nameidata structure
to pass the create intent to the file system.

Currently, there is no easy way to pass the LOOKUP_OPEN intent.  The proper
way would be to call open_namei.

We'd like to get comments about what's necessary to make stackable file
systems do lookups right - this includes potential changes to open_namei. 

Josef 'Jeff' Sipek.

[1] http://marc.info/?l=linux-kernel&m=117343337823760&w=2
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Add preallocation beyond EOF to fallocate

2007-04-29 Thread David Chinner

Add new mode to ->fallocate() to allow allocation to occur
beyond the current EOF without changing the file size. Implement
in XFS ->fallocate() vector.

Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]>

---
 fs/xfs/linux-2.6/xfs_iops.c |8 +---
 include/linux/fs.h  |1 +
 2 files changed, 6 insertions(+), 3 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c  2007-04-30 
11:02:16.0 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c   2007-04-30 11:09:48.233375382 
+1000
@@ -833,11 +833,13 @@ xfs_vn_fallocate(
VNODE_POSITION_XFS);
 
switch (mode) {
-   case FA_ALLOCATE: /* changes file size */
-   error = xfs_change_file_space(bdp, XFS_IOC_RESVSP,
-   &bf, 0, NULL, 0);
+   case FA_ALLOCATE:/* changes file size */
if (offset + len > i_size_read(inode))
do_setattr = offset + len;
+   /* FALL THROUGH */
+   case FA_PREALLOCATE:/* no filesize change */
+   error = xfs_change_file_space(bdp, XFS_IOC_RESVSP,
+   &bf, 0, NULL, 0);
break;
case FA_DEALLOCATE:
/* XXX: changes file size?  this just punches a hole */
Index: 2.6.x-xfs-new/include/linux/fs.h
===
--- 2.6.x-xfs-new.orig/include/linux/fs.h   2007-04-27 18:48:01.0 
+1000
+++ 2.6.x-xfs-new/include/linux/fs.h2007-04-30 11:08:05.790903661 +1000
@@ -269,6 +269,7 @@ extern int dir_notify_enable;
  */
 #define FA_ALLOCATE0x1
 #define FA_DEALLOCATE  0x2
+#define FA_PREALLOCATE 0x3
 
 #ifdef __KERNEL__
 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] XFS ->fallocate() support

2007-04-29 Thread David Chinner

Add XFS support for ->fallocate() vector.

Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]>

---
 fs/xfs/linux-2.6/xfs_iops.c |   48 
 1 file changed, 48 insertions(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c  2007-02-07 
13:24:32.0 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c   2007-04-30 11:02:16.225095992 
+1000
@@ -812,6 +812,53 @@ xfs_vn_removexattr(
return namesp->attr_remove(vp, attr, xflags);
 }
 
+STATIC long
+xfs_vn_fallocate(
+   struct inode*inode,
+   int mode,
+   loff_t  offset,
+   loff_t  len)
+{
+   longerror = -EOPNOTSUPP;
+   bhv_vnode_t *vp = vn_from_inode(inode);
+   bhv_desc_t  *bdp;
+   int do_setattr = 0;
+   xfs_flock64_t   bf;
+
+   bf.l_whence = 0;
+   bf.l_start = offset;
+   bf.l_len = len;
+
+   bdp = bhv_lookup_range(VN_BHV_HEAD(vp), VNODE_POSITION_XFS,
+   VNODE_POSITION_XFS);
+
+   switch (mode) {
+   case FA_ALLOCATE: /* changes file size */
+   error = xfs_change_file_space(bdp, XFS_IOC_RESVSP,
+   &bf, 0, NULL, 0);
+   if (offset + len > i_size_read(inode))
+   do_setattr = offset + len;
+   break;
+   case FA_DEALLOCATE:
+   /* XXX: changes file size?  this just punches a hole */
+   error = xfs_change_file_space(bdp, XFS_IOC_UNRESVSP,
+   &bf, 0, NULL, 0);
+   break;
+   default:
+   break;
+   }
+
+   /* Change file size if needed */
+   if (!error && do_setattr) {
+   bhv_vattr_t va;
+
+   va.va_mask = XFS_AT_SIZE;
+   va.va_size = do_setattr;
+   error = bhv_vop_setattr(vp, &va, 0, NULL);
+   }
+
+   return error;
+}
 
 struct inode_operations xfs_inode_operations = {
.permission = xfs_vn_permission,
@@ -822,6 +869,7 @@ struct inode_operations xfs_inode_operat
.getxattr   = xfs_vn_getxattr,
.listxattr  = xfs_vn_listxattr,
.removexattr= xfs_vn_removexattr,
+   .fallocate  = xfs_vn_fallocate,
 };
 
 struct inode_operations xfs_dir_inode_operations = {
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ia64 fallocate syscall

2007-04-29 Thread David Chinner

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]>

---
 arch/ia64/kernel/entry.S  |1 +
 include/asm-ia64/unistd.h |3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: 2.6.x-xfs-new/arch/ia64/kernel/entry.S
===
--- 2.6.x-xfs-new.orig/arch/ia64/kernel/entry.S 2007-03-29 19:01:41.0 
+1000
+++ 2.6.x-xfs-new/arch/ia64/kernel/entry.S  2007-04-27 19:12:56.829396661 
+1000
@@ -1612,5 +1612,6 @@ sys_call_table:
data8 sys_vmsplice
data8 sys_ni_syscall// reserved for move_pages
data8 sys_getcpu
+   data8 sys_fallocate
 
.org sys_call_table + 8*NR_syscalls // guard against failures to 
increase NR_syscalls
Index: 2.6.x-xfs-new/include/asm-ia64/unistd.h
===
--- 2.6.x-xfs-new.orig/include/asm-ia64/unistd.h2007-03-29 
19:03:37.0 +1000
+++ 2.6.x-xfs-new/include/asm-ia64/unistd.h 2007-04-27 19:18:18.215568425 
+1000
@@ -293,11 +293,12 @@
 #define __NR_vmsplice  1302
 /* 1303 reserved for move_pages */
 #define __NR_getcpu1304
+#define __NR_fallocate 1305
 
 #ifdef __KERNEL__
 
 
-#define NR_syscalls281 /* length of syscall table */
+#define NR_syscalls282 /* length of syscall table */
 
 #define __ARCH_WANT_SYS_RT_SIGACTION
 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Matt Mackall

On Sun, Apr 29, 2007 at 07:23:49PM -0400, Theodore Tso wrote:
> On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote:
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> What's your goal here?  Is it to it to speed up fsck's after an
> unclean shutdown to the point so that you don't need to use a journal
> or some kind of soft updates scheme?   
> 
> Is it to speed up fsck's after the kernel has detected some kind if
> internal consistency error?  
> 
> Is it to speed up fsck's after you no longer have confidence that
> random blocks in the filesystem may have gotten corrupted, due to a
> suspend/resume bug, hard drive failure, reported CRC/parity errors
> when writing to the device, reports of massive ECC mailures in your
> memory that could have caused random block to have gotten been written
> with multiple bit flips?

Mostly this last.

> The first is relatively easy, but as you move down the list, things
> get progressive harder, since it's no longer possible to use a per
> tile "clean bit" to assume that you get to skip checking that
> particular tile or chunk.

Tile headers contain a timestamp that lets you say "let's revisit
every clean block that hasn't been checked in the last 6 months". And
the blocks containing the clean bits are themselves covered by CRC!

> >  Divide disk into a bunch of tiles. For each tile, allocate a one
> >  block tile header that contains (inode, checksum) pairs for each
> >  block in the tile. Unused blocks get marked inode -1, filesystem
> >  metadata blocks -2. The first element contains a last-clean
> >  timestamp, a clean flag and a checksum for the block itself. For 4K
> >  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
> >  with ~.2% overhead.
> 
> So what happens for files that are bigger than 2MB?  Presumably they
> consists of blocks that must come from more than one tile, right?  So
> is an inode allowed to reference blocks outside of its tile?  Or does
> an inode that needs to span multiple tiles have a local sub-inode in
> each tile, with back pointers to parent inode?   

We don't need any form of sub-inode or continuation inode, which is
why it's not mentioned. This is one of two main differences from
chunkfs. The other is reverse maps (aka back pointers) for blocks ->
inodes and inodes -> directories that obviate the need to have large
amounts of memory to check for collisions.

When you check a tile, you check all the inodes that "cover" that tile
(ie have overlapping data or metadata). So if a large file spans many
tiles, you end up reading many tiles to check one. So a completely
naive checker would check the same file many times in the course of
checking a file. Total work would then be multiplied by the average
tile-crossing rate for files.

However most of this work can be trivially avoided, of course. You can
keep a small in-memory LRU cache of recently-checked inodes to avoid
having to repeatedly check files in neighboring blocks as your scan
progresses. This LRU doesn't need to be very big at all to be quite
effective.

This does mean that our time to make progress on a check is bounded at
the top by the size of our largest file. If we have a degenerate
filesystem filled with a single file, this will in fact take as long
as a conventional fsck. If your filesystem has, say, 100 roughly
equally-sized files, you're back in Chunkfs territory.

Note that even though our time usage scales with max file size, our
memory usage is bound by a small constant. We only need one tile
header in memory plus what it takes to walk an inode's data. The
block->inode reverse map means that we don't have keep a filesystem-
or chunk-sized map to spot block collisions. Each block knows where it
belongs and spotting broken links is easy.

So we should have no trouble checking an exabyte-sized filesystem on a
4MB box. Even if it has one exabyte-sized file! We check the first
tile, see that it points to our file, then iterate through that file,
checking that the forward and reverse pointers for each block match
and all CRCs match, etc. We cache the file's inode as clean, finish
checking anything else in the first tile, then mark it clean. When we get
to the next tile (and the next billion after that!), we notice that
each block points back to our cached inode and skip rechecking it.

> >  Every time we write to a tile, we must mark the tile dirty. To cut
> >  down time to find dirty tiles, the clean bits can be collected into a
> >  smaller set of blocks, one clean bitmap block per 64GB of data.
> 
> Hopefully the clean bitmap block is protected by a checksum.  After
> all, the smaller set of clean bitmap block is going to be constantly
> updated as tiles get dirtied, and then cleaned.  What if they get
> corrupted?  How

Re: [PATCH 0/5] fallocate system call

2007-04-29 Thread David Chinner

On Thu, Apr 26, 2007 at 11:20:56PM +0530, Amit K. Arora wrote:
> Based on the discussion, this new patchset uses following as the
> interface for fallocate() system call:
> 
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

Ok, so now for the hard questions - what are the semantics of
FA_ALLOCATE and FA_DEALLOCATE?

For FA_ALLOCATE, it's supposed to change the file size if we
allocate past EOF, right? What's the return value supposed to
be? Zero for success, error otherwise? Does this update a/m/ctime
at all? How persistent is this preallocation? Should it be
there "forever" or for the lifetime of the currently open fd
that it was preallocated on?

For FA_DEALLOCATE, does it change the filesize at all? Or does
it just punch a hole in the file? If it does change file size,
what happens when you punch out preallocation beyond EOF?
What's the return value supposed to be?

> Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for
> preallocation and deallocation of preallocated blocks respectively. More
> modes can be added, when required.

FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does
not change file size) so we can preallocate beyond EOF for apps which
use O_APPEND (i.e. changing file size would cause problems for them).

> ToDos:
> =
> 1>   Implementation on other architectures (other than i386, x86_64, 
> ppc64 and s390(x)) 

I'll have ia64 soon.

> 2>   A generic file system operation to handle fallocate
> (generic_fallocate), for filesystems that do _not_ have the fallocate
> inode operation implemented.
> 3>   Changes to glibc,
>   a) to support fallocate() system call
>   b) so that posix_fallocate() and posix_fallocate64() call
>  fallocate() system call
> 4>   Changes to XFS to implement the fallocate inode operation

And that's what I'm doing now, hence all the questions ;)

BTW, do you have a test program for this, or will I need to write
one myself?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Theodore Tso

On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote:
> This is a relatively simple scheme for making a filesystem with
> incremental online consistency checks of both data and metadata.
> Overhead can be well under 1% disk space and CPU overhead may also be
> very small, while greatly improving filesystem integrity.

What's your goal here?  Is it to it to speed up fsck's after an
unclean shutdown to the point so that you don't need to use a journal
or some kind of soft updates scheme?   

Is it to speed up fsck's after the kernel has detected some kind if
internal consistency error?  

Is it to speed up fsck's after you no longer have confidence that
random blocks in the filesystem may have gotten corrupted, due to a
suspend/resume bug, hard drive failure, reported CRC/parity errors
when writing to the device, reports of massive ECC mailures in your
memory that could have caused random block to have gotten been written
with multiple bit flips?

The first is relatively easy, but as you move down the list, things
get progressive harder, since it's no longer possible to use a per
tile "clean bit" to assume that you get to skip checking that
particular tile or chunk.

>  Divide disk into a bunch of tiles. For each tile, allocate a one
>  block tile header that contains (inode, checksum) pairs for each
>  block in the tile. Unused blocks get marked inode -1, filesystem
>  metadata blocks -2. The first element contains a last-clean
>  timestamp, a clean flag and a checksum for the block itself. For 4K
>  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
>  with ~.2% overhead.

So what happens for files that are bigger than 2MB?  Presumably they
consists of blocks that must come from more than one tile, right?  So
is an inode allowed to reference blocks outside of its tile?  Or does
an inode that needs to span multiple tiles have a local sub-inode in
each tile, with back pointers to parent inode?   

Note that both design paths have some serious tradeoffs.  If you allow
an inode to span multiple tiles, now you can't check the block
allocation data structures without scanning all of the tiles involved.
If you have sub-inodes, now you have to have bidirectional pointers
and you have to validate those pointers after validating all of the
individual tiles.  

This is one of the toughest aspects of either the chunkfs or tilefs
design, and when we discussed chunkfs at past filesystem workshops, we
didn't come to any firm conclusions about the best way to solve it,
except to acknowledge that it is a hard problem.  My personal
inclination is to use substantially bigger chunks than the 2MB that
you've proposed, and make each of the chunks more like 2 or 4 GB each,
and to enforce a rule which says an inode in a chunk can only
reference blocks in that local chunk, and to try like mad to keep
directories reference inodes in a chunk, and to keep inodes reference
blocks within that chunk.  When a file is bigger than a chunk, then
you will be forced to use indirection pointers that basically say,
"for offsets 2GB-4GB, reference inode  in chunk , and for
4GB-8GB, checkout inode  in chunk , etc".   

I won't say that this is definitely the best way to do things, but I
note that you haven't really address this design point, and there are
no obvious best ways of handling this.

>  [Note that CRCs are optional so we can cut the overhead in half. I
>  choose CRCs here because they're capable of catching the vast
>  majority of accidental corruptions at a small cost and mostly serve
>  to protect against errors not caught by on-disk ECC (eg cable noise,
>  kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
>  SHA-n is perfectly doable.]

If the goal is just accidental corruptions, CRC's are just fine.  If
you want better protection against accidental corruption, then the
answer is to use a bigger CRC.  Using a cryptographic hash like SHA-n
is pure overkill unless you're trying to design protection against a
malicious attacker, in which case you've got a much bigger set of
problems that you have to address first --- you don't get a
cryptogaphically secure filesystem by replcing a CRC with a SHA-n hash
function

>  Every time we write to a tile, we must mark the tile dirty. To cut
>  down time to find dirty tiles, the clean bits can be collected into a
>  smaller set of blocks, one clean bitmap block per 64GB of data.

Hopefully the clean bitmap block is protected by a checksum.  After
all, the smaller set of clean bitmap block is going to be constantly
updated as tiles get dirtied, and then cleaned.  What if they get
corrupted?  How does the checker notice?  And presumably if there is a
CRC that doesn't verify, it would have to check all of the tiles,
right?

> Checking a tile:
> 
>  Read the tile
>  If clean and current, we're done.
>  Check the tile header checksum
>  Check the checksum on each block in the tile
>  Check that metadata blocks are metadata
>  Check that in

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Matt Mackall

On Sun, Apr 29, 2007 at 05:58:48PM +0200, Jörn Engel wrote:
> On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> > 
> > Some things we need to check during fsck:
> > 
> >  all directories point to in-use inodes
> >  all in-use inodes are referred to by directories
> >  all inodes in use are marked in use
> >  all free inodes are marked free
> >  all inodes point to in-use blocks
> >  all inode refcounts are correct
> >  all inode block counts are correct
> >  free inode count is correct
> > 
> >  no blocks are used twice
> >  all used blocks are marked as used
> >  all free blocks are marked as free
> >  optional: all block contents are correct
> 
>statfs information matches filesystem content
> 
> This one may still require a full fsck in your current approach.  One if
> the good aspects of ChunkFS (assuming my understanding matches reality)
> is to have per-chunk counters for free blocks, free inodes, etc.  For a
> fast fsck you would need to have these counters per-unit as well.  It
> doesn't matter whether your unit is a tile, chunk, blockgroup or
> karboozel.

Yes. This is probably not a very big problem. Once we've checked all
the tiles in a block group and they're clean, we know that block and
inode bitmaps are correct, so we can doublecheck the per-blockgroup
counters.

One obvious approach is simply batching a blockgroup's worth of tiles
at a time (if tile header entries are 64 bits, then there are 64 tiles
in a block group). Then you keep running totals and check at the end.
Memory usage is still O(single tile).

Summing blockgroup totals across a million block groups.. we'll
probably want something like:

> Having some tree structure for these counters would also help.  Statfs
> requires to add all counters for all units.  Smaller units speed up fsck
> but slow down statfs.  With a tree statfs can be O(log(n)).

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Jörn Engel

On Sun, 29 April 2007 18:34:59 +0200, Andi Kleen wrote:
> Matt Mackall <[EMAIL PROTECTED]> writes:
> 
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> Problem I see is that your scheme doesn't support metadata checksums only.

Why?  It is fairly simple to run many different schemes with this:
- Generate checksums for everything, compare for each access.
- Generate checksums for everything, only compare metadata checksums.
- Generate checksums for everything, only compare at fsck time.
- Generate metadata checksums, use 0x_ as data "checksums".
- Not generate any checksums.

Users without checksums would still pay the .1% space overhead, sure.
I'd bet they already pay more for unused inodes.

Jörn

-- 
ticks = jiffies;
while (ticks == jiffies);
ticks = jiffies;
-- /usr/src/linux/init/main.c
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Matt Mackall

On Sun, Apr 29, 2007 at 06:34:59PM +0200, Andi Kleen wrote:
> Matt Mackall <[EMAIL PROTECTED]> writes:
> 
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> Problem I see is that your scheme doesn't support metadata checksums only.
> IMHO those are the most interesting because they have the potential
> to be basically zero cost, unlike full data checksumming.
> And doing metadata checksums is enough to handle the fsck
> problem.

It does. Checksums are not in any way an essential part of this
approach. We can checksum nothing, just metadata, or everything.

Further, we might find ourselves wanting to change the setting, so
we'd probably just want to leave slots in the tile headers for CRCs
always. So for now, let's imagine three mount options: nocrc, crc, and
metacrc.

> I'm sure there are many cases where full checksumming makes sense too,
> but those shouldn't be forced on everybody because it will slow down
> some important workloads (like O_DIRECT)

Right.

> Metadata checksums would be best just put into the file systems
> data structures. Essentially every object (inode, extent, directory entry,
> super block) should have a checksum that can be incrementially updated.

Putting CRCs on extents is trouble. You can't validate them without
reading the whole extent and writing into the middle of an extent is
also hard.

But this is kind of a rat-hole. The more immediate question is do the
reverse mapping pointers make sense for fsck purposes?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Jörn Engel

On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> 
> Some things we need to check during fsck:
> 
>  all directories point to in-use inodes
>  all in-use inodes are referred to by directories
>  all inodes in use are marked in use
>  all free inodes are marked free
>  all inodes point to in-use blocks
>  all inode refcounts are correct
>  all inode block counts are correct
>  free inode count is correct
> 
>  no blocks are used twice
>  all used blocks are marked as used
>  all free blocks are marked as free
>  optional: all block contents are correct

   statfs information matches filesystem content

This one may still require a full fsck in your current approach.  One if
the good aspects of ChunkFS (assuming my understanding matches reality)
is to have per-chunk counters for free blocks, free inodes, etc.  For a
fast fsck you would need to have these counters per-unit as well.  It
doesn't matter whether your unit is a tile, chunk, blockgroup or
karboozel.

Having some tree structure for these counters would also help.  Statfs
requires to add all counters for all units.  Smaller units speed up fsck
but slow down statfs.  With a tree statfs can be O(log(n)).

Jörn

-- 
"Translations are and will always be problematic. They inflict violence
upon two languages." (translation from German)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Jörn Engel

On Sun, 29 April 2007 07:57:18 -0500, Matt Mackall wrote:
> On Sun, Apr 29, 2007 at 02:21:13PM +0200, Jörn Engel wrote:
> 
> Thanks. I think this is a bit more direct solution than ChunkFS, but
> a) I haven't followed ChunkFS closely and b) I haven't been thinking
> about fsck very long, so this is still just a presented as fodder for
> discussion.

After thinking about it for a while, I believe you have a problem as
well.  Will cover that in a later mail.

> > You should add a 64bit fpos field.  That allows you to easily check for
> > addressing errors.
> 
> Describe the scenario where this manifests, please.

Block with checksum is written somewhere.  Block matches checksum, but a
bit flipped on the address bus.  To catch this you have to match 1)
block contents, 2) checksum and 3) inode tree.  ZFS does it by having
the checksum next to the block pointers in indirect blocks.  LogFS does
it by having a block header with checksum and (ino, pos, level) tupel.

Level is 0 for data blocks, 1 for indirect blocks, etc.  In the very
rare case that a data block gets written to the offset belonging to one
of its indirect blocks or vice versa this is necessary to catch the
error.  LogFS needs it for different reasons and I wouldn't mind if you
just ignore that detail.

> It just occurred to me that my approach is analogous to object-based
> rmap on the filesystem. The fpos proposal I think makes it more like
> the original per-pte rmap. This is not to say I think the same lessons
> apply, as I'm not clear what you're proposing yet.
> 
> Ooh.. I also just realized the tile approach allows much easier
> defragging/shrinking of large filesystems because finding the
> associated inode for blocks you want to move is fast.

It also allows for a background scan of the filesystem.  If data is
rotting on the medium, the only chance to detect this is by reading and
checking it.  Lots of data is never read from userspace, so it can
accumulate lots of errors.

Without rmap the filesystem cannot verify random blocks.

Jörn

-- 
Joern's library part 9:
http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Andi Kleen

Matt Mackall <[EMAIL PROTECTED]> writes:

> This is a relatively simple scheme for making a filesystem with
> incremental online consistency checks of both data and metadata.
> Overhead can be well under 1% disk space and CPU overhead may also be
> very small, while greatly improving filesystem integrity.

Problem I see is that your scheme doesn't support metadata checksums only.
IMHO those are the most interesting because they have the potential
to be basically zero cost, unlike full data checksumming.
And doing metadata checksums is enough to handle the fsck
problem.

I'm sure there are many cases where full checksumming makes sense too,
but those shouldn't be forced on everybody because it will slow down
some important workloads (like O_DIRECT)

Metadata checksums would be best just put into the file systems
data structures. Essentially every object (inode, extent, directory entry,
super block) should have a checksum that can be incrementially updated.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Matt Mackall

On Sun, Apr 29, 2007 at 02:21:13PM +0200, Jörn Engel wrote:
> On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> > 
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> I like it a lot.  Until now it appears to solve more problems and cause
> fewer new problems than ChunkFS.

Thanks. I think this is a bit more direct solution than ChunkFS, but
a) I haven't followed ChunkFS closely and b) I haven't been thinking
about fsck very long, so this is still just a presented as fodder for
discussion.

> > [...]
> > 
> >  Divide disk into a bunch of tiles. For each tile, allocate a one
> >  block tile header that contains (inode, checksum) pairs for each
> >  block in the tile. Unused blocks get marked inode -1, filesystem
> >  metadata blocks -2. The first element contains a last-clean
> >  timestamp, a clean flag and a checksum for the block itself. For 4K
> >  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
> >  with ~.2% overhead.
> 
> You should add a 64bit fpos field.  That allows you to easily check for
> addressing errors.

Describe the scenario where this manifests, please.

It just occurred to me that my approach is analogous to object-based
rmap on the filesystem. The fpos proposal I think makes it more like
the original per-pte rmap. This is not to say I think the same lessons
apply, as I'm not clear what you're proposing yet.

Ooh.. I also just realized the tile approach allows much easier
defragging/shrinking of large filesystems because finding the
associated inode for blocks you want to move is fast.

> >  [Note that CRCs are optional so we can cut the overhead in half. I
> >  choose CRCs here because they're capable of catching the vast
> >  majority of accidental corruptions at a small cost and mostly serve
> >  to protect against errors not caught by on-disk ECC (eg cable noise,
> >  kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
> >  SHA-n is perfectly doable.]
> 
> The checksum cannot protect against a maliciously prepared medium
> anyway, so crypto makes little sense.

In a past life, I wrote a device mapper layer that kept a
cryptographic hash per cluster of the underlying device, with a
top-level digital signature of said hashes. That gets you pretty
close to tamper-proof, in theory. Practice of course is a different
matter, so don't try this at home.

As it happens, this earlier system was the inspiration for the tile
idea, the integrity parts of which have been kicking around in my head
since I heard ZFS was tracking checksums.

> Crc32 can provably (if you trust those who did the proof) detect all
> 1, 2 and 3-bit errors and has a 1:2^32 chance of detecting any
> remaining errors. That is fairly hard to improve on.

Indeed.

> >  Every time we write to a tile, we must mark the tile dirty. To cut
> >  down time to find dirty tiles, the clean bits can be collected into a
> >  smaller set of blocks, one clean bitmap block per 64GB of data.
> 
> You can and possibly should organize this as a tree, similar to a file.
> One bit at the lowest level marks a tile as dirty.  One bit for each
> indirect block pointer marks some tiles behind the pointer as dirty.
> That scales logarithmically to any filesystem size.

Right. 3 levels takes us to 512TB, etc..

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Jörn Engel

On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> 
> This is a relatively simple scheme for making a filesystem with
> incremental online consistency checks of both data and metadata.
> Overhead can be well under 1% disk space and CPU overhead may also be
> very small, while greatly improving filesystem integrity.

I like it a lot.  Until now it appears to solve more problems and cause
fewer new problems than ChunkFS.

> [...]
> 
>  Divide disk into a bunch of tiles. For each tile, allocate a one
>  block tile header that contains (inode, checksum) pairs for each
>  block in the tile. Unused blocks get marked inode -1, filesystem
>  metadata blocks -2. The first element contains a last-clean
>  timestamp, a clean flag and a checksum for the block itself. For 4K
>  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
>  with ~.2% overhead.

You should add a 64bit fpos field.  That allows you to easily check for
addressing errors.

>  [Note that CRCs are optional so we can cut the overhead in half. I
>  choose CRCs here because they're capable of catching the vast
>  majority of accidental corruptions at a small cost and mostly serve
>  to protect against errors not caught by on-disk ECC (eg cable noise,
>  kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
>  SHA-n is perfectly doable.]

The checksum cannot protect against a maliciously prepared medium
anyway, so crypto makes little sense.  Crc32 can provably (if you trust
those who did the proof) detect all 1, 2 and 3-bit errors and has a
1:2^32 chance of detecting any remaining errors.  That is fairly hard to
improve on.

>  Every time we write to a tile, we must mark the tile dirty. To cut
>  down time to find dirty tiles, the clean bits can be collected into a
>  smaller set of blocks, one clean bitmap block per 64GB of data.

You can and possibly should organize this as a tree, similar to a file.
One bit at the lowest level marks a tile as dirty.  One bit for each
indirect block pointer marks some tiles behind the pointer as dirty.
That scales logarithmically to any filesystem size.

Jörn

-- 
I don't understand it. Nobody does.
-- Richard P. Feynman
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] fallocate system call

Re: [PATCH 0/5] fallocate system call

Re: [PATCH 0/5] fallocate system call

[PATCH 1/1] fs: add 4th case to do_path_lookup

[PATCH 0/1] [RFC] New mode for path_lookup (V1)

[PATCH] Add preallocation beyond EOF to fallocate

[PATCH] XFS ->fallocate() support

[PATCH] ia64 fallocate syscall

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [PATCH 0/5] fallocate system call

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC] TileFS - a proposal for scalable integrity checking

19 matches

Site Navigation

Mail list logo

Footer information