Hi,

I was not really expecting to see people testing out of space handling in Tux3 
at this early stage, but Marcin apparently failed to get that memo, so he went 
ahead and demonstrated directory corruption by hitting nospace in his mass file 
creation test.  To avoid further embarrassment of this nature, I put together a 
preliminary patch to fail gracefully in the file create instead of just 
stumbling forward and running out of disk blocks deep in metadata update.

The classic way of dealing with this messy issue is, at the beginning of each 
atomic filesystem change one overestimates the worst case number of blocks that 
will be required to store the change, including all associated metadata blocks, 
and reserve that many "credits" against the remaining free blocks on the 
filesystem.  If the remaining free blocks count lies below some safety margin 
depending on the user privilege and our own confidence that the reservation 
code is correct, we bail out of the change with ENOSPC, which bubbles up to the 
application.  If remaining free blocks are sufficient, on the other hand, we 
press on with the change, counting the number of new blocks actually allocated. 
 When the change is done (but not necessarily recorded on disk) any unused 
credit, that is, reservation in excess of actual blocks consumed, is returned 
to the credit pool.

Well, I was too lazy to implement all the accounting code needed to keep track 
of actual allocations, and in some cases the actual allocation does not occur 
inside the change anyway, but is deferred for a later flush operation.  We 
already have this situation with delayed write allocation and we will soon have 
more of that with delayed bitmap block flushing.  So rather than do that big, 
fragile hack right now, I tried a more creative approach.  I make a gross 
overestimate of number of blocks required to store the transaction as described 
above, but do not attempt to return unused credits to the credit pool.  
Instead, I just let Tux3 "hit the wall" and detect insufficient unreserved free 
blocks while there still is plenty of actual free space on the volume.  At this 
point, I force a flush to disk and reset credits to zero, then repeat the 
reservation check against actual volume free space, now that all earlier 
reservations have been translated into a concrete on-disk representation.  If 
this second check fails, it is a real out of space condition and ENOSPC is 
returned to the application.

Most of the time, the second check does not fail.  However, as the volume gets 
close to full, the frequency of the flush/recheck fallback cycle increases.  
When there are just a handful of blocks left, every change_begin causes a full 
flush, which does not take very long, because not much cache has been dirtied 
since the previous flush.

In practice, this appears to work very well.  Normal operation is scarcely 
affected, because all I have done is add some efficient arithmetic to the 
change_begin.  We don't take any locks for the check, because the check is 
conservation.

During a flush triggered by exceeding the credit limit, recursive calls to 
change_begin do take place, but we need to avoid recursive flushes.  This is 
handled (hackishly) by negating the sb->margin variable (our safety margin for 
reservation calculations).  Then any reservation check will succeed if it sees 
that negative flag, on the theory that the operation is part of a flush and 
space for it has already been reserved.  Well, this is not actually correct, 
because new filesystem operations could arrive during the flush, and we want 
those just to wait instead of bypassing the credit check.  So there is some 
cleanup work to do, but in practice this works, and fixes Marcin's observed 
directory corruption.  We need to refine this synchronization strategy to be 
less of a hack.

One thing that needs to be cleaned up: right now, map_region just fails if it 
fails to allocate an extent exactly the size it requested from balloc, even 
though there may be smaller runs of free blocks adding up to the required 
amount still available on the volume.  We need to implement a fallback to 
attempt a fragmented allocation if no extent of the ideal size is available, or 
does not lie within an acceptable distance from the allocation goal.  This is a 
slightly messy requirement of the extent allocation strategy that is not 
optional, otherwise fragmentation can cause spurious out of space errors when 
there is still sufficient space available.  This is probably not even rare when 
somebody runs a lot of filesystem activity against a nearly full volume, an all 
too common usage pattern.  Untarring a kernel tree that starts to fail on out 
of space partway through, generating thousands of attempts to store files in 
rapid succession, some of them succeeding, is a common scenario that can very 
quickly find flaws in a broken out of space handling strategy.

I used a really crude technique to flush the cache.  For some reason, none of 
the VFS machinery for flushing inodes is exposed to modules.  Except for 
freeze_bdev, which incorporates a flush, among various other odd things.  So I 
do a freeze_bdev to flush the volume, which locks out further writes to the 
filesystem as a side effect (but not further metadata operations as far as I 
can see, maybe I did not look carefully enough).  This is pretty gross, and I 
am not at all sure that it cannot deadlock.  Our flush really should be a delta 
commit, not this heavy handed thing that tries to handle all filesystems 
uniformly without really knowing what it is doing.  However it does seem to 
work pretty well, and lets us see how nospace handling needs to work, even 
before the delta commit mechanism is fully in place.

It is pretty cool the way the system behaves as it gets close to full.  Flushes 
are triggered with accelerating frequency, consuming less free space on eacy 
cycle.  Finally, we hit true out of space and the application screeches to a 
halt with an error message, all in good order.  Then, as far as I can see, 
there is no filesystem corruption as Marcin was able to trigger easily on an 
unpatched system.  We do need to confirm this by putting together some sort of 
preliminary fsck, however, things are looking pretty good.  I think we can 
evolve this off the wall idea into a solid out of space handling mechanism that 
is light, tight and efficient.

Regards,

Daniel
diff -r fbdd7c18a67c user/commit.c
--- a/user/commit.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/commit.c	Mon Mar 02 15:17:44 2009 -0800
@@ -242,12 +242,13 @@ static int commit_delta(struct sb *sb)
 	return unstash(sb, &sb->defree, retire_bfree);
 }
 
-void change_begin(struct sb *sb)
+int change_begin(struct sb *sb)
 {
 	down_read(&sb->delta_lock);
+	return 0;
 }
 
-void change_end(struct sb *sb)
+int change_end(struct sb *sb)
 {
 	if (need_delta(sb)) {
 		unsigned delta = sb->delta;
@@ -262,6 +263,7 @@ void change_end(struct sb *sb)
 		up_write(&sb->delta_lock);
 	} else
 		up_read(&sb->delta_lock);
+	return 0;
 }
 
 int bitmap_io(struct buffer_head *buffer, int write)
diff -r fbdd7c18a67c user/filemap.c
--- a/user/filemap.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/filemap.c	Mon Mar 02 15:17:44 2009 -0800
@@ -14,6 +14,13 @@
 #include "kernel/ileaf.c"
 #include "kernel/balloc.c"
 #include "kernel/filemap.c"
+
+int devio(int rw, struct dev *dev, loff_t offset, void *data, unsigned len)
+{
+	return ioabs(dev->fd, data, len, rw, offset);
+}
+
+#include "kernel/commit.c"
 
 /*
  * Extrapolate from single buffer flush or blockread to opportunistic exent IO
@@ -112,8 +119,8 @@ int filemap_extent_io(struct buffer_head
 }
 
 #ifdef build_filemap
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
 
 static void check_created_seg(struct seg *seg)
 {
diff -r fbdd7c18a67c user/ileaf.c
--- a/user/ileaf.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/ileaf.c	Mon Mar 02 15:17:44 2009 -0800
@@ -21,8 +21,8 @@
 #include "kernel/xattr.c"
 #include "kernel/ileaf.c"
 
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
 
 struct ileaf *ileaf_create(struct btree *btree)
 {
diff -r fbdd7c18a67c user/inode.c
--- a/user/inode.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/inode.c	Mon Mar 02 15:17:44 2009 -0800
@@ -293,8 +293,8 @@ void tuxclose(struct inode *inode)
 #include "super.c"
 
 #ifdef build_inode
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
 
 int main(int argc, char *argv[])
 {
diff -r fbdd7c18a67c user/kernel/commit.c
--- a/user/kernel/commit.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/commit.c	Mon Mar 02 15:17:44 2009 -0800
@@ -32,6 +32,7 @@ int load_sb(struct sb *sb)
 	sb->atomgen = from_be_u32(super->atomgen);
 	sb->freeatom = from_be_u32(super->freeatom);
 	sb->dictsize = from_be_u64(super->dictsize);
+	sb->margin = 100; // should be tunable?
 	trace("blocksize %u, blockbits %u, blockmask %08x",
 	      sb->blocksize, sb->blockbits, sb->blockmask);
 	trace("volblocks %Lu, freeblocks %Lu, nextalloc %Lu",
@@ -61,3 +62,35 @@ int load_itable(struct sb *sb)
 	init_btree(itable_btree(sb), sb, unpack_root(iroot_val), &itable_ops);
 	return 0;
 }
+
+#ifdef __KERNEL__
+int reserve_credits(struct sb *sb, unsigned credits)
+{
+	sb->credits += credits;
+	if (sb->margin > 0 && sb->freeblocks < sb->credits + sb->margin) {
+		struct block_device *bdev = vfs_sb(sb)->s_bdev;
+		warn(">>> freeblocks = %Lx, credits = %Lx", (L)sb->freeblocks, (L)sb->credits);
+		sb->margin = -sb->margin;
+		freeze_bdev(bdev);
+		sb->credits = 0;
+		thaw_bdev(bdev, vfs_sb(sb));
+		sb->margin = -sb->margin;
+		warn(">>> nospace = %i", sb->freeblocks < sb->credits + sb->margin);
+		if (sb->freeblocks < sb->credits + sb->margin) {
+			return -ENOSPC;
+		}
+	}
+	return 0;
+}
+
+int change_begin(struct sb *sb)
+{
+	return reserve_credits(sb, 10);
+}
+
+int change_end(struct sb *sb)
+{
+	// <try to return unused credits to pool> //
+	return 0;
+}
+#endif
diff -r fbdd7c18a67c user/kernel/dir.c
--- a/user/kernel/dir.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/dir.c	Mon Mar 02 15:17:44 2009 -0800
@@ -132,7 +132,7 @@ loff_t _tux_create_entry(struct inode *d
 		while (entry <= limit) {
 			if (entry->rec_len == 0) {
 				brelse(buffer);
-				tux_error(dir->i_sb, "zero-length directory entry");
+				warn("[CORRUPTION] Zero length directory record detected!");
 				return -EIO;
 			}
 			name_len = TUX_REC_LEN(entry->name_len);
diff -r fbdd7c18a67c user/kernel/filemap.c
--- a/user/kernel/filemap.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/filemap.c	Mon Mar 02 15:17:44 2009 -0800
@@ -511,7 +511,11 @@ static int tux3_da_write_begin(struct fi
 			       loff_t pos, unsigned len, unsigned flags,
 			       struct page **pagep, void **fsdata)
 {
+	struct sb *sb = tux_sb(mapping->host->i_sb);
+	int err;
 	*pagep = NULL;
+	if ((err = reserve_credits(sb, (len >> sb->blockbits) + 10)))
+		return err;
 	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
 				 tux3_da_get_block);
 }
@@ -519,8 +523,10 @@ static int tux3_writepage(struct page *p
 static int tux3_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct sb *sb = tux_sb(page->mapping->host->i_sb);
-	change_begin(sb);
-	int err = block_write_full_page(page, tux3_get_block, wbc);
+	int err = change_begin(sb);
+	if (err)
+		return err;
+	err = block_write_full_page(page, tux3_get_block, wbc);
 	change_end(sb);
 	return err;
 }
diff -r fbdd7c18a67c user/kernel/namei.c
--- a/user/kernel/namei.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/namei.c	Mon Mar 02 15:17:44 2009 -0800
@@ -66,7 +66,8 @@ static int tux3_mknod(struct inode *dir,
 	if (!huge_valid_dev(rdev))
 		return -EINVAL;
 
-	change_begin(tux_sb(dir->i_sb));
+	if ((err = change_begin(tux_sb(dir->i_sb))))
+		return err;
 	inode = tux_create_inode(dir, mode, rdev);
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
@@ -105,8 +106,8 @@ static int tux3_link(struct dentry *old_
 
 	if (inode->i_nlink >= TUX_LINK_MAX)
 		return -EMLINK;
-
-	change_begin(tux_sb(inode->i_sb));
+	if ((err = change_begin(tux_sb(inode->i_sb))))
+		return err;
 	inode->i_ctime = gettime();
 	inode_inc_link_count(inode);
 	atomic_inc(&inode->i_count);
@@ -125,7 +126,8 @@ static int tux3_symlink(struct inode *di
 	struct inode *inode;
 	int err;
 
-	change_begin(tux_sb(dir->i_sb));
+	if ((err = change_begin(tux_sb(dir->i_sb))))
+		return err;
 	inode = tux_create_inode(dir, S_IFLNK | S_IRWXUGO, 0);
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
@@ -146,9 +148,10 @@ static int tux3_unlink(struct inode *dir
 static int tux3_unlink(struct inode *dir, struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
-	change_begin(tux_sb(inode->i_sb));
-	int err = tux_del_dirent(dir, dentry);
-	if (!err) {
+	int err;
+	if ((err = change_begin(tux_sb(dir->i_sb))))
+		return err;
+	if (!(tux_del_dirent(dir, dentry))) {
 		inode->i_ctime = dir->i_ctime;
 		inode_dec_link_count(inode);
 	}
@@ -161,19 +164,18 @@ static int tux3_rmdir(struct inode *dir,
 	struct inode *inode = dentry->d_inode;
 	int err;
 
-	err = tux_dir_is_empty(inode);
-	if (!err) {
-		change_begin(tux_sb(inode->i_sb));
-		err = tux_del_dirent(dir, dentry);
-		if (!err) {
-			inode->i_ctime = dir->i_ctime;
-			inode->i_size = 0;
-			clear_nlink(inode);
-			mark_inode_dirty(inode);
-			inode_dec_link_count(dir);
-		}
-		change_end(tux_sb(inode->i_sb));
+	if ((err = tux_dir_is_empty(inode)))
+		return err;
+	if ((err = change_begin(tux_sb(dir->i_sb))))
+		return err;
+	if (!(tux_del_dirent(dir, dentry))) {
+		inode->i_ctime = dir->i_ctime;
+		inode->i_size = 0;
+		clear_nlink(inode);
+		mark_inode_dirty(inode);
+		inode_dec_link_count(dir);
 	}
+	change_end(tux_sb(inode->i_sb));
 	return err;
 }
 
@@ -194,7 +196,8 @@ static int tux3_rename(struct inode *old
 	/* FIXME: is this needed? */
 	BUG_ON(from_be_u64(old_entry->inum) != tux_inode(old_inode)->inum);
 
-	change_begin(tux_sb(old_inode->i_sb));
+	if ((err = change_begin(tux_sb(old_dir->i_sb))))
+		return err;
 	if (new_inode) {
 		int old_is_dir = S_ISDIR(old_inode->i_mode);
 		if (old_is_dir) {
diff -r fbdd7c18a67c user/kernel/tux3.h
--- a/user/kernel/tux3.h	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/tux3.h	Mon Mar 02 15:17:44 2009 -0800
@@ -208,7 +208,7 @@ static inline void flink_last_del(struct
 /* Tux3 disk format */
 
 #define SB_MAGIC_SIZE 8
-#define SB_MAGIC { 't', 'u', 'x', '3', 0xdd, 0x09, 0x02, 0x18 } /* date of latest incompatible sb format */
+#define SB_MAGIC { 't', 'u', 'x', '3', 0xdd, 0x09, 0x02, 0x28 } /* date of latest incompatible sb format */
 /*
  * disk format revision history
  * !!! always update this for every incompatible change !!!
@@ -317,6 +317,8 @@ struct sb {
 	struct rw_semaphore delta_lock; /* delta transition exclusive */
 	unsigned blocksize, blockbits, blockmask;
 	block_t volblocks, freeblocks, nextalloc;
+	unsigned credits;	/* Blocks reserved for inflight operations */
+	int margin;	/* ENOSPC when freeblocks less than this */
 	unsigned entries_per_node; /* must be per-btree type, get rid of this */
 	unsigned max_inodes_per_block; /* get rid of this and use entries per leaf */
 	unsigned version;	/* Currently mounted volume version view */
@@ -881,14 +883,20 @@ int retire_frees(struct sb *sb, struct s
 int retire_frees(struct sb *sb, struct stash *defree);
 void destroy_defree(struct stash *defree);
 
-/* commit.c */
+/* utility.c */
 int vecio(int rw, struct block_device *dev, sector_t sector,
 	bio_end_io_t endio, void *data, unsigned vecs, struct bio_vec *vec);
 int syncio(int rw, struct block_device *dev, sector_t sector, unsigned vecs, struct bio_vec *vec);
 int devio(int rw, struct block_device *dev, loff_t offset, void *data, unsigned len);
+
+/* commit.c */
+
 int load_sb(struct sb *sb);
 int save_sb(struct sb *sb);
 int load_itable(struct sb *sb);
+int reserve_credits(struct sb *sb, unsigned credits);
+int change_begin(struct sb *sb);
+int change_end(struct sb *sb);
 
 /* temporary hack for buffer */
 struct buffer_head *blockread(struct address_space *mapping, block_t iblock);
@@ -916,9 +924,6 @@ static inline struct buffer_head *blockd
 	return buffer;
 }
 
-static inline void change_begin(struct sb *sb) { }
-static inline void change_end(struct sb *sb) { }
-
 #endif /* __KERNEL__ */
 
 static inline struct buffer_head *vol_getblk(struct sb *sb, block_t block)
diff -r fbdd7c18a67c user/super.c
--- a/user/super.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/super.c	Mon Mar 02 15:17:44 2009 -0800
@@ -13,13 +13,6 @@
 #ifndef trace
 #define trace trace_on
 #endif
-
-int devio(int rw, struct dev *dev, loff_t offset, void *data, unsigned len)
-{
-	return ioabs(dev->fd, data, len, rw, offset);
-}
-
-#include "kernel/commit.c"
 
 int sync_super(struct sb *sb)
 {
diff -r fbdd7c18a67c user/tux3.c
--- a/user/tux3.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/tux3.c	Mon Mar 02 15:17:44 2009 -0800
@@ -11,8 +11,8 @@
 #include "inode.c"
 #include <getopt.h>
 
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
 
 int fls(uint32_t v)
 {
diff -r fbdd7c18a67c user/tux3.h
--- a/user/tux3.h	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/tux3.h	Mon Mar 02 15:17:44 2009 -0800
@@ -212,9 +212,6 @@ static inline dev_t huge_decode_dev(u64 
 
 #define mark_btree_dirty(x) do {} while (0)
 
-void change_begin(struct sb *sb);
-void change_end(struct sb *sb);
-
 #define INIT_INODE(inode, sb, mode)			\
 	.i_sb = sb,					\
 	.i_mode = mode,					\
@@ -267,4 +264,7 @@ static inline void mark_inode_dirty(stru
 
 enum rw { READ, WRITE };
 
+int change_begin(struct sb *sb);
+int change_end(struct sb *sb);
+
 #endif
diff -r fbdd7c18a67c user/tux3fuse.c
--- a/user/tux3fuse.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/tux3fuse.c	Mon Mar 02 15:17:44 2009 -0800
@@ -48,8 +48,8 @@
 #undef trace
 #define trace trace_on
 
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
 
 static struct sb *sb;
 static struct dev *dev;
diff -r fbdd7c18a67c user/tux3graph.c
--- a/user/tux3graph.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/tux3graph.c	Mon Mar 02 15:17:44 2009 -0800
@@ -13,8 +13,8 @@
 #include "inode.c"
 #include <getopt.h>
 
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
 
 static const char *dtree_names[] = {
 	[TUX_BITMAP_INO]	= "bitmap",
diff -r fbdd7c18a67c user/xattr.c
--- a/user/xattr.c	Sun Mar 01 11:11:01 2009 +0900
+++ b/user/xattr.c	Mon Mar 02 15:17:44 2009 -0800
@@ -21,8 +21,8 @@
 #include "kernel/xattr.c"
 #include "kernel/ileaf.c"
 
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
 
 int main(int argc, char *argv[])
 {
_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3

Reply via email to