Hi, I was not really expecting to see people testing out of space handling in Tux3 at this early stage, but Marcin apparently failed to get that memo, so he went ahead and demonstrated directory corruption by hitting nospace in his mass file creation test. To avoid further embarrassment of this nature, I put together a preliminary patch to fail gracefully in the file create instead of just stumbling forward and running out of disk blocks deep in metadata update.
The classic way of dealing with this messy issue is, at the beginning of each atomic filesystem change one overestimates the worst case number of blocks that will be required to store the change, including all associated metadata blocks, and reserve that many "credits" against the remaining free blocks on the filesystem. If the remaining free blocks count lies below some safety margin depending on the user privilege and our own confidence that the reservation code is correct, we bail out of the change with ENOSPC, which bubbles up to the application. If remaining free blocks are sufficient, on the other hand, we press on with the change, counting the number of new blocks actually allocated. When the change is done (but not necessarily recorded on disk) any unused credit, that is, reservation in excess of actual blocks consumed, is returned to the credit pool. Well, I was too lazy to implement all the accounting code needed to keep track of actual allocations, and in some cases the actual allocation does not occur inside the change anyway, but is deferred for a later flush operation. We already have this situation with delayed write allocation and we will soon have more of that with delayed bitmap block flushing. So rather than do that big, fragile hack right now, I tried a more creative approach. I make a gross overestimate of number of blocks required to store the transaction as described above, but do not attempt to return unused credits to the credit pool. Instead, I just let Tux3 "hit the wall" and detect insufficient unreserved free blocks while there still is plenty of actual free space on the volume. At this point, I force a flush to disk and reset credits to zero, then repeat the reservation check against actual volume free space, now that all earlier reservations have been translated into a concrete on-disk representation. If this second check fails, it is a real out of space condition and ENOSPC is returned to the application. Most of the time, the second check does not fail. However, as the volume gets close to full, the frequency of the flush/recheck fallback cycle increases. When there are just a handful of blocks left, every change_begin causes a full flush, which does not take very long, because not much cache has been dirtied since the previous flush. In practice, this appears to work very well. Normal operation is scarcely affected, because all I have done is add some efficient arithmetic to the change_begin. We don't take any locks for the check, because the check is conservation. During a flush triggered by exceeding the credit limit, recursive calls to change_begin do take place, but we need to avoid recursive flushes. This is handled (hackishly) by negating the sb->margin variable (our safety margin for reservation calculations). Then any reservation check will succeed if it sees that negative flag, on the theory that the operation is part of a flush and space for it has already been reserved. Well, this is not actually correct, because new filesystem operations could arrive during the flush, and we want those just to wait instead of bypassing the credit check. So there is some cleanup work to do, but in practice this works, and fixes Marcin's observed directory corruption. We need to refine this synchronization strategy to be less of a hack. One thing that needs to be cleaned up: right now, map_region just fails if it fails to allocate an extent exactly the size it requested from balloc, even though there may be smaller runs of free blocks adding up to the required amount still available on the volume. We need to implement a fallback to attempt a fragmented allocation if no extent of the ideal size is available, or does not lie within an acceptable distance from the allocation goal. This is a slightly messy requirement of the extent allocation strategy that is not optional, otherwise fragmentation can cause spurious out of space errors when there is still sufficient space available. This is probably not even rare when somebody runs a lot of filesystem activity against a nearly full volume, an all too common usage pattern. Untarring a kernel tree that starts to fail on out of space partway through, generating thousands of attempts to store files in rapid succession, some of them succeeding, is a common scenario that can very quickly find flaws in a broken out of space handling strategy. I used a really crude technique to flush the cache. For some reason, none of the VFS machinery for flushing inodes is exposed to modules. Except for freeze_bdev, which incorporates a flush, among various other odd things. So I do a freeze_bdev to flush the volume, which locks out further writes to the filesystem as a side effect (but not further metadata operations as far as I can see, maybe I did not look carefully enough). This is pretty gross, and I am not at all sure that it cannot deadlock. Our flush really should be a delta commit, not this heavy handed thing that tries to handle all filesystems uniformly without really knowing what it is doing. However it does seem to work pretty well, and lets us see how nospace handling needs to work, even before the delta commit mechanism is fully in place. It is pretty cool the way the system behaves as it gets close to full. Flushes are triggered with accelerating frequency, consuming less free space on eacy cycle. Finally, we hit true out of space and the application screeches to a halt with an error message, all in good order. Then, as far as I can see, there is no filesystem corruption as Marcin was able to trigger easily on an unpatched system. We do need to confirm this by putting together some sort of preliminary fsck, however, things are looking pretty good. I think we can evolve this off the wall idea into a solid out of space handling mechanism that is light, tight and efficient. Regards, Daniel
diff -r fbdd7c18a67c user/commit.c --- a/user/commit.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/commit.c Mon Mar 02 15:17:44 2009 -0800 @@ -242,12 +242,13 @@ static int commit_delta(struct sb *sb) return unstash(sb, &sb->defree, retire_bfree); } -void change_begin(struct sb *sb) +int change_begin(struct sb *sb) { down_read(&sb->delta_lock); + return 0; } -void change_end(struct sb *sb) +int change_end(struct sb *sb) { if (need_delta(sb)) { unsigned delta = sb->delta; @@ -262,6 +263,7 @@ void change_end(struct sb *sb) up_write(&sb->delta_lock); } else up_read(&sb->delta_lock); + return 0; } int bitmap_io(struct buffer_head *buffer, int write) diff -r fbdd7c18a67c user/filemap.c --- a/user/filemap.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/filemap.c Mon Mar 02 15:17:44 2009 -0800 @@ -14,6 +14,13 @@ #include "kernel/ileaf.c" #include "kernel/balloc.c" #include "kernel/filemap.c" + +int devio(int rw, struct dev *dev, loff_t offset, void *data, unsigned len) +{ + return ioabs(dev->fd, data, len, rw, offset); +} + +#include "kernel/commit.c" /* * Extrapolate from single buffer flush or blockread to opportunistic exent IO @@ -112,8 +119,8 @@ int filemap_extent_io(struct buffer_head } #ifdef build_filemap -void change_begin(struct sb *sb) { } -void change_end(struct sb *sb) { } +int change_begin(struct sb *sb) { return 0; } +int change_end(struct sb *sb) { return 0; } static void check_created_seg(struct seg *seg) { diff -r fbdd7c18a67c user/ileaf.c --- a/user/ileaf.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/ileaf.c Mon Mar 02 15:17:44 2009 -0800 @@ -21,8 +21,8 @@ #include "kernel/xattr.c" #include "kernel/ileaf.c" -void change_begin(struct sb *sb) { } -void change_end(struct sb *sb) { } +int change_begin(struct sb *sb) { return 0; } +int change_end(struct sb *sb) { return 0; } struct ileaf *ileaf_create(struct btree *btree) { diff -r fbdd7c18a67c user/inode.c --- a/user/inode.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/inode.c Mon Mar 02 15:17:44 2009 -0800 @@ -293,8 +293,8 @@ void tuxclose(struct inode *inode) #include "super.c" #ifdef build_inode -void change_begin(struct sb *sb) { } -void change_end(struct sb *sb) { } +int change_begin(struct sb *sb) { return 0; } +int change_end(struct sb *sb) { return 0; } int main(int argc, char *argv[]) { diff -r fbdd7c18a67c user/kernel/commit.c --- a/user/kernel/commit.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/kernel/commit.c Mon Mar 02 15:17:44 2009 -0800 @@ -32,6 +32,7 @@ int load_sb(struct sb *sb) sb->atomgen = from_be_u32(super->atomgen); sb->freeatom = from_be_u32(super->freeatom); sb->dictsize = from_be_u64(super->dictsize); + sb->margin = 100; // should be tunable? trace("blocksize %u, blockbits %u, blockmask %08x", sb->blocksize, sb->blockbits, sb->blockmask); trace("volblocks %Lu, freeblocks %Lu, nextalloc %Lu", @@ -61,3 +62,35 @@ int load_itable(struct sb *sb) init_btree(itable_btree(sb), sb, unpack_root(iroot_val), &itable_ops); return 0; } + +#ifdef __KERNEL__ +int reserve_credits(struct sb *sb, unsigned credits) +{ + sb->credits += credits; + if (sb->margin > 0 && sb->freeblocks < sb->credits + sb->margin) { + struct block_device *bdev = vfs_sb(sb)->s_bdev; + warn(">>> freeblocks = %Lx, credits = %Lx", (L)sb->freeblocks, (L)sb->credits); + sb->margin = -sb->margin; + freeze_bdev(bdev); + sb->credits = 0; + thaw_bdev(bdev, vfs_sb(sb)); + sb->margin = -sb->margin; + warn(">>> nospace = %i", sb->freeblocks < sb->credits + sb->margin); + if (sb->freeblocks < sb->credits + sb->margin) { + return -ENOSPC; + } + } + return 0; +} + +int change_begin(struct sb *sb) +{ + return reserve_credits(sb, 10); +} + +int change_end(struct sb *sb) +{ + // <try to return unused credits to pool> // + return 0; +} +#endif diff -r fbdd7c18a67c user/kernel/dir.c --- a/user/kernel/dir.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/kernel/dir.c Mon Mar 02 15:17:44 2009 -0800 @@ -132,7 +132,7 @@ loff_t _tux_create_entry(struct inode *d while (entry <= limit) { if (entry->rec_len == 0) { brelse(buffer); - tux_error(dir->i_sb, "zero-length directory entry"); + warn("[CORRUPTION] Zero length directory record detected!"); return -EIO; } name_len = TUX_REC_LEN(entry->name_len); diff -r fbdd7c18a67c user/kernel/filemap.c --- a/user/kernel/filemap.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/kernel/filemap.c Mon Mar 02 15:17:44 2009 -0800 @@ -511,7 +511,11 @@ static int tux3_da_write_begin(struct fi loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) { + struct sb *sb = tux_sb(mapping->host->i_sb); + int err; *pagep = NULL; + if ((err = reserve_credits(sb, (len >> sb->blockbits) + 10))) + return err; return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, tux3_da_get_block); } @@ -519,8 +523,10 @@ static int tux3_writepage(struct page *p static int tux3_writepage(struct page *page, struct writeback_control *wbc) { struct sb *sb = tux_sb(page->mapping->host->i_sb); - change_begin(sb); - int err = block_write_full_page(page, tux3_get_block, wbc); + int err = change_begin(sb); + if (err) + return err; + err = block_write_full_page(page, tux3_get_block, wbc); change_end(sb); return err; } diff -r fbdd7c18a67c user/kernel/namei.c --- a/user/kernel/namei.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/kernel/namei.c Mon Mar 02 15:17:44 2009 -0800 @@ -66,7 +66,8 @@ static int tux3_mknod(struct inode *dir, if (!huge_valid_dev(rdev)) return -EINVAL; - change_begin(tux_sb(dir->i_sb)); + if ((err = change_begin(tux_sb(dir->i_sb)))) + return err; inode = tux_create_inode(dir, mode, rdev); err = PTR_ERR(inode); if (!IS_ERR(inode)) { @@ -105,8 +106,8 @@ static int tux3_link(struct dentry *old_ if (inode->i_nlink >= TUX_LINK_MAX) return -EMLINK; - - change_begin(tux_sb(inode->i_sb)); + if ((err = change_begin(tux_sb(inode->i_sb)))) + return err; inode->i_ctime = gettime(); inode_inc_link_count(inode); atomic_inc(&inode->i_count); @@ -125,7 +126,8 @@ static int tux3_symlink(struct inode *di struct inode *inode; int err; - change_begin(tux_sb(dir->i_sb)); + if ((err = change_begin(tux_sb(dir->i_sb)))) + return err; inode = tux_create_inode(dir, S_IFLNK | S_IRWXUGO, 0); err = PTR_ERR(inode); if (!IS_ERR(inode)) { @@ -146,9 +148,10 @@ static int tux3_unlink(struct inode *dir static int tux3_unlink(struct inode *dir, struct dentry *dentry) { struct inode *inode = dentry->d_inode; - change_begin(tux_sb(inode->i_sb)); - int err = tux_del_dirent(dir, dentry); - if (!err) { + int err; + if ((err = change_begin(tux_sb(dir->i_sb)))) + return err; + if (!(tux_del_dirent(dir, dentry))) { inode->i_ctime = dir->i_ctime; inode_dec_link_count(inode); } @@ -161,19 +164,18 @@ static int tux3_rmdir(struct inode *dir, struct inode *inode = dentry->d_inode; int err; - err = tux_dir_is_empty(inode); - if (!err) { - change_begin(tux_sb(inode->i_sb)); - err = tux_del_dirent(dir, dentry); - if (!err) { - inode->i_ctime = dir->i_ctime; - inode->i_size = 0; - clear_nlink(inode); - mark_inode_dirty(inode); - inode_dec_link_count(dir); - } - change_end(tux_sb(inode->i_sb)); + if ((err = tux_dir_is_empty(inode))) + return err; + if ((err = change_begin(tux_sb(dir->i_sb)))) + return err; + if (!(tux_del_dirent(dir, dentry))) { + inode->i_ctime = dir->i_ctime; + inode->i_size = 0; + clear_nlink(inode); + mark_inode_dirty(inode); + inode_dec_link_count(dir); } + change_end(tux_sb(inode->i_sb)); return err; } @@ -194,7 +196,8 @@ static int tux3_rename(struct inode *old /* FIXME: is this needed? */ BUG_ON(from_be_u64(old_entry->inum) != tux_inode(old_inode)->inum); - change_begin(tux_sb(old_inode->i_sb)); + if ((err = change_begin(tux_sb(old_dir->i_sb)))) + return err; if (new_inode) { int old_is_dir = S_ISDIR(old_inode->i_mode); if (old_is_dir) { diff -r fbdd7c18a67c user/kernel/tux3.h --- a/user/kernel/tux3.h Sun Mar 01 11:11:01 2009 +0900 +++ b/user/kernel/tux3.h Mon Mar 02 15:17:44 2009 -0800 @@ -208,7 +208,7 @@ static inline void flink_last_del(struct /* Tux3 disk format */ #define SB_MAGIC_SIZE 8 -#define SB_MAGIC { 't', 'u', 'x', '3', 0xdd, 0x09, 0x02, 0x18 } /* date of latest incompatible sb format */ +#define SB_MAGIC { 't', 'u', 'x', '3', 0xdd, 0x09, 0x02, 0x28 } /* date of latest incompatible sb format */ /* * disk format revision history * !!! always update this for every incompatible change !!! @@ -317,6 +317,8 @@ struct sb { struct rw_semaphore delta_lock; /* delta transition exclusive */ unsigned blocksize, blockbits, blockmask; block_t volblocks, freeblocks, nextalloc; + unsigned credits; /* Blocks reserved for inflight operations */ + int margin; /* ENOSPC when freeblocks less than this */ unsigned entries_per_node; /* must be per-btree type, get rid of this */ unsigned max_inodes_per_block; /* get rid of this and use entries per leaf */ unsigned version; /* Currently mounted volume version view */ @@ -881,14 +883,20 @@ int retire_frees(struct sb *sb, struct s int retire_frees(struct sb *sb, struct stash *defree); void destroy_defree(struct stash *defree); -/* commit.c */ +/* utility.c */ int vecio(int rw, struct block_device *dev, sector_t sector, bio_end_io_t endio, void *data, unsigned vecs, struct bio_vec *vec); int syncio(int rw, struct block_device *dev, sector_t sector, unsigned vecs, struct bio_vec *vec); int devio(int rw, struct block_device *dev, loff_t offset, void *data, unsigned len); + +/* commit.c */ + int load_sb(struct sb *sb); int save_sb(struct sb *sb); int load_itable(struct sb *sb); +int reserve_credits(struct sb *sb, unsigned credits); +int change_begin(struct sb *sb); +int change_end(struct sb *sb); /* temporary hack for buffer */ struct buffer_head *blockread(struct address_space *mapping, block_t iblock); @@ -916,9 +924,6 @@ static inline struct buffer_head *blockd return buffer; } -static inline void change_begin(struct sb *sb) { } -static inline void change_end(struct sb *sb) { } - #endif /* __KERNEL__ */ static inline struct buffer_head *vol_getblk(struct sb *sb, block_t block) diff -r fbdd7c18a67c user/super.c --- a/user/super.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/super.c Mon Mar 02 15:17:44 2009 -0800 @@ -13,13 +13,6 @@ #ifndef trace #define trace trace_on #endif - -int devio(int rw, struct dev *dev, loff_t offset, void *data, unsigned len) -{ - return ioabs(dev->fd, data, len, rw, offset); -} - -#include "kernel/commit.c" int sync_super(struct sb *sb) { diff -r fbdd7c18a67c user/tux3.c --- a/user/tux3.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/tux3.c Mon Mar 02 15:17:44 2009 -0800 @@ -11,8 +11,8 @@ #include "inode.c" #include <getopt.h> -void change_begin(struct sb *sb) { } -void change_end(struct sb *sb) { } +int change_begin(struct sb *sb) { return 0; } +int change_end(struct sb *sb) { return 0; } int fls(uint32_t v) { diff -r fbdd7c18a67c user/tux3.h --- a/user/tux3.h Sun Mar 01 11:11:01 2009 +0900 +++ b/user/tux3.h Mon Mar 02 15:17:44 2009 -0800 @@ -212,9 +212,6 @@ static inline dev_t huge_decode_dev(u64 #define mark_btree_dirty(x) do {} while (0) -void change_begin(struct sb *sb); -void change_end(struct sb *sb); - #define INIT_INODE(inode, sb, mode) \ .i_sb = sb, \ .i_mode = mode, \ @@ -267,4 +264,7 @@ static inline void mark_inode_dirty(stru enum rw { READ, WRITE }; +int change_begin(struct sb *sb); +int change_end(struct sb *sb); + #endif diff -r fbdd7c18a67c user/tux3fuse.c --- a/user/tux3fuse.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/tux3fuse.c Mon Mar 02 15:17:44 2009 -0800 @@ -48,8 +48,8 @@ #undef trace #define trace trace_on -void change_begin(struct sb *sb) { } -void change_end(struct sb *sb) { } +int change_begin(struct sb *sb) { return 0; } +int change_end(struct sb *sb) { return 0; } static struct sb *sb; static struct dev *dev; diff -r fbdd7c18a67c user/tux3graph.c --- a/user/tux3graph.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/tux3graph.c Mon Mar 02 15:17:44 2009 -0800 @@ -13,8 +13,8 @@ #include "inode.c" #include <getopt.h> -void change_begin(struct sb *sb) { } -void change_end(struct sb *sb) { } +int change_begin(struct sb *sb) { return 0; } +int change_end(struct sb *sb) { return 0; } static const char *dtree_names[] = { [TUX_BITMAP_INO] = "bitmap", diff -r fbdd7c18a67c user/xattr.c --- a/user/xattr.c Sun Mar 01 11:11:01 2009 +0900 +++ b/user/xattr.c Mon Mar 02 15:17:44 2009 -0800 @@ -21,8 +21,8 @@ #include "kernel/xattr.c" #include "kernel/ileaf.c" -void change_begin(struct sb *sb) { } -void change_end(struct sb *sb) { } +int change_begin(struct sb *sb) { return 0; } +int change_end(struct sb *sb) { return 0; } int main(int argc, char *argv[]) {
_______________________________________________ Tux3 mailing list Tux3@tux3.org http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3