Hi,
I was not really expecting to see people testing out of space handling in Tux3
at this early stage, but Marcin apparently failed to get that memo, so he went
ahead and demonstrated directory corruption by hitting nospace in his mass file
creation test. To avoid further embarrassment of this nature, I put together a
preliminary patch to fail gracefully in the file create instead of just
stumbling forward and running out of disk blocks deep in metadata update.
The classic way of dealing with this messy issue is, at the beginning of each
atomic filesystem change one overestimates the worst case number of blocks that
will be required to store the change, including all associated metadata blocks,
and reserve that many "credits" against the remaining free blocks on the
filesystem. If the remaining free blocks count lies below some safety margin
depending on the user privilege and our own confidence that the reservation
code is correct, we bail out of the change with ENOSPC, which bubbles up to the
application. If remaining free blocks are sufficient, on the other hand, we
press on with the change, counting the number of new blocks actually allocated.
When the change is done (but not necessarily recorded on disk) any unused
credit, that is, reservation in excess of actual blocks consumed, is returned
to the credit pool.
Well, I was too lazy to implement all the accounting code needed to keep track
of actual allocations, and in some cases the actual allocation does not occur
inside the change anyway, but is deferred for a later flush operation. We
already have this situation with delayed write allocation and we will soon have
more of that with delayed bitmap block flushing. So rather than do that big,
fragile hack right now, I tried a more creative approach. I make a gross
overestimate of number of blocks required to store the transaction as described
above, but do not attempt to return unused credits to the credit pool.
Instead, I just let Tux3 "hit the wall" and detect insufficient unreserved free
blocks while there still is plenty of actual free space on the volume. At this
point, I force a flush to disk and reset credits to zero, then repeat the
reservation check against actual volume free space, now that all earlier
reservations have been translated into a concrete on-disk representation. If
this second check fails, it is a real out of space condition and ENOSPC is
returned to the application.
Most of the time, the second check does not fail. However, as the volume gets
close to full, the frequency of the flush/recheck fallback cycle increases.
When there are just a handful of blocks left, every change_begin causes a full
flush, which does not take very long, because not much cache has been dirtied
since the previous flush.
In practice, this appears to work very well. Normal operation is scarcely
affected, because all I have done is add some efficient arithmetic to the
change_begin. We don't take any locks for the check, because the check is
conservation.
During a flush triggered by exceeding the credit limit, recursive calls to
change_begin do take place, but we need to avoid recursive flushes. This is
handled (hackishly) by negating the sb->margin variable (our safety margin for
reservation calculations). Then any reservation check will succeed if it sees
that negative flag, on the theory that the operation is part of a flush and
space for it has already been reserved. Well, this is not actually correct,
because new filesystem operations could arrive during the flush, and we want
those just to wait instead of bypassing the credit check. So there is some
cleanup work to do, but in practice this works, and fixes Marcin's observed
directory corruption. We need to refine this synchronization strategy to be
less of a hack.
One thing that needs to be cleaned up: right now, map_region just fails if it
fails to allocate an extent exactly the size it requested from balloc, even
though there may be smaller runs of free blocks adding up to the required
amount still available on the volume. We need to implement a fallback to
attempt a fragmented allocation if no extent of the ideal size is available, or
does not lie within an acceptable distance from the allocation goal. This is a
slightly messy requirement of the extent allocation strategy that is not
optional, otherwise fragmentation can cause spurious out of space errors when
there is still sufficient space available. This is probably not even rare when
somebody runs a lot of filesystem activity against a nearly full volume, an all
too common usage pattern. Untarring a kernel tree that starts to fail on out
of space partway through, generating thousands of attempts to store files in
rapid succession, some of them succeeding, is a common scenario that can very
quickly find flaws in a broken out of space handling strategy.
I used a really crude technique to flush the cache. For some reason, none of
the VFS machinery for flushing inodes is exposed to modules. Except for
freeze_bdev, which incorporates a flush, among various other odd things. So I
do a freeze_bdev to flush the volume, which locks out further writes to the
filesystem as a side effect (but not further metadata operations as far as I
can see, maybe I did not look carefully enough). This is pretty gross, and I
am not at all sure that it cannot deadlock. Our flush really should be a delta
commit, not this heavy handed thing that tries to handle all filesystems
uniformly without really knowing what it is doing. However it does seem to
work pretty well, and lets us see how nospace handling needs to work, even
before the delta commit mechanism is fully in place.
It is pretty cool the way the system behaves as it gets close to full. Flushes
are triggered with accelerating frequency, consuming less free space on eacy
cycle. Finally, we hit true out of space and the application screeches to a
halt with an error message, all in good order. Then, as far as I can see,
there is no filesystem corruption as Marcin was able to trigger easily on an
unpatched system. We do need to confirm this by putting together some sort of
preliminary fsck, however, things are looking pretty good. I think we can
evolve this off the wall idea into a solid out of space handling mechanism that
is light, tight and efficient.
Regards,
Daniel
diff -r fbdd7c18a67c user/commit.c
--- a/user/commit.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/commit.c Mon Mar 02 15:17:44 2009 -0800
@@ -242,12 +242,13 @@ static int commit_delta(struct sb *sb)
return unstash(sb, &sb->defree, retire_bfree);
}
-void change_begin(struct sb *sb)
+int change_begin(struct sb *sb)
{
down_read(&sb->delta_lock);
+ return 0;
}
-void change_end(struct sb *sb)
+int change_end(struct sb *sb)
{
if (need_delta(sb)) {
unsigned delta = sb->delta;
@@ -262,6 +263,7 @@ void change_end(struct sb *sb)
up_write(&sb->delta_lock);
} else
up_read(&sb->delta_lock);
+ return 0;
}
int bitmap_io(struct buffer_head *buffer, int write)
diff -r fbdd7c18a67c user/filemap.c
--- a/user/filemap.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/filemap.c Mon Mar 02 15:17:44 2009 -0800
@@ -14,6 +14,13 @@
#include "kernel/ileaf.c"
#include "kernel/balloc.c"
#include "kernel/filemap.c"
+
+int devio(int rw, struct dev *dev, loff_t offset, void *data, unsigned len)
+{
+ return ioabs(dev->fd, data, len, rw, offset);
+}
+
+#include "kernel/commit.c"
/*
* Extrapolate from single buffer flush or blockread to opportunistic exent IO
@@ -112,8 +119,8 @@ int filemap_extent_io(struct buffer_head
}
#ifdef build_filemap
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
static void check_created_seg(struct seg *seg)
{
diff -r fbdd7c18a67c user/ileaf.c
--- a/user/ileaf.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/ileaf.c Mon Mar 02 15:17:44 2009 -0800
@@ -21,8 +21,8 @@
#include "kernel/xattr.c"
#include "kernel/ileaf.c"
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
struct ileaf *ileaf_create(struct btree *btree)
{
diff -r fbdd7c18a67c user/inode.c
--- a/user/inode.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/inode.c Mon Mar 02 15:17:44 2009 -0800
@@ -293,8 +293,8 @@ void tuxclose(struct inode *inode)
#include "super.c"
#ifdef build_inode
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
int main(int argc, char *argv[])
{
diff -r fbdd7c18a67c user/kernel/commit.c
--- a/user/kernel/commit.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/commit.c Mon Mar 02 15:17:44 2009 -0800
@@ -32,6 +32,7 @@ int load_sb(struct sb *sb)
sb->atomgen = from_be_u32(super->atomgen);
sb->freeatom = from_be_u32(super->freeatom);
sb->dictsize = from_be_u64(super->dictsize);
+ sb->margin = 100; // should be tunable?
trace("blocksize %u, blockbits %u, blockmask %08x",
sb->blocksize, sb->blockbits, sb->blockmask);
trace("volblocks %Lu, freeblocks %Lu, nextalloc %Lu",
@@ -61,3 +62,35 @@ int load_itable(struct sb *sb)
init_btree(itable_btree(sb), sb, unpack_root(iroot_val), &itable_ops);
return 0;
}
+
+#ifdef __KERNEL__
+int reserve_credits(struct sb *sb, unsigned credits)
+{
+ sb->credits += credits;
+ if (sb->margin > 0 && sb->freeblocks < sb->credits + sb->margin) {
+ struct block_device *bdev = vfs_sb(sb)->s_bdev;
+ warn(">>> freeblocks = %Lx, credits = %Lx", (L)sb->freeblocks, (L)sb->credits);
+ sb->margin = -sb->margin;
+ freeze_bdev(bdev);
+ sb->credits = 0;
+ thaw_bdev(bdev, vfs_sb(sb));
+ sb->margin = -sb->margin;
+ warn(">>> nospace = %i", sb->freeblocks < sb->credits + sb->margin);
+ if (sb->freeblocks < sb->credits + sb->margin) {
+ return -ENOSPC;
+ }
+ }
+ return 0;
+}
+
+int change_begin(struct sb *sb)
+{
+ return reserve_credits(sb, 10);
+}
+
+int change_end(struct sb *sb)
+{
+ // <try to return unused credits to pool> //
+ return 0;
+}
+#endif
diff -r fbdd7c18a67c user/kernel/dir.c
--- a/user/kernel/dir.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/dir.c Mon Mar 02 15:17:44 2009 -0800
@@ -132,7 +132,7 @@ loff_t _tux_create_entry(struct inode *d
while (entry <= limit) {
if (entry->rec_len == 0) {
brelse(buffer);
- tux_error(dir->i_sb, "zero-length directory entry");
+ warn("[CORRUPTION] Zero length directory record detected!");
return -EIO;
}
name_len = TUX_REC_LEN(entry->name_len);
diff -r fbdd7c18a67c user/kernel/filemap.c
--- a/user/kernel/filemap.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/filemap.c Mon Mar 02 15:17:44 2009 -0800
@@ -511,7 +511,11 @@ static int tux3_da_write_begin(struct fi
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
+ struct sb *sb = tux_sb(mapping->host->i_sb);
+ int err;
*pagep = NULL;
+ if ((err = reserve_credits(sb, (len >> sb->blockbits) + 10)))
+ return err;
return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
tux3_da_get_block);
}
@@ -519,8 +523,10 @@ static int tux3_writepage(struct page *p
static int tux3_writepage(struct page *page, struct writeback_control *wbc)
{
struct sb *sb = tux_sb(page->mapping->host->i_sb);
- change_begin(sb);
- int err = block_write_full_page(page, tux3_get_block, wbc);
+ int err = change_begin(sb);
+ if (err)
+ return err;
+ err = block_write_full_page(page, tux3_get_block, wbc);
change_end(sb);
return err;
}
diff -r fbdd7c18a67c user/kernel/namei.c
--- a/user/kernel/namei.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/namei.c Mon Mar 02 15:17:44 2009 -0800
@@ -66,7 +66,8 @@ static int tux3_mknod(struct inode *dir,
if (!huge_valid_dev(rdev))
return -EINVAL;
- change_begin(tux_sb(dir->i_sb));
+ if ((err = change_begin(tux_sb(dir->i_sb))))
+ return err;
inode = tux_create_inode(dir, mode, rdev);
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
@@ -105,8 +106,8 @@ static int tux3_link(struct dentry *old_
if (inode->i_nlink >= TUX_LINK_MAX)
return -EMLINK;
-
- change_begin(tux_sb(inode->i_sb));
+ if ((err = change_begin(tux_sb(inode->i_sb))))
+ return err;
inode->i_ctime = gettime();
inode_inc_link_count(inode);
atomic_inc(&inode->i_count);
@@ -125,7 +126,8 @@ static int tux3_symlink(struct inode *di
struct inode *inode;
int err;
- change_begin(tux_sb(dir->i_sb));
+ if ((err = change_begin(tux_sb(dir->i_sb))))
+ return err;
inode = tux_create_inode(dir, S_IFLNK | S_IRWXUGO, 0);
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
@@ -146,9 +148,10 @@ static int tux3_unlink(struct inode *dir
static int tux3_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
- change_begin(tux_sb(inode->i_sb));
- int err = tux_del_dirent(dir, dentry);
- if (!err) {
+ int err;
+ if ((err = change_begin(tux_sb(dir->i_sb))))
+ return err;
+ if (!(tux_del_dirent(dir, dentry))) {
inode->i_ctime = dir->i_ctime;
inode_dec_link_count(inode);
}
@@ -161,19 +164,18 @@ static int tux3_rmdir(struct inode *dir,
struct inode *inode = dentry->d_inode;
int err;
- err = tux_dir_is_empty(inode);
- if (!err) {
- change_begin(tux_sb(inode->i_sb));
- err = tux_del_dirent(dir, dentry);
- if (!err) {
- inode->i_ctime = dir->i_ctime;
- inode->i_size = 0;
- clear_nlink(inode);
- mark_inode_dirty(inode);
- inode_dec_link_count(dir);
- }
- change_end(tux_sb(inode->i_sb));
+ if ((err = tux_dir_is_empty(inode)))
+ return err;
+ if ((err = change_begin(tux_sb(dir->i_sb))))
+ return err;
+ if (!(tux_del_dirent(dir, dentry))) {
+ inode->i_ctime = dir->i_ctime;
+ inode->i_size = 0;
+ clear_nlink(inode);
+ mark_inode_dirty(inode);
+ inode_dec_link_count(dir);
}
+ change_end(tux_sb(inode->i_sb));
return err;
}
@@ -194,7 +196,8 @@ static int tux3_rename(struct inode *old
/* FIXME: is this needed? */
BUG_ON(from_be_u64(old_entry->inum) != tux_inode(old_inode)->inum);
- change_begin(tux_sb(old_inode->i_sb));
+ if ((err = change_begin(tux_sb(old_dir->i_sb))))
+ return err;
if (new_inode) {
int old_is_dir = S_ISDIR(old_inode->i_mode);
if (old_is_dir) {
diff -r fbdd7c18a67c user/kernel/tux3.h
--- a/user/kernel/tux3.h Sun Mar 01 11:11:01 2009 +0900
+++ b/user/kernel/tux3.h Mon Mar 02 15:17:44 2009 -0800
@@ -208,7 +208,7 @@ static inline void flink_last_del(struct
/* Tux3 disk format */
#define SB_MAGIC_SIZE 8
-#define SB_MAGIC { 't', 'u', 'x', '3', 0xdd, 0x09, 0x02, 0x18 } /* date of latest incompatible sb format */
+#define SB_MAGIC { 't', 'u', 'x', '3', 0xdd, 0x09, 0x02, 0x28 } /* date of latest incompatible sb format */
/*
* disk format revision history
* !!! always update this for every incompatible change !!!
@@ -317,6 +317,8 @@ struct sb {
struct rw_semaphore delta_lock; /* delta transition exclusive */
unsigned blocksize, blockbits, blockmask;
block_t volblocks, freeblocks, nextalloc;
+ unsigned credits; /* Blocks reserved for inflight operations */
+ int margin; /* ENOSPC when freeblocks less than this */
unsigned entries_per_node; /* must be per-btree type, get rid of this */
unsigned max_inodes_per_block; /* get rid of this and use entries per leaf */
unsigned version; /* Currently mounted volume version view */
@@ -881,14 +883,20 @@ int retire_frees(struct sb *sb, struct s
int retire_frees(struct sb *sb, struct stash *defree);
void destroy_defree(struct stash *defree);
-/* commit.c */
+/* utility.c */
int vecio(int rw, struct block_device *dev, sector_t sector,
bio_end_io_t endio, void *data, unsigned vecs, struct bio_vec *vec);
int syncio(int rw, struct block_device *dev, sector_t sector, unsigned vecs, struct bio_vec *vec);
int devio(int rw, struct block_device *dev, loff_t offset, void *data, unsigned len);
+
+/* commit.c */
+
int load_sb(struct sb *sb);
int save_sb(struct sb *sb);
int load_itable(struct sb *sb);
+int reserve_credits(struct sb *sb, unsigned credits);
+int change_begin(struct sb *sb);
+int change_end(struct sb *sb);
/* temporary hack for buffer */
struct buffer_head *blockread(struct address_space *mapping, block_t iblock);
@@ -916,9 +924,6 @@ static inline struct buffer_head *blockd
return buffer;
}
-static inline void change_begin(struct sb *sb) { }
-static inline void change_end(struct sb *sb) { }
-
#endif /* __KERNEL__ */
static inline struct buffer_head *vol_getblk(struct sb *sb, block_t block)
diff -r fbdd7c18a67c user/super.c
--- a/user/super.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/super.c Mon Mar 02 15:17:44 2009 -0800
@@ -13,13 +13,6 @@
#ifndef trace
#define trace trace_on
#endif
-
-int devio(int rw, struct dev *dev, loff_t offset, void *data, unsigned len)
-{
- return ioabs(dev->fd, data, len, rw, offset);
-}
-
-#include "kernel/commit.c"
int sync_super(struct sb *sb)
{
diff -r fbdd7c18a67c user/tux3.c
--- a/user/tux3.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/tux3.c Mon Mar 02 15:17:44 2009 -0800
@@ -11,8 +11,8 @@
#include "inode.c"
#include <getopt.h>
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
int fls(uint32_t v)
{
diff -r fbdd7c18a67c user/tux3.h
--- a/user/tux3.h Sun Mar 01 11:11:01 2009 +0900
+++ b/user/tux3.h Mon Mar 02 15:17:44 2009 -0800
@@ -212,9 +212,6 @@ static inline dev_t huge_decode_dev(u64
#define mark_btree_dirty(x) do {} while (0)
-void change_begin(struct sb *sb);
-void change_end(struct sb *sb);
-
#define INIT_INODE(inode, sb, mode) \
.i_sb = sb, \
.i_mode = mode, \
@@ -267,4 +264,7 @@ static inline void mark_inode_dirty(stru
enum rw { READ, WRITE };
+int change_begin(struct sb *sb);
+int change_end(struct sb *sb);
+
#endif
diff -r fbdd7c18a67c user/tux3fuse.c
--- a/user/tux3fuse.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/tux3fuse.c Mon Mar 02 15:17:44 2009 -0800
@@ -48,8 +48,8 @@
#undef trace
#define trace trace_on
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
static struct sb *sb;
static struct dev *dev;
diff -r fbdd7c18a67c user/tux3graph.c
--- a/user/tux3graph.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/tux3graph.c Mon Mar 02 15:17:44 2009 -0800
@@ -13,8 +13,8 @@
#include "inode.c"
#include <getopt.h>
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
static const char *dtree_names[] = {
[TUX_BITMAP_INO] = "bitmap",
diff -r fbdd7c18a67c user/xattr.c
--- a/user/xattr.c Sun Mar 01 11:11:01 2009 +0900
+++ b/user/xattr.c Mon Mar 02 15:17:44 2009 -0800
@@ -21,8 +21,8 @@
#include "kernel/xattr.c"
#include "kernel/ileaf.c"
-void change_begin(struct sb *sb) { }
-void change_end(struct sb *sb) { }
+int change_begin(struct sb *sb) { return 0; }
+int change_end(struct sb *sb) { return 0; }
int main(int argc, char *argv[])
{
_______________________________________________
Tux3 mailing list
[email protected]
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3