Re: System completely unresponsive after `btrfs balance start -dconvert=raid0 /` and `btrfs fi show /`
Carmine Paolino posted on Tue, 13 Oct 2015 23:21:49 +0200 as excerpted: > I have an home server with 3 hard drives that I added to the same btrfs > filesystem. Several hours ago I run `btrfs balance start -dconvert=raid0 > /` and as soon as I run `btrfs fi show /` I lost my ssh connection to > the machine. The machine is still on, but it doesn’t even respond to > ping[. ...] > > (I have a 250gb internal hard drive, a 120gb usb 2.0 one and a 2TB usb > 2.0 one so the transfer speeds are pretty low) I won't attempt to answer the primary question[1] directly, but can point out that in many cases, USB-connected devices simply don't have a stable enough connection to work reliably in a multi-device btrfs. There's several possibilities for failure, including flaky connections (sometimes assisted by cats or kids), unstable USB host port drivers, and unstable USB/ATA translators. A number of folks have reported problems with such filesystems with devices connected over USB, that simply disappear if they direct-connect the exact same devices to a proper SATA port. The problem seems to be /dramatically/ worse with USB connected devices, than it is with, for instance, PCIE-based SATA expansion cards. Single-device btrfs with USB-attached devices seem to work rather better, because at least in that case, if the connection is flaky, the entire filesystem appears and disappears at once, and btrfs' COW, atomic-commit and data-integrity features, kick in to help deal with the connection's instability. Arguably, a two-device raid1 (both data/metadata, with metadata including system) should work reasonably well too, as long as scrubs are done after reconnection when there's trouble with one of the pair, because in that case, all data appears on both devices, but single and raid0 modes are likely to have severe issues in that sort of environment, because even temporary disconnection of a single device means loss of access to some data/metadata on the filesystem. Raid10, 3+-device-raid1, and raid5/6, are more complex situations. They should survive loss of at least one device, but keeping the filesystem healthy in the presence of unstable connections is... complex enough I'd hate to be the one having to deal with it, which means I can't recommend it to others, either. So I'd recommend either connecting all devices internally if possible, or setting up the USB-connected devices with separate filesystems, if internal direct-connection isn't possible. --- [1] Sysadmin's rule of backups. If the data isn't backed up, by definition it is of less value than the resource and hassle cost of backup. No exceptions -- post-loss claims to the contrary simply put the lie to the claims, as actions spoke louder than words and they defined the cost of the backup as more expensive than the data that would have been backed up. Worst-case is then loss of data that was by definition of less value than the cost of backup, and the more valuable resource and hassle cost of the backup was avoided, so the comparatively lower value data loss is no big deal. So in a case like this, I'd simply power down and take my chances of filesystem loss, strictly limiting the time and resources I'd devote to any further attempt at recovery, because the data is by definition either backed up, or of such low value that a backup was considered too expensive to do, meaning there's a very real possibility of spending more time in a recovery attempt that's iffy at best, than the data on the filesystem is actually worth, either because there are backups, or because it's throw-away data in the first place. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: State of Dedup / Defrag
On Tue, Oct 13, 2015 at 02:59:59PM -0400, Rich Freeman wrote: > What is the current state of Dedup and Defrag in btrfs? I seem to > recall there having been problems a few months ago and I've stopped > using it, but I haven't seen much news since. It has been 1 day since a kernel bug leading to data loss was fixed in the ioctl calls for dedup (commit 6e685a1e3e9054d43fac58f2bc0cd070df915079 from fdmanana yesterday); however, to hit that particular bug you'd need to be doing something unusual with the ioctls--in particular, a thing that makes no sense for dedup, and that dedup userspace programs intentionally avoid doing. There was another bug for defrag 68 days ago. I wouldn't try to use dedup on a kernel older than v4.1 because of these fixes in 4.1 and later: - allow dedup of the ends of files when they are not aligned to 4K. Before this was fixed, up to 1GB of space could be wasted per file. - no mtime update on extent-same. With the update, rsync and backup programs think all the deduped files are modified. The next rsync after dedup would immediately un-dedup (redup?) all the deduped files. - fixes for deadlocks. If dedup is running at the same time as other readers of files (e.g. deduping /usr or a tree on a busy file server), a deadlock was inevitable. IMHO these fixes really made dedup usable for the first time. There are some other fixes that appeared after v4.1, but they should not impact cases where mostly static data is deduped without concurrent modifications. Do dedup a photo or video file collection. Don't dedup a live database server on a filesystem with compression enabled...yet. Using dedup and defrag at the same time is still a bad idea. The features work against each other: autodefrag skips anything that has been deduped, while manual defrag un-dedups everything it touches. The effect of defrag on dedup depends on the choice of dedup userspace strategy, so defrag can either be helpful or harmful. Autodefrag in my experience pushes write latencies up to insane levels. Data ends up making multiple round-trips to the disk _with_ extra constraints on the allocator on the second and later passes, and while this is happening any other writes on the filesystem block an absurdly long time. It can easily cost more I/O time than it saves. That said, there are some kernel patches floating around to fix the allocator, so at least we can hope autodefrag will be less bad someday. signature.asc Description: Digital signature
[PATCH v2] btrfs: compress: put variables defined per compress type in struct to make cache friendly
Below variables are defined per compress type. - struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES] - spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES] - int comp_num_workspace[BTRFS_COMPRESS_TYPES] - atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES] - wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES] BTW, while accessing one compress type of these variables, the next or before address is other compress types of it. So this patch puts these variables in a struct to make cache friendly. Signed-off-by: Byongho Lee--- V2: Apply David's review comment. Rename struct comp to btrfs_comp_ws and trim it's members to 'ws' instead of 'workspace'. fs/btrfs/compression.c | 94 ++ 1 file changed, 48 insertions(+), 46 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index ce62324c78e7..8e94ae5fe732 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -744,11 +744,13 @@ out: return ret; } -static struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES]; -static spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES]; -static int comp_num_workspace[BTRFS_COMPRESS_TYPES]; -static atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES]; -static wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]; +static struct { + struct list_head idle_ws; + spinlock_t ws_lock; + int num_ws; + atomic_t alloc_ws; + wait_queue_head_t ws_wait; +} btrfs_comp_ws[BTRFS_COMPRESS_TYPES]; static const struct btrfs_compress_op * const btrfs_compress_op[] = { _zlib_compress, @@ -760,10 +762,10 @@ void __init btrfs_init_compress(void) int i; for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) { - INIT_LIST_HEAD(_idle_workspace[i]); - spin_lock_init(_workspace_lock[i]); - atomic_set(_alloc_workspace[i], 0); - init_waitqueue_head(_workspace_wait[i]); + INIT_LIST_HEAD(_comp_ws[i].idle_ws); + spin_lock_init(_comp_ws[i].ws_lock); + atomic_set(_comp_ws[i].alloc_ws, 0); + init_waitqueue_head(_comp_ws[i].ws_wait); } } @@ -777,38 +779,38 @@ static struct list_head *find_workspace(int type) int cpus = num_online_cpus(); int idx = type - 1; - struct list_head *idle_workspace= _idle_workspace[idx]; - spinlock_t *workspace_lock = _workspace_lock[idx]; - atomic_t *alloc_workspace = _alloc_workspace[idx]; - wait_queue_head_t *workspace_wait = _workspace_wait[idx]; - int *num_workspace = _num_workspace[idx]; + struct list_head *idle_ws = _comp_ws[idx].idle_ws; + spinlock_t *ws_lock = _comp_ws[idx].ws_lock; + atomic_t *alloc_ws = _comp_ws[idx].alloc_ws; + wait_queue_head_t *ws_wait = _comp_ws[idx].ws_wait; + int *num_ws = _comp_ws[idx].num_ws; again: - spin_lock(workspace_lock); - if (!list_empty(idle_workspace)) { - workspace = idle_workspace->next; + spin_lock(ws_lock); + if (!list_empty(idle_ws)) { + workspace = idle_ws->next; list_del(workspace); - (*num_workspace)--; - spin_unlock(workspace_lock); + (*num_ws)--; + spin_unlock(ws_lock); return workspace; } - if (atomic_read(alloc_workspace) > cpus) { + if (atomic_read(alloc_ws) > cpus) { DEFINE_WAIT(wait); - spin_unlock(workspace_lock); - prepare_to_wait(workspace_wait, , TASK_UNINTERRUPTIBLE); - if (atomic_read(alloc_workspace) > cpus && !*num_workspace) + spin_unlock(ws_lock); + prepare_to_wait(ws_wait, , TASK_UNINTERRUPTIBLE); + if (atomic_read(alloc_ws) > cpus && !*num_ws) schedule(); - finish_wait(workspace_wait, ); + finish_wait(ws_wait, ); goto again; } - atomic_inc(alloc_workspace); - spin_unlock(workspace_lock); + atomic_inc(alloc_ws); + spin_unlock(ws_lock); workspace = btrfs_compress_op[idx]->alloc_workspace(); if (IS_ERR(workspace)) { - atomic_dec(alloc_workspace); - wake_up(workspace_wait); + atomic_dec(alloc_ws); + wake_up(ws_wait); } return workspace; } @@ -820,27 +822,27 @@ again: static void free_workspace(int type, struct list_head *workspace) { int idx = type - 1; - struct list_head *idle_workspace= _idle_workspace[idx]; - spinlock_t *workspace_lock = _workspace_lock[idx]; - atomic_t *alloc_workspace = _alloc_workspace[idx]; - wait_queue_head_t
Re: [PATCH] btrfs: fix use after free iterating extrefs
On Tue, Oct 13, 2015 at 12:17:55PM -0700, Mark Fasheh wrote: > On Tue, Oct 13, 2015 at 02:06:48PM -0400, Chris Mason wrote: > > The code for btrfs inode-resolve has never worked properly for > > files with enough hard links to trigger extrefs. It was trying to > > get the leaf out of a path after freeing the path: > > > > btrfs_release_path(path); > > leaf = path->nodes[0]; > > item_size = btrfs_item_size_nr(leaf, slot); > > > > The fix here is to use the extent buffer we cloned just a little higher > > up to avoid deadlocks caused by using the leaf in the path. > > > > Signed-off-by: Chris Mason> > cc: sta...@vger.kernel.org # v3.7+ > > cc: Mark Fasheh > Reviewed-by: Mark Fasheh Thanks Mark and Filipe, I've tested this and queued it up. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
On Mon, Oct 12, 2015 at 11:36:31PM -0400, Trond Myklebust wrote: > On Mon, Oct 12, 2015 at 7:17 PM, Darrick J. Wong >wrote: > > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote: > >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote: > >> > This allows us to have an in-kernel copy mechanism that avoids frequent > >> > switches between kernel and user space. This is especially useful so > >> > NFSD can support server-side copies. > >> > > >> > I make pagecache copies configurable by adding three new (exclusive) > >> > flags: > >> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. > >> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. > >> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both > >> > ranges are identical. > >> > >> All but FR_COPY really should be a separate system call. Clones (an > >> dedup as a special case of clones) are really a separate beast from file > >> copies. > >> > >> If I want to clone a file I either want it clone fully or fail, not copy > >> a certain amount. That means that a) we need to return an error not > >> short "write", and b) locking impementations are important - we need to > >> prevent other applications from racing with our clone even if it is > >> large, while to get these semantics for the possible short returning > >> file copy will require a proper userland locking protocol. Last but not > >> least file copies need to be interruptible while clones should be not. > >> All this is already important for local file systems and even more > >> important for NFS exporting. > >> > >> So I'd suggest to drop this patch and just let your syscall handle > >> actualy copies with all their horrors. We can go with Peng's patches > >> to generalize the btrfs ioctls for clones for now which is what everyone > >> already uses anyway, and then add a separate sys_file_clone later. > > > > Hm. Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from > > btrfs, however they don't port over the (vastly different) EXTENT_SAME > > ioctl. > > > > What does everyone think about generalizing EXTENT_SAME? The interface > > enables > > one to ask the kernel to dedupe multiple file ranges in a single call. > > That's > > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm > > assuming > > that the extra complexity buys us the ability to ... multi-dedupe at the > > same > > time, with locks held on the source file? > > How is this supposed to be implemented on something like NFS without > protocol changes? Quite frankly, I'm not sure. Assuming NFS doesn't already have some sort of deduplication primitive (I could be totally wrong about that) I'd probably just leave the appropriate ops function pointer set to NULL and return -EOPNOTSUPP to userspace. Trying to fake it by comparing contents on the client and issuing a reflink might be doable with hard locks but if I had to guess I'd say that's even less palatable than simply bailing out. :) IOW: I was only considering the filesystems that already support dedupe, which is basically btrfs and future-XFS. --D > > Trond > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
On Mon, Oct 12, 2015 at 04:17:49PM -0700, Darrick J. Wong wrote: > Hm. Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl. > > What does everyone think about generalizing EXTENT_SAME? The interface > enables > one to ask the kernel to dedupe multiple file ranges in a single call. That's > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming > that the extra complexity buys us the ability to ... multi-dedupe at the same > time, with locks held on the source file? > > I'm happy to generalize the existing EXTENT_SAME, but please yell if you > really > hate the interface. It's not pretty, but if the btrfs folks have a good reason for it I don't see a reason to diverge. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
On Mon, Oct 12, 2015 at 11:36:31PM -0400, Trond Myklebust wrote: > How is this supposed to be implemented on something like NFS without > protocol changes? Explicit dedup has no chance of working over NFS or other network protocols without protocol changes. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: extend balance filter limit to take minimum and maximum
The 'limit' filter is underdesigned, it should have been a range for [min,max], with some relaxed semantics when one of the bounds is missing. Besides that, using a full u64 for a single value is a waste of bytes. Let's fix both by extending the use of the u64 bytes for the [min,max] range. This can be done in a backward compatible way, the range will be interpreted only if the appropriate flag is set (BTRFS_BALANCE_ARGS_LIMITS). Signed-off-by: David Sterba--- fs/btrfs/ctree.h | 14 -- fs/btrfs/volumes.c | 14 ++ fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 13 - 4 files changed, 39 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 938efe33be80..7d2e1b6d0ac1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -846,8 +846,18 @@ struct btrfs_disk_balance_args { /* BTRFS_BALANCE_ARGS_* */ __le64 flags; - /* BTRFS_BALANCE_ARGS_LIMIT value */ - __le64 limit; + /* +* BTRFS_BALANCE_ARGS_LIMIT with value 'limit' +* BTRFS_BALANCE_ARGS_LIMITS - the extend version can use minimum and +* maximum +*/ + union { + __le64 limit; + struct { + __le32 limit_min; + __le32 limit_max; + }; + }; __le64 unused[7]; } __attribute__ ((__packed__)); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 6fc735869c18..0693e974f1c0 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3250,6 +3250,15 @@ static int should_balance_chunk(struct btrfs_root *root, return 0; else bargs->limit--; + } else if ((bargs->flags & BTRFS_BALANCE_ARGS_LIMITS)) { + if (bargs->limit_min < bargs->limit_max) { + bargs->limit_max--; + } else if (bargs->limit_min == bargs->limit_max) { + bargs->limit_min = UINT_MAX; + bargs->limit_max = 0; + } else { + return 0; + } } return 1; @@ -3274,6 +3283,7 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info) int ret; int enospc_errors = 0; bool counting = true; + /* The single value limit and min/max limits use the same bytes in the */ u64 limit_data = bctl->data.limit; u64 limit_meta = bctl->meta.limit; u64 limit_sys = bctl->sys.limit; @@ -3317,6 +3327,10 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info) spin_unlock(_info->balance_lock); again: if (!counting) { + /* +* The single value limit and min/max limits use the same bytes +* in the +*/ bctl->data.limit = limit_data; bctl->meta.limit = limit_meta; bctl->sys.limit = limit_sys; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 2ca784a14e84..1c9d8edd7d57 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -375,6 +375,7 @@ struct map_lookup { #define BTRFS_BALANCE_ARGS_DRANGE (1ULL << 3) #define BTRFS_BALANCE_ARGS_VRANGE (1ULL << 4) #define BTRFS_BALANCE_ARGS_LIMIT (1ULL << 5) +#define BTRFS_BALANCE_ARGS_LIMITS (1ULL << 6) /* * Profile changing flags. When SOFT is set we won't relocate chunk if diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index b6dec05c7196..264ecea5ecfc 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -217,7 +217,18 @@ struct btrfs_balance_args { __u64 flags; - __u64 limit;/* limit number of processed chunks */ + /* +* BTRFS_BALANCE_ARGS_LIMIT with value 'limit' +* BTRFS_BALANCE_ARGS_LIMITS - the extend version can use minimum and +* maximum +*/ + union { + __u64 limit;/* limit number of processed chunks */ + struct { + __u32 limit_min; + __u32 limit_max; + }; + }; __u64 unused[7]; } __attribute__ ((__packed__)); -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Balance filters: stripes, enhanced limit
Update to balance filters, intended fro 4.4: * new 'stripes=' - process only stripes that cross given number of devices, specified by a range * updated 'limit=' - previously a single number was accepted, it's a range so now we can specify a minimum number of chunks to process There will be more documentation about the use in the btrfs-progs patches, the kernel side just applies the ranges. The update to 'limit' is backward compatible, reuses the previous struct member. Can be pulled from git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git dev/balance-filters I'm finalizing the progs patches and haven't tested that them extensively. David Sterba (1): btrfs: extend balance filter limit to take minimum and maximum Gabríel Arthúr Pétursson (1): btrfs: add balance filter for stripes fs/btrfs/ctree.h | 23 --- fs/btrfs/volumes.c | 33 + fs/btrfs/volumes.h | 2 ++ include/uapi/linux/btrfs.h | 23 +-- 4 files changed, 76 insertions(+), 5 deletions(-) -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: add balance filter for stripes
From: Gabríel Arthúr PéturssonBalance block groups which have the given number of stripes, defined by a range min..max. This is useful to selectively rebalance only chunks that do not span enough devices, applies to RAID0/10/5/6. Signed-off-by: Gabríel Arthúr Pétursson [ renamed bargs members, added to the UAPI, wrote the changelog ] Signed-off-by: David Sterba --- fs/btrfs/ctree.h | 9 - fs/btrfs/volumes.c | 19 +++ fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 10 +- 4 files changed, 37 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7d2e1b6d0ac1..e2eefa222999 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -859,7 +859,14 @@ struct btrfs_disk_balance_args { }; }; - __le64 unused[7]; + /* +* Process chunks that cross stripes_min..stripes_max devices, +* BTRFS_BALANCE_ARGS_STRIPES +*/ + __le32 stripes_min; + __le32 stripes_max; + + __le64 unused[6]; } __attribute__ ((__packed__)); /* diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0693e974f1c0..51c0e5b219a3 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3170,6 +3170,19 @@ static int chunk_vrange_filter(struct extent_buffer *leaf, return 1; } +static int chunk_stripes_filter(struct extent_buffer *leaf, + struct btrfs_chunk *chunk, + struct btrfs_balance_args *bargs) +{ + int num_stripes = btrfs_chunk_num_stripes(leaf, chunk); + + if (bargs->stripes_min <= num_stripes + && num_stripes <= bargs->stripes_max) + return 0; + + return 1; +} + static int chunk_soft_convert_filter(u64 chunk_type, struct btrfs_balance_args *bargs) { @@ -3236,6 +3249,12 @@ static int should_balance_chunk(struct btrfs_root *root, return 0; } + /* stripes filter */ + if ((bargs->flags & BTRFS_BALANCE_ARGS_STRIPES) && + chunk_stripes_filter(leaf, chunk, bargs)) { + return 0; + } + /* soft profile changing mode */ if ((bargs->flags & BTRFS_BALANCE_ARGS_SOFT) && chunk_soft_convert_filter(chunk_type, bargs)) { diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 1c9d8edd7d57..a87d96d75d07 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -376,6 +376,7 @@ struct map_lookup { #define BTRFS_BALANCE_ARGS_VRANGE (1ULL << 4) #define BTRFS_BALANCE_ARGS_LIMIT (1ULL << 5) #define BTRFS_BALANCE_ARGS_LIMITS (1ULL << 6) +#define BTRFS_BALANCE_ARGS_STRIPES (1ULL << 7) /* * Profile changing flags. When SOFT is set we won't relocate chunk if diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 264ecea5ecfc..ab720200d0f7 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -229,7 +229,15 @@ struct btrfs_balance_args { __u32 limit_max; }; }; - __u64 unused[7]; + + /* +* Process chunks that cross stripes_min..stripes_max devices, +* BTRFS_BALANCE_ARGS_STRIPES +*/ + __le32 stripes_min; + __le32 stripes_max; + + __u64 unused[6]; } __attribute__ ((__packed__)); /* report balance progress to userspace */ -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote: > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it > leaves holes alone. Yes, I've seen the implementation. > Obviously we haven't yet figured out what are peoples' preferences in terms of > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. > "only > fill the holes". It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES > flag > that fills the holes while unsharing is going on. > > Personally I suspect that the most interest is in filling holes and unsharing, > because they don't want to pay for allocation at a critical stage for anywhere > in the file. But I could be wrong, so allowing both goals to be expressed via > mode allows flexibility. Exactly. And a normal falloc should do just that - fill holes and ensure that we don't need to COW already allocated locks. So I don't think we need a new fallocate interface for that. The question is if we want a copy interface that gives you the same semantics as if you also called an fallocate on the destination range. For that case we'd usually want to avoid doing the clone and instead do a in-kernel or hardware assisted copy and then fill the holes with unwritten extents. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: mkfs: Enable -d dup for single device
Current code don't support dup profile in single device, except it is in mixed mode, because following reason: 1: In some ssd with deduplication function, it have no effect. 2: For a physical device, it the entire disk broken, -d dup can not help. 3: Half performance comparing with single profile. 4: We have a workaround: Create multi-partition in single device, and btefs will treat them as multi device. Instead of refuse -d dup, we have a better choise: Give user a chance to select, and output a warning to notice above problem. Signed-off-by: Zhao Lei--- mkfs.c | 2 +- utils.c | 10 -- utils.h | 2 +- 3 files changed, 6 insertions(+), 8 deletions(-) diff --git a/mkfs.c b/mkfs.c index ecd6fbf..7fa7cfc 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1578,7 +1578,7 @@ int main(int ac, char **av) } } ret = test_num_disk_vs_raid(metadata_profile, data_profile, - dev_cnt, mixed); + dev_cnt, mixed, ssd); if (ret) exit(1); diff --git a/utils.c b/utils.c index f1e3248..d81c2d9 100644 --- a/utils.c +++ b/utils.c @@ -2425,7 +2425,7 @@ static int group_profile_devs_min(u64 flag) } int test_num_disk_vs_raid(u64 metadata_profile, u64 data_profile, - u64 dev_cnt, int mixed) + u64 dev_cnt, int mixed, int ssd) { u64 allowed = 0; @@ -2466,11 +2466,9 @@ int test_num_disk_vs_raid(u64 metadata_profile, u64 data_profile, return 1; } - if (!mixed && (data_profile & BTRFS_BLOCK_GROUP_DUP)) { - fprintf(stderr, - "ERROR: DUP for data is allowed only in mixed mode\n"); - return 1; - } + warning_on(!mixed && (data_profile & BTRFS_BLOCK_GROUP_DUP) && ssd, + "DUP have no effect if your SSD have deduplication function"); + return 0; } diff --git a/utils.h b/utils.h index 044ea15..b85f3fe 100644 --- a/utils.h +++ b/utils.h @@ -167,7 +167,7 @@ int test_dev_for_mkfs(char *file, int force_overwrite); int get_label_mounted(const char *mount_path, char *labelp); int get_label_unmounted(const char *dev, char *label); int test_num_disk_vs_raid(u64 metadata_profile, u64 data_profile, - u64 dev_cnt, int mixed); + u64 dev_cnt, int mixed, int ssd); int group_profile_max_safe_loss(u64 flags); int is_vol_small(char *file); int csum_tree_block(struct btrfs_root *root, struct extent_buffer *buf, -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
behavior of BTRFS in relation to inodes when moving/copying files between filesystems
Hi! With BTRFS to XFS/Ext4 the inode number of the target file stays the same in with both cp and mv case (/mnt/zeit is a freshly created XFS in this example): merkaba:~> ls -li foo /mnt/zeit/moo 6609270 foo 99 /mnt/zeit/moo merkaba:~> cp foo /mnt/zeit/moo merkaba:~> ls -li foo /mnt/zeit/moo 6609270 8 foo 99 /mnt/zeit/moo merkaba:~> cp -p foo /mnt/zeit/moo merkaba:~> ls -li foo /mnt/zeit/moo 6609270 foo 99 /mnt/zeit/moo merkaba:~> mv foo /mnt/zeit/moo merkaba:~> ls -lid /mnt/zeit/moo 99 -rw-r--r-- 1 root root 6 Okt 13 12:28 /mnt/zeit/moo With BTRFS as target filesystem however in the mv case I get a new inode: merkaba:~> ls -li foo /home/moo 6609289 -rw-r--r-- 1 root root 6 Okt 13 12:34 foo 16476276 -rw-r--r-- 1 root root 6 Okt 13 12:34 /home/moo merkaba:~> cp foo /home/moo merkaba:~> ls -li foo /home/moo 6609289 -rw-r--r-- 1 root root 6 Okt 13 12:34 foo 16476276 -rw-r--r-- 1 root root 6 Okt 13 12:34 /home/moo merkaba:~> cp -p foo /home/moo merkaba:~> ls -li foo /home/moo 6609289 -rw-r--r-- 1 root root 6 Okt 13 12:34 foo 16476276 -rw-r--r-- 1 root root 6 Okt 13 12:34 /home/moo merkaba:~> mv foo /home/moo merkaba:~> ls -li /home/moo 16476280 -rw-r--r-- 1 root root 6 Okt 13 12:34 /home/moo Is this intentional and/or somehow related to the copy on write specifics of the filesystem? I think even with COW it can just overwrite the existing file instead of removing the old one and creating a new one – but it wouldn´t give much of a benefit unless the target file is nocow. (Also I thought only certain other utilities had supercow powers, but well BTRFS seems to have them as well :) Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] btrfs-progs: mkfs: Enable -d dup for single device
Hi, David Sterba > -Original Message- > From: David Sterba [mailto:dste...@suse.cz] > Sent: Tuesday, October 13, 2015 8:36 PM > To: Zhao Lei> Cc: dste...@suse.cz; linux-btrfs@vger.kernel.org; c...@fb.com > Subject: Re: [PATCH] btrfs-progs: mkfs: Enable -d dup for single device > > On Tue, Oct 13, 2015 at 07:40:30PM +0800, Zhao Lei wrote: > > > What I remember from the comment is that "it's slightly offset that > > > would lead to corruption". > > > > Before this patch, I had done git blame to search why the condition > > was added, and hadn't found the exact reason. > > found it: commit bc3f116fec194f1d7329b160c266fe16b9266a1e and it was not > aobut data/dup but mixed bgs with sectorisze != nodesize: > > 26 + nodesize = btrfs_super_nodesize(disk_super); > 27 + leafsize = btrfs_super_leafsize(disk_super); > 28 + sectorsize = btrfs_super_sectorsize(disk_super); > 29 + stripesize = btrfs_super_stripesize(disk_super); > 30 + > 31 + /* > 32 +* mixed block groups end up with duplicate but slightly offset > 33 +* extent buffers for the same range. It leads to corruptions > 34 +*/ > 35 + if ((features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS) > && > 36 + (sectorsize != leafsize)) { > 37 + printk(KERN_WARNING "btrfs: unequal > leaf/node/sector sizes " > 38 + "are not allowed for mixed block > groups on %s\n", > 39 + sb->s_id); > 40 + goto fail_alloc; > 41 + } > 42 + > Thanks for this information, I'll investigate is similar problem in non-mixed with dup. > > I will queue xfstests(btrfs/generic) at this profile with all mount > > option for multi-times, to check is something wrong with this. > > Thanks. We need to cover more: the balance conversion forbids data/dup > profile, I'm not sure if scrub handles that properly, and the ususal suspects > in > the rescue tools (fsck, restore, chunk-recover). > Agree, a new profile may be have potential problem because existing code haven't check the support status. IMHO, it is still necessary except we can prove this function should not exist. But we'll need to do more works to confirm above potential problem. Thanks Zhaolei -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: mkfs: Enable -d dup for single device
On Tue, Oct 13, 2015 at 07:40:30PM +0800, Zhao Lei wrote: > > What I remember from the comment is that "it's slightly offset that would > > lead > > to corruption". > > Before this patch, I had done git blame to search why the condition was added, > and hadn't found the exact reason. found it: commit bc3f116fec194f1d7329b160c266fe16b9266a1e and it was not aobut data/dup but mixed bgs with sectorisze != nodesize: 26 + nodesize = btrfs_super_nodesize(disk_super); 27 + leafsize = btrfs_super_leafsize(disk_super); 28 + sectorsize = btrfs_super_sectorsize(disk_super); 29 + stripesize = btrfs_super_stripesize(disk_super); 30 + 31 + /* 32 +* mixed block groups end up with duplicate but slightly offset 33 +* extent buffers for the same range. It leads to corruptions 34 +*/ 35 + if ((features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS) && 36 + (sectorsize != leafsize)) { 37 + printk(KERN_WARNING "btrfs: unequal leaf/node/sector sizes " 38 + "are not allowed for mixed block groups on %s\n", 39 + sb->s_id); 40 + goto fail_alloc; 41 + } 42 + > I will queue xfstests(btrfs/generic) at this profile with all mount option > for multi-times, to check is something wrong with this. Thanks. We need to cover more: the balance conversion forbids data/dup profile, I'm not sure if scrub handles that properly, and the ususal suspects in the rescue tools (fsck, restore, chunk-recover). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND 0/3] btrfs-progs: Introduce device delete by devid
On Sat, Oct 10, 2015 at 10:30:55PM +0800, Anand Jain wrote: > This is the btrfs-progs part of the kernel patch >Btrfs: Introduce device delete by devid Thanks, now in next/delete-by-id-v3, I made some changes so please have a look. Notably, I've dropped the BTRFS_VOL_ARG_V2_FLAGS mask, this belongs to kernel only (unless you need it userspace of course). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: mkfs: Enable -d dup for single device
On Tue, Oct 13, 2015 at 06:29:41PM +0800, Zhao Lei wrote: > Current code don't support dup profile in single device, except it > is in mixed mode, because following reason: > 1: In some ssd with deduplication function, it have no effect. > 2: For a physical device, it the entire disk broken, -d dup can >not help. > 3: Half performance comparing with single profile. > 4: We have a workaround: Create multi-partition in single device, >and btefs will treat them as multi device. While the above makes sense is true, I'm not sure that DUP was disabled for these reasons. I'm sure that I read a comment from Chris that dup for data is intentionally disabled because this would lead to corruption, the code for DUP for metadata would not work for data. And I can't find the comment, but the doubt is there. So unless I find it or get otherwise convicend that it's ok, I won't merge the patch. I hope you understand that. What I remember from the comment is that "it's slightly offset that would lead to corruption". -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] btrfs-progs: mkfs: Enable -d dup for single device
Hi, David Sterba Thanks for review. > -Original Message- > From: David Sterba [mailto:dste...@suse.cz] > Sent: Tuesday, October 13, 2015 7:29 PM > To: Zhao Lei> Cc: linux-btrfs@vger.kernel.org; c...@fb.com > Subject: Re: [PATCH] btrfs-progs: mkfs: Enable -d dup for single device > > On Tue, Oct 13, 2015 at 06:29:41PM +0800, Zhao Lei wrote: > > Current code don't support dup profile in single device, except it is > > in mixed mode, because following reason: > > 1: In some ssd with deduplication function, it have no effect. > > 2: For a physical device, it the entire disk broken, -d dup can > >not help. > > 3: Half performance comparing with single profile. > > 4: We have a workaround: Create multi-partition in single device, > >and btefs will treat them as multi device. > > While the above makes sense is true, I'm not sure that DUP was disabled for > these reasons. I'm sure that I read a comment from Chris that dup for data is > intentionally disabled because this would lead to corruption, the code for DUP > for metadata would not work for data. And I can't find the comment, but the > doubt is there. So unless I find it or get otherwise convicend that it's ok, > I won't > merge the patch. I hope you understand that. > > What I remember from the comment is that "it's slightly offset that would lead > to corruption". Before this patch, I had done git blame to search why the condition was added, and hadn't found the exact reason. I will queue xfstests(btrfs/generic) at this profile with all mount option for multi-times, to check is something wrong with this. Thanks Zhaolei -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] btrfs-progs: mkfs: Fix different mixed type by argument sequence
Given a 200G vdd1 and 1G vdd2: In current code: mkfs.btrfs -f /dev/vdd1 /dev/vdd2 and mkfs.btrfs -f /dev/vdd2 /dev/vdd1 will create different "mixed" type. See [PATCH 2/3] for detail. This patchset also include some small fixs. Zhao Lei (3): btrfs-progs: mkfs: Remove saved_optind in mkfs.btrfs btrfs-progs: mkfs: Fix different mixed type by argument sequence btrfs-progs: mkfs: Fix inaccurate mixed information mkfs.c | 43 ++- 1 file changed, 22 insertions(+), 21 deletions(-) -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] btrfs-progs: mkfs: Remove saved_optind in mkfs.btrfs
No need to use complex logic for iter devs in mkfs.c, as backup optind, increase/decrease optind and reset dev_cnt. A simple for() loop is enough for above request. Signed-off-by: Zhao Lei--- mkfs.c | 25 +++-- 1 file changed, 7 insertions(+), 18 deletions(-) diff --git a/mkfs.c b/mkfs.c index 7fa7cfc..cdae94d 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1354,7 +1354,6 @@ int main(int ac, char **av) u64 size_of_data = 0; u64 source_dir_size = 0; int dev_cnt = 0; - int saved_optind; char fs_uuid[BTRFS_UUID_UNPARSED_SIZE] = { 0 }; u64 features = BTRFS_MKFS_DEFAULT_FEATURES; struct mkfs_allocation allocation = { 0 }; @@ -1467,7 +1466,6 @@ int main(int ac, char **av) } sectorsize = max(sectorsize, (u32)sysconf(_SC_PAGESIZE)); - saved_optind = optind; dev_cnt = ac - optind; if (dev_cnt == 0) print_usage(1); @@ -1490,18 +1488,15 @@ int main(int ac, char **av) exit(1); } } - - while (dev_cnt-- > 0) { - file = av[optind++]; + + for (i = optind; i < optind + dev_cnt; i++) { + file = av[i]; if (is_block_device(file) == 1) if (test_dev_for_mkfs(file, force_overwrite)) exit(1); } - optind = saved_optind; - dev_cnt = ac - optind; - - file = av[optind++]; + file = av[optind]; ssd = is_ssd(file); if (is_vol_small(file) || mixed) { @@ -1557,7 +1552,7 @@ int main(int ac, char **av) btrfs_min_dev_size(nodesize)); exit(1); } - for (i = saved_optind; i < saved_optind + dev_cnt; i++) { + for (i = optind; i < optind + dev_cnt; i++) { char *path; path = av[i]; @@ -1588,8 +1583,6 @@ int main(int ac, char **av) printf("See %s for more information.\n\n", PACKAGE_URL); } - dev_cnt--; - if (!source_dir_set) { /* * open without O_EXCL so that the problem should not @@ -1720,13 +1713,10 @@ int main(int ac, char **av) if (is_block_device(file) == 1) btrfs_register_one_device(file); - if (dev_cnt == 0) - goto raid_groups; - - while (dev_cnt-- > 0) { + for (i = optind + 1; i < optind + dev_cnt; i++) { int old_mixed = mixed; - file = av[optind++]; + file = av[i]; /* * open without O_EXCL so that the problem should not @@ -1771,7 +1761,6 @@ int main(int ac, char **av) btrfs_register_one_device(file); } -raid_groups: if (!source_dir_set) { ret = create_raid_groups(trans, root, data_profile, metadata_profile, mixed, ); -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs-progs: mkfs: Fix different mixed type by argument sequence
Given a 200G vdd1 and 1G vdd2: In current code: # mkfs.btrfs -f /dev/vdd1 /dev/vdd2 SMALL VOLUME: forcing mixed metadata/data groups btrfs-progs v4.1.2 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: 7aa6fc75-ce23-4033-9d47-fd046afa2992 Node size: 4096 Sector size:4096 Filesystem size:1.20GiB Block group profiles: Data+Metadata:single8.00MiB System: single4.00MiB SSD detected: no Incompat features: mixed-bg, extref, skinny-metadata Number of devices: 2 Devices: IDSIZE PATH 1 200.29MiB /dev/vdd1 2 1.00GiB /dev/vdd2 # # mkfs.btrfs -f /dev/vdd2 /dev/vdd1 btrfs-progs v4.1.2 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: ac659809-66c1-427d-934d-bd4c209c91a8 Node size: 16384 Sector size:4096 Filesystem size:1.20GiB Block group profiles: Data: RAID0 136.00MiB Metadata: RAID169.38MiB System: RAID112.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 2 Devices: IDSIZE PATH 1 1.00GiB /dev/vdd2 2 200.29MiB /dev/vdd1 We can see: mkfs.btrfs -f /dev/vdd1 /dev/vdd2 and mkfs.btrfs -f /dev/vdd2 /dev/vdd1 have different "mixed" type. Reason: Current code determine "is to use mixed-type" only by first device. Fix: Use mixed-type only if all device are small. Signed-off-by: Zhao Lei--- mkfs.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/mkfs.c b/mkfs.c index cdae94d..29cab13 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1358,6 +1358,7 @@ int main(int ac, char **av) u64 features = BTRFS_MKFS_DEFAULT_FEATURES; struct mkfs_allocation allocation = { 0 }; struct btrfs_mkfs_config mkfs_cfg; + int large_device_cnt = 0; while(1) { int c; @@ -1494,17 +1495,25 @@ int main(int ac, char **av) if (is_block_device(file) == 1) if (test_dev_for_mkfs(file, force_overwrite)) exit(1); + ret = is_vol_small(file); + if (ret < 0) { + error("Failed to check size for '%s': %s", + file, strerror(-ret)); + exit(1); + } + large_device_cnt += (!ret); + ret = 0; } - file = av[optind]; - ssd = is_ssd(file); - - if (is_vol_small(file) || mixed) { + if (!large_device_cnt || mixed) { if (verbose) printf("SMALL VOLUME: forcing mixed metadata/data groups\n"); mixed = 1; } + file = av[optind]; + ssd = is_ssd(file); + /* * Set default profiles according to number of added devices. * For mixed groups defaults are single/single. -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs-progs: mkfs: Fix inaccurate mixed information
In current code, with a "BIG VOLUME" /dev/vdd2: # ./mkfs.btrfs -f -M /dev/vdd2 SMALL VOLUME: forcing mixed metadata/data groups ... This patch changed above output to: Using mixed metadata/data groups And the "SMALL VOLUME" output only when we exactly using SMALL VOLUME. Signed-off-by: Zhao Lei--- mkfs.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mkfs.c b/mkfs.c index 29cab13..0064a78 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1505,7 +1505,10 @@ int main(int ac, char **av) ret = 0; } - if (!large_device_cnt || mixed) { + if (mixed) { + if (verbose) + printf("Using mixed metadata/data groups\n"); + } else if (!large_device_cnt) { if (verbose) printf("SMALL VOLUME: forcing mixed metadata/data groups\n"); mixed = 1; -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix file corruption and data loss after cloning inline extents
From: Filipe MananaCurrently the clone ioctl allows to clone an inline extent from one file to another that already has other (non-inlined) extents. This is a problem because btrfs is not designed to deal with files having inline and regular extents, if a file has an inline extent then it must be the only extent in the file and must start at file offset 0. Having a file with an inline extent followed by regular extents results in EIO errors when doing reads or writes against the first 4K of the file. Also, the clone ioctl allows one to lose data if the source file consists of a single inline extent, with a size of N bytes, and the destination file consists of a single inline extent with a size of M bytes, where we have M > N. In this case the clone operation removes the inline extent from the destination file and then copies the inline extent from the source file into the destination file - we lose the M - N bytes from the destination file, a read operation will get the value 0x00 for any bytes in the the range [N, M] (the destination inode's i_size remained as M, that's why we can read past N bytes). So fix this by not allowing such destructive operations to happen and return errno EOPNOTSUPP to user space. Currently the fstest btrfs/035 tests the data loss case but it totally ignores this - i.e. expects the operation to succeed and does not check the we got data loss. The following test case for fstests exercises all these cases that result in file corruption and data loss: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" rm -f $seqres.full test_cloning_inline_extents() { local mkfs_opts=$1 local mount_opts=$2 _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1 _scratch_mount $mount_opts # File bar, the source for all the following clone operations, consists # of a # single inline extent (50 bytes). $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \ | _filter_xfs_io # Test cloning into a file with an extent (non-inlined) where the # destination offset overlaps that extent. It should not be possible to # clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo data after clone operation:" # All bytes should have the value 0xaa (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io # Test cloning the inline extent against a file which has a hole in its # first 4K followed by a non-inlined extent. It should not be possible # as well to clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo2 data after clone operation:" # All bytes should have the value 0x00 (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo2 $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io # Test cloning the inline extent against a file which has a size of zero # but has a prealloc extent. It should not be possible as well to clone # the inline extent from file bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "First 50 bytes of foo3 after clone operation:" # Should not be able to read any bytes, file has 0 bytes i_size (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo3 $XFS_IO_PROG -c
[PATCH 1/2] fstests: btrfs test for cloning of inline extents
From: Filipe MananaTest several cases of cloning inline extents that used to lead to file corruption or data loss. These file corruption and data loss cases are fixed by the linux kernel patch titled: "Btrfs: fix file corruption and data loss after cloning inline extents" Signed-off-by: Filipe Manana --- tests/btrfs/110 | 199 tests/btrfs/110.out | 257 tests/btrfs/group | 1 + 3 files changed, 457 insertions(+) create mode 100755 tests/btrfs/110 create mode 100644 tests/btrfs/110.out diff --git a/tests/btrfs/110 b/tests/btrfs/110 new file mode 100755 index 000..327c8c0 --- /dev/null +++ b/tests/btrfs/110 @@ -0,0 +1,199 @@ +#! /bin/bash +# FSQA Test No. 110 +# +# Test several cases of cloning inline extents that used to lead to file +# corruption or data loss. +# +#--- +# +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_cloner +_require_btrfs_fs_feature "no_holes" +_require_btrfs_mkfs_feature "no-holes" + +rm -f $seqres.full + +test_cloning_inline_extents() +{ + local mkfs_opts=$1 + local mount_opts=$2 + + _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1 + _scratch_mount $mount_opts + + # File bar, the source for all the following clone operations, consists +# of a # single inline extent (50 bytes). + $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \ + | _filter_xfs_io + + # Test cloning into a file with an extent (non-inlined) where the + # destination offset overlaps that extent. It should not be possible to + # clone the inline extent from file bar into this file. + $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \ + | _filter_xfs_io + $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo + + # Doing IO against any range in the first 4K of the file should work. + # Due to a past clone ioctl bug which allowed cloning the inline extent, + # these operations resulted in EIO errors. + echo "File foo data after clone operation:" + # All bytes should have the value 0xaa (clone operation failed and did + # not modify our file). + od -t x1 $SCRATCH_MNT/foo + $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io + + # Test cloning the inline extent against a file which has a hole in its + # first 4K followed by a non-inlined extent. It should not be possible + # as well to clone the inline extent from file bar into this file. + $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \ + | _filter_xfs_io + $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2 + + # Doing IO against any range in the first 4K of the file should work. + # Due to a past clone ioctl bug which allowed cloning the inline extent, + # these operations resulted in EIO errors. + echo "File foo2 data after clone operation:" + # All bytes should have the value 0x00 (clone operation failed and did + # not modify our file). + od -t x1 $SCRATCH_MNT/foo2 + $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io + + # Test cloning the inline extent against a file which has a size of zero + # but has a prealloc extent. It should not be possible as well to clone + # the inline extent from file bar into this file. + $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io + $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3 + + # Doing IO against any range in the
[PATCH 2/2] fstests: update btrfs/035 to check for data loss
From: Filipe MananaThe test currently verifies that cloning one file with an inline extent with a size of 10 bytes into a file with an inline extent that has a size of 20 bytes succeeds. But this results in data loss, because the btrfs clone operation drops the 20 bytes inline extent from the destination inode and then copies the 10 bytes inline extent from the source file into the destination file, resulting in data loss of the last 10 bytes of data that the destination file had. Fixing btrfs to correctly operate for this case (not resulting in data loss) is actually a lot of work and brings a lot of complexity, specially considering that any of the inline extents can be compressed. For the moment there's a fix to make the clone operation return the errno EOPNOTSUPP and not touch any of the inodes. This is the same approach we do for other cases involving operation against inline extents, so this just adds one more case that should have never been allowed. Cloning inline extents is a rare operation and pointless, since it involves copying them and not doing any actual deduplication or saving space. The btrfs patch for the linux kernel that prevents this data loss, and fixes some file corruption cases, is titled: "Btrfs: fix file corruption and data loss after cloning inline extents" Signed-off-by: Filipe Manana --- tests/btrfs/035 | 14 ++ tests/btrfs/035.out | 9 + 2 files changed, 23 insertions(+) diff --git a/tests/btrfs/035 b/tests/btrfs/035 index 35ddfce..0f8a70d 100755 --- a/tests/btrfs/035 +++ b/tests/btrfs/035 @@ -67,9 +67,23 @@ echo "attempting ioctl (src.clone1 src)" $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \ $SCRATCH_MNT/src.clone1 $SCRATCH_MNT/src +# The clone operation should have failed. If it did not it meant we had data +# loss, because file "src.clone1" has an inline extent which is 10 bytes long +# while file "src" has an inline extent which is 20 bytes long. The clone +# operation would remove the inline extent of "src" and then copy the inline +# extent from "src.clone1" into "src", which means we would lose the last 10 +# bytes of data from "src" (on read we would get 0x00 as the value for any +# of those 10 bytes, because the file's size remains as 20 bytes). +echo "File src data after attempt to clone from src.clone1 into src:" +od -t x1 $SCRATCH_MNT/src + snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone2 | awk '{print $5}'` echo "attempting ioctl (src.clone2 src)" $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \ $SCRATCH_MNT/src.clone2 $SCRATCH_MNT/src +# The clone operation should have succeeded. +echo "File src data after attempt to clone from src.clone2 into src:" +od -t x1 $SCRATCH_MNT/src + status=0 ; exit diff --git a/tests/btrfs/035.out b/tests/btrfs/035.out index f86cadf..3ea7d77 100644 --- a/tests/btrfs/035.out +++ b/tests/btrfs/035.out @@ -1,3 +1,12 @@ QA output created by 035 attempting ioctl (src.clone1 src) +clone failed: Operation not supported +File src data after attempt to clone from src.clone1 into src: +000 62 62 62 62 62 62 62 62 62 62 63 63 63 63 63 63 +020 63 63 63 63 +024 attempting ioctl (src.clone2 src) +File src data after attempt to clone from src.clone2 into src: +000 62 62 62 62 62 62 62 62 62 62 63 63 63 63 63 63 +020 63 63 63 63 +024 -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kernel error in extent-tree, forced readonly
One of my backup disks hit a btrfs bug yesterday, leaving me with a forced readonly filesystem (see kernel trace below). This error is reproducible, and happens on first access after mounting. This disk receives snapshots (incrementally "ssh btrfs send -p | btrfs receive") from several hosts on a daily schedule, and deletes old ones. - kernel-4.2.3 (no quota support, no acl support) - btrfs-progs-4.2.2 - mount: noatime,autodefrag,compress=zlib,subvolid=0 My short analysis reveals that the backref 1044608417792 points to 3231, which is most probably the parent subvolume used by a "btrfs receive" operation, which was then deleted directly after receive success (using --commit-after). I suspect the error could have been triggered by a unmount operation run directly after serveral (~10) receive and (~10) delete operations. I enabled this unmounting feature in my cron job two days ago, maybe this wasn't a good idea after all? Before that, the filesystem was always mounted and I'm doing backups like this for about one year without any problems. Can someone please give me some hints on how I can get rid of the broken backrefs? How do I find out which file is causing the trouble, and which subvolume I need to delete? And how can I do this on a readonly fs? If you are interested, I could leave this disk untouched for some days and help debugging. # btrfs fi df /mnt/btr_backup Data, single: total=487.29GiB, used=481.65GiB System, DUP: total=32.00MiB, used=40.00KiB Metadata, DUP: total=18.50GiB, used=7.33GiB GlobalReserve, single: total=512.00MiB, used=0.00B Kernel trace: [ cut here ] WARNING: CPU: 0 PID: 17264 at fs/btrfs/extent-tree.c:6255 __btrfs_free_extent.isra.72+0xab0/0xce0() Modules linked in: isofs sr_mod cdrom usblp f2fs usb_storage bridge stp llc tun cpufreq_ondemand vfat fat mmc_block snd_hda_codec_hdmi nvidia(PO) x86_pkg_temp_thermal iwldvm btusb kvm_intel btrtl btbcm btintel dell_laptop snd_hda_codec_idt bluetooth kvm dcdbas snd_hda_codec_generic psmouse dell_smm_hwmon iwlwifi sdhci_pci sdhci mmc_core snd_hda_intel snd_hda_codec thermal snd_hwdep snd_hda_core snd_pcm parport_pc snd_timer parport xhci_pci snd xhci_hcd acpi_cpufreq soundcore dell_rbtn processor battery dell_smo8800 ac CPU: 0 PID: 17264 Comm: btrfs-cleaner Tainted: P O 4.2.3-gentoo #1 Hardware name: Dell Inc. Latitude E6430/0H3MT5, BIOS A16 08/19/2014 81778e68 81618a50 810691e7 88021fad46c0 00f337835000 fffe 1000 880138044000 811dcfc0 Call Trace: [] ? dump_stack+0x47/0x67 [] ? warn_slowpath_common+0x77/0xb0 [] ? __btrfs_free_extent.isra.72+0xab0/0xce0 [] ? __btrfs_run_delayed_refs+0x7a0/0xf80 [] ? __percpu_counter_add+0x52/0x70 [] ? btrfs_free_tree_block+0xe0/0x1e0 [] ? btrfs_run_delayed_refs.part.78+0x6a/0x250 [] ? walk_up_tree+0xe0/0x1d0 [] ? btrfs_should_end_transaction+0x3e/0x60 [] ? btrfs_drop_snapshot+0x41c/0x810 [] ? btrfs_clean_one_deleted_snapshot+0x9e/0xd0 [] ? cleaner_kthread+0x141/0x1d0 [] ? btrfs_destroy_pinned_extent+0xa0/0xa0 [] ? kthread+0xbc/0xe0 [] ? kthread_create_on_node+0x170/0x170 [] ? ret_from_fork+0x3f/0x70 [] ? kthread_create_on_node+0x170/0x170 ---[ end trace 937617c32053608b ]--- BTRFS info (device sdc1): leaf 1044023648256 total ptrs 55 free space 526 \x09item 0 key (1044608286720 168 4096) itemoff 3944 itemsize 51 \x09\x09extent refs 1 gen 10950 flags 2 \x09\x09tree block key (18446744073709551606 128 190953070592) level 0 \x09\x09tree block backref root 7 \x09item 1 key (1044608290816 168 4096) itemoff 3893 itemsize 51 \x09\x09extent refs 1 gen 11983 flags 258 \x09\x09tree block key (3259 12 3215) level 0 \x09\x09tree block backref root 3242 \x09item 2 key (1044608294912 168 4096) itemoff 3842 itemsize 51 \x09\x09extent refs 1 gen 10950 flags 258 \x09\x09tree block key (82282 108 0) level 0 \x09\x09shared block backref parent 1045418885120 \x09item 3 key (1044608299008 168 4096) itemoff 3782 itemsize 60 \x09\x09extent refs 2 gen 11983 flags 258 \x09\x09tree block key (770034 12 3265) level 0 \x09\x09tree block backref root 3242 \x09\x09tree block backref root 3231 \x09item 4 key (1044608303104 168 4096) itemoff 3731 itemsize 51 \x09\x09extent refs 1 gen 10950 flags 258 \x09\x09tree block key (82306 1 0) level 0 \x09\x09shared block backref parent 1045418885120 \x09item 5 key (1044608311296 168 4096) itemoff 3680 itemsize 51 \x09\x09extent refs 1 gen 10950 flags 258 \x09\x09tree block key (446163 1 0) level 0 \x09\x09shared block backref parent 1059954139136 \x09item 6 key (1044608315392 168 4096) itemoff 3629 itemsize 51 \x09\x09extent refs 1 gen 10950 flags 2 \x09\x09tree block key (18446744073709551606 128 190953070592) level 0 \x09\x09tree block backref root 7 \x09item 7 key (1044608319488 168 4096) itemoff 3578 itemsize 51 \x09\x09extent refs 1 gen 11983 flags 258 \x09\x09tree block key (3885 108 0) level 0 \x09\x09tree block
[PATCH 3/7] btrfs-progs: add helpers for parsing 32bit ranges
Signed-off-by: David Sterba--- cmds-balance.c | 31 +++ 1 file changed, 31 insertions(+) diff --git a/cmds-balance.c b/cmds-balance.c index 62bee3cc78b6..72714b23b45c 100644 --- a/cmds-balance.c +++ b/cmds-balance.c @@ -159,6 +159,37 @@ static int parse_range_strict(const char *range, u64 *start, u64 *end) return 1; } +/* + * Convert 64bit range to 32bit with boundary checkso + */ +static int range_to_u32(u64 start, u64 end, u32 *start32, u32 *end32) +{ + if (start > (u32)-1) + return 1; + + if (end != (u64)-1 && end > (u32)-1) + return 1; + + *start32 = (u32)start; + *end32 = (u32)end; + + return 0; +} + +static int parse_range_u32(const char *range, u32 *start, u32 *end) +{ + u64 tmp_start; + u64 tmp_end; + + if (parse_range(range, _start, _end)) + return 1; + + if (range_to_u32(tmp_start, tmp_end, start, end)) + return 1; + + return 0; +} + static int parse_filters(char *filters, struct btrfs_balance_args *args) { char *this_char; -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] btrfs-progs: extend balance args to take min/max limit filter
Add the overlapping limit and [limit_min, limit_max] members to the balance args. The min/max values are interpreted iff the corresponding flag BTRFS_BALANCE_ARGS_LIMITS is set. Note that the values are only 32bit, but this should be enough for the foreseeable future. Signed-off-by: David Sterba--- Documentation/btrfs-balance.asciidoc | 4 cmds-balance.c | 4 ioctl.h | 13 - volumes.h| 1 + 4 files changed, 21 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-balance.asciidoc b/Documentation/btrfs-balance.asciidoc index 6d2fd0c36086..61517461ca90 100644 --- a/Documentation/btrfs-balance.asciidoc +++ b/Documentation/btrfs-balance.asciidoc @@ -109,6 +109,10 @@ parameters. Process only given number of chunks, after all filters apply. This can be used to specifically target a chunk in connection with other filters (drange, vrange) or just simply limit the amount of work done by a single balance run. ++ +The argument may be a single value or a range. The single value *N* means *at +most N chunks*, equivalent to *..N* range syntax. Kernels prior to 4.4 accept +only the single value format. *soft*:: Takes no parameters. Only has meaning when converting between profiles. diff --git a/cmds-balance.c b/cmds-balance.c index dba6613b1540..1eadba417abc 100644 --- a/cmds-balance.c +++ b/cmds-balance.c @@ -343,6 +343,10 @@ static void dump_balance_args(struct btrfs_balance_args *args) (unsigned long long)args->vend); if (args->flags & BTRFS_BALANCE_ARGS_LIMIT) printf(", limit=%llu", (unsigned long long)args->limit); + if (args->flags & BTRFS_BALANCE_ARGS_LIMITS) { + printf(", limit="); + print_range_u32(args->limit_min, args->limit_max); + } printf("\n"); } diff --git a/ioctl.h b/ioctl.h index dff015a52b43..ff7a1a0610a1 100644 --- a/ioctl.h +++ b/ioctl.h @@ -227,7 +227,18 @@ struct btrfs_balance_args { __u64 flags; - __u64 limit;/* limit number of processed chunks */ + /* +* BTRFS_BALANCE_ARGS_LIMIT with value 'limit' +* BTRFS_BALANCE_ARGS_LIMITS - the extend version can use minimum and +* maximum +*/ + union { + __u64 limit;/* limit number of processed chunks */ + struct { + __u32 limit_min; + __u32 limit_max; + }; + }; __u64 unused[7]; } __attribute__ ((__packed__)); diff --git a/volumes.h b/volumes.h index 4ecb99314a0c..cb6f5752cdda 100644 --- a/volumes.h +++ b/volumes.h @@ -136,6 +136,7 @@ struct map_lookup { #define BTRFS_BALANCE_ARGS_DRANGE (1ULL << 3) #define BTRFS_BALANCE_ARGS_VRANGE (1ULL << 4) #define BTRFS_BALANCE_ARGS_LIMIT (1ULL << 5) +#define BTRFS_BALANCE_ARGS_LIMITS (1ULL << 6) /* * Profile changing flags. When SOFT is set we won't relocate chunk if -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] btrfs-progs: add helpers to print ranges
Signed-off-by: David Sterba--- cmds-balance.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/cmds-balance.c b/cmds-balance.c index 72714b23b45c..dba6613b1540 100644 --- a/cmds-balance.c +++ b/cmds-balance.c @@ -190,6 +190,25 @@ static int parse_range_u32(const char *range, u32 *start, u32 *end) return 0; } +__attribute__ ((unused)) +static void print_range(u64 start, u64 end) +{ + if (start) + printf("%llu", (unsigned long long)start); + printf(".."); + if (end != (u64)-1) + printf("%llu", (unsigned long long)end); +} + +static void print_range_u32(u32 start, u32 end) +{ + if (start) + printf("%u", start); + printf(".."); + if (end != (u32)-1) + printf("%u", end); +} + static int parse_filters(char *filters, struct btrfs_balance_args *args) { char *this_char; -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] btrfs-progs: balance: add stripes filter
From: Gabríel Arthúr PéturssonAdd new balance filter 'stripes=' to process only chunks that are spread accross given number of chunks. The range must be specified with both values, but they can be the same to denote exact number of stripes. Signed-off-by: Gabríel Arthúr Pétursson [ reworked a bit to use the range helpers, dropped the single value for stripes ] Signed-off-by: David Sterba --- Documentation/btrfs-balance.asciidoc | 4 cmds-balance.c | 17 + ioctl.h | 4 +++- volumes.h| 1 + 4 files changed, 25 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-balance.asciidoc b/Documentation/btrfs-balance.asciidoc index 61517461ca90..6bb9fffdf188 100644 --- a/Documentation/btrfs-balance.asciidoc +++ b/Documentation/btrfs-balance.asciidoc @@ -114,6 +114,10 @@ The argument may be a single value or a range. The single value *N* means *at most N chunks*, equivalent to *..N* range syntax. Kernels prior to 4.4 accept only the single value format. +*stripes*:: +Balances only block groups which have the given number of stripes. The +parameter is a range specified as . + *soft*:: Takes no parameters. Only has meaning when converting between profiles. When doing convert from one profile to another and soft mode is on, diff --git a/cmds-balance.c b/cmds-balance.c index 7aaf33d03630..a958e584eeb5 100644 --- a/cmds-balance.c +++ b/cmds-balance.c @@ -319,6 +319,19 @@ static int parse_filters(char *filters, struct btrfs_balance_args *args) args->flags &= ~BTRFS_BALANCE_ARGS_LIMITS; args->flags |= BTRFS_BALANCE_ARGS_LIMIT; } + args->flags |= BTRFS_BALANCE_ARGS_LIMIT; + } else if (!strcmp(this_char, "stripes")) { + if (!value || !*value) { + fprintf(stderr, + "the stripes filter requires an argument\n"); + return 1; + } + if (parse_range_u32(value, >stripes_min, + >stripes_max)) { + fprintf(stderr, "Invalid stripes argument\n"); + return 1; + } + args->flags |= BTRFS_BALANCE_ARGS_STRIPES; } else { fprintf(stderr, "Unrecognized balance option '%s'\n", this_char); @@ -359,6 +372,10 @@ static void dump_balance_args(struct btrfs_balance_args *args) printf(", limit="); print_range_u32(args->limit_min, args->limit_max); } + if (args->flags & BTRFS_BALANCE_ARGS_STRIPES) { + printf(", stripes="); + print_range_u32(args->stripes_min, args->stripes_max); + } printf("\n"); } diff --git a/ioctl.h b/ioctl.h index ff7a1a0610a1..50f9e1485a30 100644 --- a/ioctl.h +++ b/ioctl.h @@ -239,7 +239,9 @@ struct btrfs_balance_args { __u32 limit_max; }; }; - __u64 unused[7]; + __u32 stripes_min; + __u32 stripes_max; + __u64 unused[6]; } __attribute__ ((__packed__)); /* report balance progress to userspace */ diff --git a/volumes.h b/volumes.h index cb6f5752cdda..150ea7f31659 100644 --- a/volumes.h +++ b/volumes.h @@ -137,6 +137,7 @@ struct map_lookup { #define BTRFS_BALANCE_ARGS_VRANGE (1ULL << 4) #define BTRFS_BALANCE_ARGS_LIMIT (1ULL << 5) #define BTRFS_BALANCE_ARGS_LIMITS (1ULL << 6) +#define BTRFS_BALANCE_ARGS_STRIPES (1ULL << 7) /* * Profile changing flags. When SOFT is set we won't relocate chunk if -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] btrfs-progs: balance: enhance the limit fiter with range
We can do more with the balance filer. Enhance it so we can specify also the minimum number of block groups to process. The 'limit' filter now accepts a range (a..b, can be partial) and needs kernel support. The 'limit=value' filter is equivalent to 'limit=..value' but works on older kernels as well. The min/max values are 32bit, unlike the single-value limit which is 64bit. Signed-off-by: David Sterba--- cmds-balance.c | 22 +- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/cmds-balance.c b/cmds-balance.c index 1eadba417abc..7aaf33d03630 100644 --- a/cmds-balance.c +++ b/cmds-balance.c @@ -301,12 +301,24 @@ static int parse_filters(char *filters, struct btrfs_balance_args *args) "the limit filter requires an argument\n"); return 1; } - if (parse_u64(value, >limit)) { - fprintf(stderr, "Invalid limit argument: %s\n", - value); - return 1; + /* +* Try to parse the range first. A single value is not +* a valid range +*/ + if (parse_range_u32(value, >limit_min, + >limit_max) == 0) { + args->flags &= ~BTRFS_BALANCE_ARGS_LIMIT; + args->flags |= BTRFS_BALANCE_ARGS_LIMITS; + } else { + if (parse_u64(value, >limit)) { + fprintf(stderr, + "Invalid limit argument: %s\n", + value); + return 1; + } + args->flags &= ~BTRFS_BALANCE_ARGS_LIMITS; + args->flags |= BTRFS_BALANCE_ARGS_LIMIT; } - args->flags |= BTRFS_BALANCE_ARGS_LIMIT; } else { fprintf(stderr, "Unrecognized balance option '%s'\n", this_char); -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] btrfs-progs: extend parse_range API to accept a relaxed range
In some cases we want to accept a range of type [a..a]. Add a new function to do the 'a < b' check for the caller and use it. Signed-off-by: David Sterba--- cmds-balance.c | 30 +- 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/cmds-balance.c b/cmds-balance.c index 798b533aa7d6..62bee3cc78b6 100644 --- a/cmds-balance.c +++ b/cmds-balance.c @@ -126,9 +126,10 @@ static int parse_range(const char *range, u64 *start, u64 *end) return 1; } - if (*start >= *end) { - fprintf(stderr, "Range %llu..%llu doesn't make " - "sense\n", (unsigned long long)*start, + if (*start > *end) { + fprintf(stderr, + "ERROR: range %llu..%llu doesn't make sense\n", + (unsigned long long)*start, (unsigned long long)*end); return 1; } @@ -139,6 +140,25 @@ static int parse_range(const char *range, u64 *start, u64 *end) return 1; } +/* + * Parse range and check if start < end + */ +static int parse_range_strict(const char *range, u64 *start, u64 *end) +{ + if (parse_range(range, start, end) == 0) { + if (*start >= *end) { + fprintf(stderr, + "ERROR: range %llu..%llu not allowed\n", + (unsigned long long)*start, + (unsigned long long)*end); + return 1; + } + return 0; + } + + return 1; +} + static int parse_filters(char *filters, struct btrfs_balance_args *args) { char *this_char; @@ -196,7 +216,7 @@ static int parse_filters(char *filters, struct btrfs_balance_args *args) "an argument\n"); return 1; } - if (parse_range(value, >pstart, >pend)) { + if (parse_range_strict(value, >pstart, >pend)) { fprintf(stderr, "Invalid drange argument\n"); return 1; } @@ -207,7 +227,7 @@ static int parse_filters(char *filters, struct btrfs_balance_args *args) "an argument\n"); return 1; } - if (parse_range(value, >vstart, >vend)) { + if (parse_range_strict(value, >vstart, >vend)) { fprintf(stderr, "Invalid vrange argument\n"); return 1; } -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] btrfs-progs: cleanup and comment parse_range
Simplify a check and unindent some code. Signed-off-by: David Sterba--- cmds-balance.c | 63 ++ 1 file changed, 37 insertions(+), 26 deletions(-) diff --git a/cmds-balance.c b/cmds-balance.c index 9af218bbfa51..798b533aa7d6 100644 --- a/cmds-balance.c +++ b/cmds-balance.c @@ -88,43 +88,54 @@ static int parse_u64(const char *str, u64 *result) return 0; } +/* + * Parse range that's missing some part that can be implicit: + * a..b- exact range, a can be equal to b + * a.. - implicitly unbounded maximum (end == (u64)-1) + * ..b - implicitly starting at 0 + * a - invalid; unclear semantics, use parse_u64 instead + * + * Returned values are u64, value validation and interpretation should be done + * by the caller. + */ static int parse_range(const char *range, u64 *start, u64 *end) { char *dots; + const char *rest; + int skipped = 0; dots = strstr(range, ".."); - if (dots) { - const char *rest = dots + 2; - int skipped = 0; - - *dots = 0; + if (!dots) + return 1; - if (!*rest) { - *end = (u64)-1; - skipped++; - } else { - if (parse_u64(rest, end)) - return 1; - } - if (dots == range) { - *start = 0; - skipped++; - } else { - if (parse_u64(range, start)) - return 1; - } + rest = dots + 2; + *dots = 0; - if (*start >= *end) { - fprintf(stderr, "Range %llu..%llu doesn't make " - "sense\n", (unsigned long long)*start, - (unsigned long long)*end); + if (!*rest) { + *end = (u64)-1; + skipped++; + } else { + if (parse_u64(rest, end)) return 1; - } + } + if (dots == range) { + *start = 0; + skipped++; + } else { + if (parse_u64(range, start)) + return 1; + } - if (skipped <= 1) - return 0; + if (*start >= *end) { + fprintf(stderr, "Range %llu..%llu doesn't make " + "sense\n", (unsigned long long)*start, + (unsigned long long)*end); + return 1; } + if (skipped <= 1) + return 0; + return 1; } -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] Btrfs-progs, balance filters: stripes, limits
Here's the userspace part that enables the use of stripes and enhanced limit filters. David Sterba (6): btrfs-progs: cleanup and comment parse_range btrfs-progs: extend parse_range API to accept a relaxed range btrfs-progs: add helpers for parsing 32bit ranges btrfs-progs: add helpers to print ranges btrfs-progs: extend balance args to take min/max limit filter btrfs-progs: balance: enhance the limit fiter with range Gabríel Arthúr Pétursson (1): btrfs-progs: balance: add stripes filter Documentation/btrfs-balance.asciidoc | 8 ++ cmds-balance.c | 172 +-- ioctl.h | 17 +++- volumes.h| 2 + 4 files changed, 168 insertions(+), 31 deletions(-) -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix double range unlock of hole region when reading page
From: Filipe MananaIf when reading a page we find a hole and our caller had already locked the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end up unlocking the hole's range and then later our caller unlocks it again, which might have already been locked by some other task once the first unlock happened. Currently this can only happen during a call to the extent_same ioctl, as it's the only caller of __do_readpage() that sets the bit EXTENT_BIO_PARENT_LOCKED for bio flags. Fix this by leaving the unlock exclusively to the caller. Signed-off-by: Filipe Manana --- fs/btrfs/extent_io.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ecb1204..6e6df34 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3070,8 +3070,12 @@ static int __do_readpage(struct extent_io_tree *tree, set_extent_uptodate(tree, cur, cur + iosize - 1, , GFP_NOFS); - unlock_extent_cached(tree, cur, cur + iosize - 1, -, GFP_NOFS); + if (parent_locked) + free_extent_state(cached); + else + unlock_extent_cached(tree, cur, +cur + iosize - 1, +, GFP_NOFS); cur = cur + iosize; pg_offset += iosize; continue; -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/11] btrfs-progs: Use btrfs_open_dir to avoid show error of ioctl or tree search
On Mon, Oct 12, 2015 at 09:22:53PM +0800, Zhao Lei wrote: > Use btrfs_open_dir() instead of open_file_or_dir(), to show error before > real action(in ioctl or tree search), to make the error message exact > and unified. > > Zhao Lei (11): > btrfs-progs: subvolume: use btrfs_open_dir for btrfs subvolume command > btrfs-progs: filesystem: use btrfs_open_dir for btrfs filesystem > command > btrfs-progs: balance: use btrfs_open_dir for btrfs balance command > btrfs-progs: inspect: Bypass unnecessary clean function in open_error > btrfs-progs: inspect: set return value of error case > btrfs-progs: inspect: use btrfs_open_dir for btrfs inspect command > btrfs-progs: qgroup: use btrfs_open_dir for btrfs qgroup command > btrfs-progs: quota: use btrfs_open_dir for btrfs quota command > btrfs-progs: use btrfs_open_dir in open_path_or_dev_mnt > btrfs-progs: replace: use btrfs_open_dir for btrfs replace command > btrfs-progs: fragments: use btrfs_open_dir for btrfs-fragments command All merged, thanks! I appreciate you took the time to test all the changes and the patch separation made review easy. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: compress: put variables defined per compress type in struct to make cache friendly
Below variables are defined per compress type. - struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES] - spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES] - int comp_num_workspace[BTRFS_COMPRESS_TYPES] - atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES] - wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES] BTW, while accessing one compress type of these variables, the next or before address is other compress types of it. So this patch puts these variables in a struct to make cache friendly. Signed-off-by: Byongho Lee--- fs/btrfs/compression.c | 46 -- 1 file changed, 24 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index ce62324c78e7..85a80931ae3f 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -744,11 +744,13 @@ out: return ret; } -static struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES]; -static spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES]; -static int comp_num_workspace[BTRFS_COMPRESS_TYPES]; -static atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES]; -static wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]; +static struct { + struct list_head idle_workspace; + spinlock_t workspace_lock; + int num_workspace; + atomic_t alloc_workspace; + wait_queue_head_t workspace_wait; +} comp[BTRFS_COMPRESS_TYPES]; static const struct btrfs_compress_op * const btrfs_compress_op[] = { _zlib_compress, @@ -760,10 +762,10 @@ void __init btrfs_init_compress(void) int i; for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) { - INIT_LIST_HEAD(_idle_workspace[i]); - spin_lock_init(_workspace_lock[i]); - atomic_set(_alloc_workspace[i], 0); - init_waitqueue_head(_workspace_wait[i]); + INIT_LIST_HEAD([i].idle_workspace); + spin_lock_init([i].workspace_lock); + atomic_set([i].alloc_workspace, 0); + init_waitqueue_head([i].workspace_wait); } } @@ -777,11 +779,11 @@ static struct list_head *find_workspace(int type) int cpus = num_online_cpus(); int idx = type - 1; - struct list_head *idle_workspace= _idle_workspace[idx]; - spinlock_t *workspace_lock = _workspace_lock[idx]; - atomic_t *alloc_workspace = _alloc_workspace[idx]; - wait_queue_head_t *workspace_wait = _workspace_wait[idx]; - int *num_workspace = _num_workspace[idx]; + struct list_head *idle_workspace= [idx].idle_workspace; + spinlock_t *workspace_lock = [idx].workspace_lock; + atomic_t *alloc_workspace = [idx].alloc_workspace; + wait_queue_head_t *workspace_wait = [idx].workspace_wait; + int *num_workspace = [idx].num_workspace; again: spin_lock(workspace_lock); if (!list_empty(idle_workspace)) { @@ -820,11 +822,11 @@ again: static void free_workspace(int type, struct list_head *workspace) { int idx = type - 1; - struct list_head *idle_workspace= _idle_workspace[idx]; - spinlock_t *workspace_lock = _workspace_lock[idx]; - atomic_t *alloc_workspace = _alloc_workspace[idx]; - wait_queue_head_t *workspace_wait = _workspace_wait[idx]; - int *num_workspace = _num_workspace[idx]; + struct list_head *idle_workspace= [idx].idle_workspace; + spinlock_t *workspace_lock = [idx].workspace_lock; + atomic_t *alloc_workspace = [idx].alloc_workspace; + wait_queue_head_t *workspace_wait = [idx].workspace_wait; + int *num_workspace = [idx].num_workspace; spin_lock(workspace_lock); if (*num_workspace < num_online_cpus()) { @@ -852,11 +854,11 @@ static void free_workspaces(void) int i; for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) { - while (!list_empty(_idle_workspace[i])) { - workspace = comp_idle_workspace[i].next; + while (!list_empty([i].idle_workspace)) { + workspace = comp[i].idle_workspace.next; list_del(workspace); btrfs_compress_op[i]->free_workspace(workspace); - atomic_dec(_alloc_workspace[i]); + atomic_dec([i].alloc_workspace); } } } -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS with 8TB SMR drives
On Mon, Oct 12, 2015 at 06:25:52PM +0200, Henk Slager wrote: > and looking at this spec: > http://www.seagate.com/files/www-content/product-content/hdd-fam/seagate-archive-hdd/en-us/docs/archive-hdd-dS1834-3-1411us.pdf > > it seems that it is a drive-managed SMR disk. I am not sure why David > assumes it is host-managed, maybe drive firmware/functionality can be > bypassed. Because the drive-managed ones are not interesting from the filesystem POV. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: compress: put variables defined per compress type in struct to make cache friendly
On Wed, Oct 14, 2015 at 01:13:26AM +0900, Byongho Lee wrote: > Below variables are defined per compress type. > - struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES] > - spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES] > - int comp_num_workspace[BTRFS_COMPRESS_TYPES] > - atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES] > - wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES] > > BTW, while accessing one compress type of these variables, the next or > before address is other compress types of it. > So this patch puts these variables in a struct to make cache friendly. Nice. > +static struct { > + struct list_head idle_workspace; > + spinlock_t workspace_lock; > + int num_workspace; > + atomic_t alloc_workspace; > + wait_queue_head_t workspace_wait; > +} comp[BTRFS_COMPRESS_TYPES]; The name became too generic, please rename it to btrfs_comp_ws. btrfs_comp_workspaces would be too long. I won't mind trimming the members to 'ws' instead of 'workspace' so this does not result in too wild code formatting. The use of the workspaces is localized only to the compression code so it will not be confusing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix resending received snapshot with parent
On Tue, Oct 13, 2015 at 1:31 AM, Ed Tomlinsonwrote: > On Friday, October 9, 2015 4:24:10 PM EDT, Filipe Manana wrote: >> >> On Wed, Sep 30, 2015 at 8:23 PM, Robin Ruede wrote: >>> >>> This fixes a regression introduced by 37b8d27d between v4.1 and v4.2. >>> >>> When a snapshot is received, its received_uuid is set to the original >>> uuid of the subvolume. When that snapshot is then resent to a third >>> filesystem, it's received_uuid is set to the second uuid >>> instead of the original one. The same was true for the parent_uuid. ... >> >> Reviewed-by: Filipe Manana >> >> Thanks for fixing this. >> I've added this to my integration branch [1] and will send soon a pull >> request to Chris for 4.4, including this fix plus a few others for >> send/receive, after some more testing. >> >> I've also made an xfstest for it [1, 2] > > > Another thanks for this fix. It fixes things here. I am runing 4.2.3 with > the 4.3 btrfs tree pulled on top of it along with this fix. Incremental > sends > are now working again. > Tested-by: Ed Tomlinson > > This fixes a regression, can we please get into 4.3? I've tagged it for stable backports in my 4.4 integration branch [1]. Thanks. [1] http://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=integration-4.4 thanks > > Thanks > Ed Tomlinson > -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix use after free iterating extrefs
On Tue, Oct 13, 2015 at 02:06:48PM -0400, Chris Mason wrote: > The code for btrfs inode-resolve has never worked properly for > files with enough hard links to trigger extrefs. It was trying to > get the leaf out of a path after freeing the path: > > btrfs_release_path(path); > leaf = path->nodes[0]; > item_size = btrfs_item_size_nr(leaf, slot); > > The fix here is to use the extent buffer we cloned just a little higher > up to avoid deadlocks caused by using the leaf in the path. > > Signed-off-by: Chris Mason> cc: sta...@vger.kernel.org # v3.7+ > cc: Mark Fasheh Reviewed-by: Mark Fasheh Thanks for the CC Chris. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
On 10/09/2015 07:15 AM, Pádraig Brady wrote: > On 08/10/15 02:40, Neil Brown wrote: >> Anna Schumakerwrites: >> >>> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, >>> loff_t pos_in, >>> struct file *file_out, loff_t pos_out, >>> size_t len, unsigned int flags) >>> { >>> - struct inode *inode_in; >>> - struct inode *inode_out; >>> ssize_t ret; >>> >>> - if (flags) >>> + /* Flags should only be used exclusively. */ >>> + if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY)) >>> + return -EINVAL; >>> + if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK)) >>> + return -EINVAL; >>> + if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP)) >>> return -EINVAL; >>> >> >> Do you also need: >> >>if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP)) >> return -EINVAL; >> >> so that future user-space can test if the kernel supports new flags? > > Seems like a good idea, yes. > > Also that got me thinking about COPY_FR_SPARSE. > What's the current behavior when copying a sparse range? > Is the hole propagated by default (good), or is it expanded? I haven't tried it, but I think the hole would be expanded :(. I'm having splice() handle the pagecache copy part, and (as far as I know) splice() doesn't know anything about sparse files. I might be able to put in some kind of fallocate() / splice() loop to copy the range in multiple pieces. I don't want to add COPY_FR_SPARSE_AUTO, because then the kernel will have to determine how best to interpret "auto". I'm more inclined to add a single COPY_FR_SPARSE flag to enable creating sparse files, and then have the application tell us what to do for any given range. Anna > > Note cp(1) has --sparse={never,auto,always}. Auto is the default, > so it would be good I think if that was the default mode for > copy_file_range(). > With other sparse modes, we'd have to avoid copy_file_range() unless > there was control possible with COPY_FR_SPARSE_{AUTO,NONE,ALWAYS}. > Note currently cp --sparse=always will detect runs of zeros and also > avoid speculative preallocation by using fallocate (fd, FALLOC_FL_PUNCH_HOLE, > ...) > > thanks, > Pádraig. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
On 10/07/2015 09:40 PM, Neil Brown wrote: > Anna Schumakerwrites: > >> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, >> loff_t pos_in, >> struct file *file_out, loff_t pos_out, >> size_t len, unsigned int flags) >> { >> -struct inode *inode_in; >> -struct inode *inode_out; >> ssize_t ret; >> >> -if (flags) >> +/* Flags should only be used exclusively. */ >> +if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY)) >> +return -EINVAL; >> +if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK)) >> +return -EINVAL; >> +if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP)) >> return -EINVAL; >> > > Do you also need: > >if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP)) > return -EINVAL; > > so that future user-space can test if the kernel supports new flags? Probably. I'll add that in! Thanks, Anna > > NeilBrown > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] btrfs-progs: mkfs: Fix different mixed type by argument sequence
On Tue, Oct 13, 2015 at 08:52:16PM +0800, Zhao Lei wrote: > Given a 200G vdd1 and 1G vdd2: > > In current code: > mkfs.btrfs -f /dev/vdd1 /dev/vdd2 > and > mkfs.btrfs -f /dev/vdd2 /dev/vdd1 > will create different "mixed" type. I think combining large and small devices was not intended use for the mixed-bg, nevertheless current behaviour is not right. Chandan is working on dropping the forced mixed-bg completely. We've discussed that on IRC, I'm ok with it but this needs more testing. So far it looks fine, small filesystems get created and usable, though some tuning might be needed. My intentions for 4.3 is to take Chandan's work provided that we test it enough. There are like 3 weeks left. In case of problems, I'll take this patchset so at least we get the inconsisten behaviour fixed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix use after free iterating extrefs
The code for btrfs inode-resolve has never worked properly for files with enough hard links to trigger extrefs. It was trying to get the leaf out of a path after freeing the path: btrfs_release_path(path); leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, slot); The fix here is to use the extent buffer we cloned just a little higher up to avoid deadlocks caused by using the leaf in the path. Signed-off-by: Chris Masoncc: sta...@vger.kernel.org # v3.7+ cc: Mark Fasheh --- fs/btrfs/backref.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index ecbc63d..9a2ec79 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -1828,7 +1828,6 @@ static int iterate_inode_extrefs(u64 inum, struct btrfs_root *fs_root, int found = 0; struct extent_buffer *eb; struct btrfs_inode_extref *extref; - struct extent_buffer *leaf; u32 item_size; u32 cur_offset; unsigned long ptr; @@ -1856,9 +1855,8 @@ static int iterate_inode_extrefs(u64 inum, struct btrfs_root *fs_root, btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); btrfs_release_path(path); - leaf = path->nodes[0]; - item_size = btrfs_item_size_nr(leaf, slot); - ptr = btrfs_item_ptr_offset(leaf, slot); + item_size = btrfs_item_size_nr(eb, slot); + ptr = btrfs_item_ptr_offset(eb, slot); cur_offset = 0; while (cur_offset < item_size) { @@ -1872,7 +1870,7 @@ static int iterate_inode_extrefs(u64 inum, struct btrfs_root *fs_root, if (ret) break; - cur_offset += btrfs_inode_extref_name_len(leaf, extref); + cur_offset += btrfs_inode_extref_name_len(eb, extref); cur_offset += sizeof(*extref); } btrfs_tree_read_unlock_blocking(eb); -- 2.4.6 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix use after free iterating extrefs
On Tue, Oct 13, 2015 at 7:06 PM, Chris Masonwrote: > The code for btrfs inode-resolve has never worked properly for > files with enough hard links to trigger extrefs. It was trying to > get the leaf out of a path after freeing the path: > > btrfs_release_path(path); > leaf = path->nodes[0]; > item_size = btrfs_item_size_nr(leaf, slot); > > The fix here is to use the extent buffer we cloned just a little higher > up to avoid deadlocks caused by using the leaf in the path. > > Signed-off-by: Chris Mason > cc: sta...@vger.kernel.org # v3.7+ > cc: Mark Fasheh Reviewed-by: Filipe Manana Looks good to me. I failed to notice that problem at commit [1] [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3fe81ce206f3805e0eb5d886aabbf91064655144 > --- > fs/btrfs/backref.c | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c > index ecbc63d..9a2ec79 100644 > --- a/fs/btrfs/backref.c > +++ b/fs/btrfs/backref.c > @@ -1828,7 +1828,6 @@ static int iterate_inode_extrefs(u64 inum, struct > btrfs_root *fs_root, > int found = 0; > struct extent_buffer *eb; > struct btrfs_inode_extref *extref; > - struct extent_buffer *leaf; > u32 item_size; > u32 cur_offset; > unsigned long ptr; > @@ -1856,9 +1855,8 @@ static int iterate_inode_extrefs(u64 inum, struct > btrfs_root *fs_root, > btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); > btrfs_release_path(path); > > - leaf = path->nodes[0]; > - item_size = btrfs_item_size_nr(leaf, slot); > - ptr = btrfs_item_ptr_offset(leaf, slot); > + item_size = btrfs_item_size_nr(eb, slot); > + ptr = btrfs_item_ptr_offset(eb, slot); > cur_offset = 0; > > while (cur_offset < item_size) { > @@ -1872,7 +1870,7 @@ static int iterate_inode_extrefs(u64 inum, struct > btrfs_root *fs_root, > if (ret) > break; > > - cur_offset += btrfs_inode_extref_name_len(leaf, > extref); > + cur_offset += btrfs_inode_extref_name_len(eb, extref); > cur_offset += sizeof(*extref); > } > btrfs_tree_read_unlock_blocking(eb); > -- > 2.4.6 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
State of Dedup / Defrag
What is the current state of Dedup and Defrag in btrfs? I seem to recall there having been problems a few months ago and I've stopped using it, but I haven't seen much news since. I'm interested both in the 3.18 and subsequent kernel series. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
System completely unresponsive after `btrfs balance start -dconvert=raid0 /` and `btrfs fi show /`
Hi all, I have an home server with 3 hard drives that I added to the same btrfs filesystem. Several hours ago I run `btrfs balance start -dconvert=raid0 /` and as soon as I run `btrfs fi show /` I lost my ssh connection to the machine. The machine is still on, but it doesn’t even respond to ping: I always get a request timeout and sometimes even an host is down message. Its fans are spinning at full blast and the hard drives’s led are registering activity all the time. I run Plex Home Theater too there and the display output is stuck at the time when I run those two commands. I left it running because I fear to lose everything by powering it down manually. Should I leave it like this and let it finish? How long it might take? (I have a 250gb internal hard drive, a 120gb usb 2.0 one and a 2TB usb 2.0 one so the transfer speeds are pretty low) Is it safe to power it off manually? Should I file a bug after it? Any help would be appreciated. Thanks, Carmine-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can't mount btrfs: corrupt leaf, slot offset bad
I rebooted my server last night and discovered that my btrfs filesystem (3 disk raid1) would not mount anymore. After doing some research and getting nowhere I went to IRC and user darkling asked me a few questions and asked for output of btrfs-debug-tree and ultimately sent me here saying I should include a handful of things: Before I go further, let's get required info out of the way: uname -a: Linux archhost1 4.2.3-1-ARCH #1 SMP PREEMPT Sat Oct 3 18:52:50 CEST 2015 x86_64 GNU/Linux btrfs --version: btrfs-progs v4.2.1 output from "btrfs fi show": Label: none uuid: 5470630f-39f4-4d39-90a2-277d7991722a Total devices 3 FS bytes used 3.10TiB devid1 size 3.64TiB used 2.12TiB path /dev/sdd devid2 size 3.64TiB used 2.12TiB path /dev/sde devid3 size 3.64TiB used 2.12TiB path /dev/sdc First, I am able to mount with -o ro,recovery, but not with just -o recovery. When I attempt to mount w/o ro, I get this in dmesg: [44478.800613] BTRFS critical (device sde): corrupt leaf, slot offset bad: block=5674754899968,root=1, slot=147 [44478.802489] BTRFS critical (device sde): corrupt leaf, slot offset bad: block=5674754899968,root=1, slot=147 [44478.804072] BTRFS error (device sde): Error removing orphan entry, stopping orphan cleanup [44478.805856] BTRFS error (device sde): could not do orphan cleanup -22 [44482.635498] BTRFS: open_ctree failed Running "btrfs-debug-tree -b 5674754899968 /dev/sde" gave me this: leaf 5674754899968 items 207 free space 30 generation 884595 owner 5 fs uuid 5470630f-39f4-4d39-90a2-277d7991722a chunk uuid c269615e-7397-41bc-95d0-dfdb2a696b23 [...] item 145 key (273094 EXTENT_DATA 364924928) itemoff 8545 itemsize 53 extent data disk byte 8658465382400 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 item 146 key (273094 EXTENT_DATA 364929024) itemoff 8492 itemsize 53 extent data disk byte 8658465378304 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 item 147 key (273094 EXTENT_DATA 364933120) itemoff 8439 itemsize 53 extent data disk byte 8677950173184 nr 24576 extent data offset 0 nr 20480 ram 24576 extent compression 0 item 148 key (273094 EXTENT_DATA 364953600) itemoff 8333 itemsize 53 extent data disk byte 8677990363136 nr 20480 extent data offset 0 nr 16384 ram 20480 extent compression 0 item 149 key (273094 EXTENT_DATA 364957696) itemoff 8386 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 18446744073709514752 ram 18446744073709514752 extent compression 0 item 150 key (273094 EXTENT_DATA 364969984) itemoff 8280 itemsize 53 extent data disk byte 8678063341568 nr 20480 extent data offset 0 nr 16384 ram 20480 extent compression 0 item 151 key (273094 EXTENT_DATA 365002752) itemoff 8227 itemsize 53 extent data disk byte 8678025232384 nr 36864 extent data offset 0 nr 32768 ram 36864 extent compression 0 item 152 key (273094 EXTENT_DATA 365019136) itemoff 8174 itemsize 53 extent data disk byte 8678112104448 nr 36864 extent data offset 0 nr 32768 ram 36864 extent compression 0 item 153 key (273094 EXTENT_DATA 365051904) itemoff 8121 itemsize 53 extent data disk byte 8678052835328 nr 53248 extent data offset 0 nr 49152 ram 53248 extent compression 0 item 154 key (273094 EXTENT_DATA 365101056) itemoff 8068 itemsize 53 extent data disk byte 8678090510336 nr 20480 extent data offset 0 nr 16384 ram 20480 extent compression 0 item 155 key (273094 EXTENT_DATA 365117440) itemoff 8015 itemsize 53 extent data disk byte 8678117130240 nr 20480 extent data offset 0 nr 16384 ram 20480 extent compression 0 [...] Output from "btrfs check --readonly /dev/sde": Checking filesystem on /dev/sde UUID: 5470630f-39f4-4d39-90a2-277d7991722a checking extents incorrect offsets 8439 8386 bad block 5674754899968 Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Output from (failed) "btrfs check --repair /dev/sdc" (which I tried prior to seeking help): enabling repair mode Checking filesystem on /dev/sdc UUID: 5470630f-39f4-4d39-90a2-277d7991722a checking extents incorrect offsets 8439 8386 shifting item nr 148 by bytes in block 5674754899968 items overlap, can't fix cmds-check.c:4059: fix_item_offset: Assertion `ret` failed. darklink also mentioned that btrfs-zero-log might help too, but that I should get confirmation from one of the devs on
Re: [PATCH] btrfs: compress: put variables defined per compress type in struct to make cache friendly
David Sterba writes: > >> +static struct { >> +struct list_head idle_workspace; >> +spinlock_t workspace_lock; >> +int num_workspace; >> +atomic_t alloc_workspace; >> +wait_queue_head_t workspace_wait; >> +} comp[BTRFS_COMPRESS_TYPES]; > > The name became too generic, please rename it to btrfs_comp_ws. > btrfs_comp_workspaces would be too long. I won't mind trimming the > members to 'ws' instead of 'workspace' so this does not result in too > wild code formatting. The use of the workspaces is localized only to the > compression code so it will not be confusing. Thanks for feedback. I will prepare v2 patch applying your comment. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html