Re: fsck: to repair or not to repair
On May 11 2016, Nikolaus Rath wrote: > Hello, > > I recently ran btrfsck on one of my file systems, and got the following > messages: > > checking extents > checking free space cache > checking fs roots > root 5 inode 3149867 errors 400, nbytes wrong > root 5 inode 3150237 errors 400, nbytes wrong > root 5 inode 3150238 errors 400, nbytes wrong > root 5 inode 3150242 errors 400, nbytes wrong > root 5 inode 3150260 errors 400, nbytes wrong > [ lots of similar message with different inode numbers ] > root 5 inode 15595011 errors 400, nbytes wrong > root 5 inode 15595016 errors 400, nbytes wrong > Checking filesystem on /dev/mapper/vg0-nikratio_crypt > UUID: 8742472d-a9b0-4ab6-b67a-5d21f14f7a38 > found 263648960636 bytes used err is 1 > total csum bytes: 395314372 > total tree bytes: 908644352 > total fs tree bytes: 352735232 > total extent tree bytes: 95039488 > btree space waste bytes: 156301160 > file data blocks allocated: 675209801728 > referenced 410351722496 > Btrfs v3.17 > > > Can someone explain to me the risk that I run by attempting a repair, > and (conversely) what I put at stake when continuing to use this file > system as-is? To follow-up on this: after finding out which files were affected (using btrfs inspect-internal), I was able to fix the problem without using btrfsck by simply copying the data, deleting the file, and restoring it: cat affected-files.txt | while read -r name; do rsync -a "${name}" "/backup/location/${name}" rm -f "${name}" cp -a "/backup/location/${name}" "${name}" done (I used rsync to avoid cp making use of reflinks). After this procedure, btrfschk reported no more problems. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to map extents to files
On Jun 01 2016, Nikolaus Rath wrote: > Hello, > > For one of my btrfs volumes, btrfsck reports a lot of the following > warnings: > > [...] > checking extents > bad extent [138477568, 138510336), type mismatch with chunk > bad extent [140091392, 140148736), type mismatch with chunk > bad extent [140148736, 140201984), type mismatch with chunk > bad extent [140836864, 140865536), type mismatch with chunk > [...] > > Is there a way to discover which files are affected by this (in > particular so that I can take a look at them before and after a btrfsck > --repair)? Hmm. Anyone? Is there really no way to figure out what might be affected if I attempt a repair? Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: plumb fs_info into btrfs_work
From: Jeff Mahoney In order to provide an fsid for trace events, we'll need a btrfs_fs_info pointer. The most lightweight way to do that for btrfs_work structures is to associate it with the __btrfs_workqueue structure. Each queued btrfs_work structure has a workqueue associated with it, so that's a natural fit. It's a privately defined structures, so we add accessors to retrieve the fs_info pointer. Signed-off-by: Jeff Mahoney --- fs/btrfs/async-thread.c | 31 +-- fs/btrfs/async-thread.h | 6 +- fs/btrfs/disk-io.c | 47 --- fs/btrfs/scrub.c| 10 +- 4 files changed, 63 insertions(+), 31 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index 5fb60ea..e0f071f 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -34,6 +34,10 @@ struct __btrfs_workqueue { struct workqueue_struct *normal_wq; + + /* File system this workqueue services */ + struct btrfs_fs_info *fs_info; + /* List head pointing to ordered work list */ struct list_head ordered_list; @@ -70,6 +74,18 @@ void btrfs_##name(struct work_struct *arg) \ normal_work_helper(work); \ } +struct btrfs_fs_info * +btrfs_workqueue_owner(struct __btrfs_workqueue *wq) +{ + return wq->fs_info; +} + +struct btrfs_fs_info * +btrfs_work_owner(struct btrfs_work *work) +{ + return work->wq->fs_info; +} + BTRFS_WORK_HELPER(worker_helper); BTRFS_WORK_HELPER(delalloc_helper); BTRFS_WORK_HELPER(flush_delalloc_helper); @@ -94,14 +110,15 @@ BTRFS_WORK_HELPER(scrubnc_helper); BTRFS_WORK_HELPER(scrubparity_helper); static struct __btrfs_workqueue * -__btrfs_alloc_workqueue(const char *name, unsigned int flags, int limit_active, -int thresh) +__btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info, const char *name, + unsigned int flags, int limit_active, int thresh) { struct __btrfs_workqueue *ret = kzalloc(sizeof(*ret), GFP_KERNEL); if (!ret) return NULL; + ret->fs_info = fs_info; ret->limit_active = limit_active; atomic_set(&ret->pending, 0); if (thresh == 0) @@ -143,7 +160,8 @@ __btrfs_alloc_workqueue(const char *name, unsigned int flags, int limit_active, static inline void __btrfs_destroy_workqueue(struct __btrfs_workqueue *wq); -struct btrfs_workqueue *btrfs_alloc_workqueue(const char *name, +struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info, + const char *name, unsigned int flags, int limit_active, int thresh) @@ -153,7 +171,8 @@ struct btrfs_workqueue *btrfs_alloc_workqueue(const char *name, if (!ret) return NULL; - ret->normal = __btrfs_alloc_workqueue(name, flags & ~WQ_HIGHPRI, + ret->normal = __btrfs_alloc_workqueue(fs_info, name, + flags & ~WQ_HIGHPRI, limit_active, thresh); if (!ret->normal) { kfree(ret); @@ -161,8 +180,8 @@ struct btrfs_workqueue *btrfs_alloc_workqueue(const char *name, } if (flags & WQ_HIGHPRI) { - ret->high = __btrfs_alloc_workqueue(name, flags, limit_active, - thresh); + ret->high = __btrfs_alloc_workqueue(fs_info, name, flags, + limit_active, thresh); if (!ret->high) { __btrfs_destroy_workqueue(ret->normal); kfree(ret); diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h index ad4d064..8e52484 100644 --- a/fs/btrfs/async-thread.h +++ b/fs/btrfs/async-thread.h @@ -21,6 +21,7 @@ #define __BTRFS_ASYNC_THREAD_ #include +struct btrfs_fs_info; struct btrfs_workqueue; /* Internal use only */ struct __btrfs_workqueue; @@ -67,7 +68,8 @@ BTRFS_WORK_HELPER_PROTO(scrubnc_helper); BTRFS_WORK_HELPER_PROTO(scrubparity_helper); -struct btrfs_workqueue *btrfs_alloc_workqueue(const char *name, +struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info, + const char *name, unsigned int flags, int limit_active, int thresh); @@ -80,4 +82,6 @@ void btrfs_queue_work(struct btrfs_workqueue *wq, void btrfs_destroy_workqueue(struct btrfs_workqueue *wq); void btrfs_workqueue_set_max(struct btrfs_workqueue *wq, int max); void btrfs_set_work_high_priority(struct btrfs_work *work); +struct btrfs
[PATCH 2/2] btrfs: prefix fsid to all trace events
From: Jeff Mahoney When using trace events to debug a problem, it's impossible to determine which file system generated a particular event. This patch adds a macro to prefix standard information to the head of a trace event. The extent_state alloc/free events are all that's left without an fs_info available. Signed-off-by: Jeff Mahoney --- fs/btrfs/delayed-ref.c | 9 +- fs/btrfs/extent-tree.c | 10 +- fs/btrfs/qgroup.c| 19 +-- fs/btrfs/qgroup.h| 9 +- fs/btrfs/super.c | 2 +- include/trace/events/btrfs.h | 282 --- 6 files changed, 182 insertions(+), 149 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 430b368..e7b1ec0 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -606,7 +606,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, qrecord->num_bytes = num_bytes; qrecord->old_roots = NULL; - qexisting = btrfs_qgroup_insert_dirty_extent(delayed_refs, + qexisting = btrfs_qgroup_insert_dirty_extent(fs_info, +delayed_refs, qrecord); if (qexisting) kfree(qrecord); @@ -615,7 +616,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, spin_lock_init(&head_ref->lock); mutex_init(&head_ref->mutex); - trace_add_delayed_ref_head(ref, head_ref, action); + trace_add_delayed_ref_head(fs_info, ref, head_ref, action); existing = htree_insert(&delayed_refs->href_root, &head_ref->href_node); @@ -682,7 +683,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info, ref->type = BTRFS_TREE_BLOCK_REF_KEY; full_ref->level = level; - trace_add_delayed_tree_ref(ref, full_ref, action); + trace_add_delayed_tree_ref(fs_info, ref, full_ref, action); ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref); @@ -739,7 +740,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info, full_ref->objectid = owner; full_ref->offset = offset; - trace_add_delayed_data_ref(ref, full_ref, action); + trace_add_delayed_data_ref(fs_info, ref, full_ref, action); ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 689d25a..ecb68bb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2194,7 +2194,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans, ins.type = BTRFS_EXTENT_ITEM_KEY; ref = btrfs_delayed_node_to_data_ref(node); - trace_run_delayed_data_ref(node, ref, node->action); + trace_run_delayed_data_ref(root->fs_info, node, ref, node->action); if (node->type == BTRFS_SHARED_DATA_REF_KEY) parent = ref->parent; @@ -2349,7 +2349,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans, SKINNY_METADATA); ref = btrfs_delayed_node_to_tree_ref(node); - trace_run_delayed_tree_ref(node, ref, node->action); + trace_run_delayed_tree_ref(root->fs_info, node, ref, node->action); if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) parent = ref->parent; @@ -2413,7 +2413,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, */ BUG_ON(extent_op); head = btrfs_delayed_node_to_head(node); - trace_run_delayed_ref_head(node, head, node->action); + trace_run_delayed_ref_head(root->fs_info, node, head, + node->action); if (insert_reserved) { btrfs_pin_extent(root, node->bytenr, @@ -8316,7 +8317,8 @@ static int record_one_subtree_extent(struct btrfs_trans_handle *trans, delayed_refs = &trans->transaction->delayed_refs; spin_lock(&delayed_refs->lock); - if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord)) + if (btrfs_qgroup_insert_dirty_extent(trans->root->fs_info, +delayed_refs, qrecord)) kfree(qrecord); spin_unlock(&delayed_refs->lock); diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 9d4c05b..13e28d8 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1453,9 +1453,10 @@ int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans, return ret; } -struct btrfs_qgroup_extent_record -*btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs, - struct btrfs_qgroup_extent_record *record) +struct btrfs_qgroup_extent_record * +btrfs_qgroup_insert_dirty_extent(struct btrfs_fs_info *fs_info
Re: [PATCH] btrfs-progs: add the error message when open fails
Hi, David, On 2016/06/09 22:13, David Sterba wrote: On Thu, Jun 09, 2016 at 10:23:15AM +0900, Tsutomu Itoh wrote: When open in btrfs_open_devices failed, only the following message is displayed. Therefore the user doesn't understand the reason why open failed. # btrfs check /dev/sdb8 Couldn't open file system This patch adds the error message when open fails. I think the message should be printed by the caller, not by the helper. However in this case, error device name is not understood in the caller. Thanks, Tsutomu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: add check-only option for balance
From: Liu Bo This aims to decide whether a balance can reduce the number of data block groups and if it can, this shows the '-dvrange' block group's objectid. With this, you can run 'btrfs balance start -c mnt' or 'btrfs balance start --check-only mnt' -- $ btrfs balance start -c /mnt/btrfs Checking data block groups... block_group12582912 (len 8388608 used 786432) block_group 1103101952 (len 1073741824 used 536870912) block_group 2176843776 (len 1073741824 used 1073741824) total bgs 3 total_free 544473088 min_used bg 12582912 has (min_used 786432 free 7602176) run 'btrfs balance start -dvrange=12582912..12582913 your_mnt' $ btrfs balance start -dvrange=12582912..12582913 /mnt/btrfs Done, had to relocate 1 out of 5 chunks $ btrfs balance start -c /mnt/btrfs Checking data block groups... block_group 1103101952 (len 1073741824 used 537395200) block_group 2176843776 (len 1073741824 used 1073741824) total bgs 2 total_free 536346624 min_used bg 1103101952 has (min_used 537395200 free 536346624) -- So you now know how to babysit your btrfs in a smart way. Signed-off-by: Liu Bo Signed-off-by: Ashish Samant --- cmds-balance.c | 127 +++- 1 files changed, 126 insertions(+), 1 deletions(-) diff --git a/cmds-balance.c b/cmds-balance.c index 8f3bf5b..e2aab6c 100644 --- a/cmds-balance.c +++ b/cmds-balance.c @@ -493,6 +493,116 @@ out: return ret; } +/* return 0 if balance can remove a data block group, otherwise return 1 */ +static int search_data_bgs(const char *path) +{ + struct btrfs_ioctl_search_args args; + struct btrfs_ioctl_search_key *sk; + struct btrfs_ioctl_search_header *header; + struct btrfs_block_group_item *bg; + unsigned long off = 0; + DIR *dirstream = NULL; + int e; + int fd; + int i; + u64 total_free = 0; + u64 min_used = (u64)-1; + u64 free_of_min_used = 0; + u64 bg_of_min_used = 0; + u64 flags; + u64 used; + int ret = 0; + int nr_data_bgs = 0; + + fd = btrfs_open_dir(path, &dirstream, 1); + if (fd < 0) + return 1; + + memset(&args, 0, sizeof(args)); + sk = &args.key; + + sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID; + sk->min_objectid = sk->min_offset = sk->min_transid = 0; + sk->max_objectid = sk->max_offset = sk->max_transid = (u64)-1; + sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY; + sk->nr_items = 65536; + + while (1) { + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args); + e = errno; + if (ret < 0) { + fprintf(stderr, "ret %d error '%s'\n", ret, + strerror(e)); + return ret; + } + /* +* it should not happen. +*/ + if (sk->nr_items == 0) + break; + + off = 0; + for (i = 0; i < sk->nr_items; i++) { + header = (struct btrfs_ioctl_search_header *)(args.buf + + off); + + off += sizeof(*header); + if (header->type == BTRFS_BLOCK_GROUP_ITEM_KEY) { + bg = (struct btrfs_block_group_item *) + (args.buf + off); + flags = btrfs_block_group_flags(bg); + if (flags & BTRFS_BLOCK_GROUP_DATA) { + nr_data_bgs++; + used = btrfs_block_group_used(bg); + printf( + "block_group %15llu (len %11llu used %11llu)\n", + header->objectid, + header->offset, used); + total_free += header->offset - used; + if (min_used >= used) { + min_used = used; + free_of_min_used = + header->offset - used; + bg_of_min_used = + header->objectid; + } + } + } + + off += header->len; + sk->min_objectid = header->objectid; + sk->min_type = header->type; + sk->min_offset = header->offset
Re: Recommended why to use btrfs for production?
Chris Murphy posted on Thu, 09 Jun 2016 11:39:23 -0600 as excerpted: > Yeah but somewhere there's a chunk that's likely affected by two losses, > with a probability much higher than for conventional raid10 where such a > loss is very binary: if the loss is a mirrored pair, the whole array and > filesystem implodes; if the loss does not affect an entire mirrored > pair, the whole array survives. > > The thing with Btrfs raid 10 is you can't really tell in advance to what > degree you have loss. It's not a binary condition, it has a gray area > where a lot of data can still be retrieved, but the instant you hit > missing data it's a loss, and if you hit missing metadata then the fs > will either go read only or crash, it just can't continue. So that > "walking on egg shells" behavior in a 2+ drive loss is really different > from a conventional raid10 where it's either gonna completely work or > completely fail. Yes, thanks, CMurphy. That's exactly what I was trying to explain. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: chunk_width_limit mount option
On Mon, Jun 6, 2016 at 11:00 AM, Hugo Mills wrote: > On Mon, Jun 06, 2016 at 09:43:19AM -0400, Andrew Armenia wrote: >> On Mon, Jun 6, 2016 at 5:17 AM, David Sterba wrote: >> > On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote: >> >> This patch adds mount option 'chunk_width_limit=X', which when set forces >> >> the chunk allocator to use only up to X devices when allocating a chunk. >> >> This may help reduce the seek penalties seen in filesystems with large >> >> numbers of devices. >> > >> > There is a use for such limit but passing it as a mount is not the right >> > way to do it. Changing the value requires remount, different raid levels >> > might need different values, it's not persistent etc. >> >> Thanks for the feedback. I agree that it's less than an ideal way to >> do it. I guess I was a bit quick to release proof-of-concept code. >> I'll try to do some more work on the configuration side of things and >> resubmit. > >Last time something like this came up in discussion, the consensus > was to put the configuration in xattrs in the btrfs namespace. There > should be one with a name to indicate "global" configuration, applied > to the top of subvol 5. If the feature allows configuration on a > per-subvol or per-object basis, then there should also be a name for > the relevant xattr (also in the btrfs namespace) that can be created > on each object as required. > >Hugo. > > -- > Hugo Mills | Klytus, I'm bored. What plaything can you offer me > hugo@... carfax.org.uk | today? > http://carfax.org.uk/ | > PGP: E2AB1DE4 | Ming the Merciless, Flash Gordon OK, since I'm new here I'll throw my thoughts out before writing and submitting another patch. I wouldn't want to have to read the xattrs from disk every time a chunk needs to be allocated. That means the allocator configuration will need to be cached in the fs_info structure. My current line of thinking is to create a btrfs_allocator_config structure containing all of the allocator parameters - for now a user-set devs_max for each RAID level. I'd add a couple ioctls to allow the userspace tools to get and set these parameters. The "set" ioctl would also persist this configuration to one or more xattrs on the root of subvol 5 per Hugo's suggestion. I'm still a bit hung up on the proper way to read the config back at mount time. Patching the mount routines doesn't seem like a good way to do it. Perhaps just doing it at the first invocation of __btrfs_alloc_chunk is best? Any other thoughts? -Andrew -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
On Wed, Jun 8, 2016 at 5:10 PM, Hans van Kranenburg wrote: > Hi list, > > > On 05/31/2016 03:36 AM, Qu Wenruo wrote: >> >> >> >> Hans van Kranenburg wrote on 2016/05/06 23:28 +0200: >>> >>> Hi, >>> >>> I've got a mostly inactive btrfs filesystem inside a virtual machine >>> somewhere that shows interesting behaviour: while no interesting disk >>> activity is going on, btrfs keeps allocating new chunks, a GiB at a time. >>> >>> A picture, telling more than 1000 words: >>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png >>> (when the amount of allocated/unused goes down, I did a btrfs balance) >> >> >> Nice picture. >> Really better than 1000 words. >> >> AFAIK, the problem may be caused by fragments. >> >> And even I saw some early prototypes inside the codes to allow btrfs do >> allocation smaller extent than required. >> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents) >> >> But it's still prototype and seems no one is really working on it now. >> >> So when btrfs is writing new data, for example, to write about 16M data, >> it will need to allocate a 16M continuous extent, and if it can't find >> large enough space to allocate, then create a new data chunk. >> >> Despite the already awesome chunk level usage pricutre, I hope there is >> info about extent level allocation to confirm my assumption. >> >> You could dump it by calling "btrfs-debug-tree -t 2 ". >> It's normally recommended to do it unmounted, but it's still possible to >> call it mounted, although not 100% perfect though. >> (Then I'd better find a good way to draw a picture of >> allocate/unallocate space and how fragments the chunks are) > > > So, I finally found some spare time to continue investigating. In the > meantime, the filesystem has happily been allocating new chunks every few > days, filling them up way below 10% with data before starting a new one. > > The chunk allocation primarily seems to happen during cron.daily. But, > manually executing all the cronjobs that are in there, even multiple times, > does not result in newly allocated chunks. Yay. :( > > After the previous post, I put a little script in between every two jobs in > /etc/cron.daily that prints the output of btrfs fi df to syslog and sleeps > for 10 minutes so I can easily find out afterwards during which one it > happened. > > Bingo! The "apt" cron.daily, which refreshes package lists and triggers > unattended-upgrades. > > Jun 7 04:01:46 ichiban root: Data, single: total=12.00GiB, used=5.65GiB > [...] > 2016-06-07 04:01:56,552 INFO Starting unattended upgrades script > [...] > Jun 7 04:12:10 ichiban root: Data, single: total=13.00GiB, used=5.64GiB > > And, this thing is clever enough to do things once a day, even if you would > execute it multple times... (Hehehe...) > > Ok, let's try doing some apt-get update then. > > Today, the latest added chunks look like this: > > # ./show_usage.py / > [...] > chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length > 1073741824 used 115499008 used_pct 10 > chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 length > 1073741824 used 36585472 used_pct 3 > chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 length > 1073741824 used 17510400 used_pct 1 > chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length > 268435456 used 72663040 used_pct 27 > chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 length > 1073741824 used 86986752 used_pct 8 > chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 length > 1073741824 used 21188608 used_pct 1 > chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 length > 1073741824 used 64032768 used_pct 5 > chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 length > 1073741824 used 71712768 used_pct 6 > > Now I apt-get update... > > before: Data, single: total=13.00GiB, used=5.64GiB > during: Data, single: total=13.00GiB, used=5.59GiB > after : Data, single: total=14.00GiB, used=5.64GiB > > # ./show_usage.py / > [...] > chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length > 1073741824 used 119279616 used_pct 11 > chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 length > 1073741824 used 36585472 used_pct 3 > chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 length > 1073741824 used 17510400 used_pct 1 > chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length > 268435456 used 73170944 used_pct 27 > chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 length > 1073741824 used 82251776 used_pct 7 > chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 length > 1073741824 used 21188608 used_pct 1 > chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 length > 1073741824 used 6041600 used_pct 0 > chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 length > 1073741824 used 46178304 used_pct 4 > chunk vaddr 71280099328
Re: Recommended why to use btrfs for production?
On Thu, Jun 9, 2016 at 5:38 AM, Austin S. Hemmelgarn wrote: > On 2016-06-09 02:16, Duncan wrote: >> >> Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as >> excerpted: >> >>> As far as BTRFS raid10 mode in general, there are a few things that are >>> important to remember about it: >>> 1. It stores exactly two copies of everything, any extra disks just add >>> to the stripe length on each copy. >> >> >> I'll add one more, potentially very important, related to this one: >> >> Btrfs raid mode (any of them) works in relation to individual chunks, >> *NOT* individual devices. >> >> What that means for btrfs raid10 in combination with the above exactly >> two copies rule, is that it works rather differently than a standard >> raid10, which can tolerate loss of two devices as long as they're from >> the same mirror set, as the other mirror set will then still be whole. >> Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of >> a second device close to assures loss of data, because the very likely >> true assumption is that both mirror sets will be affected for some >> chunks, but not others. > > Actually, that's not _quite_ the case. Assuming that you have an even > number of devices, BTRFS raid10 will currently always span all the available > devices with two striped copies of the data (if there's an odd number, it > spans one less than the total, and rotates which one gets left out of each > chunk). This means that as long as all the devices are the same size and > you have have stripes that are the full width of the array (you can end up > with shorter ones if you have run in degraded mode or expanded the array), > your probability of data loss per-chunk goes down as you add more devices > (because the probability of a two device failure affecting both copies of a > stripe in a given chunk decreases), but goes up as you add more chunks > (because you then have to apply that probability for each individual chunk). > Once you've lost one disk, the probability that losing another will > compromise a specific chunk is: > 1/(N - 1) > Where N is the total number of devices. > The probability that it will compromise _any_ chunk is: > (1/(N - 1))/C > Where C is the total number of chunks > BTRFS raid1 mode actually has the exact same probabilities, but they apply > even if you have an odd number of disks. Yeah but somewhere there's a chunk that's likely affected by two losses, with a probability much higher than for conventional raid10 where such a loss is very binary: if the loss is a mirrored pair, the whole array and filesystem implodes; if the loss does not affect an entire mirrored pair, the whole array survives. The thing with Btrfs raid 10 is you can't really tell in advance to what degree you have loss. It's not a binary condition, it has a gray area where a lot of data can still be retrieved, but the instant you hit missing data it's a loss, and if you hit missing metadata then the fs will either go read only or crash, it just can't continue. So that "walking on egg shells" behavior in a 2+ drive loss is really different from a conventional raid10 where it's either gonna completely work or completely fail. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Replacing drives with larger ones in a 4 drive raid1
On 09.06.2016, at 17:20, Duncan <1i5t5.dun...@cox.net> wrote: > Are those the 8 TB SMR "archive" drives? No, they are Western Digital Red drives. Thanks for the detailed follow-up anyway. :) Half a year ago, when I evaluated hard drives, in the 8 TB category there were only the Hitachi 8 TB Helium drives for 800 bucks, and the Seagate SMR for 250 bucks. I bought myself one of the Seagate SMR ones for testing, and figured out it wouldn't work for my use case (I now use it in a write-very-seldom context). For my two NASes I went with 6 TB WD Red drives all around. Nowadays there are more choices of 8 TB drives, such as the WD Reds I'm switching my backup NAS to. > I haven't been following the issue very closely, but be aware that there > were serious issues with those drives a few kernels back, and that while > those issues are now fixed, the drives themselves operate rather > differently than normal drives, and simply don't work well in normal > usage. > > The short version is that they really are designed for archiving and work > well when used for that purpose -- a mostly write once and leave it there > for archiving and retrieval but rarely if ever rewrite it, type usage. > However, they work rather poorly in normal usage where data is rewritten, > because they have to rewrite entire zones of data, and that takes much > longer than simply rewriting individual sectors on normal drives does. > > With the kernel patches to fix the initial problems they do work well > enough, tho performance may not be what you expect, but the key to > keeping them working well is being aware that they continue to do > rewrites in the background for long after they are done with the initial > write, and shutting them down while they are doing them can be an issue. > > Due to btrfs' data checksumming feature, small variances to data that > wouldn't normally be detected on non-checksumming filesystems were > detected far sooner on btrfs, making it far more sensitive to these small > errors. However, if you use the drives for their intended nearly write- > only purpose, and/or very seldom power down the drives at all or do so > only long after (give it half an hour, say) any writes have completed, as > long as you're running a current kernel with the initial issues patched, > you should be fine. Just don't treat them like normal drives. > > If OTOH you need more normal drive usage including lots of data rewrites, > especially if you frequently poweroff the devices, strongly consider > avoiding those 8 TB SMR drives, at least until the technology has a few > more years to mature. > > There's more information on other threads on the list and on other lists, > if you need it and nobody posts more direct information (such as the > specific patches in question and what specific kernel versions they hit) > here. I could find it but I'd have to do a search in my own list > archives, and now that you are aware of the problem, you can of course do > the search as well, if you need to. =:^) > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Btrfs: improve performance on fsync against new inode after rename/unlink
From: Filipe Manana With commit 56f23fdbb600 ("Btrfs: fix file/data loss caused by fsync after rename and new inode") we got simple fix for a functional issue when the following sequence of actions is done: at transaction N create file A at directory D at transaction N + M (where M >= 1) move/rename existing file A from directory D to directory E create a new file named A at directory D fsync the new file power fail The solution was to simply detect such scenario and fallback to a full transaction commit when we detect it. However this turned out to had a significant impact on throughput (and a bit on latency too) for benchmarks using the dbench tool, which simulates real workloads from smbd (Samba) servers. For example on a test vm (with a debug kernel): Unpatched: Throughput 19.1572 MB/sec 32 clients 32 procs max_latency=1005.229 ms Patched: Throughput 23.7015 MB/sec 32 clients 32 procs max_latency=809.206 ms The patched results (this patch is applied) are similar to the results of a kernel with the commit 56f23fdbb600 ("Btrfs: fix file/data loss caused by fsync after rename and new inode") reverted. This change avoids the fallback to a transaction commit and instead makes sure all the names of the conflicting inode (the one that had a name in a past transaction that matches the name of the new file in the same parent directory) are logged so that at log replay time we don't lose neither the new file nor the old file, and the old file gets the name it was renamed to. This also ends up avoiding a full transaction commit for a similar case that involves an unlink instead of a rename of the old file: at transaction N create file A at directory D at transaction N + M (where M >= 1) remove file A create a new file named A at directory D fsync the new file power fail Signed-off-by: Filipe Manana --- fs/btrfs/inode.c| 19 +++- fs/btrfs/tree-log.c | 85 - 2 files changed, 95 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1ae0fc8..6dcd38b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4194,6 +4194,7 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) int err = 0; struct btrfs_root *root = BTRFS_I(dir)->root; struct btrfs_trans_handle *trans; + u64 last_unlink_trans; if (inode->i_size > BTRFS_EMPTY_DIR_SIZE) return -ENOTEMPTY; @@ -4216,11 +4217,27 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) if (err) goto out; + last_unlink_trans = BTRFS_I(inode)->last_unlink_trans; + /* now the directory is empty */ err = btrfs_unlink_inode(trans, root, dir, d_inode(dentry), dentry->d_name.name, dentry->d_name.len); - if (!err) + if (!err) { btrfs_i_size_write(inode, 0); + /* +* Propagate the last_unlink_trans value of the deleted dir to +* its parent directory. This is to prevent an unrecoverable +* log tree in the case we do something like this: +* 1) create dir foo +* 2) create snapshot under dir foo +* 3) delete the snapshot +* 4) rmdir foo +* 5) mkdir foo +* 6) fsync foo or some file inside foo +*/ + if (last_unlink_trans >= trans->transid) + BTRFS_I(dir)->last_unlink_trans = last_unlink_trans; + } out: btrfs_end_transaction(trans, root); btrfs_btree_balance_dirty(root); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index aa8fed1..0c39afb 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4469,7 +4469,8 @@ static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans, static int btrfs_check_ref_name_override(struct extent_buffer *eb, const int slot, const struct btrfs_key *key, -struct inode *inode) +struct inode *inode, +u64 *other_ino) { int ret; struct btrfs_path *search_path; @@ -4528,7 +4529,16 @@ static int btrfs_check_ref_name_override(struct extent_buffer *eb, search_path, parent, name, this_name_len, 0); if (di && !IS_ERR(di)) { - ret = 1; + struct btrfs_key di_key; + + btrfs_dir_item_key_to_cpu(search_path->nodes[0], + di, &di_key); + if (di_key.type == BTRFS_INODE_ITEM_KEY) { + ret = 1; + *
[PATCH 1/2] Btrfs: be more precise on errors when getting an inode from disk
From: Filipe Manana When we attempt to read an inode from disk, we end up always returning an -ESTALE error to the caller regardless of the actual failure reason, which can be an out of memory problem (when allocating a path), some error found when reading from the fs/subvolume btree (like a genuine IO error) or the inode does not exists. So lets start returning the real error code to the callers so that they don't treat all -ESTALE errors as meaning that the inode does not exists (such as during orphan cleanup). This will also be needed for a subsequent patch in the same series dealing with a special fsync case. Signed-off-by: Filipe Manana --- fs/btrfs/inode.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e5558d9..1ae0fc8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3419,10 +3419,10 @@ int btrfs_orphan_cleanup(struct btrfs_root *root) found_key.offset = 0; inode = btrfs_iget(root->fs_info->sb, &found_key, root, NULL); ret = PTR_ERR_OR_ZERO(inode); - if (ret && ret != -ESTALE) + if (ret && ret != -ENOENT) goto out; - if (ret == -ESTALE && root == root->fs_info->tree_root) { + if (ret == -ENOENT && root == root->fs_info->tree_root) { struct btrfs_root *dead_root; struct btrfs_fs_info *fs_info = root->fs_info; int is_dead_root = 0; @@ -3458,7 +3458,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root) * Inode is already gone but the orphan item is still there, * kill the orphan item. */ - if (ret == -ESTALE) { + if (ret == -ENOENT) { trans = btrfs_start_transaction(root, 1); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -3617,7 +3617,7 @@ static noinline int acls_after_inode_item(struct extent_buffer *leaf, /* * read an inode from the btree into the in-memory inode */ -static void btrfs_read_locked_inode(struct inode *inode) +static int btrfs_read_locked_inode(struct inode *inode) { struct btrfs_path *path; struct extent_buffer *leaf; @@ -3636,14 +3636,19 @@ static void btrfs_read_locked_inode(struct inode *inode) filled = true; path = btrfs_alloc_path(); - if (!path) + if (!path) { + ret = -ENOMEM; goto make_bad; + } memcpy(&location, &BTRFS_I(inode)->location, sizeof(location)); ret = btrfs_lookup_inode(NULL, root, path, &location, 0); - if (ret) + if (ret) { + if (ret > 0) + ret = -ENOENT; goto make_bad; + } leaf = path->nodes[0]; @@ -3796,11 +3801,12 @@ cache_acl: } btrfs_update_iflags(inode); - return; + return 0; make_bad: btrfs_free_path(path); make_bad_inode(inode); + return ret; } /* @@ -5585,7 +5591,9 @@ struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, return ERR_PTR(-ENOMEM); if (inode->i_state & I_NEW) { - btrfs_read_locked_inode(inode); + int ret; + + ret = btrfs_read_locked_inode(inode); if (!is_bad_inode(inode)) { inode_tree_add(inode); unlock_new_inode(inode); @@ -5594,7 +5602,8 @@ struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, } else { unlock_new_inode(inode); iput(inode); - inode = ERR_PTR(-ESTALE); + ASSERT(ret < 0); + inode = ERR_PTR(ret < 0 ? ret : -ESTALE); } } -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: at /home/kernel/COD/linux/fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x247/0x2c0 [btrfs]
Am 09.06.2016 um 16:52 schrieb Duncan: > Fugou Nashi posted on Sun, 05 Jun 2016 10:12:31 +0900 as excerpted: > >> Hi, >> >> Do I need to worry about this? >> >> Thanks. >> >> Linux nakku 4.6.0-040600-generic #201605151930 SMP Sun May 15 23:32:59 >> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux > There's a patch for it that didn't quite make 4.6.0, but is in 4.7-rc1 > and should appear in later 4.6.x stable releases (tho I don't track > stable myself so couldn't tell you if it's there yet or not). > > In general, some people were reporting many of these warnings during the > 4.6 rcs with no visible damage or further problems, so it doesn't tend to > be anything but the irritation of the warnings in practice, tho under the > right conditions, in theory, it could be a bigger issue. But I know of > no one who actually experienced a problem beyond the warn noise. > > So in practice it's a get the patch as soon as you can thing, but not > something to be hugely worried about in the mean time. > FYI i run into the same error on debian sid with 4.6.0and holger pointed out: This is 4.6.0-whatever, not 4.6.1. The problem is fixed in the real 4.6.1 (released last week), which you can find described in commit 4704fa54.. at https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.6.1 -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
Hans van Kranenburg posted on Thu, 09 Jun 2016 01:10:46 +0200 as excerpted: > The next question is what files these extents belong to. To find out, I > need to open up the extent items I get back and follow a backreference > to an inode object. Might do that tomorrow, fun. > > To be honest, I suspect /var/log and/or the file storage of mailman to > be the cause of the fragmentation, since there's logging from postfix, > mailman and nginx going on all day long in a slow but steady tempo. > While using btrfs for a number of use cases at work now, we normally > don't use it for the root filesystem. And the cases where it's used as > root filesystem don't do much logging or mail. FWIW, that's one reason I have a dedicated partition (and filesystem) for logs, here. (The other reason is that should something go runaway log- spewing, I get a warning much sooner when my log filesystem fills up, not much later, with much worse implications, when the main filesystem fills up!) > And no, autodefrag is not in the mount options currently. Would that be > helpful in this case? It should be helpful, yes. Be aware that autodefrag works best with smaller (sub-half-gig) files, however, and that it used to cause performance issues with larger database and VM files, in particular. There used to be a warning on the wiki about that, that was recently removed, so apparently it's not the issue that it was, but you might wish to monitor any databases or VMs with gig-plus files to see if it's going to be a performance issue, once you turn on autodefrag. The other issue with autodefrag is that if it hasn't been on and things are heavily fragmented, it can at first drive down performance as it rewrites all these heavily fragmented files, until it catches up and is mostly dealing only with the normal refragmentation load. Of course the best way around that is to run autodefrag from the first time you mount the filesystem and start writing to it, so it never gets overly fragmented in the first place. For a currently in-use and highly fragmented filesystem, you have two choices, either backup and do a fresh mkfs.btrfs so you can start with a clean filesystem and autodefrag from the beginning, or doing manual defrag. However, be aware that if you have snapshots locking down the old extents in their fragmented form, a manual defrag will copy the data to new extents without releasing the old ones as they're locked in place by the snapshots, thus using additional space. Worse, if the filesystem is already heavily fragmented and snapshots are locking most of those fragments in place, defrag likely won't help a lot, because the free space as well will be heavily fragmented. So starting off with a clean and new filesystem and using autodefrag from the beginning really is your best bet. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Replacing drives with larger ones in a 4 drive raid1
boli posted on Wed, 08 Jun 2016 20:55:13 +0200 as excerpted: > Recently I had the idea to replace the 6 TB HDDs with 8 TB ones ("WD > Red"), because their price is now acceptable. Are those the 8 TB SMR "archive" drives? I haven't been following the issue very closely, but be aware that there were serious issues with those drives a few kernels back, and that while those issues are now fixed, the drives themselves operate rather differently than normal drives, and simply don't work well in normal usage. The short version is that they really are designed for archiving and work well when used for that purpose -- a mostly write once and leave it there for archiving and retrieval but rarely if ever rewrite it, type usage. However, they work rather poorly in normal usage where data is rewritten, because they have to rewrite entire zones of data, and that takes much longer than simply rewriting individual sectors on normal drives does. With the kernel patches to fix the initial problems they do work well enough, tho performance may not be what you expect, but the key to keeping them working well is being aware that they continue to do rewrites in the background for long after they are done with the initial write, and shutting them down while they are doing them can be an issue. Due to btrfs' data checksumming feature, small variances to data that wouldn't normally be detected on non-checksumming filesystems were detected far sooner on btrfs, making it far more sensitive to these small errors. However, if you use the drives for their intended nearly write- only purpose, and/or very seldom power down the drives at all or do so only long after (give it half an hour, say) any writes have completed, as long as you're running a current kernel with the initial issues patched, you should be fine. Just don't treat them like normal drives. If OTOH you need more normal drive usage including lots of data rewrites, especially if you frequently poweroff the devices, strongly consider avoiding those 8 TB SMR drives, at least until the technology has a few more years to mature. There's more information on other threads on the list and on other lists, if you need it and nobody posts more direct information (such as the specific patches in question and what specific kernel versions they hit) here. I could find it but I'd have to do a search in my own list archives, and now that you are aware of the problem, you can of course do the search as well, if you need to. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: at /home/kernel/COD/linux/fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x247/0x2c0 [btrfs]
On Thu, Jun 9, 2016 at 10:52 AM, Duncan <1i5t5.dun...@cox.net> wrote: > Fugou Nashi posted on Sun, 05 Jun 2016 10:12:31 +0900 as excerpted: > >> Hi, >> >> Do I need to worry about this? >> >> Thanks. >> >> Linux nakku 4.6.0-040600-generic #201605151930 SMP Sun May 15 23:32:59 >> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux > > There's a patch for it that didn't quite make 4.6.0, but is in 4.7-rc1 > and should appear in later 4.6.x stable releases (tho I don't track > stable myself so couldn't tell you if it's there yet or not). > It's in 4.6.1 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=4704fa547224fd49041c26f7fca3710a47fc449f -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: at /home/kernel/COD/linux/fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x247/0x2c0 [btrfs]
Fugou Nashi posted on Sun, 05 Jun 2016 10:12:31 +0900 as excerpted: > Hi, > > Do I need to worry about this? > > Thanks. > > Linux nakku 4.6.0-040600-generic #201605151930 SMP Sun May 15 23:32:59 > UTC 2016 x86_64 x86_64 x86_64 GNU/Linux There's a patch for it that didn't quite make 4.6.0, but is in 4.7-rc1 and should appear in later 4.6.x stable releases (tho I don't track stable myself so couldn't tell you if it's there yet or not). In general, some people were reporting many of these warnings during the 4.6 rcs with no visible damage or further problems, so it doesn't tend to be anything but the irritation of the warnings in practice, tho under the right conditions, in theory, it could be a bigger issue. But I know of no one who actually experienced a problem beyond the warn noise. So in practice it's a get the patch as soon as you can thing, but not something to be hugely worried about in the mean time. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Allocator behaviour during device delete
On 06/09/2016 03:07 PM, Austin S. Hemmelgarn wrote: On 2016-06-09 08:34, Brendan Hide wrote: Hey, all I noticed this odd behaviour while migrating from a 1TB spindle to SSD (in this case on a LUKS-encrypted 200GB partition) - and am curious if this behaviour I've noted below is expected or known. I figure it is a bug. Depending on the situation, it *could* be severe. In my case it was simply annoying. --- Steps After having added the new device (btrfs dev add), I deleted the old device (btrfs dev del) Then, whilst waiting for that to complete, I started a watch of "btrfs fi show /". Note that the below is very close to the output at the time - but is not actually copy/pasted from the output. Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 Total devices 2 FS bytes used 115.03GiB devid1 size 0.00GiB used 298.06GiB path /dev/sda2 devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot devid1 is the old disk while devid2 is the new SSD After a few minutes, I saw that the numbers have changed - but that the SSD still had no data: Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 Total devices 2 FS bytes used 115.03GiB devid1 size 0.00GiB used 284.06GiB path /dev/sda2 devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot The "FS bytes used" amount was changing a lot - but mostly stayed near the original total, which is expected since there was very little happening other than the "migration". I'm not certain of the exact point where it started using the new disk's space. I figure that may have been helpful to pinpoint. :-/ OK, I'm pretty sure I know what was going on in this case. Your assumption that device delete uses the balance code is correct, and that is why you see what's happening happening. There are two key bits that are missing though: 1. Balance will never allocate chunks when it doesn't need to. 2. The space usage listed in fi show is how much space is allocated to chunks, not how much is used in those chunks. In this case, based on what you've said, you had a lot of empty or mostly empty chunks. As a result of this, the device delete was both copying data, and consolidating free space. If you have a lot of empty or mostly empty chunks, it's not unusual for a device delete to look like this until you start hitting chunks that have actual data in them. The pri8mary point of this behavior is that it makes it possible to directly switch to a smaller device without having to run a balance and then a resize before replacing the device, and then resize again afterwards. Thanks, Austin. Your explanation is along the lines of my thinking though. The new disk should have had *some* data written to it at that point, as it started out at over 600GiB in allocation (should have probably mentioned that already). Consolidating or not, I would consider data being written to the old disk to be a bug, even if it is considered minor. I'll set up a reproducible test later today to prove/disprove the theory. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
On Wed, Jun 8, 2016 at 5:36 AM, Jeff Mahoney wrote: > The test for !trans->blocks_used in btrfs_abort_transaction is > insufficient to determine whether it's safe to drop the transaction > handle on the floor. btrfs_cow_block, informed by should_cow_block, > can return blocks that have already been CoW'd in the current > transaction. trans->blocks_used is only incremented for new block > allocations. If an operation overlaps the blocks in the current > transaction entirely and must abort the transaction, we'll happily > let it clean up the trans handle even though it may have modified > the blocks and will commit an incomplete operation. IMHO it wouldn't hurt to be more explicit here and mention that if the transaction handle ends up not COWing any nodes/leafs, calling abort against the handle won't abort the transaction. > > In the long-term, I'd like to do closer tracking of when the fs > is actually modified so we can still recover as gracefully as possible, > but that approach will need some discussion. In the short term, > since this is the only code using trans->blocks_used, let's just > switch it to a bool indicating whether any blocks were used and set > it when should_cow_block returns false. > > Cc: sta...@vger.kernel.org # 3.4+ > Signed-off-by: Jeff Mahoney But anyway, Reviewed-by: Filipe Manana > --- > fs/btrfs/ctree.c | 5 - > fs/btrfs/extent-tree.c | 2 +- > fs/btrfs/super.c | 2 +- > fs/btrfs/transaction.h | 2 +- > 4 files changed, 7 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > index 427c36b..135af4e 100644 > --- a/fs/btrfs/ctree.c > +++ b/fs/btrfs/ctree.c > @@ -1552,6 +1552,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle > *trans, >trans->transid, root->fs_info->generation); > > if (!should_cow_block(trans, root, buf)) { > + trans->dirty = true; > *cow_ret = buf; > return 0; > } > @@ -2773,8 +2774,10 @@ again: > * then we don't want to set the path blocking, > * so we test it here > */ > - if (!should_cow_block(trans, root, b)) > + if (!should_cow_block(trans, root, b)) { > + trans->dirty = true; > goto cow_done; > + } > > /* > * must have write locks on this node and the > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 689d25a..1ed31eb 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -8044,7 +8044,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, > struct btrfs_root *root, > set_extent_dirty(&trans->transaction->dirty_pages, buf->start, > buf->start + buf->len - 1, GFP_NOFS); > } > - trans->blocks_used++; > + trans->dirty = true; > /* this returns a buffer locked for blocking */ > return buf; > } > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 4e59a91..bd07e01 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -235,7 +235,7 @@ void __btrfs_abort_transaction(struct btrfs_trans_handle > *trans, > trans->aborted = errno; > /* Nothing used. The other threads that have joined this > * transaction may be able to continue. */ > - if (!trans->blocks_used && list_empty(&trans->new_bgs)) { > + if (!trans->dirty && list_empty(&trans->new_bgs)) { > const char *errstr; > > errstr = btrfs_decode_error(errno); > diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h > index 9fe0ec2..c5abee4 100644 > --- a/fs/btrfs/transaction.h > +++ b/fs/btrfs/transaction.h > @@ -110,7 +110,6 @@ struct btrfs_trans_handle { > u64 chunk_bytes_reserved; > unsigned long use_count; > unsigned long blocks_reserved; > - unsigned long blocks_used; > unsigned long delayed_ref_updates; > struct btrfs_transaction *transaction; > struct btrfs_block_rsv *block_rsv; > @@ -121,6 +120,7 @@ struct btrfs_trans_handle { > bool can_flush_pending_bgs; > bool reloc_reserved; > bool sync; > + bool dirty; > unsigned int type; > /* > * this root is only needed to validate that the root passed to > -- > 2.7.1 > > > -- > Jeff Mahoney > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: [PATCH v2] Btrfs: check if extent buffer is aligned to sectorsize
On Mon, Jun 06, 2016 at 12:01:23PM -0700, Liu Bo wrote: > Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer > via alloc_extent_buffer(). An unaligned eb can have more pages than it > should have, which ends up extent buffer's leak or some corrupted content > in extent buffer. > > This adds a warning to let us quickly know what was happening. > > Now that alloc_extent_buffer() no more returns NULL, this changes its > caller and callers of its caller to match with the new error > handling. > > Signed-off-by: Liu Bo Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: add the error message when open fails
On Thu, Jun 09, 2016 at 10:23:15AM +0900, Tsutomu Itoh wrote: > When open in btrfs_open_devices failed, only the following message is > displayed. Therefore the user doesn't understand the reason why open > failed. > > # btrfs check /dev/sdb8 > Couldn't open file system > > This patch adds the error message when open fails. I think the message should be printed by the caller, not by the helper. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Allocator behaviour during device delete
On 2016-06-09 08:34, Brendan Hide wrote: Hey, all I noticed this odd behaviour while migrating from a 1TB spindle to SSD (in this case on a LUKS-encrypted 200GB partition) - and am curious if this behaviour I've noted below is expected or known. I figure it is a bug. Depending on the situation, it *could* be severe. In my case it was simply annoying. --- Steps After having added the new device (btrfs dev add), I deleted the old device (btrfs dev del) Then, whilst waiting for that to complete, I started a watch of "btrfs fi show /". Note that the below is very close to the output at the time - but is not actually copy/pasted from the output. Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 Total devices 2 FS bytes used 115.03GiB devid1 size 0.00GiB used 298.06GiB path /dev/sda2 devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot devid1 is the old disk while devid2 is the new SSD After a few minutes, I saw that the numbers have changed - but that the SSD still had no data: Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 Total devices 2 FS bytes used 115.03GiB devid1 size 0.00GiB used 284.06GiB path /dev/sda2 devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot The "FS bytes used" amount was changing a lot - but mostly stayed near the original total, which is expected since there was very little happening other than the "migration". I'm not certain of the exact point where it started using the new disk's space. I figure that may have been helpful to pinpoint. :-/ OK, I'm pretty sure I know what was going on in this case. Your assumption that device delete uses the balance code is correct, and that is why you see what's happening happening. There are two key bits that are missing though: 1. Balance will never allocate chunks when it doesn't need to. 2. The space usage listed in fi show is how much space is allocated to chunks, not how much is used in those chunks. In this case, based on what you've said, you had a lot of empty or mostly empty chunks. As a result of this, the device delete was both copying data, and consolidating free space. If you have a lot of empty or mostly empty chunks, it's not unusual for a device delete to look like this until you start hitting chunks that have actual data in them. The pri8mary point of this behavior is that it makes it possible to directly switch to a smaller device without having to run a balance and then a resize before replacing the device, and then resize again afterwards. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Allocator behaviour during device delete
Hey, all I noticed this odd behaviour while migrating from a 1TB spindle to SSD (in this case on a LUKS-encrypted 200GB partition) - and am curious if this behaviour I've noted below is expected or known. I figure it is a bug. Depending on the situation, it *could* be severe. In my case it was simply annoying. --- Steps After having added the new device (btrfs dev add), I deleted the old device (btrfs dev del) Then, whilst waiting for that to complete, I started a watch of "btrfs fi show /". Note that the below is very close to the output at the time - but is not actually copy/pasted from the output. > Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 > Total devices 2 FS bytes used 115.03GiB > devid1 size 0.00GiB used 298.06GiB path /dev/sda2 > devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot devid1 is the old disk while devid2 is the new SSD After a few minutes, I saw that the numbers have changed - but that the SSD still had no data: > Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 > Total devices 2 FS bytes used 115.03GiB > devid1 size 0.00GiB used 284.06GiB path /dev/sda2 > devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot The "FS bytes used" amount was changing a lot - but mostly stayed near the original total, which is expected since there was very little happening other than the "migration". I'm not certain of the exact point where it started using the new disk's space. I figure that may have been helpful to pinpoint. :-/ --- Educated guess as to what was happening: Key: Though the available space on devid1 is displayed as 0 GiB, internally the allocator still sees most of the device's space as available. The allocator will continue writing to the old disk even though the intention is to remove it. The dev delete operation goes through the chunks in sequence and does a "normal" balance operation on each, which the kernel simply sends to the "normal" single allocator. At the start of the operation, the allocator will see that the device of 1TB has more space available than the 200GB device, thus it writes the data to a new chunk on the 1TB spindle. Only after the chunk is balanced away, does the operation mark *only* that "source" chunk as being unavailable. As each chunk is subsequently balanced away, eventually the allocator will see that there is more space available on the new device than on the old device (1:199/2:200), thus the next chunk gets allocated to the new device. The same occurs for the next chunk (1:198/2:199) and so on, until the device finally has zero usage and is removed completely. --- Naive approach for a fix (assuming my assessment above is correct) At the start: 1. "Balance away"/Mark-as-Unavailable empty space 2. Balance away the *current* chunks (data+metadata) that would otherwise be written to if the device was still available 3. As before, balance in whatever order is applicable. --- Severity I figure that, for my use-case, this isn't a severe issue. However, in the case where you want quickly to remove a potentially failing disk (common use case for dev delete), I'd much rather that btrfs does *not* write data to the disk I'm trying to remove, making this a potentially severe bug. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommended why to use btrfs for production?
On 2016-06-09 02:16, Duncan wrote: Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as excerpted: As far as BTRFS raid10 mode in general, there are a few things that are important to remember about it: 1. It stores exactly two copies of everything, any extra disks just add to the stripe length on each copy. I'll add one more, potentially very important, related to this one: Btrfs raid mode (any of them) works in relation to individual chunks, *NOT* individual devices. What that means for btrfs raid10 in combination with the above exactly two copies rule, is that it works rather differently than a standard raid10, which can tolerate loss of two devices as long as they're from the same mirror set, as the other mirror set will then still be whole. Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of a second device close to assures loss of data, because the very likely true assumption is that both mirror sets will be affected for some chunks, but not others. Actually, that's not _quite_ the case. Assuming that you have an even number of devices, BTRFS raid10 will currently always span all the available devices with two striped copies of the data (if there's an odd number, it spans one less than the total, and rotates which one gets left out of each chunk). This means that as long as all the devices are the same size and you have have stripes that are the full width of the array (you can end up with shorter ones if you have run in degraded mode or expanded the array), your probability of data loss per-chunk goes down as you add more devices (because the probability of a two device failure affecting both copies of a stripe in a given chunk decreases), but goes up as you add more chunks (because you then have to apply that probability for each individual chunk). Once you've lost one disk, the probability that losing another will compromise a specific chunk is: 1/(N - 1) Where N is the total number of devices. The probability that it will compromise _any_ chunk is: (1/(N - 1))/C Where C is the total number of chunks BTRFS raid1 mode actually has the exact same probabilities, but they apply even if you have an odd number of disks. By using a layered approach, btrfs raid1 on top (for its error correction from the other copy feature) of a pair of mdraid0s, you force one of the btrfs raid1 copies to each of the mdraid0s, thus making allocation more deterministic than btrfs raid10, and can thus again tolerate loss of two devices, as long as they're from the same underlying mdraid0. (Traditionally, raid1 on top of raid0 is called raid01, and is discouraged compared to raid10, raid0 on top of raid1, because device failure and replacement with the latter triggers a much more localized rebuild than the former, across the pair of devices in the raid1 when it's closest to the physical devices, across the whole array, one raid0 to the other, when the raid1 is on top. However, btrfs raid1's data integrity and error repair from the good mirror feature is generally considered to be useful enough to be worth the rebuild-inefficiency of the raid01 design.) So in regard to failure tolerance, btrfs raid10 is far closer to traditional raid5, loss of a single device is tolerated, loss of a second before a repair is complete generally means data loss -- there's not the chance of it being on the same mirror set to save you that traditional raid10 has. Similarly, btrfs raid10 doesn't have the cleanly separate pair of mirrors on raid0 arrays that traditional raid10 does, thus doesn't have the fault tolerance of losing say the connection or power to one entire device bank, as long as it's all one mirror set, that traditional raid10 has. And again, doing the layered thing with btrfs raid1 on top and mdraid0 (or whatever else) underneath gets that back for you, if you set it up that way, of course. And will get you better performance than just BTRFS most of the time too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
On 06/09/2016 10:52 AM, Marc Haber wrote: On Thu, Jun 09, 2016 at 01:10:46AM +0200, Hans van Kranenburg wrote: So, instead of being the cause, apt-get update causing a new chunk to be allocated might as well be the result of existing ones already filled up with too many fragments. The next question is what files these extents belong to. To find out, I need to open up the extent items I get back and follow a backreference to an inode object. Might do that tomorrow, fun. Does your apt use pdiffs to update the packages lists? If yes, I'd try turning it off just for the fun of it and to see whether this changes btrfs' allocation behavior. I have never looked at apt's pdiff stuff in detail, but I guess that it creates many tiny temporary files. No, it does not: Acquire::Pdiffs "false"; -- Hans van Kranenburg - System / Network Engineer Mendix | Driving Digital Innovation | www.mendix.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: fix check_shared for fiemap ioctl
On Wed, Jun 08, 2016 at 08:53:00AM -0700, Mark Fasheh wrote: > On Wed, Jun 08, 2016 at 01:13:03PM +0800, Lu Fengqi wrote: > > Only in the case of different root_id or different object_id, check_shared > > identified extent as the shared. However, If a extent was referred by > > different offset of same file, it should also be identified as shared. > > In addition, check_shared's loop scale is at least n^3, so if a extent > > has too many references, even causes soft hang up. > > > > First, add all delayed_ref to the ref_tree and calculate the unqiue_refs, > > if the unique_refs is greater than one, return BACKREF_FOUND_SHARED. > > Then individually add the on-disk reference(inline/keyed) to the ref_tree > > and calculate the unique_refs of the ref_tree to check if the unique_refs > > is greater than one.Because once there are two references to return > > SHARED, so the time complexity is close to the constant. > > > > Reported-by: Tsutomu Itoh > > Signed-off-by: Lu Fengqi > > --- > > The caller is fiemap that called from an ioctl. Because it isn't on a > > writeout path, so we temporarily use GFP_KERNEL in ref_root_alloc() and > > ref_tree_add(). If we use ref_tree replace the existing backref structure > > later, we can consider whether to use GFP_NOFS again. > > NACK. > > You don't need to be on a writeout path to deadlock, you simply need to be > holding locks that the writeout path takes when you allocate. If the > allocator does writeout to free memory then you deadlock. Fiemap is locking > down extents which may also get locked down when you allocate within those > locks. See my e-mail here for details, > > http://www.spinics.net/lists/linux-btrfs/msg55789.html There seems to be a code path that leads to a deadlock scenario, it depends on the extent state of the file under fiemap. If there's a delalloc range in the middle of the file, fiemap locks the whole range, asks for memory that could trigger writeout, the delalloc range would need to be written and requests locking the dealloc range again. There's no direct locking but rather waiting for the extent LOCKED bit unset. So, use GFP_NOFS now, though I still hope we can find a way to avoid rb-tree and allocations alltogether. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
On Thu, Jun 09, 2016 at 01:10:46AM +0200, Hans van Kranenburg wrote: > So, instead of being the cause, apt-get update causing a new chunk to be > allocated might as well be the result of existing ones already filled up > with too many fragments. > > The next question is what files these extents belong to. To find out, I need > to open up the extent items I get back and follow a backreference to an > inode object. Might do that tomorrow, fun. Does your apt use pdiffs to update the packages lists? If yes, I'd try turning it off just for the fun of it and to see whether this changes btrfs' allocation behavior. I have never looked at apt's pdiff stuff in detail, but I guess that it creates many tiny temporary files. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/21] Delete CURRENT_TIME and CURRENT_TIME_SEC macros
Hi, Deepa Dinamani writes: > drivers/usb/gadget/function/f_fs.c | 2 +- > drivers/usb/gadget/legacy/inode.c | 2 +- for drivers/usb/gadget: Acked-by: Felipe Balbi -- balbi signature.asc Description: PGP signature