Re: Linux-next regression?
On 5 Dec 2018, at 5:59, Andrea Gelmini wrote: > On Tue, Dec 04, 2018 at 10:29:49PM +0000, Chris Mason wrote: >> I think (hope) this is: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >> >> Which was just nailed down to a blkmq bug. It triggers when you have >> scsi devices using elevator=none over blkmq. > > Thanks a lot Chris. Really. > Good news: I confirm I recompiled and used blkmq and no-op (at that > time). > Also, the massive write of btrfs defrag can explain the massive > trigger of > the bug, and next corruption. Sorry this happened, but glad you were able to confirm that it explains the trouble you hit. Thanks for the report, I did end up using this as a datapoint to convince myself the bugzilla above wasn't ext4 specific. -chris
Re: Linux-next regression?
On 28 Nov 2018, at 11:05, Andrea Gelmini wrote: > On Tue, Nov 27, 2018 at 10:16:52PM +0800, Qu Wenruo wrote: >> >> But it's less a concerning problem since it doesn't reach latest RC, >> so >> if you could reproduce it stably, I'd recommend to do a bisect. > > No problem to bisect, usually. > But right now it's not possible for me, I explain further. > Anyway, here the rest of the story. > > So, in the end I: > a) booted with 4.20.0-rc4 > b) updated backup > c) did the btrfs check --read-only > d) seven steps, everything is perfect > e) no complains on screen or in logs (never had) > f) so, started to compile linux-next 20181128 (on another partition) > e) without using (reading or writing) on /home, I started > f) btrfs filesystem defrag -v -r -t 128M /home > g) it worked without complain (in screen or logs) > h) then, reboot with kernel tag 20181128 > i) and no way to mount: I think (hope) this is: https://bugzilla.kernel.org/show_bug.cgi?id=201685 Which was just nailed down to a blkmq bug. It triggers when you have scsi devices using elevator=none over blkmq. -chris
Re: btrfs development - question about crypto api integration
On 29 Nov 2018, at 12:37, Nikolay Borisov wrote: > On 29.11.18 г. 18:43 ч., Jean Fobe wrote: >> Hi all, >> I've been studying LZ4 and other compression algorithms on the >> kernel, and seen other projects such as zram and ubifs using the >> crypto api. Is there a technical reason for not using the crypto api >> for compression (and possibly for encryption) in btrfs? >> I did not find any design/technical implementation choices in >> btrfs development in the developer's FAQ on the wiki. If I completely >> missed it, could someone point me in the right direction? >> Lastly, if there is no technical reason for this, would it be >> something interesting to have implemented? > > I personally think it would be better if btrfs' exploited the generic > framework. And in fact when you look at zstd, btrfs does use the > generic, low-level ZSTD routines but not the crypto library wrappers. > If > I were I'd try and convert zstd (since it's the most recently added > algorithm) to using the crypto layer to see if there are any lurking > problems. Back when I first added the zlib support, the zlib API was both easier to use and a better fit for our async worker threads. That doesn't mean we shouldn't switch, it's just how we got to step one ;) -chris
Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
On 28 Nov 2018, at 14:06, David Sterba wrote: > On Tue, Nov 27, 2018 at 03:08:08PM -0500, Josef Bacik wrote: >> On Tue, Nov 27, 2018 at 07:59:42PM +0000, Chris Mason wrote: >>> On 27 Nov 2018, at 14:54, Josef Bacik wrote: >>> >>>> On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote: >>>>> >>>>> >>>>> On 21.11.18 г. 21:09 ч., Josef Bacik wrote: >>>>>> The cleaner thread usually takes care of delayed iputs, with the >>>>>> exception of the btrfs_end_transaction_throttle path. The >>>>>> cleaner >>>>>> thread only gets woken up every 30 seconds, so instead wake it up >>>>>> to >>>>>> do >>>>>> it's work so that we can free up that space as quickly as >>>>>> possible. >>>>> >>>>> Have you done any measurements how this affects the overall >>>>> system. I >>>>> suspect this introduces a lot of noise since now we are going to >>>>> be >>>>> doing a thread wakeup on every iput, does this give a chance to >>>>> have >>>>> nice, large batches of iputs that the cost of wake up can be >>>>> amortized >>>>> across? >>>> >>>> I ran the whole patchset with our A/B testing stuff and the >>>> patchset >>>> was a 5% >>>> win overall, so I'm inclined to think it's fine. Thanks, >>> >>> It's a good point though, a delayed wakeup may be less overhead. >> >> Sure, but how do we go about that without it sometimes messing up? >> In practice >> the only time we're doing this is at the end of finish_ordered_io, so >> likely to >> not be a constant stream. I suppose since we have places where we >> force it to >> run that we don't really need this. IDK, I'm fine with dropping it. >> Thanks, > > The transaction thread wakes up cleaner periodically (commit interval, > 30s by default), so the time to process iputs is not unbounded. > > I have the same concerns as Nikolay, coupling the wakeup to all > delayed > iputs could result in smaller batches. But some of the callers of > btrfs_add_delayed_iput might benefit from the extra wakeup, like > btrfs_remove_block_group, so I don't want to leave the idea yet. Yeah, I love the idea, I'm just worried about a tiny bit of rate limiting. Since we're only waking up a fixed process and not creating new work queue items every time, it's probably fine, but a delay of HZ/2 would probably be even better. -chris
Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
On 27 Nov 2018, at 14:54, Josef Bacik wrote: > On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote: >> >> >> On 21.11.18 г. 21:09 ч., Josef Bacik wrote: >>> The cleaner thread usually takes care of delayed iputs, with the >>> exception of the btrfs_end_transaction_throttle path. The cleaner >>> thread only gets woken up every 30 seconds, so instead wake it up to >>> do >>> it's work so that we can free up that space as quickly as possible. >> >> Have you done any measurements how this affects the overall system. I >> suspect this introduces a lot of noise since now we are going to be >> doing a thread wakeup on every iput, does this give a chance to have >> nice, large batches of iputs that the cost of wake up can be >> amortized >> across? > > I ran the whole patchset with our A/B testing stuff and the patchset > was a 5% > win overall, so I'm inclined to think it's fine. Thanks, It's a good point though, a delayed wakeup may be less overhead. -chris
Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount
On 1 Nov 2018, at 12:00, Omar Sandoval wrote: > On Thu, Nov 01, 2018 at 04:29:48PM +0100, David Sterba wrote: >>> >>> How is that? kthread_stop() frees the task struct, so >>> wake_up_process() >>> would be a use-after-free. >> >> That was an assumption for the demonstration purposes, the wording >> was >> confusing sorry. > > Oh, well in that case, that's exactly what kthread_park() is ;) Stop > the > thread and make wake_up a noop, and then we don't need to add special > cases everywhere else. This is how Omar won me over to kthread_park(). I think Dave's alternatives can be made correct, but Omar's way solves a whole class of potential problems. The park() makes the thread safely ignore future work, and won't don't have to chase down weird bugs where people are sending future work at the wrong time. -chris
Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount
On 1 Nov 2018, at 6:15, David Sterba wrote: > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: >> From: Omar Sandoval >> >> There's a race between close_ctree() and cleaner_kthread(). >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it >> sees it set, but this is racy; the cleaner might have already checked >> the bit and could be cleaning stuff. In particular, if it deletes >> unused >> block groups, it will create delayed iputs for the free space cache >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no >> longer running delayed iputs after a commit. Therefore, if the >> cleaner >> creates more delayed iputs after delayed iputs are run in >> btrfs_commit_super(), we will leak inodes on unmount and get a busy >> inode crash from the VFS. >> >> Fix it by parking the cleaner > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > we're missing a commit or clean up data structures. Messing with state > of a thread would be the last thing I'd try after proving that it's > not > possible to fix in the logic of btrfs itself. > > The shutdown sequence in close_tree is quite tricky and we've had bugs > there. The interdependencies of thread and data structures and other > subsystems cannot have loops that could not find an ordering that will > not leak something. > > It's not a big problem if some step is done more than once, like > committing or cleaning up some other structures if we know that > it could create new. The problem is the cleaner thread needs to be told to stop doing new work, and we need to wait for the work it's already doing to be finished. We're getting "stop doing new work" already because the cleaner thread checks to see if the FS is closing, but we don't have a way today to wait for him to finish what he's already doing. kthread_park() is basically the same as adding another mutex or synchronization point. I'm not sure how we could change close_tree() or the final commit to pick this up more effectively? -chris
Re: [PATCH] btrfs: use tagged writepage to mitigate livelock of snapshot
On 1 Nov 2018, at 6:21, ethanlien wrote: > Nikolay Borisov 於 2018-11-01 18:01 寫到: >> On 1.11.18 г. 11:56 ч., ethanlien wrote: >>> Nikolay Borisov 於 2018-11-01 16:59 寫到: On 1.11.18 г. 8:49 ч., Ethan Lien wrote: > Snapshot is expected to be fast. But if there are writers steadily > create dirty pages in our subvolume, the snapshot may take a very > long > time to complete. To fix the problem, we use tagged writepage for > snapshot > flusher as we do in the generic write_cache_pages(), so we can > ommit > pages > dirtied after the snapshot command. So the gist of this commit really is that you detect when filemap_flush has been called from snapshot context and tag all pages at *that* time as PAGECACHE_TAG_TOWRITE and then write them, ignoring any pages that might have been dirtied beforehand. Your description kind of dances around this idea without really saying it explicitly. Those semantics make sense, however i'd like to be stated more explicitly in the change log. However, this is done at the expense of consistency, so we have faster snapshots but depending the file which are stored in them they might be broken (i.e a database) since they will be missing pages. >>> >>> tag_pages_for_writeback() will tag all pages with >>> PAGECACHE_TAG_DIRTY. >>> If a dirty >>> page get missed here, it means someone has initiated the flush >>> before >>> us, so we >>> will wait that dirty page to complete in create_snapshot() -> >>> btrfs_wait_ordered_extents(). >> >> And what happens in the scenario where you have someone dirtying >> pages, >> you issue the snapshot ioctl, mark all currently dirtied pages as >> TOWRITE and then you have more delalloc pages being dirtied following >> initial call to tag_pages_for_writeback , you will miss those, no ? >> > > We don't freeze the filesystem when doing snapshot, so there is no > guarantee > the page dirtied right after we send the ioctl, will be included in > snapshot. > If a page is dirtied while we scan the dirty pages and its offset is > before our > current index, we miss it in our current snapshot implementation too. This looks like a pretty good compromise to me between btrfs v0's don't flush at all on snapshot and today's full sync on snapshot. The full sync is always going to be a little racey wrt concurrent writes. -chris
Re: [PATCH 0/2] address lock contention of btree root
On 21 Aug 2018, at 14:15, Liu Bo wrote: On Tue, Aug 21, 2018 at 01:54:11PM -0400, Chris Mason wrote: On 16 Aug 2018, at 17:07, Liu Bo wrote: The lock contention on btree nodes (esp. root node) is apparently a bottleneck when there're multiple readers and writers concurrently trying to access them. Unfortunately this is by design and it's not easy to fix it unless with some complex changes, however, there is still some room. With a stable workload based on fsmark which has 16 threads creating 1,600K files, we could see that a good amount of overhead comes from switching path between spinning mode and blocking mode in btrfs_search_slot(). Patch 1 provides more details about the overhead and test results from fsmark and dbench. Patch 2 kills leave_spinning due to the behaviour change from patch 1. This is really interesting, do you have numbers about how often we are able to stay spinning? I didn't gather how long we stay spinning, I'm less worried about length of time spinning than I am how often we're able to call btrfs_search_slot() without ever blocking. If one caller ends up going into blocking mode, it can cascade into all of the other callers, which can slow things down in low-to-medium contention workloads. The flip side is that maybe the adaptive spinning in the mutexes is good enough now and we can just deleting the spinning completely. but I was tracking (1) how long a read lock or write lock can wait until it gets the lock, with vanilla kernel, for read lock, it could be up to 10ms while for write lock, it could be up to 200ms. Nice, please add the stats to the patch descriptions, both before and after. I'd love to see a histogram like you can get out of bcc's argdist.py. -chris
Re: [PATCH 0/2] address lock contention of btree root
On 16 Aug 2018, at 17:07, Liu Bo wrote: The lock contention on btree nodes (esp. root node) is apparently a bottleneck when there're multiple readers and writers concurrently trying to access them. Unfortunately this is by design and it's not easy to fix it unless with some complex changes, however, there is still some room. With a stable workload based on fsmark which has 16 threads creating 1,600K files, we could see that a good amount of overhead comes from switching path between spinning mode and blocking mode in btrfs_search_slot(). Patch 1 provides more details about the overhead and test results from fsmark and dbench. Patch 2 kills leave_spinning due to the behaviour change from patch 1. This is really interesting, do you have numbers about how often we are able to stay spinning? IOW, do we end up blocking every time? -chris
Re: Do btrfs compression option changes need to be atomic?
On 21 Aug 2018, at 9:46, David Howells wrote: Should changes to the compression options on a btrfs mount be atomic, the problem being that they're split across three variables? Further to that, how much of an issue is it if the configuration is split out into its own struct that is accessed from struct btrfs_fs_info using RCU? Hi David, They shouldn't need to be atomic, we're making compression decisions on a per-extent basis and sometimes bailing on the compression if it doesn't make things smaller. -chris
[PATCH v2] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
The vm_fault_t conversion commit introduced a ret2 variable for tracking the integer return values from internal btrfs functions. It was sometimes returning VM_FAULT_LOCKED for pages that were actually invalid and had been removed from the radix. Something like this: ret2 = btrfs_delalloc_reserve_space() // returns zero on success lock_page(page) if (page->mapping != inode->i_mapping) goto out_unlock; ... out_unlock: if (!ret2) { ... return VM_FAULT_LOCKED; } This ends up triggering this WARNING in btrfs_destroy_inode() WARN_ON(BTRFS_I(inode)->block_rsv.size); xfstests generic/095 was able to reliably reproduce the errors. Since out_unlock: is only used for errors, this fix moves it below the if (!ret2) check we use to return VM_FAULT_LOCKED for success. Fixes: a528a2415087 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t) Signed-off-by: Chris Mason --- Changes since v1: don't set the vmfault_t 'ret' to zero, just move our goto taret around around instead. fs/btrfs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 193f933..63f713a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9036,13 +9036,14 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) unlock_extent_cached(io_tree, page_start, page_end, _state); -out_unlock: if (!ret2) { btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true); sb_end_pagefault(inode->i_sb); extent_changeset_free(data_reserved); return VM_FAULT_LOCKED; } + +out_unlock: unlock_page(page); out: btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, (ret != 0)); -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
On 25 Jun 2018, at 9:54, David Sterba wrote: > On Mon, Jun 25, 2018 at 06:45:32AM -0700, Chris Mason wrote: >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 193f933..38403aa 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -9013,6 +9013,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) >> goto out_unlock; >> } >> ret2 = 0; >> +ret = 0; > > That's something I'd rather avoid, now there are two error values that > need to be tracked. Shouldn't the 'ret2 = 0' be removed completely? Yeah, the whole thing hurts my eyes. Fixing. > >> >> /* page is wholly or partially inside EOF */ >> if (page_start + PAGE_SIZE > size) >> @@ -9037,7 +9038,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) >> unlock_extent_cached(io_tree, page_start, page_end, _state); >> >> out_unlock: >> -if (!ret2) { >> +if (!ret) { > > I'm not sure is safe, the new ret is one of the VM_FAULT_ values and > it's not 0 by default and in fact I don't see anything that would set it > to 0 directly or indirectly. Which means that the code in the out: label > also needs to be revisited: Good point, ok. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits
On 25 Jun 2018, at 7:10, David Sterba wrote: On Fri, Jun 22, 2018 at 05:25:54PM -0400, Chris Mason wrote: The bug came here: commit a528a24150870c5c16cbbbec69dba7e992b08456 Author: Souptick Joarder Date: Wed Jun 6 19:54:44 2018 +0530 btrfs: change return type of btrfs_page_mkwrite to vm_fault_t When page->mapping != mapping, we improperly return success because ret2 is zero on goto out_unlock. I'm running a fix through a full xfstests and I'll post soon. The ret/ret2 pattern is IMO used in the wrong way, the ret2 is some temporary value and it should be set and tested next to each other, not holding the state accross the function. The fix I'd propose is to avoid relying on it and add a separate exit block, similar to out_noreserve: --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8981,7 +8981,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) ret = VM_FAULT_SIGBUS; goto out_unlock; } - ret2 = 0; /* page is wholly or partially inside EOF */ if (page_start + PAGE_SIZE > size) @@ -9004,14 +9003,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; unlock_extent_cached(io_tree, page_start, page_end, _state); - -out_unlock: - if (!ret2) { - btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true); - sb_end_pagefault(inode->i_sb); - extent_changeset_free(data_reserved); - return VM_FAULT_LOCKED; - } unlock_page(page); VM_FAULT_LOCKED is where we return success. The out_unlock goto is confusing because it's actually only used for failure, but the goto lands right above the if (everything actually worked) {} test. On top of the patch I sent today, moving out_unlock: after the if would make it more clear. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
The vm_fault_t conversion commit introduced a ret2 variable for tracking the integer return values from internal btrfs functions. It was sometimes returning VM_FAULT_LOCKED for pages that were actually invalid and had been removed from the radix. Something like this: ret2 = btrfs_delalloc_reserve_space() // returns zero on success lock_page(page) if (page->mapping != inode->i_mapping) goto out_unlock; ... out_unlock: if (!ret2) { ... return VM_FAULT_LOCKED; } This ends up triggering this WARNING in btrfs_destroy_inode() WARN_ON(BTRFS_I(inode)->block_rsv.size); The fix used here is to use ret instead of ret2 in our check for success. Fixes: a528a2415087 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t) Signed-off-by: Chris Mason --- fs/btrfs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 193f933..38403aa 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9013,6 +9013,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) goto out_unlock; } ret2 = 0; + ret = 0; /* page is wholly or partially inside EOF */ if (page_start + PAGE_SIZE > size) @@ -9037,7 +9038,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) unlock_extent_cached(io_tree, page_start, page_end, _state); out_unlock: - if (!ret2) { + if (!ret) { btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true); sb_end_pagefault(inode->i_sb); extent_changeset_free(data_reserved); -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits
On 20 Jun 2018, at 16:24, David Sterba wrote: On Wed, Jun 20, 2018 at 03:48:08PM -0400, Chris Mason wrote: generic/095 [18:07:03][ 3769.317862] run fstests generic/095 at 2018-06-20 18:07:03 Hmpf, I pass both 095 and 208 here. [ 3774.849685] BTRFS: device fsid 3acffad9-28e5-43ce-80e1-f5032e334cba devid 1 transid 5 /dev/vdb [ 3774.875409] BTRFS info (device vdb): disk space caching is enabled [ 3774.877723] BTRFS info (device vdb): has skinny extents [ 3774.879371] BTRFS info (device vdb): flagging fs with big metadata feature [ 3774.885020] BTRFS info (device vdb): checking UUID tree [ 3775.593329] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 3775.596979] File: /tmp/scratch/file2 PID: 12031 Comm: kworker/1:1 [ 3776.642812] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 3776.645041] File: /tmp/scratch/file2 PID: 12033 Comm: kworker/3:0 [ 3776.920634] WARNING: CPU: 0 PID: 12036 at fs/btrfs/inode.c:9319 btrfs_destroy_inode+0x1d5/0x290 [btrfs] Which warning is this in your tree? The file_write patch is more likely to have screwed up our bits and the fixup worker is more likely to have screwed up nrpages. 9311 void btrfs_destroy_inode(struct inode *inode) 9312 { 9313 struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); 9314 struct btrfs_ordered_extent *ordered; 9315 struct btrfs_root *root = BTRFS_I(inode)->root; 9316 9317 WARN_ON(!hlist_empty(>i_dentry)); 9318 WARN_ON(inode->i_data.nrpages); 9319 WARN_ON(BTRFS_I(inode)->block_rsv.reserved); The branch is the last pull, ie. no other 4.18-rc1 stuff plus your two patches. The bug came here: commit a528a24150870c5c16cbbbec69dba7e992b08456 Author: Souptick Joarder Date: Wed Jun 6 19:54:44 2018 +0530 btrfs: change return type of btrfs_page_mkwrite to vm_fault_t When page->mapping != mapping, we improperly return success because ret2 is zero on goto out_unlock. I'm running a fix through a full xfstests and I'll post soon. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] btrfs: Remove V0 extent support
On 21 Jun 2018, at 5:22, David Sterba wrote: On Thu, Jun 21, 2018 at 09:45:00AM +0300, Nikolay Borisov wrote: The v0 compat code was introduced in commit 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") 9 years ago, which was merged in 2.6.31. This means that the code is there to support filesystems which are _VERY_ old and if you are using btrfs on such an old kernel, you have much bigger problems. This coupled with the fact that no one is likely testing/maintining this code likely means it has bugs lurking. All things considered I think 43 kernel releases later it's high time this remnant of the past got removed. Ack. Given the age of the format and nearly zero chance of such filesystem to be still in use it should be ok to remove it directly and not add some sort of warning when such filesystem is found. Agreed, it seems very unlikely to me that we have happy v0 users. This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case we have a v0 with no support intact as a sort of safety-net. I think there should be some verbose way to report what happened and possibly do proper error handling and go to read-only eventually. The type of checks for v0 do not seem to be something that could be done at mount time. Yeah, with the code we have right now, it's really obvious the BUG()s are for V0 extents. With all the ifdefs removed its much less clear. I'd swap them out for a transaction abort with a message about the v0 extents. Also, there's at least one ASSERT() left over in a #else, which I would also turn into a more explicit abort or at least comment. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits
On 20 Jun 2018, at 15:33, David Sterba wrote: On Wed, Jun 20, 2018 at 07:56:10AM -0700, Chris Mason wrote: We've been hunting the root cause of data crc errors here at FB for a while. We'd find one or two corrupted files, usually displaying crc errors without any corresponding IO errors from the storage. The bug was rare enough that we'd need to watch a large number of machines for a few days just to catch it happening. We're still running these patches through testing, but the fixup worker bug seems to account for the vast majority of crc errors we're seeing in the fleet. It's cleaning pages that were dirty, and creating a window where they can be reclaimed before we finish processing the page. I'm having flashbacks when I see 'fixup worker', and the test generic/208 does not make it better: generic/095 [18:07:03][ 3769.317862] run fstests generic/095 at 2018-06-20 18:07:03 [ 3774.849685] BTRFS: device fsid 3acffad9-28e5-43ce-80e1-f5032e334cba devid 1 transid 5 /dev/vdb [ 3774.875409] BTRFS info (device vdb): disk space caching is enabled [ 3774.877723] BTRFS info (device vdb): has skinny extents [ 3774.879371] BTRFS info (device vdb): flagging fs with big metadata feature [ 3774.885020] BTRFS info (device vdb): checking UUID tree [ 3775.593329] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 3775.596979] File: /tmp/scratch/file2 PID: 12031 Comm: kworker/1:1 [ 3776.642812] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 3776.645041] File: /tmp/scratch/file2 PID: 12033 Comm: kworker/3:0 [ 3776.920634] WARNING: CPU: 0 PID: 12036 at fs/btrfs/inode.c:9319 btrfs_destroy_inode+0x1d5/0x290 [btrfs] [ 3776.924182] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop [last unloaded: libcrc32c] [ 3776.927703] CPU: 0 PID: 12036 Comm: umount Not tainted 4.17.0-rc7-default+ #153 [ 3776.929164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 [ 3776.931006] RIP: 0010:btrfs_destroy_inode+0x1d5/0x290 [btrfs] Running generic/095 on current Linus git (without my patches), I'm seeing this same warning. This makes me a little happy because I have my patches in prod, but mostly sad because it's easier to find when the suspect pool is small. I'll bisect. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits
On 20 Jun 2018, at 15:33, David Sterba wrote: On Wed, Jun 20, 2018 at 07:56:10AM -0700, Chris Mason wrote: We've been hunting the root cause of data crc errors here at FB for a while. We'd find one or two corrupted files, usually displaying crc errors without any corresponding IO errors from the storage. The bug was rare enough that we'd need to watch a large number of machines for a few days just to catch it happening. We're still running these patches through testing, but the fixup worker bug seems to account for the vast majority of crc errors we're seeing in the fleet. It's cleaning pages that were dirty, and creating a window where they can be reclaimed before we finish processing the page. I'm having flashbacks when I see 'fixup worker', Yeah, I don't understand how so much pain can live in one little function. and the test generic/208 does not make it better: generic/095 [18:07:03][ 3769.317862] run fstests generic/095 at 2018-06-20 18:07:03 Hmpf, I pass both 095 and 208 here. [ 3774.849685] BTRFS: device fsid 3acffad9-28e5-43ce-80e1-f5032e334cba devid 1 transid 5 /dev/vdb [ 3774.875409] BTRFS info (device vdb): disk space caching is enabled [ 3774.877723] BTRFS info (device vdb): has skinny extents [ 3774.879371] BTRFS info (device vdb): flagging fs with big metadata feature [ 3774.885020] BTRFS info (device vdb): checking UUID tree [ 3775.593329] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 3775.596979] File: /tmp/scratch/file2 PID: 12031 Comm: kworker/1:1 [ 3776.642812] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 3776.645041] File: /tmp/scratch/file2 PID: 12033 Comm: kworker/3:0 [ 3776.920634] WARNING: CPU: 0 PID: 12036 at fs/btrfs/inode.c:9319 btrfs_destroy_inode+0x1d5/0x290 [btrfs] Which warning is this in your tree? The file_write patch is more likely to have screwed up our bits and the fixup worker is more likely to have screwed up nrpages. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits
We've been hunting the root cause of data crc errors here at FB for a while. We'd find one or two corrupted files, usually displaying crc errors without any corresponding IO errors from the storage. The bug was rare enough that we'd need to watch a large number of machines for a few days just to catch it happening. We're still running these patches through testing, but the fixup worker bug seems to account for the vast majority of crc errors we're seeing in the fleet. It's cleaning pages that were dirty, and creating a window where they can be reclaimed before we finish processing the page. btrfs_file_write() has a similar bug when copy_from_user catches a page fault and we're writing to a page that was already dirty when file_write started. This one is much harder to trigger, and I haven't confirmed yet that we're seeing it in the fleet. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker
For COW, btrfs expects pages dirty pages to have been through a few setup steps. This includes reserving space for the new block allocations and marking the range in the state tree for delayed allocation. A few places outside btrfs will dirty pages directly, especially when unmapping mmap'd pages. In order for these to properly go through COW, we run them through a fixup worker to wait for stable pages, and do the delalloc prep. 87826df0ec36 added a window where the dirty pages were cleaned, but pending more action from the fixup worker. During this window, page migration can jump in and relocate the page. Once our fixup work actually starts, it finds page->mapping is NULL and we end up freeing the page without ever writing it. This leads to crc errors and other exciting problems, since it screws up the whole statemachine for waiting for ordered extents. The fix here is to keep the page dirty while we're waiting for the fixup worker to get to work. This also makes sure the error handling in btrfs_writepage_fixup_worker does the right thing with dirty bits when we run out of space. Signed-off-by: Chris Mason --- fs/btrfs/inode.c | 67 +--- 1 file changed, 49 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0b86cf1..5538900 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2100,11 +2100,21 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work) page = fixup->page; again: lock_page(page); - if (!page->mapping || !PageDirty(page) || !PageChecked(page)) { - ClearPageChecked(page); + + /* +* before we queued this fixup, we took a reference on the page. +* page->mapping may go NULL, but it shouldn't be moved to a +* different address space. +*/ + if (!page->mapping || !PageDirty(page) || !PageChecked(page)) goto out_page; - } + /* +* we keep the PageChecked() bit set until we're done with the +* btrfs_start_ordered_extent() dance that we do below. That +* drops and retakes the page lock, so we don't want new +* fixup workers queued for this page during the churn. +*/ inode = page->mapping->host; page_start = page_offset(page); page_end = page_offset(page) + PAGE_SIZE - 1; @@ -2129,33 +2139,46 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work) ret = btrfs_delalloc_reserve_space(inode, _reserved, page_start, PAGE_SIZE); - if (ret) { - mapping_set_error(page->mapping, ret); - end_extent_writepage(page, ret, page_start, page_end); - ClearPageChecked(page); - goto out; -} + if (ret) + goto out_error; ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0, _state, 0); - if (ret) { - mapping_set_error(page->mapping, ret); - end_extent_writepage(page, ret, page_start, page_end); - ClearPageChecked(page); - goto out; - } + if (ret) + goto out_error; - ClearPageChecked(page); - set_page_dirty(page); btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, false); + + /* +* everything went as planned, we're now the proud owners of a +* Dirty page with delayed allocation bits set and space reserved +* for our COW destination. +* +* The page was dirty when we started, nothing should have cleaned it. +*/ + BUG_ON(!PageDirty(page)); + out: unlock_extent_cached(_I(inode)->io_tree, page_start, page_end, _state); out_page: + ClearPageChecked(page); unlock_page(page); put_page(page); kfree(fixup); extent_changeset_free(data_reserved); + return; + +out_error: + /* +* We hit ENOSPC or other errors. Update the mapping and page to +* reflect the errors and clean the page. +*/ + mapping_set_error(page->mapping, ret); + end_extent_writepage(page, ret, page_start, page_end); + clear_page_dirty_for_io(page); + SetPageError(page); + goto out; } /* @@ -2179,6 +2202,13 @@ static int btrfs_writepage_start_hook(struct page *page, u64 start, u64 end) if (TestClearPagePrivate2(page)) return 0; + /* +* PageChecked is set below when we create a fixup worker for this page, +* don't try to create another one if we're already PageChecked() +* +* The extent_io writepage code will redirty the page if we send +* back EAGAIN. +*/ if (PageChecked(page)) return -EAGAIN; @@ -2192,7 +,8 @@ stat
[PATCH 1/2] Btrfs: don't clean dirty pages during buffered writes
During buffered writes, we follow this basic series of steps: again: lock all the pages wait for writeback on all the pages Take the extent range lock wait for ordered extents on the whole range clean all the pages if (copy_from_user_in_atomic() hits a fault) { drop our locks goto again; } dirty all the pages release all the locks The extra waiting, cleaning and locking are there to make sure we don't modify pages in flight to the drive, after they've been crc'd. If some of the pages in the range were already dirty when the write began, and we need to goto again, we create a window where a dirty page has been cleaned and unlocked. It may be reclaimed before we're able to lock it again, which means we'll read the old contents off the drive and lose any modifications that had been pending writeback. We don't actually need to clean the pages. All of the other locking in place makes sure we don't start IO on the pages, so we can just leave them dirty for the duration of the write. Fixes: 73d59314e6ed (the original btrfs merge) Signed-off-by: Chris Mason --- fs/btrfs/file.c | 30 -- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index f660ba1..89ec4d2 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -534,6 +534,15 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages, end_of_last_block = start_pos + num_bytes - 1; + /* +* the pages may have already been dirty, clear out old accounting +* so we can set things up properly +*/ + clear_extent_bit(_I(inode)->io_tree, start_pos, end_of_last_block, +EXTENT_DIRTY | EXTENT_DELALLOC | +EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0, +cached); + if (!btrfs_is_free_space_inode(BTRFS_I(inode))) { if (start_pos >= isize && !(BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC)) { @@ -1504,18 +1513,27 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages, } if (ordered) btrfs_put_ordered_extent(ordered); - clear_extent_bit(>io_tree, start_pos, last_pos, -EXTENT_DIRTY | EXTENT_DELALLOC | -EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, -0, 0, cached_state); + *lockstart = start_pos; *lockend = last_pos; ret = 1; } + /* +* It's possible the pages are dirty right now, but we don't want +* to clean them yet because copy_from_user may catch a page fault +* and we might have to fall back to one page at a time. If that +* happens, we'll unlock these pages and we'd have a window where +* reclaim could sneak in and drop the once-dirty page on the floor +* without writing it. +* +* We have the pages locked and the extent range locked, so there's +* no way someone can start IO on any dirty pages in this range. +* +* we'll call btrfs_dirty_pages() later on, and that will flip around +* delalloc bits and dirty the pages as required. +*/ for (i = 0; i < num_pages; i++) { - if (clear_page_dirty_for_io(pages[i])) - account_page_redirty(pages[i]); set_page_extent_mapped(pages[i]); WARN_ON(!PageLocked(pages[i])); } -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LKP] [lkp-robot] [mm] 9092c71bb7: blogbench.write_score -12.3% regression
On 19 Jun 2018, at 23:51, Huang, Ying wrote: "Huang, Ying" writes: Hi, Josef, Do you have time to take a look at the regression? kernel test robot writes: Greeting, FYI, we noticed a -12.3% regression of blogbench.write_score and a +9.6% improvement of blogbench.read_score due to commit: commit: 9092c71bb724dba2ecba849eae69e5c9d39bd3d2 ("mm: use sc->priority for slab shrink targets") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master in testcase: blogbench on test machine: 16 threads Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz with 8G memory with following parameters: disk: 1SSD fs: btrfs cpufreq_governor: performance test-description: Blogbench is a portable filesystem benchmark that tries to reproduce the load of a real-world busy file server. test-url: I'm surprised, this patch is a big win in production here at FB. I'll have to reproduce these results to better understand what is going on. My first guess is that since we have fewer inodes in slab, we're reading more inodes from disk in order to do the writes. But that should also make our read scores lower. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches
On 6 Jun 2018, at 9:38, Liu Bo wrote: On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason wrote: On 5 Jun 2018, at 16:03, Andrew Morton wrote: (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org wrote: https://bugzilla.kernel.org/show_bug.cgi?id=199931 Bug ID: 199931 Summary: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches A long tale of woe here. Chris, do you think the pagecache corruption is a general thing, or is it possible that btrfs is contributing? Also, that 4.4 oom-killer regression sounds very serious. This week I found a bug in btrfs file write with how we handle stable pages. Basically it works like this: write(fd, some bytes less than a page) write(fd, some bytes into the same page) btrfs prefaults the userland page lock_and_cleanup_extent_if_need() <- stable pages wait for writeback() clear_page_dirty_for_io() At this point we have a page that was dirty and is now clean. That's normally fine, unless our prefaulted page isn't in ram anymore. iov_iter_copy_from_user_atomic() <--- uh oh If the copy_from_user fails, we drop all our locks and retry. But along the way, we completely lost the dirty bit on the page. If the page is dropped by drop_caches, the writes are lost. We'll just read back the stale contents of that page during the retry loop. This won't result in crc errors because the bytes we lost were never crc'd. So we're going to carefully redirty the page under the page lock, right? I don't think we actually need to clean it. We have the page locked, writeback won't start until we unlock. It could result in zeros in the file because we're basically reading a hole, and those zeros could move around in the page depending on which part of the page was dirty when the writes were lost. I got a question, while re-reading this page, wouldn't it read old/stale on-disk data? If it was never written we should be treating it like a hole, but I'll double check. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches
On 5 Jun 2018, at 16:03, Andrew Morton wrote: (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org wrote: https://bugzilla.kernel.org/show_bug.cgi?id=199931 Bug ID: 199931 Summary: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches A long tale of woe here. Chris, do you think the pagecache corruption is a general thing, or is it possible that btrfs is contributing? Also, that 4.4 oom-killer regression sounds very serious. This week I found a bug in btrfs file write with how we handle stable pages. Basically it works like this: write(fd, some bytes less than a page) write(fd, some bytes into the same page) btrfs prefaults the userland page lock_and_cleanup_extent_if_need() <- stable pages wait for writeback() clear_page_dirty_for_io() At this point we have a page that was dirty and is now clean. That's normally fine, unless our prefaulted page isn't in ram anymore. iov_iter_copy_from_user_atomic() <--- uh oh If the copy_from_user fails, we drop all our locks and retry. But along the way, we completely lost the dirty bit on the page. If the page is dropped by drop_caches, the writes are lost. We'll just read back the stale contents of that page during the retry loop. This won't result in crc errors because the bytes we lost were never crc'd. It could result in zeros in the file because we're basically reading a hole, and those zeros could move around in the page depending on which part of the page was dirty when the writes were lost. I spent a morning trying to trigger this with drop_caches and couldn't make it happen, even with schedule_timeout()s inserted and other tricks. But I was able to get corruptions if I manually invalidated pages in the critical section. I'm working on a patch, and I'll check and see if any of the other recent fixes Dave integrated may have a less exotic explanation. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: implement unlocked buffered write
On 24 May 2018, at 4:46, robbieko wrote: Chris Mason 於 2018-05-23 23:56 寫到: On 23 May 2018, at 3:26, robbieko wrote: But we're not avoiding the inode lock completely, we're just dropping it for the expensive parts of writing to the file. A quick guess about what the expensive parts are: 1) balance_dirty_pages() 2) btrfs_btree_balance_dirty() 3) metadata reservations/enospc waiting. The expensive part of buffered_write are: 1. prepare_pages() --wait_on_page_writeback() Because writeback submit page to PG_writeback. We must wait until the page writeback IO ends. Is this based on timing the fio job or something else? We can trigger a stable page prep run before we take the mutex, but stable pages shouldn't be part of random IO workloads unless we're doing random IO inside a file that mostly fits in ram. This problem is easily encountered in the context of VMs and File-based iSCSI LUNs. Most of them are random write pattern. Fio can quickly simulate the problem. With really random IO, to a device bigger than memory, the chances of us writing to the same block twice with it still in flight are pretty small. With VMs and other virtual block devices backed by btrfs, it's easier to hit stable pages because filesystems try pretty hard to be less than completely random. So which is the best way? 1. unlocked buffered write (like direct-io) 2. stable page prep 3. Or is there any way to completely avoid stable pages? balance_dirty_pages() is a much more common waiting point, and doing that with the inode lock held definitely adds contention. Yes. I agree. But for latency, balance_dirty_pages is usually a relatively smooth latency, lock_and_cleanup_extent_if_need and prepare_pages are dependent on IO, so the latency is a multiple of growth. I think the best way is to peel out a stable page write wait before the inode lock is taken, and only when the write is inside i_size. We'll need to keep the existing stable page wait as well, but waiting before the inode lock is taken should handle the vast majority of IO in flight. I'd also push the balance_dirty_pages() outside the inode lock, as long as the write is relatively small. For really getting rid of stable pages, that's a bigger project ;) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: implement unlocked buffered write
On 23 May 2018, at 2:37, Christoph Hellwig wrote: On Tue, May 22, 2018 at 02:31:36PM -0400, Chris Mason wrote: And what protects two writes from interleaving their results now? page locks...ish, we at least won't have results interleaved in a single page. For btrfs it'll actually be multiple pages since we try to do more than one at a time. I think you are going to break just about every assumption people have that any single write is going to be atomic vs another write. E.g. this comes from the posix read definition reference by the write definition: "I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic means that all the bytes from a single operation that started out together end up together, without interleaving from other I/O operations. It is a known attribute of terminals that this is not honored, and terminals are explicitly (and implicitly permanently) excepted, making the behavior unspecified. The behavior for other device types is also left unspecified, but the wording is intended to imply that future standards might choose to specify atomicity (or not). " Without taking the inode lock (or some sort of range lock) you can easily interleave data from two write requests. "This volume of IEEE Std 1003.1-2001 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control." I'm always more worried about truncate than standards ;) But just to be clear, I'm not disagreeing with you at all. Interleaved writes just wasn't the first thing that jumped to my mind. But we're not avoiding the inode lock completely, we're just dropping it for the expensive parts of writing to the file. A quick guess about what the expensive parts are: The way I read the patch it basically 'avoids' the inode lock for almost the whole write call, just minus some setup. Yeah, if we can get 90% of the way there by pushing some balance_dirty_pages() outside the lock (or whatever other expensive setup we're doing), I'd by much happier. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: implement unlocked buffered write
On 23 May 2018, at 3:26, robbieko wrote: Chris Mason 於 2018-05-23 02:31 寫到: On 22 May 2018, at 14:08, Christoph Hellwig wrote: On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote: From: Robbie Ko <robbi...@synology.com> This idea is from direct io. By this patch, we can make the buffered write parallel, and improve the performance and latency. But because we can not update isize without i_mutex, the unlocked buffered write just can be done in front of the EOF. We needn't worry about the race between buffered write and truncate, because the truncate need wait until all the buffered write end. And we also needn't worry about the race between dio write and punch hole, because we have extent lock to protect our operation. I ran fio to test the performance of this feature. And what protects two writes from interleaving their results now? page locks...ish, we at least won't have results interleaved in a single page. For btrfs it'll actually be multiple pages since we try to do more than one at a time. I haven't verified all the assumptions around truncate and fallocate and friends expecting the dio special locking to be inside i_size. In general this makes me a little uncomfortable. But we're not avoiding the inode lock completely, we're just dropping it for the expensive parts of writing to the file. A quick guess about what the expensive parts are: 1) balance_dirty_pages() 2) btrfs_btree_balance_dirty() 3) metadata reservations/enospc waiting. The expensive part of buffered_write are: 1. prepare_pages() --wait_on_page_writeback() Because writeback submit page to PG_writeback. We must wait until the page writeback IO ends. Is this based on timing the fio job or something else? We can trigger a stable page prep run before we take the mutex, but stable pages shouldn't be part of random IO workloads unless we're doing random IO inside a file that mostly fits in ram. balance_dirty_pages() is a much more common waiting point, and doing that with the inode lock held definitely adds contention. 2. lock_and_cleanup_extent_if_need --btrfs_start_ordered_extent When a large number of ordered_extent queue is in endio_write_workers workqueue. Buffered_write assumes that ordered_extent is the last one in the endio_write_workers workqueue, and waits for all ordered_extents to be processed before because the workqueue is a FIFO. This isn't completely accurate, but it falls into the stable pages bucket as well. We can push a lighter version of the stable page wait before the inode lock when the IO is inside of i_size. It won't completely remove the possibility that someone else dirties those pages again, but if the workload is random or really splitting up the file, it'll make the work done with the lock held much less. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: implement unlocked buffered write
On 22 May 2018, at 14:08, Christoph Hellwig wrote: On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote: From: Robbie KoThis idea is from direct io. By this patch, we can make the buffered write parallel, and improve the performance and latency. But because we can not update isize without i_mutex, the unlocked buffered write just can be done in front of the EOF. We needn't worry about the race between buffered write and truncate, because the truncate need wait until all the buffered write end. And we also needn't worry about the race between dio write and punch hole, because we have extent lock to protect our operation. I ran fio to test the performance of this feature. And what protects two writes from interleaving their results now? page locks...ish, we at least won't have results interleaved in a single page. For btrfs it'll actually be multiple pages since we try to do more than one at a time. I haven't verified all the assumptions around truncate and fallocate and friends expecting the dio special locking to be inside i_size. In general this makes me a little uncomfortable. But we're not avoiding the inode lock completely, we're just dropping it for the expensive parts of writing to the file. A quick guess about what the expensive parts are: 1) balance_dirty_pages() 2) btrfs_btree_balance_dirty() 3) metadata reservations/enospc waiting. Can I bribe you to benchmark how much each of those things is impacting the iops/latency benefits? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: verify key failure
On 14 May 2018, at 10:35, Liu Bo wrote: Hi, I got another warning of verify_level_key by running btrfs/124 in a loop, I'm testing against 4.17-rc3. Not sure if it's false positive. How long does this take to trigger? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2 V2] hoist BTRFS_IOC_[SG]ET_FSLABEL to vfs
On 11 May 2018, at 10:10, David Sterba wrote: On Thu, May 10, 2018 at 08:16:09PM +0100, Al Viro wrote: On Thu, May 10, 2018 at 01:13:57PM -0500, Eric Sandeen wrote: Move the btrfs label ioctls up to the vfs for general use. This retains 256 chars as the maximum size through the interface, which is the btrfs limit and AFAIK exceeds any other filesystem's maximum label size. Signed-off-by: Eric SandeenReviewed-by: Andreas Dilger Reviewed-by: David Sterba No objections (and it obviously ought to go through btrfs tree). I can take it through my tree, but Eric mentioned that there's a patch for xfs that depends on it. In this case it would make sense to take both patches at once via the xfs tree. There are no pending conflicting changes in btrfs. Probably easiest to just have a separate pull dedicated just for this series. That way it doesn't really matter which tree it goes through. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: metadata_ratio mount option?
On 7 May 2018, at 12:16, Martin Svec wrote: Hello Chris, Dne 7.5.2018 v 16:49 Chris Mason napsal(a): On 7 May 2018, at 7:40, Martin Svec wrote: Hi, According to man btrfs [1], I assume that metadata_ratio=1 mount option should force allocation of one metadata chunk after every allocated data chunk. However, when I set this option and start filling btrfs with "dd if=/dev/zero of=dummyfile.dat", only data chunks are allocated but no metadata ones. So, how does the metadata_ratio option really work? Note that I'm trying to use this option as a workaround of the bug reported here: [ urls that FB email server eats, sorry ] It's link to "Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs" thread :) Oh yeah, the link worked fine, it just goes through this url defense monster that munges it in replies. i.e. I want to manually preallocate metadata chunks to avoid nightly ENOSPC errors. metadata_ratio is almost but not quite what you want. It sets a flag on the space_info to force a chunk allocation the next time we decide to call should_alloc_chunk(). Thanks to the overcommit code, we usually don't call that until the metadata we think we're going to need is bigger than the metadata space available. In other words, by the time we're into the code that honors the force flag, reservations are already high enough to make us allocate the chunk anyway. Yeah, that's how I understood the code. So I think metadata_ratio man section is quite confusing because it implies that btrfs guarantees given metadata to data chunk space ratio, which isn't true. I tried to use metadata_ratio to experiment with forcing more metadata slop space, but really I have to tweak the overcommit code first. Omar beat me to a better solution, tracking down our transient ENOSPC problems here at FB to reservations done for orphans. Do you have a lot of deleted files still being held open? lsof /mntpoint | grep deleted will list them. I'll take a look during backup window. The initial bug report describes our rsync workload in detail, for your reference. We're working through a patch for the orphans here. You've got a ton of bytes pinned, which isn't a great match for the symptoms we see: [285169.096630] BTRFS info (device sdb): space_info 4 has 18446744072120172544 free, is not full [285169.096633] BTRFS info (device sdb): space_info total=273804165120, used=269218267136, pinned=3459629056, reserved=52396032, may_use=2663120896, readonly=131072 But, your may_use count is high enough that you might be hitting this problem. Otherwise I'll work out a patch to make some more metadata chunks while Josef is perfecting his great delayed ref update. As mentioned in the bug report, we have a custom patch that dedicates SSDs for metadata chunks and HDDs for data chunks. So, all we need is to preallocate metadata chunks to occupy all of the SSD space and our issues will be gone. Note that btrfs with SSD-backed metadata works absolutely great for rsync backups, even if there're billions of files and thousands of snapshots. The global reservation ENOSPC is the last issue we're struggling with. Great, we'll get this nailed down, thanks! -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: metadata_ratio mount option?
On 7 May 2018, at 7:40, Martin Svec wrote: Hi, According to man btrfs [1], I assume that metadata_ratio=1 mount option should force allocation of one metadata chunk after every allocated data chunk. However, when I set this option and start filling btrfs with "dd if=/dev/zero of=dummyfile.dat", only data chunks are allocated but no metadata ones. So, how does the metadata_ratio option really work? Note that I'm trying to use this option as a workaround of the bug reported here: [ urls that FB email server eats, sorry ] i.e. I want to manually preallocate metadata chunks to avoid nightly ENOSPC errors. metadata_ratio is almost but not quite what you want. It sets a flag on the space_info to force a chunk allocation the next time we decide to call should_alloc_chunk(). Thanks to the overcommit code, we usually don't call that until the metadata we think we're going to need is bigger than the metadata space available. In other words, by the time we're into the code that honors the force flag, reservations are already high enough to make us allocate the chunk anyway. I tried to use metadata_ratio to experiment with forcing more metadata slop space, but really I have to tweak the overcommit code first. Omar beat me to a better solution, tracking down our transient ENOSPC problems here at FB to reservations done for orphans. Do you have a lot of deleted files still being held open? lsof /mntpoint | grep deleted will list them. We're working through a patch for the orphans here. You've got a ton of bytes pinned, which isn't a great match for the symptoms we see: [285169.096630] BTRFS info (device sdb): space_info 4 has 18446744072120172544 free, is not full [285169.096633] BTRFS info (device sdb): space_info total=273804165120, used=269218267136, pinned=3459629056, reserved=52396032, may_use=2663120896, readonly=131072 But, your may_use count is high enough that you might be hitting this problem. Otherwise I'll work out a patch to make some more metadata chunks while Josef is perfecting his great delayed ref update. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Inconsistent behavior of fsync in btrfs
On 29 Apr 2018, at 18:16, Theodore Y. Ts'o wrote: On Sun, Apr 29, 2018 at 03:55:39PM -0500, Vijay Chidambaram wrote: In the spirit of clarifying fsync behavior, we have one more case where we'd like to find out what should be expected. Consider this: Mkdir A Creat A/bar Fsync A/bar Rename A to B Fsync B/bar -- Crash -- A/bar has been fsynced previously, so its not a newly created file. After the crash, in ext4 and btrfs, can we expect directory B and B/bar to exist? or ext4, no. The POSIX semantics apply: bar will *either* be in A, or in B. Same for btrfs. If the rename for B goes down, it'll be a side effect of other decisions and not on purpose. I'd actually like for the rename to be on disk in the normal case, but we won't always be able to catch it. If you modify the file bar such that the mod time has been updated, then fsync(2) --- but not necessarily fdatasync(2) --- will cause the inode modifications to be written committed, and this will cause the updates to directory B from the rename to be committed as a side-effect. Note though that there are plenty of people who consider this to be a performance bug, and not a feature, and there have been papers proposed by your fellow academics that if implemented, would change this to no longer be true. In general with these sorts of things it would be useful to reason about this in the context of real world applications and why they want such guarantees. These guarantees can cost performance hits, and so there is a cost/benefit tradeoff involved. So my preference is to negotiate with applicationt writes, and ask *why* they want such guarantees, and to explore whether there better ways of achieving their high level goals before we legislate this to be an iron-clad commitment which might application A happy, but performance-seeking user B unhappy. I know this is not POSIX compliant, but from prior comments, it seems like both ext4 and btrfs would like to persist directory entries upon fsync of newly created files. So we were wondering if this extended to this case. We had real world examples of users/applications who suffered data loss when the directory entries for newly created files were not persisted. It was on the basis of these complaints that we made this commitment, since it seemed more important than the relatively minor performance hit. Agreeing with Ted and expanding a bit. If fsync(some file) doesn't persist the name for that file, applications need to fsync the directories, which can be double the log commits. Getting everything down to disk in one fsync() is much better for both the application and the FS. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Inconsistent behavior of fsync in btrfs
On 27 Apr 2018, at 10:07, David Sterba wrote: On Thu, Apr 26, 2018 at 07:59:23PM -0500, Jayashree Mohan wrote: Thanks for the response. We are using a tool we developed called CrashMonkey[1] to run crash consistency tests and generate the bug reports above. We'd be happy to guide you through setting up CrashMonkey and getting these bugs reproduced. However, if you want to be able to reproduce them with your current setup (xfstest + dm-flakey), I have the workload scripts attached to the end of the mail which might make your task simpler. Interestingly we seem to have found another bug that breaks rename atomicity and results in a previously fsynced file missing. Workload: 1. mkdir A 2. creat A/bar (*) 3. fsync A/bar 4. mkdir B 5. creat B/bar 6. rename B/bar A/bar 7. creat A/foo 8. fsync A/foo 9. fsync A --- crash--- When we recover from the crash, we see that file A/bar goes missing. If the rename did not persist, we expect to see A/bar(*) created in step 2 above, or if the rename indeed persisted, we still expect file A/bar to be present. I'm no fsync expert and the lack of standard or well defined behaviour (mentioned elsewhere) leads me to question, on what do you base your expectations? Not only for this report, but in general during your testing. Comparing various filesystems will show that at best it's implementation defined and everybody has their own reasons for doing it one way or another, or request fsync at particular time etc. We have a manual page in section 5 that contains general topics of btrfs, so documenting the fsync specifics would be good. My goal for the fsync tree log was to make it just do the right thing most of the time. We mostly got there, thanks to a ton of fixes and test cases from Filipe. fsync(some file) -- all the names for this file will exist, without having to fsync the directory. fsync(some dir) -- ugh, don't fsync the directory. But if you do, all the files/subdirs will exist, and unlinks will be done and renames will be included. This is slow and may require a full FS commit, which is why we don't want dirs fsunk. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Inconsistent behavior of fsync in btrfs
On 26 Apr 2018, at 18:59, Jayashree Mohan wrote: Hi Chris, Thanks for the response. We are using a tool we developed called CrashMonkey[1] to run crash consistency tests and generate the bug reports above. We'd be happy to guide you through setting up CrashMonkey and getting these bugs reproduced. However, if you want to be able to reproduce them with your current setup (xfstest + dm-flakey), I have the workload scripts attached to the end of the mail which might make your task simpler. Interestingly we seem to have found another bug that breaks rename atomicity and results in a previously fsynced file missing. Workload: 1. mkdir A 2. creat A/bar (*) 3. fsync A/bar 4. mkdir B 5. creat B/bar 6. rename B/bar A/bar 7. creat A/foo 8. fsync A/foo 9. fsync A The original workloads have been easy to reproduce/fix so far. I want to make sure my patches aren't slowing things down and I'll send them out ~Monday. This one looks similar but I'll double check the rename handling and make sure I've got all the cases. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Inconsistent behavior of fsync in btrfs
On 24 Apr 2018, at 20:35, Jayashree Mohan wrote: Hi, While investigating crash consistency bugs on btrfs, we came across workloads that demonstrate inconsistent behavior of fsync. Consider the following workload where fsync on the directory did not persist it. Only file B/foo gets persisted, and both A/foo and C/foo are missing. This seems like inconsistent behavior as only the first fsync persists the file, while all others don't seem to. Do you agree if this is indeed incorrect and needs fixing? All the above tests pass on ext4 and xfs. Please let us know what you feel about such inconsistency. The btrfs fsync log is more fine grained than xfs/ext, but fsync(any file) should be enough to persist that file in its current directory. I'll get these reproduced and see if we can nail down why we're not logging the location properly. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Periodic frame losses when recording to btrfs volume with OBS
On 01/22/2018 04:17 PM, Sebastian Ochmann wrote: > Hello, > > I attached to the ffmpeg-mux process for a little while and pasted the > result here: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__pastebin.com_XHaMLX8z=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=9QPtTAxcitoznaWRKKHoEQ=IkofqwZ_S5C0_qAXjt4EQae-mVE09Ir8zmSbuGqXaCs=1nw7xUkEoQF7MgYOlZ8iAA9U0UsRQObH1Z4VLqx8IF8= > > > > Can you help me with interpreting this result? If you'd like me to run > strace with specific options, please let me know. This is a level of > debugging I'm not dealing with on a daily basis. :) > Going to guess it's these sequences: lseek(3, 1302012898, SEEK_SET) = 1302012898 write(3, "\37C\266u\1\377\377\377\377\377\377\377\277\204|\271\347J\347\203\3@\0\243CY\202\0\0\0!\21"..., 262144) = 262144 write(3, "\310\22\323g7J#h\351\0\323\270\f\206\5\207(.\232\246\27\371/\376\341\0\0\200\th\3\37"..., 262144) = 262144 write(3, "\225*\245<8N\32\263\237k\331]\313\215\301\366$\7\216\0349\302AS\201\302\307T\361\365\3375"..., 262144) = 262144 write(3, "\272e\37\255\250\24n\235\341E\272Me\36'\345W\353\2337K.n\367\264\\\370\307\341_\206|"..., 262144) = 262144 write(3, ""..., 53271) = 53271 lseek(3, 1302012902, SEEK_SET) = 1302012902 write(3, "\1\0\0\0\0\20\320\v", 8) = 8 lseek(3, 1303114745, SEEK_SET) = 1303114745 It's seeking, writing, then jumping back and updating what had been written before. That's going to hit the stable page writing code in btrfs that I had mentioned earlier. At Facebook, we've been experimenting with fixes for this that are limited to O_APPEND slowly growing log files. Let me think harder... -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Periodic frame losses when recording to btrfs volume with OBS
On 01/22/2018 01:33 PM, Sebastian Ochmann wrote: [ skipping to the traces ;) ] 2866 ffmpeg-mux D [] btrfs_start_ordered_extent+0x101/0x130 [btrfs] [] lock_and_cleanup_extent_if_need+0x340/0x380 [btrfs] [] __btrfs_buffered_write+0x261/0x740 [btrfs] [] btrfs_file_write_iter+0x20f/0x650 [btrfs] [] __vfs_write+0xf9/0x170 [] vfs_write+0xad/0x1a0 [] SyS_write+0x52/0xc0 [] entry_SYSCALL_64_fastpath+0x1a/0x7d [] 0x This is where we wait for writes that are already in flight before we're allowed to redirty those pages in the file. It'll happen when we either overwrite a page in the file that we've already written, or when we're trickling down writes slowly in non-4K aligned writes. You can probably figure out pretty quickly which is the case by stracing ffmpeg-mux. Since lower dirty ratios made it happen more often for you, my guess is the app is sending down unaligned writes. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Periodic frame losses when recording to btrfs volume with OBS
On 01/20/2018 05:47 AM, Sebastian Ochmann wrote: Hello, I would like to describe a real-world use case where btrfs does not perform well for me. I'm recording 60 fps, larger-than-1080p video using OBS Studio [1] where it is important that the video stream is encoded and written out to disk in real-time for a prolonged period of time (2-5 hours). The result is a H264 video encoded on the GPU with a data rate ranging from approximately 10-50 MB/s. The hardware used is powerful enough to handle this task. When I use a XFS volume for recording, no matter whether it's a SSD or HDD, the recording is smooth and no frame drops are reported (OBS has a nice Stats window where it shows the number of frames dropped due to encoding lag which seemingly also includes writing the data out to disk). However, when using a btrfs volume I quickly observe severe, periodic frame drops. It's not single frames but larger chunks of frames that a dropped at a time. I tried mounting the volume with nobarrier but to no avail. Of course, the simple fix is to use a FS that works for me(TM). However I thought since this is a common real-world use case I'd describe the symptoms here in case anyone is interested in analyzing this behavior. It's not immediately obvious that the FS makes such a difference. Also, if anyone has an idea what I could try to mitigate this issue (mount or mkfs options?) I can try that. I saw this behavior on two different machines with kernels 4.14.13 and 4.14.5, both Arch Linux. btrfs-progs 4.14, OBS 20.1.3-241-gf5c3af1b built from git. This could be a few different things, trying without the space cache was already suggested, and that's a top suspect. How does the application do the writes? Are they always 4K aligned or does it send them out in odd sizes? The easiest way to nail it down is to use offcputime from the iovisor project: https://github.com/iovisor/bcc/blob/master/tools/offcputime.py If you haven't already configured this it may take a little while, but it's the perfect tool for this problem. Otherwise, if the stalls are long enough you can try to catch it with /proc//stack. I've attached a helper script I often use to dump the stack trace of all the tasks in D state. Just run walker.py and it should give you something useful. You can use walker.py -a to see all the tasks instead of just D state. This just walks /proc//stack, so you'll need to run it as someone with permissions to see the stack traces of the procs you care about. -chris #!/usr/bin/env python3 # # this walks all the tasks on the system and prints out a stack trace # of any tasks waiting in D state. If you pass -a, it will print out # the stack of every task it finds. # # It also makes a histogram of the common stacks so you can see where # more of the tasks are. # import sys import os from optparse import OptionParser usage = "usage: %prog [options]" parser = OptionParser(usage=usage) parser.add_option("-a", "--all_tasks", help="Dump all stacks", default=False, action="store_true") parser.add_option("-s", "--smaps", help="Dump /proc/pid/smaps", default=False, action="store_true") parser.add_option("-S", "--sort", help="/proc/pid/smaps sort key", default="size", type="str") parser.add_option("-p", "--pid", help="Filter on pid", default=None, type="str") parser.add_option("-c", "--command", help="Filter on command name", default=None, type="str") parser.add_option("-f", "--files", help="don't collapse open files", default=False, action="store_true") parser.add_option("-v", "--verbose", help="details++", default=False, action="store_true") (options, args) = parser.parse_args() stacks = {} maps = {} lines = [] # parse the units from a number and normalize into KB def parse_number(s): try: words = s.split() unit = words[-1].lower() number = int(words[1]) tag = words[0].lower().rstrip(':') # we store in kb if unit == "mb": number = number * 1024 elif unit == "gb": number = number * 1024 * 1024 elif unit == "tb": number = number * 1024 * 1024 return (tag, number) except: return (None, None) # pretty print a number in KB with units def print_number(num): # we store the number in kb units = ['KB', 'MB', 'GB', 'TB'] index = 0 while num > 1024: num /= 1024 index += 1 final = float(num + num / 1024) return (final, units[index]) # find a given line in the record and pretty print it def print_line(header, record, tag): num, unit = print_number(record[tag]) if options.verbose: header = header + " " else: header = "" print("\t%s%s: %.2f %s" % (header, tag.capitalize(), num, unit)) # print all the lines we care about in a given record def
Re: Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)
On 12/20/2017 03:59 PM, Timofey Titovets wrote: How reproduce: touch test_file chattr +C test_file dd if=/dev/zero of=test_file bs=1M count=1 btrfs fi def -vrczlib test_file filefrag -v test_file test_file Filesystem type is: 9123683e File size of test_file is 1048576 (256 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31: 72917050.. 72917081: 32: encoded 1: 32.. 63: 72917118.. 72917149: 32: 72917082: encoded 2: 64.. 95: 72919494.. 72919525: 32: 72917150: encoded 3: 96.. 127: 72927576.. 72927607: 32: 72919526: encoded 4: 128.. 159: 72943261.. 72943292: 32: 72927608: encoded 5: 160.. 191: 72944929.. 72944960: 32: 72943293: encoded 6: 192.. 223: 72944952.. 72944983: 32: 72944961: encoded 7: 224.. 255: 72967084.. 72967115: 32: 72944984: last,encoded,eof test_file: 8 extents found I can't found at now, where that error happen in code, but it's reproducible on Linux 4.14.8 We'll silently cow in a few cases, this is one. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
refcounts have a generic implementation and an asm optimized one. The generic version has extra debugging to make sure that once a refcount goes to zero, refcount_inc won't increase it. The btrfs delayed inode code wasn't expecting this, and we're tripping over the warnings when the generic refcounts are used. We ended up with this race: Process A Process B btrfs_get_delayed_node() spin_lock(root->inode_lock) radix_tree_lookup() __btrfs_release_delayed_node() refcount_dec_and_test(_node->refs) our refcount is now zero refcount_add(2) <--- warning here, refcount unchanged spin_lock(root->inode_lock) radix_tree_delete() With the generic refcounts, we actually warn again when process B above tries to release his refcount because refcount_add() turned into a no-op. We saw this in production on older kernels without the asm optimized refcounts. The fix used here is to use refcount_inc_not_zero() to detect when the object is in the middle of being freed and return NULL. This is almost always the right answer anyway, since we usually end up pitching the delayed_node if it didn't have fresh data in it. This also changes __btrfs_release_delayed_node() to remove the extra check for zero refcounts before radix tree deletion. btrfs_get_delayed_node() was the only path that was allowing refcounts to go from zero to one. Signed-off-by: Chris Mason <c...@fb.com> Fixes: 6de5f18e7b0da cc: <sta...@vger.kernel.org> #4.12+ --- fs/btrfs/delayed-inode.c | 45 ++--- 1 file changed, 34 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 5d73f79..84c54af 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -87,6 +87,7 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node( spin_lock(>inode_lock); node = radix_tree_lookup(>delayed_nodes_tree, ino); + if (node) { if (btrfs_inode->delayed_node) { refcount_inc(>refs); /* can be accessed */ @@ -94,9 +95,30 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node( spin_unlock(>inode_lock); return node; } - btrfs_inode->delayed_node = node; - /* can be accessed and cached in the inode */ - refcount_add(2, >refs); + + /* it's possible that we're racing into the middle of +* removing this node from the radix tree. In this case, +* the refcount was zero and it should never go back +* to one. Just return NULL like it was never in the radix +* at all; our release function is in the process of removing +* it. +* +* Some implementations of refcount_inc refuse to +* bump the refcount once it has hit zero. If we don't do +* this dance here, refcount_inc() may decide to +* just WARN_ONCE() instead of actually bumping the refcount. +* +* If this node is properly in the radix, we want to +* bump the refcount twice, once for the inode +* and once for this get operation. +*/ + if (refcount_inc_not_zero(>refs)) { + refcount_inc(>refs); + btrfs_inode->delayed_node = node; + } else { + node = NULL; + } + spin_unlock(>inode_lock); return node; } @@ -254,17 +276,18 @@ static void __btrfs_release_delayed_node( mutex_unlock(_node->mutex); if (refcount_dec_and_test(_node->refs)) { - bool free = false; struct btrfs_root *root = delayed_node->root; + spin_lock(>inode_lock); - if (refcount_read(_node->refs) == 0) { - radix_tree_delete(>delayed_nodes_tree, - delayed_node->inode_id); - free = true; - } + /* +* once our refcount goes to zero, nobody is allowed to +* bump it back up. We can delete it now +*/ + ASSERT(refcount_read(_node->refs) == 0); + radix_tree_delete(>delayed_nodes_tree, + delayed_node->inode_id); spin_unlock(>inode_lock); - if (free) -
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/30/2017 12:23 PM, David Sterba wrote: On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote: On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? There are only minor changes to btrfs code so cgroup tree would be better. I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. Are there any problems in sight if the inline crc and cgroup chnanges go separately? I assume there's a runtime dependency, not a code dependency, so it could be sorted by the right merge order. The feature is just more useful with the inline crcs. Without them we end up with kworkers doing both high and low prio submissions and it all boils down to the speed of the lowest priority. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix deadlock when writing out space cache
On 11/16/2017 03:09 AM, Nikolay Borisov wrote: On 15.11.2017 23:20, Josef Bacik wrote: From: Josef BacikIf we fail to prepare our pages for whatever reason (out of memory in our case) we need to make sure to drop the block_group->data_rwsem, otherwise hilarity ensues. Signed-off-by: Josef Bacik --- fs/btrfs/free-space-cache.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index cdc9f4015ec3..a6c643275210 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1263,8 +1263,12 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, /* Lock all pages first so we can lock the extent safely. */ ret = io_ctl_prepare_pages(io_ctl, inode, 0); - if (ret) + if (ret) { + if (block_group && + (block_group->flags & BTRFS_BLOCK_GROUP_DATA)) + up_write(_group->data_rwsem); goto out; + } Which function after out: label causes a deadlock - btrfs_update_inode (unlikely) or invalidate_inode_pages2? Neither, out: just doesn't drop the data_rwsem mutex, so it leaves the block group locked. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix deadlock when writing out space cache
On 11/15/2017 06:46 PM, Liu Bo wrote: On Wed, Nov 15, 2017 at 04:20:52PM -0500, Josef Bacik wrote: From: Josef BacikIf we fail to prepare our pages for whatever reason (out of memory in our case) we need to make sure to drop the block_group->data_rwsem, otherwise hilarity ensues. Thanks Josef, I searched all the logs and it looks like we've really only hit this twice this month. It's surprising we haven't seen this more given how often we OOM. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. Herbert had asked about the crypto patch when we discussed the pull, but I didn't realize he really meant not-right-now. I've rebased it out of this branch, and none of the other patches depended on it. I have things in my zstd-minimal branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal There's a trivial conflict with the main btrfs pull from last week. Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (3) commits (+14222/-12): btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (4) commits (+14400/-12) fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 432 ++ fs/squashfs/Kconfig| 14 +
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote: On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote: crypto/Kconfig |9 + crypto/Makefile|1 + crypto/testmgr.c | 10 + crypto/testmgr.h | 71 + crypto/zstd.c | 265 Is there anyone going to use zstd through the crypto API? If not then I don't see the point in adding it at this point. Especially as the compression API is still in a state of flux. That part was requested by intel, but I'm happy to leave it out for another time. The rest of the patch series doesn't depend on it at all. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On 09/08/2017 03:33 PM, Chris Mason wrote: Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Just to clarify, we've been testing the kernel side of this here at FB, but our zstd use in prod is limited to the application side. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] zstd support (lib, btrfs, squashfs)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (4) commits (+14578/-12): crypto: Add zstd support (+356/-0) btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (5) commits (+14756/-12)
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Mon, Aug 14, 2017 at 09:54:48PM +0200, Christoph Anton Mitterer wrote: On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote: Quite a few applications actually _do_ have some degree of secondary verification or protection from a crash. Go look at almost any database software. Then please give proper references for this! This is from 2015, where you claimed this already and I looked up all the bigger DBs and they either couldn't do it at all, didn't to it per default, or it required application support (i.e. from the programs using the DB) https://www.spinics.net/lists/linux-btrfs/msg50258.html It usually will not have checksumming, but it will almost always have support for a journal, which is enough to cover the particular data loss scenario we're talking about (unexpected unclean shutdown). I don't think we talk about this: We talk about people wanting checksuming to notice e.g. silent data corruption. The crash case is only the corner case about what happens then if data is written correctly but csums not. We use the crcs to catch storage gone wrong, both in terms of simple things like cabling, bus errors, drives gone crazy or exotic problems like every time I reboot the box a handful of sectors return EFI partition table headers instead of the data I wrote. You don't need data center scale for this to happen, but it does help... So, we do catch crc errors in prod and they do keep us from replicating bad data over good data. Some databases also crc, and all drives have correction bits of of some kind. There's nothing wrong with crcs happening at lots of layers. Btrfs couples the crcs with COW because it's the least complicated way to protect against: * bits flipping * IO getting lost on the way to the drive, leaving stale but valid data in place * IO from sector A going to sector B instead, overwriting valid data with other valid data. It's possible to protect against all three without COW, but all solutions have their own tradeoffs and this is the setup we chose. It's easy to trust and easy to debug and at scale that really helps. In general, production storage environments prefer clearly defined errors when the storage has the wrong data. EIOs happen often, and you want to be able to quickly pitch the bad data and replicate in good data. My real goal is to make COW fast enough that we can leave it on for the database applications too. Obviously I haven't quite finished that one yet ;) But I'd rather keep the building block of all the other btrfs features in place than try to do crcs differently. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH stable v4.12 backport] Btrfs: fix early ENOSPC due to delalloc
On 08/11/2017 10:44 AM, Chris Mason wrote: Hmpf, forgot to put the sha in Linus' tree: 17024ad0a0fdfcfe53043afb969b813d3e020c21 And Nikolay just reminded me this is already in Greg's queue. Whoops. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH stable v4.12 backport] Btrfs: fix early ENOSPC due to delalloc
Hmpf, forgot to put the sha in Linus' tree: 17024ad0a0fdfcfe53043afb969b813d3e020c21 -chris On 08/11/2017 10:41 AM, Chris Mason wrote: From: Omar Sandoval <osan...@fb.com> If a lot of metadata is reserved for outstanding delayed allocations, we rely on shrink_delalloc() to reclaim metadata space in order to fulfill reservation tickets. However, shrink_delalloc() has a shortcut where if it determines that space can be overcommitted, it will stop early. This made sense before the ticketed enospc system, but now it means that shrink_delalloc() will often not reclaim enough space to fulfill any tickets, leading to an early ENOSPC. (Reservation tickets don't care about being able to overcommit, they need every byte accounted for.) Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims all of the metadata it is supposed to. This fixes early ENOSPCs we were seeing when doing a btrfs receive to populate a new filesystem, as well as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure") Tested-by: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name> Cc: sta...@vger.kernel.org Reviewed-by: Josef Bacik <jba...@fb.com> Signed-off-by: Omar Sandoval <osan...@fb.com> Signed-off-by: David Sterba <dste...@suse.com> Signed-off-by: Chris Mason <c...@fb.com> --- fs/btrfs/extent-tree.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 33d979e9..83eecd3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4776,10 +4776,6 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig, else flush = BTRFS_RESERVE_NO_FLUSH; spin_lock(_info->lock); - if (can_overcommit(root, space_info, orig, flush)) { - spin_unlock(_info->lock); - break; - } if (list_empty(_info->tickets) && list_empty(_info->priority_tickets)) { spin_unlock(_info->lock); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FAILED: patch "[PATCH] Btrfs: fix early ENOSPC due to delalloc" failed to apply to 4.12-stable tree
On 08/04/2017 03:29 PM, Christoph Anton Mitterer wrote: > Hey. > > Could someone of the devs put some attention on this...? > > Thanks, > Chris :-) Done, you can also grab it here: https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-stable-4.12=354850ad1948af13248031e5180d495044d05aa5 -chris > > > On Mon, 2017-07-31 at 18:06 -0700, gre...@linuxfoundation.org wrote: >> The patch below does not apply to the 4.12-stable tree. >> If someone wants it applied there, or to any other stable or longterm >> tree, then please email the backport, including the original git >> commit >> id to. >> >> thanks, >> >> greg k-h >> >> -- original commit in Linus's tree -- >> >> From 17024ad0a0fdfcfe53043afb969b813d3e020c21 Mon Sep 17 00:00:00 >> 2001 >> From: Omar Sandoval >> Date: Thu, 20 Jul 2017 15:10:35 -0700 >> Subject: [PATCH] Btrfs: fix early ENOSPC due to delalloc >> >> If a lot of metadata is reserved for outstanding delayed allocations, >> we >> rely on shrink_delalloc() to reclaim metadata space in order to >> fulfill >> reservation tickets. However, shrink_delalloc() has a shortcut where >> if >> it determines that space can be overcommitted, it will stop early. >> This >> made sense before the ticketed enospc system, but now it means that >> shrink_delalloc() will often not reclaim enough space to fulfill any >> tickets, leading to an early ENOSPC. (Reservation tickets don't care >> about being able to overcommit, they need every byte accounted for.) >> >> Fix it by getting rid of the shortcut so that shrink_delalloc() >> reclaims >> all of the metadata it is supposed to. This fixes early ENOSPCs we >> were >> seeing when doing a btrfs receive to populate a new filesystem, as >> well >> as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. >> >> Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc >> infrastructure") >> Tested-by: Christoph Anton Mitterer > me> >> Cc: sta...@vger.kernel.org >> Reviewed-by: Josef Bacik >> Signed-off-by: Omar Sandoval >> Signed-off-by: David Sterba >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index a6635f07b8f1..e3b0b4196d3d 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -4825,10 +4825,6 @@ static void shrink_delalloc(struct >> btrfs_fs_info *fs_info, u64 to_reclaim, >> else >> flush = BTRFS_RESERVE_NO_FLUSH; >> spin_lock(_info->lock); >> -if (can_overcommit(fs_info, space_info, orig, flush, >> false)) { >> -spin_unlock(_info->lock); >> -break; >> -} >> if (list_empty(_info->tickets) && >> list_empty(_info->priority_tickets)) { >> spin_unlock(_info->lock); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH stable v4.12 backport] Btrfs: fix early ENOSPC due to delalloc
From: Omar Sandoval <osan...@fb.com> If a lot of metadata is reserved for outstanding delayed allocations, we rely on shrink_delalloc() to reclaim metadata space in order to fulfill reservation tickets. However, shrink_delalloc() has a shortcut where if it determines that space can be overcommitted, it will stop early. This made sense before the ticketed enospc system, but now it means that shrink_delalloc() will often not reclaim enough space to fulfill any tickets, leading to an early ENOSPC. (Reservation tickets don't care about being able to overcommit, they need every byte accounted for.) Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims all of the metadata it is supposed to. This fixes early ENOSPCs we were seeing when doing a btrfs receive to populate a new filesystem, as well as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure") Tested-by: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name> Cc: sta...@vger.kernel.org Reviewed-by: Josef Bacik <jba...@fb.com> Signed-off-by: Omar Sandoval <osan...@fb.com> Signed-off-by: David Sterba <dste...@suse.com> Signed-off-by: Chris Mason <c...@fb.com> --- fs/btrfs/extent-tree.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 33d979e9..83eecd3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4776,10 +4776,6 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig, else flush = BTRFS_RESERVE_NO_FLUSH; spin_lock(_info->lock); - if (can_overcommit(root, space_info, orig, flush)) { - spin_unlock(_info->lock); - break; - } if (list_empty(_info->tickets) && list_empty(_info->priority_tickets)) { spin_unlock(_info->lock); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:25 PM, Hugo Mills wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. Could we please not add more mount options? I get that they're easy to implement, but it's a very blunt instrument. What we tend to see (with both nodatacow and compress) is people using the mount options, then asking for exceptions, discovering that they can't do that, and then falling back to doing it with attributes or btrfs properties. Could we just start with btrfs properties this time round, and cut out the mount option part of this cycle. In the long run, it'd be great to see most of the btrfs-specific mount options get deprecated and ultimately removed entirely, in favour of attributes/properties, where feasible. It's a good point, and as was commented later down I'd just do mount -o compress=zstd:3 or something. But I do prefer properties in general for this. My big point was just that next step is outside of Nick's scope. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:00 PM, Eric Biggers wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. I am not surprised --- Zstandard is closer to the state of the art, both format-wise and implementation-wise, than the other choices in BTRFS. My point is that benchmarks need to account for how much data is compressed at a time. This is a common mistake when comparing different compression algorithms; the algorithm name and compression level do not tell the whole story. The dictionary size is extremely significant. No one is going to compress or decompress a 200 MB file as a single stream in kernel mode, so it does not make sense to justify adding Zstandard *to the kernel* based on such a benchmark. It is going to be divided into chunks. How big are the chunks in BTRFS? I thought that it compressed only one page (4 KiB) at a time, but I hope that has been, or is being, improved; 32 KiB - 128 KiB should be a better amount. (And if the amount of data compressed at a time happens to be different between the different algorithms, note that BTRFS benchmarks are likely to be measuring that as much as the algorithms themselves.) Btrfs hooks the compression code into the delayed allocation mechanism we use to gather large extents for COW. So if you write 100MB to a file, we'll have 100MB to compress at a time (within the limits of the amount of pages we allow to collect before forcing it down). But we want to balance how much memory you might need to uncompress during random reads. So we have an artificial limit of 128KB that we send at a time to the compression code. It's easy to change this, it's just a tradeoff made to limit the cost of reading small bits. It's the same for zlib,lzo and the new zstd patch. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range
On 08/03/2017 11:25 AM, Wang Shilong wrote: On Thu, Aug 3, 2017 at 11:00 PM, Chris Mason <c...@fb.com> wrote: On 07/27/2017 02:52 PM, fdman...@kernel.org wrote: From: Filipe Manana <fdman...@suse.com> If the range being cleared was not marked for defrag and we are not about to clear the range from the defrag status, we don't need to lock and unlock the inode. Signed-off-by: Filipe Manana <fdman...@suse.com> Thanks Filipe, looks like it goes all the way back to: commit 47059d930f0e002ff851beea87d738146804726d Author: Wang Shilong <wangsl.f...@cn.fujitsu.com> Date: Thu Jul 3 18:22:07 2014 +0800 Btrfs: make defragment work with nodatacow option I can't see how the inode lock is required here. This blames to me, thanks for fixing it. No blame ;) I'll take code that works any day. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio
On 08/03/2017 08:44 AM, Nikolay Borisov wrote: Currently the code checks whether we should do data checksumming in btrfs_submit_direct and the boolean result of this check is passed to btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which actually consumes it. The last function actually has all the necessary context to figure out whether to skip the check or not, so let's move the check closer to where it's being consumed. No functional changes. I like it, thanks. Reviewed-by: Chris Mason <c...@fb.com> -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range
On 07/27/2017 02:52 PM, fdman...@kernel.org wrote: From: Filipe Manana <fdman...@suse.com> If the range being cleared was not marked for defrag and we are not about to clear the range from the defrag status, we don't need to lock and unlock the inode. Signed-off-by: Filipe Manana <fdman...@suse.com> Thanks Filipe, looks like it goes all the way back to: commit 47059d930f0e002ff851beea87d738146804726d Author: Wang Shilong <wangsl.f...@cn.fujitsu.com> Date: Thu Jul 3 18:22:07 2014 +0800 Btrfs: make defragment work with nodatacow option I can't see how the inode lock is required here. Reviewed-by: Chris Mason <c...@fb.com> -chris --- fs/btrfs/inode.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index eb495e956d53..51c45c0a8553 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1797,10 +1797,11 @@ static void btrfs_clear_bit_hook(void *private_data, u64 len = state->end + 1 - state->start; u32 num_extents = count_max_extents(len); - spin_lock(>lock); - if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) + if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) { + spin_lock(>lock); inode->defrag_bytes -= len; - spin_unlock(>lock); + spin_unlock(>lock); + } /* * set_bit and clear bit hooks normally require _irqsave/restore -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
On 08/01/2017 01:39 PM, Austin S. Hemmelgarn wrote: On 2017-08-01 13:25, Roman Mamedov wrote: On Tue, 1 Aug 2017 10:14:23 -0600 Liu Bowrote: This aims to fix write hole issue on btrfs raid5/6 setup by adding a separate disk as a journal (aka raid5/6 log), so that after unclean shutdown we can make sure data and parity are consistent on the raid array by replaying the journal. Could it be possible to designate areas on the in-array devices to be used as journal? While md doesn't have much spare room in its metadata for extraneous things like this, Btrfs could use almost as much as it wants to, adding to size of the FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks. It doesn't seem convenient to need having an additional storage device around just for the log, and also needing to maintain its fault tolerance yourself (so the log device would better be on a mirror, such as mdadm RAID1? more expense and maintenance complexity). I agree, MD pretty much needs a separate device simply because they can't allocate arbitrary space on the other array members. BTRFS can do that though, and I would actually think that that would be _easier_ to implement than having a separate device. That said, I do think that it would need to be a separate chunk type, because things could get really complicated if the metadata is itself using a parity raid profile. Thanks for running with this Liu, I'm reading through all the patches. I do agree that it's better to put the logging into a dedicated chunk type, that way we can have it default to either double or triple mirroring. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 08/02/2017 04:38 AM, Brendan Hide wrote: The title seems alarmist to me - and I suspect it is going to be misconstrued. :-/ Supporting any filesystem is a huge amount of work. I don't have a problem with Redhat or any distro picking and choosing the projects they want to support. At least inside of FB, our own internal btrfs usage is continuing to grow. Btrfs is becoming a big part of how we ship containers and other workloads where snapshots improve performance. We also heavily use XFS, so I'm happy to see RH's long standing investment there continue. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode
On 07/24/2017 03:06 PM, Austin S. Hemmelgarn wrote: On 2017-07-24 14:53, Chris Mason wrote: On 07/24/2017 02:41 PM, David Sterba wrote: would it be ok for you to keep ssd_working as before? I'd really like to get this patch merged soon because "do not use ssd mode for ssd" has started to be the recommended workaround. Once this sticks, we won't need to have any ssd mode anymore ... Works for me. I do want to make sure that commits in this area include the workload they were targeting, how they were measured and what impacts they had. That way when we go back to try and change this again we'll understand what profiles we want to preserve. Just thinking long term here, but might it make sense to (eventually) allow the user to tune how big a chunk of space to look for? I know that ext* have options to do this kind of thing, and I think XFS does too (but they do it at filesystem creation time), and I do know people who make use of those to make sure things are working at their absolute best. Agreed, we have space in the on-disk format to record those preferences, but we haven't done it in the past. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode
On 07/24/2017 02:41 PM, David Sterba wrote: On Mon, Jul 24, 2017 at 02:01:07PM -0400, Chris Mason wrote: On 07/24/2017 10:25 AM, David Sterba wrote: Thanks for the extensive historical summary, this change really deserves it. Decoupling the assumptions about the device's block management is really a good thing, mount option 'ssd' should mean that the device just has cheap seeks. Moving the the allocation tweaks to ssd_spread provides a way to keep the behaviour for anybody who wants it. I'd like to push this change to 4.13-rc3, as I don't think we need more time to let other users to test this. The effects of current ssd implementation have been debated and debugged on IRC for a long time. The description is great, but I'd love to see many more benchmarks. At Facebook we use the current ssd_spread mode in production on top of hardware raid5/6 (spinning storage) because it dramatically reduces the read/modify/write cycles done for metadata writes. Well, I think this is an example that ssd got misused because of the side effects of the allocation. If you observe good patterns for raid5, then the allocator should be adapted for that case, otherwise ssd/ssd_spread should be independent of the raid level. Absolutely. The optimizations that made ssd_spread useful for first generation flash are the same things that raid5/6 need. Big writes, or said differently a minimum size for fast writes. If we're going to play around with these, we need a good way to measure free space fragmentation as part of benchmarks, as well as the IO patterns coming out of the allocator. Hans has a tool that visualizes the fragmentation. Most complaints I've seen were about 'ssd' itself, excessive fragmentation, early ENOSPC. Not many people use ssd_spread, 'ssd' gets turned on automatically so it has much wider impact. At least for our uses, ssd_spread matters much more for metadata than data (the data writes are large and metadata is small). From the changes overview: 1. Throw out the current ssd_spread behaviour. would it be ok for you to keep ssd_working as before? I'd really like to get this patch merged soon because "do not use ssd mode for ssd" has started to be the recommended workaround. Once this sticks, we won't need to have any ssd mode anymore ... Works for me. I do want to make sure that commits in this area include the workload they were targeting, how they were measured and what impacts they had. That way when we go back to try and change this again we'll understand what profiles we want to preserve. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode
On 07/24/2017 10:25 AM, David Sterba wrote: Thanks for the extensive historical summary, this change really deserves it. Decoupling the assumptions about the device's block management is really a good thing, mount option 'ssd' should mean that the device just has cheap seeks. Moving the the allocation tweaks to ssd_spread provides a way to keep the behaviour for anybody who wants it. I'd like to push this change to 4.13-rc3, as I don't think we need more time to let other users to test this. The effects of current ssd implementation have been debated and debugged on IRC for a long time. The description is great, but I'd love to see many more benchmarks. At Facebook we use the current ssd_spread mode in production on top of hardware raid5/6 (spinning storage) because it dramatically reduces the read/modify/write cycles done for metadata writes. If we're going to play around with these, we need a good way to measure free space fragmentation as part of benchmarks, as well as the IO patterns coming out of the allocator. At least for our uses, ssd_spread matters much more for metadata than data (the data writes are large and metadata is small). Thanks again, Chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL] Btrfs for 4.13, part 1
On 06/23/2017 11:16 AM, David Sterba wrote: Hi, this is the main batch for 4.13. There are some user visible changes, see below. The core updates improve error handling (mostly related to bios), with the usual incremental work on the GFP_NOFS (mis)use removal. All patches have been in for-next for an extensive amount of time. Thre will be followups but I want push the series (111 patches) forward. There are also some updates to adjacent subsystems (writeback and blocklayer), so I want to give some stable point for merging in the upcoming weeks. Thanks Dave, I ran this (along with the updates we added) through a long stress and the usual xfstests. For everyone else on the list, since I'm heading off to vacation until ~July 9th, Dave is sending this off to Linus once the merge window starts. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: fix integer overflow in calc_reclaim_items_nr
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with v4.12-rc6. This was because commit 70e7af244 made it possible for calc_reclaim_items_nr() to return a negative number. It's not really a bug in that commit, it just didn't go far enough down the stack to find all the possible 64->32 bit overflows. This switches calc_reclaim_items_nr() to return a u64 and changes everyone that uses the results of that math to u64 as well. Reported-by: Dave Jones <da...@codemonkey.org.uk> Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow") Signed-off-by: Chris Mason <c...@fb.com> --- v1->v2 Fixed (int) cast in calc_reclaim_items_nr (thanks Holger Hoffstätte) fs/btrfs/dev-replace.c | 4 ++-- fs/btrfs/extent-tree.c | 12 ++-- fs/btrfs/ioctl.c| 2 +- fs/btrfs/ordered-data.c | 17 - fs/btrfs/ordered-data.h | 4 ++-- fs/btrfs/qgroup.c | 2 +- fs/btrfs/relocation.c | 2 +- fs/btrfs/scrub.c| 2 +- fs/btrfs/super.c| 2 +- fs/btrfs/transaction.c | 2 +- 10 files changed, 24 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 5fe1ca8..bee3ede 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -388,7 +388,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, if (ret) btrfs_err(fs_info, "kobj add dev failed %d", ret); - btrfs_wait_ordered_roots(fs_info, -1, 0, (u64)-1); + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); /* force writing the updated state information to disk */ trans = btrfs_start_transaction(root, 0); @@ -507,7 +507,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, mutex_unlock(_replace->lock_finishing_cancel_unmount); return ret; } - btrfs_wait_ordered_roots(fs_info, -1, 0, (u64)-1); + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 33d979e9..705247a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4238,7 +4238,7 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) if (need_commit > 0) { btrfs_start_delalloc_roots(fs_info, 0, -1); - btrfs_wait_ordered_roots(fs_info, -1, 0, + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); } @@ -4698,14 +4698,14 @@ static void btrfs_writeback_inodes_sb_nr(struct btrfs_fs_info *fs_info, } } -static inline int calc_reclaim_items_nr(struct btrfs_fs_info *fs_info, +static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info, u64 to_reclaim) { u64 bytes; - int nr; + u64 nr; bytes = btrfs_calc_trans_metadata_size(fs_info, 1); - nr = (int)div64_u64(to_reclaim, bytes); + nr = div64_u64(to_reclaim, bytes); if (!nr) nr = 1; return nr; @@ -4725,15 +4725,15 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig, struct btrfs_trans_handle *trans; u64 delalloc_bytes; u64 max_reclaim; + u64 items; long time_left; unsigned long nr_pages; int loops; - int items; enum btrfs_reserve_flush_enum flush; /* Calc the number of the pages we need flush for space reservation */ items = calc_reclaim_items_nr(fs_info, to_reclaim); - to_reclaim = (u64)items * EXTENT_SIZE_PER_ITEM; + to_reclaim = items * EXTENT_SIZE_PER_ITEM; trans = (struct btrfs_trans_handle *)current->journal_info; block_rsv = _info->delalloc_block_rsv; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index e176375..7712217 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -689,7 +689,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, if (ret) goto dec_and_free; - btrfs_wait_ordered_extents(root, -1, 0, (u64)-1); + btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1); btrfs_init_block_rsv(_snapshot->block_rsv, BTRFS_BLOCK_RSV_TEMP); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 7b40e2e..a3aca49 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -663,7 +663,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work) * wait for all the ordered extents in a root. This is done when balancing * space between drives. */ -int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr, +u64 bt
Re: [PATCH] btrfs: fix integer overflow in calc_reclaim_items_nr
On 06/23/2017 11:29 AM, Holger Hoffstätte wrote: On 06/23/17 16:32, Chris Mason wrote: [..] -static inline int calc_reclaim_items_nr(struct btrfs_fs_info *fs_info, +static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info, u64 to_reclaim) { u64 bytes; - int nr; + u64 nr; bytes = btrfs_calc_trans_metadata_size(fs_info, 1); nr = (int)div64_u64(to_reclaim, bytes); ^^ Isn't this a bit odd? I can't even tell whether it matters, just randomly noticed it because I genuinely dislike type casts.. Crud, yes that's broken. Thanks! -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix integer overflow in calc_reclaim_items_nr
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with v4.12-rc6. This was because commit 70e7af244 made it possible for calc_reclaim_items_nr() to return a negative number. It's not really a bug in that commit, it just didn't go far enough down the stack to find all the possible 64->32 bit overflows. This switches calc_reclaim_items_nr() to return a u64 and changes everyone that uses the results of that math to u64 as well. Reported-by: Dave Jones <da...@codemonkey.org.uk> Fixes: 70e7af244f24c94604ef6eca32ad297632018583 Signed-off-by: Chris Mason <c...@fb.com> --- fs/btrfs/dev-replace.c | 4 ++-- fs/btrfs/extent-tree.c | 10 +- fs/btrfs/ioctl.c| 2 +- fs/btrfs/ordered-data.c | 17 - fs/btrfs/ordered-data.h | 4 ++-- fs/btrfs/qgroup.c | 2 +- fs/btrfs/relocation.c | 2 +- fs/btrfs/scrub.c| 2 +- fs/btrfs/super.c| 2 +- fs/btrfs/transaction.c | 2 +- 10 files changed, 23 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 5fe1ca8..bee3ede 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -388,7 +388,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, if (ret) btrfs_err(fs_info, "kobj add dev failed %d", ret); - btrfs_wait_ordered_roots(fs_info, -1, 0, (u64)-1); + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); /* force writing the updated state information to disk */ trans = btrfs_start_transaction(root, 0); @@ -507,7 +507,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, mutex_unlock(_replace->lock_finishing_cancel_unmount); return ret; } - btrfs_wait_ordered_roots(fs_info, -1, 0, (u64)-1); + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 33d979e9..d675324 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4238,7 +4238,7 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) if (need_commit > 0) { btrfs_start_delalloc_roots(fs_info, 0, -1); - btrfs_wait_ordered_roots(fs_info, -1, 0, + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); } @@ -4698,11 +4698,11 @@ static void btrfs_writeback_inodes_sb_nr(struct btrfs_fs_info *fs_info, } } -static inline int calc_reclaim_items_nr(struct btrfs_fs_info *fs_info, +static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info, u64 to_reclaim) { u64 bytes; - int nr; + u64 nr; bytes = btrfs_calc_trans_metadata_size(fs_info, 1); nr = (int)div64_u64(to_reclaim, bytes); @@ -4725,15 +4725,15 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig, struct btrfs_trans_handle *trans; u64 delalloc_bytes; u64 max_reclaim; + u64 items; long time_left; unsigned long nr_pages; int loops; - int items; enum btrfs_reserve_flush_enum flush; /* Calc the number of the pages we need flush for space reservation */ items = calc_reclaim_items_nr(fs_info, to_reclaim); - to_reclaim = (u64)items * EXTENT_SIZE_PER_ITEM; + to_reclaim = items * EXTENT_SIZE_PER_ITEM; trans = (struct btrfs_trans_handle *)current->journal_info; block_rsv = _info->delalloc_block_rsv; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index e176375..7712217 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -689,7 +689,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, if (ret) goto dec_and_free; - btrfs_wait_ordered_extents(root, -1, 0, (u64)-1); + btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1); btrfs_init_block_rsv(_snapshot->block_rsv, BTRFS_BLOCK_RSV_TEMP); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 7b40e2e..a3aca49 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -663,7 +663,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work) * wait for all the ordered extents in a root. This is done when balancing * space between drives. */ -int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr, +u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr, const u64 range_start, const u64 range_len) { struct btrfs_fs_info *fs_info = root->fs_info; @@ -671,7 +671,7 @@ int btrfs_wait_ordered_extents(struct btrfs_
Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk
On 06/21/2017 05:08 PM, Jeff Mahoney wrote: On 6/21/17 4:31 PM, Chris Mason wrote: On 06/21/2017 04:14 PM, Jeff Mahoney wrote: On 6/14/17 11:44 AM, je...@suse.com wrote: From: Jeff Mahoney <je...@suse.com> In a heavy write scenario, we can end up with a large number of pinned bytes. This can translate into (very) premature ENOSPC because pinned bytes must be accounted for when allowing a reservation but aren't accounted for when deciding whether to create a new chunk. This patch adds the accounting to should_alloc_chunk so that we can create the chunk. Signed-off-by: Jeff Mahoney <je...@suse.com> --- fs/btrfs/extent-tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index cb0b924..d027807 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct btrfs_fs_info *fs_info, { struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly; -u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved; +u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + sinfo->bytes_pinned + sinfo->bytes_may_use; u64 thresh; if (force == CHUNK_ALLOC_FORCE) Ignore this patch. It certainly allocates chunks more aggressively, but it means we end up with a ton of metadata chunks even when we don't have much metadata. Josef and I pushed this needle back and forth a bunch of times in the early days. I still think we can allocate a few more chunks than we do now... I agree. This patch was to fix an issue that we are seeing during installation. It'd stop with ENOSPC with >50GB completely unallocated. The patch passed the test cases that were failing before but now it's failing differently. I was worried this pattern might be the end result: Data,single: Size:4.00GiB, Used:3.32GiB /dev/vde4.00GiB Metadata,DUP: Size:20.00GiB, Used:204.12MiB /dev/vde 40.00GiB System,DUP: Size:8.00MiB, Used:16.00KiB /dev/vde 16.00MiB This is on a fresh file system with just "cp /usr /mnt" executed. I'm looking into it a bit more now. Does this failure still happen with Omar's ENOSPC fix (commit: 70e7af244f24c94604ef6eca32ad297632018583) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk
On 06/21/2017 04:14 PM, Jeff Mahoney wrote: On 6/14/17 11:44 AM, je...@suse.com wrote: From: Jeff MahoneyIn a heavy write scenario, we can end up with a large number of pinned bytes. This can translate into (very) premature ENOSPC because pinned bytes must be accounted for when allowing a reservation but aren't accounted for when deciding whether to create a new chunk. This patch adds the accounting to should_alloc_chunk so that we can create the chunk. Signed-off-by: Jeff Mahoney --- fs/btrfs/extent-tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index cb0b924..d027807 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct btrfs_fs_info *fs_info, { struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly; - u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved; + u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + sinfo->bytes_pinned + sinfo->bytes_may_use; u64 thresh; if (force == CHUNK_ALLOC_FORCE) Ignore this patch. It certainly allocates chunks more aggressively, but it means we end up with a ton of metadata chunks even when we don't have much metadata. Josef and I pushed this needle back and forth a bunch of times in the early days. I still think we can allocate a few more chunks than we do now... -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_wait_ordered_roots warning triggered
On 06/21/2017 11:16 AM, Dave Jones wrote: WARNING: CPU: 2 PID: 7153 at fs/btrfs/ordered-data.c:753 btrfs_wait_ordered_roots+0x1a3/0x220 CPU: 2 PID: 7153 Comm: kworker/u8:7 Not tainted 4.12.0-rc6-think+ #4 Workqueue: events_unbound btrfs_async_reclaim_metadata_space task: 8804f08d5380 task.stack: c9000895c000 RIP: 0010:btrfs_wait_ordered_roots+0x1a3/0x220 RSP: 0018:c9000895fc70 EFLAGS: 00010286 RAX: RBX: f7efdfb1 RCX: 25c3 RDX: 880507c0cde0 RSI: RDI: 0002 RBP: c9000895fce8 R08: R09: 0001 R10: c9000895fbd8 R11: 8804f08d5380 R12: 8804f4930008 R13: 0001 R14: 8804f36f R15: 8804f4930850 FS: () GS:880507c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fd40c4a1bf2 CR3: 00013995e000 CR4: 001406e0 DR0: 7f17be89 DR1: DR2: DR3: DR6: 0ff0 DR7: 0600 Call Trace: flush_space+0x285/0x6e0 btrfs_async_reclaim_metadata_space+0xd7/0x420 process_one_work+0x24d/0x680 worker_thread+0x4e/0x3b0 kthread+0x117/0x150 ? process_one_work+0x680/0x680 ? kthread_create_on_node+0x70/0x70 ret_from_fork+0x27/0x40 Thanks Dave, which trinity command line triggered this? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.12 branch has some fixes that Dave Sterba collected: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.12 We've been hitting an early enospc problem on production machines that Omar tracked down to an old int->u64 mistake. I waited a bit on this pull to make sure it was really the problem from production, but it's on ~2100 hosts now and I think we're good. Omar also noticed a commit in the queue would make new early ENOSPC problems. I pulled that out for now, which is why the top three commits are younger than the rest. Otherwise these are all fixes, some explaining very old bugs that we've been poking at for a while. Jeff Mahoney (2) commits (+4/-3): btrfs: fix race with relocation recovery and fs_root setup (+3/-3) btrfs: fix memory leak in update_space_info failure path (+1/-0) Liu Bo (1) commits (+1/-1): Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io Colin Ian King (1) commits (+1/-1): btrfs: fix incorrect error return ret being passed to mapping_set_error Omar Sandoval (1) commits (+2/-2): Btrfs: fix delalloc accounting leak caused by u32 overflow Qu Wenruo (1) commits (+122/-2): btrfs: fiemap: Cache and merge fiemap extent before submit it to user David Sterba (1) commits (+2/-2): btrfs: use correct types for page indices in btrfs_page_exists_in_range Jan Kara (1) commits (+6/-4): btrfs: Make flush bios explicitely sync Su Yue (1) commits (+1/-1): btrfs: tree-log.c: Wrong printk information about namelen Total: (9) commits (+139/-16) fs/btrfs/ctree.h | 4 +- fs/btrfs/dir-item.c| 2 +- fs/btrfs/disk-io.c | 10 ++-- fs/btrfs/extent-tree.c | 7 +-- fs/btrfs/extent_io.c | 126 +++-- fs/btrfs/inode.c | 6 +-- 6 files changed, 139 insertions(+), 16 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL] Btrfs fixes for 4.12
On 06/05/2017 01:47 PM, David Sterba wrote: Hi, please pull the following assorted fixes to 4.12. Thanks. The following changes since commit 9bcaaea7418d09691f1ffab5c49aacafe3eef9d0: btrfs: fix the gfp_mask for the reada_zones radix tree (2017-05-04 16:56:11 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.12 for you to fetch changes up to 2eec59665d545ce443184b9c7a32a65b4dde6d4c: Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io (2017-06-01 17:00:11 +0200) Colin Ian King (1): btrfs: fix incorrect error return ret being passed to mapping_set_error David Sterba (1): btrfs: use correct types for page indices in btrfs_page_exists_in_range Jan Kara (1): btrfs: Make flush bios explicitely sync Jeff Mahoney (2): btrfs: fix memory leak in update_space_info failure path btrfs: fix race with relocation recovery and fs_root setup Liu Bo (2): Btrfs: skip commit transaction if we don't have enough pinned bytes Thanks Dave, I've been bashing on these along with Omar's enospc fix for the last few days. Omar mentioned that until we fix the pinned bytes counter we should leave out this commit from Liu Bo, so I've rebased it out for now. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs list corruption and soft lockups while testing writeback error handling
On 05/11/2017 03:52 PM, Jeff Layton wrote: On Thu, 2017-05-11 at 07:13 -0400, Jeff Layton wrote: I finally got my writeback error handling test to work on btrfs (thanks, Chris!), by making the filesystem stripe the data and mirror the metadata across two devices. The test passes now, but on one run, I got the following list corruption warning and then a soft lockup (which is probably fallout from the list corruption). I ran the test several times before and since then without this failure, so I don't have a clear reproducer. The kernel in this instance is basically a v4.11 kernel with my pile of writeback error handling patches on top: https://urldefense.proofpoint.com/v2/url?u=https-3A__git.samba.org_-3Fp-3Djlayton_linux.git-3Ba-3Dshortlog-3Bh-3Drefs_heads_wberr=DwICaQ=5VD0RTtNlTh3ycd41b3MUw=9QPtTAxcitoznaWRKKHoEQ=BXXwaUFQNFNaGGFYHEVlvNBwkrXiIoH7K5iOdR_PvxM=xE6pIXeQ1rlaxAV8aTYBSiI06pb3WZoiRJW8Vo1L3NQ= It may be that they are a contributing factor, but this smells more like a bug down in btrfs. Let me know if you need other info: [ btrfs inode logging ] (cc'ing Liu Bo since we were discussing this earlier this week) I can't reproduce this on stock v4.11, so I think this is a bug in my series. I think this is due to the differences in how errors are being reported from filemap_fdatawait_range now causing some transactions to end up being freed while they're still on the log_ctxs list. I'm working on hunting down the problem now. Sorry for the noise! There's a list in the inode logging code that we consistently seem to find list debugging assertions with. We've fixed up all the known issues, but I wouldn't be surprised if we've got a goto fail in there. I'll take a look ;) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Btrfs
On 05/09/2017 01:56 PM, Chris Mason wrote: > Hi Linus, > > My for-linus-4.12 branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git > for-linus-4.12 I hit send too soon, sorry. There's a trivial conflict with our WARN_ON fix that went into 4.11. I pushed the resolution to for-linus-4.12-merged. diff --cc fs/btrfs/qgroup.c index afbea61,3f75b5c..deffbeb --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str qgroup->excl += sign * num_bytes; qgroup->excl_cmpr += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else qgroup->reserved -= num_bytes; @@@ -1103,7 -1057,9 +1060,9 @@@ WARN_ON(sign < 0 && qgroup->excl < num_bytes); qgroup->excl += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, + -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else @@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b qg = unode_aux_to_qgroup(unode); + trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes); - if (WARN_ON(qg->reserved < num_bytes)) + if (qg->reserved < num_bytes) report_reserved_underflow(fs_info, qg, num_bytes); else qg->reserved -= num_bytes; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
bdev_get_queue (+3/-4) btrfs: check if the device is flush capable (+4/-0) btrfs: delete unused member nobarriers (+0/-4) Edmund Nadolski (2) commits (+25/-20): btrfs: provide enumeration for __merge_refs mode argument (+13/-10) btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10) Goldwyn Rodrigues (2) commits (+24/-3): btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1) btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2) Chris Mason (1) commits (+2/-2): btrfs: fix the gfp_mask for the reada_zones radix tree Adam Borowski (1) commits (+9/-3): btrfs: fix a bogus warning when converting only data or metadata Deepa Dinamani (1) commits (+2/-1): btrfs: Use ktime_get_real_ts for root ctime Dan Carpenter (1) commits (+15/-26): Btrfs: handle only applicable errors returned by btrfs_get_extent Dmitry V. Levin (1) commits (+2/-0): MAINTAINERS: add btrfs file entries for include directories Hans van Kranenburg (1) commits (+5/-5): Btrfs: consistent usage of types in balance_args Total: (71) commits MAINTAINERS | 2 + fs/btrfs/backref.c | 41 ++- fs/btrfs/btrfs_inode.h | 7 + fs/btrfs/compression.c | 18 +- fs/btrfs/ctree.c | 20 +- fs/btrfs/ctree.h | 34 +- fs/btrfs/delayed-inode.c | 46 +-- fs/btrfs/delayed-inode.h | 6 +- fs/btrfs/delayed-ref.c | 8 +- fs/btrfs/delayed-ref.h | 8 +- fs/btrfs/dev-replace.c | 9 +- fs/btrfs/disk-io.c | 13 +- fs/btrfs/disk-io.h | 4 +- fs/btrfs/extent-tree.c | 35 +- fs/btrfs/extent_io.c | 59 +-- fs/btrfs/extent_io.h | 8 +- fs/btrfs/extent_map.c| 10 +- fs/btrfs/extent_map.h| 3 +- fs/btrfs/file.c | 82 - fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/inode.c | 289 +++ fs/btrfs/ioctl.c | 33 +- fs/btrfs/ordered-data.c | 20 +- fs/btrfs/ordered-data.h | 2 +- fs/btrfs/qgroup.c| 102 ++ fs/btrfs/qgroup.h| 51 ++- fs/btrfs/raid56.c| 38 +- fs/btrfs/reada.c | 37 +- fs/btrfs/root-tree.c | 3 +- fs/btrfs/scrub.c | 331 +++-- fs/btrfs/send.c | 23 +- fs/btrfs/super.c | 3 +- fs/btrfs/tests/btrfs-tests.c | 1 - fs/btrfs/transaction.c | 48 ++- fs/btrfs/transaction.h | 6 +- fs/btrfs/tree-log.c | 2 +- fs/btrfs/volumes.c | 854 +++ fs/btrfs/volumes.h | 8 +- include/trace/events/btrfs.h | 187 +- include/uapi/linux/btrfs.h | 10 +- 40 files changed, 1629 insertions(+), 834 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: always write superblocks synchronously
On 05/03/2017 04:36 AM, Jan Kara wrote: On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote: Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO when the disk doesn't have volatile write cache and thus effectively make the write async. This was seen to cause performance hits up to 90% regression in disk IO related benchmarks such as reaim and dbench[1]. Fix the problem by making sure the first superblock write is also treated as synchronous since they can block progress of the journalling (commit, log syncs) machinery and thus the whole filesystem. Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous) Cc: stableCc: Jan Kara Signed-off-by: Davidlohr Bueso I wasn't patient enough and already sent the fix as part of my series fixing other filesystems [1]. It also fixes one more place in btrfs that needs REQ_SYNC to return to the original behavior. Thanks guys. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, We have one more for btrfs: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 This is dropping a new WARN_ON from rc1 that ended up making more noise than we really want. The larger fix for the underflow got delayed a bit and it's better for now to put it under CONFIG_BTRFS_DEBUG. David Sterba (1) commits (+7/-4): btrfs: qgroup: move noisy underflow warning to debugging build Total: (1) commits (+7/-4) fs/btrfs/qgroup.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Btrfs bug fixes for 4.12
On 04/26/2017 01:52 PM, fdman...@kernel.org wrote: From: Filipe MananaHi Chris, Please consider the following changes for the 4.12 merge window. These are all bug fixes and nothing particularly outstanding compared to changes for past merge windows. Thanks. The following changes since commit a967efb30b3afa3d858edd6a17f544f9e9e46eea: Btrfs: fix potential use-after-free for cloned bio (2017-04-11 18:49:56 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git for-chris-4.12 for you to fetch changes up to a7e3b975a0f9296162b72ac6ab7fad9631a07630: Btrfs: fix reported number of inode blocks (2017-04-26 16:27:26 +0100) Filipe Manana (5): Btrfs: fix invalid attempt to free reserved space on failure to cow range Btrfs: fix incorrect space accounting after failure to insert inline extent Btrfs: fix extent map leak during fallocate error path Btrfs: send, fix file hole not being preserved due to inline extent Btrfs: fix reported number of inode blocks Qu Wenruo (2): btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error btrfs: Handle delalloc error correctly to avoid ordered extent hang Thanks Filipe, pulling this in. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL] Btrfs, updates for 4.12
On 04/26/2017 11:06 AM, Filipe Manana wrote: > Hi, > > Did you actually ran xfstests with those readahead patches to > preallocate radix tree nodes? > > With those 2 patches applied (Chris' for-linus.4,12 branch) this > breaks things and many btrfs specific tests (at least, since I can't > get pass them) result in tons of traces like the following in a debug > kernel: > > [ 8180.696804] BUG: sleeping function called from invalid context at > mm/slab.h:432 > [ 8180.703584] in_atomic(): 1, irqs_disabled(): 0, pid: 28583, name: btrfs > [ 8180.724146] 2 locks held by btrfs/28583: > [ 8180.726427] #0: (sb_writers#12){.+.+.+}, at: [] > mnt_want_write_file+0x25/0x4d > [ 8180.736742] #1: (&(_info->reada_lock)->rlock){+.+.+.}, at: > [] reada_add_block+0x2fe/0x6cd [btrfs] > [ 8180.766321] Preemption disabled at: > [ 8180.766326] [] preempt_count_add+0x65/0x68 > [ 8180.794837] CPU: 5 PID: 28583 Comm: btrfs Tainted: GW > 4.11.0-rc8-btrfs-next-39+ #1 > [ 8180.798818] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 > [ 8180.798818] Call Trace: > [ 8180.798818] dump_stack+0x68/0x92 > [ 8180.798818] ? preempt_count_add+0x65/0x68 > [ 8180.798818] ___might_sleep+0x20f/0x226 > [ 8180.798818] __might_sleep+0x77/0x7e > [ 8180.798818] slab_pre_alloc_hook+0x32/0x4f > [ 8180.798818] kmem_cache_alloc+0x39/0x233 > [ 8180.798818] ? radix_tree_node_alloc.constprop.12+0x9d/0xdf > [ 8180.798818] radix_tree_node_alloc.constprop.12+0x9d/0xdf > [ 8180.798818] __radix_tree_create+0xc3/0x143 > [ 8180.798818] __radix_tree_insert+0x32/0xc0 > [ 8180.798818] reada_add_block+0x318/0x6cd [btrfs] So radix_tree_preload doesn't work the way I thought it did. It populates a per-cpu pool of radix tree nodes so the allocation is sure not to fail. But, when we go to actually allocate the node during radix_tree_insert: static struct radix_tree_node * radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent, struct radix_tree_root *root, unsigned int shift, unsigned int offset, unsigned int count, unsigned int exceptional) { struct radix_tree_node *ret = NULL; /* * Preload code isn't irq safe and it doesn't make sense to use * preloading during an interrupt anyway as all the allocations have * to be atomic. So just do normal allocation when in interrupt. */ if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) { struct radix_tree_preload *rtp; /* * Even if the caller has preloaded, try to allocate from the * cache first for the new node to get accounted to the memory * cgroup. */ ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask | __GFP_NOWARN); if (ret) goto out; /* * Provided the caller has preloaded here, we will always * succeed in getting a node here (and never reach * kmem_cache_alloc) */ rtp = this_cpu_ptr(_tree_preloads); if (rtp->nr) { ret = rtp->nodes; rtp->nodes = ret->parent; rtp->nr--; } /* *
Re: [PULL] Btrfs, updates for 4.12
On 04/26/2017 11:06 AM, Filipe Manana wrote: Hi, Did you actually ran xfstests with those readahead patches to preallocate radix tree nodes? With those 2 patches applied (Chris' for-linus.4,12 branch) this breaks things and many btrfs specific tests (at least, since I can't get pass them) result in tons of traces like the following in a debug kernel: Huh, I did and these didn't come up. I'll double check I have preemption enabled. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL] Btrfs, qgroup message level adjustment, for 4.11-rc8
On 04/25/2017 10:21 AM, David Sterba wrote: On Wed, Apr 19, 2017 at 12:51:03PM +0200, David Sterba wrote: Hi, a single-patch pull request, the qgroup use can trigger an underflow warning frequently. The warning is for debugging and should not be in the final release of 4.11 as we won't be able to fix it. Sigh, I looked for this pull after you mentioned it last week, couldn't find it and assumed you held off. Sorry, maybe Josef has the right idea ditching exchange. I'll send today. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus Dave Sterba collected a few more fixes for the last rc: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 These aren't marked for stable, but I'm putting them in with a batch were testing/sending by hand for this release. Liu Bo (3) commits (+11/-13): Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10) Btrfs: fix potential use-after-free for cloned bio (+1/-1) Btrfs: fix segmentation fault when doing dio read (+6/-2) Adam Borowski (1) commits (+3/-0): btrfs: drop the nossd flag when remounting with -o ssd Total: (4) commits (+14/-13) fs/btrfs/inode.c | 22 ++ fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 2 +- 3 files changed, 14 insertions(+), 13 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, We have 3 small fixes queued up in my for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Goldwyn Rodrigues (1) commits (+7/-7): btrfs: Change qgroup_meta_rsv to 64bit Dan Carpenter (1) commits (+6/-1): Btrfs: fix an integer overflow check Liu Bo (1) commits (+31/-21): Btrfs: bring back repair during read Total: (3) commits (+44/-29) fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent_io.c | 46 -- fs/btrfs/inode.c | 6 +++--- fs/btrfs/qgroup.c| 10 +- fs/btrfs/send.c | 7 ++- 6 files changed, 44 insertions(+), 29 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus We have a small set of fixes for the next RC: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Zygo tracked down a very old bug with inline compressed extents. I didn't tag this one for stable because I want to do individual tested backports. It's a little tricky and I'd rather do some extra testing on it along the way. Otherwise they are pretty obvious: Liu Bo (1) commits (+2/-1): Btrfs: fix regression in lock_delalloc_pages Dmitry V. Levin (1) commits (+0/-27): btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h Zygo Blaxell (1) commits (+14/-0): btrfs: add missing memset while reading compressed inline extents Total: (3) commits (+16/-28) fs/btrfs/extent_io.c | 3 ++- fs/btrfs/inode.c | 14 ++ include/uapi/linux/btrfs.h | 27 --- 3 files changed, 16 insertions(+), 28 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL] Btrfs updates for 4.11-rc2
On 03/09/2017 08:55 AM, David Sterba wrote: Hi, there's a regression fix for the assertion failure reported by Dave Jones, the rest of patches are minor updates. Please pull, thanks. The following changes since commit e9f467d028cd7d8bee2a4d6c4fb806caf8cd580b: Merge branch 'for-chris-4.11-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.11 (2017-02-28 14:35:09 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.11-rc2 for you to fetch changes up to 94b5a2924fc0ec2bb2347ca31156be8fabe907c4: Btrfs: consistent usage of types in balance_args (2017-03-09 14:23:00 +0100) Thanks Dave, these all tested well all week, but a few were arguably cleanups. I've pull out the fixes only for this time and I'll queue the others for 4.12. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents
On 03/10/2017 01:56 PM, Zygo Blaxell wrote: On Fri, Mar 10, 2017 at 11:19:24AM -0500, Chris Mason wrote: On 03/09/2017 11:41 PM, Zygo Blaxell wrote: On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote: On 03/08/2017 09:12 PM, Zygo Blaxell wrote: This is a story about 4 distinct (and very old) btrfs bugs. Really great write up. [ ... ] diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 25ac2cf..4d41a31 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct btrfs_path *path, max_size = min_t(unsigned long, PAGE_SIZE, max_size); ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size); + WARN_ON(max_size + pg_offset > PAGE_SIZE); Can you please drop this WARN_ON and make the math reflect any possible pg_offset? I do agree it shouldn't be happening, but its easy to correct for and the WARN is likely to get lost. I'm not sure how to do that. It looks like I'd have to pass pg_offset through btrfs_decompress to the decompress functions? ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size, pg_offset); and in the compression functions get pg_offset from the argument list instead of hardcoding zero. Yeah, it's a good point. Both zlib and lzo are assuming a zero pg_offset right now, but just like there are wacky corners allowing inline extents followed by more data, there are a few wacky corners allowing inline extents at the end of the file. Lets not mix that change in with this one though. For now, just get the memset right and we can pass pg_offset down in a later patch. Are you saying "fix the memset in the patch" (and if so, what's wrong with it?), or are you saying "let's take the patch with its memset as is, and fix the pg_offset > 0 issues later"? Your WARN_ON() would fire when this math is bad: memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - pg_offset); Instead or warning, just don't memset if pg_offset + max_size >= PAGE_SIZE -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents
On 03/09/2017 11:41 PM, Zygo Blaxell wrote: On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote: On 03/08/2017 09:12 PM, Zygo Blaxell wrote: This is a story about 4 distinct (and very old) btrfs bugs. Really great write up. [ ... ] diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 25ac2cf..4d41a31 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct btrfs_path *path, max_size = min_t(unsigned long, PAGE_SIZE, max_size); ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size); + WARN_ON(max_size + pg_offset > PAGE_SIZE); Can you please drop this WARN_ON and make the math reflect any possible pg_offset? I do agree it shouldn't be happening, but its easy to correct for and the WARN is likely to get lost. I'm not sure how to do that. It looks like I'd have to pass pg_offset through btrfs_decompress to the decompress functions? ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size, pg_offset); and in the compression functions get pg_offset from the argument list instead of hardcoding zero. Yeah, it's a good point. Both zlib and lzo are assuming a zero pg_offset right now, but just like there are wacky corners allowing inline extents followed by more data, there are a few wacky corners allowing inline extents at the end of the file. Lets not mix that change in with this one though. For now, just get the memset right and we can pass pg_offset down in a later patch. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents
On 03/08/2017 09:12 PM, Zygo Blaxell wrote: This is a story about 4 distinct (and very old) btrfs bugs. Really great write up. [ ... ] diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 25ac2cf..4d41a31 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct btrfs_path *path, max_size = min_t(unsigned long, PAGE_SIZE, max_size); ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size); + WARN_ON(max_size + pg_offset > PAGE_SIZE); Can you please drop this WARN_ON and make the math reflect any possible pg_offset? I do agree it shouldn't be happening, but its easy to correct for and the WARN is likely to get lost. + if (max_size + pg_offset < PAGE_SIZE) { + char *map = kmap(page); + memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - pg_offset); + kunmap(page); + } Both lzo and zlib have a memset to cover the gap between what they actually decompress and the max_size that we pass here. That's important because ram_bytes may not be 100% accurate. Can you also please toss in a comment about how the decompression code is responsible for the memset up to max_bytes? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11
On 03/06/2017 11:44 AM, David Sterba wrote: On Mon, Mar 06, 2017 at 04:08:41PM +0800, Qu Wenruo wrote: Any response? These patches are already here for at least 2 kernel releases. And are all bug fixes, and fix bugs that are already reported. I didn't see any reason why it should be delayed for so long time. The reason is not enough review for the whole series. The delay is unfortunate, same as almost zero people capable & willing to review it, despite your efforts to keep the series up to date with current code. I've been hesitated to take more series of qgroup fixes because past sets have ended up causing problems down the line. I'm queuing it for v4.12 while I collect tests and reviews though, it's definitely not fair for you to have to keep rebasing it without suggestions on what to fix. Thanks! -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Has Btrfs round two. These are mostly a continuation of Dave Sterba's collection of cleanups, but Filipe also has some bug fixes and performance improvements. Nikolay Borisov (42) commits (+611/-579): btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode (+14/-14) btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode (+39/-38) btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode (+10/-8) btrfs: all btrfs_delalloc_release_metadata take btrfs_inode (+22/-19) btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode (+3/-4) btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode (+7/-6) btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode (+3/-3) btrfs: Make btrfs_orphan_release_metadata take btrfs_inode (+8/-8) btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode (+7/-7) btrfs: Make check_parent_dirs_for_sync take btrfs_inode (+14/-14) btrfs: make btrfs_free_io_failure_record take btrfs_inode (+9/-7) btrfs: Make btrfs_lookup_ordered_range take btrfs_inode (+19/-18) btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode (+17/-16) btrfs: make btrfs_print_data_csum_error take btrfs_inode (+8/-7) btrfs: make btrfs_is_free_space_inode take btrfs_inode (+20/-19) btrfs: make btrfs_set_inode_index_count take btrfs_inode (+8/-8) btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode (+5/-5) btrfs: Make clone_update_extent_map take btrfs_inode (+13/-14) btrfs: Make btrfs_mark_extent_written take btrfs_inode (+6/-6) btrfs: Make btrfs_drop_extent_cache take btrfs_inode (+30/-26) btrfs: Make calc_csum_metadata_size take btrfs_inode (+12/-15) btrfs: Make drop_outstanding_extent take btrfs_inode (+11/-12) btrfs: Make btrfs_del_delalloc_inode take btrfs_inode (+7/-7) btrfs: make btrfs_log_inode_parent take btrfs_inode (+24/-26) btrfs: Make btrfs_set_inode_index take btrfs_inode (+13/-13) btrfs: Make btrfs_clear_bit_hook take btrfs_inode (+25/-21) btrfs: Make check_extent_to_block take btrfs_inode (+6/-5) btrfs: make check_compressed_csum take btrfs_inode (+4/-5) btrfs: Make btrfs_insert_dir_item take btrfs_inode (+7/-7) btrfs: Make btrfs_log_all_parents take btrfs_inode (+5/-5) btrfs: Make btrfs_i_size_write take btrfs_inode (+18/-19) btrfs: make repair_io_failure take btrfs_inode (+12/-11) btrfs: Make btrfs_orphan_add take btrfs_inode (+24/-22) btrfs: make btrfs_orphan_del take btrfs_inode (+20/-20) btrfs: make clean_io_failure take btrfs_inode (+15/-14) btrfs: Make btrfs_add_nondir take btrfs_inode (+13/-9) btrfs: make free_io_failure take btrfs_inode (+13/-11) btrfs: Make check_can_nocow take btrfs_inode (+12/-10) btrfs: Make btrfs_add_link take btrfs_inode (+26/-23) btrfs: Make get_extent_t take btrfs_inode (+59/-54) btrfs: Make hole_mergeable take btrfs_inode (+5/-4) btrfs: Make fill_holes take btrfs_inode (+18/-19) David Sterba (16) commits (+139/-124): btrfs: use predefined limits for calculating maximum number of pages for compression (+6/-5) btrfs: derive maximum output size in the compression implementation (+9/-14) btrfs: merge nr_pages input and output parameter in compress_pages (+11/-15) btrfs: merge length input and output parameter in compress_pages (+18/-20) btrfs: add dummy callback for readpage_io_failed and drop checks (+10/-3) btrfs: do proper error handling in btrfs_insert_xattr_item (+2/-1) btrfs: drop checks for mandatory extent_io_ops callbacks (+3/-4) btrfs: constify device path passed to relevant helpers (+22/-18) btrfs: document existence of extent_io ops callbacks (+26/-11) btrfs: handle allocation error in update_dev_stat_item (+2/-1) btrfs: export compression buffer limits in a header (+15/-10) btrfs: constify name of subvolume in creation helpers (+3/-3) btrfs: constify buffers used by compression helpers (+3/-3) btrfs: remove BUG_ON from __tree_mod_log_insert (+0/-2) btrfs: constify input buffer of btrfs_csum_data (+3/-3) btrfs: let writepage_end_io_hook return void (+6/-11) Filipe Manana (8) commits (+163/-27): Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled (+5/-0) Btrfs: try harder to migrate items to left sibling before splitting a leaf (+7/-0) Btrfs: fix assertion failure when freeing block groups at close_ctree() (+9/-6) Btrfs: incremental send, fix unnecessary hole writes for sparse files (+86/-2) Btrfs: fix use-after-free due to wrong order of destroying work queues (+7/-2) Btrfs: incremental send, do not delay rename when parent inode is new (+16/-3) Btrfs: fix data loss after truncate when using the no-holes feature (+6/-13) Btrfs: bulk delete checksum items in the same leaf (+27/-1) Robbie Ko (3) commits
Re: [PULL] Btrfs cleanups for 4.11, part 2
On 02/28/2017 10:09 AM, David Sterba wrote: Hi, this is the second half of the 4.11 batch, the rest of the cleanups. Please pull, thanks. The following changes since commit 6288d6eabc7505f42dda34a2c2962f91914be3a4: Btrfs: use the correct type when creating cow dio extent (2017-02-22 15:55:03 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.11-part2 for you to fetch changes up to 20a7db8ab3f2057a518448b1728d504ffadef65e: btrfs: add dummy callback for readpage_io_failed and drop checks (2017-02-28 14:29:24 +0100) Thanks Dave, I've got this along with Filipe's pull. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Has a series of fixes and cleanups that Dave Sterba has been collecting: There is a pretty big variety here, cleaning up internal APIs and fixing corner cases. David Sterba (46) commits (+235/-313): btrfs: remove unused parameter from btrfs_subvolume_release_metadata (+6/-11) btrfs: remove pointless rcu protection from btrfs_qgroup_inherit (+0/-2) btrfs: check quota status earlier and don't do unnecessary frees (+3/-2) btrfs: remove unused parameter from btrfs_prepare_extent_commit (+3/-5) btrfs: remove unnecessary mutex lock in qgroup_account_snapshot (+1/-5) btrfs: embed extent_changeset::range_changed to the structure (+11/-17) btrfs: remove unused parameter from cleanup_write_cache_enospc (+2/-3) btrfs: remove unused parameters from __btrfs_write_out_cache (+3/-8) btrfs: remove unused parameter from clone_copy_inline_extent (+2/-3) btrfs: remove unused parameter from extent_write_cache_pages (+2/-4) btrfs: remove unused parameter from tree_move_next_or_upnext (+2/-4) btrfs: remove unused parameter from btrfs_check_super_valid (+3/-5) btrfs: remove unused logic of limiting async delalloc pages (+0/-7) btrfs: fix over-80 lines introduced by previous cleanups (+74/-63) btrfs: remove unused parameter from read_block_for_search (+5/-5) btrfs: remove unused parameter from adjust_slots_upwards (+2/-3) btrfs: remove unused parameter from init_first_rw_device (+3/-5) btrfs: make space cache inode readahead failure nonfatal (+3/-7) btrfs: remove unused parameters from scrub_setup_wr_ctx (+3/-7) btrfs: remove unused parameter from __btrfs_alloc_chunk (+4/-6) btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE (+23/-31) btrfs: remove unused parameter from submit_extent_page (+3/-9) btrfs: remove unused parameter from clean_tree_block (+17/-19) btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation (+2/-2) btrfs: remove unused parameter from __add_inline_refs (+2/-3) btrfs: remove unused parameter from add_pending_csums (+2/-4) btrfs: remove unused parameter from update_nr_written (+4/-4) btrfs: remove unused parameter from __push_leaf_right (+2/-3) btrfs: remove unused parameter from check_async_write (+2/-2) btrfs: remove unused parameter from btrfs_fill_super (+2/-3) btrfs: remove unused parameter from __push_leaf_left (+2/-3) btrfs: remove unused parameter from write_dev_supers (+3/-3) btrfs: remove unused parameter from __add_inode_ref (+1/-2) btrfs: remove unused parameters from btrfs_cmp_data (+2/-3) btrfs: remove unused parameter from create_snapshot (+2/-2) btrfs: ulist: make the finalization function public (+2/-1) btrfs: remove unused parameter from tree_move_down (+2/-2) btrfs: ulist: rename ulist_fini to ulist_release (+10/-10) btrfs: qgroups: make __del_qgroup_relation static (+1/-1) btrfs: use GFP_KERNEL in btrfs_read_qgroup_config (+1/-1) btrfs: remove unused parameter from split_item (+2/-3) btrfs: merge two superblock writing helpers (+4/-11) btrfs: qgroups: opencode qgroup_free helper (+9/-9) btrfs: use GFP_KERNEL in btrfs_quota_enable (+1/-1) btrfs: use GFP_KERNEL in create_snapshot (+2/-2) btrfs: remove unused ulist members (+0/-7) Nikolay Borisov (36) commits (+476/-480): btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode (+8/-8) btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inode (+5/-5) btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inode (+4/-4) btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inode (+6/-6) btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode (+5/-6) btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inode (+4/-4) btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode (+5/-5) btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode (+6/-6) btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode (+5/-5) btrfs: Make btrfs_check_ref_name_override take btrfs_inode (+4/-5) btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode (+6/-6) btrfs: Make btrfs_must_commit_transaction take btrfs_inode (+9/-9) btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode (+7/-7) btrfs: Make btrfs_log_changed_extents take btrfs_inode (+11/-11) btrfs: Make btrfs_record_unlink_dir take btrfs_inode (+14/-14) btrfs: Make btrfs_remove_delayed_node take btrfs_inode (+5/-5) btrfs: Make btrfs_get_logged_extents take btrfs_inode (+4/-4) btrfs: Make btrfs_log_trailing_hole take btrfs_inode (+4/-4) btrfs: Make btrfs_get_delayed_node take btrfs_inode (+8/-9) btrfs: Make btrfs_ino take a struct btrfs_inode (+151/-151) btrfs: Make log_directory_changes take btrfs_inode (+5/-6)
Re: [GIT PULL] Btrfs bug fixes and improvements for 4.11 (2nd retry)
On Fri, Feb 24, 2017 at 03:25:09AM +, fdman...@kernel.org wrote: From: Filipe MananaHi Chris, since my previous pull request (sent timely) was either missed or not pulled for some reason I'm not aware of, here I send it again (with one more patch included). The following is taken from the former pull request: Hi Filipe, I've got this queued up and will send Monday/Tuesday. Thanks! -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.10 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Has two last minute fixes. The highest priority here is a regression fix for the decompression code, but we also fixed up a problem with the 32 bit compat ioctls. The decompression bug could hand back the wrong data on big reads when zlib was used. I have a larger cleanup to make the math here less error prone, but at this stage in the release Omar's patch is the best choice. Omar Sandoval (1) commits (+24/-15): Btrfs: fix btrfs_decompress_buf2page() Jeff Mahoney (1) commits (+4/-2): btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls Total: (2) commits (+28/-17) fs/btrfs/compression.c | 39 --- fs/btrfs/ioctl.c | 6 -- 2 files changed, 28 insertions(+), 17 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix btrfs_decompress_buf2page()
On Fri, Feb 10, 2017 at 08:12:35PM -0800, Pat Erley wrote: On 02/10/17 15:03, Omar Sandoval wrote: From: Omar SandovalIf btrfs_decompress_buf2page() is handed a bio with its page in the middle of the working buffer, then we adjust the offset into the working buffer. After we copy into the bio, we advance the iterator by the number of bytes we copied. Then, we have some logic to handle the case of discontiguous pages and adjust the offset into the working buffer again. However, if we didn't advance the bio to a new page, we may enter this case in error, essentially repeating the adjustment that we already made when we entered the function. The end result is bogus data in the bio. Previously, we only checked for this case when we advanced to a new page, but the conversion to bio iterators changed that. This restores the old, correct behavior. I can confirm this fixes the corruption I was seeing Feel free to add: Tested-by: Pat Erley Thanks again Pat for bisecting this down. It passed overnight so I'm sending in right now. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL] Fix ioctls on 32bit/64bit userspace/kernel, for 4.10
On Wed, Feb 08, 2017 at 05:51:28PM +0100, David Sterba wrote: Hi, could you please merge this single-patch pull request, for 4.10 still? There are quite a few patches on top of v4.10-rc7 so this IMHO does not look like look too bad even late in the release cycle. Though it's a fix for an uncommon usecase of 32bit userspace on 64bit kernel, it fixes basically operation of the ioctls. Thanks. Hi Dave, I'll pull this in and it, thanks. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.10 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Has some fixes that we've collected from the list. We still have one more pending to nail down a regression in lzo compression, but I wanted to get this batch out the door. Omar Sandoval (3) commits (+2/-6): Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations (+0/-2) Btrfs: remove old tree_root case in btrfs_read_locked_inode() (+1/-4) Btrfs: disable xattr operations on subvolume directories (+1/-0) Liu Bo (1) commits (+12/-1): Btrfs: fix truncate down when no_holes feature is enabled Chandan Rajendra (1) commits (+2/-2): Btrfs: Fix deadlock between direct IO and fast fsync Wang Xiaoguang (1) commits (+1/-0): btrfs: fix false enospc error when truncating heavily reflinked file Total: (6) commits (+17/-9) fs/btrfs/inode.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html